Live Compiler Comparison — Floating-Point Reproducibility

In the essay From Artificial Intelligence to Artificial Wisdom, I explained that your compiler version, your CUDA compute level, your PyTorch version, and your driver version are each components in your supply chain and that the same code run through two different versions of a compiler can result in different results, even when the flags are identical. This page lets you verify that claim in real-time.

Below is a short C++ program. It assigns 4.2 to a double, then subtracts the literal 4.2 from it. Mathematically, the result should be zero. The output panels show the results from two different GCC versions, same source, same flags. Click Refresh Live to recompile in real-time using the Compiler Explorer API, or read on. The cached results below already tell the story.

The Source

C++17 -O2 -std=c++17 -mfpmath=387

 1#include <iostream>
 2#include <iomanip>
 3#include <cfloat>
 4
 5int main() {
 6    std::cout << "FLT_EVAL_METHOD = " << FLT_EVAL_METHOD << "\n";
 7
 8    double d = 4.2;
 9    double diff = d - 4.2;          // this is the key line
10    bool eq = (d == 4.2);
11
12    std::cout << std::boolalpha;
13    std::cout << "d == 4.2         = " << eq << "\n";
14
15    std::cout << std::scientific << std::setprecision(20);
16    std::cout << "d               = " << d << "\n";
17    std::cout << "diff (d - 4.2)  = " << diff << "\n";
18}

Loading cached results…

GCC 11.5 g115

GCC 15.2 g152

d − 4.2

0.0

GCC 11.5

vs.

1.78 × 10⁻¹⁶

GCC 15.2

Same source code. Same compiler flags. Different answers. Both compilers are correct.

Amplification

A difference of 10⁻¹⁶ sounds negligible. But floating-point operations rarely happen once. Neural network training involves billions of sequential floating-point operations where each result feeds into the next. What happens when that invisible per-operation error is repeated?

This second program takes the same d - 4.2 computation and accumulates it across 10 million iterations, printing the running total at five checkpoints. On one compiler, the accumulator stays at zero. On the other, it grows.

C++17 -O2 -std=c++17 -mfpmath=387

 1#include <iostream>
 2#include <iomanip>
 3
 4int main() {
 5    double d = 4.2;
 6    double acc = 0.0;
 7
 8    std::cout << std::scientific << std::setprecision(6);
 9    for (int i = 1; i <= 10000000; i++) {
10        acc += (d - 4.2);
11        if (i==1 || i==100 || i==10000 || i==1000000 || i==10000000)
12            std::cout << "n=" << std::setw(8) << i << "  err=" << acc << "\n";
13    }
14}

Loading cached results…

GCC 11.5 g115

GCC 15.2 g152

Accumulated error after 10,000,000 operations

0.0

GCC 11.5

vs.

1.78 × 10⁻⁹

GCC 15.2

An invisible per-operation difference of 10⁻¹⁶ becomes a measurable 10⁻⁹ after ten million iterations. A typical neural network training run involves billions.

Why This Matters for Software in Regulated Industries and AI

This demonstration uses a flag (-mfpmath=387) that almost nobody uses in production. That's intentional because it isolates the effect cleanly. But the underlying principle applies broadly. Floating-point non-determinism arises from SIMD instruction selection, FMA (fused multiply-add) availability, auto-vectorization decisions, and GPU warp scheduling. The specific values differ; the supply chain variable is the same.

When you train a neural network, the final weights are the product of millions of floating-point operations chained together. If your training framework uses cuDNN, the specific convolution algorithm selected can vary between runs and between driver versions. NVIDIA's documentation explicitly notes that bitwise reproducibility requires specific configuration choices and that the default behavior prioritizes performance over determinism.

In medical device manufacturing, you are required to maintain a Software Bill of Materials documenting every component version. You are required to demonstrate that your build is reproducible. You are required to investigate when outputs change unexpectedly. These requirements exist because silent changes in behavior are how patients get hurt.

In AI development, the equivalent of this GCC version change happens routinely: PyTorch updates its default CUDA kernels, cuDNN changes its autotuner heuristics, a driver update changes instruction scheduling, a cloud provider silently rotates the GPU hardware behind your API endpoint. Each change is individually minor. The compound effect on model behavior is largely unstudied. Very few teams maintain the traceability to even know it happened.

The Supply Chain Question

If two versions of GCC produce different results from identical source code and identical flags, how confident are you that your production AI model was produced by the exact toolchain you think it was? Could you prove it? In aerospace or medical devices, this would be a finding. In AI, it's Tuesday.

Technical Notes

This page compiles and executes code in real-time using the Compiler Explorer REST API, an open-source project created by Matt Godbolt. The compiler identifiers are g115 (x86-64 GCC 11.5) and g152 (x86-64 GCC 15.2). No code is executed locally in your browser.

If the Compiler Explorer API is unavailable, the button will report an error. The demonstration depends on Godbolt's servers being reachable at the time you click. You can verify these results yourself at godbolt.org by entering the source code above with compilers x86-64 gcc 11.5 and x86-64 gcc 15.2. The previous hyperlink may expire at some point.