Before surgery, you need a plan. To have a plan, you need a diagnosis. To get a diagnosis, you start with a differential diagnosis to consider all the possibilities. This is differential diagnostics for AI computing infrastructure—a systematic benchmark suite for understanding CPU memory, GPU memory, CPU↔GPU transfers, and multi-node networking. This guide explains what each test measures, why it matters, and how modern hardware actually works.
01 — Download
These binaries are tested with the DGX Spark (DGX-OS) and Windows Subsystem for Linux 2 for x86-64. The networking tests are for NVIDIA NCCL only.
Optimized for Neoverse V2 cores and SM121 (Blackwell). DGX Spark only. Single code path tuned for the Grace CPU's SVE2 vector units—no runtime dispatch overhead.
Download ARM64SHA256: b49ececa0d2661aac2fabdc8d67bcb6d0b5c38543c6618443e4dc5d73e6be3a1
Runtime CPU dispatch via IFUNC. Contains multiple code paths optimized for different SIMD generations: AVX-512 (Skylake-X, Zen 4+), AVX2 (Haswell, Zen 1-3), and SSE4.2 fallback.
Download x86-64SHA256: 2ec428999ea468ed88bf5241e2e6fd6b636994a11d6a7f2b0a9dee095af15922
DDx requires X11/Motif for the GUI and OpenMP for multithreaded benchmarks:
sudo apt install libx11-6 libxext6 libxt6 libmotif-common libxm4 libgomp1
libnccl2 is required for multi-node benchmarks. We recommend building NCCL from source for optimal performance on your specific network hardware. Ensure the library is in your system path, or consider packaging as an AppImage. See NVIDIA's Stacked Sparks guide for DGX Spark-specific instructions.
# GUI mode (default) - opens the node discovery lobby ./orthosystem_ddx_NVDA_GB10 # Local benchmarks only - no GUI, no networking ./orthosystem_ddx_NVDA_GB10 --local # Multi-node: Host a benchmark session expecting N total nodes ./orthosystem_ddx_NVDA_GB10 --primary --expect N # Multi-node: Join an existing session at the given IP ./orthosystem_ddx_NVDA_GB10 --join <ip>
Replace with orthosystem_ddx_Linux_x86-64 for the x86 build.
02 — Results at a Glance
The GB10's unified LPDDR5X architecture means CPU and GPU share the same 273 GB/s memory pool. No PCIe bottleneck, no staging buffers, no pinned vs. pageable distinction.
The RTX 5090's 1792 GB/s GDDR7 is accessed through PCIe 5.0 x16 (64 GB/s theoretical). This discrete architecture requires explicit data movement—and makes the pinned vs. pageable distinction critical.
Pageable memory (standard malloc) can be swapped to disk by the OS. Before DMA transfer, the CUDA driver must copy data to an internal pinned staging buffer—a double-copy that halves effective bandwidth. Pinned memory (cudaHostAlloc) is page-locked and enables direct DMA. On unified memory systems like the GB10, this distinction vanishes entirely.
03 — The Lobby
When launched without arguments, DDx opens a Motif-based GUI that serves as a "lobby" for coordinating multi-node benchmarks. The lobby discovers other DDx instances on your local network segment, displays their IP addresses and detected architectures, and lets you manually add nodes by IP if automatic discovery fails.
Motif (libXm) is a stable, minimal X11 toolkit that works identically across all Linux distributions including DGX-OS. Unlike GTK or Qt, it has no version fragmentation issues and adds only ~2MB to the binary. The aesthetic matches NVIDIA's DGX software stack, and the API has been stable since 1989—no breaking changes in 35 years.
To run multi-node benchmarks, select the nodes you want to include and click "Run Benchmark." The lobby coordinates NCCL initialization: the first selected node becomes the primary (rank 0) and generates an NCCL unique ID, then broadcasts it to secondary nodes. This handshake establishes the communication topology before benchmarks begin.
04 — Understanding Output
DDx reports results with statistical context rather than single numbers. Memory bandwidth benchmarks are notoriously variable—cache state, thermal conditions, and background processes all affect measurements. Reporting a single number hides this reality; DDx exposes it.
Each benchmark runs multiple trials (typically 5) and reports:
Size Mean [Min - Max] SD vs.GB10 @273 128 MB 121.4 GB/s [ 119.2- 124.1] ± 1.8 44.5%
The columns are: buffer size, mean bandwidth across trials, the range of observed values, standard deviation, and percentage of theoretical maximum. The reference bandwidth (here, 273 GB/s for the GB10's LPDDR5X) provides context for whether you're achieving reasonable efficiency.
When the coefficient of variation (CV = standard deviation ÷ mean) exceeds 10%, DDx automatically prints an ASCII histogram. This threshold is defined in the source as HISTOGRAM_CV_THRESHOLD = 0.10. High variance usually indicates one of two things: cache effects (some runs hit cache, others spill to DRAM) or transient system activity.
4 MB 1013.0 GB/s [ 963.8-1067.8] ± 30.0 371.1% ⚠
A bimodal distribution (two distinct peaks) often indicates cache behavior: some runs hit L2/L3 cache while others spill to DRAM. A single peak with scattered outliers suggests transient system activity—background processes, interrupts, or thermal throttling.
05 — SIMD Fundamentals
Before diving into memory benchmarks, we need to understand how modern CPUs actually move data. The key concept is SIMD: Single Instruction, Multiple Data. Instead of processing one number at a time, SIMD instructions operate on multiple values simultaneously using wide vector registers.
SIMD register widths have grown from 128 bits (SSE) to 512 bits (AVX-512). ARM's SVE is length-agnostic—the same binary runs on implementations from 128 to 2048 bits.
The x86 SIMD story is one of backward compatibility layered atop backward compatibility. SSE (1999) introduced 128-bit XMM registers. AVX (2011) doubled them to 256-bit YMM registers. AVX-512 (2017) doubled again to 512-bit ZMM registers—and added mask registers, embedded rounding, and a bewildering array of sub-extensions.
MMX
Intel's first SIMD: 64-bit registers repurposed from x87 FPU. Integer only.
AMD 3DNow!
Single-precision floats in MMX registers. Included PREFETCHW. Deprecated 2010.
SSE
New 128-bit XMM registers. Single-precision floats. Still supported today.
AVX2 + FMA
256-bit integer operations. Fused multiply-add. Haswell/Zen baseline.
AVX-512
512-bit ZMM registers. Mask registers. Skylake-X server, Zen 4 consumer.
DDx's x86-64 build contains multiple implementations of performance-critical functions, each optimized for a different SIMD generation. At program startup, glibc's IFUNC (indirect function) resolver inspects CPUID and patches the Global Offset Table to call the optimal version. There's no runtime dispatch overhead after this one-time initialization.
// GCC's target_clones generates multiple versions automatically #define MULTIVERSION __attribute__((target_clones( \ "arch=sapphirerapids", \ // Intel SPR: AVX-512 + AMX + BF16 "arch=skylake-avx512", \ // Intel SKX/CLX: AVX-512F/BW/VL/DQ "arch=znver4", \ // AMD Zen 4/5: AVX-512 "arch=haswell", \ // Intel HSW / AMD Zen 1-3: AVX2+FMA "default"))) // SSE4.2 fallback (Nehalem+)
ARM took a different approach with SVE (Scalable Vector Extension). Rather than fixing register width at compile time, SVE code is length-agnostic: the same binary runs on implementations from 128 to 2048 bits. The CPU tells software its vector length at runtime. The Grace CPU (Neoverse V2) in the DGX Spark implements SVE2 with 256-bit vectors.
GCC's target_clones for SVE requires glibc IFUNC resolver support that isn't yet functional on most ARM64 distributions (as of 2025). The ARM64 build uses a single code path compiled with -mcpu=neoverse-v2.
06 — STREAM Benchmark
The STREAM benchmark, developed by John McCalpin at the University of Virginia in 1991, measures sustainable memory bandwidth using four simple vector operations. "Sustainable" is key—STREAM deliberately uses arrays too large to fit in cache, forcing every access to hit main memory.
Each level of the memory hierarchy trades capacity for speed. STREAM uses buffers large enough to exceed L3 cache, measuring true DRAM bandwidth.
The simplest operation: read from one array, write to another. Tests raw memory bandwidth without computational overhead. Compilers typically generate non-temporal (streaming) stores that bypass cache on the write path.
Adds a floating-point multiply to each element. Tests whether the memory system can keep the floating-point units fed. On modern CPUs, the multiply is essentially free—memory latency dominates.
Reads from two arrays, writes to a third. Tests memory bandwidth with a higher read:write ratio (2:1 vs 1:1). Some memory controllers optimize differently for read-heavy versus write-heavy workloads.
The most computationally intensive: fused multiply-add (FMA) on each element. Tests whether memory bandwidth or FMA throughput is the bottleneck. On bandwidth-limited systems, Triad performs similarly to Add.
DDx runs the STREAM suite twice: once compiled by GCC with -O3 -march=native, and once compiled through nvcc with -Xcompiler flags. The nvcc compiler preprocesses source code before invoking GCC, which can disrupt loop patterns that GCC's auto-vectorizer recognizes.
// Compile directly with GCC → full SVE/AVX vectorization $ gcc -O3 -march=native -fopenmp stream.cpp -o stream_gcc // Compile through nvcc → GCC sees preprocessed code $ nvcc -Xcompiler=-O3,-march=native stream.cu -o stream_nvcc // On ARM64 (Grace), I measured: // GCC native: 116 GB/s (STREAM Triad @ 2GB) // nvcc+Xcomp: 42 GB/s (same code, same flags)
07 — GPU Memory
GPU memory benchmarks measure bandwidth within the GPU's memory subsystem. Nearly all AI inference and training workloads are memory-bandwidth limited, not compute limited. Understanding your actual achievable bandwidth tells you whether you're getting full value from your hardware.
GPU memory hierarchy. The L2 cache is key to understanding results: small buffers achieve "impossible" bandwidth because they never leave L2.
Modern GPUs have large L2 caches: 24 MB on the GB10, 96 MB on the RTX 5090. Benchmarks with small buffers may report bandwidth far exceeding the theoretical DRAM limit—because the data never leaves cache. DDx tests buffer sizes from 2 MB to 2 GB to characterize both cache-resident and DRAM-limited performance.
Size Mean vs.RTX 5090 @1792 8 MB 1,863 GB/s 104.0% ← L2 cache bandwidth 32 MB 4,582 GB/s 255.7% ← Sweet spot: entire buffer in 96MB L2 128 MB 1,555 GB/s 86.8% ← Transitioning to DRAM 2 GB 1,608 GB/s 89.8% ← Sustained DRAM bandwidth
If you see 4,000+ GB/s at small buffer sizes on an RTX 5090, that's L2 cache bandwidth, not DRAM bandwidth. The 1,792 GB/s spec is for GDDR7 memory. Use 128+ MB buffers to measure actual memory subsystem performance. DDx reports percentages against theoretical DRAM bandwidth, making cache effects immediately visible as >100% results.
08 — CPU↔GPU Transfers
On discrete GPUs, moving data between CPU and GPU memory traverses the PCIe bus. PCIe 5.0 x16 provides 64 GB/s unidirectional. The key factor is whether memory is pageable (can be swapped to disk) or pinned (page-locked in physical RAM).
Pageable memory requires an extra copy to a pinned staging buffer before DMA. Pinned memory enables direct GPU access.
The performance difference is dramatic: typically 5-6× faster with pinned allocations. This is why CUDA applications that transfer large buffers should use cudaHostAlloc() or cudaMallocHost().
09 — Unified Memory
The DGX Spark (GB10), Grace Hopper (GH200), and integrated GPUs represent a fundamentally different architecture: CPU and GPU share the same physical memory. There's no PCIe bus to cross—cudaMemcpy becomes a formality that the driver can optimize away entirely.
Discrete GPUs require explicit data movement across PCIe. Unified memory systems share physical DRAM—pointers work from both processors without copying.
DDx detects unified memory architectures and runs additional benchmarks that only make sense in this context: cudaMallocManaged prefetch (should be near-instant), zero-copy kernel access, and cache-to-cache transfers via the 600 GB/s NVLink-C2C interconnect.
On unified memory systems, DDx shows nearly identical performance for pageable and pinned allocations—typically within 3%. There's no staging buffer because there's no separate memory pool to stage to. This eliminates an entire category of optimization work and bugs.
10 — NCCL Collectives
NCCL (NVIDIA Collective Communications Library) provides optimized implementations of collective operations used in distributed training. DDx measures AllGather, SendRecv Ring, All-to-All, and ReduceScatter across your actual network fabric.
The following results demonstrate actual benchmark runs on a heterogeneous cluster: two NVIDIA DGX Spark systems (aarch64) and one x86_64 workstation running WSL2.
The 0.64 GB/s Spark↔WSL2 throughput represents only 51% of 10GbE line rate—a 3.7× reduction compared to native Spark-to-Spark. This overhead comes from WSL2's virtualized network stack (NAT through Windows host) rather than the x86_64 architecture itself. Native Linux on the same hardware would likely achieve full 10GbE throughput.
NCCL collective patterns. AllGather replicates data to all ranks; All-to-All creates maximum simultaneous network flows to stress test switch fabric.
NCCL benchmarks report efficiency against detected network speed. DDx queries link speed via ethtool. Values above 100% indicate full-duplex operation—200GbE is 25 GB/s per direction, so 45 GB/s bidirectional is 180% of unidirectional theoretical.
11 — Interpretation
DDx reports percentages against theoretical bandwidth, but "theoretical" is often unachievable due to protocol overhead, memory controller scheduling, and cache coherency traffic. Here are reasonable efficiency targets:
40-50% of theoretical is typical. The GB10 achieves ~43% (117 GB/s of 273 GB/s).
75-85% of theoretical with coalesced access. The GB10 achieves ~78% (213 GB/s).
85-95% of theoretical PCIe bandwidth. RTX 5090 on PCIe 5.0 x16 should achieve 55-60 GB/s.
90%+ of network bandwidth for large messages (128+ MB). DDx achieves 180% of unidirectional on 200GbE.
For detailed benchmark numbers across GB10, RTX 5090, and cross-node configurations, see the DGX Spark review.
A 5% difference from published results is normal measurement variance. A 50% difference indicates a real problem: wrong NUMA node assignment, thermal throttling, misconfigured PCIe slot (x8 electrical in x16 physical), or driver issues.