Infrastructure Diagnostics

OrthoSystem DDx v21

Before surgery, you need a plan. To have a plan, you need a diagnosis. To get a diagnosis, you start with a differential diagnosis to consider all the possibilities. This is differential diagnostics for AI computing infrastructure—a systematic benchmark suite for understanding CPU memory, GPU memory, CPU↔GPU transfers, and multi-node networking. This guide explains what each test measures, why it matters, and how modern hardware actually works.

01 — Download

Getting Started

Linux / DGX-OS Only

These binaries are tested with the DGX Spark (DGX-OS) and Windows Subsystem for Linux 2 for x86-64. The networking tests are for NVIDIA NCCL only.

ARM64

DGX Spark (GB10)

Optimized for Neoverse V2 cores and SM121 (Blackwell). DGX Spark only. Single code path tuned for the Grace CPU's SVE2 vector units—no runtime dispatch overhead.

Download ARM64

SHA256: b49ececa0d2661aac2fabdc8d67bcb6d0b5c38543c6618443e4dc5d73e6be3a1

x86-64

Fat Binary

Runtime CPU dispatch via IFUNC. Contains multiple code paths optimized for different SIMD generations: AVX-512 (Skylake-X, Zen 4+), AVX2 (Haswell, Zen 1-3), and SSE4.2 fallback.

Download x86-64

SHA256: 2ec428999ea468ed88bf5241e2e6fd6b636994a11d6a7f2b0a9dee095af15922

Required Libraries

DDx requires X11/Motif for the GUI and OpenMP for multithreaded benchmarks:

sudo apt install libx11-6 libxext6 libxt6 libmotif-common libxm4 libgomp1

libnccl2 is required for multi-node benchmarks. We recommend building NCCL from source for optimal performance on your specific network hardware. Ensure the library is in your system path, or consider packaging as an AppImage. See NVIDIA's Stacked Sparks guide for DGX Spark-specific instructions.

Usage

# GUI mode (default) - opens the node discovery lobby
./orthosystem_ddx_NVDA_GB10

# Local benchmarks only - no GUI, no networking
./orthosystem_ddx_NVDA_GB10 --local

# Multi-node: Host a benchmark session expecting N total nodes
./orthosystem_ddx_NVDA_GB10 --primary --expect N

# Multi-node: Join an existing session at the given IP
./orthosystem_ddx_NVDA_GB10 --join <ip>

Replace with orthosystem_ddx_Linux_x86-64 for the x86 build.

02 — Results at a Glance

DGX Spark (GB10) — Unified Memory

The GB10's unified LPDDR5X architecture means CPU and GPU share the same 273 GB/s memory pool. No PCIe bottleneck, no staging buffers, no pinned vs. pageable distinction.

CPU Memory
116 GB/s
STREAM Triad @ 2GB · 43%
GPU Memory
213 GB/s
Kernel @ 2GB · 78%
NCCL AllGather
45.2 GB/s
200GbE · 181% bidir

RTX 5090 + Xeon W7-2475X — Discrete GPU

The RTX 5090's 1792 GB/s GDDR7 is accessed through PCIe 5.0 x16 (64 GB/s theoretical). This discrete architecture requires explicit data movement—and makes the pinned vs. pageable distinction critical.

CPU Memory
110 GB/s
STREAM Triad @ 2GB · 72%
GPU Memory
1,608 GB/s
Kernel @ 2GB · 90%
PCIe Pinned
57 GB/s
H→D @ 1GB · 89%
PCIe Transfer: Pinned vs. Pageable Memory
RTX 5090 · Host→Device · 1 GB buffer · PCIe 5.0 x16
0 20 40 60 64 GB/s (theoretical) Pinned 57.2 GB/s Pageable 9.3 GB/s 6.1× faster with pinned memory

Why the Massive Difference?

Pageable memory (standard malloc) can be swapped to disk by the OS. Before DMA transfer, the CUDA driver must copy data to an internal pinned staging buffer—a double-copy that halves effective bandwidth. Pinned memory (cudaHostAlloc) is page-locked and enables direct DMA. On unified memory systems like the GB10, this distinction vanishes entirely.

03 — The Lobby

GUI Mode and Node Discovery

When launched without arguments, DDx opens a Motif-based GUI that serves as a "lobby" for coordinating multi-node benchmarks. The lobby discovers other DDx instances on your local network segment, displays their IP addresses and detected architectures, and lets you manually add nodes by IP if automatic discovery fails.

Why Motif?

Motif (libXm) is a stable, minimal X11 toolkit that works identically across all Linux distributions including DGX-OS. Unlike GTK or Qt, it has no version fragmentation issues and adds only ~2MB to the binary. The aesthetic matches NVIDIA's DGX software stack, and the API has been stable since 1989—no breaking changes in 35 years.

To run multi-node benchmarks, select the nodes you want to include and click "Run Benchmark." The lobby coordinates NCCL initialization: the first selected node becomes the primary (rank 0) and generates an NCCL unique ID, then broadcasts it to secondary nodes. This handshake establishes the communication topology before benchmarks begin.

04 — Understanding Output

Reading Benchmark Results

DDx reports results with statistical context rather than single numbers. Memory bandwidth benchmarks are notoriously variable—cache state, thermal conditions, and background processes all affect measurements. Reporting a single number hides this reality; DDx exposes it.

Each benchmark runs multiple trials (typically 5) and reports:

  Size      Mean      [Min   -   Max]    SD     vs.GB10 @273
  128 MB    121.4 GB/s [ 119.2- 124.1]  ± 1.8    44.5%

The columns are: buffer size, mean bandwidth across trials, the range of observed values, standard deviation, and percentage of theoretical maximum. The reference bandwidth (here, 273 GB/s for the GB10's LPDDR5X) provides context for whether you're achieving reasonable efficiency.

Automatic Histograms for High-Variance Results

When the coefficient of variation (CV = standard deviation ÷ mean) exceeds 10%, DDx automatically prints an ASCII histogram. This threshold is defined in the source as HISTOGRAM_CV_THRESHOLD = 0.10. High variance usually indicates one of two things: cache effects (some runs hit cache, others spill to DRAM) or transient system activity.

  4 MB     1013.0 GB/s [ 963.8-1067.8] ± 30.0  371.1% 
     963-980  GB/s: █ (1)
     997-1014 GB/s: ██████ (3)
    1014-1031 GB/s: ████████████ (5)
    1048-1068 GB/s: █ (1)

A bimodal distribution (two distinct peaks) often indicates cache behavior: some runs hit L2/L3 cache while others spill to DRAM. A single peak with scattered outliers suggests transient system activity—background processes, interrupts, or thermal throttling.

05 — SIMD Fundamentals

How CPUs Process Data in Parallel

Before diving into memory benchmarks, we need to understand how modern CPUs actually move data. The key concept is SIMD: Single Instruction, Multiple Data. Instead of processing one number at a time, SIMD instructions operate on multiple values simultaneously using wide vector registers.

SIMD Register Width Evolution SSE (128-bit) 64-bit 64-bit 2 doubles/instruction AVX2 (256-bit) 4 doubles/instruction AVX-512 (512-bit) 8 doubles/instruction

SIMD register widths have grown from 128 bits (SSE) to 512 bits (AVX-512). ARM's SVE is length-agnostic—the same binary runs on implementations from 128 to 2048 bits.

The x86 SIMD story is one of backward compatibility layered atop backward compatibility. SSE (1999) introduced 128-bit XMM registers. AVX (2011) doubled them to 256-bit YMM registers. AVX-512 (2017) doubled again to 512-bit ZMM registers—and added mask registers, embedded rounding, and a bewildering array of sub-extensions.

1997

MMX

Intel's first SIMD: 64-bit registers repurposed from x87 FPU. Integer only.

1998

AMD 3DNow!

Single-precision floats in MMX registers. Included PREFETCHW. Deprecated 2010.

1999

SSE

New 128-bit XMM registers. Single-precision floats. Still supported today.

2013

AVX2 + FMA

256-bit integer operations. Fused multiply-add. Haswell/Zen baseline.

2017

AVX-512

512-bit ZMM registers. Mask registers. Skylake-X server, Zen 4 consumer.

The Fat Binary Approach

DDx's x86-64 build contains multiple implementations of performance-critical functions, each optimized for a different SIMD generation. At program startup, glibc's IFUNC (indirect function) resolver inspects CPUID and patches the Global Offset Table to call the optimal version. There's no runtime dispatch overhead after this one-time initialization.

// GCC's target_clones generates multiple versions automatically
#define MULTIVERSION __attribute__((target_clones( \
    "arch=sapphirerapids", \   // Intel SPR: AVX-512 + AMX + BF16
    "arch=skylake-avx512", \   // Intel SKX/CLX: AVX-512F/BW/VL/DQ
    "arch=znver4", \           // AMD Zen 4/5: AVX-512
    "arch=haswell", \          // Intel HSW / AMD Zen 1-3: AVX2+FMA
    "default")))               // SSE4.2 fallback (Nehalem+)

ARM: A Cleaner Design

ARM took a different approach with SVE (Scalable Vector Extension). Rather than fixing register width at compile time, SVE code is length-agnostic: the same binary runs on implementations from 128 to 2048 bits. The CPU tells software its vector length at runtime. The Grace CPU (Neoverse V2) in the DGX Spark implements SVE2 with 256-bit vectors.

Why ARM64 DDx Isn't a Fat Binary

GCC's target_clones for SVE requires glibc IFUNC resolver support that isn't yet functional on most ARM64 distributions (as of 2025). The ARM64 build uses a single code path compiled with -mcpu=neoverse-v2.

06 — STREAM Benchmark

Measuring CPU Memory Bandwidth

The STREAM benchmark, developed by John McCalpin at the University of Virginia in 1991, measures sustainable memory bandwidth using four simple vector operations. "Sustainable" is key—STREAM deliberately uses arrays too large to fit in cache, forcing every access to hit main memory.

Memory Hierarchy: Latency vs. Capacity L1 Cache 32-64 KB ~1 ns · ~1 TB/s L2 Cache 256KB-2MB ~4 ns · ~500 GB/s L3 Cache 24-100+ MB ~15 ns · ~200 GB/s Main Memory (DRAM) 32-128+ GB ~80 ns · 100-300 GB/s

Each level of the memory hierarchy trades capacity for speed. STREAM uses buffers large enough to exceed L3 cache, measuring true DRAM bandwidth.

Copy c[i] = a[i]

Bytes moved: 2 × array size (1 read + 1 write)

The simplest operation: read from one array, write to another. Tests raw memory bandwidth without computational overhead. Compilers typically generate non-temporal (streaming) stores that bypass cache on the write path.

Scale c[i] = scalar × a[i]

Bytes moved: 2 × array size (1 read + 1 write)

Adds a floating-point multiply to each element. Tests whether the memory system can keep the floating-point units fed. On modern CPUs, the multiply is essentially free—memory latency dominates.

Add c[i] = a[i] + b[i]

Bytes moved: 3 × array size (2 reads + 1 write)

Reads from two arrays, writes to a third. Tests memory bandwidth with a higher read:write ratio (2:1 vs 1:1). Some memory controllers optimize differently for read-heavy versus write-heavy workloads.

Triad c[i] = a[i] + scalar × b[i]

Bytes moved: 3 × array size (2 reads + 1 write)

The most computationally intensive: fused multiply-add (FMA) on each element. Tests whether memory bandwidth or FMA throughput is the bottleneck. On bandwidth-limited systems, Triad performs similarly to Add.

Why DDx Runs CPU Benchmarks Twice

DDx runs the STREAM suite twice: once compiled by GCC with -O3 -march=native, and once compiled through nvcc with -Xcompiler flags. The nvcc compiler preprocesses source code before invoking GCC, which can disrupt loop patterns that GCC's auto-vectorizer recognizes.

ARM64 Penalty
2.8×
116 → 42 GB/s via nvcc
x86-64 Penalty
5.8×
110 → 19 GB/s via nvcc
GCC vs. nvcc Host Compilation — STREAM Triad @ 2GB
Same source code, different compilation paths
0 40 80 120 160 GB/s DGX Spark ARM64 / SVE2 116 GB/s 42 GB/s Xeon W7-2475X x86-64 / AVX-512 110 GB/s 19 GB/s GCC -O3 -march=native nvcc -Xcompiler (broken vectorization)
// Compile directly with GCC → full SVE/AVX vectorization
$ gcc -O3 -march=native -fopenmp stream.cpp -o stream_gcc

// Compile through nvcc → GCC sees preprocessed code
$ nvcc -Xcompiler=-O3,-march=native stream.cu -o stream_nvcc

// On ARM64 (Grace), I measured:
//   GCC native:  116 GB/s (STREAM Triad @ 2GB)
//   nvcc+Xcomp:   42 GB/s (same code, same flags)

07 — GPU Memory

Device Memory Architecture

GPU memory benchmarks measure bandwidth within the GPU's memory subsystem. Nearly all AI inference and training workloads are memory-bandwidth limited, not compute limited. Understanding your actual achievable bandwidth tells you whether you're getting full value from your hardware.

GPU Memory Hierarchy Registers 255/thread · ~20TB/s Shared Memory 64-228KB/SM · ~10TB/s L1 Cache 128-256KB/SM · ~5TB/s L2 Cache: 24 MB (GB10) · 96 MB (RTX 5090) · ~4-5 TB/s at small sizes Device Memory: 128 GB (GB10) · 32 GB (RTX 5090) · 273–1792 GB/s

GPU memory hierarchy. The L2 cache is key to understanding results: small buffers achieve "impossible" bandwidth because they never leave L2.

Understanding Cache Effects

Modern GPUs have large L2 caches: 24 MB on the GB10, 96 MB on the RTX 5090. Benchmarks with small buffers may report bandwidth far exceeding the theoretical DRAM limit—because the data never leaves cache. DDx tests buffer sizes from 2 MB to 2 GB to characterize both cache-resident and DRAM-limited performance.

RTX 5090 L2 Cache
4,582 GB/s
32 MB buffer · 256% of DRAM
RTX 5090 DRAM
1,608 GB/s
2 GB buffer · 90% efficiency
GB10 L2 Cache
1,157 GB/s
4 MB buffer · 424% of DRAM
  Size      Mean        vs.RTX 5090 @1792
  8 MB      1,863 GB/s   104.0%    ← L2 cache bandwidth
  32 MB     4,582 GB/s   255.7%    ← Sweet spot: entire buffer in 96MB L2
  128 MB    1,555 GB/s    86.8%    ← Transitioning to DRAM
  2 GB      1,608 GB/s    89.8%    ← Sustained DRAM bandwidth

Interpreting "Impossible" Numbers

If you see 4,000+ GB/s at small buffer sizes on an RTX 5090, that's L2 cache bandwidth, not DRAM bandwidth. The 1,792 GB/s spec is for GDDR7 memory. Use 128+ MB buffers to measure actual memory subsystem performance. DDx reports percentages against theoretical DRAM bandwidth, making cache effects immediately visible as >100% results.

08 — CPU↔GPU Transfers

Moving Data Between Processors

On discrete GPUs, moving data between CPU and GPU memory traverses the PCIe bus. PCIe 5.0 x16 provides 64 GB/s unidirectional. The key factor is whether memory is pageable (can be swapped to disk) or pinned (page-locked in physical RAM).

Pageable vs. Pinned Memory Transfers Pageable (malloc) User Buffer copy Staging (driver) GPU Mem ~10 GB/s Double-copy overhead Pinned (cudaHostAlloc) User Buffer (page-locked) Direct DMA GPU Memory ~57 GB/s Full PCIe bandwidth 5-6× faster with pinned allocations

Pageable memory requires an extra copy to a pinned staging buffer before DMA. Pinned memory enables direct GPU access.

The performance difference is dramatic: typically 5-6× faster with pinned allocations. This is why CUDA applications that transfer large buffers should use cudaHostAlloc() or cudaMallocHost().

09 — Unified Memory

When CPU and GPU Share DRAM

The DGX Spark (GB10), Grace Hopper (GH200), and integrated GPUs represent a fundamentally different architecture: CPU and GPU share the same physical memory. There's no PCIe bus to cross—cudaMemcpy becomes a formality that the driver can optimize away entirely.

Discrete GPU (RTX 5090) CPU + DDR5 256 GB PCIe 5.0 GPU + GDDR7 32 GB 64 GB/s bottleneck Data must cross PCIe bus Unified Memory (GB10) Grace CPU NVLink Blackwell GPU Shared LPDDR5X · 128 GB · 273 GB/s Both processors access same physical RAM

Discrete GPUs require explicit data movement across PCIe. Unified memory systems share physical DRAM—pointers work from both processors without copying.

DDx detects unified memory architectures and runs additional benchmarks that only make sense in this context: cudaMallocManaged prefetch (should be near-instant), zero-copy kernel access, and cache-to-cache transfers via the 600 GB/s NVLink-C2C interconnect.

The Pinned/Pageable Distinction Vanishes

On unified memory systems, DDx shows nearly identical performance for pageable and pinned allocations—typically within 3%. There's no staging buffer because there's no separate memory pool to stage to. This eliminates an entire category of optimization work and bugs.

10 — NCCL Collectives

Multi-Node Network Benchmarks

NCCL (NVIDIA Collective Communications Library) provides optimized implementations of collective operations used in distributed training. DDx measures AllGather, SendRecv Ring, All-to-All, and ReduceScatter across your actual network fabric.

200GbE RoCE (DGX Spark ↔ DGX Spark)

AllGather
45.2 GB/s
2 GB · 181% of unidirectional
SendRecv Ring
41.3 GB/s
2 GB · 165% full-duplex
All-to-All
39.0 GB/s
2 GB · Switch stress test

10GbE vs. 200GbE: Real-World Comparison

The following results demonstrate actual benchmark runs on a heterogeneous cluster: two NVIDIA DGX Spark systems (aarch64) and one x86_64 workstation running WSL2.

NCCL AllGather Bandwidth — 128 MB
Network fabric comparison across configurations
0 12.5 25 37.5 50 GB/s 200GbE RoCE Spark ↔ Spark 40.8 GB/s 10GbE TCP Spark ↔ Spark 2.35 GB/s 10GbE TCP Spark ↔ WSL2 0.64 GB/s RoCE (163% eff.) TCP Native (188% eff.) TCP via WSL2 (51% eff.)
200GbE vs 10GbE
17×
Bandwidth improvement
WSL2 Overhead
3.7×
vs. native 10GbE

WSL2 Networking Overhead

The 0.64 GB/s Spark↔WSL2 throughput represents only 51% of 10GbE line rate—a 3.7× reduction compared to native Spark-to-Spark. This overhead comes from WSL2's virtualized network stack (NAT through Windows host) rather than the x86_64 architecture itself. Native Linux on the same hardware would likely achieve full 10GbE throughput.

AllGather Each contributes; all receive all Before: A B C D After: Rank 0 Rank 1 All ranks have [A,B,C,D] All-to-All Every rank exchanges with every other 0 1 2 3 N×(N-1) = 12 flows

NCCL collective patterns. AllGather replicates data to all ranks; All-to-All creates maximum simultaneous network flows to stress test switch fabric.

NCCL benchmarks report efficiency against detected network speed. DDx queries link speed via ethtool. Values above 100% indicate full-duplex operation—200GbE is 25 GB/s per direction, so 45 GB/s bidirectional is 180% of unidirectional theoretical.

11 — Interpretation

What Good Results Look Like

DDx reports percentages against theoretical bandwidth, but "theoretical" is often unachievable due to protocol overhead, memory controller scheduling, and cache coherency traffic. Here are reasonable efficiency targets:

CPU DRAM (STREAM Triad)

40-50% of theoretical is typical. The GB10 achieves ~43% (117 GB/s of 273 GB/s).

GPU DRAM (kernel)

75-85% of theoretical with coalesced access. The GB10 achieves ~78% (213 GB/s).

PCIe Transfers (pinned)

85-95% of theoretical PCIe bandwidth. RTX 5090 on PCIe 5.0 x16 should achieve 55-60 GB/s.

NCCL AllGather

90%+ of network bandwidth for large messages (128+ MB). DDx achieves 180% of unidirectional on 200GbE.

For detailed benchmark numbers across GB10, RTX 5090, and cross-node configurations, see the DGX Spark review.

When to Worry

A 5% difference from published results is normal measurement variance. A 50% difference indicates a real problem: wrong NUMA node assignment, thermal throttling, misconfigured PCIe slot (x8 electrical in x16 physical), or driver issues.