```
```
The scalpel is one of the oldest tools in surgery.
Its form hasn't fundamentally changed in centuries.
Yet in the hands of a master surgeon versus a first-year resident,
the outcomes are very different.
Artificial intelligence is the same kind of tool. The technology exists. What matters is how it's wielded, and by whom. OrthoSystem represents the convergence of three decades in computing and a decade in academic medicine, built on the understanding that educational AI demands the same precision we require in the operating room.
December 2025 · NVIDIA Grace Blackwell Optimized
01 — The Journey
OrthoSystem didn't emerge from a startup accelerator or a corporate AI lab. It emerged from the unusual intersection of two careers that shouldn't have overlapped and the realization that understanding both domains deeply enough reveals solutions invisible in isolation.
In high school and college, I ran what became the world's most popular enthusiast site covering ATI (now AMD) graphics hardware. I was invited by ATI to Computex Taipei but had to decline, because I needed to take the MCAT. In 2001, I attended the launch party for the world's first programmable GPU, the NVIDIA GeForce 3. I was watching the birth of what would become the foundation of modern AI computing.
As a freelance tech journalist, I was granted the exclusive interview with Ian Buck, the architect of CUDA, NVIDIA's parallel computing platform that would eventually enable their leadership in accelerated computing. That NVIDIA trusted me to tell the story of CUDA to enthusiasts reflected years of building credibility in the hardware community. I wrote about the future of accelerated computing in 2010, before most of the industry understood why GPUs would matter beyond graphics, all while doing my orthopaedic surgery residency.
I worked my way from assistant professor to full professor of orthopaedic spine surgery at UCSF, practicing and teaching at the San Francisco VA Health Center. The technical instincts didn't disappear, they found new applications. Virtual surgical planning (US9601030B2, 2017). Adult stem cell therapies (US10994050B2, 2021). Trabecular bone lattice design for advanced orthopaedic implants and next-generation anisotropic heat exchangers (US11717418B2, 2023). A decade of teaching residents, watching how they learn, understanding where traditional education fails them. This is when I began a 3D printing program for education and co-founded a company to help deliver orthopaedic innovation globally.
When I wrote a review article explaining the basic science of large language models for orthopaedic surgeons, I recognized the failure modes. The reasoning errors AI makes, drawing wrong conclusions from pattern-matching without structural understanding, were familiar. They were the same errors I see in early learners. And I realized: I understood both domains deeply enough to fix this myself.
NVIDIA released the DGX Spark—128GB of unified memory, Grace Blackwell SoC with ConnectX-7 200GbE networking. For the first time, affordable hardware existed that could run the full vision on a single node: large embedding models, GPU-accelerated compute, real-time visualization, and enterprise level software tools. That same month, the UCSF Department of Orthopaedic Surgery funded this AI research through my fall teaching bonus.
Leveraging the combination of the NVIDIA software stack and the DGX Spark, I completed OrthoRAG V41 in two months, a production-grade, single-user retrieval system with instruction-tuned embeddings, multi-scale chunking, hybrid search, and neural reranking. Just “me, myself, and AI.” The same teaching instincts that guide a resident through a complex case now inform the design of the tutoring system.
02 — Overview
OrthoSystem integrates four complementary subsystems to address fundamentally different modes of knowledge access. The platform recognizes that experts and learners approach information differently and provides tools optimized for each cognitive style while leveraging shared infrastructure.
How experts think. Query-driven retrieval with instruction-tuned embeddings. Augments existing knowledge with precise, cited sources. Grounds LLM responses with professorial reliability.
How learners explore. Visual navigation through semantic space. See relationships, discover clusters, identify gaps. Transforms the unknown into navigable territory.
AI-assisted instruction. Integrates OrthoRAG and OrthoGraph capabilities for personalized learning experiences.
Systematic evaluation. Two-step LLM-as-judge scoring against domain rubrics. Measures accuracy, safety, and speed across any OpenAI-compatible endpoint.
03 — What is RAG
When you ask ChatGPT or any other LLM a question, it generates an answer from patterns learned during training. It has no access to your documents, your textbooks, or the latest research. Retrieval-Augmented Generation (RAG) changes this. Before the language model generates a response, a retrieval system searches your documents and injects the relevant passages into the prompt. The model then generates an answer grounded in actual sources rather than statistical memory. This is how ChatGPT, the service appears to be continously learning -- it is an LLM with a fixed amount of knowledge combining web search and updated knowledge documents at the time of response.
The concept sounds simple. The implementation is not.
I didn't set out to create my own RAG, but my initial exploration of consumer-level RAG platforms revealed fundamental limitations. Many solutions had arbitrary file size limits, refusing to work with large textbooks. Others used generic embedding models trained on web text, with retrieval quality varying wildly depending on how questions were phrased.
The deeper problem is architectural. RAG isn't a single tool. Instead, it's an orchestration of multiple AI subsystems, each making decisions that compound downstream.
These translate text into vectors, or coordinates in high-dimensional space where similar concepts cluster together. Generic embeddings confuse medical terminology. Instruction-tuned embeddings understand that a query differs from a document, improving retrieval precision. OrthoRAG uses NVIDIA Llama Embed Nemotron 8B using asymmetric instruction tuning to optimize for medical content.
Documents must be split into retrievable segments. Too large, and irrelevant content dilutes the signal. Too small, and context is lost. Multi-scale chunking preserves both detail and context through parent-child relationships.
Initial retrieval casts a wide net. A reranker, another neural network, scores each candidate against the actual query, promoting truly relevant passages. Without reranking, retrieval often returns factually accurate but irrelevant content which can impair reasoning, a problem seen with orthopaedic learners too.
Each component is itself a "mini AI" making decisions that affect final answer quality. The embedding model casts a wide net on what's similar. The chunker decides what constitutes a coherent unit. The reranker decides what's actually relevant. Commercial platforms make these decisions for you, generically. Building OrthoRAG meant taking control and fine tuning each stage.
Two parallel RAG implementations, each serving different reliability and capability requirements.
Production-stable. Proven reliability. Hand-tuned performance.
Enhanced capabilities for OrthoGraph integration.
V41's additional metadata enables OrthoGraph's multiple semantic views and organizes the corpus by subspecialty, anatomy, or evidence level requires that information to exist in the database.
04 — Philosophy
When I search the literature, I have a precise question: "What's the latest evidence on 3D printed titanium lumbar interbody cages?" I'm filling a specific gap in my knowledge.
Learners don't have that framework yet. But it's not as if they're standing at the edge of unfamiliar territory with no map. They have a map, but it's too big and complex. Breaking down a massive universe of information into learnable chunks when residents are working 70-80 hours a week in the hospital while trying to preserve healthy work-life balance and personal commitments is the challenge.
"That giant textbook? Learners don't need help figuring out what parts are important — everything is important."
Query-driven. The expert knows what to ask.
Exploration-driven. The learner doesn't know what they don't know.
The mathematics that make this possible emerged not from Silicon Valley but from university research. In the 1950s, Charles Osgood and his team at the University of Illinois asked hundreds of undergraduate students to rate concepts, words like lady, father, fire, boulder, tornado, and sword, on fifty different scales: good-bad, hard-soft, hot-cold, sweet-sour. Across multiple studies and populations, concepts consistently clustered into three dimensions: good-bad, strong-weak, and active-passive. Each word could be described by three numbers — just like x, y, z coordinates. This research required IBM's ILLIAC, a state-of-the-art supercomputer of the era powered by vacuum tubes.
The next major leap did come from Silicon Valley, when Mikolov hypothesized that the meaning of a word is defined by the words it tends to occur with in 2013. Using a Google News dataset with six billion words, his team placed one million common words in 300 dimensions, derived purely from patterns of co-occurrence. The stunning result: if you took the coordinate for king, subtracted man, and added woman, the nearest word was queen. Paris minus France plus Italy? Rome. Meaning had become geometry.
In 2017, the transformer architecture enabled this mathematical structure to scale to entire sentences and documents. OrthoGraph leverages this same geometry. Semantic similarity becomes spatial proximity. A search doesn't return a list; it illuminates a region.
The embedding service has been allocated ~24GB memory. Instruction-tuned for medical terminology. Produces 4096-dimensional vectors.
GPU UMAP via cuML/RAPIDS. Dimensionality reduction for visualization. 50K chunks requires ~6GB working memory.
128K context windows. OpenAI’s 20B and 120B open source models, configured for maximum thinking, work with OrthoRAG.
The DGX Spark's 128GB unified memory enables all services to coexist on a single node for development. Production deployments with larger models benefit from dual-node configurations where inference runs on dedicated hardware. More sophisticated models have more sophisticated reasoning capabilities.
05 — Architecture
Every architectural decision has been carefully considered to provide performance today while being scalable. From prompts to the configuration of HNSW (Hierarchical Navigable Small World) indices, almost nothing is set at "default" parameters. Qdrant was selected among a multitude of vector databases because it targets our latency priorities and enables storage of visualization coordinates alongside 4096-dimensional embeddings without a separate database.
Why FlashAttention matters: This is an optional technology that optimizes the way GPUs do the math used in LLMs. As it is optional and requires specific optimization for specific models, many hobbyist-level implementations of RAG and local LLM do not implement this technology, especially on models released in October 2025. These is an example of technical nuances in the system.
06 — OrthoGraph
Learners face a challenge: sometimes you don't know what questions to ask. OrthoGraph addresses this by transforming the curated documents into navigable space. The same vector embeddings that power retrieval become coordinates in a visual field where semantic similarity becomes spatial proximity.
A search doesn't just return a list, it illuminates a region. Clusters reveal topic structure. Sparse areas indicate gaps. The learner develops intuition about the knowledge landscape before diving into specific content.
Query: "ACL reconstruction techniques" — Results appear as bright points. Their clustering reveals related literature; their distance from other clusters shows how the topic connects to the broader field.
The same data will be able to be visualized in different ways.
Clusters by clinical domain: Sports Medicine, Trauma, Arthroplasty, Spine, Pediatric, Oncology.
Organizes by body region: Knee, Hip, Shoulder, Spine, Foot/Ankle, Hand/Wrist.
Stratifies by research quality. Background gradient shows evidence strength distribution.
Groups by source category: Textbook, Review, RCT, Case Report, Guidelines.
OrthoGraph consists of three containerized services: a GPU-accelerated projection service for UMAP dimensionality reduction, an API server for search orchestration and viewport streaming, and a WebGPU frontend for visualization.
OrthoGraph adapts to different OrthoRAG deployment levels, auto-detecting capabilities on startup.
Embeddings only. Basic UMAP projection with search illumination.
V33 single-scale. Adds document metadata and source filtering.
V41 multi-scale. Parent/child LOD, subspecialty views, evidence filtering.
07 — Implementation
OrthoSystem is not a prototype or proof-of-concept. The following implementations are production-grade, containerized, and currently deployed.
~10,000 lines
Single-scale chunking implementation. Five containerized services. Vector search with neural reranking.
~20,000 lines
Multi-scale chunking with parent-child relationships. Enhanced metadata for OrthoGraph integration.
~8,000 lines to date
GPU UMAP projection service. WebGPU visualization frontend. Three containerized services with real-time viewport streaming.
Services are organized by subsystem. All communication occurs via Docker network DNS with internal service discovery.
08 — Deployment
Each subsystem follows single-script deployment, theoretically allowing any DGX Spark user to deploy the system in minutes.
All services run in Docker containers, many based on NVIDIA NGC images.
All containers communicate through TCP protocols, enabling future multi-node scaling.
Docker Compose for single-node development. Scalable to multi-node production.
Container memory allocations based on production deployment. RAG services total ~38GB allocated, with Qdrant scaled to corpus size. Model sizes listed below are weights only; 128K context windows require additional KV cache memory.
Embedding Service: 24GB
Reranker Service: 6GB
RAG Web: 4GB
Ingestion Service: 4GB
Subtotal: 38GB
GPT-OSS-20B
Weights (MXFP4): ~12GB
KV Cache (BF16 @ 128K): ~6GB
Total: ~18GB + overhead
GPT-OSS-120B
Weights (MXFP4): ~60GB
KV Cache (BF16 @ 128K): ~9GB
Total: ~69GB + overhead
KV cache uses BF16—attention layers are not quantized.
GQA (8 KV heads vs 64 query heads) reduces cache 8×.
Development configuration with 20B model.
Comfortable headroom for development. 20B weights (12GB MXFP4) + KV cache (6GB BF16 @ 128K) + overhead.
Required for 120B model + vision capabilities.
Node 1: Services
Node 2: Inference
120B weights (60GB MXFP4) + KV cache (9GB BF16 @ 128K) + activations.
09 — FAQ
Fundamentally, the architecture is platform-agnostic. Vector databases, embedding models, and transformer inference all have implementations across AMD, Intel, and Apple Silicon. However, I chose to build on the NVIDIA platform based on cost and time efficiencies. The NVIDIA software stack: cuML for GPU UMAP, TensorRT-LLM for optimized inference, the NGC container registry, dramatically accelerates development. To prove I could actually execute this vision, I built not just a working prototype but a full enterprise-grade RAG system. Leveraging the combination of the NVIDIA software stack and the DGX Spark allowed us to complete OrthoRAG V41 in two months. I would welcome support to port OrthoSystem to other platforms.
Knowledge graphs require entity extraction, relationship classification, and schema maintenance, significant complexity for uncertain retrieval benefit. Modern LLMs handle multi-hop reasoning at inference time. OrthoGraph uses the same embeddings as OrthoRAG, projected into visual space, avoiding a separate database and extraction pipeline. The visualization provides exploration value without complicating retrieval.
Memory requirements compound quickly. The embedding model (Llama Embed Nemotron 8B) needs ~16GB. GPU UMAP working memory adds ~6GB. Qdrant with 50K chunks needs ~4GB. The inference stack is where it gets interesting: GPT-OSS-120B weights in MXFP4 require ~60GB, but the KV cache runs in BF16 (attention layers aren't quantized), adding ~9GB at full 128K context. Grouped Query Attention (8 KV heads vs 64 query heads) keeps this manageable. Without GQA, the cache alone would be ~72GB. Total system memory exceeds what consumer GPUs can address. The unified memory architecture of GB10 enables flexible allocation across all services.
OrthoBench uses open-ended orthopaedic surgery questions (not multiple choice). A scorer LLM evaluates answers against domain rubrics in two steps: natural language analysis, then structured JSON scoring. Tiers range from "Gold" (>75) through "Dangerous" (<0, flagging patient safety concerns). The benchmark measures accuracy, tokens/second, and produces per-question reasoning trails.
A simpler system would be easier to build. But simpler systems make tradeoffs that compromise educational reliability. A standard embedding model doesn't retrieve as consistently as instruction-tuned embeddings can. A monolithic application can't scale individual bottlenecks but microservices can.
As the development utilized UCSF resources, ownership of the OrthoRAG IP is held by the State of California, technically UC Regents. Some elements of the project are also co-owned by the Department of Veterans Affairs (OrthoBench). Elements that may represent novel patentable inventions have not been disclosed. The special element of OrthoSystem falls into the category of trade secrets and subject matter expertise and know-how. A three Michelin Star restaurant and home baker may be using the same ingredients, the recipe and execution of that recipe differs.
"Production grade" refers to the architecture and code quality, not deployment scale. OrthoSystem uses a microservices architecture where each component runs as an independent containerized service communicating over standard protocols. While everything runs on a single DGX Spark, the system could use a dedicated machine for individual service or even resilient clusters for each service. This also allows us to upgrade features within the whole system without breaking anything.
I've also designed this to be software not software-as-a-service. The system is designed to run air-gapped, completely offline once deployed. There's no multi-user authentication or session management which may be beneficial in the future. The code is written to be resilient and horizontally scalable, but it is a production-ready single-user experience today.
10 — Roadmap
OrthoRAG (V33/V41) and OrthoBench are operational. V41 optimization on-going. The immediate focus is OrthoGraph with a target completion of Q2 2026. OrthoTutor, the tutoring system that ties everything together, represents the largest development effort.
About the Author
Health Sciences Clinical Professor of Orthopaedic Surgery, UCSF
Staff Surgeon, San Francisco VA Health Center
Three decades in computing. A decade in academic spine surgery. Multiple patents.
One vision: AI that teaches the way surgeons actually learn.
The Workshop
Technical documentation from active development. Benchmarks, tutorials, and deep dives into the engineering challenges of building AI infrastructure for medical education.
Same deep dives, different decade and different stakes.
Open to collaborations that expand the bandwidth from nights and weekends to business hours.