```
```The scalpel is one of the oldest tools in surgery. Its form hasn't fundamentally changed in centuries. Yet in the hands of a master surgeon versus a first-year resident, the outcomes are very different.
Artificial intelligence is the same kind of tool. The technology exists. What matters is how it's wielded—and by whom. OrthoSystem represents the convergence of three decades in computing and a decade in academic medicine, built on the understanding that educational AI demands the same precision we require in the operating room.
Version 2.0 · December 2025 · ARM64/Blackwell Optimized
01 — The Journey
OrthoSystem didn't emerge from a startup accelerator or a corporate AI lab. It emerged from the unusual intersection of two careers that shouldn't have overlapped—and the realization that understanding both domains deeply enough reveals solutions invisible to specialists in either.
In high school and college, I ran what became the world's most popular enthusiast site covering ATI (now AMD) graphics hardware. I was invited by ATI to Computex Taipei—a trip I had to cancel because I needed to take the MCAT. In 2001, I attended the launch party for the world's first programmable GPU, the NVIDIA GeForce 3. I was watching the birth of what would become the foundation of modern AI computing.
As a freelance tech journalist, I was granted the exclusive interview with Ian Buck, the architect of CUDA—NVIDIA's parallel computing platform that would eventually enable their leadership in large language models. That NVIDIA trusted me to tell the story of CUDA to enthusiasts reflected years of building credibility in the hardware community. I wrote about the future of accelerated computing in 2010, before most of the industry understood why GPUs would matter beyond graphics, all while doing my orthopaedic surgery residency.
I worked my way from assistant professor to full professor of orthopaedic spine surgery at UCSF, practicing and teaching at the San Francisco VA Health Center. The technical instincts didn't disappear—they found new applications. Virtual surgical planning (US9601030B2, 2017). Adult stem cell therapies (US10994050B2, 2021). Trabecular bone lattice design for advanced orthopaedic implants and next-generation anisotropic heat exchangers (US11717418B2, 2023). A decade of teaching residents, watching how they learn, understanding where traditional education fails them.
When I wrote a review article explaining the basic science of large language models for orthopaedic surgeons, I recognized the failure modes. The reasoning errors AI makes—drawing wrong conclusions from pattern-matching without structural understanding—were familiar. They were the same errors I see in early learners. And I realized: I understood both domains deeply enough to fix this myself.
NVIDIA released the DGX Spark—128GB of unified memory, Grace Blackwell SoC with ConnectX-7 200GbE networking. For the first time, hardware existed that could run the full vision on a single node: large embedding models, GPU-accelerated compute, real-time visualization, and enterprise level software tools. That same month, the UCSF Department of Orthopaedic Surgery funded this AI research through my fall teaching bonus—recognition that the work with residents had value worth investing in. The same appreciation for parallel computing that let me interview Ian Buck in 2009 now had a platform capable of executing the architecture I'd been imagining.
Leveraging the combination of the NVIDIA software stack and the DGX Spark, I completed OrthoRAG V41 in two months—not a prototype, but a production-grade retrieval system with instruction-tuned embeddings, multi-scale chunking, hybrid search, and neural reranking. Just “me, myself, and AI.” The same teaching instincts that guide a resident through a complex case now inform the design of an adaptive tutoring agent.
The insight: Most people building AI tools understand the technology but not the domain. Most domain experts understand the problems but not the technology. OrthoSystem.ai exists because these threads converged in one person.
02 — Overview
OrthoSystem integrates four complementary subsystems to address fundamentally different modes of knowledge access. The platform recognizes that experts and learners approach information differently—and provides tools optimized for each cognitive style while leveraging shared infrastructure.
How experts think. Query-driven retrieval with instruction-tuned embeddings. Augments existing knowledge with precise, cited sources. Grounds LLM responses with professorial reliability.
How learners explore. Visual navigation through semantic space. See relationships, discover clusters, identify gaps. Transforms the unknown into navigable territory.
Agentic instruction. Dynamically leverages OrthoRAG and OrthoGraph based on student performance. Adapts explanation style to individual learning patterns.
Systematic evaluation. Two-step LLM-as-judge scoring against domain rubrics. Measures accuracy, safety, and speed across any OpenAI-compatible endpoint.
Current status: OrthoRAG (V33 and V41) and OrthoBench are operational. OrthoGraph is in active development—the visual corpus navigation that will demonstrate the platform's educational potential. OrthoTutor, the agentic system that ties everything together, requires the most development time and represents the primary support opportunity.
03 — What is RAG
When you ask ChatGPT a question, it generates an answer from patterns learned during training—patterns frozen at whatever date the model was last updated. It has no access to your documents, your textbooks, or the latest research. Retrieval-Augmented Generation (RAG) changes this. Before the language model generates a response, a retrieval system searches your documents and injects the relevant passages into the prompt. The model then generates an answer grounded in actual sources rather than statistical memory.
The concept sounds simple. The implementation is not.
Our initial exploration of available RAG platforms revealed fundamental limitations. Many commercial solutions impose arbitrary file size limits—cutting off textbook chapters mid-sentence. Others use generic embedding models trained on web text, poorly suited to medical terminology where "reduction" means something entirely different than in everyday English. The retrieval quality varied wildly depending on how questions were phrased.
The deeper problem is architectural. RAG isn't a single tool—it's an orchestration of multiple AI subsystems, each making decisions that compound downstream.
These translate text into vectors—coordinates in high-dimensional space where similar concepts cluster together. Generic embeddings confuse medical terminology. Instruction-tuned embeddings understand that a query differs from a document, improving retrieval precision. We use NVIDIA Llama Embed Nemotron 8B using asymmetric instruction tuning to optimize for medical content.
Documents must be split into retrievable segments. Too large, and irrelevant content dilutes the signal. Too small, and context is lost. Multi-scale chunking preserves both detail and context through parent-child relationships—a technique that requires custom implementation.
Initial retrieval casts a wide net. A reranker—another neural network—scores each candidate against the actual query, promoting truly relevant passages. Without reranking, retrieval often returns factually accurate but irrelevant content which can impair reasoning, a problem seen with orthopaedic learners too.
Each component is itself a "mini AI"—a specialized neural network making decisions that affect final answer quality. The embedding model casts a wide net on what's similar. The chunker decides what constitutes a coherent unit. The reranker decides what's actually relevant. Commercial platforms make these decisions for you, generically. Building OrthoRAG meant taking control and fine tuning each stage.
We maintain two parallel RAG implementations, each serving different reliability and capability requirements.
Production-stable. Proven reliability. Better than available alternatives.
documentsEnhanced capabilities for OrthoGraph integration.
documents_multiscaleV41's additional metadata enables OrthoGraph's multiple semantic views—organizing the corpus by subspecialty, anatomy, or evidence level requires that information to exist in the database.
04 — Philosophy
When I search the literature, I have a precise question—"What's the latest evidence on graft selection for ACL reconstruction in adolescents?" I'm filling a specific gap in my knowledge.
Learners don't have that framework yet. It’s not as if they're standing at the edge of unfamiliar territory with no map. That giant textbook? Learners don’t need help figuring out what parts are important — everything is important. Breaking down a massive universe of information into learnable chunks when residents are working 70-80 hours a week in the hospital while trying to preserve healthy work-life balance and personal commitments is the challenge.
This insight drives the architecture. We need multiple different systems because we all learn in different cognitive modes.
Query-driven. The expert knows what to ask.
Exploration-driven. The learner doesn't know what they don't know.
The mathematics that make this possible emerged not from Silicon Valley but from university research. In the 1950s, Charles Osgood and his team at the University of Illinois asked hundreds of undergraduate students to rate concepts—words like lady, father, fire, boulder, tornado, and sword—on fifty different scales: good-bad, hard-soft, hot-cold, sweet-sour. Across multiple studies and populations, concepts consistently clustered into three dimensions: good-bad, strong-weak, and active-passive. Each word could be described by three numbers—just like x, y, z coordinates. This research utilized vacuum tube supercomputers from IBM.
The next major leap did come from Silicon Valley, when Mikolov hypothesized that the meaning of a word is defined by the words it tends to occur with in 2013. Using a Google News dataset with six billion words, his team placed one million common words in 300 dimensions—derived purely from patterns of co-occurrence. The stunning result: if you took the coordinate for king, subtracted man, and added woman, the nearest word was queen. Paris minus France plus Italy? Rome. Meaning had become geometry.
In 2017, the transformer architecture enabled this mathematical structure to scale to entire sentences and documents. OrthoGraph leverages this same geometry—the embeddings that power retrieval become coordinates in visual space. Semantic similarity becomes spatial proximity. A search doesn't return a list; it illuminates a region.
Why not knowledge graphs?Knowledge graphs with triplicates were considered but we concluded that modern LLMs with robust retrieval handle multi-hop reasoning, aggregation, and causal chains at inference time. Knowledge graphs add complexity and constraint without retrieval benefit. Today’s hardware no longer needs the intermediate step. OrthoGraph uses the same vector embeddings as OrthoRAG—projected into visual space—rather than maintaining a separate graph database.
The embedding service has been allocated ~24GB memory. Instruction-tuned for medical terminology. Produces 4096-dimensional vectors.
GPU UMAP via cuML/RAPIDS. 4096D → 768D reduction for 50K chunks requires ~6GB working memory.
128K context windows. OpenAI’s 20B and 120B open source models, configured for maximum thinking, work with OrthoRAG.
The DGX Spark's 128GB unified memory enables all services to coexist on a single node for development. Production deployments with larger models benefit from dual-node configurations where inference runs on dedicated hardware. More sophisticated models have more sophisticated reasoning capabilities.
05 — Architecture
Every architectural decision reflects a constraint or an insight. Microservices add complexity but enable scaling as the technology grows. Qdrant gets selected among a multitude of vector databases because it targets our latency priorities and enable storage of our 768-dimensional projections alongside 4096-dimensional embeddings without a separate database.
Why FlashAttention matters: Context window determines how much the model can consider at once. Early GPT-3 handled 2,048 tokens—roughly 1,500 words. Modern architectures using FlashAttention extend this to 128K tokens and beyond while enabling high performance output. These types of implementations on models released in October 2025 and later are technical challenges we've overcome.
06 — OrthoGraph
Learners face a challenge that experts forget: when you don't know a domain, you don't know what questions to ask. OrthoGraph addresses this by transforming the corpus into navigable space. The same vector embeddings that power retrieval become coordinates in a visual field where semantic similarity becomes spatial proximity.
A search doesn't just return a list—it illuminates a region. Clusters reveal topic structure. Sparse areas indicate gaps. The learner develops intuition about the knowledge landscape before diving into specific content.
Query: "ACL reconstruction techniques" — Results appear as bright points. Their clustering reveals related literature; their distance from other clusters shows how the topic connects to the broader field.
Real-time visualization of 50,000+ chunks at 60fps requires careful engineering. CPU-based UMAP on 50K vectors at 4096 dimensions takes 15-30 minutes. GPU UMAP via cuML completes in under 2 minutes.
Pre-computed during ingestion. 4096D → 768D reduction preserves semantic structure. NVIDIA cuML/RAPIDS enables GPU acceleration. Results stored as intermediate_768d in Qdrant payload.
Real-time 768D → 2D via trained LDA matrices. Different matrices produce different views (subspecialty, anatomy, evidence level). WebGPU compute shader enables smooth morphing between views.
GPU-accelerated instanced rendering. Bloom post-processing for illumination effects. Canvas2D overlay for labels and UI chrome. Targets 60fps sustained with 50K particles.
The same data projects differently depending on what structure you want to see. Switching views triggers smooth interpolation through intermediate positions—the corpus "morphs" from one organization to another.
Clusters by clinical domain: Sports Medicine, Trauma, Arthroplasty, Spine, Pediatric, Oncology.
Organizes by body region: Knee, Hip, Shoulder, Spine, Foot/Ankle, Hand/Wrist.
Stratifies by research quality. Background gradient shows evidence strength distribution.
Groups by source category: Textbook, Review, RCT, Case Report, Guidelines.
| Service | Port | Base Image | Function |
|---|---|---|---|
orthograph-projection |
8220 | nvcr.io/nvidia/pytorch:25.11-py3 | GPU UMAP, LDA matrix generation, R-tree spatial index |
orthograph-api |
8200 | nvcr.io/nvidia/pytorch:25.11-py3 | Search orchestration, viewport streaming, clustering |
orthograph-web |
8201 | node:20-slim | WebGPU visualization, Canvas2D overlay, chat panel |
OrthoGraph adapts to different OrthoRAG deployment levels. The tier is auto-detected on startup by probing Qdrant payload fields.
| Tier | RAG Version | Collection | OrthoGraph Features |
|---|---|---|---|
| Tier 0 | Embeddings only | Any | Basic UMAP projection, search illumination |
| Tier 1 | V33 (single-scale) | documents |
+ Document metadata, source filtering |
| Tier 2 | V41 (multi-scale) | documents_multiscale |
+ Parent/child LOD, subspecialty views, evidence filtering |
07 — Port Allocation
Ports are organized by subsystem. Internal services communicate via Docker network DNS and are not exposed to the host.
| Port | Service | Container | Status |
|---|---|---|---|
6333 | Qdrant Vector DB | v34_qdrant | Deployed |
8000 | RAG Web UI | v34_rag-web | Deployed |
8080 | Embedding Service | v34_embedding-service | Deployed |
8081 | Reranker Service | v34_reranker-service | Deployed |
8100 | Chat V41 (RLHF) | chat-web-v41 | Deployed |
| Port | Service | Container | Status |
|---|---|---|---|
8200 | OrthoGraph API | orthograph-api | In Development |
8201 | OrthoGraph Web | orthograph-web | In Development |
8220 | Projection Service | orthograph-projection | In Development |
| Port | Service | Container | Status |
|---|---|---|---|
8300 | OrthoTutor Web | orthotutor-web | Requires Funding |
8310 | Teaching Agent | orthotutor-agent | Requires Funding |
8320 | Assessment Engine | orthotutor-assess | Requires Funding |
8330 | Student Model | orthotutor-student | Requires Funding |
5432 | PostgreSQL | orthotutor-postgres | Requires Funding |
08 — Deployment
Each subsystem follows single-script deployment. Scripts include dependencies, health checks, and graceful error handling. The RAG_PREFIX environment variable enables switching between V33 and V41 deployments.
orthograph-api: image: nvcr.io/nvidia/pytorch:25.11-py3 networks: - ortho-network - ${RAG_PREFIX:-v34}_rag-network environment: - RAG_PREFIX=${RAG_PREFIX:-v34} - QDRANT_HOST=${RAG_PREFIX:-v34}_qdrant - QDRANT_COLLECTION=${QDRANT_COLLECTION:-documents_multiscale} orthograph-projection: image: nvcr.io/nvidia/pytorch:25.11-py3 deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] networks: ortho-network: driver: bridge ${RAG_PREFIX:-v34}_rag-network: external: true```
Development configuration with 70B FP4 model.
Production with dedicated inference node.
Node 1: Services
Node 2: Inference
09 — FAQ
Fundamentally, the architecture is platform-agnostic. Vector databases, embedding models, and transformer inference all have implementations across AMD, Intel, and Apple Silicon. However, we chose to build on the NVIDIA platform based on cost and time efficiencies. The NVIDIA software stack—cuML for GPU UMAP, TensorRT-LLM for optimized inference, the NGC container registry—dramatically accelerates development. To prove we could actually execute this vision, we built not just a working prototype but a full enterprise-grade RAG system. Leveraging the combination of the NVIDIA software stack and the DGX Spark allowed us to complete OrthoRAG V41 in two months. We would welcome support to port OrthoSystem to other platforms—the educational mission benefits from broader accessibility.
Knowledge graphs require entity extraction, relationship classification, and schema maintenance—significant complexity for uncertain retrieval benefit. Modern LLMs handle multi-hop reasoning at inference time. OrthoGraph uses the same embeddings as OrthoRAG, projected into visual space, avoiding a separate database and extraction pipeline. The visualization provides exploration value without complicating retrieval.
Chrome 113+, Edge 113+, and Firefox (behind flag). Safari has partial support. OrthoGraph implements Canvas2D fallback for unsupported browsers, though at reduced particle counts and without bloom effects. The primary target is Chrome on desktop, where WebGPU is stable and performant.
CPU UMAP on 50K 4096-dimensional vectors takes 15-30 minutes. GPU UMAP via cuML completes in under 2 minutes. Since projections must be recomputed when corpus changes or new views are trained, GPU acceleration is essential for practical operation. The 768D intermediate representation balances dimensionality (preserving structure) against storage and computation costs.
Yes. OrthoTutor's Teaching Agent uses OrthoRAG for content retrieval and can provide text-based instruction without visualization. OrthoGraph integration enables the agent to show students where a topic fits in the broader field, but isn't required for core tutoring functionality.
The embedding models used (i.e. Llama Embed Nemotron 8B) require a lot of memory. GPU UMAP working memory adds ~6GB. Qdrant with 50K chunks needs ~4GB. Inference models range from 40GB (70B FP4) to 100GB (120B). Total exceeds what consumer GPUs can address. The unified memory architecture of GB10 enables flexible allocation, and a easy method for deploying this software.
OrthoBench uses open-ended orthopaedic surgery questions (not multiple choice). A scorer LLM evaluates answers against domain rubrics in two steps: natural language analysis, then structured JSON scoring. Tiers range from "Gold" (>75) through "Dangerous" (<0, flagging patient safety concerns). The benchmark measures accuracy, tokens/second, and produces per-question reasoning trails.
A simpler system would be easier to build. But simpler systems make tradeoffs that compromise educational reliability. A single embedding model can't distinguish queries from documents—instruction-tuned embeddings can. CPU-based dimensionality reduction takes 30 minutes—GPU UMAP takes 90 seconds. Reducing the dimensions to 2D once means that OrthoGraph cannot take advantage of other reranking models centered around 768 dimensions. A monolithic application can't scale individual bottlenecks—microservices can.
10 — Roadmap
OrthoRAG (V41) and OrthoBench are operational. The immediate focus is OrthoGraph—the visual corpus navigation that will demonstrate the platform's educational potential. OrthoTutor, the agentic system that ties everything together, represents the largest development effort and the primary funding opportunity.
Single view projection, basic WebGPU rendering, search illumination, Qdrant integration. Viewport streaming with R-tree spatial index.
Current
LDA matrix training for subspecialty/anatomy/evidence views. Smooth view morphing. Level-of-detail with parent/child chunks. Star clustering.
Sliding panel with OrthoRAG chat. Click-to-cite from visualization. Bidirectional: chat queries illuminate, clicks populate context.
Retrieval failure heatmap from Chat V41 feedback. Exploration trails. Corpus gap analysis dashboard.
Teaching Agent with Socratic dialogue. Student Model with Bayesian knowledge tracing. MCQ integration. Spaced repetition scheduling.
Requires Funding
OrthoTutor dynamically leverages OrthoRAG and OrthoGraph based on student performance. Adaptive explanation strategies.
Requires Funding
The opportunity: OrthoGraph will be the showcase—visual evidence that semantic structure can become navigable space. But the transformative potential lies in OrthoTutor: an agentic system that watches how a student learns, recognizes when they need structure versus detail, and dynamically chooses whether to retrieve, visualize, or explain. This is where the educational AI field is heading. OrthoSystem is positioned to lead.
About the Author
Health Sciences Clinical Professor of Orthopaedic Surgery, UCSF
Staff Surgeon, San Francisco VA Health Center
Three decades in computing. A decade in academic spine surgery. Three patents. One vision: AI that teaches the way surgeons actually learn.