The OrthoSystem Vision | orthosystem.ai

01 — The Journey

Why This Exists

OrthoSystem didn't emerge from a startup accelerator or a corporate AI lab. It emerged from the unusual intersection of two careers that shouldn't have overlapped and the realization that understanding both domains deeply enough reveals solutions invisible in isolation.

1997–2001

The Graphics Hardware Years

In high school and college, I ran what became the world's most popular enthusiast site covering ATI (now AMD) graphics hardware. I was invited by ATI to Computex Taipei but had to decline, because I needed to take the MCAT. In 2001, I attended the launch party for the world's first programmable GPU, the NVIDIA GeForce 3. I was watching the birth of what would become the foundation of modern AI computing.

2009–2010

The CUDA Interview

As a freelance tech journalist, I was granted the exclusive interview with Ian Buck, the architect of CUDA, NVIDIA's parallel computing platform that would eventually enable their leadership in accelerated computing. That NVIDIA trusted me to tell the story of CUDA to enthusiasts reflected years of building credibility in the hardware community. I wrote about the future of accelerated computing in 2010, before most of the industry understood why GPUs would matter beyond graphics, all while doing my orthopaedic surgery residency.

2013–2024

Academic Surgery

I worked my way from assistant professor to full professor of orthopaedic spine surgery at UCSF, practicing and teaching at the San Francisco VA Health Center. The technical instincts didn't disappear, they found new applications. Virtual surgical planning (US9601030B2, 2017). Adult stem cell therapies (US10994050B2, 2021). Trabecular bone lattice design for advanced orthopaedic implants and next-generation anisotropic heat exchangers (US11717418B2, 2023). A decade of teaching residents, watching how they learn, understanding where traditional education fails them. This is when I began a 3D printing program for education and co-founded a company to help deliver orthopaedic innovation globally.

Summer 2025

The Realization

When I wrote a review article explaining the basic science of large language models for orthopaedic surgeons, I recognized the failure modes. The reasoning errors AI makes, drawing wrong conclusions from pattern-matching without structural understanding, were familiar. They were the same errors I see in early learners. And I realized: I understood both domains deeply enough to fix this myself.

October 15, 2025

The Hardware Arrives

NVIDIA released the DGX Spark—128GB of unified memory, Grace Blackwell SoC with ConnectX-7 200GbE networking. For the first time, affordable hardware existed that could run the full vision on a single node: large embedding models, GPU-accelerated compute, real-time visualization, and enterprise level software tools. That same month, the UCSF Department of Orthopaedic Surgery funded this AI research through my fall teaching bonus.

December 2025

Enterprise-Grade in Two Months

Leveraging the combination of the NVIDIA software stack and the DGX Spark, I completed OrthoRAG V41 in two months, a production-grade, single-user retrieval system with instruction-tuned embeddings, multi-scale chunking, hybrid search, and neural reranking. Just “me, myself, and AI.” The same teaching instincts that guide a resident through a complex case now inform the design of the tutoring system.

02 — Overview

Four Systems, One Platform

OrthoSystem integrates four complementary subsystems to address fundamentally different modes of knowledge access. The platform recognizes that experts and learners approach information differently and provides tools optimized for each cognitive style while leveraging shared infrastructure.

OrthoRAG

How experts think. Query-driven retrieval with instruction-tuned embeddings. Augments existing knowledge with precise, cited sources. Grounds LLM responses with professorial reliability.

V33/V41 · Deployed

OrthoGraph

How learners explore. Visual navigation through semantic space. See relationships, discover clusters, identify gaps. Transforms the unknown into navigable territory.

In Development (Q2 2026)

OrthoTutor

AI-assisted instruction. Integrates OrthoRAG and OrthoGraph capabilities for personalized learning experiences.

Requires Funding

OrthoBench

Systematic evaluation. Two-step LLM-as-judge scoring against domain rubrics. Measures accuracy, safety, and speed across any OpenAI-compatible endpoint.

Operational · Standalone

03 — What is RAG

Retrieval-Augmented Generation

When you ask ChatGPT or any other LLM a question, it generates an answer from patterns learned during training. It has no access to your documents, your textbooks, or the latest research. Retrieval-Augmented Generation (RAG) changes this. Before the language model generates a response, a retrieval system searches your documents and injects the relevant passages into the prompt. The model then generates an answer grounded in actual sources rather than statistical memory. This is how ChatGPT, the service appears to be continously learning -- it is an LLM with a fixed amount of knowledge combining web search and updated knowledge documents at the time of response.

The concept sounds simple. The implementation is not.

Why Existing Solutions Didn't Work

I didn't set out to create my own RAG, but my initial exploration of consumer-level RAG platforms revealed fundamental limitations. Many solutions had arbitrary file size limits, refusing to work with large textbooks. Others used generic embedding models trained on web text, with retrieval quality varying wildly depending on how questions were phrased.

The deeper problem is architectural. RAG isn't a single tool. Instead, it's an orchestration of multiple AI subsystems, each making decisions that compound downstream.

Embedding Models

These translate text into vectors, or coordinates in high-dimensional space where similar concepts cluster together. Generic embeddings confuse medical terminology. Instruction-tuned embeddings understand that a query differs from a document, improving retrieval precision. OrthoRAG uses NVIDIA Llama Embed Nemotron 8B using asymmetric instruction tuning to optimize for medical content.

Chunking Strategy

Documents must be split into retrievable segments. Too large, and irrelevant content dilutes the signal. Too small, and context is lost. Multi-scale chunking preserves both detail and context through parent-child relationships.

Reranking Models

Initial retrieval casts a wide net. A reranker, another neural network, scores each candidate against the actual query, promoting truly relevant passages. Without reranking, retrieval often returns factually accurate but irrelevant content which can impair reasoning, a problem seen with orthopaedic learners too.

Each component is itself a "mini AI" making decisions that affect final answer quality. The embedding model casts a wide net on what's similar. The chunker decides what constitutes a coherent unit. The reranker decides what's actually relevant. Commercial platforms make these decisions for you, generically. Building OrthoRAG meant taking control and fine tuning each stage.

Two Parallel Systems: V33 and V41

Two parallel RAG implementations, each serving different reliability and capability requirements.

V33: Single-Scale Chunking

Production-stable. Proven reliability. Hand-tuned performance.

Architecture: Fixed-size chunks with overlap
Collection: Single-scale chunks
Strength: Predictable behavior, well-tested
Use case: Clinical queries requiring maximum reliability

V41: Multi-Scale Intelligence

Enhanced capabilities for OrthoGraph integration.

Architecture: Parent-child chunk relationships with AI-based metadata
Collection: Multi-scale with metadata
Strength: Subspecialty tagging, evidence levels, source types
Use case: Visual navigation, tutoring, research exploration

V41's additional metadata enables OrthoGraph's multiple semantic views and organizes the corpus by subspecialty, anatomy, or evidence level requires that information to exist in the database.

04 — Philosophy

Bridging Expert and Learner Cognition

When I search the literature, I have a precise question: "What's the latest evidence on 3D printed titanium lumbar interbody cages?" I'm filling a specific gap in my knowledge.

Learners don't have that framework yet. But it's not as if they're standing at the edge of unfamiliar territory with no map. They have a map, but it's too big and complex. Breaking down a massive universe of information into learnable chunks when residents are working 70-80 hours a week in the hospital while trying to preserve healthy work-life balance and personal commitments is the challenge.

"That giant textbook? Learners don't need help figuring out what parts are important — everything is important."

Expert Mode: OrthoRAG

Query-driven. The expert knows what to ask.

Mental model: "I need the latest evidence on graft selection for ACL reconstruction in adolescents."
Interaction: Precise queries, expecting cited sources
Value: Augments existing expertise with current literature
Output: Verifiable responses with page-level citations

Learner Mode: OrthoGraph

Exploration-driven. The learner doesn't know what they don't know.

Mental model: "What is all this knee stuff? How does it connect?"
Interaction: Visual navigation, pattern recognition
Value: Reveals structure of unfamiliar territory using the faculty curated database of literature.
Output: Spatial understanding of concept relationships

The Origins of Word Vectors

The mathematics that make this possible emerged not from Silicon Valley but from university research. In the 1950s, Charles Osgood and his team at the University of Illinois asked hundreds of undergraduate students to rate concepts, words like lady, father, fire, boulder, tornado, and sword, on fifty different scales: good-bad, hard-soft, hot-cold, sweet-sour. Across multiple studies and populations, concepts consistently clustered into three dimensions: good-bad, strong-weak, and active-passive. Each word could be described by three numbers — just like x, y, z coordinates. This research required IBM's ILLIAC, a state-of-the-art supercomputer of the era powered by vacuum tubes.

The next major leap did come from Silicon Valley, when Mikolov hypothesized that the meaning of a word is defined by the words it tends to occur with in 2013. Using a Google News dataset with six billion words, his team placed one million common words in 300 dimensions, derived purely from patterns of co-occurrence. The stunning result: if you took the coordinate for king, subtracted man, and added woman, the nearest word was queen. Paris minus France plus Italy? Rome. Meaning had become geometry.

In 2017, the transformer architecture enabled this mathematical structure to scale to entire sentences and documents. OrthoGraph leverages this same geometry. Semantic similarity becomes spatial proximity. A search doesn't return a list; it illuminates a region.

Hardware Requirements

Embedding Pipeline

The embedding service has been allocated ~24GB memory. Instruction-tuned for medical terminology. Produces 4096-dimensional vectors.

Projection Service

GPU UMAP via cuML/RAPIDS. Dimensionality reduction for visualization. 50K chunks requires ~6GB working memory.

Inference

128K context windows. OpenAI’s 20B and 120B open source models, configured for maximum thinking, work with OrthoRAG.

The DGX Spark's 128GB unified memory enables all services to coexist on a single node for development. Production deployments with larger models benefit from dual-node configurations where inference runs on dedicated hardware. More sophisticated models have more sophisticated reasoning capabilities.

05 — Architecture

System Topology

Every architectural decision has been carefully considered to provide performance today while being scalable. From prompts to the configuration of HNSW (Hierarchical Navigable Small World) indices, almost nothing is set at "default" parameters. Qdrant was selected among a multitude of vector databases because it targets our latency priorities and enables storage of visualization coordinates alongside 4096-dimensional embeddings without a separate database.

Why FlashAttention matters: This is an optional technology that optimizes the way GPUs do the math used in LLMs. As it is optional and requires specific optimization for specific models, many hobbyist-level implementations of RAG and local LLM do not implement this technology, especially on models released in October 2025. These is an example of technical nuances in the system.

06 — OrthoGraph

Visual Knowledge Navigation Concept

Learners face a challenge: sometimes you don't know what questions to ask. OrthoGraph addresses this by transforming the curated documents into navigable space. The same vector embeddings that power retrieval become coordinates in a visual field where semantic similarity becomes spatial proximity.

A search doesn't just return a list, it illuminates a region. Clusters reveal topic structure. Sparse areas indicate gaps. The learner develops intuition about the knowledge landscape before diving into specific content.

Search Illuminates Context (Concept Art)

Query: "ACL reconstruction techniques" — Results appear as bright points. Their clustering reveals related literature; their distance from other clusters shows how the topic connects to the broader field.

Multiple Semantic Views

The same data will be able to be visualized in different ways.

Subspecialty

Clusters by clinical domain: Sports Medicine, Trauma, Arthroplasty, Spine, Pediatric, Oncology.

Anatomy

Organizes by body region: Knee, Hip, Shoulder, Spine, Foot/Ankle, Hand/Wrist.

Evidence Level

Stratifies by research quality. Background gradient shows evidence strength distribution.

Document Type

Groups by source category: Textbook, Review, RCT, Case Report, Guidelines.

Service Architecture

OrthoGraph consists of three containerized services: a GPU-accelerated projection service for UMAP dimensionality reduction, an API server for search orchestration and viewport streaming, and a WebGPU frontend for visualization.

RAG Compatibility

OrthoGraph adapts to different OrthoRAG deployment levels, auto-detecting capabilities on startup.

Tier 0

Embeddings only. Basic UMAP projection with search illumination.

Tier 1

V33 single-scale. Adds document metadata and source filtering.

Tier 2

V41 multi-scale. Parent/child LOD, subspecialty views, evidence filtering.

07 — Implementation

Production Codebase

OrthoSystem is not a prototype or proof-of-concept. The following implementations are production-grade, containerized, and currently deployed.

OrthoRAG V33

~10,000 lines

Single-scale chunking implementation. Five containerized services. Vector search with neural reranking.

Deployed

OrthoRAG V41

~20,000 lines

Multi-scale chunking with parent-child relationships. Enhanced metadata for OrthoGraph integration.

Deployed

OrthoGraph

~8,000 lines to date

GPU UMAP projection service. WebGPU visualization frontend. Three containerized services with real-time viewport streaming.

In Development

Service Architecture

Services are organized by subsystem. All communication occurs via Docker network DNS with internal service discovery.

OrthoRAG Services

Vector Database (Qdrant)
Embedding Service (GPU)
Reranker Service (GPU)
Ingestion Pipeline
Web Interface + Chat

OrthoGraph Services

Projection Service (GPU UMAP)
API Server
WebGPU Frontend

08 — Deployment

Infrastructure Configuration

Each subsystem follows single-script deployment, theoretically allowing any DGX Spark user to deploy the system in minutes.

Containerization

All services run in Docker containers, many based on NVIDIA NGC images.

Networking

All containers communicate through TCP protocols, enabling future multi-node scaling.

Orchestration

Docker Compose for single-node development. Scalable to multi-node production.

Memory Configuration

Container memory allocations based on production deployment. RAG services total ~38GB allocated, with Qdrant scaled to corpus size. Model sizes listed below are weights only; 128K context windows require additional KV cache memory.

RAG Services

Embedding Service: 24GB
Reranker Service: 6GB
RAG Web: 4GB
Ingestion Service: 4GB
Subtotal: 38GB

Inference Models

GPT-OSS-20B
  Weights (MXFP4): ~12GB
  KV Cache (BF16 @ 128K): ~6GB
  Total: ~18GB + overhead

GPT-OSS-120B
  Weights (MXFP4): ~60GB
  KV Cache (BF16 @ 128K): ~9GB
  Total: ~69GB + overhead

KV cache uses BF16—attention layers are not quantized.
GQA (8 KV heads vs 64 query heads) reduces cache 8×.

Single Node · 128GB

Development configuration with 20B model.

RAG 46GB

Graph 10GB

20B+KV 22GB

Buffer 50GB

Comfortable headroom for development. 20B weights (12GB MXFP4) + KV cache (6GB BF16 @ 128K) + overhead.

Dual Node · 2×128GB Production

Required for 120B model + vision capabilities.

Node 1: Services

RAG 46GB

Graph 10GB

Tutor 8GB

Buffer 64GB

Node 2: Inference

120B+KV · 69GB

Vision 20GB

Buffer 39GB

120B weights (60GB MXFP4) + KV cache (9GB BF16 @ 128K) + activations.

09 — FAQ

Technical Questions

Can this run on non-NVIDIA hardware? +

Fundamentally, the architecture is platform-agnostic. Vector databases, embedding models, and transformer inference all have implementations across AMD, Intel, and Apple Silicon. However, I chose to build on the NVIDIA platform based on cost and time efficiencies. The NVIDIA software stack: cuML for GPU UMAP, TensorRT-LLM for optimized inference, the NGC container registry, dramatically accelerates development. To prove I could actually execute this vision, I built not just a working prototype but a full enterprise-grade RAG system. Leveraging the combination of the NVIDIA software stack and the DGX Spark allowed us to complete OrthoRAG V41 in two months. I would welcome support to port OrthoSystem to other platforms.

Why not use a knowledge graph for OrthoGraph? +

Knowledge graphs require entity extraction, relationship classification, and schema maintenance, significant complexity for uncertain retrieval benefit. Modern LLMs handle multi-hop reasoning at inference time. OrthoGraph uses the same embeddings as OrthoRAG, projected into visual space, avoiding a separate database and extraction pipeline. The visualization provides exploration value without complicating retrieval.

Why require GB10-class hardware? +

Memory requirements compound quickly. The embedding model (Llama Embed Nemotron 8B) needs ~16GB. GPU UMAP working memory adds ~6GB. Qdrant with 50K chunks needs ~4GB. The inference stack is where it gets interesting: GPT-OSS-120B weights in MXFP4 require ~60GB, but the KV cache runs in BF16 (attention layers aren't quantized), adding ~9GB at full 128K context. Grouped Query Attention (8 KV heads vs 64 query heads) keeps this manageable. Without GQA, the cache alone would be ~72GB. Total system memory exceeds what consumer GPUs can address. The unified memory architecture of GB10 enables flexible allocation across all services.

What does OrthoBench measure? +

OrthoBench uses open-ended orthopaedic surgery questions (not multiple choice). A scorer LLM evaluates answers against domain rubrics in two steps: natural language analysis, then structured JSON scoring. Tiers range from "Gold" (>75) through "Dangerous" (<0, flagging patient safety concerns). The benchmark measures accuracy, tokens/second, and produces per-question reasoning trails.

Why this level of infrastructure complexity? +

A simpler system would be easier to build. But simpler systems make tradeoffs that compromise educational reliability. A standard embedding model doesn't retrieve as consistently as instruction-tuned embeddings can. A monolithic application can't scale individual bottlenecks but microservices can.

Is this patented? What is the IP protection? +

As the development utilized UCSF resources, ownership of the OrthoRAG IP is held by the State of California, technically UC Regents. Some elements of the project are also co-owned by the Department of Veterans Affairs (OrthoBench). Elements that may represent novel patentable inventions have not been disclosed. The special element of OrthoSystem falls into the category of trade secrets and subject matter expertise and know-how. A three Michelin Star restaurant and home baker may be using the same ingredients, the recipe and execution of that recipe differs.

What do you mean by production grade? +

"Production grade" refers to the architecture and code quality, not deployment scale. OrthoSystem uses a microservices architecture where each component runs as an independent containerized service communicating over standard protocols. While everything runs on a single DGX Spark, the system could use a dedicated machine for individual service or even resilient clusters for each service. This also allows us to upgrade features within the whole system without breaking anything.

I've also designed this to be software not software-as-a-service. The system is designed to run air-gapped, completely offline once deployed. There's no multi-user authentication or session management which may be beneficial in the future. The code is written to be resilient and horizontally scalable, but it is a production-ready single-user experience today.

10 — Roadmap

Development Status

OrthoRAG (V33/V41) and OrthoBench are operational. V41 optimization on-going. The immediate focus is OrthoGraph with a target completion of Q2 2026. OrthoTutor, the tutoring system that ties everything together, represents the largest development effort.

About the Author

Alan B.C. Dang, MD

Health Sciences Clinical Professor of Orthopaedic Surgery, UCSF
Staff Surgeon, San Francisco VA Health Center

Three decades in computing. A decade in academic spine surgery. Multiple patents.
One vision: AI that teaches the way surgeons actually learn.

UCSF Faculty Profile · LinkedIn

The Workshop

Atelier

Technical documentation from active development. Benchmarks, tutorials, and deep dives into the engineering challenges of building AI infrastructure for medical education.

DGX Spark Review DDx Benchmarks Digital Academy

Enter the Atelier

Same deep dives, different decade and different stakes.

Open to collaborations that expand the bandwidth from nights and weekends to business hours.