``` ```
orthosystem.ai

The OrthoSystem Vision

The scalpel is one of the oldest tools in surgery. Its form hasn't fundamentally changed in centuries. Yet in the hands of a master surgeon versus a first-year resident, the outcomes are very different.

Artificial intelligence is the same kind of tool. The technology exists. What matters is how it's wielded—and by whom. OrthoSystem represents the convergence of three decades in computing and a decade in academic medicine, built on the understanding that educational AI demands the same precision we require in the operating room.

Version 2.0 · December 2025 · ARM64/Blackwell Optimized

01 — The Journey

Why This Exists

OrthoSystem didn't emerge from a startup accelerator or a corporate AI lab. It emerged from the unusual intersection of two careers that shouldn't have overlapped—and the realization that understanding both domains deeply enough reveals solutions invisible to specialists in either.

1997–2001

The Graphics Hardware Years

In high school and college, I ran what became the world's most popular enthusiast site covering ATI (now AMD) graphics hardware. I was invited by ATI to Computex Taipei—a trip I had to cancel because I needed to take the MCAT. In 2001, I attended the launch party for the world's first programmable GPU, the NVIDIA GeForce 3. I was watching the birth of what would become the foundation of modern AI computing.

2009–2010

The CUDA Interview

As a freelance tech journalist, I was granted the exclusive interview with Ian Buck, the architect of CUDA—NVIDIA's parallel computing platform that would eventually enable their leadership in large language models. That NVIDIA trusted me to tell the story of CUDA to enthusiasts reflected years of building credibility in the hardware community. I wrote about the future of accelerated computing in 2010, before most of the industry understood why GPUs would matter beyond graphics, all while doing my orthopaedic surgery residency.

2013–2024

Academic Surgery

I worked my way from assistant professor to full professor of orthopaedic spine surgery at UCSF, practicing and teaching at the San Francisco VA Health Center. The technical instincts didn't disappear—they found new applications. Virtual surgical planning (US9601030B2, 2017). Adult stem cell therapies (US10994050B2, 2021). Trabecular bone lattice design for advanced orthopaedic implants and next-generation anisotropic heat exchangers (US11717418B2, 2023). A decade of teaching residents, watching how they learn, understanding where traditional education fails them.

Early 2025

The Realization

When I wrote a review article explaining the basic science of large language models for orthopaedic surgeons, I recognized the failure modes. The reasoning errors AI makes—drawing wrong conclusions from pattern-matching without structural understanding—were familiar. They were the same errors I see in early learners. And I realized: I understood both domains deeply enough to fix this myself.

October 15, 2025

The Hardware Arrives

NVIDIA released the DGX Spark—128GB of unified memory, Grace Blackwell SoC with ConnectX-7 200GbE networking. For the first time, hardware existed that could run the full vision on a single node: large embedding models, GPU-accelerated compute, real-time visualization, and enterprise level software tools. That same month, the UCSF Department of Orthopaedic Surgery funded this AI research through my fall teaching bonus—recognition that the work with residents had value worth investing in. The same appreciation for parallel computing that let me interview Ian Buck in 2009 now had a platform capable of executing the architecture I'd been imagining.

December 2025

Enterprise-Grade in Two Months

Leveraging the combination of the NVIDIA software stack and the DGX Spark, I completed OrthoRAG V41 in two months—not a prototype, but a production-grade retrieval system with instruction-tuned embeddings, multi-scale chunking, hybrid search, and neural reranking. Just “me, myself, and AI.” The same teaching instincts that guide a resident through a complex case now inform the design of an adaptive tutoring agent.

The insight: Most people building AI tools understand the technology but not the domain. Most domain experts understand the problems but not the technology. OrthoSystem.ai exists because these threads converged in one person.

02 — Overview

Four Systems, One Platform

OrthoSystem integrates four complementary subsystems to address fundamentally different modes of knowledge access. The platform recognizes that experts and learners approach information differently—and provides tools optimized for each cognitive style while leveraging shared infrastructure.

OrthoRAG

How experts think. Query-driven retrieval with instruction-tuned embeddings. Augments existing knowledge with precise, cited sources. Grounds LLM responses with professorial reliability.

V33/V41 · Deployed

OrthoGraph

How learners explore. Visual navigation through semantic space. See relationships, discover clusters, identify gaps. Transforms the unknown into navigable territory.

In Development

OrthoTutor

Agentic instruction. Dynamically leverages OrthoRAG and OrthoGraph based on student performance. Adapts explanation style to individual learning patterns.

Requires Funding

OrthoBench

Systematic evaluation. Two-step LLM-as-judge scoring against domain rubrics. Measures accuracy, safety, and speed across any OpenAI-compatible endpoint.

Operational · Standalone

Current status: OrthoRAG (V33 and V41) and OrthoBench are operational. OrthoGraph is in active development—the visual corpus navigation that will demonstrate the platform's educational potential. OrthoTutor, the agentic system that ties everything together, requires the most development time and represents the primary support opportunity.

03 — What is RAG

Retrieval-Augmented Generation

When you ask ChatGPT a question, it generates an answer from patterns learned during training—patterns frozen at whatever date the model was last updated. It has no access to your documents, your textbooks, or the latest research. Retrieval-Augmented Generation (RAG) changes this. Before the language model generates a response, a retrieval system searches your documents and injects the relevant passages into the prompt. The model then generates an answer grounded in actual sources rather than statistical memory.

The concept sounds simple. The implementation is not.

Why Existing Solutions Failed Us

Our initial exploration of available RAG platforms revealed fundamental limitations. Many commercial solutions impose arbitrary file size limits—cutting off textbook chapters mid-sentence. Others use generic embedding models trained on web text, poorly suited to medical terminology where "reduction" means something entirely different than in everyday English. The retrieval quality varied wildly depending on how questions were phrased.

The deeper problem is architectural. RAG isn't a single tool—it's an orchestration of multiple AI subsystems, each making decisions that compound downstream.

Embedding Models

These translate text into vectors—coordinates in high-dimensional space where similar concepts cluster together. Generic embeddings confuse medical terminology. Instruction-tuned embeddings understand that a query differs from a document, improving retrieval precision. We use NVIDIA Llama Embed Nemotron 8B using asymmetric instruction tuning to optimize for medical content.

Chunking Strategy

Documents must be split into retrievable segments. Too large, and irrelevant content dilutes the signal. Too small, and context is lost. Multi-scale chunking preserves both detail and context through parent-child relationships—a technique that requires custom implementation.

Reranking Models

Initial retrieval casts a wide net. A reranker—another neural network—scores each candidate against the actual query, promoting truly relevant passages. Without reranking, retrieval often returns factually accurate but irrelevant content which can impair reasoning, a problem seen with orthopaedic learners too.

Each component is itself a "mini AI"—a specialized neural network making decisions that affect final answer quality. The embedding model casts a wide net on what's similar. The chunker decides what constitutes a coherent unit. The reranker decides what's actually relevant. Commercial platforms make these decisions for you, generically. Building OrthoRAG meant taking control and fine tuning each stage.

Two Parallel Systems: V33 and V41

We maintain two parallel RAG implementations, each serving different reliability and capability requirements.

V33: Single-Scale Chunking

Production-stable. Proven reliability. Better than available alternatives.

  • Architecture: Fixed-size chunks with overlap
  • Collection: documents
  • Strength: Predictable behavior, well-tested
  • Use case: Clinical queries requiring maximum reliability

V41: Multi-Scale Intelligence

Enhanced capabilities for OrthoGraph integration.

  • Architecture: Parent-child chunk relationships with metadata
  • Collection: documents_multiscale
  • Strength: Subspecialty tagging, evidence levels, source types
  • Use case: Visual navigation, adaptive tutoring, research exploration

V41's additional metadata enables OrthoGraph's multiple semantic views—organizing the corpus by subspecialty, anatomy, or evidence level requires that information to exist in the database.

04 — Philosophy

Bridging Expert and Learner Cognition

When I search the literature, I have a precise question—"What's the latest evidence on graft selection for ACL reconstruction in adolescents?" I'm filling a specific gap in my knowledge.

Learners don't have that framework yet. It’s not as if they're standing at the edge of unfamiliar territory with no map. That giant textbook? Learners don’t need help figuring out what parts are important — everything is important. Breaking down a massive universe of information into learnable chunks when residents are working 70-80 hours a week in the hospital while trying to preserve healthy work-life balance and personal commitments is the challenge.

This insight drives the architecture. We need multiple different systems because we all learn in different cognitive modes.

Expert Mode: OrthoRAG

Query-driven. The expert knows what to ask.

  • Mental model: "I need the latest evidence on graft selection for ACL reconstruction in adolescents."
  • Interaction: Precise queries, expecting cited sources
  • Value: Augments existing expertise with current literature
  • Output: Grounded responses with page-level citations

Learner Mode: OrthoGraph

Exploration-driven. The learner doesn't know what they don't know.

  • Mental model: "What is all this knee stuff? How does it connect?"
  • Interaction: Visual navigation, pattern recognition
  • Value: Reveals structure of unfamiliar territory using the faculty curated database of literature.
  • Output: Spatial understanding of concept relationships

The Origins of Word Vectors

The mathematics that make this possible emerged not from Silicon Valley but from university research. In the 1950s, Charles Osgood and his team at the University of Illinois asked hundreds of undergraduate students to rate concepts—words like lady, father, fire, boulder, tornado, and sword—on fifty different scales: good-bad, hard-soft, hot-cold, sweet-sour. Across multiple studies and populations, concepts consistently clustered into three dimensions: good-bad, strong-weak, and active-passive. Each word could be described by three numbers—just like x, y, z coordinates. This research utilized vacuum tube supercomputers from IBM.

The next major leap did come from Silicon Valley, when Mikolov hypothesized that the meaning of a word is defined by the words it tends to occur with in 2013. Using a Google News dataset with six billion words, his team placed one million common words in 300 dimensions—derived purely from patterns of co-occurrence. The stunning result: if you took the coordinate for king, subtracted man, and added woman, the nearest word was queen. Paris minus France plus Italy? Rome. Meaning had become geometry.

In 2017, the transformer architecture enabled this mathematical structure to scale to entire sentences and documents. OrthoGraph leverages this same geometry—the embeddings that power retrieval become coordinates in visual space. Semantic similarity becomes spatial proximity. A search doesn't return a list; it illuminates a region.

Why not knowledge graphs?Knowledge graphs with triplicates were considered but we concluded that modern LLMs with robust retrieval handle multi-hop reasoning, aggregation, and causal chains at inference time. Knowledge graphs add complexity and constraint without retrieval benefit. Today’s hardware no longer needs the intermediate step. OrthoGraph uses the same vector embeddings as OrthoRAG—projected into visual space—rather than maintaining a separate graph database.

Hardware Requirements

Embedding Pipeline

The embedding service has been allocated ~24GB memory. Instruction-tuned for medical terminology. Produces 4096-dimensional vectors.

Projection Service

GPU UMAP via cuML/RAPIDS. 4096D → 768D reduction for 50K chunks requires ~6GB working memory.

Inference

128K context windows. OpenAI’s 20B and 120B open source models, configured for maximum thinking, work with OrthoRAG.

The DGX Spark's 128GB unified memory enables all services to coexist on a single node for development. Production deployments with larger models benefit from dual-node configurations where inference runs on dedicated hardware. More sophisticated models have more sophisticated reasoning capabilities.

05 — Architecture

System Topology

Every architectural decision reflects a constraint or an insight. Microservices add complexity but enable scaling as the technology grows. Qdrant gets selected among a multitude of vector databases because it targets our latency priorities and enable storage of our 768-dimensional projections alongside 4096-dimensional embeddings without a separate database.

ORTHOSYSTEM PLATFORM ARCHITECTURE USER INTERFACES RAG Web :8000 Chat V41 :8100 OrthoGraph WebGPU Canvas :8201 OrthoTutor :8300 OrthoBench Standalone SERVICE LAYER ORTHORAG SERVICES Embedding :8080 Reranker :8081 Ingestion :8082 GPU GPU ORTHOGRAPH SERVICES API :8200 Projection :8220 GPU UMAP cuML/RAPIDS ORTHOTUTOR SERVICES Agent :8310 Assessment :8320 Student :8330 Agentic · Leverages RAG + Graph DATA LAYER Qdrant 4096D Vectors + 768D Projections :6333 PostgreSQL Students · MCQs :5432 SQLite RLHF · Conversations conversations.db LEGEND Deployed In Dev Funding INFERENCE LAYER LM Studio / llama.cpp host.docker.internal:1234 or TensorRT-LLM FlashAttention or Commercial API OpenAI / Anthropic / etc.

Why FlashAttention matters: Context window determines how much the model can consider at once. Early GPT-3 handled 2,048 tokens—roughly 1,500 words. Modern architectures using FlashAttention extend this to 128K tokens and beyond while enabling high performance output. These types of implementations on models released in October 2025 and later are technical challenges we've overcome.

06 — OrthoGraph

Visual Corpus Navigation

Learners face a challenge that experts forget: when you don't know a domain, you don't know what questions to ask. OrthoGraph addresses this by transforming the corpus into navigable space. The same vector embeddings that power retrieval become coordinates in a visual field where semantic similarity becomes spatial proximity.

A search doesn't just return a list—it illuminates a region. Clusters reveal topic structure. Sparse areas indicate gaps. The learner develops intuition about the knowledge landscape before diving into specific content.

KNEE · LIGAMENT · RECONSTRUCTION Graft Selection

Search Illuminates Context

Query: "ACL reconstruction techniques" — Results appear as bright points. Their clustering reveals related literature; their distance from other clusters shows how the topic connects to the broader field.

Technical Implementation

Real-time visualization of 50,000+ chunks at 60fps requires careful engineering. CPU-based UMAP on 50K vectors at 4096 dimensions takes 15-30 minutes. GPU UMAP via cuML completes in under 2 minutes.

Stage 1: GPU UMAP

Pre-computed during ingestion. 4096D → 768D reduction preserves semantic structure. NVIDIA cuML/RAPIDS enables GPU acceleration. Results stored as intermediate_768d in Qdrant payload.

Stage 2: Linear Projection

Real-time 768D → 2D via trained LDA matrices. Different matrices produce different views (subspecialty, anatomy, evidence level). WebGPU compute shader enables smooth morphing between views.

WebGPU Rendering

GPU-accelerated instanced rendering. Bloom post-processing for illumination effects. Canvas2D overlay for labels and UI chrome. Targets 60fps sustained with 50K particles.

Multiple Semantic Views

The same data projects differently depending on what structure you want to see. Switching views triggers smooth interpolation through intermediate positions—the corpus "morphs" from one organization to another.

Subspecialty

Clusters by clinical domain: Sports Medicine, Trauma, Arthroplasty, Spine, Pediatric, Oncology.

Anatomy

Organizes by body region: Knee, Hip, Shoulder, Spine, Foot/Ankle, Hand/Wrist.

Evidence Level

Stratifies by research quality. Background gradient shows evidence strength distribution.

Document Type

Groups by source category: Textbook, Review, RCT, Case Report, Guidelines.

Service Architecture

Service Port Base Image Function
orthograph-projection 8220 nvcr.io/nvidia/pytorch:25.11-py3 GPU UMAP, LDA matrix generation, R-tree spatial index
orthograph-api 8200 nvcr.io/nvidia/pytorch:25.11-py3 Search orchestration, viewport streaming, clustering
orthograph-web 8201 node:20-slim WebGPU visualization, Canvas2D overlay, chat panel

RAG Compatibility Tiers

OrthoGraph adapts to different OrthoRAG deployment levels. The tier is auto-detected on startup by probing Qdrant payload fields.

Tier RAG Version Collection OrthoGraph Features
Tier 0 Embeddings only Any Basic UMAP projection, search illumination
Tier 1 V33 (single-scale) documents + Document metadata, source filtering
Tier 2 V41 (multi-scale) documents_multiscale + Parent/child LOD, subspecialty views, evidence filtering

07 — Port Allocation

Service Endpoints

Ports are organized by subsystem. Internal services communicate via Docker network DNS and are not exposed to the host.

OrthoRAG (V41)

PortServiceContainerStatus
6333Qdrant Vector DBv34_qdrantDeployed
8000RAG Web UIv34_rag-webDeployed
8080Embedding Servicev34_embedding-serviceDeployed
8081Reranker Servicev34_reranker-serviceDeployed
8100Chat V41 (RLHF)chat-web-v41Deployed

OrthoGraph

PortServiceContainerStatus
8200OrthoGraph APIorthograph-apiIn Development
8201OrthoGraph Weborthograph-webIn Development
8220Projection Serviceorthograph-projectionIn Development

OrthoTutor

PortServiceContainerStatus
8300OrthoTutor Weborthotutor-webRequires Funding
8310Teaching Agentorthotutor-agentRequires Funding
8320Assessment Engineorthotutor-assessRequires Funding
8330Student Modelorthotutor-studentRequires Funding
5432PostgreSQLorthotutor-postgresRequires Funding

08 — Deployment

Infrastructure Configuration

Each subsystem follows single-script deployment. Scripts include dependencies, health checks, and graceful error handling. The RAG_PREFIX environment variable enables switching between V33 and V41 deployments.

docker-compose.yml excerpt YAML
```
orthograph-api:
  image: nvcr.io/nvidia/pytorch:25.11-py3
  networks:
    - ortho-network
    - ${RAG_PREFIX:-v34}_rag-network
  environment:
    - RAG_PREFIX=${RAG_PREFIX:-v34}
    - QDRANT_HOST=${RAG_PREFIX:-v34}_qdrant
    - QDRANT_COLLECTION=${QDRANT_COLLECTION:-documents_multiscale}

orthograph-projection:
  image: nvcr.io/nvidia/pytorch:25.11-py3
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: 1
            capabilities: [gpu]

networks:
  ortho-network:
    driver: bridge
  ${RAG_PREFIX:-v34}_rag-network:
    external: true
```

Memory Configuration

Single Node · 128GB

Development configuration with 70B FP4 model.

RAG 46GB
Graph 10GB
Tutor 8GB
gpt-oss-20B 40GB
24GB

09 — FAQ

Technical Questions

Can this run on non-NVIDIA hardware? +

Fundamentally, the architecture is platform-agnostic. Vector databases, embedding models, and transformer inference all have implementations across AMD, Intel, and Apple Silicon. However, we chose to build on the NVIDIA platform based on cost and time efficiencies. The NVIDIA software stack—cuML for GPU UMAP, TensorRT-LLM for optimized inference, the NGC container registry—dramatically accelerates development. To prove we could actually execute this vision, we built not just a working prototype but a full enterprise-grade RAG system. Leveraging the combination of the NVIDIA software stack and the DGX Spark allowed us to complete OrthoRAG V41 in two months. We would welcome support to port OrthoSystem to other platforms—the educational mission benefits from broader accessibility.

Why not use a knowledge graph for OrthoGraph? +

Knowledge graphs require entity extraction, relationship classification, and schema maintenance—significant complexity for uncertain retrieval benefit. Modern LLMs handle multi-hop reasoning at inference time. OrthoGraph uses the same embeddings as OrthoRAG, projected into visual space, avoiding a separate database and extraction pipeline. The visualization provides exploration value without complicating retrieval.

What browsers support WebGPU? +

Chrome 113+, Edge 113+, and Firefox (behind flag). Safari has partial support. OrthoGraph implements Canvas2D fallback for unsupported browsers, though at reduced particle counts and without bloom effects. The primary target is Chrome on desktop, where WebGPU is stable and performant.

Why does OrthoGraph need GPU UMAP? +

CPU UMAP on 50K 4096-dimensional vectors takes 15-30 minutes. GPU UMAP via cuML completes in under 2 minutes. Since projections must be recomputed when corpus changes or new views are trained, GPU acceleration is essential for practical operation. The 768D intermediate representation balances dimensionality (preserving structure) against storage and computation costs.

Can OrthoTutor function without OrthoGraph? +

Yes. OrthoTutor's Teaching Agent uses OrthoRAG for content retrieval and can provide text-based instruction without visualization. OrthoGraph integration enables the agent to show students where a topic fits in the broader field, but isn't required for core tutoring functionality.

Why require GB10-class hardware? +

The embedding models used (i.e. Llama Embed Nemotron 8B) require a lot of memory. GPU UMAP working memory adds ~6GB. Qdrant with 50K chunks needs ~4GB. Inference models range from 40GB (70B FP4) to 100GB (120B). Total exceeds what consumer GPUs can address. The unified memory architecture of GB10 enables flexible allocation, and a easy method for deploying this software.

What does OrthoBench measure? +

OrthoBench uses open-ended orthopaedic surgery questions (not multiple choice). A scorer LLM evaluates answers against domain rubrics in two steps: natural language analysis, then structured JSON scoring. Tiers range from "Gold" (>75) through "Dangerous" (<0, flagging patient safety concerns). The benchmark measures accuracy, tokens/second, and produces per-question reasoning trails.

Why this level of infrastructure complexity? +

A simpler system would be easier to build. But simpler systems make tradeoffs that compromise educational reliability. A single embedding model can't distinguish queries from documents—instruction-tuned embeddings can. CPU-based dimensionality reduction takes 30 minutes—GPU UMAP takes 90 seconds. Reducing the dimensions to 2D once means that OrthoGraph cannot take advantage of other reranking models centered around 768 dimensions. A monolithic application can't scale individual bottlenecks—microservices can.

10 — Roadmap

Development Phases

OrthoRAG (V41) and OrthoBench are operational. The immediate focus is OrthoGraph—the visual corpus navigation that will demonstrate the platform's educational potential. OrthoTutor, the agentic system that ties everything together, represents the largest development effort and the primary funding opportunity.

Phase 1: OrthoGraph Foundation

Single view projection, basic WebGPU rendering, search illumination, Qdrant integration. Viewport streaming with R-tree spatial index.

Current

Phase 2: Multi-View System

LDA matrix training for subspecialty/anatomy/evidence views. Smooth view morphing. Level-of-detail with parent/child chunks. Star clustering.

Phase 3: Chat Integration

Sliding panel with OrthoRAG chat. Click-to-cite from visualization. Bidirectional: chat queries illuminate, clicks populate context.

Phase 4: RLHF Visualization

Retrieval failure heatmap from Chat V41 feedback. Exploration trails. Corpus gap analysis dashboard.

Phase 5: OrthoTutor Core

Teaching Agent with Socratic dialogue. Student Model with Bayesian knowledge tracing. MCQ integration. Spaced repetition scheduling.

Requires Funding

Phase 6: Agentic Integration

OrthoTutor dynamically leverages OrthoRAG and OrthoGraph based on student performance. Adaptive explanation strategies.

Requires Funding

The opportunity: OrthoGraph will be the showcase—visual evidence that semantic structure can become navigable space. But the transformative potential lies in OrthoTutor: an agentic system that watches how a student learns, recognizes when they need structure versus detail, and dynamically chooses whether to retrieve, visualize, or explain. This is where the educational AI field is heading. OrthoSystem is positioned to lead.

About the Author

Alan B.C. Dang, MD

Health Sciences Clinical Professor of Orthopaedic Surgery, UCSF
Staff Surgeon, San Francisco VA Health Center

Three decades in computing. A decade in academic spine surgery. Three patents. One vision: AI that teaches the way surgeons actually learn.

UCSF Faculty Profile · LinkedIn

```