CUDA Proves Nvidia Is a Software Company (wired.com) 12
Nvidia's real AI moat isn't "a piece of hardware," writes Wired's Sheon Han. It's CUDA: a mature, deeply optimized software ecosystem that keeps machine-learning workloads tied to Nvidia GPUs. An anonymous reader quotes a report from Wired: What sounds like a chemical compound banned by the FDA may be the one true moat in AI. CUDA technically stands for Compute Unified Device Architecture, but much like laser or scuba, no one bothers to expand the acronym; we just say "KOO-duh." So what is this all-important treasure good for? If forced to give a one-word answer: parallelization. Here's a simple example. Let's say we task a machine with filling out a 9x9 multiplication table. Using a computer with a single core, all 81 operations are executed dutifully one by one. But a GPU with nine cores can assign tasks so that each core takes a different column -- one from 1x1 to 1x9, another from 2x1 to 2x9, and so on -- for a ninefold speed gain. Modern GPUs can be even cleverer. For example, if programmed to recognize commutativity -- 7x9 = 9x7 -- they can avoid duplicate work, reducing 81 operations to 45, nearly halving the workload. When a single training run costs a hundred million dollars, every optimization counts.
Nvidia's GPUs were originally built to render graphics for video games. In the early 2000s, a Stanford PhD student named Ian Buck, who first got into GPUs as a gamer, realized their architecture could be repurposed for general high-performance computing. He created a programming language called Brook, was hired by Nvidia, and, with John Nickolls, led the development of CUDA. If AI ushers in the age of a permanent white-collar underclass and autonomous weapons, just know that it would all be because someone somewhere playing Doom thought a demon's scrotum should jiggle at 60 frames per second. CUDA is not a programming language in itself but a "platform." I use that weasel word because, not unlike how The New York Times is a newspaper that's also a gaming company, CUDA has, over the years, become a nested bundle of software libraries for AI. Each function shaves nanoseconds off single mathematical operations -- added up, they make GPUs, in industry parlance, go brrr.
A modern graphics card is not just a circuit board crammed with chips and memory and fans. It's an elaborate confection of cache hierarchies and specialized units called "tensor cores" and "streaming multiprocessors." In that sense, what chip companies sell is like a professional kitchen, and more cores are akin to more grilling stations. But even a kitchen with 30 grilling stations won't run any faster without a capable head chef deftly assigning tasks -- as CUDA does for GPU cores. To extend the metaphor, hand-tuned CUDA libraries optimized for one matrix operation are the equivalent of kitchen tools designed for a single job and nothing more -- a cherry pitter, a shrimp deveiner -- which are indulgences for home cooks but not if you have 10,000 shrimp guts to yank out. Which brings us back to DeepSeek. Its engineers went below this already deep layer of abstraction to work directly in PTX, a kind of assembly language for Nvidia GPUs. Let's say the task is peeling garlic. An unoptimized GPU would go: "Peel the skin with your fingernails." CUDA can instruct: "Smash the clove with the flat of a knife." PTX lets you dictate every sub-instruction: "Lift the blade 2.35 inches above the cutting board, make it parallel to the clove's equator, and strike downward with your palm at a force of 36.2 newtons." "You can begin to see why CUDA is so valuable to Nvidia -- and so hard for anyone else to touch," writes Han. "Tuning GPU performance is a gnarly problem. You can't just conscript some tender-footed undergrad on Market Street, hand them a Claude Max plan, and expect them to hack GPU kernels. Writing at this level is a grindsome enterprise -- unless you're a cracker-jack programmer at DeepSeek..."
Han goes on to argue that rivals like AMD and Intel offer competitive specs on paper, but their software stacks have struggled with bugs, compatibility issues, and weak adoption. As a result, Nvidia has built an Apple-like moat around AI computing, leaving the industry dependent on its expensive hardware.
Nvidia's GPUs were originally built to render graphics for video games. In the early 2000s, a Stanford PhD student named Ian Buck, who first got into GPUs as a gamer, realized their architecture could be repurposed for general high-performance computing. He created a programming language called Brook, was hired by Nvidia, and, with John Nickolls, led the development of CUDA. If AI ushers in the age of a permanent white-collar underclass and autonomous weapons, just know that it would all be because someone somewhere playing Doom thought a demon's scrotum should jiggle at 60 frames per second. CUDA is not a programming language in itself but a "platform." I use that weasel word because, not unlike how The New York Times is a newspaper that's also a gaming company, CUDA has, over the years, become a nested bundle of software libraries for AI. Each function shaves nanoseconds off single mathematical operations -- added up, they make GPUs, in industry parlance, go brrr.
A modern graphics card is not just a circuit board crammed with chips and memory and fans. It's an elaborate confection of cache hierarchies and specialized units called "tensor cores" and "streaming multiprocessors." In that sense, what chip companies sell is like a professional kitchen, and more cores are akin to more grilling stations. But even a kitchen with 30 grilling stations won't run any faster without a capable head chef deftly assigning tasks -- as CUDA does for GPU cores. To extend the metaphor, hand-tuned CUDA libraries optimized for one matrix operation are the equivalent of kitchen tools designed for a single job and nothing more -- a cherry pitter, a shrimp deveiner -- which are indulgences for home cooks but not if you have 10,000 shrimp guts to yank out. Which brings us back to DeepSeek. Its engineers went below this already deep layer of abstraction to work directly in PTX, a kind of assembly language for Nvidia GPUs. Let's say the task is peeling garlic. An unoptimized GPU would go: "Peel the skin with your fingernails." CUDA can instruct: "Smash the clove with the flat of a knife." PTX lets you dictate every sub-instruction: "Lift the blade 2.35 inches above the cutting board, make it parallel to the clove's equator, and strike downward with your palm at a force of 36.2 newtons." "You can begin to see why CUDA is so valuable to Nvidia -- and so hard for anyone else to touch," writes Han. "Tuning GPU performance is a gnarly problem. You can't just conscript some tender-footed undergrad on Market Street, hand them a Claude Max plan, and expect them to hack GPU kernels. Writing at this level is a grindsome enterprise -- unless you're a cracker-jack programmer at DeepSeek..."
Han goes on to argue that rivals like AMD and Intel offer competitive specs on paper, but their software stacks have struggled with bugs, compatibility issues, and weak adoption. As a result, Nvidia has built an Apple-like moat around AI computing, leaving the industry dependent on its expensive hardware.