Configuring 100K NVIDIA H200 GPUs Usually Takes Years But Musk Did It In 19 Days

NVIDIA CEO Jensen Huang had high praise to offer for Elon Musk and his team at xAI, calling them “superhuman.” In an interview with YouTube channel BG2 Pod, Huang remarked that Elon’s specific combination of engineering smarts and project management is “singular”, and credited both the X owner and NVIDIA’s own infrastructure expertise with the incredible achievement of setting up the world’s fastest AI-training supercomputing cluster in just nineteen days.

The cluster in question is xAI’s Colossus system in Memphis, Tennessee, and it sports one-hundred-thousand Hopper H100 GPUs, making it theoretically the fastest AI training cluster in the world. The “nineteen days” remark is a little misleading, though; that’s the time from hardware setup to its first functional use for AI training. The full Colossus project was set up in 122 days from start to finish, according to Musk.

To be clear, though, both of those numbers are unbelievably short time frames. Huang is almost awestruck as he describes X.AI’s achievement, which we’ll quote here in abridged form (Jensen was rambling a bit):

“From the moment that we decided to go … to training: 19 days. […] Do you know how many days 19 days is? It’s just a couple of weeks, and the mountain of technology, if you were ever to see it, is just unbelievable. […] What they achieved is singular; never been done before. Just to put it in perspective, 100,000 GPUs — that’s easily the fastest supercomputer on the planet, that one cluster. A supercomputer that you would build would take normally three years to plan, and then they can deliver the equipment, and then it takes one year to get it all working. We’re talking about 19 days.”

nvidia nvl72 — *One of NVIDIA’s Blackwell GB200 NVL72 racks. The entire rack functions as one GPU.*

NVIDIA’s CEO is also quick to point out that “networking NVIDIA gear is very different from networking hyperscale datacenters.” He goes on to explain that NVIDIA clusters require much more connectivity between nodes than a typical datacenter due to the high-bandwidth nature of GPU compute workloads. Comically, he evocatively describes the process by exclaiming that “the back of the computer’s all wires!”

While Musk is involved with supercomputing datacenters across all of his businesses, Colossus is part of xAI, his venture to become a big player in the AI space. The new system absolutely dwarfs the 2,000-GPU AI training cluster at Tesla’s Austin, Texas facility—which is obvious, given that it dwarfs almost every supercomputing cluster in the world.

The relevant portion starts at 46:40 or thereabouts, if you’re short on time.

The full interview at BG2 Pod is worth a watch if you’re interested in AI and the future of NVIDIA. Huang has some pretty interesting ideas about what the next ten years are going to look like. We won’t repeat everything he said here, but you can watch the video for yourself above.

Introducing the Frontier Safety Framework

Our approach to analyzing and mitigating future risks posed by advanced AI models Google DeepMind has consistently pushed the boundaries of AI, developing models that

May 17, 2024

Witcher 4 is CD Projekt’s ‘most-developed game,’ over 400 devs working on project

With Cyberpunk 2077’s Phantom Liberty shipped, most of CD Projekt RED is now focusing on the next big Witcher sequel. 3 VIEW GALLERY – 3

March 31, 2024

The XPG Core Reactor II VE 850W PSU Review: Our First ATX 3.1 Power Supply

Just over 18 months ago, Intel launched their significantly revised ATX v3.0 power supply standard, and with it, the 600 Watt-capable 12VHPWR cable to power

May 2, 2024

Supercharge Your Portfolio with Future Tech Stocks!

Join us for Profitable Insights & Expert Tips!

With expert analysis, comprehensive market coverage, and actionable insights, our newsletter equips you with the knowledge & tools necessary to make informed decisions & maximize your potential returns in the dynamic world of future tech stocks.