NVIDIA H200 Tensor Core GPU




The GPU for Generative AI and HPC

 

The NVIDIA H200 Tensor Core GPU supercharges generative AI and high-performance computing (HPC) workloads with game-changing performance and memory capabilities. As the first GPU with HBM3e, the H200’s larger and faster memory fuels the acceleration of generative AI and large language models (LLMs) while advancing scientific computing for HPC workloads.

 

 

Unmatched End-to-End Accelerated Computing Platform

 

The NVIDIA HGX B300 integrates NVIDIA Blackwell Ultra GPUs with high-speed interconnects to propel the data center into a new era of accelerated computing and generative AI. As a premier accelerated scale-up platform with up to 11x more inference performance than the previous generation, NVIDIA Blackwell-based HGX systems are designed for the most demanding generative AI, data analytics, and HPC workloads.

NVIDIA HGX includes advanced networking options—at speeds up to 800 gigabits per second (Gb/s)—using NVIDIA Quantum-X800 InfiniBand and Spectrum™-X Ethernet for the highest AI performance. HGX also includes NVIDIA BlueField®-3 data processing units (DPUs) to enable cloud networking, composable storage, zero-trust security, and GPU compute elasticity in hyperscale AI clouds.

 

 

AI Reasoning Inference: Performance and Versatility

 

 

11x Higher Inference on Llama 3.1 405B

Real-Time Large Language Model Inference​

 

HGX B300 achieves up to 11x higher inference performance over the previous NVIDIA Hopper™ generation for models such as Llama 3.1 405B. The second-generation Transformer Engine uses custom Blackwell Tensor Core technology combined with TensorRT™-LLM innovations to accelerate inference for large language models (LLMs).

Projected performance subject to change. Token-to-token latency (TTL) = 20ms real time, first token latency (FTL) = 5s, input sequence length = 32,768, output sequence length = 1,028, 1x eight-way HGX H100 GPUs air-cooled vs 1x eight-way HGX B300 air-cooled per GPU performance comparison​; served using disaggregated inference.

 

AI Training: Performance and Scalability

 

4x Faster Training on Llama 3.1 405B

Next-Level Training Performance

 

The second-generation Transformer Engine, featuring 8-bit floating point (FP8) and new precisions, enables a remarkable 4x faster training for large language models like Llama 3.1 405B. This breakthrough is complemented by fifth-generation NVLink with 1.8 TB/s of GPU-to-GPU interconnect, InfiniBand networking, and NVIDIA Magnum IO™ software. Together, these ensure efficient scalability for enterprises and extensive GPU computing clusters.

Projected performance subject to change. 1x eight-way HGX H100 vs. 1x eight-way HGX B300, per-GPU performance comparison.

Accelerating HGX With NVIDIA Networking

 

The data center is the new unit of computing, and networking plays an integral role in scaling application performance across it. Paired with NVIDIA Quantum InfiniBand, HGX delivers world-class performance and efficiency, which ensures the full utilization of computing resources.

For AI cloud data centers that deploy Ethernet, HGX is best used with the NVIDIA Spectrum-X™ networking platform, which powers the highest AI performance over Ethernet. It features Spectrum-X switches and NVIDIA SuperNIC for optimal resource utilization and performance isolation, delivering consistent, predictable outcomes for thousands of simultaneous AI jobs at every scale. Spectrum-X enables advanced cloud multi-tenancy and zero-trust security. As a reference design, NVIDIA has designed Israel-1, a hyperscale generative AI supercomputer built with Dell PowerEdge XE9680 servers based on the NVIDIA HGX 8-GPU platform, BlueField-3 SuperNICs, and Spectrum-4 switches.

 

NVIDIA HGX Specifications

 

NVIDIA HGX is available in single baseboards with four or eight Hopper SXMs or eight NVIDIA Blackwell or NVIDIA Blackwell Ultra SXMs. These powerful combinations of hardware and software lay the foundation for unprecedented AI supercomputing performance.