Architecture

Multi-GPU Training Explained

Q: How many GPUs do I need to train a large model?

It depends on model size and GPU memory. A 7B model fits on a single A100 80GB. A 70B model needs 4-8 H100s with model parallelism. A 405B model requires 64+ GPUs. Higher-memory GPUs (H200, MI300X) reduce the count needed.

Q: What is the difference between data and model parallelism?

Data parallelism: each GPU has the full model, processes different data. Model parallelism: the model is split across GPUs. Data parallelism is simpler but requires the model to fit in one GPU's memory.

Multi-GPU training distributes AI model training across multiple GPUs to handle models too large for a single GPU or to reduce training time. The two main approaches are data parallelism (same model, different data batches) and model parallelism (model split across GPUs). GPU interconnect bandwidth — NVLink, NVSwitch, InfiniBand — is critical for efficient multi-GPU training.

Data Parallelism

In data parallelism, each GPU holds a complete copy of the model and processes a different batch of training data. After each step, the GPUs synchronise their gradient updates.

This is the simplest form of multi-GPU training and works well when the model fits in a single GPU's memory. It scales training throughput almost linearly — 8 GPUs process roughly 8x more data per step than one GPU.

The limitation is memory: each GPU needs enough VRAM to hold the entire model. For a 70B parameter model requiring ~140GB in FP16, data parallelism alone isn't possible on 80GB H100 GPUs.

Model Parallelism

Model parallelism splits the model itself across multiple GPUs. Tensor parallelism splits individual layers across GPUs, while pipeline parallelism assigns different layers to different GPUs.

Model parallelism enables training of models that exceed a single GPU's memory. However, it introduces communication overhead — GPUs must constantly exchange intermediate results, making interconnect bandwidth critical.

NVLink 4.0 (used in H100) provides 900 GB/s bidirectional bandwidth between GPUs, while InfiniBand handles communication across servers. These high-bandwidth interconnects are essential for efficient model parallelism.

Choosing the Right Strategy

Most large-scale training uses a combination of both approaches:

- **Small models (<15B params):** Data parallelism on multiple GPUs is sufficient - **Medium models (15-70B params):** Tensor + data parallelism across 4-8 GPUs - **Large models (70B+ params):** Full 3D parallelism (data + tensor + pipeline) across dozens or hundreds of GPUs

The GPU's memory capacity determines the minimum parallelism required. Higher-memory GPUs like the H200 (141GB) or MI300X (192GB) reduce the number of GPUs needed, simplifying the training setup.

Frequently Asked Questions

How many GPUs do I need to train a large model?

It depends on model size and GPU memory. A 7B model fits on a single A100 80GB. A 70B model needs 4-8 H100s with model parallelism. A 405B model requires 64+ GPUs. Higher-memory GPUs (H200, MI300X) reduce the count needed.

What is the difference between data and model parallelism?

Data parallelism: each GPU has the full model, processes different data. Model parallelism: the model is split across GPUs. Data parallelism is simpler but requires the model to fit in one GPU's memory.