Part 2 — Scaling with the Distributed Data Parallel (DDP) Algorithm | by The Machine Learning Alchemist

In the first part of this series, I explored the Data Parallel (DP) algorithm, highlighting its efficiency in scenarios where all the GPUs are located within a single server. However, a common question arises: what if the GPUs are distributed across multiple hosts? This is precisely the situation that the Distributed Data Parallel (DDP) algorithm is designed to handle.

Similar to the DP algorithm, DDP requires that a model’s parameters be small enough to fit in the memory of each of the GPUs in the group. However, DDP distinguishes itself by its ability to scale processing across multiple servers, thanks to a significant reduction in communication overhead. This efficiency is primarily achieved by consolidating all communication into a single step: gradient reduction.

The accompanying diagram illustrates this process. Initially, each system starts with a copy of the model and a specific portion of the training data. Independently they perform all the stages from the forward pass up to the gradient reduction. During the gradient reduction phase, each system communicates and merges its gradient values with those from all the other nodes. This aggregation ensures that every GPU ends up with matching gradient values.

In the final step, each GPU, beginning with its clone of the original model, applies the aggregated gradients obtained from the reduction phase. This ensures that every GPU generates an identical copy of the updated model, thus maintaining consistency across all nodes. With such synchronization already established, the systems are set for the subsequent forward pass, negating the need for further synchronization steps.

Workflow Steps:

Initially, workers retrieve a mini-batch from the disk, transferring it into the system’s pinned memory.
The CPU then efficiently moves the starting model and the mini-batch onto the GPU. (Note: From this point onward, CPU actions do not provide any additional value, so references to GPU operations encompass the entire system which the GPU is integrated into).
Following this, each GPU carries out a forward pass with the mini-batch, operating independently.
Concurrently, every GPU calculates the loss, each working in isolation.
Continuing independently, each GPU performs a backward pass.
Next, the GPUs engage in a collective operation, communicating and aggregating gradients through a ‘reduce all’ process across each unit.
Lastly, each GPU updates its own model, maintaining independence in this final step.

From step 3 onwards, the GPUs maintain synchronization, eliminating the need for further data exchanges. This streamlined approach ensures efficient progression through the steps, with each GPU operating in tandem yet independently, ready for the next cycle without redundant data transfer.

Key Notes

DDP excels in scaling across multiple systems.
DDP significantly reduces communication overhead.
DDP can be combined with DP, where DDP manages inter-system synchronization and DP handles intra-system coordination.
A key consideration with DDP is the model still must fit within the GPU’s memory

Source link