Tools of the Trade: C2C Activation Offloading on Grace Blackwell
We demonstrate the potential of NVIDIA's NVLink C2C on Grace-based superchips as a high-performance alternative to selective activation checkpointing. By offloading MLP activations to host memory during training, we achieve a 6–13% throughput improvement over selective AC with negligible memory overhead.