blog_

Hello 👋,
I am an engineer at heart (about me), focused on thoughtful creation with an emphasis on throughput, availability, and correctness.

You can trace my mind dump here:

NOTES: training the distributed scale?

Key definitions Data parallelism: every GPU (worker) holds a full copy of the model and processes a different slice of the training data; after computing gradients they are combined (synchronised). Model parallelism: the model itself is split across devices (layers or parameters), so different GPUs handle different parts of the model. Parameter / optimizer state / gradient: model parameters = weights; gradients = derivatives computed in back-prop; optimizer state = extra per-parameter data (momentum, Adam’s moments). Collective communication / all-reduce: when multiple devices coordinate data exchange, e.g., an all-reduce sums up a tensor across all devices then distributes the result back. Sharding: dividing a tensor or state among workers so each stores only a part (instead of full replication). Offloading: moving data (parameters, states) to slower but larger memory (CPU RAM or SSD/NVMe) when it isn’t actively needed on GPU. 1. PyTorch Distributed Data Parallel (DDP) There is a problem of training large deep-neural-network models efficiently across many GPUs, so we used DDP, which replicates the model on each GPU, processes different samples in parallel, and then synchronises gradients via collective operations, achieving high scalability. The main takeaway / innovation / addition / insight is efficient overlap of computation and communication - it was implemented through gradient “bucketing” (grouping gradient tensors) and overlapping gradient reduction during backward pass, and enabled by the process-group abstraction (in torch.distributed) and collective backends such as NCCL. The second takeaway is using a process-group abstraction that hides details of inter-GPU communication - implemented via torch.distributed.ProcessGroup, and enabled by NCCL or MPI based libraries that manage all-reduce/ broadcast under the hood. The third insight is reducing redundant copies of model parameters and buffers across processes - rather than each process fully copying every buffer, DDP uses shared storage for model parameters in multi-process mode (via torch.multiprocessing + shared CUDA memory) so that intra-node clones reuse memory rather than duplicating it. And the last additional interesting fact is that DDP achieves near-linear scalability (up to ~256 GPUs) when configured properly (bucketing, overlapping) as shown in the PyTorch paper. (arXiv) ...

NOTES: vSpiders or how to virtualize a network

Cloud networking is about squeezing determinism from chaos - hundreds of thousands of tenants, each expecting a private, secure, and high-performance network that doesn’t even “feel” shared. This post takes you from the ground up: we’ll start with what virtualization means in networking, explain overlay networks, fabric topologies, and control planes, and then we’ll walk through three real systems: (1) Koponen et al., Network Virtualization in Multi-Tenant Datacenters, NSDI 2014 - PDF from USENIX (2) Dalton et al., Andromeda: Performance, Isolation, and Velocity at Scale, NSDI 2018 - PDF from USENIX (3) SNAP: Snap: a Microkernel Approach to Host Networking, SOSP / Google - Official web page (includes PDF) at Google Research ...

NOTES: GPU Architecture, TVM, and Triton

If TensorFlow showed us how to scale ML across entire data centers, and PyTorch 2.0 showed us how to compile dynamic Python code, GPUs and compiler stacks like TVM and Triton show us how to squeeze the absolute last drop of performance out of hardware. This post is my attempt to tie together GPU architecture, TVM’s tensor-level compilation, and Triton’s clever tiling JIT strategy, all in one nerdy, slightly dorky package. ...

NOTES: VMing the Containers - The Latency-Availability Tradeoff

Before we dive into containers vs VMs, let’s talk about what virtualization actually is. If you’re coming from an OOP background, think of virtualization as the ultimate facade pattern: you’re presenting a clean interface (a “virtual machine”) that hides the messy reality underneath (the actual hardware and hypervisor). There are three main levels of virtualization, each with different tradeoffs: Trap-and-emulate: The classic approach where privileged instructions from the guest OS “trap” into the hypervisor, which then emulates them. This is what early VMware and QEMU did in software mode. ...

NOTES: TensorFlow, TorchDynamo, and TorchInductor

If TensorFlow was about making large-scale distributed ML programmable, PyTorch 2.0 is about making dynamic Python code compilable without losing flexibility. The TensorFlow paper walked through distributed dataflow graphs, parameter servers, checkpointing, and execution placement. In contrast, the PyTorch 2.0 paper focused on TorchDynamo and TorchInductor, showing how dynamic graphs can be captured and compiled into fused kernels that run nearly as fast as handwritten CUDA. Below are my notes on the history, design, and impact of these systems. ...