pytorch/docs
Lucas Pasqualin ff8e33556e Enables load balancing duplicates in DCP (#116469)
Enables the deduplication of saved entries by load balancing duplicates across ranks.

Tested with existing and modified tests. Additionally tested with the following code snippet, which saves a 20GB DDP model in **~3 seconds on 8 ranks**.  Before this PR, the same operation has been measured at ~19 seconds.

```
def run(local_rank, world_size, param_size, num_params, work_dir):

    os.environ["RANK"] = str(local_rank)
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12355"
    device = torch.device(f"cuda:{local_rank}")
    torch.cuda.set_device(device)
    dist.init_process_group(backend="nccl", rank=local_rank, world_size=world_size)

    model = Model(param_size=param_size, num_params=num_params)
    model = DistributedDataParallel(model, gradient_as_bucket_view=True)
    _patch_model_state_dict(model)

    sz = sum(t.nelement() * t.element_size() for t in model.parameters())
    rank_0_print(f"Model size: {sz / 1_000_000_000.0} GB")
    rank_0_print("Saving the model with DCP...")

    checkpointer = _FileSystemCheckpointer(
        f"{args.work_dir}/dcp",
        sync_files=False,
        single_file_per_rank=False,
        thread_count=1
    )

    begin_ts = time.monotonic()
    checkpointer.save(state_dict={"model": model})
    end_ts = time.monotonic()
    rank_0_print(f"Took {end_ts - begin_ts} seconds with DCP")
```

Differential Revision: [D52435926](https://our.internmc.facebook.com/intern/diff/D52435926/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116469
Approved by: https://github.com/fegin, https://github.com/wz337
2024-01-26 22:34:14 +00:00
..
caffe2 Enable UFMT on a bunch of low traffic Python files outside of main files (#106052) 2023-07-27 01:01:17 +00:00
cpp [Doc] Fix typo in cpp/installing when wheel is used (#111143) 2023-10-13 18:56:27 +00:00
source Enables load balancing duplicates in DCP (#116469) 2024-01-26 22:34:14 +00:00
.gitignore
libtorch.rst Replace master with main in links and docs/conf.py (#100176) 2023-05-02 18:20:32 +00:00
make.bat
Makefile Refactor torch.onnx documentation (#108379) 2023-09-08 18:23:48 +00:00
README.md Add docs/README.md to make existing doc build info more discoverable (#49286) 2020-12-16 11:55:45 -08:00
requirements.txt update requirements.txt in /docs (#101092) 2023-05-12 03:19:36 +00:00

Please see the Writing documentation section of CONTRIBUTING.md for details on both writing and building the docs.