pytorch/docs at 9bce208dfbdb71e38f9e9ee38a07d43645ffb82a - pytorch - Carlos Sousa's Git

OSSForks/pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

History

Lucas Pasqualin ff8e33556e Enables load balancing duplicates in DCP (#116469 ) Enables the deduplication of saved entries by load balancing duplicates across ranks. Tested with existing and modified tests. Additionally tested with the following code snippet, which saves a 20GB DDP model in ~3 seconds on 8 ranks. Before this PR, the same operation has been measured at ~19 seconds. ``` def run(local_rank, world_size, param_size, num_params, work_dir): os.environ["RANK"] = str(local_rank) os.environ["MASTER_ADDR"] = "localhost" os.environ["MASTER_PORT"] = "12355" device = torch.device(f"cuda:{local_rank}") torch.cuda.set_device(device) dist.init_process_group(backend="nccl", rank=local_rank, world_size=world_size) model = Model(param_size=param_size, num_params=num_params) model = DistributedDataParallel(model, gradient_as_bucket_view=True) _patch_model_state_dict(model) sz = sum(t.nelement() * t.element_size() for t in model.parameters()) rank_0_print(f"Model size: {sz / 1_000_000_000.0} GB") rank_0_print("Saving the model with DCP...") checkpointer = _FileSystemCheckpointer( f"{args.work_dir}/dcp", sync_files=False, single_file_per_rank=False, thread_count=1 ) begin_ts = time.monotonic() checkpointer.save(state_dict={"model": model}) end_ts = time.monotonic() rank_0_print(f"Took {end_ts - begin_ts} seconds with DCP") ``` Differential Revision: [D52435926](https://our.internmc.facebook.com/intern/diff/D52435926/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116469 Approved by: https://github.com/fegin, https://github.com/wz337		2024-01-26 22:34:14 +00:00
..
caffe2	Enable UFMT on a bunch of low traffic Python files outside of main files (#106052 )	2023-07-27 01:01:17 +00:00
cpp	[Doc] Fix typo in cpp/installing when wheel is used (#111143 )	2023-10-13 18:56:27 +00:00
source	Enables load balancing duplicates in DCP (#116469 )	2024-01-26 22:34:14 +00:00
.gitignore
libtorch.rst	Replace master with main in links and docs/conf.py (#100176 )	2023-05-02 18:20:32 +00:00
make.bat
Makefile	Refactor torch.onnx documentation (#108379 )	2023-09-08 18:23:48 +00:00
README.md	Add docs/README.md to make existing doc build info more discoverable (#49286 )	2020-12-16 11:55:45 -08:00
requirements.txt	update requirements.txt in /docs (#101092 )	2023-05-12 03:19:36 +00:00

README.md

Please see the Writing documentation section of CONTRIBUTING.md for details on both writing and building the docs.