Summary:
This diff tries to periodically (e.g., every 30s) log critical collective
progress status to scuba table, starting from a few metric such as last
enequeued seq id.
With the Scuba table, it is our hope that we can easily detect the straggler of a PG,
E.g., the rank that has not progressed it seq_ for X seconds while other ranks in the same PG have a larger seq_
The implementation needs to make sure that Scuba will be used only for FB internal use
cases.
For OSS, we still provide a generic logger data struct and logger that can be
easily extended. If users do not register the logger, nothing will be logged.
Test Plan:
Re-use the existing unit test for fb side of operations, such as
test_register_and_dump in test_c10d_manifold and change the dump period to a
very small number, e.g., 1ms, verified that the loggs are correctly shown in scuba table:
https://fburl.com/scuba/c10d_work_update/9trhwnmy
Reviewed By: wconstab
Differential Revision: D54556219
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121859
Approved by: https://github.com/wconstab
Headers under torch/csrc/distributed may be referened with relative path, e.g., "<c10d/...>". However, relative path cannot be gracefully handled by Meta internal build when the NCCL PG is hipified to support AMD/RCCL because the "hipified" header files are generated in other directories. Moreover, using absolute path for header inclusion is the state-of-the-art in most components in Pytorch. Thus, this patch refactors all header paths in torch/csrc/distributed to be absolute.
See D39835774 for more details about Meta internal complication.
**How to test**: commit 9e5d199 removes -I./torch/csrc/distributed in compile options. Thus use it to verify we don't miss any relative path use of torch/csrc/distributed headers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85780
Approved by: https://github.com/kumpera, https://github.com/huydhn
Headers under torch/csrc/distributed may be referened with relative path, e.g., "<c10d/...>". However, relative path cannot be gracefully handled by Meta internal build when the NCCL PG is hipified to support AMD/RCCL because the "hipified" header files are generated in other directories. Moreover, using absolute path for header inclusion is the state-of-the-art in most components in Pytorch. Thus, this patch refactors all header paths in torch/csrc/distributed to be absolute.
See D39835774 for more details about Meta internal complication.
**How to test**: commit 9e5d199 removes -I./torch/csrc/distributed in compile options. Thus use it to verify we don't miss any relative path use of torch/csrc/distributed headers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85780
Approved by: https://github.com/kumpera
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73731
during rebuilting bucket, in addition to sync bucket_indice, per_bucket_limits should be synced as well before calling initialize_buckets(). Syncing per_bucket_limits will increase communicaton volume as well increasing code complexity, after taking a further look at the codes, per_bucket_limits used inside initialize_buckets() is actually not useful, it assigns bucket_size_limit property to bucket struct, but the property is not used anywhere. So it is good to remove this property and avoid syncing per_bucket_limits.
Differential Revision: [D34605513](https://our.internmc.facebook.com/intern/diff/D34605513/)
Approved by: https://github.com/rohan-varma
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72456
It is easier to log if static graph is set at construction time now that it is natively supported in DDP constructor, as opposed to waiting for the first iteration to finish. In some failure cases we're seeing the first iteration does not finish and thus we don't have this data which is vaulable to debug.
ghstack-source-id: 148840679
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D34045204
fbshipit-source-id: 72a187c1ce031db217de4b3ad20a64f2a74995bc
(cherry picked from commit 1d622c88f3)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67059
Debugging some workflows, and sometimes the training does not finish
but I want to know whether the graph was not static. Also, log 0 for unused
parameter size if no unused params were found.
ghstack-source-id: 141428950
Test Plan: Ci
Reviewed By: mrshenli
Differential Revision: D31846669
fbshipit-source-id: 21763fcdc1b244ba829117da1f15b2271d966983
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66680
Closes https://github.com/pytorch/pytorch/issues/66215. Tracks models
with sync BN so we can find workflows that use them and target for perf
optimization.
ghstack-source-id: 140875182
Test Plan: CI
Reviewed By: pritamdamania87
Differential Revision: D31679477
fbshipit-source-id: 0e68cd1a7aabbc5b26227895c53d33b8e98bfb8e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65772
Looking at some workloads and it would be useful to have this info.
ghstack-source-id: 140555200
Test Plan: CI
Reviewed By: zhaojuanmao, wayi1
Differential Revision: D31224417
fbshipit-source-id: 14eeb053aced87c7ca43b6879f81f54bd0a42b76
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62751
This will help us determine whether gradient ready order and bucket
indices are aligned amongst all the ranks. This should always be true for rank
0 as we determine rebuilt bucket order by the gradient ready order on rank 0,
but would be interested to see this on different workloads for other ranks
ghstack-source-id: 135104369
Test Plan: CI
Reviewed By: SciPioneer, wanchaol
Differential Revision: D30111833
fbshipit-source-id: a0ab38413a45022d953da76384800bee53cbcf9f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62232
Logs the bucket sizes in DDP logging so that we know which workflow ran with what bucket size config. Will be used to verify how changing bucket sizes in DDP affects perf.
Based on the test, we can see inconsistency where the "first" bucket size actually is (last before rebuild buckets, first after).
ghstack-source-id: 134663867
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D29922299
fbshipit-source-id: 538b331c96e77048164ad130b377433be100a761
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61871
When set_static_graph=False, the only type of dynamism we really
support in DDP is dynamic set of unused parameters which must be explicitly
enabled with find_unused_parameters=True. Although, some workflows have static
set of unused parameters, would be good to detect and add this to logging to
identify workflows that are candidates for static graph optimization.
ghstack-source-id: 134371429
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D29773962
fbshipit-source-id: 1f741984c6e6f8e3e55cf69ca719b1e25a485b13
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60543
Since now c10d is part of libtorch, it would also be nice if the sources lived all in one place.
ghstack-source-id: 132306292
Test Plan: It builds
Reviewed By: cbalioglu
Differential Revision: D29062002
fbshipit-source-id: d9e1301e9d73e1643fa0f0119cd2d618f1ad52e6