Aaron Orenstein
|
316808e4e9
|
PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163)
See #145101 for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145163
Approved by: https://github.com/Skylion007
|
2025-01-19 20:55:59 +00:00 |
|
Xuehai Pan
|
e6d4451ae8
|
[BE][Easy] enable UFMT for torch/distributed/{algorithms,autograd,benchmarks,checkpoint,elastic}/ (#128866)
Part of #123062
- #123062
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128866
Approved by: https://github.com/fegin
|
2024-06-18 13:51:53 +00:00 |
|
Tristan Rice
|
597922ba21
|
Reapply "distributed debug handlers (#126601)" (#127805)
This reverts commit 7646825c3e.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127805
Approved by: https://github.com/PaliC
|
2024-06-04 19:44:30 +00:00 |
|
PyTorch MergeBot
|
7646825c3e
|
Revert "distributed debug handlers (#126601)"
This reverts commit 3d541835d5.
Reverted https://github.com/pytorch/pytorch/pull/126601 on behalf of https://github.com/PaliC due to breaking internal typechecking tests ([comment](https://github.com/pytorch/pytorch/pull/126601#issuecomment-2141076987))
|
2024-05-31 01:21:24 +00:00 |
|
Tristan Rice
|
3d541835d5
|
distributed debug handlers (#126601)
This adds debug handlers as described in:
* https://gist.github.com/d4l3k/828b7be585c7615e85b2c448b308d925 (public copy)
* https://docs.google.com/document/d/1la68szcS6wUYElUUX-P6zXgkPA8lnfzpagMTPys3aQ8/edit (internal copy)
This is only adding the C++ pieces that will be used from the main process. The Python and torchrun pieces will be added in a follow up PR.
This adds 2 handlers out of the box:
* `/handler/ping` for testing purposes
* `/handler/dump_nccl_trace_pickle` as a POC integration with Flight Recorder
Test plan:
```
python test/distributed/elastic/test_control_plane.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126601
Approved by: https://github.com/kurman, https://github.com/c-p-i-o
|
2024-05-30 02:21:08 +00:00 |
|