pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Xuehai Pan	4ccc0381de	[BE][5/16] fix typos in torch/ (torch/distributed/) (#156315 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156315 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #156313, #156314	2025-06-23 02:57:28 +00:00
PyTorch MergeBot	145d4cdc11	Revert "[BE][5/16] fix typos in torch/ (torch/distributed/) (#156315 )" This reverts commit `c2f0292bd5`. Reverted https://github.com/pytorch/pytorch/pull/156315 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](`c95f7fa874`) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))	2025-06-22 12:31:57 +00:00
Xuehai Pan	c2f0292bd5	[BE][5/16] fix typos in torch/ (torch/distributed/) (#156315 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156315 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #156313, #156314	2025-06-22 08:43:26 +00:00
Xuehai Pan	995df34b19	[BE][PYFMT] migrate PYFMT for `torch.{distributed,distributions}` to `ruff format` (#144547 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144547 Approved by: https://github.com/kwen2501	2025-02-28 07:35:56 +00:00
Aaron Orenstein	316808e4e9	PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145163 Approved by: https://github.com/Skylion007	2025-01-19 20:55:59 +00:00
fduwjj	40c825d773	[reland] [torchelastic][c10d] Fix store prefix race in rendezvous (#136768 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136768 Approved by: https://github.com/kwen2501, https://github.com/atalman	2024-09-26 17:37:07 +00:00
PyTorch MergeBot	706eda5cd8	Revert "[RFC][torchelastic][c10d] Fix store prefix race in rendezvous (#135957 )" This reverts commit `5033a1ca0d`. Reverted https://github.com/pytorch/pytorch/pull/135957 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/135957#issuecomment-2372493186))	2024-09-24 22:24:26 +00:00
fduwjj	5033a1ca0d	[RFC][torchelastic][c10d] Fix store prefix race in rendezvous (#135957 ) 1. We want to take option 3 as discussed in https://github.com/pytorch/pytorch/issues/135712, so every time when we retry, we create a new TCPStore server first so that we don't need to append attempt count as prefix and avoid eventually TCPStore sync failure. (This is only for the TCPStore sharing enabled case) 2. We start a new server bound to an ephemeral port (i.e. 0) so it gets assigned to a free port. We then pass that downstream (trainer or c10d). By doing so, TCPStore is managed by the elastic agent rather than having a race condition on binding to a specific port in the trainer. 3. Then the port be broadcasted for dynamic_rendezvous. Only one more question, what do we do about the store created from (_create_tcp_store) torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py, are we ok with creating a duplicate TCPStore server? Pull Request resolved: https://github.com/pytorch/pytorch/pull/135957 Approved by: https://github.com/d4l3k, https://github.com/c-p-i-o	2024-09-23 20:32:24 +00:00
Tristan Rice	196748d491	[elastic] support local_addr across all rendezvous impls (#135262 ) Summary: There was a regression introduced in https://github.com/pytorch/pytorch/pull/125743 that made `local_addr` no longer used. This fixes that by passing `local_addr` to `RendezvousStoreInfo.build` everywhere it's used. This also fixes a number of tests allowing them to be run in parallel which hugely sped up the testing cycle as this change touches many different rendezvous implementations. This required a few fixes in unrelated tests. Test Plan: Added tests for the common rendezvous implementations that `local_addr` to prevent future regressions. ``` buck2 test @//mode/dev-nosan fbcode//caffe2/test/distributed/elastic/... fbcode//caffe2/torch/distributed/elastic/... -- --stress-runs 3 ``` To vet the parallelism changes I also ran with 3 stress runs each to identify flakiness caused by parallelism. Differential Revision: D62256407 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135262 Approved by: https://github.com/fduwjj, https://github.com/wz337	2024-09-06 17:55:43 +00:00
Xuehai Pan	e6d4451ae8	[BE][Easy] enable UFMT for `torch/distributed/{algorithms,autograd,benchmarks,checkpoint,elastic}/` (#128866 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128866 Approved by: https://github.com/fegin	2024-06-18 13:51:53 +00:00
Aaron Orenstein	3a0d088517	Flip default value for mypy disallow_untyped_defs [5/11] (#127842 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127842 Approved by: https://github.com/oulgen	2024-06-08 18:49:18 +00:00
Kurman Karabukaev	d62b025efc	[TorchElastic] Option for sharing TCPStore created by rdzv handlers (#125743 ) Summary: 1. Define explicit `use_agent_store` on rdzv handlers. Handlers that set is true can share the store. 2. Instead of agent coordinating master_add/master_port values, the logic is now encapsulated by a rdzv_handler where `RendezvousInfo` will have `RendezvousStoreInfo` object that handlers must return. - Depending on the implementation they can either: - point to existing store (and expected to `use_agent_store` as true - point 1). Client code will rely on `TORCHELASTIC_USE_AGENT_STORE` env variable to know if the store is shared. - build args that `torch.distributed.init_process_group` can bootstrap by creating new store. Additional points: - When TCPStore is shared, it should be wrapped in PrefixStore to qualify/scope namespace for other usecases. - `next_rendezvous` signature changed to return instance of `RendezvousInfo` instead of a (store, rank, world_size) tuple for extensibility purposes. Why: - Reduce moving parts - easier to swap implementation - improve tractability - addressing perf/debug-ability will benefit all usecases - Test Plan: CI Differential Revision: D57055235 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125743 Approved by: https://github.com/d4l3k	2024-05-22 18:24:11 +00:00
Kurman Karabukaev	a60b566d37	[TorchElastic] Support for overprovisioning in C10 based rendezvous (#117066 ) Summary: Allow TorchElastic to manage more nodes than a maximum nnodes specifed in a job. It will be used as a spare capacity/warm nodes for schedulers that support elasticity. RFC: https://github.com/pytorch/pytorch/issues/114097 Test Plan: Integration tests Differential Revision: D52343874 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117066 Approved by: https://github.com/zdevito	2024-01-18 01:16:55 +00:00
Kazuaki Ishizaki	91973e1c31	Issue113185 (#113523 ) Fixes #113185 I have fixed the given docstring errors. The followings are the outputs with numbers before and after the changes: Pull Request resolved: https://github.com/pytorch/pytorch/pull/113523 Approved by: https://github.com/kit1980	2023-11-14 22:25:28 +00:00
Aaron Gokaslan	8769fb854d	[BE] Fix flake8 B027 errors - missing abstractmethod decorator (#100715 ) Enables B027 and applies fixes by adding abstract method decorators. Autofix generated by ruff master. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100715 Approved by: https://github.com/ezyang	2023-05-09 17:28:48 +00:00
Sergii Dymchenko	365071c73c	Fix non-existing parameters in docstrings in torch/distributed (#91116 ) This is a continuation of https://github.com/pytorch/pytorch/pull/90505 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91116 Approved by: https://github.com/huydhn	2022-12-22 02:37:31 +00:00
Chris Zheng	5d37890b8e	Update torchrun and TorchElastic to take optional `local_addr` param to allow skip local IP lookup if specified (#88922 ) Summary: Update dynamic renderzvous nodes to use rendezvous hostname if provided. For PR: https://github.com/pytorch/pytorch/issues/85300 Before: For dynamic renderzvous, it always grab the `fqdn` from socket for each node even if user specified the address. For example, https://github.com/pytorch/pytorch/blob/master/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py#L248-L256 ``` return _NodeDesc(socket.getfqdn(), os.getpid(), local_id) ``` Now: If user specifies the hostname, each node will respect the given hostname. For example, `socket.getfqdn(<hostname>) ` Test Plan: Unit tests. Differential Revision: D41204028 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88922 Approved by: https://github.com/d4l3k	2022-12-21 03:55:01 +00:00
Ram Rachum	351d73b97f	Fix exception causes all over the codebase (#90271 ) This is the continuation to #90134 and hopefully the final PR in this series. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90271 Approved by: https://github.com/kit1980	2022-12-07 04:29:00 +00:00
Can Balioglu	028f2f62ac	[torch/elastic] Update the rendezvous docs (#58160 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58160 This PR updates the Torch Distributed Elastic documentation with references to the new `c10d` backend. ghstack-source-id: 128783809 Test Plan: Visually verified the correct Reviewed By: tierex Differential Revision: D28384996 fbshipit-source-id: a40b0c37989ce67963322565368403e2be5d2592	2021-05-12 16:54:28 -07:00
Kimish Patel	b7d674eb21	Revert D28331386: [pytorch][PR] [torch/elastic] Update the rendezvous docs Test Plan: revert-hammer Differential Revision: D28331386 (`e4418b67c7`) Original commit changeset: 95dd32146222 fbshipit-source-id: 5522d4a09bc06ac42943eec9aa8bf5292cc778b2	2021-05-11 18:10:46 -07:00
Can Balioglu	e4418b67c7	[torch/elastic] Update the rendezvous docs (#57973 ) Summary: This PR updates the rendezvous documentation for the Torch Distributed Elastic section of PyTorch docs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/57973 Reviewed By: kiukchung Differential Revision: D28331386 Pulled By: cbalioglu fbshipit-source-id: 95dd32146222aaeff246bd3c3d2caf0036a9011b	2021-05-11 15:32:50 -07:00
Can Balioglu	731dcd75f5	[torch/elastic] Revise the note section of RendezvousHandler doc (#57723 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57723 Updated the note section of `RendezvousHandler`: - Removed the experimental API warning. - Recommended using the C10d Store instead of etcd for most users. Test Plan: N/A Reviewed By: kiukchung Differential Revision: D28253828 fbshipit-source-id: c4f34dffd1a3cc132977029fe449b6d63ddc877b	2021-05-07 13:10:22 -07:00
Can Balioglu	b3dd8cde61	[1/n] [torch/elastic] Introduce `DynamicRendezvousHandler` and `RendezvousBackend`. (#55635 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55635 This diff introduces the `DynamicRendezvousHandler` type as a stub implementation and its accompanying `RendezvousBackend` interface. `DynamicRendezvousHandler` is intended to be a backend-agnostic type that will contain the core (bulk) logic of rendezvous handling. Any backend specific operation will be delegated to a concrete subclass of `RendezvousBackend` (e.g. `C10dRendezvousBackend` - see D27654492) that is passed as a constructor argument to `DynamicRendezvousHandler`. ghstack-source-id: 126304697 Test Plan: Run the existing and newly-introduced unit/integration tests. Reviewed By: tierex Differential Revision: D27654478 fbshipit-source-id: 9fc89a6e4cb308971c65b29a7c5af7ae191f70c5	2021-04-12 22:18:49 -07:00
Can Balioglu	493a233c04	[torch/elastic] Revise the rendezvous handler registry logic. (#55466 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55466 Improve the implementation and the unit test coverage of `RendezvousHandlerRegistry`. ### Note See the original diff (D27442325 (`df299dbd7d`)) that had to be reverted due to an unexpected Python version incompatibility between the internal and external PyTorch CI tests. Test Plan: Run the existing and newly-introduced unit tests. Reviewed By: tierex Differential Revision: D27623215 fbshipit-source-id: 51538d0f154f64e04f685a95d40d805b478c93f9	2021-04-07 20:43:20 -07:00
Nikita Shulga	add49e7e4e	Enforce PEP263 for PyTorch python codebase (#55346 ) Summary: All python files containing non-ASCII characters should be correctly annotated with `# -- coding: utf-8 --` comment Delete number of superfluous UTF-8 characters, most commonly UTF-8 opening closing quotation mark U+2019 (’) instead of ascii apostrophe ', for example `Module’s`->`Module's` Pull Request resolved: https://github.com/pytorch/pytorch/pull/55346 Reviewed By: samestep Differential Revision: D27582044 Pulled By: malfet fbshipit-source-id: c1cd89655915858ff3a41f675cdfffff795a8e44	2021-04-06 18:31:38 -07:00
Brian Hirsh	bf70fe69ae	Revert D27442325: [torch/elastic] Revise the rendezvous handler registry logic. Test Plan: revert-hammer Differential Revision: D27442325 (`df299dbd7d`) Original commit changeset: 8519a2caacbe fbshipit-source-id: f10452567f592c23ae79ca31556a2a77546726b1	2021-04-06 06:17:14 -07:00
Can Balioglu	df299dbd7d	[torch/elastic] Revise the rendezvous handler registry logic. Summary: Improve the implementation and the unit test coverage of `RendezvousHandlerRegistry`. Test Plan: Run the existing and newly-introduced unit tests. Reviewed By: tierex Differential Revision: D27442325 fbshipit-source-id: 8519a2caacbe2e3ce5d9a02e87a910503dea27d7	2021-04-05 23:38:29 -07:00
Can Balioglu	359d0a0205	[torch/elastic] Improve the implementation of `RendezvousParameters` and add its unit tests. (#146 ) Summary: Pull Request resolved: https://github.com/pytorch/elastic/pull/146 Pull Request resolved: https://github.com/pytorch/pytorch/pull/54807 Improve the implementation and the unit test coverage of `RendezvousParameters`. Test Plan: Run the existing and newly-introduced unit tests. Reviewed By: kiukchung Differential Revision: D27342444 fbshipit-source-id: 88de356c0a799844a739eb9105185bb8c1acf11f	2021-04-05 23:38:27 -07:00
Can Balioglu	bad8d34780	[torch/elastic] Revise the rendezvous exception types. (#54803 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54803 Revise the rendezvous exception types to align their naming convention more closely with the standard Python exception types. Test Plan: Run the existing test suite. Reviewed By: H-Huang Differential Revision: D27327505 fbshipit-source-id: 862c59222f9ca61a0e5afde89ae8f226090b4f92	2021-04-05 23:36:50 -07:00
Kiuk Chung	ba75cedfc5	[1/n][torch/elastic][upstream] Move torchelastic/rendezvous to torch/distributed/rendezvous (#53172 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53172 Pull Request resolved: https://github.com/pytorch/elastic/pull/141 Upstreams two modules to torch: 1. `torchelastic.rendezvous` 2. `torchelastic.utils` These modules were chosen as `[1/n]` since they are the leaf modules in torchelastic. ==== NOTES: ==== 1. I'm disabling etcd_rendezvous and etcd_server tests in CIRCLECI for the moment since I need to edit the test dockers to contain the etcd server binary (there's 4-5 test dockers - one for each platform so this is going to take some time for me to set up the environments and test) - T85992919. 2. I've fixed all lint errors on python files but there are ones on the cpp files on the ZeusRendezvous. I took a look at them, and I don't want to fix the linter errors right now for 2 major reasons: 1. Some of them are more than formatting changes (e.g. std::move vs pass by value) and I don't want to introduce bundled changes with the move 1. The old rendezvous code (the one we forked from in caffe2/fb) has the same problems and I think its better for us to deal with this when we deprecate caffe2/fb/rendezvous in favor of the one in torchelastic -T86012579. Test Plan: ``` buck test mode/dev-nosan //caffe2/torch/distributed/elastic/utils/test/... buck test mode/dev-nosan //caffe2/torch/distributed/elastic/utils/data/test/... buck test mode/dev-nosan //caffe2/torch/distributed/elastic/rendezvous/test/... buck test mode/dev-nosan //caffe2/torch/distributed/elastic/rendezvous/fb/... buck test mode/dev-nosan //pytorch/elastic/torchelastic/... ``` \+ Sandcastle Reviewed By: H-Huang Differential Revision: D26718746 fbshipit-source-id: 67cc0350c3d847221cb3c3038f98f47915362f51	2021-03-05 11:27:57 -08:00

30 Commits