pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Max Wang	c5845c4482	Add support for reduce-scatter in c10d (#18844 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/18844 ghimport-source-id: c6b2f0032c7c2212be2000a9c1f262f63d878a97 Stack from [ghstack](https://github.com/ezyang/ghstack): * #18844 Add support for reduce-scatter in c10d * #18820 Refactor ProcessGroupNCCL collective primitives Reviewed By: mrshenli Differential Revision: D14768369 fbshipit-source-id: a9def7a0da6e9cd995e982371cc1e22f3df1a156	2019-04-26 13:46:57 -07:00
Kutta Srinivasan	b7323a94ad	Cleanup init_process_group (#19033 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/19033 torch.distributed.init_process_group() has had many parameters added, but the contract isn't clear. Adding documentation, asserts, and explicit args should make this clearer to callers and more strictly enforced. Reviewed By: mrshenli Differential Revision: D14813070 fbshipit-source-id: 80e4e7123087745bed436eb390887db9d1876042	2019-04-18 09:37:38 -07:00
Pieter Noordhuis	ce166d949d	ProcessGroupMPI exists only if it is valid (#14809 ) Summary: Previously, MPI process groups were created for all processes, even if they were not part of the created group. Their MPI_Comm member field would be MPI_COMM_NULL and they would ignore any calls. Their rank and size were identical to that of the global process group and they had a special groupRank and groupSize field to capture the _real_ rank. This also meant assymetry with other process group types, where creating a new group would either return the process group OR GroupMember.NON_GROUP_MEMBER. For the MPI process group, it would always return a process group and an additional check was needed to verify whether or not a process was indeed part of a process group or not. This commit changes this such that every MPI process group is a valid process group, and by extension that we no longer have to special case MPI to determine whether or not a process is part of a group. Now, if the value returned by `new_group` is GroupMember.NON_GROUP_MEMBER, the process is not a member, otherwise it is. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14809 Differential Revision: D14887937 Pulled By: pietern fbshipit-source-id: c5bf86d3b33e524cc5004ee68e30103178fa491d	2019-04-10 21:36:35 -07:00
Shen Li	8f9b11cf33	Propagate ProcessGroup timeout to Store (#16571 ) Summary: closes #16520 Hi pietern, I am not sure if this is the expected way to pass timeout to `Store`, could you please help take a look? Thanks! Questions: 1. How do I write tests for this? I wanted to do something like `test_barrier_timeout_global`, but it seems I need to set the pg's timeout larger than the `Store`'s default timeout (3 min) to see a difference, which is too long for a unit test. And I do not want to change the `Store`'s default timeout either. Any suggestion? 2. Should I also propagate timeout configuration down to `PrefixStore` in `_new_process_group_helper`? Pull Request resolved: https://github.com/pytorch/pytorch/pull/16571 Differential Revision: D13954527 Pulled By: mrshenli fbshipit-source-id: 77f2653903f24255207233eb298f7c0321119a87	2019-04-09 12:36:28 -07:00
Pieter Noordhuis	7a19d3c9e1	Allow override of backend in dist.new_group() (#18595 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/18595 There is no need to force the backend to be the same as the global process group, as long as the backend is "nccl" or "gloo". Reviewed By: mrshenli Differential Revision: D14657204 fbshipit-source-id: 868817b9f219e3be8db0761a487f0027ed46663b	2019-04-04 14:23:03 -07:00
Shen Li	c0ad6747a9	Highlight NCCL all_reduce and all_gather requirements (#18741 ) Summary: See #18689 Pull Request resolved: https://github.com/pytorch/pytorch/pull/18741 Differential Revision: D14726874 Pulled By: mrshenli fbshipit-source-id: a92404c653e3c62fc23fa3ccacfb3b2959b2e307	2019-04-03 09:50:29 -07:00
Igor Fedan	36237c4893	Fix flake8 issues in gragrad test Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/18727 Differential Revision: D14724887 Pulled By: ifedan fbshipit-source-id: 8c1db6460303e746e4aea0142302b8d61277c067	2019-04-02 12:45:18 -07:00
Pieter Noordhuis	bdfdf6c2b9	C++ handler for gradient reduction (#18251 ) Summary: This commit adds the `c10d::Reducer` class that hooks into autograd and performs gradient bucketing and reduction. These are the core parts of `nn.parallel.DistributedDataParallel` that up to now were only usable for CUDA models. This should enable the following: * Distributed data parallelism for models defined using the C++ frontend. * Allow overlap of gradient computation and reduction for non-CUDA models. * Enable distributed data parallelism for models with some unused parameters. This does not include any logic for computing bucket assignment, which can be done separately; either by observing autograd execution order (this is what Apex does), or by assigning buckets based on some maximum byte size, or both. Also see #17757 and #13273. Pull Request resolved: https://github.com/pytorch/pytorch/pull/18251 Reviewed By: mrshenli Differential Revision: D14571899 Pulled By: pietern fbshipit-source-id: 20f95eefd288dfe8cfffe0a28ca22fa7c9c3cd4c	2019-04-01 14:30:02 -07:00
Edward Yang	173f224570	Turn on F401: Unused import warning. (#18598 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/18598 ghimport-source-id: c74597e5e7437e94a43c163cee0639b20d0d0c6a Stack from [ghstack](https://github.com/ezyang/ghstack): * #18598 Turn on F401: Unused import warning. This was requested by someone at Facebook; this lint is turned on for Facebook by default. "Sure, why not." I had to noqa a number of imports in __init__. Hypothetically we're supposed to use __all__ in this case, but I was too lazy to fix it. Left for future work. Be careful! flake8-2 and flake8-3 behave differently with respect to import resolution for # type: comments. flake8-3 will report an import unused; flake8-2 will not. For now, I just noqa'd all these sites. All the changes were done by hand. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: D14687478 fbshipit-source-id: 30d532381e914091aadfa0d2a5a89404819663e3	2019-03-30 09:01:17 -07:00
Brian Johnson	fd04073e61	Fixed a formatting issue in doc comments (#17505 ) Summary: for torch.distributed.broadcast_multigpu per issue #17243 Pull Request resolved: https://github.com/pytorch/pytorch/pull/17505 Reviewed By: janewangfb Differential Revision: D14373865 Pulled By: pietern fbshipit-source-id: 6d7e91a3da50a7c9ba417ad852f7746eb5200043	2019-03-12 09:55:29 -07:00
Jane Wang	a2b9f7f484	add elastic zeus handler (#16746 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/16746 as titled. We use a special url schem elasticzeus for elastic zeus so that we dont need to change the public interface of init_process_group. Reviewed By: aazzolini, soumith Differential Revision: D13948151 fbshipit-source-id: 88939dcfa0ad93467dabedad6905ec32e6ec60e6	2019-02-27 11:29:59 -08:00
hysts	cbefd0323b	Fix typo Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/17521 Differential Revision: D14237482 Pulled By: soumith fbshipit-source-id: 636e0fbe2c667d15fcb649136a65ae64937fa0cb	2019-02-26 20:23:34 -08:00
Teng Li	2d3cf98b49	Making dist.get_default_group private for PT1 release (#14767 ) Summary: When I wrote the frontend API, it is designed on not letting users use the default_group directly on any functions. It should really be private. All collectives are supposed to either use group.WORLD, or anything that comes out of new_group. That was the initial design. We need to make a TODO on removing group.WORLD one day. It exists for backward compatibility reasons and adds lots of complexity. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14767 Reviewed By: pietern Differential Revision: D13330655 Pulled By: teng-li fbshipit-source-id: ace107e1c3a9b3910a300b22815a9e8096fafb1c	2018-12-04 19:22:24 -08:00
Pieter Noordhuis	11ef5191ff	Enable tests for CPU tensors in test_distributed.py (#14572 ) Summary: These were not enabled after adding support in the Gloo backend. The argument checks in ProcessGroupGloo raised an error in two cases: * If the input tensor list to scatter was ``[None]`` on processes other than the source process. * If the output tensor list to gather was ``[None]`` on processes other than the destination process. This commit prepares these arguments explicitly instead of boxing them at the process group call site. This fixes #14536. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14572 Differential Revision: D13272812 Pulled By: pietern fbshipit-source-id: 12cb0d85ec92f175365cbada585260f89330aad8	2018-11-29 21:39:02 -08:00
Teng Li	9127ab3866	Fixed new_group won't work for two or more different rank groups (#14529 ) Summary: This fixed two things: (1) NCCL group doesn't support 2 or more groups, this is because, we need a group name in ProcessGroupNCCL class to keep track of the ProcessGroup ID within that group name, and also the NCCL unique ID within that group name and process group ID. Otherwise, different processes will create different NCCL PG in different orders and can clash on these names. This will fix the NCCL problem. (2) When using new_group, each rank should enter this function and update its global group name counter to ensure that every rank always operates on the same group name. With both fixes: repro code in: https://github.com/pytorch/pytorch/issues/14528 should work with both NCCL and Gloo backends. ``` tengli@learnfair096:~$ python -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --master_port=30000 ~/github_issues/nccl_group.py rank: 0 - val: 6.0 rank: 2 - val: 6.0 rank: 3 - val: 6.0 rank: 1 - val: 6.0 rank: 4 - val: 22.0 rank: 6 - val: 22.0 rank: 5 - val: 22.0 rank: 7 - val: 22.0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/14529 Differential Revision: D13253434 Pulled By: teng-li fbshipit-source-id: 8eb45882b996b06d951fc9a306d5de86a42e8b84	2018-11-29 19:57:47 -08:00
Teng Li	0d3cb91d8c	Make env init_method support both env and args for rank and size (#14494 ) Summary: Fixing: https://github.com/pytorch/pytorch/issues/14446 This was a supported behavior in old torch.distributed. We want to support it in the new release. Test should cover all combination of scenario when we have either env or arg set up for rank or size or both Pull Request resolved: https://github.com/pytorch/pytorch/pull/14494 Differential Revision: D13253433 Pulled By: teng-li fbshipit-source-id: c05974d84f1bdf969f74ec45763e11a841fe4848	2018-11-29 18:48:20 -08:00
Pieter Noordhuis	4ec6bd7356	Add sourceRank() to ProcessGroup::Work (#14453 ) Summary: This function is only implemented for the subclasses where it makes sense. If it's not overridden it will throw an error. Having this function removes the need for a pointer passing hack to pass the source rank of a recv operation back to the caller. Instead, the caller can now call `source_rank` on the work object and achieve the same result. Closes #11804. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14453 Differential Revision: D13230898 Pulled By: pietern fbshipit-source-id: ef38f48bfaca8ef9a364e5be122951bafc9f8e49	2018-11-29 09:16:53 -08:00
Pieter Noordhuis	0f62af4ab1	Add timeout kwarg to init_process_group (#14435 ) Summary: This applies to the gloo backend only. Timeout support for the NCCL and MPI backends is tracked in issues #14371 and #14372 respectively. When creating a new process group (either the global one or any subgroup created through `new_group`) you can specify a timeout keyword argument (of type datetime.timedelta). This timeout applies to all collective operations executed against that process group, such that any operation taking longer than the timeout will throw a runtime error. Using a different, better catchable error type is tracked in #14433. This fixes #14376. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14435 Differential Revision: D13234317 Pulled By: pietern fbshipit-source-id: 973993b67994dc64861c0977cbb6f051ec9d87f6	2018-11-28 11:35:01 -08:00
Teng Li	b807970aea	Tensor type checking and informative error messages for torch.distributed (#14204 ) Summary: This will address https://github.com/pytorch/pytorch/issues/13574 This error message should be more informative to the user for all the non-multiGPU ops, since we python binding to multi-gpu ops always. test_distributed should cover all. Also tested both RunTime errors. ``` >>> a = torch.ByteTensor([]) >>> b = [a, a] >>> dist.all_reduce(b) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/private/home/tengli/pytorch/torch/distributed/distributed_c10d.py", line 809, in all_reduce _check_single_tensor(tensor, "tensor") File "/private/home/tengli/pytorch/torch/distributed/distributed_c10d.py", line 207, in _check_single_tensor "to be a torch.Tensor type".format(param_name)) RuntimeError: Invalid function argument. Expecting parameter: tensor to be a torch.Tensor type >>> b = ["b"] >>> dist.all_gather(b, a) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/private/home/tengli/pytorch/torch/distributed/distributed_c10d.py", line 1006, in all_gather _check_tensor_list(tensor_list, "tensor_list") File "/private/home/tengli/pytorch/torch/distributed/distributed_c10d.py", line 225, in _check_tensor_list "to be a List[torch.Tensor] type".format(param_name)) RuntimeError: Invalid function argument. Expecting parameter: tensor_list to be a List[torch.Tensor] type ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/14204 Differential Revision: D13131526 Pulled By: teng-li fbshipit-source-id: bca3d881e41044a013a6b90fa187e722b9dd45f2	2018-11-19 18:30:54 -08:00
Tongzhou Wang	044d00516c	Rename DistBackend -> Backend (#11830 ) Summary: Also add docs for get_backend, Backend, and reduce_op fixes #11803 cc The controller you requested could not be found. pietern apaszke Pull Request resolved: https://github.com/pytorch/pytorch/pull/11830 Differential Revision: D9927991 Pulled By: SsnL fbshipit-source-id: a2ffb70826241ba84264f36f2cb173e00b19af48	2018-11-07 11:58:12 -08:00
Teng Li	1b64c0f8fe	Error msg on TCP backend (#13596 ) Summary: Clean it up from my queue: https://github.com/pytorch/pytorch/issues/12721 ``` >>> torch.distributed.init_process_group(backend="tcp") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/private/home/tengli/pytorch/torch/distributed/distributed_c10d.py", line 275, in init_process_group backend = DistBackend(backend) File "/private/home/tengli/pytorch/torch/distributed/distributed_c10d.py", line 55, in __new__ raise ValueError("TCP backend has been deprecated. Please use " ValueError: TCP backend has been deprecated. Please use Gloo or MPI backends for collective operations on CPU tensors. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/13596 Differential Revision: D12931196 Pulled By: teng-li fbshipit-source-id: bb739b107ad7454e2e0a17430087161fedd4c392	2018-11-05 16:40:02 -08:00
Pieter Noordhuis	526460fc8b	Use default timeout of 30 minutes for gloo backend (#13056 ) Summary: The existing default timeout was set at 10 seconds, which is too low for asynchronous tasks that depend on a barrier to resynchronize. Having a single timeout for all operations is not ideal and this will be addressed in future commits. Pull Request resolved: https://github.com/pytorch/pytorch/pull/13056 Reviewed By: teng-li Differential Revision: D10558746 Pulled By: pietern fbshipit-source-id: d857ea55b1776fc7d0baf2efd77951b5d98beabb	2018-10-25 16:35:53 -07:00
Edward Yang	dfa03e94eb	Fix mispelling of AVAILABLE. (#12016 ) Summary: Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/12016 Reviewed By: pietern Differential Revision: D10010808 Pulled By: ezyang fbshipit-source-id: ff6394ae9a53f7fdad2cadb4e019e09ac63bba96	2018-09-24 20:46:41 -07:00
Tongzhou Wang	540ef9b1fc	Add distributed get_backend (#11715 ) Summary: I have no idea how to run distributed tests locally so I'll let CI do this. Hopefully everything still works with `IntEnum`. cc mcarilli Pull Request resolved: https://github.com/pytorch/pytorch/pull/11715 Reviewed By: pietern Differential Revision: D9889646 Pulled By: SsnL fbshipit-source-id: 1e2a487cb6fe0bd4cc67501c9d72a295c35693e2	2018-09-18 10:56:24 -07:00
Pieter Noordhuis	7535d98ec4	Add message tag parameter to send/recv Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/11490 Reviewed By: teng-li Differential Revision: D9828116 Pulled By: pietern fbshipit-source-id: 98be1ae84b6763ffb329e63c030c5e3ec0e748b7	2018-09-14 10:55:37 -07:00
Teng Li	0988bbad2d	C10d release to torch.distributed for PT1 (#11405 ) Summary: The old `torch.distributed` will go to `torch.distributed.deprecated` The old DDP will go to `torch.nn.parallel.deprecated` Now `torch.nn.parallel.DDP` will use c10d DDP Now `torch.distributed` will use C10d frontend API Pull Request resolved: https://github.com/pytorch/pytorch/pull/11405 Reviewed By: pietern Differential Revision: D9733733 Pulled By: teng-li fbshipit-source-id: d6a3f3e73f8d3a7fcb1f4baef53c78063b8cbb08	2018-09-10 23:27:22 -07:00

26 Commits