pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Yuxin Wu	1ed4653e89	Stop writing logs to root logger (#72649 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/72648 Pull Request resolved: https://github.com/pytorch/pytorch/pull/72649 Reviewed By: soulitzer Differential Revision: D34172113 Pulled By: mrshenli fbshipit-source-id: 98cb4140b978a0d9fa53876e427ea3b8bbe884cf (cherry picked from commit `c14297cee6`)	2022-02-11 21:30:53 +00:00
Rohan Varma	678c08bb55	[PG Wrapper] Small fix (#72657 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72657 _ProcessGroupWrapper check needs to be gated on Gloo availability, this fails when gloo is not avail_ProcessGroupWrapper check needs to be gated on Gloo availability, this fails when gloo is not avail. ghstack-source-id: 148837056 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D34144848 fbshipit-source-id: 42a04918b968247f3259cd2cde5438e1265b04fe (cherry picked from commit `ba5de98939`)	2022-02-11 15:59:13 +00:00
Wanchao Liang	8551989bff	[c10d] Enable gather_object on nccl (#71623 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71623 Enable gather_object on the nccl backend, since we already support `dist.gather` on nccl. This requires user to set the current device properly. ghstack-source-id: 147754836 Test Plan: distributed_nccl_spawn -r test_gather_object Reviewed By: zou3519 Differential Revision: D33701042 fbshipit-source-id: 39cff22947a7cac69d0c923b956dc10f25353a6f (cherry picked from commit `6e6eff497f`)	2022-01-27 14:59:55 -08:00
Shen Li	7bc220e060	Update distributed.rst for ProcessGroup Extensions (#71482 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71482 cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D33745986 Pulled By: mrshenli fbshipit-source-id: fe2d0491901bf00be09deb5c556bc1e1d359b725 (cherry picked from commit `be5104bfd7`)	2022-01-25 00:30:08 +00:00
Stephan Uphoff	e1e43c4e71	Prevent sum overflow in broadcast_object_list (#70605 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70605 broadcast_object_list casted the sum of all object lengths to int from long causing overflows. Test Plan: Add a Tensor with >2GB storage requirement (in distributed_test.py) to object broadcast. This Tensor is only added if test are running at Meta as github tests will oom. Without fix the length will overflow and the program will request a negative sized Tensor: ``` RuntimeError: Trying to create tensor with negative dimension -2147482417: [-2147482417] ``` With fix it will pass the test. Test used on server with GPUs: buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn --local -- broadcast_object buck test mode/dev-nosan //caffe2/test/distributed:distributed_gloo_spawn --local -- broadcast_object Reviewed By: r-barnes Differential Revision: D33405741 fbshipit-source-id: 972165f8297b3f5d475636e6127ed4a49adacab1	2022-01-05 09:07:39 -08:00
Michael Suo	b7b32b56f1	Revert D33281300: Prevent sum overflow in broadcast_object_list Test Plan: revert-hammer Differential Revision: D33281300 (`807f9a828c`) Original commit changeset: 1bc83e8624ed Original Phabricator Diff: D33281300 (`807f9a828c`) fbshipit-source-id: beb81a9cbfba405a61b11dfaa8e39c9601f45643	2021-12-27 19:01:53 -08:00
Stephan Uphoff	807f9a828c	Prevent sum overflow in broadcast_object_list (#70336 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70336 broadcast_object_list casted the sum of all object lengths to int from long causing overflows. Test Plan: Increased size of Tensor used in object transfers to have >2GB storage requirement (in distributed_test.py) Without fix the length will overflow and the program will request a negative sized Tensor: ``` RuntimeError: Trying to create tensor with negative dimension -2147482417: [-2147482417] ``` With fix it will pass the test. Test used on server with GPUs: buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn --local -- broadcast_object Differential Revision: D33281300 fbshipit-source-id: 1bc83e8624edc14e747eeced7bc8a7a10e443ee4	2021-12-27 16:17:53 -08:00
s-kumano	ff53ed24d2	fix NameError of docstring in broadcast_object_list (#69810 ) Summary: This PR fixes NameError of docstring in broadcast_object_list. cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/69810 Reviewed By: kimishpatel Differential Revision: D33143167 Pulled By: jbschlosser fbshipit-source-id: 99c076466ae4b4a332763b7546028c5097b417d7	2021-12-16 10:50:45 -08:00
Bryan Reese	4670f0f2c5	Set non-default backend names to lower case (#69400 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69400 Hopefully this makes naming more consistent. Without this change, some tests will fail for plugins since values can be set to upper case in some cases. This should prevent that and make lookup and comparison consistent. Test Plan: Check the signals. There is no specific test for this, but all tests should pass. Reviewed By: mrshenli Differential Revision: D32836529 fbshipit-source-id: 1b7d2b64e04fe0391b710aa6ed6d1e47df9027a3	2021-12-07 07:58:46 -08:00
Rohan Varma	cb14a258a2	[c10d] Fix object-based collectives for debug mode (#68223 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68223 DETAIL debug mode didn't work with object-based collectives for NCCL backend, because we'd only check if backend is NCCL and then move tensors to CUDA. Instead, check if it is a wrapped PG, and then check the pg that is wrapped to see if its nccl. ghstack-source-id: 143242023 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D32366840 fbshipit-source-id: be0a2af6849f8f24446593f4a4fbea4a67586ee5	2021-11-13 04:18:31 -08:00
Shen Li	18955d3564	Raise warning when calling collectives on non-member group objects (#67639 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67639 Due to BC considerations, we cannot directly error out, as that might break existing applications. Raise warnings first to improve debuggability. cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D32075151 Pulled By: mrshenli fbshipit-source-id: 5680d420f5f6cd3f74a36616c03350e8a976b363	2021-11-02 20:04:07 -07:00
Shen Li	ce6f4b3a02	Setup c10d extension Backend class attr the same way as builtin ones (#66991 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66991 Currently, c10d extensions uses Backend.NAME to store the creator function. However, builtin ones use that same field to store the name. This commit makes c10d extensions comply with builtin ones, and uses a dedicated `_plugins` field to store creator functions. Thanks bryanmr for pointing this out. cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D31820307 Pulled By: mrshenli fbshipit-source-id: 259769ebfc80c0c9fc44d25498c8d19a3a09d1bc	2021-10-21 12:35:07 -07:00
Yi Wang	12137db5e3	Fix the slowdown of _object_to_tensor since 1.9 (#65721 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65721 #Closes: https://github.com/pytorch/pytorch/issues/65696 The bug is introduced in https://github.com/pytorch/pytorch/pull/55861, and it causes 100X slowdown since 1.9. ghstack-source-id: 139128267 Test Plan: Performance test: ``` import time from torch.distributed.distributed_c10d import _object_to_tensor start = time.time() _object_to_tensor("x" * 50_000_000) print("Time:", time.time() - start) ``` Reviewed By: rohan-varma Differential Revision: D31219794 fbshipit-source-id: 1abec38f9d51361c1eab6ad5efd87b589322e208	2021-09-27 19:22:10 -07:00
Shen Li	2a81e8b8f1	Let all_reduce_coalesced and all_gather_coalesced return Future objects (#64722 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64722 `all_reduce_coalesced` and `all_gather_coalesced` are never publicly released in our API docs. So, I would assume the blast radius to be small. The motivation for this change to allow implementing `all_reduce_coalesced` and `all_gather_coalesced` by re-using `allreduce` and `allgather` C++ cores and perform flatten and copy only on the Python side. With that, we can then remove `all_reduce_coalesced` and `all_gather_coalesced` from C++ ProcessGroup APIs. For the async mode, the copy-back logic after the communication will need to be chained as a callback on the returned Future and use the chained child Future as the return value (otherwise, we will need to wrap the child Future into another work handle). This PR tries to test if we can directly return a Future without breaking tests and internal use cases. If yes, it will make the consolidation a lot easier. cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse agolynski SciPioneer H-Huang mrzzd cbalioglu gcramer23 Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D30830994 Pulled By: mrshenli fbshipit-source-id: dcde0ed9245e9e8fee357b3588b07d540a4b6318	2021-09-10 07:45:25 -07:00
mrshenli	101a626330	Improve `distributed.get_rank()` API docstring (#63296 ) Summary: See discussion in https://pytorch.slack.com/archives/CBHSWPNM7/p1628792389008600 Pull Request resolved: https://github.com/pytorch/pytorch/pull/63296 Reviewed By: cbalioglu Differential Revision: D30332042 Pulled By: mrshenli fbshipit-source-id: 3a642fda2e106fd35b67709ed2adb60e408854c2	2021-08-27 11:34:55 -07:00
Kiuk Chung	9d95d48567	(torch.distributed) Add torch.distributed.is_torchelastic_launched() util method + make init_method=tcp:// compatible with torchelastic (#63910 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63910 Addresses the current issue that `init_method=tcp://` is not compatible with `torch.distributed.run` and `torch.distributed.launch`. When running with a training script that initializes the process group with `init_method=tcp://localhost:$port` as such: ``` $ python -u -m torch.distributed.run --max_restarts 0 --nproc_per_node 1 --nnodes 1 --master_addr $(hostname) --master_port 6000 ~/tmp/test.py ``` An `Address in use` error is raised since the training script tries to create a TCPStore on port 6000, which is already taken since the elastic agent is already running a TCPStore on that port. For details see: https://github.com/pytorch/pytorch/issues/63874. This change does a couple of things: 1. Adds `is_torchelastic_launched()` check function that users can use in the training scripts to see whether the script is launched via torchelastic. 1. Update the `torch.distributed` docs page to include the new `is_torchelastic_launched()` function. 1. Makes `init_method=tcp://` torchelastic compatible by modifying `_tcp_rendezvous_handler` in `torch.distributed.rendezvous` (this is NOT the elastic rendezvous, it is the old rendezvous module which is slotted for deprecation in future releases) to check `is_torchelastic_launched()` AND `torchelastic_use_agent_store()` and if so, only create TCPStore clients (no daemons, not even for rank 0). 1. Adds a bunch of unittests to cover the different code paths NOTE: the issue mentions that we should fail-fast with an assertion on `init_method!=env://` when `is_torchelastic_launched()` is `True`. There are three registered init_methods in pytorch: env://, tcp://, file://. Since this diff makes tcp:// compatible with torchelastic and I've validated that file is compatible with torchelastic. There is no need to add assertions. I did update the docs to point out that env:// is the RECOMMENDED init_method. We should probably deprecate the other init_methods in the future but this is out of scope for this issue. Test Plan: Unittests. Reviewed By: cbalioglu Differential Revision: D30529984 fbshipit-source-id: 267aea6d4dad73eb14a2680ac921f210ff547cc5	2021-08-25 22:57:43 -07:00
Gao, Xiang	2d103025a5	Adding warning on isend about modifying after send (#61875 ) Summary: This is a standard limitation on communication collective libraries. For example: https://www.open-mpi.org/doc/v4.0/man3/MPI_Isend.3.php ``` A nonblocking send call indicates that the system may start copying data out of the send buffer. The sender should not modify any part of the send buffer after a nonblocking send operation is called, until the send completes. ``` http://openucx.github.io/ucx/api/latest/html/group___u_c_p___c_o_m_m.html#ga8323878b60f426c630d4ff8996ede3cc ``` The user should not modify any part of the buffer after this operation is called, until the operation completes. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/61875 Reviewed By: suo Differential Revision: D29783720 Pulled By: mrshenli fbshipit-source-id: 78fd047c74449f77b906f3766a6c2bc29499847d	2021-07-29 07:37:18 -07:00
Marjan Fariborz	994434ad16	Adding complex number support for all_to_all/scatter (#61299 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61299 Modifying all_to_all and scatter to support complex numbers as well as float numbers. Test Plan: buck run //caffe2/test/distributed:distributed_gloo_fork -- test_name --print-passing-details --run-disabled Reviewed By: wanchaol Differential Revision: D29563938 fbshipit-source-id: 59e436b3fa1aee3d5195cbcffd39587e642c76b9	2021-07-20 15:45:34 -07:00
Yu Guo	a50a389ca6	Revert D29701479: [pytorch][PR] Remove `_broadcast_object()` from `ZeroRedundancyOptimizer` Test Plan: revert-hammer Differential Revision: D29701479 (`9b5d9b4049`) Original commit changeset: c8d5f9057b32 fbshipit-source-id: 35ab1f399513fb9d1c4e73b1fa906e559d2a6994	2021-07-15 10:03:08 -07:00
Andrew Gu	9b5d9b4049	Remove `_broadcast_object()` from `ZeroRedundancyOptimizer` (#61539 ) Summary: Revised version of https://github.com/pytorch/pytorch/issues/60573. Overview: This makes two changes: - It introduces a `map_location` argument to `broadcast_object_list()`. The argument specifies the device to load tensors contained in objects received from the broadcast. This change requires modifying the implementation of `_object_to_tensor()` and `_tensor_to_object()` to use `torch.save()` and torch.load()` respectively. - It removes all calls to `_broadcast_object()` in `ZeroRedundancyOptimizer` and the corresponding test file in favor of `broadcast_object_list()`. The default value of `map_location` is `None`, in which case `_object_to_tensor()` and hence `broadcast_object_list()` preserve their original behavior. Namely, contained tensors are loaded to their original device. In `consolidate_state_dict()`, I specify `map_location=torch.device("cpu")` instead of `self._default_device`. This slightly changes the behavior from before when using `_broadcast_object()`. The reason I do so is that it saves one GPU to CPU data transfer since the action immediately after receiving the broadcasted `local_state_dict` is to copy it to CPU. Explicitly, if `map_location=self._default_device`, then the data transfer path assuming NCCL backend is as follows: `source GPU --[before serialize]--> source CPU --[before broadcast]--> source GPU --[broadcast]--> destination GPU --[before deserialize]--> destination CPU --[deserialize]--> destination GPU --[copy]--> destination CPU` Hence, by setting `map_location=torch.device("cpu")` instead, the suffix becomes: `destination CPU --[deserialize]--> destination CPU --[copy]--> destination CPU` Pull Request resolved: https://github.com/pytorch/pytorch/pull/61539 Test Plan: I added a test `test_broadcast_object_list_map_location()` that checks for both `map_location` as CPU and GPU that (1) tensors contained in broadcasted objects are appropriately loaded onto the specified device and (2) that the contents of the tensors are correct. The existing `ZeroRedundancyOptimizer` tests pass. ``` gpurun4 python test/distributed/optim/test_zero_redundancy_optimizer.py ``` The existing `broadcast_object_list()` test passes: ``` touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- TestDistBackendWithFork.test_broadcast_object_list ``` Reviewed By: zou3519 Differential Revision: D29701479 Pulled By: andwgu fbshipit-source-id: c8d5f9057b32e5e9f40e8edc5b2cc25fb21414a9	2021-07-14 17:36:30 -07:00
Bo Wang	ab27399566	Make broadcast_object_list accept a device parameter. (#61305 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61305 Part I (this PR): Add dist_device argument to broadcast_object_list API Part II: andwgu@ will deprecate _broadcast_object with the newly introduced API Also include the changes to _object_to_tensor()/_tensor_to_object() with PR 60573 Context: https://github.com/pytorch/pytorch/issues/60062 Test Plan: Run the following on DevGpus with two cuda devices $python setup.py develop --- run this build on DevGPU $BACKEND='nccl' WORLD_SIZE=2 with-proxy python test/distributed/test_distributed_fork.py TestDistBackendWithFork.test_broadcast_object_list --v $BACKEND='gloo' WORLD_SIZE=2 with-proxy python test/distributed/test_distributed_fork.py TestDistBackendWithFork.test_broadcast_object_list --v Build with distributed on: USE_DISTRIBUTE=1 python setup.py develop Test on CPU devvm: $ with-proxy python test/distributed/optim/test_zero_redundancy_optimizer.py Imported from OSS Differential Revision: D29566538 D29566538 Reviewed By: iramazanli, mrshenli Pulled By: bowangbj fbshipit-source-id: 0bea52442551c5194acba85eadda16ba2ec4b6ef	2021-07-14 11:43:17 -07:00
Philip Meier	d5988c5eca	remove unused `type: ignore` directives (#60006 ) Summary: During development it is common practice to put `type: ignore` comments on lines that are correct, but `mypy` doesn't recognize this. This often stems from the fact, that the used `mypy` version wasn't able to handle the used pattern. With every new release `mypy` gets better at handling complex code. In addition to fix all the previously accepted but now failing patterns, we should also revisit all `type: ignore` comments to see if they are still needed or not. Fortunately, we don't need to do it manually: by adding `warn_unused_ignores = True` to the configuration, `mypy` will error out in case it encounters an `type: ignore` that is no longer needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/60006 Reviewed By: jbschlosser, malfet Differential Revision: D29133237 Pulled By: albanD fbshipit-source-id: 41e82edc5cd5affa7ccedad044b59b94dad4425a	2021-06-18 07:23:31 -07:00
Ruilin Chen	38c3116813	[hierarchical sharding 5/n] enable table-wise -> col-wise sharding in embedding table lookup Summary: This diff add table-wise -> col-wise sharding support in GroupedShardedEmbeddingBag. Changes includes: 1. Add necessary member variables set up. 2. Create new fast kernel and add fast kernel lookup support 3. Add intra-host all2all and cross-host all2all logic. Test Plan: UT ``` buck test mode/dev-nosan //caffe2/torch/fb/training_toolkit/backend/tests:test_model_materializer_full_sync_spawn ``` ``` buck test caffe2/torch/fb/hpc/tests:model_sharder_test ``` QPS check: ``` buck run mode/dev-nosan -c python.package_style=inplace caffe2/torch/fb/training_toolkit/examples:sync_sgd_local_driver -- prod-preset --num-trainers 32 --use-shrunk-model false --model-version=inline_cvr_dec_2020 --fast-kernel table_batched --max-batches 10000 --num-dpp-worker-threads 16 --num-readers 100 --hpc-identity ads_model_platform --table-partition hierarchical_based --hierarchical-options "["table_based", "column_based"]" --flow-entitlement ads_global_qps ``` with diff: dec inline_cvr: table-wise -> table-wise (82K): https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_d0a0cba5?version=0&tab=status&env=PRODUCTION table-wise -> column-wise (80k): https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_b1ac5873 column-wise: dec inline_cvr: gpu trace: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2F0%2F1623827677%2F127.0.0.1%2Flibkineto_activities_4550.json.gz&bucket=gpu_traces https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_a79e1522 (81k) https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_2dacc13e (88k) row-wise(62k): https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_4e349cab table-wise(90k): https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_5d51b608 10x ctr_mbl_feed: ``` buck run mode/dev-nosan -c python.package_style=inplace caffe2/torch/fb/training_toolkit/examples:sync_sgd_local_driver -- prod-preset --num-trainers 128 --use-shrunk-model false --model-version=ctr_mbl_oct_2020_10x_3tb --num-dpp-worker-threads 16 --num-readers 200 --fast-kernel table_batched --max-batches 5000000 --hpc-identity ads_model_platform --table-partition column_based --flow-entitlement ads_global_tc_mimo ``` column-wise: https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_f05fb306?version=0&tab=status&env=PRODUCTION (290k) w/o diff: dec inline_cvr: column-wise (87K): gpu trace: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2F0%2F1623864444%2F127.0.0.1%2Flibkineto_activities_4451.json.gz&bucket=gpu_traces https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_e1315f14 row-wise (60k): https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_8fcc0adf table-wise (91k): https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_cb94ff41 10x ctr_mbl_feed: https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_203ef35b?version=0&tab=status&env=PRODUCTION (281k) NE check(use deterministic reading D28711400) ``` buck run mode/dev-nosan -c python.package_style=inplace caffe2/torch/fb/training_toolkit/examples:sync_sgd_local_driver -- prod-preset --num-trainers 32 --use-shrunk-model false --model-version=inline_cvr_dec_2020 --fast-kernel table_batched --max-batches 100000 --num-dpp-worker-threads 16 --num-readers 64 --hpc-identity ads_model_platform --table-partition hierarchical_based --hierarchical-options "[table_based, column_based]" --flow-entitlement ads_global_qps --use-deterministic-model --use-deterministic-reading --model-entity-id 995557193 ``` w/o this diff: ``` I0611 12:19:18.766000 647 print_publisher.py:33 master ] Publishing batch metrics: ne-ne\|lifetime_ne 0.8660048340401448 I0611 12:19:18.766000 647 print_publisher.py:33 master ] Publishing batch metrics: ne-ne\|window_ne 0.8660048340401447 I0611 12:19:18.766000 647 print_publisher.py:33 master ] Publishing batch metrics: qps-qps\|total_examples 1867776.0 I0611 12:19:18.766000 647 print_publisher.py:33 master ] Publishing batch metrics: qps-qps\|window_qps 491.5199890136719 ``` https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_15bc6243?version=0&tab=status&env=PRODUCTION w this diff: ``` I0611 12:19:18.766000 647 print_publisher.py:33 master ] Publishing batch metrics: ne-ne\|lifetime_ne 0.8660048340401448 I0611 12:19:18.766000 647 print_publisher.py:33 master ] Publishing batch metrics: ne-ne\|window_ne 0.8660048340401447 I0611 12:19:18.766000 647 print_publisher.py:33 master ] Publishing batch metrics: qps-qps\|total_examples 1867776.0 ``` https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_15bc6243?version=0&tab=status&env=PRODUCTION Reviewed By: JadeNie Differential Revision: D28689126 fbshipit-source-id: 1c7879d4e3ee2b90aaf2a89e87f7b827d54173b3	2021-06-17 22:25:25 -07:00
clint	78011bc0ce	typofix (torch.zero to torch.zeros) in docstring (#59703 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59703 Reviewed By: ezyang Differential Revision: D29145998 Pulled By: H-Huang fbshipit-source-id: f2670502170aa100fb02408046b7f6850f9379cf	2021-06-15 21:12:42 -07:00
Yi Wang	48ea7c808d	[C10d] Support subgroups (#59111 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59111 Create a util function for initializing subgroups. By default, each subgroup contains all the ranks within a machine. This util function can be used by both local SGD and SyncBatchNorm optimization. Additionally, clang format `distributed/__init__.py` after importing `_rank_not_in_group` which is used by the unit test, and also clang format `distributed_c10d.py`. Note that this API does not accept another overall main group. Like APEX API `create_syncbn_process_group` [here](https://nvidia.github.io/apex/_modules/apex/parallel.html), always uses the global world size and should only be applied when CUDA is available. #Closes: https://github.com/pytorch/pytorch/issues/53962 ghstack-source-id: 130975027 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_group_size_exceeds_world_size buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_world_size_not_divisible_by_group_size buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_by_enumeration buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_by_enumeration_input_rank_exceeds_world_size buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_overlap_not_allowed Reviewed By: rohan-varma Differential Revision: D28495672 fbshipit-source-id: fdcc405411dd409634eb51806ee0a320d1ecd4e0	2021-06-09 22:35:11 -07:00
Can Balioglu	4ee761c2c5	[2/n] [c10d] Introduce the 'multiTenant' constructor parameter in TCPStore (#58329 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58329 This PR is part of a stack that addresses the GitHub issue #41614; it introduces: - A new `multiTenant` constructor option for the `TCPStore` class indicating whether multiple store instances can be initialized with the same host:port pair. - Updates to the C10d distributed (elastic) rendezvous and the `init_process_group` method to leverage the new `multiTenant` feature. Note that the multi-tenancy feature itself is implemented in the fourth PR of this stack. In this PR passing `true` to `multiTenant` results only with a warning output. ghstack-source-id: 130676389 Test Plan: Run the existing tests since there are no behavioral changes. Reviewed By: rohan-varma Differential Revision: D28424978 fbshipit-source-id: fb1d1d81b8b5884cc5b54486700a8182a69c1f29	2021-06-05 07:50:04 -07:00
Luca Wehrstedt	8f4cfaa9db	Fix race condition in TP agent (#58753 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58753 TSAN was (rightfully!) detecting and complaining about a race due to the fact that upon init the TP agent exchanges the device maps between nodes using RPC requests (and by doing so it accesses the device maps) and then sets the reverse device maps (thus possibly modifying the set of devices). This resulted in a data race, i.e., simultaneously reading and writing the set of devices without synchronizing. One solution is to add a mutex around the devices, which works, but is "annoying". An alternative solution is to make the set of devices immutable (i.e., `const`). For that to work, we need to exchange the device maps without using RPC calls. We can do so using the process group that we need to create anyways. Since now there's a lot more logic in Python, I've moved (and restructured) all safety checks over there, and removed them from C++. ghstack-source-id: 130583775 Test Plan: Unit tests Reviewed By: mrshenli Differential Revision: D28603754 fbshipit-source-id: 88533e65d72d1eb806dc41bec8d55def5082e290	2021-06-04 06:53:42 -07:00
Liang Luo	77de640f4b	[torch distributed] Implementing reduce_scatter_base (#57567 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57567 Support flattened reduce_scatter. Test Plan: buck test mode/opt -c fbcode.enable_gpu_sections=true //caffe2/torch/lib/c10d:ProcessGroupNCCLTest buck test mode/opt -c fbcode.enable_gpu_sections=true //caffe2/test/distributed:c10d Reviewed By: zhaojuanmao Differential Revision: D27876281 fbshipit-source-id: 58e2edfb1baff5cdc083dbaaba9f19502ef0b298	2021-06-03 17:17:53 -07:00
Rohan Varma	19bcbfc5cf	[c10d] Use pg wrapper in detailed debug mode (#58281 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58281 When TORCH_DISTRIBUTED_DEBUG=DETAIL is enabled, this PR causes process groups created by `new_group` and `init_process_group` that are nccl or gloo to be wrapped in `ProcessGroupWrapper`. As a result, the user will get back a `ProcessGroupWrapper` that they can use in the exact same way as a regular nccl/gloo pg, but will be more helpful in terms of debugging desync/hangs. Besides doing collective desync checks, which should be transparent if there are indeed no issues in the user application, there are no semantic differences in using the wrapper pg. Note that there is a performance implication here but that is a tradeoff we are making when DETAIL debug mode is enabled. Open to suggestions on how to test better. Currently I verified locally that enabling TORCH_DISTRIBUTED_DEBUG=detail creates the wrapper and all tests still pass, but that doesn't run in CI. On the other hand testing everything with debug=detail and the regular tests might be too much, so we have only added it to a few tests for now. We also do have tests in the below diff. ghstack-source-id: 129817857 Test Plan: ci Reviewed By: SciPioneer Differential Revision: D28402301 fbshipit-source-id: c4d3438320f6f0986e128c738c9d4a87bbb6eede	2021-05-25 09:55:52 -07:00
Rohan Varma	cf395c0718	[c10d] Introduce ProcessGroupWrapper (#58224 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58224 Adds C++ implementation of ProcessGroupWrapper. It wraps an underlying ProcessGroup and does debug checks before dispatching the collective to the underlying pg. The design mostly follows https://github.com/pytorch/pytorch/issues/22071. Concretely, on each collective, we: 1. Verify op type consistency. This can help catch mismatched ops in the user application (i.e. allreduce on one rank and allgather on another) 2. Verify tensor shapes. This can help catch bugs where the tensor inputs are malformed, whereas normally in NCCL this would just lead to a hang. The shapes verification for allgather/allreduce_coalesced is omitted because they actually accept different shape tensors and don't error out. This is done through an abstraction called `CollectiveFingerPrint` which uses a helper process group to do the above verification. Concretely, we gather the data we need for each of the above checks into tensors, and allgather them, and verify their equivalence. Once all of this passes we simply dispatch the collective to the underlying pg. Added `ProcessGroupWrapperTest` in python to comprehensively test these changes. ghstack-source-id: 129735687 Test Plan: ci Reviewed By: zhaojuanmao Differential Revision: D28023981 fbshipit-source-id: 1defc203c5efa72ca0476ade0d1d8d05aacd4e64	2021-05-24 20:09:51 -07:00
Rohan Varma	071d49a970	Document monitored barrier (#58322 ) Summary: Will not land before the release, but would be good to have this function documented in master for its use in distributed debugability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/58322 Reviewed By: SciPioneer Differential Revision: D28595405 Pulled By: rohan-varma fbshipit-source-id: fb00fa22fbe97a38c396eae98a904d1c4fb636fa	2021-05-21 19:04:57 -07:00
Yi Wang	314a578154	Clang format distributed_c10d.py (#58435 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58435 Prepare for #53962 ghstack-source-id: 129171617 Test Plan: N/A Reviewed By: zhaojuanmao Differential Revision: D28490326 fbshipit-source-id: 2ed3c5850788b9702a8020f6ee6d0b579625bf89	2021-05-17 16:47:35 -07:00
Rohan Varma	e90fcffb65	[c10d] Log when store based barrier succeeds (#57711 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57711 Seeing some hangs/issues around store based barrier internally, would be good to have this log to indicate whether store based barrier has completed successfully or not for a particular rank to debug further. ghstack-source-id: 128605600 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D28249087 fbshipit-source-id: 644e5780519017ae780c3bc78bbe5def322db3f8	2021-05-10 21:09:40 -07:00
Liang Luo	c37095760d	[torch distributed] Implementing all_gather_base (#56315 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56315 This diff implements the all_gather_base in pytorch distributed. Test Plan: dist.all_gather_base(output, input)... Reviewed By: agolynski, amylittleyang Differential Revision: D27488999 fbshipit-source-id: 937ec8bddf9527fa4d114f984d1d0f6a5b8c3936	2021-04-23 14:16:47 -07:00
Wanchao Liang	a970e525fd	make ProcessGroup.Options.timeout argument private in python (#56531 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56531 per discussions in https://github.com/pytorch/pytorch/pull/53663/files#r593409009, we need to make sure our API not confusing user by passing in both timeout in argument and timeout in processgroup.options. This PR tries to make the `ProcessGroup.Options.timeout` be a private field, and only be used in our test utils, for both `init_process_group` and `new_group`, we still allow user pass `timeout` as a separate argument. Since `ProcessGroupGloo.Options` only have a `timeout` config, both functions will not allow passing in options for the GLOO backend. This way we still preserve the only `timeout` API, and only allow user to use `ProcessGroupNCCL.Options` when needed. cc pritamdamania87 rohan-varma Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D27893395 Pulled By: wanchaol fbshipit-source-id: cdd29c84648002226ef3d9f9f3ea67b795e64bc5	2021-04-21 17:55:10 -07:00
Rohan Varma	b7d5a0cf10	[c10d] sequence number in process group (#55319 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55319 Adds a sequence number class as well as integration with ProcessGroup (nccl and gloo) as part of better debugability. The main use case is that each ProcessGroup instantiated will have a sequence number initially set by rank 0, and broadcasted to all others. We will increment the number on each collective, thus allowing us to match the numbers appropriately when checking for desynchronization. This PR just adds the bare-bones integration and verifies sequence numbers are set appropriately at the beginning. ghstack-source-id: 127011277 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D27562769 fbshipit-source-id: d4a4de7529ce07a0c86fcf6beb06f317f359d89b	2021-04-21 10:59:24 -07:00
Sam Estep	75024e228c	Add lint for unqualified `type: ignore` (#56290 ) Summary: The other half of https://github.com/pytorch/pytorch/issues/56272. Pull Request resolved: https://github.com/pytorch/pytorch/pull/56290 Test Plan: CI should pass on the tip of this PR, and we know that the lint works because the following CI runs (before this PR was finished) failed: - https://github.com/pytorch/pytorch/runs/2384511062 - https://github.com/pytorch/pytorch/actions/runs/765036024 Reviewed By: seemethere Differential Revision: D27867219 Pulled By: samestep fbshipit-source-id: e648f07b6822867e70833e23ddafe7fb7eaca235	2021-04-21 08:07:23 -07:00
Rohan Varma	ce05b7a324	[c10d] Remove deprecated use of torch.LongTensor, torch.ByteTensor (#55861 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55861 APIs such as torch.LongTensor and torch.ByteTensor are deprecated and the recommended API is torch.tensor(args, dtype=...). Use this API in distributed_c10d. ghstack-source-id: 126777875 Test Plan: CI Reviewed By: pbelevich Differential Revision: D27726600 fbshipit-source-id: 07eb8168d93697593589002c93c3903ce29431ef	2021-04-18 14:12:02 -07:00
Rohan Varma	bbc4c775bb	[reland][c10d] monitored_barrier: ensure all ranks pass or none do (#55990 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55990 Reland of https://github.com/pytorch/pytorch/pull/55197, which fails windows test that was only run on master. Disabled these tests for windows, similar to they are disabled on MacOS. The reason for disabling as that they use libuv transport which does not have as robust error handling as tcp on linux. The result is that non-zero ranks that were healthy don't throw immediately (like they do on linux) but they throw on timeout. The error handling still occurs as expected on rank 0 for all platforms. ghstack-source-id: 126478371 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D27758424 fbshipit-source-id: d30841c8dda77f51b09a58161e638657ef758e63	2021-04-14 12:26:54 -07:00
Rohan Varma	48c73d24b8	Revert D27523060: [c10d] monitored_barrier: ensure all ranks pass or none do Test Plan: revert-hammer Differential Revision: D27523060 (`a5290adea5`) Original commit changeset: fa05e4f8ad8a fbshipit-source-id: aa59c1c3ab0ed5b124583a52aed0f93c3b93a05a	2021-04-13 21:33:09 -07:00
Rohan Varma	a5290adea5	[c10d] monitored_barrier: ensure all ranks pass or none do (#55197 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55197 From initial user feedback, one unexpected difference between monitored_barrier impl and barrier is the "all or nothing" semantics. In barrier, all ranks pass or they all fail. With monitored barrier however, if rank 1 is healthy, it will respond to both send and recv from rank 0, but rank 0 can later fail because rank 2 is stuck. In this case, rank 1 will move forward out of the barrier. This change makes it so that if a rank fails in monitored barrier, all other ranks in monitored barrier will also fail. It does so by the following process, similar to acknowledgements: Nonzero ranks call send() Nonzero ranks call recv() Rank 0 calls recv(), if this succeeds, rank 0 has acknowledged rank N as healthy Once all ranks are acknowledged as healthy: Rank 0 calls send() to all nonzero ranks to unblock them Modified unittests to ensure the all or nothing failure behavior ghstack-source-id: 126413088 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D27523060 fbshipit-source-id: fa05e4f8ad8ae97fd6cb20da5c3a7ef76fd31de6	2021-04-13 19:01:25 -07:00
Rohan Varma	19f15317a0	[BE][Docs] Improve dist.new_group doc (#55660 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55660 Noticed this doc was missing clarification on nccl env vars that init_process_group docs have. Also, specify default behavior when backend=None is passed in. ghstack-source-id: 126251116 Test Plan: Ci Reviewed By: SciPioneer Differential Revision: D27672208 fbshipit-source-id: 2e79d297174e135173bceb059450ea267367bde4	2021-04-11 16:16:18 -07:00
Szymon Migacz	8e78a1b084	[Resubmit] Fix for incorrect usage of logging in torch/distributed/distributed_c10d.py (#52757 ) Summary: Resubmit of https://github.com/pytorch/pytorch/pull/51739 Fixes https://github.com/pytorch/pytorch/issues/51428 Pull Request resolved: https://github.com/pytorch/pytorch/pull/52757 Reviewed By: cbalioglu Differential Revision: D26646843 fbshipit-source-id: df4962ef86ea465307e39878860b9fbbcc958d52	2021-04-06 11:32:26 -07:00
Rohan Varma	19a0eb4cdb	[c10d] Monitored barrier: option to collect all failed ranks (#55010 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55010 Follow up change to add a flag to provide an option for monitored barrier to collect all the failed ranks and then throw instead of just throwing on the first one. This is useful as now monitored barrier will be able to pick up on all hanging ranks instead of just one. This is done by passing in a flag `wait_all_ranks=True`. ghstack-source-id: 125699839 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D27447787 fbshipit-source-id: ec23aee212060d9eb515ff8adc96c6a17822d1bb	2021-04-04 21:39:54 -07:00
Rohan Varma	d185719455	Expose dist.monitored_barrier() API (#53787 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53787 Per title, exposes a python-based monitored barrier API that we can use as part of debugability and may be useful for user applications. ghstack-source-id: 125124315 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D26965127 fbshipit-source-id: 6c7826e63758462e3e5111f28cced54cba76a758	2021-03-29 14:15:37 -07:00
Jeff Yang	0435059ddf	docs: fix docstring signature in `all_reduce_multigpu` (#54665 ) Summary: fixes https://github.com/pytorch/pytorch/issues/43500 Pull Request resolved: https://github.com/pytorch/pytorch/pull/54665 Reviewed By: ezyang Differential Revision: D27340481 Pulled By: rohan-varma fbshipit-source-id: d53c36b41dd26c7a791d3674a5b4b67daaadae13	2021-03-26 11:08:32 -07:00
Wanchao Liang	133000fe7a	[distributed] add processgroup options as argument (#53663 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53663 This add the processgroup option as an optional argument to new_group and init_processgroup, this allows user to pass in a initialized processgroup option for gloo and nccl. Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D26968857 Pulled By: wanchaol fbshipit-source-id: 2ff73a009120b85e83ecde7c69956b731902abc2	2021-03-18 01:04:17 -07:00
Michael Suo	87b6702833	[distributed] make the pickler in distributed_c10d pluggable (#53060 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53060 As title. We would like to use alternative pickler/unpickler implementations, to make it possible to send objects over the wire that are coming from a torch.package Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D26737317 Pulled By: suo fbshipit-source-id: 6bdef9824e48ef657dcad72cc5a9114e6612ea4a	2021-03-01 21:37:48 -08:00
Howard Huang	b56f59ea20	Revert D26599390: [pytorch][PR] Fix for incorrect usage of logging in torch/distributed/distributed_c10d.py Test Plan: revert-hammer Differential Revision: D26599390 (`075bbe0d6a`) Original commit changeset: d822658076f7 fbshipit-source-id: 6c4421f4de99794ea66780175af549cef9410a20	2021-02-24 05:38:34 -08:00
Szymon Migacz	075bbe0d6a	Fix for incorrect usage of logging in torch/distributed/distributed_c10d.py (#51739 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/51428 Pull Request resolved: https://github.com/pytorch/pytorch/pull/51739 Reviewed By: bdhirsh Differential Revision: D26599390 fbshipit-source-id: d822658076f7b08ebfde3dc9994159539490fda0	2021-02-23 22:30:37 -08:00
Rohan Varma	c255628134	[Collective APIs] Make python object collective API args consistent (#50625 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50625 Make API signatures consistent and provide default argument similar to the tensor collectives. ghstack-source-id: 120718121 Test Plan: CI Reviewed By: wanchaol Differential Revision: D25932012 fbshipit-source-id: d16267e236a65ac9d55e19e2178f9d9267b08a20	2021-01-30 19:47:16 -08:00
Pritam Damania	16e5af41da	Fix store based barrier to only use 'add'. (#49930 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49930 Certain store implementations don't work well when we use get() and add() on the same key. To avoid this issue, we only use add() in the store based barrier. The buggy store implementations can't be properly fixed due to legacy reasons. Test Plan: 1) unit tests. 2) waitforbuildbot Reviewed By: osalpekar Differential Revision: D25725386 fbshipit-source-id: 1535e2629914de7f78847b730f8764f92cde67e7	2021-01-05 12:46:24 -08:00
Jagadish Krishnamoorthy	c115957df0	[distributed] Provide parameter to pass GPU ID in barrier function (#49069 ) Summary: For a multi GPU node, rank and corresponding GPU mapping can be different. Provide optional parameter to specify the GPU device number for the allreduce operation in barrier function. Add test cases to validate barrier device_ids. Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com> Fixes https://github.com/pytorch/pytorch/issues/48110 Pull Request resolved: https://github.com/pytorch/pytorch/pull/49069 Reviewed By: mrshenli Differential Revision: D25658528 Pulled By: rohan-varma fbshipit-source-id: 418198b6224c8c1fd95993b80c072a8ff8f02eec	2021-01-05 11:27:54 -08:00
Samuel Marks	e6779d4357	[*.py] Rename "Arguments:" to "Args:" (#49736 ) Summary: I've written custom parsers and emitters for everything from docstrings to classes and functions. However, I recently came across an issue when I was parsing/generating from the TensorFlow codebase: inconsistent use of `Args:` and `Arguments:` in its docstrings. ```sh (pytorch#c348fae)$ for name in 'Args:' 'Arguments:'; do printf '%-10s %04d\n' "$name" "$(rg -IFtpy --count-matches "$name" \| paste -s -d+ -- \| bc)"; done Args: 1095 Arguments: 0336 ``` It is easy enough to extend my parsers to support both variants, however it looks like `Arguments:` is wrong anyway, as per: - https://google.github.io/styleguide/pyguide.html#doc-function-args @ [`ddccc0f`](https://github.com/google/styleguide/blob/ddccc0f/pyguide.md) - https://chromium.googlesource.com/chromiumos/docs/+/master/styleguide/python.md#describing-arguments-in-docstrings @ [`9fc0fc0`](https://chromium.googlesource.com/chromiumos/docs/+/9fc0fc0/styleguide/python.md) - https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html @ [`c0ae8e3`](https://github.com/sphinx-contrib/napoleon/blob/c0ae8e3/docs/source/example_google.rst) Therefore, only `Args:` is valid. This PR replaces them throughout the codebase. PS: For related PRs, see tensorflow/tensorflow/pull/45420 PPS: The trackbacks automatically appearing below are sending the same changes to other repositories in the [PyTorch](https://github.com/pytorch) organisation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/49736 Reviewed By: albanD Differential Revision: D25710534 Pulled By: soumith fbshipit-source-id: 61e8ff01abb433e9f78185c2d1d0cbd7c22c1619	2020-12-28 09:34:47 -08:00
Pritam Damania	1043ecf68d	Use store based barrier only for certain store types. (#49694 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49694 The store based barrier introduced in https://github.com/pytorch/pytorch/pull/49419 broke for certain store types. This is a quick fix to resolve the issues for other store types. ghstack-source-id: 119006874 Test Plan: 1) waitforbuildbot Reviewed By: ppwwyyxx, rohan-varma Differential Revision: D25668404 fbshipit-source-id: 751fb8b229ad6f50ee9c50f63a70de5a91c9eda5	2020-12-21 18:41:28 -08:00
Pritam Damania	43f6da787e	Use store based barrier in init_process_group. (#49419 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49419 As described in https://github.com/pytorch/pytorch/issues/48110, the newly introduced `barrier()` in `init_process_group` messes up NCCL communicator state since it uses a bunch of default devices to perform an allreduce which simulates a barrier(). As a ressult, subsequent NCCL operations might not behave as expected. ghstack-source-id: 118861776 Test Plan: 1) unit test added. 2) waitforbuildbot Reviewed By: mrshenli Differential Revision: D25566550 fbshipit-source-id: ab083b67b634d7c515f4945deb228f959b27c936	2020-12-18 00:02:54 -08:00
Pritam Damania	db2ecefc01	[reland] Support torch.distributed.irecv(src=None, ...) (#49383 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49383 Reland of https://github.com/pytorch/pytorch/pull/47137 ghstack-source-id: 118735407 Test Plan: waitforbuildbot Reviewed By: osalpekar Differential Revision: D25551910 fbshipit-source-id: 2e1f2f77e7c69204056dfe6ed178e8ad7650ab32	2020-12-16 19:39:23 -08:00
Omkar Salpekar	4b3f05a471	[Docs] Updating init_process_group docs to indicate correct rank range (#49131 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49131 Users frequently assume the correct range of ranks is 1 ... `world_size`. This PR udpates the docs to indicate that the correct rank range users should specify is 0 ... `world_size` - 1. Test Plan: Rendering and Building Docs Reviewed By: mrshenli Differential Revision: D25410532 fbshipit-source-id: fe0f17a4369b533dc98543204a38b8558e68497a	2020-12-16 10:26:04 -08:00
Pritam Damania	f2ba3c1621	Use group.WORLD appropriately in process group initialization. (#48767 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48767 As part of investigating https://github.com/pytorch/pytorch/issues/48464, I realized some weird inconsistency in how we use `_default_pg` and `group.WORLD`. `group.WORLD` apparently was an `object()` and never changed despite `_default_pg` changing. In this sense, `group.WORLD` was being used a constant to refer to the default pg, but wasn't of type PG at all. In fact the passed in group is also compared via `==` to `group.WORLD` in many places, and it just worked since the default argument was `group.WORLD`. To clean this up, I got rid of `_default_pg` completely and instead used `group.WORLD` as the default pg throughout the codebase. This also fixes the documentation issues mentioned in https://github.com/pytorch/pytorch/issues/48464. #Closes: https://github.com/pytorch/pytorch/issues/48464 ghstack-source-id: 118459779 Test Plan: waitforbuildbot Reviewed By: rohan-varma Differential Revision: D25292893 fbshipit-source-id: 9a1703c71610aee2591683ab60b010332e05e412	2020-12-13 17:53:42 -08:00
Pritam Damania	7584161dfa	Enhance `new_group` doc to mention using NCCL concurrently. (#48872 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48872 Using NCCL communicators concurrently is not safe and this is documented in NCCL docs. However, this is not documented in PyTorch and we should add documentation for ProcessGroupNCCL so that users are aware of this limitation. ghstack-source-id: 118148014 Test Plan: waitforbuildbot Reviewed By: rohan-varma Differential Revision: D25351778 fbshipit-source-id: f7f448dc834c47cc1244f821362f5437dd17ce77	2020-12-09 12:29:15 -08:00
Rohan Varma	b77ca9e829	[Docs] Add examples for new object-based c10d APIs (#43932 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43932 Adds some basic examples to the documentation for each of the newly added object-based collectibves. ghstack-source-id: 117965966 Test Plan: CI Reviewed By: mrshenli Differential Revision: D23441838 fbshipit-source-id: 91344612952cfcaa71f08ccf2a2c9ed162ca9c89	2020-12-07 14:35:14 -08:00
Rohan Varma	02d89f9f1d	scatter_object_list API for c10d (#43930 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43930 Closes #23232. As part of addressing #23232, this PR adds support for scatter_object_list which is an API to scatter arbitrary picklable objects to all the other ranks. The implementation approach follows a similar approach as https://github.com/pytorch/pytorch/pull/42189. The result of the `scatter` is stored as the first element of `scatter_object_output_list`, and the src rank is expected to provide an input list `scatter_object_input_list` which contains the objects to scatter. Note that this API requires 1 broadcast and 2 scatters. This is because we must communicate the maximum object size to be scattered, which only the src rank knows about. After that, we also need to communicate the objects themselves as well as the true sizes of the object. Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this. It only works for Gloo because NCCL doesn't support scatter. ghstack-source-id: 117904065 Reviewed By: mrshenli Differential Revision: D23430686 fbshipit-source-id: f033b89cd82dadd194f2b036312a98423449c26b	2020-12-04 18:55:57 -08:00
Pritam Damania	4b8d965f18	Revert D25292656: [pytorch][PR] Support torch.distributed.irecv(src=None, ...) Test Plan: revert-hammer Differential Revision: D25292656 (`4eb4db7c30`) Original commit changeset: beb018ba0b67 fbshipit-source-id: 5a13055e50ed90731fee431e81c09a1871f6cc03	2020-12-04 16:57:06 -08:00
Tom Birch	4eb4db7c30	Support torch.distributed.irecv(src=None, ...) (#47137 ) Summary: Calling torch.distributed.irecv(src=None) fails with "The global rank None is not part of the group". This change calls recv_anysource if src is None. Tested locally with MPI backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/47137 Reviewed By: heitorschueroff Differential Revision: D25292656 fbshipit-source-id: beb018ba0b676924aeaabeb4a4d6acf96e4a1926	2020-12-04 13:56:36 -08:00
Xu Zhao	915050ed66	Fix typing errors in torch.distributed.distributed_c10d.* (#47532 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47532 Test Plan: Imported from OSS Reviewed By: walterddr Differential Revision: D24952501 Pulled By: xuzhao9 fbshipit-source-id: 9b2dd1069eb1729c24be00f46da60d6a0439a8da	2020-11-16 23:27:51 -08:00
Mingzhe Li	66f9b1de1b	[NCCL] enable p2p tests (#47797 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47797 NCCL p2p tests had hang issues before, the reason is that there were some unexpected context switches. For example, process 1 which is supposed to only use GPU1 could use GPU0 as a result of missing explicitly setting device. ghstack-source-id: 116461969 Test Plan: waitforsandcastle Reviewed By: jiayisuse Differential Revision: D24863808 fbshipit-source-id: 92bd3a4874be8334210c7c8ee6363648893c963e	2020-11-12 10:44:50 -08:00
Omkar Salpekar	32b4b51254	[Docs] Minor doc fixes for init_process_group (#47644 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47644 Minor Update to the init_process_group docs. ghstack-source-id: 116441798 Test Plan: CI Reviewed By: jiayisuse, mrshenli Differential Revision: D24633432 fbshipit-source-id: fbd38dab464ee156d119f9f0b22ffd0e416c4fd7	2020-11-11 15:21:30 -08:00
Xu Zhao	73a3e70b24	Add type annotations for torch._C._distributed_c10d module. (#46623 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46623 Test Plan: Imported from OSS Reviewed By: malfet Differential Revision: D24761606 Pulled By: xuzhao9 fbshipit-source-id: 827eaf2502e381ee24d36741c1613b4c08208569	2020-11-06 01:28:48 -08:00
Rohan Varma	c7183c9878	Fix object-based collectives API to use torch.cuda.current_device instead of (#46897 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46897 These APIs implicitly assumed that gpu for rank == rank index, but that is not necessarily true. For example, the first GPU could be used for a different purpose and rank 0 could use GPU 1, rank 1 uses GPU 2, etc. Thus, we mandate that the user specify the device to use via `torch.cuda.set_device()` before making calls to this API. This expectation should be okay since we clearly document it, and we expect the user to set this for DistributedDataParallel as well. Also adds/tidies up some documentation. ghstack-source-id: 115359633 Test Plan: Modified unittests Reviewed By: divchenko Differential Revision: D24556177 fbshipit-source-id: 7e826007241eba0fde3019180066ed56faf3c0ca	2020-10-28 18:12:50 -07:00
Omkar Salpekar	5e2f17d77a	Add NCCL_ASYNC_ERROR_HANDLING to docs (#46856 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46856 Add reference to NCCL_ASYNC_ERROR_HANDLING in the pytorch docs, similar to how NCCL_BLOCKING_WAIT is curently described. ghstack-source-id: 115186877 Test Plan: CI, verifying docs change Reviewed By: jiayisuse Differential Revision: D24541822 fbshipit-source-id: a0b3e843bc6392d2787a4bb270118f2dfda5f4ec	2020-10-26 14:41:32 -07:00
Luca Wehrstedt	f230245c06	Revert D24422354: [pytorch][PR] fix-process-group-counter Test Plan: revert-hammer Differential Revision: D24422354 (`caed29a069`) Original commit changeset: 32493cc2001d fbshipit-source-id: 9b633f738ea555f45031056689f780dde8eda859	2020-10-23 08:04:37 -07:00
Brian Hirsh	db83ddcb86	small doc fix (#46599 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46599 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D24426181 Pulled By: bdhirsh fbshipit-source-id: d0900d5c43574c80f1bf614824eafd21ba6a9caf	2020-10-21 20:17:31 -07:00
Joel Lamy-Poirier	caed29a069	fix-process-group-counter (#46563 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/46561 A minimal fix to issue https://github.com/pytorch/pytorch/issues/46561. Increment the global variable `_group_count` at the same time as the others so the global state remains consistent in case of a failure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/46563 Reviewed By: zou3519 Differential Revision: D24422354 Pulled By: mrshenli fbshipit-source-id: 32493cc2001d21ad366c396d16c303936959434e	2020-10-21 13:03:53 -07:00
Alexander Golynski	e7e919fc34	Add warning on ProcessGroup and ProcessGroup::Work APIs (#46220 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46220 Test Plan: Imported from OSS Reviewed By: gmagogsfm Differential Revision: D24294437 Pulled By: gmagogsfm fbshipit-source-id: 198f8e5760beeb1d18740f971647d2537afb3dd6	2020-10-14 16:27:37 -07:00
Brian Hirsh	1f791c06f0	adding BAND/BOR/BXOR reduce ops to unsupported list for complex numbers. added tests (#46270 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46270 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D24284702 Pulled By: bdhirsh fbshipit-source-id: 7e6c3fce83a4367808a638f0400999399b2c35b0	2020-10-14 08:48:14 -07:00
Brian Hirsh	c02efdefa8	adding complex support for distributed functions and . fix #45760 (#45879 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45879 Test Plan: Imported from OSS Reviewed By: glaringlee Differential Revision: D24127949 Pulled By: bdhirsh fbshipit-source-id: 8061b14fa1c0adbe22b9397c2d7f92618556d223	2020-10-12 12:44:47 -07:00
Mingzhe Li	281463ba0b	[NCCL] Enable send/recv tests (#45994 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45994 Send/Recv tests were disabled because of the https://github.com/pytorch/pytorch/issues/42517. With that issue fixed, this diff enables those tests. ghstack-source-id: 113970569 Test Plan: waitforsandcastle Reviewed By: jiayisuse Differential Revision: D24172484 fbshipit-source-id: 7492ee2e9bf88840c0d0086003ce8e99995aeb91	2020-10-09 15:00:39 -07:00
Mingzhe Li	59083d6176	[NCCL] Support NCCL Send/Recv (#44921 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44921 This diff adds support for Process Group point-to-point operations on NCCL backend based on ncclSend/ncclRecv. See https://github.com/pytorch/pytorch/issues/43995 for more context. ghstack-source-id: 113592785 Test Plan: unittest Reviewed By: jiayisuse Differential Revision: D23709848 fbshipit-source-id: cdf38050379ecbb10450f3394631317b41163258	2020-10-05 18:27:57 -07:00
Pritam Damania	a2b4177c5b	Add barrier() at the end of init_process_group and new_group. (#45181 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45181 `init_process_group` and `new_group` update a bunch of global variables after initializing the actual process group. As a result, there is a race that after initializing the process group on say rank 0, if we immediately check the default process group on rank 1 (say via RPC), we might actually get an error since rank 1 hasn't yet updated its _default_pg variable. To resolve this issue, I've added barrier() at the end of both of these calls. This ensures that once these calls return we are guaranteed about correct initialization on all ranks. Since these calls are usually done mostly during initialization, it should be fine to add the overhead of a barrier() here. #Closes: https://github.com/pytorch/pytorch/issues/40434, https://github.com/pytorch/pytorch/issues/40378 ghstack-source-id: 112923112 Test Plan: Reproduced the failures in https://github.com/pytorch/pytorch/issues/40434 and https://github.com/pytorch/pytorch/issues/40378 and verified that this PR fixes the issue. Reviewed By: mrshenli Differential Revision: D23858025 fbshipit-source-id: c4d5e46c2157981caf3ba1525dec5310dcbc1830	2020-09-25 15:46:59 -07:00
Rohan Varma	bee97d5be0	Document the default behavior for dist.new_group() when ranks=None (#44000 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44000 This wasn't documented, so add a doc saying all ranks are used when ranks=None ghstack-source-id: 111206308 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D23465034 fbshipit-source-id: 4c51f37ffcba3d58ffa5a0adcd5457e0c5676a5d	2020-09-17 11:30:37 -07:00
Rohan Varma	fbea2ee917	broadcast_object API for c10d (#43887 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43887 As part of addressing #23232, this PR adds support for `broadcast_object_list` which is an API to broadcast arbitrary picklable objects to all the other ranks. This has been a long-requested feature, so would be good for Pytorch to natively support this. The implementation approach follows a similar approach as https://github.com/pytorch/pytorch/pull/42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank. Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this. ghstack-source-id: 111180436 Reviewed By: mrshenli Differential Revision: D23422577 fbshipit-source-id: fa700abb86eff7128dc29129a0823e83caf4ab0e	2020-09-01 18:54:17 -07:00
Akihiro Nitta	f17d7a5556	Fix exception chaining in `torch/` (#43836 ) Summary: ## Motivation Fixes https://github.com/pytorch/pytorch/issues/43770. ## Description of the change This PR fixes exception chaining only in files under `torch/` where appropriate. To fix exception chaining, I used either: 1. `raise new_exception from old_exception` where `new_exception` itself seems not descriptive enough to debug or `old_exception` delivers valuable information. 2. `raise new_exception from None` where raising both of `new_exception` and `old_exception` seems a bit noisy and redundant. I subjectively chose which one to use from the above options. ## List of lines containing raise in except clause: I wrote [this simple script](https://gist.github.com/akihironitta/4223c1b32404b36c1b349d70c4c93b4d) using [ast](https://docs.python.org/3.8/library/ast.html#module-ast) to list lines where `raise`ing in `except` clause. - [x] `000739c31a/torch/jit/annotations.py (L35)` - [x] `000739c31a/torch/jit/annotations.py (L150)` - [x] `000739c31a/torch/jit/annotations.py (L158)` - [x] `000739c31a/torch/jit/annotations.py (L231)` - [x] `000739c31a/torch/jit/_trace.py (L432)` - [x] `000739c31a/torch/nn/utils/prune.py (L192)` - [x] `000739c31a/torch/cuda/nvtx.py (L7)` - [x] `000739c31a/torch/utils/cpp_extension.py (L1537)` - [x] `000739c31a/torch/utils/tensorboard/_pytorch_graph.py (L292)` - [x] `000739c31a/torch/utils/data/dataloader.py (L835)` - [x] `000739c31a/torch/utils/data/dataloader.py (L849)` - [x] `000739c31a/torch/utils/data/dataloader.py (L856)` - [x] `000739c31a/torch/testing/_internal/common_utils.py (L186)` - [x] `000739c31a/torch/testing/_internal/common_utils.py (L189)` - [x] `000739c31a/torch/testing/_internal/common_utils.py (L424)` - [x] `000739c31a/torch/testing/_internal/common_utils.py (L1279)` - [x] `000739c31a/torch/testing/_internal/common_utils.py (L1283)` - [x] `000739c31a/torch/testing/_internal/common_utils.py (L1356)` - [x] `000739c31a/torch/testing/_internal/common_utils.py (L1388)` - [x] `000739c31a/torch/testing/_internal/common_utils.py (L1391)` - [ ] `000739c31a/torch/testing/_internal/common_utils.py (L1412)` - [x] `000739c31a/torch/testing/_internal/codegen/random_topo_test.py (L310)` - [x] `000739c31a/torch/testing/_internal/codegen/random_topo_test.py (L329)` - [x] `000739c31a/torch/testing/_internal/codegen/random_topo_test.py (L332)` - [x] `000739c31a/torch/testing/_internal/jit_utils.py (L183)` - [x] `000739c31a/torch/testing/_internal/common_nn.py (L4789)` - [x] `000739c31a/torch/onnx/utils.py (L367)` - [x] `000739c31a/torch/onnx/utils.py (L659)` - [x] `000739c31a/torch/onnx/utils.py (L892)` - [x] `000739c31a/torch/onnx/utils.py (L897)` - [x] `000739c31a/torch/serialization.py (L108)` - [x] `000739c31a/torch/serialization.py (L754)` - [x] `000739c31a/torch/distributed/rpc/_testing/faulty_agent_backend_registry.py (L76)` - [x] `000739c31a/torch/distributed/rpc/backend_registry.py (L260)` - [x] `000739c31a/torch/distributed/distributed_c10d.py (L184)` - [x] `000739c31a/torch/_utils_internal.py (L57)` - [x] `000739c31a/torch/hub.py (L494)` - [x] `000739c31a/torch/contrib/_tensorboard_vis.py (L16)` - [x] `000739c31a/torch/distributions/lowrank_multivariate_normal.py (L100)` - [x] `000739c31a/torch/distributions/constraint_registry.py (L142)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/43836 Reviewed By: ailzhang Differential Revision: D23431212 Pulled By: malfet fbshipit-source-id: 5f7f41b391164a5ad0efc06e55cd58c23408a921	2020-08-31 20:26:23 -07:00
Shen Li	2f52748515	Publish all_gather_object and gather_object docs (#43772 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43772 Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D23398495 Pulled By: rohan-varma fbshipit-source-id: 032e1d628c0c0f2dec297226167471698c56b605	2020-08-31 13:28:00 -07:00
Rohan Varma	f22aa601ce	All Gather and gather APIs for Python Objects (#42189 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42189 Rehash of https://github.com/pytorch/pytorch/pull/28811, which was several months old. As part of addressing https://github.com/pytorch/pytorch/issues/23232, this PR adds support for the following APIs: `allgather_object` and `gather_object` to support gather/allgather of generic, pickable Python objects. This has been a long-requested feature so PyTorch should provide these helpers built-in. The methodology is what is proposed in the original issue: 1) Pickle object to ByteTensor using torch.save 2) Comm. tensor sizes 3) Copy local ByteTensor into a tensor of maximal size 4) Call tensor-based collectives on the result of (3) 5) Unpickle back into object using torch.load Note that the API is designed to match other than supporting `async_op`. For now, it is a blocking call. If we see demand to support `async_op`, we will have to make more progress on merging work/future to support this. If this is a suitable approach, we can support `scatter`, `broadcast` in follow up PRs. ghstack-source-id: 109322433 Reviewed By: mrshenli Differential Revision: D22785387 fbshipit-source-id: a265a44ec0aa3aaffc3c6966023400495904c7d8	2020-08-06 13:30:25 -07:00
Tongzhou Wang	3001facd7a	[doc] [distributed] fix typo (#39264 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39264 Differential Revision: D21791426 Pulled By: mrshenli fbshipit-source-id: c3aa8fda1893aa3c0f9ad3db7da25f1ee80303e8	2020-06-01 19:19:46 -07:00
Quang Luong	9d7a79ac27	[Caffe2] raise exceptions instead of str (#37744 ) Summary: Some exceptions are not correctly wrapped inside a class. Pull Request resolved: https://github.com/pytorch/pytorch/pull/37744 Differential Revision: D21388197 Pulled By: mrshenli fbshipit-source-id: 2d69e2543c2e05116c367d137968b982c254d2dc	2020-05-05 13:34:33 -07:00
Pritam Damania	136d84dd38	Enhance error message for MPI unavailability. (#36781 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36781 Mention that you need to to build PyTorch from source to enable MPI. Additional context: https://discuss.pytorch.org/t/distributed-pytorch-with-mpi/77106. ghstack-source-id: 102341246 Test Plan: waitforbuildbot Differential Revision: D21082009 fbshipit-source-id: 3a3286349e71322726a341dfc743b5978c7d9a56	2020-04-18 14:45:44 -07:00
Sudarshan Raghunathan	739351fac4	Fix linter warning: replace f-strings with str.format for Py2 compat (#35492 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35492 Test Plan: Imported from OSS Differential Revision: D20998727 Pulled By: drdarshan fbshipit-source-id: 54f34a7649a2772ad030b456f1b50aba831ce2e0	2020-04-13 18:43:58 -07:00
Feng Tian	762270c51f	add c10d dynamic loading mechanism and unit test (#28068 ) Summary: The original behavior of pytorch c10d only supports built-in c10d backends, such as nccl/gloo/mpi. This patch is used to extend the c10d capability to support dynamically loading 3rd party communication libraries which are derived from ProcessGroup base class. related RFC is in: https://github.com/pytorch/pytorch/issues/27955 Through this way, user just need specify a 3rd party c10d backend name when invoking torch.distributed.init_process_group(). The proposed logic will try to load corresponding c10d backend cpp extension automatically. as for how to develop a new 3rd party c10d backend through cpp extension, pls refer to test/cpp_extensions/cpp_c10d_extension.cpp Pull Request resolved: https://github.com/pytorch/pytorch/pull/28068 Differential Revision: D19174838 Pulled By: agolynski fbshipit-source-id: 3409a504a43ce7260e6f9d1207c00e87471fac62	2020-04-02 15:46:51 -07:00
Dhiraj D Kalamkar	945d7a7408	Add All-to-all comms support to distributed module and MPI backend (#32361 ) Summary: As described in https://github.com/pytorch/pytorch/issues/32345, a prototype implementation to add an alltoall communication primitive to torch.distributed module and ProcessGroup abstract interface. Also, implements alltoall in ProcessGroupMPI backend. mnaumovfb JianpingChen066 dmudiger srinivas212 Jianhui-Li mshiryaev ftian1 cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini xush6528 osalpekar Pull Request resolved: https://github.com/pytorch/pytorch/pull/32361 Reviewed By: mrshenli Differential Revision: D20635481 Pulled By: srinivas212 fbshipit-source-id: 3dd0af800ce55d02f02813cde550e3a0f1a287d2	2020-04-01 08:57:12 -07:00
Ankesh Anand	45c45195cd	Remove warning about building from source to use the NCCL backend (#34051 ) Summary: I think this warning isn't true anymore, and the NCCL backend works without PyTorch needing to be built from source. Pull Request resolved: https://github.com/pytorch/pytorch/pull/34051 Differential Revision: D20195310 Pulled By: ezyang fbshipit-source-id: 14f879a8c43ea5efdbdf0f638792ea2b90011f4a	2020-03-02 13:43:43 -08:00
Rohan Varma	6cb9e6b015	Back out "Revert D19871946: [distributed] pass in timeout to TCP store when initializing" (#33434 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33434 Reland of https://github.com/pytorch/pytorch/pull/33325, since the unit test was flaky and failed on land. To ensure that the test is not flaky, I bumped the timeout so the rendezvous does not timeout (timing out the rendezvous in 1s led to the flakiness). I also generalized our mechanism for retrying on errors to include retrying on errors due to timeout in rendezvous. ghstack-source-id: 98558377 Test Plan: Added UT test_tcp_store_timeout_set Differential Revision: D19935390 fbshipit-source-id: 56ccf8c333dd2f954a33614d35cd1642d4e9473a	2020-02-19 17:17:17 -08:00
Rohan Varma	d4e4beddc4	Revert D19871946: [distributed] pass in timeout to TCP store when initializing Test Plan: revert-hammer Differential Revision: D19871946 Original commit changeset: dd002180c4c8 fbshipit-source-id: 40b0676c51e43366c0700e81d16cc7927ee8efc2	2020-02-16 19:37:44 -08:00
Rohan Varma	df47a3abe0	[distributed] pass in timeout to TCP store when initializing (#33325 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33325 Closes https://github.com/pytorch/pytorch/issues/32924. There was a bug where for TCPStore, we would not respect the timeout passed into `init_process_group` while constructing the TCPStore. Instead, we'd set the timeout after the rendezvous created the store, meaning that we used the default timeout of 300s while connecting to the server. This diff passes the timeout passed into `init_process_group` to rendezvous so that it can be passed into the constructor for TCPStore, so that we can use the right timeout at construction time. Question: Should we make this change for FileStore as well? Currently the FileStore constructor does not take in a timeout at all. ghstack-source-id: 98401875 Test Plan: Added a UT Differential Revision: D19871946 fbshipit-source-id: dd002180c4c883216645b8a97cc472c6116ac117	2020-02-16 17:59:44 -08:00
Brian Wignall	f326045b37	Fix typos, via a Levenshtein-type corrector (#31523 ) Summary: Should be non-semantic. Uses https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines to find likely typos, with https://github.com/bwignall/typochecker to help automate the checking. Uses an updated version of the tool used in https://github.com/pytorch/pytorch/pull/30606 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/31523 Differential Revision: D19216749 Pulled By: mrshenli fbshipit-source-id: 7fd489cb9a77cd7e4950c1046f925d57524960ea	2020-01-17 16:03:19 -08:00
Alexander Golynski	23695ab23f	Moving python allgather_coalesced impl from Py to C. (#29059 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29059 This is a resubmit of reverted diff D18209289 ( PR #28857 ). Test Plan: buck test caffe2/test:c10d buck test caffe2/test:distributed_gloo Reviewed By: pietern Differential Revision: D18277097 fbshipit-source-id: aecfd7206d70829f0cac66182bf02fccee410fed	2019-11-04 08:34:34 -08:00
Shen Li	9041e29d94	Revert D18209289: Moving python allgather_coalesced impl from Py to C Test Plan: revert-hammer Differential Revision: D18209289 Original commit changeset: c5a4c4a1aaa0 fbshipit-source-id: d4865e3f8c4eeee285c711e5c2250b8c9f9b0d25	2019-11-01 11:23:41 -07:00
Alexander Golynski	22a346ee34	Moving python allgather_coalesced impl from Py to C Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28857 Test Plan: buck test caffe2/test:c10d buck test caffe2/test:distributed_gloo Reviewed By: mrshenli Differential Revision: D18209289 fbshipit-source-id: c5a4c4a1aaa07286a05a7c842dda428eeb46f696	2019-11-01 10:34:23 -07:00
Alexander Golynski	45dab56153	adding python all_gather coalesced functionality and testing. (#28634 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28634 caveat 1: this only works in sync mode. caveat 2: this is going to go away and be replaced by c++ implementation Test Plan: buck test caffe2/test:distributed_gloo -- test_all_gather_coalesced Reviewed By: mrshenli Differential Revision: D18123422 fbshipit-source-id: cfb9950d5d54c6181a5240e7cc9fed88ed47f5d9	2019-10-28 08:12:36 -07:00
Shihao Xu	59402f51cf	Make init_method url appending step re-usable by both init_process_group and init_model_parallel(init_rpc) (#28226 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28226 # Goal Rendezvous step should be the first step not only for `init_process_group` but also for `init_model_parallel`. The road block is that there is special step in `init_process_group` where arguments `rank`, `world_size` passed to `init_process_group(..)` are appended to `init_method` url string. We need to make this argument appending step common and re-usable for both `init_process_group` and `init_model_parallel`. # Solution - Put argument appending inside of `rendezvous` function. - Remove manual `init_method` url construction. Delegate the responsibility to the `rendezvous` function. - Use the `rendezvous` function for any `RpcAgent`. Test Plan: ``` buck test mode/dev-nosan caffe2/test:c10d ``` ``` buck test mode/dev-nosan caffe2/test:rpc_fork -- test_invalid_names buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_worker_id ``` ``` buck test mode/dev-nosan caffe2/torch/fb/distributed/pytorch/tests:test_rpc -- test_sync_rpc ``` ``` buck test mode/dev-nosan caffe2/torch/fb/rendezvous:zeus_test ``` ``` buck test mode/dev-nosan //caffe2/torch/fb/distributed/modules/tests:test_sharded_pairwise_attention_pooling -- test_single_trainer_multiple_pss ``` Differential Revision: D5524494 fbshipit-source-id: 50be58ec3c928621b0874b044ef4a1640534d8ef	2019-10-23 21:51:08 -07:00
zou3519	e5d6b75319	Bag of documentation fixes; fix more sphinx warnings (#27850 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27850 Many of these are real problems in the documentation (i.e., link or bullet point doesn't display correctly). Test Plan: - built and viewed the documentation for each change locally. Differential Revision: D17908123 Pulled By: zou3519 fbshipit-source-id: 65c92a352c89b90fb6b508c388b0874233a3817a	2019-10-15 07:31:14 -07:00
Pritam Damania	646e214706	ProcessGroupNCCL should respect timeout passed in to init_process_group. (#27224 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27224 As part of adding error handling to NCCL, we are now able to specify a timeout for operations using ProcessGroupNCCL. Although, this timeout had a default of 10 seconds and didn't respect the timeout specified in init_process_group. In this change, I've ensured we pass the appropriate timeout to ProcessGroupNCCL. ghstack-source-id: 91283548 Test Plan: Added unit test to verify timeout passed in to init_process_group is respected. Differential Revision: D17717992 fbshipit-source-id: c73320187f1f3b2693ba1e177d80646e282d01a2	2019-10-04 13:28:57 -07:00
Vikas Mehta	3a18e2e768	support re-creating/destroying process groups when some trainers recover after failures (#26912 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26912 group name is used as prefix in the c10d store and without a consistent name process group cannot be initialized. When process group doesn't have an explicit name (only WORLD (default) process group can have an explicit name), we use global _group_counter to generate the name. We need to reset the counter on destruction to allow consistent value to be generated when we re-create process groups after some trainers recover from failure. Test Plan: existing tests passed Reviewed By: mrshenli Differential Revision: D17594268 fbshipit-source-id: 17f4d2746584dadaa5d468085d871ff3e95a1c84	2019-09-27 16:16:58 -07:00
Pieter Noordhuis	ebdb32c749	Remove global group name tracking for ProcessGroupNCCL (#25905 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25905 Now that we can detect and recover from failures in NCCL we should allow processes that are started at different times (and perhaps have had previous NCCL process group instances), to eventually be part of the same process group. Keeping track of group names in global variables prevents that, because the processes will be out of sync. This commit removes the global group name maps and defers responsibility of isolating access to the same store from multiple process groups to the store itself. Users can use `c10d::PrefixStore` to derive new store instances whose keyspace is scoped to some prefix. Functionally, this is identical to keeping a global map and using a group name, but also gives more flexibility to the front-end API to reset state and have processes that have started at different times to join the same process group. ghstack-source-id: 89804865 Test Plan: Tests pass. Differential Revision: D17281416 fbshipit-source-id: eab3b48463a9b0ef24aedeca76e2bb970b9f33ef	2019-09-11 06:56:33 -07:00
Pieter Noordhuis	500e72aaa5	Make scatter/gather arguments optional (#25575 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25575 For both scatter and gather, only the source and destination rank, respectively, need to supply a list of tensors. The `scatter_list` and `gather_list` arguments were mandatory, however, and this has resulted in some confusion. This commit makes both the `scatter_list` and `gather_list`, and the `src` and `dst` arguments optional. Closes #25463. Test Plan: Imported from OSS Differential Revision: D17164253 fbshipit-source-id: a16bc208c87a1c96163c1a86d4a7ca8634a26f95	2019-09-03 12:27:05 -07:00
Pieter Noordhuis	493f7bd817	Error phrasing in torch.distributed helper functions Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25574 Test Plan: Imported from OSS Differential Revision: D17164254 fbshipit-source-id: 13dbcffd67c2b5425c722b2b21765345a85a3872	2019-09-03 12:27:01 -07:00
jfc4050	590619ab8c	Support all_reduce a list of same-device tensors #21640 (#24949 ) Summary: addresses https://github.com/pytorch/pytorch/issues/21640 for CPU tensors and the Gloo backend. Questions: - ~~currently takes `AllreduceOptions`, since all of the options are the same. Would it be better to make a new `AllreduceCoalescedOptions` class?~~ - ~~I decided to inherit from `ProcessGroupGloo::AsyncWork` instead of `AsyncAllreduceWork` to shorten the inheritance chain a bit and for consistency with existing classes. However, this means that the two `getFunction` methods are copy-pasted. Would inheriting from `AsyncAllreduceWork` be preferable?~~ - ~~should the work class be named `AsyncCoalescedAllreduceWork` or `AsyncAllreduceCoalescedWork`?~~ thank you! Pull Request resolved: https://github.com/pytorch/pytorch/pull/24949 Differential Revision: D17055580 Pulled By: mrshenli fbshipit-source-id: e63b5fcaec6021053ea960776a09ee8cf11d1ec2	2019-08-28 10:57:37 -07:00
Max Wang	c5845c4482	Add support for reduce-scatter in c10d (#18844 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/18844 ghimport-source-id: c6b2f0032c7c2212be2000a9c1f262f63d878a97 Stack from [ghstack](https://github.com/ezyang/ghstack): * #18844 Add support for reduce-scatter in c10d * #18820 Refactor ProcessGroupNCCL collective primitives Reviewed By: mrshenli Differential Revision: D14768369 fbshipit-source-id: a9def7a0da6e9cd995e982371cc1e22f3df1a156	2019-04-26 13:46:57 -07:00
Kutta Srinivasan	b7323a94ad	Cleanup init_process_group (#19033 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/19033 torch.distributed.init_process_group() has had many parameters added, but the contract isn't clear. Adding documentation, asserts, and explicit args should make this clearer to callers and more strictly enforced. Reviewed By: mrshenli Differential Revision: D14813070 fbshipit-source-id: 80e4e7123087745bed436eb390887db9d1876042	2019-04-18 09:37:38 -07:00
Pieter Noordhuis	ce166d949d	ProcessGroupMPI exists only if it is valid (#14809 ) Summary: Previously, MPI process groups were created for all processes, even if they were not part of the created group. Their MPI_Comm member field would be MPI_COMM_NULL and they would ignore any calls. Their rank and size were identical to that of the global process group and they had a special groupRank and groupSize field to capture the _real_ rank. This also meant assymetry with other process group types, where creating a new group would either return the process group OR GroupMember.NON_GROUP_MEMBER. For the MPI process group, it would always return a process group and an additional check was needed to verify whether or not a process was indeed part of a process group or not. This commit changes this such that every MPI process group is a valid process group, and by extension that we no longer have to special case MPI to determine whether or not a process is part of a group. Now, if the value returned by `new_group` is GroupMember.NON_GROUP_MEMBER, the process is not a member, otherwise it is. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14809 Differential Revision: D14887937 Pulled By: pietern fbshipit-source-id: c5bf86d3b33e524cc5004ee68e30103178fa491d	2019-04-10 21:36:35 -07:00
Shen Li	8f9b11cf33	Propagate ProcessGroup timeout to Store (#16571 ) Summary: closes #16520 Hi pietern, I am not sure if this is the expected way to pass timeout to `Store`, could you please help take a look? Thanks! Questions: 1. How do I write tests for this? I wanted to do something like `test_barrier_timeout_global`, but it seems I need to set the pg's timeout larger than the `Store`'s default timeout (3 min) to see a difference, which is too long for a unit test. And I do not want to change the `Store`'s default timeout either. Any suggestion? 2. Should I also propagate timeout configuration down to `PrefixStore` in `_new_process_group_helper`? Pull Request resolved: https://github.com/pytorch/pytorch/pull/16571 Differential Revision: D13954527 Pulled By: mrshenli fbshipit-source-id: 77f2653903f24255207233eb298f7c0321119a87	2019-04-09 12:36:28 -07:00
Pieter Noordhuis	7a19d3c9e1	Allow override of backend in dist.new_group() (#18595 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/18595 There is no need to force the backend to be the same as the global process group, as long as the backend is "nccl" or "gloo". Reviewed By: mrshenli Differential Revision: D14657204 fbshipit-source-id: 868817b9f219e3be8db0761a487f0027ed46663b	2019-04-04 14:23:03 -07:00
Shen Li	c0ad6747a9	Highlight NCCL all_reduce and all_gather requirements (#18741 ) Summary: See #18689 Pull Request resolved: https://github.com/pytorch/pytorch/pull/18741 Differential Revision: D14726874 Pulled By: mrshenli fbshipit-source-id: a92404c653e3c62fc23fa3ccacfb3b2959b2e307	2019-04-03 09:50:29 -07:00
Igor Fedan	36237c4893	Fix flake8 issues in gragrad test Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/18727 Differential Revision: D14724887 Pulled By: ifedan fbshipit-source-id: 8c1db6460303e746e4aea0142302b8d61277c067	2019-04-02 12:45:18 -07:00
Pieter Noordhuis	bdfdf6c2b9	C++ handler for gradient reduction (#18251 ) Summary: This commit adds the `c10d::Reducer` class that hooks into autograd and performs gradient bucketing and reduction. These are the core parts of `nn.parallel.DistributedDataParallel` that up to now were only usable for CUDA models. This should enable the following: * Distributed data parallelism for models defined using the C++ frontend. * Allow overlap of gradient computation and reduction for non-CUDA models. * Enable distributed data parallelism for models with some unused parameters. This does not include any logic for computing bucket assignment, which can be done separately; either by observing autograd execution order (this is what Apex does), or by assigning buckets based on some maximum byte size, or both. Also see #17757 and #13273. Pull Request resolved: https://github.com/pytorch/pytorch/pull/18251 Reviewed By: mrshenli Differential Revision: D14571899 Pulled By: pietern fbshipit-source-id: 20f95eefd288dfe8cfffe0a28ca22fa7c9c3cd4c	2019-04-01 14:30:02 -07:00
Edward Yang	173f224570	Turn on F401: Unused import warning. (#18598 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/18598 ghimport-source-id: c74597e5e7437e94a43c163cee0639b20d0d0c6a Stack from [ghstack](https://github.com/ezyang/ghstack): * #18598 Turn on F401: Unused import warning. This was requested by someone at Facebook; this lint is turned on for Facebook by default. "Sure, why not." I had to noqa a number of imports in __init__. Hypothetically we're supposed to use __all__ in this case, but I was too lazy to fix it. Left for future work. Be careful! flake8-2 and flake8-3 behave differently with respect to import resolution for # type: comments. flake8-3 will report an import unused; flake8-2 will not. For now, I just noqa'd all these sites. All the changes were done by hand. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: D14687478 fbshipit-source-id: 30d532381e914091aadfa0d2a5a89404819663e3	2019-03-30 09:01:17 -07:00
Brian Johnson	fd04073e61	Fixed a formatting issue in doc comments (#17505 ) Summary: for torch.distributed.broadcast_multigpu per issue #17243 Pull Request resolved: https://github.com/pytorch/pytorch/pull/17505 Reviewed By: janewangfb Differential Revision: D14373865 Pulled By: pietern fbshipit-source-id: 6d7e91a3da50a7c9ba417ad852f7746eb5200043	2019-03-12 09:55:29 -07:00
Jane Wang	a2b9f7f484	add elastic zeus handler (#16746 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/16746 as titled. We use a special url schem elasticzeus for elastic zeus so that we dont need to change the public interface of init_process_group. Reviewed By: aazzolini, soumith Differential Revision: D13948151 fbshipit-source-id: 88939dcfa0ad93467dabedad6905ec32e6ec60e6	2019-02-27 11:29:59 -08:00
hysts	cbefd0323b	Fix typo Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/17521 Differential Revision: D14237482 Pulled By: soumith fbshipit-source-id: 636e0fbe2c667d15fcb649136a65ae64937fa0cb	2019-02-26 20:23:34 -08:00
Teng Li	2d3cf98b49	Making dist.get_default_group private for PT1 release (#14767 ) Summary: When I wrote the frontend API, it is designed on not letting users use the default_group directly on any functions. It should really be private. All collectives are supposed to either use group.WORLD, or anything that comes out of new_group. That was the initial design. We need to make a TODO on removing group.WORLD one day. It exists for backward compatibility reasons and adds lots of complexity. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14767 Reviewed By: pietern Differential Revision: D13330655 Pulled By: teng-li fbshipit-source-id: ace107e1c3a9b3910a300b22815a9e8096fafb1c	2018-12-04 19:22:24 -08:00
Pieter Noordhuis	11ef5191ff	Enable tests for CPU tensors in test_distributed.py (#14572 ) Summary: These were not enabled after adding support in the Gloo backend. The argument checks in ProcessGroupGloo raised an error in two cases: * If the input tensor list to scatter was ``[None]`` on processes other than the source process. * If the output tensor list to gather was ``[None]`` on processes other than the destination process. This commit prepares these arguments explicitly instead of boxing them at the process group call site. This fixes #14536. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14572 Differential Revision: D13272812 Pulled By: pietern fbshipit-source-id: 12cb0d85ec92f175365cbada585260f89330aad8	2018-11-29 21:39:02 -08:00
Teng Li	9127ab3866	Fixed new_group won't work for two or more different rank groups (#14529 ) Summary: This fixed two things: (1) NCCL group doesn't support 2 or more groups, this is because, we need a group name in ProcessGroupNCCL class to keep track of the ProcessGroup ID within that group name, and also the NCCL unique ID within that group name and process group ID. Otherwise, different processes will create different NCCL PG in different orders and can clash on these names. This will fix the NCCL problem. (2) When using new_group, each rank should enter this function and update its global group name counter to ensure that every rank always operates on the same group name. With both fixes: repro code in: https://github.com/pytorch/pytorch/issues/14528 should work with both NCCL and Gloo backends. ``` tengli@learnfair096:~$ python -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --master_port=30000 ~/github_issues/nccl_group.py rank: 0 - val: 6.0 rank: 2 - val: 6.0 rank: 3 - val: 6.0 rank: 1 - val: 6.0 rank: 4 - val: 22.0 rank: 6 - val: 22.0 rank: 5 - val: 22.0 rank: 7 - val: 22.0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/14529 Differential Revision: D13253434 Pulled By: teng-li fbshipit-source-id: 8eb45882b996b06d951fc9a306d5de86a42e8b84	2018-11-29 19:57:47 -08:00
Teng Li	0d3cb91d8c	Make env init_method support both env and args for rank and size (#14494 ) Summary: Fixing: https://github.com/pytorch/pytorch/issues/14446 This was a supported behavior in old torch.distributed. We want to support it in the new release. Test should cover all combination of scenario when we have either env or arg set up for rank or size or both Pull Request resolved: https://github.com/pytorch/pytorch/pull/14494 Differential Revision: D13253433 Pulled By: teng-li fbshipit-source-id: c05974d84f1bdf969f74ec45763e11a841fe4848	2018-11-29 18:48:20 -08:00
Pieter Noordhuis	4ec6bd7356	Add sourceRank() to ProcessGroup::Work (#14453 ) Summary: This function is only implemented for the subclasses where it makes sense. If it's not overridden it will throw an error. Having this function removes the need for a pointer passing hack to pass the source rank of a recv operation back to the caller. Instead, the caller can now call `source_rank` on the work object and achieve the same result. Closes #11804. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14453 Differential Revision: D13230898 Pulled By: pietern fbshipit-source-id: ef38f48bfaca8ef9a364e5be122951bafc9f8e49	2018-11-29 09:16:53 -08:00
Pieter Noordhuis	0f62af4ab1	Add timeout kwarg to init_process_group (#14435 ) Summary: This applies to the gloo backend only. Timeout support for the NCCL and MPI backends is tracked in issues #14371 and #14372 respectively. When creating a new process group (either the global one or any subgroup created through `new_group`) you can specify a timeout keyword argument (of type datetime.timedelta). This timeout applies to all collective operations executed against that process group, such that any operation taking longer than the timeout will throw a runtime error. Using a different, better catchable error type is tracked in #14433. This fixes #14376. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14435 Differential Revision: D13234317 Pulled By: pietern fbshipit-source-id: 973993b67994dc64861c0977cbb6f051ec9d87f6	2018-11-28 11:35:01 -08:00
Teng Li	b807970aea	Tensor type checking and informative error messages for torch.distributed (#14204 ) Summary: This will address https://github.com/pytorch/pytorch/issues/13574 This error message should be more informative to the user for all the non-multiGPU ops, since we python binding to multi-gpu ops always. test_distributed should cover all. Also tested both RunTime errors. ``` >>> a = torch.ByteTensor([]) >>> b = [a, a] >>> dist.all_reduce(b) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/private/home/tengli/pytorch/torch/distributed/distributed_c10d.py", line 809, in all_reduce _check_single_tensor(tensor, "tensor") File "/private/home/tengli/pytorch/torch/distributed/distributed_c10d.py", line 207, in _check_single_tensor "to be a torch.Tensor type".format(param_name)) RuntimeError: Invalid function argument. Expecting parameter: tensor to be a torch.Tensor type >>> b = ["b"] >>> dist.all_gather(b, a) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/private/home/tengli/pytorch/torch/distributed/distributed_c10d.py", line 1006, in all_gather _check_tensor_list(tensor_list, "tensor_list") File "/private/home/tengli/pytorch/torch/distributed/distributed_c10d.py", line 225, in _check_tensor_list "to be a List[torch.Tensor] type".format(param_name)) RuntimeError: Invalid function argument. Expecting parameter: tensor_list to be a List[torch.Tensor] type ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/14204 Differential Revision: D13131526 Pulled By: teng-li fbshipit-source-id: bca3d881e41044a013a6b90fa187e722b9dd45f2	2018-11-19 18:30:54 -08:00
Tongzhou Wang	044d00516c	Rename DistBackend -> Backend (#11830 ) Summary: Also add docs for get_backend, Backend, and reduce_op fixes #11803 cc The controller you requested could not be found. pietern apaszke Pull Request resolved: https://github.com/pytorch/pytorch/pull/11830 Differential Revision: D9927991 Pulled By: SsnL fbshipit-source-id: a2ffb70826241ba84264f36f2cb173e00b19af48	2018-11-07 11:58:12 -08:00
Teng Li	1b64c0f8fe	Error msg on TCP backend (#13596 ) Summary: Clean it up from my queue: https://github.com/pytorch/pytorch/issues/12721 ``` >>> torch.distributed.init_process_group(backend="tcp") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/private/home/tengli/pytorch/torch/distributed/distributed_c10d.py", line 275, in init_process_group backend = DistBackend(backend) File "/private/home/tengli/pytorch/torch/distributed/distributed_c10d.py", line 55, in __new__ raise ValueError("TCP backend has been deprecated. Please use " ValueError: TCP backend has been deprecated. Please use Gloo or MPI backends for collective operations on CPU tensors. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/13596 Differential Revision: D12931196 Pulled By: teng-li fbshipit-source-id: bb739b107ad7454e2e0a17430087161fedd4c392	2018-11-05 16:40:02 -08:00
Pieter Noordhuis	526460fc8b	Use default timeout of 30 minutes for gloo backend (#13056 ) Summary: The existing default timeout was set at 10 seconds, which is too low for asynchronous tasks that depend on a barrier to resynchronize. Having a single timeout for all operations is not ideal and this will be addressed in future commits. Pull Request resolved: https://github.com/pytorch/pytorch/pull/13056 Reviewed By: teng-li Differential Revision: D10558746 Pulled By: pietern fbshipit-source-id: d857ea55b1776fc7d0baf2efd77951b5d98beabb	2018-10-25 16:35:53 -07:00
Edward Yang	dfa03e94eb	Fix mispelling of AVAILABLE. (#12016 ) Summary: Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/12016 Reviewed By: pietern Differential Revision: D10010808 Pulled By: ezyang fbshipit-source-id: ff6394ae9a53f7fdad2cadb4e019e09ac63bba96	2018-09-24 20:46:41 -07:00
Tongzhou Wang	540ef9b1fc	Add distributed get_backend (#11715 ) Summary: I have no idea how to run distributed tests locally so I'll let CI do this. Hopefully everything still works with `IntEnum`. cc mcarilli Pull Request resolved: https://github.com/pytorch/pytorch/pull/11715 Reviewed By: pietern Differential Revision: D9889646 Pulled By: SsnL fbshipit-source-id: 1e2a487cb6fe0bd4cc67501c9d72a295c35693e2	2018-09-18 10:56:24 -07:00
Pieter Noordhuis	7535d98ec4	Add message tag parameter to send/recv Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/11490 Reviewed By: teng-li Differential Revision: D9828116 Pulled By: pietern fbshipit-source-id: 98be1ae84b6763ffb329e63c030c5e3ec0e748b7	2018-09-14 10:55:37 -07:00
Teng Li	0988bbad2d	C10d release to torch.distributed for PT1 (#11405 ) Summary: The old `torch.distributed` will go to `torch.distributed.deprecated` The old DDP will go to `torch.nn.parallel.deprecated` Now `torch.nn.parallel.DDP` will use c10d DDP Now `torch.distributed` will use C10d frontend API Pull Request resolved: https://github.com/pytorch/pytorch/pull/11405 Reviewed By: pietern Differential Revision: D9733733 Pulled By: teng-li fbshipit-source-id: d6a3f3e73f8d3a7fcb1f4baef53c78063b8cbb08	2018-09-10 23:27:22 -07:00

... 2 3 4 5 6

283 Commits