pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Rohan Varma	f8248543a1	Pass in smaller timeout into init_process_group for distributed_test (#47896 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47896 Per title ghstack-source-id: 116710141 Test Plan: CI Reviewed By: osalpekar Differential Revision: D24943323 fbshipit-source-id: 7bf33ce3a021b9750b65e0c08f602c465cd81d28	2020-11-14 13:38:20 -08:00
Jagadish Krishnamoorthy	1606899dbe	distributed_test: Map rank to GPU accordingly (#47898 ) Summary: If world_size is lesser than or equal to number of GPU's available then the rank can be directly mapped to corresponding GPU. This fixes the issue referenced in https://github.com/pytorch/pytorch/issues/45435 and https://github.com/pytorch/pytorch/issues/47629 For world_size = 3 and number of GPU's = 8, the rank to GPU mapping will be 0,2,4. This is due to the introduction of barrier, (refer PR https://github.com/pytorch/pytorch/issues/45181) the tensors in barrier is mapped to cuda0,1,2 and the tensors in the actual test cases are mapped to cuda0,2,4 resulting in different streams and leading to timeout. This issue is specific to default process group. Issue is not observed in new process group since the streams are created again after the initial barrier call. This patch maps the rank to corresponding GPU's when the world_size is less than or equal to the number of GPU's, in this case 0,1,2 Note: The barrier function in distributed_c10d.py should include new parameter to specify the tensor or rank to GPU mapping. In that case, this patch will be redundant but harmless since the tests can specify the tensors with appropriate GPU rankings. Fixes https://github.com/pytorch/pytorch/issues/47629 Pull Request resolved: https://github.com/pytorch/pytorch/pull/47898 Reviewed By: smessmer Differential Revision: D24956021 Pulled By: rohan-varma fbshipit-source-id: a88257f22a7991ba36566329766c106d3360bb4e	2020-11-13 23:59:42 -08:00
Natalia Gimelshein	eb8331e759	Revert D24524219: Remove `balance` and `devices` parameter from Pipe. Test Plan: revert-hammer Differential Revision: D24524219 (`8da7576303`) Original commit changeset: 9973172c2bb7 fbshipit-source-id: b187c80270adb2a412e3882863a2d7de2a52ed56	2020-11-12 19:31:19 -08:00
Pritam Damania	8da7576303	Remove `balance` and `devices` parameter from Pipe. (#46804 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46804 As per our design in https://github.com/pytorch/pytorch/issues/44827, changign the API such that the user places modules on appropriate devices instead of having a `balance` and `devices` parameter that decides this. This design allows us to use RemoteModule in the future. ghstack-source-id: 116479842 Test Plan: waitforbuildbot Reviewed By: mrshenli Differential Revision: D24524219 fbshipit-source-id: 9973172c2bb7636572cdc37ce06bf8368638a463	2020-11-12 14:20:23 -08:00
Mingzhe Li	66f9b1de1b	[NCCL] enable p2p tests (#47797 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47797 NCCL p2p tests had hang issues before, the reason is that there were some unexpected context switches. For example, process 1 which is supposed to only use GPU1 could use GPU0 as a result of missing explicitly setting device. ghstack-source-id: 116461969 Test Plan: waitforsandcastle Reviewed By: jiayisuse Differential Revision: D24863808 fbshipit-source-id: 92bd3a4874be8334210c7c8ee6363648893c963e	2020-11-12 10:44:50 -08:00
Kyle Chen	859e054314	skip test_all_reduce_sum_cuda_async test case for ROCM (#47630 ) Summary: Skip the following test case for rocm (When PYTORCH_TEST_WITH_ROCM=1): - test_all_reduce_sum_cuda_async (__main__.TestDistBackendWithFork) jeffdaily pruthvistony Pull Request resolved: https://github.com/pytorch/pytorch/pull/47630 Reviewed By: seemethere, heitorschueroff Differential Revision: D24849755 Pulled By: walterddr fbshipit-source-id: b952c81677df2dfd35d459b94ce0f7a5b12c0d5c	2020-11-12 07:19:32 -08:00
Rohan Varma	c9f6e70c09	Refactor DDP uneven inputs control flags (#47394 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47394 This is a preliminary refactor for the next diff that will add an additional flag to control whether we throw a StopIteration or not. We basically move the flags for ddp uneven inputs to a simple class. ghstack-source-id: 116428177 Test Plan: CI Reviewed By: pritamdamania87 Differential Revision: D24739509 fbshipit-source-id: 96bf41bd1c02dd27e68f6f37d08e22f33129b319	2020-11-11 16:51:56 -08:00
Rohan Varma	0a7ebf00f8	[Reland] Add tests for DDP control flow models. (#47470 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47470 Reland of https://github.com/pytorch/pytorch/pull/47206, which was reverted due to failing multigpu tests. The fix to make multigpu tests work is to compare against `torch.tensor([world_size, 0])`, not hardcode `torch.tensor([2, 0]` which assumes a world size of 2. Original commit description: As discussed offline with pritamdamania87, add testing to ensure per-iteration and rank-dependent control flow works as expected in DDP with find_unused_parameters=True. ghstack-source-id: 115993934 ghstack-source-id: 115993934 Test Plan: CI Reviewed By: pritamdamania87 Differential Revision: D24767893 fbshipit-source-id: 7d7a2449270eb3e72b5061694e897166e16f9bbc	2020-11-10 12:22:59 -08:00
Richard Zou	22d21414d7	Revert D24574649: [pytorch][PR] Utility that loads a DP/DDP model state dict into a non-DDP model with the same architecture. Test Plan: revert-hammer Differential Revision: D24574649 (`b631c872c9`) Original commit changeset: 17d29ab16ae2 fbshipit-source-id: 6766c6b21b82c9463143da0370192d9c68dbce6c	2020-11-10 06:55:45 -08:00
Pradeep Ganesan	b631c872c9	Utility that loads a DP/DDP model state dict into a non-DDP model with the same architecture. (#45643 ) Summary: Added a convenience function that allows users to load models without DP/DDP from a DP/DDP state dict. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45643 Reviewed By: rohan-varma Differential Revision: D24574649 fbshipit-source-id: 17d29ab16ae24a30890168fa84da6c63650e61e9	2020-11-09 20:49:29 -08:00
Pritam Damania	781e0ed835	Support RRef.backward() for Owner RRefs. (#46641 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46641 Second part of https://github.com/pytorch/pytorch/pull/46568, allows RRef.backward() to work for owner RRefs. ghstack-source-id: 115440252 Test Plan: waitforbuildbot Reviewed By: mrshenli Differential Revision: D24441300 fbshipit-source-id: 64af28e6b6ae47ea27e611a148f217bc344a4c5b	2020-11-07 21:25:32 -08:00
Mehdi Mirzazadeh	160db3db4f	Adding profiling capability to c++ ddp collective functions (#46471 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46471 ghstack-source-id: 116018837 Test Plan: Added unit tests: buck test mode/dev-nosan caffe2/test/distributed:distributed_gloo_fork buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork Reviewed By: rohan-varma Differential Revision: D23948397 fbshipit-source-id: 6d93a370aff26bf96c39e5d78a2492c5142a9156	2020-11-06 10:29:58 -08:00
Xu Zhao	eaa993a2e0	Add type annotations to torch._C._distributed_rpc module. (#46624 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46624 Test Plan: Imported from OSS Reviewed By: glaringlee Differential Revision: D24761656 Pulled By: xuzhao9 fbshipit-source-id: b55aee5dd2b97f573a50e5bbfddde7d984943fec	2020-11-06 01:28:51 -08:00
Richard Zou	9c8078cdfb	Revert D24659901: Add tests for DDP control flow models. Test Plan: revert-hammer Differential Revision: D24659901 (`31c9d2efcd`) Original commit changeset: 17fc2b3ebba9 fbshipit-source-id: 26b0bdbe83cba54da4f363cfa7fc85c503aa05ab	2020-11-05 08:08:59 -08:00
Pritam Damania	c8872051e6	Validate number of GPUs in distributed_test. (#47259 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47259 As described in https://github.com/pytorch/pytorch/issues/47257, not using enough number of GPUs would result in an error. As a result, before we call `init_process_group` in distributed_test, we validate we have enough GPUs. #Closes: https://github.com/pytorch/pytorch/issues/47257 ghstack-source-id: 115790475 Test Plan: waitforbuildbot Reviewed By: rohan-varma Differential Revision: D24699122 fbshipit-source-id: 59c78d191881d1e063c43623dcf4d7eb75a2e94e	2020-11-04 17:55:34 -08:00
Rohan Varma	31c9d2efcd	Add tests for DDP control flow models. (#47206 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47206 As discussed offline with pritamdamania87, add testing to ensure per-iteration and rank-dependent control flow works as expected in DDP with `find_unused_parameters=True`. ghstack-source-id: 115854944 Test Plan: CI Reviewed By: pritamdamania87 Differential Revision: D24659901 fbshipit-source-id: 17fc2b3ebba9cef2dd01d2877bad5702174b9767	2020-11-04 15:40:57 -08:00
Jeff Daily	6906701bde	[ROCm] enable stream priorities (#47136 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47136 Reviewed By: mruberry Differential Revision: D24672457 Pulled By: ngimel fbshipit-source-id: 54f60c32df87cbd40fccd7fb1ecf0437905f01a3	2020-11-02 11:25:44 -08:00
Rohan Varma	d850b5c98c	Fix DDP issue where parameters share same grad_accumulator (#46755 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46755 As reported in https://github.com/pytorch/pytorch/issues/41324, there is a bug in DDP when `find_unused_parameters=True` and 2 or more parameters share the same gradient accumulator. In the reducer, we currently keep a mapping of grad accumulator to index and populate it with map[accumulator] = index, but this overwrites indices when the accumulator is the same. To fix this, switch the mapping values to a vector of indices to hold all such indices that share the same accumulator. ghstack-source-id: 115453567 Test Plan: Added UT Reviewed By: pritamdamania87 Differential Revision: D24497388 fbshipit-source-id: d32dfa9c5cd0b7a8df13c7873d5d28917b766640	2020-10-29 12:23:06 -07:00
Yi Wang	cab32d9cdf	[RPC Framework] Support remote device format "<workername>/<device>" (#46773 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46773 Changed the constructor of RemoteModule to accept a `remote_device` arg in the following format: "<workername>/<device>" (e.g., "trainer0/cpu", "ps0/cuda:0") This arg merges the original `on` and `device` arg. Original PR issue: RemoteDevice Format #46554 ghstack-source-id: 115448051 Test Plan: buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- RemoteModule Reviewed By: pritamdamania87 Differential Revision: D24482562 fbshipit-source-id: 5acfc73772576a4b674df27625bf560b8f8e67c1	2020-10-29 00:14:56 -07:00
Rohan Varma	c7183c9878	Fix object-based collectives API to use torch.cuda.current_device instead of (#46897 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46897 These APIs implicitly assumed that gpu for rank == rank index, but that is not necessarily true. For example, the first GPU could be used for a different purpose and rank 0 could use GPU 1, rank 1 uses GPU 2, etc. Thus, we mandate that the user specify the device to use via `torch.cuda.set_device()` before making calls to this API. This expectation should be okay since we clearly document it, and we expect the user to set this for DistributedDataParallel as well. Also adds/tidies up some documentation. ghstack-source-id: 115359633 Test Plan: Modified unittests Reviewed By: divchenko Differential Revision: D24556177 fbshipit-source-id: 7e826007241eba0fde3019180066ed56faf3c0ca	2020-10-28 18:12:50 -07:00
Pritam Damania	adafd3d4b2	Support RRef.backward() for local RRefs. (#46568 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46568 This PR adds support for an RRef.backward() API. This would be useful in applications like pipeline parallelism as described here: https://github.com/pytorch/pytorch/issues/44827 This PR only adds support for local RRefs, remote RRef support will be added in a follow up PR. ghstack-source-id: 115100729 Test Plan: 1) unit tests. 2) waitforbuildbot Reviewed By: mrshenli Differential Revision: D24406311 fbshipit-source-id: fb0b4e185d9721bf57f4dea9847e0aaa66b3e513	2020-10-26 17:31:17 -07:00
Oscar Sandoval	58ed60c259	Added context manager enabling all futures returned by rpc_async and custom build rpc functions to be automatically waited on (#41807 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41807 Test Plan: Make sure ci tests pass, including newly written test Reviewed By: mrshenli Differential Revision: D22640839 Pulled By: osandoval-fb fbshipit-source-id: 3ff98d8e8c6e6d08575e307f05b5e159442d7216	2020-10-26 12:53:35 -07:00
Alexander Grund	93719440b8	Replace map(lambda constructs (#46462 ) Summary: Follow-up of https://github.com/pytorch/pytorch/issues/46461 with a similar goal Makes them more readable and possibly faster. Care has to be taken because `map` applies the function immediately while `(x for x in xs)` is a generator expression which gets evaluated later. This is a benefit in some cases where it is not required to actually create the list of values in memory (e.g. when passing to `tuple` or `extend` or `join`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/46462 Reviewed By: zou3519 Differential Revision: D24422343 Pulled By: ezyang fbshipit-source-id: 252e33499c92ac0b15238f2df32681dbbda2b237	2020-10-22 09:50:22 -07:00
Rohan Varma	25dc0056f2	[RPC] print exception message on workers that run python functions (#46372 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46372 Currently, in `_run_function`, we catch an exception from the python function which is run, and report it back to the master. However in some large scale training jobs, it would be valuable to also log the error on the trainer itself for faster debugging. Test Plan: Added unittest. Reviewed By: pritamdamania87 Differential Revision: D24324578 fbshipit-source-id: 88460d7599ea69d2c38fd9c10eb6471f7edd4100	2020-10-22 09:44:15 -07:00
Ivan Kobzarev	3112e23428	[py][vulkan][reland] Add is_vulkan to py api, add vulkan to device type parsing (#46655 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46655 Test Plan: Imported from OSS Pulled By: IvanKobzarev Reviewed By: mrshenli Differential Revision: D24448984 fbshipit-source-id: 5000846a06077f7a5a06dd51da422d2a42f70820	2020-10-22 09:35:50 -07:00
Rohan Varma	7245d2c939	Avoid scatter for single-device case in DDP (#46304 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46304 In the case that a single process operates only on one GPU, we can avoid this scatter and instead replace it with a recursive version of `to` which transfers the input tensors to the correct device. The implementation of `_recursive_to` is modeled after `scatter` in https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/scatter_gather.py, in order to keep parity with the previous conventions (i.e. custom types not having their tensors moved). ghstack-source-id: 114896677 Test Plan: Added unittest, and CI Reviewed By: pritamdamania87 Differential Revision: D24296377 fbshipit-source-id: 536242da05ecabfcd36dffe14168b1f2cf58ca1d	2020-10-22 08:29:37 -07:00
Howard Huang	611f028168	Add Batch-Updating Parameter Server Example to CI Tests (#46510 ) Summary: Resolves one item in https://github.com/pytorch/pytorch/issues/46321 This PR sets up DistExamplesTest which will be used as the class to implement future tests for examples. This class is run as part of CI tests. It also creates a dist_examples folder and includes the [batch server example](https://github.com/pytorch/examples/blob/master/distributed/rpc/batch/parameter_server.py) which is slightly modified to allow to be tested. Run test: pytest test/distributed/rpc/test_tensorpipe_agent.py -k test_batch_updating_parameter_server -vs pytest test/distributed/rpc/test_process_group_agent.py -k test_batch_updating_parameter_server -vs Pull Request resolved: https://github.com/pytorch/pytorch/pull/46510 Reviewed By: mrshenli Differential Revision: D24379296 Pulled By: H-Huang fbshipit-source-id: 1c102041e338b022b7a659a51894422addc0e06f	2020-10-21 08:46:46 -07:00
Alexander Grund	5b0f400488	Replace list(map(...)) constructs by list comprehensions (#46461 ) Summary: As discussed in https://github.com/pytorch/pytorch/issues/46392 this makes the code more readable and possibly more performant. It also fixes a bug detected by this where the argument order of `map` was confused: `030a24906e (diff-5bb26bd3a23ee3bb540aeadcc0385df2a4e48de39f87ed9ea76b21990738fe98L1537-R1537)` Fixes https://github.com/pytorch/pytorch/issues/46392 Pull Request resolved: https://github.com/pytorch/pytorch/pull/46461 Reviewed By: ailzhang Differential Revision: D24367015 Pulled By: ezyang fbshipit-source-id: d55a67933cc22346b00544c9671f09982ad920e7	2020-10-19 18:42:49 -07:00
Rohan Varma	5c5484c889	[Flaky tests] Fix test_all_gather_timeout test (#45989 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45989 This test was failing internally for the Thrift-based RPC agent, since it has a different error regex. Use `self.get_timeout_error_regex` which gets the timeout error string for each backend to fix this. ghstack-source-id: 114463458 Test Plan: CI Reviewed By: pritamdamania87 Differential Revision: D24170394 fbshipit-source-id: 9b30945e3e30f36472268d042173f8175ad88098	2020-10-16 09:02:46 -07:00
HyunJun	a69910868a	Fix possible padding length overflow in DistributedSampler (#45329 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/45324 This fix handles cases for `len(dataset) * 2 < num_replica` in DistributedSampler. (which previous code resulted in error.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/45329 Reviewed By: mruberry Differential Revision: D24205035 Pulled By: rohan-varma fbshipit-source-id: f94329d9c1e7deaee41e5af319e7c7d0c741910c	2020-10-14 17:19:44 -07:00
Brian Hirsh	1f791c06f0	adding BAND/BOR/BXOR reduce ops to unsupported list for complex numbers. added tests (#46270 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46270 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D24284702 Pulled By: bdhirsh fbshipit-source-id: 7e6c3fce83a4367808a638f0400999399b2c35b0	2020-10-14 08:48:14 -07:00
Pritam Damania	f89498f3f8	Allow RPC framework to use rank in addition to WorkerInfo and name. (#46221 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46221 The RPC framework only allowed sending RPCs based on provided WorkerInfo or name. When using RPC with DDP, sometimes it might just be easier to refer to everything in terms of ranks since DDP doesn't support names yet. As a result, support a `to` parameter in the RPC APIs which allow for specifying a rank as well would be helpful. ghstack-source-id: 114207172 Test Plan: 1) waitforbuildbot 2) Unit Tests Reviewed By: mrshenli Differential Revision: D24264989 fbshipit-source-id: 5edf5d92e2bd2f213471dfe7c74eebfa9efc9f70	2020-10-13 17:52:54 -07:00
Brian Hirsh	a3caa719af	fix #45552 - adding add_done_callback(fn) to torch.futures.Future (#45675 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45675 Test Plan: Imported from OSS Reviewed By: glaringlee Differential Revision: D24055353 Pulled By: bdhirsh fbshipit-source-id: 9233c8e17acc878f0fecbe740a4397fb55cf722f	2020-10-13 07:47:36 -07:00
Brian Hirsh	c02efdefa8	adding complex support for distributed functions and . fix #45760 (#45879 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45879 Test Plan: Imported from OSS Reviewed By: glaringlee Differential Revision: D24127949 Pulled By: bdhirsh fbshipit-source-id: 8061b14fa1c0adbe22b9397c2d7f92618556d223	2020-10-12 12:44:47 -07:00
Mingzhe Li	281463ba0b	[NCCL] Enable send/recv tests (#45994 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45994 Send/Recv tests were disabled because of the https://github.com/pytorch/pytorch/issues/42517. With that issue fixed, this diff enables those tests. ghstack-source-id: 113970569 Test Plan: waitforsandcastle Reviewed By: jiayisuse Differential Revision: D24172484 fbshipit-source-id: 7492ee2e9bf88840c0d0086003ce8e99995aeb91	2020-10-09 15:00:39 -07:00
Rohan Varma	62554a3bd2	Prioritize raising error message about unused parameters when rebuild_buckets fails (#45933 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45933 Occasionally users run DDP with models with unused params, in this case we would like to surface an error message telling them to run with find_unused_params=True. However, a recent change to rebuild_buckets logic (https://github.com/pytorch/pytorch/pull/44798) made it so that we raise a size mismatch error when this happens, but the information about unused parameters is likely to be more useful and likely to be the most common case of failure. Prefer raising this error over the subsequent size mismatch errors. ghstack-source-id: 113914759 Test Plan: Added unittest Reviewed By: mrshenli Differential Revision: D24151256 fbshipit-source-id: 5d349a988b4aac7d3e0ef7b3cd84dfdcbe9db675	2020-10-09 09:16:45 -07:00
Mingzhe Li	b7f7378b2d	[NCCL] support send/recv to/from self when communicator is created on demand (#45873 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45873 This diff adds support for sending/receiving to/from self. It also fixed a bug when p2p operations are not used by all processes. ghstack-source-id: 113910526 Test Plan: waitforsandcastle Reviewed By: jiayisuse Differential Revision: D24124413 fbshipit-source-id: edccb830757ac64f569e7908fec8cb2b43cd098d	2020-10-08 19:19:15 -07:00
Shen Li	96d48178c8	Make pipeWrite and pipeRead noexcept (#45783 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45783 After the previous device maps commits, `pipeWrite` might throw. In this case, if we increment active calls before `pipeWrite` on the caller, that active call won't be decremented properly when `pipeWrite` throws. As a result, `shutdown` can silently timeout. I noticed this as some tests take more than 60s to finish. This commit extract the tensor device checking logic out of pipeWrite, and make sure the error is thrown before the active call count is incremented. Differential Revision: D24094803 Test Plan: Imported from OSS Reviewed By: mruberry Pulled By: mrshenli fbshipit-source-id: d30316bb23d2afd3ba4f5540c3bd94a2ac10969b	2020-10-08 18:53:51 -07:00
Mingzhe Li	59083d6176	[NCCL] Support NCCL Send/Recv (#44921 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44921 This diff adds support for Process Group point-to-point operations on NCCL backend based on ncclSend/ncclRecv. See https://github.com/pytorch/pytorch/issues/43995 for more context. ghstack-source-id: 113592785 Test Plan: unittest Reviewed By: jiayisuse Differential Revision: D23709848 fbshipit-source-id: cdf38050379ecbb10450f3394631317b41163258	2020-10-05 18:27:57 -07:00
Shen Li	8cb7280242	Revert "Remove device maps from TensorPipe for v1.7 release (#45353 )" (#45762 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45762 This reverts commit `5211fb97ac`. Test Plan: Imported from OSS Reviewed By: colesbury Differential Revision: D24088231 Pulled By: mrshenli fbshipit-source-id: b6ee15ec5ae137ea127bdc2db8e1842764bc01d4	2020-10-02 15:14:05 -07:00
Rohan Varma	f8c1ca5dd8	Enable NamedTuple data type to work with DDP (#44220 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44220 Closes https://github.com/pytorch/pytorch/issues/44009 Currently if a dataloader returns objects created with a collections.namedtuple, this will incorrectly be cast to a tuple. As a result, if we have data of these types, there can be runtime errors during the forward pass if the module is expecting a named tuple. Fix this in `scatter_gather.py` to resolve the issue reported in https://github.com/pytorch/pytorch/issues/44009 ghstack-source-id: 113423287 Test Plan: CI Reviewed By: colesbury Differential Revision: D23536752 fbshipit-source-id: 3838e60162f29ebe424e83e474c4350ae838180b	2020-10-02 13:33:08 -07:00
Rohan Varma	181afd5220	Add an option to DDP to take a list of parameters to ignore upfront. (#44826 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44826 As described in https://github.com/pytorch/pytorch/issues/43690, there is a need for DDP to be able to ignore certain parameters in the module (not install allreduce hooks) for certain use cases. `find_unused_parameters` is sufficient from a correctness perspective, but we can get better performance with this upfront list if users know which params are unused, since we won't have to traverse the autograd graph every iteration. To enable this, we add a field `parameters_to_ignore` to DDP init and don't pass in that parameter to reducer if that parameter is in the given list. ghstack-source-id: 113210109 Test Plan: Added unittest Reviewed By: xw285cornell, mrshenli Differential Revision: D23740639 fbshipit-source-id: a0411712a8b0b809b9c9e6da04bef2b955ba5314	2020-09-30 11:52:50 -07:00
Natalia Gimelshein	50b91103a9	add self cuda time to avoid double/quadruple counting (#45209 ) Summary: In profiler, cuda did not report self time, so for composite functions there was no way to determine which function is really taking time. In addition, "total cuda time" reported was frequently more than total wallclock time. This PR adds "self CUDA time" in profiler, and computes total cuda time based on self cuda time, similar to how it's done for CPU. Also, slight formatting changes to make table more compact. Before: ``` -------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls -------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- aten::matmul 0.17% 890.805us 99.05% 523.401ms 5.234ms 49.91% 791.184ms 7.912ms 100 aten::mm 98.09% 518.336ms 98.88% 522.511ms 5.225ms 49.89% 790.885ms 7.909ms 100 aten::t 0.29% 1.530ms 0.49% 2.588ms 25.882us 0.07% 1.058ms 10.576us 100 aten::view 0.46% 2.448ms 0.46% 2.448ms 12.238us 0.06% 918.936us 4.595us 200 aten::transpose 0.13% 707.204us 0.20% 1.058ms 10.581us 0.03% 457.802us 4.578us 100 aten::empty 0.14% 716.056us 0.14% 716.056us 7.161us 0.01% 185.694us 1.857us 100 aten::as_strided 0.07% 350.935us 0.07% 350.935us 3.509us 0.01% 156.380us 1.564us 100 aten::stride 0.65% 3.458ms 0.65% 3.458ms 11.527us 0.03% 441.258us 1.471us 300 -------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 528.437ms CUDA time total: 1.585s Recorded timeit time: 789.0814 ms ``` Note recorded timeit time (with proper cuda syncs) is 2 times smaller than "CUDA time total" reported by profiler After ``` -------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ aten::matmul 0.15% 802.716us 99.06% 523.548ms 5.235ms 302.451us 0.04% 791.151ms 7.912ms 100 aten::mm 98.20% 519.007ms 98.91% 522.745ms 5.227ms 790.225ms 99.63% 790.848ms 7.908ms 100 aten::t 0.27% 1.406ms 0.49% 2.578ms 25.783us 604.964us 0.08% 1.066ms 10.662us 100 aten::view 0.45% 2.371ms 0.45% 2.371ms 11.856us 926.281us 0.12% 926.281us 4.631us 200 aten::transpose 0.15% 783.462us 0.22% 1.173ms 11.727us 310.016us 0.04% 461.282us 4.613us 100 aten::empty 0.11% 591.603us 0.11% 591.603us 5.916us 176.566us 0.02% 176.566us 1.766us 100 aten::as_strided 0.07% 389.270us 0.07% 389.270us 3.893us 151.266us 0.02% 151.266us 1.513us 100 aten::stride 0.60% 3.147ms 0.60% 3.147ms 10.489us 446.451us 0.06% 446.451us 1.488us 300 -------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 528.498ms CUDA time total: 793.143ms Recorded timeit time: 788.9832 ms ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/45209 Reviewed By: zou3519 Differential Revision: D23925491 Pulled By: ngimel fbshipit-source-id: 7f9c49238d116bfd2db9db3e8943355c953a77d0	2020-09-28 21:51:13 -07:00
Rong Rong	48d29c830d	[hotfix] disable problematic cuda tests on rocm builds (#45435 ) Summary: Disable the recent 3 cuda tests on amd rocm build/tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/45435 Reviewed By: malfet Differential Revision: D23962881 Pulled By: walterddr fbshipit-source-id: ad4ea1f835b4722cdbdce685806cfd64376cc16f	2020-09-28 10:02:12 -07:00
Rohan Varma	23dfca8351	Support record_shapes in RPC profiling (#44419 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44419 Closes https://github.com/pytorch/pytorch/issues/39969 This PR adds support for propagation of input shapes over the wire when the profiler is invoked with `record_shapes=True` over RPC. Previously, we did not respect this argument. This is done by saving the shapes as an ivalue list and recovering it as the type expected (`std::vector<std::vector<int>>` on the client). Test is added to ensure that remote ops have the same `input_shapes` as if the op were run locally. ghstack-source-id: 112977899 Reviewed By: pritamdamania87 Differential Revision: D23591274 fbshipit-source-id: 7cf3b2e8df26935ead9d70e534fc2c872ccd6958	2020-09-26 13:26:44 -07:00
Wanchao Liang	32c355af5b	[dist_optim] introduce distributed functional optimizer (#45221 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45221 This PR introduces a distributed functional optimizer, so that distributed optimizer can reuse the functional optimizer APIs and maintain their own states. This could enable the torchscript compatible functional optimizer when using distributed optimizer, helps getting rid of GIL and improve overall performance of training, especially distributed model parallel training Test Plan: Imported from OSS Reviewed By: ailzhang Differential Revision: D23935256 Pulled By: wanchaol fbshipit-source-id: 59b6d77ff4693ab24a6e1cbb6740bcf614cc624a	2020-09-25 17:13:10 -07:00
Shen Li	5211fb97ac	Remove device maps from TensorPipe for v1.7 release (#45353 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45353 Temporarily removing this feature, will add this back after branch cut. Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D23939865 Pulled By: mrshenli fbshipit-source-id: 7dceaffea6b9a16512b5ba6036da73e7f8f83a8e	2020-09-25 16:51:45 -07:00
Pritam Damania	a2b4177c5b	Add barrier() at the end of init_process_group and new_group. (#45181 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45181 `init_process_group` and `new_group` update a bunch of global variables after initializing the actual process group. As a result, there is a race that after initializing the process group on say rank 0, if we immediately check the default process group on rank 1 (say via RPC), we might actually get an error since rank 1 hasn't yet updated its _default_pg variable. To resolve this issue, I've added barrier() at the end of both of these calls. This ensures that once these calls return we are guaranteed about correct initialization on all ranks. Since these calls are usually done mostly during initialization, it should be fine to add the overhead of a barrier() here. #Closes: https://github.com/pytorch/pytorch/issues/40434, https://github.com/pytorch/pytorch/issues/40378 ghstack-source-id: 112923112 Test Plan: Reproduced the failures in https://github.com/pytorch/pytorch/issues/40434 and https://github.com/pytorch/pytorch/issues/40378 and verified that this PR fixes the issue. Reviewed By: mrshenli Differential Revision: D23858025 fbshipit-source-id: c4d5e46c2157981caf3ba1525dec5310dcbc1830	2020-09-25 15:46:59 -07:00
Rohan Varma	7c5436d557	[RPC profiling] Add tests to ensure RPC profiling works on single threaded (#44923 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44923 This ensures that RPC profiling works in single-threaded server scenarios and that we won't make the assumption that we'll have multiple threads when working on this code. For example, this assumption resulted in a bug in the previous diff (which was fixed) ghstack-source-id: 112868469 Test Plan: CI Reviewed By: lw Differential Revision: D23691304 fbshipit-source-id: b17d34ade823794cbe949b70a5ab35723d974203	2020-09-25 13:24:18 -07:00
Rohan Varma	27ab9bc0f9	[RPC profiling] Extend RPC profiling to support async function execution over RPC. (#44664 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44664 Closes https://github.com/pytorch/pytorch/issues/39971. This PR adds support for functions decorated with `rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run. To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node. For example, if the following async function is ran on a server over RPC: ``` def slow_add(x, y): time.sleep(1) return torch.add(x, y) rpc.functions.async_execution def slow_async_add(to, x, y): return rpc.rpc_async(to, slow_add, args=(x, y)) ``` we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output: ``` ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls Node ID ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- rpc_async#slow_async_add(worker1 -> worker2) 0.00% 0.000us 0 1.012s 1.012s 1 1 aten::empty 7.02% 11.519us 7.02% 11.519us 11.519us 1 1 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3) 0.00% 0.000us 0 1.006s 1.006s 1 2 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty 7.21% 11.843us 7.21% 11.843us 11.843us 1 2 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add 71.94% 118.107us 85.77% 140.802us 140.802us 1 3 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty 13.82% 22.695us 13.82% 22.695us 22.695us 1 3 ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- Self CPU time total: 164.164us ``` This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code. ghstack-source-id: 112868470 Test Plan: ``` rvarm1@devbig978:fbcode (52dd34f6)$ buck test mode/no-gpu mode/dev-nosan //caffe2/test/distributed/rpc:process_group_agent -- test_rpc_profiling_async_function --print-passing-details --stress-runs 1 ``` Reviewed By: mrshenli Differential Revision: D23638387 fbshipit-source-id: eedb6d48173a4ecd41d70a9c64048920bd4807c4	2020-09-25 13:19:26 -07:00

1 2 3 4 5

230 Commits