Commit Graph

230 Commits

Author SHA1 Message Date
Rohan Varma
f8248543a1 Pass in smaller timeout into init_process_group for distributed_test (#47896)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47896

Per title
ghstack-source-id: 116710141

Test Plan: CI

Reviewed By: osalpekar

Differential Revision: D24943323

fbshipit-source-id: 7bf33ce3a021b9750b65e0c08f602c465cd81d28
2020-11-14 13:38:20 -08:00
Jagadish Krishnamoorthy
1606899dbe distributed_test: Map rank to GPU accordingly (#47898)
Summary:
If world_size is lesser than or equal to number of GPU's available
then the rank can be directly mapped to corresponding GPU.
This fixes the issue referenced in https://github.com/pytorch/pytorch/issues/45435 and https://github.com/pytorch/pytorch/issues/47629

For world_size = 3 and number of GPU's = 8, the rank to GPU mapping
will be 0,2,4. This is due to the introduction of barrier,
(refer PR https://github.com/pytorch/pytorch/issues/45181)
the tensors in barrier is mapped to cuda0,1,2 and the tensors in the
actual test cases are mapped to cuda0,2,4 resulting in different streams and
leading to timeout. This issue is specific to default process group.
Issue is not observed in new process group since the streams are created again
after the initial barrier call.

This patch maps the rank to corresponding GPU's when the world_size is
less than or equal to the number of GPU's, in this case 0,1,2

Note: The barrier function in distributed_c10d.py should include new parameter
to specify the tensor or rank to GPU mapping. In that case, this patch will be
redundant but harmless since the tests can specify the tensors with appropriate
GPU rankings.

Fixes https://github.com/pytorch/pytorch/issues/47629

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47898

Reviewed By: smessmer

Differential Revision: D24956021

Pulled By: rohan-varma

fbshipit-source-id: a88257f22a7991ba36566329766c106d3360bb4e
2020-11-13 23:59:42 -08:00
Natalia Gimelshein
eb8331e759 Revert D24524219: Remove balance and devices parameter from Pipe.
Test Plan: revert-hammer

Differential Revision:
D24524219 (8da7576303)

Original commit changeset: 9973172c2bb7

fbshipit-source-id: b187c80270adb2a412e3882863a2d7de2a52ed56
2020-11-12 19:31:19 -08:00
Pritam Damania
8da7576303 Remove balance and devices parameter from Pipe. (#46804)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46804

As per our design in https://github.com/pytorch/pytorch/issues/44827,
changign the API such that the user places modules on appropriate devices
instead of having a `balance` and `devices` parameter that decides this.

This design allows us to use RemoteModule in the future.
ghstack-source-id: 116479842

Test Plan: waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D24524219

fbshipit-source-id: 9973172c2bb7636572cdc37ce06bf8368638a463
2020-11-12 14:20:23 -08:00
Mingzhe Li
66f9b1de1b [NCCL] enable p2p tests (#47797)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47797

NCCL p2p tests had hang issues before, the reason is that there were some unexpected context switches. For example, process 1 which is supposed to only use GPU1 could use GPU0 as a result of missing explicitly setting device.
ghstack-source-id: 116461969

Test Plan: waitforsandcastle

Reviewed By: jiayisuse

Differential Revision: D24863808

fbshipit-source-id: 92bd3a4874be8334210c7c8ee6363648893c963e
2020-11-12 10:44:50 -08:00
Kyle Chen
859e054314 skip test_all_reduce_sum_cuda_async test case for ROCM (#47630)
Summary:
Skip the following test case for rocm (When PYTORCH_TEST_WITH_ROCM=1):
- test_all_reduce_sum_cuda_async (__main__.TestDistBackendWithFork)

jeffdaily
pruthvistony

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47630

Reviewed By: seemethere, heitorschueroff

Differential Revision: D24849755

Pulled By: walterddr

fbshipit-source-id: b952c81677df2dfd35d459b94ce0f7a5b12c0d5c
2020-11-12 07:19:32 -08:00
Rohan Varma
c9f6e70c09 Refactor DDP uneven inputs control flags (#47394)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47394

This is a preliminary refactor for the next diff that will add an
additional flag to control whether we throw a StopIteration or not. We
basically move the flags for ddp uneven inputs to a simple class.
ghstack-source-id: 116428177

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D24739509

fbshipit-source-id: 96bf41bd1c02dd27e68f6f37d08e22f33129b319
2020-11-11 16:51:56 -08:00
Rohan Varma
0a7ebf00f8 [Reland] Add tests for DDP control flow models. (#47470)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47470

Reland of https://github.com/pytorch/pytorch/pull/47206, which was reverted due to failing multigpu tests.

The fix to make multigpu tests work is to compare against `torch.tensor([world_size, 0])`, not hardcode `torch.tensor([2, 0]` which assumes a world size of 2.

Original commit description:

As discussed offline with pritamdamania87, add testing to ensure per-iteration and rank-dependent control flow works as expected in DDP with find_unused_parameters=True.
ghstack-source-id: 115993934
ghstack-source-id: 115993934

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D24767893

fbshipit-source-id: 7d7a2449270eb3e72b5061694e897166e16f9bbc
2020-11-10 12:22:59 -08:00
Richard Zou
22d21414d7 Revert D24574649: [pytorch][PR] Utility that loads a DP/DDP model state dict into a non-DDP model with the same architecture.
Test Plan: revert-hammer

Differential Revision:
D24574649 (b631c872c9)

Original commit changeset: 17d29ab16ae2

fbshipit-source-id: 6766c6b21b82c9463143da0370192d9c68dbce6c
2020-11-10 06:55:45 -08:00
Pradeep Ganesan
b631c872c9 Utility that loads a DP/DDP model state dict into a non-DDP model with the same architecture. (#45643)
Summary:
Added a convenience function that allows users to load models without DP/DDP from a DP/DDP state dict.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45643

Reviewed By: rohan-varma

Differential Revision: D24574649

fbshipit-source-id: 17d29ab16ae24a30890168fa84da6c63650e61e9
2020-11-09 20:49:29 -08:00
Pritam Damania
781e0ed835 Support RRef.backward() for Owner RRefs. (#46641)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46641

Second part of https://github.com/pytorch/pytorch/pull/46568, allows
RRef.backward() to work for owner RRefs.
ghstack-source-id: 115440252

Test Plan: waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D24441300

fbshipit-source-id: 64af28e6b6ae47ea27e611a148f217bc344a4c5b
2020-11-07 21:25:32 -08:00
Mehdi Mirzazadeh
160db3db4f Adding profiling capability to c++ ddp collective functions (#46471)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46471

ghstack-source-id: 116018837

Test Plan:
Added unit tests:

 buck test mode/dev-nosan caffe2/test/distributed:distributed_gloo_fork
 buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork

Reviewed By: rohan-varma

Differential Revision: D23948397

fbshipit-source-id: 6d93a370aff26bf96c39e5d78a2492c5142a9156
2020-11-06 10:29:58 -08:00
Xu Zhao
eaa993a2e0 Add type annotations to torch._C._distributed_rpc module. (#46624)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46624

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D24761656

Pulled By: xuzhao9

fbshipit-source-id: b55aee5dd2b97f573a50e5bbfddde7d984943fec
2020-11-06 01:28:51 -08:00
Richard Zou
9c8078cdfb Revert D24659901: Add tests for DDP control flow models.
Test Plan: revert-hammer

Differential Revision:
D24659901 (31c9d2efcd)

Original commit changeset: 17fc2b3ebba9

fbshipit-source-id: 26b0bdbe83cba54da4f363cfa7fc85c503aa05ab
2020-11-05 08:08:59 -08:00
Pritam Damania
c8872051e6 Validate number of GPUs in distributed_test. (#47259)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47259

As described in https://github.com/pytorch/pytorch/issues/47257, not
using enough number of GPUs would result in an error.

As a result, before we call `init_process_group` in distributed_test, we
validate we have enough GPUs.

#Closes: https://github.com/pytorch/pytorch/issues/47257
ghstack-source-id: 115790475

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D24699122

fbshipit-source-id: 59c78d191881d1e063c43623dcf4d7eb75a2e94e
2020-11-04 17:55:34 -08:00
Rohan Varma
31c9d2efcd Add tests for DDP control flow models. (#47206)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47206

As discussed offline with pritamdamania87, add testing to ensure per-iteration and rank-dependent control flow works as expected in DDP with `find_unused_parameters=True`.
ghstack-source-id: 115854944

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D24659901

fbshipit-source-id: 17fc2b3ebba9cef2dd01d2877bad5702174b9767
2020-11-04 15:40:57 -08:00
Jeff Daily
6906701bde [ROCm] enable stream priorities (#47136)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47136

Reviewed By: mruberry

Differential Revision: D24672457

Pulled By: ngimel

fbshipit-source-id: 54f60c32df87cbd40fccd7fb1ecf0437905f01a3
2020-11-02 11:25:44 -08:00
Rohan Varma
d850b5c98c Fix DDP issue where parameters share same grad_accumulator (#46755)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46755

As reported in https://github.com/pytorch/pytorch/issues/41324, there is a bug in DDP when `find_unused_parameters=True` and 2 or more parameters share the same gradient accumulator.

In the reducer, we currently keep a mapping of grad accumulator to index and populate it with map[accumulator] = index, but this overwrites indices when the accumulator is the same. To fix this, switch the mapping values to a vector of indices to hold all such indices that share the same accumulator.
ghstack-source-id: 115453567

Test Plan: Added UT

Reviewed By: pritamdamania87

Differential Revision: D24497388

fbshipit-source-id: d32dfa9c5cd0b7a8df13c7873d5d28917b766640
2020-10-29 12:23:06 -07:00
Yi Wang
cab32d9cdf [RPC Framework] Support remote device format "<workername>/<device>" (#46773)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46773

Changed the constructor of RemoteModule to accept a `remote_device` arg in the following format:
"<workername>/<device>" (e.g., "trainer0/cpu", "ps0/cuda:0")

This arg merges the original `on` and `device` arg.

Original PR issue: RemoteDevice Format #46554
ghstack-source-id: 115448051

Test Plan: buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- RemoteModule

Reviewed By: pritamdamania87

Differential Revision: D24482562

fbshipit-source-id: 5acfc73772576a4b674df27625bf560b8f8e67c1
2020-10-29 00:14:56 -07:00
Rohan Varma
c7183c9878 Fix object-based collectives API to use torch.cuda.current_device instead of (#46897)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46897

These APIs implicitly assumed that gpu for rank == rank index, but
that is not necessarily true. For example, the first GPU could be used for a
different purpose and rank 0 could use GPU 1, rank 1 uses GPU 2, etc. Thus, we
mandate that the user specify the device to use via `torch.cuda.set_device()`
before making calls to this API. This expectation should be okay since we
clearly document it, and we expect the user to set this for
DistributedDataParallel as well.

Also adds/tidies up some documentation.
ghstack-source-id: 115359633

Test Plan: Modified unittests

Reviewed By: divchenko

Differential Revision: D24556177

fbshipit-source-id: 7e826007241eba0fde3019180066ed56faf3c0ca
2020-10-28 18:12:50 -07:00
Pritam Damania
adafd3d4b2 Support RRef.backward() for local RRefs. (#46568)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46568

This PR adds support for an RRef.backward() API. This would be useful
in applications like pipeline parallelism as described here:
https://github.com/pytorch/pytorch/issues/44827

This PR only adds support for local RRefs, remote RRef support will be added in
a follow up PR.
ghstack-source-id: 115100729

Test Plan:
1) unit tests.
2) waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D24406311

fbshipit-source-id: fb0b4e185d9721bf57f4dea9847e0aaa66b3e513
2020-10-26 17:31:17 -07:00
Oscar Sandoval
58ed60c259 Added context manager enabling all futures returned by rpc_async and custom build rpc functions to be automatically waited on (#41807)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41807

Test Plan: Make sure ci tests pass, including newly written test

Reviewed By: mrshenli

Differential Revision: D22640839

Pulled By: osandoval-fb

fbshipit-source-id: 3ff98d8e8c6e6d08575e307f05b5e159442d7216
2020-10-26 12:53:35 -07:00
Alexander Grund
93719440b8 Replace map(lambda constructs (#46462)
Summary:
Follow-up of https://github.com/pytorch/pytorch/issues/46461 with a similar goal

Makes them more readable and possibly faster. Care has to be taken because `map` applies the function immediately while `(x for x in xs)` is a generator expression which gets evaluated later. This is a benefit in some cases where it is not required to actually create the list of values in memory (e.g. when passing to `tuple` or `extend` or `join`)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46462

Reviewed By: zou3519

Differential Revision: D24422343

Pulled By: ezyang

fbshipit-source-id: 252e33499c92ac0b15238f2df32681dbbda2b237
2020-10-22 09:50:22 -07:00
Rohan Varma
25dc0056f2 [RPC] print exception message on workers that run python functions (#46372)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46372

Currently, in `_run_function`, we catch an exception from the python
function which is run, and report it back to the master. However in some large
scale training jobs, it would be valuable to also log the error on the trainer
itself for faster debugging.

Test Plan: Added unittest.

Reviewed By: pritamdamania87

Differential Revision: D24324578

fbshipit-source-id: 88460d7599ea69d2c38fd9c10eb6471f7edd4100
2020-10-22 09:44:15 -07:00
Ivan Kobzarev
3112e23428 [py][vulkan][reland] Add is_vulkan to py api, add vulkan to device type parsing (#46655)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46655

Test Plan: Imported from OSS

Pulled By: IvanKobzarev

Reviewed By: mrshenli

Differential Revision: D24448984

fbshipit-source-id: 5000846a06077f7a5a06dd51da422d2a42f70820
2020-10-22 09:35:50 -07:00
Rohan Varma
7245d2c939 Avoid scatter for single-device case in DDP (#46304)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46304

In the case that a single process operates only on one GPU, we can
avoid this scatter and instead replace it with a recursive version of `to`
which transfers the input tensors to the correct device.

The implementation of `_recursive_to` is modeled after `scatter` in https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/scatter_gather.py, in order to keep parity with the previous conventions (i.e. custom types not having their tensors moved).
ghstack-source-id: 114896677

Test Plan: Added unittest, and CI

Reviewed By: pritamdamania87

Differential Revision: D24296377

fbshipit-source-id: 536242da05ecabfcd36dffe14168b1f2cf58ca1d
2020-10-22 08:29:37 -07:00
Howard Huang
611f028168 Add Batch-Updating Parameter Server Example to CI Tests (#46510)
Summary:
Resolves one item in https://github.com/pytorch/pytorch/issues/46321

This PR sets up DistExamplesTest which will be used as the class to implement future tests for examples. This class is run as part of CI tests. It also creates a dist_examples folder and includes the [batch server example](https://github.com/pytorch/examples/blob/master/distributed/rpc/batch/parameter_server.py) which is slightly modified to allow to be tested.

Run test:
pytest test/distributed/rpc/test_tensorpipe_agent.py -k test_batch_updating_parameter_server -vs
pytest test/distributed/rpc/test_process_group_agent.py -k test_batch_updating_parameter_server -vs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46510

Reviewed By: mrshenli

Differential Revision: D24379296

Pulled By: H-Huang

fbshipit-source-id: 1c102041e338b022b7a659a51894422addc0e06f
2020-10-21 08:46:46 -07:00
Alexander Grund
5b0f400488 Replace list(map(...)) constructs by list comprehensions (#46461)
Summary:
As discussed in https://github.com/pytorch/pytorch/issues/46392 this makes the code more readable and possibly more performant.

It also fixes a bug detected by this where the argument order of `map` was confused: 030a24906e (diff-5bb26bd3a23ee3bb540aeadcc0385df2a4e48de39f87ed9ea76b21990738fe98L1537-R1537)

Fixes https://github.com/pytorch/pytorch/issues/46392

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46461

Reviewed By: ailzhang

Differential Revision: D24367015

Pulled By: ezyang

fbshipit-source-id: d55a67933cc22346b00544c9671f09982ad920e7
2020-10-19 18:42:49 -07:00
Rohan Varma
5c5484c889 [Flaky tests] Fix test_all_gather_timeout test (#45989)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45989

This test was failing internally for the Thrift-based RPC agent, since
it has a different error regex. Use `self.get_timeout_error_regex` which gets
the timeout error string for each backend to fix this.
ghstack-source-id: 114463458

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D24170394

fbshipit-source-id: 9b30945e3e30f36472268d042173f8175ad88098
2020-10-16 09:02:46 -07:00
HyunJun
a69910868a Fix possible padding length overflow in DistributedSampler (#45329)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45324

This fix handles cases for `len(dataset) * 2 < num_replica` in DistributedSampler. (which previous code resulted in error.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45329

Reviewed By: mruberry

Differential Revision: D24205035

Pulled By: rohan-varma

fbshipit-source-id: f94329d9c1e7deaee41e5af319e7c7d0c741910c
2020-10-14 17:19:44 -07:00
Brian Hirsh
1f791c06f0 adding BAND/BOR/BXOR reduce ops to unsupported list for complex numbers. added tests (#46270)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46270

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D24284702

Pulled By: bdhirsh

fbshipit-source-id: 7e6c3fce83a4367808a638f0400999399b2c35b0
2020-10-14 08:48:14 -07:00
Pritam Damania
f89498f3f8 Allow RPC framework to use rank in addition to WorkerInfo and name. (#46221)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46221

The RPC framework only allowed sending RPCs based on provided
WorkerInfo or name. When using RPC with DDP, sometimes it might just be easier
to refer to everything in terms of ranks since DDP doesn't support names yet.

As a result, support a `to` parameter in the RPC APIs which allow for
specifying a rank as well would be helpful.
ghstack-source-id: 114207172

Test Plan:
1) waitforbuildbot
2) Unit Tests

Reviewed By: mrshenli

Differential Revision: D24264989

fbshipit-source-id: 5edf5d92e2bd2f213471dfe7c74eebfa9efc9f70
2020-10-13 17:52:54 -07:00
Brian Hirsh
a3caa719af fix #45552 - adding add_done_callback(fn) to torch.futures.Future (#45675)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45675

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D24055353

Pulled By: bdhirsh

fbshipit-source-id: 9233c8e17acc878f0fecbe740a4397fb55cf722f
2020-10-13 07:47:36 -07:00
Brian Hirsh
c02efdefa8 adding complex support for distributed functions and . fix #45760 (#45879)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45879

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D24127949

Pulled By: bdhirsh

fbshipit-source-id: 8061b14fa1c0adbe22b9397c2d7f92618556d223
2020-10-12 12:44:47 -07:00
Mingzhe Li
281463ba0b [NCCL] Enable send/recv tests (#45994)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45994

Send/Recv tests were disabled because of the https://github.com/pytorch/pytorch/issues/42517. With that issue fixed, this diff enables those tests.
ghstack-source-id: 113970569

Test Plan: waitforsandcastle

Reviewed By: jiayisuse

Differential Revision: D24172484

fbshipit-source-id: 7492ee2e9bf88840c0d0086003ce8e99995aeb91
2020-10-09 15:00:39 -07:00
Rohan Varma
62554a3bd2 Prioritize raising error message about unused parameters when rebuild_buckets fails (#45933)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45933

Occasionally users run DDP with models with unused params, in this
case we would like to surface an error message telling them to run with
find_unused_params=True. However, a recent change to rebuild_buckets logic (https://github.com/pytorch/pytorch/pull/44798) made
it so that we raise a size mismatch error when this happens, but the
information about unused parameters is likely to be more useful and likely to
be the most common case of failure. Prefer raising this error over the
subsequent size mismatch errors.
ghstack-source-id: 113914759

Test Plan: Added unittest

Reviewed By: mrshenli

Differential Revision: D24151256

fbshipit-source-id: 5d349a988b4aac7d3e0ef7b3cd84dfdcbe9db675
2020-10-09 09:16:45 -07:00
Mingzhe Li
b7f7378b2d [NCCL] support send/recv to/from self when communicator is created on demand (#45873)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45873

This diff adds support for sending/receiving to/from self. It also fixed a bug when p2p operations are not used by all processes.
ghstack-source-id: 113910526

Test Plan: waitforsandcastle

Reviewed By: jiayisuse

Differential Revision: D24124413

fbshipit-source-id: edccb830757ac64f569e7908fec8cb2b43cd098d
2020-10-08 19:19:15 -07:00
Shen Li
96d48178c8 Make pipeWrite and pipeRead noexcept (#45783)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45783

After the previous device maps commits, `pipeWrite` might throw. In
this case, if we increment active calls before `pipeWrite` on the
caller, that active call won't be decremented properly when `pipeWrite`
throws. As a result, `shutdown` can silently timeout. I noticed this
as some tests take more than 60s to finish.

This commit extract the tensor device checking logic out of pipeWrite,
and make sure the error is thrown before the active call count is
incremented.

Differential Revision: D24094803

Test Plan: Imported from OSS

Reviewed By: mruberry

Pulled By: mrshenli

fbshipit-source-id: d30316bb23d2afd3ba4f5540c3bd94a2ac10969b
2020-10-08 18:53:51 -07:00
Mingzhe Li
59083d6176 [NCCL] Support NCCL Send/Recv (#44921)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44921

This diff adds support for Process Group point-to-point operations on NCCL backend based on ncclSend/ncclRecv. See https://github.com/pytorch/pytorch/issues/43995 for more context.
ghstack-source-id: 113592785

Test Plan: unittest

Reviewed By: jiayisuse

Differential Revision: D23709848

fbshipit-source-id: cdf38050379ecbb10450f3394631317b41163258
2020-10-05 18:27:57 -07:00
Shen Li
8cb7280242 Revert "Remove device maps from TensorPipe for v1.7 release (#45353)" (#45762)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45762

This reverts commit 5211fb97ac.

Test Plan: Imported from OSS

Reviewed By: colesbury

Differential Revision: D24088231

Pulled By: mrshenli

fbshipit-source-id: b6ee15ec5ae137ea127bdc2db8e1842764bc01d4
2020-10-02 15:14:05 -07:00
Rohan Varma
f8c1ca5dd8 Enable NamedTuple data type to work with DDP (#44220)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44220

Closes https://github.com/pytorch/pytorch/issues/44009
Currently if a dataloader returns objects created with a
collections.namedtuple, this will incorrectly be cast to a tuple. As a result, if we have data of these types, there can be runtime errors during the forward pass if the module is expecting a named tuple.

Fix this in
`scatter_gather.py` to resolve the issue reported in
https://github.com/pytorch/pytorch/issues/44009
ghstack-source-id: 113423287

Test Plan: CI

Reviewed By: colesbury

Differential Revision: D23536752

fbshipit-source-id: 3838e60162f29ebe424e83e474c4350ae838180b
2020-10-02 13:33:08 -07:00
Rohan Varma
181afd5220 Add an option to DDP to take a list of parameters to ignore upfront. (#44826)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44826

As described in https://github.com/pytorch/pytorch/issues/43690, there
is a need for DDP to be able to ignore certain parameters in the module (not
install allreduce hooks) for certain use cases. `find_unused_parameters` is
sufficient from a correctness perspective, but we can get better performance
with this upfront list if users know which params are unused, since we won't
have to traverse the autograd graph every iteration.

To enable this, we add a field `parameters_to_ignore` to DDP init and don't
pass in that parameter to reducer if that parameter is in the given list.
ghstack-source-id: 113210109

Test Plan: Added unittest

Reviewed By: xw285cornell, mrshenli

Differential Revision: D23740639

fbshipit-source-id: a0411712a8b0b809b9c9e6da04bef2b955ba5314
2020-09-30 11:52:50 -07:00
Natalia Gimelshein
50b91103a9 add self cuda time to avoid double/quadruple counting (#45209)
Summary:
In profiler, cuda did not report self time, so for composite functions there was no way to determine which function is really taking time. In addition, "total cuda time" reported was frequently more than total wallclock time. This PR adds "self CUDA time" in profiler, and computes total cuda time based on self cuda time, similar to how it's done for CPU. Also, slight formatting changes to make table more compact. Before:
```
--------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                  Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls
--------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
aten::matmul          0.17%            890.805us        99.05%           523.401ms        5.234ms          49.91%           791.184ms        7.912ms          100
aten::mm              98.09%           518.336ms        98.88%           522.511ms        5.225ms          49.89%           790.885ms        7.909ms          100
aten::t               0.29%            1.530ms          0.49%            2.588ms          25.882us         0.07%            1.058ms          10.576us         100
aten::view            0.46%            2.448ms          0.46%            2.448ms          12.238us         0.06%            918.936us        4.595us          200
aten::transpose       0.13%            707.204us        0.20%            1.058ms          10.581us         0.03%            457.802us        4.578us          100
aten::empty           0.14%            716.056us        0.14%            716.056us        7.161us          0.01%            185.694us        1.857us          100
aten::as_strided      0.07%            350.935us        0.07%            350.935us        3.509us          0.01%            156.380us        1.564us          100
aten::stride          0.65%            3.458ms          0.65%            3.458ms          11.527us         0.03%            441.258us        1.471us          300
--------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 528.437ms
CUDA time total: 1.585s

Recorded timeit time:  789.0814 ms

```
Note recorded timeit time (with proper cuda syncs) is 2 times smaller than "CUDA time total" reported by profiler

After
```
--------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
--------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
        aten::matmul         0.15%     802.716us        99.06%     523.548ms       5.235ms     302.451us         0.04%     791.151ms       7.912ms           100
            aten::mm        98.20%     519.007ms        98.91%     522.745ms       5.227ms     790.225ms        99.63%     790.848ms       7.908ms           100
             aten::t         0.27%       1.406ms         0.49%       2.578ms      25.783us     604.964us         0.08%       1.066ms      10.662us           100
          aten::view         0.45%       2.371ms         0.45%       2.371ms      11.856us     926.281us         0.12%     926.281us       4.631us           200
     aten::transpose         0.15%     783.462us         0.22%       1.173ms      11.727us     310.016us         0.04%     461.282us       4.613us           100
         aten::empty         0.11%     591.603us         0.11%     591.603us       5.916us     176.566us         0.02%     176.566us       1.766us           100
    aten::as_strided         0.07%     389.270us         0.07%     389.270us       3.893us     151.266us         0.02%     151.266us       1.513us           100
        aten::stride         0.60%       3.147ms         0.60%       3.147ms      10.489us     446.451us         0.06%     446.451us       1.488us           300
--------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 528.498ms
CUDA time total: 793.143ms

Recorded timeit time:  788.9832 ms

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45209

Reviewed By: zou3519

Differential Revision: D23925491

Pulled By: ngimel

fbshipit-source-id: 7f9c49238d116bfd2db9db3e8943355c953a77d0
2020-09-28 21:51:13 -07:00
Rong Rong
48d29c830d [hotfix] disable problematic cuda tests on rocm builds (#45435)
Summary:
Disable the recent 3 cuda tests on amd rocm build/tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45435

Reviewed By: malfet

Differential Revision: D23962881

Pulled By: walterddr

fbshipit-source-id: ad4ea1f835b4722cdbdce685806cfd64376cc16f
2020-09-28 10:02:12 -07:00
Rohan Varma
23dfca8351 Support record_shapes in RPC profiling (#44419)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44419

Closes https://github.com/pytorch/pytorch/issues/39969

This PR adds support for propagation of input shapes over the wire when the profiler is invoked with `record_shapes=True` over RPC. Previously, we did not respect this argument.

This is done by saving the shapes as an ivalue list and recovering it as the type expected (`std::vector<std::vector<int>>` on the client). Test is added to ensure that remote ops have the same `input_shapes` as if the op were run locally.
ghstack-source-id: 112977899

Reviewed By: pritamdamania87

Differential Revision: D23591274

fbshipit-source-id: 7cf3b2e8df26935ead9d70e534fc2c872ccd6958
2020-09-26 13:26:44 -07:00
Wanchao Liang
32c355af5b [dist_optim] introduce distributed functional optimizer (#45221)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45221

This PR introduces a distributed functional optimizer, so that
distributed optimizer can reuse the functional optimizer APIs and
maintain their own states. This could enable the torchscript compatible
functional optimizer when using distributed optimizer, helps getting rid
of GIL and improve overall performance of training, especially distributed
model parallel training

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D23935256

Pulled By: wanchaol

fbshipit-source-id: 59b6d77ff4693ab24a6e1cbb6740bcf614cc624a
2020-09-25 17:13:10 -07:00
Shen Li
5211fb97ac Remove device maps from TensorPipe for v1.7 release (#45353)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45353

Temporarily removing this feature, will add this back after branch cut.

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D23939865

Pulled By: mrshenli

fbshipit-source-id: 7dceaffea6b9a16512b5ba6036da73e7f8f83a8e
2020-09-25 16:51:45 -07:00
Pritam Damania
a2b4177c5b Add barrier() at the end of init_process_group and new_group. (#45181)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45181

`init_process_group` and `new_group` update a bunch of global
variables after initializing the actual process group. As a result, there is a
race that after initializing the process group on say rank 0, if we immediately
check the default process group on rank 1 (say via RPC), we might actually get
an error since rank 1 hasn't yet updated its _default_pg variable.

To resolve this issue, I've added barrier() at the end of both of these calls.
This ensures that once these calls return we are guaranteed about correct
initialization on all ranks.

Since these calls are usually done mostly during initialization, it should be
fine to add the overhead of a barrier() here.

#Closes: https://github.com/pytorch/pytorch/issues/40434, https://github.com/pytorch/pytorch/issues/40378
ghstack-source-id: 112923112

Test Plan:
Reproduced the failures in
https://github.com/pytorch/pytorch/issues/40434 and
https://github.com/pytorch/pytorch/issues/40378 and verified that this PR fixes
the issue.

Reviewed By: mrshenli

Differential Revision: D23858025

fbshipit-source-id: c4d5e46c2157981caf3ba1525dec5310dcbc1830
2020-09-25 15:46:59 -07:00
Rohan Varma
7c5436d557 [RPC profiling] Add tests to ensure RPC profiling works on single threaded (#44923)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44923

This ensures that RPC profiling works in single-threaded server
scenarios and that we won't make the assumption that we'll have multiple
threads when working on this code. For example, this assumption resulted in a
bug in the previous diff (which was fixed)
ghstack-source-id: 112868469

Test Plan: CI

Reviewed By: lw

Differential Revision: D23691304

fbshipit-source-id: b17d34ade823794cbe949b70a5ab35723d974203
2020-09-25 13:24:18 -07:00
Rohan Varma
27ab9bc0f9 [RPC profiling] Extend RPC profiling to support async function execution over RPC. (#44664)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44664

Closes https://github.com/pytorch/pytorch/issues/39971. This PR adds support for functions decorated with `rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run.

To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node.

For example, if the following async function is ran on a server over RPC:

```
def slow_add(x, y):
    time.sleep(1)
    return torch.add(x, y)

rpc.functions.async_execution
def slow_async_add(to, x, y):
    return rpc.rpc_async(to, slow_add, args=(x, y))
```

we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output:

```
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Name                                                                                                                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Node ID
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------                                                                                                                            rpc_async#slow_async_add(worker1 -> worker2)                                                                               0.00%            0.000us          0                1.012s
         1.012s           1                1
aten::empty                                                                                                                7.02%            11.519us         7.02%            11.519us         11.519us         1                1
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)                             0.00%            0.000us          0                1.006s
         1.006s           1                2                                                                                                                                          rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty                                                        7.21%            11.843us         7.21%            11.843us
         11.843us         1                2
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add        71.94%           118.107us        85.77%           140.802us        140.802us        1                3
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty      13.82%           22.695us         13.82%           22.695us
         22.695us         1                3                                                                                                                                          -------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Self CPU time total: 164.164us
```

This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code.
ghstack-source-id: 112868470

Test Plan:
```
rvarm1@devbig978:fbcode  (52dd34f6)$ buck test mode/no-gpu mode/dev-nosan //caffe2/test/distributed/rpc:process_group_agent -- test_rpc_profiling_async_function --print-passing-details --stress-runs 1
```

Reviewed By: mrshenli

Differential Revision: D23638387

fbshipit-source-id: eedb6d48173a4ecd41d70a9c64048920bd4807c4
2020-09-25 13:19:26 -07:00