Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70029
This PR implements NCCL scatter and add scatter to ProcessGroupNCCL.
NCCL doesn’t directly provide primitives for scatter, so we need to be implemented on top of NCCL’s send/recv API.
1. In ProcessGroupNCCL.cpp, the inputTensors are first flattened, then outputTensors and inputFlattened are passed by the collective class to scatter() function in nccl.cpp.
2. In nccl.cpp, scatter is implemented using ncclSend/ncclRecv: the root rank uses a for loop to send(distribute) the inputTensors to each rank, then all the ranks receive the inputTensor from the root rank.
ghstack-source-id: 147754837
Test Plan:
test_scatter_ops
test_scatter_stress
test_scatter_checks
Reviewed By: pritamdamania87
Differential Revision: D33154823
fbshipit-source-id: 4513e7eaf7d47a60eb67da99dc6c2e9a2882f3fd
(cherry picked from commit 93201f9d4a)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66745
This PR implement NCCL gather and add gather to ProcessGroupNCCL using nccl send/recv api.
NCCL doesn’t directly provide primitives for gather, so we need to be implemented on top of NCCL’s send/recv API.
1. In ProcessGroupNCCL.cpp, the outputTensors are first flattened, then inputTensors and outputFlattened are passed by the collective class to gather() function in nccl.cpp.
1. In nccl.cpp, gather is implemented using ncclSend/ncclRecv: all the ranks send inputTensor to the root rank, and the root rank uses a for loop to receive these inputTensors.
ghstack-source-id: 147754838
Test Plan:
test_gather_ops
test_gather_checks
test_gather_stress
Reviewed By: pritamdamania87
Differential Revision: D29616361
fbshipit-source-id: b500d9b8e67113194c5cc6575fb0e5d806dc7782
(cherry picked from commit d560ee732e)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70628
event_listener thread is used to log process tracebacks when a timed
out process sends it a request to get its traceback. Although, this thread is
created in `_run` function which is overridden by some classes such as
`TestDistBackend` so those tests did not have this feature. Move the
event_listener setup logic to `run_test` which is called by all distributed
test classes, which enables it for all distributed tests. Also modify logger
setup to ensure that logging.info calls are printed in the subprocess.
ghstack-source-id: 146714642
Test Plan: CI
Reviewed By: jaceyca, fduwjj
Differential Revision: D33410613
fbshipit-source-id: aa616d69d251bc9d04e45781c501d2244f011843
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67256
To change what tests can be run in various cases, the check logic should be moved to functions and variables that can be changed.
One challenge here is that decorators don't have dynamic functionality. If something is read in when imported and then changed afterwards, it will not actually change. This means we need to separate out the variables that need to be changed for our use case.
Those are put into common_distributed.py and can be changed before importing the distributed_test.py code.
The use case is to add new backends to the tests and split it into tests that can be ran on demand as a separate instance. To do so, you would change DistTestSkipCases after importing it into a launcher or a setup script and then load distributed_test.
Test Plan: Check the signals
Reviewed By: mrshenli
Differential Revision: D31906947
fbshipit-source-id: 45e3258c55f4dc34e12a468bed65280f4c25748f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68223
DETAIL debug mode didn't work with object-based collectives for NCCL backend, because we'd only check if backend is NCCL and then move tensors to CUDA.
Instead, check if it is a wrapped PG, and then check the pg that is wrapped to see if its nccl.
ghstack-source-id: 143242023
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D32366840
fbshipit-source-id: be0a2af6849f8f24446593f4a4fbea4a67586ee5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67050
This PR moves init_multi_gpu_helper to common_distributed so that it could be shared by different distributed tests.
ghstack-source-id: 141370119
Test Plan: wait for ci.
Reviewed By: mrshenli
Differential Revision: D31842644
fbshipit-source-id: c7bad25d6cef9bdce7ad1fb6c60c1cad4b765702
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65772
Looking at some workloads and it would be useful to have this info.
ghstack-source-id: 140555200
Test Plan: CI
Reviewed By: zhaojuanmao, wayi1
Differential Revision: D31224417
fbshipit-source-id: 14eeb053aced87c7ca43b6879f81f54bd0a42b76
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65391
TSAN tests are much slower than the usual dev/opt mode, about 5-10x
slower.
As a result, for TSAN build mode we use a much higher timeout for distributed
tests.
ghstack-source-id: 138584613
Test Plan: waitforbuildbot
Reviewed By: cbalioglu
Differential Revision: D31076575
fbshipit-source-id: 44a485f07101deac536470ceeff2a52cac4f9e0b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63711
This removes `_fork_process` from common_distributed.py and fixes all
other callpoints to use `spawn_process` instead.
ghstack-source-id: 136395719
Test Plan: waitforbuildbot
Reviewed By: xush6528
Differential Revision: D30463834
fbshipit-source-id: 0c09e8a996d0e5b912c8cdd45488a39951bac4db
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63361
Python multiprocessing doesn't support LSAN and causes false positives
instead. As a result, disabling LSAN for these tests so that we can still run
with opt-asan
ghstack-source-id: 135962489
Test Plan: waitforbuildbot
Reviewed By: rohan-varma
Differential Revision: D30352269
fbshipit-source-id: f6ab5abce7bdef00cd5e1f5977424d2b151174af
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62051
The goal here is to enable opt-asan for "spawn" based unit tests since
this works for "spawn" unlike "dev-asan". As a result, we can run ASAN for
"spawn" unit tests as well.
This means we can completely remove fork unit tests from the code base since
the only purpose for these tests was to run ASAN.
ghstack-source-id: 135523770
Test Plan: waitforbuildbot
Reviewed By: SciPioneer
Differential Revision: D29854514
fbshipit-source-id: 02a5bfcfae2afc21badecff77082c7a6ad83636b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61991
Continuation of https://github.com/pytorch/pytorch/pull/61887 and
removing unittest.skip as much as possible.
ghstack-source-id: 134759368
Test Plan: waitforbuildbot
Reviewed By: SciPioneer
Differential Revision: D29831860
fbshipit-source-id: fe57a7d56d4423924a2dec10bb670137ace0c9a4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61887
1) Introduced a `sandcastle_skip_if` decorator that ensures these
tests just get passed on sandcastle.
2) Fixed all test files under `test/distributed` to not use `unittest.skip`
Overall goal is to avoid using skips since sandcastle tags these tests as
continuously skipping.
ghstack-source-id: 134382237
Test Plan: waitforbuildbot
Reviewed By: SciPioneer
Differential Revision: D29784152
fbshipit-source-id: 17b4df6c5a55ff1d1e8e1de128fa679c3dfbcb7d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61876
In the sandcastle environment, avoid skipping tests and instead just
"pass" these tests to avoid a large number of tasks being created which are not
actionable.
ghstack-source-id: 133846232
Test Plan: Test with `SANDCASTLE=1 TW_JOB_USER=sandcastle`
Reviewed By: rohan-varma
Differential Revision: D29779699
fbshipit-source-id: add71008830dfa6f456ce2365a2d70436b7b7a31
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61409
We used a multiprocessing.Manager in order to share TEST_SKIPS between the parent and the child processes. TEST_SKIPS is a global variable that defines a unique error code for each "error type", so that the parent can figure out the reason a child exited. While originally this mapping was immutable, at some point we allowed children to modify the parent's value of that mapping so they could update the message for the `multi-gpu` error to make it reflect how many GPUs were really needed. This occurred in D23285790 (2a4d312027). Since then this Manager proved to be quite problematic, especially around thread safety, races, TSAN, ... (see D22753459 (f0c46878c6), D23641618 (567c51cce9), D28490129, D28794321 (0128eb9a85) and D29585862). This seems like an awful lot of trouble for such a small functionality. Here I propose we drop Manager and instead get the same result by using separate error codes for each number of GPUs. It should be much simpler and thus more robust.
ghstack-source-id: 133236447
Test Plan: CI
Reviewed By: pritamdamania87
Differential Revision: D29612614
fbshipit-source-id: 8ad0fedcb7796e5832a0eb196f8fdc147e02b3df
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59238
Creating a `mutliprocessing.Manager()` launches a new process using the `fork` method (because it's the default one), and then in that subprocess it launches a new thread. TSAN really doesn't like this (and rightly so!) because we already had threads in the superprocess, and intermixing threads and forks is dangerous. The proper way to deal with this is to `exec` inside the child process or, in other words, use the `spawn` method.
Note that the method used to launch the Manager is entirely unrelated from the method used to launch our "own" subprocesses, hence we were using `fork` for the Manager even though we were using `spawn` for our own subprocesses.
ghstack-source-id: 130240724
Test Plan: Reverted the silencing introduced in D28490129, ran the `test_init_rpc_then_pg` test from the TensorPipe suite and saw the original TSAN failure. Then applied my fix, re-ran the test, and the failure was gone.
Reviewed By: zhaojuanmao
Differential Revision: D28794321
fbshipit-source-id: 12242e69be399a7f02a40a0ebb3d92f92e00ce73
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60917
The second line of error log didn't handle f-string properly.
Before fix:
```
exiting process with exit code: {MultiProcessTestCase.TEST_ERROR_EXIT_CODE}
```
After fix:
```
exiting process 3 with exit code: 10
```
ghstack-source-id: 132618199
Test Plan: waitforbuildbot
Reviewed By: rohan-varma
Differential Revision: D29446574
fbshipit-source-id: f806ef0470cb6aa86fe3c404e1c895514abb6488
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60167
We were getting errors such as this on Windows in our c10d ProcessGroup test suite:
```
test_send_recv_all_to_all (__main__.ProcessGroupGlooTest) ... Exception in thread Thread-1:
Traceback (most recent call last):
File "C:\Jenkins\Miniconda3\lib\threading.py", line 932, in _bootstrap_inner
self.run()
File "C:\Jenkins\Miniconda3\lib\threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\circleci\project\build\win_tmp\build\torch\testing\_internal\common_distributed.py", line 471, in _event_listener
if pipe.poll(None):
File "C:\Jenkins\Miniconda3\lib\multiprocessing\connection.py", line 257, in poll
return self._poll(timeout)
File "C:\Jenkins\Miniconda3\lib\multiprocessing\connection.py", line 330, in _poll
return bool(wait([self], timeout))
File "C:\Jenkins\Miniconda3\lib\multiprocessing\connection.py", line 883, in wait
ov.cancel()
OSError: [WinError 6] The handle is invalid
Fatal Python error: could not acquire lock for <_io.BufferedWriter name='<stderr>'> at interpreter shutdown, possibly due to daemon threads
Python runtime state: finalizing (tstate=000001EFDF228CE0)
Thread 0x00001f68 (most recent call first):
File "C:\Jenkins\Miniconda3\lib\threading.py", line 1202 in invoke_excepthook
File "C:\Jenkins\Miniconda3\lib\threading.py", line 934 in _bootstrap_inner
File "C:\Jenkins\Miniconda3\lib\threading.py", line 890 in _bootstrap
Current thread 0x00000f94 (most recent call first):
<no Python frame>
FAIL (5.009s)
```
And the process would then exit with error code 3221226505.
See: https://app.circleci.com/pipelines/github/pytorch/pytorch/337351/workflows/ad919a3e-fe9a-4566-8ad6-8b0a252f730c/jobs/14170191/steps
By looking at [the code of `_event_listener` in `common_distributed.py`](eb36f67dcc/torch/testing/_internal/common_distributed.py (L467-L489)) I think that the first exception (the one about the handle being invalid) is "expected" as it results from another thread purposely closing the pipe while that thread is polling it.
The relevant part of the problem seems to be the "could not acquire lock" one. I think this stems from the event listener thread being launched as a daemon thread, which means the interpreter will not wait for that thread to complete before shutting down. When the interpreter shuts down it instantly aborts all other threads. If the event listener thread was aborter _while_ it was logging to stdout then that thread was holding the lock but never got to release it. This is probably what the error is complaining about. This seems to be intended/expected behavior for CPython: https://bugs.python.org/issue42717.
The solution thus is simple: don't make that thread a daemon thread and explicitly wait for it to terminate before shutting down.
ghstack-source-id: 132293710
Test Plan: Will see...
Reviewed By: pritamdamania87
Differential Revision: D29193014
fbshipit-source-id: 4aabe1fc74bf9c54ca605e7a702ac99655489780
Summary:
During development it is common practice to put `type: ignore` comments on lines that are correct, but `mypy` doesn't recognize this. This often stems from the fact, that the used `mypy` version wasn't able to handle the used pattern.
With every new release `mypy` gets better at handling complex code. In addition to fix all the previously accepted but now failing patterns, we should also revisit all `type: ignore` comments to see if they are still needed or not. Fortunately, we don't need to do it manually: by adding `warn_unused_ignores = True` to the configuration, `mypy` will error out in case it encounters an `type: ignore` that is no longer needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60006
Reviewed By: jbschlosser, malfet
Differential Revision: D29133237
Pulled By: albanD
fbshipit-source-id: 41e82edc5cd5affa7ccedad044b59b94dad4425a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59284
Logs a few python-side errors to DDP logging.
TODO: Most python errors actually have to do with user input correctness, so they throw before reducer is constructed and thus there is no logger. For this case, should we allow `logger` to be created optionally without a reducer, just for the purpose of logging errors, so that we can gain insight into these errors in scuba?
ghstack-source-id: 130412973
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D28820290
fbshipit-source-id: 610e5dba885b173c52351f7ab25c923edce639e0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56717
The signal_handler was under the caffe2 namespacee but was being used
by PyTorch as well.
I've fixed this my moving it to the c10 namespace where now both C2 and PyTorch
can use it.
The signal_handler interface in caffe2/utils/signal_handler.h is kept the same
for backward compatiblity for C2, but most of the commmon code is moved to c10.
ghstack-source-id: 127446929
Test Plan: waitforbuildbot
Reviewed By: ezyang
Differential Revision: D27946738
fbshipit-source-id: d6228d1a0108f4c807d405e7a0bb799c5375388f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56410
Changes:
- Move create_tcp_store() helper function to common file
- Update test_jit_c10d to retry TCP Store creation in case allocated port becomes used
fixes https://github.com/pytorch/pytorch/issues/55053
Test Plan: Imported from OSS
Reviewed By: heitorschueroff
Differential Revision: D27869560
Pulled By: H-Huang
fbshipit-source-id: f4a6613049bb25e6f6f194214379a380968bb19c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55075
Constructs and passes in a mapping with parameter names to Reducer to log information about unused parameters in error messages about unused parameters/not all parameters getting gradient.
Use case:
1) User runs DDP forward + bwd, and it has some unused parameters that will result in ddp error in next iteration
2) Next forward pass calls `Reducer::ensure_prior_reduction_finished()` where we check all params got gradient from the previous bwd pass. DDP would throw here in this case.
3) Reducer maintains mapping and tracks used parameters, and computes which parameters did not get gradient and logs this as part of the error.
Implementation details:
0) The following is only enabled for debug modes of INFO or DETAIL.
1) To save memory, we don't map param -> param name so that we don't have to copy the entire tensor, instead we map param_index -> param_name and use the existing concept of variable_index in Reducer to look up parameter names.
2) DDP constructs param index -> param name mapping. The name is the fully qualified name: f"{module_name}:{param_name}" and passes it into Reducer
3) Reducer maintains per-iteration std::set<int> of variable indices that have had `mark_variable_ready` called.
4) When some params go unused, we take a set difference to detect the unused params.
5) Unittests to test the logged unused params, as well as for nested modules, are added
ghstack-source-id: 126581051
Test Plan: CI, UT
Reviewed By: zhaojuanmao
Differential Revision: D27356394
fbshipit-source-id: 89f436af4e74145b0a8eda92b3c4e2af8e747332
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54769
Follow-up to #53820. This
- makes the `asserts.py` module private as per suggestion from rgommers in https://github.com/pytorch/pytorch/pull/53820#issuecomment-802661387. With this the functions should only be accessible through `torch.testing`, giving us the option the change the underlying structure later.
- moves the code from `torch/testing/__init__.py` to `torch/testing/_core.py` (happy to accept other name suggestions). Otherwise we can't import the new `_asserts.py` in `torch/testing/__init__.py` due to circular imports.
Test Plan: Imported from OSS
Reviewed By: mrshenli
Differential Revision: D27438451
Pulled By: mruberry
fbshipit-source-id: c7292b4d5709185b42b4aac8016648562688040e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55003
Using the `caffe2::setPrintStackTracesOnFatalSignal` utility in
distributed tests to set a signal handler that dumps the state of all threads
for all processes when it receives a FATAL signal. This would help in debugging
tests further.
I had to revert all the python faulthandler code since only one signal handler
function is supported, so running python faulthandler with
`setPrintStackTracesOnFatalSignal` doesn't work.
Sample output:
```
SIGSEGV(11), PID: 3492872, Thread 3492872:
[0] ???(0x7fa7b2d1d61b) in libcaffe2_caffe2_caffe2_cpu.so
[1] ???(0x7fa7b2d1d3fb) in libcaffe2_caffe2_caffe2_cpu.so
[2] ???(0x7fa7b2d1d33d) in libcaffe2_caffe2_caffe2_cpu.so
[3] ???(0x7fa7b2d1d167) in libcaffe2_caffe2_caffe2_cpu.so
[4] ???(0x7fa7ce683150) in libpthread.so.0
[5] ???(0x7fa7be2b233c) in libcaffe2__C_impl_cuda.so
[6] ???(0x7fa7be2ce80c) in libcaffe2__C_impl_cuda.so
[7] ???(0x7fa7be2a0512) in libcaffe2__C_impl_cuda.so
[8] torch::distributed::rpc::TensorPipeAgent::send(torch::distributed::rpc::WorkerInfo const&, torch::distributed::rpc::Message&&, float, std::unordered_map<signed char, signed char, std::hash<signed char>, std::equal_to<signed char>, std::allocator<std::pair<signed char const, signed char> > > const&)+0x24f(0x7fa7be29f71f) in libcaffe2__C_impl_cuda.so
[9] torch::distributed::autograd::sendMessageWithAutograd(torch::distributed::rpc::RpcAgent&, torch::distributed::rpc::WorkerInfo const&, torch::distributed::rpc::Message&&, bool, float, bool)+0x393(0x7fa7b602b203) in libcaffe2_libtorch.so
[10] torch::distributed::rpc::pyRpcPythonUdf(torch::distributed::rpc::WorkerInfo const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, std::vector<at::Tensor, std::allocator<at::Tensor> >&, float, bool)+0x201(0x7fa7bd844971) in libcaffe2__C_impl_cuda.so
```
ghstack-source-id: 125630551
Test Plan: waitforbuildbot
Reviewed By: SciPioneer
Differential Revision: D27419714
fbshipit-source-id: 8aca9a14ef688004053d8798124d9c3a3fbe3489
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54818
Several flaky tests fail due to some sort of timeout and it isn't
clear from the error message in CI where exactly each process is stuck. In this
PR, I've added mechanism to dump the entire python traceback of all python
threads when we encounter a timeout.
Example traceback:
```
Process 3 timed out with traceback:
Current thread 0x00007ff3363ff700 (most recent call first):
File "torch/testing/_internal/common_distributed.py", line 373 in _event_listener
File "threading.py", line 870 in run
File "threading.py", line 932 in _bootstrap_inner
File "threading.py", line 890 in _bootstrap
Thread 0x00007ff406132180 (most recent call first):
File "torch/distributed/distributed_c10d.py", line 2477 in barrier
File "torch/testing/_internal/distributed/rpc/rpc_test.py", line 838 in test_reinit
File "torch/testing/_internal/dist_utils.py", line 90 in new_test_method
File "torch/testing/_internal/common_distributed.py", line 292 in wrapper
File "torch/testing/_internal/common_distributed.py", line 409 in run_test
File "torch/testing/_internal/common_distributed.py", line 393 in _run
File "multiprocessing/process.py", line 108 in run
File "multiprocessing/process.py", line 315 in _bootstrap
File "multiprocessing/popen_fork.py", line 75 in _launch
File "multiprocessing/popen_fork.py", line 19 in __init__
File "multiprocessing/context.py", line 277 in _Popen
File "multiprocessing/process.py", line 121 in start
```
ghstack-source-id: 125323810
Test Plan: waitforbuildbot
Reviewed By: rohan-varma
Differential Revision: D27378764
fbshipit-source-id: 661c009a5458c724f004aa83de9347a4bc03b63e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54740
Adds a simple helper decorator to set/unset nccl blocking wait for
tests. This will make it easier than having to manually set/unset the
os.environ vars every time.
ghstack-source-id: 125233693
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D27277222
fbshipit-source-id: c289b9d05e2f6328d672810b07501979b6e177c6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54531
Enabling faulthandler will intercept signals like SIGSEGV, SIGFPE,
SIGABRT, SIGBUS and SIGKILL and dump the entire python traceback before the
process goes down.
This can help us in debugging flaky tests where a process crashes and we need
to debug what happened.
ghstack-source-id: 125045894
Test Plan:
1) Tested locally to see traceback is produced.
2) waitforbuildbot
Reviewed By: rohan-varma
Differential Revision: D27271048
fbshipit-source-id: ca12125a9da6cdfc7bac5619ad1c7e116666014b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52632
Distributed tests run in a multiprocessing environment, where a parent
process drives the tests through several child processes. As a result, when a
child process fails the parent only prints the following:
```
Process 0 exited with error code 10
```
The child process also logs its own exception, but it is cumberson to go
through the logs and track this down.
To alleviate this, I've added a bunch of pipes for each child process so that
the child process writes the error to the pipe before exiting and the parent
process can read the appropriate error from the pipe and display it.
The new output printed by the parent is as follows:
```
> RuntimeError: Process 0 exited with error code 10 and exception:
Traceback (most recent call last):
File "torch/testing/_internal/common_distributed.py", line 361, in _run
getattr(self, test_name)()
File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
fn()
File "test_c10d.py", line 789, in test_broadcast_checks
pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1
Process 1 exited with error code 10 and exception:
Traceback (most recent call last):
File "torch/testing/_internal/common_distributed.py", line 361, in _run
getattr(self, test_name)()
File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
fn()
File "test_c10d.py", line 789, in test_broadcast_checks
pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1
Process 2 exited with error code 10 and exception:
Traceback (most recent call last):
File "torch/testing/_internal/common_distributed.py", line 361, in _run
getattr(self, test_name)()
File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
fn()
File "test_c10d.py", line 789, in test_broadcast_checks
pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1
Process 3 exited with error code 10 and exception:
Traceback (most recent call last):
File "torch/testing/_internal/common_distributed.py", line 361, in _run
getattr(self, test_name)()
File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
fn()
File "test_c10d.py", line 789, in test_broadcast_checks
pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1
```
ghstack-source-id: 122273793
Test Plan: waitforbuildbot
Reviewed By: rohan-varma
Differential Revision: D26589274
fbshipit-source-id: 7b7a71ec790b216a89db7c157377f426531349a5
Summary:
Take 2 of https://github.com/pytorch/pytorch/issues/50914
This change moves the early termination logic into common_utils.TestCase class.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52126
Test Plan: CI with ci-all tag
Reviewed By: malfet
Differential Revision: D26391762
Pulled By: walterddr
fbshipit-source-id: a149ecc47ccda7f2795e107fb95915506ae060b4
Summary:
This is a follow up on https://github.com/pytorch/pytorch/issues/49869.
Previously CUDA early termination only happens for generic test classes that extends from `DeviceTypeTestBase`. However, JIT test cases which extends from common_utils.TestCase cannot benefit from the early termination.
This change moves the early termination logic into common_utils.TestCase class.
- all tests extended from common_utils.TestCase now should early terminate if CUDA assert occurs.
- For TestCases that extends from common_device_type.DeviceTypeTestBase, still only do torch.cuda.synchronize() when RTE is thrown.
- For TestCases extends common_utils.TestCase, regardless of whether a test case uses GPU or not, it will always synchronize CUDA as long as `torch.cuda.is_initialize()` returns true.
- Disabling this on common_distributed.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50914
Reviewed By: malfet
Differential Revision: D26019289
Pulled By: walterddr
fbshipit-source-id: ddc7c1c0d00db4d073a6c8bc5b7733637a7e77d1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44418
This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, `TensorPipeAgent` grabs a stream to each device used by the
message, let these streams wait for current streams, and passes
the streams to TensorPipe `CudaBuffer`. On the receiver side, it
also grabs a stream for each device used in the message, and uses
these streams to receive tensors and run user functions. After that,
these streams are then used for sending the response back to the
sender. When receiving the response, the sender will grab a new set
of streams and use them for TensorPipe's `CudaBuffer`.
If device maps are provided, `TensorPipeAgent::send` will return a
derived class of `CUDAFuture`, which is specifically tailored for
RPC Messages.
TODOs:
1. Enable sending CUDA RPC to the same process.
2. Add a custom CUDA stream pool.
3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`,
remove `cuda:0` context initialization code in `backend_registry.py`.
4. When TensorPipe can detect availability of peer access, enable all
tests on platforms without peer access.
Differential Revision: D23626207
Test Plan: Imported from OSS
Reviewed By: lw
Pulled By: mrshenli
fbshipit-source-id: d30e89e8a98bc44b8d237807b84e78475c2763f0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46372
Currently, in `_run_function`, we catch an exception from the python
function which is run, and report it back to the master. However in some large
scale training jobs, it would be valuable to also log the error on the trainer
itself for faster debugging.
Test Plan: Added unittest.
Reviewed By: pritamdamania87
Differential Revision: D24324578
fbshipit-source-id: 88460d7599ea69d2c38fd9c10eb6471f7edd4100
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45642
Prior to https://github.com/pytorch/pytorch/pull/45181, initializing a
NCCL process group would work even if no GPUs were present. Although, now since
init_process_group calls `barrier()` this would fail.
In general the problem was that we could initialize ProcessGroupNCCL without
GPUs and then if we called a method like `barrier()` the process would crash
since we do % numGPUs resulting in division by zero.
ghstack-source-id: 113490343
Test Plan: waitforbuildbot
Reviewed By: osalpekar
Differential Revision: D24038839
fbshipit-source-id: a1f1db52cabcfb83e06c1a11ae9744afbf03f8dc
Summary:
Enabled type checking in common_distributed by using tensors of ints
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44821
Test Plan: Run python test/test_type_hints.py, errors are no longer ingnored by mypy.ini
Reviewed By: walterddr
Differential Revision: D23747466
Pulled By: alanadakotashine
fbshipit-source-id: 820fd502d7ff715728470fbef0be90ae7f128dd6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44525
Since `TEST_SKIPS` is a global multiprocessing.manager, this was causing
issues when one test would fail and make the rest of the tests fail during
setup due to networking errors.
See the failed CI job: https://app.circleci.com/pipelines/github/pytorch/pytorch/212491/workflows/0450151d-ca09-4cf6-863d-272de6ed917f/jobs/7389065 for an example, where `test_ddp_backward` failed but then caused the rest of the tests to fail at the line `test_skips.update(TEST_SKIPS)`.
To fix this issue, at the end of every test we revert `TEST_SKIPS` back to a regular dict, and redo the conversion to a `mulitiprocessing.Manager` in the next test, which prevents these errors.
ghstack-source-id: 111844724
Test Plan: CI
Reviewed By: malfet
Differential Revision: D23641618
fbshipit-source-id: 27ce823968ece9804bb4dda898ffac43ef732b89