Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54981
put part of codes in autograd_hook into functions, so that they can be used in the static graph training later on.
ghstack-source-id: 127755405
Test Plan: unit tests
Reviewed By: SciPioneer
Differential Revision: D27439508
fbshipit-source-id: a02a4b029841f5e7f11cfc5496bb7972ef53d878
Summary:
This adds some more compiler warnings ignores for everything that happens on a standard CPU build (CUDA builds still have a bunch of warnings so we can't turn on `-Werror` everywhere yet).
](https://our.intern.facebook.com/intern/diff/28005063/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56630
Pulled By: driazati
Reviewed By: malfet
Differential Revision: D28005063
fbshipit-source-id: 541ed415eb0470ddf7e08c22c5eb6da9db26e9a0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57191
Changed Store::compareSet() to a pure virtual function and added compareSet definition to PythonStore. Rest of changes are from clang-format.
Test Plan: Imported from OSS
Reviewed By: cbalioglu
Differential Revision: D28076557
Pulled By: H-Huang
fbshipit-source-id: 379636cf8b031088341a032250ba410d84ccf692
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57175
Update other Store implementations to add the value when current value is empty to match the amendment made to TCPStore (#55636). Added test to cover this case.
Test:
`pytest -vs test/distributed/test_c10d_common.py -k compare_set`
Test Plan: Imported from OSS
Reviewed By: cbalioglu
Differential Revision: D28069380
Pulled By: H-Huang
fbshipit-source-id: eac703edb41faee32a4e7cda61107e2a0e726326
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57052
This PR caps a stack whose goal was to merge CUDAFuture into ivalue::Future. CUDAFuture used to be a subclass of ivalue::Future, which was already pretty good, but it meant that in several places we needed `#ifdef`s or registries in order to create the right type of class, which was annoying. We've made CUDAFuture device-agnostic, by using generic helpers, so that it doesn't depend on CUDA. Now all its code can be inserted into ivalue::Future.
This PR does this very naively, by copy-pasting CUDAFuture's code into the (previously empty) virtual methods of ivalue::Future. This helps ensure the correctness of this PR, as it's straightforward to see it behaves exactly like before. However we probably want to polish it a bit later to iron out so wrinkles.
ghstack-source-id: 127713138
(Note: this ignores all push blocking failures!)
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D28036829
fbshipit-source-id: 3e5b16402f5dc245c1fcb9d7bf06db64dcb0d2a3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57051
Make CUDAFuture autodetect the devicetype from its arguments (which thus change from DeviceIndices to full Devices). This in fact transforms CUDAFuture into a AnythingFuture, since it's not tied to CUDA in any way anymore. Having made it fully device-agnostic, we'll merge it into ivalue::Future in the next PR.
ghstack-source-id: 127713134
(Note: this ignores all push blocking failures!)
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D28032711
fbshipit-source-id: 8ba23b1b0d97f61db8693cd5f3c7bae7989a9bcd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57049
There was a comment above CUDAMultiStreamGuard which said "TODO: Implement this generically in c10". This is what I'm doing here.
The new generic MultiStreamGuard class is able to take a vector of device-agnostic c10::Streams and is able to support any device type (CUDA, but also ROCm and others) by using a VirtualGuardImpl. A class called CUDAMultiStreamGuard is still kept around, for convenience, and slightly for performance as it avoids a vtable lookup.
ghstack-source-id: 127713139
(Note: this ignores all push blocking failures!)
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D28029158
fbshipit-source-id: 2f3181371f8cb0d77a3b2e6aa510f1dd74e8f69b
Summary:
This is an automatic change generated by the following script:
```
#!/usr/bin/env python3
from subprocess import check_output, check_call
import os
def get_compiled_files_list():
import json
with open("build/compile_commands.json") as f:
data = json.load(f)
files = [os.path.relpath(node['file']) for node in data]
for idx, fname in enumerate(files):
if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'):
files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')]
return files
def run_clang_tidy(fname):
check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"])
changes = check_output(["git", "ls-files", "-m"])
if len(changes) == 0:
return
check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"])
def main():
git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n")
compiled_files = get_compiled_files_list()
for idx, fname in enumerate(git_files):
if fname not in compiled_files:
continue
if fname.startswith("caffe2/contrib/aten/"):
continue
print(f"[{idx}/{len(git_files)}] Processing {fname}")
run_clang_tidy(fname)
if __name__ == "__main__":
main()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892
Reviewed By: H-Huang
Differential Revision: D27991944
Pulled By: malfet
fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56217
Reland of https://github.com/pytorch/pytorch/pull/54264
Changes:
- Update socket send() to use flag MSG_NOSIGNAL to prevent SIGPIPE because error in return is already capturad
- Update watchKey to block until callback has been registered on master.
- Fix race condition in testWatchKeyCallback which caused flaky test failures.
Test:
Ran TCPStoreTest 100 times locally with no errors, running [ci-all tests](https://github.com/pytorch/pytorch/pull/56219)
Test Plan: Imported from OSS
Reviewed By: mrshenli
Differential Revision: D27824802
Pulled By: H-Huang
fbshipit-source-id: c32230ce726d7d848b9896a63aa52b8eb04a0a2d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56530
For upcoming diffs, ProcessGroup will need to know about debug level
for e.g. logging collective operations.
ghstack-source-id: 127535775
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D27849839
fbshipit-source-id: a9f016a27d30a242eced19929b3824ae68fe430f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56709
Right now, ProcessGroupMPITest testGather() fails with
```
what(): Gather: number of output tensors should be 0 for non-root
[devgpu025:429730] *** Process received signal ***
```
there is a similar issue with testScatter() where number of input/output tensors on source/destination respectively should be 0.
In addition testSendRecv(true); fails with
```
terminate called after throwing an instance of 'std::runtime_error'
what(): src rank is wrong for recvAnysource
```
since we never populate `srcRanks`
Test Plan: Imported from OSS
Reviewed By: pbelevich
Differential Revision: D28001963
Pulled By: agolynski
fbshipit-source-id: c381dfc6f417ee78fbbaf884e567b0485076dfc8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56641
currently ddpLoggingData is flat struct, which requires internal DDP developers and external users to know about the struct field names. This is not flexible to delete or add new fields in the future. also it is hard to access ddpLoggingData.
With maps/dict, developers and users can easily access the fields without knowing the field names, also easier to add/remove a new/old field.
Since C++ does not support map values to be different types, right now ddpLoggingData containes two types of maps.
ghstack-source-id: 127482694
Test Plan: unit tests
Reviewed By: SciPioneer
Differential Revision: D27923723
fbshipit-source-id: c90199c14925fc50ef219000e2f809dc7601cce1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55204
Implements a fix discussed offline with pritamdamia87 to run end callbacks after `CUDAFuture`'s wrapCallback has ensured appropriate synchronization. Also enables the relevant distributed profiling tests that were previously disabled for ProcessGroupNCCL.
Note that the profiling infrastructure has moved to primarily encourage the use of torch.profiler and CUPTI to trace CUDA kernels, support for distributed collectives for that will require further discussion with ilia-cher. However, this PR improves the usability of torch.autograd.profiler with respect to distributed collectives.
ghstack-source-id: 127357995
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D27491711
fbshipit-source-id: cec7703a4c5d59b5023b0aa8fef4c2e3fb8d37d0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55718
Increments sequence numbers when ProcessGroupGloo::enqueue or
ProcessGroupNCCL::collective is run, which is a common call all collectives
make. The next step will be to log these along with other collective info in
debug mode as well as integrating them with the process group wrapper.
ghstack-source-id: 127215077
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D27690690
fbshipit-source-id: cb284b7c760763b7c0f814a41f06656fabf806d6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56515
In https://github.com/pytorch/pytorch/pull/56405 we finally found a solution to support RPC remote user functions that created/used CUDA tensors on devices that were not used by their arguments, by defining a "bounding set" of devices when constructing the agent and allowing all functions to freely use any of those devices.
We had the same exact problem with the callbacks of CUDAFuture, and in this PR I'm adopting the same exact solution: I allow to specify a set of devices when constructing a CUDAFuture, and then every callback is allowed to use any of those devices. (These devices will also be propagated to child futures).
I'm also making ProcessGroupNCCL pass these devices. I can't yet do it for TensorPipeAgent, until #56405 lands.
ghstack-source-id: 127261552
Test Plan: Added a test for this later in the stack.
Reviewed By: mrshenli
Differential Revision: D27861067
fbshipit-source-id: 8ab2c9d06a514c0407a7e96abc3704e8d5c5dc09
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56549
This make the `kProcessGroupDefaultTimeout` be the same as the python
side, and python side directly use the pybind value instead
Test Plan: Imported from OSS
Reviewed By: rohan-varma
Differential Revision: D27899190
Pulled By: wanchaol
fbshipit-source-id: 388a7f42358b0abed75cf4934fb7b311fd33fee6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56531
per discussions in
https://github.com/pytorch/pytorch/pull/53663/files#r593409009, we need
to make sure our API not confusing user by passing in both timeout in
argument and timeout in processgroup.options. This PR tries to make the
`ProcessGroup.Options.timeout` be a private field, and only be used in
our test utils, for both `init_process_group` and `new_group`, we still
allow user pass `timeout` as a separate argument. Since
`ProcessGroupGloo.Options` only have a `timeout` config, both functions
will not allow passing in options for the GLOO backend.
This way we still preserve the only `timeout` API, and only allow user
to use `ProcessGroupNCCL.Options` when needed.
cc pritamdamania87 rohan-varma
Test Plan: Imported from OSS
Reviewed By: rohan-varma
Differential Revision: D27893395
Pulled By: wanchaol
fbshipit-source-id: cdd29c84648002226ef3d9f9f3ea67b795e64bc5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55319
Adds a sequence number class as well as integration with ProcessGroup (nccl and gloo) as part of better debugability.
The main use case is that each ProcessGroup instantiated will have a sequence number initially set by rank 0, and broadcasted to all others. We will increment the number on each collective, thus allowing us to match the numbers appropriately when checking for desynchronization.
This PR just adds the bare-bones integration and verifies sequence numbers are set appropriately at the beginning.
ghstack-source-id: 127011277
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D27562769
fbshipit-source-id: d4a4de7529ce07a0c86fcf6beb06f317f359d89b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55976
- Define a concrete `DebugInfo` to collect Param comms.
- Add a macro to easily log `DebugInfo`
Test Plan:
Tested on `ads:simplified_launcher` with `dyno gputrace`
locally tested in libkinetoObserver that it can collect the debug Infobase
Reviewed By: kingchc, ilia-cher
Differential Revision: D26773447
fbshipit-source-id: a8eeede2d6dbf34d7a1b3614843b4a1baba94448
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55075
Constructs and passes in a mapping with parameter names to Reducer to log information about unused parameters in error messages about unused parameters/not all parameters getting gradient.
Use case:
1) User runs DDP forward + bwd, and it has some unused parameters that will result in ddp error in next iteration
2) Next forward pass calls `Reducer::ensure_prior_reduction_finished()` where we check all params got gradient from the previous bwd pass. DDP would throw here in this case.
3) Reducer maintains mapping and tracks used parameters, and computes which parameters did not get gradient and logs this as part of the error.
Implementation details:
0) The following is only enabled for debug modes of INFO or DETAIL.
1) To save memory, we don't map param -> param name so that we don't have to copy the entire tensor, instead we map param_index -> param_name and use the existing concept of variable_index in Reducer to look up parameter names.
2) DDP constructs param index -> param name mapping. The name is the fully qualified name: f"{module_name}:{param_name}" and passes it into Reducer
3) Reducer maintains per-iteration std::set<int> of variable indices that have had `mark_variable_ready` called.
4) When some params go unused, we take a set difference to detect the unused params.
5) Unittests to test the logged unused params, as well as for nested modules, are added
ghstack-source-id: 126581051
Test Plan: CI, UT
Reviewed By: zhaojuanmao
Differential Revision: D27356394
fbshipit-source-id: 89f436af4e74145b0a8eda92b3c4e2af8e747332
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54085
Fixes https://github.com/pytorch/pytorch/issues/50121.
This fixes two similar issues pointed out with the dtype that `torch.pow` performs its computation. Thanks ngimel for spotting the issues originally (comments [here](https://github.com/pytorch/pytorch/pull/53669#discussion_r594624355) and [here](https://github.com/pytorch/pytorch/pull/53669#discussion_r594719704))!
Before:
```
>>> torch.pow(2, torch.tensor([17], dtype=torch.uint8), out=torch.tensor([0]))
tensor([0])
>>> torch.pow(2, torch.tensor(17, dtype=torch.uint8), out=torch.tensor(0))
tensor(131072)
>>> torch.pow(2, torch.tensor([17], dtype=torch.uint8, device='cuda'), out=torch.tensor([0], device='cuda'))
tensor([131072], device='cuda:0')
>>> torch.pow(2, torch.tensor(17, dtype=torch.uint8, device='cuda'), out=torch.tensor(0, device='cuda'))
tensor(131072, device='cuda:0')
```
After:
```
>>> torch.pow(2, torch.tensor([17], dtype=torch.uint8), out=torch.tensor([0]))
tensor([0])
>>> torch.pow(2, torch.tensor(17, dtype=torch.uint8), out=torch.tensor(0))
tensor(0)
>>> torch.pow(2, torch.tensor([17], dtype=torch.uint8, device='cuda'), out=torch.tensor([0], device='cuda'))
tensor([0], device='cuda:0')
>>> torch.pow(2, torch.tensor(17, dtype=torch.uint8, device='cuda'), out=torch.tensor(0, device='cuda'))
tensor(0, device='cuda:0')
```
In all four cases above, `tensor(0, ...)` is the correct value because the computed "common dtype" among the inputs is expected to be `uint8`. Computing `2 ** 7` in uint8 will then overflow to zero. Finally, we cast the computed output to the output tensor's dtype, which is `int32`.
There were two separate issues fixed in this PR: one for cpu and one for cuda:
* For CPU, The `pow(Scalar, Tensor)` overload wasn't calling `set_wrapped_number(true)` after wrapping the scalar in a Tensor, which caused the "promoted" scalar to incorrectly participate in type promotion (see the documented behavior [here](aa8714dfed/c10/core/TensorImpl.h (L590)))
* For CUDA, the cuda kernels defined in `PowKernel.cu` were using the output's dtype to run the computation, instead of the common dtype.
As an aside: The CPU and CUDA kernels actually both use `iter.dtype()` instead of `iter.common_dtype()` to run the computation, which I fixed. The reason that only manifested here for CUDA is because TensorIterator has cpu-specific logic to create temporary outputs with the intermediate dtype (shown [here](aa8714dfed/aten/src/ATen/TensorIterator.cpp (L349))). I'm not sure what the end state is there- I can imagine that being something we're more okay doing for cpu than for cuda, but it also leads to hard-to-track-down inconsistencies between the two like in this case.
Test Plan: Imported from OSS
Reviewed By: ngimel
Differential Revision: D27096330
Pulled By: bdhirsh
fbshipit-source-id: a7e2909243851625cb3056d1e7abb2383bfe95f2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54264
**Changes**
- Creates new listener thread on each client to run the callback
- Create new class which listener thread and master thread derive from, this class is used to handle shut down and clean up of the thread in windows and linux
- Add watchKey method and update any functions that changes the key value.
**Background**
This PR adds functionality to TCPStore to allow users to watch a key and execute a callback on key change.
It introduces this a new watchKey() API:
`TCPStore::watchKey(const std::string& key, std::function<void(std::string, std::string)> callback)` which has parameters `key` and `callback(old_key, new_key)` to run on key change. Since current methods are blocking, for example in`TCPStore::get()` a worker will send a "get key" request to the master -> wait for a response back -> then exit the function and return the value to user, we need a non-blocking, asynchronous way to execute the callback whenever a key changes. This is done by creating a new listener thread on each client which the master can communicate with.
Right now, the API is C++ only and only for TCPStore, the internal use case is for elastic RPC. We will have an internal key such as `_NumNodes` and all nodes in the elastic RPC group will watch this key. When a node leaves, this key will be updated and each node will execute a callback to clean up Autograd context and RRef context.
Test Plan: Imported from OSS
Reviewed By: mrshenli
Differential Revision: D27709912
Pulled By: H-Huang
fbshipit-source-id: 619aa3b2a8eb23f4be5f5736efdcca6c175aadf3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55990
Reland of https://github.com/pytorch/pytorch/pull/55197, which fails windows test that was only run on master.
Disabled these tests for windows, similar to they are disabled on MacOS. The reason for disabling as that they use libuv transport which does not have as robust error handling as tcp on linux. The result is that non-zero ranks that were healthy don't throw immediately (like they do on linux) but they throw on timeout. The error handling still occurs as expected on rank 0 for all platforms.
ghstack-source-id: 126478371
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D27758424
fbshipit-source-id: d30841c8dda77f51b09a58161e638657ef758e63
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55265
Logs API usage of monitored barrier for better tracking and use case
understanding.
ghstack-source-id: 126413087
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D27548433
fbshipit-source-id: 7520ad0948b8dc9d44fa3118d5ea953d52f9f1c5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55197
From initial user feedback, one unexpected difference between monitored_barrier impl and barrier is the "all or nothing" semantics.
In barrier, all ranks pass or they all fail. With monitored barrier however, if rank 1 is healthy, it will respond to both send and recv from rank 0, but rank 0 can later fail because rank 2 is stuck. In this case, rank 1 will move forward out of the barrier.
This change makes it so that if a rank fails in monitored barrier, all other ranks in monitored barrier will also fail. It does so by the following process, similar to acknowledgements:
Nonzero ranks call send()
Nonzero ranks call recv()
Rank 0 calls recv(), if this succeeds, rank 0 has acknowledged rank N as healthy
Once all ranks are acknowledged as healthy:
Rank 0 calls send() to all nonzero ranks to unblock them
Modified unittests to ensure the all or nothing failure behavior
ghstack-source-id: 126413088
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D27523060
fbshipit-source-id: fa05e4f8ad8ae97fd6cb20da5c3a7ef76fd31de6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55444
Changes ~ProcessGroupNCCL so that we join work cleanup thread before aborting nccl communicators. This is because if we abort nccl communicators first on destruction, outstanding work objects in workMetaList can have exceptions set on them. Right now this doesn't trigger errors in nccl async error handling due to the terminated check, but it seems a bit cleaner to just join this thread first.
The main motivation is also to reduce log spam since we added some logging when an exception is set on WorkNCCL, but this unexpectedly resulted in a lot of false-positive errors being logged even after pg shutdown. An example is below:
I0406 18:30:27.361981 1567104 ProcessGroupNCCL.cpp:527] [Rank 1] NCCL watchdog thread terminated normally
I0406 18:30:27.364675 1567105 ProcessGroupNCCL.cpp:265] [Rank 1] found async exception when checking for NCCL errors: NCCL error: unhandled system error, NCCL version 2.
7.3
With this change, we no longer see these false positive logs.
ghstack-source-id: 126145284
Test Plan: CI
Reviewed By: osalpekar
Differential Revision: D27613035
fbshipit-source-id: abf924630128b50e7f66ae41ac83403e7a0aac96
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54977
put part of codes in prepare_for_backward into functions, so that those functions can be used in static graph training and delay all reduce later on.
ghstack-source-id: 126366714
Test Plan: unit tests
Reviewed By: rohan-varma
Differential Revision: D27439195
fbshipit-source-id: 8899eda621260232d774cb145f9c6d683c47e188
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54991
Actual proposed fix is in
https://github.com/pytorch/pytorch/pull/53934, in the meantime, would be useful
to include this LOG when barrier does not know what devices to use, and suggest
the workaround of passing in device_ids into barrier().
ghstack-source-id: 126351889
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D27444917
fbshipit-source-id: 0f269c5a7732e5be6e51adfca7ef70d04ffd71d3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55636
This diff introduces:
- The `C10dRendezvousBackend` type to support C10d stores as rendezvous backends.
- A fix to the `TCPStore.compare_set()` function to support non-existent keys.
- A placeholder `c10d-experimental` registry to instantiate C10d-baked rendezvous backends via `get_rendezvous_handler()`.
ghstack-source-id: 126312162
Test Plan: Run the existing and newly-introduced unit/integration tests.
Reviewed By: tierex
Differential Revision: D27654492
fbshipit-source-id: 09f498138b35186de4b0e174adb33fb5b5aa4b52
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55353
Remove all the code branches that will only be executed when `device_ids > 1`.
Some helper functions are also removed:
1. `_verify_replicas_within_process` and `verify_replicas_within_process`
2. `_replicate_modules_within_process`
3. `parallel_apply`
The next step is deprecating `_module_copies` field.
ghstack-source-id: 126201121
Test Plan: waitforbuildbot
Reviewed By: rohan-varma
Differential Revision: D27552201
fbshipit-source-id: 128d0216a202f5b1ba4279517d68c3badba92a6c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55074
This function accesses member variables that can be modified by
different threads (i.e. autograd engine threads), so call it within lock scope.
ghstack-source-id: 125707513
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D27474526
fbshipit-source-id: 8d43faedd6e6eeeb69e21ce3262337ab83d7ba07
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55253
Previously DDP communication hooks takes a tensor list as the input. Now only takes a single tensor, as the preparation of retiring SPMD and only providing a single model replica for DDP communication hooks.
The next step is limiting only 1 model replica in Reducer.
ghstack-source-id: 125677637
Test Plan: waitforbuildbot
Reviewed By: zhaojuanmao
Differential Revision: D27533898
fbshipit-source-id: 5db92549c440f33662cf4edf8e0a0fd024101eae
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55010
Follow up change to add a flag to provide an option for monitored barrier to collect all the failed ranks and then throw instead of just throwing on the first one. This is useful as now monitored barrier will be able to pick up on all hanging ranks instead of just one.
This is done by passing in a flag `wait_all_ranks=True`.
ghstack-source-id: 125699839
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D27447787
fbshipit-source-id: ec23aee212060d9eb515ff8adc96c6a17822d1bb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55009
Changes monitoredBarrier so that we await acknowledgemenet from ranks
in a consistent order (from least to greatest). This will reduce confusion
around the order the ranks are awaited. We are still planning to add support
for awaiting all ranks in follow up changes.
ghstack-source-id: 125699838
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D27405417
fbshipit-source-id: b9a3e72742cbffdd9bf890ab2c94103b768a7b71
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55212
Error out SPMD in C++ Reducer.
Added a new test `test_reducer_no_multi_replicas`, which checks no multiple replicas are allowed at the Reducer constructor.
Removed 2 tests relevant to reducer in SPMD mode:
`test_ddp_comm_hook_multiple_replica_check`
`test_forward_backward_multi_replica`
ghstack-source-id: 125602472
Test Plan: waitforbuildbot
Reviewed By: pritamdamania87
Differential Revision: D27497747
fbshipit-source-id: 17ef1bc4d889cbe8076bcb3d504aed4c1aea1562
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54919
Log the use of uneven inputs API for better tracking and use case
detection.
ghstack-source-id: 125446499
Test Plan: CI, added ut
Reviewed By: zhaojuanmao, SciPioneer
Differential Revision: D27410764
fbshipit-source-id: abc8055a2e15a3ee087d9959f8881b05a0ea933e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54558
In blocking wait's polling synchronization loop, we frequently call checkAndSetException() as part of isCompleted() to check the status of nccl operations. It would be useful to log here in case we encounter any exceptions (which are later thrown by `checkAndThrowException`).
Also slightly refactors code previously added to make use of a helper function to get the error message given an `std::exception_ptr`.
ghstack-source-id: 125124314
Test Plan: CI
Reviewed By: pritamdamania87
Differential Revision: D27136202
fbshipit-source-id: 256eb63c5c2a84be909722d3fd7377ad9303fa11
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54557
When looping through the nccl communicator cache checking for errors, enhance the watchdog to log exceptions that are set on the communicator.
This will allow for better debugability since the NCCL error will be logged when the watchdog receives errors for the communicators and aborts them appropriately.
Tested by forcing a NCCL error with NCCL_BLOCKING_WAIT=1 and verifying that the exception is indeed logged.
ghstack-source-id: 125124310
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D27106699
fbshipit-source-id: 1d2bd9f057a3796ce15dd8a4ce34cf6899eee45c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53773
Closes https://github.com/pytorch/pytorch/issues/52876
Implements a barrier by doing send/recv to rank 0, and rank 0 waits for these requests and on timeout, throws an exception indicating which rank did not join in the given timeout.
This barrier is only intended for CPU use cases and built into process group gloo, and will be used for debugging synchronization/hang issues.
Test Plan: Added UT
Reviewed By: zhaojuanmao
Differential Revision: D26921357
fbshipit-source-id: 7c16e861b4b8ea2bdd67a36b3de7b1029af7d173
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54764
We mark a few vars as const in Reducer, also do this for replicas_ and
process_group_ as they should not be changed by Reducer during training. This
can help eliminate issues at compile time and prevent the developer from
accidently changing these variables.
ghstack-source-id: 125040110
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D27357132
fbshipit-source-id: 23a0edf754a8e4f9e6440e99860e5549724cb7ad
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54763
Replaces deprecated torch::autograd::variable with at::Tensor.
torch::autograd::variable is defined as equal to at::Tensor now so this should
be a noop, but follows convention of using tensor instead of Variable.
ghstack-source-id: 125040109
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D27356450
fbshipit-source-id: 1a001358d7726a597141ec47803c8213db4814c0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54117https://github.com/pytorch/pytorch/pull/45950 enhanced our NCCL logging errors so that we add some basic debug information about what when wrong when erroring out with a NCCL error.
However, that PR only used the added function for `C10D_NCCL_CHECK` which is used to check the return values of NCCL calls. However, in ProcessGroupNCCL we also have `checkForNCCLErrors` which checks for errors on nccl communicators, and in case of errors it would be good to have this logging there too.
Also renames the function s/errorMessage/getNcclErrorDetailStr
ghstack-source-id: 124662592
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D27100497
fbshipit-source-id: fec3663ffa3e92bae8391ef4f77054abb4bb9715
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52692
Porting `at::mul` to structured.
One other issue I hit with the port was the fact that there are a bunch of other places around the code base that used to call out to variants of `at::native::mul`, which no longer exists. *Technically*, `at::cpu::mul` does the equivalent thing now, so I patched most call-sites to use that. There were two other places where I did something slightly different (calling `at::cuda::mul` and `at::mul`, respectively), which I called out in the comments.
Test Plan: Imported from OSS
Reviewed By: ezyang
Differential Revision: D27029822
Pulled By: bdhirsh
fbshipit-source-id: 6cc80de0dfccec304bf8e16a1823e733bed27bf4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54090
This PR adds an options field to both ProcessGroupGloo/NCCL so that we
have a constant `options` field even after the initialization of
ProcessGroup, which gives us the ability to inspect the options during
construction of specific ProcessGroup. Also use options inside different
methods instead of separate fields.
Test Plan: Imported from OSS
Reviewed By: rohan-varma
Differential Revision: D27093670
Pulled By: wanchaol
fbshipit-source-id: b02d9394290e9be88b21bddb94d4de7993b4a2e3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53662
Add a base processgroup::options so that we can do inheritance and
provide
a universal option API in python
Test Plan: Imported from OSS
Reviewed By: rohan-varma
Differential Revision: D26968856
Pulled By: wanchaol
fbshipit-source-id: 858f4b61b27aecb1943959bba68f8c14114f67d8
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53159.
See comments for a description of the race condition. Thanks to ptrblck xwang233 and especially zasdfgbnm for lots of help isolating the problem and discussing the fix.
PRing for discussion. We can try to concoct a dedicated test for the problem if you want. The ingredients are:
- DDP(..., find_unused_parameters=True)
- Use all the DDP-ed model's params in forward such that the "lazy local used work wait()" path will be taken in backward
- Queue up a lot of asynchronous dummy work just before backward(), so stream work gets pushed far into the future relative to CPU work
Benchmark:
Bert model, When find_unused_parameters=true, latency (sec) per iteration P50: trunk-1.265sec, this PR-1.263sec, if add blocking copy before calling local_used_.fill(i)-1.236 sec
Bert model, When find_unsued_parameters=false, latency (sec) per iteration P50: trunk-1.00sec, this PR-1.026sec
Resnet50 model, accuracy is also matched with trunk when find_unused_parameters=true and false
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53160
Reviewed By: albanD
Differential Revision: D26916766
Pulled By: zhaojuanmao
fbshipit-source-id: 3e0ed91b7b5c42e2f2c82e12d4d2940fdc89e023
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53928
HashStoreTest was taking forever to run. Turns out it was because a default timeout is set when creating Store() and setTimeout for prefixStore is not actually able to change the timeout of the underlying store.
After removing the default timeout and updating setTimeout, this will save ~10 minutes for all of the gcc_test CI runs.
Test Plan: Imported from OSS
Reviewed By: mrshenli
Differential Revision: D27025275
Pulled By: H-Huang
fbshipit-source-id: 650c8c1eb8b166da1d412ed88e765747a2ca2069
Summary:
The tcpstore delete key implementation inadvertendly set "moreData" when sending the key when it was in fact the last message.
Thank you, PetrochukM, for the reproducing example which was instrumental in developing the fix (and is the blueprint for the test case).
Fixes https://github.com/pytorch/pytorch/issues/53872
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53886
Reviewed By: jbschlosser
Differential Revision: D27011846
Pulled By: H-Huang
fbshipit-source-id: 5c460d1e4d095a8bc267bf63613b556856ced3e8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53860
Fixes [#53840](https://github.com/pytorch/pytorch/issues/53840)
Right now [TCPStore wait([LIST_OF_KEYS_TO_AWAIT])](https://pytorch.org/docs/master/distributed.html#torch.distributed.Store.wait) will hang if any of the keys in [LIST_OF_KEYS_TO_AWAIT] has been previously set. This change will ensure that wait() is only waiting for the keys that have not been set
Before change:
```
# Case 1: HANG
store.set("1", "1")
store.wait(["1", "2"])
store.set("2", "2")
# Case 2: SUCCEED
store.wait(["1", "2"])
store.set("1", "1")
store.set("2", "2")
```
After change:
Both cases work
TODO: working on adding a test for wait()
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D26999929
Pulled By: H-Huang
fbshipit-source-id: 8931749923c98b520366538f785af82ef37cca8e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52949
Enables distributed profiling which we have for gloo and nccl for the MPI backend
ghstack-source-id: 123610105
Test Plan: CI
Reviewed By: wanchaol
Differential Revision: D26591590
fbshipit-source-id: a20ec9d104faa26bc62c727dd01319c3ea230f5d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53148
clang format reducer and logger files
ghstack-source-id: 123453983
Test Plan: unit test
Reviewed By: SciPioneer
Differential Revision: D26764509
fbshipit-source-id: 711efcfd77420f912861cfd20c69e3af5086f4b9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53162
it is possible there are multiple data types in mixed precision training, so log data types as a list of data type names.
ghstack-source-id: 123452626
Test Plan: unit test
Reviewed By: SciPioneer
Differential Revision: D26769256
fbshipit-source-id: 8f7d73821e89864fedbbce723f301fe8fbad5685
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53145
add a new API to allow users to set sample rate for runtime stats, also add per iteration latency breakdowns to DDPLoggingData struct. e.g.
if users set sample rate to be 1, they can analyze per iteration latency change over time (not avged)
ghstack-source-id: 123443369
Test Plan: unit test
Reviewed By: SciPioneer
Differential Revision: D26763957
fbshipit-source-id: baff6a09c2a590e6eb91362ca6f47ae8fa6ddb0e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52966
Logs registerd comm hook if there is one, else logs
"builtin_allreduce"
ghstack-source-id: 123174803
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D26709388
fbshipit-source-id: 484fdbbd6643ec261b3797bd8d9824b2b6a1a490
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52887
This diff changes the way to do model consistency check (i.e. `_verify_replicas_across_processes`) in DDP.
There were a few things that could be improved with the way we verify model across processes in DDP initialization:
1. We should do this check before syncing module states in DDP init, otherwise with Gloo backend this will throw but we would like to throw the error corresponding to different models on different ranks. To do this, we move the methods to be standalone C++ functions (not part of reducer) and move this check to before synchronizing parameters.
2. Refactor DDP init in the following ways:
- Run model consistency check before creating reducer, 2
- add helper functions to build params to pass into reducer
- add helper function to call `_verify_model_across_ranks`
- move `def parameters` to a helper function `_get_parameters` to be used more broadly within DDP
In follow up changes we will add the ability to detect which rank had inconsistent model (https://github.com/pytorch/pytorch/issues/52876 would be useful for this to determine which ranks(s) had errors).
ghstack-source-id: 123171877
Test Plan:
CI/unittest
buck test mode/dev-nosan //caffe2/test/distributed:c10d
BACKEND="nccl" WORLD_SIZE="2" ~/fbcode/buck-out/dev/gen/caffe2/test/distributed/distributed_nccl_fork#binary.par -r test_ddp_model_diff_across_ranks
Reviewed By: zhaojuanmao
Differential Revision: D26565290
fbshipit-source-id: f0e1709585b53730e86915e768448f5b8817a608
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53098
Remove some low-level methods that are no longer needed since `get_per_parameter_tensors` method is added to `GradBucket` class.
Avoid unnecessary exposure to the internals before publishing GradBucket APIs.
ghstack-source-id: 122979064
Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
Reviewed By: osalpekar
Differential Revision: D26784249
fbshipit-source-id: d1b27bb026989c25a5b65be4767cb752afd6f19b
Summary:
Currently, `torch.nn.parallel.DistributedDataParallel(model...)` doesn't deduplicate params shared across `model`'s child Modules before calling Reducer with the param list. This can cause Reducer to register more than one hook on the shared param(s), at which point who knows what happens.
We ran into this in mlperf BERT, which has at least one param shared across submodules (an embedding weight iirc, not 100% sure). Running with `gradient_as_bucket_view = False` produced different numerics from running with `gradient_as_bucket_view = True` (which i guess is one potential consequence of multiple DDP hooks on a given param, not sure why, i'd have to dig further).
This PR changes DDP to deduplicate shared params (a small diff), and adds some tests (right now just `test_ddp_weight_sharing`, but I'll add more). `test_ddp_weight_sharing` fails with bad numerics on current master (proving the shared param issue is real) and passes with the deduplication diff.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51929
Reviewed By: zou3519
Differential Revision: D26625807
Pulled By: zhaojuanmao
fbshipit-source-id: f5f5959fef90dfe2c55812d79fa88b877f22ecc3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53098
Remove some low-level methods that are no longer needed since `get_per_parameter_tensors` method is added to `GradBucket` class.
Avoid unnecessary exposure to the internals before publishing GradBucket APIs.
ghstack-source-id: 122723683
Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
Reviewed By: rohan-varma
Differential Revision: D26720919
fbshipit-source-id: 46fb6423008792e72d7a1dd68930a31e0724c92c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53010
To determine the boundary between different iterations in a DDP communication hook, currently the user code needs `bucket.get_index() == 0`, which involves internal bucketization implementation details and undermines the usability of DDP communication hook.
Create an API to hide the details and improve the usability before publishing GradBucket APIs.
ghstack-source-id: 122723081
Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
Reviewed By: rohan-varma
Differential Revision: D26720813
fbshipit-source-id: f4a3147382c1f970534d7f0dee0cd599156c8b8c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53102
In `GradBucket` constructor, `offsets`, `lengths`, and `sizes_vec` are optional arguments and could possibly be empty. It will be safe to remove the default values.
ghstack-source-id: 122833603
Test Plan: waitforbuildbot
Reviewed By: rohan-varma
Differential Revision: D26748199
fbshipit-source-id: 2e3bcd1b732851919a64bbbd20fe85e77a616fe3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53009
It can be a common operation to apply layer-wise operations over per-parameter tensors in a DDP communication hook.
Create a util method in GradBucket class before publishing GradBucket APIs.
ghstack-source-id: 122833594
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
f254364097
Reviewed By: rohan-varma
Differential Revision: D26717893
fbshipit-source-id: 916db319de8b85dd22bc4e35db5671bf4e34740f
Summary:
This PR fixes a resource leakage bug in the constructor of `TCPStore` where an exception thrown in `TCPStoreDaemon` or `tcputil::connect()` can leave the server socket dangling. The ideal long-term solution would be to have a RAII wrapper for TCP sockets returned by `tcputil`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52860
Reviewed By: osalpekar
Differential Revision: D26671775
Pulled By: cbalioglu
fbshipit-source-id: ccebbd7533ac601a4b80e6e759f2fb4fe01c70fa
Summary:
This PR introduces the `timeout` accessor to `Store` and `host`, `port` accessors to `TCPStore` to help testing and troubleshooting higher level APIs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52784
Reviewed By: anjali411
Differential Revision: D26648202
Pulled By: cbalioglu
fbshipit-source-id: 9cf23bf998ed330d648dfec2a93e1bbb50817292
Summary:
- Fixes the ordering of the value parameters of TCPStore's `compare_set()` in the pybind11 interop layer. The C++ API expects (old, new) while we are passing (new, old) in Python.
- Fixes the implementation of TCPStore's `compareSetHandler()` for cases where the key already exists in the store.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52696
Test Plan: `python test/distributed/test_c10d.py`
Reviewed By: malfet, H-Huang
Differential Revision: D26616976
Pulled By: cbalioglu
fbshipit-source-id: e6a70542e837be04697b5850947924edd896dbf6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52391
There are 2 ways DDP can throw the exception refactored here -
1) Unused params in the forward pass. We provide `find_unused_parameters=True` for this.
2) All params used in fwd pass, but not all outputs used in loss computation. There are a few workarounds for this but we do not provide native support.
Previously, these 2 issues were combined into 1 error message but that has historically resulted in confusion, with users reporting getting this error even when they enable `find_unused_parameters=True` (which they expect to fix this error). As a result there is additional churn to debug these issues because the true cause (1) vs (2) is not known.
This commit helps to fix the issue by separating out the 2 error messages depending on if we ran with unused parameter detection or not. Hopefully this should make the error message much more clear and actionable.
error msg with `find_unused_params=True`:
```
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. Since `find_unused_parameters=True` is enabled, this likely means that not all `forward` outputs participate in computing loss. You can fix this by making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
```
error msg without `find_unused_params` specified:
```
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
```
ghstack-source-id: 122097900
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D26496688
fbshipit-source-id: 4a9eeeda10293da13d94a692d10cb954e4506d7c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51386
add stats such as rebuilt bucket stats, unused parameter stats and performance stats to ddp logging data
1. gpu time stats are not collected for single process multiple devices in this diff, as that requires events are created and recorded on multiple devices
2. use at::cuda::event API for safer calls
3. events may not be created in autograd hook if hook is not triggered in user's codes, e.g., users runs in non-sync mode in some iterations. So we checked events are created or not before synchronizing, also skipped invalid results.
4. users may not set device upfront, so explicitly set proper device before creating events in our prepare_forward() and prepare_backward() calls
ghstack-source-id: 121933566
Test Plan: unit tests
Reviewed By: SciPioneer
Differential Revision: D26158645
fbshipit-source-id: ce5f15187802eba76accb980449be68902c10178
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52385
This warning should specify that we did not find unused params in the
_forward_ pass, which is when we log this warning. This is to avoid confusion
when we get an error because not all outputs were used to compute loss, which
also raises an error about unused parameters (to be fixed in the next diff)
ghstack-source-id: 122001929
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D26494136
fbshipit-source-id: d9b41732ea7e5e31b899d590d311080e3dc56682
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52031
Closes https://github.com/pytorch/pytorch/issues/52020
Ensures that we can profile collectives in DDP by propagating the profiler threadLocalState appropriately. As described in the above issue, before this wouldn't work as the profiler would only be enabled on the main thread.
ghstack-source-id: 121818080
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D26356192
fbshipit-source-id: 0158b5833a3f857a0b4b2943ae3037e9d998dfd1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51822
Adds support for shape recording for profiling distributed collectives, for nccl/gloo backends. Added
both cpp and python tests to ensure that shapes are recorded properly. Note that we don't add `ProcessGroupNCCLTest`s since they need to be modified to support single process per device and > 1 world size.
ghstack-source-id: 121507509
Test Plan: CI
Reviewed By: mrzzd
Differential Revision: D26291739
fbshipit-source-id: 5f7bd54d8c36d17a4a29e172b25266ca3dbd8fbd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51759
Some unit tests actually register a comm hook on other backends like GLOO. Example: `test_ddp_comm_hook_future_passing_cpu`
Therefore, only do the check on `register_builtin_comm_hook`.
Currently DDP communication hook can only be supported on NCCL. Add a check in the registration methods.
ghstack-source-id: 121115814
Test Plan: unit tests.
Reviewed By: pritamdamania87
Differential Revision: D26268581
fbshipit-source-id: c739fa4dca6d320202dc6689d790c2761c834c30
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51370
TORCH_CHECK should be used when confirming the correctness of function
arguments like the tag passed to Gloo functions.
ghstack-source-id: 120908449
Test Plan: Sandcastle/CI
Reviewed By: mingzhe09088
Differential Revision: D26152359
fbshipit-source-id: ddffaa6f11393aaedaf0870759dc526d8d4530ee
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51066
backend name of a processgroup created using distributed_c10d python API is tracked, but there is no good way to track name of a processgroup created using processGroup c++ API. In some cases, knowing backend name of a processGroup is useful, e,g., log the backend name, or write some codes that have dependency on the known backend.
ghstack-source-id: 120628432
Test Plan: unit tests
Reviewed By: pritamdamania87
Differential Revision: D26059769
fbshipit-source-id: 6584c6695c5c3570137dc98c16e06cbe4b7f5503
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50622
1. Define a DDPLoggingData struct that is the placeholder for all the ddp related logging fields
2. Put the DDPLoggingData struct in the C10 directory so that it can be easily imported by c10 and torch files
3. Expose get_ddp_logging_data() method in python so that users can get the logging data and dump in their applications
4. Unit test tested the logging data can be set and got as expected
5. Follow up will add more logging fields such as perf stats, internal states, env variables and etc
ghstack-source-id: 120275870
Test Plan: unit tests
Reviewed By: SciPioneer
Differential Revision: D25930527
fbshipit-source-id: 290c200161019c58e28eed9a5a2a7a8153113f99
Summary:
In https://github.com/pytorch/pytorch/issues/42514, NCCL `alltoall_single` is already added. This PR adds NCCL `alltoall`.
The difference between `alltoall_single` and `alltoall` is: `alltoall_single` works on a single tensor and send/receive slices of that tensor, while `alltoall` works on a list of tensor, and send/receive tensors in that list.
cc: ptrblck ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44374
Reviewed By: zhangguanheng66, mrshenli
Differential Revision: D24455427
Pulled By: srinivas212
fbshipit-source-id: 42fdebdd14f8340098e2c34ef645bd40603552b1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50455
Certain systems only print logging messages for ERROR/WARN and the
error message that the watchdog is timing out a particular operation is pretty
important.
As a result, changing its level to ERROR instead of INFO.
ghstack-source-id: 119761029
Test Plan: waitforbuildbot
Reviewed By: rohan-varma
Differential Revision: D25894795
fbshipit-source-id: 259b16c13f6cdf9cb1956602d15784b92aa53f17
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50133
`find_unused_parameters=True` is only needed when the model has unused parameters that are not known at model definition time or differ due to control flow.
Unfortunately, many DDP users pass this flag in as `True` even when they do not need it, sometimes as a precaution to mitigate possible errors that may be raised (such as the error we raise with not using all outputs).While this is a larger issue to be fixed in DDP, it would also be useful to warn once if we did not detect unused parameters.
The downside of this is that in the case of flow control models where the first iteration doesn't have unused params but the rest do, this would be a false warning. However, I think the warning's value exceeds this downside.
ghstack-source-id: 119707101
Test Plan: CI
Reviewed By: pritamdamania87
Differential Revision: D25411118
fbshipit-source-id: 9f4a18ad8f45e364eae79b575cb1a9eaea45a86c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50131
Noticed that in the internal diff for
https://github.com/pytorch/pytorch/pull/49069 there was a clang-tidy warning to
use emplace instead of push_back. This can save us a copy as it eliminates the
unnecessary in-place construction
ghstack-source-id: 119560979
Test Plan: CI
Reviewed By: pritamdamania87
Differential Revision: D25800134
fbshipit-source-id: 243e57318f5d6e43de524d4e5409893febe6164c
Summary:
For a multi GPU node, rank and corresponding GPU mapping can be different.
Provide optional parameter to specify the GPU device number for the
allreduce operation in barrier function.
Add test cases to validate barrier device_ids.
Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>
Fixes https://github.com/pytorch/pytorch/issues/48110
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49069
Reviewed By: mrshenli
Differential Revision: D25658528
Pulled By: rohan-varma
fbshipit-source-id: 418198b6224c8c1fd95993b80c072a8ff8f02eec
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49556
Implemented the missing Store functionality (specifically numKeys) in the FileStore.
Test Plan: Added both C++ and Python tests to verify functionality.
Reviewed By: jiayisuse
Differential Revision: D25619001
fbshipit-source-id: 9146d0da9e0903622be3035880f619bbb2cc3891
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49343
at::cuda::CUDAEvent is "lazy" and only creates an event when it's first recorded. Until then, at::cuda::CUDAEvent is empty. If we use at::cuda::CUDAEvent::query() this is taken into account (an empty event is always ready), but WorkNCCL extracts the raw cudaEvent_t value from at::cuda::CUDAEvent and calls cudaEventQuery manually and doesn't check this. This could cause a failure.
It's unclear if this is ever supposed to happen, but we're seeing that failure, and we want to sort it out in order to see if there's something "deeper" going on.
ghstack-source-id: 118532806
Test Plan: Unit tests
Reviewed By: SciPioneer
Differential Revision: D25537844
fbshipit-source-id: 506319f4742e1c0a02aa75ecc01112ea3be42d8f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49014
We extracted a generic and reusable CUDAFuture class from FutureNCCL, but we had left FutureNCCL around, as a subclass of CUDAFuture, in order to deal with some peculiarity of ProcessGroupNCCL, namely that the future would be completed right away when constructed and that its CUDA events would be _shared_ with the ones of the WorkNCCL. This required some "hacks" in CUDAFuture itself (protected members, fields wrapped in shared_ptrs, ...).
My understanding is that creating CUDA events is a rather cheap operation. That would mean that we could afford to record _twice_ the events after each NCCL call, once for the WorkNCCL and once for the future. By doing so, we can use the CUDAFuture class directly and revert all its hacks.
ghstack-source-id: 118391217
Test Plan: Unit tests
Reviewed By: mrshenli
Differential Revision: D25355272
fbshipit-source-id: 3a2a0891724928221ff0f08600675d2f5990e674
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48789
CUDAFuture aims to "capture" the current state of CUDA-related stuff when the future is marked complete (e.g., by looking at current streams and recording events on them) and then "replicate" a similar state when users synchronize with the result of the future (by synchronizing the current streams with these events).
However, one "contextual" aspect of CUDA that we weren't capturing/replicating was the current device. This diff tries to fix that. I must mention that we can only do this for callbacks, while we cannot do it for the wait() method. I don't know if such a discrepancy between the two actually makes the overall behavior _worse_. I'd love to hear people's opinions on this.
ghstack-source-id: 118081338
Test Plan: Unit tests
Reviewed By: mrshenli
Differential Revision: D25210335
fbshipit-source-id: 1d1a3f80b1cc42e5114bc88554ed50617f1aaa90
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48946
Move recordFunctionEndCallback to after the blocking portion of launching the NCCL kernel, and remove addCallback since it runs the lambda inline anyways, and triggers unnecessary CUDA stream logic. If we want CUDA operations such as NCCL kernels accurately profiled, we should use the profiler with use_cuda=True. However, we are currently debugging a deadlock for the use_cuda=True case, fix is being tracked in #48987.
To ensure that the tests are no longer flaky, submitted this PR to ci-all: #48947 and ran the test a bunch of times ssh'd into the CI machine.
ghstack-source-id: 118330130
Test Plan: Ci
Reviewed By: mrzzd
Differential Revision: D25368322
fbshipit-source-id: 7d17036248a3dcd855e58addc383bba64d6bc391
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49145
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49105
(1) Add a safety check `C10_CUDA_KERNEL_LAUNCH_CHECK()` after each kernel launch. This diff only changes the files inside the directory /fbsource/fbcode/caffe2/modules/, /fbsource/fbcode/caffe2/fb/, /fbsource/fbcode/caffe2/test/.
(2) Get rid of old check `AT_CUDA_CHECK(cudaGetLastError())` when necessary.
Test Plan:
Test build:
```
buck build mode/dev-nosan //caffe2/modules/detectron:
buck test mode/dev-nosan //caffe2/modules/detectron:
buck build mode/dev-nosan //caffe2/torch/fb/:
buck test mode/dev-nosan //caffe2/torch/fb/:
```
To check for launches without checks:
```
python3 caffe2/torch/testing/check_kernel_launches.py
```
Make sure none of the updated files are in the returned list.
Reviewed By: r-barnes
Differential Revision: D25452852
fbshipit-source-id: d6657edab612c9e0fa99b29c68460be8b1a20064
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48788
CUDAFuture needs to inspect the value it contains in order to first determine what devices its tensors reside on (so that it can record events on those devices), and then to record these tensors with the caching allocator when they are used in other streams. Extracting data ptrs can become somewhat expensive (especially if we resort to using the pickler to do that), hence it's probably a good idea to cache the result the first time we compute it.
ghstack-source-id: 118180023
Test Plan: Unit tests
Reviewed By: mrshenli
Differential Revision: D25303486
fbshipit-source-id: 5c541640f6d19249dfb5489ba5e8fad2502836fb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48506
This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).
---
FutureNCCL is now a general-purpose type-agnostic multi-device class, so in this commit I extract it from ProcessGroupNCCL to make it available for wider use (notably by the RPC module). We'll call this new class CUDAFuture. We'll keep FutureNCCL as a subclass of CUDAFuture to deal with some NCCL peculiarity, namely the fact that the future becomes complete immediately upon creation. We can clean this up for good once we're done merging Future and Work.
I'm not exactly sure of where to put CUDAFuture. It needs to be available to both c10d and RPC (which lives under torch/csrc). If I figured CMake out correctly (and that's a big if) I think c10d can only depend on ATen (I'll maybe add a comment with how I tracked that down). Hence we cannot put CUDAFuture in torch/csrc. On the other hand, RPC currently depends on c10d, because RPC agents use ProcessGroups internally, so it would be "ok" to put CUDAFuture in c10d. However, we want to get rid of ProcessGroups in RPC, and at that point RPC should in principle not depend on c10d. In that case, the only shared dep between the two that I see is ATen itself.
While I'm a bit wary of putting it right in ATen, I think it might actually make sense. CUDAFuture is intended to be a general-purpose component that can be reused in all settings and is not particularly tied to c10d or RPC. Moreover, ATen already contains ivalue::Future, and it contains a lot of CUDA helpers, so CUDAFuture definitely belongs to the "closure" of what's already there.
ghstack-source-id: 118180030
Test Plan: Unit tests?
Reviewed By: wanchaol
Differential Revision: D25180532
fbshipit-source-id: 697f655240dbdd3be22a568d5102ab27691f86d4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48505
This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).
---
FutureNCCL isn't just adding CUDA support to ivalue::Future, it's also reimplementing a lot of the latter's logic (by overriding plenty of its methods). That's brittle, as whenever a new method is added to ivalue::Future there's a risk of forgetting to add it to FutureNCCL, and in such a case calling this method on FutureNCCL would defer to the base class and give inconsistent results (e.g., future not being completed when it actually is). This _is already happening_, for example with the waitAndThrow or hasError, which are not implemented by FutureNCCL. In addition, this creates duplication between the two classes, which could lead to inconsistencies of behavior, bugs, missing features, ...
The best solution would be to keep the core future logic in ivalue::Future, and have _only_ the CUDA additions in FutureNCCL. That's what we're going to do, in two steps. In the previous commit, I split the CUDA features into separate hooks, which are called by FutureNCCL's other methods. In this commit, I'm removing these latter methods, and invoke the hooks directly from ivalue::Future.
ghstack-source-id: 118180032
Test Plan: Unit tests
Reviewed By: wanchaol
Differential Revision: D25180535
fbshipit-source-id: 19181fe133152044eb677062a9e31e5e4ad3c03c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48504
This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).
---
FutureNCCL isn't just adding CUDA support to ivalue::Future, it's also reimplementing a lot of the latter's logic (by overriding plenty of its methods). That's brittle, as whenever a new method is added to ivalue::Future there's a risk of forgetting to add it to FutureNCCL, and in such a case calling this method on FutureNCCL would defer to the base class and give inconsistent results (e.g., future not being completed when it actually is). This _is already happening_, for example with the waitAndThrow or hasError, which are not implemented by FutureNCCL. In addition, this creates duplication between the two classes, which could lead to inconsistencies of behavior, bugs, missing features, ...
The best solution would be to keep the core future logic in ivalue::Future, and have _only_ the CUDA additions in FutureNCCL. That's what we're going to do, in two steps. In this commit, I'll split the CUDA features into separate hooks, which are called by FutureNCCL's other methods. In the next commit, I'll remove these latter methods, and invoke the hooks directly from ivalue::Future.
ghstack-source-id: 118180025
Test Plan: Unit tests
Reviewed By: mrshenli
Differential Revision: D25180534
fbshipit-source-id: 7b3cd374aee78f6c07104daec793c4d248404c61
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48502
This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).
---
FutureNCCL restricted the values to be tensors, or (singleton) lists of tensors, or Python object that could be converted to either of those types. We need a CUDA future that can handle more generic types though.
The main challenge is extracting all DataPtrs from an arbitrary object. I think I found some ways of doing so, but I'd like some JIT experts to look into this and tell me if there are better ways. I'll add inline comments for where their input would be appreciated.
ghstack-source-id: 118180026
Test Plan: Unit tests (I should probably add new ones)
Reviewed By: wanchaol
Differential Revision: D25177562
fbshipit-source-id: 1ef18e67bf44543c70abb4ca152f1610dea4e533
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48501
This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).
---
FutureNCCL stores a set of devices (on which the tensors in the data reside) and a CUDA event for each of those devices. In fact, each event instance also already contains the device it belongs to, which means we can avoid storing that information separately (with the risk that it'll be mismatched and/or inaccurate).
ghstack-source-id: 118180024
Test Plan: Unit tests
Reviewed By: mrshenli
Differential Revision: D25177554
fbshipit-source-id: 64667c176efc2a7dafe99457a1fbba5d142cb06c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48500
This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).
---
After the previous changes, this is now much simpler than it sounds. For the most part it just consists in repeating some operations multiple times, once for device (e.g., recording and blocking on events). Funnily, we already had a vector of events, even though we only ever stored one element in it (this probably comes from the fact that this is shared with WorkNCCL, which can hold more than one event). Here, we now also store a vector of device indices.
Perhaps the only non-trivial part of this is that now, for "follow-up" Futures (for callbacks), we can't know in advance which device the result will be on so we must determine it dynamically when we receive the result, by inspecting it. That's also easier than it sound because we already have a dataptr extractor.
ghstack-source-id: 118180022
Test Plan: Unit tests (I should probably add new ones)
Reviewed By: mrshenli
Differential Revision: D25177556
fbshipit-source-id: 41ef39ec0dc458e341aa1564f2b9f2b573d7fa9f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48563
This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).
---
The CUDA caching allocator requires us to register all streams in which a DataPtr is used. We already do so when we invoke a callback, for which we obtain streams from the ATen pool. However, we didn't do so when the user waits for the Future and then uses the results in their current streams. This was probably fine in most cases, because the outputs of the NCCL ops (which is the tensors we're dealing with here) were user-provided, and thus already registered in some user streams, but in principle the user could use different streams when waiting than the ones they used to create the tensors. (If they use the same streams, registering becomes a no-op). But, more importantly, this change will help us turn FutureNCCL into a more general-purpose class as for example in RPC the tensors of the result are allocated by PyTorch itself and thus we need to record their usage on the user's streams with the caching allocator.
ghstack-source-id: 118180033
Test Plan: Unit tests
Reviewed By: mrshenli
Differential Revision: D25210338
fbshipit-source-id: e0a4ba157653b74dd84cf5665c992ccce2dea188
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48503
This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).
---
My impression is that one property of the upstream Future class is that once .wait() returns, or once a callback is invoked, then .completed() should return True. This was not the case for FutureNCCL because .wait() would return immediately, and callbacks would be invoked inline, but .completed() could return False if the CUDA async operations hadn't completed yet.
That was odd and confusing. Since there are other ways for users to check the status of CUDA operations (if they really need, and typically I don't think it's so common), perhaps it's best to avoid checking the status of CUDA events in .completed().
ghstack-source-id: 118180028
Test Plan: Unit tests
Reviewed By: mrshenli
Differential Revision: D25180531
fbshipit-source-id: e1207f6b91f010f278923cc5fec1190d0fcdab30
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48499
This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).
---
We can merge and "hide" a whole bunch of CUDA-related logic if we store and record the CUDA events that correspond to the completion of a FutureNCCL when we call markCompleted (rather than splitting it between the constructor, the `then` method, and a wrapper around the callback).
A more concrete reason for this change is that soon I'll add support for multi-device, and in that case we can't necessarily know in advance which devices a value will be on until we get that value (and we don't want to record an event on all devices as then we might "over-synchronize").
To me, this also makes more conceptual sense: the moment when we store a value on the future, which is the "signal" that the future is now ready, should also be time at which we record the events needed to synchronize with that value. Though this may just be personal preference.
ghstack-source-id: 118180034
Test Plan: Unit tests
Reviewed By: mrshenli
Differential Revision: D25177557
fbshipit-source-id: 53d4bcdfb89fa0d11bb7b1b94db5d652edeb3b7b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48498
This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).
---
FutureNCCL has a dedicated CUDA stream that it sets as current when running callbacks. This stream is initialized by the ProcessGroupNCCL by extracting it from the global ATen pool.
In order to decouple FutureNCCL from that specific ProcessGroup and make it more generic, in this commit we make FutureNCCL extract a fresh stream from the ATen pool each time it needs one.
This introduces a functional change, because it removes the implicit synchronization and ordering between the callbacks of a same Future. In fact, such an ordering is hard to guarantee in the general case as, for example, a user could attach a new callback just after the future becomes completed, and thus that callback would be run inline, immediately, out-of-order wrt the other callbacks. (There are ways to "fix" this but they are complicated). NCCL got around this because its futures are already marked complete when they're returned, but in fact it could also run into issues if multiple threads were adding callbacks simultaneously.
Note that it remains still possible to enforce ordering between callbacks, but one must now do so explicitly. Namely, instead of this:
```
fut.then(cb1)
fut.then(cb2)
```
one must now do:
```
fut.then(cb1).then(cb2)
```
ghstack-source-id: 118180029
Test Plan: Unit tests
Reviewed By: mrshenli
Differential Revision: D25177559
fbshipit-source-id: 4d4e73ea7bda0ea65066548109b9ea6d5b465599
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48497
This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).
---
When we record the events to mark a "follow-up" future complete (for a callback), we used to record them onto the dedicated stream, but that streams is the current stream at that time, so instead we could just record them onto the current stream. This introduces no functional differences. The reason I'm adding such an additional layer of indirection is so that the dedicated stream is only referenced inside the `addCallback` method, which will later allow us to more easily change how that stream works.
ghstack-source-id: 118180035
Test Plan: Unit tests
Reviewed By: mrshenli
Differential Revision: D25177553
fbshipit-source-id: c6373eddd34bd399df09fd4861915bf98fd50681
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48496
This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).
---
There are two ways to add a callback to a Future: `then` and `addCallback` (with the former deferring to the latter). FutureNCCL only "patched" `then`, which caused `addCallback` to be unsupported. By patching `addCallback`, on the other hand, we cover both.
The high-level goal of this change though is to remove all CUDA-specific stuff from `then`, and move it to either `markCompleted` or to a wrapper around the callback. This will take a few more steps to achieve.
ghstack-source-id: 118180031
Test Plan: Unit tests
Reviewed By: mrshenli
Differential Revision: D25177558
fbshipit-source-id: ee0ad24eb2e56494c353db700319858ef9dcf32b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48562
This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).
---
In this commit I'm adding a few asserts to the constructors of FutureNCCL to make sure that what's passed in is what we expect (fun fact: until two commits ago that wasn't the case, as we were passed some empty events).
I'm also making the second constructor private, as it's only supposed to be used by the then() method.
ghstack-source-id: 118180036
Test Plan: Unit tests
Reviewed By: mrshenli
Differential Revision: D25210333
fbshipit-source-id: d2eacf0f7de5cc763e3cdd1ae5fd521fd2eec317
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48495
This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).
---
PythonFutureWrapper needs to provide a GIL-aware way to extract tensors from an IValue of type PyObject. Since this was only used by FutureNCCL it was guarded by #ifdef USE_C10D_NCCL. However, we will need to use it with CUDA-aware futures other than the NCCL one. This might have been achieved simply by replacing USE_C10D_NCCL with USE_CUDA, but I wanted to clean this up better.
We're dealing with two independent dimensions: C++-vs-Python and CPU-vs-CUDA. To make the code more modular, the two dimensions should be dealt with by orthogonal solutions: the user setting a custom callback to handle Python, and the subclass being CUDA-aware. Mixing these two axes makes it more complicated.
Another reason for changing how this works is that later on, when we'll introduce multi-device support, we'll need to extract dataptrs for other reasons too (rather than just recording streams with the caching allocator), namely to inspect the value to determine which devices it resides on.
ghstack-source-id: 118180038
Test Plan: Unit tests
Reviewed By: mrshenli
Differential Revision: D25177560
fbshipit-source-id: 3a424610c1ea191e8371ffee0a26d62639895884
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48561
This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).
---
WorkNCCL allows to extract a FutureNCCL through getFuture(). There is one instance of this method being called by ProcessGroupNCCL itself, in order to attach a callback to it. This was happening _before_ the work was actually launched, however FutureNCCL does _always_ invoke its callbacks immediately inline. The events that the FutureNCCL was using hadn't been recorded yet, thus blocking on them was a no-op. Moreover, the function that was being called was installed by the generic ProcessGroup superclass, which is not CUDA-aware, and thus probably didn't make any use of the CUDA events or streams.
383abf1f0c/torch/lib/c10d/ProcessGroup.cpp (L66)
In short: I believe that creating a FutureNCCL and attaching a callback was equivalent to just invoking that function directly, without any CUDA-specific thing. I'm thus converting the code to do just that, in order to simplify it.
Note that, given the comment, I don't think this was the original intention of that code. It seems that the function was intended to be run once the work finished. However, I am not familiar with this code, and I don't want to introduce any functional changes.
ghstack-source-id: 118180037
Test Plan: Unit tests
Reviewed By: mrshenli
Differential Revision: D25210337
fbshipit-source-id: 54033c814ac77641cbbe79b4d01686dfc2b45495
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49105
(1) Add a safety check `C10_CUDA_KERNEL_LAUNCH_CHECK()` after each kernel launch. This diff only changes the files inside the directory /fbsource/fbcode/caffe2/modules/, /fbsource/fbcode/caffe2/fb/, /fbsource/fbcode/caffe2/test/.
(2) Get rid of old check `AT_CUDA_CHECK(cudaGetLastError())` when necessary.
Test Plan:
Test build:
```
buck build //caffe2/modules/detectron:
buck build //caffe2/torch/fb/:
```
To check for launches without checks:
```
python3 caffe2/torch/testing/check_kernel_launches.py
```
Make sure none of the updated files are in the returned list.
Reviewed By: r-barnes
Differential Revision: D25325039
fbshipit-source-id: 2043d6e63c7d029c35576d3101c18247ffe92f01
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48757
Add an index field to GradBucekt, so error_dict is keyed by this index instead of the hashcode of input tensor. The replacement will be done in a separate diff, as the definition of this new method somehow couldn't be recognized in the OSS version.
Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 117939208
Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
Reviewed By: rohan-varma
Differential Revision: D25288496
fbshipit-source-id: 6f71977809690a0367e408bd59601ee62c9c03ea
Summary:
This diff enables JIT serialization of `ProcessGroup`, including both base `ProcessGroup` class and derived classes like `ProcessGroupNCCL`.
If a `ProcessGroup` is created via high-level APIs like `dist_c10d.frontend().new_process_group_helper()`, they are automatically serializable. If a `ProcessGroup` is created via its derived class TorchBind APIs like `dist_c10d.ProcessGroupNCCL()`, then it has to be given a name and registered with `dist_c10d.frontend().register_process_group_name` to be uniquely identifiable and serializable.
* Fixed a minor bug in new dist_c10d frontend which fails to check whether a process group is used or not
* Fixed an issue where `test_jit_c10d.py` wasn't really run due to a configuration bug. Now tests are run as a slow test (need ci-all/* branch)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48544
Reviewed By: wanchaol
Differential Revision: D25298309
Pulled By: gmagogsfm
fbshipit-source-id: ed27ce37373c88277dc0c78704c48d4c19d46d46
Summary:
Enable TcpStore for DDP on Windows platform, in order to improve running DDP cross machines performance.
Related RFC is https://github.com/pytorch/pytorch/issues/47659
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47749
Reviewed By: bdhirsh
Differential Revision: D25220401
Pulled By: mrshenli
fbshipit-source-id: da4b46b42296e666fa7d8ec8040093de7443a529
Summary:
This PR aims to reduce the import overhead and symbol noises from the `windows.h` headers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48009
Reviewed By: gchanan
Differential Revision: D25045840
Pulled By: ezyang
fbshipit-source-id: 01fda70f433ba2dd0cd2d7cd676ab6ffe9d98b90
Summary:
Add TorchBind-binding for ProcessGroup class.
Currently there are a few limitation of TorchBind that prevents us from fully matching existing PyBind-binding of ProcessGroup:
- TorchBind doesn't support method overloading. Current PyBind binding uses overloading extensively to provide flexible API, but TorchBind (and TorchScript ClassType behind it) doesn't yet support it. Therefore, we can provide at most one version of API under each name.
- TorchBind doesn't support C++ enums yet. This prevents us from making real uses of XXXOptions, which is widely used in many APIs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47907
Reviewed By: wanchaol
Differential Revision: D24945814
Pulled By: gmagogsfm
fbshipit-source-id: e103d448849ea838c10414068c3e4795db91ab1c