Commit Graph

1423 Commits

Author SHA1 Message Date
Yanli Zhao
ea421fb249 enable static graph training in DDP (#55248)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55248

This PR provides enable static graph training when users call _set_static_graph(). This can help support more use cases in DDP without performance regression, also can potentially improve performance when there are unused parameters in the graph.
1. first iteration records graph states like how many times a grad is calculated, whether the grad is used or not. then first iteration queues a delay_all_reduce call back to all reduce grads.
2. Since autograd call back is associated with current target graph task, the delay_all_all call back should be associated with out-most backward graph task. A DDP sink layer is added in DDP forward loop so that we can queue the delay_all_reduce call back in the sink layer.
3. after first iterations, DDP will use the saved graph states to determine whether a grad is used or not. whether a grad is ready for communication.
4. rebuilt bucket is called in second iteration, after graph states are recorded in first iteration.
5. if the graph states change, DDP will throw errors
ghstack-source-id: 128599464

Test Plan: unit tests. adding more tests

Reviewed By: rohan-varma

Differential Revision: D27539964

fbshipit-source-id: 74de1ad2719465be67bab8688d6e293cd6e3a246
2021-05-11 10:23:25 -07:00
Rohan Varma
5840c8cfd8 [nccl] log rank when communicator is aborted (#57974)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57974

We see this error quite a bit in internal workflows, would be useful
to have this additional logging information here.
ghstack-source-id: 128602199

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D28331693

fbshipit-source-id: 25398c6a3420a2b594d79aa8f46936cd0addd426
2021-05-10 21:23:31 -07:00
Alexander Golynski
db412a6885 Avoid 2 extra copies when reducing sparse tensors and fix result() vs inplace output discrepancy (#57822)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57822

* `AsyncSparseAllreduceWork` can avoid copying output tensors, since we keep all the results alive by means of modifying input vector directly
* `AsyncSparseAllreduceWork` now returns inputs back to user instead of former behavior where it returned copies of inputs. This is consistent with other operations and process group implementations
* `AsyncSparseAllreduceCUDAWork` is now copying tensors directly from CPU to input tensors avoiding extra copy `output` -> `outputs` -> `inputs`. inputs are being returned to back to user. This is consistent with other operations and process group implementations.

overall AsyncSparseAllreduceCUDAWork is now avoiding 2 extra copies (as AsyncSparseAllreduceCUDAWork is using AsyncSparseAllreduceWork's impl)

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D28298325

Pulled By: agolynski

fbshipit-source-id: 18e2104413cdf5e73a01aad464e2613807779297
2021-05-07 15:12:58 -07:00
Pavel Belevich
96e1a83fb2 Add Gloo TCP_TLS transport (#56442)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56442

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D27896285

Pulled By: pbelevich

fbshipit-source-id: 589af59ca4c7c9bab2329f079382c09b71cfcf9e
2021-05-07 13:36:11 -07:00
Luca Wehrstedt
36e47af58b Pass reference to parent future in callbacks (#57635)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57635

Note: this PR looks massive, but it's just one simple change, codemodded many times.

In many cases, a callback needs to access the value/error produced by the parent future. In Python this was easy because the callback was invoked with the parent future as argument, and could thus inspect it. In C++ the callbacks didn't take any arguments, thus in many cases we worked around this by capturing the future in its own callback. This is risky (leads to reference cycle and thus memory leak) and must be done carefully (spoiler: sometimes we weren't).
ghstack-source-id: 128296580

Test Plan: CI

Reviewed By: wanchaol

Differential Revision: D28178783

fbshipit-source-id: 6de02c4568be42123372edc008f630d5ddae0081
2021-05-07 03:59:18 -07:00
Jay Chae
1101a5f6e9 [paramcomms] support for in and out split sizes (#57709)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57709

NOTE: initial commit got reverted D28247764

Adding way to accept in and out split sizes.

Test Plan:
{F613245151}
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2F0%2F1620153506%2F127.0.0.1%2Flibkineto_activities_1112677.json.gz&bucket=gpu_traces
NOTE: ignore the GPU user showing up in CPU - the issue is fixed in the diff above the stack D28196723 (fc657b547a)

UPDATED: now the sizes are encoded as arrays in .json
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2F0%2F1620259313%2F127.0.0.1%2Flibkineto_activities_3944235.json.gz&bucket=gpu_traces

Reviewed By: kingchc

Differential Revision: D28248333

fbshipit-source-id: cee523612667cb37170c94e3c40dab5fba432225
2021-05-06 12:04:34 -07:00
Alexander Golynski
dc06f52480 Add result() to ProcessGroupGloo::AsyncWork's (#57565)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57565

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D28255120

Pulled By: agolynski

fbshipit-source-id: 1e904d4fe024d5b99cb642f8689ca32be0581e82
2021-05-06 08:48:48 -07:00
Horace He
ccbbb2d6f8 Revert D28052211: [paramcomms] support for in and out split sizes
Test Plan: revert-hammer

Differential Revision:
D28052211 (866b19e95d)

Original commit changeset: 4ab7d425fc72

fbshipit-source-id: 80c001ddcb3730f0487adddf66d9166f53c45a8c
2021-05-05 21:10:31 -07:00
Jay Chae
866b19e95d [paramcomms] support for in and out split sizes
Summary: Adding way to accept in and out split sizes.

Test Plan:
{F613245151}
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2F0%2F1620153506%2F127.0.0.1%2Flibkineto_activities_1112677.json.gz&bucket=gpu_traces
NOTE: ignore the GPU user showing up in CPU - the issue is fixed in the diff above the stack D28196723

UPDATED: now the sizes are encoded as arrays in .json
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2F0%2F1620259313%2F127.0.0.1%2Flibkineto_activities_3944235.json.gz&bucket=gpu_traces

Reviewed By: kingchc

Differential Revision: D28052211

fbshipit-source-id: 4ab7d425fc722907d9bbcfad7e364d031ff69b29
2021-05-05 20:46:11 -07:00
Rohan Varma
7115a4b870 Clang format ProcessGroupNCCL.cpp (#56840)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56840

Per comments in https://github.com/pytorch/pytorch/pull/56427/files
ghstack-source-id: 128142665

Test Plan: Ci

Reviewed By: SciPioneer

Differential Revision: D27980768

fbshipit-source-id: 0158ae1cfd892ff3385ffa0084dd7ef9de014f8c
2021-05-05 10:17:09 -07:00
Rohan Varma
a948e279ac [c10d] Profiler support for nccl p2p collectives (#56427)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56427

This PR enables support for nccl send/recv profiling similar to how we have it for MPI and Gloo.

The process to do so is similar to the NCCL collectives where we create the `recordingFunction` in `initWork` and then add a callback that runs the profiler end callbacks. Tests are added similar to send/recv tests with gloo/MPI.

We also test with both autograd profiler and torch.profiler.
ghstack-source-id: 128142666

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D27866600

fbshipit-source-id: f29d9103e22b22f658632fece0df9ba36911fc62
2021-05-05 10:14:56 -07:00
Rohan Varma
7175d49122 [Dist profiling] Add is_async field (#57253)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57253

This PR:

1. Adds is_async getter/setter to RecordFunction
2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction
3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well)
4. Sets profiling of c10d collectives as async in ProcessGroup.cpp
5. Modifies tests to ensure is_async is set

This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (https://github.com/pytorch/pytorch/pull/56963 tried to do so as well but this is a better approach).
ghstack-source-id: 128021158

Test Plan: CI

Reviewed By: walterddr, ilia-cher

Differential Revision: D28086719

fbshipit-source-id: 4473db4aed939a71fbe9db5d6655f3008347cb29
2021-05-04 17:44:28 -07:00
Alexander Golynski
2b6c09c11e Add futures to ProcessGroupMPI work (but not including Send/Recv) and python DDP comm hook testing (#57214)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57214

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D28200791

Pulled By: agolynski

fbshipit-source-id: 83f814abd4f2eea70e383ed373b04aae8291be55
2021-05-04 16:04:45 -07:00
Rohan Varma
375c8a81dc [DDP] Profile search_unused_parameters (#57376)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57376

Having this in profiler/trace outputs will be useful when
investigating performance overhead of find_unused_parameters for certain
workloads, to determine whether it is a bottleneck or not.
ghstack-source-id: 127942159

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D28126233

fbshipit-source-id: 93082ae5b84e64351d59447a29f97eaf9b0bbd64
2021-05-03 09:41:18 -07:00
Alexander Golynski
f332a8bdff Implement result() function in MPI Work classes (#57168)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57168

Implement result() for MPI which wasn't previously supported.

Some user rely on output args, however in future usecases (e.g. DDP comm hook) we need to return the result explicitly.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D28129125

Pulled By: agolynski

fbshipit-source-id: d6abcd2114163471c045043534a0a3377f2579b4
2021-05-03 07:12:46 -07:00
Brad Fish
e68c46bb3a Propagate information on torch_shm_manager execl failure to parent process (#57310)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57310

If we fail to exec `torch_shm_manager`, write an appropriate error message to stdout so that the parent process can have some context on the failure.

Reviewed By: ejguan

Differential Revision: D28047917

fbshipit-source-id: 68bf357df7a6b318c036f4f62cbb428a62cb139e
2021-04-30 11:11:09 -07:00
Brad Fish
2c2aa9e030 Address temp file/bind race condition in torch_shm_manager (#57309)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57309

Addressing a race condition that can occur in `torch_shm_manager` between the time its temporary file is unlinked and when it `bind()`s the manager server socket to that same name. In that time window, other threads/processes can re-create another temporary file with the same name, causing `bind()` to fail with `EADDRINUSE`.

This diff introduces `c10::TempDir` and associated helper functions that mirror those of `c10::TempFile` and generates the manager socket name using a combination of a temporary directory, which will be valid for the lifetime of `torch_shm_manager`, and a well-known file name within that directory that will never be used outside of `bind()`.

Reviewed By: ejguan

Differential Revision: D28047914

fbshipit-source-id: 148d54818add44159881d3afc2ffb31bd73bcabf
2021-04-30 11:11:07 -07:00
Brad Fish
7eed5410cd Make c10::TempFile non-copyable but movable (#57308)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57308

This diff makes `c10::TempFile` non-copyable but movable. `torch_shm_manager` was previously dependent upon some hidden behavior that was a result of copying `TempFile`s, which is also being made more explicit now that they can be moved but not copied.

Context:

`c10::TempFile` is currently copyable, which leads to surprising behavior. A seemingly valid `TempFile` may in fact be invalid if the original it was copied from has already been destroyed, resulting in the file descriptor to be closed and the filename being unlinked without the user knowing about it.

**In fact, both `c10::try_make_tempfile` and `c10::make_tempfile` cause copies of `TempFile` to be made**, which can easily be verified by explicitly deleting the copy constructor of `TempFile` and attempting to compile. This means that in practice, users of these functions are getting temporary files that have already been closed and unlinked.

This copying of `TempFile` is particularly interesting in the case of `torch_shm_manager`, which uses `try_make_tempfile` to generate the name of a Unix domain socket to communicate with clients. In order for `bind()` on the socket name to be successful, a file with that same name must not be linked in the filesystem, or `EADDRINUSE` will result. Happily, beacuse `try_make_tempfile` previously created a copy of the `TempFile` while destroying the original, `torch_shm_manager` did not encounter this. With this change, howevrer, `torch_shm_manager` must now explicitly destroy the `TempFile` before attempting to `bind()`. Unfortunately, this exposes a race condition--**other code can re-generate the same-named temporary file after the one created by `torch_shm_manager` is explicitly unlinked but before `torch_shm_manager` binds it to the server socket.** To be clear: this race condition already existed before this diff, but this makes things more explicit. The real fix will be in a follow-up change.

Reviewed By: ejguan

Differential Revision: D28047915

fbshipit-source-id: e8a1b6bb50419fe65620cfecdb67c566a4cf9056
2021-04-30 11:11:06 -07:00
Brad Fish
788aefd7cc Propagate information on torch_shm_manager failures to parent process (#57307)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57307

Extend the `"ERROR"` message that `torch_shm_manager` writes to the pipe when it encounters a fatal error with some extra context (specifically, the `what()` on a caught `std::exception`), allowing the parent process to gain some insight into the cause of the failure.

Also, simply return from `main()` with an error exit code when a fatal exception is caught rather than re-throwing, because re-throwing leads to premature process termination that may prevent standard output from being flushed (and therefore the parent process from being able to read the error context from the pipe).

Reviewed By: ejguan

Differential Revision: D28047916

fbshipit-source-id: d423ee8ed1b2bf7831db877e8f8515ec6d6aa169
2021-04-30 11:09:47 -07:00
Yanli Zhao
3f81912885 static graph api skeleton (#54995)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54995

provide an DDP private API to explicitly set the training is static, also set this flag in logger
ghstack-source-id: 127755713

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D27444965

fbshipit-source-id: 06ef1c372296815944b2adb33fbdf4e1217c1359
2021-04-30 11:07:26 -07:00
Yanli Zhao
5f2b9b1df9 refactor autograd_hook (#54981)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54981

put part of codes in autograd_hook into functions, so that they can be used in the static graph training later on.
ghstack-source-id: 127755405

Test Plan: unit tests

Reviewed By: SciPioneer

Differential Revision: D27439508

fbshipit-source-id: a02a4b029841f5e7f11cfc5496bb7972ef53d878
2021-04-30 11:06:04 -07:00
davidriazati@fb.com
c44cbc63cc Ignore more compiler warnings, unify WERROR options (#56630)
Summary:
This adds some more compiler warnings ignores for everything that happens on a standard CPU build (CUDA builds still have a bunch of warnings so we can't turn on `-Werror` everywhere yet).
](https://our.intern.facebook.com/intern/diff/28005063/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56630

Pulled By: driazati

Reviewed By: malfet

Differential Revision: D28005063

fbshipit-source-id: 541ed415eb0470ddf7e08c22c5eb6da9db26e9a0
2021-04-29 21:20:29 -07:00
Howard Huang
149000c3f0 Update compare_set docs (#57203)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57203

Update documentation to remove warning. Refactored arguments from `old_value` -> `expected_value` and `new_value` -> `desired_value`

Test Plan: Imported from OSS

Reviewed By: gchanan, cbalioglu

Differential Revision: D28076556

Pulled By: H-Huang

fbshipit-source-id: 5fcc5bcfff89cad51d8dc0b74a234964f1af20ed
2021-04-29 13:58:57 -07:00
Howard Huang
95f393f212 Add compare_set to trampoline class, add typing and formatting (#57191)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57191

Changed Store::compareSet() to a pure virtual function and added compareSet definition to PythonStore. Rest of changes are from clang-format.

Test Plan: Imported from OSS

Reviewed By: cbalioglu

Differential Revision: D28076557

Pulled By: H-Huang

fbshipit-source-id: 379636cf8b031088341a032250ba410d84ccf692
2021-04-29 13:29:11 -07:00
Howard Huang
ee71584236 Update compare_set implementation for FileStore and HashStore (#57175)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57175

Update other Store implementations to add the value when current value is empty to match the amendment made to TCPStore (#55636). Added test to cover this case.

Test:
`pytest -vs test/distributed/test_c10d_common.py -k compare_set`

Test Plan: Imported from OSS

Reviewed By: cbalioglu

Differential Revision: D28069380

Pulled By: H-Huang

fbshipit-source-id: eac703edb41faee32a4e7cda61107e2a0e726326
2021-04-29 10:48:11 -07:00
Luca Wehrstedt
311ad5e3af Merge CUDAFuture into ivalue::Future (#57052)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57052

This PR caps a stack whose goal was to merge CUDAFuture into ivalue::Future. CUDAFuture used to be a subclass of ivalue::Future, which was already pretty good, but it meant that in several places we needed `#ifdef`s or registries in order to create the right type of class, which was annoying. We've made CUDAFuture device-agnostic, by using generic helpers, so that it doesn't depend on CUDA. Now all its code can be inserted into ivalue::Future.

This PR does this very naively, by copy-pasting CUDAFuture's code into the (previously empty) virtual methods of ivalue::Future. This helps ensure the correctness of this PR, as it's straightforward to see it behaves exactly like before. However we probably want to polish it a bit later to iron out so wrinkles.
ghstack-source-id: 127713138

(Note: this ignores all push blocking failures!)

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28036829

fbshipit-source-id: 3e5b16402f5dc245c1fcb9d7bf06db64dcb0d2a3
2021-04-29 09:31:52 -07:00
Luca Wehrstedt
71c2f88b90 Make CUDAFuture handle any kind of device type (#57051)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57051

Make CUDAFuture autodetect the devicetype from its arguments (which thus change from DeviceIndices to full Devices). This in fact transforms CUDAFuture into a AnythingFuture, since it's not tied to CUDA in any way anymore. Having made it fully device-agnostic, we'll merge it into ivalue::Future in the next PR.
ghstack-source-id: 127713134

(Note: this ignores all push blocking failures!)

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28032711

fbshipit-source-id: 8ba23b1b0d97f61db8693cd5f3c7bae7989a9bcd
2021-04-29 09:31:50 -07:00
Luca Wehrstedt
682476022f Introduce generic MultiStreamGuard (#57049)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57049

There was a comment above CUDAMultiStreamGuard which said "TODO: Implement this generically in c10". This is what I'm doing here.

The new generic MultiStreamGuard class is able to take a vector of device-agnostic c10::Streams and is able to support any device type (CUDA, but also ROCm and others) by using a VirtualGuardImpl. A class called CUDAMultiStreamGuard is still kept around, for convenience, and slightly for performance as it avoids a vtable lookup.
ghstack-source-id: 127713139

(Note: this ignores all push blocking failures!)

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28029158

fbshipit-source-id: 2f3181371f8cb0d77a3b2e6aa510f1dd74e8f69b
2021-04-29 09:31:47 -07:00
Nikita Shulga
4cb534f92e Make PyTorch code-base clang-tidy compliant (#56892)
Summary:
This is an automatic change generated by the following script:
```
#!/usr/bin/env python3
from subprocess import check_output, check_call
import os

def get_compiled_files_list():
    import json
    with open("build/compile_commands.json") as f:
        data = json.load(f)
    files = [os.path.relpath(node['file']) for node in data]
    for idx, fname in enumerate(files):
        if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'):
            files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')]
    return files

def run_clang_tidy(fname):
    check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"])
    changes = check_output(["git", "ls-files", "-m"])
    if len(changes) == 0:
        return
    check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"])

def main():
    git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n")
    compiled_files = get_compiled_files_list()
    for idx, fname in enumerate(git_files):
        if fname not in compiled_files:
            continue
        if fname.startswith("caffe2/contrib/aten/"):
            continue
        print(f"[{idx}/{len(git_files)}] Processing {fname}")
        run_clang_tidy(fname)

if __name__ == "__main__":
    main()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892

Reviewed By: H-Huang

Differential Revision: D27991944

Pulled By: malfet

fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179
2021-04-28 14:10:25 -07:00
Howard Huang
5a10ee71d6 [Reland] TCPStore add watchKey method and new listener thread (#56217)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56217

Reland of https://github.com/pytorch/pytorch/pull/54264

Changes:
- Update socket send() to use flag MSG_NOSIGNAL to prevent SIGPIPE because error in return is already capturad
- Update watchKey to block until callback has been registered on master.
- Fix race condition in testWatchKeyCallback which caused flaky test failures.

Test:
Ran TCPStoreTest 100 times locally with no errors, running [ci-all tests](https://github.com/pytorch/pytorch/pull/56219)

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D27824802

Pulled By: H-Huang

fbshipit-source-id: c32230ce726d7d848b9896a63aa52b8eb04a0a2d
2021-04-28 13:46:02 -07:00
Rohan Varma
fe09d54120 [c10d] Add debug level field in ProcessGroup (#56530)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56530

For upcoming diffs, ProcessGroup will need to know about debug level
for e.g. logging collective operations.
ghstack-source-id: 127535775

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D27849839

fbshipit-source-id: a9f016a27d30a242eced19929b3824ae68fe430f
2021-04-28 10:01:21 -07:00
Alexander Golynski
4638bd0f0f Fix ProcessGroupMPITest.cpp Gather, Scatter and SendRecv. Enable ProcessGroupMPITest (#56709)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56709

Right now, ProcessGroupMPITest testGather() fails with

 ```
what():  Gather: number of output tensors should be 0 for non-root
[devgpu025:429730] *** Process received signal ***

```

there is a similar issue with testScatter() where number of input/output tensors on source/destination respectively should be 0.

In addition testSendRecv(true); fails with

```
terminate called after throwing an instance of 'std::runtime_error'
  what():  src rank is wrong for recvAnysource

```

since we never populate `srcRanks`

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D28001963

Pulled By: agolynski

fbshipit-source-id: c381dfc6f417ee78fbbaf884e567b0485076dfc8
2021-04-28 08:39:08 -07:00
Yanli Zhao
1e77ba36db change ddpLoggingData struct to map or dict (#56641)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56641

currently ddpLoggingData is flat struct, which requires internal DDP developers and external users to know about the struct field names. This is not flexible to delete or add new fields in the future. also it is hard to access ddpLoggingData.

With maps/dict, developers and users can easily access the fields without knowing the field names, also easier to add/remove a new/old field.

Since C++ does not support map values to be different types, right now ddpLoggingData containes two types of maps.
ghstack-source-id: 127482694

Test Plan: unit tests

Reviewed By: SciPioneer

Differential Revision: D27923723

fbshipit-source-id: c90199c14925fc50ef219000e2f809dc7601cce1
2021-04-28 06:43:25 -07:00
Yanli Zhao
28a9483e36 fix ddp logging test (#56640)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56640

reset performance stats for current iteration, also fix ddp logging verifiction for sampled iterations.
ghstack-source-id: 127327708

Test Plan: unit tests

Reviewed By: SciPioneer

Differential Revision: D27923414

fbshipit-source-id: aaa1b10f64a0c952ba345c789c864bcef5cf1ab0
2021-04-26 10:12:05 -07:00
Rohan Varma
2d2370bb61 [Dist profiling] Fix ProcessGroupNCCL collective profiling (#55204)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55204

Implements a fix discussed offline with pritamdamia87 to run end callbacks after `CUDAFuture`'s wrapCallback has ensured appropriate synchronization. Also enables the relevant distributed profiling tests that were previously disabled for ProcessGroupNCCL.

Note that the profiling infrastructure has moved to primarily encourage the use of torch.profiler and CUPTI to trace CUDA kernels, support for distributed collectives for that will require further discussion with ilia-cher. However, this PR improves the usability of torch.autograd.profiler with respect to distributed collectives.

ghstack-source-id: 127357995

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D27491711

fbshipit-source-id: cec7703a4c5d59b5023b0aa8fef4c2e3fb8d37d0
2021-04-25 19:40:19 -07:00
Liang Luo
c37095760d [torch distributed] Implementing all_gather_base (#56315)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56315

This diff implements the all_gather_base in pytorch distributed.

Test Plan: dist.all_gather_base(output, input)...

Reviewed By: agolynski, amylittleyang

Differential Revision: D27488999

fbshipit-source-id: 937ec8bddf9527fa4d114f984d1d0f6a5b8c3936
2021-04-23 14:16:47 -07:00
Rohan Varma
7ff1990caf [c10d] Increment sequence numbers on collectives. (#55718)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55718

Increments sequence numbers when ProcessGroupGloo::enqueue or
ProcessGroupNCCL::collective is run, which is a common call all collectives
make. The next step will be to log these along with other collective info in
debug mode as well as integrating them with the process group wrapper.
ghstack-source-id: 127215077

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D27690690

fbshipit-source-id: cb284b7c760763b7c0f814a41f06656fabf806d6
2021-04-23 10:06:56 -07:00
Luca Wehrstedt
58d12eb75e Allow to specify a set of device for CUDAFuture (#56515)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56515

In https://github.com/pytorch/pytorch/pull/56405 we finally found a solution to support RPC remote user functions that created/used CUDA tensors on devices that were not used by their arguments, by defining a "bounding set" of devices when constructing the agent and allowing all functions to freely use any of those devices.

We had the same exact problem with the callbacks of CUDAFuture, and in this PR I'm adopting the same exact solution: I allow to specify a set of devices when constructing a CUDAFuture, and then every callback is allowed to use any of those devices. (These devices will also be propagated to child futures).

I'm also making ProcessGroupNCCL pass these devices. I can't yet do it for TensorPipeAgent, until #56405 lands.
ghstack-source-id: 127261552

Test Plan: Added a test for this later in the stack.

Reviewed By: mrshenli

Differential Revision: D27861067

fbshipit-source-id: 8ab2c9d06a514c0407a7e96abc3704e8d5c5dc09
2021-04-23 08:12:41 -07:00
Pavel Belevich
5cc75e46fa Split test_c10d.py to test_c10d_common.py, test_c10d_gloo.py, test_c10d_nccl.py (#56598)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56598

Test Plan: NA

Reviewed By: SciPioneer

Differential Revision: D27913170

fbshipit-source-id: 3439d18141131b02d55f2ca399a4c795cba2b04b
2021-04-21 22:10:41 -07:00
Wanchao Liang
43ad172c54 make ProcessGroupDefaultTimeout the same as python (#56549)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56549

This make the `kProcessGroupDefaultTimeout` be the same as the python
side, and python side directly use the pybind value instead

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D27899190

Pulled By: wanchaol

fbshipit-source-id: 388a7f42358b0abed75cf4934fb7b311fd33fee6
2021-04-21 17:56:05 -07:00
Wanchao Liang
a970e525fd make ProcessGroup.Options.timeout argument private in python (#56531)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56531

per discussions in
https://github.com/pytorch/pytorch/pull/53663/files#r593409009, we need
to make sure our API not confusing user by passing in both timeout in
argument and timeout in processgroup.options. This PR tries to make the
`ProcessGroup.Options.timeout` be a private field, and only be used in
our test utils, for both `init_process_group` and `new_group`, we still
allow user pass `timeout` as a separate argument. Since
`ProcessGroupGloo.Options` only have a `timeout` config, both functions
will not allow passing in options for the GLOO backend.

This way we still preserve the only `timeout` API, and only allow user
to use `ProcessGroupNCCL.Options` when needed.

cc pritamdamania87 rohan-varma

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D27893395

Pulled By: wanchaol

fbshipit-source-id: cdd29c84648002226ef3d9f9f3ea67b795e64bc5
2021-04-21 17:55:10 -07:00
Ailing Zhang
27a0d6f1df AutoDispatchBelowAutograd takes no arguments. (#56424)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56424

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D27866607

Pulled By: ailzhang

fbshipit-source-id: b82cfb90af5bc7b4129266083fe31f8b335a5b41
2021-04-21 14:44:12 -07:00
Rohan Varma
b7d5a0cf10 [c10d] sequence number in process group (#55319)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55319

Adds a sequence number class as well as integration with ProcessGroup (nccl and gloo) as part of better debugability.

The main use case is that each ProcessGroup instantiated will have a sequence number initially set by rank 0, and broadcasted to all others. We will increment the number on each collective, thus allowing us to match the numbers appropriately when checking for desynchronization.

This PR just adds the bare-bones integration and verifies sequence numbers are set appropriately at the beginning.
ghstack-source-id: 127011277

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D27562769

fbshipit-source-id: d4a4de7529ce07a0c86fcf6beb06f317f359d89b
2021-04-21 10:59:24 -07:00
Ailing Zhang
3d904b56ec s/AutoNonVariableTypeMode/AutoDispatchBelowAutograd/ (#56423)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56423

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D27866606

Pulled By: ailzhang

fbshipit-source-id: e3942356dc3133d1c5722de40ec0d45e6a60f2f1
2021-04-20 17:17:46 -07:00
marksaroufim
48aaea3359 unified GlooStore and c10d store API (#56222)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56222

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55719

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D27785267

Pulled By: msaroufim

fbshipit-source-id: ce247f9226ecc971af8e1f08adeb835f64973e12
2021-04-19 10:57:18 -07:00
Jay Chae
400398006f [PARAM] Param comms debug info (#55976)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55976

- Define a concrete `DebugInfo` to collect Param comms.
- Add a macro to easily log `DebugInfo`

Test Plan:
Tested on `ads:simplified_launcher` with `dyno gputrace`
locally tested in libkinetoObserver that it can collect the debug Infobase

Reviewed By: kingchc, ilia-cher

Differential Revision: D26773447

fbshipit-source-id: a8eeede2d6dbf34d7a1b3614843b4a1baba94448
2021-04-15 16:22:01 -07:00
Rohan Varma
51e7a371f5 [DDP] Param to name mapping in Reducer (#55075)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55075

Constructs and passes in a mapping with parameter names to Reducer to log information about unused parameters in error messages about unused parameters/not all parameters getting gradient.

Use case:
1) User runs DDP forward + bwd, and it has some unused parameters that will result in ddp error in next iteration
2) Next forward pass calls `Reducer::ensure_prior_reduction_finished()` where we check all params got gradient from the previous bwd pass. DDP would throw here in this case.
3) Reducer maintains mapping and tracks used parameters, and computes which parameters did not get gradient and logs this as part of the error.

Implementation details:
0) The following is only enabled for debug modes of INFO or DETAIL.
1) To save memory, we don't map param -> param name so that we don't have to copy the entire tensor, instead we map param_index -> param_name and use the existing concept of variable_index in Reducer to look up parameter names.
2) DDP constructs param index -> param name mapping. The name is the fully qualified name: f"{module_name}:{param_name}" and passes it into Reducer
3) Reducer maintains per-iteration std::set<int> of variable indices that have had `mark_variable_ready` called.
4) When some params go unused, we take a set difference to detect the unused params.
5) Unittests to test the logged unused params, as well as for nested modules, are added
ghstack-source-id: 126581051

Test Plan: CI, UT

Reviewed By: zhaojuanmao

Differential Revision: D27356394

fbshipit-source-id: 89f436af4e74145b0a8eda92b3c4e2af8e747332
2021-04-15 09:19:50 -07:00
Brian Hirsh
e8faf69739 fix torch.pow type promotion issue (#54085)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54085

Fixes https://github.com/pytorch/pytorch/issues/50121.

This fixes two similar issues pointed out with the dtype that `torch.pow` performs its computation. Thanks ngimel for spotting the issues originally (comments [here](https://github.com/pytorch/pytorch/pull/53669#discussion_r594624355) and [here](https://github.com/pytorch/pytorch/pull/53669#discussion_r594719704))!

Before:
```
>>> torch.pow(2, torch.tensor([17], dtype=torch.uint8), out=torch.tensor([0]))
tensor([0])
>>> torch.pow(2, torch.tensor(17, dtype=torch.uint8), out=torch.tensor(0))
tensor(131072)
>>> torch.pow(2, torch.tensor([17], dtype=torch.uint8, device='cuda'), out=torch.tensor([0], device='cuda'))
tensor([131072], device='cuda:0')
>>> torch.pow(2, torch.tensor(17, dtype=torch.uint8, device='cuda'), out=torch.tensor(0, device='cuda'))
tensor(131072, device='cuda:0')
```

After:
```
>>> torch.pow(2, torch.tensor([17], dtype=torch.uint8), out=torch.tensor([0]))
tensor([0])
>>> torch.pow(2, torch.tensor(17, dtype=torch.uint8), out=torch.tensor(0))
tensor(0)
>>> torch.pow(2, torch.tensor([17], dtype=torch.uint8, device='cuda'), out=torch.tensor([0], device='cuda'))
tensor([0], device='cuda:0')
>>> torch.pow(2, torch.tensor(17, dtype=torch.uint8, device='cuda'), out=torch.tensor(0, device='cuda'))
tensor(0, device='cuda:0')
```

In all four cases above, `tensor(0, ...)` is the correct value because the computed "common dtype" among the inputs is expected to be `uint8`. Computing `2 ** 7` in uint8 will then overflow to zero. Finally, we cast the computed output to the output tensor's dtype, which is `int32`.

There were two separate issues fixed in this PR: one for cpu and one for cuda:
* For CPU, The `pow(Scalar, Tensor)` overload wasn't calling `set_wrapped_number(true)` after wrapping the scalar in a Tensor, which caused the "promoted" scalar to incorrectly participate in type promotion (see the documented behavior [here](aa8714dfed/c10/core/TensorImpl.h (L590)))
* For CUDA, the cuda kernels defined in `PowKernel.cu` were using the output's dtype to run the computation, instead of the common dtype.

As an aside: The CPU and CUDA kernels actually both use `iter.dtype()` instead of `iter.common_dtype()` to run the computation, which I fixed. The reason that only manifested here for CUDA is because TensorIterator has cpu-specific logic to create temporary outputs with the intermediate dtype (shown [here](aa8714dfed/aten/src/ATen/TensorIterator.cpp (L349))). I'm not sure what the end state is there- I can imagine that being something we're more okay doing for cpu than for cuda, but it also leads to hard-to-track-down inconsistencies between the two like in this case.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D27096330

Pulled By: bdhirsh

fbshipit-source-id: a7e2909243851625cb3056d1e7abb2383bfe95f2
2021-04-15 08:55:53 -07:00
Howard Huang
5cab3b9cf6 Revert D27709912: TCPStore add watchKey method and new listener thread
Test Plan: revert-hammer

Differential Revision:
D27709912 (f8f756efb2)

Original commit changeset: 619aa3b2a8eb

fbshipit-source-id: 3ef96ccaa76c702d7e5427dfc263531fb1c274ab
2021-04-15 07:43:48 -07:00
Howard Huang
f8f756efb2 TCPStore add watchKey method and new listener thread (#54264)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54264

**Changes**

- Creates new listener thread on each client to run the callback
- Create new class which listener thread and master thread derive from, this class is used to handle shut down and clean up of the thread in windows and linux
- Add watchKey method and update any functions that changes the key value.

**Background**
This PR adds functionality to TCPStore to allow users to watch a key and execute a callback on key change.

It introduces this a new watchKey() API:
`TCPStore::watchKey(const std::string& key, std::function<void(std::string, std::string)> callback)` which has parameters `key` and `callback(old_key, new_key)` to run on key change. Since current methods are blocking, for example in`TCPStore::get()` a worker will send a "get key" request to the master -> wait for a response back -> then exit the function and return the value to user, we need a non-blocking, asynchronous way to execute the callback whenever a key changes. This is done by creating a new listener thread on each client which the master can communicate with.

Right now, the API is C++ only and only for TCPStore, the internal use case is for elastic RPC. We will have an internal key such as `_NumNodes` and all nodes in the elastic RPC group will watch this key. When a node leaves, this key will be updated and each node will execute a callback to clean up Autograd context and RRef context.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D27709912

Pulled By: H-Huang

fbshipit-source-id: 619aa3b2a8eb23f4be5f5736efdcca6c175aadf3
2021-04-14 13:23:12 -07:00
Rohan Varma
bbc4c775bb [reland][c10d] monitored_barrier: ensure all ranks pass or none do (#55990)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55990

Reland of https://github.com/pytorch/pytorch/pull/55197, which fails windows test that was only run on master.

Disabled these tests for windows, similar to they are disabled on MacOS. The reason for disabling as that they use libuv transport which does not have as robust error handling as tcp on linux. The result is that non-zero ranks that were healthy don't throw immediately (like they do on linux) but they throw on timeout. The error handling still occurs as expected on rank 0 for all platforms.
ghstack-source-id: 126478371

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D27758424

fbshipit-source-id: d30841c8dda77f51b09a58161e638657ef758e63
2021-04-14 12:26:54 -07:00
Rohan Varma
752f5b1030 [reland][c10d] Log API usage of monitored barrier (#55989)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55989

Reland of https://github.com/pytorch/pytorch/pull/55197, which fails windows test that was only run on master.
ghstack-source-id: 126477554

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D27758425

fbshipit-source-id: ebca8b6baf0019879bc4b16639d6cccf27dc6b1c
2021-04-14 12:25:35 -07:00
Rohan Varma
48c73d24b8 Revert D27523060: [c10d] monitored_barrier: ensure all ranks pass or none do
Test Plan: revert-hammer

Differential Revision:
D27523060 (a5290adea5)

Original commit changeset: fa05e4f8ad8a

fbshipit-source-id: aa59c1c3ab0ed5b124583a52aed0f93c3b93a05a
2021-04-13 21:33:09 -07:00
Rohan Varma
c7aa1026a8 Revert D27548433: [c10d] Log API usage of monitored barrier
Test Plan: revert-hammer

Differential Revision:
D27548433 (09231b5db1)

Original commit changeset: 7520ad0948b8

fbshipit-source-id: aa946d8d27472d19c0fe855952ec58d1266ee35a
2021-04-13 21:31:49 -07:00
Rohan Varma
09231b5db1 [c10d] Log API usage of monitored barrier (#55265)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55265

Logs API usage of monitored barrier for better tracking and use case
understanding.
ghstack-source-id: 126413087

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D27548433

fbshipit-source-id: 7520ad0948b8dc9d44fa3118d5ea953d52f9f1c5
2021-04-13 19:02:52 -07:00
Rohan Varma
a5290adea5 [c10d] monitored_barrier: ensure all ranks pass or none do (#55197)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55197

From initial user feedback, one unexpected difference between monitored_barrier impl and barrier is the "all or nothing" semantics.

In barrier, all ranks pass or they all fail. With monitored barrier however, if rank 1 is healthy, it will respond to both send and recv from rank 0, but rank 0 can later fail because rank 2 is stuck. In this case, rank 1 will move forward out of the barrier.

This change makes it so that if a rank fails in monitored barrier, all other ranks in monitored barrier will also fail. It does so by the following process, similar to acknowledgements:

Nonzero ranks call send()
Nonzero ranks call recv()

Rank 0 calls recv(), if this succeeds, rank 0 has acknowledged rank N as healthy
Once all ranks are acknowledged as healthy:
Rank 0 calls send() to all nonzero ranks to unblock them

Modified unittests to ensure the all or nothing failure behavior
ghstack-source-id: 126413088

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D27523060

fbshipit-source-id: fa05e4f8ad8ae97fd6cb20da5c3a7ef76fd31de6
2021-04-13 19:01:25 -07:00
Yi Wang
132f5c1f36 Clang-format ProcessGroupMPI.cpp (#55969)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55969

Per title
ghstack-source-id: 126453717

Test Plan: N/A

Reviewed By: zhaojuanmao

Differential Revision: D27752173

fbshipit-source-id: e5069b91d699b9d02b12e5dab5e62007dbcee9f0
2021-04-13 17:11:19 -07:00
Yi Wang
de5e3b5eb0 Fix OSS flaky test_destroy_full_group on MPI backend in pytorch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test environment by adding a barrier and retrying MPI_Comm_create 3 times (#55921)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55921

Fix this flaky test by adding a barrier and retrying the flaky function call `MPI_Comm_create` 3 times.

Couldn't figure out the root cause why `createProcessGroupMPI` can be flaky when just creating a subgroup communicator by mainly invoking `MPI_Comm_create`. Here `createProcessGroupMPI` does not involve any p2p or collective communication at all. Cannot further dig into `MPI_Comm_create`, which is in MPI codebase.

Also checked the commit history, and no commit on `ProcessGroupMPI.cpp` can be found within a few days before Mar 10th.

First failure (on Mar 10th):
https://app.circleci.com/pipelines/github/pytorch/pytorch/283704/workflows/d84ac4a0-42e3-4925-b1cf-32d3c3d1022a/jobs/11456129

Note that the test failure cannot be reproduced locally.

Verified the fix on CI:
https://app.circleci.com/pipelines/github/pytorch/pytorch/300586/workflows/a5c16db4-3ae2-44c7-a9c8-b0885dad2a64/jobs/12356852
test_destroy_full_group has rerun 100 times and pass.

#Closes: https://github.com/pytorch/pytorch/issues/53899
ghstack-source-id: 126414937

Test Plan:
```
export BACKEND=mpi
export WORLD_SIZE=2
pytest -k test_destroy_full_group test/distributed/test_distributed_fork.py -vs
```

```
#!/bin/bash
for i in {1..100}
do
pytest -k test_destroy_full_group test/distributed/test_distributed_fork.py
done
```

The CI tests triggered by a new branch:
https://app.circleci.com/pipelines/github/pytorch/pytorch?branch=ci-all%2Fwayi_mpi

Reviewed By: mrshenli

Differential Revision: D27245421

fbshipit-source-id: 86e7fe208e34eda8a33885e385d56ec6b60eca27
2021-04-13 15:28:51 -07:00
Rohan Varma
c218ac3bc0 [NCCL] Join work clean up thread before aborting communicators (#55444)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55444

Changes ~ProcessGroupNCCL so that we join work cleanup thread before aborting nccl communicators. This is because if we abort nccl communicators first on destruction, outstanding work objects in workMetaList can have exceptions set on them. Right now this doesn't trigger errors in nccl async error handling due to the terminated check, but it seems a bit cleaner to just join this thread first.

The main motivation is also to reduce log spam since we added some logging when an exception is set on WorkNCCL, but this unexpectedly resulted in a lot of false-positive errors being logged even after pg shutdown. An example is below:

I0406 18:30:27.361981 1567104 ProcessGroupNCCL.cpp:527] [Rank 1] NCCL watchdog thread terminated normally
I0406 18:30:27.364675 1567105 ProcessGroupNCCL.cpp:265] [Rank 1] found async exception when checking for NCCL errors: NCCL error: unhandled system error, NCCL version 2.
7.3
With this change, we no longer see these false positive logs.
ghstack-source-id: 126145284

Test Plan: CI

Reviewed By: osalpekar

Differential Revision: D27613035

fbshipit-source-id: abf924630128b50e7f66ae41ac83403e7a0aac96
2021-04-13 15:25:22 -07:00
Yanli Zhao
5ffc4e3b0f refactor prepare_for_backward (#54977)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54977

put part of codes in prepare_for_backward into functions, so that those functions can be used in static graph training and delay all reduce later on.
ghstack-source-id: 126366714

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D27439195

fbshipit-source-id: 8899eda621260232d774cb145f9c6d683c47e188
2021-04-13 14:25:29 -07:00
Rohan Varma
657b66e87d [NCCL] Log when barrier guesses device to use (#54991)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54991

Actual proposed fix is in
https://github.com/pytorch/pytorch/pull/53934, in the meantime, would be useful
to include this LOG when barrier does not know what devices to use, and suggest
the workaround of passing in device_ids into barrier().
ghstack-source-id: 126351889

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D27444917

fbshipit-source-id: 0f269c5a7732e5be6e51adfca7ef70d04ffd71d3
2021-04-13 11:53:55 -07:00
Can Balioglu
339d3bf394 [2/n] [torch/elastic] Introduce C10dRendezvousBackend. (#55636)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55636

This diff introduces:

- The `C10dRendezvousBackend` type to support C10d stores as rendezvous backends.
- A fix to the `TCPStore.compare_set()` function to support non-existent keys.
- A placeholder `c10d-experimental` registry to instantiate C10d-baked rendezvous backends via `get_rendezvous_handler()`.
ghstack-source-id: 126312162

Test Plan: Run the existing and newly-introduced unit/integration tests.

Reviewed By: tierex

Differential Revision: D27654492

fbshipit-source-id: 09f498138b35186de4b0e174adb33fb5b5aa4b52
2021-04-12 22:20:27 -07:00
Yi Wang
3e9cbe5ef7 [SPMD] Remove the code branches only used in SPMD mode from distributed.py (#55353)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55353

Remove all the code branches that will only be executed when `device_ids > 1`.

Some helper functions are also removed:
1.  `_verify_replicas_within_process` and `verify_replicas_within_process`
2. `_replicate_modules_within_process`
3. `parallel_apply`

The next step is deprecating `_module_copies` field.
ghstack-source-id: 126201121

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D27552201

fbshipit-source-id: 128d0216a202f5b1ba4279517d68c3badba92a6c
2021-04-09 17:27:56 -07:00
Rohan Varma
0e03a2978a [DDP] Call ensure_prior_reduction_finished within lock (#55074)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55074

This function accesses member variables that can be modified by
different threads (i.e. autograd engine threads), so call it within lock scope.
ghstack-source-id: 125707513

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D27474526

fbshipit-source-id: 8d43faedd6e6eeeb69e21ce3262337ab83d7ba07
2021-04-05 22:16:13 -07:00
Yi Wang
6a2f046504 [SPMD] Restrict DDP communication hooks to SPSD mode (#55253)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55253

Previously DDP communication hooks takes a tensor list as the input. Now only takes a single tensor, as the preparation of retiring SPMD and only providing a single model replica for DDP communication hooks.

The next step is limiting only 1 model replica in Reducer.
ghstack-source-id: 125677637

Test Plan: waitforbuildbot

Reviewed By: zhaojuanmao

Differential Revision: D27533898

fbshipit-source-id: 5db92549c440f33662cf4edf8e0a0fd024101eae
2021-04-05 16:46:47 -07:00
Rohan Varma
19a0eb4cdb [c10d] Monitored barrier: option to collect all failed ranks (#55010)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55010

Follow up change to add a flag to provide an option for monitored barrier to collect all the failed ranks and then throw instead of just throwing on the first one. This is useful as now monitored barrier will be able to pick up on all hanging ranks instead of just one.

This is done by passing in a flag `wait_all_ranks=True`.
ghstack-source-id: 125699839

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D27447787

fbshipit-source-id: ec23aee212060d9eb515ff8adc96c6a17822d1bb
2021-04-04 21:39:54 -07:00
Rohan Varma
0ec1af4b7e [c10d] Enforce order of waited ranks in monitored barrier. (#55009)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55009

Changes monitoredBarrier so that we await acknowledgemenet from ranks
in a consistent order (from least to greatest). This will reduce confusion
around the order the ranks are awaited. We are still planning to add support
for awaiting all ranks in follow up changes.
ghstack-source-id: 125699838

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D27405417

fbshipit-source-id: b9a3e72742cbffdd9bf890ab2c94103b768a7b71
2021-04-04 21:38:25 -07:00
Mike Ruberry
c0ac0fef4e Revert D27448156: irange for size_t
Test Plan: revert-hammer

Differential Revision:
D27448156 (041b4431b2)

Original commit changeset: 585da57d4de9

fbshipit-source-id: 8e047c29f391c0166e0a1a87c3fb2a0854377365
2021-04-03 19:14:00 -07:00
Richard Barnes
041b4431b2 irange for size_t (#55163)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55163

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D27448156

fbshipit-source-id: 585da57d4de91c692b6360d65f7b8a66deb0f8c1
2021-04-02 23:22:29 -07:00
Yi Wang
322854d2f0 [SPMD] Error out SPMD in C++ Reducer (#55212)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55212

Error out SPMD in C++ Reducer.

Added a new test `test_reducer_no_multi_replicas`, which checks no multiple replicas are allowed at the Reducer constructor.

Removed 2 tests relevant to reducer in SPMD mode:
`test_ddp_comm_hook_multiple_replica_check`
`test_forward_backward_multi_replica`

ghstack-source-id: 125602472

Test Plan: waitforbuildbot

Reviewed By: pritamdamania87

Differential Revision: D27497747

fbshipit-source-id: 17ef1bc4d889cbe8076bcb3d504aed4c1aea1562
2021-04-02 22:59:25 -07:00
Rohan Varma
3575e71be8 [DDP Logging] Log use of uneven inputs API (#54919)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54919

Log the use of uneven inputs API for better tracking and use case
detection.
ghstack-source-id: 125446499

Test Plan: CI, added ut

Reviewed By: zhaojuanmao, SciPioneer

Differential Revision: D27410764

fbshipit-source-id: abc8055a2e15a3ee087d9959f8881b05a0ea933e
2021-04-01 16:22:32 -07:00
Rohan Varma
d5564618d0 [NCCL][Blocking Wait] Log set exceptions when checking for exceptions in (#54558)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54558

In blocking wait's polling synchronization loop, we frequently call checkAndSetException() as part of isCompleted() to check the status of nccl operations. It would be useful to log here in case we encounter any exceptions (which are later thrown by `checkAndThrowException`).

Also slightly refactors code previously added to make use of a helper function to get the error message given an `std::exception_ptr`.
ghstack-source-id: 125124314

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D27136202

fbshipit-source-id: 256eb63c5c2a84be909722d3fd7377ad9303fa11
2021-03-29 14:15:45 -07:00
Rohan Varma
028d2d6e63 [NCCL] Enhance watchdog to log exceptions (#54557)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54557

When looping through the nccl communicator cache checking for errors, enhance the watchdog to log exceptions that are set on the communicator.

This will allow for better debugability since the NCCL error will be logged when the watchdog receives errors for the communicators and aborts them appropriately.

Tested by forcing a NCCL error with NCCL_BLOCKING_WAIT=1 and verifying that the exception is indeed logged.
ghstack-source-id: 125124310

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D27106699

fbshipit-source-id: 1d2bd9f057a3796ce15dd8a4ce34cf6899eee45c
2021-03-29 14:15:42 -07:00
Rohan Varma
4541f60390 Gloo-only CPU-based monitored barrier (#53773)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53773

Closes https://github.com/pytorch/pytorch/issues/52876

Implements a barrier by doing send/recv to rank 0, and rank 0 waits for these requests and on timeout, throws an exception indicating which rank did not join in the given timeout.

This barrier is only intended for CPU use cases and built into process group gloo, and will be used for debugging synchronization/hang issues.

Test Plan: Added UT

Reviewed By: zhaojuanmao

Differential Revision: D26921357

fbshipit-source-id: 7c16e861b4b8ea2bdd67a36b3de7b1029af7d173
2021-03-29 14:14:10 -07:00
Rohan Varma
5c3d80d8fa [DDP] Mark a few variables as const in reducer (#54764)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54764

We mark a few vars as const in Reducer, also do this for replicas_ and
process_group_ as they should not be changed by Reducer during training. This
can help eliminate issues at compile time and prevent the developer from
accidently changing these variables.
ghstack-source-id: 125040110

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D27357132

fbshipit-source-id: 23a0edf754a8e4f9e6440e99860e5549724cb7ad
2021-03-27 21:40:18 -07:00
Rohan Varma
671f80a313 [c10d] s/torch::autograd::variable/at::Tensor/g (#54763)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54763

Replaces deprecated torch::autograd::variable with at::Tensor.
torch::autograd::variable is defined as equal to at::Tensor now so this should
be a noop, but follows convention of using tensor instead of Variable.
ghstack-source-id: 125040109

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D27356450

fbshipit-source-id: 1a001358d7726a597141ec47803c8213db4814c0
2021-03-27 21:38:51 -07:00
Wenlei Xie
593295daac Migrate kernels with TensorOptions to C10 full dispatcher (#54539)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54539

Codemod commands generated by https://github.com/pytorch/pytorch/pull/54468

ghstack-source-id: 125018630

# Facebook:
The following 2 files are changed on fb side:
```
// Should be hidden
```

Test Plan: buck build //caffe2/aten/...

Reviewed By: smessmer

Differential Revision: D27273744

fbshipit-source-id: 35c1bff63189477645008caaf0dc794096e3fcc4
2021-03-26 13:55:22 -07:00
Michael Carilli
1442a92741 Ensure local_used_maps_tmp is distinct from local_used_maps_[i] (#54474)
Summary:
Followup/hotfix for https://github.com/pytorch/pytorch/pull/53160. rohan-varma and zhaojuanmao were seeing https://github.com/pytorch/pytorch/pull/53160/files#diff-9273e5ff7b40f30d6a4444d1c7be9fe9a5c2068070c68af4e7b0ac2d4cff0923R582 fire in some internal workloads, indicating `local_used_maps_tmp` wasn't actually being created as a distinct temporary, in other words, `local_used_maps_[i]` was already pinned for some reason. This seems like a bug with the CPU allocator: [`local_used_maps_` should not have been pinned on construction](9be4c75fa0/torch/lib/c10d/reducer.cpp (L180-L183)). We should [investigate that separately](https://github.com/pytorch/pytorch/pull/53160/files#r599188373).

In the meantime, the present PR should ensure `local_used_maps_tmp` is always distinct from `local_used_maps_[i]` (and therefore prevents the race condition described in https://github.com/pytorch/pytorch/pull/51360) even if `local_used_maps_[i]`is already pinned.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54474

Reviewed By: zhaojuanmao

Differential Revision: D27268039

Pulled By: rohan-varma

fbshipit-source-id: ab9af3dd845098bde788cb28a9217caea246ddfa
2021-03-24 16:58:31 -07:00
Rohan Varma
789dc6d445 [NCCL] Add more details for checkForNCCLErrors (#54117)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54117

https://github.com/pytorch/pytorch/pull/45950 enhanced our NCCL logging errors so that we add some basic debug information about what when wrong when erroring out with a NCCL error.

However, that PR only used the added function for `C10D_NCCL_CHECK` which is used to check the return values of NCCL calls. However, in ProcessGroupNCCL we also have `checkForNCCLErrors` which checks for errors on nccl communicators, and in case of errors it would be good to have this logging there too.

Also renames the function s/errorMessage/getNcclErrorDetailStr
ghstack-source-id: 124662592

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D27100497

fbshipit-source-id: fec3663ffa3e92bae8391ef4f77054abb4bb9715
2021-03-23 20:29:16 -07:00
Brian Hirsh
bc4f521178 port at::mul to structured (#52692)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52692

Porting `at::mul` to structured.

One other issue I hit with the port was the fact that there are a bunch of other places around the code base that used to call out to variants of `at::native::mul`, which no longer exists. *Technically*, `at::cpu::mul` does the equivalent thing now, so I patched most call-sites to use that. There were two other places where I did something slightly different (calling `at::cuda::mul` and `at::mul`, respectively), which I called out in the comments.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D27029822

Pulled By: bdhirsh

fbshipit-source-id: 6cc80de0dfccec304bf8e16a1823e733bed27bf4
2021-03-19 11:34:33 -07:00
Wanchao Liang
f4a044ca1d [distributed] add options field in ProcessGroupGloo/NCCL (#54090)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54090

This PR adds an options field to both ProcessGroupGloo/NCCL so that we
have a constant `options` field even after the initialization of
ProcessGroup, which gives us the ability to inspect the options during
construction of specific ProcessGroup. Also use options inside different
methods instead of separate fields.

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D27093670

Pulled By: wanchaol

fbshipit-source-id: b02d9394290e9be88b21bddb94d4de7993b4a2e3
2021-03-17 18:41:55 -07:00
Wanchao Liang
a4f0f8b1e9 [distributed] add base processgroup::options (#53662)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53662

Add a base processgroup::options so that we can do inheritance and
provide
a universal option API in python

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D26968856

Pulled By: wanchaol

fbshipit-source-id: 858f4b61b27aecb1943959bba68f8c14114f67d8
2021-03-17 18:40:04 -07:00
Michael Carilli
ce40ff5c64 Avoid DDP race condition with find_unused_parameters=True when all params are used (#53160)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53159.

See comments for a description of the race condition. Thanks to ptrblck xwang233 and especially zasdfgbnm for lots of help isolating the problem and discussing the fix.

PRing for discussion. We can try to concoct a dedicated test for the problem if you want. The ingredients are:
- DDP(..., find_unused_parameters=True)
- Use all the DDP-ed model's params in forward such that the "lazy local used work wait()" path will be taken in backward
- Queue up a lot of asynchronous dummy work just before backward(), so stream work gets pushed far into the future relative to CPU work

Benchmark:
Bert model, When find_unused_parameters=true, latency (sec) per iteration P50: trunk-1.265sec, this PR-1.263sec, if add blocking copy before calling local_used_.fill(i)-1.236 sec
Bert model, When find_unsued_parameters=false, latency (sec) per iteration P50: trunk-1.00sec, this PR-1.026sec
Resnet50 model, accuracy is also matched with trunk when find_unused_parameters=true and false

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53160

Reviewed By: albanD

Differential Revision: D26916766

Pulled By: zhaojuanmao

fbshipit-source-id: 3e0ed91b7b5c42e2f2c82e12d4d2940fdc89e023
2021-03-17 10:08:22 -07:00
Rohan Varma
f52a3bd634 [DDP] remove dedupe check in reducer (#53919)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53919

https://github.com/pytorch/pytorch/pull/53279/files has landed
deduplicating the shared params in python before constructing reducer. Because
of this, we no longer need the changes in
https://github.com/pytorch/pytorch/pull/46755/files.

This is already tested by `test_ddp_shared_grad_acc_unused_params` and
`test_ddp_weight_sharing`
ghstack-source-id: 123828299

Test Plan: ci

Reviewed By: SciPioneer

Differential Revision: D27015466

fbshipit-source-id: efb079540c1a0e18bb38e68479caeb50cf550304
2021-03-15 18:50:05 -07:00
Howard Huang
7f88840495 Fix prefix store timeout bug (#53928)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53928

HashStoreTest was taking forever to run. Turns out it was because a default timeout is set when creating Store() and setTimeout for prefixStore is not actually able to change the timeout of the underlying store.

After removing the default timeout and updating setTimeout, this will save ~10 minutes for all of the gcc_test CI runs.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D27025275

Pulled By: H-Huang

fbshipit-source-id: 650c8c1eb8b166da1d412ed88e765747a2ca2069
2021-03-15 13:23:20 -07:00
Thomas Viehmann
8734e88f0b delete has no more data after the key (#53886)
Summary:
The tcpstore delete key implementation inadvertendly set "moreData" when sending the key when it was in fact the last message.

Thank you, PetrochukM, for the reproducing example which was instrumental in developing the fix (and is the blueprint for the test case).

Fixes https://github.com/pytorch/pytorch/issues/53872

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53886

Reviewed By: jbschlosser

Differential Revision: D27011846

Pulled By: H-Huang

fbshipit-source-id: 5c460d1e4d095a8bc267bf63613b556856ced3e8
2021-03-15 08:44:55 -07:00
Isaac Seessel
3078233e9a [Gradient Compression] Make FP16 compression as a wrapper that can be combined with other communication hooks (#53808)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53808

Create a FP16 wrapper that can combine FP16 gradient compression with any gradient compression algorithm.

Test Plan:
Unit test:
```
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_compress_wrapper
```

Performance Test on DDP QPS Benchmark: Check if AllReduce + FP16 Wrapper = FP16 Compression
1) FP16 Compression:
f256897690

2) FP16 Wrapper + AllReduce (after patching D26960986):
f256897289

Reviewed By: SciPioneer

Differential Revision: D26978832

fbshipit-source-id: 0dcd18b050c02f5e9f3cff56344d1f39a04e20c0
2021-03-12 17:31:07 -08:00
Siva Datta Mannava
fdbd667e31 compareSet method for HashStore and FileStore (#53803)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53803

Reviewed By: ngimel

Differential Revision: D27017014

Pulled By: H-Huang

fbshipit-source-id: 736aa5ad848f5708e6581e472e48d5682bef7131
2021-03-12 12:38:30 -08:00
Howard Huang
4873641602 Fix TCPStore wait() hang when key is previously set (#53860)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53860

Fixes [#53840](https://github.com/pytorch/pytorch/issues/53840)

Right now [TCPStore wait([LIST_OF_KEYS_TO_AWAIT])](https://pytorch.org/docs/master/distributed.html#torch.distributed.Store.wait) will hang if any of the keys in [LIST_OF_KEYS_TO_AWAIT] has been previously set. This change will ensure that wait() is only waiting for the keys that have not been set

Before change:
```
# Case 1: HANG
store.set("1", "1")
store.wait(["1", "2"])
store.set("2", "2")

# Case 2: SUCCEED
store.wait(["1", "2"])
store.set("1", "1")
store.set("2", "2")
```
After change:
Both cases work

TODO: working on adding a test for wait()

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D26999929

Pulled By: H-Huang

fbshipit-source-id: 8931749923c98b520366538f785af82ef37cca8e
2021-03-12 07:05:31 -08:00
cyy
14d02517e1 replace data with data_ptr (#53097)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53097

Reviewed By: albanD

Differential Revision: D26972445

Pulled By: rohan-varma

fbshipit-source-id: 04798a3fd55dd297638377513cfc57ff86c8916d
2021-03-11 13:14:35 -08:00
Rohan Varma
fa980bb22a [wip][Dist Profiling] Enable dist profiling for MPI backend (#52949)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52949

Enables distributed profiling which we have for gloo and nccl for the MPI backend
ghstack-source-id: 123610105

Test Plan: CI

Reviewed By: wanchaol

Differential Revision: D26591590

fbshipit-source-id: a20ec9d104faa26bc62c727dd01319c3ea230f5d
2021-03-11 13:08:41 -08:00
Yanli Zhao
a76b4736db clang format reducer and logger files (#53148)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53148

clang format reducer and logger files
ghstack-source-id: 123453983

Test Plan: unit test

Reviewed By: SciPioneer

Differential Revision: D26764509

fbshipit-source-id: 711efcfd77420f912861cfd20c69e3af5086f4b9
2021-03-10 11:35:30 -08:00
Yanli Zhao
d032287ec3 fix data type logging (#53162)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53162

it is possible there are multiple data types in mixed precision training, so log data types as a list of data type names.
ghstack-source-id: 123452626

Test Plan: unit test

Reviewed By: SciPioneer

Differential Revision: D26769256

fbshipit-source-id: 8f7d73821e89864fedbbce723f301fe8fbad5685
2021-03-10 11:35:26 -08:00
Yanli Zhao
7d4b229d61 add is_multi_device_module logging field (#53149)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53149

add is_multi_device_module logging field
ghstack-source-id: 123444621

Test Plan: unit test

Reviewed By: SciPioneer

Differential Revision: D26765355

fbshipit-source-id: d4d9c5981b18b1744299aebe8af37eb4e2e35c61
2021-03-10 11:35:22 -08:00
Yanli Zhao
a08fc1a7fc allow users to set sample rate and add per iteration latency breakdowns (#53145)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53145

add a new API to allow users to set sample rate for runtime stats, also add per iteration latency breakdowns to DDPLoggingData struct. e.g.
if users set sample rate to be 1, they can analyze per iteration latency change over time (not avged)
ghstack-source-id: 123443369

Test Plan: unit test

Reviewed By: SciPioneer

Differential Revision: D26763957

fbshipit-source-id: baff6a09c2a590e6eb91362ca6f47ae8fa6ddb0e
2021-03-10 11:35:18 -08:00
Michael Carilli
e787872a47 [RELAND] Deduplicate shared params before constructing Reducer in DDP (#53279)
Summary:
Original PR https://github.com/pytorch/pytorch/pull/51929 seemed to trigger failures in `pytorch_linux_xenial_py3_clang5_asan_test2`. Resubmitting to figure out why, and hopefully reland.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53279

Reviewed By: mrshenli

Differential Revision: D26916701

Pulled By: zhaojuanmao

fbshipit-source-id: 75c74c8ad8ad24154eb59eddb2b222da0a09897e
2021-03-10 07:56:20 -08:00
Rohan Varma
14fa47631b [DDP Logging] Log comm. hook in ddp logging (#52966)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52966

Logs registerd comm hook if there is one, else logs
"builtin_allreduce"
ghstack-source-id: 123174803

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D26709388

fbshipit-source-id: 484fdbbd6643ec261b3797bd8d9824b2b6a1a490
2021-03-05 11:23:26 -08:00
Rohan Varma
5d9b7bee1a [DDP Logging] Log nccl_async_error_handling (#52965)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52965

Logs nccl async error handling in ddp logger
ghstack-source-id: 123171876

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D26709030

fbshipit-source-id: 530456a5005b8e4956d7fb023986e9b948ebe1a8
2021-03-05 11:23:22 -08:00
Rohan Varma
bdbfc2582d [Dist Debugality] Log key DDP metrics to stderr under debug mode. (#52957)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52957

This diff:
1. Under TORCH_DISTRIBUTED_DEBUG=INFO or DETAIL, logs DDP information during init time (all stats in ddp_logging_data_)
2. Under TORCH_DISTRIBUTED_DEBUG=DETAIL, logs runtime stats when they are collected (first 10 iterations and then once every 100 iterations). Avoiding logging every iteration to not spam logs.

Verified by inspecting logs:

```
I0226 19:12:47.109243 2818475 logger.cpp:140] [Rank 1]: DDP Initialized with:
world_size: 2 module_name: Linear device_ids: 1 output_device: 1 backend_name: nccl parameter_dtype: float total
_parameter_size_in_bytes: 40 num_parameter_tensors: 2 bucket_sizes: 40 CUDA_VISIBLE_DEVICES: N/Abroadcast_buffer
s: 1 bucket_cap_mb: 25 find_unused_parameters: 0 gradient_as_bucket_view: 0
 Backend Info: nccl_socket_ifname: N/A nccl_blocking_wait: N/A nccl_debug: WARN nccl_nthreads: N/A nccl_ib_timeo
ut: N/A
I0226 19:12:47.109252 2818473 logger.cpp:140] [Rank 0]: DDP Initialized with:
world_size: 2 module_name: Linear device_ids: 0 output_device: 0 backend_name: nccl parameter_dtype: float total
_parameter_size_in_bytes: 40 num_parameter_tensors: 2 bucket_sizes: 40 CUDA_VISIBLE_DEVICES: N/Abroadcast_buffer
s: 1 bucket_cap_mb: 25 find_unused_parameters: 0 gradient_as_bucket_view: 0
 Backend Info: nccl_socket_ifname: N/A nccl_blocking_wait: N/A nccl_debug: WARN nccl_nthreads: N/A nccl_ib_timeo
ut: N/A
```

```
I0226 19:12:48.117936 2818473 logger.cpp:286] [Rank 0 / 2] Training Linear unused_parameter_size=0
 Avg forward compute time: 568944
 Avg backward compute time: 885504
Avg backward comm. time: 692496
 Avg backward comm/comp overlap time: 113536
I0226 19:12:48.118517 2818475 logger.cpp:286] [Rank 1 / 2] Training Linear unused_parameter_size=0
 Avg forward compute time: 565584
 Avg backward compute time: 876992
Avg backward comm. time: 201872
 Avg backward comm/comp overlap time: 128624
```
ghstack-source-id: 123171875

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D26708184

fbshipit-source-id: 16defd5610d28bc4cf3fc2a0cc564e84efcfa791
2021-03-05 11:23:18 -08:00
Rohan Varma
68134374cb Refactor/fix DDP model check during init (#52887)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52887

This diff changes the way to do model consistency check (i.e. `_verify_replicas_across_processes`) in DDP.

There were a few things that could be improved with the way we verify model across processes in DDP initialization:

1. We should do this check before syncing module states in DDP init, otherwise with Gloo backend this will throw but we would like to throw the error corresponding to different models on different ranks. To do this, we move the methods to be standalone C++ functions (not part of reducer) and move this check to before synchronizing parameters.
2. Refactor DDP init in the following ways:
- Run model consistency check before creating reducer, 2
- add helper functions to build params to pass into reducer
- add helper function to call `_verify_model_across_ranks`
- move `def parameters` to a helper function `_get_parameters` to be used more broadly within DDP

In follow up changes we will add the ability to detect which rank had inconsistent model (https://github.com/pytorch/pytorch/issues/52876 would be useful for this to determine which ranks(s) had errors).
ghstack-source-id: 123171877

Test Plan:
CI/unittest
buck test mode/dev-nosan //caffe2/test/distributed:c10d
BACKEND="nccl" WORLD_SIZE="2" ~/fbcode/buck-out/dev/gen/caffe2/test/distributed/distributed_nccl_fork#binary.par -r test_ddp_model_diff_across_ranks

Reviewed By: zhaojuanmao

Differential Revision: D26565290

fbshipit-source-id: f0e1709585b53730e86915e768448f5b8817a608
2021-03-05 11:21:45 -08:00
Mike Ruberry
30a8a13a7d Revert D26625807: [pytorch][PR] Deduplicate shared params before constructing Reducer in DDP
Test Plan: revert-hammer

Differential Revision:
D26625807 (5c15a5bb46)

Original commit changeset: f5f5959fef90

fbshipit-source-id: c875cc86b8fd21d9d64f934559f8e3126ed1d23d
2021-03-03 20:05:47 -08:00
Yi Wang
510c03d922 [Gradient Compression] Remove some low-level methods of GradBucket class (#53098)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53098

Remove some low-level methods that are no longer needed since `get_per_parameter_tensors` method is added to `GradBucket` class.

Avoid unnecessary exposure to the internals before publishing GradBucket APIs.
ghstack-source-id: 122979064

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

Reviewed By: osalpekar

Differential Revision: D26784249

fbshipit-source-id: d1b27bb026989c25a5b65be4767cb752afd6f19b
2021-03-03 12:06:14 -08:00
Michael Carilli
5c15a5bb46 Deduplicate shared params before constructing Reducer in DDP (#51929)
Summary:
Currently, `torch.nn.parallel.DistributedDataParallel(model...)` doesn't deduplicate params shared across `model`'s child Modules before calling Reducer with the param list. This can cause Reducer to register more than one hook on the shared param(s), at which point who knows what happens.

We ran into this in mlperf BERT, which has at least one param shared across submodules (an embedding weight iirc, not 100% sure). Running with `gradient_as_bucket_view = False` produced different numerics from running with `gradient_as_bucket_view = True` (which i guess is one potential consequence of multiple DDP hooks on a given param, not sure why, i'd have to dig further).

This PR changes DDP to deduplicate shared params (a small diff), and adds some tests (right now just `test_ddp_weight_sharing`, but I'll add more). `test_ddp_weight_sharing` fails with bad numerics on current master (proving the shared param issue is real) and passes with the deduplication diff.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51929

Reviewed By: zou3519

Differential Revision: D26625807

Pulled By: zhaojuanmao

fbshipit-source-id: f5f5959fef90dfe2c55812d79fa88b877f22ecc3
2021-03-03 10:13:24 -08:00
Omkar Salpekar
593b0fbade Revert D26720919: [Gradient Compression] Remove some low-level methods of GradBucket class
Test Plan: revert-hammer

Differential Revision:
D26720919 (521e1e83ea)

Original commit changeset: 46fb64230087

fbshipit-source-id: e2b68892d1735b7249b4d36f3dff57160c9cbc78
2021-03-02 16:18:39 -08:00
Yi Wang
521e1e83ea [Gradient Compression] Remove some low-level methods of GradBucket class (#53098)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53098

Remove some low-level methods that are no longer needed since `get_per_parameter_tensors` method is added to `GradBucket` class.

Avoid unnecessary exposure to the internals before publishing GradBucket APIs.
ghstack-source-id: 122723683

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

Reviewed By: rohan-varma

Differential Revision: D26720919

fbshipit-source-id: 46fb6423008792e72d7a1dd68930a31e0724c92c
2021-03-02 14:39:19 -08:00
Yi Wang
b05dd931ee [Gradient Compression] Add is_the_last_bucket_to_allreduce method to GradBucket class (#53010)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53010

To determine the boundary between different iterations in a DDP communication hook, currently the user code needs `bucket.get_index() == 0`, which involves internal bucketization implementation details and undermines the usability of DDP communication hook.

Create an API to hide the details and improve the usability before publishing GradBucket APIs.
ghstack-source-id: 122723081

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

Reviewed By: rohan-varma

Differential Revision: D26720813

fbshipit-source-id: f4a3147382c1f970534d7f0dee0cd599156c8b8c
2021-03-02 14:39:12 -08:00
Yi Wang
4997c38a15 [Gradient Compression] Don't provide default values in GradBucket constructor (#53102)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53102

In `GradBucket` constructor, `offsets`, `lengths`, and `sizes_vec` are optional arguments and could possibly be empty. It will be safe to remove the default values.
ghstack-source-id: 122833603

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D26748199

fbshipit-source-id: 2e3bcd1b732851919a64bbbd20fe85e77a616fe3
2021-03-02 14:39:07 -08:00
Yi Wang
ecb5ac90ed [Gradient Compression] Add get_per_parameter_tensors method to GradBucket class (#53009)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53009

It can be a common operation to apply layer-wise operations over per-parameter tensors in a DDP communication hook.

Create a util method in GradBucket class before publishing GradBucket APIs.
ghstack-source-id: 122833594

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

f254364097

Reviewed By: rohan-varma

Differential Revision: D26717893

fbshipit-source-id: 916db319de8b85dd22bc4e35db5671bf4e34740f
2021-03-02 14:39:03 -08:00
Rohan Varma
b3bf08e67f Log nccl debug level in ProcessGroupNCCL (#52803)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52803

This is useful for double checking we have the expected nccl_debug
level when debugging problematic jobs.

New logs:

When default is warn:
```
NCCL_ASYNC_ERROR_HANDLING: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 60000
USE_HIGH_PRIORITY_STREAM: 0
NCCL_DEBUG: WARN
```

off:

```
NCCL_ASYNC_ERROR_HANDLING: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
NCCL_DEBUG: N/A
```
ghstack-source-id: 122751110

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D26653699

fbshipit-source-id: 845cc1236f3838f4763c6dcf2a30d059b3d44f02
2021-03-01 14:57:22 -08:00
Rohan Varma
a3cd881890 Fix grammar in reducer warning (#52835)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52835

Addresses comment in https://github.com/pytorch/pytorch/pull/52385
that was missed before landing the PR
ghstack-source-id: 122543534

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D26660764

fbshipit-source-id: 3edfebed56f382c1414ba9eb65a753ced7e34154
2021-02-25 22:29:52 -08:00
Can Balioglu
94da8b9816 Fix resource leak bug in TCPStore constructor (#52860)
Summary:
This PR fixes a resource leakage bug in the constructor of `TCPStore` where an exception thrown in `TCPStoreDaemon` or `tcputil::connect()` can leave the server socket dangling. The ideal long-term solution would be to have a RAII wrapper for TCP sockets returned by `tcputil`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52860

Reviewed By: osalpekar

Differential Revision: D26671775

Pulled By: cbalioglu

fbshipit-source-id: ccebbd7533ac601a4b80e6e759f2fb4fe01c70fa
2021-02-25 15:32:38 -08:00
Can Balioglu
a11b601100 Expose Store's timeout and TCPStore's host and port in Python API (#52784)
Summary:
This PR introduces the `timeout` accessor to `Store` and `host`, `port` accessors to `TCPStore` to help testing and troubleshooting higher level APIs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52784

Reviewed By: anjali411

Differential Revision: D26648202

Pulled By: cbalioglu

fbshipit-source-id: 9cf23bf998ed330d648dfec2a93e1bbb50817292
2021-02-25 11:05:15 -08:00
Richard Barnes
373a20ad4a Modernize for-loops in caffe2/torch (#52618)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52618

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50797

Modernize for-loops throughout caffe2/ subdirs to use ranged-loops where possible (all `.cpp` files were examined).

```
find caffe2/ -iname "*.cpp" > /home/rbarnes/files
buck run mode/opt foundation/clangr:clangr_local -- -j 10 --file=/home/rbarnes/files --multi --apply-replacements=true tidy '--checks=-*,modernize-loop-convert'
```

Test Plan: Sandcastle tests

Reviewed By: suo

Differential Revision: D26585065

fbshipit-source-id: 439b9f9ce7c54fa9b4b80161f6bb27ebe8a35967
2021-02-24 18:17:46 -08:00
Can Balioglu
3489b4a7b8 Fix the ordering of TCPStore's compare_set parameters (#52696)
Summary:
- Fixes the ordering of the value parameters of TCPStore's `compare_set()` in the pybind11 interop layer. The C++ API expects (old, new) while we are passing (new, old) in Python.
- Fixes the implementation of TCPStore's `compareSetHandler()` for cases where the key already exists in the store.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52696

Test Plan: `python test/distributed/test_c10d.py`

Reviewed By: malfet, H-Huang

Differential Revision: D26616976

Pulled By: cbalioglu

fbshipit-source-id: e6a70542e837be04697b5850947924edd896dbf6
2021-02-24 06:59:03 -08:00
Richard Barnes
2eb9c0832e Modernize for-loops in torch misc (#52452)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52452

Test Plan: Sandcastle

Reviewed By: pritamdamania87

Differential Revision: D26520760

fbshipit-source-id: c13161324f24f553ad679308d0dc279ab178e129
2021-02-22 13:37:19 -08:00
Rohan Varma
ef8d17e112 [DDP] Separate error messages for unused params in forward and not all outputs (#52391)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52391

There are 2 ways DDP can throw the exception refactored here -
1) Unused params in the forward pass. We provide `find_unused_parameters=True` for this.
2) All params used in fwd pass, but not all outputs used in loss computation. There are a few workarounds for this but we do not provide native support.

Previously, these 2 issues were combined into 1 error message but that has historically resulted in confusion, with users reporting getting this error even when they enable `find_unused_parameters=True` (which they expect to fix this error). As a result there is additional churn to debug these issues because the true cause (1) vs (2) is not known.

This commit helps to fix the issue by separating out the 2 error messages depending on if we ran with unused parameter detection or not. Hopefully this should make the error message much more clear and actionable.

error msg with `find_unused_params=True`:
```
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. Since `find_unused_parameters=True` is enabled, this likely  means that not all `forward` outputs participate in computing loss. You can fix this by making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
```
error msg without `find_unused_params` specified:
```
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
```
ghstack-source-id: 122097900

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D26496688

fbshipit-source-id: 4a9eeeda10293da13d94a692d10cb954e4506d7c
2021-02-19 17:09:22 -08:00
Howard Huang
bc6852c192 Change TCPStore world_size and is_master to be optional (#51809)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51809

Changes to TCPStore which will make world_size and is_master optional parameters for initialization.

API before change:
```python
# arguments: host_name, port, world_size, is_master, timeout=300s
server_store = dist.TCPStore("127.0.0.1", 0, 2, True)
client_store = dist.TCPStore("127.0.0.1", 0, 2, False)
```

API after change:
```python
# arguments: host_name, port, world_size=-1, is_master=False, timeout=300s
server_store = dist.TCPStore("127.0.0.1", 0, is_master=True)
client_store = dist.TCPStore("127.0.0.1", 0)
```

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D26461770

Pulled By: H-Huang

fbshipit-source-id: 5b2157029c73e8706e158cd49ecce60c9f3a7f41
2021-02-19 09:56:51 -08:00
Yanli Zhao
d0795ab358 log newly added construction and runtime stats at randomly selected iterations (#51394)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51394

log newly added construction and runtime stats at randomly selected iterations
ghstack-source-id: 121934040

Test Plan: unit tests

Reviewed By: SciPioneer

Differential Revision: D26161885

fbshipit-source-id: add6e02c1a03e6f74f08b9a9aecf90fa81631d60
2021-02-19 00:15:04 -08:00
Yanli Zhao
c75fa39b6c add stats that can only be collected at runtime (#51386)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51386

add stats such as rebuilt bucket stats, unused parameter stats and performance stats to ddp logging data

1. gpu time stats are not collected for single process multiple devices in this diff, as that requires events are created and recorded on multiple devices
2. use at::cuda::event API for safer calls
3. events may not be created in autograd hook if hook is not triggered in user's codes, e.g., users runs in non-sync mode in some iterations. So we checked events are created or not before synchronizing, also skipped invalid results.
4. users may not set device upfront, so explicitly set proper device before creating events in our prepare_forward() and prepare_backward() calls

ghstack-source-id: 121933566

Test Plan: unit tests

Reviewed By: SciPioneer

Differential Revision: D26158645

fbshipit-source-id: ce5f15187802eba76accb980449be68902c10178
2021-02-19 00:13:11 -08:00
Rohan Varma
0c46b6b3f6 [DDP] Enhance warning for find_unused_params (#52385)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52385

This warning should specify that we did not find unused params in the
_forward_ pass, which is when we log this warning. This is to avoid confusion
when we get an error because not all outputs were used to compute loss, which
also raises an error about unused parameters (to be fixed in the next diff)
ghstack-source-id: 122001929

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D26494136

fbshipit-source-id: d9b41732ea7e5e31b899d590d311080e3dc56682
2021-02-18 23:36:08 -08:00
Rohan Varma
6dabe0b291 [Dist Profiling] Enable dist profiling for DDP (gloo only) (#52031)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52031

Closes https://github.com/pytorch/pytorch/issues/52020
Ensures that we can profile collectives in DDP by propagating the profiler threadLocalState appropriately. As described in the above issue, before this wouldn't work as the profiler would only be enabled on the main thread.
ghstack-source-id: 121818080

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D26356192

fbshipit-source-id: 0158b5833a3f857a0b4b2943ae3037e9d998dfd1
2021-02-17 12:21:37 -08:00
Rohan Varma
7b21c6be67 [Dist Profiling] Enable profiling for gloo send/recv (#52004)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52004

Enables profiling of p2p collectives for Gloo. Modified/added relevant unittests.
ghstack-source-id: 121507511

Test Plan: CI

Reviewed By: mrzzd

Differential Revision: D26347164

fbshipit-source-id: f4d1c474fccf40d5776fc13c4add7a053ea08960
2021-02-12 13:46:51 -08:00
Rohan Varma
4c93a79a04 [Dist Profiling] Support shape recording for profiling collectives (#51822)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51822

Adds support for shape recording for profiling distributed collectives, for nccl/gloo backends. Added
both cpp and python tests to ensure that shapes are recorded properly. Note that we don't add `ProcessGroupNCCLTest`s since they need to be modified to support single process per device and > 1 world size.
ghstack-source-id: 121507509

Test Plan: CI

Reviewed By: mrzzd

Differential Revision: D26291739

fbshipit-source-id: 5f7bd54d8c36d17a4a29e172b25266ca3dbd8fbd
2021-02-11 12:42:26 -08:00
Richard Barnes
fa325d7c9f Use sum_integers and multiply_integers (#51146)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51146

Test Plan: Sandcastle tests

Reviewed By: ngimel

Differential Revision: D25903430

fbshipit-source-id: 329c14018c9e5192864eed88a8ed0a5068ff1c69
2021-02-10 18:05:45 -08:00
Yanli Zhao
18e0a61388 add more logging fields that can be set in construction time (#51260)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51260

add more logging fields to DDPLoggingData, including param stats, bucket stats, environment variables, nccl version, data type
ghstack-source-id: 121260224

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D26118245

fbshipit-source-id: ba48b7a11340bda1f5f3b24c8603545d346361e9
2021-02-09 21:58:58 -08:00
Howard Huang
97e35858ec [Resubmit] Add compare_set operation and test to TCPStore (#51815)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51815

This is resubmission of #51593, already approved.

Test Plan: Imported from OSS

Reviewed By: izdeby

Differential Revision: D26316875

Pulled By: H-Huang

fbshipit-source-id: d81cb131ef6b9e2ebaee32bb505dfc11235bc29d
2021-02-08 13:44:31 -08:00
Yi Wang
5a962369e2 [Gradient Compression] Check if the backend is NCCL when a DDP communication hook is registered (#51759)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51759

Some unit tests actually register a comm hook on other backends like GLOO. Example: `test_ddp_comm_hook_future_passing_cpu`

Therefore, only do the check on `register_builtin_comm_hook`.

Currently DDP communication hook can only be supported on NCCL. Add a check in the registration methods.
ghstack-source-id: 121115814

Test Plan: unit tests.

Reviewed By: pritamdamania87

Differential Revision: D26268581

fbshipit-source-id: c739fa4dca6d320202dc6689d790c2761c834c30
2021-02-05 09:59:12 -08:00
Howard Huang
62aea33d7f Revert D26237328: Add compare_set operation and test to TCPStore
Test Plan: revert-hammer

Differential Revision:
D26237328 (7d00aec6bc)

Original commit changeset: c6837a4cc34f

fbshipit-source-id: 662f8067ead9bce0da13b35d393fb781635dd2b9
2021-02-04 13:43:05 -08:00
Howard Huang
7d00aec6bc Add compare_set operation and test to TCPStore (#51593)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51593

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D26237328

Pulled By: H-Huang

fbshipit-source-id: c6837a4cc34f8247df6e1c29c1f40fd9e7953313
2021-02-04 10:36:58 -08:00
Omkar Salpekar
3361d365bd [Gloo] Use TORCH_CHECK for ensuring tag is nonnegative (#51370)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51370

TORCH_CHECK should be used when confirming the correctness of function
arguments like the tag passed to Gloo functions.
ghstack-source-id: 120908449

Test Plan: Sandcastle/CI

Reviewed By: mingzhe09088

Differential Revision: D26152359

fbshipit-source-id: ddffaa6f11393aaedaf0870759dc526d8d4530ee
2021-02-03 11:48:20 -08:00
Yanli Zhao
e54cbb8250 Create PyTorch DDP logging APIs for applications to use (#50637)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50637

add APIs for logging pytorch ddp logging data in applications.

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D25933411

fbshipit-source-id: 57c248a2f002da06a386fc7406d3e5533ebb9124
2021-02-02 18:24:21 -08:00
Yanli Zhao
d5541c50a3 add a c++ interface in processGroup to get its backend name (#51066)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51066

backend name of a processgroup created using distributed_c10d python API is tracked, but there is no good way to track name of a processgroup created using processGroup c++ API. In some cases, knowing backend name of a processGroup is useful, e,g., log the backend name, or write some codes that have dependency on the known backend.
ghstack-source-id: 120628432

Test Plan: unit tests

Reviewed By: pritamdamania87

Differential Revision: D26059769

fbshipit-source-id: 6584c6695c5c3570137dc98c16e06cbe4b7f5503
2021-01-29 17:28:42 -08:00
Yanli Zhao
250c71121b Create a DDPLoggingData and expose it to python interface (#50622)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50622

1. Define a DDPLoggingData struct that is the placeholder for all the ddp related logging fields
2. Put the DDPLoggingData struct in the C10 directory so that it can be easily imported by c10 and torch files
3. Expose get_ddp_logging_data() method in python so that users can get the logging data and dump in their applications
4. Unit test tested the logging data can be set and got as expected
5. Follow up will add more logging fields such as perf stats, internal states, env variables and etc
ghstack-source-id: 120275870

Test Plan: unit tests

Reviewed By: SciPioneer

Differential Revision: D25930527

fbshipit-source-id: 290c200161019c58e28eed9a5a2a7a8153113f99
2021-01-25 15:23:07 -08:00
Xiang Gao
44922f26f5 Add support for NCCL alltoall (#44374)
Summary:
In https://github.com/pytorch/pytorch/issues/42514, NCCL `alltoall_single` is already added. This PR adds NCCL `alltoall`.

The difference between `alltoall_single` and `alltoall` is: `alltoall_single`  works on a single tensor and send/receive slices of that tensor, while `alltoall` works on a list of tensor, and send/receive tensors in that list.

cc: ptrblck ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44374

Reviewed By: zhangguanheng66, mrshenli

Differential Revision: D24455427

Pulled By: srinivas212

fbshipit-source-id: 42fdebdd14f8340098e2c34ef645bd40603552b1
2021-01-20 14:57:12 -08:00
Pritam Damania
4e248eb3f6 Change watchdog timeout logging from INFO to ERROR. (#50455)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50455

Certain systems only print logging messages for ERROR/WARN and the
error message that the watchdog is timing out a particular operation is pretty
important.

As a result, changing its level to ERROR instead of INFO.
ghstack-source-id: 119761029

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D25894795

fbshipit-source-id: 259b16c13f6cdf9cb1956602d15784b92aa53f17
2021-01-12 20:15:39 -08:00
Rohan Varma
78e71ce627 warn user once for possible unnecessary find_unused_params (#50133)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50133

`find_unused_parameters=True` is only needed when the model has unused parameters that are not known at model definition time or differ due to control flow.

Unfortunately, many DDP users pass this flag in as `True` even when they do not need it, sometimes as a precaution to mitigate possible errors that may be raised (such as the error we raise with not using all outputs).While this is a larger issue to be fixed in DDP, it would also be useful to warn once if we did not detect unused parameters.

The downside of this is that in the case of flow control models where the first iteration doesn't have unused params but the rest do, this would be a false warning. However, I think the warning's value exceeds this downside.
ghstack-source-id: 119707101

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D25411118

fbshipit-source-id: 9f4a18ad8f45e364eae79b575cb1a9eaea45a86c
2021-01-12 02:55:06 -08:00
Rohan Varma
294b7867eb Address clang-tidy warnings in ProcessGroupNCCL (#50131)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50131

Noticed that in the internal diff for
https://github.com/pytorch/pytorch/pull/49069 there was a clang-tidy warning to
use emplace instead of push_back. This can save us a copy as it eliminates the
unnecessary in-place construction
ghstack-source-id: 119560979

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D25800134

fbshipit-source-id: 243e57318f5d6e43de524d4e5409893febe6164c
2021-01-07 21:29:28 -08:00
Jagadish Krishnamoorthy
c115957df0 [distributed] Provide parameter to pass GPU ID in barrier function (#49069)
Summary:
For a multi GPU node, rank and corresponding GPU mapping can be different.
Provide optional parameter to specify the GPU device number for the
allreduce operation in barrier function.

Add test cases to validate barrier device_ids.

Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>

Fixes https://github.com/pytorch/pytorch/issues/48110

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49069

Reviewed By: mrshenli

Differential Revision: D25658528

Pulled By: rohan-varma

fbshipit-source-id: 418198b6224c8c1fd95993b80c072a8ff8f02eec
2021-01-05 11:27:54 -08:00
Omkar Salpekar
31fcbbdf35 [FileStore] Implemented numKeys and Added Tests (#49556)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49556

Implemented the missing Store functionality (specifically numKeys) in the FileStore.

Test Plan: Added both C++ and Python tests to verify functionality.

Reviewed By: jiayisuse

Differential Revision: D25619001

fbshipit-source-id: 9146d0da9e0903622be3035880f619bbb2cc3891
2020-12-17 14:54:24 -08:00
Luca Wehrstedt
9234f5026d Make WorkNCCL use CUDAEvent::query() rather than re-implement it (#49343)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49343

at::cuda::CUDAEvent is "lazy" and only creates an event when it's first recorded. Until then, at::cuda::CUDAEvent is empty. If we use at::cuda::CUDAEvent::query() this is taken into account (an empty event is always ready), but WorkNCCL extracts the raw cudaEvent_t value from at::cuda::CUDAEvent and calls cudaEventQuery manually and doesn't check this. This could cause a failure.

It's unclear if this is ever supposed to happen, but we're seeing that failure, and we want to sort it out in order to see if there's something "deeper" going on.
ghstack-source-id: 118532806

Test Plan: Unit tests

Reviewed By: SciPioneer

Differential Revision: D25537844

fbshipit-source-id: 506319f4742e1c0a02aa75ecc01112ea3be42d8f
2020-12-15 03:15:48 -08:00
Luca Wehrstedt
f204f77e6d Drop FutureNCCL in favor of vanilla CUDAFuture (#49014)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49014

We extracted a generic and reusable CUDAFuture class from FutureNCCL, but we had left FutureNCCL around, as a subclass of CUDAFuture, in order to deal with some peculiarity of ProcessGroupNCCL, namely that the future would be completed right away when constructed and that its CUDA events would be _shared_ with the ones of the WorkNCCL. This required some "hacks" in CUDAFuture itself (protected members, fields wrapped in shared_ptrs, ...).

My understanding is that creating CUDA events is a rather cheap operation. That would mean that we could afford to record _twice_ the events after each NCCL call, once for the WorkNCCL and once for the future. By doing so, we can use the CUDAFuture class directly and revert all its hacks.
ghstack-source-id: 118391217

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D25355272

fbshipit-source-id: 3a2a0891724928221ff0f08600675d2f5990e674
2020-12-11 09:25:05 -08:00
Luca Wehrstedt
5ab90b2fda Make CUDAFuture remember and restore current device in callback (#48789)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48789

CUDAFuture aims to "capture" the current state of CUDA-related stuff when the future is marked complete (e.g., by looking at current streams and recording events on them) and then "replicate" a similar state when users synchronize with the result of the future (by synchronizing the current streams with these events).

However, one "contextual" aspect of CUDA that we weren't capturing/replicating was the current device. This diff tries to fix that. I must mention that we can only do this for callbacks, while we cannot do it for the wait() method. I don't know if such a discrepancy between the two actually makes the overall behavior _worse_. I'd love to hear people's opinions on this.
ghstack-source-id: 118081338

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D25210335

fbshipit-source-id: 1d1a3f80b1cc42e5114bc88554ed50617f1aaa90
2020-12-11 03:35:53 -08:00
Rohan Varma
696e30af6e Fix ProcessGroupNCCL profiling when profiler is not run with use_cuda (#48946)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48946

Move recordFunctionEndCallback to after the blocking portion of launching the NCCL kernel, and remove addCallback since it runs the lambda inline anyways, and triggers unnecessary CUDA stream logic. If we want CUDA operations such as NCCL kernels accurately profiled, we should use the profiler with use_cuda=True. However, we are currently debugging a deadlock for the use_cuda=True case, fix is being tracked in #48987.

To ensure that the tests are no longer flaky, submitted this PR to ci-all: #48947 and ran the test a bunch of times ssh'd into the CI machine.

ghstack-source-id: 118330130

Test Plan: Ci

Reviewed By: mrzzd

Differential Revision: D25368322

fbshipit-source-id: 7d17036248a3dcd855e58addc383bba64d6bc391
2020-12-10 21:09:41 -08:00
Yixin Bao
840e71f4e6 Check CUDA kernel launches (/fbcode/caffe2/) (#49145)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49145

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49105

(1) Add a safety check `C10_CUDA_KERNEL_LAUNCH_CHECK()` after each kernel launch. This diff only changes the files inside the directory /fbsource/fbcode/caffe2/modules/, /fbsource/fbcode/caffe2/fb/, /fbsource/fbcode/caffe2/test/.

(2) Get rid of old check `AT_CUDA_CHECK(cudaGetLastError())` when necessary.

Test Plan:
Test build:
```
buck build mode/dev-nosan //caffe2/modules/detectron:
buck test mode/dev-nosan //caffe2/modules/detectron:
buck build mode/dev-nosan //caffe2/torch/fb/:
buck test mode/dev-nosan //caffe2/torch/fb/:
```

To check for launches without checks:
```
python3 caffe2/torch/testing/check_kernel_launches.py
```
Make sure none of the updated files are in the returned list.

Reviewed By: r-barnes

Differential Revision: D25452852

fbshipit-source-id: d6657edab612c9e0fa99b29c68460be8b1a20064
2020-12-10 10:43:03 -08:00
Luca Wehrstedt
b5a7e25059 Cache the DataPtrs in CUDAFuture (#48788)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48788

CUDAFuture needs to inspect the value it contains in order to first determine what devices its tensors reside on (so that it can record events on those devices), and then to record these tensors with the caching allocator when they are used in other streams. Extracting data ptrs can become somewhat expensive (especially if we resort to using the pickler to do that), hence it's probably a good idea to cache the result the first time we compute it.
ghstack-source-id: 118180023

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D25303486

fbshipit-source-id: 5c541640f6d19249dfb5489ba5e8fad2502836fb
2020-12-10 03:54:29 -08:00
Luca Wehrstedt
030fa6cfba Split out reusable CUDAFuture from FutureNCCL (#48506)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48506

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

 ---

FutureNCCL is now a general-purpose type-agnostic multi-device class, so in this commit I extract it from ProcessGroupNCCL to make it available for wider use (notably by the RPC module). We'll call this new class CUDAFuture. We'll keep FutureNCCL as a subclass of CUDAFuture to deal with some NCCL peculiarity, namely the fact that the future becomes complete immediately upon creation. We can clean this up for good once we're done merging Future and Work.

I'm not exactly sure of where to put CUDAFuture. It needs to be available to both c10d and RPC (which lives under torch/csrc). If I figured CMake out correctly (and that's a big if) I think c10d can only depend on ATen (I'll maybe add a comment with how I tracked that down). Hence we cannot put CUDAFuture in torch/csrc. On the other hand, RPC currently depends on c10d, because RPC agents use ProcessGroups internally, so it would be "ok" to put CUDAFuture in c10d. However, we want to get rid of ProcessGroups in RPC, and at that point RPC should in principle not depend on c10d. In that case, the only shared dep between the two that I see is ATen itself.

While I'm a bit wary of putting it right in ATen, I think it might actually make sense. CUDAFuture is intended to be a general-purpose component that can be reused in all settings and is not particularly tied to c10d or RPC. Moreover, ATen already contains ivalue::Future, and it contains a lot of CUDA helpers, so CUDAFuture definitely belongs to the "closure" of what's already there.
ghstack-source-id: 118180030

Test Plan: Unit tests?

Reviewed By: wanchaol

Differential Revision: D25180532

fbshipit-source-id: 697f655240dbdd3be22a568d5102ab27691f86d4
2020-12-10 03:54:26 -08:00
Luca Wehrstedt
4c425e8da0 Merge common parts of FutureNCCL into at::ivalue::Future (#48505)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48505

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

 ---

FutureNCCL isn't just adding CUDA support to ivalue::Future, it's also reimplementing a lot of the latter's logic (by overriding plenty of its methods). That's brittle, as whenever a new method is added to ivalue::Future there's a risk of forgetting to add it to FutureNCCL, and in such a case calling this method on FutureNCCL would defer to the base class and give inconsistent results (e.g., future not being completed when it actually is). This _is already happening_, for example with the waitAndThrow or hasError, which are not implemented by FutureNCCL. In addition, this creates duplication between the two classes, which could lead to inconsistencies of behavior, bugs, missing features, ...

The best solution would be to keep the core future logic in ivalue::Future, and have _only_ the CUDA additions in FutureNCCL. That's what we're going to do, in two steps. In the previous commit, I split the CUDA features into separate hooks, which are called by FutureNCCL's other methods. In this commit, I'm removing these latter methods, and invoke the hooks directly from ivalue::Future.
ghstack-source-id: 118180032

Test Plan: Unit tests

Reviewed By: wanchaol

Differential Revision: D25180535

fbshipit-source-id: 19181fe133152044eb677062a9e31e5e4ad3c03c
2020-12-10 03:54:22 -08:00
Luca Wehrstedt
9078088edb Split FutureNCCL's CUDA-specific parts from generic future logic (#48504)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48504

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

 ---

FutureNCCL isn't just adding CUDA support to ivalue::Future, it's also reimplementing a lot of the latter's logic (by overriding plenty of its methods). That's brittle, as whenever a new method is added to ivalue::Future there's a risk of forgetting to add it to FutureNCCL, and in such a case calling this method on FutureNCCL would defer to the base class and give inconsistent results (e.g., future not being completed when it actually is). This _is already happening_, for example with the waitAndThrow or hasError, which are not implemented by FutureNCCL. In addition, this creates duplication between the two classes, which could lead to inconsistencies of behavior, bugs, missing features, ...

The best solution would be to keep the core future logic in ivalue::Future, and have _only_ the CUDA additions in FutureNCCL. That's what we're going to do, in two steps. In this commit, I'll split the CUDA features into separate hooks, which are called by FutureNCCL's other methods. In the next commit, I'll remove these latter methods, and invoke the hooks directly from ivalue::Future.
ghstack-source-id: 118180025

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D25180534

fbshipit-source-id: 7b3cd374aee78f6c07104daec793c4d248404c61
2020-12-10 03:54:19 -08:00
Luca Wehrstedt
a6778989d1 Support wider range of types in FutureNCCL (#48502)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48502

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

 ---

FutureNCCL restricted the values to be tensors, or (singleton) lists of tensors, or Python object that could be converted to either of those types. We need a CUDA future that can handle more generic types though.

The main challenge is extracting all DataPtrs from an arbitrary object. I think I found some ways of doing so, but I'd like some JIT experts to look into this and tell me if there are better ways. I'll add inline comments for where their input would be appreciated.
ghstack-source-id: 118180026

Test Plan: Unit tests (I should probably add new ones)

Reviewed By: wanchaol

Differential Revision: D25177562

fbshipit-source-id: 1ef18e67bf44543c70abb4ca152f1610dea4e533
2020-12-10 03:54:15 -08:00
Luca Wehrstedt
9fe3ac3650 Don't store device indices separately on FutureNCCL (#48501)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48501

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

 ---

FutureNCCL stores a set of devices (on which the tensors in the data reside) and a CUDA event for each of those devices. In fact, each event instance also already contains the device it belongs to, which means we can avoid storing that information separately (with the risk that it'll be mismatched and/or inaccurate).
ghstack-source-id: 118180024

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D25177554

fbshipit-source-id: 64667c176efc2a7dafe99457a1fbba5d142cb06c
2020-12-10 03:54:12 -08:00