Commit Graph

259 Commits

Author SHA1 Message Date
mrshenli
2ed3ad2891
fix autodoc for torch.distributed.launch (#40963) (#41089)
Summary:
The doc for `torch.distributed.launch` is missing since v1.2.0 (see issue https://github.com/pytorch/pytorch/issues/36386) because PR https://github.com/pytorch/pytorch/issues/22501 added some imports at the first line.
542ac74987/torch/distributed/launch.py (L1-L5)
I move it below the docstring to make the autodoc in Sphinx work normally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40963

Differential Revision: D22380816

Pulled By: mrshenli

fbshipit-source-id: ee8406785b9a198bbf3fc65e589854379179496f

Co-authored-by: Xin Yao <yaox12@outlook.com>
2020-07-07 14:23:31 -07:00
mrshenli
4316199832
Add examples and tests for combining static/class method with async execution (#40619) (#40688)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40619

Test Plan: Imported from OSS

Differential Revision: D22258407

Pulled By: mrshenli

fbshipit-source-id: 036d85a2affc4505efd2df197fc513dba010e359
2020-06-29 19:34:23 -07:00
mrshenli
0dc93ac119
[v1.6.0 patch] Install method docstrings from PyRRef to RRef (#40620)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40461

It turned out `:inheried-members:` (see [doc](https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html#directive-autoclass)) is not really usable.

Because pybind11 generates a docstring that writes `self` as parent class, `rpc.PyRRef`, type.

As a workaround, I am pulling docstrings on parent-class, `PyRRef` class, into subclass, `RRef`. And do surgery on the docstring generated by pybind11.

{F241283111}

ghstack-source-id: 106472496

P134031188

Differential Revision: D7933834

fbshipit-source-id: c03a8a4c9d98888b64492a8caba1591595bfe247

Co-authored-by: Shihao Xu <shihaoxu@fb.com>
2020-06-26 12:15:28 -07:00
Rohan Varma
14f7e95c1a Add prefix of remote events for RPC profiling (#40066)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40066

Builds on top of the previous PR to ensure that all remotely profiled events are prefixed with the key for the RPC that generated them.

The key is generated by the result of `_build_rpc_profiling_key` in `rpc/internal.py` and prefixed onto the event name. In order to do this, we set the current-key when creating the RPC in Python, retrieve the currently-set key in C++ and save a GloballyUniqueId -> key mapping to an in-memory map. When we receive an RPC with profiling information, we expect to receive this ID back, and look up the corresponding profiling key in the map.

The key is then added to all the remote events.

Tested by adding tests to ensure the key is added to all the remote events. Also added a UT which tests in under the multi-threading scenario, to ensure that the mapping's correctness is maintained when several RPCs are in the process of being created at once.
ghstack-source-id: 106316106

Test Plan: Unit test

Differential Revision: D22040035

fbshipit-source-id: 9215feb06084b294edbfa6e03385e13c1d730c43
2020-06-22 11:01:07 -07:00
Shen Li
314d645e05 Add a warning to mention that async_execution does not work with autograd profiler (#40309)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40309

Test Plan: Imported from OSS

Differential Revision: D22145130

Pulled By: mrshenli

fbshipit-source-id: d6f7250e53648d6939367f1ad4c9b898be00afed
2020-06-19 15:35:00 -07:00
Shen Li
5d0044389a Minor RPC doc improvements (#40305)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40305

Test Plan: Imported from OSS

Differential Revision: D22144304

Pulled By: mrshenli

fbshipit-source-id: 1c8a9648043eabaf909c6e4ae116672396a9f0f5
2020-06-19 15:34:58 -07:00
Shen Li
caf0c286b8 Fix RPC API doc links (#40299)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40299

Test Plan: Imported from OSS

Differential Revision: D22143156

Pulled By: mrshenli

fbshipit-source-id: c11848ebfe8863d59509a0fbc042eed71a58e514
2020-06-19 15:34:53 -07:00
Shen Li
d6d579397d Improve docs for init_rpc (#40298)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40298

Test Plan: Imported from OSS

Differential Revision: D22143155

Pulled By: mrshenli

fbshipit-source-id: deadcc29eda157b401ca6a091c3ba17455acb6b5
2020-06-19 15:34:51 -07:00
Alexander Mols
b7bfdcbe3e [caffe2/torch] Use logger in jit instantiator
Summary:
Previously the module would log some data using `print()`. This can be
a problem when used in contexts where the process expects to write data to
stdout itself. This diff changes the log statements to use `logger` instead.
This makes it similar to other log statements in the same module.

Test Plan:
Confirmed no weird test showed up when running:

buck test caffe2/test/distributed/nn/api:remote_module_fork

Differential Revision: D22136172

fbshipit-source-id: a3d144eba6c75925ed684981793c84b36eb45a5d
2020-06-19 07:49:15 -07:00
Luca Wehrstedt
2393bab036 [TensorPipe] Update documentation (#40222)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40222

Mention the TensorPipe agent in the RPC docs and give users the information they need to choose which agent to use.
ghstack-source-id: 106225711

Test Plan: Export to GitHub, build locally and try out the docs.

Differential Revision: D22116494

fbshipit-source-id: 30703ba8410c40f64e785f60d71dfd9faa8de4a1
2020-06-19 04:26:49 -07:00
Rohan Varma
7e82382ad5 Allow profiler to be enabled remotely with RPC (#38748)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38748

This diff contains the message scaffolding and profiler changes in order to be able to remotely run the profiler across different nodes and aggregate the results on a single node.

As discussed, we have implemented this by creating new message types, that similar to autograd messages, wrap the profiling information with the original message, and send this new message over the wire. On the receiving end, this wrapped message is detected, we fetch the original message from it, and process the original message with the profiler enabled. When sending a response with profiling information, we serialize the profiled `Events` and send them back over RPC. When such a message is received, the events profiled on the remote node are stored (added back to the local profiler).

Changes in this PR:
- New message types (run_with_profiling_req, run_with_profiling_resp) to send profiling info over the wire. Message parsing logic is added to handle these wrapped types.
- Handling of sending profiler data over the wire, in particular, the attributes of the `ProfilerConfig` and the serialized profiled `Event`s
- The logic for wrapping RPC messages is deduped with that in `rpc_with_autograd`, and the common payload wrapping/unwrapping logic is moved to helper functions in `rpc/utils.cpp`
- Changes in `autograd/utils.cpp` to detect if we have enabled the profiler and are sending an RPC, if so, uses the above new message types
- Changes in request_callback to parse and turn on the profiler in a thread-local fashion
- Serialization and deserialization of profiling `Events`, and support to add the remote events to the thread-local profiler
- Introduction of the concept of `node_id`, which as discussed with ilia-cher , will be used along with the `Event`s handle attribute to distinguish between events. When there are events from different nodes, this node information is rendered in the profile output (e.g. when printing tables), otherwise, it is not, since it is irrelevant.
- Some changes to profiler.cpp to add useful helper methods/guards
- toHere() is now profiled for RRefs
- Unittests
ghstack-source-id: 106134626

Test Plan: Added unittests, existing profiler unittests.

Differential Revision: D19510010

fbshipit-source-id: 044347af992f19a9e3b357c9567f6fc73e988157
2020-06-18 17:01:57 -07:00
Gemfield
034eddca01 Fix typos in RPC Docs (#40219)
Summary:
Environment variable MASTER_ADDRESS and MASTER_port should be MASTER_ADDR and MASTER_PORT respectively.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40219

Differential Revision: D22116585

Pulled By: mrshenli

fbshipit-source-id: d312ae66210b0a16ec3ab1f468b1654bb0a75a0f
2020-06-18 11:40:11 -07:00
Shihao Xu
f3f30d4354 [JIT x RPC] Consolidate RRef type class and RRef impl class (#35694)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35694

close https://github.com/pytorch/pytorch/issues/35110

Differential Revision: D7881729

fbshipit-source-id: eedda8f1b7510491886d469efeed4e002bb8b991
2020-06-18 07:46:38 -07:00
Luca Wehrstedt
7c9e78fdf5 [TensorPipe] Add options for agent, including backend killswitches (#40162)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40162

The only public option is `num_worker_threads`. The other ones are private (as indicated by the leading underscore, is that enough?) and allow to specify a different set and order of transports/channels. These can thus be used to disable a backend (by not specifying it) or by forcing one (by raising its priority). They can therefore be used to work around defective backends, in case we'll find any post-release.
ghstack-source-id: 106103238

Test Plan: Built //caffe2:ifbpy and, using TensorPipe's verbose logging, verified that the transports/channels I specified were indeed the ones that were being registered.

Differential Revision: D22090661

fbshipit-source-id: 789bbe3bde4444cfa20c40276246e4ab67c50cd0
2020-06-18 02:54:17 -07:00
Shihao Xu
bc9e8af218 [distributed.nn] Change remote module template instantiator to write to tmp folder (#40173)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40173

- Avoid path sharing across runs and workers, so even the test methods/workers run in parallel on the same host, they don't interfere with each other.
- On some environment (e.g. fb internal CI platform), the torch package file tree is not writable. But the temporary folder chosen by Python `tempfile` module is always writable, on linux it's "/tmp".

close https://github.com/pytorch/pytorch/issues/40120

ghstack-source-id: 106086340

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/nn/jit:test_instantiator

buck build mode/dev-nosan //caffe2/test/distributed/nn/jit:test_instantiator && \
buck-out/gen/caffe2/test/distributed/nn/jit/test_instantiator\#binary.par -r test_instantiate_scripted_remote_module_template

buck build mode/dev-nosan //caffe2/test/distributed/nn/jit:test_instantiator && \
buck-out/gen/caffe2/test/distributed/nn/jit/test_instantiator\#binary.par -r test_instantiate_non_scripted_remote_module_template
```

```
buck test mode/dev-nosan //caffe2/test/distributed/nn/api:remote_module_fork
```

Differential Revision: D5708493

fbshipit-source-id: dd92695682433aaf79d1912c7956cef40a450eaf
2020-06-17 15:01:30 -07:00
Pritam Damania
145df306ae Avoid using default process group in ProcessGroupAgent. (#39909)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39909

As described in https://github.com/pytorch/pytorch/issues/33583,
ProcessGroupAgent initializes the default process group and this causes issues
if the user initializes the default process group themsleves. Either the RPC
initialization would fail or the user's process group initialization would
fail.

To avoid this, I've changed ProcessGroupAgent init to create its own
ProcessGroupGloo and not use the default one at all.

Closes: https://github.com/pytorch/pytorch/issues/33583
ghstack-source-id: 105953303

Test Plan: waitforbuildbot

Differential Revision: D22011868

fbshipit-source-id: 7346a3fcb2821a0bc08e0bdc0625947abb5ae16f
2020-06-16 12:00:29 -07:00
Shihao Xu
00651b8c93 [distribtued.nn] Implement TorchScript-compatible RemoteModule API (#37139)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37139

See design doc in https://github.com/pytorch/pytorch/issues/37136

ghstack-source-id: 105926270

Test Plan:
TODO:

- Make the generated Interface usable. https://github.com/pytorch/pytorch/pull/37139#discussion_r434190978
-
- Avoid generating the same template instances for Module that is not scriptable.
- Remove "infer_module_interface_cls".
- Use Python format instead of a CodeTemplate
- Use Python tempfile to track and delete file. Does it work if there is crash.

```
buck test mode/dev-nosan //caffe2/test/distributed/nn/jit:test_instantiator

buck build mode/dev-nosan //caffe2/test/distributed/nn/jit:test_instantiator && \
buck-out/gen/caffe2/test/distributed/nn/jit/test_instantiator\#binary.par -r test_instantiate_scripted_remote_module_template

buck build mode/dev-nosan //caffe2/test/distributed/nn/jit:test_instantiator && \
buck-out/gen/caffe2/test/distributed/nn/jit/test_instantiator\#binary.par -r test_instantiate_non_scripted_remote_module_template
```

```
buck test mode/dev-nosan //caffe2/test/distributed/nn/api:remote_module_spawn
```

```
buck test mode/dev-nosan //caffe2/test/distributed/nn/api:remote_module_fork

buck build mode/dev-nosan //caffe2/test/distributed/nn/api:remote_module_fork && \
buck-out/gen/caffe2/test/distributed/nn/api/remote_module_fork\#binary.par -r test_user_provided_global_unique_name

buck build mode/dev-nosan //caffe2/test/distributed/nn/api:remote_module_fork && \
buck-out/gen/caffe2/test/distributed/nn/api/remote_module_fork\#binary.par -r test_forward_async_script

buck build mode/dev-nosan //caffe2/test/distributed/nn/api:remote_module_fork && \
buck-out/gen/caffe2/test/distributed/nn/api/remote_module_fork\#binary.par -r test_forward_sync_script

buck build mode/dev-nosan //caffe2/test/distributed/nn/api:remote_module_fork && \
buck-out/gen/caffe2/test/distributed/nn/api/remote_module_fork\#binary.par -r test_forward_with_kwargs

buck build mode/dev-nosan //caffe2/test/distributed/nn/api:remote_module_fork && \
buck-out/gen/caffe2/test/distributed/nn/api/remote_module_fork\#binary.par -r test_user_provided_global_unique_name
```

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork
```

buck test mode/opt-asan //caffe2/test:jit -- 'test_script_forward_method_replacement

buck build mode/dev-nosan //caffe2/test:jit && \
buck-out/gen/caffe2/test/jit\#binary.par -r 'test_script_forward_method_replacement'

buck build mode/dev-nosan //caffe2/test:jit && \
buck-out/gen/caffe2/test/jit\#binary.par -r 'test_imported_classes'

Differential Revision: D20499658

fbshipit-source-id: dd9383ae4eb2343366c11127664f845b91ca3b0a
2020-06-15 19:07:35 -07:00
Shihao Xu
d602950cb4 [torch.distributed.rpc] Add WorkerInfo python repr magic method (#40004)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40004

close https://github.com/pytorch/pytorch/issues/39965
ghstack-source-id: 105891281

Test Plan:
buck test mode/opt-asan //caffe2/test:jit -- 'test_vae_quantized \(jit\.test_models\.TestModels\)'

buck test mode/dev-nosan //caffe2/test/distributed/rpc/:rpc_fork

Differential Revision: D5696583

fbshipit-source-id: 19570414dc833c38fcd1ad38d2f0a816dbf51743
2020-06-15 15:08:29 -07:00
Shen Li
3fb1e73a4e Add rpc.async_execution support for rpc.remote on script functions (#39758)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39758

Test Plan: Imported from OSS

Differential Revision: D21963789

Pulled By: mrshenli

fbshipit-source-id: f16f464ba01401b160cc4d3daf036e4bc806d7ea
2020-06-10 13:17:07 -07:00
Luca Wehrstedt
9bfb91b50b Fix possible deadlock in _wait_all_workers (#39535)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39535

This is my understanding of what could happen: on workerN (N != 0), `_wait_all_workers_sequence_id_to_states`, which is a `defaultdict`, is accessed twice: once in the body of `_wait_all_workers` (by the "main thread" of workerN) and once in `_set_proceed_shutdown_signal`, called by worker0 through a RPC call. I think the two could race and access the `_wait_all_workers_sequence_id_to_states` at the same time, and thus create two separate copies of `WaitAllWorkersStates`. One of those threads would wait  on the event of one copy, but the other thread would set the event of the other copy. This lead to a deadlock, as the main thread would end up waiting forever.
ghstack-source-id: 105283327

Test Plan: I added additional logging in those functions, ran a stress test of the RPC test suite, based on the logs I suspected that this could be the issue, fixed it and re-run the stress test and didn't see the bug anymore. This is admittedly not very convincing evidence, as I may just have been lucky that second time...

Differential Revision: D21889752

fbshipit-source-id: 05ec710bd2930313e1480ae896b4b2f5f503aa17
2020-06-05 02:42:32 -07:00
Shen Li
8a6914ddb2 Add @rpc.functions.async_execution for rpc.remote (#39486)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39486

Test Plan: Imported from OSS

Differential Revision: D21871422

Pulled By: mrshenli

fbshipit-source-id: 3c432b7718a47732b2aee064c554f6bdcc5c95c1
2020-06-04 22:38:35 -07:00
Rohan Varma
8b2bb02e09 Implement timeout support for RRefs (#38590)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38590

This PR implements timeout semantics for RRef for parity with rpc_sync and rpc_async. How it works:

- Timeout parameter is added to rpc.remote. If the rpc.remote call times out, note that the error won't be raised to the user in that call, as it is not blocking (similar to rpc_async). Instead, the timeout error will be raised the next time the RRef is used (either by pickling or to_here call).
- Error handling semantics are added to RRef to deal with the timeout errors. Previously, if there was an error creating the OwnerRRef, the callback on the local user would throw an error in a callback, resulting in an `std::terminate`. Instead of this, the error is now caught and surfaced to the user the next time the RRef is used. As part of this, we have added an `RPCErrorType` enum and defined RRef error handlers to handle the `RPCErrorrTypes` (currently just timeout and unknown)
- A timeout parameter is added to `to_here()` which gives the user control over the max amount of time it can block for.
- `ctx.prepareChildForFork()` which is called when the RRef is pickled (i.e. used as an arg over RPC) checks if the `rpc.remote()` call had timed out, and if so, raises that error to the user.
- Tests are added, primarily via delay injection.
ghstack-source-id: 105232837

Test Plan: CI

Differential Revision: D21588165

fbshipit-source-id: c9f9e8aa3521012ea1de3e0f152a41afdf8b23f3
2020-06-04 02:14:42 -07:00
Shen Li
67cea74dd3 Add rpc.async_function decorator for TorchScript functions (#39267)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39267

When combined with `torch.jit.script`, the order of decorators matter.
`rpc.functions.async_execution` must be the outmost one. The
`async_execution` decorator will store the TorchScript function in
attribute `_wrapped_async_rpc_function` on the wrapper function, and
pass this wrapped TorchScript function (i.e., an instance of
`torch.jit.ScriptFunction`) to RPC. The caller will mark the ScriptCall
with `isAsyncExecution=true`, and the callee will extract the returned
`Future` in C++ and install subsequent processing as a callback to
that `Future`.

Test Plan: Imported from OSS

Differential Revision: D21792688

fbshipit-source-id: de095eb148d21e9114a478e9e6047c707d34fd07
2020-06-03 22:27:15 -07:00
Shen Li
a05ef17e46 Add rpc.functions.async_execution decorator for rpc_sync/rpc_async (#39216)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39216

The `rpc.functions.async_execution` decorator specifies that the
wrapped function is guaranteed to return a `torch.futures.Future`.
The decorator adds a `_wrapped_async_rpc_function` attribute to
the wrapper function. The caller retrieves this information and
then sets `isAsyncFunction` argument accordingly which is later
added to PythonCall RPC message as a field. On the callee side,
if the PythonCall carries an asynchronous function, it will cast
the function's return value to a jit::PythonFutureWrapper object,
and then install response creation and communication as a callback
on the that jit::PythonFutureWrapper.

For applications, this feature is useful when a function needs to
wait for IO or additional singaling. In those cases, marking the
user function as `rpc.functions.async_execution` will prevent it
from blocking one thread on callee for too long.

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D21779962

fbshipit-source-id: 6b6aa698bf6f91dad6ed2a7ee433df429b59e941
2020-06-02 23:21:25 -07:00
Omkar Salpekar
a6f0051db2 Fix test_get_and_set_timeout for TensorPipe Agent (#39353)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39353

This test failed with TSAN since the shortened timeout prevented all
messages from being processed within the timeout during Phase 1 of
wait_all_workers during RPC shutdown. Phase 2 already had a longer timeout, so
we extend this to Phase 1 as well.
ghstack-source-id: 105045926

Test Plan: Ran the test_get_and_set_timeout with TSAN

Differential Revision: D21826783

fbshipit-source-id: 7edfdeb50169b31e997dd36a3fd8eea0e9ae7189
2020-06-02 12:01:11 -07:00
Tongzhou Wang
3001facd7a [doc] [distributed] fix typo (#39264)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39264

Differential Revision: D21791426

Pulled By: mrshenli

fbshipit-source-id: c3aa8fda1893aa3c0f9ad3db7da25f1ee80303e8
2020-06-01 19:19:46 -07:00
Shihao Xu
45baf0e1a0 [Profiler x RPC] Enable RPC Server Global Profiler (#38847)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38847

See motivation and design in https://github.com/pytorch/pytorch/issues/38845.

Close https://github.com/pytorch/pytorch/issues/38845.

Changes,

- Add pre-request and post-response hooks to RPC "request_callback_impl.cpp". For a thread that executes RPC handler, check if the server-side global profiling is on. If it's on, enable profiling on this thread and after response, merge the thread-local profiling result into the global profiling state.
- Add context-style Python API to parse the profiling Events into ranges represented by FunctionEvent.
- Add data-structures to work as global profiling state that support nesting and container for consolidating results from multiple threads.

Test,

- Add a test that uses nested profiling range and inspect the profiling events.

ghstack-source-id: 104991517

Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed/rpc/:rpc_fork

buck build mode/dev-nosan //caffe2/test/distributed/rpc/:rpc_fork && \
buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par -r test_server_process_global_profiler

Differential Revision: D5665992

fbshipit-source-id: 07f3bef5efd33d1214ef3404284c3803f5deca26
2020-06-01 12:35:52 -07:00
Luca Wehrstedt
54046c1024 [TensorPipe] Implement join correctly (#38933)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38933

Based on what I could understand from how the RPC shutdown operates and from what the ProcessGroup agent does, the join method is supposed to act as a barrier among all workers that waits until they all have finished all their pending work, including work that may be triggered by nested calls or by callbacks.

ghstack-source-id: 104760684

Test Plan: Before this diff, the `test_user_rrefs_confirmed` test of the RPC suite was flakily deadlocking. After this, I haven't been able to repro that.

Differential Revision: D21703020

fbshipit-source-id: 3d36c6544f1ba8e17ce27ef520ecfd30552045dd
2020-05-28 10:48:13 -07:00
Rohan Varma
01815be1e4 Infinite timeout for operations against ProcessGroup for RPC (#38577)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38577

We don't want to limit a timeout to 30 min since there could be no
operations within that time frame. Bump to 2^31 - 1 (int32 max)
ghstack-source-id: 104743727

Test Plan: CI

Differential Revision: D21602425

fbshipit-source-id: ab002262f01664b538761202b3bd7584fcee3c6b
2020-05-27 22:35:13 -07:00
Mingzhe Li
6736a76cec Back out "[RPC] [Minor] RPC entry point cleanup"
Summary:
Original commit changeset: b509c47fb612

(Note: this ignores all push blocking failures!)

Reviewed By: xush6528

Differential Revision: D21669711

fbshipit-source-id: e452a513a2d22eaa3bffa333fdb3277fabc24b41
2020-05-20 15:35:24 -07:00
Shihao Xu
befc76bb65 [RPC] [Minor] RPC entry point cleanup (#34292)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34292

This is to finish a cleanup request from https://github.com/pytorch/pytorch/pull/34733#discussion_r392479110.

ghstack-source-id: 104361618

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork && \
buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par \
-r test_return_local_script_class_rref_in_py_and_use_in_script

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork && \
buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par \
-r test_return_local_script_module_rref_in_py_and_use_in_script
```

Differential Revision: D7436759

fbshipit-source-id: b509c47fb612ec3486ff1199c005eba69480ee05
2020-05-19 14:23:11 -07:00
Rohan Varma
4d4895a62a Use Future's then() API to fix RPC profiling (#38352)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38352

Fixes the RPC profiling by using the `then()` API added in https://github.com/pytorch/pytorch/pull/37311. Instead of adding a regular callback, we return a new future that completes when the profiling callback is finished. This is transparent to the user as the future still completes with the value of the original future (i.e. the RPC's return value)

To make this work for RRef, we add a `_set_profiling_future` to set the profiling future, and `_get_profiling_future` to retrieve this future and wait on it in the tests.

Re-enabled profiling tests and stress tested them 1000 times to verify the fix
ghstack-source-id: 104086114

Test Plan: Re-enabled profiling tests

Differential Revision: D21506940

fbshipit-source-id: 35cde22f0551c825c9bc98ddc24cca412878a63a
2020-05-14 12:52:45 -07:00
Shihao Xu
3d0279862d Consolidate builtin/python_udf RPC to return ivalue::Future like torchscript RPC does (#35154)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35154

This is for issue https://github.com/pytorch/pytorch/issues/34999.

close https://github.com/pytorch/pytorch/issues/34999.

https://github.com/pytorch/pytorch/issues/34997 need more work.

This will make a few work items easier, like 1) Dist autograd profiler, 2) JIT annotation for Future.

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork

buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork -- test_rref_forward_chain --stress-runs 100

buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork && \
buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par \
-r test_call_method_on_rref
```

buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork -- 'test_rref_proxy_class \(fb\.test_rpc_fork\.RpcTestWithFork\)' --stress-runs 100

test_rref_proxy_reuse
test_handle_send_exceptions

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork && \
buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par \
-r test_script_call_python_return_future
```

Differential Revision: D7722184

fbshipit-source-id: bd92b855bfea4913d6672700590c57622fa86e0e
2020-05-08 21:28:56 -07:00
Luca Wehrstedt
91f451a5e6 [TensorPipe] Do not require user to provide worker name-to-rank map (#38052)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38052

The initial version of the TensorPipe agent required the user to specify the full map between workers' names and their ids, on each worker. However it's enough for each worker to just specify their name and id, as these can then be exchanged using the store.

Addresses #37784, although I think we can go further and use the store to also automatically assign ranks to workers, so that the user only needs to specify a name.
ghstack-source-id: 103741595

(Note: this ignores all push blocking failures!)

Test Plan:
On worker 0:
```
In [1]: import os
   ...: import torch
   ...: import torch.distributed.rpc as rpc
   ...: os.environ["MASTER_ADDR"] = "127.0.0.1"
   ...: os.environ["MASTER_PORT"] = "8765"

In [2]: rpc.init_rpc(name="foo", rank=0, backend=rpc.backend_registry.BackendType.TENSORPIPE, world_size=2)

In [3]: rpc.rpc_sync("bar", torch.add, args=(torch.full((2,2), 1), torch.full((2,2), 2)))
Out[3]:
tensor([[3., 3.],
        [3., 3.]])

In [4]: rpc.rpc_sync("bar", torch.add, args=(1, 2))
Out[4]: 3
```
On worker 1:
```
In [1]: import os
   ...: import torch
   ...: import torch.distributed.rpc as rpc
   ...: os.environ["MASTER_ADDR"] = "127.0.0.1"
   ...: os.environ["MASTER_PORT"] = "8765"

In [2]: rpc.init_rpc(name="bar", rank=1, backend=rpc.backend_registry.BackendType.TENSORPIPE, world_size=2)
```

Then also tested by adding `rpc_backend_options=rpc.TensorPipeRpcBackendOptions(init_method="file:///tmp/init/foo")` to `rpc_init`.

Differential Revision: D21463833

fbshipit-source-id: b53d7af6fc060789358ac845aa1898ddea6e8f31
2020-05-08 10:48:48 -07:00
Quang Luong
9d7a79ac27 [Caffe2] raise exceptions instead of str (#37744)
Summary:
Some exceptions are not correctly wrapped inside a class.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37744

Differential Revision: D21388197

Pulled By: mrshenli

fbshipit-source-id: 2d69e2543c2e05116c367d137968b982c254d2dc
2020-05-05 13:34:33 -07:00
Hongyi Jia
0549e1f384 [Tensorpipe/RPC] tensorpipe RPC agent (#35483)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35483

Implement the initial version of TensorPipe RPC agent, and register to RPC registry to expose to Python interface. As a starter, it utilizes all available TensorPipe transports (shm, uv) and channels (basic, cma).

Test Plan:
https://our.intern.facebook.com/intern/diffusion/FBS/browse/master/fbcode/experimental/jiayisuse/tensorpipe_rpc
  export MASTER_ADDR=127.0.0.1
  export MASTER_PORT=28500
  buck build mode/dev-nosan mode/no-gpu //experimental/jiayisuse/tensorpipe_rpc:main
  ./buck-out/gen/experimental/jiayisuse/tensorpipe_rpc/main.par
  buck build mode/dev-nosan mode/no-gpu //experimental/jiayisuse/tensorpipe_rpc:benchmark
  ./buck-out/gen/experimental/jiayisuse/tensorpipe_rpc/benchmark.par

Multiple connections with async echo
  ./buck-out/gen/experimental/jiayisuse/tensorpipe_rpc/async_echo.par

Reviewed By: lw

Differential Revision: D20088366

fbshipit-source-id: 980f641af3321ca93583c62753e1c9174b7d4afc
2020-05-05 05:47:43 -07:00
Rohan Varma
d639418307 Add timeout injection to faulty agent for testing (#37485)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37485

Adds arbitrary timeout injection to faulty RPC agent. This is to better test scenarios that need information about how long-running RPCs, such as properly testing RPC timeouts and the profiler in all scenarios.

This is done by overriding ProcessGroupAgent's `enqueueSend()` function to inject the timeout. Determining which messages to timeout is done similar to the existing `faulty_messages` by having the user specify a mapping of message to timeout.

Added unit tests that verify RPC timeouts work with builtin + TorchScript functions, which was not tested before.
ghstack-source-id: 103341662

Test Plan: Added unit tests in `FaultyRpcAgentTest`.

Differential Revision: D21296537

fbshipit-source-id: 1dbc21aee14e49780272634e9cbb2b5a448f2896
2020-05-01 23:48:28 -07:00
Rohan Varma
c0a985fcd6 Allow customizing retryable message types in Faulty agent tests (#37450)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37450

It doesn't seem like we could customize the retryable message types by
passing faulty_messages into dist_utils, as the `FaultyRpcAgentTestFixture`
overrode the `rpc_backend_options` function and provided the default list of
retryable message types. Needed to fix this as part of adding timeout injection
support as mentioned in https://github.com/pytorch/pytorch/issues/36272
ghstack-source-id: 103287164

Test Plan: `buck test mode/dev-nosan //caffe2/test/distributed/rpc/faulty_agent:rpc_spawn_faulty -- --print-passing-details`

Differential Revision: D21270127

fbshipit-source-id: e5dd847dcf92f14b490f84e9ee79291698b85ffa
2020-05-01 12:00:36 -07:00
Rohan Varma
4ff4119d45 [rpc] Move _set_rpc_backand and RpcBackendOptions to use float instead of timedelta (#37027)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37027

The RPC timeout passed into rpc_sync and rpc_async after the below
change is now float, so we should make these APIs consistent.
ghstack-source-id: 102971906

Test Plan:
Existing unittests, also added unittest testing specific timeout set
in ProcessGroupRpcBackendOptions and the dispatch rpc backend options handling.

Differential Revision: D21125171

fbshipit-source-id: a5894b8ce31d2926f2c3d323d1cda4d54b30cef1
2020-04-27 19:38:06 -07:00
Rohan Varma
7bd2014eec [resubmit][rpc] per-RPC timeouts for rpc_sync and rpc_async (#34650)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34650

Resubmit of https://github.com/pytorch/pytorch/pull/33840, which was overly eager in the sense that it deleted a lot of code that we didn't want to get rid of yet (default timeout handling).

This PR adds an optional argument into `rpc_sync` and `rpc_async` as well as `RpcAgent::send()` that allows the user to specify a timeout for an RPC to override the default set timeout. If the user does not specify this argument, then the currently set default RPC timeout given in the RPC constructor or by `rpc.set_rpc_timeout()` is used. Otherwise, we use the passed in timeout.

This diff does not address:
1) timeout support when called rpc.rpc_async is called as a JIT operator. For this to work, we would need to change the logic in `register_distributed_ops` to pass in this timeout to `rpcTorchscript`. One more issue is that torchscript doesn't support the timedelta object. This will be done in a follow up PR as it requires a fair amount of changes to the argument parsing logic.
2) Per-RPC timeouts for internal messages or `rpc.remote()`. A follow-up diff will address the latter with the approach of raising the timeout error at the earliest next possible time to the user, such as when the next time the RRef is forked or `to_here` is called

Added unit tests to confirm the current behavior
ghstack-source-id: 102622601

Test Plan: Added unit tests in rpc_test

Differential Revision: D20376953

fbshipit-source-id: 9fb3f147520588308ab50dd33286255658d76d47
2020-04-22 13:00:42 -07:00
David Reiss
e75fb4356b Remove (most) Python 2 support from Python code (#35615)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35615

Python 2 has reached end-of-life and is no longer supported by PyTorch.
Now we can clean up a lot of cruft that we put in place to support it.
These changes were all done manually, and I skipped anything that seemed
like it would take more than a few seconds, so I think it makes sense to
review it manually as well (though using side-by-side view and ignoring
whitespace change might be helpful).

Test Plan: CI

Differential Revision: D20842886

Pulled By: dreiss

fbshipit-source-id: 8cad4e87c45895e7ce3938a88e61157a79504aed
2020-04-22 09:23:14 -07:00
Shen Li
5c2b273089 Add RRef Python Helper to launch function on the referenced object (#36619)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36619

With this PR, applications no longer need to create dedicated helpers
to run functions on the object referenced by an RRef. Instead,
`rref.rpc_sync().some_func()` will use `rpc_sync` to run `some_func`
on the owner of the RRef using the object referenced by the RRef.
Similar helpers for `rref.rpc_async().some_func()` and
`rref.remote().some_func()` are also added.

An alternative design is to expose PyRRef as RRefBase and then
implement everything in a new Python RRef class. However, the RRef
class cannot directly inherit from PyRRef/RRefBase, otherwise we
will need to let pyRemote* C++ functions to load RRef from Python
and return an RRef instance. It is possible to let RRef hold a
instance of PyRRef instead of inherit from it, but this does not
look like a elegant design, as we will have RRef holding PyRRef and
PyRRef holding the C++ RRef. Another alternative is to use dynamic
method loading, by installing member methods to PyRRef instances.
However, this would require different solutions to handle
RRef(data) and rpc.remote(...). Base on the above thinking, we
decided to go with the current implementation for simplicity and we
can also keep all RRef-related APIs in one place.

Test Plan: Imported from OSS

Differential Revision: D21028333

Pulled By: mrshenli

fbshipit-source-id: fe90f56ef7183d18874e357900093755e1601eb4
2020-04-21 19:29:54 -07:00
Shen Li
b982a6a247 Expose torch.distributed.is_available() API (#37021)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37021

Test Plan: Imported from OSS

Differential Revision: D21164318

Pulled By: mrshenli

fbshipit-source-id: 08a446af342cbe54f3eb4994956ffa7ef4922bcf
2020-04-21 18:38:46 -07:00
Rohan Varma
752d3c281a [profiler] Allow record_function ctx manager to profile futures (#35055)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35055

This is the first step to improving the way RPCs are profiled as suggested by Ilia. For now, since RPC can return two different types of futures, we have to implement two different code paths, one for the python eager mode future and one for the jit future.

This diff implements the python eager part. We have defined a method `_call_end_callbacks_on_future` that takes in a future and schedules a `RecordFunction` to be completed as a callback on the future.

Once https://github.com/pytorch/pytorch/pull/35039 lands, we can implement the JIT codepath by registering an operator that takes a `Future(t)` as well.

These code paths will be merged once the futures are merged.
ghstack-source-id: 102478180

Test Plan: Added unit tests

Differential Revision: D20452003

fbshipit-source-id: 1acdcb073bd1f63d6fb2e78277ac0be00fd6671d
2020-04-20 12:37:54 -07:00
Pritam Damania
136d84dd38 Enhance error message for MPI unavailability. (#36781)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36781

Mention that you need to to build PyTorch from source to enable MPI.
Additional context:
https://discuss.pytorch.org/t/distributed-pytorch-with-mpi/77106.
ghstack-source-id: 102341246

Test Plan: waitforbuildbot

Differential Revision: D21082009

fbshipit-source-id: 3a3286349e71322726a341dfc743b5978c7d9a56
2020-04-18 14:45:44 -07:00
Sudarshan Raghunathan
739351fac4 Fix linter warning: replace f-strings with str.format for Py2 compat (#35492)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35492

Test Plan: Imported from OSS

Differential Revision: D20998727

Pulled By: drdarshan

fbshipit-source-id: 54f34a7649a2772ad030b456f1b50aba831ce2e0
2020-04-13 18:43:58 -07:00
Rohan Varma
f59e646faa [rpc] Allow profiling in RPC to work with torchscript function invocations (#36275)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36275

Calling a TorchScript function from within RPC was added after initial
support for the profiler with RPC, hence, we were not recording torchscript
funtions invoked under RPC correctly. This diff passes the `RecordFunction` to
the `_invoke_torchscript..` calls similar to what is done for builtin and UDFs.

However, this is only a temporary solution. We will be removing the use of
`RecordFunction` as a standalone in the RPC code in
https://github.com/pytorch/pytorch/pull/35055. This diff is to unblock
recording of torchscript functions in the meantime.
ghstack-source-id: 101800134

Test Plan:
Added tests for calling a script function with builtin, sync, and
asyc. The output looks like below:

```
------  ---------------  ---------------  ---------------  ---------------  ---------------
> Name                                                                                                        Self CPU
total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls
> ----------------------------------------------------------------------------------------------------------  ---------
------  ---------------  ---------------  ---------------  ---------------  ---------------
> rpc_sync#__torch__.torch.testing._internal.distributed.rpc.rpc_test.my_script_func(worker1 -> worker2)      99.92%
        1.056s           99.92%           1.056s           1.056s           1
> select                                                                                                      0.04%
        383.661us        0.04%            383.661us        95.915us         4
> fill_                                                                                                       0.02%
        210.966us        0.02%            210.966us        52.741us         4
> to                                                                                                          0.00%
        26.276us         0.00%            26.276us         26.276us         1
> empty                                                                                                       0.02%
        159.802us        0.02%            159.802us        79.901us         2
> set_                                                                                                        0.01%
        93.818us         0.01%            93.818us         93.818us         1
> ----------------------------------------------------------------------------------------------------------  ---------
------  ---------------  ---------------  ---------------  ---------------  ---------------
> Self CPU time total: 1.057s
```

Note that we use `torch.jit._qualified_name` to get the name of the script fn.

Differential Revision: D20930453

fbshipit-source-id: c6d940aa44fcd9dd8a1a29c156aa19e0d8428d60
2020-04-08 23:58:36 -07:00
Pritam Damania
82dd01150c Fix race during RPC shutdown. (#36113)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36113

As part of debugging https://github.com/pytorch/pytorch/issues/35863,
I discovered that the unit test would timeout during clean shutdown.

Looking into this further, it looks like there is a race in
`_on_leader_follower_report_shutdown_intent` when multiple followers call the
same method on the leader.

To fix this, I've ensured we have an appropriate lock in
`_on_leader_follower_report_shutdown_intent` to guard against this.

I ran the test 500 times to validate that this fix works.

Closes #35863
ghstack-source-id: 101641463

Test Plan:
1) waitforbuildbot
2) Ran the test 500 times.

Differential Revision: D20884373

fbshipit-source-id: 9d580e9892adffc0c9a4c2e832881fb291a1ff16
2020-04-08 14:12:33 -07:00
Feng Tian
762270c51f add c10d dynamic loading mechanism and unit test (#28068)
Summary:
The original behavior of pytorch c10d only supports built-in c10d backends, such as
nccl/gloo/mpi. This patch is used to extend the c10d capability to support dynamically
loading 3rd party communication libraries which are derived from ProcessGroup base class.

related RFC is in: https://github.com/pytorch/pytorch/issues/27955

Through this way, user just need specify a 3rd party c10d backend name when invoking
torch.distributed.init_process_group(). The proposed logic will try to load corresponding
c10d backend cpp extension automatically. as for how to develop a new 3rd party c10d backend
through cpp extension, pls refer to test/cpp_extensions/cpp_c10d_extension.cpp
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28068

Differential Revision: D19174838

Pulled By: agolynski

fbshipit-source-id: 3409a504a43ce7260e6f9d1207c00e87471fac62
2020-04-02 15:46:51 -07:00
Dhiraj D Kalamkar
945d7a7408 Add All-to-all comms support to distributed module and MPI backend (#32361)
Summary:
As described in https://github.com/pytorch/pytorch/issues/32345, a prototype implementation to add an alltoall communication primitive to torch.distributed module and ProcessGroup abstract interface. Also, implements alltoall in ProcessGroupMPI backend.

mnaumovfb JianpingChen066 dmudiger srinivas212 Jianhui-Li mshiryaev ftian1

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini xush6528 osalpekar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32361

Reviewed By: mrshenli

Differential Revision: D20635481

Pulled By: srinivas212

fbshipit-source-id: 3dd0af800ce55d02f02813cde550e3a0f1a287d2
2020-04-01 08:57:12 -07:00