pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Sergii Dymchenko	f51f6aa387	Fix non-existing parameters in docstrings (#90505 ) Continuation after https://github.com/pytorch/pytorch/pull/90163. Here is a script I used to find all the non-existing arguments in the docstrings (the script can give false positives in presence of args/*kwargs or decorators): _Edit:_ I've realized that the indentation is wrong for the last `break` in the script, so the script only gives output for a function if the first docstring argument is wrong. I'll create a separate PR if I find more issues with corrected script. ``` python import ast import os import docstring_parser for root, dirs, files in os.walk('.'): for name in files: if root.startswith("./.git/") or root.startswith("./third_party/"): continue if name.endswith(".py"): full_name = os.path.join(root, name) with open(full_name, "r") as source: tree = ast.parse(source.read()) for node in ast.walk(tree): if isinstance(node, ast.FunctionDef): all_node_args = node.args.args if node.args.vararg is not None: all_node_args.append(node.args.vararg) if node.args.kwarg is not None: all_node_args.append(node.args.kwarg) if node.args.posonlyargs is not None: all_node_args.extend(node.args.posonlyargs) if node.args.kwonlyargs is not None: all_node_args.extend(node.args.kwonlyargs) args = [a.arg for a in all_node_args] docstring = docstring_parser.parse(ast.get_docstring(node)) doc_args = [a.arg_name for a in docstring.params] clean_doc_args = [] for a in doc_args: clean_a = "" for c in a.split()[0]: if c.isalnum() or c == '_': clean_a += c if clean_a: clean_doc_args.append(clean_a) doc_args = clean_doc_args for a in doc_args: if a not in args: print(full_name, node.lineno, args, doc_args) break ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/90505 Approved by: https://github.com/malfet, https://github.com/ZainRizvi	2022-12-09 21:43:09 +00:00
Ram Rachum	351d73b97f	Fix exception causes all over the codebase (#90271 ) This is the continuation to #90134 and hopefully the final PR in this series. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90271 Approved by: https://github.com/kit1980	2022-12-07 04:29:00 +00:00
Sergii Dymchenko	fa7a963f65	Remove BaseException TODO (#89540 ) After discussion in https://github.com/pytorch/pytorch/pull/88461#issuecomment-1318965664 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89540 Approved by: https://github.com/H-Huang	2022-11-23 19:39:49 +00:00
Tom Stein	fd60b818b9	[Python] refactor slices on sorted (#86995 ) Sometimes you want to query the small element of a set of elements and use `sorted(elements)[0]` without a second thought. However, this is not optimal, since the entire list must be sorted first `O(n log n)`. It would be better to use the `min(elements)` method provided for this purpose `O(n)`. Furthermore `sorted(elements)[::-1]` is not very efficient, because it would be better to use `sorted(elements, reverse=True)` to save the slice operation. TLDR: using `sorted(elements)[0]` is slow and can be replaced with `min(elements)`. I stumbled across these code snippets while playing around with CodeQL (see https://lgtm.com/query/4148064474379348546/). Pull Request resolved: https://github.com/pytorch/pytorch/pull/86995 Approved by: https://github.com/jansel	2022-10-25 04:07:19 +00:00
Rohan Varma	07bd053a7e	[rpc] Wrap exception creation with try/catch (#87224 ) Sometimes, we cannot recreate the exception with only string (for example if it is a custom exception type). Ideal situation would be to carry over all details on how to recreate the remote end's exception and throw that on client, but for now, we raise a RuntimeError with the original error msg when we cannot reconstruct. Created from CodeHub with https://fburl.com/edit-in-codehub Differential Revision: [D40353274](https://our.internmc.facebook.com/intern/diff/D40353274/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87224 Approved by: https://github.com/fduwjj	2022-10-20 00:02:24 +00:00
anjali411	e2a4dfa468	Add correct __all__ for torch.distributed and torch.cuda submodules (#85702 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/85702 Approved by: https://github.com/ezyang, https://github.com/albanD, https://github.com/rohan-varma	2022-10-10 19:15:24 +00:00
anjali411	cf2f552cd8	Add __all__ to torch.{fx, distributed, backends} submodules (#85079 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/85079 Approved by: https://github.com/rohan-varma	2022-09-20 12:51:08 +00:00
Sergii Dymchenko	591222f5d9	Fix use-dict-literal lint (#83718 ) Fix use-dict-literal pylint suggestions by changing `dict()` to `{}`. This PR should do the change for every Python file except test/jit/test_list_dict.py, where I think the intent is to test the constructor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83718 Approved by: https://github.com/albanD	2022-08-24 00:26:46 +00:00
joncrall	b136f3f310	More doctest refinements. (#83317 ) Follow up to #82797 Now that the doctests themselves are in a better state, we should be able to enable xdoctest on the CI so they stay that way. @ezyang @vadimkantorov Pull Request resolved: https://github.com/pytorch/pytorch/pull/83317 Approved by: https://github.com/ezyang	2022-08-22 20:07:26 +00:00
Taylor Robie	1fa9a377d0	[Profiler] Start moving python bindings out of autograd (#82584 ) A lot of profiler code still lives in autograd for historic reasons. However as we formalize and clean up profiler internals it makes sense to pull more and more into the profiler folders/namespace. For now I'm just moving some of the core config data structures and those related to `torch::profiler::impl::Result` to keep the scope manageable. Differential Revision: [D37961462](https://our.internmc.facebook.com/intern/diff/D37961462/) NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D37961462/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/82584 Approved by: https://github.com/albanD, https://github.com/Gamrix	2022-08-19 17:15:18 +00:00
joncrall	4618371da5	Integrate xdoctest - Rebased (#82797 ) This is a new version of #15648 based on the latest master branch. Unlike the previous PR where I fixed a lot of the doctests in addition to integrating xdoctest, I'm going to reduce the scope here. I'm simply going to integrate xdoctest, and then I'm going to mark all of the failing tests as "SKIP". This will let xdoctest run on the dashboards, provide some value, and still let the dashboards pass. I'll leave fixing the doctests themselves to another PR. In my initial commit, I do the bare minimum to get something running with failing dashboards. The few tests that I marked as skip are causing segfaults. Running xdoctest results in 293 failed, 201 passed tests. The next commits will be to disable those tests. (unfortunately I don't have a tool that will insert the `#xdoctest: +SKIP` directive over every failing test, so I'm going to do this mostly manually.) Fixes https://github.com/pytorch/pytorch/issues/71105 @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/82797 Approved by: https://github.com/ezyang	2022-08-12 02:08:01 +00:00
ProGamerGov	71d50f4f89	Change docstring type callable to Callable for consistency (#82487 ) ### Description Across PyTorch's docstrings, both `callable` and `Callable` for variable types. The Callable should be capitalized as we are referring to the `Callable` type, and not the Python `callable()` function. ### Testing There shouldn't be any testing required. Pull Request resolved: https://github.com/pytorch/pytorch/pull/82487 Approved by: https://github.com/albanD	2022-08-01 17:26:09 +00:00
Howard Huang	81ca2ff353	Prevent automatic cuda init in init_rpc (#80180 ) Fixes #80141 Only initialize cuda if there are devices specified in `init_rpc` Differential Revision: [D37458309](https://our.internmc.facebook.com/intern/diff/D37458309) Pull Request resolved: https://github.com/pytorch/pytorch/pull/80180 Approved by: https://github.com/rohan-varma	2022-07-08 14:18:02 +00:00
anjali411	3bcc19b29a	Add __all__ to various submodules in torch.fx, distributions, distributed, package (#80367 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/80367 Approved by: https://github.com/albanD	2022-06-27 21:27:30 +00:00
Howard Huang	31d03c2f63	[RPC small change] Improving logging for store.wait error Pull Request resolved: https://github.com/pytorch/pytorch/pull/76548 Approved by: https://github.com/mrshenli	2022-05-05 18:23:17 +00:00
Howard Huang	e68686bb05	Add optional timeout argument for RpcAgent join() (#76194 ) Summary: This PR was created to resolve issue brought up in https://fb.workplace.com/groups/319878845696681/permalink/741428653541696/ Changes: - Adds timeout argument to RpcAgent.join() - Add optional timeout argument to ThriftRpcAgent barrier() - During shutdown (ThriftRpcAgent join) calls the barrier, the agent will use the timeout passed to shutdown and pass that timeout into the join(). - Update API.py to also include fix bug (missing timeout for signal) - Change default shutdown timeout to 0 (no timeout). Existing functionality in _all_gather will remain the same and wait indefinitely for signal if no timeout is set for the function. New functionality has user specify timeout for both the signal and rpc calls. Pull Request resolved: https://github.com/pytorch/pytorch/pull/76194 Test Plan: Modified barrier test buck test torch/fb/distributed/thriftRpcBackend/test:ThriftRpcAgentTest -- BarrierTest Reviewed By: mrshenli Differential Revision: D35825382 fbshipit-source-id: e91e9ab5d9fca08787cb6b6b8125a4b03d1c7cde (cherry picked from commit fcf899a387001574bf4e39a213ea741611d76097)	2022-05-03 01:10:17 +00:00
vitrioil	f92cddd890	Removed direct doc formatting Fixes #76034 This does not make python remove all `__doc__` because in some places `__doc__` is assigned to a string. Example: `04b3313379/torch/nn/modules/conv.py (L174-L233)` Since there are quite a few of these, I will add all of them together in this PR later. (Basically still a lot of docstring will persist even with `-OO` enabled.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/76619 Approved by: https://github.com/albanD	2022-05-02 14:14:33 +00:00
Rohan Varma	ec62901a2c	Disable RPC profiling for kineto profilers Pull Request resolved: https://github.com/pytorch/pytorch/pull/76234 RPC profiling is only enabled when the profiler is of legacy type. Differential Revision: [D35484579](https://our.internmc.facebook.com/intern/diff/D35484579/) Approved by: https://github.com/H-Huang	2022-04-26 23:35:30 +00:00
Howard Huang	811ccde41a	[Dynamic RPC] Add graceful shutdown for dynamic RPC members Pull Request resolved: https://github.com/pytorch/pytorch/pull/74561 Approved by: https://github.com/mrshenli	2022-04-26 13:12:55 +00:00
Brian Coutinho	8385e06b0b	[pytorch][cupti profiler 6/n] Changes to configure Kineto cupti profiler from pytorch profiler interface (#75616 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/75616 Kineto introduced a new profiler to read performance counters from NVIDIA GPUs (CUPTI Range Profiler API) Here we are adding support to configure this Kineto range profiler mode Example ``` with torch.profiler.profile( activities=[ProfilerActivity.CUDA], record_shapes=True, on_trace_ready=trace_handler, experimental_config=torch.profiler._ExperimentalConfig( profiler_metrics=[ "kineto__tensor_core_insts", "dram__bytes_read.sum", "dram__bytes_write.sum"], profiler_measure_per_kernel=False), ) as prof: res = train_batch(modeldef) prof.step() ``` ## Details * Introduce a new structure `KinetoProfilerConfig` so users can configure Kineto specific options, keeps profiler API consistent. * Populate configuration options for Kineto. Test Plan: CI and tested on resnet50 Reviewed By: robieta Differential Revision: D34489487 fbshipit-source-id: 8ef82d2593f4f4d5824ca634f7d25507bc572caa (cherry picked from commit 4a2af70629db55a605d4b8d0a54d41df2b247183)	2022-04-20 22:19:54 +00:00
Howard Huang	8646e0dc28	[Dynamic RPC] Allow existing ranks to communicate with newly joined ranks Pull Request resolved: https://github.com/pytorch/pytorch/pull/74035 Approved by: https://github.com/mrshenli	2022-04-20 18:07:40 +00:00
Howard Huang	f76d1c022e	[Dynamic RPC] Allow for optional world_size argument in init_rpc (#73372 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73372 This PR which allows for optional `world_size` argument in init_rpc. This makes changes in rendezvous to allow for `NoneType` for world_size and creates a new code path when initializing TensorPipe agent for init_rpc. The TensorPipe agent is protected by a critical section enforced using the store, so that only one node can create a TPAgent at a time. This PR does not yet enable RPC commands between ranks. Previously: ```python os.environ['MASTER_ADDR'] = 'localhost' os.environ['MASTER_PORT'] = '29500' init_rpc("worker0", world_size=1, rank=0) ``` Now (only rank is needed): ```python os.environ['MASTER_ADDR'] = 'localhost' os.environ['MASTER_PORT'] = '29500' init_rpc("worker0", rank=0) ``` Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D34621651 Pulled By: H-Huang fbshipit-source-id: 09dbb511d5a00c219a6ce0a35501ff2e388998b0 (cherry picked from commit 834aedc3256167399c323169ef2f0c9b3cf98dff)	2022-03-24 16:19:28 +00:00
Edward Yang	c24783fbd4	Don't discard stacktrace when rewriting AttributeError (#73720 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73720 Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: pritamdamania87 Differential Revision: D34603306 Pulled By: ezyang fbshipit-source-id: 8c484df231071decd336dbf6932f8f491071f215 (cherry picked from commit 4950d15681d4cd7ca904af19cad87a83ddc8bba6)	2022-03-04 01:29:43 +00:00
Rodrigo Kumpera	ef4bc3fa2f	[distributed] Make rref_proxy._invoke_rpc trully async when needed. (#70206 ) Summary: From https://github.com/pytorch/pytorch/issues/67626: RRefProxy (rref.rpc_async, rref.rpc_sync, rref.remote) currently uses a blocking RPC call to the owner This is done by chaining async calls. In the sync case we wait on the resulting Future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/70206 Test Plan: I ran rpc_tests using tensorpipe_rpc_agent_test_fixture.py and had to adjust test_rref_proxy_timeout to the new behavior. I ran into test_tensorpipe_set_default_timeout failing due to the timeout being too small. Doesn't look related to this change. mrshenli Fixes https://github.com/pytorch/pytorch/issues/67626 cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Reviewed By: pritamdamania87 Differential Revision: D33243348 Pulled By: kumpera fbshipit-source-id: e1e8c34bb3d170407c0a793e2e585357f905d3c6 (cherry picked from commit `1ad5a7ceea`)	2022-01-19 23:37:15 +00:00
Howard Huang	be7e159e71	Remove extraneous logging (#68830 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68830 No logical changes, removing a logging statement that was accidentally committed. cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang jjlilley mrzzd Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D32628711 Pulled By: H-Huang fbshipit-source-id: 070190b92f97c8e38d8bb03124c13cb061fc9ec1	2021-11-24 07:15:50 -08:00
Howard Huang	7b376bf844	Remove ProcessGroup from TensorPipeAgent initialization (#68128 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68128 Reland of D31762735 (`0cbfd466d2`). This diff was originally reverted due to failure in test_send_export_type_through_rpc_with_custom_pickler. I updated rpc_pickler_test.py to prevent a race condition where processes were not registering their pickler before handling their rpc_sync calls. Test Plan: rpc_pickler_test file: buck test mode/dev-nosan -c 'cxx.coverage_only=caffe2' //caffe2/torch/fb/training_toolkit/backend/metrics/tests:rpc_pickler_test //caffe2/torch/fb/training_toolkit/backend/metrics/collectors/fbdata_aggregator/tests:batch_collector_test -- --run-disabled --collect-coverage '--code-coverage-session=test_session' --force-tpx rpc_pickler stress test: buck test mode/dev-nosan -c 'cxx.coverage_only=caffe2' //caffe2/torch/fb/training_toolkit/backend/metrics/tests:rpc_pickler_test -- --exact 'caffe2/torch/fb/training_toolkit/backend/metrics/tests:rpc_pickler_test - test_send_export_type_through_rpc_with_custom_pickler (caffe2.torch.fb.training_toolkit.backend.metrics.tests.rpc_pickler_test.CythonTypeRpcSpawnTest)' --run-disabled --collect-coverage '--code-coverage-session=test_session' --force-tpx --jobs 18 --stress-runs 10 --record-results Reviewed By: mrshenli Differential Revision: D32316077 fbshipit-source-id: e58de2335fbaa3ab46d46fe222c659197633a5e4	2021-11-11 12:28:55 -08:00
Onyiee	442d7d72de	fixed type checking errors in options.py (#68056 ) Summary: Fixes [issue#64](https://github.com/MLH-Fellowship/pyre-check/issues/64) This PR fixes the type checking errors in torch/distributed/rpc/options.py. The variable types in 84:8 and 85:8 were declared to have type `List` but were sometimes assigned a value of `None`. This caused an incompatitble variable type error. Therefore, I changed the type from `List` to `Optional[List]` . Hence, this fixes the incompatitble variable type error. Signed-off-by: Onyemowo Agbo onionymous 0xedward Pull Request resolved: https://github.com/pytorch/pytorch/pull/68056 Reviewed By: zou3519 Differential Revision: D32282289 Pulled By: mrshenli fbshipit-source-id: ee410165e623834b4f5f3da8d44bd5a29306daae	2021-11-09 11:42:34 -08:00
Howard Huang	9fb3ba9d7b	Revert D31762735 (#67924 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67924 This diff reverts the changes made in D31762735 (`0cbfd466d2`) Test Plan: Wait for CI Reviewed By: derekmod-fb Differential Revision: D32214744 fbshipit-source-id: e0a65b6a31a88216ae1243549fcbc901ef812374	2021-11-06 17:34:13 -07:00
Howard Huang	cfd998c197	Remove ProcessGroup RPC backend placeholder as part of 1.11 (#67363 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67363 ProcessGroup RPC backend is deprecated. In 1.10 it would throw an error to the user to be more user friendly. This PR now removes it completely. cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Test Plan: Imported from OSS Reviewed By: bdhirsh Differential Revision: D32138321 Pulled By: H-Huang fbshipit-source-id: b4f700d8f1b1d46ada7b5062d3f754646571ea90	2021-11-04 07:57:58 -07:00
Pritam Damania	05e17e7ff6	Add API usage logging for several other RPC APIs. (#67722 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67722 ghstack-source-id: 142259452 Test Plan: waitforbuildbot Reviewed By: jaceyca, fduwjj Differential Revision: D32118872 fbshipit-source-id: 041ab5601221b1846c56ce4bb63364bec9ad28b0	2021-11-03 14:02:00 -07:00
Howard Huang	0cbfd466d2	Remove ProcessGroup from TensorPipeAgent initialization (#66708 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66708 cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Test Plan: Imported from OSS Reviewed By: anjali411 Differential Revision: D31762735 Pulled By: H-Huang fbshipit-source-id: 9f3879fca6b8258f7e6171b14d2c1d6cce21627d	2021-11-01 14:15:27 -07:00
Pritam Damania	285d5a55b9	Add API usage to torch.RPC (#67515 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67515 Adding API usage to torch.rpc to better understand usage of this API. ghstack-source-id: 141877028 Reviewed By: rohan-varma Differential Revision: D32011465 fbshipit-source-id: 34d006ece307ae4a90fbcc6cb44fc0b7edca611e	2021-10-29 10:38:41 -07:00
Bin Wen	da166d4f12	Add a timeout argument to RPC shutdown() (#65425 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65425 cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang gcramer23 Test Plan: Imported from OSS python3 test/distributed/rpc/test_tensorpipe_agent.py -v -k test_wait_all_workers_timeout Reviewed By: mrshenli Differential Revision: D31092483 Pulled By: dracifer fbshipit-source-id: 5b5e9f20b1d6602cf8cde3772678f721dddf0d78	2021-09-23 10:42:58 -07:00
Pritam Damania	c245632e2e	Use higher timeout for TSAN tests. (#65391 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65391 TSAN tests are much slower than the usual dev/opt mode, about 5-10x slower. As a result, for TSAN build mode we use a much higher timeout for distributed tests. ghstack-source-id: 138584613 Test Plan: waitforbuildbot Reviewed By: cbalioglu Differential Revision: D31076575 fbshipit-source-id: 44a485f07101deac536470ceeff2a52cac4f9e0b	2021-09-21 12:08:27 -07:00
Kimish Patel	54f2eb6e7e	[Pytorch Profiler] Add support for adding module hierarchy to (#61792 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61792 KinetoEvent This PR adds module hierarchy information to events. What is module hierarchy information attached to events? During profiling a TorchScript module, when events are added, we ask JIT what is the module hierarchy associated with the node being executed. At the time of execution of that node, there might be multiple frames in the stack of interpreter. For each frame, we find corresponding node and the corresponding module hierarchy is queried. Module hierarchy corresponding to the node is associated with node's InlinedCallStack. InlinedCallStack of node tracks the path via which the node is inlined. Thus during the inlining process we annotate module information corresponding to the CallMethod nodes being inlined. With this PR, chrome trace will contain additional metadata: "Module Hierarchy". This can look like this: TOP(ResNet)::forward.SELF(ResNet)::_forward_impl.layer1(Sequential)::forward.0(BasicBlock)::forward.conv1(Conv2d)::forward.SELF(Conv2d)::_conv_forward It contains module instance, type name and the method name in the callstack. Test Plan: test_profiler Imported from OSS Reviewed By: raziel, ilia-cher Differential Revision: D29745442 fbshipit-source-id: dc8dfaf7c5b8ab256ff0b2ef1e5ec265ca366528	2021-08-13 21:39:10 -07:00
Ilia Cherniavskii	773a8eede4	[profiler][refactor] Refactor the usage of legacy profiler implementation (#61931 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61931 This PR consolidates the profiling code around a new C++ implementation (profiler_kineto.h/cpp) and uses it unconditionally from torch.autograd.profiler/torch.profiler: 1. Always use profiler_kineto.h/cpp as the C++ implementation 2. Simplify profiler.py to remove unneeded parts depending on legacy impl 3. Move some of the legacy logic into profiler_legacy.py (to be fully deleted later) Test Plan: USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install --cmake python test/test_profiler.py -v USE_KINETO=0 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install --cmake python test/test_profiler.py -v Imported from OSS Reviewed By: gdankel Differential Revision: D29801599 fbshipit-source-id: 9794d29f2af38dddbcd90dbce4481fc8575fa29e	2021-08-03 18:51:29 -07:00
Howard Huang	dc1bd6acee	Remove PROCESS GROUP rpc backend (#62411 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62411 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D29990408 Pulled By: H-Huang fbshipit-source-id: 183d3b316767b12993cebbe32b73c2850fd1cc42	2021-08-02 12:26:22 -07:00
Howard Huang	b3781f0244	Remove faulty process group agent logic (#62409 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62409 This a reland of #61907 because removing process_group_agent.h / cpp broke facebook specific tests. I will remove the files and update the internal test code in a separate PR. Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D29990001 Pulled By: H-Huang fbshipit-source-id: 2ee333322247d8b72691152308c3297e8c0c006d	2021-07-30 08:12:48 -07:00
Howard Huang	a15fff0a7f	Revert D29794666: Remove faulty process group code Test Plan: revert-hammer Differential Revision: D29794666 (`afe3644321`) Original commit changeset: 0b35191cc072 fbshipit-source-id: 6467bc5100f4115f2fdb385e205740cd68c89743	2021-07-28 10:15:34 -07:00
Howard Huang	afe3644321	Remove faulty process group code (#61907 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61907 Removing the code for faulty process group agent since it was replaced by faulty tensorpipe agent Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D29794666 Pulled By: H-Huang fbshipit-source-id: 0b35191cc07220b6774ecacc8d004f25fd2e87f0	2021-07-27 07:37:40 -07:00
Howard Huang	e8d2916b84	Add faulty tensorpipe implementation (#61421 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61421 This PR adds the faulty tensorpipe agent implementation and replaces all faulty process group agent tests with it. The faulty tensorpipe agent code is very similar to that of faulty process group agent. It allows the user to fail or delay certain types of rpc messages, which is used in the faulty agent tests. These changes are needed to deprecate the process group rpc backend. Summary of changes: - Add faulty tensorpipe agent class - Update tensorpipe pipeWrite function to allow to be overwritten and add delay - Update test backend registry and faulty agent tests to use the FAULTY_TENSORPIPE_AGENT backend. This effects all faulty agent tests, here a few of them as sample commands: `pytest test/distributed/rpc/test_faulty_agent.py -vs -k test_verify_backend_options` `pytest test/distributed/rpc/test_faulty_agent.py -vs -k test_no_faulty_messages` `pytest test/distributed/rpc/test_faulty_agent.py -vs -k test_builtin_remote_message_dropped_timeout` Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D29773739 Pulled By: H-Huang fbshipit-source-id: 6b2bc366735d70b79943d4207f454bc9555bbf5f	2021-07-20 13:54:30 -07:00
Pritam Damania	1d1d5acbb0	[RPC] Ensure _wait_all_workers doesn't swallow exception. (#61094 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61094 `_wait_all_workers` was swallowing exceptions and as a result if there were any errors it would still continue with rpc_agent.join() which would hang since something already failed before. To fix this, I've ensured that wait_all_workers throws and in that case we just proceed with an ungraceful shutdown without joining. ghstack-source-id: 133160706 Test Plan: 1) Added unit test. 2) waitforbuildbot Reviewed By: rohan-varma Differential Revision: D29509286 fbshipit-source-id: 7c3f1c68d712ae2f63e10e0216580db8e9bcc29d	2021-07-07 18:28:41 -07:00
Shen Li	bbedfd913d	Run an dummy rpc._all_gather in init_rpc to avoid shutdown timeout (#59801 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59801 Fixes https://github.com/pytorch/pytorch/issues/59795. The RPC calls in shutdown no longer able to finish within 5s if there is no other RPCs before `rpc.shutdown()` in that process, because agent initialization can take longer than 5s. We don't have this problem previously, because TensorPipe's backend registry used to use RPC to communicate CUDA devices in `init_rpc`. However, after #58753, `init_rpc` uses ProcessGroup to communicate devices, and hence the channels/transport could be uninitialized after `init_rpc`. Differential Revision: D29039238 D29039238 Test Plan: Imported from OSS Reviewed By: rohan-varma Pulled By: mrshenli fbshipit-source-id: 46f89b01a058a51d271ddef9084a67b220a067b7	2021-06-17 11:47:54 -07:00
Luca Wehrstedt	8f4cfaa9db	Fix race condition in TP agent (#58753 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58753 TSAN was (rightfully!) detecting and complaining about a race due to the fact that upon init the TP agent exchanges the device maps between nodes using RPC requests (and by doing so it accesses the device maps) and then sets the reverse device maps (thus possibly modifying the set of devices). This resulted in a data race, i.e., simultaneously reading and writing the set of devices without synchronizing. One solution is to add a mutex around the devices, which works, but is "annoying". An alternative solution is to make the set of devices immutable (i.e., `const`). For that to work, we need to exchange the device maps without using RPC calls. We can do so using the process group that we need to create anyways. Since now there's a lot more logic in Python, I've moved (and restructured) all safety checks over there, and removed them from C++. ghstack-source-id: 130583775 Test Plan: Unit tests Reviewed By: mrshenli Differential Revision: D28603754 fbshipit-source-id: 88533e65d72d1eb806dc41bec8d55def5082e290	2021-06-04 06:53:42 -07:00
Howard Huang	7ee68363a8	Add new rpc.barrier API (#53423 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53423 closes #40166 This change exposes a new API, rpc.barrier() which blocks the main processes of all workers running RPC until the whole group completes this function. Optionally rpc.barrier can take in a set of worker_names and only synchronize across those worker names. Example: ```python import os import torch.multiprocessing as mp import torch.distributed.rpc as rpc os.environ["MASTER_ADDR"] = "localhost" os.environ["MASTER_PORT"] = "5678" world_size = 4 odd_num_workers = [f"worker{i}" for i in range(world_size) if i % 2] even_num_workers = [f"worker{i}" for i in range(world_size) if not i % 2] def worker(i): print(i) rpc.init_rpc(f"worker{i}", rank=i, world_size=world_size) if i % 2: print(f"start barrier {i}") rpc.barrier(set(odd_num_workers)) else: print(f"start barrier {i}") rpc.barrier(set(even_num_workers)) rpc.shutdown() print(f"shutdown{i}") if __name__ == '__main__': with mp.Pool(processes=world_size) as pool: pool.map(worker, range(world_size)) ``` Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D27737145 Pulled By: H-Huang fbshipit-source-id: 369196bc62446f506d1fb6a3fa5bebcb0b09da9f	2021-06-02 14:20:16 -07:00
Yi Wang	dbe629c51d	[RPC Framework] Support creating a RemoteModule by RRef (#59242 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59242 #Oringal PR Issue: https://github.com/pytorch/pytorch/issues/58274 This can be a workaround: Instead of passing a script `RemoteModule` over RPC, pass its `module_rref` field over RPC, and then construct a new `RemoteModule` on the receiver end. ghstack-source-id: 130268018 Test Plan: buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_send_remote_module_over_the_wire_script_not_supported buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_remote_module_py_pickle_not_supported_script buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_create_remote_module_by_module_rref Reviewed By: vipannalla Differential Revision: D28794905 fbshipit-source-id: 1a677ff0d4b47c078ad47b50d7102a198a1fc39b	2021-06-01 22:35:03 -07:00
Pritam Damania	0d6fa1adc5	Introduce ChunkShardingSpec as a model sharding specification. (#55728 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55728 Full design: https://github.com/pytorch/pytorch/issues/55207 This PR introduces ChunkShardingSpec (SingleShardingSpec in the design). Used the name ChunkShardingSpec since it is very similar to `torch.chunk` in terms of how a Tensor is split up and feels more clear compared to SingleShardingSpec. ghstack-source-id: 129603318 Test Plan: waitforbuildbot Reviewed By: SciPioneer Differential Revision: D27694108 fbshipit-source-id: c8764abe6a4d5fc56d023fda29b74b5af2a73b49	2021-05-23 16:04:57 -07:00
Yi Wang	fd3d3ef900	[RPC Framework] Add _script_module_reducer unconditionally for RecursiveScriptModule in RPC pickler (#58020 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58020 Previously there is no RPC pickler for `RecursiveScriptModule`. Although it is a subclass of `ScriptModule`, the reducer of `ScriptModule` is not triggered for `RecursiveScriptModule` when a script remote module is sent over RPC. This PR checkpoints the investigation of #58274, which makes sure that a RPC pickler is invoked here. This still cannot fix `test_send_remote_module_over_the_wire_script`. Will revisit this bug once there is a feature request from users. ghstack-source-id: 128949642 Test Plan: TODO: re-enable these tests buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_send_remote_module_over_the_wire_script buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_remote_module_py_pickle_not_supported_script Reviewed By: rohan-varma Differential Revision: D28346758 fbshipit-source-id: 3cff84ca665da03da6ed6acb094a1f594fcd945e	2021-05-13 17:51:25 -07:00
Yi Wang	e507771294	[RPC Framework] Replace Python Pickler with internal RPC pickler for RemoteModule (#58019 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58019 In order to support sending `RemoteModule` over PRC, previously the pickling/unpickling of `RemoteModule` was implemented based on `__setstate__` and `__getstate__`. However, this means that the user can call regular Python pickler/unpickler to invoke the same logic,which should not be allowed. This PR ensures that the pickling can only happen over RPC and not via regular python pickle. Additionally, when a new attribute is added to `RemoteModule`, if it's not added to either `_REMOTE_MODULE_PICKLED_ATTRIBUTES` or `_REMOTE_MODULE_ATTRIBUTES_IGNORE_FOR_PICKLING`, this attribute will be ignored and an error message will be printed to std.err. However, it will not raise an exception like before, because such exception raised at the RPC layer will somehow cause timeout. #Closes: https://github.com/pytorch/pytorch/issues/57516 ghstack-source-id: 128868501 Test Plan: buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_send_remote_module_over_the_wire buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_remote_module_py_pickle_not_supported buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_send_remote_module_with_a_new_attribute_ignored_over_the_wire buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- RemoteModule buck test mode/dev-nosan //caffe2/torch/fb/csrc/concurrency/test:atomic_int_interprocess_test -- --exact 'caffe2/torch/fb/csrc/concurrency/test:atomic_int_interprocess_test - test_multiple_processes (caffe2.torch.fb.csrc.concurrency.test.atomic_int_interprocess_test.ForkMultipleProcessTest)' buck test mode/dev //caffe2/torch/distributed/fb/test:app_test -- --exact 'caffe2/torch/distributed/fb/test:app_test - test_custom_init_rpc (caffe2.torch.distributed.fb.test.app_test.TestRpc)' Reviewed By: mrshenli Differential Revision: D28318270 fbshipit-source-id: 7e7df2a6690f0860c4531a244d38789db424496f	2021-05-13 09:37:42 -07:00
Lucas Hosseini	dc49299078	Allow passing cpu to CUDA RPC device maps (#57019 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57019 Based on https://github.com/pytorch/pytorch/pull/56043 Test Plan: Imported from OSS Reviewed By: anjali411 Differential Revision: D28169796 Pulled By: beauby fbshipit-source-id: 7fcf623de07c74c4f1ab415b7e20b518876a567a	2021-05-04 04:14:27 -07:00

1 2 3 4 5

223 Commits