pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Richard Barnes	3979cb0656	irange for size_t (#55320 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55320 Test Plan: Sandcastle Reviewed By: ngimel Differential Revision: D27572577 fbshipit-source-id: 97710fd2bb1303006b05828a0d1343b0b59ccb03	2021-06-03 01:04:13 -07:00
David Reiss	9c83e4160d	Use some c10::ThreadLocal to avoid crashes on old Android toolchains (#59017 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59017 See the comment in ThreadLocal.h for context. I used a slightly dirty preprocessor hack to minimize the number of changes. The hope is that we'll be able to revert all of these soon. Test Plan: CI. Built FB4A with gnustl and saw no references to cxa_thread_atexit in the PyTorch libraries. Reviewed By: ilia-cher Differential Revision: D28720762 fbshipit-source-id: 0f13c7ac5a108b95f8fde6dbc63c6b8bdb8599de	2021-05-27 20:49:03 -07:00
Luca Wehrstedt	44daf1930b	Migrate remaining shared_ptr<Future> to intrusive_ptr (#58420 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58420 In https://github.com/pytorch/pytorch/pull/57636 I migrated most uses of Future to an intrusive_ptr. I thought I had all of them but I missed a couple. These are the remaining ones. (The next PR will make it impossible to add new usages of shared_ptr). ghstack-source-id: 129567071 Test Plan: CI Reviewed By: mrshenli Differential Revision: D28477285 fbshipit-source-id: 75008276baa59e26b450e942c009ec7e78f89b13	2021-05-21 13:15:20 -07:00
Nikita Shulga	3a66a1cb99	[clang-tidy] Exclude cppcoreguidelines-avoid-magic-numbers (#57841 ) Summary: Add cppcoreguidelines-avoid-magic-numbers exclusion to clang-tidy Remove existing nolint warnings using following script: ``` for file in `git ls-files \| grep -v \.py`; do gsed '/^ *\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)/d' -i $file; done ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/57841 Reviewed By: samestep Differential Revision: D28295045 Pulled By: malfet fbshipit-source-id: 7c6e8d1213c9593f169ed3df6a916498f1a97163	2021-05-07 20:02:33 -07:00
Nikita Shulga	4cb534f92e	Make PyTorch code-base clang-tidy compliant (#56892 ) Summary: This is an automatic change generated by the following script: ``` #!/usr/bin/env python3 from subprocess import check_output, check_call import os def get_compiled_files_list(): import json with open("build/compile_commands.json") as f: data = json.load(f) files = [os.path.relpath(node['file']) for node in data] for idx, fname in enumerate(files): if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'): files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')] return files def run_clang_tidy(fname): check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"]) changes = check_output(["git", "ls-files", "-m"]) if len(changes) == 0: return check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"]) def main(): git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n") compiled_files = get_compiled_files_list() for idx, fname in enumerate(git_files): if fname not in compiled_files: continue if fname.startswith("caffe2/contrib/aten/"): continue print(f"[{idx}/{len(git_files)}] Processing {fname}") run_clang_tidy(fname) if __name__ == "__main__": main() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892 Reviewed By: H-Huang Differential Revision: D27991944 Pulled By: malfet fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179	2021-04-28 14:10:25 -07:00
Robin Cheng	5d940e2fbc	[TSAN] Fix PythonEngine data-race-on-vptr. (#56808 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56808 For information about data-race-on-vptr in general, see https://www.internalfb.com/intern/wiki/TSAN/Common_Concurrency_Mistakes/Stopping_a_Thread_in_Destructor/ Engine::~Engine() was previously tasked with stopping the threads. This causes a data race on the object's vptr when PythonEngine is being destructed. This fixes the data race by making ~PythonEngine trigger the thread stopping before going down to the base class's destructor. Test Plan: Many tests are affected, but here's one example: buck test mode/dev-tsan -c fbcode.tsan_strict_mode=true //oculus/research/orcoptics/deep_learning/srg_nn/tests:test_grating_net -- 'test_train (oculus.research.orcoptics.deep_learning.srg_nn.tests.test_grating_net.TestGratingNet)' --run-disabled Reviewed By: walterddr, albanD Differential Revision: D27972384 fbshipit-source-id: 8b70fec8d9326497c591a2777b355ea590a85082	2021-04-23 17:39:27 -07:00
Richard Zou	d2d1112513	Set ThreadLocalState correctly in the autograd engine (#56174 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56174 evaluate_function: 1. calls the autograd function (call_function) 2. accumulates gradients into buffers Previously, ThreadLocalStateGuard only covered part of `call_function`. However, it should cover all Tensor operations in `evaluate_function`, so this PR moves it to do so. One alternative would have been to move ThreadLocalStateGuard to here: `71f9e99e29/torch/csrc/autograd/engine.cpp (L394)` Unfortunately that adds 2% additional instructions according to the instruction count benchmark in the next section. This is because `evaluate_function` does an early return: `71f9e99e29/torch/csrc/autograd/engine.cpp (L732-L735)` If this is preferred, please let me know. Test Plan: - run existing tests. It's hard to actually come up with a test case for this. Benchmark plan: TL;DR: Instruction count decreases by a little after this PR. ``` import torch from torch.utils.benchmark import Timer timer = Timer( stmt="""\ torch::autograd::grad({y}, {x}, {}, /retain_grad=/true);""", setup="""\ auto x = torch::ones({}, torch::requires_grad()); auto y = x * 2;""", language="cpp") stats = timer.collect_callgrind() print(stats) ``` This gave the following: ``` Before: <torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f4b28ce6a90> torch::autograd::grad({y}, {x}, {}, /retain_grad=/true); setup: auto x = torch::ones({}, torch::requires_grad()); auto y = x * 2; All Noisy symbols removed Instructions: 3514184 3514184 Baseline: 0 0 100 runs per measurement, 1 thread After: <torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7fdbc9d187d0> torch::autograd::grad({y}, {x}, {}, /retain_grad=/true); setup: auto x = torch::ones({}, torch::requires_grad()); auto y = x * 2; All Noisy symbols removed Instructions: 3513884 3513884 Baseline: 0 0 100 runs per measurement, 1 thread ``` Reviewed By: albanD Differential Revision: D27799283 Pulled By: zou3519 fbshipit-source-id: 0a8213824e08c04748d38e66604c73f395285d63	2021-04-15 20:57:27 -07:00
Mike Ruberry	c0ac0fef4e	Revert D27448156: irange for size_t Test Plan: revert-hammer Differential Revision: D27448156 (`041b4431b2`) Original commit changeset: 585da57d4de9 fbshipit-source-id: 8e047c29f391c0166e0a1a87c3fb2a0854377365	2021-04-03 19:14:00 -07:00
Richard Barnes	041b4431b2	irange for size_t (#55163 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55163 Test Plan: Sandcastle Reviewed By: ngimel Differential Revision: D27448156 fbshipit-source-id: 585da57d4de91c692b6360d65f7b8a66deb0f8c1	2021-04-02 23:22:29 -07:00
Edward Yang	1f36ce6e4d	Restore storage on meta tensors; increase meta coverage (#53973 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53973 Two parts to this PR; I had to put them together because adding support for X causes more test code to be exercised, which in turn may require a fix for Y. The first part is restoring the concept of storage to meta tensors. Previously, meta tensors had a nullptr storage (e.g., `meta_tensor.storage()` is an error.) As I was increasing the coverage of meta tensors, I started running into test cases (specifically memory overlap tests) that were failing because not having storage meant I couldn't check for memory overlap. After some discussion, we decided that it would make sense for meta tensors to model this as well (we already model strides, so getting accurate view information also seems useful). This PR does that by: * Rewrite all of the factory functions in MetaTensor.cpp to use the generic versions (which are very carefully written to not actually poke at the data pointer, so everything works out). The key idea here is we give meta tensors a special allocator, MetaAllocator, which always returns a nullptr even if you ask for a nonzero number of bytes. resize_ is also made generic; the normal variant can be used directly rather than having to instruct it to avoid resizing storage * Turn on memory overlap checking in TensorIterator even for meta tensors * Although meta tensors now have storage, the concept of meta storage is NOT exposed to Python land (as it would imply I would have to codegen MetaFloatStorage, MetaDoubleStorage, etc. classes). So `x.storage()` still raises an error and I have a cludge in `__deepcopy__` to break storage sharing upon deep copy (this is wrong, but no tests exercise this at the moment). The second part is adding more support for the most used functions in the test suite. * Inplace operations have very simple meta functions. I added `fill_`, `zero_`, `random_`, `uniform_` and `normal_`. In the case of random, I take advantage of pbelevich's templates for defining random kernels, so that I can reuse the common scaffolding, and then just register a noop stub that actually does the RNG. (Look, another structured kernels tiny variant!) * `copy_` is now implemented. Copying into a meta tensor is always OK, but copying out of a meta tensor raises an error (as we don't know what the "correct" data to copy out is in this case) * `empty_strided` usage from structured kernels now is implemented (TBH, this could have been done as soon as `empty_strided` was added) * Meta was missing in a few places in TensorOptions/DispatchKey utility functions, so I added them * Autograd engine now correctly homes meta tensors with CPU tensors (they have -1 device index so CUDA queues wouldn't work anyway) * `apply_`, `map_` and `map2_` are special cased to no-op on meta tensor self. These count as inplace operations too but they are implemented a little differently. Getting more meta function support triggers a number of bugs in the test suite, which I then fix: - Linear algebra functions sometimes don't report NotImplementedError because they get swallowed by catch all try blocks. This is tracked in https://github.com/pytorch/pytorch/issues/53739 - dlpack obviously doesn't work with meta tensors, I just disabled the test Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: D27036572 Test Plan: Imported from OSS Reviewed By: agolynski, bdhirsh Pulled By: ezyang fbshipit-source-id: 7005ecf4feb92a643c37389fdfbd852dbf00ac78	2021-03-29 08:37:46 -07:00
Jeffrey Wan	4739d15a67	Skip some nodes during discovery using sequence number (#52180 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/12635 This change will help us speed up autograd's discovery algorithm in cases where we use `.grad` and we try to "unroll" the training loop. For example the example in the issue and also https://github.com/pytorch/pytorch/pull/52180#issuecomment-783400832 observe an unbounded multiple of speed-up. We do this by adding a new sequence_nr-type numbering: for each node, we maintain the length of the longest path from it to any leaf node. How does this help us speed up discovery (dfs)? Previously the bottleneck was that the dfs that computes which nodes need to be executed always explored every node. With this change, before we run dfs, we first compute the mininum seq_nr among all the nodes passed as the `inputs`. If let this be some number N, intuitively this means that dfs should stay at least N units away from any leaf node. So, if we find ourselves too close to any leaf node, we should stop our search early. Edit: After some discussion offline, the plan is: - make old sequence_nr a construct of the profiler. This means we can avoid accessing thread local state in cases where the profiler is disabled. Note that we cannot replace sequence_nr as-is because profiler's use-case requires that thread-id + sequence_nr can uniquely identify a given node in order for downstream users/programs to correlate nodes from backward and forward passes. This means we must maintain two sequence_nr's and that we have an extra field in Node. - In a future PR, we can potentially remove sequence_nr entirely from the profiler as well, but we avoid doing it now because we haven't measured, and its a larger effort because we'd have to mess around with the dispatcher and profiler Testing with this [code](https://gist.github.com/kyunghyuncho/5fb9991ce1233f909051854a84b7148e), we see that runtime no longer increases as we iterate. Before: ``` 100: Time taken: 0.47s, loss: 1.1e+06 200: Time taken: 0.064s, loss: 6.5e+05 300: Time taken: 0.088s, loss: 4.4e+05 400: Time taken: 0.1s, loss: 3.2e+05 500: Time taken: 0.12s, loss: 2.5e+05 600: Time taken: 0.15s, loss: 2e+05 700: Time taken: 0.18s, loss: 1.7e+05 800: Time taken: 0.2s, loss: 1.4e+05 900: Time taken: 0.22s, loss: 1.2e+05 1000: Time taken: 0.24s, loss: 1.1e+05 1100: Time taken: 0.27s, loss: 9.3e+04 1200: Time taken: 0.3s, loss: 8.3e+04 1300: Time taken: 0.34s, loss: 7.4e+04 1400: Time taken: 0.36s, loss: 6.7e+04 1500: Time taken: 0.38s, loss: 6.1e+04 1600: Time taken: 0.4s, loss: 5.6e+04 1700: Time taken: 0.42s, loss: 5.1e+04 1800: Time taken: 0.44s, loss: 4.7e+04 1900: Time taken: 0.47s, loss: 4.4e+04 2000: Time taken: 0.5s, loss: 4.1e+04 ``` After: ``` 100: Time taken: 0.49s, loss: 1.2e+06 200: Time taken: 0.031s, loss: 6.9e+05 300: Time taken: 0.031s, loss: 4.6e+05 400: Time taken: 0.031s, loss: 3.3e+05 500: Time taken: 0.031s, loss: 2.6e+05 600: Time taken: 0.031s, loss: 2.1e+05 700: Time taken: 0.031s, loss: 1.7e+05 800: Time taken: 0.031s, loss: 1.4e+05 900: Time taken: 0.031s, loss: 1.2e+05 1000: Time taken: 0.031s, loss: 1.1e+05 1100: Time taken: 0.031s, loss: 9.6e+04 1200: Time taken: 0.031s, loss: 8.6e+04 1300: Time taken: 0.031s, loss: 7.7e+04 1400: Time taken: 0.031s, loss: 7e+04 1500: Time taken: 0.031s, loss: 6.3e+04 1600: Time taken: 0.031s, loss: 5.8e+04 1700: Time taken: 0.031s, loss: 5.3e+04 1800: Time taken: 0.031s, loss: 4.9e+04 1900: Time taken: 0.031s, loss: 4.5e+04 2000: Time taken: 0.032s, loss: 4.2e+04 ``` Testing w/ small graph to check for regression: ``` import torch from torch.utils.benchmark import Timer setup=""" a = torch.rand((2, 2), requires_grad=True) b = torch.rand((2, 2), requires_grad=True) gradient = torch.ones(2, 2) """ stmt=""" torch.autograd.grad(a*b, [a, b], gradient) """ timer = Timer(stmt, setup) print(timer.timeit(10000)) print(timer.collect_callgrind(100)) ``` Result: there doesn't seem to be any significant regression ``` Time before: 12.74 us Time after: 13.12 us Instruction count before: All Noisy symbols removed Instructions: 8078960 8000882 Baseline: 4226 3838 Instruction count after: All Noisy symbols removed Instructions: 8091846 8017940 Baseline: 4336 3838 100 runs per measurement, 1 thread ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/52180 Reviewed By: gchanan, zhangguanheng66 Differential Revision: D26794387 Pulled By: soulitzer fbshipit-source-id: c00d387a29f151109c33dc6f1b56a8f275cdec58	2021-03-04 16:13:53 -08:00
Nikita Shulga	72ec718373	Leak autograd threads after wait limit (#53170 ) Summary: Leak autograd threads if TORCH_AUTOGRAD_SHUTDOWN_WAIT_LIMIT is reached (default to 10 seconds) Pull Request resolved: https://github.com/pytorch/pytorch/pull/53170 Reviewed By: zhangguanheng66 Differential Revision: D26821983 Pulled By: malfet fbshipit-source-id: 310960564da7cd8c9f475432a8efbee32cfe6009	2021-03-04 14:42:15 -08:00
Jeffrey Wan	2900cf2b94	Refactor autograd discovery code (#52057 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/34067 by using https://github.com/pytorch/pytorch/issues/34426 by hczhu In addition to removing the unnecessary any() we do also: - Get rid of the outer loop since graph_root also needs to be checked - Update psuedo code description so it matches what the code does - Add some comments explaining the difference between assigning `info.needed_` and `info.captures_` in terms of how that affects discovery - [edit: another benefit is that exec_info entries are no longer created for all reachable nodes] This PR is on top of https://github.com/pytorch/pytorch/issues/51940, so once that lands rebasing on top of master should get rid of the extra commits and changes I'm not sure if this change will bring a lot of performance gains, but the main benefit is that the code is easier to read. Trivial graph: ``` torch.autograd.grad(ab, [a, b], gradient) setup: a = torch.rand((2, 2), requires_grad=True) b = torch.rand((2, 2), requires_grad=True) gradient = torch.ones(2, 2) Timer before: 15.45 us Time after: 14.33 us 1 measurement, 10000 runs , 1 thread Instructions after: All Noisy symbols removed Instructions: 8271213 8193169 Baseline: 4244 3838 Instructions before: All Noisy symbols removed Instructions: 8142843 8054463 Baseline: 4280 3838 100 runs per measurement, 1 thread ``` Small graph: ``` torch.autograd.grad((ba.exp()+a*b.exp()).sum(), (a, b)) setup: a = torch.rand((2, 2), requires_grad=True) b = torch.rand((2, 2), requires_grad=True) Time before: 52.25 us Time after: 50.80 us 1 measurement, 10000 runs , 1 thread Instruction count before: All Noisy symbols removed Instructions: 25601257 25518229 Baseline: 4228 3838 Instruction count after: All Noisy symbols removed Instructions: 25606533 25522797 Baseline: 4228 100 runs per measurement, 1 thread ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/52057 Reviewed By: ngimel Differential Revision: D26432207 Pulled By: soulitzer fbshipit-source-id: beef68344d66e9e286378e31e3311ba43c25c749	2021-02-12 16:22:35 -08:00
Jeffrey Wan	aa2fede201	Fix autograd when `inputs` contains tensors without materialized grad_fn (#51940 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/39784 At the time the issue was filed, there was only issue (1) below. There are actually now two issues here: 1. We always set all inputs passed in through `inputs` arg as `needed = True` in exec_info. So if we pass in an input that has a grad_fn that is not materialized, we create an entry of exec_info with nullptr as key with `needed = True`. Coincidentally, when we perform simple arithmetic operations, such as "2 * x", one of the next edges of mul is an invalid edge, meaning that its grad_fn is also nullptr. This causes the discovery algorithm to set all grad_fns that have a path to this invalid_edge as `needed = True`. 2. Before the commit that enabled the engine skipped the dummy node, we knew that root node is always needed, i.e., we hardcode `exec_info[&graph_root]=true`. The issue was that this logic wasn't updated after the code was updated to skip the graph root. To address (1), instead of passing in an invalid edge if an input in `inputs` has no grad_fn, we create a dummy grad_fn. This is done in both python and cpp entry points. The alternative is to add logic for both backward() and grad() cases to check whether the grad_fn is nullptr and set needed=false in that case (the .grad() case would be slightly more complicated than the .backward() case here). For (2), we perform one final iteration of the discovery algorithm so that we really know whether we need to execute the graph root. Pull Request resolved: https://github.com/pytorch/pytorch/pull/51940 Reviewed By: VitalyFedyunin Differential Revision: D26369529 Pulled By: soulitzer fbshipit-source-id: 14a01ae7988a8de621b967a31564ce1d7a00084e	2021-02-11 09:22:15 -08:00
Jeffrey Wan	7b9ca54ecf	Reset checkpoint_valid flag when error happens during function execution (#51746 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/37874, https://github.com/pytorch/pytorch/issues/51743 Uses RAII to manage the flag so that it gets reset properly on exception Pull Request resolved: https://github.com/pytorch/pytorch/pull/51746 Reviewed By: izdeby Differential Revision: D26319619 Pulled By: soulitzer fbshipit-source-id: ea1235438ba516f99195c83fa23d5880f9977c93	2021-02-08 17:48:25 -08:00
Alban Desmaison	e160362837	Add range assert in autograd engine queue lookup (#50372 ) Summary: Follow up to https://github.com/pytorch/pytorch/issues/49652 Pull Request resolved: https://github.com/pytorch/pytorch/pull/50372 Reviewed By: zhangguanheng66 Differential Revision: D25872203 Pulled By: albanD fbshipit-source-id: 8d6f30f17fba856c5c34c08372767349a250983d	2021-01-11 15:16:35 -08:00
Alban Desmaison	fc2ead0944	Autograd engine, only enqueue task when it is fully initialized (#50164 ) Summary: This solves a race condition where the worker thread might see a partially initialized graph_task Fixes https://github.com/pytorch/pytorch/issues/49652 I don't know how to reliably trigger the race so I didn't add any test. But the rocm build flakyness (it just happens to race more often on rocm builds) should disappear after this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/50164 Reviewed By: zou3519 Differential Revision: D25824954 Pulled By: albanD fbshipit-source-id: 6a3391753cb2afd2ab415d3fb2071a837cc565bb	2021-01-08 05:30:11 -08:00
Ansley Ussery	c619892482	Fix errata (#49903 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49903 Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D25718411 Pulled By: ansley fbshipit-source-id: 0cc365c5a53077752dc1c5a5c4a65b873baa3604	2020-12-28 20:40:41 -08:00
Jeffrey Wan	d20483a999	Skip dummy node creation for autograd engine when there is a single input and place on correct queue (#47592 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/42890 - Removes dummy node - Places graph root on the correct queue based on input buffer's device instead of cpu queue by default cpu - no significant change in speed (too noisy to measure), but we see up to 7% reduction in small graphs cuda - small reduction in speed (still very noisy) and up to ~20% reduction in instruction count for small graphs CPU Code: ``` import torch from torch.utils.benchmark import Timer setup=""" a = torch.rand((2, 2), requires_grad=True) b = torch.rand((2, 2), requires_grad=True) gradient = torch.ones(2, 2) """ stmt=""" torch.autograd.grad(ab, [a, b], gradient) """ timer = Timer(stmt, setup) print(timer.timeit(10000)) print(timer.collect_callgrind(100)) ``` Before (when dummy node is not skipped): ``` torch.autograd.grad(ab, [a, b], gradient) setup: a = torch.rand((2, 2), requires_grad=True) b = torch.rand((2, 2), requires_grad=True) gradient = torch.ones(2, 2) 26.62 us 1 measurement, 10000 runs , 1 thread <torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7efee44ad8e0> torch.autograd.grad(ab, [a, b], gradient) setup: a = torch.rand((2, 2), requires_grad=True) b = torch.rand((2, 2), requires_grad=True) gradient = torch.ones(2, 2) All Noisy symbols removed Instructions: 9755488 9659378 Baseline: 4300 3784 100 runs per measurement, 1 thread ``` After ``` <torch.utils.benchmark.utils.common.Measurement object at 0x7f56961a7730> torch.autograd.grad(ab, [a, b], gradient) setup: a = torch.rand((2, 2), requires_grad=True) b = torch.rand((2, 2), requires_grad=True) gradient = torch.ones(2, 2) 26.78 us 1 measurement, 10000 runs , 1 thread <torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f56961a78e0> torch.autograd.grad(ab, [a, b], gradient) setup: a = torch.rand((2, 2), requires_grad=True) b = torch.rand((2, 2), requires_grad=True) gradient = torch.ones(2, 2) All Noisy symbols removed Instructions: 9045508 8939872 Baseline: 4280 3784 100 runs per measurement, 1 thread ``` Cuda* Before ``` <torch.utils.benchmark.utils.common.Measurement object at 0x7f84cbaa1ee0> torch.autograd.grad(out, [x, y], gradient) setup: x = torch.rand((2,2), requires_grad=True, device="cuda") y = torch.rand((2,2), requires_grad=True, device="cuda") out = x + y gradient = torch.ones(2, 2).cuda() 70.49 us 1 measurement, 10000 runs , 1 thread <torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f84cbaa1e50> torch.autograd.grad(out, [x, y], gradient) setup: x = torch.rand((2,2), requires_grad=True, device="cuda") y = torch.rand((2,2), requires_grad=True, device="cuda") out = x + y gradient = torch.ones(2, 2).cuda() All Noisy symbols removed Instructions: 5054581 4951911 Baseline: 4105 3735 100 runs per measurement, 1 thread ``` Remove dummy node only ``` <torch.utils.benchmark.utils.common.Measurement object at 0x7fbf29c67eb0> torch.autograd.grad(out, [x, y], gradient) setup: x = torch.rand((2,2), requires_grad=True, device="cuda") y = torch.rand((2,2), requires_grad=True, device="cuda") out = x + y gradient = torch.ones(2, 2).cuda() 55.65 us 1 measurement, 10000 runs , 1 thread <torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7fbf29c67e20> torch.autograd.grad(out, [x, y], gradient) setup: x = torch.rand((2,2), requires_grad=True, device="cuda") y = torch.rand((2,2), requires_grad=True, device="cuda") out = x + y gradient = torch.ones(2, 2).cuda() All Noisy symbols removed Instructions: 5002105 4900841 Baseline: 4177 3731 100 runs per measurement, 1 thread ``` Remove dummy node and put in correct queue ``` <torch.utils.benchmark.utils.common.Measurement object at 0x7fb64438ce80> torch.autograd.grad(out, [x, y], gradient) setup: x = torch.rand((2,2), requires_grad=True, device="cuda") y = torch.rand((2,2), requires_grad=True, device="cuda") out = x + y gradient = torch.ones(2, 2).cuda() 27.56 us 1 measurement, 10000 runs , 1 thread <torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7fb64438cdf0> torch.autograd.grad(out, [x, y], gradient) setup: x = torch.rand((2,2), requires_grad=True, device="cuda") y = torch.rand((2,2), requires_grad=True, device="cuda") out = x + y gradient = torch.ones(2, 2).cuda() All Noisy symbols removed Instructions: 4104433 4007555 Baseline: 4159 3735 100 runs per measurement, 1 thread ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/47592 Reviewed By: ailzhang Differential Revision: D24890761 Pulled By: soulitzer fbshipit-source-id: f457376e4a882f8a59476e8c1e708391b1a031a2	2020-11-16 11:33:35 -08:00
Jeffrey Wan	f5073b0c5a	Add `inputs` argument to `autograd.backward()` (#46855 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/46373 As noted in https://github.com/pytorch/pytorch/issues/46373, there needs to be a flag passed into the engine that indicates whether it was executed through the backward api or grad api. Tentatively named the flag `accumulate_grad` since functionally, backward api accumulates grad into .grad while grad api captures the grad and returns it. Moving changes not necessary to the python api (cpp, torchscript) to a new PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/46855 Reviewed By: ngimel Differential Revision: D24649054 Pulled By: soulitzer fbshipit-source-id: 6925d5a67d583eeb781fc7cfaec807c410e1fc65	2020-11-02 14:32:38 -08:00
Pritam Damania	bf85642c4c	Remove lock from GraphTask::set_exception_without_signal. (#45867 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45867 In most cases the lock ordering was hold a lock in local autograd and then hold a lock in DistAutogradContext. In case of `set_exception_without_signal` the lock order was in reverse and as a result we saw potential deadlock issues in our TSAN tests. To fix this, I removed the lock and instead just used std::atomic exchange. In addition to this, I fixed TestE2E to ensure that we use the appropriate timeout. TestE2EProcessGroup was flaky for these two reasons and now is fixed. ghstack-source-id: 113592709 Test Plan: waitforbuildbot. Reviewed By: albanD Differential Revision: D24120962 fbshipit-source-id: 12447b84ceae772b91e9a183c90d1e6340f44e66	2020-10-05 20:02:29 -07:00
Abaho Katabarwa	de3a48013a	Use CAFFE2_USE_MSVC_STATIC_RUNTIME to determine when to avoid waiting for global destructors on Windows (#43532 ) Summary: We are trying to build libtorch statically (BUILD_SHARED_LIBS=OFF) then link it into a DLL. Our setup hits the infinite loop mentioned [here](`54c05fa34e/torch/csrc/autograd/engine.cpp (L228)`) because we build with `BUILD_SHARED_LIBS=OFF` but still link it all into a DLL at the end of the day. This PR fixes the issue by changing the condition to guard on which windows runtime the build links against using the `CAFFE2_USE_MSVC_STATIC_RUNTIME` flag. `CAFFE2_USE_MSVC_STATIC_RUNTIME` defaults to ON when `BUILD_SHARED_LIBS=OFF`, so backwards compatibility is maintained. I'm not entirely confident I understand the subtleties of the windows runtime versus linking setup, but this setup works for us and should not affect the existing builds. Fixes https://github.com/pytorch/pytorch/issues/44470 Pull Request resolved: https://github.com/pytorch/pytorch/pull/43532 Reviewed By: mrshenli Differential Revision: D24053767 Pulled By: albanD fbshipit-source-id: 1127fefe5104d302a4fc083106d4e9f48e50add8	2020-10-01 16:41:14 -07:00
Pritam Damania	f1624b82b5	Preserve python backtrace in autograd engine errors. (#43684 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43684 This PR attempts to address #42560 by capturing the appropriate exception_ptr in the autograd engine and passing it over to the Future. As part of this change, there is a significant change the Future API where we now only accept an exception_ptr as part of setError. For the example in #42560, the exception trace would now look like: ``` > Traceback (most recent call last): > File "test_autograd.py", line 6914, in test_preserve_backtrace > Foo.apply(t).sum().backward() > File "torch/tensor.py", line 214, in backward > torch.autograd.backward(self, gradient, retain_graph, create_graph) > File "torch/autograd/__init__.py", line 127, in backward > allow_unreachable=True) # allow_unreachable flag > File "torch/autograd/function.py", line 87, in apply > return self._forward_cls.backward(self, *args) > File "test_autograd.py", line 6910, in backward > raise ValueError("something") > ValueError: something ``` ghstack-source-id: 111109637 Test Plan: waitforbuildbot Reviewed By: albanD Differential Revision: D23365408 fbshipit-source-id: 1470c4776ec8053ea92a6ee1663460a3bae6edc5	2020-09-01 01:28:47 -07:00
mfkasim91	576880febf	Print all traceback for nested backwards in detect_anomaly (#43626 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/43405. This pull request adds a feature of printing all tracebacks if a `detect_anomaly` mode detects `nan` in nested backward operations. The way I did it is by assigning a node as a parent to all nodes it produces during its backward calculation. Then if one of the children produces `nan`, it will print the traceback from the parent and grand parents (if any). The parent is assigned in `parent_node_` member in `Node` class which is accessible in C++ by function `node->parent()` and in Python by `node.parent_function`. A node has a parent iff: 1. it is created from a backward operation, and 2. created when anomaly mode and grad mode are both enabled. An example of this feature: import torch def example(): x = torch.tensor(1.0, requires_grad=True) y = torch.tensor(1e-8, requires_grad=True) # small to induce nan in n-th backward a = x * y b = x * y z1 = a / b # can produce nan in n-th backward as long as https://github.com/pytorch/pytorch/issues/43414 is unsolved z = z1 * z1 gy , = torch.autograd.grad( z , (y,), create_graph=True) gy2, = torch.autograd.grad(gy , (y,), create_graph=True) gy3, = torch.autograd.grad(gy2, (y,), create_graph=True) gy4, = torch.autograd.grad(gy3, (y,), create_graph=True) return gy4 with torch.autograd.detect_anomaly(): gy4 = example() with output: example.py:16: UserWarning: Anomaly Detection has been enabled. This mode will increase the runtime and should only be enabled for debugging. with torch.autograd.detect_anomaly(): /home/mfkasim/anaconda2/envs/base3/lib/python3.8/site-packages/torch/autograd/__init__.py:190: UserWarning: Error detected in DivBackward0. Traceback of forward call that caused the error: File "example.py", line 17, in <module> gy4 = example() File "example.py", line 12, in example gy3, = torch.autograd.grad(gy2, (y,), create_graph=True) File "/home/mfkasim/anaconda2/envs/base3/lib/python3.8/site-packages/torch/autograd/__init__.py", line 190, in grad return Variable._execution_engine.run_backward( (Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:61.) return Variable._execution_engine.run_backward( /home/mfkasim/anaconda2/envs/base3/lib/python3.8/site-packages/torch/autograd/__init__.py:190: UserWarning: Traceback of forward call that induces the previous calculation: File "example.py", line 17, in <module> gy4 = example() File "example.py", line 11, in example gy2, = torch.autograd.grad(gy , (y,), create_graph=True) File "/home/mfkasim/anaconda2/envs/base3/lib/python3.8/site-packages/torch/autograd/__init__.py", line 190, in grad return Variable._execution_engine.run_backward( (Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:65.) return Variable._execution_engine.run_backward( /home/mfkasim/anaconda2/envs/base3/lib/python3.8/site-packages/torch/autograd/__init__.py:190: UserWarning: Traceback of forward call that induces the previous calculation: File "example.py", line 17, in <module> gy4 = example() File "example.py", line 8, in example z1 = a / b # can produce nan in n-th backward as long as https://github.com/pytorch/pytorch/issues/43414 is unsolved (Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:65.) return Variable._execution_engine.run_backward( Traceback (most recent call last): File "example.py", line 17, in <module> gy4 = example() File "example.py", line 13, in example gy4, = torch.autograd.grad(gy3, (y,), create_graph=True) File "/home/mfkasim/anaconda2/envs/base3/lib/python3.8/site-packages/torch/autograd/__init__.py", line 190, in grad return Variable._execution_engine.run_backward( RuntimeError: Function 'DivBackward0' returned nan values in its 1th output. cc & thanks to albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/43626 Reviewed By: malfet Differential Revision: D23397499 Pulled By: albanD fbshipit-source-id: aa7435ec2a7f0d23a7a02ab7db751c198faf3b7d	2020-08-31 08:23:07 -07:00
Pritam Damania	931b8b4ac8	Use ivalue::Future in autograd engine and DistEngine. (#43676 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43676 This is one part of https://github.com/pytorch/pytorch/issues/41574 to ensure we consolidate everything around ivalue::Future. I've removed the use of torch/csrc/utils/future.h from the autograd engines and used ivalue::Future instead. ghstack-source-id: 110895545 Test Plan: waitforbuildbot. Reviewed By: albanD Differential Revision: D23362415 fbshipit-source-id: aa109b3f8acf0814d59fc5264a85a8c27ef4bdb6	2020-08-29 02:15:26 -07:00
Pritam Damania	54c05fa34e	Add basic GPU support to distributed autograd. (#40312 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40312 As part of https://github.com/pytorch/pytorch/issues/40255, we realized that GPU support for distributed autograd was broken as part of our multithreaded autograd change. To fix this in the short term for 1.6, this PR includes the following changes: 1) Long lived CPU thread in DistEngine to execute GPU->CPU continuations in the autograd graph. 2) The long lived CPU thread has its own ready_queue and this queue is used for all GraphTasks created by DistEngine. 3) In thread_main(), the CPU thread cannot exit once the GraphTask is done processing because of the new CPU thread added in 1). 4) To resolve this, thread_main() now has a parameter `device_thread` instead of `reentrant_thread`. When device_thread is True, we expect this to be a long lived device thread that does not exit. 5) When device_thread is False, thread_main is expected to run a GraphTask and return once done. ghstack-source-id: 106391329 Test Plan: waitforbuildbot Differential Revision: D22146183 fbshipit-source-id: dd146b7a95f55db75f6767889b7255e9d62d5825	2020-06-23 07:49:00 -07:00
HC Zhu	acc13ac828	[PyTorch] Make DDP reducer work under distributed autograd (#37998 ) Summary: ## Why doesn’t DDP work under dist_autograd? DDP follows the steps below 1. [DDP Python constructor](`8d6a8d2b3f/torch/nn/parallel/distributed.py (L389-L393)`) (on a module) creates a [C++ Reducer](https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/reducer.cpp), which holds references to all parameters (or variables in C++ code). 2. The reducer installs a post hook on each model parameter. 3. The backward run starts and triggers the post hooks installed above. 4. The post hook of a parameter simply marks the parameter ready for all-reduce. 5. Once all parameters in a bucket are ready, an all-reduce process starts by reading variable `.grad` and writes to variable `.grad`. But under dist_autograd, `.grad` of a variable is not populated at all. Instead, grads are in a global map in distributed context from variables to their grads. ## Solution of this PR The distributed engine to set a thread_local variable in a backward run indicating we're running in distributed mode. DDP reducer can then appropriately use `.grad` or the distributed context based on the thread local. More precisely, the thread local is set before calling the post hooks installed by the DDP reducer so that DDP post hooks can retrieve this thread local. Pull Request resolved: https://github.com/pytorch/pytorch/pull/37998 Test Plan: ``` python test/distributed/test_ddp_under_dist_autograd.py ``` FB repo ``` buck test caffe2/test/distributed/... ``` DDP accuracy benchmark workflow run ``` flow-cli canary pytorch.benchmark.accuracy_comparison.workflow --parameters-json '{"node_world_size": 4, "dist_backend": "nccl"}' --run-as-secure-group fblearner_flow --entitlement gpu_prod ``` f196173157 Reviewed By: pritamdamania87 Differential Revision: D21513795 Pulled By: hczhu fbshipit-source-id: fe21e68ecdc9274182db4d4bb5a1e2d68ef927a2	2020-06-10 08:38:14 -07:00
Nikita Shulga	c3d3782c80	Fix init-shutdown race condition in autograd engine (#39194 ) Summary: If Engine is created shortly before application exits, then non-reentrant thread might not have a chance to spawn which would result in an infinite wait in `Engine::~Engine()` Prevent this by actually waiting for threads to spawn before returning from `Engine::start_device_threads()` Make sure that thread count is incremented before GIL is acquired in PythonThread Pull Request resolved: https://github.com/pytorch/pytorch/pull/39194 Differential Revision: D21789219 Pulled By: malfet fbshipit-source-id: d9b5e74d5ddeb2474b575af2e4f33d022efcfe53	2020-05-29 12:20:31 -07:00
Alban Desmaison	0f1669181a	Add specific list of supported types in autograd (#38325 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38325 Test Plan: Imported from OSS Differential Revision: D21668739 Pulled By: albanD fbshipit-source-id: 2e6ebaa36e41a084aed0a8e1e16b6e37e36a1910	2020-05-21 08:28:06 -07:00
Wojciech Baranowski	95465dcbaf	autograd: move scalar input to a different device when needed (#35286 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/33870 Pull Request resolved: https://github.com/pytorch/pytorch/pull/35286 Differential Revision: D21229721 Pulled By: albanD fbshipit-source-id: 6f6a6d44b675457c9580ec2d91da52d12d44f096	2020-05-01 13:56:29 -07:00
Wanchao Liang	f41742ff2f	[autograd] remove spinning for dist engine (#36606 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36606 This PR refactor the continuation logic of the async mode on autograd engine, to avoid launch spinning works. To achieve that: 1. remove the continuation logic in execute_graph_task_with_continuiation 2. separate the usage of execute_graph_task between dist_engine and local engine, now dist_engine universally use `execute_graph_task_until_ready_queue_empty` (a better name appreciated here). 3. remove enqueue_blocked_task_on_cpu 4. remove the async mode in `execute_with_graph_task` as we don't need to use it in dist_engine Test Plan: Imported from OSS Differential Revision: D21032731 Pulled By: wanchaol fbshipit-source-id: 708ea3bc14815bdc151b56afa15eb85b4ac0f4b1	2020-04-26 22:23:30 -07:00
Wanchao Liang	ed9ec3c96f	[autograd] refactor some functions (#37061 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37061 This PR refactors: 1. `set_device` to make it out of Engine 2. put `graph_task_completed` into GraphTask 3. put `mark_graph_task_completed` into GraphTask This also make the distributed engine easy to call those functions. Test Plan: Imported from OSS Differential Revision: D21188688 Pulled By: wanchaol fbshipit-source-id: f56106e6ed7d966cfa4d962781c7865cc3c5321d	2020-04-26 22:21:59 -07:00
anjali411	6e92579883	Added autograd support for C->C functions and enabled requires_grad=True for complex (#36932 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36932 Differential Revision: D21181230 Pulled By: anjali411 fbshipit-source-id: 295f2cd1e2b9918a8b2cb88cab0536b2407dc455	2020-04-24 12:30:49 -07:00
HC Zhu	ea97fa1f2a	[PyTorch][Dist] Trigger pre/post hooks of output function nodes under distributed autograd (#34501 ) Summary: # Goals Do the following things during a distributed backward pass. 1. Accumulate the gradient of a variable to RPC context once the gradient is ready instead of at the very end of the backward pass. 2. Run post/pre hooks installed in`AccumulateGrad` nodes once the gradient is ready for the variable. Currently, the hooks in `AccumulateGrad` are not executed just because the function `AccumulateGrad` itself is not even evaluated by the local engine. 3. Make it extensible to support post hooks installed by DDP's reducer. # Introduce GradCapturePreHook ## Why do we need this? ### Root issue: * dist engine uses the autograd.grad-like API on the vanilla engine and then in the Future callback populates the context with the gradients. This is a bad emulation of the .backward() call on the vanilla engine. ### Practical issue: * The leaf’s hook are not called (because associated with the AccumulateGrad that is not call in the autograd.grad-like API). Modules like DDP rely on these hooks. * The Future is marked as completed before the context is actually populated with the grads leading to unexpected behavior on the user side. * The Future callback is only called at the complete end of the backward and so too late for DDP if they want to overlap compute/transfert. ### Proposed solution: * Provide hooks in the autograd.grad-like API that will allow the distributed engine to populate the context and call the hooks to better emulate the .backward call. ## Who can install a grad capture pre-hook? This will be an internal hook at C++ level and it won’t be exposed to PyThon code. Only call-sites directly interacting with the local engine can install such hooks. ## Signature The returned `grad` will be captured. ``` virtual const torch::Tensor& grad operator()(const torch::Tensor& grads) = 0; ``` ## Where are hooks installed? Grad capture pre-hooks are install in GraphTask::ExecInfo::Capture. ExecInfo is per node. Every backward run will have its own GraphTask instance. ## When/How will hooks be called? When the local engine captures the grads for a node, all grad capture pre hooks are called one by one in the order they are added. The output grads of the hooks will replace the original grads. The output of the last hook will be used for grad capturing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/34501 Test Plan: All existing tests should pass. ``` python setup.py develop python test/distributed/rpc/test_dist_autograd_spawn.py DistAutogradTestWithSpawn.test_post_hooks ``` Differential Revision: D20953673 Pulled By: hczhu fbshipit-source-id: 543b3844823330ea9f9856bab7c5cb2679290a53	2020-04-21 13:23:18 -07:00
Jeremy Lilley	0e6c66493a	[engine] Ensure future is complete when exiting Engine::mark_graph_task_completed() (#36856 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36856 Previously, we could early-exit mark_graph_task_completed() without the future actually being fully complete - we were only guaranteeing that it was at least in the process of being marked complete. This seems to be triggering an assert graph_task->future_result_->completed() This change simply adds a 1-line waitNoThrow() call to ensure that the future has been marked complete before exiting the mark_graph_task_completed() function. The cost is relatively reasonable, since this isn't the common path. ghstack-source-id: 102423589 Test Plan: buck test mode/dev-nosan caffe2/test/,,, Differential Revision: D21104121 fbshipit-source-id: 51c1554618880fe80d52d5eb96716abc15f6be8a	2020-04-18 07:47:09 -07:00
Wanchao Liang	6d4c509168	[autograd] lower MAX_DEPTH limit according to TSAN limit (#36745 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36745 As we hold a mutex for our custom C++ Node, when calling reentrant backward from custom C++ function, we will cocurrently holding many mutexes up to MAX_DEPTH. TSAN only allow 65 mutexes at once, otherwise it will complain. This PR lower the limit according to TSAN. TSAN Reference: https://github.com/google/sanitizers/issues/950 Test Plan: Imported from OSS Differential Revision: D21072604 Pulled By: wanchaol fbshipit-source-id: 99cd1acab41a203d834fa4947f4e6f0ffd2e70f2	2020-04-16 20:43:20 -07:00
Pritam Damania	f64fae9193	Fix race in mark_graph_task_completed. (#36640 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36640 We had the following race when two threads entered 'mark_graph_task_completed'. 1) Thread 1 grabs the graph_task mutex first and moves captured_vars_ to its local 'vars'. 2) Thread 1 releases the lock. 3) Thread 2 grabs the mutex and moves an empty captured_vars_ to its local 'vars'. 4) Thread 2 now proceeds to call 'markCompleted' with empty grads. 5) Thread 1 which actually has the right grads never gets to set the grads on the future since future_completed_ is set to True by Thread 2. Discovered this while running our RNN example: https://github.com/pytorch/examples/tree/master/distributed/rpc/rnn and verified this PR fixes the race. ghstack-source-id: 102237850 Test Plan: waitforbuildbot Differential Revision: D21035196 fbshipit-source-id: 1963826194d466b93f19e8016b38e4f9cad47720	2020-04-15 20:05:34 -07:00
Jeremy Lilley	0035aeef40	[autograd] Avoid holding lock when completing GraphTask futureResult (#35101 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35101 TSAN is noting lock-order-inversion in context of dist autograd because we're holding lock when GraphTask calls markCompleted() on the relevant futureResult_. Add an atomic bool to make it possible to protect this without holding the mutex, and also fix alignment of a few struct vars. ghstack-source-id: 101805283 Test Plan: buck test mode/opt-tsan //caffe2/test/distributed/rpc:dist_autograd_spawn_thrift Differential Revision: D20553517 fbshipit-source-id: 446e3718dd68876bd312166ecceed1d92868ce4e	2020-04-13 15:23:47 -07:00
Ilia Cherniavskii	a5bfcc5323	Unify management of thread local settings (#35523 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35523 In this PR we extend ThreadLocalState to cover dispatch keys and ThreadLocalDebugInfo and move it from JIT interpreter down to thread management (at::launch) and autograd (backward threads) code Test Plan: unit tests (CI) Reviewed By: dzhulgakov Differential Revision: D20615714 fbshipit-source-id: 16a9fc96a25cb6c2629230b1187fbf78786ac565	2020-04-01 01:56:39 -07:00
Wanchao Liang	f3151052ce	[autograd] fix engine flakiness (#35599 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35599 We don't check if the ready queue was empty before https://github.com/pytorch/pytorch/pull/33157 because the CPU worker's queue might not be empty, but after #33157, we try to check if the owner thread's ready_queue empty after inline exeuction. This might not always hold true, imagine the following case: The CPU thread that calls backward() and the GPU device thread, the Graph is like: GraphRoot(CPU) -> ComputeNode(GPU) in both thread_main, they are decrementing `--local_graph_task->outstanding_tasks_` to zero together, and then both thread will enter `if (graph_task_completed(local_graph_task))`, CPU thread will break out and finish and check if local_ready_queue is empty, the GPU thread will send a dummy task to CPU thread ready queue as it think the graph_task finished on its own thread (it actually finished on both threads together). So there will be cases that there's a dummy task remains in the queue. This happens very rare and non-deterministic, but it might get triggered when we run many jobs in the CI. Remove the check to fix the flakiness Test Plan: Imported from OSS Differential Revision: D20739778 Pulled By: wanchaol fbshipit-source-id: 75a671762650a188f44720625d53f0873617c684	2020-03-30 12:39:39 -07:00
Wanchao Liang	618104185b	[autograd] enable graph level thread parallelism on CPU (#33157 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33157 This PR enables graph level thread parallelism on CPU for the Autograd Engine. It replace https://github.com/pytorch/pytorch/pull/29574 for the reason of task level parallelism drawbacks with the existing autograd system. Fixes https://github.com/pytorch/pytorch/issues/18333 The graph level parallelism on CPU design: 1. Remove the single CPU thread that init in the Engine itself and allow the owning thread (which calls Engine::execute) to drive the Engine execution so that we could let outer threading to enable thread parallelism. 2. Maintain a separate ReadyQueue per CPU thread, and stash the ReadyQueue for different devices/threads into the thread local shared_ptr, the Engine itself will memorize the shared_ptr of the ReadyQueue to different devices (other than CPU) 3. The CPU thread local ReadyQueue is initialized per CPU thread Engine::execute call (or `backward()`, `grad()` call), and memorized the shared_ptr into the GraphTask since every `backward()` call have its own GraphTask 4. Cross device NodeTask push is accomplished by 2 and 3. we can refer to device's ReadyQueue from Engine, and CPU's ReadyQueue from GraphTask, which means if we can push to a different ReadyQueue according to the device 5. Termination of the CPU thread: if we mark the graph_task as completed, we will exit the while loop and terminate the current backward execution, because it's guranteed that all other NodeTasks is finished before we mark a GraphTask as complete 6. re-entrant thread logic keeps the same, reentrant thread detection is similar as before, we set the worker_device to NO_DEVICE initially and set to CPU afterward to detect if this is a reentrant call or not. 7. we still have the reentrant thread pool that create new threads if it's a deep reentrant case, and reuse the ReadyQueue with the parent thread for performance. Since we introduce the thread parallelism on CPU, we have to ensure the thread safety of the GraphTask. This is not a problem if we execute all forward in different threads since we will build separate GraphTask in different threads, and each GraphTask is a separate instance that share nothing, i.e. Hogwild training on CPU should be fine on this case. But there might be case that user would like to do some part of the task in a single thread, and do the rest of work in several threads concurrently, so thread safety is crucial in those cases. The thread safety strategy for the multithread autograd is as follows: 1. Add a mutex to protect thread safety in Autograd Node/Function, and hold the lock for different data racing cases 2. Lock the mutex during Node::apply(), this is to ensure Node that writing to the shared variable are not racing across threads (i.e. AccumulateGrad and custom C++ Autograd Node if writing to shared variables ) 3. Lock the mutex during Node::release_variables(), this serve the purpose that when we release saved_variables from one thread, no other threads can call the Node::apply(), this ensures the variable references from other threads aren't dangling. 4. If we don't release any variables and no shared data read/write in the Node i.e. purely functional, we don't lock the mutex This way we could protect the thread safety on Autograd Node, but we could still not protect the thread safety on Node pre/post C++ hooks (python hooks are automatically thread safe), we rely on the user to write thread safe C++ hooks if they want the hook to be correctly applied in multithreading environment. User visiable changes: There're not too much user visiable changes, since we use the owning thread to drive the autograd execution, user could write their own threading code and does not block on the Autograd engine, some behaviors that user should be aware of: Non-determinism: if we are calling backward() on multiple thread concurrently but with shared inputs (i.e. Hogwild CPU training). Since parameters are automatically shared across threads, gradient accumulation might become non-deterministic on backward calls across threads, because two backward calls might access and try to accumulate the same .grad attribute. This is technically not safe, and it might result in racing condition and the result might be invalid to use. But this is expected pattern if user are using the multithreading approach to drive the whole training process but using shared parameters, user who use multithreading should have the threading model in mind and should expect this to happen. User should use the functional interface `torch.autograd.grad()` to calculate the gradients instead of `backward()` on loss. Graph retaining: If part of the autograd graph is shared between threads, i.e. run first part of forward single thread, then run second part in multiple threads, then the first part of graph is shared. In this case different threads execute grad() or backward() on the same graph might have issue of destroying the graph on the fly of one thread, and the other thread will crash in this case. We will error out to the user similar to what call `backward()` twice with out `retain_graph=True`, and let the user know they should use `retain_graph=True`. TODOs: [ ] benchmark the PR with example models and datasets to demonstrate the performance gain in CPU training [ ] ensure that we don't regress the single thread autograd performance Follow ups: [ ] a correct and tight integration with distributed autograd [ ] try to unify the thread pool between JIT and Autograd, and see if there's unifying pattern that we could apply universally Test Plan: Imported from OSS Differential Revision: D20236771 Pulled By: wanchaol fbshipit-source-id: 1e0bd4eec14ffebeffdb60b763b8d6f0e427eb64	2020-03-26 17:17:52 -07:00
Shen Li	c9117f27c4	Fix final callbacks for reentrant backwards (#35066 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35066 Closes #24965 Prior to this commit, final_callbacks_ are cleared on exit of ANY backward. When using reentrant backward, the last backward would remove all callbacks from the engine. However, this might lead to unexpected behavior. For example, the application could install a final callback after forward, and expecting this callback to fire when all gradients are ready. If there is a renentrant backward on a subgraph, it would fire the callback and delete it on exit, meaning that when fired, not all gradients are ready. Failed Attempt The 1st attempt was trying to move the callback to the GraphTask in engine::execute(). However, this failed because more callbacks could be installed during backward pass. Current Solution Final callbacks are stored as a member variable in the GraphTask. * Insertion: use the thread_local current_graph_task to find the target GraphTask, and append final callback. * Deletion: final callbacks have the same lifetime as a GraphTask * Execution: Use the GraphTask provided in the argument to find final callbacks. Test Plan: Imported from OSS Differential Revision: D20546474 Pulled By: mrshenli fbshipit-source-id: d3f3449bb5af9f8703bcae63e6b52056cd535f11	2020-03-25 13:47:06 -07:00
Nikita Shulga	1c958f8ef9	`Engine::~Engine()` should wait for non-reentrant threads to shutdown (#34529 ) Summary: Because `this` must be valid while `Engine::main_thread` is running, at least for non-reentrant worker threads Pull Request resolved: https://github.com/pytorch/pytorch/pull/34529 Test Plan: Run `test_api --gtest-filter=ModulesTest.InstanceNorm1d` in a loop Differential Revision: D20552717 Pulled By: malfet fbshipit-source-id: a0197671db1b7b1499dda675e43e0826f368bf0d	2020-03-20 00:49:48 -07:00
Edward Yang	96860af870	Revert D20164420: [1.5 Release][Dist Autograd][Better Engineering] Notify Workers on Failure during Distributed Autograd Test Plan: revert-hammer Differential Revision: D20164420 Original commit changeset: 3d4ed7423096 fbshipit-source-id: 67f0f9c11cee84df6dbe37db7821dd601227df66	2020-03-19 08:02:07 -07:00
Omkar Salpekar	5f67c923f1	[1.5 Release][Dist Autograd][Better Engineering] Notify Workers on Failure during Distributed Autograd (#34638 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34638 Fixes: https://github.com/pytorch/pytorch/issues/27643 This PR manages notifying workers in the event of a failure during distributed autograd. Gracefully handles propagating errors across all nodes in the backward pass and sets state in the local autograd engines accordingly. (Note: this ignores all push blocking failures!) Test Plan: Added 2 new tests checking errors when they are thrown in an intermediate node during distributed autograd. Ensured that all existing distributed autograd tests pass. Differential Revision: D20164420 fbshipit-source-id: 3d4ed74230969ac70bb763f1b5b1c16d979f66a2	2020-03-18 18:56:14 -07:00
Nikita Shulga	a22008f91e	Prohibit copying autograd engines (#34567 ) Summary: Make sure that there could not be more than one instance of either `torch::autograd::Engine` or `torch::autograd::python::PythonEngine` Pull Request resolved: https://github.com/pytorch/pytorch/pull/34567 Test Plan: CI Differential Revision: D20390622 Pulled By: malfet fbshipit-source-id: c90595032afc88f552dee52901361b58b282dc1a	2020-03-12 08:06:53 -07:00
Richard Zou	3e6e2e9b7b	Print the current Node name in anomaly mode (#33875 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33875 Fixes #33675. I added a `current_node_name` argument to AnomalyMetadata::print_stack. This is a mandatory arg because I found only one callsite and making it a default arg on a virtual function can be confusing. Test Plan: - Tested locally: https://gist.github.com/zou3519/09937387c83efc76e1700374d5c9c9d9 - I don't know how to add a test for this: the message is printed to stderr but it isn't an exception nor a warning. I considered capturing the stderr of a subprocess but that seems like asking for flakiness. Differential Revision: D20349399 Pulled By: zou3519 fbshipit-source-id: 7585ddffe2bf9e1081f4028a9c44de783978a052	2020-03-10 07:51:52 -07:00
Pritam Damania	d30fa4837e	Unify gradient accumulation between distributed autograd and local autograd (#33214 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33214 Distributed autograd had some custom logic in terms of how we accumulated gradients. This was mostly done early on to enable basic functionality. Although, in the long term we should merge this logic with what we have in the local autograd engine. A lot of work has gone into ensuring we accumulate grads correctly and efficiently and we should reuse that as a starting point. We can investigate if we need further custom logic for distributed autograd later on if we need additional optimizations. In this PR I've merged the gradient accumulation logic and also the gradient hooks. As a result, now gradient hooks are called in distributed autograd as well. ghstack-source-id: 99838019 Test Plan: waitforbuildbot Differential Revision: D19843284 fbshipit-source-id: 7923d7e871fb6afd3e98dba7de96606264dcb5f3	2020-03-10 01:56:08 -07:00
albanD	98afce3c56	Remove unnecessary assert in autograd engine (#34307 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34307 Test Plan: Imported from OSS Differential Revision: D20283401 Pulled By: albanD fbshipit-source-id: 34f6eb8955b7d9cb259260abc1056ddd9f354107	2020-03-06 11:45:46 -08:00
Vitaly Fedyunin	877ab3afe3	Better handing of Autograd+Fork errors. (#33885 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33885 Fixes: #32835 Fixes: #5834 Can not combine with CUDA's implementation as each of them requires individual `std::once_flag` as well as different `forked_autograd_child` functions. CUDA version relays to python module, autograd uses TORCH_CHECK to report error to python and cpp. Test Plan: Imported from OSS Differential Revision: D20144024 Pulled By: VitalyFedyunin fbshipit-source-id: e7cf30568fff5110e9df7fe5b23f18ed992fa17f	2020-02-27 16:07:29 -08:00

1 2 3 4

157 Commits