pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
drisspg	10dfdf2066	engine changes Pull Request resolved: https://github.com/pytorch/pytorch/pull/79296 Approved by: https://github.com/soulitzer	2022-06-11 00:07:27 +00:00
Jeff Daily	de86146c61	rocblas alt impl during backward pass only (#71881 ) In preparation of adopting future rocblas library options, it is necessary to track when the backward pass of training is executing. The scope-based helper class `BackwardPassGuard` is provided to toggle state. Pull Request resolved: https://github.com/pytorch/pytorch/pull/71881 Approved by: https://github.com/albanD	2022-05-18 19:42:58 +00:00
Scott Wolchok	52af4fc5ba	[PyTorch] Make RecordFunction store inputs as ArrayRef (#72484 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72484 Stepping stone toward stack-allocating array of inputs. Funnily enough, this seems to improve performance too. ghstack-source-id: 155492056 Test Plan: 1) CI 2) framework overhead benchmark with --stressTestRecordFunction --captureRecordFunctionInputs goes from 0.76 usec/iter to 0.72. Reviewed By: chaekit, robieta Differential Revision: D34061169 fbshipit-source-id: 073fedf1d3d162f927c4e9867cfda7dbfabba215 (cherry picked from commit dae77cf1cd8813d902d73999ad97133a3ef8e291)	2022-05-05 21:38:42 +00:00
Hannes Friederich	5932c37198	[caffe2] drop XROS ports (#76366 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/76366 caffe2 is not currently being built for XROS. Test Plan: CI Reviewed By: kimishpatel Differential Revision: D35923922 fbshipit-source-id: 260dacadf0bd5b6bab7833a4ce81e896d280b053 (cherry picked from commit 8370b8dd2519d55a79fa8d45e7951ca8dc0b21a8)	2022-04-26 23:54:22 +00:00
Ivan Yashchuk	bba4780232	Enable autograd wrt sparse CSR tensors This pull request enables accumulating gradients for the CSR tensor. Functions that work and are tested: - tensor.abs() - tensor.neg() - tensor.conj_physical() - torch.addmm `torch.mm` also works, but tests will be added later. In addition, this PR adds throwing an error when trying to access strides, storage, and contiguity info on a CSR tensor. `tensor.to_sparse_csr().to_sparse_csr()` was failing and now fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/75435 Approved by: https://github.com/cpuhrsch	2022-04-19 18:42:45 +00:00
Alban Desmaison	4d04ef62a1	Allow forking until a worker thread is created in autograd engine (#72689 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72689 Fix https://github.com/pytorch/pytorch/issues/69839 Should we add a private python binding to check if the bad fork guard has been set and add test in CI to make sure that it is never set on our CPU-only CI build? Not sure how flaky that will be out of CI for people that run CPU build on a machine that cuda installed... EDIT: turns out, we already had such tests in test_multiprocessing. So should be tested and enforced now! Test Plan: Imported from OSS Reviewed By: soulitzer Differential Revision: D34180243 Pulled By: albanD fbshipit-source-id: 3284db52dcf4568362244b60e3c5657153e64fa4 (cherry picked from commit `6e23f7a33a`)	2022-02-12 01:52:57 +00:00
Alban Desmaison	a877441494	Clean up use of cpu ready queue in autograd engine (#72688 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72688 Refactor how we know what to run on the cpu queue. The Lazy Tensor moved there as it is always present as a device guard and would make the number of devices 1 all the time (forcing the creation of a thread). FYI wconstab you most likely don't care about this unless you ever use multiple Lazy device? This should slightly improve the perf if you run backward with Lazy Tensors as the work will be done in the main thread and not a worker thread. Test Plan: Imported from OSS Reviewed By: soulitzer Differential Revision: D34180245 Pulled By: albanD fbshipit-source-id: 88c5d5bdd631ad01bf271d720d1eab69aba84fc0 (cherry picked from commit `da7e9b902f`)	2022-02-12 01:52:56 +00:00
Alban Desmaison	1312524759	Remove un-used function in autograd engine (#72687 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72687 Not sure who would use that. It is not used in the code base as far as I can see. And I don't know of anyone working with the engine directly out of tree. So tentatively removing it. Test Plan: Imported from OSS Reviewed By: soulitzer Differential Revision: D34180244 Pulled By: albanD fbshipit-source-id: 678ba1c4a1cbd9a0458d33be97664d1e3d1bd86b (cherry picked from commit `3968ca3a38`)	2022-02-12 01:52:56 +00:00
rraminen	07ca1fc88b	remove hasPrimaryContext workaround on ROCm (#71146 ) Summary: As issue https://github.com/pytorch/pytorch/issues/59750 is fixed, this PR is to remove the workaround implemented for it on ROCm. Enabled hasPrimaryContext() related PyTorch unit tests on ROCm. cc: amathews-amd, jithunnair-amd cc jeffdaily sunway513 jithunnair-amd ROCmSupport KyleCZH Pull Request resolved: https://github.com/pytorch/pytorch/pull/71146 Reviewed By: anjali411 Differential Revision: D33754615 Pulled By: albanD fbshipit-source-id: b3a5c65a20c6d52d5f2ffc9e6f9628c819329b5d (cherry picked from commit `cfdd12166c`)	2022-01-25 20:32:12 +00:00
Peter Bell	b08d64202a	Remove THGeneral (#69041 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69041 `TH_CONCAT_{N}` is still being used by THP so I've moved that into it's own header but all the compiled code is gone. Test Plan: Imported from OSS Reviewed By: anjali411 Differential Revision: D32872477 Pulled By: ngimel fbshipit-source-id: 06c82d8f96dbcee0715be407c61dfc7d7e8be47a	2021-12-13 16:14:28 -08:00
Veselin Petrov	065018d812	[pytorch][xros] Ensure all pytorch mobile operators build ok in XROS mode (#68266 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68266 * Use `if...endif` to adjust pyTorch internals towards XROS Test Plan: CI Reviewed By: kkosik20 Differential Revision: D32190771 fbshipit-source-id: cce073dea53c2b5681d913321101cd83c6472019	2021-11-15 19:52:45 -08:00
Peter Bell	5f45927d15	Autograd: Delay warnings until the end of backward execution (#66235 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/50209 This adds a new warning handler that stores all warnings in a shared queue, which can be "replayed" at a later time and, crucially, on another thread. Then, I use this inside the autograd engine to ensure that warnings are processed by the handler registered on the main thread. For testing, I also add an operator that always warns in the backward pass and test that the warning is a normal Python warning. cc ezyang albanD zou3519 gqchen pearu nikitaved soulitzer Lezcano Varal7 Pull Request resolved: https://github.com/pytorch/pytorch/pull/66235 Reviewed By: ejguan Differential Revision: D31505413 Pulled By: albanD fbshipit-source-id: 1a7f60b038f55c20591c0748b9e86735b3fec2f9	2021-10-13 15:38:04 -07:00
Pruthvi Madugundu	085e2f7bdd	[ROCm] Changes not to rely on CUDA_VERSION or HIP_VERSION (#65610 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65610 - Replace HIP_PLATFORM_HCC with USE_ROCM - Dont rely on CUDA_VERSION or HIP_VERSION and use USE_ROCM and ROCM_VERSION. - In the next PR - Will be removing the mapping from CUDA_VERSION to HIP_VERSION and CUDA to HIP in hipify. - HIP_PLATFORM_HCC is deprecated, so will add HIP_PLATFORM_AMD to support HIP host code compilation on gcc. cc jeffdaily sunway513 jithunnair-amd ROCmSupport amathews-amd Reviewed By: jbschlosser Differential Revision: D30909053 Pulled By: ezyang fbshipit-source-id: 224a966ebf1aaec79beccbbd686fdf3d49267e06	2021-09-29 09:55:43 -07:00
Alban Desmaison	b858993c97	Fix engine check for case where grad is a subclass (#65568 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65568 Test Plan: Imported from OSS Reviewed By: soulitzer Differential Revision: D31158089 Pulled By: albanD fbshipit-source-id: 2a77df9b6340107de02a043b57a36cb7ae68df34	2021-09-24 08:41:19 -07:00
anjali411	158393e1a1	Fix autograd engine checks and update InputMetadata (#65235 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65235 1. Updated the legacy type checks in `torch/csrc/autograd/engine.cpp` to individually validate the dtype, device, and layout equality for grad and tensor. 2. Removed device field from `InputMetadata` since it's already stored via storing options. Also, added `dtype()` and `layout()` methods to `InputMetadata`. To make this change, some calls had to be updated due to the change in constructor. 3. To fix https://github.com/pytorch/pytorch/issues/65016: a. Added a `is_tensor_subclass` field in `InputMetadata` to skip device checks for grad and tensor when the tensor has python key set on it (tensor subclass). Test Plan: Imported from OSS Reviewed By: jbschlosser Differential Revision: D31117318 Pulled By: anjali411 fbshipit-source-id: 825401df98695c48bf9b320be54585f6aff500bd	2021-09-22 11:01:19 -07:00
Brian Hirsh	152f0236c3	Revert D31082693: Fix autograd engine checks and update InputMetadata Test Plan: revert-hammer Differential Revision: D31082693 (`9324d682fd`) Original commit changeset: cb551cd438c6 fbshipit-source-id: fc60f86b80fc70058984df6bccbf240d27f5843e	2021-09-22 10:00:08 -07:00
anjali411	9324d682fd	Fix autograd engine checks and update InputMetadata (#65235 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65235 1. Updated the legacy type checks in `torch/csrc/autograd/engine.cpp` to individually validate the dtype, device, and layout equality for grad and tensor. 2. Removed device field from `InputMetadata` since it's already stored via storing options. Also, added `dtype()` and `layout()` methods to `InputMetadata`. To make this change, some calls had to be updated due to the change in constructor. 3. To fix https://github.com/pytorch/pytorch/issues/65016: a. Added a `is_tensor_subclass` field in `InputMetadata` to skip device checks for grad and tensor when the tensor has python key set on it (tensor subclass). Test Plan: Imported from OSS Reviewed By: pbelevich Differential Revision: D31082693 Pulled By: anjali411 fbshipit-source-id: cb551cd438c6ca40b0f18a4d0009e0861cf0fd4e	2021-09-22 07:49:52 -07:00
Rohan Varma	421d8f86b6	Add a record scope around autograd::engine::evaluate_function (#63619 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63619 Adds a RECORD_FUNCTION with the function that is being valuate as part of backwards execution. This has been useful in picking up some operations in the backwards pass that otherwise would not show up, for example custom cpp functions that use custom C++ code. ghstack-source-id: 137041723 Test Plan: CI benchmark: buck run mode/opt //scripts/rvarm1/ddp:bench Reviewed By: albanD Differential Revision: D30439492 fbshipit-source-id: 955917770cdf2a2edb0303223ace710b668ba388	2021-09-01 12:32:30 -07:00
albanD	733755f72c	remove special grad_mode tls handling (#63116 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63116 This PR removes the special flag to disable grad mode tracking on the ThreadLocalState and replaces it with an explicit setter that users can use. This allows to reduce complexity of ThreadLocalState. Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D30388098 Pulled By: albanD fbshipit-source-id: 85641b3d711179fb78ff6a41ed077548dc821a2f	2021-08-26 07:51:30 -07:00
albanD	ab954cb0d1	clean up engine.cpp thread state (#63115 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63115 This actually changes: - callbacks now run with proper grad mode even in worker threads - graphtask's Future callbacks now run with proper TLS when erroring out from a worker thread Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D30388100 Pulled By: albanD fbshipit-source-id: 7ae9c461c2f0040548dd9e1e314f25e8da0c2e67	2021-08-25 11:08:43 -07:00
Nikita Shulga	a9b0a921d5	Disable `avoid-non-const-global-variables` lint check (#62008 ) Summary: As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH` All changes but the ones to `.clang-tidy` are generated using following script: ``` for i in `find . -type f -iname ".c" -or -iname "*.h"\|xargs grep cppcoreguidelines-avoid-non-const-global-variables\|cut -f1 -d:\|sort\|uniq`; do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008 Reviewed By: driazati, r-barnes Differential Revision: D29838584 Pulled By: malfet fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13	2021-07-22 18:04:40 -07:00
Richard Barnes	a91be24e2d	Modernize make pointers (#61741 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61741 Test Plan: Sandcastle Reviewed By: malfet Differential Revision: D29717385 fbshipit-source-id: 4452b77981e49175f744bdaab12cd225bf75b90e	2021-07-22 15:54:37 -07:00
johnlu	f59ac5abc8	Add thread local state guards in autograd engine hooks. (#60067 ) Summary: The thread local state of backward thread is not aligned to the GraphTask's `thread_local_` when calling the hooks in backward. This is required for profiling the statistics c10d operation of `DistributedDataParallel` module. Is there any concern to add the thread local state guard when calling the hooks in backward? ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/60067 Reviewed By: ezyang Differential Revision: D29654599 Pulled By: albanD fbshipit-source-id: 656c4f91017184fd40f1a184de24757a13387e37	2021-07-20 07:41:49 -07:00
Michael Carilli	2fa6c7627e	[CUDA graphs][BC-breaking] Removes post-backward syncs on default stream (#60421 ) Summary: Before https://github.com/pytorch/pytorch/pull/57833, calls to backward() or grad() synced only the calling thread's default stream with autograd leaf streams at the end of backward. This made the following weird pattern safe: ```python with torch.cuda.stream(s): # imagine forward used many streams, so backward leaf nodes may run on many streams loss.backward() # no sync use grads ``` but a more benign-looking pattern was unsafe: ```python with torch.cuda.stream(s): # imagine forward used a lot of streams, so backward leaf nodes may run on many streams loss.backward() # backward() syncs the default stream with all the leaf streams, but does not sync s with anything, # so counterintuitively (even though we're in the same stream context as backward()!) # it is NOT SAFE to use grads here, and there's no easy way to make it safe, # unless you manually sync on all the streams you used in forward, # or move "use grads" back to default stream outside the context. use grads ``` mruberry ngimel and I decided backward() should have the [same user-facing stream semantics as any cuda op](https://pytorch.org/docs/master/notes/cuda.html#stream-semantics-of-backward-passes). In other words, the weird pattern should be unsafe, and the benign-looking pattern should be safe. Implementationwise, this meant backward() should sync its calling thread's current stream, not default stream, with the leaf streams. After https://github.com/pytorch/pytorch/pull/57833, backward syncs the calling thread's current stream AND default stream with all leaf streams at the end of backward. The default stream syncs were retained for temporary backward compatibility. This PR finishes https://github.com/pytorch/pytorch/pull/57833's work by deleting syncs on the default stream. With this PR, graph-capturing an entire backward() call should be possible (see the [test_graph_grad_scaling diffs](https://github.com/pytorch/pytorch/compare/master...mcarilli:streaming_backwards_remove_default_syncs?expand=1#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R3641-R3642)). first paragraph has a formatting error which this PR should also fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/60421 Reviewed By: albanD Differential Revision: D29370344 Pulled By: ngimel fbshipit-source-id: 3248bc5fb92fc517db0c15c897e5d7250f67d7fe	2021-06-24 17:34:02 -07:00
Luca Wehrstedt	bb9e1150ea	Revert D29342234: [pytorch][PR] [CUDA graphs][BC-breaking] Removes post-backward syncs on default stream Test Plan: revert-hammer Differential Revision: D29342234 (`675cea1adb`) Original commit changeset: 98e6be7fdd85 fbshipit-source-id: 84022973248b2254210eee57402df2c4f4bc43c6	2021-06-24 04:49:28 -07:00
Michael Carilli	675cea1adb	[CUDA graphs][BC-breaking] Removes post-backward syncs on default stream (#60421 ) Summary: Before https://github.com/pytorch/pytorch/pull/57833, calls to backward() or grad() synced only the calling thread's default stream with autograd leaf streams at the end of backward. This made the following weird pattern safe: ```python with torch.cuda.stream(s): # imagine forward used many streams, so backward leaf nodes may run on many streams loss.backward() # no sync use grads ``` but a more benign-looking pattern was unsafe: ```python with torch.cuda.stream(s): # imagine forward used a lot of streams, so backward leaf nodes may run on many streams loss.backward() # backward() syncs the default stream with all the leaf streams, but does not sync s with anything, # so counterintuitively (even though we're in the same stream context as backward()!) # it is NOT SAFE to use grads here, and there's no easy way to make it safe, # unless you manually sync on all the streams you used in forward, # or move "use grads" back to default stream outside the context. use grads ``` mruberry ngimel and I decided backward() should have the [same user-facing stream semantics as any cuda op](https://pytorch.org/docs/master/notes/cuda.html#stream-semantics-of-backward-passes). In other words, the weird pattern should be unsafe, and the benign-looking pattern should be safe. Implementationwise, this meant backward() should sync its calling thread's current stream, not default stream, with the leaf streams. After https://github.com/pytorch/pytorch/pull/57833, backward syncs the calling thread's current stream AND default stream with all leaf streams at the end of backward. The default stream syncs were retained for temporary backward compatibility. This PR finishes https://github.com/pytorch/pytorch/pull/57833's work by deleting syncs on the default stream. With this PR, graph-capturing an entire backward() call should be possible (see the [test_graph_grad_scaling diffs](https://github.com/pytorch/pytorch/compare/master...mcarilli:streaming_backwards_remove_default_syncs?expand=1#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R3641-R3642)). first paragraph has a formatting error which this PR should also fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/60421 Reviewed By: VitalyFedyunin, albanD Differential Revision: D29342234 Pulled By: ngimel fbshipit-source-id: 98e6be7fdd8550872f0a78f9a66cb8dfe75abf63	2021-06-23 23:35:24 -07:00
Michael Carilli	56481f9762	Ensure proper syncs for out-of-place grad creation (torch.autograd.grad) when backward ops run on side streams (#60127 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/59844. Streaming backwards collects "leaf streams" for AccumulateGrad functions that stash or accumulate .grad attributes for autograd leaf tensors, and syncs those streams with some ambient stream(s) so later ops can safely consume the grads on the ambient stream(s). But, currently, streaming backwards does not collect leaf streams for grads produced out-of-place (ie, not stashed onto a .grad attribute) by `torch.autograd.grad`, because these out-of-place grads are "captured" and returned before they reach an AccumulateGrad function. Some out-of-place grads might not even have an AccumulateGrad function to go to, because `torch.autograd.grad` can be told to make grads for non-leaf temporaries.[1] The upshot is, when streaming backwards makes ops that produce out-of-place gradients run on side streams, no ambient stream is told to sync on these side streams, so `torch.autograd.grad` doesn't offer the same post-call safe-use guarantees for grads as the leaf accumulation of `torch.autograd.backward`. This PR ensures `torch.autograd.grad` gives the same safe-use guarantees as `torch.autograd.backward` by also stashing leaf streams for grads created out-of-place. I augmented a streaming backwards test to include a torch.autograd.grad attempt. The test fails on current master[2] and passes with the engine.cpp diffs. I have no idea if this bug or its fix matter to distributed autograd. pritamdamania mrshenli should take a look before it's merged. [1] example: ```python leaf = torch.tensor(..., requires_grad=True) tmp = leaf * 2 loss = tmp.sum() torch.autograd.grad(loss, inputs=(tmp, leaf)) ``` Technically, because `torch.autograd.grad` can be told to produce grads for non-leaf temporaries, these streams might NOT be "leaf streams". Maybe I should rename `leaf_streams`? [2] the way the test currently fails is fun: it reports ``` AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 0 element(s) (out of 25) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.0 (5.0 vs. 5.0), which occurred at index (0, 0). ``` I suspect this [kafka trap](https://en.wiktionary.org/wiki/Kafkatrap) happens because assertEqual does a comparison test on the device, syncs on some bool result, sees failure and prints the tensors post-sync at which point is IS safe to access the values. Pull Request resolved: https://github.com/pytorch/pytorch/pull/60127 Reviewed By: mrshenli Differential Revision: D29276581 Pulled By: albanD fbshipit-source-id: a9f797e2fd76e2f884cce5a32ecf5d9b704c88ee	2021-06-23 07:14:01 -07:00
Michael Dagitses	91451369ed	require non-empty inputs to grad() calls in the API (#52016 ) Summary: The grad() function needs to return the updated values, and hence needs a non-empty inputs to populate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/52016 Test Plan: Passes Python and C++ unit tests, and added new tests to catch this behavior. Fixes https://github.com/pytorch/pytorch/issues/47061 Reviewed By: albanD Differential Revision: D26406444 Pulled By: dagitses fbshipit-source-id: 023aeca9a40cd765c5bad6a1a2f8767a33b75a1a	2021-06-22 10:10:58 -07:00
Michael Carilli	be038d8989	[CUDA graphs] Make stream semantics of backward calls consistent with other cuda ops (ci-all edition) (#57833 ) Summary: ci-all resubmit of https://github.com/pytorch/pytorch/pull/54227. Tests look good except for a few distributed autograd failures (pytorch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test) and rocm failures (pr/pytorch-linux-bionic-rocm4.1-py3.6). The common denominator in rocm failures appears to be multi-gpu activity: some [multiprocess DDP failures](https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm4.1-py3.6-test1/8115/console), some [single-process failures](https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm4.1-py3.6-test2/8115/console) where the single process has autograd ops that span devices. jeffdaily jithunnair-amd sunway513, could one of you take a look? The streaming backward change is also beneficial to rocm, I expect. For debugging rocm failures, I think we should ignore the multiprocess/DDP tests and focus on the single process cases. The root cause is probably the same and the single process cases are simpler. ---------------------------------- Update: Rocm failures are due to https://github.com/pytorch/pytorch/issues/59750. `2718a54032` is a workaround, to be updated once https://github.com/pytorch/pytorch/issues/59750 is fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/57833 Reviewed By: mruberry Differential Revision: D28942391 Pulled By: ngimel fbshipit-source-id: d6047e971c5f1c6386334bf3641402a92f12e2f8	2021-06-13 12:09:56 -07:00
Richard Barnes	e3d75b8475	irange for PyTorch sans jit (#59481 ) Summary: Switches most of the simple for loops outside of `jit` directories to use `c10::irange`. Generated with D28874212. Pull Request resolved: https://github.com/pytorch/pytorch/pull/59481 Test Plan: Sandcastle Reviewed By: ngimel Differential Revision: D28909681 fbshipit-source-id: ec9ab1bd602933238d9d0f73d4d8d027b75d9d85	2021-06-09 14:46:11 -07:00
Jeffrey Wan	1733d10399	Warn when backward() is called with create_graph=True (#59412 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/4661 - Add warnings in engine's `execute` function so it can be triggered through both cpp and python codepaths - Adds an RAII guard version of `c10::Warning::set_warnAlways` and replaces all prior usages of the set_warnAlways with the new one Pull Request resolved: https://github.com/pytorch/pytorch/pull/59412 Reviewed By: jbschlosser Differential Revision: D28969294 Pulled By: soulitzer fbshipit-source-id: b03369c926a3be18ce1cf363b39edd82a14245f0	2021-06-08 17:19:04 -07:00
Richard Barnes	3979cb0656	irange for size_t (#55320 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55320 Test Plan: Sandcastle Reviewed By: ngimel Differential Revision: D27572577 fbshipit-source-id: 97710fd2bb1303006b05828a0d1343b0b59ccb03	2021-06-03 01:04:13 -07:00
David Reiss	9c83e4160d	Use some c10::ThreadLocal to avoid crashes on old Android toolchains (#59017 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59017 See the comment in ThreadLocal.h for context. I used a slightly dirty preprocessor hack to minimize the number of changes. The hope is that we'll be able to revert all of these soon. Test Plan: CI. Built FB4A with gnustl and saw no references to cxa_thread_atexit in the PyTorch libraries. Reviewed By: ilia-cher Differential Revision: D28720762 fbshipit-source-id: 0f13c7ac5a108b95f8fde6dbc63c6b8bdb8599de	2021-05-27 20:49:03 -07:00
Luca Wehrstedt	44daf1930b	Migrate remaining shared_ptr<Future> to intrusive_ptr (#58420 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58420 In https://github.com/pytorch/pytorch/pull/57636 I migrated most uses of Future to an intrusive_ptr. I thought I had all of them but I missed a couple. These are the remaining ones. (The next PR will make it impossible to add new usages of shared_ptr). ghstack-source-id: 129567071 Test Plan: CI Reviewed By: mrshenli Differential Revision: D28477285 fbshipit-source-id: 75008276baa59e26b450e942c009ec7e78f89b13	2021-05-21 13:15:20 -07:00
Nikita Shulga	3a66a1cb99	[clang-tidy] Exclude cppcoreguidelines-avoid-magic-numbers (#57841 ) Summary: Add cppcoreguidelines-avoid-magic-numbers exclusion to clang-tidy Remove existing nolint warnings using following script: ``` for file in `git ls-files \| grep -v \.py`; do gsed '/^ *\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)/d' -i $file; done ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/57841 Reviewed By: samestep Differential Revision: D28295045 Pulled By: malfet fbshipit-source-id: 7c6e8d1213c9593f169ed3df6a916498f1a97163	2021-05-07 20:02:33 -07:00
Nikita Shulga	4cb534f92e	Make PyTorch code-base clang-tidy compliant (#56892 ) Summary: This is an automatic change generated by the following script: ``` #!/usr/bin/env python3 from subprocess import check_output, check_call import os def get_compiled_files_list(): import json with open("build/compile_commands.json") as f: data = json.load(f) files = [os.path.relpath(node['file']) for node in data] for idx, fname in enumerate(files): if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'): files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')] return files def run_clang_tidy(fname): check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"]) changes = check_output(["git", "ls-files", "-m"]) if len(changes) == 0: return check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"]) def main(): git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n") compiled_files = get_compiled_files_list() for idx, fname in enumerate(git_files): if fname not in compiled_files: continue if fname.startswith("caffe2/contrib/aten/"): continue print(f"[{idx}/{len(git_files)}] Processing {fname}") run_clang_tidy(fname) if __name__ == "__main__": main() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892 Reviewed By: H-Huang Differential Revision: D27991944 Pulled By: malfet fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179	2021-04-28 14:10:25 -07:00
Robin Cheng	5d940e2fbc	[TSAN] Fix PythonEngine data-race-on-vptr. (#56808 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56808 For information about data-race-on-vptr in general, see https://www.internalfb.com/intern/wiki/TSAN/Common_Concurrency_Mistakes/Stopping_a_Thread_in_Destructor/ Engine::~Engine() was previously tasked with stopping the threads. This causes a data race on the object's vptr when PythonEngine is being destructed. This fixes the data race by making ~PythonEngine trigger the thread stopping before going down to the base class's destructor. Test Plan: Many tests are affected, but here's one example: buck test mode/dev-tsan -c fbcode.tsan_strict_mode=true //oculus/research/orcoptics/deep_learning/srg_nn/tests:test_grating_net -- 'test_train (oculus.research.orcoptics.deep_learning.srg_nn.tests.test_grating_net.TestGratingNet)' --run-disabled Reviewed By: walterddr, albanD Differential Revision: D27972384 fbshipit-source-id: 8b70fec8d9326497c591a2777b355ea590a85082	2021-04-23 17:39:27 -07:00
Richard Zou	d2d1112513	Set ThreadLocalState correctly in the autograd engine (#56174 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56174 evaluate_function: 1. calls the autograd function (call_function) 2. accumulates gradients into buffers Previously, ThreadLocalStateGuard only covered part of `call_function`. However, it should cover all Tensor operations in `evaluate_function`, so this PR moves it to do so. One alternative would have been to move ThreadLocalStateGuard to here: `71f9e99e29/torch/csrc/autograd/engine.cpp (L394)` Unfortunately that adds 2% additional instructions according to the instruction count benchmark in the next section. This is because `evaluate_function` does an early return: `71f9e99e29/torch/csrc/autograd/engine.cpp (L732-L735)` If this is preferred, please let me know. Test Plan: - run existing tests. It's hard to actually come up with a test case for this. Benchmark plan: TL;DR: Instruction count decreases by a little after this PR. ``` import torch from torch.utils.benchmark import Timer timer = Timer( stmt="""\ torch::autograd::grad({y}, {x}, {}, /retain_grad=/true);""", setup="""\ auto x = torch::ones({}, torch::requires_grad()); auto y = x * 2;""", language="cpp") stats = timer.collect_callgrind() print(stats) ``` This gave the following: ``` Before: <torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f4b28ce6a90> torch::autograd::grad({y}, {x}, {}, /retain_grad=/true); setup: auto x = torch::ones({}, torch::requires_grad()); auto y = x * 2; All Noisy symbols removed Instructions: 3514184 3514184 Baseline: 0 0 100 runs per measurement, 1 thread After: <torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7fdbc9d187d0> torch::autograd::grad({y}, {x}, {}, /retain_grad=/true); setup: auto x = torch::ones({}, torch::requires_grad()); auto y = x * 2; All Noisy symbols removed Instructions: 3513884 3513884 Baseline: 0 0 100 runs per measurement, 1 thread ``` Reviewed By: albanD Differential Revision: D27799283 Pulled By: zou3519 fbshipit-source-id: 0a8213824e08c04748d38e66604c73f395285d63	2021-04-15 20:57:27 -07:00
Mike Ruberry	c0ac0fef4e	Revert D27448156: irange for size_t Test Plan: revert-hammer Differential Revision: D27448156 (`041b4431b2`) Original commit changeset: 585da57d4de9 fbshipit-source-id: 8e047c29f391c0166e0a1a87c3fb2a0854377365	2021-04-03 19:14:00 -07:00
Richard Barnes	041b4431b2	irange for size_t (#55163 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55163 Test Plan: Sandcastle Reviewed By: ngimel Differential Revision: D27448156 fbshipit-source-id: 585da57d4de91c692b6360d65f7b8a66deb0f8c1	2021-04-02 23:22:29 -07:00
Edward Yang	1f36ce6e4d	Restore storage on meta tensors; increase meta coverage (#53973 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53973 Two parts to this PR; I had to put them together because adding support for X causes more test code to be exercised, which in turn may require a fix for Y. The first part is restoring the concept of storage to meta tensors. Previously, meta tensors had a nullptr storage (e.g., `meta_tensor.storage()` is an error.) As I was increasing the coverage of meta tensors, I started running into test cases (specifically memory overlap tests) that were failing because not having storage meant I couldn't check for memory overlap. After some discussion, we decided that it would make sense for meta tensors to model this as well (we already model strides, so getting accurate view information also seems useful). This PR does that by: * Rewrite all of the factory functions in MetaTensor.cpp to use the generic versions (which are very carefully written to not actually poke at the data pointer, so everything works out). The key idea here is we give meta tensors a special allocator, MetaAllocator, which always returns a nullptr even if you ask for a nonzero number of bytes. resize_ is also made generic; the normal variant can be used directly rather than having to instruct it to avoid resizing storage * Turn on memory overlap checking in TensorIterator even for meta tensors * Although meta tensors now have storage, the concept of meta storage is NOT exposed to Python land (as it would imply I would have to codegen MetaFloatStorage, MetaDoubleStorage, etc. classes). So `x.storage()` still raises an error and I have a cludge in `__deepcopy__` to break storage sharing upon deep copy (this is wrong, but no tests exercise this at the moment). The second part is adding more support for the most used functions in the test suite. * Inplace operations have very simple meta functions. I added `fill_`, `zero_`, `random_`, `uniform_` and `normal_`. In the case of random, I take advantage of pbelevich's templates for defining random kernels, so that I can reuse the common scaffolding, and then just register a noop stub that actually does the RNG. (Look, another structured kernels tiny variant!) * `copy_` is now implemented. Copying into a meta tensor is always OK, but copying out of a meta tensor raises an error (as we don't know what the "correct" data to copy out is in this case) * `empty_strided` usage from structured kernels now is implemented (TBH, this could have been done as soon as `empty_strided` was added) * Meta was missing in a few places in TensorOptions/DispatchKey utility functions, so I added them * Autograd engine now correctly homes meta tensors with CPU tensors (they have -1 device index so CUDA queues wouldn't work anyway) * `apply_`, `map_` and `map2_` are special cased to no-op on meta tensor self. These count as inplace operations too but they are implemented a little differently. Getting more meta function support triggers a number of bugs in the test suite, which I then fix: - Linear algebra functions sometimes don't report NotImplementedError because they get swallowed by catch all try blocks. This is tracked in https://github.com/pytorch/pytorch/issues/53739 - dlpack obviously doesn't work with meta tensors, I just disabled the test Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: D27036572 Test Plan: Imported from OSS Reviewed By: agolynski, bdhirsh Pulled By: ezyang fbshipit-source-id: 7005ecf4feb92a643c37389fdfbd852dbf00ac78	2021-03-29 08:37:46 -07:00
Jeffrey Wan	4739d15a67	Skip some nodes during discovery using sequence number (#52180 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/12635 This change will help us speed up autograd's discovery algorithm in cases where we use `.grad` and we try to "unroll" the training loop. For example the example in the issue and also https://github.com/pytorch/pytorch/pull/52180#issuecomment-783400832 observe an unbounded multiple of speed-up. We do this by adding a new sequence_nr-type numbering: for each node, we maintain the length of the longest path from it to any leaf node. How does this help us speed up discovery (dfs)? Previously the bottleneck was that the dfs that computes which nodes need to be executed always explored every node. With this change, before we run dfs, we first compute the mininum seq_nr among all the nodes passed as the `inputs`. If let this be some number N, intuitively this means that dfs should stay at least N units away from any leaf node. So, if we find ourselves too close to any leaf node, we should stop our search early. Edit: After some discussion offline, the plan is: - make old sequence_nr a construct of the profiler. This means we can avoid accessing thread local state in cases where the profiler is disabled. Note that we cannot replace sequence_nr as-is because profiler's use-case requires that thread-id + sequence_nr can uniquely identify a given node in order for downstream users/programs to correlate nodes from backward and forward passes. This means we must maintain two sequence_nr's and that we have an extra field in Node. - In a future PR, we can potentially remove sequence_nr entirely from the profiler as well, but we avoid doing it now because we haven't measured, and its a larger effort because we'd have to mess around with the dispatcher and profiler Testing with this [code](https://gist.github.com/kyunghyuncho/5fb9991ce1233f909051854a84b7148e), we see that runtime no longer increases as we iterate. Before: ``` 100: Time taken: 0.47s, loss: 1.1e+06 200: Time taken: 0.064s, loss: 6.5e+05 300: Time taken: 0.088s, loss: 4.4e+05 400: Time taken: 0.1s, loss: 3.2e+05 500: Time taken: 0.12s, loss: 2.5e+05 600: Time taken: 0.15s, loss: 2e+05 700: Time taken: 0.18s, loss: 1.7e+05 800: Time taken: 0.2s, loss: 1.4e+05 900: Time taken: 0.22s, loss: 1.2e+05 1000: Time taken: 0.24s, loss: 1.1e+05 1100: Time taken: 0.27s, loss: 9.3e+04 1200: Time taken: 0.3s, loss: 8.3e+04 1300: Time taken: 0.34s, loss: 7.4e+04 1400: Time taken: 0.36s, loss: 6.7e+04 1500: Time taken: 0.38s, loss: 6.1e+04 1600: Time taken: 0.4s, loss: 5.6e+04 1700: Time taken: 0.42s, loss: 5.1e+04 1800: Time taken: 0.44s, loss: 4.7e+04 1900: Time taken: 0.47s, loss: 4.4e+04 2000: Time taken: 0.5s, loss: 4.1e+04 ``` After: ``` 100: Time taken: 0.49s, loss: 1.2e+06 200: Time taken: 0.031s, loss: 6.9e+05 300: Time taken: 0.031s, loss: 4.6e+05 400: Time taken: 0.031s, loss: 3.3e+05 500: Time taken: 0.031s, loss: 2.6e+05 600: Time taken: 0.031s, loss: 2.1e+05 700: Time taken: 0.031s, loss: 1.7e+05 800: Time taken: 0.031s, loss: 1.4e+05 900: Time taken: 0.031s, loss: 1.2e+05 1000: Time taken: 0.031s, loss: 1.1e+05 1100: Time taken: 0.031s, loss: 9.6e+04 1200: Time taken: 0.031s, loss: 8.6e+04 1300: Time taken: 0.031s, loss: 7.7e+04 1400: Time taken: 0.031s, loss: 7e+04 1500: Time taken: 0.031s, loss: 6.3e+04 1600: Time taken: 0.031s, loss: 5.8e+04 1700: Time taken: 0.031s, loss: 5.3e+04 1800: Time taken: 0.031s, loss: 4.9e+04 1900: Time taken: 0.031s, loss: 4.5e+04 2000: Time taken: 0.032s, loss: 4.2e+04 ``` Testing w/ small graph to check for regression: ``` import torch from torch.utils.benchmark import Timer setup=""" a = torch.rand((2, 2), requires_grad=True) b = torch.rand((2, 2), requires_grad=True) gradient = torch.ones(2, 2) """ stmt=""" torch.autograd.grad(a*b, [a, b], gradient) """ timer = Timer(stmt, setup) print(timer.timeit(10000)) print(timer.collect_callgrind(100)) ``` Result: there doesn't seem to be any significant regression ``` Time before: 12.74 us Time after: 13.12 us Instruction count before: All Noisy symbols removed Instructions: 8078960 8000882 Baseline: 4226 3838 Instruction count after: All Noisy symbols removed Instructions: 8091846 8017940 Baseline: 4336 3838 100 runs per measurement, 1 thread ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/52180 Reviewed By: gchanan, zhangguanheng66 Differential Revision: D26794387 Pulled By: soulitzer fbshipit-source-id: c00d387a29f151109c33dc6f1b56a8f275cdec58	2021-03-04 16:13:53 -08:00
Nikita Shulga	72ec718373	Leak autograd threads after wait limit (#53170 ) Summary: Leak autograd threads if TORCH_AUTOGRAD_SHUTDOWN_WAIT_LIMIT is reached (default to 10 seconds) Pull Request resolved: https://github.com/pytorch/pytorch/pull/53170 Reviewed By: zhangguanheng66 Differential Revision: D26821983 Pulled By: malfet fbshipit-source-id: 310960564da7cd8c9f475432a8efbee32cfe6009	2021-03-04 14:42:15 -08:00
Jeffrey Wan	2900cf2b94	Refactor autograd discovery code (#52057 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/34067 by using https://github.com/pytorch/pytorch/issues/34426 by hczhu In addition to removing the unnecessary any() we do also: - Get rid of the outer loop since graph_root also needs to be checked - Update psuedo code description so it matches what the code does - Add some comments explaining the difference between assigning `info.needed_` and `info.captures_` in terms of how that affects discovery - [edit: another benefit is that exec_info entries are no longer created for all reachable nodes] This PR is on top of https://github.com/pytorch/pytorch/issues/51940, so once that lands rebasing on top of master should get rid of the extra commits and changes I'm not sure if this change will bring a lot of performance gains, but the main benefit is that the code is easier to read. Trivial graph: ``` torch.autograd.grad(ab, [a, b], gradient) setup: a = torch.rand((2, 2), requires_grad=True) b = torch.rand((2, 2), requires_grad=True) gradient = torch.ones(2, 2) Timer before: 15.45 us Time after: 14.33 us 1 measurement, 10000 runs , 1 thread Instructions after: All Noisy symbols removed Instructions: 8271213 8193169 Baseline: 4244 3838 Instructions before: All Noisy symbols removed Instructions: 8142843 8054463 Baseline: 4280 3838 100 runs per measurement, 1 thread ``` Small graph: ``` torch.autograd.grad((ba.exp()+a*b.exp()).sum(), (a, b)) setup: a = torch.rand((2, 2), requires_grad=True) b = torch.rand((2, 2), requires_grad=True) Time before: 52.25 us Time after: 50.80 us 1 measurement, 10000 runs , 1 thread Instruction count before: All Noisy symbols removed Instructions: 25601257 25518229 Baseline: 4228 3838 Instruction count after: All Noisy symbols removed Instructions: 25606533 25522797 Baseline: 4228 100 runs per measurement, 1 thread ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/52057 Reviewed By: ngimel Differential Revision: D26432207 Pulled By: soulitzer fbshipit-source-id: beef68344d66e9e286378e31e3311ba43c25c749	2021-02-12 16:22:35 -08:00
Jeffrey Wan	aa2fede201	Fix autograd when `inputs` contains tensors without materialized grad_fn (#51940 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/39784 At the time the issue was filed, there was only issue (1) below. There are actually now two issues here: 1. We always set all inputs passed in through `inputs` arg as `needed = True` in exec_info. So if we pass in an input that has a grad_fn that is not materialized, we create an entry of exec_info with nullptr as key with `needed = True`. Coincidentally, when we perform simple arithmetic operations, such as "2 * x", one of the next edges of mul is an invalid edge, meaning that its grad_fn is also nullptr. This causes the discovery algorithm to set all grad_fns that have a path to this invalid_edge as `needed = True`. 2. Before the commit that enabled the engine skipped the dummy node, we knew that root node is always needed, i.e., we hardcode `exec_info[&graph_root]=true`. The issue was that this logic wasn't updated after the code was updated to skip the graph root. To address (1), instead of passing in an invalid edge if an input in `inputs` has no grad_fn, we create a dummy grad_fn. This is done in both python and cpp entry points. The alternative is to add logic for both backward() and grad() cases to check whether the grad_fn is nullptr and set needed=false in that case (the .grad() case would be slightly more complicated than the .backward() case here). For (2), we perform one final iteration of the discovery algorithm so that we really know whether we need to execute the graph root. Pull Request resolved: https://github.com/pytorch/pytorch/pull/51940 Reviewed By: VitalyFedyunin Differential Revision: D26369529 Pulled By: soulitzer fbshipit-source-id: 14a01ae7988a8de621b967a31564ce1d7a00084e	2021-02-11 09:22:15 -08:00
Jeffrey Wan	7b9ca54ecf	Reset checkpoint_valid flag when error happens during function execution (#51746 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/37874, https://github.com/pytorch/pytorch/issues/51743 Uses RAII to manage the flag so that it gets reset properly on exception Pull Request resolved: https://github.com/pytorch/pytorch/pull/51746 Reviewed By: izdeby Differential Revision: D26319619 Pulled By: soulitzer fbshipit-source-id: ea1235438ba516f99195c83fa23d5880f9977c93	2021-02-08 17:48:25 -08:00
Alban Desmaison	e160362837	Add range assert in autograd engine queue lookup (#50372 ) Summary: Follow up to https://github.com/pytorch/pytorch/issues/49652 Pull Request resolved: https://github.com/pytorch/pytorch/pull/50372 Reviewed By: zhangguanheng66 Differential Revision: D25872203 Pulled By: albanD fbshipit-source-id: 8d6f30f17fba856c5c34c08372767349a250983d	2021-01-11 15:16:35 -08:00
Alban Desmaison	fc2ead0944	Autograd engine, only enqueue task when it is fully initialized (#50164 ) Summary: This solves a race condition where the worker thread might see a partially initialized graph_task Fixes https://github.com/pytorch/pytorch/issues/49652 I don't know how to reliably trigger the race so I didn't add any test. But the rocm build flakyness (it just happens to race more often on rocm builds) should disappear after this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/50164 Reviewed By: zou3519 Differential Revision: D25824954 Pulled By: albanD fbshipit-source-id: 6a3391753cb2afd2ab415d3fb2071a837cc565bb	2021-01-08 05:30:11 -08:00
Ansley Ussery	c619892482	Fix errata (#49903 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49903 Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D25718411 Pulled By: ansley fbshipit-source-id: 0cc365c5a53077752dc1c5a5c4a65b873baa3604	2020-12-28 20:40:41 -08:00
Jeffrey Wan	d20483a999	Skip dummy node creation for autograd engine when there is a single input and place on correct queue (#47592 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/42890 - Removes dummy node - Places graph root on the correct queue based on input buffer's device instead of cpu queue by default cpu - no significant change in speed (too noisy to measure), but we see up to 7% reduction in small graphs cuda - small reduction in speed (still very noisy) and up to ~20% reduction in instruction count for small graphs CPU Code: ``` import torch from torch.utils.benchmark import Timer setup=""" a = torch.rand((2, 2), requires_grad=True) b = torch.rand((2, 2), requires_grad=True) gradient = torch.ones(2, 2) """ stmt=""" torch.autograd.grad(ab, [a, b], gradient) """ timer = Timer(stmt, setup) print(timer.timeit(10000)) print(timer.collect_callgrind(100)) ``` Before (when dummy node is not skipped): ``` torch.autograd.grad(ab, [a, b], gradient) setup: a = torch.rand((2, 2), requires_grad=True) b = torch.rand((2, 2), requires_grad=True) gradient = torch.ones(2, 2) 26.62 us 1 measurement, 10000 runs , 1 thread <torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7efee44ad8e0> torch.autograd.grad(ab, [a, b], gradient) setup: a = torch.rand((2, 2), requires_grad=True) b = torch.rand((2, 2), requires_grad=True) gradient = torch.ones(2, 2) All Noisy symbols removed Instructions: 9755488 9659378 Baseline: 4300 3784 100 runs per measurement, 1 thread ``` After ``` <torch.utils.benchmark.utils.common.Measurement object at 0x7f56961a7730> torch.autograd.grad(ab, [a, b], gradient) setup: a = torch.rand((2, 2), requires_grad=True) b = torch.rand((2, 2), requires_grad=True) gradient = torch.ones(2, 2) 26.78 us 1 measurement, 10000 runs , 1 thread <torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f56961a78e0> torch.autograd.grad(ab, [a, b], gradient) setup: a = torch.rand((2, 2), requires_grad=True) b = torch.rand((2, 2), requires_grad=True) gradient = torch.ones(2, 2) All Noisy symbols removed Instructions: 9045508 8939872 Baseline: 4280 3784 100 runs per measurement, 1 thread ``` Cuda* Before ``` <torch.utils.benchmark.utils.common.Measurement object at 0x7f84cbaa1ee0> torch.autograd.grad(out, [x, y], gradient) setup: x = torch.rand((2,2), requires_grad=True, device="cuda") y = torch.rand((2,2), requires_grad=True, device="cuda") out = x + y gradient = torch.ones(2, 2).cuda() 70.49 us 1 measurement, 10000 runs , 1 thread <torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f84cbaa1e50> torch.autograd.grad(out, [x, y], gradient) setup: x = torch.rand((2,2), requires_grad=True, device="cuda") y = torch.rand((2,2), requires_grad=True, device="cuda") out = x + y gradient = torch.ones(2, 2).cuda() All Noisy symbols removed Instructions: 5054581 4951911 Baseline: 4105 3735 100 runs per measurement, 1 thread ``` Remove dummy node only ``` <torch.utils.benchmark.utils.common.Measurement object at 0x7fbf29c67eb0> torch.autograd.grad(out, [x, y], gradient) setup: x = torch.rand((2,2), requires_grad=True, device="cuda") y = torch.rand((2,2), requires_grad=True, device="cuda") out = x + y gradient = torch.ones(2, 2).cuda() 55.65 us 1 measurement, 10000 runs , 1 thread <torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7fbf29c67e20> torch.autograd.grad(out, [x, y], gradient) setup: x = torch.rand((2,2), requires_grad=True, device="cuda") y = torch.rand((2,2), requires_grad=True, device="cuda") out = x + y gradient = torch.ones(2, 2).cuda() All Noisy symbols removed Instructions: 5002105 4900841 Baseline: 4177 3731 100 runs per measurement, 1 thread ``` Remove dummy node and put in correct queue ``` <torch.utils.benchmark.utils.common.Measurement object at 0x7fb64438ce80> torch.autograd.grad(out, [x, y], gradient) setup: x = torch.rand((2,2), requires_grad=True, device="cuda") y = torch.rand((2,2), requires_grad=True, device="cuda") out = x + y gradient = torch.ones(2, 2).cuda() 27.56 us 1 measurement, 10000 runs , 1 thread <torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7fb64438cdf0> torch.autograd.grad(out, [x, y], gradient) setup: x = torch.rand((2,2), requires_grad=True, device="cuda") y = torch.rand((2,2), requires_grad=True, device="cuda") out = x + y gradient = torch.ones(2, 2).cuda() All Noisy symbols removed Instructions: 4104433 4007555 Baseline: 4159 3735 100 runs per measurement, 1 thread ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/47592 Reviewed By: ailzhang Differential Revision: D24890761 Pulled By: soulitzer fbshipit-source-id: f457376e4a882f8a59476e8c1e708391b1a031a2	2020-11-16 11:33:35 -08:00

1 2 3 4

188 Commits