In preparation of adopting future rocblas library options, it is necessary to track when the backward pass of training is executing. The scope-based helper class `BackwardPassGuard` is provided to toggle state.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71881
Approved by: https://github.com/albanD
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76366
caffe2 is not currently being built for XROS.
Test Plan: CI
Reviewed By: kimishpatel
Differential Revision: D35923922
fbshipit-source-id: 260dacadf0bd5b6bab7833a4ce81e896d280b053
(cherry picked from commit 8370b8dd2519d55a79fa8d45e7951ca8dc0b21a8)
This pull request enables accumulating gradients for the CSR tensor.
Functions that work and are tested:
- tensor.abs()
- tensor.neg()
- tensor.conj_physical()
- torch.addmm
`torch.mm` also works, but tests will be added later.
In addition, this PR adds throwing an error when trying to access strides, storage, and contiguity info on a CSR tensor.
`tensor.to_sparse_csr().to_sparse_csr()` was failing and now fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75435
Approved by: https://github.com/cpuhrsch
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72689
Fix https://github.com/pytorch/pytorch/issues/69839
Should we add a private python binding to check if the bad fork guard has been set and add test in CI to make sure that it is never set on our CPU-only CI build? Not sure how flaky that will be out of CI for people that run CPU build on a machine that cuda installed...
EDIT: turns out, we already had such tests in test_multiprocessing. So should be tested and enforced now!
Test Plan: Imported from OSS
Reviewed By: soulitzer
Differential Revision: D34180243
Pulled By: albanD
fbshipit-source-id: 3284db52dcf4568362244b60e3c5657153e64fa4
(cherry picked from commit 6e23f7a33a)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72688
Refactor how we know what to run on the cpu queue.
The Lazy Tensor moved there as it is always present as a device guard and would make the number of devices 1 all the time (forcing the creation of a thread).
FYI wconstab you most likely don't care about this unless you ever use multiple Lazy device?
This should slightly improve the perf if you run backward with Lazy Tensors as the work will be done in the main thread and not a worker thread.
Test Plan: Imported from OSS
Reviewed By: soulitzer
Differential Revision: D34180245
Pulled By: albanD
fbshipit-source-id: 88c5d5bdd631ad01bf271d720d1eab69aba84fc0
(cherry picked from commit da7e9b902f)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72687
Not sure who would use that. It is not used in the code base as far as I can see. And I don't know of anyone working with the engine directly out of tree. So tentatively removing it.
Test Plan: Imported from OSS
Reviewed By: soulitzer
Differential Revision: D34180244
Pulled By: albanD
fbshipit-source-id: 678ba1c4a1cbd9a0458d33be97664d1e3d1bd86b
(cherry picked from commit 3968ca3a38)
Summary:
As issue https://github.com/pytorch/pytorch/issues/59750 is fixed, this PR is to remove the workaround implemented for it on ROCm.
Enabled hasPrimaryContext() related PyTorch unit tests on ROCm.
cc: amathews-amd, jithunnair-amd
cc jeffdaily sunway513 jithunnair-amd ROCmSupport KyleCZH
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71146
Reviewed By: anjali411
Differential Revision: D33754615
Pulled By: albanD
fbshipit-source-id: b3a5c65a20c6d52d5f2ffc9e6f9628c819329b5d
(cherry picked from commit cfdd12166c)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69041
`TH_CONCAT_{N}` is still being used by THP so I've moved that into
it's own header but all the compiled code is gone.
Test Plan: Imported from OSS
Reviewed By: anjali411
Differential Revision: D32872477
Pulled By: ngimel
fbshipit-source-id: 06c82d8f96dbcee0715be407c61dfc7d7e8be47a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68266
* Use `if...endif` to adjust pyTorch internals towards XROS
Test Plan: CI
Reviewed By: kkosik20
Differential Revision: D32190771
fbshipit-source-id: cce073dea53c2b5681d913321101cd83c6472019
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50209
This adds a new warning handler that stores all warnings in a shared
queue, which can be "replayed" at a later time and, crucially, on
another thread. Then, I use this inside the autograd engine to ensure
that warnings are processed by the handler registered on the main
thread.
For testing, I also add an operator that always warns in the backward
pass and test that the warning is a normal Python warning.
cc ezyang albanD zou3519 gqchen pearu nikitaved soulitzer Lezcano Varal7
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66235
Reviewed By: ejguan
Differential Revision: D31505413
Pulled By: albanD
fbshipit-source-id: 1a7f60b038f55c20591c0748b9e86735b3fec2f9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65610
- Replace HIP_PLATFORM_HCC with USE_ROCM
- Dont rely on CUDA_VERSION or HIP_VERSION and use USE_ROCM and ROCM_VERSION.
- In the next PR
- Will be removing the mapping from CUDA_VERSION to HIP_VERSION and CUDA to HIP in hipify.
- HIP_PLATFORM_HCC is deprecated, so will add HIP_PLATFORM_AMD to support HIP host code compilation on gcc.
cc jeffdaily sunway513 jithunnair-amd ROCmSupport amathews-amd
Reviewed By: jbschlosser
Differential Revision: D30909053
Pulled By: ezyang
fbshipit-source-id: 224a966ebf1aaec79beccbbd686fdf3d49267e06
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65235
1. Updated the legacy type checks in `torch/csrc/autograd/engine.cpp` to individually validate the dtype, device, and layout equality for grad and tensor.
2. Removed device field from `InputMetadata` since it's already stored via storing options. Also, added `dtype()` and `layout()` methods to `InputMetadata`. To make this change, some calls had to be updated due to the change in constructor.
3. To fix https://github.com/pytorch/pytorch/issues/65016:
a. Added a `is_tensor_subclass` field in `InputMetadata` to skip device checks for grad and tensor when the tensor has
python key set on it (tensor subclass).
Test Plan: Imported from OSS
Reviewed By: jbschlosser
Differential Revision: D31117318
Pulled By: anjali411
fbshipit-source-id: 825401df98695c48bf9b320be54585f6aff500bd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65235
1. Updated the legacy type checks in `torch/csrc/autograd/engine.cpp` to individually validate the dtype, device, and layout equality for grad and tensor.
2. Removed device field from `InputMetadata` since it's already stored via storing options. Also, added `dtype()` and `layout()` methods to `InputMetadata`. To make this change, some calls had to be updated due to the change in constructor.
3. To fix https://github.com/pytorch/pytorch/issues/65016:
a. Added a `is_tensor_subclass` field in `InputMetadata` to skip device checks for grad and tensor when the tensor has
python key set on it (tensor subclass).
Test Plan: Imported from OSS
Reviewed By: pbelevich
Differential Revision: D31082693
Pulled By: anjali411
fbshipit-source-id: cb551cd438c6ca40b0f18a4d0009e0861cf0fd4e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63619
Adds a RECORD_FUNCTION with the function that is being valuate as part
of backwards execution. This has been useful in picking up some operations
in the backwards pass that otherwise would not show up, for example custom cpp
functions that use custom C++ code.
ghstack-source-id: 137041723
Test Plan:
CI
benchmark:
buck run mode/opt //scripts/rvarm1/ddp:bench
Reviewed By: albanD
Differential Revision: D30439492
fbshipit-source-id: 955917770cdf2a2edb0303223ace710b668ba388
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63116
This PR removes the special flag to disable grad mode tracking on the ThreadLocalState and replaces it with an explicit setter that users can use.
This allows to reduce complexity of ThreadLocalState.
Test Plan: Imported from OSS
Reviewed By: ngimel
Differential Revision: D30388098
Pulled By: albanD
fbshipit-source-id: 85641b3d711179fb78ff6a41ed077548dc821a2f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63115
This actually changes:
- callbacks now run with proper grad mode even in worker threads
- graphtask's Future callbacks now run with proper TLS when erroring
out from a worker thread
Test Plan: Imported from OSS
Reviewed By: ngimel
Differential Revision: D30388100
Pulled By: albanD
fbshipit-source-id: 7ae9c461c2f0040548dd9e1e314f25e8da0c2e67
Summary:
As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH`
All changes but the ones to `.clang-tidy` are generated using following script:
```
for i in `find . -type f -iname "*.c*" -or -iname "*.h"|xargs grep cppcoreguidelines-avoid-non-const-global-variables|cut -f1 -d:|sort|uniq`; do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008
Reviewed By: driazati, r-barnes
Differential Revision: D29838584
Pulled By: malfet
fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13
Summary:
The thread local state of backward thread is not aligned to the GraphTask's `thread_local_` when calling the hooks in backward.
This is required for profiling the statistics c10d operation of `DistributedDataParallel` module.
Is there any concern to add the thread local state guard when calling the hooks in backward? ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60067
Reviewed By: ezyang
Differential Revision: D29654599
Pulled By: albanD
fbshipit-source-id: 656c4f91017184fd40f1a184de24757a13387e37
Summary:
Before https://github.com/pytorch/pytorch/pull/57833, calls to backward() or grad() synced only the calling thread's default stream with autograd leaf streams at the end of backward. This made the following weird pattern safe:
```python
with torch.cuda.stream(s):
# imagine forward used many streams, so backward leaf nodes may run on many streams
loss.backward()
# no sync
use grads
```
but a more benign-looking pattern was unsafe:
```python
with torch.cuda.stream(s):
# imagine forward used a lot of streams, so backward leaf nodes may run on many streams
loss.backward()
# backward() syncs the default stream with all the leaf streams, but does not sync s with anything,
# so counterintuitively (even though we're in the same stream context as backward()!)
# it is NOT SAFE to use grads here, and there's no easy way to make it safe,
# unless you manually sync on all the streams you used in forward,
# or move "use grads" back to default stream outside the context.
use grads
```
mruberry ngimel and I decided backward() should have the [same user-facing stream semantics as any cuda op](https://pytorch.org/docs/master/notes/cuda.html#stream-semantics-of-backward-passes).** In other words, the weird pattern should be unsafe, and the benign-looking pattern should be safe. Implementationwise, this meant backward() should sync its calling thread's current stream, not default stream, with the leaf streams.
After https://github.com/pytorch/pytorch/pull/57833, backward syncs the calling thread's current stream AND default stream with all leaf streams at the end of backward. The default stream syncs were retained for temporary backward compatibility.
This PR finishes https://github.com/pytorch/pytorch/pull/57833's work by deleting syncs on the default stream.
With this PR, graph-capturing an entire backward() call should be possible (see the [test_graph_grad_scaling diffs](https://github.com/pytorch/pytorch/compare/master...mcarilli:streaming_backwards_remove_default_syncs?expand=1#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R3641-R3642)).
** first paragraph has a formatting error which this PR should also fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60421
Reviewed By: albanD
Differential Revision: D29370344
Pulled By: ngimel
fbshipit-source-id: 3248bc5fb92fc517db0c15c897e5d7250f67d7fe
Summary:
Before https://github.com/pytorch/pytorch/pull/57833, calls to backward() or grad() synced only the calling thread's default stream with autograd leaf streams at the end of backward. This made the following weird pattern safe:
```python
with torch.cuda.stream(s):
# imagine forward used many streams, so backward leaf nodes may run on many streams
loss.backward()
# no sync
use grads
```
but a more benign-looking pattern was unsafe:
```python
with torch.cuda.stream(s):
# imagine forward used a lot of streams, so backward leaf nodes may run on many streams
loss.backward()
# backward() syncs the default stream with all the leaf streams, but does not sync s with anything,
# so counterintuitively (even though we're in the same stream context as backward()!)
# it is NOT SAFE to use grads here, and there's no easy way to make it safe,
# unless you manually sync on all the streams you used in forward,
# or move "use grads" back to default stream outside the context.
use grads
```
mruberry ngimel and I decided backward() should have the [same user-facing stream semantics as any cuda op](https://pytorch.org/docs/master/notes/cuda.html#stream-semantics-of-backward-passes).** In other words, the weird pattern should be unsafe, and the benign-looking pattern should be safe. Implementationwise, this meant backward() should sync its calling thread's current stream, not default stream, with the leaf streams.
After https://github.com/pytorch/pytorch/pull/57833, backward syncs the calling thread's current stream AND default stream with all leaf streams at the end of backward. The default stream syncs were retained for temporary backward compatibility.
This PR finishes https://github.com/pytorch/pytorch/pull/57833's work by deleting syncs on the default stream.
With this PR, graph-capturing an entire backward() call should be possible (see the [test_graph_grad_scaling diffs](https://github.com/pytorch/pytorch/compare/master...mcarilli:streaming_backwards_remove_default_syncs?expand=1#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R3641-R3642)).
** first paragraph has a formatting error which this PR should also fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60421
Reviewed By: VitalyFedyunin, albanD
Differential Revision: D29342234
Pulled By: ngimel
fbshipit-source-id: 98e6be7fdd8550872f0a78f9a66cb8dfe75abf63
Summary:
Fixes https://github.com/pytorch/pytorch/issues/59844.
Streaming backwards collects "leaf streams" for AccumulateGrad functions that stash or accumulate .grad attributes for autograd leaf tensors, and syncs those streams with some ambient stream(s) so later ops can safely consume the grads on the ambient stream(s).
But, currently, streaming backwards does not collect leaf streams for grads produced out-of-place (ie, not stashed onto a .grad attribute) by `torch.autograd.grad`, because these out-of-place grads are "captured" and returned before they reach an AccumulateGrad function. Some out-of-place grads might not even have an AccumulateGrad function to go to, because `torch.autograd.grad` can be told to make grads for non-leaf temporaries.[1]
The upshot is, when streaming backwards makes ops that produce out-of-place gradients run on side streams, no ambient stream is told to sync on these side streams, so `torch.autograd.grad` doesn't offer the same post-call safe-use guarantees for grads as the leaf accumulation of `torch.autograd.backward`.
This PR ensures `torch.autograd.grad` gives the same safe-use guarantees as `torch.autograd.backward` by also stashing leaf streams for grads created out-of-place.
I augmented a streaming backwards test to include a torch.autograd.grad attempt. The test fails on current master[2] and passes with the engine.cpp diffs.
I have no idea if this bug or its fix matter to distributed autograd. pritamdamania mrshenli should take a look before it's merged.
[1] example:
```python
leaf = torch.tensor(..., requires_grad=True)
tmp = leaf * 2
loss = tmp.sum()
torch.autograd.grad(loss, inputs=(tmp, leaf))
```
Technically, because `torch.autograd.grad` can be told to produce grads for non-leaf temporaries, these streams might NOT be "leaf streams". Maybe I should rename `leaf_streams`?
[2] the way the test currently fails is fun: it reports
```
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 0 element(s) (out of 25) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.0 (5.0 vs. 5.0), which occurred at index (0, 0).
```
I suspect this [kafka trap](https://en.wiktionary.org/wiki/Kafkatrap) happens because assertEqual does a comparison test on the device, syncs on some bool result, sees failure and prints the tensors post-sync at which point is IS safe to access the values.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60127
Reviewed By: mrshenli
Differential Revision: D29276581
Pulled By: albanD
fbshipit-source-id: a9f797e2fd76e2f884cce5a32ecf5d9b704c88ee
Summary:
The grad() function needs to return the updated values, and hence
needs a non-empty inputs to populate.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52016
Test Plan:
Passes Python and C++ unit tests, and added new tests to catch this behavior.
Fixes https://github.com/pytorch/pytorch/issues/47061
Reviewed By: albanD
Differential Revision: D26406444
Pulled By: dagitses
fbshipit-source-id: 023aeca9a40cd765c5bad6a1a2f8767a33b75a1a
Summary:
Switches most of the simple for loops outside of `jit` directories to use `c10::irange`.
Generated with D28874212.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59481
Test Plan: Sandcastle
Reviewed By: ngimel
Differential Revision: D28909681
fbshipit-source-id: ec9ab1bd602933238d9d0f73d4d8d027b75d9d85
Summary:
Fixes https://github.com/pytorch/pytorch/issues/4661
- Add warnings in engine's `execute` function so it can be triggered through both cpp and python codepaths
- Adds an RAII guard version of `c10::Warning::set_warnAlways` and replaces all prior usages of the set_warnAlways with the new one
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59412
Reviewed By: jbschlosser
Differential Revision: D28969294
Pulled By: soulitzer
fbshipit-source-id: b03369c926a3be18ce1cf363b39edd82a14245f0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59017
See the comment in ThreadLocal.h for context.
I used a slightly dirty preprocessor hack to minimize the number of changes.
The hope is that we'll be able to revert all of these soon.
Test Plan:
CI.
Built FB4A with gnustl and saw no references to cxa_thread_atexit
in the PyTorch libraries.
Reviewed By: ilia-cher
Differential Revision: D28720762
fbshipit-source-id: 0f13c7ac5a108b95f8fde6dbc63c6b8bdb8599de
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58420
In https://github.com/pytorch/pytorch/pull/57636 I migrated most uses of Future to an intrusive_ptr. I thought I had all of them but I missed a couple. These are the remaining ones. (The next PR will make it impossible to add new usages of shared_ptr).
ghstack-source-id: 129567071
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D28477285
fbshipit-source-id: 75008276baa59e26b450e942c009ec7e78f89b13
Summary:
This is an automatic change generated by the following script:
```
#!/usr/bin/env python3
from subprocess import check_output, check_call
import os
def get_compiled_files_list():
import json
with open("build/compile_commands.json") as f:
data = json.load(f)
files = [os.path.relpath(node['file']) for node in data]
for idx, fname in enumerate(files):
if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'):
files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')]
return files
def run_clang_tidy(fname):
check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"])
changes = check_output(["git", "ls-files", "-m"])
if len(changes) == 0:
return
check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"])
def main():
git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n")
compiled_files = get_compiled_files_list()
for idx, fname in enumerate(git_files):
if fname not in compiled_files:
continue
if fname.startswith("caffe2/contrib/aten/"):
continue
print(f"[{idx}/{len(git_files)}] Processing {fname}")
run_clang_tidy(fname)
if __name__ == "__main__":
main()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892
Reviewed By: H-Huang
Differential Revision: D27991944
Pulled By: malfet
fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56808
For information about data-race-on-vptr in general, see https://www.internalfb.com/intern/wiki/TSAN/Common_Concurrency_Mistakes/Stopping_a_Thread_in_Destructor/
Engine::~Engine() was previously tasked with stopping the threads. This causes a data race on the object's vptr when PythonEngine is being destructed. This fixes the data race by making ~PythonEngine trigger the thread stopping before going down to the base class's destructor.
Test Plan:
Many tests are affected, but here's one example:
buck test mode/dev-tsan -c fbcode.tsan_strict_mode=true //oculus/research/orcoptics/deep_learning/srg_nn/tests:test_grating_net -- 'test_train (oculus.research.orcoptics.deep_learning.srg_nn.tests.test_grating_net.TestGratingNet)' --run-disabled
Reviewed By: walterddr, albanD
Differential Revision: D27972384
fbshipit-source-id: 8b70fec8d9326497c591a2777b355ea590a85082
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56174
evaluate_function:
1. calls the autograd function (call_function)
2. accumulates gradients into buffers
Previously, ThreadLocalStateGuard only covered part of `call_function`.
However, it should cover all Tensor operations in `evaluate_function`,
so this PR moves it to do so.
One alternative would have been to move ThreadLocalStateGuard to here:
71f9e99e29/torch/csrc/autograd/engine.cpp (L394)
Unfortunately that adds 2% additional instructions according to the
instruction count benchmark in the next section. This is because
`evaluate_function` does an early return:
71f9e99e29/torch/csrc/autograd/engine.cpp (L732-L735)
If this is preferred, please let me know.
Test Plan:
- run existing tests. It's hard to actually come up with a test case for
this.
Benchmark plan:
TL;DR: Instruction count decreases by a little after this PR.
```
import torch
from torch.utils.benchmark import Timer
timer = Timer(
stmt="""\
torch::autograd::grad({y}, {x}, {}, /*retain_grad=*/true);""",
setup="""\
auto x = torch::ones({}, torch::requires_grad());
auto y = x * 2;""",
language="cpp")
stats = timer.collect_callgrind()
print(stats)
```
This gave the following:
```
Before:
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f4b28ce6a90>
torch::autograd::grad({y}, {x}, {}, /*retain_grad=*/true);
setup:
auto x = torch::ones({}, torch::requires_grad());
auto y = x * 2;
All Noisy symbols removed
Instructions: 3514184 3514184
Baseline: 0 0
100 runs per measurement, 1 thread
After:
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7fdbc9d187d0>
torch::autograd::grad({y}, {x}, {}, /*retain_grad=*/true);
setup:
auto x = torch::ones({}, torch::requires_grad());
auto y = x * 2;
All Noisy symbols removed
Instructions: 3513884 3513884
Baseline: 0 0
100 runs per measurement, 1 thread
```
Reviewed By: albanD
Differential Revision: D27799283
Pulled By: zou3519
fbshipit-source-id: 0a8213824e08c04748d38e66604c73f395285d63
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53973
Two parts to this PR; I had to put them together because adding support for X causes more test code to be exercised, which in turn may require a fix for Y.
The first part is restoring the concept of storage to meta tensors. Previously, meta tensors had a nullptr storage (e.g., `meta_tensor.storage()` is an error.) As I was increasing the coverage of meta tensors, I started running into test cases (specifically memory overlap tests) that were failing because not having storage meant I couldn't check for memory overlap. After some discussion, we decided that it would make sense for meta tensors to model this as well (we already model strides, so getting accurate view information also seems useful). This PR does that by:
* Rewrite all of the factory functions in MetaTensor.cpp to use the generic versions (which are very carefully written to not actually poke at the data pointer, so everything works out). The key idea here is we give meta tensors a special allocator, MetaAllocator, which always returns a nullptr even if you ask for a nonzero number of bytes. resize_ is also made generic; the normal variant can be used directly rather than having to instruct it to avoid resizing storage
* Turn on memory overlap checking in TensorIterator even for meta tensors
* Although meta tensors now have storage, the concept of meta storage is NOT exposed to Python land (as it would imply I would have to codegen MetaFloatStorage, MetaDoubleStorage, etc. classes). So `x.storage()` still raises an error and I have a cludge in `__deepcopy__` to break storage sharing upon deep copy (this is wrong, but no tests exercise this at the moment).
The second part is adding more support for the most used functions in the test suite.
* Inplace operations have very simple meta functions. I added `fill_`, `zero_`, `random_`, `uniform_` and `normal_`. In the case of random, I take advantage of pbelevich's templates for defining random kernels, so that I can reuse the common scaffolding, and then just register a noop stub that actually does the RNG. (Look, another structured kernels tiny variant!)
* `copy_` is now implemented. Copying into a meta tensor is always OK, but copying out of a meta tensor raises an error (as we don't know what the "correct" data to copy out is in this case)
* `empty_strided` usage from structured kernels now is implemented (TBH, this could have been done as soon as `empty_strided` was added)
* Meta was missing in a few places in TensorOptions/DispatchKey utility functions, so I added them
* Autograd engine now correctly homes meta tensors with CPU tensors (they have -1 device index so CUDA queues wouldn't work anyway)
* `apply_`, `map_` and `map2_` are special cased to no-op on meta tensor self. These count as inplace operations too but they are implemented a little differently.
Getting more meta function support triggers a number of bugs in the test suite, which I then fix:
- Linear algebra functions sometimes don't report NotImplementedError because they get swallowed by catch all try blocks. This is tracked in https://github.com/pytorch/pytorch/issues/53739
- dlpack obviously doesn't work with meta tensors, I just disabled the test
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Differential Revision: D27036572
Test Plan: Imported from OSS
Reviewed By: agolynski, bdhirsh
Pulled By: ezyang
fbshipit-source-id: 7005ecf4feb92a643c37389fdfbd852dbf00ac78
Summary:
Fixes https://github.com/pytorch/pytorch/issues/12635
This change will help us speed up autograd's discovery algorithm in cases where we use `.grad` and we try to "unroll" the training loop. For example the example in the issue and also https://github.com/pytorch/pytorch/pull/52180#issuecomment-783400832 observe an unbounded multiple of speed-up.
We do this by adding a new sequence_nr-type numbering: for each node, we maintain the length of the longest path from it to any leaf node. How does this help us speed up discovery (dfs)? Previously the bottleneck was that the dfs that computes which nodes need to be executed always explored every node. With this change, before we run dfs, we first compute the mininum seq_nr among all the nodes passed as the `inputs`. If let this be some number N, intuitively this means that dfs should stay at least N units away from any leaf node. So, if we find ourselves too close to any leaf node, we should stop our search early.
Edit:
After some discussion offline, the plan is:
- make old sequence_nr a construct of the profiler. This means we can avoid accessing thread local state in cases where the profiler is disabled. Note that we cannot replace sequence_nr as-is because profiler's use-case requires that thread-id + sequence_nr can uniquely identify a given node in order for downstream users/programs to correlate nodes from backward and forward passes. This means we must maintain two sequence_nr's and that we have an extra field in Node.
- In a future PR, we can potentially remove sequence_nr entirely from the profiler as well, but we avoid doing it now because we haven't measured, and its a larger effort because we'd have to mess around with the dispatcher and profiler
Testing with this [code](https://gist.github.com/kyunghyuncho/5fb9991ce1233f909051854a84b7148e), we see that runtime no longer increases as we iterate.
Before:
```
100: Time taken: 0.47s, loss: 1.1e+06
200: Time taken: 0.064s, loss: 6.5e+05
300: Time taken: 0.088s, loss: 4.4e+05
400: Time taken: 0.1s, loss: 3.2e+05
500: Time taken: 0.12s, loss: 2.5e+05
600: Time taken: 0.15s, loss: 2e+05
700: Time taken: 0.18s, loss: 1.7e+05
800: Time taken: 0.2s, loss: 1.4e+05
900: Time taken: 0.22s, loss: 1.2e+05
1000: Time taken: 0.24s, loss: 1.1e+05
1100: Time taken: 0.27s, loss: 9.3e+04
1200: Time taken: 0.3s, loss: 8.3e+04
1300: Time taken: 0.34s, loss: 7.4e+04
1400: Time taken: 0.36s, loss: 6.7e+04
1500: Time taken: 0.38s, loss: 6.1e+04
1600: Time taken: 0.4s, loss: 5.6e+04
1700: Time taken: 0.42s, loss: 5.1e+04
1800: Time taken: 0.44s, loss: 4.7e+04
1900: Time taken: 0.47s, loss: 4.4e+04
2000: Time taken: 0.5s, loss: 4.1e+04
```
After:
```
100: Time taken: 0.49s, loss: 1.2e+06
200: Time taken: 0.031s, loss: 6.9e+05
300: Time taken: 0.031s, loss: 4.6e+05
400: Time taken: 0.031s, loss: 3.3e+05
500: Time taken: 0.031s, loss: 2.6e+05
600: Time taken: 0.031s, loss: 2.1e+05
700: Time taken: 0.031s, loss: 1.7e+05
800: Time taken: 0.031s, loss: 1.4e+05
900: Time taken: 0.031s, loss: 1.2e+05
1000: Time taken: 0.031s, loss: 1.1e+05
1100: Time taken: 0.031s, loss: 9.6e+04
1200: Time taken: 0.031s, loss: 8.6e+04
1300: Time taken: 0.031s, loss: 7.7e+04
1400: Time taken: 0.031s, loss: 7e+04
1500: Time taken: 0.031s, loss: 6.3e+04
1600: Time taken: 0.031s, loss: 5.8e+04
1700: Time taken: 0.031s, loss: 5.3e+04
1800: Time taken: 0.031s, loss: 4.9e+04
1900: Time taken: 0.031s, loss: 4.5e+04
2000: Time taken: 0.032s, loss: 4.2e+04
```
Testing w/ small graph to check for regression:
```
import torch
from torch.utils.benchmark import Timer
setup="""
a = torch.rand((2, 2), requires_grad=True)
b = torch.rand((2, 2), requires_grad=True)
gradient = torch.ones(2, 2)
"""
stmt="""
torch.autograd.grad(a*b, [a, b], gradient)
"""
timer = Timer(stmt, setup)
print(timer.timeit(10000))
print(timer.collect_callgrind(100))
```
Result: there doesn't seem to be any significant regression
```
Time before: 12.74 us
Time after: 13.12 us
Instruction count before:
All Noisy symbols removed
Instructions: 8078960 8000882
Baseline: 4226 3838
Instruction count after:
All Noisy symbols removed
Instructions: 8091846 8017940
Baseline: 4336 3838
100 runs per measurement, 1 thread
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52180
Reviewed By: gchanan, zhangguanheng66
Differential Revision: D26794387
Pulled By: soulitzer
fbshipit-source-id: c00d387a29f151109c33dc6f1b56a8f275cdec58
Summary:
Fixes https://github.com/pytorch/pytorch/issues/34067 by using https://github.com/pytorch/pytorch/issues/34426 by hczhu
In addition to removing the unnecessary any() we do also:
- Get rid of the outer loop since graph_root also needs to be checked
- Update psuedo code description so it matches what the code does
- Add some comments explaining the difference between assigning `info.needed_` and `info.captures_` in terms of how that affects discovery
- [edit: another benefit is that exec_info entries are no longer created for all reachable nodes]
This PR is on top of https://github.com/pytorch/pytorch/issues/51940, so once that lands rebasing on top of master should get rid of the extra commits and changes
I'm not sure if this change will bring a lot of performance gains, but the main benefit is that the code is easier to read.
Trivial graph:
```
torch.autograd.grad(a*b, [a, b], gradient)
setup:
a = torch.rand((2, 2), requires_grad=True)
b = torch.rand((2, 2), requires_grad=True)
gradient = torch.ones(2, 2)
Timer before:
15.45 us
Time after:
14.33 us
1 measurement, 10000 runs , 1 thread
Instructions after:
All Noisy symbols removed
Instructions: 8271213 8193169
Baseline: 4244 3838
Instructions before:
All Noisy symbols removed
Instructions: 8142843 8054463
Baseline: 4280 3838
100 runs per measurement, 1 thread
```
Small graph:
```
torch.autograd.grad((b*a.exp()+a*b.exp()).sum(), (a, b))
setup:
a = torch.rand((2, 2), requires_grad=True)
b = torch.rand((2, 2), requires_grad=True)
Time before:
52.25 us
Time after:
50.80 us
1 measurement, 10000 runs , 1 thread
Instruction count before:
All Noisy symbols removed
Instructions: 25601257 25518229
Baseline: 4228 3838
Instruction count after:
All Noisy symbols removed
Instructions: 25606533 25522797
Baseline: 4228
100 runs per measurement, 1 thread
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52057
Reviewed By: ngimel
Differential Revision: D26432207
Pulled By: soulitzer
fbshipit-source-id: beef68344d66e9e286378e31e3311ba43c25c749
Summary:
Fixes https://github.com/pytorch/pytorch/issues/39784
At the time the issue was filed, there was only issue (1) below.
There are actually now two issues here:
1. We always set all inputs passed in through `inputs` arg as `needed = True` in exec_info. So if we pass in an input that has a grad_fn that is not materialized, we create an entry of exec_info with nullptr as key with `needed = True`. Coincidentally, when we perform simple arithmetic operations, such as "2 * x", one of the next edges of mul is an invalid edge, meaning that its grad_fn is also nullptr. This causes the discovery algorithm to set all grad_fns that have a path to this invalid_edge as `needed = True`.
2. Before the commit that enabled the engine skipped the dummy node, we knew that root node is always needed, i.e., we hardcode `exec_info[&graph_root]=true`. The issue was that this logic wasn't updated after the code was updated to skip the graph root.
To address (1), instead of passing in an invalid edge if an input in `inputs` has no grad_fn, we create a dummy grad_fn. This is done in both python and cpp entry points. The alternative is to add logic for both backward() and grad() cases to check whether the grad_fn is nullptr and set needed=false in that case (the .grad() case would be slightly more complicated than the .backward() case here).
For (2), we perform one final iteration of the discovery algorithm so that we really know whether we need to execute the graph root.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51940
Reviewed By: VitalyFedyunin
Differential Revision: D26369529
Pulled By: soulitzer
fbshipit-source-id: 14a01ae7988a8de621b967a31564ce1d7a00084e
Summary:
This solves a race condition where the worker thread might
see a partially initialized graph_task
Fixes https://github.com/pytorch/pytorch/issues/49652
I don't know how to reliably trigger the race so I didn't add any test. But the rocm build flakyness (it just happens to race more often on rocm builds) should disappear after this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50164
Reviewed By: zou3519
Differential Revision: D25824954
Pulled By: albanD
fbshipit-source-id: 6a3391753cb2afd2ab415d3fb2071a837cc565bb