Summary:
CUDA_VERSION and HIP_VERSION follow very unrelated versioning schemes, so it does not make sense to use CUDA_VERSION to determine the ROCm path. This note explicitly addresses it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62850
Reviewed By: mruberry
Differential Revision: D30547562
Pulled By: malfet
fbshipit-source-id: 02990fa66a88466c2330ab85f446b25b78545150
Summary:
- Adds some code examples for `ctx` methods and make requirements of arguments more clear
- Type annotations for `save_for_backward`, `mark_dirty`, `mark_non_differentiable`, and `set_materialize_grads` (BC-breaking?)
- Refactor `torch.autograd.Function` doc
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60312
Reviewed By: VitalyFedyunin
Differential Revision: D30314961
Pulled By: soulitzer
fbshipit-source-id: a284314b65662e26390417bd2b6b12cd85e68dc8
Summary:
This FAQ has a section for CUDA OOMs where there are lots of don'ts. This limits modeling solution. Deep nets can blow up memory due to output caching during training.
It's a known problem with a known solution: to trade-off compute for memory via checkpointing.
FAQ should mention it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62709
Reviewed By: nairbv
Differential Revision: D30103326
Pulled By: ezyang
fbshipit-source-id: 3a8b465a7fbe19aae88f83cc50fe82ebafcb56c9
Summary:
Before https://github.com/pytorch/pytorch/pull/57833, calls to backward() or grad() synced only the calling thread's default stream with autograd leaf streams at the end of backward. This made the following weird pattern safe:
```python
with torch.cuda.stream(s):
# imagine forward used many streams, so backward leaf nodes may run on many streams
loss.backward()
# no sync
use grads
```
but a more benign-looking pattern was unsafe:
```python
with torch.cuda.stream(s):
# imagine forward used a lot of streams, so backward leaf nodes may run on many streams
loss.backward()
# backward() syncs the default stream with all the leaf streams, but does not sync s with anything,
# so counterintuitively (even though we're in the same stream context as backward()!)
# it is NOT SAFE to use grads here, and there's no easy way to make it safe,
# unless you manually sync on all the streams you used in forward,
# or move "use grads" back to default stream outside the context.
use grads
```
mruberry ngimel and I decided backward() should have the [same user-facing stream semantics as any cuda op](https://pytorch.org/docs/master/notes/cuda.html#stream-semantics-of-backward-passes).** In other words, the weird pattern should be unsafe, and the benign-looking pattern should be safe. Implementationwise, this meant backward() should sync its calling thread's current stream, not default stream, with the leaf streams.
After https://github.com/pytorch/pytorch/pull/57833, backward syncs the calling thread's current stream AND default stream with all leaf streams at the end of backward. The default stream syncs were retained for temporary backward compatibility.
This PR finishes https://github.com/pytorch/pytorch/pull/57833's work by deleting syncs on the default stream.
With this PR, graph-capturing an entire backward() call should be possible (see the [test_graph_grad_scaling diffs](https://github.com/pytorch/pytorch/compare/master...mcarilli:streaming_backwards_remove_default_syncs?expand=1#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R3641-R3642)).
** first paragraph has a formatting error which this PR should also fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60421
Reviewed By: albanD
Differential Revision: D29370344
Pulled By: ngimel
fbshipit-source-id: 3248bc5fb92fc517db0c15c897e5d7250f67d7fe
Summary:
Before https://github.com/pytorch/pytorch/pull/57833, calls to backward() or grad() synced only the calling thread's default stream with autograd leaf streams at the end of backward. This made the following weird pattern safe:
```python
with torch.cuda.stream(s):
# imagine forward used many streams, so backward leaf nodes may run on many streams
loss.backward()
# no sync
use grads
```
but a more benign-looking pattern was unsafe:
```python
with torch.cuda.stream(s):
# imagine forward used a lot of streams, so backward leaf nodes may run on many streams
loss.backward()
# backward() syncs the default stream with all the leaf streams, but does not sync s with anything,
# so counterintuitively (even though we're in the same stream context as backward()!)
# it is NOT SAFE to use grads here, and there's no easy way to make it safe,
# unless you manually sync on all the streams you used in forward,
# or move "use grads" back to default stream outside the context.
use grads
```
mruberry ngimel and I decided backward() should have the [same user-facing stream semantics as any cuda op](https://pytorch.org/docs/master/notes/cuda.html#stream-semantics-of-backward-passes).** In other words, the weird pattern should be unsafe, and the benign-looking pattern should be safe. Implementationwise, this meant backward() should sync its calling thread's current stream, not default stream, with the leaf streams.
After https://github.com/pytorch/pytorch/pull/57833, backward syncs the calling thread's current stream AND default stream with all leaf streams at the end of backward. The default stream syncs were retained for temporary backward compatibility.
This PR finishes https://github.com/pytorch/pytorch/pull/57833's work by deleting syncs on the default stream.
With this PR, graph-capturing an entire backward() call should be possible (see the [test_graph_grad_scaling diffs](https://github.com/pytorch/pytorch/compare/master...mcarilli:streaming_backwards_remove_default_syncs?expand=1#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R3641-R3642)).
** first paragraph has a formatting error which this PR should also fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60421
Reviewed By: VitalyFedyunin, albanD
Differential Revision: D29342234
Pulled By: ngimel
fbshipit-source-id: 98e6be7fdd8550872f0a78f9a66cb8dfe75abf63
Summary:
Fixes https://github.com/pytorch/pytorch/issues/35901
This change is designed to prevent fragmentation in the Caching Allocator. Permissive block splitting in the allocator allows very large blocks to be split into many pieces. Once split too finely it is unlikely all pieces will be 'free' at that same time so the original allocation can never be returned. Anecdotally, we've seen a model run out of memory failing to alloc a 50 MB block on a 32 GB card while the caching allocator is holding 13 GB of 'split free blocks'
Approach:
- Large blocks above a certain size are designated "oversize". This limit is currently set 1 decade above large, 200 MB
- Oversize blocks can not be split
- Oversize blocks must closely match the requested size (e.g. a 200 MB request will match an existing 205 MB block, but not a 300 MB block)
- In lieu of splitting oversize blocks there is a mechanism to quickly free a single oversize block (to the system allocator) to allow an appropriate size block to be allocated. This will be activated under memory pressure and will prevent _release_cached_blocks()_ from triggering
Initial performance tests show this is similar or quicker than the original strategy. Additional tests are ongoing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44742
Reviewed By: zou3519
Differential Revision: D29186394
Pulled By: ezyang
fbshipit-source-id: c88918836db3f51df59de6d1b3e03602ebe306a9
Summary:
Adds a note explaining the difference between several often conflated mechanisms in the autograd note
Also adds a link to this note from the docs in `grad_mode` and `nn.module`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58513
Reviewed By: gchanan
Differential Revision: D28651129
Pulled By: soulitzer
fbshipit-source-id: af9eb1749b641fc1b632815634eea36bf7979156
Summary:
You can find the latest rendered version in the `python_doc_build` CI job below, in the artifact tab of that build on circle CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55966
Reviewed By: H-Huang
Differential Revision: D28032446
Pulled By: albanD
fbshipit-source-id: 227ad37b03d39894d736c19cae3195b4d56fc62f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56528
Tried to search across internal and external usage of DataLoader. People haven't started to use `generator` for `DataLoader`.
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D27908487
Pulled By: ejguan
fbshipit-source-id: 14c83ed40d4ba4dc988b121968a78c2732d8eb93
Summary:
Fixes https://github.com/pytorch/pytorch/issues/35901
This change is designed to prevent fragmentation in the Caching Allocator. Permissive block splitting in the allocator allows very large blocks to be split into many pieces. Once split too finely it is unlikely all pieces will be 'free' at that same time so the original allocation can never be returned. Anecdotally, we've seen a model run out of memory failing to alloc a 50 MB block on a 32 GB card while the caching allocator is holding 13 GB of 'split free blocks'
Approach:
- Large blocks above a certain size are designated "oversize". This limit is currently set 1 decade above large, 200 MB
- Oversize blocks can not be split
- Oversize blocks must closely match the requested size (e.g. a 200 MB request will match an existing 205 MB block, but not a 300 MB block)
- In lieu of splitting oversize blocks there is a mechanism to quickly free a single oversize block (to the system allocator) to allow an appropriate size block to be allocated. This will be activated under memory pressure and will prevent _release_cached_blocks()_ from triggering
Initial performance tests show this is similar or quicker than the original strategy. Additional tests are ongoing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44742
Reviewed By: ngimel
Differential Revision: D23752058
Pulled By: ezyang
fbshipit-source-id: ccb7c13e3cf8ef2707706726ac9aaac3a5e3d5c8
Summary:
*Context:* https://github.com/pytorch/pytorch/issues/53406 added a lint for trailing whitespace at the ends of lines. However, in order to pass FB-internal lints, that PR also had to normalize the trailing newlines in four of the files it touched. This PR adds an OSS lint to normalize trailing newlines.
The changes to the following files (made in 54847d0adb9be71be4979cead3d9d4c02160e4cd) are the only manually-written parts of this PR:
- `.github/workflows/lint.yml`
- `mypy-strict.ini`
- `tools/README.md`
- `tools/test/test_trailing_newlines.py`
- `tools/trailing_newlines.py`
I would have liked to make this just a shell one-liner like the other three similar lints, but nothing I could find quite fit the bill. Specifically, all the answers I tried from the following Stack Overflow questions were far too slow (at least a minute and a half to run on this entire repository):
- [How to detect file ends in newline?](https://stackoverflow.com/q/38746)
- [How do I find files that do not end with a newline/linefeed?](https://stackoverflow.com/q/4631068)
- [How to list all files in the Git index without newline at end of file](https://stackoverflow.com/q/27624800)
- [Linux - check if there is an empty line at the end of a file [duplicate]](https://stackoverflow.com/q/34943632)
- [git ensure newline at end of each file](https://stackoverflow.com/q/57770972)
To avoid giving false positives during the few days after this PR is merged, we should probably only merge it after https://github.com/pytorch/pytorch/issues/54967.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54737
Test Plan:
Running the shell script from the "Ensure correct trailing newlines" step in the `quick-checks` job of `.github/workflows/lint.yml` should print no output and exit in a fraction of a second with a status of 0. That was not the case prior to this PR, as shown by this failing GHA workflow run on an earlier draft of this PR:
- https://github.com/pytorch/pytorch/runs/2197446987?check_suite_focus=true
In contrast, this run (after correcting the trailing newlines in this PR) succeeded:
- https://github.com/pytorch/pytorch/pull/54737/checks?check_run_id=2197553241
To unit-test `tools/trailing_newlines.py` itself (this is run as part of our "Test tools" GitHub Actions workflow):
```
python tools/test/test_trailing_newlines.py
```
Reviewed By: malfet
Differential Revision: D27409736
Pulled By: samestep
fbshipit-source-id: 46f565227046b39f68349bbd5633105b2d2e9b19
Summary:
Context: https://github.com/pytorch/pytorch/pull/53299#discussion_r587882857
These are the only hand-written parts of this diff:
- the addition to `.github/workflows/lint.yml`
- the file endings changed in these four files (to appease FB-internal land-blocking lints):
- `GLOSSARY.md`
- `aten/src/ATen/core/op_registration/README.md`
- `scripts/README.md`
- `torch/csrc/jit/codegen/fuser/README.md`
The rest was generated by running this command (on macOS):
```
git grep -I -l ' $' -- . ':(exclude)**/contrib/**' ':(exclude)third_party' | xargs gsed -i 's/ *$//'
```
I looked over the auto-generated changes and didn't see anything that looked problematic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53406
Test Plan:
This run (after adding the lint but before removing existing trailing spaces) failed:
- https://github.com/pytorch/pytorch/runs/2043032377
This run (on the tip of this PR) succeeded:
- https://github.com/pytorch/pytorch/runs/2043296348
Reviewed By: walterddr, seemethere
Differential Revision: D26856620
Pulled By: samestep
fbshipit-source-id: 3f0de7f7c2e4b0f1c089eac9b5085a58dd7e0d97
Summary:
Minor doc fix in clarifying that the input data is rounded not truncated.
CC zasdfgbnm ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49625
Reviewed By: mruberry
Differential Revision: D25668244
Pulled By: ngimel
fbshipit-source-id: ac97e41e0ca296276544f9e9f85b2cf1790d9985
Summary:
Ref https://github.com/pytorch/pytorch/issues/42175
This removes the 4 deprecated spectral functions: `torch.{fft,rfft,ifft,irfft}`. `torch.fft` is also now imported by by default.
The actual `at::native` functions are still used in `torch.stft` so can't be full removed yet. But will once https://github.com/pytorch/pytorch/issues/47601 has been merged.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48594
Reviewed By: heitorschueroff
Differential Revision: D25298929
Pulled By: mruberry
fbshipit-source-id: e36737fe8192fcd16f7e6310f8b49de478e63bf0
Summary:
I have been asked several times how to toggle this flag on libtorch. I think it would be good to mention it in the docs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47331
Reviewed By: glaringlee
Differential Revision: D24777576
Pulled By: mruberry
fbshipit-source-id: cc2a338c477bb57e0bb74b8960c47fde99665e41
Summary:
Currently, a GraphRoot instance doesn't have an associated stream. Streaming backward synchronization logic assumes the instance ran on the default stream, and tells consumer ops to sync with the default stream. If the gradient the GraphRoot instance passes to consumer backward ops was populated on a non-default stream, we have a race condition.
The race condition can exist even if the user doesn't give a manually populated gradient:
```python
with torch.cuda.stream(side_stream):
# loss.backward() implicitly synthesizes a one-element 1.0 tensor on side_stream
# GraphRoot passes it to consumers, but consumers first sync on default stream, not side_stream.
loss.backward()
# Internally to backward(), streaming-backward logic takes over, stuff executes on the same stream it ran on in forward,
# and the side_stream context is irrelevant. GraphRoot's interaction with its first consumer(s) is the spot where
# the side_stream context causes a problem.
```
This PR fixes the race condition by associating a GraphRoot instance, at construction time, with the current stream(s) on the device(s) of the grads it will pass to consumers. (i think this relies on GraphRoot executing in the main thread, before backward thread(s) fork, because the grads were populated on the main thread.)
The test demonstrates the race condition. It fails reliably without the PR's GraphRoot diffs and passes with the GraphRoot diffs.
With the GraphRoot diffs, manually populating an incoming-gradient arg for `backward` (or `torch.autograd.grad`) and the actual call to `autograd.backward` will have the same stream-semantics relationship as any other pair of ops:
```python
# implicit population is safe
with torch.cuda.stream(side_stream):
loss.backward()
# explicit population in side stream then backward in side stream is safe
with torch.cuda.stream(side_stream):
kickoff_grad = torch.ones_like(loss)
loss.backward(gradient=kickoff_grad)
# explicit population in one stream then backward kickoff in another stream
# is NOT safe, even with this PR's diffs, but that unsafety is consistent with
# stream-semantics relationship of any pair of ops
kickoff_grad = torch.ones_like(loss)
with torch.cuda.stream(side_stream):
loss.backward(gradient=kickoff_grad)
# Safe, as you'd expect for any pair of ops
kickoff_grad = torch.ones_like(loss)
side_stream.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(side_stream):
loss.backward(gradient=kickoff_grad)
```
This PR also adds the last three examples above to cuda docs and references them from autograd docstrings.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45787
Reviewed By: nairbv
Differential Revision: D24138376
Pulled By: albanD
fbshipit-source-id: bc4cd9390f9f0358633db530b1b09f9c1080d2a3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45294
While tracking down a recent memory corruption bug we found that
cuda-memcheck wasn't finding the bad accesses, and ngimel pointed out that
it's because we use a caching allocator so a lot of "out of bounds" accesses
land in a valid slab.
This PR adds a runtime knob (`PYTORCH_NO_CUDA_MEMORY_CACHING`) that, when set,
bypasses the caching allocator's caching logic so that allocations go straight
to cudaMalloc. This way, cuda-memcheck will actually work.
Test Plan:
Insert some memory errors and run a test under cuda-memcheck;
observe that cuda-memcheck flags an error where expected.
Specifically I removed the output-masking logic here:
https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/tensorexpr/cuda_codegen.cpp#L819-L826
And ran:
```
PYTORCH_NO_CUDA_MEMORY_CACHING=1 cuda-memcheck pytest -k test_superslomo test_jit_fuser_te.py
```
Reviewed By: ngimel
Differential Revision: D23964734
Pulled By: bertmaher
fbshipit-source-id: 04efd11e8aff037b9edde80c70585cb820ee6e39