Commit Graph

260 Commits

Author SHA1 Message Date
leslie-fang-intel
3a112ebb57 add autocast cpu doc
As discussed in https://github.com/pytorch/pytorch/issues/55374#issuecomment-968333614, here we update the cpu autocast operation list in autocast API document.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68567
Approved by: https://github.com/ezyang
2022-03-22 02:02:43 +00:00
Jaewon Lee
11ea09effc [CUDACachingAlloc/GPUInference] Implement garbage collection without GPU sync (#74261)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74261

### Goal
Implement a cheap way to reclaim GPU memory (garbage collection) without incurring GPU sync.

### Why do we need this?
Currently, there are only two ways to reclaim GPU memory block already assigned to a particular stream.

- `release_available_cached_blocks(params)`: Free blocks exceeding the `CachingAllocatorConfig::max_split_size()` until we can satisfy the request.

Issue: If the `max_split_size` is unset (default), this function is a no-op. Even if this is set, the reclamation is quite conservative (e.g., never frees blocks under max_split_size).

- `release_cached_blocks()`: Waits for all the in-flight events and then reclaim blocks.

Issue: 'waiting for all event' is very expensive as it will likely stall all the GPU operations. Many GPU applications without a proper handling of potential GPU throttling would suffer/crash.

### Proposed idea
- If the garbage collection threshold is set, try to reclaim some memory blocks *without* synchronization. It should be safe to do so, as `release_available_cached_blocks` essentially does the same thing (but less aggressively).
- GC is triggered only when we fail to serve a `malloc` request from the block pool. No need to free blocks when the block pool is functioning just fine.
- Prioritize reclaiming blocks that weren't reused for long time. Reclamation stops once the used memory capacity < threshold.
- This code path is totally optional; by default it won't be invoked.

Test Plan:
- Unit tests
- Manually checked that the GPU memory usage stays as indicated by the garbage collector. If not the caching allocator at least tries to keep freeing the blocks.

Reviewed By: jianyuh

Differential Revision: D34482514

fbshipit-source-id: d5eae62ac60b94b0bca851f9d233a092d086e3c2
(cherry picked from commit 05780f1ed4b176f05e765b2411c9eaa2eaeb48b0)
2022-03-21 18:46:02 +00:00
Banit Agrawal
ac3effd150 [PyTorch GPU Allocator] Better use of blocks with rounding of allocation sizes (#74213)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74213

In the current CUDACachingAllocator, the sizes are rounded up in multiple of blocks size of 512, so this works for smaller sizes. However for large sizes, we can have lots of different size blocks in the larger pool. This is problematic when we have variable batch sizes 1001, 1021, 1023 -> all will go to different block size and will create different size of blocks. This will create lots of unused blocks and will waste GPU memory capacity.

This diff adds a rounding approach to allocation size. It rounds up the size to nearest power-of-2 divisions and the power2-division can be changed with env variable setting.

   For example, if we need to round-up  size of1200 and if number of divisions is 4,
   the size 1200 lies between 1024 and 2048 and if we do 4 divisions between
   them, the values are 1024, 1280, 1536, and 1792. So the function will
   return 1280 as the nearest ceiling of power-2 division.

env setting:
   export PYTORCH_CUDA_ALLOC_CONF=roundup_power2_divisions:4
ghstack-source-id: 151446017

Reviewed By: ezyang

Differential Revision: D34868036

fbshipit-source-id: 494785add16e6b37c920dcb5a2b81d4c637b554a
(cherry picked from commit 548454ccacbd8700e7ffd2d762e40b4ba37abbae)
2022-03-16 02:53:53 +00:00
Rohit Goswami
801abc0cdd MAINT, DOC: Trivial spellings and warnings (#72745)
Summary:
Fixes N/A.
Just minor annoyances.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72745

Reviewed By: samdow

Differential Revision: D34216016

Pulled By: albanD

fbshipit-source-id: b65600b50e41a1dd7bf7d076b0dd3e2d1c99caf9
(cherry picked from commit b959392a5f)
2022-02-14 21:55:19 +00:00
Felix Divo
340fae4363 [Doc] Better formatting in autograd.rst (#72586)
Summary:
See title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72586

Reviewed By: soulitzer

Differential Revision: D34177704

Pulled By: albanD

fbshipit-source-id: 1adf6ebed4f64ec4d8fff160df300c8e6ee528ea
(cherry picked from commit bbb586d67d)
2022-02-11 22:46:10 +00:00
Felix Divo
25fba4a019 [DOC] Add link to "double backward" from "extending pytorch" page (#72584)
Summary:
It is probably the most user friendly to link to that (lesser known?) feature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72584

Reviewed By: soulitzer

Differential Revision: D34173999

Pulled By: albanD

fbshipit-source-id: 99fff7a55412faf54888f8317ab2388f4d7d30e4
(cherry picked from commit 2191ee7657)
2022-02-11 20:34:13 +00:00
Mike Ruberry
9b9b878c89 Fixes jiterator cache macro include + updates CUDA note with cache variables (#71452)
Summary:
Per title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71452

Reviewed By: ngimel

Differential Revision: D33646495

Pulled By: mruberry

fbshipit-source-id: bbf627e6d7a724a83a3ea2ae9c0f50430f8d578e
(cherry picked from commit d1e72b144a)
2022-01-19 03:45:05 +00:00
Rohan Varma
4fd1992a60 [Docs][BE] DDP doc fix (#71363)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71363

Looks like DDP example is currently broken as per
https://discuss.pytorch.org/t/official-ddp-example-is-broken/141493. Fix the
issue by setting the correct env variable.
ghstack-source-id: 147080377

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D33607250

fbshipit-source-id: e0e7d03cc365c186253b959c4c5405a5e3609218
(cherry picked from commit 32472884ec)
2022-01-18 22:24:51 +00:00
Jake Tae
23f902f7e4 Fix incorrect variable in autograd docs (#70884)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/68362.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70884

Reviewed By: mruberry

Differential Revision: D33463331

Pulled By: ngimel

fbshipit-source-id: 834ba9c450972710e0424cc92af222551f0b4a4a
2022-01-06 20:53:10 -08:00
Peter Bell
e279963eef Remove remaining THC code (#69039)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69039

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D32872476

Pulled By: ngimel

fbshipit-source-id: 7972aacc24aef9450fb59b707ed6396c501bcb31
2021-12-08 12:18:08 -08:00
Rodrigo Bermúdez Schettino
1a202b0c39 Docs: Fix broken code syntax in autograd.rst (#69362)
Summary:
The backticks around `nn.Parameters` were not rendered correctly because the word was enclosed in an italics block.
Spotted the issue on https://pytorch.org/docs/stable/notes/autograd.html#locally-disable-grad-doc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69362

Reviewed By: zou3519

Differential Revision: D32924093

Pulled By: albanD

fbshipit-source-id: 5a310ac3f3d13a5116f7aa911817b9452eee711d
2021-12-07 12:03:15 -08:00
Michael Carilli
da023611d7 [CUDA graphs] Fixes make_graphed_callables example typos (#69379)
Summary:
cc mcarilli

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69379

Reviewed By: mruberry

Differential Revision: D32841260

Pulled By: ngimel

fbshipit-source-id: a7d0b9db0578526907547b201eddd55827812b63
2021-12-03 16:51:14 -08:00
Elio
088a4feb41 Update the documentation for AMP with DataParallel (#69218)
Summary:
Following https://github.com/pytorch/pytorch/issues/60540 and pull request https://github.com/pytorch/pytorch/issues/43102

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69218

Reviewed By: gchanan

Differential Revision: D32803814

Pulled By: ngimel

fbshipit-source-id: 06fdbbee2c7734153271be70ec4bc24263c8c367
2021-12-03 14:58:47 -08:00
Vansh Sharma
ff125a3624 Minor changes in documentation (#68557)
Summary:
Fixed some small typos

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68557

Reviewed By: mruberry

Differential Revision: D32538749

Pulled By: ngimel

fbshipit-source-id: 09a9cd4031463b6a40d7307bd8fcb7d364444ac3
2021-11-18 17:57:16 -08:00
eqy
790763b0fe Add an option to disable reduced precision reductions for FP16 GEMM (#67946)
Summary:
https://github.com/pytorch/pytorch/issues/67578 disabled reduced precision reductions for FP16 GEMMs. After benchmarking, we've found that this has substantial performance impacts for common GEMM shapes (e.g., those found in popular instantiations of multiheaded-attention) on architectures such as Volta. As these performance regressions may come as a surprise to current users, this PR adds a toggle to disable reduced precision reductions
`torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = `
rather than making it the default behavior.

CC ngimel ptrblck
stas00 Note that the behavior after the previous PR can be replicated with
`torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = False`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67946

Reviewed By: zou3519

Differential Revision: D32289896

Pulled By: ngimel

fbshipit-source-id: a1ea2918b77e27a7d9b391e030417802a0174abe
2021-11-09 17:27:20 -08:00
Alban Desmaison
708f7b1209 Update extending doc to cover forward mode AD (#66962)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66962

Reviewed By: VitalyFedyunin

Differential Revision: D31897782

Pulled By: albanD

fbshipit-source-id: 64164783a14a7ed4cedc17da28f1181d9807a499
2021-10-27 14:18:38 -07:00
Natalia Gimelshein
fdd9f49cf5 add a note on numerical accuracy (#65947)
Summary:
Per title
Fixes https://github.com/pytorch/pytorch/issues/54437

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65947

Reviewed By: albanD

Differential Revision: D31612445

Pulled By: ngimel

fbshipit-source-id: 5c155891a088aef3b9813f253d0dc1ee4d51ae1c
2021-10-13 12:43:55 -07:00
Rodrigo Berriel
7e772e7685 Update link to tutorial on defining NN modules (#65534)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/65527. Please, see my comment in the issue: https://github.com/pytorch/pytorch/issues/65527#issuecomment-925863193. The file was renamed in ce58d5904c (diff-e5ef486bd89eb38de15752211d9437953681b8caa8f44d7c86bb820d13151df2), but the link in this repository was not updated.

It doesn't change the fact that the old link is still working, but I guess this has to be fixed in [pytorch/tutorials](https://github.com/pytorch/tutorials) instead of here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65534

Reviewed By: soulitzer

Differential Revision: D31144269

Pulled By: H-Huang

fbshipit-source-id: f70744a21113b7dc84510e2992d87f0fed793985
2021-09-23 11:26:50 -07:00
Rodrigo Berriel
f0ada4bd54 [docs] Remove .data from some docs (#65358)
Summary:
Related to https://github.com/pytorch/pytorch/issues/30987. Fix the following task:

- [ ] Remove the use of `.data` in all our internal code:
  - [ ] ...
  - [x] `docs/source/scripts/build_activation_images.py` and `docs/source/notes/extending.rst`

In `docs/source/scripts/build_activation_images.py`, I used `nn.init` because the snippet already assumes `nn` is available (the class inherits from `nn.Module`).

cc albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65358

Reviewed By: malfet

Differential Revision: D31061790

Pulled By: albanD

fbshipit-source-id: be936c2035f0bdd49986351026fe3e932a5b4032
2021-09-21 06:32:31 -07:00
Michael Carilli
e3210ca184 [CUDA graphs] Beta, not prototype (#65247)
Summary:
Powers have decided this API should be listed as beta.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65247

Reviewed By: malfet

Differential Revision: D31057940

Pulled By: ngimel

fbshipit-source-id: 137b63cbd2c7409fecdc161a22135619bfc96bfa
2021-09-20 13:32:36 -07:00
albanD
473e55d5b2 Use classmethods for overrides (#64841)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64841

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D30991424

Pulled By: albanD

fbshipit-source-id: 551e2119768f3a4292713f3bfa83930f5506adbd
2021-09-17 08:32:49 -07:00
Jane Xu
4c4c03124b Remove old references to 9.2 in documentation (#65059)
Summary:
Removes references in .rst and README.md and comments in the Dockerfile

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65059

Reviewed By: malfet

Differential Revision: D30961110

Pulled By: janeyx99

fbshipit-source-id: 702a9a81bf08125ec4ac38bc656fc2c128c30018
2021-09-16 13:24:05 -07:00
Michael Carilli
36cac2be4d [CUDA graphs] moves memory sharing intro paragraph (#64996)
Summary:
Puts memory sharing intro under Sharing memory... header, where it should have been all along.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64996

Reviewed By: mruberry

Differential Revision: D30948619

Pulled By: ngimel

fbshipit-source-id: 5d9dd267b34e9d3fc499d4738377b58a22da1dc2
2021-09-14 17:53:43 -07:00
Michael Carilli
8d08b103be [CUDA graphs] Prototype API and documentation (#63269)
Summary:
RFC: https://github.com/pytorch/pytorch/issues/61880

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63269

Reviewed By: mruberry

Differential Revision: D30596643

Pulled By: ngimel

fbshipit-source-id: b1f8061406364b667e2c2d4d30fbce1f0d8456be
2021-08-31 13:34:23 -07:00
Joel Schlosser
196fd3ee7a Modules note v2 (#63963)
Summary:
This PR expands the [note on modules](https://pytorch.org/docs/stable/notes/modules.html) with additional info for 1.10.

It adds the following:
* Examples of using hooks
* Examples of using apply()
* Examples for ParameterList / ParameterDict
* register_parameter() / register_buffer() usage
* Discussion of train() / eval() modes
* Distributed training overview / links
* TorchScript overview / links
* Quantization overview / links
* FX overview / links
* Parametrization overview / link to tutorial

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63963

Reviewed By: albanD

Differential Revision: D30606604

Pulled By: jbschlosser

fbshipit-source-id: c1030b19162bcb5fe7364bcdc981a2eb6d6e89b4
2021-08-27 11:30:18 -07:00
Jithun Nair
730ce29baf Add note on ifdefing based on CUDA_VERSION for ROCm path (#62850)
Summary:
CUDA_VERSION and HIP_VERSION follow very unrelated versioning schemes, so it does not make sense to use CUDA_VERSION to determine the ROCm path. This note explicitly addresses it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62850

Reviewed By: mruberry

Differential Revision: D30547562

Pulled By: malfet

fbshipit-source-id: 02990fa66a88466c2330ab85f446b25b78545150
2021-08-25 15:02:03 -07:00
Victor Quach
b95ce1591d Add docs describing saved tensor hooks (#62362)
Summary:
Add section to the Autograd mechanics docs to describe the recently
exposed saved tensors (https://github.com/pytorch/pytorch/issues/52451), how to register packing / unpacking
hooks (https://github.com/pytorch/pytorch/issues/60975) and how to use default hooks (https://github.com/pytorch/pytorch/issues/61834)

Sister PR: https://github.com/pytorch/pytorch/issues/62361 (will add a link from autograd.rst to notes/autograd in whatever PR does not land first)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62362

Reviewed By: soulitzer

Differential Revision: D30453177

Pulled By: Varal7

fbshipit-source-id: f5759977b069ff0ef36a47b08856d297691a6caa
2021-08-20 11:10:51 -07:00
soulitzer
2f615f6313 Improve custom function docs (#60312)
Summary:
- Adds some code examples for `ctx` methods and make requirements of arguments more clear
- Type annotations for `save_for_backward`, `mark_dirty`, `mark_non_differentiable`, and `set_materialize_grads` (BC-breaking?)
- Refactor `torch.autograd.Function` doc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60312

Reviewed By: VitalyFedyunin

Differential Revision: D30314961

Pulled By: soulitzer

fbshipit-source-id: a284314b65662e26390417bd2b6b12cd85e68dc8
2021-08-18 11:31:31 -07:00
kyshel
e75ed4a4b5 add comma to prevent syntax errors (#62492)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62492

Reviewed By: VitalyFedyunin

Differential Revision: D30304684

Pulled By: ezyang

fbshipit-source-id: db08ca39bcecbfd79ea50df18536bf4e87f51e15
2021-08-16 12:27:31 -07:00
cpatru
6d896cb545 Update faq.rst so OOM section mentions checkpoint (#62709)
Summary:
This FAQ has a section for CUDA OOMs where there are lots of don'ts. This limits modeling solution. Deep nets can blow up memory due to output caching during training.
It's a known problem with a known solution: to trade-off compute for memory via checkpointing.
FAQ should mention it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62709

Reviewed By: nairbv

Differential Revision: D30103326

Pulled By: ezyang

fbshipit-source-id: 3a8b465a7fbe19aae88f83cc50fe82ebafcb56c9
2021-08-05 07:40:08 -07:00
Victor Quach
5830f122f1 Add docstrings for save_on_cpu hooks (#62410)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62410

This PR adds docstrings for CPU hooks introduced in #61928.

Also uncomments the warning about pinned memory in CUDA semantics docs.

Depends on: #62361.

For now docstrings are an orphan page at https://docs-preview.pytorch.org/62410/generated/torch.autograd.graph.set_save_on_cpu_hooks.html#torch-autograd-graph-set-save-on-cpu-hooks

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D29990129

Pulled By: Varal7

fbshipit-source-id: 7a98eeee6a0abb11e2c2d9169cd1aa35ad7ba3f4
2021-08-03 17:53:45 -07:00
Michael Dagitses
58df01c3b8 clarify default value of requires_grad for tensors (#61038)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61038

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D29491984

Pulled By: dagitses

fbshipit-source-id: 7e6b7f8e81d77f38c881b86a68c17d3cf5483dad
2021-07-12 12:57:37 -07:00
Jithun Nair
336970c03e Add note on torch.distributed backends on ROCm (#58975)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58975

Reviewed By: soulitzer

Differential Revision: D29595510

Pulled By: rohan-varma

fbshipit-source-id: 384bb67fcd003d65b76e957a474406b2a38099b9
2021-07-10 03:51:19 -07:00
Michael Carilli
2fa6c7627e [CUDA graphs][BC-breaking] Removes post-backward syncs on default stream (#60421)
Summary:
Before https://github.com/pytorch/pytorch/pull/57833, calls to backward() or grad() synced only the calling thread's default stream with autograd leaf streams at the end of backward. This made the following weird pattern safe:
```python
with torch.cuda.stream(s):
    # imagine forward used many streams, so backward leaf nodes may run on many streams
    loss.backward()
# no sync
use grads
```

but a more benign-looking pattern was unsafe:
```python
with torch.cuda.stream(s):
    # imagine forward used a lot of streams, so backward leaf nodes may run on many streams
    loss.backward()
    # backward() syncs the default stream with all the leaf streams, but does not sync s with anything,
    # so counterintuitively (even though we're in the same stream context as backward()!)
    # it is NOT SAFE to use grads here, and there's no easy way to make it safe,
    # unless you manually sync on all the streams you used in forward,
    # or move "use grads" back to default stream outside the context.
    use grads
```
mruberry ngimel and I decided backward() should have the [same user-facing stream semantics as any cuda op](https://pytorch.org/docs/master/notes/cuda.html#stream-semantics-of-backward-passes).** In other words, the weird pattern should be unsafe, and the benign-looking pattern should be safe. Implementationwise, this meant backward() should sync its calling thread's current stream, not default stream, with the leaf streams.

After https://github.com/pytorch/pytorch/pull/57833, backward syncs the calling thread's current stream AND default stream with all leaf streams at the end of backward. The default stream syncs were retained for temporary backward compatibility.

This PR finishes https://github.com/pytorch/pytorch/pull/57833's work by deleting syncs on the default stream.

With this PR, graph-capturing an entire backward() call should be possible (see the [test_graph_grad_scaling diffs](https://github.com/pytorch/pytorch/compare/master...mcarilli:streaming_backwards_remove_default_syncs?expand=1#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R3641-R3642)).

** first paragraph has a formatting error which this PR should also fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60421

Reviewed By: albanD

Differential Revision: D29370344

Pulled By: ngimel

fbshipit-source-id: 3248bc5fb92fc517db0c15c897e5d7250f67d7fe
2021-06-24 17:34:02 -07:00
Luca Wehrstedt
bb9e1150ea Revert D29342234: [pytorch][PR] [CUDA graphs][BC-breaking] Removes post-backward syncs on default stream
Test Plan: revert-hammer

Differential Revision:
D29342234 (675cea1adb)

Original commit changeset: 98e6be7fdd85

fbshipit-source-id: 84022973248b2254210eee57402df2c4f4bc43c6
2021-06-24 04:49:28 -07:00
Michael Carilli
675cea1adb [CUDA graphs][BC-breaking] Removes post-backward syncs on default stream (#60421)
Summary:
Before https://github.com/pytorch/pytorch/pull/57833, calls to backward() or grad() synced only the calling thread's default stream with autograd leaf streams at the end of backward. This made the following weird pattern safe:
```python
with torch.cuda.stream(s):
    # imagine forward used many streams, so backward leaf nodes may run on many streams
    loss.backward()
# no sync
use grads
```

but a more benign-looking pattern was unsafe:
```python
with torch.cuda.stream(s):
    # imagine forward used a lot of streams, so backward leaf nodes may run on many streams
    loss.backward()
    # backward() syncs the default stream with all the leaf streams, but does not sync s with anything,
    # so counterintuitively (even though we're in the same stream context as backward()!)
    # it is NOT SAFE to use grads here, and there's no easy way to make it safe,
    # unless you manually sync on all the streams you used in forward,
    # or move "use grads" back to default stream outside the context.
    use grads
```
mruberry ngimel and I decided backward() should have the [same user-facing stream semantics as any cuda op](https://pytorch.org/docs/master/notes/cuda.html#stream-semantics-of-backward-passes).** In other words, the weird pattern should be unsafe, and the benign-looking pattern should be safe. Implementationwise, this meant backward() should sync its calling thread's current stream, not default stream, with the leaf streams.

After https://github.com/pytorch/pytorch/pull/57833, backward syncs the calling thread's current stream AND default stream with all leaf streams at the end of backward. The default stream syncs were retained for temporary backward compatibility.

This PR finishes https://github.com/pytorch/pytorch/pull/57833's work by deleting syncs on the default stream.

With this PR, graph-capturing an entire backward() call should be possible (see the [test_graph_grad_scaling diffs](https://github.com/pytorch/pytorch/compare/master...mcarilli:streaming_backwards_remove_default_syncs?expand=1#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R3641-R3642)).

** first paragraph has a formatting error which this PR should also fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60421

Reviewed By: VitalyFedyunin, albanD

Differential Revision: D29342234

Pulled By: ngimel

fbshipit-source-id: 98e6be7fdd8550872f0a78f9a66cb8dfe75abf63
2021-06-23 23:35:24 -07:00
Michael Wootton
2f3be2735f Don't split oversize cached blocks (#44742)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/35901

This change is designed to prevent fragmentation in the Caching Allocator.  Permissive block splitting in the allocator allows very large blocks to be split into many pieces.  Once split too finely it is unlikely all pieces will be 'free' at that same time so the original allocation can never be returned.   Anecdotally, we've seen a model run out of memory failing to alloc a 50 MB block on a 32 GB card while the caching allocator is holding 13 GB of 'split free blocks'

Approach:

- Large blocks above a certain size are designated "oversize".  This limit is currently set 1 decade above large, 200 MB
- Oversize blocks can not be split
- Oversize blocks must closely match the requested size (e.g. a 200 MB request will match an existing 205 MB block, but not a 300 MB block)
- In lieu of splitting oversize blocks there is a mechanism to quickly free a single oversize block (to the system allocator) to allow an appropriate size block to be allocated.  This will be activated under memory pressure and will prevent _release_cached_blocks()_ from triggering

Initial performance tests show this is similar or quicker than the original strategy.  Additional tests are ongoing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44742

Reviewed By: zou3519

Differential Revision: D29186394

Pulled By: ezyang

fbshipit-source-id: c88918836db3f51df59de6d1b3e03602ebe306a9
2021-06-21 11:46:08 -07:00
Michael Carilli
be038d8989 [CUDA graphs] Make stream semantics of backward calls consistent with other cuda ops (ci-all edition) (#57833)
Summary:
ci-all resubmit of https://github.com/pytorch/pytorch/pull/54227.

Tests look good except for a few distributed autograd failures (pytorch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test) and rocm failures (pr/pytorch-linux-bionic-rocm4.1-py3.6).

The common denominator in rocm failures appears to be multi-gpu activity: some [multiprocess DDP failures](https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm4.1-py3.6-test1/8115/console), some [single-process failures](https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm4.1-py3.6-test2/8115/console) where the single process has autograd ops that span devices. jeffdaily jithunnair-amd sunway513, could one of you take a look? The streaming backward change is also beneficial to rocm, I expect.

For debugging rocm failures, I think we should ignore the multiprocess/DDP tests and focus on the single process cases. The root cause is probably the same and the single process cases are simpler.

----------------------------------

Update: Rocm failures are due to https://github.com/pytorch/pytorch/issues/59750.
2718a54032 is a workaround, to be updated once https://github.com/pytorch/pytorch/issues/59750 is fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57833

Reviewed By: mruberry

Differential Revision: D28942391

Pulled By: ngimel

fbshipit-source-id: d6047e971c5f1c6386334bf3641402a92f12e2f8
2021-06-13 12:09:56 -07:00
Jeffrey Wan
a7a5992d7d Add no-grad inference mode note (#58513)
Summary:
Adds a note explaining the difference between several often conflated mechanisms in the autograd note
Also adds a link to this note from the docs in `grad_mode` and `nn.module`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58513

Reviewed By: gchanan

Differential Revision: D28651129

Pulled By: soulitzer

fbshipit-source-id: af9eb1749b641fc1b632815634eea36bf7979156
2021-05-25 13:06:54 -07:00
Jithun Nair
ab6b5fa036 Add HIP (ROCm) semantics doc (#57871)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57871

Reviewed By: agolynski

Differential Revision: D28385510

Pulled By: malfet

fbshipit-source-id: 9cf69e52d026a1cf74cc12d8727ca17ae026235e
2021-05-12 12:34:07 -07:00
albanD
d16ed1ee8a Add first draft of gradcheck note (#55966)
Summary:
You can find the latest rendered version in the `python_doc_build` CI job below, in the artifact tab of that build on circle CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55966

Reviewed By: H-Huang

Differential Revision: D28032446

Pulled By: albanD

fbshipit-source-id: 227ad37b03d39894d736c19cae3195b4d56fc62f
2021-04-27 14:33:42 -07:00
Ilqar Ramazanli
70d9be0f42 Replace duplicative s with alpha (#56804)
Summary:
It is always easier to read a document when different objects / concepts denoted with different variables / representations.
In this PR we make sure the [complex autograd](https://pytorch.org/docs/master/notes/autograd.html#autograd-for-complex-numbers) documentation, the variable of output and step size diverge.

Fixes https://github.com/pytorch/pytorch/issues/53633

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56804

Reviewed By: anjali411

Differential Revision: D27989959

Pulled By: iramazanli

fbshipit-source-id: c271590ee744c8aeeff62bfaa2295429765ef64e
2021-04-25 16:27:09 -07:00
Erjia Guan
8cf85a1152 [DataLoader][doc] Randomness for base_seed generator and NumPy seed (#56528)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56528

Tried to search across internal and external usage of DataLoader. People haven't started to use `generator` for `DataLoader`.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D27908487

Pulled By: ejguan

fbshipit-source-id: 14c83ed40d4ba4dc988b121968a78c2732d8eb93
2021-04-22 09:40:45 -07:00
Natalia Gimelshein
f94c95a2dd Revert D23752058: [pytorch][PR] Don't split oversize cached blocks
Test Plan: revert-hammer

Differential Revision:
D23752058 (67dcd62310)

Original commit changeset: ccb7c13e3cf8

fbshipit-source-id: 12ae9702135ea510e9714ed97fb75ca3b9f97c27
2021-04-14 09:24:08 -07:00
Michael Wootton
67dcd62310 Don't split oversize cached blocks (#44742)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/35901

This change is designed to prevent fragmentation in the Caching Allocator.  Permissive block splitting in the allocator allows very large blocks to be split into many pieces.  Once split too finely it is unlikely all pieces will be 'free' at that same time so the original allocation can never be returned.   Anecdotally, we've seen a model run out of memory failing to alloc a 50 MB block on a 32 GB card while the caching allocator is holding 13 GB of 'split free blocks'

Approach:

- Large blocks above a certain size are designated "oversize".  This limit is currently set 1 decade above large, 200 MB
- Oversize blocks can not be split
- Oversize blocks must closely match the requested size (e.g. a 200 MB request will match an existing 205 MB block, but not a 300 MB block)
- In lieu of splitting oversize blocks there is a mechanism to quickly free a single oversize block (to the system allocator) to allow an appropriate size block to be allocated.  This will be activated under memory pressure and will prevent _release_cached_blocks()_ from triggering

Initial performance tests show this is similar or quicker than the original strategy.  Additional tests are ongoing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44742

Reviewed By: ngimel

Differential Revision: D23752058

Pulled By: ezyang

fbshipit-source-id: ccb7c13e3cf8ef2707706726ac9aaac3a5e3d5c8
2021-04-14 03:04:41 -07:00
Yukio Siraichi
93bf0ae6fc Remove legacy constructor calls from pytorch codebase. (#54142)
Summary:
Follow up from https://github.com/pytorch/pytorch/issues/53889
Related to https://github.com/pytorch/pytorch/issues/47112

Removing every occurrence of the legacy constructor call present in PyTorch at:
- _docs_
- _benchmarks_
- _test_
- _caffe2_
- _CONTRIBUTING.md_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54142

Reviewed By: ngimel

Differential Revision: D27699450

Pulled By: mruberry

fbshipit-source-id: 530aa3f5746cc8bc1407d5d51b2bbd8075e30546
2021-04-11 15:45:17 -07:00
James Reed
ec38dda1cc Remove extra close bracket in extending.rst (#55409)
Summary:
Small typo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55409

Reviewed By: pbelevich

Differential Revision: D27611177

Pulled By: jamesr66a

fbshipit-source-id: 8a5ff702e4ab8a7eb2403432889f8b7a5a69484b
2021-04-07 21:15:46 -07:00
Peter Bell
8ac0619784 Avoid infinite recursion in __torch_function__ example (#55391)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55284

This gets the example to run but probably doesn't help the readability of the example.

Thoughts?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55391

Reviewed By: mrshenli

Differential Revision: D27621096

Pulled By: ezyang

fbshipit-source-id: d02c4fb0001e54139a167b477fd3b4a229e4dc8c
2021-04-07 20:31:46 -07:00
James Reed
c96f076248 Fix typo in extending.rst (#55408)
Summary:
Small typo in docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55408

Reviewed By: pbelevich

Differential Revision: D27611175

Pulled By: jamesr66a

fbshipit-source-id: a83a6220054c0411329792c7ac6afceb2b699f44
2021-04-07 03:46:01 -07:00
Sam Estep
5bcbbf5373 Lint trailing newlines (#54737)
Summary:
*Context:* https://github.com/pytorch/pytorch/issues/53406 added a lint for trailing whitespace at the ends of lines. However, in order to pass FB-internal lints, that PR also had to normalize the trailing newlines in four of the files it touched. This PR adds an OSS lint to normalize trailing newlines.

The changes to the following files (made in 54847d0adb9be71be4979cead3d9d4c02160e4cd) are the only manually-written parts of this PR:

- `.github/workflows/lint.yml`
- `mypy-strict.ini`
- `tools/README.md`
- `tools/test/test_trailing_newlines.py`
- `tools/trailing_newlines.py`

I would have liked to make this just a shell one-liner like the other three similar lints, but nothing I could find quite fit the bill. Specifically, all the answers I tried from the following Stack Overflow questions were far too slow (at least a minute and a half to run on this entire repository):

- [How to detect file ends in newline?](https://stackoverflow.com/q/38746)
- [How do I find files that do not end with a newline/linefeed?](https://stackoverflow.com/q/4631068)
- [How to list all files in the Git index without newline at end of file](https://stackoverflow.com/q/27624800)
- [Linux - check if there is an empty line at the end of a file [duplicate]](https://stackoverflow.com/q/34943632)
- [git ensure newline at end of each file](https://stackoverflow.com/q/57770972)

To avoid giving false positives during the few days after this PR is merged, we should probably only merge it after https://github.com/pytorch/pytorch/issues/54967.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54737

Test Plan:
Running the shell script from the "Ensure correct trailing newlines" step in the `quick-checks` job of `.github/workflows/lint.yml` should print no output and exit in a fraction of a second with a status of 0. That was not the case prior to this PR, as shown by this failing GHA workflow run on an earlier draft of this PR:

- https://github.com/pytorch/pytorch/runs/2197446987?check_suite_focus=true

In contrast, this run (after correcting the trailing newlines in this PR) succeeded:

- https://github.com/pytorch/pytorch/pull/54737/checks?check_run_id=2197553241

To unit-test `tools/trailing_newlines.py` itself (this is run as part of our "Test tools" GitHub Actions workflow):
```
python tools/test/test_trailing_newlines.py
```

Reviewed By: malfet

Differential Revision: D27409736

Pulled By: samestep

fbshipit-source-id: 46f565227046b39f68349bbd5633105b2d2e9b19
2021-03-30 13:09:52 -07:00