Commit Graph

44 Commits

Author SHA1 Message Date
Victor Quach
5830f122f1 Add docstrings for save_on_cpu hooks (#62410)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62410

This PR adds docstrings for CPU hooks introduced in #61928.

Also uncomments the warning about pinned memory in CUDA semantics docs.

Depends on: #62361.

For now docstrings are an orphan page at https://docs-preview.pytorch.org/62410/generated/torch.autograd.graph.set_save_on_cpu_hooks.html#torch-autograd-graph-set-save-on-cpu-hooks

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D29990129

Pulled By: Varal7

fbshipit-source-id: 7a98eeee6a0abb11e2c2d9169cd1aa35ad7ba3f4
2021-08-03 17:53:45 -07:00
Michael Carilli
2fa6c7627e [CUDA graphs][BC-breaking] Removes post-backward syncs on default stream (#60421)
Summary:
Before https://github.com/pytorch/pytorch/pull/57833, calls to backward() or grad() synced only the calling thread's default stream with autograd leaf streams at the end of backward. This made the following weird pattern safe:
```python
with torch.cuda.stream(s):
    # imagine forward used many streams, so backward leaf nodes may run on many streams
    loss.backward()
# no sync
use grads
```

but a more benign-looking pattern was unsafe:
```python
with torch.cuda.stream(s):
    # imagine forward used a lot of streams, so backward leaf nodes may run on many streams
    loss.backward()
    # backward() syncs the default stream with all the leaf streams, but does not sync s with anything,
    # so counterintuitively (even though we're in the same stream context as backward()!)
    # it is NOT SAFE to use grads here, and there's no easy way to make it safe,
    # unless you manually sync on all the streams you used in forward,
    # or move "use grads" back to default stream outside the context.
    use grads
```
mruberry ngimel and I decided backward() should have the [same user-facing stream semantics as any cuda op](https://pytorch.org/docs/master/notes/cuda.html#stream-semantics-of-backward-passes).** In other words, the weird pattern should be unsafe, and the benign-looking pattern should be safe. Implementationwise, this meant backward() should sync its calling thread's current stream, not default stream, with the leaf streams.

After https://github.com/pytorch/pytorch/pull/57833, backward syncs the calling thread's current stream AND default stream with all leaf streams at the end of backward. The default stream syncs were retained for temporary backward compatibility.

This PR finishes https://github.com/pytorch/pytorch/pull/57833's work by deleting syncs on the default stream.

With this PR, graph-capturing an entire backward() call should be possible (see the [test_graph_grad_scaling diffs](https://github.com/pytorch/pytorch/compare/master...mcarilli:streaming_backwards_remove_default_syncs?expand=1#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R3641-R3642)).

** first paragraph has a formatting error which this PR should also fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60421

Reviewed By: albanD

Differential Revision: D29370344

Pulled By: ngimel

fbshipit-source-id: 3248bc5fb92fc517db0c15c897e5d7250f67d7fe
2021-06-24 17:34:02 -07:00
Luca Wehrstedt
bb9e1150ea Revert D29342234: [pytorch][PR] [CUDA graphs][BC-breaking] Removes post-backward syncs on default stream
Test Plan: revert-hammer

Differential Revision:
D29342234 (675cea1adb)

Original commit changeset: 98e6be7fdd85

fbshipit-source-id: 84022973248b2254210eee57402df2c4f4bc43c6
2021-06-24 04:49:28 -07:00
Michael Carilli
675cea1adb [CUDA graphs][BC-breaking] Removes post-backward syncs on default stream (#60421)
Summary:
Before https://github.com/pytorch/pytorch/pull/57833, calls to backward() or grad() synced only the calling thread's default stream with autograd leaf streams at the end of backward. This made the following weird pattern safe:
```python
with torch.cuda.stream(s):
    # imagine forward used many streams, so backward leaf nodes may run on many streams
    loss.backward()
# no sync
use grads
```

but a more benign-looking pattern was unsafe:
```python
with torch.cuda.stream(s):
    # imagine forward used a lot of streams, so backward leaf nodes may run on many streams
    loss.backward()
    # backward() syncs the default stream with all the leaf streams, but does not sync s with anything,
    # so counterintuitively (even though we're in the same stream context as backward()!)
    # it is NOT SAFE to use grads here, and there's no easy way to make it safe,
    # unless you manually sync on all the streams you used in forward,
    # or move "use grads" back to default stream outside the context.
    use grads
```
mruberry ngimel and I decided backward() should have the [same user-facing stream semantics as any cuda op](https://pytorch.org/docs/master/notes/cuda.html#stream-semantics-of-backward-passes).** In other words, the weird pattern should be unsafe, and the benign-looking pattern should be safe. Implementationwise, this meant backward() should sync its calling thread's current stream, not default stream, with the leaf streams.

After https://github.com/pytorch/pytorch/pull/57833, backward syncs the calling thread's current stream AND default stream with all leaf streams at the end of backward. The default stream syncs were retained for temporary backward compatibility.

This PR finishes https://github.com/pytorch/pytorch/pull/57833's work by deleting syncs on the default stream.

With this PR, graph-capturing an entire backward() call should be possible (see the [test_graph_grad_scaling diffs](https://github.com/pytorch/pytorch/compare/master...mcarilli:streaming_backwards_remove_default_syncs?expand=1#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R3641-R3642)).

** first paragraph has a formatting error which this PR should also fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60421

Reviewed By: VitalyFedyunin, albanD

Differential Revision: D29342234

Pulled By: ngimel

fbshipit-source-id: 98e6be7fdd8550872f0a78f9a66cb8dfe75abf63
2021-06-23 23:35:24 -07:00
Michael Wootton
2f3be2735f Don't split oversize cached blocks (#44742)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/35901

This change is designed to prevent fragmentation in the Caching Allocator.  Permissive block splitting in the allocator allows very large blocks to be split into many pieces.  Once split too finely it is unlikely all pieces will be 'free' at that same time so the original allocation can never be returned.   Anecdotally, we've seen a model run out of memory failing to alloc a 50 MB block on a 32 GB card while the caching allocator is holding 13 GB of 'split free blocks'

Approach:

- Large blocks above a certain size are designated "oversize".  This limit is currently set 1 decade above large, 200 MB
- Oversize blocks can not be split
- Oversize blocks must closely match the requested size (e.g. a 200 MB request will match an existing 205 MB block, but not a 300 MB block)
- In lieu of splitting oversize blocks there is a mechanism to quickly free a single oversize block (to the system allocator) to allow an appropriate size block to be allocated.  This will be activated under memory pressure and will prevent _release_cached_blocks()_ from triggering

Initial performance tests show this is similar or quicker than the original strategy.  Additional tests are ongoing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44742

Reviewed By: zou3519

Differential Revision: D29186394

Pulled By: ezyang

fbshipit-source-id: c88918836db3f51df59de6d1b3e03602ebe306a9
2021-06-21 11:46:08 -07:00
Michael Carilli
be038d8989 [CUDA graphs] Make stream semantics of backward calls consistent with other cuda ops (ci-all edition) (#57833)
Summary:
ci-all resubmit of https://github.com/pytorch/pytorch/pull/54227.

Tests look good except for a few distributed autograd failures (pytorch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test) and rocm failures (pr/pytorch-linux-bionic-rocm4.1-py3.6).

The common denominator in rocm failures appears to be multi-gpu activity: some [multiprocess DDP failures](https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm4.1-py3.6-test1/8115/console), some [single-process failures](https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm4.1-py3.6-test2/8115/console) where the single process has autograd ops that span devices. jeffdaily jithunnair-amd sunway513, could one of you take a look? The streaming backward change is also beneficial to rocm, I expect.

For debugging rocm failures, I think we should ignore the multiprocess/DDP tests and focus on the single process cases. The root cause is probably the same and the single process cases are simpler.

----------------------------------

Update: Rocm failures are due to https://github.com/pytorch/pytorch/issues/59750.
2718a54032 is a workaround, to be updated once https://github.com/pytorch/pytorch/issues/59750 is fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57833

Reviewed By: mruberry

Differential Revision: D28942391

Pulled By: ngimel

fbshipit-source-id: d6047e971c5f1c6386334bf3641402a92f12e2f8
2021-06-13 12:09:56 -07:00
Natalia Gimelshein
f94c95a2dd Revert D23752058: [pytorch][PR] Don't split oversize cached blocks
Test Plan: revert-hammer

Differential Revision:
D23752058 (67dcd62310)

Original commit changeset: ccb7c13e3cf8

fbshipit-source-id: 12ae9702135ea510e9714ed97fb75ca3b9f97c27
2021-04-14 09:24:08 -07:00
Michael Wootton
67dcd62310 Don't split oversize cached blocks (#44742)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/35901

This change is designed to prevent fragmentation in the Caching Allocator.  Permissive block splitting in the allocator allows very large blocks to be split into many pieces.  Once split too finely it is unlikely all pieces will be 'free' at that same time so the original allocation can never be returned.   Anecdotally, we've seen a model run out of memory failing to alloc a 50 MB block on a 32 GB card while the caching allocator is holding 13 GB of 'split free blocks'

Approach:

- Large blocks above a certain size are designated "oversize".  This limit is currently set 1 decade above large, 200 MB
- Oversize blocks can not be split
- Oversize blocks must closely match the requested size (e.g. a 200 MB request will match an existing 205 MB block, but not a 300 MB block)
- In lieu of splitting oversize blocks there is a mechanism to quickly free a single oversize block (to the system allocator) to allow an appropriate size block to be allocated.  This will be activated under memory pressure and will prevent _release_cached_blocks()_ from triggering

Initial performance tests show this is similar or quicker than the original strategy.  Additional tests are ongoing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44742

Reviewed By: ngimel

Differential Revision: D23752058

Pulled By: ezyang

fbshipit-source-id: ccb7c13e3cf8ef2707706726ac9aaac3a5e3d5c8
2021-04-14 03:04:41 -07:00
pbialecki
1451d84766 Minor doc fix: change truncating to rounding in TF32 docs (#49625)
Summary:
Minor doc fix in clarifying that the input data is rounded not truncated.

CC zasdfgbnm ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49625

Reviewed By: mruberry

Differential Revision: D25668244

Pulled By: ngimel

fbshipit-source-id: ac97e41e0ca296276544f9e9f85b2cf1790d9985
2020-12-22 13:46:33 -08:00
Peter Bell
5180caeeb4 Remove deprecated spectral ops from torch namespace (#48594)
Summary:
Ref https://github.com/pytorch/pytorch/issues/42175

This removes the 4 deprecated spectral functions: `torch.{fft,rfft,ifft,irfft}`. `torch.fft` is also now imported by by default.

The actual `at::native` functions are still used in `torch.stft` so can't be full removed yet. But will once https://github.com/pytorch/pytorch/issues/47601 has been merged.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48594

Reviewed By: heitorschueroff

Differential Revision: D25298929

Pulled By: mruberry

fbshipit-source-id: e36737fe8192fcd16f7e6310f8b49de478e63bf0
2020-12-05 04:12:32 -08:00
Xiang Gao
4a7de2746f Add docs on how to toggle TF32 flags on C++ (#47331)
Summary:
I have been asked several times how to toggle this flag on libtorch. I think it would be good to mention it in the docs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47331

Reviewed By: glaringlee

Differential Revision: D24777576

Pulled By: mruberry

fbshipit-source-id: cc2a338c477bb57e0bb74b8960c47fde99665e41
2020-11-08 01:29:24 -08:00
Michael Carilli
5640b79bf8 Allow consumer ops to sync on GraphRoot's gradient (#45787)
Summary:
Currently, a GraphRoot instance doesn't have an associated stream.  Streaming backward synchronization logic assumes the instance ran on the default stream, and tells consumer ops to sync with the default stream.  If the gradient the GraphRoot instance passes to consumer backward ops was populated on a non-default stream, we have a race condition.

The race condition can exist even if the user doesn't give a manually populated gradient:
```python
with torch.cuda.stream(side_stream):
    # loss.backward() implicitly synthesizes a one-element 1.0 tensor on side_stream
    # GraphRoot passes it to consumers, but consumers first sync on default stream, not side_stream.
    loss.backward()

    # Internally to backward(), streaming-backward logic takes over, stuff executes on the same stream it ran on in forward,
    # and the side_stream context is irrelevant.  GraphRoot's interaction with its first consumer(s) is the spot where
    # the side_stream context causes a problem.
```

This PR fixes the race condition by associating a GraphRoot instance, at construction time, with the current stream(s) on the device(s) of the grads it will pass to consumers. (i think this relies on GraphRoot executing in the main thread, before backward thread(s) fork, because the grads were populated on the main thread.)

The test demonstrates the race condition. It fails reliably without the PR's GraphRoot diffs and passes with the GraphRoot diffs.

With the GraphRoot diffs, manually populating an incoming-gradient arg for `backward` (or `torch.autograd.grad`) and the actual call to `autograd.backward` will have the same stream-semantics relationship as any other pair of ops:
```python
# implicit population is safe
with torch.cuda.stream(side_stream):
    loss.backward()

# explicit population in side stream then backward in side stream is safe
with torch.cuda.stream(side_stream):
    kickoff_grad = torch.ones_like(loss)
    loss.backward(gradient=kickoff_grad)

# explicit population in one stream then backward kickoff in another stream
# is NOT safe, even with this PR's diffs, but that unsafety is consistent with
# stream-semantics relationship of any pair of ops
kickoff_grad = torch.ones_like(loss)
with torch.cuda.stream(side_stream):
    loss.backward(gradient=kickoff_grad)

# Safe, as you'd expect for any pair of ops
kickoff_grad = torch.ones_like(loss)
side_stream.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(side_stream):
    loss.backward(gradient=kickoff_grad)
```
This PR also adds the last three examples above to cuda docs and references them from autograd docstrings.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45787

Reviewed By: nairbv

Differential Revision: D24138376

Pulled By: albanD

fbshipit-source-id: bc4cd9390f9f0358633db530b1b09f9c1080d2a3
2020-10-07 08:53:53 -07:00
Bert Maher
03342af3a3 Add env variable to bypass CUDACachingAllocator for debugging (#45294)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45294

While tracking down a recent memory corruption bug we found that
cuda-memcheck wasn't finding the bad accesses, and ngimel pointed out that
it's because we use a caching allocator so a lot of "out of bounds" accesses
land in a valid slab.

This PR adds a runtime knob (`PYTORCH_NO_CUDA_MEMORY_CACHING`) that, when set,
bypasses the caching allocator's caching logic so that allocations go straight
to cudaMalloc.  This way, cuda-memcheck will actually work.

Test Plan:
Insert some memory errors and run a test under cuda-memcheck;
observe that cuda-memcheck flags an error where expected.

Specifically I removed the output-masking logic here:
https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/tensorexpr/cuda_codegen.cpp#L819-L826

And ran:
```
PYTORCH_NO_CUDA_MEMORY_CACHING=1 cuda-memcheck pytest -k test_superslomo test_jit_fuser_te.py
```

Reviewed By: ngimel

Differential Revision: D23964734

Pulled By: bertmaher

fbshipit-source-id: 04efd11e8aff037b9edde80c70585cb820ee6e39
2020-09-28 11:40:04 -07:00
Gao, Xiang
5e97f251a8 Enable TF32 support for cuDNN (#40737)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40737

Reviewed By: mruberry

Differential Revision: D22801525

Pulled By: ngimel

fbshipit-source-id: ac7f7e728b4b3e01925337e8c9996f26a6433fd2
2020-09-01 15:34:24 -07:00
Xiang Gao
23174ca71b [reland] Enable TF32 support for cuBLAS (#41498)
Summary:
fix rocm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41498

Reviewed By: mruberry

Differential Revision: D22560572

Pulled By: ngimel

fbshipit-source-id: 5ee79e96cb29e70d9180830d058efb53d1c6c041
2020-07-15 21:00:55 -07:00
Shen Li
3a63a939d4 Revert D22517785: [pytorch][PR] Enable TF32 support for cuBLAS
Test Plan: revert-hammer

Differential Revision:
D22517785 (288ece89e1)

Original commit changeset: 87334c893561

fbshipit-source-id: 0a0674f49c1bcfc98f7f88af5a8c7de93b76e458
2020-07-15 08:15:48 -07:00
Xiang Gao
288ece89e1 Enable TF32 support for cuBLAS (#40800)
Summary:
Benchmark on a fully connected network and torchvision models (time in seconds) on GA100:

| model              | batch size | forward(TF32) | forward(FP32) | backward(TF32) | backward(FP32) |
|--------------------|------------|---------------|---------------|----------------|----------------|
| FC 512-128-32-8    | 512        | 0.000211      | 0.000321      | 0.000499       | 0.000532       |
| alexnet            | 512        | 0.0184        | 0.0255        | 0.0486         | 0.0709         |
| densenet161        | 128        | 0.0665        | 0.204         | 0.108          | 0.437          |
| googlenet          | 256        | 0.0925        | 0.110         | 0.269          | 0.326          |
| inception_v3       | 256        | 0.155         | 0.214         | 0.391          | 0.510          |
| mnasnet1_0         | 512        | 0.108         | 0.137         | 0.298          | 0.312          |
| mobilenet_v2       | 512        | 0.114         | 0.294         | 0.133          | 0.303          |
| resnet18           | 512        | 0.0722        | 0.100         | 0.182          | 0.228          |
| resnext50_32x4d    | 256        | 0.170         | 0.237         | 0.373          | 0.479          |
| shufflenet_v2_x1_0 | 512        | 0.0463        | 0.0473        | 0.125          | 0.123          |
| squeezenet1_0      | 512        | 0.0870        | 0.0948        | 0.205          | 0.214          |
| vgg16              | 256        | 0.167         | 0.234         | 0.401          | 0.502          |
| wide_resnet50_2    | 512        | 0.186         | 0.310         | 0.415          | 0.638          |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40800

Reviewed By: mruberry

Differential Revision: D22517785

Pulled By: ngimel

fbshipit-source-id: 87334c8935616f72a6af5abbd3ae69f76923dc3e
2020-07-14 13:21:10 -07:00
Edward Leardi
6b50874cb7 Fix HTTP links in documentation to HTTPS (#40878)
Summary:
I ran `make linkcheck` using `sphinx.builders.linkcheck` on the documentation and noticed a few links weren't using HTTPS so I quickly updated them all.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40878

Differential Revision: D22404647

Pulled By: ngimel

fbshipit-source-id: 9c9756db59197304023fddc28f252314f6cf4af3
2020-07-06 20:05:21 -07:00
Xiang Gao
df8d6eeb19 Update docs about DP and DDP for CUDA (#35063)
Summary:
We should recommend DDP instead of DP. Hope we can also cherry-pick this for 1.5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35063

Differential Revision: D20549621

Pulled By: ngimel

fbshipit-source-id: 86b1b2134664065cc6070ea4212895f993eaf543
2020-03-20 20:06:37 -07:00
Jerry Ma
1610ea8ef8 Comprehensive-ish instrumentation for CUDA memory allocator (#27361)
Summary:
Adds comprehensive memory instrumentation to the CUDA caching memory allocator.

# Counters

Added comprehensive instrumentation for the following stats:
  - Allocation requests (`allocation`)
  - Allocated memory (`allocated_bytes`)
  - Reserved segments from cudaMalloc (`segment`)
  - Reserved memory (`reserved_bytes`)
  - Active memory blocks (`active`)
  - Active memory (`active_bytes`)
  - Inactive, non-releasable blocks (`inactive_split`)
  - Inactive, non-releasable memory (`inactive_split_bytes`)
  - Number of failed cudaMalloc calls that result in a cache flush and retry (`cuda_malloc_retries`)
  - Number of OOMs (`num_ooms`)

Except for the last two, these stats are segmented between all memory, large blocks, and small blocks. Along with the current value of each stat, historical counts of allocs/frees as well as peak usage are tracked by the allocator.

# Snapshots

Added the capability to get a "memory snapshot" – that is, to generate a complete dump of the allocator block/segment state.

# Implementation: major changes

- Added `torch.cuda.memory_stats()` (and associated C++ changes) which returns all instrumented stats as a dictionary.
- Added `torch.cuda.snapshot()` (and associated C++ changes) which returns a complete dump of the allocator block/segment state as a list of segments.
- Added memory summary generator in `torch.cuda.memory_summary()` for ease of client access to the instrumentation stats. Potentially useful to dump when catching OOMs. Sample output here: https://pastebin.com/uKZjtupq

# Implementation: minor changes

- Add error-checking helper functions for Python dicts and lists in `torch/csrc/utils/`.
- Existing memory management functions in `torch.cuda` moved from `__init__.py` to `memory.py` and star-imported to the main CUDA module.
- Add various helper functions to `torch.cuda` to return individual items from `torch.cuda.memory_stats()`.
- `torch.cuda.reset_max_memory_cached()` and `torch.cuda.reset_max_memory_allocated()` are deprecated in favor of `reset_peak_stats`. It's a bit difficult to think of a case where only one of those stats should be reset, and IMO this makes the peak stats collectively more consistent.
- `torch.cuda.memory_cached()` and `torch.cuda.max_memory_cached()` are deprecated in favor of `*memory_reserved()`.
- Style (add access modifiers in the allocator class, random nit fixes, etc.)

# Testing

- Added consistency check for stats in `test_cuda.py`. This verifies that the data from `memory_stats()` is faithful to the data from `snapshot()`.
- Ran on various basic workflows (toy example, CIFAR)

# Performance

Running the following speed benchmark: https://pastebin.com/UNndQg50

- Before this PR: 45.98 microseconds per tensor creation
- After this PR: 46.65 microseconds per tensor creation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27361

Differential Revision: D17758747

Pulled By: jma127

fbshipit-source-id: 5a84e82d696c40c505646b9a1b4e0c3bba38aeb6
2019-10-08 15:42:48 -07:00
Tongzhou Wang
98d3d1659e Document benchmarking practice for CUDA
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/23910

Differential Revision: D16732365

Pulled By: ezyang

fbshipit-source-id: 24e055602d479293da3e00a7143bba8f92bb7c4a
2019-08-13 15:07:23 -07:00
Tongzhou Wang
058beae411 Add IterableDataset (#19228)
Summary:
This is a modified version of https://github.com/pytorch/pytorch/pull/14705 since commit structure for that PR is quite messy.

1. Add `IterableDataset`.
3. So we have 2 data loader mods: `Iterable` and `Map`.

    1. `Iterable` if the `dataset` is an instance of `IterableDataset`
    2. `Map` o.w.

3. Add better support for non-batch loading (i.e., `batch_size=None` and `batch_sampler=None`). This is useful in doing things like bulk loading.
3. Refactor `DataLoaderIter` into two classes, `_SingleProcessDataLoaderIter` and `_MultiProcessingDataLoaderIter`. Rename some methods to be more generic, e.g., `get_batch` -> `get_data`.
4. Add `torch.utils.data.get_worker_info` which returns worker information in a worker proc (e.g., worker id, dataset obj copy, etc.) and can be used in `IterableDataset.__iter__` and `worker_init_fn` to do per-worker configuration.
5. Add `ChainDataset`, which is the analog of `ConcatDataset` for `IterableDataset`.
7. Import torch.utils.data in `torch/__init__.py`
9. data loader examples and documentations
10. Use `get_worker_info` to detect whether we are in a worker process in `default_collate`

Closes https://github.com/pytorch/pytorch/issues/17909, https://github.com/pytorch/pytorch/issues/18096, https://github.com/pytorch/pytorch/issues/19946, and some of https://github.com/pytorch/pytorch/issues/13023
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19228

Reviewed By: bddppq

Differential Revision: D15058152

fbshipit-source-id: 9e081a901a071d7e4502b88054a34b450ab5ddde
2019-06-20 20:12:44 -07:00
Tongzhou Wang
bb89827e1d Update cuda pinned memory note to include tensor.to (#20977)
Summary:
separate bits of changes from #19228
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20977

Differential Revision: D15511919

Pulled By: soumith

fbshipit-source-id: 5015a29cdac6d6e160388c493182c330f0da63ec
2019-05-26 22:22:06 -07:00
Tongzhou Wang
6d307db5b4 Move cuFFT plan cache note outside Best Practices (#19538)
Summary:
I mistakenly put it there.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19538

Differential Revision: D15026500

Pulled By: soumith

fbshipit-source-id: 0c13499571fdfd789c3bd1c4b58abd870725d422
2019-04-20 21:39:59 -07:00
Tongzhou Wang
973d51079b Add device-specific cuFFT plan caches (#19300)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/19224
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19300

Differential Revision: D14986967

Pulled By: soumith

fbshipit-source-id: 8c31237db50d6924bba1472434c10326610d9255
2019-04-18 06:39:35 -07:00
SsnL
300dcc3b96 Add cuda.reset_max_memory_* (#15985)
Summary:
Addresses #15968
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15985

Differential Revision: D13649916

Pulled By: soumith

fbshipit-source-id: a207aea5709a79dba7a6fc541d0a70103f49efff
2019-01-14 07:31:51 -08:00
cclauss
b0248df72a Docs: Change cuda(async) —> cuda(non_blocking) (#12158)
Summary:
goldsborough Modify the docs to match the changes made in #4999
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12158

Differential Revision: D10103964

Pulled By: SsnL

fbshipit-source-id: 1b8692da86aca1a52e8d2e6cea76a5ad1f71e058
2018-09-28 08:39:27 -07:00
Kaiyu Shi
0169ac5936 Fix sample code for cuda stream (#8319) 2018-06-10 11:41:50 -04:00
Richard Zou
0430bfe40b
[docs] Update broadcasting and cuda semantics notes (#6904)
* [docs] Update broadcasting and cuda semantics notes

* Update multiprocessing.rst

* address comments

* Address comments
2018-04-24 13:41:24 -04:00
Kento NOZAWA
c00ee6da8f Fix typos (#6348)
* Fix typo

* Fix typo

* Update faq.rst
2018-04-06 11:06:42 -04:00
Tongzhou Wang
392fc8885c add faq on cuda memory management and dataloder (#5378) 2018-02-27 18:35:30 -05:00
Tongzhou Wang
6420c6b224 Improve torch.cuda.empty_cache documentation (#4879)
* add doc about empty_cache wont increase amount of memory available

* typo
2018-01-27 04:54:25 -05:00
Yongjik Kim
dd5c195646 More documentation for CUDA stream functions. (#4756) 2018-01-21 12:58:51 +01:00
Tongzhou Wang
5918243b0c Methods for checking CUDA memory usage (#4511)
* gpu mem allocated

* add test

* addressed some of @apaszke 's comments

* cache stats

* add more comments about test
2018-01-09 11:47:48 -05:00
SsnL
bb1b826cdc Exposing emptyCache from allocator (#3518)
* Add empty_cache binding

* cuda.empty_cache document

* update docs
2017-11-07 17:00:38 -05:00
Kaixhin
5de7f9e731 Tidy up CUDA notes 2017-11-05 14:42:06 +01:00
Kai Arulkumaran
a7c5be1d45 Document CUDA best practices (#3227) 2017-10-25 22:38:17 +02:00
Hungryof
73128f7b08 fix minor typos (#2051)
* Update extending.rst

fix typo

* Update cuda.rst

fix typo
2017-07-11 11:01:41 -04:00
Du Phan
86e40ed875 Fix a typo in docs about pinned memory buffers (#1023)
* remove misleading guide for BCELoss

* fix docs about pinned memory buffers
2017-03-17 05:08:03 -04:00
Eli Stevens
88275da5e8 CUDA documentation tweaks (#858) 2017-02-26 20:37:43 +01:00
Eli Stevens
b87c113cf4 CUDA documentation enhancement and docs versioning (#848)
* Add more detail to CUDA documentation

Also adds better cross-linking to the pages that discuss relevant topics.

* Adds recommendation to torch.save docs

* Make the version numbers for the docs dynamic

Might need tweaks for beta, 1.0, etc.
2017-02-26 08:33:26 -05:00
Alfredo Canziani
a38749d15f Fix cuda notes
Target GPU *is* consisten with source GPU
2017-01-27 19:30:49 +01:00
Adam Paszke
4cc11066b2 Add torch.utils.data docs and improve notes (#460)
* Add torch.utils.data docs and improve notes
2017-01-17 14:51:05 -05:00
Adam Paszke
15c1dad340 Minor fixes and torch.cuda docs 2017-01-16 20:38:14 -05:00