Commit Graph

842 Commits

Author SHA1 Message Date
Ting Lu
c167c34396 Skip unsupported tests on arm (#117344)
add skips to tests that involve record_context_cpp on ARM as it is only supported on linux x86_64 arch. Error is reported as below:
```
Traceback (most recent call last):
  File "/usr/lib/python3.10/unittest/case.py", line 59, in testPartExecutor
    yield
  File "/usr/lib/python3.10/unittest/case.py", line 591, in run
    self._callTestMethod(testMethod)
  File "/usr/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
    method()
  File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 2674, in wrapper
    method(*args, **kwargs)
  File "/opt/pytorch/pytorch/test/test_cuda.py", line 3481, in test_direct_traceback
    c = gather_traceback(True, True, True)
RuntimeError: record_context_cpp is not support on non-linux non-x86_64 platforms
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117344
Approved by: https://github.com/malfet, https://github.com/drisspg
2024-01-12 21:12:11 +00:00
Doe Hyun Yoon
83c45a9931 Faster gc_count update for CUDACachingAllocator (and avoid nullptr de… (#117064)
…reference) (#109065)

Summary:

Modify the way we update gc_count in CUDACachingAlloctor to make it faster.

Originally D48481557, but reverted due to nullptr dereference in some cases (D49003756). This diff changed to use correct constructor for search key (so avoid nullptr dereference). Also, added nullptr check (and returns 0 if it is) in gc_count functions.

Differential Revision: D49068760

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117064
Approved by: https://github.com/zdevito
2024-01-11 19:47:05 +00:00
Nikita Shulga
a6325ad86c Fix cuInit test on Windows (#117055)
By changing library name from `libcuda.so.1` to `nvcuda.dll` on Windows

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117055
Approved by: https://github.com/Skylion007, https://github.com/huydhn, https://github.com/atalman
2024-01-10 00:45:18 +00:00
Nikita Shulga
81b7a09d27 [CI] Test that cuInit is not called during import (#117010)
By making a driver API call in subprocess and expecting it to return `CUDA_ERROR_NOT_INITIALIZED`

Test Plan: run it on nighties before https://github.com/pytorch/pytorch/pull/116201 got reverted and observe the failure

This is very important for lots of distributed launchers

Fixes https://github.com/pytorch/pytorch/issues/116276

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117010
Approved by: https://github.com/albanD
2024-01-09 14:44:22 +00:00
Aaron Gokaslan
95041829c8 Add bfloat16 CUDA support to RNN (#116927)
Fixes #116925
Fixes #116763

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116927
Approved by: https://github.com/malfet
2024-01-06 22:55:34 +00:00
Aaron Gokaslan
3fe437b24b [BE]: Update flake8 to v6.1.0 and fix lints (#116591)
Updates flake8 to v6.1.0 and fixes a few lints using sed and some ruff tooling.
- Replace `assert(0)` with `raise AssertionError()`
- Remove extraneous parenthesis i.e.
  - `assert(a == b)` -> `assert a == b`
  - `if(x > y or y < z):`->`if x > y or y < z:`
  - And `return('...')` -> `return '...'`

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116591
Approved by: https://github.com/albanD, https://github.com/malfet
2024-01-03 06:04:44 +00:00
Aaron Gokaslan
bd10fea79a [BE]: Enable F821 and fix bugs (#116579)
Fixes #112371

I tried to fix as many of the bugs as I could, a few I could not figure out what the proper fix for them was though and so I left them with noqas.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116579
Approved by: https://github.com/ezyang
2024-01-01 08:40:46 +00:00
zdevito
4afe2687d5 Reland "Serve multistream graph captures from correct pool (#114647)" (#116199)
Fixes a variable shadowing problem that broke internal builds.

This reverts commit fe15645619.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116199
Approved by: https://github.com/eellison
2023-12-20 21:22:34 +00:00
PyTorch MergeBot
fe15645619 Revert "Serve multistream graph captures from correct pool (#114647)"
This reverts commit 8a445f7bd5.

Reverted https://github.com/pytorch/pytorch/pull/114647 on behalf of https://github.com/jeanschmidt due to breaking multiple internal build jobs, please check internal diff in order to obtain more details ([comment](https://github.com/pytorch/pytorch/pull/114647#issuecomment-1864840724))
2023-12-20 17:11:42 +00:00
zdevito
8a445f7bd5 Serve multistream graph captures from correct pool (#114647)
This fixes #114320 by placing the logic for determining whether to allocate
to a pool inside a callback that is controlled by CUDAGraph.cpp or by the
python bound api to allocate a stream directly to a pool.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114647
Approved by: https://github.com/ngimel, https://github.com/eellison
2023-12-18 18:24:15 +00:00
rzou
8ddca5aeae markDynamoStrictTest some more tests (#115857)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115857
Approved by: https://github.com/voznesenskym
ghstack dependencies: #115845, #115855, #115856
2023-12-15 01:22:38 +00:00
atalman
43e3242490 [BE] Remove test corner cases for CUDA older than supported 11.8 (#114989)
Remove deprecated CUDA use cases from tests.
Similar to: https://github.com/pytorch/pytorch/pull/112873

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114989
Approved by: https://github.com/malfet
2023-12-04 21:41:03 +00:00
eqy
6a86cf00ad [CUDA][cuBLAS] Remove explicit cuBLAS workspace allocation for CUDA 12.2+ (#113994)
cuBLAS should be using `cudaMallocAsync` in CUDA 12.2+, which removes the need for explicit workspace allocation to avoid increasing memory usage with multiple graph captures.

CC @ptrblck @malfet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113994
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-11-22 23:23:51 +00:00
Banit Agrawal
cc776d2186 [PyTorch Pinned Allocator] Create per thread task pool for mapping memory space (#111545)
Differential Revision: D50443865

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111545
Approved by: https://github.com/zdevito
2023-10-22 00:23:49 +00:00
Kazuaki Ishizaki
a603dcc307 Fix typo under test directory (#110826)
This PR fixes typo `the the` of comments in files under `test` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110826
Approved by: https://github.com/Skylion007
2023-10-08 20:52:38 +00:00
Banit Agrawal
64583c4d04 [CUDA Host Allocator] Add support of CudaHostRegister (#108488)
Summary: This diff adds another option to create cuda pinned memory using cudaHostRegister.

Differential Revision: D45843715

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108488
Approved by: https://github.com/zdevito
2023-10-06 04:13:02 +00:00
Aidyn-A
e7bd9c5315 [CUDA][CUDA Graphs] Fix CUDAGraph::reset function (#108896)
The following two cases fail due to a small oversight `CUDAGraph::reset()` that causes failures in graph destructor
```Python
import torch

x = torch.zeros(4, device="cuda")
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
    x = x + 1

g.reset()
del g
```
that fails with:
```
terminate called after throwing an instance of 'c10::Error'
  what():  uc >= 0 INTERNAL ASSERT FAILED at ".../pytorch/c10/cuda/CUDACachingAllocator.cpp":2157, please report a bug to PyTorch.
```

and reset and subsequent re-capture
```Python
import torch

x = torch.zeros(4, device="cuda")
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
    x = x + 1

g.reset()

with torch.cuda.graph(g):
    x = x + 1
g.replay()
```
which fails with:
```
Traceback (most recent call last):
  File "test_graph.py", line 11, in <module>
    with torch.cuda.graph(g):
  File ".../pytorch/torch/cuda/graphs.py", line 192, in __enter__
    self.cuda_graph.capture_begin(
  File ".../pytorch/torch/cuda/graphs.py", line 77, in capture_begin
    super().capture_begin(pool=pool, capture_error_mode=capture_error_mode)
RuntimeError: This CUDAGraph instance already owns a captured graph. To capture a new graph, create a new instance.

```

This PR fixes `CUDAGraph::reset()` function for above to use cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108896
Approved by: https://github.com/ezyang
2023-09-11 19:49:31 +00:00
Michael Lazos
b193f295b6 Add capturable ASGD impl (#107857)
Add capturable ASGD impl + test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107857
Approved by: https://github.com/janeyx99
2023-09-07 06:30:30 +00:00
Banit Agrawal
b8af8ac784 [CUDACaching Allocator] Release the allocator lock on the slow path (#108367)
Summary: This diff is to release the global allocator lock on the slow path when we do synchronous cudaMalloc call.

Differential Revision: D48750077

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108367
Approved by: https://github.com/zdevito
2023-09-02 02:52:25 +00:00
Elias Ellison
0a9778a372 Expose cudaStreamCaptureMode in CUDA Graphs, use local setting in inductor (#107407)
>  capture_error_mode (str, optional): specifies the cudaStreamCaptureMode for the graph capture stream.
Can be "global", "thread_local" or "relaxed". During cuda graph capture, some actions, such as cudaMalloc,
 may be unsafe. "global" will error on actions in other threads, "thread_local" will only error for
 actions in the current thread, and "relaxed" will not error on these actions.

Inductor codegen is single-threaded, so it should be safe to enable "thread_local" for inductor's cuda graph capturing. We have seen errors when inductor cudagraphs has been used concurrently with data preprocessing in other threads.

Differential Revision: [D48656014](https://our.internmc.facebook.com/intern/diff/D48656014)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107407
Approved by: https://github.com/albanD, https://github.com/eqy
2023-08-25 01:44:26 +00:00
Zachary DeVito
cc54448a07 [memory snapshot] add 'address' key to block (#107171)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107171
Approved by: https://github.com/ngimel
2023-08-23 18:57:24 +00:00
Aaron Gokaslan
660e8060ad [BE]: Update ruff to 0.285 (#107519)
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings.

I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519
Approved by: https://github.com/ezyang
2023-08-22 23:16:38 +00:00
PyTorch MergeBot
d59a6864fb Revert "[BE]: Update ruff to 0.285 (#107519)"
This reverts commit 88ab3e4322.

Reverted https://github.com/pytorch/pytorch/pull/107519 on behalf of https://github.com/ZainRizvi due to Sorry, but this PR breaks internal tests. @ezyang, can you please hep them get unblocked? It seems like one of the strings was prob accidentally modified ([comment](https://github.com/pytorch/pytorch/pull/107519#issuecomment-1688833480))
2023-08-22 19:53:32 +00:00
Aaron Gokaslan
88ab3e4322 [BE]: Update ruff to 0.285 (#107519)
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings.

I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519
Approved by: https://github.com/ezyang
2023-08-20 01:36:18 +00:00
lcskrishna
bc662ffff9 [ROCm] Update ROCm skip decorators (#106138)
This PR adds a msg argument for skipIfRocm and skipCUDAIfRocm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106138
Approved by: https://github.com/jataylo, https://github.com/jeffdaily, https://github.com/pruthvistony, https://github.com/albanD
2023-08-18 22:02:06 +00:00
Zachary DeVito
80988b6277 Introduce memory stacks for free (#106758)
Previously when we recorded a free action in a memory trace, we would provide
the stack for when the block was allocated. This is faster because we do not
have to record stacks for free, which would otherwise double the number of stacks
collected. However, sometimes knowing the location of a free is useful for
figuring out why a tensor was live. So this PR adds this behavior. If
performance ends up being a concern the old behavior is possible by passing
"alloc" to the context argument rather than "all".

Also refactors some of glue logic to be consistent across C++ and Python and
routes the Python API through the C++ version.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106758
Approved by: https://github.com/albanD
2023-08-14 20:38:15 +00:00
Jane Xu
0208574db9 [NAdam] Add capturable API and tests + fix differentiable (#106615)
This PR:
- adds a capturable API for NAdam similar to Adam(W)
- adds tests accordingly
- discovered and fixed bugs in the differentiable implementation (now tested through the capturable codepath).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106615
Approved by: https://github.com/albanD
2023-08-07 19:49:11 +00:00
Zachary DeVito
3e5a52cedd [memory snapshot] track context for segments (#106113)
We want to display the stack for the original cudaMalloc that created a segment.
Previously we could only report the last time the segment memory was used,
or the record of the segment_alloc could appear in the list of allocator actions.
This PR ensure regardless of whether we still have the segment_alloc action,
the context for a segment is still available. The visualizer is updated to
be able to incorporate this information.

This PR adds a new field to Block. However the previous stacked cleanup PR
 removed a field of the same size, making the change to Block size-neutral.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106113
Approved by: https://github.com/aaronenyeshi
2023-07-28 06:45:48 +00:00
Zachary DeVito
45b564766d [memory snapshots] removed chained history (#106079)
For free blocks of memory in the allocator, we previously kept a linked list
of the stack frames of previous allocations that lived there. This was only
ever used in one flamegraph visualization and never proved useful at
understanding what was going on. When memory history tracing was added, it
became redundant, since we can see the history of the free space from recording
the previous actions anyway.

This patch removes this functionality and simplifies the snapshot format:
allocated blocks directly have a 'frames' attribute rather than burying stack frames in the history.
Previously the memory history tracked the real size of allocations before rounding.
Since history was added, 'requested_size' has been added directly to the block which records the same information,
so this patch also removes that redundancy.

None of this functionality has been part of a PyTorch release with BC guarentees, so it should be safe to alter
this part of the format.

This patch also updates our visualization tools to work with the simplified format. Visualization tools keep
support for the old format in `_legacy` functions so that during the transition old snapshot files can still be read.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106079
Approved by: https://github.com/eellison
2023-07-28 06:45:48 +00:00
Justin Chu
4cc1745b13 [BE] f-stringify torch/ and scripts (#105538)
This PR is a follow up on the pyupgrade series to convert more strings to use f-strings using `flynt`.

- https://docs.python.org/3/reference/lexical_analysis.html#f-strings
- https://pypi.org/project/flynt/

Command used:

```
flynt torch/ -ll 120
flynt scripts/ -ll 120
flynt tools/ -ll 120
```

and excluded `collect_env.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105538
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-07-21 19:35:24 +00:00
Justin Chu
73e1455327 [BE] Enable ruff's UP rules and autoformat test/ (#105434)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105434
Approved by: https://github.com/albanD
2023-07-19 20:36:06 +00:00
Nikita Shulga
c3e4a67905 Refactor multigpu tests to test_cuda_multigpu (#104059)
Mostly refactor, that moves all the tests from `test_cuda` that benefit from multiGPU environment into its own file.

- Add `TestCudaMallocAsync` class for Async tests ( to separate them from `TestCudaComm`)
- Move individual tests from `TestCuda` to `TestCudaMultiGPU`
- Move `_create_scaling_models_optimizers` and `_create_scaling_case` to `torch.testing._internal.common_cuda`
- Add newly created `test_cuda_multigpu` to the multigpu periodic test

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at f4d46fa</samp>

This pull request fixes a flaky test and improves the testing of gradient scaling on multiple GPUs. It adds verbose output for two CUDA tests, and refactors some common code into helper functions in `torch/testing/_internal/common_cuda.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104059
Approved by: https://github.com/huydhn
2023-06-27 05:32:05 +00:00
Zachary DeVito
afc788a99c Re-land _cycleviz.py: visualize reference cycles holding cuda memory (#104051)
Reference cycles are freed by the cycle collector rather than being cleaned up
when the objects in the cycle first become unreachable. If a cycle points to a tensor,
the CUDA memory for that tensor will not be freed until garbage collection runs.
Accumulation of CUDA allocations can lead to out of memory errors (OOMs), as well as
non-deterministic allocation behavior which is harder to debug.

This visualizer installs a garbage collection hook to look for cycles containing
CUDA tensors and saves a visualization of the garbage:

```
from torch.cuda._cycleviz import warn_tensor_cycles
warn_tensor_cycles()
# do some work that results in a cycle getting garbage collected
# ...
> WARNING:root:Reference cycle includes a CUDA Tensor see visualization of cycle /tmp/tmpeideu9gl.html
```

Reland to make windows skip the test.

This reverts commit 7b3b6dd426.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104051
Approved by: https://github.com/aaronenyeshi, https://github.com/malfet
2023-06-23 13:44:58 +00:00
PyTorch MergeBot
7b3b6dd426 Revert "_cycleviz.py: visualize reference cycles holding cuda memory (#102656)"
This reverts commit dba67f71c9.

Reverted https://github.com/pytorch/pytorch/pull/102656 on behalf of https://github.com/huydhn due to Sorry for reverting your PR. But I think the change is failing on Windows CUDA https://github.com/pytorch/pytorch/actions/runs/5341701630/jobs/9683293600 ([comment](https://github.com/pytorch/pytorch/pull/102656#issuecomment-1603035364))
2023-06-22 17:16:47 +00:00
Zachary DeVito
dba67f71c9 _cycleviz.py: visualize reference cycles holding cuda memory (#102656)
Reference cycles are freed by the cycle collector rather than being cleaned up
when the objects in the cycle first become unreachable. If a cycle points to a tensor,
the CUDA memory for that tensor will not be freed until garbage collection runs.
Accumulatin of CUDA allocations can lead to out of memory errors (OOMs), as well as
non-deterministic allocation behavior which is harder to debug.

This visualizer installs a garbage collection hook to look for cycles containing
CUDA tensors and saves a visualization of the garbage:

```
from torch.cuda._cycleviz import warn_tensor_cycles
warn_tensor_cycles()
# do some work that results in a cycle getting garbage collected
# ...
> WARNING:root:Reference cycle includes a CUDA Tensor see visualization of cycle /tmp/tmpeideu9gl.html
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102656
Approved by: https://github.com/aaronenyeshi
2023-06-22 04:00:28 +00:00
Nikita Shulga
cd05c3b98c [BE] Use TEST_MULTIGPU from common_cuda.py (#103982)
Comment about `TEST_CUDNN` called over and over has long been alleviated by wrapping the check with `LazyVal`, that caches the results.
Also, delete unused `TEST_MAGMA`.

Prep change for https://github.com/pytorch/pytorch/issues/100006

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at e3a5b39</samp>

> _`common_cuda.py`_
> _Refactored for dynamo tests_
> _Winter code cleanup_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103982
Approved by: https://github.com/atalman, https://github.com/janeyx99
2023-06-22 00:07:44 +00:00
Zachary DeVito
19b3e07fe0 [memory_viz] Unified viewer (#103565)
This replaces the invidual visualization routines in _memory_viz.py with
a single javascript application.

The javascript application can load pickled snapshot dumps directly using
drag/drop, requesting them via fetch, or by embedding them in a webpage.

The _memory_viz.py commands use the embedding approach.
We can also host MemoryViz.js on a webpage to use the drag/drop approach, e.g.
https://zdevito.github.io/assets/viz/
(eventually this should be hosted with the pytorch docs).

All views/multiple cuda devices are supported on one page.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103565
Approved by: https://github.com/eellison, https://github.com/albanD
2023-06-16 03:49:48 +00:00
Xiao Wang
39f3514fa3 Add an env PYTORCH_TEST_SKIP_CUDAGRAPH to skip all cuda graph-related unit tests (#103032)
Skip all cuda graph-related unit tests by setting env var `PYTORCH_TEST_SKIP_CUDAGRAPH=1`

This PR refactors the `TEST_CUDA` python variable in test_cuda.py into common_utils.py. This PR also creates a new python variable `TEST_CUDA_GRAPH` in common_utils.py, which has an env var switch to turn off all cuda graph-related tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103032
Approved by: https://github.com/malfet
2023-06-06 07:51:57 +00:00
Nikita Shulga
ca470fc59f [BE] Make test_no_triton_on_import simple (#102674)
Do not try to parse raised exception for no good reason
Add short description
Reduce script to a single line

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at ea4164e</samp>

> _`test_no_triton_on_import`_
> _Cleans up the code, adds docs_
> _No hidden errors_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102674
Approved by: https://github.com/cpuhrsch, https://github.com/albanD
2023-06-01 20:31:18 +00:00
Nikita Vedeneev
d80d3b18d0 nn.Linear with BSR inputs: spare the user from explicit Triton kernel registrations (#98403)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 08f7a6a</samp>

This pull request adds support for triton kernels in `torch` and `torch/cuda`, and refactors and tests the existing triton kernel for BSR matrix multiplication. It also adds a test case to ensure that importing `torch` does not implicitly import `triton`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98403
Approved by: https://github.com/malfet, https://github.com/cpuhrsch
2023-05-31 13:09:45 +00:00
Masaki Kozuki
c8579b7374 Run test_cpp_memory_snapshot_pickle only when linux and x86_64 (#101366)
On Arm, I got

```
Traceback (most recent call last):
  File "/opt/pytorch/pytorch/test/test_cuda.py", line 5260, in test_cpp_memory_snapshot_pickle
    mem = run()
  File "/opt/pytorch/pytorch/test/test_cuda.py", line 5257, in run
    t = the_script_fn()
  File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 496, in prof_func_call
    return prof_callable(func_call, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 493, in prof_callable
    return callable(*args, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "/opt/pytorch/pytorch/test/test_cuda.py", line 5254, in the_script_fn
                @torch.jit.script
                def the_script_fn():
                    return torch.rand(311, 411, device='cuda')
                           ~~~~~~~~~~ <--- HERE
RuntimeError: record_context_cpp is not support on non-linux non-x86_64 platforms
```

dfe484a3b3/torch/csrc/profiler/unwind/unwind.cpp (L4-L24) seems related

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101366
Approved by: https://github.com/zdevito
2023-05-17 19:44:21 +00:00
Elias Ellison
3edff6b6ec Improve detection of workspace/non-output allocations in cudagraphs (#99985)
When we run cudagraph trees we are not allowed to have permanent workspace allocations like in cublas because we might need to reclaim that memory for a previous cudagraph recording, and it is memory that is not accounted for in output weakrefs so it does not work with checkpointing. Previously, I would check that we didn't have any additional allocations through snapshotting. This was extremely slow so I had to turn it off.

This PR first does the quick checking to see if we are in an error state, then if we are does the slow logic of creating snapshot. Also turns on history recording so we get a stacktrace of where the bad allocation came from.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99985
Approved by: https://github.com/zdevito
2023-05-01 15:58:45 +00:00
Jane Xu
808267767c Prevent grad scale from overflowing (#98876)
Fixes #98828 by capping the growth in the kernel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98876
Approved by: https://github.com/ngimel
2023-04-25 20:59:44 +00:00
Aaron Gokaslan
e2a3817dfd [BE] Enable C419 rule for any all shortcircuiting (#99890)
Apparently https://github.com/pytorch/pytorch/pull/78142 made torch.JIT allow for simple generator expressions which allows us to enable rules that replace unnecessary list comprehensions with generators in any/all. This was originally part of #99280 but I split it off into this PR so that it can be easily reverted should anything break.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99890
Approved by: https://github.com/justinchuby, https://github.com/kit1980, https://github.com/malfet
2023-04-25 15:02:13 +00:00
Masaki Kozuki
b87c7ab6d6 Remove redundant found_inf recompute from _step_supports_amp_unscaling path (#98620)
following https://github.com/pytorch/pytorch/pull/97415#issuecomment-1499787115.

Rel: https://github.com/pytorch/pytorch/pull/98613

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98620
Approved by: https://github.com/janeyx99
2023-04-20 19:24:09 +00:00
Animesh Jain
971df458db Reland of "Python binding to set/get CUDA rng state offset" (#99565)
Why?
* To reduce the latency of hot path in https://github.com/pytorch/pytorch/pull/97377

Concern - I had to add `set_offset` in all instances of `GeneratorImpl`. I don't know if there is a better way.

~~~~
import torch
torch.cuda.manual_seed(123)
print(torch.cuda.get_rng_state())
torch.cuda.set_rng_state_offset(40)
print(torch.cuda.get_rng_state())

tensor([123,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0], dtype=torch.uint8)
tensor([123,   0,   0,   0,   0,   0,   0,   0,  40,   0,   0,   0,   0,   0,
          0,   0], dtype=torch.uint8)
~~~~

Reland of https://github.com/pytorch/pytorch/pull/98965

(cherry picked from commit 8214fe07e8)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99565
Approved by: https://github.com/anijain2305
2023-04-20 15:42:25 +00:00
PyTorch MergeBot
bb2cd4a107 Revert "Python binding to set/get CUDA rng state offset (#98965)"
This reverts commit 8214fe07e8.

Reverted https://github.com/pytorch/pytorch/pull/98965 on behalf of https://github.com/DanilBaibak due to Break internal build
2023-04-19 11:23:32 +00:00
Animesh Jain
8214fe07e8 Python binding to set/get CUDA rng state offset (#98965)
Why?
* To reduce the latency of hot path in https://github.com/pytorch/pytorch/pull/97377

Concern - I had to add `set_offset` in all instances of `GeneratorImpl`. I don't know if there is a better way.

~~~~
import torch
torch.cuda.manual_seed(123)
print(torch.cuda.get_rng_state())
torch.cuda.set_rng_state_offset(40)
print(torch.cuda.get_rng_state())

tensor([123,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0], dtype=torch.uint8)
tensor([123,   0,   0,   0,   0,   0,   0,   0,  40,   0,   0,   0,   0,   0,
          0,   0], dtype=torch.uint8)
~~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98965
Approved by: https://github.com/kulinseth, https://github.com/ezyang
2023-04-18 07:52:21 +00:00
Zachary DeVito
7ff1f3f3f6 Revert "Revert "Expandable blocks in allocator (#96995)"" (#99275)
This reverts commit 851e89c8e8.

Differential Revision: [D45034526](https://our.internmc.facebook.com/intern/diff/D45034526)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99275
Approved by: https://github.com/eellison
2023-04-17 23:46:08 +00:00
PyTorch MergeBot
851e89c8e8 Revert "Expandable blocks in allocator (#96995)"
This reverts commit 6a50b83b73.

Reverted https://github.com/pytorch/pytorch/pull/96995 on behalf of https://github.com/izaitsevfb due to Breaks internal tests
2023-04-16 19:23:37 +00:00
Zachary DeVito
6a50b83b73 Expandable blocks in allocator (#96995)
Common advice we give for handling memory fragmentation issues is to
allocate a big block upfront to reserve memory which will get split up later.
For programs with changing tensor sizes this can be especially helpful to
avoid OOMs that happen the first time we see a new largest input and would
otherwise have to allocate new segments.

However the issue with allocating a block upfront is that is nearly impossible
to correctly estimate the size of that block. If too small, space in the block
will run out and the allocator will allocate separate blocks anyway. Too large,
and other non-PyTorch libraries might stop working because they cannot allocate
any memory.

This patch provides the same benefits as using a pre-allocating block but
without having to choose its size upfront. Using the cuMemMap-style APIs,
it adds the ability to expand the last block in a segment when more memory is
needed.

Compared to universally using cudaMallocAsync to avoid fragmentation,
this patch can fix this common fragmentation issue while preserving most
of the existing allocator behavior. This behavior can be enabled and disabled dynamically.
 This should allow users to, for instance, allocate long-lived parameters and state in individual buffers,
and put temporary state into the large expandable blocks, further reducing
fragmentation.

See inline comments for information about the implementation and its limitations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96995
Approved by: https://github.com/eellison
2023-04-14 09:49:11 +00:00
Peeyush Agarwal
ebd4c165ff Back out "GradScaler recomputes optimizer_state["found_inf_per_device"] before optimizer.step (#97415)" (#98613)
Summary: This change causes multi-GPU job from XI team to hang after 8K steps.

Differential Revision: D44797248

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98613
Approved by: https://github.com/ngimel
2023-04-07 23:31:31 +00:00
Zachary DeVito
b1a83c4da4 [memory history] cleanup recording API (#97406)
This makes the options for recording memory history
easier to understand and makes the default to record
the most information.

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 4706acf</samp>

This pull request enhances the memory profiling and debugging capabilities of PyTorch on CUDA devices. It introduces a new API for memory history recording in `torch/cuda/memory.py` and `test/test_cuda.py`, and adds new functions for memory snapshot management and visualization in `torch/cuda/memory.py`.

Also adds a quick _dump_snapshot function to make
it easier to look at the common visualizations.

<!--
copilot:walkthrough
-->
### <samp>🤖 Generated by Copilot at 4706acf</samp>

*  Modify the `_record_memory_history` function to use a new API that accepts a string argument for the `enabled` parameter and more parameters to control the stack trace collection and memory event history ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-80bd98caafb20d758f45a4d23711810f7e0b9ce7a6505094f9dbb0e00a657377L620-R696))
* Add a new function `_dump_snapshot` that allows users to dump a memory snapshot to a directory with HTML plots of the memory segments and events ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-80bd98caafb20d758f45a4d23711810f7e0b9ce7a6505094f9dbb0e00a657377R703-R713))
* Update the test cases in `test/test_cuda.py` to use the new API for memory history recording and check the expected output of the memory plots ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L4946-R4946), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L4984-R4984), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5000-R5000), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5015-R5015), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5035-R5038), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R5045-R5046), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5060-R5059), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5068-R5065), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5088-R5085))
* Add missing imports and types to the `torch/cuda/memory.py` module ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-80bd98caafb20d758f45a4d23711810f7e0b9ce7a6505094f9dbb0e00a657377L5-R15))
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97406
Approved by: https://github.com/ezyang
2023-03-28 16:31:10 +00:00
soulitzer
51c3fd39a5 Modify all calls to checkpoint pass use_reentrant explicitly (#97376)
Fixes #ISSUE_NUMBER

This is the first step toward making use_reentrant=False the default.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97376
Approved by: https://github.com/albanD
2023-03-27 13:37:42 +00:00
Masaki Kozuki
b5edf18334 GradScaler recomputes optimizer_state["found_inf_per_device"] before optimizer.step (#97415)
I found a discrepancy between non-fused and fused optimizers, which is to use `optimizer_state["found_inf"]` or to recompute `found_inf`.

- non fused: e64ddd1ab9/torch/cuda/amp/grad_scaler.py (L289)
- fused: e64ddd1ab9/torch/cuda/amp/grad_scaler.py (L353)
    - where `_check_inf_per_device` is e64ddd1ab9/torch/cuda/amp/grad_scaler.py (L564-L573)

The other way to align the behavior is to use the existing `found_inf` in e64ddd1ab9/torch/cuda/amp/grad_scaler.py (L353).

I'd say this PR is for the sake of "safety" and the alternative is to keep the existing behavior.
I honestly have no idea if it's expected to double-check the sanity of gradients in `GradScaler.step`.

---

what I've observed in huggingface/transformers T5-base example so far seems like that non-fused optimizers lead to invalid parameters while the fused not.
The cause seems to be that `gradients` become inf/nan before `GradScaler.step(optimizer)` after `GradScaler._unscale_grads_` (more precicely, the call of `torch._amp_foreach_non_finite_check_and_unscale_`) in the script of the issue linked below, i.e. the gradient clipping and/or unscaling lead to inf/nan as these happen after the grad check. See
788300cc2a/aten/src/ATen/native/cuda/AmpKernels.cu (L165-L174).

Fixes #96755 🙏

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97415
Approved by: https://github.com/ngimel, https://github.com/janeyx99
2023-03-24 17:36:47 +00:00
Tailing Yuan
63e1f12b49 Speedup bincount and histc on CUDA (#97090)
This is to speed up torch.bincount and torch.histc on CUDA.

1. Speed up int64_t gpuAtomicAdd,
2. and optimize the histogram kernel.

# Fixes #96626
After speedup, time cost in #96626 would be

```
... (run 2 times and ignore the first run)
case 1 CPU  0.0003631114959716797 seconds
case 1 CUDA 0.0005860328674316406 seconds
case 2 CPU  0.0013742446899414062 seconds
case 2 CUDA 0.0008623600006103516 seconds
```

Note that in "*case 1 CUDA*", the **max** op takes the most time, i.e., 5ee5a164ff/aten/src/ATen/native/cuda/SummaryOps.cu (L334-L335), which is not to be optimized in this PR.

# Benchmark

Time is measured on i7-10700 + RTX 3080, Ubuntu 22.04 (in WSL). The baseline is PyTorch 2.0.0+cu117. My dev version of PyTorch is compiled with CUDA 11.8. Each case is measured 15 times to take the median.

## torch.bincount
#elem | nbins | distribution | CPU | PyTorch 2.0.0 | this PR | speedup
-- | -- | -- | -- | -- | -- | --
2**20 | 80 | random.uniform | 0.000834 | 0.005783 | 0.000266 | 21.8x
2**20 | 80 | narrow in 1 bin | 0.001576 | 0.003967 | 0.000563 | 7.0x
2**20 | 500 | random.uniform | 0.000852 | 0.003641 | 0.000334 | 10.9x
2**20 | 500 | narrow in 1% bins | 0.000894 | 0.001878 | 0.000349 | 5.4x
2**20 | 2048 | random.uniform | 0.000891 | 0.000820 | 0.000298 | 2.8x
2**20 | 2048 | narrow in 1% bins | 0.000958 | 1.043251 | 0.000335 | 3,116.6x
2**26 | 80 | random.uniform | 0.067715 | 0.322409 | 0.003032 | 106.3x
2**26 | 80 | narrow in 1 bin | 0.110940 | 0.194644 | 0.017651 | 11.0x
2**26 | 500 | random.uniform | 0.066666 | 0.192302 | 0.002535 | 75.8x
2**26 | 500 | narrow in 1% bins | 0.066130 | 0.092237 | 0.005462 | 16.9x
2**26 | 2048 | random.uniform | 0.066371 | 0.035308 | 0.002476 | 14.3x
2**26 | 2048 | narrow in 1% bins | 0.068453 | 72.122858 | 0.003185 | 22,644.3x

## torch.histc (float32)
#elem | nbins | distribution | CPU | PyTorch 2.0.0 | this PR | speedup
-- | -- | -- | -- | -- | -- | --
2**20 | 80 | random.uniform | 0.001261 | 0.000145 | 9.47E-05 | 1.5x
2**20 | 80 | narrow in 1 bin | 0.001074 | 0.000356 | 0.000311 | 1.1x
2**20 | 500 | random.uniform | 0.001162 | 0.000227 | 9.18E-05 | 2.5x
2**20 | 500 | narrow in 1% bins | 0.001082 | 0.000201 | 0.000152 | 1.3x
2**20 | 2048 | random.uniform | 0.001100 | 0.000203 | 0.000118 | 1.7x
2**20 | 2048 | narrow in 1% bins | 0.001089 | 0.000396 | 0.000107 | 3.7x
2**26 | 80 | random.uniform | 0.064219 | 0.001170 | 0.000786 | 1.5x
2**26 | 80 | narrow in 1 bin | 0.056471 | 0.013283 | 0.011939 | 1.1x
2**26 | 500 | random.uniform | 0.078183 | 0.003411 | 0.000562 | 6.1x
2**26 | 500 | narrow in 1% bins | 0.056711 | 0.002763 | 0.002738 | 1.0x
2**26 | 2048 | random.uniform | 0.059296 | 0.003503 | 0.000533 | 6.6x
2**26 | 2048 | narrow in 1% bins | 0.061754 | 0.015703 | 0.000962 | 16.3x

## torch.histc (int64)
#elem | nbins | distribution | CPU | PyTorch 2.0.0 | this PR | speedup
-- | -- | -- | -- | -- | -- | --
2**20 | 80 | random.uniform | N/A | 0.005614 | 9.47E-05 | 59.3x
2**20 | 80 | narrow in 1 bin | N/A | 0.003799 | 0.000395 | 9.6x
2**20 | 500 | random.uniform | N/A | 0.003665 | 9.58E-05 | 38.2x
2**20 | 500 | narrow in 1% bins | N/A | 0.001760 | 0.000178 | 9.9x
2**20 | 2048 | random.uniform | N/A | 0.000693 | 0.000111 | 6.2x
2**20 | 2048 | narrow in 1% bins | N/A | 1.082904 | 0.000123 | 8,802.4x
2**26 | 80 | random.uniform | N/A | 0.320400 | 0.001145 | 279.9x
2**26 | 80 | narrow in 1 bin | N/A | 0.193668 | 0.015229 | 12.7x
2**26 | 500 | random.uniform | N/A | 0.182897 | 0.000823 | 222.2x
2**26 | 500 | narrow in 1% bins | N/A | 0.089363 | 0.00376 | 23.8x
2**26 | 2048 | random.uniform | N/A | 0.033190 | 0.000832 | 39.9x
2**26 | 2048 | narrow in 1% bins | N/A | 71.721012 | 0.001525 | 47,017.8x

## Banchmark code

Here is the benchmark code:

```python3
import time
import torch

cases = [
    ("bincount    bins=80   wide  ", torch.randint(80, [2**20]),   lambda x: torch.bincount(x, minlength=80)),
    ("bincount    bins=80   narrow", torch.randint(1, [2**20]),    lambda x: torch.bincount(x, minlength=80)),
    ("bincount    bins=500  wide  ", torch.randint(500, [2**20]),  lambda x: torch.bincount(x, minlength=500)),
    ("bincount    bins=500  narrow", torch.randint(5, [2**20]),    lambda x: torch.bincount(x, minlength=500)),
    ("bincount    bins=2048 wide  ", torch.randint(2048, [2**20]), lambda x: torch.bincount(x, minlength=2048)),
    ("bincount    bins=2048 narrow", torch.randint(20, [2**20]),   lambda x: torch.bincount(x, minlength=2048)),
    ("histc_float bins=80   wide  ", torch.rand(2**20),            lambda x: torch.histc(x, bins=80, min=0., max=1.)),
    ("histc_float bins=80   narrow", torch.rand(2**20)*.01,        lambda x: torch.histc(x, bins=80, min=0., max=1.)),
    ("histc_float bins=500  wide  ", torch.rand(2**20),            lambda x: torch.histc(x, bins=500, min=0., max=1.)),
    ("histc_float bins=500  narrow", torch.rand(2**20)*.01,        lambda x: torch.histc(x, bins=500, min=0., max=1.)),
    ("histc_float bins=2048 wide  ", torch.rand(2**20),            lambda x: torch.histc(x, bins=2048, min=0., max=1.)),
    ("histc_float bins=2048 narrow", torch.rand(2**20)*.01,        lambda x: torch.histc(x, bins=2048, min=0., max=1.)),
    ("histc_int   bins=80   wide  ", torch.randint(80, [2**20]),   lambda x: torch.histc(x, bins=80, min=0., max=80.)),
    ("histc_int   bins=80   narrow", torch.randint(1, [2**20]),    lambda x: torch.histc(x, bins=80, min=0., max=80.)),
    ("histc_int   bins=500  wide  ", torch.randint(500, [2**20]),  lambda x: torch.histc(x, bins=500, min=0., max=500.)),
    ("histc_int   bins=500  narrow", torch.randint(5, [2**20]),    lambda x: torch.histc(x, bins=500, min=0., max=500.)),
    ("histc_int   bins=2048 wide  ", torch.randint(2048, [2**20]), lambda x: torch.histc(x, bins=2048, min=0., max=2048.)),
    ("histc_int   bins=2048 narrow", torch.randint(20, [2**20]),   lambda x: torch.histc(x, bins=2048, min=0., max=2048.)),
]

def test(case, device):
    name, x, func = case
    x = x.to(device)
    time_samples = []
    for _ in range(15):
        torch.cuda.synchronize()
        t1 = time.time()
        func(x)
        torch.cuda.synchronize()
        t2 = time.time()
        time_samples.append(t2 - t1)
    median = sorted(time_samples)[len(time_samples) // 2]
    print(device, name, median)

for case in cases:
    test(case, device="cuda")

# for case in cases:
#     test(case, device="cpu")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97090
Approved by: https://github.com/ngimel
2023-03-24 00:25:34 +00:00
Masaki Kozuki
22ea21da3d Change 1D Tensor of 1 element to 0D Tensor (#96994)
add 0d tensor to graph adam/adamw test

Affected:
- `torch.cuda.amp.GradScaler`'s `found_inf`, `_scale`, and `_growth_tracker`
- `step` of Adam & AdamW of `capturable`

Fixes #96776 🤞

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96994
Approved by: https://github.com/janeyx99
2023-03-21 18:24:19 +00:00
Elias Ellison
571f96bf59 cudagraph trees (#89146)
CUDA Graph Trees

Design doc: https://docs.google.com/document/d/1ZrxLGWz7T45MSX6gPsL6Ln4t0eZCSfWewtJ_qLd_D0E/edit

Not currently implemented :

- Right now, we are using weak tensor refs from outputs to check if a tensor has dies. This doesn't work because a) aliasing, and b) aot_autograd detaches tensors (see note [Detaching saved tensors in AOTAutograd]). Would need either https://github.com/pytorch/pytorch/issues/91395 to land to use storage weak refs or manually add a deleter fn that does what I want. This is doable but theres some interactions with the caching allocator checkpointing so saving for a stacked pr.

- Reclaiming memory from the inputs during model recording. This isn't terribly difficult but deferring to another PR. You would need to write over the input memory during warmup, and therefore copy the inputs to cpu. Saving for a stacked pr.

- Warning on overwriting previous generation outputs. and handling nested torch.compile() calls in generation tracking

Differential Revision: [D43999887](https://our.internmc.facebook.com/intern/diff/D43999887)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89146
Approved by: https://github.com/ezyang
2023-03-17 02:47:03 +00:00
Elias Ellison
ea7415087a Expose Stream Recording Apis in python (#96384)
Differential Revision: [D43999891](https://our.internmc.facebook.com/intern/diff/D43999891)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96384
Approved by: https://github.com/zdevito
2023-03-16 23:45:43 +00:00
Zachary DeVito
e74f70d212 Revert "Revert "[memory profiling] add a facility to gather combined C++/Python/TorchScript stack traces. (#95541)"" (#96878)
This reverts commit e1ea584b1c.
Adds __has_include check to fix fbcode build.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96878
Approved by: https://github.com/ezyang
2023-03-16 04:12:54 +00:00
PyTorch MergeBot
e1ea584b1c Revert "[memory profiling] add a facility to gather combined C++/Python/TorchScript stack traces. (#95541)"
This reverts commit 4e1060c609.

Reverted https://github.com/pytorch/pytorch/pull/95541 on behalf of https://github.com/DanilBaibak due to breaking internal builds
2023-03-15 13:28:41 +00:00
Zachary DeVito
4e1060c609 [memory profiling] add a facility to gather combined C++/Python/TorchScript stack traces. (#95541)
This refactors the stack trace facility specific to memory profiling
    in python+cuda to make a generic facility to generate combined stack
    traces.

    The generic facility (combined_traceback.h) does not require
    python to be around to work, but will return python stacks if it is
    present.

    This facility is then used to add support for stack trace gathering in memory profiling that
    happens directly from C++.

    It is also used to expose a python API for gathering and symbolizing
    combineds stacks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95541
Approved by: https://github.com/ezyang
2023-03-14 18:26:05 +00:00
Elias Ellison
da265652d6 Return Live Data Pointers from Checkpoint, swap onto tensors (#95020)
When we checkpoint the state of the private pool allocator, we will need to make sure that its current live allocated blocks will get properly cleaned up when the tensors they correspond to die. Return DataPtrs for these new allocated blocks that the callee can swap onto live Tensors.

The exact api for setting the checkpoint can be manipulated after this as the cudagraph implementation is built out, but this at least shows its sufficiently general.

This should be the last PR touching cuda caching allocator necessary for new cudagraphs integration.

Differential Revision: [D43999888](https://our.internmc.facebook.com/intern/diff/D43999888)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95020
Approved by: https://github.com/zdevito
2023-03-14 01:22:19 +00:00
Elias Ellison
1cc32aedb0 Handle additional live allocations not in checkpointed state (#94943)
We choose to ignore certain blocks that are currently allocated when we set the pool to its checkpoint. For those blocks, we need to swap out the deleter function of their corresponding blocks so that a deallocation is not triggered when they die.

Differential Revision: [D43999886](https://our.internmc.facebook.com/intern/diff/D43999886)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94943
Approved by: https://github.com/zdevito
2023-03-14 01:00:47 +00:00
Elias Ellison
d798de2b05 Checkpoint CUDA Allocator Private Pool State (#94653)
Copying note from cuda caching allocator:

```
   * Note [Checkpointing PrivatePoolState]
   *
   * Refer above to Note [Interaction with CUDA graph capture]. Allocations made
   * during graph capture are made from a separate private pool. During graph
   * capture allocations behave as usual. During graph replay the allocator
   * state does not change even as new tensors are created. The private pool
   * will not free its blocks to the main caching allocator until cuda graph use
   * is finished to prevent an allocation from eager clobbering the memory from
   * a live but unaccounted for tensor that was created during replay.
   *
   * `make_graphed_callables`, a series of separate callables chained in
   * successive cuda graphs, can share a memory pool because after a cuda graph
   * recording the allocations in the shared private pool exactly reflect the
   * tensors that are allocated.
   *
   * We would like to extend callable chaining to support a graphed callable
   * tree. In this scenario, we have a tree of callable chains which will be
   * captured with cuda graphs. In the diagram below, we have a tree with four
   * callables, A, B, C, and D. Suppose we have captured, and subsequently
   * replayed, A, B, and C. Then on a new invocation, we replay A and B, but
   * would now like to record D. At this point the private pool will not reflect
   * any of the live tensors created during graph replay. Allocations made
   * during a new recording with the pool could overwrite those live tensors.
   *
   * In order to record a new graph capture after replaying prior callables in
   * the tree, we need the allocator to reflect the state of the live tensors.
   * We checkpoint the state of the private after each recording, and then
   * reapply it when we are starting a new recording chain. Additionally, we
   * must free the allocations for any tensors that died between the end of our
   * previous graph replaying and our new recording (TODO). All of the allocated
   * segments that existed in the checkpointed state must still exist in the
   * pool. There may also exist new segments, which we will free (TODO : link
   * note [live tensors between iterations] when it exists).
   *
   *
   *  ---------------> A ---------------> B ---------------> C
   *                                |
   *                                |
   *                                |
   *                                |
   *                                  ---------------> D
```

A few TODOs:
- need to add logic for freeing tensors that have died between a last replay and current new recording
- Add logic for free that might be called on a pointer multiple times (because we are manually freeing live tensors)

The two scenarios above have not been exercised in the tests yet.

Differential Revision: [D43999889](https://our.internmc.facebook.com/intern/diff/D43999889)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94653
Approved by: https://github.com/zdevito
2023-03-14 00:47:30 +00:00
Zachary DeVito
4b372e3958 [memory profiling] C++ tracing support (#95357)
Adds the ability to quickly generate stack traces for C++,
and combine Python, TorchScript, and C++ frames into a single trace.

This makes it possible for the memory tracer to record allocations inside
C++ code (e.g. convolution temporaries, backward operators).

The unwinder code is ~10x faster than execinfo.h's backward because it
cache fast unwinder routines for instruction pointers that have already been seen.
It is also only 1.2--2x slower than copying the entire stack (the approach perf takes),
while using 2 orders of magnitude less space per stack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95357
Approved by: https://github.com/bertmaher
2023-03-12 07:24:14 +00:00
Zachary DeVito
266089a3fe [memory snapshots] record scripted stack traces (#95356)
Adds support for seeing both python and script stack traces in memory
debugging.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95356
Approved by: https://github.com/aaronenyeshi
2023-03-12 07:24:14 +00:00
Zachary DeVito
d6d8d3484e _memory_viz.py: Visualize how blocks fit into segments. (#91336)
Add a segment_plot command that visualizes how blocks are allocated into segments.
This is similar to the 'stats' command but produces an interactive html viewer rather
than text dump, allowing exploration of stack traces.

It also adds the ability to see the layout at any point in the trace by starting from the
snapshot and then apply the events backwards to reconstruct what memory would have looked like.

Example:
![Screen Shot 2022-12-22 at 3 32 49 PM](https://user-images.githubusercontent.com/370202/209242650-b952372e-37ac-400a-a01c-13be2b5426fa.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91336
Approved by: https://github.com/bhosmer
2023-03-07 21:07:18 +00:00
Zachary DeVito
71f369092d Revert "Revert "memory viz: Add colors for categories and a legend (#90587)"" (#96133)
This reverts commit b38b39c441.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96133
Approved by: https://github.com/bhosmer
2023-03-07 21:07:18 +00:00
Catherine Lee
eea0733045 Reduce pytest blocklist (#96016)
`TestCase = object` or variations of it get switched to `TestCase = NoTest`.

unittest collects test based on subclassing unittest.TestCase, so setting TestCase = object removes it from unittest test collection.  pytest collects based on name (https://docs.pytest.org/en/7.1.x/reference/reference.html#confval-python_classes) but can be told to ignore a class (bottom of https://docs.pytest.org/en/7.1.x/example/pythoncollection.html#changing-naming-conventions)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96016
Approved by: https://github.com/ZainRizvi, https://github.com/huydhn
2023-03-07 18:30:27 +00:00
Eli Uriegas
b38b39c441 Revert "memory viz: Add colors for categories and a legend (#90587)"
This reverts commit ee43842505.
2023-03-06 11:38:58 -08:00
Zachary DeVito
ee43842505 memory viz: Add colors for categories and a legend (#90587)
Adds a category legend to memory trace plots that colors allocations by their role (activation, parameter, gradient, etc.) as captured by kineto.

Differential Revision: [D43757381](https://our.internmc.facebook.com/intern/diff/D43757381)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90587
Approved by: https://github.com/aaronenyeshi
2023-03-03 20:42:22 +00:00
Mark Saroufim
9f707f164e Add more GPU metric instrumentation (#91717)
Fixes https://github.com/pytorch/serve/issues/1937

A fairly common query I see folks running while using pytorch is

`nvidia-smi --format=csv,noheader,nounits --query-gpu=utilization.gpu,utilization.memory,memory.total,memory.used,temperature.gpu,power.draw,clocks.current.sm,clocks.current.memory -l 10`

Existing metrics we have
* For kernel utilization`torch.cuda.utilization()`
* For memory utilization we have them under `torch.cuda.memory` the memory allocated with `torch.cuda.memory.memory_allocated()`
* For total available memory we have `torch.cuda.get_device_properties(0).total_memory`

Which means the only metrics we're missing are
* Temperature: now in `torch.cuda.temperature()`
* Power draw: now in `torch.cuda.power()`
* Clock speed: now in `torch.cuda.clock_speed()`

With some important details on each

* Clock speed settings: I picked the SM clock domain which is documented here https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceEnumvs.html#group__nvmlDeviceEnumvs_1g805c0647be9996589fc5e3f6ff680c64
* Temperature: I use `pynvml.nvmlDeviceGetTemperature(handle, 0)` where 0 refers to the GPU die temperature
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91717
Approved by: https://github.com/ngimel
2023-02-24 00:38:03 +00:00
Pearu Peterson
cece63f197 Add warn-once deprecation warning to legacy sparse constructors (#94850)
Addresses https://github.com/pytorch/pytorch/issues/68323#issuecomment-1425174341

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94850
Approved by: https://github.com/amjames, https://github.com/cpuhrsch
2023-02-23 15:05:12 +00:00
puririshi98
8aa34602f7 Jetson Update for CI Redo (#94549)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94549
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-02-21 17:13:38 +00:00
dllehr-amd
98012e4a59 [ROCm] hipGraph support for pytorch mainline (#88202)
With the release of ROCm 5.3 hip now supports a hipGraph implementation.

All necessary backend work and hipification is done to support the same functionality as cudaGraph.

Unit tests are modified to support a new TEST_GRAPH feature which allows us to create a single check for graph support instead of attempted to gather the CUDA level in annotations for every graph test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88202
Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/malfet
2023-02-14 22:18:56 +00:00
Xuehai Pan
b005ec62b9 [BE] Remove dependency on six and future (#94709)
Remove the Python 2 and 3 compatibility library [six](https://pypi.org/project/six) and [future](https://pypi.org/project/future) and `torch._six`. We only support Python 3.8+ now. It's time to retire them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94709
Approved by: https://github.com/malfet, https://github.com/Skylion007
2023-02-14 09:14:14 +00:00
Xuehai Pan
046e88a291 [BE] [3/3] Rewrite super() calls in test (#94592)
Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied.

- #94587
- #94588
- #94592

Also, methods with only a `super()` call are removed:

```diff
class MyModule(nn.Module):
-   def __init__(self):
-       super().__init__()
-
    def forward(self, ...):
        ...
```

Some cases that change the semantics should be kept unchanged. E.g.:

f152a79be9/caffe2/python/net_printer.py (L184-L190)

f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94592
Approved by: https://github.com/ezyang, https://github.com/seemethere
2023-02-12 22:20:53 +00:00
c-odrin
54b7c7d5e9 Added requested_bytes to CUDA Caching Allocator Stats (#88575)
Summary:
The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce.

We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag:
    - "requested_bytes.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead

Test Plan: Added test case in caffe2/test/test_cuda.py

Differential Revision: D40810674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88575
Approved by: https://github.com/zdevito
2023-02-09 21:37:25 +00:00
Masaki Kozuki
6ba041fcae Look up group["capturable"], not defaults["capturable"] in Adam(W) (#94149)
We could set different values in each `param_group` when calling dunder init of `torch.optim` optimizers as in e.g.  https://github.com/pytorch/pytorch/issues/89987.

So check whether or not `capturable` is `True` among all the `param_group`s.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94149
Approved by: https://github.com/albanD
2023-02-07 00:24:35 +00:00
Masaki Kozuki
4207d3c330 FusedAdam(W) should take OptState into account before unscaling grads (#94060)
the optimizers have to consult `OptState` before unscaling gradients because we could call `GradScaler.unscale_` explicitly to for e.g. `clip_grad_norm_` as mentioned in e52786f3d1/torch/cuda/amp/grad_scaler.py (L235-L266) and https://pytorch.org/docs/stable/notes/amp_examples.html#working-with-unscaled-gradients

Related #90752

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94060
Approved by: https://github.com/albanD
2023-02-04 05:20:13 +00:00
Masaki Kozuki
a23ed38f9a [mta][foreach] Implement fused adamw (#88015)
related: https://github.com/pytorch/pytorch/issues/68041, https://github.com/pytorch/pytorch/issues/71274, https://github.com/pytorch/pytorch/issues/80167
possibly related to https://github.com/pytorch/pytorch/issues/80595#issuecomment-1178519436

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88015
Approved by: https://github.com/albanD, https://github.com/ngimel
2023-02-01 19:32:29 +00:00
albanD
d8aa68c683 make sure that our error handling runs with the GIL enabled (#92848)
Fixes https://github.com/pytorch/pytorch/issues/92684

I checked the other use case of this API and they never release the GIL

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92848
Approved by: https://github.com/ngimel
2023-01-24 09:30:42 +00:00
eqy
fb38b9ff2a [cuBLAS][TF32] Fix TF32 get/set test when TORCH_ALLOW_TF32_CUBLAS_OVERRIDE is set (#92052)
Follow up of #85859 to fix the test for when the environment variable is set.

CC @xwang233 @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92052
Approved by: https://github.com/ngimel
2023-01-12 05:36:06 +00:00
eqy
97ff20d722 [cuBLAS] (re-open) Fix default cuBLAS workspace size and parsing for multiple workspaces (#91564)
re-open of #89027
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91564
Approved by: https://github.com/ngimel
2023-01-03 23:48:15 +00:00
PyTorch MergeBot
39d49dbe45 Revert "[cuBLAS] Fix default cuBLAS workspace size and parsing for multiple workspaces (#89027)"
This reverts commit b407d98dbe.

Reverted https://github.com/pytorch/pytorch/pull/89027 on behalf of https://github.com/kit1980 due to Fails test_cublas_workspace_explicit_allocation on ROCm
2022-12-31 23:04:57 +00:00
eqy
b407d98dbe [cuBLAS] Fix default cuBLAS workspace size and parsing for multiple workspaces (#89027)
Follow-up of #86167 ; The number of pools was mistakenly ignored and the default workspace size appears to be too small to match selected cuBLAS kernels before the explicit allocation change.

CC @ptrblck @ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89027
Approved by: https://github.com/ngimel
2022-12-31 06:58:04 +00:00
lezcano
484dd40022 Implement PReLU in a compositional way (#91238)
The PReLU implementation was all over the place. This lead to a number
of bugs like https://github.com/pytorch/pytorch/issues/68760.  We fix it by:
- Keeping the weird broadcasting logic it has as a CompositeImplicit kernel that calls into a second kernel
- This second kernel is just a good-ol' pointwise kernel.
- We implement the derivative for the pointwise kernel via TI as well for speed.
- We implement the second derivative for the pointwise kernel and the forward AD derivatives compositionally

This fixes a number of issues:
- We don't perform copies any more when the inputs are not contiguous
- The derivatives are now correct
- We fix vmap and many other functorch-related issues.
- CPU and CUDA now share the relevant broadcasting logic
- The implementation is about 1/3 the length.

Fixes https://github.com/pytorch/pytorch/issues/68760
Fixes https://github.com/pytorch/pytorch/issues/89895

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91238
Approved by: https://github.com/kshitij12345, https://github.com/jbschlosser, https://github.com/albanD
2022-12-30 10:42:30 +00:00
Eddie Yan
8b617f813d [cuBLAS] Add an option to disable reduced precision reductions for BF16 GEMM (#89172)
Essentially the same change as #67946, except that the default is to disallow reduced precision reductions in `BFloat16` GEMMs (for now). If performance is severely regressed, we can change the default, but this option appears to be necessary to pass some `addmm` `BFloat16` tests on H100.

CC @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89172
Approved by: https://github.com/ngimel
2022-12-21 18:58:28 +00:00
Eddie Yan
dabf515c18 [cuDNN][cuDNN V8 API] (re-re-re-open) cuDNN V8 API on by default (#91117)
Re-opening following #91025

CC @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91117
Approved by: https://github.com/ngimel
2022-12-20 18:52:29 +00:00
PyTorch MergeBot
ba7aeac37b Revert "[cuDNN][cuDNN V8 API] (re-re-open) cuDNN V8 API on by default (#89022)"
This reverts commit eecd621f06.

Reverted https://github.com/pytorch/pytorch/pull/89022 on behalf of https://github.com/ngimel due to breaks some convolution configurations #91025
2022-12-16 23:06:35 +00:00
Rich Zhu
4372dbb89f use pytree to allow any input format for cuda graph (#90941)
Summary:
1. use pytree to allow any input format for make_graphed_callables
2. add allow_unused_input argument for make_graphed_callables

Test Plan: buck2 test mode/dev-nosan  //caffe2/test:cuda --  --print-passing-details

Differential Revision: D42077976

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90941
Approved by: https://github.com/ngimel
2022-12-16 03:01:47 +00:00
Eddie Yan
eecd621f06 [cuDNN][cuDNN V8 API] (re-re-open) cuDNN V8 API on by default (#89022)
Testing V8 on by default again after fixes have been merged for e.g., https://github.com/pytorch/torchdynamo/issues/1833

One new failure that seems to be surfaced with V8 on appears in halonext + amp
```
RuntimeError: Internal Triton PTX codegen error:
Segmentation fault (core dumped)
```
But I'm not sure if this points to a V8 issue or a Triton issue CC @ngimel @ptrblck

Current dynamo benchmarks on A100:
v7 vs. v8
|dev |name                           |batch_size|abs_latency_v7|abs_latency_v8|
|----|-------------------------------|----------|--------------|--------------|
|cuda|adv_inception_v3               |128       |166.0240      |165.5798      |
|cuda|beit_base_patch16_224          |64        |123.5912      |123.0797      |
|cuda|botnet26t_256                  |128       |107.7343      |107.5948      |
|cuda|cait_m36_384                   |4         |184.5038      |184.0271      |
|cuda|coat_lite_mini                 |128       |142.3061      |140.5814      |
|cuda|convit_base                    |64        |165.2499      |161.0743      |
|cuda|convmixer_768_32               |32        |325.6984      |325.7094      |
|cuda|convnext_base                  |64        |237.4632      |238.0142      |
|cuda|crossvit_9_240                 |128       |72.2980       |72.4367       |
|cuda|cspdarknet53                   |64        |96.6862       |96.8308       |
|cuda|deit_base_distilled_patch16_224|64        |117.6045      |117.9616      |
|cuda|dla102                         |128       |182.3073      |182.2304      |
|cuda|dm_nfnet_f0                    |128       |133.6011      |133.6298      |
|cuda|dpn107                         |32        |148.5080      |148.5885      |
|cuda|eca_botnext26ts_256            |128       |113.8676      |113.1514      |
|cuda|eca_halonext26ts               |128       |119.2242      |119.1845      |
|cuda|ese_vovnet19b_dw               |128       |80.0217       |79.9438       |
|cuda|fbnetc_100                     |128       |91.4548       |91.4009       |
|cuda|fbnetv3_b                      |128       |115.4496      |115.5058      |
|cuda|gernet_l                       |128       |114.8365      |114.7870      |
|cuda|ghostnet_100                   |128       |58.5766       |58.5766       |
|cuda|gluon_inception_v3             |128       |165.5222      |165.7167      |
|cuda|gluon_xception65               |32        |165.8779      |165.7818      |
|cuda|gmixer_24_224                  |128       |116.3611      |113.4925      |
|cuda|gmlp_s16_224                   |128       |121.2607      |121.2534      |
|cuda|hrnet_w18                      |128       |246.5706      |246.7599      |
|cuda|inception_v3                   |128       |166.1096      |166.2034      |
|cuda|jx_nest_base                   |32        |93.6064       |93.4088       |
|cuda|lcnet_050                      |128       |21.4156       |21.4207       |
|cuda|levit_128                      |128       |27.2901       |27.2543       |
|cuda|mixer_b16_224                  |128       |157.8992      |158.2878      |
|cuda|mixnet_l                       |128       |197.3443      |197.2125      |
|cuda|mnasnet_100                    |128       |71.4604       |71.2997       |
|cuda|mobilenetv2_100                |128       |67.6080       |67.7515       |
|cuda|mobilenetv3_large_100          |128       |57.7224       |57.6591       |
|cuda|mobilevit_s                    |64        |93.0372       |93.0530       |
|cuda|nfnet_l0                       |128       |113.1664      |113.2853      |
|cuda|pit_b_224                      |64        |133.3333      |133.4153      |
|cuda|pnasnet5large                  |16        |238.9545      |238.8122      |
|cuda|poolformer_m36                 |64        |144.2353      |144.2375      |
|cuda|regnety_002                    |128       |32.8534       |32.9069       |
|cuda|repvgg_a2                      |128       |102.4150      |102.3827      |
|cuda|res2net101_26w_4s              |64        |120.8127      |120.8322      |
|cuda|res2net50_14w_8s               |128       |149.7052      |149.8969      |
|cuda|res2next50                     |128       |153.7439      |153.8215      |
|cuda|resmlp_12_224                  |128       |89.1918       |86.9226       |
|cuda|resnest101e                    |64        |159.4706      |159.3133      |
|cuda|rexnet_100                     |128       |88.0032       |88.0397       |
|cuda|sebotnet33ts_256               |64        |80.4635       |80.0120       |
|cuda|selecsls42b                    |128       |70.4430       |70.3663       |
|cuda|spnasnet_100                   |128       |78.0537       |78.1991       |
|cuda|swin_base_patch4_window7_224   |64        |212.9073      |213.0824      |
|cuda|swsl_resnext101_32x16d         |32        |193.0229      |193.0404      |
|cuda|tf_efficientnet_b0             |128       |97.1316       |97.0410       |
|cuda|tf_mixnet_l                    |128       |203.4956      |203.5340      |
|cuda|tinynet_a                      |128       |82.4038       |82.8733       |
|cuda|tnt_s_patch16_224              |128       |284.8576      |284.8867      |
|cuda|twins_pcpvt_base               |64        |118.3893      |119.2329      |
|cuda|visformer_small                |128       |126.0533      |126.0390      |
|cuda|vit_base_patch16_224           |64        |118.2873      |118.0573      |
|cuda|volo_d1_224                    |64        |108.7764      |108.2063      |
|cuda|xcit_large_24_p8_224           |5         |100.4656      |100.5209      |

v7 vs. v8 amp

|dev |name                           |batch_size|abs_latency_v7|abs_latency_v8|
|----|-------------------------------|----------|--------------|--------------|
|cuda|adv_inception_v3               |128       |104.9729      |105.1237      |
|cuda|beit_base_patch16_224          |64        |75.4330       |75.2039       |
|cuda|botnet26t_256                  |128       |74.5149       |74.8071       |
|cuda|cait_m36_384                   |4         |110.9788      |111.5170      |
|cuda|coat_lite_mini                 |128       |62.3618       |64.4965       |
|cuda|convit_base                    |64        |116.4054      |117.9129      |
|cuda|convmixer_768_32               |32        |264.4401      |264.4491      |
|cuda|convnext_base                  |64        |182.9009      |179.2136      |
|cuda|crossvit_9_240                 |128       |48.8586       |48.8359       |
|cuda|cspdarknet53                   |64        |80.0245       |80.0160       |
|cuda|deit_base_distilled_patch16_224|64        |66.5921       |66.7448       |
|cuda|dla102                         |128       |116.7780      |117.1683      |
|cuda|dm_nfnet_f0                    |128       |78.9322       |79.1135       |
|cuda|dpn107                         |32        |85.5206       |85.7514       |
|cuda|eca_botnext26ts_256            |128       |76.3672       |77.0050       |
|cuda|eca_halonext26ts               |128       |86.2458       |              |
|cuda|ese_vovnet19b_dw               |128       |43.2943       |43.3379       |
|cuda|fbnetc_100                     |128       |54.8479       |54.9251       |
|cuda|fbnetv3_b                      |128       |70.7504       |71.0188       |
|cuda|gernet_l                       |128       |66.1607       |66.0379       |
|cuda|ghostnet_100                   |128       |43.8882       |43.9336       |
|cuda|gluon_inception_v3             |128       |104.9297      |105.0204      |
|cuda|gluon_xception65               |32        |85.7118       |85.8370       |
|cuda|gmixer_24_224                  |128       |75.1214       |76.1170       |
|cuda|gmlp_s16_224                   |128       |76.4207       |76.6641       |
|cuda|hrnet_w18                      |128       |186.1326      |186.2435      |
|cuda|inception_v3                   |128       |105.0561      |105.0783      |
|cuda|jx_nest_base                   |32        |65.3066       |65.3245       |
|cuda|lcnet_050                      |128       |14.7991       |14.8687       |
|cuda|levit_128                      |128       |19.2893       |19.4772       |
|cuda|mixer_b16_224                  |128       |93.9826       |94.2056       |
|cuda|mixnet_l                       |128       |147.1245      |147.0435      |
|cuda|mnasnet_100                    |128       |39.1781       |39.2565       |
|cuda|mobilenetv2_100                |128       |42.3704       |42.3114       |
|cuda|mobilenetv3_large_100          |128       |37.2946       |37.2816       |
|cuda|mobilevit_s                    |64        |55.8930       |55.8934       |
|cuda|nfnet_l0                       |128       |64.0448       |64.4438       |
|cuda|pit_b_224                      |64        |80.6342       |80.2933       |
|cuda|pnasnet5large                  |16        |154.9611      |154.8654      |
|cuda|poolformer_m36                 |64        |101.7489      |101.8138      |
|cuda|regnety_002                    |128       |27.0939       |27.0309       |
|cuda|repvgg_a2                      |128       |60.9651       |61.2533       |
|cuda|res2net101_26w_4s              |64        |77.3291       |77.4739       |
|cuda|res2net50_14w_8s               |128       |93.6572       |93.7221       |
|cuda|res2next50                     |128       |112.4975      |112.3248      |
|cuda|resmlp_12_224                  |128       |59.5422       |60.7644       |
|cuda|resnest101e                    |64        |97.9894       |98.3358       |
|cuda|rexnet_100                     |128       |55.2218       |55.0718       |
|cuda|sebotnet33ts_256               |64        |60.4880       |60.8113       |
|cuda|selecsls42b                    |128       |41.4294       |41.5341       |
|cuda|spnasnet_100                   |128       |45.0037       |45.0304       |
|cuda|swin_base_patch4_window7_224   |64        |98.2561       |98.6925       |
|cuda|swsl_resnext101_32x16d         |32        |100.6179      |100.9195      |
|cuda|tf_efficientnet_b0             |128       |56.5344       |56.4591       |
|cuda|tf_mixnet_l                    |128       |153.0318      |152.9367      |
|cuda|tinynet_a                      |128       |54.1307       |53.9298       |
|cuda|tnt_s_patch16_224              |128       |142.4801      |142.6589      |
|cuda|twins_pcpvt_base               |64        |67.9027       |67.8325       |
|cuda|visformer_small                |128       |72.5589       |72.9427       |
|cuda|vit_base_patch16_224           |64        |71.4885       |71.7342       |
|cuda|volo_d1_224                    |64        |69.3539       |69.5910       |
|cuda|xcit_large_24_p8_224           |5         |59.9000       |59.9699       |

v7 vs. v8 float16
|dev |name                           |batch_size|abs_latency|abs_latency|
|----|-------------------------------|----------|-----------|-----------|
|cuda|adv_inception_v3               |128       |104.2544   |104.2677   |
|cuda|beit_base_patch16_224          |64        |85.3601    |85.3786    |
|cuda|botnet26t_256                  |128       |72.1476    |71.8277    |
|cuda|cait_m36_384                   |4         |108.3075   |108.5941   |
|cuda|coat_lite_mini                 |128       |61.2382    |61.6049    |
|cuda|convmixer_768_32               |32        |263.3818   |263.3598   |
|cuda|convnext_base                  |64        |172.6821   |173.8520   |
|cuda|crossvit_9_240                 |128       |44.6321    |44.6340    |
|cuda|cspdarknet53                   |64        |79.3165    |79.2964    |
|cuda|deit_base_distilled_patch16_224|64        |61.9816    |62.2109    |
|cuda|dla102                         |128       |115.7403   |115.9928   |
|cuda|dm_nfnet_f0                    |128       |77.5434    |77.7440    |
|cuda|dpn107                         |32        |83.6489    |83.5605    |
|cuda|eca_botnext26ts_256            |128       |73.9953    |74.1031    |
|cuda|eca_halonext26ts               |128       |81.7951    |81.7103    |
|cuda|ese_vovnet19b_dw               |128       |42.9618    |42.8853    |
|cuda|fbnetc_100                     |128       |54.3590    |54.3575    |
|cuda|fbnetv3_b                      |128       |69.7977    |70.1696    |
|cuda|gernet_l                       |128       |64.8684    |65.1726    |
|cuda|ghostnet_100                   |128       |43.2054    |43.1319    |
|cuda|gluon_inception_v3             |128       |104.1988   |104.3030   |
|cuda|gluon_xception65               |32        |84.2245    |84.5085    |
|cuda|gmixer_24_224                  |128       |82.0418    |82.7252    |
|cuda|gmlp_s16_224                   |128       |75.4792    |75.8374    |
|cuda|hrnet_w18                      |128       |184.1450   |184.1848   |
|cuda|inception_v3                   |128       |104.1203   |104.2536   |
|cuda|jx_nest_base                   |32        |58.2386    |58.4901    |
|cuda|lcnet_050                      |128       |14.6409    |14.5616    |
|cuda|levit_128                      |128       |22.3875    |22.4680    |
|cuda|mixer_b16_224                  |128       |98.9534    |98.4730    |
|cuda|mixnet_l                       |128       |146.1623   |146.1947   |
|cuda|mnasnet_100                    |128       |38.9208    |39.3463    |
|cuda|mobilenetv2_100                |128       |41.8946    |41.9847    |
|cuda|mobilenetv3_large_100          |128       |36.7810    |36.8264    |
|cuda|mobilevit_s                    |64        |55.3211    |55.3186    |
|cuda|nfnet_l0                       |128       |63.1302    |63.5544    |
|cuda|pit_b_224                      |64        |73.8752    |73.4602    |
|cuda|pnasnet5large                  |16        |151.6806   |151.6111   |
|cuda|poolformer_m36                 |64        |86.8341    |86.8021    |
|cuda|regnety_002                    |128       |26.6798    |26.5295    |
|cuda|repvgg_a2                      |128       |61.6652    |62.1482    |
|cuda|res2net101_26w_4s              |64        |75.8037    |75.7739    |
|cuda|res2net50_14w_8s               |128       |92.6362    |92.4338    |
|cuda|res2next50                     |128       |111.5371   |111.5832   |
|cuda|resmlp_12_224                  |128       |58.2349    |57.9807    |
|cuda|resnest101e                    |64        |96.1114    |96.2742    |
|cuda|rexnet_100                     |128       |54.8138    |54.7643    |
|cuda|sebotnet33ts_256               |64        |53.1524    |53.3823    |
|cuda|selecsls42b                    |128       |40.6070    |40.7104    |
|cuda|spnasnet_100                   |128       |44.5732    |44.4318    |
|cuda|swin_base_patch4_window7_224   |64        |98.6447    |98.8445    |
|cuda|swsl_resnext101_32x16d         |32        |97.0195    |97.2968    |
|cuda|tf_efficientnet_b0             |128       |56.0640    |56.0278    |
|cuda|tf_mixnet_l                    |128       |152.0958   |152.0874   |
|cuda|tinynet_a                      |128       |53.3694    |53.3762    |
|cuda|tnt_s_patch16_224              |128       |130.2981   |130.3726   |
|cuda|twins_pcpvt_base               |64        |62.5459    |62.6416    |
|cuda|visformer_small                |128       |68.8502    |69.1756    |
|cuda|vit_base_patch16_224           |64        |65.8587    |66.0285    |
|cuda|volo_d1_224                    |64        |64.5348    |64.6057    |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89022
Approved by: https://github.com/ngimel
2022-12-15 03:24:44 +00:00
PyTorch MergeBot
cba96366a2 Revert "remove torch.equal usages (#89527)"
This reverts commit 4095ef8b80.

Reverted https://github.com/pytorch/pytorch/pull/89527 on behalf of https://github.com/clee2000 due to broke periodic multigpu tests 4095ef8b80 https://github.com/pytorch/pytorch/actions/runs/3592806602/jobs/6049368502
2022-12-02 21:36:13 +00:00
Philip Meier
4095ef8b80 remove torch.equal usages (#89527)
Preparation for the next PR in this stack: #89559.

I replaced

- `self.assertTrue(torch.equal(...))` with `self.assertEqual(..., rtol=0, atol=0, exact_device=True)`,
- the same for `self.assertFalse(...)` with `self.assertNotEqual(...)`, and
- `assert torch.equal(...)` with `torch.testing.assert_close(..., rtol=0, atol=0)` (note that we don't need to set `check_device=True` here since that is the default).

There were a few instances where the result of `torch.equal` is used directly. In that cases I've replaced with `(... == ...).all().item()` while sometimes also dropping the `.item()` depending on the context.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89527
Approved by: https://github.com/mruberry
2022-12-01 11:22:52 +00:00
Aidyn-A
0057be3361 [CUDA graphs] Add warning if captured graph is empty (#88754)
Fixes #87894

This PR adds a warning if captured graph is empty (consists of zero nodes).
The example snippet where would it be useful:

```python
import torch

x = torch.randn(10)
z = torch.zeros(10)

g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
    z = x * x
# Warn user
```

and in #87894

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88754
Approved by: https://github.com/ezyang
2022-11-28 23:20:19 +00:00
Nikita Shulga
da2afcb1e0 Add test for out-of-bounds Tensor access on GPU (#39211)
Since CUDA context can not recover safely from on-device assert, use `torch.multiprocessing.spawn` to execute a method in another context and verify that it raises unrecoverable error.

As those types of tests are pretty slow (6 seconds on powerful linux box with one GPU) run it only in the slow shard.

Closes https://github.com/pytorch/pytorch/issues/38944

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39211
Approved by: https://github.com/ezyang
2022-11-15 21:06:02 +00:00
PyTorch MergeBot
d98a884b33 Revert "[cuDNN] (re-open) Enable cuDNN Frontend v8 API by Default (#87669)"
This reverts commit 3c6bddc3f6.

Reverted https://github.com/pytorch/pytorch/pull/87669 on behalf of https://github.com/eqy due to investigating convnext benchmark regressions
2022-11-08 19:04:25 +00:00
Kurt Mohler
ee28b865ee Deprecate TypedStorage, its derived classes, and all of their public methods (#85303)
Part of #85302

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85303
Approved by: https://github.com/ezyang
2022-11-08 18:11:01 +00:00
Codrin Popa
5b767d404e Modified roundup_power2_divisions to specify the number of divisions for each power of two interval (#87290)
Summary:
Improved roundup_power2_divisions knob so it allows better control of rouding in the PyTorch CUDA Caching Allocator.

This new version allows setting the number of divisions per power of two interval starting from 1MB and ending at 64GB and above. An example use case is when rouding is desirable for small allocations but there are also very large allocations which are persistent, thus would not benefit from rounding and take up extra space.

Test Plan: Tested locally

Differential Revision: D40103909

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87290
Approved by: https://github.com/zdevito
2022-11-04 19:31:16 +00:00
eqy
3c6bddc3f6 [cuDNN] (re-open) Enable cuDNN Frontend v8 API by Default (#87669)
#58414

Has a small tweak to a test that was breaking on A10 (CC @malfet).

CC @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87669
Approved by: https://github.com/ngimel
2022-11-02 01:36:37 +00:00
Masaki Kozuki
bc03aa6013 Store autocast_gpu_dtype in custom_fwd and custom_bwd for BFloat16 autocast (#88029)
As per #87979, `custom_bwd` seems to forcefully use `torch.float16` for `torch.autograd.Function.backward` regardless of the `dtype` used in the forward.

Changes:
- store the `dtype` in `args[0]`
- update tests to confirm the dtype of intermediate result tensors that are outputs of autocast compatible `torch` functions

cc @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88029
Approved by: https://github.com/ngimel
2022-10-31 22:45:26 +00:00
Zachary DeVito
00c91f4446 [allocator] disable tests that don't work for cudaMallocAsyncAllocator (#87250)
Two tests were failing locally for me and don't appear to be run in our CI.
Disabling them so we can otherwise refactor the allocators.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87250
Approved by: https://github.com/wconstab
2022-10-19 18:29:35 +00:00
PyTorch MergeBot
746500d58d Revert "[cuDNN] Enable cuDNN Frontend v8 API by Default (#84948)"
This reverts commit 427e0a6b4e.

Reverted https://github.com/pytorch/pytorch/pull/84948 on behalf of https://github.com/malfet due to Broke SM86 sanity
2022-10-14 14:25:51 +00:00
Eddie Yan
427e0a6b4e [cuDNN] Enable cuDNN Frontend v8 API by Default (#84948)
#58414

Opening this PR for testing for now to check CI status. 🤞

CC @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84948
Approved by: https://github.com/ngimel
2022-10-13 17:26:36 +00:00
Eddie Yan
25725fd624 (Re-open) Adds cudaMallocAsync as an alternative backend for the CUDA allocator (#82682)
Rebased version of @mcarilli 's cudaMallocAsync #65365 for continued testing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82682
Approved by: https://github.com/ngimel
2022-10-12 03:44:21 +00:00
eqy
352d926482 [CUBLAS][CUDA GRAPHS] (re-re-re-re-open of #83461) Explicitly set the workspace for cuBLAS handles (#86645)
re-opening (again) in hopes of working around failed/stuck CLA check

CC @ptrblck @ngimel @huydhn
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86645
Approved by: https://github.com/zdevito
2022-10-11 16:03:49 +00:00
Zachary DeVito
91b1bae1df Caching allocator tracing (#86241)
We currently can take snapshots of the state of the allocated cuda memory, but we do not have a way to correlate these snapshots with the actions the allocator that were taken between snapshots. This PR adds a simple fixed-sized buffer that records the major actions that the allocator takes (ALLOC, FREE, SEGMENT_ALLOC, SEGMENT_FREE, OOM, SNAPSHOT) and includes these with the snapshot information. Capturing period snapshots with a big enough trace buffer makes it possible to see how the allocator state changes over time.

We plan to use this functionality to guide how settings in the allocator can be adjusted and eventually have a more robust overall algorithm.

As a component of this functionality, we also add the ability to get a callback when the allocator will throw an OOM, primarily so that snapshots can be taken immediately to see why the program ran out of memory (most programs have some C++ state that would free tensors before the OutOfMemory exception can be caught).

This PR also updates the _memory_viz.py script to pretty-print the trace information and provide a better textual summary of snapshots distinguishing between internal and external fragmentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86241
Approved by: https://github.com/ngimel
2022-10-07 23:19:54 +00:00
Edward Z. Yang
adf5919720 Add option to record C++ backtraces in _record_memory_history (#86145)
I used this to debug https://github.com/pytorch/pytorch/issues/86136 so it is useful. The implementation is not so fast so it is not enabled by default.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86145
Approved by: https://github.com/albanD, https://github.com/zdevito
2022-10-06 04:07:37 +00:00
Zachary DeVito
736adc0808 Memory snapshots from C++ (#86190)
Sometimes the driving process want to save memory snapshots but isn't Python.
Add a simple API to turn it on without python stack traces. It still
saves to the same format for the vizualization and summary scripts, using
the C++ Pickler.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86190
Approved by: https://github.com/ezyang
2022-10-05 07:36:39 +00:00
PyTorch MergeBot
71eb04403c Revert "[CUBLAS][CUDA GRAPHS] (re-re-open of #83461) Explicitly set the workspace for cuBLAS handles (#85447)"
This reverts commit b04b2fa9aa.

Reverted https://github.com/pytorch/pytorch/pull/85447 on behalf of https://github.com/seemethere due to Caused a CUDA memory leak, detected by our performance benchmark suite
2022-09-30 20:53:41 +00:00
Masaki Kozuki
5f26df0345 resubmit: "resubmit: [mta] APEX style Fused Adam (#81705) (#85507)" (#85739)
Embarrassingly move the pow implementations around [ATen/native/cuda/PowKernel.cu#L21-L66](849b08f14b/aten/src/ATen/native/cuda/PowKernel.cu (L21-L66)) to a new header file and let FusedAdam use them to tame MSVC, hopefully.

cc @ngimel @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85739
Approved by: https://github.com/ngimel
2022-09-29 16:58:59 +00:00
Eddie Yan
b04b2fa9aa [CUBLAS][CUDA GRAPHS] (re-re-open of #83461) Explicitly set the workspace for cuBLAS handles (#85447)
Now includes @dagitses 's optimizations and fixes for teardown

CC @ngimel @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85447
Approved by: https://github.com/malfet
2022-09-28 16:04:58 +00:00
Andres Lugo-Reyes
5709c67f1f [ROCm] Retry loop implemented to avoid transient memory leak errors (#82607)
### Description
Added a retry loop to memory leak checker to avoid rare case in which ROCM reports a false positive memory leak.

### Issue
Original issue observed as part of this ticket: https://github.com/pytorch/pytorch/issues/62533

### Testing
- Applied changes and built
- python test/test_cuda.py
- Ensure all tests pass

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82607
Approved by: https://github.com/malfet
2022-09-28 15:48:24 +00:00
PyTorch MergeBot
7167996346 Revert "resubmit: [mta] APEX style Fused Adam (#81705) (#85507)"
This reverts commit 4615d1bcfa.

Reverted https://github.com/pytorch/pytorch/pull/85507 on behalf of https://github.com/atalman due to Break internal windows builds
2022-09-27 16:59:35 +00:00
Masaki Kozuki
4615d1bcfa resubmit: [mta] APEX style Fused Adam (#81705) (#85507)
This PR implements an APEX style FusedAdam in PyTorch. This is different from the APEX one in that this is compatible with `torch.cuda.amp.GradScaler` by setting `_step_supports_amp_scaling` to `True` and unscales gradients inside its CUDA kernel.

related: https://github.com/pytorch/pytorch/issues/68041, https://github.com/pytorch/pytorch/issues/71274, https://github.com/pytorch/pytorch/issues/80167 possibly related to https://github.com/pytorch/pytorch/issues/80595#issuecomment-1178519436

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81705
Approved by: https://github.com/ngimel

cc @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85507
Approved by: https://github.com/ngimel
2022-09-23 18:56:00 +00:00
PyTorch MergeBot
e505360eb8 Revert "[mta] APEX style Fused Adam (#81705)"
This reverts commit 7a6c4d0c50.

Reverted https://github.com/pytorch/pytorch/pull/81705 on behalf of https://github.com/dagitses due to broke internal builds, details to come
2022-09-22 19:37:29 +00:00
PyTorch MergeBot
0ac6311356 Revert "[CUBLAS][CUDA GRAPHS] (re-open of #83461) Explicitly set the workspace for cuBLAS handles (#85292)"
This reverts commit 4012e623e8.

Reverted https://github.com/pytorch/pytorch/pull/85292 on behalf of https://github.com/dagitses due to broke an internal test during shutdown. Re-submit with #85399 in stack
2022-09-21 17:57:49 +00:00
Masaki Kozuki
7a6c4d0c50 [mta] APEX style Fused Adam (#81705)
This PR implements an APEX style FusedAdam in PyTorch.
This is different from the APEX one in that this is compatible with `torch.cuda.amp.GradScaler` by setting `_step_supports_amp_scaling` to `True` and unscales gradients inside its CUDA kernel.

related: https://github.com/pytorch/pytorch/issues/68041, https://github.com/pytorch/pytorch/issues/71274, https://github.com/pytorch/pytorch/issues/80167
possibly related to https://github.com/pytorch/pytorch/issues/80595#issuecomment-1178519436

cc @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81705
Approved by: https://github.com/ngimel
2022-09-20 17:18:33 +00:00
eqy
4012e623e8 [CUBLAS][CUDA GRAPHS] (re-open of #83461) Explicitly set the workspace for cuBLAS handles (#85292)
re-open of #83461 with fix for 10.2 build

CC @ngimel @malfet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85292
Approved by: https://github.com/malfet
2022-09-20 16:31:54 +00:00
Hector Yuen
d23ce29761 allow changing the cuda allocator settings even after the process started (#84970)
Summary:
- expose a python call to set the allocator settings, it uses the same format as the value for PYTORCH_CUDA_ALLOCATOR
- keep the implementation contained within the cpp file to avoid increasing build times, only expose a function to call the setting
- make some of the Allocator Config methods public, now it looks more like a singleton

Test Plan: added the unit test

Differential Revision: D39487522

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84970
Approved by: https://github.com/zdevito
2022-09-17 09:42:42 +00:00
PyTorch MergeBot
2711b9fa63 Revert "[CUBLAS][CUDA GRAPHS] Explicitly set the workspace for cuBLAS handles (#83461)"
This reverts commit 713d8b8552.

Reverted https://github.com/pytorch/pytorch/pull/83461 on behalf of https://github.com/malfet due to Broke CUDA-10.2 builds, see 713d8b8552
2022-09-14 22:27:30 +00:00
Eddie Yan
713d8b8552 [CUBLAS][CUDA GRAPHS] Explicitly set the workspace for cuBLAS handles (#83461)
We're seeing an issue where repeatedly capturing graphs incurs increasing memory usage as cuBLAS internally allocates a new workspace for each graph even when the same handle is being used:
https://gist.github.com/tomconerlyanth/a20c04a4a46a0f6e9ce18f5280729b36

This PR works around the issue by intercepting the `CUBLAS_WORKSPACE_CONFIG` environment variable and allocating the workspace for the cuBLAS handle explicitly.

CC @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83461
Approved by: https://github.com/ngimel
2022-09-14 21:56:48 +00:00
Aidyn-A
5271494ef2 [CUDA graphs] Fixes errors in RNG seed (#84967)
Fixes #84614

Prior to this PR CUDAGraph did not store the RNG seed, that is why `torch.cuda.manual_seed(new_seed)` would only reset the offset but not update the seed at all keeping whatever value was used during graph capture.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84967
Approved by: https://github.com/ngimel
2022-09-14 19:56:12 +00:00
jataylo
09bcc006e9 ROCm support for test_lazy_init (#84333)
Added ROCm support for the test_lazy_init unit test by including a condition on TEST_WITH_ROCM to switch CUDA_VISIBLE_DEVICES with HIP_VISIBLE_DEVICES.

This is needed because HIP_VISIBLE_DEVICES is set when running the single-GPU tests in CI: a47bc96fb7/.jenkins/pytorch/test.sh (L38), but this test sets CUDA_VISIBLE_DEVICES, which takes lower precedence than HIP_VISIBLE_DEVICES on ROCm.

**Testing Logs (to show behavior difference)**
12:40:41 Aug 30 11:40:41 CUDA_VISIBLE_DEVICES='0': 0
12:40:41 Aug 30 11:40:41 1
12:40:41 Aug 30 11:40:41 CUDA_VISIBLE_DEVICES='32': 32
12:40:41 Aug 30 11:40:41 1
12:40:41 Aug 30 11:40:41 HIP_VISIBLE_DEVICES='0': 0
12:40:41 Aug 30 11:40:41 1
12:40:41 Aug 30 11:40:41 HIP_VISIBLE_DEVICES='32': 32
12:40:41 Aug 30 11:40:41 0

**Passing UT**
Aug 30 17:03:15 test_lazy_init (main.TestCuda)
Aug 30 17:03:17 Validate that no CUDA calls are made during import torch call ... ok (2.471s)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84333
Approved by: https://github.com/jithunnair-amd, https://github.com/malfet
2022-09-09 14:14:59 +00:00
Fabio Rocha
88b1cc885c Removed tri[lu]* tests, superseeded by OpInfos (#84256)
triu, tril, triu_indices and tril_indices had some
tests in test_tensor_creation_ops.py and test_cuda.py
that are redudant with the ones done by OpInfos for those ops.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84256
Approved by: https://github.com/Lezcano, https://github.com/ngimel
2022-09-06 18:54:10 +00:00
Aidyn-A
ce1b727e77 Disable autocast cache in torch.cuda.make_graphed_callables (#84289)
There there are conflicts between `torch.clear_autocast_cache()` and `cudaMallocAsync` from #82682.
Moreover, the use of autocast caching is not reasonable during training which is the main target of `make_graphed_callables`.

cc @eqy @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84289
Approved by: https://github.com/ngimel
2022-09-01 21:34:51 +00:00
Pruthvi Madugundu
8473e69684 [ROCm] Fixes the kernel asserts API declaration mismatch error (#81790)
This problem updates the the PR [#73040](https://github.com/pytorch/pytorch/pull/73040)

The compilation error in pyTorch with ROCm is successful with these changes when `NDEBUG` is enabled.

Solution:
For HIP we keep `__device__ __assert_fail()`
and for host side compilation we want to use the `__assert_fail()` from the glibc library.

Tested the code by compiling with below steps
```
python3 tools/amd_build/build_amd.py
python3 setup.py develop --cmake-only
cmake -DHIP_HIPCC_FLAGS_RELEASE="-DNDEBUG" build
cmake --build build
```

The UT test_fixed_cuda_assert_async is still skipped due performance overhead.

cc @jithunnair-amd

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81790
Approved by: https://github.com/shintaro-iwasaki, https://github.com/jeffdaily, https://github.com/malfet
2022-08-16 19:22:31 +00:00
Zachary DeVito
4128712397 Propagate CUDAOutOfMemoryError to Python. (#83146)
The intention is to make it easier to catch this situation for debugging,
logging, or application-specific recovery.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83146
Approved by: https://github.com/albanD
2022-08-11 21:32:11 +00:00
Zachary DeVito
726d040692 annotated allocator snapshots (#82146)
Record stack trace information for each allocated segment in the allocator.
It takes around 1.5us to record 50 stack frames of context.
Since invoking a Pytorch operator is around 8us, this adds minimal overhead but we still leave it disabled by default so that we can test it more on real workloads first.

Stack information is kept both for allocated blocks and the last allocation used inactive blocks. We could potential keep around the _first_ allocation that caused the block to get allocated from cuda as well.

Potential Followups:
* stack frame entries are small (16 bytes), but the list of Frames is not compressed eventhough most frames will share some entries. So far this doesn't produce huge dumps (7MB for one real workload that uses all memory on the GPU), but it can be much smaller through compression.
* Code to format the information is slow (a few seconds) because it uses python and FlameGraph.pl
* Things allocated during the backward pass have no stack frames because they are run on another C++ thread.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82146
Approved by: https://github.com/albanD
2022-08-09 17:21:35 +00:00
Aidyn-A
da0a3fe058 [Re-land] [CUDA graphs] Clear autocast amp cache (#81896)
Re-lands #81558 that got reverted due failing tests.

This failure happened because of the test that I poorly designed. [The loop here](https://github.com/pytorch/pytorch/pull/81558/files#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R3837) is doing `cache_enabled=False` and then `cache_enabled=True`. By doing this loop the graph from previous iteration (case `False`) conflicts with the next one (case `True`). I redesigned the test such that it does not do any loops. The new test does separate function calls with different argument values.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81896
Approved by: https://github.com/ngimel
2022-08-02 23:22:00 +00:00
Kurt Mohler
14d0296e5c Rename _Typed/_UntypedStorage to Typed/UntypedStorage and update docs (#82438)
### Description

Since the major changes for `_TypedStorage` and `_UntypedStorage` are now complete, they can be renamed to be public.

`TypedStorage._untyped()` is renamed to `TypedStorage.untyped()`.

Documentation for storages is improved as well.

### Issue
Fixes #82436

### Testing
N/A

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82438
Approved by: https://github.com/ezyang
2022-07-30 19:37:08 +00:00
Eddie Yan
0b2566456f [CUDNN] Update tests and dispatching for CUDNN V8 API behavior for bfloat16 convs (#81139)
cuDNN via the V8 API supports `bfloat16` on Ampere (`>= (8, 0)` but not older devices) which might be unexpected given current test settings. This PR fixes some dispatching to check the device capability before dispatching `bfloat16` convs and adjusts the expected failure conditions for the autocast test.

CC @xwang233 @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81139
Approved by: https://github.com/ngimel
2022-07-29 23:28:58 +00:00
PyTorch MergeBot
f5b460b200 Revert "[CUDA graphs] Clear autocast amp cache (#81558)"
This reverts commit e9d07bd4f0.

Reverted https://github.com/pytorch/pytorch/pull/81558 on behalf of https://github.com/janeyx99 due to Breaks windows 11.6 tests on trunk e9d07bd4f0
2022-07-21 12:46:36 +00:00
Aidyn-A
e9d07bd4f0 [CUDA graphs] Clear autocast amp cache (#81558)
According to [autocast_mode.cpp](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/autocast_mode.cpp) `cached_casts` is to be cleared at the end of each forward pass. However, this was not the case in current implementation of `make_graphed_callables` so a graph created the following way:

```
    with torch.cuda.amp.autocast(cache_enabled=True):
        graphed_foo = torch.cuda.make_graphed_callables(foo, tensors)
```
Behaves incorrectly.

cc @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81558
Approved by: https://github.com/ngimel
2022-07-21 01:44:14 +00:00
Jeff Daily
ff6655defb [ROCm] unskip external streams tests (#80922)
These two tests are passing for ROCm 5.1.1 and 5.2.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80922
Approved by: https://github.com/cpuhrsch
2022-07-08 21:29:29 +00:00
Nikita Shulga
1ad7ef3f21 Add check for cuda lazy init (#80912)
Validate that no CUDA calls are made during `import torch` call, by
importing torch and limited visible devices to non-existing device

Should prevent regressions like ones reported in https://github.com/pytorch/pytorch/issues/80876

Pull Request resolved: https://github.com/pytorch/pytorch/pull/80912
Approved by: https://github.com/ngimel, https://github.com/atalman
2022-07-06 01:39:27 +00:00
Jeff Daily
20d56d2b32 increase sleep for TestCuda.test_caching_pinned_memory_multi_gpu (#76601)
Fixes #68299.  Fixes #70875.

Test is flaky on ROCm because the HIP runtime occasionally copies asynchronously too quickly for the current sleep value of 50ms.  This is not a bug.  Increasing the sleep value to 1s to avoid flakiness.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76601
Approved by: https://github.com/pruthvistony, https://github.com/malfet
2022-06-14 21:10:35 +00:00
Michael Carilli
ba27ee9e8f [CUDA graphs] Allows Adam and AdamW to be capture-safe (#77862)
Near term fix for https://github.com/pytorch/pytorch/issues/76368.

Q. Why does the user need to request `capturable=True` in the optimizer constructor? Why can't capture safety be completely automatic?
A. We need to set up capture-safe (device-side) state variables before capture. If we don't, and step() internally detects capture is underway, it's too late: the best we could do is create a device state variable and copy the current CPU value into it, which is not something we want baked into the graph.

Q. Ok, why not just do the capture-safe approach with device-side state variables all the time?
A. It incurs several more kernel launches per parameter, which could really add up and regress cpu overhead for ungraphed step()s. If the optimizer won't be captured, we should allow step() to stick with its current cpu-side state handling.

Q. But cuda RNG is a stateful thing that maintains its state on the cpu outside of capture and replay, and we capture it automatically. Why can't we do the same thing here?
A. The graph object can handle RNG generator increments because its capture_begin, capture_end, and replay() methods can see and access generator object. But the graph object has no explicit knowledge of or access to optimizer steps in its capture scope. We could let the user tell the graph object what optimizers will be stepped in its scope, ie something like
```python
graph.will_use_optimizer(opt)
graph.capture_begin()
...
```
but that seems clunkier than an optimizer constructor arg.

I'm open to other ideas, but right now I think constructor arg is necessary and the least bad approach.

Long term, https://github.com/pytorch/pytorch/issues/71274 is a better fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77862
Approved by: https://github.com/ezyang
2022-06-13 01:56:47 +00:00
Kurt Mohler
aea6e2c396 Merge torch.cuda._UntypedStorage into torch._UntypedStorage (#75459)
Fixes #74933

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75459
Approved by: https://github.com/ezyang
2022-05-19 13:54:39 +00:00
Michael Carilli
929f1d5317 [RELAND] Adds torch.cuda.is_current_stream_capturing (#77789)
Resubmit of https://github.com/pytorch/pytorch/pull/77673, which was reverted due to Windows test failures: https://github.com/pytorch/pytorch/pull/77673#issuecomment-1130425845.

I suspect these failures happened because I don't explicitly set a side stream for graph capture in the new test.
Not setting a side stream explicitly is alright on Linux because cuda tests implicitly use a side stream.
I think Windows cuda tests implicitly use the default stream, breaking capture and leaving the backend in a bad state.
Other graphs tests explicitly set side streams and don't error in Windows builds, so i'm 95% sure doing the same for the new test will work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77789
Approved by: https://github.com/ezyang
2022-05-18 23:18:53 +00:00
Jeff Daily
de86146c61 rocblas alt impl during backward pass only (#71881)
In preparation of adopting future rocblas library options, it is necessary to track when the backward pass of training is executing.  The scope-based helper class `BackwardPassGuard` is provided to toggle state.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71881
Approved by: https://github.com/albanD
2022-05-18 19:42:58 +00:00
PyTorch MergeBot
0d8a0f186b Revert "Adds torch.cuda.is_current_stream_capturing (#77673)"
This reverts commit d03d43df52.

Reverted https://github.com/pytorch/pytorch/pull/77673 on behalf of https://github.com/suo
2022-05-18 19:31:49 +00:00
Michael Carilli
d03d43df52 Adds torch.cuda.is_current_stream_capturing (#77673)
Exposes a way to query if CUDA graph capture is underway on the current stream.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77673
Approved by: https://github.com/ezyang
2022-05-18 16:46:35 +00:00
Eddie Yan
76b952bb35 [CUBLAS][TF32] Skip test_cublas_allow_tf32_get_set if TORCH_ALLOW_TF32_CUBLAS_OVERRIDE is set (#77298)
Follow-up to #77114 to prevent test breakages when the environment variable is set.

CC @xwang233 @ngimel @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77298
Approved by: https://github.com/xwang233, https://github.com/ngimel
2022-05-17 21:57:09 +00:00
Eddie Yan
e838137b3e Add high level control of fp32 matmul precision; disable TF32 for matmuls by default
#76440

CC @mruberry @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76509
Approved by: https://github.com/ngimel
2022-05-04 20:40:13 +00:00
Felipe Petroski Such
b0c5fba967 [CUDA Graphs] Fix OOM inside graph capture_begin
release_cached_blocks calls this:
```
void synchronize_and_free_events() {
    TORCH_INTERNAL_ASSERT(captures_underway == 0);
```
Which means we can't call that function when we are capturing a cuda graph:
```
import torch

with torch.cuda.graph(torch.cuda.CUDAGraph()):
    torch.zeros(2 ** 40, device="cuda")
```

results in:
```
RuntimeError: captures_underway == 0INTERNAL ASSERT FAILED at "/tmp/torch/c10/cuda/CUDACachingAllocator.cpp":1224, please report a bug to PyTorch.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76247
Approved by: https://github.com/ngimel
2022-04-29 17:42:04 +00:00
Jeff Daily
e846ef8818 add rocm ciflow/slow workflow
Enables additional tests that historically have been missed for ROCm CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72686
Approved by: https://github.com/seemethere
2022-04-22 17:41:28 +00:00
Ivan Yashchuk
4bb5e6e830 Fix test_reduce_add_coalesced failure (#74027)
Summary:
Recent change (https://github.com/pytorch/pytorch/pull/69751) introduced the requirement of using `.coalesce()` explicitly in the tests. Unfortunately, not all tests are run in the current CI configuration and one test failure slipped through.
Fixes https://github.com/pytorch/pytorch/issues/74015.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/74027

Reviewed By: samdow

Differential Revision: D34858112

Pulled By: mruberry

fbshipit-source-id: 8904fac5e2b5335684a21f95a22646469478eb81
(cherry picked from commit 06d6e6d2a796af0e8444f4c57841a07ec4f67c9f)
2022-03-15 06:29:54 +00:00
Michael Carilli
2f957f513e Deletes unused line in test_autocast_rnn (#73195)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73195

Reviewed By: mruberry

Differential Revision: D34557677

Pulled By: ngimel

fbshipit-source-id: 284018b4596471332d0e90a08e2c38303fb2b3ae
(cherry picked from commit bbf6913009e206c02e124c49ab80ef9596f7fcad)
2022-03-02 01:27:55 +00:00
Shintaro Iwasaki
7dc2cfa249 [c10][rocm] fix __assert_fail() declaration mismatch error (#73040)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73040

This patch fixes a compilation error in PyTorch with ROCm when `NDEBUG` is passed.

## Problem

Forward declaration of `__host__ __device__ __assert_fail()` is used in `c10/macros/Macros.h` for HIP compilation when `NDEBUG` is set  However, HIP has  `__device__ __assert_fail()` in `hip/amd_detail/amd_device_functions.h`, causing a function type error.

This issue does not appear in ROCm CI tests since it happens only when `NDEBUG` is passed.

## Solution

[EDIT] After the discussion on GitHub, we chose to entirely disable `CUDA_KERNEL_ASSERT()` for ROCm.

 ---

To solve this compilation error, this patch disables `CUDA_KERNEL_ASSERT()`, which uses `__assert_fail()` when
1. `c10/macros/Macros.h` is included for `*.hip` (precisely speaking, `__HIP__` or `__HIP_ARCH__` is defined), and
2. `NDEBUG` is passed.

Note that there's no impact on default compilation because, without a special compilation flag, those HIP files are compiled without `-NDEBUG`. And that's why this issue has not been found.

### Justification
[1] We cannot declare one host-and-device function for two separate host and device functions.
```
__device__ int func() {return 0};
__host__ int func() {return 0};
// Compile error (hipcc)
// __device__ __host__ int func();
```
[2] Forward declaration of a correct `__device__` only `__assert_fail()` for `__HIP__` causes the following error:
```
pytorch/c10/util/TypeCast.h:135:7: error: reference to __device__ function '__assert_fail' in __host__ __device__ function
      ERROR_UNSUPPORTED_CAST
      ^
pytorch/c10/util/TypeCast.h:118:32: note: expanded from macro 'ERROR_UNSUPPORTED_CAST'
#define ERROR_UNSUPPORTED_CAST CUDA_KERNEL_ASSERT(false);
                               ^
pytorch/c10/macros/Macros.h:392:5: note: expanded from macro 'CUDA_KERNEL_ASSERT'
    __assert_fail(
```

[3] Maybe there's a way to properly define `__assert_fail()` for HIP + NDEBUG, but this might be too much. Please let me just disable it.

### Technical details

Error
```
pytorch/c10/macros/Macros.h:368:5: error: __host__ __device__ function '__assert_fail' cannot overload __device__ function '__assert_fail'
    __assert_fail(
    ^
/opt/rocm/hip/include/hip/amd_detail/amd_device_functions.h:1173:6: note: previous declaration is here
void __assert_fail(const char *assertion,
```

CUDA definition (9.x) of `__assert_fail()`
```
#elif defined(__GNUC__)
extern __host__ __device__ __cudart_builtin__ void __assert_fail(
  const char *, const char *, unsigned int, const char *)
  __THROW;
```

ROCm definition (the latest version)
```
// 2b59661f3e/include/hip/amd_detail/amd_device_functions.h (L1172-L1177)
extern "C" __device__ __attribute__((noinline)) __attribute__((weak))
void __assert_fail(const char *assertion,
                   const char *file,
                   unsigned int line,
                   const char *function);
```

Test Plan:
CI + reproducer
```
python3 tools/amd_build/build_amd.py
python3 setup.py develop --cmake-only
cmake -DHIP_HIPCC_FLAGS_RELEASE="-DNDEBUG" build
cmake --build build
```

Reviewed By: xw285cornell

Differential Revision: D34310555

fbshipit-source-id: 7542288912590533ced3f20afd2e704b6551991b
(cherry picked from commit 9e52196e36820abe36bf6427cabc7389d3ea6cb5)
2022-03-01 04:35:30 +00:00
Philip Meier
b5f2574f36 no longer coalesce sparse COO tensors before comparison (#69751)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69751

cc nikitaved pearu cpuhrsch IvanYashchuk

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D34262453

Pulled By: ezyang

fbshipit-source-id: e2e62d2aa03fc569d2951c880960b256f5dc4aaa
(cherry picked from commit cb6b0ef719)
2022-02-17 02:33:08 +00:00
Kurt Mohler
8e7fe87630 Rename Typed/UntypedStorage to _Typed/_UntypedStorage (#72540)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72540

Reviewed By: jbschlosser

Differential Revision: D34216823

Pulled By: bdhirsh

fbshipit-source-id: 1bc9930ab582771ebf02308e035576cd1a0dbe47
(cherry picked from commit 329238f612)
2022-02-15 23:53:01 +00:00
Louis Feng
83b3b5fb00 [PyTorch] Support NVTX range_start and range_end (#70030)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70030

range_push and range_pop do not support multi-thread. It only works for push and pop range in the same thread.

For process level ranges, we should use range_start and range_end. This is important because PyTorch forward is on one thread, while the autograd is on a different thread.

See NVidia implementation documentation:
cab2dec760/NSight/nvToolsExt.h (L397-L407)

Test Plan:
```
buck test caffe2/test:cuda

Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/8162774391483460
    ✓ ListingSuccess: caffe2/test:cuda - main (19.640)
Summary
  ListingSuccess: 1
If you need help understanding your runs, please follow the wiki: https://fburl.com/posting_in_tpx_users
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/8162774391483460
```

Reviewed By: malfet

Differential Revision: D33155244

fbshipit-source-id: c7d5143f6da9b6ef0e0811e2fcae03a3e76f24de
(cherry picked from commit 22134e91b7)
2022-02-07 17:31:57 +00:00
Andrew Tulloch
0099796978 [CUDA Pinned Memory] [Retry] Alternative implementation of pinned memory allocator focusing on multi-threaded scalability (#69299)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69299

https://github.com/pytorch/pytorch/pull/68906 + https://github.com/pytorch/pytorch/pull/68749 plugged one correctness hole (non-blocking copies of offset pinned memory tensors) while introducing another (non-blocking copies of pinned memory tensors with a non-standard DataPtr context).

In this revision, we use both the tensor data pointer and context to attempt to identify the originating block in the pinned memory allocator.

Test Plan: New unit tests added to cover the missing case previously.

Reviewed By: yinghai

Differential Revision: D32787087

fbshipit-source-id: 0cb0d29d7c39a13f433eb1cd423dc0d2a303c955
(cherry picked from commit 297157b1a1)
2022-01-27 01:33:55 +00:00
Mike Ruberry
e0d829a266 Kill the test_torch.py mixin and creates test_scatter_gather_ops (#71691)
Summary:
Per title.

Also annotates test_torch.py with additional cleanup tasks and adds empty sample inputs to elementwise unary and binary OpInfos.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71691

Reviewed By: ngimel

Differential Revision: D33735126

Pulled By: mruberry

fbshipit-source-id: 8cc097a7581a8b620540c95b2a5889c1165ecf23
(cherry picked from commit 5c6a245a3f)
2022-01-24 09:32:32 +00:00
Leo Fang
67941c8a94 Document torch.cuda.ExternalStream, torch.cuda.caching_allocator_alloc and torch.cuda.caching_allocator_delete (#70126)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/67414. Fixes https://github.com/pytorch/pytorch/issues/70117.

cc brianjo mruberry ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70126

Reviewed By: mruberry

Differential Revision: D33542910

Pulled By: ngimel

fbshipit-source-id: 4b870f4dceca6ee4cc8fba58819f1cb18ac9f857
2022-01-12 15:44:40 -08:00
Jane Xu
20489ebdc9 Increase tensor size for mem check tests (#70603)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/70226

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70603

Reviewed By: mruberry

Differential Revision: D33410439

Pulled By: janeyx99

fbshipit-source-id: e94615ece6d0fdf230de5297118678b70f34a18c
2022-01-05 08:27:48 -08:00
Jane Xu
c555b7bacb GHA: Remove caffe2 check in Windows shard 1 smoke tests (#70010)
Summary:
Windows shard 1 hasn't actually been running any tests because the script that does so exited before running the python tests but did not report an error. This has been happening to all windows tests across the board, for example https://github.com/pytorch/pytorch/runs/4526170542?check_suite_focus=true

Removing the caffe2.python check passes the smoke tests now. You can observe that the run_test.py file is called in the windows cpu job now https://github.com/pytorch/pytorch/runs/4541331717?check_suite_focus=true

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70010

Reviewed By: malfet, seemethere

Differential Revision: D33161291

Pulled By: janeyx99

fbshipit-source-id: 85024b0ebb3ac42297684467ee4d0898ecf394de
2021-12-20 16:05:38 -08:00
Mike Ruberry
84b7832010 Updates CUDA memory leak check to verify against driver API and print more diagnostic information (#69556)
Summary:
Per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69556

Reviewed By: mrshenli

Differential Revision: D32954770

Pulled By: mruberry

fbshipit-source-id: a6c2ae6f704422c178569980ca4b9c72c4272f55
2021-12-17 23:37:49 -08:00
Mike Ruberry
dc87cf5fe1 Fixes mem_get_info when querying on a device other than the current device (#69640)
Summary:
Also fixes the documentation failing to appear and adds a test to validate that op works with multiple devices properly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69640

Reviewed By: ngimel

Differential Revision: D32965391

Pulled By: mruberry

fbshipit-source-id: 4fe502809b353464da8edf62d92ca9863804f08e
2021-12-08 23:04:30 -08:00
Dennis van der Staay
cbe0a38d8c Back out "[CUDA Pinned Memory] Event recording with non-blocking copies should track the storage context, not the tensor data pointer" (#69193)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69193

Reviewed By: xing-liu, yuchenhao

Differential Revision: D32748570

fbshipit-source-id: bd73d7567f94c70daeace49d4081381b8adf2d77
2021-12-01 19:30:08 -08:00
Andrew Tulloch
d44e610efa [CUDA Pinned Memory] Event recording with non-blocking copies should track the storage context, not the tensor data pointer (#68749)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68749

The logic for asynchronous copies (either HtoD or DtoH) using cudaMemcpyAsync relies on recording an event with the caching host allocator to notify it that a given allocation has been used on a stream - and thus it should wait for that stream to proceed before reusing the host memory.

This tracking is based on the allocator maintaining a map from storage allocation pointers to some state.

If we try to record an event for a pointer we don't understand, we will silently drop the event and ignore it (9554ebe44e/aten/src/ATen/cuda/CachingHostAllocator.cpp (L171-L175)).

Thus, if we use the data_ptr of a Tensor instead of the storage allocation, then reasonable code can lead to incorrectness due to missed events.

One way this can occur is simply by slicing a tensor into sub-tensors - which have different values of `data_ptr()` but share the same storage, for example:

```
image_batch = torch.randn(M, B, C, H, W).pin_memory()
for m in range(M):
  sub_batch = image_batch[m].cuda(non_blocking=True)
  # sub_batch.data_ptr() != image_batch.data_ptr() except for m == 0.
  # however, sub_batch.storage().data_ptr() == image_batch.storage().data_ptr() always.
```

Therefore, we instead use the storage context pointer when recording events, as this is the same state that is tracked by the caching allocator itself. This is a correctness fix, although it's hard to determine how widespread this issue is.

Using the storage context also allows us to use a more efficient structure internally to the caching allocator, which will be sent in future diffs.

Test Plan: Test added which demonstrates the issue, although it's hard to demonstrate the race explicitly.

Reviewed By: ngimel

Differential Revision: D32588785

fbshipit-source-id: d87cc5e49ff8cbf59052c3c97da5b48dd1fe75cc
2021-11-24 13:20:22 -08:00
eqy
790763b0fe Add an option to disable reduced precision reductions for FP16 GEMM (#67946)
Summary:
https://github.com/pytorch/pytorch/issues/67578 disabled reduced precision reductions for FP16 GEMMs. After benchmarking, we've found that this has substantial performance impacts for common GEMM shapes (e.g., those found in popular instantiations of multiheaded-attention) on architectures such as Volta. As these performance regressions may come as a surprise to current users, this PR adds a toggle to disable reduced precision reductions
`torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = `
rather than making it the default behavior.

CC ngimel ptrblck
stas00 Note that the behavior after the previous PR can be replicated with
`torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = False`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67946

Reviewed By: zou3519

Differential Revision: D32289896

Pulled By: ngimel

fbshipit-source-id: a1ea2918b77e27a7d9b391e030417802a0174abe
2021-11-09 17:27:20 -08:00
Jane Xu
2578de4851 [skip ci] Set test owner for test_cuda* tests (#66838)
Summary:
Action following https://github.com/pytorch/pytorch/issues/66232

cc ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66838

Reviewed By: saketh-are

Differential Revision: D31841411

Pulled By: janeyx99

fbshipit-source-id: 5cdffdef4a92f9adcef1143ae4598b052c5acc6b
2021-10-21 17:36:25 -07:00
arindamroy-eng
32e790997b [Rocm]Reduce severity of detected possible memory leak from assertion to warning (#65973)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/62533.
In very rare cases, the decorator for detecting memory leak is throwing assertion, even when the test is passing, and the memory is being freed with a tiny delay. The issue is not being reproduced in internal testing, but shows up sometimes in CI environment.

Reducing the severity of such detection to warning, so as not to fail the CI tests, as the actual test is not failing, rather only the check inside the decorator is failing.

Limiting the change to ROCM only for now.

cc jeffdaily sunway513 jithunnair-amd ROCmSupport

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65973

Reviewed By: anjali411

Differential Revision: D31776154

Pulled By: malfet

fbshipit-source-id: 432199fca17669648463c4177c62adb553cacefd
2021-10-21 07:10:54 -07:00
Yanli Zhao
8173d4df69 move get_cycles_per_ms() to common_utils (#66798)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66798

get_cycles_per_ms is copied and used in a few places, move it to common_utils so that it can be used as a shared util function
ghstack-source-id: 140790599

Test Plan: unit tests

Reviewed By: pritamdamania87

Differential Revision: D31706870

fbshipit-source-id: e8dccecb13862646a19aaadd7bad7c8f414fd4ab
2021-10-18 14:04:09 -07:00
Kurt Mohler
5883523c1d Remove dtype from torch.Storage and use only torch.ByteStorage (#62030)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62030

Remove dtype tracking from Python Storage interface, remove all the different `<type>Storage` classes except for `ByteStorage`, and update serialization accordingly, while maintaining as much FC/BC as possible

Fixes https://github.com/pytorch/pytorch/issues/47442

* **THE SERIALIZATION FORMAT IS FULLY FC/BC.** We worked very hard to make sure this is the case. We will probably want to break FC at some point to make the serialization structure of tensors make more sense, but not today.
* There is now only a single torch.ByteStorage class. Methods like `Tensor.set_` no longer check that the dtype of storage is appropriate.
* As we no longer know what dtype of a storage is, we've **removed** the size method from Storage, replacing it with nbytes. This is to help catch otherwise silent errors where you confuse number of elements with number of bytes.
* `Storage._new_shared` takes a `nbytes` kwarg and will reject previous positional only calls.  `Storage._new_with_file` and `_set_from_file` require explicit element size arguments.
* It's no longer possible to convert storages to different types using the float/double/etc methods. Instead, do the conversion using a tensor.
* It's no longer possible to allocate a typed storage directly using FloatStorage/DoubleStorage/etc constructors. Instead, construct a tensor and extract its storage. The classes still exist but they are used purely for unpickling.
* The preexisting serialization format stores dtype with storage, and in fact this dtype is used to determine the dtype of the tensor overall.
 To accommodate this case, we introduce a new TypedStorage concept that exists only during unpickling time which is used to temporarily store the dtype so we can construct a tensor. **If you overrode the handling of pickling/unpickling, you MUST add handling for TypedStorage** or your serialization code will degrade to standard file-based serialization.

Original pull request: https://github.com/pytorch/pytorch/pull/59671

Reviewed By: soulitzer, ngimel

Differential Revision: D29466819

Pulled By: ezyang

fbshipit-source-id: 4a14e5d3c2b08e06e558683d97f7378a3180b00e
2021-10-05 13:50:34 -07:00
Michael Dagitses
b737629ff0 simplify op name determination into a single forward pass (#64261)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64261

Note that this does not preserve byte-for-byte compatibility with
existing names.

Test Plan:
* Rely on CI to catch gross errors.
* Merge after release cut to catch subtle issues.

Reviewed By: albanD

Differential Revision: D30700647

Pulled By: dagitses

fbshipit-source-id: 7b02f34b8fae3041240cc78fbc6bcae498c3acd4
2021-09-02 07:32:11 -07:00
Michael Carilli
24e50b8453 [CUDA graphs] hotfix for test_graph_ (#64339)
Summary:
Graphed workloads that try to capture a full backward pass must do warmup on a non-default stream. If warmup happens on the default stream, AccumulateGrad functions might tag themselves to run on the default stream, and therefore won't be capturable.

ngimel and I suspect some test_cuda.py tests run with the default stream as the ambient stream, which breaks `test_graph_grad_scaling` because `test_graph_grad_scaling` does warmup on the ambient stream _assuming_ the ambient stream is a non-default stream.

This PR explicitly sets a side stream for the warmup in `test_graph_grad_scaling`, which is what I should have done all along because it's what the new documentation recommends.

I pushed the PR branch straight to the main pytorch repo because we need to run ci-all on it, and I'm not sure what the requirements are these days.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64339

Reviewed By: mruberry

Differential Revision: D30690711

Pulled By: ngimel

fbshipit-source-id: 91ad75f46a11f311e25bc468ea184e22acdcc25a
2021-08-31 22:34:10 -07:00
Rishi Puri
13484084a6 fix syntax error in bfloat16 PR (#64122)
Summary:
fixes prior syntax error from PR ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64122

Reviewed By: H-Huang

Differential Revision: D30643596

Pulled By: ngimel

fbshipit-source-id: 0a2d5a40fb6dc7339cd03112e57ef0e1bf8a000e
2021-08-31 14:33:12 -07:00
Michael Carilli
8d08b103be [CUDA graphs] Prototype API and documentation (#63269)
Summary:
RFC: https://github.com/pytorch/pytorch/issues/61880

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63269

Reviewed By: mruberry

Differential Revision: D30596643

Pulled By: ngimel

fbshipit-source-id: b1f8061406364b667e2c2d4d30fbce1f0d8456be
2021-08-31 13:34:23 -07:00
Philip Meier
57d4c6cf42 replace self.assertTrue(torch.allclose(..)) with self.assertEqual(…) (#63637)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/63565

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63637

Reviewed By: malfet

Differential Revision: D30541266

Pulled By: mruberry

fbshipit-source-id: ab461949782c6908a589ea098fcfcf5c3e081ee6
2021-08-25 16:47:40 -07:00
Shen Li
1022443168 Revert D30279364: [codemod][lint][fbcode/c*] Enable BLACK by default
Test Plan: revert-hammer

Differential Revision:
D30279364 (b004307252)

Original commit changeset: c1ed77dfe43a

fbshipit-source-id: eab50857675c51e0088391af06ec0ecb14e2347e
2021-08-12 11:45:01 -07:00
Zsolt Dollenstein
b004307252 [codemod][lint][fbcode/c*] Enable BLACK by default
Test Plan: manual inspection & sandcastle

Reviewed By: zertosh

Differential Revision: D30279364

fbshipit-source-id: c1ed77dfe43a3bde358f92737cd5535ae5d13c9a
2021-08-12 10:58:35 -07:00
Rishi Puri
324673a537 rebase for autocast updates to include device_type and dtype flags (#61002)
Summary:
Fixes #{55374}
https://github.com/pytorch/pytorch/issues/55374

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61002

Reviewed By: malfet, mruberry

Differential Revision: D30016812

Pulled By: ngimel

fbshipit-source-id: 6e09a29f539d28e9aea5cd9489b1e633cc588033
2021-08-10 20:03:12 -07:00
Kevin Tse
4b47ea9446 adding a skip for ROCm for a flaky test (#62664)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62664

Skipping a test for ROCm because of issue #62602

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D30079534

Pulled By: NivekT

fbshipit-source-id: a9cf35e5d3a8d218edc9c5a704d1f9599d2f38a6
2021-08-04 07:29:06 -07:00
Michael Carilli
9fb6b40f3e Makes a streaming backward test try gradient stealing more directly (#60065)
Summary:
Closes https://github.com/pytorch/pytorch/issues/59846.

https://github.com/pytorch/pytorch/issues/59846 is likely paranoia, and some of the test_streaming_backward_* in test_cuda.py already use gradient stealing (ie, they start with `.grad`s as None before backward). Regardless, this PR augments one of the tests to stress gradient stealing a bit more directly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60065

Reviewed By: mrshenli

Differential Revision: D29779518

Pulled By: ngimel

fbshipit-source-id: ccbf278543c3adebe5f4ba0365b1dace9a14da9b
2021-07-19 20:39:55 -07:00
Michael Carilli
2fa6c7627e [CUDA graphs][BC-breaking] Removes post-backward syncs on default stream (#60421)
Summary:
Before https://github.com/pytorch/pytorch/pull/57833, calls to backward() or grad() synced only the calling thread's default stream with autograd leaf streams at the end of backward. This made the following weird pattern safe:
```python
with torch.cuda.stream(s):
    # imagine forward used many streams, so backward leaf nodes may run on many streams
    loss.backward()
# no sync
use grads
```

but a more benign-looking pattern was unsafe:
```python
with torch.cuda.stream(s):
    # imagine forward used a lot of streams, so backward leaf nodes may run on many streams
    loss.backward()
    # backward() syncs the default stream with all the leaf streams, but does not sync s with anything,
    # so counterintuitively (even though we're in the same stream context as backward()!)
    # it is NOT SAFE to use grads here, and there's no easy way to make it safe,
    # unless you manually sync on all the streams you used in forward,
    # or move "use grads" back to default stream outside the context.
    use grads
```
mruberry ngimel and I decided backward() should have the [same user-facing stream semantics as any cuda op](https://pytorch.org/docs/master/notes/cuda.html#stream-semantics-of-backward-passes).** In other words, the weird pattern should be unsafe, and the benign-looking pattern should be safe. Implementationwise, this meant backward() should sync its calling thread's current stream, not default stream, with the leaf streams.

After https://github.com/pytorch/pytorch/pull/57833, backward syncs the calling thread's current stream AND default stream with all leaf streams at the end of backward. The default stream syncs were retained for temporary backward compatibility.

This PR finishes https://github.com/pytorch/pytorch/pull/57833's work by deleting syncs on the default stream.

With this PR, graph-capturing an entire backward() call should be possible (see the [test_graph_grad_scaling diffs](https://github.com/pytorch/pytorch/compare/master...mcarilli:streaming_backwards_remove_default_syncs?expand=1#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R3641-R3642)).

** first paragraph has a formatting error which this PR should also fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60421

Reviewed By: albanD

Differential Revision: D29370344

Pulled By: ngimel

fbshipit-source-id: 3248bc5fb92fc517db0c15c897e5d7250f67d7fe
2021-06-24 17:34:02 -07:00
Luca Wehrstedt
bb9e1150ea Revert D29342234: [pytorch][PR] [CUDA graphs][BC-breaking] Removes post-backward syncs on default stream
Test Plan: revert-hammer

Differential Revision:
D29342234 (675cea1adb)

Original commit changeset: 98e6be7fdd85

fbshipit-source-id: 84022973248b2254210eee57402df2c4f4bc43c6
2021-06-24 04:49:28 -07:00
Michael Carilli
675cea1adb [CUDA graphs][BC-breaking] Removes post-backward syncs on default stream (#60421)
Summary:
Before https://github.com/pytorch/pytorch/pull/57833, calls to backward() or grad() synced only the calling thread's default stream with autograd leaf streams at the end of backward. This made the following weird pattern safe:
```python
with torch.cuda.stream(s):
    # imagine forward used many streams, so backward leaf nodes may run on many streams
    loss.backward()
# no sync
use grads
```

but a more benign-looking pattern was unsafe:
```python
with torch.cuda.stream(s):
    # imagine forward used a lot of streams, so backward leaf nodes may run on many streams
    loss.backward()
    # backward() syncs the default stream with all the leaf streams, but does not sync s with anything,
    # so counterintuitively (even though we're in the same stream context as backward()!)
    # it is NOT SAFE to use grads here, and there's no easy way to make it safe,
    # unless you manually sync on all the streams you used in forward,
    # or move "use grads" back to default stream outside the context.
    use grads
```
mruberry ngimel and I decided backward() should have the [same user-facing stream semantics as any cuda op](https://pytorch.org/docs/master/notes/cuda.html#stream-semantics-of-backward-passes).** In other words, the weird pattern should be unsafe, and the benign-looking pattern should be safe. Implementationwise, this meant backward() should sync its calling thread's current stream, not default stream, with the leaf streams.

After https://github.com/pytorch/pytorch/pull/57833, backward syncs the calling thread's current stream AND default stream with all leaf streams at the end of backward. The default stream syncs were retained for temporary backward compatibility.

This PR finishes https://github.com/pytorch/pytorch/pull/57833's work by deleting syncs on the default stream.

With this PR, graph-capturing an entire backward() call should be possible (see the [test_graph_grad_scaling diffs](https://github.com/pytorch/pytorch/compare/master...mcarilli:streaming_backwards_remove_default_syncs?expand=1#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R3641-R3642)).

** first paragraph has a formatting error which this PR should also fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60421

Reviewed By: VitalyFedyunin, albanD

Differential Revision: D29342234

Pulled By: ngimel

fbshipit-source-id: 98e6be7fdd8550872f0a78f9a66cb8dfe75abf63
2021-06-23 23:35:24 -07:00
Michael Carilli
56481f9762 Ensure proper syncs for out-of-place grad creation (torch.autograd.grad) when backward ops run on side streams (#60127)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/59844.

Streaming backwards collects "leaf streams" for AccumulateGrad functions that stash or accumulate .grad attributes for autograd leaf tensors, and syncs those streams with some ambient stream(s) so later ops can safely consume the grads on the ambient stream(s).

But, currently, streaming backwards does not collect leaf streams for grads produced out-of-place (ie, not stashed onto a .grad attribute) by `torch.autograd.grad`, because these out-of-place grads are "captured" and returned before they reach an AccumulateGrad function. Some out-of-place grads might not even have an AccumulateGrad function to go to, because `torch.autograd.grad` can be told to make grads for non-leaf temporaries.[1]

The upshot is, when streaming backwards makes ops that produce out-of-place gradients run on side streams, no ambient stream is told to sync on these side streams, so `torch.autograd.grad` doesn't offer the same post-call safe-use guarantees for grads as the leaf accumulation of `torch.autograd.backward`.

This PR ensures `torch.autograd.grad` gives the same safe-use guarantees as `torch.autograd.backward` by also stashing leaf streams for grads created out-of-place.

I augmented a streaming backwards test to include a torch.autograd.grad attempt. The test fails on current master[2] and passes with the engine.cpp diffs.

I have no idea if this bug or its fix matter to distributed autograd. pritamdamania mrshenli should take a look before it's merged.

[1] example:
```python
leaf = torch.tensor(..., requires_grad=True)
tmp = leaf * 2
loss = tmp.sum()
torch.autograd.grad(loss, inputs=(tmp, leaf))
```
Technically, because `torch.autograd.grad` can be told to produce grads for non-leaf temporaries, these streams might NOT be "leaf streams". Maybe I should rename `leaf_streams`?

[2] the way the test currently fails is fun: it reports
```
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 0 element(s) (out of 25) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.0 (5.0 vs. 5.0), which occurred at index (0, 0).
```
I suspect this [kafka trap](https://en.wiktionary.org/wiki/Kafkatrap) happens because assertEqual does a comparison test on the device, syncs on some bool result, sees failure and prints the tensors post-sync at which point is IS safe to access the values.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60127

Reviewed By: mrshenli

Differential Revision: D29276581

Pulled By: albanD

fbshipit-source-id: a9f797e2fd76e2f884cce5a32ecf5d9b704c88ee
2021-06-23 07:14:01 -07:00
Alexander Grund
3846cef2d7 Increase tolerance for test_grad_scaling_clipping (#60458)
Summary:
This makes it pass on A100 and with e.g. torch.manual_seed(6) called before running this test.

Fixes https://github.com/pytorch/pytorch/issues/60455

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60458

Reviewed By: mrshenli

Differential Revision: D29309618

Pulled By: ngimel

fbshipit-source-id: 72584087bcc949f7bc96b0644b701e69ae1fa025
2021-06-22 23:43:25 -07:00
Emilio Castillo
f9ec86a6c6 External stream (#59527)
Summary:
Previous is https://github.com/pytorch/pytorch/issues/57781

We add now two CUDA bindings to avoid using ctypes to fix a windows issue.
However, we use ctypes to allocate the stream and create its pointer
(we can do this with a 0-dim tensor too if it feels better).

CC. ezyang rgommers ngimel mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59527

Reviewed By: albanD

Differential Revision: D29053062

Pulled By: ezyang

fbshipit-source-id: 661e7e58de98b1bdb7a0871808cd41d91fe8f13f
2021-06-14 13:46:11 -07:00
Michael Carilli
be038d8989 [CUDA graphs] Make stream semantics of backward calls consistent with other cuda ops (ci-all edition) (#57833)
Summary:
ci-all resubmit of https://github.com/pytorch/pytorch/pull/54227.

Tests look good except for a few distributed autograd failures (pytorch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test) and rocm failures (pr/pytorch-linux-bionic-rocm4.1-py3.6).

The common denominator in rocm failures appears to be multi-gpu activity: some [multiprocess DDP failures](https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm4.1-py3.6-test1/8115/console), some [single-process failures](https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm4.1-py3.6-test2/8115/console) where the single process has autograd ops that span devices. jeffdaily jithunnair-amd sunway513, could one of you take a look? The streaming backward change is also beneficial to rocm, I expect.

For debugging rocm failures, I think we should ignore the multiprocess/DDP tests and focus on the single process cases. The root cause is probably the same and the single process cases are simpler.

----------------------------------

Update: Rocm failures are due to https://github.com/pytorch/pytorch/issues/59750.
2718a54032 is a workaround, to be updated once https://github.com/pytorch/pytorch/issues/59750 is fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57833

Reviewed By: mruberry

Differential Revision: D28942391

Pulled By: ngimel

fbshipit-source-id: d6047e971c5f1c6386334bf3641402a92f12e2f8
2021-06-13 12:09:56 -07:00
Jeff Daily
24e27af683 [ROCm] enable kernel asserts (#49624)
Summary:
Addresses missing ROCm feature indicated in https://github.com/pytorch/pytorch/issues/38943.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49624

Reviewed By: agolynski

Differential Revision: D28902459

Pulled By: malfet

fbshipit-source-id: 29c9b552770241a0ec52cd057ea45efc4389d838
2021-06-07 13:43:07 -07:00
Mike Ruberry
de40c8e495 Adds remaining OpInfos and removes redundant test generators (#55558)
Summary:
Per title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55558

Reviewed By: ngimel

Differential Revision: D28922522

Pulled By: mruberry

fbshipit-source-id: 89cefd93788bc8aa0683f4583cf5caa81aa2dc93
2021-06-06 14:52:26 -07:00
Rong Rong (AI Infra)
689a5edd0a Revert D28326365: [pytorch][PR] Add torch.cuda.streams.ExternalStream
Test Plan: revert-hammer

Differential Revision:
D28326365 (d7ef9b73fb)

Original commit changeset: b67858c80339

fbshipit-source-id: 337588d40b96cf04e46e554fa481ae7fd4254478
2021-06-04 11:19:36 -07:00
Emilio Castillo
d7ef9b73fb Add torch.cuda.streams.ExternalStream (#57781)
Summary:
This is required in https://github.com/pytorch/pytorch/pull/57110#issuecomment-828357947

We need to provide means to synchronize on externally allocated streams for dlpack support in python array data api.

cc mruberry rgommers leofang asi1024 kmaehashi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57781

Reviewed By: mrshenli

Differential Revision: D28326365

Pulled By: ezyang

fbshipit-source-id: b67858c8033949951b49a3d319f649884dfd0a91
2021-06-04 08:47:09 -07:00
Michael Carilli
3efefc4016 [CUDA graphs] Makes sure all graphs tests call empty_cache() at some point before capture (#59233)
Summary:
Graphs tests are sometimes flaky in CI ([example](https://app.circleci.com/pipelines/github/pytorch/pytorch/328930/workflows/0311199b-a0be-4802-a286-cf1e73f96c70/jobs/13793451)) because when the GPU runs near its max memory capacity (which is not unusual during a long test), sometimes, to satisfy new allocations that don't match any existing unused blocks, the caching allocator may call `synchronize_and_free_events` to wait on block end-of-life events and cudaFree unused blocks, then re-cudaMalloc a new block. For ungraphed ops this isn't a problem, but synchronizing or calling cudaFree while capturing is illegal, so `synchronize_and_free_events` raises an error if called during capture.

The graphs tests themselves don't use much memory, so calling torch.cuda.empty_cache() at some point before their captures should ensure memory is available and the captures never need `synchronize_and_free_events`.

I was already calling empty_cache() near the beginning of several graphs tests. This PR extends it to the ones I forgot.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59233

Reviewed By: mruberry

Differential Revision: D28816691

Pulled By: ngimel

fbshipit-source-id: 5cd83e48e43b1107daed5cfa2efff0fdb4f99dff
2021-06-01 21:05:46 -07:00
Masaki Kozuki
7eade660c6 [PyTorch] Reduce errors of foreach functions (#56993)
Summary:
This is based on  https://github.com/pytorch/pytorch/issues/48224.

To make `foreach` more flexible, this PR pushes unsupported cases to slow path.
Also, this adds some tests to verify that
- `foreach` functions work with tensors of different dtypes and/or memory layouts in 7bd4b2c89f
- `foreach` functions work with tensors on different devices in a list, but are on the same device if the indices are the same: def4b9b5a1

Future plans:
1. Improve the coverage of unittests using `ops` decorator & updating `foreach_unary_op_db` and creating `foreach_(binary|pointwise|minmax)_db`.
2. Support broadcasting in slow path. Ref:  https://github.com/pytorch/pytorch/pull/52448
3. Support type promotion in fast path. Ref https://github.com/pytorch/pytorch/pull/52449

CC: ngimel mcarilli  ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56993

Reviewed By: zou3519

Differential Revision: D28630580

Pulled By: ngimel

fbshipit-source-id: e26ee74a39a591025e18c1ead48948cb7ec53c19
2021-05-25 10:50:20 -07:00
Michael Carilli
dbedb1fa1c [CUDA graphs] Sync after replay (#57556)
Summary:
Right now** there's a bug in libcuda.so that triggers sometimes when graphs with certain topologies are replayed back to back without a sync in between. Replays that hit this bug turn into spaghetti: kernels reordered ignoring dependencies, kernels elided, corrupted results. Currently, the only workaround I know that fixes all our repros is a manual sync between replays.

I'll remove the sync (or special case it based on cuda version) in a later PR, as soon as a fixed libcuda.so is available.

The only substantive change is the cudaDeviceSynchronize, other lines changed are de-indenting an unneeded scope.

** The bug is in current and semi-recent public versions of libcuda.so. We discovered the bug recently and we're not sure yet which public release was first affected. The version that ships with 11.3 is definitely affected, versions that shipped with 11.1 and earlier are likely not affected.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57556

Reviewed By: mruberry

Differential Revision: D28343043

Pulled By: ngimel

fbshipit-source-id: 3b907241aebdb8ad47ae96a6314a8b02de7bfa77
2021-05-11 09:38:47 -07:00
Gao, Xiang
db7b31358f Fix internal assert in CUDA caching allocator when trying to allocate ~2^64 memory (#57571)
Summary:
When the memory requested is huge, some internal logic in CUDA caching allocator could overflow. The result of the overflow is the caching allocator gives a confusing error message.

For example:

```python
import torch
import torch.nn as nn
from torch.utils import cpp_extension
cuda_source = """
#include <c10/cuda/CUDACachingAllocator.h>
void my_fun(void)
{
    size_t temp_storage_bytes = 18446744073708433663UL;
    auto& caching_allocator = *::c10::cuda::CUDACachingAllocator::get();
    auto temp_storage = caching_allocator.allocate(temp_storage_bytes);
    return;
}
"""
cpp_source = """
    void my_fun(void);
"""
module = torch.utils.cpp_extension.load_inline(
    name="cuda_test_extension",
    cpp_sources=cpp_source,
    cuda_sources=cuda_source,
    functions="my_fun",
    extra_cuda_cflags=["--extended-lambda"],
    verbose=True,
)
module.my_fun()
print('done')
```

gives

```
Traceback (most recent call last):
  File "/home/gaoxiang/misc/caching-allocator.py", line 26, in <module>
    module.my_fun()
RuntimeError: p.block != nullptr && p.block->ptr != nullptrINTERNAL ASSERT FAILED at "../c10/cuda/CUDACachingAllocator.cpp":991, please report a bug to PyTorch.
Exception raised from alloc_block at ../c10/cuda/CUDACachingAllocator.cpp:991 (most recent call first):
frame #0: <unknown function> + 0x83e93 (0x7f424f05ee93 in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame https://github.com/pytorch/pytorch/issues/1: <unknown function> + 0x83bf9 (0x7f424f05ebf9 in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame https://github.com/pytorch/pytorch/issues/2: <unknown function> + 0x839bd (0x7f424f05e9bd in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame https://github.com/pytorch/pytorch/issues/3: std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>::operator()() const + 0x4c (0x7f428a3350a2 in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame https://github.com/pytorch/pytorch/issues/4: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x40 (0x7f424f05dc34 in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame https://github.com/pytorch/pytorch/issues/5: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x97 (0x7f424f05c42f in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame https://github.com/pytorch/pytorch/issues/6: <unknown function> + 0x6948b4 (0x7f42978fd8b4 in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame https://github.com/pytorch/pytorch/issues/7: <unknown function> + 0x22373 (0x7f424f0e2373 in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame https://github.com/pytorch/pytorch/issues/8: <unknown function> + 0x1fa6c (0x7f424f0dfa6c in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame https://github.com/pytorch/pytorch/issues/9: <unknown function> + 0x2337a (0x7f424f0e337a in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame https://github.com/pytorch/pytorch/issues/10: <unknown function> + 0x23f18 (0x7f424f0e3f18 in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame https://github.com/pytorch/pytorch/issues/11: my_fun() + 0x4b (0x7f4200338f74 in /home/gaoxiang/.cache/torch_extensions/cuda_test_extension/cuda_test_extension.so)
frame https://github.com/pytorch/pytorch/issues/12: torch::detail::wrap_pybind_function_impl_<void (&)()>(void (&)(), std::integer_sequence<unsigned long>)::{lambda()https://github.com/pytorch/pytorch/issues/1}::operator()() const + 0x3f (0x7f420031e575 in /home/gaoxiang/.cache/torch_extensions/cuda_test_extension/cuda_test_extension.so)
frame https://github.com/pytorch/pytorch/issues/13: <unknown function> + 0x570f2 (0x7f42003350f2 in /home/gaoxiang/.cache/torch_extensions/cuda_test_extension/cuda_test_extension.so)
frame https://github.com/pytorch/pytorch/issues/14: <unknown function> + 0x536e2 (0x7f42003316e2 in /home/gaoxiang/.cache/torch_extensions/cuda_test_extension/cuda_test_extension.so)
frame https://github.com/pytorch/pytorch/issues/15: <unknown function> + 0x4ef2f (0x7f420032cf2f in /home/gaoxiang/.cache/torch_extensions/cuda_test_extension/cuda_test_extension.so)
frame https://github.com/pytorch/pytorch/issues/16: <unknown function> + 0x4ef93 (0x7f420032cf93 in /home/gaoxiang/.cache/torch_extensions/cuda_test_extension/cuda_test_extension.so)
frame https://github.com/pytorch/pytorch/issues/17: <unknown function> + 0x3e7f2 (0x7f420031c7f2 in /home/gaoxiang/.cache/torch_extensions/cuda_test_extension/cuda_test_extension.so)
<omitting python frames>
frame https://github.com/pytorch/pytorch/issues/30: __libc_start_main + 0xd5 (0x7f42c60bab25 in /usr/lib/libc.so.6)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57571

Reviewed By: VitalyFedyunin

Differential Revision: D28224574

Pulled By: ezyang

fbshipit-source-id: df440961f6eaf58048af36ae2a06c59f3c18baec
2021-05-06 01:36:58 -07:00
Michael Carilli
e841f335aa [RELAND] [CUDA graphs] Avoid sync errors when graph capturing cudnn rnn calls that use cudnn dropout (#57373)
Summary:
https://github.com/pytorch/pytorch/pull/56433 was reverted because the test perceived internal dropout state creation as a memory leak. This PR resubmits with the leak check skipped.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57373

Reviewed By: anjali411

Differential Revision: D28152186

Pulled By: ezyang

fbshipit-source-id: 9a593fcdbbabbb09dc4e4221191663e94b697503
2021-05-03 11:41:40 -07:00
Wenlei Xie
20085f6d23 Support auto generation of device check (#56872)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56872

ghstack-source-id: 127914018

Test Plan: auto test

Reviewed By: ezyang

Differential Revision: D27986429

fbshipit-source-id: 0da8413b0b8e6810fcea27ed1de499f11f68bd1f
2021-05-01 12:02:09 -07:00
Michael Carilli
bbc3cc6718 [CUDA graphs] [BC-breaking] Makes torch.cuda.amp.GradScaler scale updates in-place for better composability with graph capture (#55562)
Summary:
I'd like the following pattern (a natural composition of Amp with full fwd+bwd capture) to work:
```python
# Create "static_input" with dummy data, run warmup iterations,
# call optimizer.zero_grad(set_to_none=True), then
g = torch.cuda._Graph()
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
    optimizer.zero_grad(set_to_none=True)
    g.capture_begin()
    with autocast():
        out = model(static_input)
        loss = loss_fn(out)
    scaler.scale(loss).backward()
    g.capture_end()
torch.cuda.current_stream().wait_stream(s)

# Training loop:
for b in data:
    # optimizer.zero_grad() deliberately omitted, replay()'s baked-in backward will refill statically held .grads
    static_input.copy_(b)
    g.replay()
    scaler.step(optimizer)
    scaler.update()
```

Right now `GradScaler` can't work with this pattern because `update()` creates the scale tensor for the next iteration out of place. This PR changes `update()` to act in place on a long-lived scale tensor that stays static across iterations.

I'm not sure how this change affects XLA (see https://github.com/pytorch/pytorch/pull/48570), so we shouldn't merge without approval from ailzhang yaochengji.

Tagged bc-breaking because it's a change to the amp update utility function in native_functions.yaml. The function was never meant to be user-facing though.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55562

Reviewed By: zou3519

Differential Revision: D28046159

Pulled By: ngimel

fbshipit-source-id: 02018c221609974546c562f691e20ab6ac611910
2021-04-30 13:03:05 -07:00
Nikita Shulga
0a30d64c83 Revert D27966444: [pytorch][PR] [CUDA graphs] Avoid sync errors when graph capturing cudnn rnn calls that use cudnn dropout
Test Plan: revert-hammer

Differential Revision:
D27966444 (610c984d2e)

Original commit changeset: fe0df843c521

fbshipit-source-id: 8223b7f8b7183f0e7c9df6a7aa8f6b164e5634db
2021-04-28 14:51:10 -07:00
Michael Carilli
610c984d2e [CUDA graphs] Avoid sync errors when graph capturing cudnn rnn calls that use cudnn dropout (#56433)
Summary:
Cudnn rnn calls that use use cudnn dropout maintain a "state" buffer across calls. [DropoutState](fe3f6f2da2/aten/src/ATen/native/cudnn/RNN.cpp (L1388-L1402))'s lock() and unlock() ensure the current call's use of the state buffer syncs with the end of the previous call's use of the state buffer (in case the previous call was on a different stream).

Telling a capturing stream to wait on an event recorded in a non-capturing stream is an error (1). Telling a non-capturing stream to wait on an event recorded during capture is also an error (2). So DropoutState's flow can error in either of two simple use cases:
```python
rnn = nn.LSTM(512, 512, 2, dropout=0.5).cuda()

out1 = rnn(in1)

# calling cudnn rnn with dropout in capture after calling it uncaptured triggers 1
capture_stream.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(capture_stream):
    graph.capture_begin()
    out2 = rnn(in2)
    graph.capture_end()
torch.cuda.current_stream().wait_stream(capture_stream)

# calling cudnn rnn with dropout uncaptured after calling it in capture triggers 2
out3 = rnn(in3)
```

This PR fixes both cases by telling `DropoutState::lock()`: "if the most recent end-of-usage event was in a different capture state (ie, we crossed a capturing<->noncapturing border) or in a different capture, don't sync on it." While considering the fix I had two assumptions in mind:
- only one capture using the RNN can be underway at a time in this process
- no noncapturing ops in this process are issuing RNN calls while the capture using the RNN is underway.

That second assumption seems brittle if, for example, someone wants to capture an internal region of the forward method of a model wrapped with DataParallel: multiple threads could be issuing RNN calls with some currently capturing and some not. We should talk about whether that use case seems realistic.

(Bigger-picture thoughts: I don't know if forcing calls to serialize on using the shared state buffer is the best design. And if we want to do it that way, we might as well run all cudnn rnns with dropout on a dedicated side stream synced with the surrounding stream (capturing or not), in which case I don't think this PR's event-handling diffs would be needed.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56433

Reviewed By: heitorschueroff

Differential Revision: D27966444

Pulled By: ezyang

fbshipit-source-id: fe0df843c521e0d48d7f2c81a17aff84c5497e20
2021-04-28 12:52:03 -07:00
Michael Carilli
ffdecc1ac4 [CUDA graphs] Allows DeviceCachingAllocator to capture cross-stream memory use (#55860)
Summary:
Safely deallocating and repurposing memory used across streams relies on recording end-of-life events in all an allocation's usage streams beyond its original allocation stream. The events are later queried to see if all GPU work in those extra streams that could have used the allocation is done (from the CPU's perspective) before repurposing the allocation for use in its original stream.

The trouble is, calling EventQuery on an ordinary event recorded in a capturing stream is illegal. Calling EventQuery while capture is underway is also illegal. So when we call `tensor.record_stream` (or `c10::cuda::cudaCachingAllocator::recordStream`) on any tensor that's used or deleted in or around a capture, we often end up with a confusing error thrown from the cudaEventQuery in DeviceCachingAllocator::process_events().

This PR enables hopefully-safe deletion of tensors used across streams in or around capture with a conservative but simple approach: don't record or process end of life events for such tensors until the allocator's sure no captures are underway. You could whiteboard cases where this causes cross-stream-used allocations to be unavailable for reuse longer than absolutely necessary, but cross-stream-used allocations are uncommon, so for practical purposes this approach's impact on the memory footprint of captured sequences should be small.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55860

Reviewed By: ejguan

Differential Revision: D27822557

Pulled By: ezyang

fbshipit-source-id: b2e18a19d83ed05bad67a8157a14a606ed14d04e
2021-04-18 20:32:10 -07:00
Arindam Roy
4cfbb2401f [ROCM] Re-enable 3 previously faling tests in test_cuda.py (#55813)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53190
The following tests are passing in ROCM 4.1. Hence re-enabling them.
test_grad_scaling_multigpu
test_streaming_backwards_device_transfer
test_streaming_backwards_multiple_streams

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55813

Reviewed By: yinghai

Differential Revision: D27725547

Pulled By: ngimel

fbshipit-source-id: d8b3ed69fa44c2086f0666b4db0fabb30ad59439
2021-04-13 01:09:11 -07:00
Yukio Siraichi
93bf0ae6fc Remove legacy constructor calls from pytorch codebase. (#54142)
Summary:
Follow up from https://github.com/pytorch/pytorch/issues/53889
Related to https://github.com/pytorch/pytorch/issues/47112

Removing every occurrence of the legacy constructor call present in PyTorch at:
- _docs_
- _benchmarks_
- _test_
- _caffe2_
- _CONTRIBUTING.md_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54142

Reviewed By: ngimel

Differential Revision: D27699450

Pulled By: mruberry

fbshipit-source-id: 530aa3f5746cc8bc1407d5d51b2bbd8075e30546
2021-04-11 15:45:17 -07:00
Heitor Schueroff
5d68b3695c [Relanding] Implemented torch.linalg.multi_dot (#52859)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52859

This reverts commit 92a4ee1cf6.

Added support for bfloat16 for CUDA 11 and removed fast-path for empty input tensors that was affecting autograd graph.

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D27402390

Pulled By: heitorschueroff

fbshipit-source-id: 73c5ccf54f3da3d29eb63c9ed3601e2fe6951034
2021-04-01 04:49:05 -07:00
Kurt Mohler
6c235ef267 Allow std=0 in torch.normal, and error if std<0 (#51317)
Summary:
Part of https://github.com/pytorch/pytorch/issues/49998

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51317

Reviewed By: bdhirsh

Differential Revision: D27253939

Pulled By: mruberry

fbshipit-source-id: af7a72c3d91549b1a88b73849b6973e7619dc50b
2021-03-31 21:06:07 -07:00
Kurt Mohler
3ddc6174da Raise error in clip_grad_norm_ if norm is non-finite (#53843)
Summary:
**BC-breaking note**: This change throws errors for cases that used to silently pass. The old behavior can be obtained by setting `error_if_nonfinite=False`

Fixes https://github.com/pytorch/pytorch/issues/46849

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53843

Reviewed By: malfet

Differential Revision: D27291838

Pulled By: jbschlosser

fbshipit-source-id: 216d191b26e1b5919a44a3af5cde6f35baf825c4
2021-03-29 08:41:21 -07:00
albanD
1126d51de9 Remove useless contiguous calls from torch.matmul (#54616)
Summary:
This reduces the memory usage of matmul significantly for expanded batch size.

This reduces the peak memory usage of
```
a = torch.rand(1, 1024, 1024, device="cuda")
b = torch.rand(1024, 1024, 1, device="cuda")

out = torch.matmul(a, b)
```
From 4GB to 16MB which is not too bad.

It also fixes the same problem when `b` is not batched.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54616

Reviewed By: ailzhang

Differential Revision: D27327056

Pulled By: albanD

fbshipit-source-id: 4bb5f4015aeab4174148512f3c5b8d1ffa97bf54
2021-03-26 06:34:24 -07:00
Nikita Vedeneev
61b074581c torch.prod backward for complex types. (#48125)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53511
torch.det does depend on torch.prod, which in turn depends on several other functions, and they also depend on torch.prod, so there is a circular relationship, hence this PR will enable complex backward support for several functions at once.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48125

Reviewed By: pbelevich

Differential Revision: D27188589

Pulled By: anjali411

fbshipit-source-id: bbb80f8ecb83a0c3bea2b917627d3cd3b84eb09a
2021-03-19 09:44:08 -07:00
Michael Carilli
b27e678dfb [RELAND] [CUDA graphs] Private mempools for CUDA graphs (#54038)
Summary:
Resubmit of https://github.com/pytorch/pytorch/pull/51436.

Apparently some non-public windows builds run cuda tests on the default stream, so I changed a few capture tests to manually ensure all captures happen on non-default streams.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54038

Reviewed By: mruberry

Differential Revision: D27068649

Pulled By: ngimel

fbshipit-source-id: 4284475fa40ee38c0f8faff05a2faa310cf8a207
2021-03-16 12:13:33 -07:00
Natalia Gimelshein
76129c7cdf Revert D26993790: [pytorch][PR] [CUDA graphs] Private mempools for CUDA graphs
Test Plan: revert-hammer

Differential Revision:
D26993790 (90dfdef226)

Original commit changeset: a992eaee1b8c

fbshipit-source-id: 6ddb4aedd6154d7d89847aa5a34181158d06a309
2021-03-12 13:07:28 -08:00
Michael Carilli
90dfdef226 [CUDA graphs] Private mempools for CUDA graphs (#51436)
Summary:
Implements https://github.com/pytorch/pytorch/issues/51075#issuecomment-768884685 and additions discussed offline with ezyang ngimel . (Calling it "simple" is charitable but it's not too bad).

[High level strategy](https://github.com/pytorch/pytorch/pull/51436/files#diff-acc6337586bf9cdcf0a684380779300ec171897d05b8569bf439820dc8c93bd5R57-R82)

The current design aggregates stats from private pools with the ordinary pools, which may or may not be what we want.

Instead of adding PrivatePools as an internal feature of DeviceAllocator, I could inherit from DeviceAllocator (eg `DevicePrivateAllocator : public DeviceAllocator`) and create separate per-graph instances of the inherited class. I'm not sure if that would be better.

Graph bindings in Python are almost unchanged from https://github.com/pytorch/pytorch/pull/48875:
```python
# Same bindings as 48875, but now implicitly grabs a private mempool
graph1.capture_begin()
graph1.capture_end()

# pool=... is new.  It hints that allocations during graph2's capture may share graph1's mempool
graph2.capture_begin(pool=graph1.pool())
graph2.capture_end()

# graph3 also implicitly creates its own mempool
graph3.capture_begin()
graph3.capture_end()
```

Test plan (other suggestions appreciated):

- [x] Stop maintaining manual references for all the tensors in my existing graphs+RNG tests. If private pools somehow give bad allocations, they should start failing intermittently. They run eager ops and eager allocations mixed with graph replays, so they may expose if eager ops and replays corrupt each other.
- [x] `test_graph_two_successive`: Capture successive graphs, with the second graph using the first graph's result. Try with and without sharing a pool. Check results, also check memory stats to confirm sharing a pool saves memory.
- [x] `test_graph_concurrent_replay`: Capture some graphs in separate private pools, replay them concurrently in different streams, check the results to make sure they don't corrupt each other's memory. Capture some graphs with a shared pool, replay them concurrently in different streams, check results, confirm they DO corrupt each other's memory.
- [x] `test_graph_three_successive`: A three-graph case, checking the safe and unsafe replay patterns in [Restrictions of the Strawman API](https://github.com/pytorch/pytorch/issues/51075)).
- [x] `test_graph_memory_stats_and_use_result_after_destroy_graph`: Comprehensively check torch.cuda.memory_stats() changes that result from graph capture and delete. Check that a tensor ref created during capture and held after graph delete stays valid until the tensor itself is deleted.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51436

Reviewed By: mruberry

Differential Revision: D26993790

Pulled By: ngimel

fbshipit-source-id: a992eaee1b8c23628e7b388a5a3c26e0f80e54da
2021-03-12 11:07:47 -08:00
Jagadish Krishnamoorthy
ec6a7cace3 [ROCm] Fix the flaky test test_stream_event_nogil (#53850)
Summary:
Fix the flaky test in https://github.com/pytorch/pytorch/issues/53192 properly.

Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53850

Reviewed By: albanD

Differential Revision: D26993582

Pulled By: malfet

fbshipit-source-id: b0aefb188a236a5e94ee31a30ede7e8175443ff5
2021-03-11 16:07:41 -08:00
Jagadish Krishnamoorthy
0a549f9412 [ROCm] Disable flaky tests on ROCm (#53192)
Summary:
The disabled tests are tracked by
https://github.com/pytorch/pytorch/issues/53190

Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53192

Reviewed By: zhangguanheng66

Differential Revision: D26782204

Pulled By: mrshenli

fbshipit-source-id: bc90b182c236249961da1f0d4894d29f6b44fa27
2021-03-11 08:29:12 -08:00
Edward Yang
758fb94fcb Prefix assert_async with underscore, fix some bugs in assert_async CUDA testing (#53276)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53276

- One of the tests had a syntax error (but the test
  wasn't fine grained enough to catch this; any error
  was a pass)
- Doesn't work on ROCm

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: D26820048

Test Plan: Imported from OSS

Reviewed By: mruberry

Pulled By: ezyang

fbshipit-source-id: b02c4252d10191c3b1b78f141d008084dc860c45
2021-03-05 17:36:01 -08:00
Edward Yang
cfd9360d09 Revert D26837780: Revert D26819810: Revert D26815021: Revert D26744062: Add assert_async
Test Plan: revert-hammer

Differential Revision:
D26837780

Original commit changeset: 21567cab5c0f

fbshipit-source-id: 8ea735e5fdc97e32ae3fafd40297a1b8a7cd34b0
2021-03-04 20:45:35 -08:00
Edward Yang
1accffe450 Revert D26819810: Revert D26815021: Revert D26744062: Add assert_async
Test Plan: revert-hammer

Differential Revision:
D26819810

Original commit changeset: e528260e1aa9

fbshipit-source-id: 21567cab5c0ff5f5e60a699d4d4678773a567c30
2021-03-04 18:48:56 -08:00
Edward Yang
9e5e5a7d96 Revert D26815021: Revert D26744062: Add assert_async
Test Plan: revert-hammer

Differential Revision:
D26815021

Original commit changeset: 972eaafcdf14

fbshipit-source-id: e528260e1aa91df1873c73af00aa57addd671607
2021-03-04 09:28:25 -08:00
Mike Ruberry
b864457743 Revert D26744062: Add assert_async
Test Plan: revert-hammer

Differential Revision:
D26744062 (12d63cc2f5)

Original commit changeset: be6d2653afe5

fbshipit-source-id: 972eaafcdf14d96abdec3dea6bcbd5cac1f3d759
2021-03-04 04:11:25 -08:00
Edward Yang
12d63cc2f5 Add assert_async (#53086)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53086

Fixes #36853

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D26744062

Pulled By: ezyang

fbshipit-source-id: be6d2653afe584adf67a05b5d43185b40764650d
2021-03-03 16:18:07 -08:00
Kyle Chen
f2657d2e4f [ROCm] Enable test cases in test_cuda.py for ROCm (#52739)
Summary:
Enabling four test cases in test_cuda.py for ROCm because they are passing.

Signed-off-by: Kyle Chen <kylechen@amd.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52739

Reviewed By: H-Huang

Differential Revision: D26706321

Pulled By: ngimel

fbshipit-source-id: 6907c548c4ac4e387f0eb7c646e8a01f0d036c8a
2021-03-01 12:54:40 -08:00
AJ San Joaquin
578f0a04c7 fix torch.nn.parallel.scatter_gather.gather to handle NamedTuples and handle moving output to CPU (#51104)
Summary:
Fixes #{[50510](https://github.com/pytorch/pytorch/issues/50510)}

Allows ```torch.nn.parallel.scatter_gather.gather``` to accept a list of NamedTuples as input and returns a NamedTuple whose elements are tensors. I added the author's fix using the ```is_namedtuple``` function.

While testing this fix, I encountered a deprecation warning instructing me to use ```'cpu'``` instead of ```-1``` to move the outputs to the CPU. However, doing this causes an assertion error in the ```_get_device_index``` function. I solved this by handling the CPU case in the affected ```forward``` function.
rohan-varma

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51104

Reviewed By: albanD

Differential Revision: D26395578

Pulled By: rohan-varma

fbshipit-source-id: 6e98c9ce1d9f1725973c18d24a6554c1bceae465
2021-02-11 15:50:28 -08:00
Chester Liu
58eb23378f Clean up usage of torch._six partially (#49785)
Summary:
See https://github.com/pytorch/pytorch/issues/42919

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49785

Reviewed By: mruberry

Differential Revision: D25963833

Pulled By: bugra

fbshipit-source-id: 11c90d6b8d3f206c9d0a4d8621b773beb10c6ba2
2021-02-08 13:58:34 -08:00
Jagadish Krishnamoorthy
506fdf9abf [ROCm] disable tests for ROCm 4.0.1 (#51510)
Summary:
These tests are failing for ROCm 4.0/4.0.1 release.  Disable the tests until they are fixed.

- TestCuda.test_cudnn_multiple_threads_same_device
- TestCudaFuser.test_reduction

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51510

Reviewed By: H-Huang

Differential Revision: D26205179

Pulled By: seemethere

fbshipit-source-id: 0c3d29989d711deab8b5046b458c772a1543d8ed
2021-02-02 14:39:08 -08:00
Nikita Shulga
43f0ccd1ec torch.cuda.memory_allocated to return {} if not initialized (#51179)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49952

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51179

Reviewed By: ngimel

Differential Revision: D26094932

Pulled By: malfet

fbshipit-source-id: 0ec28ef9b0604245753d3f2b0e3536286700668d
2021-01-28 20:38:17 -08:00
Jeffrey Wan
6e3e57095c Add complex support for torch.nn.L1Loss (#49912)
Summary:
Building on top of the work of anjali411 (https://github.com/pytorch/pytorch/issues/46640)

Things added in this PR:
1. Modify backward and double-backward formulas
2. Add complex support for `new module tests` and criterion tests (and add complex tests for L1)
3. Modify some existing tests to support complex

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49912

Reviewed By: zhangguanheng66

Differential Revision: D25853036

Pulled By: soulitzer

fbshipit-source-id: df619f1b71c450ab2818eb17804e0c55990aa8ad
2021-01-15 15:53:15 -08:00
Nikita Shulga
bf4fcab681 Fix SyncBatchNorm usage without stats tracking (#50126)
Summary:
In `batch_norm_gather_stats_with_counts_cuda` use `input.scalar_type()` if `running_mean` is not defined
In `SyncBatchNorm` forward function create count tensor with `torch.float32` type if `running_mean` is None
Fix a few typos

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50126

Test Plan:
```
python -c "import torch;print(torch.batch_norm_gather_stats_with_counts( torch.randn(1, 3, 3, 3, device='cuda'), mean = torch.ones(2, 3, device='cuda'), invstd = torch.ones(2, 3, device='cuda'), running_mean = None, running_var = None  , momentum = .1, eps = 1e-5, counts = torch.ones(2, device='cuda')))"
```

Fixes https://github.com/pytorch/pytorch/issues/49730

Reviewed By: ngimel

Differential Revision: D25797930

Pulled By: malfet

fbshipit-source-id: 22a91e3969b5e9bbb7969d9cc70b45013a42fe83
2021-01-07 18:31:13 -08:00
Michael Carilli
ee271047b5 torch.utils.checkpoint.checkpoint + torch.cuda.amp (#49757)
Summary:
Adds a test to orphaned original PR (https://github.com/pytorch/pytorch/pull/40221).

Should fix https://github.com/pytorch/pytorch/issues/49738 and https://github.com/pytorch/pytorch/issues/47183

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49757

Reviewed By: mruberry

Differential Revision: D25689609

Pulled By: ngimel

fbshipit-source-id: 0a6adc11eb98382048ef9a9775e185dcdeff6010
2020-12-22 22:25:11 -08:00
Nikita Shulga
befe337072 Fix test_cuda_init_race skip rules (#49693)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49432

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49693

Reviewed By: walterddr, janeyx99

Differential Revision: D25668027

Pulled By: malfet

fbshipit-source-id: 802cbd39e4ebe585709179f332b680f5f7978814
2020-12-21 14:30:00 -08:00
Michael Carilli
c068180a17 [CUDA graphs] Cuda RNG-safe graph capture and replay bindings (#48875)
Summary:
Part 2 of https://github.com/pytorch/pytorch/pull/46148 refactor.  (part 1 was https://github.com/pytorch/pytorch/pull/48694.)
Contains
- a few more CUDAGeneratorImpl diffs to clean up graph capture interaction
- Capture and replay bindings that interact correctly with CUDAGeneratorImpl
- Tests.

Diffs compile and tests pass on my machine (ubuntu 20.04, cuda 11.0) but it needs finetuning for many CI builds.

See [Note [CUDA Graph-safe RNG states]](02d89f9f1d/aten/src/ATen/CUDAGeneratorImpl.h (L13-L85)) for the strategy, based on https://github.com/pytorch/pytorch/pull/46148#issuecomment-724414794.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48875

Reviewed By: zou3519

Differential Revision: D25482654

Pulled By: ngimel

fbshipit-source-id: 634dbc4c6c9d7d0d9a62dc81a52d430561f905fe
2020-12-14 10:51:58 -08:00
Jeff Daily
d5c4a80cfd Allow ROCm CI to use non-default stream. (#48424)
Summary:
Revert https://github.com/pytorch/pytorch/issues/26394. Fixes https://github.com/pytorch/pytorch/issues/27356.  Not all MIOpen handles were setting their stream to the current stream prior to running the op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48424

Reviewed By: H-Huang

Differential Revision: D25420384

Pulled By: mruberry

fbshipit-source-id: 051683ba9e3d264b71162bd344031a0c58bf6a41
2020-12-10 09:55:11 -08:00
x00480351
47aa253632 [Feature] Allow user to specify a fraction of the GPU memory. (#48172)
Summary:
Add a new function,  torch.cuda.set_per_process_memory_fraction(fraction, device), to torch.cuda.  Related:  https://github.com/pytorch/pytorch/issues/18626
The fraction (float type, from 0 to 1) is used to limit memory  of cashing allocator on GPU device .  One can set it on any visible GPU. The allowed memory equals total memory * fraction. It will raise an OOM error when  try to apply GPU memory more than the allowed value. This function is similar to Tensorflow's per_process_gpu_memory_fraction
Note, this setting is just limit the cashing allocator in one process. If you are using multiprocess, you need to put this setting in to the subprocess to limit its GPU memory, because subprocess could have its own allocator.

## usage
In some cases, one needs to split a GPU device as two parts. Can set limitation before GPU memory using.
Eg. device: 0, each part takes half memory, the code as follows:
```
torch.cuda.set_per_process_memory_fraction(0.5, 0)
```
There is an example to show what it is.
```python
import torch
torch.cuda.set_per_process_memory_fraction(0.5, 0)
torch.cuda.empty_cache()
total_memory = torch.cuda.get_device_properties(0).total_memory
# less than 0.5 will be ok:
tmp_tensor = torch.empty(int(total_memory * 0.499), dtype=torch.int8, device='cuda')
del tmp_tensordel tmp_tensor
torch.cuda.empty_cache()
# this allocation will raise a OOM:
torch.empty(total_memory // 2, dtype=torch.int8, device='cuda')

"""
It raises an error as follows:
RuntimeError: CUDA out of memory. Tried to allocate 5.59 GiB (GPU 0; 11.17 GiB total capacity; 0 bytes already allocated; 10.91 GiB free; 5.59 GiB allowed; 0 bytes reserved in total by PyTorch)
"""
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48172

Reviewed By: bdhirsh

Differential Revision: D25275381

Pulled By: VitalyFedyunin

fbshipit-source-id: d8e7af31902c2eb795d416b57011cc8a22891b8f
2020-12-03 11:45:56 -08:00
pbialecki
22c3ae8b57 Disable autocast cache for tensor views as fix for #48049 (#48696)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48049

Root cause of the issue explained [here](https://github.com/pytorch/pytorch/issues/48049#issuecomment-736701769).

This PR implements albanD's suggestion to add the `!t.is_view()` check and disable autocast caching for views of tensors.

The added test checks for an increase in memory usage by comparing the initially allocated memory with the memory after 3 iterations using a single `nn.Linear` layer in a `no_grad` and `autocast` context.

After this PR the memory usage in the original issue doesn't grow anymore and yields:
```python
autocast: True
0: 0MB (peak 1165MB)
1: 0MB (peak 1264MB)
2: 0MB (peak 1265MB)
3: 0MB (peak 1265MB)
4: 0MB (peak 1265MB)
5: 0MB (peak 1265MB)
6: 0MB (peak 1265MB)
7: 0MB (peak 1265MB)
8: 0MB (peak 1265MB)
9: 0MB (peak 1265MB)
```

CC ngimel mcarilli

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48696

Reviewed By: bdhirsh

Differential Revision: D25276231

Pulled By: ngimel

fbshipit-source-id: e2571e9f166c0a6f6f569b0c28e8b9ca34132743
2020-12-02 20:25:13 -08:00
Jeff Daily
5dfced3b0d work around #47028 until a proper fix is identified (#48405)
Summary:
Otherwise, this test will appear flaky for ROCm even though it is a generic PyTorch issue.

CC albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48405

Reviewed By: mrshenli

Differential Revision: D25183473

Pulled By: ngimel

fbshipit-source-id: 0fa19b5497a713cc6c5d251598e57cc7068604be
2020-11-26 18:33:19 -08:00
Gao, Xiang
315122ce15 Bump up the CUDA OOM test memory size (#48029)
Summary:
80GB is no longer large any more https://nvidianews.nvidia.com/news/nvidia-doubles-down-announces-a100-80gb-gpu-supercharging-worlds-most-powerful-gpu-for-ai-supercomputing

Hopefully, the new size could be OK until the end of Moore's Law :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48029

Reviewed By: linbinyu

Differential Revision: D25003603

Pulled By: zou3519

fbshipit-source-id: 626b9c031daee950df8453be4d7643dd67647213
2020-11-17 11:16:31 -08:00
Jeff Daily
6906701bde [ROCm] enable stream priorities (#47136)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47136

Reviewed By: mruberry

Differential Revision: D24672457

Pulled By: ngimel

fbshipit-source-id: 54f60c32df87cbd40fccd7fb1ecf0437905f01a3
2020-11-02 11:25:44 -08:00
Michael Carilli
3c643d112e Pin destination memory for cuda_tensor.to("cpu", non_blocking=True) (#46878)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/39694.

[`torch.cuda._sleep(int(100 * get_cycles_per_ms()))`](https://github.com/pytorch/pytorch/pull/46878/files#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R511-R513) in the test helps avoid flakiness noted by ngimel (https://github.com/pytorch/pytorch/pull/35144#issuecomment-602103631).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46878

Reviewed By: izdeby

Differential Revision: D24550403

Pulled By: xw285cornell

fbshipit-source-id: 1ecc35ef75f9a38ab332aacdf4835955105edafc
2020-10-29 15:42:55 -07:00
Jeff Daily
151f31ba27 remove event not ready assertion from TestCuda.test_copy_non_blocking (#46857)
Summary:
It is incorrect to assume that a newly recorded event will immediately query as False.
This test is flaky on ROCm due to this incorrect assumption.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46857

Reviewed By: albanD

Differential Revision: D24565581

Pulled By: mrshenli

fbshipit-source-id: 0e9ba02cf52554957b29dbeaa5093696dc914b67
2020-10-27 14:21:40 -07:00
anjali411
d94bd998ec Update backward formulas (Re #44444) (#46275)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46275

Re #44444

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D24285785

Pulled By: anjali411

fbshipit-source-id: c60ecd4fe4f144132085f2c91d3b950e92b2a491
2020-10-25 19:40:59 -07:00
ashish
88e94da580 Enable softmax and tiny norm FP16 tests on ROCm (#46363)
Summary:
This pull request enables the following tests on ROCm:
* TestCuda.test_tiny_half_norm_
* TestNNDeviceTypeCUDA.test_softmax_cuda_float16
* TestNNDeviceTypeCUDA.test_softmax_cuda_float32
* TestNNDeviceTypeCUDA.test_softmax_results_cuda_float16
* TestNNDeviceTypeCUDA.test_softmax_results_cuda_float32

The earlier failures, because of which the tests were skipped, were because of a precision issue for FP16 compute on MI25 hardware with ROCm 3.7 and older. The fix was delivered in the compiler in ROCm 3.8.

The pull request fixes https://github.com/pytorch/pytorch/issues/37493

cc: jeffdaily ezyang malfet mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46363

Reviewed By: heitorschueroff

Differential Revision: D24325639

Pulled By: ezyang

fbshipit-source-id: a7dbb238cf38c04b6592baad40b4d71725a358c9
2020-10-22 19:40:00 -07:00
Richard Barnes
52a970bac9 Minor cleaning of test_cuda.py (#46617)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46617

Sort includes, fix deprecated test warning

Test Plan:
```
buck run mode/dev-nosan //caffe2/test:cuda
```

Reviewed By: drdarshan

Differential Revision: D24429247

fbshipit-source-id: 65f53d7c904032e5c8f8ca45d1d2bb437358ffdd
2020-10-22 09:03:30 -07:00
Alexander Grund
5b0f400488 Replace list(map(...)) constructs by list comprehensions (#46461)
Summary:
As discussed in https://github.com/pytorch/pytorch/issues/46392 this makes the code more readable and possibly more performant.

It also fixes a bug detected by this where the argument order of `map` was confused: 030a24906e (diff-5bb26bd3a23ee3bb540aeadcc0385df2a4e48de39f87ed9ea76b21990738fe98L1537-R1537)

Fixes https://github.com/pytorch/pytorch/issues/46392

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46461

Reviewed By: ailzhang

Differential Revision: D24367015

Pulled By: ezyang

fbshipit-source-id: d55a67933cc22346b00544c9671f09982ad920e7
2020-10-19 18:42:49 -07:00
Michael Carilli
5640b79bf8 Allow consumer ops to sync on GraphRoot's gradient (#45787)
Summary:
Currently, a GraphRoot instance doesn't have an associated stream.  Streaming backward synchronization logic assumes the instance ran on the default stream, and tells consumer ops to sync with the default stream.  If the gradient the GraphRoot instance passes to consumer backward ops was populated on a non-default stream, we have a race condition.

The race condition can exist even if the user doesn't give a manually populated gradient:
```python
with torch.cuda.stream(side_stream):
    # loss.backward() implicitly synthesizes a one-element 1.0 tensor on side_stream
    # GraphRoot passes it to consumers, but consumers first sync on default stream, not side_stream.
    loss.backward()

    # Internally to backward(), streaming-backward logic takes over, stuff executes on the same stream it ran on in forward,
    # and the side_stream context is irrelevant.  GraphRoot's interaction with its first consumer(s) is the spot where
    # the side_stream context causes a problem.
```

This PR fixes the race condition by associating a GraphRoot instance, at construction time, with the current stream(s) on the device(s) of the grads it will pass to consumers. (i think this relies on GraphRoot executing in the main thread, before backward thread(s) fork, because the grads were populated on the main thread.)

The test demonstrates the race condition. It fails reliably without the PR's GraphRoot diffs and passes with the GraphRoot diffs.

With the GraphRoot diffs, manually populating an incoming-gradient arg for `backward` (or `torch.autograd.grad`) and the actual call to `autograd.backward` will have the same stream-semantics relationship as any other pair of ops:
```python
# implicit population is safe
with torch.cuda.stream(side_stream):
    loss.backward()

# explicit population in side stream then backward in side stream is safe
with torch.cuda.stream(side_stream):
    kickoff_grad = torch.ones_like(loss)
    loss.backward(gradient=kickoff_grad)

# explicit population in one stream then backward kickoff in another stream
# is NOT safe, even with this PR's diffs, but that unsafety is consistent with
# stream-semantics relationship of any pair of ops
kickoff_grad = torch.ones_like(loss)
with torch.cuda.stream(side_stream):
    loss.backward(gradient=kickoff_grad)

# Safe, as you'd expect for any pair of ops
kickoff_grad = torch.ones_like(loss)
side_stream.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(side_stream):
    loss.backward(gradient=kickoff_grad)
```
This PR also adds the last three examples above to cuda docs and references them from autograd docstrings.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45787

Reviewed By: nairbv

Differential Revision: D24138376

Pulled By: albanD

fbshipit-source-id: bc4cd9390f9f0358633db530b1b09f9c1080d2a3
2020-10-07 08:53:53 -07:00
Rohan Varma
f8c1ca5dd8 Enable NamedTuple data type to work with DDP (#44220)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44220

Closes https://github.com/pytorch/pytorch/issues/44009
Currently if a dataloader returns objects created with a
collections.namedtuple, this will incorrectly be cast to a tuple. As a result, if we have data of these types, there can be runtime errors during the forward pass if the module is expecting a named tuple.

Fix this in
`scatter_gather.py` to resolve the issue reported in
https://github.com/pytorch/pytorch/issues/44009
ghstack-source-id: 113423287

Test Plan: CI

Reviewed By: colesbury

Differential Revision: D23536752

fbshipit-source-id: 3838e60162f29ebe424e83e474c4350ae838180b
2020-10-02 13:33:08 -07:00
Michael Carilli
72bc3d9de4 Use MTA for amp grad unscaling, enforce op math type in MTA functors, and allow op lambdas (#44778)
Summary:
Amp gradient unscaling is a great use case for multi tensor apply (in fact it's the first case I wrote it for).  This PR adds an MTA unscale+infcheck functor.  Really excited to have it for `torch.cuda.amp`. izdeby your interface was clean and straightforward to use, great work!

Labeled as bc-breaking because the native_functions.yaml exposure of unscale+infcheck changes from [`_amp_non_finite_check_and_unscale_` to `_amp_foreach_non_finite_check_and_unscale_`]( https://github.com/pytorch/pytorch/pull/44778/files#diff-f1e4b2c15de770d978d0eb77b53a4077L6289-L6293).

The PR also modifies Unary/Binary/Pointwise Functors to
- do ops' internal math in FP32 for FP16 or bfloat16 inputs, which improves precision ([and throughput, on some architectures!](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions)) and has no downside for the ops we care about.
- accept an instantiated op functor rather than an op functor template (`template<class> class Op`).  This allows calling code to pass lambdas.

Open question:  As written now, the PR has MTA Functors take care of pre- and post-casting FP16/bfloat16 inputs to FP32 before running the ops.  However, alternatively, the pre- and post-math casting could be deferred/written into the ops themselves, which gives them a bit more control.  I can easily rewrite it that way if you prefer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44778

Reviewed By: gchanan

Differential Revision: D23944102

Pulled By: izdeby

fbshipit-source-id: 22b25ccad5f69b413c77afe8733fa9cacc8e766d
2020-10-01 07:51:16 -07:00
Nikita Shulga
c3a5aed5f7 Run pytorch_core CUDA tests on GPU using TPX
Summary:
Modify contbuild to disable sanitizers, add option to run "cuda" test using TPX RE

(Note: this ignores all push blocking failures!)

Test Plan: CI

Reviewed By: walterddr, cspanda

Differential Revision: D23854578

fbshipit-source-id: 327d7cc3655c17034a6a7bc78f69967403290623
2020-09-24 12:12:23 -07:00
Edward Yang
da4033d32a Make cudaHostRegister actually useful on cudart. (#45159)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45159

By default, pybind11 binds void* to be capsules.  After a lot of
Googling, I have concluded that this is not actually useful:
you can't actually create a capsule from Python land, and our
data_ptr() function returns an int, which means that the
function is effectively unusable.  It didn't help that we had no
tests exercising it.

I've replaced the void* with uintptr_t, so that we now accept int
(and you can pass data_ptr() in directly).  I'm not sure if we
should make these functions accept ctypes types; unfortunately,
pybind11 doesn't seem to have any easy way to do this.

Fixes #43006

Also added cudaHostUnregister which was requested.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D23849731

Pulled By: ezyang

fbshipit-source-id: 8a79986f3aa9546abbd2a6a5828329ae90fd298f
2020-09-23 11:05:44 -07:00
Xiao Wang
d75c402755 Add cusolver to build, rewrite MAGMA inverse with cusolver (#42403)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42265

This PR adds cusolver to the pytorch build, and enables the use of cusolver/cublas library functions on GPU `torch.inverse` on certain tensor shapes.

Specifically, when

* the tensor is two dimensional (single batch), or
* has >2 dimensions (multiple batches) and `batch_size <= 2`, or
* magma is not linked,

cusolver/cublas will be used. In other conditions, the current implementation of MAGMA will still be used.

8c0949ae45/aten/src/ATen/native/cuda/BatchLinearAlgebra.cu (L742-L752)

The reason for this is that for tensors with large batch_size, `cublasXgetrfBatched` and `cublasXgetriBatched` doesn't perform very well. For `batch_size > 1`, we launch cusolver functions in multiple streams. This lets cusolver functions run in parallel, and can greatly increase the performance. When `batch_size > 2`, the parallel launched cusolver functions are slightly slower than the current magma implementation, so we still use the current magma impl.

On CUDA 9.2, there were some numerical issues detected, so cusolver impl will not be used. The cusolver impl will also not be used on platforms other than Nvidia CUDA.

060769feaf/aten/src/ATen/native/cuda/BatchLinearAlgebraLib.h (L10-L13)

Note that there is a new heuristic used before cusolver/cublas calls here:

8c0949ae45/aten/src/ATen/native/cuda/MiscUtils.h (L113-L121)

where `use_loop_launch = true` means launch single batch cusolver functions in parallel, and `use_loop_launch = false` means use cublas_X_batched functions. When magma is enabled (only `batch_size <= 2` will be dispatched to cusolver/cublas), the heuristic will always return `true` and the cusolver calls are faster than small batch_size magma calls. When magma is disabled, this adds the functionality of `torch.inverse`, which was disabled before for all shapes (though large batch_size cublas performance may not be as well as magma).

Checklist:
- [X] Add benchmark, cpu, gpu-before (magma), gpu-after (cusolver)
- [X] Rewrite single inverse (ndim == 2) with cusolver
- [X] Rewrite batched inverse (ndim > 2) with cublas
- [X] Add cusolver to build
- [x] Clean up functions related to `USE_MAGMA` define guard
- [x] Workaround for non-cuda platform
- [x] Workaround for cuda 9.2
- [x] Add zero size check
- [x] Add tests

Next step:

If cusolver doesn't cause any problem in pytorch build, and there are no major performance regressions reported after this PR being merged, I will start porting other cusolver/cublas functions for linear algebra to improve the performance.

<details>
<summary> benchmark 73499c6 </summary>

benchmark code: https://github.com/xwang233/code-snippet/blob/master/torch.inverse/inverse-cusolver.ipynb

shape meaning:

* `[] 2 torch.float32 -> torch.randn(2, 2, dtype=torch.float32)`
* `[2] 4 torch.float32 -> torch.randn(2, 4, 4, dtype=torch.float32)`

| shape | cpu_time (ms) | gpu_time_before (magma) (ms) | gpu_time_after (ms) |
| --- | --- | --- | --- |
| [] 2 torch.float32 |  0.095 |  7.534 |  0.129  |
| [] 4 torch.float32 |  0.009 |  7.522 |  0.129  |
| [] 8 torch.float32 |  0.011 |  7.647 |  0.138  |
| [] 16 torch.float32 |  0.075 |  7.582 |  0.135  |
| [] 32 torch.float32 |  0.073 |  7.573 |  0.191  |
| [] 64 torch.float32 |  0.134 |  7.694 |  0.288  |
| [] 128 torch.float32 |  0.398 |  8.073 |  0.491  |
| [] 256 torch.float32 |  1.054 |  11.860 |  1.074  |
| [] 512 torch.float32 |  5.218 |  14.130 |  2.582  |
| [] 1024 torch.float32 |  19.010 |  18.780 |  6.936  |
| [1] 2 torch.float32 |  0.009 |  0.113 |  0.128 ***regressed |
| [1] 4 torch.float32 |  0.009 |  0.113 |  0.131 ***regressed |
| [1] 8 torch.float32 |  0.011 |  0.116 |  0.129 ***regressed |
| [1] 16 torch.float32 |  0.015 |  0.122 |  0.135 ***regressed |
| [1] 32 torch.float32 |  0.032 |  0.177 |  0.178 ***regressed |
| [1] 64 torch.float32 |  0.070 |  0.420 |  0.281  |
| [1] 128 torch.float32 |  0.328 |  0.816 |  0.490  |
| [1] 256 torch.float32 |  1.125 |  1.690 |  1.084  |
| [1] 512 torch.float32 |  4.344 |  4.305 |  2.576  |
| [1] 1024 torch.float32 |  16.510 |  16.340 |  6.928  |
| [2] 2 torch.float32 |  0.009 |  0.113 |  0.186 ***regressed |
| [2] 4 torch.float32 |  0.011 |  0.115 |  0.184 ***regressed |
| [2] 8 torch.float32 |  0.012 |  0.114 |  0.184 ***regressed |
| [2] 16 torch.float32 |  0.019 |  0.119 |  0.173 ***regressed |
| [2] 32 torch.float32 |  0.050 |  0.170 |  0.240 ***regressed |
| [2] 64 torch.float32 |  0.120 |  0.429 |  0.375  |
| [2] 128 torch.float32 |  0.576 |  0.830 |  0.675  |
| [2] 256 torch.float32 |  2.021 |  1.748 |  1.451  |
| [2] 512 torch.float32 |  9.070 |  4.749 |  3.539  |
| [2] 1024 torch.float32 |  33.655 |  18.240 |  12.220  |
| [4] 2 torch.float32 |  0.009 |  0.112 |  0.318 ***regressed |
| [4] 4 torch.float32 |  0.010 |  0.115 |  0.319 ***regressed |
| [4] 8 torch.float32 |  0.013 |  0.115 |  0.320 ***regressed |
| [4] 16 torch.float32 |  0.027 |  0.120 |  0.331 ***regressed |
| [4] 32 torch.float32 |  0.085 |  0.173 |  0.385 ***regressed |
| [4] 64 torch.float32 |  0.221 |  0.431 |  0.646 ***regressed |
| [4] 128 torch.float32 |  1.102 |  0.834 |  1.055 ***regressed |
| [4] 256 torch.float32 |  4.042 |  1.811 |  2.054 ***regressed |
| [4] 512 torch.float32 |  18.390 |  4.884 |  5.087 ***regressed |
| [4] 1024 torch.float32 |  69.025 |  19.840 |  20.000 ***regressed |

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42403

Reviewed By: ailzhang, mruberry

Differential Revision: D23717984

Pulled By: ngimel

fbshipit-source-id: 54cbd9ea72a97989cff4127089938e8a8e29a72b
2020-09-18 20:43:29 -07:00
Michael Carilli
2a87742ffa Autocast wrappers for RNN cell apis (#44296)
Summary:
Should fix https://github.com/pytorch/pytorch/issues/42605.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44296

Reviewed By: izdeby

Differential Revision: D23580447

Pulled By: ezyang

fbshipit-source-id: 86027b693fd2b648f043ab781b84ffcc1f72854d
2020-09-09 09:44:59 -07:00
Gao, Xiang
5e97f251a8 Enable TF32 support for cuDNN (#40737)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40737

Reviewed By: mruberry

Differential Revision: D22801525

Pulled By: ngimel

fbshipit-source-id: ac7f7e728b4b3e01925337e8c9996f26a6433fd2
2020-09-01 15:34:24 -07:00
Peter Bell
42f6c3b1f4 Raise error on device mismatch in addmm (#43505)
Summary:
Fixes gh-42282

This adds a device-mismatch check to `addmm` on CPU and CUDA. Although it seems like the dispatcher is always selecting the CUDA version here if any of the inputs are on GPU. So in theory the CPU check is unnecessary, but probably better to err on the side of caution.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43505

Reviewed By: mruberry

Differential Revision: D23331651

Pulled By: ngimel

fbshipit-source-id: 8eb2f64f13d87e3ca816bacec9d91fe285d83ea0
2020-08-26 09:37:57 -07:00
Michael Carilli
fbf274f5a7 Autocast support for cudnn RNNs (#42385)
Summary:
Should close https://github.com/pytorch/pytorch/issues/36428.

The cudnn RNN API expects weights to occupy a flat buffer in memory with a particular layout.  This PR implements a "speed of light" fix:  [`_cudnn_rnn_cast_reflatten`](https://github.com/pytorch/pytorch/pull/42385/files#diff-9ef93b6a4fb5a06a37c562b83737ac6aR327) (the autocast wrapper assigned to `_cudnn_rnn`) copies weights to the right slices of a flat FP16 buffer with a single read/write per weight (as opposed to casting them to FP16 individually then reflattening the individual FP16 weights, which would require 2 read/writes per weight).

It isn't pretty but IMO it doesn't make rnn bindings much more tortuous than they already are.

The [test](https://github.com/pytorch/pytorch/pull/42385/files#diff-e68a7bc6ba14f212e5e7eb3727394b40R2683) tries a forward under autocast and a backward for the full cross product of RNN options and input/weight/hidden dtypes.  As for all FP16list autocast tests, forward output and backward grads are checked against a control where inputs (including RNN module weights in this case) are precasted to FP16 on the python side.

Not sure who to ask for review, tagging ezyang and ngimel because Ed wrote this file (almost 2 years ago) and Natalia did the most recent major [surgery](https://github.com/pytorch/pytorch/pull/12600).

Side quests discovered:
- Should we update [persistent RNN heuristics](dbdd28207c/aten/src/ATen/native/cudnn/RNN.cpp (L584)) to include compute capability 8.0?  Could be another PR but seems easy enough to include.
- Many (maybe all?!) the raw cudnn API calls in [RNN.cpp](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cudnn/RNN.cpp) are deprecated in cudnn 8.  I don't mind taking the AI to update them since my mental cache is full of rnn stuff, but that would be a substantial separate PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42385

Reviewed By: zhangguanheng66

Differential Revision: D23077782

Pulled By: ezyang

fbshipit-source-id: a2afb1bdab33ba0442879a703df13dc87f03ec2e
2020-08-18 13:37:42 -07:00
Pritam Damania
872237c1f2 Output to stderr in distributed tests. (#42139)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42139

A bunch of tests were failing with buck since we would output to
stdout and buck would fail parsing stdout in some cases.

Moving these print statements to stderr fixes this issue.
ghstack-source-id: 108606579

Test Plan: Run the offending unit tests.

Reviewed By: mrshenli

Differential Revision: D22779135

fbshipit-source-id: 789af3b16a03b68a6cb12377ed852e5b5091bbad
2020-07-29 19:23:34 -07:00