Commit Graph

894 Commits

Author SHA1 Message Date
PyTorch MergeBot
03b7ec9237 Revert "create a new torch.cuda.memory_usage_in_bytes api (#140719)"
This reverts commit 9febc47637.

Reverted https://github.com/pytorch/pytorch/pull/140719 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the test is flaky on ROCm ([comment](https://github.com/pytorch/pytorch/pull/140719#issuecomment-2479832082))
2024-11-15 20:05:32 +00:00
Yu Guo
9febc47637 create a new torch.cuda.memory_usage_in_bytes api (#140719)
Summary:
the current torch.cuda.memory_usage returns the memory utilization, more specifically, percent of time over the past sample period global memory being read/written for Nvidia.

see more details in https://github.com/pytorch/pytorch/issues/140638

Test Plan: added a new unittest

Differential Revision: D65928031

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140719
Approved by: https://github.com/xw285cornell, https://github.com/hongxiayang
2024-11-15 05:59:40 +00:00
Syed Tousif Ahmed
341a28f0ce Refactors empty_cache to return only MemPool memory to the system (#133602)
Canonically, the empty_cache API releases all cached blocks of the CUDACachingAllocator. There is no API that can release only the cached blocks of a given pool.

In this PR, we extend the functionality of empty_cache API such that it only releases the cached blocks of an active pool. When empty_cache API is called under a MemPoolContext, we only release the cached blocks that correspond to the pool id of the active pool.

Part of https://github.com/pytorch/pytorch/issues/124807.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133602
Approved by: https://github.com/ezyang
2024-10-29 23:58:44 +00:00
Jeff Daily
7c7b2d89ba [ROCm] set hipblas workspace (#138791)
Fixes #138532.

This brings hipblas behavior in line with cublas behavior with respect to setting the workspace to an allocation from the caching allocator as well as the env var HIPBLAS_WORKSPACE_CONFIG.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138791
Approved by: https://github.com/naromero77amd, https://github.com/eqy, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-10-29 01:37:55 +00:00
Syed Tousif Ahmed
1637a40796 Adds snapshot API for MemPools to get pool memory segments (#133601)
Canonically, the snapshot API returns the entire memory state of the CUDACachingAllocator (using `get_all_blocks`). There is no API that can only return the memory state of a given pool.

In this PR, we extend the functionality of snapshot API such that it can only return the memory addresses of an active pool. When snapshot API is called under a MemPoolContext, we only return the blocks that correspond to the pool id of the active pool.

Part of https://github.com/pytorch/pytorch/issues/124807.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133601
Approved by: https://github.com/ezyang
2024-10-29 01:01:47 +00:00
PyTorch MergeBot
3b0f39336c Revert "Adds snapshot API for MemPools to get pool memory segments (#133601)"
This reverts commit 00504aa6b8.

Reverted https://github.com/pytorch/pytorch/pull/133601 on behalf of https://github.com/wdvr due to reverting for now as this breaks lots of internal tests. Details below ([comment](https://github.com/pytorch/pytorch/pull/133601#issuecomment-2441864871))
2024-10-28 15:12:20 +00:00
Syed Tousif Ahmed
00504aa6b8 Adds snapshot API for MemPools to get pool memory segments (#133601)
Canonically, the snapshot API returns the entire memory state of the CUDACachingAllocator (using `get_all_blocks`). There is no API that can only return the memory state of a given pool.

In this PR, we extend the functionality of snapshot API such that it can only return the memory addresses of an active pool. When snapshot API is called under a MemPoolContext, we only return the blocks that correspond to the pool id of the active pool.

Part of https://github.com/pytorch/pytorch/issues/124807.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133601
Approved by: https://github.com/ezyang
2024-10-26 03:34:59 +00:00
Kiuk Chung
940658405b [test/test_cuda] Use temp file for test_improper_device_name (#138856)
Use `tempfile.NamedTemporaryFile()` to have test_specify_improper_device_name save/load to a tmp file rather than the current-working-directory
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138856
Approved by: https://github.com/Skylion007
2024-10-26 02:42:25 +00:00
Syed Tousif Ahmed
03c72976a5 Properly uses ref-counting for torch.cuda.use_mem_pool (#133600)
This PR refactors some ref-counting functionality out of `beginAllocateToPool` and `releasePool`. The ref-counting logic is then used in construction and destruction of `torch.cuda.MemPool`.

The `use_count` variable in the CUDACachingAllocator is essentially a refcount of how many context managers are using the pool. Since we are now lifting up the MemPool abstraction to the user, the MemPool object itself now needs to hold a an extra reference as well.

Part of https://github.com/pytorch/pytorch/issues/124807.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133600
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-10-22 03:21:53 +00:00
Jack Taylor
966a1a971e [ROCm] Add AMDSMI support for UUID input (#129741)
Adds support for for using UUIDs for AMDSMI utilities in PyTorch via CUDA_VISIBLE_DEVICES/HIP_VISIBLE_DEVICES.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129741
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
2024-10-15 15:56:30 +00:00
eqy
cba3f4f5e3 [CUDA] Clean up asserts in test_cuda.py (#137034)
Switch some `assertTrue` tests to `assertEqual` etc for debuggability in logs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137034
Approved by: https://github.com/Skylion007
2024-10-08 23:16:19 +00:00
Nikita Shulga
6d0d7b6e37 [CI][BE] Restore cuda memory allocator setting (#137383)
By adding `finally:` clause at the end of the test

Might fix https://github.com/pytorch/pytorch/issues/137098#issuecomment-2389172552

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137383
Approved by: https://github.com/ngimel
2024-10-05 04:16:38 +00:00
Jeff Daily
c7b0d4b148 raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING (#131114)
raw_alloc is used by cudnn, miopen, thrust, and tunableop.  Without this PR, the env var for disabling the caching allocator will only partially work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131114
Approved by: https://github.com/eqy, https://github.com/houseroad, https://github.com/albanD

Co-authored-by: Nichols A. Romero <nick.romero@amd.com>
2024-10-04 15:36:29 +00:00
cyy
6327a71880 [Environment Variable][2/N] Use thread-safe setenv wrapper (#124485)
This follows #119449 to make setenv thread-safe.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124485
Approved by: https://github.com/eqy
2024-10-04 07:30:51 +00:00
PyTorch MergeBot
0d1701f310 Revert "raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING (#131114)"
This reverts commit 7001907480.

Reverted https://github.com/pytorch/pytorch/pull/131114 on behalf of https://github.com/PaliC due to failing internal builds ([comment](https://github.com/pytorch/pytorch/pull/131114#issuecomment-2390615007))
2024-10-03 06:22:55 +00:00
Jeff Daily
7001907480 raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING (#131114)
raw_alloc is used by cudnn, miopen, thrust, and tunableop.  Without this PR, the env var for disabling the caching allocator will only partially work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131114
Approved by: https://github.com/eqy, https://github.com/houseroad, https://github.com/albanD

Co-authored-by: Nichols A. Romero <nick.romero@amd.com>
2024-10-02 16:27:15 +00:00
Yu, Guangye
df5bbc09d1 Make device-specific event inherits from torch.Event (#134845)
# Motivation
This PR intends to make device-specific Event inherit from the generic torch.Event. The benefit is providing a generic abstract class `torch.Event` for different devices, like `torch.Stream`. This make it easier for Dynamo to capture the Event of different devices, like torch.cuda.Event and torch.xpu.Event.
And the next PR would like to remove previous useless base class `_StreamBase` and `_EventBase` to avoid multiple Inheritance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134845
Approved by: https://github.com/albanD, https://github.com/EikanWang
2024-10-01 06:28:41 +00:00
FFFrog
e14b58ffbd Using device-agnostic autocast api (#136613)
- using torch.autocast(device_str="cuda") instead of torch.cuda.amp.autocast()
- using torch.autocast(device_str="cpu") instead of torch.cpu.amp.autocast()

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136613
Approved by: https://github.com/shink, https://github.com/cyyever, https://github.com/kwen2501
2024-09-27 07:16:24 +00:00
Fuzzkatt
d1382aaf3d skip test_out_of_memory for jetson (#133270)
Skip test_out_of_memory in test/test_cuda.py on Jetson as OOM reporting in Jetson has issues due to partially missing NVML support. cc @eqy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133270
Approved by: https://github.com/eqy, https://github.com/albanD, https://github.com/seemethere
2024-09-27 02:36:48 +00:00
drisspg
d05645841e Update get_device_properties to take in optional device (#136683)
Aligns behavior with the rest of cuda's device info query methods

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136683
Approved by: https://github.com/eqy
2024-09-26 15:07:31 +00:00
eqy
8225e7706e [CUDA][Expandable Segments] Account for non-gc'able memory in expandable segments tests (#136496)
Seems like some other tests are holding onto memory that is not gc'able (e.g., cuBLAS workspaces), so these tests while working in isolation fail when run as e.g., `python test/test_cuda.py -k able`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136496
Approved by: https://github.com/ezyang
2024-09-25 01:14:45 +00:00
Yuxin Wu
663e760065 add unittest for OOM message (#129671)
Add unittest for the bug in #123984
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129671
Approved by: https://github.com/eqy
2024-09-23 04:48:01 +00:00
Aaron Orenstein
8c356ce3da Fix lint errors in fbcode (#135614)
Summary: Fixed a bunch of fbcode imports that happened to work but confused autodeps.  After this autodeps still suggests "improvements" to TARGETS (which breaks our builds) but at least it can find all the imports.

Test Plan:
```
fbpython fbcode/tools/build/buck/linters/lint_autoformat.py --linter=autodeps --default-exec-timeout=1800 -- fbcode/caffe2/TARGETS fbcode/caffe2/test/TARGETS
```
Before:
```
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/testing.py:229) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fbur$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export.py:87) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fburl$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_serdes.py:9) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fb$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_serdes.py:10) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fburl$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_retraceability.py:7) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https:$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_retraceability.py:6) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See ht$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export_nonstrict.py:7) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See http$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_export_nonstrict.py:6) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See $
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_export_training_ir_to_run_decomp.py:8) when processing rule "test_export". Please make sure it's listed in the srcs parameter of an$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export_training_ir_to_run_decomp.py:10) when processing rule "test_export". Please make sure it's listed in the srcs parameter of anoth$
ERROR while processing caffe2/test/TARGETS: Found "//python/typeshed_internal:typeshed_internal_library" owner for "cv2" but it is protected by visibility rules: [] (from caffe2/test/test_bundled_images.py:7) when processing rule "test_bundled_$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "caffe2.test.profiler_test_cpp_thread_lib" (from caffe2/test/profiler/test_cpp_thread.py:29) when processing rule "profiler_test_cpp_thread". Please make sure it's listed in t$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._utils_internal.get_file_path_2" (from caffe2/test/test_custom_ops.py:23) when processing rule "custom_ops". Please make sure it's listed in the srcs parameter of anoth$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._utils_internal.get_file_path_2" (from caffe2/test/test_public_bindings.py:13) when processing rule "public_bindings". Please make sure it's listed in the srcs paramete$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._C._profiler.symbolize_tracebacks" (from caffe2/test/test_cuda.py:3348) when processing rule "test_cuda". Please make sure it's listed in the srcs parameter of another $
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._C._profiler.gather_traceback" (from caffe2/test/test_cuda.py:3348) when processing rule "test_cuda". Please make sure it's listed in the srcs parameter of another rule$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for include <torch/csrc/autograd/profiler_kineto.h> (from caffe2/test/profiler/test_cpp_thread.cpp:2) when processing profiler_test_cpp_thread_lib.  Some things to try:
```

Differential Revision: D62049222

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135614
Approved by: https://github.com/oulgen, https://github.com/laithsakka
2024-09-13 02:04:34 +00:00
FFFrog
80a6d60829 Moving _run_autocast_outofplace to basic class named TestAutocast to reduce redundance (#134460)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134460
Approved by: https://github.com/EikanWang, https://github.com/ezyang
2024-09-04 10:48:58 +00:00
Natalia Gimelshein
c25b64a057 expose host_emptyCache to python, fix a bug in freeing cudaHostRegist… (#134919)
…ered memory

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134919
Approved by: https://github.com/eqy
2024-09-01 09:07:25 +00:00
zdevito
d91b49dbaa expandable_segments <-> other allocator options (#134338)
Previously setting  garbage_collection_threshold or max_split_size_mb along with expandable_segments:True could cause the allocator to hit assert failures when running nearly out of memory. This PR ensures garbage_collection and max_split freeing do not accidentally try to release expandable segments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134338
Approved by: https://github.com/ezyang
2024-08-29 18:43:59 +00:00
Syed Tousif Ahmed
4655eb3ee2 Uses MemPoolContext to route allocations from CUDACachingAllocator (#134685)
Re-open of https://github.com/pytorch/pytorch/pull/133599 that was mistakenly closed by issuing `ghstack land`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134685
Approved by: https://github.com/ezyang
2024-08-29 03:56:31 +00:00
Chien-Lin Chen
40de63be09 parameterized test_graph_optims and test_graph_scaling_fused_optimizers (#133749)
Fixes #123451

This is a rework of a reverted pull request, https://github.com/pytorch/pytorch/pull/125127.
The test failure is fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133749
Approved by: https://github.com/janeyx99
2024-08-28 16:34:06 +00:00
Mikayla Gawarecki
018e48c337 [Reland] Add wrappers for synchronous GPUDirect Storage APIs (#133489)
Reland #130633

USE_CUFILE turned off by default in this version
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133489
Approved by: https://github.com/albanD
2024-08-15 17:11:52 +00:00
Syed Tousif Ahmed
42cd397a0e Loads .pyd instead of .so in MemPool test for windows (#132749)
Fixes #132650

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132749
Approved by: https://github.com/albanD
2024-08-08 14:29:56 +00:00
PyTorch MergeBot
123d9ec5bf Revert "Loads .pyd instead of .so in MemPool test for windows (#132749)"
This reverts commit 37ab0f3385.

Reverted https://github.com/pytorch/pytorch/pull/132749 on behalf of https://github.com/syed-ahmed due to Seems like periodic is still failing: 7c79e89bc5 ([comment](https://github.com/pytorch/pytorch/pull/132749#issuecomment-2274041302))
2024-08-07 18:08:44 +00:00
Syed Tousif Ahmed
37ab0f3385 Loads .pyd instead of .so in MemPool test for windows (#132749)
Fixes #132650

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132749
Approved by: https://github.com/albanD
2024-08-07 09:58:52 +00:00
albanD
9a1ad3345f Fix periodic windows test (#132648)
This test fails to clean up folders on windows for the past week, see 27f61eba58 for example

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132648
Approved by: https://github.com/janeyx99, https://github.com/zou3519, https://github.com/malfet
2024-08-05 20:54:20 +00:00
Xuehai Pan
4226ed1585 [BE] Format uncategorized Python files with ruff format (#132576)
Remove patterns `**`, `test/**`, and `torch/**` in `tools/linter/adapters/pyfmt_linter.py` and run `lintrunner`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132576
Approved by: https://github.com/ezyang, https://github.com/Skylion007
ghstack dependencies: #132574
2024-08-04 17:13:31 +00:00
Oguz Ulgen
221350e3a4 Add None return type to init -- tests (#132352)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132352
Approved by: https://github.com/ezyang
ghstack dependencies: #132335, #132351
2024-08-01 15:44:51 +00:00
Syed Tousif Ahmed
7c89ec0f7c Implements torch.cuda.MemPool() API (#131152)
In this PR:
- Pool id creation logic is refactored and moved to a MemPool class. `graph_pool_handle()` API now uses `torch.cuda.MemPool()` to get a unique id for a pool. Existing tests should cover this change.
- MemPool holds a pointer to a CUDAAllocator as proposed in https://github.com/pytorch/pytorch/issues/124807#issuecomment-2077506997. Tests are added to show usage with CUDAPluggableAllocator.
- MemPoolContext API makes a mempool active. Tests are added to show usage of this API. This API will be used in CUDACachingAllocator to route allocations to a user provided allocator. See draft here: https://github.com/pytorch/pytorch/pull/125722/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131152
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-08-01 01:29:30 +00:00
Aidyn-A
301ec32ae8 [EASY][TEST][CUDA] Fix typo in test_graph_make_graphed_callables_same_pool (#132059)
Per title.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132059
Approved by: https://github.com/Skylion007
2024-07-29 19:15:37 +00:00
PyTorch MergeBot
e191b83462 Revert "Add wrappers for synchronous GPUDirect Storage APIs (#130633)"
This reverts commit 709ddf7a9d.

Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to still failing internally D60265673 ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2253239607))
2024-07-26 18:08:20 +00:00
Mikayla Gawarecki
709ddf7a9d Add wrappers for synchronous GPUDirect Storage APIs (#130633)
Based in part on https://github.com/NVIDIA/apex/pull/1774

Differential Revision: [D60155434](https://our.internmc.facebook.com/intern/diff/D60155434)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633
Approved by: https://github.com/albanD
2024-07-25 22:23:38 +00:00
PyTorch MergeBot
e4b5645f83 Revert "Add wrappers for synchronous GPUDirect Storage APIs (#130633)"
This reverts commit 5b5e0698a5.

Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to breaking a lot of jobs and build rules internally D60085885, possibly needs to update some bazel build? ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2245806738))
2024-07-23 17:19:34 +00:00
Mikayla Gawarecki
5b5e0698a5 Add wrappers for synchronous GPUDirect Storage APIs (#130633)
Based in part on https://github.com/NVIDIA/apex/pull/1774

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633
Approved by: https://github.com/albanD
2024-07-22 14:51:24 +00:00
Xuehai Pan
ba48cf6535 [BE][Easy][6/19] enforce style for empty lines in import segments in test/ (#129757)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129757
Approved by: https://github.com/ezyang
2024-07-17 06:42:37 +00:00
Bilal Khan
54a932b0ac Support for expandable segments with cuda graph trees (#128068)
This PR adds support to use expandable segments with private memory pools which should unblock using it with cuda graphs and cuda graph trees. Currently, the allocator silently avoids using expandable segments when allocating in a private pool due to checkpoint saving/restoring not meshing well with how we keep track of unmapped blocks.

The PR itself is pretty short, most of the logic for checkpointing and reapplying state for non-expandable segments transfers over without much work.

Expandable segments reserve a virtual address space of size equal to the amount of physical memory on the GPU. Every time we want to `malloc()` or `free()` memory in a memory pool with expandable segments turned on, we map/unmap pages of physical GPU memory under the hood to create a new block that we return to the caller. This is beneficial due to the fact that each memory pool functions as a single segment of memory with a contiguous block of memory addresses that can grow and shrink as needed, avoiding fragmentation from allocating multiple non-contiguous segments that may not be merged together.

The caching allocator handles this by creating an unmapped block for the entire reserved virtual address space at init, which is treated similarly to an unallocated block in a free pool. When callers call `malloc()`, it's split and mapped to create allocated blocks, and calling `free()` similarly caches and merges free blocks in a free pool to be used later. Expandable blocks are unmapped and returned back to Cuda when they are cleaned up, or when we hit an OOM and the allocator attempts to remap cached free blocks. The code paths to map, free, and unmap blocks in expandable segments is similar to that for normal blocks and does all the same work of updating stats on memory usage, moving blocks between active and free pools, and returning memory to Cuda.

With Cuda Graph Trees and private memory pools, we need the ability to take checkpoints of the current state of the memory allocator after each graph capture as well as reapplying the state before capturing a new graph after replaying a captured graph so that the new cuda graph capture has access to the state of the allocator at the point after replaying a previously captured graph so it can reuse empty blocks and allocate new ones.

As mentioned in a below comment, memory in a private pool is cached until the private pool is destroyed and allocations can only grow from extra graph captures, any freeing of memory would result in invalid memory addresses and would break cuda graphs.

One implementation detail to note for unmapped blocks with expandable segments is that unmapped blocks are kept track in a member variable `unmapped` of a `BlockPool`. `unmapped` is *not* part of the checkpointed state of the caching allocator and isn't restored when reapplying checkpoints since we never free/unmap memory back to cuda and is persisted across graph captures / replays.

Checkpointing the current state of the memory allocator works as expected with expandable segments. Checkpointing grabs the first block of every segment in the active and free pools of the private pool and traverses the linked list of blocks in the segment to capture the state of every segment, which is then saved and kept for when it is needed to be reapplied. For expandable blocks, the last block in every segment will be an unallocated unmapped block containing the remaining amount of unmapped memory at graph capture time, and this too is saved in the checkpoint.

Reapplying the checkpoints works by freeing all allocated blocks and merging them into a single block per segment, then for each segment, we manually split and allocate all blocks from the checkpoint and then free the blocks marked as unallocated in the checkpoint state. For expandable segments, we need to make some modifications to not split unmapped blocks and avoid manually mapping then freeing unmapped blocks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128068
Approved by: https://github.com/eqy, https://github.com/eellison
2024-07-15 23:23:23 +00:00
Tobias Ringwald
e5de25896f Fixed CUDA randint generation for large ranges. (#126066)
Fixes #125224

For large ranges, calls to CUDA `randint` use a different `unroll_factor` to generate random ints. This `unroll_factor` was not considered correctly in the calculation of the Philox offsets. Thus, some of the random states were reused, resulting in lower entropy (see #125224).

This also affects multiple other random functions, such as `torch.rand` and `torch.randn`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126066
Approved by: https://github.com/eqy, https://github.com/lezcano
2024-07-13 21:42:27 +00:00
eqy
60fc01d0ab [CUDA] Don't double-destroy CUDA graph when debug dump is used (#130401)
Repro from @eellison

Could have sworn we had another PR with this fix floating around somewhere but I couldn't find it...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130401
Approved by: https://github.com/Skylion007, https://github.com/eellison
2024-07-12 18:57:07 +00:00
PyTorch MergeBot
578388bed8 Revert "Support for expandable segments with cuda graph trees (#128068)"
This reverts commit fdc83610f2.

Reverted https://github.com/pytorch/pytorch/pull/128068 on behalf of https://github.com/janeyx99 due to Reverting for breaking ROCm tests on trunk, I think the tests need to be qualified with @onlyCUDA ([comment](https://github.com/pytorch/pytorch/pull/128068#issuecomment-2223672381))
2024-07-11 18:58:13 +00:00
Bilal Khan
fdc83610f2 Support for expandable segments with cuda graph trees (#128068)
This PR adds support to use expandable segments with private memory pools which should unblock using it with cuda graphs and cuda graph trees. Currently, the allocator silently avoids using expandable segments when allocating in a private pool due to checkpoint saving/restoring not meshing well with how we keep track of unmapped blocks.

The PR itself is pretty short, most of the logic for checkpointing and reapplying state for non-expandable segments transfers over without much work.

Expandable segments reserve a virtual address space of size equal to the amount of physical memory on the GPU. Every time we want to `malloc()` or `free()` memory in a memory pool with expandable segments turned on, we map/unmap pages of physical GPU memory under the hood to create a new block that we return to the caller. This is beneficial due to the fact that each memory pool functions as a single segment of memory with a contiguous block of memory addresses that can grow and shrink as needed, avoiding fragmentation from allocating multiple non-contiguous segments that may not be merged together.

The caching allocator handles this by creating an unmapped block for the entire reserved virtual address space at init, which is treated similarly to an unallocated block in a free pool. When callers call `malloc()`, it's split and mapped to create allocated blocks, and calling `free()` similarly caches and merges free blocks in a free pool to be used later. Expandable blocks are unmapped and returned back to Cuda when they are cleaned up, or when we hit an OOM and the allocator attempts to remap cached free blocks. The code paths to map, free, and unmap blocks in expandable segments is similar to that for normal blocks and does all the same work of updating stats on memory usage, moving blocks between active and free pools, and returning memory to Cuda.

With Cuda Graph Trees and private memory pools, we need the ability to take checkpoints of the current state of the memory allocator after each graph capture as well as reapplying the state before capturing a new graph after replaying a captured graph so that the new cuda graph capture has access to the state of the allocator at the point after replaying a previously captured graph so it can reuse empty blocks and allocate new ones.

As mentioned in a below comment, memory in a private pool is cached until the private pool is destroyed and allocations can only grow from extra graph captures, any freeing of memory would result in invalid memory addresses and would break cuda graphs.

One implementation detail to note for unmapped blocks with expandable segments is that unmapped blocks are kept track in a member variable `unmapped` of a `BlockPool`. `unmapped` is *not* part of the checkpointed state of the caching allocator and isn't restored when reapplying checkpoints since we never free/unmap memory back to cuda and is persisted across graph captures / replays.

Checkpointing the current state of the memory allocator works as expected with expandable segments. Checkpointing grabs the first block of every segment in the active and free pools of the private pool and traverses the linked list of blocks in the segment to capture the state of every segment, which is then saved and kept for when it is needed to be reapplied. For expandable blocks, the last block in every segment will be an unallocated unmapped block containing the remaining amount of unmapped memory at graph capture time, and this too is saved in the checkpoint.

Reapplying the checkpoints works by freeing all allocated blocks and merging them into a single block per segment, then for each segment, we manually split and allocate all blocks from the checkpoint and then free the blocks marked as unallocated in the checkpoint state. For expandable segments, we need to make some modifications to not split unmapped blocks and avoid manually mapping then freeing unmapped blocks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128068
Approved by: https://github.com/zdevito, https://github.com/eqy
2024-07-11 05:33:09 +00:00
Jeff Willette
5c9d5272e4 fixes #124582 (#128483)
added check for existence of outputs requiring grad to make_graphed_callables.

added new test case, updated existing test case to include parameterless modules.

Fixes #124582

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128483
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-07-02 08:45:59 +00:00
Jack Taylor
e1b426b345 [ROCm] CUDA_VISIBLE_DEVICES fallback option for device_count (#129650)
Updating `_parse_visible_devices` to allow use of CUDA_VISIBLE_DEVICES if HIP_VISIBLE_DEVICES is unset, to avoid any unnecessary code changes in workloads that already rely on CUDA_VISIBLE_DEVICES.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129650
Approved by: https://github.com/hongxiayang, https://github.com/malfet
2024-07-01 11:40:09 +00:00
Jeff Daily
169b4ca07e add uuid in cudaDeviceProperties (#125083)
Replaces #99967.

Fixes #99903.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125083
Approved by: https://github.com/pruthvistony, https://github.com/albanD, https://github.com/eqy, https://github.com/malfet
2024-06-27 23:53:13 +00:00
yousufmo
305ba62906 Add support to GradScaler for respecting an already set grad_scale value (#123429)
Fixes #123428

Co-authored-by: Yousuf Mohamed-Ahmed <youmed.tech@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123429
Approved by: https://github.com/ezyang
2024-06-27 22:40:54 +00:00
Dmitry Rogozhkin
321bdcb372 Fix device propagation for checkpointing (#128671)
Fixes: #128478

In backward() implementation checkpointing code was quering device type from the rng_state tensors saved on forward(). These tensors are CPU only tensors and don't carry device information with them. As a result CUDA device was assumed as a default. Which is not correct if user runs on some other device. For example, on XPU.

This patch saves full device information on forward() and uses it on backward() to get device type. Previously forward save only device index.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128671
Approved by: https://github.com/guangyey, https://github.com/soulitzer
2024-06-27 17:14:13 +00:00
Fuzzkatt
4ca8eecca4 skip test_graph_capture_oom for jetson (#128661)
On Jetson IGX, `python test/test_cuda.py -k test_graph_capture_oom` fails with the following error:

```
RuntimeError: NVML_SUCCESS == r INTERNAL ASSERT FAILED at "/opt/pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp":841, please report a bug to PyTorch.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/lib/python3.10/unittest/case.py", line 59, in testPartExecutor
    yield
  File "/usr/lib/python3.10/unittest/case.py", line 591, in run
    self._callTestMethod(testMethod)
  File "/usr/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
    method()
  File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 2759, in wrapper
    method(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 2759, in wrapper
    method(*args, **kwargs)
  File "/opt/pytorch/pytorch/test/test_cuda.py", line 2255, in test_graph_capture_oom
    with self.assertRaisesRegex(RuntimeError, oom_regex):
  File "/usr/lib/python3.10/unittest/case.py", line 239, in __exit__
    self._raiseFailure('"{}" does not match "{}"'.format(
  File "/usr/lib/python3.10/unittest/case.py", line 163, in _raiseFailure
    raise self.test_case.failureException(msg)
AssertionError: "out of memory" does not match "NVML_SUCCESS == r INTERNAL ASSERT FAILED at "/opt/pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp":841, please report a bug to PyTorch. "

```

This is a known issue as nvml support on Jetson is limited, and the OOM reporting in CUDACachingAllocator.cpp requires nvml to be properly loaded, which fails on Jetson.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128661
Approved by: https://github.com/eqy, https://github.com/atalman
2024-06-25 08:25:11 +00:00
Xuehai Pan
a7c596870d [BE][Eazy] remove torch.torch.xxx usages (#127800)
NB: `torch` is exposed in `torch/__init__.py`. So there can be `torch.torch.torch.xxx`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127800
Approved by: https://github.com/peterbell10, https://github.com/kit1980, https://github.com/malfet
2024-06-05 21:53:49 +00:00
Xuehai Pan
67ef2683d9 [BE] wrap deprecated function/class with typing_extensions.deprecated (#127689)
Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing.

Note that only warnings that their messages contain `[Dd]eprecat(ed|ion)` are updated in this PR.

Resolves #126888

- #126888

This PR is split from PR #126898.

- #126898

------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127689
Approved by: https://github.com/Skylion007
2024-06-02 12:30:43 +00:00
PyTorch MergeBot
033e733021 Revert "[BE] wrap deprecated function/class with typing_extensions.deprecated (#126898)"
This reverts commit 749a132fb0.

Reverted https://github.com/pytorch/pytorch/pull/126898 on behalf of https://github.com/fbgheith due to switching typing-extensions=4.3.0 to 4.9.0 causes internal failure ([comment](https://github.com/pytorch/pytorch/pull/126898#issuecomment-2142884456))
2024-05-31 19:47:24 +00:00
SandishKumarHN
da39461d61 [optim] Move test_grad_scaling_autocast_fused_optimizers to test_cuda.py (#126418)
this PR address the comments in this PR #124904

- Move test_grad_scaling_autocast_fused_optimizers to test_cuda.py
- Combine _grad_scaling_autocast_fused_optimizers into test_grad_scaling_autocast_fused_optimizers
- Move to OptimizerInfo framework.
- For failing tests test_grad_scaling_autocast_fused_optimizers AdamW_cuda_float32, Adam_cuda_float32
    - Added toleranceOverride in this PR
    - created a issue #127000

```
> (c2env) [sandish@devgpu166.ash6 ~/pytorch (refactoroptimizers)]$ python test/test_cuda.py -k test_grad_scaling_autocast_fused_optimizers -v
/home/sandish/pytorch/torch/backends/cudnn/__init__.py:106: UserWarning: PyTorch was compiled without cuDNN/MIOpen support. To use cuDNN/MIOpen, rebuild PyTorch making sure the library is visible to the build system.
  warnings.warn(
/home/sandish/pytorch/torch/backends/cudnn/__init__.py:106: UserWarning: PyTorch was compiled without cuDNN/MIOpen support. To use cuDNN/MIOpen, rebuild PyTorch making sure the library is visible to the build system.
  warnings.warn(
test_grad_scaling_autocast_fused_optimizers_Adagrad_cpu_float32 (__main__.TestCudaOptimsCPU) ... {'fused': True}
{'fused': True}
{'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'lr': 0.1, 'fused': True}
{'lr': 0.1, 'fused': True}
{'initial_accumulator_value': 0.1, 'weight_decay': 0.1, 'fused': True}
{'initial_accumulator_value': 0.1, 'weight_decay': 0.1, 'fused': True}
{'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.1, 'fused': True}
{'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.1, 'fused': True}
{'lr': tensor(0.0010), 'fused': True}
{'lr': tensor(0.0010), 'fused': True}
ok
test_grad_scaling_autocast_fused_optimizers_AdamW_cpu_float32 (__main__.TestCudaOptimsCPU) ... {'fused': True}
{'fused': True}
{'lr': 0.01, 'fused': True}
{'lr': 0.01, 'fused': True}
{'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'fused': True}
ok
test_grad_scaling_autocast_fused_optimizers_Adam_cpu_float32 (__main__.TestCudaOptimsCPU) ... {'fused': True}
{'fused': True}
{'lr': 0.01, 'fused': True}
{'lr': 0.01, 'fused': True}
{'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'fused': True}
ok
test_grad_scaling_autocast_fused_optimizers_SGD_cpu_float32 (__main__.TestCudaOptimsCPU) ... {'fused': True}
{'fused': True}
{'lr': 0.01, 'fused': True}
{'lr': 0.01, 'fused': True}
{'lr': tensor(0.0010), 'fused': True}
{'lr': tensor(0.0010), 'fused': True}
{'momentum': 0.9, 'fused': True}
{'momentum': 0.9, 'fused': True}
{'momentum': 0.9, 'dampening': 0.5, 'fused': True}
{'momentum': 0.9, 'dampening': 0.5, 'fused': True}
{'momentum': 0.9, 'weight_decay': 0.1, 'fused': True}
{'momentum': 0.9, 'weight_decay': 0.1, 'fused': True}
{'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.1, 'fused': True}
{'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
ok
test_grad_scaling_autocast_fused_optimizers_Adagrad_cuda_float32 (__main__.TestCudaOptimsCUDA) ... skipped 'cuda is not supported for fused on Adagrad'
test_grad_scaling_autocast_fused_optimizers_AdamW_cuda_float32 (__main__.TestCudaOptimsCUDA) ... {'fused': True}
{'fused': True}
{'lr': 0.01, 'fused': True}
{'lr': 0.01, 'fused': True}
{'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'fused': True}
{'capturable': True, 'fused': True}
{'capturable': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'fused': True}
{'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'fused': True}
{'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'fused': True}
ok
test_grad_scaling_autocast_fused_optimizers_Adam_cuda_float32 (__main__.TestCudaOptimsCUDA) ... {'fused': True}
{'fused': True}
{'lr': 0.01, 'fused': True}
{'lr': 0.01, 'fused': True}
{'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'fused': True}
{'capturable': True, 'fused': True}
{'capturable': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'fused': True}
{'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'fused': True}
{'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'fused': True}
ok
test_grad_scaling_autocast_fused_optimizers_SGD_cuda_float32 (__main__.TestCudaOptimsCUDA) ... {'fused': True}
{'fused': True}
{'lr': 0.01, 'fused': True}
{'lr': 0.01, 'fused': True}
{'lr': tensor(0.0010), 'fused': True}
{'lr': tensor(0.0010), 'fused': True}
{'momentum': 0.9, 'fused': True}
{'momentum': 0.9, 'fused': True}
{'momentum': 0.9, 'dampening': 0.5, 'fused': True}
{'momentum': 0.9, 'dampening': 0.5, 'fused': True}
{'momentum': 0.9, 'weight_decay': 0.1, 'fused': True}
{'momentum': 0.9, 'weight_decay': 0.1, 'fused': True}
{'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.1, 'fused': True}
{'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
ok

----------------------------------------------------------------------
Ran 8 tests in 16.117s

OK (skipped=1)

> lintrunner test/test_cuda.py
----------------------------------------------------------------------
ok No lint issues.

> lintrunner torch/testing/_internal/common_optimizers.py
----------------------------------------------------------------------
ok No lint issues.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126418
Approved by: https://github.com/janeyx99
2024-05-30 01:47:41 +00:00
Xuehai Pan
749a132fb0 [BE] wrap deprecated function/class with typing_extensions.deprecated (#126898)
Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing.

Note that only warnings that their messages contain `[Dd]eprecat(ed|ion)` are updated in this PR.

UPDATE: Use `FutureWarning` instead of `DeprecationWarning`.

Resolves #126888

- #126888

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126898
Approved by: https://github.com/albanD
2024-05-29 12:09:27 +00:00
Yu, Guangye
e7a42702f9 generalize custom_fwd&custom_bwd to be device-agnostic (#126531)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126531
Approved by: https://github.com/jgong5, https://github.com/gujinghui, https://github.com/albanD, https://github.com/EikanWang
ghstack dependencies: #126527
2024-05-25 06:48:16 +00:00
Yu, Guangye
c09205a057 Deprecate device-specific GradScaler autocast API (#126527)
# Motivation

## for `torch.amp.GradScaler`,
- `torch.cpu.amp.GradScaler(args...)` is completely equivalent to `torch. amp.GradScaler("cpu", args...)`.
- `torch.cuda.amp.GradScaler(args...)` is completely equivalent to `torch.amp.GradScaler("cuda", args...)`.

So, we intend to depreate them and **strongly recommend** developer to use `torch.amp.GradScaler`.

## for `custom_fwd` and `custom_bwd`,
this is a good solution to make the custom function run with or without effect even in an autocast-enabled region and can be shared by other backends, like CPU and XPU.
So we generalize it to be device-agnostic and put them int `torch/amp/autocast_mode.py` and re-expose to `torch.amp.custom_fwd` and `torch.amp.custom_bwd`. Meanwhile, we deprecate `torch.cuda.amp.custom_fwd` and `torch.cuda.amp.custom_bwd`.

# Additional Context
Add UT to cover the deprecated warning.
No need for more UTs to cover the functionality of `torch.amp.custom_f/bwd`, the existing UTs that previously covered the functionality of `torch.cuda.amp.custom_f/bwd` can cover them.
To facilitate the review, we separate these code changes to two PRs. The first PR cover `torch.amp.GradScaler`. The follow-up covers `custom_fwd` and `custom_bwd`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126527
Approved by: https://github.com/jgong5, https://github.com/gujinghui, https://github.com/janeyx99, https://github.com/EikanWang
2024-05-25 06:41:34 +00:00
Catherine Lee
ef86a27dba Mark test_set_per_process_memory_fraction serial (#127087)
Occasionally OOMs

Also should probably give the entire GPU for this anyways
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127087
Approved by: https://github.com/huydhn
2024-05-25 06:26:47 +00:00
Jack Taylor
d30cdc4321 [ROCm] amdsmi library integration (#119182)
Adds monitoring support for ROCm using amdsmi in place of pynvml.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119182
Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/xw285cornell
2024-05-21 01:59:26 +00:00
PyTorch MergeBot
cb69c51b6f Revert " Updated test_graph_optims and test_graph_scaling_fused_optimizers to use new OptimizerInfo infrastructure (#125127)"
This reverts commit cf35a591b9.

Reverted https://github.com/pytorch/pytorch/pull/125127 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/125127#issuecomment-2120337584))
2024-05-20 12:14:22 +00:00
jayanth domalapalli
cf35a591b9 Updated test_graph_optims and test_graph_scaling_fused_optimizers to use new OptimizerInfo infrastructure (#125127)
This PR is meant to address issue #123451, more specifically, the ```test_graph_optims``` and ```test_graph_scaling_fused_optimizers``` functions in ```test_cuda.py``` have been updated so that they now use the new OptimizerInfo infrastructure.

Lintrunner passed:
```
$ lintrunner test/test_cuda.py
ok No lint issues.
```
Tests passed:
```
>python test_cuda.py -k test_graph_optims
Ran 19 tests in 7.463s

OK (skipped=9)

>python test_cuda.py -k test_graph_scaling_fused_optimizers
Ran 6 tests in 2.800s

OK (skipped=3)
```
Both the functions have been moved to the newly created TestCase class ```TestCudaOptims```. The test is mostly the same except the ```@optims``` decorator is used at the top of the function to implicitly call the function using each of the optimizers mentioned in the decorator instead of explicitly using a for loop to iterate through each of the optimizers.

I was unable to use the ```_get_optim_inputs_including_global_cliquey_kwargs``` to get all kwargs for each of the optimizers since some of the kwargs that are used in the original ```test_graph_optims``` function are not being returned by the new OptimizerInfo infrastructure, more specifically, for the ```torch.optim.rmsprop.RMSprop``` optimizer, the following kwargs are not returned whenever ```_get_optim_inputs_including_global_cliquey_kwargs``` is called:
```
{'foreach': False, 'maximize': True, 'weight_decay': 0}
{ 'foreach': True, 'maximize': True, 'weight_decay': 0}
```
I ran into the same issue for ```test_graph_scaling_fused_optimizers```, for the ```torch.optim.adamw.AdamW``` optimizer, whenever ```optim_info.optim_inputs_func(device=device)``` was called, the following kwarg was not returned:
```
{'amsgrad': True}
```

Due to this issue, I resorted to using a dictionary to store the kwargs for each of the optimizers, I am aware that this is less than ideal. I was wondering whether I should use the OptimizerInfo infrastructure to get all the kwargs regardless of the fact that it lacks some kwargs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125127
Approved by: https://github.com/janeyx99
2024-05-20 06:20:45 +00:00
Yu, Guangye
58378f1224 [Doc] Add deprecated autocast comments for doc (#126062)
# Motivation
We generalize a device-agnostic API `torch.amp.autocast` in [#125103](https://github.com/pytorch/pytorch/pull/125103).  After that,
- `torch.cpu.amp.autocast(args...)` is completely equivalent to `torch.amp.autocast('cpu', args...)`, and
- `torch.cuda.amp.autocast(args...)` is completely equivalent to `torch.amp.autocast('cuda', args...)`

no matter in eager mode or JIT mode.
Base on this point, we would like to deprecate `torch.cpu.amp.autocast` and `torch.cuda.amp.autocast` to **strongly recommend** developer to use `torch.amp.autocast` that is a device-agnostic API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126062
Approved by: https://github.com/eqy, https://github.com/albanD
2024-05-16 05:26:43 +00:00
haozhe.zhu
f9d107af66 [optim] add fused_adagrad support for CPU device (#124905)
Support fused_sgd_kernel support for CPU.

## Bench result:
32 core/sockets ICX
Test Scripts:
https://gist.github.com/zhuhaozhe/79e842e0a6e25d6d7fa1e4598807272c
https://gist.github.com/zhuhaozhe/b4c6998a509dcea1796dd05b3005c969
```
Tensor Size: 262144, Num Tensor 4, Num Threads: 1
_single_tensor_adagrad time: 0.2500 seconds
_fused_adagrad time: 0.0933 seconds
Tensor Size: 4194304, Num Tensor 32, Num Threads: 32
_single_tensor_adagrad time: 2.8819 seconds
_fused_adagrad time: 1.7591 seconds
```
## Test Plan:
```
python test_optim.py -k test_fused_matches_forloop
python test_optim.py -k test_fused_large_tensor
python test_optim.py -k test_can_load_older_state_dict
python test_optim.py -k test_grad_scaling_autocast_fused_optimizers
python test_torch.py -k test_grad_scaling_autocast_fused
python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step
```

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124905
Approved by: https://github.com/jgong5, https://github.com/janeyx99
2024-05-16 01:11:51 +00:00
PyTorch MergeBot
bd3cbdba2f Revert "[optim] add fused_adagrad support for CPU device (#124905)"
This reverts commit 1c3fe84033.

Reverted https://github.com/pytorch/pytorch/pull/124905 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing distributed multigpu test in trunk 1c3fe84033 ([comment](https://github.com/pytorch/pytorch/pull/124905#issuecomment-2108777063))
2024-05-13 20:53:22 +00:00
haozhe.zhu
1c3fe84033 [optim] add fused_adagrad support for CPU device (#124905)
Support fused_sgd_kernel support for CPU.

## Bench result:
32 core/sockets ICX
Test Scripts:
https://gist.github.com/zhuhaozhe/79e842e0a6e25d6d7fa1e4598807272c
https://gist.github.com/zhuhaozhe/b4c6998a509dcea1796dd05b3005c969
```
Tensor Size: 262144, Num Tensor 4, Num Threads: 1
_single_tensor_adagrad time: 0.2500 seconds
_fused_adagrad time: 0.0933 seconds
Tensor Size: 4194304, Num Tensor 32, Num Threads: 32
_single_tensor_adagrad time: 2.8819 seconds
_fused_adagrad time: 1.7591 seconds
```
## Test Plan:
```
python test_optim.py -k test_fused_matches_forloop
python test_optim.py -k test_fused_large_tensor
python test_optim.py -k test_can_load_older_state_dict
python test_optim.py -k test_grad_scaling_autocast_fused_optimizers
python test_torch.py -k test_grad_scaling_autocast_fused
python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step
```

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124905
Approved by: https://github.com/jgong5, https://github.com/janeyx99
2024-05-13 01:16:20 +00:00
Yu, Guangye
31372fa842 Support generic stream/event on CUDA/HIP backend (#125757)
# Motivation
According to [#123611](https://github.com/pytorch/pytorch/pull/123611), we support generic stream/event on CUDA backend.

# Additional Context
new method/attribute on `torch.Event` for cuda
- torch.Event.event_id
- torch.Event.elapsed_time
- torch.Event.synchronize

new method on `c10::Event` on cuda backend
- c10.Event.event_id
- c10.Event.elapsed_time
- c10.Event.synchronize

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125757
Approved by: https://github.com/albanD, https://github.com/jgong5, https://github.com/EikanWang
2024-05-10 13:34:09 +00:00
PyTorch MergeBot
0d4fdb0bb7 Revert "[ROCm] amdsmi library integration (#119182)"
This reverts commit 85447c41e3.

Reverted https://github.com/pytorch/pytorch/pull/119182 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the ROCm failed test is legit 85447c41e3 ([comment](https://github.com/pytorch/pytorch/pull/119182#issuecomment-2103433197))
2024-05-09 21:18:21 +00:00
PyTorch MergeBot
6fd745255e Revert "add uuid in cudaDeviceProperties (#125083)"
This reverts commit 3f36145db2.

Reverted https://github.com/pytorch/pytorch/pull/125083 on behalf of https://github.com/izaitsevfb due to Fails internal builds with: no member named 'uuid' in 'hipDeviceProp_t' ([comment](https://github.com/pytorch/pytorch/pull/125083#issuecomment-2103315320))
2024-05-09 19:52:45 +00:00
Jack Taylor
85447c41e3 [ROCm] amdsmi library integration (#119182)
Adds monitoring support for ROCm using amdsmi in place of pynvml.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119182
Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/xw285cornell
2024-05-09 18:21:38 +00:00
Jeff Daily
3f36145db2 add uuid in cudaDeviceProperties (#125083)
Replaces #99967.

Fixes #99903.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125083
Approved by: https://github.com/pruthvistony, https://github.com/albanD, https://github.com/eqy
2024-05-08 19:15:55 +00:00
PyTorch MergeBot
5fd0b6e5f7 Revert "add uuid in cudaDeviceProperties (#125083)"
This reverts commit f35fe4eaf1.

Reverted https://github.com/pytorch/pytorch/pull/125083 on behalf of https://github.com/clee2000 due to test_uuid is flaky.  ex https://github.com/pytorch/pytorch/actions/runs/8988855916/job/24692369523 https://hud.pytorch.org/flakytest?name=test_uuid&suite=TestCuda&file=%25&limit=300 ([comment](https://github.com/pytorch/pytorch/pull/125083#issuecomment-2099029993))
2024-05-07 18:16:27 +00:00
Jeff Daily
f35fe4eaf1 add uuid in cudaDeviceProperties (#125083)
Replaces #99967.

Fixes #99903.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125083
Approved by: https://github.com/pruthvistony, https://github.com/albanD, https://github.com/eqy
2024-05-07 01:26:01 +00:00
haozhe.zhu
489b4586e9 [optim]fix ut and sgd kernel (#124904)
- Original `test_grad_scaling_autocast_fused_optimizers` does not work since there is no "fused" in `optim_inputs`
 - We should use different `grad_scaler`, they should not share 1 `scale`, there is no issue exposed here because the default `_growth_interval` is 2000 so it will not growth and there is also no inf is found so it will not reduced. The one in `test_cuda.py` should also have this issue,
 - I set a manual seed to reproduce purpose if there is any numerical failure
 - I use Tensor tracker here because we failed this UT in dynamo case, the cpp generated code are not exactly same with fused/non fused kernel.
 - I make it check both `cuda` and `cpu`.
 - I find some SGD numerical issue with `clang`, and fixed it by using `fmadd` instead of `add/mul` in fused sgd veckernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124904
Approved by: https://github.com/jgong5, https://github.com/janeyx99
2024-05-03 09:13:24 +00:00
Yuanhao Ji
d5182bb75b Enable UFMT on test/test_cuda*.py (#124352)
Part of: #123062

Ran lintrunner on:

- test/test_cuda.py
- test/test_cuda_expandable_segments.py
- test/test_cuda_multigpu.py
- test/test_cuda_nvml_based_avail.py
- test/test_cuda_primary_ctx.py
- test/test_cuda_sanitizer.py
- test/test_cuda_trace.py

Detail:

```bash
$ lintrunner -a --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124352
Approved by: https://github.com/ezyang
2024-04-25 18:31:08 +00:00
PyTorch MergeBot
c0fd7894cc Revert "Fast standalone symbolize for unwinding (#123966)"
This reverts commit 772ae6da1e.

Reverted https://github.com/pytorch/pytorch/pull/123966 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, check D56522678 ([comment](https://github.com/pytorch/pytorch/pull/123966#issuecomment-2076821043))
2024-04-25 10:04:48 +00:00
Tiger Huo
94af62b000 Updated test_graph_grad_scaling to use new OptimizerInfo infrastructure (#123581)
This PR targets the issue mentioned in #123451 , and solves the specific task to update`test_graph_grad_scaling` in `test/test_cuda.py` to use the new OptimizerInfo infrastructure.

`test_graph_grad_scaling` is moved to a new `TestCase` class called `TestCudaOptims` in order to use `instantiate_device_type_tests`. The test content remained the same. `@onlyCUDA` is applied to the new test; the original use of the wrapper function is also changed to a `@parametrize` decorator for better style.

If we think that this migration is successful, we can delete the original test item under `TestCuda`. Currently it is left untouched to avoid any unexpected issues.

Local linter passed.
```
$ lintrunner test/test_cuda.py
ok No lint issues.
```

Local tests passed.
```
> python .\test\test_cuda.py -k test_graph_grad_scaling
Ran 7 tests in 0.458s
OK (skipped = 3)
```
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123581
Approved by: https://github.com/janeyx99
2024-04-25 06:29:20 +00:00
Catherine Lee
4f29103749 [ez][CI] Move test_cuda off CI_SERIAL_LIST (#124649)
Tag test cases with large tensor with serial, also tag a few more that failed on a previous iteration of this PR

Move test_cuda and test_cuda_expandable_segments off the serial list
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124649
Approved by: https://github.com/ZainRizvi
2024-04-24 22:04:23 +00:00
zdevito
772ae6da1e Fast standalone symbolize for unwinding (#123966)
We've had issues using addr2line. On certain versions of
CentOS it is on a version that has a performance regression making it very slow,
and even normallly it is not that fast, taking several seconds even when parallelized
for a typical memory trace dump.

Folly Symbolize or LLVMSymbolize are fast but it requires PyTorch take a dependency on those libraries to do this, and given the number of environments we run stuff in, we end up hitting cases where we fallback to slow addr2line behavior.

This adds a standalone symbolizer to PyTorch similar to the unwinder which has
no external dependencies and is ~20x faster than addr2line for unwinding PyTorch frames.

I've tested this on some memory profiling runs using all combinations of {gcc, clang} x {dwarf4, dwarf5} and it seems to do a good job at getting line numbers and function names right. It is also careful to route all reads of library data through the `CheckedLexer` object, which ensure it is not reading out of bounds of the section. Errors are routed through UnwindError so that those exceptions get caught and we produce a ?? frame rather than crash. I also added a fuzz test which gives all our symbolizer options random addresses in the process to make sure they do not crash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123966
Approved by: https://github.com/ezyang
2024-04-23 15:27:18 +00:00
Aaron Gokaslan
5a1216bb2e [BE]: Update ruff to 0.4.1 (#124549)
Update ruff to 0.4.1 .
This version fixes a lot false negatives/false positives, is 20-40% faster, and has various other bug fixes.

Below is a before and after table showing the execution time of ruff lint and ruff format in milliseconds courtesy of https://astral.sh/blog/ruff-v0.4.0

| Repository                                         | Linter (v0.3) | Linter (v0.4) | Formatter (v0.3) | Formatter (v0.4) |
|----------------------------------------------------|---------------|---------------|------------------|------------------|
| [pytorch/pytorch](https://github.com/pytorch/pytorch) | 328.7         | 251.8         | 351.1            | 274.9            |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124549
Approved by: https://github.com/ezyang
2024-04-21 14:06:23 +00:00
Michael Lazos
16771747c2 Add tensor step and capturable support to rprop (#122261)
Towards fixing https://github.com/pytorch/pytorch/issues/115679
Fixes Rprop step update while compiling

Also adds capturable support + testing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122261
Approved by: https://github.com/janeyx99
2024-03-28 23:31:18 +00:00
Edward Z. Yang
0284bca99b Don't cache device_count if we haven't initialized CUDA yet (#122815)
Before initializing CUDA, it can change by modifying CUDA_VISIBLE_DEVICES

Fixes https://github.com/pytorch/pytorch/issues/122085
Fixes https://github.com/pytorch/pytorch/issues/38616
Fixes https://github.com/pytorch/pytorch/issues/110000
Fixes https://github.com/pytorch/pytorch/issues/110971
Fixes https://github.com/pytorch/pytorch/issues/95073

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122815
Approved by: https://github.com/albanD
2024-03-28 13:23:45 +00:00
Michael Lazos
caa57e4fcd Add tensor step and capturable support to rmsprop (#122264)
Towards fixing https://github.com/pytorch/pytorch/issues/115679
Fixes RMSprop step update while compiling

Adds capturable support to RMSprop

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122264
Approved by: https://github.com/janeyx99
2024-03-28 03:39:28 +00:00
Frank Lin
249e65b92d Graph-Safe RNG State Exchange for Tensor Parallelism (#114068)
See #113541

The PR allows for registering and controlling multiple RNG states using indices, ensuring cudagraph-safe operations, and includes both C++ and Python API changes to support this functionality.

cc  @eellison @anijain2305 @jansel @ezyang @ptrblck @csarofeen @mcarilli
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114068
Approved by: https://github.com/ezyang, https://github.com/eqy, https://github.com/xuzhao9
2024-03-27 01:14:38 +00:00
PyTorch MergeBot
4dc09d6aa4 Revert "Graph-Safe RNG State Exchange for Tensor Parallelism (#114068)"
This reverts commit e9dcda5cba.

Reverted https://github.com/pytorch/pytorch/pull/114068 on behalf of https://github.com/ezyang due to memory leak in another ci ([comment](https://github.com/pytorch/pytorch/pull/114068#issuecomment-2018044527))
2024-03-25 13:49:04 +00:00
Michael Lazos
365e89a591 Add tensor step to adadelta (#122252)
Towards fixing https://github.com/pytorch/pytorch/issues/115679
Fixes Adadelta step update while compiling

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122252
Approved by: https://github.com/janeyx99
2024-03-21 07:28:47 +00:00
Frank Lin
e9dcda5cba Graph-Safe RNG State Exchange for Tensor Parallelism (#114068)
See #113541

The PR allows for registering and controlling multiple RNG states using indices, ensuring cudagraph-safe operations, and includes both C++ and Python API changes to support this functionality.

cc  @eellison @anijain2305 @jansel @ezyang @ptrblck @csarofeen @mcarilli
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114068
Approved by: https://github.com/ezyang
2024-03-21 01:57:08 +00:00
Andres Lugo-Reyes
e01b07e1e8 [ROCm] Autocast RNN Support (#121539)
Fixes #116361

Implements Autocast wrapper for miopen rnn's

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121539
Approved by: https://github.com/albanD, https://github.com/jeffdaily
2024-03-11 21:14:43 +00:00
Natalia Gimelshein
89add71168 fix synchronization behavior for copies with type change (#121341)
Fixes #121320

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121341
Approved by: https://github.com/albanD
2024-03-11 17:09:45 +00:00
Aidyn-A
ca9678405a [CUDA graphs] Pool argument for make_graphed_callables (#121475)
It is just a nice feature to have for the situations when users want multiple graphs captures and/or graphed callables to share the same memory pool.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121475
Approved by: https://github.com/eellison, https://github.com/eqy
2024-03-09 00:15:38 +00:00
Jane Xu
9d6c5be781 Add ASGD capturable API for forloop (#121264)
@tfsingh I got to it first--wanted to land this stack and close the gap ASAP.

This PR also fixes a discrepancy between `_init_group` and `__set_state__` because we have the constants live on params' device always.

There are some next steps though:
- ASGD can be made faster by making etas, mus, steps be on CPU when NOT capturable. (I had mistakenly thought foreachifying was faster and so we landed https://github.com/pytorch/pytorch/pull/107857, but it is slower). No one has complained yet though.  ¯\_(ツ)_/¯

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121264
Approved by: https://github.com/albanD
ghstack dependencies: #121260
2024-03-08 00:00:30 +00:00
Jane Xu
24821fec26 Add RAdam capturable API for forloop (#121260)
Implementation thanks to @MarouaneMaatouk in https://github.com/pytorch/pytorch/pull/118697, though I've since cleaned it up a lot to save perf on the rect < 5 eager case. It also just looks better now :) Added tests and the cudagraph health check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121260
Approved by: https://github.com/mlazos
2024-03-08 00:00:30 +00:00
Jane Xu
53bdae736d Add capturable single tensor Adamax (#121183)
Finishes the work started in https://github.com/pytorch/pytorch/pull/118697. Thanks @MarouaneMaatouk for the attempt, but due to inactivity I have opened this PR for Adamax. Note that the new capturable implementation is much simpler and I've modified the foreach capturable impl--it now calls fewer kernels and is more easily comparable to forloop.

Next steps:
* This PR discovered two bugs: #121178 and #121238.
* Move the now hefty graph optim tests in test_cuda to use OptimInfo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121183
Approved by: https://github.com/albanD
2024-03-07 17:57:02 +00:00
Aaron Enye Shi
aa36821615 [Memory Snapshot] Stop clearing history when changing context (#120436)
Summary:
This change will avoid clearing the memory event history, when changing the context from `record_memory_history(context=None)` to `record_memory_history(context="python")`.

Now it will continue recording memory events with changing context on the fly. Only `record_memory_history(enabled=None)` will clear the history.

Test Plan:
# Ran on the following local Resnet50 example:

- At iteration=0, record_memory_history(context=None, stacks="python")
- At iteration=3, record_memory_history(context="all", stacks="python")
- After iteration=4, export_memory_snapshot()

## Before:
 - Only collects the last 2 iterations with python call stacks.
![image](https://github.com/pytorch/pytorch/assets/17602366/86154532-9f73-4d10-9194-19e8c96ee4f3)

## After:
 - Collects all 5 iterations, where first 3 iterations have no call stacks, and last 2 iterations have python call stacks.
![image](https://github.com/pytorch/pytorch/assets/17602366/c2c277d6-b400-4da2-85c8-a7f119d409f8)
![image](https://github.com/pytorch/pytorch/assets/17602366/dc9da2f8-41cc-44b0-9c32-ec3cbe79d2c4)

Differential Revision: D54084017

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120436
Approved by: https://github.com/zdevito, https://github.com/leitian
2024-02-28 22:46:26 +00:00
CaoE
113138aa55 add test cases for GradScaler on CPU (#109994)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109994
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-02-02 21:49:07 +00:00
Michael Lazos
800e2e823f Add compilable foreach RAdam support (#117912)
Fixes https://github.com/pytorch/pytorch/issues/117807

This brings the number of supported optimizers with `torch.compile` to 11/13 (!)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117912
Approved by: https://github.com/janeyx99
2024-01-27 04:32:27 +00:00
Aaron Shi
6ac284122b [Memory Snapshot] Track context for SEGMENT_FREE and SEGMENT_UNMAP (#118055)
Summary: Show the stack when SEGMENT_FREE and SEGMENT_UNMAP occurs. This may be useful for debugging such as when empty_cache() may cause a segment to be freed. If the free context is unavailable, resort to the segment allocation stack.

Test Plan: CI

Differential Revision: D52984953

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118055
Approved by: https://github.com/zdevito
2024-01-23 21:48:57 +00:00
Michael Lazos
aaae2d8bb6 Add compilable and capturable foreach adamax with tests (#117835)
Based off of https://github.com/pytorch/pytorch/pull/110345

Fixes https://github.com/pytorch/pytorch/issues/117812

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117835
Approved by: https://github.com/janeyx99
2024-01-20 05:29:05 +00:00
Masaki Kozuki
1d14adfa66 [mta] Fused SGD (#116585)
depends on #116583

rel:
- #94791

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116585
Approved by: https://github.com/janeyx99
2024-01-16 23:54:38 +00:00
CaoE
29516bd2a0 add _amp_foreach_non_finite_check_and_unscale_cpu_ and _amp_update_scale_cpu_ kernels on CPU (#109281)
Step1 of https://github.com/pytorch/pytorch/issues/111559.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109281
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-01-16 15:25:08 +00:00
Ting Lu
c167c34396 Skip unsupported tests on arm (#117344)
add skips to tests that involve record_context_cpp on ARM as it is only supported on linux x86_64 arch. Error is reported as below:
```
Traceback (most recent call last):
  File "/usr/lib/python3.10/unittest/case.py", line 59, in testPartExecutor
    yield
  File "/usr/lib/python3.10/unittest/case.py", line 591, in run
    self._callTestMethod(testMethod)
  File "/usr/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
    method()
  File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 2674, in wrapper
    method(*args, **kwargs)
  File "/opt/pytorch/pytorch/test/test_cuda.py", line 3481, in test_direct_traceback
    c = gather_traceback(True, True, True)
RuntimeError: record_context_cpp is not support on non-linux non-x86_64 platforms
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117344
Approved by: https://github.com/malfet, https://github.com/drisspg
2024-01-12 21:12:11 +00:00
Doe Hyun Yoon
83c45a9931 Faster gc_count update for CUDACachingAllocator (and avoid nullptr de… (#117064)
…reference) (#109065)

Summary:

Modify the way we update gc_count in CUDACachingAlloctor to make it faster.

Originally D48481557, but reverted due to nullptr dereference in some cases (D49003756). This diff changed to use correct constructor for search key (so avoid nullptr dereference). Also, added nullptr check (and returns 0 if it is) in gc_count functions.

Differential Revision: D49068760

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117064
Approved by: https://github.com/zdevito
2024-01-11 19:47:05 +00:00
Nikita Shulga
a6325ad86c Fix cuInit test on Windows (#117055)
By changing library name from `libcuda.so.1` to `nvcuda.dll` on Windows

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117055
Approved by: https://github.com/Skylion007, https://github.com/huydhn, https://github.com/atalman
2024-01-10 00:45:18 +00:00
Nikita Shulga
81b7a09d27 [CI] Test that cuInit is not called during import (#117010)
By making a driver API call in subprocess and expecting it to return `CUDA_ERROR_NOT_INITIALIZED`

Test Plan: run it on nighties before https://github.com/pytorch/pytorch/pull/116201 got reverted and observe the failure

This is very important for lots of distributed launchers

Fixes https://github.com/pytorch/pytorch/issues/116276

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117010
Approved by: https://github.com/albanD
2024-01-09 14:44:22 +00:00
Aaron Gokaslan
95041829c8 Add bfloat16 CUDA support to RNN (#116927)
Fixes #116925
Fixes #116763

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116927
Approved by: https://github.com/malfet
2024-01-06 22:55:34 +00:00
Aaron Gokaslan
3fe437b24b [BE]: Update flake8 to v6.1.0 and fix lints (#116591)
Updates flake8 to v6.1.0 and fixes a few lints using sed and some ruff tooling.
- Replace `assert(0)` with `raise AssertionError()`
- Remove extraneous parenthesis i.e.
  - `assert(a == b)` -> `assert a == b`
  - `if(x > y or y < z):`->`if x > y or y < z:`
  - And `return('...')` -> `return '...'`

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116591
Approved by: https://github.com/albanD, https://github.com/malfet
2024-01-03 06:04:44 +00:00
Aaron Gokaslan
bd10fea79a [BE]: Enable F821 and fix bugs (#116579)
Fixes #112371

I tried to fix as many of the bugs as I could, a few I could not figure out what the proper fix for them was though and so I left them with noqas.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116579
Approved by: https://github.com/ezyang
2024-01-01 08:40:46 +00:00
zdevito
4afe2687d5 Reland "Serve multistream graph captures from correct pool (#114647)" (#116199)
Fixes a variable shadowing problem that broke internal builds.

This reverts commit fe15645619.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116199
Approved by: https://github.com/eellison
2023-12-20 21:22:34 +00:00
PyTorch MergeBot
fe15645619 Revert "Serve multistream graph captures from correct pool (#114647)"
This reverts commit 8a445f7bd5.

Reverted https://github.com/pytorch/pytorch/pull/114647 on behalf of https://github.com/jeanschmidt due to breaking multiple internal build jobs, please check internal diff in order to obtain more details ([comment](https://github.com/pytorch/pytorch/pull/114647#issuecomment-1864840724))
2023-12-20 17:11:42 +00:00
zdevito
8a445f7bd5 Serve multistream graph captures from correct pool (#114647)
This fixes #114320 by placing the logic for determining whether to allocate
to a pool inside a callback that is controlled by CUDAGraph.cpp or by the
python bound api to allocate a stream directly to a pool.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114647
Approved by: https://github.com/ngimel, https://github.com/eellison
2023-12-18 18:24:15 +00:00
rzou
8ddca5aeae markDynamoStrictTest some more tests (#115857)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115857
Approved by: https://github.com/voznesenskym
ghstack dependencies: #115845, #115855, #115856
2023-12-15 01:22:38 +00:00
atalman
43e3242490 [BE] Remove test corner cases for CUDA older than supported 11.8 (#114989)
Remove deprecated CUDA use cases from tests.
Similar to: https://github.com/pytorch/pytorch/pull/112873

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114989
Approved by: https://github.com/malfet
2023-12-04 21:41:03 +00:00
eqy
6a86cf00ad [CUDA][cuBLAS] Remove explicit cuBLAS workspace allocation for CUDA 12.2+ (#113994)
cuBLAS should be using `cudaMallocAsync` in CUDA 12.2+, which removes the need for explicit workspace allocation to avoid increasing memory usage with multiple graph captures.

CC @ptrblck @malfet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113994
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-11-22 23:23:51 +00:00
Banit Agrawal
cc776d2186 [PyTorch Pinned Allocator] Create per thread task pool for mapping memory space (#111545)
Differential Revision: D50443865

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111545
Approved by: https://github.com/zdevito
2023-10-22 00:23:49 +00:00
Kazuaki Ishizaki
a603dcc307 Fix typo under test directory (#110826)
This PR fixes typo `the the` of comments in files under `test` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110826
Approved by: https://github.com/Skylion007
2023-10-08 20:52:38 +00:00
Banit Agrawal
64583c4d04 [CUDA Host Allocator] Add support of CudaHostRegister (#108488)
Summary: This diff adds another option to create cuda pinned memory using cudaHostRegister.

Differential Revision: D45843715

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108488
Approved by: https://github.com/zdevito
2023-10-06 04:13:02 +00:00
Aidyn-A
e7bd9c5315 [CUDA][CUDA Graphs] Fix CUDAGraph::reset function (#108896)
The following two cases fail due to a small oversight `CUDAGraph::reset()` that causes failures in graph destructor
```Python
import torch

x = torch.zeros(4, device="cuda")
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
    x = x + 1

g.reset()
del g
```
that fails with:
```
terminate called after throwing an instance of 'c10::Error'
  what():  uc >= 0 INTERNAL ASSERT FAILED at ".../pytorch/c10/cuda/CUDACachingAllocator.cpp":2157, please report a bug to PyTorch.
```

and reset and subsequent re-capture
```Python
import torch

x = torch.zeros(4, device="cuda")
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
    x = x + 1

g.reset()

with torch.cuda.graph(g):
    x = x + 1
g.replay()
```
which fails with:
```
Traceback (most recent call last):
  File "test_graph.py", line 11, in <module>
    with torch.cuda.graph(g):
  File ".../pytorch/torch/cuda/graphs.py", line 192, in __enter__
    self.cuda_graph.capture_begin(
  File ".../pytorch/torch/cuda/graphs.py", line 77, in capture_begin
    super().capture_begin(pool=pool, capture_error_mode=capture_error_mode)
RuntimeError: This CUDAGraph instance already owns a captured graph. To capture a new graph, create a new instance.

```

This PR fixes `CUDAGraph::reset()` function for above to use cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108896
Approved by: https://github.com/ezyang
2023-09-11 19:49:31 +00:00
Michael Lazos
b193f295b6 Add capturable ASGD impl (#107857)
Add capturable ASGD impl + test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107857
Approved by: https://github.com/janeyx99
2023-09-07 06:30:30 +00:00
Banit Agrawal
b8af8ac784 [CUDACaching Allocator] Release the allocator lock on the slow path (#108367)
Summary: This diff is to release the global allocator lock on the slow path when we do synchronous cudaMalloc call.

Differential Revision: D48750077

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108367
Approved by: https://github.com/zdevito
2023-09-02 02:52:25 +00:00
Elias Ellison
0a9778a372 Expose cudaStreamCaptureMode in CUDA Graphs, use local setting in inductor (#107407)
>  capture_error_mode (str, optional): specifies the cudaStreamCaptureMode for the graph capture stream.
Can be "global", "thread_local" or "relaxed". During cuda graph capture, some actions, such as cudaMalloc,
 may be unsafe. "global" will error on actions in other threads, "thread_local" will only error for
 actions in the current thread, and "relaxed" will not error on these actions.

Inductor codegen is single-threaded, so it should be safe to enable "thread_local" for inductor's cuda graph capturing. We have seen errors when inductor cudagraphs has been used concurrently with data preprocessing in other threads.

Differential Revision: [D48656014](https://our.internmc.facebook.com/intern/diff/D48656014)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107407
Approved by: https://github.com/albanD, https://github.com/eqy
2023-08-25 01:44:26 +00:00
Zachary DeVito
cc54448a07 [memory snapshot] add 'address' key to block (#107171)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107171
Approved by: https://github.com/ngimel
2023-08-23 18:57:24 +00:00
Aaron Gokaslan
660e8060ad [BE]: Update ruff to 0.285 (#107519)
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings.

I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519
Approved by: https://github.com/ezyang
2023-08-22 23:16:38 +00:00
PyTorch MergeBot
d59a6864fb Revert "[BE]: Update ruff to 0.285 (#107519)"
This reverts commit 88ab3e4322.

Reverted https://github.com/pytorch/pytorch/pull/107519 on behalf of https://github.com/ZainRizvi due to Sorry, but this PR breaks internal tests. @ezyang, can you please hep them get unblocked? It seems like one of the strings was prob accidentally modified ([comment](https://github.com/pytorch/pytorch/pull/107519#issuecomment-1688833480))
2023-08-22 19:53:32 +00:00
Aaron Gokaslan
88ab3e4322 [BE]: Update ruff to 0.285 (#107519)
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings.

I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519
Approved by: https://github.com/ezyang
2023-08-20 01:36:18 +00:00
lcskrishna
bc662ffff9 [ROCm] Update ROCm skip decorators (#106138)
This PR adds a msg argument for skipIfRocm and skipCUDAIfRocm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106138
Approved by: https://github.com/jataylo, https://github.com/jeffdaily, https://github.com/pruthvistony, https://github.com/albanD
2023-08-18 22:02:06 +00:00
Zachary DeVito
80988b6277 Introduce memory stacks for free (#106758)
Previously when we recorded a free action in a memory trace, we would provide
the stack for when the block was allocated. This is faster because we do not
have to record stacks for free, which would otherwise double the number of stacks
collected. However, sometimes knowing the location of a free is useful for
figuring out why a tensor was live. So this PR adds this behavior. If
performance ends up being a concern the old behavior is possible by passing
"alloc" to the context argument rather than "all".

Also refactors some of glue logic to be consistent across C++ and Python and
routes the Python API through the C++ version.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106758
Approved by: https://github.com/albanD
2023-08-14 20:38:15 +00:00
Jane Xu
0208574db9 [NAdam] Add capturable API and tests + fix differentiable (#106615)
This PR:
- adds a capturable API for NAdam similar to Adam(W)
- adds tests accordingly
- discovered and fixed bugs in the differentiable implementation (now tested through the capturable codepath).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106615
Approved by: https://github.com/albanD
2023-08-07 19:49:11 +00:00
Zachary DeVito
3e5a52cedd [memory snapshot] track context for segments (#106113)
We want to display the stack for the original cudaMalloc that created a segment.
Previously we could only report the last time the segment memory was used,
or the record of the segment_alloc could appear in the list of allocator actions.
This PR ensure regardless of whether we still have the segment_alloc action,
the context for a segment is still available. The visualizer is updated to
be able to incorporate this information.

This PR adds a new field to Block. However the previous stacked cleanup PR
 removed a field of the same size, making the change to Block size-neutral.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106113
Approved by: https://github.com/aaronenyeshi
2023-07-28 06:45:48 +00:00
Zachary DeVito
45b564766d [memory snapshots] removed chained history (#106079)
For free blocks of memory in the allocator, we previously kept a linked list
of the stack frames of previous allocations that lived there. This was only
ever used in one flamegraph visualization and never proved useful at
understanding what was going on. When memory history tracing was added, it
became redundant, since we can see the history of the free space from recording
the previous actions anyway.

This patch removes this functionality and simplifies the snapshot format:
allocated blocks directly have a 'frames' attribute rather than burying stack frames in the history.
Previously the memory history tracked the real size of allocations before rounding.
Since history was added, 'requested_size' has been added directly to the block which records the same information,
so this patch also removes that redundancy.

None of this functionality has been part of a PyTorch release with BC guarentees, so it should be safe to alter
this part of the format.

This patch also updates our visualization tools to work with the simplified format. Visualization tools keep
support for the old format in `_legacy` functions so that during the transition old snapshot files can still be read.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106079
Approved by: https://github.com/eellison
2023-07-28 06:45:48 +00:00
Justin Chu
4cc1745b13 [BE] f-stringify torch/ and scripts (#105538)
This PR is a follow up on the pyupgrade series to convert more strings to use f-strings using `flynt`.

- https://docs.python.org/3/reference/lexical_analysis.html#f-strings
- https://pypi.org/project/flynt/

Command used:

```
flynt torch/ -ll 120
flynt scripts/ -ll 120
flynt tools/ -ll 120
```

and excluded `collect_env.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105538
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-07-21 19:35:24 +00:00
Justin Chu
73e1455327 [BE] Enable ruff's UP rules and autoformat test/ (#105434)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105434
Approved by: https://github.com/albanD
2023-07-19 20:36:06 +00:00
Nikita Shulga
c3e4a67905 Refactor multigpu tests to test_cuda_multigpu (#104059)
Mostly refactor, that moves all the tests from `test_cuda` that benefit from multiGPU environment into its own file.

- Add `TestCudaMallocAsync` class for Async tests ( to separate them from `TestCudaComm`)
- Move individual tests from `TestCuda` to `TestCudaMultiGPU`
- Move `_create_scaling_models_optimizers` and `_create_scaling_case` to `torch.testing._internal.common_cuda`
- Add newly created `test_cuda_multigpu` to the multigpu periodic test

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at f4d46fa</samp>

This pull request fixes a flaky test and improves the testing of gradient scaling on multiple GPUs. It adds verbose output for two CUDA tests, and refactors some common code into helper functions in `torch/testing/_internal/common_cuda.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104059
Approved by: https://github.com/huydhn
2023-06-27 05:32:05 +00:00
Zachary DeVito
afc788a99c Re-land _cycleviz.py: visualize reference cycles holding cuda memory (#104051)
Reference cycles are freed by the cycle collector rather than being cleaned up
when the objects in the cycle first become unreachable. If a cycle points to a tensor,
the CUDA memory for that tensor will not be freed until garbage collection runs.
Accumulation of CUDA allocations can lead to out of memory errors (OOMs), as well as
non-deterministic allocation behavior which is harder to debug.

This visualizer installs a garbage collection hook to look for cycles containing
CUDA tensors and saves a visualization of the garbage:

```
from torch.cuda._cycleviz import warn_tensor_cycles
warn_tensor_cycles()
# do some work that results in a cycle getting garbage collected
# ...
> WARNING:root:Reference cycle includes a CUDA Tensor see visualization of cycle /tmp/tmpeideu9gl.html
```

Reland to make windows skip the test.

This reverts commit 7b3b6dd426.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104051
Approved by: https://github.com/aaronenyeshi, https://github.com/malfet
2023-06-23 13:44:58 +00:00
PyTorch MergeBot
7b3b6dd426 Revert "_cycleviz.py: visualize reference cycles holding cuda memory (#102656)"
This reverts commit dba67f71c9.

Reverted https://github.com/pytorch/pytorch/pull/102656 on behalf of https://github.com/huydhn due to Sorry for reverting your PR. But I think the change is failing on Windows CUDA https://github.com/pytorch/pytorch/actions/runs/5341701630/jobs/9683293600 ([comment](https://github.com/pytorch/pytorch/pull/102656#issuecomment-1603035364))
2023-06-22 17:16:47 +00:00
Zachary DeVito
dba67f71c9 _cycleviz.py: visualize reference cycles holding cuda memory (#102656)
Reference cycles are freed by the cycle collector rather than being cleaned up
when the objects in the cycle first become unreachable. If a cycle points to a tensor,
the CUDA memory for that tensor will not be freed until garbage collection runs.
Accumulatin of CUDA allocations can lead to out of memory errors (OOMs), as well as
non-deterministic allocation behavior which is harder to debug.

This visualizer installs a garbage collection hook to look for cycles containing
CUDA tensors and saves a visualization of the garbage:

```
from torch.cuda._cycleviz import warn_tensor_cycles
warn_tensor_cycles()
# do some work that results in a cycle getting garbage collected
# ...
> WARNING:root:Reference cycle includes a CUDA Tensor see visualization of cycle /tmp/tmpeideu9gl.html
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102656
Approved by: https://github.com/aaronenyeshi
2023-06-22 04:00:28 +00:00
Nikita Shulga
cd05c3b98c [BE] Use TEST_MULTIGPU from common_cuda.py (#103982)
Comment about `TEST_CUDNN` called over and over has long been alleviated by wrapping the check with `LazyVal`, that caches the results.
Also, delete unused `TEST_MAGMA`.

Prep change for https://github.com/pytorch/pytorch/issues/100006

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at e3a5b39</samp>

> _`common_cuda.py`_
> _Refactored for dynamo tests_
> _Winter code cleanup_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103982
Approved by: https://github.com/atalman, https://github.com/janeyx99
2023-06-22 00:07:44 +00:00
Zachary DeVito
19b3e07fe0 [memory_viz] Unified viewer (#103565)
This replaces the invidual visualization routines in _memory_viz.py with
a single javascript application.

The javascript application can load pickled snapshot dumps directly using
drag/drop, requesting them via fetch, or by embedding them in a webpage.

The _memory_viz.py commands use the embedding approach.
We can also host MemoryViz.js on a webpage to use the drag/drop approach, e.g.
https://zdevito.github.io/assets/viz/
(eventually this should be hosted with the pytorch docs).

All views/multiple cuda devices are supported on one page.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103565
Approved by: https://github.com/eellison, https://github.com/albanD
2023-06-16 03:49:48 +00:00
Xiao Wang
39f3514fa3 Add an env PYTORCH_TEST_SKIP_CUDAGRAPH to skip all cuda graph-related unit tests (#103032)
Skip all cuda graph-related unit tests by setting env var `PYTORCH_TEST_SKIP_CUDAGRAPH=1`

This PR refactors the `TEST_CUDA` python variable in test_cuda.py into common_utils.py. This PR also creates a new python variable `TEST_CUDA_GRAPH` in common_utils.py, which has an env var switch to turn off all cuda graph-related tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103032
Approved by: https://github.com/malfet
2023-06-06 07:51:57 +00:00
Nikita Shulga
ca470fc59f [BE] Make test_no_triton_on_import simple (#102674)
Do not try to parse raised exception for no good reason
Add short description
Reduce script to a single line

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at ea4164e</samp>

> _`test_no_triton_on_import`_
> _Cleans up the code, adds docs_
> _No hidden errors_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102674
Approved by: https://github.com/cpuhrsch, https://github.com/albanD
2023-06-01 20:31:18 +00:00
Nikita Vedeneev
d80d3b18d0 nn.Linear with BSR inputs: spare the user from explicit Triton kernel registrations (#98403)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 08f7a6a</samp>

This pull request adds support for triton kernels in `torch` and `torch/cuda`, and refactors and tests the existing triton kernel for BSR matrix multiplication. It also adds a test case to ensure that importing `torch` does not implicitly import `triton`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98403
Approved by: https://github.com/malfet, https://github.com/cpuhrsch
2023-05-31 13:09:45 +00:00
Masaki Kozuki
c8579b7374 Run test_cpp_memory_snapshot_pickle only when linux and x86_64 (#101366)
On Arm, I got

```
Traceback (most recent call last):
  File "/opt/pytorch/pytorch/test/test_cuda.py", line 5260, in test_cpp_memory_snapshot_pickle
    mem = run()
  File "/opt/pytorch/pytorch/test/test_cuda.py", line 5257, in run
    t = the_script_fn()
  File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 496, in prof_func_call
    return prof_callable(func_call, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 493, in prof_callable
    return callable(*args, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "/opt/pytorch/pytorch/test/test_cuda.py", line 5254, in the_script_fn
                @torch.jit.script
                def the_script_fn():
                    return torch.rand(311, 411, device='cuda')
                           ~~~~~~~~~~ <--- HERE
RuntimeError: record_context_cpp is not support on non-linux non-x86_64 platforms
```

dfe484a3b3/torch/csrc/profiler/unwind/unwind.cpp (L4-L24) seems related

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101366
Approved by: https://github.com/zdevito
2023-05-17 19:44:21 +00:00
Elias Ellison
3edff6b6ec Improve detection of workspace/non-output allocations in cudagraphs (#99985)
When we run cudagraph trees we are not allowed to have permanent workspace allocations like in cublas because we might need to reclaim that memory for a previous cudagraph recording, and it is memory that is not accounted for in output weakrefs so it does not work with checkpointing. Previously, I would check that we didn't have any additional allocations through snapshotting. This was extremely slow so I had to turn it off.

This PR first does the quick checking to see if we are in an error state, then if we are does the slow logic of creating snapshot. Also turns on history recording so we get a stacktrace of where the bad allocation came from.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99985
Approved by: https://github.com/zdevito
2023-05-01 15:58:45 +00:00
Jane Xu
808267767c Prevent grad scale from overflowing (#98876)
Fixes #98828 by capping the growth in the kernel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98876
Approved by: https://github.com/ngimel
2023-04-25 20:59:44 +00:00
Aaron Gokaslan
e2a3817dfd [BE] Enable C419 rule for any all shortcircuiting (#99890)
Apparently https://github.com/pytorch/pytorch/pull/78142 made torch.JIT allow for simple generator expressions which allows us to enable rules that replace unnecessary list comprehensions with generators in any/all. This was originally part of #99280 but I split it off into this PR so that it can be easily reverted should anything break.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99890
Approved by: https://github.com/justinchuby, https://github.com/kit1980, https://github.com/malfet
2023-04-25 15:02:13 +00:00
Masaki Kozuki
b87c7ab6d6 Remove redundant found_inf recompute from _step_supports_amp_unscaling path (#98620)
following https://github.com/pytorch/pytorch/pull/97415#issuecomment-1499787115.

Rel: https://github.com/pytorch/pytorch/pull/98613

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98620
Approved by: https://github.com/janeyx99
2023-04-20 19:24:09 +00:00
Animesh Jain
971df458db Reland of "Python binding to set/get CUDA rng state offset" (#99565)
Why?
* To reduce the latency of hot path in https://github.com/pytorch/pytorch/pull/97377

Concern - I had to add `set_offset` in all instances of `GeneratorImpl`. I don't know if there is a better way.

~~~~
import torch
torch.cuda.manual_seed(123)
print(torch.cuda.get_rng_state())
torch.cuda.set_rng_state_offset(40)
print(torch.cuda.get_rng_state())

tensor([123,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0], dtype=torch.uint8)
tensor([123,   0,   0,   0,   0,   0,   0,   0,  40,   0,   0,   0,   0,   0,
          0,   0], dtype=torch.uint8)
~~~~

Reland of https://github.com/pytorch/pytorch/pull/98965

(cherry picked from commit 8214fe07e8)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99565
Approved by: https://github.com/anijain2305
2023-04-20 15:42:25 +00:00
PyTorch MergeBot
bb2cd4a107 Revert "Python binding to set/get CUDA rng state offset (#98965)"
This reverts commit 8214fe07e8.

Reverted https://github.com/pytorch/pytorch/pull/98965 on behalf of https://github.com/DanilBaibak due to Break internal build
2023-04-19 11:23:32 +00:00
Animesh Jain
8214fe07e8 Python binding to set/get CUDA rng state offset (#98965)
Why?
* To reduce the latency of hot path in https://github.com/pytorch/pytorch/pull/97377

Concern - I had to add `set_offset` in all instances of `GeneratorImpl`. I don't know if there is a better way.

~~~~
import torch
torch.cuda.manual_seed(123)
print(torch.cuda.get_rng_state())
torch.cuda.set_rng_state_offset(40)
print(torch.cuda.get_rng_state())

tensor([123,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0], dtype=torch.uint8)
tensor([123,   0,   0,   0,   0,   0,   0,   0,  40,   0,   0,   0,   0,   0,
          0,   0], dtype=torch.uint8)
~~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98965
Approved by: https://github.com/kulinseth, https://github.com/ezyang
2023-04-18 07:52:21 +00:00
Zachary DeVito
7ff1f3f3f6 Revert "Revert "Expandable blocks in allocator (#96995)"" (#99275)
This reverts commit 851e89c8e8.

Differential Revision: [D45034526](https://our.internmc.facebook.com/intern/diff/D45034526)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99275
Approved by: https://github.com/eellison
2023-04-17 23:46:08 +00:00
PyTorch MergeBot
851e89c8e8 Revert "Expandable blocks in allocator (#96995)"
This reverts commit 6a50b83b73.

Reverted https://github.com/pytorch/pytorch/pull/96995 on behalf of https://github.com/izaitsevfb due to Breaks internal tests
2023-04-16 19:23:37 +00:00
Zachary DeVito
6a50b83b73 Expandable blocks in allocator (#96995)
Common advice we give for handling memory fragmentation issues is to
allocate a big block upfront to reserve memory which will get split up later.
For programs with changing tensor sizes this can be especially helpful to
avoid OOMs that happen the first time we see a new largest input and would
otherwise have to allocate new segments.

However the issue with allocating a block upfront is that is nearly impossible
to correctly estimate the size of that block. If too small, space in the block
will run out and the allocator will allocate separate blocks anyway. Too large,
and other non-PyTorch libraries might stop working because they cannot allocate
any memory.

This patch provides the same benefits as using a pre-allocating block but
without having to choose its size upfront. Using the cuMemMap-style APIs,
it adds the ability to expand the last block in a segment when more memory is
needed.

Compared to universally using cudaMallocAsync to avoid fragmentation,
this patch can fix this common fragmentation issue while preserving most
of the existing allocator behavior. This behavior can be enabled and disabled dynamically.
 This should allow users to, for instance, allocate long-lived parameters and state in individual buffers,
and put temporary state into the large expandable blocks, further reducing
fragmentation.

See inline comments for information about the implementation and its limitations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96995
Approved by: https://github.com/eellison
2023-04-14 09:49:11 +00:00
Peeyush Agarwal
ebd4c165ff Back out "GradScaler recomputes optimizer_state["found_inf_per_device"] before optimizer.step (#97415)" (#98613)
Summary: This change causes multi-GPU job from XI team to hang after 8K steps.

Differential Revision: D44797248

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98613
Approved by: https://github.com/ngimel
2023-04-07 23:31:31 +00:00
Zachary DeVito
b1a83c4da4 [memory history] cleanup recording API (#97406)
This makes the options for recording memory history
easier to understand and makes the default to record
the most information.

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 4706acf</samp>

This pull request enhances the memory profiling and debugging capabilities of PyTorch on CUDA devices. It introduces a new API for memory history recording in `torch/cuda/memory.py` and `test/test_cuda.py`, and adds new functions for memory snapshot management and visualization in `torch/cuda/memory.py`.

Also adds a quick _dump_snapshot function to make
it easier to look at the common visualizations.

<!--
copilot:walkthrough
-->
### <samp>🤖 Generated by Copilot at 4706acf</samp>

*  Modify the `_record_memory_history` function to use a new API that accepts a string argument for the `enabled` parameter and more parameters to control the stack trace collection and memory event history ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-80bd98caafb20d758f45a4d23711810f7e0b9ce7a6505094f9dbb0e00a657377L620-R696))
* Add a new function `_dump_snapshot` that allows users to dump a memory snapshot to a directory with HTML plots of the memory segments and events ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-80bd98caafb20d758f45a4d23711810f7e0b9ce7a6505094f9dbb0e00a657377R703-R713))
* Update the test cases in `test/test_cuda.py` to use the new API for memory history recording and check the expected output of the memory plots ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L4946-R4946), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L4984-R4984), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5000-R5000), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5015-R5015), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5035-R5038), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R5045-R5046), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5060-R5059), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5068-R5065), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5088-R5085))
* Add missing imports and types to the `torch/cuda/memory.py` module ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-80bd98caafb20d758f45a4d23711810f7e0b9ce7a6505094f9dbb0e00a657377L5-R15))
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97406
Approved by: https://github.com/ezyang
2023-03-28 16:31:10 +00:00
soulitzer
51c3fd39a5 Modify all calls to checkpoint pass use_reentrant explicitly (#97376)
Fixes #ISSUE_NUMBER

This is the first step toward making use_reentrant=False the default.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97376
Approved by: https://github.com/albanD
2023-03-27 13:37:42 +00:00
Masaki Kozuki
b5edf18334 GradScaler recomputes optimizer_state["found_inf_per_device"] before optimizer.step (#97415)
I found a discrepancy between non-fused and fused optimizers, which is to use `optimizer_state["found_inf"]` or to recompute `found_inf`.

- non fused: e64ddd1ab9/torch/cuda/amp/grad_scaler.py (L289)
- fused: e64ddd1ab9/torch/cuda/amp/grad_scaler.py (L353)
    - where `_check_inf_per_device` is e64ddd1ab9/torch/cuda/amp/grad_scaler.py (L564-L573)

The other way to align the behavior is to use the existing `found_inf` in e64ddd1ab9/torch/cuda/amp/grad_scaler.py (L353).

I'd say this PR is for the sake of "safety" and the alternative is to keep the existing behavior.
I honestly have no idea if it's expected to double-check the sanity of gradients in `GradScaler.step`.

---

what I've observed in huggingface/transformers T5-base example so far seems like that non-fused optimizers lead to invalid parameters while the fused not.
The cause seems to be that `gradients` become inf/nan before `GradScaler.step(optimizer)` after `GradScaler._unscale_grads_` (more precicely, the call of `torch._amp_foreach_non_finite_check_and_unscale_`) in the script of the issue linked below, i.e. the gradient clipping and/or unscaling lead to inf/nan as these happen after the grad check. See
788300cc2a/aten/src/ATen/native/cuda/AmpKernels.cu (L165-L174).

Fixes #96755 🙏

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97415
Approved by: https://github.com/ngimel, https://github.com/janeyx99
2023-03-24 17:36:47 +00:00
Tailing Yuan
63e1f12b49 Speedup bincount and histc on CUDA (#97090)
This is to speed up torch.bincount and torch.histc on CUDA.

1. Speed up int64_t gpuAtomicAdd,
2. and optimize the histogram kernel.

# Fixes #96626
After speedup, time cost in #96626 would be

```
... (run 2 times and ignore the first run)
case 1 CPU  0.0003631114959716797 seconds
case 1 CUDA 0.0005860328674316406 seconds
case 2 CPU  0.0013742446899414062 seconds
case 2 CUDA 0.0008623600006103516 seconds
```

Note that in "*case 1 CUDA*", the **max** op takes the most time, i.e., 5ee5a164ff/aten/src/ATen/native/cuda/SummaryOps.cu (L334-L335), which is not to be optimized in this PR.

# Benchmark

Time is measured on i7-10700 + RTX 3080, Ubuntu 22.04 (in WSL). The baseline is PyTorch 2.0.0+cu117. My dev version of PyTorch is compiled with CUDA 11.8. Each case is measured 15 times to take the median.

## torch.bincount
#elem | nbins | distribution | CPU | PyTorch 2.0.0 | this PR | speedup
-- | -- | -- | -- | -- | -- | --
2**20 | 80 | random.uniform | 0.000834 | 0.005783 | 0.000266 | 21.8x
2**20 | 80 | narrow in 1 bin | 0.001576 | 0.003967 | 0.000563 | 7.0x
2**20 | 500 | random.uniform | 0.000852 | 0.003641 | 0.000334 | 10.9x
2**20 | 500 | narrow in 1% bins | 0.000894 | 0.001878 | 0.000349 | 5.4x
2**20 | 2048 | random.uniform | 0.000891 | 0.000820 | 0.000298 | 2.8x
2**20 | 2048 | narrow in 1% bins | 0.000958 | 1.043251 | 0.000335 | 3,116.6x
2**26 | 80 | random.uniform | 0.067715 | 0.322409 | 0.003032 | 106.3x
2**26 | 80 | narrow in 1 bin | 0.110940 | 0.194644 | 0.017651 | 11.0x
2**26 | 500 | random.uniform | 0.066666 | 0.192302 | 0.002535 | 75.8x
2**26 | 500 | narrow in 1% bins | 0.066130 | 0.092237 | 0.005462 | 16.9x
2**26 | 2048 | random.uniform | 0.066371 | 0.035308 | 0.002476 | 14.3x
2**26 | 2048 | narrow in 1% bins | 0.068453 | 72.122858 | 0.003185 | 22,644.3x

## torch.histc (float32)
#elem | nbins | distribution | CPU | PyTorch 2.0.0 | this PR | speedup
-- | -- | -- | -- | -- | -- | --
2**20 | 80 | random.uniform | 0.001261 | 0.000145 | 9.47E-05 | 1.5x
2**20 | 80 | narrow in 1 bin | 0.001074 | 0.000356 | 0.000311 | 1.1x
2**20 | 500 | random.uniform | 0.001162 | 0.000227 | 9.18E-05 | 2.5x
2**20 | 500 | narrow in 1% bins | 0.001082 | 0.000201 | 0.000152 | 1.3x
2**20 | 2048 | random.uniform | 0.001100 | 0.000203 | 0.000118 | 1.7x
2**20 | 2048 | narrow in 1% bins | 0.001089 | 0.000396 | 0.000107 | 3.7x
2**26 | 80 | random.uniform | 0.064219 | 0.001170 | 0.000786 | 1.5x
2**26 | 80 | narrow in 1 bin | 0.056471 | 0.013283 | 0.011939 | 1.1x
2**26 | 500 | random.uniform | 0.078183 | 0.003411 | 0.000562 | 6.1x
2**26 | 500 | narrow in 1% bins | 0.056711 | 0.002763 | 0.002738 | 1.0x
2**26 | 2048 | random.uniform | 0.059296 | 0.003503 | 0.000533 | 6.6x
2**26 | 2048 | narrow in 1% bins | 0.061754 | 0.015703 | 0.000962 | 16.3x

## torch.histc (int64)
#elem | nbins | distribution | CPU | PyTorch 2.0.0 | this PR | speedup
-- | -- | -- | -- | -- | -- | --
2**20 | 80 | random.uniform | N/A | 0.005614 | 9.47E-05 | 59.3x
2**20 | 80 | narrow in 1 bin | N/A | 0.003799 | 0.000395 | 9.6x
2**20 | 500 | random.uniform | N/A | 0.003665 | 9.58E-05 | 38.2x
2**20 | 500 | narrow in 1% bins | N/A | 0.001760 | 0.000178 | 9.9x
2**20 | 2048 | random.uniform | N/A | 0.000693 | 0.000111 | 6.2x
2**20 | 2048 | narrow in 1% bins | N/A | 1.082904 | 0.000123 | 8,802.4x
2**26 | 80 | random.uniform | N/A | 0.320400 | 0.001145 | 279.9x
2**26 | 80 | narrow in 1 bin | N/A | 0.193668 | 0.015229 | 12.7x
2**26 | 500 | random.uniform | N/A | 0.182897 | 0.000823 | 222.2x
2**26 | 500 | narrow in 1% bins | N/A | 0.089363 | 0.00376 | 23.8x
2**26 | 2048 | random.uniform | N/A | 0.033190 | 0.000832 | 39.9x
2**26 | 2048 | narrow in 1% bins | N/A | 71.721012 | 0.001525 | 47,017.8x

## Banchmark code

Here is the benchmark code:

```python3
import time
import torch

cases = [
    ("bincount    bins=80   wide  ", torch.randint(80, [2**20]),   lambda x: torch.bincount(x, minlength=80)),
    ("bincount    bins=80   narrow", torch.randint(1, [2**20]),    lambda x: torch.bincount(x, minlength=80)),
    ("bincount    bins=500  wide  ", torch.randint(500, [2**20]),  lambda x: torch.bincount(x, minlength=500)),
    ("bincount    bins=500  narrow", torch.randint(5, [2**20]),    lambda x: torch.bincount(x, minlength=500)),
    ("bincount    bins=2048 wide  ", torch.randint(2048, [2**20]), lambda x: torch.bincount(x, minlength=2048)),
    ("bincount    bins=2048 narrow", torch.randint(20, [2**20]),   lambda x: torch.bincount(x, minlength=2048)),
    ("histc_float bins=80   wide  ", torch.rand(2**20),            lambda x: torch.histc(x, bins=80, min=0., max=1.)),
    ("histc_float bins=80   narrow", torch.rand(2**20)*.01,        lambda x: torch.histc(x, bins=80, min=0., max=1.)),
    ("histc_float bins=500  wide  ", torch.rand(2**20),            lambda x: torch.histc(x, bins=500, min=0., max=1.)),
    ("histc_float bins=500  narrow", torch.rand(2**20)*.01,        lambda x: torch.histc(x, bins=500, min=0., max=1.)),
    ("histc_float bins=2048 wide  ", torch.rand(2**20),            lambda x: torch.histc(x, bins=2048, min=0., max=1.)),
    ("histc_float bins=2048 narrow", torch.rand(2**20)*.01,        lambda x: torch.histc(x, bins=2048, min=0., max=1.)),
    ("histc_int   bins=80   wide  ", torch.randint(80, [2**20]),   lambda x: torch.histc(x, bins=80, min=0., max=80.)),
    ("histc_int   bins=80   narrow", torch.randint(1, [2**20]),    lambda x: torch.histc(x, bins=80, min=0., max=80.)),
    ("histc_int   bins=500  wide  ", torch.randint(500, [2**20]),  lambda x: torch.histc(x, bins=500, min=0., max=500.)),
    ("histc_int   bins=500  narrow", torch.randint(5, [2**20]),    lambda x: torch.histc(x, bins=500, min=0., max=500.)),
    ("histc_int   bins=2048 wide  ", torch.randint(2048, [2**20]), lambda x: torch.histc(x, bins=2048, min=0., max=2048.)),
    ("histc_int   bins=2048 narrow", torch.randint(20, [2**20]),   lambda x: torch.histc(x, bins=2048, min=0., max=2048.)),
]

def test(case, device):
    name, x, func = case
    x = x.to(device)
    time_samples = []
    for _ in range(15):
        torch.cuda.synchronize()
        t1 = time.time()
        func(x)
        torch.cuda.synchronize()
        t2 = time.time()
        time_samples.append(t2 - t1)
    median = sorted(time_samples)[len(time_samples) // 2]
    print(device, name, median)

for case in cases:
    test(case, device="cuda")

# for case in cases:
#     test(case, device="cpu")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97090
Approved by: https://github.com/ngimel
2023-03-24 00:25:34 +00:00
Masaki Kozuki
22ea21da3d Change 1D Tensor of 1 element to 0D Tensor (#96994)
add 0d tensor to graph adam/adamw test

Affected:
- `torch.cuda.amp.GradScaler`'s `found_inf`, `_scale`, and `_growth_tracker`
- `step` of Adam & AdamW of `capturable`

Fixes #96776 🤞

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96994
Approved by: https://github.com/janeyx99
2023-03-21 18:24:19 +00:00
Elias Ellison
571f96bf59 cudagraph trees (#89146)
CUDA Graph Trees

Design doc: https://docs.google.com/document/d/1ZrxLGWz7T45MSX6gPsL6Ln4t0eZCSfWewtJ_qLd_D0E/edit

Not currently implemented :

- Right now, we are using weak tensor refs from outputs to check if a tensor has dies. This doesn't work because a) aliasing, and b) aot_autograd detaches tensors (see note [Detaching saved tensors in AOTAutograd]). Would need either https://github.com/pytorch/pytorch/issues/91395 to land to use storage weak refs or manually add a deleter fn that does what I want. This is doable but theres some interactions with the caching allocator checkpointing so saving for a stacked pr.

- Reclaiming memory from the inputs during model recording. This isn't terribly difficult but deferring to another PR. You would need to write over the input memory during warmup, and therefore copy the inputs to cpu. Saving for a stacked pr.

- Warning on overwriting previous generation outputs. and handling nested torch.compile() calls in generation tracking

Differential Revision: [D43999887](https://our.internmc.facebook.com/intern/diff/D43999887)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89146
Approved by: https://github.com/ezyang
2023-03-17 02:47:03 +00:00
Elias Ellison
ea7415087a Expose Stream Recording Apis in python (#96384)
Differential Revision: [D43999891](https://our.internmc.facebook.com/intern/diff/D43999891)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96384
Approved by: https://github.com/zdevito
2023-03-16 23:45:43 +00:00
Zachary DeVito
e74f70d212 Revert "Revert "[memory profiling] add a facility to gather combined C++/Python/TorchScript stack traces. (#95541)"" (#96878)
This reverts commit e1ea584b1c.
Adds __has_include check to fix fbcode build.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96878
Approved by: https://github.com/ezyang
2023-03-16 04:12:54 +00:00
PyTorch MergeBot
e1ea584b1c Revert "[memory profiling] add a facility to gather combined C++/Python/TorchScript stack traces. (#95541)"
This reverts commit 4e1060c609.

Reverted https://github.com/pytorch/pytorch/pull/95541 on behalf of https://github.com/DanilBaibak due to breaking internal builds
2023-03-15 13:28:41 +00:00
Zachary DeVito
4e1060c609 [memory profiling] add a facility to gather combined C++/Python/TorchScript stack traces. (#95541)
This refactors the stack trace facility specific to memory profiling
    in python+cuda to make a generic facility to generate combined stack
    traces.

    The generic facility (combined_traceback.h) does not require
    python to be around to work, but will return python stacks if it is
    present.

    This facility is then used to add support for stack trace gathering in memory profiling that
    happens directly from C++.

    It is also used to expose a python API for gathering and symbolizing
    combineds stacks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95541
Approved by: https://github.com/ezyang
2023-03-14 18:26:05 +00:00
Elias Ellison
da265652d6 Return Live Data Pointers from Checkpoint, swap onto tensors (#95020)
When we checkpoint the state of the private pool allocator, we will need to make sure that its current live allocated blocks will get properly cleaned up when the tensors they correspond to die. Return DataPtrs for these new allocated blocks that the callee can swap onto live Tensors.

The exact api for setting the checkpoint can be manipulated after this as the cudagraph implementation is built out, but this at least shows its sufficiently general.

This should be the last PR touching cuda caching allocator necessary for new cudagraphs integration.

Differential Revision: [D43999888](https://our.internmc.facebook.com/intern/diff/D43999888)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95020
Approved by: https://github.com/zdevito
2023-03-14 01:22:19 +00:00
Elias Ellison
1cc32aedb0 Handle additional live allocations not in checkpointed state (#94943)
We choose to ignore certain blocks that are currently allocated when we set the pool to its checkpoint. For those blocks, we need to swap out the deleter function of their corresponding blocks so that a deallocation is not triggered when they die.

Differential Revision: [D43999886](https://our.internmc.facebook.com/intern/diff/D43999886)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94943
Approved by: https://github.com/zdevito
2023-03-14 01:00:47 +00:00
Elias Ellison
d798de2b05 Checkpoint CUDA Allocator Private Pool State (#94653)
Copying note from cuda caching allocator:

```
   * Note [Checkpointing PrivatePoolState]
   *
   * Refer above to Note [Interaction with CUDA graph capture]. Allocations made
   * during graph capture are made from a separate private pool. During graph
   * capture allocations behave as usual. During graph replay the allocator
   * state does not change even as new tensors are created. The private pool
   * will not free its blocks to the main caching allocator until cuda graph use
   * is finished to prevent an allocation from eager clobbering the memory from
   * a live but unaccounted for tensor that was created during replay.
   *
   * `make_graphed_callables`, a series of separate callables chained in
   * successive cuda graphs, can share a memory pool because after a cuda graph
   * recording the allocations in the shared private pool exactly reflect the
   * tensors that are allocated.
   *
   * We would like to extend callable chaining to support a graphed callable
   * tree. In this scenario, we have a tree of callable chains which will be
   * captured with cuda graphs. In the diagram below, we have a tree with four
   * callables, A, B, C, and D. Suppose we have captured, and subsequently
   * replayed, A, B, and C. Then on a new invocation, we replay A and B, but
   * would now like to record D. At this point the private pool will not reflect
   * any of the live tensors created during graph replay. Allocations made
   * during a new recording with the pool could overwrite those live tensors.
   *
   * In order to record a new graph capture after replaying prior callables in
   * the tree, we need the allocator to reflect the state of the live tensors.
   * We checkpoint the state of the private after each recording, and then
   * reapply it when we are starting a new recording chain. Additionally, we
   * must free the allocations for any tensors that died between the end of our
   * previous graph replaying and our new recording (TODO). All of the allocated
   * segments that existed in the checkpointed state must still exist in the
   * pool. There may also exist new segments, which we will free (TODO : link
   * note [live tensors between iterations] when it exists).
   *
   *
   *  ---------------> A ---------------> B ---------------> C
   *                                |
   *                                |
   *                                |
   *                                |
   *                                  ---------------> D
```

A few TODOs:
- need to add logic for freeing tensors that have died between a last replay and current new recording
- Add logic for free that might be called on a pointer multiple times (because we are manually freeing live tensors)

The two scenarios above have not been exercised in the tests yet.

Differential Revision: [D43999889](https://our.internmc.facebook.com/intern/diff/D43999889)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94653
Approved by: https://github.com/zdevito
2023-03-14 00:47:30 +00:00
Zachary DeVito
4b372e3958 [memory profiling] C++ tracing support (#95357)
Adds the ability to quickly generate stack traces for C++,
and combine Python, TorchScript, and C++ frames into a single trace.

This makes it possible for the memory tracer to record allocations inside
C++ code (e.g. convolution temporaries, backward operators).

The unwinder code is ~10x faster than execinfo.h's backward because it
cache fast unwinder routines for instruction pointers that have already been seen.
It is also only 1.2--2x slower than copying the entire stack (the approach perf takes),
while using 2 orders of magnitude less space per stack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95357
Approved by: https://github.com/bertmaher
2023-03-12 07:24:14 +00:00
Zachary DeVito
266089a3fe [memory snapshots] record scripted stack traces (#95356)
Adds support for seeing both python and script stack traces in memory
debugging.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95356
Approved by: https://github.com/aaronenyeshi
2023-03-12 07:24:14 +00:00
Zachary DeVito
d6d8d3484e _memory_viz.py: Visualize how blocks fit into segments. (#91336)
Add a segment_plot command that visualizes how blocks are allocated into segments.
This is similar to the 'stats' command but produces an interactive html viewer rather
than text dump, allowing exploration of stack traces.

It also adds the ability to see the layout at any point in the trace by starting from the
snapshot and then apply the events backwards to reconstruct what memory would have looked like.

Example:
![Screen Shot 2022-12-22 at 3 32 49 PM](https://user-images.githubusercontent.com/370202/209242650-b952372e-37ac-400a-a01c-13be2b5426fa.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91336
Approved by: https://github.com/bhosmer
2023-03-07 21:07:18 +00:00
Zachary DeVito
71f369092d Revert "Revert "memory viz: Add colors for categories and a legend (#90587)"" (#96133)
This reverts commit b38b39c441.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96133
Approved by: https://github.com/bhosmer
2023-03-07 21:07:18 +00:00
Catherine Lee
eea0733045 Reduce pytest blocklist (#96016)
`TestCase = object` or variations of it get switched to `TestCase = NoTest`.

unittest collects test based on subclassing unittest.TestCase, so setting TestCase = object removes it from unittest test collection.  pytest collects based on name (https://docs.pytest.org/en/7.1.x/reference/reference.html#confval-python_classes) but can be told to ignore a class (bottom of https://docs.pytest.org/en/7.1.x/example/pythoncollection.html#changing-naming-conventions)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96016
Approved by: https://github.com/ZainRizvi, https://github.com/huydhn
2023-03-07 18:30:27 +00:00
Eli Uriegas
b38b39c441 Revert "memory viz: Add colors for categories and a legend (#90587)"
This reverts commit ee43842505.
2023-03-06 11:38:58 -08:00
Zachary DeVito
ee43842505 memory viz: Add colors for categories and a legend (#90587)
Adds a category legend to memory trace plots that colors allocations by their role (activation, parameter, gradient, etc.) as captured by kineto.

Differential Revision: [D43757381](https://our.internmc.facebook.com/intern/diff/D43757381)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90587
Approved by: https://github.com/aaronenyeshi
2023-03-03 20:42:22 +00:00
Mark Saroufim
9f707f164e Add more GPU metric instrumentation (#91717)
Fixes https://github.com/pytorch/serve/issues/1937

A fairly common query I see folks running while using pytorch is

`nvidia-smi --format=csv,noheader,nounits --query-gpu=utilization.gpu,utilization.memory,memory.total,memory.used,temperature.gpu,power.draw,clocks.current.sm,clocks.current.memory -l 10`

Existing metrics we have
* For kernel utilization`torch.cuda.utilization()`
* For memory utilization we have them under `torch.cuda.memory` the memory allocated with `torch.cuda.memory.memory_allocated()`
* For total available memory we have `torch.cuda.get_device_properties(0).total_memory`

Which means the only metrics we're missing are
* Temperature: now in `torch.cuda.temperature()`
* Power draw: now in `torch.cuda.power()`
* Clock speed: now in `torch.cuda.clock_speed()`

With some important details on each

* Clock speed settings: I picked the SM clock domain which is documented here https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceEnumvs.html#group__nvmlDeviceEnumvs_1g805c0647be9996589fc5e3f6ff680c64
* Temperature: I use `pynvml.nvmlDeviceGetTemperature(handle, 0)` where 0 refers to the GPU die temperature
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91717
Approved by: https://github.com/ngimel
2023-02-24 00:38:03 +00:00
Pearu Peterson
cece63f197 Add warn-once deprecation warning to legacy sparse constructors (#94850)
Addresses https://github.com/pytorch/pytorch/issues/68323#issuecomment-1425174341

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94850
Approved by: https://github.com/amjames, https://github.com/cpuhrsch
2023-02-23 15:05:12 +00:00
puririshi98
8aa34602f7 Jetson Update for CI Redo (#94549)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94549
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-02-21 17:13:38 +00:00
dllehr-amd
98012e4a59 [ROCm] hipGraph support for pytorch mainline (#88202)
With the release of ROCm 5.3 hip now supports a hipGraph implementation.

All necessary backend work and hipification is done to support the same functionality as cudaGraph.

Unit tests are modified to support a new TEST_GRAPH feature which allows us to create a single check for graph support instead of attempted to gather the CUDA level in annotations for every graph test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88202
Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/malfet
2023-02-14 22:18:56 +00:00
Xuehai Pan
b005ec62b9 [BE] Remove dependency on six and future (#94709)
Remove the Python 2 and 3 compatibility library [six](https://pypi.org/project/six) and [future](https://pypi.org/project/future) and `torch._six`. We only support Python 3.8+ now. It's time to retire them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94709
Approved by: https://github.com/malfet, https://github.com/Skylion007
2023-02-14 09:14:14 +00:00
Xuehai Pan
046e88a291 [BE] [3/3] Rewrite super() calls in test (#94592)
Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied.

- #94587
- #94588
- #94592

Also, methods with only a `super()` call are removed:

```diff
class MyModule(nn.Module):
-   def __init__(self):
-       super().__init__()
-
    def forward(self, ...):
        ...
```

Some cases that change the semantics should be kept unchanged. E.g.:

f152a79be9/caffe2/python/net_printer.py (L184-L190)

f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94592
Approved by: https://github.com/ezyang, https://github.com/seemethere
2023-02-12 22:20:53 +00:00
c-odrin
54b7c7d5e9 Added requested_bytes to CUDA Caching Allocator Stats (#88575)
Summary:
The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce.

We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag:
    - "requested_bytes.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead

Test Plan: Added test case in caffe2/test/test_cuda.py

Differential Revision: D40810674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88575
Approved by: https://github.com/zdevito
2023-02-09 21:37:25 +00:00
Masaki Kozuki
6ba041fcae Look up group["capturable"], not defaults["capturable"] in Adam(W) (#94149)
We could set different values in each `param_group` when calling dunder init of `torch.optim` optimizers as in e.g.  https://github.com/pytorch/pytorch/issues/89987.

So check whether or not `capturable` is `True` among all the `param_group`s.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94149
Approved by: https://github.com/albanD
2023-02-07 00:24:35 +00:00
Masaki Kozuki
4207d3c330 FusedAdam(W) should take OptState into account before unscaling grads (#94060)
the optimizers have to consult `OptState` before unscaling gradients because we could call `GradScaler.unscale_` explicitly to for e.g. `clip_grad_norm_` as mentioned in e52786f3d1/torch/cuda/amp/grad_scaler.py (L235-L266) and https://pytorch.org/docs/stable/notes/amp_examples.html#working-with-unscaled-gradients

Related #90752

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94060
Approved by: https://github.com/albanD
2023-02-04 05:20:13 +00:00
Masaki Kozuki
a23ed38f9a [mta][foreach] Implement fused adamw (#88015)
related: https://github.com/pytorch/pytorch/issues/68041, https://github.com/pytorch/pytorch/issues/71274, https://github.com/pytorch/pytorch/issues/80167
possibly related to https://github.com/pytorch/pytorch/issues/80595#issuecomment-1178519436

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88015
Approved by: https://github.com/albanD, https://github.com/ngimel
2023-02-01 19:32:29 +00:00
albanD
d8aa68c683 make sure that our error handling runs with the GIL enabled (#92848)
Fixes https://github.com/pytorch/pytorch/issues/92684

I checked the other use case of this API and they never release the GIL

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92848
Approved by: https://github.com/ngimel
2023-01-24 09:30:42 +00:00
eqy
fb38b9ff2a [cuBLAS][TF32] Fix TF32 get/set test when TORCH_ALLOW_TF32_CUBLAS_OVERRIDE is set (#92052)
Follow up of #85859 to fix the test for when the environment variable is set.

CC @xwang233 @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92052
Approved by: https://github.com/ngimel
2023-01-12 05:36:06 +00:00
eqy
97ff20d722 [cuBLAS] (re-open) Fix default cuBLAS workspace size and parsing for multiple workspaces (#91564)
re-open of #89027
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91564
Approved by: https://github.com/ngimel
2023-01-03 23:48:15 +00:00
PyTorch MergeBot
39d49dbe45 Revert "[cuBLAS] Fix default cuBLAS workspace size and parsing for multiple workspaces (#89027)"
This reverts commit b407d98dbe.

Reverted https://github.com/pytorch/pytorch/pull/89027 on behalf of https://github.com/kit1980 due to Fails test_cublas_workspace_explicit_allocation on ROCm
2022-12-31 23:04:57 +00:00
eqy
b407d98dbe [cuBLAS] Fix default cuBLAS workspace size and parsing for multiple workspaces (#89027)
Follow-up of #86167 ; The number of pools was mistakenly ignored and the default workspace size appears to be too small to match selected cuBLAS kernels before the explicit allocation change.

CC @ptrblck @ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89027
Approved by: https://github.com/ngimel
2022-12-31 06:58:04 +00:00
lezcano
484dd40022 Implement PReLU in a compositional way (#91238)
The PReLU implementation was all over the place. This lead to a number
of bugs like https://github.com/pytorch/pytorch/issues/68760.  We fix it by:
- Keeping the weird broadcasting logic it has as a CompositeImplicit kernel that calls into a second kernel
- This second kernel is just a good-ol' pointwise kernel.
- We implement the derivative for the pointwise kernel via TI as well for speed.
- We implement the second derivative for the pointwise kernel and the forward AD derivatives compositionally

This fixes a number of issues:
- We don't perform copies any more when the inputs are not contiguous
- The derivatives are now correct
- We fix vmap and many other functorch-related issues.
- CPU and CUDA now share the relevant broadcasting logic
- The implementation is about 1/3 the length.

Fixes https://github.com/pytorch/pytorch/issues/68760
Fixes https://github.com/pytorch/pytorch/issues/89895

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91238
Approved by: https://github.com/kshitij12345, https://github.com/jbschlosser, https://github.com/albanD
2022-12-30 10:42:30 +00:00
Eddie Yan
8b617f813d [cuBLAS] Add an option to disable reduced precision reductions for BF16 GEMM (#89172)
Essentially the same change as #67946, except that the default is to disallow reduced precision reductions in `BFloat16` GEMMs (for now). If performance is severely regressed, we can change the default, but this option appears to be necessary to pass some `addmm` `BFloat16` tests on H100.

CC @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89172
Approved by: https://github.com/ngimel
2022-12-21 18:58:28 +00:00
Eddie Yan
dabf515c18 [cuDNN][cuDNN V8 API] (re-re-re-open) cuDNN V8 API on by default (#91117)
Re-opening following #91025

CC @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91117
Approved by: https://github.com/ngimel
2022-12-20 18:52:29 +00:00
PyTorch MergeBot
ba7aeac37b Revert "[cuDNN][cuDNN V8 API] (re-re-open) cuDNN V8 API on by default (#89022)"
This reverts commit eecd621f06.

Reverted https://github.com/pytorch/pytorch/pull/89022 on behalf of https://github.com/ngimel due to breaks some convolution configurations #91025
2022-12-16 23:06:35 +00:00
Rich Zhu
4372dbb89f use pytree to allow any input format for cuda graph (#90941)
Summary:
1. use pytree to allow any input format for make_graphed_callables
2. add allow_unused_input argument for make_graphed_callables

Test Plan: buck2 test mode/dev-nosan  //caffe2/test:cuda --  --print-passing-details

Differential Revision: D42077976

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90941
Approved by: https://github.com/ngimel
2022-12-16 03:01:47 +00:00
Eddie Yan
eecd621f06 [cuDNN][cuDNN V8 API] (re-re-open) cuDNN V8 API on by default (#89022)
Testing V8 on by default again after fixes have been merged for e.g., https://github.com/pytorch/torchdynamo/issues/1833

One new failure that seems to be surfaced with V8 on appears in halonext + amp
```
RuntimeError: Internal Triton PTX codegen error:
Segmentation fault (core dumped)
```
But I'm not sure if this points to a V8 issue or a Triton issue CC @ngimel @ptrblck

Current dynamo benchmarks on A100:
v7 vs. v8
|dev |name                           |batch_size|abs_latency_v7|abs_latency_v8|
|----|-------------------------------|----------|--------------|--------------|
|cuda|adv_inception_v3               |128       |166.0240      |165.5798      |
|cuda|beit_base_patch16_224          |64        |123.5912      |123.0797      |
|cuda|botnet26t_256                  |128       |107.7343      |107.5948      |
|cuda|cait_m36_384                   |4         |184.5038      |184.0271      |
|cuda|coat_lite_mini                 |128       |142.3061      |140.5814      |
|cuda|convit_base                    |64        |165.2499      |161.0743      |
|cuda|convmixer_768_32               |32        |325.6984      |325.7094      |
|cuda|convnext_base                  |64        |237.4632      |238.0142      |
|cuda|crossvit_9_240                 |128       |72.2980       |72.4367       |
|cuda|cspdarknet53                   |64        |96.6862       |96.8308       |
|cuda|deit_base_distilled_patch16_224|64        |117.6045      |117.9616      |
|cuda|dla102                         |128       |182.3073      |182.2304      |
|cuda|dm_nfnet_f0                    |128       |133.6011      |133.6298      |
|cuda|dpn107                         |32        |148.5080      |148.5885      |
|cuda|eca_botnext26ts_256            |128       |113.8676      |113.1514      |
|cuda|eca_halonext26ts               |128       |119.2242      |119.1845      |
|cuda|ese_vovnet19b_dw               |128       |80.0217       |79.9438       |
|cuda|fbnetc_100                     |128       |91.4548       |91.4009       |
|cuda|fbnetv3_b                      |128       |115.4496      |115.5058      |
|cuda|gernet_l                       |128       |114.8365      |114.7870      |
|cuda|ghostnet_100                   |128       |58.5766       |58.5766       |
|cuda|gluon_inception_v3             |128       |165.5222      |165.7167      |
|cuda|gluon_xception65               |32        |165.8779      |165.7818      |
|cuda|gmixer_24_224                  |128       |116.3611      |113.4925      |
|cuda|gmlp_s16_224                   |128       |121.2607      |121.2534      |
|cuda|hrnet_w18                      |128       |246.5706      |246.7599      |
|cuda|inception_v3                   |128       |166.1096      |166.2034      |
|cuda|jx_nest_base                   |32        |93.6064       |93.4088       |
|cuda|lcnet_050                      |128       |21.4156       |21.4207       |
|cuda|levit_128                      |128       |27.2901       |27.2543       |
|cuda|mixer_b16_224                  |128       |157.8992      |158.2878      |
|cuda|mixnet_l                       |128       |197.3443      |197.2125      |
|cuda|mnasnet_100                    |128       |71.4604       |71.2997       |
|cuda|mobilenetv2_100                |128       |67.6080       |67.7515       |
|cuda|mobilenetv3_large_100          |128       |57.7224       |57.6591       |
|cuda|mobilevit_s                    |64        |93.0372       |93.0530       |
|cuda|nfnet_l0                       |128       |113.1664      |113.2853      |
|cuda|pit_b_224                      |64        |133.3333      |133.4153      |
|cuda|pnasnet5large                  |16        |238.9545      |238.8122      |
|cuda|poolformer_m36                 |64        |144.2353      |144.2375      |
|cuda|regnety_002                    |128       |32.8534       |32.9069       |
|cuda|repvgg_a2                      |128       |102.4150      |102.3827      |
|cuda|res2net101_26w_4s              |64        |120.8127      |120.8322      |
|cuda|res2net50_14w_8s               |128       |149.7052      |149.8969      |
|cuda|res2next50                     |128       |153.7439      |153.8215      |
|cuda|resmlp_12_224                  |128       |89.1918       |86.9226       |
|cuda|resnest101e                    |64        |159.4706      |159.3133      |
|cuda|rexnet_100                     |128       |88.0032       |88.0397       |
|cuda|sebotnet33ts_256               |64        |80.4635       |80.0120       |
|cuda|selecsls42b                    |128       |70.4430       |70.3663       |
|cuda|spnasnet_100                   |128       |78.0537       |78.1991       |
|cuda|swin_base_patch4_window7_224   |64        |212.9073      |213.0824      |
|cuda|swsl_resnext101_32x16d         |32        |193.0229      |193.0404      |
|cuda|tf_efficientnet_b0             |128       |97.1316       |97.0410       |
|cuda|tf_mixnet_l                    |128       |203.4956      |203.5340      |
|cuda|tinynet_a                      |128       |82.4038       |82.8733       |
|cuda|tnt_s_patch16_224              |128       |284.8576      |284.8867      |
|cuda|twins_pcpvt_base               |64        |118.3893      |119.2329      |
|cuda|visformer_small                |128       |126.0533      |126.0390      |
|cuda|vit_base_patch16_224           |64        |118.2873      |118.0573      |
|cuda|volo_d1_224                    |64        |108.7764      |108.2063      |
|cuda|xcit_large_24_p8_224           |5         |100.4656      |100.5209      |

v7 vs. v8 amp

|dev |name                           |batch_size|abs_latency_v7|abs_latency_v8|
|----|-------------------------------|----------|--------------|--------------|
|cuda|adv_inception_v3               |128       |104.9729      |105.1237      |
|cuda|beit_base_patch16_224          |64        |75.4330       |75.2039       |
|cuda|botnet26t_256                  |128       |74.5149       |74.8071       |
|cuda|cait_m36_384                   |4         |110.9788      |111.5170      |
|cuda|coat_lite_mini                 |128       |62.3618       |64.4965       |
|cuda|convit_base                    |64        |116.4054      |117.9129      |
|cuda|convmixer_768_32               |32        |264.4401      |264.4491      |
|cuda|convnext_base                  |64        |182.9009      |179.2136      |
|cuda|crossvit_9_240                 |128       |48.8586       |48.8359       |
|cuda|cspdarknet53                   |64        |80.0245       |80.0160       |
|cuda|deit_base_distilled_patch16_224|64        |66.5921       |66.7448       |
|cuda|dla102                         |128       |116.7780      |117.1683      |
|cuda|dm_nfnet_f0                    |128       |78.9322       |79.1135       |
|cuda|dpn107                         |32        |85.5206       |85.7514       |
|cuda|eca_botnext26ts_256            |128       |76.3672       |77.0050       |
|cuda|eca_halonext26ts               |128       |86.2458       |              |
|cuda|ese_vovnet19b_dw               |128       |43.2943       |43.3379       |
|cuda|fbnetc_100                     |128       |54.8479       |54.9251       |
|cuda|fbnetv3_b                      |128       |70.7504       |71.0188       |
|cuda|gernet_l                       |128       |66.1607       |66.0379       |
|cuda|ghostnet_100                   |128       |43.8882       |43.9336       |
|cuda|gluon_inception_v3             |128       |104.9297      |105.0204      |
|cuda|gluon_xception65               |32        |85.7118       |85.8370       |
|cuda|gmixer_24_224                  |128       |75.1214       |76.1170       |
|cuda|gmlp_s16_224                   |128       |76.4207       |76.6641       |
|cuda|hrnet_w18                      |128       |186.1326      |186.2435      |
|cuda|inception_v3                   |128       |105.0561      |105.0783      |
|cuda|jx_nest_base                   |32        |65.3066       |65.3245       |
|cuda|lcnet_050                      |128       |14.7991       |14.8687       |
|cuda|levit_128                      |128       |19.2893       |19.4772       |
|cuda|mixer_b16_224                  |128       |93.9826       |94.2056       |
|cuda|mixnet_l                       |128       |147.1245      |147.0435      |
|cuda|mnasnet_100                    |128       |39.1781       |39.2565       |
|cuda|mobilenetv2_100                |128       |42.3704       |42.3114       |
|cuda|mobilenetv3_large_100          |128       |37.2946       |37.2816       |
|cuda|mobilevit_s                    |64        |55.8930       |55.8934       |
|cuda|nfnet_l0                       |128       |64.0448       |64.4438       |
|cuda|pit_b_224                      |64        |80.6342       |80.2933       |
|cuda|pnasnet5large                  |16        |154.9611      |154.8654      |
|cuda|poolformer_m36                 |64        |101.7489      |101.8138      |
|cuda|regnety_002                    |128       |27.0939       |27.0309       |
|cuda|repvgg_a2                      |128       |60.9651       |61.2533       |
|cuda|res2net101_26w_4s              |64        |77.3291       |77.4739       |
|cuda|res2net50_14w_8s               |128       |93.6572       |93.7221       |
|cuda|res2next50                     |128       |112.4975      |112.3248      |
|cuda|resmlp_12_224                  |128       |59.5422       |60.7644       |
|cuda|resnest101e                    |64        |97.9894       |98.3358       |
|cuda|rexnet_100                     |128       |55.2218       |55.0718       |
|cuda|sebotnet33ts_256               |64        |60.4880       |60.8113       |
|cuda|selecsls42b                    |128       |41.4294       |41.5341       |
|cuda|spnasnet_100                   |128       |45.0037       |45.0304       |
|cuda|swin_base_patch4_window7_224   |64        |98.2561       |98.6925       |
|cuda|swsl_resnext101_32x16d         |32        |100.6179      |100.9195      |
|cuda|tf_efficientnet_b0             |128       |56.5344       |56.4591       |
|cuda|tf_mixnet_l                    |128       |153.0318      |152.9367      |
|cuda|tinynet_a                      |128       |54.1307       |53.9298       |
|cuda|tnt_s_patch16_224              |128       |142.4801      |142.6589      |
|cuda|twins_pcpvt_base               |64        |67.9027       |67.8325       |
|cuda|visformer_small                |128       |72.5589       |72.9427       |
|cuda|vit_base_patch16_224           |64        |71.4885       |71.7342       |
|cuda|volo_d1_224                    |64        |69.3539       |69.5910       |
|cuda|xcit_large_24_p8_224           |5         |59.9000       |59.9699       |

v7 vs. v8 float16
|dev |name                           |batch_size|abs_latency|abs_latency|
|----|-------------------------------|----------|-----------|-----------|
|cuda|adv_inception_v3               |128       |104.2544   |104.2677   |
|cuda|beit_base_patch16_224          |64        |85.3601    |85.3786    |
|cuda|botnet26t_256                  |128       |72.1476    |71.8277    |
|cuda|cait_m36_384                   |4         |108.3075   |108.5941   |
|cuda|coat_lite_mini                 |128       |61.2382    |61.6049    |
|cuda|convmixer_768_32               |32        |263.3818   |263.3598   |
|cuda|convnext_base                  |64        |172.6821   |173.8520   |
|cuda|crossvit_9_240                 |128       |44.6321    |44.6340    |
|cuda|cspdarknet53                   |64        |79.3165    |79.2964    |
|cuda|deit_base_distilled_patch16_224|64        |61.9816    |62.2109    |
|cuda|dla102                         |128       |115.7403   |115.9928   |
|cuda|dm_nfnet_f0                    |128       |77.5434    |77.7440    |
|cuda|dpn107                         |32        |83.6489    |83.5605    |
|cuda|eca_botnext26ts_256            |128       |73.9953    |74.1031    |
|cuda|eca_halonext26ts               |128       |81.7951    |81.7103    |
|cuda|ese_vovnet19b_dw               |128       |42.9618    |42.8853    |
|cuda|fbnetc_100                     |128       |54.3590    |54.3575    |
|cuda|fbnetv3_b                      |128       |69.7977    |70.1696    |
|cuda|gernet_l                       |128       |64.8684    |65.1726    |
|cuda|ghostnet_100                   |128       |43.2054    |43.1319    |
|cuda|gluon_inception_v3             |128       |104.1988   |104.3030   |
|cuda|gluon_xception65               |32        |84.2245    |84.5085    |
|cuda|gmixer_24_224                  |128       |82.0418    |82.7252    |
|cuda|gmlp_s16_224                   |128       |75.4792    |75.8374    |
|cuda|hrnet_w18                      |128       |184.1450   |184.1848   |
|cuda|inception_v3                   |128       |104.1203   |104.2536   |
|cuda|jx_nest_base                   |32        |58.2386    |58.4901    |
|cuda|lcnet_050                      |128       |14.6409    |14.5616    |
|cuda|levit_128                      |128       |22.3875    |22.4680    |
|cuda|mixer_b16_224                  |128       |98.9534    |98.4730    |
|cuda|mixnet_l                       |128       |146.1623   |146.1947   |
|cuda|mnasnet_100                    |128       |38.9208    |39.3463    |
|cuda|mobilenetv2_100                |128       |41.8946    |41.9847    |
|cuda|mobilenetv3_large_100          |128       |36.7810    |36.8264    |
|cuda|mobilevit_s                    |64        |55.3211    |55.3186    |
|cuda|nfnet_l0                       |128       |63.1302    |63.5544    |
|cuda|pit_b_224                      |64        |73.8752    |73.4602    |
|cuda|pnasnet5large                  |16        |151.6806   |151.6111   |
|cuda|poolformer_m36                 |64        |86.8341    |86.8021    |
|cuda|regnety_002                    |128       |26.6798    |26.5295    |
|cuda|repvgg_a2                      |128       |61.6652    |62.1482    |
|cuda|res2net101_26w_4s              |64        |75.8037    |75.7739    |
|cuda|res2net50_14w_8s               |128       |92.6362    |92.4338    |
|cuda|res2next50                     |128       |111.5371   |111.5832   |
|cuda|resmlp_12_224                  |128       |58.2349    |57.9807    |
|cuda|resnest101e                    |64        |96.1114    |96.2742    |
|cuda|rexnet_100                     |128       |54.8138    |54.7643    |
|cuda|sebotnet33ts_256               |64        |53.1524    |53.3823    |
|cuda|selecsls42b                    |128       |40.6070    |40.7104    |
|cuda|spnasnet_100                   |128       |44.5732    |44.4318    |
|cuda|swin_base_patch4_window7_224   |64        |98.6447    |98.8445    |
|cuda|swsl_resnext101_32x16d         |32        |97.0195    |97.2968    |
|cuda|tf_efficientnet_b0             |128       |56.0640    |56.0278    |
|cuda|tf_mixnet_l                    |128       |152.0958   |152.0874   |
|cuda|tinynet_a                      |128       |53.3694    |53.3762    |
|cuda|tnt_s_patch16_224              |128       |130.2981   |130.3726   |
|cuda|twins_pcpvt_base               |64        |62.5459    |62.6416    |
|cuda|visformer_small                |128       |68.8502    |69.1756    |
|cuda|vit_base_patch16_224           |64        |65.8587    |66.0285    |
|cuda|volo_d1_224                    |64        |64.5348    |64.6057    |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89022
Approved by: https://github.com/ngimel
2022-12-15 03:24:44 +00:00
PyTorch MergeBot
cba96366a2 Revert "remove torch.equal usages (#89527)"
This reverts commit 4095ef8b80.

Reverted https://github.com/pytorch/pytorch/pull/89527 on behalf of https://github.com/clee2000 due to broke periodic multigpu tests 4095ef8b80 https://github.com/pytorch/pytorch/actions/runs/3592806602/jobs/6049368502
2022-12-02 21:36:13 +00:00
Philip Meier
4095ef8b80 remove torch.equal usages (#89527)
Preparation for the next PR in this stack: #89559.

I replaced

- `self.assertTrue(torch.equal(...))` with `self.assertEqual(..., rtol=0, atol=0, exact_device=True)`,
- the same for `self.assertFalse(...)` with `self.assertNotEqual(...)`, and
- `assert torch.equal(...)` with `torch.testing.assert_close(..., rtol=0, atol=0)` (note that we don't need to set `check_device=True` here since that is the default).

There were a few instances where the result of `torch.equal` is used directly. In that cases I've replaced with `(... == ...).all().item()` while sometimes also dropping the `.item()` depending on the context.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89527
Approved by: https://github.com/mruberry
2022-12-01 11:22:52 +00:00
Aidyn-A
0057be3361 [CUDA graphs] Add warning if captured graph is empty (#88754)
Fixes #87894

This PR adds a warning if captured graph is empty (consists of zero nodes).
The example snippet where would it be useful:

```python
import torch

x = torch.randn(10)
z = torch.zeros(10)

g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
    z = x * x
# Warn user
```

and in #87894

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88754
Approved by: https://github.com/ezyang
2022-11-28 23:20:19 +00:00
Nikita Shulga
da2afcb1e0 Add test for out-of-bounds Tensor access on GPU (#39211)
Since CUDA context can not recover safely from on-device assert, use `torch.multiprocessing.spawn` to execute a method in another context and verify that it raises unrecoverable error.

As those types of tests are pretty slow (6 seconds on powerful linux box with one GPU) run it only in the slow shard.

Closes https://github.com/pytorch/pytorch/issues/38944

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39211
Approved by: https://github.com/ezyang
2022-11-15 21:06:02 +00:00
PyTorch MergeBot
d98a884b33 Revert "[cuDNN] (re-open) Enable cuDNN Frontend v8 API by Default (#87669)"
This reverts commit 3c6bddc3f6.

Reverted https://github.com/pytorch/pytorch/pull/87669 on behalf of https://github.com/eqy due to investigating convnext benchmark regressions
2022-11-08 19:04:25 +00:00