pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
PyTorch MergeBot	123d9ec5bf	Revert "Loads .pyd instead of .so in MemPool test for windows (#132749 )" This reverts commit `37ab0f3385`. Reverted https://github.com/pytorch/pytorch/pull/132749 on behalf of https://github.com/syed-ahmed due to Seems like periodic is still failing: `7c79e89bc5` ([comment](https://github.com/pytorch/pytorch/pull/132749#issuecomment-2274041302))	2024-08-07 18:08:44 +00:00
Syed Tousif Ahmed	37ab0f3385	Loads .pyd instead of .so in MemPool test for windows (#132749 ) Fixes #132650 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132749 Approved by: https://github.com/albanD	2024-08-07 09:58:52 +00:00
albanD	9a1ad3345f	Fix periodic windows test (#132648 ) This test fails to clean up folders on windows for the past week, see `27f61eba58` for example Pull Request resolved: https://github.com/pytorch/pytorch/pull/132648 Approved by: https://github.com/janeyx99, https://github.com/zou3519, https://github.com/malfet	2024-08-05 20:54:20 +00:00
Xuehai Pan	4226ed1585	[BE] Format uncategorized Python files with `ruff format` (#132576 ) Remove patterns ``, `test/`, and `torch/**` in `tools/linter/adapters/pyfmt_linter.py` and run `lintrunner`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132576 Approved by: https://github.com/ezyang, https://github.com/Skylion007 ghstack dependencies: #132574	2024-08-04 17:13:31 +00:00
Oguz Ulgen	221350e3a4	Add None return type to init -- tests (#132352 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132352 Approved by: https://github.com/ezyang ghstack dependencies: #132335, #132351	2024-08-01 15:44:51 +00:00
Syed Tousif Ahmed	7c89ec0f7c	Implements torch.cuda.MemPool() API (#131152 ) In this PR: - Pool id creation logic is refactored and moved to a MemPool class. `graph_pool_handle()` API now uses `torch.cuda.MemPool()` to get a unique id for a pool. Existing tests should cover this change. - MemPool holds a pointer to a CUDAAllocator as proposed in https://github.com/pytorch/pytorch/issues/124807#issuecomment-2077506997. Tests are added to show usage with CUDAPluggableAllocator. - MemPoolContext API makes a mempool active. Tests are added to show usage of this API. This API will be used in CUDACachingAllocator to route allocations to a user provided allocator. See draft here: https://github.com/pytorch/pytorch/pull/125722/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/131152 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-08-01 01:29:30 +00:00
Aidyn-A	301ec32ae8	[EASY][TEST][CUDA] Fix typo in test_graph_make_graphed_callables_same_pool (#132059 ) Per title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132059 Approved by: https://github.com/Skylion007	2024-07-29 19:15:37 +00:00
PyTorch MergeBot	e191b83462	Revert "Add wrappers for synchronous GPUDirect Storage APIs (#130633 )" This reverts commit `709ddf7a9d`. Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to still failing internally D60265673 ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2253239607))	2024-07-26 18:08:20 +00:00
Mikayla Gawarecki	709ddf7a9d	Add wrappers for synchronous GPUDirect Storage APIs (#130633 ) Based in part on https://github.com/NVIDIA/apex/pull/1774 Differential Revision: [D60155434](https://our.internmc.facebook.com/intern/diff/D60155434) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633 Approved by: https://github.com/albanD	2024-07-25 22:23:38 +00:00
PyTorch MergeBot	e4b5645f83	Revert "Add wrappers for synchronous GPUDirect Storage APIs (#130633 )" This reverts commit `5b5e0698a5`. Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to breaking a lot of jobs and build rules internally D60085885, possibly needs to update some bazel build? ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2245806738))	2024-07-23 17:19:34 +00:00
Mikayla Gawarecki	5b5e0698a5	Add wrappers for synchronous GPUDirect Storage APIs (#130633 ) Based in part on https://github.com/NVIDIA/apex/pull/1774 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633 Approved by: https://github.com/albanD	2024-07-22 14:51:24 +00:00
Xuehai Pan	ba48cf6535	[BE][Easy][6/19] enforce style for empty lines in import segments in `test/` (#129757 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129757 Approved by: https://github.com/ezyang	2024-07-17 06:42:37 +00:00
Bilal Khan	54a932b0ac	Support for expandable segments with cuda graph trees (#128068 ) This PR adds support to use expandable segments with private memory pools which should unblock using it with cuda graphs and cuda graph trees. Currently, the allocator silently avoids using expandable segments when allocating in a private pool due to checkpoint saving/restoring not meshing well with how we keep track of unmapped blocks. The PR itself is pretty short, most of the logic for checkpointing and reapplying state for non-expandable segments transfers over without much work. Expandable segments reserve a virtual address space of size equal to the amount of physical memory on the GPU. Every time we want to `malloc()` or `free()` memory in a memory pool with expandable segments turned on, we map/unmap pages of physical GPU memory under the hood to create a new block that we return to the caller. This is beneficial due to the fact that each memory pool functions as a single segment of memory with a contiguous block of memory addresses that can grow and shrink as needed, avoiding fragmentation from allocating multiple non-contiguous segments that may not be merged together. The caching allocator handles this by creating an unmapped block for the entire reserved virtual address space at init, which is treated similarly to an unallocated block in a free pool. When callers call `malloc()`, it's split and mapped to create allocated blocks, and calling `free()` similarly caches and merges free blocks in a free pool to be used later. Expandable blocks are unmapped and returned back to Cuda when they are cleaned up, or when we hit an OOM and the allocator attempts to remap cached free blocks. The code paths to map, free, and unmap blocks in expandable segments is similar to that for normal blocks and does all the same work of updating stats on memory usage, moving blocks between active and free pools, and returning memory to Cuda. With Cuda Graph Trees and private memory pools, we need the ability to take checkpoints of the current state of the memory allocator after each graph capture as well as reapplying the state before capturing a new graph after replaying a captured graph so that the new cuda graph capture has access to the state of the allocator at the point after replaying a previously captured graph so it can reuse empty blocks and allocate new ones. As mentioned in a below comment, memory in a private pool is cached until the private pool is destroyed and allocations can only grow from extra graph captures, any freeing of memory would result in invalid memory addresses and would break cuda graphs. One implementation detail to note for unmapped blocks with expandable segments is that unmapped blocks are kept track in a member variable `unmapped` of a `BlockPool`. `unmapped` is not part of the checkpointed state of the caching allocator and isn't restored when reapplying checkpoints since we never free/unmap memory back to cuda and is persisted across graph captures / replays. Checkpointing the current state of the memory allocator works as expected with expandable segments. Checkpointing grabs the first block of every segment in the active and free pools of the private pool and traverses the linked list of blocks in the segment to capture the state of every segment, which is then saved and kept for when it is needed to be reapplied. For expandable blocks, the last block in every segment will be an unallocated unmapped block containing the remaining amount of unmapped memory at graph capture time, and this too is saved in the checkpoint. Reapplying the checkpoints works by freeing all allocated blocks and merging them into a single block per segment, then for each segment, we manually split and allocate all blocks from the checkpoint and then free the blocks marked as unallocated in the checkpoint state. For expandable segments, we need to make some modifications to not split unmapped blocks and avoid manually mapping then freeing unmapped blocks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128068 Approved by: https://github.com/eqy, https://github.com/eellison	2024-07-15 23:23:23 +00:00
Tobias Ringwald	e5de25896f	Fixed CUDA randint generation for large ranges. (#126066 ) Fixes #125224 For large ranges, calls to CUDA `randint` use a different `unroll_factor` to generate random ints. This `unroll_factor` was not considered correctly in the calculation of the Philox offsets. Thus, some of the random states were reused, resulting in lower entropy (see #125224). This also affects multiple other random functions, such as `torch.rand` and `torch.randn`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126066 Approved by: https://github.com/eqy, https://github.com/lezcano	2024-07-13 21:42:27 +00:00
eqy	60fc01d0ab	[CUDA] Don't double-destroy CUDA graph when debug dump is used (#130401 ) Repro from @eellison Could have sworn we had another PR with this fix floating around somewhere but I couldn't find it... Pull Request resolved: https://github.com/pytorch/pytorch/pull/130401 Approved by: https://github.com/Skylion007, https://github.com/eellison	2024-07-12 18:57:07 +00:00
PyTorch MergeBot	578388bed8	Revert "Support for expandable segments with cuda graph trees (#128068 )" This reverts commit `fdc83610f2`. Reverted https://github.com/pytorch/pytorch/pull/128068 on behalf of https://github.com/janeyx99 due to Reverting for breaking ROCm tests on trunk, I think the tests need to be qualified with @onlyCUDA ([comment](https://github.com/pytorch/pytorch/pull/128068#issuecomment-2223672381))	2024-07-11 18:58:13 +00:00
Bilal Khan	fdc83610f2	Support for expandable segments with cuda graph trees (#128068 ) This PR adds support to use expandable segments with private memory pools which should unblock using it with cuda graphs and cuda graph trees. Currently, the allocator silently avoids using expandable segments when allocating in a private pool due to checkpoint saving/restoring not meshing well with how we keep track of unmapped blocks. The PR itself is pretty short, most of the logic for checkpointing and reapplying state for non-expandable segments transfers over without much work. Expandable segments reserve a virtual address space of size equal to the amount of physical memory on the GPU. Every time we want to `malloc()` or `free()` memory in a memory pool with expandable segments turned on, we map/unmap pages of physical GPU memory under the hood to create a new block that we return to the caller. This is beneficial due to the fact that each memory pool functions as a single segment of memory with a contiguous block of memory addresses that can grow and shrink as needed, avoiding fragmentation from allocating multiple non-contiguous segments that may not be merged together. The caching allocator handles this by creating an unmapped block for the entire reserved virtual address space at init, which is treated similarly to an unallocated block in a free pool. When callers call `malloc()`, it's split and mapped to create allocated blocks, and calling `free()` similarly caches and merges free blocks in a free pool to be used later. Expandable blocks are unmapped and returned back to Cuda when they are cleaned up, or when we hit an OOM and the allocator attempts to remap cached free blocks. The code paths to map, free, and unmap blocks in expandable segments is similar to that for normal blocks and does all the same work of updating stats on memory usage, moving blocks between active and free pools, and returning memory to Cuda. With Cuda Graph Trees and private memory pools, we need the ability to take checkpoints of the current state of the memory allocator after each graph capture as well as reapplying the state before capturing a new graph after replaying a captured graph so that the new cuda graph capture has access to the state of the allocator at the point after replaying a previously captured graph so it can reuse empty blocks and allocate new ones. As mentioned in a below comment, memory in a private pool is cached until the private pool is destroyed and allocations can only grow from extra graph captures, any freeing of memory would result in invalid memory addresses and would break cuda graphs. One implementation detail to note for unmapped blocks with expandable segments is that unmapped blocks are kept track in a member variable `unmapped` of a `BlockPool`. `unmapped` is not part of the checkpointed state of the caching allocator and isn't restored when reapplying checkpoints since we never free/unmap memory back to cuda and is persisted across graph captures / replays. Checkpointing the current state of the memory allocator works as expected with expandable segments. Checkpointing grabs the first block of every segment in the active and free pools of the private pool and traverses the linked list of blocks in the segment to capture the state of every segment, which is then saved and kept for when it is needed to be reapplied. For expandable blocks, the last block in every segment will be an unallocated unmapped block containing the remaining amount of unmapped memory at graph capture time, and this too is saved in the checkpoint. Reapplying the checkpoints works by freeing all allocated blocks and merging them into a single block per segment, then for each segment, we manually split and allocate all blocks from the checkpoint and then free the blocks marked as unallocated in the checkpoint state. For expandable segments, we need to make some modifications to not split unmapped blocks and avoid manually mapping then freeing unmapped blocks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128068 Approved by: https://github.com/zdevito, https://github.com/eqy	2024-07-11 05:33:09 +00:00
Jeff Willette	5c9d5272e4	fixes #124582 (#128483 ) added check for existence of outputs requiring grad to make_graphed_callables. added new test case, updated existing test case to include parameterless modules. Fixes #124582 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128483 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-07-02 08:45:59 +00:00
Jack Taylor	e1b426b345	[ROCm] CUDA_VISIBLE_DEVICES fallback option for device_count (#129650 ) Updating `_parse_visible_devices` to allow use of CUDA_VISIBLE_DEVICES if HIP_VISIBLE_DEVICES is unset, to avoid any unnecessary code changes in workloads that already rely on CUDA_VISIBLE_DEVICES. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129650 Approved by: https://github.com/hongxiayang, https://github.com/malfet	2024-07-01 11:40:09 +00:00
Jeff Daily	169b4ca07e	add uuid in cudaDeviceProperties (#125083 ) Replaces #99967. Fixes #99903. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125083 Approved by: https://github.com/pruthvistony, https://github.com/albanD, https://github.com/eqy, https://github.com/malfet	2024-06-27 23:53:13 +00:00
yousufmo	305ba62906	Add support to `GradScaler` for respecting an already set `grad_scale` value (#123429 ) Fixes #123428 Co-authored-by: Yousuf Mohamed-Ahmed <youmed.tech@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123429 Approved by: https://github.com/ezyang	2024-06-27 22:40:54 +00:00
Dmitry Rogozhkin	321bdcb372	Fix device propagation for checkpointing (#128671 ) Fixes: #128478 In backward() implementation checkpointing code was quering device type from the rng_state tensors saved on forward(). These tensors are CPU only tensors and don't carry device information with them. As a result CUDA device was assumed as a default. Which is not correct if user runs on some other device. For example, on XPU. This patch saves full device information on forward() and uses it on backward() to get device type. Previously forward save only device index. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128671 Approved by: https://github.com/guangyey, https://github.com/soulitzer	2024-06-27 17:14:13 +00:00
Fuzzkatt	4ca8eecca4	skip test_graph_capture_oom for jetson (#128661 ) On Jetson IGX, `python test/test_cuda.py -k test_graph_capture_oom` fails with the following error: ``` RuntimeError: NVML_SUCCESS == r INTERNAL ASSERT FAILED at "/opt/pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp":841, please report a bug to PyTorch. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.10/unittest/case.py", line 59, in testPartExecutor yield File "/usr/lib/python3.10/unittest/case.py", line 591, in run self._callTestMethod(testMethod) File "/usr/lib/python3.10/unittest/case.py", line 549, in _callTestMethod method() File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 2759, in wrapper method(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 2759, in wrapper method(args, **kwargs) File "/opt/pytorch/pytorch/test/test_cuda.py", line 2255, in test_graph_capture_oom with self.assertRaisesRegex(RuntimeError, oom_regex): File "/usr/lib/python3.10/unittest/case.py", line 239, in __exit__ self._raiseFailure('"{}" does not match "{}"'.format( File "/usr/lib/python3.10/unittest/case.py", line 163, in _raiseFailure raise self.test_case.failureException(msg) AssertionError: "out of memory" does not match "NVML_SUCCESS == r INTERNAL ASSERT FAILED at "/opt/pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp":841, please report a bug to PyTorch. " ``` This is a known issue as nvml support on Jetson is limited, and the OOM reporting in CUDACachingAllocator.cpp requires nvml to be properly loaded, which fails on Jetson. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128661 Approved by: https://github.com/eqy, https://github.com/atalman	2024-06-25 08:25:11 +00:00
Xuehai Pan	a7c596870d	[BE][Eazy] remove `torch.torch.xxx` usages (#127800 ) NB: `torch` is exposed in `torch/__init__.py`. So there can be `torch.torch.torch.xxx`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127800 Approved by: https://github.com/peterbell10, https://github.com/kit1980, https://github.com/malfet	2024-06-05 21:53:49 +00:00
Xuehai Pan	67ef2683d9	[BE] wrap deprecated function/class with `typing_extensions.deprecated` (#127689 ) Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing. Note that only warnings that their messages contain `[Dd]eprecat(ed\|ion)` are updated in this PR. Resolves #126888 - #126888 This PR is split from PR #126898. - #126898 ------ Pull Request resolved: https://github.com/pytorch/pytorch/pull/127689 Approved by: https://github.com/Skylion007	2024-06-02 12:30:43 +00:00
PyTorch MergeBot	033e733021	Revert "[BE] wrap deprecated function/class with `typing_extensions.deprecated` (#126898 )" This reverts commit `749a132fb0`. Reverted https://github.com/pytorch/pytorch/pull/126898 on behalf of https://github.com/fbgheith due to switching typing-extensions=4.3.0 to 4.9.0 causes internal failure ([comment](https://github.com/pytorch/pytorch/pull/126898#issuecomment-2142884456))	2024-05-31 19:47:24 +00:00
SandishKumarHN	da39461d61	[optim] Move test_grad_scaling_autocast_fused_optimizers to test_cuda.py (#126418 ) this PR address the comments in this PR #124904 - Move test_grad_scaling_autocast_fused_optimizers to test_cuda.py - Combine _grad_scaling_autocast_fused_optimizers into test_grad_scaling_autocast_fused_optimizers - Move to OptimizerInfo framework. - For failing tests test_grad_scaling_autocast_fused_optimizers AdamW_cuda_float32, Adam_cuda_float32 - Added toleranceOverride in this PR - created a issue #127000 ``` > (c2env) [sandish@devgpu166.ash6 ~/pytorch (refactoroptimizers)]$ python test/test_cuda.py -k test_grad_scaling_autocast_fused_optimizers -v /home/sandish/pytorch/torch/backends/cudnn/__init__.py:106: UserWarning: PyTorch was compiled without cuDNN/MIOpen support. To use cuDNN/MIOpen, rebuild PyTorch making sure the library is visible to the build system. warnings.warn( /home/sandish/pytorch/torch/backends/cudnn/__init__.py:106: UserWarning: PyTorch was compiled without cuDNN/MIOpen support. To use cuDNN/MIOpen, rebuild PyTorch making sure the library is visible to the build system. warnings.warn( test_grad_scaling_autocast_fused_optimizers_Adagrad_cpu_float32 (__main__.TestCudaOptimsCPU) ... {'fused': True} {'fused': True} {'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'lr': 0.1, 'fused': True} {'lr': 0.1, 'fused': True} {'initial_accumulator_value': 0.1, 'weight_decay': 0.1, 'fused': True} {'initial_accumulator_value': 0.1, 'weight_decay': 0.1, 'fused': True} {'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.1, 'fused': True} {'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.1, 'fused': True} {'lr': tensor(0.0010), 'fused': True} {'lr': tensor(0.0010), 'fused': True} ok test_grad_scaling_autocast_fused_optimizers_AdamW_cpu_float32 (__main__.TestCudaOptimsCPU) ... {'fused': True} {'fused': True} {'lr': 0.01, 'fused': True} {'lr': 0.01, 'fused': True} {'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'fused': True} ok test_grad_scaling_autocast_fused_optimizers_Adam_cpu_float32 (__main__.TestCudaOptimsCPU) ... {'fused': True} {'fused': True} {'lr': 0.01, 'fused': True} {'lr': 0.01, 'fused': True} {'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'fused': True} ok test_grad_scaling_autocast_fused_optimizers_SGD_cpu_float32 (__main__.TestCudaOptimsCPU) ... {'fused': True} {'fused': True} {'lr': 0.01, 'fused': True} {'lr': 0.01, 'fused': True} {'lr': tensor(0.0010), 'fused': True} {'lr': tensor(0.0010), 'fused': True} {'momentum': 0.9, 'fused': True} {'momentum': 0.9, 'fused': True} {'momentum': 0.9, 'dampening': 0.5, 'fused': True} {'momentum': 0.9, 'dampening': 0.5, 'fused': True} {'momentum': 0.9, 'weight_decay': 0.1, 'fused': True} {'momentum': 0.9, 'weight_decay': 0.1, 'fused': True} {'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.1, 'fused': True} {'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} ok test_grad_scaling_autocast_fused_optimizers_Adagrad_cuda_float32 (__main__.TestCudaOptimsCUDA) ... skipped 'cuda is not supported for fused on Adagrad' test_grad_scaling_autocast_fused_optimizers_AdamW_cuda_float32 (__main__.TestCudaOptimsCUDA) ... {'fused': True} {'fused': True} {'lr': 0.01, 'fused': True} {'lr': 0.01, 'fused': True} {'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'fused': True} {'capturable': True, 'fused': True} {'capturable': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'fused': True} {'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'fused': True} {'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'fused': True} ok test_grad_scaling_autocast_fused_optimizers_Adam_cuda_float32 (__main__.TestCudaOptimsCUDA) ... {'fused': True} {'fused': True} {'lr': 0.01, 'fused': True} {'lr': 0.01, 'fused': True} {'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'fused': True} {'capturable': True, 'fused': True} {'capturable': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'fused': True} {'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'fused': True} {'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'fused': True} ok test_grad_scaling_autocast_fused_optimizers_SGD_cuda_float32 (__main__.TestCudaOptimsCUDA) ... {'fused': True} {'fused': True} {'lr': 0.01, 'fused': True} {'lr': 0.01, 'fused': True} {'lr': tensor(0.0010), 'fused': True} {'lr': tensor(0.0010), 'fused': True} {'momentum': 0.9, 'fused': True} {'momentum': 0.9, 'fused': True} {'momentum': 0.9, 'dampening': 0.5, 'fused': True} {'momentum': 0.9, 'dampening': 0.5, 'fused': True} {'momentum': 0.9, 'weight_decay': 0.1, 'fused': True} {'momentum': 0.9, 'weight_decay': 0.1, 'fused': True} {'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.1, 'fused': True} {'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} ok ---------------------------------------------------------------------- Ran 8 tests in 16.117s OK (skipped=1) > lintrunner test/test_cuda.py ---------------------------------------------------------------------- ok No lint issues. > lintrunner torch/testing/_internal/common_optimizers.py ---------------------------------------------------------------------- ok No lint issues. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126418 Approved by: https://github.com/janeyx99	2024-05-30 01:47:41 +00:00
Xuehai Pan	749a132fb0	[BE] wrap deprecated function/class with `typing_extensions.deprecated` (#126898 ) Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing. Note that only warnings that their messages contain `[Dd]eprecat(ed\|ion)` are updated in this PR. UPDATE: Use `FutureWarning` instead of `DeprecationWarning`. Resolves #126888 - #126888 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126898 Approved by: https://github.com/albanD	2024-05-29 12:09:27 +00:00
Yu, Guangye	e7a42702f9	generalize custom_fwd&custom_bwd to be device-agnostic (#126531 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126531 Approved by: https://github.com/jgong5, https://github.com/gujinghui, https://github.com/albanD, https://github.com/EikanWang ghstack dependencies: #126527	2024-05-25 06:48:16 +00:00
Yu, Guangye	c09205a057	Deprecate device-specific GradScaler autocast API (#126527 ) # Motivation ## for `torch.amp.GradScaler`, - `torch.cpu.amp.GradScaler(args...)` is completely equivalent to `torch. amp.GradScaler("cpu", args...)`. - `torch.cuda.amp.GradScaler(args...)` is completely equivalent to `torch.amp.GradScaler("cuda", args...)`. So, we intend to depreate them and strongly recommend developer to use `torch.amp.GradScaler`. ## for `custom_fwd` and `custom_bwd`, this is a good solution to make the custom function run with or without effect even in an autocast-enabled region and can be shared by other backends, like CPU and XPU. So we generalize it to be device-agnostic and put them int `torch/amp/autocast_mode.py` and re-expose to `torch.amp.custom_fwd` and `torch.amp.custom_bwd`. Meanwhile, we deprecate `torch.cuda.amp.custom_fwd` and `torch.cuda.amp.custom_bwd`. # Additional Context Add UT to cover the deprecated warning. No need for more UTs to cover the functionality of `torch.amp.custom_f/bwd`, the existing UTs that previously covered the functionality of `torch.cuda.amp.custom_f/bwd` can cover them. To facilitate the review, we separate these code changes to two PRs. The first PR cover `torch.amp.GradScaler`. The follow-up covers `custom_fwd` and `custom_bwd`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126527 Approved by: https://github.com/jgong5, https://github.com/gujinghui, https://github.com/janeyx99, https://github.com/EikanWang	2024-05-25 06:41:34 +00:00
Catherine Lee	ef86a27dba	Mark test_set_per_process_memory_fraction serial (#127087 ) Occasionally OOMs Also should probably give the entire GPU for this anyways Pull Request resolved: https://github.com/pytorch/pytorch/pull/127087 Approved by: https://github.com/huydhn	2024-05-25 06:26:47 +00:00
Jack Taylor	d30cdc4321	[ROCm] amdsmi library integration (#119182 ) Adds monitoring support for ROCm using amdsmi in place of pynvml. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119182 Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/xw285cornell	2024-05-21 01:59:26 +00:00
PyTorch MergeBot	cb69c51b6f	Revert " Updated test_graph_optims and test_graph_scaling_fused_optimizers to use new OptimizerInfo infrastructure (#125127 )" This reverts commit `cf35a591b9`. Reverted https://github.com/pytorch/pytorch/pull/125127 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/125127#issuecomment-2120337584))	2024-05-20 12:14:22 +00:00
jayanth domalapalli	cf35a591b9	Updated test_graph_optims and test_graph_scaling_fused_optimizers to use new OptimizerInfo infrastructure (#125127 ) This PR is meant to address issue #123451, more specifically, the ```test_graph_optims``` and ```test_graph_scaling_fused_optimizers``` functions in ```test_cuda.py``` have been updated so that they now use the new OptimizerInfo infrastructure. Lintrunner passed: ``` $ lintrunner test/test_cuda.py ok No lint issues. ``` Tests passed: ``` >python test_cuda.py -k test_graph_optims Ran 19 tests in 7.463s OK (skipped=9) >python test_cuda.py -k test_graph_scaling_fused_optimizers Ran 6 tests in 2.800s OK (skipped=3) ``` Both the functions have been moved to the newly created TestCase class ```TestCudaOptims```. The test is mostly the same except the ```@optims``` decorator is used at the top of the function to implicitly call the function using each of the optimizers mentioned in the decorator instead of explicitly using a for loop to iterate through each of the optimizers. I was unable to use the ```_get_optim_inputs_including_global_cliquey_kwargs``` to get all kwargs for each of the optimizers since some of the kwargs that are used in the original ```test_graph_optims``` function are not being returned by the new OptimizerInfo infrastructure, more specifically, for the ```torch.optim.rmsprop.RMSprop``` optimizer, the following kwargs are not returned whenever ```_get_optim_inputs_including_global_cliquey_kwargs``` is called: ``` {'foreach': False, 'maximize': True, 'weight_decay': 0} { 'foreach': True, 'maximize': True, 'weight_decay': 0} ``` I ran into the same issue for ```test_graph_scaling_fused_optimizers```, for the ```torch.optim.adamw.AdamW``` optimizer, whenever ```optim_info.optim_inputs_func(device=device)``` was called, the following kwarg was not returned: ``` {'amsgrad': True} ``` Due to this issue, I resorted to using a dictionary to store the kwargs for each of the optimizers, I am aware that this is less than ideal. I was wondering whether I should use the OptimizerInfo infrastructure to get all the kwargs regardless of the fact that it lacks some kwargs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125127 Approved by: https://github.com/janeyx99	2024-05-20 06:20:45 +00:00
Yu, Guangye	58378f1224	[Doc] Add deprecated autocast comments for doc (#126062 ) # Motivation We generalize a device-agnostic API `torch.amp.autocast` in [#125103](https://github.com/pytorch/pytorch/pull/125103). After that, - `torch.cpu.amp.autocast(args...)` is completely equivalent to `torch.amp.autocast('cpu', args...)`, and - `torch.cuda.amp.autocast(args...)` is completely equivalent to `torch.amp.autocast('cuda', args...)` no matter in eager mode or JIT mode. Base on this point, we would like to deprecate `torch.cpu.amp.autocast` and `torch.cuda.amp.autocast` to strongly recommend developer to use `torch.amp.autocast` that is a device-agnostic API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126062 Approved by: https://github.com/eqy, https://github.com/albanD	2024-05-16 05:26:43 +00:00
haozhe.zhu	f9d107af66	[optim] add fused_adagrad support for CPU device (#124905 ) Support fused_sgd_kernel support for CPU. ## Bench result: 32 core/sockets ICX Test Scripts: https://gist.github.com/zhuhaozhe/79e842e0a6e25d6d7fa1e4598807272c https://gist.github.com/zhuhaozhe/b4c6998a509dcea1796dd05b3005c969 ``` Tensor Size: 262144, Num Tensor 4, Num Threads: 1 _single_tensor_adagrad time: 0.2500 seconds _fused_adagrad time: 0.0933 seconds Tensor Size: 4194304, Num Tensor 32, Num Threads: 32 _single_tensor_adagrad time: 2.8819 seconds _fused_adagrad time: 1.7591 seconds ``` ## Test Plan: ``` python test_optim.py -k test_fused_matches_forloop python test_optim.py -k test_fused_large_tensor python test_optim.py -k test_can_load_older_state_dict python test_optim.py -k test_grad_scaling_autocast_fused_optimizers python test_torch.py -k test_grad_scaling_autocast_fused python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step ``` Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124905 Approved by: https://github.com/jgong5, https://github.com/janeyx99	2024-05-16 01:11:51 +00:00
PyTorch MergeBot	bd3cbdba2f	Revert "[optim] add fused_adagrad support for CPU device (#124905 )" This reverts commit `1c3fe84033`. Reverted https://github.com/pytorch/pytorch/pull/124905 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing distributed multigpu test in trunk `1c3fe84033` ([comment](https://github.com/pytorch/pytorch/pull/124905#issuecomment-2108777063))	2024-05-13 20:53:22 +00:00
haozhe.zhu	1c3fe84033	[optim] add fused_adagrad support for CPU device (#124905 ) Support fused_sgd_kernel support for CPU. ## Bench result: 32 core/sockets ICX Test Scripts: https://gist.github.com/zhuhaozhe/79e842e0a6e25d6d7fa1e4598807272c https://gist.github.com/zhuhaozhe/b4c6998a509dcea1796dd05b3005c969 ``` Tensor Size: 262144, Num Tensor 4, Num Threads: 1 _single_tensor_adagrad time: 0.2500 seconds _fused_adagrad time: 0.0933 seconds Tensor Size: 4194304, Num Tensor 32, Num Threads: 32 _single_tensor_adagrad time: 2.8819 seconds _fused_adagrad time: 1.7591 seconds ``` ## Test Plan: ``` python test_optim.py -k test_fused_matches_forloop python test_optim.py -k test_fused_large_tensor python test_optim.py -k test_can_load_older_state_dict python test_optim.py -k test_grad_scaling_autocast_fused_optimizers python test_torch.py -k test_grad_scaling_autocast_fused python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step ``` Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124905 Approved by: https://github.com/jgong5, https://github.com/janeyx99	2024-05-13 01:16:20 +00:00
Yu, Guangye	31372fa842	Support generic stream/event on CUDA/HIP backend (#125757 ) # Motivation According to [#123611](https://github.com/pytorch/pytorch/pull/123611), we support generic stream/event on CUDA backend. # Additional Context new method/attribute on `torch.Event` for cuda - torch.Event.event_id - torch.Event.elapsed_time - torch.Event.synchronize new method on `c10::Event` on cuda backend - c10.Event.event_id - c10.Event.elapsed_time - c10.Event.synchronize Pull Request resolved: https://github.com/pytorch/pytorch/pull/125757 Approved by: https://github.com/albanD, https://github.com/jgong5, https://github.com/EikanWang	2024-05-10 13:34:09 +00:00
PyTorch MergeBot	0d4fdb0bb7	Revert "[ROCm] amdsmi library integration (#119182 )" This reverts commit `85447c41e3`. Reverted https://github.com/pytorch/pytorch/pull/119182 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the ROCm failed test is legit `85447c41e3` ([comment](https://github.com/pytorch/pytorch/pull/119182#issuecomment-2103433197))	2024-05-09 21:18:21 +00:00
PyTorch MergeBot	6fd745255e	Revert "add uuid in cudaDeviceProperties (#125083 )" This reverts commit `3f36145db2`. Reverted https://github.com/pytorch/pytorch/pull/125083 on behalf of https://github.com/izaitsevfb due to Fails internal builds with: no member named 'uuid' in 'hipDeviceProp_t' ([comment](https://github.com/pytorch/pytorch/pull/125083#issuecomment-2103315320))	2024-05-09 19:52:45 +00:00
Jack Taylor	85447c41e3	[ROCm] amdsmi library integration (#119182 ) Adds monitoring support for ROCm using amdsmi in place of pynvml. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119182 Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/xw285cornell	2024-05-09 18:21:38 +00:00
Jeff Daily	3f36145db2	add uuid in cudaDeviceProperties (#125083 ) Replaces #99967. Fixes #99903. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125083 Approved by: https://github.com/pruthvistony, https://github.com/albanD, https://github.com/eqy	2024-05-08 19:15:55 +00:00
PyTorch MergeBot	5fd0b6e5f7	Revert "add uuid in cudaDeviceProperties (#125083 )" This reverts commit `f35fe4eaf1`. Reverted https://github.com/pytorch/pytorch/pull/125083 on behalf of https://github.com/clee2000 due to test_uuid is flaky. ex https://github.com/pytorch/pytorch/actions/runs/8988855916/job/24692369523 https://hud.pytorch.org/flakytest?name=test_uuid&suite=TestCuda&file=%25&limit=300 ([comment](https://github.com/pytorch/pytorch/pull/125083#issuecomment-2099029993))	2024-05-07 18:16:27 +00:00
Jeff Daily	f35fe4eaf1	add uuid in cudaDeviceProperties (#125083 ) Replaces #99967. Fixes #99903. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125083 Approved by: https://github.com/pruthvistony, https://github.com/albanD, https://github.com/eqy	2024-05-07 01:26:01 +00:00
haozhe.zhu	489b4586e9	[optim]fix ut and sgd kernel (#124904 ) - Original `test_grad_scaling_autocast_fused_optimizers` does not work since there is no "fused" in `optim_inputs` - We should use different `grad_scaler`, they should not share 1 `scale`, there is no issue exposed here because the default `_growth_interval` is 2000 so it will not growth and there is also no inf is found so it will not reduced. The one in `test_cuda.py` should also have this issue, - I set a manual seed to reproduce purpose if there is any numerical failure - I use Tensor tracker here because we failed this UT in dynamo case, the cpp generated code are not exactly same with fused/non fused kernel. - I make it check both `cuda` and `cpu`. - I find some SGD numerical issue with `clang`, and fixed it by using `fmadd` instead of `add/mul` in fused sgd veckernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124904 Approved by: https://github.com/jgong5, https://github.com/janeyx99	2024-05-03 09:13:24 +00:00
Yuanhao Ji	d5182bb75b	Enable UFMT on `test/test_cuda*.py` (#124352 ) Part of: #123062 Ran lintrunner on: - test/test_cuda.py - test/test_cuda_expandable_segments.py - test/test_cuda_multigpu.py - test/test_cuda_nvml_based_avail.py - test/test_cuda_primary_ctx.py - test/test_cuda_sanitizer.py - test/test_cuda_trace.py Detail: ```bash $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124352 Approved by: https://github.com/ezyang	2024-04-25 18:31:08 +00:00
PyTorch MergeBot	c0fd7894cc	Revert "Fast standalone symbolize for unwinding (#123966 )" This reverts commit `772ae6da1e`. Reverted https://github.com/pytorch/pytorch/pull/123966 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, check D56522678 ([comment](https://github.com/pytorch/pytorch/pull/123966#issuecomment-2076821043))	2024-04-25 10:04:48 +00:00
Tiger Huo	94af62b000	Updated test_graph_grad_scaling to use new OptimizerInfo infrastructure (#123581 ) This PR targets the issue mentioned in #123451 , and solves the specific task to update`test_graph_grad_scaling` in `test/test_cuda.py` to use the new OptimizerInfo infrastructure. `test_graph_grad_scaling` is moved to a new `TestCase` class called `TestCudaOptims` in order to use `instantiate_device_type_tests`. The test content remained the same. `@onlyCUDA` is applied to the new test; the original use of the wrapper function is also changed to a `@parametrize` decorator for better style. If we think that this migration is successful, we can delete the original test item under `TestCuda`. Currently it is left untouched to avoid any unexpected issues. Local linter passed. ``` $ lintrunner test/test_cuda.py ok No lint issues. ``` Local tests passed. ``` > python .\test\test_cuda.py -k test_graph_grad_scaling Ran 7 tests in 0.458s OK (skipped = 3) ``` Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123581 Approved by: https://github.com/janeyx99	2024-04-25 06:29:20 +00:00
Catherine Lee	4f29103749	[ez][CI] Move test_cuda off CI_SERIAL_LIST (#124649 ) Tag test cases with large tensor with serial, also tag a few more that failed on a previous iteration of this PR Move test_cuda and test_cuda_expandable_segments off the serial list Pull Request resolved: https://github.com/pytorch/pytorch/pull/124649 Approved by: https://github.com/ZainRizvi	2024-04-24 22:04:23 +00:00
zdevito	772ae6da1e	Fast standalone symbolize for unwinding (#123966 ) We've had issues using addr2line. On certain versions of CentOS it is on a version that has a performance regression making it very slow, and even normallly it is not that fast, taking several seconds even when parallelized for a typical memory trace dump. Folly Symbolize or LLVMSymbolize are fast but it requires PyTorch take a dependency on those libraries to do this, and given the number of environments we run stuff in, we end up hitting cases where we fallback to slow addr2line behavior. This adds a standalone symbolizer to PyTorch similar to the unwinder which has no external dependencies and is ~20x faster than addr2line for unwinding PyTorch frames. I've tested this on some memory profiling runs using all combinations of {gcc, clang} x {dwarf4, dwarf5} and it seems to do a good job at getting line numbers and function names right. It is also careful to route all reads of library data through the `CheckedLexer` object, which ensure it is not reading out of bounds of the section. Errors are routed through UnwindError so that those exceptions get caught and we produce a ?? frame rather than crash. I also added a fuzz test which gives all our symbolizer options random addresses in the process to make sure they do not crash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123966 Approved by: https://github.com/ezyang	2024-04-23 15:27:18 +00:00
Aaron Gokaslan	5a1216bb2e	[BE]: Update ruff to 0.4.1 (#124549 ) Update ruff to 0.4.1 . This version fixes a lot false negatives/false positives, is 20-40% faster, and has various other bug fixes. Below is a before and after table showing the execution time of ruff lint and ruff format in milliseconds courtesy of https://astral.sh/blog/ruff-v0.4.0 \| Repository \| Linter (v0.3) \| Linter (v0.4) \| Formatter (v0.3) \| Formatter (v0.4) \| \|----------------------------------------------------\|---------------\|---------------\|------------------\|------------------\| \| [pytorch/pytorch](https://github.com/pytorch/pytorch) \| 328.7 \| 251.8 \| 351.1 \| 274.9 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/124549 Approved by: https://github.com/ezyang	2024-04-21 14:06:23 +00:00
Michael Lazos	16771747c2	Add tensor step and capturable support to rprop (#122261 ) Towards fixing https://github.com/pytorch/pytorch/issues/115679 Fixes Rprop step update while compiling Also adds capturable support + testing Pull Request resolved: https://github.com/pytorch/pytorch/pull/122261 Approved by: https://github.com/janeyx99	2024-03-28 23:31:18 +00:00
Edward Z. Yang	0284bca99b	Don't cache device_count if we haven't initialized CUDA yet (#122815 ) Before initializing CUDA, it can change by modifying CUDA_VISIBLE_DEVICES Fixes https://github.com/pytorch/pytorch/issues/122085 Fixes https://github.com/pytorch/pytorch/issues/38616 Fixes https://github.com/pytorch/pytorch/issues/110000 Fixes https://github.com/pytorch/pytorch/issues/110971 Fixes https://github.com/pytorch/pytorch/issues/95073 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122815 Approved by: https://github.com/albanD	2024-03-28 13:23:45 +00:00
Michael Lazos	caa57e4fcd	Add tensor step and capturable support to rmsprop (#122264 ) Towards fixing https://github.com/pytorch/pytorch/issues/115679 Fixes RMSprop step update while compiling Adds capturable support to RMSprop Pull Request resolved: https://github.com/pytorch/pytorch/pull/122264 Approved by: https://github.com/janeyx99	2024-03-28 03:39:28 +00:00
Frank Lin	249e65b92d	Graph-Safe RNG State Exchange for Tensor Parallelism (#114068 ) See #113541 The PR allows for registering and controlling multiple RNG states using indices, ensuring cudagraph-safe operations, and includes both C++ and Python API changes to support this functionality. cc @eellison @anijain2305 @jansel @ezyang @ptrblck @csarofeen @mcarilli Pull Request resolved: https://github.com/pytorch/pytorch/pull/114068 Approved by: https://github.com/ezyang, https://github.com/eqy, https://github.com/xuzhao9	2024-03-27 01:14:38 +00:00
PyTorch MergeBot	4dc09d6aa4	Revert "Graph-Safe RNG State Exchange for Tensor Parallelism (#114068 )" This reverts commit `e9dcda5cba`. Reverted https://github.com/pytorch/pytorch/pull/114068 on behalf of https://github.com/ezyang due to memory leak in another ci ([comment](https://github.com/pytorch/pytorch/pull/114068#issuecomment-2018044527))	2024-03-25 13:49:04 +00:00
Michael Lazos	365e89a591	Add tensor step to adadelta (#122252 ) Towards fixing https://github.com/pytorch/pytorch/issues/115679 Fixes Adadelta step update while compiling Pull Request resolved: https://github.com/pytorch/pytorch/pull/122252 Approved by: https://github.com/janeyx99	2024-03-21 07:28:47 +00:00
Frank Lin	e9dcda5cba	Graph-Safe RNG State Exchange for Tensor Parallelism (#114068 ) See #113541 The PR allows for registering and controlling multiple RNG states using indices, ensuring cudagraph-safe operations, and includes both C++ and Python API changes to support this functionality. cc @eellison @anijain2305 @jansel @ezyang @ptrblck @csarofeen @mcarilli Pull Request resolved: https://github.com/pytorch/pytorch/pull/114068 Approved by: https://github.com/ezyang	2024-03-21 01:57:08 +00:00
Andres Lugo-Reyes	e01b07e1e8	[ROCm] Autocast RNN Support (#121539 ) Fixes #116361 Implements Autocast wrapper for miopen rnn's Pull Request resolved: https://github.com/pytorch/pytorch/pull/121539 Approved by: https://github.com/albanD, https://github.com/jeffdaily	2024-03-11 21:14:43 +00:00
Natalia Gimelshein	89add71168	fix synchronization behavior for copies with type change (#121341 ) Fixes #121320 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121341 Approved by: https://github.com/albanD	2024-03-11 17:09:45 +00:00
Aidyn-A	ca9678405a	[CUDA graphs] Pool argument for make_graphed_callables (#121475 ) It is just a nice feature to have for the situations when users want multiple graphs captures and/or graphed callables to share the same memory pool. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121475 Approved by: https://github.com/eellison, https://github.com/eqy	2024-03-09 00:15:38 +00:00
Jane Xu	9d6c5be781	Add ASGD capturable API for forloop (#121264 ) @tfsingh I got to it first--wanted to land this stack and close the gap ASAP. This PR also fixes a discrepancy between `_init_group` and `__set_state__` because we have the constants live on params' device always. There are some next steps though: - ASGD can be made faster by making etas, mus, steps be on CPU when NOT capturable. (I had mistakenly thought foreachifying was faster and so we landed https://github.com/pytorch/pytorch/pull/107857, but it is slower). No one has complained yet though. ¯\_(ツ)_/¯ Pull Request resolved: https://github.com/pytorch/pytorch/pull/121264 Approved by: https://github.com/albanD ghstack dependencies: #121260	2024-03-08 00:00:30 +00:00
Jane Xu	24821fec26	Add RAdam capturable API for forloop (#121260 ) Implementation thanks to @MarouaneMaatouk in https://github.com/pytorch/pytorch/pull/118697, though I've since cleaned it up a lot to save perf on the rect < 5 eager case. It also just looks better now :) Added tests and the cudagraph health check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121260 Approved by: https://github.com/mlazos	2024-03-08 00:00:30 +00:00
Jane Xu	53bdae736d	Add capturable single tensor Adamax (#121183 ) Finishes the work started in https://github.com/pytorch/pytorch/pull/118697. Thanks @MarouaneMaatouk for the attempt, but due to inactivity I have opened this PR for Adamax. Note that the new capturable implementation is much simpler and I've modified the foreach capturable impl--it now calls fewer kernels and is more easily comparable to forloop. Next steps: * This PR discovered two bugs: #121178 and #121238. * Move the now hefty graph optim tests in test_cuda to use OptimInfo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121183 Approved by: https://github.com/albanD	2024-03-07 17:57:02 +00:00
Aaron Enye Shi	aa36821615	[Memory Snapshot] Stop clearing history when changing context (#120436 ) Summary: This change will avoid clearing the memory event history, when changing the context from `record_memory_history(context=None)` to `record_memory_history(context="python")`. Now it will continue recording memory events with changing context on the fly. Only `record_memory_history(enabled=None)` will clear the history. Test Plan: # Ran on the following local Resnet50 example: - At iteration=0, record_memory_history(context=None, stacks="python") - At iteration=3, record_memory_history(context="all", stacks="python") - After iteration=4, export_memory_snapshot() ## Before: - Only collects the last 2 iterations with python call stacks. ![image](https://github.com/pytorch/pytorch/assets/17602366/86154532-9f73-4d10-9194-19e8c96ee4f3) ## After: - Collects all 5 iterations, where first 3 iterations have no call stacks, and last 2 iterations have python call stacks. ![image](https://github.com/pytorch/pytorch/assets/17602366/c2c277d6-b400-4da2-85c8-a7f119d409f8) ![image](https://github.com/pytorch/pytorch/assets/17602366/dc9da2f8-41cc-44b0-9c32-ec3cbe79d2c4) Differential Revision: D54084017 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/120436 Approved by: https://github.com/zdevito, https://github.com/leitian	2024-02-28 22:46:26 +00:00
CaoE	113138aa55	add test cases for GradScaler on CPU (#109994 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/109994 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-02-02 21:49:07 +00:00
Michael Lazos	800e2e823f	Add compilable foreach RAdam support (#117912 ) Fixes https://github.com/pytorch/pytorch/issues/117807 This brings the number of supported optimizers with `torch.compile` to 11/13 (!) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117912 Approved by: https://github.com/janeyx99	2024-01-27 04:32:27 +00:00
Aaron Shi	6ac284122b	[Memory Snapshot] Track context for SEGMENT_FREE and SEGMENT_UNMAP (#118055 ) Summary: Show the stack when SEGMENT_FREE and SEGMENT_UNMAP occurs. This may be useful for debugging such as when empty_cache() may cause a segment to be freed. If the free context is unavailable, resort to the segment allocation stack. Test Plan: CI Differential Revision: D52984953 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118055 Approved by: https://github.com/zdevito	2024-01-23 21:48:57 +00:00
Michael Lazos	aaae2d8bb6	Add compilable and capturable foreach adamax with tests (#117835 ) Based off of https://github.com/pytorch/pytorch/pull/110345 Fixes https://github.com/pytorch/pytorch/issues/117812 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117835 Approved by: https://github.com/janeyx99	2024-01-20 05:29:05 +00:00
Masaki Kozuki	1d14adfa66	[mta] Fused SGD (#116585 ) depends on #116583 rel: - #94791 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116585 Approved by: https://github.com/janeyx99	2024-01-16 23:54:38 +00:00
CaoE	29516bd2a0	add _amp_foreach_non_finite_check_and_unscale_cpu_ and _amp_update_scale_cpu_ kernels on CPU (#109281 ) Step1 of https://github.com/pytorch/pytorch/issues/111559. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109281 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-01-16 15:25:08 +00:00
Ting Lu	c167c34396	Skip unsupported tests on arm (#117344 ) add skips to tests that involve record_context_cpp on ARM as it is only supported on linux x86_64 arch. Error is reported as below: ``` Traceback (most recent call last): File "/usr/lib/python3.10/unittest/case.py", line 59, in testPartExecutor yield File "/usr/lib/python3.10/unittest/case.py", line 591, in run self._callTestMethod(testMethod) File "/usr/lib/python3.10/unittest/case.py", line 549, in _callTestMethod method() File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 2674, in wrapper method(args, *kwargs) File "/opt/pytorch/pytorch/test/test_cuda.py", line 3481, in test_direct_traceback c = gather_traceback(True, True, True) RuntimeError: record_context_cpp is not support on non-linux non-x86_64 platforms ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117344 Approved by: https://github.com/malfet, https://github.com/drisspg	2024-01-12 21:12:11 +00:00
Doe Hyun Yoon	83c45a9931	Faster gc_count update for CUDACachingAllocator (and avoid nullptr de… (#117064 ) …reference) (#109065) Summary: Modify the way we update gc_count in CUDACachingAlloctor to make it faster. Originally D48481557, but reverted due to nullptr dereference in some cases (D49003756). This diff changed to use correct constructor for search key (so avoid nullptr dereference). Also, added nullptr check (and returns 0 if it is) in gc_count functions. Differential Revision: D49068760 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/117064 Approved by: https://github.com/zdevito	2024-01-11 19:47:05 +00:00
Nikita Shulga	a6325ad86c	Fix cuInit test on Windows (#117055 ) By changing library name from `libcuda.so.1` to `nvcuda.dll` on Windows Pull Request resolved: https://github.com/pytorch/pytorch/pull/117055 Approved by: https://github.com/Skylion007, https://github.com/huydhn, https://github.com/atalman	2024-01-10 00:45:18 +00:00
Nikita Shulga	81b7a09d27	[CI] Test that cuInit is not called during import (#117010 ) By making a driver API call in subprocess and expecting it to return `CUDA_ERROR_NOT_INITIALIZED` Test Plan: run it on nighties before https://github.com/pytorch/pytorch/pull/116201 got reverted and observe the failure This is very important for lots of distributed launchers Fixes https://github.com/pytorch/pytorch/issues/116276 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117010 Approved by: https://github.com/albanD	2024-01-09 14:44:22 +00:00
Aaron Gokaslan	95041829c8	Add bfloat16 CUDA support to RNN (#116927 ) Fixes #116925 Fixes #116763 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116927 Approved by: https://github.com/malfet	2024-01-06 22:55:34 +00:00
Aaron Gokaslan	3fe437b24b	[BE]: Update flake8 to v6.1.0 and fix lints (#116591 ) Updates flake8 to v6.1.0 and fixes a few lints using sed and some ruff tooling. - Replace `assert(0)` with `raise AssertionError()` - Remove extraneous parenthesis i.e. - `assert(a == b)` -> `assert a == b` - `if(x > y or y < z):`->`if x > y or y < z:` - And `return('...')` -> `return '...'` Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116591 Approved by: https://github.com/albanD, https://github.com/malfet	2024-01-03 06:04:44 +00:00
Aaron Gokaslan	bd10fea79a	[BE]: Enable F821 and fix bugs (#116579 ) Fixes #112371 I tried to fix as many of the bugs as I could, a few I could not figure out what the proper fix for them was though and so I left them with noqas. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116579 Approved by: https://github.com/ezyang	2024-01-01 08:40:46 +00:00
zdevito	4afe2687d5	Reland "Serve multistream graph captures from correct pool (#114647 )" (#116199 ) Fixes a variable shadowing problem that broke internal builds. This reverts commit `fe15645619`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116199 Approved by: https://github.com/eellison	2023-12-20 21:22:34 +00:00
PyTorch MergeBot	fe15645619	Revert "Serve multistream graph captures from correct pool (#114647 )" This reverts commit `8a445f7bd5`. Reverted https://github.com/pytorch/pytorch/pull/114647 on behalf of https://github.com/jeanschmidt due to breaking multiple internal build jobs, please check internal diff in order to obtain more details ([comment](https://github.com/pytorch/pytorch/pull/114647#issuecomment-1864840724))	2023-12-20 17:11:42 +00:00
zdevito	8a445f7bd5	Serve multistream graph captures from correct pool (#114647 ) This fixes #114320 by placing the logic for determining whether to allocate to a pool inside a callback that is controlled by CUDAGraph.cpp or by the python bound api to allocate a stream directly to a pool. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114647 Approved by: https://github.com/ngimel, https://github.com/eellison	2023-12-18 18:24:15 +00:00
rzou	8ddca5aeae	markDynamoStrictTest some more tests (#115857 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115857 Approved by: https://github.com/voznesenskym ghstack dependencies: #115845, #115855, #115856	2023-12-15 01:22:38 +00:00
atalman	43e3242490	[BE] Remove test corner cases for CUDA older than supported 11.8 (#114989 ) Remove deprecated CUDA use cases from tests. Similar to: https://github.com/pytorch/pytorch/pull/112873 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114989 Approved by: https://github.com/malfet	2023-12-04 21:41:03 +00:00
eqy	6a86cf00ad	[CUDA][cuBLAS] Remove explicit cuBLAS workspace allocation for CUDA 12.2+ (#113994 ) cuBLAS should be using `cudaMallocAsync` in CUDA 12.2+, which removes the need for explicit workspace allocation to avoid increasing memory usage with multiple graph captures. CC @ptrblck @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/113994 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-11-22 23:23:51 +00:00
Banit Agrawal	cc776d2186	[PyTorch Pinned Allocator] Create per thread task pool for mapping memory space (#111545 ) Differential Revision: D50443865 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111545 Approved by: https://github.com/zdevito	2023-10-22 00:23:49 +00:00
Kazuaki Ishizaki	a603dcc307	Fix typo under test directory (#110826 ) This PR fixes typo `the the` of comments in files under `test` directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110826 Approved by: https://github.com/Skylion007	2023-10-08 20:52:38 +00:00
Banit Agrawal	64583c4d04	[CUDA Host Allocator] Add support of CudaHostRegister (#108488 ) Summary: This diff adds another option to create cuda pinned memory using cudaHostRegister. Differential Revision: D45843715 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108488 Approved by: https://github.com/zdevito	2023-10-06 04:13:02 +00:00
Aidyn-A	e7bd9c5315	[CUDA][CUDA Graphs] Fix CUDAGraph::reset function (#108896 ) The following two cases fail due to a small oversight `CUDAGraph::reset()` that causes failures in graph destructor ```Python import torch x = torch.zeros(4, device="cuda") g = torch.cuda.CUDAGraph() with torch.cuda.graph(g): x = x + 1 g.reset() del g ``` that fails with: ``` terminate called after throwing an instance of 'c10::Error' what(): uc >= 0 INTERNAL ASSERT FAILED at ".../pytorch/c10/cuda/CUDACachingAllocator.cpp":2157, please report a bug to PyTorch. ``` and reset and subsequent re-capture ```Python import torch x = torch.zeros(4, device="cuda") g = torch.cuda.CUDAGraph() with torch.cuda.graph(g): x = x + 1 g.reset() with torch.cuda.graph(g): x = x + 1 g.replay() ``` which fails with: ``` Traceback (most recent call last): File "test_graph.py", line 11, in <module> with torch.cuda.graph(g): File ".../pytorch/torch/cuda/graphs.py", line 192, in __enter__ self.cuda_graph.capture_begin( File ".../pytorch/torch/cuda/graphs.py", line 77, in capture_begin super().capture_begin(pool=pool, capture_error_mode=capture_error_mode) RuntimeError: This CUDAGraph instance already owns a captured graph. To capture a new graph, create a new instance. ``` This PR fixes `CUDAGraph::reset()` function for above to use cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108896 Approved by: https://github.com/ezyang	2023-09-11 19:49:31 +00:00
Michael Lazos	b193f295b6	Add capturable ASGD impl (#107857 ) Add capturable ASGD impl + test Pull Request resolved: https://github.com/pytorch/pytorch/pull/107857 Approved by: https://github.com/janeyx99	2023-09-07 06:30:30 +00:00
Banit Agrawal	b8af8ac784	[CUDACaching Allocator] Release the allocator lock on the slow path (#108367 ) Summary: This diff is to release the global allocator lock on the slow path when we do synchronous cudaMalloc call. Differential Revision: D48750077 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108367 Approved by: https://github.com/zdevito	2023-09-02 02:52:25 +00:00
Elias Ellison	0a9778a372	Expose cudaStreamCaptureMode in CUDA Graphs, use local setting in inductor (#107407 ) > capture_error_mode (str, optional): specifies the cudaStreamCaptureMode for the graph capture stream. Can be "global", "thread_local" or "relaxed". During cuda graph capture, some actions, such as cudaMalloc, may be unsafe. "global" will error on actions in other threads, "thread_local" will only error for actions in the current thread, and "relaxed" will not error on these actions. Inductor codegen is single-threaded, so it should be safe to enable "thread_local" for inductor's cuda graph capturing. We have seen errors when inductor cudagraphs has been used concurrently with data preprocessing in other threads. Differential Revision: [D48656014](https://our.internmc.facebook.com/intern/diff/D48656014) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107407 Approved by: https://github.com/albanD, https://github.com/eqy	2023-08-25 01:44:26 +00:00
Zachary DeVito	cc54448a07	[memory snapshot] add 'address' key to block (#107171 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107171 Approved by: https://github.com/ngimel	2023-08-23 18:57:24 +00:00
Aaron Gokaslan	660e8060ad	[BE]: Update ruff to 0.285 (#107519 ) This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings. I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519 Approved by: https://github.com/ezyang	2023-08-22 23:16:38 +00:00
PyTorch MergeBot	d59a6864fb	Revert "[BE]: Update ruff to 0.285 (#107519 )" This reverts commit `88ab3e4322`. Reverted https://github.com/pytorch/pytorch/pull/107519 on behalf of https://github.com/ZainRizvi due to Sorry, but this PR breaks internal tests. @ezyang, can you please hep them get unblocked? It seems like one of the strings was prob accidentally modified ([comment](https://github.com/pytorch/pytorch/pull/107519#issuecomment-1688833480))	2023-08-22 19:53:32 +00:00
Aaron Gokaslan	88ab3e4322	[BE]: Update ruff to 0.285 (#107519 ) This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings. I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519 Approved by: https://github.com/ezyang	2023-08-20 01:36:18 +00:00
lcskrishna	bc662ffff9	[ROCm] Update ROCm skip decorators (#106138 ) This PR adds a msg argument for skipIfRocm and skipCUDAIfRocm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106138 Approved by: https://github.com/jataylo, https://github.com/jeffdaily, https://github.com/pruthvistony, https://github.com/albanD	2023-08-18 22:02:06 +00:00
Zachary DeVito	80988b6277	Introduce memory stacks for free (#106758 ) Previously when we recorded a free action in a memory trace, we would provide the stack for when the block was allocated. This is faster because we do not have to record stacks for free, which would otherwise double the number of stacks collected. However, sometimes knowing the location of a free is useful for figuring out why a tensor was live. So this PR adds this behavior. If performance ends up being a concern the old behavior is possible by passing "alloc" to the context argument rather than "all". Also refactors some of glue logic to be consistent across C++ and Python and routes the Python API through the C++ version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106758 Approved by: https://github.com/albanD	2023-08-14 20:38:15 +00:00
Jane Xu	0208574db9	[NAdam] Add capturable API and tests + fix differentiable (#106615 ) This PR: - adds a capturable API for NAdam similar to Adam(W) - adds tests accordingly - discovered and fixed bugs in the differentiable implementation (now tested through the capturable codepath). Pull Request resolved: https://github.com/pytorch/pytorch/pull/106615 Approved by: https://github.com/albanD	2023-08-07 19:49:11 +00:00
Zachary DeVito	3e5a52cedd	[memory snapshot] track context for segments (#106113 ) We want to display the stack for the original cudaMalloc that created a segment. Previously we could only report the last time the segment memory was used, or the record of the segment_alloc could appear in the list of allocator actions. This PR ensure regardless of whether we still have the segment_alloc action, the context for a segment is still available. The visualizer is updated to be able to incorporate this information. This PR adds a new field to Block. However the previous stacked cleanup PR removed a field of the same size, making the change to Block size-neutral. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106113 Approved by: https://github.com/aaronenyeshi	2023-07-28 06:45:48 +00:00
Zachary DeVito	45b564766d	[memory snapshots] removed chained history (#106079 ) For free blocks of memory in the allocator, we previously kept a linked list of the stack frames of previous allocations that lived there. This was only ever used in one flamegraph visualization and never proved useful at understanding what was going on. When memory history tracing was added, it became redundant, since we can see the history of the free space from recording the previous actions anyway. This patch removes this functionality and simplifies the snapshot format: allocated blocks directly have a 'frames' attribute rather than burying stack frames in the history. Previously the memory history tracked the real size of allocations before rounding. Since history was added, 'requested_size' has been added directly to the block which records the same information, so this patch also removes that redundancy. None of this functionality has been part of a PyTorch release with BC guarentees, so it should be safe to alter this part of the format. This patch also updates our visualization tools to work with the simplified format. Visualization tools keep support for the old format in `_legacy` functions so that during the transition old snapshot files can still be read. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106079 Approved by: https://github.com/eellison	2023-07-28 06:45:48 +00:00
Justin Chu	4cc1745b13	[BE] f-stringify torch/ and scripts (#105538 ) This PR is a follow up on the pyupgrade series to convert more strings to use f-strings using `flynt`. - https://docs.python.org/3/reference/lexical_analysis.html#f-strings - https://pypi.org/project/flynt/ Command used: ``` flynt torch/ -ll 120 flynt scripts/ -ll 120 flynt tools/ -ll 120 ``` and excluded `collect_env.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105538 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-07-21 19:35:24 +00:00
Justin Chu	73e1455327	[BE] Enable ruff's UP rules and autoformat test/ (#105434 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105434 Approved by: https://github.com/albanD	2023-07-19 20:36:06 +00:00
Nikita Shulga	c3e4a67905	Refactor multigpu tests to `test_cuda_multigpu` (#104059 ) Mostly refactor, that moves all the tests from `test_cuda` that benefit from multiGPU environment into its own file. - Add `TestCudaMallocAsync` class for Async tests ( to separate them from `TestCudaComm`) - Move individual tests from `TestCuda` to `TestCudaMultiGPU` - Move `_create_scaling_models_optimizers` and `_create_scaling_case` to `torch.testing._internal.common_cuda` - Add newly created `test_cuda_multigpu` to the multigpu periodic test <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at f4d46fa</samp> This pull request fixes a flaky test and improves the testing of gradient scaling on multiple GPUs. It adds verbose output for two CUDA tests, and refactors some common code into helper functions in `torch/testing/_internal/common_cuda.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104059 Approved by: https://github.com/huydhn	2023-06-27 05:32:05 +00:00
Zachary DeVito	afc788a99c	Re-land _cycleviz.py: visualize reference cycles holding cuda memory (#104051 ) Reference cycles are freed by the cycle collector rather than being cleaned up when the objects in the cycle first become unreachable. If a cycle points to a tensor, the CUDA memory for that tensor will not be freed until garbage collection runs. Accumulation of CUDA allocations can lead to out of memory errors (OOMs), as well as non-deterministic allocation behavior which is harder to debug. This visualizer installs a garbage collection hook to look for cycles containing CUDA tensors and saves a visualization of the garbage: ``` from torch.cuda._cycleviz import warn_tensor_cycles warn_tensor_cycles() # do some work that results in a cycle getting garbage collected # ... > WARNING:root:Reference cycle includes a CUDA Tensor see visualization of cycle /tmp/tmpeideu9gl.html ``` Reland to make windows skip the test. This reverts commit `7b3b6dd426`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104051 Approved by: https://github.com/aaronenyeshi, https://github.com/malfet	2023-06-23 13:44:58 +00:00
PyTorch MergeBot	7b3b6dd426	Revert "_cycleviz.py: visualize reference cycles holding cuda memory (#102656 )" This reverts commit `dba67f71c9`. Reverted https://github.com/pytorch/pytorch/pull/102656 on behalf of https://github.com/huydhn due to Sorry for reverting your PR. But I think the change is failing on Windows CUDA https://github.com/pytorch/pytorch/actions/runs/5341701630/jobs/9683293600 ([comment](https://github.com/pytorch/pytorch/pull/102656#issuecomment-1603035364))	2023-06-22 17:16:47 +00:00
Zachary DeVito	dba67f71c9	_cycleviz.py: visualize reference cycles holding cuda memory (#102656 ) Reference cycles are freed by the cycle collector rather than being cleaned up when the objects in the cycle first become unreachable. If a cycle points to a tensor, the CUDA memory for that tensor will not be freed until garbage collection runs. Accumulatin of CUDA allocations can lead to out of memory errors (OOMs), as well as non-deterministic allocation behavior which is harder to debug. This visualizer installs a garbage collection hook to look for cycles containing CUDA tensors and saves a visualization of the garbage: ``` from torch.cuda._cycleviz import warn_tensor_cycles warn_tensor_cycles() # do some work that results in a cycle getting garbage collected # ... > WARNING:root:Reference cycle includes a CUDA Tensor see visualization of cycle /tmp/tmpeideu9gl.html ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102656 Approved by: https://github.com/aaronenyeshi	2023-06-22 04:00:28 +00:00
Nikita Shulga	cd05c3b98c	[BE] Use `TEST_MULTIGPU` from `common_cuda.py` (#103982 ) Comment about `TEST_CUDNN` called over and over has long been alleviated by wrapping the check with `LazyVal`, that caches the results. Also, delete unused `TEST_MAGMA`. Prep change for https://github.com/pytorch/pytorch/issues/100006 <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at e3a5b39</samp> > _`common_cuda.py`_ > _Refactored for dynamo tests_ > _Winter code cleanup_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/103982 Approved by: https://github.com/atalman, https://github.com/janeyx99	2023-06-22 00:07:44 +00:00
Zachary DeVito	19b3e07fe0	[memory_viz] Unified viewer (#103565 ) This replaces the invidual visualization routines in _memory_viz.py with a single javascript application. The javascript application can load pickled snapshot dumps directly using drag/drop, requesting them via fetch, or by embedding them in a webpage. The _memory_viz.py commands use the embedding approach. We can also host MemoryViz.js on a webpage to use the drag/drop approach, e.g. https://zdevito.github.io/assets/viz/ (eventually this should be hosted with the pytorch docs). All views/multiple cuda devices are supported on one page. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103565 Approved by: https://github.com/eellison, https://github.com/albanD	2023-06-16 03:49:48 +00:00
Xiao Wang	39f3514fa3	Add an env PYTORCH_TEST_SKIP_CUDAGRAPH to skip all cuda graph-related unit tests (#103032 ) Skip all cuda graph-related unit tests by setting env var `PYTORCH_TEST_SKIP_CUDAGRAPH=1` This PR refactors the `TEST_CUDA` python variable in test_cuda.py into common_utils.py. This PR also creates a new python variable `TEST_CUDA_GRAPH` in common_utils.py, which has an env var switch to turn off all cuda graph-related tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103032 Approved by: https://github.com/malfet	2023-06-06 07:51:57 +00:00
Nikita Shulga	ca470fc59f	[BE] Make `test_no_triton_on_import` simple (#102674 ) Do not try to parse raised exception for no good reason Add short description Reduce script to a single line <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at ea4164e</samp> > _`test_no_triton_on_import`_ > _Cleans up the code, adds docs_ > _No hidden errors_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/102674 Approved by: https://github.com/cpuhrsch, https://github.com/albanD	2023-06-01 20:31:18 +00:00
Nikita Vedeneev	d80d3b18d0	nn.Linear with BSR inputs: spare the user from explicit Triton kernel registrations (#98403 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 08f7a6a</samp> This pull request adds support for triton kernels in `torch` and `torch/cuda`, and refactors and tests the existing triton kernel for BSR matrix multiplication. It also adds a test case to ensure that importing `torch` does not implicitly import `triton`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98403 Approved by: https://github.com/malfet, https://github.com/cpuhrsch	2023-05-31 13:09:45 +00:00
Masaki Kozuki	c8579b7374	Run `test_cpp_memory_snapshot_pickle` only when linux and x86_64 (#101366 ) On Arm, I got ``` Traceback (most recent call last): File "/opt/pytorch/pytorch/test/test_cuda.py", line 5260, in test_cpp_memory_snapshot_pickle mem = run() File "/opt/pytorch/pytorch/test/test_cuda.py", line 5257, in run t = the_script_fn() File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 496, in prof_func_call return prof_callable(func_call, args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 493, in prof_callable return callable(args, **kwargs) RuntimeError: The following operation failed in the TorchScript interpreter. Traceback of TorchScript (most recent call last): File "/opt/pytorch/pytorch/test/test_cuda.py", line 5254, in the_script_fn @torch.jit.script def the_script_fn(): return torch.rand(311, 411, device='cuda') ~~~~~~~~~~ <--- HERE RuntimeError: record_context_cpp is not support on non-linux non-x86_64 platforms ``` `dfe484a3b3/torch/csrc/profiler/unwind/unwind.cpp (L4-L24)` seems related Pull Request resolved: https://github.com/pytorch/pytorch/pull/101366 Approved by: https://github.com/zdevito	2023-05-17 19:44:21 +00:00
Elias Ellison	3edff6b6ec	Improve detection of workspace/non-output allocations in cudagraphs (#99985 ) When we run cudagraph trees we are not allowed to have permanent workspace allocations like in cublas because we might need to reclaim that memory for a previous cudagraph recording, and it is memory that is not accounted for in output weakrefs so it does not work with checkpointing. Previously, I would check that we didn't have any additional allocations through snapshotting. This was extremely slow so I had to turn it off. This PR first does the quick checking to see if we are in an error state, then if we are does the slow logic of creating snapshot. Also turns on history recording so we get a stacktrace of where the bad allocation came from. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99985 Approved by: https://github.com/zdevito	2023-05-01 15:58:45 +00:00
Jane Xu	808267767c	Prevent grad scale from overflowing (#98876 ) Fixes #98828 by capping the growth in the kernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/98876 Approved by: https://github.com/ngimel	2023-04-25 20:59:44 +00:00
Aaron Gokaslan	e2a3817dfd	[BE] Enable C419 rule for any all shortcircuiting (#99890 ) Apparently https://github.com/pytorch/pytorch/pull/78142 made torch.JIT allow for simple generator expressions which allows us to enable rules that replace unnecessary list comprehensions with generators in any/all. This was originally part of #99280 but I split it off into this PR so that it can be easily reverted should anything break. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99890 Approved by: https://github.com/justinchuby, https://github.com/kit1980, https://github.com/malfet	2023-04-25 15:02:13 +00:00
Masaki Kozuki	b87c7ab6d6	Remove redundant `found_inf` recompute from `_step_supports_amp_unscaling` path (#98620 ) following https://github.com/pytorch/pytorch/pull/97415#issuecomment-1499787115. Rel: https://github.com/pytorch/pytorch/pull/98613 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98620 Approved by: https://github.com/janeyx99	2023-04-20 19:24:09 +00:00
Animesh Jain	971df458db	Reland of "Python binding to set/get CUDA rng state offset" (#99565 ) Why? * To reduce the latency of hot path in https://github.com/pytorch/pytorch/pull/97377 Concern - I had to add `set_offset` in all instances of `GeneratorImpl`. I don't know if there is a better way. ~~~~ import torch torch.cuda.manual_seed(123) print(torch.cuda.get_rng_state()) torch.cuda.set_rng_state_offset(40) print(torch.cuda.get_rng_state()) tensor([123, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=torch.uint8) tensor([123, 0, 0, 0, 0, 0, 0, 0, 40, 0, 0, 0, 0, 0, 0, 0], dtype=torch.uint8) ~~~~ Reland of https://github.com/pytorch/pytorch/pull/98965 (cherry picked from commit `8214fe07e8`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99565 Approved by: https://github.com/anijain2305	2023-04-20 15:42:25 +00:00
PyTorch MergeBot	bb2cd4a107	Revert "Python binding to set/get CUDA rng state offset (#98965 )" This reverts commit `8214fe07e8`. Reverted https://github.com/pytorch/pytorch/pull/98965 on behalf of https://github.com/DanilBaibak due to Break internal build	2023-04-19 11:23:32 +00:00
Animesh Jain	8214fe07e8	Python binding to set/get CUDA rng state offset (#98965 ) Why? * To reduce the latency of hot path in https://github.com/pytorch/pytorch/pull/97377 Concern - I had to add `set_offset` in all instances of `GeneratorImpl`. I don't know if there is a better way. ~~~~ import torch torch.cuda.manual_seed(123) print(torch.cuda.get_rng_state()) torch.cuda.set_rng_state_offset(40) print(torch.cuda.get_rng_state()) tensor([123, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=torch.uint8) tensor([123, 0, 0, 0, 0, 0, 0, 0, 40, 0, 0, 0, 0, 0, 0, 0], dtype=torch.uint8) ~~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/98965 Approved by: https://github.com/kulinseth, https://github.com/ezyang	2023-04-18 07:52:21 +00:00
Zachary DeVito	7ff1f3f3f6	Revert "Revert "Expandable blocks in allocator (#96995 )"" (#99275 ) This reverts commit `851e89c8e8`. Differential Revision: [D45034526](https://our.internmc.facebook.com/intern/diff/D45034526) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99275 Approved by: https://github.com/eellison	2023-04-17 23:46:08 +00:00
PyTorch MergeBot	851e89c8e8	Revert "Expandable blocks in allocator (#96995 )" This reverts commit `6a50b83b73`. Reverted https://github.com/pytorch/pytorch/pull/96995 on behalf of https://github.com/izaitsevfb due to Breaks internal tests	2023-04-16 19:23:37 +00:00
Zachary DeVito	6a50b83b73	Expandable blocks in allocator (#96995 ) Common advice we give for handling memory fragmentation issues is to allocate a big block upfront to reserve memory which will get split up later. For programs with changing tensor sizes this can be especially helpful to avoid OOMs that happen the first time we see a new largest input and would otherwise have to allocate new segments. However the issue with allocating a block upfront is that is nearly impossible to correctly estimate the size of that block. If too small, space in the block will run out and the allocator will allocate separate blocks anyway. Too large, and other non-PyTorch libraries might stop working because they cannot allocate any memory. This patch provides the same benefits as using a pre-allocating block but without having to choose its size upfront. Using the cuMemMap-style APIs, it adds the ability to expand the last block in a segment when more memory is needed. Compared to universally using cudaMallocAsync to avoid fragmentation, this patch can fix this common fragmentation issue while preserving most of the existing allocator behavior. This behavior can be enabled and disabled dynamically. This should allow users to, for instance, allocate long-lived parameters and state in individual buffers, and put temporary state into the large expandable blocks, further reducing fragmentation. See inline comments for information about the implementation and its limitations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96995 Approved by: https://github.com/eellison	2023-04-14 09:49:11 +00:00
Peeyush Agarwal	ebd4c165ff	Back out "`GradScaler` recomputes `optimizer_state["found_inf_per_device"]` before `optimizer.step` (#97415 )" (#98613 ) Summary: This change causes multi-GPU job from XI team to hang after 8K steps. Differential Revision: D44797248 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98613 Approved by: https://github.com/ngimel	2023-04-07 23:31:31 +00:00
Zachary DeVito	b1a83c4da4	[memory history] cleanup recording API (#97406 ) This makes the options for recording memory history easier to understand and makes the default to record the most information. <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 4706acf</samp> This pull request enhances the memory profiling and debugging capabilities of PyTorch on CUDA devices. It introduces a new API for memory history recording in `torch/cuda/memory.py` and `test/test_cuda.py`, and adds new functions for memory snapshot management and visualization in `torch/cuda/memory.py`. Also adds a quick _dump_snapshot function to make it easier to look at the common visualizations. <!-- copilot:walkthrough --> ### <samp>🤖 Generated by Copilot at 4706acf</samp> * Modify the `_record_memory_history` function to use a new API that accepts a string argument for the `enabled` parameter and more parameters to control the stack trace collection and memory event history ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-80bd98caafb20d758f45a4d23711810f7e0b9ce7a6505094f9dbb0e00a657377L620-R696)) * Add a new function `_dump_snapshot` that allows users to dump a memory snapshot to a directory with HTML plots of the memory segments and events ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-80bd98caafb20d758f45a4d23711810f7e0b9ce7a6505094f9dbb0e00a657377R703-R713)) * Update the test cases in `test/test_cuda.py` to use the new API for memory history recording and check the expected output of the memory plots ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L4946-R4946), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L4984-R4984), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5000-R5000), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5015-R5015), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5035-R5038), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R5045-R5046), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5060-R5059), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5068-R5065), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5088-R5085)) * Add missing imports and types to the `torch/cuda/memory.py` module ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-80bd98caafb20d758f45a4d23711810f7e0b9ce7a6505094f9dbb0e00a657377L5-R15)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97406 Approved by: https://github.com/ezyang	2023-03-28 16:31:10 +00:00
soulitzer	51c3fd39a5	Modify all calls to checkpoint pass use_reentrant explicitly (#97376 ) Fixes #ISSUE_NUMBER This is the first step toward making use_reentrant=False the default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97376 Approved by: https://github.com/albanD	2023-03-27 13:37:42 +00:00
Masaki Kozuki	b5edf18334	`GradScaler` recomputes `optimizer_state["found_inf_per_device"]` before `optimizer.step` (#97415 ) I found a discrepancy between non-fused and fused optimizers, which is to use `optimizer_state["found_inf"]` or to recompute `found_inf`. - non fused: `e64ddd1ab9/torch/cuda/amp/grad_scaler.py (L289)` - fused: `e64ddd1ab9/torch/cuda/amp/grad_scaler.py (L353)` - where `_check_inf_per_device` is `e64ddd1ab9/torch/cuda/amp/grad_scaler.py (L564-L573)` The other way to align the behavior is to use the existing `found_inf` in `e64ddd1ab9/torch/cuda/amp/grad_scaler.py (L353)`. I'd say this PR is for the sake of "safety" and the alternative is to keep the existing behavior. I honestly have no idea if it's expected to double-check the sanity of gradients in `GradScaler.step`. --- what I've observed in huggingface/transformers T5-base example so far seems like that non-fused optimizers lead to invalid parameters while the fused not. The cause seems to be that `gradients` become inf/nan before `GradScaler.step(optimizer)` after `GradScaler._unscale_grads_` (more precicely, the call of `torch._amp_foreach_non_finite_check_and_unscale_`) in the script of the issue linked below, i.e. the gradient clipping and/or unscaling lead to inf/nan as these happen after the grad check. See `788300cc2a/aten/src/ATen/native/cuda/AmpKernels.cu (L165-L174)`. Fixes #96755 🙏 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97415 Approved by: https://github.com/ngimel, https://github.com/janeyx99	2023-03-24 17:36:47 +00:00
Tailing Yuan	63e1f12b49	Speedup bincount and histc on CUDA (#97090 ) This is to speed up torch.bincount and torch.histc on CUDA. 1. Speed up int64_t gpuAtomicAdd, 2. and optimize the histogram kernel. # Fixes #96626 After speedup, time cost in #96626 would be ``` ... (run 2 times and ignore the first run) case 1 CPU 0.0003631114959716797 seconds case 1 CUDA 0.0005860328674316406 seconds case 2 CPU 0.0013742446899414062 seconds case 2 CUDA 0.0008623600006103516 seconds ``` Note that in "case 1 CUDA", the max op takes the most time, i.e., `5ee5a164ff/aten/src/ATen/native/cuda/SummaryOps.cu (L334-L335)`, which is not to be optimized in this PR. # Benchmark Time is measured on i7-10700 + RTX 3080, Ubuntu 22.04 (in WSL). The baseline is PyTorch 2.0.0+cu117. My dev version of PyTorch is compiled with CUDA 11.8. Each case is measured 15 times to take the median. ## torch.bincount #elem \| nbins \| distribution \| CPU \| PyTorch 2.0.0 \| this PR \| speedup -- \| -- \| -- \| -- \| -- \| -- \| -- 220 \| 80 \| random.uniform \| 0.000834 \| 0.005783 \| 0.000266 \| 21.8x 220 \| 80 \| narrow in 1 bin \| 0.001576 \| 0.003967 \| 0.000563 \| 7.0x 220 \| 500 \| random.uniform \| 0.000852 \| 0.003641 \| 0.000334 \| 10.9x 220 \| 500 \| narrow in 1% bins \| 0.000894 \| 0.001878 \| 0.000349 \| 5.4x 220 \| 2048 \| random.uniform \| 0.000891 \| 0.000820 \| 0.000298 \| 2.8x 220 \| 2048 \| narrow in 1% bins \| 0.000958 \| 1.043251 \| 0.000335 \| 3,116.6x 226 \| 80 \| random.uniform \| 0.067715 \| 0.322409 \| 0.003032 \| 106.3x 226 \| 80 \| narrow in 1 bin \| 0.110940 \| 0.194644 \| 0.017651 \| 11.0x 226 \| 500 \| random.uniform \| 0.066666 \| 0.192302 \| 0.002535 \| 75.8x 226 \| 500 \| narrow in 1% bins \| 0.066130 \| 0.092237 \| 0.005462 \| 16.9x 226 \| 2048 \| random.uniform \| 0.066371 \| 0.035308 \| 0.002476 \| 14.3x 226 \| 2048 \| narrow in 1% bins \| 0.068453 \| 72.122858 \| 0.003185 \| 22,644.3x ## torch.histc (float32) #elem \| nbins \| distribution \| CPU \| PyTorch 2.0.0 \| this PR \| speedup -- \| -- \| -- \| -- \| -- \| -- \| -- 220 \| 80 \| random.uniform \| 0.001261 \| 0.000145 \| 9.47E-05 \| 1.5x 220 \| 80 \| narrow in 1 bin \| 0.001074 \| 0.000356 \| 0.000311 \| 1.1x 220 \| 500 \| random.uniform \| 0.001162 \| 0.000227 \| 9.18E-05 \| 2.5x 220 \| 500 \| narrow in 1% bins \| 0.001082 \| 0.000201 \| 0.000152 \| 1.3x 220 \| 2048 \| random.uniform \| 0.001100 \| 0.000203 \| 0.000118 \| 1.7x 220 \| 2048 \| narrow in 1% bins \| 0.001089 \| 0.000396 \| 0.000107 \| 3.7x 226 \| 80 \| random.uniform \| 0.064219 \| 0.001170 \| 0.000786 \| 1.5x 226 \| 80 \| narrow in 1 bin \| 0.056471 \| 0.013283 \| 0.011939 \| 1.1x 226 \| 500 \| random.uniform \| 0.078183 \| 0.003411 \| 0.000562 \| 6.1x 226 \| 500 \| narrow in 1% bins \| 0.056711 \| 0.002763 \| 0.002738 \| 1.0x 226 \| 2048 \| random.uniform \| 0.059296 \| 0.003503 \| 0.000533 \| 6.6x 226 \| 2048 \| narrow in 1% bins \| 0.061754 \| 0.015703 \| 0.000962 \| 16.3x ## torch.histc (int64) #elem \| nbins \| distribution \| CPU \| PyTorch 2.0.0 \| this PR \| speedup -- \| -- \| -- \| -- \| -- \| -- \| -- 220 \| 80 \| random.uniform \| N/A \| 0.005614 \| 9.47E-05 \| 59.3x 220 \| 80 \| narrow in 1 bin \| N/A \| 0.003799 \| 0.000395 \| 9.6x 220 \| 500 \| random.uniform \| N/A \| 0.003665 \| 9.58E-05 \| 38.2x 220 \| 500 \| narrow in 1% bins \| N/A \| 0.001760 \| 0.000178 \| 9.9x 220 \| 2048 \| random.uniform \| N/A \| 0.000693 \| 0.000111 \| 6.2x 220 \| 2048 \| narrow in 1% bins \| N/A \| 1.082904 \| 0.000123 \| 8,802.4x 226 \| 80 \| random.uniform \| N/A \| 0.320400 \| 0.001145 \| 279.9x 226 \| 80 \| narrow in 1 bin \| N/A \| 0.193668 \| 0.015229 \| 12.7x 226 \| 500 \| random.uniform \| N/A \| 0.182897 \| 0.000823 \| 222.2x 226 \| 500 \| narrow in 1% bins \| N/A \| 0.089363 \| 0.00376 \| 23.8x 226 \| 2048 \| random.uniform \| N/A \| 0.033190 \| 0.000832 \| 39.9x 226 \| 2048 \| narrow in 1% bins \| N/A \| 71.721012 \| 0.001525 \| 47,017.8x ## Banchmark code Here is the benchmark code: ```python3 import time import torch cases = [ ("bincount bins=80 wide ", torch.randint(80, [220]), lambda x: torch.bincount(x, minlength=80)), ("bincount bins=80 narrow", torch.randint(1, [220]), lambda x: torch.bincount(x, minlength=80)), ("bincount bins=500 wide ", torch.randint(500, [220]), lambda x: torch.bincount(x, minlength=500)), ("bincount bins=500 narrow", torch.randint(5, [220]), lambda x: torch.bincount(x, minlength=500)), ("bincount bins=2048 wide ", torch.randint(2048, [220]), lambda x: torch.bincount(x, minlength=2048)), ("bincount bins=2048 narrow", torch.randint(20, [220]), lambda x: torch.bincount(x, minlength=2048)), ("histc_float bins=80 wide ", torch.rand(220), lambda x: torch.histc(x, bins=80, min=0., max=1.)), ("histc_float bins=80 narrow", torch.rand(220).01, lambda x: torch.histc(x, bins=80, min=0., max=1.)), ("histc_float bins=500 wide ", torch.rand(220), lambda x: torch.histc(x, bins=500, min=0., max=1.)), ("histc_float bins=500 narrow", torch.rand(220).01, lambda x: torch.histc(x, bins=500, min=0., max=1.)), ("histc_float bins=2048 wide ", torch.rand(220), lambda x: torch.histc(x, bins=2048, min=0., max=1.)), ("histc_float bins=2048 narrow", torch.rand(220).01, lambda x: torch.histc(x, bins=2048, min=0., max=1.)), ("histc_int bins=80 wide ", torch.randint(80, [220]), lambda x: torch.histc(x, bins=80, min=0., max=80.)), ("histc_int bins=80 narrow", torch.randint(1, [220]), lambda x: torch.histc(x, bins=80, min=0., max=80.)), ("histc_int bins=500 wide ", torch.randint(500, [220]), lambda x: torch.histc(x, bins=500, min=0., max=500.)), ("histc_int bins=500 narrow", torch.randint(5, [220]), lambda x: torch.histc(x, bins=500, min=0., max=500.)), ("histc_int bins=2048 wide ", torch.randint(2048, [220]), lambda x: torch.histc(x, bins=2048, min=0., max=2048.)), ("histc_int bins=2048 narrow", torch.randint(20, [2*20]), lambda x: torch.histc(x, bins=2048, min=0., max=2048.)), ] def test(case, device): name, x, func = case x = x.to(device) time_samples = [] for _ in range(15): torch.cuda.synchronize() t1 = time.time() func(x) torch.cuda.synchronize() t2 = time.time() time_samples.append(t2 - t1) median = sorted(time_samples)[len(time_samples) // 2] print(device, name, median) for case in cases: test(case, device="cuda") # for case in cases: # test(case, device="cpu") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97090 Approved by: https://github.com/ngimel	2023-03-24 00:25:34 +00:00
Masaki Kozuki	22ea21da3d	Change 1D Tensor of 1 element to 0D Tensor (#96994 ) add 0d tensor to graph adam/adamw test Affected: - `torch.cuda.amp.GradScaler`'s `found_inf`, `_scale`, and `_growth_tracker` - `step` of Adam & AdamW of `capturable` Fixes #96776 🤞 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96994 Approved by: https://github.com/janeyx99	2023-03-21 18:24:19 +00:00
Elias Ellison	571f96bf59	cudagraph trees (#89146 ) CUDA Graph Trees Design doc: https://docs.google.com/document/d/1ZrxLGWz7T45MSX6gPsL6Ln4t0eZCSfWewtJ_qLd_D0E/edit Not currently implemented : - Right now, we are using weak tensor refs from outputs to check if a tensor has dies. This doesn't work because a) aliasing, and b) aot_autograd detaches tensors (see note [Detaching saved tensors in AOTAutograd]). Would need either https://github.com/pytorch/pytorch/issues/91395 to land to use storage weak refs or manually add a deleter fn that does what I want. This is doable but theres some interactions with the caching allocator checkpointing so saving for a stacked pr. - Reclaiming memory from the inputs during model recording. This isn't terribly difficult but deferring to another PR. You would need to write over the input memory during warmup, and therefore copy the inputs to cpu. Saving for a stacked pr. - Warning on overwriting previous generation outputs. and handling nested torch.compile() calls in generation tracking Differential Revision: [D43999887](https://our.internmc.facebook.com/intern/diff/D43999887) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89146 Approved by: https://github.com/ezyang	2023-03-17 02:47:03 +00:00
Elias Ellison	ea7415087a	Expose Stream Recording Apis in python (#96384 ) Differential Revision: [D43999891](https://our.internmc.facebook.com/intern/diff/D43999891) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96384 Approved by: https://github.com/zdevito	2023-03-16 23:45:43 +00:00
Zachary DeVito	e74f70d212	Revert "Revert "[memory profiling] add a facility to gather combined C++/Python/TorchScript stack traces. (#95541 )"" (#96878 ) This reverts commit `e1ea584b1c`. Adds __has_include check to fix fbcode build. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96878 Approved by: https://github.com/ezyang	2023-03-16 04:12:54 +00:00
PyTorch MergeBot	e1ea584b1c	Revert "[memory profiling] add a facility to gather combined C++/Python/TorchScript stack traces. (#95541 )" This reverts commit `4e1060c609`. Reverted https://github.com/pytorch/pytorch/pull/95541 on behalf of https://github.com/DanilBaibak due to breaking internal builds	2023-03-15 13:28:41 +00:00
Zachary DeVito	4e1060c609	[memory profiling] add a facility to gather combined C++/Python/TorchScript stack traces. (#95541 ) This refactors the stack trace facility specific to memory profiling in python+cuda to make a generic facility to generate combined stack traces. The generic facility (combined_traceback.h) does not require python to be around to work, but will return python stacks if it is present. This facility is then used to add support for stack trace gathering in memory profiling that happens directly from C++. It is also used to expose a python API for gathering and symbolizing combineds stacks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95541 Approved by: https://github.com/ezyang	2023-03-14 18:26:05 +00:00
Elias Ellison	da265652d6	Return Live Data Pointers from Checkpoint, swap onto tensors (#95020 ) When we checkpoint the state of the private pool allocator, we will need to make sure that its current live allocated blocks will get properly cleaned up when the tensors they correspond to die. Return DataPtrs for these new allocated blocks that the callee can swap onto live Tensors. The exact api for setting the checkpoint can be manipulated after this as the cudagraph implementation is built out, but this at least shows its sufficiently general. This should be the last PR touching cuda caching allocator necessary for new cudagraphs integration. Differential Revision: [D43999888](https://our.internmc.facebook.com/intern/diff/D43999888) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95020 Approved by: https://github.com/zdevito	2023-03-14 01:22:19 +00:00
Elias Ellison	1cc32aedb0	Handle additional live allocations not in checkpointed state (#94943 ) We choose to ignore certain blocks that are currently allocated when we set the pool to its checkpoint. For those blocks, we need to swap out the deleter function of their corresponding blocks so that a deallocation is not triggered when they die. Differential Revision: [D43999886](https://our.internmc.facebook.com/intern/diff/D43999886) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94943 Approved by: https://github.com/zdevito	2023-03-14 01:00:47 +00:00
Elias Ellison	d798de2b05	Checkpoint CUDA Allocator Private Pool State (#94653 ) Copying note from cuda caching allocator: ``` * Note [Checkpointing PrivatePoolState] * * Refer above to Note [Interaction with CUDA graph capture]. Allocations made * during graph capture are made from a separate private pool. During graph * capture allocations behave as usual. During graph replay the allocator * state does not change even as new tensors are created. The private pool * will not free its blocks to the main caching allocator until cuda graph use * is finished to prevent an allocation from eager clobbering the memory from * a live but unaccounted for tensor that was created during replay. * * `make_graphed_callables`, a series of separate callables chained in * successive cuda graphs, can share a memory pool because after a cuda graph * recording the allocations in the shared private pool exactly reflect the * tensors that are allocated. * * We would like to extend callable chaining to support a graphed callable * tree. In this scenario, we have a tree of callable chains which will be * captured with cuda graphs. In the diagram below, we have a tree with four * callables, A, B, C, and D. Suppose we have captured, and subsequently * replayed, A, B, and C. Then on a new invocation, we replay A and B, but * would now like to record D. At this point the private pool will not reflect * any of the live tensors created during graph replay. Allocations made * during a new recording with the pool could overwrite those live tensors. * * In order to record a new graph capture after replaying prior callables in * the tree, we need the allocator to reflect the state of the live tensors. * We checkpoint the state of the private after each recording, and then * reapply it when we are starting a new recording chain. Additionally, we * must free the allocations for any tensors that died between the end of our * previous graph replaying and our new recording (TODO). All of the allocated * segments that existed in the checkpointed state must still exist in the * pool. There may also exist new segments, which we will free (TODO : link * note [live tensors between iterations] when it exists). * * * ---------------> A ---------------> B ---------------> C * \| * \| * \| * \| * ---------------> D ``` A few TODOs: - need to add logic for freeing tensors that have died between a last replay and current new recording - Add logic for free that might be called on a pointer multiple times (because we are manually freeing live tensors) The two scenarios above have not been exercised in the tests yet. Differential Revision: [D43999889](https://our.internmc.facebook.com/intern/diff/D43999889) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94653 Approved by: https://github.com/zdevito	2023-03-14 00:47:30 +00:00
Zachary DeVito	4b372e3958	[memory profiling] C++ tracing support (#95357 ) Adds the ability to quickly generate stack traces for C++, and combine Python, TorchScript, and C++ frames into a single trace. This makes it possible for the memory tracer to record allocations inside C++ code (e.g. convolution temporaries, backward operators). The unwinder code is ~10x faster than execinfo.h's backward because it cache fast unwinder routines for instruction pointers that have already been seen. It is also only 1.2--2x slower than copying the entire stack (the approach perf takes), while using 2 orders of magnitude less space per stack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95357 Approved by: https://github.com/bertmaher	2023-03-12 07:24:14 +00:00
Zachary DeVito	266089a3fe	[memory snapshots] record scripted stack traces (#95356 ) Adds support for seeing both python and script stack traces in memory debugging. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95356 Approved by: https://github.com/aaronenyeshi	2023-03-12 07:24:14 +00:00
Zachary DeVito	d6d8d3484e	_memory_viz.py: Visualize how blocks fit into segments. (#91336 ) Add a segment_plot command that visualizes how blocks are allocated into segments. This is similar to the 'stats' command but produces an interactive html viewer rather than text dump, allowing exploration of stack traces. It also adds the ability to see the layout at any point in the trace by starting from the snapshot and then apply the events backwards to reconstruct what memory would have looked like. Example: ![Screen Shot 2022-12-22 at 3 32 49 PM](https://user-images.githubusercontent.com/370202/209242650-b952372e-37ac-400a-a01c-13be2b5426fa.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91336 Approved by: https://github.com/bhosmer	2023-03-07 21:07:18 +00:00
Zachary DeVito	71f369092d	Revert "Revert "memory viz: Add colors for categories and a legend (#90587 )"" (#96133 ) This reverts commit `b38b39c441`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96133 Approved by: https://github.com/bhosmer	2023-03-07 21:07:18 +00:00
Catherine Lee	eea0733045	Reduce pytest blocklist (#96016 ) `TestCase = object` or variations of it get switched to `TestCase = NoTest`. unittest collects test based on subclassing unittest.TestCase, so setting TestCase = object removes it from unittest test collection. pytest collects based on name (https://docs.pytest.org/en/7.1.x/reference/reference.html#confval-python_classes) but can be told to ignore a class (bottom of https://docs.pytest.org/en/7.1.x/example/pythoncollection.html#changing-naming-conventions) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96016 Approved by: https://github.com/ZainRizvi, https://github.com/huydhn	2023-03-07 18:30:27 +00:00
Eli Uriegas	b38b39c441	Revert "memory viz: Add colors for categories and a legend (#90587 )" This reverts commit `ee43842505`.	2023-03-06 11:38:58 -08:00
Zachary DeVito	ee43842505	memory viz: Add colors for categories and a legend (#90587 ) Adds a category legend to memory trace plots that colors allocations by their role (activation, parameter, gradient, etc.) as captured by kineto. Differential Revision: [D43757381](https://our.internmc.facebook.com/intern/diff/D43757381) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90587 Approved by: https://github.com/aaronenyeshi	2023-03-03 20:42:22 +00:00
Mark Saroufim	9f707f164e	Add more GPU metric instrumentation (#91717 ) Fixes https://github.com/pytorch/serve/issues/1937 A fairly common query I see folks running while using pytorch is `nvidia-smi --format=csv,noheader,nounits --query-gpu=utilization.gpu,utilization.memory,memory.total,memory.used,temperature.gpu,power.draw,clocks.current.sm,clocks.current.memory -l 10` Existing metrics we have * For kernel utilization`torch.cuda.utilization()` * For memory utilization we have them under `torch.cuda.memory` the memory allocated with `torch.cuda.memory.memory_allocated()` * For total available memory we have `torch.cuda.get_device_properties(0).total_memory` Which means the only metrics we're missing are * Temperature: now in `torch.cuda.temperature()` * Power draw: now in `torch.cuda.power()` * Clock speed: now in `torch.cuda.clock_speed()` With some important details on each * Clock speed settings: I picked the SM clock domain which is documented here https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceEnumvs.html#group__nvmlDeviceEnumvs_1g805c0647be9996589fc5e3f6ff680c64 * Temperature: I use `pynvml.nvmlDeviceGetTemperature(handle, 0)` where 0 refers to the GPU die temperature Pull Request resolved: https://github.com/pytorch/pytorch/pull/91717 Approved by: https://github.com/ngimel	2023-02-24 00:38:03 +00:00
Pearu Peterson	cece63f197	Add warn-once deprecation warning to legacy sparse constructors (#94850 ) Addresses https://github.com/pytorch/pytorch/issues/68323#issuecomment-1425174341 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94850 Approved by: https://github.com/amjames, https://github.com/cpuhrsch	2023-02-23 15:05:12 +00:00
puririshi98	8aa34602f7	Jetson Update for CI Redo (#94549 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94549 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-02-21 17:13:38 +00:00
dllehr-amd	98012e4a59	[ROCm] hipGraph support for pytorch mainline (#88202 ) With the release of ROCm 5.3 hip now supports a hipGraph implementation. All necessary backend work and hipification is done to support the same functionality as cudaGraph. Unit tests are modified to support a new TEST_GRAPH feature which allows us to create a single check for graph support instead of attempted to gather the CUDA level in annotations for every graph test Pull Request resolved: https://github.com/pytorch/pytorch/pull/88202 Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/malfet	2023-02-14 22:18:56 +00:00
Xuehai Pan	b005ec62b9	[BE] Remove dependency on `six` and `future` (#94709 ) Remove the Python 2 and 3 compatibility library [six](https://pypi.org/project/six) and [future](https://pypi.org/project/future) and `torch._six`. We only support Python 3.8+ now. It's time to retire them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94709 Approved by: https://github.com/malfet, https://github.com/Skylion007	2023-02-14 09:14:14 +00:00
Xuehai Pan	046e88a291	[BE] [3/3] Rewrite `super()` calls in test (#94592 ) Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied. - #94587 - #94588 - #94592 Also, methods with only a `super()` call are removed: ```diff class MyModule(nn.Module): - def __init__(self): - super().__init__() - def forward(self, ...): ... ``` Some cases that change the semantics should be kept unchanged. E.g.: `f152a79be9/caffe2/python/net_printer.py (L184-L190)` `f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94592 Approved by: https://github.com/ezyang, https://github.com/seemethere	2023-02-12 22:20:53 +00:00
c-odrin	54b7c7d5e9	Added requested_bytes to CUDA Caching Allocator Stats (#88575 ) Summary: The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce. We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag: - "requested_bytes.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead Test Plan: Added test case in caffe2/test/test_cuda.py Differential Revision: D40810674 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88575 Approved by: https://github.com/zdevito	2023-02-09 21:37:25 +00:00
Masaki Kozuki	6ba041fcae	Look up `group["capturable"]`, not `defaults["capturable"]` in Adam(W) (#94149 ) We could set different values in each `param_group` when calling dunder init of `torch.optim` optimizers as in e.g. https://github.com/pytorch/pytorch/issues/89987. So check whether or not `capturable` is `True` among all the `param_group`s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94149 Approved by: https://github.com/albanD	2023-02-07 00:24:35 +00:00
Masaki Kozuki	4207d3c330	`FusedAdam(W)` should take `OptState` into account before unscaling grads (#94060 ) the optimizers have to consult `OptState` before unscaling gradients because we could call `GradScaler.unscale_` explicitly to for e.g. `clip_grad_norm_` as mentioned in `e52786f3d1/torch/cuda/amp/grad_scaler.py (L235-L266)` and https://pytorch.org/docs/stable/notes/amp_examples.html#working-with-unscaled-gradients Related #90752 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94060 Approved by: https://github.com/albanD	2023-02-04 05:20:13 +00:00
Masaki Kozuki	a23ed38f9a	[mta][foreach] Implement fused adamw (#88015 ) related: https://github.com/pytorch/pytorch/issues/68041, https://github.com/pytorch/pytorch/issues/71274, https://github.com/pytorch/pytorch/issues/80167 possibly related to https://github.com/pytorch/pytorch/issues/80595#issuecomment-1178519436 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88015 Approved by: https://github.com/albanD, https://github.com/ngimel	2023-02-01 19:32:29 +00:00
albanD	d8aa68c683	make sure that our error handling runs with the GIL enabled (#92848 ) Fixes https://github.com/pytorch/pytorch/issues/92684 I checked the other use case of this API and they never release the GIL Pull Request resolved: https://github.com/pytorch/pytorch/pull/92848 Approved by: https://github.com/ngimel	2023-01-24 09:30:42 +00:00
eqy	fb38b9ff2a	[cuBLAS][TF32] Fix TF32 get/set test when `TORCH_ALLOW_TF32_CUBLAS_OVERRIDE` is set (#92052 ) Follow up of #85859 to fix the test for when the environment variable is set. CC @xwang233 @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/92052 Approved by: https://github.com/ngimel	2023-01-12 05:36:06 +00:00
eqy	97ff20d722	[cuBLAS] (re-open) Fix default cuBLAS workspace size and parsing for multiple workspaces (#91564 ) re-open of #89027 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91564 Approved by: https://github.com/ngimel	2023-01-03 23:48:15 +00:00
PyTorch MergeBot	39d49dbe45	Revert "[cuBLAS] Fix default cuBLAS workspace size and parsing for multiple workspaces (#89027 )" This reverts commit `b407d98dbe`. Reverted https://github.com/pytorch/pytorch/pull/89027 on behalf of https://github.com/kit1980 due to Fails test_cublas_workspace_explicit_allocation on ROCm	2022-12-31 23:04:57 +00:00
eqy	b407d98dbe	[cuBLAS] Fix default cuBLAS workspace size and parsing for multiple workspaces (#89027 ) Follow-up of #86167 ; The number of pools was mistakenly ignored and the default workspace size appears to be too small to match selected cuBLAS kernels before the explicit allocation change. CC @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/89027 Approved by: https://github.com/ngimel	2022-12-31 06:58:04 +00:00
lezcano	484dd40022	Implement PReLU in a compositional way (#91238 ) The PReLU implementation was all over the place. This lead to a number of bugs like https://github.com/pytorch/pytorch/issues/68760. We fix it by: - Keeping the weird broadcasting logic it has as a CompositeImplicit kernel that calls into a second kernel - This second kernel is just a good-ol' pointwise kernel. - We implement the derivative for the pointwise kernel via TI as well for speed. - We implement the second derivative for the pointwise kernel and the forward AD derivatives compositionally This fixes a number of issues: - We don't perform copies any more when the inputs are not contiguous - The derivatives are now correct - We fix vmap and many other functorch-related issues. - CPU and CUDA now share the relevant broadcasting logic - The implementation is about 1/3 the length. Fixes https://github.com/pytorch/pytorch/issues/68760 Fixes https://github.com/pytorch/pytorch/issues/89895 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91238 Approved by: https://github.com/kshitij12345, https://github.com/jbschlosser, https://github.com/albanD	2022-12-30 10:42:30 +00:00
Eddie Yan	8b617f813d	[cuBLAS] Add an option to disable reduced precision reductions for BF16 GEMM (#89172 ) Essentially the same change as #67946, except that the default is to disallow reduced precision reductions in `BFloat16` GEMMs (for now). If performance is severely regressed, we can change the default, but this option appears to be necessary to pass some `addmm` `BFloat16` tests on H100. CC @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/89172 Approved by: https://github.com/ngimel	2022-12-21 18:58:28 +00:00
Eddie Yan	dabf515c18	[cuDNN][cuDNN V8 API] (re-re-re-open) cuDNN V8 API on by default (#91117 ) Re-opening following #91025 CC @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/91117 Approved by: https://github.com/ngimel	2022-12-20 18:52:29 +00:00
PyTorch MergeBot	ba7aeac37b	Revert "[cuDNN][cuDNN V8 API] (re-re-open) cuDNN V8 API on by default (#89022 )" This reverts commit `eecd621f06`. Reverted https://github.com/pytorch/pytorch/pull/89022 on behalf of https://github.com/ngimel due to breaks some convolution configurations #91025	2022-12-16 23:06:35 +00:00
Rich Zhu	4372dbb89f	use pytree to allow any input format for cuda graph (#90941 ) Summary: 1. use pytree to allow any input format for make_graphed_callables 2. add allow_unused_input argument for make_graphed_callables Test Plan: buck2 test mode/dev-nosan //caffe2/test:cuda -- --print-passing-details Differential Revision: D42077976 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90941 Approved by: https://github.com/ngimel	2022-12-16 03:01:47 +00:00
Eddie Yan	eecd621f06	[cuDNN][cuDNN V8 API] (re-re-open) cuDNN V8 API on by default (#89022 ) Testing V8 on by default again after fixes have been merged for e.g., https://github.com/pytorch/torchdynamo/issues/1833 One new failure that seems to be surfaced with V8 on appears in halonext + amp ``` RuntimeError: Internal Triton PTX codegen error: Segmentation fault (core dumped) ``` But I'm not sure if this points to a V8 issue or a Triton issue CC @ngimel @ptrblck Current dynamo benchmarks on A100: v7 vs. v8 \|dev \|name \|batch_size\|abs_latency_v7\|abs_latency_v8\| \|----\|-------------------------------\|----------\|--------------\|--------------\| \|cuda\|adv_inception_v3 \|128 \|166.0240 \|165.5798 \| \|cuda\|beit_base_patch16_224 \|64 \|123.5912 \|123.0797 \| \|cuda\|botnet26t_256 \|128 \|107.7343 \|107.5948 \| \|cuda\|cait_m36_384 \|4 \|184.5038 \|184.0271 \| \|cuda\|coat_lite_mini \|128 \|142.3061 \|140.5814 \| \|cuda\|convit_base \|64 \|165.2499 \|161.0743 \| \|cuda\|convmixer_768_32 \|32 \|325.6984 \|325.7094 \| \|cuda\|convnext_base \|64 \|237.4632 \|238.0142 \| \|cuda\|crossvit_9_240 \|128 \|72.2980 \|72.4367 \| \|cuda\|cspdarknet53 \|64 \|96.6862 \|96.8308 \| \|cuda\|deit_base_distilled_patch16_224\|64 \|117.6045 \|117.9616 \| \|cuda\|dla102 \|128 \|182.3073 \|182.2304 \| \|cuda\|dm_nfnet_f0 \|128 \|133.6011 \|133.6298 \| \|cuda\|dpn107 \|32 \|148.5080 \|148.5885 \| \|cuda\|eca_botnext26ts_256 \|128 \|113.8676 \|113.1514 \| \|cuda\|eca_halonext26ts \|128 \|119.2242 \|119.1845 \| \|cuda\|ese_vovnet19b_dw \|128 \|80.0217 \|79.9438 \| \|cuda\|fbnetc_100 \|128 \|91.4548 \|91.4009 \| \|cuda\|fbnetv3_b \|128 \|115.4496 \|115.5058 \| \|cuda\|gernet_l \|128 \|114.8365 \|114.7870 \| \|cuda\|ghostnet_100 \|128 \|58.5766 \|58.5766 \| \|cuda\|gluon_inception_v3 \|128 \|165.5222 \|165.7167 \| \|cuda\|gluon_xception65 \|32 \|165.8779 \|165.7818 \| \|cuda\|gmixer_24_224 \|128 \|116.3611 \|113.4925 \| \|cuda\|gmlp_s16_224 \|128 \|121.2607 \|121.2534 \| \|cuda\|hrnet_w18 \|128 \|246.5706 \|246.7599 \| \|cuda\|inception_v3 \|128 \|166.1096 \|166.2034 \| \|cuda\|jx_nest_base \|32 \|93.6064 \|93.4088 \| \|cuda\|lcnet_050 \|128 \|21.4156 \|21.4207 \| \|cuda\|levit_128 \|128 \|27.2901 \|27.2543 \| \|cuda\|mixer_b16_224 \|128 \|157.8992 \|158.2878 \| \|cuda\|mixnet_l \|128 \|197.3443 \|197.2125 \| \|cuda\|mnasnet_100 \|128 \|71.4604 \|71.2997 \| \|cuda\|mobilenetv2_100 \|128 \|67.6080 \|67.7515 \| \|cuda\|mobilenetv3_large_100 \|128 \|57.7224 \|57.6591 \| \|cuda\|mobilevit_s \|64 \|93.0372 \|93.0530 \| \|cuda\|nfnet_l0 \|128 \|113.1664 \|113.2853 \| \|cuda\|pit_b_224 \|64 \|133.3333 \|133.4153 \| \|cuda\|pnasnet5large \|16 \|238.9545 \|238.8122 \| \|cuda\|poolformer_m36 \|64 \|144.2353 \|144.2375 \| \|cuda\|regnety_002 \|128 \|32.8534 \|32.9069 \| \|cuda\|repvgg_a2 \|128 \|102.4150 \|102.3827 \| \|cuda\|res2net101_26w_4s \|64 \|120.8127 \|120.8322 \| \|cuda\|res2net50_14w_8s \|128 \|149.7052 \|149.8969 \| \|cuda\|res2next50 \|128 \|153.7439 \|153.8215 \| \|cuda\|resmlp_12_224 \|128 \|89.1918 \|86.9226 \| \|cuda\|resnest101e \|64 \|159.4706 \|159.3133 \| \|cuda\|rexnet_100 \|128 \|88.0032 \|88.0397 \| \|cuda\|sebotnet33ts_256 \|64 \|80.4635 \|80.0120 \| \|cuda\|selecsls42b \|128 \|70.4430 \|70.3663 \| \|cuda\|spnasnet_100 \|128 \|78.0537 \|78.1991 \| \|cuda\|swin_base_patch4_window7_224 \|64 \|212.9073 \|213.0824 \| \|cuda\|swsl_resnext101_32x16d \|32 \|193.0229 \|193.0404 \| \|cuda\|tf_efficientnet_b0 \|128 \|97.1316 \|97.0410 \| \|cuda\|tf_mixnet_l \|128 \|203.4956 \|203.5340 \| \|cuda\|tinynet_a \|128 \|82.4038 \|82.8733 \| \|cuda\|tnt_s_patch16_224 \|128 \|284.8576 \|284.8867 \| \|cuda\|twins_pcpvt_base \|64 \|118.3893 \|119.2329 \| \|cuda\|visformer_small \|128 \|126.0533 \|126.0390 \| \|cuda\|vit_base_patch16_224 \|64 \|118.2873 \|118.0573 \| \|cuda\|volo_d1_224 \|64 \|108.7764 \|108.2063 \| \|cuda\|xcit_large_24_p8_224 \|5 \|100.4656 \|100.5209 \| v7 vs. v8 amp \|dev \|name \|batch_size\|abs_latency_v7\|abs_latency_v8\| \|----\|-------------------------------\|----------\|--------------\|--------------\| \|cuda\|adv_inception_v3 \|128 \|104.9729 \|105.1237 \| \|cuda\|beit_base_patch16_224 \|64 \|75.4330 \|75.2039 \| \|cuda\|botnet26t_256 \|128 \|74.5149 \|74.8071 \| \|cuda\|cait_m36_384 \|4 \|110.9788 \|111.5170 \| \|cuda\|coat_lite_mini \|128 \|62.3618 \|64.4965 \| \|cuda\|convit_base \|64 \|116.4054 \|117.9129 \| \|cuda\|convmixer_768_32 \|32 \|264.4401 \|264.4491 \| \|cuda\|convnext_base \|64 \|182.9009 \|179.2136 \| \|cuda\|crossvit_9_240 \|128 \|48.8586 \|48.8359 \| \|cuda\|cspdarknet53 \|64 \|80.0245 \|80.0160 \| \|cuda\|deit_base_distilled_patch16_224\|64 \|66.5921 \|66.7448 \| \|cuda\|dla102 \|128 \|116.7780 \|117.1683 \| \|cuda\|dm_nfnet_f0 \|128 \|78.9322 \|79.1135 \| \|cuda\|dpn107 \|32 \|85.5206 \|85.7514 \| \|cuda\|eca_botnext26ts_256 \|128 \|76.3672 \|77.0050 \| \|cuda\|eca_halonext26ts \|128 \|86.2458 \| \| \|cuda\|ese_vovnet19b_dw \|128 \|43.2943 \|43.3379 \| \|cuda\|fbnetc_100 \|128 \|54.8479 \|54.9251 \| \|cuda\|fbnetv3_b \|128 \|70.7504 \|71.0188 \| \|cuda\|gernet_l \|128 \|66.1607 \|66.0379 \| \|cuda\|ghostnet_100 \|128 \|43.8882 \|43.9336 \| \|cuda\|gluon_inception_v3 \|128 \|104.9297 \|105.0204 \| \|cuda\|gluon_xception65 \|32 \|85.7118 \|85.8370 \| \|cuda\|gmixer_24_224 \|128 \|75.1214 \|76.1170 \| \|cuda\|gmlp_s16_224 \|128 \|76.4207 \|76.6641 \| \|cuda\|hrnet_w18 \|128 \|186.1326 \|186.2435 \| \|cuda\|inception_v3 \|128 \|105.0561 \|105.0783 \| \|cuda\|jx_nest_base \|32 \|65.3066 \|65.3245 \| \|cuda\|lcnet_050 \|128 \|14.7991 \|14.8687 \| \|cuda\|levit_128 \|128 \|19.2893 \|19.4772 \| \|cuda\|mixer_b16_224 \|128 \|93.9826 \|94.2056 \| \|cuda\|mixnet_l \|128 \|147.1245 \|147.0435 \| \|cuda\|mnasnet_100 \|128 \|39.1781 \|39.2565 \| \|cuda\|mobilenetv2_100 \|128 \|42.3704 \|42.3114 \| \|cuda\|mobilenetv3_large_100 \|128 \|37.2946 \|37.2816 \| \|cuda\|mobilevit_s \|64 \|55.8930 \|55.8934 \| \|cuda\|nfnet_l0 \|128 \|64.0448 \|64.4438 \| \|cuda\|pit_b_224 \|64 \|80.6342 \|80.2933 \| \|cuda\|pnasnet5large \|16 \|154.9611 \|154.8654 \| \|cuda\|poolformer_m36 \|64 \|101.7489 \|101.8138 \| \|cuda\|regnety_002 \|128 \|27.0939 \|27.0309 \| \|cuda\|repvgg_a2 \|128 \|60.9651 \|61.2533 \| \|cuda\|res2net101_26w_4s \|64 \|77.3291 \|77.4739 \| \|cuda\|res2net50_14w_8s \|128 \|93.6572 \|93.7221 \| \|cuda\|res2next50 \|128 \|112.4975 \|112.3248 \| \|cuda\|resmlp_12_224 \|128 \|59.5422 \|60.7644 \| \|cuda\|resnest101e \|64 \|97.9894 \|98.3358 \| \|cuda\|rexnet_100 \|128 \|55.2218 \|55.0718 \| \|cuda\|sebotnet33ts_256 \|64 \|60.4880 \|60.8113 \| \|cuda\|selecsls42b \|128 \|41.4294 \|41.5341 \| \|cuda\|spnasnet_100 \|128 \|45.0037 \|45.0304 \| \|cuda\|swin_base_patch4_window7_224 \|64 \|98.2561 \|98.6925 \| \|cuda\|swsl_resnext101_32x16d \|32 \|100.6179 \|100.9195 \| \|cuda\|tf_efficientnet_b0 \|128 \|56.5344 \|56.4591 \| \|cuda\|tf_mixnet_l \|128 \|153.0318 \|152.9367 \| \|cuda\|tinynet_a \|128 \|54.1307 \|53.9298 \| \|cuda\|tnt_s_patch16_224 \|128 \|142.4801 \|142.6589 \| \|cuda\|twins_pcpvt_base \|64 \|67.9027 \|67.8325 \| \|cuda\|visformer_small \|128 \|72.5589 \|72.9427 \| \|cuda\|vit_base_patch16_224 \|64 \|71.4885 \|71.7342 \| \|cuda\|volo_d1_224 \|64 \|69.3539 \|69.5910 \| \|cuda\|xcit_large_24_p8_224 \|5 \|59.9000 \|59.9699 \| v7 vs. v8 float16 \|dev \|name \|batch_size\|abs_latency\|abs_latency\| \|----\|-------------------------------\|----------\|-----------\|-----------\| \|cuda\|adv_inception_v3 \|128 \|104.2544 \|104.2677 \| \|cuda\|beit_base_patch16_224 \|64 \|85.3601 \|85.3786 \| \|cuda\|botnet26t_256 \|128 \|72.1476 \|71.8277 \| \|cuda\|cait_m36_384 \|4 \|108.3075 \|108.5941 \| \|cuda\|coat_lite_mini \|128 \|61.2382 \|61.6049 \| \|cuda\|convmixer_768_32 \|32 \|263.3818 \|263.3598 \| \|cuda\|convnext_base \|64 \|172.6821 \|173.8520 \| \|cuda\|crossvit_9_240 \|128 \|44.6321 \|44.6340 \| \|cuda\|cspdarknet53 \|64 \|79.3165 \|79.2964 \| \|cuda\|deit_base_distilled_patch16_224\|64 \|61.9816 \|62.2109 \| \|cuda\|dla102 \|128 \|115.7403 \|115.9928 \| \|cuda\|dm_nfnet_f0 \|128 \|77.5434 \|77.7440 \| \|cuda\|dpn107 \|32 \|83.6489 \|83.5605 \| \|cuda\|eca_botnext26ts_256 \|128 \|73.9953 \|74.1031 \| \|cuda\|eca_halonext26ts \|128 \|81.7951 \|81.7103 \| \|cuda\|ese_vovnet19b_dw \|128 \|42.9618 \|42.8853 \| \|cuda\|fbnetc_100 \|128 \|54.3590 \|54.3575 \| \|cuda\|fbnetv3_b \|128 \|69.7977 \|70.1696 \| \|cuda\|gernet_l \|128 \|64.8684 \|65.1726 \| \|cuda\|ghostnet_100 \|128 \|43.2054 \|43.1319 \| \|cuda\|gluon_inception_v3 \|128 \|104.1988 \|104.3030 \| \|cuda\|gluon_xception65 \|32 \|84.2245 \|84.5085 \| \|cuda\|gmixer_24_224 \|128 \|82.0418 \|82.7252 \| \|cuda\|gmlp_s16_224 \|128 \|75.4792 \|75.8374 \| \|cuda\|hrnet_w18 \|128 \|184.1450 \|184.1848 \| \|cuda\|inception_v3 \|128 \|104.1203 \|104.2536 \| \|cuda\|jx_nest_base \|32 \|58.2386 \|58.4901 \| \|cuda\|lcnet_050 \|128 \|14.6409 \|14.5616 \| \|cuda\|levit_128 \|128 \|22.3875 \|22.4680 \| \|cuda\|mixer_b16_224 \|128 \|98.9534 \|98.4730 \| \|cuda\|mixnet_l \|128 \|146.1623 \|146.1947 \| \|cuda\|mnasnet_100 \|128 \|38.9208 \|39.3463 \| \|cuda\|mobilenetv2_100 \|128 \|41.8946 \|41.9847 \| \|cuda\|mobilenetv3_large_100 \|128 \|36.7810 \|36.8264 \| \|cuda\|mobilevit_s \|64 \|55.3211 \|55.3186 \| \|cuda\|nfnet_l0 \|128 \|63.1302 \|63.5544 \| \|cuda\|pit_b_224 \|64 \|73.8752 \|73.4602 \| \|cuda\|pnasnet5large \|16 \|151.6806 \|151.6111 \| \|cuda\|poolformer_m36 \|64 \|86.8341 \|86.8021 \| \|cuda\|regnety_002 \|128 \|26.6798 \|26.5295 \| \|cuda\|repvgg_a2 \|128 \|61.6652 \|62.1482 \| \|cuda\|res2net101_26w_4s \|64 \|75.8037 \|75.7739 \| \|cuda\|res2net50_14w_8s \|128 \|92.6362 \|92.4338 \| \|cuda\|res2next50 \|128 \|111.5371 \|111.5832 \| \|cuda\|resmlp_12_224 \|128 \|58.2349 \|57.9807 \| \|cuda\|resnest101e \|64 \|96.1114 \|96.2742 \| \|cuda\|rexnet_100 \|128 \|54.8138 \|54.7643 \| \|cuda\|sebotnet33ts_256 \|64 \|53.1524 \|53.3823 \| \|cuda\|selecsls42b \|128 \|40.6070 \|40.7104 \| \|cuda\|spnasnet_100 \|128 \|44.5732 \|44.4318 \| \|cuda\|swin_base_patch4_window7_224 \|64 \|98.6447 \|98.8445 \| \|cuda\|swsl_resnext101_32x16d \|32 \|97.0195 \|97.2968 \| \|cuda\|tf_efficientnet_b0 \|128 \|56.0640 \|56.0278 \| \|cuda\|tf_mixnet_l \|128 \|152.0958 \|152.0874 \| \|cuda\|tinynet_a \|128 \|53.3694 \|53.3762 \| \|cuda\|tnt_s_patch16_224 \|128 \|130.2981 \|130.3726 \| \|cuda\|twins_pcpvt_base \|64 \|62.5459 \|62.6416 \| \|cuda\|visformer_small \|128 \|68.8502 \|69.1756 \| \|cuda\|vit_base_patch16_224 \|64 \|65.8587 \|66.0285 \| \|cuda\|volo_d1_224 \|64 \|64.5348 \|64.6057 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/89022 Approved by: https://github.com/ngimel	2022-12-15 03:24:44 +00:00
PyTorch MergeBot	cba96366a2	Revert "remove torch.equal usages (#89527 )" This reverts commit `4095ef8b80`. Reverted https://github.com/pytorch/pytorch/pull/89527 on behalf of https://github.com/clee2000 due to broke periodic multigpu tests `4095ef8b80` https://github.com/pytorch/pytorch/actions/runs/3592806602/jobs/6049368502	2022-12-02 21:36:13 +00:00
Philip Meier	4095ef8b80	remove torch.equal usages (#89527 ) Preparation for the next PR in this stack: #89559. I replaced - `self.assertTrue(torch.equal(...))` with `self.assertEqual(..., rtol=0, atol=0, exact_device=True)`, - the same for `self.assertFalse(...)` with `self.assertNotEqual(...)`, and - `assert torch.equal(...)` with `torch.testing.assert_close(..., rtol=0, atol=0)` (note that we don't need to set `check_device=True` here since that is the default). There were a few instances where the result of `torch.equal` is used directly. In that cases I've replaced with `(... == ...).all().item()` while sometimes also dropping the `.item()` depending on the context. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89527 Approved by: https://github.com/mruberry	2022-12-01 11:22:52 +00:00
Aidyn-A	0057be3361	[CUDA graphs] Add warning if captured graph is empty (#88754 ) Fixes #87894 This PR adds a warning if captured graph is empty (consists of zero nodes). The example snippet where would it be useful: ```python import torch x = torch.randn(10) z = torch.zeros(10) g = torch.cuda.CUDAGraph() with torch.cuda.graph(g): z = x * x # Warn user ``` and in #87894 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88754 Approved by: https://github.com/ezyang	2022-11-28 23:20:19 +00:00
Nikita Shulga	da2afcb1e0	Add test for out-of-bounds Tensor access on GPU (#39211 ) Since CUDA context can not recover safely from on-device assert, use `torch.multiprocessing.spawn` to execute a method in another context and verify that it raises unrecoverable error. As those types of tests are pretty slow (6 seconds on powerful linux box with one GPU) run it only in the slow shard. Closes https://github.com/pytorch/pytorch/issues/38944 Pull Request resolved: https://github.com/pytorch/pytorch/pull/39211 Approved by: https://github.com/ezyang	2022-11-15 21:06:02 +00:00
PyTorch MergeBot	d98a884b33	Revert "[cuDNN] (re-open) Enable cuDNN Frontend v8 API by Default (#87669 )" This reverts commit `3c6bddc3f6`. Reverted https://github.com/pytorch/pytorch/pull/87669 on behalf of https://github.com/eqy due to investigating convnext benchmark regressions	2022-11-08 19:04:25 +00:00
Kurt Mohler	ee28b865ee	Deprecate TypedStorage, its derived classes, and all of their public methods (#85303 ) Part of #85302 Pull Request resolved: https://github.com/pytorch/pytorch/pull/85303 Approved by: https://github.com/ezyang	2022-11-08 18:11:01 +00:00
Codrin Popa	5b767d404e	Modified roundup_power2_divisions to specify the number of divisions for each power of two interval (#87290 ) Summary: Improved roundup_power2_divisions knob so it allows better control of rouding in the PyTorch CUDA Caching Allocator. This new version allows setting the number of divisions per power of two interval starting from 1MB and ending at 64GB and above. An example use case is when rouding is desirable for small allocations but there are also very large allocations which are persistent, thus would not benefit from rounding and take up extra space. Test Plan: Tested locally Differential Revision: D40103909 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87290 Approved by: https://github.com/zdevito	2022-11-04 19:31:16 +00:00
eqy	3c6bddc3f6	[cuDNN] (re-open) Enable cuDNN Frontend v8 API by Default (#87669 ) #58414 Has a small tweak to a test that was breaking on A10 (CC @malfet). CC @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/87669 Approved by: https://github.com/ngimel	2022-11-02 01:36:37 +00:00
Masaki Kozuki	bc03aa6013	Store `autocast_gpu_dtype` in `custom_fwd` and `custom_bwd` for BFloat16 autocast (#88029 ) As per #87979, `custom_bwd` seems to forcefully use `torch.float16` for `torch.autograd.Function.backward` regardless of the `dtype` used in the forward. Changes: - store the `dtype` in `args[0]` - update tests to confirm the dtype of intermediate result tensors that are outputs of autocast compatible `torch` functions cc @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/88029 Approved by: https://github.com/ngimel	2022-10-31 22:45:26 +00:00
Zachary DeVito	00c91f4446	[allocator] disable tests that don't work for cudaMallocAsyncAllocator (#87250 ) Two tests were failing locally for me and don't appear to be run in our CI. Disabling them so we can otherwise refactor the allocators. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87250 Approved by: https://github.com/wconstab	2022-10-19 18:29:35 +00:00
PyTorch MergeBot	746500d58d	Revert "[cuDNN] Enable cuDNN Frontend v8 API by Default (#84948 )" This reverts commit `427e0a6b4e`. Reverted https://github.com/pytorch/pytorch/pull/84948 on behalf of https://github.com/malfet due to Broke SM86 sanity	2022-10-14 14:25:51 +00:00
Eddie Yan	427e0a6b4e	[cuDNN] Enable cuDNN Frontend v8 API by Default (#84948 ) #58414 Opening this PR for testing for now to check CI status. 🤞 CC @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/84948 Approved by: https://github.com/ngimel	2022-10-13 17:26:36 +00:00
Eddie Yan	25725fd624	(Re-open) Adds cudaMallocAsync as an alternative backend for the CUDA allocator (#82682 ) Rebased version of @mcarilli 's cudaMallocAsync #65365 for continued testing Pull Request resolved: https://github.com/pytorch/pytorch/pull/82682 Approved by: https://github.com/ngimel	2022-10-12 03:44:21 +00:00
eqy	352d926482	[CUBLAS][CUDA GRAPHS] (re-re-re-re-open of #83461 ) Explicitly set the workspace for cuBLAS handles (#86645 ) re-opening (again) in hopes of working around failed/stuck CLA check CC @ptrblck @ngimel @huydhn Pull Request resolved: https://github.com/pytorch/pytorch/pull/86645 Approved by: https://github.com/zdevito	2022-10-11 16:03:49 +00:00
Zachary DeVito	91b1bae1df	Caching allocator tracing (#86241 ) We currently can take snapshots of the state of the allocated cuda memory, but we do not have a way to correlate these snapshots with the actions the allocator that were taken between snapshots. This PR adds a simple fixed-sized buffer that records the major actions that the allocator takes (ALLOC, FREE, SEGMENT_ALLOC, SEGMENT_FREE, OOM, SNAPSHOT) and includes these with the snapshot information. Capturing period snapshots with a big enough trace buffer makes it possible to see how the allocator state changes over time. We plan to use this functionality to guide how settings in the allocator can be adjusted and eventually have a more robust overall algorithm. As a component of this functionality, we also add the ability to get a callback when the allocator will throw an OOM, primarily so that snapshots can be taken immediately to see why the program ran out of memory (most programs have some C++ state that would free tensors before the OutOfMemory exception can be caught). This PR also updates the _memory_viz.py script to pretty-print the trace information and provide a better textual summary of snapshots distinguishing between internal and external fragmentation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86241 Approved by: https://github.com/ngimel	2022-10-07 23:19:54 +00:00
Edward Z. Yang	adf5919720	Add option to record C++ backtraces in _record_memory_history (#86145 ) I used this to debug https://github.com/pytorch/pytorch/issues/86136 so it is useful. The implementation is not so fast so it is not enabled by default. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/86145 Approved by: https://github.com/albanD, https://github.com/zdevito	2022-10-06 04:07:37 +00:00
Zachary DeVito	736adc0808	Memory snapshots from C++ (#86190 ) Sometimes the driving process want to save memory snapshots but isn't Python. Add a simple API to turn it on without python stack traces. It still saves to the same format for the vizualization and summary scripts, using the C++ Pickler. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86190 Approved by: https://github.com/ezyang	2022-10-05 07:36:39 +00:00
PyTorch MergeBot	71eb04403c	Revert "[CUBLAS][CUDA GRAPHS] (re-re-open of #83461 ) Explicitly set the workspace for cuBLAS handles (#85447 )" This reverts commit `b04b2fa9aa`. Reverted https://github.com/pytorch/pytorch/pull/85447 on behalf of https://github.com/seemethere due to Caused a CUDA memory leak, detected by our performance benchmark suite	2022-09-30 20:53:41 +00:00
Masaki Kozuki	5f26df0345	resubmit: "resubmit: [mta] APEX style Fused Adam (#81705 ) (#85507 )" (#85739 ) Embarrassingly move the pow implementations around [ATen/native/cuda/PowKernel.cu#L21-L66](`849b08f14b/aten/src/ATen/native/cuda/PowKernel.cu (L21-L66)`) to a new header file and let FusedAdam use them to tame MSVC, hopefully. cc @ngimel @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/85739 Approved by: https://github.com/ngimel	2022-09-29 16:58:59 +00:00
Eddie Yan	b04b2fa9aa	[CUBLAS][CUDA GRAPHS] (re-re-open of #83461 ) Explicitly set the workspace for cuBLAS handles (#85447 ) Now includes @dagitses 's optimizations and fixes for teardown CC @ngimel @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/85447 Approved by: https://github.com/malfet	2022-09-28 16:04:58 +00:00
Andres Lugo-Reyes	5709c67f1f	[ROCm] Retry loop implemented to avoid transient memory leak errors (#82607 ) ### Description Added a retry loop to memory leak checker to avoid rare case in which ROCM reports a false positive memory leak. ### Issue Original issue observed as part of this ticket: https://github.com/pytorch/pytorch/issues/62533 ### Testing - Applied changes and built - python test/test_cuda.py - Ensure all tests pass Pull Request resolved: https://github.com/pytorch/pytorch/pull/82607 Approved by: https://github.com/malfet	2022-09-28 15:48:24 +00:00
PyTorch MergeBot	7167996346	Revert "resubmit: [mta] APEX style Fused Adam (#81705 ) (#85507 )" This reverts commit `4615d1bcfa`. Reverted https://github.com/pytorch/pytorch/pull/85507 on behalf of https://github.com/atalman due to Break internal windows builds	2022-09-27 16:59:35 +00:00
Masaki Kozuki	4615d1bcfa	resubmit: [mta] APEX style Fused Adam (#81705 ) (#85507 ) This PR implements an APEX style FusedAdam in PyTorch. This is different from the APEX one in that this is compatible with `torch.cuda.amp.GradScaler` by setting `_step_supports_amp_scaling` to `True` and unscales gradients inside its CUDA kernel. related: https://github.com/pytorch/pytorch/issues/68041, https://github.com/pytorch/pytorch/issues/71274, https://github.com/pytorch/pytorch/issues/80167 possibly related to https://github.com/pytorch/pytorch/issues/80595#issuecomment-1178519436 Pull Request resolved: https://github.com/pytorch/pytorch/pull/81705 Approved by: https://github.com/ngimel cc @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/85507 Approved by: https://github.com/ngimel	2022-09-23 18:56:00 +00:00
PyTorch MergeBot	e505360eb8	Revert "[mta] APEX style Fused Adam (#81705 )" This reverts commit `7a6c4d0c50`. Reverted https://github.com/pytorch/pytorch/pull/81705 on behalf of https://github.com/dagitses due to broke internal builds, details to come	2022-09-22 19:37:29 +00:00
PyTorch MergeBot	0ac6311356	Revert "[CUBLAS][CUDA GRAPHS] (re-open of #83461 ) Explicitly set the workspace for cuBLAS handles (#85292 )" This reverts commit `4012e623e8`. Reverted https://github.com/pytorch/pytorch/pull/85292 on behalf of https://github.com/dagitses due to broke an internal test during shutdown. Re-submit with #85399 in stack	2022-09-21 17:57:49 +00:00
Masaki Kozuki	7a6c4d0c50	[mta] APEX style Fused Adam (#81705 ) This PR implements an APEX style FusedAdam in PyTorch. This is different from the APEX one in that this is compatible with `torch.cuda.amp.GradScaler` by setting `_step_supports_amp_scaling` to `True` and unscales gradients inside its CUDA kernel. related: https://github.com/pytorch/pytorch/issues/68041, https://github.com/pytorch/pytorch/issues/71274, https://github.com/pytorch/pytorch/issues/80167 possibly related to https://github.com/pytorch/pytorch/issues/80595#issuecomment-1178519436 cc @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/81705 Approved by: https://github.com/ngimel	2022-09-20 17:18:33 +00:00
eqy	4012e623e8	[CUBLAS][CUDA GRAPHS] (re-open of #83461 ) Explicitly set the workspace for cuBLAS handles (#85292 ) re-open of #83461 with fix for 10.2 build CC @ngimel @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/85292 Approved by: https://github.com/malfet	2022-09-20 16:31:54 +00:00
Hector Yuen	d23ce29761	allow changing the cuda allocator settings even after the process started (#84970 ) Summary: - expose a python call to set the allocator settings, it uses the same format as the value for PYTORCH_CUDA_ALLOCATOR - keep the implementation contained within the cpp file to avoid increasing build times, only expose a function to call the setting - make some of the Allocator Config methods public, now it looks more like a singleton Test Plan: added the unit test Differential Revision: D39487522 Pull Request resolved: https://github.com/pytorch/pytorch/pull/84970 Approved by: https://github.com/zdevito	2022-09-17 09:42:42 +00:00
PyTorch MergeBot	2711b9fa63	Revert "[CUBLAS][CUDA GRAPHS] Explicitly set the workspace for cuBLAS handles (#83461 )" This reverts commit `713d8b8552`. Reverted https://github.com/pytorch/pytorch/pull/83461 on behalf of https://github.com/malfet due to Broke CUDA-10.2 builds, see `713d8b8552`	2022-09-14 22:27:30 +00:00
Eddie Yan	713d8b8552	[CUBLAS][CUDA GRAPHS] Explicitly set the workspace for cuBLAS handles (#83461 ) We're seeing an issue where repeatedly capturing graphs incurs increasing memory usage as cuBLAS internally allocates a new workspace for each graph even when the same handle is being used: https://gist.github.com/tomconerlyanth/a20c04a4a46a0f6e9ce18f5280729b36 This PR works around the issue by intercepting the `CUBLAS_WORKSPACE_CONFIG` environment variable and allocating the workspace for the cuBLAS handle explicitly. CC @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/83461 Approved by: https://github.com/ngimel	2022-09-14 21:56:48 +00:00
Aidyn-A	5271494ef2	[CUDA graphs] Fixes errors in RNG seed (#84967 ) Fixes #84614 Prior to this PR CUDAGraph did not store the RNG seed, that is why `torch.cuda.manual_seed(new_seed)` would only reset the offset but not update the seed at all keeping whatever value was used during graph capture. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84967 Approved by: https://github.com/ngimel	2022-09-14 19:56:12 +00:00
jataylo	09bcc006e9	ROCm support for test_lazy_init (#84333 ) Added ROCm support for the test_lazy_init unit test by including a condition on TEST_WITH_ROCM to switch CUDA_VISIBLE_DEVICES with HIP_VISIBLE_DEVICES. This is needed because HIP_VISIBLE_DEVICES is set when running the single-GPU tests in CI: `a47bc96fb7/.jenkins/pytorch/test.sh (L38)`, but this test sets CUDA_VISIBLE_DEVICES, which takes lower precedence than HIP_VISIBLE_DEVICES on ROCm. Testing Logs (to show behavior difference) 12:40:41 Aug 30 11:40:41 CUDA_VISIBLE_DEVICES='0': 0 12:40:41 Aug 30 11:40:41 1 12:40:41 Aug 30 11:40:41 CUDA_VISIBLE_DEVICES='32': 32 12:40:41 Aug 30 11:40:41 1 12:40:41 Aug 30 11:40:41 HIP_VISIBLE_DEVICES='0': 0 12:40:41 Aug 30 11:40:41 1 12:40:41 Aug 30 11:40:41 HIP_VISIBLE_DEVICES='32': 32 12:40:41 Aug 30 11:40:41 0 Passing UT Aug 30 17:03:15 test_lazy_init (main.TestCuda) Aug 30 17:03:17 Validate that no CUDA calls are made during import torch call ... ok (2.471s) Pull Request resolved: https://github.com/pytorch/pytorch/pull/84333 Approved by: https://github.com/jithunnair-amd, https://github.com/malfet	2022-09-09 14:14:59 +00:00
Fabio Rocha	88b1cc885c	Removed tri[lu]* tests, superseeded by OpInfos (#84256 ) triu, tril, triu_indices and tril_indices had some tests in test_tensor_creation_ops.py and test_cuda.py that are redudant with the ones done by OpInfos for those ops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84256 Approved by: https://github.com/Lezcano, https://github.com/ngimel	2022-09-06 18:54:10 +00:00
Aidyn-A	ce1b727e77	Disable autocast cache in torch.cuda.make_graphed_callables (#84289 ) There there are conflicts between `torch.clear_autocast_cache()` and `cudaMallocAsync` from #82682. Moreover, the use of autocast caching is not reasonable during training which is the main target of `make_graphed_callables`. cc @eqy @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/84289 Approved by: https://github.com/ngimel	2022-09-01 21:34:51 +00:00
Pruthvi Madugundu	8473e69684	[ROCm] Fixes the kernel asserts API declaration mismatch error (#81790 ) This problem updates the the PR [#73040](https://github.com/pytorch/pytorch/pull/73040) The compilation error in pyTorch with ROCm is successful with these changes when `NDEBUG` is enabled. Solution: For HIP we keep `__device__ __assert_fail()` and for host side compilation we want to use the `__assert_fail()` from the glibc library. Tested the code by compiling with below steps ``` python3 tools/amd_build/build_amd.py python3 setup.py develop --cmake-only cmake -DHIP_HIPCC_FLAGS_RELEASE="-DNDEBUG" build cmake --build build ``` The UT test_fixed_cuda_assert_async is still skipped due performance overhead. cc @jithunnair-amd Pull Request resolved: https://github.com/pytorch/pytorch/pull/81790 Approved by: https://github.com/shintaro-iwasaki, https://github.com/jeffdaily, https://github.com/malfet	2022-08-16 19:22:31 +00:00
Zachary DeVito	4128712397	Propagate CUDAOutOfMemoryError to Python. (#83146 ) The intention is to make it easier to catch this situation for debugging, logging, or application-specific recovery. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83146 Approved by: https://github.com/albanD	2022-08-11 21:32:11 +00:00
Zachary DeVito	726d040692	annotated allocator snapshots (#82146 ) Record stack trace information for each allocated segment in the allocator. It takes around 1.5us to record 50 stack frames of context. Since invoking a Pytorch operator is around 8us, this adds minimal overhead but we still leave it disabled by default so that we can test it more on real workloads first. Stack information is kept both for allocated blocks and the last allocation used inactive blocks. We could potential keep around the _first_ allocation that caused the block to get allocated from cuda as well. Potential Followups: * stack frame entries are small (16 bytes), but the list of Frames is not compressed eventhough most frames will share some entries. So far this doesn't produce huge dumps (7MB for one real workload that uses all memory on the GPU), but it can be much smaller through compression. * Code to format the information is slow (a few seconds) because it uses python and FlameGraph.pl * Things allocated during the backward pass have no stack frames because they are run on another C++ thread. Pull Request resolved: https://github.com/pytorch/pytorch/pull/82146 Approved by: https://github.com/albanD	2022-08-09 17:21:35 +00:00
Aidyn-A	da0a3fe058	[Re-land] [CUDA graphs] Clear autocast amp cache (#81896 ) Re-lands #81558 that got reverted due failing tests. This failure happened because of the test that I poorly designed. [The loop here](https://github.com/pytorch/pytorch/pull/81558/files#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R3837) is doing `cache_enabled=False` and then `cache_enabled=True`. By doing this loop the graph from previous iteration (case `False`) conflicts with the next one (case `True`). I redesigned the test such that it does not do any loops. The new test does separate function calls with different argument values. Pull Request resolved: https://github.com/pytorch/pytorch/pull/81896 Approved by: https://github.com/ngimel	2022-08-02 23:22:00 +00:00
Kurt Mohler	14d0296e5c	Rename `_Typed/_UntypedStorage` to `Typed/UntypedStorage` and update docs (#82438 ) ### Description Since the major changes for `_TypedStorage` and `_UntypedStorage` are now complete, they can be renamed to be public. `TypedStorage._untyped()` is renamed to `TypedStorage.untyped()`. Documentation for storages is improved as well. ### Issue Fixes #82436 ### Testing N/A Pull Request resolved: https://github.com/pytorch/pytorch/pull/82438 Approved by: https://github.com/ezyang	2022-07-30 19:37:08 +00:00
Eddie Yan	0b2566456f	[CUDNN] Update tests and dispatching for CUDNN V8 API behavior for `bfloat16` convs (#81139 ) cuDNN via the V8 API supports `bfloat16` on Ampere (`>= (8, 0)` but not older devices) which might be unexpected given current test settings. This PR fixes some dispatching to check the device capability before dispatching `bfloat16` convs and adjusts the expected failure conditions for the autocast test. CC @xwang233 @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/81139 Approved by: https://github.com/ngimel	2022-07-29 23:28:58 +00:00
PyTorch MergeBot	f5b460b200	Revert "[CUDA graphs] Clear autocast amp cache (#81558 )" This reverts commit `e9d07bd4f0`. Reverted https://github.com/pytorch/pytorch/pull/81558 on behalf of https://github.com/janeyx99 due to Breaks windows 11.6 tests on trunk `e9d07bd4f0`	2022-07-21 12:46:36 +00:00
Aidyn-A	e9d07bd4f0	[CUDA graphs] Clear autocast amp cache (#81558 ) According to [autocast_mode.cpp](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/autocast_mode.cpp) `cached_casts` is to be cleared at the end of each forward pass. However, this was not the case in current implementation of `make_graphed_callables` so a graph created the following way: ``` with torch.cuda.amp.autocast(cache_enabled=True): graphed_foo = torch.cuda.make_graphed_callables(foo, tensors) ``` Behaves incorrectly. cc @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/81558 Approved by: https://github.com/ngimel	2022-07-21 01:44:14 +00:00
Jeff Daily	ff6655defb	[ROCm] unskip external streams tests (#80922 ) These two tests are passing for ROCm 5.1.1 and 5.2. Pull Request resolved: https://github.com/pytorch/pytorch/pull/80922 Approved by: https://github.com/cpuhrsch	2022-07-08 21:29:29 +00:00
Nikita Shulga	1ad7ef3f21	Add check for cuda lazy init (#80912 ) Validate that no CUDA calls are made during `import torch` call, by importing torch and limited visible devices to non-existing device Should prevent regressions like ones reported in https://github.com/pytorch/pytorch/issues/80876 Pull Request resolved: https://github.com/pytorch/pytorch/pull/80912 Approved by: https://github.com/ngimel, https://github.com/atalman	2022-07-06 01:39:27 +00:00
Jeff Daily	20d56d2b32	increase sleep for TestCuda.test_caching_pinned_memory_multi_gpu (#76601 ) Fixes #68299. Fixes #70875. Test is flaky on ROCm because the HIP runtime occasionally copies asynchronously too quickly for the current sleep value of 50ms. This is not a bug. Increasing the sleep value to 1s to avoid flakiness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/76601 Approved by: https://github.com/pruthvistony, https://github.com/malfet	2022-06-14 21:10:35 +00:00
Michael Carilli	ba27ee9e8f	[CUDA graphs] Allows Adam and AdamW to be capture-safe (#77862 ) Near term fix for https://github.com/pytorch/pytorch/issues/76368. Q. Why does the user need to request `capturable=True` in the optimizer constructor? Why can't capture safety be completely automatic? A. We need to set up capture-safe (device-side) state variables before capture. If we don't, and step() internally detects capture is underway, it's too late: the best we could do is create a device state variable and copy the current CPU value into it, which is not something we want baked into the graph. Q. Ok, why not just do the capture-safe approach with device-side state variables all the time? A. It incurs several more kernel launches per parameter, which could really add up and regress cpu overhead for ungraphed step()s. If the optimizer won't be captured, we should allow step() to stick with its current cpu-side state handling. Q. But cuda RNG is a stateful thing that maintains its state on the cpu outside of capture and replay, and we capture it automatically. Why can't we do the same thing here? A. The graph object can handle RNG generator increments because its capture_begin, capture_end, and replay() methods can see and access generator object. But the graph object has no explicit knowledge of or access to optimizer steps in its capture scope. We could let the user tell the graph object what optimizers will be stepped in its scope, ie something like ```python graph.will_use_optimizer(opt) graph.capture_begin() ... ``` but that seems clunkier than an optimizer constructor arg. I'm open to other ideas, but right now I think constructor arg is necessary and the least bad approach. Long term, https://github.com/pytorch/pytorch/issues/71274 is a better fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/77862 Approved by: https://github.com/ezyang	2022-06-13 01:56:47 +00:00
Kurt Mohler	aea6e2c396	Merge torch.cuda._UntypedStorage into torch._UntypedStorage (#75459 ) Fixes #74933 Pull Request resolved: https://github.com/pytorch/pytorch/pull/75459 Approved by: https://github.com/ezyang	2022-05-19 13:54:39 +00:00
Michael Carilli	929f1d5317	[RELAND] Adds torch.cuda.is_current_stream_capturing (#77789 ) Resubmit of https://github.com/pytorch/pytorch/pull/77673, which was reverted due to Windows test failures: https://github.com/pytorch/pytorch/pull/77673#issuecomment-1130425845. I suspect these failures happened because I don't explicitly set a side stream for graph capture in the new test. Not setting a side stream explicitly is alright on Linux because cuda tests implicitly use a side stream. I think Windows cuda tests implicitly use the default stream, breaking capture and leaving the backend in a bad state. Other graphs tests explicitly set side streams and don't error in Windows builds, so i'm 95% sure doing the same for the new test will work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/77789 Approved by: https://github.com/ezyang	2022-05-18 23:18:53 +00:00
Jeff Daily	de86146c61	rocblas alt impl during backward pass only (#71881 ) In preparation of adopting future rocblas library options, it is necessary to track when the backward pass of training is executing. The scope-based helper class `BackwardPassGuard` is provided to toggle state. Pull Request resolved: https://github.com/pytorch/pytorch/pull/71881 Approved by: https://github.com/albanD	2022-05-18 19:42:58 +00:00
PyTorch MergeBot	0d8a0f186b	Revert "Adds torch.cuda.is_current_stream_capturing (#77673 )" This reverts commit `d03d43df52`. Reverted https://github.com/pytorch/pytorch/pull/77673 on behalf of https://github.com/suo	2022-05-18 19:31:49 +00:00
Michael Carilli	d03d43df52	Adds torch.cuda.is_current_stream_capturing (#77673 ) Exposes a way to query if CUDA graph capture is underway on the current stream. Pull Request resolved: https://github.com/pytorch/pytorch/pull/77673 Approved by: https://github.com/ezyang	2022-05-18 16:46:35 +00:00
Eddie Yan	76b952bb35	[CUBLAS][TF32] Skip `test_cublas_allow_tf32_get_set` if `TORCH_ALLOW_TF32_CUBLAS_OVERRIDE` is set (#77298 ) Follow-up to #77114 to prevent test breakages when the environment variable is set. CC @xwang233 @ngimel @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/77298 Approved by: https://github.com/xwang233, https://github.com/ngimel	2022-05-17 21:57:09 +00:00
Eddie Yan	e838137b3e	Add high level control of fp32 matmul precision; disable TF32 for matmuls by default #76440 CC @mruberry @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/76509 Approved by: https://github.com/ngimel	2022-05-04 20:40:13 +00:00
Felipe Petroski Such	b0c5fba967	[CUDA Graphs] Fix OOM inside graph capture_begin release_cached_blocks calls this: ``` void synchronize_and_free_events() { TORCH_INTERNAL_ASSERT(captures_underway == 0); ``` Which means we can't call that function when we are capturing a cuda graph: ``` import torch with torch.cuda.graph(torch.cuda.CUDAGraph()): torch.zeros(2 ** 40, device="cuda") ``` results in: ``` RuntimeError: captures_underway == 0INTERNAL ASSERT FAILED at "/tmp/torch/c10/cuda/CUDACachingAllocator.cpp":1224, please report a bug to PyTorch. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/76247 Approved by: https://github.com/ngimel	2022-04-29 17:42:04 +00:00
Jeff Daily	e846ef8818	add rocm ciflow/slow workflow Enables additional tests that historically have been missed for ROCm CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/72686 Approved by: https://github.com/seemethere	2022-04-22 17:41:28 +00:00
Ivan Yashchuk	4bb5e6e830	Fix `test_reduce_add_coalesced` failure (#74027 ) Summary: Recent change (https://github.com/pytorch/pytorch/pull/69751) introduced the requirement of using `.coalesce()` explicitly in the tests. Unfortunately, not all tests are run in the current CI configuration and one test failure slipped through. Fixes https://github.com/pytorch/pytorch/issues/74015. Pull Request resolved: https://github.com/pytorch/pytorch/pull/74027 Reviewed By: samdow Differential Revision: D34858112 Pulled By: mruberry fbshipit-source-id: 8904fac5e2b5335684a21f95a22646469478eb81 (cherry picked from commit 06d6e6d2a796af0e8444f4c57841a07ec4f67c9f)	2022-03-15 06:29:54 +00:00
Michael Carilli	2f957f513e	Deletes unused line in test_autocast_rnn (#73195 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73195 Reviewed By: mruberry Differential Revision: D34557677 Pulled By: ngimel fbshipit-source-id: 284018b4596471332d0e90a08e2c38303fb2b3ae (cherry picked from commit bbf6913009e206c02e124c49ab80ef9596f7fcad)	2022-03-02 01:27:55 +00:00
Shintaro Iwasaki	7dc2cfa249	[c10][rocm] fix __assert_fail() declaration mismatch error (#73040 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73040 This patch fixes a compilation error in PyTorch with ROCm when `NDEBUG` is passed. ## Problem Forward declaration of `__host__ __device__ __assert_fail()` is used in `c10/macros/Macros.h` for HIP compilation when `NDEBUG` is set However, HIP has `__device__ __assert_fail()` in `hip/amd_detail/amd_device_functions.h`, causing a function type error. This issue does not appear in ROCm CI tests since it happens only when `NDEBUG` is passed. ## Solution [EDIT] After the discussion on GitHub, we chose to entirely disable `CUDA_KERNEL_ASSERT()` for ROCm. --- To solve this compilation error, this patch disables `CUDA_KERNEL_ASSERT()`, which uses `__assert_fail()` when 1. `c10/macros/Macros.h` is included for `.hip` (precisely speaking, `__HIP__` or `__HIP_ARCH__` is defined), and 2. `NDEBUG` is passed. Note that there's no impact on default compilation because, without a special compilation flag, those HIP files are compiled without `-NDEBUG`. And that's why this issue has not been found. ### Justification [1] We cannot declare one host-and-device function for two separate host and device functions. ``` __device__ int func() {return 0}; __host__ int func() {return 0}; // Compile error (hipcc) // __device__ __host__ int func(); ``` [2] Forward declaration of a correct `__device__` only `__assert_fail()` for `__HIP__` causes the following error: ``` pytorch/c10/util/TypeCast.h:135:7: error: reference to __device__ function '__assert_fail' in __host__ __device__ function ERROR_UNSUPPORTED_CAST ^ pytorch/c10/util/TypeCast.h:118:32: note: expanded from macro 'ERROR_UNSUPPORTED_CAST' #define ERROR_UNSUPPORTED_CAST CUDA_KERNEL_ASSERT(false); ^ pytorch/c10/macros/Macros.h:392:5: note: expanded from macro 'CUDA_KERNEL_ASSERT' __assert_fail( ``` [3] Maybe there's a way to properly define `__assert_fail()` for HIP + NDEBUG, but this might be too much. Please let me just disable it. ### Technical details Error ``` pytorch/c10/macros/Macros.h:368:5: error: __host__ __device__ function '__assert_fail' cannot overload __device__ function '__assert_fail' __assert_fail( ^ /opt/rocm/hip/include/hip/amd_detail/amd_device_functions.h:1173:6: note: previous declaration is here void __assert_fail(const char assertion, ``` CUDA definition (9.x) of `__assert_fail()` ``` #elif defined(__GNUC__) extern __host__ __device__ __cudart_builtin__ void __assert_fail( const char , const char , unsigned int, const char ) __THROW; ``` ROCm definition (the latest version) ``` // `2b59661f3e/include/hip/amd_detail/amd_device_functions.h (L1172-L1177)` extern "C" __device__ __attribute__((noinline)) __attribute__((weak)) void __assert_fail(const char assertion, const char file, unsigned int line, const char function); ``` Test Plan: CI + reproducer ``` python3 tools/amd_build/build_amd.py python3 setup.py develop --cmake-only cmake -DHIP_HIPCC_FLAGS_RELEASE="-DNDEBUG" build cmake --build build ``` Reviewed By: xw285cornell Differential Revision: D34310555 fbshipit-source-id: 7542288912590533ced3f20afd2e704b6551991b (cherry picked from commit 9e52196e36820abe36bf6427cabc7389d3ea6cb5)	2022-03-01 04:35:30 +00:00
Philip Meier	b5f2574f36	no longer coalesce sparse COO tensors before comparison (#69751 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69751 cc nikitaved pearu cpuhrsch IvanYashchuk Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D34262453 Pulled By: ezyang fbshipit-source-id: e2e62d2aa03fc569d2951c880960b256f5dc4aaa (cherry picked from commit `cb6b0ef719`)	2022-02-17 02:33:08 +00:00
Kurt Mohler	8e7fe87630	Rename `Typed/UntypedStorage` to `_Typed/_UntypedStorage` (#72540 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72540 Reviewed By: jbschlosser Differential Revision: D34216823 Pulled By: bdhirsh fbshipit-source-id: 1bc9930ab582771ebf02308e035576cd1a0dbe47 (cherry picked from commit `329238f612`)	2022-02-15 23:53:01 +00:00
Louis Feng	83b3b5fb00	[PyTorch] Support NVTX range_start and range_end (#70030 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70030 range_push and range_pop do not support multi-thread. It only works for push and pop range in the same thread. For process level ranges, we should use range_start and range_end. This is important because PyTorch forward is on one thread, while the autograd is on a different thread. See NVidia implementation documentation: `cab2dec760/NSight/nvToolsExt.h (L397-L407)` Test Plan: ``` buck test caffe2/test:cuda Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/8162774391483460 ✓ ListingSuccess: caffe2/test:cuda - main (19.640) Summary ListingSuccess: 1 If you need help understanding your runs, please follow the wiki: https://fburl.com/posting_in_tpx_users Finished test run: https://www.internalfb.com/intern/testinfra/testrun/8162774391483460 ``` Reviewed By: malfet Differential Revision: D33155244 fbshipit-source-id: c7d5143f6da9b6ef0e0811e2fcae03a3e76f24de (cherry picked from commit `22134e91b7`)	2022-02-07 17:31:57 +00:00
Andrew Tulloch	0099796978	[CUDA Pinned Memory] [Retry] Alternative implementation of pinned memory allocator focusing on multi-threaded scalability (#69299 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69299 https://github.com/pytorch/pytorch/pull/68906 + https://github.com/pytorch/pytorch/pull/68749 plugged one correctness hole (non-blocking copies of offset pinned memory tensors) while introducing another (non-blocking copies of pinned memory tensors with a non-standard DataPtr context). In this revision, we use both the tensor data pointer and context to attempt to identify the originating block in the pinned memory allocator. Test Plan: New unit tests added to cover the missing case previously. Reviewed By: yinghai Differential Revision: D32787087 fbshipit-source-id: 0cb0d29d7c39a13f433eb1cd423dc0d2a303c955 (cherry picked from commit `297157b1a1`)	2022-01-27 01:33:55 +00:00
Mike Ruberry	e0d829a266	Kill the test_torch.py mixin and creates test_scatter_gather_ops (#71691 ) Summary: Per title. Also annotates test_torch.py with additional cleanup tasks and adds empty sample inputs to elementwise unary and binary OpInfos. Pull Request resolved: https://github.com/pytorch/pytorch/pull/71691 Reviewed By: ngimel Differential Revision: D33735126 Pulled By: mruberry fbshipit-source-id: 8cc097a7581a8b620540c95b2a5889c1165ecf23 (cherry picked from commit `5c6a245a3f`)	2022-01-24 09:32:32 +00:00
Leo Fang	67941c8a94	Document `torch.cuda.ExternalStream`, `torch.cuda.caching_allocator_alloc` and `torch.cuda.caching_allocator_delete` (#70126 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/67414. Fixes https://github.com/pytorch/pytorch/issues/70117. cc brianjo mruberry ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/70126 Reviewed By: mruberry Differential Revision: D33542910 Pulled By: ngimel fbshipit-source-id: 4b870f4dceca6ee4cc8fba58819f1cb18ac9f857	2022-01-12 15:44:40 -08:00
Jane Xu	20489ebdc9	Increase tensor size for mem check tests (#70603 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/70226 Pull Request resolved: https://github.com/pytorch/pytorch/pull/70603 Reviewed By: mruberry Differential Revision: D33410439 Pulled By: janeyx99 fbshipit-source-id: e94615ece6d0fdf230de5297118678b70f34a18c	2022-01-05 08:27:48 -08:00
Jane Xu	c555b7bacb	GHA: Remove caffe2 check in Windows shard 1 smoke tests (#70010 ) Summary: Windows shard 1 hasn't actually been running any tests because the script that does so exited before running the python tests but did not report an error. This has been happening to all windows tests across the board, for example https://github.com/pytorch/pytorch/runs/4526170542?check_suite_focus=true Removing the caffe2.python check passes the smoke tests now. You can observe that the run_test.py file is called in the windows cpu job now https://github.com/pytorch/pytorch/runs/4541331717?check_suite_focus=true Pull Request resolved: https://github.com/pytorch/pytorch/pull/70010 Reviewed By: malfet, seemethere Differential Revision: D33161291 Pulled By: janeyx99 fbshipit-source-id: 85024b0ebb3ac42297684467ee4d0898ecf394de	2021-12-20 16:05:38 -08:00
Mike Ruberry	84b7832010	Updates CUDA memory leak check to verify against driver API and print more diagnostic information (#69556 ) Summary: Per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/69556 Reviewed By: mrshenli Differential Revision: D32954770 Pulled By: mruberry fbshipit-source-id: a6c2ae6f704422c178569980ca4b9c72c4272f55	2021-12-17 23:37:49 -08:00
Mike Ruberry	dc87cf5fe1	Fixes mem_get_info when querying on a device other than the current device (#69640 ) Summary: Also fixes the documentation failing to appear and adds a test to validate that op works with multiple devices properly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/69640 Reviewed By: ngimel Differential Revision: D32965391 Pulled By: mruberry fbshipit-source-id: 4fe502809b353464da8edf62d92ca9863804f08e	2021-12-08 23:04:30 -08:00
Dennis van der Staay	cbe0a38d8c	Back out "[CUDA Pinned Memory] Event recording with non-blocking copies should track the storage context, not the tensor data pointer" (#69193 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69193 Reviewed By: xing-liu, yuchenhao Differential Revision: D32748570 fbshipit-source-id: bd73d7567f94c70daeace49d4081381b8adf2d77	2021-12-01 19:30:08 -08:00
Andrew Tulloch	d44e610efa	[CUDA Pinned Memory] Event recording with non-blocking copies should track the storage context, not the tensor data pointer (#68749 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68749 The logic for asynchronous copies (either HtoD or DtoH) using cudaMemcpyAsync relies on recording an event with the caching host allocator to notify it that a given allocation has been used on a stream - and thus it should wait for that stream to proceed before reusing the host memory. This tracking is based on the allocator maintaining a map from storage allocation pointers to some state. If we try to record an event for a pointer we don't understand, we will silently drop the event and ignore it (`9554ebe44e/aten/src/ATen/cuda/CachingHostAllocator.cpp (L171-L175)`). Thus, if we use the data_ptr of a Tensor instead of the storage allocation, then reasonable code can lead to incorrectness due to missed events. One way this can occur is simply by slicing a tensor into sub-tensors - which have different values of `data_ptr()` but share the same storage, for example: ``` image_batch = torch.randn(M, B, C, H, W).pin_memory() for m in range(M): sub_batch = image_batch[m].cuda(non_blocking=True) # sub_batch.data_ptr() != image_batch.data_ptr() except for m == 0. # however, sub_batch.storage().data_ptr() == image_batch.storage().data_ptr() always. ``` Therefore, we instead use the storage context pointer when recording events, as this is the same state that is tracked by the caching allocator itself. This is a correctness fix, although it's hard to determine how widespread this issue is. Using the storage context also allows us to use a more efficient structure internally to the caching allocator, which will be sent in future diffs. Test Plan: Test added which demonstrates the issue, although it's hard to demonstrate the race explicitly. Reviewed By: ngimel Differential Revision: D32588785 fbshipit-source-id: d87cc5e49ff8cbf59052c3c97da5b48dd1fe75cc	2021-11-24 13:20:22 -08:00
eqy	790763b0fe	Add an option to disable reduced precision reductions for FP16 GEMM (#67946 ) Summary: https://github.com/pytorch/pytorch/issues/67578 disabled reduced precision reductions for FP16 GEMMs. After benchmarking, we've found that this has substantial performance impacts for common GEMM shapes (e.g., those found in popular instantiations of multiheaded-attention) on architectures such as Volta. As these performance regressions may come as a surprise to current users, this PR adds a toggle to disable reduced precision reductions `torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = ` rather than making it the default behavior. CC ngimel ptrblck stas00 Note that the behavior after the previous PR can be replicated with `torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = False` Pull Request resolved: https://github.com/pytorch/pytorch/pull/67946 Reviewed By: zou3519 Differential Revision: D32289896 Pulled By: ngimel fbshipit-source-id: a1ea2918b77e27a7d9b391e030417802a0174abe	2021-11-09 17:27:20 -08:00
Jane Xu	2578de4851	[skip ci] Set test owner for test_cuda* tests (#66838 ) Summary: Action following https://github.com/pytorch/pytorch/issues/66232 cc ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/66838 Reviewed By: saketh-are Differential Revision: D31841411 Pulled By: janeyx99 fbshipit-source-id: 5cdffdef4a92f9adcef1143ae4598b052c5acc6b	2021-10-21 17:36:25 -07:00
arindamroy-eng	32e790997b	[Rocm]Reduce severity of detected possible memory leak from assertion to warning (#65973 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/62533. In very rare cases, the decorator for detecting memory leak is throwing assertion, even when the test is passing, and the memory is being freed with a tiny delay. The issue is not being reproduced in internal testing, but shows up sometimes in CI environment. Reducing the severity of such detection to warning, so as not to fail the CI tests, as the actual test is not failing, rather only the check inside the decorator is failing. Limiting the change to ROCM only for now. cc jeffdaily sunway513 jithunnair-amd ROCmSupport Pull Request resolved: https://github.com/pytorch/pytorch/pull/65973 Reviewed By: anjali411 Differential Revision: D31776154 Pulled By: malfet fbshipit-source-id: 432199fca17669648463c4177c62adb553cacefd	2021-10-21 07:10:54 -07:00
Yanli Zhao	8173d4df69	move get_cycles_per_ms() to common_utils (#66798 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66798 get_cycles_per_ms is copied and used in a few places, move it to common_utils so that it can be used as a shared util function ghstack-source-id: 140790599 Test Plan: unit tests Reviewed By: pritamdamania87 Differential Revision: D31706870 fbshipit-source-id: e8dccecb13862646a19aaadd7bad7c8f414fd4ab	2021-10-18 14:04:09 -07:00
Kurt Mohler	5883523c1d	Remove dtype from torch.Storage and use only torch.ByteStorage (#62030 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62030 Remove dtype tracking from Python Storage interface, remove all the different `<type>Storage` classes except for `ByteStorage`, and update serialization accordingly, while maintaining as much FC/BC as possible Fixes https://github.com/pytorch/pytorch/issues/47442 * THE SERIALIZATION FORMAT IS FULLY FC/BC. We worked very hard to make sure this is the case. We will probably want to break FC at some point to make the serialization structure of tensors make more sense, but not today. * There is now only a single torch.ByteStorage class. Methods like `Tensor.set_` no longer check that the dtype of storage is appropriate. * As we no longer know what dtype of a storage is, we've removed the size method from Storage, replacing it with nbytes. This is to help catch otherwise silent errors where you confuse number of elements with number of bytes. * `Storage._new_shared` takes a `nbytes` kwarg and will reject previous positional only calls. `Storage._new_with_file` and `_set_from_file` require explicit element size arguments. * It's no longer possible to convert storages to different types using the float/double/etc methods. Instead, do the conversion using a tensor. * It's no longer possible to allocate a typed storage directly using FloatStorage/DoubleStorage/etc constructors. Instead, construct a tensor and extract its storage. The classes still exist but they are used purely for unpickling. * The preexisting serialization format stores dtype with storage, and in fact this dtype is used to determine the dtype of the tensor overall. To accommodate this case, we introduce a new TypedStorage concept that exists only during unpickling time which is used to temporarily store the dtype so we can construct a tensor. If you overrode the handling of pickling/unpickling, you MUST add handling for TypedStorage or your serialization code will degrade to standard file-based serialization. Original pull request: https://github.com/pytorch/pytorch/pull/59671 Reviewed By: soulitzer, ngimel Differential Revision: D29466819 Pulled By: ezyang fbshipit-source-id: 4a14e5d3c2b08e06e558683d97f7378a3180b00e	2021-10-05 13:50:34 -07:00
Michael Dagitses	b737629ff0	simplify op name determination into a single forward pass (#64261 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64261 Note that this does not preserve byte-for-byte compatibility with existing names. Test Plan: * Rely on CI to catch gross errors. * Merge after release cut to catch subtle issues. Reviewed By: albanD Differential Revision: D30700647 Pulled By: dagitses fbshipit-source-id: 7b02f34b8fae3041240cc78fbc6bcae498c3acd4	2021-09-02 07:32:11 -07:00
Michael Carilli	24e50b8453	[CUDA graphs] hotfix for test_graph_ (#64339 ) Summary: Graphed workloads that try to capture a full backward pass must do warmup on a non-default stream. If warmup happens on the default stream, AccumulateGrad functions might tag themselves to run on the default stream, and therefore won't be capturable. ngimel and I suspect some test_cuda.py tests run with the default stream as the ambient stream, which breaks `test_graph_grad_scaling` because `test_graph_grad_scaling` does warmup on the ambient stream _assuming_ the ambient stream is a non-default stream. This PR explicitly sets a side stream for the warmup in `test_graph_grad_scaling`, which is what I should have done all along because it's what the new documentation recommends. I pushed the PR branch straight to the main pytorch repo because we need to run ci-all on it, and I'm not sure what the requirements are these days. Pull Request resolved: https://github.com/pytorch/pytorch/pull/64339 Reviewed By: mruberry Differential Revision: D30690711 Pulled By: ngimel fbshipit-source-id: 91ad75f46a11f311e25bc468ea184e22acdcc25a	2021-08-31 22:34:10 -07:00
Rishi Puri	13484084a6	fix syntax error in bfloat16 PR (#64122 ) Summary: fixes prior syntax error from PR ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/64122 Reviewed By: H-Huang Differential Revision: D30643596 Pulled By: ngimel fbshipit-source-id: 0a2d5a40fb6dc7339cd03112e57ef0e1bf8a000e	2021-08-31 14:33:12 -07:00
Michael Carilli	8d08b103be	[CUDA graphs] Prototype API and documentation (#63269 ) Summary: RFC: https://github.com/pytorch/pytorch/issues/61880 Pull Request resolved: https://github.com/pytorch/pytorch/pull/63269 Reviewed By: mruberry Differential Revision: D30596643 Pulled By: ngimel fbshipit-source-id: b1f8061406364b667e2c2d4d30fbce1f0d8456be	2021-08-31 13:34:23 -07:00
Philip Meier	57d4c6cf42	replace `self.assertTrue(torch.allclose(..))` with `self.assertEqual(…)` (#63637 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/63565 Pull Request resolved: https://github.com/pytorch/pytorch/pull/63637 Reviewed By: malfet Differential Revision: D30541266 Pulled By: mruberry fbshipit-source-id: ab461949782c6908a589ea098fcfcf5c3e081ee6	2021-08-25 16:47:40 -07:00
Shen Li	1022443168	Revert D30279364: [codemod][lint][fbcode/c*] Enable BLACK by default Test Plan: revert-hammer Differential Revision: D30279364 (`b004307252`) Original commit changeset: c1ed77dfe43a fbshipit-source-id: eab50857675c51e0088391af06ec0ecb14e2347e	2021-08-12 11:45:01 -07:00
Zsolt Dollenstein	b004307252	[codemod][lint][fbcode/c*] Enable BLACK by default Test Plan: manual inspection & sandcastle Reviewed By: zertosh Differential Revision: D30279364 fbshipit-source-id: c1ed77dfe43a3bde358f92737cd5535ae5d13c9a	2021-08-12 10:58:35 -07:00
Rishi Puri	324673a537	rebase for autocast updates to include device_type and dtype flags (#61002 ) Summary: Fixes #{55374} https://github.com/pytorch/pytorch/issues/55374 Pull Request resolved: https://github.com/pytorch/pytorch/pull/61002 Reviewed By: malfet, mruberry Differential Revision: D30016812 Pulled By: ngimel fbshipit-source-id: 6e09a29f539d28e9aea5cd9489b1e633cc588033	2021-08-10 20:03:12 -07:00
Kevin Tse	4b47ea9446	adding a skip for ROCm for a flaky test (#62664 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62664 Skipping a test for ROCm because of issue #62602 Test Plan: Imported from OSS Reviewed By: soulitzer Differential Revision: D30079534 Pulled By: NivekT fbshipit-source-id: a9cf35e5d3a8d218edc9c5a704d1f9599d2f38a6	2021-08-04 07:29:06 -07:00
Michael Carilli	9fb6b40f3e	Makes a streaming backward test try gradient stealing more directly (#60065 ) Summary: Closes https://github.com/pytorch/pytorch/issues/59846. https://github.com/pytorch/pytorch/issues/59846 is likely paranoia, and some of the test_streaming_backward_* in test_cuda.py already use gradient stealing (ie, they start with `.grad`s as None before backward). Regardless, this PR augments one of the tests to stress gradient stealing a bit more directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/60065 Reviewed By: mrshenli Differential Revision: D29779518 Pulled By: ngimel fbshipit-source-id: ccbf278543c3adebe5f4ba0365b1dace9a14da9b	2021-07-19 20:39:55 -07:00

... 3 4 5 6 7 ...

914 Commits