pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
FFFrog	783be8f932	[Easy] Add more check for elapsedTime of torch.xxx.Event and torch.Event (#151404 ) As the title stated Changes: - Add record, query and enable_timing check - Add related tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/151404 Approved by: https://github.com/albanD	2025-04-18 15:26:13 +00:00
PyTorch MergeBot	1ce7969e81	Revert "[Easy] Add more check for elapsedTime of torch.xxx.Event and torch.Event (#151404 )" This reverts commit `90c5b86cd8`. Reverted https://github.com/pytorch/pytorch/pull/151404 on behalf of https://github.com/clee2000 due to broke a cpp extension test? test_cpp_extensions_stream_and_event.py::TestCppExtensionStreamAndEvent::test_stream_event [GH job link](https://github.com/pytorch/pytorch/actions/runs/14519277500/job/40736981315) [HUD commit link](`90c5b86cd8`), bad TD ([comment](https://github.com/pytorch/pytorch/pull/151404#issuecomment-2813649667))	2025-04-17 17:45:41 +00:00
FFFrog	90c5b86cd8	[Easy] Add more check for elapsedTime of torch.xxx.Event and torch.Event (#151404 ) As the title stated Changes: - Add record, query and enable_timing check - Add related tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/151404 Approved by: https://github.com/albanD	2025-04-17 15:30:12 +00:00
Jeff Daily	15768cc34b	add unit test for preferred_blas_library settings (#150581 ) Follow up to #150212 that was committed without a unit test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150581 Approved by: https://github.com/atalman, https://github.com/malfet Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-04-06 01:44:07 +00:00
PyTorch MergeBot	b0e28f60df	Revert "add unit test for preferred_blas_library settings (#150581 )" This reverts commit `781d28e265`. Reverted https://github.com/pytorch/pytorch/pull/150581 on behalf of https://github.com/clee2000 due to new test broken internally D72395624 ([comment](https://github.com/pytorch/pytorch/pull/150581#issuecomment-2777228731))	2025-04-03 23:51:49 +00:00
Jeff Daily	781d28e265	add unit test for preferred_blas_library settings (#150581 ) Follow up to #150212 that was committed without a unit test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150581 Approved by: https://github.com/atalman	2025-04-03 13:27:50 +00:00
Alexander Grund	350a479146	Fix test failures on non-x86 Linux (#148445 ) The cpp contexts are only supported on x86 Linux. The tests requiring them are skipped on non-Linux but not if the architecture is not x86. In most places it is checked for ARM64 which is not enough as a check for x86 is required instead. Fix the test decorators and factor out a common one in test_cuda. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148445 Approved by: https://github.com/eellison	2025-03-28 15:27:44 +00:00
Fuzzkatt	ce3dc9e346	add some extra test oom skips for jetson due to lacking nvml support (#149587 ) Add a couple of Jetson skips for oom tests in test/test_cuda.py due to failures in nvidia CI. Jetson not having full nvml support is a known issue so this is mostly a test side fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149587 Approved by: https://github.com/eqy	2025-03-25 20:39:10 +00:00
Fuzzkatt	b562d22772	test/test_cuda.py: rework TEST_PYNVML logic to make more sense, add not IS_JETSON condition (#149578 ) PYNVML related tests in test/test_cuda.py are failing in nvidia internal CI for Jetson devices because Jetson devices don't fully support nvml (it exists as a stub library). In addition to skipping PYNVML tests for Jetson, this PR also reworks the TEST_PYNVML logic a bit to be more consistent with the rest of TEST_{something} conditions in test/test_cuda.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/149578 Approved by: https://github.com/janeyx99, https://github.com/eqy	2025-03-25 20:38:15 +00:00
Ding, Yi1	f7d1b966c2	[Inductor] Unify the data type propagation between Triton and CPP Backend (#146970 ) Fixes #144246 Use `DtypePropagationOpsHandler` for CSE variables of CPP backend. In addition, add static type checking for the generated CPP code similar to the `config.test_configs.runtime_triton_dtype_assert`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146970 Approved by: https://github.com/jgong5, https://github.com/eellison, https://github.com/leslie-fang-intel	2025-03-21 17:52:51 +00:00
albanD	68c12ecfe2	Move get accelerator to use build time flags when possible (#146098 ) This PR does two main things (they are in a single PR to show how the newly added APIs are used). - Add isBuilt and isAvailable APIs to the AcceleratorHook interface. See inline doc for their exact semantic - Use the newly added isBuilt for accelerator check to ensure it does not poison fork Pull Request resolved: https://github.com/pytorch/pytorch/pull/146098 Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/EikanWang, https://github.com/jeromean Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-03-10 13:17:58 +00:00
PyTorch MergeBot	b246cd7b82	Revert "Move get accelerator to use build time flags when possible (#146098 )" This reverts commit `17302b4bc8`. Reverted https://github.com/pytorch/pytorch/pull/146098 on behalf of https://github.com/albanD due to Still fails with cuda build on a non-gpu machine ([comment](https://github.com/pytorch/pytorch/pull/146098#issuecomment-2707191770))	2025-03-07 18:59:58 +00:00
albanD	17302b4bc8	Move get accelerator to use build time flags when possible (#146098 ) This PR does two main things (they are in a single PR to show how the newly added APIs are used). - Add isBuilt and isAvailable APIs to the AcceleratorHook interface. See inline doc for their exact semantic - Use the newly added isBuilt for accelerator check to ensure it does not poison fork Pull Request resolved: https://github.com/pytorch/pytorch/pull/146098 Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/EikanWang, https://github.com/jeromean Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-03-07 15:19:34 +00:00
Syed Tousif Ahmed	5f392ae560	Throws error when using torch.cuda.MemPool with expandable segments (#148378 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148378 Approved by: https://github.com/ngimel, https://github.com/eqy ghstack dependencies: #148374	2025-03-07 05:22:03 +00:00
Marko Radmilac	c65ee728f0	Initial implementation of host memory stats (#147660 ) This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics. This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache. As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later. Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660 Approved by: https://github.com/ngimel	2025-03-05 16:13:19 +00:00
cyy	ec2805ada8	Remove outdated CUDA version check (#148142 ) Since Torch requires CUDA>=11, some checks can be removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148142 Approved by: https://github.com/janeyx99, https://github.com/eqy	2025-03-04 03:33:44 +00:00
PyTorch MergeBot	a983b2b11a	Revert "Initial implementation of host memory stats (#147660 )" This reverts commit `945e359fc1`. Reverted https://github.com/pytorch/pytorch/pull/147660 on behalf of https://github.com/mradmila due to There is an issue with ambiguous definition of Stat structure when different C++ tools are used. Backing out for now. ([comment](https://github.com/pytorch/pytorch/pull/147660#issuecomment-2692346379))	2025-03-01 18:05:45 +00:00
Fuzzkatt	493cd97af5	add skips to test_notifies_oom and test_set_per_process_memory_fraction (#148134 ) Tests fail in NVIDIA internal CI since we do not support nvml on Jetson, but nvml is required for OOM reporting to work properly, so we are skipping the failing tests for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148134 Approved by: https://github.com/eqy	2025-03-01 02:59:48 +00:00
Marko Radmilac	945e359fc1	Initial implementation of host memory stats (#147660 ) This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics. This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache. As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later. Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660 Approved by: https://github.com/ngimel	2025-02-28 18:36:44 +00:00
cyy	b0dfd242fa	Remove NO_MULTIPROCESSING_SPAWN checks (#146705 ) py 3.9 has spawn. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146705 Approved by: https://github.com/colesbury	2025-02-28 05:53:19 +00:00
PyTorch MergeBot	926b7b5027	Revert "Remove NO_MULTIPROCESSING_SPAWN checks (#146705 )" This reverts commit `40ad5e01df`. Reverted https://github.com/pytorch/pytorch/pull/146705 on behalf of https://github.com/cyyever due to Broke lint?, I guess land race with rufff update ([comment](https://github.com/pytorch/pytorch/pull/146705#issuecomment-2689603077))	2025-02-28 03:04:38 +00:00
cyyever	40ad5e01df	Remove NO_MULTIPROCESSING_SPAWN checks (#146705 ) py 3.9 has spawn. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146705 Approved by: https://github.com/colesbury	2025-02-28 00:15:32 +00:00
Jagadish Krishnamoorthy	0ea5d1067b	ROCm: Remove static specifier for allow_tf32 variable. (#147186 ) Since the env variable HIPBLASLT_ALLOW_TF32 can change, remove static type for allow_tf32 variable so that it captures the current value of env variable HIPBLASLT_ALLOW_TF32. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147186 Approved by: https://github.com/jeffdaily, https://github.com/naromero77amd	2025-02-26 18:24:02 +00:00
Bo Li	de80b6f0d3	Updated test_cuda.py to rerun tests (#147040 ) Initially test_cuda::TestCudaMallocAsync::test_clock_speed and test_cuda::TestCudaMallocAsync::test_power_draw are skipped in this [commit](`d4871750d9`). Pulled ROCm nightly image and verified these two tests run fine locally. Filed this PR to enable them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147040 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily	2025-02-25 19:58:42 +00:00
PyTorch MergeBot	fb73b0c7c5	Revert "use copy2d in h2d/d2h copy when possible (#146256 )" This reverts commit `0bc036a9e9`. Reverted https://github.com/pytorch/pytorch/pull/146256 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/146256#issuecomment-2680868627))	2025-02-25 07:06:38 +00:00
Mikayla Gawarecki	e8fbc86de0	Make torch.cuda.gds APIs public (#147120 ) Follow up to https://github.com/pytorch/pytorch/pull/145748 that turned USE_CUFILE on for CUDA 12.6 and 12.8 binaries Pull Request resolved: https://github.com/pytorch/pytorch/pull/147120 Approved by: https://github.com/albanD	2025-02-14 17:06:50 +00:00
Mikayla Gawarecki	861bf892fb	Set USE_CUFILE=1 by default and add pypi package to binary build matrix (#145748 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145748 Approved by: https://github.com/atalman	2025-02-11 15:49:01 +00:00
Eddie Yan	9ee506bd93	[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 ) Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441 Approved by: https://github.com/Chillee, https://github.com/malfet	2025-02-06 19:04:50 +00:00
PyTorch MergeBot	f27220e32a	Revert "Move get accelerator to use build time flags when possible (#146098 )" This reverts commit `157d81c201`. Reverted https://github.com/pytorch/pytorch/pull/146098 on behalf of https://github.com/atalman due to Failing internally, sorry need to revert ([comment](https://github.com/pytorch/pytorch/pull/146098#issuecomment-2637443675))	2025-02-05 16:39:37 +00:00
albanD	157d81c201	Move get accelerator to use build time flags when possible (#146098 ) This PR does two main things (they are in a single PR to show how the newly added APIs are used). - Add isBuilt and isAvailable APIs to the AcceleratorHook interface. See inline doc for their exact semantic - Use the newly added isBuilt for accelerator check to ensure it does not poison fork Pull Request resolved: https://github.com/pytorch/pytorch/pull/146098 Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/EikanWang Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-02-04 18:23:24 +00:00
Natalia Gimelshein	0bc036a9e9	use copy2d in h2d/d2h copy when possible (#146256 ) A rewrite of #138964 In addition to rewriting the conditions for using copy2d, this PR fixes a few other problems with #138964: 1) gpu-gpu copies when peer access is disabled shouldn't rely on copy2d 2) copy2d should record even for the host pinned memory, like the regular copy does 3) copy2d shouldn't pretend that it's synchronizing (for the purposes of cuda sanitizer tracer) when it's non-blocking In this PR copy2d behaves in exactly the same way as copy does wrt to those additional syncs, except it calls a different underlying cuda call. Tests for multiple cases going through copy2d and avoiding copy2d pattern due to unsatisfied conditions are added. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146256 Approved by: https://github.com/eqy, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-02-03 23:07:54 +00:00
PyTorch MergeBot	c39c679813	Revert "Tensor .cuda() very slow with specific array sizes (#138964 )" This reverts commit `98f87edd23`. Reverted https://github.com/pytorch/pytorch/pull/138964 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but some slow test start failing after this lands ([comment](https://github.com/pytorch/pytorch/pull/138964#issuecomment-2628455198))	2025-01-31 21:48:51 +00:00
PyTorch MergeBot	c3f71eb61b	Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 )" This reverts commit `e2917245fb`. Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/ZainRizvi due to Sorry but this still fails internally with the same error. @Chillee or @malfet, can you please help the change get tested? (See D68783351) ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2627886999))	2025-01-31 17:43:09 +00:00
Donald Tolley	98f87edd23	Tensor .cuda() very slow with specific array sizes (#138964 ) ### Pull Request: Optimized Non-Contiguous Tensor Copy for CPU to GPU in PyTorch #### Summary This PR addresses the performance issue identified in [#111570](https://github.com/pytorch/pytorch/issues/111570), where non-contiguous tensors took significantly longer to transfer from CPU to GPU. Through detailed tracing of the call flow, we identified that PyTorch was creating temporary contiguous buffers for non-contiguous tensor transfers, which introduced unnecessary overhead. #### Tracing the Issue To pinpoint the cause of the slowdown, we followed the call flow from Python’s `tensor.cuda()` method through PyTorch’s backend, ultimately identifying `copy_kernel_cuda` as the key function responsible for CPU-to-GPU tensor transfers. Here’s a summary of the tracing process: 1. Python Call: `tensor.cuda()` - Starting from Python, the `cuda()` method initiates the tensor transfer to the GPU. 2. `TensorBody.h: cuda()` - The `cuda()` method calls `to()`, specifying the target device as CUDA. 3. `Tensor.cpp: TensorBase::to()` - The `to()` function prepares device and data type options before invoking `_ops::to_dtype_layout::call()`. 4. Operator Call: `_ops::to_dtype_layout::call()` - This operator dispatches the request to the backend-specific function responsible for managing the transfer. 5. `Copy.cpp: copy_()` - The `copy_()` function performs preliminary checks (e.g., zero-tensor immutability) and proceeds to call `copy_impl()`. 6. `Copy.cpp: copy_impl()` - This function sets up a tensor iterator and dispatches the copy operation to the appropriate backend through `copy_stub`. 7. Dispatch to CUDA: `copy_stub` - The dispatch mechanism routes the call to the CUDA-specific function, `copy_kernel_cuda`. 8. `Copy.cu: copy_kernel_cuda()` - Here, we identified that PyTorch was creating temporary contiguous buffers for 1D and 2D non-contiguous tensors, which slowed down the copy process. This behavior is managed by the `copy_requires_temporaries()` function. #### Solution To address this, we modified `copy_kernel_cuda` to handle non-contiguous 1D and 2D tensors directly by using `cudaMemcpy2DAsync`, which allows efficient, stride-aware memory transfers without temporary buffers. Here’s why this approach improves performance: - Efficiency of `cudaMemcpy2DAsync`: This CUDA function is optimized for pitched (stride-based) memory transfers, allowing it to handle non-contiguous data layouts effectively by specifying memory strides for source and destination tensors. - Reduction of Overhead: By directly copying non-contiguous tensors without intermediate buffers, we eliminate extra memory allocation and achieve faster CPU-to-GPU transfers. - Asynchronous Execution: `cudaMemcpy2DAsync` enables asynchronous transfer on the CUDA stream, further improving performance by taking advantage of CUDA's optimized memory handling for non-contiguous layouts. #### Performance Results In my testing, I created tensors of size `327680 x 2000` and used slices for transfer performance measurements. The tests show that the average time for transferring a non-contiguous slice (e.g., rows 10,000 to 50,000) from CPU to GPU now closely matches the contiguous case. This improvement indicates that the updated implementation effectively addresses the performance discrepancy. Below are the measured times and validation checks: ```plaintext Average time for contiguous slice (rows 10,000-50,000): 66 ms Average time for non-contiguous slice (rows 10,000-50,000): 66 ms Validation of contiguous and non-contiguous tensor copies: ✅ PASS: Tensor shapes match. ✅ PASS: Tensor contiguity matches. ✅ PASS: Tensor contents match. ✅ PASS: Tensor data types match. ✅ Success: Both contiguous and non-contiguous tensors were copied correctly to the GPU. ``` #### Conclusion This PR resolves the identified performance issue by eliminating the need for temporary buffers in non-contiguous 1D and 2D tensor transfers, ensuring faster and more efficient copies from CPU to GPU. Future optimizations could further enhance performance for higher-dimensional non-contiguous tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138964 Approved by: https://github.com/jeffdaily Co-authored-by: Natalia Gimelshein <ngimel@gmail.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-01-31 17:05:02 +00:00
Eddie Yan	e2917245fb	[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 ) Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441 Approved by: https://github.com/Chillee, https://github.com/malfet	2025-01-30 22:33:50 +00:00
Natalia Gimelshein	08ff11e9d0	initialize device when pinning memory on this device, short circuit i… (#145752 ) …s_pinned if device is not initialized Do not land RFC potential fix for #144687 Now `.is_pinned(device="cuda")` does not initialize device and thus doesn't poison the fork (but it complains about `device` arg being deprecated). To not need `device=` arg we'd need to fix get_accelerator to not initialize device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145752 Approved by: https://github.com/albanD Co-authored-by: albanD <albandes@fb.com>	2025-01-30 21:37:29 +00:00
Dmitry Nikolaev	6967ef1b07	[ROCm] fix test_cublas_workspace_explicit_allocation for gfx12 (#145227 ) gfx12 passes the condition `torch.cuda.get_device_capability() >= (9, 4)` and uses `default_workspace_size=128MB`, but it required only for MI300 Fix condition to use `("gfx94" in gcn_arch)` instead of `torch.cuda.get_device_properties()` to detect MI300. Now `default_workspace_size=32MB` is used for gfx12 and the test passes Pull Request resolved: https://github.com/pytorch/pytorch/pull/145227 Approved by: https://github.com/jeffdaily, https://github.com/eqy	2025-01-28 16:19:27 +00:00
PyTorch MergeBot	c986eba560	Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 )" This reverts commit `abf28982a8`. Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/ZainRizvi due to Sorry but this is failing internally. @Chillee can you please help change get remerged? See D68720562 ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2616726406))	2025-01-27 19:38:26 +00:00
Eddie Yan	abf28982a8	[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 ) Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441 Approved by: https://github.com/Chillee	2025-01-27 18:05:23 +00:00
FEI	615bdd9c81	Improve the caching allocator test for raw alloc (#145269 ) 1 Prevent block allocated by torch._C._cuda_cudaCachingAllocator_raw_alloc from affecting torch.cuda.empty_cache() in other unit tests 2 Additionally, tested the changes to raw_delete in https://github.com/pytorch/pytorch/pull/131114 @jeffdaily @albanD @houseroad @eqy @aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/145269 Approved by: https://github.com/albanD, https://github.com/eqy, https://github.com/jeffdaily	2025-01-24 21:07:17 +00:00
PyTorch MergeBot	dad9bc3461	Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 )" This reverts commit `de945d78da`. Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/izaitsevfb due to unused variables again :( ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2611182461))	2025-01-23 22:59:25 +00:00
Eddie Yan	de945d78da	[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 ) Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441 Approved by: https://github.com/Chillee	2025-01-22 22:42:48 +00:00
PyTorch MergeBot	4ea189422d	Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 )" This reverts commit `a6763b7b81`. Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/kit1980 due to breaking internal builds: unused variable 'halpha' ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2596895865))	2025-01-16 21:12:41 +00:00
eqy	a6763b7b81	[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 ) Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441 Approved by: https://github.com/Chillee	2025-01-15 18:37:55 +00:00
PyTorch MergeBot	64bcf39180	Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 )" This reverts commit `388b75edec`. Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/kit1980 due to breaking internal builds: unused variable 'halpha' ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2588517060))	2025-01-14 00:48:28 +00:00
eqy	388b75edec	[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 ) Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441 Approved by: https://github.com/Chillee	2025-01-11 15:30:38 +00:00
PyTorch MergeBot	b80ecc4457	Revert "Fix poision child process issue when call getAccelerator() (#144368 )" This reverts commit `2583d831d4`. Reverted https://github.com/pytorch/pytorch/pull/144368 on behalf of https://github.com/clee2000 due to broke internal tests D68023262, probably the same problem as noted in the issue this PR is mentioned above ([comment](https://github.com/pytorch/pytorch/pull/144368#issuecomment-2584848568))	2025-01-10 23:36:43 +00:00
Yu, Guangye	2583d831d4	Fix poision child process issue when call getAccelerator() (#144368 ) # Motivation fix https://github.com/pytorch/pytorch/issues/144152 # Solution - Align `at::globalContext()::hasXXX` to determine if accelerator XXX is built with PyTorch or an extension already registered to PyTorch. - Define `at::hasXXX` to determine if accelerator XXX is available at runtime. - Use `at::globalContext()::hasXXX` in `getAccelerator` rather than `at::hasXXX` to avoid initializing the XXX runtime (which can poison child processes) while detecting the current accelerator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144368 Approved by: https://github.com/albanD, https://github.com/atalman, https://github.com/gujinghui	2025-01-10 09:28:27 +00:00
Yu, Guangye	6de110b862	Support with statement on torch.Stream (#140138 ) # Motivation We propose to support Python with statement on `torch.Stream`. This is a benefit for all accelerators when writing device-agnostic code. The device-specific stream will also be supported because they are generally derived from `torch.Stream`. With this PR, we can do like this ```python s1= torch.Stream() # Set s1 to the current stream torch.accelerator.set_stream(s1) with torch.Stream() as s2: # Inside with statement, we set s2 to the current stream assert torch.accelerator.current_stream() == s2 # Here the current stream should be s1 assert torch.accelerator.current_stream() == s1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140138 Approved by: https://github.com/albanD	2025-01-10 02:05:19 +00:00
Dmitry Nikolaev	d4871750d9	[ROCm] Enable post-merge trunk workflow on MI300 runners; skip and fix MI300 related failed tests (#143673 ) This PR * makes changes to the workflow files and scripts so we can run CI workflows on the MI300 runners * skips and fixes several tests, failed on MI300, observed in https://github.com/pytorch/pytorch/pull/140989 Skipped due to unsupported Float8_e4m3fn data type on MI300 (need to update test code to use datatypes supported by MI300): - distributed.tensor.parallel.test_micro_pipeline_tp.py::MicroPipelineTPTest::test_fuse_all_gather_scaled_matmul_A_dims_\_gather_dim_\ (24 tests across inductor/distributed configs) - distributed.tensor.parallel.test_micro_pipeline_tp.py::test_fuse_scaled_matmul_reduce_scatter_A_dims_\_scatter_dim_\ (12 tests across inductor/distributed configs)) - inductor.test_loop_ordering::LoopOrderingTest::test_fp8_cast_and_t - inductor.test_loop_ordering::LoopOrderingTest::test_fp8_pattern_2 Skipped due to AssertionError on MI300: - inductor.test_mkldnn_pattern_matcher.py::test_qconv2d_int8_mixed_bf16 - distributed._tools.test_sac_ilp::TestSACILP::test_sac_ilp_case1 Skipped: - test_cuda.py::TestCudaMallocAsync::test_clock_speed - test_cuda.py::TestCudaMallocAsync::test_power_draw - test_torch.py::TestTorchDeviceTypeCUDA::test_deterministic_cumsum_cuda Skipped flaky tests on MI300: - distributed.test_c10d_gloo.py::ProcessGroupGlooTest::test_gather_stress_cuda - inductor.test_cpu_repro::CPUReproTests::test_lstm_packed_unbatched_False* (256 tests) Fixed: - test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_float8_basics_cuda Features: - inductor/test_fp8.py - declare a new function to convert FP8 datatypes to ROCm supported FP8 datatypes. It keeps test names for CUDA and ROCm and allows to enable Inductor FP8 tests on CPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/143673 Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/pruthvistony Co-authored-by: saienduri <saimanas.enduri@amd.com> Co-authored-by: Jithun Nair <jithun.nair@amd.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-01-09 05:18:57 +00:00
Xiaodong Wang	3d3a07963f	[reland][attempt2][AMD] Turn on TF32 for aten::mm (#144145 ) Summary: https://github.com/pytorch/pytorch/pull/143549 was reverted due to some internal/oss tooling issue. Relanding. hipblaslt supports TF32, so adding the support. Original PR https://github.com/pytorch/pytorch/pull/139869 Test Plan: CI Differential Revision: D67785496 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144145 Approved by: https://github.com/jianyuh	2025-01-06 00:37:01 +00:00
cyy	df458be4e5	[4/N] Apply py39 ruff and pyupgrade fixes (#143257 ) ```torch/fx/passes/annotate_getitem_nodes.py``` was changed to support the new type hinting annotations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143257 Approved by: https://github.com/justinchuby, https://github.com/albanD	2025-01-04 10:47:51 +00:00
Tal Ben-Nun	c0d710634f	Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery (#142292 ) Reland of #140320 after failing test on trunk. Fixes potential environment clobbering in test, makes ROCr+HIP devices (if specified together) more robust to index errors. Fixes #140318 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142292 Approved by: https://github.com/jataylo, https://github.com/huydhn, https://github.com/jeffdaily Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2024-12-25 02:37:11 +00:00
PyTorch MergeBot	448c16ac87	Revert "[reland][AMD] Turn on TF32 for aten::mm (#143549 )" This reverts commit `41cdc7f735`. Reverted https://github.com/pytorch/pytorch/pull/143549 on behalf of https://github.com/malfet due to It breaks ROCM testing, see `06b4b96b34/1` ([comment](https://github.com/pytorch/pytorch/pull/143549#issuecomment-2559016960))	2024-12-23 06:47:36 +00:00
Yu, Guangye	07fa6e2c8b	Fix torch.accelerator api abort when passing invaild device (#143550 ) # Motivation Fix https://github.com/pytorch/pytorch/issues/143543 # Solution We should raise python exception instead of aborting... # Additional Context without this PR: ```python >>> import torch >>> torch.accelerator.current_stream(torch.accelerator.device_count()) terminate called after throwing an instance of 'c10::Error' what(): device is out of range, device is 2, total number of device is 2. Exception raised from check_device_index at /home/dvrogozh/git/pytorch/pytorch/c10/xpu/XPUFunctions.h:36 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xac (0x7f30707eb95c in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xf3 (0x7f307078fc57 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so) frame #2: <unknown function> + 0x19a3e (0x7f3070c2ba3e in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so) frame #3: c10::xpu::getCurrentXPUStream(signed char) + 0x2f (0x7f3070c2c83f in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so) frame #4: <unknown function> + 0x1ca35 (0x7f3070c2ea35 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so) frame #5: <unknown function> + 0x653f15 (0x7f3083391f15 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so) frame #6: <unknown function> + 0x39e5f2 (0x7f30830dc5f2 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so) <omitting python frames> frame #20: <unknown function> + 0x29d90 (0x7f308b19bd90 in /lib/x86_64-linux-gnu/libc.so.6) frame #21: __libc_start_main + 0x80 (0x7f308b19be40 in /lib/x86_64-linux-gnu/libc.so.6) Aborted (core dumped) ``` with this PR: ```python >>> import torch >>> torch.accelerator.current_stream(torch.accelerator.device_count()) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/pt-gpu/4T-4652/guangyey/stock-pytorch/torch/accelerator/__init__.py", line 123, in current_stream return torch._C._accelerator_getStream(device_index) RuntimeError: The device index is out of range. It must be in [0, 2), but got 2. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143550 Approved by: https://github.com/EikanWang, https://github.com/dvrogozh, https://github.com/albanD	2024-12-23 03:44:22 +00:00
Xiaodong Wang	41cdc7f735	[reland][AMD] Turn on TF32 for aten::mm (#143549 ) Summary: hipblaslt supports TF32, so adding the support. Original PR https://github.com/pytorch/pytorch/pull/139869 Test Plan: CI Differential Revision: D67431681 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143549 Approved by: https://github.com/eqy	2024-12-22 21:05:05 +00:00
Tom Ritchford	d8c8ba2440	Fix unused Python variables in test/[e-z]* (#136964 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136964 Approved by: https://github.com/justinchuby, https://github.com/albanD	2024-12-18 23:02:30 +00:00
PyTorch MergeBot	7ab3177776	Revert "[AMD] Turn on TF32 for aten::mm (#139869 )" This reverts commit `e0bdae7884`. Reverted https://github.com/pytorch/pytorch/pull/139869 on behalf of https://github.com/jeffdaily due to causing ROCm CI failures, need to investigate, revert for now ([comment](https://github.com/pytorch/pytorch/pull/139869#issuecomment-2546127069))	2024-12-16 16:46:48 +00:00
Yu, Guangye	45ac4ebf15	[RELAND] Add UTs for accelerator device-agnostic runtime APIs (#133572 ) # Motivation This PR intends to add UTs for accelerator device-agnostic APIs. # Additional Context This PR is relanded. It is reverted because `torch.Event` doesn't support mps backend. We have fixed it in https://github.com/pytorch/pytorch/pull/142468. The previous commit is `952514f0c8` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133572 Approved by: https://github.com/EikanWang, https://github.com/albanD ghstack dependencies: #143171	2024-12-16 02:18:41 +00:00
Xiaodong Wang	e0bdae7884	[AMD] Turn on TF32 for aten::mm (#139869 ) Summary: hipblaslt supports TF32, so adding the support. Test Plan: CI Differential Revision: D65435392 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139869 Approved by: https://github.com/leitian	2024-12-15 10:02:29 +00:00
Eddie Yan	0d6d29af38	[CUDA] Follow up to clean up some `set_per_process_memory_fraction` usage in tests (#142811 ) follow-up to #140852 now that #140620 has landed Pull Request resolved: https://github.com/pytorch/pytorch/pull/142811 Approved by: https://github.com/Skylion007	2024-12-13 21:09:05 +00:00
PyTorch MergeBot	1b3f8b7589	Revert "[RELAND] Add UTs for accelerator device-agnostic runtime APIs (#133572 )" This reverts commit `2091194249`. Reverted https://github.com/pytorch/pytorch/pull/133572 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the new test is still very flaky on MacOS even when it does not segfault anymore ([comment](https://github.com/pytorch/pytorch/pull/133572#issuecomment-2537256522))	2024-12-11 21:47:18 +00:00
Yu, Guangye	2091194249	[RELAND] Add UTs for accelerator device-agnostic runtime APIs (#133572 ) # Motivation This PR intends to add UTs for accelerator device-agnostic APIs. # Additional Context This PR is relanded. It is reverted because `torch.Event` doesn't support mps backend. We have fixed it in https://github.com/pytorch/pytorch/pull/142468. The previous commit is `952514f0c8` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133572 Approved by: https://github.com/EikanWang, https://github.com/albanD ghstack dependencies: #142468	2024-12-11 02:04:52 +00:00
PyTorch MergeBot	a1c6cf7e9f	Revert "Add UTs for accelerator device-agnostic runtime APIs (#133572 )" This reverts commit `952514f0c8`. Reverted https://github.com/pytorch/pytorch/pull/133572 on behalf of https://github.com/malfet due to Sorry for reverting your PR, but it segfaults on MacOS ([comment](https://github.com/pytorch/pytorch/pull/133572#issuecomment-2530354401))	2024-12-10 04:42:55 +00:00
Yu, Guangye	952514f0c8	Add UTs for accelerator device-agnostic runtime APIs (#133572 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133572 Approved by: https://github.com/EikanWang, https://github.com/albanD	2024-12-07 13:14:10 +00:00
PyTorch MergeBot	40d1b5f490	Revert "Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery (#140320 )" This reverts commit `add4a42ea2`. Reverted https://github.com/pytorch/pytorch/pull/140320 on behalf of https://github.com/huydhn due to Sorry for reverting your change but test_hip_device_count is failing in trunk after this land ([comment](https://github.com/pytorch/pytorch/pull/140320#issuecomment-2524742845))	2024-12-07 01:28:51 +00:00
eqy	0a619a212f	[CUDA] Cleanup per-process-memory-fraction in `test_cuda.py` tests (#140852 ) Otherwise certain sequences of tests will fail with OOM e.g., ``` # python test/test_cuda.py -k max_split_expandable -k test_assigning_back_deleter_fns_to_tensor --repeat 100 .. ---------------------------------------------------------------------- Ran 2 tests in 0.311s OK E. ====================================================================== ERROR: test_assigning_back_deleter_fns_to_tensor (__main__.TestBlockStateAbsorption.test_assigning_back_deleter_fns_to_tensor) ---------------------------------------------------------------------- Traceback (most recent call last): File "/workspace/pytorch/torch/testing/_internal/common_utils.py", line 3058, in wrapper method(args, kwargs) File "/workspace/pytorch/test/test_cuda.py", line 4320, in test_assigning_back_deleter_fns_to_tensor graph, outputs = cudagraphify(foo, [inp]) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/pytorch/test/test_cuda.py", line 4080, in cudagraphify fn(inputs) File "/workspace/pytorch/test/test_cuda.py", line 4316, in foo int8_cuda(LARGE_BUFFER) + x, ~~~~~~~~~~~~~~~~~~~~~~~~^~~ torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 160.00 MiB. GPU 0 has a total capacity of 31.73 GiB of which 31.30 GiB is free. Process 2916661 has 442.00 MiB memory in use. 120.00 MiB allowed; Of the allocated memory 52.00 MiB is allocated by PyTorch, and 6.00 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) To execute this test, run the following from the base repo dir: python test/test_cuda.py TestBlockStateAbsorption.test_assigning_back_deleter_fns_to_tensor This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ---------------------------------------------------------------------- Ran 2 tests in 0.136s FAILED (errors=1) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140852 Approved by: https://github.com/Skylion007	2024-12-06 21:26:54 +00:00
Tal Ben-Nun	add4a42ea2	Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery (#140320 ) Fixes #140318 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140320 Approved by: https://github.com/eqy, https://github.com/jithunnair-amd, https://github.com/jataylo, https://github.com/jeffdaily Co-authored-by: Jack Taylor <jack.taylor@amd.com>	2024-12-06 20:09:56 +00:00
Benjamin Glass	4959784dac	Add API query for available per-process CUDA memory (#140620 ) Certain `cpp_wrapper`-enabled tests were OOM-ing in the CI pipeline, with error messages suggesting that sufficient memory was accessible. This ultimately resulted from an internal memory limitation that was not queryable in the API. This PR adds querying for that limit. Additionally, the failing tests had incorrect memory availability checks, and are updated with measured memory requirements. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140620 Approved by: https://github.com/malfet, https://github.com/eqy ghstack dependencies: #141367	2024-12-03 00:24:03 +00:00
eqy	9532589b53	[CUDA][64-bit indexing] Support 64-bit indexing in `distribution_elementwise_grid_stride_kernel` (#141613 ) For #141544 Overhead doesn't seem to be noticeable even on small sizes (e.g., 2**10 elements) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141613 Approved by: https://github.com/Skylion007, https://github.com/ngimel	2024-11-30 06:55:02 +00:00
Yu Guo	808da50c2d	create a new torch.cuda.device_memory_used api (#140870 ) Summary: the current torch.cuda.memory_usage returns the memory utilization, more specifically, percent of time over the past sample period global memory being read/written for Nvidia. see more details in https://github.com/pytorch/pytorch/issues/140638 Test Plan: added a new unittest Differential Revision: D65960134 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140870 Approved by: https://github.com/ngimel, https://github.com/eqy	2024-11-19 06:36:30 +00:00
PyTorch MergeBot	43de32d948	Revert "create a new torch.cuda.device_memory_used api (#140870 )" This reverts commit `478204cad6`. Reverted https://github.com/pytorch/pytorch/pull/140870 on behalf of https://github.com/yuguo68 due to the test is still flaky on ROCm, test_cuda.py::TestCudaMallocAsync is not skipped with the unittest.skipIf(TEST_CUDAMALLOCASYNC ([comment](https://github.com/pytorch/pytorch/pull/140870#issuecomment-2484161914))	2024-11-18 21:26:25 +00:00
Yu Guo	478204cad6	create a new torch.cuda.device_memory_used api (#140870 ) Summary: the current torch.cuda.memory_usage returns the memory utilization, more specifically, percent of time over the past sample period global memory being read/written for Nvidia. see more details in https://github.com/pytorch/pytorch/issues/140638 Test Plan: added a new unittest Differential Revision: D65960134 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140870 Approved by: https://github.com/ngimel	2024-11-18 19:13:43 +00:00
Michael Lazos	1fd4757fdc	Support tensor betas in Adam and AdamW (#134171 ) Adds support for beta1 and beta2 to be wrapped in tensor for Adam and AdamW. Fixes https://github.com/pytorch/pytorch/issues/133898 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134171 Approved by: https://github.com/janeyx99	2024-11-15 21:55:55 +00:00
PyTorch MergeBot	03b7ec9237	Revert "create a new torch.cuda.memory_usage_in_bytes api (#140719 )" This reverts commit `9febc47637`. Reverted https://github.com/pytorch/pytorch/pull/140719 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the test is flaky on ROCm ([comment](https://github.com/pytorch/pytorch/pull/140719#issuecomment-2479832082))	2024-11-15 20:05:32 +00:00
Yu Guo	9febc47637	create a new torch.cuda.memory_usage_in_bytes api (#140719 ) Summary: the current torch.cuda.memory_usage returns the memory utilization, more specifically, percent of time over the past sample period global memory being read/written for Nvidia. see more details in https://github.com/pytorch/pytorch/issues/140638 Test Plan: added a new unittest Differential Revision: D65928031 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140719 Approved by: https://github.com/xw285cornell, https://github.com/hongxiayang	2024-11-15 05:59:40 +00:00
Syed Tousif Ahmed	341a28f0ce	Refactors empty_cache to return only MemPool memory to the system (#133602 ) Canonically, the empty_cache API releases all cached blocks of the CUDACachingAllocator. There is no API that can release only the cached blocks of a given pool. In this PR, we extend the functionality of empty_cache API such that it only releases the cached blocks of an active pool. When empty_cache API is called under a MemPoolContext, we only release the cached blocks that correspond to the pool id of the active pool. Part of https://github.com/pytorch/pytorch/issues/124807. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133602 Approved by: https://github.com/ezyang	2024-10-29 23:58:44 +00:00
Jeff Daily	7c7b2d89ba	[ROCm] set hipblas workspace (#138791 ) Fixes #138532. This brings hipblas behavior in line with cublas behavior with respect to setting the workspace to an allocation from the caching allocator as well as the env var HIPBLAS_WORKSPACE_CONFIG. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138791 Approved by: https://github.com/naromero77amd, https://github.com/eqy, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-10-29 01:37:55 +00:00
Syed Tousif Ahmed	1637a40796	Adds snapshot API for MemPools to get pool memory segments (#133601 ) Canonically, the snapshot API returns the entire memory state of the CUDACachingAllocator (using `get_all_blocks`). There is no API that can only return the memory state of a given pool. In this PR, we extend the functionality of snapshot API such that it can only return the memory addresses of an active pool. When snapshot API is called under a MemPoolContext, we only return the blocks that correspond to the pool id of the active pool. Part of https://github.com/pytorch/pytorch/issues/124807. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133601 Approved by: https://github.com/ezyang	2024-10-29 01:01:47 +00:00
PyTorch MergeBot	3b0f39336c	Revert "Adds snapshot API for MemPools to get pool memory segments (#133601 )" This reverts commit `00504aa6b8`. Reverted https://github.com/pytorch/pytorch/pull/133601 on behalf of https://github.com/wdvr due to reverting for now as this breaks lots of internal tests. Details below ([comment](https://github.com/pytorch/pytorch/pull/133601#issuecomment-2441864871))	2024-10-28 15:12:20 +00:00
Syed Tousif Ahmed	00504aa6b8	Adds snapshot API for MemPools to get pool memory segments (#133601 ) Canonically, the snapshot API returns the entire memory state of the CUDACachingAllocator (using `get_all_blocks`). There is no API that can only return the memory state of a given pool. In this PR, we extend the functionality of snapshot API such that it can only return the memory addresses of an active pool. When snapshot API is called under a MemPoolContext, we only return the blocks that correspond to the pool id of the active pool. Part of https://github.com/pytorch/pytorch/issues/124807. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133601 Approved by: https://github.com/ezyang	2024-10-26 03:34:59 +00:00
Kiuk Chung	940658405b	[test/test_cuda] Use temp file for test_improper_device_name (#138856 ) Use `tempfile.NamedTemporaryFile()` to have test_specify_improper_device_name save/load to a tmp file rather than the current-working-directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/138856 Approved by: https://github.com/Skylion007	2024-10-26 02:42:25 +00:00
Syed Tousif Ahmed	03c72976a5	Properly uses ref-counting for torch.cuda.use_mem_pool (#133600 ) This PR refactors some ref-counting functionality out of `beginAllocateToPool` and `releasePool`. The ref-counting logic is then used in construction and destruction of `torch.cuda.MemPool`. The `use_count` variable in the CUDACachingAllocator is essentially a refcount of how many context managers are using the pool. Since we are now lifting up the MemPool abstraction to the user, the MemPool object itself now needs to hold a an extra reference as well. Part of https://github.com/pytorch/pytorch/issues/124807. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133600 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-10-22 03:21:53 +00:00
Jack Taylor	966a1a971e	[ROCm] Add AMDSMI support for UUID input (#129741 ) Adds support for for using UUIDs for AMDSMI utilities in PyTorch via CUDA_VISIBLE_DEVICES/HIP_VISIBLE_DEVICES. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129741 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily	2024-10-15 15:56:30 +00:00
eqy	cba3f4f5e3	[CUDA] Clean up asserts in `test_cuda.py` (#137034 ) Switch some `assertTrue` tests to `assertEqual` etc for debuggability in logs Pull Request resolved: https://github.com/pytorch/pytorch/pull/137034 Approved by: https://github.com/Skylion007	2024-10-08 23:16:19 +00:00
Nikita Shulga	6d0d7b6e37	[CI][BE] Restore cuda memory allocator setting (#137383 ) By adding `finally:` clause at the end of the test Might fix https://github.com/pytorch/pytorch/issues/137098#issuecomment-2389172552 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137383 Approved by: https://github.com/ngimel	2024-10-05 04:16:38 +00:00
Jeff Daily	c7b0d4b148	raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING (#131114 ) raw_alloc is used by cudnn, miopen, thrust, and tunableop. Without this PR, the env var for disabling the caching allocator will only partially work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131114 Approved by: https://github.com/eqy, https://github.com/houseroad, https://github.com/albanD Co-authored-by: Nichols A. Romero <nick.romero@amd.com>	2024-10-04 15:36:29 +00:00
cyy	6327a71880	[Environment Variable][2/N] Use thread-safe setenv wrapper (#124485 ) This follows #119449 to make setenv thread-safe. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124485 Approved by: https://github.com/eqy	2024-10-04 07:30:51 +00:00
PyTorch MergeBot	0d1701f310	Revert "raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING (#131114 )" This reverts commit `7001907480`. Reverted https://github.com/pytorch/pytorch/pull/131114 on behalf of https://github.com/PaliC due to failing internal builds ([comment](https://github.com/pytorch/pytorch/pull/131114#issuecomment-2390615007))	2024-10-03 06:22:55 +00:00
Jeff Daily	7001907480	raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING (#131114 ) raw_alloc is used by cudnn, miopen, thrust, and tunableop. Without this PR, the env var for disabling the caching allocator will only partially work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131114 Approved by: https://github.com/eqy, https://github.com/houseroad, https://github.com/albanD Co-authored-by: Nichols A. Romero <nick.romero@amd.com>	2024-10-02 16:27:15 +00:00
Yu, Guangye	df5bbc09d1	Make device-specific event inherits from torch.Event (#134845 ) # Motivation This PR intends to make device-specific Event inherit from the generic torch.Event. The benefit is providing a generic abstract class `torch.Event` for different devices, like `torch.Stream`. This make it easier for Dynamo to capture the Event of different devices, like torch.cuda.Event and torch.xpu.Event. And the next PR would like to remove previous useless base class `_StreamBase` and `_EventBase` to avoid multiple Inheritance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134845 Approved by: https://github.com/albanD, https://github.com/EikanWang	2024-10-01 06:28:41 +00:00
FFFrog	e14b58ffbd	Using device-agnostic autocast api (#136613 ) - using torch.autocast(device_str="cuda") instead of torch.cuda.amp.autocast() - using torch.autocast(device_str="cpu") instead of torch.cpu.amp.autocast() Pull Request resolved: https://github.com/pytorch/pytorch/pull/136613 Approved by: https://github.com/shink, https://github.com/cyyever, https://github.com/kwen2501	2024-09-27 07:16:24 +00:00
Fuzzkatt	d1382aaf3d	skip test_out_of_memory for jetson (#133270 ) Skip test_out_of_memory in test/test_cuda.py on Jetson as OOM reporting in Jetson has issues due to partially missing NVML support. cc @eqy Pull Request resolved: https://github.com/pytorch/pytorch/pull/133270 Approved by: https://github.com/eqy, https://github.com/albanD, https://github.com/seemethere	2024-09-27 02:36:48 +00:00
drisspg	d05645841e	Update get_device_properties to take in optional device (#136683 ) Aligns behavior with the rest of cuda's device info query methods Pull Request resolved: https://github.com/pytorch/pytorch/pull/136683 Approved by: https://github.com/eqy	2024-09-26 15:07:31 +00:00
eqy	8225e7706e	[CUDA][Expandable Segments] Account for non-gc'able memory in expandable segments tests (#136496 ) Seems like some other tests are holding onto memory that is not gc'able (e.g., cuBLAS workspaces), so these tests while working in isolation fail when run as e.g., `python test/test_cuda.py -k able` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136496 Approved by: https://github.com/ezyang	2024-09-25 01:14:45 +00:00
Yuxin Wu	663e760065	add unittest for OOM message (#129671 ) Add unittest for the bug in #123984 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129671 Approved by: https://github.com/eqy	2024-09-23 04:48:01 +00:00
Aaron Orenstein	8c356ce3da	Fix lint errors in fbcode (#135614 ) Summary: Fixed a bunch of fbcode imports that happened to work but confused autodeps. After this autodeps still suggests "improvements" to TARGETS (which breaks our builds) but at least it can find all the imports. Test Plan: ``` fbpython fbcode/tools/build/buck/linters/lint_autoformat.py --linter=autodeps --default-exec-timeout=1800 -- fbcode/caffe2/TARGETS fbcode/caffe2/test/TARGETS ``` Before: ``` ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/testing.py:229) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fbur$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export.py:87) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fburl$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_serdes.py:9) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fb$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_serdes.py:10) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fburl$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_retraceability.py:7) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https:$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_retraceability.py:6) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See ht$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export_nonstrict.py:7) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See http$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_export_nonstrict.py:6) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See $ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_export_training_ir_to_run_decomp.py:8) when processing rule "test_export". Please make sure it's listed in the srcs parameter of an$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export_training_ir_to_run_decomp.py:10) when processing rule "test_export". Please make sure it's listed in the srcs parameter of anoth$ ERROR while processing caffe2/test/TARGETS: Found "//python/typeshed_internal:typeshed_internal_library" owner for "cv2" but it is protected by visibility rules: [] (from caffe2/test/test_bundled_images.py:7) when processing rule "test_bundled_$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "caffe2.test.profiler_test_cpp_thread_lib" (from caffe2/test/profiler/test_cpp_thread.py:29) when processing rule "profiler_test_cpp_thread". Please make sure it's listed in t$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._utils_internal.get_file_path_2" (from caffe2/test/test_custom_ops.py:23) when processing rule "custom_ops". Please make sure it's listed in the srcs parameter of anoth$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._utils_internal.get_file_path_2" (from caffe2/test/test_public_bindings.py:13) when processing rule "public_bindings". Please make sure it's listed in the srcs paramete$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._C._profiler.symbolize_tracebacks" (from caffe2/test/test_cuda.py:3348) when processing rule "test_cuda". Please make sure it's listed in the srcs parameter of another $ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._C._profiler.gather_traceback" (from caffe2/test/test_cuda.py:3348) when processing rule "test_cuda". Please make sure it's listed in the srcs parameter of another rule$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for include <torch/csrc/autograd/profiler_kineto.h> (from caffe2/test/profiler/test_cpp_thread.cpp:2) when processing profiler_test_cpp_thread_lib. Some things to try: ``` Differential Revision: D62049222 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135614 Approved by: https://github.com/oulgen, https://github.com/laithsakka	2024-09-13 02:04:34 +00:00
FFFrog	80a6d60829	Moving _run_autocast_outofplace to basic class named TestAutocast to reduce redundance (#134460 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134460 Approved by: https://github.com/EikanWang, https://github.com/ezyang	2024-09-04 10:48:58 +00:00
Natalia Gimelshein	c25b64a057	expose host_emptyCache to python, fix a bug in freeing cudaHostRegist… (#134919 ) …ered memory Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134919 Approved by: https://github.com/eqy	2024-09-01 09:07:25 +00:00
zdevito	d91b49dbaa	expandable_segments <-> other allocator options (#134338 ) Previously setting garbage_collection_threshold or max_split_size_mb along with expandable_segments:True could cause the allocator to hit assert failures when running nearly out of memory. This PR ensures garbage_collection and max_split freeing do not accidentally try to release expandable segments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134338 Approved by: https://github.com/ezyang	2024-08-29 18:43:59 +00:00
Syed Tousif Ahmed	4655eb3ee2	Uses MemPoolContext to route allocations from CUDACachingAllocator (#134685 ) Re-open of https://github.com/pytorch/pytorch/pull/133599 that was mistakenly closed by issuing `ghstack land` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134685 Approved by: https://github.com/ezyang	2024-08-29 03:56:31 +00:00
Chien-Lin Chen	40de63be09	parameterized test_graph_optims and test_graph_scaling_fused_optimizers (#133749 ) Fixes #123451 This is a rework of a reverted pull request, https://github.com/pytorch/pytorch/pull/125127. The test failure is fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133749 Approved by: https://github.com/janeyx99	2024-08-28 16:34:06 +00:00
Mikayla Gawarecki	018e48c337	[Reland] Add wrappers for synchronous GPUDirect Storage APIs (#133489 ) Reland #130633 USE_CUFILE turned off by default in this version Pull Request resolved: https://github.com/pytorch/pytorch/pull/133489 Approved by: https://github.com/albanD	2024-08-15 17:11:52 +00:00
Syed Tousif Ahmed	42cd397a0e	Loads .pyd instead of .so in MemPool test for windows (#132749 ) Fixes #132650 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132749 Approved by: https://github.com/albanD	2024-08-08 14:29:56 +00:00
PyTorch MergeBot	123d9ec5bf	Revert "Loads .pyd instead of .so in MemPool test for windows (#132749 )" This reverts commit `37ab0f3385`. Reverted https://github.com/pytorch/pytorch/pull/132749 on behalf of https://github.com/syed-ahmed due to Seems like periodic is still failing: `7c79e89bc5` ([comment](https://github.com/pytorch/pytorch/pull/132749#issuecomment-2274041302))	2024-08-07 18:08:44 +00:00
Syed Tousif Ahmed	37ab0f3385	Loads .pyd instead of .so in MemPool test for windows (#132749 ) Fixes #132650 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132749 Approved by: https://github.com/albanD	2024-08-07 09:58:52 +00:00
albanD	9a1ad3345f	Fix periodic windows test (#132648 ) This test fails to clean up folders on windows for the past week, see `27f61eba58` for example Pull Request resolved: https://github.com/pytorch/pytorch/pull/132648 Approved by: https://github.com/janeyx99, https://github.com/zou3519, https://github.com/malfet	2024-08-05 20:54:20 +00:00
Xuehai Pan	4226ed1585	[BE] Format uncategorized Python files with `ruff format` (#132576 ) Remove patterns ``, `test/`, and `torch/**` in `tools/linter/adapters/pyfmt_linter.py` and run `lintrunner`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132576 Approved by: https://github.com/ezyang, https://github.com/Skylion007 ghstack dependencies: #132574	2024-08-04 17:13:31 +00:00
Oguz Ulgen	221350e3a4	Add None return type to init -- tests (#132352 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132352 Approved by: https://github.com/ezyang ghstack dependencies: #132335, #132351	2024-08-01 15:44:51 +00:00
Syed Tousif Ahmed	7c89ec0f7c	Implements torch.cuda.MemPool() API (#131152 ) In this PR: - Pool id creation logic is refactored and moved to a MemPool class. `graph_pool_handle()` API now uses `torch.cuda.MemPool()` to get a unique id for a pool. Existing tests should cover this change. - MemPool holds a pointer to a CUDAAllocator as proposed in https://github.com/pytorch/pytorch/issues/124807#issuecomment-2077506997. Tests are added to show usage with CUDAPluggableAllocator. - MemPoolContext API makes a mempool active. Tests are added to show usage of this API. This API will be used in CUDACachingAllocator to route allocations to a user provided allocator. See draft here: https://github.com/pytorch/pytorch/pull/125722/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/131152 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-08-01 01:29:30 +00:00
Aidyn-A	301ec32ae8	[EASY][TEST][CUDA] Fix typo in test_graph_make_graphed_callables_same_pool (#132059 ) Per title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132059 Approved by: https://github.com/Skylion007	2024-07-29 19:15:37 +00:00
PyTorch MergeBot	e191b83462	Revert "Add wrappers for synchronous GPUDirect Storage APIs (#130633 )" This reverts commit `709ddf7a9d`. Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to still failing internally D60265673 ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2253239607))	2024-07-26 18:08:20 +00:00
Mikayla Gawarecki	709ddf7a9d	Add wrappers for synchronous GPUDirect Storage APIs (#130633 ) Based in part on https://github.com/NVIDIA/apex/pull/1774 Differential Revision: [D60155434](https://our.internmc.facebook.com/intern/diff/D60155434) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633 Approved by: https://github.com/albanD	2024-07-25 22:23:38 +00:00
PyTorch MergeBot	e4b5645f83	Revert "Add wrappers for synchronous GPUDirect Storage APIs (#130633 )" This reverts commit `5b5e0698a5`. Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to breaking a lot of jobs and build rules internally D60085885, possibly needs to update some bazel build? ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2245806738))	2024-07-23 17:19:34 +00:00
Mikayla Gawarecki	5b5e0698a5	Add wrappers for synchronous GPUDirect Storage APIs (#130633 ) Based in part on https://github.com/NVIDIA/apex/pull/1774 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633 Approved by: https://github.com/albanD	2024-07-22 14:51:24 +00:00
Xuehai Pan	ba48cf6535	[BE][Easy][6/19] enforce style for empty lines in import segments in `test/` (#129757 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129757 Approved by: https://github.com/ezyang	2024-07-17 06:42:37 +00:00
Bilal Khan	54a932b0ac	Support for expandable segments with cuda graph trees (#128068 ) This PR adds support to use expandable segments with private memory pools which should unblock using it with cuda graphs and cuda graph trees. Currently, the allocator silently avoids using expandable segments when allocating in a private pool due to checkpoint saving/restoring not meshing well with how we keep track of unmapped blocks. The PR itself is pretty short, most of the logic for checkpointing and reapplying state for non-expandable segments transfers over without much work. Expandable segments reserve a virtual address space of size equal to the amount of physical memory on the GPU. Every time we want to `malloc()` or `free()` memory in a memory pool with expandable segments turned on, we map/unmap pages of physical GPU memory under the hood to create a new block that we return to the caller. This is beneficial due to the fact that each memory pool functions as a single segment of memory with a contiguous block of memory addresses that can grow and shrink as needed, avoiding fragmentation from allocating multiple non-contiguous segments that may not be merged together. The caching allocator handles this by creating an unmapped block for the entire reserved virtual address space at init, which is treated similarly to an unallocated block in a free pool. When callers call `malloc()`, it's split and mapped to create allocated blocks, and calling `free()` similarly caches and merges free blocks in a free pool to be used later. Expandable blocks are unmapped and returned back to Cuda when they are cleaned up, or when we hit an OOM and the allocator attempts to remap cached free blocks. The code paths to map, free, and unmap blocks in expandable segments is similar to that for normal blocks and does all the same work of updating stats on memory usage, moving blocks between active and free pools, and returning memory to Cuda. With Cuda Graph Trees and private memory pools, we need the ability to take checkpoints of the current state of the memory allocator after each graph capture as well as reapplying the state before capturing a new graph after replaying a captured graph so that the new cuda graph capture has access to the state of the allocator at the point after replaying a previously captured graph so it can reuse empty blocks and allocate new ones. As mentioned in a below comment, memory in a private pool is cached until the private pool is destroyed and allocations can only grow from extra graph captures, any freeing of memory would result in invalid memory addresses and would break cuda graphs. One implementation detail to note for unmapped blocks with expandable segments is that unmapped blocks are kept track in a member variable `unmapped` of a `BlockPool`. `unmapped` is not part of the checkpointed state of the caching allocator and isn't restored when reapplying checkpoints since we never free/unmap memory back to cuda and is persisted across graph captures / replays. Checkpointing the current state of the memory allocator works as expected with expandable segments. Checkpointing grabs the first block of every segment in the active and free pools of the private pool and traverses the linked list of blocks in the segment to capture the state of every segment, which is then saved and kept for when it is needed to be reapplied. For expandable blocks, the last block in every segment will be an unallocated unmapped block containing the remaining amount of unmapped memory at graph capture time, and this too is saved in the checkpoint. Reapplying the checkpoints works by freeing all allocated blocks and merging them into a single block per segment, then for each segment, we manually split and allocate all blocks from the checkpoint and then free the blocks marked as unallocated in the checkpoint state. For expandable segments, we need to make some modifications to not split unmapped blocks and avoid manually mapping then freeing unmapped blocks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128068 Approved by: https://github.com/eqy, https://github.com/eellison	2024-07-15 23:23:23 +00:00
Tobias Ringwald	e5de25896f	Fixed CUDA randint generation for large ranges. (#126066 ) Fixes #125224 For large ranges, calls to CUDA `randint` use a different `unroll_factor` to generate random ints. This `unroll_factor` was not considered correctly in the calculation of the Philox offsets. Thus, some of the random states were reused, resulting in lower entropy (see #125224). This also affects multiple other random functions, such as `torch.rand` and `torch.randn`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126066 Approved by: https://github.com/eqy, https://github.com/lezcano	2024-07-13 21:42:27 +00:00
eqy	60fc01d0ab	[CUDA] Don't double-destroy CUDA graph when debug dump is used (#130401 ) Repro from @eellison Could have sworn we had another PR with this fix floating around somewhere but I couldn't find it... Pull Request resolved: https://github.com/pytorch/pytorch/pull/130401 Approved by: https://github.com/Skylion007, https://github.com/eellison	2024-07-12 18:57:07 +00:00
PyTorch MergeBot	578388bed8	Revert "Support for expandable segments with cuda graph trees (#128068 )" This reverts commit `fdc83610f2`. Reverted https://github.com/pytorch/pytorch/pull/128068 on behalf of https://github.com/janeyx99 due to Reverting for breaking ROCm tests on trunk, I think the tests need to be qualified with @onlyCUDA ([comment](https://github.com/pytorch/pytorch/pull/128068#issuecomment-2223672381))	2024-07-11 18:58:13 +00:00
Bilal Khan	fdc83610f2	Support for expandable segments with cuda graph trees (#128068 ) This PR adds support to use expandable segments with private memory pools which should unblock using it with cuda graphs and cuda graph trees. Currently, the allocator silently avoids using expandable segments when allocating in a private pool due to checkpoint saving/restoring not meshing well with how we keep track of unmapped blocks. The PR itself is pretty short, most of the logic for checkpointing and reapplying state for non-expandable segments transfers over without much work. Expandable segments reserve a virtual address space of size equal to the amount of physical memory on the GPU. Every time we want to `malloc()` or `free()` memory in a memory pool with expandable segments turned on, we map/unmap pages of physical GPU memory under the hood to create a new block that we return to the caller. This is beneficial due to the fact that each memory pool functions as a single segment of memory with a contiguous block of memory addresses that can grow and shrink as needed, avoiding fragmentation from allocating multiple non-contiguous segments that may not be merged together. The caching allocator handles this by creating an unmapped block for the entire reserved virtual address space at init, which is treated similarly to an unallocated block in a free pool. When callers call `malloc()`, it's split and mapped to create allocated blocks, and calling `free()` similarly caches and merges free blocks in a free pool to be used later. Expandable blocks are unmapped and returned back to Cuda when they are cleaned up, or when we hit an OOM and the allocator attempts to remap cached free blocks. The code paths to map, free, and unmap blocks in expandable segments is similar to that for normal blocks and does all the same work of updating stats on memory usage, moving blocks between active and free pools, and returning memory to Cuda. With Cuda Graph Trees and private memory pools, we need the ability to take checkpoints of the current state of the memory allocator after each graph capture as well as reapplying the state before capturing a new graph after replaying a captured graph so that the new cuda graph capture has access to the state of the allocator at the point after replaying a previously captured graph so it can reuse empty blocks and allocate new ones. As mentioned in a below comment, memory in a private pool is cached until the private pool is destroyed and allocations can only grow from extra graph captures, any freeing of memory would result in invalid memory addresses and would break cuda graphs. One implementation detail to note for unmapped blocks with expandable segments is that unmapped blocks are kept track in a member variable `unmapped` of a `BlockPool`. `unmapped` is not part of the checkpointed state of the caching allocator and isn't restored when reapplying checkpoints since we never free/unmap memory back to cuda and is persisted across graph captures / replays. Checkpointing the current state of the memory allocator works as expected with expandable segments. Checkpointing grabs the first block of every segment in the active and free pools of the private pool and traverses the linked list of blocks in the segment to capture the state of every segment, which is then saved and kept for when it is needed to be reapplied. For expandable blocks, the last block in every segment will be an unallocated unmapped block containing the remaining amount of unmapped memory at graph capture time, and this too is saved in the checkpoint. Reapplying the checkpoints works by freeing all allocated blocks and merging them into a single block per segment, then for each segment, we manually split and allocate all blocks from the checkpoint and then free the blocks marked as unallocated in the checkpoint state. For expandable segments, we need to make some modifications to not split unmapped blocks and avoid manually mapping then freeing unmapped blocks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128068 Approved by: https://github.com/zdevito, https://github.com/eqy	2024-07-11 05:33:09 +00:00
Jeff Willette	5c9d5272e4	fixes #124582 (#128483 ) added check for existence of outputs requiring grad to make_graphed_callables. added new test case, updated existing test case to include parameterless modules. Fixes #124582 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128483 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-07-02 08:45:59 +00:00
Jack Taylor	e1b426b345	[ROCm] CUDA_VISIBLE_DEVICES fallback option for device_count (#129650 ) Updating `_parse_visible_devices` to allow use of CUDA_VISIBLE_DEVICES if HIP_VISIBLE_DEVICES is unset, to avoid any unnecessary code changes in workloads that already rely on CUDA_VISIBLE_DEVICES. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129650 Approved by: https://github.com/hongxiayang, https://github.com/malfet	2024-07-01 11:40:09 +00:00
Jeff Daily	169b4ca07e	add uuid in cudaDeviceProperties (#125083 ) Replaces #99967. Fixes #99903. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125083 Approved by: https://github.com/pruthvistony, https://github.com/albanD, https://github.com/eqy, https://github.com/malfet	2024-06-27 23:53:13 +00:00
yousufmo	305ba62906	Add support to `GradScaler` for respecting an already set `grad_scale` value (#123429 ) Fixes #123428 Co-authored-by: Yousuf Mohamed-Ahmed <youmed.tech@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123429 Approved by: https://github.com/ezyang	2024-06-27 22:40:54 +00:00
Dmitry Rogozhkin	321bdcb372	Fix device propagation for checkpointing (#128671 ) Fixes: #128478 In backward() implementation checkpointing code was quering device type from the rng_state tensors saved on forward(). These tensors are CPU only tensors and don't carry device information with them. As a result CUDA device was assumed as a default. Which is not correct if user runs on some other device. For example, on XPU. This patch saves full device information on forward() and uses it on backward() to get device type. Previously forward save only device index. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128671 Approved by: https://github.com/guangyey, https://github.com/soulitzer	2024-06-27 17:14:13 +00:00
Fuzzkatt	4ca8eecca4	skip test_graph_capture_oom for jetson (#128661 ) On Jetson IGX, `python test/test_cuda.py -k test_graph_capture_oom` fails with the following error: ``` RuntimeError: NVML_SUCCESS == r INTERNAL ASSERT FAILED at "/opt/pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp":841, please report a bug to PyTorch. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.10/unittest/case.py", line 59, in testPartExecutor yield File "/usr/lib/python3.10/unittest/case.py", line 591, in run self._callTestMethod(testMethod) File "/usr/lib/python3.10/unittest/case.py", line 549, in _callTestMethod method() File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 2759, in wrapper method(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 2759, in wrapper method(args, **kwargs) File "/opt/pytorch/pytorch/test/test_cuda.py", line 2255, in test_graph_capture_oom with self.assertRaisesRegex(RuntimeError, oom_regex): File "/usr/lib/python3.10/unittest/case.py", line 239, in __exit__ self._raiseFailure('"{}" does not match "{}"'.format( File "/usr/lib/python3.10/unittest/case.py", line 163, in _raiseFailure raise self.test_case.failureException(msg) AssertionError: "out of memory" does not match "NVML_SUCCESS == r INTERNAL ASSERT FAILED at "/opt/pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp":841, please report a bug to PyTorch. " ``` This is a known issue as nvml support on Jetson is limited, and the OOM reporting in CUDACachingAllocator.cpp requires nvml to be properly loaded, which fails on Jetson. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128661 Approved by: https://github.com/eqy, https://github.com/atalman	2024-06-25 08:25:11 +00:00
Xuehai Pan	a7c596870d	[BE][Eazy] remove `torch.torch.xxx` usages (#127800 ) NB: `torch` is exposed in `torch/__init__.py`. So there can be `torch.torch.torch.xxx`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127800 Approved by: https://github.com/peterbell10, https://github.com/kit1980, https://github.com/malfet	2024-06-05 21:53:49 +00:00
Xuehai Pan	67ef2683d9	[BE] wrap deprecated function/class with `typing_extensions.deprecated` (#127689 ) Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing. Note that only warnings that their messages contain `[Dd]eprecat(ed\|ion)` are updated in this PR. Resolves #126888 - #126888 This PR is split from PR #126898. - #126898 ------ Pull Request resolved: https://github.com/pytorch/pytorch/pull/127689 Approved by: https://github.com/Skylion007	2024-06-02 12:30:43 +00:00
PyTorch MergeBot	033e733021	Revert "[BE] wrap deprecated function/class with `typing_extensions.deprecated` (#126898 )" This reverts commit `749a132fb0`. Reverted https://github.com/pytorch/pytorch/pull/126898 on behalf of https://github.com/fbgheith due to switching typing-extensions=4.3.0 to 4.9.0 causes internal failure ([comment](https://github.com/pytorch/pytorch/pull/126898#issuecomment-2142884456))	2024-05-31 19:47:24 +00:00
SandishKumarHN	da39461d61	[optim] Move test_grad_scaling_autocast_fused_optimizers to test_cuda.py (#126418 ) this PR address the comments in this PR #124904 - Move test_grad_scaling_autocast_fused_optimizers to test_cuda.py - Combine _grad_scaling_autocast_fused_optimizers into test_grad_scaling_autocast_fused_optimizers - Move to OptimizerInfo framework. - For failing tests test_grad_scaling_autocast_fused_optimizers AdamW_cuda_float32, Adam_cuda_float32 - Added toleranceOverride in this PR - created a issue #127000 ``` > (c2env) [sandish@devgpu166.ash6 ~/pytorch (refactoroptimizers)]$ python test/test_cuda.py -k test_grad_scaling_autocast_fused_optimizers -v /home/sandish/pytorch/torch/backends/cudnn/__init__.py:106: UserWarning: PyTorch was compiled without cuDNN/MIOpen support. To use cuDNN/MIOpen, rebuild PyTorch making sure the library is visible to the build system. warnings.warn( /home/sandish/pytorch/torch/backends/cudnn/__init__.py:106: UserWarning: PyTorch was compiled without cuDNN/MIOpen support. To use cuDNN/MIOpen, rebuild PyTorch making sure the library is visible to the build system. warnings.warn( test_grad_scaling_autocast_fused_optimizers_Adagrad_cpu_float32 (__main__.TestCudaOptimsCPU) ... {'fused': True} {'fused': True} {'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'lr': 0.1, 'fused': True} {'lr': 0.1, 'fused': True} {'initial_accumulator_value': 0.1, 'weight_decay': 0.1, 'fused': True} {'initial_accumulator_value': 0.1, 'weight_decay': 0.1, 'fused': True} {'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.1, 'fused': True} {'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.1, 'fused': True} {'lr': tensor(0.0010), 'fused': True} {'lr': tensor(0.0010), 'fused': True} ok test_grad_scaling_autocast_fused_optimizers_AdamW_cpu_float32 (__main__.TestCudaOptimsCPU) ... {'fused': True} {'fused': True} {'lr': 0.01, 'fused': True} {'lr': 0.01, 'fused': True} {'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'fused': True} ok test_grad_scaling_autocast_fused_optimizers_Adam_cpu_float32 (__main__.TestCudaOptimsCPU) ... {'fused': True} {'fused': True} {'lr': 0.01, 'fused': True} {'lr': 0.01, 'fused': True} {'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'fused': True} ok test_grad_scaling_autocast_fused_optimizers_SGD_cpu_float32 (__main__.TestCudaOptimsCPU) ... {'fused': True} {'fused': True} {'lr': 0.01, 'fused': True} {'lr': 0.01, 'fused': True} {'lr': tensor(0.0010), 'fused': True} {'lr': tensor(0.0010), 'fused': True} {'momentum': 0.9, 'fused': True} {'momentum': 0.9, 'fused': True} {'momentum': 0.9, 'dampening': 0.5, 'fused': True} {'momentum': 0.9, 'dampening': 0.5, 'fused': True} {'momentum': 0.9, 'weight_decay': 0.1, 'fused': True} {'momentum': 0.9, 'weight_decay': 0.1, 'fused': True} {'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.1, 'fused': True} {'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} ok test_grad_scaling_autocast_fused_optimizers_Adagrad_cuda_float32 (__main__.TestCudaOptimsCUDA) ... skipped 'cuda is not supported for fused on Adagrad' test_grad_scaling_autocast_fused_optimizers_AdamW_cuda_float32 (__main__.TestCudaOptimsCUDA) ... {'fused': True} {'fused': True} {'lr': 0.01, 'fused': True} {'lr': 0.01, 'fused': True} {'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'fused': True} {'capturable': True, 'fused': True} {'capturable': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'fused': True} {'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'fused': True} {'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'fused': True} ok test_grad_scaling_autocast_fused_optimizers_Adam_cuda_float32 (__main__.TestCudaOptimsCUDA) ... {'fused': True} {'fused': True} {'lr': 0.01, 'fused': True} {'lr': 0.01, 'fused': True} {'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'fused': True} {'capturable': True, 'fused': True} {'capturable': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'fused': True} {'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'fused': True} {'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'fused': True} ok test_grad_scaling_autocast_fused_optimizers_SGD_cuda_float32 (__main__.TestCudaOptimsCUDA) ... {'fused': True} {'fused': True} {'lr': 0.01, 'fused': True} {'lr': 0.01, 'fused': True} {'lr': tensor(0.0010), 'fused': True} {'lr': tensor(0.0010), 'fused': True} {'momentum': 0.9, 'fused': True} {'momentum': 0.9, 'fused': True} {'momentum': 0.9, 'dampening': 0.5, 'fused': True} {'momentum': 0.9, 'dampening': 0.5, 'fused': True} {'momentum': 0.9, 'weight_decay': 0.1, 'fused': True} {'momentum': 0.9, 'weight_decay': 0.1, 'fused': True} {'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.1, 'fused': True} {'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} ok ---------------------------------------------------------------------- Ran 8 tests in 16.117s OK (skipped=1) > lintrunner test/test_cuda.py ---------------------------------------------------------------------- ok No lint issues. > lintrunner torch/testing/_internal/common_optimizers.py ---------------------------------------------------------------------- ok No lint issues. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126418 Approved by: https://github.com/janeyx99	2024-05-30 01:47:41 +00:00
Xuehai Pan	749a132fb0	[BE] wrap deprecated function/class with `typing_extensions.deprecated` (#126898 ) Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing. Note that only warnings that their messages contain `[Dd]eprecat(ed\|ion)` are updated in this PR. UPDATE: Use `FutureWarning` instead of `DeprecationWarning`. Resolves #126888 - #126888 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126898 Approved by: https://github.com/albanD	2024-05-29 12:09:27 +00:00
Yu, Guangye	e7a42702f9	generalize custom_fwd&custom_bwd to be device-agnostic (#126531 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126531 Approved by: https://github.com/jgong5, https://github.com/gujinghui, https://github.com/albanD, https://github.com/EikanWang ghstack dependencies: #126527	2024-05-25 06:48:16 +00:00
Yu, Guangye	c09205a057	Deprecate device-specific GradScaler autocast API (#126527 ) # Motivation ## for `torch.amp.GradScaler`, - `torch.cpu.amp.GradScaler(args...)` is completely equivalent to `torch. amp.GradScaler("cpu", args...)`. - `torch.cuda.amp.GradScaler(args...)` is completely equivalent to `torch.amp.GradScaler("cuda", args...)`. So, we intend to depreate them and strongly recommend developer to use `torch.amp.GradScaler`. ## for `custom_fwd` and `custom_bwd`, this is a good solution to make the custom function run with or without effect even in an autocast-enabled region and can be shared by other backends, like CPU and XPU. So we generalize it to be device-agnostic and put them int `torch/amp/autocast_mode.py` and re-expose to `torch.amp.custom_fwd` and `torch.amp.custom_bwd`. Meanwhile, we deprecate `torch.cuda.amp.custom_fwd` and `torch.cuda.amp.custom_bwd`. # Additional Context Add UT to cover the deprecated warning. No need for more UTs to cover the functionality of `torch.amp.custom_f/bwd`, the existing UTs that previously covered the functionality of `torch.cuda.amp.custom_f/bwd` can cover them. To facilitate the review, we separate these code changes to two PRs. The first PR cover `torch.amp.GradScaler`. The follow-up covers `custom_fwd` and `custom_bwd`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126527 Approved by: https://github.com/jgong5, https://github.com/gujinghui, https://github.com/janeyx99, https://github.com/EikanWang	2024-05-25 06:41:34 +00:00
Catherine Lee	ef86a27dba	Mark test_set_per_process_memory_fraction serial (#127087 ) Occasionally OOMs Also should probably give the entire GPU for this anyways Pull Request resolved: https://github.com/pytorch/pytorch/pull/127087 Approved by: https://github.com/huydhn	2024-05-25 06:26:47 +00:00
Jack Taylor	d30cdc4321	[ROCm] amdsmi library integration (#119182 ) Adds monitoring support for ROCm using amdsmi in place of pynvml. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119182 Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/xw285cornell	2024-05-21 01:59:26 +00:00
PyTorch MergeBot	cb69c51b6f	Revert " Updated test_graph_optims and test_graph_scaling_fused_optimizers to use new OptimizerInfo infrastructure (#125127 )" This reverts commit `cf35a591b9`. Reverted https://github.com/pytorch/pytorch/pull/125127 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/125127#issuecomment-2120337584))	2024-05-20 12:14:22 +00:00
jayanth domalapalli	cf35a591b9	Updated test_graph_optims and test_graph_scaling_fused_optimizers to use new OptimizerInfo infrastructure (#125127 ) This PR is meant to address issue #123451, more specifically, the ```test_graph_optims``` and ```test_graph_scaling_fused_optimizers``` functions in ```test_cuda.py``` have been updated so that they now use the new OptimizerInfo infrastructure. Lintrunner passed: ``` $ lintrunner test/test_cuda.py ok No lint issues. ``` Tests passed: ``` >python test_cuda.py -k test_graph_optims Ran 19 tests in 7.463s OK (skipped=9) >python test_cuda.py -k test_graph_scaling_fused_optimizers Ran 6 tests in 2.800s OK (skipped=3) ``` Both the functions have been moved to the newly created TestCase class ```TestCudaOptims```. The test is mostly the same except the ```@optims``` decorator is used at the top of the function to implicitly call the function using each of the optimizers mentioned in the decorator instead of explicitly using a for loop to iterate through each of the optimizers. I was unable to use the ```_get_optim_inputs_including_global_cliquey_kwargs``` to get all kwargs for each of the optimizers since some of the kwargs that are used in the original ```test_graph_optims``` function are not being returned by the new OptimizerInfo infrastructure, more specifically, for the ```torch.optim.rmsprop.RMSprop``` optimizer, the following kwargs are not returned whenever ```_get_optim_inputs_including_global_cliquey_kwargs``` is called: ``` {'foreach': False, 'maximize': True, 'weight_decay': 0} { 'foreach': True, 'maximize': True, 'weight_decay': 0} ``` I ran into the same issue for ```test_graph_scaling_fused_optimizers```, for the ```torch.optim.adamw.AdamW``` optimizer, whenever ```optim_info.optim_inputs_func(device=device)``` was called, the following kwarg was not returned: ``` {'amsgrad': True} ``` Due to this issue, I resorted to using a dictionary to store the kwargs for each of the optimizers, I am aware that this is less than ideal. I was wondering whether I should use the OptimizerInfo infrastructure to get all the kwargs regardless of the fact that it lacks some kwargs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125127 Approved by: https://github.com/janeyx99	2024-05-20 06:20:45 +00:00
Yu, Guangye	58378f1224	[Doc] Add deprecated autocast comments for doc (#126062 ) # Motivation We generalize a device-agnostic API `torch.amp.autocast` in [#125103](https://github.com/pytorch/pytorch/pull/125103). After that, - `torch.cpu.amp.autocast(args...)` is completely equivalent to `torch.amp.autocast('cpu', args...)`, and - `torch.cuda.amp.autocast(args...)` is completely equivalent to `torch.amp.autocast('cuda', args...)` no matter in eager mode or JIT mode. Base on this point, we would like to deprecate `torch.cpu.amp.autocast` and `torch.cuda.amp.autocast` to strongly recommend developer to use `torch.amp.autocast` that is a device-agnostic API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126062 Approved by: https://github.com/eqy, https://github.com/albanD	2024-05-16 05:26:43 +00:00
haozhe.zhu	f9d107af66	[optim] add fused_adagrad support for CPU device (#124905 ) Support fused_sgd_kernel support for CPU. ## Bench result: 32 core/sockets ICX Test Scripts: https://gist.github.com/zhuhaozhe/79e842e0a6e25d6d7fa1e4598807272c https://gist.github.com/zhuhaozhe/b4c6998a509dcea1796dd05b3005c969 ``` Tensor Size: 262144, Num Tensor 4, Num Threads: 1 _single_tensor_adagrad time: 0.2500 seconds _fused_adagrad time: 0.0933 seconds Tensor Size: 4194304, Num Tensor 32, Num Threads: 32 _single_tensor_adagrad time: 2.8819 seconds _fused_adagrad time: 1.7591 seconds ``` ## Test Plan: ``` python test_optim.py -k test_fused_matches_forloop python test_optim.py -k test_fused_large_tensor python test_optim.py -k test_can_load_older_state_dict python test_optim.py -k test_grad_scaling_autocast_fused_optimizers python test_torch.py -k test_grad_scaling_autocast_fused python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step ``` Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124905 Approved by: https://github.com/jgong5, https://github.com/janeyx99	2024-05-16 01:11:51 +00:00
PyTorch MergeBot	bd3cbdba2f	Revert "[optim] add fused_adagrad support for CPU device (#124905 )" This reverts commit `1c3fe84033`. Reverted https://github.com/pytorch/pytorch/pull/124905 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing distributed multigpu test in trunk `1c3fe84033` ([comment](https://github.com/pytorch/pytorch/pull/124905#issuecomment-2108777063))	2024-05-13 20:53:22 +00:00
haozhe.zhu	1c3fe84033	[optim] add fused_adagrad support for CPU device (#124905 ) Support fused_sgd_kernel support for CPU. ## Bench result: 32 core/sockets ICX Test Scripts: https://gist.github.com/zhuhaozhe/79e842e0a6e25d6d7fa1e4598807272c https://gist.github.com/zhuhaozhe/b4c6998a509dcea1796dd05b3005c969 ``` Tensor Size: 262144, Num Tensor 4, Num Threads: 1 _single_tensor_adagrad time: 0.2500 seconds _fused_adagrad time: 0.0933 seconds Tensor Size: 4194304, Num Tensor 32, Num Threads: 32 _single_tensor_adagrad time: 2.8819 seconds _fused_adagrad time: 1.7591 seconds ``` ## Test Plan: ``` python test_optim.py -k test_fused_matches_forloop python test_optim.py -k test_fused_large_tensor python test_optim.py -k test_can_load_older_state_dict python test_optim.py -k test_grad_scaling_autocast_fused_optimizers python test_torch.py -k test_grad_scaling_autocast_fused python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step ``` Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124905 Approved by: https://github.com/jgong5, https://github.com/janeyx99	2024-05-13 01:16:20 +00:00
Yu, Guangye	31372fa842	Support generic stream/event on CUDA/HIP backend (#125757 ) # Motivation According to [#123611](https://github.com/pytorch/pytorch/pull/123611), we support generic stream/event on CUDA backend. # Additional Context new method/attribute on `torch.Event` for cuda - torch.Event.event_id - torch.Event.elapsed_time - torch.Event.synchronize new method on `c10::Event` on cuda backend - c10.Event.event_id - c10.Event.elapsed_time - c10.Event.synchronize Pull Request resolved: https://github.com/pytorch/pytorch/pull/125757 Approved by: https://github.com/albanD, https://github.com/jgong5, https://github.com/EikanWang	2024-05-10 13:34:09 +00:00
PyTorch MergeBot	0d4fdb0bb7	Revert "[ROCm] amdsmi library integration (#119182 )" This reverts commit `85447c41e3`. Reverted https://github.com/pytorch/pytorch/pull/119182 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the ROCm failed test is legit `85447c41e3` ([comment](https://github.com/pytorch/pytorch/pull/119182#issuecomment-2103433197))	2024-05-09 21:18:21 +00:00
PyTorch MergeBot	6fd745255e	Revert "add uuid in cudaDeviceProperties (#125083 )" This reverts commit `3f36145db2`. Reverted https://github.com/pytorch/pytorch/pull/125083 on behalf of https://github.com/izaitsevfb due to Fails internal builds with: no member named 'uuid' in 'hipDeviceProp_t' ([comment](https://github.com/pytorch/pytorch/pull/125083#issuecomment-2103315320))	2024-05-09 19:52:45 +00:00
Jack Taylor	85447c41e3	[ROCm] amdsmi library integration (#119182 ) Adds monitoring support for ROCm using amdsmi in place of pynvml. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119182 Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/xw285cornell	2024-05-09 18:21:38 +00:00
Jeff Daily	3f36145db2	add uuid in cudaDeviceProperties (#125083 ) Replaces #99967. Fixes #99903. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125083 Approved by: https://github.com/pruthvistony, https://github.com/albanD, https://github.com/eqy	2024-05-08 19:15:55 +00:00
PyTorch MergeBot	5fd0b6e5f7	Revert "add uuid in cudaDeviceProperties (#125083 )" This reverts commit `f35fe4eaf1`. Reverted https://github.com/pytorch/pytorch/pull/125083 on behalf of https://github.com/clee2000 due to test_uuid is flaky. ex https://github.com/pytorch/pytorch/actions/runs/8988855916/job/24692369523 https://hud.pytorch.org/flakytest?name=test_uuid&suite=TestCuda&file=%25&limit=300 ([comment](https://github.com/pytorch/pytorch/pull/125083#issuecomment-2099029993))	2024-05-07 18:16:27 +00:00
Jeff Daily	f35fe4eaf1	add uuid in cudaDeviceProperties (#125083 ) Replaces #99967. Fixes #99903. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125083 Approved by: https://github.com/pruthvistony, https://github.com/albanD, https://github.com/eqy	2024-05-07 01:26:01 +00:00
haozhe.zhu	489b4586e9	[optim]fix ut and sgd kernel (#124904 ) - Original `test_grad_scaling_autocast_fused_optimizers` does not work since there is no "fused" in `optim_inputs` - We should use different `grad_scaler`, they should not share 1 `scale`, there is no issue exposed here because the default `_growth_interval` is 2000 so it will not growth and there is also no inf is found so it will not reduced. The one in `test_cuda.py` should also have this issue, - I set a manual seed to reproduce purpose if there is any numerical failure - I use Tensor tracker here because we failed this UT in dynamo case, the cpp generated code are not exactly same with fused/non fused kernel. - I make it check both `cuda` and `cpu`. - I find some SGD numerical issue with `clang`, and fixed it by using `fmadd` instead of `add/mul` in fused sgd veckernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124904 Approved by: https://github.com/jgong5, https://github.com/janeyx99	2024-05-03 09:13:24 +00:00

1 2 3 4 5 ...

918 Commits