cyy
29f52e3972
[2/N] Remove unnecessary once flag usage ( #145057 )
...
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145057
Approved by: https://github.com/albanD
2025-01-23 09:48:46 +00:00
Yu, Guangye
3848de55ed
Add get_stream_from_external API for CUDA backend ( #143799 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143799
Approved by: https://github.com/albanD , https://github.com/EikanWang
ghstack dependencies: #142347 , #141119 , #141123
2024-12-31 11:15:59 +00:00
Nichols A. Romero
c0a39ad35a
[ROCm] Fix TunableOp UTs: Rotating Buffer ( #143172 )
...
TunableOp's rotating buffer feature cannot be properly tested because the environment variable that controls this feature is sticky. A Python API is introduced to modify this value.
Additional items in this PR:
* UT for rotating buffer API
* Clean up UTs that were setting the rotating buffer via the environment variable
* Align behavior of environment variable and Python API when a negative value (< 0) is set.
* Update documentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143172
Approved by: https://github.com/jeffdaily
2024-12-14 06:18:11 +00:00
Peter Bell
96c3b2c388
Expose remaining sharedMem cudaDeviceProps to python ( #143226 )
...
Was a bit too fast with my earlier PR, `sharedMemPerMultiprocessor` includes some memory that is reserved for the system. The amount a kernel can actually use is limited by `sharedMemPerBlockOptin`.
I also expose `sharedMemPerBlock` for completeness.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143226
Approved by: https://github.com/ezyang
2024-12-14 06:13:28 +00:00
Peter Bell
82a45d19b4
Expose sharedMemPerMultiprocessor device property to python ( #143119 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143119
Approved by: https://github.com/ezyang
2024-12-13 16:53:57 +00:00
Benjamin Glass
4959784dac
Add API query for available per-process CUDA memory ( #140620 )
...
Certain `cpp_wrapper`-enabled tests were OOM-ing in the CI pipeline, with error messages suggesting that sufficient memory was accessible. This ultimately resulted from an internal memory limitation that was not queryable in the API. This PR adds querying for that limit.
Additionally, the failing tests had incorrect memory availability checks, and are updated with measured memory requirements.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140620
Approved by: https://github.com/malfet , https://github.com/eqy
ghstack dependencies: #141367
2024-12-03 00:24:03 +00:00
cyy
f95c71867e
[9/N] Fix extra warnings brought by clang-tidy-17 ( #139286 )
...
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139286
Approved by: https://github.com/ezyang
2024-10-31 05:20:31 +00:00
cyyever
456c87c8a2
[8/N] Fix extra warnings brought by clang-tidy-17 ( #139151 )
...
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139151
Approved by: https://github.com/ezyang
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-10-30 14:20:08 +00:00
Edward Yang
b14269dcfb
Make Context to be Device-agnostic Step by Step (1/N) ( #136519 ) ( #138155 )
...
Summary:
- make init to be device-agnostic and move it to AcceleratorHooksInterface
- refactoring context related to device initialization
Original pull request: https://github.com/pytorch/pytorch/pull/136519
Test Plan: contbuild & OSS CI, see 4a8e49389c
Reviewed By: malfet
Differential Revision: D64471142
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138155
Approved by: https://github.com/malfet , https://github.com/bobrenjc93
2024-10-17 20:58:56 +00:00
PyTorch MergeBot
d4d687ffb2
Revert "Make Context to be Device-agnostic Step by Step (1/N) ( #136519 )"
...
This reverts commit 4a8e49389c .
Reverted https://github.com/pytorch/pytorch/pull/136519 on behalf of https://github.com/clee2000 due to breaking internal tests related to MITA, @ezyang has a forward fix? ([comment](https://github.com/pytorch/pytorch/pull/136519#issuecomment-2414588302 ))
2024-10-15 17:19:16 +00:00
FFFrog
4a8e49389c
Make Context to be Device-agnostic Step by Step (1/N) ( #136519 )
...
----
- make init to be device-agnostic and move it to AcceleratorHooksInterface
- refactoring context related to device initialization
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136519
Approved by: https://github.com/ezyang , https://github.com/EikanWang , https://github.com/guangyey
2024-10-13 12:38:02 +00:00
PyTorch MergeBot
079f909263
Revert "Make Context to be Device-agnostic Step by Step (1/N) ( #136519 )"
...
This reverts commit be0b75256a .
Reverted https://github.com/pytorch/pytorch/pull/136519 on behalf of https://github.com/jovianjaison due to this pr is causing errors internally ([comment](https://github.com/pytorch/pytorch/pull/136519#issuecomment-2405781093 ))
2024-10-10 18:32:17 +00:00
Jin Zhou
5516ac5c21
[ROCm] Tunableop record untuned ( #128813 )
...
When enable tunableop, It is easy to have OOM since APP usually needs large video memory size, such as running a LLM for inference. So we need a offline mode to tune the GEMMs. This PR provide an offline mode for tunableOp:
- record untuned GEMMs to file.
- a python API named tune_gemm_in_file is added to read the untuned file and tune the GEMMs in file
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128813
Approved by: https://github.com/jeffdaily , https://github.com/hongxiayang , https://github.com/naromero77amd
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-10-09 21:59:03 +00:00
FFFrog
be0b75256a
Make Context to be Device-agnostic Step by Step (1/N) ( #136519 )
...
- make init to be device-agnostic and move it to AcceleratorHooksInterface
- refactoring context related to device initialization
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136519
Approved by: https://github.com/ezyang , https://github.com/EikanWang , https://github.com/guangyey
2024-10-09 02:13:36 +00:00
eellison
8893881867
Invalidate StorageImpl instances when tensor is overwritten with cudagraphs ( #125264 )
...
Fixes #104435
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125264
Approved by: https://github.com/ezyang
Co-authored-by: eellison <elias.ellison@gmail.com>
2024-10-09 00:05:52 +00:00
cyy
a2396b2dd8
[2/N] Fix extra warnings brought by clang-tidy-17 ( #137459 )
...
Follows #137407
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137459
Approved by: https://github.com/Skylion007
2024-10-08 19:05:02 +00:00
albanD
88e54de219
More nogil unsafe API fix ( #137142 )
...
Cover the PyDict APIs and confirms no update needed for PyModule one.
The rest was already covered in https://github.com/pytorch/pytorch/pull/136899
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137142
Approved by: https://github.com/eqy , https://github.com/Skylion007
2024-10-04 21:56:34 +00:00
Jeff Daily
c7b0d4b148
raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING ( #131114 )
...
raw_alloc is used by cudnn, miopen, thrust, and tunableop. Without this PR, the env var for disabling the caching allocator will only partially work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131114
Approved by: https://github.com/eqy , https://github.com/houseroad , https://github.com/albanD
Co-authored-by: Nichols A. Romero <nick.romero@amd.com>
2024-10-04 15:36:29 +00:00
PyTorch MergeBot
0d1701f310
Revert "raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING ( #131114 )"
...
This reverts commit 7001907480 .
Reverted https://github.com/pytorch/pytorch/pull/131114 on behalf of https://github.com/PaliC due to failing internal builds ([comment](https://github.com/pytorch/pytorch/pull/131114#issuecomment-2390615007 ))
2024-10-03 06:22:55 +00:00
Jeff Daily
7001907480
raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING ( #131114 )
...
raw_alloc is used by cudnn, miopen, thrust, and tunableop. Without this PR, the env var for disabling the caching allocator will only partially work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131114
Approved by: https://github.com/eqy , https://github.com/houseroad , https://github.com/albanD
Co-authored-by: Nichols A. Romero <nick.romero@amd.com>
2024-10-02 16:27:15 +00:00
Jack Taylor
a15774563b
[ROCm] Enable ROCm support for inductor's dynamic_rblock_scaling ( #129663 )
...
As of ROCm 6.1 [hipDeviceProp_t::regsPerMultiprocessor](https://rocm.docs.amd.com/projects/HIP/en/latest/doxygen/html/structhip_device_prop__t.html#a7390d5b180d63978c81aa971060270b4 ) is now available allowing us to enable this attribute on ROCm.
```
>>> torch.cuda.get_device_properties(0)
_CudaDeviceProperties(name='AMD Instinct MI250X/MI250', major=9, minor=0, gcnArchName='gfx90a:sramecc+:xnack-', total_memory=65520MB, multi_processor_count=104)
>>> torch.cuda.get_device_properties(0).regs_per_multiprocessor
65536
```
With https://github.com/triton-lang/triton/pull/3962we can extract n_regs and n_spells from a triton binary with AMD backend allowing us to enable inductor's dynamic_rblock_scaling on ROCm initially implemented in https://github.com/pytorch/pytorch/pull/115094
Leaving this in draft until following PRs have landed:
- https://github.com/pytorch/pytorch/pull/129361 to bump the triton commit pin
- https://github.com/pytorch/pytorch/pull/128449 to allow us to grab warp_size from device properties instead of hard coding 64 on ROCm.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129663
Approved by: https://github.com/jansel , https://github.com/shunting314
2024-09-13 16:45:39 +00:00
Yu, Guangye
6c1da66407
[Reland] Refactor caching device allocator utils ( #130923 )
...
# Motivation
Following [[RFC] Intel GPU Runtime Upstreaming for Allocator ](https://github.com/pytorch/pytorch/issues/116322 ), this PR aims to refactor caching device allocator utils to improve code reuse usage.
This is the first PR, we could prepare some follow-up PRs continuing to refactor the device caching allocator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130923
Approved by: https://github.com/EikanWang , https://github.com/gujinghui , https://github.com/albanD , https://github.com/eqy
2024-09-07 11:14:17 +00:00
PyTorch MergeBot
e55c0f59e5
Revert "[Reland] Refactor caching device allocator utils ( #130923 )"
...
This reverts commit 9809080b9e .
Reverted https://github.com/pytorch/pytorch/pull/130923 on behalf of https://github.com/kit1980 due to breaking internal builds - Error: Relocation overflow has occured ([comment](https://github.com/pytorch/pytorch/pull/130923#issuecomment-2332640961 ))
2024-09-05 21:16:14 +00:00
Yu, Guangye
9809080b9e
[Reland] Refactor caching device allocator utils ( #130923 )
...
# Motivation
Following [[RFC] Intel GPU Runtime Upstreaming for Allocator ](https://github.com/pytorch/pytorch/issues/116322 ), this PR aims to refactor caching device allocator utils to improve code reuse usage.
This is the first PR, we could prepare some follow-up PRs continuing to refactor the device caching allocator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130923
Approved by: https://github.com/EikanWang , https://github.com/gujinghui , https://github.com/albanD , https://github.com/eqy
2024-09-04 05:31:08 +00:00
Natalia Gimelshein
c25b64a057
expose host_emptyCache to python, fix a bug in freeing cudaHostRegist… ( #134919 )
...
…ered memory
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134919
Approved by: https://github.com/eqy
2024-09-01 09:07:25 +00:00
Natalia Gimelshein
29b7852dc1
drop gil in couple places (leads to deadlocks) ( #134910 )
...
Per title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134910
Approved by: https://github.com/eqy
2024-09-01 00:05:53 +00:00
Syed Tousif Ahmed
4655eb3ee2
Uses MemPoolContext to route allocations from CUDACachingAllocator ( #134685 )
...
Re-open of https://github.com/pytorch/pytorch/pull/133599 that was mistakenly closed by issuing `ghstack land`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134685
Approved by: https://github.com/ezyang
2024-08-29 03:56:31 +00:00
PyTorch MergeBot
2c88a923a7
Revert "Refactor caching device allocator utils ( #130923 )"
...
This reverts commit c45ca8092d .
Reverted https://github.com/pytorch/pytorch/pull/130923 on behalf of https://github.com/ZainRizvi due to Sorry but this appears to be causing internal tests to fail with errors like `error: no type named 'DeviceStats' in namespace 'xxx::xxx:xxxAllocator'; did you mean 'DeviceStatus'?` ([comment](https://github.com/pytorch/pytorch/pull/130923#issuecomment-2315730155 ))
2024-08-28 15:56:08 +00:00
Yu, Guangye
c45ca8092d
Refactor caching device allocator utils ( #130923 )
...
# Motivation
Following [[RFC] Intel GPU Runtime Upstreaming for Allocator ](https://github.com/pytorch/pytorch/issues/116322 ), this PR aims to refactor caching device allocator utils to improve code reuse usage.
This is the first PR, we could prepare some follow-up PRs continuing to refactor the device caching allocator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130923
Approved by: https://github.com/EikanWang , https://github.com/gujinghui , https://github.com/albanD , https://github.com/eqy
2024-08-28 01:35:23 +00:00
Jesse Cai
255cd75a97
[sparse] Add cuSPARSELt as a backend ( #128534 )
...
Summary:
This PR adds in cuSPARSELt as a backend to PyTorch.
It is now possible to see if cuSPARSELt is available and the version if
it is with
```
torch.backends.cusparselt.is_available()
torch.backends.cusparselt.version()
```
Test Plan:
```
python test/test_sparse_semi_structured.py -k test_cusparselt_backend
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128534
Approved by: https://github.com/cpuhrsch , https://github.com/eqy , https://github.com/syed-ahmed
2024-08-21 22:06:07 +00:00
Mikayla Gawarecki
018e48c337
[Reland] Add wrappers for synchronous GPUDirect Storage APIs ( #133489 )
...
Reland #130633
USE_CUFILE turned off by default in this version
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133489
Approved by: https://github.com/albanD
2024-08-15 17:11:52 +00:00
Jez Ng
260e7cb143
Make CUDA device properties's __repr__ output actually printable ( #132863 )
...
Previously we would write the UUID bytes directly, leading to 'invalid
UTF-8 sequence' errors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132863
Approved by: https://github.com/Skylion007 , https://github.com/eqy
2024-08-07 21:08:43 +00:00
Nicolas Macchioni
527f104a69
add L2 cache size to device properties ( #132819 )
...
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132819
Approved by: https://github.com/eellison
2024-08-07 04:55:06 +00:00
PyTorch MergeBot
e191b83462
Revert "Add wrappers for synchronous GPUDirect Storage APIs ( #130633 )"
...
This reverts commit 709ddf7a9d .
Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to still failing internally D60265673 ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2253239607 ))
2024-07-26 18:08:20 +00:00
Mikayla Gawarecki
709ddf7a9d
Add wrappers for synchronous GPUDirect Storage APIs ( #130633 )
...
Based in part on https://github.com/NVIDIA/apex/pull/1774
Differential Revision: [D60155434](https://our.internmc.facebook.com/intern/diff/D60155434 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633
Approved by: https://github.com/albanD
2024-07-25 22:23:38 +00:00
Aaron Enye Shi
fddb1bcdea
[CCA][Memory Snapshot] Move user_defined annotations to Native Caching Allocator ( #130964 )
...
Summary: Instead of embedding the user_defined TraceEntry inside of device_traces, which causes issues when some threads may not have the proper device id set, save them into an external_annotations field by using a RingBuffer<AnnotationEntry> called annotation_buffer owned by the NativeCachingAllocator.
Test Plan: CI, resnet run, and FBR model.
Differential Revision: D59703213
Pulled By: aaronenyeshi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130964
Approved by: https://github.com/zdevito
2024-07-25 14:06:52 +00:00
PyTorch MergeBot
e4b5645f83
Revert "Add wrappers for synchronous GPUDirect Storage APIs ( #130633 )"
...
This reverts commit 5b5e0698a5 .
Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to breaking a lot of jobs and build rules internally D60085885, possibly needs to update some bazel build? ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2245806738 ))
2024-07-23 17:19:34 +00:00
Mikayla Gawarecki
5b5e0698a5
Add wrappers for synchronous GPUDirect Storage APIs ( #130633 )
...
Based in part on https://github.com/NVIDIA/apex/pull/1774
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633
Approved by: https://github.com/albanD
2024-07-22 14:51:24 +00:00
PyTorch MergeBot
7c299b46ca
Revert "Invalidate StorageImpl instances when tensor is overwritten with cudagraphs ( #125264 )"
...
This reverts commit 8390843eba .
Reverted https://github.com/pytorch/pytorch/pull/125264 on behalf of https://github.com/izaitsevfb due to breaks internal tests ([comment](https://github.com/pytorch/pytorch/pull/125264#issuecomment-2240516202 ))
2024-07-19 22:58:51 +00:00
PyTorch MergeBot
5f981388ec
Revert "[ROCm] Enable ROCm support for inductor's dynamic_rblock_scaling ( #129663 )"
...
This reverts commit d7a78ec8b9 .
Reverted https://github.com/pytorch/pytorch/pull/129663 on behalf of https://github.com/atalman due to Breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/129663#issuecomment-2240011143 ))
2024-07-19 19:46:26 +00:00
Jack Taylor
d7a78ec8b9
[ROCm] Enable ROCm support for inductor's dynamic_rblock_scaling ( #129663 )
...
As of ROCm 6.1 [hipDeviceProp_t::regsPerMultiprocessor](https://rocm.docs.amd.com/projects/HIP/en/latest/doxygen/html/structhip_device_prop__t.html#a7390d5b180d63978c81aa971060270b4 ) is now available allowing us to enable this attribute on ROCm.
```
>>> torch.cuda.get_device_properties(0)
_CudaDeviceProperties(name='AMD Instinct MI250X/MI250', major=9, minor=0, gcnArchName='gfx90a:sramecc+:xnack-', total_memory=65520MB, multi_processor_count=104)
>>> torch.cuda.get_device_properties(0).regs_per_multiprocessor
65536
```
With https://github.com/triton-lang/triton/pull/3962we can extract n_regs and n_spells from a triton binary with AMD backend allowing us to enable inductor's dynamic_rblock_scaling on ROCm initially implemented in https://github.com/pytorch/pytorch/pull/115094
Leaving this in draft until following PRs have landed:
- https://github.com/pytorch/pytorch/pull/129361 to bump the triton commit pin
- https://github.com/pytorch/pytorch/pull/128449 to allow us to grab warp_size from device properties instead of hard coding 64 on ROCm.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129663
Approved by: https://github.com/jansel , https://github.com/shunting314
2024-07-19 09:45:03 +00:00
Syed Tousif Ahmed
38b7d89aa4
Uses context pointer for deleter to enable multiple CUDAPluggableAllocator usage ( #130472 )
...
We should be able to create multiple CUDAPluggableAllocators in the same pytorch program (see https://github.com/pytorch/pytorch/issues/124807 , https://github.com/pytorch/pytorch/pull/125722 for context). When mixing CUDAPluggableAllocators in the same pytorch program, we need to make sure that the deleter passed in through the CUDAPluggableAllocator gets "attached" to the data_ptr and persist until program exit (when it's called to free the memory).
Currently, CUDAPluggableAllocator maintains a global `current_custom_allocator`. When creating the `DataPtr`, `raw_deleter` attaches `custom_raw_deleter` to the DataPtr which calls `current_custom_allocator->raw_delete(...)`. This approach is fine when using only one allocator, however for multiple allocator use case, DataPtr would be using the deleter of whatever is in the `current_custom_allocator`. For example, if allocation 1 was done with `cudaMalloc` and allocation 2 was done with `ncclMemAlloc`, and if `current_custom_allocator` is currently pointing to the CUDAPluggableAllocator with `ncclMemAlloc` - when cleaning up the allocation 1, we'd be using `ncclMemFree` instead of `cudaFree`.
In this PR, we solve the above problem by remembering the `free_fn_` using a deleter context. Hence, there is no need to go through an allocator object to find the deleter.
CC: @zdevito @ptrblck @eqy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130472
Approved by: https://github.com/eqy , https://github.com/ezyang
2024-07-18 11:33:21 +00:00
Yu, Guangye
f2552dcc3d
refactor cached tensor more generic ( #129359 )
...
# Motivation
solve https://github.com/pytorch/pytorch/issues/129027 to refactor cached tensor to be generic.
# Additional Context
No API name change. It is only decoupling with CUDA build option.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129359
Approved by: https://github.com/eqy , https://github.com/EikanWang , https://github.com/albanD
2024-07-17 03:00:08 +00:00
Isuru Fernando
8390843eba
Invalidate StorageImpl instances when tensor is overwritten with cudagraphs ( #125264 )
...
Fixes #104435
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125264
Approved by: https://github.com/ezyang
2024-07-16 14:29:29 +00:00
PyTorch MergeBot
78799e82b0
Revert "Invalidate StorageImpl instances when tensor is overwritten with cudagraphs ( #125264 )"
...
This reverts commit 1bc390c5f5 .
Reverted https://github.com/pytorch/pytorch/pull/125264 on behalf of https://github.com/jithunnair-amd due to test test/inductor/test_cudagraph_trees.py::CudaGraphTreeTests::test_fallback_to_eager_if_recompiling_too_many_times is failing https://github.com/pytorch/pytorch/actions/runs/9933628108/job/27477785946 1bc390c5f5 . Test was introduced by fa5f572748 which is before the merge base ([comment](https://github.com/pytorch/pytorch/pull/125264#issuecomment-2229508737 ))
2024-07-15 21:59:46 +00:00
Isuru Fernando
1bc390c5f5
Invalidate StorageImpl instances when tensor is overwritten with cudagraphs ( #125264 )
...
Fixes #104435
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125264
Approved by: https://github.com/ezyang
2024-07-15 04:16:17 +00:00
Ramana Cherukuri
f6a0be5023
Add warpSize to Device properties ( #128449 )
...
Adding warp_size to CudaDeviceProperties.
>>> import torch
>>> prop = torch.cuda.get_device_properties(torch.cuda.current_device())
>>> prop.warp_size
64
>>>
@jeffdaily @pruthvistony @jithunnair-amd @ROCmSupport
Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128449
Approved by: https://github.com/eqy , https://github.com/jataylo , https://github.com/jithunnair-amd , https://github.com/malfet
2024-07-01 09:13:32 +00:00
Jeff Daily
169b4ca07e
add uuid in cudaDeviceProperties ( #125083 )
...
Replaces #99967 .
Fixes #99903 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125083
Approved by: https://github.com/pruthvistony , https://github.com/albanD , https://github.com/eqy , https://github.com/malfet
2024-06-27 23:53:13 +00:00
Aaron Enye Shi
f42d5b6dca
[Memory Snapshot] Make recordAnnotations callback initialize lazily ( #129242 )
...
Summary: Make the recordAnnotations' Record function callback lazily initialize when record memory history starts. This will help reduce the impact on Time To First Batch metric.
Test Plan: CI and ran locally.
Differential Revision: D58875576
Pulled By: aaronenyeshi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129242
Approved by: https://github.com/zdevito
2024-06-22 04:05:55 +00:00
Aaron Enye Shi
b5d541609d
[Memory Snapshot] Add recordAnnotations to capture record_function annotations ( #129072 )
...
Summary:
Add new traceEvents into Memory Snapshot for record_function annotations. These will capture both the profiler's step annotation as well as user annotations.
Test Plan:
CI
Pulled By:
aaronenyeshi
Differential Revision: D55941362
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129072
Approved by: https://github.com/zdevito
2024-06-19 18:05:41 +00:00