Canonically, the snapshot API returns the entire memory state of the CUDACachingAllocator (using `get_all_blocks`). There is no API that can only return the memory state of a given pool.
In this PR, we extend the functionality of snapshot API such that it can only return the memory addresses of an active pool. When snapshot API is called under a MemPoolContext, we only return the blocks that correspond to the pool id of the active pool.
Part of https://github.com/pytorch/pytorch/issues/124807.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133601
Approved by: https://github.com/ezyang
Canonically, the snapshot API returns the entire memory state of the CUDACachingAllocator (using `get_all_blocks`). There is no API that can only return the memory state of a given pool.
In this PR, we extend the functionality of snapshot API such that it can only return the memory addresses of an active pool. When snapshot API is called under a MemPoolContext, we only return the blocks that correspond to the pool id of the active pool.
Part of https://github.com/pytorch/pytorch/issues/124807.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133601
Approved by: https://github.com/ezyang
This PR refactors some ref-counting functionality out of `beginAllocateToPool` and `releasePool`. The ref-counting logic is then used in construction and destruction of `torch.cuda.MemPool`.
The `use_count` variable in the CUDACachingAllocator is essentially a refcount of how many context managers are using the pool. Since we are now lifting up the MemPool abstraction to the user, the MemPool object itself now needs to hold a an extra reference as well.
Part of https://github.com/pytorch/pytorch/issues/124807.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133600
Approved by: https://github.com/eqy, https://github.com/ezyang
This fixes 4 main issues:
- The way the cuda sanitizer handle it's state is weird. In particular, because the lifetime of the Mode is linked to the submodule, then this might outlive the python runtime and other modules loaded. On my current version, this even outlives the "sys" module. Given that I'm not sure the impact of changing this lifetime handling, I'm making the exit handler a no-op when python is already dying and thus no point cleaning up.
- Adds a "disable" method to be able to test after the mode is enabled.
- Fix `Tensor.as_sublass()` to properly disable modes when creating the new Tensor object just like we already do in `make_subclass` and `make_wrapper_subclass`. The change here is just to apply the exact same treatment to it.
- ~Fix `Tensor.as_subclass()` not to propagate autograd as there is no valid backward associated here.~ We have test that check that this behavior happen so I guess this is not an obvious bugfix and expected behavior. Reverted that change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138218
Approved by: https://github.com/ngimel
When enable tunableop, It is easy to have OOM since APP usually needs large video memory size, such as running a LLM for inference. So we need a offline mode to tune the GEMMs. This PR provide an offline mode for tunableOp:
- record untuned GEMMs to file.
- a python API named tune_gemm_in_file is added to read the untuned file and tune the GEMMs in file
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128813
Approved by: https://github.com/jeffdaily, https://github.com/hongxiayang, https://github.com/naromero77amd
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
https://github.com/pytorch/pytorch/pull/132990 introduced dependency on `torch.version`, which might not be imported yet, and can result in `AttributeError: partially initialized module 'torch' has no attribute 'version' (most likely due to a circular import)` if user starts its code with `import torch.cuda`
Fix it by importing `torch.version` explicitly
Test Plan: CI
Differential Revision: D61549284
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134019
Approved by: https://github.com/seemethere
This is a bugfix that was recently encountered in ROCm/Deepspeed. Currently if a library installs pynvml and runs on ROCm pytorch will break as _HAS_PYNVML is set to true and it will attempt to use amdsmi library for the device_count call which will not be installed.
This fix will set _HAS_PYNVML to false on ROCm if amdsmi is not installed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132990
Approved by: https://github.com/pruthvistony, https://github.com/eqy, https://github.com/malfet
`torch.cuda.memory.mem_get_info` allows device strings given the current type hints. However, `device = torch.device('cuda')` leads to `device.index = None`, which results in downstream problems. Setting `optional=True` will insert the default device index in such cases.
Fixes#132583
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132616
Approved by: https://github.com/soulitzer
Summary:
It is a long known pain point that if other users are running things, the call of `torch.cuda.memory.list_gpu_processes()` will error out:
```
torch.cuda.memory.list_gpu_processes()
File "torch/cuda/memory.py", line 647, in list_gpu_processes
procs = amdsmi.amdsmi_get_gpu_process_list(handle) # type: ignore[attr-defined]
File "amdsmi/py_interface/amdsmi_interface.py", line 1946, in amdsmi_get_gpu_process_list
_check_res(
File "amdsmi/py_interface/amdsmi_interface.py", line 510, in _check_res
raise AmdSmiLibraryException(ret_code)
amdsmi.py_interface.amdsmi_exception.AmdSmiLibraryException: Error code:
10 | AMDSMI_STATUS_NO_PERM - Permission Denied
```
So just catch this error
Test Plan: torch.cuda.memory.list_gpu_processes() no longer fails
Differential Revision: D59901053
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131018
Approved by: https://github.com/eqy, https://github.com/clee2000
Volta(sm_7x) do not have a HW support for bfloat16 datatype, and while it is is emulated to ted in software, so PyTorch eager can use bfloat16 tensors, but not in Triton. So if graph with either CUDA bf16 input or output tensors is used, raise warnings and skip the frame.
Add optional parameter `including_emulation` to `torch.cuda.is_bf16_supported` method and call it from `torch._inductor.compile_fx. _check_triton_bf16_support`.
Test plan: Modify `is_bf16_supported` to return False and see that warning is generated
Fixes https://github.com/pytorch/pytorch/issues/118122 and https://github.com/pytorch/pytorch/issues/118581
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129288
Approved by: https://github.com/eqy, https://github.com/jansel
Fixes#127908
## Description
Created docs to document the torch.cuda.cudart function to solve the issue #127908.
I tried to stick to the [guidelines to document a function](https://github.com/pytorch/pytorch/wiki/Docstring-Guidelines#documenting-a-function) but I was not sure if there is a consensus on how to handle the docs of a function that calls an internal function. So I went ahead and tried what the function will raise, etc. from the user endpoint and documented it (i.e. I am giving what actually _lazy_init() will raise).
Updated PR from #128298 since I made quite a big mistake in my branch. I apologize for the newbie mistake.
### Summary of Changes
- Added docs for torch.cuda.cudart
- Added the cudart function in the autosummary of docs/source/cuda.rst
## Checklist
- [X] The issue that is being fixed is referred in the description
- [X] Only one issue is addressed in this pull request
- [X] Labels from the issue that this PR is fixing are added to this pull request
- [X] No unnecesary issues are included into this pull request
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128741
Approved by: https://github.com/msaroufim
Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing.
Note that only warnings that their messages contain `[Dd]eprecat(ed|ion)` are updated in this PR.
Resolves#126888
- #126888
This PR is split from PR #126898.
- #126898
------
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127689
Approved by: https://github.com/Skylion007