Commit Graph

262 Commits

Author SHA1 Message Date
Marko Radmilac
c65ee728f0 Initial implementation of host memory stats (#147660)
This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics.

This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache.

As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later.

Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660
Approved by: https://github.com/ngimel
2025-03-05 16:13:19 +00:00
cyy
ec2805ada8 Remove outdated CUDA version check (#148142)
Since Torch requires CUDA>=11, some checks can be removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148142
Approved by: https://github.com/janeyx99, https://github.com/eqy
2025-03-04 03:33:44 +00:00
PyTorch MergeBot
a983b2b11a Revert "Initial implementation of host memory stats (#147660)"
This reverts commit 945e359fc1.

Reverted https://github.com/pytorch/pytorch/pull/147660 on behalf of https://github.com/mradmila due to There is an issue with ambiguous definition of Stat structure when different C++ tools are used. Backing out for now. ([comment](https://github.com/pytorch/pytorch/pull/147660#issuecomment-2692346379))
2025-03-01 18:05:45 +00:00
Marko Radmilac
945e359fc1 Initial implementation of host memory stats (#147660)
This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics.

This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache.

As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later.

Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660
Approved by: https://github.com/ngimel
2025-02-28 18:36:44 +00:00
Aaron Orenstein
db4ce78d46 PEP585: More UP006 fixes (#146392)
This should be the final PR before we can enable RUFF UP006.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146392
Approved by: https://github.com/justinchuby, https://github.com/albanD, https://github.com/Skylion007
2025-02-20 06:18:13 +00:00
Jane Xu
c8433c2c6c [BE] correct docs for clock_rate to MHz, fixes #147098 (#147393)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147393
Approved by: https://github.com/andrewor14
2025-02-18 22:59:58 +00:00
Dan Zimmerman
6f035d8462 [torch] Make amdsmi cdll hook private (#147207)
Summary: https://github.com/pytorch/pytorch/actions/runs/13314282597/job/37186177974 yelled at me for landing a seemingly public API that's not exported. It's a private API, so lets prepend `_` to make that clear

Test Plan: CI

Differential Revision: D69665234

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147207
Approved by: https://github.com/PaulZhang12
2025-02-14 20:30:48 +00:00
Aaron Gokaslan
6344ca1dd4 [BE][Ez]: Apply FURB188: use str remove(pre|suf)fix (#146997)
Since we are on 3.9, we can use this nice str builtin which is more readable and more efficient.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146997
Approved by: https://github.com/XuehaiPan, https://github.com/cyyever, https://github.com/jansel
2025-02-14 03:38:07 +00:00
Dan Zimmerman
6419076db9 [torch][amdsmi] Look for amdsmi in ROCM_HOME/ROCM_PATH before using rpath (#147117)
Summary: ROCm uses ROCM_HOME/ROCM_PATH to specify which version of rocm the user wants to use. This is especially important in multi-version setups. Let's respect that behavior when loading amdsmi.

Test Plan:
CI
```
NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,COLL MSCCL_ALGO_DIR=~/2fbsource/third-party/rccl/develop/tools/msccl-algorithms RCCL_MSCCLPP_THRESHOLD=(math '128*1024*1024')  RCCL_MSCCLPP_ENABLE=1 ENABLE_MSCCLPP=1 buck2 run fbcode//mode/opt-amd-gpu -m rocm621 fbcode//accelerators/workloads/microbench:bench_comm -- --shape moe_17b --comm_algo nccl_allreduce
```

Differential Revision: D69597647

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147117
Approved by: https://github.com/malfet
2025-02-14 01:11:59 +00:00
Dan Zimmerman
281249ba54 [torch][amdsmi] Avoid ODR violation when loading amdsmi (#146324)
Summary:
amdsmi bundles its own copy of `libamd_smi.so`. When you're interacting with `amdsmi` from *only* python that's fine, but when you try to interact with `libamd_smi.so` from native code too this poses a problem, because from native code you'll be linking against the copy of `libamd_smi.so` from the SDK.

This means you'll end up with 2 copies of `libamd_smi.so` in your process, and potentially (Murphey's law says you will, as does our CI) violate ODR.

In order to avoid this issue from the PT side of the world we can hook the `dlopen("path/to/bundled/libamd_smi.so")` and try to use the already loaded/SDK version of `libamd_smi.so` first, before proceeding to use the `path/to/bundled/libamd_smi.so`.

Test Plan: CI, inspect process using libamd_smi.so from native + python and observe only a single copy loaded

Differential Revision: D69064038

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146324
Approved by: https://github.com/malfet
2025-02-12 00:01:02 +00:00
Benjamin Glass
5aa5a5763e [inductor triton] Disable incorrect TF32 usage on CUDA capability < 8 (#145684)
Triton 2.2 and greater have a bug where allowing TF32 generation for a GPU that does not support TF32 will cause code generation errors. Patch around this problem by:

1. Adding a function to `torch.cuda` that determines whether CUDA hardware is capable of using the TF32 format.
2. Using that function to explicitly disable TF32 generation when calling Triton, where needed.

To demonstrate that this fix works, try running `test/inductor/test_max_autotune.py` on a GPU with CUDA compute capability < 8 (e.g. any NVIDIA consumer GPU) without this fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145684
Approved by: https://github.com/eqy
2025-01-28 22:01:08 +00:00
Aaron Orenstein
805c4b597a PEP585 update - torch/_higher_order_ops torch/_subclasses torch/backends torch/compiler torch/cuda torch/masked torch/mtia torch/nested (#145202)
See #145101 for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145202
Approved by: https://github.com/bobrenjc93
2025-01-20 22:37:26 +00:00
Yu, Guangye
3848de55ed Add get_stream_from_external API for CUDA backend (#143799)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143799
Approved by: https://github.com/albanD, https://github.com/EikanWang
ghstack dependencies: #142347, #141119, #141123
2024-12-31 11:15:59 +00:00
Tal Ben-Nun
c0d710634f Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery (#142292)
Reland of #140320 after failing test on trunk. Fixes potential environment clobbering in test, makes ROCr+HIP devices (if specified together) more robust to index errors.

Fixes #140318

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142292
Approved by: https://github.com/jataylo, https://github.com/huydhn, https://github.com/jeffdaily

Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2024-12-25 02:37:11 +00:00
Michael Suo
9933e59c2b [torch][cuda] fix race condition in cuda initialization (#143238)
The access to lazy init callbacks (`_lazy_seed_tracker` and `_queued_calls`) is not synchronized with the initialization lock.

This exposes us to the following race:
1. start `_lazy_init`
2. take `_initialization_lock`
3. flush `_queued_calls` and run them all
4. another thread comes in and uses `_lazy_call` to put something on the queue (in our case, the `manual_seed`)
5. original thread finishes initializing, but never runs that call

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143238
Approved by: https://github.com/ngimel
2024-12-14 07:41:24 +00:00
Jane Xu
fd65bd755d [BE] replace incorrect .. note:: invocations (#142868)
Something I've noticed is that a lot of the distributed sites don't render on our docs at all, but if they ever do, the notes will render properly now 😛

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142868
Approved by: https://github.com/albanD
2024-12-11 19:58:18 +00:00
PyTorch MergeBot
40d1b5f490 Revert "Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery (#140320)"
This reverts commit add4a42ea2.

Reverted https://github.com/pytorch/pytorch/pull/140320 on behalf of https://github.com/huydhn due to Sorry for reverting your change but test_hip_device_count is failing in trunk after this land ([comment](https://github.com/pytorch/pytorch/pull/140320#issuecomment-2524742845))
2024-12-07 01:28:51 +00:00
Tal Ben-Nun
add4a42ea2 Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery (#140320)
Fixes #140318

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140320
Approved by: https://github.com/eqy, https://github.com/jithunnair-amd, https://github.com/jataylo, https://github.com/jeffdaily

Co-authored-by: Jack Taylor <jack.taylor@amd.com>
2024-12-06 20:09:56 +00:00
Benjamin Glass
4959784dac Add API query for available per-process CUDA memory (#140620)
Certain `cpp_wrapper`-enabled tests were OOM-ing in the CI pipeline, with error messages suggesting that sufficient memory was accessible.  This ultimately resulted from an internal memory limitation that was not queryable in the API.  This PR adds querying for that limit.

Additionally, the failing tests had incorrect memory availability checks, and are updated with measured memory requirements.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140620
Approved by: https://github.com/malfet, https://github.com/eqy
ghstack dependencies: #141367
2024-12-03 00:24:03 +00:00
Jack Taylor
04f569a524 [ROCm] AMDSMI memory usage unification (#139900)
Fixes https://github.com/pytorch/pytorch/issues/140638

Old implementation used vram_used, which is not the correct equivalent API for pynvml memory utilization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139900
Approved by: https://github.com/jeffdaily, https://github.com/eqy
2024-11-21 21:11:39 +00:00
Aaron Gokaslan
12e95aa4ee [BE]: Apply PERF401 autofixes from ruff (#140980)
* Automatically applies ruff rule 401. Turns loops into equivalent list comprehensions which are faster and do not leak the scope of the loop variables.
* list comprehensions not only often have better typing, but are 50+% faster than for loops on overhead. They also preserve length information etc and are better for the interpreter to optimize.
* Manually went back and made mypy happy after the change.
* Also fixed style lints in files covered by flake8 but not by pyfmt

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140980
Approved by: https://github.com/justinchuby, https://github.com/malfet
2024-11-20 17:52:07 +00:00
Yu Guo
808da50c2d create a new torch.cuda.device_memory_used api (#140870)
Summary:
the current torch.cuda.memory_usage returns the memory utilization, more specifically, percent of time over the past sample period global memory being read/written for Nvidia.
see more details in https://github.com/pytorch/pytorch/issues/140638

Test Plan: added a new unittest

Differential Revision: D65960134

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140870
Approved by: https://github.com/ngimel, https://github.com/eqy
2024-11-19 06:36:30 +00:00
PyTorch MergeBot
43de32d948 Revert "create a new torch.cuda.device_memory_used api (#140870)"
This reverts commit 478204cad6.

Reverted https://github.com/pytorch/pytorch/pull/140870 on behalf of https://github.com/yuguo68 due to the test is still flaky on ROCm, test_cuda.py::TestCudaMallocAsync is not skipped with the unittest.skipIf(TEST_CUDAMALLOCASYNC ([comment](https://github.com/pytorch/pytorch/pull/140870#issuecomment-2484161914))
2024-11-18 21:26:25 +00:00
Yu Guo
478204cad6 create a new torch.cuda.device_memory_used api (#140870)
Summary:
the current torch.cuda.memory_usage returns the memory utilization, more specifically, percent of time over the past sample period global memory being read/written for Nvidia.
see more details in https://github.com/pytorch/pytorch/issues/140638

Test Plan: added a new unittest

Differential Revision: D65960134

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140870
Approved by: https://github.com/ngimel
2024-11-18 19:13:43 +00:00
PyTorch MergeBot
03b7ec9237 Revert "create a new torch.cuda.memory_usage_in_bytes api (#140719)"
This reverts commit 9febc47637.

Reverted https://github.com/pytorch/pytorch/pull/140719 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the test is flaky on ROCm ([comment](https://github.com/pytorch/pytorch/pull/140719#issuecomment-2479832082))
2024-11-15 20:05:32 +00:00
Yu Guo
9febc47637 create a new torch.cuda.memory_usage_in_bytes api (#140719)
Summary:
the current torch.cuda.memory_usage returns the memory utilization, more specifically, percent of time over the past sample period global memory being read/written for Nvidia.

see more details in https://github.com/pytorch/pytorch/issues/140638

Test Plan: added a new unittest

Differential Revision: D65928031

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140719
Approved by: https://github.com/xw285cornell, https://github.com/hongxiayang
2024-11-15 05:59:40 +00:00
Tom Ritchford
c0582fd0f8 Remove unused Python variables in torch/[b-z]* (#136963)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136963
Approved by: https://github.com/ezyang
2024-10-19 16:45:22 +00:00
Jack Taylor
966a1a971e [ROCm] Add AMDSMI support for UUID input (#129741)
Adds support for for using UUIDs for AMDSMI utilities in PyTorch via CUDA_VISIBLE_DEVICES/HIP_VISIBLE_DEVICES.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129741
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
2024-10-15 15:56:30 +00:00
Jeff Daily
c7b0d4b148 raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING (#131114)
raw_alloc is used by cudnn, miopen, thrust, and tunableop.  Without this PR, the env var for disabling the caching allocator will only partially work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131114
Approved by: https://github.com/eqy, https://github.com/houseroad, https://github.com/albanD

Co-authored-by: Nichols A. Romero <nick.romero@amd.com>
2024-10-04 15:36:29 +00:00
PyTorch MergeBot
0d1701f310 Revert "raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING (#131114)"
This reverts commit 7001907480.

Reverted https://github.com/pytorch/pytorch/pull/131114 on behalf of https://github.com/PaliC due to failing internal builds ([comment](https://github.com/pytorch/pytorch/pull/131114#issuecomment-2390615007))
2024-10-03 06:22:55 +00:00
Jeff Daily
7001907480 raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING (#131114)
raw_alloc is used by cudnn, miopen, thrust, and tunableop.  Without this PR, the env var for disabling the caching allocator will only partially work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131114
Approved by: https://github.com/eqy, https://github.com/houseroad, https://github.com/albanD

Co-authored-by: Nichols A. Romero <nick.romero@amd.com>
2024-10-02 16:27:15 +00:00
drisspg
d05645841e Update get_device_properties to take in optional device (#136683)
Aligns behavior with the rest of cuda's device info query methods

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136683
Approved by: https://github.com/eqy
2024-09-26 15:07:31 +00:00
Jeff Daily
15dba021bb [ROCm][CI] upgrade CI to ROCm 6.2 (#132555)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132555
Approved by: https://github.com/pruthvistony, https://github.com/malfet
2024-09-20 17:39:31 +00:00
Dan Zimmerman
fc88ba260f [amdsmi][torch] Update amdsmi API usages (#135504)
Summary: In ROCm 6.2.0 there were API name changes-- we check if the new APIs exist and use them in this diff; see 7b2463abe0 for the changes

Test Plan: CI

Differential Revision: D62325661

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135504
Approved by: https://github.com/eqy, https://github.com/houseroad
2024-09-10 19:15:39 +00:00
Syed Tousif Ahmed
4655eb3ee2 Uses MemPoolContext to route allocations from CUDACachingAllocator (#134685)
Re-open of https://github.com/pytorch/pytorch/pull/133599 that was mistakenly closed by issuing `ghstack land`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134685
Approved by: https://github.com/ezyang
2024-08-29 03:56:31 +00:00
Nikita Shulga
f7c1f32803 Fix partially initialized module error (#134019)
https://github.com/pytorch/pytorch/pull/132990 introduced dependency on `torch.version`, which might not be imported yet, and can result in  `AttributeError: partially initialized module 'torch' has no attribute 'version' (most likely due to a circular import)` if user starts its code with `import torch.cuda`

Fix it by importing `torch.version` explicitly

Test Plan: CI

Differential Revision: D61549284

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134019
Approved by: https://github.com/seemethere
2024-08-20 22:20:02 +00:00
Jack Taylor
92151c814b [ROCm] Set _HAS_PYNVML to false if amdsmi not installed (#132990)
This is a bugfix that was recently encountered in ROCm/Deepspeed. Currently if a library installs pynvml and runs on ROCm pytorch will break as _HAS_PYNVML is set to true and it will attempt to use amdsmi library for the device_count call which will not be installed.

This fix will set _HAS_PYNVML to false on ROCm if amdsmi is not installed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132990
Approved by: https://github.com/pruthvistony, https://github.com/eqy, https://github.com/malfet
2024-08-19 09:45:58 +00:00
Mikayla Gawarecki
018e48c337 [Reland] Add wrappers for synchronous GPUDirect Storage APIs (#133489)
Reland #130633

USE_CUFILE turned off by default in this version
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133489
Approved by: https://github.com/albanD
2024-08-15 17:11:52 +00:00
Xuehai Pan
f3fce597e9 [BE][Easy][17/19] enforce style for empty lines in import segments in torch/[a-c]*/ and torch/[e-n]*/ (#129769)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129769
Approved by: https://github.com/ezyang
2024-08-04 10:24:09 +00:00
Nikita Shulga
cd5452aace [CUDA] is_bf16_supported() should not crash if there are no GPUs (#132313)
`False` is the good answer on a system that does not have any CUDA GPUs.
- Added regression test to TestTorch.

Fixes https://github.com/pytorch/pytorch/issues/132303

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132313
Approved by: https://github.com/eqy, https://github.com/syed-ahmed
2024-08-02 02:50:43 +00:00
Syed Tousif Ahmed
7c89ec0f7c Implements torch.cuda.MemPool() API (#131152)
In this PR:
- Pool id creation logic is refactored and moved to a MemPool class. `graph_pool_handle()` API now uses `torch.cuda.MemPool()` to get a unique id for a pool. Existing tests should cover this change.
- MemPool holds a pointer to a CUDAAllocator as proposed in https://github.com/pytorch/pytorch/issues/124807#issuecomment-2077506997. Tests are added to show usage with CUDAPluggableAllocator.
- MemPoolContext API makes a mempool active. Tests are added to show usage of this API. This API will be used in CUDACachingAllocator to route allocations to a user provided allocator. See draft here: https://github.com/pytorch/pytorch/pull/125722/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131152
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-08-01 01:29:30 +00:00
PyTorch MergeBot
e191b83462 Revert "Add wrappers for synchronous GPUDirect Storage APIs (#130633)"
This reverts commit 709ddf7a9d.

Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to still failing internally D60265673 ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2253239607))
2024-07-26 18:08:20 +00:00
Mikayla Gawarecki
709ddf7a9d Add wrappers for synchronous GPUDirect Storage APIs (#130633)
Based in part on https://github.com/NVIDIA/apex/pull/1774

Differential Revision: [D60155434](https://our.internmc.facebook.com/intern/diff/D60155434)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633
Approved by: https://github.com/albanD
2024-07-25 22:23:38 +00:00
PyTorch MergeBot
e4b5645f83 Revert "Add wrappers for synchronous GPUDirect Storage APIs (#130633)"
This reverts commit 5b5e0698a5.

Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to breaking a lot of jobs and build rules internally D60085885, possibly needs to update some bazel build? ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2245806738))
2024-07-23 17:19:34 +00:00
Mikayla Gawarecki
5b5e0698a5 Add wrappers for synchronous GPUDirect Storage APIs (#130633)
Based in part on https://github.com/NVIDIA/apex/pull/1774

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633
Approved by: https://github.com/albanD
2024-07-22 14:51:24 +00:00
Jack Taylor
e9023d57b0 [ROCm] Return correct AMDSMI socket_power metric (#130331)
Extending on the change in https://github.com/pytorch/pytorch/pull/127729

Depending on gcnArch the API to return socket power will change based on underlying gpu_metrics. This PR will handle both cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130331
Approved by: https://github.com/jeffdaily, https://github.com/eqy, https://github.com/malfet
2024-07-17 01:58:58 +00:00
Jack Taylor
e1b426b345 [ROCm] CUDA_VISIBLE_DEVICES fallback option for device_count (#129650)
Updating `_parse_visible_devices` to allow use of CUDA_VISIBLE_DEVICES if HIP_VISIBLE_DEVICES is unset, to avoid any unnecessary code changes in workloads that already rely on CUDA_VISIBLE_DEVICES.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129650
Approved by: https://github.com/hongxiayang, https://github.com/malfet
2024-07-01 11:40:09 +00:00
Nikita Shulga
14dc08ddc7 Inductor to fail gracefully on Voltas for bf16 tensors (#129288)
Volta(sm_7x) do not have a HW support for bfloat16 datatype, and while it is is emulated to ted in software, so PyTorch eager can use bfloat16 tensors, but not in Triton. So if graph with either CUDA bf16 input or output tensors is used, raise warnings and skip the frame.

Add optional parameter `including_emulation` to `torch.cuda.is_bf16_supported` method and call it from `torch._inductor.compile_fx. _check_triton_bf16_support`.

Test plan: Modify `is_bf16_supported` to return False and see that warning is generated

Fixes https://github.com/pytorch/pytorch/issues/118122 and https://github.com/pytorch/pytorch/issues/118581

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129288
Approved by: https://github.com/eqy, https://github.com/jansel
2024-06-25 00:04:13 +00:00
ibartol
c6b180a316 Created docs (and example) for cudart function in torch.cuda (#128741)
Fixes #127908

## Description

Created docs to document the torch.cuda.cudart function to solve the issue #127908.
I tried to stick to the [guidelines to document a function](https://github.com/pytorch/pytorch/wiki/Docstring-Guidelines#documenting-a-function) but I was not sure if there is a consensus on how to handle the docs of a function that calls an internal function. So I went ahead and tried what the function will raise, etc. from the user endpoint and documented it (i.e. I am giving what actually _lazy_init() will raise).

Updated PR from #128298 since I made quite a big mistake in my branch. I apologize for the newbie mistake.

### Summary of Changes

- Added docs for torch.cuda.cudart
- Added the cudart function in the autosummary of docs/source/cuda.rst

## Checklist
- [X] The issue that is being fixed is referred in the description
- [X] Only one issue is addressed in this pull request
- [X] Labels from the issue that this PR is fixing are added to this pull request
- [X] No unnecesary issues are included into this pull request

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128741
Approved by: https://github.com/msaroufim
2024-06-17 16:50:37 +00:00
Aaron Orenstein
62bcdc0ac9 Flip default value for mypy disallow_untyped_defs [4/11] (#127841)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127841
Approved by: https://github.com/oulgen
2024-06-08 18:36:48 +00:00