Commit Graph

242 Commits

Author SHA1 Message Date
Natalia Gimelshein
401fa87ace make only current thread allocate to pool in NcclPG (#153990)
follow up to #153356 that fixes nccl allocation to pool

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153990
Approved by: https://github.com/kwen2501
2025-05-21 21:57:37 +00:00
Shivam Raikundalia
dbb4444ce3 [Memento] Add PT2 to Memory Snapshot (#152707)
Summary:
To add PT2 information to memory snapshot we piggyback off of the Kineto implementation using record_function similar to adding the user annotations. To do this we add the following:

1. Stack implementation that we instantiate to keep track of which compile context stack we are currently in (top element of the stack). The stack will be per device and thread-local since different threads of a process can be in different compile contexts at a given time. For this reason, we do not need to add mutexes to our stack impl since no two threads will touch a given stack
2. RecordFunction hooks to properly pipe the correct events to the compile context stack. These hooks are similar to the annotation ones in the fact that we just register them lazily and DO NOT unregister them. This is done out of convenience. In the future, we should save the handles and unregister them to minimize overhead after profiling is finished. As of now, we are registering this at the FUNCTION scope which is wide; however, we treat any function that does not start with "Torch-Compiled Region" as a no-op so we anticipate the difference in performance to be negligible during and after profiling. We also hide this feature behind a flag set to off on default so existing jobs will be unaffected
3. Piping for compile context to pickle output

Test Plan:
In D74039793, we add CompileContext to the visualizer and we see the following {F1977654658}

Differential Revision: D74028214

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152707
Approved by: https://github.com/eqy
2025-05-12 21:12:51 +00:00
Natalia Gimelshein
9ae722cdb4 allocate cuMem memory with rdma flag (#153261)
to be able to register memory with ibverbs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153261
Approved by: https://github.com/kwen2501, https://github.com/eqy, https://github.com/Skylion007
2025-05-09 21:48:48 +00:00
Dan Johnson
8e2e06b7ea Fix shadow local variables (#152429)
Summary: Fixing shadow local variables error: P1798875650

Test Plan: CI

Differential Revision: D73853605

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152429
Approved by: https://github.com/Skylion007, https://github.com/eqy
2025-04-29 18:50:18 +00:00
Dan Johnson
d22c4cc353 Add option to use mempool on OOM (#151487)
MemPool is a separate pool of memory handled by the caching allocator. This PR adds the option let the caching allocator try to use this pool as a last resort instead of OOMing by associating a use_on_oom bool with each MemPool.

Usage:
Users can optionally specify a ``use_on_oom`` bool (which is False by default) during MemPool creation. If true, then the CUDACachingAllocator will be able to use memory in this pool as a last resort instead of OOMing.

```
pool = torch.cuda.MemPool(allocator, use_on_oom=True)
with torch.cuda.use_mem_pool(pool):
    a = torch.randn(40 * 1024 * 1024, dtype=torch.uint8, device="cuda")
del a
# at the memory limit, this will succeed by using pool's memory in order to avoid the oom
b = torch.randn(40 * 1024 * 1024, dtype=torch.uint8, device="cuda")
```

Testing:
```
python test/test_cuda.py -k test_mempool_limited_memory_with_allocator
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151487
Approved by: https://github.com/eqy, https://github.com/syed-ahmed, https://github.com/ngimel
2025-04-26 04:04:57 +00:00
Luca Wehrstedt
99642182f2 Add mempool to allocator's trace events (#150683)
In the NCCL ProcessGroup we want to support being able to "register" with NCCL all the allocations that belong to a certain private MemPool. In order to do so on-the-fly for every new allocation, we register a hook for the CachingAllocator's TraceEvents. However, we were lacking a way to know whether a given TraceEvent belonged to the MemPool that we cared about or not. With this PR, we add a MempoolId_t field to the TraceEvents.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150683
Approved by: https://github.com/syed-ahmed, https://github.com/kwen2501
2025-04-11 17:26:37 +00:00
Ethan Wee
6cbf97ede8 [ROCm] enable HIPMallocAsyncAllocator (#149145)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149145
Approved by: https://github.com/izaitsevfb

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-03-19 23:42:35 +00:00
Shivam Raikundalia
a11538aa46 [GPU Snapshot] Add Clear History Flag (#149352)
Summary:
Oftentimes, users complain that a bunch of extra events are prepended to their desired GPU snapshot. This is because they usually attach an OOM logger without knowing and when they go to collect the actual snapshot, it adds all the OOM logger contents. Since OOM and regular snapshot use the same backend, we currently don't have the infra in place to split these snapshots.

As a solution we add a flag to the snapshot frontend to clear out the history when starting the auto-trace record memory history.

A more thorough solution would be to have a user pass in a handle and to have snapshots per handle to seperate the events. However, this would likely be complicated and more work than it is worth as we would have to change the callbacks in the caching allocator and pass these objects between python and cpp.

Test Plan:
See diff below

Differential Revision: D71159720

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149352
Approved by: https://github.com/eqy, https://github.com/aaronenyeshi
2025-03-19 21:44:20 +00:00
PyTorch MergeBot
e1d143cb7b Revert "[ROCm] enable HIPMallocAsyncAllocator (#149145)"
This reverts commit ee1a2b7810.

Reverted https://github.com/pytorch/pytorch/pull/149145 on behalf of https://github.com/izaitsevfb due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/149145#issuecomment-2738115728))
2025-03-19 21:12:13 +00:00
Ethan Wee
ee1a2b7810 [ROCm] enable HIPMallocAsyncAllocator (#149145)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149145
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-03-19 03:59:55 +00:00
PyTorch MergeBot
9d37b501db Revert "[ROCm] enable HIPMallocAsyncAllocator (#149145)"
This reverts commit 2e02c07a5d.

Reverted https://github.com/pytorch/pytorch/pull/149145 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally.  @albanD, might you be able to help get this PR landed? See D71214814 for more details on the failure. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/149145#issuecomment-2730104736))
2025-03-17 16:17:02 +00:00
Ethan Wee
2e02c07a5d [ROCm] enable HIPMallocAsyncAllocator (#149145)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149145
Approved by: https://github.com/jeffdaily
2025-03-14 18:21:27 +00:00
cyy
8fa81a6066 Enable misc-use-internal-linkage check and apply fixes (#148948)
Enables clang-tidy rule [`misc-use-internal-linkage`](https://clang.llvm.org/extra/clang-tidy/checks/misc/use-internal-linkage.html). This new check was introduced in Clang-Tidy 18 and is available due to recent update of Clang-Tidy 19.

The check marks functions and variables used only in the translation unit as static. Therefore undesired symbols are not leaked into other units, more link time optimisations are possible and the resulting binaries may be smaller.

The detected violations were mostly fixed by using static. In other cases, the symbols were indeed consumed by others files, then their declaring headers were included. Still some declarations were wrong and have been fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148948
Approved by: https://github.com/Skylion007
2025-03-12 14:22:56 +00:00
Tristan Rice
494abeff8a CUDACachingAllocator,c10d: fixes for IPC release performance (#148805)
This has two fixes to improve IPC tensor release performance when using torchft's BabyProcessGroupNCCL.

1. release the IpcMutex when deleting the `ExpandableSegements` object to avoid synchronizing under the lock
2. release the GIL in WorkNCCL destructor since the shared tensor will be destructed there

Test plan:

Run with torchft + torchtitan

```
REPLICA_GROUP_ID=0 NGPU=2 CUDA_VISIBLE_DEVICES=0,1 CONFIG_FILE=./torchtitan/models/llama/train_configs/llama3_8b.toml ./run_train.sh --training.data_par
allel_shard_degree=2 --fault_tolerance.enable --fault_tolerance.group_size=2 --fault_tolerance.replica_id=0 --metrics.log_freq=1 --training.seq_len 4096

...

[rank0]:[titan] 2025-03-07 17:51:31,387 - root - INFO - step: 61  loss:  7.4825  memory: 79.73GiB(83.89%)  tps: 317  tflops: 16.34  mfu: 1.65%
```

Check py-spy to verify no bottleneck on IPC lock when creating new shared tensors

![20250307_17h50m10s_grim](https://github.com/user-attachments/assets/fa8b359f-e337-4ed5-be22-a42ab2bee03d)
![20250307_17h50m00s_grim](https://github.com/user-attachments/assets/206f869a-f07e-4fbd-9e28-89b3da95ef6e)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148805
Approved by: https://github.com/Skylion007, https://github.com/fegin, https://github.com/zdevito
2025-03-10 19:47:04 +00:00
Syed Tousif Ahmed
5f392ae560 Throws error when using torch.cuda.MemPool with expandable segments (#148378)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148378
Approved by: https://github.com/ngimel, https://github.com/eqy
ghstack dependencies: #148374
2025-03-07 05:22:03 +00:00
Marko Radmilac
c65ee728f0 Initial implementation of host memory stats (#147660)
This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics.

This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache.

As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later.

Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660
Approved by: https://github.com/ngimel
2025-03-05 16:13:19 +00:00
cyy
09291817b2 Fix extra semicolon warning (#148291)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148291
Approved by: https://github.com/Skylion007
2025-03-03 18:51:44 +00:00
PyTorch MergeBot
a983b2b11a Revert "Initial implementation of host memory stats (#147660)"
This reverts commit 945e359fc1.

Reverted https://github.com/pytorch/pytorch/pull/147660 on behalf of https://github.com/mradmila due to There is an issue with ambiguous definition of Stat structure when different C++ tools are used. Backing out for now. ([comment](https://github.com/pytorch/pytorch/pull/147660#issuecomment-2692346379))
2025-03-01 18:05:45 +00:00
Marko Radmilac
945e359fc1 Initial implementation of host memory stats (#147660)
This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics.

This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache.

As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later.

Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660
Approved by: https://github.com/ngimel
2025-02-28 18:36:44 +00:00
FEI
8a4dd763b8 [CCA] remove TODO for hardware_destructive_interference_size (#145591)
@zyan0 @albanD  @houseroad

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145591
Approved by: https://github.com/albanD
2025-02-06 14:41:25 +00:00
cyyever
8a6e9a88e9 Let PYTORCH_NO_CUDA_MEMORY_CACHING has effect only when value is 1 (#145905)
Fixes #145661

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145905
Approved by: https://github.com/eqy, https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-01-30 05:11:10 +00:00
cyy
67fcc7cf02 [3/N] Remove unnecessary once flag usage (#145672)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145672
Approved by: https://github.com/albanD
2025-01-28 04:28:18 +00:00
cyy
9a841f9321 Enable bugprone-unchecked-optional-access (#144226)
We can actually enable bugprone-unchecked-optional-access without the risk of hang.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144226
Approved by: https://github.com/albanD
2025-01-10 03:16:56 +00:00
cyy
dca443835e Enable more readability-redundant checks (#143963)
They are helpful to simplifying code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143963
Approved by: https://github.com/albanD
2024-12-30 14:49:33 +00:00
Benjamin Glass
4959784dac Add API query for available per-process CUDA memory (#140620)
Certain `cpp_wrapper`-enabled tests were OOM-ing in the CI pipeline, with error messages suggesting that sufficient memory was accessible.  This ultimately resulted from an internal memory limitation that was not queryable in the API.  This PR adds querying for that limit.

Additionally, the failing tests had incorrect memory availability checks, and are updated with measured memory requirements.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140620
Approved by: https://github.com/malfet, https://github.com/eqy
ghstack dependencies: #141367
2024-12-03 00:24:03 +00:00
Syed Tousif Ahmed
e0482fdf95 Implements user buffer registration using MemPool (#133603)
This PR implements user buffer registration and demonstrates NVLink Sharp (NVLS) reductions using a combination of allocation special memory using MemPool and registering it with the nccl buffer registration APIs.

Part of https://github.com/pytorch/pytorch/issues/124807.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133603
Approved by: https://github.com/kwen2501, https://github.com/eqy
2024-11-21 01:40:11 +00:00
PyTorch MergeBot
496c1e78c5 Revert "Implements user buffer registration using MemPool (#133603)"
This reverts commit 25d9be37be.

Reverted https://github.com/pytorch/pytorch/pull/133603 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/133603#issuecomment-2486897708))
2024-11-19 22:42:26 +00:00
cyy
00b3b61076 Add and use thread-safe strerror (#140472)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140472
Approved by: https://github.com/ezyang
2024-11-19 04:24:17 +00:00
Syed Tousif Ahmed
25d9be37be Implements user buffer registration using MemPool (#133603)
This PR implements user buffer registration and demonstrates NVLink Sharp (NVLS) reductions using a combination of allocation special memory using MemPool and registering it with the nccl buffer registration APIs.

Part of https://github.com/pytorch/pytorch/issues/124807.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133603
Approved by: https://github.com/kwen2501, https://github.com/eqy
2024-11-15 12:47:49 +00:00
Syed Tousif Ahmed
0f06dff4d7 Restores release_lock_on_cudamalloc behavior in CUDACachingAllocator (#139430)
In https://github.com/pytorch/pytorch/pull/134685, I transformed the following code:
```CPP
      if (CUDAAllocatorConfig::release_lock_on_cudamalloc()) {
        // At scope exit, acquire the lock again. This provides safety against
        // any potential exceptions in the cudaMallocMaybeCapturing function.
        auto sg = c10::make_scope_exit([&]() { lock.lock(); });
        lock.unlock();
        p.err = cudaMallocMaybeCapturing(&ptr, size);
      } else {
        p.err = cudaMallocMaybeCapturing(&ptr, size);
      }
      if (CUDAAllocatorConfig::release_lock_on_cudamalloc()) {
        TORCH_CHECK(
            lock.owns_lock(), "Failed to acquire lock after cudaMalloc");
      }
```
into:
```CPP
      if (CUDAAllocatorConfig::release_lock_on_cudamalloc()) {
        // At scope exit, acquire the lock again. This provides safety against
        // any potential exceptions in the cudaMallocMaybeCapturing function.
        auto sg = c10::make_scope_exit([&]() { lock.lock(); });
        lock.unlock();
      }
      auto active_pool = MemPoolContext::getActiveMemPool();
      if (active_pool && active_pool->allocator() &&
          p.pool->owner_PrivatePool) {
        ptr = active_pool->allocator()->raw_alloc(size);
        p.err = ptr ? cudaSuccess : cudaErrorMemoryAllocation;
      } else {
        p.err = cudaMallocMaybeCapturing(&ptr, size);
      }
      if (CUDAAllocatorConfig::release_lock_on_cudamalloc()) {
        TORCH_CHECK(
            lock.owns_lock(), "Failed to acquire lock after cudaMalloc");
      }
```
This is wrong because, I didn't realize what `c10::make_scope_exit([&]() { lock.lock(); });` does. And so my changes doesn't let `release_lock_on_cudamalloc` unlock..execute alloc..lock, and instead it just unlock..locks. This PR rectifies that change, and in addition adds an ASSERT ensuring the active pool and p.pool are the same (mirroring the behavior from released_cached_blocks).

Thanks @zvon82 for reporting this!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139430
Approved by: https://github.com/ezyang
2024-11-03 00:04:30 +00:00
Syed Tousif Ahmed
341a28f0ce Refactors empty_cache to return only MemPool memory to the system (#133602)
Canonically, the empty_cache API releases all cached blocks of the CUDACachingAllocator. There is no API that can release only the cached blocks of a given pool.

In this PR, we extend the functionality of empty_cache API such that it only releases the cached blocks of an active pool. When empty_cache API is called under a MemPoolContext, we only release the cached blocks that correspond to the pool id of the active pool.

Part of https://github.com/pytorch/pytorch/issues/124807.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133602
Approved by: https://github.com/ezyang
2024-10-29 23:58:44 +00:00
cyy
3907f36808 Turn some variables and functions into static (#136847)
Re-check some files and mark variables and functions into static and fix other warnings.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136847
Approved by: https://github.com/ezyang
2024-10-29 17:01:56 +00:00
Syed Tousif Ahmed
1637a40796 Adds snapshot API for MemPools to get pool memory segments (#133601)
Canonically, the snapshot API returns the entire memory state of the CUDACachingAllocator (using `get_all_blocks`). There is no API that can only return the memory state of a given pool.

In this PR, we extend the functionality of snapshot API such that it can only return the memory addresses of an active pool. When snapshot API is called under a MemPoolContext, we only return the blocks that correspond to the pool id of the active pool.

Part of https://github.com/pytorch/pytorch/issues/124807.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133601
Approved by: https://github.com/ezyang
2024-10-29 01:01:47 +00:00
cyy
383d9e3de6 [4/N] Fix cppcoreguidelines-special-member-functions warnings (#139027)
Follows #138796
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139027
Approved by: https://github.com/ezyang
2024-10-29 00:18:18 +00:00
PyTorch MergeBot
3b0f39336c Revert "Adds snapshot API for MemPools to get pool memory segments (#133601)"
This reverts commit 00504aa6b8.

Reverted https://github.com/pytorch/pytorch/pull/133601 on behalf of https://github.com/wdvr due to reverting for now as this breaks lots of internal tests. Details below ([comment](https://github.com/pytorch/pytorch/pull/133601#issuecomment-2441864871))
2024-10-28 15:12:20 +00:00
Syed Tousif Ahmed
00504aa6b8 Adds snapshot API for MemPools to get pool memory segments (#133601)
Canonically, the snapshot API returns the entire memory state of the CUDACachingAllocator (using `get_all_blocks`). There is no API that can only return the memory state of a given pool.

In this PR, we extend the functionality of snapshot API such that it can only return the memory addresses of an active pool. When snapshot API is called under a MemPoolContext, we only return the blocks that correspond to the pool id of the active pool.

Part of https://github.com/pytorch/pytorch/issues/124807.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133601
Approved by: https://github.com/ezyang
2024-10-26 03:34:59 +00:00
Syed Tousif Ahmed
03c72976a5 Properly uses ref-counting for torch.cuda.use_mem_pool (#133600)
This PR refactors some ref-counting functionality out of `beginAllocateToPool` and `releasePool`. The ref-counting logic is then used in construction and destruction of `torch.cuda.MemPool`.

The `use_count` variable in the CUDACachingAllocator is essentially a refcount of how many context managers are using the pool. Since we are now lifting up the MemPool abstraction to the user, the MemPool object itself now needs to hold a an extra reference as well.

Part of https://github.com/pytorch/pytorch/issues/124807.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133600
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-10-22 03:21:53 +00:00
cyy
8c860aef0d [Reland][Environment Variable][3/N] Use thread-safe getenv functions (#137942)
Reland of #137328, which was reverted due to reverting a dependent PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137942
Approved by: https://github.com/eqy
2024-10-15 07:47:24 +00:00
PyTorch MergeBot
df0c2f5cae Revert "[Environment Variable][3/N] Use thread-safe getenv wrapper (#137328)"
This reverts commit 25ac5652d0.

Reverted https://github.com/pytorch/pytorch/pull/137328 on behalf of https://github.com/clee2000 due to need to revert this in order to revert #133896, please rebase and reland, sorry for the churn ([comment](https://github.com/pytorch/pytorch/pull/137328#issuecomment-2412143739))
2024-10-14 20:22:26 +00:00
cyyever
25ac5652d0 [Environment Variable][3/N] Use thread-safe getenv wrapper (#137328)
Follows #124485

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137328
Approved by: https://github.com/eqy
2024-10-11 23:23:57 +00:00
cyy
0c0d8c8ff0 [1/N] Fix extra warnings brought by clang-tidy-17 (#137407)
Before we can use clang-tidy-17
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137407
Approved by: https://github.com/Skylion007, https://github.com/aaronenyeshi
2024-10-07 17:53:59 +00:00
Jeff Daily
c7b0d4b148 raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING (#131114)
raw_alloc is used by cudnn, miopen, thrust, and tunableop.  Without this PR, the env var for disabling the caching allocator will only partially work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131114
Approved by: https://github.com/eqy, https://github.com/houseroad, https://github.com/albanD

Co-authored-by: Nichols A. Romero <nick.romero@amd.com>
2024-10-04 15:36:29 +00:00
PyTorch MergeBot
0d1701f310 Revert "raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING (#131114)"
This reverts commit 7001907480.

Reverted https://github.com/pytorch/pytorch/pull/131114 on behalf of https://github.com/PaliC due to failing internal builds ([comment](https://github.com/pytorch/pytorch/pull/131114#issuecomment-2390615007))
2024-10-03 06:22:55 +00:00
Jeff Daily
7001907480 raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING (#131114)
raw_alloc is used by cudnn, miopen, thrust, and tunableop.  Without this PR, the env var for disabling the caching allocator will only partially work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131114
Approved by: https://github.com/eqy, https://github.com/houseroad, https://github.com/albanD

Co-authored-by: Nichols A. Romero <nick.romero@amd.com>
2024-10-02 16:27:15 +00:00
Banit Agrawal
48d18fbd4c [PyTorch CUDA Allocator] Allow reuse of non-split blocks with better rounding (#136174)
Summary:
This diff adds an option to round the non-split blocks in caching allocator so that they can be reused without causing lots of fragmentation for large memory segments.

For example, if we specify max_split memory size as 400MB, then all allocations more than 400MB will not be split. Lets say, we allocated some 1024MB blocks and these are cached in the allocator blocks. If we request a new 500MB block, we round it to nearest power-2-division, thats 512MB, we add default kLargeBuffer of 20MB, that will be 532MB and since 532MB is less than existing 1024MB block, the 1024MB will not be used for this allocation, instead a new 512MB block will be created. In this diff, we provide an option to cofigure the kLargeBuffer for rounding and expose as a configurable option, so 512MB + max_non_split_rounding_size and if thats greater than 1024MB, we will use te 1024MB and we wont create a new 512MB block using cudaMalloc. This option is added so that we can pre-allocate some large blocks so that we can reuse them as much as possible and we dont stall on calling cudaMalloc.

Differential Revision: D62758758

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136174
Approved by: https://github.com/zyan0
2024-09-17 19:08:44 +00:00
Yu, Guangye
6c1da66407 [Reland] Refactor caching device allocator utils (#130923)
# Motivation
Following [[RFC] Intel GPU Runtime Upstreaming for Allocator ](https://github.com/pytorch/pytorch/issues/116322), this PR aims to refactor caching device allocator utils to improve code reuse usage.
This is the first PR, we could prepare some follow-up PRs continuing to refactor the device caching allocator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130923
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD, https://github.com/eqy
2024-09-07 11:14:17 +00:00
Haibo Chen
e162414963 add instrumentation of CCA stats for reserved and allocated memory size (#135231)
As titled
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135231
Approved by: https://github.com/c-p-i-o
2024-09-06 02:48:56 +00:00
PyTorch MergeBot
e55c0f59e5 Revert "[Reland] Refactor caching device allocator utils (#130923)"
This reverts commit 9809080b9e.

Reverted https://github.com/pytorch/pytorch/pull/130923 on behalf of https://github.com/kit1980 due to breaking internal builds - Error: Relocation overflow has occured ([comment](https://github.com/pytorch/pytorch/pull/130923#issuecomment-2332640961))
2024-09-05 21:16:14 +00:00
Yu, Guangye
9809080b9e [Reland] Refactor caching device allocator utils (#130923)
# Motivation
Following [[RFC] Intel GPU Runtime Upstreaming for Allocator ](https://github.com/pytorch/pytorch/issues/116322), this PR aims to refactor caching device allocator utils to improve code reuse usage.
This is the first PR, we could prepare some follow-up PRs continuing to refactor the device caching allocator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130923
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD, https://github.com/eqy
2024-09-04 05:31:08 +00:00
zdevito
d91b49dbaa expandable_segments <-> other allocator options (#134338)
Previously setting  garbage_collection_threshold or max_split_size_mb along with expandable_segments:True could cause the allocator to hit assert failures when running nearly out of memory. This PR ensures garbage_collection and max_split freeing do not accidentally try to release expandable segments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134338
Approved by: https://github.com/ezyang
2024-08-29 18:43:59 +00:00