Follow up to #110123, removing the CUDA_VERSION check for ROCm because HIP already has hipMallocAsync() and doesn't need the version check there.
Follow up to #108488, fixing the unit failing unit tests by accepting either a "cuda" or "hip" attribute for the caching allocator options. This is aligned to the masquerading strategy for ROCm/HIP.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110715
Approved by: https://github.com/ezyang
Summary: This diff refactors the code by moving CUDAAllocatorConfig into the header file. This config refactoring is done so that we can use the same config code for CUDA pinned memory as well.
Test Plan: sandcastle
Differential Revision: D49653265
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110123
Approved by: https://github.com/zdevito
This is reland of PRs #https://github.com/pytorch/pytorch/pull/108626 and #109564. We fixed the IOS build failure by changing
```
((CHECK) ? (EXPR) : ([] { assert(!#CHECK); }(), (EXPR)))
```
to
```
((CHECK) ? (EXPR) : ([] { assert(false); }(), (EXPR)))
```
in TR2_OPTIONAL_ASSERTED_EXPRESSION, since the former syntax was invalid on Apple Clang. Anyway, we could apply the simple fix hoping that c10::optional would be replaced by std::optional soon.
We also enabled -Wdeprecated on c10.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110019
Approved by: https://github.com/clee2000
Summary:
There's currently a bug in `CUDACachingAllocator` which makes it impossible to determine whether a `malloc`ed sample has been deallocated (introduced in D48229150).
It happens because we currently instrument the `malloc` SDT **before** a block of memory has been allocated by either `cudaMalloc` or local cashing allocator `malloc` call. Since this is a static tracepoint, it receives arg values at the point of instrumentation. Currently, it receives the memory pointer, `void* p`, which is NULL.
Changes in this diff:
1) Move this SDT to right before the `allocate` function returns, so that memory has been allocated already and `p` pointer points to a valid, non-NULL address.
2) Enable tracing of `cudaMalloc` calls, in addition to `NativeCachingAllocator::malloc`
3) renames a poorly-named local var: `r` --> `devPtr` (pointer to the allocated memory block)
Test Plan:
Tested with a local PyTorch script that leaks memory. Verified the following:
* prior to this fix (prod), malloc samples are **not** marked as "freed"
* with the fix (branch), samples **are** marked as "freed"
* results are comparable with the current uprobe implementation to sample PyTorch malloc events in `gpusnoop`
Reviewed By: chaekit
Differential Revision: D48873734
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108907
Approved by: https://github.com/chaekit
This change is to match the behavior of _record_memory_history which was
recently changed to enable history recording on all devices rather than
the current one. It prevents confusing situations where the observer
was registered before the device was set for the training run.
It also ensures the allocators have been initialized in the python binding just in case this is the first call to the CUDA API.
Fixes#107330
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107399
Approved by: https://github.com/eellison
ghstack dependencies: #107171
Summary: Adds new tracepoints to CUDA allocator code for tracking alloc and dealloc events in the allocator code.
Test Plan: This change simply adds static tracepoints to CUDA allocator code, and does not otherwise change any logic. Testing is not required.
Reviewed By: chaekit
Differential Revision: D48229150
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107322
Approved by: https://github.com/chaekit
Previously when we recorded a free action in a memory trace, we would provide
the stack for when the block was allocated. This is faster because we do not
have to record stacks for free, which would otherwise double the number of stacks
collected. However, sometimes knowing the location of a free is useful for
figuring out why a tensor was live. So this PR adds this behavior. If
performance ends up being a concern the old behavior is possible by passing
"alloc" to the context argument rather than "all".
Also refactors some of glue logic to be consistent across C++ and Python and
routes the Python API through the C++ version.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106758
Approved by: https://github.com/albanD
If record_history is enabled, then a block is allocated, record_history
is disabled, and then the block is freed and later unnmapped, we can hit
the `to_map->context_when_allocated == nullptr` assertion.
This change universally clears context_when_allocated on free, which should
prevent this sequence of events from happening.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106818
Approved by: https://github.com/eellison
Previously calling _record_memory_history would only start recording
for a single device because snapshots were also device specific.
Now the visualizer packages all devices into a single page, so we snapshot
recording should also enable recording for all devices.
Verified locally that calling the method does not initialize cuda context
on devices that have not previously been used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106346
Approved by: https://github.com/eellison
PR #101956 introduced additional stream priorities for cuda streams. HIP streams have slightly different semantics.
- HIP: 1=low, 0=default, -1=high
- CUDA: 0=default, -1=high, -2=higher, etc.
This PR forces HIP stream priority to just 0 and -1 to match the pytorch semantics.
This fixes a broken unit test.
```
python3 test_cuda_multigpu.py TestCudaMultiGPU.test_streams_priority -v
Test results will be stored in test-reports/python-unittest/test_cuda_multigpu
Running tests...
----------------------------------------------------------------------
test_streams_priority (__main__.TestCudaMultiGPU) ... ERROR (0.200s)
======================================================================
ERROR [0.200s]: test_streams_priority (__main__.TestCudaMultiGPU)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 2354, in wrapper
method(*args, **kwargs)
File "test_cuda_multigpu.py", line 656, in test_streams_priority
low, high = torch.cuda.Stream.priority_range()
RuntimeError: least_priority == 0 INTERNAL ASSERT FAILED at "/var/lib/jenkins/pytorch-upstream/c10/hip/HIPStream.h":184, please report a bug to PyTorch. Unexpected HIP stream priority range
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106157
Approved by: https://github.com/malfet
We want to display the stack for the original cudaMalloc that created a segment.
Previously we could only report the last time the segment memory was used,
or the record of the segment_alloc could appear in the list of allocator actions.
This PR ensure regardless of whether we still have the segment_alloc action,
the context for a segment is still available. The visualizer is updated to
be able to incorporate this information.
This PR adds a new field to Block. However the previous stacked cleanup PR
removed a field of the same size, making the change to Block size-neutral.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106113
Approved by: https://github.com/aaronenyeshi
For free blocks of memory in the allocator, we previously kept a linked list
of the stack frames of previous allocations that lived there. This was only
ever used in one flamegraph visualization and never proved useful at
understanding what was going on. When memory history tracing was added, it
became redundant, since we can see the history of the free space from recording
the previous actions anyway.
This patch removes this functionality and simplifies the snapshot format:
allocated blocks directly have a 'frames' attribute rather than burying stack frames in the history.
Previously the memory history tracked the real size of allocations before rounding.
Since history was added, 'requested_size' has been added directly to the block which records the same information,
so this patch also removes that redundancy.
None of this functionality has been part of a PyTorch release with BC guarentees, so it should be safe to alter
this part of the format.
This patch also updates our visualization tools to work with the simplified format. Visualization tools keep
support for the old format in `_legacy` functions so that during the transition old snapshot files can still be read.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106079
Approved by: https://github.com/eellison
Summary:
## What is this?
This is a giant codemod to migrate all of fbcode from the tp2 version of gtest to the `fbsource/third-party` version.
## Why?
Various parts of the monorepo use different versions of gtest which are incompatible with each other and make maintenance of C++ testing more difficult than it should be. There also doesn't seem to be much reason for this fragmentation. Shifting all `gtest` dependencies towards `fbsource/third-party` is a big step in the right direction towards cleaning this up.
Also -- tp2 is deprecated, so we want to stop using that anyway. If we're going to make improvements to `gtest`, we should get away from tp2 as a first step.
## How?
I used bash script to perform the majority of the codemod: P777150295
I followed up with `rg` to find additional dependencies, then simply iterated a ton until CI was (mostly) happy.
This diff also includes an update to autodeps to use the `third-party/fbsource` version of gtest rather than the `tp2` version.
#forcetdhashing
Test Plan: CI
Differential Revision: D46961576
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104255
Approved by: https://github.com/huydhn
Summary:
`getStreamFromPool(bool, signed char)` overload doesn't initialize `max_stream_priorities`. So if we call `getStreamFromPool(true)` we would hit the following error
```
terminate called after throwing an instance of 'c10::Error'
what(): Expected cuda stream priority to be less than or equal to 0, got 1
```
Differential Revision: D46358087
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102739
Approved by: https://github.com/ngimel
Changes the StreamID encoding to use the last bit to distinguish between external and internal streams, 4 bits for IdType (DEFAULT, EXT or user-created streams possibly with high priority), and 5 bits for index. This allows us to have more stream priorities exposed to user (I'm currently setting 4, but that's easy to change now). Note, we are pre-creating all 32 streams in the pool per each allowed priority, I don't know if it's a problem in practice. Currently cuda 11.8/A100 GPUs allow 6 different stream priorities, the number may be different for the different cards/different cuda versions.
Previous callsites explicitly requesting high prioity stream (`isHighPriority=true`) are now getting the highest priority stream.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101956
Approved by: https://github.com/ezyang
Changes the StreamID encoding to use the last bit to distinguish between external and internal streams, 4 bits for IdType (DEFAULT, EXT or user-created streams possibly with high priority), and 5 bits for index. This allows us to have more stream priorities exposed to user (I'm currently setting 4, but that's easy to change now). Note, we are pre-creating all 32 streams in the pool per each allowed priority, I don't know if it's a problem in practice. Currently cuda 11.8/A100 GPUs allow 6 different stream priorities, the number may be different for the different cards/different cuda versions.
Previous callsites explicitly requesting high prioity stream (`isHighPriority=true`) are now getting the highest priority stream.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101956
Approved by: https://github.com/ezyang
cudaGetLastError and hipGetLastError will clear any error value within CUDA and HIP, respectively. This is often done on purpose to clear benign errors. Discarding the return value should be indicated by casting to void and a nearby comment. This silences warnings from HIP:
warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result]
Performing an audit of pytorch sources found one use of cudaGetLastError that was incorrectly ignored in IndexKernel.cu.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100488
Approved by: https://github.com/ezyang
Summary: `CUDACachingAllocator::format_size` is used not only in CUDACachingAllocator.cpp but also in CUDAMallocAsyncAllocator.cpp. This caused a breakage when the compiler inlined the function and the linker couldn't find it when resolving symbols for CUDAMallocAsyncAllocator.cpp.
Differential Revision: D45612790
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100734
Approved by: https://github.com/interwq, https://github.com/kit1980
Now that expandable_segments has been merged from OSS, we can enable it in the internal build. It still defaults to off, so this should not change any behavior changes in the allocator unless the flag is explicitly set.
Differential Revision: D45249535
Pull request resolved: https://github.com/pytorch/pytorch/pull/100184
When we run cudagraph trees we are not allowed to have permanent workspace allocations like in cublas because we might need to reclaim that memory for a previous cudagraph recording, and it is memory that is not accounted for in output weakrefs so it does not work with checkpointing. Previously, I would check that we didn't have any additional allocations through snapshotting. This was extremely slow so I had to turn it off.
This PR first does the quick checking to see if we are in an error state, then if we are does the slow logic of creating snapshot. Also turns on history recording so we get a stacktrace of where the bad allocation came from.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99985
Approved by: https://github.com/zdevito
This PR adds calls to nvml during an OOM to find out the total memory
in use by the process and any other CUDA processes on the device.
This makes it easier to identify cases where non-PyTorch libraries have
allocated memory or another process (such as a data loader) has also
allocated memory on the device.
This also rewords the other parts of the error message to make the meaning
of the memory statistics more clear with this new information:
"""
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 138.00 MiB.
GPU 0 has a total capacty of 15.90 GiB of which 8.44 MiB is free.
Process 1246069 has 577.00 MiB memory in use. Including non-PyTorch memory,
this process has 15.32 GiB memory in use. Of the allocated memory
14.12 GiB is allocated by PyTorch, and 410.41 MiB is reserved
by PyTorch but unallocated. If reserved but unallocated memory is large
try setting max_split_size_mb to avoid fragmentation. See documentation
for Memory Management and PYTORCH_CUDA_ALLOC_CONF
"""
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99699
Approved by: https://github.com/ngimel
Common advice we give for handling memory fragmentation issues is to
allocate a big block upfront to reserve memory which will get split up later.
For programs with changing tensor sizes this can be especially helpful to
avoid OOMs that happen the first time we see a new largest input and would
otherwise have to allocate new segments.
However the issue with allocating a block upfront is that is nearly impossible
to correctly estimate the size of that block. If too small, space in the block
will run out and the allocator will allocate separate blocks anyway. Too large,
and other non-PyTorch libraries might stop working because they cannot allocate
any memory.
This patch provides the same benefits as using a pre-allocating block but
without having to choose its size upfront. Using the cuMemMap-style APIs,
it adds the ability to expand the last block in a segment when more memory is
needed.
Compared to universally using cudaMallocAsync to avoid fragmentation,
this patch can fix this common fragmentation issue while preserving most
of the existing allocator behavior. This behavior can be enabled and disabled dynamically.
This should allow users to, for instance, allocate long-lived parameters and state in individual buffers,
and put temporary state into the large expandable blocks, further reducing
fragmentation.
See inline comments for information about the implementation and its limitations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96995
Approved by: https://github.com/eellison
This method has to be accessible from `c10` to enable CUDA-12 integration.
Implemented by providing private `c10::cuda:_internal::setHasPrimaryContext` that passes the pointer to the implementation (in `torch_cuda`) back to c10.
Use global class constructor/destructor to guarantee RAII.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96800
Approved by: https://github.com/ngimel