Add a couple of Jetson skips for oom tests in test/test_cuda.py due to failures in nvidia CI. Jetson not having full nvml support is a known issue so this is mostly a test side fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149587
Approved by: https://github.com/eqy
PYNVML related tests in test/test_cuda.py are failing in nvidia internal CI for Jetson devices because Jetson devices don't fully support nvml (it exists as a stub library). In addition to skipping PYNVML tests for Jetson, this PR also reworks the TEST_PYNVML logic a bit to be more consistent with the rest of TEST_{something} conditions in test/test_cuda.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149578
Approved by: https://github.com/janeyx99, https://github.com/eqy
This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics.
This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache.
As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later.
Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660
Approved by: https://github.com/ngimel
Tests fail in NVIDIA internal CI since we do not support nvml on Jetson, but nvml is required for OOM reporting to work properly, so we are skipping the failing tests for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148134
Approved by: https://github.com/eqy
This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics.
This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache.
As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later.
Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660
Approved by: https://github.com/ngimel
This PR does two main things (they are in a single PR to show how the newly added APIs are used).
- Add isBuilt and isAvailable APIs to the AcceleratorHook interface. See inline doc for their exact semantic
- Use the newly added isBuilt for accelerator check to ensure it does not poison fork
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146098
Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/EikanWang
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
A rewrite of #138964
In addition to rewriting the conditions for using copy2d, this PR fixes a few other problems with #138964:
1) gpu-gpu copies when peer access is disabled shouldn't rely on copy2d
2) copy2d should record even for the host pinned memory, like the regular copy does
3) copy2d shouldn't pretend that it's synchronizing (for the purposes of cuda sanitizer tracer) when it's non-blocking
In this PR copy2d behaves in exactly the same way as copy does wrt to those additional syncs, except it calls a different underlying cuda call.
Tests for multiple cases going through copy2d and avoiding copy2d pattern due to unsatisfied conditions are added.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146256
Approved by: https://github.com/eqy, https://github.com/malfet
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
### **Pull Request: Optimized Non-Contiguous Tensor Copy for CPU to GPU in PyTorch**
#### **Summary**
This PR addresses the performance issue identified in [#111570](https://github.com/pytorch/pytorch/issues/111570), where non-contiguous tensors took significantly longer to transfer from CPU to GPU. Through detailed tracing of the call flow, we identified that PyTorch was creating temporary contiguous buffers for non-contiguous tensor transfers, which introduced unnecessary overhead.
#### **Tracing the Issue**
To pinpoint the cause of the slowdown, we followed the call flow from Python’s `tensor.cuda()` method through PyTorch’s backend, ultimately identifying `copy_kernel_cuda` as the key function responsible for CPU-to-GPU tensor transfers. Here’s a summary of the tracing process:
1. **Python Call: `tensor.cuda()`**
- Starting from Python, the `cuda()` method initiates the tensor transfer to the GPU.
2. **`TensorBody.h: cuda()`**
- The `cuda()` method calls `to()`, specifying the target device as CUDA.
3. **`Tensor.cpp: TensorBase::to()`**
- The `to()` function prepares device and data type options before invoking `_ops::to_dtype_layout::call()`.
4. **Operator Call: `_ops::to_dtype_layout::call()`**
- This operator dispatches the request to the backend-specific function responsible for managing the transfer.
5. **`Copy.cpp: copy_()`**
- The `copy_()` function performs preliminary checks (e.g., zero-tensor immutability) and proceeds to call `copy_impl()`.
6. **`Copy.cpp: copy_impl()`**
- This function sets up a tensor iterator and dispatches the copy operation to the appropriate backend through `copy_stub`.
7. **Dispatch to CUDA: `copy_stub`**
- The dispatch mechanism routes the call to the CUDA-specific function, `copy_kernel_cuda`.
8. **`Copy.cu: copy_kernel_cuda()`**
- Here, we identified that PyTorch was creating temporary contiguous buffers for 1D and 2D non-contiguous tensors, which slowed down the copy process. This behavior is managed by the `copy_requires_temporaries()` function.
#### **Solution**
To address this, we modified `copy_kernel_cuda` to handle non-contiguous 1D and 2D tensors directly by using `cudaMemcpy2DAsync`, which allows efficient, stride-aware memory transfers without temporary buffers. Here’s why this approach improves performance:
- **Efficiency of `cudaMemcpy2DAsync`**: This CUDA function is optimized for pitched (stride-based) memory transfers, allowing it to handle non-contiguous data layouts effectively by specifying memory strides for source and destination tensors.
- **Reduction of Overhead**: By directly copying non-contiguous tensors without intermediate buffers, we eliminate extra memory allocation and achieve faster CPU-to-GPU transfers.
- **Asynchronous Execution**: `cudaMemcpy2DAsync` enables asynchronous transfer on the CUDA stream, further improving performance by taking advantage of CUDA's optimized memory handling for non-contiguous layouts.
#### **Performance Results**
In my testing, I created tensors of size `327680 x 2000` and used slices for transfer performance measurements. The tests show that the average time for transferring a non-contiguous slice (e.g., rows 10,000 to 50,000) from CPU to GPU now closely matches the contiguous case. This improvement indicates that the updated implementation effectively addresses the performance discrepancy. Below are the measured times and validation checks:
```plaintext
Average time for contiguous slice (rows 10,000-50,000): 66 ms
Average time for non-contiguous slice (rows 10,000-50,000): 66 ms
Validation of contiguous and non-contiguous tensor copies:
✅ PASS: Tensor shapes match.
✅ PASS: Tensor contiguity matches.
✅ PASS: Tensor contents match.
✅ PASS: Tensor data types match.
✅ Success: Both contiguous and non-contiguous tensors were copied correctly to the GPU.
```
#### **Conclusion**
This PR resolves the identified performance issue by eliminating the need for temporary buffers in non-contiguous 1D and 2D tensor transfers, ensuring faster and more efficient copies from CPU to GPU. Future optimizations could further enhance performance for higher-dimensional non-contiguous tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138964
Approved by: https://github.com/jeffdaily
Co-authored-by: Natalia Gimelshein <ngimel@gmail.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
…s_pinned if device is not initialized
Do not land
RFC
potential fix for #144687
Now `.is_pinned(device="cuda")` does not initialize device and thus doesn't poison the fork (but it complains about `device` arg being deprecated). To not need `device=` arg we'd need to fix get_accelerator to not initialize device.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145752
Approved by: https://github.com/albanD
Co-authored-by: albanD <albandes@fb.com>
gfx12 passes the condition `torch.cuda.get_device_capability() >= (9, 4)` and uses `default_workspace_size=128MB`, but it required only for MI300
Fix condition to use `("gfx94" in gcn_arch)` instead of `torch.cuda.get_device_properties()` to detect MI300.
Now `default_workspace_size=32MB` is used for gfx12 and the test passes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145227
Approved by: https://github.com/jeffdaily, https://github.com/eqy
# Motivation
We propose to support Python with statement on `torch.Stream`. This is a benefit for all accelerators when writing device-agnostic code. The device-specific stream will also be supported because they are generally derived from `torch.Stream`.
With this PR, we can do like this
```python
s1= torch.Stream()
# Set s1 to the current stream
torch.accelerator.set_stream(s1)
with torch.Stream() as s2:
# Inside with statement, we set s2 to the current stream
assert torch.accelerator.current_stream() == s2
# Here the current stream should be s1
assert torch.accelerator.current_stream() == s1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140138
Approved by: https://github.com/albanD
This PR
* makes changes to the workflow files and scripts so we can run CI workflows on the MI300 runners
* skips and fixes several tests, failed on MI300, observed in https://github.com/pytorch/pytorch/pull/140989
Skipped due to unsupported Float8_e4m3fn data type on MI300 (need to update test code to use datatypes supported by MI300):
- distributed.tensor.parallel.test_micro_pipeline_tp.py::MicroPipelineTPTest::test_fuse_all_gather_scaled_matmul_A_dims_\*_gather_dim_\* (24 tests across inductor/distributed configs)
- distributed.tensor.parallel.test_micro_pipeline_tp.py::test_fuse_scaled_matmul_reduce_scatter_A_dims_\*_scatter_dim_\* (12 tests across inductor/distributed configs))
- inductor.test_loop_ordering::LoopOrderingTest::test_fp8_cast_and_t
- inductor.test_loop_ordering::LoopOrderingTest::test_fp8_pattern_2
Skipped due to AssertionError on MI300:
- inductor.test_mkldnn_pattern_matcher.py::test_qconv2d_int8_mixed_bf16
- distributed._tools.test_sac_ilp::TestSACILP::test_sac_ilp_case1
Skipped:
- test_cuda.py::TestCudaMallocAsync::test_clock_speed
- test_cuda.py::TestCudaMallocAsync::test_power_draw
- test_torch.py::TestTorchDeviceTypeCUDA::test_deterministic_cumsum_cuda
Skipped flaky tests on MI300:
- distributed.test_c10d_gloo.py::ProcessGroupGlooTest::test_gather_stress_cuda
- inductor.test_cpu_repro::CPUReproTests::test_lstm_packed_unbatched_False* (256 tests)
Fixed:
- test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_float8_basics_cuda
Features:
- inductor/test_fp8.py - declare a new function to convert FP8 datatypes to ROCm supported FP8 datatypes. It keeps test names for CUDA and ROCm and allows to enable Inductor FP8 tests on CPU
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143673
Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/pruthvistony
Co-authored-by: saienduri <saimanas.enduri@amd.com>
Co-authored-by: Jithun Nair <jithun.nair@amd.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
# Motivation
Fix https://github.com/pytorch/pytorch/issues/143543
# Solution
We should raise python exception instead of aborting...
# Additional Context
without this PR:
```python
>>> import torch
>>> torch.accelerator.current_stream(torch.accelerator.device_count())
terminate called after throwing an instance of 'c10::Error'
what(): device is out of range, device is 2, total number of device is 2.
Exception raised from check_device_index at /home/dvrogozh/git/pytorch/pytorch/c10/xpu/XPUFunctions.h:36 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xac (0x7f30707eb95c in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xf3 (0x7f307078fc57 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so)
frame #2: <unknown function> + 0x19a3e (0x7f3070c2ba3e in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so)
frame #3: c10::xpu::getCurrentXPUStream(signed char) + 0x2f (0x7f3070c2c83f in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so)
frame #4: <unknown function> + 0x1ca35 (0x7f3070c2ea35 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so)
frame #5: <unknown function> + 0x653f15 (0x7f3083391f15 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x39e5f2 (0x7f30830dc5f2 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so)
<omitting python frames>
frame #20: <unknown function> + 0x29d90 (0x7f308b19bd90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #21: __libc_start_main + 0x80 (0x7f308b19be40 in /lib/x86_64-linux-gnu/libc.so.6)
Aborted (core dumped)
```
with this PR:
```python
>>> import torch
>>> torch.accelerator.current_stream(torch.accelerator.device_count())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/pt-gpu/4T-4652/guangyey/stock-pytorch/torch/accelerator/__init__.py", line 123, in current_stream
return torch._C._accelerator_getStream(device_index)
RuntimeError: The device index is out of range. It must be in [0, 2), but got 2.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143550
Approved by: https://github.com/EikanWang, https://github.com/dvrogozh, https://github.com/albanD
Otherwise certain sequences of tests will fail with OOM e.g.,
```
# python test/test_cuda.py -k max_split_expandable -k test_assigning_back_deleter_fns_to_tensor --repeat 100 .. ---------------------------------------------------------------------- Ran 2 tests in 0.311s OK E. ====================================================================== ERROR: test_assigning_back_deleter_fns_to_tensor (__main__.TestBlockStateAbsorption.test_assigning_back_deleter_fns_to_tensor)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/workspace/pytorch/torch/testing/_internal/common_utils.py", line 3058, in wrapper
method(*args, **kwargs)
File "/workspace/pytorch/test/test_cuda.py", line 4320, in test_assigning_back_deleter_fns_to_tensor
graph, outputs = cudagraphify(foo, [inp])
^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/pytorch/test/test_cuda.py", line 4080, in cudagraphify
fn(*inputs)
File "/workspace/pytorch/test/test_cuda.py", line 4316, in foo
int8_cuda(LARGE_BUFFER) + x,
~~~~~~~~~~~~~~~~~~~~~~~~^~~
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 160.00 MiB. GPU 0 has a total capacity of 31.73 GiB of which 31.30 GiB is free. Process 2916661 has 442.00 MiB memory in use. 120.00 MiB allowed; Of the allocated memory 52.00 MiB is allocated by PyTorch, and 6.00 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
To execute this test, run the following from the base repo dir:
python test/test_cuda.py TestBlockStateAbsorption.test_assigning_back_deleter_fns_to_tensor
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
----------------------------------------------------------------------
Ran 2 tests in 0.136s
FAILED (errors=1)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140852
Approved by: https://github.com/Skylion007
Certain `cpp_wrapper`-enabled tests were OOM-ing in the CI pipeline, with error messages suggesting that sufficient memory was accessible. This ultimately resulted from an internal memory limitation that was not queryable in the API. This PR adds querying for that limit.
Additionally, the failing tests had incorrect memory availability checks, and are updated with measured memory requirements.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140620
Approved by: https://github.com/malfet, https://github.com/eqy
ghstack dependencies: #141367
Canonically, the empty_cache API releases all cached blocks of the CUDACachingAllocator. There is no API that can release only the cached blocks of a given pool.
In this PR, we extend the functionality of empty_cache API such that it only releases the cached blocks of an active pool. When empty_cache API is called under a MemPoolContext, we only release the cached blocks that correspond to the pool id of the active pool.
Part of https://github.com/pytorch/pytorch/issues/124807.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133602
Approved by: https://github.com/ezyang
Canonically, the snapshot API returns the entire memory state of the CUDACachingAllocator (using `get_all_blocks`). There is no API that can only return the memory state of a given pool.
In this PR, we extend the functionality of snapshot API such that it can only return the memory addresses of an active pool. When snapshot API is called under a MemPoolContext, we only return the blocks that correspond to the pool id of the active pool.
Part of https://github.com/pytorch/pytorch/issues/124807.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133601
Approved by: https://github.com/ezyang
Canonically, the snapshot API returns the entire memory state of the CUDACachingAllocator (using `get_all_blocks`). There is no API that can only return the memory state of a given pool.
In this PR, we extend the functionality of snapshot API such that it can only return the memory addresses of an active pool. When snapshot API is called under a MemPoolContext, we only return the blocks that correspond to the pool id of the active pool.
Part of https://github.com/pytorch/pytorch/issues/124807.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133601
Approved by: https://github.com/ezyang
This PR refactors some ref-counting functionality out of `beginAllocateToPool` and `releasePool`. The ref-counting logic is then used in construction and destruction of `torch.cuda.MemPool`.
The `use_count` variable in the CUDACachingAllocator is essentially a refcount of how many context managers are using the pool. Since we are now lifting up the MemPool abstraction to the user, the MemPool object itself now needs to hold a an extra reference as well.
Part of https://github.com/pytorch/pytorch/issues/124807.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133600
Approved by: https://github.com/eqy, https://github.com/ezyang
# Motivation
This PR intends to make device-specific Event inherit from the generic torch.Event. The benefit is providing a generic abstract class `torch.Event` for different devices, like `torch.Stream`. This make it easier for Dynamo to capture the Event of different devices, like torch.cuda.Event and torch.xpu.Event.
And the next PR would like to remove previous useless base class `_StreamBase` and `_EventBase` to avoid multiple Inheritance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134845
Approved by: https://github.com/albanD, https://github.com/EikanWang
Seems like some other tests are holding onto memory that is not gc'able (e.g., cuBLAS workspaces), so these tests while working in isolation fail when run as e.g., `python test/test_cuda.py -k able`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136496
Approved by: https://github.com/ezyang
Summary: Fixed a bunch of fbcode imports that happened to work but confused autodeps. After this autodeps still suggests "improvements" to TARGETS (which breaks our builds) but at least it can find all the imports.
Test Plan:
```
fbpython fbcode/tools/build/buck/linters/lint_autoformat.py --linter=autodeps --default-exec-timeout=1800 -- fbcode/caffe2/TARGETS fbcode/caffe2/test/TARGETS
```
Before:
```
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/testing.py:229) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fbur$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export.py:87) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fburl$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_serdes.py:9) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fb$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_serdes.py:10) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fburl$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_retraceability.py:7) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https:$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_retraceability.py:6) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See ht$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export_nonstrict.py:7) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See http$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_export_nonstrict.py:6) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See $
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_export_training_ir_to_run_decomp.py:8) when processing rule "test_export". Please make sure it's listed in the srcs parameter of an$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export_training_ir_to_run_decomp.py:10) when processing rule "test_export". Please make sure it's listed in the srcs parameter of anoth$
ERROR while processing caffe2/test/TARGETS: Found "//python/typeshed_internal:typeshed_internal_library" owner for "cv2" but it is protected by visibility rules: [] (from caffe2/test/test_bundled_images.py:7) when processing rule "test_bundled_$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "caffe2.test.profiler_test_cpp_thread_lib" (from caffe2/test/profiler/test_cpp_thread.py:29) when processing rule "profiler_test_cpp_thread". Please make sure it's listed in t$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._utils_internal.get_file_path_2" (from caffe2/test/test_custom_ops.py:23) when processing rule "custom_ops". Please make sure it's listed in the srcs parameter of anoth$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._utils_internal.get_file_path_2" (from caffe2/test/test_public_bindings.py:13) when processing rule "public_bindings". Please make sure it's listed in the srcs paramete$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._C._profiler.symbolize_tracebacks" (from caffe2/test/test_cuda.py:3348) when processing rule "test_cuda". Please make sure it's listed in the srcs parameter of another $
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._C._profiler.gather_traceback" (from caffe2/test/test_cuda.py:3348) when processing rule "test_cuda". Please make sure it's listed in the srcs parameter of another rule$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for include <torch/csrc/autograd/profiler_kineto.h> (from caffe2/test/profiler/test_cpp_thread.cpp:2) when processing profiler_test_cpp_thread_lib. Some things to try:
```
Differential Revision: D62049222
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135614
Approved by: https://github.com/oulgen, https://github.com/laithsakka
Previously setting garbage_collection_threshold or max_split_size_mb along with expandable_segments:True could cause the allocator to hit assert failures when running nearly out of memory. This PR ensures garbage_collection and max_split freeing do not accidentally try to release expandable segments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134338
Approved by: https://github.com/ezyang