Commit Graph

235 Commits

Author SHA1 Message Date
Jeff Daily
59592ce9f2 [CUDA Host Allocator][ROCm] fixes (#110715)
Follow up to #110123, removing the CUDA_VERSION check for ROCm because HIP already has hipMallocAsync() and doesn't need the version check there.

Follow up to #108488, fixing the unit failing unit tests by accepting either a "cuda" or "hip" attribute for the caching allocator options.  This is aligned to the masquerading strategy for ROCm/HIP.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110715
Approved by: https://github.com/ezyang
2023-10-06 21:42:24 +00:00
Banit Agrawal
64583c4d04 [CUDA Host Allocator] Add support of CudaHostRegister (#108488)
Summary: This diff adds another option to create cuda pinned memory using cudaHostRegister.

Differential Revision: D45843715

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108488
Approved by: https://github.com/zdevito
2023-10-06 04:13:02 +00:00
Banit Agrawal
30c4c6ff9b [PyTorch CCA] Refactor caching allocator config code (#110123)
Summary: This diff refactors the code by moving CUDAAllocatorConfig into the header file. This config refactoring is done so that we can use the same config code for CUDA pinned memory as well.

Test Plan: sandcastle

Differential Revision: D49653265

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110123
Approved by: https://github.com/zdevito
2023-10-04 14:58:23 +00:00
eqy
6b84658433 [CUDA][cudaMallocAsync] Improve PYTORCH_CUDA_ALLOC_CONF error message (#104891)
Tiny fix to improve use-facing errors for issues like #104801

CC @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104891
Approved by: https://github.com/kit1980
2023-09-30 02:59:02 +00:00
cyy
a81d083b1c [Reland] Add -Wdeprecated and related fixes (#110019)
This is reland of PRs #https://github.com/pytorch/pytorch/pull/108626 and #109564. We fixed the IOS build failure by changing
```
((CHECK) ? (EXPR) : ([] { assert(!#CHECK); }(), (EXPR)))
```
to
```
((CHECK) ? (EXPR) : ([] { assert(false); }(), (EXPR)))
```
in TR2_OPTIONAL_ASSERTED_EXPRESSION, since the former syntax was invalid on Apple Clang. Anyway, we could apply the simple fix hoping that c10::optional would be replaced by std::optional soon.
We also enabled -Wdeprecated on c10.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110019
Approved by: https://github.com/clee2000
2023-09-28 03:34:29 +00:00
PyTorch MergeBot
1cc052bcab Revert "[1/N] Add -Wdeprecated and related fixes (#108626)"
This reverts commit a53a677b4d.

Reverted https://github.com/pytorch/pytorch/pull/108626 on behalf of https://github.com/clee2000 due to I'm getting errors internally that look like the below on x86_64-apple-ios-simulator with clang 16 ([comment](https://github.com/pytorch/pytorch/pull/108626#issuecomment-1728102447))
2023-09-20 16:49:11 +00:00
cyy
a53a677b4d [1/N] Add -Wdeprecated and related fixes (#108626)
This PR adds -Wdeprecated to CMake warnings and fixes related issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108626
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2023-09-19 09:24:04 +00:00
Vlad Scherbich
2d26364fb3 [caffe2][cuda] Fix instrumentation of malloc/free SDTs for CUDACachingAllocator (#108907)
Summary:
There's currently a bug in `CUDACachingAllocator` which makes it impossible to determine whether a `malloc`ed sample has been deallocated (introduced in D48229150).

It happens because we currently instrument the `malloc` SDT **before** a block of memory has been allocated by either `cudaMalloc` or local cashing allocator `malloc` call. Since this is a static tracepoint, it receives arg values at the point of instrumentation. Currently, it receives the memory pointer, `void* p`, which is NULL.

Changes in this diff:
1) Move this SDT to right before the `allocate` function returns, so that memory has been allocated already and `p` pointer points to a valid, non-NULL address.
2) Enable tracing of `cudaMalloc` calls, in addition to `NativeCachingAllocator::malloc`
3) renames a poorly-named local var: `r` --> `devPtr` (pointer to the allocated memory block)

Test Plan:
Tested with a local PyTorch script that leaks memory. Verified the following:
* prior to this fix (prod), malloc samples are **not** marked as "freed"
* with the fix (branch), samples **are** marked as "freed"
* results are comparable with the current uprobe implementation to sample PyTorch malloc events in `gpusnoop`

Reviewed By: chaekit

Differential Revision: D48873734

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108907
Approved by: https://github.com/chaekit
2023-09-13 22:15:41 +00:00
Yinghai Lu
aebb86fef7 Back out "Faster gc_count update for CUDACachingAllocator" (#108632)
Summary:
Original commit changeset: 1d04ae368fd8

Original Phabricator Diff: D48481557

block.pool is not guaranteed to be not nullptr

Test Plan: CI

Differential Revision: D49003756

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108632
Approved by: https://github.com/houseroad
2023-09-06 14:57:41 +00:00
Banit Agrawal
b8af8ac784 [CUDACaching Allocator] Release the allocator lock on the slow path (#108367)
Summary: This diff is to release the global allocator lock on the slow path when we do synchronous cudaMalloc call.

Differential Revision: D48750077

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108367
Approved by: https://github.com/zdevito
2023-09-02 02:52:25 +00:00
Doe Hyun Yoon
ad17e5ec4e Faster gc_count update for CUDACachingAllocator (#108071)
Summary: Modify the way we update gc_count in CUDACachingAlloctor to make it faster.

Reviewed By: jaewonlee-fb

Differential Revision: D48481557

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108071
Approved by: https://github.com/zdevito
2023-08-30 18:51:44 +00:00
cyy
d9fb7166d6 [BE] use DeviceIndex instead of int64_t for related device interfaces (#103068)
This PR unifies the device interfaces in aten/*cpp and torch/csrc/*cpp to use  **c10::DeviceIndex**.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103068
Approved by: https://github.com/malfet
2023-08-25 20:16:14 +00:00
Zachary DeVito
40cbda274b document memory snapshotting (#107660)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107660
Approved by: https://github.com/albanD
ghstack dependencies: #107171, #107399
2023-08-24 19:20:03 +00:00
Zachary DeVito
c9b5e9d7a8 [allocator] register oom observers on every device (#107399)
This change is to match the behavior of _record_memory_history which was
recently changed to enable history recording on all devices rather than
the current one. It prevents confusing situations where the observer
was registered before the device was set for the training run.

It also ensures the allocators have been initialized in the python binding just in case this is the first call to the CUDA API.
Fixes #107330
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107399
Approved by: https://github.com/eellison
ghstack dependencies: #107171
2023-08-23 18:57:24 +00:00
vlad-scherbich
e740491674 [caffe2][cuda] Trace allocate and local_raw_delete events with PyTorch USDTs (#107322)
Summary: Adds new tracepoints to CUDA allocator code for tracking alloc and dealloc events in the allocator code.

Test Plan: This change simply adds static tracepoints to CUDA allocator code, and does not otherwise change any logic. Testing is not required.

Reviewed By: chaekit

Differential Revision: D48229150

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107322
Approved by: https://github.com/chaekit
2023-08-22 16:31:30 +00:00
Zachary DeVito
80988b6277 Introduce memory stacks for free (#106758)
Previously when we recorded a free action in a memory trace, we would provide
the stack for when the block was allocated. This is faster because we do not
have to record stacks for free, which would otherwise double the number of stacks
collected. However, sometimes knowing the location of a free is useful for
figuring out why a tensor was live. So this PR adds this behavior. If
performance ends up being a concern the old behavior is possible by passing
"alloc" to the context argument rather than "all".

Also refactors some of glue logic to be consistent across C++ and Python and
routes the Python API through the C++ version.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106758
Approved by: https://github.com/albanD
2023-08-14 20:38:15 +00:00
Zachary DeVito
c14cf312c9 expandable_segments fix possible assert (#106818)
If record_history is enabled, then a block is allocated, record_history
is disabled, and then the block is freed and later unnmapped, we can hit
the `to_map->context_when_allocated == nullptr` assertion.

This change universally clears context_when_allocated on free, which should
prevent this sequence of events from  happening.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106818
Approved by: https://github.com/eellison
2023-08-09 01:09:03 +00:00
Zachary DeVito
449f481de0 [memory snaphots] record for all devices (#106346)
Previously calling _record_memory_history would only start recording
for a single device because snapshots were also device specific.

Now the visualizer packages all devices into a single page, so we snapshot
recording should also enable recording for all devices.

Verified locally that calling the method does not initialize cuda context
on devices that have not previously been used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106346
Approved by: https://github.com/eellison
2023-08-01 19:56:15 +00:00
Jeff Daily
50e3f9cbbb [ROCm] HIP stream priority fix post #101956 (#106157)
PR #101956 introduced additional stream priorities for cuda streams. HIP streams have slightly different semantics.
- HIP: 1=low, 0=default, -1=high
- CUDA: 0=default, -1=high, -2=higher, etc.

This PR forces HIP stream priority to just 0 and -1 to match the pytorch semantics.

This fixes a broken unit test.

```
python3 test_cuda_multigpu.py TestCudaMultiGPU.test_streams_priority -v

Test results will be stored in test-reports/python-unittest/test_cuda_multigpu

Running tests...
----------------------------------------------------------------------
  test_streams_priority (__main__.TestCudaMultiGPU) ... ERROR (0.200s)

======================================================================
ERROR [0.200s]: test_streams_priority (__main__.TestCudaMultiGPU)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 2354, in wrapper
    method(*args, **kwargs)
  File "test_cuda_multigpu.py", line 656, in test_streams_priority
    low, high = torch.cuda.Stream.priority_range()
RuntimeError: least_priority == 0 INTERNAL ASSERT FAILED at "/var/lib/jenkins/pytorch-upstream/c10/hip/HIPStream.h":184, please report a bug to PyTorch. Unexpected HIP stream priority range
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106157
Approved by: https://github.com/malfet
2023-07-31 16:57:20 +00:00
Zachary DeVito
3e5a52cedd [memory snapshot] track context for segments (#106113)
We want to display the stack for the original cudaMalloc that created a segment.
Previously we could only report the last time the segment memory was used,
or the record of the segment_alloc could appear in the list of allocator actions.
This PR ensure regardless of whether we still have the segment_alloc action,
the context for a segment is still available. The visualizer is updated to
be able to incorporate this information.

This PR adds a new field to Block. However the previous stacked cleanup PR
 removed a field of the same size, making the change to Block size-neutral.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106113
Approved by: https://github.com/aaronenyeshi
2023-07-28 06:45:48 +00:00
Zachary DeVito
45b564766d [memory snapshots] removed chained history (#106079)
For free blocks of memory in the allocator, we previously kept a linked list
of the stack frames of previous allocations that lived there. This was only
ever used in one flamegraph visualization and never proved useful at
understanding what was going on. When memory history tracing was added, it
became redundant, since we can see the history of the free space from recording
the previous actions anyway.

This patch removes this functionality and simplifies the snapshot format:
allocated blocks directly have a 'frames' attribute rather than burying stack frames in the history.
Previously the memory history tracked the real size of allocations before rounding.
Since history was added, 'requested_size' has been added directly to the block which records the same information,
so this patch also removes that redundancy.

None of this functionality has been part of a PyTorch release with BC guarentees, so it should be safe to alter
this part of the format.

This patch also updates our visualization tools to work with the simplified format. Visualization tools keep
support for the old format in `_legacy` functions so that during the transition old snapshot files can still be read.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106079
Approved by: https://github.com/eellison
2023-07-28 06:45:48 +00:00
eqy
2c85f28c71 [CUDA][cudaMallocAsync] Reduce record-stream warning spam (#105015)
Addresses #104925

CC @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105015
Approved by: https://github.com/eellison
2023-07-13 02:06:14 +00:00
Huy Do
9154bbc999 Fix CUDA Bazel build to optionally include gmock after #104255 (#104308)
This reverts commit 39868b0578.  Fixes https://github.com/pytorch/pytorch/issues/104279.

The change came from an internal codemod diff that we don't want to revert.  AFAIK, this addition is not needed as gmock has already been included https://github.com/google/googletest/blob/main/BUILD.bazel

### Testing

* OSS CUDA Bazel build should be back after this revert
* Import as D47077813 to make sure that nothing breaks internally

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104308
Approved by: https://github.com/kit1980, https://github.com/malfet
2023-06-29 07:15:06 +00:00
Logan Wendholt
39868b0578 [codemod][third-party][gtest] Migrate all fbcode gtest from tp2 to fbsource/third-party (#104255)
Summary:
## What is this?
This is a giant codemod to migrate all of fbcode from the tp2 version of gtest to the `fbsource/third-party` version.

## Why?
Various parts of the monorepo use different versions of gtest which are incompatible with each other and make maintenance of C++ testing more difficult than it should be. There also doesn't seem to be much reason for this fragmentation. Shifting all `gtest` dependencies towards `fbsource/third-party` is a big step in the right direction towards cleaning this up.

Also -- tp2 is deprecated, so we want to stop using that anyway. If we're going to make improvements to `gtest`, we should get away from tp2 as a first step.

## How?

I used bash script to perform the majority of the codemod: P777150295

I followed up with `rg` to find additional dependencies, then simply iterated a ton until CI was (mostly) happy.

This diff also includes an update to autodeps to use the `third-party/fbsource` version of gtest rather than the `tp2` version.

#forcetdhashing

Test Plan: CI

Differential Revision: D46961576

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104255
Approved by: https://github.com/huydhn
2023-06-27 19:10:08 +00:00
cyy
87cbfe957a increase clang-tidy coverage to more c10 source files (#102902)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102902
Approved by: https://github.com/Skylion007
2023-06-04 06:33:01 +00:00
Shiyan Deng
f15af19877 initialize max_stream_priorities in getStreamFromPool(bool) (#102739)
Summary:
`getStreamFromPool(bool, signed char)` overload doesn't initialize `max_stream_priorities`. So if we call `getStreamFromPool(true)` we would hit the following error
```
terminate called after throwing an instance of 'c10::Error'
  what():  Expected cuda stream priority to be less than or equal to 0, got 1
```

Differential Revision: D46358087

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102739
Approved by: https://github.com/ngimel
2023-06-01 21:05:56 +00:00
Natalia Gimelshein
ecd79b1fef add additional stream priority for cuda streams (#101956)
Changes the StreamID encoding to use the last bit to distinguish between external and internal streams, 4 bits for IdType (DEFAULT, EXT or user-created streams possibly with high priority), and 5 bits for index. This allows us to have more stream priorities exposed to user (I'm currently setting 4, but that's easy to change now). Note, we are pre-creating all 32 streams in the pool per each allowed priority, I don't know if it's a problem in practice. Currently cuda 11.8/A100 GPUs allow 6 different stream priorities, the number may be different for the different cards/different cuda versions.

Previous callsites explicitly requesting high prioity stream (`isHighPriority=true`) are now getting the highest priority stream.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101956
Approved by: https://github.com/ezyang
2023-05-27 02:36:16 +00:00
PyTorch MergeBot
6c9b94dcda Revert "add additional stream priority for cuda streams (#101956)"
This reverts commit 5da497cabb.

Reverted https://github.com/pytorch/pytorch/pull/101956 on behalf of https://github.com/osalpekar due to Broke internal builds that used -Wunused-function since this PR removed the call to StreamIdType::<< ([comment](https://github.com/pytorch/pytorch/pull/101956#issuecomment-1563875493))
2023-05-26 06:35:23 +00:00
Natalia Gimelshein
5da497cabb add additional stream priority for cuda streams (#101956)
Changes the StreamID encoding to use the last bit to distinguish between external and internal streams, 4 bits for IdType (DEFAULT, EXT or user-created streams possibly with high priority), and 5 bits for index. This allows us to have more stream priorities exposed to user (I'm currently setting 4, but that's easy to change now). Note, we are pre-creating all 32 streams in the pool per each allowed priority, I don't know if it's a problem in practice. Currently cuda 11.8/A100 GPUs allow 6 different stream priorities, the number may be different for the different cards/different cuda versions.

Previous callsites explicitly requesting high prioity stream (`isHighPriority=true`) are now getting the highest priority stream.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101956
Approved by: https://github.com/ezyang
2023-05-24 23:26:47 +00:00
Jeff Daily
bf214f40d4 explicitly check or discard cudaGetLastError return value (#100488)
cudaGetLastError and hipGetLastError will clear any error value within CUDA and HIP, respectively. This is often done on purpose to clear benign errors. Discarding the return value should be indicated by casting to void and a nearby comment. This silences warnings from HIP:

warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result]

Performing an audit of pytorch sources found one use of cudaGetLastError that was incorrectly ignored in IndexKernel.cu.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100488
Approved by: https://github.com/ezyang
2023-05-10 01:24:07 +00:00
Han Zhu
5ef50ef2d8 [caffe2] Remove inline keyword of function CUDACachingAllocator::format_size (#100734)
Summary: `CUDACachingAllocator::format_size` is used not only in CUDACachingAllocator.cpp but also in CUDAMallocAsyncAllocator.cpp. This caused a breakage when the compiler inlined the function and the linker couldn't find it when resolving symbols for CUDAMallocAsyncAllocator.cpp.

Differential Revision: D45612790

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100734
Approved by: https://github.com/interwq, https://github.com/kit1980
2023-05-09 01:03:39 +00:00
zdevito
0aac244680 Support expandable_segments:True in fbcode for caching allocator
Now that expandable_segments has been merged from OSS, we can enable it in the internal build. It still defaults to off, so this should not change any behavior changes in the allocator unless the flag is explicitly set.

Differential Revision: D45249535

Pull request resolved: https://github.com/pytorch/pytorch/pull/100184
2023-05-02 11:12:39 -07:00
Elias Ellison
3edff6b6ec Improve detection of workspace/non-output allocations in cudagraphs (#99985)
When we run cudagraph trees we are not allowed to have permanent workspace allocations like in cublas because we might need to reclaim that memory for a previous cudagraph recording, and it is memory that is not accounted for in output weakrefs so it does not work with checkpointing. Previously, I would check that we didn't have any additional allocations through snapshotting. This was extremely slow so I had to turn it off.

This PR first does the quick checking to see if we are in an error state, then if we are does the slow logic of creating snapshot. Also turns on history recording so we get a stacktrace of where the bad allocation came from.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99985
Approved by: https://github.com/zdevito
2023-05-01 15:58:45 +00:00
Zachary DeVito
8548cb3dd5 Improve OOM error message (#99699)
This PR adds calls to nvml during an OOM to find out the total memory
in use by the process and any other CUDA processes on the device.

This makes it easier to identify cases where non-PyTorch libraries have
allocated memory or another process (such as a data loader) has also
allocated memory on the device.

This also rewords the other parts of the error message to make the meaning
of the memory statistics more clear with this new information:

"""
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 138.00 MiB.
GPU 0 has a total capacty of 15.90 GiB of which 8.44 MiB is free.
Process 1246069 has 577.00 MiB memory in use. Including non-PyTorch memory,
this process has 15.32 GiB memory in use. Of the allocated memory
14.12 GiB is allocated by PyTorch, and 410.41 MiB is reserved
by PyTorch but unallocated. If reserved but unallocated memory is large
try setting max_split_size_mb to avoid fragmentation.  See documentation
 for Memory Management and PYTORCH_CUDA_ALLOC_CONF
"""
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99699
Approved by: https://github.com/ngimel
2023-04-21 21:36:48 +00:00
Zachary DeVito
2402fe5210 [memory allocator] fix ifdef typo (#99553)
First PR went in with the expandable allocator accidentally disabled
which happened trying to fix the build on weird architectures.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99553
Approved by: https://github.com/ezyang, https://github.com/eellison
2023-04-19 21:45:51 +00:00
mikey dagitses
1eb1911012 migrate cuda files to const_data_ptr (#99357)
migrate cuda files to const_data_ptr

Summary:
These are all going to const_data_ptr, so they ought to all be safe.

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99357
Approved by: https://github.com/ezyang
2023-04-19 12:06:25 +00:00
Zachary DeVito
7ff1f3f3f6 Revert "Revert "Expandable blocks in allocator (#96995)"" (#99275)
This reverts commit 851e89c8e8.

Differential Revision: [D45034526](https://our.internmc.facebook.com/intern/diff/D45034526)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99275
Approved by: https://github.com/eellison
2023-04-17 23:46:08 +00:00
PyTorch MergeBot
851e89c8e8 Revert "Expandable blocks in allocator (#96995)"
This reverts commit 6a50b83b73.

Reverted https://github.com/pytorch/pytorch/pull/96995 on behalf of https://github.com/izaitsevfb due to Breaks internal tests
2023-04-16 19:23:37 +00:00
Zachary DeVito
6a50b83b73 Expandable blocks in allocator (#96995)
Common advice we give for handling memory fragmentation issues is to
allocate a big block upfront to reserve memory which will get split up later.
For programs with changing tensor sizes this can be especially helpful to
avoid OOMs that happen the first time we see a new largest input and would
otherwise have to allocate new segments.

However the issue with allocating a block upfront is that is nearly impossible
to correctly estimate the size of that block. If too small, space in the block
will run out and the allocator will allocate separate blocks anyway. Too large,
and other non-PyTorch libraries might stop working because they cannot allocate
any memory.

This patch provides the same benefits as using a pre-allocating block but
without having to choose its size upfront. Using the cuMemMap-style APIs,
it adds the ability to expand the last block in a segment when more memory is
needed.

Compared to universally using cudaMallocAsync to avoid fragmentation,
this patch can fix this common fragmentation issue while preserving most
of the existing allocator behavior. This behavior can be enabled and disabled dynamically.
 This should allow users to, for instance, allocate long-lived parameters and state in individual buffers,
and put temporary state into the large expandable blocks, further reducing
fragmentation.

See inline comments for information about the implementation and its limitations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96995
Approved by: https://github.com/eellison
2023-04-14 09:49:11 +00:00
Aidyn-A
69eef5a4be [CUDA12] set_device change (#94864)
This PR adds workaround for CUDA 12 [`cudaSetDevice` change](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g159587909ffa0791bbe4b40187a4c6bb) which will always create primary context on target device. So operations like this:
```Python
import torch
x = torch.randn(1, device="cuda:1")
```
would always create primary context on on device `cuda:1` because it is creating a tensor on it and on device `cuda:0` because the destructor of CUDA Device guard calls `cudaSetDevice(0)`.
After this PR the CUDA Device guard will not call `cudaSetDevice(0)` if primary context does not exist on `cuda:0`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94864
Approved by: https://github.com/malfet, https://github.com/atalman, https://github.com/ezyang
2023-04-10 17:31:12 +00:00
PyTorch MergeBot
45a2f6b70f Revert "Reduce includes of CUDACachingAllocator.h (#97072)"
This reverts commit 1bcb880894.

Reverted https://github.com/pytorch/pytorch/pull/97072 on behalf of https://github.com/weiwangmeta due to breaking internal builds
2023-04-07 06:15:11 +00:00
Zachary DeVito
1bcb880894 Reduce includes of CUDACachingAllocator.h (#97072)
On my machine this goes from > 200 to ~80, making rebuilds faster.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97072
Approved by: https://github.com/wanchaol
2023-04-06 17:22:35 +00:00
Zachary DeVito
e085acc9f3 Cleanup Copy.cu logic (#97071)
Some of the logic specific to the cudaMallocAsync allocator related to peer access is placed outside of the allocator itself. This PR refactors, documents, and encapsulates it, while maintaining the same behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97071
Approved by: https://github.com/ngimel, https://github.com/eellison
2023-04-06 17:22:35 +00:00
PyTorch MergeBot
279ca5f9db Revert "[CUDA12] set_device change (#94864)"
This reverts commit c18be2b2ec.

Reverted https://github.com/pytorch/pytorch/pull/94864 on behalf of https://github.com/ezyang due to avoid affecting cuda 11
2023-04-05 14:53:00 +00:00
Aidyn-A
c18be2b2ec [CUDA12] set_device change (#94864)
This PR adds workaround for CUDA 12 [`cudaSetDevice` change](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g159587909ffa0791bbe4b40187a4c6bb) which will always create primary context on target device. So operations like this:
```Python
import torch
x = torch.randn(1, device="cuda:1")
```
would always create primary context on on device `cuda:1` because it is creating a tensor on it and on device `cuda:0` because the destructor of CUDA Device guard calls `cudaSetDevice(0)`.
After this PR the CUDA Device guard will not call `cudaSetDevice(0)` if primary context does not exist on `cuda:0`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94864
Approved by: https://github.com/malfet, https://github.com/atalman, https://github.com/ezyang
2023-04-05 14:34:00 +00:00
mikey dagitses
2ac9086987 run buildifier on unified build files (#98141)
This is pretty tricky. buildifier by default doesn't do much to these
files. It does a little more if you tell it that they are
`BUILD.bazel` files with -type=build. But it can do even more if you
remove the target definitions from the `def define_rules()` wrapper
and dedent them.

I wrote a little wrapper that does that. I'll submit it at a later
date.

Differential Revision: [D44606558](https://our.internmc.facebook.com/intern/diff/D44606558/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D44606558/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98141
Approved by: https://github.com/ezyang, https://github.com/PaliC
2023-04-04 00:37:19 +00:00
Kazuaki Ishizaki
64b8d20a5c Fix typos under c10 directory (#98079)
This PR fixes typos in comments and messages of files under `c10` directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98079
Approved by: https://github.com/Skylion007
2023-03-31 18:31:11 +00:00
Zachary DeVito
b1a83c4da4 [memory history] cleanup recording API (#97406)
This makes the options for recording memory history
easier to understand and makes the default to record
the most information.

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 4706acf</samp>

This pull request enhances the memory profiling and debugging capabilities of PyTorch on CUDA devices. It introduces a new API for memory history recording in `torch/cuda/memory.py` and `test/test_cuda.py`, and adds new functions for memory snapshot management and visualization in `torch/cuda/memory.py`.

Also adds a quick _dump_snapshot function to make
it easier to look at the common visualizations.

<!--
copilot:walkthrough
-->
### <samp>🤖 Generated by Copilot at 4706acf</samp>

*  Modify the `_record_memory_history` function to use a new API that accepts a string argument for the `enabled` parameter and more parameters to control the stack trace collection and memory event history ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-80bd98caafb20d758f45a4d23711810f7e0b9ce7a6505094f9dbb0e00a657377L620-R696))
* Add a new function `_dump_snapshot` that allows users to dump a memory snapshot to a directory with HTML plots of the memory segments and events ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-80bd98caafb20d758f45a4d23711810f7e0b9ce7a6505094f9dbb0e00a657377R703-R713))
* Update the test cases in `test/test_cuda.py` to use the new API for memory history recording and check the expected output of the memory plots ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L4946-R4946), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L4984-R4984), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5000-R5000), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5015-R5015), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5035-R5038), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R5045-R5046), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5060-R5059), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5068-R5065), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5088-R5085))
* Add missing imports and types to the `torch/cuda/memory.py` module ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-80bd98caafb20d758f45a4d23711810f7e0b9ce7a6505094f9dbb0e00a657377L5-R15))
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97406
Approved by: https://github.com/ezyang
2023-03-28 16:31:10 +00:00
Nikita Shulga
24ce3a7c34 Move hasPrimaryContext to c10::cuda (#96800)
This method has to be accessible from `c10` to enable CUDA-12 integration.
Implemented by providing private `c10::cuda:_internal::setHasPrimaryContext` that passes the pointer to the implementation (in `torch_cuda`) back to c10.
Use global class constructor/destructor to guarantee RAII.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96800
Approved by: https://github.com/ngimel
2023-03-17 04:50:35 +00:00
Elias Ellison
a7d2e451fd Fix build, shadowed variable (#96778)
Had an internal build error with this

Differential Revision: [D44071892](https://our.internmc.facebook.com/intern/diff/D44071892)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96778
Approved by: https://github.com/Chillee, https://github.com/voznesenskym
2023-03-15 16:41:06 +00:00