pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Jeff Daily	59592ce9f2	[CUDA Host Allocator][ROCm] fixes (#110715 ) Follow up to #110123, removing the CUDA_VERSION check for ROCm because HIP already has hipMallocAsync() and doesn't need the version check there. Follow up to #108488, fixing the unit failing unit tests by accepting either a "cuda" or "hip" attribute for the caching allocator options. This is aligned to the masquerading strategy for ROCm/HIP. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110715 Approved by: https://github.com/ezyang	2023-10-06 21:42:24 +00:00
Banit Agrawal	64583c4d04	[CUDA Host Allocator] Add support of CudaHostRegister (#108488 ) Summary: This diff adds another option to create cuda pinned memory using cudaHostRegister. Differential Revision: D45843715 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108488 Approved by: https://github.com/zdevito	2023-10-06 04:13:02 +00:00
Banit Agrawal	30c4c6ff9b	[PyTorch CCA] Refactor caching allocator config code (#110123 ) Summary: This diff refactors the code by moving CUDAAllocatorConfig into the header file. This config refactoring is done so that we can use the same config code for CUDA pinned memory as well. Test Plan: sandcastle Differential Revision: D49653265 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110123 Approved by: https://github.com/zdevito	2023-10-04 14:58:23 +00:00
eqy	6b84658433	[CUDA][cudaMallocAsync] Improve `PYTORCH_CUDA_ALLOC_CONF` error message (#104891 ) Tiny fix to improve use-facing errors for issues like #104801 CC @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/104891 Approved by: https://github.com/kit1980	2023-09-30 02:59:02 +00:00
cyy	a81d083b1c	[Reland] Add -Wdeprecated and related fixes (#110019 ) This is reland of PRs #https://github.com/pytorch/pytorch/pull/108626 and #109564. We fixed the IOS build failure by changing ``` ((CHECK) ? (EXPR) : ([] { assert(!#CHECK); }(), (EXPR))) ``` to ``` ((CHECK) ? (EXPR) : ([] { assert(false); }(), (EXPR))) ``` in TR2_OPTIONAL_ASSERTED_EXPRESSION, since the former syntax was invalid on Apple Clang. Anyway, we could apply the simple fix hoping that c10::optional would be replaced by std::optional soon. We also enabled -Wdeprecated on c10. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110019 Approved by: https://github.com/clee2000	2023-09-28 03:34:29 +00:00
PyTorch MergeBot	1cc052bcab	Revert "[1/N] Add -Wdeprecated and related fixes (#108626 )" This reverts commit `a53a677b4d`. Reverted https://github.com/pytorch/pytorch/pull/108626 on behalf of https://github.com/clee2000 due to I'm getting errors internally that look like the below on x86_64-apple-ios-simulator with clang 16 ([comment](https://github.com/pytorch/pytorch/pull/108626#issuecomment-1728102447))	2023-09-20 16:49:11 +00:00
cyy	a53a677b4d	[1/N] Add -Wdeprecated and related fixes (#108626 ) This PR adds -Wdeprecated to CMake warnings and fixes related issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108626 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2023-09-19 09:24:04 +00:00
Vlad Scherbich	2d26364fb3	[caffe2][cuda] Fix instrumentation of malloc/free SDTs for `CUDACachingAllocator` (#108907 ) Summary: There's currently a bug in `CUDACachingAllocator` which makes it impossible to determine whether a `malloc`ed sample has been deallocated (introduced in D48229150). It happens because we currently instrument the `malloc` SDT before a block of memory has been allocated by either `cudaMalloc` or local cashing allocator `malloc` call. Since this is a static tracepoint, it receives arg values at the point of instrumentation. Currently, it receives the memory pointer, `void* p`, which is NULL. Changes in this diff: 1) Move this SDT to right before the `allocate` function returns, so that memory has been allocated already and `p` pointer points to a valid, non-NULL address. 2) Enable tracing of `cudaMalloc` calls, in addition to `NativeCachingAllocator::malloc` 3) renames a poorly-named local var: `r` --> `devPtr` (pointer to the allocated memory block) Test Plan: Tested with a local PyTorch script that leaks memory. Verified the following: * prior to this fix (prod), malloc samples are not marked as "freed" * with the fix (branch), samples are marked as "freed" * results are comparable with the current uprobe implementation to sample PyTorch malloc events in `gpusnoop` Reviewed By: chaekit Differential Revision: D48873734 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108907 Approved by: https://github.com/chaekit	2023-09-13 22:15:41 +00:00
Yinghai Lu	aebb86fef7	Back out "Faster gc_count update for CUDACachingAllocator" (#108632 ) Summary: Original commit changeset: 1d04ae368fd8 Original Phabricator Diff: D48481557 block.pool is not guaranteed to be not nullptr Test Plan: CI Differential Revision: D49003756 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108632 Approved by: https://github.com/houseroad	2023-09-06 14:57:41 +00:00
Banit Agrawal	b8af8ac784	[CUDACaching Allocator] Release the allocator lock on the slow path (#108367 ) Summary: This diff is to release the global allocator lock on the slow path when we do synchronous cudaMalloc call. Differential Revision: D48750077 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108367 Approved by: https://github.com/zdevito	2023-09-02 02:52:25 +00:00
Doe Hyun Yoon	ad17e5ec4e	Faster gc_count update for CUDACachingAllocator (#108071 ) Summary: Modify the way we update gc_count in CUDACachingAlloctor to make it faster. Reviewed By: jaewonlee-fb Differential Revision: D48481557 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108071 Approved by: https://github.com/zdevito	2023-08-30 18:51:44 +00:00
cyy	d9fb7166d6	[BE] use DeviceIndex instead of int64_t for related device interfaces (#103068 ) This PR unifies the device interfaces in aten/cpp and torch/csrc/cpp to use c10::DeviceIndex. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103068 Approved by: https://github.com/malfet	2023-08-25 20:16:14 +00:00
Zachary DeVito	40cbda274b	document memory snapshotting (#107660 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107660 Approved by: https://github.com/albanD ghstack dependencies: #107171, #107399	2023-08-24 19:20:03 +00:00
Zachary DeVito	c9b5e9d7a8	[allocator] register oom observers on every device (#107399 ) This change is to match the behavior of _record_memory_history which was recently changed to enable history recording on all devices rather than the current one. It prevents confusing situations where the observer was registered before the device was set for the training run. It also ensures the allocators have been initialized in the python binding just in case this is the first call to the CUDA API. Fixes #107330 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107399 Approved by: https://github.com/eellison ghstack dependencies: #107171	2023-08-23 18:57:24 +00:00
vlad-scherbich	e740491674	[caffe2][cuda] Trace `allocate` and `local_raw_delete` events with PyTorch USDTs (#107322 ) Summary: Adds new tracepoints to CUDA allocator code for tracking alloc and dealloc events in the allocator code. Test Plan: This change simply adds static tracepoints to CUDA allocator code, and does not otherwise change any logic. Testing is not required. Reviewed By: chaekit Differential Revision: D48229150 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107322 Approved by: https://github.com/chaekit	2023-08-22 16:31:30 +00:00
Zachary DeVito	80988b6277	Introduce memory stacks for free (#106758 ) Previously when we recorded a free action in a memory trace, we would provide the stack for when the block was allocated. This is faster because we do not have to record stacks for free, which would otherwise double the number of stacks collected. However, sometimes knowing the location of a free is useful for figuring out why a tensor was live. So this PR adds this behavior. If performance ends up being a concern the old behavior is possible by passing "alloc" to the context argument rather than "all". Also refactors some of glue logic to be consistent across C++ and Python and routes the Python API through the C++ version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106758 Approved by: https://github.com/albanD	2023-08-14 20:38:15 +00:00
Zachary DeVito	c14cf312c9	expandable_segments fix possible assert (#106818 ) If record_history is enabled, then a block is allocated, record_history is disabled, and then the block is freed and later unnmapped, we can hit the `to_map->context_when_allocated == nullptr` assertion. This change universally clears context_when_allocated on free, which should prevent this sequence of events from happening. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106818 Approved by: https://github.com/eellison	2023-08-09 01:09:03 +00:00
Zachary DeVito	449f481de0	[memory snaphots] record for all devices (#106346 ) Previously calling _record_memory_history would only start recording for a single device because snapshots were also device specific. Now the visualizer packages all devices into a single page, so we snapshot recording should also enable recording for all devices. Verified locally that calling the method does not initialize cuda context on devices that have not previously been used. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106346 Approved by: https://github.com/eellison	2023-08-01 19:56:15 +00:00
Jeff Daily	50e3f9cbbb	[ROCm] HIP stream priority fix post #101956 (#106157 ) PR #101956 introduced additional stream priorities for cuda streams. HIP streams have slightly different semantics. - HIP: 1=low, 0=default, -1=high - CUDA: 0=default, -1=high, -2=higher, etc. This PR forces HIP stream priority to just 0 and -1 to match the pytorch semantics. This fixes a broken unit test. ``` python3 test_cuda_multigpu.py TestCudaMultiGPU.test_streams_priority -v Test results will be stored in test-reports/python-unittest/test_cuda_multigpu Running tests... ---------------------------------------------------------------------- test_streams_priority (__main__.TestCudaMultiGPU) ... ERROR (0.200s) ====================================================================== ERROR [0.200s]: test_streams_priority (__main__.TestCudaMultiGPU) ---------------------------------------------------------------------- Traceback (most recent call last): File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 2354, in wrapper method(args, *kwargs) File "test_cuda_multigpu.py", line 656, in test_streams_priority low, high = torch.cuda.Stream.priority_range() RuntimeError: least_priority == 0 INTERNAL ASSERT FAILED at "/var/lib/jenkins/pytorch-upstream/c10/hip/HIPStream.h":184, please report a bug to PyTorch. Unexpected HIP stream priority range ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106157 Approved by: https://github.com/malfet	2023-07-31 16:57:20 +00:00
Zachary DeVito	3e5a52cedd	[memory snapshot] track context for segments (#106113 ) We want to display the stack for the original cudaMalloc that created a segment. Previously we could only report the last time the segment memory was used, or the record of the segment_alloc could appear in the list of allocator actions. This PR ensure regardless of whether we still have the segment_alloc action, the context for a segment is still available. The visualizer is updated to be able to incorporate this information. This PR adds a new field to Block. However the previous stacked cleanup PR removed a field of the same size, making the change to Block size-neutral. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106113 Approved by: https://github.com/aaronenyeshi	2023-07-28 06:45:48 +00:00
Zachary DeVito	45b564766d	[memory snapshots] removed chained history (#106079 ) For free blocks of memory in the allocator, we previously kept a linked list of the stack frames of previous allocations that lived there. This was only ever used in one flamegraph visualization and never proved useful at understanding what was going on. When memory history tracing was added, it became redundant, since we can see the history of the free space from recording the previous actions anyway. This patch removes this functionality and simplifies the snapshot format: allocated blocks directly have a 'frames' attribute rather than burying stack frames in the history. Previously the memory history tracked the real size of allocations before rounding. Since history was added, 'requested_size' has been added directly to the block which records the same information, so this patch also removes that redundancy. None of this functionality has been part of a PyTorch release with BC guarentees, so it should be safe to alter this part of the format. This patch also updates our visualization tools to work with the simplified format. Visualization tools keep support for the old format in `_legacy` functions so that during the transition old snapshot files can still be read. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106079 Approved by: https://github.com/eellison	2023-07-28 06:45:48 +00:00
eqy	2c85f28c71	[CUDA][cudaMallocAsync] Reduce record-stream warning spam (#105015 ) Addresses #104925 CC @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/105015 Approved by: https://github.com/eellison	2023-07-13 02:06:14 +00:00
Huy Do	9154bbc999	Fix CUDA Bazel build to optionally include gmock after #104255 (#104308 ) This reverts commit `39868b0578`. Fixes https://github.com/pytorch/pytorch/issues/104279. The change came from an internal codemod diff that we don't want to revert. AFAIK, this addition is not needed as gmock has already been included https://github.com/google/googletest/blob/main/BUILD.bazel ### Testing * OSS CUDA Bazel build should be back after this revert * Import as D47077813 to make sure that nothing breaks internally Pull Request resolved: https://github.com/pytorch/pytorch/pull/104308 Approved by: https://github.com/kit1980, https://github.com/malfet	2023-06-29 07:15:06 +00:00
Logan Wendholt	39868b0578	[codemod][third-party][gtest] Migrate all fbcode gtest from tp2 to fbsource/third-party (#104255 ) Summary: ## What is this? This is a giant codemod to migrate all of fbcode from the tp2 version of gtest to the `fbsource/third-party` version. ## Why? Various parts of the monorepo use different versions of gtest which are incompatible with each other and make maintenance of C++ testing more difficult than it should be. There also doesn't seem to be much reason for this fragmentation. Shifting all `gtest` dependencies towards `fbsource/third-party` is a big step in the right direction towards cleaning this up. Also -- tp2 is deprecated, so we want to stop using that anyway. If we're going to make improvements to `gtest`, we should get away from tp2 as a first step. ## How? I used bash script to perform the majority of the codemod: P777150295 I followed up with `rg` to find additional dependencies, then simply iterated a ton until CI was (mostly) happy. This diff also includes an update to autodeps to use the `third-party/fbsource` version of gtest rather than the `tp2` version. #forcetdhashing Test Plan: CI Differential Revision: D46961576 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104255 Approved by: https://github.com/huydhn	2023-06-27 19:10:08 +00:00
cyy	87cbfe957a	increase clang-tidy coverage to more c10 source files (#102902 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102902 Approved by: https://github.com/Skylion007	2023-06-04 06:33:01 +00:00
Shiyan Deng	f15af19877	initialize max_stream_priorities in getStreamFromPool(bool) (#102739 ) Summary: `getStreamFromPool(bool, signed char)` overload doesn't initialize `max_stream_priorities`. So if we call `getStreamFromPool(true)` we would hit the following error ``` terminate called after throwing an instance of 'c10::Error' what(): Expected cuda stream priority to be less than or equal to 0, got 1 ``` Differential Revision: D46358087 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102739 Approved by: https://github.com/ngimel	2023-06-01 21:05:56 +00:00
Natalia Gimelshein	ecd79b1fef	add additional stream priority for cuda streams (#101956 ) Changes the StreamID encoding to use the last bit to distinguish between external and internal streams, 4 bits for IdType (DEFAULT, EXT or user-created streams possibly with high priority), and 5 bits for index. This allows us to have more stream priorities exposed to user (I'm currently setting 4, but that's easy to change now). Note, we are pre-creating all 32 streams in the pool per each allowed priority, I don't know if it's a problem in practice. Currently cuda 11.8/A100 GPUs allow 6 different stream priorities, the number may be different for the different cards/different cuda versions. Previous callsites explicitly requesting high prioity stream (`isHighPriority=true`) are now getting the highest priority stream. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101956 Approved by: https://github.com/ezyang	2023-05-27 02:36:16 +00:00
PyTorch MergeBot	6c9b94dcda	Revert "add additional stream priority for cuda streams (#101956 )" This reverts commit `5da497cabb`. Reverted https://github.com/pytorch/pytorch/pull/101956 on behalf of https://github.com/osalpekar due to Broke internal builds that used -Wunused-function since this PR removed the call to StreamIdType::<< ([comment](https://github.com/pytorch/pytorch/pull/101956#issuecomment-1563875493))	2023-05-26 06:35:23 +00:00
Natalia Gimelshein	5da497cabb	add additional stream priority for cuda streams (#101956 ) Changes the StreamID encoding to use the last bit to distinguish between external and internal streams, 4 bits for IdType (DEFAULT, EXT or user-created streams possibly with high priority), and 5 bits for index. This allows us to have more stream priorities exposed to user (I'm currently setting 4, but that's easy to change now). Note, we are pre-creating all 32 streams in the pool per each allowed priority, I don't know if it's a problem in practice. Currently cuda 11.8/A100 GPUs allow 6 different stream priorities, the number may be different for the different cards/different cuda versions. Previous callsites explicitly requesting high prioity stream (`isHighPriority=true`) are now getting the highest priority stream. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101956 Approved by: https://github.com/ezyang	2023-05-24 23:26:47 +00:00
Jeff Daily	bf214f40d4	explicitly check or discard cudaGetLastError return value (#100488 ) cudaGetLastError and hipGetLastError will clear any error value within CUDA and HIP, respectively. This is often done on purpose to clear benign errors. Discarding the return value should be indicated by casting to void and a nearby comment. This silences warnings from HIP: warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result] Performing an audit of pytorch sources found one use of cudaGetLastError that was incorrectly ignored in IndexKernel.cu. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100488 Approved by: https://github.com/ezyang	2023-05-10 01:24:07 +00:00
Han Zhu	5ef50ef2d8	[caffe2] Remove inline keyword of function CUDACachingAllocator::format_size (#100734 ) Summary: `CUDACachingAllocator::format_size` is used not only in CUDACachingAllocator.cpp but also in CUDAMallocAsyncAllocator.cpp. This caused a breakage when the compiler inlined the function and the linker couldn't find it when resolving symbols for CUDAMallocAsyncAllocator.cpp. Differential Revision: D45612790 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100734 Approved by: https://github.com/interwq, https://github.com/kit1980	2023-05-09 01:03:39 +00:00
zdevito	0aac244680	Support expandable_segments:True in fbcode for caching allocator Now that expandable_segments has been merged from OSS, we can enable it in the internal build. It still defaults to off, so this should not change any behavior changes in the allocator unless the flag is explicitly set. Differential Revision: D45249535 Pull request resolved: https://github.com/pytorch/pytorch/pull/100184	2023-05-02 11:12:39 -07:00
Elias Ellison	3edff6b6ec	Improve detection of workspace/non-output allocations in cudagraphs (#99985 ) When we run cudagraph trees we are not allowed to have permanent workspace allocations like in cublas because we might need to reclaim that memory for a previous cudagraph recording, and it is memory that is not accounted for in output weakrefs so it does not work with checkpointing. Previously, I would check that we didn't have any additional allocations through snapshotting. This was extremely slow so I had to turn it off. This PR first does the quick checking to see if we are in an error state, then if we are does the slow logic of creating snapshot. Also turns on history recording so we get a stacktrace of where the bad allocation came from. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99985 Approved by: https://github.com/zdevito	2023-05-01 15:58:45 +00:00
Zachary DeVito	8548cb3dd5	Improve OOM error message (#99699 ) This PR adds calls to nvml during an OOM to find out the total memory in use by the process and any other CUDA processes on the device. This makes it easier to identify cases where non-PyTorch libraries have allocated memory or another process (such as a data loader) has also allocated memory on the device. This also rewords the other parts of the error message to make the meaning of the memory statistics more clear with this new information: """ torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 138.00 MiB. GPU 0 has a total capacty of 15.90 GiB of which 8.44 MiB is free. Process 1246069 has 577.00 MiB memory in use. Including non-PyTorch memory, this process has 15.32 GiB memory in use. Of the allocated memory 14.12 GiB is allocated by PyTorch, and 410.41 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF """ Pull Request resolved: https://github.com/pytorch/pytorch/pull/99699 Approved by: https://github.com/ngimel	2023-04-21 21:36:48 +00:00
Zachary DeVito	2402fe5210	[memory allocator] fix ifdef typo (#99553 ) First PR went in with the expandable allocator accidentally disabled which happened trying to fix the build on weird architectures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99553 Approved by: https://github.com/ezyang, https://github.com/eellison	2023-04-19 21:45:51 +00:00
mikey dagitses	1eb1911012	migrate cuda files to const_data_ptr (#99357 ) migrate cuda files to const_data_ptr Summary: These are all going to const_data_ptr, so they ought to all be safe. Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99357 Approved by: https://github.com/ezyang	2023-04-19 12:06:25 +00:00
Zachary DeVito	7ff1f3f3f6	Revert "Revert "Expandable blocks in allocator (#96995 )"" (#99275 ) This reverts commit `851e89c8e8`. Differential Revision: [D45034526](https://our.internmc.facebook.com/intern/diff/D45034526) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99275 Approved by: https://github.com/eellison	2023-04-17 23:46:08 +00:00
PyTorch MergeBot	851e89c8e8	Revert "Expandable blocks in allocator (#96995 )" This reverts commit `6a50b83b73`. Reverted https://github.com/pytorch/pytorch/pull/96995 on behalf of https://github.com/izaitsevfb due to Breaks internal tests	2023-04-16 19:23:37 +00:00
Zachary DeVito	6a50b83b73	Expandable blocks in allocator (#96995 ) Common advice we give for handling memory fragmentation issues is to allocate a big block upfront to reserve memory which will get split up later. For programs with changing tensor sizes this can be especially helpful to avoid OOMs that happen the first time we see a new largest input and would otherwise have to allocate new segments. However the issue with allocating a block upfront is that is nearly impossible to correctly estimate the size of that block. If too small, space in the block will run out and the allocator will allocate separate blocks anyway. Too large, and other non-PyTorch libraries might stop working because they cannot allocate any memory. This patch provides the same benefits as using a pre-allocating block but without having to choose its size upfront. Using the cuMemMap-style APIs, it adds the ability to expand the last block in a segment when more memory is needed. Compared to universally using cudaMallocAsync to avoid fragmentation, this patch can fix this common fragmentation issue while preserving most of the existing allocator behavior. This behavior can be enabled and disabled dynamically. This should allow users to, for instance, allocate long-lived parameters and state in individual buffers, and put temporary state into the large expandable blocks, further reducing fragmentation. See inline comments for information about the implementation and its limitations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96995 Approved by: https://github.com/eellison	2023-04-14 09:49:11 +00:00
Aidyn-A	69eef5a4be	[CUDA12] set_device change (#94864 ) This PR adds workaround for CUDA 12 [`cudaSetDevice` change](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g159587909ffa0791bbe4b40187a4c6bb) which will always create primary context on target device. So operations like this: ```Python import torch x = torch.randn(1, device="cuda:1") ``` would always create primary context on on device `cuda:1` because it is creating a tensor on it and on device `cuda:0` because the destructor of CUDA Device guard calls `cudaSetDevice(0)`. After this PR the CUDA Device guard will not call `cudaSetDevice(0)` if primary context does not exist on `cuda:0`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94864 Approved by: https://github.com/malfet, https://github.com/atalman, https://github.com/ezyang	2023-04-10 17:31:12 +00:00
PyTorch MergeBot	45a2f6b70f	Revert "Reduce includes of CUDACachingAllocator.h (#97072 )" This reverts commit `1bcb880894`. Reverted https://github.com/pytorch/pytorch/pull/97072 on behalf of https://github.com/weiwangmeta due to breaking internal builds	2023-04-07 06:15:11 +00:00
Zachary DeVito	1bcb880894	Reduce includes of CUDACachingAllocator.h (#97072 ) On my machine this goes from > 200 to ~80, making rebuilds faster. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97072 Approved by: https://github.com/wanchaol	2023-04-06 17:22:35 +00:00
Zachary DeVito	e085acc9f3	Cleanup Copy.cu logic (#97071 ) Some of the logic specific to the cudaMallocAsync allocator related to peer access is placed outside of the allocator itself. This PR refactors, documents, and encapsulates it, while maintaining the same behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97071 Approved by: https://github.com/ngimel, https://github.com/eellison	2023-04-06 17:22:35 +00:00
PyTorch MergeBot	279ca5f9db	Revert "[CUDA12] set_device change (#94864 )" This reverts commit `c18be2b2ec`. Reverted https://github.com/pytorch/pytorch/pull/94864 on behalf of https://github.com/ezyang due to avoid affecting cuda 11	2023-04-05 14:53:00 +00:00
Aidyn-A	c18be2b2ec	[CUDA12] set_device change (#94864 ) This PR adds workaround for CUDA 12 [`cudaSetDevice` change](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g159587909ffa0791bbe4b40187a4c6bb) which will always create primary context on target device. So operations like this: ```Python import torch x = torch.randn(1, device="cuda:1") ``` would always create primary context on on device `cuda:1` because it is creating a tensor on it and on device `cuda:0` because the destructor of CUDA Device guard calls `cudaSetDevice(0)`. After this PR the CUDA Device guard will not call `cudaSetDevice(0)` if primary context does not exist on `cuda:0`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94864 Approved by: https://github.com/malfet, https://github.com/atalman, https://github.com/ezyang	2023-04-05 14:34:00 +00:00
mikey dagitses	2ac9086987	run buildifier on unified build files (#98141 ) This is pretty tricky. buildifier by default doesn't do much to these files. It does a little more if you tell it that they are `BUILD.bazel` files with -type=build. But it can do even more if you remove the target definitions from the `def define_rules()` wrapper and dedent them. I wrote a little wrapper that does that. I'll submit it at a later date. Differential Revision: [D44606558](https://our.internmc.facebook.com/intern/diff/D44606558/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D44606558/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/98141 Approved by: https://github.com/ezyang, https://github.com/PaliC	2023-04-04 00:37:19 +00:00
Kazuaki Ishizaki	64b8d20a5c	Fix typos under c10 directory (#98079 ) This PR fixes typos in comments and messages of files under `c10` directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/98079 Approved by: https://github.com/Skylion007	2023-03-31 18:31:11 +00:00
Zachary DeVito	b1a83c4da4	[memory history] cleanup recording API (#97406 ) This makes the options for recording memory history easier to understand and makes the default to record the most information. <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 4706acf</samp> This pull request enhances the memory profiling and debugging capabilities of PyTorch on CUDA devices. It introduces a new API for memory history recording in `torch/cuda/memory.py` and `test/test_cuda.py`, and adds new functions for memory snapshot management and visualization in `torch/cuda/memory.py`. Also adds a quick _dump_snapshot function to make it easier to look at the common visualizations. <!-- copilot:walkthrough --> ### <samp>🤖 Generated by Copilot at 4706acf</samp> * Modify the `_record_memory_history` function to use a new API that accepts a string argument for the `enabled` parameter and more parameters to control the stack trace collection and memory event history ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-80bd98caafb20d758f45a4d23711810f7e0b9ce7a6505094f9dbb0e00a657377L620-R696)) * Add a new function `_dump_snapshot` that allows users to dump a memory snapshot to a directory with HTML plots of the memory segments and events ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-80bd98caafb20d758f45a4d23711810f7e0b9ce7a6505094f9dbb0e00a657377R703-R713)) * Update the test cases in `test/test_cuda.py` to use the new API for memory history recording and check the expected output of the memory plots ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L4946-R4946), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L4984-R4984), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5000-R5000), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5015-R5015), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5035-R5038), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R5045-R5046), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5060-R5059), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5068-R5065), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5088-R5085)) * Add missing imports and types to the `torch/cuda/memory.py` module ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-80bd98caafb20d758f45a4d23711810f7e0b9ce7a6505094f9dbb0e00a657377L5-R15)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97406 Approved by: https://github.com/ezyang	2023-03-28 16:31:10 +00:00
Nikita Shulga	24ce3a7c34	Move `hasPrimaryContext` to `c10::cuda` (#96800 ) This method has to be accessible from `c10` to enable CUDA-12 integration. Implemented by providing private `c10::cuda:_internal::setHasPrimaryContext` that passes the pointer to the implementation (in `torch_cuda`) back to c10. Use global class constructor/destructor to guarantee RAII. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96800 Approved by: https://github.com/ngimel	2023-03-17 04:50:35 +00:00
Elias Ellison	a7d2e451fd	Fix build, shadowed variable (#96778 ) Had an internal build error with this Differential Revision: [D44071892](https://our.internmc.facebook.com/intern/diff/D44071892) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96778 Approved by: https://github.com/Chillee, https://github.com/voznesenskym	2023-03-15 16:41:06 +00:00

1 2 3 4 5

235 Commits