pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Jithun Nair	333d5821ee	[ROCm] Add gcnArchName to collect_env and torch.cuda.get_device_properties (#107477 ) Printing just the device name is not helpful when investigating PyTorch issues filed for specific AMD GPUs, as the support/issue might depend on the gfx arch, which is part of the gcnArchName property. `torch.cuda.get_device_properties(0).gcnArchName` will print the value of the `gcnArchName` property: eg. ``` >>> torch.cuda.get_device_properties(0).gcnArchName 'gfx906:sramecc+:xnack-' ``` ``` root@6f064e3c19fb:/data/pytorch/test# python ../torch/utils/collect_env.py ... GPU models and configuration: AMD Radeon Graphics(gfx906:sramecc+:xnack-) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/107477 Approved by: https://github.com/albanD	2023-10-31 23:05:36 +00:00
PyTorch MergeBot	9c7391ea36	Revert " [1/N] Apply clang-tidy to c10 cuda files (#111137 )" This reverts commit `43b023694e`. Reverted https://github.com/pytorch/pytorch/pull/111137 on behalf of https://github.com/malfet due to Was reverted internally due to the failures in torch.cuda.memory_stats(device=0) (presumably) ([comment](https://github.com/pytorch/pytorch/pull/111137#issuecomment-1769274103))	2023-10-18 20:32:53 +00:00
cyy	43b023694e	[1/N] Apply clang-tidy to c10 cuda files (#111137 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/111137 Approved by: https://github.com/zou3519, https://github.com/Skylion007	2023-10-17 04:52:50 +00:00
cyy	a6b452dfdc	[2/N] Enable Wunused-result, Wunused-variable and Wmissing-braces in torch targets (#110836 ) This PR enables Wunused-result, Wunused-variable and Wmissing-braces because our code base is clean. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110836 Approved by: https://github.com/Skylion007	2023-10-11 23:49:15 +00:00
cyy	3ec33957eb	[1/N] Enable Wunused-result and Wunused-variable in torch targets (#110722 ) They are useful for checking results of function calls. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110722 Approved by: https://github.com/Skylion007	2023-10-08 23:43:45 +00:00
Banit Agrawal	30c4c6ff9b	[PyTorch CCA] Refactor caching allocator config code (#110123 ) Summary: This diff refactors the code by moving CUDAAllocatorConfig into the header file. This config refactoring is done so that we can use the same config code for CUDA pinned memory as well. Test Plan: sandcastle Differential Revision: D49653265 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110123 Approved by: https://github.com/zdevito	2023-10-04 14:58:23 +00:00
Pritam Damania	5565a29568	Release GIL in torch.cuda ops wherever possible. (#109159 ) Most `torch.cuda` ops (ex: `torch.cuda.synchronize`) do not release GIL in C++ land. This has the potential of causing deadlocks and freeze the python process. For example, `torch.cuda.synchronize` could hold GIL and get blocked on some operation. However, that operation might never complete in python land since GIL is held by `torch.cuda.synchronize`. In this PR, I've tried to release GIL as much as possible in `torch.cuda` ops. See https://github.com/pytorch/pytorch/issues/109074 for an example of how holding GIL causes a deadlock. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109159 Approved by: https://github.com/ezyang	2023-09-25 14:35:31 +00:00
cyy	01fc6466d1	[Reland] [1/N] fix clang-tidy warnings in torch/csrc (#108114 ) Reland of PR #107648 with auto replaced with Py_ssize_t in eval_frame.c. This PR applies fixes to some found issues by clang-tidy in torch/csrc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108114 Approved by: https://github.com/Skylion007	2023-08-30 17:11:16 +00:00
PyTorch MergeBot	8cbf77585d	Revert "[1/N] fix clang-tidy warnings in torch/csrc (#107648 )" This reverts commit `49eeca00d1`. Reverted https://github.com/pytorch/pytorch/pull/107648 on behalf of https://github.com/osalpekar due to This causes breakages due to underspecified type ([comment](https://github.com/pytorch/pytorch/pull/107648#issuecomment-1696372588))	2023-08-28 20:35:12 +00:00
cyy	49eeca00d1	[1/N] fix clang-tidy warnings in torch/csrc (#107648 ) Apply fixes to some found issues by clang-tidy in torch/csrc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107648 Approved by: https://github.com/Skylion007	2023-08-25 00:30:09 +00:00
Zachary DeVito	c9b5e9d7a8	[allocator] register oom observers on every device (#107399 ) This change is to match the behavior of _record_memory_history which was recently changed to enable history recording on all devices rather than the current one. It prevents confusing situations where the observer was registered before the device was set for the training run. It also ensures the allocators have been initialized in the python binding just in case this is the first call to the CUDA API. Fixes #107330 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107399 Approved by: https://github.com/eellison ghstack dependencies: #107171	2023-08-23 18:57:24 +00:00
Zachary DeVito	cc54448a07	[memory snapshot] add 'address' key to block (#107171 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107171 Approved by: https://github.com/ngimel	2023-08-23 18:57:24 +00:00
Zachary DeVito	80988b6277	Introduce memory stacks for free (#106758 ) Previously when we recorded a free action in a memory trace, we would provide the stack for when the block was allocated. This is faster because we do not have to record stacks for free, which would otherwise double the number of stacks collected. However, sometimes knowing the location of a free is useful for figuring out why a tensor was live. So this PR adds this behavior. If performance ends up being a concern the old behavior is possible by passing "alloc" to the context argument rather than "all". Also refactors some of glue logic to be consistent across C++ and Python and routes the Python API through the C++ version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106758 Approved by: https://github.com/albanD	2023-08-14 20:38:15 +00:00
Alexander Pivovarov	02abbb8109	Fix some typos, mostly "that that" (#106901 ) Fix some typos Pull Request resolved: https://github.com/pytorch/pytorch/pull/106901 Approved by: https://github.com/janeyx99	2023-08-10 19:46:53 +00:00
Nikita Shulga	dfd441a12c	[BE] Use nested namespaces in `torch/csrc/cuda` (#106928 ) <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 6b1dde1</samp> > _`namespace` syntax_ > _Simplified with C++17_ > _Code is more readable_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/106928 Approved by: https://github.com/huydhn, https://github.com/izaitsevfb	2023-08-10 03:56:09 +00:00
Zachary DeVito	3e5a52cedd	[memory snapshot] track context for segments (#106113 ) We want to display the stack for the original cudaMalloc that created a segment. Previously we could only report the last time the segment memory was used, or the record of the segment_alloc could appear in the list of allocator actions. This PR ensure regardless of whether we still have the segment_alloc action, the context for a segment is still available. The visualizer is updated to be able to incorporate this information. This PR adds a new field to Block. However the previous stacked cleanup PR removed a field of the same size, making the change to Block size-neutral. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106113 Approved by: https://github.com/aaronenyeshi	2023-07-28 06:45:48 +00:00
Zachary DeVito	45b564766d	[memory snapshots] removed chained history (#106079 ) For free blocks of memory in the allocator, we previously kept a linked list of the stack frames of previous allocations that lived there. This was only ever used in one flamegraph visualization and never proved useful at understanding what was going on. When memory history tracing was added, it became redundant, since we can see the history of the free space from recording the previous actions anyway. This patch removes this functionality and simplifies the snapshot format: allocated blocks directly have a 'frames' attribute rather than burying stack frames in the history. Previously the memory history tracked the real size of allocations before rounding. Since history was added, 'requested_size' has been added directly to the block which records the same information, so this patch also removes that redundancy. None of this functionality has been part of a PyTorch release with BC guarentees, so it should be safe to alter this part of the format. This patch also updates our visualization tools to work with the simplified format. Visualization tools keep support for the old format in `_legacy` functions so that during the transition old snapshot files can still be read. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106079 Approved by: https://github.com/eellison	2023-07-28 06:45:48 +00:00
Elias Ellison	a8ff647e42	Disable conv cache emptying (#101038 ) We warmup cudagraph trees in the cudagraph memory pool so that if we are part of the way through your run, and a large majority of memory is already allocated to cudagraphs, we dont try to allocate again to eager which would split memory pool in half. However this means this is causing us to fail the following assert due to the `emptyCache` call in CUDNN benchmarking: https://github.com/pytorch/pytorch/blob/main/c10/cuda/CUDACachingAllocator.cpp#L2959. Disable the empty cache call during cudagraph warmup to fix error. Disabling did not have a significant affect on memory: ![image](https://github.com/pytorch/pytorch/assets/11477974/90513a1e-aa77-410c-a32e-2f80b99e673f) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101038 Approved by: https://github.com/shunting314, https://github.com/ngimel	2023-05-12 18:49:46 +00:00
Elias Ellison	0ec4646588	CUDA Graph Trees - error on deallocated access (#100927 ) Turn warning to error if we detect tensor is accessed after its memory is overwritten/released by a new invocation of cudagraphs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100927 Approved by: https://github.com/zou3519	2023-05-11 17:17:14 +00:00
PyTorch MergeBot	cbfed470bd	Revert "CUDA Graph Trees - error on deallocated access (#100927 )" This reverts commit `3941bbc5ba`. Reverted https://github.com/pytorch/pytorch/pull/100927 on behalf of https://github.com/jeanschmidt due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/100927#issuecomment-1543874258))	2023-05-11 12:07:20 +00:00
Elias Ellison	3941bbc5ba	CUDA Graph Trees - error on deallocated access (#100927 ) Turn warning to error if we detect tensor is accessed after its memory is overwritten/released by a new invocation of cudagraphs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100927 Approved by: https://github.com/zou3519	2023-05-10 17:15:33 +00:00
Elias Ellison	3edff6b6ec	Improve detection of workspace/non-output allocations in cudagraphs (#99985 ) When we run cudagraph trees we are not allowed to have permanent workspace allocations like in cublas because we might need to reclaim that memory for a previous cudagraph recording, and it is memory that is not accounted for in output weakrefs so it does not work with checkpointing. Previously, I would check that we didn't have any additional allocations through snapshotting. This was extremely slow so I had to turn it off. This PR first does the quick checking to see if we are in an error state, then if we are does the slow logic of creating snapshot. Also turns on history recording so we get a stacktrace of where the bad allocation came from. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99985 Approved by: https://github.com/zdevito	2023-05-01 15:58:45 +00:00
Elias Ellison	d881b2978c	Make autocast cache and buffer stealing aware of cudagraph static output tensors (#99368 ) In this stack of PRs we adding caching to output tensors for cudagraph trees after we've done initial recording. On initial recording we do not cache tensor outputs because this prevents memory from being reclaimed. On subsequent exeuctions we do cache them to avoid overhead. However, because there is an extra reference around, this caused divergent recording & execution behavior in both autocast caching and autograd gradient stealing. Divergent recording & execution would keep on re-recording and eventually stabilize, but it's not what you want to see happen. This pr makes the autocast cache and buffer stealing aware of the cudagraph static output tensors. I will add this to the other cudagraph impl in another pr. Not sure if this should be in autograd or in autocast since it affects both.. Or somewhere else Pull Request resolved: https://github.com/pytorch/pytorch/pull/99368 Approved by: https://github.com/albanD, https://github.com/ezyang	2023-04-24 20:23:12 +00:00
Elias Ellison	472f46635e	Cache output tensors on execution (#98944 ) Caches output tensors for the common case when the output Tensor storage is unaliased for all graph outputs in all paths. For these persisted tensors we adjust the liveness tracking by also checking that the output tensor does not have an additional python reference. I limit cached output tensors to be unaliased. If a descendent node discovers it has an alias of a prior output, then the aliased output will no longer be persisted in the ancestor. The large majority of tensors are unaliased, and preserving aliased output tensors would add significant additional complexity with marginal gains. For instance, when do checkpointing and re-recordings, we need to remove the persisted tensors otherwise it would prevent memory from being reclaimed. If a single persisted tensor was present in multiple paths then that would create an inter-path dependence which adds complexity. Additionally, each further caching of the output would affect the reference count of the other caches, and that reference count would also need to be adjusted depending on if a node was checkpointed. Still need to do a complete a run but for the models I tried makes the performance extremely close between trees and non trees impl. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98944 Approved by: https://github.com/jansel, https://github.com/ngimel	2023-04-18 19:44:47 +00:00
Elias Ellison	93b64f0ad3	[Easy] Remove C++ call now that it wont be on hot path (#98943 ) Since we will be caching output tensors, it is no longer necessary for this logic to be in C++ Pull Request resolved: https://github.com/pytorch/pytorch/pull/98943 Approved by: https://github.com/ezyang, https://github.com/jansel	2023-04-18 19:28:37 +00:00
Zachary DeVito	7ff1f3f3f6	Revert "Revert "Expandable blocks in allocator (#96995 )"" (#99275 ) This reverts commit `851e89c8e8`. Differential Revision: [D45034526](https://our.internmc.facebook.com/intern/diff/D45034526) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99275 Approved by: https://github.com/eellison	2023-04-17 23:46:08 +00:00
PyTorch MergeBot	851e89c8e8	Revert "Expandable blocks in allocator (#96995 )" This reverts commit `6a50b83b73`. Reverted https://github.com/pytorch/pytorch/pull/96995 on behalf of https://github.com/izaitsevfb due to Breaks internal tests	2023-04-16 19:23:37 +00:00
Animesh Jain	fdbc8625a1	Functionalization of torch.rand/rand_like ops (#97377 ) This PR introduces the functionalization of RNG ops. Key points are * Introduces a new `philox_rand` prim operator that accepts seed, offset. * Adds decompositions for random operators that use these philox_rand prims * Adds a PhiloxStateTracker to track the offset for each occurence of rand ops * Changes calling convention of AOT Autograd and adds <fwd_seed, fwd_base_offset> and <bwd_seed, bwd_base_offset> * Monkeypatches set_rng_state and get_rng_state while AOT Autograd tracing to record the rng state behavior * Raises assertion for CPU because CPU does not Philox RNG. Not dealt in this PR * dropout op - offset calculation is different * other distributions like normal, poisson etc * Inductor support * Cudagraph support * Dynamic shape support An example ~~~ class Custom(torch.autograd.Function): @staticmethod def forward(ctx, x): ctx.save_for_backward(x) a = torch.rand_like(x) * x a = torch.rand_like(x) * a return a @staticmethod def backward(ctx, grad_out): x, = ctx.saved_tensors return grad_out * torch.rand_like(grad_out) * torch.cos(x) ====== Forward graph 0 ====== def forward(self, fwd_seed_1: i64[], fwd_base_offset_1: i64[], primals_1: f32[16, 16]): # No stacktrace found for following nodes add: i64[] = torch.ops.aten.add.Tensor(fwd_base_offset_1, 0) philox_rand: f32[16, 16] = torch.ops.prims.philox_rand.default([16, 16], fwd_seed_1, add, [16, 1], device(type='cuda', index=0), torch.float32); add = None mul: f32[16, 16] = torch.ops.aten.mul.Tensor(philox_rand, primals_1); philox_rand = None add_1: i64[] = torch.ops.aten.add.Tensor(fwd_base_offset_1, 4); fwd_base_offset_1 = None philox_rand_1: f32[16, 16] = torch.ops.prims.philox_rand.default([16, 16], fwd_seed_1, add_1, [16, 1], device(type='cuda', index=0), torch.float32); fwd_seed_1 = add_1 = None mul_1: f32[16, 16] = torch.ops.aten.mul.Tensor(philox_rand_1, mul); philox_rand_1 = mul = None return [mul_1, primals_1] ====== Backward graph 0 ====== def forward(self, bwd_seed_1: i64[], bwd_base_offset_1: i64[], primals_1: f32[16, 16], tangents_1: f32[16, 16]): # No stacktrace found for following nodes add_2: i64[] = torch.ops.aten.add.Tensor(bwd_base_offset_1, 0); bwd_base_offset_1 = None philox_rand_2: f32[16, 16] = torch.ops.prims.philox_rand.default([16, 16], bwd_seed_1, add_2, [16, 1], device(type='cuda', index=0), torch.float32); bwd_seed_1 = add_2 = None mul_2: f32[16, 16] = torch.ops.aten.mul.Tensor(tangents_1, philox_rand_2); tangents_1 = philox_rand_2 = None cos: f32[16, 16] = torch.ops.aten.cos.default(primals_1); primals_1 = None mul_3: f32[16, 16] = torch.ops.aten.mul.Tensor(mul_2, cos); mul_2 = cos = None return [mul_3] ~~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/97377 Approved by: https://github.com/ezyang	2023-04-16 09:55:56 +00:00
Zachary DeVito	6a50b83b73	Expandable blocks in allocator (#96995 ) Common advice we give for handling memory fragmentation issues is to allocate a big block upfront to reserve memory which will get split up later. For programs with changing tensor sizes this can be especially helpful to avoid OOMs that happen the first time we see a new largest input and would otherwise have to allocate new segments. However the issue with allocating a block upfront is that is nearly impossible to correctly estimate the size of that block. If too small, space in the block will run out and the allocator will allocate separate blocks anyway. Too large, and other non-PyTorch libraries might stop working because they cannot allocate any memory. This patch provides the same benefits as using a pre-allocating block but without having to choose its size upfront. Using the cuMemMap-style APIs, it adds the ability to expand the last block in a segment when more memory is needed. Compared to universally using cudaMallocAsync to avoid fragmentation, this patch can fix this common fragmentation issue while preserving most of the existing allocator behavior. This behavior can be enabled and disabled dynamically. This should allow users to, for instance, allocate long-lived parameters and state in individual buffers, and put temporary state into the large expandable blocks, further reducing fragmentation. See inline comments for information about the implementation and its limitations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96995 Approved by: https://github.com/eellison	2023-04-14 09:49:11 +00:00
Aidyn-A	69eef5a4be	[CUDA12] set_device change (#94864 ) This PR adds workaround for CUDA 12 [`cudaSetDevice` change](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g159587909ffa0791bbe4b40187a4c6bb) which will always create primary context on target device. So operations like this: ```Python import torch x = torch.randn(1, device="cuda:1") ``` would always create primary context on on device `cuda:1` because it is creating a tensor on it and on device `cuda:0` because the destructor of CUDA Device guard calls `cudaSetDevice(0)`. After this PR the CUDA Device guard will not call `cudaSetDevice(0)` if primary context does not exist on `cuda:0`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94864 Approved by: https://github.com/malfet, https://github.com/atalman, https://github.com/ezyang	2023-04-10 17:31:12 +00:00
Elias Ellison	5c8fea5647	Reduce overhead in CUDAGraph Trees (#98529 ) Significantly reduces overhead of constructing Tensors and Storages and checking Storage Liveness. Removes the regression for HF models that I tested and removes 75% of overhead of the extremely overhead bound resnet50 training we have in torchbench. (.91x base commit, 1.02x torchinductor default, 1.16x this PR, 1.25 previous cudagraphs impl). This PR takes care of all of the lower hanging fruit. - Computes storage aliasing at record time instead of during at runtime. We no longer need to use a runtime storage cache, and can instead index directly into the existing alias if there is one, or construct a new Storage - Moves the heavyweight C++ calls into a batch - getting storage weakrefs and constructing tensors Pull Request resolved: https://github.com/pytorch/pytorch/pull/98529 Approved by: https://github.com/jansel, https://github.com/ngimel	2023-04-07 05:46:08 +00:00
PyTorch MergeBot	279ca5f9db	Revert "[CUDA12] set_device change (#94864 )" This reverts commit `c18be2b2ec`. Reverted https://github.com/pytorch/pytorch/pull/94864 on behalf of https://github.com/ezyang due to avoid affecting cuda 11	2023-04-05 14:53:00 +00:00
Aidyn-A	c18be2b2ec	[CUDA12] set_device change (#94864 ) This PR adds workaround for CUDA 12 [`cudaSetDevice` change](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g159587909ffa0791bbe4b40187a4c6bb) which will always create primary context on target device. So operations like this: ```Python import torch x = torch.randn(1, device="cuda:1") ``` would always create primary context on on device `cuda:1` because it is creating a tensor on it and on device `cuda:0` because the destructor of CUDA Device guard calls `cudaSetDevice(0)`. After this PR the CUDA Device guard will not call `cudaSetDevice(0)` if primary context does not exist on `cuda:0`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94864 Approved by: https://github.com/malfet, https://github.com/atalman, https://github.com/ezyang	2023-04-05 14:34:00 +00:00
mikey dagitses	da28af3286	distinguish mutability of StorageImpl::data_ptr() member (#97651 ) See D44409928. Differential Revision: [D44410323](https://our.internmc.facebook.com/intern/diff/D44410323/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97651 Approved by: https://github.com/ezyang	2023-03-30 19:13:56 +00:00
Nikita Shulga	24ce3a7c34	Move `hasPrimaryContext` to `c10::cuda` (#96800 ) This method has to be accessible from `c10` to enable CUDA-12 integration. Implemented by providing private `c10::cuda:_internal::setHasPrimaryContext` that passes the pointer to the implementation (in `torch_cuda`) back to c10. Use global class constructor/destructor to guarantee RAII. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96800 Approved by: https://github.com/ngimel	2023-03-17 04:50:35 +00:00
Elias Ellison	571f96bf59	cudagraph trees (#89146 ) CUDA Graph Trees Design doc: https://docs.google.com/document/d/1ZrxLGWz7T45MSX6gPsL6Ln4t0eZCSfWewtJ_qLd_D0E/edit Not currently implemented : - Right now, we are using weak tensor refs from outputs to check if a tensor has dies. This doesn't work because a) aliasing, and b) aot_autograd detaches tensors (see note [Detaching saved tensors in AOTAutograd]). Would need either https://github.com/pytorch/pytorch/issues/91395 to land to use storage weak refs or manually add a deleter fn that does what I want. This is doable but theres some interactions with the caching allocator checkpointing so saving for a stacked pr. - Reclaiming memory from the inputs during model recording. This isn't terribly difficult but deferring to another PR. You would need to write over the input memory during warmup, and therefore copy the inputs to cpu. Saving for a stacked pr. - Warning on overwriting previous generation outputs. and handling nested torch.compile() calls in generation tracking Differential Revision: [D43999887](https://our.internmc.facebook.com/intern/diff/D43999887) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89146 Approved by: https://github.com/ezyang	2023-03-17 02:47:03 +00:00
Elias Ellison	ea7415087a	Expose Stream Recording Apis in python (#96384 ) Differential Revision: [D43999891](https://our.internmc.facebook.com/intern/diff/D43999891) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96384 Approved by: https://github.com/zdevito	2023-03-16 23:45:43 +00:00
Zachary DeVito	e74f70d212	Revert "Revert "[memory profiling] add a facility to gather combined C++/Python/TorchScript stack traces. (#95541 )"" (#96878 ) This reverts commit `e1ea584b1c`. Adds __has_include check to fix fbcode build. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96878 Approved by: https://github.com/ezyang	2023-03-16 04:12:54 +00:00
PyTorch MergeBot	e1ea584b1c	Revert "[memory profiling] add a facility to gather combined C++/Python/TorchScript stack traces. (#95541 )" This reverts commit `4e1060c609`. Reverted https://github.com/pytorch/pytorch/pull/95541 on behalf of https://github.com/DanilBaibak due to breaking internal builds	2023-03-15 13:28:41 +00:00
Zachary DeVito	85639c1a88	[allocator] Generalize recording to a pool (#96542 ) Previously the allocator would query whether a stream was recording a graph, and look up the pool associated with a graph. This change has the allocator directly associate a stream with a mempool, decoupling "record this stream to a pool" from the action of "record all actions to a cuda graph". Pull Request resolved: https://github.com/pytorch/pytorch/pull/96542 Approved by: https://github.com/eellison	2023-03-15 04:28:49 +00:00
Zachary DeVito	4e1060c609	[memory profiling] add a facility to gather combined C++/Python/TorchScript stack traces. (#95541 ) This refactors the stack trace facility specific to memory profiling in python+cuda to make a generic facility to generate combined stack traces. The generic facility (combined_traceback.h) does not require python to be around to work, but will return python stacks if it is present. This facility is then used to add support for stack trace gathering in memory profiling that happens directly from C++. It is also used to expose a python API for gathering and symbolizing combineds stacks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95541 Approved by: https://github.com/ezyang	2023-03-14 18:26:05 +00:00
Elias Ellison	da265652d6	Return Live Data Pointers from Checkpoint, swap onto tensors (#95020 ) When we checkpoint the state of the private pool allocator, we will need to make sure that its current live allocated blocks will get properly cleaned up when the tensors they correspond to die. Return DataPtrs for these new allocated blocks that the callee can swap onto live Tensors. The exact api for setting the checkpoint can be manipulated after this as the cudagraph implementation is built out, but this at least shows its sufficiently general. This should be the last PR touching cuda caching allocator necessary for new cudagraphs integration. Differential Revision: [D43999888](https://our.internmc.facebook.com/intern/diff/D43999888) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95020 Approved by: https://github.com/zdevito	2023-03-14 01:22:19 +00:00
Elias Ellison	1cc32aedb0	Handle additional live allocations not in checkpointed state (#94943 ) We choose to ignore certain blocks that are currently allocated when we set the pool to its checkpoint. For those blocks, we need to swap out the deleter function of their corresponding blocks so that a deallocation is not triggered when they die. Differential Revision: [D43999886](https://our.internmc.facebook.com/intern/diff/D43999886) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94943 Approved by: https://github.com/zdevito	2023-03-14 01:00:47 +00:00
Elias Ellison	d798de2b05	Checkpoint CUDA Allocator Private Pool State (#94653 ) Copying note from cuda caching allocator: ``` * Note [Checkpointing PrivatePoolState] * * Refer above to Note [Interaction with CUDA graph capture]. Allocations made * during graph capture are made from a separate private pool. During graph * capture allocations behave as usual. During graph replay the allocator * state does not change even as new tensors are created. The private pool * will not free its blocks to the main caching allocator until cuda graph use * is finished to prevent an allocation from eager clobbering the memory from * a live but unaccounted for tensor that was created during replay. * * `make_graphed_callables`, a series of separate callables chained in * successive cuda graphs, can share a memory pool because after a cuda graph * recording the allocations in the shared private pool exactly reflect the * tensors that are allocated. * * We would like to extend callable chaining to support a graphed callable * tree. In this scenario, we have a tree of callable chains which will be * captured with cuda graphs. In the diagram below, we have a tree with four * callables, A, B, C, and D. Suppose we have captured, and subsequently * replayed, A, B, and C. Then on a new invocation, we replay A and B, but * would now like to record D. At this point the private pool will not reflect * any of the live tensors created during graph replay. Allocations made * during a new recording with the pool could overwrite those live tensors. * * In order to record a new graph capture after replaying prior callables in * the tree, we need the allocator to reflect the state of the live tensors. * We checkpoint the state of the private after each recording, and then * reapply it when we are starting a new recording chain. Additionally, we * must free the allocations for any tensors that died between the end of our * previous graph replaying and our new recording (TODO). All of the allocated * segments that existed in the checkpointed state must still exist in the * pool. There may also exist new segments, which we will free (TODO : link * note [live tensors between iterations] when it exists). * * * ---------------> A ---------------> B ---------------> C * \| * \| * \| * \| * ---------------> D ``` A few TODOs: - need to add logic for freeing tensors that have died between a last replay and current new recording - Add logic for free that might be called on a pointer multiple times (because we are manually freeing live tensors) The two scenarios above have not been exercised in the tests yet. Differential Revision: [D43999889](https://our.internmc.facebook.com/intern/diff/D43999889) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94653 Approved by: https://github.com/zdevito	2023-03-14 00:47:30 +00:00
Zachary DeVito	4b372e3958	[memory profiling] C++ tracing support (#95357 ) Adds the ability to quickly generate stack traces for C++, and combine Python, TorchScript, and C++ frames into a single trace. This makes it possible for the memory tracer to record allocations inside C++ code (e.g. convolution temporaries, backward operators). The unwinder code is ~10x faster than execinfo.h's backward because it cache fast unwinder routines for instruction pointers that have already been seen. It is also only 1.2--2x slower than copying the entire stack (the approach perf takes), while using 2 orders of magnitude less space per stack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95357 Approved by: https://github.com/bertmaher	2023-03-12 07:24:14 +00:00
Zachary DeVito	48490cec28	[memory profiling] Move Context object to c10 (#96280 ) Minor refactor so that follow up PR can have objects that meet the GatheredContext inferface without having to depend on CUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96280 Approved by: https://github.com/eellison	2023-03-12 07:24:14 +00:00
Zachary DeVito	266089a3fe	[memory snapshots] record scripted stack traces (#95356 ) Adds support for seeing both python and script stack traces in memory debugging. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95356 Approved by: https://github.com/aaronenyeshi	2023-03-12 07:24:14 +00:00
cyy	6786a24fd2	fix some tiny code issues (#95757 ) This PR tries to fix: 1. a misspelled NDEBUG preprocessing condition. 2. get ride of all writable-strings warnings. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95757 Approved by: https://github.com/soulitzer	2023-03-01 23:27:32 +00:00
Zachary DeVito	4f84c57c87	Fix potential deadlock when recording memory traces (#95273 ) See comment in the diff Differential Revision: [D43490668](https://our.internmc.facebook.com/intern/diff/D43490668) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95273 Approved by: https://github.com/eellison	2023-02-27 19:04:47 +00:00
c-odrin	54b7c7d5e9	Added requested_bytes to CUDA Caching Allocator Stats (#88575 ) Summary: The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce. We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag: - "requested_bytes.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead Test Plan: Added test case in caffe2/test/test_cuda.py Differential Revision: D40810674 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88575 Approved by: https://github.com/zdevito	2023-02-09 21:37:25 +00:00
cyy	27efdc5eed	fix writable-strings warnings (#93246 ) clang reports "ISO C++11 does not allow conversion from string literal to 'char *'" Pull Request resolved: https://github.com/pytorch/pytorch/pull/93246 Approved by: https://github.com/malfet	2023-02-04 02:11:15 +00:00
cyy	bfe5e1258b	avoid unnecessary static_cast (#93898 ) avoid unnecessary static_cast Pull Request resolved: https://github.com/pytorch/pytorch/pull/93898 Approved by: https://github.com/Skylion007	2023-02-03 03:44:43 +00:00
cyy	e292ddff4e	More clang-tidy fixes (#92944 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92944 Approved by: https://github.com/Skylion007	2023-01-25 19:11:51 +00:00
PyTorch MergeBot	523d4f2562	Revert "[cuDNN][cuDNN V8 API] Always build assuming cuDNN >= 8.0 (#91527 )" This reverts commit `4d07ad74f1`. Reverted https://github.com/pytorch/pytorch/pull/91527 on behalf of https://github.com/DanilBaibak due to Break internal build	2023-01-16 13:28:09 +00:00
Eddie Yan	4d07ad74f1	[cuDNN][cuDNN V8 API] Always build assuming cuDNN >= 8.0 (#91527 ) We've been building with V8 (incl. V8 API) by default for a while now; this PR cleans up some guards for cuDNN < 8.0. CC @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/91527 Approved by: https://github.com/ngimel	2023-01-13 18:55:37 +00:00
Peter Bell	eece6da162	[inductor] Reduce device context manager overhead (#91045 ) This adds `torch.cuda._DeviceGuard` which is a stripped down version of `torch.cuda.device` with lower overhead. To do this, it only accepts `int` as the device so we don't need to call `_get_device_index` and is implemented with a new C++ helper `torch._C._cuda_exchangeDevice` that allows `_DeviceGuard.__enter__` to be just a single function call. On my machine, I see a drop from 3.8us of overhead to 0.94 us with this simple benchmark: ```python def set_device(): with torch.cuda.device(0): pass %timeit set_device() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91045 Approved by: https://github.com/ngimel, https://github.com/anijain2305	2023-01-12 16:51:59 +00:00
Eddie Yan	e096d2db5a	[BC-Breaking] Separate `stream_id`, `device_index`, and `device_type` in `pack` and `unpack` for `Streams` (#81596 ) #75854 A naive attempt at working around the limitations of using a single 64-bit integer to pack `stream_id`, `device_index`, and `device_type`. Stills needs sanity checks, testing, and minimization of BC-breaking changes. Currently a Holder for the `StreamData3` struct is used for `IValue` compatibility. While doing this seems to work for `ivalue.h` and `ivalue_inl.h`, this doesn't seem to be naively working for the JIT CUDA stream wrapper? (Something about ambiguous calls if an `intrusive_ptr` to `c10::ivalue::StreamData3Holder` is used as the return type for `pack()`. It turns out that the methods required to access the fields for rematerializing a CUDA Stream are basically already present anyway, so `pack` is simply removed in the wrapper for now and the methods to access the required fields are called directly. CC @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/81596 Approved by: https://github.com/ezyang	2023-01-12 14:16:49 +00:00
Emilio Castillo	07e595e88a	Add `device_idx` to `free_fn` in `CUDAPluggableAllocator` (#91398 ) This was requested by nvidia folks, track also the device_id in the free function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91398 Approved by: https://github.com/albanD	2023-01-12 05:03:48 +00:00
Emilio Castillo	c9d4390d13	Add Pluggable CUDA allocator backend (#86786 ) Fixes #43144 This uses the Backend system added by [82682](https://github.com/pytorch/pytorch/pull/82682) to change allocators dynamically during the code execution. This will allow us to use RMM, use CUDA managed memory for some portions of the code that do not fit in GPU memory. Write static memory allocators to reduce fragmentation while training models and improve interoperability with external DL compilers/libraries. For example, we could have the following allocator in c++ ```c++ #include <sys/types.h> #include <cuda_runtime_api.h> #include <iostream> extern "C" { void* my_malloc(ssize_t size, int device, cudaStream_t stream) { void ptr; std::cout<<"alloc "<< size<<std::endl; cudaMalloc(&ptr, size); return ptr; } void my_free(void ptr) { std::cout<<"free "<<std::endl; cudaFree(ptr); } } ``` Compile it as a shared library ``` nvcc allocator.cc -o alloc.so -shared --compiler-options '-fPIC' ``` And use it from PyTorch as follows ```python import torch # Init caching # b = torch.zeros(10, device='cuda') new_alloc = torch.cuda.memory.CUDAPluggableAllocator('alloc.so', 'my_malloc', 'my_free') old = torch.cuda.memory.get_current_allocator() torch.cuda.memory.change_current_allocator(new_alloc) b = torch.zeros(10, device='cuda') # This will error since the current allocator was already instantiated torch.cuda.memory.change_current_allocator(old) ``` Things to discuss - How to test this, needs compiling external code ... Pull Request resolved: https://github.com/pytorch/pytorch/pull/86786 Approved by: https://github.com/albanD	2022-11-23 17:54:36 +00:00
Zachary DeVito	0d2c2110f1	[allocator] Introduce the abstract class CUDACachingAllocator (#87251 ) This replaces the manual function pointers, making it easier to write new drop-in allocators. Note that most allocation goes through the Allocator interface, which CUDAAllocator inherits from, and this arrangement avoids adding and additional layer of dispatch along this pathway compared to what existed before. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87251 Approved by: https://github.com/wconstab	2022-10-20 01:17:00 +00:00
Zachary DeVito	f56ce8dbad	[allocator] Move getFreeMutex (#87237 ) It isn't used at all the allocators and this change makes that more clear. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87237 Approved by: https://github.com/wconstab	2022-10-19 18:00:40 +00:00
Eddie Yan	25725fd624	(Re-open) Adds cudaMallocAsync as an alternative backend for the CUDA allocator (#82682 ) Rebased version of @mcarilli 's cudaMallocAsync #65365 for continued testing Pull Request resolved: https://github.com/pytorch/pytorch/pull/82682 Approved by: https://github.com/ngimel	2022-10-12 03:44:21 +00:00
eqy	352d926482	[CUBLAS][CUDA GRAPHS] (re-re-re-re-open of #83461 ) Explicitly set the workspace for cuBLAS handles (#86645 ) re-opening (again) in hopes of working around failed/stuck CLA check CC @ptrblck @ngimel @huydhn Pull Request resolved: https://github.com/pytorch/pytorch/pull/86645 Approved by: https://github.com/zdevito	2022-10-11 16:03:49 +00:00
Zachary DeVito	91b1bae1df	Caching allocator tracing (#86241 ) We currently can take snapshots of the state of the allocated cuda memory, but we do not have a way to correlate these snapshots with the actions the allocator that were taken between snapshots. This PR adds a simple fixed-sized buffer that records the major actions that the allocator takes (ALLOC, FREE, SEGMENT_ALLOC, SEGMENT_FREE, OOM, SNAPSHOT) and includes these with the snapshot information. Capturing period snapshots with a big enough trace buffer makes it possible to see how the allocator state changes over time. We plan to use this functionality to guide how settings in the allocator can be adjusted and eventually have a more robust overall algorithm. As a component of this functionality, we also add the ability to get a callback when the allocator will throw an OOM, primarily so that snapshots can be taken immediately to see why the program ran out of memory (most programs have some C++ state that would free tensors before the OutOfMemory exception can be caught). This PR also updates the _memory_viz.py script to pretty-print the trace information and provide a better textual summary of snapshots distinguishing between internal and external fragmentation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86241 Approved by: https://github.com/ngimel	2022-10-07 23:19:54 +00:00
Edward Z. Yang	adf5919720	Add option to record C++ backtraces in _record_memory_history (#86145 ) I used this to debug https://github.com/pytorch/pytorch/issues/86136 so it is useful. The implementation is not so fast so it is not enabled by default. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/86145 Approved by: https://github.com/albanD, https://github.com/zdevito	2022-10-06 04:07:37 +00:00
Edward Z. Yang	97d6b5bbf8	Refactor _cuda_recordMemoryHistory to use pybind11 (#86139 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/86139 Approved by: https://github.com/albanD	2022-10-06 04:07:37 +00:00
Zachary DeVito	db13049b88	[allocator tracing] missing GIL acquire (#86254 ) Bug where the context destructor needs to hold the GIL to free the context. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86254 Approved by: https://github.com/ezyang	2022-10-05 05:54:05 +00:00
PyTorch MergeBot	71eb04403c	Revert "[CUBLAS][CUDA GRAPHS] (re-re-open of #83461 ) Explicitly set the workspace for cuBLAS handles (#85447 )" This reverts commit `b04b2fa9aa`. Reverted https://github.com/pytorch/pytorch/pull/85447 on behalf of https://github.com/seemethere due to Caused a CUDA memory leak, detected by our performance benchmark suite	2022-09-30 20:53:41 +00:00
Eddie Yan	b04b2fa9aa	[CUBLAS][CUDA GRAPHS] (re-re-open of #83461 ) Explicitly set the workspace for cuBLAS handles (#85447 ) Now includes @dagitses 's optimizations and fixes for teardown CC @ngimel @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/85447 Approved by: https://github.com/malfet	2022-09-28 16:04:58 +00:00
PyTorch MergeBot	0ac6311356	Revert "[CUBLAS][CUDA GRAPHS] (re-open of #83461 ) Explicitly set the workspace for cuBLAS handles (#85292 )" This reverts commit `4012e623e8`. Reverted https://github.com/pytorch/pytorch/pull/85292 on behalf of https://github.com/dagitses due to broke an internal test during shutdown. Re-submit with #85399 in stack	2022-09-21 17:57:49 +00:00
eqy	4012e623e8	[CUBLAS][CUDA GRAPHS] (re-open of #83461 ) Explicitly set the workspace for cuBLAS handles (#85292 ) re-open of #83461 with fix for 10.2 build CC @ngimel @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/85292 Approved by: https://github.com/malfet	2022-09-20 16:31:54 +00:00
Hector Yuen	d23ce29761	allow changing the cuda allocator settings even after the process started (#84970 ) Summary: - expose a python call to set the allocator settings, it uses the same format as the value for PYTORCH_CUDA_ALLOCATOR - keep the implementation contained within the cpp file to avoid increasing build times, only expose a function to call the setting - make some of the Allocator Config methods public, now it looks more like a singleton Test Plan: added the unit test Differential Revision: D39487522 Pull Request resolved: https://github.com/pytorch/pytorch/pull/84970 Approved by: https://github.com/zdevito	2022-09-17 09:42:42 +00:00
Zachary DeVito	726d040692	annotated allocator snapshots (#82146 ) Record stack trace information for each allocated segment in the allocator. It takes around 1.5us to record 50 stack frames of context. Since invoking a Pytorch operator is around 8us, this adds minimal overhead but we still leave it disabled by default so that we can test it more on real workloads first. Stack information is kept both for allocated blocks and the last allocation used inactive blocks. We could potential keep around the _first_ allocation that caused the block to get allocated from cuda as well. Potential Followups: * stack frame entries are small (16 bytes), but the list of Frames is not compressed eventhough most frames will share some entries. So far this doesn't produce huge dumps (7MB for one real workload that uses all memory on the GPU), but it can be much smaller through compression. * Code to format the information is slow (a few seconds) because it uses python and FlameGraph.pl * Things allocated during the backward pass have no stack frames because they are run on another C++ thread. Pull Request resolved: https://github.com/pytorch/pytorch/pull/82146 Approved by: https://github.com/albanD	2022-08-09 17:21:35 +00:00
Can Balioglu	56dea92d97	Fix set_requires_cuda_init (#81183 ) Fixes the buggy `set_requires_cuda_init` introduced in #80788. Pull Request resolved: https://github.com/pytorch/pytorch/pull/81183 Approved by: https://github.com/ezyang	2022-07-11 15:36:42 +00:00
Eddie Yan	ae6dd20ba7	[cuDNN V8 API] (reopen 2) Allow the number of kernels profiled under torch.backends.cudnn.benchmark = True to be limitedCudnnv8 benchmark limit (#78299 ) Reopen of #77002 to address comments by @malfet CC @ngimel @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/78299 Approved by: https://github.com/ngimel	2022-07-07 23:25:23 +00:00
Can Balioglu	081b56fd41	Improve readability of cuda_lazy_init (#80788 ) This PR cleans up the implementation of `cuda_lazy_init.cpp` and improves its readability. No behavioral changes are introduced. Pull Request resolved: https://github.com/pytorch/pytorch/pull/80788 Approved by: https://github.com/ezyang	2022-07-04 16:47:11 +00:00
jjsjann123	9e86796fe3	simple c10 implementation for std::call_once (#78051 ) A long standing bug on std::call_once: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66146 It could hang during re-entry after an exception handling. Added a c10 implementation yielding a bulky mutex. Not the most efficient thing but at least it shouldn't hang. Pull Request resolved: https://github.com/pytorch/pytorch/pull/78051 Approved by: https://github.com/albanD	2022-06-28 15:47:03 +00:00
Michael Suo	30fb2c4aba	[lint] autoformat test/cpp and torch/csrc Let's have some fun. Pull Request resolved: https://github.com/pytorch/pytorch/pull/78828 Approved by: https://github.com/ezyang	2022-06-11 21:11:16 +00:00
Sherlock Huang	6db8440f35	Python Jiterator supports multiple outputs (#78139 ) This PR is part3. Part1: https://github.com/pytorch/pytorch/pull/77902 Part2: https://github.com/pytorch/pytorch/pull/77921 Python Jiterator now supports returning multiple outputs ``` fn = torch.cuda.jiterator._create_multi_output_jit_fn( """ template <typename T> T binary_2outputs(T i0, T i1, T& out0, T& out1) { out0 = i0 + i1; out1 = i0 - i1; } """, num_outputs=2) x = torch.rand(3, device='cuda') y = torch.rand(3, device='cuda') out0, out1 = fn(x, y) torch.allclose(out0, x+y) torch.allclose(out1, x-y) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/78139 Approved by: https://github.com/ngimel	2022-05-24 21:52:56 +00:00
Natalia Gimelshein	4ea176ea57	expose fast get_current_stream (#78165 ) Expose fast no-frills version of getting raw `cudaStream_t` in python (200 ns instead of 4 us) Pull Request resolved: https://github.com/pytorch/pytorch/pull/78165 Approved by: https://github.com/SherlockNoMad, https://github.com/soumith, https://github.com/gchanan	2022-05-24 15:54:47 +00:00
Kurt Mohler	aea6e2c396	Merge torch.cuda._UntypedStorage into torch._UntypedStorage (#75459 ) Fixes #74933 Pull Request resolved: https://github.com/pytorch/pytorch/pull/75459 Approved by: https://github.com/ezyang	2022-05-19 13:54:39 +00:00
Michael Carilli	929f1d5317	[RELAND] Adds torch.cuda.is_current_stream_capturing (#77789 ) Resubmit of https://github.com/pytorch/pytorch/pull/77673, which was reverted due to Windows test failures: https://github.com/pytorch/pytorch/pull/77673#issuecomment-1130425845. I suspect these failures happened because I don't explicitly set a side stream for graph capture in the new test. Not setting a side stream explicitly is alright on Linux because cuda tests implicitly use a side stream. I think Windows cuda tests implicitly use the default stream, breaking capture and leaving the backend in a bad state. Other graphs tests explicitly set side streams and don't error in Windows builds, so i'm 95% sure doing the same for the new test will work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/77789 Approved by: https://github.com/ezyang	2022-05-18 23:18:53 +00:00
Jeff Daily	de86146c61	rocblas alt impl during backward pass only (#71881 ) In preparation of adopting future rocblas library options, it is necessary to track when the backward pass of training is executing. The scope-based helper class `BackwardPassGuard` is provided to toggle state. Pull Request resolved: https://github.com/pytorch/pytorch/pull/71881 Approved by: https://github.com/albanD	2022-05-18 19:42:58 +00:00
PyTorch MergeBot	0d8a0f186b	Revert "Adds torch.cuda.is_current_stream_capturing (#77673 )" This reverts commit `d03d43df52`. Reverted https://github.com/pytorch/pytorch/pull/77673 on behalf of https://github.com/suo	2022-05-18 19:31:49 +00:00
Michael Carilli	d03d43df52	Adds torch.cuda.is_current_stream_capturing (#77673 ) Exposes a way to query if CUDA graph capture is underway on the current stream. Pull Request resolved: https://github.com/pytorch/pytorch/pull/77673 Approved by: https://github.com/ezyang	2022-05-18 16:46:35 +00:00
Sherlockk Huang	8b6a78f39f	Python Interface for Jiterator This PR allows user to author a CUDA kernel in python. ``` from torch.cuda.jiterator import create_jit_fn code_string = "template <typename T> T my_kernel(T x, T y, T alpha) { return -x * y + x - y + alpha; }" jitted_fn = create_jit_fn(code_string, alpha=0) a = torch.rand(3, device='cuda') b = torch.rand(3, device='cuda') result = jitted_fn(a, b, alpha=1.0) ``` Limitations: - Only supports elementwise kernel - 1~8 tensor inputs (empty input, e.g. factory methods, is not supported) - inputs tensors must live in cuda device - cpu Scalar is not supported - kwargs must be pre-declared when calling create_jit_fn - kwargs must be convertible to at::Scalar, one of float64, int64_t, bool. (complex not support for now) TODOs: - [x] consolidate union and c10::variant implementation - [x] plug into existing op testing framework - [ ] rename files, place files in the right folder - [ ] place util functions in the right file - [x] enforce assumptions in python interface e.g <8 inputs, kwargs types - [x] Add user-facing documentation Pull Request resolved: https://github.com/pytorch/pytorch/pull/76394 Approved by: https://github.com/mruberry	2022-05-06 18:44:28 +00:00
Andrey Talman	efd274bbcb	Fix for windows builds with python 3.10 , getting rid of ssize_t (ssize_t is not a C++ defined type) (#71390 ) Summary: Fix for windows builds with python 3.10 , getting rid of ssize_t Here is the completed bin build : https://app.circleci.com/pipelines/github/pytorch/pytorch/441527/workflows/144edb79-b398-4d70-92fe-b63158c1b439/jobs/16954881 Pull Request resolved: https://github.com/pytorch/pytorch/pull/71390 Reviewed By: samdow Differential Revision: D33637686 Pulled By: atalman fbshipit-source-id: fcdfca672dc20385a3d2339c20e69bd2d1717e88 (cherry picked from commit `2ac58b0dc1`)	2022-01-18 22:12:41 +00:00
Shintaro Iwasaki	5cae40c169	[pytorch][aten][cuda] move CUDAGeneratorImpl.h to ATen/cuda (#70650 ) Summary: This patch moves a CUDA-specific file, `CUDAGeneratorImpl.h` to `ATen/cuda` as the following TODO comment in `CUDAGeneratorImpl.h` suggests: ``` // TODO: this file should be in ATen/cuda, not top level ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/70650 Reviewed By: jianyuh, xw285cornell Differential Revision: D33414890 Pulled By: shintaro-iwasaki fbshipit-source-id: 4ff839205f4e4ea4c8767f164d583eb7072f1b8b	2022-01-10 22:27:04 -08:00
Peter Bell	b08d64202a	Remove THGeneral (#69041 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69041 `TH_CONCAT_{N}` is still being used by THP so I've moved that into it's own header but all the compiled code is gone. Test Plan: Imported from OSS Reviewed By: anjali411 Differential Revision: D32872477 Pulled By: ngimel fbshipit-source-id: 06c82d8f96dbcee0715be407c61dfc7d7e8be47a	2021-12-13 16:14:28 -08:00
Peter Bell	e279963eef	Remove remaining THC code (#69039 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69039 Test Plan: Imported from OSS Reviewed By: anjali411 Differential Revision: D32872476 Pulled By: ngimel fbshipit-source-id: 7972aacc24aef9450fb59b707ed6396c501bcb31	2021-12-08 12:18:08 -08:00
Peter Bell	bf01cd5228	Move THC_sleep to ATen (#69038 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69038 Test Plan: Imported from OSS Reviewed By: anjali411 Differential Revision: D32872479 Pulled By: ngimel fbshipit-source-id: 97c7592b16eee2ecc66c42507c358aa92cc8ee50	2021-12-06 10:20:43 -08:00
Kurt Mohler	5883523c1d	Remove dtype from torch.Storage and use only torch.ByteStorage (#62030 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62030 Remove dtype tracking from Python Storage interface, remove all the different `<type>Storage` classes except for `ByteStorage`, and update serialization accordingly, while maintaining as much FC/BC as possible Fixes https://github.com/pytorch/pytorch/issues/47442 * THE SERIALIZATION FORMAT IS FULLY FC/BC. We worked very hard to make sure this is the case. We will probably want to break FC at some point to make the serialization structure of tensors make more sense, but not today. * There is now only a single torch.ByteStorage class. Methods like `Tensor.set_` no longer check that the dtype of storage is appropriate. * As we no longer know what dtype of a storage is, we've removed the size method from Storage, replacing it with nbytes. This is to help catch otherwise silent errors where you confuse number of elements with number of bytes. * `Storage._new_shared` takes a `nbytes` kwarg and will reject previous positional only calls. `Storage._new_with_file` and `_set_from_file` require explicit element size arguments. * It's no longer possible to convert storages to different types using the float/double/etc methods. Instead, do the conversion using a tensor. * It's no longer possible to allocate a typed storage directly using FloatStorage/DoubleStorage/etc constructors. Instead, construct a tensor and extract its storage. The classes still exist but they are used purely for unpickling. * The preexisting serialization format stores dtype with storage, and in fact this dtype is used to determine the dtype of the tensor overall. To accommodate this case, we introduce a new TypedStorage concept that exists only during unpickling time which is used to temporarily store the dtype so we can construct a tensor. If you overrode the handling of pickling/unpickling, you MUST add handling for TypedStorage or your serialization code will degrade to standard file-based serialization. Original pull request: https://github.com/pytorch/pytorch/pull/59671 Reviewed By: soulitzer, ngimel Differential Revision: D29466819 Pulled By: ezyang fbshipit-source-id: 4a14e5d3c2b08e06e558683d97f7378a3180b00e	2021-10-05 13:50:34 -07:00
Peter Bell	f6dfac6974	Migrate THCCachingHostAllocator to ATen (#65746 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65746 This also removes the cudaHostAllocator field on THCState, since there doesn't seem to be an API anywhere for customizing it. Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D31236630 Pulled By: ngimel fbshipit-source-id: 2a8e756222ae70565e77f8e7139d60ec5be32276	2021-09-30 21:26:38 -07:00
Pruthvi Madugundu	085e2f7bdd	[ROCm] Changes not to rely on CUDA_VERSION or HIP_VERSION (#65610 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65610 - Replace HIP_PLATFORM_HCC with USE_ROCM - Dont rely on CUDA_VERSION or HIP_VERSION and use USE_ROCM and ROCM_VERSION. - In the next PR - Will be removing the mapping from CUDA_VERSION to HIP_VERSION and CUDA to HIP in hipify. - HIP_PLATFORM_HCC is deprecated, so will add HIP_PLATFORM_AMD to support HIP host code compilation on gcc. cc jeffdaily sunway513 jithunnair-amd ROCmSupport amathews-amd Reviewed By: jbschlosser Differential Revision: D30909053 Pulled By: ezyang fbshipit-source-id: 224a966ebf1aaec79beccbbd686fdf3d49267e06	2021-09-29 09:55:43 -07:00
Natalia Gimelshein	d783617216	enable warnings on cuda synchronization (#62092 ) Summary: This creates `torch.cuda.set_warn_on_synchronization()` function that would warn or error when synchronizing operation is performed. We could wrap it in a context manager for ease of use, but it would be a lie, because it sets global, and not thread-local state. Since it's intended for debugging, maybe that's ok though. As all `torch.cuda.*` functions, it's going through CPython, not pybind, so the argument is converted to long before being passed to c10 function. I'll make python argument a python enum class, but without pybind it'll still have to go thourgh long conversion. For a test script ``` import torch torch.cuda.set_warn_on_synchronization(1) x=torch.randn(10, device="cuda") x.nonzero() y=torch.randn((), device="cuda") if y: print("something") torch.multinomial(x.abs(), 10, replacement=False) torch.randperm(20000, device="cuda") ind = torch.randint(10, (3,), device="cuda") mask = torch.randint(2, (10,), device="cuda", dtype=torch.bool) val = torch.randn((), device="cuda") x[mask]=1. x[mask] = val torch.cuda.synchronize() ``` the output is ``` /../playground/sync_warn_test.py:4: UserWarning: called a synchronizing operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:145.) x.nonzero() /../playground/sync_warn_test.py:7: UserWarning: called a synchronizing operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:145.) if y: something /../playground/sync_warn_test.py:9: UserWarning: called a synchronizing operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:145.) torch.multinomial(x.abs(), 10, replacement=False) /../playground/sync_warn_test.py:15: UserWarning: called a synchronizing operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:145.) x[mask] = val ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/62092 Reviewed By: mruberry Differential Revision: D29968792 Pulled By: ngimel fbshipit-source-id: cc6f817212c164727ed99ecf6ab050dc29631b9e	2021-07-30 09:13:01 -07:00
Nikita Shulga	a9b0a921d5	Disable `avoid-non-const-global-variables` lint check (#62008 ) Summary: As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH` All changes but the ones to `.clang-tidy` are generated using following script: ``` for i in `find . -type f -iname ".c" -or -iname "*.h"\|xargs grep cppcoreguidelines-avoid-non-const-global-variables\|cut -f1 -d:\|sort\|uniq`; do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008 Reviewed By: driazati, r-barnes Differential Revision: D29838584 Pulled By: malfet fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13	2021-07-22 18:04:40 -07:00
Mike Guo	6ecc1a4c4f	Make pytorch clang-tidy clean (#60649 ) Summary: This PR suppresses clang-tidy warnings in the codebase (for now) so that we can re-enable clang-tidy checks on master. I ran this script to add the `NOLINTNEXTLINE` comments (on a devserver): ```bash python3 setup.py develop # Uses same script that's run on CI and adds the -j (parallel), -s (add comments), -k (continue if diagnostic errors are found) options python3 tools/clang_tidy.py \ -j \ -s \ -k \ -v \ --paths torch/csrc/ \ -g"-torch/csrc/jit/passes/onnx/helper.cpp" \ -g"-torch/csrc/jit/passes/onnx/shape_type_inference.cpp" \ -g"-torch/csrc/jit/serialization/onnx.cpp" \ -g"-torch/csrc/jit/serialization/export.cpp" \ -g"-torch/csrc/jit/serialization/import.cpp" \ -g"-torch/csrc/jit/serialization/import_legacy.cpp" \ -g"-torch/csrc/onnx/init.cpp" \ -g"-torch/csrc/cuda/nccl." \ -g"-torch/csrc/cuda/python_nccl.cpp" \ -g"-torch/csrc/autograd/FunctionsManual.cpp" \ -g"-torch/csrc/generic/.cpp" \ -g"-torch/csrc/jit/codegen/cuda/runtime/*" \ -g"-torch/csrc/deploy/interpreter/interpreter.cpp" \ -g"-torch/csrc/deploy/interpreter/interpreter.h" \ -g"-torch/csrc/deploy/interpreter/interpreter_impl.h" \ -g"-torch/csrc/deploy/interpreter/test_main.cpp" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/60649 Test Plan: Verified changes by re-running the script (without the `-s` option) and seeing no warnings/errors. Reviewed By: walterddr, janeyx99 Differential Revision: D29504258 Pulled By: 1ntEgr8 fbshipit-source-id: 78310b30ee8213b73ddb4771ad874665323e7a4e	2021-07-01 12:21:07 -07:00
Edward Yang	85af24f52b	Remove some unnecessary functions from CUDAHooks (#59655 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59655 CUDAHooks is to be used solely when you need to call into CUDA functionality from a context where you cannot directly link to CUDA libraries. Neither of hasPrimaryContext nor getDevceIndexWithPrimaryContext (sic) needs to be used in such contexts. By moving them out of CUDAHooks and calling them directly a dynamic dispatch can be skipped. I also fixed the typo in getDev(i)ceIndexWithPrimaryContext Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D28972946 Pulled By: ezyang fbshipit-source-id: edcd7a7b62aec97928f07fbf3bf413b9fb027517	2021-06-28 10:38:51 -07:00
Michael Wootton	2f3be2735f	Don't split oversize cached blocks (#44742 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/35901 This change is designed to prevent fragmentation in the Caching Allocator. Permissive block splitting in the allocator allows very large blocks to be split into many pieces. Once split too finely it is unlikely all pieces will be 'free' at that same time so the original allocation can never be returned. Anecdotally, we've seen a model run out of memory failing to alloc a 50 MB block on a 32 GB card while the caching allocator is holding 13 GB of 'split free blocks' Approach: - Large blocks above a certain size are designated "oversize". This limit is currently set 1 decade above large, 200 MB - Oversize blocks can not be split - Oversize blocks must closely match the requested size (e.g. a 200 MB request will match an existing 205 MB block, but not a 300 MB block) - In lieu of splitting oversize blocks there is a mechanism to quickly free a single oversize block (to the system allocator) to allow an appropriate size block to be allocated. This will be activated under memory pressure and will prevent _release_cached_blocks()_ from triggering Initial performance tests show this is similar or quicker than the original strategy. Additional tests are ongoing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44742 Reviewed By: zou3519 Differential Revision: D29186394 Pulled By: ezyang fbshipit-source-id: c88918836db3f51df59de6d1b3e03602ebe306a9	2021-06-21 11:46:08 -07:00
Richard Barnes	e3d75b8475	irange for PyTorch sans jit (#59481 ) Summary: Switches most of the simple for loops outside of `jit` directories to use `c10::irange`. Generated with D28874212. Pull Request resolved: https://github.com/pytorch/pytorch/pull/59481 Test Plan: Sandcastle Reviewed By: ngimel Differential Revision: D28909681 fbshipit-source-id: ec9ab1bd602933238d9d0f73d4d8d027b75d9d85	2021-06-09 14:46:11 -07:00

1 2 3 4 5 ...

269 Commits