pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Syed Tousif Ahmed	3960f97832	Documents torch.cuda.MemPool API (#148374 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148374 Approved by: https://github.com/eqy, https://github.com/ngimel	2025-03-06 23:18:43 +00:00
PyTorch MergeBot	9a883007a2	Revert "Implement cuda graphs implementation of torch.cond and torch.while_loop (#140979 )" This reverts commit `c7515da7b0`. Reverted https://github.com/pytorch/pytorch/pull/140979 on behalf of https://github.com/huydhn due to This change has been reported to break internal code ([comment](https://github.com/pytorch/pytorch/pull/140979#issuecomment-2657361940))	2025-02-13 18:04:26 +00:00
Daniel Galvez	c7515da7b0	Implement cuda graphs implementation of torch.cond and torch.while_loop (#140979 ) This is a new PR for #130386 , which got stale and was closed. Since I force-pushed to that branch in order to rebase it on top of main, the PR can no longer be reopened, according to https://github.com/isaacs/github/issues/361 I fixed the possibly-not-warmed-up problem described here: https://github.com/pytorch/pytorch/pull/130386/files#r1690856534 Since starting this, torch.cond and torch.while_loop now apparently have support for backward passes. I will look into what it might take to support that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140979 Approved by: https://github.com/eqy, https://github.com/eellison	2025-02-11 18:16:15 +00:00
Eddie Yan	9ee506bd93	[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 ) Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441 Approved by: https://github.com/Chillee, https://github.com/malfet	2025-02-06 19:04:50 +00:00
PyTorch MergeBot	c3f71eb61b	Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 )" This reverts commit `e2917245fb`. Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/ZainRizvi due to Sorry but this still fails internally with the same error. @Chillee or @malfet, can you please help the change get tested? (See D68783351) ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2627886999))	2025-01-31 17:43:09 +00:00
Eddie Yan	e2917245fb	[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 ) Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441 Approved by: https://github.com/Chillee, https://github.com/malfet	2025-01-30 22:33:50 +00:00
PyTorch MergeBot	c986eba560	Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 )" This reverts commit `abf28982a8`. Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/ZainRizvi due to Sorry but this is failing internally. @Chillee can you please help change get remerged? See D68720562 ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2616726406))	2025-01-27 19:38:26 +00:00
Eddie Yan	abf28982a8	[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 ) Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441 Approved by: https://github.com/Chillee	2025-01-27 18:05:23 +00:00
PyTorch MergeBot	dad9bc3461	Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 )" This reverts commit `de945d78da`. Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/izaitsevfb due to unused variables again :( ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2611182461))	2025-01-23 22:59:25 +00:00
Eddie Yan	de945d78da	[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 ) Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441 Approved by: https://github.com/Chillee	2025-01-22 22:42:48 +00:00
PyTorch MergeBot	4ea189422d	Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 )" This reverts commit `a6763b7b81`. Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/kit1980 due to breaking internal builds: unused variable 'halpha' ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2596895865))	2025-01-16 21:12:41 +00:00
eqy	a6763b7b81	[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 ) Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441 Approved by: https://github.com/Chillee	2025-01-15 18:37:55 +00:00
PyTorch MergeBot	64bcf39180	Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 )" This reverts commit `388b75edec`. Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/kit1980 due to breaking internal builds: unused variable 'halpha' ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2588517060))	2025-01-14 00:48:28 +00:00
eqy	388b75edec	[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 ) Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441 Approved by: https://github.com/Chillee	2025-01-11 15:30:38 +00:00
Banit Agrawal	a575ce0dc6	[PyTorch Pinned Allocator] Add support of background thread to process events (#135524 ) Summary: Currently we process events in the regular allocation path and we call cudaEventQuery to check on the events and this path can take some locks in libcuda driver. Its not entirely needed to do process events in the allocation path, we could move this to a background thread and keep processing events regularly and put the freed block to the free list. Differential Revision: D62396585 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135524 Approved by: https://github.com/zyan0	2024-09-17 21:08:10 +00:00
Banit Agrawal	48d18fbd4c	[PyTorch CUDA Allocator] Allow reuse of non-split blocks with better rounding (#136174 ) Summary: This diff adds an option to round the non-split blocks in caching allocator so that they can be reused without causing lots of fragmentation for large memory segments. For example, if we specify max_split memory size as 400MB, then all allocations more than 400MB will not be split. Lets say, we allocated some 1024MB blocks and these are cached in the allocator blocks. If we request a new 500MB block, we round it to nearest power-2-division, thats 512MB, we add default kLargeBuffer of 20MB, that will be 532MB and since 532MB is less than existing 1024MB block, the 1024MB will not be used for this allocation, instead a new 512MB block will be created. In this diff, we provide an option to cofigure the kLargeBuffer for rounding and expose as a configurable option, so 512MB + max_non_split_rounding_size and if thats greater than 1024MB, we will use te 1024MB and we wont create a new 512MB block using cudaMalloc. This option is added so that we can pre-allocate some large blocks so that we can reuse them as much as possible and we dont stall on calling cudaMalloc. Differential Revision: D62758758 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136174 Approved by: https://github.com/zyan0	2024-09-17 19:08:44 +00:00
Frank Lin	249e65b92d	Graph-Safe RNG State Exchange for Tensor Parallelism (#114068 ) See #113541 The PR allows for registering and controlling multiple RNG states using indices, ensuring cudagraph-safe operations, and includes both C++ and Python API changes to support this functionality. cc @eellison @anijain2305 @jansel @ezyang @ptrblck @csarofeen @mcarilli Pull Request resolved: https://github.com/pytorch/pytorch/pull/114068 Approved by: https://github.com/ezyang, https://github.com/eqy, https://github.com/xuzhao9	2024-03-27 01:14:38 +00:00
PyTorch MergeBot	4dc09d6aa4	Revert "Graph-Safe RNG State Exchange for Tensor Parallelism (#114068 )" This reverts commit `e9dcda5cba`. Reverted https://github.com/pytorch/pytorch/pull/114068 on behalf of https://github.com/ezyang due to memory leak in another ci ([comment](https://github.com/pytorch/pytorch/pull/114068#issuecomment-2018044527))	2024-03-25 13:49:04 +00:00
Frank Lin	e9dcda5cba	Graph-Safe RNG State Exchange for Tensor Parallelism (#114068 ) See #113541 The PR allows for registering and controlling multiple RNG states using indices, ensuring cudagraph-safe operations, and includes both C++ and Python API changes to support this functionality. cc @eellison @anijain2305 @jansel @ezyang @ptrblck @csarofeen @mcarilli Pull Request resolved: https://github.com/pytorch/pytorch/pull/114068 Approved by: https://github.com/ezyang	2024-03-21 01:57:08 +00:00
Svetlana Karslioglu	5ae6f6cffe	Test seo torch cuda (#119324 ) Testing if this will help improve SEO of this page. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119324 Approved by: https://github.com/albanD	2024-02-07 00:39:51 +00:00
Edward Z. Yang	09df6b771b	Add a note about performant record_stream use. (#112526 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/112526 Approved by: https://github.com/albanD	2023-11-02 15:50:22 +00:00
eqy	894b9957c8	[DOCS][CUDA] Update TF32 docs for sm90 (#111337 ) For #110252. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111337 Approved by: https://github.com/msaroufim	2023-10-19 09:36:13 +00:00
Banit Agrawal	64583c4d04	[CUDA Host Allocator] Add support of CudaHostRegister (#108488 ) Summary: This diff adds another option to create cuda pinned memory using cudaHostRegister. Differential Revision: D45843715 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108488 Approved by: https://github.com/zdevito	2023-10-06 04:13:02 +00:00
Zachary DeVito	40cbda274b	document memory snapshotting (#107660 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107660 Approved by: https://github.com/albanD ghstack dependencies: #107171, #107399	2023-08-24 19:20:03 +00:00
eqy	33f3dca6b5	[CUDA][CUBLAS] Fix BF16 reduced precision reduction note in docs (#101044 ) #100966 CC @ngimel @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/101044 Approved by: https://github.com/ngimel	2023-05-10 06:50:58 +00:00
eqy	6e2efd16d8	[CUDA][CUBLAS] Add cuBLAS workspace allocation behavior to docs (#100919 ) Adding to the docs for now, hopefully we can move to `cudaMallocAsync`-backed cuBLAS workspaces soon which should alleviate the recent confusion around `cuBLAS` "leaking" memory through workspaces. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100919 Approved by: https://github.com/ngimel	2023-05-10 06:40:26 +00:00
Emilio Castillo	07e595e88a	Add `device_idx` to `free_fn` in `CUDAPluggableAllocator` (#91398 ) This was requested by nvidia folks, track also the device_id in the free function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91398 Approved by: https://github.com/albanD	2023-01-12 05:03:48 +00:00
Eddie Yan	8b617f813d	[cuBLAS] Add an option to disable reduced precision reductions for BF16 GEMM (#89172 ) Essentially the same change as #67946, except that the default is to disallow reduced precision reductions in `BFloat16` GEMMs (for now). If performance is severely regressed, we can change the default, but this option appears to be necessary to pass some `addmm` `BFloat16` tests on H100. CC @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/89172 Approved by: https://github.com/ngimel	2022-12-21 18:58:28 +00:00
Emilio Castillo	c9d4390d13	Add Pluggable CUDA allocator backend (#86786 ) Fixes #43144 This uses the Backend system added by [82682](https://github.com/pytorch/pytorch/pull/82682) to change allocators dynamically during the code execution. This will allow us to use RMM, use CUDA managed memory for some portions of the code that do not fit in GPU memory. Write static memory allocators to reduce fragmentation while training models and improve interoperability with external DL compilers/libraries. For example, we could have the following allocator in c++ ```c++ #include <sys/types.h> #include <cuda_runtime_api.h> #include <iostream> extern "C" { void* my_malloc(ssize_t size, int device, cudaStream_t stream) { void ptr; std::cout<<"alloc "<< size<<std::endl; cudaMalloc(&ptr, size); return ptr; } void my_free(void ptr) { std::cout<<"free "<<std::endl; cudaFree(ptr); } } ``` Compile it as a shared library ``` nvcc allocator.cc -o alloc.so -shared --compiler-options '-fPIC' ``` And use it from PyTorch as follows ```python import torch # Init caching # b = torch.zeros(10, device='cuda') new_alloc = torch.cuda.memory.CUDAPluggableAllocator('alloc.so', 'my_malloc', 'my_free') old = torch.cuda.memory.get_current_allocator() torch.cuda.memory.change_current_allocator(new_alloc) b = torch.zeros(10, device='cuda') # This will error since the current allocator was already instantiated torch.cuda.memory.change_current_allocator(old) ``` Things to discuss - How to test this, needs compiling external code ... Pull Request resolved: https://github.com/pytorch/pytorch/pull/86786 Approved by: https://github.com/albanD	2022-11-23 17:54:36 +00:00
Codrin Popa	5b767d404e	Modified roundup_power2_divisions to specify the number of divisions for each power of two interval (#87290 ) Summary: Improved roundup_power2_divisions knob so it allows better control of rouding in the PyTorch CUDA Caching Allocator. This new version allows setting the number of divisions per power of two interval starting from 1MB and ending at 64GB and above. An example use case is when rouding is desirable for small allocations but there are also very large allocations which are persistent, thus would not benefit from rounding and take up extra space. Test Plan: Tested locally Differential Revision: D40103909 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87290 Approved by: https://github.com/zdevito	2022-11-04 19:31:16 +00:00
Daniel Dale	ce56ee11fd	Extend torch.cuda.is_available() to attempt an NVML-based CUDA availability assessment when explicitly requested by the user (#85951 ) Fixes #83973 (This is a substitute PR for https://github.com/pytorch/pytorch/pull/85024) First of all, thanks for your invaluable contributions to PyTorch everyone! Given how extensively `torch.cuda.is_available` is used in the PyTorch ecosystem, IMHO it's worthwhile to provide downstream libraries/frameworks/users the ability to alter the default behavior of `torch.cuda.is_available` in the context of their PyTorch usage. I'm confident there are many current and future such use cases which could benefit from leveraging a weakened, NVML-based `torch.cuda.is_available` assessment at a downstream framework's explicit direction (thanks @malfet `81da50a972` !). Though one could always patch out the `torch.cuda.is_available` function with another implementation in a downstream library, I think this environmental variable based configuration option is more convenient and the cost to including the option is quite low. As discussed in https://github.com/pytorch/pytorch/pull/85024#issuecomment-1261542045, this PR gates new non-default NVML-based CUDA behavior with an environmental variable (PYTORCH_NVML_BASED_CUDA_CHK) that allows a user/framework to invoke non-default, NVML-based `is_available()` assessments if desired. Thanks again for your work everyone! @ngimel @malfet @awaelchli Pull Request resolved: https://github.com/pytorch/pytorch/pull/85951 Approved by: https://github.com/ngimel	2022-10-12 18:37:50 +00:00
Eddie Yan	25725fd624	(Re-open) Adds cudaMallocAsync as an alternative backend for the CUDA allocator (#82682 ) Rebased version of @mcarilli 's cudaMallocAsync #65365 for continued testing Pull Request resolved: https://github.com/pytorch/pytorch/pull/82682 Approved by: https://github.com/ngimel	2022-10-12 03:44:21 +00:00
Codrin Popa	d401732baa	Added roundup_bypass_threshold_mb knobs to the PyTorch Caching Allocator (#85940 ) Summary: Added an additional roundup knob( ``roundup_bypass_threshold_mb``) to bypass rounding the requested allocation size, for allocation requests larger than the threshold value (in MB). This can help reduce the memory footprint when making large allocations that are expected to be persistent or have a large lifetime. Differential Revision: D39868104 Pull Request resolved: https://github.com/pytorch/pytorch/pull/85940 Approved by: https://github.com/zdevito	2022-10-03 16:56:22 +00:00
Christian Jauvin	089101fc82	Fix small typo in cuda.rst (#84012 ) This fixes a very minor typo in the CUDA semantics doc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84012 Approved by: https://github.com/malfet	2022-08-26 04:53:49 +00:00
Danielle Pintz	8926b5b9c2	Fix typos in docs: Profiler and CUDA semantics (#80406 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/80406 Approved by: https://github.com/robieta	2022-07-13 18:53:02 +00:00
Mike Ruberry	1d47e0df5a	Updates TF32 docs (#79401 ) Updates TF32 docs to reflect PyTorch 1.12 updates. Pull Request resolved: https://github.com/pytorch/pytorch/pull/79401 Approved by: https://github.com/ngimel	2022-06-13 21:02:00 +00:00
Jaewon Lee	11ea09effc	[CUDACachingAlloc/GPUInference] Implement garbage collection without GPU sync (#74261 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74261 ### Goal Implement a cheap way to reclaim GPU memory (garbage collection) without incurring GPU sync. ### Why do we need this? Currently, there are only two ways to reclaim GPU memory block already assigned to a particular stream. - `release_available_cached_blocks(params)`: Free blocks exceeding the `CachingAllocatorConfig::max_split_size()` until we can satisfy the request. Issue: If the `max_split_size` is unset (default), this function is a no-op. Even if this is set, the reclamation is quite conservative (e.g., never frees blocks under max_split_size). - `release_cached_blocks()`: Waits for all the in-flight events and then reclaim blocks. Issue: 'waiting for all event' is very expensive as it will likely stall all the GPU operations. Many GPU applications without a proper handling of potential GPU throttling would suffer/crash. ### Proposed idea - If the garbage collection threshold is set, try to reclaim some memory blocks without synchronization. It should be safe to do so, as `release_available_cached_blocks` essentially does the same thing (but less aggressively). - GC is triggered only when we fail to serve a `malloc` request from the block pool. No need to free blocks when the block pool is functioning just fine. - Prioritize reclaiming blocks that weren't reused for long time. Reclamation stops once the used memory capacity < threshold. - This code path is totally optional; by default it won't be invoked. Test Plan: - Unit tests - Manually checked that the GPU memory usage stays as indicated by the garbage collector. If not the caching allocator at least tries to keep freeing the blocks. Reviewed By: jianyuh Differential Revision: D34482514 fbshipit-source-id: d5eae62ac60b94b0bca851f9d233a092d086e3c2 (cherry picked from commit 05780f1ed4b176f05e765b2411c9eaa2eaeb48b0)	2022-03-21 18:46:02 +00:00
Banit Agrawal	ac3effd150	[PyTorch GPU Allocator] Better use of blocks with rounding of allocation sizes (#74213 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74213 In the current CUDACachingAllocator, the sizes are rounded up in multiple of blocks size of 512, so this works for smaller sizes. However for large sizes, we can have lots of different size blocks in the larger pool. This is problematic when we have variable batch sizes 1001, 1021, 1023 -> all will go to different block size and will create different size of blocks. This will create lots of unused blocks and will waste GPU memory capacity. This diff adds a rounding approach to allocation size. It rounds up the size to nearest power-of-2 divisions and the power2-division can be changed with env variable setting. For example, if we need to round-up size of1200 and if number of divisions is 4, the size 1200 lies between 1024 and 2048 and if we do 4 divisions between them, the values are 1024, 1280, 1536, and 1792. So the function will return 1280 as the nearest ceiling of power-2 division. env setting: export PYTORCH_CUDA_ALLOC_CONF=roundup_power2_divisions:4 ghstack-source-id: 151446017 Reviewed By: ezyang Differential Revision: D34868036 fbshipit-source-id: 494785add16e6b37c920dcb5a2b81d4c637b554a (cherry picked from commit 548454ccacbd8700e7ffd2d762e40b4ba37abbae)	2022-03-16 02:53:53 +00:00
Mike Ruberry	9b9b878c89	Fixes jiterator cache macro include + updates CUDA note with cache variables (#71452 ) Summary: Per title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/71452 Reviewed By: ngimel Differential Revision: D33646495 Pulled By: mruberry fbshipit-source-id: bbf627e6d7a724a83a3ea2ae9c0f50430f8d578e (cherry picked from commit `d1e72b144a`)	2022-01-19 03:45:05 +00:00
Michael Carilli	da023611d7	[CUDA graphs] Fixes make_graphed_callables example typos (#69379 ) Summary: cc mcarilli Pull Request resolved: https://github.com/pytorch/pytorch/pull/69379 Reviewed By: mruberry Differential Revision: D32841260 Pulled By: ngimel fbshipit-source-id: a7d0b9db0578526907547b201eddd55827812b63	2021-12-03 16:51:14 -08:00
Vansh Sharma	ff125a3624	Minor changes in documentation (#68557 ) Summary: Fixed some small typos Pull Request resolved: https://github.com/pytorch/pytorch/pull/68557 Reviewed By: mruberry Differential Revision: D32538749 Pulled By: ngimel fbshipit-source-id: 09a9cd4031463b6a40d7307bd8fcb7d364444ac3	2021-11-18 17:57:16 -08:00
eqy	790763b0fe	Add an option to disable reduced precision reductions for FP16 GEMM (#67946 ) Summary: https://github.com/pytorch/pytorch/issues/67578 disabled reduced precision reductions for FP16 GEMMs. After benchmarking, we've found that this has substantial performance impacts for common GEMM shapes (e.g., those found in popular instantiations of multiheaded-attention) on architectures such as Volta. As these performance regressions may come as a surprise to current users, this PR adds a toggle to disable reduced precision reductions `torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = ` rather than making it the default behavior. CC ngimel ptrblck stas00 Note that the behavior after the previous PR can be replicated with `torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = False` Pull Request resolved: https://github.com/pytorch/pytorch/pull/67946 Reviewed By: zou3519 Differential Revision: D32289896 Pulled By: ngimel fbshipit-source-id: a1ea2918b77e27a7d9b391e030417802a0174abe	2021-11-09 17:27:20 -08:00
Michael Carilli	e3210ca184	[CUDA graphs] Beta, not prototype (#65247 ) Summary: Powers have decided this API should be listed as beta. Pull Request resolved: https://github.com/pytorch/pytorch/pull/65247 Reviewed By: malfet Differential Revision: D31057940 Pulled By: ngimel fbshipit-source-id: 137b63cbd2c7409fecdc161a22135619bfc96bfa	2021-09-20 13:32:36 -07:00
Michael Carilli	36cac2be4d	[CUDA graphs] moves memory sharing intro paragraph (#64996 ) Summary: Puts memory sharing intro under Sharing memory... header, where it should have been all along. Pull Request resolved: https://github.com/pytorch/pytorch/pull/64996 Reviewed By: mruberry Differential Revision: D30948619 Pulled By: ngimel fbshipit-source-id: 5d9dd267b34e9d3fc499d4738377b58a22da1dc2	2021-09-14 17:53:43 -07:00
Michael Carilli	8d08b103be	[CUDA graphs] Prototype API and documentation (#63269 ) Summary: RFC: https://github.com/pytorch/pytorch/issues/61880 Pull Request resolved: https://github.com/pytorch/pytorch/pull/63269 Reviewed By: mruberry Differential Revision: D30596643 Pulled By: ngimel fbshipit-source-id: b1f8061406364b667e2c2d4d30fbce1f0d8456be	2021-08-31 13:34:23 -07:00
Victor Quach	5830f122f1	Add docstrings for save_on_cpu hooks (#62410 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62410 This PR adds docstrings for CPU hooks introduced in #61928. Also uncomments the warning about pinned memory in CUDA semantics docs. Depends on: #62361. For now docstrings are an orphan page at https://docs-preview.pytorch.org/62410/generated/torch.autograd.graph.set_save_on_cpu_hooks.html#torch-autograd-graph-set-save-on-cpu-hooks Test Plan: Imported from OSS Reviewed By: soulitzer Differential Revision: D29990129 Pulled By: Varal7 fbshipit-source-id: 7a98eeee6a0abb11e2c2d9169cd1aa35ad7ba3f4	2021-08-03 17:53:45 -07:00
Michael Carilli	2fa6c7627e	[CUDA graphs][BC-breaking] Removes post-backward syncs on default stream (#60421 ) Summary: Before https://github.com/pytorch/pytorch/pull/57833, calls to backward() or grad() synced only the calling thread's default stream with autograd leaf streams at the end of backward. This made the following weird pattern safe: ```python with torch.cuda.stream(s): # imagine forward used many streams, so backward leaf nodes may run on many streams loss.backward() # no sync use grads ``` but a more benign-looking pattern was unsafe: ```python with torch.cuda.stream(s): # imagine forward used a lot of streams, so backward leaf nodes may run on many streams loss.backward() # backward() syncs the default stream with all the leaf streams, but does not sync s with anything, # so counterintuitively (even though we're in the same stream context as backward()!) # it is NOT SAFE to use grads here, and there's no easy way to make it safe, # unless you manually sync on all the streams you used in forward, # or move "use grads" back to default stream outside the context. use grads ``` mruberry ngimel and I decided backward() should have the [same user-facing stream semantics as any cuda op](https://pytorch.org/docs/master/notes/cuda.html#stream-semantics-of-backward-passes). In other words, the weird pattern should be unsafe, and the benign-looking pattern should be safe. Implementationwise, this meant backward() should sync its calling thread's current stream, not default stream, with the leaf streams. After https://github.com/pytorch/pytorch/pull/57833, backward syncs the calling thread's current stream AND default stream with all leaf streams at the end of backward. The default stream syncs were retained for temporary backward compatibility. This PR finishes https://github.com/pytorch/pytorch/pull/57833's work by deleting syncs on the default stream. With this PR, graph-capturing an entire backward() call should be possible (see the [test_graph_grad_scaling diffs](https://github.com/pytorch/pytorch/compare/master...mcarilli:streaming_backwards_remove_default_syncs?expand=1#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R3641-R3642)). first paragraph has a formatting error which this PR should also fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/60421 Reviewed By: albanD Differential Revision: D29370344 Pulled By: ngimel fbshipit-source-id: 3248bc5fb92fc517db0c15c897e5d7250f67d7fe	2021-06-24 17:34:02 -07:00
Luca Wehrstedt	bb9e1150ea	Revert D29342234: [pytorch][PR] [CUDA graphs][BC-breaking] Removes post-backward syncs on default stream Test Plan: revert-hammer Differential Revision: D29342234 (`675cea1adb`) Original commit changeset: 98e6be7fdd85 fbshipit-source-id: 84022973248b2254210eee57402df2c4f4bc43c6	2021-06-24 04:49:28 -07:00
Michael Carilli	675cea1adb	[CUDA graphs][BC-breaking] Removes post-backward syncs on default stream (#60421 ) Summary: Before https://github.com/pytorch/pytorch/pull/57833, calls to backward() or grad() synced only the calling thread's default stream with autograd leaf streams at the end of backward. This made the following weird pattern safe: ```python with torch.cuda.stream(s): # imagine forward used many streams, so backward leaf nodes may run on many streams loss.backward() # no sync use grads ``` but a more benign-looking pattern was unsafe: ```python with torch.cuda.stream(s): # imagine forward used a lot of streams, so backward leaf nodes may run on many streams loss.backward() # backward() syncs the default stream with all the leaf streams, but does not sync s with anything, # so counterintuitively (even though we're in the same stream context as backward()!) # it is NOT SAFE to use grads here, and there's no easy way to make it safe, # unless you manually sync on all the streams you used in forward, # or move "use grads" back to default stream outside the context. use grads ``` mruberry ngimel and I decided backward() should have the [same user-facing stream semantics as any cuda op](https://pytorch.org/docs/master/notes/cuda.html#stream-semantics-of-backward-passes). In other words, the weird pattern should be unsafe, and the benign-looking pattern should be safe. Implementationwise, this meant backward() should sync its calling thread's current stream, not default stream, with the leaf streams. After https://github.com/pytorch/pytorch/pull/57833, backward syncs the calling thread's current stream AND default stream with all leaf streams at the end of backward. The default stream syncs were retained for temporary backward compatibility. This PR finishes https://github.com/pytorch/pytorch/pull/57833's work by deleting syncs on the default stream. With this PR, graph-capturing an entire backward() call should be possible (see the [test_graph_grad_scaling diffs](https://github.com/pytorch/pytorch/compare/master...mcarilli:streaming_backwards_remove_default_syncs?expand=1#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R3641-R3642)). first paragraph has a formatting error which this PR should also fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/60421 Reviewed By: VitalyFedyunin, albanD Differential Revision: D29342234 Pulled By: ngimel fbshipit-source-id: 98e6be7fdd8550872f0a78f9a66cb8dfe75abf63	2021-06-23 23:35:24 -07:00
Michael Wootton	2f3be2735f	Don't split oversize cached blocks (#44742 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/35901 This change is designed to prevent fragmentation in the Caching Allocator. Permissive block splitting in the allocator allows very large blocks to be split into many pieces. Once split too finely it is unlikely all pieces will be 'free' at that same time so the original allocation can never be returned. Anecdotally, we've seen a model run out of memory failing to alloc a 50 MB block on a 32 GB card while the caching allocator is holding 13 GB of 'split free blocks' Approach: - Large blocks above a certain size are designated "oversize". This limit is currently set 1 decade above large, 200 MB - Oversize blocks can not be split - Oversize blocks must closely match the requested size (e.g. a 200 MB request will match an existing 205 MB block, but not a 300 MB block) - In lieu of splitting oversize blocks there is a mechanism to quickly free a single oversize block (to the system allocator) to allow an appropriate size block to be allocated. This will be activated under memory pressure and will prevent _release_cached_blocks()_ from triggering Initial performance tests show this is similar or quicker than the original strategy. Additional tests are ongoing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44742 Reviewed By: zou3519 Differential Revision: D29186394 Pulled By: ezyang fbshipit-source-id: c88918836db3f51df59de6d1b3e03602ebe306a9	2021-06-21 11:46:08 -07:00

1 2

89 Commits