pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
sryap	32461ed319	Pool cudaEvents in CUDACachingAllocator (#78279 ) Summary: cudaEventCreate/Destroy can be expensive especially when the process is calling lots of other CUDA APIs. Pool the `cudaEvent_t` objects so that we create them once and reuse as much as possible. Test Plan: Unit tests to check the functionality. Manual performance testing shows that this diff is perf positive. \| \| create_event_internal (us) \| free_event_internal/destructor (us) \| insert_events (us) \| process_events (us) \| \| baseline \| 2.411 \| 2.647 \| 3.968 \| 0.321 \| \| this diff \| 0.115 \| 0.147 \| 2.846 \| 0.262 \| \| speed up \| 20.9x \| 18.0x \| 1.4x \| 1.2x \| Differential Revision: D35729059 Pull Request resolved: https://github.com/pytorch/pytorch/pull/78279 Approved by: https://github.com/jianyuh	2022-06-07 18:26:45 +00:00
Jaewon Lee	3e89a1d6b7	Disable GC if fraction is not set (#76648 ) Summary: If fraction is not set, don't trigger GC! In the current codebase, if you turn on the GC and do not set the fraction in the application, the GC will be triggered every time which does not make much sense -- perf will be as bad as turning off the caching allocator. With this fix, GC is invoked only when the fraction is set. Test Plan: Unit tests Differential Revision: D36026128 Pull Request resolved: https://github.com/pytorch/pytorch/pull/76648 Approved by: https://github.com/yinghai	2022-05-16 16:37:39 +00:00
Felipe Petroski Such	b0c5fba967	[CUDA Graphs] Fix OOM inside graph capture_begin release_cached_blocks calls this: ``` void synchronize_and_free_events() { TORCH_INTERNAL_ASSERT(captures_underway == 0); ``` Which means we can't call that function when we are capturing a cuda graph: ``` import torch with torch.cuda.graph(torch.cuda.CUDAGraph()): torch.zeros(2 ** 40, device="cuda") ``` results in: ``` RuntimeError: captures_underway == 0INTERNAL ASSERT FAILED at "/tmp/torch/c10/cuda/CUDACachingAllocator.cpp":1224, please report a bug to PyTorch. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/76247 Approved by: https://github.com/ngimel	2022-04-29 17:42:04 +00:00
Richard Barnes	2793cf85ec	Check all CUDA API calls for errors in caffe2/c10/ (#74918 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74918 Test Plan: Sandcastle Reviewed By: ngimel Differential Revision: D35194795 fbshipit-source-id: 8490e5497c37bab0055925ed520c2fd0c37a554c (cherry picked from commit 52697ab670e2f53c580cfd4ca82c5468ed3bb06c)	2022-03-30 17:13:02 +00:00
Richard Barnes	1249d490de	Add additional CUDA error handling macros (#74865 ) Summary: Introduces additional ways of handling CUDA errors that allow automated linters to detect if errors are being handled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/74865 Test Plan: Sandcastle Reviewed By: ngimel Differential Revision: D35194530 fbshipit-source-id: f4fe61594edbfd81e97a4b605935961b893df167 (cherry picked from commit 919ddf677c5b9b46c5e493ed64346a5f2527bf08)	2022-03-29 18:03:03 +00:00
Jaewon Lee	11ea09effc	[CUDACachingAlloc/GPUInference] Implement garbage collection without GPU sync (#74261 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74261 ### Goal Implement a cheap way to reclaim GPU memory (garbage collection) without incurring GPU sync. ### Why do we need this? Currently, there are only two ways to reclaim GPU memory block already assigned to a particular stream. - `release_available_cached_blocks(params)`: Free blocks exceeding the `CachingAllocatorConfig::max_split_size()` until we can satisfy the request. Issue: If the `max_split_size` is unset (default), this function is a no-op. Even if this is set, the reclamation is quite conservative (e.g., never frees blocks under max_split_size). - `release_cached_blocks()`: Waits for all the in-flight events and then reclaim blocks. Issue: 'waiting for all event' is very expensive as it will likely stall all the GPU operations. Many GPU applications without a proper handling of potential GPU throttling would suffer/crash. ### Proposed idea - If the garbage collection threshold is set, try to reclaim some memory blocks without synchronization. It should be safe to do so, as `release_available_cached_blocks` essentially does the same thing (but less aggressively). - GC is triggered only when we fail to serve a `malloc` request from the block pool. No need to free blocks when the block pool is functioning just fine. - Prioritize reclaiming blocks that weren't reused for long time. Reclamation stops once the used memory capacity < threshold. - This code path is totally optional; by default it won't be invoked. Test Plan: - Unit tests - Manually checked that the GPU memory usage stays as indicated by the garbage collector. If not the caching allocator at least tries to keep freeing the blocks. Reviewed By: jianyuh Differential Revision: D34482514 fbshipit-source-id: d5eae62ac60b94b0bca851f9d233a092d086e3c2 (cherry picked from commit 05780f1ed4b176f05e765b2411c9eaa2eaeb48b0)	2022-03-21 18:46:02 +00:00
Banit Agrawal	ac3effd150	[PyTorch GPU Allocator] Better use of blocks with rounding of allocation sizes (#74213 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74213 In the current CUDACachingAllocator, the sizes are rounded up in multiple of blocks size of 512, so this works for smaller sizes. However for large sizes, we can have lots of different size blocks in the larger pool. This is problematic when we have variable batch sizes 1001, 1021, 1023 -> all will go to different block size and will create different size of blocks. This will create lots of unused blocks and will waste GPU memory capacity. This diff adds a rounding approach to allocation size. It rounds up the size to nearest power-of-2 divisions and the power2-division can be changed with env variable setting. For example, if we need to round-up size of1200 and if number of divisions is 4, the size 1200 lies between 1024 and 2048 and if we do 4 divisions between them, the values are 1024, 1280, 1536, and 1792. So the function will return 1280 as the nearest ceiling of power-2 division. env setting: export PYTORCH_CUDA_ALLOC_CONF=roundup_power2_divisions:4 ghstack-source-id: 151446017 Reviewed By: ezyang Differential Revision: D34868036 fbshipit-source-id: 494785add16e6b37c920dcb5a2b81d4c637b554a (cherry picked from commit 548454ccacbd8700e7ffd2d762e40b4ba37abbae)	2022-03-16 02:53:53 +00:00
Joe	c4af6ba173	Show friendly error message when forgetting `init` in `torch.cuda` (#72404 ) Summary: # Problem The error message `RuntimeError: Invalid device argument` is not friendly when users just forget calling `torch.cuda.init()`. This error message is shown for example by calling `torch.cuda.reset_accumulated_memory_stats`, or other methods which internally calls [assertValidDevice](`6297aa114f/c10/cuda/CUDACachingAllocator.cpp (L1561-L1566)`). # Reproduce ```python $ python Python 3.8.6 (default, Apr 1 2021, 08:23:31) [GCC 7.5.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch.cuda >>> torch.cuda.reset_accumulated_memory_stats(0) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.8/site-packages/torch/cuda/memory.py", line 219, in reset_accumulated_memory_stats return torch._C._cuda_resetAccumulatedMemoryStats(device) RuntimeError: Invalid device argument. >>> torch.cuda.current_device() 0 ``` # This PR Shows better error message like `RuntimeError: Invalid device argument 0: did you call init?`. I cited the error message from `6297aa114f/c10/cuda/CUDACachingAllocator.cpp (L1392-L1396)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/72404 Reviewed By: mruberry Differential Revision: D34063268 Pulled By: ngimel fbshipit-source-id: 0775d9c83a4a0eb0eb41bf6efecca94a00692141 (cherry picked from commit `07a1a3d0b4`)	2022-02-08 22:52:26 +00:00
mikey dagitses	90458004cb	move //c10/cuda/test to shared build structure (#71429 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71429 Note that this was untested in OSS Bazel. ghstack-source-id: 148159363 Test Plan: Tested locally. Rely on CI to validate. Reviewed By: malfet Differential Revision: D33638407 fbshipit-source-id: 12ae383ccadc1375b92d9c6a12d43821e48f9dcb (cherry picked from commit `12be8c195c`)	2022-02-03 22:33:41 +00:00
mikey dagitses	6d9c0073a8	create //c10/cuda library (#70863 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70863 ghstack-source-id: 148159368 Test Plan: Ought to be a no-op: rely on CI to validate. Reviewed By: malfet Differential Revision: D33367290 fbshipit-source-id: cb550538b9eafaa0117f94077ebd4cb920688881 (cherry picked from commit `077d9578bc`)	2022-02-03 19:17:18 +00:00
CodemodService FBSourceClangFormatLinterBot	14538fa7bf	[AutoAccept][Codemod][FBSourceClangFormatLinter] Daily `arc lint --take CLANGFORMAT` Reviewed By: zertosh Differential Revision: D33962464 fbshipit-source-id: a8f0633dbd3fcb26b68e3d48886d520a46eea631 (cherry picked from commit `85f819baa3`)	2022-02-03 04:02:37 +00:00
Nelson Elhage	c585d35463	CUDACachingAllocator: Keep one event queue per stream (#71745 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/71616 This fixes the leaks in my test case. I have not tested it on our big models yet, but will report back if we can. This potentially impacts allocator performance in that it slightly increases the amount of CPU memory we allocate for data structures, and it means that `process_events` may look at a larger number of events in the case where there are multiple streams with long-running ops on them. However, I suspect that in general, either: - An application isn't using very many streams or very many long-running ops, in which case the performance is essentially the same - Or, they are, which is precisely the case where https://github.com/pytorch/pytorch/issues/71616 bites you, and so freeing memory faster is probably more valuable than the slight CPU overhead here. I'm not attached to this approach or any of its details, but figured it was worth throwing up for discussion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/71745 Reviewed By: soulitzer Differential Revision: D33948288 Pulled By: ngimel fbshipit-source-id: 73e95f8a9bbe385a77de483d1c58b857b5d84e81 (cherry picked from commit `d233719c07`)	2022-02-03 01:35:19 +00:00
Scott Wolchok	4aade95029	[PyTorch] Rework stat collection in CUDACachingAllocator (#71669 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71669 This was relatively inefficient. Rather than looping for each type of stat we want to update, we now do one loop covering all the stats. ghstack-source-id: 148013645 Reviewed By: ngimel Differential Revision: D33725458 fbshipit-source-id: 39ef5d65a73d4ef67f259de8c02c7df29487d990 (cherry picked from commit `7ca46689b7`)	2022-02-01 17:24:51 +00:00
Scott Wolchok	ca2ff12ea3	[PyTorch] Remove call_once from CUDACachingAllocator (#71668 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71668 As https://en.cppreference.com/w/cpp/thread/call_once mentions, function-local statics are probably more efficient. ghstack-source-id: 148013646 Reviewed By: ngimel Differential Revision: D33722954 fbshipit-source-id: a2737c2d6dfdd23b26cbe34574b80e3da0d4b8a4 (cherry picked from commit `a6ddb24558`)	2022-02-01 17:24:51 +00:00
Scott Wolchok	da0423aa0b	[PyTorch] Use a better hash table in CUDACachingAllocator (#71667 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71667 We have flat_hash_set because it performs better than std::unordered_set. ghstack-source-id: 148013648 Reviewed By: ngimel Differential Revision: D33720595 fbshipit-source-id: aa6077c474dd6fc61ce17e24ebde4056c8bae361 (cherry picked from commit `386082eaf1`)	2022-02-01 17:24:51 +00:00
Michael Dagitses	661d10aab4	use c10/macros/cmake_macros.h in fbcode build (#70851 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70851 This is a step towards OSS/fbcode convergence since OSS uses this file in both CMake and Bazel. ghstack-source-id: 147170896 Test Plan: Relying on the extensive CI internal tests for this. Reviewed By: malfet Differential Revision: D33299102 fbshipit-source-id: c650dd4755f8d696d5fce81c583d5c73782e3990 (cherry picked from commit `741ca140c8`)	2022-01-19 20:56:12 +00:00
Richard Barnes	11aa1961c1	Use (void)error_unused to avoid unused warning (#71000 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71000 Test Plan: Sandcastle Reviewed By: ngimel Differential Revision: D33470600 fbshipit-source-id: 868a6ee33a04846bd1efbe06ab306fbaad3bf9db	2022-01-07 23:39:30 -08:00
Nelson Elhage	a813ddf5ec	CUDACachingAllocator: make an error message more accurate. (#69174 ) Summary: The `TORCH_CHECK` asserts for strictly-greater-than `kLargeBuffer`, but the exception claims `>=`. Fix the error message to match the code. Happy to open an issue if it's helpful; I was hopeful the trivial fix doesn't need a separate issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/69174 Reviewed By: zou3519 Differential Revision: D32760055 Pulled By: H-Huang fbshipit-source-id: 1a8ab68f36b326ed62d78afdcb198f4d6572d017	2021-12-03 15:04:59 -08:00
Nikita Shulga	c373387709	Update CMake and use native CUDA language support (#62445 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62445 PyTorch currently uses the old style of compiling CUDA in CMake which is just a bunch of scripts in `FindCUDA.cmake`. Newer versions support CUDA natively as a language just like C++ or C. Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D31503350 fbshipit-source-id: 2ee817edc9698531ae1b87eda3ad271ee459fd55	2021-10-11 09:05:48 -07:00
Luca Wehrstedt	bc06eefebe	[reland] Allow external CUDA streams to be set as current (#66324 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66324 Fixes https://github.com/pytorch/pytorch/issues/65822. Reland of https://github.com/pytorch/pytorch/pull/65914. ghstack-source-id: 140105651 Test Plan: Added tests Reviewed By: ngimel Differential Revision: D31506134 fbshipit-source-id: ff56203a120befdb282e974309478ac11aa56652	2021-10-11 02:41:43 -07:00
Luca Wehrstedt	201174cb91	Revert D31389480: [pytorch][PR] Allow external CUDA streams to be set as current Test Plan: revert-hammer Differential Revision: D31389480 (`61f0bb70c1`) Original commit changeset: 2b2f40e5452c fbshipit-source-id: c6631e51abcf3819732f981f646cb77b91569c7d	2021-10-08 09:20:24 -07:00
Luca Wehrstedt	61f0bb70c1	Allow external CUDA streams to be set as current (#65914 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/65822. Pull Request resolved: https://github.com/pytorch/pytorch/pull/65914 Reviewed By: dagitses Differential Revision: D31389480 Pulled By: lw fbshipit-source-id: 2b2f40e5452c5b2a0b9f0f705750d2aa9deb2ead	2021-10-08 06:09:32 -07:00
Pruthvi Madugundu	085e2f7bdd	[ROCm] Changes not to rely on CUDA_VERSION or HIP_VERSION (#65610 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65610 - Replace HIP_PLATFORM_HCC with USE_ROCM - Dont rely on CUDA_VERSION or HIP_VERSION and use USE_ROCM and ROCM_VERSION. - In the next PR - Will be removing the mapping from CUDA_VERSION to HIP_VERSION and CUDA to HIP in hipify. - HIP_PLATFORM_HCC is deprecated, so will add HIP_PLATFORM_AMD to support HIP host code compilation on gcc. cc jeffdaily sunway513 jithunnair-amd ROCmSupport amathews-amd Reviewed By: jbschlosser Differential Revision: D30909053 Pulled By: ezyang fbshipit-source-id: 224a966ebf1aaec79beccbbd686fdf3d49267e06	2021-09-29 09:55:43 -07:00
Michael Carilli	8d08b103be	[CUDA graphs] Prototype API and documentation (#63269 ) Summary: RFC: https://github.com/pytorch/pytorch/issues/61880 Pull Request resolved: https://github.com/pytorch/pytorch/pull/63269 Reviewed By: mruberry Differential Revision: D30596643 Pulled By: ngimel fbshipit-source-id: b1f8061406364b667e2c2d4d30fbce1f0d8456be	2021-08-31 13:34:23 -07:00
Pruthvi Madugundu	ab7a472980	[ROCm] Update HIP_VERSION to TORCH_HIP_VERSION (#62786 ) Summary: - HIP_VERSION semantic versioning will change in ROCm4.3. The changes essentially remove the dependency on HIP_VERSION provided in the hip header to keep code compatible with older and newer versions of ROCm. - TORCH_HIP_VERSION is derived from HIP_VERSION_MAJOR and HIP_VERSION_MINOR Pull Request resolved: https://github.com/pytorch/pytorch/pull/62786 Reviewed By: bdhirsh Differential Revision: D30281682 Pulled By: seemethere fbshipit-source-id: e41e69fb9e13de5ddd1af99ba5bbdcbb7b64b673	2021-08-13 15:00:43 -07:00
Han Guangyun	8bbcef5096	Report more information for memory profiling (#61282 ) Summary: Report pointed memory size, total allocated memory, total reserved size all in one report. `ptr` and `alloc_size` will be used for associating with op trace. `allocated_size`, `reserved_size` will be used for memory trace. Pull Request resolved: https://github.com/pytorch/pytorch/pull/61282 Reviewed By: ejguan Differential Revision: D29796282 Pulled By: chaekit fbshipit-source-id: 5314c867632d3af1fa9a3811b35eaa5e931a5d87	2021-08-04 15:03:14 -07:00
Jeff Daily	b7391f44df	cast return of cudaGetLastError() to void when discarding (#62518 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/62511. Pull Request resolved: https://github.com/pytorch/pytorch/pull/62518 Reviewed By: walterddr, janeyx99 Differential Revision: D30029858 Pulled By: malfet fbshipit-source-id: d47ce4e507ac800b4e5a5e0a8d9a6fabdfd28e6d	2021-08-03 11:17:22 -07:00
Natalia Gimelshein	d783617216	enable warnings on cuda synchronization (#62092 ) Summary: This creates `torch.cuda.set_warn_on_synchronization()` function that would warn or error when synchronizing operation is performed. We could wrap it in a context manager for ease of use, but it would be a lie, because it sets global, and not thread-local state. Since it's intended for debugging, maybe that's ok though. As all `torch.cuda.*` functions, it's going through CPython, not pybind, so the argument is converted to long before being passed to c10 function. I'll make python argument a python enum class, but without pybind it'll still have to go thourgh long conversion. For a test script ``` import torch torch.cuda.set_warn_on_synchronization(1) x=torch.randn(10, device="cuda") x.nonzero() y=torch.randn((), device="cuda") if y: print("something") torch.multinomial(x.abs(), 10, replacement=False) torch.randperm(20000, device="cuda") ind = torch.randint(10, (3,), device="cuda") mask = torch.randint(2, (10,), device="cuda", dtype=torch.bool) val = torch.randn((), device="cuda") x[mask]=1. x[mask] = val torch.cuda.synchronize() ``` the output is ``` /../playground/sync_warn_test.py:4: UserWarning: called a synchronizing operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:145.) x.nonzero() /../playground/sync_warn_test.py:7: UserWarning: called a synchronizing operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:145.) if y: something /../playground/sync_warn_test.py:9: UserWarning: called a synchronizing operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:145.) torch.multinomial(x.abs(), 10, replacement=False) /../playground/sync_warn_test.py:15: UserWarning: called a synchronizing operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:145.) x[mask] = val ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/62092 Reviewed By: mruberry Differential Revision: D29968792 Pulled By: ngimel fbshipit-source-id: cc6f817212c164727ed99ecf6ab050dc29631b9e	2021-07-30 09:13:01 -07:00
Natalia Gimelshein	6284d2a82b	wrap cudaStreamSynchronize calls (#61889 ) Summary: This is a first step towards creating context manager that errors out on synchronizing calls. Pull Request resolved: https://github.com/pytorch/pytorch/pull/61889 Reviewed By: albanD Differential Revision: D29805280 Pulled By: ngimel fbshipit-source-id: b66400fbe0941b7daa51e6b30abe27b9cccd4e8a	2021-07-21 19:30:52 -07:00
Michael Carilli	ffd2e602f4	[CUDA graphs] Make sure graph mempool cudaMalloc_count decrement pairs with cudaFree for all allocations (#61567 ) Summary: Graphs mempools aren't deleted until all their allocations are cudaFreed. `PrivatePool::cudaMalloc_count` tracks the number of outstanding (not-yet-cudaFreed) allocations. https://github.com/pytorch/pytorch/pull/44742 moves cudaFree to [release_block](https://github.com/pytorch/pytorch/pull/44742/files#diff-acc6337586bf9cdcf0a684380779300ec171897d05b8569bf439820dc8c93bd5R1160), while the `cudaMalloc_count` decrement (if needed) remains in a caller ([release_blocks](https://github.com/pytorch/pytorch/pull/44742/files#diff-acc6337586bf9cdcf0a684380779300ec171897d05b8569bf439820dc8c93bd5R1177)). But I noticed there's also a path ([release_available_cached_blocks](https://github.com/pytorch/pytorch/pull/44742/files#diff-acc6337586bf9cdcf0a684380779300ec171897d05b8569bf439820dc8c93bd5R1094)) that calls `release_block` without calling `release_blocks`, in other words, it calls cudaFree but dodges any potential `cudaMalloc_count` decrement. In practice, the way the code is currently organized, I don't _think_ this second path can cause the pool to become a zombie whose `cudaMalloc_count` will never reach zero (I think this could only happen if you call `release_available_cached_blocks` on a private pool, and the only way it would be called on a private pool is if capture is underway, and if capture is underway, the cudaFree call will hard error). Regardless, I feel much more comfortable keeping the cudaMalloc_count decrement right next to the cudaFree. Pull Request resolved: https://github.com/pytorch/pytorch/pull/61567 Reviewed By: mrshenli Differential Revision: D29765198 Pulled By: ezyang fbshipit-source-id: bcbeed656c3e0d101112aa470d8a098c73a011b1	2021-07-19 19:22:18 -07:00
Jeff Daily	15210f3b82	ignore and clear not ready errors (#61554 ) Summary: Follow-up to https://github.com/pytorch/pytorch/issues/18584. This PR covers the remaining places where event or stream query might result in not ready errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/61554 Reviewed By: mrshenli Differential Revision: D29763973 Pulled By: ezyang fbshipit-source-id: 41d988d1826b2309cc6b01a81144094b353abdf9	2021-07-19 16:03:04 -07:00
cyy	00c4897c51	use make_unique (#61272 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61272 Reviewed By: pbelevich Differential Revision: D29660354 Pulled By: ezyang fbshipit-source-id: f0aba1ea6983aec415915ed9b7dbced2e2b3b171	2021-07-12 08:09:46 -07:00
Nikita Shulga	635d864b26	Fix modernize-use-equals-default nolint failures in torch/csrcs (#61142 ) Summary: Test-plan: Compile + clang-tidy Pull Request resolved: https://github.com/pytorch/pytorch/pull/61142 Reviewed By: VitalyFedyunin Differential Revision: D29529372 Pulled By: malfet fbshipit-source-id: 2ccde7712a51c28243b16bbb4d1d68086e0414a6	2021-07-06 09:46:46 -07:00
Michael Wootton	2f3be2735f	Don't split oversize cached blocks (#44742 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/35901 This change is designed to prevent fragmentation in the Caching Allocator. Permissive block splitting in the allocator allows very large blocks to be split into many pieces. Once split too finely it is unlikely all pieces will be 'free' at that same time so the original allocation can never be returned. Anecdotally, we've seen a model run out of memory failing to alloc a 50 MB block on a 32 GB card while the caching allocator is holding 13 GB of 'split free blocks' Approach: - Large blocks above a certain size are designated "oversize". This limit is currently set 1 decade above large, 200 MB - Oversize blocks can not be split - Oversize blocks must closely match the requested size (e.g. a 200 MB request will match an existing 205 MB block, but not a 300 MB block) - In lieu of splitting oversize blocks there is a mechanism to quickly free a single oversize block (to the system allocator) to allow an appropriate size block to be allocated. This will be activated under memory pressure and will prevent _release_cached_blocks()_ from triggering Initial performance tests show this is similar or quicker than the original strategy. Additional tests are ongoing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44742 Reviewed By: zou3519 Differential Revision: D29186394 Pulled By: ezyang fbshipit-source-id: c88918836db3f51df59de6d1b3e03602ebe306a9	2021-06-21 11:46:08 -07:00
Emilio Castillo	f9ec86a6c6	External stream (#59527 ) Summary: Previous is https://github.com/pytorch/pytorch/issues/57781 We add now two CUDA bindings to avoid using ctypes to fix a windows issue. However, we use ctypes to allocate the stream and create its pointer (we can do this with a 0-dim tensor too if it feels better). CC. ezyang rgommers ngimel mruberry Pull Request resolved: https://github.com/pytorch/pytorch/pull/59527 Reviewed By: albanD Differential Revision: D29053062 Pulled By: ezyang fbshipit-source-id: 661e7e58de98b1bdb7a0871808cd41d91fe8f13f	2021-06-14 13:46:11 -07:00
Richard Barnes	10a3a3d363	Fix bad change in a CUDACachingAllocator loop (#59903 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59903 D29034650 (`cf0c4ac258`) probably breaks something because it changes a `for` loop on ~Line 1200 from `[size,max)` to `[0,max)`. This fixes that Test Plan: Sandcastle Reviewed By: ngimel Differential Revision: D29081688 fbshipit-source-id: 21f08e3f244fc02cf97d137b3cc80d4378d17185	2021-06-11 18:20:07 -07:00
Richard Barnes	60eb22e45e	Build an -Wextra around c10 (#59853 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59853 Test Plan: Sandcastle Reviewed By: ngimel Differential Revision: D29016682 fbshipit-source-id: f6c5f32464d57dbd60b59b5f9e2234ef2c39f1c1	2021-06-11 16:12:21 -07:00
Richard Barnes	cf0c4ac258	Fix some issues in CUDACachingAllocator (#59819 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59819 Test Plan: Sandcastle Reviewed By: ngimel Differential Revision: D29034650 fbshipit-source-id: 7e9689fc1ae121432e9421fa4a9ae00f7f78caca	2021-06-11 13:15:27 -07:00
Luca Wehrstedt	e7cccc23b9	Add query and synchronize to c10::Stream (#59560 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59560 `at::cuda::CUDAStream` has the `query` and `synchronize` methods, but `c10::Stream` does not, and I couldn't find any generic way to accomplish this. Hence I added helpers to do this to the DeviceGuardImpl interface, and then defined these methods on `c10::Stream`. (I had to do it out-of-line to circumvent a circular dependency). ghstack-source-id: 130932249 Test Plan: CI Reviewed By: ezyang Differential Revision: D28931377 fbshipit-source-id: cd0c19cf021e305d0c0cf9af364afb445d010248	2021-06-10 01:42:40 -07:00
Nikita Shulga	d125694d0b	Move CUDA async warning to suffix (#59467 ) Summary: After the change async error warnings look as follows: ``` $ python -c "import torch;torch.eye(3,3,device='cuda:777')" Traceback (most recent call last): File "<string>", line 1, in <module> RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/59467 Reviewed By: ngimel Differential Revision: D28904360 Pulled By: malfet fbshipit-source-id: 2a8fa5affed5b4ffcaa602c8ab2669061cde7db0	2021-06-04 17:26:28 -07:00
Rong Rong (AI Infra)	689a5edd0a	Revert D28326365: [pytorch][PR] Add `torch.cuda.streams.ExternalStream` Test Plan: revert-hammer Differential Revision: D28326365 (`d7ef9b73fb`) Original commit changeset: b67858c80339 fbshipit-source-id: 337588d40b96cf04e46e554fa481ae7fd4254478	2021-06-04 11:19:36 -07:00
Emilio Castillo	d7ef9b73fb	Add `torch.cuda.streams.ExternalStream` (#57781 ) Summary: This is required in https://github.com/pytorch/pytorch/pull/57110#issuecomment-828357947 We need to provide means to synchronize on externally allocated streams for dlpack support in python array data api. cc mruberry rgommers leofang asi1024 kmaehashi Pull Request resolved: https://github.com/pytorch/pytorch/pull/57781 Reviewed By: mrshenli Differential Revision: D28326365 Pulled By: ezyang fbshipit-source-id: b67858c8033949951b49a3d319f649884dfd0a91	2021-06-04 08:47:09 -07:00
Atul Jangra	3948ce2fd9	[Caffe2] Introduce c10::CudaError for CUDA Exceptions (#57609 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57609 Throw c10::CudaError for CUDA Exceptions for better classification of errors Test Plan: Test locally by running some workflows Reviewed By: dzhulgakov Differential Revision: D28209356 fbshipit-source-id: 19a5fc8548433238dc224ea81a5f63a945fc5cc3	2021-05-06 14:28:45 -07:00
Gao, Xiang	db7b31358f	Fix internal assert in CUDA caching allocator when trying to allocate ~2^64 memory (#57571 ) Summary: When the memory requested is huge, some internal logic in CUDA caching allocator could overflow. The result of the overflow is the caching allocator gives a confusing error message. For example: ```python import torch import torch.nn as nn from torch.utils import cpp_extension cuda_source = """ #include <c10/cuda/CUDACachingAllocator.h> void my_fun(void) { size_t temp_storage_bytes = 18446744073708433663UL; auto& caching_allocator = ::c10::cuda::CUDACachingAllocator::get(); auto temp_storage = caching_allocator.allocate(temp_storage_bytes); return; } """ cpp_source = """ void my_fun(void); """ module = torch.utils.cpp_extension.load_inline( name="cuda_test_extension", cpp_sources=cpp_source, cuda_sources=cuda_source, functions="my_fun", extra_cuda_cflags=["--extended-lambda"], verbose=True, ) module.my_fun() print('done') ``` gives ``` Traceback (most recent call last): File "/home/gaoxiang/misc/caching-allocator.py", line 26, in <module> module.my_fun() RuntimeError: p.block != nullptr && p.block->ptr != nullptrINTERNAL ASSERT FAILED at "../c10/cuda/CUDACachingAllocator.cpp":991, please report a bug to PyTorch. Exception raised from alloc_block at ../c10/cuda/CUDACachingAllocator.cpp:991 (most recent call first): frame #0: <unknown function> + 0x83e93 (0x7f424f05ee93 in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10.so) frame https://github.com/pytorch/pytorch/issues/1: <unknown function> + 0x83bf9 (0x7f424f05ebf9 in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10.so) frame https://github.com/pytorch/pytorch/issues/2: <unknown function> + 0x839bd (0x7f424f05e9bd in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10.so) frame https://github.com/pytorch/pytorch/issues/3: std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>::operator()() const + 0x4c (0x7f428a3350a2 in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so) frame https://github.com/pytorch/pytorch/issues/4: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x40 (0x7f424f05dc34 in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10.so) frame https://github.com/pytorch/pytorch/issues/5: c10::detail::torchCheckFail(char const, char const, unsigned int, char const) + 0x97 (0x7f424f05c42f in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10.so) frame https://github.com/pytorch/pytorch/issues/6: <unknown function> + 0x6948b4 (0x7f42978fd8b4 in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libtorch_python.so) frame https://github.com/pytorch/pytorch/issues/7: <unknown function> + 0x22373 (0x7f424f0e2373 in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10_cuda.so) frame https://github.com/pytorch/pytorch/issues/8: <unknown function> + 0x1fa6c (0x7f424f0dfa6c in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10_cuda.so) frame https://github.com/pytorch/pytorch/issues/9: <unknown function> + 0x2337a (0x7f424f0e337a in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10_cuda.so) frame https://github.com/pytorch/pytorch/issues/10: <unknown function> + 0x23f18 (0x7f424f0e3f18 in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10_cuda.so) frame https://github.com/pytorch/pytorch/issues/11: my_fun() + 0x4b (0x7f4200338f74 in /home/gaoxiang/.cache/torch_extensions/cuda_test_extension/cuda_test_extension.so) frame https://github.com/pytorch/pytorch/issues/12: torch::detail::wrap_pybind_function_impl_<void (&)()>(void (&)(), std::integer_sequence<unsigned long>)::{lambda()https://github.com/pytorch/pytorch/issues/1}::operator()() const + 0x3f (0x7f420031e575 in /home/gaoxiang/.cache/torch_extensions/cuda_test_extension/cuda_test_extension.so) frame https://github.com/pytorch/pytorch/issues/13: <unknown function> + 0x570f2 (0x7f42003350f2 in /home/gaoxiang/.cache/torch_extensions/cuda_test_extension/cuda_test_extension.so) frame https://github.com/pytorch/pytorch/issues/14: <unknown function> + 0x536e2 (0x7f42003316e2 in /home/gaoxiang/.cache/torch_extensions/cuda_test_extension/cuda_test_extension.so) frame https://github.com/pytorch/pytorch/issues/15: <unknown function> + 0x4ef2f (0x7f420032cf2f in /home/gaoxiang/.cache/torch_extensions/cuda_test_extension/cuda_test_extension.so) frame https://github.com/pytorch/pytorch/issues/16: <unknown function> + 0x4ef93 (0x7f420032cf93 in /home/gaoxiang/.cache/torch_extensions/cuda_test_extension/cuda_test_extension.so) frame https://github.com/pytorch/pytorch/issues/17: <unknown function> + 0x3e7f2 (0x7f420031c7f2 in /home/gaoxiang/.cache/torch_extensions/cuda_test_extension/cuda_test_extension.so) <omitting python frames> frame https://github.com/pytorch/pytorch/issues/30: __libc_start_main + 0xd5 (0x7f42c60bab25 in /usr/lib/libc.so.6) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/57571 Reviewed By: VitalyFedyunin Differential Revision: D28224574 Pulled By: ezyang fbshipit-source-id: df440961f6eaf58048af36ae2a06c59f3c18baec	2021-05-06 01:36:58 -07:00
Luca Wehrstedt	0c3e79b5b9	Rename DeviceGuardImplInteface's getStreamFromPool method (#57345 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57345 Already back in https://github.com/pytorch/pytorch/pull/57046 we realized that calling this method `getStreamFromPool` could cause issues because that name gets HIPified and thus in some callsites we'd end up calling a method that doesn't exist. In the end we got away with it because the places where we were calling that method weren't HIPified. However in the next PR we'll use this method inside RPC, and that will start causing problems, hence here I rename it to something that should not cause conflicts. This is a private API (since it's inside `impl`) thus there's no backwards compatibility concerns. ghstack-source-id: 127916484 Test Plan: CI Reviewed By: mrshenli Differential Revision: D28114923 fbshipit-source-id: e027ad08a8e02090c08c6407c2db5a7fde104812	2021-05-01 16:12:53 -07:00
Scott Wolchok	44cc873fba	[PyTorch] Autoformat c10 (#56830 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56830 Opt into formatting on GitHub and format everything. This is a trial run before turning on formatting for more and eventually all of the codebase. Test Plan: CI Reviewed By: zertosh Differential Revision: D27979080 fbshipit-source-id: a80f0c48691c08ae8ca0af06377b87e6a2351151	2021-04-30 21:23:28 -07:00
Luca Wehrstedt	682476022f	Introduce generic MultiStreamGuard (#57049 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57049 There was a comment above CUDAMultiStreamGuard which said "TODO: Implement this generically in c10". This is what I'm doing here. The new generic MultiStreamGuard class is able to take a vector of device-agnostic c10::Streams and is able to support any device type (CUDA, but also ROCm and others) by using a VirtualGuardImpl. A class called CUDAMultiStreamGuard is still kept around, for convenience, and slightly for performance as it avoids a vtable lookup. ghstack-source-id: 127713139 (Note: this ignores all push blocking failures!) Test Plan: CI Reviewed By: mrshenli Differential Revision: D28029158 fbshipit-source-id: 2f3181371f8cb0d77a3b2e6aa510f1dd74e8f69b	2021-04-29 09:31:47 -07:00
Luca Wehrstedt	ea64c90ecc	Add recordDataPtrOnStream to DeviceGuardImplInterface (#57047 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57047 We intend to merge CUDAFuture into ivalue::Future by using DeviceGuardImplInterface to avoid explicitly referring to CUDA. For that we need to add two methods to DeviceGuardImplInterface. In this PR, we add a method to record a DataPtr onto a stream with the caching allocator. ghstack-source-id: 127713135 (Note: this ignores all push blocking failures!) Test Plan: Used later in this stack Reviewed By: ezyang Differential Revision: D28029161 fbshipit-source-id: ff337ab8ccc98437b5594b2f263476baa1ae93e7	2021-04-29 09:31:43 -07:00
Luca Wehrstedt	6fdf092cad	Add getStreamFromPool to DeviceGuardImplInterface (#57046 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57046 We intend to merge CUDAFuture into ivalue::Future by using DeviceGuardImplInterface to avoid explicitly referring to CUDA. For that we need to add two methods to DeviceGuardImplInterface. In this PR, we add a method to get a stream from the global ATen pool. ghstack-source-id: 127713137 (Note: this ignores all push blocking failures!) Test Plan: Used later in this stack Reviewed By: ezyang Differential Revision: D28029159 fbshipit-source-id: 5055d84c1f3c2a4d86442f3149455c5ebd976dea	2021-04-29 09:30:41 -07:00
Michael Carilli	ffdecc1ac4	[CUDA graphs] Allows DeviceCachingAllocator to capture cross-stream memory use (#55860 ) Summary: Safely deallocating and repurposing memory used across streams relies on recording end-of-life events in all an allocation's usage streams beyond its original allocation stream. The events are later queried to see if all GPU work in those extra streams that could have used the allocation is done (from the CPU's perspective) before repurposing the allocation for use in its original stream. The trouble is, calling EventQuery on an ordinary event recorded in a capturing stream is illegal. Calling EventQuery while capture is underway is also illegal. So when we call `tensor.record_stream` (or `c10::cuda::cudaCachingAllocator::recordStream`) on any tensor that's used or deleted in or around a capture, we often end up with a confusing error thrown from the cudaEventQuery in DeviceCachingAllocator::process_events(). This PR enables hopefully-safe deletion of tensors used across streams in or around capture with a conservative but simple approach: don't record or process end of life events for such tensors until the allocator's sure no captures are underway. You could whiteboard cases where this causes cross-stream-used allocations to be unavailable for reuse longer than absolutely necessary, but cross-stream-used allocations are uncommon, so for practical purposes this approach's impact on the memory footprint of captured sequences should be small. Pull Request resolved: https://github.com/pytorch/pytorch/pull/55860 Reviewed By: ejguan Differential Revision: D27822557 Pulled By: ezyang fbshipit-source-id: b2e18a19d83ed05bad67a8157a14a606ed14d04e	2021-04-18 20:32:10 -07:00

1 2 3

141 Commits