pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
PyTorch MergeBot	e505360eb8	Revert "[mta] APEX style Fused Adam (#81705 )" This reverts commit `7a6c4d0c50`. Reverted https://github.com/pytorch/pytorch/pull/81705 on behalf of https://github.com/dagitses due to broke internal builds, details to come	2022-09-22 19:37:29 +00:00
PyTorch MergeBot	0ac6311356	Revert "[CUBLAS][CUDA GRAPHS] (re-open of #83461 ) Explicitly set the workspace for cuBLAS handles (#85292 )" This reverts commit `4012e623e8`. Reverted https://github.com/pytorch/pytorch/pull/85292 on behalf of https://github.com/dagitses due to broke an internal test during shutdown. Re-submit with #85399 in stack	2022-09-21 17:57:49 +00:00
Masaki Kozuki	7a6c4d0c50	[mta] APEX style Fused Adam (#81705 ) This PR implements an APEX style FusedAdam in PyTorch. This is different from the APEX one in that this is compatible with `torch.cuda.amp.GradScaler` by setting `_step_supports_amp_scaling` to `True` and unscales gradients inside its CUDA kernel. related: https://github.com/pytorch/pytorch/issues/68041, https://github.com/pytorch/pytorch/issues/71274, https://github.com/pytorch/pytorch/issues/80167 possibly related to https://github.com/pytorch/pytorch/issues/80595#issuecomment-1178519436 cc @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/81705 Approved by: https://github.com/ngimel	2022-09-20 17:18:33 +00:00
eqy	4012e623e8	[CUBLAS][CUDA GRAPHS] (re-open of #83461 ) Explicitly set the workspace for cuBLAS handles (#85292 ) re-open of #83461 with fix for 10.2 build CC @ngimel @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/85292 Approved by: https://github.com/malfet	2022-09-20 16:31:54 +00:00
Hector Yuen	d23ce29761	allow changing the cuda allocator settings even after the process started (#84970 ) Summary: - expose a python call to set the allocator settings, it uses the same format as the value for PYTORCH_CUDA_ALLOCATOR - keep the implementation contained within the cpp file to avoid increasing build times, only expose a function to call the setting - make some of the Allocator Config methods public, now it looks more like a singleton Test Plan: added the unit test Differential Revision: D39487522 Pull Request resolved: https://github.com/pytorch/pytorch/pull/84970 Approved by: https://github.com/zdevito	2022-09-17 09:42:42 +00:00
PyTorch MergeBot	2711b9fa63	Revert "[CUBLAS][CUDA GRAPHS] Explicitly set the workspace for cuBLAS handles (#83461 )" This reverts commit `713d8b8552`. Reverted https://github.com/pytorch/pytorch/pull/83461 on behalf of https://github.com/malfet due to Broke CUDA-10.2 builds, see `713d8b8552`	2022-09-14 22:27:30 +00:00
Eddie Yan	713d8b8552	[CUBLAS][CUDA GRAPHS] Explicitly set the workspace for cuBLAS handles (#83461 ) We're seeing an issue where repeatedly capturing graphs incurs increasing memory usage as cuBLAS internally allocates a new workspace for each graph even when the same handle is being used: https://gist.github.com/tomconerlyanth/a20c04a4a46a0f6e9ce18f5280729b36 This PR works around the issue by intercepting the `CUBLAS_WORKSPACE_CONFIG` environment variable and allocating the workspace for the cuBLAS handle explicitly. CC @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/83461 Approved by: https://github.com/ngimel	2022-09-14 21:56:48 +00:00
Aidyn-A	5271494ef2	[CUDA graphs] Fixes errors in RNG seed (#84967 ) Fixes #84614 Prior to this PR CUDAGraph did not store the RNG seed, that is why `torch.cuda.manual_seed(new_seed)` would only reset the offset but not update the seed at all keeping whatever value was used during graph capture. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84967 Approved by: https://github.com/ngimel	2022-09-14 19:56:12 +00:00
jataylo	09bcc006e9	ROCm support for test_lazy_init (#84333 ) Added ROCm support for the test_lazy_init unit test by including a condition on TEST_WITH_ROCM to switch CUDA_VISIBLE_DEVICES with HIP_VISIBLE_DEVICES. This is needed because HIP_VISIBLE_DEVICES is set when running the single-GPU tests in CI: `a47bc96fb7/.jenkins/pytorch/test.sh (L38)`, but this test sets CUDA_VISIBLE_DEVICES, which takes lower precedence than HIP_VISIBLE_DEVICES on ROCm. Testing Logs (to show behavior difference) 12:40:41 Aug 30 11:40:41 CUDA_VISIBLE_DEVICES='0': 0 12:40:41 Aug 30 11:40:41 1 12:40:41 Aug 30 11:40:41 CUDA_VISIBLE_DEVICES='32': 32 12:40:41 Aug 30 11:40:41 1 12:40:41 Aug 30 11:40:41 HIP_VISIBLE_DEVICES='0': 0 12:40:41 Aug 30 11:40:41 1 12:40:41 Aug 30 11:40:41 HIP_VISIBLE_DEVICES='32': 32 12:40:41 Aug 30 11:40:41 0 Passing UT Aug 30 17:03:15 test_lazy_init (main.TestCuda) Aug 30 17:03:17 Validate that no CUDA calls are made during import torch call ... ok (2.471s) Pull Request resolved: https://github.com/pytorch/pytorch/pull/84333 Approved by: https://github.com/jithunnair-amd, https://github.com/malfet	2022-09-09 14:14:59 +00:00
Fabio Rocha	88b1cc885c	Removed tri[lu]* tests, superseeded by OpInfos (#84256 ) triu, tril, triu_indices and tril_indices had some tests in test_tensor_creation_ops.py and test_cuda.py that are redudant with the ones done by OpInfos for those ops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84256 Approved by: https://github.com/Lezcano, https://github.com/ngimel	2022-09-06 18:54:10 +00:00
Aidyn-A	ce1b727e77	Disable autocast cache in torch.cuda.make_graphed_callables (#84289 ) There there are conflicts between `torch.clear_autocast_cache()` and `cudaMallocAsync` from #82682. Moreover, the use of autocast caching is not reasonable during training which is the main target of `make_graphed_callables`. cc @eqy @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/84289 Approved by: https://github.com/ngimel	2022-09-01 21:34:51 +00:00
Pruthvi Madugundu	8473e69684	[ROCm] Fixes the kernel asserts API declaration mismatch error (#81790 ) This problem updates the the PR [#73040](https://github.com/pytorch/pytorch/pull/73040) The compilation error in pyTorch with ROCm is successful with these changes when `NDEBUG` is enabled. Solution: For HIP we keep `__device__ __assert_fail()` and for host side compilation we want to use the `__assert_fail()` from the glibc library. Tested the code by compiling with below steps ``` python3 tools/amd_build/build_amd.py python3 setup.py develop --cmake-only cmake -DHIP_HIPCC_FLAGS_RELEASE="-DNDEBUG" build cmake --build build ``` The UT test_fixed_cuda_assert_async is still skipped due performance overhead. cc @jithunnair-amd Pull Request resolved: https://github.com/pytorch/pytorch/pull/81790 Approved by: https://github.com/shintaro-iwasaki, https://github.com/jeffdaily, https://github.com/malfet	2022-08-16 19:22:31 +00:00
Zachary DeVito	4128712397	Propagate CUDAOutOfMemoryError to Python. (#83146 ) The intention is to make it easier to catch this situation for debugging, logging, or application-specific recovery. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83146 Approved by: https://github.com/albanD	2022-08-11 21:32:11 +00:00
Zachary DeVito	726d040692	annotated allocator snapshots (#82146 ) Record stack trace information for each allocated segment in the allocator. It takes around 1.5us to record 50 stack frames of context. Since invoking a Pytorch operator is around 8us, this adds minimal overhead but we still leave it disabled by default so that we can test it more on real workloads first. Stack information is kept both for allocated blocks and the last allocation used inactive blocks. We could potential keep around the _first_ allocation that caused the block to get allocated from cuda as well. Potential Followups: * stack frame entries are small (16 bytes), but the list of Frames is not compressed eventhough most frames will share some entries. So far this doesn't produce huge dumps (7MB for one real workload that uses all memory on the GPU), but it can be much smaller through compression. * Code to format the information is slow (a few seconds) because it uses python and FlameGraph.pl * Things allocated during the backward pass have no stack frames because they are run on another C++ thread. Pull Request resolved: https://github.com/pytorch/pytorch/pull/82146 Approved by: https://github.com/albanD	2022-08-09 17:21:35 +00:00
Aidyn-A	da0a3fe058	[Re-land] [CUDA graphs] Clear autocast amp cache (#81896 ) Re-lands #81558 that got reverted due failing tests. This failure happened because of the test that I poorly designed. [The loop here](https://github.com/pytorch/pytorch/pull/81558/files#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R3837) is doing `cache_enabled=False` and then `cache_enabled=True`. By doing this loop the graph from previous iteration (case `False`) conflicts with the next one (case `True`). I redesigned the test such that it does not do any loops. The new test does separate function calls with different argument values. Pull Request resolved: https://github.com/pytorch/pytorch/pull/81896 Approved by: https://github.com/ngimel	2022-08-02 23:22:00 +00:00
Kurt Mohler	14d0296e5c	Rename `_Typed/_UntypedStorage` to `Typed/UntypedStorage` and update docs (#82438 ) ### Description Since the major changes for `_TypedStorage` and `_UntypedStorage` are now complete, they can be renamed to be public. `TypedStorage._untyped()` is renamed to `TypedStorage.untyped()`. Documentation for storages is improved as well. ### Issue Fixes #82436 ### Testing N/A Pull Request resolved: https://github.com/pytorch/pytorch/pull/82438 Approved by: https://github.com/ezyang	2022-07-30 19:37:08 +00:00
Eddie Yan	0b2566456f	[CUDNN] Update tests and dispatching for CUDNN V8 API behavior for `bfloat16` convs (#81139 ) cuDNN via the V8 API supports `bfloat16` on Ampere (`>= (8, 0)` but not older devices) which might be unexpected given current test settings. This PR fixes some dispatching to check the device capability before dispatching `bfloat16` convs and adjusts the expected failure conditions for the autocast test. CC @xwang233 @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/81139 Approved by: https://github.com/ngimel	2022-07-29 23:28:58 +00:00
PyTorch MergeBot	f5b460b200	Revert "[CUDA graphs] Clear autocast amp cache (#81558 )" This reverts commit `e9d07bd4f0`. Reverted https://github.com/pytorch/pytorch/pull/81558 on behalf of https://github.com/janeyx99 due to Breaks windows 11.6 tests on trunk `e9d07bd4f0`	2022-07-21 12:46:36 +00:00
Aidyn-A	e9d07bd4f0	[CUDA graphs] Clear autocast amp cache (#81558 ) According to [autocast_mode.cpp](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/autocast_mode.cpp) `cached_casts` is to be cleared at the end of each forward pass. However, this was not the case in current implementation of `make_graphed_callables` so a graph created the following way: ``` with torch.cuda.amp.autocast(cache_enabled=True): graphed_foo = torch.cuda.make_graphed_callables(foo, tensors) ``` Behaves incorrectly. cc @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/81558 Approved by: https://github.com/ngimel	2022-07-21 01:44:14 +00:00
Jeff Daily	ff6655defb	[ROCm] unskip external streams tests (#80922 ) These two tests are passing for ROCm 5.1.1 and 5.2. Pull Request resolved: https://github.com/pytorch/pytorch/pull/80922 Approved by: https://github.com/cpuhrsch	2022-07-08 21:29:29 +00:00
Nikita Shulga	1ad7ef3f21	Add check for cuda lazy init (#80912 ) Validate that no CUDA calls are made during `import torch` call, by importing torch and limited visible devices to non-existing device Should prevent regressions like ones reported in https://github.com/pytorch/pytorch/issues/80876 Pull Request resolved: https://github.com/pytorch/pytorch/pull/80912 Approved by: https://github.com/ngimel, https://github.com/atalman	2022-07-06 01:39:27 +00:00
Jeff Daily	20d56d2b32	increase sleep for TestCuda.test_caching_pinned_memory_multi_gpu (#76601 ) Fixes #68299. Fixes #70875. Test is flaky on ROCm because the HIP runtime occasionally copies asynchronously too quickly for the current sleep value of 50ms. This is not a bug. Increasing the sleep value to 1s to avoid flakiness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/76601 Approved by: https://github.com/pruthvistony, https://github.com/malfet	2022-06-14 21:10:35 +00:00
Michael Carilli	ba27ee9e8f	[CUDA graphs] Allows Adam and AdamW to be capture-safe (#77862 ) Near term fix for https://github.com/pytorch/pytorch/issues/76368. Q. Why does the user need to request `capturable=True` in the optimizer constructor? Why can't capture safety be completely automatic? A. We need to set up capture-safe (device-side) state variables before capture. If we don't, and step() internally detects capture is underway, it's too late: the best we could do is create a device state variable and copy the current CPU value into it, which is not something we want baked into the graph. Q. Ok, why not just do the capture-safe approach with device-side state variables all the time? A. It incurs several more kernel launches per parameter, which could really add up and regress cpu overhead for ungraphed step()s. If the optimizer won't be captured, we should allow step() to stick with its current cpu-side state handling. Q. But cuda RNG is a stateful thing that maintains its state on the cpu outside of capture and replay, and we capture it automatically. Why can't we do the same thing here? A. The graph object can handle RNG generator increments because its capture_begin, capture_end, and replay() methods can see and access generator object. But the graph object has no explicit knowledge of or access to optimizer steps in its capture scope. We could let the user tell the graph object what optimizers will be stepped in its scope, ie something like ```python graph.will_use_optimizer(opt) graph.capture_begin() ... ``` but that seems clunkier than an optimizer constructor arg. I'm open to other ideas, but right now I think constructor arg is necessary and the least bad approach. Long term, https://github.com/pytorch/pytorch/issues/71274 is a better fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/77862 Approved by: https://github.com/ezyang	2022-06-13 01:56:47 +00:00
Kurt Mohler	aea6e2c396	Merge torch.cuda._UntypedStorage into torch._UntypedStorage (#75459 ) Fixes #74933 Pull Request resolved: https://github.com/pytorch/pytorch/pull/75459 Approved by: https://github.com/ezyang	2022-05-19 13:54:39 +00:00
Michael Carilli	929f1d5317	[RELAND] Adds torch.cuda.is_current_stream_capturing (#77789 ) Resubmit of https://github.com/pytorch/pytorch/pull/77673, which was reverted due to Windows test failures: https://github.com/pytorch/pytorch/pull/77673#issuecomment-1130425845. I suspect these failures happened because I don't explicitly set a side stream for graph capture in the new test. Not setting a side stream explicitly is alright on Linux because cuda tests implicitly use a side stream. I think Windows cuda tests implicitly use the default stream, breaking capture and leaving the backend in a bad state. Other graphs tests explicitly set side streams and don't error in Windows builds, so i'm 95% sure doing the same for the new test will work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/77789 Approved by: https://github.com/ezyang	2022-05-18 23:18:53 +00:00
Jeff Daily	de86146c61	rocblas alt impl during backward pass only (#71881 ) In preparation of adopting future rocblas library options, it is necessary to track when the backward pass of training is executing. The scope-based helper class `BackwardPassGuard` is provided to toggle state. Pull Request resolved: https://github.com/pytorch/pytorch/pull/71881 Approved by: https://github.com/albanD	2022-05-18 19:42:58 +00:00
PyTorch MergeBot	0d8a0f186b	Revert "Adds torch.cuda.is_current_stream_capturing (#77673 )" This reverts commit `d03d43df52`. Reverted https://github.com/pytorch/pytorch/pull/77673 on behalf of https://github.com/suo	2022-05-18 19:31:49 +00:00
Michael Carilli	d03d43df52	Adds torch.cuda.is_current_stream_capturing (#77673 ) Exposes a way to query if CUDA graph capture is underway on the current stream. Pull Request resolved: https://github.com/pytorch/pytorch/pull/77673 Approved by: https://github.com/ezyang	2022-05-18 16:46:35 +00:00
Eddie Yan	76b952bb35	[CUBLAS][TF32] Skip `test_cublas_allow_tf32_get_set` if `TORCH_ALLOW_TF32_CUBLAS_OVERRIDE` is set (#77298 ) Follow-up to #77114 to prevent test breakages when the environment variable is set. CC @xwang233 @ngimel @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/77298 Approved by: https://github.com/xwang233, https://github.com/ngimel	2022-05-17 21:57:09 +00:00
Eddie Yan	e838137b3e	Add high level control of fp32 matmul precision; disable TF32 for matmuls by default #76440 CC @mruberry @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/76509 Approved by: https://github.com/ngimel	2022-05-04 20:40:13 +00:00
Felipe Petroski Such	b0c5fba967	[CUDA Graphs] Fix OOM inside graph capture_begin release_cached_blocks calls this: ``` void synchronize_and_free_events() { TORCH_INTERNAL_ASSERT(captures_underway == 0); ``` Which means we can't call that function when we are capturing a cuda graph: ``` import torch with torch.cuda.graph(torch.cuda.CUDAGraph()): torch.zeros(2 ** 40, device="cuda") ``` results in: ``` RuntimeError: captures_underway == 0INTERNAL ASSERT FAILED at "/tmp/torch/c10/cuda/CUDACachingAllocator.cpp":1224, please report a bug to PyTorch. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/76247 Approved by: https://github.com/ngimel	2022-04-29 17:42:04 +00:00
Jeff Daily	e846ef8818	add rocm ciflow/slow workflow Enables additional tests that historically have been missed for ROCm CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/72686 Approved by: https://github.com/seemethere	2022-04-22 17:41:28 +00:00
Ivan Yashchuk	4bb5e6e830	Fix `test_reduce_add_coalesced` failure (#74027 ) Summary: Recent change (https://github.com/pytorch/pytorch/pull/69751) introduced the requirement of using `.coalesce()` explicitly in the tests. Unfortunately, not all tests are run in the current CI configuration and one test failure slipped through. Fixes https://github.com/pytorch/pytorch/issues/74015. Pull Request resolved: https://github.com/pytorch/pytorch/pull/74027 Reviewed By: samdow Differential Revision: D34858112 Pulled By: mruberry fbshipit-source-id: 8904fac5e2b5335684a21f95a22646469478eb81 (cherry picked from commit 06d6e6d2a796af0e8444f4c57841a07ec4f67c9f)	2022-03-15 06:29:54 +00:00
Michael Carilli	2f957f513e	Deletes unused line in test_autocast_rnn (#73195 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73195 Reviewed By: mruberry Differential Revision: D34557677 Pulled By: ngimel fbshipit-source-id: 284018b4596471332d0e90a08e2c38303fb2b3ae (cherry picked from commit bbf6913009e206c02e124c49ab80ef9596f7fcad)	2022-03-02 01:27:55 +00:00
Shintaro Iwasaki	7dc2cfa249	[c10][rocm] fix __assert_fail() declaration mismatch error (#73040 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73040 This patch fixes a compilation error in PyTorch with ROCm when `NDEBUG` is passed. ## Problem Forward declaration of `__host__ __device__ __assert_fail()` is used in `c10/macros/Macros.h` for HIP compilation when `NDEBUG` is set However, HIP has `__device__ __assert_fail()` in `hip/amd_detail/amd_device_functions.h`, causing a function type error. This issue does not appear in ROCm CI tests since it happens only when `NDEBUG` is passed. ## Solution [EDIT] After the discussion on GitHub, we chose to entirely disable `CUDA_KERNEL_ASSERT()` for ROCm. --- To solve this compilation error, this patch disables `CUDA_KERNEL_ASSERT()`, which uses `__assert_fail()` when 1. `c10/macros/Macros.h` is included for `.hip` (precisely speaking, `__HIP__` or `__HIP_ARCH__` is defined), and 2. `NDEBUG` is passed. Note that there's no impact on default compilation because, without a special compilation flag, those HIP files are compiled without `-NDEBUG`. And that's why this issue has not been found. ### Justification [1] We cannot declare one host-and-device function for two separate host and device functions. ``` __device__ int func() {return 0}; __host__ int func() {return 0}; // Compile error (hipcc) // __device__ __host__ int func(); ``` [2] Forward declaration of a correct `__device__` only `__assert_fail()` for `__HIP__` causes the following error: ``` pytorch/c10/util/TypeCast.h:135:7: error: reference to __device__ function '__assert_fail' in __host__ __device__ function ERROR_UNSUPPORTED_CAST ^ pytorch/c10/util/TypeCast.h:118:32: note: expanded from macro 'ERROR_UNSUPPORTED_CAST' #define ERROR_UNSUPPORTED_CAST CUDA_KERNEL_ASSERT(false); ^ pytorch/c10/macros/Macros.h:392:5: note: expanded from macro 'CUDA_KERNEL_ASSERT' __assert_fail( ``` [3] Maybe there's a way to properly define `__assert_fail()` for HIP + NDEBUG, but this might be too much. Please let me just disable it. ### Technical details Error ``` pytorch/c10/macros/Macros.h:368:5: error: __host__ __device__ function '__assert_fail' cannot overload __device__ function '__assert_fail' __assert_fail( ^ /opt/rocm/hip/include/hip/amd_detail/amd_device_functions.h:1173:6: note: previous declaration is here void __assert_fail(const char assertion, ``` CUDA definition (9.x) of `__assert_fail()` ``` #elif defined(__GNUC__) extern __host__ __device__ __cudart_builtin__ void __assert_fail( const char , const char , unsigned int, const char ) __THROW; ``` ROCm definition (the latest version) ``` // `2b59661f3e/include/hip/amd_detail/amd_device_functions.h (L1172-L1177)` extern "C" __device__ __attribute__((noinline)) __attribute__((weak)) void __assert_fail(const char assertion, const char file, unsigned int line, const char function); ``` Test Plan: CI + reproducer ``` python3 tools/amd_build/build_amd.py python3 setup.py develop --cmake-only cmake -DHIP_HIPCC_FLAGS_RELEASE="-DNDEBUG" build cmake --build build ``` Reviewed By: xw285cornell Differential Revision: D34310555 fbshipit-source-id: 7542288912590533ced3f20afd2e704b6551991b (cherry picked from commit 9e52196e36820abe36bf6427cabc7389d3ea6cb5)	2022-03-01 04:35:30 +00:00
Philip Meier	b5f2574f36	no longer coalesce sparse COO tensors before comparison (#69751 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69751 cc nikitaved pearu cpuhrsch IvanYashchuk Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D34262453 Pulled By: ezyang fbshipit-source-id: e2e62d2aa03fc569d2951c880960b256f5dc4aaa (cherry picked from commit `cb6b0ef719`)	2022-02-17 02:33:08 +00:00
Kurt Mohler	8e7fe87630	Rename `Typed/UntypedStorage` to `_Typed/_UntypedStorage` (#72540 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72540 Reviewed By: jbschlosser Differential Revision: D34216823 Pulled By: bdhirsh fbshipit-source-id: 1bc9930ab582771ebf02308e035576cd1a0dbe47 (cherry picked from commit `329238f612`)	2022-02-15 23:53:01 +00:00
Louis Feng	83b3b5fb00	[PyTorch] Support NVTX range_start and range_end (#70030 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70030 range_push and range_pop do not support multi-thread. It only works for push and pop range in the same thread. For process level ranges, we should use range_start and range_end. This is important because PyTorch forward is on one thread, while the autograd is on a different thread. See NVidia implementation documentation: `cab2dec760/NSight/nvToolsExt.h (L397-L407)` Test Plan: ``` buck test caffe2/test:cuda Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/8162774391483460 ✓ ListingSuccess: caffe2/test:cuda - main (19.640) Summary ListingSuccess: 1 If you need help understanding your runs, please follow the wiki: https://fburl.com/posting_in_tpx_users Finished test run: https://www.internalfb.com/intern/testinfra/testrun/8162774391483460 ``` Reviewed By: malfet Differential Revision: D33155244 fbshipit-source-id: c7d5143f6da9b6ef0e0811e2fcae03a3e76f24de (cherry picked from commit `22134e91b7`)	2022-02-07 17:31:57 +00:00
Andrew Tulloch	0099796978	[CUDA Pinned Memory] [Retry] Alternative implementation of pinned memory allocator focusing on multi-threaded scalability (#69299 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69299 https://github.com/pytorch/pytorch/pull/68906 + https://github.com/pytorch/pytorch/pull/68749 plugged one correctness hole (non-blocking copies of offset pinned memory tensors) while introducing another (non-blocking copies of pinned memory tensors with a non-standard DataPtr context). In this revision, we use both the tensor data pointer and context to attempt to identify the originating block in the pinned memory allocator. Test Plan: New unit tests added to cover the missing case previously. Reviewed By: yinghai Differential Revision: D32787087 fbshipit-source-id: 0cb0d29d7c39a13f433eb1cd423dc0d2a303c955 (cherry picked from commit `297157b1a1`)	2022-01-27 01:33:55 +00:00
Mike Ruberry	e0d829a266	Kill the test_torch.py mixin and creates test_scatter_gather_ops (#71691 ) Summary: Per title. Also annotates test_torch.py with additional cleanup tasks and adds empty sample inputs to elementwise unary and binary OpInfos. Pull Request resolved: https://github.com/pytorch/pytorch/pull/71691 Reviewed By: ngimel Differential Revision: D33735126 Pulled By: mruberry fbshipit-source-id: 8cc097a7581a8b620540c95b2a5889c1165ecf23 (cherry picked from commit `5c6a245a3f`)	2022-01-24 09:32:32 +00:00
Leo Fang	67941c8a94	Document `torch.cuda.ExternalStream`, `torch.cuda.caching_allocator_alloc` and `torch.cuda.caching_allocator_delete` (#70126 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/67414. Fixes https://github.com/pytorch/pytorch/issues/70117. cc brianjo mruberry ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/70126 Reviewed By: mruberry Differential Revision: D33542910 Pulled By: ngimel fbshipit-source-id: 4b870f4dceca6ee4cc8fba58819f1cb18ac9f857	2022-01-12 15:44:40 -08:00
Jane Xu	20489ebdc9	Increase tensor size for mem check tests (#70603 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/70226 Pull Request resolved: https://github.com/pytorch/pytorch/pull/70603 Reviewed By: mruberry Differential Revision: D33410439 Pulled By: janeyx99 fbshipit-source-id: e94615ece6d0fdf230de5297118678b70f34a18c	2022-01-05 08:27:48 -08:00
Jane Xu	c555b7bacb	GHA: Remove caffe2 check in Windows shard 1 smoke tests (#70010 ) Summary: Windows shard 1 hasn't actually been running any tests because the script that does so exited before running the python tests but did not report an error. This has been happening to all windows tests across the board, for example https://github.com/pytorch/pytorch/runs/4526170542?check_suite_focus=true Removing the caffe2.python check passes the smoke tests now. You can observe that the run_test.py file is called in the windows cpu job now https://github.com/pytorch/pytorch/runs/4541331717?check_suite_focus=true Pull Request resolved: https://github.com/pytorch/pytorch/pull/70010 Reviewed By: malfet, seemethere Differential Revision: D33161291 Pulled By: janeyx99 fbshipit-source-id: 85024b0ebb3ac42297684467ee4d0898ecf394de	2021-12-20 16:05:38 -08:00
Mike Ruberry	84b7832010	Updates CUDA memory leak check to verify against driver API and print more diagnostic information (#69556 ) Summary: Per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/69556 Reviewed By: mrshenli Differential Revision: D32954770 Pulled By: mruberry fbshipit-source-id: a6c2ae6f704422c178569980ca4b9c72c4272f55	2021-12-17 23:37:49 -08:00
Mike Ruberry	dc87cf5fe1	Fixes mem_get_info when querying on a device other than the current device (#69640 ) Summary: Also fixes the documentation failing to appear and adds a test to validate that op works with multiple devices properly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/69640 Reviewed By: ngimel Differential Revision: D32965391 Pulled By: mruberry fbshipit-source-id: 4fe502809b353464da8edf62d92ca9863804f08e	2021-12-08 23:04:30 -08:00
Dennis van der Staay	cbe0a38d8c	Back out "[CUDA Pinned Memory] Event recording with non-blocking copies should track the storage context, not the tensor data pointer" (#69193 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69193 Reviewed By: xing-liu, yuchenhao Differential Revision: D32748570 fbshipit-source-id: bd73d7567f94c70daeace49d4081381b8adf2d77	2021-12-01 19:30:08 -08:00
Andrew Tulloch	d44e610efa	[CUDA Pinned Memory] Event recording with non-blocking copies should track the storage context, not the tensor data pointer (#68749 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68749 The logic for asynchronous copies (either HtoD or DtoH) using cudaMemcpyAsync relies on recording an event with the caching host allocator to notify it that a given allocation has been used on a stream - and thus it should wait for that stream to proceed before reusing the host memory. This tracking is based on the allocator maintaining a map from storage allocation pointers to some state. If we try to record an event for a pointer we don't understand, we will silently drop the event and ignore it (`9554ebe44e/aten/src/ATen/cuda/CachingHostAllocator.cpp (L171-L175)`). Thus, if we use the data_ptr of a Tensor instead of the storage allocation, then reasonable code can lead to incorrectness due to missed events. One way this can occur is simply by slicing a tensor into sub-tensors - which have different values of `data_ptr()` but share the same storage, for example: ``` image_batch = torch.randn(M, B, C, H, W).pin_memory() for m in range(M): sub_batch = image_batch[m].cuda(non_blocking=True) # sub_batch.data_ptr() != image_batch.data_ptr() except for m == 0. # however, sub_batch.storage().data_ptr() == image_batch.storage().data_ptr() always. ``` Therefore, we instead use the storage context pointer when recording events, as this is the same state that is tracked by the caching allocator itself. This is a correctness fix, although it's hard to determine how widespread this issue is. Using the storage context also allows us to use a more efficient structure internally to the caching allocator, which will be sent in future diffs. Test Plan: Test added which demonstrates the issue, although it's hard to demonstrate the race explicitly. Reviewed By: ngimel Differential Revision: D32588785 fbshipit-source-id: d87cc5e49ff8cbf59052c3c97da5b48dd1fe75cc	2021-11-24 13:20:22 -08:00
eqy	790763b0fe	Add an option to disable reduced precision reductions for FP16 GEMM (#67946 ) Summary: https://github.com/pytorch/pytorch/issues/67578 disabled reduced precision reductions for FP16 GEMMs. After benchmarking, we've found that this has substantial performance impacts for common GEMM shapes (e.g., those found in popular instantiations of multiheaded-attention) on architectures such as Volta. As these performance regressions may come as a surprise to current users, this PR adds a toggle to disable reduced precision reductions `torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = ` rather than making it the default behavior. CC ngimel ptrblck stas00 Note that the behavior after the previous PR can be replicated with `torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = False` Pull Request resolved: https://github.com/pytorch/pytorch/pull/67946 Reviewed By: zou3519 Differential Revision: D32289896 Pulled By: ngimel fbshipit-source-id: a1ea2918b77e27a7d9b391e030417802a0174abe	2021-11-09 17:27:20 -08:00
Jane Xu	2578de4851	[skip ci] Set test owner for test_cuda* tests (#66838 ) Summary: Action following https://github.com/pytorch/pytorch/issues/66232 cc ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/66838 Reviewed By: saketh-are Differential Revision: D31841411 Pulled By: janeyx99 fbshipit-source-id: 5cdffdef4a92f9adcef1143ae4598b052c5acc6b	2021-10-21 17:36:25 -07:00
arindamroy-eng	32e790997b	[Rocm]Reduce severity of detected possible memory leak from assertion to warning (#65973 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/62533. In very rare cases, the decorator for detecting memory leak is throwing assertion, even when the test is passing, and the memory is being freed with a tiny delay. The issue is not being reproduced in internal testing, but shows up sometimes in CI environment. Reducing the severity of such detection to warning, so as not to fail the CI tests, as the actual test is not failing, rather only the check inside the decorator is failing. Limiting the change to ROCM only for now. cc jeffdaily sunway513 jithunnair-amd ROCmSupport Pull Request resolved: https://github.com/pytorch/pytorch/pull/65973 Reviewed By: anjali411 Differential Revision: D31776154 Pulled By: malfet fbshipit-source-id: 432199fca17669648463c4177c62adb553cacefd	2021-10-21 07:10:54 -07:00
Yanli Zhao	8173d4df69	move get_cycles_per_ms() to common_utils (#66798 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66798 get_cycles_per_ms is copied and used in a few places, move it to common_utils so that it can be used as a shared util function ghstack-source-id: 140790599 Test Plan: unit tests Reviewed By: pritamdamania87 Differential Revision: D31706870 fbshipit-source-id: e8dccecb13862646a19aaadd7bad7c8f414fd4ab	2021-10-18 14:04:09 -07:00
Kurt Mohler	5883523c1d	Remove dtype from torch.Storage and use only torch.ByteStorage (#62030 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62030 Remove dtype tracking from Python Storage interface, remove all the different `<type>Storage` classes except for `ByteStorage`, and update serialization accordingly, while maintaining as much FC/BC as possible Fixes https://github.com/pytorch/pytorch/issues/47442 * THE SERIALIZATION FORMAT IS FULLY FC/BC. We worked very hard to make sure this is the case. We will probably want to break FC at some point to make the serialization structure of tensors make more sense, but not today. * There is now only a single torch.ByteStorage class. Methods like `Tensor.set_` no longer check that the dtype of storage is appropriate. * As we no longer know what dtype of a storage is, we've removed the size method from Storage, replacing it with nbytes. This is to help catch otherwise silent errors where you confuse number of elements with number of bytes. * `Storage._new_shared` takes a `nbytes` kwarg and will reject previous positional only calls. `Storage._new_with_file` and `_set_from_file` require explicit element size arguments. * It's no longer possible to convert storages to different types using the float/double/etc methods. Instead, do the conversion using a tensor. * It's no longer possible to allocate a typed storage directly using FloatStorage/DoubleStorage/etc constructors. Instead, construct a tensor and extract its storage. The classes still exist but they are used purely for unpickling. * The preexisting serialization format stores dtype with storage, and in fact this dtype is used to determine the dtype of the tensor overall. To accommodate this case, we introduce a new TypedStorage concept that exists only during unpickling time which is used to temporarily store the dtype so we can construct a tensor. If you overrode the handling of pickling/unpickling, you MUST add handling for TypedStorage or your serialization code will degrade to standard file-based serialization. Original pull request: https://github.com/pytorch/pytorch/pull/59671 Reviewed By: soulitzer, ngimel Differential Revision: D29466819 Pulled By: ezyang fbshipit-source-id: 4a14e5d3c2b08e06e558683d97f7378a3180b00e	2021-10-05 13:50:34 -07:00
Michael Dagitses	b737629ff0	simplify op name determination into a single forward pass (#64261 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64261 Note that this does not preserve byte-for-byte compatibility with existing names. Test Plan: * Rely on CI to catch gross errors. * Merge after release cut to catch subtle issues. Reviewed By: albanD Differential Revision: D30700647 Pulled By: dagitses fbshipit-source-id: 7b02f34b8fae3041240cc78fbc6bcae498c3acd4	2021-09-02 07:32:11 -07:00
Michael Carilli	24e50b8453	[CUDA graphs] hotfix for test_graph_ (#64339 ) Summary: Graphed workloads that try to capture a full backward pass must do warmup on a non-default stream. If warmup happens on the default stream, AccumulateGrad functions might tag themselves to run on the default stream, and therefore won't be capturable. ngimel and I suspect some test_cuda.py tests run with the default stream as the ambient stream, which breaks `test_graph_grad_scaling` because `test_graph_grad_scaling` does warmup on the ambient stream _assuming_ the ambient stream is a non-default stream. This PR explicitly sets a side stream for the warmup in `test_graph_grad_scaling`, which is what I should have done all along because it's what the new documentation recommends. I pushed the PR branch straight to the main pytorch repo because we need to run ci-all on it, and I'm not sure what the requirements are these days. Pull Request resolved: https://github.com/pytorch/pytorch/pull/64339 Reviewed By: mruberry Differential Revision: D30690711 Pulled By: ngimel fbshipit-source-id: 91ad75f46a11f311e25bc468ea184e22acdcc25a	2021-08-31 22:34:10 -07:00
Rishi Puri	13484084a6	fix syntax error in bfloat16 PR (#64122 ) Summary: fixes prior syntax error from PR ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/64122 Reviewed By: H-Huang Differential Revision: D30643596 Pulled By: ngimel fbshipit-source-id: 0a2d5a40fb6dc7339cd03112e57ef0e1bf8a000e	2021-08-31 14:33:12 -07:00
Michael Carilli	8d08b103be	[CUDA graphs] Prototype API and documentation (#63269 ) Summary: RFC: https://github.com/pytorch/pytorch/issues/61880 Pull Request resolved: https://github.com/pytorch/pytorch/pull/63269 Reviewed By: mruberry Differential Revision: D30596643 Pulled By: ngimel fbshipit-source-id: b1f8061406364b667e2c2d4d30fbce1f0d8456be	2021-08-31 13:34:23 -07:00
Philip Meier	57d4c6cf42	replace `self.assertTrue(torch.allclose(..))` with `self.assertEqual(…)` (#63637 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/63565 Pull Request resolved: https://github.com/pytorch/pytorch/pull/63637 Reviewed By: malfet Differential Revision: D30541266 Pulled By: mruberry fbshipit-source-id: ab461949782c6908a589ea098fcfcf5c3e081ee6	2021-08-25 16:47:40 -07:00
Shen Li	1022443168	Revert D30279364: [codemod][lint][fbcode/c*] Enable BLACK by default Test Plan: revert-hammer Differential Revision: D30279364 (`b004307252`) Original commit changeset: c1ed77dfe43a fbshipit-source-id: eab50857675c51e0088391af06ec0ecb14e2347e	2021-08-12 11:45:01 -07:00
Zsolt Dollenstein	b004307252	[codemod][lint][fbcode/c*] Enable BLACK by default Test Plan: manual inspection & sandcastle Reviewed By: zertosh Differential Revision: D30279364 fbshipit-source-id: c1ed77dfe43a3bde358f92737cd5535ae5d13c9a	2021-08-12 10:58:35 -07:00
Rishi Puri	324673a537	rebase for autocast updates to include device_type and dtype flags (#61002 ) Summary: Fixes #{55374} https://github.com/pytorch/pytorch/issues/55374 Pull Request resolved: https://github.com/pytorch/pytorch/pull/61002 Reviewed By: malfet, mruberry Differential Revision: D30016812 Pulled By: ngimel fbshipit-source-id: 6e09a29f539d28e9aea5cd9489b1e633cc588033	2021-08-10 20:03:12 -07:00
Kevin Tse	4b47ea9446	adding a skip for ROCm for a flaky test (#62664 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62664 Skipping a test for ROCm because of issue #62602 Test Plan: Imported from OSS Reviewed By: soulitzer Differential Revision: D30079534 Pulled By: NivekT fbshipit-source-id: a9cf35e5d3a8d218edc9c5a704d1f9599d2f38a6	2021-08-04 07:29:06 -07:00
Michael Carilli	9fb6b40f3e	Makes a streaming backward test try gradient stealing more directly (#60065 ) Summary: Closes https://github.com/pytorch/pytorch/issues/59846. https://github.com/pytorch/pytorch/issues/59846 is likely paranoia, and some of the test_streaming_backward_* in test_cuda.py already use gradient stealing (ie, they start with `.grad`s as None before backward). Regardless, this PR augments one of the tests to stress gradient stealing a bit more directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/60065 Reviewed By: mrshenli Differential Revision: D29779518 Pulled By: ngimel fbshipit-source-id: ccbf278543c3adebe5f4ba0365b1dace9a14da9b	2021-07-19 20:39:55 -07:00
Michael Carilli	2fa6c7627e	[CUDA graphs][BC-breaking] Removes post-backward syncs on default stream (#60421 ) Summary: Before https://github.com/pytorch/pytorch/pull/57833, calls to backward() or grad() synced only the calling thread's default stream with autograd leaf streams at the end of backward. This made the following weird pattern safe: ```python with torch.cuda.stream(s): # imagine forward used many streams, so backward leaf nodes may run on many streams loss.backward() # no sync use grads ``` but a more benign-looking pattern was unsafe: ```python with torch.cuda.stream(s): # imagine forward used a lot of streams, so backward leaf nodes may run on many streams loss.backward() # backward() syncs the default stream with all the leaf streams, but does not sync s with anything, # so counterintuitively (even though we're in the same stream context as backward()!) # it is NOT SAFE to use grads here, and there's no easy way to make it safe, # unless you manually sync on all the streams you used in forward, # or move "use grads" back to default stream outside the context. use grads ``` mruberry ngimel and I decided backward() should have the [same user-facing stream semantics as any cuda op](https://pytorch.org/docs/master/notes/cuda.html#stream-semantics-of-backward-passes). In other words, the weird pattern should be unsafe, and the benign-looking pattern should be safe. Implementationwise, this meant backward() should sync its calling thread's current stream, not default stream, with the leaf streams. After https://github.com/pytorch/pytorch/pull/57833, backward syncs the calling thread's current stream AND default stream with all leaf streams at the end of backward. The default stream syncs were retained for temporary backward compatibility. This PR finishes https://github.com/pytorch/pytorch/pull/57833's work by deleting syncs on the default stream. With this PR, graph-capturing an entire backward() call should be possible (see the [test_graph_grad_scaling diffs](https://github.com/pytorch/pytorch/compare/master...mcarilli:streaming_backwards_remove_default_syncs?expand=1#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R3641-R3642)). first paragraph has a formatting error which this PR should also fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/60421 Reviewed By: albanD Differential Revision: D29370344 Pulled By: ngimel fbshipit-source-id: 3248bc5fb92fc517db0c15c897e5d7250f67d7fe	2021-06-24 17:34:02 -07:00
Luca Wehrstedt	bb9e1150ea	Revert D29342234: [pytorch][PR] [CUDA graphs][BC-breaking] Removes post-backward syncs on default stream Test Plan: revert-hammer Differential Revision: D29342234 (`675cea1adb`) Original commit changeset: 98e6be7fdd85 fbshipit-source-id: 84022973248b2254210eee57402df2c4f4bc43c6	2021-06-24 04:49:28 -07:00
Michael Carilli	675cea1adb	[CUDA graphs][BC-breaking] Removes post-backward syncs on default stream (#60421 ) Summary: Before https://github.com/pytorch/pytorch/pull/57833, calls to backward() or grad() synced only the calling thread's default stream with autograd leaf streams at the end of backward. This made the following weird pattern safe: ```python with torch.cuda.stream(s): # imagine forward used many streams, so backward leaf nodes may run on many streams loss.backward() # no sync use grads ``` but a more benign-looking pattern was unsafe: ```python with torch.cuda.stream(s): # imagine forward used a lot of streams, so backward leaf nodes may run on many streams loss.backward() # backward() syncs the default stream with all the leaf streams, but does not sync s with anything, # so counterintuitively (even though we're in the same stream context as backward()!) # it is NOT SAFE to use grads here, and there's no easy way to make it safe, # unless you manually sync on all the streams you used in forward, # or move "use grads" back to default stream outside the context. use grads ``` mruberry ngimel and I decided backward() should have the [same user-facing stream semantics as any cuda op](https://pytorch.org/docs/master/notes/cuda.html#stream-semantics-of-backward-passes). In other words, the weird pattern should be unsafe, and the benign-looking pattern should be safe. Implementationwise, this meant backward() should sync its calling thread's current stream, not default stream, with the leaf streams. After https://github.com/pytorch/pytorch/pull/57833, backward syncs the calling thread's current stream AND default stream with all leaf streams at the end of backward. The default stream syncs were retained for temporary backward compatibility. This PR finishes https://github.com/pytorch/pytorch/pull/57833's work by deleting syncs on the default stream. With this PR, graph-capturing an entire backward() call should be possible (see the [test_graph_grad_scaling diffs](https://github.com/pytorch/pytorch/compare/master...mcarilli:streaming_backwards_remove_default_syncs?expand=1#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R3641-R3642)). first paragraph has a formatting error which this PR should also fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/60421 Reviewed By: VitalyFedyunin, albanD Differential Revision: D29342234 Pulled By: ngimel fbshipit-source-id: 98e6be7fdd8550872f0a78f9a66cb8dfe75abf63	2021-06-23 23:35:24 -07:00
Michael Carilli	56481f9762	Ensure proper syncs for out-of-place grad creation (torch.autograd.grad) when backward ops run on side streams (#60127 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/59844. Streaming backwards collects "leaf streams" for AccumulateGrad functions that stash or accumulate .grad attributes for autograd leaf tensors, and syncs those streams with some ambient stream(s) so later ops can safely consume the grads on the ambient stream(s). But, currently, streaming backwards does not collect leaf streams for grads produced out-of-place (ie, not stashed onto a .grad attribute) by `torch.autograd.grad`, because these out-of-place grads are "captured" and returned before they reach an AccumulateGrad function. Some out-of-place grads might not even have an AccumulateGrad function to go to, because `torch.autograd.grad` can be told to make grads for non-leaf temporaries.[1] The upshot is, when streaming backwards makes ops that produce out-of-place gradients run on side streams, no ambient stream is told to sync on these side streams, so `torch.autograd.grad` doesn't offer the same post-call safe-use guarantees for grads as the leaf accumulation of `torch.autograd.backward`. This PR ensures `torch.autograd.grad` gives the same safe-use guarantees as `torch.autograd.backward` by also stashing leaf streams for grads created out-of-place. I augmented a streaming backwards test to include a torch.autograd.grad attempt. The test fails on current master[2] and passes with the engine.cpp diffs. I have no idea if this bug or its fix matter to distributed autograd. pritamdamania mrshenli should take a look before it's merged. [1] example: ```python leaf = torch.tensor(..., requires_grad=True) tmp = leaf * 2 loss = tmp.sum() torch.autograd.grad(loss, inputs=(tmp, leaf)) ``` Technically, because `torch.autograd.grad` can be told to produce grads for non-leaf temporaries, these streams might NOT be "leaf streams". Maybe I should rename `leaf_streams`? [2] the way the test currently fails is fun: it reports ``` AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 0 element(s) (out of 25) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.0 (5.0 vs. 5.0), which occurred at index (0, 0). ``` I suspect this [kafka trap](https://en.wiktionary.org/wiki/Kafkatrap) happens because assertEqual does a comparison test on the device, syncs on some bool result, sees failure and prints the tensors post-sync at which point is IS safe to access the values. Pull Request resolved: https://github.com/pytorch/pytorch/pull/60127 Reviewed By: mrshenli Differential Revision: D29276581 Pulled By: albanD fbshipit-source-id: a9f797e2fd76e2f884cce5a32ecf5d9b704c88ee	2021-06-23 07:14:01 -07:00
Alexander Grund	3846cef2d7	Increase tolerance for test_grad_scaling_clipping (#60458 ) Summary: This makes it pass on A100 and with e.g. torch.manual_seed(6) called before running this test. Fixes https://github.com/pytorch/pytorch/issues/60455 Pull Request resolved: https://github.com/pytorch/pytorch/pull/60458 Reviewed By: mrshenli Differential Revision: D29309618 Pulled By: ngimel fbshipit-source-id: 72584087bcc949f7bc96b0644b701e69ae1fa025	2021-06-22 23:43:25 -07:00
Emilio Castillo	f9ec86a6c6	External stream (#59527 ) Summary: Previous is https://github.com/pytorch/pytorch/issues/57781 We add now two CUDA bindings to avoid using ctypes to fix a windows issue. However, we use ctypes to allocate the stream and create its pointer (we can do this with a 0-dim tensor too if it feels better). CC. ezyang rgommers ngimel mruberry Pull Request resolved: https://github.com/pytorch/pytorch/pull/59527 Reviewed By: albanD Differential Revision: D29053062 Pulled By: ezyang fbshipit-source-id: 661e7e58de98b1bdb7a0871808cd41d91fe8f13f	2021-06-14 13:46:11 -07:00
Michael Carilli	be038d8989	[CUDA graphs] Make stream semantics of backward calls consistent with other cuda ops (ci-all edition) (#57833 ) Summary: ci-all resubmit of https://github.com/pytorch/pytorch/pull/54227. Tests look good except for a few distributed autograd failures (pytorch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test) and rocm failures (pr/pytorch-linux-bionic-rocm4.1-py3.6). The common denominator in rocm failures appears to be multi-gpu activity: some [multiprocess DDP failures](https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm4.1-py3.6-test1/8115/console), some [single-process failures](https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm4.1-py3.6-test2/8115/console) where the single process has autograd ops that span devices. jeffdaily jithunnair-amd sunway513, could one of you take a look? The streaming backward change is also beneficial to rocm, I expect. For debugging rocm failures, I think we should ignore the multiprocess/DDP tests and focus on the single process cases. The root cause is probably the same and the single process cases are simpler. ---------------------------------- Update: Rocm failures are due to https://github.com/pytorch/pytorch/issues/59750. `2718a54032` is a workaround, to be updated once https://github.com/pytorch/pytorch/issues/59750 is fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/57833 Reviewed By: mruberry Differential Revision: D28942391 Pulled By: ngimel fbshipit-source-id: d6047e971c5f1c6386334bf3641402a92f12e2f8	2021-06-13 12:09:56 -07:00
Jeff Daily	24e27af683	[ROCm] enable kernel asserts (#49624 ) Summary: Addresses missing ROCm feature indicated in https://github.com/pytorch/pytorch/issues/38943. Pull Request resolved: https://github.com/pytorch/pytorch/pull/49624 Reviewed By: agolynski Differential Revision: D28902459 Pulled By: malfet fbshipit-source-id: 29c9b552770241a0ec52cd057ea45efc4389d838	2021-06-07 13:43:07 -07:00
Mike Ruberry	de40c8e495	Adds remaining OpInfos and removes redundant test generators (#55558 ) Summary: Per title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/55558 Reviewed By: ngimel Differential Revision: D28922522 Pulled By: mruberry fbshipit-source-id: 89cefd93788bc8aa0683f4583cf5caa81aa2dc93	2021-06-06 14:52:26 -07:00
Rong Rong (AI Infra)	689a5edd0a	Revert D28326365: [pytorch][PR] Add `torch.cuda.streams.ExternalStream` Test Plan: revert-hammer Differential Revision: D28326365 (`d7ef9b73fb`) Original commit changeset: b67858c80339 fbshipit-source-id: 337588d40b96cf04e46e554fa481ae7fd4254478	2021-06-04 11:19:36 -07:00
Emilio Castillo	d7ef9b73fb	Add `torch.cuda.streams.ExternalStream` (#57781 ) Summary: This is required in https://github.com/pytorch/pytorch/pull/57110#issuecomment-828357947 We need to provide means to synchronize on externally allocated streams for dlpack support in python array data api. cc mruberry rgommers leofang asi1024 kmaehashi Pull Request resolved: https://github.com/pytorch/pytorch/pull/57781 Reviewed By: mrshenli Differential Revision: D28326365 Pulled By: ezyang fbshipit-source-id: b67858c8033949951b49a3d319f649884dfd0a91	2021-06-04 08:47:09 -07:00
Michael Carilli	3efefc4016	[CUDA graphs] Makes sure all graphs tests call empty_cache() at some point before capture (#59233 ) Summary: Graphs tests are sometimes flaky in CI ([example](https://app.circleci.com/pipelines/github/pytorch/pytorch/328930/workflows/0311199b-a0be-4802-a286-cf1e73f96c70/jobs/13793451)) because when the GPU runs near its max memory capacity (which is not unusual during a long test), sometimes, to satisfy new allocations that don't match any existing unused blocks, the caching allocator may call `synchronize_and_free_events` to wait on block end-of-life events and cudaFree unused blocks, then re-cudaMalloc a new block. For ungraphed ops this isn't a problem, but synchronizing or calling cudaFree while capturing is illegal, so `synchronize_and_free_events` raises an error if called during capture. The graphs tests themselves don't use much memory, so calling torch.cuda.empty_cache() at some point before their captures should ensure memory is available and the captures never need `synchronize_and_free_events`. I was already calling empty_cache() near the beginning of several graphs tests. This PR extends it to the ones I forgot. Pull Request resolved: https://github.com/pytorch/pytorch/pull/59233 Reviewed By: mruberry Differential Revision: D28816691 Pulled By: ngimel fbshipit-source-id: 5cd83e48e43b1107daed5cfa2efff0fdb4f99dff	2021-06-01 21:05:46 -07:00
Masaki Kozuki	7eade660c6	[PyTorch] Reduce errors of `foreach` functions (#56993 ) Summary: This is based on https://github.com/pytorch/pytorch/issues/48224. To make `foreach` more flexible, this PR pushes unsupported cases to slow path. Also, this adds some tests to verify that - `foreach` functions work with tensors of different dtypes and/or memory layouts in `7bd4b2c89f` - `foreach` functions work with tensors on different devices in a list, but are on the same device if the indices are the same: `def4b9b5a1` Future plans: 1. Improve the coverage of unittests using `ops` decorator & updating `foreach_unary_op_db` and creating `foreach_(binary\|pointwise\|minmax)_db`. 2. Support broadcasting in slow path. Ref: https://github.com/pytorch/pytorch/pull/52448 3. Support type promotion in fast path. Ref https://github.com/pytorch/pytorch/pull/52449 CC: ngimel mcarilli ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/56993 Reviewed By: zou3519 Differential Revision: D28630580 Pulled By: ngimel fbshipit-source-id: e26ee74a39a591025e18c1ead48948cb7ec53c19	2021-05-25 10:50:20 -07:00
Michael Carilli	dbedb1fa1c	[CUDA graphs] Sync after replay (#57556 ) Summary: Right now there's a bug in libcuda.so that triggers sometimes when graphs with certain topologies are replayed back to back without a sync in between. Replays that hit this bug turn into spaghetti: kernels reordered ignoring dependencies, kernels elided, corrupted results. Currently, the only workaround I know that fixes all our repros is a manual sync between replays. I'll remove the sync (or special case it based on cuda version) in a later PR, as soon as a fixed libcuda.so is available. The only substantive change is the cudaDeviceSynchronize, other lines changed are de-indenting an unneeded scope. The bug is in current and semi-recent public versions of libcuda.so. We discovered the bug recently and we're not sure yet which public release was first affected. The version that ships with 11.3 is definitely affected, versions that shipped with 11.1 and earlier are likely not affected. Pull Request resolved: https://github.com/pytorch/pytorch/pull/57556 Reviewed By: mruberry Differential Revision: D28343043 Pulled By: ngimel fbshipit-source-id: 3b907241aebdb8ad47ae96a6314a8b02de7bfa77	2021-05-11 09:38:47 -07:00
Gao, Xiang	db7b31358f	Fix internal assert in CUDA caching allocator when trying to allocate ~2^64 memory (#57571 ) Summary: When the memory requested is huge, some internal logic in CUDA caching allocator could overflow. The result of the overflow is the caching allocator gives a confusing error message. For example: ```python import torch import torch.nn as nn from torch.utils import cpp_extension cuda_source = """ #include <c10/cuda/CUDACachingAllocator.h> void my_fun(void) { size_t temp_storage_bytes = 18446744073708433663UL; auto& caching_allocator = ::c10::cuda::CUDACachingAllocator::get(); auto temp_storage = caching_allocator.allocate(temp_storage_bytes); return; } """ cpp_source = """ void my_fun(void); """ module = torch.utils.cpp_extension.load_inline( name="cuda_test_extension", cpp_sources=cpp_source, cuda_sources=cuda_source, functions="my_fun", extra_cuda_cflags=["--extended-lambda"], verbose=True, ) module.my_fun() print('done') ``` gives ``` Traceback (most recent call last): File "/home/gaoxiang/misc/caching-allocator.py", line 26, in <module> module.my_fun() RuntimeError: p.block != nullptr && p.block->ptr != nullptrINTERNAL ASSERT FAILED at "../c10/cuda/CUDACachingAllocator.cpp":991, please report a bug to PyTorch. Exception raised from alloc_block at ../c10/cuda/CUDACachingAllocator.cpp:991 (most recent call first): frame #0: <unknown function> + 0x83e93 (0x7f424f05ee93 in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10.so) frame https://github.com/pytorch/pytorch/issues/1: <unknown function> + 0x83bf9 (0x7f424f05ebf9 in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10.so) frame https://github.com/pytorch/pytorch/issues/2: <unknown function> + 0x839bd (0x7f424f05e9bd in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10.so) frame https://github.com/pytorch/pytorch/issues/3: std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>::operator()() const + 0x4c (0x7f428a3350a2 in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so) frame https://github.com/pytorch/pytorch/issues/4: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x40 (0x7f424f05dc34 in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10.so) frame https://github.com/pytorch/pytorch/issues/5: c10::detail::torchCheckFail(char const, char const, unsigned int, char const) + 0x97 (0x7f424f05c42f in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10.so) frame https://github.com/pytorch/pytorch/issues/6: <unknown function> + 0x6948b4 (0x7f42978fd8b4 in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libtorch_python.so) frame https://github.com/pytorch/pytorch/issues/7: <unknown function> + 0x22373 (0x7f424f0e2373 in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10_cuda.so) frame https://github.com/pytorch/pytorch/issues/8: <unknown function> + 0x1fa6c (0x7f424f0dfa6c in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10_cuda.so) frame https://github.com/pytorch/pytorch/issues/9: <unknown function> + 0x2337a (0x7f424f0e337a in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10_cuda.so) frame https://github.com/pytorch/pytorch/issues/10: <unknown function> + 0x23f18 (0x7f424f0e3f18 in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10_cuda.so) frame https://github.com/pytorch/pytorch/issues/11: my_fun() + 0x4b (0x7f4200338f74 in /home/gaoxiang/.cache/torch_extensions/cuda_test_extension/cuda_test_extension.so) frame https://github.com/pytorch/pytorch/issues/12: torch::detail::wrap_pybind_function_impl_<void (&)()>(void (&)(), std::integer_sequence<unsigned long>)::{lambda()https://github.com/pytorch/pytorch/issues/1}::operator()() const + 0x3f (0x7f420031e575 in /home/gaoxiang/.cache/torch_extensions/cuda_test_extension/cuda_test_extension.so) frame https://github.com/pytorch/pytorch/issues/13: <unknown function> + 0x570f2 (0x7f42003350f2 in /home/gaoxiang/.cache/torch_extensions/cuda_test_extension/cuda_test_extension.so) frame https://github.com/pytorch/pytorch/issues/14: <unknown function> + 0x536e2 (0x7f42003316e2 in /home/gaoxiang/.cache/torch_extensions/cuda_test_extension/cuda_test_extension.so) frame https://github.com/pytorch/pytorch/issues/15: <unknown function> + 0x4ef2f (0x7f420032cf2f in /home/gaoxiang/.cache/torch_extensions/cuda_test_extension/cuda_test_extension.so) frame https://github.com/pytorch/pytorch/issues/16: <unknown function> + 0x4ef93 (0x7f420032cf93 in /home/gaoxiang/.cache/torch_extensions/cuda_test_extension/cuda_test_extension.so) frame https://github.com/pytorch/pytorch/issues/17: <unknown function> + 0x3e7f2 (0x7f420031c7f2 in /home/gaoxiang/.cache/torch_extensions/cuda_test_extension/cuda_test_extension.so) <omitting python frames> frame https://github.com/pytorch/pytorch/issues/30: __libc_start_main + 0xd5 (0x7f42c60bab25 in /usr/lib/libc.so.6) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/57571 Reviewed By: VitalyFedyunin Differential Revision: D28224574 Pulled By: ezyang fbshipit-source-id: df440961f6eaf58048af36ae2a06c59f3c18baec	2021-05-06 01:36:58 -07:00
Michael Carilli	e841f335aa	[RELAND] [CUDA graphs] Avoid sync errors when graph capturing cudnn rnn calls that use cudnn dropout (#57373 ) Summary: https://github.com/pytorch/pytorch/pull/56433 was reverted because the test perceived internal dropout state creation as a memory leak. This PR resubmits with the leak check skipped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/57373 Reviewed By: anjali411 Differential Revision: D28152186 Pulled By: ezyang fbshipit-source-id: 9a593fcdbbabbb09dc4e4221191663e94b697503	2021-05-03 11:41:40 -07:00
Wenlei Xie	20085f6d23	Support auto generation of device check (#56872 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56872 ghstack-source-id: 127914018 Test Plan: auto test Reviewed By: ezyang Differential Revision: D27986429 fbshipit-source-id: 0da8413b0b8e6810fcea27ed1de499f11f68bd1f	2021-05-01 12:02:09 -07:00
Michael Carilli	bbc3cc6718	[CUDA graphs] [BC-breaking] Makes torch.cuda.amp.GradScaler scale updates in-place for better composability with graph capture (#55562 ) Summary: I'd like the following pattern (a natural composition of Amp with full fwd+bwd capture) to work: ```python # Create "static_input" with dummy data, run warmup iterations, # call optimizer.zero_grad(set_to_none=True), then g = torch.cuda._Graph() s.wait_stream(torch.cuda.current_stream()) with torch.cuda.stream(s): optimizer.zero_grad(set_to_none=True) g.capture_begin() with autocast(): out = model(static_input) loss = loss_fn(out) scaler.scale(loss).backward() g.capture_end() torch.cuda.current_stream().wait_stream(s) # Training loop: for b in data: # optimizer.zero_grad() deliberately omitted, replay()'s baked-in backward will refill statically held .grads static_input.copy_(b) g.replay() scaler.step(optimizer) scaler.update() ``` Right now `GradScaler` can't work with this pattern because `update()` creates the scale tensor for the next iteration out of place. This PR changes `update()` to act in place on a long-lived scale tensor that stays static across iterations. I'm not sure how this change affects XLA (see https://github.com/pytorch/pytorch/pull/48570), so we shouldn't merge without approval from ailzhang yaochengji. Tagged bc-breaking because it's a change to the amp update utility function in native_functions.yaml. The function was never meant to be user-facing though. Pull Request resolved: https://github.com/pytorch/pytorch/pull/55562 Reviewed By: zou3519 Differential Revision: D28046159 Pulled By: ngimel fbshipit-source-id: 02018c221609974546c562f691e20ab6ac611910	2021-04-30 13:03:05 -07:00
Nikita Shulga	0a30d64c83	Revert D27966444: [pytorch][PR] [CUDA graphs] Avoid sync errors when graph capturing cudnn rnn calls that use cudnn dropout Test Plan: revert-hammer Differential Revision: D27966444 (`610c984d2e`) Original commit changeset: fe0df843c521 fbshipit-source-id: 8223b7f8b7183f0e7c9df6a7aa8f6b164e5634db	2021-04-28 14:51:10 -07:00
Michael Carilli	610c984d2e	[CUDA graphs] Avoid sync errors when graph capturing cudnn rnn calls that use cudnn dropout (#56433 ) Summary: Cudnn rnn calls that use use cudnn dropout maintain a "state" buffer across calls. [DropoutState](`fe3f6f2da2/aten/src/ATen/native/cudnn/RNN.cpp (L1388-L1402)`)'s lock() and unlock() ensure the current call's use of the state buffer syncs with the end of the previous call's use of the state buffer (in case the previous call was on a different stream). Telling a capturing stream to wait on an event recorded in a non-capturing stream is an error (1). Telling a non-capturing stream to wait on an event recorded during capture is also an error (2). So DropoutState's flow can error in either of two simple use cases: ```python rnn = nn.LSTM(512, 512, 2, dropout=0.5).cuda() out1 = rnn(in1) # calling cudnn rnn with dropout in capture after calling it uncaptured triggers 1 capture_stream.wait_stream(torch.cuda.current_stream()) with torch.cuda.stream(capture_stream): graph.capture_begin() out2 = rnn(in2) graph.capture_end() torch.cuda.current_stream().wait_stream(capture_stream) # calling cudnn rnn with dropout uncaptured after calling it in capture triggers 2 out3 = rnn(in3) ``` This PR fixes both cases by telling `DropoutState::lock()`: "if the most recent end-of-usage event was in a different capture state (ie, we crossed a capturing<->noncapturing border) or in a different capture, don't sync on it." While considering the fix I had two assumptions in mind: - only one capture using the RNN can be underway at a time in this process - no noncapturing ops in this process are issuing RNN calls while the capture using the RNN is underway. That second assumption seems brittle if, for example, someone wants to capture an internal region of the forward method of a model wrapped with DataParallel: multiple threads could be issuing RNN calls with some currently capturing and some not. We should talk about whether that use case seems realistic. (Bigger-picture thoughts: I don't know if forcing calls to serialize on using the shared state buffer is the best design. And if we want to do it that way, we might as well run all cudnn rnns with dropout on a dedicated side stream synced with the surrounding stream (capturing or not), in which case I don't think this PR's event-handling diffs would be needed.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/56433 Reviewed By: heitorschueroff Differential Revision: D27966444 Pulled By: ezyang fbshipit-source-id: fe0df843c521e0d48d7f2c81a17aff84c5497e20	2021-04-28 12:52:03 -07:00
Michael Carilli	ffdecc1ac4	[CUDA graphs] Allows DeviceCachingAllocator to capture cross-stream memory use (#55860 ) Summary: Safely deallocating and repurposing memory used across streams relies on recording end-of-life events in all an allocation's usage streams beyond its original allocation stream. The events are later queried to see if all GPU work in those extra streams that could have used the allocation is done (from the CPU's perspective) before repurposing the allocation for use in its original stream. The trouble is, calling EventQuery on an ordinary event recorded in a capturing stream is illegal. Calling EventQuery while capture is underway is also illegal. So when we call `tensor.record_stream` (or `c10::cuda::cudaCachingAllocator::recordStream`) on any tensor that's used or deleted in or around a capture, we often end up with a confusing error thrown from the cudaEventQuery in DeviceCachingAllocator::process_events(). This PR enables hopefully-safe deletion of tensors used across streams in or around capture with a conservative but simple approach: don't record or process end of life events for such tensors until the allocator's sure no captures are underway. You could whiteboard cases where this causes cross-stream-used allocations to be unavailable for reuse longer than absolutely necessary, but cross-stream-used allocations are uncommon, so for practical purposes this approach's impact on the memory footprint of captured sequences should be small. Pull Request resolved: https://github.com/pytorch/pytorch/pull/55860 Reviewed By: ejguan Differential Revision: D27822557 Pulled By: ezyang fbshipit-source-id: b2e18a19d83ed05bad67a8157a14a606ed14d04e	2021-04-18 20:32:10 -07:00
Arindam Roy	4cfbb2401f	[ROCM] Re-enable 3 previously faling tests in test_cuda.py (#55813 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/53190 The following tests are passing in ROCM 4.1. Hence re-enabling them. test_grad_scaling_multigpu test_streaming_backwards_device_transfer test_streaming_backwards_multiple_streams Pull Request resolved: https://github.com/pytorch/pytorch/pull/55813 Reviewed By: yinghai Differential Revision: D27725547 Pulled By: ngimel fbshipit-source-id: d8b3ed69fa44c2086f0666b4db0fabb30ad59439	2021-04-13 01:09:11 -07:00
Yukio Siraichi	93bf0ae6fc	Remove legacy constructor calls from pytorch codebase. (#54142 ) Summary: Follow up from https://github.com/pytorch/pytorch/issues/53889 Related to https://github.com/pytorch/pytorch/issues/47112 Removing every occurrence of the legacy constructor call present in PyTorch at: - _docs_ - _benchmarks_ - _test_ - _caffe2_ - _CONTRIBUTING.md_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/54142 Reviewed By: ngimel Differential Revision: D27699450 Pulled By: mruberry fbshipit-source-id: 530aa3f5746cc8bc1407d5d51b2bbd8075e30546	2021-04-11 15:45:17 -07:00
Heitor Schueroff	5d68b3695c	[Relanding] Implemented torch.linalg.multi_dot (#52859 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52859 This reverts commit `92a4ee1cf6`. Added support for bfloat16 for CUDA 11 and removed fast-path for empty input tensors that was affecting autograd graph. Test Plan: Imported from OSS Reviewed By: H-Huang Differential Revision: D27402390 Pulled By: heitorschueroff fbshipit-source-id: 73c5ccf54f3da3d29eb63c9ed3601e2fe6951034	2021-04-01 04:49:05 -07:00
Kurt Mohler	6c235ef267	Allow `std=0` in `torch.normal`, and error if `std<0` (#51317 ) Summary: Part of https://github.com/pytorch/pytorch/issues/49998 Pull Request resolved: https://github.com/pytorch/pytorch/pull/51317 Reviewed By: bdhirsh Differential Revision: D27253939 Pulled By: mruberry fbshipit-source-id: af7a72c3d91549b1a88b73849b6973e7619dc50b	2021-03-31 21:06:07 -07:00
Kurt Mohler	3ddc6174da	Raise error in clip_grad_norm_ if norm is non-finite (#53843 ) Summary: BC-breaking note: This change throws errors for cases that used to silently pass. The old behavior can be obtained by setting `error_if_nonfinite=False` Fixes https://github.com/pytorch/pytorch/issues/46849 Pull Request resolved: https://github.com/pytorch/pytorch/pull/53843 Reviewed By: malfet Differential Revision: D27291838 Pulled By: jbschlosser fbshipit-source-id: 216d191b26e1b5919a44a3af5cde6f35baf825c4	2021-03-29 08:41:21 -07:00
albanD	1126d51de9	Remove useless contiguous calls from torch.matmul (#54616 ) Summary: This reduces the memory usage of matmul significantly for expanded batch size. This reduces the peak memory usage of ``` a = torch.rand(1, 1024, 1024, device="cuda") b = torch.rand(1024, 1024, 1, device="cuda") out = torch.matmul(a, b) ``` From 4GB to 16MB which is not too bad. It also fixes the same problem when `b` is not batched. Pull Request resolved: https://github.com/pytorch/pytorch/pull/54616 Reviewed By: ailzhang Differential Revision: D27327056 Pulled By: albanD fbshipit-source-id: 4bb5f4015aeab4174148512f3c5b8d1ffa97bf54	2021-03-26 06:34:24 -07:00
Nikita Vedeneev	61b074581c	`torch.prod` backward for complex types. (#48125 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/53511 torch.det does depend on torch.prod, which in turn depends on several other functions, and they also depend on torch.prod, so there is a circular relationship, hence this PR will enable complex backward support for several functions at once. Pull Request resolved: https://github.com/pytorch/pytorch/pull/48125 Reviewed By: pbelevich Differential Revision: D27188589 Pulled By: anjali411 fbshipit-source-id: bbb80f8ecb83a0c3bea2b917627d3cd3b84eb09a	2021-03-19 09:44:08 -07:00
Michael Carilli	b27e678dfb	[RELAND] [CUDA graphs] Private mempools for CUDA graphs (#54038 ) Summary: Resubmit of https://github.com/pytorch/pytorch/pull/51436. Apparently some non-public windows builds run cuda tests on the default stream, so I changed a few capture tests to manually ensure all captures happen on non-default streams. Pull Request resolved: https://github.com/pytorch/pytorch/pull/54038 Reviewed By: mruberry Differential Revision: D27068649 Pulled By: ngimel fbshipit-source-id: 4284475fa40ee38c0f8faff05a2faa310cf8a207	2021-03-16 12:13:33 -07:00
Natalia Gimelshein	76129c7cdf	Revert D26993790: [pytorch][PR] [CUDA graphs] Private mempools for CUDA graphs Test Plan: revert-hammer Differential Revision: D26993790 (`90dfdef226`) Original commit changeset: a992eaee1b8c fbshipit-source-id: 6ddb4aedd6154d7d89847aa5a34181158d06a309	2021-03-12 13:07:28 -08:00
Michael Carilli	90dfdef226	[CUDA graphs] Private mempools for CUDA graphs (#51436 ) Summary: Implements https://github.com/pytorch/pytorch/issues/51075#issuecomment-768884685 and additions discussed offline with ezyang ngimel . (Calling it "simple" is charitable but it's not too bad). [High level strategy](https://github.com/pytorch/pytorch/pull/51436/files#diff-acc6337586bf9cdcf0a684380779300ec171897d05b8569bf439820dc8c93bd5R57-R82) The current design aggregates stats from private pools with the ordinary pools, which may or may not be what we want. Instead of adding PrivatePools as an internal feature of DeviceAllocator, I could inherit from DeviceAllocator (eg `DevicePrivateAllocator : public DeviceAllocator`) and create separate per-graph instances of the inherited class. I'm not sure if that would be better. Graph bindings in Python are almost unchanged from https://github.com/pytorch/pytorch/pull/48875: ```python # Same bindings as 48875, but now implicitly grabs a private mempool graph1.capture_begin() graph1.capture_end() # pool=... is new. It hints that allocations during graph2's capture may share graph1's mempool graph2.capture_begin(pool=graph1.pool()) graph2.capture_end() # graph3 also implicitly creates its own mempool graph3.capture_begin() graph3.capture_end() ``` Test plan (other suggestions appreciated): - [x] Stop maintaining manual references for all the tensors in my existing graphs+RNG tests. If private pools somehow give bad allocations, they should start failing intermittently. They run eager ops and eager allocations mixed with graph replays, so they may expose if eager ops and replays corrupt each other. - [x] `test_graph_two_successive`: Capture successive graphs, with the second graph using the first graph's result. Try with and without sharing a pool. Check results, also check memory stats to confirm sharing a pool saves memory. - [x] `test_graph_concurrent_replay`: Capture some graphs in separate private pools, replay them concurrently in different streams, check the results to make sure they don't corrupt each other's memory. Capture some graphs with a shared pool, replay them concurrently in different streams, check results, confirm they DO corrupt each other's memory. - [x] `test_graph_three_successive`: A three-graph case, checking the safe and unsafe replay patterns in [Restrictions of the Strawman API](https://github.com/pytorch/pytorch/issues/51075)). - [x] `test_graph_memory_stats_and_use_result_after_destroy_graph`: Comprehensively check torch.cuda.memory_stats() changes that result from graph capture and delete. Check that a tensor ref created during capture and held after graph delete stays valid until the tensor itself is deleted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/51436 Reviewed By: mruberry Differential Revision: D26993790 Pulled By: ngimel fbshipit-source-id: a992eaee1b8c23628e7b388a5a3c26e0f80e54da	2021-03-12 11:07:47 -08:00
Jagadish Krishnamoorthy	ec6a7cace3	[ROCm] Fix the flaky test test_stream_event_nogil (#53850 ) Summary: Fix the flaky test in https://github.com/pytorch/pytorch/issues/53192 properly. Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/53850 Reviewed By: albanD Differential Revision: D26993582 Pulled By: malfet fbshipit-source-id: b0aefb188a236a5e94ee31a30ede7e8175443ff5	2021-03-11 16:07:41 -08:00
Jagadish Krishnamoorthy	0a549f9412	[ROCm] Disable flaky tests on ROCm (#53192 ) Summary: The disabled tests are tracked by https://github.com/pytorch/pytorch/issues/53190 Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/53192 Reviewed By: zhangguanheng66 Differential Revision: D26782204 Pulled By: mrshenli fbshipit-source-id: bc90b182c236249961da1f0d4894d29f6b44fa27	2021-03-11 08:29:12 -08:00
Edward Yang	758fb94fcb	Prefix assert_async with underscore, fix some bugs in assert_async CUDA testing (#53276 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53276 - One of the tests had a syntax error (but the test wasn't fine grained enough to catch this; any error was a pass) - Doesn't work on ROCm Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: D26820048 Test Plan: Imported from OSS Reviewed By: mruberry Pulled By: ezyang fbshipit-source-id: b02c4252d10191c3b1b78f141d008084dc860c45	2021-03-05 17:36:01 -08:00
Edward Yang	cfd9360d09	Revert D26837780: Revert D26819810: Revert D26815021: Revert D26744062: Add assert_async Test Plan: revert-hammer Differential Revision: D26837780 Original commit changeset: 21567cab5c0f fbshipit-source-id: 8ea735e5fdc97e32ae3fafd40297a1b8a7cd34b0	2021-03-04 20:45:35 -08:00
Edward Yang	1accffe450	Revert D26819810: Revert D26815021: Revert D26744062: Add assert_async Test Plan: revert-hammer Differential Revision: D26819810 Original commit changeset: e528260e1aa9 fbshipit-source-id: 21567cab5c0ff5f5e60a699d4d4678773a567c30	2021-03-04 18:48:56 -08:00
Edward Yang	9e5e5a7d96	Revert D26815021: Revert D26744062: Add assert_async Test Plan: revert-hammer Differential Revision: D26815021 Original commit changeset: 972eaafcdf14 fbshipit-source-id: e528260e1aa91df1873c73af00aa57addd671607	2021-03-04 09:28:25 -08:00
Mike Ruberry	b864457743	Revert D26744062: Add assert_async Test Plan: revert-hammer Differential Revision: D26744062 (`12d63cc2f5`) Original commit changeset: be6d2653afe5 fbshipit-source-id: 972eaafcdf14d96abdec3dea6bcbd5cac1f3d759	2021-03-04 04:11:25 -08:00
Edward Yang	12d63cc2f5	Add assert_async (#53086 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53086 Fixes #36853 Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D26744062 Pulled By: ezyang fbshipit-source-id: be6d2653afe584adf67a05b5d43185b40764650d	2021-03-03 16:18:07 -08:00
Kyle Chen	f2657d2e4f	[ROCm] Enable test cases in test_cuda.py for ROCm (#52739 ) Summary: Enabling four test cases in test_cuda.py for ROCm because they are passing. Signed-off-by: Kyle Chen <kylechen@amd.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/52739 Reviewed By: H-Huang Differential Revision: D26706321 Pulled By: ngimel fbshipit-source-id: 6907c548c4ac4e387f0eb7c646e8a01f0d036c8a	2021-03-01 12:54:40 -08:00
AJ San Joaquin	578f0a04c7	fix torch.nn.parallel.scatter_gather.gather to handle NamedTuples and handle moving output to CPU (#51104 ) Summary: Fixes #{[50510](https://github.com/pytorch/pytorch/issues/50510)} Allows ```torch.nn.parallel.scatter_gather.gather``` to accept a list of NamedTuples as input and returns a NamedTuple whose elements are tensors. I added the author's fix using the ```is_namedtuple``` function. While testing this fix, I encountered a deprecation warning instructing me to use ```'cpu'``` instead of ```-1``` to move the outputs to the CPU. However, doing this causes an assertion error in the ```_get_device_index``` function. I solved this by handling the CPU case in the affected ```forward``` function. rohan-varma Pull Request resolved: https://github.com/pytorch/pytorch/pull/51104 Reviewed By: albanD Differential Revision: D26395578 Pulled By: rohan-varma fbshipit-source-id: 6e98c9ce1d9f1725973c18d24a6554c1bceae465	2021-02-11 15:50:28 -08:00
Chester Liu	58eb23378f	Clean up usage of torch._six partially (#49785 ) Summary: See https://github.com/pytorch/pytorch/issues/42919 Pull Request resolved: https://github.com/pytorch/pytorch/pull/49785 Reviewed By: mruberry Differential Revision: D25963833 Pulled By: bugra fbshipit-source-id: 11c90d6b8d3f206c9d0a4d8621b773beb10c6ba2	2021-02-08 13:58:34 -08:00
Jagadish Krishnamoorthy	506fdf9abf	[ROCm] disable tests for ROCm 4.0.1 (#51510 ) Summary: These tests are failing for ROCm 4.0/4.0.1 release. Disable the tests until they are fixed. - TestCuda.test_cudnn_multiple_threads_same_device - TestCudaFuser.test_reduction Pull Request resolved: https://github.com/pytorch/pytorch/pull/51510 Reviewed By: H-Huang Differential Revision: D26205179 Pulled By: seemethere fbshipit-source-id: 0c3d29989d711deab8b5046b458c772a1543d8ed	2021-02-02 14:39:08 -08:00
Nikita Shulga	43f0ccd1ec	torch.cuda.memory_allocated to return `{}` if not initialized (#51179 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/49952 Pull Request resolved: https://github.com/pytorch/pytorch/pull/51179 Reviewed By: ngimel Differential Revision: D26094932 Pulled By: malfet fbshipit-source-id: 0ec28ef9b0604245753d3f2b0e3536286700668d	2021-01-28 20:38:17 -08:00
Jeffrey Wan	6e3e57095c	Add complex support for torch.nn.L1Loss (#49912 ) Summary: Building on top of the work of anjali411 (https://github.com/pytorch/pytorch/issues/46640) Things added in this PR: 1. Modify backward and double-backward formulas 2. Add complex support for `new module tests` and criterion tests (and add complex tests for L1) 3. Modify some existing tests to support complex Pull Request resolved: https://github.com/pytorch/pytorch/pull/49912 Reviewed By: zhangguanheng66 Differential Revision: D25853036 Pulled By: soulitzer fbshipit-source-id: df619f1b71c450ab2818eb17804e0c55990aa8ad	2021-01-15 15:53:15 -08:00
Nikita Shulga	bf4fcab681	Fix SyncBatchNorm usage without stats tracking (#50126 ) Summary: In `batch_norm_gather_stats_with_counts_cuda` use `input.scalar_type()` if `running_mean` is not defined In `SyncBatchNorm` forward function create count tensor with `torch.float32` type if `running_mean` is None Fix a few typos Pull Request resolved: https://github.com/pytorch/pytorch/pull/50126 Test Plan: ``` python -c "import torch;print(torch.batch_norm_gather_stats_with_counts( torch.randn(1, 3, 3, 3, device='cuda'), mean = torch.ones(2, 3, device='cuda'), invstd = torch.ones(2, 3, device='cuda'), running_mean = None, running_var = None , momentum = .1, eps = 1e-5, counts = torch.ones(2, device='cuda')))" ``` Fixes https://github.com/pytorch/pytorch/issues/49730 Reviewed By: ngimel Differential Revision: D25797930 Pulled By: malfet fbshipit-source-id: 22a91e3969b5e9bbb7969d9cc70b45013a42fe83	2021-01-07 18:31:13 -08:00
Michael Carilli	ee271047b5	torch.utils.checkpoint.checkpoint + torch.cuda.amp (#49757 ) Summary: Adds a test to orphaned original PR (https://github.com/pytorch/pytorch/pull/40221). Should fix https://github.com/pytorch/pytorch/issues/49738 and https://github.com/pytorch/pytorch/issues/47183 Pull Request resolved: https://github.com/pytorch/pytorch/pull/49757 Reviewed By: mruberry Differential Revision: D25689609 Pulled By: ngimel fbshipit-source-id: 0a6adc11eb98382048ef9a9775e185dcdeff6010	2020-12-22 22:25:11 -08:00
Nikita Shulga	befe337072	Fix test_cuda_init_race skip rules (#49693 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/49432 Pull Request resolved: https://github.com/pytorch/pytorch/pull/49693 Reviewed By: walterddr, janeyx99 Differential Revision: D25668027 Pulled By: malfet fbshipit-source-id: 802cbd39e4ebe585709179f332b680f5f7978814	2020-12-21 14:30:00 -08:00
Michael Carilli	c068180a17	[CUDA graphs] Cuda RNG-safe graph capture and replay bindings (#48875 ) Summary: Part 2 of https://github.com/pytorch/pytorch/pull/46148 refactor. (part 1 was https://github.com/pytorch/pytorch/pull/48694.) Contains - a few more CUDAGeneratorImpl diffs to clean up graph capture interaction - Capture and replay bindings that interact correctly with CUDAGeneratorImpl - Tests. Diffs compile and tests pass on my machine (ubuntu 20.04, cuda 11.0) but it needs finetuning for many CI builds. See [Note [CUDA Graph-safe RNG states]](`02d89f9f1d/aten/src/ATen/CUDAGeneratorImpl.h (L13-L85)`) for the strategy, based on https://github.com/pytorch/pytorch/pull/46148#issuecomment-724414794. Pull Request resolved: https://github.com/pytorch/pytorch/pull/48875 Reviewed By: zou3519 Differential Revision: D25482654 Pulled By: ngimel fbshipit-source-id: 634dbc4c6c9d7d0d9a62dc81a52d430561f905fe	2020-12-14 10:51:58 -08:00
Jeff Daily	d5c4a80cfd	Allow ROCm CI to use non-default stream. (#48424 ) Summary: Revert https://github.com/pytorch/pytorch/issues/26394. Fixes https://github.com/pytorch/pytorch/issues/27356. Not all MIOpen handles were setting their stream to the current stream prior to running the op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/48424 Reviewed By: H-Huang Differential Revision: D25420384 Pulled By: mruberry fbshipit-source-id: 051683ba9e3d264b71162bd344031a0c58bf6a41	2020-12-10 09:55:11 -08:00
x00480351	47aa253632	[Feature] Allow user to specify a fraction of the GPU memory. (#48172 ) Summary: Add a new function, torch.cuda.set_per_process_memory_fraction(fraction, device), to torch.cuda. Related: https://github.com/pytorch/pytorch/issues/18626 The fraction (float type, from 0 to 1) is used to limit memory of cashing allocator on GPU device . One can set it on any visible GPU. The allowed memory equals total memory * fraction. It will raise an OOM error when try to apply GPU memory more than the allowed value. This function is similar to Tensorflow's per_process_gpu_memory_fraction Note， this setting is just limit the cashing allocator in one process. If you are using multiprocess, you need to put this setting in to the subprocess to limit its GPU memory, because subprocess could have its own allocator. ## usage In some cases, one needs to split a GPU device as two parts. Can set limitation before GPU memory using. Eg. device: 0, each part takes half memory, the code as follows: ``` torch.cuda.set_per_process_memory_fraction(0.5, 0) ``` There is an example to show what it is. ```python import torch torch.cuda.set_per_process_memory_fraction(0.5, 0) torch.cuda.empty_cache() total_memory = torch.cuda.get_device_properties(0).total_memory # less than 0.5 will be ok: tmp_tensor = torch.empty(int(total_memory * 0.499), dtype=torch.int8, device='cuda') del tmp_tensordel tmp_tensor torch.cuda.empty_cache() # this allocation will raise a OOM: torch.empty(total_memory // 2, dtype=torch.int8, device='cuda') """ It raises an error as follows: RuntimeError: CUDA out of memory. Tried to allocate 5.59 GiB (GPU 0; 11.17 GiB total capacity; 0 bytes already allocated; 10.91 GiB free; 5.59 GiB allowed; 0 bytes reserved in total by PyTorch) """ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/48172 Reviewed By: bdhirsh Differential Revision: D25275381 Pulled By: VitalyFedyunin fbshipit-source-id: d8e7af31902c2eb795d416b57011cc8a22891b8f	2020-12-03 11:45:56 -08:00
pbialecki	22c3ae8b57	Disable autocast cache for tensor views as fix for #48049 (#48696 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/48049 Root cause of the issue explained [here](https://github.com/pytorch/pytorch/issues/48049#issuecomment-736701769). This PR implements albanD's suggestion to add the `!t.is_view()` check and disable autocast caching for views of tensors. The added test checks for an increase in memory usage by comparing the initially allocated memory with the memory after 3 iterations using a single `nn.Linear` layer in a `no_grad` and `autocast` context. After this PR the memory usage in the original issue doesn't grow anymore and yields: ```python autocast: True 0: 0MB (peak 1165MB) 1: 0MB (peak 1264MB) 2: 0MB (peak 1265MB) 3: 0MB (peak 1265MB) 4: 0MB (peak 1265MB) 5: 0MB (peak 1265MB) 6: 0MB (peak 1265MB) 7: 0MB (peak 1265MB) 8: 0MB (peak 1265MB) 9: 0MB (peak 1265MB) ``` CC ngimel mcarilli Pull Request resolved: https://github.com/pytorch/pytorch/pull/48696 Reviewed By: bdhirsh Differential Revision: D25276231 Pulled By: ngimel fbshipit-source-id: e2571e9f166c0a6f6f569b0c28e8b9ca34132743	2020-12-02 20:25:13 -08:00
Jeff Daily	5dfced3b0d	work around #47028 until a proper fix is identified (#48405 ) Summary: Otherwise, this test will appear flaky for ROCm even though it is a generic PyTorch issue. CC albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/48405 Reviewed By: mrshenli Differential Revision: D25183473 Pulled By: ngimel fbshipit-source-id: 0fa19b5497a713cc6c5d251598e57cc7068604be	2020-11-26 18:33:19 -08:00
Gao, Xiang	315122ce15	Bump up the CUDA OOM test memory size (#48029 ) Summary: 80GB is no longer large any more https://nvidianews.nvidia.com/news/nvidia-doubles-down-announces-a100-80gb-gpu-supercharging-worlds-most-powerful-gpu-for-ai-supercomputing Hopefully, the new size could be OK until the end of Moore's Law :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/48029 Reviewed By: linbinyu Differential Revision: D25003603 Pulled By: zou3519 fbshipit-source-id: 626b9c031daee950df8453be4d7643dd67647213	2020-11-17 11:16:31 -08:00
Jeff Daily	6906701bde	[ROCm] enable stream priorities (#47136 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47136 Reviewed By: mruberry Differential Revision: D24672457 Pulled By: ngimel fbshipit-source-id: 54f60c32df87cbd40fccd7fb1ecf0437905f01a3	2020-11-02 11:25:44 -08:00
Michael Carilli	3c643d112e	Pin destination memory for cuda_tensor.to("cpu", non_blocking=True) (#46878 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/39694. [`torch.cuda._sleep(int(100 * get_cycles_per_ms()))`](https://github.com/pytorch/pytorch/pull/46878/files#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R511-R513) in the test helps avoid flakiness noted by ngimel (https://github.com/pytorch/pytorch/pull/35144#issuecomment-602103631). Pull Request resolved: https://github.com/pytorch/pytorch/pull/46878 Reviewed By: izdeby Differential Revision: D24550403 Pulled By: xw285cornell fbshipit-source-id: 1ecc35ef75f9a38ab332aacdf4835955105edafc	2020-10-29 15:42:55 -07:00
Jeff Daily	151f31ba27	remove event not ready assertion from TestCuda.test_copy_non_blocking (#46857 ) Summary: It is incorrect to assume that a newly recorded event will immediately query as False. This test is flaky on ROCm due to this incorrect assumption. Pull Request resolved: https://github.com/pytorch/pytorch/pull/46857 Reviewed By: albanD Differential Revision: D24565581 Pulled By: mrshenli fbshipit-source-id: 0e9ba02cf52554957b29dbeaa5093696dc914b67	2020-10-27 14:21:40 -07:00
anjali411	d94bd998ec	Update backward formulas (Re #44444 ) (#46275 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46275 Re #44444 Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D24285785 Pulled By: anjali411 fbshipit-source-id: c60ecd4fe4f144132085f2c91d3b950e92b2a491	2020-10-25 19:40:59 -07:00
ashish	88e94da580	Enable softmax and tiny norm FP16 tests on ROCm (#46363 ) Summary: This pull request enables the following tests on ROCm: * TestCuda.test_tiny_half_norm_ * TestNNDeviceTypeCUDA.test_softmax_cuda_float16 * TestNNDeviceTypeCUDA.test_softmax_cuda_float32 * TestNNDeviceTypeCUDA.test_softmax_results_cuda_float16 * TestNNDeviceTypeCUDA.test_softmax_results_cuda_float32 The earlier failures, because of which the tests were skipped, were because of a precision issue for FP16 compute on MI25 hardware with ROCm 3.7 and older. The fix was delivered in the compiler in ROCm 3.8. The pull request fixes https://github.com/pytorch/pytorch/issues/37493 cc: jeffdaily ezyang malfet mruberry Pull Request resolved: https://github.com/pytorch/pytorch/pull/46363 Reviewed By: heitorschueroff Differential Revision: D24325639 Pulled By: ezyang fbshipit-source-id: a7dbb238cf38c04b6592baad40b4d71725a358c9	2020-10-22 19:40:00 -07:00
Richard Barnes	52a970bac9	Minor cleaning of `test_cuda.py` (#46617 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46617 Sort includes, fix deprecated test warning Test Plan: ``` buck run mode/dev-nosan //caffe2/test:cuda ``` Reviewed By: drdarshan Differential Revision: D24429247 fbshipit-source-id: 65f53d7c904032e5c8f8ca45d1d2bb437358ffdd	2020-10-22 09:03:30 -07:00
Alexander Grund	5b0f400488	Replace list(map(...)) constructs by list comprehensions (#46461 ) Summary: As discussed in https://github.com/pytorch/pytorch/issues/46392 this makes the code more readable and possibly more performant. It also fixes a bug detected by this where the argument order of `map` was confused: `030a24906e (diff-5bb26bd3a23ee3bb540aeadcc0385df2a4e48de39f87ed9ea76b21990738fe98L1537-R1537)` Fixes https://github.com/pytorch/pytorch/issues/46392 Pull Request resolved: https://github.com/pytorch/pytorch/pull/46461 Reviewed By: ailzhang Differential Revision: D24367015 Pulled By: ezyang fbshipit-source-id: d55a67933cc22346b00544c9671f09982ad920e7	2020-10-19 18:42:49 -07:00
Michael Carilli	5640b79bf8	Allow consumer ops to sync on GraphRoot's gradient (#45787 ) Summary: Currently, a GraphRoot instance doesn't have an associated stream. Streaming backward synchronization logic assumes the instance ran on the default stream, and tells consumer ops to sync with the default stream. If the gradient the GraphRoot instance passes to consumer backward ops was populated on a non-default stream, we have a race condition. The race condition can exist even if the user doesn't give a manually populated gradient: ```python with torch.cuda.stream(side_stream): # loss.backward() implicitly synthesizes a one-element 1.0 tensor on side_stream # GraphRoot passes it to consumers, but consumers first sync on default stream, not side_stream. loss.backward() # Internally to backward(), streaming-backward logic takes over, stuff executes on the same stream it ran on in forward, # and the side_stream context is irrelevant. GraphRoot's interaction with its first consumer(s) is the spot where # the side_stream context causes a problem. ``` This PR fixes the race condition by associating a GraphRoot instance, at construction time, with the current stream(s) on the device(s) of the grads it will pass to consumers. (i think this relies on GraphRoot executing in the main thread, before backward thread(s) fork, because the grads were populated on the main thread.) The test demonstrates the race condition. It fails reliably without the PR's GraphRoot diffs and passes with the GraphRoot diffs. With the GraphRoot diffs, manually populating an incoming-gradient arg for `backward` (or `torch.autograd.grad`) and the actual call to `autograd.backward` will have the same stream-semantics relationship as any other pair of ops: ```python # implicit population is safe with torch.cuda.stream(side_stream): loss.backward() # explicit population in side stream then backward in side stream is safe with torch.cuda.stream(side_stream): kickoff_grad = torch.ones_like(loss) loss.backward(gradient=kickoff_grad) # explicit population in one stream then backward kickoff in another stream # is NOT safe, even with this PR's diffs, but that unsafety is consistent with # stream-semantics relationship of any pair of ops kickoff_grad = torch.ones_like(loss) with torch.cuda.stream(side_stream): loss.backward(gradient=kickoff_grad) # Safe, as you'd expect for any pair of ops kickoff_grad = torch.ones_like(loss) side_stream.wait_stream(torch.cuda.current_stream()) with torch.cuda.stream(side_stream): loss.backward(gradient=kickoff_grad) ``` This PR also adds the last three examples above to cuda docs and references them from autograd docstrings. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45787 Reviewed By: nairbv Differential Revision: D24138376 Pulled By: albanD fbshipit-source-id: bc4cd9390f9f0358633db530b1b09f9c1080d2a3	2020-10-07 08:53:53 -07:00
Rohan Varma	f8c1ca5dd8	Enable NamedTuple data type to work with DDP (#44220 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44220 Closes https://github.com/pytorch/pytorch/issues/44009 Currently if a dataloader returns objects created with a collections.namedtuple, this will incorrectly be cast to a tuple. As a result, if we have data of these types, there can be runtime errors during the forward pass if the module is expecting a named tuple. Fix this in `scatter_gather.py` to resolve the issue reported in https://github.com/pytorch/pytorch/issues/44009 ghstack-source-id: 113423287 Test Plan: CI Reviewed By: colesbury Differential Revision: D23536752 fbshipit-source-id: 3838e60162f29ebe424e83e474c4350ae838180b	2020-10-02 13:33:08 -07:00
Michael Carilli	72bc3d9de4	Use MTA for amp grad unscaling, enforce op math type in MTA functors, and allow op lambdas (#44778 ) Summary: Amp gradient unscaling is a great use case for multi tensor apply (in fact it's the first case I wrote it for). This PR adds an MTA unscale+infcheck functor. Really excited to have it for `torch.cuda.amp`. izdeby your interface was clean and straightforward to use, great work! Labeled as bc-breaking because the native_functions.yaml exposure of unscale+infcheck changes from [`_amp_non_finite_check_and_unscale_` to `_amp_foreach_non_finite_check_and_unscale_`]( https://github.com/pytorch/pytorch/pull/44778/files#diff-f1e4b2c15de770d978d0eb77b53a4077L6289-L6293). The PR also modifies Unary/Binary/Pointwise Functors to - do ops' internal math in FP32 for FP16 or bfloat16 inputs, which improves precision ([and throughput, on some architectures!](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions)) and has no downside for the ops we care about. - accept an instantiated op functor rather than an op functor template (`template<class> class Op`). This allows calling code to pass lambdas. Open question: As written now, the PR has MTA Functors take care of pre- and post-casting FP16/bfloat16 inputs to FP32 before running the ops. However, alternatively, the pre- and post-math casting could be deferred/written into the ops themselves, which gives them a bit more control. I can easily rewrite it that way if you prefer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44778 Reviewed By: gchanan Differential Revision: D23944102 Pulled By: izdeby fbshipit-source-id: 22b25ccad5f69b413c77afe8733fa9cacc8e766d	2020-10-01 07:51:16 -07:00
Nikita Shulga	c3a5aed5f7	Run pytorch_core CUDA tests on GPU using TPX Summary: Modify contbuild to disable sanitizers, add option to run "cuda" test using TPX RE (Note: this ignores all push blocking failures!) Test Plan: CI Reviewed By: walterddr, cspanda Differential Revision: D23854578 fbshipit-source-id: 327d7cc3655c17034a6a7bc78f69967403290623	2020-09-24 12:12:23 -07:00
Edward Yang	da4033d32a	Make cudaHostRegister actually useful on cudart. (#45159 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45159 By default, pybind11 binds void* to be capsules. After a lot of Googling, I have concluded that this is not actually useful: you can't actually create a capsule from Python land, and our data_ptr() function returns an int, which means that the function is effectively unusable. It didn't help that we had no tests exercising it. I've replaced the void* with uintptr_t, so that we now accept int (and you can pass data_ptr() in directly). I'm not sure if we should make these functions accept ctypes types; unfortunately, pybind11 doesn't seem to have any easy way to do this. Fixes #43006 Also added cudaHostUnregister which was requested. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: lw Differential Revision: D23849731 Pulled By: ezyang fbshipit-source-id: 8a79986f3aa9546abbd2a6a5828329ae90fd298f	2020-09-23 11:05:44 -07:00
Xiao Wang	d75c402755	Add cusolver to build, rewrite MAGMA inverse with cusolver (#42403 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/42265 This PR adds cusolver to the pytorch build, and enables the use of cusolver/cublas library functions on GPU `torch.inverse` on certain tensor shapes. Specifically, when * the tensor is two dimensional (single batch), or * has >2 dimensions (multiple batches) and `batch_size <= 2`, or * magma is not linked, cusolver/cublas will be used. In other conditions, the current implementation of MAGMA will still be used. `8c0949ae45/aten/src/ATen/native/cuda/BatchLinearAlgebra.cu (L742-L752)` The reason for this is that for tensors with large batch_size, `cublasXgetrfBatched` and `cublasXgetriBatched` doesn't perform very well. For `batch_size > 1`, we launch cusolver functions in multiple streams. This lets cusolver functions run in parallel, and can greatly increase the performance. When `batch_size > 2`, the parallel launched cusolver functions are slightly slower than the current magma implementation, so we still use the current magma impl. On CUDA 9.2, there were some numerical issues detected, so cusolver impl will not be used. The cusolver impl will also not be used on platforms other than Nvidia CUDA. `060769feaf/aten/src/ATen/native/cuda/BatchLinearAlgebraLib.h (L10-L13)` Note that there is a new heuristic used before cusolver/cublas calls here: `8c0949ae45/aten/src/ATen/native/cuda/MiscUtils.h (L113-L121)` where `use_loop_launch = true` means launch single batch cusolver functions in parallel, and `use_loop_launch = false` means use cublas_X_batched functions. When magma is enabled (only `batch_size <= 2` will be dispatched to cusolver/cublas), the heuristic will always return `true` and the cusolver calls are faster than small batch_size magma calls. When magma is disabled, this adds the functionality of `torch.inverse`, which was disabled before for all shapes (though large batch_size cublas performance may not be as well as magma). Checklist: - [X] Add benchmark, cpu, gpu-before (magma), gpu-after (cusolver) - [X] Rewrite single inverse (ndim == 2) with cusolver - [X] Rewrite batched inverse (ndim > 2) with cublas - [X] Add cusolver to build - [x] Clean up functions related to `USE_MAGMA` define guard - [x] Workaround for non-cuda platform - [x] Workaround for cuda 9.2 - [x] Add zero size check - [x] Add tests Next step: If cusolver doesn't cause any problem in pytorch build, and there are no major performance regressions reported after this PR being merged, I will start porting other cusolver/cublas functions for linear algebra to improve the performance. <details> <summary> benchmark 73499c6 </summary> benchmark code: https://github.com/xwang233/code-snippet/blob/master/torch.inverse/inverse-cusolver.ipynb shape meaning: * `[] 2 torch.float32 -> torch.randn(2, 2, dtype=torch.float32)` * `[2] 4 torch.float32 -> torch.randn(2, 4, 4, dtype=torch.float32)` \| shape \| cpu_time (ms) \| gpu_time_before (magma) (ms) \| gpu_time_after (ms) \| \| --- \| --- \| --- \| --- \| \| [] 2 torch.float32 \| 0.095 \| 7.534 \| 0.129 \| \| [] 4 torch.float32 \| 0.009 \| 7.522 \| 0.129 \| \| [] 8 torch.float32 \| 0.011 \| 7.647 \| 0.138 \| \| [] 16 torch.float32 \| 0.075 \| 7.582 \| 0.135 \| \| [] 32 torch.float32 \| 0.073 \| 7.573 \| 0.191 \| \| [] 64 torch.float32 \| 0.134 \| 7.694 \| 0.288 \| \| [] 128 torch.float32 \| 0.398 \| 8.073 \| 0.491 \| \| [] 256 torch.float32 \| 1.054 \| 11.860 \| 1.074 \| \| [] 512 torch.float32 \| 5.218 \| 14.130 \| 2.582 \| \| [] 1024 torch.float32 \| 19.010 \| 18.780 \| 6.936 \| \| [1] 2 torch.float32 \| 0.009 \| 0.113 \| 0.128 *regressed \| \| [1] 4 torch.float32 \| 0.009 \| 0.113 \| 0.131 regressed \| \| [1] 8 torch.float32 \| 0.011 \| 0.116 \| 0.129 regressed \| \| [1] 16 torch.float32 \| 0.015 \| 0.122 \| 0.135 regressed \| \| [1] 32 torch.float32 \| 0.032 \| 0.177 \| 0.178 regressed \| \| [1] 64 torch.float32 \| 0.070 \| 0.420 \| 0.281 \| \| [1] 128 torch.float32 \| 0.328 \| 0.816 \| 0.490 \| \| [1] 256 torch.float32 \| 1.125 \| 1.690 \| 1.084 \| \| [1] 512 torch.float32 \| 4.344 \| 4.305 \| 2.576 \| \| [1] 1024 torch.float32 \| 16.510 \| 16.340 \| 6.928 \| \| [2] 2 torch.float32 \| 0.009 \| 0.113 \| 0.186 regressed \| \| [2] 4 torch.float32 \| 0.011 \| 0.115 \| 0.184 regressed \| \| [2] 8 torch.float32 \| 0.012 \| 0.114 \| 0.184 regressed \| \| [2] 16 torch.float32 \| 0.019 \| 0.119 \| 0.173 regressed \| \| [2] 32 torch.float32 \| 0.050 \| 0.170 \| 0.240 regressed \| \| [2] 64 torch.float32 \| 0.120 \| 0.429 \| 0.375 \| \| [2] 128 torch.float32 \| 0.576 \| 0.830 \| 0.675 \| \| [2] 256 torch.float32 \| 2.021 \| 1.748 \| 1.451 \| \| [2] 512 torch.float32 \| 9.070 \| 4.749 \| 3.539 \| \| [2] 1024 torch.float32 \| 33.655 \| 18.240 \| 12.220 \| \| [4] 2 torch.float32 \| 0.009 \| 0.112 \| 0.318 regressed \| \| [4] 4 torch.float32 \| 0.010 \| 0.115 \| 0.319 regressed \| \| [4] 8 torch.float32 \| 0.013 \| 0.115 \| 0.320 regressed \| \| [4] 16 torch.float32 \| 0.027 \| 0.120 \| 0.331 regressed \| \| [4] 32 torch.float32 \| 0.085 \| 0.173 \| 0.385 regressed \| \| [4] 64 torch.float32 \| 0.221 \| 0.431 \| 0.646 regressed \| \| [4] 128 torch.float32 \| 1.102 \| 0.834 \| 1.055 regressed \| \| [4] 256 torch.float32 \| 4.042 \| 1.811 \| 2.054 regressed \| \| [4] 512 torch.float32 \| 18.390 \| 4.884 \| 5.087 regressed \| \| [4] 1024 torch.float32 \| 69.025 \| 19.840 \| 20.000 *regressed \| </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/42403 Reviewed By: ailzhang, mruberry Differential Revision: D23717984 Pulled By: ngimel fbshipit-source-id: 54cbd9ea72a97989cff4127089938e8a8e29a72b	2020-09-18 20:43:29 -07:00
Michael Carilli	2a87742ffa	Autocast wrappers for RNN cell apis (#44296 ) Summary: Should fix https://github.com/pytorch/pytorch/issues/42605. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44296 Reviewed By: izdeby Differential Revision: D23580447 Pulled By: ezyang fbshipit-source-id: 86027b693fd2b648f043ab781b84ffcc1f72854d	2020-09-09 09:44:59 -07:00
Gao, Xiang	5e97f251a8	Enable TF32 support for cuDNN (#40737 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40737 Reviewed By: mruberry Differential Revision: D22801525 Pulled By: ngimel fbshipit-source-id: ac7f7e728b4b3e01925337e8c9996f26a6433fd2	2020-09-01 15:34:24 -07:00
Peter Bell	42f6c3b1f4	Raise error on device mismatch in addmm (#43505 ) Summary: Fixes gh-42282 This adds a device-mismatch check to `addmm` on CPU and CUDA. Although it seems like the dispatcher is always selecting the CUDA version here if any of the inputs are on GPU. So in theory the CPU check is unnecessary, but probably better to err on the side of caution. Pull Request resolved: https://github.com/pytorch/pytorch/pull/43505 Reviewed By: mruberry Differential Revision: D23331651 Pulled By: ngimel fbshipit-source-id: 8eb2f64f13d87e3ca816bacec9d91fe285d83ea0	2020-08-26 09:37:57 -07:00
Michael Carilli	fbf274f5a7	Autocast support for cudnn RNNs (#42385 ) Summary: Should close https://github.com/pytorch/pytorch/issues/36428. The cudnn RNN API expects weights to occupy a flat buffer in memory with a particular layout. This PR implements a "speed of light" fix: [`_cudnn_rnn_cast_reflatten`](https://github.com/pytorch/pytorch/pull/42385/files#diff-9ef93b6a4fb5a06a37c562b83737ac6aR327) (the autocast wrapper assigned to `_cudnn_rnn`) copies weights to the right slices of a flat FP16 buffer with a single read/write per weight (as opposed to casting them to FP16 individually then reflattening the individual FP16 weights, which would require 2 read/writes per weight). It isn't pretty but IMO it doesn't make rnn bindings much more tortuous than they already are. The [test](https://github.com/pytorch/pytorch/pull/42385/files#diff-e68a7bc6ba14f212e5e7eb3727394b40R2683) tries a forward under autocast and a backward for the full cross product of RNN options and input/weight/hidden dtypes. As for all FP16list autocast tests, forward output and backward grads are checked against a control where inputs (including RNN module weights in this case) are precasted to FP16 on the python side. Not sure who to ask for review, tagging ezyang and ngimel because Ed wrote this file (almost 2 years ago) and Natalia did the most recent major [surgery](https://github.com/pytorch/pytorch/pull/12600). Side quests discovered: - Should we update [persistent RNN heuristics](`dbdd28207c/aten/src/ATen/native/cudnn/RNN.cpp (L584)`) to include compute capability 8.0? Could be another PR but seems easy enough to include. - Many (maybe all?!) the raw cudnn API calls in [RNN.cpp](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cudnn/RNN.cpp) are deprecated in cudnn 8. I don't mind taking the AI to update them since my mental cache is full of rnn stuff, but that would be a substantial separate PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/42385 Reviewed By: zhangguanheng66 Differential Revision: D23077782 Pulled By: ezyang fbshipit-source-id: a2afb1bdab33ba0442879a703df13dc87f03ec2e	2020-08-18 13:37:42 -07:00
Pritam Damania	872237c1f2	Output to stderr in distributed tests. (#42139 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42139 A bunch of tests were failing with buck since we would output to stdout and buck would fail parsing stdout in some cases. Moving these print statements to stderr fixes this issue. ghstack-source-id: 108606579 Test Plan: Run the offending unit tests. Reviewed By: mrshenli Differential Revision: D22779135 fbshipit-source-id: 789af3b16a03b68a6cb12377ed852e5b5091bbad	2020-07-29 19:23:34 -07:00
Mike Ruberry	4b6e5f42a4	Creates spectral ops test suite (#42157 ) Summary: In preparation for creating the new torch.fft namespace and NumPy-like fft functions, as well as supporting our goal of refactoring and reducing the size of test_torch.py, this PR creates a test suite for our spectral ops. The existing spectral op tests from test_torch.py and test_cuda.py are moved to test_spectral_ops.py and updated to run under the device generic test framework. Pull Request resolved: https://github.com/pytorch/pytorch/pull/42157 Reviewed By: albanD Differential Revision: D22811096 Pulled By: mruberry fbshipit-source-id: e5c50f0016ea6bb8b093cd6df2dbcef6db9bb6b6	2020-07-29 11:36:18 -07:00
lcskrishna	1f11e930d0	[ROCm] skip test_streams on rocm. (#41697 ) Summary: Skipping the test test_streams as it is flaky on rocm. cc: jeffdaily sunway513 Pull Request resolved: https://github.com/pytorch/pytorch/pull/41697 Reviewed By: zhangguanheng66 Differential Revision: D22644600 Pulled By: malfet fbshipit-source-id: b1b16d496e58a91c44c40d640851fd62a5d7393d	2020-07-21 08:55:07 -07:00
Xiang Gao	23174ca71b	[reland] Enable TF32 support for cuBLAS (#41498 ) Summary: fix rocm Pull Request resolved: https://github.com/pytorch/pytorch/pull/41498 Reviewed By: mruberry Differential Revision: D22560572 Pulled By: ngimel fbshipit-source-id: 5ee79e96cb29e70d9180830d058efb53d1c6c041	2020-07-15 21:00:55 -07:00
Alexander Grund	563b60b890	Fix flaky test_stream_event_nogil due to missing event sync (#41398 ) Summary: The test asserts that the stream is "ready" but doesn't wait for the event to be "executed" which makes it fail on some platforms where the `query` call occurs "soon enough". Fixes https://github.com/pytorch/pytorch/issues/38807 Pull Request resolved: https://github.com/pytorch/pytorch/pull/41398 Reviewed By: zhangguanheng66 Differential Revision: D22540012 Pulled By: ezyang fbshipit-source-id: 6f56d951e48133ce4f6a9a54534298b7d2877c80	2020-07-15 11:03:35 -07:00
Shen Li	3a63a939d4	Revert D22517785: [pytorch][PR] Enable TF32 support for cuBLAS Test Plan: revert-hammer Differential Revision: D22517785 (`288ece89e1`) Original commit changeset: 87334c893561 fbshipit-source-id: 0a0674f49c1bcfc98f7f88af5a8c7de93b76e458	2020-07-15 08:15:48 -07:00
Xiang Gao	288ece89e1	Enable TF32 support for cuBLAS (#40800 ) Summary: Benchmark on a fully connected network and torchvision models (time in seconds) on GA100: \| model \| batch size \| forward(TF32) \| forward(FP32) \| backward(TF32) \| backward(FP32) \| \|--------------------\|------------\|---------------\|---------------\|----------------\|----------------\| \| FC 512-128-32-8 \| 512 \| 0.000211 \| 0.000321 \| 0.000499 \| 0.000532 \| \| alexnet \| 512 \| 0.0184 \| 0.0255 \| 0.0486 \| 0.0709 \| \| densenet161 \| 128 \| 0.0665 \| 0.204 \| 0.108 \| 0.437 \| \| googlenet \| 256 \| 0.0925 \| 0.110 \| 0.269 \| 0.326 \| \| inception_v3 \| 256 \| 0.155 \| 0.214 \| 0.391 \| 0.510 \| \| mnasnet1_0 \| 512 \| 0.108 \| 0.137 \| 0.298 \| 0.312 \| \| mobilenet_v2 \| 512 \| 0.114 \| 0.294 \| 0.133 \| 0.303 \| \| resnet18 \| 512 \| 0.0722 \| 0.100 \| 0.182 \| 0.228 \| \| resnext50_32x4d \| 256 \| 0.170 \| 0.237 \| 0.373 \| 0.479 \| \| shufflenet_v2_x1_0 \| 512 \| 0.0463 \| 0.0473 \| 0.125 \| 0.123 \| \| squeezenet1_0 \| 512 \| 0.0870 \| 0.0948 \| 0.205 \| 0.214 \| \| vgg16 \| 256 \| 0.167 \| 0.234 \| 0.401 \| 0.502 \| \| wide_resnet50_2 \| 512 \| 0.186 \| 0.310 \| 0.415 \| 0.638 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/40800 Reviewed By: mruberry Differential Revision: D22517785 Pulled By: ngimel fbshipit-source-id: 87334c8935616f72a6af5abbd3ae69f76923dc3e	2020-07-14 13:21:10 -07:00
Luca Wehrstedt	c20426f86d	Fix torch.cuda.check_error type errors (#41330 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41330 `torch.cuda.check_error` is annotated as taking an `int` as argument but when running `torch.cuda.check_error(34)` one would get: ``` TypeError: cudaGetErrorString(): incompatible function arguments. The following argument types are supported: 1. (arg0: torch._C._cudart.cudaError) -> str Invoked with: 34 ``` Even if one explicitly casted the argument, running `torch.cuda.check_error(torch._C._cudart.cudaError(34))` would give: ``` AttributeError: 'str' object has no attribute 'decode' ``` This PR fixes both issues (thus allowing `check_error` to be called with a un-casted int) and adds a test. ghstack-source-id: 107628709 Test Plan: Unit tests Reviewed By: ezyang Differential Revision: D22500549 fbshipit-source-id: 9170c1e466dd554d471e928b26eb472a712da9e1	2020-07-14 00:47:14 -07:00
SsnL	de7ac60cf4	Add out= variants for cuda.comm.broadcast/gather/scatter (#39681 ) Summary: Partially fixes https://github.com/pytorch/pytorch/issues/38911 Pull Request resolved: https://github.com/pytorch/pytorch/pull/39681 Differential Revision: D22161342 Pulled By: mrshenli fbshipit-source-id: 60295077159b02087823e93bb6ebac9d70adea0a	2020-06-24 12:58:19 -07:00
Michael Carilli	b4ccdef090	Allow torch.cuda.amp.GradScaler to support sparse gradients (#36786 ) Summary: Should close https://github.com/pytorch/pytorch/issues/35810. I decided to keep sparse handling on the Python side for clarity, although it could be moved to the C++ side (into `_amp_non_finite_check_and_unscale_`) without much trouble. For non-fp16 sparse grads the logic is simple (call `_amp_non_finite_check_and_unscale_` on `grad._values()`) instead of `grad` itself. At least I hope it's that easy. For fp16 sparse grads, it's tricker. Sparse tensors can be uncoalesced. From the [Note](https://pytorch.org/docs/master/sparse.html#torch.sparse.FloatTensor): > Our sparse tensor format permits uncoalesced sparse tensors, where there may be duplicate coordinates in the indices; in this case, the interpretation is that the value at that index is the sum of all duplicate value entries. An uncoalesced scaled fp16 grad may have values at duplicate coordinates that are all finite but large, such that adding them to make the coalesced version WOULD cause overflows. If I checked `_values()` on the uncoalesced version, it might not report overflows, but I think it should. So, if the grad is sparse, fp16, and uncoalesced, I still call `_amp_non_finite_check_and_unscale_` to unscale `grad._values()` in-place, but I also double-check the coalesced version by calling a second `_amp_non_finite_check_and_unscale_` on `grad.coalesce()._values()`. `coalesce()` is out-of-place, so this call doesn't redundantly affect `grad._values()`, but it does have the power to populate the same `found_inf` tensor. The `is_coalesced()` check and `coalesce()` probably aren't great for performance, but if someone needs a giant embedding table in FP16, they're better than nothing and memorywise, they'll only create a copy of nnz gradient values+indices, which is still way better than changing the whole table to FP32. An `unscale` variant with liberty to create unscaled grads out-of-place, and replace `param.grad` instead of writing through it, could get away with just one `_amp_non_finite_check_and_unscale_`. It could say `coalesced = grad.coalesced()`, do only the stronger `_amp_non_finite_check_and_unscale_` on `coalesced._values()`, and set `param.grad = coalesced`. I could even avoid replacing `param.grad` itself by going one level deeper and setting `param.grad`'s indices and values to `coalesced`'s, but that seems brittle and still isn't truly "in place". you could whiteboard an uncoalesced fp32 grad with the same property, but fp32's range is big enough that I don't think it's realistic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/36786 Reviewed By: ezyang Differential Revision: D22202832 Pulled By: ngimel fbshipit-source-id: b70961a4b6fc3a4c1882f65e7f34874066435735	2020-06-24 09:10:49 -07:00
Michael Carilli	3b040c478a	Make custom_fwd a no-op when not executed under autocast (#36171 ) Summary: Currently, a custom autograd function written with ``` torch.cuda.amp.custom_fwd(cast_inputs=dtype) def forward(ctx, *args): ... ``` casts incoming floating-point CUDA tensors to `dtype` unconditionally, regardless of whether the function executes in an autocast-enabled region. I think I had the wrong idea there. Autocast-disabled regions should give the user control of input types. Also, `custom_fwd(cast_inputs=dtype)`-decorated functions' behavior should align with native fp32list/fp16list functions. C++-side casting wrappers have no effect when autocast is disabled, and `custom_fwd`'s casting should behave the same way. The present PR changes `custom_fwd` so it only casts in autocast-enabled regions (also updates custom_fwd to ignore fp64 inputs, like the C++ wrappers). Pull Request resolved: https://github.com/pytorch/pytorch/pull/36171 Differential Revision: D22179511 Pulled By: ngimel fbshipit-source-id: 5a93d070179a43206066bce19da0a5a19ecaabbd	2020-06-23 10:23:02 -07:00
Michael Carilli	8066fba226	[RELAND2] Change AccumulateGrad to yield `.grad`s that match weights' memory layout (#40358 ) Summary: https://github.com/pytorch/pytorch/pull/40129 fixed the error responsible for the first revert, but exposed another error in the same test. This PR is intended as the "master copy" for merge, and it runs on full CI. Two other PRs (restricted to run on a small subset of CI) supporting debugging DDP failures/hangs with multiple devices per process (`test_c10d.py:DistributedDataParallelTest.test_grad_layout_1devicemodule_2replicaperprocess`). - https://github.com/pytorch/pytorch/pull/40290 tries the test with purely rowmajor contiguous params on an untouched master. In other words https://github.com/pytorch/pytorch/pull/40290 contains none of this PR's diffs aside from the test itself. - https://github.com/pytorch/pytorch/pull/40178, for comparison, tries the test with this PR's diffs. Both fail the same way, indicating failure is unrelated to this PR's other diffs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/40358 Differential Revision: D22165785 Pulled By: albanD fbshipit-source-id: ac7cdd79af5c080ab74341671392dca8e717554e	2020-06-22 17:13:21 -07:00
Alban Desmaison	08227fea4f	Revert D22079377: [pytorch][PR] [RELAND] Change AccumulateGrad to yield `.grad`s that match weights' memory layout Test Plan: revert-hammer Differential Revision: D22079377 Original commit changeset: 9bd2b7e0c34f fbshipit-source-id: c22cc349d790caa574eace0d63980854c33e5a59	2020-06-17 10:17:27 -07:00
Michael Carilli	1ec8ece2b9	[RELAND] Change AccumulateGrad to yield `.grad`s that match weights' memory layout (#40129 ) Summary: https://github.com/pytorch/pytorch/pull/34904 was reverted because it had a misconfigured 4 GPU test that for some reason wasn't caught by external CI ([example failure](https://app.circleci.com/pipelines/github/pytorch/pytorch/181719/workflows/cfb37cd9-9a0c-4738-898b-d683934cd308/jobs/5868948/steps)). This PR reverts the revert, and adds diffs that should repair the misconfigured test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/40129 Differential Revision: D22079377 Pulled By: albanD fbshipit-source-id: 9bd2b7e0c34fdaf887497b52037cfe82cba709c1	2020-06-17 09:02:54 -07:00
Alban Desmaison	f1e575a0bf	Revert D20496044: [pytorch][PR] Change AccumulateGrad to yield `.grad`s that match weights' memory layout Test Plan: revert-hammer Differential Revision: D20496044 Original commit changeset: 248d680f4b1b fbshipit-source-id: 6462b25e3fb9c8596c1da443389089f09c32df4d	2020-06-16 10:38:40 -07:00
Michael Carilli	2beb9690c3	Change AccumulateGrad to yield `.grad`s that match weights' memory layout (#34904 ) Summary: Currently, whether `AccumulateGrad` [steals](`67cb018462/torch/csrc/autograd/functions/accumulate_grad.h (L42)`) or [clones](`67cb018462/torch/csrc/autograd/functions/accumulate_grad.h (L80)`) an incoming gradient, the gradient ends up rowmajor contiguous, regardless of its param's layout. If the param's layout is channels last, or otherwise not rowmajor contigous, later kernels that apply gradients to params are forced into an uncoalesced memory access pattern for either the param or the gradient. This may not sound like a big deal but for any binary op on large tensors it's a >3X increase in gmem traffic => 3X slowdown. The present PR changes `AccumulateGrad` to prefer, where possible, stashing gradients that match their params' layouts (["Gradient Layout Contract"](https://github.com/pytorch/pytorch/pull/34904/files#diff-ef1a56d24f66b280dcdb401502d6a796R29-R38)). Allowing `AccumulateGrad` to stash non-rowmajor-contiguous grads means DDP allreduces and DP reduces must allow non-rowmajor-contiguous grads. This PR extends DDP and DP to allow gradients with non-rowmajor-contiguous strides as long as their layout is nonoverlapping and dense. For good measure, I include changes that allow all five nccl primitives (allreduce, reduce, broadcast, allgather, reducescatter) to act on non-rowmajor-contiguous tensors (again as long as each input's layout is nonoverlapping and dense, and as long as all tensors participating in a given collective have the same layout). The primitive comm changes aren't necessary to enable the DDP changes, but I wasn't sure this would end up true until I had written both sets of changes. I think primitive comm enablement is reasonable to keep in the PR, especially since the code for it is simple. Channels last params will be a major beneficiary of this PR, but I don't see it as channels-last-specific fix. The spirit is layout matching in general: - Grads should be stashed with memory layouts matching their params. - Src and dst tensors on opposite ends of collectives should have matching dense layouts. This PR also updates autograd docs to describe potential BC-breaking changes below. ## BC notes ngimel albanD gchanan #### BC-breaking In the common case where the user lets AccumulateGrad decide grad layouts, strides for grads of dense but non-rowmajor-contiguous params will change. Any user code that was accustomed to `view(-1)`ing these grads will break. Also, the circumstances under which a grad can be stolen directly from the backward function that created it, as opposed to deep-copied by AccumulateGrad, have changed. In most cases we expect silent performance improvement, because we expect channels-last-aware backward kernels will create channels last gradients for channels last params. Now those can be stolen, whereas before this PR they were cloned and made rowmajor contiguous. IMO this is a mild BC breakage. Param backward hooks still see grads come in with whatever format the backward kernel gave them. The only BC breakage potential I see is if user code relies somehow on a grad in a hook having or not having the same deep memory as the eventual `param.grad`. Any such users hopefully know they're off the edge of the map and understand how to update their expectations. #### BC escape hatches At alband's recommendation, this PR's changes to AccumulateGrad do not alter the pre-PR code's decisions about whether grad is accumulated in or out of place. Accumulations of new grads onto an existing `.grad` attribute were (usually) in-place before this PR and remain in-place after this PR, keeping the existing `.grad`'s layout. After this PR, if the user wants to force accumulation into a grad with a particular layout, they can preset `param.grad` to a zeroed tensor with the desired strides or call `grad.contiguous(desired format)`. This likely won't be as performant as letting AccumulateGrad establish grad layouts by cloning or stealing grads with contract-compliant strides, but at least users have a control point. One limitation (present before this PR and unchanged by this PR): Presetting `param.grad` does not ensure in-place accumulation all the time. For example, if `create_graph=True`, or if incoming `new_grad` is dense and existing `variable_grad` is sparse, accumulation occurs out of place, and the out-of-place result may not match the existing grad's strides. ---------------------------- I also noticed some potential DDP improvements that I considered out of scope but want to mention for visibility: 1. make sure Reducer's ops sync with AccumulateGrad streams 2. ~to reduce CPU overhead and incur fewer kernel launches, lazily create flat `contents` tensors by a single `cat` kernel only when a bucket is full, instead of `copy_`ing grads into `contents` individually as soon as they are received.~ PR includes a [minor change](https://github.com/pytorch/pytorch/pull/34904/files#diff-c269190a925a4b0df49eda8a8f6c5bd3R312-R315) to divide grads while copying them into flat buffers, instead of copying them in, then dividing separately. Without cat+div fusion, div-while-copying is the best we can do. 3. https://github.com/pytorch/pytorch/issues/38942 Pull Request resolved: https://github.com/pytorch/pytorch/pull/34904 Differential Revision: D20496044 Pulled By: albanD fbshipit-source-id: 248d680f4b1bf77b0a986451844ec6e254469217	2020-06-16 08:43:31 -07:00
kshitij12345	97dfdaaad8	torch.multinomial : fast-path for replacement=False (#39742 ) Summary: Benchmark with same build settings on same system. gcc : version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04) CUDA : 10.1 GPU : 1050ti ```python import time import torch import numpy as np for n, t in [(500_000, 10), (1_000_000, 10)]: for dtype in (torch.half, torch.float, torch.double): # Input Setup p = torch.from_numpy(np.random.rand(n)).to(dtype) want = 1000 print(f'torch.multinomial(a) a.numel() == {n} for {t} times {dtype}') start = time.time() # Iterate for _ in range(t): torch.multinomial(p, want, replacement=False) print(f'Took:', time.time() - start) print('***' 10) for n, t in [(50_000, 100), (100_000, 100)]: for dtype in (torch.half, torch.float, torch.double): # Input Setup p = torch.rand(n, device='cuda', dtype=dtype) want = 1000 print(f'torch.multinomial(a) a.numel() == {n} for {t} times {dtype}') start = time.time() # torch.cuda.synchronize() # Iterate for _ in range(t): torch.multinomial(p, want, replacement=False) # torch.cuda.synchronize() print(f'CUDA Took:', time.time() - start) ``` Before: ``` torch.multinomial(a) a.numel() == 500000 for 10 times torch.float16 Took: 80.64455389976501 torch.multinomial(a) a.numel() == 500000 for 10 times torch.float32 Took: 3.7778031826019287 torch.multinomial(a) a.numel() == 500000 for 10 times torch.float64 Took: 5.045570611953735 torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float16 Took: 161.53191947937012 torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float32 Took: 7.640851736068726 torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float64 Took: 10.399673461914062 ************************************** torch.multinomial(a) a.numel() == 50000 for 100 times torch.float16 CUDA Took: 4.873984098434448 torch.multinomial(a) a.numel() == 50000 for 100 times torch.float32 CUDA Took: 4.713594436645508 torch.multinomial(a) a.numel() == 50000 for 100 times torch.float64 CUDA Took: 11.167185068130493 torch.multinomial(a) a.numel() == 100000 for 100 times torch.float16 CUDA Took: 7.195427417755127 torch.multinomial(a) a.numel() == 100000 for 100 times torch.float32 CUDA Took: 7.669712066650391 torch.multinomial(a) a.numel() == 100000 for 100 times torch.float64 CUDA Took: 20.20938801765442 ``` After: ``` torch.multinomial(a) a.numel() == 500000 for 10 times torch.float16 Took: 81.09321522712708 torch.multinomial(a) a.numel() == 500000 for 10 times torch.float32 Took: 0.06062650680541992 torch.multinomial(a) a.numel() == 500000 for 10 times torch.float64 Took: 0.0862889289855957 torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float16 Took: 161.85304307937622 torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float32 Took: 0.13271093368530273 torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float64 Took: 0.17215657234191895 ************************************** torch.multinomial(a) a.numel() == 50000 for 100 times torch.float16 CUDA Took: 0.035035133361816406 torch.multinomial(a) a.numel() == 50000 for 100 times torch.float32 CUDA Took: 0.03631949424743652 torch.multinomial(a) a.numel() == 50000 for 100 times torch.float64 CUDA Took: 0.05507040023803711 torch.multinomial(a) a.numel() == 100000 for 100 times torch.float16 CUDA Took: 0.05105161666870117 torch.multinomial(a) a.numel() == 100000 for 100 times torch.float32 CUDA Took: 0.05449223518371582 torch.multinomial(a) a.numel() == 100000 for 100 times torch.float64 CUDA Took: 0.09161853790283203 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/39742 Differential Revision: D21976915 Pulled By: ngimel fbshipit-source-id: 34431f814f31b6dfd6179a89f8e4fa574da7a306	2020-06-10 20:42:55 -07:00

1 2 3 4 5 ...

626 Commits