Summary:
Resubmit of https://github.com/pytorch/pytorch/pull/51436.
Apparently some non-public windows builds run cuda tests on the default stream, so I changed a few capture tests to manually ensure all captures happen on non-default streams.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54038
Reviewed By: mruberry
Differential Revision: D27068649
Pulled By: ngimel
fbshipit-source-id: 4284475fa40ee38c0f8faff05a2faa310cf8a207
Summary:
Implements https://github.com/pytorch/pytorch/issues/51075#issuecomment-768884685 and additions discussed offline with ezyang ngimel . (Calling it "simple" is charitable but it's not too bad).
[High level strategy](https://github.com/pytorch/pytorch/pull/51436/files#diff-acc6337586bf9cdcf0a684380779300ec171897d05b8569bf439820dc8c93bd5R57-R82)
The current design aggregates stats from private pools with the ordinary pools, which may or may not be what we want.
Instead of adding PrivatePools as an internal feature of DeviceAllocator, I could inherit from DeviceAllocator (eg `DevicePrivateAllocator : public DeviceAllocator`) and create separate per-graph instances of the inherited class. I'm not sure if that would be better.
Graph bindings in Python are almost unchanged from https://github.com/pytorch/pytorch/pull/48875:
```python
# Same bindings as 48875, but now implicitly grabs a private mempool
graph1.capture_begin()
graph1.capture_end()
# pool=... is new. It hints that allocations during graph2's capture may share graph1's mempool
graph2.capture_begin(pool=graph1.pool())
graph2.capture_end()
# graph3 also implicitly creates its own mempool
graph3.capture_begin()
graph3.capture_end()
```
Test plan (other suggestions appreciated):
- [x] Stop maintaining manual references for all the tensors in my existing graphs+RNG tests. If private pools somehow give bad allocations, they should start failing intermittently. They run eager ops and eager allocations mixed with graph replays, so they may expose if eager ops and replays corrupt each other.
- [x] `test_graph_two_successive`: Capture successive graphs, with the second graph using the first graph's result. Try with and without sharing a pool. Check results, also check memory stats to confirm sharing a pool saves memory.
- [x] `test_graph_concurrent_replay`: Capture some graphs in separate private pools, replay them concurrently in different streams, check the results to make sure they don't corrupt each other's memory. Capture some graphs with a shared pool, replay them concurrently in different streams, check results, confirm they DO corrupt each other's memory.
- [x] `test_graph_three_successive`: A three-graph case, checking the safe and unsafe replay patterns in [Restrictions of the Strawman API](https://github.com/pytorch/pytorch/issues/51075)).
- [x] `test_graph_memory_stats_and_use_result_after_destroy_graph`: Comprehensively check torch.cuda.memory_stats() changes that result from graph capture and delete. Check that a tensor ref created during capture and held after graph delete stays valid until the tensor itself is deleted.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51436
Reviewed By: mruberry
Differential Revision: D26993790
Pulled By: ngimel
fbshipit-source-id: a992eaee1b8c23628e7b388a5a3c26e0f80e54da
Summary:
It frequently happens when PyTorch compiled with CUDA support is installed on machine that does not have NVIDIA GPUs.
Fixes https://github.com/pytorch/pytorch/issues/47038
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51806
Reviewed By: ezyang
Differential Revision: D26285827
Pulled By: malfet
fbshipit-source-id: 9fd5e690d0135a2b219c1afa803fb69de9729f5e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50843
AT_ASSERTM is deprecated and should be replaced by either TORCH_CHECK or
TORCH_INTERNAL_ASSERT, depending on the situation.
Test Plan: Imported from OSS
Reviewed By: ailzhang
Differential Revision: D26074365
Pulled By: ezyang
fbshipit-source-id: 46e13588fad4e24828f3cc99635e9cb2223a6c2c
Summary:
^
Currently, `alloc_block`'s error handling has a couple (imo) minor flaws. It might clear the error state even if the error had nothing to do with memory allocation. It might also clear the error state even if it didn't attempt a cudaMalloc, meaning it might clear an error state that came from some completely unrelated earlier cuda call.
The diffs and comments are the best explanation of my preferred (new) error-checking policy.
The diffs add very little work to the common (successful, allocation satisfied by existing block) hot path. Most of the additional logic occurs in `alloc_block`, which is a slow path anyway (it tries cudaMalloc).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51158
Reviewed By: malfet, heitorschueroff
Differential Revision: D26101515
Pulled By: ezyang
fbshipit-source-id: 6b447f1770974a04450376afd9726be87af83c48
Summary:
All pretty minor. I avoided renaming `class DestructableMock` to `class DestructibleMock` and similar such symbol renames (in this PR).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49815
Reviewed By: VitalyFedyunin
Differential Revision: D25734507
Pulled By: mruberry
fbshipit-source-id: bbe8874a99d047e9d9814bf92ea8c036a5c6a3fd
Summary:
Since caffe2 and torch have been consolidated, CAFFE2_API should be merged with TORCH_API. Addresses a TODO.
Manually edited some references of the removed `CAFFE2_API`:
* `CONTRIBUTING.md`
* `caffe2/proto/CMakeLists.txt`
* `cmake/ProtoBuf.cmake`
* `c10/macros/Export.h`
* `torch/csrc/WindowsTorchApiMacro.h`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49496
Reviewed By: malfet, samestep
Differential Revision: D25600726
Pulled By: janeyx99
fbshipit-source-id: 7e068d959e397ac183c097d7e9a9afeca5ddd782
Summary:
Add a new function, torch.cuda.set_per_process_memory_fraction(fraction, device), to torch.cuda. Related: https://github.com/pytorch/pytorch/issues/18626
The fraction (float type, from 0 to 1) is used to limit memory of cashing allocator on GPU device . One can set it on any visible GPU. The allowed memory equals total memory * fraction. It will raise an OOM error when try to apply GPU memory more than the allowed value. This function is similar to Tensorflow's per_process_gpu_memory_fraction
Note, this setting is just limit the cashing allocator in one process. If you are using multiprocess, you need to put this setting in to the subprocess to limit its GPU memory, because subprocess could have its own allocator.
## usage
In some cases, one needs to split a GPU device as two parts. Can set limitation before GPU memory using.
Eg. device: 0, each part takes half memory, the code as follows:
```
torch.cuda.set_per_process_memory_fraction(0.5, 0)
```
There is an example to show what it is.
```python
import torch
torch.cuda.set_per_process_memory_fraction(0.5, 0)
torch.cuda.empty_cache()
total_memory = torch.cuda.get_device_properties(0).total_memory
# less than 0.5 will be ok:
tmp_tensor = torch.empty(int(total_memory * 0.499), dtype=torch.int8, device='cuda')
del tmp_tensordel tmp_tensor
torch.cuda.empty_cache()
# this allocation will raise a OOM:
torch.empty(total_memory // 2, dtype=torch.int8, device='cuda')
"""
It raises an error as follows:
RuntimeError: CUDA out of memory. Tried to allocate 5.59 GiB (GPU 0; 11.17 GiB total capacity; 0 bytes already allocated; 10.91 GiB free; 5.59 GiB allowed; 0 bytes reserved in total by PyTorch)
"""
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48172
Reviewed By: bdhirsh
Differential Revision: D25275381
Pulled By: VitalyFedyunin
fbshipit-source-id: d8e7af31902c2eb795d416b57011cc8a22891b8f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48615
Convert the macro from `TORCH_CUDA_KERNEL_LAUNCH_CHECK` to `C10_CUDA_KERNEL_LAUNCH_CHECK`, since it is now accessible through c10, not just torch.
Test Plan:
```
buck build //caffe2/caffe2:caffe2_cu
buck build //caffe2/aten:ATen-cu
buck test //caffe2/test:kernel_launch_checks -- --print-passing-details
```
Reviewed By: jianyuh
Differential Revision: D25228727
fbshipit-source-id: 9c65feb3d0ea3fbd31f1dcaecdb88ef0534f9121
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48277
We move `TORCH_CUDA_KERNEL_LAUNCH_CHECK` from `//caffe2/aten/src/ATen/cuda/Exceptions.h` to `//caffe2/c10/cuda/CUDAException.h`.
The primary reason is for allowing us to use this MACRO in other subdirectories of //caffe2, not just in ATen. Refer to D24309971 (353e7f940f) for context.
An example of this use case is D24868557, where we add these checks to `//caffe2/caffe2/sgd`.
Also, this should not affect current files, because `Exceptions.h` includes `CUDAException.h`.
Test Plan:
```
buck build //caffe2/aten:ATen-cu
```
- https://fburl.com/buck/oq3rxbir
Also wait for sandcastle tests.
Reviewed By: ngimel
Differential Revision: D25101720
fbshipit-source-id: e2b05b39ff1413a21e64949e26ca24c8f7d0400f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47023
DeviceType pretty clearly only needs 1 byte. DeviceIndex only needs 1 byte given that machines don't have anywhere near 255 GPUs in them as far as I know.
ghstack-source-id: 116901430
Test Plan: Existing tests, added assertion to catch if my assumption about DeviceIndex is incorrect
Reviewed By: dzhulgakov
Differential Revision: D24605460
fbshipit-source-id: 7c9a89027fcf8eebd623b7cdbf6302162c981cd2
Summary:
In assertValidDevice() compare device index to `caching_allocator.device_allocator` rather than to `device_no`
Fixes potential crashes when caching allocator is accessed before being initialized, for example by calling something like:
`python -c "import torch;print(torch.cuda.memory_stats(0))"`
Fixes https://github.com/pytorch/pytorch/issues/46437
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46439
Reviewed By: ngimel
Differential Revision: D24350717
Pulled By: malfet
fbshipit-source-id: 714e6e74f7c2367a9830b0292478270192f07a7f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45294
While tracking down a recent memory corruption bug we found that
cuda-memcheck wasn't finding the bad accesses, and ngimel pointed out that
it's because we use a caching allocator so a lot of "out of bounds" accesses
land in a valid slab.
This PR adds a runtime knob (`PYTORCH_NO_CUDA_MEMORY_CACHING`) that, when set,
bypasses the caching allocator's caching logic so that allocations go straight
to cudaMalloc. This way, cuda-memcheck will actually work.
Test Plan:
Insert some memory errors and run a test under cuda-memcheck;
observe that cuda-memcheck flags an error where expected.
Specifically I removed the output-masking logic here:
https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/tensorexpr/cuda_codegen.cpp#L819-L826
And ran:
```
PYTORCH_NO_CUDA_MEMORY_CACHING=1 cuda-memcheck pytest -k test_superslomo test_jit_fuser_te.py
```
Reviewed By: ngimel
Differential Revision: D23964734
Pulled By: bertmaher
fbshipit-source-id: 04efd11e8aff037b9edde80c70585cb820ee6e39
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42249
Main change is to bring Caffe2's superior error messages for cuda initialization into c10 and use them in all code paths.
Basic logic:
| Case | Call to device_count() | init_cuda, e.g. allocating tensor |
| -- | -- | -- |
| all good | non-zero | just works |
| no gpus | 0, no warning | throw exception with good message |
| driver issues | 0, produce warning | throw exception with good message |
| out of memory with ASAN | 0, produce warning| throw exception with ASAN message |
Previously, the error thrown from init_cuda was very generic and the ASAN warning (if any) was buried in the logs.
Other clean up changes:
* cache device_count() always in a static variable
* move all asan macros in c10
Test Plan:
Hard to unittest because of build modes. Verified manually that the behavior from the table above holds by running the following script in different modes (ASAN/no-ASAN, CUDA_VISIBLE_DEVICES=):
```
print('before import')
import torch
print('after import')
print('devices: ', torch.cuda.device_count())
x = torch.tensor([1,2,3])
print('tensor creation')
x = x.cuda()
print('moved to cuda')
```
Reviewed By: ngimel
Differential Revision: D22824329
fbshipit-source-id: 5314007313a3897fc955b02f8b21b661ae35fdf5
Summary:
* Make c10::cuda functions regular non-inlined functions
* Add driver_version() and device_synchronize() functions
With this change I don't see anymore direct calls to CUDA API when look at Modules.cpp.obj
FYI malfet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42251
Reviewed By: malfet
Differential Revision: D22826505
Pulled By: ziab
fbshipit-source-id: 8dc2f3e209d3710e2ce78411982a10e8c727573c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39759
Caffe2 has a mode where it uses PT's caching allocator. Somehow we were not calling the initialization explicitly.
Now, I have no idea why it worked before. Probably worth to run a bisect separately.
Reviewed By: houseroad
Differential Revision: D21962331
fbshipit-source-id: f16ad6b27a67dbe0bda93939cca8c94620d22a09
Summary:
Reduces lock contention and BlockPool management costs by tracking applicable state in per-device structures.
`THCCachingAllocator` now maintains a set of `DeviceCachingAllocator` objects (one per device) each of which maintains its own allocator state and operations.
Only global state remains in the top-level THCCachingAllocator object -- namely, `allocated_blocks`, the mapping between the raw storage pointers and the allocator's underlying Block structure. Global operations deal mostly with this translation and then pass the bulk of the work on to the device-specific allocator.
Conversely, device-specific state and operations are comprised mostly of managing the device's underlying blocks.
This has the following benefits:
- Performance: Access to the global pointer map is serialized independently of the per-device state -- reducing lock contention between operations on different devices.
- Simplicity: Managing the block pools in separate device-specific objects is conceptually more intuitive, simplifies the code and makes certain operations more efficient -- even in the absence of contention (e.g. free_cached_blocks, synchronize_and_free_events, emptyCache, get_all_blocks, etc.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37567
Differential Revision: D21458556
Pulled By: colesbury
fbshipit-source-id: ef56cb373797b180df72f0998ebc35972c892288
Summary:
`THCudaMemGetInfo` has only been used in `aten/src/ATen/native/cudnn/Conv.cpp`. We can extract `c10::cuda::CUDACachingAllocator::cacheInfo` out from it and use it in `aten/src/ATen/native/cudnn/Conv.cpp` directly and drop lines that are not used in `THCudaMemGetInfo`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37447
Differential Revision: D21302770
Pulled By: ezyang
fbshipit-source-id: 41ad68b8fd5ecc7bc666a6861789c6c1f743f420
Summary:
Setting device could be expensive, especially when a debugger is present. We should check the device are different before we set.
cc: ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35438
Differential Revision: D20664084
Pulled By: ngimel
fbshipit-source-id: 2440b4c9d96c41b4a19d5b1e8e1756fa40f090f0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34723
Add min function to cuda math compat
Test Plan: unittest
Reviewed By: houseroad
Differential Revision: D20444517
fbshipit-source-id: 1a93343cc57249ef1101eeb7ef373266f6a2873a
Summary:
This PR aims to improve the interoperability with [CuPy](https://github.com/cupy/cupy/pulls).
Instead of having two separate and conflicting memory pools. With this PR, CuPy can directly alloc memory from the PyTorch allocator by means of this proposal https://github.com/cupy/cupy/pull/3126
We would like to gather feedback to know if this approach makes sense for PyTorch, or other alternative designs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33860
Differential Revision: D20212788
Pulled By: ngimel
fbshipit-source-id: bc1e08a66da1992d26021147bf645dc65239581c
Summary:
There are cases when we want to recover from CUDA OOM, for example, some cuDNN algorithms use huge workspace and we want to recover from OOM to pick a different algorithm, in such cases, there is no reason to catch all errors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33056
Differential Revision: D19795359
Pulled By: ezyang
fbshipit-source-id: a34e23bf6d172dc0257389251dafef5b38d27d2b
Summary:
After several discussions, we agreed not to put any extra safety check for recordStream as either the check will cause failures in certain scenarios or there is no need to throw for user errors.
As a summary, it simply does what is described in https://github.com/pytorch/pytorch/issues/27405, check if a tensor is indeed allocated by a CUDACachingAllocator instance, if it is, then throw internal error if a block can not be retrieved.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30870
Differential Revision: D18851669
Pulled By: yxia11
fbshipit-source-id: c2f01798cd24f1fd0f35db8764057d5d333dab95
Summary:
Adds comprehensive memory instrumentation to the CUDA caching memory allocator.
# Counters
Added comprehensive instrumentation for the following stats:
- Allocation requests (`allocation`)
- Allocated memory (`allocated_bytes`)
- Reserved segments from cudaMalloc (`segment`)
- Reserved memory (`reserved_bytes`)
- Active memory blocks (`active`)
- Active memory (`active_bytes`)
- Inactive, non-releasable blocks (`inactive_split`)
- Inactive, non-releasable memory (`inactive_split_bytes`)
- Number of failed cudaMalloc calls that result in a cache flush and retry (`cuda_malloc_retries`)
- Number of OOMs (`num_ooms`)
Except for the last two, these stats are segmented between all memory, large blocks, and small blocks. Along with the current value of each stat, historical counts of allocs/frees as well as peak usage are tracked by the allocator.
# Snapshots
Added the capability to get a "memory snapshot" – that is, to generate a complete dump of the allocator block/segment state.
# Implementation: major changes
- Added `torch.cuda.memory_stats()` (and associated C++ changes) which returns all instrumented stats as a dictionary.
- Added `torch.cuda.snapshot()` (and associated C++ changes) which returns a complete dump of the allocator block/segment state as a list of segments.
- Added memory summary generator in `torch.cuda.memory_summary()` for ease of client access to the instrumentation stats. Potentially useful to dump when catching OOMs. Sample output here: https://pastebin.com/uKZjtupq
# Implementation: minor changes
- Add error-checking helper functions for Python dicts and lists in `torch/csrc/utils/`.
- Existing memory management functions in `torch.cuda` moved from `__init__.py` to `memory.py` and star-imported to the main CUDA module.
- Add various helper functions to `torch.cuda` to return individual items from `torch.cuda.memory_stats()`.
- `torch.cuda.reset_max_memory_cached()` and `torch.cuda.reset_max_memory_allocated()` are deprecated in favor of `reset_peak_stats`. It's a bit difficult to think of a case where only one of those stats should be reset, and IMO this makes the peak stats collectively more consistent.
- `torch.cuda.memory_cached()` and `torch.cuda.max_memory_cached()` are deprecated in favor of `*memory_reserved()`.
- Style (add access modifiers in the allocator class, random nit fixes, etc.)
# Testing
- Added consistency check for stats in `test_cuda.py`. This verifies that the data from `memory_stats()` is faithful to the data from `snapshot()`.
- Ran on various basic workflows (toy example, CIFAR)
# Performance
Running the following speed benchmark: https://pastebin.com/UNndQg50
- Before this PR: 45.98 microseconds per tensor creation
- After this PR: 46.65 microseconds per tensor creation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27361
Differential Revision: D17758747
Pulled By: jma127
fbshipit-source-id: 5a84e82d696c40c505646b9a1b4e0c3bba38aeb6
Summary:
This PR addresses issue https://github.com/pytorch/pytorch/issues/7601.
Currently models that use streams explicitly in forward have to do a lot of extra work to make backwards respect those streams. This PR extends the (recently added) input tracing (see TypeAndShape) to record the devices and streams of inputs. The autograd engine then uses this metadata to enact the expected stream parallelism without extra work from the user.
For example, a model with forward declared like (original example courtesy of ngimel):
```
def forward(self,x):
x0 = x.clone()
torch._C._cuda_setStream(self.stream1._cdata)
y0 = self.fc1(x0)
self.event1.record(stream = torch.cuda.current_stream())
torch._C._cuda_setStream(self.stream2._cdata)
y1 = self.fc2(x)
self.event2.record(stream = torch.cuda.current_stream())
self.stream2.wait_event(self.event1)
return y0 + y1
```
currently will backward on a single stream. With this change the kernels will go on the streams they are assigned in forward and both forward and backward will (for appropriate sizes) run the fc1 and fc2 kernels simultaneously.
The crux of this change is, as mentioned, an expansion of the TypeAndShape tracing and a relatively simple change to the autograd engine to use cuda events for stream synchronization. To make this efficient I also added a new AutoGPUAndStream class, exposed getting and setting streams on devices, and removed InputBuffer's AutoGPU (it's now redundant). While making these modifications I also fixed AutoGPU to check before setting the GPU when it's destroyed and to use THCudaCheck instead of its custom error handler. These changes mean that an often excessive cudaSetDevice() is not being called when inputs are added to a buffer.
In addition to allowing users to easily set and use streams that are respected in both forward and backward, this change may encourage modules to do the same and the expanded tracing might allow further optimizations in the autograd engine. (apaszke, for example, now after initial enumeration we know the number of devices that will be used by a graph task, which might help provide a sense of the "level of parallelism" we should expect.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8354
Test Plan: Two tests were added specifically for this behavior.
Differential Revision: D17275980
Pulled By: mruberry
fbshipit-source-id: 92bd50ac782ffa973b159fcbbadb7a083802e45d
Summary:
This is achieved by using `cuDevicePrimaryCtxGetState` as a way to check whether a primary context exists on a device. It is not too slow, from this benchmark of a single call to it on CUDA 10.1, Titan Xp, driver 415.27:
```
---------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------
BM_cuDevicePrimaryCtxGetState 301 ns 301 ns 2319746
```
Commits:
1. Add `CUDAHooks::getDeviceWithPrimaryContext` which returns a device index with primary context (if exists).
Link `c10/cuda` against `libcuda` for device API calls.
2. Use `getDeviceWithPrimaryContext` to check primary context in `pin_memory`.
Fix `OptionalDeviceGuard` doc.
3. Refactor `test_cuda_primary_ctx.py` to support multiple tests.
Add test for this in that file.
Fixes https://github.com/pytorch/pytorch/issues/21081.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22229
Differential Revision: D16170194
Pulled By: zou3519
fbshipit-source-id: 485a45f211b7844c9e69c63f3b3b75194a796c5d
Summary:
Having the NVRTC stub in ATen is necessary to call driver APIs in ATen. This is currently blocking https://github.com/pytorch/pytorch/pull/22229.
`DynamicLibrary` is also moved as it is used in the stub code, and seems general enough.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22362
Differential Revision: D16131787
Pulled By: ezyang
fbshipit-source-id: add2ee8a8865229578aa00001a00d5a6671e0e73
Summary:
This renames the CMake `caffe2` target to `torch`, as well as renaming `caffe2_gpu` to `torch_gpu` (and likewise for other gpu target variants). Many intermediate variables that don't manifest as artifacts of the build remain for now with the "caffe2" name; a complete purge of `caffe2` from CMake variable names is beyond the scope of this PR.
The shell `libtorch` library that had been introduced as a stopgap in https://github.com/pytorch/pytorch/issues/17783 is again flattened in this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20774
Differential Revision: D15769965
Pulled By: kostmo
fbshipit-source-id: b86e8c410099f90be0468e30176207d3ad40c821
Summary:
Ops on a Process Group (pg) instance will hit an error when input/output tensors are created on a different process, because, pg calls `recordStream` on `CUDACachingAllocator` which only knows tensors created within the same process.
The proposed solution is to add a `suppressError` arg (suggestions for better names?) to `recordStream`. See comments in code for arguments.
CC pichuang1984
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21449
Differential Revision: D15689736
Pulled By: mrshenli
fbshipit-source-id: e7fc81b167868f8666536067eaa7ae2c8584d88e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20665
Add gelu activation forward on CPU in pytorch
Compare to current python implemented version of gelu in BERT model like
def gelu(self, x):
x * 0.5 * (1.0 + torch.erf(x / self.sqrt_two))
The torch.nn.functional.gelu function can reduce the forward time from 333ms to 109ms (with MKL) / 112ms (without MKL) for input size = [64, 128, 56, 56] on a devvm.
Reviewed By: zheng-xq
Differential Revision: D15400974
fbshipit-source-id: f606b43d1dd64e3c42a12c4991411d47551a8121
Summary:
Fixes#20651
Communication collectives in `torch.distributed` call `CUDACachingAllocator::recordStream()` on input and output tensors to prevent their memory blocks being freed too early. `CUDACachingAllocator` uses tensor's data pointer to track memory blocks, which does not accept null pointers. However, empty tensor's `storage().data()` might be null. In this case, as there is no associated memory block for the empty tensor, it should be fine to make `recordStream()` a no-op.
Tests only cover `broadcast` empty tensors for GLOO backend, because GLOO does not support empty inputs (facebookincubator/gloo/issues/179). It can be addressed in either `ProcessGroupGloo` or GLOO itself. Will add more tests when that gap is filled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20658
Differential Revision: D15399371
Pulled By: mrshenli
fbshipit-source-id: d29ebd1c72fddae49531f32695f81b89e42e5a4d
Summary:
The caching allocator tries to free all blocks on an out-of-memory
error. Previously, it did not free blocks that still had outstanding
stream uses. This change synchronizes on the outstanding events and
frees those blocks.
See #19219
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19222
Differential Revision: D14925071
Pulled By: colesbury
fbshipit-source-id: a2e9fe957ec11b00ea8e6c0468436c519667c558
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18485
I don't know how (1) we landed the wrong version of the patch and (2) how
this passed the push blocking test
Reviewed By: pjh5
Differential Revision: D14621961
fbshipit-source-id: 0a3953d7adcdc79727a61c2acff65f436dcafe55
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18445
ghimport-source-id: 30d018737bf6989bc68b7e3676f44e0ca6141fde
Stack from [ghstack](https://github.com/ezyang/ghstack):
* #18242 Test running a CUDA build on CPU machine.
* **#18445 Unify cudaGetDeviceCount implementations.**
I went about doing this by searching for calls to cudaGetDeviceCount,
and then methodically replacing them with references to c10::cuda::device_count()
or at::cuda::device_count().
There is a point to doing this: the various implementations wildly differed
in their handling of what to do when cudaGetDeviceCount returns an error.
The final standardized behavior is that **all errors are swallowed** and
we return device count of zero. This indirectly fixes running CUDA builds
on CPU, which was broken in #17847.
I added 'noexcept' to the 'deviceCount' virtual method on DeviceGuardImpl.
This is a BC-breaking change for anyone inheriting from DeviceGuardImpl
but all you need to do is put 'noexcept' on your method and it is backwards
compatible with older libtorch.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Differential Revision: D14612189
fbshipit-source-id: 3c8d186e3dd623c0e27625212c7ce30f75d943cb
Summary:
This is to fix#16141 and similar issues.
The idea is to track a reference to every shared CUDA Storage and deallocate memory only after a consumer process deallocates received Storage.
ezyang Done with cleanup. Same (insignificantly better) performance as in file-per-share solution, but handles millions of shared tensors easily. Note [ ] documentation in progress.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16854
Differential Revision: D13994490
Pulled By: VitalyFedyunin
fbshipit-source-id: 565148ec3ac4fafb32d37fde0486b325bed6fbd1
Summary:
```
NVIDIA changed the CUDA allocation behavior on Pascal GPUs. The
page size increased from 1MB to 2MB and allocations larger than 1MB
are now always page-aligned. Previously, allocations larger than 1MB
were aligned to 128KB boundaries.
This interacted poorly with the caching allocator. The remaining
memory in a page could only be filled by small cudaMalloc calls, but
the caching allocator never cudaMalloc's a chunk smaller than 1MB.
This behavior could also cause a large discrepancy between the memory
usage reported by nvidia-smi and the memory usage reported by
PyTorch, because nvidia-smi counts a partially used page as "full",
while PyTorch only counts the actual memory requested.
This PR makes a few changes to the caching allocator to better support
Pascal and Volta GPUs:
- All cudaMalloc calls are now multiples of 2MB (the page size)
- Requests between 1-10MB allocate (and split) a 20MB block to
reduce wasted space due to rounding
- Small requests are now packed into 2MB blocks (instead of 1MB)
This improves Mask R-CNN memory usage by 10-20% in internal tests on
Volta GPUs. Maxwell performance seems to be largely unchanged, but
it's possible that some use cases suffer slightly.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17120
Differential Revision: D14301536
Pulled By: colesbury
fbshipit-source-id: a8282315ea8f7b8ca149b5066fdeaecd0d404edf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16399
Catching cudaError_t return values in a few places, because it's nodiscard in rocm. Unless we add -Wno-unused-result, it'll end up with a compilation error.
Also in c10/cuda/test, check whether a host has GPU or not. We were silently throwing out the error before (so not really testing the cuda api).
Reviewed By: bddppq
Differential Revision: D13828281
fbshipit-source-id: 587d1cc31c20b836ce9594e3c18f067d322b2934
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16721
The very key line is we have to set the stream to the default
stream before calling the allocator. This is very interesting.
It shouldn't be necessary, but seemingly is!
Reviewed By: dzhulgakov
Differential Revision: D13943193
fbshipit-source-id: c21014917d9fe504fab0ad8abbc025787f559287
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16720
I'm taking the deduplication slowly because there is something here
that is causing problems, and I want to figure out what it is.
Reviewed By: dzhulgakov
Differential Revision: D13943194
fbshipit-source-id: cbc08fee5862fdcb393b9dd5b1d2ac7250f77c4b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16615
This is another go at landing https://github.com/pytorch/pytorch/pull/16226
Now that the caching allocator is moved to c10_cuda, we can
delete the duplicate copy from Caffe2.
The difference between this and the previous PR is that this
version faithfully maintains the binding code; in particular,
we end up with a SECOND copy of the caching allocator in
this patch. I verified that this code does NOT cause a crash
in the workflow we canaried last time.
In further diffs, I plan to eliminate the second copy, and then
adjust the binding code.
Reviewed By: dzhulgakov
Differential Revision: D13901067
fbshipit-source-id: 66331fd4eadffd0a5defb3cea532d5cd07287872
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16510
This diff was supposed to be memory usage neutral, but based on
some internal flows involving cuDNN, it was not. Reverting pending
further investigation.
Original commit changeset: 03f1ebf7f11c
Reviewed By: xw285cornell
Differential Revision: D13863610
fbshipit-source-id: 15517e255fd6b0c064b65fb99f0ef19742236cfd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16226
Now that the caching allocator is moved to c10_cuda, we can
delete the duplicate copy from Caffe2.
Reviewed By: dzhulgakov, smessmer
Differential Revision: D13762540
fbshipit-source-id: 03f1ebf7f11c68c19aa0d66110156fe228da6138
Summary:
Some renaming and renamespacing also took place. I was originally planning not to do anything, but it turns out that it was easier to make HIPify work by using a namespace CUDACachingAllocator:: rather than THCCachingAllocator_, since :: is a word boundary but _ is not.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16119
Reviewed By: smessmer
Differential Revision: D13718768
fbshipit-source-id: 884a481d99027fd3e34471c020f826aa12225656
Summary:
1. Added `torch/csrc/cuda/Event.h` and `torch/csrc/cuda/Event.cpp` to bind Python Event class to C++ implementation.
2. Move all CUDA runtime invocations from `torch/cuda/streams.py` to C++
3. Added tests to cover Stream and Event APIs. ~(event IPC handle tests is introduced in #15974)~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15937
Differential Revision: D13649001
Pulled By: mrshenli
fbshipit-source-id: 84ca58f35f6ba679a4ba33150ceba678d760d240
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15692
It was leading to ocassional crashes with dynamically linked CUDA because runtime was already destroyed.
Also, unique_ptr<T[]> is more suitable than deque<T> for the purpose.
Reviewed By: Yangqing
Differential Revision: D13571988
fbshipit-source-id: 37eb26dfbe361c49160367b53f87bd037c6c0e46
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15316
This starts cleaning up the files in c10 according to the module structure we decided on.
Move to c10/util:
- Half.h, Half-inl.h, Half.cpp, bitcasts.h
Move to c10/core:
- Device.h, Device.cpp
- DeviceType.h, DeviceType.cpp
i-am-not-moving-c2-to-c10
Reviewed By: dzhulgakov
Differential Revision: D13498493
fbshipit-source-id: dfcf1c490474a12ab950c72ca686b8ad86428f63
Summary:
See #15682
Pushing up this small PR to check if I am doing the right thing. If correct, more will follow for other Stream APIs. Questions will be added inline.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15737
Differential Revision: D13581400
Pulled By: mrshenli
fbshipit-source-id: 24afed7847b89b62f0692c79a101ec7ff9d9ee4d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15125
I realized that it is really bad juju if you fake a StreamId
out of thin air, because in general this isn't going to work.
So, make the constructor a lot scarier.
Most "faking StreamId out of thin air" happens because someone
just wants to put something on the default stream.
Reviewed By: dzhulgakov
Differential Revision: D13432800
fbshipit-source-id: a86991d6fc1d8aa4e54e8175e5f06f90856238e6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14248
This diff also introduces a horrifying hack to override CUDA's DeviceGuardImpl
with a HIPGuardImplMasqueradingAsCUDA, to accommodate PyTorch's current
behavior of pretending CUDA is HIP when you build with ROCm enabled.
Reviewed By: bddppq
Differential Revision: D13145293
fbshipit-source-id: ee0e207b6fd132f0d435512957424a002d588f02
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13920
I want to move CUDAStream and CUDAGuard to c10_cuda without also
bringing along CUDAContext or CUDAEvent for the ride (at least for
now). To do this, I need to eliminate those dependencies.
There's a few functions in CUDAContext.h which don't really need
THCState, so they're separated out and put in general
purpose c10/cuda/CUDAFunctions.h
Reviewed By: smessmer
Differential Revision: D13047468
fbshipit-source-id: 7ed9d5e660f95805ab39d7af25892327edae050e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13900
Add c10 cuda library.
Right now, this is not used by anything, and only tests if the CUDA
headers are available (and not, e.g., that linking works.)
Extra changes:
- cmake/public/cuda.cmake now is correctly include guarded, so you
can include it multiple times without trouble.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Reviewed By: smessmer
Differential Revision: D13025313
fbshipit-source-id: fda85b4c35783ffb48ddd6bbb98dbd9154119d86
Summary:
- Speed up hipify_python.py by blacklisting useless (and quite large)
directory trees that it would otherwise recurse into
- Pass around relative paths instead of absolute paths. This makes it
easier to do filename matches based on the root of the tree.
- Redo the streaming output to contain more useful information
- Make it handle c10/cuda correctly, rewrite c10::cuda to
c10::hip, and the header name from CUDAMathCompat.h to
CUDAHIPCompat.h
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13973
Differential Revision: D13062374
Pulled By: ezyang
fbshipit-source-id: f0858dd18c94d449ff5dbadc22534c695dc0f8fb