Summary:
Older versions of MIOpen (<=2.2) don't have the `miopenGetVersion` api, but MIOpen is always a part of the ROCm builds, so do NOT set `lib` to None for ROCm builds. `__cudnn_version` will be `None` for older versions of MIOpen.
Setting `lib` to `None` ends up printing the following erroneous warning when running unit tests:
```
/root/.local/lib/python3.6/site-packages/torch/backends/cudnn/__init__.py:120: UserWarning: cuDNN/MIOpen library not found. Check your LD_LIBRARY_PATH
}.get(sys.platform, 'LD_LIBRARY_PATH')))
```
Eg.: https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py3.6-clang7-rocmdeb-ubuntu16.04-test2/18387/consoleFull
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33837
Differential Revision: D20369285
Pulled By: xw285cornell
fbshipit-source-id: e82e6f8f5bccb486213cf868f40aece41ce11f98
Summary:
Because `torch_python` is supposed to be thin wrapper around `torch`
In this PR, all invocation of functions from nccl library are moved from python_nccl.cpp (which is part of torch_python) to nccl.cpp (which is part of torch_cuda)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36193
Test Plan: CI
Differential Revision: D20930047
Pulled By: malfet
fbshipit-source-id: 7f278610077df6ac5dc3471c1a1b5d51e653ef9c
Summary:
This PR aims to improve the interoperability with [CuPy](https://github.com/cupy/cupy/pulls).
Instead of having two separate and conflicting memory pools. With this PR, CuPy can directly alloc memory from the PyTorch allocator by means of this proposal https://github.com/cupy/cupy/pull/3126
We would like to gather feedback to know if this approach makes sense for PyTorch, or other alternative designs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33860
Differential Revision: D20212788
Pulled By: ngimel
fbshipit-source-id: bc1e08a66da1992d26021147bf645dc65239581c
Summary:
Given that pybind11 implements these gil functions, I don't think it makes sense for Pytorch to have its own bespoke versions.
Fixes https://github.com/pytorch/pytorch/issues/29065
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29095
Differential Revision: D18301806
Pulled By: ezyang
fbshipit-source-id: 03da6a26c41ee65aaadf7b67b9f0b14d2def2a5a
Summary:
Fixes https://github.com/pytorch/pytorch/issues/23401
We cannot rely on `multiprocessing.util.register_after_fork` since it is only
called for processes created by the `multiprocessing` module and not `os.fork()`.
Moving to `pthread_atfork` does always get called. However, I don't think it's safe to call python functions inside of the `atfork` handler so the python code has to be a bit more careful when checking `_initialized`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29101
Differential Revision: D18355451
Pulled By: ezyang
fbshipit-source-id: 4d4253a3669796212c099dad4e5bdfdb0df40469
Summary:
Fixes https://github.com/pytorch/pytorch/issues/6962
The PR implements the handle pool mechanism for cublas as suggested by mcarilli in https://github.com/pytorch/pytorch/issues/6962#issuecomment-530563872.
~~I didn't add any unit test here yet because as mcarilli mentioned:~~
> ~~On my local machine, out of curiosity I also rewrote that test to use gemms instead of convolutions. The race condition seemed rarer, but the test did show that cublas use is not thread safe. I can share the script if you want.~~
~~Please share your script with me mcarilli. And if the race condition is rare, would it still be possible for the CI to detect it?~~
cc: colesbury
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29233
Differential Revision: D18372007
Pulled By: ezyang
fbshipit-source-id: 3492bf13410598e8452e89cf4e3e63e8df9c8c3d
Summary:
Adds comprehensive memory instrumentation to the CUDA caching memory allocator.
# Counters
Added comprehensive instrumentation for the following stats:
- Allocation requests (`allocation`)
- Allocated memory (`allocated_bytes`)
- Reserved segments from cudaMalloc (`segment`)
- Reserved memory (`reserved_bytes`)
- Active memory blocks (`active`)
- Active memory (`active_bytes`)
- Inactive, non-releasable blocks (`inactive_split`)
- Inactive, non-releasable memory (`inactive_split_bytes`)
- Number of failed cudaMalloc calls that result in a cache flush and retry (`cuda_malloc_retries`)
- Number of OOMs (`num_ooms`)
Except for the last two, these stats are segmented between all memory, large blocks, and small blocks. Along with the current value of each stat, historical counts of allocs/frees as well as peak usage are tracked by the allocator.
# Snapshots
Added the capability to get a "memory snapshot" – that is, to generate a complete dump of the allocator block/segment state.
# Implementation: major changes
- Added `torch.cuda.memory_stats()` (and associated C++ changes) which returns all instrumented stats as a dictionary.
- Added `torch.cuda.snapshot()` (and associated C++ changes) which returns a complete dump of the allocator block/segment state as a list of segments.
- Added memory summary generator in `torch.cuda.memory_summary()` for ease of client access to the instrumentation stats. Potentially useful to dump when catching OOMs. Sample output here: https://pastebin.com/uKZjtupq
# Implementation: minor changes
- Add error-checking helper functions for Python dicts and lists in `torch/csrc/utils/`.
- Existing memory management functions in `torch.cuda` moved from `__init__.py` to `memory.py` and star-imported to the main CUDA module.
- Add various helper functions to `torch.cuda` to return individual items from `torch.cuda.memory_stats()`.
- `torch.cuda.reset_max_memory_cached()` and `torch.cuda.reset_max_memory_allocated()` are deprecated in favor of `reset_peak_stats`. It's a bit difficult to think of a case where only one of those stats should be reset, and IMO this makes the peak stats collectively more consistent.
- `torch.cuda.memory_cached()` and `torch.cuda.max_memory_cached()` are deprecated in favor of `*memory_reserved()`.
- Style (add access modifiers in the allocator class, random nit fixes, etc.)
# Testing
- Added consistency check for stats in `test_cuda.py`. This verifies that the data from `memory_stats()` is faithful to the data from `snapshot()`.
- Ran on various basic workflows (toy example, CIFAR)
# Performance
Running the following speed benchmark: https://pastebin.com/UNndQg50
- Before this PR: 45.98 microseconds per tensor creation
- After this PR: 46.65 microseconds per tensor creation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27361
Differential Revision: D17758747
Pulled By: jma127
fbshipit-source-id: 5a84e82d696c40c505646b9a1b4e0c3bba38aeb6
Summary:
Follow-up to gh-25483, more of the same fixes for warnings like:
```
../torch/csrc/autograd/python_variable.cpp:503:31: warning: cast between incompatible function types from ‘PyObject* (*)(THPVariable*)’ {aka ‘_object* (*)(THPVariable*)’} to ‘getter’ {aka ‘_object* (*)(_object*, void*)’} [-Wcast-function-type]
503 | {"_backward_hooks", (getter)THPVariable_get_backwards_hooks, (setter)THPVariable_set_backwards_hooks, nullptr, nullptr},
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```
This takes the build log output for a full rebuild with GCC 9.1 from ~10,000 to ~7,000 lines.
`clang-tidy` is going to complain, no way around that - see discussion at the end of gh-25483.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26104
Differential Revision: D17396831
Pulled By: ezyang
fbshipit-source-id: d71696bfe4dbe25519e4bcb7753151c118bd39f7
Summary:
This is achieved by using `cuDevicePrimaryCtxGetState` as a way to check whether a primary context exists on a device. It is not too slow, from this benchmark of a single call to it on CUDA 10.1, Titan Xp, driver 415.27:
```
---------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------
BM_cuDevicePrimaryCtxGetState 301 ns 301 ns 2319746
```
Commits:
1. Add `CUDAHooks::getDeviceWithPrimaryContext` which returns a device index with primary context (if exists).
Link `c10/cuda` against `libcuda` for device API calls.
2. Use `getDeviceWithPrimaryContext` to check primary context in `pin_memory`.
Fix `OptionalDeviceGuard` doc.
3. Refactor `test_cuda_primary_ctx.py` to support multiple tests.
Add test for this in that file.
Fixes https://github.com/pytorch/pytorch/issues/21081.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22229
Differential Revision: D16170194
Pulled By: zou3519
fbshipit-source-id: 485a45f211b7844c9e69c63f3b3b75194a796c5d
Summary:
As part of the Variable/Tensor merge work: https://github.com/pytorch/pytorch/issues/13638, we make the following changes in this PR:
1. Remove the `Variable::Impl` class and the `DifferentiableViewImpl` class
2. Change all `Variable.data()` call sites to either use `Variable` directly, or use `Variable.tensor_data()`
3. Remove `Variable.data()` API
3. Add `Variable.variable_data()` that matches `tensor.data` in Python API, which creates a new `Variable` that shares the same storage and tensor metadata with the original `Variable`, but with a completely new autograd history.
After this PR, Variable doesn't wrap a Tensor internally anymore, and both Variable and Tensor use the same TensorImpl class as its `impl_`. The only difference is that Variable always has AutogradMeta in its TensorImpl, but Tensor doesn't.
**Note that this PR is BC-breaking in the following use cases:**
**Use Case 1:**
Previously, `x.data = y` works even if `x` and `y` are of different TensorImpl type (e.g. `x` is a CPU dense tensor whose impl is of type TensorImpl, while `y` is a CPU sparse tensor whose impl is of type SparseTensorImpl). However, after this PR, `x.data = y` doesn't work anymore if `x` and `y` are of different TensorImpl type, because the underlying implementation `variable.set_data(tensor)` no longer works if `variable` and `tensor` have different TensorImpl type.
**Use Case 2:**
If a tensor `x`'s `grad` is sparse, accumulating dense gradients to `x` will change the tensor that `x.grad` is pointing to. This is better illustrated with the following example:
```python
params = torch.tensor([1.5, 1.5]).requires_grad_()
with torch.no_grad():
# Change gradient to a sparse tensor
params.grad = torch.sparse_coo_tensor(torch.tensor([[1, 1]]).long(), torch.tensor([1., 1.]))
grad_saved = params.grad
params.backward(torch.tensor([1.5, 1.5]))
assert id(grad_saved) == id(params.grad) # This will fail after this PR
```
The assertion in the last line will fail after this PR, because adding dense gradients to sparse gradients will change the `params.grad` tensor reference.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17072
Differential Revision: D14075257
Pulled By: yf225
fbshipit-source-id: 0e681df641270dea586042dd26db59f2e76b5957
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18445
ghimport-source-id: 30d018737bf6989bc68b7e3676f44e0ca6141fde
Stack from [ghstack](https://github.com/ezyang/ghstack):
* #18242 Test running a CUDA build on CPU machine.
* **#18445 Unify cudaGetDeviceCount implementations.**
I went about doing this by searching for calls to cudaGetDeviceCount,
and then methodically replacing them with references to c10::cuda::device_count()
or at::cuda::device_count().
There is a point to doing this: the various implementations wildly differed
in their handling of what to do when cudaGetDeviceCount returns an error.
The final standardized behavior is that **all errors are swallowed** and
we return device count of zero. This indirectly fixes running CUDA builds
on CPU, which was broken in #17847.
I added 'noexcept' to the 'deviceCount' virtual method on DeviceGuardImpl.
This is a BC-breaking change for anyone inheriting from DeviceGuardImpl
but all you need to do is put 'noexcept' on your method and it is backwards
compatible with older libtorch.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Differential Revision: D14612189
fbshipit-source-id: 3c8d186e3dd623c0e27625212c7ce30f75d943cb
Summary:
This is to fix#16141 and similar issues.
The idea is to track a reference to every shared CUDA Storage and deallocate memory only after a consumer process deallocates received Storage.
ezyang Done with cleanup. Same (insignificantly better) performance as in file-per-share solution, but handles millions of shared tensors easily. Note [ ] documentation in progress.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16854
Differential Revision: D13994490
Pulled By: VitalyFedyunin
fbshipit-source-id: 565148ec3ac4fafb32d37fde0486b325bed6fbd1
Summary:
This is the first commit from a series of planned changes in order to add boolean tensors to PyTorch. The whole plan looks like this:
0. Storage Implementation (this change)
1. Tensor Creation.
2. Tensor Conversions.
3. Tensor Indexing.
4. Tensor Operations.
5. Back compatibility related changes.
This feature was requested by the community:
https://github.com/pytorch/pytorch/issues/4764https://github.com/pytorch/pytorch/issues/4219https://github.com/pytorch/pytorch/issues/4288
**Change**:
Added boolean type to the Storage class for CPU and CUDA backends.
**Tested via**:
1. unit tests
2. running this:
-> import torch
-> torch.BoolStorage
<class 'torch.BoolStorage'>
-> torch.cuda.BoolStorage
<class 'torch.cuda.BoolStorage'>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16810
Reviewed By: gchanan
Differential Revision: D14087246
Pulled By: izdeby
fbshipit-source-id: 042642ced1cb0fd1bb6bff05f9ca871a5c54ee5e
Summary:
When Variable and Tensor are merged, the dynamic type of the tensors passed to certain functions will become variables, and expecting `type()` on those variables to still return non-Variable types will cause type mismatch error.
One way to fix this problem is to use the thread-local guard `at::AutoNonVariableTypeMode` to force `type()` to return non-Variable type, but ideally we want to limit the use of `at::AutoNonVariableTypeMode` to be only in VariableType.cpp. Another way to fix the problem is to use `at::globalContext().getNonVariableType()` instead to get the non-Variable type of the tensor, which is what this PR is trying to achieve.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16325
Differential Revision: D14012022
Pulled By: yf225
fbshipit-source-id: 77ef1d2a02f78bff0063bdd72596e34046f1e00d
Summary:
Some renaming and renamespacing also took place. I was originally planning not to do anything, but it turns out that it was easier to make HIPify work by using a namespace CUDACachingAllocator:: rather than THCCachingAllocator_, since :: is a word boundary but _ is not.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16119
Reviewed By: smessmer
Differential Revision: D13718768
fbshipit-source-id: 884a481d99027fd3e34471c020f826aa12225656
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16117
This means I can move it to c10_cuda with minimal fuss.
Reviewed By: smessmer
Differential Revision: D13717836
fbshipit-source-id: a94c7dc649af64542480fc1c226b289588886c00
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16061
I discovered I needed to delete these names in preparation of moving
THCCachingAllocator to c10_cuda; might as well also fix all the other
sites too.
Reviewed By: dzhulgakov
Differential Revision: D13686869
fbshipit-source-id: e8cc55d39ac4bfd3e3a22c761f89a7a111ce5f5e
Summary:
Anywhere we used #include "foo.h", we now say #include <foo.h>
Paths are adjusted to be rooted out of aten/src, torch/lib, or
the root level directory.
I modified CMakeLists.txt by hand to remove TH and THC from
the include paths.
I used the following script to do the canonicalization:
```
import subprocess
import re
import os.path
files = subprocess.check_output(['git', 'ls-files']).decode('utf-8').rstrip().split('\n')
for fn in files:
if not any(fn.endswith(suff) for suff in ['.cu', '.cpp', '.in', '.h', '.hpp', '.cu', '.cuh', '.cc']):
continue
if not any(fn.startswith(pref) for pref in ["aten/", "torch/"]):
continue
with open(fn, 'r') as f:
c = f.read()
def fmt(p):
return "#include <{}>".format(p)
def repl(m):
p = m.group(1)
if p in ["dlfcn.h", "unistd.h", "nvrtc.h", "cuda.h", "cuda_runtime.h", "cstdint", "cudnn.h", "Python.h", "cusparse.h", "cuda_runtime_api.h", "cuda_fp16.h", "cublas_v2.h", "stdint.h", "curand_kernel.h"]:
return fmt(p)
if any(p.startswith(pref) for pref in ["torch/csrc", "c10/", "ATen/", "caffe2/", "TH/", "THC/", "Eigen/", "gtest/", "zdl/", "gloo/", "onnx/", "miopen/"]):
return fmt(p)
for root in ["aten/src", "torch/lib", ""]:
for bad_root in [os.path.dirname(fn), "aten/src/TH", "aten/src/THC", "torch/csrc"]:
new_p = os.path.relpath(os.path.join(bad_root, p), root)
if not new_p.startswith("../") and (os.path.exists(os.path.join(root, new_p)) or os.path.exists(os.path.join(root, new_p + ".in"))):
return fmt(new_p)
print("ERROR: ", fn, p)
return m.group(0)
new_c = re.sub(r'#include "([^"]+)"', repl, c)
if new_c != c:
print(fn)
with open(fn, 'w') as f:
f.write(new_c)
```
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14849
Reviewed By: dzhulgakov
Differential Revision: D13363445
Pulled By: ezyang
fbshipit-source-id: 52361f878a672785f9306c9e9ab2513128092b68
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14246
This commit systematically eliminates THCStream entirely from THC, replacing it
with at::cuda::CUDAStream. In places where the previous pointer type showed up
in a public API signature, those functions are now only available to C++
clients. (It would not be too difficult to make a C-compatible version of
CUDAStream, as it's really just a simple struct, but we leave this for
future work.)
All functions in THC that referred to THCStream were expunged in favor of their
modern counterparts.
One annoyance was that I didn't feel like redoing how the torch.cuda.Stream
binding code worked, but I really wanted to get rid of the stored THCStream*
pointer. So I repurposed the bit-packing code I implemented for Stream hashing,
and used that to (reversibly) store streams in a uint64_t cdata field. A perhaps
more future proof solution would be to get rid of cdata entirely, and store the
device and stream ID directly.
Billing of changes:
- All CUDAStream_ pointer API functions are now hidden and anonymously
namespaced (instead of being in the impl namespace). All use sites
rewritten to use the modern C++ API. Since CUDAStreamInternals is no
longer part of the public API, the CUDAStreamInternals constructor and
internals() method have been removed, and replaced with anonymous
functions in the C++ file.
- device_index() returns DeviceIndex rather than int64_t now
- Stream and CUDAStream now have pack/unpack methods. (CUDAStream checks
that the unpacked bit-pattern is for a CUDA device.)
- THCStream.h header is removed entirely
- Most THCStream handling functions in THC API are removed
Reviewed By: gchanan
Differential Revision: D13121531
fbshipit-source-id: 48873262cc0a37c3eec75a7ba1c93c800da40222
Summary:
How did we get so many uses of `NULL` again?
ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11047
Differential Revision: D9566799
Pulled By: goldsborough
fbshipit-source-id: 83469f352ac69aa65bdaf1a1a21f922d892e0db3
Summary:
Currently our `skipIfLapack` has uses a try-catch block and regex match the error message. It is highly unreliable. This PR adds `hasLAPACK` and `hasMAGMA` on ATen context, and expose the flags to python.
Also fixes refcounting bug with `PyModule_AddObject`. The method steals reference, but we didn't `Py_INCREF` in some places before calling it with `Py_True` or `Py_False`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11024
Differential Revision: D9564898
Pulled By: SsnL
fbshipit-source-id: f46862ec3558d7e0058ef48991cd9c720cb317e2
Summary:
This PR removes couple of macros throughout TH* as part of the re-factoring effort for ATen. Removing these macros should avoid confusion among developers who are trying to move things from TH* to ATen. This PR is part of the THCNumerics deprecation that I have been working on following up on mruberry's https://github.com/pytorch/pytorch/pull/9318. I am separating these two commits to see if removal of these macros doesn't upset the pytorch public CI, as well as internal builds.
- Commit 1248de7baf removes the code paths guarded by `CUDA_HALF_INSTRUCTIONS` macro. Since the macro was removed in commit 2f186df52d, `ifdef CUDA_HALF_INSTRUCTIONS` would return false and hence the code path that is kept after this change is for the false case of `ifdef CUDA_HALF_INSTRUCTIONS`
- Commit 520c99b057 removes the code paths guarded by `CUDA_HALF_TENSOR` macro. Since Pytorch now provides support for only CUDA 8.0 and above, `CUDA_HALF_TENSOR` is always true since CUDA 8.0 satisfies `CUDA_HAS_FP16` and hence, the code path that is kept after this change is for the true case of `ifdef CUDA_HALF_TENSOR`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10147
Differential Revision: D9345940
Pulled By: soumith
fbshipit-source-id: c9392261dd432d304f1cdaf961760cbd164a59d0
Summary:
ezyang noticed that the CUDAStream files lived under ATen/ despite being CUDA-specific, and suggested porting them to ATen/cuda and exposing them with a new CUDAContext. This PR does that. It also:
- Moves ATen's CUDA-specific exceptions for ATen/cudnn to ATen/cuda for consistency
- Moves getDeviceProperties() and getCurrentCUDASparseHandle() to CUDAContext from CUDAHooks
The separation between CUDAContext and CUDAHooks is straightforward. Files that are in CUDA-only builds should rely on CUDAContext, while CUDAHooks is for runtime dispatch in files that can be included in CPU-only builds. A comment in CUDAContext.h explains this pattern. Acquiring device properties and CUDA-specific handles is something only done in builds with CUDA, for example, so I moved them from CUDAHooks to CUDAContext.
This PR will conflict with #9277 and I will merge with master after #9277 goes in.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9435
Reviewed By: soumith
Differential Revision: D8917236
Pulled By: ezyang
fbshipit-source-id: 219718864234fdd21a2baff1dd3932ff289b5751
Summary:
See Note [Supervisor deleter] for how SupervisedPtr works.
This design is not the obvious one, but there were a lot of
constraints feeding into it:
- It must support the reallocation usage-pattern, where, given
an existing Storage, we allocate a new region of memory,
copy the existing data to it, and then deallocate the old
region of memory.
- Creation of a deleter for memory MUST avoid dynamic allocations
in the common case. We've done some benchmarking in Caffe2
where dynamic allocation for deleters is ruinously expensive,
and it's really hard to avoid these performance tarpits in
very general function wrappers like std::function or
folly::Function (while benchmarking this, we discovered that
folly::Function's move constructor was way more expensive
than it should be).
- We need to be able to deallocate data that comes from external
sources, e.g., dlpack and numpy tensors. Most notably,
you often cannot deallocate these with merely the void*
data pointer; you need some extra, out-of-band information
(e.g., the managing struct) to deallocate it. Sometimes,
you may even want to resize data living in an external source!
- The "core" allocators need to support being wrapped in a Thrust
allocator, so you need to be implement the following two functions:
char* allocate(size_t);
void deallocate(char*, size_t);
- We need to support tensors which contain non-POD, non-trivially
copyable data; specifically tensors of std::string. This is
an upcoming requirement from Caffe2. It's dirty AF, but
it's really useful.
- It should use C++ standard library types like std::unique_ptr
(which is hugely problematic because std::unique_ptr doesn't
call the deleter when the pointer is null.)
Here is the billing of changes:
- Built-in support for realloc() has been DROPPED ENTIRELY.
Instead, you're expected to allocate and then copy from
the old memory to the new memory if you want to do a
reallocation. This is what you'd generally have expected
to occur; and axing realloc() from the design lets us avoid
some tricky correctness issues with std::realloc(), namely
the fact that we must refuse the realloc if the type of the
elements are not trivially copyeable. If it really matters,
we can add this back, but there really needs to be a good
explanation WHY you need fast resizing reallocations (by in
large, people don't resize their storages, and it should
be acceptable to have a performance degradation when they
do).
- TH_STORAGE_FREEMEM is no more; instead, if you want a
storage which doesn't free its result, you just give it
an empty deleter.
- What we used to call an "allocator" (really, a combined
object for allocating/deleting) has been split into two
concepts, an allocator, and a smart pointer (SupervisedPtr)
which knows how to delete data.
- Unlike previously, where THAllocator/THCDeviceAllocator
could have a per-tensor context storing extra information
(e.g., a pointer to the metadata you need to actually
free the tensor), there is no context in the allocator or
the deleter of the smart pointer; instead, the smart
pointer directly holds an owning reference to the
metadata necessary to free the data. This metadata
is *freshly manufactured* upon every allocation, which
permits us to resize tensors even in the absence of
built-in support for realloc().
- By default, allocators don't support "raw" allocations
and deallocations with raw pointers. This is because
some allocations may return a different context every
time, in which case you need to reconstruct the context
at delete time (because all you got was a void*, not
a unique_ptr that carries the deleter).
- The diff between at::Allocator and THCDeviceAllocator is a
bit larger:
- It used to return a cudaError_t. Now, allocators
are expected to check the error status immediately and throw
an exception if there was an error. It turns out that this
is what was immediately done after all occurrences of
allocate/release, so it wasn't a big deal (although some
subsidiary interfaces had to themselves be converted to
not return cudaError_t).
There is one notable exception to this, and it is how
we handle CUDA OOM: if this occurs, we attempt to return
unused memory to the system and try again. This is now
handled by a catch-all try-catch block. The cost of
catching the exception is probably the least of your worries
if you're about to OOM.
- It used to take the CUDA stream to perform the allocation
on as an argument. However, it turned out that all call
sites, this stream was the stream for the current device.
So we can push this into the allocator (and the choice,
in the future, could be made explicitly by twiddling
thread local state.)
- It held two extra methods, emptyCache and cacheInfo, specifically
for interacting with some state in THCCachingAllocator.
But this "generality" was a lie, since THCCachingAllocator
was the only allocator that actually implemented these
methods, and there is actually a bunch of code in THC
which assumes that it is the caching allocator that is
the underlying allocator for CUDA allocations. So I
folded these two methods into this interface as
THCCachingAllocator_emptyCache and THCCachingAllocator_cacheInfo.
- It held its context directly inside the THCDeviceAllocator
struct. This context has been moved out into whatever
is holding the at::Allocator*.
- The APIs for getting at allocators/deleters is now a little different.
- Previously there were a bunch of static variables you could get
the address of (e.g., &THDefaultAllocator); now there is a
function getTHDefaultAllocator().
- Some "allocators" didn't actually know how to allocate (e.g.,
the IPC "allocator"). These have been deleted; instead, you
can wrap the produced pointers into SupervisedPtr using
an appropriate makeSupervisedPtr() static method.
- Storage sharing was a lot of work to wrangle, but I think I've
tamed the beast.
- THMapAllocator and its "subclasses" have been refactored to
be proper, honest to goodness C++ classes. I used the enum
argument trick to get "named" constructors. We use inheritance
to add refcounting and management (in libshm). What we previously
called the "Context" class (Context has been dropped from the name)
is now the supervisor for the data.
- Sometimes, we need to pull out the file descriptor from a
tensor. Previously, it was pulled out of the allocator context.
Now, we pull it out of the supervisor of the SupervisorPtr,
using the static method fromSupervisedPtr(), which uses the
deleter as the typeid, and refines the type if it matches.
- I renamed the std::function deleter into
InefficientStdFunctionSupervisor, to emphasize the fact that it does
a dynamic allocation to save the std::function deleter.
TODO:
- Windows libshm is in shambles and needs to be fixed.
Perhaps for the future:
- newFromFd is now unconditionally calling cudaPointerGetAttributes
even though this is unnecessary, because we know what the device
is from higher up in the callstack. We can fix this by making
newWithDataAndAllocator also take an explicit device argument.
- Consider statically distinguishing between allocators that
support raw_allocate/raw_deallocate, and those which don't.
The Thrust constraint applies only to the CUDA device allocator;
you never need to allocate CPU memory this way
- Really want to get rid of storage views. Ugh.
Nontrivial bugs I noticed when preparing this patch:
- I forgot to placement-new unique pointers and attempted to
assign them directly on uninitialized memory; very bad! Sam
Gross has encouraged me to replace this with a proper constructor
but I keep putting it off, because once everything goes in
StorageImpl there really will be a proper constructor.
- I rewrote a number of APIs to use newWithDataAndAllocator
instead of newWithAllocator, calling the allocator at the
call site (because they required "allocation context" which
we no longer give to "allocators"). When I did this, I forgot
to insert the multiplication with sizeof(real) to scale from
numels to number of bytes.
- The implementation of swap on storages was missing it for
scalarType and backend. It was benign (because the only case
we call swap is when these are the same), but I fixed it anyway.
- I accidentally returned a nullptr unique_ptr with no deleter,
even though there was a legitimate one. This matters, because
some code still shoves its hands in the deleter context to
get extra metadata about the function.
- I used std::move() on a unique_ptr, and then did a boolean
test on the pointer aftewards (always false!)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9358
Reviewed By: SsnL
Differential Revision: D8811822
Pulled By: ezyang
fbshipit-source-id: 4befe2d12c3e7fd62bad819ff52b054a9bf47c75
Getting CUDA device property struct with cudaGetDeviceProperties is expensive. THC caches CUDA device property, which is available via THCState_getDeviceProperties, which is available via at::globalContext().getDeviceProperties(device), which is available via torch.cuda.get_device_properties. This PR changes the two methods that previously calls cudaGetDeviceProperties to directly using torch.cuda.get_device_properties in Python.
Also fixes ATen compile error when it can't find CUDA.
Fixes#4908. Using the script from that issue, we get roughly 18x speed-up.
[ssnl@ ~] python dev.py # master
0.2826697587966919
0.00034999847412109375
0.0003493785858154297
0.000356292724609375
0.00036025047302246094
0.0003629922866821289
0.00036084651947021484
0.00035686492919921874
0.00036056041717529296
0.0003606319427490234
[ssnl@ ~] python dev.py # this PR
0.27275662422180175
2.1147727966308594e-05
1.9598007202148438e-05
1.94549560546875e-05
1.9359588623046876e-05
1.938343048095703e-05
2.0074844360351563e-05
1.952648162841797e-05
1.9311904907226562e-05
1.938343048095703e-05
This deletes most of the dead Tensor code paths, including the TensorMethods cwrap and generic/Tensor.cpp.
This also moves the THNN.cwrap/.cpp generation to generate_code which can use ninja if installed.
This replaces the torch.Tensor constructors with factories that produce
Variables. Similarly, functions on the torch module (e.g. torch.randn)
now return Variables.
To keep the PR to a reasonable size, I've left most of the unused tensor
code. Subsequent PRs will remove the dead code, clean-up calls to
torch.autograd.Variable, and rename Variable to Tensor everywhere.
There are some breaking changes because Variable and Tensors had
slightly different semantics. There's a list of those changes here:
https://github.com/pytorch/pytorch/wiki/Breaking-Changes-from-Variable-and-Tensor-merge
Adds streams and comms as optional arguments to the NCCL calls in
torch.cuda.nccl. Also exposes ncclUniqueId and ncclCommInitRank for
multi-process mode.
Moves Py_RETURN_NONE statements after the GIL is re-acquired.
Fixes#1267
This fixes a number of issues when PyTorch was compiled with CUDA
support but run on a machine without any GPUs. Now, we treat all errors
from cudaGetDeviceCount() as if the machine has no devices.
This ensures that we use the same library at the C++ level and with
Python ctypes. It moves the searching for the correct library from
run-time to compile-time.
The core autograd Variable, Function, and Engine no longer depend on the
Python API. This let's us implement functions in C++. In the future, we
can also multithread engine and release the GIL for most of the
non-Python backwards.