Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51124
Original commit changeset: 1c7133627da2
Test Plan: Test locally with interpreter_test and on CI
Reviewed By: suo
Differential Revision: D26077905
fbshipit-source-id: fae83bf9822d79e9a9b5641bc5191a7f3fdea78d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50458
libinterpreter.so contains a frozen python distribution including
torch-python bindings.
Freezing refers to serializing bytecode of python standard library modules as
well as the torch python library and embedding them in the library code. This
library can then be dlopened multiple times in one process context, each
interpreter having its own python state and GIL. In addition, each python
environment is sealed off from the filesystem and can only import the frozen
modules included in the distribution.
This change relies on newly added frozenpython, a cpython 3.8.6 fork built for this purpose. Frozenpython provides libpython3.8-frozen.a which
contains frozen bytecode and object code for the python standard library.
Building on top of frozen python, the frozen torch-python bindings are added in
this diff, providing each embedded interpreter with a copy of the torch
bindings. Each interpreter is intended to share one instance of libtorch and
the underlying tensor libraries.
Known issues
- Autograd is not expected to work with the embedded interpreter currently, as it manages
its own python interactions and needs to coordinate with the duplicated python
states in each of the interpreters.
- Distributed and cuda stuff is disabled in libinterpreter.so build, needs to be revisited
- __file__ is not supported in the context of embedded python since there are no
files for the underlying library modules.
using __file__
- __version__ is not properly supported in the embedded torch-python, just a
workaround for now
Test Plan: tested locally and on CI with cmake and buck builds running torch::deploy interpreter_test
Reviewed By: ailzhang
Differential Revision: D25850783
fbshipit-source-id: a4656377caff25b73913daae7ae2f88bcab8fd88
Summary:
And unrelying torch._C._cuda_canDeviceAccessPeer, which is a wrapper around cudaDeviceCanAccessPeer
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50446
Reviewed By: mrshenli
Differential Revision: D25890405
Pulled By: malfet
fbshipit-source-id: ef09405f115bbe73ba301d608d56cd8f8453201b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49486
Remove code for Python 3.5 and lower.
There's more that can be removed/modernised, but sticking mainly to redundant version checks here, to keep the diff/PR smaller.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46579
Reviewed By: zou3519
Differential Revision: D24453571
Pulled By: ezyang
fbshipit-source-id: c2cfcf05d6c5f65df64d89c331692c9aec09248e
Summary:
Add a new function, torch.cuda.set_per_process_memory_fraction(fraction, device), to torch.cuda. Related: https://github.com/pytorch/pytorch/issues/18626
The fraction (float type, from 0 to 1) is used to limit memory of cashing allocator on GPU device . One can set it on any visible GPU. The allowed memory equals total memory * fraction. It will raise an OOM error when try to apply GPU memory more than the allowed value. This function is similar to Tensorflow's per_process_gpu_memory_fraction
Note, this setting is just limit the cashing allocator in one process. If you are using multiprocess, you need to put this setting in to the subprocess to limit its GPU memory, because subprocess could have its own allocator.
## usage
In some cases, one needs to split a GPU device as two parts. Can set limitation before GPU memory using.
Eg. device: 0, each part takes half memory, the code as follows:
```
torch.cuda.set_per_process_memory_fraction(0.5, 0)
```
There is an example to show what it is.
```python
import torch
torch.cuda.set_per_process_memory_fraction(0.5, 0)
torch.cuda.empty_cache()
total_memory = torch.cuda.get_device_properties(0).total_memory
# less than 0.5 will be ok:
tmp_tensor = torch.empty(int(total_memory * 0.499), dtype=torch.int8, device='cuda')
del tmp_tensordel tmp_tensor
torch.cuda.empty_cache()
# this allocation will raise a OOM:
torch.empty(total_memory // 2, dtype=torch.int8, device='cuda')
"""
It raises an error as follows:
RuntimeError: CUDA out of memory. Tried to allocate 5.59 GiB (GPU 0; 11.17 GiB total capacity; 0 bytes already allocated; 10.91 GiB free; 5.59 GiB allowed; 0 bytes reserved in total by PyTorch)
"""
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48172
Reviewed By: bdhirsh
Differential Revision: D25275381
Pulled By: VitalyFedyunin
fbshipit-source-id: d8e7af31902c2eb795d416b57011cc8a22891b8f
Summary:
Small quality-of-life improvement to NVTX Python bindings, that we're using internally and that would be useful to other folks using NVTX annotations via PyTorch. (And my first potential PyTorch contribution.)
Instead of needing to be careful with try/finally to make sure all your range_push'es are range_pop'ed:
```
nvtx.range_push("Some event")
try:
# Code here...
finally:
nvtx.range_pop()
```
you can simply do:
```
with nvtx.range("Some event"):
# Code here...
```
or even use it as a decorator:
```
class MyModel(nn.Module):
# Other methods here...
nvtx.range("MyModel.forward()")
def forward(self, *input):
# Forward pass code here...
```
A couple small open questions:
1. I also added the ability to call `msg.format()` inside `range()`, with the intention that, if there is nothing listening to NVTX events, we should skip the string formatting, to lower the overhead in that case. If you like that idea, I could add the actual "skip string formatting if nobody is listening to events" parts. We can also just leave it as is. Or I can remove that if you folks don't like it. (In the first two cases, should we add that to `range_push()` and `mark()` too?) Just let me know which one it is, and I'll update the pull request.
2. I don't think there are many places for bugs to hide in that function, but I can certainly add a quick test, if you folks want.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42925
Reviewed By: gchanan
Differential Revision: D24476977
Pulled By: ezyang
fbshipit-source-id: 874882818d958e167e624052e42d52fae3c4abf1
Summary:
Follow-up of https://github.com/pytorch/pytorch/issues/46461 with a similar goal
Makes them more readable and possibly faster. Care has to be taken because `map` applies the function immediately while `(x for x in xs)` is a generator expression which gets evaluated later. This is a benefit in some cases where it is not required to actually create the list of values in memory (e.g. when passing to `tuple` or `extend` or `join`)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46462
Reviewed By: zou3519
Differential Revision: D24422343
Pulled By: ezyang
fbshipit-source-id: 252e33499c92ac0b15238f2df32681dbbda2b237
Summary:
Amp gradient unscaling is a great use case for multi tensor apply (in fact it's the first case I wrote it for). This PR adds an MTA unscale+infcheck functor. Really excited to have it for `torch.cuda.amp`. izdeby your interface was clean and straightforward to use, great work!
Labeled as bc-breaking because the native_functions.yaml exposure of unscale+infcheck changes from [`_amp_non_finite_check_and_unscale_` to `_amp_foreach_non_finite_check_and_unscale_`]( https://github.com/pytorch/pytorch/pull/44778/files#diff-f1e4b2c15de770d978d0eb77b53a4077L6289-L6293).
The PR also modifies Unary/Binary/Pointwise Functors to
- do ops' internal math in FP32 for FP16 or bfloat16 inputs, which improves precision ([and throughput, on some architectures!](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions)) and has no downside for the ops we care about.
- accept an instantiated op functor rather than an op functor template (`template<class> class Op`). This allows calling code to pass lambdas.
Open question: As written now, the PR has MTA Functors take care of pre- and post-casting FP16/bfloat16 inputs to FP32 before running the ops. However, alternatively, the pre- and post-math casting could be deferred/written into the ops themselves, which gives them a bit more control. I can easily rewrite it that way if you prefer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44778
Reviewed By: gchanan
Differential Revision: D23944102
Pulled By: izdeby
fbshipit-source-id: 22b25ccad5f69b413c77afe8733fa9cacc8e766d
Summary:
Fix `torch._C._autocast_*_nesting` declarations in __init__.pyi
Fix iterable constructor logic: not every iterable can be constructed using `type(val)(val)` trick, for example it would not work for `val=range(10)` although `isinstance(val, Iterable)` is True
Change optional resolution logic to meet mypy expectations
Fixes https://github.com/pytorch/pytorch/issues/45436
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45480
Reviewed By: walterddr
Differential Revision: D23982822
Pulled By: malfet
fbshipit-source-id: 6418a28d04ece1b2427dcde4b71effb67856a872
Summary:
NVIDIA GPUs are binary compatible within major compute capability revision
This would prevent: "GeForce RTX 3080 with CUDA capability sm_86 is not compatible with the current PyTorch installation." messages from appearing, since CUDA-11 do not support code generation for sm_85.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45130
Reviewed By: ngimel
Differential Revision: D23841556
Pulled By: malfet
fbshipit-source-id: bcfc9e8da63dfe62cdec06909b6c049aaed6a18a
Summary:
No need for compatibility wrapper in Python3+ world
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43981
Reviewed By: seemethere
Differential Revision: D23458325
Pulled By: malfet
fbshipit-source-id: 00f822895625f4867c22376fe558c50316f5974d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43247
`torch.cuda.nccl` APIs didn't throw appropriate errors when called
with inputs/outputs that were of the wrong type and it resulted in some cryptic
errors instead.
Adding some error checks with explicit error messages for these APIs.
ghstack-source-id: 110683546
Test Plan: waitforbuildbot
Reviewed By: rohan-varma
Differential Revision: D23206069
fbshipit-source-id: 8107b39d27f4b7c921aa238ef37c051a9ef4d65b
Summary:
A small clarity improvement to the cuda init docstring
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42923
Reviewed By: zhangguanheng66
Differential Revision: D23080693
Pulled By: mrshenli
fbshipit-source-id: aad5ed9276af3b872c1def76c6175ee30104ccb2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42249
Main change is to bring Caffe2's superior error messages for cuda initialization into c10 and use them in all code paths.
Basic logic:
| Case | Call to device_count() | init_cuda, e.g. allocating tensor |
| -- | -- | -- |
| all good | non-zero | just works |
| no gpus | 0, no warning | throw exception with good message |
| driver issues | 0, produce warning | throw exception with good message |
| out of memory with ASAN | 0, produce warning| throw exception with ASAN message |
Previously, the error thrown from init_cuda was very generic and the ASAN warning (if any) was buried in the logs.
Other clean up changes:
* cache device_count() always in a static variable
* move all asan macros in c10
Test Plan:
Hard to unittest because of build modes. Verified manually that the behavior from the table above holds by running the following script in different modes (ASAN/no-ASAN, CUDA_VISIBLE_DEVICES=):
```
print('before import')
import torch
print('after import')
print('devices: ', torch.cuda.device_count())
x = torch.tensor([1,2,3])
print('tensor creation')
x = x.cuda()
print('moved to cuda')
```
Reviewed By: ngimel
Differential Revision: D22824329
fbshipit-source-id: 5314007313a3897fc955b02f8b21b661ae35fdf5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41330
`torch.cuda.check_error` is annotated as taking an `int` as argument but when running `torch.cuda.check_error(34)` one would get:
```
TypeError: cudaGetErrorString(): incompatible function arguments. The following argument types are supported:
1. (arg0: torch._C._cudart.cudaError) -> str
Invoked with: 34
```
Even if one explicitly casted the argument, running `torch.cuda.check_error(torch._C._cudart.cudaError(34))` would give:
```
AttributeError: 'str' object has no attribute 'decode'
```
This PR fixes both issues (thus allowing `check_error` to be called with a un-casted int) and adds a test.
ghstack-source-id: 107628709
Test Plan: Unit tests
Reviewed By: ezyang
Differential Revision: D22500549
fbshipit-source-id: 9170c1e466dd554d471e928b26eb472a712da9e1
Summary:
Add `torch._C._cuda_getArchFlags()` that returns list of architecture `torch_cuda` were compiled with
Add `torch.cuda.get_arch_list()` and `torch.cuda.get_gencode_flags()` methods that returns architecture list and gencode flags PyTorch were compiled with
Print warning if some of GPUs is not compatible with any of the CUBINs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41173
Differential Revision: D22459998
Pulled By: malfet
fbshipit-source-id: 65d40ae29e54a0ba0f3f2da11b821fdb4d452d95
Summary:
Decouple DataParallel/DistributedDataParallel from CUDA to support more device types.
- Move torch/cuda/comm.py to torch/nn/parallel/comm.py with minor changes for common devices support. Torch.cuda.comm is kept as is for backward compatibility
- Provide common APIs to arbitrary device types without changing existing CUDA APIs in torch.cuda space.
- Replace the torch.cuda calls in DataParellel/DistributedDataParallel with the new APIs.
Related RFC: [https://github.com/pytorch/pytorch/issues/36160](https://github.com/pytorch/pytorch/issues/36160)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38454
Differential Revision: D22051557
Pulled By: mrshenli
fbshipit-source-id: 7842dad0e5d3ca0f6fb760bda49182dcf6653af8
Summary:
I ran `make linkcheck` using `sphinx.builders.linkcheck` on the documentation and noticed a few links weren't using HTTPS so I quickly updated them all.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40878
Differential Revision: D22404647
Pulled By: ngimel
fbshipit-source-id: 9c9756db59197304023fddc28f252314f6cf4af3
Summary:
Should close https://github.com/pytorch/pytorch/issues/35810.
I decided to keep sparse handling on the Python side for clarity, although it could be moved to the C++ side (into `_amp_non_finite_check_and_unscale_`) without much trouble.
For non-fp16 sparse grads the logic is simple (call `_amp_non_finite_check_and_unscale_` on `grad._values()`) instead of `grad` itself. At least I hope it's that easy.
For fp16 sparse grads, it's tricker. Sparse tensors can be uncoalesced. From the [Note](https://pytorch.org/docs/master/sparse.html#torch.sparse.FloatTensor):
> Our sparse tensor format permits uncoalesced sparse tensors, where there may be duplicate coordinates in the indices; in this case, the interpretation is that the value at that index is the sum of all duplicate value entries.
An uncoalesced scaled fp16 grad may have values at duplicate coordinates that are all finite but large, such that adding them to make the coalesced version WOULD cause overflows.** If I checked `_values()` on the uncoalesced version, it might not report overflows, but I think it should.
So, if the grad is sparse, fp16, and uncoalesced, I still call `_amp_non_finite_check_and_unscale_` to unscale `grad._values()` in-place, but I also double-check the coalesced version by calling a second `_amp_non_finite_check_and_unscale_` on `grad.coalesce()._values()`. `coalesce()` is out-of-place, so this call doesn't redundantly affect `grad._values()`, but it does have the power to populate the same `found_inf` tensor. The `is_coalesced()` check and `coalesce()` probably aren't great for performance, but if someone needs a giant embedding table in FP16, they're better than nothing and memorywise, they'll only create a copy of nnz gradient values+indices, which is still way better than changing the whole table to FP32.
An `unscale` variant with liberty to create unscaled grads out-of-place, and replace `param.grad` instead of writing through it, could get away with just one `_amp_non_finite_check_and_unscale_`. It could say `coalesced = grad.coalesced()`, do only the stronger `_amp_non_finite_check_and_unscale_` on `coalesced._values()`, and set `param.grad = coalesced`. I could even avoid replacing `param.grad` itself by going one level deeper and setting `param.grad`'s indices and values to `coalesced`'s, but that seems brittle and still isn't truly "in place".
** you could whiteboard an uncoalesced fp32 grad with the same property, but fp32's range is big enough that I don't think it's realistic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36786
Reviewed By: ezyang
Differential Revision: D22202832
Pulled By: ngimel
fbshipit-source-id: b70961a4b6fc3a4c1882f65e7f34874066435735
Summary:
Currently, a custom autograd function written with
```
torch.cuda.amp.custom_fwd(cast_inputs=dtype)
def forward(ctx, *args):
...
```
casts incoming floating-point CUDA tensors to `dtype` unconditionally, regardless of whether the function executes in an autocast-enabled region. I think I had the wrong idea there. Autocast-disabled regions should give the user control of input types. Also, `custom_fwd(cast_inputs=dtype)`-decorated functions' behavior should align with native fp32list/fp16list functions. C++-side casting wrappers have no effect when autocast is disabled, and `custom_fwd`'s casting should behave the same way.
The present PR changes `custom_fwd` so it only casts in autocast-enabled regions (also updates custom_fwd to ignore fp64 inputs, like the C++ wrappers).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36171
Differential Revision: D22179511
Pulled By: ngimel
fbshipit-source-id: 5a93d070179a43206066bce19da0a5a19ecaabbd
Summary:
I.e. do not accept `bytes` as possible type of `device` argument in
`torch.cuda._get_device_index`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40322
Differential Revision: D22176885
Pulled By: malfet
fbshipit-source-id: 2f3a46174161f1cdcf6a6ad94a31e54b18ad6186
Summary:
Use it from both __init__ and streams to define dummy types when CUDA is missing
Fix accidental reference of global `storage_name` from `_dummy_type`
Add type annotations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40177
Differential Revision: D22106922
Pulled By: malfet
fbshipit-source-id: 52fbfd91d70a78eb14d7ffda109c02ad1231497e
Summary:
While working on https://github.com/pytorch/pytorch/issues/38911, I realized that `nccl.reduce` only needs a single output tensor, while our current implementation requires a list of output tensors. This, along with a TODO I fixed in reduce_add, should have some speed up for data parallel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39688
Differential Revision: D22034547
Pulled By: mrshenli
fbshipit-source-id: e74d54d673ebbb062474b1bb5cc93a095a3a5f6c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39483
I fixed all of the new errors that occurred because of the upgrade.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D21884575
Pulled By: ezyang
fbshipit-source-id: 45c8e1f1ecb410c8d7c46dd3922ad70e982a0685
Summary:
Following up on this: https://github.com/pytorch/pytorch/pull/35851 cross dtype storage copy is not being used internally, so I have not included cross dtype copy for complex.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35771
Differential Revision: D21319650
Pulled By: anjali411
fbshipit-source-id: 07c72996ee598eba0cf401ad61534494d6f5b5b3
Summary:
- added tests that showcase the problems
- fixed the problems
These changes would allow me to remove many "# type: ignore" comments in my codebase.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36358
Differential Revision: D21230704
Pulled By: ezyang
fbshipit-source-id: e6d475a0aa1fb40258fa0231ade28c38108355fb
Summary:
Several people have asked me about proper Amp usage with gradient accumulation. In particular, it's [unclear to people](https://github.com/NVIDIA/apex/issues/439#issuecomment-610351482) that you should only call `scaler.unscale_()` (if desired) and `scaler.update()` in iterations where you actually plan to step. This PR adds a minimal accumulation example.
I built the docs locally and it looks free from sphinx errors, at least.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36601
Differential Revision: D21082295
Pulled By: ngimel
fbshipit-source-id: b2faa6c02b9f7e1972618a0f1d5360a03f0450ac
Summary:
Initial integration of eager autocasting, supporting out-of-place ops only for easier review.
Relevant issue/RFC: https://github.com/pytorch/pytorch/issues/25081
In-place ops and ops with user-supplied `out=...` can certainly be supported as well (my initial WIP https://github.com/pytorch/pytorch/pull/29552 handled many) but require substantially more complex special casing in the autocasting backend and tests. Support for these ops (much of which has already been written) will be broken into later PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32140
Differential Revision: D20346700
Pulled By: ezyang
fbshipit-source-id: 12d77b3917310186fbddf11c59b2794dc859131f
Summary:
This PR aims to improve the interoperability with [CuPy](https://github.com/cupy/cupy/pulls).
Instead of having two separate and conflicting memory pools. With this PR, CuPy can directly alloc memory from the PyTorch allocator by means of this proposal https://github.com/cupy/cupy/pull/3126
We would like to gather feedback to know if this approach makes sense for PyTorch, or other alternative designs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33860
Differential Revision: D20212788
Pulled By: ngimel
fbshipit-source-id: bc1e08a66da1992d26021147bf645dc65239581c
Summary:
hard to get right locally...I can build the docs but never quite match what it looks like live. the bullet point indentation was just an oversight.
Removing `Returns:` formatting tabs because they take up a lot of space when rendered and add no clarity. Some functions in Pytorch [do use them](https://pytorch.org/docs/master/torch.html#torch.eye), but [many don't bother](https://pytorch.org/docs/master/torch.html#torch.is_tensor), so apparently some people shared my feelings (Not using them is in line with existing practice).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33832
Differential Revision: D20135581
Pulled By: ngimel
fbshipit-source-id: bc788a7e57b142f95c4fa5baf3fe01f94c45abd8
Summary:
Also, windows memory failures responsible for the earlier reversion have been fixed.
This PR (initially) contains 2 commits:
* a revert of the revert
* all changes to implement the original Apex scale update heuristic, squashed into a single commit for easier diff review
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33366
Differential Revision: D20099026
Pulled By: ngimel
fbshipit-source-id: 339b9b6bd5134bf055057492cd1eedb7e4461529
Summary:
This PR implements the gradient scaling API that mruberry, jjsjann123, ngimel, zdevito, gchanan and I have been discussing. Relevant issue/RFC: https://github.com/pytorch/pytorch/issues/25081.
Volume-wise, this PR is mostly documentation and tests. The Python API (found entirely in `torch/cuda/amp/amp_scaler.py`) is lightweight . The exposed functions are intended to make the implementation and control flow of gradient scaling convenient, intuitive, and performant.
The API is probably easiest to digest by looking at the documentation and examples. `docs/source/amp.rst` is the homepage for the Automatic Mixed Precision package. `docs/source/notes/amp_examples.rst` includes several examples demonstrating common but not-immediately-obvious use cases. Examples are backed by tests in `test_cuda.py` (and thankfully the tests pass :P).
Two small utility kernels have been added in `native/cuda/AmpKernels.cu` to improve performance and avoid host-device synchronizations wherever possible.
Existing optimizers, both in the wild and in Pytorch core, do not need to change to use the scaling API.
However, the API was also designed to establish a contract between user scripts and optimizers such that writers of _new_ custom optimizers have the control points they need to implement fast, optionally sync-free updates. User scripts that obey the scaling API can drop such custom optimizers in and reap performance benefits without having to change anything aside from the optimizer constructor itself. [I know what the contract with custom optimizers should be](35829f24ef/torch/cuda/amp/amp_scaler.py (L179-L184)), but I'm waiting for review on the rest of the API before I go about documenting it (it will be given a dedicated section in `docs/source/notes/amp_examples.rst`.
Currently, the gradient scaling examples do not include the auto-casting API as discussed in https://github.com/pytorch/pytorch/issues/25081. The gradient scaling API is intended to be orthogonal/modular relative to autocasting. Without auto-casting the gradient scaling API is fully use-_able_, but not terribly use-_ful_, so it's up to you guys whether you want to wait until auto-casting is ready before merging the scaling API as well.
### Todo
- [ ] How do I get c10 registered status for my two custom kernels? They're very simple.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26512
Differential Revision: D19859905
Pulled By: mruberry
fbshipit-source-id: bb8ae6966214718dfee11345db824389e4286923
Summary:
Fixes https://github.com/pytorch/pytorch/issues/23401
We cannot rely on `multiprocessing.util.register_after_fork` since it is only
called for processes created by the `multiprocessing` module and not `os.fork()`.
Moving to `pthread_atfork` does always get called. However, I don't think it's safe to call python functions inside of the `atfork` handler so the python code has to be a bit more careful when checking `_initialized`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29101
Differential Revision: D18355451
Pulled By: ezyang
fbshipit-source-id: 4d4253a3669796212c099dad4e5bdfdb0df40469
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27850
Many of these are real problems in the documentation (i.e., link or
bullet point doesn't display correctly).
Test Plan: - built and viewed the documentation for each change locally.
Differential Revision: D17908123
Pulled By: zou3519
fbshipit-source-id: 65c92a352c89b90fb6b508c388b0874233a3817a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27782
Warnings show up when running `make html` to build documentation. All of
the warnings are very reasonable and point to bugs in our docs. This PR
attempts to fix most of those warnings.
In the future we will add something to the CI that asserts that there
are no warnings in our docs.
Test Plan: - build and view changes locally
Differential Revision: D17887067
Pulled By: zou3519
fbshipit-source-id: 6bf4d08764759133b20983d6cd7f5d27e5ee3166
Summary:
Adds comprehensive memory instrumentation to the CUDA caching memory allocator.
# Counters
Added comprehensive instrumentation for the following stats:
- Allocation requests (`allocation`)
- Allocated memory (`allocated_bytes`)
- Reserved segments from cudaMalloc (`segment`)
- Reserved memory (`reserved_bytes`)
- Active memory blocks (`active`)
- Active memory (`active_bytes`)
- Inactive, non-releasable blocks (`inactive_split`)
- Inactive, non-releasable memory (`inactive_split_bytes`)
- Number of failed cudaMalloc calls that result in a cache flush and retry (`cuda_malloc_retries`)
- Number of OOMs (`num_ooms`)
Except for the last two, these stats are segmented between all memory, large blocks, and small blocks. Along with the current value of each stat, historical counts of allocs/frees as well as peak usage are tracked by the allocator.
# Snapshots
Added the capability to get a "memory snapshot" – that is, to generate a complete dump of the allocator block/segment state.
# Implementation: major changes
- Added `torch.cuda.memory_stats()` (and associated C++ changes) which returns all instrumented stats as a dictionary.
- Added `torch.cuda.snapshot()` (and associated C++ changes) which returns a complete dump of the allocator block/segment state as a list of segments.
- Added memory summary generator in `torch.cuda.memory_summary()` for ease of client access to the instrumentation stats. Potentially useful to dump when catching OOMs. Sample output here: https://pastebin.com/uKZjtupq
# Implementation: minor changes
- Add error-checking helper functions for Python dicts and lists in `torch/csrc/utils/`.
- Existing memory management functions in `torch.cuda` moved from `__init__.py` to `memory.py` and star-imported to the main CUDA module.
- Add various helper functions to `torch.cuda` to return individual items from `torch.cuda.memory_stats()`.
- `torch.cuda.reset_max_memory_cached()` and `torch.cuda.reset_max_memory_allocated()` are deprecated in favor of `reset_peak_stats`. It's a bit difficult to think of a case where only one of those stats should be reset, and IMO this makes the peak stats collectively more consistent.
- `torch.cuda.memory_cached()` and `torch.cuda.max_memory_cached()` are deprecated in favor of `*memory_reserved()`.
- Style (add access modifiers in the allocator class, random nit fixes, etc.)
# Testing
- Added consistency check for stats in `test_cuda.py`. This verifies that the data from `memory_stats()` is faithful to the data from `snapshot()`.
- Ran on various basic workflows (toy example, CIFAR)
# Performance
Running the following speed benchmark: https://pastebin.com/UNndQg50
- Before this PR: 45.98 microseconds per tensor creation
- After this PR: 46.65 microseconds per tensor creation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27361
Differential Revision: D17758747
Pulled By: jma127
fbshipit-source-id: 5a84e82d696c40c505646b9a1b4e0c3bba38aeb6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25788
Previously, I thought that _lazy_init held the GIL throughout initialization, so
I could write the code in a single-threaded manner. This is not true; it
releases the GIL at various points, which make it possible for another thread to
race with initialization.
The correct fix is to add locking for the initialization section, so other
threads wait until the first thread finishes initializing before being let
in. There is some subtlety with how to handle lazy calls, which will call
_lazy_init reentrantly; this is handled using TLS that lets you know if you
are the initializing thread (and therefore reentrant calls are OK.)
Fixes#16559
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D17366348
Pulled By: ezyang
fbshipit-source-id: 99b982709323e2370d03c127c46d87be97495916
Summary:
Currently set_rng_state and get_rng_state do not accept string as their parameters. This commit let them accept strings.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23448
Differential Revision: D16527172
Pulled By: soumith
fbshipit-source-id: 8f9a2129979706e16877cc110f104770fbbe952c
Summary:
Added stubs for:
* The `device` module
* The `cuda` module
* Parts of the `optim` module
* Began adding stubs for the `autograd` module. I'll annotate more later but `no_grad` and friends are probably the most used exports from it so it seemed like a good place to start.
This would close#16996, although comments on that issue reference other missing stubs so maybe it's worth keeping open as an umbrella issue.
The big remaining missing package is `nn`.
Also added a `py.typed` file so mypy will pick up on the type stubs. That closes#17639.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18511
Differential Revision: D14715053
Pulled By: ezyang
fbshipit-source-id: 9e4882ac997063650e6ce47604b3eaf1232c61c9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18598
ghimport-source-id: c74597e5e7437e94a43c163cee0639b20d0d0c6a
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18598 Turn on F401: Unused import warning.**
This was requested by someone at Facebook; this lint is turned
on for Facebook by default. "Sure, why not."
I had to noqa a number of imports in __init__. Hypothetically
we're supposed to use __all__ in this case, but I was too lazy
to fix it. Left for future work.
Be careful! flake8-2 and flake8-3 behave differently with
respect to import resolution for # type: comments. flake8-3 will
report an import unused; flake8-2 will not. For now, I just
noqa'd all these sites.
All the changes were done by hand.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Differential Revision: D14687478
fbshipit-source-id: 30d532381e914091aadfa0d2a5a89404819663e3
Summary:
This is to fix#16141 and similar issues.
The idea is to track a reference to every shared CUDA Storage and deallocate memory only after a consumer process deallocates received Storage.
ezyang Done with cleanup. Same (insignificantly better) performance as in file-per-share solution, but handles millions of shared tensors easily. Note [ ] documentation in progress.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16854
Differential Revision: D13994490
Pulled By: VitalyFedyunin
fbshipit-source-id: 565148ec3ac4fafb32d37fde0486b325bed6fbd1
Summary:
When switching back to `d0` from a stream on a different device `d1`, we need to restore the current streams on both `d0` and `d1`. The current implementation only does that for `d0`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17439
Differential Revision: D14208919
Pulled By: mrshenli
fbshipit-source-id: 89f2565b9977206256efbec42adbd789329ccad8
Summary:
This is the first commit from a series of planned changes in order to add boolean tensors to PyTorch. The whole plan looks like this:
0. Storage Implementation (this change)
1. Tensor Creation.
2. Tensor Conversions.
3. Tensor Indexing.
4. Tensor Operations.
5. Back compatibility related changes.
This feature was requested by the community:
https://github.com/pytorch/pytorch/issues/4764https://github.com/pytorch/pytorch/issues/4219https://github.com/pytorch/pytorch/issues/4288
**Change**:
Added boolean type to the Storage class for CPU and CUDA backends.
**Tested via**:
1. unit tests
2. running this:
-> import torch
-> torch.BoolStorage
<class 'torch.BoolStorage'>
-> torch.cuda.BoolStorage
<class 'torch.cuda.BoolStorage'>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16810
Reviewed By: gchanan
Differential Revision: D14087246
Pulled By: izdeby
fbshipit-source-id: 042642ced1cb0fd1bb6bff05f9ca871a5c54ee5e
Summary:
1. Added `torch/csrc/cuda/Event.h` and `torch/csrc/cuda/Event.cpp` to bind Python Event class to C++ implementation.
2. Move all CUDA runtime invocations from `torch/cuda/streams.py` to C++
3. Added tests to cover Stream and Event APIs. ~(event IPC handle tests is introduced in #15974)~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15937
Differential Revision: D13649001
Pulled By: mrshenli
fbshipit-source-id: 84ca58f35f6ba679a4ba33150ceba678d760d240
Summary:
See #15682
Pushing up this small PR to check if I am doing the right thing. If correct, more will follow for other Stream APIs. Questions will be added inline.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15737
Differential Revision: D13581400
Pulled By: mrshenli
fbshipit-source-id: 24afed7847b89b62f0692c79a101ec7ff9d9ee4d
Summary:
see #15682
This is a quick fix by implementing the simpler solution as suggested by colesbury. As benchmark result shows, it slows down `Stream.query()` by ~20%, I would be happy to further pursue a more complex solution by implementing this in C++/ATen. But I would still vote for merge this quick fix first just to get rid of the bug sooner.
~Test TBA~ Added
FYI jeffreyksmithjr
now
```python
In [1]: def f():
...: d0 = torch.device('cuda:0')
...: d1 = torch.device('cuda:1')
...: with torch.cuda.device(d0):
...: s0 = torch.cuda.current_stream()
...: with torch.cuda.device(d1):
...: s1 = torch.cuda.current_stream()
...: s0.query()
...: s1.query()
In [4]: %timeit f()
38.1 µs ± 4.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [5]: %timeit f()
37.6 µs ± 2.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```
before
```python
In [4]: %timeit f()
28.5 µs ± 1.74 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [5]: %timeit f()
35.3 µs ± 2.91 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15689
Differential Revision: D13571697
Pulled By: mrshenli
fbshipit-source-id: 4fe697f91248c6419136d37bb5b7147e612e2f4c
Summary:
Now that `cuda.get/set_rng_state` accept `device` objects, the default value should be an device object, and doc should mention so.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14324
Reviewed By: ezyang
Differential Revision: D13528707
Pulled By: soumith
fbshipit-source-id: 32fdac467dfea6d5b96b7e2a42dc8cfd42ba11ee
Summary:
Addresses #918, interpolation results should be similar to tf
* Adds bicubic interpolation operator to `nn.functional.interpolate`
* Corresponding test in `test_nn.py`
The operator is added in legacy `TH` to be aligned with the other upsampling operators; they can be refactored/moved to ATen all at once when #10482 is resolved
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9849
Differential Revision: D9007525
Pulled By: driazati
fbshipit-source-id: 93ef49a34ce4e5ffd4bda94cd9a6ddc939f0a4cc
Summary: Record unit of time for torch.cuda.Event's elapsed_time
Differential Revision: D13467646
Pulled By: zou3519
fbshipit-source-id: 4f1f4ef5fa4bc5a1b4775dfcec6ab155e5bf8d6e
Summary:
In `broadcast_coalesced`, since multiple variables can be "views" of a big flattened tensor, they can share the same version counter. However, this base flat tensor is not exposed and they don't share any memory locations, so this is not necessary. Furthermore, it can cause problems, e.g., when two buffers are broadcast together in `DataParallel` and one of them is modified in-place during `forward` but the other is needed in backward, autograd engine will complain.
Fixing the bug discovered at https://github.com/pytorch/pytorch/pull/13350#issuecomment-436011370
edit: This is a very real problem. E.g., consider using Spectral Norm + Batch Norm together.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13594
Differential Revision: D12967311
Pulled By: SsnL
fbshipit-source-id: 52998dbabe149f575cf0fb79e7016f0b95e4b9e5
Summary:
The pytorch.org site redirects all of the http:// requests to the https:// site anyway, so the comments and error messages might as well refer directly to the https:// site. The GitHub project description should also be updated to point to https://pytorch.org
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12636
Differential Revision: D10377099
Pulled By: soumith
fbshipit-source-id: f47eaba1dd3eecc5dbe62afaf7022573dc3fd039
Summary:
Background: we run pytorch in embedded C++ pipelines, running in C++ GUIs in https://github.com/Kitware/VIAME and without this addition, the call was failing with the below error, but only on certain windows platforms/configurations:
OSError: [WinError6] The handle is invalid
At:
C:\Program Files\VIAME\Python36\site-packages\torch\cuda_init_.py(162):_lazy_init
C:\Program Files\VIAME\Python36\site-packages\torch\nn\modules\module.py(249): <lambda>
C:\Program Files\VIAME\Python36\site-packages\torch\nn\modules\module.py(182): _apply
C:\Program Files\VIAME\Python36\site-packages\torch\nn\modules\module.py(176): _apply
C:\Program Files\VIAME\Python36\site-packages\torch\nn\modules\module.py(249): cuda
C:\Program Files\VIAME\lib\python3.6None\site-packages\kwiver\arrows\pytorch\pytorch_resnet_f_extractor.py(74):_init_
C:\Program Files\VIAME\lib\python3.6None\site-packages\kwiver\processes\resnet_descriptors.py(132): _configure
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10379
Differential Revision: D9330772
Pulled By: ezyang
fbshipit-source-id: 657ae7590879004558158d3c4abef2ec11d9ed57
Summary:
As I try to replicate DP in C++, I need to move some functions into C++ from Python. This PR ports the scatter and gather primitives from Python in torch/cuda/comm.py to C++ in torch/csrc/cuda/comm.cpp. The basic infrastructure was already there, since apaszke had rewritten broadcast in C++ already.
I'm not very familiar with this code, so let me know if I'm doing something wrong. I largely just literally translated the code.
I don't know how "public" `torch.cuda.comm` is, but I feel like the `destination_index` parameter for `gather` should be changed from -1 indicating CPU to `None` indicating CPU, and `-1` indicating the default CUDA device. That would make the code clearer IMO.
apaszke colesbury teng-li pietern
Closes https://github.com/pytorch/pytorch/pull/9117
Differential Revision: D8721729
Pulled By: goldsborough
fbshipit-source-id: 1844a488079d21fa209b32e2c73e48632cbe9e68
* fix type mismatch while call torch._C._cuda_setDevice
* fix type mismatch in scatter
* fix type mismatch in scatter
* fix type mismatch while call torch._C._cuda_setDevice
* fix type mismatch while call torch._C._cuda_setDevice
* fix type mismatch while call torch._C._cuda_setDevice
Getting CUDA device property struct with cudaGetDeviceProperties is expensive. THC caches CUDA device property, which is available via THCState_getDeviceProperties, which is available via at::globalContext().getDeviceProperties(device), which is available via torch.cuda.get_device_properties. This PR changes the two methods that previously calls cudaGetDeviceProperties to directly using torch.cuda.get_device_properties in Python.
Also fixes ATen compile error when it can't find CUDA.
Fixes#4908. Using the script from that issue, we get roughly 18x speed-up.
[ssnl@ ~] python dev.py # master
0.2826697587966919
0.00034999847412109375
0.0003493785858154297
0.000356292724609375
0.00036025047302246094
0.0003629922866821289
0.00036084651947021484
0.00035686492919921874
0.00036056041717529296
0.0003606319427490234
[ssnl@ ~] python dev.py # this PR
0.27275662422180175
2.1147727966308594e-05
1.9598007202148438e-05
1.94549560546875e-05
1.9359588623046876e-05
1.938343048095703e-05
2.0074844360351563e-05
1.952648162841797e-05
1.9311904907226562e-05
1.938343048095703e-05
This deletes most of the dead Tensor code paths, including the TensorMethods cwrap and generic/Tensor.cpp.
This also moves the THNN.cwrap/.cpp generation to generate_code which can use ninja if installed.
This replaces the torch.Tensor constructors with factories that produce
Variables. Similarly, functions on the torch module (e.g. torch.randn)
now return Variables.
To keep the PR to a reasonable size, I've left most of the unused tensor
code. Subsequent PRs will remove the dead code, clean-up calls to
torch.autograd.Variable, and rename Variable to Tensor everywhere.
There are some breaking changes because Variable and Tensors had
slightly different semantics. There's a list of those changes here:
https://github.com/pytorch/pytorch/wiki/Breaking-Changes-from-Variable-and-Tensor-merge
The Tensor and Variable classes are being merged.
autograd.Function.forward is now called on Variables, but with "no-grad"
mode (torch.no_grad()) enabled.
One benefit is that we no longer have to explicitly track shared
storages.
* Replace async with non_blocking for Python 3.7 upgrade
* Remove trailing whitespace
* Give _cuda and _type kwargs and accept async for compatibility
* Rename async to non_blocking in all C++ code
* Add entries for async in python_variable_methods
* Friendlier backward compatibility for cuda and type
Adds streams and comms as optional arguments to the NCCL calls in
torch.cuda.nccl. Also exposes ncclUniqueId and ncclCommInitRank for
multi-process mode.
Moves Py_RETURN_NONE statements after the GIL is re-acquired.
* Avoid casting integer params and buffers to float(), double() and half()
* Add test for immune integer buffers
* Fix documentation for float(), double() and half()
* Fix test