Summary:
Add `torch._C._cuda_getArchFlags()` that returns list of architecture `torch_cuda` were compiled with
Add `torch.cuda.get_arch_list()` and `torch.cuda.get_gencode_flags()` methods that returns architecture list and gencode flags PyTorch were compiled with
Print warning if some of GPUs is not compatible with any of the CUBINs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41173
Differential Revision: D22459998
Pulled By: malfet
fbshipit-source-id: 65d40ae29e54a0ba0f3f2da11b821fdb4d452d95
Summary:
Currently, a custom autograd function written with
```
torch.cuda.amp.custom_fwd(cast_inputs=dtype)
def forward(ctx, *args):
...
```
casts incoming floating-point CUDA tensors to `dtype` unconditionally, regardless of whether the function executes in an autocast-enabled region. I think I had the wrong idea there. Autocast-disabled regions should give the user control of input types. Also, `custom_fwd(cast_inputs=dtype)`-decorated functions' behavior should align with native fp32list/fp16list functions. C++-side casting wrappers have no effect when autocast is disabled, and `custom_fwd`'s casting should behave the same way.
The present PR changes `custom_fwd` so it only casts in autocast-enabled regions (also updates custom_fwd to ignore fp64 inputs, like the C++ wrappers).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36171
Differential Revision: D22179511
Pulled By: ngimel
fbshipit-source-id: 5a93d070179a43206066bce19da0a5a19ecaabbd
Summary:
I.e. do not accept `bytes` as possible type of `device` argument in
`torch.cuda._get_device_index`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40322
Differential Revision: D22176885
Pulled By: malfet
fbshipit-source-id: 2f3a46174161f1cdcf6a6ad94a31e54b18ad6186
Summary:
Use it from both __init__ and streams to define dummy types when CUDA is missing
Fix accidental reference of global `storage_name` from `_dummy_type`
Add type annotations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40177
Differential Revision: D22106922
Pulled By: malfet
fbshipit-source-id: 52fbfd91d70a78eb14d7ffda109c02ad1231497e
Summary:
While working on https://github.com/pytorch/pytorch/issues/38911, I realized that `nccl.reduce` only needs a single output tensor, while our current implementation requires a list of output tensors. This, along with a TODO I fixed in reduce_add, should have some speed up for data parallel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39688
Differential Revision: D22034547
Pulled By: mrshenli
fbshipit-source-id: e74d54d673ebbb062474b1bb5cc93a095a3a5f6c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39483
I fixed all of the new errors that occurred because of the upgrade.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D21884575
Pulled By: ezyang
fbshipit-source-id: 45c8e1f1ecb410c8d7c46dd3922ad70e982a0685
Summary:
Following up on this: https://github.com/pytorch/pytorch/pull/35851 cross dtype storage copy is not being used internally, so I have not included cross dtype copy for complex.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35771
Differential Revision: D21319650
Pulled By: anjali411
fbshipit-source-id: 07c72996ee598eba0cf401ad61534494d6f5b5b3
Summary:
- added tests that showcase the problems
- fixed the problems
These changes would allow me to remove many "# type: ignore" comments in my codebase.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36358
Differential Revision: D21230704
Pulled By: ezyang
fbshipit-source-id: e6d475a0aa1fb40258fa0231ade28c38108355fb
Summary:
Several people have asked me about proper Amp usage with gradient accumulation. In particular, it's [unclear to people](https://github.com/NVIDIA/apex/issues/439#issuecomment-610351482) that you should only call `scaler.unscale_()` (if desired) and `scaler.update()` in iterations where you actually plan to step. This PR adds a minimal accumulation example.
I built the docs locally and it looks free from sphinx errors, at least.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36601
Differential Revision: D21082295
Pulled By: ngimel
fbshipit-source-id: b2faa6c02b9f7e1972618a0f1d5360a03f0450ac
Summary:
Initial integration of eager autocasting, supporting out-of-place ops only for easier review.
Relevant issue/RFC: https://github.com/pytorch/pytorch/issues/25081
In-place ops and ops with user-supplied `out=...` can certainly be supported as well (my initial WIP https://github.com/pytorch/pytorch/pull/29552 handled many) but require substantially more complex special casing in the autocasting backend and tests. Support for these ops (much of which has already been written) will be broken into later PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32140
Differential Revision: D20346700
Pulled By: ezyang
fbshipit-source-id: 12d77b3917310186fbddf11c59b2794dc859131f
Summary:
This PR aims to improve the interoperability with [CuPy](https://github.com/cupy/cupy/pulls).
Instead of having two separate and conflicting memory pools. With this PR, CuPy can directly alloc memory from the PyTorch allocator by means of this proposal https://github.com/cupy/cupy/pull/3126
We would like to gather feedback to know if this approach makes sense for PyTorch, or other alternative designs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33860
Differential Revision: D20212788
Pulled By: ngimel
fbshipit-source-id: bc1e08a66da1992d26021147bf645dc65239581c
Summary:
hard to get right locally...I can build the docs but never quite match what it looks like live. the bullet point indentation was just an oversight.
Removing `Returns:` formatting tabs because they take up a lot of space when rendered and add no clarity. Some functions in Pytorch [do use them](https://pytorch.org/docs/master/torch.html#torch.eye), but [many don't bother](https://pytorch.org/docs/master/torch.html#torch.is_tensor), so apparently some people shared my feelings (Not using them is in line with existing practice).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33832
Differential Revision: D20135581
Pulled By: ngimel
fbshipit-source-id: bc788a7e57b142f95c4fa5baf3fe01f94c45abd8
Summary:
Also, windows memory failures responsible for the earlier reversion have been fixed.
This PR (initially) contains 2 commits:
* a revert of the revert
* all changes to implement the original Apex scale update heuristic, squashed into a single commit for easier diff review
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33366
Differential Revision: D20099026
Pulled By: ngimel
fbshipit-source-id: 339b9b6bd5134bf055057492cd1eedb7e4461529
Summary:
This PR implements the gradient scaling API that mruberry, jjsjann123, ngimel, zdevito, gchanan and I have been discussing. Relevant issue/RFC: https://github.com/pytorch/pytorch/issues/25081.
Volume-wise, this PR is mostly documentation and tests. The Python API (found entirely in `torch/cuda/amp/amp_scaler.py`) is lightweight . The exposed functions are intended to make the implementation and control flow of gradient scaling convenient, intuitive, and performant.
The API is probably easiest to digest by looking at the documentation and examples. `docs/source/amp.rst` is the homepage for the Automatic Mixed Precision package. `docs/source/notes/amp_examples.rst` includes several examples demonstrating common but not-immediately-obvious use cases. Examples are backed by tests in `test_cuda.py` (and thankfully the tests pass :P).
Two small utility kernels have been added in `native/cuda/AmpKernels.cu` to improve performance and avoid host-device synchronizations wherever possible.
Existing optimizers, both in the wild and in Pytorch core, do not need to change to use the scaling API.
However, the API was also designed to establish a contract between user scripts and optimizers such that writers of _new_ custom optimizers have the control points they need to implement fast, optionally sync-free updates. User scripts that obey the scaling API can drop such custom optimizers in and reap performance benefits without having to change anything aside from the optimizer constructor itself. [I know what the contract with custom optimizers should be](35829f24ef/torch/cuda/amp/amp_scaler.py (L179-L184)), but I'm waiting for review on the rest of the API before I go about documenting it (it will be given a dedicated section in `docs/source/notes/amp_examples.rst`.
Currently, the gradient scaling examples do not include the auto-casting API as discussed in https://github.com/pytorch/pytorch/issues/25081. The gradient scaling API is intended to be orthogonal/modular relative to autocasting. Without auto-casting the gradient scaling API is fully use-_able_, but not terribly use-_ful_, so it's up to you guys whether you want to wait until auto-casting is ready before merging the scaling API as well.
### Todo
- [ ] How do I get c10 registered status for my two custom kernels? They're very simple.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26512
Differential Revision: D19859905
Pulled By: mruberry
fbshipit-source-id: bb8ae6966214718dfee11345db824389e4286923
Summary:
Fixes https://github.com/pytorch/pytorch/issues/23401
We cannot rely on `multiprocessing.util.register_after_fork` since it is only
called for processes created by the `multiprocessing` module and not `os.fork()`.
Moving to `pthread_atfork` does always get called. However, I don't think it's safe to call python functions inside of the `atfork` handler so the python code has to be a bit more careful when checking `_initialized`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29101
Differential Revision: D18355451
Pulled By: ezyang
fbshipit-source-id: 4d4253a3669796212c099dad4e5bdfdb0df40469
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27850
Many of these are real problems in the documentation (i.e., link or
bullet point doesn't display correctly).
Test Plan: - built and viewed the documentation for each change locally.
Differential Revision: D17908123
Pulled By: zou3519
fbshipit-source-id: 65c92a352c89b90fb6b508c388b0874233a3817a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27782
Warnings show up when running `make html` to build documentation. All of
the warnings are very reasonable and point to bugs in our docs. This PR
attempts to fix most of those warnings.
In the future we will add something to the CI that asserts that there
are no warnings in our docs.
Test Plan: - build and view changes locally
Differential Revision: D17887067
Pulled By: zou3519
fbshipit-source-id: 6bf4d08764759133b20983d6cd7f5d27e5ee3166
Summary:
Adds comprehensive memory instrumentation to the CUDA caching memory allocator.
# Counters
Added comprehensive instrumentation for the following stats:
- Allocation requests (`allocation`)
- Allocated memory (`allocated_bytes`)
- Reserved segments from cudaMalloc (`segment`)
- Reserved memory (`reserved_bytes`)
- Active memory blocks (`active`)
- Active memory (`active_bytes`)
- Inactive, non-releasable blocks (`inactive_split`)
- Inactive, non-releasable memory (`inactive_split_bytes`)
- Number of failed cudaMalloc calls that result in a cache flush and retry (`cuda_malloc_retries`)
- Number of OOMs (`num_ooms`)
Except for the last two, these stats are segmented between all memory, large blocks, and small blocks. Along with the current value of each stat, historical counts of allocs/frees as well as peak usage are tracked by the allocator.
# Snapshots
Added the capability to get a "memory snapshot" – that is, to generate a complete dump of the allocator block/segment state.
# Implementation: major changes
- Added `torch.cuda.memory_stats()` (and associated C++ changes) which returns all instrumented stats as a dictionary.
- Added `torch.cuda.snapshot()` (and associated C++ changes) which returns a complete dump of the allocator block/segment state as a list of segments.
- Added memory summary generator in `torch.cuda.memory_summary()` for ease of client access to the instrumentation stats. Potentially useful to dump when catching OOMs. Sample output here: https://pastebin.com/uKZjtupq
# Implementation: minor changes
- Add error-checking helper functions for Python dicts and lists in `torch/csrc/utils/`.
- Existing memory management functions in `torch.cuda` moved from `__init__.py` to `memory.py` and star-imported to the main CUDA module.
- Add various helper functions to `torch.cuda` to return individual items from `torch.cuda.memory_stats()`.
- `torch.cuda.reset_max_memory_cached()` and `torch.cuda.reset_max_memory_allocated()` are deprecated in favor of `reset_peak_stats`. It's a bit difficult to think of a case where only one of those stats should be reset, and IMO this makes the peak stats collectively more consistent.
- `torch.cuda.memory_cached()` and `torch.cuda.max_memory_cached()` are deprecated in favor of `*memory_reserved()`.
- Style (add access modifiers in the allocator class, random nit fixes, etc.)
# Testing
- Added consistency check for stats in `test_cuda.py`. This verifies that the data from `memory_stats()` is faithful to the data from `snapshot()`.
- Ran on various basic workflows (toy example, CIFAR)
# Performance
Running the following speed benchmark: https://pastebin.com/UNndQg50
- Before this PR: 45.98 microseconds per tensor creation
- After this PR: 46.65 microseconds per tensor creation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27361
Differential Revision: D17758747
Pulled By: jma127
fbshipit-source-id: 5a84e82d696c40c505646b9a1b4e0c3bba38aeb6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25788
Previously, I thought that _lazy_init held the GIL throughout initialization, so
I could write the code in a single-threaded manner. This is not true; it
releases the GIL at various points, which make it possible for another thread to
race with initialization.
The correct fix is to add locking for the initialization section, so other
threads wait until the first thread finishes initializing before being let
in. There is some subtlety with how to handle lazy calls, which will call
_lazy_init reentrantly; this is handled using TLS that lets you know if you
are the initializing thread (and therefore reentrant calls are OK.)
Fixes#16559
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D17366348
Pulled By: ezyang
fbshipit-source-id: 99b982709323e2370d03c127c46d87be97495916
Summary:
Currently set_rng_state and get_rng_state do not accept string as their parameters. This commit let them accept strings.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23448
Differential Revision: D16527172
Pulled By: soumith
fbshipit-source-id: 8f9a2129979706e16877cc110f104770fbbe952c
Summary:
Added stubs for:
* The `device` module
* The `cuda` module
* Parts of the `optim` module
* Began adding stubs for the `autograd` module. I'll annotate more later but `no_grad` and friends are probably the most used exports from it so it seemed like a good place to start.
This would close#16996, although comments on that issue reference other missing stubs so maybe it's worth keeping open as an umbrella issue.
The big remaining missing package is `nn`.
Also added a `py.typed` file so mypy will pick up on the type stubs. That closes#17639.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18511
Differential Revision: D14715053
Pulled By: ezyang
fbshipit-source-id: 9e4882ac997063650e6ce47604b3eaf1232c61c9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18598
ghimport-source-id: c74597e5e7437e94a43c163cee0639b20d0d0c6a
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18598 Turn on F401: Unused import warning.**
This was requested by someone at Facebook; this lint is turned
on for Facebook by default. "Sure, why not."
I had to noqa a number of imports in __init__. Hypothetically
we're supposed to use __all__ in this case, but I was too lazy
to fix it. Left for future work.
Be careful! flake8-2 and flake8-3 behave differently with
respect to import resolution for # type: comments. flake8-3 will
report an import unused; flake8-2 will not. For now, I just
noqa'd all these sites.
All the changes were done by hand.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Differential Revision: D14687478
fbshipit-source-id: 30d532381e914091aadfa0d2a5a89404819663e3
Summary:
This is to fix#16141 and similar issues.
The idea is to track a reference to every shared CUDA Storage and deallocate memory only after a consumer process deallocates received Storage.
ezyang Done with cleanup. Same (insignificantly better) performance as in file-per-share solution, but handles millions of shared tensors easily. Note [ ] documentation in progress.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16854
Differential Revision: D13994490
Pulled By: VitalyFedyunin
fbshipit-source-id: 565148ec3ac4fafb32d37fde0486b325bed6fbd1