Commit Graph

180 Commits

Author SHA1 Message Date
Nikita Shulga
43d746305c
Preserve CUDA gencode flags (#41212)
Summary:
Add `torch._C._cuda_getArchFlags()` that returns list of architecture `torch_cuda` were compiled with
Add `torch.cuda.get_arch_list()` and `torch.cuda.get_gencode_flags()` methods that returns architecture list and gencode flags PyTorch were compiled with
Print warning if some of GPUs is not compatible with any of the CUBINs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41173

Differential Revision: D22459998

Pulled By: malfet

fbshipit-source-id: 65d40ae29e54a0ba0f3f2da11b821fdb4d452d95
2020-07-09 17:34:50 -07:00
Michael Carilli
3b040c478a Make custom_fwd a no-op when not executed under autocast (#36171)
Summary:
Currently, a custom autograd function written with
```
torch.cuda.amp.custom_fwd(cast_inputs=dtype)
def forward(ctx, *args):
    ...
```
casts incoming floating-point CUDA tensors to `dtype` unconditionally, regardless of whether the function executes in an autocast-enabled region.  I think I had the wrong idea there.  Autocast-disabled regions should give the user control of input types.  Also, `custom_fwd(cast_inputs=dtype)`-decorated functions' behavior should align with native fp32list/fp16list functions.  C++-side casting wrappers have no effect when autocast is disabled, and  `custom_fwd`'s casting should behave the same way.

The present PR changes `custom_fwd` so it only casts in autocast-enabled regions (also updates custom_fwd to ignore fp64 inputs, like the C++ wrappers).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36171

Differential Revision: D22179511

Pulled By: ngimel

fbshipit-source-id: 5a93d070179a43206066bce19da0a5a19ecaabbd
2020-06-23 10:23:02 -07:00
Nikita Shulga
5766da503b Device name should be a string, not bytes (#40322)
Summary:
I.e. do not accept `bytes` as possible type of `device` argument in
`torch.cuda._get_device_index`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40322

Differential Revision: D22176885

Pulled By: malfet

fbshipit-source-id: 2f3a46174161f1cdcf6a6ad94a31e54b18ad6186
2020-06-22 19:27:25 -07:00
Nikita Shulga
8b5732e8ad Move torch.cuda annotations inline (#40075)
Summary:
Also enable `torch.cuda` typechecking
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40075

Differential Revision: D22121275

Pulled By: malfet

fbshipit-source-id: dbecef09911334e8f3d87f5ecab66349da9f2325
2020-06-18 15:52:29 -07:00
Nikita Shulga
76fbfba644 Move _dummy_type to _utils.py (#40177)
Summary:
Use it from both __init__ and streams to define dummy types when CUDA is missing
Fix accidental reference of global `storage_name` from `_dummy_type`
Add type annotations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40177

Differential Revision: D22106922

Pulled By: malfet

fbshipit-source-id: 52fbfd91d70a78eb14d7ffda109c02ad1231497e
2020-06-17 22:50:02 -07:00
SsnL
d5236f8517 Avoid initializing unnecessary tensors in nccl.reduce (#39688)
Summary:
While working on https://github.com/pytorch/pytorch/issues/38911, I realized that `nccl.reduce` only needs a single output tensor, while our current implementation requires a list of output tensors. This, along with a TODO I fixed in reduce_add, should have some speed up for data parallel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39688

Differential Revision: D22034547

Pulled By: mrshenli

fbshipit-source-id: e74d54d673ebbb062474b1bb5cc93a095a3a5f6c
2020-06-14 10:11:32 -07:00
Edward Yang
da2004e132 Upgrade lint. (#39483)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39483

I fixed all of the new errors that occurred because of the upgrade.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21884575

Pulled By: ezyang

fbshipit-source-id: 45c8e1f1ecb410c8d7c46dd3922ad70e982a0685
2020-06-04 12:56:43 -07:00
Michael Carilli
25f918548d Allow GradScaler to be pickled (#38296)
Summary:
Should unblock https://github.com/PyTorchLightning/pytorch-lightning/issues/1782.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38296

Differential Revision: D21553296

Pulled By: albanD

fbshipit-source-id: 9041a72d7cf8833e4b01bc767fd2321f17c7c5f2
2020-05-14 09:14:28 -07:00
Jerry Ma
88c447bf71 Change DeprecationWarning to UserWarning in torch.cuda (#32142)
Summary:
Follow-up of https://github.com/pytorch/pytorch/issues/27361 .

Addresses https://github.com/pytorch/pytorch/issues/32141 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32142

Differential Revision: D19404540

Pulled By: gchanan

fbshipit-source-id: f0b230a3224004286064da2b617ff471ba272f47
2020-05-06 08:28:43 -07:00
anjali411
1f09f7ea44 Python API for Complex Storage and storage copy logic (#35771)
Summary:
Following up on this: https://github.com/pytorch/pytorch/pull/35851 cross dtype storage copy is not being used internally, so I have not included cross dtype copy for complex.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35771

Differential Revision: D21319650

Pulled By: anjali411

fbshipit-source-id: 07c72996ee598eba0cf401ad61534494d6f5b5b3
2020-05-01 11:47:22 -07:00
Bartosz Gasiorzewski
867e05921f Fix multiple issues with type annotations (#36358)
Summary:
- added tests that showcase the problems
- fixed the problems

These changes would allow me to remove many "# type: ignore" comments in my codebase.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36358

Differential Revision: D21230704

Pulled By: ezyang

fbshipit-source-id: e6d475a0aa1fb40258fa0231ade28c38108355fb
2020-04-29 11:16:39 -07:00
Thomas Viehmann
dc4d888193 ROCm: don't warn about CUDA compute capabilities (#35949)
Summary:
The warning doesn't apply for ROCm.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35949

Differential Revision: D21050540

Pulled By: ezyang

fbshipit-source-id: 23b13bddd13f132c2017ddea12b2dc54f611fba6
2020-04-20 11:53:56 -07:00
Michael Carilli
e6bc34f549 Amp gradient accumulation example (#36601)
Summary:
Several people have asked me about proper Amp usage with gradient accumulation.  In particular, it's [unclear to people](https://github.com/NVIDIA/apex/issues/439#issuecomment-610351482) that you should only call `scaler.unscale_()` (if desired) and `scaler.update()` in iterations where you actually plan to step.  This PR adds a minimal accumulation example.

I built the docs locally and it looks free from sphinx errors, at least.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36601

Differential Revision: D21082295

Pulled By: ngimel

fbshipit-source-id: b2faa6c02b9f7e1972618a0f1d5360a03f0450ac
2020-04-17 09:56:36 -07:00
Michael Carilli
0f0271e255 [RELAND2] Eager autocasting, out-of-place ops only (with MSVC 2017 fix) (#35102)
Summary:
This is the second reland attempt for https://github.com/pytorch/pytorch/pull/32140.

The first reland attempt https://github.com/pytorch/pytorch/pull/35011 failed due a [small incompatible change](https://github.com/pytorch/pytorch/pull/35011#issuecomment-601754216) in recent master (`skipIfRocm` was removed from `test_data_parallel.py`).

The present PR restores skipIfRocm.

Description from first reland attempt https://github.com/pytorch/pytorch/pull/35011:

> https://github.com/pytorch/pytorch/pull/32140 was approved and merged, but [reverted](d0577e19f0) because it broke builds with versions of Visual Studio older than 15.8 that were not represented in public CI.  The build failures were caused by a [known VS bug](https://developercommunity.visualstudio.com/content/problem/27729/allow-function-with-internal-linkage-as-template-n.html), fixed in versions 15.8 and newer.
>
> The present PR reverts the revert (restoring https://github.com/pytorch/pytorch/pull/32140 's diffs) and adds a workaround to enable compilation with VS < 15.8.  The workaround isn't pretty, but it's guarded by macros such that it's only used when compiling with VS < 15.8.  All other builds compile with the same code/control flow as was merged in https://github.com/pytorch/pytorch/pull/32140.
>
> Original description of https://github.com/pytorch/pytorch/pull/32140:
> > Initial integration of eager autocasting, supporting out-of-place ops only for easier review.
> Relevant issue/RFC: https://github.com/pytorch/pytorch/issues/25081
>
> > In-place ops and ops with user-supplied out=... can certainly be supported as well (my initial WIP https://github.com/pytorch/pytorch/issues/29552 handled many) but require substantially more complex special casing in the autocasting backend and tests. Support for these ops (much of which has already been written) will be broken into later PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35102

Differential Revision: D20596918

Pulled By: ezyang

fbshipit-source-id: 60caa279bb0ce4a9bb0b28c1d585d42cf1cc7e50
2020-03-24 09:08:04 -07:00
Mike Ruberry
fe276d541e Revert D20541921: [pytorch][PR] [RELAND] Eager autocasting, out-of-place ops only (with MSVC 2017 fix)
Test Plan: revert-hammer

Differential Revision:
D20541921

Original commit changeset: abb5488dca86

fbshipit-source-id: d2c6038978f80e5429632f8b49107090a8a247f4
2020-03-19 22:39:12 -07:00
Michael Carilli
991b97277a [RELAND] Eager autocasting, out-of-place ops only (with MSVC 2017 fix) (#35011)
Summary:
https://github.com/pytorch/pytorch/pull/32140 was approved and merged, but [reverted](d0577e19f0) because it broke builds with versions of Visual Studio older than 15.8 that were not represented in public CI.  The build failures were caused by a [known VS bug](https://developercommunity.visualstudio.com/content/problem/27729/allow-function-with-internal-linkage-as-template-n.html), fixed in versions 15.8 and newer.

The present PR reverts the revert (restoring https://github.com/pytorch/pytorch/pull/32140 's diffs) and adds a workaround to enable compilation with VS < 15.8.  The workaround isn't pretty, but it's guarded by macros such that it's only used when compiling with VS < 15.8.  All other builds compile with the same code/control flow as was merged in https://github.com/pytorch/pytorch/pull/32140.

Original description of https://github.com/pytorch/pytorch/pull/32140:
> Initial integration of eager autocasting, supporting out-of-place ops only for easier review.
Relevant issue/RFC: https://github.com/pytorch/pytorch/issues/25081

> In-place ops and ops with user-supplied out=... can certainly be supported as well (my initial WIP https://github.com/pytorch/pytorch/issues/29552 handled many) but require substantially more complex special casing in the autocasting backend and tests. Support for these ops (much of which has already been written) will be broken into later PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35011

Differential Revision: D20541921

Pulled By: ezyang

fbshipit-source-id: abb5488dca8620b0daac4306ebf2bb47fc36e4f5
2020-03-19 20:18:18 -07:00
Edward Yang
d0577e19f0 Revert D20346700: [pytorch][PR] Eager autocasting, out-of-place ops only
Test Plan: revert-hammer

Differential Revision:
D20346700

Original commit changeset: 12d77b391731

fbshipit-source-id: 108d72bf24232f443c0be293ec932c0c478d6a60
2020-03-18 11:42:51 -07:00
Michael Carilli
aaa8f02156 Eager autocasting, out-of-place ops only (#32140)
Summary:
Initial integration of eager autocasting, supporting out-of-place ops only for easier review.
Relevant issue/RFC: https://github.com/pytorch/pytorch/issues/25081

In-place ops and ops with user-supplied `out=...` can certainly be supported as well (my initial WIP https://github.com/pytorch/pytorch/pull/29552 handled many) but require substantially more complex special casing in the autocasting backend and tests.  Support for these ops (much of which has already been written) will be broken into later PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32140

Differential Revision: D20346700

Pulled By: ezyang

fbshipit-source-id: 12d77b3917310186fbddf11c59b2794dc859131f
2020-03-18 10:28:21 -07:00
Peter Bell
5fc5cf6571 Stop using ctypes to interface with CUDA libraries. (#33678)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/33016, Continuation of https://github.com/pytorch/pytorch/issues/31160
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33678

Differential Revision: D20249187

Pulled By: ezyang

fbshipit-source-id: 172ce4a0fee7fbe01436a421d1af22ef6173b6ed
2020-03-11 07:22:46 -07:00
Emilio Castillo
31cc311143 Expose CUDACachingAllocator raw_alloc and raw_delete to python (#33860)
Summary:
This PR aims to improve the interoperability with [CuPy](https://github.com/cupy/cupy/pulls).

Instead of having two separate and conflicting memory pools. With this PR, CuPy can directly alloc memory from the PyTorch allocator by means of this proposal https://github.com/cupy/cupy/pull/3126

We would like to gather feedback to know if this approach makes sense for PyTorch, or other alternative designs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33860

Differential Revision: D20212788

Pulled By: ngimel

fbshipit-source-id: bc1e08a66da1992d26021147bf645dc65239581c
2020-03-03 17:50:11 -08:00
Michael Carilli
a726827ec8 Formatting changes for gradient scaling (#33832)
Summary:
hard to get right locally...I can build the docs but never quite match what it looks like live.  the bullet point indentation was just an oversight.

Removing `Returns:` formatting tabs because they take up a lot of space when rendered and add no clarity.  Some functions in Pytorch [do use them](https://pytorch.org/docs/master/torch.html#torch.eye), but [many don't bother](https://pytorch.org/docs/master/torch.html#torch.is_tensor), so apparently some people shared my feelings (Not using them is in line with existing practice).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33832

Differential Revision: D20135581

Pulled By: ngimel

fbshipit-source-id: bc788a7e57b142f95c4fa5baf3fe01f94c45abd8
2020-02-28 11:40:48 -08:00
Michael Carilli
fc6a153688 [WIP] Reanimate gradient scaling API with original scale update heuristic (#33366)
Summary:
Also, windows memory failures responsible for the earlier reversion have been fixed.

This PR (initially) contains 2 commits:
* a revert of the revert
* all changes to implement the original Apex scale update heuristic, squashed into a single commit for easier diff review
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33366

Differential Revision: D20099026

Pulled By: ngimel

fbshipit-source-id: 339b9b6bd5134bf055057492cd1eedb7e4461529
2020-02-25 19:00:34 -08:00
Edward Yang
ae53f8dd25 Revert D19859905: [pytorch][PR] Gradient scaling API
Test Plan: revert-hammer

Differential Revision:
D19859905

Original commit changeset: bb8ae6966214

fbshipit-source-id: 28f1c93e8a00e3a4bbe8cc981499b15468f0b970
2020-02-14 11:03:27 -08:00
Michael Carilli
40246fa63c Gradient scaling API (#26512)
Summary:
This PR implements the gradient scaling API that mruberry, jjsjann123, ngimel, zdevito, gchanan and I have been discussing.  Relevant issue/RFC: https://github.com/pytorch/pytorch/issues/25081.

Volume-wise, this PR is mostly documentation and tests.  The Python API (found entirely in `torch/cuda/amp/amp_scaler.py`) is lightweight .  The exposed functions are intended to make the implementation and control flow of gradient scaling convenient, intuitive, and performant.

The API is probably easiest to digest by looking at the documentation and examples. `docs/source/amp.rst` is the homepage for the Automatic Mixed Precision package.  `docs/source/notes/amp_examples.rst` includes several examples demonstrating common but not-immediately-obvious use cases.  Examples are backed by tests in `test_cuda.py` (and thankfully the tests pass :P).

Two small utility kernels have been added in `native/cuda/AmpKernels.cu` to improve performance and avoid host-device synchronizations wherever possible.

Existing optimizers, both in the wild and in Pytorch core, do not need to change to use the scaling API.

However, the API was also designed to establish a contract between user scripts and optimizers such that writers of _new_ custom optimizers have the control points they need to implement fast, optionally sync-free updates.  User scripts that obey the scaling API can drop such custom optimizers in and reap performance benefits without having to change anything aside from the optimizer constructor itself.  [I know what the contract with custom optimizers should be](35829f24ef/torch/cuda/amp/amp_scaler.py (L179-L184)), but I'm waiting for review on the rest of the API before I go about documenting it (it will be given a dedicated section in `docs/source/notes/amp_examples.rst`.

Currently, the gradient scaling examples do not include the auto-casting API as discussed in https://github.com/pytorch/pytorch/issues/25081.  The gradient scaling API is intended to be orthogonal/modular relative to autocasting.  Without auto-casting the gradient scaling API is fully use-_able_, but not terribly use-_ful_, so it's up to you guys whether you want to wait until auto-casting is ready before merging the scaling API as well.

### Todo
- [ ] How do I get c10 registered status for my two custom kernels?  They're very simple.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26512

Differential Revision: D19859905

Pulled By: mruberry

fbshipit-source-id: bb8ae6966214718dfee11345db824389e4286923
2020-02-13 11:06:06 -08:00
peter
b77c25dec0 Fix dll load logic for Python 3.8 on Windows (#32215)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/31181 and https://github.com/pytorch/pytorch/pull/31162#discussion_r362495611.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32215

Differential Revision: D19501869

Pulled By: ezyang

fbshipit-source-id: 363824e52d2592ad968ecf1df345aa4c0daff915
2020-01-22 08:33:34 -08:00
Brian Wignall
e7fe64f6a6 Fix typos (#30606)
Summary:
Should be non-semantic.

Uses https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines to find likely typos.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30606

Differential Revision: D18763028

Pulled By: mrshenli

fbshipit-source-id: 896515a2156d062653408852e6c04b429fc5955c
2019-12-02 20:17:42 -08:00
Peter Bell
bb119d957e Move torch.cuda's atfork handler into C++ (#29101)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/23401

We cannot rely on `multiprocessing.util.register_after_fork` since it is only
called for processes created by the `multiprocessing` module and not `os.fork()`.

Moving to `pthread_atfork` does always get called. However, I don't think it's safe to call python functions inside of the `atfork` handler so the python code has to be a bit more careful when checking `_initialized`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29101

Differential Revision: D18355451

Pulled By: ezyang

fbshipit-source-id: 4d4253a3669796212c099dad4e5bdfdb0df40469
2019-11-11 07:34:27 -08:00
Igor Fedan
75309b45f3 explicitly provide memory format when calling to clone() at Indexing.cpp
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28660

Test Plan: Imported from OSS

Differential Revision: D18333346

Pulled By: ifedan

fbshipit-source-id: 06590205d883a5096388a4ae318389244130972d
2019-11-07 05:38:32 -08:00
なるみ
d83389d327 Ignore F401 in all __init__.py without putting noqa (#25823)
Summary:
By adding `per-file-ignores = __init__.py: F401` into `.flake8` with `flake8>=3.7`, we can ignore F410 in all `__init__.py` without putting `# noqa: F401` line by line.

http://flake8.pycqa.org/en/latest/user/options.html?highlight=per-file-ignores#cmdoption-flake8-per-file-ignores
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25823

Differential Revision: D17252182

Pulled By: soumith

fbshipit-source-id: 87b174075b79e4078953a7521bd1a8f82405646b
2019-10-23 15:28:13 -07:00
zou3519
e5d6b75319 Bag of documentation fixes; fix more sphinx warnings (#27850)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27850

Many of these are real problems in the documentation (i.e., link or
bullet point doesn't display correctly).

Test Plan: - built and viewed the documentation for each change locally.

Differential Revision: D17908123

Pulled By: zou3519

fbshipit-source-id: 65c92a352c89b90fb6b508c388b0874233a3817a
2019-10-15 07:31:14 -07:00
zou3519
23bffc4f14 Fix most documentation warnings (#27782)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27782

Warnings show up when running `make html` to build documentation. All of
the warnings are very reasonable and point to bugs in our docs. This PR
attempts to fix most of those warnings.

In the future we will add something to the CI that asserts that there
are no warnings in our docs.

Test Plan: - build and view changes locally

Differential Revision: D17887067

Pulled By: zou3519

fbshipit-source-id: 6bf4d08764759133b20983d6cd7f5d27e5ee3166
2019-10-13 10:34:01 -07:00
Jerry Ma
1610ea8ef8 Comprehensive-ish instrumentation for CUDA memory allocator (#27361)
Summary:
Adds comprehensive memory instrumentation to the CUDA caching memory allocator.

# Counters

Added comprehensive instrumentation for the following stats:
  - Allocation requests (`allocation`)
  - Allocated memory (`allocated_bytes`)
  - Reserved segments from cudaMalloc (`segment`)
  - Reserved memory (`reserved_bytes`)
  - Active memory blocks (`active`)
  - Active memory (`active_bytes`)
  - Inactive, non-releasable blocks (`inactive_split`)
  - Inactive, non-releasable memory (`inactive_split_bytes`)
  - Number of failed cudaMalloc calls that result in a cache flush and retry (`cuda_malloc_retries`)
  - Number of OOMs (`num_ooms`)

Except for the last two, these stats are segmented between all memory, large blocks, and small blocks. Along with the current value of each stat, historical counts of allocs/frees as well as peak usage are tracked by the allocator.

# Snapshots

Added the capability to get a "memory snapshot" – that is, to generate a complete dump of the allocator block/segment state.

# Implementation: major changes

- Added `torch.cuda.memory_stats()` (and associated C++ changes) which returns all instrumented stats as a dictionary.
- Added `torch.cuda.snapshot()` (and associated C++ changes) which returns a complete dump of the allocator block/segment state as a list of segments.
- Added memory summary generator in `torch.cuda.memory_summary()` for ease of client access to the instrumentation stats. Potentially useful to dump when catching OOMs. Sample output here: https://pastebin.com/uKZjtupq

# Implementation: minor changes

- Add error-checking helper functions for Python dicts and lists in `torch/csrc/utils/`.
- Existing memory management functions in `torch.cuda` moved from `__init__.py` to `memory.py` and star-imported to the main CUDA module.
- Add various helper functions to `torch.cuda` to return individual items from `torch.cuda.memory_stats()`.
- `torch.cuda.reset_max_memory_cached()` and `torch.cuda.reset_max_memory_allocated()` are deprecated in favor of `reset_peak_stats`. It's a bit difficult to think of a case where only one of those stats should be reset, and IMO this makes the peak stats collectively more consistent.
- `torch.cuda.memory_cached()` and `torch.cuda.max_memory_cached()` are deprecated in favor of `*memory_reserved()`.
- Style (add access modifiers in the allocator class, random nit fixes, etc.)

# Testing

- Added consistency check for stats in `test_cuda.py`. This verifies that the data from `memory_stats()` is faithful to the data from `snapshot()`.
- Ran on various basic workflows (toy example, CIFAR)

# Performance

Running the following speed benchmark: https://pastebin.com/UNndQg50

- Before this PR: 45.98 microseconds per tensor creation
- After this PR: 46.65 microseconds per tensor creation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27361

Differential Revision: D17758747

Pulled By: jma127

fbshipit-source-id: 5a84e82d696c40c505646b9a1b4e0c3bba38aeb6
2019-10-08 15:42:48 -07:00
Edward Yang
925131a85e Fix race in CUDA initialization (#25788)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25788

Previously, I thought that _lazy_init held the GIL throughout initialization, so
I could write the code in a single-threaded manner.  This is not true; it
releases the GIL at various points, which make it possible for another thread to
race with initialization.

The correct fix is to add locking for the initialization section, so other
threads wait until the first thread finishes initializing before being let
in.  There is some subtlety with how to handle lazy calls, which will call
_lazy_init reentrantly; this is handled using TLS that lets you know if you
are the initializing thread (and therefore reentrant calls are OK.)

Fixes #16559

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D17366348

Pulled By: ezyang

fbshipit-source-id: 99b982709323e2370d03c127c46d87be97495916
2019-09-17 07:40:29 -07:00
Yaroslav Bulatov
7f3c423541 Add type hint for cuda.set_rng_state (#26200)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/26199
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26200

Differential Revision: D17386885

Pulled By: soumith

fbshipit-source-id: 9da03aae29281b2ed691cbfdd7b85fde55e5b7ef
2019-09-14 19:29:42 -07:00
Hong Xu
97f129bf0a Let set_rng_state and get_rng_state accept string parameter (#23448)
Summary:
Currently set_rng_state and get_rng_state do not accept string as their parameters. This commit let them accept strings.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23448

Differential Revision: D16527172

Pulled By: soumith

fbshipit-source-id: 8f9a2129979706e16877cc110f104770fbbe952c
2019-07-29 08:08:39 -07:00
Guanheng Zhang
b22adeb007 Fix error message for a wrong fork CUDA (#23322)
Summary:
Re-land https://github.com/pytorch/pytorch/pull/23030
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23322

Differential Revision: D16469442

Pulled By: zhangguanheng66

fbshipit-source-id: 70b63ab6265efa3f289111ef0ce46bb3c0d353bc
2019-07-25 12:58:14 -07:00
Edward Yang
1f608d09cf Revert D16440000: [pytorch][PR] Re-land "Fix error message for a wrong fork CUDA"
Differential Revision:
D16440000

Original commit changeset: e05683275522

fbshipit-source-id: b688f24c1e6d3d8f63c2d415262a3f0ab1b85914
2019-07-24 12:05:36 -07:00
Guanheng Zhang
aa660b8eb7 Re-land "Fix error message for a wrong fork CUDA" (#23209)
Summary:
Re-land https://github.com/pytorch/pytorch/pull/23030
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23209

Differential Revision: D16440000

Pulled By: zhangguanheng66

fbshipit-source-id: e05683275522835a33d5a7e6d76b7e94774e4d98
2019-07-24 07:01:04 -07:00
Jesse Hellemn
06d11f0434 Revert D16368004: [pytorch][PR] Fix error message for a wrong fork CUDA
Differential Revision:
D16368004

Original commit changeset: 44b6977790ce

fbshipit-source-id: c81a232bd52219e56a19c64650c4b6dedeb167cb
2019-07-22 18:46:48 -07:00
Guanheng Zhang
a6e45a69a8 Fix error message for a wrong fork CUDA (#23030)
Summary:
Fix https://github.com/pytorch/pytorch/issues/17357
Unblock 1.2 release.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23030

Differential Revision: D16368004

Pulled By: zhangguanheng66

fbshipit-source-id: 44b6977790ce768efa4777bae41d4b26dae5f288
2019-07-22 15:04:32 -07:00
Iurii Zdebskyi
3a8d7463bd Enabled BFloat16 storage (#21523)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21523
ghimport-source-id: 698b3cbd6b21c09b9ff8bf8011980df8e35c33b0

Test Plan: Imported from OSS

Differential Revision: D15819368

Pulled By: izdeby

fbshipit-source-id: f6b3bba7b3ca8ee677bd80a231dbb3920c07d61c
2019-07-09 21:51:06 -07:00
Syed Tousif Ahmed
effcc398c4 Refactor Random Number Generators in ATen (#21555)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21555
ghimport-source-id: dd900a8c3e1ef9ef1e011b8bb5476626d18cc462

Test Plan: Imported from OSS

Differential Revision: D15875780

Pulled By: ezyang

fbshipit-source-id: 6e04e90af62ab9c9593d74f344a3a084aaaf6f43
2019-06-19 13:54:09 -07:00
ptrblck
bad67015fe Add warning for Turing GPUs and CUDA <= 9000 (#21468)
Summary:
Turing GPUs (compute capability 7.5) require CUDA10 to work properly.
We've seen some issues for these GPUs using PyTorch binaries with CUDA9 or older:
[Discussion Board #1](https://discuss.pytorch.org/t/cudnn-status-execution-failed-error/38575)
[Discussion Board #2](https://discuss.pytorch.org/t/cublas-runtime-error-on-gpu-running-but-works-on-cpu/46545/6)

Tested on using CUDA9 with an RTX 2080Ti.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21468

Differential Revision: D15696170

Pulled By: ezyang

fbshipit-source-id: ed43f4e4948d3f97ec8e7d7952110cbbfeafef2a
2019-06-06 19:33:02 -07:00
Syed Tousif Ahmed
0e3c4a054b Remove curandStateMTGP32 usage (#21301)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21301
ghimport-source-id: d4516237a8fb46d1f74c47532e849e5926fc6a79

Differential Revision: D15632929

Pulled By: ezyang

fbshipit-source-id: b5147edb95dc3d71f87581aa2ab002e48c3fef30
2019-06-05 14:06:25 -07:00
Syed Tousif Ahmed
5268b7dfaf Remove support for CUDA 8 (#20298)
Summary:
1.1.0 stopped support for CUDA 8
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20298

Differential Revision: D15294639

Pulled By: ezyang

fbshipit-source-id: b9411bfe456f93f1529b745dc83b7d6310df684d
2019-05-13 11:24:22 -07:00
peter
d6f62b70f3 Fix cuda and cudnn libraries search process on Windows (#20205)
Summary:
Fixes #20202
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20205

Differential Revision: D15258626

Pulled By: ezyang

fbshipit-source-id: 855ad457a8bb7a46accc7cf6ec5cb09e98f6e770
2019-05-08 06:08:47 -07:00
SsnL
dce3d74dfb add torch.cuda.synchronize(device=None) (#19573)
Summary:
fixes https://github.com/pytorch/pytorch/issues/19509
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19573

Differential Revision: D15045730

Pulled By: ezyang

fbshipit-source-id: 732721b4b360fc4348ca7c87d4cd1386e7651bdd
2019-04-23 08:40:38 -07:00
Jon Malmaud
1b25fdbcd0 More type stubs (#18511)
Summary:
Added stubs for:

* The `device` module
* The `cuda` module
* Parts of the `optim` module
* Began adding stubs for the `autograd` module. I'll annotate more later but `no_grad` and friends are probably the most used exports from it so it seemed like a good place to start.

This would close #16996, although comments on that issue reference other missing stubs so maybe it's worth keeping open as an umbrella issue.

The big remaining missing package is `nn`.

Also added a `py.typed` file so mypy will pick up on the type stubs. That closes #17639.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18511

Differential Revision: D14715053

Pulled By: ezyang

fbshipit-source-id: 9e4882ac997063650e6ce47604b3eaf1232c61c9
2019-04-01 16:03:58 -07:00
Edward Yang
173f224570 Turn on F401: Unused import warning. (#18598)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18598
ghimport-source-id: c74597e5e7437e94a43c163cee0639b20d0d0c6a

Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18598 Turn on F401: Unused import warning.**

This was requested by someone at Facebook; this lint is turned
on for Facebook by default.  "Sure, why not."

I had to noqa a number of imports in __init__.  Hypothetically
we're supposed to use __all__ in this case, but I was too lazy
to fix it.  Left for future work.

Be careful!  flake8-2 and flake8-3 behave differently with
respect to import resolution for # type: comments.  flake8-3 will
report an import unused; flake8-2 will not.  For now, I just
noqa'd all these sites.

All the changes were done by hand.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: D14687478

fbshipit-source-id: 30d532381e914091aadfa0d2a5a89404819663e3
2019-03-30 09:01:17 -07:00
Vitaly Fedyunin
5653a914f7 Implement reference counting for shared IPC CUDA tensors (#16854)
Summary:
This is to fix #16141 and similar issues.

The idea is to track a reference to every shared CUDA Storage and deallocate memory only after a consumer process deallocates received Storage.

ezyang Done with cleanup. Same (insignificantly better) performance as in file-per-share solution, but handles millions of shared tensors easily. Note [ ] documentation in progress.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16854

Differential Revision: D13994490

Pulled By: VitalyFedyunin

fbshipit-source-id: 565148ec3ac4fafb32d37fde0486b325bed6fbd1
2019-03-25 10:24:38 -07:00