Commit Graph

251 Commits

Author SHA1 Message Date
Pritam Damania
2b221a9599 Remove PyCFunction casts as much as possible. (#46227)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46227

Follow up from https://github.com/pytorch/pytorch/issues/45419, in
this PR I've removed as many PyCFunction casts as I could from the codebase.

The only ones I didn't remove were the ones with `METH_VARARGS | METH_KEYWORDS`
which have 3 parameters instead of 2 and had to be casted. Example: `
{"copy_", (PyCFunction)(void(*)(void))THPStorage_(copy_), METH_VARARGS |
METH_KEYWORDS, nullptr},`
ghstack-source-id: 114632704

Test Plan: waitforbuildbot

Reviewed By: albanD

Differential Revision: D24269435

fbshipit-source-id: 025cfd43a9a2a3e59f6b2951c1a78749193d77cf
2020-10-20 15:01:51 -07:00
chengjun
5741de883a Define the record_stream method in native_functions.yaml (#44301)
Summary:
The record_stream method was hard coded for CUDA device. Define the record_stream in the native_functions.yaml to enable the dynamic dispatch to different end device.

Fixes https://github.com/pytorch/pytorch/issues/36556

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44301

Reviewed By: glaringlee

Differential Revision: D23763954

Pulled By: ezyang

fbshipit-source-id: e6d24f5e7892b56101fa858a6cad2abc5cdc4293
2020-10-13 09:15:22 -07:00
Mingzhe Li
8cd3857bc7 [NCCL] Add torch::cuda::nccl::send/recv (#45926)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45926

torch/csrc/cuda/nccl.cpp is compiled as part of torch_cuda library and thus by calling this function from ProcessGroupNCCCL.cpp it avoids linking 2nd instance of libnccl.a into torch_python
Fixes similiar issue as https://github.com/pytorch/pytorch/issues/42517

ghstack-source-id: 113910530

Test Plan: waitforsandcastle

Reviewed By: jiayisuse

Differential Revision: D24147802

fbshipit-source-id: d8901fdb31bdc22ddca2364f8050844639a1beb3
2020-10-08 19:20:40 -07:00
Nikita Shulga
c19b9cd18d Add torch::cuda::ncll::all2all (#45900)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45900

Use `torch:cuda::nccl:all2all` from `ProcesGroupNCCL.cpp`

Fixes https://github.com/pytorch/pytorch/issues/42517

Here is a NCCL dependency graph:
```
libnccl.a --> libtorch_cuda.so ---> libtorch_python.so
    |                                   ^
    |                                   |
    --------> libc10d.a -----------------
```
When static library is linked into a dynamic library or an executable, linker is removes all unused/duplicate symbols from that library, unless `-whole-archive` option is used. Before https://github.com/pytorch/pytorch/pull/42514 all nccl call made from `ProcessGroupNCCL.cpp` were also made from `torch/csrc/cuda/nccl.cpp`, which is compiled as part of `libtorch_cuda.so`
But adding `ncclSend`|`ncclRecv` to ProcesGroupNCCL.cpp forced linker to embed those into `libtorch_python.so`, which also resulted in linking other dependent symbols into the library.

This PR adds `nccl[Send|Recv]` call to `torch_cuda.so` by implementing `all2all` in `torch_cuda` and thus avoids double linking the static library.

More involved, but prone solution, would be to use wrappers exported in `torch::cuda::nccl` namespace, instead of making direct NCCL API calls.

Test Plan: Imported from OSS

Reviewed By: mingzhe09088

Differential Revision: D24138011

Pulled By: malfet

fbshipit-source-id: 33305197fc7d8707b7fd3a66b543f7733b9241a1
2020-10-07 23:56:31 -07:00
Nikita Shulga
930bddd403 Cleanup nccl.cpp (#45899)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45899

Use function polymorphism to avoid repeated casts
I.e. instead of using `NCCL_CHECK(from_nccl_result(` add variant of the function that takes `ncclResult_t` as input argument
Add non-pointer variant of `to_nccl_comm` to avoid `*to_nccl_comm(&comm)` pattern

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D24138012

Pulled By: malfet

fbshipit-source-id: 7f62a03e108cbe455910e86e894afdd1c27e8ff1
2020-10-06 11:26:14 -07:00
Nikita Shulga
8ab2ad306d Enable torch.cuda.nccl typechecking (#45344)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45336

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45344

Reviewed By: walterddr

Differential Revision: D23935306

Pulled By: malfet

fbshipit-source-id: dd09d4f8ff7a327131764487158675027a13bf69
2020-09-25 17:02:47 -07:00
Edward Yang
da4033d32a Make cudaHostRegister actually useful on cudart. (#45159)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45159

By default, pybind11 binds void* to be capsules.  After a lot of
Googling, I have concluded that this is not actually useful:
you can't actually create a capsule from Python land, and our
data_ptr() function returns an int, which means that the
function is effectively unusable.  It didn't help that we had no
tests exercising it.

I've replaced the void* with uintptr_t, so that we now accept int
(and you can pass data_ptr() in directly).  I'm not sure if we
should make these functions accept ctypes types; unfortunately,
pybind11 doesn't seem to have any easy way to do this.

Fixes #43006

Also added cudaHostUnregister which was requested.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D23849731

Pulled By: ezyang

fbshipit-source-id: 8a79986f3aa9546abbd2a6a5828329ae90fd298f
2020-09-23 11:05:44 -07:00
Nikita Shulga
4066022146 Do not use PRId64 in torch/csrc (#44767)
Summary:
Instead use `fmt::format()` or `%lld` and cast argument to `(long long)`
Fix typos and add helper `PyErr_SetString()` method in torch/csrc/Exceptions.h

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44767

Reviewed By: ezyang

Differential Revision: D23723671

Pulled By: malfet

fbshipit-source-id: c0101aed222184aa436b1e8768480d1531dff232
2020-09-17 14:00:02 -07:00
Nikita Shulga
683380fc91 Use compile time cudnn version if linking with it statically (#44402)
Summary:
This should prevent torch_python from linking the entire cudnn library statically just to query its version

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44402

Reviewed By: seemethere

Differential Revision: D23602720

Pulled By: malfet

fbshipit-source-id: 185b15b789bd48b1df178120801d140ea54ba569
2020-09-09 11:33:41 -07:00
Ashkan Aliabadi
4e39c310eb Move torch/csrc/utils/hash.h to c10/util/hash.h. (#42503)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42503

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D23252331

Pulled By: AshkanAliabadi

fbshipit-source-id: 3c4c0e27b9a7eec8560e374c2a3ba5f1c65dae48
2020-08-29 17:47:00 -07:00
Rong Rong
3eb31325fc refactor torch/cuda/nccl.h to remove direct dependency on NCCL in libtorch_python (#42687)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42687

Reviewed By: malfet

Differential Revision: D23145834

Pulled By: walterddr

fbshipit-source-id: c703a953a54a638852f6e5a1479ca95ae6a10529
2020-08-19 20:16:53 -07:00
Dmytro Dzhulgakov
06d978a9ad [c10/cuda] Reorganize device_count() and robustly surface ASAN warnings (#42249)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42249

Main change is to bring Caffe2's superior error messages for cuda initialization into c10 and use them in all code paths.

Basic logic:

| Case | Call to device_count() | init_cuda, e.g. allocating tensor |
| -- | -- | -- |
| all good | non-zero | just works |
| no gpus | 0, no warning | throw exception with good message |
| driver issues | 0, produce warning | throw exception with good message |
| out of memory with ASAN | 0, produce warning| throw exception with ASAN message |

Previously, the error thrown from init_cuda was very generic and the ASAN warning (if any) was buried in the logs.

Other clean up changes:
* cache device_count() always in a static variable
* move all asan macros in c10

Test Plan:
Hard to unittest because of build modes. Verified manually that the behavior from the table above holds by running the following script in different modes (ASAN/no-ASAN, CUDA_VISIBLE_DEVICES=):

```
print('before import')
import torch
print('after import')
print('devices: ', torch.cuda.device_count())
x = torch.tensor([1,2,3])
print('tensor creation')
x = x.cuda()
print('moved to cuda')
```

Reviewed By: ngimel

Differential Revision: D22824329

fbshipit-source-id: 5314007313a3897fc955b02f8b21b661ae35fdf5
2020-08-05 11:39:31 -07:00
ziab
1c8217a7a6 Abstract cuda calls made from torch_python (#42251)
Summary:
* Make c10::cuda functions regular non-inlined functions
* Add driver_version() and device_synchronize() functions

With this change I don't see anymore direct calls to CUDA API when look at Modules.cpp.obj

FYI malfet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42251

Reviewed By: malfet

Differential Revision: D22826505

Pulled By: ziab

fbshipit-source-id: 8dc2f3e209d3710e2ce78411982a10e8c727573c
2020-07-30 19:18:33 -07:00
maokaiyu
9ed825746a Use c10::cuda:: primitives rather than make CUDA runtime calls directly (#41405)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41405

Test Plan:
**Imported from GitHub: all checks have passed**

{F244195355}

**The Intern Builds & Tests have 127 success, 5 no signals, and 1 failure. Double check the failed test log file, the failure is result differences:**
- AssertionError: 0.435608434677124 != 0.4356083869934082
- AssertionError: 0.4393022060394287 != 0.4393021583557129
- AssertionError: 0.44707541465759276 != 0.44707536697387695

These are all very small numerical errors (within 0.0000001).

Reviewed By: malfet

Differential Revision: D22531486

Pulled By: threekindoms

fbshipit-source-id: 21543ec76bb9b502885b5146c8ba5ede719be9ff
2020-07-16 15:11:57 -07:00
Luca Wehrstedt
c20426f86d Fix torch.cuda.check_error type errors (#41330)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41330

`torch.cuda.check_error` is annotated as taking an `int` as argument but when running `torch.cuda.check_error(34)` one would get:
```
TypeError: cudaGetErrorString(): incompatible function arguments. The following argument types are supported:
    1. (arg0: torch._C._cudart.cudaError) -> str

Invoked with: 34
```
Even if one explicitly casted the argument, running `torch.cuda.check_error(torch._C._cudart.cudaError(34))` would give:
```
AttributeError: 'str' object has no attribute 'decode'
```

This PR fixes both issues (thus allowing `check_error` to be called with a un-casted int) and adds a test.
ghstack-source-id: 107628709

Test Plan: Unit tests

Reviewed By: ezyang

Differential Revision: D22500549

fbshipit-source-id: 9170c1e466dd554d471e928b26eb472a712da9e1
2020-07-14 00:47:14 -07:00
Nikita Shulga
b952eaf668 Preserve CUDA gencode flags (#41173)
Summary:
Add `torch._C._cuda_getArchFlags()` that returns list of architecture `torch_cuda` were compiled with
Add `torch.cuda.get_arch_list()` and `torch.cuda.get_gencode_flags()` methods that returns architecture list and gencode flags PyTorch were compiled with
Print warning if some of GPUs is not compatible with any of the CUBINs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41173

Differential Revision: D22459998

Pulled By: malfet

fbshipit-source-id: 65d40ae29e54a0ba0f3f2da11b821fdb4d452d95
2020-07-09 14:59:35 -07:00
Jithun Nair
eea535742f Add bfloat16 support for nccl path (#38515)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38515

Differential Revision: D22420896

Pulled By: ezyang

fbshipit-source-id: 80d2d0c2052c91c9035e1e025ebb14e210cb0100
2020-07-07 18:07:06 -07:00
lixinyu
4a235b87be pop warning message for cuda module when asan is built in (#35088)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35088

Test Plan: Imported from OSS

Differential Revision: D20552708

Pulled By: glaringlee

fbshipit-source-id: 0b809712378596ccf83211bf8ae39cd71c27dbba
2020-06-30 08:00:37 -07:00
SsnL
de7ac60cf4 Add out= variants for cuda.comm.broadcast/gather/scatter (#39681)
Summary:
Partially fixes https://github.com/pytorch/pytorch/issues/38911
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39681

Differential Revision: D22161342

Pulled By: mrshenli

fbshipit-source-id: 60295077159b02087823e93bb6ebac9d70adea0a
2020-06-24 12:58:19 -07:00
Alban Desmaison
02ae9a1583 add TypeError to c10 and fix segfault in error checking in Tensor constructor (#40106)
Summary:
As per title.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40106

Differential Revision: D22137193

Pulled By: albanD

fbshipit-source-id: 11d059263c00a834211f016bd9a9e18fdc0437ef
2020-06-22 13:42:44 -07:00
Nikita Shulga
8b5732e8ad Move torch.cuda annotations inline (#40075)
Summary:
Also enable `torch.cuda` typechecking
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40075

Differential Revision: D22121275

Pulled By: malfet

fbshipit-source-id: dbecef09911334e8f3d87f5ecab66349da9f2325
2020-06-18 15:52:29 -07:00
SsnL
d5236f8517 Avoid initializing unnecessary tensors in nccl.reduce (#39688)
Summary:
While working on https://github.com/pytorch/pytorch/issues/38911, I realized that `nccl.reduce` only needs a single output tensor, while our current implementation requires a list of output tensors. This, along with a TODO I fixed in reduce_add, should have some speed up for data parallel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39688

Differential Revision: D22034547

Pulled By: mrshenli

fbshipit-source-id: e74d54d673ebbb062474b1bb5cc93a095a3a5f6c
2020-06-14 10:11:32 -07:00
anjali411
8e07b75cef Have DeviceType available in torch namespace (#38036)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38036

Resolves: https://github.com/pytorch/pytorch/issues/36946

Test Plan: Imported from OSS

Differential Revision: D21463610

Pulled By: anjali411

fbshipit-source-id: c4aabfac2cd1f05f8b66745aae0a17c2af4d9c9b
2020-05-11 16:06:52 -07:00
anjali411
1f09f7ea44 Python API for Complex Storage and storage copy logic (#35771)
Summary:
Following up on this: https://github.com/pytorch/pytorch/pull/35851 cross dtype storage copy is not being used internally, so I have not included cross dtype copy for complex.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35771

Differential Revision: D21319650

Pulled By: anjali411

fbshipit-source-id: 07c72996ee598eba0cf401ad61534494d6f5b5b3
2020-05-01 11:47:22 -07:00
HC Zhu
ea97fa1f2a [PyTorch][Dist] Trigger pre/post hooks of output function nodes under distributed autograd (#34501)
Summary:
# Goals
Do the following things during a distributed backward pass.
1. Accumulate the gradient of a variable to RPC context once the gradient is ready instead of at the very end of the backward pass.
2. Run post/pre hooks installed in`AccumulateGrad` nodes once the gradient is ready for the variable. Currently, the hooks in `AccumulateGrad` are not executed just because the function `AccumulateGrad` itself is not even evaluated by the local engine.
3. Make it extensible to support post hooks installed by DDP's reducer.

# Introduce GradCapturePreHook

## Why do we need this?

### Root issue:

* dist engine uses the autograd.grad-like API on the vanilla engine and then in the Future callback populates the context with the gradients. This is a bad emulation of the .backward() call on the vanilla engine.

### Practical issue:

* The leaf’s hook are not called (because associated with the AccumulateGrad that is not call in the autograd.grad-like API). Modules like DDP rely on these hooks.
* The Future is marked as completed before the context is actually populated with the grads leading to unexpected behavior on the user side.
* The Future callback is only called at the complete end of the backward and so too late for DDP if they want to overlap compute/transfert.

### Proposed solution:

* Provide hooks in the autograd.grad-like API that will allow the distributed engine to populate the context and call the hooks to better emulate the .backward call.

## Who can install a grad capture pre-hook?

This will be an internal hook at C++ level and it won’t be exposed to PyThon code. Only call-sites directly interacting with the local engine can install such hooks.

## Signature
The returned `grad` will be captured.
```
virtual const torch::Tensor& grad operator()(const torch::Tensor& grads) = 0;
```

## Where are hooks installed?

Grad capture pre-hooks are install in GraphTask::ExecInfo::Capture. ExecInfo is per node. Every backward run will have its own GraphTask instance.

## When/How will hooks be called?

When the local engine captures the grads for a node, all grad capture pre hooks are called one by one in the order they are added. The output grads of the hooks will replace the original grads.
The output of the last hook will be used for grad capturing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34501

Test Plan:
All existing tests should pass.

```

python setup.py develop

python test/distributed/rpc/test_dist_autograd_spawn.py DistAutogradTestWithSpawn.test_post_hooks

```

Differential Revision: D20953673

Pulled By: hczhu

fbshipit-source-id: 543b3844823330ea9f9856bab7c5cb2679290a53
2020-04-21 13:23:18 -07:00
Jithun Nair
dc1f9eee53 Avoid printing erroneous warning about "MIOpen not found" for ROCm builds (#33837)
Summary:
Older versions of MIOpen (<=2.2) don't have the `miopenGetVersion` api, but MIOpen is always a part of the ROCm builds, so do NOT set `lib` to None for ROCm builds. `__cudnn_version` will be `None` for older versions of MIOpen.

Setting `lib` to `None` ends up printing the following erroneous warning when running unit tests:
```
/root/.local/lib/python3.6/site-packages/torch/backends/cudnn/__init__.py:120: UserWarning: cuDNN/MIOpen library not found. Check your LD_LIBRARY_PATH
  }.get(sys.platform, 'LD_LIBRARY_PATH')))
```
Eg.: https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py3.6-clang7-rocmdeb-ubuntu16.04-test2/18387/consoleFull
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33837

Differential Revision: D20369285

Pulled By: xw285cornell

fbshipit-source-id: e82e6f8f5bccb486213cf868f40aece41ce11f98
2020-04-17 20:31:01 -07:00
Nikita Shulga
2458f6c63e Move all nccl from torch_python to torch_cuda (#36193)
Summary:
Because `torch_python` is supposed to be thin wrapper around `torch`

In this PR, all invocation of functions from nccl library are moved from python_nccl.cpp  (which is part of torch_python) to nccl.cpp (which is part of torch_cuda)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36193

Test Plan: CI

Differential Revision: D20930047

Pulled By: malfet

fbshipit-source-id: 7f278610077df6ac5dc3471c1a1b5d51e653ef9c
2020-04-08 18:01:47 -07:00
Pavel Belevich
3328a2f903 Rename CPUGenerator to CPUGeneratorImpl and CUDAGenerator to CUDAGeneratorImpl (#36026)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36026

Differential Revision: D20856458

Pulled By: pbelevich

fbshipit-source-id: 6d105593dca67640d508a4aebf7edf028d52af32
2020-04-07 08:05:23 -07:00
Edward Yang
940e678da9 Add back cudaHostRegister to cudart API. (#34665)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34665

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20493861

Pulled By: ezyang

fbshipit-source-id: 4215e3037a16be460f20cfc2859be5ee074128d3
2020-03-17 13:30:39 -07:00
Peter Bell
5fc5cf6571 Stop using ctypes to interface with CUDA libraries. (#33678)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/33016, Continuation of https://github.com/pytorch/pytorch/issues/31160
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33678

Differential Revision: D20249187

Pulled By: ezyang

fbshipit-source-id: 172ce4a0fee7fbe01436a421d1af22ef6173b6ed
2020-03-11 07:22:46 -07:00
Emilio Castillo
31cc311143 Expose CUDACachingAllocator raw_alloc and raw_delete to python (#33860)
Summary:
This PR aims to improve the interoperability with [CuPy](https://github.com/cupy/cupy/pulls).

Instead of having two separate and conflicting memory pools. With this PR, CuPy can directly alloc memory from the PyTorch allocator by means of this proposal https://github.com/cupy/cupy/pull/3126

We would like to gather feedback to know if this approach makes sense for PyTorch, or other alternative designs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33860

Differential Revision: D20212788

Pulled By: ngimel

fbshipit-source-id: bc1e08a66da1992d26021147bf645dc65239581c
2020-03-03 17:50:11 -08:00
Andreas Koepf
22044c6f7c Use TORCH_CHECK instead of AT_ASSERT in torch::cuda::gather() (#27456)
Summary:
The error message produced by AT_ASSERT() in gather() encouraged users to file a bug report ("please report a bug to PyTorch..."). The assertion should be a regular argument check since it can be triggered by passing tensors with different dimensionality, e.g. `torch.cuda.comm.gather([torch.rand(1, device='cuda'), torch.rand(1, 1, device='cuda')])`.

See: https://github.com/pytorch/pytorch/issues/26400
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27456

Differential Revision: D19300270

Pulled By: ezyang

fbshipit-source-id: ec87d225e23445020b377521e0daccceb4748215
2020-01-07 10:04:24 -08:00
Edward Yang
58cffbff91 Add missing TORCH_CUDA_API annotation to throw_nccl_error (#31157)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31157

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D19262583

Pulled By: ezyang

fbshipit-source-id: 8fb87b41ab53770329b38e1e2fe679fb868fee12
2020-01-06 14:39:51 -08:00
Edward Yang
1111a6b810 Use pybind11::gil_scoped_* functions instead of AutoGIL/AutoNoGIL (#30274)
Summary:
Reland of https://github.com/pytorch/pytorch/pull/29095
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30274

Differential Revision: D18762293

Pulled By: ezyang

fbshipit-source-id: d3d50c2dd12bcb678ab25fa708eb6587cc4b66f9
2019-12-02 12:19:58 -08:00
Mike Ruberry
eff4c4d7c1 Revert D18301806: Use pybind11::gil_scoped_* functions instead of AutoGIL/AutoNoGIL
Test Plan: revert-hammer

Differential Revision:
D18301806

Original commit changeset: 03da6a26c41e

fbshipit-source-id: c1324ee8d154e7e16f5dd4f1cf3625aaa566cd39
2019-11-21 14:50:07 -08:00
Alan Du
f4b9690f2d Use pybind11::gil_scoped_* functions instead of AutoGIL/AutoNoGIL (#29095)
Summary:
Given that pybind11 implements these gil functions, I don't think it makes sense for Pytorch to have its own bespoke versions.

Fixes https://github.com/pytorch/pytorch/issues/29065
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29095

Differential Revision: D18301806

Pulled By: ezyang

fbshipit-source-id: 03da6a26c41ee65aaadf7b67b9f0b14d2def2a5a
2019-11-21 13:44:40 -08:00
xiaobing.zhang
c2c835dd95 Port sigmoid backward to Aten(CPU+CUDA) (#29185)
Summary:
VitalyFedyunin, This PR is about port sigmoid backward to Aten:
**Test script:**
```
import torch
import torch.nn as nn
import time

torch.manual_seed(0)

def _time():
    if torch.cuda.is_available():
        torch.cuda.synchronize()
    return time.time()

device = "cpu"
if torch.cuda.is_available():
    device = "cuda"

#warm up
for n in [100, 10000]:
    input = torch.randn(128, n, requires_grad=True, device=device)
    for i in range(1000):
        output = input.sigmoid().sum()
        output.backward()

#get running time
for n in [100, 10000]:
    bwd_t = 0
    input = torch.randn(128, n, requires_grad=True, device=device)
    for i in range(10000):
        output = input.sigmoid().sum()
        t1 = _time()
        output.backward()
        t2 = _time()
        bwd_t = bwd_t + (t2 - t1)
    bwd_avg = bwd_t / 10000 * 1000
    print("input size(128, %d), backwad avg time is %.2f (ms)." % (n, bwd_avg))
```
Test Device: CPU: skx-8280, GPU: Tesla P40

**Perfromance**:
Before:
```
GPU:
input size(128, 100), backwad avg time is 0.14 (ms).
input size(128, 10000), backwad avg time is 0.17 (ms).
CPU:
OMP_NUM_THREADS=56
input size(128, 100), backwad avg time is 0.06 (ms).
input size(128, 10000), backwad avg time is 4.21 (ms).
OMP_NUM_THREADS=1
input size(128, 100), backwad avg time is 0.06 (ms).
input size(128, 10000), backwad avg time is 2.30 (ms).
```
After:
```
GPU:
input size(128, 100), backwad avg time is 0.14 (ms).
input size(128, 10000), backwad avg time is 0.17 (ms).
CPU:
OMP_NUM_THREADS=56
input size(128, 100), backwad avg time is 0.05 (ms).
input size(128, 10000), backwad avg time is 0.48 (ms).
OMP_NUM_THREADS=1
input size(128, 100), backwad avg time is 0.04 (ms).
input size(128, 10000), backwad avg time is 0.86 (ms).
```
How to set number thread? using following script:
```
num_threads=$1
script=$2
last_core=`expr $num_threads - 1`
echo "using $num_threads OMP threads"
echo "bind cores to 0~$last_core"
export OMP_NUM_THREADS=$num_threads
export KMP_AFFINITY=granularity=fine,compact,1,0
numactl --physcpubind=0-$last_core --membind=0 python $script
```
and run **./run.sh num_threads test.py**.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29185

Differential Revision: D18587352

Pulled By: VitalyFedyunin

fbshipit-source-id: 8167ca261960399f795d35a83fa8c4be365bc4da
2019-11-20 07:31:42 -08:00
Vitaly Fedyunin
b80c4f60fb Add channels last support to cuda.comm.scatter and gather
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28077

Test Plan: Imported from OSS

Differential Revision: D17980305

Pulled By: VitalyFedyunin

fbshipit-source-id: e4741194baac3d93f2d53724582dc4c38f82ee84
2019-11-18 05:35:35 -08:00
Edward Yang
65bb34d885 Remove TensorImpl::is_variable, deprecate Tensor::is_variable (#29653)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29653

I didn't remove is_variable from Tensor for BC reasons, but I did
remove as many uses as I could from the codebase.
at::impl::variable_excluded_from_dispatch got moved to TensorBody.h
so that it's more widely accessible.

This diff is NOT semantics preserving.  Here are the major differences:

- In a number of native operator implementations, we tested that arguments
  are not variable.  I replaced these with asserts that variable is
  excluded from dispatch.  I actually don't think these asserts are really
  necessary now (they should certainly be true, but it's hard to get
  it wrong), but I've kept them for old time's sake.  At least, they'll detect
  if you call these functions before you've processed variable (indicating
  a bug in your kernel.)

- There are a number of places where we do a per-tensor test for being a
  variable, for better error reporting when someone commits Tensor/Variable
  confusion.  Although these tests are substantively the same as the
  tests above, in these cases I decided to *delete* the test entirely.
  The reasoning is that in these cases, we didn't really care about
  dispatch (also, see above; I'm not too sure we really need the dispatch
  asserts), we cared about Tensor/Variable confusion.  Since Tensor/Variable
  confusion is impossible now, we don't need the tests.  One of the key
  factors which pushed me one way or another was whether or not a function
  was doing per-tensor validation; if I kept the assert in such functions,
  I'd repeatedly access the TLS.  Even if we want to bring back the asserts,
  they would have to go somewhere else.

  Another similar idiom is the number of places we do !x.defined() ||
  x.is_variable(); I treated this equivalently.

- nuclear_norm's computation of compute_uv is a bit weird, but I think
  it's OK to just delete the is_variable case (I *suspect* that it is
  always the case that self.is_variable(), but it doesn't really matter.)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D18496168

Pulled By: ezyang

fbshipit-source-id: 5a1ded931e0c10a6b758ba64a8380d34110e0c3e
2019-11-14 11:41:02 -08:00
Edward Yang
0c91ebb694 Delete all trivial uses of make_variable. (#29213)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29213

A trivial use of make_variable is one where requires_grad=False.  This
transformation is not technically semantics preserving, as make_variable
will create a shallow copy of the tensor in question; however, I
am guessing that we have the invariant that we don't actually make
use of this shallow copy in a nontrivial way.

There were some cases where the surrounding code expected a Variable proper
to be returned; I retained those sites.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D18353503

Pulled By: ezyang

fbshipit-source-id: 57fe34d82e009c0cc852266fb0b79d6d9c62bb03
2019-11-13 07:43:41 -08:00
Peter Bell
bb119d957e Move torch.cuda's atfork handler into C++ (#29101)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/23401

We cannot rely on `multiprocessing.util.register_after_fork` since it is only
called for processes created by the `multiprocessing` module and not `os.fork()`.

Moving to `pthread_atfork` does always get called. However, I don't think it's safe to call python functions inside of the `atfork` handler so the python code has to be a bit more careful when checking `_initialized`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29101

Differential Revision: D18355451

Pulled By: ezyang

fbshipit-source-id: 4d4253a3669796212c099dad4e5bdfdb0df40469
2019-11-11 07:34:27 -08:00
Xiang Gao
02921e7985 Use cuDNN's handle pool mechanism to manage cublas handles (#29233)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/6962

The PR implements the handle pool mechanism for cublas as suggested by mcarilli  in https://github.com/pytorch/pytorch/issues/6962#issuecomment-530563872.

~~I didn't add any unit test here yet because as mcarilli mentioned:~~
> ~~On my local machine, out of curiosity I also rewrote that test to use gemms instead of convolutions. The race condition seemed rarer, but the test did show that cublas use is not thread safe. I can share the script if you want.~~

~~Please share your script with me mcarilli. And if the race condition is rare, would it still be possible for the CI to detect it?~~

cc: colesbury
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29233

Differential Revision: D18372007

Pulled By: ezyang

fbshipit-source-id: 3492bf13410598e8452e89cf4e3e63e8df9c8c3d
2019-11-07 12:50:18 -08:00
vishwakftw
86c64440c9 Make PyTorch Python 3.8 compatible (#29302)
Summary:
PEP 590 modifies the `tp_print` offset to `tp_vectorcall_offset` - which requires a Py_ssize_t object.
Passing a nullptr caused compatibility issues for Python 3.8.

Changelog:
- Modify all occurrences of `nullptr  /* tp_print */` to 0  /* tp_vectorcall_offset */
- Minor formatting changes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29302

Test Plan:
- Local fresh build with Python 3.8 completed successfully.

Fixes https://github.com/pytorch/pytorch/issues/28060.
Fixes https://github.com/pytorch/pytorch/issues/29162.

Supersedes https://github.com/pytorch/pytorch/pull/28364

Differential Revision: D18372022

Pulled By: ezyang

fbshipit-source-id: 8e9a15b0d0f72101ccc69bd489f5efa216b880bb
2019-11-07 09:20:19 -08:00
Edward Yang
adb7df7117 Consistently use TORCH_CUDA_API for all files that live in cuda targets. (#29158)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29158

My plan is to split out libtorch_cuda.so from libtorch.so.  To do this,
I need accurate _API annotations for files in these directories.

I determined the correct set of annotations by looking at
tools/build_variables.py and making sure every file that was a member
of the libtorch_cuda/ATen-cu targets had these annotations.  (torch-cpp-cuda
doesn't count since that's going to be where the stuff that has explicit
USE_CUDA lives, so it's going to be in a separate dynamic library).

As future work, it would be good to setup a lint rule to help people
understand what the correct _API annotation to use in a file is; it
would also be good to reorganize folder structure so that the library
structure is clearer.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D18309593

Pulled By: ezyang

fbshipit-source-id: de710e721b6013a09dad17b35f9a358c95a91030
2019-11-06 15:02:07 -08:00
Edward Yang
a5d356cb39 Delete THP_CORE macro; partially replace with THP_BUILD_MAIN_LIB (#29143)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29143

THP_CORE macro is a very old macro that appeared to have served
two purposes:

1. The torch-python equivalent of CAFFE2_BUILD_MAIN_LIB, to toggle
   symbol visibility headers

2. Some sort of ad hoc way of hiding certain definitions from headers
   so external clients can't get at them.

It did (2) in a very confusing manner, because we set THP_CORE in both
torch and torch-python (it shouldn't do anything in torch).  In this
PR I just get rid of use case (2) entirely (so everything shows up in
headers all the time), and then redo (1) using a new THP_BUILD_MAIN_LIB
macro.  This cleans up some of the macro definitions and makes my life
easier for working on #27215.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D18309594

Pulled By: ezyang

fbshipit-source-id: adcb6d7cb387cd818480137e2b94e5e761dbfefc
2019-11-06 15:02:02 -08:00
Michael Carilli
314066bd74 Making torch/csrc/cuda nccl usage safe for nccl 2.5 (#29014)
Summary:
Thanks to AddyLaddy ptrblck for tracking this fix down.

In torch/csrc/cuda/nccl.cpp and torch/csrc/cuda/python_nccl.cpp, construction of the `AutoNcclGroup` guard (which calls `ncclGroupStart()`) [precedes](https://github.com/pytorch/pytorch/pull/29014/files#diff-3b6a42619dd44000cf58c0328b679a1cL239-L241) a possible call to `get_communicators`, which may call `ncclCommInitAll()`.  Calling `ncclCommInitAll()` within a `ncclGroupStart()/End()` is incorrect according to our Nccl people.

It seemed ok (relevant tests were silently passing) as long as Pytorch was compiled/linked against Nccl 2.4.x (which is currently what's locked into your third_party/nccl subrepo).  However, when we tried to compile and link against Nccl 2.5.x in internal builds, we began to see test hangs (TestAutogradDeviceTypeCUDA.test_unused_output_device_cuda was what initially brought it to our attention).

The present PR fixes those hangs, as far as we know, and will prevent a nasty future surprise when you start building against nccl 2.5.

The backend affected by this PR is exposed via https://github.com/pytorch/pytorch/blob/master/torch/cuda/nccl.py.  I'm not sure if the exposure is actually used anywhere (I think the distributed frontend is now backed by ProcessGroupNCCL in torch/lib/c10d).  So this PR may affect code that is already dead or dying, but still tested, it seems.

I skimmed ProcessGroupNCCL.cpp for potential similar vulnerabilities and didn't spot anything obvious.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29014

Differential Revision: D18274799

Pulled By: ezyang

fbshipit-source-id: c5f88cf187960d61736be14458be01e3675c6702
2019-11-01 13:53:31 -07:00
Gao, Xiang
2d2fe14a60 Install CUDA for clang-tidy (#27967)
Summary:
fixes: https://github.com/pytorch/pytorch/issues/28009

clang-tidy is reporting `'cuda_runtime_api.h' file not found` when a PR modifying some file including this header.

Installation script take from official site:
https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=debnetwork
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27967

Differential Revision: D17952383

Pulled By: ezyang

fbshipit-source-id: 85807d93bd46eb902a84b2126784349ce3a01cfa
2019-10-16 10:02:19 -07:00
Jerry Ma
1610ea8ef8 Comprehensive-ish instrumentation for CUDA memory allocator (#27361)
Summary:
Adds comprehensive memory instrumentation to the CUDA caching memory allocator.

# Counters

Added comprehensive instrumentation for the following stats:
  - Allocation requests (`allocation`)
  - Allocated memory (`allocated_bytes`)
  - Reserved segments from cudaMalloc (`segment`)
  - Reserved memory (`reserved_bytes`)
  - Active memory blocks (`active`)
  - Active memory (`active_bytes`)
  - Inactive, non-releasable blocks (`inactive_split`)
  - Inactive, non-releasable memory (`inactive_split_bytes`)
  - Number of failed cudaMalloc calls that result in a cache flush and retry (`cuda_malloc_retries`)
  - Number of OOMs (`num_ooms`)

Except for the last two, these stats are segmented between all memory, large blocks, and small blocks. Along with the current value of each stat, historical counts of allocs/frees as well as peak usage are tracked by the allocator.

# Snapshots

Added the capability to get a "memory snapshot" – that is, to generate a complete dump of the allocator block/segment state.

# Implementation: major changes

- Added `torch.cuda.memory_stats()` (and associated C++ changes) which returns all instrumented stats as a dictionary.
- Added `torch.cuda.snapshot()` (and associated C++ changes) which returns a complete dump of the allocator block/segment state as a list of segments.
- Added memory summary generator in `torch.cuda.memory_summary()` for ease of client access to the instrumentation stats. Potentially useful to dump when catching OOMs. Sample output here: https://pastebin.com/uKZjtupq

# Implementation: minor changes

- Add error-checking helper functions for Python dicts and lists in `torch/csrc/utils/`.
- Existing memory management functions in `torch.cuda` moved from `__init__.py` to `memory.py` and star-imported to the main CUDA module.
- Add various helper functions to `torch.cuda` to return individual items from `torch.cuda.memory_stats()`.
- `torch.cuda.reset_max_memory_cached()` and `torch.cuda.reset_max_memory_allocated()` are deprecated in favor of `reset_peak_stats`. It's a bit difficult to think of a case where only one of those stats should be reset, and IMO this makes the peak stats collectively more consistent.
- `torch.cuda.memory_cached()` and `torch.cuda.max_memory_cached()` are deprecated in favor of `*memory_reserved()`.
- Style (add access modifiers in the allocator class, random nit fixes, etc.)

# Testing

- Added consistency check for stats in `test_cuda.py`. This verifies that the data from `memory_stats()` is faithful to the data from `snapshot()`.
- Ran on various basic workflows (toy example, CIFAR)

# Performance

Running the following speed benchmark: https://pastebin.com/UNndQg50

- Before this PR: 45.98 microseconds per tensor creation
- After this PR: 46.65 microseconds per tensor creation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27361

Differential Revision: D17758747

Pulled By: jma127

fbshipit-source-id: 5a84e82d696c40c505646b9a1b4e0c3bba38aeb6
2019-10-08 15:42:48 -07:00
Pritam Damania
24242e86fa Ensure NCCL error handling code is disabled for NCCL versions < 2.4 (#27124)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27124

ncclCommAbort() and ncclGetAsyncError() were two APIs added in NCCL
2.4 to detect errors in NCCL communicators. These were used as part of
ProcesGroupNCCL and we also enforced that only NCCL versions 2.4+ were
supported. Although, there is still legitimate use for older NCCL versions and
hence we should still support those.

For that purpose, in this change I've ensured we disable NCCL error checking
for versions < 2.4.
ghstack-source-id: 91452959

Test Plan:
1) Test with 2.4.8
2) Test with 2.2.13
3) unit tests.

Differential Revision: D17178988

fbshipit-source-id: 5dc44b5f7b4b00466c67fd452315f1d4f5c47698
2019-10-07 17:39:32 -07:00
Ralf Gommers
1b4951d3a5 Fix remaining invalid function cast warnings that show up with GCC 8/9 (#26104)
Summary:
Follow-up to gh-25483, more of the same fixes for warnings like:

```
../torch/csrc/autograd/python_variable.cpp:503:31: warning: cast between incompatible function types from ‘PyObject* (*)(THPVariable*)’ {aka ‘_object* (*)(THPVariable*)’} to ‘getter’ {aka ‘_object* (*)(_object*, void*)’} [-Wcast-function-type]
  503 |   {"_backward_hooks", (getter)THPVariable_get_backwards_hooks, (setter)THPVariable_set_backwards_hooks, nullptr, nullptr},
      |                               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```

This takes the build log output for a full rebuild with GCC 9.1 from ~10,000 to ~7,000 lines.

`clang-tidy` is going to complain, no way around that - see discussion at the end of gh-25483.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26104

Differential Revision: D17396831

Pulled By: ezyang

fbshipit-source-id: d71696bfe4dbe25519e4bcb7753151c118bd39f7
2019-09-17 07:43:37 -07:00