Summary:
Background: we run pytorch in embedded C++ pipelines, running in C++ GUIs in https://github.com/Kitware/VIAME and without this addition, the call was failing with the below error, but only on certain windows platforms/configurations:
OSError: [WinError6] The handle is invalid
At:
C:\Program Files\VIAME\Python36\site-packages\torch\cuda_init_.py(162):_lazy_init
C:\Program Files\VIAME\Python36\site-packages\torch\nn\modules\module.py(249): <lambda>
C:\Program Files\VIAME\Python36\site-packages\torch\nn\modules\module.py(182): _apply
C:\Program Files\VIAME\Python36\site-packages\torch\nn\modules\module.py(176): _apply
C:\Program Files\VIAME\Python36\site-packages\torch\nn\modules\module.py(249): cuda
C:\Program Files\VIAME\lib\python3.6None\site-packages\kwiver\arrows\pytorch\pytorch_resnet_f_extractor.py(74):_init_
C:\Program Files\VIAME\lib\python3.6None\site-packages\kwiver\processes\resnet_descriptors.py(132): _configure
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10379
Differential Revision: D9330772
Pulled By: ezyang
fbshipit-source-id: 657ae7590879004558158d3c4abef2ec11d9ed57
Summary:
As I try to replicate DP in C++, I need to move some functions into C++ from Python. This PR ports the scatter and gather primitives from Python in torch/cuda/comm.py to C++ in torch/csrc/cuda/comm.cpp. The basic infrastructure was already there, since apaszke had rewritten broadcast in C++ already.
I'm not very familiar with this code, so let me know if I'm doing something wrong. I largely just literally translated the code.
I don't know how "public" `torch.cuda.comm` is, but I feel like the `destination_index` parameter for `gather` should be changed from -1 indicating CPU to `None` indicating CPU, and `-1` indicating the default CUDA device. That would make the code clearer IMO.
apaszke colesbury teng-li pietern
Closes https://github.com/pytorch/pytorch/pull/9117
Differential Revision: D8721729
Pulled By: goldsborough
fbshipit-source-id: 1844a488079d21fa209b32e2c73e48632cbe9e68
* fix type mismatch while call torch._C._cuda_setDevice
* fix type mismatch in scatter
* fix type mismatch in scatter
* fix type mismatch while call torch._C._cuda_setDevice
* fix type mismatch while call torch._C._cuda_setDevice
* fix type mismatch while call torch._C._cuda_setDevice
Getting CUDA device property struct with cudaGetDeviceProperties is expensive. THC caches CUDA device property, which is available via THCState_getDeviceProperties, which is available via at::globalContext().getDeviceProperties(device), which is available via torch.cuda.get_device_properties. This PR changes the two methods that previously calls cudaGetDeviceProperties to directly using torch.cuda.get_device_properties in Python.
Also fixes ATen compile error when it can't find CUDA.
Fixes#4908. Using the script from that issue, we get roughly 18x speed-up.
[ssnl@ ~] python dev.py # master
0.2826697587966919
0.00034999847412109375
0.0003493785858154297
0.000356292724609375
0.00036025047302246094
0.0003629922866821289
0.00036084651947021484
0.00035686492919921874
0.00036056041717529296
0.0003606319427490234
[ssnl@ ~] python dev.py # this PR
0.27275662422180175
2.1147727966308594e-05
1.9598007202148438e-05
1.94549560546875e-05
1.9359588623046876e-05
1.938343048095703e-05
2.0074844360351563e-05
1.952648162841797e-05
1.9311904907226562e-05
1.938343048095703e-05
This deletes most of the dead Tensor code paths, including the TensorMethods cwrap and generic/Tensor.cpp.
This also moves the THNN.cwrap/.cpp generation to generate_code which can use ninja if installed.
This replaces the torch.Tensor constructors with factories that produce
Variables. Similarly, functions on the torch module (e.g. torch.randn)
now return Variables.
To keep the PR to a reasonable size, I've left most of the unused tensor
code. Subsequent PRs will remove the dead code, clean-up calls to
torch.autograd.Variable, and rename Variable to Tensor everywhere.
There are some breaking changes because Variable and Tensors had
slightly different semantics. There's a list of those changes here:
https://github.com/pytorch/pytorch/wiki/Breaking-Changes-from-Variable-and-Tensor-merge
The Tensor and Variable classes are being merged.
autograd.Function.forward is now called on Variables, but with "no-grad"
mode (torch.no_grad()) enabled.
One benefit is that we no longer have to explicitly track shared
storages.
* Replace async with non_blocking for Python 3.7 upgrade
* Remove trailing whitespace
* Give _cuda and _type kwargs and accept async for compatibility
* Rename async to non_blocking in all C++ code
* Add entries for async in python_variable_methods
* Friendlier backward compatibility for cuda and type
Adds streams and comms as optional arguments to the NCCL calls in
torch.cuda.nccl. Also exposes ncclUniqueId and ncclCommInitRank for
multi-process mode.
Moves Py_RETURN_NONE statements after the GIL is re-acquired.
* Avoid casting integer params and buffers to float(), double() and half()
* Add test for immune integer buffers
* Fix documentation for float(), double() and half()
* Fix test
Instead of initializing CUDA immediately and executing them,
we wait until we actually initialize CUDA before executing.
To keep things debuggable, we also keep track of the original
backtrace when these functions are called, so we can inform
users where they actually called the seeding/state functions
(as opposed to the first time they actually initialized the
RNG).
Fixes#2517
Signed-off-by: Edward Z. Yang <ezyang@fb.com>