Commit Graph

110 Commits

Author SHA1 Message Date
Tongzhou Wang
8e33451e2e Make torch.cuda.* take device objects; Update distributed docs (#10833)
Summary:
Commits:

1. Make `torch.cuda.*` take device objects
2. Update `torch.distributed` docs to emphasize calling `torch.cuda.set_device` before `init_process_group`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10833

Differential Revision: D9514241

Pulled By: SsnL

fbshipit-source-id: 2497464305fb1e63d6c495291a5744aaa7e2696e
2018-08-27 15:24:42 -07:00
Matt Dawkins
e41528a5cc Also set stdin to subprocess pipe in FindCUDA windows popen call (#10379)
Summary:
Background: we run pytorch in embedded C++ pipelines, running in C++ GUIs in https://github.com/Kitware/VIAME and without this addition, the call was failing with the below error, but only on certain windows platforms/configurations:

OSError: [WinError6] The handle is invalid
At:
C:\Program Files\VIAME\Python36\site-packages\torch\cuda_init_.py(162):_lazy_init
C:\Program Files\VIAME\Python36\site-packages\torch\nn\modules\module.py(249): <lambda>
C:\Program Files\VIAME\Python36\site-packages\torch\nn\modules\module.py(182): _apply
C:\Program Files\VIAME\Python36\site-packages\torch\nn\modules\module.py(176): _apply
C:\Program Files\VIAME\Python36\site-packages\torch\nn\modules\module.py(249): cuda
C:\Program Files\VIAME\lib\python3.6None\site-packages\kwiver\arrows\pytorch\pytorch_resnet_f_extractor.py(74):_init_
C:\Program Files\VIAME\lib\python3.6None\site-packages\kwiver\processes\resnet_descriptors.py(132): _configure
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10379

Differential Revision: D9330772

Pulled By: ezyang

fbshipit-source-id: 657ae7590879004558158d3c4abef2ec11d9ed57
2018-08-14 23:10:20 -07:00
Peter Goldsborough
f1ce15b50c Move nccl scatter and gather to C++ (#9117)
Summary:
As I try to replicate DP in C++, I need to move some functions into C++ from Python. This PR ports the scatter and gather primitives from Python in torch/cuda/comm.py to C++ in torch/csrc/cuda/comm.cpp. The basic infrastructure was already there, since apaszke had rewritten broadcast in C++ already.

I'm not very familiar with this code, so let me know if I'm doing something wrong. I largely just literally translated the code.

I don't know how "public" `torch.cuda.comm` is, but I feel like the `destination_index` parameter for `gather` should be changed from -1 indicating CPU to `None` indicating CPU, and `-1` indicating the default CUDA device. That would make the code clearer IMO.

apaszke colesbury teng-li pietern
Closes https://github.com/pytorch/pytorch/pull/9117

Differential Revision: D8721729

Pulled By: goldsborough

fbshipit-source-id: 1844a488079d21fa209b32e2c73e48632cbe9e68
2018-07-06 11:10:33 -07:00
LaiyuanGong
f5cd479b59 fix type mismatch while call torch._C._cuda_setDevice (#8065)
* fix type mismatch while call torch._C._cuda_setDevice

* fix type mismatch in scatter

* fix type mismatch in scatter

* fix type mismatch while call torch._C._cuda_setDevice

* fix type mismatch while call torch._C._cuda_setDevice

* fix type mismatch while call torch._C._cuda_setDevice
2018-06-05 09:53:22 -04:00
Soumith Chintala
50e92a3085 Static linkage for CUDA (#6807)
* add static linkage option for CUDA libs

* add CuFFT linking via fakelink

* remove warning for 5.0 cuda architecture
2018-04-22 13:57:17 -04:00
Tongzhou Wang
4563e190c4 Use THC cached CUDA device property when get_device_name and get_device_capability (#6027)
Getting CUDA device property struct with cudaGetDeviceProperties is expensive. THC caches CUDA device property, which is available via THCState_getDeviceProperties, which is available via at::globalContext().getDeviceProperties(device), which is available via torch.cuda.get_device_properties. This PR changes the two methods that previously calls cudaGetDeviceProperties to directly using torch.cuda.get_device_properties in Python.

Also fixes ATen compile error when it can't find CUDA.

Fixes #4908. Using the script from that issue, we get roughly 18x speed-up.

[ssnl@ ~] python dev.py  # master
0.2826697587966919
0.00034999847412109375
0.0003493785858154297
0.000356292724609375
0.00036025047302246094
0.0003629922866821289
0.00036084651947021484
0.00035686492919921874
0.00036056041717529296
0.0003606319427490234
[ssnl@ ~] python dev.py  # this PR
0.27275662422180175
2.1147727966308594e-05
1.9598007202148438e-05
1.94549560546875e-05
1.9359588623046876e-05
1.938343048095703e-05
2.0074844360351563e-05
1.952648162841797e-05
1.9311904907226562e-05
1.938343048095703e-05
2018-03-30 16:39:22 -04:00
Sam Gross
48a3349c29
Delete dead Tensor code paths (#5417)
This deletes most of the dead Tensor code paths, including the TensorMethods cwrap and generic/Tensor.cpp.

This also moves the THNN.cwrap/.cpp generation to generate_code which can use ninja if installed.
2018-02-27 17:58:09 -05:00
Carl Lemaire
6b95ca4eda DataParallel: GPU imbalance warning (#5376) 2018-02-27 21:30:41 +01:00
Sam Gross
30ec06c140
Merge Variable and Tensor classes (#5225)
This replaces the torch.Tensor constructors with factories that produce
Variables. Similarly, functions on the torch module (e.g. torch.randn)
now return Variables.

To keep the PR to a reasonable size, I've left most of the unused tensor
code. Subsequent PRs will remove the dead code, clean-up calls to
torch.autograd.Variable, and rename Variable to Tensor everywhere.

There are some breaking changes because Variable and Tensors had
slightly different semantics. There's a list of those changes here:

 https://github.com/pytorch/pytorch/wiki/Breaking-Changes-from-Variable-and-Tensor-merge
2018-02-23 18:03:31 -05:00
Soumith Chintala
2d84cb4b04
warn that CUDA capability 3.0 and 5.0 is no longer supported (#5125) 2018-02-08 00:07:53 -05:00
Sam Gross
895aebac08
Use Variable instead of Tensor in Function.forward (#4786)
The Tensor and Variable classes are being merged.
autograd.Function.forward is now called on Variables, but with "no-grad"
mode (torch.no_grad()) enabled.

One benefit is that we no longer have to explicitly track shared
storages.
2018-02-06 17:24:27 -05:00
Peter Goldsborough
86fd5fd524 Replace async with non_blocking for Python 3.7 (#4999)
* Replace async with non_blocking for Python 3.7 upgrade

* Remove trailing whitespace

* Give _cuda and _type kwargs and accept async for compatibility

* Rename async to non_blocking in all C++ code

* Add entries for async in python_variable_methods

* Friendlier backward compatibility for cuda and type
2018-02-02 09:23:51 -05:00
Christian Sarofeen
ef4cf860ac Lazy init in set device, also should not be called in getDevCount (#4918) 2018-01-30 16:24:31 +01:00
albanD
ee8bcdca79 make torch.cuda.empty_cache() a no-op when cuda is not initialized (#4936) 2018-01-30 16:22:17 +01:00
albanD
7a47790c27 Add missing _lazy_init in cuda python functions 2018-01-29 18:19:03 +01:00
SsnL
3ecd25b065 fix indentation 2018-01-28 20:56:57 +01:00
Tongzhou Wang
6420c6b224 Improve torch.cuda.empty_cache documentation (#4879)
* add doc about empty_cache wont increase amount of memory available

* typo
2018-01-27 04:54:25 -05:00
Yongjik Kim
dd5c195646 More documentation for CUDA stream functions. (#4756) 2018-01-21 12:58:51 +01:00
Sam Gross
f1c616418d
Fix Python docs for broadcast and braodcast_coalesced (#4727) 2018-01-19 10:57:20 -05:00
Adam Paszke
1061d7970d Move broadcast and broadcast_coalesced to C++ 2018-01-18 11:16:45 +01:00
Tongzhou Wang
5918243b0c Methods for checking CUDA memory usage (#4511)
* gpu mem allocated

* add test

* addressed some of @apaszke 's comments

* cache stats

* add more comments about test
2018-01-09 11:47:48 -05:00
Edward Z. Yang
c6381c6d44 Add function to explicitly initialize PyTorch CUDA state. (#4180)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
2017-12-14 17:48:05 -05:00
Richard Zou
d450895a74 fix typo (#4175) 2017-12-14 12:31:58 -05:00
Sam Gross
bcfe259f83
Add streams and comms as optional arguments (#3968)
Adds streams and comms as optional arguments to the NCCL calls in
torch.cuda.nccl. Also exposes ncclUniqueId and ncclCommInitRank for
multi-process mode.

Moves Py_RETURN_NONE statements after the GIL is re-acquired.
2017-12-04 13:51:22 -05:00
Luca Antiga
af58bfbb1b Make integer parameters and buffers immune to float(), double() and half() (#3820)
* Avoid casting integer params and buffers to float(), double() and half()

* Add test for immune integer buffers

* Fix documentation for float(), double() and half()

* Fix test
2017-11-22 18:34:53 -05:00
Soumith Chintala
50009144c0
add warnings if device capability is less than ideal (#3601) 2017-11-09 11:48:59 -05:00
Ozan Çağlayan
dd6d04ddf2 doc: Normalize all true/false in docstrings to `True|False` (#3593)
* doc: Normalize all true/false in docstrings to ``True|False``

This makes them more apparent in the documentation.

* doc: fix flake8
2017-11-09 08:12:29 -05:00
peterjc123
aa911939a3 Improve Windows Compatibility (for csrc/scripts) (#2941) 2017-11-08 19:51:35 +01:00
SsnL
bb1b826cdc Exposing emptyCache from allocator (#3518)
* Add empty_cache binding

* cuda.empty_cache document

* update docs
2017-11-07 17:00:38 -05:00
SsnL
fa5efab669 comments and case where not all sparse (#3370) 2017-11-01 06:05:17 -04:00
SsnL
01be4d6b20 sparse broadcast_coalesce and reduce_add_coalesced 2017-10-28 18:52:35 -04:00
Adam Paszke
76abc06b1f Fix nvprof mode in autograd profiler 2017-10-20 10:22:54 -04:00
SsnL
fce3ed19e5 Change device_id to device in python land (#3133)
* change device_id to device in python land

* cuda/random.py
2017-10-17 00:54:26 +02:00
Soumith Chintala
efe91fb9c1 delete redundant python nccl code 2017-10-09 22:24:18 -04:00
Soumith Chintala
e9dccb3156 implement all_reduce, broadcast, all_gather, reduce_scatter 2017-10-09 22:24:18 -04:00
Soumith Chintala
4d62933529 add initial NCCL C bindings 2017-10-09 22:24:18 -04:00
Edward Z. Yang
2dcaa40425 Add get_rng_state_all and set_rng_state_all.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
2017-09-30 16:21:04 -04:00
Adam Paszke
833bedc77d Add CUDA profiler bindings 2017-09-25 23:21:30 -04:00
Edward Z. Yang
450379256c Don't call is_available() in manual_seed, it initializes CUDA.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
2017-09-22 12:37:06 -04:00
Edward Z. Yang
b17dfa07ba Make CUDA seeding/RNG state functions even lazier
Instead of initializing CUDA immediately and executing them,
we wait until we actually initialize CUDA before executing.

To keep things debuggable, we also keep track of the original
backtrace when these functions are called, so we can inform
users where they actually called the seeding/state functions
(as opposed to the first time they actually initialized the
RNG).

Fixes #2517

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
2017-09-22 12:37:06 -04:00
Edward Z. Yang
06d7a0b1bc Write docs for RNG seeding on GPU more carefully.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
2017-09-22 12:37:06 -04:00
ngimel
3d7459ff6c fix indices for data_parallel and add parameter gradient tests (#2632) 2017-09-05 17:29:27 -04:00
Justin Johnson
94b5990201 Add torch.cuda.get_device_name function (#2540) 2017-08-26 15:06:37 -04:00
Zhou Mo
2c07f88ea3 Fix typos. 2017-08-25 14:27:07 -04:00
Christian Sarofeen
ec86d0b2ba Updates for CUDA 9 2017-08-25 07:32:05 -04:00
Gregory Chanan
50c208a50b Revert "Fix typos."
This reverts commit 4622b33952.
2017-08-10 13:57:00 -04:00
Zhou Mo
4622b33952 Fix typos. 2017-08-08 11:05:38 -04:00
Adam Paszke
8ab3d214d5 Fixes for DistributedDataParallel (#2168) 2017-07-21 16:00:46 -04:00
Alykhan Tejani
f814a892cf done re-seed cuda device if in bad fork (#1923) 2017-06-27 13:24:52 -04:00
Adam Paszke
12813b88f6 Add DistributedDataParallel 2017-06-12 22:00:22 -04:00