Commit Graph

123 Commits

Author SHA1 Message Date
Richard Zou
e60a7c2c88 codemod tensor.type().is_cuda(), tensor.type().is_sparse() (#13590)
Summary:
Followup to #12841

Changed these to not require type dispatch:
tensor.type().is_cuda() -> tensor.is_cuda()
tensor.type().is_sparse() -> tensor.is_sparse()
isVariable(tensor.type()) -> tensor.is_variable()

This probably does not affect performance
very much in most cases but it is nice to have.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13590

Reviewed By: ezyang

Differential Revision: D12929301

Pulled By: zou3519

fbshipit-source-id: 8ac5c6200c579dd7a44fb4ee58fc9bb170feb1d7
2018-11-07 07:27:42 -08:00
Edward Yang
0aaff5eaf9 Replace CUDA-specific set_index(_from) method from DeviceGuard with set_device. (#13275)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13275

This resulted in a bunch of knock-on changes, which I will now
describe:

- s/original_index/original_device/
- s/last_index/last_device/
- A bunch of places that used set_index, now use CUDAGuard (which does have
  set_index) because they were CUDA-specific code.

Major caveat: DeviceGuard doesn't *actually* work non-CUDA/CPU devices, To make
that happen, I plan on totally replacing the implementation of DeviceGuard; what
I mostly care about here is wrangling the API into an acceptable state.

Reviewed By: gchanan

Differential Revision: D12832080

fbshipit-source-id: 7de068c7cec35663dc8a533026a626331336e61d
2018-10-31 07:55:13 -07:00
Teng Li
e475d3ede3 DDP multi-GPU segfault fix (#13291)
Summary:
Fix https://github.com/pytorch/pytorch/issues/13200

Tested on 8 GPU machines since CI doesn't have this many GPUs, so multi-GPU test won't be triggered

```
tengli@learnfair096:~/pytorch/test$ python run_test.py -i distributed --verbose
Selected tests: distributed
Running test_distributed ... [2018-10-29 20:32:46.355858]
/public/apps/openmpi/2.1.1/gcc.5.4.0/bin/mpiexec
Running distributed tests for the gloo backend
test_DistBackend (__main__.TestDistBackend) ... ok
test_DistributedDataParallel (__main__.TestDistBackend) ... ok
test_DistributedDataParallelCPU (__main__.TestDistBackend) ... ok
```

Also I would like to bump up the bucket size of broadcast to higher for performance reasons
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13291

Differential Revision: D12842840

Pulled By: teng-li

fbshipit-source-id: e8c50f15ebf2ab3e2cd1b51d365e41a6106b98fe
2018-10-31 00:43:42 -07:00
Edward Yang
1a4473bbd7 Rewrite THPUtils_PySequence_to_CUDAStreamList to return vector<optional<CUDAStream>> (#13125)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13125

Previously, it returned a vector of THCStream*, which we eventually turned
into CUDAStream.  No need to spatter the conversion code everywhere: just
do it correctly to begin with.  An important side effect of doing it this
way is that we no longer pass nullptr to CUDAStream; instead, we create
the default stream.  I will rely on this in a later patch.

Reviewed By: gchanan

Differential Revision: D10853224

fbshipit-source-id: f6bd6594eba4626eb41a4a5e67fc64c9bbb46a1a
2018-10-29 08:27:23 -07:00
Teng Li
b4d0dc77be Eliminate CUDAStream nullptr in NCCL (#13089)
Summary:
As the title says, we should always use the current stream on device in NCCL.

This can unblock ezyang on his further work
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13089

Reviewed By: ezyang

Differential Revision: D10847172

Pulled By: teng-li

fbshipit-source-id: 7fc7c4248b5efa1971d2af4d43f62d3379debfe4
2018-10-24 20:04:41 -07:00
Teng Li
8d3e7e2fcb Move DDP queue_reduction to C++ (#12852)
Summary:
fully working version by using continuing on goldsborough 's initial version.

waiting on the stream guard to be merged before adding more stream perf logics into the c++ version
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12852

Differential Revision: D10468696

Pulled By: teng-li

fbshipit-source-id: 8e46d408796973817abfd9dbd6566e0ca5b7a13f
2018-10-22 16:07:46 -07:00
Edward Yang
1d399a80a0 Handle pollution of MAX, MIN and CHECK macros. (#11805)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11805

Some of our headers in Caffe2 pollute the macro namespace with things like MAX,
MIN, CHECK, so I renamed these in places where this is a problem.

This patch courtesy of gchanan, extracted out of #11721

Reviewed By: Yangqing

Differential Revision: D9917757

fbshipit-source-id: 17fc692ca04b208dcb8ae00731ed60e393284f7c
2018-09-18 13:18:31 -07:00
Tongzhou Wang
70d93f4777 Check for maximum numel in NCCL broadcasting (#11466)
Summary:
NCCL1 uses `int` as its numerical type for fields like `count`, which makes broadcasting tensors larger than `2 << 31 - 1` impossible, and raises opaque error `invalid arguments`. NCCL2 greatly increase the limit on many platforms by using `size_t`. This patch statically detects this type, and raises properly if the broadcast tensor exceeds the limit.

No test because I don't think our test suite should broadcast big tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11466

Differential Revision: D9754753

Pulled By: SsnL

fbshipit-source-id: 73506450cae047e06b5b225b39efdb42d5d26685
2018-09-10 14:39:15 -07:00
Peter Goldsborough
7ddc6f84c4 NULL -> nullptr (#11047)
Summary:
How did we get so many uses of `NULL` again?

ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11047

Differential Revision: D9566799

Pulled By: goldsborough

fbshipit-source-id: 83469f352ac69aa65bdaf1a1a21f922d892e0db3
2018-08-30 16:25:42 -07:00
Edward Yang
6bdbad93b9 Refactor Device to not depend on Backend. (#10478)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10478

- Removed Backend constructor from Device, and fixed all
  use-sites to use DeviceType::CPU instead of kCPU, or
  use a new function backendToDeviceType to perform
  the conversion.
- New method device_type() on Type; it gives you the
  underlying device type, e.g., CPU for SparseCPU.
- We add backward compatibility for kCPU/kCUDA uses,
  by introducing a new special type which is implicitly
  convertible to both DeviceType and Backend.  As long as
  you don't define a function that's overloaded on both
  DeviceType and Backend (but not on BackendOrDeviceType),
  the implicit conversions will ensure that uses
  of at::Device(at::kCPU) keep working. We fixed use-sites in
  the library, but did NOT fix sites in the test code, so that
  we can exercise this BC code.

Reviewed By: Yangqing

Differential Revision: D9301861

fbshipit-source-id: 9a9d88620500715c7b37e655b4fd761f6dd72716
2018-08-18 17:39:14 -07:00
mruberry
d6f21fc663 Ports Streams to ATen (#8997)
Summary:
This PR moves the THCStream logic (from both the THCStream and THCState APIs) to ATen. In particular, it:

+ Creates a new (THC free) at::CUDAStream class and API
+ Extends the at::Context API to expose it
+ Stubs the current THCStream and THCState APIs to use it
+ Updates THC to no longer violate stream encapsulation (stream.hpp is dead)
+ Adds an ATen cpp test of the API
+ Bonus: Removes some debug spew in test_nn.py

The new API has several advantages over the old one:

(1) It comes with an easy to use RAII, the CUDAStream. CUDAStreams have the expected copy and move semantics and are implicitly convertible to cudaStream_t.
(2) It does not depend on THCState, THCThreadLocal, or CUDA (thanks to goldsborough for suggesting the dynamic registration technique)
(3) It provides one consistent API/place for all stream operations, instead of having them split between THCStream and THCState
(4) The internals are completely encapsulated, unlike the historic THCStream
(5) It has getAndRetain semantics, which are safer than the historic gets (which allowed a gap between acquisition and retention)

There are a couple things this PR does not do, however, which are left for future work:

- It leaves the c10d:CUDAStream class as a THCStream wrapper (which now really wraps an at::CUDAStream).
- It leaves historic users of THCStream mostly untouched, except where they violated encapsulation (by using stream.hpp). A couple forward declarations were also changed.

I hope this PR allows easy usage of streams from ATen and is a useful pattern for porting more of the THCState API.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8997

Differential Revision: D8683375

Pulled By: soumith

fbshipit-source-id: 2e48ad85f1f9c8817684fe63a267938e80eafdcf
2018-07-08 16:25:09 -07:00
Peter Goldsborough
372d1d6735
Create ATen tensors via TensorOptions (#7869)
* Created TensorOptions

Storing the type in TensorOptions to solve the Variable problem

Created convenience creation functions for TensorOptions and added tests

Converted zeros to TensorOptions

Converted rand to TensorOptions

Fix codegen for TensorOptions and multiple arguments

Put TensorOptions convenience functions into torch namespace too

All factory functions except *_like support TensorOptions

Integrated with recent JIT changes

Support *_like functions

Fix in place modification

Some cleanups and fixes

Support sparse_coo_tensor

Fix bug in Type.cpp

Fix .empty calls in C++ API

Fix bug in Type.cpp

Trying to fix device placement

Make AutoGPU CPU compatible

Remove some auto_gpu.h uses

Fixing some headers

Fix some remaining CUDA/AutoGPU issues

Fix some AutoGPU uses

Fixes to dispatch_tensor_conversion

Reset version of new variables to zero

Implemented parsing device strings

Random fixes to tests

Self review cleanups

flake8

Undo changes to variable.{h,cpp} because they fail on gcc7.2

Add [cuda] tag to tensor_options_cuda.cpp

Move AutoGPU::set_index_from into .cpp file because Windows is stupid and sucks

Fix linker error in AutoGPU.cpp

Fix bad merge conflict in native_functions.yaml

Fixed caffe2/contrib/aten

Fix new window functions added to TensorFactories.cpp

* Removed torch::TensorOptions

Added code to generate wrapper functions for factory methods

Add implicit constructor from Backend to TensorOptions

Remove Var() from C++ API and use torch:: functions

Use torch:: functions more subtly in C++ API

Make AutoGPU::set_device more exception safe

Check status directly in DynamicCUDAHooksInterface

Rename AutoGPU to DeviceGuard

Removed set_requires_grad from python_variables.h and warn appropriately in Variable::set_requires_grad

remove python_default_init: self.type()

Add back original factory functions, but with deprecation warnings

Disable DeviceGuard for a couple functions in ATen

Remove print statement

Fix DeviceGuard construction from undefined tensor

Fixing CUDA device compiler issues

Moved as many methods as possible into header files

Dont generate python functions for deprecated factories

Remove merge conflict artefact

Fix tensor_options_cuda.cpp

Fix set_requires_grad not being checked

Fix tensor_new.h

TEMPORARILY put some methods in .cpp files to see if it solves issues on windows and mac

Fix bug in DeviceGuard.h

Missing includes

TEMPORARILY moving a few more methods into .cpp to see if it fixes windows

Fixing linker errors

* Fix up SummaryOps to use new factories

Undo device agnostic behavior of DeviceGuard

Use -1 instead of optional for default device index

Also move DeviceGuard methods into header

Fixes around device index after optional -> int32_t switch

Fix use of DeviceGuard in new_with_tensor_copy

Fix tensor_options.cpp

* Fix Type::copy(

* Remove test_non_float_params from ONNX tests

* Set requires_grad=False in ONNX tests that use ints

* Put layout/dtype/device on Tensor

* Post merge fixes

* Change behavior of DeviceGuard to match AutoGPU

* Fix C++ API integration tests

* Fix flip functions
2018-06-16 00:40:35 -07:00
Soumith Chintala
dc186cc9fe
Remove NO_* and WITH_* across codebase, except in setup.py (#8555)
* remove legacy options from CMakeLists

* codemod WITH_ to USE_ for WITH_CUDA, WITH_CUDNN, WITH_DISTRIBUTED, WITH_DISTRIBUTED_MW, WITH_GLOO_IBVERBS, WITH_NCCL, WITH_ROCM, WITH_NUMPY

* cover SYSTEM_NCCL, MKLDNN, NNPACK, C10D, NINJA

* removed NO_* variables and hotpatch them only in setup.py

* fix lint
2018-06-15 12:29:48 -04:00
Xiaodong Wang
922adf8d09 Skip calling ncclCommDestroy in destructor (#8352)
There is a bug in NCCL that causing seg faults while calling ncclCommDestroy() in the destructor during program exit. According to Nvidia, "Whether the NCCL destructor will be called before or after the CUDA runtime destructor is undefined, which can lead to crashes."

For the immediate workaround, skip calling ncclCommDestroy ihe NCCL destructor. This is UGLY and we'll follow up with Nvidia to solve this ASAP.
2018-06-12 13:11:09 -04:00
Edward Z. Yang
4caea64d72
Make all of TH and THC C++. (#6913)
Changelist:

- Move *.c to *.cpp
- Change includes of ".c" to ".cpp"
- A bunch of cmake configuration modifying CMAKE_C_FLAGS changed
to CMAKE_CXX_FLAGS or add_compile_options, because if you do CMAKE_C_FLAGS it only applies when you compile C code
- Explicitly cast void* to T* in a number of places
- Delete extern "C" { ... } blocks; instead, properly apply TH_API to everything that should have it (TH_API handles extern "C")
- Stop using stdatomic.h, instead, use <atomic>. This resulted in a bunch of placement-new/delete to be "totally properly correct"
- Refactor of THLongStorageView to not have static constructor methods (since it no longer has a copy/move constructor)
- Documentation about how the TH C interface (and extern C business) works
- Note that THD master_worker mode is dead
- C++ headers in TH libraries are given .hpp suffix, to make it less likely that you'll confuse them with the C-compatible headers (now suffixed .h)
- New function THCStream_stream and THCStream_device to project out fields of THCStream instead of accessing fields directly
- New function THStorage_(retainIfLive), which is equivalent to a retain but only if the refcount is greater than zero.
- In general, I tried to avoid using hpp headers outside of ATen/TH. However, there were a few places where I gave up and depended on the headers for my own sanity. See Note [TH abstraction violation] for all the sites where this occurred. All other sites were refactored to use functions
- Some extra Werror fixes (char* versus const char*)
2018-04-28 07:45:02 -04:00
Sam Gross
23fc2b7e06
Define CHECK in torch/csrc/cuda/nccl.h (#4721)
The CHECK function was used but not defined in the nccl.h header file.
2018-01-18 13:08:06 -05:00
Adam Paszke
de5f7b725e Base for pure C++ NCCL interface 2018-01-18 11:16:45 +01:00
Sam Gross
bcfe259f83
Add streams and comms as optional arguments (#3968)
Adds streams and comms as optional arguments to the NCCL calls in
torch.cuda.nccl. Also exposes ncclUniqueId and ncclCommInitRank for
multi-process mode.

Moves Py_RETURN_NONE statements after the GIL is re-acquired.
2017-12-04 13:51:22 -05:00
Gregory Chanan
6dee02923c [ATen] Rename isSparse -> is_sparse. 2017-11-15 18:33:07 -08:00
Gregory Chanan
9a2b54e08b [ATen] Rename isCuda -> is_cuda. 2017-11-15 18:33:07 -08:00
soumith
5a96037810 skip ncclCommDestroy if CUDA driver is already unloaded 2017-10-13 08:50:00 -07:00
Soumith Chintala
e9dccb3156 implement all_reduce, broadcast, all_gather, reduce_scatter 2017-10-09 22:24:18 -04:00
Soumith Chintala
4d62933529 add initial NCCL C bindings 2017-10-09 22:24:18 -04:00