Commit Graph

9 Commits

Author SHA1 Message Date
Edward Yang
1a4473bbd7 Rewrite THPUtils_PySequence_to_CUDAStreamList to return vector<optional<CUDAStream>> (#13125)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13125

Previously, it returned a vector of THCStream*, which we eventually turned
into CUDAStream.  No need to spatter the conversion code everywhere: just
do it correctly to begin with.  An important side effect of doing it this
way is that we no longer pass nullptr to CUDAStream; instead, we create
the default stream.  I will rely on this in a later patch.

Reviewed By: gchanan

Differential Revision: D10853224

fbshipit-source-id: f6bd6594eba4626eb41a4a5e67fc64c9bbb46a1a
2018-10-29 08:27:23 -07:00
Yangqing Jia
713e706618 Move exception to C10 (#12354)
Summary:
There are still a few work to be done:

- Move logging and unify AT_WARN with LOG(ERROR).
- A few header files are still being plumbed through, need cleaning.
- caffe2::EnforceNotMet aliasing is not done yet.
- need to unify the macros. See c10/util/Exception.h

This is mainly a codemod and not causing functional changes. If you find your job failing and trace back to this diff, usually it can be fixed by the following approaches:

(1) add //caffe2/c10:c10 to your dependency (or transitive dependency).
(2) change objects such as at::Error, at::Optional to the c10 namespace.
(3) change functions to the c10 namespace. Especially, caffe2::MakeString is not overridden by the unified c10::str function. Nothing else changes.

Please kindly consider not reverting this diff - it involves multiple rounds of rebasing and the fix is usually simple. Contact jiayq@ or AI Platform Dev for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/12354

Reviewed By: orionr

Differential Revision: D10238910

Pulled By: Yangqing

fbshipit-source-id: 7794d5bf2797ab0ca6ebaccaa2f7ebbd50ff8f32
2018-10-15 13:33:18 -07:00
mruberry
9d4360c060 Creates stream pool (#9938)
Summary:
This PR creates a stream pool per issue #9646. When a new stream is requested, that device it's requested on lazily creates two pools, one low priority and one high priority, of 32 streams each. Streams are returned from these pools round-robin. That is, stream 0 is returned, then stream 1... then stream 31, then stream 0... This PR also takes the opportunity to clean up the stream API, reducing its complexity and verbosity.

Change notes:

- There are now 3 sets of streams per device, the default stream, the low priority streams, and the high priority streams. These streams live in lazily initialized pools and are destroyed on shutdown.
- All stream refcounting has been removed (the pools pattern replaces it).
- Setting a stream now sets it on its device. Streams are associated with a device and the previous
requirement to specify that device was unnecessary.
- There is no exposure for setting the flags on a stream. This may also seem like a regression but the flag was always set to cudaStreamNonBlocking.
- Streams are now low or high priority whereas previously the priority could be set with an integer. In practice, however, the range for priorities is -1 to 0 on the latest hardware. -1 is high priority, 0 is low priority (aka default priority). Low vs. high actually clarifies this behavior if people were trying finer separations. (E.g., if someone tried streams with priorities 0, 1, and 2, they would actually all have priority 0, historically, and the intended behavior would not be respected.)
- Unused THCStream and THCState stream-related functions were removed.
- A new test of pooling behavior was added in stream_test.

fyi: colesbury, apaszke, goldsborough
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9938

Reviewed By: SsnL

Differential Revision: D9569036

Pulled By: ezyang

fbshipit-source-id: 12ed673fe373170d0cf4d65cb570de016c53ee7d
2018-08-30 12:40:23 -07:00
Mike Ruberry
1003ccfa15 Creates CUDAContext (#9435)
Summary:
ezyang noticed that the CUDAStream files lived under ATen/ despite being CUDA-specific, and suggested porting them to ATen/cuda and exposing them with a new CUDAContext. This PR does that. It also:

- Moves ATen's CUDA-specific exceptions for ATen/cudnn to ATen/cuda for consistency
- Moves getDeviceProperties() and getCurrentCUDASparseHandle() to CUDAContext from CUDAHooks

The separation between CUDAContext and CUDAHooks is straightforward. Files that are in CUDA-only builds should rely on CUDAContext, while CUDAHooks is for runtime dispatch in files that can be included in CPU-only builds. A comment in CUDAContext.h explains this pattern. Acquiring device properties and CUDA-specific handles is something only done in builds with CUDA, for example, so I moved them from CUDAHooks to CUDAContext.

This PR will conflict with #9277 and I will merge with master after #9277 goes in.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9435

Reviewed By: soumith

Differential Revision: D8917236

Pulled By: ezyang

fbshipit-source-id: 219718864234fdd21a2baff1dd3932ff289b5751
2018-07-20 12:56:15 -07:00
Peter Goldsborough
b770156a7a Functional DataParallel (#9234)
Summary:
This PR adds the functional version of `DataParallel` (i.e. `data_parallel`) to the C++ frontend.

For this, I had to:
1. Add "differentiable" versions of scatter and gather, which perform their inverse operation in the backward pass, to C++. I've added them under `torch/csrc/autograd/functions/comm.{h,cpp}`. I had to move some utilities from `VariableType.cpp` into `torch/csrc/autograd/functions/utils.h`, and changed them a bit to fix the `const_cast`s for which there were `TODO`s,
2. Implement the `replicate`, `parallel_apply` and the combining `data_parallel` functions in C++.

`replicate` is implemented based on our existing `clone()` interface, along with the ability to set the current device via `at::OptionsGuard` (so nice).

`parallel_apply` is implemented using `at::parallel_for` (CC cpuhrsch) and [follows the code from PyTorch](https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/parallel_apply.py).

Added lots of tests for these things.

apaszke ezyang ebetica colesbury
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9234

Differential Revision: D8865182

Pulled By: goldsborough

fbshipit-source-id: 4f1fecf2b3f3bc1540c071dfb2d23dd45de433e4
2018-07-19 16:12:04 -07:00
Peter Goldsborough
f1ce15b50c Move nccl scatter and gather to C++ (#9117)
Summary:
As I try to replicate DP in C++, I need to move some functions into C++ from Python. This PR ports the scatter and gather primitives from Python in torch/cuda/comm.py to C++ in torch/csrc/cuda/comm.cpp. The basic infrastructure was already there, since apaszke had rewritten broadcast in C++ already.

I'm not very familiar with this code, so let me know if I'm doing something wrong. I largely just literally translated the code.

I don't know how "public" `torch.cuda.comm` is, but I feel like the `destination_index` parameter for `gather` should be changed from -1 indicating CPU to `None` indicating CPU, and `-1` indicating the default CUDA device. That would make the code clearer IMO.

apaszke colesbury teng-li pietern
Closes https://github.com/pytorch/pytorch/pull/9117

Differential Revision: D8721729

Pulled By: goldsborough

fbshipit-source-id: 1844a488079d21fa209b32e2c73e48632cbe9e68
2018-07-06 11:10:33 -07:00
Peter Goldsborough
04a3616de0 Replace std::size_t with size_t (#8093) 2018-06-04 11:10:44 -04:00
Sam Gross
f1c616418d
Fix Python docs for broadcast and braodcast_coalesced (#4727) 2018-01-19 10:57:20 -05:00
Adam Paszke
1061d7970d Move broadcast and broadcast_coalesced to C++ 2018-01-18 11:16:45 +01:00