Commit Graph

96 Commits

Author SHA1 Message Date
Mark Richardson
8bce88d9de [caffe2] dont call cudnnDestroy on thread exit (crashes on windows with cuda 11/12) (#95382)
Summary:
My team has been hitting a mysterious crash for a few months on a windows binary that uses Caffe2 inside a worker thread.

When this thread gets destroyed, there is an error at this line in context_gpu.h where the state of this operation gives CUDNN_STATUS_INTERNAL_ERROR instead of CUDNN_STATUS_SUCCESS.

When enabling cudnn debug logs (via the env variables nvidia specifies), I can see that the context is destroyed twice, even though this code only destroys it once, so something mysterious is causing a double free.

This seems very very similar to the issue/fix described here for pytorch:
https://github.com/pytorch/pytorch/issues/17658
https://github.com/apache/tvm/pull/8267

And pytorch handles this in the same way, by just not calling cudnnDestroy

This seems to have become an issue with cuda11, but I tested cuda12 as well and found that the issue persists so this needs to be somehow fixed.

Test Plan:
CI

I checked that the specific windows binary I am using is able to create and drestroy caffe2-invoking threads without causing the application to crash.

buck run arvr/mode/win/cuda11/opt //arvr/projects/nimble/prod/tools/MonoHandTrackingVis

Differential Revision: D43538017

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95382
Approved by: https://github.com/malfet
2023-03-10 06:42:51 +00:00
Richard Barnes
6f749fd171 Fixes to DSA infra (#91835)
Differential Revision: D42397325

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91835
Approved by: https://github.com/soumith
2023-01-12 21:54:26 +00:00
Richard Barnes
7a3afe61d2 Check all CUDA API calls for errors in caffe2/ (#81816)
Test Plan: Sandcastle

Differential Revision: D35194868

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81816
Approved by: https://github.com/ezyang
2022-10-28 00:41:06 +00:00
Will Constable
4f34cd6d1e Replace all CHECK_ and DCHECK_ with TORCH_* macros (#82032)
Avoid exposing defines that conflict with google logging, since this blocks external usage of libtorch in certain cases.

All the 'interesting' changes should be in these two files, and the rest should just be mechanical changes via sed.
c10/util/logging_is_not_google_glog.h
c10/util/logging_is_google_glog.h

Fixes https://github.com/pytorch/pytorch/issues/81415

cc @miladm @malfet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82032
Approved by: https://github.com/soumith, https://github.com/miladm
2022-07-26 01:20:44 +00:00
Jeff Daily
b7391f44df cast return of cudaGetLastError() to void when discarding (#62518)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/62511.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62518

Reviewed By: walterddr, janeyx99

Differential Revision: D30029858

Pulled By: malfet

fbshipit-source-id: d47ce4e507ac800b4e5a5e0a8d9a6fabdfd28e6d
2021-08-03 11:17:22 -07:00
Jeff Daily
15210f3b82 ignore and clear not ready errors (#61554)
Summary:
Follow-up to https://github.com/pytorch/pytorch/issues/18584. This PR covers the remaining places where event or stream query might result in not ready errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61554

Reviewed By: mrshenli

Differential Revision: D29763973

Pulled By: ezyang

fbshipit-source-id: 41d988d1826b2309cc6b01a81144094b353abdf9
2021-07-19 16:03:04 -07:00
Edward Yang
778f9eab6c Don't switch streams when running Caffe2 ops from c10. (#55121)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55121

This is done by allow -1 as a stream ID, meaning "don't change
the stream", in SwitchToDevice

Fixes #54830

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D27527544

Pulled By: ezyang

fbshipit-source-id: c54983d6fc79a8fa1c65a71559a57425e40ba717
2021-04-08 13:21:11 -07:00
Basil Hosmer
f05b66b70d pass TypeMeta by value (#45026)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45026

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23802943

Pulled By: bhosmer

fbshipit-source-id: 81b06ef00bf8eb4375c0e0ff2032e03bd1d1188a
2020-10-30 10:14:17 -07:00
Nikita Shulga
a3de359464 Do not throw from CUDAContext destructor (#34756)
Summary:
Throwing from destructor leads to undefined behaviour (most often to segault)
So it's better to leak memory then segault
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34756

Test Plan: Run `test_pytorch_onnx_caffe2`

Differential Revision: D20504228

Pulled By: malfet

fbshipit-source-id: 7a05776fea9036f602e95b8182f8493cb5886dab
2020-03-18 00:13:18 -07:00
Brian Wignall
e7fe64f6a6 Fix typos (#30606)
Summary:
Should be non-semantic.

Uses https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines to find likely typos.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30606

Differential Revision: D18763028

Pulled By: mrshenli

fbshipit-source-id: 896515a2156d062653408852e6c04b429fc5955c
2019-12-02 20:17:42 -08:00
Edward Yang
1e6acc676f Replace caffe2::DeviceGuard with c10::cuda::CUDAGuard (#17623)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17623

Despite it's generic sounding name, caffe2::DeviceGuard actually
only worked on CUDA devices.  Rename it to something that more
clearly spells out its applicability.

I'm not sure if it's the right call, but in this patch I added
'using CUDAGuard = c10::cuda::CUDAGuard', as this seems to be more
in-line with how the Caffe2 codebase is currently written.  More
idiomatic c10 namespace style would be to say cuda::CUDAGuard.
Willing to change this if people shout.

This is a respin of D13156470 (#14284)

Reviewed By: dzhulgakov

Differential Revision: D14285504

fbshipit-source-id: 93b8ab938b064572b3b010c307e1261fde0fff3d
2019-03-06 10:48:15 -08:00
Dmytro Dzhulgakov
51dd2000cd unify c2 and TH allocator (#16892)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16892

Replaces https://github.com/pytorch/pytorch/pull/14517

Merged caffe2 and TH CPU Allocators. Mostly using the code from caffe2 allocators.
`memset` of caffe2 allocator is gone now. These two allocators should be almost the same.

Baseline:
```
Running ./tensor_allocation
Run on (48 X 2501 MHz CPU s)
CPU Caches:
  L1 Data 32K (x24)
  L1 Instruction 32K (x24)
  L2 Unified 256K (x24)
  L3 Unified 30720K (x2)
-------------------------------------------------------------------------
Benchmark                                  Time           CPU Iterations
-------------------------------------------------------------------------
BM_MakeStorageImpl                       148 ns        148 ns    4676594
BM_StorageImplCtor                        54 ns         54 ns   12957810
BM_MallocStorageImpl                      62 ns         62 ns   11254745
BM_TensorImplCtor                         22 ns         22 ns   31939472
BM_MallocTensorImpl                      105 ns        105 ns    6505661
BM_Malloc_1                               43 ns         43 ns   16464905
BM_MakeTensorFromStorage                 126 ns        126 ns    5586116
BM_MakeVariableFromTensor                236 ns        236 ns    2995528
BM_ATenCPUTensorAllocationSmall1         319 ns        319 ns    2268884
BM_ATenCPUTensorAllocationSmall2         318 ns        318 ns    2163332
BM_ATenCPUTensorAllocationMedium1        403 ns        403 ns    1663228
BM_ATenCPUTensorAllocationMedium2        448 ns        448 ns    1595004
BM_ATenCPUTensorAllocationBig1           532 ns        532 ns    1352634
BM_ATenCPUTensorAllocationBig2          4486 ns       4486 ns     160978
```
Changed:
```
Running ./tensor_allocation
Run on (48 X 2501 MHz CPU s)
CPU Caches:
  L1 Data 32K (x24)
  L1 Instruction 32K (x24)
  L2 Unified 256K (x24)
  L3 Unified 30720K (x2)
-------------------------------------------------------------------------
Benchmark                                  Time           CPU Iterations
-------------------------------------------------------------------------
BM_MakeStorageImpl                       141 ns        141 ns    4803576
BM_StorageImplCtor                        55 ns         55 ns   13129391
BM_MallocStorageImpl                      64 ns         64 ns   11088143
BM_TensorImplCtor                         23 ns         23 ns   31616273
BM_MallocTensorImpl                      101 ns        101 ns    7017585
BM_Malloc_1                               39 ns         39 ns   18523954
BM_MakeTensorFromStorage                 118 ns        118 ns    5877919
BM_MakeVariableFromTensor                452 ns        452 ns    1565722
BM_ATenCPUTensorAllocationSmall1         384 ns        384 ns    1819763
BM_ATenCPUTensorAllocationSmall2         389 ns        389 ns    1857483
BM_ATenCPUTensorAllocationMedium1        425 ns        425 ns    1646284
BM_ATenCPUTensorAllocationMedium2        430 ns        430 ns    1561319
BM_ATenCPUTensorAllocationBig1           508 ns        508 ns    1309969
BM_ATenCPUTensorAllocationBig2          3799 ns       3799 ns     173674
```

lstm benchmark:
Before:
```
INFO:lstm_bench:Iter: 1 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Iter: 21 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 41 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 61 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 81 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 101 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 121 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 141 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 161 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 181 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 201 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 221 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 241 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Iter: 261 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 281 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 301 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 321 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 341 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 361 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 381 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Done. Total EPS excluding 1st iteration: 0.8k
```

After:
```
INFO:lstm_bench:Iter: 1 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 21 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 41 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 61 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 81 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 101 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 121 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 141 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 161 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 181 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 201 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 221 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Iter: 241 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Iter: 261 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Iter: 281 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Iter: 301 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Iter: 321 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Iter: 341 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Iter: 361 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Iter: 381 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Done. Total EPS excluding 1st iteration: 0.8k
```

Reviewed By: ezyang

Differential Revision: D13202632

fbshipit-source-id: db6d2ec756ed15b0732b15396c82ad42302bb79d
2019-02-12 21:16:34 -08:00
Xiaodong Wang
af0c79eed4 Catch cudaError_t return val (nodiscard in rocm) (#16399)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16399

Catching cudaError_t return values in a few places, because it's nodiscard in rocm. Unless we add -Wno-unused-result, it'll end up with a compilation error.

Also in c10/cuda/test, check whether a host has GPU or not. We were silently throwing out the error before (so not really testing the cuda api).

Reviewed By: bddppq

Differential Revision: D13828281

fbshipit-source-id: 587d1cc31c20b836ce9594e3c18f067d322b2934
2019-02-11 13:18:36 -08:00
Edward Yang
b9b0be7af2 Remove Legacy entry point. (#16721)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16721

The very key line is we have to set the stream to the default
stream before calling the allocator.  This is very interesting.
It shouldn't be necessary, but seemingly is!

Reviewed By: dzhulgakov

Differential Revision: D13943193

fbshipit-source-id: c21014917d9fe504fab0ad8abbc025787f559287
2019-02-08 09:33:58 -08:00
Dmytro Dzhulgakov
96ea2594d8 Don't call cudaStreamDestroy at destruction time (#15692)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15692

It was leading to ocassional crashes with dynamically linked CUDA because runtime was already destroyed.

Also, unique_ptr<T[]> is more suitable than deque<T> for the purpose.

Reviewed By: Yangqing

Differential Revision: D13571988

fbshipit-source-id: 37eb26dfbe361c49160367b53f87bd037c6c0e46
2019-01-11 12:36:41 -08:00
Sebastian Messmer
d408324350 Move files to/from c10/core and c10/util (#15316)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15316

This starts cleaning up the files in c10 according to the module structure we decided on.

Move to c10/util:
- Half.h, Half-inl.h, Half.cpp, bitcasts.h

Move to c10/core:
- Device.h, Device.cpp
- DeviceType.h, DeviceType.cpp

i-am-not-moving-c2-to-c10

Reviewed By: dzhulgakov

Differential Revision: D13498493

fbshipit-source-id: dfcf1c490474a12ab950c72ca686b8ad86428f63
2019-01-10 16:22:22 -08:00
Edward Yang
26b04523b1 Record Caffe2's current stream ID in c10_cuda. (#15174)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15174

Previously, Caffe2 maintained a separate per-thread per-device
current logical CUDA stream ID.  In this PR, we switch Caffe2 over
to using c10::Stream to manage the current stream, and also
manage the allocation of cudaStream_t objects.

This results in a slight behavior change: previously, Caffe2
would have been willing to allocate an arbitrary number of
CUDA streams, depending on how high the logical stream IDs
went.  The c10::Stream pool has a fixed number of streams, once
you exceed it, it wraps around.

Reviewed By: dzhulgakov

Differential Revision: D13451550

fbshipit-source-id: da6cf33ee026932a2d873835f6e090f7b8a7d8dc
2018-12-20 21:54:05 -08:00
Edward Yang
f4c59c5fdf Replace SwitchToDevice(0) with SwitchToDevice() (#15126)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15126

I want to make people stop manufacturing StreamId from thin air,
and a first step is to make people use the default stream.

Reviewed By: dzhulgakov

Differential Revision: D13432922

fbshipit-source-id: 9f0d8d70646c50d979bde5ba3c3addeebac48a3d
2018-12-17 15:15:00 -08:00
Edward Yang
8617b780cf Replace use of 'int' with more descriptive 'DeviceIndex' or 'StreamId'. (#14282)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14282

This also is a substantive change, as 'DeviceIndex' and 'StreamId' are
narrower types than 'int'.

Reviewed By: Yangqing, smessmer

Differential Revision: D13156471

fbshipit-source-id: 08aa0f70c4142415b6bd4d17c57da0641c1d0e9a
2018-11-29 18:33:21 -08:00
Dmytro Dzhulgakov
0cfbbceac3 Change Tensor::CopyFrom to a simple double dispatch (#14268)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14268

Removes the need for Context in Tensor by doing simple dispatch for CopyBytes. It'd eventually be subsumed by Roy Li's changes of proper copy_ op, but before that is done, let's get a clear logic of how copies are implemented and clean up some craft in CopyFrom implementation.

Note, that with these changes, one can probably can get rid of Context::CopyFromCPU/CopyToCPU, but it's a matter for follow up diffs.

This diff doesn't change the API of Tensor yet, but relies on the fact that passing `Context` to CopyFrom makes copy async if the device is CUDA and doesn't have any effect otherwise (that's how Context methods are implemented).

This doesn't change semantics of copy async implementation - as before it blindly calls cudaMemcpyAsync which probably means that it can be misused if invoked separately outside of operator body. I'll leave it for the follow up copy_ unification.

For Extend() we always do async copy - it makes sense as it's an in-place device-device operation and only any further op would be observable.

Note: there are now three ways of invoking copy in C2 code - templated CopyBytes, virtual CopyFromCPU/etc, and double-dispatch free method here. Hopefully we can get rid of the second one.

Also, please advise whether it's c10-worthy :)

Reviewed By: ezyang

Differential Revision: D13117987

fbshipit-source-id: a6772d6dcf3effaf06717da3a656fc9873b310b5
2018-11-28 15:45:37 -08:00
Dmytro Dzhulgakov
f72f91610f Move stream to thread local (#13080)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13080

This is the first step to untangle this logic:
- moves stream id to thread local mechanically
- relies on the fact that the value of thread local is valid in conjunction with CUDAContext only until the next SwitchToDevice is called - we should move to proper RAII in the following diffs

Follow up diffs are going to move more stuff outside of CUDAContext (by making gpu_id thread local too) and simplify the CopyFrom.

The only expected change in behavior is that before CopyFrom would do copy on stream logical id 0 if the context was created on the fly and now it'd do so on the current stream. Since it'd block explicitly, I don't think it matters much.

Also, observers were semi-broken by waiting on the potentially wrong stream. It can be fixed later - I renamed the method to avoid abuse.

Reviewed By: ezyang

Differential Revision: D10525134

fbshipit-source-id: 5d495a21490bebe060a76389f1b47bdf12cbc59e
2018-10-26 00:04:32 -07:00
Jerry Zhang
b89a3b50fb Remove StaticContext (#12547)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12547

Pull Request resolved: https://github.com/pytorch/pytorch/pull/12305

Remove StaticContext from context_base.h

Reviewed By: dzhulgakov

Differential Revision: D10073519

fbshipit-source-id: 350beec3c54365edef338318ce58229ccb825a98
2018-10-10 19:41:03 -07:00
Jerry Zhang
7724807551 Remove ExtractDeviceOption from StaticContext (#12304)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12304

- make ExtractDeviceOption to be a free function.
- Add a Strorage(at::Device) constructor in order to preserve the device_id.

Reviewed By: dzhulgakov

Differential Revision: D10069839

fbshipit-source-id: a5f3994a39bdf1b7503b39bb42c228e438b52bfa
2018-10-10 14:12:16 -07:00
Junjie Bai
f54ab540af Rename cuda_gpu_id to device_id in DeviceOption (#12456)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12456

codemod with 'Yes to all'
codemod -d . --extensions h,cc,cpp,cu,py,proto,pbtxt,pb.txt,config cuda_gpu_id device_id

Overload TextFormat::ParseFromString to do string replace when parsing from protobuf format

Reviewed By: Yangqing

Differential Revision: D10240535

fbshipit-source-id: 5e6992bec961214be8dbe26f16f5794154a22b25
2018-10-09 15:54:04 -07:00
Jerry Zhang
1c69d368e1 Remove New with Allocator Registry (#12111)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12111

Setup allocator registry keyed by at::DeviceType, and remove New from StaticContext.

Reviewed By: ezyang

Differential Revision: D10022853

fbshipit-source-id: 3e88a181fe5df24f33f49b88be1f75284a185588
2018-10-09 10:53:52 -07:00
Jerry Zhang
faab6ea922 Split Allocator (#12105)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12105

Split CUDA/OpenCL/xxx Allocator from xxxStaticContext::New and rewrite it under at::Allocator interface.

Reviewed By: dzhulgakov

Differential Revision: D10001033

fbshipit-source-id: e1ffbc04c18d1dcb1f8d4ef2cbbb321967de5ccc
2018-10-03 19:10:10 -07:00
Jerry Zhang
74dc4460eb New in StaticContext returns at::DataPtr (#12029)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12029

In order to remove New() function in StaticContext(to remove StaticContext) and converge to the Allocator design, we'll first change the return type of New to at::DataPtr.

Reviewed By: ezyang

Differential Revision: D9889990

fbshipit-source-id: 3257c763530b987025f428741bdd2e089d11bad4
2018-10-03 19:10:07 -07:00
Junjie Bai
ff608a9ff3 Back out "Revert D10123245: Back out "codemod cuda_gpu_id to device_id"" (#12232)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12232

Original commit changeset: fca91fea58b7

This adds proper modifications to the DeviceType <->DeviceOption conversion code added in D10033396

Reviewed By: jerryzh168

Differential Revision: D10132473

fbshipit-source-id: 801ef777e2950982cb47b48051b1471a0a91e64b
2018-10-01 21:54:52 -07:00
Rick Ratmansky
3010dc4208 Revert D10123245: Back out "codemod cuda_gpu_id to device_id"
Differential Revision:
D10123245

Original commit changeset: d83da8e00a12

fbshipit-source-id: fca91fea58b7df208edc2e218a1d514f9821ec7b
2018-10-01 12:22:36 -07:00
Yang Liu
7d7d336c45 Back out "codemod cuda_gpu_id to device_id"
Summary:
Original commit changeset: f5614a5d2607

D9986213 is causing Multifeed Aggregator a [huge performance different](https://our.intern.facebook.com/intern/ads/analyze_canary/412951953278781781/) and is blocking aggregator push since last Friday night: https://fburl.com/feedtools/b6izvwjz
We need to land this revert ASAP to unblock aggregator push.

Reviewed By: orionr

Differential Revision: D10123245

fbshipit-source-id: d83da8e00a1250f5d09811a0a587c127e377aab2
2018-10-01 11:31:14 -07:00
Jerry Zhang
006171fffc Back out "[pytorch][PR] Revert "Move CreateContext to global registry (#11688)"" (#12121)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12121

Pull Request resolved: https://github.com/pytorch/pytorch/pull/12055

Original commit changeset: 6ca9de65b707

Reviewed By: ezyang

Differential Revision: D10033396

fbshipit-source-id: ca9f4b2f7ef0561f619b833415d394a8b9972bf4
2018-10-01 11:10:46 -07:00
Junjie Bai
3eb5940cf5 codemod cuda_gpu_id to device_id (#12022)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12022

codemod -d . --extensions h,cc,cpp,cu,py,proto,pbtxt,pb.txt,config cuda_gpu_id device_id

codemod with 'Yes to all'

Reviewed By: orionr

Differential Revision: D9986213

fbshipit-source-id: f5614a5d26078817aee8caf79a494abfd6a95ff1
2018-09-27 20:24:53 -07:00
Edward Yang
d7e11e3aae Revert "Move CreateContext to global registry (#11688)" (#12049)
Summary:
This reverts commit 3ae6ee4ebd.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12049

Differential Revision: D10030954

Pulled By: ezyang

fbshipit-source-id: 6ca9de65b707c5b4c68280fc6f1b8e5ad7251efc
2018-09-25 10:13:43 -07:00
Jerry Zhang
3ae6ee4ebd Move CreateContext to global registry (#11688)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11688

As a first step to remove static context(merge with allocator), we'll create a
global registries for context constructors, and remove CreateContext function from tensor.

Reviewed By: ezyang, dzhulgakov

Differential Revision: D9779821

fbshipit-source-id: 8b239ea50af7a0556fde2382f58f79194f0e3dc1
2018-09-24 17:07:50 -07:00
Yinghai Lu
a26ad5a332 Remove unnecessary check on device option pointer (#11845)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11845

The device pointer will be used by cudaPointerGetAttributes, which handles nullptr already. So this check is not necessary.

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__UNIFIED.html#group__CUDART__UNIFIED_1gd89830e17d399c064a2f3c3fa8bb4390

Reviewed By: salexspb

Differential Revision: D9929828

fbshipit-source-id: d862f7e5590998ffafe9bfc7754b0f83d2ae4af4
2018-09-18 21:16:18 -07:00
Edward Yang
7607b49538 s/GetDevicetype/device_type/ (#11656)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11656

The mis-capitalization really sticks up my craw.  I know why (we
already have a static function named GetDeviceType), but let's
name it differently.

```
codemod -d . --extensions cc,cpp,cu,cuh,h,py,hpp,TARGETS GetDevicetype device_type
```

Reviewed By: jerryzh168

Differential Revision: D9813544

fbshipit-source-id: fe462f4bc40b03e74921f8cf5ebd9cfc52e7e636
2018-09-13 16:32:51 -07:00
Yinghai Lu
316c167940 Add checking of nullptrs in GetTensorInfo (#11587)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11587

To help debug the issue in T33295362, we add some checks in the function.

Possible crashing site in `GetTensorInfo`
1. tc is nullptr, which is checked.
2. tc->capacity_nbytes() hits nullptr, this is unlikely because storage is not a pointer and compute of capacity_nbytes doesn't involve pointers. It's numel * itermsize().
3. tc->ExtractDeviceOption hits nullpt. One possibility raw_data() is nullptr because tc->ExtractDeviceOption will use that. This is checked.
4. Tensor itself which is not a reference. This is also checked.

Reviewed By: salexspb

Differential Revision: D9793484

fbshipit-source-id: 3fc72746fc310a23ae45553bbe0d269a4b9edb72
2018-09-12 16:25:46 -07:00
Yangqing Jia
68613cf5a2 Windows DLL build with Caffe2 code (#11266)
Summary:
This is an experimental build on top of what orionr and mingzhe09088 built.

Essentially, the idea is that we will need separate *_API versions for different shared libraries. If this theory is right, I'll try to clean up the design a bit and document it properly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11266

Reviewed By: orionr

Differential Revision: D9682942

Pulled By: Yangqing

fbshipit-source-id: c79653199e67a1500c9174f39f8b0357324763f3
2018-09-06 15:12:20 -07:00
Jerry Zhang
9f4bcdf075 caffe2::DeviceType -> at::DeviceType (#11254)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11254
Previously we use DeviceType in caffe2.proto directly, but it's an `enum` and have implicit conversion to int, which does not have type safety, e.g. we have to explicitly check for a device type is valid in event.h:
```
template <int d>
struct EventCreateFunctionRegisterer {
  explicit EventCreateFunctionRegisterer(EventCreateFunction f) {
    static_assert(d < MaxDeviceTypes, "");
    Event::event_creator_[d] = f;
  }
};
```
at::DeviceType is an `enum class`, and it does not have implicit conversion to int, and provides better type safety guarantees. In this diff we have done the following refactor(taking CPU as an example):

    1. caffe2::DeviceType → caffe2::DeviceTypeProto
    2. caffe2::CPU → caffe2::PROTO_CPU
    3. caffe2::DeviceType = at::DeviceType
    4. caffe2::CPU = at::DeviceType::CPU

codemod -d caffe2/caffe2 --extensions h,cc,cpp 'device_type\(\), ' 'device_type(), PROTO_'
+ some manual changes

In short, after this diff, in c++, caffe2::CPU refers to the at::DeviceType::CPU and the old proto caffe2::CPU will be caffe2::PROTO_CPU.
In python side, we have a temporary workaround that alias `caffe2_pb2.CPU = caffe2_pb2.PROOT_CPU` to make the change easier to review and this will be removed later.

Reviewed By: ezyang

Differential Revision: D9545704

fbshipit-source-id: 461a28a4ca74e616d3ee183a607078a717fd38a7
2018-09-05 16:28:09 -07:00
Orion Reblitz-Richardson
6508db7421 Remove BUILD_CAFFE2 and build everything (#8338)
Summary:
This completely removes BUILD_CAFFE2 from CMake. There is still a little bit of "full build" stuff in setup.py that enables USE_CUDNN and BUILD_PYTHON, but otherwise everything should be enabled for PyTorch as well as Caffe2. This gets us a lot closer to full unification.

cc mingzhe09088, pjh5, ezyang, smessmer, Yangqing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8338

Reviewed By: mingzhe09088

Differential Revision: D9600513

Pulled By: orionr

fbshipit-source-id: 9f6ca49df35b920d3439dcec56e7b26ad4768b7d
2018-08-31 13:10:24 -07:00
Edward Yang
91797c0672 Replace direct include of caffe2.pb.h with an intermediary header caffe2_pb.h (#10946)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10946

```
codemod -d . --extensions cc,cpp,cu,cuh,h caffe2/proto/caffe2.pb.h caffe2/proto/caffe2_pb.h
```

Reviewed By: houseroad

Differential Revision: D9539945

fbshipit-source-id: 497d04720e8e7e61c05ffe1b23733d0cb774de7e
2018-08-28 11:57:08 -07:00
Yangqing Jia
0a809fc8b1 build changes to make cpu unified build working. (#10504)
Summary:
Properly annotated all apis for cpu front. Checked with cmake using

cmake -DUSE_ATEN=ON -DUSE_CUDA=OFF -DBUILD_ATEN=ON

and resulting libcaffe2.so has about 11k symbols.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10504

Reviewed By: ezyang

Differential Revision: D9316491

Pulled By: Yangqing

fbshipit-source-id: 215659abf350af7032e9a4b0f28a856babab2454
2018-08-15 17:22:36 -07:00
Jerry Zhang
aebf3b47ae Remove template parameter from Tensor (#9939)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9939

Pull Request resolved: https://github.com/facebookresearch/weakly-supervised-action-detection/pull/13

Pull Request resolved: https://github.com/pytorch/translate/pull/166

Pull Request resolved: https://github.com/pytorch/pytorch/pull/9125

Closes https://github.com/pytorch/pytorch/pull/9125

Use inheritance for polymorphism, and remove template parameter
This is to change the templating in call sites, the core implementations will change later

Before Caffe2 Tensor class was compile-time fixed to bind to a particular device/context. With this change, we're making it a runtime property (stored inside the tensor), but preserve the same semantics. For example, one has to specify device type in order to create a Tensor - there are no uninitialized tensors. More specifically the changes are:

1. We added an extra argument *DeviceType* to most of the constructors of the tensor, e.g. (Tensor(DeviceType type)),
2. Semantics of constructor Tensor(const Tensor<SrcContext>& src, ContextForCopy* context); is changed, in this constructor, the second context is passed in to enable us to call the templated Copy function, it could be in a different context as source and target previously, now we'll enforce that the context should have same device type as src, if it is provided.
3. To preserve 'get-or-construct' semantics of Blob, we added specialized getter Blob::GetMutableTensor that verifies both that Blob contains a Tensor and that it's of a correct type
4. Specifically, Tensor type is not default-constructible any more (as we don't have unknown device tensors) and thus some of the code handling STL containers needs to change

Note: Some changes are postponed just to keep this diff a bit smaller. Please see `TODO`s.

Reviewed By: ezyang, houseroad

Differential Revision: D9024330

fbshipit-source-id: e0b8295d2dc6ebe2963383ded5af799ad17164ba
2018-07-27 10:56:39 -07:00
Jerry Zhang
969b62f276 Revert D8121878: Remove template parameter from Tensor
Differential Revision:
D8121878

Original commit changeset: 4a5e9a677ba4

fbshipit-source-id: d8e2c0bb145b52fbcca323b22d1d3346f0b3249e
2018-07-26 14:02:04 -07:00
Jerry Zhang
cd5adc7b5f Remove template parameter from Tensor (#13)
Summary:
Pull Request resolved: https://github.com/facebookresearch/weakly-supervised-action-detection/pull/13

Pull Request resolved: https://github.com/pytorch/translate/pull/166

Pull Request resolved: https://github.com/pytorch/pytorch/pull/9125

Closes https://github.com/pytorch/pytorch/pull/9125

Use inheritance for polymorphism, and remove template parameter
This is to change the templating in call sites, the core implementations will change later

Before Caffe2 Tensor class was compile-time fixed to bind to a particular device/context. With this change, we're making it a runtime property (stored inside the tensor), but preserve the same semantics. For example, one has to specify device type in order to create a Tensor - there are no uninitialized tensors. More specifically the changes are:

1. We added an extra argument *DeviceType* to most of the constructors of the tensor, e.g. (Tensor(DeviceType type)),
2. Semantics of constructor Tensor(const Tensor<SrcContext>& src, ContextForCopy* context); is changed, in this constructor, the second context is passed in to enable us to call the templated Copy function, it could be in a different context as source and target previously, now we'll enforce that the context should have same device type as src, if it is provided.
3. To preserve 'get-or-construct' semantics of Blob, we added specialized getter Blob::GetMutableTensor that verifies both that Blob contains a Tensor and that it's of a correct type
4. Specifically, Tensor type is not default-constructible any more (as we don't have unknown device tensors) and thus some of the code handling STL containers needs to change

Note: Some changes are postponed just to keep this diff a bit smaller. Please see `TODO`s.

Reviewed By: xw285cornell

Differential Revision: D8121878

fbshipit-source-id: 4a5e9a677ba4ac82095df959851a054c81eccf81
2018-07-26 10:25:23 -07:00
Yangqing Jia
20c516ac18
[cmake] Make cudnn optional (#8265)
* Make cudnn optional

* Remove cudnn file from cpu file
2018-06-08 02:04:27 -07:00
Orion Reblitz-Richardson
d1bdb3b10a Remove core and util warnings (#8239)
* Fix some signed/unsigned mismatches

* Skip unused result warning

* Explict fallthrough for murmur hash

* Enable aligned new support to eliminate warning

* Switch to int instead of unsigned in some cases
2018-06-07 09:10:33 -07:00
Orion Reblitz-Richardson
6223bfdb1d Update from Facebook (#6692)
* [GanH][Easy]: Add assertion to adaptive weighting layer

0 weight causes numeric instability and exploding ne

* [Easy] Add cast op before computing norm in diagnose options

As LpNorm only takes floats we add a manual casting here.

* Introduce a new caching device allocator

`cudaMalloc` and `cudaFree` calls are slow, and become slower the
more GPUs there are. Essentially, they grab a host-wide (not device-wide) lock
because GPU memory is transparently shared across all GPUs. Normally, this
isn't much of a concern since workloads allocate memory upfront, and reuse it
during later computation.

However, under some computation models (specifically, memory conserving
approaches like checkpoint-and-recompute, see
https://medium.com/@yaroslavvb/fitting-larger-networks-into-memory-583e3c758ff9)
this assumption is no longer true. In these situations, `cudaMalloc` and
`cudaFree` are common and frequent. Furthermore, in data parallel contexts,
these calls happen at nearly the same time from all GPUs worsening lock
contention.

A common solution to this problem is to add a custom allocator. In fact,
nVIDIA provides one out of the box: CUB, which Caffe2 already supports.
Unfortunately, the CUB allocator suffers from very high fragmentation. This is
primarily because it is a "buddy" allocator which neither splits nor merges
free cached blocks. Study
https://github.com/NVlabs/cub/blob/1.8.0/cub/util_allocator.cuh#L357 if you
want to convince yourself.

This diff adapts a caching allocator from the Torch codebase
https://github.com/torch/cutorch/blob/master/lib/THC/THCCachingAllocator.cpp
which does splitting and merging and ends up working really well, at least for
workloads like the checkpoint-and-recompute computation models noted above.

I simplified the implementation a little bit, made it a bit more C++-like. I
also removed a bunch of stream synchronization primitives for this diff. I
plan to add them back in subsequent diffs.

* Report reader progress in fblearner workflows

Integrate with fblearner progress reporting API and add support to report training progress from reader nodes.
If reader is constructed with batch limits, report based on finished batch vs total batch. The finished batch may be more than total batch because we evaludate if we should stop processing everytime we dequeue a split.
If no limit for the reader, report based on finished splits (Hive files) vs total splits. This is fairly accurate.

* [GanH][Diagnose]: fix plotting

1. ganh diagnose needs to set plot options
2. modifier's blob name is used for metric field can need to be fixed before
generating net

* Automatic update of fbcode/onnx to 985af3f5a0f7e7d29bc0ee6b13047e7ead9c90c8

* Make CompositeReader stops as soon as one reader finishes

Previously, CompositeReader calls all readers before stopping. It results in flaky test since the last batch may be read by different threads; resulting in dropped data.

* [dper] make sure loss is not nan

as desc.

* [rosetta2] [mobile-vision] Option to export NHWC order for RoIWarp/RoIAlign

Thanks for finding this @stzpz and @wangyanghan. Looks like NHWC is more
optimized. For OCR though it doesn't yet help since NHWC uses more mem b/w but
will soon become important.

* Intra-op parallel FC operator

Intra-op parallel FC operator

* [C2 Proto] extra info in device option

passing extra information in device option

design doc: https://fb.quip.com/yAiuAXkRXZGx

* Unregister MKL fallbacks for NCHW conversions

* Tracing for more executors

Modified Tracer to work with other executors and add more tracing

* Remove ShiftActivationDevices()

* Check for blob entry iff it is present

When processing the placeholders ops, ignore if the blob is not present in the blob_to_device.

* Internalize use of eigen tensor

Move use of eigen tensor out of the header file so we don't get template partial specialization errors when building other libraries.

* feature importance for transformed features.

* - Fix unused parameter warnings

The changes in this diff comments out unused parameters.
This will allow us to enable -Wunused-parameter as error.

#accept2ship

* add opencv dependencies to caffe2

The video input op requires additional opencv packages. This is to add them to
cmake so that it can build

* Add clip_by_value option in gradient clipping

Add clip_by_value option in gradient clipping

when the value is bigger than max or smaller than min, do the clip

* std::round compat
2018-04-17 23:36:40 -07:00
Orion Reblitz-Richardson
1d5780d42c Remove Apache headers from source.
* LICENSE file contains details, so removing from individual source files.
2018-03-27 13:10:18 -07:00
Dmytro Dzhulgakov
496c999f7d [core] NUMA-aware pinned allocator
Using cudaHostRegister/Unregister instead of cudaMallocHost to move memory to a
specific NUMA node
2018-03-06 00:33:11 -08:00