Commit Graph

68 Commits

Author SHA1 Message Date
David Goodwin
c855e04d5f Caffe2 shouldn't fail if CUDA peer access is already enabled
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/19586

Differential Revision: D15061544

Pulled By: dzhulgakov

fbshipit-source-id: 6a5f9f4fe45259d689671f58ad5206cdaf15c5bd
2019-04-24 13:22:27 -07:00
Gregory Chanan
b6ee83a5b4 Materialize a non-default device for C2 legacy storage. (#18605)
Summary:
It's not intended that Storages have 'default' CUDA devices, but this is allowable via the Storage::create_legacy codepath.

This also messages with device_caching, because the initial cache is obtained from the Storage, which may have a 'default' device.

Instead, we materialize a device by allocating 0 bytes via the allocator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18605

Differential Revision: D14680620

Pulled By: gchanan

fbshipit-source-id: 6d43383d836e90beaf12bfe37c3f0506843f5432
2019-04-11 13:50:41 -07:00
Edward Yang
1e6acc676f Replace caffe2::DeviceGuard with c10::cuda::CUDAGuard (#17623)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17623

Despite it's generic sounding name, caffe2::DeviceGuard actually
only worked on CUDA devices.  Rename it to something that more
clearly spells out its applicability.

I'm not sure if it's the right call, but in this patch I added
'using CUDAGuard = c10::cuda::CUDAGuard', as this seems to be more
in-line with how the Caffe2 codebase is currently written.  More
idiomatic c10 namespace style would be to say cuda::CUDAGuard.
Willing to change this if people shout.

This is a respin of D13156470 (#14284)

Reviewed By: dzhulgakov

Differential Revision: D14285504

fbshipit-source-id: 93b8ab938b064572b3b010c307e1261fde0fff3d
2019-03-06 10:48:15 -08:00
Brian W. Hart
fbd690c1fe caffe2: fix PinnedCPUAllocator cudaHostRegister() leak (#16340)
Summary:
In the NUMA case, PinnedCPUAllocator's allocate() would return a
DataPtr constructed by DefaultCPUAllocator, which would reference
the Default... Delete() rather than the Pinned... Delete(). That
meant Pinned... Delete() would never run, so cudaHostUnregister()
would never be called when regions were freed.

See: https://github.com/pytorch/pytorch/issues/16280

This change adds a 'naked_allocate()' method to the Default allocator
that just returns a pointer to the allocated memory rather than
wrapping it in a DataPtr. Pinned allocator uses that then constructs
a DataPtr with reference to its own Delete().
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16340

Reviewed By: dzhulgakov

Differential Revision: D13843206

Pulled By: ezyang

fbshipit-source-id: 9efb572e5a01b49ef2a4aceeccc13cd0b1066528
2019-02-15 07:02:33 -08:00
Dmytro Dzhulgakov
51dd2000cd unify c2 and TH allocator (#16892)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16892

Replaces https://github.com/pytorch/pytorch/pull/14517

Merged caffe2 and TH CPU Allocators. Mostly using the code from caffe2 allocators.
`memset` of caffe2 allocator is gone now. These two allocators should be almost the same.

Baseline:
```
Running ./tensor_allocation
Run on (48 X 2501 MHz CPU s)
CPU Caches:
  L1 Data 32K (x24)
  L1 Instruction 32K (x24)
  L2 Unified 256K (x24)
  L3 Unified 30720K (x2)
-------------------------------------------------------------------------
Benchmark                                  Time           CPU Iterations
-------------------------------------------------------------------------
BM_MakeStorageImpl                       148 ns        148 ns    4676594
BM_StorageImplCtor                        54 ns         54 ns   12957810
BM_MallocStorageImpl                      62 ns         62 ns   11254745
BM_TensorImplCtor                         22 ns         22 ns   31939472
BM_MallocTensorImpl                      105 ns        105 ns    6505661
BM_Malloc_1                               43 ns         43 ns   16464905
BM_MakeTensorFromStorage                 126 ns        126 ns    5586116
BM_MakeVariableFromTensor                236 ns        236 ns    2995528
BM_ATenCPUTensorAllocationSmall1         319 ns        319 ns    2268884
BM_ATenCPUTensorAllocationSmall2         318 ns        318 ns    2163332
BM_ATenCPUTensorAllocationMedium1        403 ns        403 ns    1663228
BM_ATenCPUTensorAllocationMedium2        448 ns        448 ns    1595004
BM_ATenCPUTensorAllocationBig1           532 ns        532 ns    1352634
BM_ATenCPUTensorAllocationBig2          4486 ns       4486 ns     160978
```
Changed:
```
Running ./tensor_allocation
Run on (48 X 2501 MHz CPU s)
CPU Caches:
  L1 Data 32K (x24)
  L1 Instruction 32K (x24)
  L2 Unified 256K (x24)
  L3 Unified 30720K (x2)
-------------------------------------------------------------------------
Benchmark                                  Time           CPU Iterations
-------------------------------------------------------------------------
BM_MakeStorageImpl                       141 ns        141 ns    4803576
BM_StorageImplCtor                        55 ns         55 ns   13129391
BM_MallocStorageImpl                      64 ns         64 ns   11088143
BM_TensorImplCtor                         23 ns         23 ns   31616273
BM_MallocTensorImpl                      101 ns        101 ns    7017585
BM_Malloc_1                               39 ns         39 ns   18523954
BM_MakeTensorFromStorage                 118 ns        118 ns    5877919
BM_MakeVariableFromTensor                452 ns        452 ns    1565722
BM_ATenCPUTensorAllocationSmall1         384 ns        384 ns    1819763
BM_ATenCPUTensorAllocationSmall2         389 ns        389 ns    1857483
BM_ATenCPUTensorAllocationMedium1        425 ns        425 ns    1646284
BM_ATenCPUTensorAllocationMedium2        430 ns        430 ns    1561319
BM_ATenCPUTensorAllocationBig1           508 ns        508 ns    1309969
BM_ATenCPUTensorAllocationBig2          3799 ns       3799 ns     173674
```

lstm benchmark:
Before:
```
INFO:lstm_bench:Iter: 1 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Iter: 21 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 41 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 61 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 81 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 101 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 121 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 141 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 161 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 181 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 201 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 221 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 241 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Iter: 261 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 281 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 301 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 321 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 341 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 361 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 381 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Done. Total EPS excluding 1st iteration: 0.8k
```

After:
```
INFO:lstm_bench:Iter: 1 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 21 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 41 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 61 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 81 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 101 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 121 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 141 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 161 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 181 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 201 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 221 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Iter: 241 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Iter: 261 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Iter: 281 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Iter: 301 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Iter: 321 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Iter: 341 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Iter: 361 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Iter: 381 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Done. Total EPS excluding 1st iteration: 0.8k
```

Reviewed By: ezyang

Differential Revision: D13202632

fbshipit-source-id: db6d2ec756ed15b0732b15396c82ad42302bb79d
2019-02-12 21:16:34 -08:00
Edward Yang
b9b0be7af2 Remove Legacy entry point. (#16721)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16721

The very key line is we have to set the stream to the default
stream before calling the allocator.  This is very interesting.
It shouldn't be necessary, but seemingly is!

Reviewed By: dzhulgakov

Differential Revision: D13943193

fbshipit-source-id: c21014917d9fe504fab0ad8abbc025787f559287
2019-02-08 09:33:58 -08:00
Edward Yang
5c982622b0 Delete duplicate copy of THCCachingAllocator (round two). (#16615)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16615

This is another go at landing https://github.com/pytorch/pytorch/pull/16226
Now that the caching allocator is moved to c10_cuda, we can
delete the duplicate copy from Caffe2.

The difference between this and the previous PR is that this
version faithfully maintains the binding code; in particular,
we end up with a SECOND copy of the caching allocator in
this patch.  I verified that this code does NOT cause a crash
in the workflow we canaried last time.

In further diffs, I plan to eliminate the second copy, and then
adjust the binding code.

Reviewed By: dzhulgakov

Differential Revision: D13901067

fbshipit-source-id: 66331fd4eadffd0a5defb3cea532d5cd07287872
2019-02-08 09:33:55 -08:00
Edward Yang
279238f0b8 Back out "Delete duplicate copy of THCCachingAllocator." (#16510)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16510

This diff was supposed to be memory usage neutral, but based on
some internal flows involving cuDNN, it was not. Reverting pending
further investigation.

Original commit changeset: 03f1ebf7f11c

Reviewed By: xw285cornell

Differential Revision: D13863610

fbshipit-source-id: 15517e255fd6b0c064b65fb99f0ef19742236cfd
2019-01-29 15:44:19 -08:00
Edward Yang
792cb774f1 Delete duplicate copy of THCCachingAllocator. (#16226)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16226

Now that the caching allocator is moved to c10_cuda, we can
delete the duplicate copy from Caffe2.

Reviewed By: dzhulgakov, smessmer

Differential Revision: D13762540

fbshipit-source-id: 03f1ebf7f11c68c19aa0d66110156fe228da6138
2019-01-24 12:06:57 -08:00
Dmytro Dzhulgakov
96ea2594d8 Don't call cudaStreamDestroy at destruction time (#15692)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15692

It was leading to ocassional crashes with dynamically linked CUDA because runtime was already destroyed.

Also, unique_ptr<T[]> is more suitable than deque<T> for the purpose.

Reviewed By: Yangqing

Differential Revision: D13571988

fbshipit-source-id: 37eb26dfbe361c49160367b53f87bd037c6c0e46
2019-01-11 12:36:41 -08:00
Ashwin Bharambe
5f0bff9639 Kill GPU memory logs in normal runs (#14838)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14838

The GPU memory tracking logs are incredibly annoying and merely serve
to pollute output. I `VLOG(1)`ed them. Hopefully, this is non-controversial.

Reviewed By: kuttas

Differential Revision: D13343290

fbshipit-source-id: b3cae99346c97b66e97ea660061e15dc5c99b9fc
2018-12-06 13:51:14 -08:00
Edward Yang
8617b780cf Replace use of 'int' with more descriptive 'DeviceIndex' or 'StreamId'. (#14282)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14282

This also is a substantive change, as 'DeviceIndex' and 'StreamId' are
narrower types than 'int'.

Reviewed By: Yangqing, smessmer

Differential Revision: D13156471

fbshipit-source-id: 08aa0f70c4142415b6bd4d17c57da0641c1d0e9a
2018-11-29 18:33:21 -08:00
Dmytro Dzhulgakov
0cfbbceac3 Change Tensor::CopyFrom to a simple double dispatch (#14268)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14268

Removes the need for Context in Tensor by doing simple dispatch for CopyBytes. It'd eventually be subsumed by Roy Li's changes of proper copy_ op, but before that is done, let's get a clear logic of how copies are implemented and clean up some craft in CopyFrom implementation.

Note, that with these changes, one can probably can get rid of Context::CopyFromCPU/CopyToCPU, but it's a matter for follow up diffs.

This diff doesn't change the API of Tensor yet, but relies on the fact that passing `Context` to CopyFrom makes copy async if the device is CUDA and doesn't have any effect otherwise (that's how Context methods are implemented).

This doesn't change semantics of copy async implementation - as before it blindly calls cudaMemcpyAsync which probably means that it can be misused if invoked separately outside of operator body. I'll leave it for the follow up copy_ unification.

For Extend() we always do async copy - it makes sense as it's an in-place device-device operation and only any further op would be observable.

Note: there are now three ways of invoking copy in C2 code - templated CopyBytes, virtual CopyFromCPU/etc, and double-dispatch free method here. Hopefully we can get rid of the second one.

Also, please advise whether it's c10-worthy :)

Reviewed By: ezyang

Differential Revision: D13117987

fbshipit-source-id: a6772d6dcf3effaf06717da3a656fc9873b310b5
2018-11-28 15:45:37 -08:00
Yangqing Jia
7d5f7ed270 Using c10 namespace across caffe2. (#12714)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12714

This is a short change to enable c10 namespace in caffe2. We did not enable
it before due to gflags global variable confusion, but it should have been
mostly cleaned now. Right now, the plan on record is that namespace caffe2 and
namespace aten will fully be supersets of namespace c10.

Most of the diff is codemod, and only two places of non-codemod is in caffe2/core/common.h, where

```
using namespace c10;
```

is added, and in Flags.h, where instead of creating aliasing variables in c10 namespace, we directly put it in the global namespace to match gflags (and same behavior if gflags is not being built with).

Reviewed By: dzhulgakov

Differential Revision: D10390486

fbshipit-source-id: 5e2df730e28e29a052f513bddc558d9f78a23b9b
2018-10-17 12:57:19 -07:00
Jerry Zhang
b89a3b50fb Remove StaticContext (#12547)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12547

Pull Request resolved: https://github.com/pytorch/pytorch/pull/12305

Remove StaticContext from context_base.h

Reviewed By: dzhulgakov

Differential Revision: D10073519

fbshipit-source-id: 350beec3c54365edef338318ce58229ccb825a98
2018-10-10 19:41:03 -07:00
Jerry Zhang
7724807551 Remove ExtractDeviceOption from StaticContext (#12304)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12304

- make ExtractDeviceOption to be a free function.
- Add a Strorage(at::Device) constructor in order to preserve the device_id.

Reviewed By: dzhulgakov

Differential Revision: D10069839

fbshipit-source-id: a5f3994a39bdf1b7503b39bb42c228e438b52bfa
2018-10-10 14:12:16 -07:00
Junjie Bai
f54ab540af Rename cuda_gpu_id to device_id in DeviceOption (#12456)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12456

codemod with 'Yes to all'
codemod -d . --extensions h,cc,cpp,cu,py,proto,pbtxt,pb.txt,config cuda_gpu_id device_id

Overload TextFormat::ParseFromString to do string replace when parsing from protobuf format

Reviewed By: Yangqing

Differential Revision: D10240535

fbshipit-source-id: 5e6992bec961214be8dbe26f16f5794154a22b25
2018-10-09 15:54:04 -07:00
Jerry Zhang
1c69d368e1 Remove New with Allocator Registry (#12111)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12111

Setup allocator registry keyed by at::DeviceType, and remove New from StaticContext.

Reviewed By: ezyang

Differential Revision: D10022853

fbshipit-source-id: 3e88a181fe5df24f33f49b88be1f75284a185588
2018-10-09 10:53:52 -07:00
Yangqing Jia
38f3d1fc40 move flags to c10 (#12144)
Summary:
still influx.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12144

Reviewed By: smessmer

Differential Revision: D10140176

Pulled By: Yangqing

fbshipit-source-id: 1a313abed022039333e3925d19f8b3ef2d95306c
2018-10-04 02:09:56 -07:00
Jerry Zhang
faab6ea922 Split Allocator (#12105)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12105

Split CUDA/OpenCL/xxx Allocator from xxxStaticContext::New and rewrite it under at::Allocator interface.

Reviewed By: dzhulgakov

Differential Revision: D10001033

fbshipit-source-id: e1ffbc04c18d1dcb1f8d4ef2cbbb321967de5ccc
2018-10-03 19:10:10 -07:00
Jerry Zhang
74dc4460eb New in StaticContext returns at::DataPtr (#12029)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12029

In order to remove New() function in StaticContext(to remove StaticContext) and converge to the Allocator design, we'll first change the return type of New to at::DataPtr.

Reviewed By: ezyang

Differential Revision: D9889990

fbshipit-source-id: 3257c763530b987025f428741bdd2e089d11bad4
2018-10-03 19:10:07 -07:00
Junjie Bai
ff608a9ff3 Back out "Revert D10123245: Back out "codemod cuda_gpu_id to device_id"" (#12232)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12232

Original commit changeset: fca91fea58b7

This adds proper modifications to the DeviceType <->DeviceOption conversion code added in D10033396

Reviewed By: jerryzh168

Differential Revision: D10132473

fbshipit-source-id: 801ef777e2950982cb47b48051b1471a0a91e64b
2018-10-01 21:54:52 -07:00
Rick Ratmansky
3010dc4208 Revert D10123245: Back out "codemod cuda_gpu_id to device_id"
Differential Revision:
D10123245

Original commit changeset: d83da8e00a12

fbshipit-source-id: fca91fea58b7df208edc2e218a1d514f9821ec7b
2018-10-01 12:22:36 -07:00
Yang Liu
7d7d336c45 Back out "codemod cuda_gpu_id to device_id"
Summary:
Original commit changeset: f5614a5d2607

D9986213 is causing Multifeed Aggregator a [huge performance different](https://our.intern.facebook.com/intern/ads/analyze_canary/412951953278781781/) and is blocking aggregator push since last Friday night: https://fburl.com/feedtools/b6izvwjz
We need to land this revert ASAP to unblock aggregator push.

Reviewed By: orionr

Differential Revision: D10123245

fbshipit-source-id: d83da8e00a1250f5d09811a0a587c127e377aab2
2018-10-01 11:31:14 -07:00
Jerry Zhang
006171fffc Back out "[pytorch][PR] Revert "Move CreateContext to global registry (#11688)"" (#12121)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12121

Pull Request resolved: https://github.com/pytorch/pytorch/pull/12055

Original commit changeset: 6ca9de65b707

Reviewed By: ezyang

Differential Revision: D10033396

fbshipit-source-id: ca9f4b2f7ef0561f619b833415d394a8b9972bf4
2018-10-01 11:10:46 -07:00
Junjie Bai
3eb5940cf5 codemod cuda_gpu_id to device_id (#12022)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12022

codemod -d . --extensions h,cc,cpp,cu,py,proto,pbtxt,pb.txt,config cuda_gpu_id device_id

codemod with 'Yes to all'

Reviewed By: orionr

Differential Revision: D9986213

fbshipit-source-id: f5614a5d26078817aee8caf79a494abfd6a95ff1
2018-09-27 20:24:53 -07:00
Edward Yang
d7e11e3aae Revert "Move CreateContext to global registry (#11688)" (#12049)
Summary:
This reverts commit 3ae6ee4ebd.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12049

Differential Revision: D10030954

Pulled By: ezyang

fbshipit-source-id: 6ca9de65b707c5b4c68280fc6f1b8e5ad7251efc
2018-09-25 10:13:43 -07:00
Jerry Zhang
3ae6ee4ebd Move CreateContext to global registry (#11688)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11688

As a first step to remove static context(merge with allocator), we'll create a
global registries for context constructors, and remove CreateContext function from tensor.

Reviewed By: ezyang, dzhulgakov

Differential Revision: D9779821

fbshipit-source-id: 8b239ea50af7a0556fde2382f58f79194f0e3dc1
2018-09-24 17:07:50 -07:00
Jerry Zhang
9f4bcdf075 caffe2::DeviceType -> at::DeviceType (#11254)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11254
Previously we use DeviceType in caffe2.proto directly, but it's an `enum` and have implicit conversion to int, which does not have type safety, e.g. we have to explicitly check for a device type is valid in event.h:
```
template <int d>
struct EventCreateFunctionRegisterer {
  explicit EventCreateFunctionRegisterer(EventCreateFunction f) {
    static_assert(d < MaxDeviceTypes, "");
    Event::event_creator_[d] = f;
  }
};
```
at::DeviceType is an `enum class`, and it does not have implicit conversion to int, and provides better type safety guarantees. In this diff we have done the following refactor(taking CPU as an example):

    1. caffe2::DeviceType → caffe2::DeviceTypeProto
    2. caffe2::CPU → caffe2::PROTO_CPU
    3. caffe2::DeviceType = at::DeviceType
    4. caffe2::CPU = at::DeviceType::CPU

codemod -d caffe2/caffe2 --extensions h,cc,cpp 'device_type\(\), ' 'device_type(), PROTO_'
+ some manual changes

In short, after this diff, in c++, caffe2::CPU refers to the at::DeviceType::CPU and the old proto caffe2::CPU will be caffe2::PROTO_CPU.
In python side, we have a temporary workaround that alias `caffe2_pb2.CPU = caffe2_pb2.PROOT_CPU` to make the change easier to review and this will be removed later.

Reviewed By: ezyang

Differential Revision: D9545704

fbshipit-source-id: 461a28a4ca74e616d3ee183a607078a717fd38a7
2018-09-05 16:28:09 -07:00
Orion Reblitz-Richardson
6508db7421 Remove BUILD_CAFFE2 and build everything (#8338)
Summary:
This completely removes BUILD_CAFFE2 from CMake. There is still a little bit of "full build" stuff in setup.py that enables USE_CUDNN and BUILD_PYTHON, but otherwise everything should be enabled for PyTorch as well as Caffe2. This gets us a lot closer to full unification.

cc mingzhe09088, pjh5, ezyang, smessmer, Yangqing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8338

Reviewed By: mingzhe09088

Differential Revision: D9600513

Pulled By: orionr

fbshipit-source-id: 9f6ca49df35b920d3439dcec56e7b26ad4768b7d
2018-08-31 13:10:24 -07:00
root
c3fe071483 Update hip files (#9826)
Summary:
The goal of this PR is to update the hip files to reflect relevant changes in cuda source files.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9826

Differential Revision: D9032840

Pulled By: bddppq

fbshipit-source-id: 504e55c46308eebfee3c9a7beea1f294fe03470f
2018-07-27 16:54:39 -07:00
Jerry Zhang
aebf3b47ae Remove template parameter from Tensor (#9939)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9939

Pull Request resolved: https://github.com/facebookresearch/weakly-supervised-action-detection/pull/13

Pull Request resolved: https://github.com/pytorch/translate/pull/166

Pull Request resolved: https://github.com/pytorch/pytorch/pull/9125

Closes https://github.com/pytorch/pytorch/pull/9125

Use inheritance for polymorphism, and remove template parameter
This is to change the templating in call sites, the core implementations will change later

Before Caffe2 Tensor class was compile-time fixed to bind to a particular device/context. With this change, we're making it a runtime property (stored inside the tensor), but preserve the same semantics. For example, one has to specify device type in order to create a Tensor - there are no uninitialized tensors. More specifically the changes are:

1. We added an extra argument *DeviceType* to most of the constructors of the tensor, e.g. (Tensor(DeviceType type)),
2. Semantics of constructor Tensor(const Tensor<SrcContext>& src, ContextForCopy* context); is changed, in this constructor, the second context is passed in to enable us to call the templated Copy function, it could be in a different context as source and target previously, now we'll enforce that the context should have same device type as src, if it is provided.
3. To preserve 'get-or-construct' semantics of Blob, we added specialized getter Blob::GetMutableTensor that verifies both that Blob contains a Tensor and that it's of a correct type
4. Specifically, Tensor type is not default-constructible any more (as we don't have unknown device tensors) and thus some of the code handling STL containers needs to change

Note: Some changes are postponed just to keep this diff a bit smaller. Please see `TODO`s.

Reviewed By: ezyang, houseroad

Differential Revision: D9024330

fbshipit-source-id: e0b8295d2dc6ebe2963383ded5af799ad17164ba
2018-07-27 10:56:39 -07:00
Jerry Zhang
969b62f276 Revert D8121878: Remove template parameter from Tensor
Differential Revision:
D8121878

Original commit changeset: 4a5e9a677ba4

fbshipit-source-id: d8e2c0bb145b52fbcca323b22d1d3346f0b3249e
2018-07-26 14:02:04 -07:00
Jerry Zhang
cd5adc7b5f Remove template parameter from Tensor (#13)
Summary:
Pull Request resolved: https://github.com/facebookresearch/weakly-supervised-action-detection/pull/13

Pull Request resolved: https://github.com/pytorch/translate/pull/166

Pull Request resolved: https://github.com/pytorch/pytorch/pull/9125

Closes https://github.com/pytorch/pytorch/pull/9125

Use inheritance for polymorphism, and remove template parameter
This is to change the templating in call sites, the core implementations will change later

Before Caffe2 Tensor class was compile-time fixed to bind to a particular device/context. With this change, we're making it a runtime property (stored inside the tensor), but preserve the same semantics. For example, one has to specify device type in order to create a Tensor - there are no uninitialized tensors. More specifically the changes are:

1. We added an extra argument *DeviceType* to most of the constructors of the tensor, e.g. (Tensor(DeviceType type)),
2. Semantics of constructor Tensor(const Tensor<SrcContext>& src, ContextForCopy* context); is changed, in this constructor, the second context is passed in to enable us to call the templated Copy function, it could be in a different context as source and target previously, now we'll enforce that the context should have same device type as src, if it is provided.
3. To preserve 'get-or-construct' semantics of Blob, we added specialized getter Blob::GetMutableTensor that verifies both that Blob contains a Tensor and that it's of a correct type
4. Specifically, Tensor type is not default-constructible any more (as we don't have unknown device tensors) and thus some of the code handling STL containers needs to change

Note: Some changes are postponed just to keep this diff a bit smaller. Please see `TODO`s.

Reviewed By: xw285cornell

Differential Revision: D8121878

fbshipit-source-id: 4a5e9a677ba4ac82095df959851a054c81eccf81
2018-07-26 10:25:23 -07:00
Sebastian Messmer
d2610fb379 Constexpr Type Ids -> 6.5% caffe2 perf improvement (#9603)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9603

Using constexpr for some heavily queried type ids gives us a 6.5% perf improvement for caffe2.

Benchmark results: P59829647

Also ad canaries (but they don't show a significant difference):
- adfinder:
  - https://our.intern.facebook.com/intern/ads/canary/411346509423301481
  - https://our.intern.facebook.com/intern/ads/canary/411346563021753557
- adindexer:
  - https://our.intern.facebook.com/intern/ads/canary/411346517006038367
  - https://our.intern.facebook.com/intern/ads/canary/411346571387258927
- multifeed_predictor:
  - https://our.intern.facebook.com/intern/ads/canary/411346526631282941
  - https://our.intern.facebook.com/intern/ads/canary/411346583141009531

Reviewed By: dzhulgakov

Differential Revision: D8841577

fbshipit-source-id: 1a0ce7f2bee1ae54b723caefe5bc7f85a20935b4
2018-07-24 17:24:55 -07:00
Lin Li
0fe980c748 Memory usage measurement -- Caffe2 (#9017)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9017

Closes https://github.com/pytorch/pytorch/pull/9017

Added "get_blob_size_bytes" to "pybind_state.cc" in Caffe2 to expose the size of blob in bytes.

Reviewed By: kuttas

Differential Revision: D8685696

fbshipit-source-id: 9a9d38f207c8c59ef534217181e8ce1514617628
2018-07-17 16:40:23 -07:00
Yangqing Jia
03f7289fcf Add CAFFE2_USE_CUDNN guard on context_gpu.cu (#8657) 2018-06-19 13:49:06 -07:00
Yangqing Jia
20c516ac18
[cmake] Make cudnn optional (#8265)
* Make cudnn optional

* Remove cudnn file from cpu file
2018-06-08 02:04:27 -07:00
Orion Reblitz-Richardson
6223bfdb1d Update from Facebook (#6692)
* [GanH][Easy]: Add assertion to adaptive weighting layer

0 weight causes numeric instability and exploding ne

* [Easy] Add cast op before computing norm in diagnose options

As LpNorm only takes floats we add a manual casting here.

* Introduce a new caching device allocator

`cudaMalloc` and `cudaFree` calls are slow, and become slower the
more GPUs there are. Essentially, they grab a host-wide (not device-wide) lock
because GPU memory is transparently shared across all GPUs. Normally, this
isn't much of a concern since workloads allocate memory upfront, and reuse it
during later computation.

However, under some computation models (specifically, memory conserving
approaches like checkpoint-and-recompute, see
https://medium.com/@yaroslavvb/fitting-larger-networks-into-memory-583e3c758ff9)
this assumption is no longer true. In these situations, `cudaMalloc` and
`cudaFree` are common and frequent. Furthermore, in data parallel contexts,
these calls happen at nearly the same time from all GPUs worsening lock
contention.

A common solution to this problem is to add a custom allocator. In fact,
nVIDIA provides one out of the box: CUB, which Caffe2 already supports.
Unfortunately, the CUB allocator suffers from very high fragmentation. This is
primarily because it is a "buddy" allocator which neither splits nor merges
free cached blocks. Study
https://github.com/NVlabs/cub/blob/1.8.0/cub/util_allocator.cuh#L357 if you
want to convince yourself.

This diff adapts a caching allocator from the Torch codebase
https://github.com/torch/cutorch/blob/master/lib/THC/THCCachingAllocator.cpp
which does splitting and merging and ends up working really well, at least for
workloads like the checkpoint-and-recompute computation models noted above.

I simplified the implementation a little bit, made it a bit more C++-like. I
also removed a bunch of stream synchronization primitives for this diff. I
plan to add them back in subsequent diffs.

* Report reader progress in fblearner workflows

Integrate with fblearner progress reporting API and add support to report training progress from reader nodes.
If reader is constructed with batch limits, report based on finished batch vs total batch. The finished batch may be more than total batch because we evaludate if we should stop processing everytime we dequeue a split.
If no limit for the reader, report based on finished splits (Hive files) vs total splits. This is fairly accurate.

* [GanH][Diagnose]: fix plotting

1. ganh diagnose needs to set plot options
2. modifier's blob name is used for metric field can need to be fixed before
generating net

* Automatic update of fbcode/onnx to 985af3f5a0f7e7d29bc0ee6b13047e7ead9c90c8

* Make CompositeReader stops as soon as one reader finishes

Previously, CompositeReader calls all readers before stopping. It results in flaky test since the last batch may be read by different threads; resulting in dropped data.

* [dper] make sure loss is not nan

as desc.

* [rosetta2] [mobile-vision] Option to export NHWC order for RoIWarp/RoIAlign

Thanks for finding this @stzpz and @wangyanghan. Looks like NHWC is more
optimized. For OCR though it doesn't yet help since NHWC uses more mem b/w but
will soon become important.

* Intra-op parallel FC operator

Intra-op parallel FC operator

* [C2 Proto] extra info in device option

passing extra information in device option

design doc: https://fb.quip.com/yAiuAXkRXZGx

* Unregister MKL fallbacks for NCHW conversions

* Tracing for more executors

Modified Tracer to work with other executors and add more tracing

* Remove ShiftActivationDevices()

* Check for blob entry iff it is present

When processing the placeholders ops, ignore if the blob is not present in the blob_to_device.

* Internalize use of eigen tensor

Move use of eigen tensor out of the header file so we don't get template partial specialization errors when building other libraries.

* feature importance for transformed features.

* - Fix unused parameter warnings

The changes in this diff comments out unused parameters.
This will allow us to enable -Wunused-parameter as error.

#accept2ship

* add opencv dependencies to caffe2

The video input op requires additional opencv packages. This is to add them to
cmake so that it can build

* Add clip_by_value option in gradient clipping

Add clip_by_value option in gradient clipping

when the value is bigger than max or smaller than min, do the clip

* std::round compat
2018-04-17 23:36:40 -07:00
Orion Reblitz-Richardson
1d5780d42c Remove Apache headers from source.
* LICENSE file contains details, so removing from individual source files.
2018-03-27 13:10:18 -07:00
Yangqing Jia
ced2c7e2b2 Remove Set/GetDefaultGPUID and move to use current gpu id instead.
Summary:
Reason for this change:

(1) Setting/Getting default gpu id doesn't seem to be used at all.
(2) It actually is confusing compared to the CUDA_VISIBLE_DEVICES options etc.
(3) When setting cuda_gpu_id=-1 in the CUDAContext arg, it used to use the
default gpu id but probably we should use the current gpu - so that the caller
will be able to control the device placement.

One use case is for TensorRT - if we have a custom callback layer, then it would
be easier for TRT or whatever caller to set the running device.

Reviewed By: dzhulgakov

Differential Revision: D6740357

fbshipit-source-id: 2ea710e434b10220d5a198e31c93847304636863
2018-01-19 18:03:21 -08:00
Yangqing Jia
59b2654544 reapply header change after xplat move
Summary: This is a reapplication of the earlier PR due to xplat move. Original author is Christoph Conrads <christoph.conrads@fluent.ai> christoph-conrads .

Reviewed By: houseroad

Differential Revision: D6379736

fbshipit-source-id: b7482ecf3b9487a528c15e92976e915791210002
2017-11-22 13:04:37 -08:00
Yangqing Jia
8286ce1e3a Re-license to Apache
Summary: Closes https://github.com/caffe2/caffe2/pull/1260

Differential Revision: D5906739

Pulled By: Yangqing

fbshipit-source-id: e482ba9ba60b5337d9165f28f7ec68d4518a0902
2017-09-28 16:22:00 -07:00
Ahmed Taei
c3a3d6ceba Add an option to use dynamic memory optimizer.
Reviewed By: akyrola

Differential Revision: D5869664

fbshipit-source-id: ab11bc27395bf10e8381ebf97e6afb83ae9af81f
2017-09-20 12:52:55 -07:00
Aapo Kyrola
60c17a34b5 better default settings for CUB
Summary: The default CUB settings led to very slow execution in practice when using "dynamic" memory allocation with C2 (i.e freeing blobs after their use). After some tinkering, I arrived to these numbers, that work with resnet-50 and NVIDIA M40 GPU much better than the origianal defaults. Also made the maximum allocated memory configurable.

Reviewed By: Yangqing

Differential Revision: D5732930

fbshipit-source-id: 9ff34f49d5a3eb138bc6f44c82918731a35325a6
2017-08-29 19:11:08 -07:00
Yangqing Jia
26f0943130 Do CaffeCudaSetDevice and CaffeCudaGetDevice
Summary:
These are wrapper functions so that if we run in a Caffe2-only mode, we can
turn the flag on and get some small speedup on cuda device switches.

The purpose of the diff is to allow us to quickly assess the overhead of cuda
device switch functions. Ideally, the caching behavior shall live in the cuda
driver, which is the only safe place to ensure correctness.

If other code is running aside Caffe2 and does not properly do device guard,
this functionality will fail as separate cudaSetDevice() calls will not update
Caffe2's thread local device id. As a result, the functionality is only enabled
when/if one explicitly sets the flag.

This might not be safe, so use with caution.

- cudaGetDevice can go from 90ns to 2ns
- when setting the same device, we can go from 100ns to 2 ns
- when setting a different device, things are the same (1ns overhead on top of 143ns)

Reviewed By: azzolini

Differential Revision: D5709398

fbshipit-source-id: 6255f17a3d41f59a30327436383f306a2287896e
2017-08-25 18:20:14 -07:00
Yangqing Jia
578adbe9c0 Adios CNMEM. You will be remembered.
Summary:
As part of the cuda 9 move we have decided to deprecate the cnmem path
as it seems to be superceded by cub if one needs a memory pool.
Closes https://github.com/caffe2/caffe2/pull/1104

Differential Revision: D5647672

Pulled By: Yangqing

fbshipit-source-id: 988af5bf63e24efa1b631fd91ddb58e798ffc5c6
2017-08-17 00:05:57 -07:00
Aapo Kyrola
07e745408b revert D5528436
Summary: Users are reporting CUDA illegal access errors happening on some configurations after D5528436 introduced lazy peer connections. Will debug later, but this diff is to revert that change.

Reviewed By: pietern

Differential Revision: D5581673

fbshipit-source-id: ef8e367160a38fc62434d6f5905892db274d9f06
2017-08-07 23:07:50 -07:00
Aapo Kyrola
3628cd30f0 initialize peer access only when (might be) needed
Summary:
Currently Caffe2 enables peer access between all 8 gpus, even if only 1 gpu would be used. This adds several seconds to the startup time, but also takes a lot of memory (110 MB per GPU).

This diff makes the peer access initialization "lazy". When GPU X is first used, pairwise peer access is set between GPUs 0 to X-1 with X. A lookup table is used to ensure no double peer access initialization.

Reviewed By: pietern

Differential Revision: D5528436

fbshipit-source-id: 8f3c2c8154291a7d3a99ee2882e4834ef5e38b66
2017-08-03 09:24:14 -07:00
Dmytro Dzhulgakov
0d833590c1 Change Allocator interface to return deleter
Summary:
This is in preparation for adding huge pages. There we want to remember for the pointer how we got it - via mmap() or alloc(). One option is to store gigantic map of void* -> destructor, but luckily usages of Context::New are all inside Tensor which already uses shared_ptr with custom deleter.

This diff could have used unique_ptr as the return type but then it's easy to accidentally call release() and loose the deleter. Thus going with std::pair<void*, MemoryDeleter> to be explicit.

Also, now CPUAllocator can be effectively changed to std::function. Haven't done it yet, but can do if necessary.

Let me know whether it's a bad idea to proceed like this.

Reviewed By: Yangqing

Differential Revision: D5429830

fbshipit-source-id: 8382ab7b81592d51272056c05c122894bb203827
2017-07-17 15:26:27 -07:00