pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
David Goodwin	c855e04d5f	Caffe2 shouldn't fail if CUDA peer access is already enabled Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/19586 Differential Revision: D15061544 Pulled By: dzhulgakov fbshipit-source-id: 6a5f9f4fe45259d689671f58ad5206cdaf15c5bd	2019-04-24 13:22:27 -07:00
Gregory Chanan	b6ee83a5b4	Materialize a non-default device for C2 legacy storage. (#18605 ) Summary: It's not intended that Storages have 'default' CUDA devices, but this is allowable via the Storage::create_legacy codepath. This also messages with device_caching, because the initial cache is obtained from the Storage, which may have a 'default' device. Instead, we materialize a device by allocating 0 bytes via the allocator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/18605 Differential Revision: D14680620 Pulled By: gchanan fbshipit-source-id: 6d43383d836e90beaf12bfe37c3f0506843f5432	2019-04-11 13:50:41 -07:00
Edward Yang	1e6acc676f	Replace caffe2::DeviceGuard with c10::cuda::CUDAGuard (#17623 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/17623 Despite it's generic sounding name, caffe2::DeviceGuard actually only worked on CUDA devices. Rename it to something that more clearly spells out its applicability. I'm not sure if it's the right call, but in this patch I added 'using CUDAGuard = c10::cuda::CUDAGuard', as this seems to be more in-line with how the Caffe2 codebase is currently written. More idiomatic c10 namespace style would be to say cuda::CUDAGuard. Willing to change this if people shout. This is a respin of D13156470 (#14284) Reviewed By: dzhulgakov Differential Revision: D14285504 fbshipit-source-id: 93b8ab938b064572b3b010c307e1261fde0fff3d	2019-03-06 10:48:15 -08:00
Brian W. Hart	fbd690c1fe	caffe2: fix PinnedCPUAllocator cudaHostRegister() leak (#16340 ) Summary: In the NUMA case, PinnedCPUAllocator's allocate() would return a DataPtr constructed by DefaultCPUAllocator, which would reference the Default... Delete() rather than the Pinned... Delete(). That meant Pinned... Delete() would never run, so cudaHostUnregister() would never be called when regions were freed. See: https://github.com/pytorch/pytorch/issues/16280 This change adds a 'naked_allocate()' method to the Default allocator that just returns a pointer to the allocated memory rather than wrapping it in a DataPtr. Pinned allocator uses that then constructs a DataPtr with reference to its own Delete(). Pull Request resolved: https://github.com/pytorch/pytorch/pull/16340 Reviewed By: dzhulgakov Differential Revision: D13843206 Pulled By: ezyang fbshipit-source-id: 9efb572e5a01b49ef2a4aceeccc13cd0b1066528	2019-02-15 07:02:33 -08:00
Dmytro Dzhulgakov	51dd2000cd	unify c2 and TH allocator (#16892 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/16892 Replaces https://github.com/pytorch/pytorch/pull/14517 Merged caffe2 and TH CPU Allocators. Mostly using the code from caffe2 allocators. `memset` of caffe2 allocator is gone now. These two allocators should be almost the same. Baseline: ``` Running ./tensor_allocation Run on (48 X 2501 MHz CPU s) CPU Caches: L1 Data 32K (x24) L1 Instruction 32K (x24) L2 Unified 256K (x24) L3 Unified 30720K (x2) ------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------- BM_MakeStorageImpl 148 ns 148 ns 4676594 BM_StorageImplCtor 54 ns 54 ns 12957810 BM_MallocStorageImpl 62 ns 62 ns 11254745 BM_TensorImplCtor 22 ns 22 ns 31939472 BM_MallocTensorImpl 105 ns 105 ns 6505661 BM_Malloc_1 43 ns 43 ns 16464905 BM_MakeTensorFromStorage 126 ns 126 ns 5586116 BM_MakeVariableFromTensor 236 ns 236 ns 2995528 BM_ATenCPUTensorAllocationSmall1 319 ns 319 ns 2268884 BM_ATenCPUTensorAllocationSmall2 318 ns 318 ns 2163332 BM_ATenCPUTensorAllocationMedium1 403 ns 403 ns 1663228 BM_ATenCPUTensorAllocationMedium2 448 ns 448 ns 1595004 BM_ATenCPUTensorAllocationBig1 532 ns 532 ns 1352634 BM_ATenCPUTensorAllocationBig2 4486 ns 4486 ns 160978 ``` Changed: ``` Running ./tensor_allocation Run on (48 X 2501 MHz CPU s) CPU Caches: L1 Data 32K (x24) L1 Instruction 32K (x24) L2 Unified 256K (x24) L3 Unified 30720K (x2) ------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------- BM_MakeStorageImpl 141 ns 141 ns 4803576 BM_StorageImplCtor 55 ns 55 ns 13129391 BM_MallocStorageImpl 64 ns 64 ns 11088143 BM_TensorImplCtor 23 ns 23 ns 31616273 BM_MallocTensorImpl 101 ns 101 ns 7017585 BM_Malloc_1 39 ns 39 ns 18523954 BM_MakeTensorFromStorage 118 ns 118 ns 5877919 BM_MakeVariableFromTensor 452 ns 452 ns 1565722 BM_ATenCPUTensorAllocationSmall1 384 ns 384 ns 1819763 BM_ATenCPUTensorAllocationSmall2 389 ns 389 ns 1857483 BM_ATenCPUTensorAllocationMedium1 425 ns 425 ns 1646284 BM_ATenCPUTensorAllocationMedium2 430 ns 430 ns 1561319 BM_ATenCPUTensorAllocationBig1 508 ns 508 ns 1309969 BM_ATenCPUTensorAllocationBig2 3799 ns 3799 ns 173674 ``` lstm benchmark: Before: ``` INFO:lstm_bench:Iter: 1 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 21 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 41 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 61 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 81 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 101 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 121 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 141 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 161 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 181 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 201 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 221 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 241 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 261 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 281 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 301 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 321 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 341 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 361 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 381 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Done. Total EPS excluding 1st iteration: 0.8k ``` After: ``` INFO:lstm_bench:Iter: 1 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 21 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 41 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 61 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 81 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 101 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 121 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 141 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 161 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 181 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 201 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 221 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 241 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 261 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 281 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 301 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 321 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 341 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 361 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 381 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Done. Total EPS excluding 1st iteration: 0.8k ``` Reviewed By: ezyang Differential Revision: D13202632 fbshipit-source-id: db6d2ec756ed15b0732b15396c82ad42302bb79d	2019-02-12 21:16:34 -08:00
Edward Yang	b9b0be7af2	Remove Legacy entry point. (#16721 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/16721 The very key line is we have to set the stream to the default stream before calling the allocator. This is very interesting. It shouldn't be necessary, but seemingly is! Reviewed By: dzhulgakov Differential Revision: D13943193 fbshipit-source-id: c21014917d9fe504fab0ad8abbc025787f559287	2019-02-08 09:33:58 -08:00
Edward Yang	5c982622b0	Delete duplicate copy of THCCachingAllocator (round two). (#16615 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/16615 This is another go at landing https://github.com/pytorch/pytorch/pull/16226 Now that the caching allocator is moved to c10_cuda, we can delete the duplicate copy from Caffe2. The difference between this and the previous PR is that this version faithfully maintains the binding code; in particular, we end up with a SECOND copy of the caching allocator in this patch. I verified that this code does NOT cause a crash in the workflow we canaried last time. In further diffs, I plan to eliminate the second copy, and then adjust the binding code. Reviewed By: dzhulgakov Differential Revision: D13901067 fbshipit-source-id: 66331fd4eadffd0a5defb3cea532d5cd07287872	2019-02-08 09:33:55 -08:00
Edward Yang	279238f0b8	Back out "Delete duplicate copy of THCCachingAllocator." (#16510 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/16510 This diff was supposed to be memory usage neutral, but based on some internal flows involving cuDNN, it was not. Reverting pending further investigation. Original commit changeset: 03f1ebf7f11c Reviewed By: xw285cornell Differential Revision: D13863610 fbshipit-source-id: 15517e255fd6b0c064b65fb99f0ef19742236cfd	2019-01-29 15:44:19 -08:00
Edward Yang	792cb774f1	Delete duplicate copy of THCCachingAllocator. (#16226 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/16226 Now that the caching allocator is moved to c10_cuda, we can delete the duplicate copy from Caffe2. Reviewed By: dzhulgakov, smessmer Differential Revision: D13762540 fbshipit-source-id: 03f1ebf7f11c68c19aa0d66110156fe228da6138	2019-01-24 12:06:57 -08:00
Dmytro Dzhulgakov	96ea2594d8	Don't call cudaStreamDestroy at destruction time (#15692 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/15692 It was leading to ocassional crashes with dynamically linked CUDA because runtime was already destroyed. Also, unique_ptr<T[]> is more suitable than deque<T> for the purpose. Reviewed By: Yangqing Differential Revision: D13571988 fbshipit-source-id: 37eb26dfbe361c49160367b53f87bd037c6c0e46	2019-01-11 12:36:41 -08:00
Ashwin Bharambe	5f0bff9639	Kill GPU memory logs in normal runs (#14838 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14838 The GPU memory tracking logs are incredibly annoying and merely serve to pollute output. I `VLOG(1)`ed them. Hopefully, this is non-controversial. Reviewed By: kuttas Differential Revision: D13343290 fbshipit-source-id: b3cae99346c97b66e97ea660061e15dc5c99b9fc	2018-12-06 13:51:14 -08:00
Edward Yang	8617b780cf	Replace use of 'int' with more descriptive 'DeviceIndex' or 'StreamId'. (#14282 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14282 This also is a substantive change, as 'DeviceIndex' and 'StreamId' are narrower types than 'int'. Reviewed By: Yangqing, smessmer Differential Revision: D13156471 fbshipit-source-id: 08aa0f70c4142415b6bd4d17c57da0641c1d0e9a	2018-11-29 18:33:21 -08:00
Dmytro Dzhulgakov	0cfbbceac3	Change Tensor::CopyFrom to a simple double dispatch (#14268 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14268 Removes the need for Context in Tensor by doing simple dispatch for CopyBytes. It'd eventually be subsumed by Roy Li's changes of proper copy_ op, but before that is done, let's get a clear logic of how copies are implemented and clean up some craft in CopyFrom implementation. Note, that with these changes, one can probably can get rid of Context::CopyFromCPU/CopyToCPU, but it's a matter for follow up diffs. This diff doesn't change the API of Tensor yet, but relies on the fact that passing `Context` to CopyFrom makes copy async if the device is CUDA and doesn't have any effect otherwise (that's how Context methods are implemented). This doesn't change semantics of copy async implementation - as before it blindly calls cudaMemcpyAsync which probably means that it can be misused if invoked separately outside of operator body. I'll leave it for the follow up copy_ unification. For Extend() we always do async copy - it makes sense as it's an in-place device-device operation and only any further op would be observable. Note: there are now three ways of invoking copy in C2 code - templated CopyBytes, virtual CopyFromCPU/etc, and double-dispatch free method here. Hopefully we can get rid of the second one. Also, please advise whether it's c10-worthy :) Reviewed By: ezyang Differential Revision: D13117987 fbshipit-source-id: a6772d6dcf3effaf06717da3a656fc9873b310b5	2018-11-28 15:45:37 -08:00
Yangqing Jia	7d5f7ed270	Using c10 namespace across caffe2. (#12714 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12714 This is a short change to enable c10 namespace in caffe2. We did not enable it before due to gflags global variable confusion, but it should have been mostly cleaned now. Right now, the plan on record is that namespace caffe2 and namespace aten will fully be supersets of namespace c10. Most of the diff is codemod, and only two places of non-codemod is in caffe2/core/common.h, where ``` using namespace c10; ``` is added, and in Flags.h, where instead of creating aliasing variables in c10 namespace, we directly put it in the global namespace to match gflags (and same behavior if gflags is not being built with). Reviewed By: dzhulgakov Differential Revision: D10390486 fbshipit-source-id: 5e2df730e28e29a052f513bddc558d9f78a23b9b	2018-10-17 12:57:19 -07:00
Jerry Zhang	b89a3b50fb	Remove StaticContext (#12547 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12547 Pull Request resolved: https://github.com/pytorch/pytorch/pull/12305 Remove StaticContext from context_base.h Reviewed By: dzhulgakov Differential Revision: D10073519 fbshipit-source-id: 350beec3c54365edef338318ce58229ccb825a98	2018-10-10 19:41:03 -07:00
Jerry Zhang	7724807551	Remove ExtractDeviceOption from StaticContext (#12304 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12304 - make ExtractDeviceOption to be a free function. - Add a Strorage(at::Device) constructor in order to preserve the device_id. Reviewed By: dzhulgakov Differential Revision: D10069839 fbshipit-source-id: a5f3994a39bdf1b7503b39bb42c228e438b52bfa	2018-10-10 14:12:16 -07:00
Junjie Bai	f54ab540af	Rename cuda_gpu_id to device_id in DeviceOption (#12456 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12456 codemod with 'Yes to all' codemod -d . --extensions h,cc,cpp,cu,py,proto,pbtxt,pb.txt,config cuda_gpu_id device_id Overload TextFormat::ParseFromString to do string replace when parsing from protobuf format Reviewed By: Yangqing Differential Revision: D10240535 fbshipit-source-id: 5e6992bec961214be8dbe26f16f5794154a22b25	2018-10-09 15:54:04 -07:00
Jerry Zhang	1c69d368e1	Remove New with Allocator Registry (#12111 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12111 Setup allocator registry keyed by at::DeviceType, and remove New from StaticContext. Reviewed By: ezyang Differential Revision: D10022853 fbshipit-source-id: 3e88a181fe5df24f33f49b88be1f75284a185588	2018-10-09 10:53:52 -07:00
Yangqing Jia	38f3d1fc40	move flags to c10 (#12144 ) Summary: still influx. Pull Request resolved: https://github.com/pytorch/pytorch/pull/12144 Reviewed By: smessmer Differential Revision: D10140176 Pulled By: Yangqing fbshipit-source-id: 1a313abed022039333e3925d19f8b3ef2d95306c	2018-10-04 02:09:56 -07:00
Jerry Zhang	faab6ea922	Split Allocator (#12105 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12105 Split CUDA/OpenCL/xxx Allocator from xxxStaticContext::New and rewrite it under at::Allocator interface. Reviewed By: dzhulgakov Differential Revision: D10001033 fbshipit-source-id: e1ffbc04c18d1dcb1f8d4ef2cbbb321967de5ccc	2018-10-03 19:10:10 -07:00
Jerry Zhang	74dc4460eb	New in StaticContext returns at::DataPtr (#12029 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12029 In order to remove New() function in StaticContext(to remove StaticContext) and converge to the Allocator design, we'll first change the return type of New to at::DataPtr. Reviewed By: ezyang Differential Revision: D9889990 fbshipit-source-id: 3257c763530b987025f428741bdd2e089d11bad4	2018-10-03 19:10:07 -07:00
Junjie Bai	ff608a9ff3	Back out "Revert D10123245: Back out "codemod cuda_gpu_id to device_id"" (#12232 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12232 Original commit changeset: fca91fea58b7 This adds proper modifications to the DeviceType <->DeviceOption conversion code added in D10033396 Reviewed By: jerryzh168 Differential Revision: D10132473 fbshipit-source-id: 801ef777e2950982cb47b48051b1471a0a91e64b	2018-10-01 21:54:52 -07:00
Rick Ratmansky	3010dc4208	Revert D10123245: Back out "codemod cuda_gpu_id to device_id" Differential Revision: D10123245 Original commit changeset: d83da8e00a12 fbshipit-source-id: fca91fea58b7df208edc2e218a1d514f9821ec7b	2018-10-01 12:22:36 -07:00
Yang Liu	7d7d336c45	Back out "codemod cuda_gpu_id to device_id" Summary: Original commit changeset: f5614a5d2607 D9986213 is causing Multifeed Aggregator a [huge performance different](https://our.intern.facebook.com/intern/ads/analyze_canary/412951953278781781/) and is blocking aggregator push since last Friday night: https://fburl.com/feedtools/b6izvwjz We need to land this revert ASAP to unblock aggregator push. Reviewed By: orionr Differential Revision: D10123245 fbshipit-source-id: d83da8e00a1250f5d09811a0a587c127e377aab2	2018-10-01 11:31:14 -07:00
Jerry Zhang	006171fffc	Back out "[pytorch][PR] Revert "Move CreateContext to global registry (#11688 )"" (#12121 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12121 Pull Request resolved: https://github.com/pytorch/pytorch/pull/12055 Original commit changeset: 6ca9de65b707 Reviewed By: ezyang Differential Revision: D10033396 fbshipit-source-id: ca9f4b2f7ef0561f619b833415d394a8b9972bf4	2018-10-01 11:10:46 -07:00
Junjie Bai	3eb5940cf5	codemod cuda_gpu_id to device_id (#12022 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12022 codemod -d . --extensions h,cc,cpp,cu,py,proto,pbtxt,pb.txt,config cuda_gpu_id device_id codemod with 'Yes to all' Reviewed By: orionr Differential Revision: D9986213 fbshipit-source-id: f5614a5d26078817aee8caf79a494abfd6a95ff1	2018-09-27 20:24:53 -07:00
Edward Yang	d7e11e3aae	Revert "Move CreateContext to global registry (#11688 )" (#12049 ) Summary: This reverts commit `3ae6ee4ebd`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/12049 Differential Revision: D10030954 Pulled By: ezyang fbshipit-source-id: 6ca9de65b707c5b4c68280fc6f1b8e5ad7251efc	2018-09-25 10:13:43 -07:00
Jerry Zhang	3ae6ee4ebd	Move CreateContext to global registry (#11688 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/11688 As a first step to remove static context(merge with allocator), we'll create a global registries for context constructors, and remove CreateContext function from tensor. Reviewed By: ezyang, dzhulgakov Differential Revision: D9779821 fbshipit-source-id: 8b239ea50af7a0556fde2382f58f79194f0e3dc1	2018-09-24 17:07:50 -07:00
Jerry Zhang	9f4bcdf075	caffe2::DeviceType -> at::DeviceType (#11254 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/11254 Previously we use DeviceType in caffe2.proto directly, but it's an `enum` and have implicit conversion to int, which does not have type safety, e.g. we have to explicitly check for a device type is valid in event.h: ``` template <int d> struct EventCreateFunctionRegisterer { explicit EventCreateFunctionRegisterer(EventCreateFunction f) { static_assert(d < MaxDeviceTypes, ""); Event::event_creator_[d] = f; } }; ``` at::DeviceType is an `enum class`, and it does not have implicit conversion to int, and provides better type safety guarantees. In this diff we have done the following refactor(taking CPU as an example): 1. caffe2::DeviceType → caffe2::DeviceTypeProto 2. caffe2::CPU → caffe2::PROTO_CPU 3. caffe2::DeviceType = at::DeviceType 4. caffe2::CPU = at::DeviceType::CPU codemod -d caffe2/caffe2 --extensions h,cc,cpp 'device_type\(\), ' 'device_type(), PROTO_' + some manual changes In short, after this diff, in c++, caffe2::CPU refers to the at::DeviceType::CPU and the old proto caffe2::CPU will be caffe2::PROTO_CPU. In python side, we have a temporary workaround that alias `caffe2_pb2.CPU = caffe2_pb2.PROOT_CPU` to make the change easier to review and this will be removed later. Reviewed By: ezyang Differential Revision: D9545704 fbshipit-source-id: 461a28a4ca74e616d3ee183a607078a717fd38a7	2018-09-05 16:28:09 -07:00
Orion Reblitz-Richardson	6508db7421	Remove BUILD_CAFFE2 and build everything (#8338 ) Summary: This completely removes BUILD_CAFFE2 from CMake. There is still a little bit of "full build" stuff in setup.py that enables USE_CUDNN and BUILD_PYTHON, but otherwise everything should be enabled for PyTorch as well as Caffe2. This gets us a lot closer to full unification. cc mingzhe09088, pjh5, ezyang, smessmer, Yangqing Pull Request resolved: https://github.com/pytorch/pytorch/pull/8338 Reviewed By: mingzhe09088 Differential Revision: D9600513 Pulled By: orionr fbshipit-source-id: 9f6ca49df35b920d3439dcec56e7b26ad4768b7d	2018-08-31 13:10:24 -07:00
root	c3fe071483	Update hip files (#9826 ) Summary: The goal of this PR is to update the hip files to reflect relevant changes in cuda source files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/9826 Differential Revision: D9032840 Pulled By: bddppq fbshipit-source-id: 504e55c46308eebfee3c9a7beea1f294fe03470f	2018-07-27 16:54:39 -07:00
Jerry Zhang	aebf3b47ae	Remove template parameter from Tensor (#9939 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/9939 Pull Request resolved: https://github.com/facebookresearch/weakly-supervised-action-detection/pull/13 Pull Request resolved: https://github.com/pytorch/translate/pull/166 Pull Request resolved: https://github.com/pytorch/pytorch/pull/9125 Closes https://github.com/pytorch/pytorch/pull/9125 Use inheritance for polymorphism, and remove template parameter This is to change the templating in call sites, the core implementations will change later Before Caffe2 Tensor class was compile-time fixed to bind to a particular device/context. With this change, we're making it a runtime property (stored inside the tensor), but preserve the same semantics. For example, one has to specify device type in order to create a Tensor - there are no uninitialized tensors. More specifically the changes are: 1. We added an extra argument DeviceType to most of the constructors of the tensor, e.g. (Tensor(DeviceType type)), 2. Semantics of constructor Tensor(const Tensor<SrcContext>& src, ContextForCopy* context); is changed, in this constructor, the second context is passed in to enable us to call the templated Copy function, it could be in a different context as source and target previously, now we'll enforce that the context should have same device type as src, if it is provided. 3. To preserve 'get-or-construct' semantics of Blob, we added specialized getter Blob::GetMutableTensor that verifies both that Blob contains a Tensor and that it's of a correct type 4. Specifically, Tensor type is not default-constructible any more (as we don't have unknown device tensors) and thus some of the code handling STL containers needs to change Note: Some changes are postponed just to keep this diff a bit smaller. Please see `TODO`s. Reviewed By: ezyang, houseroad Differential Revision: D9024330 fbshipit-source-id: e0b8295d2dc6ebe2963383ded5af799ad17164ba	2018-07-27 10:56:39 -07:00
Jerry Zhang	969b62f276	Revert D8121878: Remove template parameter from Tensor Differential Revision: D8121878 Original commit changeset: 4a5e9a677ba4 fbshipit-source-id: d8e2c0bb145b52fbcca323b22d1d3346f0b3249e	2018-07-26 14:02:04 -07:00
Jerry Zhang	cd5adc7b5f	Remove template parameter from Tensor (#13 ) Summary: Pull Request resolved: https://github.com/facebookresearch/weakly-supervised-action-detection/pull/13 Pull Request resolved: https://github.com/pytorch/translate/pull/166 Pull Request resolved: https://github.com/pytorch/pytorch/pull/9125 Closes https://github.com/pytorch/pytorch/pull/9125 Use inheritance for polymorphism, and remove template parameter This is to change the templating in call sites, the core implementations will change later Before Caffe2 Tensor class was compile-time fixed to bind to a particular device/context. With this change, we're making it a runtime property (stored inside the tensor), but preserve the same semantics. For example, one has to specify device type in order to create a Tensor - there are no uninitialized tensors. More specifically the changes are: 1. We added an extra argument DeviceType to most of the constructors of the tensor, e.g. (Tensor(DeviceType type)), 2. Semantics of constructor Tensor(const Tensor<SrcContext>& src, ContextForCopy* context); is changed, in this constructor, the second context is passed in to enable us to call the templated Copy function, it could be in a different context as source and target previously, now we'll enforce that the context should have same device type as src, if it is provided. 3. To preserve 'get-or-construct' semantics of Blob, we added specialized getter Blob::GetMutableTensor that verifies both that Blob contains a Tensor and that it's of a correct type 4. Specifically, Tensor type is not default-constructible any more (as we don't have unknown device tensors) and thus some of the code handling STL containers needs to change Note: Some changes are postponed just to keep this diff a bit smaller. Please see `TODO`s. Reviewed By: xw285cornell Differential Revision: D8121878 fbshipit-source-id: 4a5e9a677ba4ac82095df959851a054c81eccf81	2018-07-26 10:25:23 -07:00
Sebastian Messmer	d2610fb379	Constexpr Type Ids -> 6.5% caffe2 perf improvement (#9603 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/9603 Using constexpr for some heavily queried type ids gives us a 6.5% perf improvement for caffe2. Benchmark results: P59829647 Also ad canaries (but they don't show a significant difference): - adfinder: - https://our.intern.facebook.com/intern/ads/canary/411346509423301481 - https://our.intern.facebook.com/intern/ads/canary/411346563021753557 - adindexer: - https://our.intern.facebook.com/intern/ads/canary/411346517006038367 - https://our.intern.facebook.com/intern/ads/canary/411346571387258927 - multifeed_predictor: - https://our.intern.facebook.com/intern/ads/canary/411346526631282941 - https://our.intern.facebook.com/intern/ads/canary/411346583141009531 Reviewed By: dzhulgakov Differential Revision: D8841577 fbshipit-source-id: 1a0ce7f2bee1ae54b723caefe5bc7f85a20935b4	2018-07-24 17:24:55 -07:00
Lin Li	0fe980c748	Memory usage measurement -- Caffe2 (#9017 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/9017 Closes https://github.com/pytorch/pytorch/pull/9017 Added "get_blob_size_bytes" to "pybind_state.cc" in Caffe2 to expose the size of blob in bytes. Reviewed By: kuttas Differential Revision: D8685696 fbshipit-source-id: 9a9d38f207c8c59ef534217181e8ce1514617628	2018-07-17 16:40:23 -07:00
Yangqing Jia	03f7289fcf	Add CAFFE2_USE_CUDNN guard on context_gpu.cu (#8657 )	2018-06-19 13:49:06 -07:00
Yangqing Jia	20c516ac18	[cmake] Make cudnn optional (#8265 ) * Make cudnn optional * Remove cudnn file from cpu file	2018-06-08 02:04:27 -07:00
Orion Reblitz-Richardson	6223bfdb1d	Update from Facebook (#6692 ) * [GanH][Easy]: Add assertion to adaptive weighting layer 0 weight causes numeric instability and exploding ne * [Easy] Add cast op before computing norm in diagnose options As LpNorm only takes floats we add a manual casting here. * Introduce a new caching device allocator `cudaMalloc` and `cudaFree` calls are slow, and become slower the more GPUs there are. Essentially, they grab a host-wide (not device-wide) lock because GPU memory is transparently shared across all GPUs. Normally, this isn't much of a concern since workloads allocate memory upfront, and reuse it during later computation. However, under some computation models (specifically, memory conserving approaches like checkpoint-and-recompute, see https://medium.com/@yaroslavvb/fitting-larger-networks-into-memory-583e3c758ff9) this assumption is no longer true. In these situations, `cudaMalloc` and `cudaFree` are common and frequent. Furthermore, in data parallel contexts, these calls happen at nearly the same time from all GPUs worsening lock contention. A common solution to this problem is to add a custom allocator. In fact, nVIDIA provides one out of the box: CUB, which Caffe2 already supports. Unfortunately, the CUB allocator suffers from very high fragmentation. This is primarily because it is a "buddy" allocator which neither splits nor merges free cached blocks. Study https://github.com/NVlabs/cub/blob/1.8.0/cub/util_allocator.cuh#L357 if you want to convince yourself. This diff adapts a caching allocator from the Torch codebase https://github.com/torch/cutorch/blob/master/lib/THC/THCCachingAllocator.cpp which does splitting and merging and ends up working really well, at least for workloads like the checkpoint-and-recompute computation models noted above. I simplified the implementation a little bit, made it a bit more C++-like. I also removed a bunch of stream synchronization primitives for this diff. I plan to add them back in subsequent diffs. * Report reader progress in fblearner workflows Integrate with fblearner progress reporting API and add support to report training progress from reader nodes. If reader is constructed with batch limits, report based on finished batch vs total batch. The finished batch may be more than total batch because we evaludate if we should stop processing everytime we dequeue a split. If no limit for the reader, report based on finished splits (Hive files) vs total splits. This is fairly accurate. * [GanH][Diagnose]: fix plotting 1. ganh diagnose needs to set plot options 2. modifier's blob name is used for metric field can need to be fixed before generating net * Automatic update of fbcode/onnx to 985af3f5a0f7e7d29bc0ee6b13047e7ead9c90c8 * Make CompositeReader stops as soon as one reader finishes Previously, CompositeReader calls all readers before stopping. It results in flaky test since the last batch may be read by different threads; resulting in dropped data. * [dper] make sure loss is not nan as desc. * [rosetta2] [mobile-vision] Option to export NHWC order for RoIWarp/RoIAlign Thanks for finding this @stzpz and @wangyanghan. Looks like NHWC is more optimized. For OCR though it doesn't yet help since NHWC uses more mem b/w but will soon become important. * Intra-op parallel FC operator Intra-op parallel FC operator * [C2 Proto] extra info in device option passing extra information in device option design doc: https://fb.quip.com/yAiuAXkRXZGx * Unregister MKL fallbacks for NCHW conversions * Tracing for more executors Modified Tracer to work with other executors and add more tracing * Remove ShiftActivationDevices() * Check for blob entry iff it is present When processing the placeholders ops, ignore if the blob is not present in the blob_to_device. * Internalize use of eigen tensor Move use of eigen tensor out of the header file so we don't get template partial specialization errors when building other libraries. * feature importance for transformed features. * - Fix unused parameter warnings The changes in this diff comments out unused parameters. This will allow us to enable -Wunused-parameter as error. #accept2ship * add opencv dependencies to caffe2 The video input op requires additional opencv packages. This is to add them to cmake so that it can build * Add clip_by_value option in gradient clipping Add clip_by_value option in gradient clipping when the value is bigger than max or smaller than min, do the clip * std::round compat	2018-04-17 23:36:40 -07:00
Orion Reblitz-Richardson	1d5780d42c	Remove Apache headers from source. * LICENSE file contains details, so removing from individual source files.	2018-03-27 13:10:18 -07:00
Yangqing Jia	ced2c7e2b2	Remove Set/GetDefaultGPUID and move to use current gpu id instead. Summary: Reason for this change: (1) Setting/Getting default gpu id doesn't seem to be used at all. (2) It actually is confusing compared to the CUDA_VISIBLE_DEVICES options etc. (3) When setting cuda_gpu_id=-1 in the CUDAContext arg, it used to use the default gpu id but probably we should use the current gpu - so that the caller will be able to control the device placement. One use case is for TensorRT - if we have a custom callback layer, then it would be easier for TRT or whatever caller to set the running device. Reviewed By: dzhulgakov Differential Revision: D6740357 fbshipit-source-id: 2ea710e434b10220d5a198e31c93847304636863	2018-01-19 18:03:21 -08:00
Yangqing Jia	59b2654544	reapply header change after xplat move Summary: This is a reapplication of the earlier PR due to xplat move. Original author is Christoph Conrads <christoph.conrads@fluent.ai> christoph-conrads . Reviewed By: houseroad Differential Revision: D6379736 fbshipit-source-id: b7482ecf3b9487a528c15e92976e915791210002	2017-11-22 13:04:37 -08:00
Yangqing Jia	8286ce1e3a	Re-license to Apache Summary: Closes https://github.com/caffe2/caffe2/pull/1260 Differential Revision: D5906739 Pulled By: Yangqing fbshipit-source-id: e482ba9ba60b5337d9165f28f7ec68d4518a0902	2017-09-28 16:22:00 -07:00
Ahmed Taei	c3a3d6ceba	Add an option to use dynamic memory optimizer. Reviewed By: akyrola Differential Revision: D5869664 fbshipit-source-id: ab11bc27395bf10e8381ebf97e6afb83ae9af81f	2017-09-20 12:52:55 -07:00
Aapo Kyrola	60c17a34b5	better default settings for CUB Summary: The default CUB settings led to very slow execution in practice when using "dynamic" memory allocation with C2 (i.e freeing blobs after their use). After some tinkering, I arrived to these numbers, that work with resnet-50 and NVIDIA M40 GPU much better than the origianal defaults. Also made the maximum allocated memory configurable. Reviewed By: Yangqing Differential Revision: D5732930 fbshipit-source-id: 9ff34f49d5a3eb138bc6f44c82918731a35325a6	2017-08-29 19:11:08 -07:00
Yangqing Jia	26f0943130	Do CaffeCudaSetDevice and CaffeCudaGetDevice Summary: These are wrapper functions so that if we run in a Caffe2-only mode, we can turn the flag on and get some small speedup on cuda device switches. The purpose of the diff is to allow us to quickly assess the overhead of cuda device switch functions. Ideally, the caching behavior shall live in the cuda driver, which is the only safe place to ensure correctness. If other code is running aside Caffe2 and does not properly do device guard, this functionality will fail as separate cudaSetDevice() calls will not update Caffe2's thread local device id. As a result, the functionality is only enabled when/if one explicitly sets the flag. This might not be safe, so use with caution. - cudaGetDevice can go from 90ns to 2ns - when setting the same device, we can go from 100ns to 2 ns - when setting a different device, things are the same (1ns overhead on top of 143ns) Reviewed By: azzolini Differential Revision: D5709398 fbshipit-source-id: 6255f17a3d41f59a30327436383f306a2287896e	2017-08-25 18:20:14 -07:00
Yangqing Jia	578adbe9c0	Adios CNMEM. You will be remembered. Summary: As part of the cuda 9 move we have decided to deprecate the cnmem path as it seems to be superceded by cub if one needs a memory pool. Closes https://github.com/caffe2/caffe2/pull/1104 Differential Revision: D5647672 Pulled By: Yangqing fbshipit-source-id: 988af5bf63e24efa1b631fd91ddb58e798ffc5c6	2017-08-17 00:05:57 -07:00
Aapo Kyrola	07e745408b	revert D5528436 Summary: Users are reporting CUDA illegal access errors happening on some configurations after D5528436 introduced lazy peer connections. Will debug later, but this diff is to revert that change. Reviewed By: pietern Differential Revision: D5581673 fbshipit-source-id: ef8e367160a38fc62434d6f5905892db274d9f06	2017-08-07 23:07:50 -07:00
Aapo Kyrola	3628cd30f0	initialize peer access only when (might be) needed Summary: Currently Caffe2 enables peer access between all 8 gpus, even if only 1 gpu would be used. This adds several seconds to the startup time, but also takes a lot of memory (110 MB per GPU). This diff makes the peer access initialization "lazy". When GPU X is first used, pairwise peer access is set between GPUs 0 to X-1 with X. A lookup table is used to ensure no double peer access initialization. Reviewed By: pietern Differential Revision: D5528436 fbshipit-source-id: 8f3c2c8154291a7d3a99ee2882e4834ef5e38b66	2017-08-03 09:24:14 -07:00
Dmytro Dzhulgakov	0d833590c1	Change Allocator interface to return deleter Summary: This is in preparation for adding huge pages. There we want to remember for the pointer how we got it - via mmap() or alloc(). One option is to store gigantic map of void* -> destructor, but luckily usages of Context::New are all inside Tensor which already uses shared_ptr with custom deleter. This diff could have used unique_ptr as the return type but then it's easy to accidentally call release() and loose the deleter. Thus going with std::pair<void*, MemoryDeleter> to be explicit. Also, now CPUAllocator can be effectively changed to std::function. Haven't done it yet, but can do if necessary. Let me know whether it's a bad idea to proceed like this. Reviewed By: Yangqing Differential Revision: D5429830 fbshipit-source-id: 8382ab7b81592d51272056c05c122894bb203827	2017-07-17 15:26:27 -07:00

1 2

68 Commits