pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Mark Richardson	8bce88d9de	[caffe2] dont call cudnnDestroy on thread exit (crashes on windows with cuda 11/12) (#95382 ) Summary: My team has been hitting a mysterious crash for a few months on a windows binary that uses Caffe2 inside a worker thread. When this thread gets destroyed, there is an error at this line in context_gpu.h where the state of this operation gives CUDNN_STATUS_INTERNAL_ERROR instead of CUDNN_STATUS_SUCCESS. When enabling cudnn debug logs (via the env variables nvidia specifies), I can see that the context is destroyed twice, even though this code only destroys it once, so something mysterious is causing a double free. This seems very very similar to the issue/fix described here for pytorch: https://github.com/pytorch/pytorch/issues/17658 https://github.com/apache/tvm/pull/8267 And pytorch handles this in the same way, by just not calling cudnnDestroy This seems to have become an issue with cuda11, but I tested cuda12 as well and found that the issue persists so this needs to be somehow fixed. Test Plan: CI I checked that the specific windows binary I am using is able to create and drestroy caffe2-invoking threads without causing the application to crash. buck run arvr/mode/win/cuda11/opt //arvr/projects/nimble/prod/tools/MonoHandTrackingVis Differential Revision: D43538017 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95382 Approved by: https://github.com/malfet	2023-03-10 06:42:51 +00:00
Richard Barnes	6f749fd171	Fixes to DSA infra (#91835 ) Differential Revision: D42397325 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91835 Approved by: https://github.com/soumith	2023-01-12 21:54:26 +00:00
Richard Barnes	7a3afe61d2	Check all CUDA API calls for errors in caffe2/ (#81816 ) Test Plan: Sandcastle Differential Revision: D35194868 Pull Request resolved: https://github.com/pytorch/pytorch/pull/81816 Approved by: https://github.com/ezyang	2022-10-28 00:41:06 +00:00
Will Constable	4f34cd6d1e	Replace all CHECK_ and DCHECK_ with TORCH_* macros (#82032 ) Avoid exposing defines that conflict with google logging, since this blocks external usage of libtorch in certain cases. All the 'interesting' changes should be in these two files, and the rest should just be mechanical changes via sed. c10/util/logging_is_not_google_glog.h c10/util/logging_is_google_glog.h Fixes https://github.com/pytorch/pytorch/issues/81415 cc @miladm @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/82032 Approved by: https://github.com/soumith, https://github.com/miladm	2022-07-26 01:20:44 +00:00
Jeff Daily	b7391f44df	cast return of cudaGetLastError() to void when discarding (#62518 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/62511. Pull Request resolved: https://github.com/pytorch/pytorch/pull/62518 Reviewed By: walterddr, janeyx99 Differential Revision: D30029858 Pulled By: malfet fbshipit-source-id: d47ce4e507ac800b4e5a5e0a8d9a6fabdfd28e6d	2021-08-03 11:17:22 -07:00
Jeff Daily	15210f3b82	ignore and clear not ready errors (#61554 ) Summary: Follow-up to https://github.com/pytorch/pytorch/issues/18584. This PR covers the remaining places where event or stream query might result in not ready errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/61554 Reviewed By: mrshenli Differential Revision: D29763973 Pulled By: ezyang fbshipit-source-id: 41d988d1826b2309cc6b01a81144094b353abdf9	2021-07-19 16:03:04 -07:00
Edward Yang	778f9eab6c	Don't switch streams when running Caffe2 ops from c10. (#55121 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55121 This is done by allow -1 as a stream ID, meaning "don't change the stream", in SwitchToDevice Fixes #54830 Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: malfet Differential Revision: D27527544 Pulled By: ezyang fbshipit-source-id: c54983d6fc79a8fa1c65a71559a57425e40ba717	2021-04-08 13:21:11 -07:00
Basil Hosmer	f05b66b70d	pass TypeMeta by value (#45026 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45026 Test Plan: Imported from OSS Reviewed By: ezyang Differential Revision: D23802943 Pulled By: bhosmer fbshipit-source-id: 81b06ef00bf8eb4375c0e0ff2032e03bd1d1188a	2020-10-30 10:14:17 -07:00
Nikita Shulga	a3de359464	Do not throw from CUDAContext destructor (#34756 ) Summary: Throwing from destructor leads to undefined behaviour (most often to segault) So it's better to leak memory then segault Pull Request resolved: https://github.com/pytorch/pytorch/pull/34756 Test Plan: Run `test_pytorch_onnx_caffe2` Differential Revision: D20504228 Pulled By: malfet fbshipit-source-id: 7a05776fea9036f602e95b8182f8493cb5886dab	2020-03-18 00:13:18 -07:00
Brian Wignall	e7fe64f6a6	Fix typos (#30606 ) Summary: Should be non-semantic. Uses https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines to find likely typos. Pull Request resolved: https://github.com/pytorch/pytorch/pull/30606 Differential Revision: D18763028 Pulled By: mrshenli fbshipit-source-id: 896515a2156d062653408852e6c04b429fc5955c	2019-12-02 20:17:42 -08:00
Edward Yang	1e6acc676f	Replace caffe2::DeviceGuard with c10::cuda::CUDAGuard (#17623 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/17623 Despite it's generic sounding name, caffe2::DeviceGuard actually only worked on CUDA devices. Rename it to something that more clearly spells out its applicability. I'm not sure if it's the right call, but in this patch I added 'using CUDAGuard = c10::cuda::CUDAGuard', as this seems to be more in-line with how the Caffe2 codebase is currently written. More idiomatic c10 namespace style would be to say cuda::CUDAGuard. Willing to change this if people shout. This is a respin of D13156470 (#14284) Reviewed By: dzhulgakov Differential Revision: D14285504 fbshipit-source-id: 93b8ab938b064572b3b010c307e1261fde0fff3d	2019-03-06 10:48:15 -08:00
Dmytro Dzhulgakov	51dd2000cd	unify c2 and TH allocator (#16892 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/16892 Replaces https://github.com/pytorch/pytorch/pull/14517 Merged caffe2 and TH CPU Allocators. Mostly using the code from caffe2 allocators. `memset` of caffe2 allocator is gone now. These two allocators should be almost the same. Baseline: ``` Running ./tensor_allocation Run on (48 X 2501 MHz CPU s) CPU Caches: L1 Data 32K (x24) L1 Instruction 32K (x24) L2 Unified 256K (x24) L3 Unified 30720K (x2) ------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------- BM_MakeStorageImpl 148 ns 148 ns 4676594 BM_StorageImplCtor 54 ns 54 ns 12957810 BM_MallocStorageImpl 62 ns 62 ns 11254745 BM_TensorImplCtor 22 ns 22 ns 31939472 BM_MallocTensorImpl 105 ns 105 ns 6505661 BM_Malloc_1 43 ns 43 ns 16464905 BM_MakeTensorFromStorage 126 ns 126 ns 5586116 BM_MakeVariableFromTensor 236 ns 236 ns 2995528 BM_ATenCPUTensorAllocationSmall1 319 ns 319 ns 2268884 BM_ATenCPUTensorAllocationSmall2 318 ns 318 ns 2163332 BM_ATenCPUTensorAllocationMedium1 403 ns 403 ns 1663228 BM_ATenCPUTensorAllocationMedium2 448 ns 448 ns 1595004 BM_ATenCPUTensorAllocationBig1 532 ns 532 ns 1352634 BM_ATenCPUTensorAllocationBig2 4486 ns 4486 ns 160978 ``` Changed: ``` Running ./tensor_allocation Run on (48 X 2501 MHz CPU s) CPU Caches: L1 Data 32K (x24) L1 Instruction 32K (x24) L2 Unified 256K (x24) L3 Unified 30720K (x2) ------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------- BM_MakeStorageImpl 141 ns 141 ns 4803576 BM_StorageImplCtor 55 ns 55 ns 13129391 BM_MallocStorageImpl 64 ns 64 ns 11088143 BM_TensorImplCtor 23 ns 23 ns 31616273 BM_MallocTensorImpl 101 ns 101 ns 7017585 BM_Malloc_1 39 ns 39 ns 18523954 BM_MakeTensorFromStorage 118 ns 118 ns 5877919 BM_MakeVariableFromTensor 452 ns 452 ns 1565722 BM_ATenCPUTensorAllocationSmall1 384 ns 384 ns 1819763 BM_ATenCPUTensorAllocationSmall2 389 ns 389 ns 1857483 BM_ATenCPUTensorAllocationMedium1 425 ns 425 ns 1646284 BM_ATenCPUTensorAllocationMedium2 430 ns 430 ns 1561319 BM_ATenCPUTensorAllocationBig1 508 ns 508 ns 1309969 BM_ATenCPUTensorAllocationBig2 3799 ns 3799 ns 173674 ``` lstm benchmark: Before: ``` INFO:lstm_bench:Iter: 1 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 21 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 41 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 61 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 81 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 101 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 121 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 141 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 161 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 181 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 201 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 221 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 241 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 261 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 281 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 301 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 321 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 341 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 361 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 381 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Done. Total EPS excluding 1st iteration: 0.8k ``` After: ``` INFO:lstm_bench:Iter: 1 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 21 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 41 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 61 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 81 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 101 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 121 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 141 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 161 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 181 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 201 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 221 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 241 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 261 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 281 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 301 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 321 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 341 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 361 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 381 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Done. Total EPS excluding 1st iteration: 0.8k ``` Reviewed By: ezyang Differential Revision: D13202632 fbshipit-source-id: db6d2ec756ed15b0732b15396c82ad42302bb79d	2019-02-12 21:16:34 -08:00
Xiaodong Wang	af0c79eed4	Catch cudaError_t return val (nodiscard in rocm) (#16399 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/16399 Catching cudaError_t return values in a few places, because it's nodiscard in rocm. Unless we add -Wno-unused-result, it'll end up with a compilation error. Also in c10/cuda/test, check whether a host has GPU or not. We were silently throwing out the error before (so not really testing the cuda api). Reviewed By: bddppq Differential Revision: D13828281 fbshipit-source-id: 587d1cc31c20b836ce9594e3c18f067d322b2934	2019-02-11 13:18:36 -08:00
Edward Yang	b9b0be7af2	Remove Legacy entry point. (#16721 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/16721 The very key line is we have to set the stream to the default stream before calling the allocator. This is very interesting. It shouldn't be necessary, but seemingly is! Reviewed By: dzhulgakov Differential Revision: D13943193 fbshipit-source-id: c21014917d9fe504fab0ad8abbc025787f559287	2019-02-08 09:33:58 -08:00
Dmytro Dzhulgakov	96ea2594d8	Don't call cudaStreamDestroy at destruction time (#15692 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/15692 It was leading to ocassional crashes with dynamically linked CUDA because runtime was already destroyed. Also, unique_ptr<T[]> is more suitable than deque<T> for the purpose. Reviewed By: Yangqing Differential Revision: D13571988 fbshipit-source-id: 37eb26dfbe361c49160367b53f87bd037c6c0e46	2019-01-11 12:36:41 -08:00
Sebastian Messmer	d408324350	Move files to/from c10/core and c10/util (#15316 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/15316 This starts cleaning up the files in c10 according to the module structure we decided on. Move to c10/util: - Half.h, Half-inl.h, Half.cpp, bitcasts.h Move to c10/core: - Device.h, Device.cpp - DeviceType.h, DeviceType.cpp i-am-not-moving-c2-to-c10 Reviewed By: dzhulgakov Differential Revision: D13498493 fbshipit-source-id: dfcf1c490474a12ab950c72ca686b8ad86428f63	2019-01-10 16:22:22 -08:00
Edward Yang	26b04523b1	Record Caffe2's current stream ID in c10_cuda. (#15174 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/15174 Previously, Caffe2 maintained a separate per-thread per-device current logical CUDA stream ID. In this PR, we switch Caffe2 over to using c10::Stream to manage the current stream, and also manage the allocation of cudaStream_t objects. This results in a slight behavior change: previously, Caffe2 would have been willing to allocate an arbitrary number of CUDA streams, depending on how high the logical stream IDs went. The c10::Stream pool has a fixed number of streams, once you exceed it, it wraps around. Reviewed By: dzhulgakov Differential Revision: D13451550 fbshipit-source-id: da6cf33ee026932a2d873835f6e090f7b8a7d8dc	2018-12-20 21:54:05 -08:00
Edward Yang	f4c59c5fdf	Replace SwitchToDevice(0) with SwitchToDevice() (#15126 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/15126 I want to make people stop manufacturing StreamId from thin air, and a first step is to make people use the default stream. Reviewed By: dzhulgakov Differential Revision: D13432922 fbshipit-source-id: 9f0d8d70646c50d979bde5ba3c3addeebac48a3d	2018-12-17 15:15:00 -08:00
Edward Yang	8617b780cf	Replace use of 'int' with more descriptive 'DeviceIndex' or 'StreamId'. (#14282 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14282 This also is a substantive change, as 'DeviceIndex' and 'StreamId' are narrower types than 'int'. Reviewed By: Yangqing, smessmer Differential Revision: D13156471 fbshipit-source-id: 08aa0f70c4142415b6bd4d17c57da0641c1d0e9a	2018-11-29 18:33:21 -08:00
Dmytro Dzhulgakov	0cfbbceac3	Change Tensor::CopyFrom to a simple double dispatch (#14268 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14268 Removes the need for Context in Tensor by doing simple dispatch for CopyBytes. It'd eventually be subsumed by Roy Li's changes of proper copy_ op, but before that is done, let's get a clear logic of how copies are implemented and clean up some craft in CopyFrom implementation. Note, that with these changes, one can probably can get rid of Context::CopyFromCPU/CopyToCPU, but it's a matter for follow up diffs. This diff doesn't change the API of Tensor yet, but relies on the fact that passing `Context` to CopyFrom makes copy async if the device is CUDA and doesn't have any effect otherwise (that's how Context methods are implemented). This doesn't change semantics of copy async implementation - as before it blindly calls cudaMemcpyAsync which probably means that it can be misused if invoked separately outside of operator body. I'll leave it for the follow up copy_ unification. For Extend() we always do async copy - it makes sense as it's an in-place device-device operation and only any further op would be observable. Note: there are now three ways of invoking copy in C2 code - templated CopyBytes, virtual CopyFromCPU/etc, and double-dispatch free method here. Hopefully we can get rid of the second one. Also, please advise whether it's c10-worthy :) Reviewed By: ezyang Differential Revision: D13117987 fbshipit-source-id: a6772d6dcf3effaf06717da3a656fc9873b310b5	2018-11-28 15:45:37 -08:00
Dmytro Dzhulgakov	f72f91610f	Move stream to thread local (#13080 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13080 This is the first step to untangle this logic: - moves stream id to thread local mechanically - relies on the fact that the value of thread local is valid in conjunction with CUDAContext only until the next SwitchToDevice is called - we should move to proper RAII in the following diffs Follow up diffs are going to move more stuff outside of CUDAContext (by making gpu_id thread local too) and simplify the CopyFrom. The only expected change in behavior is that before CopyFrom would do copy on stream logical id 0 if the context was created on the fly and now it'd do so on the current stream. Since it'd block explicitly, I don't think it matters much. Also, observers were semi-broken by waiting on the potentially wrong stream. It can be fixed later - I renamed the method to avoid abuse. Reviewed By: ezyang Differential Revision: D10525134 fbshipit-source-id: 5d495a21490bebe060a76389f1b47bdf12cbc59e	2018-10-26 00:04:32 -07:00
Jerry Zhang	b89a3b50fb	Remove StaticContext (#12547 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12547 Pull Request resolved: https://github.com/pytorch/pytorch/pull/12305 Remove StaticContext from context_base.h Reviewed By: dzhulgakov Differential Revision: D10073519 fbshipit-source-id: 350beec3c54365edef338318ce58229ccb825a98	2018-10-10 19:41:03 -07:00
Jerry Zhang	7724807551	Remove ExtractDeviceOption from StaticContext (#12304 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12304 - make ExtractDeviceOption to be a free function. - Add a Strorage(at::Device) constructor in order to preserve the device_id. Reviewed By: dzhulgakov Differential Revision: D10069839 fbshipit-source-id: a5f3994a39bdf1b7503b39bb42c228e438b52bfa	2018-10-10 14:12:16 -07:00
Junjie Bai	f54ab540af	Rename cuda_gpu_id to device_id in DeviceOption (#12456 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12456 codemod with 'Yes to all' codemod -d . --extensions h,cc,cpp,cu,py,proto,pbtxt,pb.txt,config cuda_gpu_id device_id Overload TextFormat::ParseFromString to do string replace when parsing from protobuf format Reviewed By: Yangqing Differential Revision: D10240535 fbshipit-source-id: 5e6992bec961214be8dbe26f16f5794154a22b25	2018-10-09 15:54:04 -07:00
Jerry Zhang	1c69d368e1	Remove New with Allocator Registry (#12111 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12111 Setup allocator registry keyed by at::DeviceType, and remove New from StaticContext. Reviewed By: ezyang Differential Revision: D10022853 fbshipit-source-id: 3e88a181fe5df24f33f49b88be1f75284a185588	2018-10-09 10:53:52 -07:00
Jerry Zhang	faab6ea922	Split Allocator (#12105 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12105 Split CUDA/OpenCL/xxx Allocator from xxxStaticContext::New and rewrite it under at::Allocator interface. Reviewed By: dzhulgakov Differential Revision: D10001033 fbshipit-source-id: e1ffbc04c18d1dcb1f8d4ef2cbbb321967de5ccc	2018-10-03 19:10:10 -07:00
Jerry Zhang	74dc4460eb	New in StaticContext returns at::DataPtr (#12029 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12029 In order to remove New() function in StaticContext(to remove StaticContext) and converge to the Allocator design, we'll first change the return type of New to at::DataPtr. Reviewed By: ezyang Differential Revision: D9889990 fbshipit-source-id: 3257c763530b987025f428741bdd2e089d11bad4	2018-10-03 19:10:07 -07:00
Junjie Bai	ff608a9ff3	Back out "Revert D10123245: Back out "codemod cuda_gpu_id to device_id"" (#12232 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12232 Original commit changeset: fca91fea58b7 This adds proper modifications to the DeviceType <->DeviceOption conversion code added in D10033396 Reviewed By: jerryzh168 Differential Revision: D10132473 fbshipit-source-id: 801ef777e2950982cb47b48051b1471a0a91e64b	2018-10-01 21:54:52 -07:00
Rick Ratmansky	3010dc4208	Revert D10123245: Back out "codemod cuda_gpu_id to device_id" Differential Revision: D10123245 Original commit changeset: d83da8e00a12 fbshipit-source-id: fca91fea58b7df208edc2e218a1d514f9821ec7b	2018-10-01 12:22:36 -07:00
Yang Liu	7d7d336c45	Back out "codemod cuda_gpu_id to device_id" Summary: Original commit changeset: f5614a5d2607 D9986213 is causing Multifeed Aggregator a [huge performance different](https://our.intern.facebook.com/intern/ads/analyze_canary/412951953278781781/) and is blocking aggregator push since last Friday night: https://fburl.com/feedtools/b6izvwjz We need to land this revert ASAP to unblock aggregator push. Reviewed By: orionr Differential Revision: D10123245 fbshipit-source-id: d83da8e00a1250f5d09811a0a587c127e377aab2	2018-10-01 11:31:14 -07:00
Jerry Zhang	006171fffc	Back out "[pytorch][PR] Revert "Move CreateContext to global registry (#11688 )"" (#12121 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12121 Pull Request resolved: https://github.com/pytorch/pytorch/pull/12055 Original commit changeset: 6ca9de65b707 Reviewed By: ezyang Differential Revision: D10033396 fbshipit-source-id: ca9f4b2f7ef0561f619b833415d394a8b9972bf4	2018-10-01 11:10:46 -07:00
Junjie Bai	3eb5940cf5	codemod cuda_gpu_id to device_id (#12022 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12022 codemod -d . --extensions h,cc,cpp,cu,py,proto,pbtxt,pb.txt,config cuda_gpu_id device_id codemod with 'Yes to all' Reviewed By: orionr Differential Revision: D9986213 fbshipit-source-id: f5614a5d26078817aee8caf79a494abfd6a95ff1	2018-09-27 20:24:53 -07:00
Edward Yang	d7e11e3aae	Revert "Move CreateContext to global registry (#11688 )" (#12049 ) Summary: This reverts commit `3ae6ee4ebd`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/12049 Differential Revision: D10030954 Pulled By: ezyang fbshipit-source-id: 6ca9de65b707c5b4c68280fc6f1b8e5ad7251efc	2018-09-25 10:13:43 -07:00
Jerry Zhang	3ae6ee4ebd	Move CreateContext to global registry (#11688 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/11688 As a first step to remove static context(merge with allocator), we'll create a global registries for context constructors, and remove CreateContext function from tensor. Reviewed By: ezyang, dzhulgakov Differential Revision: D9779821 fbshipit-source-id: 8b239ea50af7a0556fde2382f58f79194f0e3dc1	2018-09-24 17:07:50 -07:00
Yinghai Lu	a26ad5a332	Remove unnecessary check on device option pointer (#11845 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/11845 The device pointer will be used by cudaPointerGetAttributes, which handles nullptr already. So this check is not necessary. https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__UNIFIED.html#group__CUDART__UNIFIED_1gd89830e17d399c064a2f3c3fa8bb4390 Reviewed By: salexspb Differential Revision: D9929828 fbshipit-source-id: d862f7e5590998ffafe9bfc7754b0f83d2ae4af4	2018-09-18 21:16:18 -07:00
Edward Yang	7607b49538	s/GetDevicetype/device_type/ (#11656 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/11656 The mis-capitalization really sticks up my craw. I know why (we already have a static function named GetDeviceType), but let's name it differently. ``` codemod -d . --extensions cc,cpp,cu,cuh,h,py,hpp,TARGETS GetDevicetype device_type ``` Reviewed By: jerryzh168 Differential Revision: D9813544 fbshipit-source-id: fe462f4bc40b03e74921f8cf5ebd9cfc52e7e636	2018-09-13 16:32:51 -07:00
Yinghai Lu	316c167940	Add checking of nullptrs in GetTensorInfo (#11587 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/11587 To help debug the issue in T33295362, we add some checks in the function. Possible crashing site in `GetTensorInfo` 1. tc is nullptr, which is checked. 2. tc->capacity_nbytes() hits nullptr, this is unlikely because storage is not a pointer and compute of capacity_nbytes doesn't involve pointers. It's numel * itermsize(). 3. tc->ExtractDeviceOption hits nullpt. One possibility raw_data() is nullptr because tc->ExtractDeviceOption will use that. This is checked. 4. Tensor itself which is not a reference. This is also checked. Reviewed By: salexspb Differential Revision: D9793484 fbshipit-source-id: 3fc72746fc310a23ae45553bbe0d269a4b9edb72	2018-09-12 16:25:46 -07:00
Yangqing Jia	68613cf5a2	Windows DLL build with Caffe2 code (#11266 ) Summary: This is an experimental build on top of what orionr and mingzhe09088 built. Essentially, the idea is that we will need separate *_API versions for different shared libraries. If this theory is right, I'll try to clean up the design a bit and document it properly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/11266 Reviewed By: orionr Differential Revision: D9682942 Pulled By: Yangqing fbshipit-source-id: c79653199e67a1500c9174f39f8b0357324763f3	2018-09-06 15:12:20 -07:00
Jerry Zhang	9f4bcdf075	caffe2::DeviceType -> at::DeviceType (#11254 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/11254 Previously we use DeviceType in caffe2.proto directly, but it's an `enum` and have implicit conversion to int, which does not have type safety, e.g. we have to explicitly check for a device type is valid in event.h: ``` template <int d> struct EventCreateFunctionRegisterer { explicit EventCreateFunctionRegisterer(EventCreateFunction f) { static_assert(d < MaxDeviceTypes, ""); Event::event_creator_[d] = f; } }; ``` at::DeviceType is an `enum class`, and it does not have implicit conversion to int, and provides better type safety guarantees. In this diff we have done the following refactor(taking CPU as an example): 1. caffe2::DeviceType → caffe2::DeviceTypeProto 2. caffe2::CPU → caffe2::PROTO_CPU 3. caffe2::DeviceType = at::DeviceType 4. caffe2::CPU = at::DeviceType::CPU codemod -d caffe2/caffe2 --extensions h,cc,cpp 'device_type\(\), ' 'device_type(), PROTO_' + some manual changes In short, after this diff, in c++, caffe2::CPU refers to the at::DeviceType::CPU and the old proto caffe2::CPU will be caffe2::PROTO_CPU. In python side, we have a temporary workaround that alias `caffe2_pb2.CPU = caffe2_pb2.PROOT_CPU` to make the change easier to review and this will be removed later. Reviewed By: ezyang Differential Revision: D9545704 fbshipit-source-id: 461a28a4ca74e616d3ee183a607078a717fd38a7	2018-09-05 16:28:09 -07:00
Orion Reblitz-Richardson	6508db7421	Remove BUILD_CAFFE2 and build everything (#8338 ) Summary: This completely removes BUILD_CAFFE2 from CMake. There is still a little bit of "full build" stuff in setup.py that enables USE_CUDNN and BUILD_PYTHON, but otherwise everything should be enabled for PyTorch as well as Caffe2. This gets us a lot closer to full unification. cc mingzhe09088, pjh5, ezyang, smessmer, Yangqing Pull Request resolved: https://github.com/pytorch/pytorch/pull/8338 Reviewed By: mingzhe09088 Differential Revision: D9600513 Pulled By: orionr fbshipit-source-id: 9f6ca49df35b920d3439dcec56e7b26ad4768b7d	2018-08-31 13:10:24 -07:00
Edward Yang	91797c0672	Replace direct include of caffe2.pb.h with an intermediary header caffe2_pb.h (#10946 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/10946 ``` codemod -d . --extensions cc,cpp,cu,cuh,h caffe2/proto/caffe2.pb.h caffe2/proto/caffe2_pb.h ``` Reviewed By: houseroad Differential Revision: D9539945 fbshipit-source-id: 497d04720e8e7e61c05ffe1b23733d0cb774de7e	2018-08-28 11:57:08 -07:00
Yangqing Jia	0a809fc8b1	build changes to make cpu unified build working. (#10504 ) Summary: Properly annotated all apis for cpu front. Checked with cmake using cmake -DUSE_ATEN=ON -DUSE_CUDA=OFF -DBUILD_ATEN=ON and resulting libcaffe2.so has about 11k symbols. Pull Request resolved: https://github.com/pytorch/pytorch/pull/10504 Reviewed By: ezyang Differential Revision: D9316491 Pulled By: Yangqing fbshipit-source-id: 215659abf350af7032e9a4b0f28a856babab2454	2018-08-15 17:22:36 -07:00
Jerry Zhang	aebf3b47ae	Remove template parameter from Tensor (#9939 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/9939 Pull Request resolved: https://github.com/facebookresearch/weakly-supervised-action-detection/pull/13 Pull Request resolved: https://github.com/pytorch/translate/pull/166 Pull Request resolved: https://github.com/pytorch/pytorch/pull/9125 Closes https://github.com/pytorch/pytorch/pull/9125 Use inheritance for polymorphism, and remove template parameter This is to change the templating in call sites, the core implementations will change later Before Caffe2 Tensor class was compile-time fixed to bind to a particular device/context. With this change, we're making it a runtime property (stored inside the tensor), but preserve the same semantics. For example, one has to specify device type in order to create a Tensor - there are no uninitialized tensors. More specifically the changes are: 1. We added an extra argument DeviceType to most of the constructors of the tensor, e.g. (Tensor(DeviceType type)), 2. Semantics of constructor Tensor(const Tensor<SrcContext>& src, ContextForCopy* context); is changed, in this constructor, the second context is passed in to enable us to call the templated Copy function, it could be in a different context as source and target previously, now we'll enforce that the context should have same device type as src, if it is provided. 3. To preserve 'get-or-construct' semantics of Blob, we added specialized getter Blob::GetMutableTensor that verifies both that Blob contains a Tensor and that it's of a correct type 4. Specifically, Tensor type is not default-constructible any more (as we don't have unknown device tensors) and thus some of the code handling STL containers needs to change Note: Some changes are postponed just to keep this diff a bit smaller. Please see `TODO`s. Reviewed By: ezyang, houseroad Differential Revision: D9024330 fbshipit-source-id: e0b8295d2dc6ebe2963383ded5af799ad17164ba	2018-07-27 10:56:39 -07:00
Jerry Zhang	969b62f276	Revert D8121878: Remove template parameter from Tensor Differential Revision: D8121878 Original commit changeset: 4a5e9a677ba4 fbshipit-source-id: d8e2c0bb145b52fbcca323b22d1d3346f0b3249e	2018-07-26 14:02:04 -07:00
Jerry Zhang	cd5adc7b5f	Remove template parameter from Tensor (#13 ) Summary: Pull Request resolved: https://github.com/facebookresearch/weakly-supervised-action-detection/pull/13 Pull Request resolved: https://github.com/pytorch/translate/pull/166 Pull Request resolved: https://github.com/pytorch/pytorch/pull/9125 Closes https://github.com/pytorch/pytorch/pull/9125 Use inheritance for polymorphism, and remove template parameter This is to change the templating in call sites, the core implementations will change later Before Caffe2 Tensor class was compile-time fixed to bind to a particular device/context. With this change, we're making it a runtime property (stored inside the tensor), but preserve the same semantics. For example, one has to specify device type in order to create a Tensor - there are no uninitialized tensors. More specifically the changes are: 1. We added an extra argument DeviceType to most of the constructors of the tensor, e.g. (Tensor(DeviceType type)), 2. Semantics of constructor Tensor(const Tensor<SrcContext>& src, ContextForCopy* context); is changed, in this constructor, the second context is passed in to enable us to call the templated Copy function, it could be in a different context as source and target previously, now we'll enforce that the context should have same device type as src, if it is provided. 3. To preserve 'get-or-construct' semantics of Blob, we added specialized getter Blob::GetMutableTensor that verifies both that Blob contains a Tensor and that it's of a correct type 4. Specifically, Tensor type is not default-constructible any more (as we don't have unknown device tensors) and thus some of the code handling STL containers needs to change Note: Some changes are postponed just to keep this diff a bit smaller. Please see `TODO`s. Reviewed By: xw285cornell Differential Revision: D8121878 fbshipit-source-id: 4a5e9a677ba4ac82095df959851a054c81eccf81	2018-07-26 10:25:23 -07:00
Yangqing Jia	20c516ac18	[cmake] Make cudnn optional (#8265 ) * Make cudnn optional * Remove cudnn file from cpu file	2018-06-08 02:04:27 -07:00
Orion Reblitz-Richardson	d1bdb3b10a	Remove core and util warnings (#8239 ) * Fix some signed/unsigned mismatches * Skip unused result warning * Explict fallthrough for murmur hash * Enable aligned new support to eliminate warning * Switch to int instead of unsigned in some cases	2018-06-07 09:10:33 -07:00
Orion Reblitz-Richardson	6223bfdb1d	Update from Facebook (#6692 ) * [GanH][Easy]: Add assertion to adaptive weighting layer 0 weight causes numeric instability and exploding ne * [Easy] Add cast op before computing norm in diagnose options As LpNorm only takes floats we add a manual casting here. * Introduce a new caching device allocator `cudaMalloc` and `cudaFree` calls are slow, and become slower the more GPUs there are. Essentially, they grab a host-wide (not device-wide) lock because GPU memory is transparently shared across all GPUs. Normally, this isn't much of a concern since workloads allocate memory upfront, and reuse it during later computation. However, under some computation models (specifically, memory conserving approaches like checkpoint-and-recompute, see https://medium.com/@yaroslavvb/fitting-larger-networks-into-memory-583e3c758ff9) this assumption is no longer true. In these situations, `cudaMalloc` and `cudaFree` are common and frequent. Furthermore, in data parallel contexts, these calls happen at nearly the same time from all GPUs worsening lock contention. A common solution to this problem is to add a custom allocator. In fact, nVIDIA provides one out of the box: CUB, which Caffe2 already supports. Unfortunately, the CUB allocator suffers from very high fragmentation. This is primarily because it is a "buddy" allocator which neither splits nor merges free cached blocks. Study https://github.com/NVlabs/cub/blob/1.8.0/cub/util_allocator.cuh#L357 if you want to convince yourself. This diff adapts a caching allocator from the Torch codebase https://github.com/torch/cutorch/blob/master/lib/THC/THCCachingAllocator.cpp which does splitting and merging and ends up working really well, at least for workloads like the checkpoint-and-recompute computation models noted above. I simplified the implementation a little bit, made it a bit more C++-like. I also removed a bunch of stream synchronization primitives for this diff. I plan to add them back in subsequent diffs. * Report reader progress in fblearner workflows Integrate with fblearner progress reporting API and add support to report training progress from reader nodes. If reader is constructed with batch limits, report based on finished batch vs total batch. The finished batch may be more than total batch because we evaludate if we should stop processing everytime we dequeue a split. If no limit for the reader, report based on finished splits (Hive files) vs total splits. This is fairly accurate. * [GanH][Diagnose]: fix plotting 1. ganh diagnose needs to set plot options 2. modifier's blob name is used for metric field can need to be fixed before generating net * Automatic update of fbcode/onnx to 985af3f5a0f7e7d29bc0ee6b13047e7ead9c90c8 * Make CompositeReader stops as soon as one reader finishes Previously, CompositeReader calls all readers before stopping. It results in flaky test since the last batch may be read by different threads; resulting in dropped data. * [dper] make sure loss is not nan as desc. * [rosetta2] [mobile-vision] Option to export NHWC order for RoIWarp/RoIAlign Thanks for finding this @stzpz and @wangyanghan. Looks like NHWC is more optimized. For OCR though it doesn't yet help since NHWC uses more mem b/w but will soon become important. * Intra-op parallel FC operator Intra-op parallel FC operator * [C2 Proto] extra info in device option passing extra information in device option design doc: https://fb.quip.com/yAiuAXkRXZGx * Unregister MKL fallbacks for NCHW conversions * Tracing for more executors Modified Tracer to work with other executors and add more tracing * Remove ShiftActivationDevices() * Check for blob entry iff it is present When processing the placeholders ops, ignore if the blob is not present in the blob_to_device. * Internalize use of eigen tensor Move use of eigen tensor out of the header file so we don't get template partial specialization errors when building other libraries. * feature importance for transformed features. * - Fix unused parameter warnings The changes in this diff comments out unused parameters. This will allow us to enable -Wunused-parameter as error. #accept2ship * add opencv dependencies to caffe2 The video input op requires additional opencv packages. This is to add them to cmake so that it can build * Add clip_by_value option in gradient clipping Add clip_by_value option in gradient clipping when the value is bigger than max or smaller than min, do the clip * std::round compat	2018-04-17 23:36:40 -07:00
Orion Reblitz-Richardson	1d5780d42c	Remove Apache headers from source. * LICENSE file contains details, so removing from individual source files.	2018-03-27 13:10:18 -07:00
Dmytro Dzhulgakov	496c999f7d	[core] NUMA-aware pinned allocator Using cudaHostRegister/Unregister instead of cudaMallocHost to move memory to a specific NUMA node	2018-03-06 00:33:11 -08:00

1 2

96 Commits