pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Dmytro Dzhulgakov	51dd2000cd	unify c2 and TH allocator (#16892 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/16892 Replaces https://github.com/pytorch/pytorch/pull/14517 Merged caffe2 and TH CPU Allocators. Mostly using the code from caffe2 allocators. `memset` of caffe2 allocator is gone now. These two allocators should be almost the same. Baseline: ``` Running ./tensor_allocation Run on (48 X 2501 MHz CPU s) CPU Caches: L1 Data 32K (x24) L1 Instruction 32K (x24) L2 Unified 256K (x24) L3 Unified 30720K (x2) ------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------- BM_MakeStorageImpl 148 ns 148 ns 4676594 BM_StorageImplCtor 54 ns 54 ns 12957810 BM_MallocStorageImpl 62 ns 62 ns 11254745 BM_TensorImplCtor 22 ns 22 ns 31939472 BM_MallocTensorImpl 105 ns 105 ns 6505661 BM_Malloc_1 43 ns 43 ns 16464905 BM_MakeTensorFromStorage 126 ns 126 ns 5586116 BM_MakeVariableFromTensor 236 ns 236 ns 2995528 BM_ATenCPUTensorAllocationSmall1 319 ns 319 ns 2268884 BM_ATenCPUTensorAllocationSmall2 318 ns 318 ns 2163332 BM_ATenCPUTensorAllocationMedium1 403 ns 403 ns 1663228 BM_ATenCPUTensorAllocationMedium2 448 ns 448 ns 1595004 BM_ATenCPUTensorAllocationBig1 532 ns 532 ns 1352634 BM_ATenCPUTensorAllocationBig2 4486 ns 4486 ns 160978 ``` Changed: ``` Running ./tensor_allocation Run on (48 X 2501 MHz CPU s) CPU Caches: L1 Data 32K (x24) L1 Instruction 32K (x24) L2 Unified 256K (x24) L3 Unified 30720K (x2) ------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------- BM_MakeStorageImpl 141 ns 141 ns 4803576 BM_StorageImplCtor 55 ns 55 ns 13129391 BM_MallocStorageImpl 64 ns 64 ns 11088143 BM_TensorImplCtor 23 ns 23 ns 31616273 BM_MallocTensorImpl 101 ns 101 ns 7017585 BM_Malloc_1 39 ns 39 ns 18523954 BM_MakeTensorFromStorage 118 ns 118 ns 5877919 BM_MakeVariableFromTensor 452 ns 452 ns 1565722 BM_ATenCPUTensorAllocationSmall1 384 ns 384 ns 1819763 BM_ATenCPUTensorAllocationSmall2 389 ns 389 ns 1857483 BM_ATenCPUTensorAllocationMedium1 425 ns 425 ns 1646284 BM_ATenCPUTensorAllocationMedium2 430 ns 430 ns 1561319 BM_ATenCPUTensorAllocationBig1 508 ns 508 ns 1309969 BM_ATenCPUTensorAllocationBig2 3799 ns 3799 ns 173674 ``` lstm benchmark: Before: ``` INFO:lstm_bench:Iter: 1 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 21 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 41 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 61 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 81 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 101 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 121 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 141 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 161 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 181 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 201 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 221 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 241 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 261 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 281 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 301 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 321 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 341 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 361 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 381 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Done. Total EPS excluding 1st iteration: 0.8k ``` After: ``` INFO:lstm_bench:Iter: 1 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 21 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 41 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 61 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 81 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 101 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 121 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 141 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 161 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 181 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 201 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 221 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 241 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 261 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 281 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 301 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 321 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 341 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 361 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 381 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Done. Total EPS excluding 1st iteration: 0.8k ``` Reviewed By: ezyang Differential Revision: D13202632 fbshipit-source-id: db6d2ec756ed15b0732b15396c82ad42302bb79d	2019-02-12 21:16:34 -08:00
Sebastian Messmer	507ed9032e	Fix include paths for Allocator.h Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14060 Reviewed By: ezyang Differential Revision: D13081605 fbshipit-source-id: 02f23af174c0f0c38fb0163c2dfef3873ff5635d	2018-11-27 12:59:46 -08:00
Xiaoqiang Zheng	de41d1ae0b	Enable junk fill for the default CPU allocator (#13377 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13377 * Enable junk fill for the default CPU allocator. The first diff only enables this for the tests. A second diff will change the default of zero-fill to false. * Fix tests to use 64-bit counters that IterOp and LearningRateOp demands. * Fix kernels that uses uninitialized memory. Reviewed By: salexspb Differential Revision: D10866512 fbshipit-source-id: 17860e77e63a203edf46d0da0335608f77884821	2018-11-08 00:02:37 -08:00
Yangqing Jia	7d5f7ed270	Using c10 namespace across caffe2. (#12714 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12714 This is a short change to enable c10 namespace in caffe2. We did not enable it before due to gflags global variable confusion, but it should have been mostly cleaned now. Right now, the plan on record is that namespace caffe2 and namespace aten will fully be supersets of namespace c10. Most of the diff is codemod, and only two places of non-codemod is in caffe2/core/common.h, where ``` using namespace c10; ``` is added, and in Flags.h, where instead of creating aliasing variables in c10 namespace, we directly put it in the global namespace to match gflags (and same behavior if gflags is not being built with). Reviewed By: dzhulgakov Differential Revision: D10390486 fbshipit-source-id: 5e2df730e28e29a052f513bddc558d9f78a23b9b	2018-10-17 12:57:19 -07:00
Yangqing Jia	38f3d1fc40	move flags to c10 (#12144 ) Summary: still influx. Pull Request resolved: https://github.com/pytorch/pytorch/pull/12144 Reviewed By: smessmer Differential Revision: D10140176 Pulled By: Yangqing fbshipit-source-id: 1a313abed022039333e3925d19f8b3ef2d95306c	2018-10-04 02:09:56 -07:00
Jerry Zhang	74dc4460eb	New in StaticContext returns at::DataPtr (#12029 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12029 In order to remove New() function in StaticContext(to remove StaticContext) and converge to the Allocator design, we'll first change the return type of New to at::DataPtr. Reviewed By: ezyang Differential Revision: D9889990 fbshipit-source-id: 3257c763530b987025f428741bdd2e089d11bad4	2018-10-03 19:10:07 -07:00
Yangqing Jia	9c49bb9ddf	Move registry fully to c10 (#12077 ) Summary: This does 6 things: - add c10/util/Registry.h as the unified registry util - cleaned up some APIs such as export condition - fully remove aten/core/registry.h - fully remove caffe2/core/registry.h - remove a bogus aten/registry.h - unifying all macros - set up registry testing in c10 Also, an important note that we used to mark the templated Registry class as EXPORT - this should not happen, because one should almost never export a template class. This PR fixes that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/12077 Reviewed By: ezyang Differential Revision: D10050771 Pulled By: Yangqing fbshipit-source-id: 417b249b49fed6a67956e7c6b6d22374bcee24cf	2018-09-27 03:09:54 -07:00
Jongsoo Park	29610621ec	64B align for avx512 (#11748 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/11748 For avx512, we need to align at a multiple of 64B not 32B Regardless of avx512, it's in general a good idea to be cache line aligned. Reviewed By: ilia-cher Differential Revision: D9845056 fbshipit-source-id: b1d3ed67749c0c1a64acd5cc230a1279e8023512	2018-09-17 14:08:31 -07:00
Yangqing Jia	0a809fc8b1	build changes to make cpu unified build working. (#10504 ) Summary: Properly annotated all apis for cpu front. Checked with cmake using cmake -DUSE_ATEN=ON -DUSE_CUDA=OFF -DBUILD_ATEN=ON and resulting libcaffe2.so has about 11k symbols. Pull Request resolved: https://github.com/pytorch/pytorch/pull/10504 Reviewed By: ezyang Differential Revision: D9316491 Pulled By: Yangqing fbshipit-source-id: 215659abf350af7032e9a4b0f28a856babab2454	2018-08-15 17:22:36 -07:00
Orion Reblitz-Richardson	1d5780d42c	Remove Apache headers from source. * LICENSE file contains details, so removing from individual source files.	2018-03-27 13:10:18 -07:00
Dmytro Dzhulgakov	9e71de398b	[core] Graph-level NUMA awareness in Caffe2 Adding NUMA awareness through numa_node_id in DeviceOption. Blobs of operators with numa_node_id are allocated on corr. memory banks, using CPU pools with NUMA affinity set to run operators.	2018-03-06 00:33:11 -08:00
Yangqing Jia	8286ce1e3a	Re-license to Apache Summary: Closes https://github.com/caffe2/caffe2/pull/1260 Differential Revision: D5906739 Pulled By: Yangqing fbshipit-source-id: e482ba9ba60b5337d9165f28f7ec68d4518a0902	2017-09-28 16:22:00 -07:00
Dmytro Dzhulgakov	97e733615c	Use simple function pointers for memory allocation and deallocation. Reviewed By: Yangqing Differential Revision: D5822238 fbshipit-source-id: 9624e6494ea6be10221aa75c7f22aa8721946af2	2017-09-13 14:26:04 -07:00
Aapo Kyrola	47fd6cc255	Revert D5801013: [caffe2] Use simple function pointers for memory allocation and deallocation. Summary: This reverts commit 7068207a43400fa3902bbb3689b3c729e839456c bypass-lint Differential Revision: D5801013 fbshipit-source-id: ca2bd9aaf61c20ce1935a007ab7b34f5d37f5033	2017-09-12 16:36:36 -07:00
Yangqing Jia	10a032de67	Use simple function pointers for memory allocation and deallocation. Summary: During the team meeting today Dima and Alex mentioned that the current lambda function causes slowdown in performance when a large number of alloc and dealloc happen. My observation is that most of the Delete are actually direct Delete() function pointers, so I gave it a shot to see if we can reduce the overhead. RawAllocDealloc is much fast already, and we observe another 5ns reduction (12.5%). For TensorAllocDealloc of 32x32 tensors, we are observing 57ns saving (26%). This is measured on Xeon(R) CPU E5-2660. Also cleaned up the function interfaces of ShareExternalPointer so we have 2 functions only. Reviewed By: salexspb, dzhulgakov Differential Revision: D5801013 fbshipit-source-id: 7068207a43400fa3902bbb3689b3c729e839456c	2017-09-10 22:47:26 -07:00
Alexander Sidorov	5d85a6753a	Caffe2 BlackBoxPredictor Reviewed By: dzhulgakov Differential Revision: D5720775 fbshipit-source-id: e3b37ce5a4aa63807825937ec5f9c0aea76c2aba	2017-09-07 23:16:02 -07:00
Yangqing Jia	367106f591	move memory allocators to allocator.{h,cc} Summary: (no major functionality change, purely cosmetic) Closes https://github.com/caffe2/caffe2/pull/1098 Reviewed By: akyrola Differential Revision: D5638664 Pulled By: Yangqing fbshipit-source-id: bbae0589ec1afe938a186ccfce9f6ff1a986a5db	2017-08-16 01:35:20 -07:00

17 Commits