pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
dolpm	4ac2ee573d	[sigmoid] memory planner C10 deps (#151275 ) Summary: perf-sensitive util functions for use in our memory planner Test Plan: CI Differential Revision: D73002726 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151275 Approved by: https://github.com/georgiaphillips	2025-04-24 01:46:32 +00:00
Yu, Guangye	d5ce5c9509	Reuse format_size utils (#149383 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149383 Approved by: https://github.com/malfet	2025-03-24 03:06:27 +00:00
Marko Radmilac	c65ee728f0	Initial implementation of host memory stats (#147660 ) This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics. This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache. As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later. Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660 Approved by: https://github.com/ngimel	2025-03-05 16:13:19 +00:00
PyTorch MergeBot	a983b2b11a	Revert "Initial implementation of host memory stats (#147660 )" This reverts commit `945e359fc1`. Reverted https://github.com/pytorch/pytorch/pull/147660 on behalf of https://github.com/mradmila due to There is an issue with ambiguous definition of Stat structure when different C++ tools are used. Backing out for now. ([comment](https://github.com/pytorch/pytorch/pull/147660#issuecomment-2692346379))	2025-03-01 18:05:45 +00:00
Marko Radmilac	945e359fc1	Initial implementation of host memory stats (#147660 ) This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics. This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache. As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later. Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660 Approved by: https://github.com/ngimel	2025-02-28 18:36:44 +00:00
cyy	dca443835e	Enable more readability-redundant checks (#143963 ) They are helpful to simplifying code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143963 Approved by: https://github.com/albanD	2024-12-30 14:49:33 +00:00
cyy	a2bc2e38f9	Use clang-tidy 17 (#139678 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139678 Approved by: https://github.com/Skylion007	2024-11-05 16:00:25 +00:00
cyy	38d3c27849	[1/N] Enable cppcoreguidelines-special-member-functions (#137405 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/137405 Approved by: https://github.com/ezyang	2024-10-23 00:16:53 +00:00
Richard Barnes	542f7c8383	Eliminate C10_NODISCARD (#138336 ) Test Plan: Sandcastle Reviewed By: swolchok Pull Request resolved: https://github.com/pytorch/pytorch/pull/138336 Approved by: https://github.com/Skylion007	2024-10-19 02:54:06 +00:00
PyTorch MergeBot	7e8dace0de	Revert "[ROCm] remove caffe2 from hipify (#137157 )" This reverts commit `40d8260745`. Reverted https://github.com/pytorch/pytorch/pull/137157 on behalf of https://github.com/xw285cornell due to this is breaking internal where we still use caffe2 ([comment](https://github.com/pytorch/pytorch/pull/137157#issuecomment-2400466131))	2024-10-08 17:45:45 +00:00
Jeff Daily	40d8260745	[ROCm] remove caffe2 from hipify (#137157 ) - Remove all "MasqueradingAsCUDA" files and classes. - Do not rename "CUDA" classes to "HIP". Pull Request resolved: https://github.com/pytorch/pytorch/pull/137157 Approved by: https://github.com/eqy	2024-10-05 12:48:54 +00:00
Aaron Enye Shi	f42d5b6dca	[Memory Snapshot] Make recordAnnotations callback initialize lazily (#129242 ) Summary: Make the recordAnnotations' Record function callback lazily initialize when record memory history starts. This will help reduce the impact on Time To First Batch metric. Test Plan: CI and ran locally. Differential Revision: D58875576 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/129242 Approved by: https://github.com/zdevito	2024-06-22 04:05:55 +00:00
Aaron Enye Shi	b5d541609d	[Memory Snapshot] Add recordAnnotations to capture record_function annotations (#129072 ) Summary: Add new traceEvents into Memory Snapshot for record_function annotations. These will capture both the profiler's step annotation as well as user annotations. Test Plan: CI Pulled By: aaronenyeshi Differential Revision: D55941362 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129072 Approved by: https://github.com/zdevito	2024-06-19 18:05:41 +00:00
PyTorch MergeBot	718bb9016f	Revert "[Memory Snapshot] Add recordAnnotations to capture record_function annotations (#124179 )" This reverts commit `187aeaeabf`. Reverted https://github.com/pytorch/pytorch/pull/124179 on behalf of https://github.com/clee2000 due to test_tensorexpr.py::TestTensorExprFuser::test_simple_add is causing a segfault https://github.com/pytorch/pytorch/actions/runs/9097383783/job/25007155440 `187aeaeabf`, test was skipped due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/124179#issuecomment-2112948246))	2024-05-15 16:11:47 +00:00
Aaron Enye Shi	187aeaeabf	[Memory Snapshot] Add recordAnnotations to capture record_function annotations (#124179 ) Summary: Add new traceEvents into Memory Snapshot for record_function annotations. These will capture both the profiler's step annotation as well as user annotations. Test Plan: CI New Snapshot Generated: devvm2184.cco0.facebook.com.Apr_19_13_27_14.3072800.snapshot.pickle Snippet of Snapshot device_traces show `ProfilerStep#0`, and `## forward ##` annotations: ``` [[{'action': 'user_defined', 'addr': 0, 'size': 0, 'stream': 0, 'time_us': 1713558427168556, 'frames': [{'name': 'START', 'filename': 'ProfilerStep#0', 'line': 0}]}, {'action': 'user_defined', 'addr': 0, 'size': 0, 'stream': 0, 'time_us': 1713558427168738, 'frames': [{'name': 'END', 'filename': 'ProfilerStep#0', 'line': 0}]}, {'action': 'user_defined', 'addr': 0, 'size': 0, 'stream': 0, 'time_us': 1713558427168865, 'frames': [{'name': 'START', 'filename': 'ProfilerStep#1', 'line': 0}]}, {'action': 'user_defined', 'addr': 0, 'size': 0, 'stream': 0, 'time_us': 1713558427168920, 'frames': [{'name': 'START', 'filename': '## forward ##', 'line': 0}]}, {'action': 'alloc', 'addr': 140166073581568, 'size': 3211264, 'stream': 0, 'time_us': 1713558427172978, 'frames': [{'name': '_conv_forward', 'filename': '/mnt/xarfuse/uid-416185/235d4caf-seed-nspid4026531836_cgpid32884718-ns-4026531840/torch/nn/modules/conv ``` Differential Revision: D55941362 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/124179 Approved by: https://github.com/zdevito	2024-05-15 14:19:40 +00:00
cyy	507611f9ae	[CUDACachingAllocator] Turn Allocator::allocate into non-const (#120969 ) Ideally, the method should be non-const since it changes the allocator state. Some const_casts are also removed in the way. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120969 Approved by: https://github.com/albanD	2024-03-05 09:53:05 +00:00
Aaron Orenstein	d280b6ae58	Ensure that deleter is called even for a no-data tensor. (#117418 ) Summary: When using a custom deleter InefficientStdFunctionContext was using a std::unique_ptr<> to store the pointer and call the deleter - but this failed to call the deleter if the pointer was null. Since we have a separate holder class anyway take out the std::unique_ptr<> and call the deleter directly. Fixes #117273 Test Plan: Pull Request resolved: https://github.com/pytorch/pytorch/pull/117418 Approved by: https://github.com/wjakob, https://github.com/yanboliang	2024-01-22 23:27:27 +00:00
Edward Yang	b4a35632f9	Add function to materialize COW storages (#117053 ) Summary: From Kurt Mohler, see https://github.com/pytorch/pytorch/pull/113396 (manually imported due to ghimport problems) Test Plan: sandcastle, OSS CI Differential Revision: D52610522 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117053 Approved by: https://github.com/malfet, https://github.com/kurtamohler	2024-01-10 15:34:16 +00:00
cyy	1544c37520	[7/N] Fixes clang-tidy warnings in c10/{core,util}/*.h (#115495 ) This PR continues to fix clang-tidy warnings for headers in c10/core and c10/util. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115495 Approved by: https://github.com/malfet	2023-12-19 02:14:30 +00:00
cyy	99f222372b	[5/N] Fixes clang-tidy warnings in c10/{core,util}/*.h (#115354 ) This PR continues to fix clang-tidy warnings for headers in c10/core and c10/util. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115354 Approved by: https://github.com/Skylion007	2023-12-09 17:16:04 +00:00
cyy	7b8084d1c6	[5/N] Fixes clang-tidy warnings in c10/core/*.h (#115232 ) This PR continues to fix clang-tidy warnings for headers in c10/core. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115232 Approved by: https://github.com/Skylion007	2023-12-07 15:48:03 +00:00
PyTorch MergeBot	f36d09fcb7	Revert "Add function to materialize COW storages (#113396 )" This reverts commit `e2f090086b`. Reverted https://github.com/pytorch/pytorch/pull/113396 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/113396#issuecomment-1818769090))	2023-11-20 10:26:01 +00:00
Kurt Mohler	e2f090086b	Add function to materialize COW storages (#113396 ) Part of #109833 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113396 Approved by: https://github.com/ezyang	2023-11-17 01:58:51 +00:00
PyTorch MergeBot	9c7391ea36	Revert " [1/N] Apply clang-tidy to c10 cuda files (#111137 )" This reverts commit `43b023694e`. Reverted https://github.com/pytorch/pytorch/pull/111137 on behalf of https://github.com/malfet due to Was reverted internally due to the failures in torch.cuda.memory_stats(device=0) (presumably) ([comment](https://github.com/pytorch/pytorch/pull/111137#issuecomment-1769274103))	2023-10-18 20:32:53 +00:00
cyy	43b023694e	[1/N] Apply clang-tidy to c10 cuda files (#111137 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/111137 Approved by: https://github.com/zou3519, https://github.com/Skylion007	2023-10-17 04:52:50 +00:00
PyTorch MergeBot	cca31f1797	Revert "implement a function to convert a storage to copy-on-write (#100819 )" This reverts commit `aec11b8c80`. Reverted https://github.com/pytorch/pytorch/pull/100819 on behalf of https://github.com/jeanschmidt due to added tests are breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/100819#issuecomment-1547929531))	2023-05-15 14:10:23 +00:00
mikey dagitses	aec11b8c80	implement a function to convert a storage to copy-on-write (#100819 ) implement a function to convert a storage to copy-on-write Summary: This will be used in the _lazy_clone() operator as well as reshape(). Test Plan: 100% coverage of reachable lines. --- Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/100819). * #100821 * #100820 * __->__ #100819 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100819 Approved by: https://github.com/ezyang	2023-05-12 17:45:04 +00:00
mikey dagitses	4431509a54	introduce c10::DataPtr::mutable_get() and use it in c10 (#98217 ) Differential Revision: [D44629940](https://our.internmc.facebook.com/intern/diff/D44629940/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98217 Approved by: https://github.com/ezyang	2023-04-04 02:26:18 +00:00
Zachary DeVito	48490cec28	[memory profiling] Move Context object to c10 (#96280 ) Minor refactor so that follow up PR can have objects that meet the GatheredContext inferface without having to depend on CUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96280 Approved by: https://github.com/eellison	2023-03-12 07:24:14 +00:00
cyy	bfe5e1258b	avoid unnecessary static_cast (#93898 ) avoid unnecessary static_cast Pull Request resolved: https://github.com/pytorch/pytorch/pull/93898 Approved by: https://github.com/Skylion007	2023-02-03 03:44:43 +00:00
Aaron Gokaslan	700941f683	Fixup c10 headers with clang-tidy (#91407 ) Clang-tidy was not applied properly to headers in c10 as documented #91406. These are the easy automated fixes that came out of applying clang-tidy to the c10 part of the code base. cc @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/91407 Approved by: https://github.com/ezyang	2022-12-28 11:12:22 +00:00
Codrin Popa	0aedda25bc	[PyTorch] Reporting OOM events to the Pytorch Profiler. (#80050 ) Summary: Similar to reporting alloc and dealloc events in the PyTorch profiler, we are now reporting Out of Memory events as well. This is useful for performance troubleshooting Test Plan: Added test_oom_tracing to test/test_profiler.py Differential Revision: D36268132 Pull Request resolved: https://github.com/pytorch/pytorch/pull/80050 Approved by: https://github.com/robieta	2022-07-20 16:51:39 +00:00
Jiewen Tan	ab0d9b18e9	[LT] Support Tensor.is_alias_of Summary: Tensor.is_alias_of relies on Storage to perform. However, LTCTensorImpl was not implemented with that in mind. This commit adds a fake storage to LazyTensor as a marker to mark LazyTensors that point to the same storage. The reason why it's not done at LTCTensorImpl is that LazyTensor maintains the view ops/alias logic in LazyTensor class instead of relying on TensorImpl to do the check. Test Plan: ./build/bin/test_lazy --gtest_filter=LazyOpsTest.IsAliasOf Pull Request resolved: https://github.com/pytorch/pytorch/pull/75246 Approved by: https://github.com/bdhirsh	2022-04-14 07:28:03 +00:00
Han Guangyun	8bbcef5096	Report more information for memory profiling (#61282 ) Summary: Report pointed memory size, total allocated memory, total reserved size all in one report. `ptr` and `alloc_size` will be used for associating with op trace. `allocated_size`, `reserved_size` will be used for memory trace. Pull Request resolved: https://github.com/pytorch/pytorch/pull/61282 Reviewed By: ejguan Differential Revision: D29796282 Pulled By: chaekit fbshipit-source-id: 5314c867632d3af1fa9a3811b35eaa5e931a5d87	2021-08-04 15:03:14 -07:00
Scott Wolchok	44cc873fba	[PyTorch] Autoformat c10 (#56830 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56830 Opt into formatting on GitHub and format everything. This is a trial run before turning on formatting for more and eventually all of the codebase. Test Plan: CI Reviewed By: zertosh Differential Revision: D27979080 fbshipit-source-id: a80f0c48691c08ae8ca0af06377b87e6a2351151	2021-04-30 21:23:28 -07:00
Ilia Cherniavskii	a94fb71b12	Memory profiling (#37775 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37775 Adding memory usage into profiler table output Test Plan: BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py develop install --cmake ``` import torch import torchvision.models as models model = models.resnet18() inp = torch.randn(5, 3, 224, 224) with torch.autograd.profiler.profile(profile_memory=True, record_shapes=True) as prof: model(inp) print(prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_memory_usage", row_limit=15)) ``` ``` --------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CPU Mem Total Number of Calls Input Shapes --------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- resize_ 0.37% 577.936us 0.37% 577.936us 9.796us 339.03 Mb 59 [[0]] empty 0.69% 1.061ms 0.74% 1.139ms 5.556us 47.42 Mb 205 [] stride 0.00% 0.853us 0.00% 0.853us 0.853us 19.53 Kb 1 [[5, 1000]] empty_strided 0.01% 21.393us 0.02% 26.033us 5.207us 252 b 5 [] is_complex 0.02% 37.425us 0.02% 37.425us 1.291us 208 b 29 [[]] masked_select 0.04% 55.333us 0.06% 93.616us 46.808us 120 b 2 [[30], [30]] conv2d 0.01% 18.009us 9.62% 14.902ms 14.902ms 0 b 1 [[5, 3, 224, 224], [64, 3, 7, 7], [ convolution 0.01% 12.436us 9.61% 14.884ms 14.884ms 0 b 1 [[5, 3, 224, 224], [64, 3, 7, 7], [ _convolution 0.03% 52.381us 9.60% 14.871ms 14.871ms 0 b 1 [[5, 3, 224, 224], [64, 3, 7, 7], [ size 0.00% 5.429us 0.00% 5.429us 0.339us 0 b 16 [[5, 3, 224, 224]] contiguous 0.00% 1.934us 0.00% 1.934us 0.967us 0 b 2 [[5, 3, 224, 224]] _convolution_nogroup 0.02% 27.505us 9.57% 14.814ms 14.814ms 0 b 1 [[5, 3, 224, 224], [64, 3, 7, 7], [ _nnpack_available 0.02% 34.267us 0.02% 34.267us 1.713us 0 b 20 [] thnn_conv2d 0.01% 13.274us 9.54% 14.771ms 14.771ms 0 b 1 [[5, 3, 224, 224], [64, 3, 7, 7], [ thnn_conv2d_forward 5.98% 9.264ms 19.02% 29.446ms 14.723ms 0 b 2 [[5, 3, 224, 224], [64, 3, 7, 7], [ --------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Self CPU time total: 154.855ms ``` Reviewed By: ngimel Differential Revision: D21384248 Pulled By: ilia-cher fbshipit-source-id: 31359cce2aa06f6255ed1ad8c60d03cb640bfec3	2020-05-19 15:48:48 -07:00
Allan Di Wu	f538cd627a	Install HugePagesArena to optimize pytorch prediction performance (#37640 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37640 Enable oversize arena to reduce memory fragmentation. Memory request with large size (configurable with FLAGS_caffe2_oversize_threshold) are fulfilled from dedicated arena separate from the existing huge page arena. Two additional parameters are introduced to configure the 2-phase decay of the memory arena: - caffe2_dirty_decay_ms - caffe2_muzzy_decay_ms In current JEMalloc implementation, oversized allocations will be immediately purged regardless of putting it in arena or not. Therefore we need to extend the decay time to indefinite. Currently we set the default for caffe2_muzzy_decay_ms to -1. We now enable the arena allocator statically. To ensure it is correctly installed regardless of static initialization order, we add a priority flag in c10::SetAllocator, and only higher priority allocators can overwrite existing ones. ghstack-source-id: 103276877 Test Plan: buck test mode/dev //caffe2/caffe2/fb/init:huge_pages_allocator_test Benchmarking known CV model that benefits from page arena: ``` PyTorchModelBench.cpp:183] test / base : 86.9532% ``` By adjusting ```dirty_decay_ms``` and ```muzzy_decay_ms```, we have the following plots: https://pxl.cl/15SWW https://pxl.cl/15TnL From the figures above we can see performance does not change much until dirty decay time is indefinite (set to -1). Either setting muzzy decay or dirty decay time to -1 will reach best performance, regardless of which one it is. Even setting the decay time to very long (100s, which is longer than the run), does not change the performance by much. ## Observe performance difference in production with a variety of models (WIP) Reviewed By: dzhulgakov Differential Revision: D21258581 fbshipit-source-id: c006f8b94f28aef0666e52f48d4e82cf0d3a48af	2020-05-06 17:27:10 -07:00
Gemfield	d9115b533a	remove needless ## in REGISTER_ALLOCATOR definition. (#19261 ) Summary: remove needless ## in REGISTER_ALLOCATOR definition. Pull Request resolved: https://github.com/pytorch/pytorch/pull/19261 Differential Revision: D15002025 Pulled By: soumith fbshipit-source-id: 40614b1d79d1fe05ccf43f0ae5aab950e4c875c2	2019-04-18 22:44:09 -07:00
Edward Yang	474adf5458	Minor doc updates in c10/core/Allocator.h (#17164 ) Summary: Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/17164 Differential Revision: D14154393 Pulled By: ezyang fbshipit-source-id: 59d8276d4bb4e7cadb4382769b75e5348ed388de	2019-02-20 14:36:15 -08:00
Dmytro Dzhulgakov	51dd2000cd	unify c2 and TH allocator (#16892 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/16892 Replaces https://github.com/pytorch/pytorch/pull/14517 Merged caffe2 and TH CPU Allocators. Mostly using the code from caffe2 allocators. `memset` of caffe2 allocator is gone now. These two allocators should be almost the same. Baseline: ``` Running ./tensor_allocation Run on (48 X 2501 MHz CPU s) CPU Caches: L1 Data 32K (x24) L1 Instruction 32K (x24) L2 Unified 256K (x24) L3 Unified 30720K (x2) ------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------- BM_MakeStorageImpl 148 ns 148 ns 4676594 BM_StorageImplCtor 54 ns 54 ns 12957810 BM_MallocStorageImpl 62 ns 62 ns 11254745 BM_TensorImplCtor 22 ns 22 ns 31939472 BM_MallocTensorImpl 105 ns 105 ns 6505661 BM_Malloc_1 43 ns 43 ns 16464905 BM_MakeTensorFromStorage 126 ns 126 ns 5586116 BM_MakeVariableFromTensor 236 ns 236 ns 2995528 BM_ATenCPUTensorAllocationSmall1 319 ns 319 ns 2268884 BM_ATenCPUTensorAllocationSmall2 318 ns 318 ns 2163332 BM_ATenCPUTensorAllocationMedium1 403 ns 403 ns 1663228 BM_ATenCPUTensorAllocationMedium2 448 ns 448 ns 1595004 BM_ATenCPUTensorAllocationBig1 532 ns 532 ns 1352634 BM_ATenCPUTensorAllocationBig2 4486 ns 4486 ns 160978 ``` Changed: ``` Running ./tensor_allocation Run on (48 X 2501 MHz CPU s) CPU Caches: L1 Data 32K (x24) L1 Instruction 32K (x24) L2 Unified 256K (x24) L3 Unified 30720K (x2) ------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------- BM_MakeStorageImpl 141 ns 141 ns 4803576 BM_StorageImplCtor 55 ns 55 ns 13129391 BM_MallocStorageImpl 64 ns 64 ns 11088143 BM_TensorImplCtor 23 ns 23 ns 31616273 BM_MallocTensorImpl 101 ns 101 ns 7017585 BM_Malloc_1 39 ns 39 ns 18523954 BM_MakeTensorFromStorage 118 ns 118 ns 5877919 BM_MakeVariableFromTensor 452 ns 452 ns 1565722 BM_ATenCPUTensorAllocationSmall1 384 ns 384 ns 1819763 BM_ATenCPUTensorAllocationSmall2 389 ns 389 ns 1857483 BM_ATenCPUTensorAllocationMedium1 425 ns 425 ns 1646284 BM_ATenCPUTensorAllocationMedium2 430 ns 430 ns 1561319 BM_ATenCPUTensorAllocationBig1 508 ns 508 ns 1309969 BM_ATenCPUTensorAllocationBig2 3799 ns 3799 ns 173674 ``` lstm benchmark: Before: ``` INFO:lstm_bench:Iter: 1 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 21 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 41 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 61 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 81 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 101 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 121 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 141 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 161 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 181 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 201 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 221 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 241 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 261 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 281 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 301 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 321 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 341 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 361 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 381 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Done. Total EPS excluding 1st iteration: 0.8k ``` After: ``` INFO:lstm_bench:Iter: 1 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 21 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 41 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 61 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 81 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 101 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 121 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 141 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 161 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 181 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 201 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 221 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 241 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 261 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 281 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 301 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 321 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 341 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 361 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 381 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Done. Total EPS excluding 1st iteration: 0.8k ``` Reviewed By: ezyang Differential Revision: D13202632 fbshipit-source-id: db6d2ec756ed15b0732b15396c82ad42302bb79d	2019-02-12 21:16:34 -08:00
Dmytro Dzhulgakov	4d4c5273de	Fix and add testing for nullptr allocator in c2->pt conversion (#16857 ) Summary: Fixes the bug for when tensor is created on Caffe2 side, then passed to PT and resized. Now we just initialize allocator correctly. Note that the code in raw_mutable_data() is still necessary because of non-resizable tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/16857 Reviewed By: houseroad Differential Revision: D14019469 Pulled By: dzhulgakov fbshipit-source-id: 14d3a3b946d718bbab747ea376903646b885706a	2019-02-11 23:21:02 -08:00
Edward Yang	e48ffa84d8	Add compare_exchange_deleter to DataPtr/UniqueVoidPtr (#16513 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/16513 compare_exchange_deleter makes it easier to replace a deleter on a DataPtr with a new one, without requiring allocating another closure to hold the old deleter. See comment for details. This diff was originally landed as part of D13762540 (#16226) but we are reverting that diff D13863610 (#16510) Reviewed By: smessmer Differential Revision: D13864245 fbshipit-source-id: 56eda4748238dd3a5130ba6434fda463fe7c690e	2019-01-31 17:40:04 -08:00
Edward Yang	279238f0b8	Back out "Delete duplicate copy of THCCachingAllocator." (#16510 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/16510 This diff was supposed to be memory usage neutral, but based on some internal flows involving cuDNN, it was not. Reverting pending further investigation. Original commit changeset: 03f1ebf7f11c Reviewed By: xw285cornell Differential Revision: D13863610 fbshipit-source-id: 15517e255fd6b0c064b65fb99f0ef19742236cfd	2019-01-29 15:44:19 -08:00
Edward Yang	792cb774f1	Delete duplicate copy of THCCachingAllocator. (#16226 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/16226 Now that the caching allocator is moved to c10_cuda, we can delete the duplicate copy from Caffe2. Reviewed By: dzhulgakov, smessmer Differential Revision: D13762540 fbshipit-source-id: 03f1ebf7f11c68c19aa0d66110156fe228da6138	2019-01-24 12:06:57 -08:00
Edward Yang	e936a69085	Move THCCachingAllocator to c10_cuda. (#16119 ) Summary: Some renaming and renamespacing also took place. I was originally planning not to do anything, but it turns out that it was easier to make HIPify work by using a namespace CUDACachingAllocator:: rather than THCCachingAllocator_, since :: is a word boundary but _ is not. Pull Request resolved: https://github.com/pytorch/pytorch/pull/16119 Reviewed By: smessmer Differential Revision: D13718768 fbshipit-source-id: 884a481d99027fd3e34471c020f826aa12225656	2019-01-24 12:06:56 -08:00
Sebastian Messmer	d408324350	Move files to/from c10/core and c10/util (#15316 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/15316 This starts cleaning up the files in c10 according to the module structure we decided on. Move to c10/util: - Half.h, Half-inl.h, Half.cpp, bitcasts.h Move to c10/core: - Device.h, Device.cpp - DeviceType.h, DeviceType.cpp i-am-not-moving-c2-to-c10 Reviewed By: dzhulgakov Differential Revision: D13498493 fbshipit-source-id: dfcf1c490474a12ab950c72ca686b8ad86428f63	2019-01-10 16:22:22 -08:00
Sebastian Messmer	9e9e87c19e	Move TensorImpl to c10 (yay!) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14795 Reviewed By: ezyang Differential Revision: D13336856 fbshipit-source-id: 5375d0e42312ff7564f4df06210a5e49542d59e3	2018-12-11 21:01:38 -08:00
Sebastian Messmer	fb6806f6e9	Remove at references in c10 Allocator.h (#14434 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14434 The referenced classes live now in c10, so we don't need to specify their namespace. Reviewed By: ezyang Differential Revision: D13224015 fbshipit-source-id: 6d154b8e3f9a1e38ff0407dbb1151f5c1d5df260	2018-11-29 11:07:22 -08:00
Sebastian Messmer	3a71d5ee49	Move Allocator.h to c10 Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14059 Reviewed By: ezyang Differential Revision: D13081606 fbshipit-source-id: d6ad59ad4e3d363268cd4307b6c999a168681246	2018-11-27 12:59:44 -08:00

49 Commits