pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
cyy	1544c37520	[7/N] Fixes clang-tidy warnings in c10/{core,util}/*.h (#115495 ) This PR continues to fix clang-tidy warnings for headers in c10/core and c10/util. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115495 Approved by: https://github.com/malfet	2023-12-19 02:14:30 +00:00
cyy	fa65ae8f56	cleanup unused include (#93359 ) Using `include-what-you-use` tool to find out and remove some unused includes Pull Request resolved: https://github.com/pytorch/pytorch/pull/93359 Approved by: https://github.com/malfet	2023-02-04 02:15:50 +00:00
Aaron Gokaslan	a34a9c3471	Perf: Apply more clang-tidy fixups to torch headers (#91445 ) Applies so more fixes to headers that may have been missed before for performance optimization.cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @EikanWang @ezyang since this more in the series of the clang-tidy fixup This is PR fixes 3 main issues: 1. Use emplacement more in headers 1. Avoid unnecessary copies and use const ref when possible 1. Default any special functions when possible to make them potentially trivial and more readable. 1. There is also one change in this PR that tries to prevent unnecessary math promotion, the rest of these changes are in another PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/91445 Approved by: https://github.com/ezyang	2022-12-29 23:43:45 +00:00
Codrin Popa	0aedda25bc	[PyTorch] Reporting OOM events to the Pytorch Profiler. (#80050 ) Summary: Similar to reporting alloc and dealloc events in the PyTorch profiler, we are now reporting Out of Memory events as well. This is useful for performance troubleshooting Test Plan: Added test_oom_tracing to test/test_profiler.py Differential Revision: D36268132 Pull Request resolved: https://github.com/pytorch/pytorch/pull/80050 Approved by: https://github.com/robieta	2022-07-20 16:51:39 +00:00
mikey dagitses	844a4b47df	extract out //c10/core:alloc_cpu (#70859 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70859 ghstack-source-id: 147642534 Test Plan: Extracting code unmodified to a new library: relying on CI to validate. Reviewed By: malfet Differential Revision: D33329688 fbshipit-source-id: f60327467d197ec1862fb3554f8b83e6c84cab5c (cherry picked from commit `f82e7c0e9b`)	2022-01-27 07:34:52 +00:00
mikey dagitses	fc6a488e9a	extract out //c10/core:alignment (#70858 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70858 ghstack-source-id: 147642533 Test Plan: Extracted a constant to a new header, trusting CI build to validate. Reviewed By: malfet Differential Revision: D33329689 fbshipit-source-id: 8697bb81a5cc3366462ebdf1f214b62d478fa77c (cherry picked from commit `16663847e1`)	2022-01-27 07:34:52 +00:00
Ilia Cherniavskii	047ae6b713	[profiler][small] CUDA synchronize guard, minor fix (#58254 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58254 Don't use CUDA synchronize when profiling in CPU only mode. minor fixes (a clarification for a doc string, fix spammy logging) (Note: this ignores all push blocking failures!) Test Plan: manual + CI Reviewed By: gdankel, chaekit Differential Revision: D28423667 Pulled By: ilia-cher fbshipit-source-id: 04c71727f528ae8e2e0ff90e88271608d291bc69	2021-05-13 19:21:56 -07:00
Hao Lu	4976208e73	[caffe2] Register BlackBoxPredictor AllocationArenaPool as CPUCachingAllocator (#48161 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48161 - Register BlackBoxPredictor AllocationArenaPool as CPUCachingAllocator - Use the AllocationArenaPool in both BlackBoxPredictor and StaticRuntime Test Plan: ``` buck run //caffe2/caffe2/fb/predictor:black_box_predictor_test buck run //caffe2/caffe2/fb/predictor:pytorch_predictor_test ``` AF canary: https://www.internalfb.com/intern/ads/canary/431021257540238874/ Reviewed By: dzhulgakov Differential Revision: D24977611 fbshipit-source-id: 33ba596b43c1e558c3ab237a0feeae93565b2d35	2020-11-30 15:03:34 -08:00
Ilia Cherniavskii	a94fb71b12	Memory profiling (#37775 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37775 Adding memory usage into profiler table output Test Plan: BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py develop install --cmake ``` import torch import torchvision.models as models model = models.resnet18() inp = torch.randn(5, 3, 224, 224) with torch.autograd.profiler.profile(profile_memory=True, record_shapes=True) as prof: model(inp) print(prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_memory_usage", row_limit=15)) ``` ``` --------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CPU Mem Total Number of Calls Input Shapes --------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- resize_ 0.37% 577.936us 0.37% 577.936us 9.796us 339.03 Mb 59 [[0]] empty 0.69% 1.061ms 0.74% 1.139ms 5.556us 47.42 Mb 205 [] stride 0.00% 0.853us 0.00% 0.853us 0.853us 19.53 Kb 1 [[5, 1000]] empty_strided 0.01% 21.393us 0.02% 26.033us 5.207us 252 b 5 [] is_complex 0.02% 37.425us 0.02% 37.425us 1.291us 208 b 29 [[]] masked_select 0.04% 55.333us 0.06% 93.616us 46.808us 120 b 2 [[30], [30]] conv2d 0.01% 18.009us 9.62% 14.902ms 14.902ms 0 b 1 [[5, 3, 224, 224], [64, 3, 7, 7], [ convolution 0.01% 12.436us 9.61% 14.884ms 14.884ms 0 b 1 [[5, 3, 224, 224], [64, 3, 7, 7], [ _convolution 0.03% 52.381us 9.60% 14.871ms 14.871ms 0 b 1 [[5, 3, 224, 224], [64, 3, 7, 7], [ size 0.00% 5.429us 0.00% 5.429us 0.339us 0 b 16 [[5, 3, 224, 224]] contiguous 0.00% 1.934us 0.00% 1.934us 0.967us 0 b 2 [[5, 3, 224, 224]] _convolution_nogroup 0.02% 27.505us 9.57% 14.814ms 14.814ms 0 b 1 [[5, 3, 224, 224], [64, 3, 7, 7], [ _nnpack_available 0.02% 34.267us 0.02% 34.267us 1.713us 0 b 20 [] thnn_conv2d 0.01% 13.274us 9.54% 14.771ms 14.771ms 0 b 1 [[5, 3, 224, 224], [64, 3, 7, 7], [ thnn_conv2d_forward 5.98% 9.264ms 19.02% 29.446ms 14.723ms 0 b 2 [[5, 3, 224, 224], [64, 3, 7, 7], [ --------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Self CPU time total: 154.855ms ``` Reviewed By: ngimel Differential Revision: D21384248 Pulled By: ilia-cher fbshipit-source-id: 31359cce2aa06f6255ed1ad8c60d03cb640bfec3	2020-05-19 15:48:48 -07:00
Allan Di Wu	f538cd627a	Install HugePagesArena to optimize pytorch prediction performance (#37640 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37640 Enable oversize arena to reduce memory fragmentation. Memory request with large size (configurable with FLAGS_caffe2_oversize_threshold) are fulfilled from dedicated arena separate from the existing huge page arena. Two additional parameters are introduced to configure the 2-phase decay of the memory arena: - caffe2_dirty_decay_ms - caffe2_muzzy_decay_ms In current JEMalloc implementation, oversized allocations will be immediately purged regardless of putting it in arena or not. Therefore we need to extend the decay time to indefinite. Currently we set the default for caffe2_muzzy_decay_ms to -1. We now enable the arena allocator statically. To ensure it is correctly installed regardless of static initialization order, we add a priority flag in c10::SetAllocator, and only higher priority allocators can overwrite existing ones. ghstack-source-id: 103276877 Test Plan: buck test mode/dev //caffe2/caffe2/fb/init:huge_pages_allocator_test Benchmarking known CV model that benefits from page arena: ``` PyTorchModelBench.cpp:183] test / base : 86.9532% ``` By adjusting ```dirty_decay_ms``` and ```muzzy_decay_ms```, we have the following plots: https://pxl.cl/15SWW https://pxl.cl/15TnL From the figures above we can see performance does not change much until dirty decay time is indefinite (set to -1). Either setting muzzy decay or dirty decay time to -1 will reach best performance, regardless of which one it is. Even setting the decay time to very long (100s, which is longer than the run), does not change the performance by much. ## Observe performance difference in production with a variety of models (WIP) Reviewed By: dzhulgakov Differential Revision: D21258581 fbshipit-source-id: c006f8b94f28aef0666e52f48d4e82cf0d3a48af	2020-05-06 17:27:10 -07:00
Ashkan Aliabadi	006f1a32f8	Mobile CPU allocator. (#36032 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36032 QNNPACK AND XNNPACK may out-of-bound access the input and / or output tensors. This is by-design, and chosen to make the implementation of micro-kernels both simpler and faster as a result of not having to individually handle the corner cases where the number of processed elements is not a multiple of SIMD register width. This behavior will trigger ASAN though, and may result in a segfault if the accessed memory location just so happens to fall on a page the current process has no read access to. Here we define a custom allocator that allocates the extra storage required to keep this behavior safe. This allocator could have been restricted to QNNPACK and XNNPACK only, but that would have negative performance ramifications, as input tensors must now be reallocated, and copied over, if the tensor is not allocated with this allocator to begin with. Making this allocator the default on mobile builds minimizes the probability of unnecessary reallocations and copies, and also enables acceleration of operations where the output tensor is allocated outside of the function doing the implementation, wherein the implementation cannot simply re-allocate the output with the guarding allocator. Test Plan: Imported from OSS Differential Revision: D20970217 Pulled By: AshkanAliabadi fbshipit-source-id: 65cca2d38d7c0cef63c732f393016f50f1fa5199	2020-04-23 11:03:03 -07:00
Sungmann Cho	f59581218f	Fix spelling errors (#21665 ) Summary: alloctor -> allocator excutable -> executable excution -> execution foward -> forward initiaize -> initialize paralell -> parallel preprocesor -> preprocessor tranpose -> transpose Pull Request resolved: https://github.com/pytorch/pytorch/pull/21665 Differential Revision: D15806155 Pulled By: soumith fbshipit-source-id: d92b21ec8650a2b32f05faf9af0b7d2b073e992c	2019-06-13 15:21:55 -07:00
Edward Yang	b3b692a80a	Don't have malloc-free pairs that cross DLL boundaries. (#17302 ) Summary: See https://blogs.msdn.microsoft.com/oldnewthing/20060915-04/?p=29723 for more background on this requirement on Windows. Fixes #17239. Signed-off-by: Edward Z. Yang <ezyang@fb.com> cc xkszltl peterjc123 Pull Request resolved: https://github.com/pytorch/pytorch/pull/17302 Differential Revision: D14150067 Pulled By: ezyang fbshipit-source-id: 9dc16ca781ff17515b8df1bb55492477e7843d4c	2019-02-20 20:31:41 -08:00
Dmytro Dzhulgakov	51dd2000cd	unify c2 and TH allocator (#16892 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/16892 Replaces https://github.com/pytorch/pytorch/pull/14517 Merged caffe2 and TH CPU Allocators. Mostly using the code from caffe2 allocators. `memset` of caffe2 allocator is gone now. These two allocators should be almost the same. Baseline: ``` Running ./tensor_allocation Run on (48 X 2501 MHz CPU s) CPU Caches: L1 Data 32K (x24) L1 Instruction 32K (x24) L2 Unified 256K (x24) L3 Unified 30720K (x2) ------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------- BM_MakeStorageImpl 148 ns 148 ns 4676594 BM_StorageImplCtor 54 ns 54 ns 12957810 BM_MallocStorageImpl 62 ns 62 ns 11254745 BM_TensorImplCtor 22 ns 22 ns 31939472 BM_MallocTensorImpl 105 ns 105 ns 6505661 BM_Malloc_1 43 ns 43 ns 16464905 BM_MakeTensorFromStorage 126 ns 126 ns 5586116 BM_MakeVariableFromTensor 236 ns 236 ns 2995528 BM_ATenCPUTensorAllocationSmall1 319 ns 319 ns 2268884 BM_ATenCPUTensorAllocationSmall2 318 ns 318 ns 2163332 BM_ATenCPUTensorAllocationMedium1 403 ns 403 ns 1663228 BM_ATenCPUTensorAllocationMedium2 448 ns 448 ns 1595004 BM_ATenCPUTensorAllocationBig1 532 ns 532 ns 1352634 BM_ATenCPUTensorAllocationBig2 4486 ns 4486 ns 160978 ``` Changed: ``` Running ./tensor_allocation Run on (48 X 2501 MHz CPU s) CPU Caches: L1 Data 32K (x24) L1 Instruction 32K (x24) L2 Unified 256K (x24) L3 Unified 30720K (x2) ------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------- BM_MakeStorageImpl 141 ns 141 ns 4803576 BM_StorageImplCtor 55 ns 55 ns 13129391 BM_MallocStorageImpl 64 ns 64 ns 11088143 BM_TensorImplCtor 23 ns 23 ns 31616273 BM_MallocTensorImpl 101 ns 101 ns 7017585 BM_Malloc_1 39 ns 39 ns 18523954 BM_MakeTensorFromStorage 118 ns 118 ns 5877919 BM_MakeVariableFromTensor 452 ns 452 ns 1565722 BM_ATenCPUTensorAllocationSmall1 384 ns 384 ns 1819763 BM_ATenCPUTensorAllocationSmall2 389 ns 389 ns 1857483 BM_ATenCPUTensorAllocationMedium1 425 ns 425 ns 1646284 BM_ATenCPUTensorAllocationMedium2 430 ns 430 ns 1561319 BM_ATenCPUTensorAllocationBig1 508 ns 508 ns 1309969 BM_ATenCPUTensorAllocationBig2 3799 ns 3799 ns 173674 ``` lstm benchmark: Before: ``` INFO:lstm_bench:Iter: 1 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 21 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 41 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 61 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 81 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 101 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 121 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 141 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 161 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 181 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 201 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 221 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 241 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 261 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 281 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 301 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 321 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 341 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 361 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 381 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Done. Total EPS excluding 1st iteration: 0.8k ``` After: ``` INFO:lstm_bench:Iter: 1 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 21 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 41 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 61 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 81 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 101 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 121 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 141 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 161 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 181 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 201 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 221 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 241 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 261 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 281 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 301 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 321 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 341 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 361 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 381 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Done. Total EPS excluding 1st iteration: 0.8k ``` Reviewed By: ezyang Differential Revision: D13202632 fbshipit-source-id: db6d2ec756ed15b0732b15396c82ad42302bb79d	2019-02-12 21:16:34 -08:00

14 Commits