Applies so more fixes to headers that may have been missed before for performance optimization.cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @EikanWang @ezyang since this more in the series of the clang-tidy fixup
This is PR fixes 3 main issues:
1. Use emplacement more in headers
1. Avoid unnecessary copies and use const ref when possible
1. Default any special functions when possible to make them potentially trivial and more readable.
1. There is also one change in this PR that tries to prevent unnecessary math promotion, the rest of these changes are in another PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91445
Approved by: https://github.com/ezyang
Summary: Similar to reporting alloc and dealloc events in the PyTorch profiler, we are now reporting Out of Memory events as well. This is useful for performance troubleshooting
Test Plan: Added test_oom_tracing to test/test_profiler.py
Differential Revision: D36268132
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80050
Approved by: https://github.com/robieta
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70859
ghstack-source-id: 147642534
Test Plan: Extracting code unmodified to a new library: relying on CI to validate.
Reviewed By: malfet
Differential Revision: D33329688
fbshipit-source-id: f60327467d197ec1862fb3554f8b83e6c84cab5c
(cherry picked from commit f82e7c0e9b)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70858
ghstack-source-id: 147642533
Test Plan: Extracted a constant to a new header, trusting CI build to validate.
Reviewed By: malfet
Differential Revision: D33329689
fbshipit-source-id: 8697bb81a5cc3366462ebdf1f214b62d478fa77c
(cherry picked from commit 16663847e1)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58254
Don't use CUDA synchronize when profiling in CPU only mode.
minor fixes (a clarification for a doc string, fix spammy logging)
(Note: this ignores all push blocking failures!)
Test Plan: manual + CI
Reviewed By: gdankel, chaekit
Differential Revision: D28423667
Pulled By: ilia-cher
fbshipit-source-id: 04c71727f528ae8e2e0ff90e88271608d291bc69
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48161
- Register BlackBoxPredictor AllocationArenaPool as CPUCachingAllocator
- Use the AllocationArenaPool in both BlackBoxPredictor and StaticRuntime
Test Plan:
```
buck run //caffe2/caffe2/fb/predictor:black_box_predictor_test
buck run //caffe2/caffe2/fb/predictor:pytorch_predictor_test
```
AF canary:
https://www.internalfb.com/intern/ads/canary/431021257540238874/
Reviewed By: dzhulgakov
Differential Revision: D24977611
fbshipit-source-id: 33ba596b43c1e558c3ab237a0feeae93565b2d35
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37640
Enable oversize arena to reduce memory fragmentation. Memory request with large size (configurable with FLAGS_caffe2_oversize_threshold) are fulfilled from dedicated arena separate from the existing huge page arena.
Two additional parameters are introduced to configure the 2-phase decay of the memory arena:
- caffe2_dirty_decay_ms
- caffe2_muzzy_decay_ms
In current JEMalloc implementation, oversized allocations will be immediately purged regardless of putting it in arena or not. Therefore we need to extend the decay time to indefinite. Currently we set the default for caffe2_muzzy_decay_ms to -1.
We now enable the arena allocator statically. To ensure it is correctly installed regardless of static initialization order, we add a priority flag in c10::SetAllocator, and only higher priority allocators can overwrite existing ones.
ghstack-source-id: 103276877
Test Plan:
buck test mode/dev //caffe2/caffe2/fb/init:huge_pages_allocator_test
Benchmarking known CV model that benefits from page arena:
```
PyTorchModelBench.cpp:183] test / base : 86.9532%
```
By adjusting ```dirty_decay_ms``` and ```muzzy_decay_ms```, we have the following plots:
https://pxl.cl/15SWWhttps://pxl.cl/15TnL
From the figures above we can see performance does not change much until dirty decay time is indefinite (set to -1). Either setting muzzy decay or dirty decay time to -1 will reach best performance, regardless of which one it is. Even setting the decay time to very long (100s, which is longer than the run), does not change the performance by much.
## Observe performance difference in production with a variety of models (WIP)
Reviewed By: dzhulgakov
Differential Revision: D21258581
fbshipit-source-id: c006f8b94f28aef0666e52f48d4e82cf0d3a48af
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36032
QNNPACK AND XNNPACK may out-of-bound access the input and / or output tensors.
This is by-design, and chosen to make the implementation of micro-kernels
both simpler and faster as a result of not having to individually handle the
corner cases where the number of processed elements is not a multiple of SIMD
register width. This behavior will trigger ASAN though, and may result in a
segfault if the accessed memory location just so happens to fall on a page
the current process has no read access to. Here we define a custom allocator
that allocates the extra storage required to keep this behavior safe. This
allocator could have been restricted to QNNPACK and XNNPACK only, but that
would have negative performance ramifications, as input tensors must now be
reallocated, and copied over, if the tensor is not allocated with this
allocator to begin with. Making this allocator the default on mobile builds
minimizes the probability of unnecessary reallocations and copies, and
also enables acceleration of operations where the output tensor is allocated
outside of the function doing the implementation, wherein the implementation
cannot simply re-allocate the output with the guarding allocator.
Test Plan: Imported from OSS
Differential Revision: D20970217
Pulled By: AshkanAliabadi
fbshipit-source-id: 65cca2d38d7c0cef63c732f393016f50f1fa5199