Enables clang-tidy rule [`misc-use-internal-linkage`](https://clang.llvm.org/extra/clang-tidy/checks/misc/use-internal-linkage.html). This new check was introduced in Clang-Tidy 18 and is available due to recent update of Clang-Tidy 19.
The check marks functions and variables used only in the translation unit as static. Therefore undesired symbols are not leaked into other units, more link time optimisations are possible and the resulting binaries may be smaller.
The detected violations were mostly fixed by using static. In other cases, the symbols were indeed consumed by others files, then their declaring headers were included. Still some declarations were wrong and have been fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148948
Approved by: https://github.com/Skylion007
Summary: Similar to reporting alloc and dealloc events in the PyTorch profiler, we are now reporting Out of Memory events as well. This is useful for performance troubleshooting
Test Plan: Added test_oom_tracing to test/test_profiler.py
Differential Revision: D36268132
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80050
Approved by: https://github.com/robieta
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70859
ghstack-source-id: 147642534
Test Plan: Extracting code unmodified to a new library: relying on CI to validate.
Reviewed By: malfet
Differential Revision: D33329688
fbshipit-source-id: f60327467d197ec1862fb3554f8b83e6c84cab5c
(cherry picked from commit f82e7c0e9b)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70858
ghstack-source-id: 147642533
Test Plan: Extracted a constant to a new header, trusting CI build to validate.
Reviewed By: malfet
Differential Revision: D33329689
fbshipit-source-id: 8697bb81a5cc3366462ebdf1f214b62d478fa77c
(cherry picked from commit 16663847e1)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66746
Modified loops in files under fbsource/fbcode/caffe2/ from the format
`for(TYPE var=x0;var<x_max;x++)`
to the format
`for(const auto var: irange(xmax))`
This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.
Test Plan: Sandcastle
Reviewed By: malfet
Differential Revision: D31705361
fbshipit-source-id: 33fd22eb03086d114e2c98e56703e8ec84460268
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66234
Modified loops in files under fbsource/fbcode/caffe2/ from the format
`for(TYPE var=x0;var<x_max;x++)`
to the format
`for(const auto var: irange(xmax))`
This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.
bypass_size_limit
allow-large-files
Test Plan: Sandcastle
Reviewed By: ngimel
Differential Revision: D30652629
fbshipit-source-id: 0ae6c4bbbb554bad42e372792a6430e1acf15e3e
Summary:
Report pointed memory size, total allocated memory, total reserved size all in one report.
`ptr` and `alloc_size` will be used for associating with op trace.
`allocated_size`, `reserved_size` will be used for memory trace.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61282
Reviewed By: ejguan
Differential Revision: D29796282
Pulled By: chaekit
fbshipit-source-id: 5314c867632d3af1fa9a3811b35eaa5e931a5d87
Summary:
As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH`
All changes but the ones to `.clang-tidy` are generated using following script:
```
for i in `find . -type f -iname "*.c*" -or -iname "*.h"|xargs grep cppcoreguidelines-avoid-non-const-global-variables|cut -f1 -d:|sort|uniq`; do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008
Reviewed By: driazati, r-barnes
Differential Revision: D29838584
Pulled By: malfet
fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58254
Don't use CUDA synchronize when profiling in CPU only mode.
minor fixes (a clarification for a doc string, fix spammy logging)
(Note: this ignores all push blocking failures!)
Test Plan: manual + CI
Reviewed By: gdankel, chaekit
Differential Revision: D28423667
Pulled By: ilia-cher
fbshipit-source-id: 04c71727f528ae8e2e0ff90e88271608d291bc69
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51421
Mark memory events that did not happen within an operator context
explicitly in the profiler output.
Test Plan: python test/test_profiler.py -k test_memory_profiler
Reviewed By: ngimel
Differential Revision: D26166518
Pulled By: ilia-cher
fbshipit-source-id: 3c14d3ac25a7137733ea7cc65f0eb48693a98f5e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48161
- Register BlackBoxPredictor AllocationArenaPool as CPUCachingAllocator
- Use the AllocationArenaPool in both BlackBoxPredictor and StaticRuntime
Test Plan:
```
buck run //caffe2/caffe2/fb/predictor:black_box_predictor_test
buck run //caffe2/caffe2/fb/predictor:pytorch_predictor_test
```
AF canary:
https://www.internalfb.com/intern/ads/canary/431021257540238874/
Reviewed By: dzhulgakov
Differential Revision: D24977611
fbshipit-source-id: 33ba596b43c1e558c3ab237a0feeae93565b2d35
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43951
AllocationPlan: Stores the sequence of allocations, their sizes
and liftime of the allocations. Along with this
it also stores the total size of a single memory
blob, total_size, required to satisfy all the allocations.
It also stores the offsets in the blob, of size
total_size, corresponding to each allocation.
Thus allocation plan contains:
- allocation sizes
- allocation lifetimes
- allocation offsets
- total size
AllocationPlaner: Takes a pointer to the allocation plan and fills
it ups with plan, i.e. sizes, lifetimes, offsets,
total size.
This is done via WithProfileAllocationsGuard which
takes in AllocationPlan* and constructs
AllocationPlanner* and set the thread local
allocation_planner to it.
MobileCPUAllocator profiles allocations via
allocation_planner.
In WithValidateAllocationsGuard, allocations profiled
in the allocation plan are validated.
CPUProfilingAllocator:
Application owns CPUProfilingAllocator
Using WithProfilingAllocatorGuard, it passes both CPUProfilingAllocator
and AllocationPlan created earlier. Then CPUProfilingAllocator will
manage allocations and frees according to the plan. Allocations that
are not managed by CPUProfilingAllocator will be routed through
c10::alloc_cpu, c10::free_cpu.
Test Plan:
cpu_profiling_allocator_test on mobile.
Imported from OSS
Reviewed By: dreiss
Differential Revision: D23451019
fbshipit-source-id: 98bf1dbcfa8fcfb83d505ac01095e84a3f5b778d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45364
Plus add some more comments about the usage, limitations and cons.
Test Plan: Build and run benchmark binary.
Reviewed By: gchanan
Differential Revision: D23944193
fbshipit-source-id: 30d4f4991d2185a0ab768d94c846d73730fc0835
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42006
This PR introduces a simple CPU caching allocator. This is specifically
intended for mobile use cases and for inference. There is nothing
specific to the implementation that can prevent it from other use cases,
however its simplicity may not be suitable everywhere.
It simply tracks allocation by sizes and relies on deterministic
repeatable behavior where allocation of same sizes are made on every
inference.
Thus after the first allocation when the pointer is returned, instead of
returning it to system, allocator caches it for subsequent use.
Memory is freed automatically at the end of the process, or it can be
explicitly freed.
This is enabled at the moment in DefaultMobileCPUAllocator only.
Test Plan:
android test: cpu_caching_allocator_test
Imported from OSS
Reviewed By: dreiss
Differential Revision: D22726976
fbshipit-source-id: 9a38b1ce34059d5653040a1c3d035bfc97609e6c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37640
Enable oversize arena to reduce memory fragmentation. Memory request with large size (configurable with FLAGS_caffe2_oversize_threshold) are fulfilled from dedicated arena separate from the existing huge page arena.
Two additional parameters are introduced to configure the 2-phase decay of the memory arena:
- caffe2_dirty_decay_ms
- caffe2_muzzy_decay_ms
In current JEMalloc implementation, oversized allocations will be immediately purged regardless of putting it in arena or not. Therefore we need to extend the decay time to indefinite. Currently we set the default for caffe2_muzzy_decay_ms to -1.
We now enable the arena allocator statically. To ensure it is correctly installed regardless of static initialization order, we add a priority flag in c10::SetAllocator, and only higher priority allocators can overwrite existing ones.
ghstack-source-id: 103276877
Test Plan:
buck test mode/dev //caffe2/caffe2/fb/init:huge_pages_allocator_test
Benchmarking known CV model that benefits from page arena:
```
PyTorchModelBench.cpp:183] test / base : 86.9532%
```
By adjusting ```dirty_decay_ms``` and ```muzzy_decay_ms```, we have the following plots:
https://pxl.cl/15SWWhttps://pxl.cl/15TnL
From the figures above we can see performance does not change much until dirty decay time is indefinite (set to -1). Either setting muzzy decay or dirty decay time to -1 will reach best performance, regardless of which one it is. Even setting the decay time to very long (100s, which is longer than the run), does not change the performance by much.
## Observe performance difference in production with a variety of models (WIP)
Reviewed By: dzhulgakov
Differential Revision: D21258581
fbshipit-source-id: c006f8b94f28aef0666e52f48d4e82cf0d3a48af
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36032
QNNPACK AND XNNPACK may out-of-bound access the input and / or output tensors.
This is by-design, and chosen to make the implementation of micro-kernels
both simpler and faster as a result of not having to individually handle the
corner cases where the number of processed elements is not a multiple of SIMD
register width. This behavior will trigger ASAN though, and may result in a
segfault if the accessed memory location just so happens to fall on a page
the current process has no read access to. Here we define a custom allocator
that allocates the extra storage required to keep this behavior safe. This
allocator could have been restricted to QNNPACK and XNNPACK only, but that
would have negative performance ramifications, as input tensors must now be
reallocated, and copied over, if the tensor is not allocated with this
allocator to begin with. Making this allocator the default on mobile builds
minimizes the probability of unnecessary reallocations and copies, and
also enables acceleration of operations where the output tensor is allocated
outside of the function doing the implementation, wherein the implementation
cannot simply re-allocate the output with the guarding allocator.
Test Plan: Imported from OSS
Differential Revision: D20970217
Pulled By: AshkanAliabadi
fbshipit-source-id: 65cca2d38d7c0cef63c732f393016f50f1fa5199
Summary:
Some legacy TH code was relying on alloc to throw when called with negative number!!! E.g. `torch.linspace(0, 1, -1)`. And it breaks ASAN build. I still believe alloc should receive size_t, but I added a safety enforce inside.
It should fix ASAN. I'll follow up with a proper fix for empty_cpu (which is probably the right place to do it) separately
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17071
Differential Revision: D14074157
Pulled By: dzhulgakov
fbshipit-source-id: 3ed3bdb873e446edecb558e1df491310fd7179e3