It turns out that we're invoking the copy assign operator in AppendOnlyList. While copy elision is expected to mostly hide any costs it does present issues for types with deleted copy assign operators. (It also seems to produce slightly worse assembly: https://godbolt.org/z/o4Gvz1fKs)
Calling new at the correct position seems to be a better way to go about this. (At least from looking at other high performance containers like SmallVector.)
Differential Revision: [D39852804](https://our.internmc.facebook.com/intern/diff/D39852804/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85716
Approved by: https://github.com/chaekit
It is apparently undefined behavior to do pointer arithmetic on nullptr. In the case of AppendOnlyList, `next_` will only be null if `end_` is also null and thus the `memcpy` path will only be triggered if `n == 0`. Nonetheless, it is UB to `memcpy(0, 0, 0)`
The extra null check is in a `C10_LIKELY` block so the extra cost should be negligible, and indeed after dusting off the component microbenchmarks there's no observable difference.
Differential Revision: [D38969443](https://our.internmc.facebook.com/intern/diff/D38969443/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83963
Approved by: https://github.com/slgong-fb
Summary:
- Use atomic counter in Block storage constructor and offset within the block to calculate correlation_id.
- Implicitly deduce correlation id, no need to store it when profiling.
Test Plan:
Added test_profiler_correlation_id() in test_profiler.py to check the uniqueness of correlation id.
To run the test:
python test_profiler.py
Differential Revision: D36793803
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78643
Approved by: https://github.com/robieta
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74484
In my first attempt at this in December I stamped out specializations using variadic templates. However I'm able to get comparable performance using simple conditionals since the branch is very predictable and AppendOnlyList::emplace_back is low enough overhead that multiple calls don't cause an issue.
This is also a chance to do some BE: rather than force ops and backend events to use the same fields (which in practice means setting a bunch of default values when reporting backend events), I just split them and use a variant.
Test Plan: The single threaded benchmark (with no extra options set) improved considerably from ~0.88 us to ~0.62 us. The stress test benchmark improved modestly from ~6.1 us to ~5.8 us. So the bottleneck for multi-threading is somewhere else, but doing less wasted work is still able to move the needle a little bit.
Reviewed By: swolchok
Differential Revision: D34779994
fbshipit-source-id: 392bc7c6f12797fa5e18777063aa21210d9d2067
(cherry picked from commit f0a49ff7be8aa65bab2f6952cc2e6306c1edc24b)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73409
We can do better than `vector` or `deque`, and it's sufficiently important to the hot path to justify a custom container. (This is part of the larger queue refactor, but this is a standalone drop-in replacement so we don't need to wait.)
Test Plan: It's a pretty simple container type, so I just added a few cpp tests for emplace and read back. I also ran the overhead benchmark (replicates=9) with both `--stressTestKineto` (0.99 -> 0.94 us) and `--stressTestKineto --kinetoProfileMemory` (1.36 -> 1.27 us).
Reviewed By: swolchok
Differential Revision: D34231072
fbshipit-source-id: ed57299729d444d59cf843a0d38a3ee2240eeec1
(cherry picked from commit 43907948f3a8d2137244e7bb59f43999bd660917)