mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-07 12:21:27 +01:00
Summary: Resubmit of https://github.com/pytorch/pytorch/issues/58811, Closes gh-24745 The existing PR (gh-50655) has been stalled because `TensorIterator` doesn't guarantee iteration order in the same way that `TH_TENSOR_APPLY` does. For contiguous test cases this isn't an issue; but it breaks down for example with channels last format. I resolve this by adding a new `TensorIteratorConfig` parameter, `enforce_linear_iteration`, which disables dimension reordering. I've also added a test case for non-contiguous tensors to verify this works. This PR also significantly improves performance by adding multithreading support to the algorithm. As part of this, I wrote a custom `count_nonzero` that gives per-thread counts which is necessary to write the outputs in the right location. | Shape | Before | After (1 thread) | After (8 threads) | |:----------:|--------:|-----------------:|------------------:| | 256,128,32 | 2610 us | 2150 us | 551 us | | 128,128,32 | 1250 us | 1020 us | 197 us | | 64,128,32 | 581 us | 495 us | 99 us | | 32,128,32 | 292 us | 255 us | 83 us | | 16,128,32 | 147 us | 126 us | 75 us | | 8,128,32 | 75 us | 65 us | 65 us | | 4,128,32 | 39 us | 33 us | 33 us | | 2,128,32 | 20 us | 18 us | 18 us | | 1,128,32 | 11 us | 9 us | 9 us | Pull Request resolved: https://github.com/pytorch/pytorch/pull/59149 Reviewed By: mruberry Differential Revision: D28817466 Pulled By: ngimel fbshipit-source-id: f08f6c003c339368fd53dabd28e9ada9e59de732
30 lines
667 B
C++
30 lines
667 B
C++
#pragma once
|
|
#include <c10/macros/Macros.h>
|
|
|
|
// Utility to guaruntee complete unrolling of a loop where the bounds are known
|
|
// at compile time. Various pragmas achieve similar effects, but are not as
|
|
// portable across compilers.
|
|
|
|
// Example: c10::ForcedUnroll<4>{}(f); is equivalent to f(0); f(1); f(2); f(3);
|
|
|
|
namespace c10 {
|
|
|
|
template <int n>
|
|
struct ForcedUnroll {
|
|
template <typename Func>
|
|
C10_ALWAYS_INLINE void operator()(const Func& f) const {
|
|
ForcedUnroll<n - 1>{}(f);
|
|
f(n - 1);
|
|
}
|
|
};
|
|
|
|
template <>
|
|
struct ForcedUnroll<1> {
|
|
template <typename Func>
|
|
C10_ALWAYS_INLINE void operator()(const Func& f) const {
|
|
f(0);
|
|
}
|
|
};
|
|
|
|
} // namespace c10
|