pytorch/c10/util/Unroll.h
Peter Bell 99f2000a99 Migrate nonzero from TH to ATen (CPU) (#59149)
Summary:
Resubmit of https://github.com/pytorch/pytorch/issues/58811, Closes gh-24745

The existing PR (gh-50655) has been stalled because `TensorIterator` doesn't guarantee iteration order in the same way that `TH_TENSOR_APPLY` does. For contiguous test cases this isn't an issue; but it breaks down for example with channels last format. I resolve this by adding a new `TensorIteratorConfig` parameter, `enforce_linear_iteration`, which disables dimension reordering. I've also added a test case for non-contiguous tensors to verify this works.

This PR also significantly improves performance by adding multithreading support to the algorithm.  As part of this, I wrote a custom `count_nonzero` that gives per-thread counts which is necessary to write the outputs in the right location.

|    Shape   |  Before | After (1 thread) | After (8 threads) |
|:----------:|--------:|-----------------:|------------------:|
| 256,128,32 | 2610 us |          2150 us |            551 us |
| 128,128,32 | 1250 us |          1020 us |            197 us |
|  64,128,32 |  581 us |           495 us |             99 us |
|  32,128,32 |  292 us |           255 us |             83 us |
|  16,128,32 |  147 us |           126 us |             75 us |
|  8,128,32  |   75 us |            65 us |             65 us |
|  4,128,32  |   39 us |            33 us |             33 us |
|  2,128,32  |   20 us |            18 us |             18 us |
|  1,128,32  |   11 us |             9 us |              9 us |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59149

Reviewed By: mruberry

Differential Revision: D28817466

Pulled By: ngimel

fbshipit-source-id: f08f6c003c339368fd53dabd28e9ada9e59de732
2021-06-02 12:26:29 -07:00

30 lines
667 B
C++

#pragma once
#include <c10/macros/Macros.h>
// Utility to guaruntee complete unrolling of a loop where the bounds are known
// at compile time. Various pragmas achieve similar effects, but are not as
// portable across compilers.
// Example: c10::ForcedUnroll<4>{}(f); is equivalent to f(0); f(1); f(2); f(3);
namespace c10 {
template <int n>
struct ForcedUnroll {
template <typename Func>
C10_ALWAYS_INLINE void operator()(const Func& f) const {
ForcedUnroll<n - 1>{}(f);
f(n - 1);
}
};
template <>
struct ForcedUnroll<1> {
template <typename Func>
C10_ALWAYS_INLINE void operator()(const Func& f) const {
f(0);
}
};
} // namespace c10