Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20390
duc0 Ngo implemented observing floating point exceptions but there were a couple of places where we have "benign" floating point exceptions leading to false positives. This diff eliminates one source of such false positives, namely using _mm256_cvtph_ps and _mm256_cvtps_ph for partially uninitialized array for the remainder loop.
Reviewed By: hx89
Differential Revision: D15307358
fbshipit-source-id: 38f57dfdd90c70bc693292d2f9c33c7ba558e2c9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15388
This is another pass to make perfkernels code safer from illegal instruction error.
Removed dependency to c10/util/Logging.h
We're err on the safer side at the expense of some verbosity.
Reviewed By: dskhudia
Differential Revision: D13502902
fbshipit-source-id: 4f833115df885c5b4f8c1ca83b9badea1553f944
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15389
SparseLengthsMean was generating uninitialized data for empty inputs (lengths == 0). We should return zeros.
The unit tests were also not covering this special case which is fixed by this diff.
Reviewed By: salexspb
Differential Revision: D13515970
fbshipit-source-id: 3c35265638f64f13f0262cee930c94f8628005da
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14950
Minimize the number of headers included from _avx2.cc files to avoid accidental compilation of functions defined the header files reused by other translation units that can lead to illegal instruction errors.
Reviewed By: dskhudia
Differential Revision: D13394483
fbshipit-source-id: 67149a6fb51f7f047e745bfe395cb6dd4ae7c1ae
Summary:
Updates the perfkernel codebase to implement embedding lookup for our new fused storage format, where each row in the data matrix stores the quantized values *and* the scale and bias.
msmelyan see this as my best-effort attempt at updating the perfkernel stuff for the fused storage. Let me know if any of this is grossly wrong. I also don't know if we need to update any of the prefetching operations or something like that.
Note that we have to keep the old code around for a bit until we get rid of the old operations with separate `scale_bias` storage.
Reviewed By: kennyhorror
Differential Revision: D6710843
fbshipit-source-id: b485ef2389f526c5db1260cac9d4be3fc8df0979
Summary: 8 bytes is 64 bits. Fixes out of range access caught by ASAN
Reviewed By: Yangqing
Differential Revision: D6219576
fbshipit-source-id: f7c418b12fa211890abcb5aef800bd456390b73a
Summary: Before the boundary checking was happening after the first access for 8bit ops.
Reviewed By: Yangqing
Differential Revision: D6206753
fbshipit-source-id: 07ab240cae8c67b3048f03aa79af0b6399b9940b
Summary:
Adding uint8 support for to code generator for and high-performance emebding look-up kernels, supporting
Sum, WeightedSum, and Mean reducers. Added number of unit tests to test these operators.
Performance Results
===================
Performance results are below for old code, sparse_lengths_sum_benchmark.old.par, that uses
code in lengths_reducer_rowwise_8bit_ops.h, and our new code, optimized via code generator,
sparse_lengths_sum_benchmark.new.par. Block size was 128 in all cases.
[root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.old.par --iteration 10000 --dtype uint8
I0912 02:49:58.773259 2640913 net_simple.cc:162] Time per operator type:
I0912 02:49:58.773264 2640913 net_simple.cc:171] 0.75769 SparseLengthsSum8BitsRowwise
[root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype uint8
I0912 02:50:33.981832 2642102 net_simple.cc:162] Time per operator type:
I0912 02:50:33.981837 2642102 net_simple.cc:171] 0.233322 SparseLengthsSum8BitsRowwise
[root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype float16
I0912 02:51:26.748972 2643925 net_simple.cc:162] Time per operator type:
I0912 02:51:26.748977 2643925 net_simple.cc:171] 0.106591 SparseLengthsSum
[root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype float
I0913 01:39:22.372238 1076874 net_simple.cc:162] Time per operator type:
I0913 01:39:22.372244 1076874 net_simple.cc:171] 0.211041 SparseLengthsSum
Analysis
========
Our optimized generated code is ~3.5x faster than original code in lengths_reducer_rowwise_8bit_ops.h
as shown below.
However, our uint8 is about 2x slower than float16 and is on par with float32. There are several reasons for that:
1. uint8 intrudoces extra instructions to multiply by bias and add scaling factors
2. In addition to emebding blocks, we are now also reading scale_bias.
For every pair of scale and bias, we bring entire cache line of
64 bytes, whiles only using 8 bytes. 128-wide uint8 input block only occupies 2 cache lines and hence
reading nearly entire extra cache lines of useless data adds to bandwidth wastage.
3. In addition, hardware prefetcher runs past the end of the input block and scale_bias
cache line, trying to prefetch more useless lines. This effect was characterised in Appendix section of
https://fb.facebook.com/notes/jason-lu/sparse-adagrad-performance-optimization-in-model-training/10214810437360961/
To get deeper insights into what is going on,
we isolated SparseLengthsSum and SparseLengthsSum8BitsRowwise codes, for float32, float16 and uint8,
into a microbenchmark, where we varried block size, while keeping table size constant (256MB)
block_size time(uint8) time(float16) time(float32)
64 0.19 0.09 0.17
128 0.12 0.09 0.17
256 0.70 0.09 0.14
1024 0.50 0.06 0.10
The pattern for block size of 64 and 128 is similar to what we observed in sparse_lengths_sum_benchmark.
However, we see that as block_size increases (for a fixed table size),
time to perform embeddings decreases quite drastically. For block_size of 256 and beyond, uint8 starts achieving
speedup over float16. Longer block better amortizes bandwidth wastage due to scale_bias and hardware prefetcher
running past the end of the block.
Reviewed By: kennyhorror
Differential Revision: D5870907
fbshipit-source-id: 445321b96f1b5801ef91f296f6063c35673ee11b
Summary:
Adding uint8 support for to code generator for and high-performance emebding look-up kernels, supporting
Sum, WeightedSum, and Mean reducers. Added number of unit tests to test these operators.
Performance Results
===================
Performance results are below for old code, sparse_lengths_sum_benchmark.old.par, that uses
code in lengths_reducer_rowwise_8bit_ops.h, and our new code, optimized via code generator,
sparse_lengths_sum_benchmark.new.par. Block size was 128 in all cases.
[root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.old.par --iteration 10000 --dtype uint8
I0912 02:49:58.773259 2640913 net_simple.cc:162] Time per operator type:
I0912 02:49:58.773264 2640913 net_simple.cc:171] 0.75769 SparseLengthsSum8BitsRowwise
[root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype uint8
I0912 02:50:33.981832 2642102 net_simple.cc:162] Time per operator type:
I0912 02:50:33.981837 2642102 net_simple.cc:171] 0.233322 SparseLengthsSum8BitsRowwise
[root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype float16
I0912 02:51:26.748972 2643925 net_simple.cc:162] Time per operator type:
I0912 02:51:26.748977 2643925 net_simple.cc:171] 0.106591 SparseLengthsSum
[root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype float
I0913 01:39:22.372238 1076874 net_simple.cc:162] Time per operator type:
I0913 01:39:22.372244 1076874 net_simple.cc:171] 0.211041 SparseLengthsSum
Analysis
========
Our optimized generated code is ~3.5x faster than original code in lengths_reducer_rowwise_8bit_ops.h
as shown below.
However, our uint8 is about 2x slower than float16 and is on par with float32. There are several reasons for that:
1. uint8 intrudoces extra instructions to multiply by bias and add scaling factors
2. In addition to emebding blocks, we are now also reading scale_bias.
For every pair of scale and bias, we bring entire cache line of
64 bytes, whiles only using 8 bytes. 128-wide uint8 input block only occupies 2 cache lines and hence
reading nearly entire extra cache lines of useless data adds to bandwidth wastage.
3. In addition, hardware prefetcher runs past the end of the input block and scale_bias
cache line, trying to prefetch more useless lines. This effect was characterised in Appendix section of
https://fb.facebook.com/notes/jason-lu/sparse-adagrad-performance-optimization-in-model-training/10214810437360961/
To get deeper insights into what is going on,
we isolated SparseLengthsSum and SparseLengthsSum8BitsRowwise codes, for float32, float16 and uint8,
into a microbenchmark, where we varried block size, while keeping table size constant (256MB)
block_size time(uint8) time(float16) time(float32)
64 0.19 0.09 0.17
128 0.12 0.09 0.17
256 0.70 0.09 0.14
1024 0.50 0.06 0.10
The pattern for block size of 64 and 128 is similar to what we observed in sparse_lengths_sum_benchmark.
However, we see that as block_size increases (for a fixed table size),
time to perform embeddings decreases quite drastically. For block_size of 256 and beyond, uint8 starts achieving
speedup over float16. Longer block better amortizes bandwidth wastage due to scale_bias and hardware prefetcher
running past the end of the block.
Reviewed By: dzhulgakov
Differential Revision: D5824641
fbshipit-source-id: 3a5c020294d84874da78c6943e596423393473d6