Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35763
Adds inference function and test for ScatterAssign
Test Plan: Updated unit test
Reviewed By: yyetim, shunting1986
Differential Revision: D20501079
fbshipit-source-id: 7ec6ef0127a151250dd699c90c2b80c35cfb1fe4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35857
This fixes a lot of common ops for InferBlobShapesAndTypes as well as adds support for testing the inferred shapes and types of gradient ops.
Ops:
* Concat
* Split
* LeakyReLU
* Relu
* Prelu
* Gelu
* Elu
* Sinh, Tanh, Cosh
* Abs
* ... and a number of other simple element wise ops
Test Plan:
Added support to hypothesis test to check the shape and type of gradient ops.
Enabled it for all the ops I fixed the shape and type inference for.
buck test caffe2/caffe2/python/operator_test:
Reviewed By: pradeepd24
Differential Revision: D20806284
fbshipit-source-id: 77f796d9ff208e09e871bdbadf9a0a7c196b77f2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35555
Att. So that we can lower the SparseLengthsSum* part of SparseLengthsSum*Sparse. We update the tying policy between Gather and SparsLengthsWeightSum* so that we don't bother lowering a single Gather into the backend, which is inefficient to execute on card and creates bubbles between continuous lowering graphs.
Test Plan:
```
buck test glow/fb/test:test_onnxifinnpi
```
Reviewed By: ipiszy
Differential Revision: D20688525
fbshipit-source-id: cb8e38239057ff13a8d385ed09d0d019421de78b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35507
We want to split up the SparseLengthsSumSparse op into an indirection op and the SparseLengthsSum op so that we can lower the later part. The indirection part is a plain impl now.
Test Plan:
```
for i in `seq 10`; do buck test caffe2/caffe2/python/operator_test:lengths_reducer_fused_nbit_rowwise_ops_test -- test_sparse_lengths_sum_rowwise_sparse; done
```
Reviewed By: jspark1105
Differential Revision: D20683478
fbshipit-source-id: 509effe88719d20aa0c4783bbe0ce1f183ee473c
Summary:
## Motivation
This PR upgrades MKL-DNN from v0.20 to DNNL v1.2 and resolves https://github.com/pytorch/pytorch/issues/30300.
DNNL (Deep Neural Network Library) is the new brand of MKL-DNN, which improves performance, quality, and usability over the old version.
This PR focuses on the migration of all existing functionalities, including minor fixes, performance improvement and code clean up. It serves as the cornerstone of our future efforts to accommodate new features like OpenCL support, BF16 training, INT8 inference, etc. and to let the Pytorch community derive more benefits from the Intel Architecture.
<br>
## What's included?
Even DNNL has many breaking changes to the API, we managed to absorb most of them in ideep. This PR contains minimalist changes to the integration code in pytorch. Below is a summary of the changes:
<br>
**General:**
1. Replace op-level allocator with global-registered allocator
```
// before
ideep::sum::compute<AllocForMKLDNN>(scales, {x, y}, z);
// after
ideep::sum::compute(scales, {x, y}, z);
```
The allocator is now being registeted at `aten/src/ATen/native/mkldnn/IDeepRegistration.cpp`. Thereafter all tensors derived from the `cpu_engine` (by default) will use the c10 allocator.
```
RegisterEngineAllocator cpu_alloc(
ideep::engine::cpu_engine(),
[](size_t size) {
return c10::GetAllocator(c10::DeviceType::CPU)->raw_allocate(size);
},
[](void* p) {
c10::GetAllocator(c10::DeviceType::CPU)->raw_deallocate(p);
}
);
```
------
2. Simplify group convolution
We had such a scenario in convolution where ideep tensor shape mismatched aten tensor: when `groups > 1`, DNNL expects weights tensors to be 5-d with an extra group dimension, e.g. `goihw` instead of `oihw` in 2d conv case.
As shown below, a lot of extra checks came with this difference in shape before. Now we've completely hidden this difference in ideep and all tensors are going to align with pytorch's definition. So we could safely remove these checks from both aten and c2 integration code.
```
// aten/src/ATen/native/mkldnn/Conv.cpp
if (w.ndims() == x.ndims() + 1) {
AT_ASSERTM(
groups > 1,
"Only group _mkldnn_conv2d weights could have been reordered to 5d");
kernel_size[0] = w.get_dim(0) * w.get_dim(1);
std::copy_n(
w.get_dims().cbegin() + 2, x.ndims() - 1, kernel_size.begin() + 1);
} else {
std::copy_n(w.get_dims().cbegin(), x.ndims(), kernel_size.begin());
}
```
------
3. Enable DNNL built-in cache
Previously, we stored DNNL jitted kernels along with intermediate buffers inside ideep using an LRU cache. Now we are switching to the newly added DNNL built-in cache, and **no longer** caching buffers in order to reduce memory footprint.
This change will be mainly reflected in lower memory usage from memory profiling results. On the code side, we removed couple of lines of `op_key_` that depended on the ideep cache before.
------
4. Use 64-bit integer to denote dimensions
We changed the type of `ideep::dims` from `vector<int32_t>` to `vector<int64_t>`. This renders ideep dims no longer compatible with 32-bit dims used by caffe2. So we use something like `{stride_.begin(), stride_.end()}` to cast parameter `stride_` into a int64 vector.
<br>
**Misc changes in each commit:**
**Commit:** change build options
Some build options were slightly changed, mainly to avoid name collisions with other projects that include DNNL as a subproject. In addition, DNNL built-in cache is enabled by option `DNNL_ENABLE_PRIMITIVE_CACHE`.
Old | New
-- | --
WITH_EXAMPLE | MKLDNN_BUILD_EXAMPLES
WITH_TEST | MKLDNN_BUILD_TESTS
MKLDNN_THREADING | MKLDNN_CPU_RUNTIME
MKLDNN_USE_MKL | N/A (not use MKL anymore)
------
**Commit:** aten reintegration
- aten/src/ATen/native/mkldnn/BinaryOps.cpp
Implement binary ops using new operation `binary` provided by DNNL
- aten/src/ATen/native/mkldnn/Conv.cpp
Clean up group convolution checks
Simplify conv backward integration
- aten/src/ATen/native/mkldnn/MKLDNNConversions.cpp
Simplify prepacking convolution weights
- test/test_mkldnn.py
Fixed an issue in conv2d unit test: it didn't check conv results between mkldnn and aten implementation before. Instead, it compared the mkldnn with mkldnn as the default cpu path will also go into mkldnn. Now we use `torch.backends.mkldnn.flags` to fix this issue
- torch/utils/mkldnn.py
Prepack weight tensor on module `__init__` to achieve better performance significantly
------
**Commit:** caffe2 reintegration
- caffe2/ideep/ideep_utils.h
Clean up unused type definitions
- caffe2/ideep/operators/adam_op.cc & caffe2/ideep/operators/momentum_sgd_op.cc
Unify tensor initialization with `ideep::tensor::init`. Obsolete `ideep::tensor::reinit`
- caffe2/ideep/operators/conv_op.cc & caffe2/ideep/operators/quantization/int8_conv_op.cc
Clean up group convolution checks
Revamp convolution API
- caffe2/ideep/operators/conv_transpose_op.cc
Clean up group convolution checks
Clean up deconv workaround code
------
**Commit:** custom allocator
- Register c10 allocator as mentioned above
<br><br>
## Performance
We tested inference on some common models based on user scenarios, and most performance numbers are either better than or on par with DNNL 0.20.
ratio: new / old | Latency (batch=1 4T) | Throughput (batch=64 56T)
-- | -- | --
pytorch resnet18 | 121.4% | 99.7%
pytorch resnet50 | 123.1% | 106.9%
pytorch resnext101_32x8d | 116.3% | 100.1%
pytorch resnext50_32x4d | 141.9% | 104.4%
pytorch mobilenet_v2 | 163.0% | 105.8%
caffe2 alexnet | 303.0% | 99.2%
caffe2 googlenet-v3 | 101.1% | 99.2%
caffe2 inception-v1 | 102.2% | 101.7%
caffe2 mobilenet-v1 | 356.1% | 253.7%
caffe2 resnet101 | 100.4% | 99.8%
caffe2 resnet152 | 99.8% | 99.8%
caffe2 shufflenet | 141.1% | 69.0% †
caffe2 squeezenet | 98.5% | 99.2%
caffe2 vgg16 | 136.8% | 100.6%
caffe2 googlenet-v3 int8 | 100.0% | 100.7%
caffe2 mobilenet-v1 int8 | 779.2% | 943.0%
caffe2 resnet50 int8 | 99.5% | 95.5%
_Configuration:
Platform: Skylake 8180
Latency Test: 4 threads, warmup 30, iteration 500, batch size 1
Throughput Test: 56 threads, warmup 30, iteration 200, batch size 64_
† Shufflenet is one of the few models that require temp buffers during inference. The performance degradation is an expected issue since we no longer cache any buffer in the ideep. As for the solution, we suggest users opt for caching allocator like **jemalloc** as a drop-in replacement for system allocator in such heavy workloads.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32422
Test Plan:
Perf results: https://our.intern.facebook.com/intern/fblearner/details/177790608?tab=Experiment%20Results
10% improvement for ResNext with avx512, neutral on avx2
More results: https://fb.quip.com/ob10AL0bCDXW#NNNACAUoHJP
Reviewed By: yinghai
Differential Revision: D20381325
Pulled By: dzhulgakov
fbshipit-source-id: 803b906fd89ed8b723c5fcab55039efe3e4bcb77
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35430
This fixes and adds tests for several commonly used operators.
There's some formatting differences due to running clang-format on one of the files.
Test Plan: buck test //caffe2/caffe2/fb/operators:hypothesis_test //caffe2/caffe2/python/operator_test:utility_ops_test //caffe2/caffe2/python/operator_test:concat_split_op_test
Reviewed By: yyetim
Differential Revision: D20657405
fbshipit-source-id: 51d86d0834003b8ac8d6acb5149ae13d7bbfc6ab
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35346
weight scale op doesn't have GPU impl. This is breaking OSS CI from D20506032. Making it cpu only
Test Plan: OSS CI
Reviewed By: ustctf
Differential Revision: D20637440
fbshipit-source-id: 9aa6cce63ce637ab7856788e5d02f527decb2a26
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34394
# SWA operator
In this diff, we added a new operator `SWA` which will be used in `AdaGradOptimizer`.
The algorithm looks like:
{F230902995}
# Background
In our testings, we found that this operator could improve our models' reproducibility a lot. (KT: 0.86 -> .92)
So we hope to land this operator and in future, enable this by default in our Models.
Test Plan:
Local build `aml.dper3:30f068668cfb408fbb40141fb17129f2` and bento kernel.
- Local test: n215857
- f174600345
Reviewed By: chocjy
Differential Revision: D20165239
fbshipit-source-id: c03cdd048cb10b091e5f06323f4c0f3999f95d8a
Summary: Add transfer_learning_blob_name_mappings into layer_model_helper to support layer model transfer learning
Reviewed By: mraway
Differential Revision: D20286298
fbshipit-source-id: de3e029611d843f38d3f42ecd4148358f7e14a2b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34903
Reattempt of D20461609
Moving 2/4-bit SLS and row-wise 2/4-bit conversion operator to open source to be used by DLRM
Test Plan: CI
Reviewed By: jianyuh
Differential Revision: D20495304
fbshipit-source-id: 66a99677583f50fd40e29c514710c7b1a8cdbc29
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34783
Moving 2/4-bit SLS and row-wise 2/4-bit conversion operator to open source to be used by DLRM
Test Plan: CI
Reviewed By: yinghai
Differential Revision: D20461609
fbshipit-source-id: b3ef73ff10f2433afe06ffa73fe1145282d9ec4c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34515
Once upon a time we thought this was necessary. In reality it is not, so
removing it.
For backcompat, our public interface (defined in `api/`) still has
typedefs to the old `script::` names.
There was only one collision: `Pass` as a `Stmt` and `Pass` as a graph
transform. I renamed one of them.
Test Plan: Imported from OSS
Differential Revision: D20353503
Pulled By: suo
fbshipit-source-id: 48bb911ce75120a8c9e0c6fb65262ef775dfba93
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34318
Stop checking whether we have AMD GPU devices on the host, because we may be constructing a net on a machine without GPU, and run the net on another one with GPU
Reviewed By: ajauhri
Differential Revision: D20269562
fbshipit-source-id: 1f561086cacdcead3ce7c03c2d02c25336c8b11a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33977
Removing python2 from operator_test so we can retire python2 support for PyTorch.
Test Plan: waitforsandcastle
Reviewed By: seemethere
Differential Revision: D20129500
fbshipit-source-id: d4c82e4acfc795be9bec6a162c713e37ffb9f5ff
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33431
Some elementwise operators don't have shape and type inference specified for the output tensor: `BitwiseOr`, `BitwiseAnd`, `BitwiseXor`, `Not`, `Sign`.
This change fixes this issue:
- For `Not` and `Sign` operators, the output has the same type and shape as the input, so `IdenticalTypeAndShapeOfInput` function is used to specify that.
- For bitwise operators created by `CAFFE2_SCHEMA_FOR_BINARY_BITWISE_OP` macro, the type and shape inference rules should be the same as for other binary element-wise operators, so `TensorInferenceFunction(ElementwiseOpShapeInference)` is used to specify that.
Also some tests were modified to ensure that the shape and type are inferred (`ensure_outputs_are_inferred` parameter)
Test Plan:
```
CAFFE2_ASSERT_SHAPEINFERENCE=1 buck test caffe2/caffe2/python/operator_test:elementwise_ops_test
CAFFE2_ASSERT_SHAPEINFERENCE=1 buck test caffe2/caffe2/python/operator_test:math_ops_test
```
Note that the tests have to be executed with `CAFFE2_ASSERT_SHAPEINFERENCE=1` in order to fail upon shape inference failure.
Reviewed By: idning
Differential Revision: D19880164
fbshipit-source-id: 5d7902e045d79e5669e5e98dfb13a39711294939
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33426
Make 2/4/8-bit fused rowwise conversion operators more general to work for N-dim tensors
Test Plan: CI
Reviewed By: ellie-wen
Differential Revision: D19943136
fbshipit-source-id: 47008544dd7e1d11a346d34f35449e0fcc0e7ee0
Summary: in dper2, local net is hard-coded by whitelisting some layers. Add SparseFeatureGating related layers to local net explicitly.
Test Plan:
* workflow: f167812211
* QRT: fall back looks normal
{F228442018}
Differential Revision: D19852280
fbshipit-source-id: 6fecc3d745c3f742d029575a7b9fe320618f1863
Summary:
For both the Caffe2 and PyTorch backends, enable 3D convolutions through MIOpen.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33067
Reviewed By: BIT-silence
Differential Revision: D19880495
Pulled By: bddppq
fbshipit-source-id: 8f6f970910654c1c5aa871b48a04c1054875691c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32271
Use the 2-stage EmbeddingSpMDM interface in D19425982 to reduce the overhead of code cache lookup and lock contention.
Fix an issue in sparse_lengths_sum_benchmarks generating empty indices when average length is small like 1.
Test Plan: CI
Reviewed By: dskhudia
Differential Revision: D19425987
fbshipit-source-id: d5c5f0d46e0072403901809c31d516fa0f4b9b31
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32448
Using binary search to compute the value for the given quantile among the input tensors.
Test Plan: Newly added unittests;
Reviewed By: jspark1105
Differential Revision: D19487604
fbshipit-source-id: 0dc6627b78d1310ac35b3f1d53b89cc89a697ece
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32475
As title
Test Plan: CI
Reviewed By: houseroad
Differential Revision: D19508778
fbshipit-source-id: fd9ad63607535980505d155f3e3c3b7c6b95daf7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32086
np.clip(1, num_indices // 2, 10) -> np.clip(num_indices // 2, 1, 10)
Also change batchsize -> num_rows to match with what the variable actually does
Test Plan: CI
Reviewed By: hx89
Differential Revision: D19361521
fbshipit-source-id: 9ce864c7d7da046dc606afa5207da677ccf80f52
Summary:
Per discussion with Fei Tian, we need to add a `scale_init_value` to scale down the output of normalization such as batch-norm and layer-norm.
Currently we have `sparse_normalization_options` to normalize embedding pooling output. By default, scale = 1.0, we found it's better to set scale from 0.025 to 0.1 https://fb.quip.com/MiKUAibEaYhH
Besides, I am removing the tags from normalizers because it makes more sense to calculate norm ops in distributed trainers, not ps.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31983
Test Plan:
Testing LN and BN after sum-pooling --
baseline f160348514
LN: f160348609
BN: f160348710
{F226106518}
Layer norm after sum-pooling fwd_net https://fburl.com/sa4j207n
Layer norm after dot-prod fwd_net https://fburl.com/twggwyvb
## Unit Tests
Testing normalization after pooling
```
buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test_4 -- test_sparse_pooling_batch_normalization
buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test_4 -- test_dense_sparse_pooling_batch_normalization
buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test_4 -- test_sparse_pooling_layer_normalization
buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test_4 -- test_dense_sparse_pooling_layer_normalization
```
Testing normalization after dot-prod
```
buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test -- test_last_layer_use_batch_norm
buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test -- test_last_layer_use_layer_norm
```
Differential Revision: D19277618
Pulled By: SilunWang
fbshipit-source-id: ea323e33e3647ba55d2e808ef09d94ad7b45b934
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31707
Change the initialization value for FC weight init and sparse embedding lookup init.
Previous default initialization is uniform(-\sqrt(1/input_dim), \sqrt(1/input_dim)); Now pass into a flexible hyperparameter, say \alpha into it, to change into uniform(-\sqrt(\alpha/input_dim), \sqrt(\alpha/input_dim));
Reviewed By: chonglinsun
Differential Revision: D18825615
fbshipit-source-id: 4c5f2e07f2b3f5d642fd96d64dbf68892ebeb30b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31676
Facebook:
Previously we assumed mask is passed in as a tensor which is not feasible for sparse parameter.
Here we allow to pass in the mask through db path which requires the masks to be stored in some db first.
Test Plan: unit tests
Reviewed By: ellie-wen
Differential Revision: D18928753
fbshipit-source-id: 75ca894de0f0dcd64ce17b13652484b3550cbdac
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31641
Assuming mask is provided as a tensor
Test Plan: unit test
Reviewed By: ellie-wen
Differential Revision: D18928737
fbshipit-source-id: a4f3dd51769c2b56e5890043e91c18e6128be082
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31612
Count the number recent update on rows. Exponential decay is applied on the counter with decay rate r, such that
r^{counter_halflife} = 0.5;
If counter_halflife is nonpositive, this operator is turned off.
Test Plan: added unittest
Reviewed By: chocjy
Differential Revision: D19217921
fbshipit-source-id: 96d850123e339212cc0e0ef352ea8a1b1bf61dfa
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31260
1. Update the LiteLM dataset conversion script (fbcode/pytext/fb/tools/lite_lm_dataset_to_tensorproto.py)
2. Created a benchmark json file for byte-aware lstm word model (xplat/aibench/specifications/models/caffe2/assistant/lite_lm_len5.json)
3. In order to run the model -- created an int64 Tensor for the model, added batch gather ops to the BUCK file
Test Plan:
```
1. Create tensorproto of the model input
buck run mode/opt //pytext/fb/tools:byte_lm_dataset_to_tensorproto -- --in-path /mnt/vol/pytext/smart_keyboard/aibench/test_5.txt --out-path /mnt/vol/pytext/smart_keyboard/aibench/byteAwareWordLM/ --hidden_dim 203 --layers_num 2 --max_seq_len 64 --max_byte_len 15
2. Run the aibench command
buck run fbsource//xplat/aibench:run_bench -- -b aibench/specifications/models/caffe2/assistant/lm_byte_lstm_len5.json --remote --devices SM-G960U-8.0.0-26
```
Reviewed By: gardenia22
Differential Revision: D17785682
fbshipit-source-id: 351c3c8bae16449e72ac641522803b23a83349be
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31470
Optimize performance of these two operators.
Additionally use nearbyint instead of round to be consistent with 4-bit embedding table quantization.
Reviewed By: hyuen
Differential Revision: D19072103
fbshipit-source-id: efe96f14aeff7958cceb453ed625d3fd693891ff
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24341
ConvTransposeOp doesn't crash for zero-batch, but it doesn't modify the output blob. This leads to buggy behaviour especially when running the same network twice using different input, or backprop during training.
Seems `ConvTransposeUnpoolBase<Context>::GetOutputSize` works for zero-batch, so I remove the check for `input.numel() > 0`, and reshape the output blob before returning.
For CudnnConvTransposeGradientOp, it's a bit verbose to set `dfilter` and `dbias`, it's a seems the Cudnn can handle it, so simply remove the `X.numel() == 0` branch.
Test Plan: buck test mode/dev-nosan caffe2/caffe2/python/operator_test:conv_transpose_test -- --run-disabled
Reviewed By: BIT-silence
Differential Revision: D16807606
fbshipit-source-id: 0d72c5bd8f2e03c34465e7b530cca548d9bdd5e1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30915
Since we now have C++14, we don't need these c10::guts helpers anymore
ghstack-source-id: 95777609
Test Plan: waitforsandcastle
Differential Revision: D18869639
fbshipit-source-id: 97716f932297c64c6e814410ac47b444c33d4e2e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19705
Optimizing for a case when there's a consecutive dims that are not broadcasted followed by another consecutive dims that are broadcasted.
For example, MulGradient(["dC", "A", "B"], ["dA", "dB"], broadcast=True, axis=0) where A.shape == dC.shape == [9508, 80] and B.shape == [80] .
Test Plan:
In SKL T6,
Running mul_gradient_benchmark without this optimization
Operator #0 (dA, MulGradient) 11.9119 ms/iter
After this optimization,
Operator #0 (dA, MulGradient) 0.672759 ms/iter
Need to land D15291800 before to fix the unit test error
Reviewed By: dmudiger
Differential Revision: D15075415
fbshipit-source-id: 0f97be17cf8f1dacbafa34cd637fb8bc1c5e5387
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29766
Add FbgemmPackTranspose op to support the packing on FCTransposed weights
Add FCTransposed to FbFCPacked transformation to Dper fp16 exporter
Test Plan:
```
buck test mode/opt caffe2/caffe2/fb/fbgemm:fb_fc_packed_op_test
```
```
buck test mode/opt caffe2/caffe2/python:layers_test
```
Differential Revision: D18482306
fbshipit-source-id: e8f1947b3d0d04892293509ebf88742f5f0f5997
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30802
Change shape_hints from map<string, TensorShape> to ShapeInfoMap to catch dimType info from model file.
Reviewed By: ipiszy
Differential Revision: D18821486
fbshipit-source-id: c5d9ed72e158d3698aba38900aeda00f776745b4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30679
Caffe2 expects quantized ops to be in NHWC format while pytorch inputs are in NCHW.
Add a jit pass to insert permutes to convert from nchw2nhwc before each conv op and add nhwc2nchw permute after the conv op.
Using graph rewriter to find consecutive redundant permutes and remove them from the graph
Test Plan:
python test/onnx/test_pytorch_onnx_caffe2_quantized.py TestQuantizedOps
Imported from OSS
Differential Revision: D18790518
fbshipit-source-id: 4dd39cf0b31b21f5586c0edfdce2260d4e245112
Summary: We added caffe2 python wrapper and unit test for the SparseRAdam C++ operator.
Test Plan:
Unit test is constructed following the design pattern of [Wngrad optimizer](https://our.intern.facebook.com/intern/diff/D8655724/). Test passed smoothly.
buck test //caffe2/caffe2/python:optimizer_test -- TestSparseRAdam
Test result:
{F221144048}
Reviewed By: wx1988
Differential Revision: D18330650
fbshipit-source-id: e0f4724c2b616b665e2a0fe2e5c3430696cca7ee
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29469
The original approach is to save both fp16 and fp32 for all models, which increased the filesize and memory.
This diff is to save 'used' blobs into predictor file.
Test Plan:
fc clone workflow :
f149878151
ctr mbl feed test with fc fp16 quantization:
f149996395
No fp32 in local file
{F221750392}
QRT after the fix:
https://fburl.com/qrt/cp8r8263
Reviewed By: wx1988
Differential Revision: D18382503
fbshipit-source-id: 231c41668f25b1d35ca8d4358ce9b12ba60a4f91
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29707
In D17885977, Linearizable label (a multi-class classification) was implemented in MTML.
In this diff, we add several items for Linearizable label:
- Assigning different weights to each class through ```model_def.tasks[i].class_weights```.
- This option is a dictionary, the keys of which are indices of the classes and the values of which are weights for each class.
- For example, if a linearizable-label task has 4 classes and its ```class_weights = {"0": 1, "1": 0.1, "2": 0.1, "3": 0.01}```, it means that in the loss function of this task, we assign weight 1 to its first class, weight 0.1 to its second and third class, and weight 0.01 to its forth class. The index/order of classes follows the logic of linearizable label.
- Note that when you assign different weights to different classes, you need to correct the calibration by setting an appropriate ```model_def.tasks[i].calibration.linearizable_class_weight```. Basically, the class weights in calibration should be the reciprocals of the class weights in loss function. So the ```calibration.linearizable_class_weight = {"0": 1, "1": 10, "2": 10, "3": 100}``` for the example above.
- Example FBLearner job: f150763093
- We also support ```model_def.allow_missing_label_with_zero_weight``` for linearizable label, which will ignore those examples with first label missing, by assigning zero weights to them in loss function.
- We need to set ```allow_missing_label_with_zero_weight = true``` to enable it.
- Example FBLearner job: f150763093
- Last but not least, we update caffe2 operator ```SoftmaxWithLoss``` to support loss averaged by batch size.
- We need to set ```model_def.tasks[i].loss.softmaxLoss.average_by_batch_size = true``` to enable it.
- Previously, the loss was averaged by weight sum of examples in batch, which is still the default behavior now (when ```average_by_batch_size = null``` or ```average_by_batch_size = false```).
- Without this new feature, the calibration will be incorrect when applying non-equal-weight training among different classes to a linearizable task.
- Example FBLearner job with ```average_by_batch_size = true``` results in a correct calibration: f150763093
- Example FBLearner job with ```average_by_batch_size = null``` results in an incorrect calibration: f150762990
Test Plan:
buck test caffe2/caffe2/fb/dper/layer_models/tests:mtml_test_2 -- test_linearizable_label_task_with_class_weights
buck test caffe2/caffe2/fb/dper/layer_models/tests:mtml_test_2 -- test_linearizable_label_task_with_zero_weight
buck test caffe2/caffe2/fb/dper/layer_models/tests:mtml_test_2 -- test_linearizable_label_task_average_by_batch_size
All tests passed.
full canary: https://fburl.com/fblearner/troznfgh
Reviewed By: chenshouyuan
Differential Revision: D18461163
fbshipit-source-id: aaf3df031406ae94f74e2e365b57e47409ef0bfe
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29167
As titled.
This fix is crucial as multi_channel splitting would create history that has no items (i.e., D == 0), which leads to flow failure.
Test Plan:
Unittest
flow test:
before fix: f148783160
after fix: f149082299
buck test mode/dev-nosan caffe2/caffe2/python/operator_test:softmax_ops_test
Reviewed By: xianjiec
Differential Revision: D18296081
fbshipit-source-id: e0bb2dc2c4e5b465e213f31e5c5ced3a7e1fd574
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29152
Bootstrapping uncertainty approach: bootstrap the last layer before the last fully-connected layer. FCWithBootstrap is a new layer to handle the logic for the bootstrapping process.
Goal:
- return a struct with the bootstrapped indices and bootstrapped predictions from this layer
- separate the functionality in the train_net and eval_net
- save the bootstrapped FC in this object so that the eval_net can use them during prediction time
Reviewed By: wx1988
Differential Revision: D17822429
fbshipit-source-id: 15dec501503d581aeb69cb9ae9e8c3a3fbc7e7b5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29155
Update the L0 norm regularizer with a budget feature to penalize features over this limit
Formula and summary:
{F212248495}
Test Plan: * Unit test located in: ~/fbsource/fbcode/caffe2/caffe2/fb/dper/layer_models/tests/split_1/fsparse_nn_test.py
Reviewed By: un-disclosed, wx1988
Differential Revision: D17458138
fbshipit-source-id: 2ed9ce6f55573b0bfc0fefbfd392f90c7542a0fd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29131
caffe2_pb2.CUDA --> workspace.GpuDeviceType
workspace.NumCudaDevices() --> workspace.NumGpuDevices()
Also added the totalGlobalMem into get_device_properties(), which is needed by multi_gpu_utils.py
Test Plan:
sandcastle
f148921769
Reviewed By: bddppq
Differential Revision: D18290090
fbshipit-source-id: bde7c175d1fb6ff59a062266c1b17de39d113b24
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28620
All Tensors are Variables now, they just happen to have requires_grad=False. Tensors ALWAYS have `VariableTensorId` in their type set.
When constructing this patch, I had to make decisions about what I would fix in this patch, and what I would leave for follow up PRs. Here is the cleanup that happens in this patch:
- The `is_variable` property is removed from TensorOptions. I removed this immediately because unlike Tensor::is_variable, TensorOptions::is_variable doesn't respect our VariableTensorId thread-local state. This means that there were a bunch of places where TensorOptions::is_variable was false, which is obviously bogus in the world when tensor and variable are merged. Instead of keeping the method as a function that always returns true, I just opted to remove it entirely (it's not public API.) All places we set `is_variable` are deleted.
- Knock on effect: there is no longer a separate DeprecatedTypeProperties for the variable and non-variable versions of type.
- Knock on effect: instead of asserting on TensorOptions::is_variable, instead we just test `at::impl::variable_is_excluded()`
- There is now only one copy of the cuDNN RNN dropout cache, not two (I'm not sure why we had two to begin with)
Some cleanup that doesn't happen in this patch:
- Eliminating unnecessary uses of `make_variable`
- Eliminating `Tensor::is_variable`
The most subtle part of this patch is retaining tracing behavior: the fact that everything is a Variable means that more code gets routed to VariableType than before; this can change traces. I identified two places where we didn't appropriately turn off VariableType, mostly factory functions:
- `torch.tensor` must turn off VariableType before invoking `at::empty` to construct the tensor, as it subsequently does direct data access
- `tensor_slow` (invoked when you pass a Python scalar to a tensor argument) must turn off VariableType before calling `scalar_to_tensor` so the scalar gets traced as constant, rather than as a call to `scalar_to_tensor`.
Honestly, these are all giant hacks, and should be replaced with a more specialized guard that just toggles tracing.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: dreiss
Differential Revision: D18171156
Pulled By: ezyang
fbshipit-source-id: 5b6a045beba37492647e350190f495114e86504d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29045
Addressing an issue seen in GitHub https://github.com/pytorch/pytorch/issues/28958
It seems sometimes the workers in this test don't stop cleanly. The purpose of this test is to check that the init_fun in init_workers works as expected, which is captured by the assertEqual in the for loop in the test. The behavior of stop() is not really important here.
The fact it's returning false is probably indicative that a worker is getting blocked but that doesn't affect the correctness of the test.
Test Plan: Ran the test 100 times, it consistently succeeds.
Reviewed By: akyrola
Differential Revision: D18273064
fbshipit-source-id: 5fdff8cf80ec7ba04acf4666a3116e081d96ffec
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28717
Make HasElements support multiple inputs. Any input has element, then return true.
Test Plan: to be added
Reviewed By: BIT-silence
Differential Revision: D17972759
fbshipit-source-id: 3ecdea74a30fcfaaa6490fef1debc6cde68db922
Summary:
This PR makes Caffe2 compatible with TensorRT 6. To make sure it works well, new unit test is added. This test checks PyTorch->ONNX->TRT6 inference flow for all classification models from TorhchVision Zoo.
Note on CMake changes: it has to be done in order to import onnx-tensorrt project. See https://github.com/pytorch/pytorch/issues/18524 for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26426
Reviewed By: hl475
Differential Revision: D17495965
Pulled By: houseroad
fbshipit-source-id: 3e8dbe8943f5a28a51368fd5686c8d6e86e7f693
Summary:
Codemod to remove all thread.isAlive() since it throws a warning that is breaking some tests that monitor the output of their cli's
is_alive() was added in python 2.6 this is super safe
This is a codemod I don't care if the code supports python3, just that its python code
Test Plan: unittests
Reviewed By: cooperlees
Differential Revision: D18069520
fbshipit-source-id: 4ca4dcb541c0b0debeb194aba5d060152ad0ef0e
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28327
Test Plan:
Failed as expected and the full protobuf is logged
f145060005
Reviewed By: ffjiang, wx1988
Differential Revision: D17975560
fbshipit-source-id: 5375acffc1f9dede16622b06eb58b6c3a26ebe5a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28265
Fix the difference in dper3 and dper2 when regressionLoss is used.
Test Plan:
test using dper2 model id f134632386
Comparison tool output before change:
```
FOUND OP DIFFERENT WITH DPER2!!!
OP is of type ExpandDims
OP inputs ['supervision:label']
OP outputs ['sparse_nn/regression_loss/mean_squared_error_loss/ExpandDims:0']
===============================
Finished all dper3 ops, number of good ops 11, bad ops 1, skipped 26
run_comparison for dper2 / dper3 nets running time: 0.0020143985748291016
result type: <class 'NoneType'> result: None
```
After change:
```
FOUND OP DIFFERENT WITH DPER2!!!
OP is of type ExpandDims
OP inputs ['sparse_nn_2/regression_loss_2/mean_squared_error_loss_8/Squeeze:0_grad']
OP outputs ['sparse_nn_2/over_arch_2/linear_2/FC_grad']
===============================
Finished all dper3 ops, number of good ops 19, bad ops 1, skipped 16
run_comparison for dper2 / dper3 nets running time: 0.0017991065979003906
result type: <class 'NoneType'> result: None
```
dper2 label part of net P111794577
dper3 label part of net after change P116817194
Reviewed By: kennyhorror
Differential Revision: D17795740
fbshipit-source-id: 9faf96f5140f5a1efdf2985820bda3ca400f61fa
Summary: previously loss_weight is not used correctly for self-supervision branch
Test Plan: buck test mode/dev-nosan //caffe2/caffe2/fb/dper/layer_models/models/experimental/tests:tum_test
Reviewed By: xianjiec
Differential Revision: D17862312
fbshipit-source-id: 554b793a5caa3886946c54333c81a0d8a10230d9
Summary:
We are seeing error "[enforce fail at BlackBoxPredictor.cpp:134] ! !parameter_workspace->HasBlob(out). Net REMOTE of type predict_net writes to blob cat/NGRAM_QRT_VERSIONS_x_EVENT_TYPE_AUTO_FIRST_X/Pool_Option_0/Repeat_0/sparse_lookup/w which exists in the parameter workspace" in online testing for calibration models.
I'm suspecting it's due to the op CopyRowsToTensorOp are being used in prediction
Test Plan:
f143080108 offline predict net does not contain CopyRowsToTensorNet, which looks right.
Waiting for Olga to test online behavior
dper2 canary:
https://fburl.com/fblearner/sv3o3yj1
Differential Revision: D17741823
fbshipit-source-id: 19721b632b5ea9ebfa1ef9ae0e99d3a10c926287
Summary: Currently accelerators does not have the concept for fp32, it only has understandings of fp16 and int8 in terms of data input. In order to fixe the issue here, we want to make sure unaries are turned into fp16 when we have the int8 exporter turned on.
Reviewed By: kennyhorror
Differential Revision: D17743791
fbshipit-source-id: 7322d23eb12ac3f813b525fc0ddd066f95c8ca85
Test Plan:
The notebook showed no diff for id score list
https://our.intern.facebook.com/intern/anp/view/?id=154764
Reviewed By: alyssawangqq
Differential Revision: D17649974
fbshipit-source-id: 84cb4ae372fc215295c2d0b139d65f4eacafae4a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27756
Implement approximate L0 norm for use in the dense feature regularizer that will be used for feature importance. The formula is as follows:
{F212246801}
Reviewed By: wx1988
Differential Revision: D17432708
fbshipit-source-id: 57d6c9c3dd1b4e210b9f10264075c57dbc9c8cb6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27592
The caffe2 data reader test `test_time_limit_reader_with_short_limit` is flaky as-written because it places an upper bound on how much can be read, but under stress it is possible for fewer records to be read. The fix is to make the assertion check a fuzzy/range check rather than exact equality, since there's not a straightforward way to precisely test a timer-based feature.
ghstack-source-id: 91543898
Test Plan:
`buck test mode/dev-tsan //caffe2/caffe2/python:dataio_test-2.7 -- --stress-runs 20` -> P117156924 (with fix, 100% pass)
P117158750 - without fix, lots of failures in this test
Reviewed By: boryiingsu
Differential Revision: D17816775
fbshipit-source-id: 2ab0d3304fbd9c9806d37a4fe2912c840616db61
Summary: This test was failing in 3.7, turns out it was ommitted by test director in 3.6 so I added a skip for both versions
Test Plan: unittests is skipped in 3.7 and 3.6 all other tests pass.
Reviewed By: tomdz
Differential Revision: D17820967
fbshipit-source-id: 571f0ec7fe1b0cb50ead4e0d18c00151a701f36a
Summary:
Support attention weights input to SparseLookup. In attention sum pooling, if attention weights can be pre-calculated before embedding lookup, they can be passed to SparseLookup and processed by SparseLengthsWeightedSum op. One example is id_score attention sum pooling.
Essentially the net is converted from:
LengthsSum(Mul(Gather(keys, w), att_weight))
to:
SpaseLenghtsWeightedSum(keys, w, att_weight)
It unblocks potential efficiency gain with distributed training.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26748
Test Plan: unit test
Reviewed By: chocjy
Differential Revision: D17553345
Pulled By: wheatkit
fbshipit-source-id: 60cc3c4b0bc1eade5459ac598e85286f3849a412
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27508
Implemented a simple exponential decay of the weight of lr loss function, with a lower bound.
Test Plan:
buck test //caffe2/caffe2/fb/dper/layer_models/tests:mtml_test -- test_task_weight_decay
https://our.intern.facebook.com/intern/testinfra/testrun/3377699729136308
canary: f140103452
Reviewed By: chenshouyuan
Differential Revision: D17524101
fbshipit-source-id: 9a653e21a4ecb74dfc4ac949c9e3388f36ef3a20
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26376
* Create the new dense_feature_reg (FCInputLpNorm) for feature importance to be applied to the fully-connected layer for feature-importance.
Test Plan: * Unit test located in: `caffe2/caffe2/fb/dper/layer_models/tests/split_1/sparse_nn_test.py`
Reviewed By: un-disclosed
Differential Revision: D17360361
fbshipit-source-id: 1a0e119eeb17199a13dfffe58b3036ea4255e301
Summary:
In some version of python, then_net and else_net may switch the order. Let's make sure we are iterating the right arg node.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26931
Reviewed By: hl475
Differential Revision: D17614829
Pulled By: houseroad
fbshipit-source-id: 3f1b4eb91ecf4d808f58c34896d3e628aa2e0af0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26654
As per python contract, __getattr__ can only throw AttributeError. Throwing something else breaks hasattr() and causes upstream issues.
Similar bug was in pytorch earlier.
Test Plan: builds
Differential Revision: D17529471
fbshipit-source-id: bb6ac6c9e3be8b80fa2967e6a2e293afd1594cf9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25426
Add embedding table 4bit quantization support.
* add the conversion from fp32 to int4.
* using brew to pass the context so that the 4bit operators are added when generating the predictor net.
Reviewed By: kennyhorror, chocjy
Differential Revision: D16859892
fbshipit-source-id: a06c3f0b56a7eabf9ca4a2b2cb6c63735030d70b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26227
In the previous implementation of composite lr, the lr_scale for each sub policy will be rewritten by the last lr_scale.
Due to another bug in unittest (where policy_lr_scale being the same for all sub policies), this bug was not detected by unittest...
Fix: add an additional field in CompositeLearningRateItem so that we store lr_scale values for all sub policies
If fix unittest, the error in previous implementation:
https://fburl.com/testinfra/ikdbnmey
With the fix,
https://fburl.com/testinfra/m694ehl1
Test Plan:
unittest
buck test caffe2/caffe2/python/operator_test:learning_rate_op_test -- test_composite_learning_rate_op
Reviewed By: chocjy, alex1o1o7cloud
Differential Revision: D17380363
fbshipit-source-id: 161e9cb71bb2ea7f0734a3361e270616057a08e4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26147
We may try to unpickle a byte string in py3 that was pickled from py2. Therefore we need to add encoding latin1.
Reviewed By: kennyhorror
Differential Revision: D17305677
fbshipit-source-id: c0c8a51909629a65eb72bb81cccfbabaee9f8d01
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25908
Original commit changeset: f6e961e88c01
device_option propagation is completely broken in Caffe2 for cases when pass through operators are used. As an example Gather operator don't have gradient and passes through it's inputs, which results in incorrect detection of the components for sparse parameter aggregation (component will be empty instead of the real device).
This diff is trying to fix this issue.
Original diff had a problem, that Caffe2 is not handling cases when device option is present, but contains only metadata (for example one for auto-generated reduction ops in backward pass). This diff is addressing this issue by merging device options during the backward pass
Test Plan:
1. net_transform is finally working with Gather + FloatToHalf transformed model instead of failing because of incorrect number of components.
2. New unit-test.
3. Verify that previously broken benchmark is now passing
ezyang do you have suggestions what else I should test?
Reviewed By: ezyang
Differential Revision: D17281528
fbshipit-source-id: 4a1bc386f29f6a34fbf8008effde9d4890abebfa
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26153
I am suspecting that our multithreaded test-system causes issue with dyndep, if two places try to concurrently InitOpsLibrary. So perhaps we just guard this by a lock. This is just a guess-fix, as it is impossible to repro.
Test Plan: sandcastle
Reviewed By: bddppq
Differential Revision: D17361310
fbshipit-source-id: 596634a2098b18881abbd26a5a727a5ba0d03b6e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26080
Will be used in c2 ctr_mbl_feed model to PyTorch conversion
Test Plan: Unit test
Reviewed By: yinghai
Differential Revision: D17337604
fbshipit-source-id: a90d9f5dc38301608d1562c6f2418e7f4616e753
Summary:
Just a tiny fix to make debugging easier (output errors to stderr and include in the exception message)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25809
Reviewed By: zrphercule
Differential Revision: D17329957
Pulled By: houseroad
fbshipit-source-id: 0d73dd9f62c735fbc5096e6a7c0e5f58e4cd90ae
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25782
Enable variable size embedding for dot processor. We split the embedding matrix into multiple towers, based on the embedding size and perform dot product in a loop over each of the towers and finally concatenate all the dot product outputs.
Test Plan:
buck test //caffe2/caffe2/fb/dper/layer_models/tests/split_1:
https://our.intern.facebook.com/intern/testinfra/testrun/3659174703037560
Specific unit tests --
buck test //caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test -- test_per_feature_emb_dim
https://our.intern.facebook.com/intern/testinfra/testrun/3377699726358808
Reviewed By: chenshouyuan
Differential Revision: D16690811
fbshipit-source-id: 8f5bce5aa5b272f5f795d4ac32bba814cc55210b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25203
device_option propagation is completely broken in Caffe2 for cases when pass
through operators are used. As an example Gather operator don't have gradient
and passes through it's inputs, which results in incorrect detection of the
components for sparse parameter aggregation (component will be empty instead of
the real device).
This diff is trying to fix this issue.
Test Plan:
net_transform is finally working with Gather + FloatToHalf transformed model
instead of failing because of incorrect number of components.
Reviewed By: dzhulgakov
Differential Revision: D16936041
fbshipit-source-id: 916551b933469f04e32ddf86ec4b2c07f76c9176
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24863
Add the sparse feature name in logging for ease of debugging
Test Plan:
./buck-out/gen/caffe2/caffe2/fb/dper/layer_models/sparse_nn/pooling_test#binary.par -r test_simple_sum_pooling_named_exception
Another test for id_score_list. the original sparse_key is equivalent to get_key(self.input_record)()
P98343716
./buck-out/gen/caffe2/caffe2/python/layers_test-2.7#binary.par -r test_get_key
Reviewed By: chocjy
Differential Revision: D16901964
fbshipit-source-id: 2523de2e290aca20afd0b909111541d3d152a588
Summary:
[Not in need of review at this time]
Support focal loss in MTML (effectively dper2 in general) as described in https://arxiv.org/pdf/1708.02002.pdf. Adopt approach similar to Yuchen He's WIP diff D14008545
Test Plan:
Passed the following unit tests
buck test //caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test -- test_lr_loss_based_focal_loss
buck test //caffe2/caffe2/fb/dper/layer_models/tests:mtml_test_2 -- test_mtml_with_lr_loss_based_focal_loss
buck test //caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test -- test_lr_loss_based_focal_loss_with_stop_grad_in_focal_factor
Passed ./fblearner/flow/projects/dper/canary.sh; URL to track workflow runs: https://fburl.com/fblearner/446ix5q6
Model based on V10 of this diff
f133367092
Baseline model
f133297603
Protobuf of train_net_1 https://our.intern.facebook.com/intern/everpaste/?color=0&handle=GEq30QIFW_7HJJoCAAAAAABMgz4Jbr0LAAAz
Reviewed By: hychyc90, ellie-wen
Differential Revision: D16795972
fbshipit-source-id: 7bacae3e2255293d337951c896e9104208235f33
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24357
SparseNormalize does not need to know the gradient value to the lookup table, only the indices of the embeddings that need to be updated. By removing this input, we allow SparseNormalize to be used alongside SparseAdagradFusion
Differential Revision: D16809919
fbshipit-source-id: cc19692ba4dea8854663ae1ed8cf9365e90c99bc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24439
many literatures mentioned BPR is useful for improving recommendation quality. Add a BPR loss so that we can train TTSN with it. Would like to see if it can improve retrieval models.
reference: https://arxiv.org/pdf/1205.2618.pdf
Reviewed By: dragonxlwang
Differential Revision: D16812513
fbshipit-source-id: 74488c714a37ccd10e0666d225751a845019eb94
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23983
While testing I realized that model layers can extract different types of features from the same column. For example, MultifeedFeaturesTransform uses float and ID list features from the "features" column.
get_accessed_features returns a map from column to AccessedFeatures, and AccessedFeatures only has the feature IDs for one feature type. This is incompatible with have multiple types of features per column, one type ends up overwriting another in the map.
To fix this, I've modified get_accessed_features to return a map from column to a list of AccessedFeatures objects.
Reviewed By: itomatik
Differential Revision: D16693845
fbshipit-source-id: 2099aac8dc3920dd61de6b6ad5cf343c864803bc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23848
Problem:
In experiment running feed model 127607201 (/mnt/public/tracelog/feed_repro2/127607201_0.predictor), encountered blob dimensionality mismatch error when running onnxified net. This is due to the model initializing input blobs in current workspace with blob size 0, and onnxifi() falsely identified those input blobs as weight blobs and assigned wrong dimension.
Solution:
Add option to pass correct weight blob names to onnxifi() instead of using all blobs in current workspace.
Reviewed By: yinghai
Differential Revision: D16661396
fbshipit-source-id: cabe44db6b64e6538bef4b65e380312214b3ba9f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23784
Backward path does nothing during the gradient path when the input as empty, as
a result workspace can preserve gradient values from previous iteration and get
inconsistent inputs for some of the backward pass operators. This diff should
fix this disrepancy by always reinitializing output during the backward path.
Reviewed By: dzhulgakov
Differential Revision: D16646096
fbshipit-source-id: 8ca68dfad17a63fc87c033cce7b36b40bd77245c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23679
Full Canary: https://fburl.com/fblearner/sa1pkpya
Add LambdaRank DCG Loss Option
* when use_idcg_normalization == true, regular LambdaRank with NDCG loss
* when use_idcg_normalization == false, gradient and loss functions are not normalized by idcg.
Differential Revision: D16605459
fbshipit-source-id: a16f071e69516974e48d27bef4ca179019ca4ae7
Summary:
We need a way to figure get a complete list fo features that are used in training a model. One way to do this is to make it possible to get the list of features used in each Model Layer. Then once the model is complete we can go through the layers and aggregate the features.
I've introduced a function to expose that information here, get_accessed_features, and implemented it in the FeatureSparseToDense layer to start with.
I've tried to include the minimum amount of information to make this useful, while making it easy to integrate into the variety of model layers. This is, for example, why AccessedFeatures does not contain feature_names which is not always present in a model layer. I debated whether or not to include feature_type, but I think that's useful enough, and easy enough to figure out in a model layer, that it's worth including.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23036
Test Plan:
Added a unit test to verify the behavior of get_accessed_features in FeatureSparseToDense.
aml_dper2-fblearner-flow-integration-tests failed due to a known issue D16355865
aml_dper3-fblearner-flow-integration-tests failed due to a known issue T47197113
I verified no tests in the integration tests failed to issues other than those known ones.
DPER2 canaries: https://fburl.com/fblearner/1217voga
Reviewed By: volkhin
Differential Revision: D16365380
Pulled By: kevinwilfong
fbshipit-source-id: 2dbb4d832628180336533f29f7d917cbad171950
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22966
We want to implement "trimmed lasso" for feature selection with learnable and regularizable weights. Trimmed lasso is a simple yet powerful improved version from traditional lasso. More reference can be found at https://arxiv.org/abs/1708.04527 and http://proceedings.mlr.press/v97/yun19a.html. For quick and necessary intro, please refer to P1-3 of the paper at https://arxiv.org/abs/1708.04527.
Given n weights, traditional lasso sums up all weights' l1 norms. The trimmed lasso takes an input integer k (how many weights you want to select from n) and only sums over the smallest n - k weights. Given lambda as the regularization constant, the penalty term is only on the smallest n - k weights, but not other larger weights. If lambda becomes larger than certain threshold, the smallest n - k weights are shrunk to zero. That means we have those weights "dropped". With this property, the number k is the number of weights left after lasso, which we can easily control.
Meanwhile, we further support all available regularization in a single interface. Current supported regularizers on weights include no reg, l1, l2, elastic, trimmed l1, elastic with trimmed l1, group l1, and logbarrier.
Differential Revision: D16326492
fbshipit-source-id: 6e1fd75606005d9bc09d6650435c96a7984ba69c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22958
When we use `extension_loader.DlopenGuard()` to dyndep or import modules, it sets a `RTLD_GLOBAL` flag, and restores the original flags after the `yield`. However, if the modules is not there, yield will fail, and the flags won't be restored, creating all kinds of symbol conflict problems.
Reviewed By: bddppq
Differential Revision: D16311949
fbshipit-source-id: 7b9ec6d60423ec5e78cae694b66c2f17493840b0
Summary:
As part of the Variable/Tensor merge, we want to be able to pass Variables into Caffe2 without doing extra shallow copy, to improve performance and also allow for in-place mutations in Caffe2 ops. There are a few approaches outlined in https://github.com/pytorch/pytorch/pull/22418, and this PR is the chosen approach.
Specifically, we can have the assumption that we won't be connecting autograd to C2 gradients at any point (as it's too tricky and not that useful). Therefore, we can pass Variable into Caffe2 ops by requiring that all Variables in Caffe2 don't require grad. For code paths in Caffe2 that might potentially track gradients (e.g. `ScriptModuleOp` and `call_caffe2_op_from_c10`), we use the `torch::NoGradGuard` to make sure gradients are not tracked.
This supersedes https://github.com/pytorch/pytorch/pull/22418.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22473
Differential Revision: D16099042
Pulled By: yf225
fbshipit-source-id: 57efc3c7cfb3048d9abe90e63759acc14ebd2972
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22348
This is the last step of LRU hash eviction weight re-init. This diff checks if there's evicted values in sparse_lookup, if so call op created in D15709866 to re-init the values for indicies in evicted_values. Also created gradient op for the operator. The gradient op just passes the output gradient as input gradient.
Reviewed By: itomatik
Differential Revision: D16044736
fbshipit-source-id: 9afb85209b0de1038c5153bcb7dfc5f52e0b2abb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21927
Add `OUTPUT_PROB` output to CTCBeamSearchDecoderOp to return a probability for each sequence.
Add argument to output top-k instead of top-1 decoded sequences.
Reviewed By: SuperIRabbit
Differential Revision: D15797371
fbshipit-source-id: 737ca5cc4f90a0bcc3660ac9f58519a175977b69
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22279
This new operator is used for embedding table weight re-init. After we get the evicted indices, they will be the rows need reseting in embedding table. Then we can create a 1d tensor with default values, and apply this operator to copy the tensor to all evicted rows in embedding table
Will add gradient op in next diff
Reviewed By: itomatik
Differential Revision: D15709866
fbshipit-source-id: 2297b70a7326591524d0be09c73a588da245cc08
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21389
As titled. To do weight re-init on evicted rows in embedding table, we need to pass the info of the evicted hashed values to SparseLookup, which is the layer model responsible for constructing the embedding table and do pooling.
To pass evicted values, we need to adjust the output record of lru_sparse_hash to include the evicted values, and add optional input to all processors that needs to take in sparse segment. For SparseLookup to get the evicted values, its input record needs to be adjusted. Now the input record can have type IdList/IdScoreList/or a struct of feature + evicted values
Reviewed By: itomatik
Differential Revision: D15590307
fbshipit-source-id: e493881909830d5ca5806a743a2a713198c100c2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22334
Improve the function signatures of save_to_db and load_from_db in predictor_exporter.
Reviewed By: akyrola
Differential Revision: D16047208
fbshipit-source-id: a4e947f86e00ef3b3dd32c57efe58f76a38fcec7
Summary:
In order to select more important features in dot product among a list of candidate sparse features, we can assign one learnable weight on each feature, reweight each feature by multiplying the weight onto its embedding before dot product. We finally select features based on the weight magnitude after training.
We can perform L1 and/or L2 regularization on the weights. To summarize, the weights tend to shrink their values (avoiding overfitting) due to L2 regularization, and some weights will vanish to zero as L1. To avoid sparse feature embedding being ignored due to early collapse of weights, a piece lr warm up policy is used in optimizing regularization term, such that regularization is weak at first stage and gets stronger afterwards (a small lr constant in iters less than threshold 1, a medium lr constant in stage 2, and a final reasonable large lr constant in all iters after threshold 2). The features with nonzero and relatively large weights (in absolute value) will be selected for the module.
We can also apply softmax on the original weights to make it sum to 1. We can even boosting the softmaxed weights by multiply the number of softmax components, which essentially make them sum to the number of softmax components and avergae to 1. In this idea, all the weights are positive and sum to a constant. Regularization is not a must since we can count on the competition between softmax weights themselves to achieve reasonable re-weighting. We expect those weights be more dense, comparing with sparse ones from L1 regularization and we can select features based on top K weights.
Overall, we aim to demonstrate the selected feature set outperform current v0 feature set in experiments. Special acknowledgement goes to Shouyuan Chen, who initiated the work of regularizable weighting.
---
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22176
The diff will export updates to Github repository, as stated below.
{F162787228}
Basically, the updates on the files are summarized as below:
- adding logger messages
`caffe2/python/layer_model_helper.py`
- add ElasticNet regularizer, which combines both L1 and L2 regularization
`caffe2/python/regularizer.py`
- implement piecewarmup, specifically warm up with three constant pieces
`caffe2/sgd/learning_rate_functors.h, caffe2/sgd/learning_rate_op.cc, caffe2/sgd/learning_rate_op.h`
Differential Revision: D15923430
fbshipit-source-id: ee18902cb88c23b1b7b367cc727d690a21e4cda9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20673
Add option to bucket-weighted pooling to hash the bucket so that any cardinality score can be used.
Reviewed By: huginhuangfb
Differential Revision: D15003509
fbshipit-source-id: 575a149de395f18fd7759f3edb485619f8aa5363
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21080
Add Huber loss as a new option for regression training (refer to TensorFlow implementation: https://fburl.com/9va71wwo)
# huber loss
def huber(true, pred, delta):
error = abs(true-pred)
loss = 0.5 * min(error, delta)^2 + delta * max(error - delta, 0)
return mean(loss)
As a combination of MSE loss (`x < delta`) and MAE loss (`x >= delta`), the advantage of Huber loss is to reduce the training dependence on outlier.
One thing worth to note is that Huber loss is not 2nd differential at `x = delta`. To further address this problem, one could consider adopt the loss of `LOG(cosh(x))`.
Reviewed By: chintak
Differential Revision: D15524377
fbshipit-source-id: 73acbe2728ce160c075f9acc65a1c21e3eb64e84
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20924
I found a python3 bug for deserializing caffe2 code. The exception thrown is Unicode related error instead of just decode error, and we need to catch that as well
Reviewed By: ipiszy
Differential Revision: D15293221
fbshipit-source-id: 29820800d1b4cbe5bf3f5a189fe2023e655d0508
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21718
adding a detection method on whether the package is built for AMD.
Reviewed By: bddppq
Differential Revision: D15795893
fbshipit-source-id: 91a21ee76b2273b1032507bdebe57e016717181d