Summary:
There is a module called `2to3` which you can target for future specifically to remove these, the directory of `caffe2` has the most redundant imports:
```2to3 -f future -w caffe2```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45033
Reviewed By: seemethere
Differential Revision: D23808648
Pulled By: bugra
fbshipit-source-id: 38971900f0fe43ab44a9168e57f2307580d36a38
Summary:
Per discussion with Fei Tian, we need to add a `scale_init_value` to scale down the output of normalization such as batch-norm and layer-norm.
Currently we have `sparse_normalization_options` to normalize embedding pooling output. By default, scale = 1.0, we found it's better to set scale from 0.025 to 0.1 https://fb.quip.com/MiKUAibEaYhH
Besides, I am removing the tags from normalizers because it makes more sense to calculate norm ops in distributed trainers, not ps.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31983
Test Plan:
Testing LN and BN after sum-pooling --
baseline f160348514
LN: f160348609
BN: f160348710
{F226106518}
Layer norm after sum-pooling fwd_net https://fburl.com/sa4j207n
Layer norm after dot-prod fwd_net https://fburl.com/twggwyvb
## Unit Tests
Testing normalization after pooling
```
buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test_4 -- test_sparse_pooling_batch_normalization
buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test_4 -- test_dense_sparse_pooling_batch_normalization
buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test_4 -- test_sparse_pooling_layer_normalization
buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test_4 -- test_dense_sparse_pooling_layer_normalization
```
Testing normalization after dot-prod
```
buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test -- test_last_layer_use_batch_norm
buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test -- test_last_layer_use_layer_norm
```
Differential Revision: D19277618
Pulled By: SilunWang
fbshipit-source-id: ea323e33e3647ba55d2e808ef09d94ad7b45b934
Summary:
To achive this, I modified the blob name scheme defined in a layer.
Before it was scope/fc_w and scope/fc_w_auto_0 (if there is another fc
within the same scope).
Now I change it to scope/fc/w and scope/fc_auto_0/w.
That is, we rely on the uniqueness of the scoped layer name to define
names for blobs.
I also overwrote the create_param method in LayerModelHelper to let it
use the resolved name for blobs given the sharingparameter context.
There are some details such as making the initializer more structured
that I need to finalize.
Reviewed By: kennyhorror
Differential Revision: D5435132
fbshipit-source-id: a0525f5ea0977e255dd5ea765b38913f5951d455
Summary:
In this revision, I mainly implemented the DRelu activation. See https://arxiv.org/pdf/1706.06978v1.pdf for details.
To sum up, different from standard relu and purely, which divide the scope into two parts with boundary at zero, DRelu calculate another value p to divide the activation into two part. P is the softmax value of the output of Batch Normalization. For f(x)=x part in relu, you can find similar patten in f(x)=px, and for f(x)=0 part in rely, you can find similar pattern in f(x)=a(1-p)x, in which a is a parameter to tune. Drelu activation result is the sum of these two parts, f(x) = a(1-p)x + px.
To implement DRelu, I take BatchNormalization as super class and then use the above formula for computation. In order to allow users to choose activation methods, which usually takes place when calling add_mlp function in processor_util.py, I pass the parameter transfer in model_option from UI to the details, just as what dropout do. Currently, I place it in extra_option, but can modify it if AML team needs to redesign the UI.
I also add units test for DRelu. We check the shape of output and also do the numeric unit tests.
For Unit test, I first check the numeric value of BatchNormalization, since there is no similar test before. I then compute the value of DRelu outputs and compare the results with current DRelu layer.
Reviewed By: chocjy
Differential Revision: D5341464
fbshipit-source-id: 896b4dcc49cfd5493d97a8b448401b19e9c80630
Summary: The number input dimension for NHWC should be the last dimension C. Since batch size is omitted, it should be 2 instead of 3.
Reviewed By: chocjy
Differential Revision: D5418538
fbshipit-source-id: a6939a863817b7566198ea2a665a1d236a2cf63d