Summary:
When output blob names are specified while load_all=1, output blob names are ignored. However, this behavior is not documented. In this diff, we just disallow users to provide blob names when load_all=1.
See discussion at https://fb.workplace.com/groups/1405155842844877/permalink/2714909788536136/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19133
Reviewed By: dzhulgakov
Differential Revision: D14883698
Pulled By: chandlerzuo
fbshipit-source-id: 6e4171e36c4ccc4f857e79da98b858a06b7d8ad6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19751
This has probably never been tested on Windows but destruction of WorkersPool crashes because it uses _aligned_malloc to allocate and 'free' to deallocate, which is not symmetric. Fix is to use _aligned_free in deallocation
Reviewed By: hlu1
Differential Revision: D15083472
fbshipit-source-id: 42243fce8f2dfea7554b52e6b289d9fea81d7681
Summary:
The input shape checkers in conv/int8_conv operator is aims to avoid the issue when running with mkldnn winograd, the weigths has to be reordered each time if input shape changed.
However, the checkers result to big performance regression due to frequent reorder.
Meanwhile, in mkldnn-bridge, such case has been already fixed by correcting the prop_kind.
Therefore, we have to remove the useless checker to fix the performance regression.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19608
Differential Revision: D15061169
Pulled By: yinghai
fbshipit-source-id: 649a43ae6fce989e84939210f6dffb143ec3d350
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19585
Originally we will unroll all If op to many different subnets;
Now we will not unroll it anymore, but just add all external input of its subnet to the If op, and ssa-rewrite all external input/outputs. That would be enough.
Reviewed By: yinghai
Differential Revision: D15038139
fbshipit-source-id: 8532216d8749068acd5558ad0d8cb1d98463a063
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19621
Comments for group_norm_op is not accurate (i.e., the math part), this diff will fix it.
Reviewed By: BIT-silence
Differential Revision: D15048695
fbshipit-source-id: 27d41d3ae21054257967815254134849944d56ca
Summary:
Often times, we want to experiment with loss per element (image etc.). This changeset allows getting per element loss as well. This output is optional.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19579
Reviewed By: jerryzh168
Differential Revision: D15035797
Pulled By: prigoyal
fbshipit-source-id: 562dea514f49c1f2f1cbbc083a1938dc019a75c4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18499
If the init op is not fp16 compatible, it should throw.
However, in the special case where the original init op is UniformFill,
we replace it with Float16UniformFill
Reviewed By: kennyhorror
Differential Revision: D14627209
fbshipit-source-id: eb427772874a732ca8b3a25d06670d119ce8ac14
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19442
For cases like CV, some of ops like transpose and tile will mangle the batch size so that we don't know how to adjust output batch size. In this case, the current solution is just fix the input batch statically and do not adjust output batch size.
Reviewed By: zrphercule
Differential Revision: D15007237
fbshipit-source-id: a21b943a52ee5462d9d7804dfae44360f579f8cf
Summary:
In this PR, the fusion alogrithms are improved to support DNNLOWP.
1. Enabled conv fusions for DNNLOWP
2. Fused order switch op into following quantize op
3. Improve conv+sum fusion to parse larger scope/window
4. re-org fusion code to fix random crash issue due to changing graph
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18843
Differential Revision: D15021030
Pulled By: yinghai
fbshipit-source-id: 88d2199d9fc69f392de9bfbe1f291e0ebf78ab08
Summary:
There are two corrections in this pull request.
The first is specific to gcc-7.4.0.
compiled with -std=c++14 gcc-7.4.0 has __cplusplus = 201402L
This does not meet the check set in Deprecated.h, which asks for >201402L.
The compiler goes down to the __GNUC__ check, which passes and sets C10_DEPRECATED_MESSAGE to a value that c++14 does not appear to support or even recognize, leading to a compile time error.
My recommended solution, which worked for my case, was to change the = into a >=
The second correction comes in response to this error:
caffe2/operators/crash_op.cc: In member function ‘virtual bool caffe2::CrashOp::RunOnDevice()’:
caffe2/operators/crash_op.cc:14:11: error: ‘SIGABRT’ was not declared in this scope
I am merely committing to the repository the solution suggested here (which worked for me)
https://discuss.pytorch.org/t/building-pytorch-from-source-in-conda-fails-in-pytorch-caffe2-operators-crash-op-cc/42859
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19470
Differential Revision: D15019529
Pulled By: ailzhang
fbshipit-source-id: 9ce9d713c860ee5fd4266e5c2a7f336a97d7a90d
Summary:
This was actually getting pretty poor throughput with respect to memory bandwidth. I used this test to measure the memory bandwidth specifically for the AXPY call: https://gist.github.com/jamesr66a/b27ff9ecbe036eed5ec310c0a3cc53c5
And I got ~8 GB/s before this change, but ~14 GB/s after this change.
This seems to speed up the operator overall by around 1.3x (benchmark: https://gist.github.com/jamesr66a/c533817c334d0be432720ef5e54a4166):
== Before ==
time_per_iter 0.0001298875093460083
GB/s 3.082544287868467
== After ==
time_per_iter 0.00010104801654815674
GB/s 3.9623142905451076
The large difference between the local BW increase and the full-op BW increase likely indicates significant time is being spent elsewhere in the op, so I will investigate that.
EDIT: I updated this PR to include a call into caffe2/perfkernels. This is the progression:
before
time_per_iter 8.983819484710693e-05
GB/s 4.456723564864611
After no axpy
time_per_iter 7.19951868057251e-05
GB/s 5.56126065872172
AFter perfkernels
time_per_iter 5.6699180603027346e-05
GB/s 7.061548257694262
After perfkernels no grad
time_per_iter 4.388842582702637e-05
GB/s 9.122769670026413
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19329
Reviewed By: dzhulgakov
Differential Revision: D14969630
Pulled By: jamesr66a
fbshipit-source-id: 42d1015772c87bedd119e33c0aa2c8105160a738
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19458
The algorithm in https://fburl.com/ggh9iyvc fails to really ensure topological ordering of nodes. The fix is ugly but effective. I think we need a real topological sort to fix this issue more nicely. Mikhail Zolotukhin, Bram Wasti.
Differential Revision: D15011893
fbshipit-source-id: 130c3aa442f5d578adfb14fbe5f16aa722434942
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19388
The old implementation forced a refcount bump when converting at::Tensor to caffe2::Tensor.
Now, it is possible to move it without a refcount bump.
Reviewed By: dzhulgakov
Differential Revision: D14986815
fbshipit-source-id: 92b4b0a6f323ed38376ffad75f960cad250ecd9b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19287
Since we now have a string-schema-based op registration API, we can also use it when exposing caffe2 operators.
Reviewed By: dzhulgakov
Differential Revision: D14931925
fbshipit-source-id: ec162469d2d94965e8c99d431c801ae7c43849c8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19362
`float` type is never used in OnnxifiOp....
Reviewed By: bddppq
Differential Revision: D14977970
fbshipit-source-id: 8fee02659dbe408e5a3e0ff95d74c04836c5c281
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19083
As we have discussed, there are too many of AdjustBatch ops and they incur reallocation overhead and affects the performance. We will eliminate these ops by
- inling the input adjust batch op into Glow
- inling the output adjust batch op into OnnxifiOp and do that only conditionally.
This is the C2 part of the change and requires change from Glow side to work e2e.
Reviewed By: rdzhabarov
Differential Revision: D14860582
fbshipit-source-id: ac2588b894bac25735babb62b1924acc559face6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19170
As title
The quantized resnext3d model in production got the following failures without the fix:
```
Caffe2 operator Int8ConvRelu logging error: [enforce fail at conv_pool_op_base.h:463] order == StorageOrder::NCHW. 1 vs 2. Conv3D only supports NCHW on the production quantized model
```
Reviewed By: jspark1105
Differential Revision: D14894276
fbshipit-source-id: ef97772277f322ed45215e382c3b4a3702e47e59
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19118
A bug introduced by D14700576 reported by Yufei (fixed by D14778810 and D14785256) was not detected by our units tests.
This diff improves unit tests to catch such errors (with this diff and without D14778810, we can reproduce the bug Yufei reported).
This improvement also revealed a bug that affects the accuracy when we pre-pack weight and bias together and the pre-packed weight/bias are used by multiple nets. We were modifying the pre-packed bias in-place which was supposed to be constants.
Reviewed By: csummersea
Differential Revision: D14806077
fbshipit-source-id: aa9049c74b6ea98d21fbd097de306447a662a46d
Summary:
Currently, a TensorImpl's `is_variable_` is true if and only if the TensorImpl has AutogradMeta. This PR unifies these two concepts by removing `is_variable_` and change `is_variable()` to check existence of AutogradMeta instead.
Removing `is_variable_` is part of the work in Variable/Tensor merge.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19139
Differential Revision: D14893339
Pulled By: yf225
fbshipit-source-id: ceb5e22c3c01f79b5d21d5bdbf4a7d1bc397796a
Summary:
It's not intended that Storages have 'default' CUDA devices, but this is allowable via the Storage::create_legacy codepath.
This also messages with device_caching, because the initial cache is obtained from the Storage, which may have a 'default' device.
Instead, we materialize a device by allocating 0 bytes via the allocator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18605
Differential Revision: D14680620
Pulled By: gchanan
fbshipit-source-id: 6d43383d836e90beaf12bfe37c3f0506843f5432
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19154
I recently saw some weird workflow error due to empty but set net_type. Maybe we should just fallback to simple net in this case.
Reviewed By: dzhulgakov
Differential Revision: D14890072
fbshipit-source-id: 4e9edf8232298000713bebb0bfdec61e9c5df17d