Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43205
A number of tests that forward to `TestLoadSaveBase.load_save` are all marked as flaky due to them regularly taking much longer to start up than hypothesis' default timeout of 200ms. This diff fixes the problem by removing the timeout for `load_save`. This is alright as these tests aren't meant to be testing the performance of these operators.
I would set the deadline to 60s if I could however it appears the that caffe2 github CI uses a different version of hypothesis that doesn't allow using `dateutil.timedelta` so instead of trying to figure out an approach that works on both I've just removed the deadline time.
I've also tagged all existing tasks WRT these failures.
Differential Revision: D23175752
fbshipit-source-id: 324f9ff034df1ac4874797f04f50067149a6ba48
Summary:
Pull Request resolved: https://github.com/pytorch/glow/pull/4812
if no compilation options are passed, default to c-step
fixed the FC and batchmatmul implementations to match C-step
fixed the fakelowp map calling to make sure we use the fp32 substitution of operators
updated the accumulator test to make it pass with fp32
Test Plan:
fakelowp tests
glow/test/numerics
net_runner
Reviewed By: jfix71
Differential Revision: D23086534
fbshipit-source-id: 3fbb8c4055bb190becb39ce8cdff6671f8558734
Summary:
Had a bunch of merged commits that shouldn't have been there, reverted them to prevent conflicts. Lots of new features, highlights listed below.
**Overall:**
- Enables pointwise fusion, single (but N-D) broadcast -- pointwise fusion, single (but N-D) broadcast -- pointwise -- single (but N-D) reduction fusion.
**Integration:**
- Separate "magic scheduler" logic that takes a fusion and generates code generator schedule
- Reduction fusion scheduling with heuristics closely matching eagermode (unrolling supported, but no vectorize support)
- 2-Stage caching mechanism, one on contiguity, device, type, and operations, the other one is input size->reduction heuristic
**Code Generation:**
- More generic support in code generation for computeAt
- Full rework of loop nest generation and Indexing to more generically handle broadcast operations
- Code generator has automatic kernel launch configuration (including automatic allocation of grid reduction buffers)
- Symbolic (runtime) tilling on grid/block dimensions is supported
- Simplified index generation based on user-defined input contiguity
- Automatic broadcast support (similar to numpy/pytorch semantics)
- Support for compile time constant shared memory buffers
- Parallelized broadcast support (i.e. block reduction -> block broadcast support)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43129
Reviewed By: mrshenli
Differential Revision: D23162207
Pulled By: soumith
fbshipit-source-id: 16deee4074c64de877eed7c271d6a359927111b2
Summary:
Since OpenMP is not available on some platforms, or might be disabled by user, set default `ATEN_THREADING` based on USE_OPENMP and USE_TBB options
Fixes https://github.com/pytorch/pytorch/issues/43036
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43067
Reviewed By: houseroad
Differential Revision: D23138856
Pulled By: malfet
fbshipit-source-id: cc8f9ee59a5559baeb3f19bf461abbc08043b71c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42927
added fp16 fusion to net transforms
refactored the transforms as well as glow_transform to get out of opt/custom so that the OSS builds passed
Test Plan: added net runner tests for this
Reviewed By: yinghai
Differential Revision: D23080881
fbshipit-source-id: ee6451811fedfd07c6560c178229854bca29301f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43037
In the previous version of mish_op.cc, the output would be 'nan' for large inputs. We re-write mish_op.cc to solve this problem.
Test Plan:
Unit test
buck test //dper3/dper3/modules/tests:core_modules_test -- test_linear_compress_embedding_with_attention_with_activation_mish
{F284052906}
buck test mode/opt //dper3/dper3_models/ads_ranking/tests:model_paradigm_e2e_tests -- test_sparse_nn_with_mish
{F284224158}
## Workflow
f212113434
{F285281318}
Differential Revision: D23102644
fbshipit-source-id: 98f1ea82f8c8e05b655047b4520c600fc1a826f4
Summary:
1. Fix illegal memory access issue for SplitByLengths operator in the CUDA context.
2. Add support to scaling lengths vector for SplitByLengths operator.
3. Add support to test SplitByLengths operator in the CUDA context.
Example for SplitByLengths operator processing scaling lengths vector:
value vector A = [1, 2, 3, 4, 5, 6]
length vector B = [1, 2]
after execution of SplitByLengths operator,
the output should be [1,2] and [3,4,5,6]
Test Plan: buck test mode/dev-nosan caffe2/caffe2/python/operator_test:concat_split_op_test
Reviewed By: kennyhorror
Differential Revision: D23079841
fbshipit-source-id: 3700e7f2ee0a5a2791850071fdc16e5b054f8400
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43014
changing this behavior mimics the behavior of the hold hypothesis
testing library
Test Plan: ran all tests on devserver
Reviewed By: hl475
Differential Revision: D23085949
fbshipit-source-id: 433fdfbb04b6a609b738eb7c319365049a49579b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42691
fix quantization of FC bias to match nnpi
quantize biases to fp16
Test Plan: improved the unit test to have input tensors in fp32
Reviewed By: tracelogfb
Differential Revision: D22941521
fbshipit-source-id: 00afb70610f8a149110344d52595c39e3fc988ab
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42871
old version of hypothesis.testing was not enforcing deadlines
after the library got updated, default deadline=200ms, but even with 1s or
more, tests are flaky. Changing deadline to non-enforced which is the same
behavior as the old version
Test Plan: tested fakelowp/tests
Reviewed By: hl475
Differential Revision: D23059033
fbshipit-source-id: 79b6aec39a2714ca5d62420c15ca9c2c1e7a8883
Summary: I found out that without exporting to public format IDEEP transpose operator in the middle of convolution net produces incorrect results (probably reading some out-of-bound memory). Exporting to public format might not be the most efficient solution, but at least it ensures correct behavior.
Test Plan: Running ConvFusion followed by transpose should give identical results on CPU and IDEEP
Reviewed By: bwasti
Differential Revision: D22970872
fbshipit-source-id: 1ddca16233e3d7d35a367c93e72d70632d28e1ef
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42763
add the fp16 fusions as net transforms:
-layernorm fused with mul+add
-swish int8
Test Plan: added unit test, ran flows
Reviewed By: yinghai
Differential Revision: D23002043
fbshipit-source-id: f0b13d51d68c240b05d2a237a7fb8273e996328b
Summary:
add a fuse path for deq->swish->quant
update swish fake op interface to take arguments accordingly
Test Plan:
net_runner passes
unit tests need to be updated
Reviewed By: venkatacrc
Differential Revision: D22962064
fbshipit-source-id: cef79768db3c8af926fca58193d459d671321f80
Summary:
Backout D22800959 (f30ac66e79). This one is causing the timeout (machine stuck) issues for dedup kernels. Reverting it make the unit test pass. Still need to investigate why this is the culprit...
Original commit changeset: 641d52a51070
Test Plan:
```
buck test mode/dev-nosan //caffe2/caffe2/fb/net_transforms/tests:fuse_sparse_ops_test -- 'test_fuse_sparse_adagrad_with_sparse_lengths_sum_gradient \(caffe2\.caffe2\.fb\.net_transforms\.tests\.fuse_sparse_ops_test\.TestFuseSparseOps\)' --print-passing-details
```
Reviewed By: jspark1105
Differential Revision: D23008389
fbshipit-source-id: 4f1b9a41c78eaa5541d57b9d8aa12401e1d495f2
Summary: Add Python type annotations for the `caffe2.distributed.python` module.
Test Plan: Will check sandcastle results.
Reviewed By: jeffdunn
Differential Revision: D22994012
fbshipit-source-id: 30565cc41dd05b5fbc639ae994dfe2ddd9e56cb1
Summary:
This PR adds the `torch.linalg` namespace as part of our continued effort to be more compatible with NumPy. The namespace is tested by adding a single function, `torch.linalg.outer`, and testing it in a new test suite, test_linalg.py. It follows the same pattern that https://github.com/pytorch/pytorch/pull/41911, which added the `torch.fft` namespace, did.
Future PRs will likely:
- add more functions to torch.linalg
- expand the testing done in test_linalg.py, including legacy functions, like torch.ger
- deprecate existing linalg functions outside of `torch.linalg` in preference to the new namespace
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42664
Reviewed By: ngimel
Differential Revision: D22991019
Pulled By: mruberry
fbshipit-source-id: 39258d9b116a916817b3588f160b141f956e5d0b
Summary:
Pull Request resolved: https://github.com/pytorch/glow/pull/4787
Resurrect ONNX as a backend through onnxifiGlow (was killed as part of D16215878). Then look for the `use_glow_aot` argument in the Onnxifi op. If it's there and true, then we override whatever `backend_id` is set and use the ONNX backend.
Reviewed By: yinghai, rdzhabarov
Differential Revision: D22762123
fbshipit-source-id: abb4c3458261f8b7eeae3016dda5359fa85672f0
Summary: Put user embedding before ads embedding in blobReorder, for flash verification reason.
Test Plan:
```
buck run mode/opt-clang -c python.package_style=inplace sigrid/predictor/scripts:enable_large_model_loading -- --model_path_src="/home/$USER/models/" --model_path_dst="/home/$USER/models_modified/" --model_file_name="182560549_0.predictor"
```
https://www.internalfb.com/intern/anp/view/?id=320921 to check blobsOrder
Reviewed By: yinghai
Differential Revision: D22964332
fbshipit-source-id: 78b4861476a3c889a5ff62492939f717c307a8d2
Summary:
Previous when inferring Int8FC, we failed to carry over the scale and zero point properly.
Also fixed int8 FC weight data type to be int8 instead of uint8 as that's what C2 actually uses.
Test Plan: Use net_runner to lower a single Int8Dequantize op. Previous scale and bias would always be 1 and 0. Now the proper value is set.
Reviewed By: yinghai
Differential Revision: D22912186
fbshipit-source-id: a6620c3493e492bdda91da73775bfc9117db12d1
Summary:
This diff NVMifies the NE Eval Flow.
- It defines a `LoadNVM` operator which either
- receives a list of nvm blobs, or
- extracts the blobs that could be NVMified from the model.
- dumps NVMified blobs into NVM
- and deallocates from DRAM
- NVMify the Eval net on dper and C2 backend
Specific NVMOp for SLS is pushed through different diffs.
Test Plan: flow-cli test-locally dper.workflows.evaluation.eval_workflow --parameters-file=/mnt/public/ehsaardestani/temp/small_model.json 2>&1 | tee log
Reviewed By: yinghai, amylittleyang
Differential Revision: D22469973
fbshipit-source-id: ed8379ad404e96d04ac05e580176d3aca984575b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42522
Main changes:
- Consolidated CMake files to have a single entry point, rather than having a specialized one for PyTorch.
- Changed the way the preprocessor flags are provided, and changed their name.
There were a few instances in PyTorch's CMake files where we were directly adding TensorPipe's source directory as an include path, which however doesn't contain the auto-generated header we now added. We fix that by adding the `tensorpipe` CMake target as a dependency, so that the include paths defined by TensorPipe are used, which contain that auto-generated header. So instead we link those targets to the tensorpipe target in order for them to pick up the correct include directories.
I'm turning off SHM and CMA for now because they have never been covered by the CI. I'll enable them in a separate PR so that if they turn out to be flaky we can revert that change without reverting this one.
Test Plan: CI
Reviewed By: malfet
Differential Revision: D22959472
fbshipit-source-id: 1959a41c4a66ef78bf0f3bd5e3964969a2a1bf67
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42570
ProfiledType doesn't do anything and is not used atm, removing
Test Plan: CI
Reviewed By: ezyang
Differential Revision: D22938664
Pulled By: ilia-cher
fbshipit-source-id: 037c512938028f44258b702bbcde3f8c144f4aa0
Summary:
This PR creates a new namespace, torch.fft (torch::fft) and puts a single function, fft, in it. This function is analogous to is a simplified version of NumPy's [numpy.fft.fft](https://numpy.org/doc/1.18/reference/generated/numpy.fft.fft.html?highlight=fft#numpy.fft.fft) that accepts no optional arguments. It is intended to demonstrate how to add and document functions in the namespace, and is not intended to deprecate the existing torch.fft function.
Adding this namespace was complicated by the existence of the torch.fft function in Python. Creating a torch.fft Python module makes this name ambiguous: does it refer to a function or module? If the JIT didn't exist, a solution to this problem would have been to make torch.fft refer to a callable class that mimicked both the function and module. The JIT, however, cannot understand this pattern. As a workaround it's required to explicitly `import torch.fft` to access the torch.fft.fft function in Python:
```
import torch.fft
t = torch.randn(128, dtype=torch.cdouble)
torch.fft.fft(t)
```
See https://github.com/pytorch/pytorch/issues/42175 for future work. Another possible future PR is to get the JIT to understand torch.fft as a callable class so it need not be imported explicitly to be used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41911
Reviewed By: glaringlee
Differential Revision: D22941894
Pulled By: mruberry
fbshipit-source-id: c8e0b44cbe90d21e998ca3832cf3a533f28dbe8d
Summary:
Enforce counter value to double type in rowwise_counter.
**Context:**
The existing implementation is using float type for counter value. But due to the precision limit of a floating number [1], we observed that the counter value can't increment beyond 16777216.0 (i.e., the max value is 16777216.0) in our earlier experiments. We decide to enforce double type to avoid this issue.
[1] https://stackoverflow.com/questions/12596695/why-does-a-float-variable-stop-incrementing-at-16777216-in-c
Test Plan:
op test
```
ruixliu@devvm1997:~/fbsource/fbcode/caffe2/caffe2/python/operator_test(f0b0b48c)$ buck test :rowwise_counter_test
Trace available for this run at /tmp/testpilot.20200728-083200.729292.log
TestPilot test runner for Facebook. See https://fburl.com/testpilot for details.
Testpilot build revision cd2638f1f47250eac058b8c36561760027d16add fbpkg f88726c8ebde4ba288e1172a348c7f46 at Mon Jul 27 18:11:43 2020 by twsvcscm from /usr/local/fbprojects/packages/testinfra.testpilot/887/t.par
Discovering tests
Running 1 test
Started new test run: https://our.intern.facebook.com/intern/testinfra/testrun/7881299364977047
✓ caffe2/caffe2/python/operator_test:rowwise_counter_test - test_rowwise_counter (caffe2.caffe2.python.operator_test.rowwise_counter_test.TestRowWiseCounter) 0.265 1/1 (passed)
✓ caffe2/caffe2/python/operator_test:rowwise_counter_test - main 14.414 (passed)
Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/7881299364977047
Summary (total time 18.51s):
PASS: 2
FAIL: 0
SKIP: 0
FATAL: 0
TIMEOUT: 0
OMIT: 0
```
optimizer test
```
ruixliu@devvm1997:~/fbsource/fbcode/caffe2/caffe2/python(7d66fbb9)$ buck test :optimizer_test
Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/7036874434841896
Summary (total time 64.87s):
PASS: 48
FAIL: 0
SKIP: 24
caffe2/caffe2/python:optimizer_test - testGPUDense (caffe2.caffe2.python.optimizer_test.TestMomentumSgd)
caffe2/caffe2/python:optimizer_test - testGPUDense (caffe2.caffe2.python.optimizer_test.TestGFtrl)
caffe2/caffe2/python:optimizer_test - test_caffe2_cpu_vs_numpy (caffe2.caffe2.python.optimizer_test.TestYellowFin)
caffe2/caffe2/python:optimizer_test - testGPUDense (caffe2.caffe2.python.optimizer_test.TestSparseRAdam)
caffe2/caffe2/python:optimizer_test - testGPUDense (caffe2.caffe2.python.optimizer_test.TestRowWiseAdagradWithCounter)
caffe2/caffe2/python:optimizer_test - testGPUDense (caffe2.caffe2.python.optimizer_test.TestAdagrad)
caffe2/caffe2/python:optimizer_test - test_caffe2_gpu_vs_numpy (caffe2.caffe2.python.optimizer_test.TestYellowFin)
caffe2/caffe2/python:optimizer_test - testDense (caffe2.caffe2.python.optimizer_test.TestRowWiseAdagrad)
caffe2/caffe2/python:optimizer_test - testGPUDense (caffe2.caffe2.python.optimizer_test.TestFtrl)
caffe2/caffe2/python:optimizer_test - testSparse (caffe2.caffe2.python.optimizer_test.TestRmsProp)
...and 14 more not shown...
FATAL: 0
TIMEOUT: 0
OMIT: 0
```
param download test
```
ruixliu@devvm1997:~/fbsource/fbcode/caffe2/caffe2/fb/net_transforms/tests(7ef20a38)$ sudo buck test :param_download_test
Finished test run: Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/6473924481526935
```
e2e flow:
f208394929
f207991149
f207967273
ANP notebook to check the counter value loaded from the flows
https://fburl.com/anp/5fdcbnoi
screenshot of the loaded counter (note that counter max is larger than 16777216.0)
{F250926501}
Reviewed By: ellie-wen
Differential Revision: D22711514
fbshipit-source-id: 426fed7415270aa3f276dda8141907534734337f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41934
The model exported from online training workflow with int8 quantization contains FCs with 4 inputs. The extra input is the quant_param blob. This diff is to adjust the bound_shape_inferencer and int8 op schema to get shape info for the quant_param input.
Test Plan:
```
buck test caffe2/caffe2/opt:bound_shape_inference_test
```
Reviewed By: yinghai
Differential Revision: D22683554
fbshipit-source-id: 684d1433212a528120aba1c37d27e26b6a31b403
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41603
Pull Request resolved: https://github.com/pytorch/glow/pull/4704
Previously in the glow onnxifi path, when an error is encountered, we log it to stderr then just return ONNXIFI_STATUS_INTERNAL_ERROR to C2. C2 then does CAFFE2_ENFORCE_EQUAL(return_code, ONNXIFI_STATUS_SUCCESS). The error message that eventually went to the user is something like
[enforce fail at onnxifi_op.cc:545] eventStatus == ONNXIFI_STATUS_SUCCESS. 1030 vs 0
This diff adds plumbing to get human readable error message out of glow into C2.
Test Plan:
Run ads replayer. Overload it with traffic. Now the error message sent back to the client used to be
E0707 00:57:45.697196 3709559 Caffe2DisaggAcceleratorTask.cpp:493] During running REMOTE_OTHER net: [enforce fail at onnxifi_op.cc:545] eventStatus == ONNXIFI_STATUS_SUCCESS. 1030 vs 0 (Error from operator:....
Now it's
```
E0707 16:46:48.366263 1532943 Client.cpp:966] Exception when calling caffe2_run_disagg_accelerator on remote predictor for model 190081310_0 : apache::thrift::TApplicationException: c10::Error: [enforce fail at onnxifi_op.cc:556] .
Error code: RUNTIME_REQUEST_REFUSED
Error message: The number of allowed queued requests has been exceeded. queued requests: 100 allowed requests: 100
Error return stack:
glow/glow/lib/Runtime/HostManager/HostManager.cpp:673
glow/glow/lib/Onnxifi/HostMana (Error from operator:...
```
Reviewed By: gcatron, yinghai
Differential Revision: D22416857
fbshipit-source-id: 564bc7644d9666eb660725c2dca5637affae9b73
Summary: this breaks if we cut the net at certain int8 ops boundary.
Test Plan: with net_runner to lower a single Int8Quantize op. It used to break. Now it works.
Reviewed By: yinghai
Differential Revision: D22912178
fbshipit-source-id: ca306068c9768df84c1cfa8b34226a1330e19912
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42591
We don't support lowering with 2-input Int8Quantize and 4-input Int8FC. Just do a conversion to absorb the quantization params into the op itself.
Test Plan:
```
buck test caffe2/caffe2/quantization/server:quantize_dnnlowp_op_test
```
Reviewed By: benjibc
Differential Revision: D22942673
fbshipit-source-id: a392ba2afdfa39c05c5adcb6c4dc5f814c95e449
Summary:
1. Fix illegal memory access issue for SplitByLengths operator in the CUDA context.
2. Add support to scaling lengths vector for SplitByLengths operator.
3. Add support to test SplitByLengths operator in the CUDA context.
Example for SplitByLengths operator processing scaling lengths vector:
value vector A = [1, 2, 3, 4, 5, 6]
length vector B = [1, 2]
after execution of SplitByLengths operator,
the output should be [1,2] and [3,4,5,6]
Test Plan: buck test mode/dev-nosan caffe2/caffe2/python/operator_test:concat_split_op_test
Reviewed By: kennyhorror
Differential Revision: D22780307
fbshipit-source-id: c5ca60ae16b24032cedfa045a421503b713daa6c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42249
Main change is to bring Caffe2's superior error messages for cuda initialization into c10 and use them in all code paths.
Basic logic:
| Case | Call to device_count() | init_cuda, e.g. allocating tensor |
| -- | -- | -- |
| all good | non-zero | just works |
| no gpus | 0, no warning | throw exception with good message |
| driver issues | 0, produce warning | throw exception with good message |
| out of memory with ASAN | 0, produce warning| throw exception with ASAN message |
Previously, the error thrown from init_cuda was very generic and the ASAN warning (if any) was buried in the logs.
Other clean up changes:
* cache device_count() always in a static variable
* move all asan macros in c10
Test Plan:
Hard to unittest because of build modes. Verified manually that the behavior from the table above holds by running the following script in different modes (ASAN/no-ASAN, CUDA_VISIBLE_DEVICES=):
```
print('before import')
import torch
print('after import')
print('devices: ', torch.cuda.device_count())
x = torch.tensor([1,2,3])
print('tensor creation')
x = x.cuda()
print('moved to cuda')
```
Reviewed By: ngimel
Differential Revision: D22824329
fbshipit-source-id: 5314007313a3897fc955b02f8b21b661ae35fdf5
Summary: Current OutputColumnMaxHistogramObserver will output 2048 bins for each column. The file will be extremely large and the dumping time is quite long. However, we only use the min and max finally. This diff enables changing bin_nums by adding an argument. And the default value is set to 16 to reduce dumping overhead. When we need more bins to analyze the results, we only need to change this argument
Test Plan:
buck run caffe2/caffe2/quantization/server:observer_test
{F263843430}
Reviewed By: hx89
Differential Revision: D22918202
fbshipit-source-id: bda34449355b269b24c55802012450ebaa4d280c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42516
att. We need it for some scripts.
Reviewed By: houseroad
Differential Revision: D22918112
fbshipit-source-id: 8a1696ceeeda67a34114bc57cb52c925711cfb4c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42137
This PR implements an SGD optimizer class similar to torch::optim::SGD, but it doesn't inherit from torch::optim::Optimizer, for use on mobile devices (or other lightweight use case).
Adding Martin's comment for visibility: "SGD may be the only optimizer used in near future. If more client optimizers are needed, refactoring the full optim codes and reusing the existing code would be an option."
Test Plan: Imported from OSS
Reviewed By: iseeyuan
Differential Revision: D22846514
Pulled By: ann-ss
fbshipit-source-id: f5f46804aa021e7ada7c0cd3f16e24404d10c7eb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42421
Previously, we can only feed shape info from Python with float dtype, and batch based dim type when we do onnxifi from Python. This diff removes this limitation and uses TensorBoundShapes protobuf as a generic shape info struct. This will make the onnxifi interface in Python more flexible.
Reviewed By: ChunliF
Differential Revision: D22889781
fbshipit-source-id: 1a89f3a68c215a0409738c425b4e0d0617d58245
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42381
Introduce new tag to support distributed hogwild.
Reviewed By: boryiingsu
Differential Revision: D20484099
fbshipit-source-id: 5973495589e0a7ab185d3867b37437aa747f408a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42397
Since the autograd registration is unified to code-gen, we don't need to keep a manual registration file for mobile.
Remove it to avoid extra maintenance.
Test Plan: Imported from OSS
Reviewed By: ljk53
Differential Revision: D22883153
Pulled By: iseeyuan
fbshipit-source-id: 6db0bd89369beab9eed6e9a9692dd46f5bd1ff48
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42380
[Caffe2] Remove explicitly divide by zero in SpatialBN training mode
Test Plan: buck test mode/dev-nosan //caffe2/caffe2/python/operator_test:spatial_bn_op_test
Reviewed By: houseroad
Differential Revision: D22873214
fbshipit-source-id: 70b505391b5db02b45fc46ecd7feb303e50c6280
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42255
Changes to match Fused Op: Dequantize->Swish->Quantize
* Changes to scale handling
Results showing matching intermediate and final Swish_Int8 Op.
P137389801
Test Plan: test case test_deq_swish_quant_nnpi.py
Reviewed By: hyuen
Differential Revision: D22827499
fbshipit-source-id: b469470ca66f6405ccc89696694af372ce6ce89e
Summary:
This Diff provides an option for DC++ module to use the squeezed sparse feature embeddings to generate attention weights, with the purpose of reducing the network size to achieve QPS gains. There are 3 squeeze options: sum, max, and mean, along the embedding dimension and are provided for both the attention weights and resnet generation.
Example workflow: f208474456
{F257199459}
Test Plan:
1. Test single ops
buck test dper3/dper3/modules/low_level_modules/tests:single_operators_test -- test_reduce_back_mean
buck test dper3/dper3/modules/low_level_modules/tests:single_operators_test -- test_reduce_back_max
2. Test DC++ module
buck test dper3/dper3/modules/tests:core_modules_test -- test_dc_pp_arch_one_layer_compressed_embeddings_only_squeeze_input
buck test dper3/dper3/modules/tests:core_modules_test -- test_dc_pp_arch_shared_input_squeeze_input
buck test dper3/dper3/modules/tests:core_modules_test -- test_dc_pp_input_compress_embeddings_squeeze_input
3. Test Arch
buck test dper3/dper3_models/ads_ranking/model_impl/sparse_nn/tests:sparse_nn_lib_test -- test_dense_sparse_interaction_compress_dot_arch_dot_compress_pp_squeezed_input
4. e2e test
buck test dper3/dper3_models/ads_ranking/tests:model_paradigm_e2e_tests -- test_sparse_nn_compress_dot_attention_fm_max_fc_size_squeeze_input
Reviewed By: taiqing
Differential Revision: D22825069
fbshipit-source-id: 29269ea22cb47d487a1c92a1f6daae1055f54cfc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42225
Main changes:
- Consolidated CMake files to have a single entry point, rather than having a specialized one for PyTorch.
- Changed the way the preprocessor flags are provided, and changed their name.
There were a few instances in PyTorch's CMake files where we were directly adding TensorPipe's source directory as an include path, which however doesn't contain the auto-generated header we now added. We fix that by adding the `tensorpipe` CMake target as a dependency, so that the include paths defined by TensorPipe are used, which contain that auto-generated header.
I'm turning off SHM and CMA for now because they have never been covered by the CI. I'll enable them in a separate PR so that if they turn out to be flaky we can revert that change without reverting this one.
Test Plan: CircleCI is all green.
Reviewed By: beauby
Differential Revision: D22812445
fbshipit-source-id: e6d824bb28f5afe75fd765de0430968174f3531f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42286
One more bug to fix. Operators such as If and AsyncIf need special treatment not just in `onnx::SsaRewrite`, but also in `RemoveOpsByType`. The solution needs two steps:
1) add external inputs/outputs of the subnets of If/AsyncIf op to the inputs/outputs of the op
2) if the inputs/outputs of the If/AsyncIf op need to be renamed as a result, the same inputs/outputs of the subnets need to be renamed as well.
I also added unit tests to cover this corner case.
Test Plan:
```
buck test //caffe2/caffe2/fb/predictor:black_box_predictor_test
mkdir /tmp/models
rm -rf /tmp/$USER/snntest
rm -rf /tmp/snntest
buck run mode/opt admarket/lib/ranking/prediction_replayer/snntest_replayer_test/tools:snntest_replay_test -- --serving_paradigm=USER_AD_PRECOMPUTATION_DSNN
```
Differential Revision: D22834028
fbshipit-source-id: c070707316cac694f452a96e5c80255abf4014bc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42287
We shouldn't use block_size for thread dimensions in linear_index_weight_offsets_dedup_kernel, since the kernel doesn't iterate the embedding dimensions.
ghstack-source-id: 108834058
Test Plan:
```
buck test mode/dev-nosan //caffe2/caffe2/fb/net_transforms/tests:fuse_sparse_ops_test -- 'test_fuse_sparse_adagrad_with_sparse_lengths_sum_gradient \(caffe2\.caffe2\.fb\.net_transforms\.tests\.fuse_sparse_ops_test\.TestFuseSparseOps\)' --print-passing-details
```
Reviewed By: jspark1105
Differential Revision: D22800959
fbshipit-source-id: 641d52a51070715c04f9fd286e7e22ac62001f61
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42219
Introduce a new extra info that is tagged on the forward net for the operators sharing the same input. The effect is that the auto gen sum of gradient for the input will not follow the tag of the operator tags in the forward net. This allow more flexible device allocation.
Test Plan:
# unit test
`./buck-out/gen/caffe2/caffe2/python/core_gradients_test#binary.par -r testMultiUseInputAutoGenSumDevice`
Reviewed By: xianjiec, boryiingsu
Differential Revision: D22609080
fbshipit-source-id: d558145e5eb36295580a70e1ee3a822504dd439a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42151
Previously our Caffe2 SpatialBN op impl was incorrect for computing running_var without unbias coefficent. Actually it should fail the test because the output will be different with CuDNN's output. However, our tests are too weak to find this bug. This diff fix all of them.
Test Plan: buck test mode/dev-nosan //caffe2/caffe2/python/operator_test:spatial_bn_op_test
Reviewed By: houseroad
Differential Revision: D22786127
fbshipit-source-id: db80becb67d60c44faae180c7e4257cb136a266d
Summary: Sometimes first dim of X in FC is BATCH_OF_FEATURE_MAX instead of BATCH. This caused an issue in f207899183 (when first dim of X is 64 but is set to 1 in inferFC). Change the check from `!= BATCH` to `== UNKNOWN`
Test Plan: unit test
Reviewed By: yinghai
Differential Revision: D22784691
fbshipit-source-id: eb66ba361d6fe75672b13edbac2fbd269a7e7a00
Summary:
Found while trying to get RocM Caffe2 CI green
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42168
Reviewed By: seemethere
Differential Revision: D22791879
Pulled By: malfet
fbshipit-source-id: 8f7ef9711bdc5941b2836e4c8943bb95c72ef8af
Summary:
Found while trying to get RocM Caffe2 job green
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42169
Reviewed By: seemethere
Differential Revision: D22791896
Pulled By: malfet
fbshipit-source-id: 9df6233876aec5ead056365499bab970aa7e8bdc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42118
We toggle trace on with a certain probablility. In the case of 3 inferences with trace on/off/on. We leak the trace from the first inference. Always clean up the trace will fix it.
Test Plan:
predictor
I created a tiny repro here: D22786551
With this fix, this issue is gone.
Reviewed By: gcatron
Differential Revision: D22768382
fbshipit-source-id: 9ee0bbcb2bc5f76107dae385759fe578909a683d
Summary:
the onnxifi path didn't handle the input/output name rewrite for ssa correctly for AsyncIf op. Add support for it.
Also fixed a place where we lose the net type while doing onnxifi transform.
Test Plan: Load 163357582_593 which is a multi feed model that uses AsyncIf. This used to fail with c2 not finding some blobs in workspace. Now it works.
Reviewed By: dhe95
Differential Revision: D21268230
fbshipit-source-id: ce7ec0e952513d0f251df1bfcfb2b0250f51fd94
Summary: we need this op to avoid the splicing of a dense tensor and then use the Mergesinglescaler op
Test Plan: integrated test with dper2
Differential Revision: D22677523
fbshipit-source-id: f4f9a1f06841b0906ec8cbb435482ae0a89e1721
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42114
Remove settings for the logit test case.
(Note: this ignores all push blocking failures!)
Test Plan: test_op_nnpi_fp16.py test case.
Reviewed By: hyuen
Differential Revision: D22766728
fbshipit-source-id: 2fe8404b103c613524cf1beddf1a0eb9068caf8a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41482
This adds a new tag for use with pipeline parallelism.
Test Plan: CI
Reviewed By: heslami
Differential Revision: D22551487
fbshipit-source-id: 90910f458a9bce68f7ef684773322a49aa24494a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41687
Specifically, this makes a new library (lazy), which can be used from both core
and workspace.
This allows workspace.Createnet to trigger lazy loading of dyndep dependencies.
Test Plan: Added a unit test specifically for workspace.CreateNet
Reviewed By: dzhulgakov
Differential Revision: D22441877
fbshipit-source-id: 3a9d1af9962585d08ea2566c9c85bec7377d39f2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41820
Pull Request resolved: https://github.com/pytorch/glow/pull/4721
In order to support int8 quantized tensor as an input to OnnxifiOp, we need to
- Add support to recognize and extract shape meta from int8 tensor at input of OnnxifiOp
- Make a copy of the input data and shift by 128 in Glow if input data is uint8 quantized tensor to get correct result because Glow uses int8 to represent the quantized data regardless.
- Propagate correct quantization parameters to through shape info in C2.
This diff implements the above.
Test Plan:
```
buck test caffe2/caffe2/contrib/fakelowp/test:test_int8_quantnnpi
```
Reviewed By: jackm321
Differential Revision: D22650584
fbshipit-source-id: 5e867f7ec7ce98bb066ec4128ceb7cad321b3392
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41693
Add non zero offset test cases for Quantize and Dequantize Ops.
Test Plan: Added new test case test_int8_non_zero_offset_quantize part of the test_int8_ops_nnpi.py test file.
Reviewed By: hyuen
Differential Revision: D22633796
fbshipit-source-id: be17ee7a0caa6e9bc7b175af539be2e6625ad47a
Summary:
## TLDR
Support using NaN default value for missing dense features in RawInputProcessor for DPER2. In preparation for subsequent support for null flag features in compute meta. For train_eval this is already supported in DPER3 and we do not plan to support this in DPER2 train eval.
Differential Revision: D22439142
fbshipit-source-id: 99ae9755bd41a5d5f43bf5a9a2819d64f3883005
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41618
More LayerNorm Vectorization in calcMeanStd function.
Test Plan: test covered in test_layernorm_nnpi_fp16.py
Reviewed By: hyuen
Differential Revision: D22606585
fbshipit-source-id: be773e62f0fc479dbc2d6735f60c2e98441916e9
Summary:
A minor spell check!
I have gone through a dozen of .md files to fix the typos.
zou3519 take a look!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41599
Reviewed By: ezyang
Differential Revision: D22601629
Pulled By: zou3519
fbshipit-source-id: 68d8f77ad18edc1e77874f778b7dadee04b393ef
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41606
The previous diff (D22220798 (59294fbbb9) and D22220797) was recently reverted (D22492356 (28291d3cf8), D22492355) because of a bug associated with the op AsyncIf. The AsyncIf op has net_defs as args and the SSA rewriting didn't take that into account. It has a special path for the op If, but not for AsyncIf. Several changes I made to fix the bug:
1) Add op AsyncIf to the special path for If op in SSA rewriting
2) clear inputs/outputs of the netdefs that are args in If/AsyncIf ops because they're no longer valid
3) revert renamed inputs/outputs in the arg netdefs that are in the external_outputs in the parent netdef
2) and 3) are existing bugs in the `SsaRewrite` function that were just never exposed before.
The algorithm for `RemoveOpsByType` is the same as in my previous diff D22220798 (59294fbbb9). The only new changes in this diff are in `onnx::SsaRewrite` and a few newly added unit tests.
(Note: this ignores all push blocking failures!)
Reviewed By: yinghai
Differential Revision: D22588652
fbshipit-source-id: ebb68ecd1662ea2bae14d4be8f61a75cd8b7e3e6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41577
* Remove skipping test
* Use fma_avx_emulation
* Increase test examples to 100
(Note: this ignores all push blocking failures!)
Test Plan: Tests are covered in test_sls_8bit_nnpi.py
Reviewed By: hyuen
Differential Revision: D22585742
fbshipit-source-id: e1f62f47eb10b402b11893ffca7a6786e31daa79
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41575
Fixes https://github.com/pytorch/pytorch/issues/34294
This updates the C++ argument parser to correctly handle `TensorList` operands. I've also included a number of updates to the testing infrastructure, this is because we're now doing a much more careful job of testing the signatures of aten kernels, using the type information about the arguments as read in from `Declarations.yaml`. The changes to the tests are required because we're now only checking for `__torch_function__` attributes on `Tensor`, `Optional[Tensor]` and elements of `TensorList` operands, whereas before we were checking for `__torch_function__` on all operands, so the relatively simplistic approach the tests were using before -- assuming all positional arguments might be tensors -- doesn't work anymore. I now think that checking for `__torch_function__` on all operands was a mistake in the original design.
The updates to the signatures of the `lambda` functions are to handle this new, more stringent checking of signatures.
I also added override support for `torch.nn.functional.threshold` `torch.nn.functional.layer_norm`, which did not yet have python-level support.
Benchmarks are still WIP.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34725
Reviewed By: mruberry
Differential Revision: D22357738
Pulled By: ezyang
fbshipit-source-id: 0e7f4a58517867b2e3f193a0a8390e2ed294e1f3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41464
If input is int8 rowwise quantized, currently we cannot low it to Glow. And previously, we had some error when running with inbatch broadcast. The main issue is that Tile op doesn't support uint8_t type, which is very easily added here. However, this will result in non-ideal situation that we will leave Tile -> Fused8BitRowwiseQuantizedToFloat on host side, which probably hurt the memory bw a lot. Even we later add the support to Fused8BitRowwiseQuantizedToFloat in Glow, it's still not ideal because we are doing redudant compute on identical columns. So the solution here is to swap the order of Fused8BitRowwiseQuantizedToFloat and Tile to make it Tile -> Fused8BitRowwiseQuantizedToFloat. In this way, it will resolve the error we saw immediately. For the short term, we can still run Tile in card. And for longer term, things runs faster on card.
The optimization is a heuristic. If in the net, there isn't such pattern, inbatch broadcast will work as it was before.
(Note: this ignores all push blocking failures!)
Test Plan:
```
buck test caffe2/caffe2/opt/custom:in_batch_broadcast_test
```
Reviewed By: benjibc
Differential Revision: D22544162
fbshipit-source-id: b6dd36a5925a9c8103b80f034e7730a7a085a6ff
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41343
Currently caffe2.InitOpLibrary does the dll import uniliaterally. Instead if we make a lazy version and use it, then many pieces of code which do not need the caffe2urrenoperators get a lot faster.
One a real test, the import time went from 140s to 68s. 8s.
This also cleans up the algorithm slightly (although it makes a very minimal
difference), by parsing the list of operators once, rather than every time a
new operator is added, since we defer the RefreshCall until after we've
imported all the operators.
The key way we maintain safety, is that as soon as someone does an operation
which requires a operator (or could), we force importing of all available
operators.
Future work could include trying to identify which code is needed for which
operator and only import the needed ones. There may also be wins available by
playing with dlmopen (which opens within a namespace), or seeing if the dl
flags have an impact (I tried this and didn't see an impact, but dlmopen may
make it better).
Note that this was previously landed and reverted. The issue was that if a import failed and raised an exception, the specific library would not be removed from the lazy imports. This caused our tests which had libraries that failed to poison all other tests that ran after it. This has been fixed and a unit test has been added for this case (to help make it obvious what failed).
Test Plan:
I added a new test a lazy_dyndep_test.py (copied from all_compare_test.py).
I'm a little concerned that I don't see any explicit tests for dyndep, but this
should provide decent coverage.
I've added a specific test to handle the poisoning issues mentioned above, which caused the previous version to get reverted.
Differential Revision: D22506369
fbshipit-source-id: 7395df4778e8eb0220630c570360b99a7d60eb83
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41505
fix the dequantization to match the fixes from quantization
Test Plan:
test is not conclusive, since only comparing emulation with reference collected from Amy's run
running an evaluation workflow at the moment
Reviewed By: venkatacrc
Differential Revision: D22558092
fbshipit-source-id: 3ff00ea15eac76007e194659c3b4949f07ff02a4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41494
revert back to the changes from amylittleyang to make quantization work
Test Plan:
ran against a dump from ctr_instagram, and verified that:
-nnpi and fakelowp match bitwise
-nnpi is different at most by 1 vs fbgemm, most likely due to the type of
rounding
Reviewed By: venkatacrc
Differential Revision: D22555276
fbshipit-source-id: 7074521d181f15ef6270985bb71c4b44d25d1c30
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41476
deleted this test by default, re-adding it in its own file to make it
more explicit
Test Plan: ran the test
Reviewed By: yinghai
Differential Revision: D22550217
fbshipit-source-id: 758e279b2bab3b23452a3d0ce75fb366f7afb7be
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41461
capacity is misleading, and we have many wrong uses internally. Let's rename to nbytes to avoid the confusion in future. Ultimately, we could remove this parameter if possible.
So far I haven't seen any case this capacity is necessary.
Test Plan: oss ci
Differential Revision: D22544189
fbshipit-source-id: f310627f2ab8f4ebb294e0dd5eabc380926991eb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36893
Adding an end to end test for running a simple training loop in C++
for the distributed RPC framework.
The goal of this change is to enable LeakSanitizer and potentially catch memory
leaks in the Future. Enabling LSAN with python multiprocessing is tricky and we
haven't found a solution for this. As a result, adding a C++ test that triggers
most of the critical codepaths would be good for now.
As an example, this unit test would've caught the memory leak fixed by:
https://github.com/pytorch/pytorch/pull/31030
ghstack-source-id: 107781167
Test Plan:
1) Verify the test catches memory leaks.
2) waitforbuildbot
Reviewed By: mrshenli
Differential Revision: D21112208
fbshipit-source-id: 4eb2a6b409253108f6b6e14352e593d250c7a64d
Summary: Adding epsilon input argument to the Logit Op
Test Plan: Added test_logit test case.
Reviewed By: hyuen
Differential Revision: D22537133
fbshipit-source-id: d6f89afd1589fda99f09550a9d1b850cfc0b9ee1
Summary:
Add support for including pytorch via an add_subdirectory()
This requires using PROJECT_* instead of CMAKE_* which refer to
the top-most project including pytorch.
TEST=add_subdirectory() into a pytorch checkout and build.
There are still some hardcoded references to TORCH_SRC_DIR, I will
fix in a follow on commit. For now you can create a symlink to
<pytorch>/torch/ in your project.
Change-Id: Ic2a8aec3b08f64e2c23d9e79db83f14a0a896abc
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41387
Reviewed By: zhangguanheng66
Differential Revision: D22539944
Pulled By: ezyang
fbshipit-source-id: b7e9631021938255f0a6ea897a7abb061759093d
Summary: Adding shape inference for SpraseToDense. Proposal impl of shape inference only works when data_to_infer_dim is given, otherwise SpraseToDense output dimension depends on max value of input tensor
Test Plan:
buck test //caffe2/caffe2/python:sparse_to_dense_test
buck test //caffe2/caffe2/python:hypothesis_test -- test_sparse_to_dense
Dper3 Changes:
f204594813
buck test dper3/dper3_models/ads_ranking/model_impl/sparse_nn/tests:sparse_nn_lib_test
Reviewed By: zhongyx12, ChunliF
Differential Revision: D22479511
fbshipit-source-id: 8983a9baea8853deec53ad6f795c874c3fb93de0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41452
The model exported from online training workflow with int8 quantization contains FCs with 4 inputs. The extra input is the quant_param blob. This diff is to adjust the bound_shape_inferencer to get shape info for the quant_param input.
Test Plan:
```
buck test caffe2/caffe2/opt:bound_shape_inference_test
```
Reviewed By: anurag16
Differential Revision: D22543215
fbshipit-source-id: 0977fca06630e279d47292e6b44f3d8180a767a5
Summary:
1. Support SparseAdagradFusedWithSparseLengthsMeanGradient and RowWiseSparseAdagradFusedWithSparseLengthsMeanGradient on CPU and GPU
2. Add the dedup implementation of fused RowWiseAdagrad op on GPUs for mean pooling
Reviewed By: xianjiec
Differential Revision: D22165603
fbshipit-source-id: 743fa55ed5893c34bc6406ddfbbbb347b88091d1
Summary:
remove layernorm templates and make them float since that's the only variant
minor fixes in logging and testing
Test Plan: ran the test
Reviewed By: venkatacrc
Differential Revision: D22527359
fbshipit-source-id: d6eec362a6e88e1c12fddf820ae629ede13fb2b8
Summary:
nccl tests and parallelize_bmuf_distributed test are failing on rocm3.5.1. Skipping these tests to upgrade the CI to rocm3.5.1
jeffdaily sunway513
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41409
Reviewed By: orionr
Differential Revision: D22528928
Pulled By: seemethere
fbshipit-source-id: 928196b7a62a441d391e69f54b278313ecc75d77
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40179
- Pass no-psabi to shut up GCC about # Suppress "The ABI for passing
parameters with 64-byte alignment has changed in GCC 4.6"
- Fix use of deprecated data() accessor (and minor optimization: hoist
accessor out of loop)
- Undeprecate NetDef.num_workers, no one is serious about fixing these
- Suppress warnings about deprecated pthreadpool types
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D22234138
Pulled By: ezyang
fbshipit-source-id: 6a1601b6d7551a7e6487a44ae65b19acdcb7b849
Summary: add logit and swish to this list
Test Plan: f203925461
Reviewed By: amylittleyang
Differential Revision: D22506814
fbshipit-source-id: b449e4ea16354cb76915adb01cf317cffb494733
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41215
To unblock int8 model productization on accelerators, we need the shape and type info for all the blobs after int8 quantization. This diff added shape inference functions for int8 quantization related ops.
Test Plan:
```
buck test caffe2/caffe2/quantization/server:int8_gen_quant_params_test
buck test caffe2/caffe2/quantization/server:fully_connected_dnnlowp_op_test
```
Reviewed By: hx89
Differential Revision: D22467487
fbshipit-source-id: 8298abb0df3457fcb15df81f423f557c1a11f530
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41313
This diff backs out the backout diff. The failure was due to C++ `or`
not being supported in MSVC. This is now replaced with ||
Original commit changeset: fc7f3f8c968d
Test Plan: Existing unit tests, check github CI.
Reviewed By: malfet
Differential Revision: D22494777
fbshipit-source-id: 3271288919dc3a6bfb82508ab9d021edc910ae45
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41305
added a warning message when layernorm under/overflows, which is what
nnpi does, reducing the frequency of the logging to every 1000
Test Plan: compilation
Reviewed By: yinghai
Differential Revision: D22492726
fbshipit-source-id: 9343beeae6e65bf3846c6b3d2edd2a08dac85ed6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41315
We should pass the number of indices but not embedding size in SparseAdagrad fused PyTorch operator
Reviewed By: jianyuh
Differential Revision: D22495422
fbshipit-source-id: ec5d3a5c9547fcd8f95106d912b71888217a5af0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41303
the error came from I0710 18:02:48.025024 1780875 NNPIOptions.cpp:49]
[NNPI_LOG][D] [KS] convert_base_kernel_ivp.cpp(524): Output Scale 108240.101562
is out of valid range +-(Min 0.000061 Max 65504.000000)!!!
Seems like the weights we are using are too small, thus generating scaling
factors out of the range of fp16 (>65k). I am tentatively increasing this
factor to a higher value to avoid this. (10x bigger)
Also increased max_examples to 100
Test Plan: ran this test
Reviewed By: yinghai
Differential Revision: D22492481
fbshipit-source-id: c0f9e59b0e70895ab787868ef1d87e6e80106554
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41299
When using `cub::DeviceRadixSort::SortPairs` (https://nvlabs.github.io/cub/structcub_1_1_device_radix_sort.html), the `end_bit` argument, or the most-significant bit index (exclusive) needed for key comparison, should be passed with `int(log2(float(num_rows)) + 1)` instead of `int(log2(float(num_indice)) + 1)`. This is because all the values in indices array are guaranteed to be less than num_rows (hash_size), not num_indices. Thanks ngimel for pointing this point and thanks malfet for quickly fixing the log2() compilation issues.
Note:
An optional bit subrange [begin_bit, end_bit) of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.
Test Plan:
```
buck test mode/dev-nosan //caffe2/caffe2/fb/net_transforms/tests:fuse_sparse_ops_test -- 'test_fuse_sparse_adagrad_with_sparse_lengths_sum_gradient \(caffe2\.caffe2\.fb\.net_transforms\.tests\.fuse_sparse_ops_test\.TestFuseSparseOps\)' --print-passing-details
```
Reviewed By: malfet
Differential Revision: D22491662
fbshipit-source-id: 4fdabe86244c948af6244f9bd91712844bf1dec1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40875
This op uses the given num_bins and a spacing strategy to automatically bin and compute the histogram of given matrices.
Test Plan: Unit tests.
Reviewed By: neha26shah
Differential Revision: D22329069
fbshipit-source-id: 28406b94e284d52d875f73662fc82f93dbc00064
Summary:
unique op test failure in caffe2 blocks upgrading CI to rocm3.5.1. Skipping the test to unblock will re-enable after root causing and fixing the issue.
jeffdaily sunway513
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41219
Differential Revision: D22471452
Pulled By: xw285cornell
fbshipit-source-id: 9e503c8b37c0a4b92632f77b2f8a90281a9889c3
Summary:
the current quantization rounding function uses fbgemm which
defaults to round to nearest. The current implementation of hw uses round
flush to infinity. Adding such an option to switch the mode of rounding.
Test Plan: ran against test_fc_int8
Reviewed By: venkatacrc
Differential Revision: D22452306
fbshipit-source-id: d2a1fbfc695612fe07caaf84f52669643507cc9c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39341
This PR introduces neon backend for vec256 class for float datatype.
For now only aarch64 is enabled due to few issues with enabling in
aarch32 bit.
Test Plan:
vec256_test
Imported from OSS
Differential Revision: D21822399
fbshipit-source-id: 3851c4336d93d1c359c85b38cf19904f82bc7b8d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40059
This benchmark is added specifically for mobile to see if compiler is
autovectorizing and thus we have no advantage of neon backend for vec256
for add op.
Test Plan:
CI
Imported from OSS
Differential Revision: D22055146
fbshipit-source-id: 43ba6c4ae57c6f05d84887c2750ce21ae1b0f0b5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40806
When the input is empty, the operator will crash on "runtime error: division by zero". This has been causing Inference platform server crashes.
Example crash logs:
{P134526683}
Test Plan:
Unit test
See reproducing steps in the Test Plan of D22300135
Reviewed By: houseroad
Differential Revision: D22302089
fbshipit-source-id: aaa5391fddc86483b0f3aba3efa7518e54913635
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41096
The spark spot model had some issues in tensor conversion, see P134598596. It happens when we convert an undefined c10 tensor to caffe2 tensor.
This diff added a null check.
Test Plan: spark spot model runs without problem
Reviewed By: smessmer
Differential Revision: D22330705
fbshipit-source-id: dfe0f29a48019b6611cad3fd8f2ae49e8db5427e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39488
Currently caffe2.InitOpLibrary does the dll import uniliaterally. Instead if we make a lazy version and use it, then many pieces of code which do not need the caffe2urrenoperators get a lot faster.
One a real test, the import time went from 140s to 68s. 8s.
This also cleans up the algorithm slightly (although it makes a very minimal
difference), by parsing the list of operators once, rather than every time a
new operator is added, since we defer the RefreshCall until after we've
imported all the operators.
The key way we maintain safety, is that as soon as someone does an operation
which requires a operator (or could), we force importing of all available
operators.
Future work could include trying to identify which code is needed for which
operator and only import the needed ones. There may also be wins available by
playing with dlmopen (which opens within a namespace), or seeing if the dl
flags have an impact (I tried this and didn't see an impact, but dlmopen may
make it better).
Test Plan:
I added a new test a lazy_dyndep_test.py (copied from all_compare_test.py).
I'm a little concerned that I don't see any explicit tests for dyndep, but this
should provide decent coverage.
Differential Revision: D21870844
fbshipit-source-id: 3f65fedb65bb48663670349cee5e1d3e22d560ed