Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42927
added fp16 fusion to net transforms
refactored the transforms as well as glow_transform to get out of opt/custom so that the OSS builds passed
Test Plan: added net runner tests for this
Reviewed By: yinghai
Differential Revision: D23080881
fbshipit-source-id: ee6451811fedfd07c6560c178229854bca29301f
Summary:
add a fuse path for deq->swish->quant
update swish fake op interface to take arguments accordingly
Test Plan:
net_runner passes
unit tests need to be updated
Reviewed By: venkatacrc
Differential Revision: D22962064
fbshipit-source-id: cef79768db3c8af926fca58193d459d671321f80
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42591
We don't support lowering with 2-input Int8Quantize and 4-input Int8FC. Just do a conversion to absorb the quantization params into the op itself.
Test Plan:
```
buck test caffe2/caffe2/quantization/server:quantize_dnnlowp_op_test
```
Reviewed By: benjibc
Differential Revision: D22942673
fbshipit-source-id: a392ba2afdfa39c05c5adcb6c4dc5f814c95e449
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41464
If input is int8 rowwise quantized, currently we cannot low it to Glow. And previously, we had some error when running with inbatch broadcast. The main issue is that Tile op doesn't support uint8_t type, which is very easily added here. However, this will result in non-ideal situation that we will leave Tile -> Fused8BitRowwiseQuantizedToFloat on host side, which probably hurt the memory bw a lot. Even we later add the support to Fused8BitRowwiseQuantizedToFloat in Glow, it's still not ideal because we are doing redudant compute on identical columns. So the solution here is to swap the order of Fused8BitRowwiseQuantizedToFloat and Tile to make it Tile -> Fused8BitRowwiseQuantizedToFloat. In this way, it will resolve the error we saw immediately. For the short term, we can still run Tile in card. And for longer term, things runs faster on card.
The optimization is a heuristic. If in the net, there isn't such pattern, inbatch broadcast will work as it was before.
(Note: this ignores all push blocking failures!)
Test Plan:
```
buck test caffe2/caffe2/opt/custom:in_batch_broadcast_test
```
Reviewed By: benjibc
Differential Revision: D22544162
fbshipit-source-id: b6dd36a5925a9c8103b80f034e7730a7a085a6ff
Summary: add logit and swish to this list
Test Plan: f203925461
Reviewed By: amylittleyang
Differential Revision: D22506814
fbshipit-source-id: b449e4ea16354cb76915adb01cf317cffb494733
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40318
rename layernom fakefp16 to the right naming convention
add it to the map of replacement ops
this can be done even if the operator is not complete because we are blacklisting anyways
Test Plan: net_runner and inspected the log that replacement happened
Reviewed By: venkatacrc
Differential Revision: D22145900
fbshipit-source-id: f19794ec05234b877f7697ed8b05dd8f46606c47
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40081
Adding the functionality to enable timeout of OnnxifiOp run. In the case of backend hanging, it can error out quickly.
Test Plan:
```
buck test glow/fb/test:test_onnxifinnpi -- test_timeout
```
Reviewed By: jackm321
Differential Revision: D22064533
fbshipit-source-id: 25487287c10ab217eb95692f09d48e13e19436ab
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40078
att. It's good to have net_pos for all the ops so that we can distinguish each op in minimizer in net_runner.
Test Plan: unittest
Reviewed By: ipiszy, ChunliF
Differential Revision: D22062748
fbshipit-source-id: 5266abdb6dde63055fdffdba6e8d65bd0f221d7b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39112
Allow int8 packed weights in int8 model to deserialize to original format. Set default deserialization behavior in eval workflows to original format.
Test Plan: Tested with workflow: f192797187
Reviewed By: yinghai
Differential Revision: D21737940
fbshipit-source-id: 7afaf307b16cb4e85e61f019356f83fdab772c57
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38507
With `--merge_fp32_inputs_into_fp16` we added some ops to the net with out net_pos, this makes the cardinality of blacklist pos smaller than number of op in the net. Previously, the updateInternalState() function of minimizer will just enter infinite loop. This diff fixed it by changing the loop condition.
Reviewed By: tracelogfb
Differential Revision: D21578777
fbshipit-source-id: 0d5373fa0a417ded1c80a2dc03248c07b1e0a320
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37675
Original commit changeset: 2c2481e3d497
(Note: this ignores all push blocking failures!)
Test Plan: Back out D21262085 due to ASAN crash P130123493
Differential Revision: D21353550
fbshipit-source-id: c43c8764322f7e58aca0c1360b1d03966b1d9798
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37535
Fuse ClipRanges + GatherRanges + SigridHash -> ClipRangesGatherSigridHash
dpa_product_ctr model's dper2 to dper3 migration is blocked by 3.6% higher prospector cpu usage. Root cause is traced down to sigrid transforms, where ClipRanges, GatherRanges, SigridHash are separately called, instead of fused, as is the case in dper2.
Further context:
https://fb.quip.com/GijaAZtX5mavhttps://fb.quip.com/pIDdAjJP2uiG
Test Plan:
Local benchmarking with small model 181513584_0
(Dper3 full model is 178772812, dper2 refresh is 178770392)
Transform turned on: P129799373
Iters per second: 609.291
Transform turned off: P129799397
Iters per second: 519.088
We also want to confirm this performance on the full model in canary and in qrt.
`buck build mode/opt-clang mode/no-gpu caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench`
`MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --pred_net=/data/users/ansha/tmp/dpa/small_pred_net.pb --c2_model=/data/users/ansha/tmp/dpa/181513584_0.predictor --c2_inputs=/data/users/ansha/tmp/dpa/c2_inputs_small.pb --iters=3000 --warmup_iters=100 --num_threads=32 --c2_apply_nomnigraph_passes=1 --caffe2_predictor_enable_preproc_fusion=1`
Prospector canary:
https://our.intern.facebook.com/intern/ads/canary/426280288521552095/
Check that ClipRangesGatherSigridHash is used: https://fburl.com/scuba/caffe2_operator_stats_canary/e6qfdsat
Reviewed By: yinghai
Differential Revision: D21262085
fbshipit-source-id: 2c2481e3d4977abb8abe6e9ef0c9999382320ab2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35555
Att. So that we can lower the SparseLengthsSum* part of SparseLengthsSum*Sparse. We update the tying policy between Gather and SparsLengthsWeightSum* so that we don't bother lowering a single Gather into the backend, which is inefficient to execute on card and creates bubbles between continuous lowering graphs.
Test Plan:
```
buck test glow/fb/test:test_onnxifinnpi
```
Reviewed By: ipiszy
Differential Revision: D20688525
fbshipit-source-id: cb8e38239057ff13a8d385ed09d0d019421de78b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34836
Once SigridHashOp argument is supplied, I realized the shape inference is still wrong because the argument is not supplied in the debug_ssa. Thanks to yinghai, I didn't fix the converter, fixing it in this diff
Test Plan:
Run the binary, and checked the exported op
op {
input: "sequential_250/parallel/normalization/dper_feature_normalization/sparse_features_processor/sparse_feature_transform/gather_ranges_GSF_IDLIST_COOCCUR_APP_ID_NEKO_ORGANIC_1D_7D_INSTALL_V1/gathered_values_0"
output: "sequential_250/parallel/normalization/dper_feature_normalization/sparse_features_processor/sparse_feature_transform/sequential_1/hash_feature_ids/SigridHash:0_0"
type: "SigridHash"
arg {
name: "salt"
i: 0
}
arg {
name: "maxValue"
i: 100000
}
arg {
name: "hashIntoInt32"
i: 1
}
arg {
name: "net_pos"
i: 3
}
}
it now have hashIntInt32
Reviewed By: yinghai
Differential Revision: D20457057
fbshipit-source-id: 023ade5e66df82037a8f2da3174383dda8aff230
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34976
Previously, we are dropping the original device option info when we override the operator conversion function.
Test Plan:
```
buck test caffe2/caffe2/opt:converter_nomigraph_test
```
Reviewed By: ipiszy
Differential Revision: D20507277
fbshipit-source-id: 66b5eab07d18651eff27dab2a809cd04872ac224
Summary: make use of springhill's fma on SpatialBatchnorm
Test Plan:
re-enabled the unit test, ran it a couple of times
pending: net runner
Reviewed By: amylittleyang
Differential Revision: D20227767
fbshipit-source-id: 7c601f185940249c0a32bdf95d74a20552cd2625
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34092
Disable op in transform map until we get bitwise matching to ice-ref
Test Plan: CI
Reviewed By: hyuen
Differential Revision: D20177936
fbshipit-source-id: e316384184cb264852e63e5edce721a8614742d1
Summary: update this mapping with thte int4 sls ops so we can run netrunner
Test Plan: testing with net_runner
Reviewed By: jfix71
Differential Revision: D19879826
fbshipit-source-id: eac84b10e2365c21cb8a7cfbf3123e26a9945deb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32935
Mock away the content of onnxified net with some low cost ops so that we can still mimic the input/output transfer while doing minimal work on the card.
Test Plan:
```
buck run glow/fb/test:sparsenn_test -- --gtest_filter='SparseNNTest.vanillaC2' --onnxifi_debug_mode --onnxifi_loop_test_mode --nocaffe2_predictor_use_memonger
```
Differential Revision: D19631971
fbshipit-source-id: f970c55ccb410702f479255eeb750e01e3f8c2ae
Summary:
SpatialBNFakeLoweredFp16NNPI
this is the fake operator for SpatialBN that gets lowered into add/mul/div, etc.
Test Plan: test_spatialbn
Reviewed By: tracelogfb, amylittleyang
Differential Revision: D19658680
fbshipit-source-id: 2abddbcd9a2023ac75c494f20eaac2051b7139dc
Summary: ATT. Since the infra is there.
Test Plan: run it
Reviewed By: amylittleyang
Differential Revision: D19605250
fbshipit-source-id: c68be4d7963afa4fa5f8f60c90f1913605eae516
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32675
It's good to have one location to do the mapping.
Test Plan: Everything still runs.
Reviewed By: amylittleyang
Differential Revision: D19590354
fbshipit-source-id: d8c0d14e4bdf27da3e13bd4d161cd135d6e3822b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30802
Change shape_hints from map<string, TensorShape> to ShapeInfoMap to catch dimType info from model file.
Reviewed By: ipiszy
Differential Revision: D18821486
fbshipit-source-id: c5d9ed72e158d3698aba38900aeda00f776745b4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30367
use the SLS emulations that match the hardware
Test Plan: replayer test
Differential Revision: D18667605
fbshipit-source-id: 89aee630184737b86ecfb09717437e5c7473e42c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28866
When we are working on the fix for int32 instead of int64, we also need to take care of the ClipRangesGatherSigridHash since this is the operator that actually gets used during inference.
Test Plan: Added unittest to cover for the new case
Reviewed By: ipiszy
Differential Revision: D17147237
fbshipit-source-id: 2b562b72a6ae8f7282e54d822467b8204fb1055e
Summary: To test the int8 ads models on CPU and accelerators with the ads replayer, we need to load the PREPACKING_INIT_NET_TYPE in the int8 model to initialize the int8 w_packed blobs.
Test Plan:
Ads replayer test.
P74811059
Reviewed By: zrphercule
Differential Revision: D16518888
fbshipit-source-id: cee212710ad37d9e491c970b25b2fe484373e5e4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24384
So that we can use them in other functions.
Reviewed By: yinghai
Differential Revision: D16824289
fbshipit-source-id: 3cb33cfa9a5c479a63db6438aef518209bdfb1f4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24262
Previously for onnxifi_blacklist_ops option, we figure out the net_pos based on the order of ops in the net. But this logic is wrong if the net already has net_pos assigned and we may end up blacklisting unintended ops. Fix this issue to always assign net_pos before computing any blacklist.
Reviewed By: yinghai
Differential Revision: D16789166
fbshipit-source-id: 2d08a7737d417822f2209adb4dcb24dbb258ff90
Summary:
Overal context: open-source BlackBoxPredictor as the entry
point for inference in Caffe2 (thread safe abstraction for Caffe2
inference). This should be used in ThroughputBenchmark for the purpose
of framework comparison
This specific diff:
There should be no harm in moving transformation code to
OSS. On the advantages side we will be able to compare production
Caffe2 setup with PyTorch in the most fair way via
ThroughputBenchmark. This approach avoid any complicated
transformation regirstries. Building those proper would be significant
engineering effort as well as production risk. In the past we had SEVs
related to transforms being turned off due to various refactors. Given
that we don't plan to build any other significant investments into
transformation logic except existing ones (like TVM and Glow), and
those also relate to open-source technologies, I came up to the
conclusion of moving to OSS the whole thing.
Reviewed By: bertmaher
Differential Revision: D16367134
fbshipit-source-id: fc6bacc1be3ff6336beb57cdad58168d3a2b8c28