Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45401
Added a DeleteKey API for the TCP Store
ghstack-source-id: 112997162
Test Plan:
Modified the existing get/set test to use delete. verified that the
correct keys were deleted and that the numKeys API returned the right values
Reviewed By: mrshenli
Differential Revision: D23955730
fbshipit-source-id: 5c9f82be34ff4521c59f56f8d9c1abf775c67f9f
Summary: As title.
Test Plan:
FBL job without this diff failed:
f221545832
Error message:
```
NonRetryableException: AssertionError: Label is missing in training stage for HistogramBinningCalibration
```
FBL job with canary package built in this diff is running without failure:
f221650379
Reviewed By: chenshouyuan
Differential Revision: D23959508
fbshipit-source-id: c077230de29f7abfd092c84747eaabda0b532bcc
Summary: Adding support for type double to caffe2 MeanOp and MeanGradientOp.
Test Plan:
All tests passed.
Example FBL job failed without this diff:
f221169563
Error message:
```
c10::Error: [enforce fail at mean_op.h:72] . Mean operator only supports 32-bit float, but input was of type double (Error from operator:
input: "dpsgd_8/Copy_3" input: "dpsgd_8/Copy_4" output: "dpsgd_8/Mean_2" name: "" type: "Mean" device_option { device_type: 0 device_id: 0 })
```
Example FBL job is running without failure with the canary package built from this diff:
f221468723
Reviewed By: chenshouyuan
Differential Revision: D23956222
fbshipit-source-id: 6c81bbc390d812ae0ac235e7d025141c8402def1
Summary: Currently GetSingleArgument is overflowing since it's expecting an int instead of an int64 when using a 1cycle (hill policy) annealing schedule
Test Plan:
unittest
buck test caffe2/caffe2/python/operator_test:learning_rate_op_test
Differential Revision: D23938169
fbshipit-source-id: 20d65df800d7a0f1dd9520705af31f63ae716463
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43963
Added a DeleteKey API for the TCP Store
ghstack-source-id: 112939762
Test Plan:
Modified the existing get/set test to use delete. verified that the
correct keys were deleted and that the numKeys API returned the right values
Reviewed By: jiayisuse
Differential Revision: D23009117
fbshipit-source-id: 1a0d95b43d79e665a69b2befbaa059b2b50a1f66
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43962
TCPStore needs a getNumKeys API for our logging needs.
ghstack-source-id: 112939761
Test Plan: Adding tests to C++ Store Tests
Reviewed By: pritamdamania87
Differential Revision: D22985085
fbshipit-source-id: 8a0d286fbd6fd314dcc997bae3aad0e62b51af83
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44062
Previously, BackendSelect kernels were still written in the legacy way, i.e. they took one TensorOptions argument instead of scattered dtype, layout, device, pin_memory, and they used hacky_wrapper to be callable. This caused a re-wrapping step. Calling into a BackencSelect kernel required taking the individual scattered arguments, packing them into a TensorOptions, and the kernel itself then gathered them again for redispatch.
Now with this PR, BackendSelect kernels are written in the new way and no hacky_wrapper or rewrapping is needed for them.
ghstack-source-id: 112825789
Test Plan:
vs master: https://www.internalfb.com/intern/fblearner/details/216117032/
vs previous diff: https://www.internalfb.com/intern/fblearner/details/216170194/
Reviewed By: ezyang
Differential Revision: D23484192
fbshipit-source-id: e8fb49c4692404b6b775d18548b990c4cdddbada
Summary:
A lot of changes are in this update, some highlights:
- Added Doxygen config file
- Split the fusion IR (higher level TE like IR) from kernel IR (lower level CUDA like IR)
- Improved latency with dynamic shape handling for the fusion logic
- Prevent recompilation for pointwise + reduction fusions when not needed
- Improvements to inner dimension reduction performance
- Added input -> kernel + kernel launch parameters cache, added eviction policy
- Added reduction fusions with multiple outputs (still single reduction stage)
- Fixed code generation bugs for symbolic tiled GEMM example
- Added thread predicates to prevent shared memory form being loaded multiple times
- Improved sync threads placements with shared memory and removed read before write race
- Fixes to FP16 reduction fusions where output would come back as FP32
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45218
Reviewed By: ezyang
Differential Revision: D23905183
Pulled By: soumith
fbshipit-source-id: 12f5ad4cbe03e9a25043bccb89e372f8579e2a79
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45315
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45314
in D23858329 (721cfbf842), we put PriorCorrectionCalibrationPrediction unit test in OSS file which causes test failure issue in public trunk.
this diff moves it to FB only test file.
Test Plan:
```
buck test //caffe2/caffe2/python/operator_test:torch_integration_test -- test_gather_ranges_to_dense_op
buck test //caffe2/caffe2/fb/python/operator_test:torch_integration_test -- test_prior_correct_calibration_prediction_op
```
all pass.
Reviewed By: houseroad
Differential Revision: D23899012
fbshipit-source-id: 1ed97d8702e2765991e6caf5695d4c49353dae82
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45096
Add operator to compute the equalization scale. This will be used in the integration of equalization into dper int8 fixed quant scheme quantization flow.
Design docs:
https://fb.quip.com/bb7SAGBxPGNChttps://fb.quip.com/PDAOAsgoLfRr
Test Plan: buck test caffe2/caffe2/quantization/server:compute_equalization_scale_test
Reviewed By: jspark1105
Differential Revision: D23779870
fbshipit-source-id: 5e6a8c220935a142ecf8e61100a8c71932afa8d7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45178
## Motivation
* To be able to make C2 ops cancellable so we can safely exit.
* Some C2 operators are now blocking thus being non-cancellable. If an error
occurs we need to be able to safely stop all net execution so we can throw
the exception to the caller.
## Summary
* Adds a hypothesis test for queue ops cancellation.
Test Plan:
## Unit test added to verify that queue ops propagate errors
```
buck test caffe2/caffe2/python:hypothesis_test
buck test caffe2/caffe2/python:hypothesis_test -- test_safe_dequeue_blob__raises_exception_when_hang --stress-runs 1000
```
```
Summary
Pass: 1000
ListingSuccess: 1
```
Reviewed By: d4l3k
Differential Revision: D23847576
fbshipit-source-id: 2fc351e1ee13ea8b32d976216d2d01dfb6fcc1ad
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45177
## Motivation
* To be able to make C2 ops cancellable so we can safely exit.
* Some C2 operators are now blocking thus being non-cancellable. If an error
occurs we need to be able to safely stop all net execution so we can throw
the exception to the caller.
## Summary
* When an error occurs in a net or it got cancelled, running ops will have the
`Cancel` method called.
This diff adds `Cancel` method to the `SafeEnqueueBlobsOp`
and `SafeDequeueBlobsOp` to have the call queue->close() to force all the
blocking ops to return.
* Adds unit test that verified the error propagation.
Test Plan:
## Unit test added to verify that queue ops propagate errors
```
buck test caffe2/caffe2/python:hypothesis_test -- test_safe_dequeue_blob__raises_exception_when_hang --stress-runs 1000
```
```
Summary
Pass: 1000
ListingSuccess: 1
```
Reviewed By: d4l3k
Differential Revision: D23846967
fbshipit-source-id: c7ddd63259e033ed0bed9df8e1b315f87bf59394
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45231
There are two operators:
`PriorCorrectionCalibrationPrediction` and `GatherRangesToDense` is not supported in PT which makes GLOW cannot work.
To unblock, we first try to use C2->PT conversion. In the long-term, we need to implement PT custom ops.
This diff does this conversion to unblock current project.
Test Plan:
Run unit test. the Test input is from current DPER example.
All pass.
```buck test //caffe2/caffe2/python/operator_test:torch_integration_test -- test_prior_correct_calibration_prediction_op --print-passing-details
> c2 reference output
> [0.14285715 0.27272728 0.39130434 0.5 ]
> PT converted output
> tensor([0.1429, 0.2727, 0.3913, 0.5000])
buck test //caffe2/caffe2/python/operator_test:torch_integration_test -- test_gather_ranges_to_dense_op --print-passing-details
c2 reference output
> [array([[6, 5, 4, 3], [0, 0, 0, 0]], dtype=int64)]
> PT converted output
> [tensor([[6, 5, 4, 3], [0, 0, 0, 0]])]
```
Reviewed By: allwu, qizzzh
Differential Revision: D23858329
fbshipit-source-id: ed37118ca7f09e1cd0ad1fdec3d37f66dce60dd9
Summary:
There is a module called `2to3` which you can target for future specifically to remove these, the directory of `caffe2` has the most redundant imports:
```2to3 -f future -w caffe2```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45033
Reviewed By: seemethere
Differential Revision: D23808648
Pulled By: bugra
fbshipit-source-id: 38971900f0fe43ab44a9168e57f2307580d36a38
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45110
A recent change in DSNN quantizes the ad embedding to 8 bits. Ad embeddings are part of the inputs to the DSNN merge net. To correctly pass shape hints of input tensors including quantized ad embeddings, we need to be able to annotate the data types in shape hints.
A bit on the corner cases, if type is omitted or not a valid type, e.g., white spaces, instead of throwing an exception, I decided to return the default type, float.
Test Plan:
```
buck test caffe2/caffe2/fb/opt:shape_info_utils_test
```
Reviewed By: yinghai
Differential Revision: D23834091
fbshipit-source-id: 5e072144a7a7ff4b5126b618062dfc4041851dd3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44639
As title; this will unblock migration of several modules that need learning rate functionality.
Test Plan:
```
buck test //dper3/dper3/modules/low_level_modules/tests:learning_rate_test
```
Reviewed By: yf225
Differential Revision: D23681733
fbshipit-source-id: 1d98cb35bf6a4ff0718c9cb6abf22401980b523c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44936
need to provide max sequence size and max element size instead of
total
added a check that onnxifi was succesful
Test Plan: sls tests
Reviewed By: yinghai
Differential Revision: D23779437
fbshipit-source-id: 5048d6536ca00f0a3b0b057c4e2cf6584b1329d6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44840
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44762
Move CostInferenceForFCGradient to fc_inference.cc/h to be used in multiple .cc files.
Test Plan: CI
Reviewed By: qizzzh
Differential Revision: D23714877
fbshipit-source-id: d27f33e270a93b0e053f2af592dc4a24e35526cd
Summary:
## Motivation
* To be able to make C2 ops cancellable so we can safely exit.
* Some C2 operators are now blocking thus being non-cancellable. If an error
occurs we need to be able to safely stop all net execution so we can throw
the exception to the caller.
* When an error occurs in a net or it got cancelled, running ops will have the
`Cancel` method called.
* This diff adds `Cancel` method to the `SafeEnqueueBlobsOp`
and `SafeDequeueBlobsOp` to have the call queue->close() to force all the
blocking ops to return.
* Adds unit test that verified the error propagation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44495
Test Plan:
## Unit Test added to verify that queue ops propagate errors
```
buck test caffe2/caffe2/python:hypothesis_test
```
Reviewed By: dzhulgakov
Differential Revision: D23236088
Pulled By: dahsh
fbshipit-source-id: daa90d9ee32483fb51195e269a52cf5987bb0a5a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44819
[12:39 AM] Cherckez, Tal
please review the following patch.
should address these issues that our validation team found:
A) test_op_nnpi_fp16: hypothesis to trigger max_example*max_example.
B) batchnorm: batchNorm has derived from unit test which doesnt have setting required for hypothesis. hence default value as 100 getting set.
Test Plan:
buck test //caffe2/caffe2/contrib/fakelowp/test/...
https://our.intern.facebook.com/intern/testinfra/testrun/5910974543950859
Reviewed By: hyuen
Differential Revision: D23740970
fbshipit-source-id: 16fcc49f7bf84a5d7342786f671cd0b4e0fc87d3
Summary:
Make `gcs_cuda_only` and `gcs_gpu_only` return empty device lists if CUDA/GPU(CUDA or RocM) not available
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44578
Reviewed By: walterddr
Differential Revision: D23664227
Pulled By: malfet
fbshipit-source-id: 176b5d964c0b02b8379777cd9a38698c11818690
Summary:
[Tests for Vec256 classes https://github.com/pytorch/pytorch/issues/15676](https://github.com/pytorch/pytorch/issues/15676)
Testing
Current list:
- [x] Blends
- [x] Memory: UnAlignedLoadStore
- [x] Arithmetics: Plus,Minu,Multiplication,Division
- [x] Bitwise: BitAnd, BitOr, BitXor
- [x] Comparison: Equal, NotEqual, Greater, Less, GreaterEqual, LessEqual
- [x] MinMax: Minimum, Maximum, ClampMin, ClampMax, Clamp
- [x] SignManipulation: Absolute, Negate
- [x] Interleave: Interleave, DeInterleave
- [x] Rounding: Round, Ceil, Floor, Trunc
- [x] Mask: ZeroMask
- [x] SqrtAndReciprocal: Sqrt, RSqrt, Reciprocal
- [x] Trigonometric: Sin, Cos, Tan
- [x] Hyperbolic: Tanh, Sinh, Cosh
- [x] InverseTrigonometric: Asin, ACos, ATan, ATan2
- [x] Logarithm: Log, Log2, Log10, Log1p
- [x] Exponents: Exp, Expm1
- [x] ErrorFunctions: Erf, Erfc, Erfinv
- [x] Pow: Pow
- [x] LGamma: LGamma
- [x] Quantization: quantize, dequantize, requantize_from_int
- [x] Quantization: widening_subtract, relu, relu6
Missing:
- [ ] Constructors, initializations
- [ ] Conversion , Cast
- [ ] Additional: imag, conj, angle (note: imag and conj only checked for float complex)
#### Notes on tests and testing framework
- some math functions are tested within domain range
- mostly testing framework randomly tests against std implementation within the domain or within the implementation domain for some math functions.
- some functions are tested against the local version. ~~For example, std::round and vector version of round differs. so it was tested against the local version~~
- round was tested against pytorch at::native::round_impl. ~~for double type on **Vsx vec_round failed for (even)+0 .5 values**~~ . it was solved by using vec_rint
- ~~**complex types are not tested**~~ **After enabling complex testing due to precision and domain some of the complex functions failed for vsx and x86 avx as well. I will either test it against local implementation or check within the accepted domain**
- ~~quantizations are not tested~~ Added tests for quantizing, dequantize, requantize_from_int, relu, relu6, widening_subtract functions
- the testing framework should be improved further
- ~~For now `-DBUILD_MOBILE_TEST=ON `will be used for Vec256Test too~~
Vec256 Test cases will be built for each CPU_CAPABILITY
Fixes: https://github.com/pytorch/pytorch/issues/15676
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42685
Reviewed By: malfet
Differential Revision: D23034406
Pulled By: glaringlee
fbshipit-source-id: d1bf03acdfa271c88744c5d0235eeb8b77288ef8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44540
Support output type to be fp16 for UniformFill
Reviewed By: jianyuh
Differential Revision: D23558030
fbshipit-source-id: 53a5b2c92cfe78cd11f55e6ee498e1bd682fe4a1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44089
Add support of fp16 as input type in SparseLengthSum/Mean caffe2 operator
Reviewed By: xianjiec
Differential Revision: D23436877
fbshipit-source-id: 02fbef2fde17d4b0abea9ca5d17a36aa989f98a0
Summary: As title; this will unblock migration of several modules that need learning rate functionality.
Test Plan:
```
buck test //dper3/dper3/modules/low_level_modules/tests:learning_rate_test
```
WIP: need to add more learning rate tests for the different policies
Reviewed By: yf225
Differential Revision: D23584071
fbshipit-source-id: f6656531b1caba38c3e3a7d6e16d9591563391e2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44145
## Motivation
* To be able to make C2 ops cancellable so we can safely exit.
* Some C2 operators are now blocking thus being non-cancellable. If an error
occurs we need to be able to safely stop all net execution so we can throw
the exception to the caller.
## Summary
* Adds `NetBase::Cancel()` to NetBase which iterates over the entire list of
operators and call Cancel.
* Cancel on all ops was added to Net since there's nothing Asyc specific about it.
* `AsyncSchedulingNet` calls parent Cancel.
* To preserve backwards compatibility, `AsyncSchedulingNet`'s Cancel still calls
`CancelAndFinishAsyncTasks` .
* Adds `Cancel()` to `OperatorBase`.
Reviewed By: dzhulgakov
Differential Revision: D23279202
fbshipit-source-id: e1bb0ff04a4e1393f935dbcac7c78c0baf728550
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44406
this fix makes fakelowp identical to hw
- mask out the floating point number with 0x7fff so we are always dealing
with positive numbers
- dsp implementation is correct, ice-ref suffers from this same problem
Test Plan: - tested with test_fusions.py, can't enable the test until the fix in ice-ref appears
Reviewed By: venkatacrc
Differential Revision: D23603878
fbshipit-source-id: a72d93a4bc811f98d1b5e82ddb204be028addfeb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44440
`aten-op.cc` takes a long time to compile due to the large generated constructor. For each case, the `std::function` constructor and the initialization functions are inlined, producing a huge amount of intermediate code that takes a long time to optimize, given that many compiler optimization passes are superlinear in the function size.
This diff moves each case to a separate function, so that each one is cheap to optimize, and the constructor is just a large jump table, which is easy to optimize.
Reviewed By: dzhulgakov
Differential Revision: D23593741
fbshipit-source-id: 1ce7a31cda10d9b0c9d799716ea312a291dc0d36
Summary:
Expose the interface of `nesterov` of SGD Optimizer from caffe2 to dper.
dper sgd optimizer (https://fburl.com/diffusion/chpobg0h) has referred to NAG sgdoptimizer in caffe2: https://fburl.com/diffusion/uat2lnan. So just need to add the parameter 'nesterov' in dper sgd optimizer.
Analysis of run resutls: N345540.
- train_ne increases as momentum (m) decreases.
- for m=0.95, 0.9: eval_ne is lower with NAG than production (no NAG, m = 0.95).
- for m=0.99: eval_ne with or without NAG is higher than production. It indicates larger variance in validation and overfit in training (lower train_ne).
Test Plan:
1. unit tests:
`buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test -- test_sgd_without_nesterov`
`buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test -- test_sgd_with_nesterov`
.
1. build dper front end package: `flow-cli canary ads.dper3.workflows.sparse_nn.train --mode opt --entitlement ads_global --run-as-secure-group team_ads_ml_ranking`. The build result (refreshed) is here https://www.internalfb.com/intern/buck/build/2a368b55-d94b-45c1-8617-2753fbce994b. Flow package version is ads_dper3.canary:856b545cc6b249c0bd328f845adeb0d2.
.
2. To build dper back end package: `flow-cli canary dper.workflows.dper3.train --mode opt --entitlement ads_global --run-as-secure-group team_ads_ml_ranking`. The build result (refreshed) is here: https://www.internalfb.com/intern/buck/build/70fa91cd-bf6e-4a08-8a4d-41e41a77fb52. Flow package version is aml.dper2.canary:84123a34be914dfe86b1ffd9925869de.
.
3. Compare prod with NAG-enabled runs:
a) refreshed prod run (m=0.95): f213877098
NAG enabled run (m=0.95): f213887113
.
b) prod run (m=0.9): f214065288
NAG enabled run (m=0.9): f214066319
.
c) prod run (m=0.99): f214065804
NAG enabled run (m=0.99): f214066725
.
d) change date type of nestrov to `bool` and launched a validation run
NAG enabled (m=0.95): f214500597
Reviewed By: ustctf
Differential Revision: D23152229
fbshipit-source-id: 61703ef6b4e72277f4c73171640fb8afc6d31f3c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44043
To invoke `cancel` from the net instance in Python, we expose it through pybind state.
Reviewed By: dzhulgakov
Differential Revision: D23249660
fbshipit-source-id: 45a1e9062dca811746fcf2e5e42199da8f76bb54
Summary: Exporting the Bucketize operator on CUDA. Also adding unit test.
Test Plan: buck test mode/dev-nosan caffe2/torch/fb/sparsenn:gpu_test -- test_bucketize
Differential Revision: D23581321
fbshipit-source-id: 7f21862984c04d840410b8718db93006f526938a