Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46134
Make sure in-place ops stay in-place after SsaRewrite. This seems to break the premise of SSA, but it's necessary to ensure correctness. Note here we only preserve the inplace ops that enforce inplace. Ops like `Relu` don't enforce inplace, they allow inplace.
(Note: this ignores all push blocking failures!)
Reviewed By: yinghai
Differential Revision: D24234957
fbshipit-source-id: 274bd3ad6227fce6a98e615aad7e57cd2696aec3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46560
Follow-up for D24236604 (16c52d918b).
For nets that pass the schema check, memonger actually makes sure to preserve the inplaceness of operators if they are already inplace. So we can safely enable it for correct input nets.
(Note: this ignores all push blocking failures!)
Differential Revision: D24402482
fbshipit-source-id: a7e95cb0e3eb87adeac79b9b69eef207957b0bd5
Summary:
Follow-up of https://github.com/pytorch/pytorch/issues/46461 with a similar goal
Makes them more readable and possibly faster. Care has to be taken because `map` applies the function immediately while `(x for x in xs)` is a generator expression which gets evaluated later. This is a benefit in some cases where it is not required to actually create the list of values in memory (e.g. when passing to `tuple` or `extend` or `join`)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46462
Reviewed By: zou3519
Differential Revision: D24422343
Pulled By: ezyang
fbshipit-source-id: 252e33499c92ac0b15238f2df32681dbbda2b237
Summary: It creates cpu overload issues when openmp gets enabled and OMP_NUM_THREADS=1 is not set.
Test Plan: buck test //caffe2/caffe2/quantization/server:quantize_dnnlowp_op_test
Reviewed By: jspark1105
Differential Revision: D24437305
fbshipit-source-id: 426209fc33ce0d4680c478f584716837ee62cb5e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46244
- What does the generated binding code do?
The Python binding codegen produces code that takes the input list of
PyObjects, finds the matching ATen C++ function using PythonArgParser,
converts the PyObjects into C++ types and calls the ATen C++ function:
```
+--------+ parsing +------------------------+ binding +-----------------------+
| PyObjs | ---------> | PythonArgParser Output | ---------> | Cpp Function Dispatch |
+--------+ +------------------------+ +-----------------------+
```
- Are Python arguments 1-1 mapped to C++ arguments?
Python arguments might be reordered, packed, unpacked when binding to
C++ arguments, as illustrated below:
```
// Binding - Reorder & Packing
// aten::empty.names(int[] size, *, Dimname[]? names, ScalarType? dtype=None, Layout? layout=None,
Device? device=None, bool? pin_memory=None, MemoryFormat? memory_format=None) -> Tensor
Python Args Cpp Args
-----------------------------------------------------------
0: size size
1: names names
2: memory_format -------+
3: dtype -----+-|--> options
4: layout / |
5: device / +--> memory_format
6: pin_memory /
7: requires_grad -+
// Binding - Unpacking
// aten::max.names_dim(Tensor self, Dimname dim, bool keepdim=False) -> (Tensor values, Tensor indices)
Python Args Cpp Args
-----------------------------------------------------------
+----> max
/-----> max_values
0: input / self
1: dim / dim
2: keepdim / keepdim
3: out -----+
```
- Why do we want to rewrite the python binding codegen?
The old codegen takes Declarations.yaml as input. It doesn't distinguish
between Python arguments and C++ arguments - they are all mixed together
as a bag of non-typed dict objects. Different methods process these arg
objects and add new attributes for various different purposes. It's not so
obvious to figure out the semantics of these attributes. The complicated
binding logic happens implicitly and scatteredly.
```
+--------------------+
| Native Functions |
+--------------------+
|
|
v
+--------------------+
| Cpp Signatures |
+--------------------+
|
|
v
+--------------------+
| Declarations.yaml |
+--------------------+
| +-------------------------------------+
| +-------> | PythonArgParser Schema |
| | +-------------------------------------+
| | .
| | .
v | .
+--------------------+ +-------------------------------------+
| NonTyped Args Objs | --> | PythonArgParser -> Cpp Args Binding |
+--------------------+ +-------------------------------------+
| .
| .
| .
| +-------------------------------------+
+-------> | Cpp Function Dispatch |
+-------------------------------------+
```
This PR leverages the new immutable data models introduced in the new
aten codegen. It introduces dedicated data models for python schema.
This way, we can not only avoid subtle Declaration.yaml conversions but
also decouple the generation of python schema, python to c++ binding and
c++ function call.
The ultimate state will be like the following diagram:
```
+-------------------+ +-------------------------------------+
+-------> | Python Signatures | --> | PythonArgParser Schema |
| +-------------------+ +-------------------------------------+
| | .
| | .
| | .
+------------------+ | +-------------------------------------+
| Native Functions | +-------> | PythonArgParser -> Cpp Args Binding |
+------------------+ | +-------------------------------------+
| | .
| | .
| | .
| +-------------------+ +-------------------------------------+
+-------> | Cpp Signatures | --> | Cpp Function Dispatch |
+-------------------+ +-------------------------------------+
```
This PR has migrated the core binding logic from
tools/autograd/gen_python_functions.py to tools/codegen/api/python.py.
It produces the byte-for-byte same results (tested with #46243).
Will migrate the rest of gen_python_functions.py in subsequent PRs.
Test Plan: Imported from OSS
Reviewed By: bhosmer
Differential Revision: D24388874
Pulled By: ljk53
fbshipit-source-id: f88b6df4e917cf90d868a2bbae2d5ffb680d1841
Summary:
1. Added CudaFusionGuard as the custom TypeCheck for nvfuser; enabled dynamic shape support with profiling executor;
2. dropped support for legacy fuser;
3. re-enabled nvfuser tests;
4. added registration for profiling record to allow profiling on user specified nodes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46452
Reviewed By: zou3519, anjali411
Differential Revision: D24364642
Pulled By: ngimel
fbshipit-source-id: daf53a9a6b6636e1ede420a3a6d0397d4a8b450b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43987
This replaces the caffe2 CPU random number (std::mt19937) with at::mt19937 which is the one currently used in pytorch. The ATen RNG is 10x faster than the std one and appears to be more robust given bugs in the std (https://fburl.com/diffusion/uhro7lqb)
For large embedding tables (10GB+) we see UniformFillOp taking upwards of 10 minutes as we're bottlenecked on the single threaded RNG. Swapping to at::mt19937 cuts that time to 10% of the current.
Test Plan: Ran all relevant tests + CI. This doesn't introduce new features (+ is a core change) so existing tests+CI should be sufficient to catch regressions.
Reviewed By: dzhulgakov
Differential Revision: D23219710
fbshipit-source-id: bd16ed6415b2933e047bcb283a013d47fb395814
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46424
Currently if an exception occurs in a reporter thread the process is killed via std::terminate. This adds support for handling the reporter exception if FLAGS_caffe2_handle_executor_threads_exceptions is set to true.
Test Plan: buck test mode/opt -c python.package_style=inplace //caffe2/caffe2/python:hypothesis_test //caffe2/caffe2:caffe2_test_cpu -- --stress-runs 100
Reviewed By: dahsh
Differential Revision: D24345027
fbshipit-source-id: 0659495c9e27680ebae41fe5a3cf26ce2f455cb3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46457
Wanted to see if using CopyMatrix specialized for float that uses mkl_somatcopy can be faster but it wasn't. Still want to check in benchmark that can be used later.
Test Plan: .
Reviewed By: dskhudia
Differential Revision: D24345901
fbshipit-source-id: d3e68dbb560e3138fda11c55789cd41bc0715c6d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46449
modifies `ComputeEqualizationScale` to have a single output `S`
Test Plan:
```
buck test caffe2/caffe2/quantization/server:compute_equalization_scale_test
```
plus e2e tests
Reviewed By: hx89
Differential Revision: D23946768
fbshipit-source-id: 137c2d7a58bb858db411248606a5784b8066ab23
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45551
The FP16 version of SparseNormalize op in Caffe2 is missing. This Diff adds FP16 support to unblock MC process of adding FP16 to Dper3.
Check https://fb.quip.com/L0T2AXGwUY3n#EReACAeifk3 .
One question is whether the pure FP16 Sparse Normalized op will affect the accuracy? Maybe we should do it in FP32 domain.
ghstack-source-id: 114184398
Test Plan:
```
buck run mode/opt //caffe2/caffe2/python/operator_test:sparse_normalize_test
```
```
buck run mode/opt -c python.package_style=inplace mode/no-gpu //caffe2/caffe2/python/benchmarks:sparse_normalize_benchmark -- --fp16
```
Reviewed By: jspark1105
Differential Revision: D24005618
fbshipit-source-id: 8b918ec4063fdaafa444779b95206ba2b7b38537
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46110
## Motivation
* `Cancel` is now added to `OperatorBase` and `NetBase` (https://github.com/pytorch/pytorch/pull/44145).
* We need a test to cover and exhibit that we can cancel stuck net and propagate error with plan executor.
## Summary
* Added PlanExecutorTest `ErrorPlanWithCancellableStuckNet` for plan executor.
* Set cancelCount to zero at the beginning of tests to avoid global state be carried over in some test environment.
Test Plan:
## Unit Test Added
```
buck test caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest
buck test caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest --stress-runs 1000
```
Reviewed By: d4l3k
Differential Revision: D24226577
fbshipit-source-id: c834383bfe6ab50747975c229eb42a363eed3458
Summary: Added OpSchema::NeedsAllInputShapes wrapper around the TensorInferenceFunction to fix exception when referencing the dim array when the input shape was unknown. There may be other operators that could use a similar change, these are just the ones that was causing InferShapesAndTypes throw an exception for my examples.
Test Plan: Tested with notebook n352716
Differential Revision: D23745442
fbshipit-source-id: d63eddea47d7ba595e73c4693d34c790f3a329cc
Summary: I think this preprocessor check is incorrect. The fused multiply-add (FMA) instructions are not part of AVX2.
Test Plan: CI
Reviewed By: jspark1105
Differential Revision: D24237836
fbshipit-source-id: 44f9b9179918332eb85ac087827726300f56224e
Summary: Adding a new flag shape_is_set to the structs for shape inference on in-place op to prevent duplicated inference.
Test Plan:
buck test mode/opt-clang caffe2/caffe2/opt:bound_shape_inference_test
buck test mode/opt-clang caffe2/caffe2/fb/opt:shape_info_utils_test
Reviewed By: ChunliF
Differential Revision: D24134767
fbshipit-source-id: 5142e749fd6d1b1092a45425ff7b417a8086f215
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46080
temp removal of ErrorPlanWithCancellableStuckNet, will fill out more
Test Plan:
```
buck test caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest
```
remove a test
Reviewed By: fegin
Differential Revision: D24213971
fbshipit-source-id: e6e600bad00b45c726311193b4b3238f1700526e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45319
## Motivation
* `Cancel` is now added to `OperatorBase` and `NetBase` (https://github.com/pytorch/pytorch/pull/44145)
* We need a test to cover and exhibit that we can cancel stuck net and propagate error with plan executor.
## Summary
* Added `ErrorPlanWithCancellableStuckNet` for plan executor.
* We set a plan with two nets: one stuck net with blocking operator that never returns, and one with error
net with error op that throws, and tested it throw and cancel.
Test Plan:
## Unit Test added
```
buck test caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest
buck test caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest --stress-runs 100
```
```
Summary
Pass: 400
ListingSuccess: 2
```
Reviewed By: d4l3k
Differential Revision: D23920548
fbshipit-source-id: feff41f73698bd6ea9b744f920e0fece4ee44438
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45981
This is a recommit of previously reverted D20850851 (3fbddb92b1).
TL;DR - combining condition_variables and atomics is a bad idea
https://stackoverflow.com/questions/49622713/c17-atomics-and-condition-variable-deadlock
This also adds some ifdefs to disable the death test for mobile, xplat and tsan builds since forking doesn't play nicely with them.
Test Plan:
buck test mode/opt //caffe2/caffe2/python:hypothesis_test -- --stress-runs 1000 test_atomic_iter_with_concurrent_steps --timeout 120
buck test mode/opt //caffe2/caffe2/python:hypothesis_test -- --stress-runs 100
buck test mode/opt caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest --stress-runs 100
no timeouts https://www.internalfb.com/intern/testinfra/testconsole/testrun/7036874440059883/
will ensure no timeouts in OSS
Reviewed By: walterddr, dahsh
Differential Revision: D24165505
fbshipit-source-id: 17cd23bfbcd9c2826a4067a387023d5186353196
Summary:
This enables the cuda fuser on ROCm and enables tests for them.
Part of this patch is based on work of Rohith Nallamaddi, thank you.
Errors are my own, of course.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45965
Reviewed By: seemethere
Differential Revision: D24170457
Pulled By: walterddr
fbshipit-source-id: 3dd25b3501a41d2f00acba3ce8642ce51c49c9a6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45952
Pull Request resolved: https://github.com/pytorch/glow/pull/4967
When glow compilation meets with nonrecoverable fatal error (hardware is busted), we would like to throw a special exception other than the normal caffe2::EnforceNotMet so that we can signal the upper layer application to handle it differently.
Test Plan: Manually code some error and add LOG(FATAL) in the special exception path and wait for application to fatal.
Reviewed By: ipiszy
Differential Revision: D24156792
fbshipit-source-id: 4ae21bb0d36c89eac331fc52dd4682826b3ea180
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45979
For some reason, sometime we cannot write out the debug files. This shouldn't block the whole service. Hence, we opt in to error out instead of throw error.
Test Plan: Run net_runner test at `/` and observe error being printed out but the test passes.
Reviewed By: ipiszy
Differential Revision: D24165081
fbshipit-source-id: a4e1d0479d54d741e615e3a00b3003f512394fd4
Summary:
cpu implementation of `torch.symeig` uses `[zc]heev`, but MAGMA only have `d`-suffixed flavors of those functions
Fixes https://github.com/pytorch/pytorch/issues/45922
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46002
Reviewed By: walterddr
Differential Revision: D24177730
Pulled By: malfet
fbshipit-source-id: 0e9aeb60a83f8a4b8ac2a86288721bd362b6040b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45992
Created a template version of AddFakeFp16 to take both float and int inputs.
Test Plan: notebook with local bento kernel: N369049
Reviewed By: amylittleyang
Differential Revision: D24169720
fbshipit-source-id: 679de391224f65f6c5b3ca890eb0d157f09712f6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45986
Recurrent networks have subnets that are not well supported by `RemoveOpsByType`. Here we exclude recurrent networks by adding the same check as in memonger.
Test Plan:
```
buck test //caffe2/caffe2/fb/predictor:black_box_predictor_test
```
AdIndexer canary for sanity check:
https://www.internalfb.com/intern/ads/canary/430059485214766620
Differential Revision: D24167284
fbshipit-source-id: fa90d1c1f34af334a599d879af09d4c0bf7c27bd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45875
Adds a googlebenchmark harness for perf testing programs generated by
tensorexpr, sans any pytorch wrappings (for python-level benchmarks of
tensorexpr, see benchmarks/tensorexpr).
Currently there's a harness for gemm that sets up the problem using torch (and
also measures the perf of a torch::mm to give a baseline).
Right now there's just an unoptimized implementation that is expected to be not
very fast. More optimized versions are coming.
Sample output from my dev box:
```
Run on (48 X 2501 MHz CPU s)
CPU Caches:
L1 Data 32K (x24)
L1 Instruction 32K (x24)
L2 Unified 256K (x24)
L3 Unified 30720K (x2)
--------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
--------------------------------------------------------------------------------------------
Gemm/Torch/128/128/128 73405 ns 73403 ns 8614 GFLOPS=57.1411G/s
Gemm/TensorExprNoopt/128/128/128 3073003 ns 3072808 ns 229 GFLOPS=1.36497G/s
```
Test Plan: Imported from OSS
Reviewed By: SplitInfinity
Differential Revision: D24142403
Pulled By: bertmaher
fbshipit-source-id: 3354aaa56868a43a553acd1ad9a192f28d8e3597
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45297
If we have two concurrent substeps and one of them throws an exception and the other is blocking, we'll currently hang. This waits up to 1 minute for it to complete before terminating the process.
Test Plan: buck test caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest --stress-runs 100
Reviewed By: dahsh
Differential Revision: D20850851
fbshipit-source-id: 330503775d8062a34645ba55fe38e6770de5e3c7
Summary: This diff adds a string equality checking operator.
Test Plan: Unit tests
Differential Revision: D24042344
fbshipit-source-id: c8997c6130e3438f2ae95dae69f76978e2e95527
Summary:
The torchbind tests didn't work be cause somehow we missed the rename of caffe2_gpu to torch_... (hip for us) in https://github.com/pytorch/pytorch/issues/20774 (merged 2019-06-13, oops) and still tried to link against it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45426
Reviewed By: VitalyFedyunin
Differential Revision: D24112439
Pulled By: walterddr
fbshipit-source-id: a66a574e63714728183399c543d2dafbd6c028f7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45649
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44275
* This Diff applies WarpReduce optimization for dedup version of RowWiseSparseAdagrad fused op. Basically we can achieve ~1.33x performance improvement with this Diff.
* Port the way from D23948802 to find the num_dup
* fix the likely bug about fp16 in the dedup kernel
Reviewed By: jianyuh
Differential Revision: D23561994
fbshipit-source-id: 1a633fcdc924593063a67f9ce0d36eadb19a7efb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45610
Also add to the usual documentation places that this option exists.
Test Plan: Imported from OSS
Reviewed By: gmagogsfm
Differential Revision: D24058199
Pulled By: suo
fbshipit-source-id: 81574fbd042f47587e2c7820c726fac0f68af2a7
Summary: `__repr__` calling self.tasks() ends up marking the instance as "used", which doesn't seem appropriate. I was debugging a value being passed around and then ran into `Cannot add Task to an already used TaskGroup.` because the value had been logged once.
Test Plan:
Added a unit test -- didn't see a clean public method to test it, but I'm happy to add one if that makes sense.
Will wait for sandcastle to trigger everything else; I'm not at all familiar with this code so any other recommendations would be great!
Reviewed By: cryptopic
Differential Revision: D23541198
fbshipit-source-id: 5d1ec674a1ddaedf113140133b90e0da6afa7270