Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35913
The pass itself is still disabled by default, but with this change we
don't need to register it as a custom pass anymore. It allows us to
control its behavior with env variables more easily.
Test Plan: Imported from OSS
Reviewed By: suo
Differential Revision: D20827189
Pulled By: ZolotukhinM
fbshipit-source-id: e74d90b5e46422e7ab7bc40974a805220da50fbc
Summary:
**Summary:** This PR contains the infrastructure of a new CUDA fuser. This CUDA fuser is based on many of the same principles of TensorExpressions and Halide, however the implementation is ground up. The fusion pass itself is similar to the default CUDA fuser, however, it has undergone some refactoring and is using the new code generation infrastructure. For those who are interested in how the code generation in this PR works, I would recommend reviewing _test/cpp/jit/test_gpu_fusion.cpp_ as well as the long comment section at the beginning of _torch/csrc/jit/codegen/cuda/transform_replay.h_ One of the largest differences between our approach and that of TVM/Halide, is the concept of "TensorView". TensorView from a high level should be thought of similarly to how we think of working with Tensors in PyTorch. It's an N-D object which can undergo transformations that change its dimensionality. Dimensionality changes are done through the operations split/merge/reorder/computeAt. These transformations are similar to split/fuse/reorder/compute_at of TVM, they modify how a tensor is iterated over to generate GPU code. Interestingly, in our scheme these transformations are applied to tensors and only impact how that tensor is generated.
**Warning:** This PR is purposefully not feature complete with the current fuser. We wanted to separate out the infrastructure from the fusion capabilities. Once in, smaller incremental PRs will be submitted to expand capabilities of the fuser.
**Short term goals:**
Parity with current CUDA fuser (including performance):
- Dynamic shapes (no recompilation)
- Implicit handling of braodcast (broadcasted tensors are treated as tensors of the braodcasted size in the generated code)
- Dropout
**Mid-term goals:**
- Transposes fused with pointwise operations where transpose involves only 2 axes (across the fused operation).
- 1-D reductions fused with pointwise operations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34785
Reviewed By: ZolotukhinM
Differential Revision: D20650977
Pulled By: soumith
fbshipit-source-id: ee39c95a880e1b9822e874ed4cc180971572bf63
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35586
This pass fuses the choose_qparams-quant-dequant sequence
Fusion for weight tensor is the same as static quant.
Test Plan:
python test/test_quantize_script.py
Imported from OSS
Differential Revision: D20755680
fbshipit-source-id: b7443770642b6e6fa0fa9da8a44637e9b2d4df70
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35448
Add _choose_qparams_per_tensor which returns scale and zero_point similar to the dynamic quantization in the operator
Test Plan:
python test/test_quantize_script.py
Imported from OSS
Differential Revision: D20755679
fbshipit-source-id: c9066d8f1bb3e331809be26c4be806faafc9b981
Summary:
This commit allows one to use an environment variable to enable the fuser in torch/csrc/jit/tensorexpr/
```
PYTORCH_TENSOREXPR=1 python benchmark.py
```
This commit also changes the registration to happen by default, removing the requirement for the python exposed "_jit_register_tensorexpr_fuser"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35341
Reviewed By: ZolotukhinM
Differential Revision: D20676348
Pulled By: bwasti
fbshipit-source-id: 4c997cdc310e7567c03905ebff72b3e8a4c2f464
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35115
This commit runs the newly added tools/clang_format.py on the JIT
codebase and includes all of the formatting changes thus produced.
Testing:
Ran the script, CI.
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D20568523
Pulled By: SplitInfinity
fbshipit-source-id: e09bdb982ccf090eecfb7c7b461b8d0681eef82b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33186
This helps create larger functional graphs. It has the potential to increase memory use, so in order to land this on by default we would probably also do a reuse of buffers pass.
This is currently O(n * | Removed Nodes | ) because we have to rebuild the alias Db each time we make a change. This pass is critical to creating functional graphs, so this might be a compelling use case to build incremental updates to alias Db.
Test Plan: Imported from OSS
Differential Revision: D20603189
Pulled By: eellison
fbshipit-source-id: 105db52bf38e02188ca6df6d36294466d3309a0a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33020
This is a pass to create functional blocks. The other PRs in the stack help avoid some of the limitations that are are often found in graphs. It's possible that this would work well with a graph that is frozen. Follow up work items that will help this pass:
- We don't currently have any capacity in alias analysis to tell whether a Value that came from the wildcard set "re-escapes" back into the wildcard set.
- More comments on the semantics of the graph and correctness conditions
- We could consider using dynamic dag if the perf of this is a limitation.
- potential make Functional Graphs Functional Blocks instead, so that we do not repeatedly copy constants, also to make IR read easier.
Test Plan: Imported from OSS
Differential Revision: D20603188
Pulled By: eellison
fbshipit-source-id: 6822a6e65f4cc2676f8f6445fe8aa1cb858ebeeb
Summary:
Stacked PRs
* #34938 - [jit] Remove stray `script`
* **#34935 - [jit] Add lazy script decorator**
Some users maintain libraries of code that is largely trace-able but not
script-able. However, some functions may need to be `torch.jit.script`ed if
they contain control flow so the tracer will use the compiler version.
This however impacts library start up time as in #33418, so this PR adds
a workaround in the form of a `torch.jit._lazy_script_while_tracing`
that will only initialize the compiler if the function is called while
actually tracing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34935
Pulled By: driazati
Differential Revision: D20569778
fbshipit-source-id: d87c88c02b1abc86b283729ab8db94285d7d4853
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35010
semantics.
This PR moves all the xnnpack specific interfces to a generic interface.
Accordingly removes xnnpac specific reference from API and some variable
names.
What has not yet changed:
TODO:
USE_XNNPACK is still used. This can be removed where no XNNPACK
specific things are done. e.g., RegisterOpContext.cpp and
xnnpack_rewrite.cpp.
Also the filename and structure also remains. Some of the generic class
definition can be moved non-XNNPACK specific folder.
Test Plan:
python test/test_xnnpack_integration.py
Imported from OSS
Differential Revision: D20526416
fbshipit-source-id: 2e1725345c44bbb26bdc448097a7384eca121387
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35039
This is the initial step towards merging ivalue future and rpc future
Test Plan: Imported from OSS
Differential Revision: D20537164
Pulled By: wanchaol
fbshipit-source-id: d4f148c88e49ed6b0881ca4b4dd945ea24166183
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34842
This PR (hopefully the last one of such kind) is merging changes from a
side branch where tensor expessions based fuser work has been done so
far. This PR is is a squashed version of changes in the side branch,
which is available here: https://github.com/bertmaher/pytorch
Differential Revision: D20478208
Test Plan: Imported from OSS
Pulled By: ZolotukhinM
fbshipit-source-id: 21556e009f1fd88099944732edba72ac40e9b9c0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34227
This PR adds a CUDA support to tensor expressions.
Differential Revision: D20251836
Test Plan: Imported from OSS
Pulled By: ZolotukhinM
fbshipit-source-id: ab36a55834cceff30c8371fef6cca1054a32f017
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34226
LLVM and Cuda backends are added in subsequent PRs, so at this point the fuser is pretty useless, but it still can be tested and its logic is not going to change with addition of the codegens.
Differential Revision: D20251838
Test Plan: Imported from OSS
Pulled By: ZolotukhinM
fbshipit-source-id: 82b0d221fa89904ed526689d02a6c7676a8ce8de
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34319
Removes prepacking ops and install them as attributes of the top level
module. Needs to run freezing as the first pass.
Test Plan:
python test/test_xnnpack_integration.py
Imported from OSS
Differential Revision: D20290726
fbshipit-source-id: 633ceaa867ff7d5c8e69bd814c0362018394cb3a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34048
Rewrites the graph to insert xnnpack prepack and packed run ops for
conv2d and linear.
Test Plan:
python test/test_xnnpack_integration.py
Imported from OSS
Differential Revision: D20185658
fbshipit-source-id: c4c073c912ad33e822e7beb4ed86c9f895129d55
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34671
Like the python arg parser, this tries to convert to the schema in order.
It introduces schema_match_exception which gets thrown when the schema doesn't match,
allowing the overload handler to try the next option.
Behavior will not 100% match the schema argument parser but should work for
simple cases using custom binding.
Test Plan: Imported from OSS
Differential Revision: D20432206
Pulled By: zdevito
fbshipit-source-id: 280839a2205ea3497db3a9b5741fccc1e2bff9a8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34677
1. Remove remaining uses of `script::` namespace from the codebase,
2. Add one more typedef for `script::ExtraFilesMap` which is part of the
public interface.
Pull Request resolved: #34580
Test Plan: Imported from OSS
Reviewed By: zdevito
Differential Revision: D20431739
Pulled By: suo
fbshipit-source-id: a29d369c755b6506c53447ca1f286b6339222c9a
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33927
Test Plan:
test will be added in later PRs
Imported from OSS
Differential Revision: D20354879
fbshipit-source-id: 03976f4b86c46dbdc4e45764a1e72f1a3855a404
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34515
Once upon a time we thought this was necessary. In reality it is not, so
removing it.
For backcompat, our public interface (defined in `api/`) still has
typedefs to the old `script::` names.
There was only one collision: `Pass` as a `Stmt` and `Pass` as a graph
transform. I renamed one of them.
Test Plan: Imported from OSS
Differential Revision: D20353503
Pulled By: suo
fbshipit-source-id: 48bb911ce75120a8c9e0c6fb65262ef775dfba93
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33853
Quant fusion relies on inline, but inline will break the CallFunction("linaer", ...) into a if block
it will be hard to recognize this block and swap it with quantized::linear, in order to
preserve the op, we will swap all quantized functional linear into aten::linear.
They might produce different backward graph, but this is called in the step before we get quantized
model, so it shouldn't affect anything.
We'll integrate this with convert_script later in the new "finalize_quant" API
Test Plan:
python test/test_jit.py
Imported from OSS
Differential Revision: D20343873
fbshipit-source-id: 423e03bf893b79267d2dc97bc997ee1bfe54ec0f
Summary:
Separating CUDA fuser from CPU fuser.
1. New node in IR - prim::CudaFusionGroup:
This enables the cuda fuser to co-exist along side the old fuser. Allows us
to incrementally build and expand cuda fuser.
2. copied FuseGraph optimization passes to CudaFuserGraph:
We will re-factor & reuse Chunk/Concat in the old fuser logic, which is
handled in the optimization pass at this moment. Unfortunately many code in
the pass is tightly binded with the legacy fuser, which makes code sharing
difficult.
The CudaFusionGraph will support only a subset of operations comparing to
legacy fuser (CUDA only). It is registered as a custom pass post fusion via
```torch._C._jit_register_cuda_fuser()```
To have it in effect, you should also turn off fusion on GPU via
```torch._C._jit_override_can_fuse_on_gpu(False)```
3. We don't have codegen in this PR yet (WIP). Currently we just fall back to
the old fuser.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33527
Differential Revision: D20171598
Pulled By: ZolotukhinM
fbshipit-source-id: 9a3c0f06f46da7eaa80ae7551c04869f5b03ef71
Summary:
This patch enables folding GetAttr nodes with their corresponding
values. _jit_pass_freeze_module API returns a new TorchScipt module
where all function calls and get attributes are inlined.
Usage:
frozen_model = torch._C._freeze_module(scrited_model._c)
frozen_model.forward(...)
This API currently optimizes the forward method. We will follow up to
to preserve and optimize methods and attributes that are annotated as
torch.jit.interface.
Several future improvements to JIT optimizations are required to maximize
clean up/de-sugar the graph and eliminate redundancies.
Ideally, we want to produce a graph that can easily be lowered to
GLOW and other low-level backends.
__
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32178
Differential Revision: D19419640
Pulled By: bzinodev
fbshipit-source-id: 52baffaba9bca2cd60a8e747baa68d57711ad42b