This PR allows user to author a CUDA kernel in python.
```
from torch.cuda.jiterator import create_jit_fn
code_string = "template <typename T> T my_kernel(T x, T y, T alpha) { return -x * y + x - y + alpha; }"
jitted_fn = create_jit_fn(code_string, alpha=0)
a = torch.rand(3, device='cuda')
b = torch.rand(3, device='cuda')
result = jitted_fn(a, b, alpha=1.0)
```
Limitations:
- Only supports elementwise kernel
- 1~8 tensor inputs (empty input, e.g. factory methods, is not supported)
- inputs tensors must live in cuda device
- cpu Scalar is not supported
- kwargs must be pre-declared when calling create_jit_fn
- kwargs must be convertible to at::Scalar, one of float64, int64_t, bool. (complex not support for now)
TODOs:
- [x] consolidate union and c10::variant implementation
- [x] plug into existing op testing framework
- [ ] rename files, place files in the right folder
- [ ] place util functions in the right file
- [x] enforce assumptions in python interface e.g <8 inputs, kwargs types
- [x] Add user-facing documentation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76394
Approved by: https://github.com/mruberry
This PR adds `linalg.vander`, the linalg version of `torch.vander`.
We add autograd support and support for batched inputs.
We also take this chance to improve the docs (TODO: Check that they
render correctly!) and add an OpInfo.
**Discussion**: The current default for the `increasing` kwargs is extremely
odd as it is the opposite of the classical definition (see
[wiki](https://en.wikipedia.org/wiki/Vandermonde_matrix)). This is
reflected in the docs, where I explicit both the odd defaults that we
use and the classical definition. See also [this stackoverflow
post](https://stackoverflow.com/a/71758047/5280578), which shows how
people are confused by this defaults.
My take on this would be to correct the default to be `increasing=True`
and document the divergence with NumPy (as we do for other `linalg`
functions) as:
- It is what people expect
- It gives the correct determinant called "the Vandermonde determinant" rather than (-1)^{n-1} times the Vandermonde det (ugh).
- [Minor] It is more efficient (no `flip` needed)
- Since it's under `linalg.vander`, it's strictly not a drop-in replacement for `np.vander`.
We will deprecate `torch.vander` in a PR after this one in this stack
(once we settle on what's the correct default).
Thoughts? mruberry
cc kgryte rgommers as they might have some context for the defaults of
NumPy.
Fixes https://github.com/pytorch/pytorch/issues/60197
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76303
Approved by: https://github.com/albanD, https://github.com/mruberry
This PR adds `linalg.lu_solve`. While doing so, I found a bug in MAGMA
when calling the batched MAGMA backend with trans=True. We work around
that by solving the system solving two triangular systems.
We also update the heuristics for this function, as they were fairly
updated. We found that cuSolver is king, so luckily we do not need to
rely on the buggy backend from magma for this function.
We added tests testing this function left and right. We also added tests
for the different backends. We also activated the tests for AMD, as
those should work as well.
Fixes https://github.com/pytorch/pytorch/issues/61657
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72935
Approved by: https://github.com/IvanYashchuk, https://github.com/mruberry
Re-landing #68111/#74596
## Description
v0.5 PR of this [RFC](https://github.com/pytorch/pytorch/issues/49444).
On the basis of #50256, the below improvements are included:
* The [v0.5 release branch](https://github.com/oneapi-src/oneDNN/releases/tag/graph-v0.5) of the oneDNN Graph API is used
* The fuser now works with the profiling graph executor. We have inserted type check nodes to guard the profiled tensor properties.
### User API:
The optimization pass is disabled by default. Users could enable it by:
```
torch.jit.enable_onednn_fusion(True)
```
`torch.jit.freeze` should be used after tracing (recommended) or scripting a model.
### Performance:
[pytorch/benchmark](https://github.com/pytorch/benchmark) tool is used to compare the performance:
* SkyLake 8180 (1 socket of 28 cores):

* SkyLake 8180 (single thread):

* By mapping hardswish to oneDNN Graph, it’s 8% faster than PyTorch JIT (NNC + OFI)
** We expect performance gain after mapping transpose, contiguous & view to oneDNN graph ops
### Directory structure of the integration code
Fuser-related code is placed under:
```
torch/csrc/jit/codegen/onednn/
```
Optimization pass registration is done in:
```
torch/csrc/jit/passes/onednn_graph_fuser.h
```
CMake for the integration code is in:
```
caffe2/CMakeLists.txt
cmake/public/mkldnn.cmake
cmake/Modules/FindMKLDNN.cmake
```
## Limitations
* In this PR, we only support Pytorch-oneDNN-Graph integration on Linux platform. Support on Windows and MacOS will be enabled as a next step.
* We have only optimized the inference use-case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76622
Approved by: https://github.com/eellison
This PR modifies `lu_unpack` by:
- Using less memory when unpacking `L` and `U`
- Fuse the subtraction by `-1` with `unpack_pivots_stub`
- Define tensors of the correct types to avoid copies
- Port `lu_unpack` to be a strucutred kernel so that its `_out` version
does not incur on extra copies
Then we implement `linalg.lu` as a structured kernel, as we want to
compute its derivative manually. We do so because composing the
derivatives of `torch.lu_factor` and `torch.lu_unpack` would be less efficient.
This new function and `lu_unpack` comes with all the things it can come:
forward and backward ad, decent docs, correctness tests, OpInfo, complex support,
support for metatensors and support for vmap and vmap over the gradients.
I really hope we don't continue adding more features.
This PR also avoids saving some of the tensors that were previously
saved unnecessarily for the backward in `lu_factor_ex_backward` and
`lu_backward` and does some other general improvements here and there
to the forward and backward AD formulae of other related functions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67833
Approved by: https://github.com/IvanYashchuk, https://github.com/nikitaved, https://github.com/mruberry
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76538
when running the example from the docs, I found that these steps were not working.
These are the updates necessary to get the example working.
Test Plan: n/a
Reviewed By: PaliC
Differential Revision: D35998155
fbshipit-source-id: d78bb2886f94889abae5a3af5239fcd306cd5e09
(cherry picked from commit 6893812efe7443b437ccafb7b1ff6bc7bd2e6670)
This PR adds `linalg.vander`, the linalg version of `torch.vander`.
We add autograd support and support for batched inputs.
We also take this chance to improve the docs (TODO: Check that they
render correctly!) and add an OpInfo.
**Discussion**: The current default for the `increasing` kwargs is extremely
odd as it is the opposite of the classical definition (see
[wiki](https://en.wikipedia.org/wiki/Vandermonde_matrix)). This is
reflected in the docs, where I explicit both the odd defaults that we
use and the classical definition. See also [this stackoverflow
post](https://stackoverflow.com/a/71758047/5280578), which shows how
people are confused by this defaults.
My take on this would be to correct the default to be `increasing=True`
and document the divergence with NumPy (as we do for other `linalg`
functions) as:
- It is what people expect
- It gives the correct determinant called "the Vandermonde determinant" rather than (-1)^{n-1} times the Vandermonde det (ugh).
- [Minor] It is more efficient (no `flip` needed)
- Since it's under `linalg.vander`, it's strictly not a drop-in replacement for `np.vander`.
We will deprecate `torch.vander` in a PR after this one in this stack
(once we settle on what's the correct default).
Thoughts? mruberry
cc kgryte rgommers as they might have some context for the defaults of
NumPy.
Fixes https://github.com/pytorch/pytorch/issues/60197
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76303
Approved by: https://github.com/albanD
This PR adds a function for computing the LDL decomposition and a function that can solve systems of linear equations using this decomposition. The result of `torch.linalg.ldl_factor_ex` is in a compact form and it's required to use it only through `torch.linalg.ldl_solve`. In the future, we could provide `ldl_unpack` function that transforms the compact representation into explicit matrices.
Fixes https://github.com/pytorch/pytorch/issues/54847.
cc @jianyuh @nikitaved @pearu @mruberry @walterddr @IvanYashchuk @xwang233 @Lezcano
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69828
Approved by: https://github.com/Lezcano, https://github.com/mruberry, https://github.com/albanD
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76223
Small formatting fixes that was missed because I didn't check the generated doc last time
Test Plan:
visual inspection of the generated docs for this PR
Imported from OSS
Reviewed By: HDCharles
Differential Revision: D35853174
fbshipit-source-id: 4454a4bf5d0c998d866bbae1d6b5286827082033
(cherry picked from commit 125f60356ccc9cd6888c515889bd27ff9860ec74)
Fixes https://github.com/pytorch/pytorch/issues/75464 Adds a context manager that will throw if the ops in the context are not fused.
API is :
```
with torch.jit.strict_fusion():
...
```
A few TODOs:
[+] Compose/figure out how to do with autodiff - right now it will run on autodiff as well
[+] Support all of the nvfuser operators that are added in guarding
[+] Figure out what to do with control flow that isn't taken (right now it will just error). this is probably a source of the original issue :/ - will just error
[+] (After those are figured out) add to docs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75777
Approved by: https://github.com/davidberard98
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75998
Add more details to user facing docs quantization.rst, which will be displayed in the official quantization doc page: https://pytorch.org/docs/stable/quantization.html
This includes:
* docs for quantization stack (quantized tensor, quantized operator and modules, observer, fake_quantize, QConfig, quantization flow)
* Added support table for quantization mode, quantization flow mode and backend, (also moved around operator support table)
* restructured eager mode and fx mode docs as well
Test Plan:
inspect the doc that's built by github ci
Imported from OSS
Reviewed By: dzdang
Differential Revision: D35739111
fbshipit-source-id: 3762d387479bdd37472cb17d5c49da2f520effbb
(cherry picked from commit db5e6411c52c08dd9c45f841ab86713d36a75d51)
Summary:
Following https://github.com/pytorch/rfcs/blob/master/RFC-0019-Extending-PyTorch-Quantization-to-Custom-Backends.md we implemented
the backend configuration for fbgemm/qnnpack backend, currently it was under fx folder, but we'd like to use this for all different
workflows, including eager, fx graph and define by run quantization, this PR moves it to torch.ao.quantization namespace so that
it can be shared by different workflows
Also moves some utility functions specific to fx to fx/backend_config_utils.py and some files are kept in fx folder (quantize_handler.py and fuse_handler.py)
Test Plan:
python test/teset_quantization.py TestQuantizeFx
python test/teset_quantization.py TestQuantizeFxOps
python test/teset_quantization.py TestQuantizeFxModels
python test/test_quantization.py TestAOMigrationQuantization
python test/test_quantization.py TestAOMigrationQuantizationFx
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75823
Approved by: https://github.com/vkuzo
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75126
Quantization has a high volume of configurations of how to quantize an
op for a reference model representation which is useful for a lowering
step for a backend. An example of this is
```
{'dtype_configs': [{'input_dtype': torch.quint8,
'output_dtype': torch.quint8}],
'observation_type': <ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT: 0>,
'pattern': <class 'torch.nn.modules.conv.ConvTranspose1d'>},
```
These configs are checked into master, and they are created with Python functions.
Therefore, there is no easy way for the user to see what the configs actually
are without running some Python code.
This PR is one approach to document these configs. Here is what this is doing:
1. during documentation build, write a text file of the configs
2. render that text file on a quantization page, with some additional context
In the future, this could be extended to autogenerate better looking tables
such as: op support per backend and dtype, op support per valid quantization settings per backend,
etc.
Test Plan:
```
cd docs
make html
cd html
python -m http.server 8000
// render http://[::]:8000/quantization-backend-configuration.html
// it renders correctly
```
Reviewed By: ejguan
Differential Revision: D35365461
Pulled By: vkuzo
fbshipit-source-id: d60f776ccb57da9db3d09550e4b27bd5e725635a
(cherry picked from commit 14865c0e23bc080120342c8f9278f0fae8eb8fbd)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74563
This is used inconsistently in all the generate_code program
invocations. Nevertheless, nothing consumes this flag, so we can
safely remove it.
This was removed in #25353.
ghstack-source-id: 152249818
Test Plan: Should be a no-op, rely on CI.
Reviewed By: malfet
Differential Revision: D35053096
fbshipit-source-id: 3ad19e83ca14649b514dc163c3caff6cbd118e14
(cherry picked from commit a43f05bb43553249caac3c3479986cbc45d286ae)