Commit Graph

102 Commits

Author SHA1 Message Date
chilli
6b1007b2a7 Fix error in div lowering with integers (#102809)
Fixes https://github.com/pytorch/pytorch/issues/101016
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102809
Approved by: https://github.com/ngimel
ghstack dependencies: #110501, #110504, #110591, #110668, #110687
2023-10-06 23:21:40 +00:00
Edward Z. Yang
3262c5358f Use _check_is_size for validate_dim_length (#109849)
_check_is_size has some extra juice for unbacked SymInts, use it.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109849
Approved by: https://github.com/yanboliang
2023-09-26 23:33:31 +00:00
Mwiza Kunda
5c4b5baf21 Fix python decomps for OpOverloadPackets and add tests (#107707)
- Extend `test_torch_dispatch_meta_outplace` to test torch ops that do not have an out parameter but have aten op overloads that have out parameters. Additionally, Python decompositions may register `OpOverloadPacket`'s so decompositions need to be tested to ensure all `OpOverloads` still function for the `Meta` key (e.g. if a python decomposition is registered for an aten op `aten.foo` with overloads `[default, out]`, the python function needs to support receiving out arguments)

- Add out parameter wrappers to python decomps for aten ops that have out overloads

CC. @ezyang @albanD @lezcano

Fixes #107713

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107707
Approved by: https://github.com/lezcano
2023-09-25 20:53:30 +00:00
Ying Zhang
bbdce93571 Basic fp8 support in Inductor (#109168)
Add basic fp8 support in Inductor, including:
* Fix fp8 Triton codegen issues;
* Add min_elements_per_thread requirement for fp8 related dtype conversions. More details on Triton implementation can be found from 10f59d8ce0/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp (L10).

Note that the current implementation only works for Pointwise. Will create follow-up PRs for Reduction.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109168
Approved by: https://github.com/drisspg
2023-09-23 04:41:41 +00:00
Mwiza Kunda
6b7b9c796e Fix registering jit decompositions for jvp for out wrapped decomps (#109367)
Python decompositions wrapped by `out_wrapper` need to be unwrapped before compiling with TorchScript since:
- `out_wrapper` extends the decompositions signature with an out parameter, however this `out` parameter is not present in the source code of the original decomposition so the resulting `ScriptFunction` will not have an `out` parameter
- `out_wrapper` is in the `torch._prims_common.wrappers` module so its `globals()` are different to the globals of the decomposition to be wrapped. This may cause symbol resolution to fail with the TorchScript compiler since it is compiling the unwrapped decomps source code rather than the wrapper

The python decomposition for `aten.trace` is wrapped as an example, other decompositions are to be fixed in https://github.com/pytorch/pytorch/pull/107707
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109367
Approved by: https://github.com/lezcano
2023-09-21 16:36:51 +00:00
Edward Z. Yang
55f956f1d2 optests improvements based on torchvision usage on nms (#108929)
- Update cross-ref FakeMode test to use ShapeEnv.  Dynamic ops can now
  return an unbacked SymInt.  We always accept this as equal to whatever
  the real value was.
- Relax test so it works on all classes, not just unittest.TestCase
- Properly wrap the original method, so things like
  pytree.mark.parametrize are carried over
- Support dynamic shapes by default for make_fx `tracing_mode="fake"` without symbolifying everything else

Fixes https://github.com/pytorch/pytorch/issues/108927

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108929
Approved by: https://github.com/zou3519
2023-09-13 13:26:15 +00:00
Guilherme Leobas
7e878c9d10 Add decomposition for aten.take_along_dim (#108185)
xref #107875

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108185
Approved by: https://github.com/lezcano
2023-09-04 13:49:53 +00:00
Ivan Yashchuk
c913f3857f Remove dynamo+nvfuser (#105789)
This PR removes unmaintained Dynamo+nvFuser.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105789
Approved by: https://github.com/jansel, https://github.com/jjsjann123, https://github.com/albanD
2023-08-08 22:29:32 +00:00
PyTorch MergeBot
891bb259f8 Revert "Remove dynamo+nvfuser (#105789)"
This reverts commit 6030151d37.

Reverted https://github.com/pytorch/pytorch/pull/105789 on behalf of https://github.com/DanilBaibak due to Break a lot of tests on main. ([comment](https://github.com/pytorch/pytorch/pull/105789#issuecomment-1669710571))
2023-08-08 14:20:32 +00:00
Ivan Yashchuk
6030151d37 Remove dynamo+nvfuser (#105789)
This PR removes unmaintained Dynamo+nvFuser.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105789
Approved by: https://github.com/jansel, https://github.com/jjsjann123, https://github.com/albanD
2023-08-08 13:29:31 +00:00
dilililiwhy
5a9e82fa02 let torch.device be overrideable by TorchFunctionMode (#106514)
Fixes #103828
let torch.device be overrideable by TorchFunctionMode
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106514
Approved by: https://github.com/ezyang
2023-08-04 10:47:43 +00:00
Edward Z. Yang
3bf922a6ce Apply UFMT to low traffic torch modules (#106249)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106249
Approved by: https://github.com/Skylion007
2023-07-29 23:37:30 +00:00
Justin Chu
4cc1745b13 [BE] f-stringify torch/ and scripts (#105538)
This PR is a follow up on the pyupgrade series to convert more strings to use f-strings using `flynt`.

- https://docs.python.org/3/reference/lexical_analysis.html#f-strings
- https://pypi.org/project/flynt/

Command used:

```
flynt torch/ -ll 120
flynt scripts/ -ll 120
flynt tools/ -ll 120
```

and excluded `collect_env.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105538
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-07-21 19:35:24 +00:00
Justin Chu
8a688277a2 [BE] Enable ruff's UP rules and autoformat dynamo / functorch and refs (#105432)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105432
Approved by: https://github.com/ezyang
2023-07-19 13:48:44 +00:00
Nikita Shulga
5837e95d30 [Reland] Update mypy to 1.4.1 (#105227)
This PR re-lands
- [Typing] Fix PEP 484 Violation (#105022)
- Update mypy to 1.4.1 (#91983)

That were reverted due to the conflict with internal source repo.

Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional)
Plus few real fixes:
  - Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi`
  - Add missing return statement to `torch._export. deserialize_graph`
  - Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights`
  - Add assert it `torch/optim/optimizer.py` that Optional list is not None
TODO (in followup PR):
  - Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py`

Unrelated, to bypass CI failures due to the gcc9 dependency update in Ubuntu-18.04:
- Add hack to squash older libstdc++ from conda environment in favor one from OS to `.ci/docker/install_conda.sh`
- Update bazel cuda builds to focal, as with libstdc++-6.0.32 bazel builds loose the ability to catch exceptions (probably because they link with cupti statically, but I could not found where it is done)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105227
Approved by: https://github.com/atalman, https://github.com/albanD, https://github.com/Skylion007
2023-07-15 20:30:20 +00:00
PyTorch MergeBot
15fd1ea118 Revert "[Reland] Update mypy to 1.4.1 (#105227)"
This reverts commit c9c4f8efc3.

Reverted https://github.com/pytorch/pytorch/pull/105227 on behalf of https://github.com/atalman due to trying to mitigate ci sev #105248 ([comment](https://github.com/pytorch/pytorch/pull/105227#issuecomment-1636510935))
2023-07-14 22:28:35 +00:00
Nikita Shulga
c9c4f8efc3 [Reland] Update mypy to 1.4.1 (#105227)
This PR re-lands
- [Typing] Fix PEP 484 Violation (#105022)
- Update mypy to 1.4.1 (#91983)

That were reverted due to the conflict with internal source repo.

Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional)
Plus few real fixes:
  - Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi`
  - Add missing return statement to `torch._export. deserialize_graph`
  - Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights`
  - Add assert it `torch/optim/optimizer.py` that Optional list is not None
TODO (in followup PR):
  - Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105227
Approved by: https://github.com/atalman, https://github.com/albanD, https://github.com/Skylion007
2023-07-14 20:45:12 +00:00
PyTorch MergeBot
3c5a494d7a Revert "Update mypy to 1.4.1 (#91983)"
This reverts commit 634659e262.

Reverted https://github.com/pytorch/pytorch/pull/91983 on behalf of https://github.com/malfet due to It's dependent change was reverted, so reverting this one as well, to keep CI clean ([comment](https://github.com/pytorch/pytorch/pull/91983#issuecomment-1636059709))
2023-07-14 15:59:16 +00:00
Nikita Shulga
634659e262 Update mypy to 1.4.1 (#91983)
Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional)
Plus few real fixes:
  - Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi`
  - Add missing return statement to `torch._export. deserialize_graph`
  - Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights`
  -
TODO (in followup PR):
  - Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91983
Approved by: https://github.com/kit1980, https://github.com/ZainRizvi, https://github.com/huydhn, https://github.com/thiagocrepaldi, https://github.com/aaronenyeshi
2023-07-13 16:30:36 +00:00
Nikita Karetnikov
c4a6f86062 [pt2] add metas for max_unpool2d and max_unpool3d (#103821)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103821
Approved by: https://github.com/Skylion007, https://github.com/Chillee
2023-07-01 01:33:35 +00:00
Peter Bell
8b418f197c [decomp] Add decomposition for torch.renorm (#103858)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103858
Approved by: https://github.com/ezyang, https://github.com/nkaretnikov
2023-06-21 20:57:43 +00:00
Kurt Mohler
ee83c646bb Replace _prims_common.check with torch._check* (#103240)
This relands most of the changes from #102219 which were backed out by #103128. However, instead of removing `_prims_common.check`, it adds a warning and a comment mentioning that it will be removed in the future and `torch._check*` should be used instead. As mentioned in https://github.com/pytorch/pytorch/pull/103128#pullrequestreview-1466414415, `_prims_common.check` cannot yet be removed because of some internal usage

Part of #72948

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103240
Approved by: https://github.com/albanD
2023-06-21 00:46:17 +00:00
Ivan Zaitsev
821493715c Back out "Remove check from _prims_common, replace with torch._check* (#102219)", Back out "Forwatd fix for D46427687" (#103128)
Test Plan: revertitparrot

Reviewed By: malfet

Differential Revision: D46506433

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103128
Approved by: https://github.com/malfet
2023-06-07 01:41:41 +00:00
Kurt Mohler
a84bb2709a Remove check from _prims_common, replace with torch._check* (#102219)
Part of #72948

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102219
Approved by: https://github.com/lezcano, https://github.com/albanD
2023-06-03 02:23:21 +00:00
PyTorch MergeBot
a7efa0ce35 Revert "Remove check from _prims_common, replace with torch._check* (#102219)"
This reverts commit fb79d43649.

Reverted https://github.com/pytorch/pytorch/pull/102219 on behalf of https://github.com/malfet due to Broke lint, see https://github.com/pytorch/pytorch/actions/runs/5158949959/jobs/9293466925 ([comment](https://github.com/pytorch/pytorch/pull/102219#issuecomment-1574245414))
2023-06-02 20:00:48 +00:00
Kurt Mohler
fb79d43649 Remove check from _prims_common, replace with torch._check* (#102219)
Part of #72948

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102219
Approved by: https://github.com/lezcano, https://github.com/albanD
2023-06-02 19:13:45 +00:00
Xia, Weiwen
ce9923a1cb [Quant][PT2E][Inductor] Lower quantized conv to Inductor (#101164)
**Summary**
Enable the lowering path for reference quantized conv after PT2E to Inductor.

The pattern `decomposed dequantize -> aten.convolution -> decomposed quantize` is fused to `quantized.functional.conv1d/2d/3d` and Inductor makes external calls to these ops.

This PR focuses on functionality only. The implementation is expected to have low performance.

Code example:
```Python
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Conv2d(3, 6, 2, stride=2, padding=0, dilation=1)

    def forward(self, x):
        return nn.functional.gelu(self.conv(x))

m = M().eval()
example_inputs = (torch.randn(2, 3, 6, 6),)
exported_model, guards = torchdynamo.export(
    m,
    *copy.deepcopy(example_inputs),
    aten_graph=True,
    tracing_mode="real",
)

qconfig = get_default_qconfig("x86")
qconfig_mapping = QConfigMapping().set_global(qconfig)
backend_config_inductor = get_x86_inductor_pt2e_backend_config()
prepared_model = prepare_pt2e(
    exported_model,
    qconfig_mapping,
    example_inputs,
    backend_config_inductor
)
prepared_model(*example_inputs)
converted_model = convert_pt2e(prepared_model)
run = compile_fx(converted_model, example_inputs)
```
Output code by Inductor
```python
from ctypes import c_void_p, c_long
import torch
import math
import random
import os
import tempfile
from torch._inductor.hooks import run_intermediate_hooks
from torch._inductor.utils import maybe_profile

from torch import empty_strided, as_strided, device
from torch._inductor.codecache import AsyncCompile
from torch._inductor.select_algorithm import extern_kernels

aten = torch.ops.aten
assert_size_stride = torch._C._dynamo.guards.assert_size_stride
async_compile = AsyncCompile()

kernel_cpp_0 = async_compile.cpp('''
#include "/tmp/torchinductor_weiwen/5d/c5dsrjrcd4jlzryilhxl5hdvcrzsoek52xzzqqy57hcoezvxxxwm.h"
extern "C" void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       const long* in_ptr2,
                       unsigned char* out_ptr0)
{
    {
        #pragma GCC ivdep
        for(long i0=static_cast<long>(0L); i0<static_cast<long>(2L); i0+=static_cast<long>(1L))
        {
            #pragma GCC ivdep
            for(long i1=static_cast<long>(0L); i1<static_cast<long>(3L); i1+=static_cast<long>(1L))
            {
                #pragma GCC ivdep
                for(long i2=static_cast<long>(0L); i2<static_cast<long>(36L); i2+=static_cast<long>(1L))
                {
                    auto tmp0 = in_ptr0[static_cast<long>(i2 + (36L*i1) + (108L*i0))];
                    auto tmp1 = in_ptr1[static_cast<long>(0L)];
                    auto tmp7 = in_ptr2[static_cast<long>(0L)];
                    auto tmp2 = 1 / tmp1;
                    auto tmp3 = static_cast<float>(1.0);
                    auto tmp4 = decltype(tmp2)(tmp2 * tmp3);
                    auto tmp5 = decltype(tmp0)(tmp0 * tmp4);
                    auto tmp6 = std::nearbyint(tmp5);
                    auto tmp8 = static_cast<float>(tmp7);
                    auto tmp9 = tmp6 + tmp8;
                    auto tmp10 = static_cast<float>(0);
                    auto tmp11 = max_propagate_nan(tmp9, tmp10);
                    auto tmp12 = static_cast<float>(127);
                    auto tmp13 = min_propagate_nan(tmp11, tmp12);
                    auto tmp14 = static_cast<unsigned char>(tmp13);
                    out_ptr0[static_cast<long>(i1 + (3L*i2) + (108L*i0))] = tmp14;
                }
            }
        }
    }
}
''')

kernel_cpp_1 = async_compile.cpp('''
#include "/tmp/torchinductor_weiwen/5d/c5dsrjrcd4jlzryilhxl5hdvcrzsoek52xzzqqy57hcoezvxxxwm.h"
extern "C" void kernel(const unsigned char* in_ptr0,
                       const long* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0)
{
    {
        #pragma GCC ivdep
        for(long i0=static_cast<long>(0L); i0<static_cast<long>(2L); i0+=static_cast<long>(1L))
        {
            #pragma GCC ivdep
            for(long i1=static_cast<long>(0L); i1<static_cast<long>(6L); i1+=static_cast<long>(1L))
            {
                #pragma GCC ivdep
                for(long i2=static_cast<long>(0L); i2<static_cast<long>(9L); i2+=static_cast<long>(1L))
                {
                    auto tmp0 = in_ptr0[static_cast<long>(i1 + (6L*i2) + (54L*i0))];
                    auto tmp2 = in_ptr1[static_cast<long>(0L)];
                    auto tmp5 = in_ptr2[static_cast<long>(0L)];
                    auto tmp1 = static_cast<float>(tmp0);
                    auto tmp3 = static_cast<float>(tmp2);
                    auto tmp4 = tmp1 - tmp3;
                    auto tmp6 = decltype(tmp4)(tmp4 * tmp5);
                    auto tmp7 = static_cast<float>(0.5);
                    auto tmp8 = decltype(tmp6)(tmp6 * tmp7);
                    auto tmp9 = static_cast<float>(0.7071067811865476);
                    auto tmp10 = decltype(tmp6)(tmp6 * tmp9);
                    auto tmp11 = std::erf(tmp10);
                    auto tmp12 = static_cast<float>(1);
                    auto tmp13 = tmp11 + tmp12;
                    auto tmp14 = decltype(tmp8)(tmp8 * tmp13);
                    out_ptr0[static_cast<long>(i2 + (9L*i1) + (54L*i0))] = tmp14;
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1 = args
    args.clear()
    buf0 = torch.ops.quantized_decomposed.quantize_per_channel.default(arg0_1, arg4_1, arg5_1, 0, -128, 127, torch.int8)
    del arg0_1
    buf1 = buf0
    assert_size_stride(buf1, (6, 3, 2, 2), (12, 4, 2, 1))
    del buf0
    buf2 = empty_strided((2, 3, 6, 6), (108, 1, 18, 3), device='cpu', dtype=torch.uint8)
    kernel_cpp_0(c_void_p(arg8_1.data_ptr()), c_void_p(arg2_1.data_ptr()), c_void_p(arg3_1.data_ptr()), c_void_p(buf2.data_ptr()))
    del arg8_1
    buf2 = torch._make_per_tensor_quantized_tensor(buf2, arg2_1, arg3_1)
    buf1 = torch._make_per_channel_quantized_tensor(buf1, arg4_1, arg5_1, 0)
    buf3 = torch.ao.nn.quantized.functional.conv2d(buf2, buf1, arg1_1, (2, 2), (0, 0), (1, 1), 1, 'zeros', arg6_1, arg7_1, torch.uint8)
    assert_size_stride(buf3, (2, 6, 3, 3), (54, 1, 18, 6))
    del arg1_1
    del arg2_1
    del arg3_1
    del arg4_1
    del arg5_1
    del buf1
    del buf2
    buf4 = empty_strided((2, 6, 3, 3), (54, 9, 3, 1), device='cpu', dtype=torch.float32)
    kernel_cpp_1(c_void_p(buf3.data_ptr()), c_void_p(arg7_1.data_ptr()), c_void_p(arg6_1.data_ptr()), c_void_p(buf4.data_ptr()))
    del arg6_1
    del arg7_1
    return (buf4, )

def benchmark_compiled_module(times=10, repeat=10):
    from torch._dynamo.testing import rand_strided
    from torch._inductor.utils import print_performance
    arg0_1 = rand_strided((6, 3, 2, 2), (12, 4, 2, 1), device='cpu', dtype=torch.float32)
    arg1_1 = rand_strided((6, ), (1, ), device='cpu', dtype=torch.float32)
    arg2_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
    arg3_1 = rand_strided((), (), device='cpu', dtype=torch.int64)
    arg4_1 = rand_strided((6, ), (1, ), device='cpu', dtype=torch.float32)
    arg5_1 = rand_strided((6, ), (1, ), device='cpu', dtype=torch.int64)
    arg6_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
    arg7_1 = rand_strided((), (), device='cpu', dtype=torch.int64)
    arg8_1 = rand_strided((2, 3, 6, 6), (108, 36, 6, 1), device='cpu', dtype=torch.float32)
    return print_performance(lambda: call([arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1]), times=times, repeat=repeat)

if __name__ == "__main__":
    from torch._inductor.utils import compiled_module_main
    compiled_module_main('None', benchmark_compiled_module)
```

**Test plan**
python test/test_quantization.py TestQuantizePT2EFXX86Inductor.test_inductor_qconv_lowering

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101164
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-06-01 10:22:02 +00:00
Richard Zou
fc31b3a106 Allow existing "Python RAII guards" to be used as context managers (#102579)
This PR adds a `py_context_manager_DEPRECATED` that converts a C++ RAII
guard to an object that may be either used as Python context manager or
as a "Python RAII guard".

We don't convert all of them to Python context manager only due to BC
reasons; people in OSS and internally actually rely on these APIs and I
don't want to break them. We are justified in breaking BC if we wanted
to, but it seemed like too much work for not a lot of gain.

The API is postfixed with "DEPRECATED" to indicate that people should
really use `py_context_manager` (converts C++ RAII guard to Python
context manager) instead.

Test Plan:
- this PR converts all PyTorch usages of _AutoDispatchBelowAutograd to
context manager. I can do the rest in follow-ups.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102579
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2023-05-31 19:55:38 +00:00
Nikita Karetnikov
1e591a8b64 [pt2] add meta function for solve_triangular (#100829)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100829
Approved by: https://github.com/ezyang
2023-05-08 13:48:15 +00:00
Edward Z. Yang
54c0edf6da Track exact origin_node on best effort basis (#100110)
Currently, we track 'origins' on IR nodes so that we have some idea about what FX IR nodes contributed to any given fused kernel. However, the origins are dumped into an undifferentiated set, so if you have, e.g., multiple outputs, you cannot easily tell which output corresponds to which FX node.

This PR introduce a more precise notion of tracking "origin_node" which says that the contents of this Buffer/Loop node corresponds EXACTLY to the output of a particular FX node; e.g., if you serialized each intermediate when running the generated inductor code, you could compare them with the corresponding intermediates from the original FX graph.

Tracking origin_node in all cases requires quite a bit of effort, so this PR introduces the tracking on a strictly best effort basis. The logic in torch/_inductor/graph.py sets up the associations, but only when it is "obvious" which IR node should get the assignment, and there is work in torch/_inductor/ir.py for propagating this information around as necessary. Like origins, origin_node is not a true dataclass field (as this would break all existing positional arg call sites), instead, it is added post facto via `__post_init__`. At the moment, it is only valid for Buffer/Loop to have an origin_node, but we could imagine relaxing this in the future.

The payoff is in torch/_inductor/codegen/wrapper.py and torch/_inductor/codegen/triton.py where we currently just print the FX node name and the tensor (but a more useful integration will be coming later.)

I also introduce a debugging tool `debug_ir_traceback` which tracks tracebacks of where IRNodes were allocated, to help you understand why a node doesn't have an `origin_node`.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100110
Approved by: https://github.com/voznesenskym
2023-04-28 04:15:27 +00:00
Animesh Jain
6bc4651193 [philox_rand] Dynamic shape support (#99290)
Extends the functionalization of rng work to Dynamic shapes. An example of the generated graph looks like this

~~~

[2023-04-24 21:41:37,446] torch._functorch.aot_autograd.__aot_graphs: [INFO] TRACED GRAPH
 ===== Forward graph 1 =====
 <eval_with_key>.7 class <lambda>(torch.nn.Module):
    def forward(self, arg0_1: i64[], arg1_1: i64[], arg2_1: Sym(s0), arg3_1: Sym(s1), arg4_1: f32[s0, s1]):
        # File: /scratch/anijain/work/pytorch/test/test_functionalization_of_rng_ops.py:46, code: a = torch.rand_like(x) * x
        add: i64[] = torch.ops.aten.add.Tensor(arg1_1, 0)
        philox_rand = torch.ops.rngprims.philox_rand.default([arg2_1, arg3_1], arg0_1, add, None, device(type='cuda', index=0), torch.float32);  add = None
        getitem: f32[s0, s1] = philox_rand[0]
        getitem_1: i64[] = philox_rand[1];  philox_rand = None
        add_1: i64[] = torch.ops.aten.add.Tensor(getitem_1, 0);  getitem_1 = None
        mul: f32[s0, s1] = torch.ops.aten.mul.Tensor(getitem, arg4_1);  getitem = arg4_1 = None

        # File: /scratch/anijain/work/pytorch/test/test_functionalization_of_rng_ops.py:47, code: a = torch.rand_like(x) * a
        add_2: i64[] = torch.ops.aten.add.Tensor(arg1_1, add_1)
        philox_rand_1 = torch.ops.rngprims.philox_rand.default([arg2_1, arg3_1], arg0_1, add_2, None, device(type='cuda', index=0), torch.float32);  arg2_1 = arg3_1 = arg0_1 = add_2 = None
        getitem_2: f32[s0, s1] = philox_rand_1[0]
        getitem_3: i64[] = philox_rand_1[1];  philox_rand_1 = None
        add_3: i64[] = torch.ops.aten.add.Tensor(add_1, getitem_3);  add_1 = getitem_3 = None
        mul_1: f32[s0, s1] = torch.ops.aten.mul.Tensor(getitem_2, mul);  getitem_2 = mul = None

        # No stacktrace found for following nodes
        add_4: i64[] = torch.ops.aten.add.Tensor(arg1_1, add_3);  arg1_1 = add_3 = None
        return (mul_1, add_4)

 ~~~

Each rand op is accompanied by its offset calculation op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99290
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
2023-04-25 22:40:28 +00:00
Elias Ellison
638feec4e3 Turn on meta converter for complex (#98869)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98869
Approved by: https://github.com/ngimel
2023-04-20 16:42:38 +00:00
Animesh Jain
fdbc8625a1 Functionalization of torch.rand/rand_like ops (#97377)
This PR introduces the functionalization of RNG ops. Key points are

* Introduces a new `philox_rand` prim operator that accepts seed, offset.
* Adds decompositions for random operators that use these philox_rand prims
* Adds a PhiloxStateTracker to track the offset for each occurence of rand ops
* Changes calling convention of AOT Autograd and adds <fwd_seed, fwd_base_offset> and <bwd_seed, bwd_base_offset>
* Monkeypatches set_rng_state and get_rng_state while AOT Autograd tracing to record the rng state behavior
* Raises assertion for CPU because CPU does not Philox RNG.

Not dealt in this PR
* dropout op - offset calculation is different
* other distributions like normal, poisson etc
* Inductor support
* Cudagraph support
* Dynamic shape support

An example
~~~

class Custom(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x):
        ctx.save_for_backward(x)
        a = torch.rand_like(x) * x
        a = torch.rand_like(x) * a
        return a

    @staticmethod
    def backward(ctx, grad_out):
        x, = ctx.saved_tensors
        return grad_out * torch.rand_like(grad_out) * torch.cos(x)

====== Forward graph 0 ======
def forward(self, fwd_seed_1: i64[], fwd_base_offset_1: i64[], primals_1: f32[16, 16]):
    # No stacktrace found for following nodes
    add: i64[] = torch.ops.aten.add.Tensor(fwd_base_offset_1, 0)
    philox_rand: f32[16, 16] = torch.ops.prims.philox_rand.default([16, 16], fwd_seed_1, add, [16, 1], device(type='cuda', index=0), torch.float32);  add = None
    mul: f32[16, 16] = torch.ops.aten.mul.Tensor(philox_rand, primals_1);  philox_rand = None
    add_1: i64[] = torch.ops.aten.add.Tensor(fwd_base_offset_1, 4);  fwd_base_offset_1 = None
    philox_rand_1: f32[16, 16] = torch.ops.prims.philox_rand.default([16, 16], fwd_seed_1, add_1, [16, 1], device(type='cuda', index=0), torch.float32);  fwd_seed_1 = add_1 = None
    mul_1: f32[16, 16] = torch.ops.aten.mul.Tensor(philox_rand_1, mul);  philox_rand_1 = mul = None
    return [mul_1, primals_1]

====== Backward graph 0 ======
def forward(self, bwd_seed_1: i64[], bwd_base_offset_1: i64[], primals_1: f32[16, 16], tangents_1: f32[16, 16]):
    # No stacktrace found for following nodes
    add_2: i64[] = torch.ops.aten.add.Tensor(bwd_base_offset_1, 0);  bwd_base_offset_1 = None
    philox_rand_2: f32[16, 16] = torch.ops.prims.philox_rand.default([16, 16], bwd_seed_1, add_2, [16, 1], device(type='cuda', index=0), torch.float32);  bwd_seed_1 = add_2 = None
    mul_2: f32[16, 16] = torch.ops.aten.mul.Tensor(tangents_1, philox_rand_2);  tangents_1 = philox_rand_2 = None
    cos: f32[16, 16] = torch.ops.aten.cos.default(primals_1);  primals_1 = None
    mul_3: f32[16, 16] = torch.ops.aten.mul.Tensor(mul_2, cos);  mul_2 = cos = None
    return [mul_3]

~~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97377
Approved by: https://github.com/ezyang
2023-04-16 09:55:56 +00:00
Nikita Karetnikov
9585a7ffd3 [inductor] support non-tensor ops with dynamic shapes (#97519)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97519
Approved by: https://github.com/jansel
2023-03-26 00:38:50 +00:00
Edward Z. Yang
4833e47feb Add support for nonzero, some improvements to reduce guards (#95387)
This takes the strategy described in https://docs.google.com/document/d/1lFRYAJo5nrfxRhwIzGnfi2pbLpU6T4ytSRSuLJ5qebI/edit#

It is essentially https://github.com/pytorch/pytorch/pull/95222 but squashed and with changes that are unnecessary given that we assume nonzero returns > 1.

What's in the PR:

* nonzero now supports meta propagation. When `capture_dynamic_output_shape_ops`, it will return a tensor with an unbacked SymInt representing the size in question.
* The unbacked SymInt is UNSOUNDLY assumed to be not equal to 0/1. We will still error if you guard otherwise.
* PrimTorch pointwise operators are updated to use empty_permuted, to avoid guarding on unbacked SymInt from empty_strided (tested in `test_dynamic_pointwise_scalar`)
* Convolution is updated to skip backend selection if batch is unbacked, to avoid guarding on unbacked SymInt (tested in `test_unbacked_batch_resnet`)
* I kept the helper utilities like `definitely_true` for working with possibly unbacked SymInts. They're not used right now but maybe someone will find them useful.
* Added `constrain_unify` to let you specify two unbacked SymInts must have the same value

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95387
Approved by: https://github.com/voznesenskym
2023-02-24 00:27:45 +00:00
Peter Bell
bc438af6fe std/var: support floating point correction value (#94073)
Ref https://github.com/pytorch/pytorch/issues/61492#issuecomment-1413003480

The array API specifies correction to be `Union[int, float]` while we currently only support integers.
https://data-apis.org/array-api/latest/API_specification/generated/array_api.std.html

As std/var is calculated currently, the final count of elements is already done
in floating point so we can make the correction floating point without any loss
of precision or generality.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94073
Approved by: https://github.com/ezyang
2023-02-23 05:50:45 +00:00
jjsjann123
21eb7f70f1 Nvfuser python API import fix (#94036)
1. Having nvfuser python API import working with both devel and upstream;
2. Add environment variable to allow custom nvfuser code base to be built with upstream pytorch core.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94036
Approved by: https://github.com/malfet, https://github.com/davidberard98
2023-02-16 20:10:40 +00:00
Edward Z. Yang
9d5fcd37a2 sym_max/sym_min introduce guard if hinted (#94400)
This patch started with only the change in `torch/_prims_common/__init__.py`. Unfortunately, this change by itself fails tests. The reason it fails tests is sym_max produces sympy.Max expression, which impedes our ability to actually reason symbolically about the resulting expressions. We much prefer to insert a guard on `l > 1`  and get a Sympy expression without Max in it, if we can. In the upcoming unbacked SymInts PR, we can't necessarily do this, but without unbacked SymInts, we always can.

To do this, we introduce `alternate_impl_if_hinted_methods`. The idea is that if all of the arguments into max/min have hints, we will just go ahead and introduce a guard and then return one argument or the other, depending on the result. This is done by rewrapping the SymNode into SymInt/SymFloat and then running builtins.min/max, but we also could have just manually done the guarding (see also https://github.com/pytorch/pytorch/pull/94365 )

However, a very subtle problem emerges when you do this. When we do builtins min/max, we return the argument SymNode directly, without actually allocating a fresh SymNode. Suppose we do a min-max with a constant (as is the case in `sym_max(l, 1)`. This means that we can return a constant SymNode as the result of the computation. Constant SymNodes get transformed into regular integers, which then subsequently trigger the assert at https://github.com/pytorch/pytorch/pull/94400/files#diff-03557db7303b8540f095b4f0d9cd2280e1f42f534f67d8695f756ec6c02d3ec7L620

After thinking about this a bit, I think the assert is wrong. It should be OK for SymNode methods to return constants. The reason the assert was originally added was that ProxyTensorMode cannot trace a constant return. But this is fine: if you return a constant, no tracing is necessary; you know you have enough guards that it is guaranteed to be a constant no matter what the input arguments are, so you can burn it in. You might also be wondering why a change to SymNode method affects the assert from the dispatch mode dispatch: the call stack typically looks like SymNode.binary_magic_impl -> SymProxyTensorMode -> SymNode.binary_magic_impl again; so you hit the binary_magic_impl twice!

No new tests, the use of sym_max breaks preexisting tests and then the rest of the PR makes the tests pass again.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94400
Approved by: https://github.com/Chillee
2023-02-13 23:36:21 +00:00
Aaron Gokaslan
67d9790985 [BE] Apply almost all remaining flake8-comprehension checks (#94676)
Applies the remaining flake8-comprehension fixes and checks. This changes replace all remaining unnecessary generator expressions with list/dict/set comprehensions which are more succinct, performant, and better supported by our torch.jit compiler. It also removes useless generators such as 'set(a for a in b)`, resolving it into just the set call.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94676
Approved by: https://github.com/ezyang
2023-02-12 01:01:25 +00:00
Aaron Gokaslan
3ce1ebb6fb Apply some safe comprehension optimizations (#94323)
Optimize unnecessary collection cast calls, unnecessary calls to list, tuple, and dict, and simplify calls to the sorted builtin. This should strictly improve speed and improve readability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94323
Approved by: https://github.com/albanD
2023-02-07 23:53:46 +00:00
Aaron Gokaslan
8fce9a09cd [BE]: pyupgrade Python to 3.8 - imports and object inheritance only (#94308)
Apply parts of pyupgrade to torch (starting with the safest changes).
This PR only does two things: removes the need to inherit from object and removes unused future imports.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94308
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-02-07 21:10:56 +00:00
Peter Bell
5817695bfa [pt2] Fix arange to match ATen behavior (#93353)
Fixes #92676

`arange` infers the output dtype from the argument types, but in order to reduce
falling back to ATen, inductor preferred to cast whole number float arguments to
int which gave the wrong output dtype. Instead, this decomposes floating point
arange into the prim equivalent for integers.

This also changes the signature of `prims.arange` to

```python
prims.iota(length, *, start, step, **factory_kwargs)
```

which only supports integers arguments. This is done because calculating the
output size from `start, end, step` is surprisingly complex and liable to off by
one errors so should not be duplicated in each backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93353
Approved by: https://github.com/ngimel, https://github.com/lezcano
2023-02-03 00:44:32 +00:00
Sherlock Huang
438f12d91a Rewrite some decomps to allow producing aten ops (#93099)
This introduces a new stop to the decomposition train.
Before reaching prims.view_of, it will stop at aten.alias. Export path wants to get off the train at aten ops.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93099
Approved by: https://github.com/ngimel
2023-01-31 17:46:20 +00:00
jjsjann123
c11b301bcd [NVFUSER] refactor nvfuser build (#89621)
This PR is the first step towards refactors the build for nvfuser in order to have the coegen being a standalone library.

Contents inside this PR:
1. nvfuser code base has been moved to `./nvfuser`, from `./torch/csrc/jit/codegen/cuda/`, except for registration code for integration (interface.h/interface.cpp)
2. splits the build system so nvfuser is generating its own `.so` files. Currently there are:
    - `libnvfuser_codegen.so`, which contains the integration, codegen and runtime system of nvfuser
    - `nvfuser.so`, which is nvfuser's python API via pybind. Python frontend is now exposed via `nvfuser._C.XXX` instead of `torch._C._nvfuser`
3. nvfuser cpp tests is currently being compiled into `nvfuser_tests`
4. cmake is refactored so that:
    - nvfuser now has its own `CMakeLists.txt`, which is under `torch/csrc/jit/codegen/cuda/`.
    - nvfuser backend code is not compiled inside `libtorch_cuda_xxx` any more
    - nvfuser is added as a subdirectory under `./CMakeLists.txt` at the very end after torch is built.
    - since nvfuser has dependency on torch, the registration of nvfuser at runtime is done via dlopen (`at::DynamicLibrary`). This avoids circular dependency in cmake, which will be a nightmare to handle. For details, look at `torch/csrc/jit/codegen/cuda/interface.cpp::LoadingNvfuserLibrary`

Future work that's scoped in following PR:
- Currently since nvfuser codegen has dependency on torch, we need to refactor that out so we can move nvfuser into a submodule and not rely on dlopen to load the library. @malfet
- Since we moved nvfuser into a cmake build, we effectively disabled bazel build for nvfuser. This could impact internal workload at Meta, so we need to put support back. cc'ing @vors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89621
Approved by: https://github.com/davidberard98
2023-01-26 02:50:44 +00:00
Peter Bell
f0b592dae7 Make masked_fill reference traceable (#90972)
As the comment states, `item()` cannot be used since you can't trace through a
scalar.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90972
Approved by: https://github.com/ngimel
2023-01-18 10:54:42 +00:00
joncrall
ad782ff7df Enable xdoctest runner in CI for real this time (#83816)
Builds on #83317 and enables running the doctests. Just need to figure out what is causing the failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83816
Approved by: https://github.com/ezyang, https://github.com/malfet
2022-12-29 05:32:42 +00:00
Joel Schlosser
8b55b86dbd Move sym_int and sym_float alongside SymInt / SymFloat in base torch package (#91317)
This PR moves the definitions for:
* `sym_int`
* `sym_ceil` (used only for `sym_int`)
* `sym_floor` (used only for `sym_int`)
* `sym_float`

from `torch/fx/experimental/symbolic_shapes.py` to `torch/__init__.py`, where `SymInt` and `SymFloat` are already defined.

This removes the need for several in-line imports, and enables proper JIT script gating for #91318. I'm very open to doing this in a better way!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91317
Approved by: https://github.com/ezyang, https://github.com/anijain2305
2022-12-28 16:08:16 +00:00
Brian Hirsh
c47bdd7522 *_scatter ops should preserve input stride/storage_offset (#91029)
It turns out that we *do* need to update *_scatter ops to return the exact same strides as their inputs. I added a test to `test/test_functionalization.py`, which now trips thanks to Ed's functionalization stride debugging check. It only actually ends up tripping silent correctness if you try to .backward() on that function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91029
Approved by: https://github.com/ezyang
2022-12-22 19:41:53 +00:00
Nikita Shulga
fd3a7264ae [MPS] Add group_norm[fwd+backward] and mean_var (take 2) (#91190)
Use Prims to implement group_norm, group_norm_backward and mean_var

Use `torch._ops.ops` instead of `torch.ops` in numerous subpackages in
order to be able to make them importable from `torch/backend/mps/__init__.py` as this alias is defined in
15af4b1cee/torch/__init__.py (L1095)
is executed last during init process.

Add `__all__` to `torch/backends/mps/__init__.py` as well as alias all imports as private

Add `TestNNMPS.test_group_norm_backward` that validates no NaNs are generated during the backward pass

Fixes https://github.com/pytorch/pytorch/issues/88331
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91190
Approved by: https://github.com/albanD
2022-12-22 08:54:37 +00:00
PyTorch MergeBot
645eda0a00 Revert "[MPS] Add group_norm[fwd+backward] and mean_var (#91190)"
This reverts commit 371716eb36.

Reverted https://github.com/pytorch/pytorch/pull/91190 on behalf of https://github.com/kit1980 due to Broke test_correct_module_names because of underscore _ops
2022-12-21 19:37:43 +00:00