Commit Graph

76242 Commits

Author SHA1 Message Date
PyTorch MergeBot
e448f32944 Revert "[BE] typing for decorators - signal/windows/windows (#131582)"
This reverts commit 8689d377f9.

Reverted https://github.com/pytorch/pytorch/pull/131582 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))
2024-07-28 03:29:31 +00:00
PyTorch MergeBot
d90f6b45c0 Revert "[inductor] Add type hints to functions in mkldnn_fusion.py (#131820)"
This reverts commit fb3ddafbcf.

Reverted https://github.com/pytorch/pytorch/pull/131820 on behalf of https://github.com/clee2000 due to reverting this to revert something else, only action you should need to do is to rebase and merge again, sorry for the churn ([comment](https://github.com/pytorch/pytorch/pull/131820#issuecomment-2254327833))
2024-07-28 03:26:14 +00:00
PyTorch MergeBot
8f5cf46405 Revert "Fix public API tests (#131386)"
This reverts commit 91fcfd8760.

Reverted https://github.com/pytorch/pytorch/pull/131386 on behalf of https://github.com/clee2000 due to reverting this to revert something else, only action you should need to do is to rebase and merge again, sorry for the churn ([comment](https://github.com/pytorch/pytorch/pull/131386#issuecomment-2254327487))
2024-07-28 03:23:04 +00:00
cyy
7be0ce51b6 Fix handle serialization error (#131871)
This is a bug to try serialise std::string in C API
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131871
Approved by: https://github.com/Skylion007
2024-07-28 00:33:20 +00:00
Aaron Orenstein
3e0ccb3a9f Fixing fake tensor SymInt caching (#131966)
Summary: Some tests are failing because of a weird interaction between the symbolic sizes and the `set()` - back it out for now.

Differential Revision: D60320595

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131966
Approved by: https://github.com/oulgen
2024-07-27 22:43:57 +00:00
Shuo Ding
d07a125af2 [Inductor] supporting pointwise intermediate nodes in B2B-GEMM (#131685)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131685
Approved by: https://github.com/eellison
2024-07-27 20:11:20 +00:00
Xuehai Pan
14158d892a [BE][tests] show local variables on failure in tests (#131151)
------

As per the title, add argument `--locals` for `unittest` and `--showlocals --tb=long` for `pytest` in CI.

Some failures cannot be reproduced on the local machine but exist on cloud CI. This change allows us to investigate the test failure more easily.

Example output: https://github.com/pytorch/pytorch/actions/runs/9961546996/job/27523888353?pr=130710#step:20:3361

```text
/opt/conda/envs/py_3.8/lib/python3.8/site-packages/sympy/core/function.py:307:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

cls = FloorDiv, base = -1.00000000000000, divisor = -1.00000000000000

    @classmethod
    def eval(cls, base, divisor):
        # python test/test_dynamic_shapes.py -k TestDimConstraints.test_dim_constraints_solve_full
        # Assert triggered by inequality solver
        # assert base.is_integer, base
        # assert divisor.is_integer, divisor

        # We don't provide the same error message as in Python because SymPy
        # makes it difficult to check the types.
        if divisor.is_zero:
            raise ZeroDivisionError("division by zero")
        if base in (int_oo, -int_oo, sympy.oo, -sympy.oo) and divisor in (
            int_oo,
            -int_oo,
            sympy.oo,
            -sympy.oo,
        ):
            return sympy.nan
        if base is sympy.nan or divisor is sympy.nan:
            return sympy.nan

        if base.is_zero:
            return sympy.S.Zero
        if base.is_integer and divisor == 1:
            return base
        if base.is_integer and divisor == -1:
            return sympy.Mul(base, -1)
        if (
            isinstance(base, sympy.Number)
            and isinstance(divisor, sympy.Number)
            and (
                base in (int_oo, -int_oo, sympy.oo, -sympy.oo)
                or divisor in (int_oo, -int_oo, sympy.oo, -sympy.oo)
            )
        ):
            r = float(base) / float(divisor)
            if r == math.inf:
                return int_oo
            elif r == -math.inf:
                return -int_oo
            elif math.isnan(r):
                return sympy.nan
            else:
                return sympy.Integer(math.floor(r))
        if isinstance(base, sympy.Integer) and isinstance(divisor, sympy.Integer):
            return sympy.Integer(int(base) // int(divisor))
        if isinstance(base, FloorDiv):
            return FloorDiv(base.args[0], base.args[1] * divisor)

        # Expands (x + y) // b into x // b + y // b.
        # This only works if floor is an identity, i.e. x / b is an integer.
        for term in sympy.Add.make_args(base):
            quotient = term / divisor
            if quotient.is_integer and isinstance(divisor, sympy.Integer):
                # NB: this is correct even if the divisor is not an integer, but it
                # creates rational expressions that cause problems with dynamic
                # shapes.
                return FloorDiv(base - term, divisor) + quotient

        try:
            gcd = sympy.gcd(base, divisor)
            if gcd != 1:
>               return FloorDiv(
                    sympy.simplify(base / gcd), sympy.simplify(divisor / gcd)
                )

base       = -1.00000000000000
cls        = FloorDiv
divisor    = -1.00000000000000
gcd        = 1.00000000000000
quotient   = 1.00000000000000
term       = -1.00000000000000

/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/_sympy/functions.py:159:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

args = (FloorDiv, -1.00000000000000, -1.00000000000000), kwargs = {}

    @wraps(func)
    def wrapper(*args, **kwargs):
        try:
>           retval = cfunc(*args, **kwargs)
E           RecursionError: maximum recursion depth exceeded in comparison
E
E           To execute this test, run the following from the base repo dir:
E               python test/test_sympy_utils.py -k TestValueRanges.test_binary_ref_fn_floordiv_dtype_float
E
E           This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

args       = (FloorDiv, -1.00000000000000, -1.00000000000000)
cfunc      = <functools._lru_cache_wrapper object at 0x7fc5303173a0>
func       = <function Function.__new__ at 0x7fc530317280>
kwargs     = {}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131151
Approved by: https://github.com/ezyang
2024-07-27 19:39:40 +00:00
albanD
466ea8ce54 Add fallback() to torch.library (#131707)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131707
Approved by: https://github.com/zou3519
2024-07-27 18:02:35 +00:00
cyy
8e5a367311 [5/N] Fix clang-tidy warnings in jit (#131969)
Follows #131903
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131969
Approved by: https://github.com/ezyang
2024-07-27 17:54:20 +00:00
Xuehai Pan
918ece4f4d [BE][Easy][11/19] enforce style for empty lines in import segments in test/dy*/ (#129762)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129762
Approved by: https://github.com/anijain2305
2024-07-27 17:43:53 +00:00
Angela Yi
ae9f17a821 [aoti] Rename OSS DynamicArg and OpKernel (#131862)
Summary: Fixing P1495466240 which I think is due to the fact that internal also has an "OpKernel" in the same namespace, using thrift instead of json.

Test Plan: https://www.internalfb.com/intern/testinfra/testrun/4785074844896831

Differential Revision: D60273354

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131862
Approved by: https://github.com/desertfire
2024-07-27 17:34:50 +00:00
PyTorch MergeBot
8cdfdb41bc Revert "[NestedTensor] Integrate the layer normalization operator along the jagged dimension into NestedTensor (#131519)"
This reverts commit f862f45730.

Reverted https://github.com/pytorch/pytorch/pull/131519 on behalf of https://github.com/atalman due to broke CI: test_nestedtensor.py::TestNestedTensorSubclassCPU::test_layer_norm_with_lengths_requires_grad_False_components_require_grad_False_cpu_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10121747545/job/27996722731) [HUD commit link](f862f45730) ([comment](https://github.com/pytorch/pytorch/pull/131519#issuecomment-2254167994))
2024-07-27 14:45:47 +00:00
Nikita Shulga
07389163f0 [C10][BE] Use range loop (#131922)
Non-function change that iterates over entries in `getCollectiveTraceJson` and uses `C10_UNUSED` rather than `(void)i;` trick

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131922
Approved by: https://github.com/XilunWu
2024-07-27 11:26:27 +00:00
cyy
f83ef69b84 Fix typo in assignment operators (#131890)
Most typos were introduced in #131077
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131890
Approved by: https://github.com/Skylion007
2024-07-27 11:13:42 +00:00
cyy
c82441e07a Fix std::optional checking bug (#131874)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131874
Approved by: https://github.com/Skylion007
2024-07-27 11:08:10 +00:00
Yifu Wang
93a4671746 Add out_dtypes to fused_all_gather_scaled_matmul's args (#131831)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131831
Approved by: https://github.com/weifengpy
ghstack dependencies: #131410
2024-07-27 11:07:43 +00:00
Yifu Wang
12cd040edd [micro_pipeline_tp] exclude simple overlappable collectives as micro-pipeline TP candidates when reorder_for_compute_comm_overlap is enabled (#131410)
When a collective can be hidden through either simple overlapping or micro-pipeline TP, we prefer simple overlapping to avoid the overhead associated with decomposition. If `reorder_for_compute_comm_overlap` is enabled, we identify collectives that can be hidden through simple overlapping and exclude them from micro-pipeline TP candidates.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131410
Approved by: https://github.com/weifengpy
2024-07-27 11:07:43 +00:00
Animesh Jain
36d24925c6 [inline_inbuilt_nn_modules][inductor-cpu] More skips for dynamic shapes when inlining enabled (#131948)
The issue is tracked here - https://github.com/pytorch/pytorch/issues/131929

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131948
Approved by: https://github.com/eellison, https://github.com/leslie-fang-intel
ghstack dependencies: #131744, #131928
2024-07-27 10:03:49 +00:00
Will Feng
aee6bcdba4 [Traceable FSDP2][Inductor] Apply compute/comm reordering passes to achieve overlap (#131614)
This PR enables the Inductor compute/comm reordering passes to Traceable FSDP2 to achieve overlap. Note that the overlap is not maximally optimized yet and the follow-up work will be done in subsequent PRs.

Test commands:
- `pytest -rA  test/distributed/test_compute_comm_reordering.py::TestComputeCommReorderingMultiProc`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131614
Approved by: https://github.com/yifuwang
ghstack dependencies: #131510
2024-07-27 08:39:58 +00:00
Will Feng
9e06572704 [Traceable FSDP2][Inductor] Create grouped nodes for FSDP2 all-gather code block and reduce-scatter code block (after Buffer/Operation split) (#131510)
This PR creates these `GroupedSchedulerNode`s:
- One for each all-gather code block (cast + copy-in + all-gather)
- One for each all-gather-wait code block (all-gather-wait + copy-out)
- One for each reduce-scatter code block (copy-in + reduce-scatter)
- One for each reduce-scatter-wait code block (reduce-scatter-wait)

This serves two goals:
- Prevent outside ops from being fused into these op groups, in order to have more predicable memory usage.
- Make it easier to specify the dependency e.g. from `i+1` all-gather group node to the `i` all-gather-wait group node, to enforce FSDP2 comm ordering (i.e. "serialization of comms").

The actual "reorder-for-FSDP-compute-comm-overlap" PR will come next.

Test commands:
- `pytest -rA  test/distributed/test_compute_comm_reordering.py::TestComputeCommReorderingMultiProc`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131510
Approved by: https://github.com/yifuwang
2024-07-27 08:39:58 +00:00
cyy
99e13e68e9 [4/N] Fix clang-tidy warnings in jit (#131903)
Follows #131830

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131903
Approved by: https://github.com/Skylion007
2024-07-27 08:08:14 +00:00
Janani Sriram
f862f45730 [NestedTensor] Integrate the layer normalization operator along the jagged dimension into NestedTensor (#131519)
Modify the existing `layer normalization` operator in PyTorch, invoked by `torch.layer_norm`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff, which uses the `aten` padding operator, enables PyTorch users to invoke `torch.nn.functional.layer_norm` on a nested tensor when reducing along the ragged dimension, e.g. `*` in a `(B, *, M)` or `(B, *, M, N)` nested tensor.

Write unit tests based on the `softmax` jagged operator to verify the accuracy of the ragged reduction implementation for `torch.nn.functional.layer_norm`. Add unit tests to verify error handling for unsupported features.

Note that this implementation is limited to nested tensors with `ragged_idx == 1`, i.e. the ragged dimension is not transposed. The layer normalization operator also requires an operation on a 2-dimensional layer; for nested tensors with 4 or more dimensions, I flatten the extra dimensions, then unflatten them after performing layer normalization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131519
Approved by: https://github.com/davidberard98
ghstack dependencies: #131518
2024-07-27 07:09:10 +00:00
Janani Sriram
bcf5c68c18 [NestedTensor] Integrate the softmax operator along the jagged dimension into NestedTensor (#131518)
Modify the existing `softmax` operator in PyTorch, invoked by `torch.softmax`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff, which uses the aten padding operator, enables PyTorch users to invoke `torch.softmax` on a nested tensor when reducing along the ragged dimension, e.g. `*` in a `(B, *, M)` nested tensor.

Write unit tests based on the `sum` and `mean` jagged operators to verify the accuracy of the ragged reduction implementation for `torch.softmax`. Add unit tests to verify error handling for unsupported features in `NestedTensor` `torch.softmax`.

Note that this implementation is limited to nested tensors with `ragged_idx == 1`, i.e. the ragged dimension is not transposed. In addition, the `softmax` operator is required to take in as input an integer for the reduction dimension `dim`, requiring new unit tests heavily inspired by the `sum` and `mean` jagged operator unit tests. `Softmax` also allows for reducing along the batch dimension.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131518
Approved by: https://github.com/davidberard98
2024-07-27 07:09:10 +00:00
Avik Chaudhuri
c49e857d32 [pt] immutable accessors in graph signature (#131940)
Summary: splitting PT part of D60253955

Test Plan: existing tests

Differential Revision: D60296909

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131940
Approved by: https://github.com/angelayi, https://github.com/zhxchen17
2024-07-27 05:32:53 +00:00
Oguz Ulgen
96c1862e0b Remove mypy ignore from torch/_dynamo/variables/__init__.py (#131784)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131784
Approved by: https://github.com/aorenste, https://github.com/zou3519, https://github.com/Skylion007
2024-07-27 05:07:33 +00:00
drisspg
1bfe7eb7e6 Update how we do sdpa testing (#131743)
## Motivation

This refactor aligns our testing methodology with the Flash Attention upstream repository while addressing several key issues:

1. **Standardized comparison**: We now compare fused kernels against float64 references, using the maximum of a calculated tolerance (based on same-precision math implementation) or standard float32 `atol`.

2. **Reduced redundancy**: Utilizing the same tensors for both same-precision math and fused kernel runs eliminates duplication.

3. **Improved maintainability**: The new approach simplifies tolerance adjustments across all affected tests.

4. **Consistency**: Standardizing tensor comparisons ensures a more uniform and reliable testing suite.

These changes collectively simplify our testing code, improve its maintainability, and provide a more robust framework for validating our attention mechanisms.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131743
Approved by: https://github.com/jainapurva, https://github.com/jbschlosser
2024-07-27 03:58:49 +00:00
Vishwa Raj Singh
bcdba9f91d Added hpu backend support in fsdp utils (#127757)
In fsdp init_utils, adding support for hpu backend device on _get_device API.

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127757
Approved by: https://github.com/wconstab, https://github.com/jgong5, https://github.com/awgu
2024-07-27 03:30:59 +00:00
Xu Han
28fd2e905d [inductor] enhance cpp_builder lint check. (#131752)
enhance cpp_builder `mypy` check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131752
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-27 02:46:27 +00:00
Xu Han
a90b8b967a [inductor] enable windows inductor UTs (#131767)
Changes:
1. Add `skipIfWindows` function.
2. Fix `fresh_inductor_cache` raise error on Windows, due to can't delete loaded modules.
3. Disable some UTs, which are not passed on Windows.
4. Enable test_torchinductor in Windows CI.

I have tested passed on my dev machine:
<img width="864" alt="image" src="https://github.com/user-attachments/assets/91d5a62f-7383-44b3-b614-99940f196fdb">

TODO: review and fix the skipped cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131767
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-27 02:46:03 +00:00
Avik Chaudhuri
3768faec2f carry cond in data-dependent error (#131932)
Test Plan: existing

Differential Revision: D60302877

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131932
Approved by: https://github.com/zhxchen17
2024-07-27 02:13:04 +00:00
Xu Han
9606d61e0c [reland][inductor] switch AotCodeCompiler to new cpp_builder (#130127)
Changes:
1. Switch `AotCodeCompiler` to new cpp_builder.
2. Only use `deprecated_cpp_compile_command` for `fb_code`, due to I can't debug anymore on no Meta internal environment access.
3. Add `TODO` comments for further some Meta employee help on contine to do this work.
4. Due to item 3, we only remaining `deprecated_cpp_compile_command` for `fb_code` to be fix, let's remove `validate_new_cpp_commands`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130127
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-27 01:46:13 +00:00
Matthew Hoffman
fdf1451bfa Add __all__ to torch.optim to define public interface (#131959)
There was a regression in the public interface for `torch.optim` introduced in #125452 when `torch/optim/__init__.pyi` was merged into `torch/optim/__init__.py`. [The import aliases were not preserved and so now `pyright` thinks that these classes are not publicly exported from `torch/optim/__init__.py`.](https://github.com/pytorch/pytorch/pull/125452/files#diff-941595c1e1aa06bec94578499dd3654532a5183d0bc1bcd94d1f33b47e0d0adfL1-L15)

```
error: "SGD" is not exported from module "torch.optim"
```

Adding these classes/modules to `__all__` fixes this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131959
Approved by: https://github.com/ezyang
2024-07-27 01:03:25 +00:00
Sergii Dymchenko
8458980bbf Move benchmarks/dynamo/huggingface configuration to YAML (#131724)
Similar to https://github.com/pytorch/pytorch/pull/120299

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131724
Approved by: https://github.com/shunting314
2024-07-27 00:55:04 +00:00
Zain Rizvi
ef8d118c67 Sync with changes to test-infra's scale-config.yml (#131955)
This synchronized lf-canary-scale-config and lf-scale-config with one in test-infra.

This really needs some automatic validation to prevent it from drifting out of sync over and over again (coming soon...)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131955
Approved by: https://github.com/malfet
2024-07-27 00:25:40 +00:00
Nikita Shulga
8b04edcac1 Delete unused yml files (#131298)
To be landed at least 3 days later after previous commit
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131298
Approved by: https://github.com/ZainRizvi
ghstack dependencies: #130762
2024-07-27 00:21:22 +00:00
Zain Rizvi
1e00f055a4 Move distributed experimental jobs back to the amazon2 for now (#131963)
Something about the new Amazon2023 AMI is making some distributed tests fail. Moving them back to the old AMI until the issue is fixed

This particular jobs are causing this test to fail:
https://github.com/pytorch/pytorch/issues/129539

More details in https://github.com/pytorch/pytorch/issues/131962
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131963
Approved by: https://github.com/clee2000
2024-07-26 23:44:56 +00:00
Joel Schlosser
91fcfd8760 Fix public API tests (#131386)
This PR fixes a bug in `test_correct_module_names` introduced in #130497. It also addresses post-fix test failures in:
* `torch/ao/quantization/__init__.py` - set the correct `__module__` for several public API helpers
* `torch/library.py` - add `register_vmap` to `__all__`
* `torch/nn/attention/flex_attention.py` - make `round_up_to_multiple` private by prepending an underscore
* `torch/storage.py` - introduce `__all__` to avoid `Self` being re-exported as a public API
* `torch/distributed/pipelining/schedules.py` - add `ZeroBubbleAlgorithm` to `__all__`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131386
Approved by: https://github.com/albanD
2024-07-26 23:38:43 +00:00
Shangdi Yu
02b922900b [aoti] Fix float16 and bfloat16 for generated GPU code (#131437)
Fixes #131333

Summary:
- Add header to define `float16` and `bfloat16` as `at::Half` and `at::BFloat16`.
- change `float16` and `bfloat16` to `float` before passing to kernel.

code generated before:
```cpp
.....
    half var_1;
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_item_float16(convert_arrayref_tensor_to_tensor(arg1_1), &var_1));
....
```

code generated now:
```cpp
typedef at::Half half;
typedef at::BFloat16 bfloat16;
.....
    half var_1_tmp;
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_item_float16(convert_arrayref_tensor_to_tensor(arg1_1), &var_1_tmp));
    float var_1 = float(var_1_tmp);
....
```

Test plan: `TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k GPUTests.test_unspec_inputs_cuda`
Work in progress.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131437
Approved by: https://github.com/desertfire
2024-07-26 23:36:11 +00:00
Bin Bao
0272934238 [Inductor][CPU] Fix an InvalidVecISA issue on CI (#131812)
Summary: CPU CI nodes failed to find valid VecISA because importing torch under the default pytorch directory will fail with the following msg, so switch cwd to a tmp directory.

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/var/lib/jenkins/workspace/torch/__init__.py", line 66, in <module>
    from torch.torch_version import __version__ as __version__
  File "/var/lib/jenkins/workspace/torch/torch_version.py", line 4, in <module>
    from torch.version import __version__ as internal_version
ModuleNotFoundError: No module named 'torch.version'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131812
Approved by: https://github.com/eellison, https://github.com/malfet
2024-07-26 22:31:44 +00:00
Sergii Dymchenko
5489ff8e94 Use Mermaid for the diagram in torch/ao/quantization/fx/README.md (#131412)
preview 3a0efcdfa3/torch/ao/quantization/fx/README.md
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131412
Approved by: https://github.com/jerryzh168
2024-07-26 22:01:21 +00:00
Peter Bell
16cd1aaa1d [inductor] Improve sort kernel perf (#131719)
Closes #129507

This makes two changes to the sort kernel:
1. Use int16 for the indices since we only operate on small dims anyway
2. Instead of passing an explicit mask, we pass the rnumel and imply the
   mask from that which saves an additional reduction in the sort
   kernel's inner loop.

In my benchmarks, this gives enough of a perf improvement to bump up the
max rblock to 512.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131719
Approved by: https://github.com/eellison
2024-07-26 21:56:47 +00:00
Luca Wehrstedt
b90bc66766 Enable FlashAttention on Windows (#131906)
Let's just give this a try.

Reland of https://github.com/pytorch/pytorch/pull/131875.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131906
Approved by: https://github.com/drisspg
2024-07-26 21:41:56 +00:00
rzou
d73b55d64b Support meta tensors as inputs to the triton_kernel_wrapper HOPs (#131896)
We automatically generate FakeTensor support for them (the FakeTensor
kernel for a triton kernel is "return None"). The same thing should
apply to the meta kernel.

Tests:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131896
Approved by: https://github.com/oulgen
2024-07-26 21:41:03 +00:00
Animesh Jain
fb98cd33f1 [inline_inbuilt_nn_modules][inductor-cpu] Skip test_quantized_linear_amx (#131928)
The issue is tracked here - https://github.com/pytorch/pytorch/issues/131929

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131928
Approved by: https://github.com/eellison
ghstack dependencies: #131744
2024-07-26 21:28:17 +00:00
Shunting Zhang
c8626a4e1f [BE] add a list of inductor test files to skip resetting dynamo (#131551)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131551
Approved by: https://github.com/zou3519
2024-07-26 21:08:15 +00:00
Catherine Lee
fde577702d [TD] More synonyms for filepath (#131838)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131838
Approved by: https://github.com/PaliC, https://github.com/ZainRizvi
2024-07-26 21:02:42 +00:00
Zain Rizvi
1bda3a3135 Migrate nightly.yml workflow & docs to Amazon 2023 (#131821)
A continuation of the migration started in
- https://github.com/pytorch/pytorch/pull/131250

Migrates nightly jobs and the linux-docs job in pull.yml

To preserve reusability, I'm switching to a new format here that allows one to only specify the runner prefix instead of the full runner name, allowing multiple jobs to continue using the same base runner type like how they did before

**Validation:**
- Nightly builds passed in the prev commit: https://github.com/pytorch/pytorch/actions/runs/10102118461/job/27937632823?pr=131821
- Latest commit only updated the docs job in pull.yml, and that has already passed: https://github.com/pytorch/pytorch/actions/runs/10114635537/job/27974392472?pr=131821

The other in-progress jobs are irrelevant
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131821
Approved by: https://github.com/atalman, https://github.com/seemethere
2024-07-26 20:54:43 +00:00
James Wu
0e6df1e0fb Disable remote cache on test (#131908)
Summary: Fixes test internally

Test Plan:
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:cudagraph_trees -- --exact 'caffe2/test/inductor:cudagraph_trees - test_cache_hit_forward_miss_backward (caffe2.test.inductor.test_cudagraph_trees.CudaGraphTreeTests)'

Passes

Differential Revision: D60293177

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131908
Approved by: https://github.com/clee2000
2024-07-26 20:19:02 +00:00
Brian Hirsh
071ac38141 fast-path FakeTensor detach (#131899)
Fixes https://github.com/pytorch/pytorch/issues/128281, see investigation at https://github.com/pytorch/pytorch/issues/128281#issuecomment-2252976926.

benchmark:
```
python benchmarks/dynamo/huggingface.py --performance --timing --explain --backend aot_eager --device cuda --training --float32 --only BertForMaskedLM
```

time before:
```
TIMING: entire_frame_compile:30.85435 backend_compile:23.98599 total_wall_time:30.85435
```

time after:
```
TIMING: entire_frame_compile:24.35898 backend_compile:18.15235 total_wall_time:24.35898
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131899
Approved by: https://github.com/ezyang, https://github.com/zou3519, https://github.com/albanD
2024-07-26 20:16:08 +00:00
Catherine Lee
2ec8312a28 Add rerun_disabled_tests for inductor (#131681)
Test in prod?

THis also turns on mem leak check

Briefly checked that
```
 python3 ".github/scripts/filter_test_configs.py" \
    --workflow "inductor" \
    --job-name "cuda12.1-py3.10-gcc9-sm86 / build" \
    --test-matrix "{ include: [
    { config: "inductor", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_distributed", shard: 1, num_shards: 1, runner: "linux.g5.12xlarge.nvidia.gpu" },
    { config: "inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "dynamic_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "dynamic_inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "dynamic_inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "dynamic_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "dynamic_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "aot_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "aot_inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "aot_inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "aot_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "aot_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_cpp_wrapper_abi_compatible", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
  ]}
  " \
    --selected-test-configs "" \
    --pr-number "${PR_NUMBER}" \
    --tag "${TAG}" \
    --event-name "schedule" \
    --schedule "29 8 * * *" \
    --branch "${HEAD_BRANCH}"
```
has rerun disabled tests option in the test matrix

I don't think all these things need to run but I'm not sure which ones (probably just inductor?)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131681
Approved by: https://github.com/zou3519
2024-07-26 20:05:24 +00:00