Commit Graph

449 Commits

Author SHA1 Message Date
Jiang, Yanbing
3fabd10c40 Add AOT inductor support for _scaled_mm for CPU (#141961)
This PR is to add AOT inductor support for _scaled_mm for CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141961
Approved by: https://github.com/malfet
ghstack dependencies: #139975
2024-12-28 05:57:35 +00:00
Animesh Jain
969415885d [inductor][invoke_subgraph] Support None/int as input/output of invoke_subgraph (#139373)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139373
Approved by: https://github.com/eellison
2024-12-27 06:46:09 +00:00
Michael Lazos
8960cb5809 Add support for bfloat16 atomic adds in fbcode (#143629)
Reland https://github.com/pytorch/pytorch/pull/141857 and fallback on A100 which doesn't have bfloat16 atomic add instrs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143629
Approved by: https://github.com/eellison
2024-12-20 23:05:13 +00:00
Michael Lazos
b4e0e3bfa3 Backout D66648013 (#143433)
Summary:
backing out https://www.internalfb.com/diff/D66648013 (see comments there for justification)

I will reland and disallow the bfloat16 atomics behavior on A100 because it causes a pretty significant performance regression.

Test Plan: This is a revert

Differential Revision: D67357485

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143433
Approved by: https://github.com/davidberard98
2024-12-19 00:53:49 +00:00
Rachel Guo
9275091d6e [provenance_tracking] Dump inductor_triton_kernel_to_post_grad_nodes.json info in debug_trace (#143055)
Summary:
This diff mainly adds code changes to dump `inductor_triton_kernel_to_post_grad_nodes.json` artifact which contains mapping info from post_grad -> inductor kernel code:
`{"inductor_triton_kernel_name": [post_grad_node_0, post_grad_node_1, ..., ], "..."}.`

Example paste: P1695235000 verified on the test model.  See "Test Plan":

We use this artifact to demonstrate provenance tracking in the frontend 3-tab highlighter tool:
https://github.com/YUNQIUGUO/compiler_explorer (copy/pasted the input files for demo purpose for now and will integrate with Shangdi's tool to 4-tab)

https://pxl.cl/66BzK

Note: Currently only supports mapping for inductor's`TritonKernel` type. TODO for enhancing more support for `ExternKernel` and other inductor generated kernel type, etc.

Test Plan:
test_model_coverage.sh:
```
#!/bin/sh
MODEL_ENTITY_ID=644688112
SNAPSHOT_ID=32
MODULE=merge

# buck2 build --show-output mode/opt -c=python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.split-dwarf=true -c fbcode.nvcc_arch=a100,h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark

TORCH_COMPILE_DEBUG=1 CUDA_VISIBLE_DEVICES=0 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCH_LOGS="+inductor, schedule, fusion, output_code" TORCH_TRACE="tmp/guorachel_tt" TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 ../buck-out/v2/gen/fbcode/d29ee94b913014f1/caffe2/torch/fb/model_transform/experimental/benchmark/__mts_gpu_benchmark__/mts_gpu_benchmark.par --model-path manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend AOT_INDUCTOR_EP --gpu-trace --aot-inductor-config="{'max_autotune': True}" 2>&1 | tee output.txt
```
 {F1973765026}

```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:provenance_tracing -- --exact 'caffe2/test/inductor:provenance_tracing - test_triton_kernel_post_grad_mapping_aot_inductor (caffe2.test.inductor.test_provenance_tracing.TestProvenanceTracingArtifact)'
```

```
TORCH_LOGS="+inductor, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:provenance_tracing -- -r test_triton_kernel_post_grad_mapping_aot_inductor
```

Differential Revision: D66967510

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143055
Approved by: https://github.com/chenyang78
2024-12-18 06:51:50 +00:00
eellison
f3ec59d44c Fix non-dense inductor effn attn bias (#141905)
Didn't have any luck making local repro, partially because https://github.com/pytorch/pytorch/issues/141888 which will be fixed when we update to triton 3.2. but verified locally it fixes https://github.com/pytorch/pytorch/issues/139424 with the triton pin update that is landing soon

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141905
Approved by: https://github.com/drisspg
ghstack dependencies: #143315
2024-12-17 18:55:50 +00:00
Tom Ritchford
da67a6a7bb [inductor] Replace set by OrderedSet (#138466)
Uses the set_linter from https://github.com/pytorch/pytorch/pull/138454
and considerable manual editing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138466
Approved by: https://github.com/eellison
2024-12-13 16:08:45 +00:00
Tom Ritchford
dc23f1944a Remove unused Python variables in torch/[_-a]* (#133492)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133492
Approved by: https://github.com/albanD
2024-12-12 17:39:14 +00:00
PyTorch MergeBot
5c97ac9721 Revert "Remove unused Python variables in torch/[_-a]* (#133492)"
This reverts commit fda975a7b3.

Reverted https://github.com/pytorch/pytorch/pull/133492 on behalf of https://github.com/clee2000 due to Sorry, I need to revert this in order to revert something else.  The only thing you need to do is rebase and remerge ([comment](https://github.com/pytorch/pytorch/pull/133492#issuecomment-2536635516))
2024-12-11 17:29:12 +00:00
Tom Ritchford
fda975a7b3 Remove unused Python variables in torch/[_-a]* (#133492)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133492
Approved by: https://github.com/albanD
2024-12-10 21:48:44 +00:00
Michael Lazos
a3abe1a5ae Add support for bfloat16 atomic adds in fbcode (#141857)
This adds support for bfloat16 atomic add in fbcode (OSS will have to wait until those changes are upstreamed to triton)

Originally I attempted to write inline asm, but the triton API was not flexible enough to support this use case. In the long run the right answer is to implement this properly in OSS triton.

relevant issues:
* https://github.com/pytorch/pytorch/issues/137425 in fbcode only
* https://github.com/pytorch/pytorch/issues/97016

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141857
Approved by: https://github.com/eellison
2024-12-10 11:40:15 +00:00
Alex Denisov
539286a67b Inductor annotations (#130429)
Add NVTX annotations around training phases and buffer computations

RFC/discussion: https://dev-discuss.pytorch.org/t/rfc-performance-profiling-at-scale-with-details-nvtx-annotations/2224

<img width="2160" alt="Screenshot 2024-07-10 at 11 48 04" src="https://github.com/pytorch/pytorch/assets/1175576/9ade139c-d393-473f-9b68-6c25da367dc4">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130429
Approved by: https://github.com/aorenste, https://github.com/eellison, https://github.com/albanD

Co-authored-by: Cedric GESTES <cedric.gestes@flex.ai>
2024-12-10 08:53:39 +00:00
blzheng
b9e253cb72 [inductor] update numbytes_hint for NoneLayout to allow more fusions (#141766)
We found that [this commit](6eca0aee76) caused a ~6% performance drop in ViT INT8. This was due to changes to the `numbytes_hint` for `NoneLayout`. In this PR, we reverted the changes in `numbytes_hint` to allow more fusions.

```
class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.dense = torch.nn.Linear(768, 768)
        self.layernorm = torch.nn.LayerNorm(768, eps=1e-12)
    def forward(self, context_layer, hidden_states):
        attention_output = self.dense(context_layer)
        hidden_states = attention_output + hidden_states
        layer_output = self.layernorm(hidden_states)
        return layer_output
```
The generated code before (left) and after (right) this PR is as follows:
![image](https://github.com/user-attachments/assets/0ec65ae5-103e-4e2c-bf7c-e8bed24fc179)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141766
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel
2024-12-10 06:45:07 +00:00
Bin Bao
5fc9f419ef [AOTI] Fix multi-kernel codegen when using one-pass (#142333)
Summary: Update multi-kernel codegen to one-pass, following https://github.com/pytorch/pytorch/pull/141980.

Differential Revision: [D66936717](https://our.internmc.facebook.com/intern/diff/D66936717)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142333
Approved by: https://github.com/chenyang78
ghstack dependencies: #141980
2024-12-09 14:49:10 +00:00
Bin Bao
4d43ec2189 [AOTI] Swith GPU codegen to one-pass (#141980)
Summary: With autotune_at_compile_time enabled, AOTI now can perform CUDA codegen in one pass. CUDA kernel related code is generated in a deferred way, after autotuning is done. This one-pass implementation will eliminate any issue caused by disparity between passes in the previous two-pass implementation (which caused multiple bug reports in the past). One-pass implementation also avoids cloning mutated inputs needed in the two-pass implementation, which will reduce GPU memory consumption.

Differential Revision: [D66739414](https://our.internmc.facebook.com/intern/diff/D66739414)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141980
Approved by: https://github.com/chenyang78
2024-12-09 14:40:34 +00:00
drisspg
75e72e1408 Adding lowering to persistent-tma device kernel for _scaled_mm (#142045)
# Summary
This PR adds an alternative triton lowering for _scaled_mm. This uses an updated mm template that utilizes persistent scheduling + TMAs on A and B matrices.

Limitations:
* This implementations does not work with Bias values: 0602676c8d/torch/_inductor/kernel/mm_scaled.py (L106) Plan is to remove this work around and enforce that both scaling + bias is properly done as epilogues onto the existing templates
* K dim must be 32 or greater for these to take effect
* Gated by a config flag ( currently defaults to Off, maybe should be on)

## Testing
We dont have any tests exercising this code in CI/CD but I updated the relevant tests in test_fp8 and they are all green:
<img width="1680" alt="Screenshot 2024-12-05 at 7 24 07 PM" src="https://github.com/user-attachments/assets/9c520541-d97a-416f-9af7-e68b366ec90f">

## Follow Ups
* Work to update the base mm triton templates and utilize the same template from mm/addmm/scaled_mm w/ respective epilogues
* Tuning on Persistent kernel configs. I found ones that work for my problem shapes but need to do some more NCU work

### Some profiling code I was using

Code I am using to iterate w/
```Python
import torch
from dataclasses import dataclass
from jsonargparse import CLI
import logging
from pathlib import Path

from transformer_nuggets.utils.benchmark import ProfileConfig, profile_function
from torchao.float8.inference import (
    addmm_float8_unwrapped_inference,
    preprocess_data,
    Float8MMConfig,
)
from transformer_nuggets.fp8.fp8_matmul import (
    matmul_persistent,
    matmul_tma_persistent,
    matmul_device_tma_persistent,
)
from enum import Enum

logging.getLogger("transformer_nuggets").setLevel(logging.INFO)

class FP8Kernel(Enum):
    PERSISTENT = "Persistent"
    PERSISTENT_TMA = "Persistent-TMA"
    DEVICE_TMA = "Device-TMA"
    SCALED_MM = "Scaled-MM"

class ScalingStrategy(Enum):
    PER_TENSOR = "PerTensor"
    PER_ROW = "PerRow"

@dataclass(frozen=True)
class ExperimentConfig:
    M: int
    K: int
    N: int
    scaling_strategy: ScalingStrategy
    fp8_kernel: FP8Kernel
    compile: bool

def get_fp8_matmul(
    A: torch.Tensor,
    B: torch.Tensor,
    scaling_strategy: ScalingStrategy,
    fp8_kernel: FP8Kernel,
):
    A_fp8 = A.to(torch.float8_e4m3fn)
    B_fp8 = B.to(torch.float8_e4m3fn)
    A_fp8, B_fp8 = preprocess_data(A_fp8, B_fp8, Float8MMConfig(use_fast_accum=True))

    if scaling_strategy == ScalingStrategy.PER_TENSOR:
        a_scale = torch.tensor(1, device="cuda", dtype=torch.float32)
        b_scale = torch.tensor(1, device="cuda", dtype=torch.float32)
    elif scaling_strategy == ScalingStrategy.PER_ROW:
        a_scale = torch.ones((A_fp8.size(0), 1), device="cuda", dtype=torch.float32)
        b_scale = torch.ones((B_fp8.size(1), 1), device="cuda", dtype=torch.float32).T
    else:
        raise ValueError(f"Invalid scaling strategy: {scaling_strategy}")

    assert fp8_kernel == FP8Kernel.SCALED_MM
    return lambda: addmm_float8_unwrapped_inference(
        A_fp8, a_scale, B_fp8, b_scale, output_dtype=torch.bfloat16, use_fast_accum=True
    )

def run_matmul(config: ExperimentConfig):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    A = torch.randn(config.M, config.K, device=device, dtype=torch.bfloat16)
    B = torch.randn(config.K, config.N, device=device, dtype=torch.bfloat16)

    fp8_matmul = get_fp8_matmul(A, B, config.scaling_strategy, config.fp8_kernel)

    if config.compile and config.fp8_kernel == FP8Kernel.SCALED_MM:
        fp8_matmul = torch.compile(fp8_matmul, mode="max-autotune-no-cudagraphs")

    _ = fp8_matmul()

    return

def main():
    torch.random.manual_seed(123)

    # Define your experiment configuration here
    config = ExperimentConfig(
        M=8192,
        K=8192,
        N=8192,
        scaling_strategy=ScalingStrategy.PER_TENSOR,
        fp8_kernel=FP8Kernel.SCALED_MM,
        compile=True,
    )

    run_matmul(config)

if __name__ == "__main__":
    CLI(main)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142045
Approved by: https://github.com/eellison
2024-12-09 01:48:40 +00:00
Bin Bao
2c6d094869 [AOTI] Assert misaligned input (#142136)
Summary: Fixes https://github.com/pytorch/pytorch/issues/141891. JIT Inductor relies on copy_misaligned_inputs to fix misaligned inputs. For AOTInductor's use scenario, this is an unacceptable performance hit, so we codegen input alignment check at the entry point and throws an error if any misalignment exists.

Differential Revision: [D66881038](https://our.internmc.facebook.com/intern/diff/D66881038)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142136
Approved by: https://github.com/eellison, https://github.com/ezyang
ghstack dependencies: #142133
2024-12-08 15:13:01 +00:00
Boyuan Feng
61a7c83c64 [Inductor] fix device error for NopKernelSchedulerNode (#141372)
This PR adds device guard support for NopKernelSchedulerNode which may create a tensor. Prior to this PR, we do not codegen device guard for NopKernelSchedulerNode, leading to errors.

Prior to the PR:
```python
def call(args):
    arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1, arg9_1, arg10_1 = args
    args.clear()
    assert_size_stride(arg0_1, (1, 1, 2048, 128), (262144, 262144, 128, 1))
    assert_size_stride(arg1_1, (1, 1, 2048, 128), (262144, 262144, 128, 1))
    assert_size_stride(arg2_1, (1, 1, 2048, 128), (262144, 262144, 128, 1))
    assert_size_stride(arg3_1, (1, 1, 16), (16, 16, 1))
    assert_size_stride(arg4_1, (1, 1, 16, 16), (256, 256, 16, 1))
    assert_size_stride(arg5_1, (1, 1, 16), (16, 16, 1))
    assert_size_stride(arg6_1, (1, 1, 16, 16), (256, 256, 16, 1))
    assert_size_stride(arg7_1, (1, 1, 16), (16, 16, 1))
    assert_size_stride(arg8_1, (1, 1, 16, 16), (256, 256, 16, 1))
    assert_size_stride(arg9_1, (1, 1, 16), (16, 16, 1))
    assert_size_stride(arg10_1, (1, 1, 16, 16), (256, 256, 16, 1))
    buf0 = empty_strided_cuda((1, 1, 2048), (2048, 2048, 1), torch.float32) # TODO: ERROR here. Should be cuda:1
    with torch.cuda._DeviceGuard(1):
        torch.cuda.set_device(1)
        buf1 = empty_strided_cuda((1, 1, 2048, 128), (262144, 262144, 128, 1), torch.bfloat16)
        # Topologically Sorted Source Nodes: [flex_attention], Original ATen: []
        stream1 = get_raw_stream(1)
        breakpoint()
        triton_tem_fused_0.run(arg0_1, arg1_1, arg2_1, buf0, arg3_1, arg4_1, arg5_1, arg6_1, buf1, grid=torch._inductor.kernel.flex_attention.flex_attention_grid(1, 1, 2048, 128, meta0), stream=stream1)
        del arg0_1
        del arg1_1
        del arg2_1
        del arg3_1
        del arg4_1
        del arg5_1
        del arg6_1
        del buf0
    return (buf1, )
```

After the PR:
```python
def call(args):
    arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1, arg9_1, arg10_1 = args
    args.clear()
    assert_size_stride(arg0_1, (1, 1, 2048, 128), (262144, 262144, 128, 1))
    assert_size_stride(arg1_1, (1, 1, 2048, 128), (262144, 262144, 128, 1))
    assert_size_stride(arg2_1, (1, 1, 2048, 128), (262144, 262144, 128, 1))
    assert_size_stride(arg3_1, (1, 1, 16), (16, 16, 1))
    assert_size_stride(arg4_1, (1, 1, 16, 16), (256, 256, 16, 1))
    assert_size_stride(arg5_1, (1, 1, 16), (16, 16, 1))
    assert_size_stride(arg6_1, (1, 1, 16, 16), (256, 256, 16, 1))
    assert_size_stride(arg7_1, (1, 1, 16), (16, 16, 1))
    assert_size_stride(arg8_1, (1, 1, 16, 16), (256, 256, 16, 1))
    assert_size_stride(arg9_1, (1, 1, 16), (16, 16, 1))
    assert_size_stride(arg10_1, (1, 1, 16, 16), (256, 256, 16, 1))
    with torch.cuda._DeviceGuard(1):
        torch.cuda.set_device(1)
        buf0 = empty_strided_cuda((1, 1, 2048), (2048, 2048, 1), torch.float32) # New: move into device guard
        buf1 = empty_strided_cuda((1, 1, 2048, 128), (262144, 262144, 128, 1), torch.bfloat16)
        # Topologically Sorted Source Nodes: [flex_attention], Original ATen: []
        stream1 = get_raw_stream(1)
        triton_tem_fused_0.run(arg0_1, arg1_1, arg2_1, buf0, arg3_1, arg4_1, arg5_1, arg6_1, buf1, grid=torch._inductor.kernel.flex_attention.flex_attention_grid(1, 1, 2048, 128, meta0), stream=stream1)
        del arg0_1
        del arg1_1
        del arg2_1
        del arg3_1
        del arg4_1
        del arg5_1
        del arg6_1
        del buf0
    return (buf1, )
```

Fixes #141010

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141372
Approved by: https://github.com/eellison
2024-12-06 19:27:50 +00:00
Aaron Gokaslan
08db735629 [BE]: Update mypy to 1.13.0 (#140808)
Update mypy to 1.13.0 . Should hopefully reduce linting time. Has support for orjson cache serialization which should improve mypy cache perf if orjson is installed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140808
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-12-03 02:50:10 +00:00
PyTorch MergeBot
daa77f3d9f Revert "[BE]: Update mypy to 1.13.0 (#140808)"
This reverts commit 00134d68af.

Reverted https://github.com/pytorch/pytorch/pull/140808 on behalf of https://github.com/huydhn due to This is failing a distributed test in trunk, target determination missed this test and did not run it on PR ([comment](https://github.com/pytorch/pytorch/pull/140808#issuecomment-2512788426))
2024-12-02 20:47:43 +00:00
Aaron Gokaslan
00134d68af [BE]: Update mypy to 1.13.0 (#140808)
Update mypy to 1.13.0 . Should hopefully reduce linting time. Has support for orjson cache serialization which should improve mypy cache perf if orjson is installed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140808
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-12-02 18:47:54 +00:00
Edward Z. Yang
b97a786125 Inline compile_to_fn at its only call site (#141691)
Stacked on https://github.com/pytorch/pytorch/pull/141689

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141691
Approved by: https://github.com/jansel
ghstack dependencies: #141681, #141683, #141685, #141688, #141689
2024-11-29 01:15:38 +00:00
eellison
f83361b274 inductor dtype propagation fixes (#141495)
- Add in upcast_compute_type on creation of new tensors (loads, constants)
- Fixes index_expr - right now we are sort of inconsistent in dtype and dont always respect the dtype specified. would be nice to fix but not doing in this pr.
- bug fix in view dtype where we were always upcasting back to fp32 when input was in bf16/fp16. we should only be doing that if the output is also in bf16/fp16.
- for masked, avoid calling dtype propagation and just use output dtype.

Turns on the runtime dtype verification for opinfo tests. The separate test file is still useful because we can use it for testing turning off codegen_upcast_to_fp32.

Follow ups:

- We could consider requiring less explicit upcast_compute_types calls and do it automatically. That would potentially make things easier but be less flexible in the future. Maybe I should have done it this pr.
- Be more consistent on our index expr dtype printing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141495
Approved by: https://github.com/blaine-rister, https://github.com/arui-meta, https://github.com/ezyang
ghstack dependencies: #139945, #140057
2024-11-28 11:39:38 +00:00
Boyuan Feng
17fd53d8e5 [Inductor] Inplacing with Donated Buffer (#140113)
Currently, inductor does not inplace update a buffer if it is an input buffer. Because we don't know if an input will be used by other functions.

Donated buffer provides additional information that an input buffer will not be used by other functions. So we can inplace update donated buffer when possible.

[Dashboard](https://hud.pytorch.org/benchmark/torchbench/inductor_dynamic?dashboard=torchinductor&startTime=Mon,%2011%20Nov%202024%2018:14:36%20GMT&stopTime=Mon,%2018%20Nov%202024%2018:14:36%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=bf/donated-buffer-inplace&lCommit=5df0769c00e6f9000caeb10fd5cbf0b165f69c2a&rBranch=main&rCommit=2b39a8db7741b816b03677a9c6fec1af05640dee)

![image](https://github.com/user-attachments/assets/f19d961f-7973-418e-9de8-5c2a97950478)
![image](https://github.com/user-attachments/assets/df3bd6a9-58b8-4e8a-8397-9e3b1de9adfe)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140113
Approved by: https://github.com/eellison
2024-11-27 18:51:52 +00:00
PyTorch MergeBot
65dbd5cc2d Revert "[Inductor] Inplacing with Donated Buffer (#140113)"
This reverts commit eecc8e362c.

Reverted https://github.com/pytorch/pytorch/pull/140113 on behalf of https://github.com/BoyuanFeng due to break test_donated_buffer_inplace internally since donated_buffer = False if is_fbcode() else True ([comment](https://github.com/pytorch/pytorch/pull/140113#issuecomment-2501954300))
2024-11-26 21:20:59 +00:00
Yidi Wu
aae581d921 [hop free symbols][inductor] remove un-used add_symbol_graph_inputs (#141172)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141172
Approved by: https://github.com/Chillee
ghstack dependencies: #140334
2024-11-26 17:32:50 +00:00
Boyuan Feng
eecc8e362c [Inductor] Inplacing with Donated Buffer (#140113)
Currently, inductor does not inplace update a buffer if it is an input buffer. Because we don't know if an input will be used by other functions.

Donated buffer provides additional information that an input buffer will not be used by other functions. So we can inplace update donated buffer when possible.

[Dashboard](https://hud.pytorch.org/benchmark/torchbench/inductor_dynamic?dashboard=torchinductor&startTime=Mon,%2011%20Nov%202024%2018:14:36%20GMT&stopTime=Mon,%2018%20Nov%202024%2018:14:36%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=bf/donated-buffer-inplace&lCommit=5df0769c00e6f9000caeb10fd5cbf0b165f69c2a&rBranch=main&rCommit=2b39a8db7741b816b03677a9c6fec1af05640dee)

![image](https://github.com/user-attachments/assets/f19d961f-7973-418e-9de8-5c2a97950478)
![image](https://github.com/user-attachments/assets/df3bd6a9-58b8-4e8a-8397-9e3b1de9adfe)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140113
Approved by: https://github.com/eellison
2024-11-26 17:19:50 +00:00
Edward Z. Yang
612122af8f Fix type-safety of torch.nn.Module instances (#141240)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141240
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-11-22 00:05:05 +00:00
Jason Ansel
6eca0aee76 [inductor] Refactor ir.Layout into ir.OutputSpec (#140910)
This separate the concepts of a Layout (size/stride/etc) and an OutputSpec (which includes multiple outputs).  Which should make typing easier.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140910
Approved by: https://github.com/ezyang
ghstack dependencies: #140895
2024-11-21 20:01:57 +00:00
Jason Ansel
808f0f656d [inductor] Refactor MutableBox to make IRNode typing easier (#140895)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140895
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2024-11-20 19:50:46 +00:00
Aaron Gokaslan
12e95aa4ee [BE]: Apply PERF401 autofixes from ruff (#140980)
* Automatically applies ruff rule 401. Turns loops into equivalent list comprehensions which are faster and do not leak the scope of the loop variables.
* list comprehensions not only often have better typing, but are 50+% faster than for loops on overhead. They also preserve length information etc and are better for the interpreter to optimize.
* Manually went back and made mypy happy after the change.
* Also fixed style lints in files covered by flake8 but not by pyfmt

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140980
Approved by: https://github.com/justinchuby, https://github.com/malfet
2024-11-20 17:52:07 +00:00
Shunting Zhang
7e9e83a8c6 [inductor] force contiguous layout for implicit fallback (#140996)
Fix https://github.com/pytorch/pytorch/issues/140462 .

Horace found that when we implicitly fallback to eager, some eager kernels may not work correctly if Inductor provide non-contiguous inputs (due to padding etc.). The original issue is found for the backward op of weight_norm. The fix in this PR is a general one: we force inputs to all implicit fallback kernels to be contiguous.

I have to refactor the code a bit to make it work. Previously we apply layout constraint in `GraphLowering.run_node`. We looks for implicit fallback in `call_function`. The problem here is, when we setup the implicit fallback in `call_function` with a layout constraint, we don't have a chance to apply the constraints.. The refactor moves the code that applies layout constraints to `call_function`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140996
Approved by: https://github.com/jansel
2024-11-20 06:41:17 +00:00
Sam Larsen
ff17d2b83e [easy][logging] Remove dynamo_timed fwd_only param (#140993)
Summary: It's ignored; remove it

Test Plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140993
Approved by: https://github.com/ezyang
2024-11-20 02:31:51 +00:00
Henry Tsang
4f2543c31d [logs] Add dynamo_timed to get better compilation time breakdown for AOTI (#140198)
Adding some dynamo timed for the purpose of better understanding AOTI compilation time.

Probably would require a few more passes. A lot of time is spent in Scheduler.__init__, and not enough annotations are there.

run_command_and_check takes a lot time as well. But there is probably not much we can do. Maybe we can add a config to tune C++ optimization level?

traces:
<img width="1205" alt="Screenshot 2024-11-08 at 4 41 10 PM" src="https://github.com/user-attachments/assets/61645264-b3af-4d4a-804d-700b0f831c7c">

Differential Revision: D65554141

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140198
Approved by: https://github.com/desertfire
2024-11-19 18:54:17 +00:00
PyTorch MergeBot
d472a5f680 Revert "[inductor] Refactor MutableBox to make IRNode typing easier (#140895)"
This reverts commit c79e78b503.

Reverted https://github.com/pytorch/pytorch/pull/140895 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think test_torchbind_inductor is failing in trunk after this lands ([comment](https://github.com/pytorch/pytorch/pull/140895#issuecomment-2484679319))
2024-11-19 04:25:41 +00:00
Jason Ansel
c79e78b503 [inductor] Refactor MutableBox to make IRNode typing easier (#140895)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140895
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2024-11-19 00:24:35 +00:00
Bin Bao
62fb6fd8bd Fix broken AOTInductor node and kernel counts (#139435)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139435
Approved by: https://github.com/desertfire
ghstack dependencies: #139411, #139412

Co-authored-by: Bin Bao <binbao@meta.com>
2024-11-17 04:17:07 +00:00
Angela Yi
baf756a785 [reland] [aoti] Selectively package AOTI generated files (#140675)
Summary: Reland  https://github.com/pytorch/pytorch/pull/140022

Test Plan: CI

Differential Revision: D65929964

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140675
Approved by: https://github.com/desertfire
2024-11-15 23:48:34 +00:00
PyTorch MergeBot
14641c0393 Revert "Fix broken AOTInductor node and kernel counts (#139435)"
This reverts commit 8cb0b932a1.

Reverted https://github.com/pytorch/pytorch/pull/139435 on behalf of https://github.com/kit1980 due to breaking internal inductor test ([comment](https://github.com/pytorch/pytorch/pull/139411#issuecomment-2477235367))
2024-11-14 19:25:46 +00:00
Sam Larsen
b11ff3cf60 [logging] Overhaul dynamo_timed and CompilationMetrics logging. (#139849)
Here's the overview:

There's a new contextmanager singleton called MetricsContext. Entering the MetricsContext is how we demarcate the boundary on which we'll create a single CompilationMetrics object, and therefore, a single dynamo_compile log entry. While we're inside the MetricsContext, we can update/set many different metrics. Most importantly: `dynamo_timed` can also update the in-progress MetricsContext. In the proposal here, we tell `dynamo_timed` that we want it to do so by providing the name of the MetricsContext field to increment. There can be many `dynamo_timed` calls in different parts of the code updating different fields. Then when the MetricsContext exits, that's when the logging of everything gathered finally happens. One potential footgun is trying to use `dynamo_timed` when we haven't entered the MetricsContext, but we assert on that problem. Another problem is that we re-enter the context recursively, but we watch for that and do the logging only when the outermost exits.

Some specifics:
* Introduce MetricsContext - a context manager that on exit, records the CompilationMetrics (which also logs to dynamo_compile).
* Completely remove the concept of frame_phase_timing. Instead, update the MetricsContext during compilation, either directly or via dynamo_timed.
* Remove some globals we previously used to accumulate counters to later populate a CompilationMetrics. We use CompilationMetrics set/update/increment APIs instead.
* `record_compilation_metrics` is now called on exit from MetricsContext.
* Populate legacy CompilationMetrics fields right before logging, inside `record_compilation_metrics`.
* Remove the one-off `add_remote_cache_time_saved` helper; capture that timing directly into the MetricsContext.

And specifically, several changes to dynamo_timed:
* "Modernize" the parameters and update all callsites accordingly.
* Move the backwards logging of the CompilationMetrics to the backwards compile location.
* Add a parameter for which CompilationMetrics field to update

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139849
Approved by: https://github.com/ezyang
2024-11-14 19:11:20 +00:00
PyTorch MergeBot
d63eb3c46c Revert "[logging] Overhaul dynamo_timed and CompilationMetrics logging. (#139849)"
This reverts commit cb15c15157.

Reverted https://github.com/pytorch/pytorch/pull/139849 on behalf of https://github.com/kit1980 due to Breaking an internal tests + there is a bug according to the author ([comment](https://github.com/pytorch/pytorch/pull/139849#issuecomment-2474459094))
2024-11-13 18:47:51 +00:00
PyTorch MergeBot
b4cc5d38b4 Revert "[aoti] Remove dir after packaging (#140022)"
This reverts commit ba136a78ba.

Reverted https://github.com/pytorch/pytorch/pull/140022 on behalf of https://github.com/angelayi due to sorry I realized I need to land from internal ([comment](https://github.com/pytorch/pytorch/pull/140022#issuecomment-2473814720))
2024-11-13 14:43:15 +00:00
angelayi
ba136a78ba [aoti] Remove dir after packaging (#140022)
Update AOTI to return a list of files that it generates when `aot_inductor.package=True`. Then we will only package the files that are in that list.

This should fix the [caching issue](https://fb.workplace.com/groups/1028545332188949/permalink/1081702043539944/) and hopefully https://github.com/pytorch/pytorch/issues/140053.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140022
Approved by: https://github.com/larryliu0820, https://github.com/desertfire, https://github.com/malfet
2024-11-13 12:17:19 +00:00
PyTorch MergeBot
d48ea29b9a Revert "[aoti] Remove dir after packaging (#140022)"
This reverts commit 8c6abe5a8c.

Reverted https://github.com/pytorch/pytorch/pull/140022 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the lint failure is legit ([comment](https://github.com/pytorch/pytorch/pull/140022#issuecomment-2471847439))
2024-11-12 23:35:27 +00:00
Wei Wei
70a223cce6 [aotinductor] fix a few issues in bandwidth profiler (#139607)
Summary:
The recent tries on bandwidth profiler is not as expected. I have observed a few issues and tried to fix them in this diff:
1. The return of the DebugAutotuner class
2. Profiling results shows really large overhead.
DebugAutotuner.run()  returns the benchmark time around 45ms while CachingAutotuner.run() returns the benchmark time around 0.45ms.
The `_find_names` and `re.match` takes 45ms: P1669186358
After we commenting out the above _find_names and re.match, the benchmark time become consistent with non-profiling mode: P1669185589
3. introduce a variable `bandwidth_info` to control the path in DebugAutotuner.run(). During benchmarking of configuration selection, we should turn off the `bandwidth_info`

After applying this diff, the profiling issues mentioned above are fixed: P1669273172

Test Plan:
```
TORCHINDUCTOR_FORCE_DISABLE_CACHES=1   TORCHINDUCTOR_PROFILE=1 TORCHINDUCTOR_PROFILE_OUTPUT=~/tmp/profile.txt TORCH_LOGS='+inductor,+schedule,output_code' TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 CUDA_VISIBLE_DEVICES=5  buck run mode/{opt,inplace} scripts/wwei6/triton_examples:test_mat 2>&1 | tee profiling-5.log
```
If we want to disable the Aten backend, just add TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS="TRITON"

Differential Revision: D64883079

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139607
Approved by: https://github.com/chenyang78
2024-11-12 23:26:47 +00:00
angelayi
8c6abe5a8c [aoti] Remove dir after packaging (#140022)
Update AOTI to return a list of files that it generates when `aot_inductor.package=True`. Then we will only package the files that are in that list.

This should fix the [caching issue](https://fb.workplace.com/groups/1028545332188949/permalink/1081702043539944/) and hopefully https://github.com/pytorch/pytorch/issues/140053.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140022
Approved by: https://github.com/larryliu0820, https://github.com/desertfire, https://github.com/malfet
2024-11-12 21:36:24 +00:00
Benjamin Glass
8cb0b932a1 Fix broken AOTInductor node and kernel counts (#139435)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139435
Approved by: https://github.com/desertfire
ghstack dependencies: #139411, #139412
2024-11-12 15:22:46 +00:00
Sam Larsen
cb15c15157 [logging] Overhaul dynamo_timed and CompilationMetrics logging. (#139849)
Here's the overview:

There's a new contextmanager singleton called MetricsContext. Entering the MetricsContext is how we demarcate the boundary on which we'll create a single CompilationMetrics object, and therefore, a single dynamo_compile log entry. While we're inside the MetricsContext, we can update/set many different metrics. Most importantly: `dynamo_timed` can also update the in-progress MetricsContext. In the proposal here, we tell `dynamo_timed` that we want it to do so by providing the name of the MetricsContext field to increment. There can be many `dynamo_timed` calls in different parts of the code updating different fields. Then when the MetricsContext exits, that's when the logging of everything gathered finally happens. One potential footgun is trying to use `dynamo_timed` when we haven't entered the MetricsContext, but we assert on that problem. Another problem is that we re-enter the context recursively, but we watch for that and do the logging only when the outermost exits.

Some specifics:
* Introduce MetricsContext - a context manager that on exit, records the CompilationMetrics (which also logs to dynamo_compile).
* Completely remove the concept of frame_phase_timing. Instead, update the MetricsContext during compilation, either directly or via dynamo_timed.
* Remove some globals we previously used to accumulate counters to later populate a CompilationMetrics. We use CompilationMetrics set/update/increment APIs instead.
* `record_compilation_metrics` is now called on exit from MetricsContext.
* Populate legacy CompilationMetrics fields right before logging, inside `record_compilation_metrics`.
* Remove the one-off `add_remote_cache_time_saved` helper; capture that timing directly into the MetricsContext.

And specifically, several changes to dynamo_timed:
* "Modernize" the parameters and update all callsites accordingly.
* Move the backwards logging of the CompilationMetrics to the backwards compile location.
* Add a parameter for which CompilationMetrics field to update

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139849
Approved by: https://github.com/ezyang
ghstack dependencies: #140094
2024-11-11 14:24:23 +00:00
Aaron Orenstein
06f619d999 typing ir.py - part 2 (#131846)
See #131852

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131846
Approved by: https://github.com/eellison
ghstack dependencies: #139238
2024-11-06 00:01:15 +00:00
Aaron Orenstein
c2109ec479 typing ir.py - Disallow untyped defs for ir.py (#139238)
- Remove "mypy: allow-untyped-defs" and mark functions individually with "no-untyped-def"
- Mark some trivial functions with the proper return types (`None` and `torch.dtype`)
- Fixed a type bug in the signature of supported_dtype_of_cpp_wrapper()
- `ruff check torch/_inductor/ir.py --select ANN --fix --unsafe-fixes` and then fixed up things that looked incorrectly applied.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139238
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2024-11-06 00:01:15 +00:00