pytorch

OSSForks/pytorch

Fork 0

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Commit Graph

Author	SHA1	Message	Date
Yuanyuan Chen	fc8ac1216c	[4/N] Remove unused loop variables in tests (#166690 ) This PR removes unused loop variables in tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166690 Approved by: https://github.com/justinchuby, https://github.com/mlazos	2025-10-31 10:20:48 +00:00
Tianren Gao	24b6eb7727	[Inductor] Enable Custom op Autotune Decompositions and Parameter Tuning (#164212 ) This PR introduces CustomOp autotuning. It allows user to provide a CustomOpConfig: (1) to register (optional) multiple decomposition implementations for custom operations and (2) to register parameter tuning knobs and values they want to tune for the decompositions so that inductor automatically select the best-performing variant through Inductor's autotune benchmarking. Example: ```python register_custom_op_autotuning( custom_op=my_attention_op, configs=[ CustomOpConfig(attention_impl, head_dim=32, method='chunked'), CustomOpConfig(attention_impl, head_dim=64, method='tiled'), CustomOpConfig(head_dim=128), # no decompositions ], input_gen_fns={ "query": lambda fake: torch.randn_like(fake, device='cuda'), "key": lambda fake: torch.randn_like(fake, device='cuda'), "value": lambda fake: torch.randn_like(fake, device='cuda'), } ) ``` CustomOpConfig: Each CustomOpConfig defines exactly one autotuning variant with specific parameter values and optional decomposition implementation with PyTorch aten ops. Users can register their own tuning knobs and optional decomposition functions for the same custom operation. The system automatically benchmarks all variants to select the best performing. If no decomposition is provided in the config, the CustomOp's default implementation will be used. Custom Input Generation: Users can provide custom input generators via an optional `input_gen_fns` to control how synthetic inputs are created during benchmarking. This enables more realistic performance testing by generating inputs that match expected data distributions and characteristics for each tensor argument. More Examples with autotune logs:: 1. Allow user to register customOp decompositions with tuning parameters for autotuning. Example usage: ```python from torch._inductor.kernel.custom_op import CustomOpConfig, register_custom_op_autotuning def decompose_k_implementation(a: torch.Tensor, b: torch.Tensor, k_splits: int = 4) -> torch.Tensor: """Matrix multiply with k-way decomposition.""" # Implementation...with k_splits @torch.library.custom_op("my_lib::decompose_k", mutates_args=()) def test_decompose_k_op( a: torch.Tensor, b: torch.Tensor, k_splits: int ) -> torch.Tensor: return decompose_k_implementation(a, b, k_splits) # Register autotuning with different k_splits values register_custom_op_autotuning( custom_op=test_decompose_k_op, configs=[ CustomOpConfig(decompose_k_implementation, k_splits=2), CustomOpConfig(decompose_k_implementation, k_splits=32), CustomOpConfig(decompose_k_implementation, k_splits=64), CustomOpConfig(k_splits=128), # can make decomposition optional, then use default impl test_decompose_k_op CustomOpConfig(k_splits=256) ], input_gen_fns={ "a": lambda fake: torch.randn_like(fake, device='cuda') * 0.1, "b": lambda fake: torch.randn_like(fake, device='cuda') * 0.1, } ) ``` Example result: ``` {"num_choices": 6, "num_triton_choices": 0, "best_kernel": "test_decompose_k_autotuned_fallback_default", "best_time": 0.09980800002813339} AUTOTUNE test_decompose_k_autotuned(256x65536, 65536x1024) strides: [65536, 1], [1024, 1] dtypes: torch.float16, torch.float16 test_decompose_k_autotuned_fallback_default 0.0998 ms 100.0% test_decompose_k_autotuned_decompose_k_implementation_k_splits_2_0 0.1096 ms 91.0% CustomOp decompose_k_implementation_k_splits_2 test_decompose_k_autotuned_decompose_k_implementation_k_splits_32_1 0.1277 ms 78.2% CustomOp decompose_k_implementation_k_splits_32 test_decompose_k_autotuned_decompose_k_implementation_k_splits_64_2 0.1454 ms 68.6% CustomOp decompose_k_implementation_k_splits_64 test_decompose_k_autotuned_decompose_k_implementation_k_splits_128_3 0.1536 ms 65.0% CustomOp decompose_k_implementation_k_splits_128 test_decompose_k_autotuned_decompose_k_implementation_k_splits_256_4 0.2084 ms 47.9% CustomOp decompose_k_implementation_k_splits_256 ``` 2. Allow user to tune parameter knob by passing the parameter and values in the CustomOpConfig. Example ```python def mlp_variants(input_tensor, gate_weight, up_weight, down_weight, method): """MLP implementation with different computational approaches.""" if method == 0: # Standard separate matmuls # ... implementation elif method == 1: # Batched approach with torch.mm # ... implementation elif method == 2: # Fused weights approach # ... implementation @torch.library.custom_op("my_lib::mlp_op", mutates_args=()) def mlp_op( input_tensor: torch.Tensor, gate_weight: torch.Tensor, up_weight: torch.Tensor, down_weight: torch.Tensor, method: int, ) -> torch.Tensor: return mlp_variants( input_tensor, gate_weight, up_weight, down_weight, method=method ) register_custom_op_autotuning( custom_op=mlp_op, configs=[ CustomOpConfig(method=0), CustomOpConfig(method=1), CustomOpConfig(method=2), # method=0 is the default fallback in the original op ], input_gen_fns={ "input_tensor": lambda fake: torch.randn_like(fake, device='cuda') * 0.1, "gate_weight": lambda fake: torch.randn_like(fake, device='cuda') * 0.05, # ... other input generators } ) ``` Example result: ``` AUTOTUNE test_mlp_autotuned(4x32x512, 512x1024, 512x1024, 1024x256) test_mlp_autotuned_mlp_variants_method_2 0.0181 ms 100.0% CustomOp mlp_variants_method_2 test_mlp_autotuned_mlp_variants_method_1 0.0185 ms 97.8% CustomOp mlp_variants_method_1 test_mlp_autotuned_mlp_default_fallback_method_0 0.0198 ms 91.4% CustomOp fallback ``` ### Test Suite (`test/inductor/test_custom_op_autotune.py`) * RMSNorm autotuning: Tests different RMSNorm implementations with dynamic input shapes * MLP autotuning: Tests different MLP decomposition and tuning "method" parameter * DecomposeK: Tests different k_splits values for matrix multiplication decomposition with k dim split * Multi-parameter tuning: Tests configs with multiple tuning parameters (scale_mode, chunk_size) ### Next Step: - Enable Max-autotune with user passed in max-autotune config. https://github.com/pytorch/pytorch/pull/165526/files - Support inline epilogue fusion for selected best customop decomposition with surrounding elementwise ops. https://github.com/pytorch/pytorch/pull/165952/files - Support customop autotune considering fusion with multiTemplateBuffer. WIP Pull Request resolved: https://github.com/pytorch/pytorch/pull/164212 Approved by: https://github.com/zou3519	2025-10-31 02:28:00 +00:00

Author

SHA1

Message

Date

Yuanyuan Chen

fc8ac1216c

[4/N] Remove unused loop variables in tests (#166690 )

This PR removes unused loop variables in tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166690
Approved by: https://github.com/justinchuby, https://github.com/mlazos

2025-10-31 10:20:48 +00:00

Tianren Gao

24b6eb7727

[Inductor] Enable Custom op Autotune Decompositions and Parameter Tuning (#164212 )

This PR introduces CustomOp autotuning. It allows user to provide a CustomOpConfig:
(1) to register (optional) multiple decomposition implementations for custom operations and
(2) to register parameter tuning knobs and values they want to tune for the decompositions
so that inductor automatically select the best-performing variant through Inductor's autotune benchmarking.

Example:
```python
 register_custom_op_autotuning(
            custom_op=my_attention_op,
            configs=[
                CustomOpConfig(attention_impl, head_dim=32, method='chunked'),
                CustomOpConfig(attention_impl, head_dim=64, method='tiled'),
                CustomOpConfig(head_dim=128), # no decompositions
            ],
            input_gen_fns={
                "query": lambda fake: torch.randn_like(fake, device='cuda'),
                "key": lambda fake: torch.randn_like(fake, device='cuda'),
                "value": lambda fake: torch.randn_like(fake, device='cuda'),
            }
    )
```

**CustomOpConfig**: Each CustomOpConfig defines exactly one autotuning variant with specific parameter values and optional decomposition implementation with PyTorch aten ops. Users can register their own tuning knobs and optional decomposition functions for the same custom operation. The system automatically benchmarks all variants to select the best performing. If no decomposition is provided in the config, the CustomOp's default implementation will be used.

**Custom Input Generation**: Users can provide custom input generators via an optional `input_gen_fns` to control how synthetic inputs are created during benchmarking. This enables more realistic performance testing by generating inputs that match expected data distributions and characteristics for each tensor argument.

**More Examples with autotune logs:**:
1. Allow user to register customOp decompositions with tuning parameters for autotuning. Example usage:
```python
from torch._inductor.kernel.custom_op import CustomOpConfig, register_custom_op_autotuning

def decompose_k_implementation(a: torch.Tensor, b: torch.Tensor, k_splits: int = 4) -> torch.Tensor:
    """Matrix multiply with k-way decomposition."""
         # Implementation...with k_splits

@torch.library.custom_op("my_lib::decompose_k", mutates_args=())
def test_decompose_k_op(
        a: torch.Tensor, b: torch.Tensor, k_splits: int
    ) -> torch.Tensor:
        return decompose_k_implementation(a, b, k_splits)

# Register autotuning with different k_splits values
register_custom_op_autotuning(
    custom_op=test_decompose_k_op,
    configs=[
        CustomOpConfig(decompose_k_implementation, k_splits=2),
        CustomOpConfig(decompose_k_implementation, k_splits=32),
        CustomOpConfig(decompose_k_implementation, k_splits=64),
        CustomOpConfig(k_splits=128), # can make decomposition optional, then use default impl test_decompose_k_op
        CustomOpConfig(k_splits=256)
    ],
    input_gen_fns={
        "a": lambda fake: torch.randn_like(fake, device='cuda') * 0.1,
        "b": lambda fake: torch.randn_like(fake, device='cuda') * 0.1,
    }
)
```

Example result:
```
{"num_choices": 6, "num_triton_choices": 0, "best_kernel": "test_decompose_k_autotuned_fallback_default", "best_time": 0.09980800002813339}
AUTOTUNE test_decompose_k_autotuned(256x65536, 65536x1024)
strides: [65536, 1], [1024, 1]
dtypes: torch.float16, torch.float16
  test_decompose_k_autotuned_fallback_default 0.0998 ms 100.0%
  test_decompose_k_autotuned_decompose_k_implementation_k_splits_2_0 0.1096 ms 91.0% CustomOp decompose_k_implementation_k_splits_2
  test_decompose_k_autotuned_decompose_k_implementation_k_splits_32_1 0.1277 ms 78.2% CustomOp decompose_k_implementation_k_splits_32
  test_decompose_k_autotuned_decompose_k_implementation_k_splits_64_2 0.1454 ms 68.6% CustomOp decompose_k_implementation_k_splits_64
  test_decompose_k_autotuned_decompose_k_implementation_k_splits_128_3 0.1536 ms 65.0% CustomOp decompose_k_implementation_k_splits_128
  test_decompose_k_autotuned_decompose_k_implementation_k_splits_256_4 0.2084 ms 47.9% CustomOp decompose_k_implementation_k_splits_256
```

2. Allow user to tune parameter knob by passing the parameter and values in the CustomOpConfig.
**Example**
```python
def mlp_variants(input_tensor, gate_weight, up_weight, down_weight, method):
    """MLP implementation with different computational approaches."""
    if method == 0:
        # Standard separate matmuls
        # ... implementation
    elif method == 1:
        # Batched approach with torch.mm
        # ... implementation
    elif method == 2:
        # Fused weights approach
        # ... implementation

@torch.library.custom_op("my_lib::mlp_op", mutates_args=())
        def mlp_op(
            input_tensor: torch.Tensor,
            gate_weight: torch.Tensor,
            up_weight: torch.Tensor,
            down_weight: torch.Tensor,
            method: int,
        ) -> torch.Tensor:
            return mlp_variants(
                input_tensor, gate_weight, up_weight, down_weight, method=method
            )

register_custom_op_autotuning(
    custom_op=mlp_op,
    configs=[
        CustomOpConfig(method=0),
        CustomOpConfig(method=1),
        CustomOpConfig(method=2),
        # method=0 is the default fallback in the original op
    ],
    input_gen_fns={
        "input_tensor": lambda fake: torch.randn_like(fake, device='cuda') * 0.1,
        "gate_weight": lambda fake: torch.randn_like(fake, device='cuda') * 0.05,
        # ... other input generators
    }
)

```

Example result:
```
AUTOTUNE test_mlp_autotuned(4x32x512, 512x1024, 512x1024, 1024x256)
  test_mlp_autotuned_mlp_variants_method_2 0.0181 ms 100.0% CustomOp mlp_variants_method_2
  test_mlp_autotuned_mlp_variants_method_1 0.0185 ms 97.8% CustomOp mlp_variants_method_1
  test_mlp_autotuned_mlp_default_fallback_method_0 0.0198 ms 91.4% CustomOp fallback
```

### Test Suite (`test/inductor/test_custom_op_autotune.py`)

*   **RMSNorm autotuning**: Tests different RMSNorm implementations with dynamic input shapes
*   **MLP autotuning**: Tests different MLP decomposition and tuning "method" parameter
*   **DecomposeK**: Tests different k_splits values for matrix multiplication decomposition with k dim split
*   **Multi-parameter tuning**: Tests configs with multiple tuning parameters (scale_mode, chunk_size)

### Next Step:
- Enable Max-autotune with user passed in max-autotune config. https://github.com/pytorch/pytorch/pull/165526/files
- Support inline epilogue fusion for selected best customop decomposition with surrounding elementwise ops. https://github.com/pytorch/pytorch/pull/165952/files
- Support customop autotune considering fusion with multiTemplateBuffer. WIP

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164212
Approved by: https://github.com/zou3519

2025-10-31 02:28:00 +00:00

2 Commits