Our three main users are OK with this, with two of them (foreach_map,
invoke_quant) prefering it like this.
I was originally worried about BC issues (this now means you cannot add
any positional args) but I think that's not a concern -- one can always
add kwonly args.
Test Plan
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146730
Approved by: https://github.com/ydwu4, https://github.com/mlazos
In Fusion model, users might change the state_dict keys by state_dict_hook
The load_state_dict APIs here won't call model.state_dict() so that the hooks won't be called to change the keys, causing the mismatch between fqn and state_dict keys.
The PR here suggests users to add how they would change the state_dict key prefix (they can name it, here we call "fqn_modifiers") by default
During loading state_dict, we have the prefix change during getting fqn so that they can be processed same as through state_dict hook.
For example:
There's a state_dict_hook:
```
def _state_dict_hook(self, destination, prefix, keep_vars):
"""Remove "embedding" from the original embedding in the state_dict
name. This keeps the orginal state dict name for the embedding
from before fusing with the FusionEmbedding.
[!Note] This update changes the order of the OrderedDict
"""
key = prefix + "embedding.weight"
new_key = prefix + "weight"
destination[new_key] = destination[key]
del destination[key]
```
In the dsd after this PR, we would skip "embedding." before "weight" if find the "fqn_modifiers" attribute at that module
```
def fqn_modifiers(self) -> Dict[str, str]:
return {
"weight": "embedding",
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146557
Approved by: https://github.com/fegin
This PR is to add `torch._scaled_mm` for CPU backend.
`_scaled_mm_out_cpu` and `_scaled_mm_cpu` are new added and included in `torch._scaled_mm` CPU dispatch. We also add `_scaled_mm_out_cpu_emulated` as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139975
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet
This PR is to add `torch._scaled_mm` for CPU backend.
`_scaled_mm_out_cpu` and `_scaled_mm_cpu` are new added and included in `torch._scaled_mm` CPU dispatch. We also add `_scaled_mm_out_cpu_emulated` as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139975
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet
This PR is to add `torch._scaled_mm` for CPU backend.
`_scaled_mm_out_cpu` and `_scaled_mm_cpu` are new added and included in `torch._scaled_mm` CPU dispatch. We also add `_scaled_mm_out_cpu_emulated` as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139975
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet
Use `__setattr__` and `__getattribute__` to wrap existing `dtypesIfXYZ` into it, which will allow for subsequent incremental elimination of those
Also, type annotation for OpInfo is a sham: it claims that `dtypes` and `dtypesIfXYZ` must be of type `_dispatch_dtypes`, but in reality it's converted to set in post init.
Test Plan:
- Check that `op_db[0].dtypesIfCUDA` and others shows the same values as before, by running the following script
```python
from torch.testing._internal.common_methods_invocations import op_db
print({name: getattr(op_db[0], f'dtypesIf{name}') for name in ['CUDA', 'ROCM', 'XPU', 'Hpu']})
```
- CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146905
Approved by: https://github.com/janeyx99
Adds a `invoke_quant` higher order operator as proposed [here](https://docs.google.com/document/d/1s2PfJlq6Q1F8l11CkTIC69BW1rEnGEgs6YmBC7hu8rA/edit?tab=t.0).
The primary motivations are
- Unifying scattered reasoning for quant operators throughout the code base
- Easy of pattern matching - see this very large pattern match expression [here](949fdd2997/torch/_inductor/fx_passes/post_grad.py (L390-L426). Compared to the pattern I have in the tests:
```
@register_graph_pattern(
CallFunction(
torch.ops.aten.mm,
CallFunction(
torch.ops.higher_order.invoke_quant,
Ignored(),
Ignored(),
Ignored(),
scheme="nf4",
),
Arg(),
),
pass_dict=test_pass,
)
```
- Ability to specify inductor specific logic, like codegen'ing the operators in lower precision, or forcing fusion to a matmul.
Example graph:
``` Python
===== AFTER POST GRAD =====
/data/users/eellison/pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module):
def forward(self, arg0_1: "f32[8][1]cpu", arg1_1: "f32[8][1]cpu"):
# File: /data/users/eellison/pytorch/torch/_higher_order_ops/invoke_quant.py:87 in __call__, code: return invoke_quant_tracer(*args, **kwargs, quant_options=self) # type: ignore[call-arg]
repeated_subgraph0 = self.repeated_subgraph0
invoke_quant: "f32[8][1]cpu" = torch.ops.higher_order.invoke_quant(repeated_subgraph0, arg0_1, arg1_1, scheme = 'nf4'); repeated_subgraph0 = arg0_1 = arg1_1 = None
return (invoke_quant,)
class repeated_subgraph0(torch.nn.Module):
def forward(self, arg0_1: "f32[8][1]cpu", arg1_1: "f32[8][1]cpu"):
# File: /data/users/eellison/pytorch/torch/_higher_order_ops/invoke_quant.py:87 in __call__, code: return invoke_quant_tracer(*args, **kwargs, quant_options=self) # type: ignore[call-arg]
mul: "f32[8][1]cpu" = torch.ops.aten.mul.Tensor(arg0_1, arg1_1); arg0_1 = None
add: "f32[8][1]cpu" = torch.ops.aten.add.Tensor(mul, arg1_1); mul = arg1_1 = None
return add
```
The schema for `invoke_quant` is `torch.ops.higher_order.invoke_quant(subgraph, *args, scheme=None)` where the scheme will not always be present.
I wasn't sure exactly how the inductor specific configurations like `codgen_in_low_precision` should be passed through. I didnt want to stuff them all in as kwargs, and I didn't want to have them affect pattern matching. So they will be stored as meta of the node itself. And, following that, I wanted the invocation of the hop to match how it will show up in the graph. So I decided to have it be an object that is then invoked for the tracing.
```
invoke_quant = InvokeQuant(codegen_low_precision=True)
invoke_quant(gn, (x, y), scheme="nf4")
```
Todo - not require the packing of args in a tuple, will do following https://github.com/pytorch/pytorch/pull/139162.
Feedback welcome.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139102
Approved by: https://github.com/Chillee
# Feature
Inductor sometimes uses `Identity` functions to group various terms of an expression. While this is convenient in some scenarios, it can frustrate pattern matching. For example, when we're matching an indexing expression to tell if it can be represented as a block pointer, that analysis should be invariant to `Identity`'s.
This PR adds a few features to achieve this invariance.
- Create a new expansion mode `expr.expand(identity=True)`, which removes all `Identity` functions from the expression.
- Preprocess the expression with this expansion prior to pattern matching.
- Bonus: create a new test utility function called `dummy_graph()`, which creates a simple `GraphLowering`. This is useful for testing the pattern matcher, as we need to initialize `V.graph` before we can access `V.graph.sizevars`.
# Test plan
This PR adds a few new unit tests:
- Added a unit test specifically for `expr.expand(identity=True)`.
- Added a new unit test module for the block pattern matcher. Tested that we can correctly match some example patterns containing Identity ops.
I originally intended to add an end to end test compiling pointwise cat, and mapping the corresponding memory accesses to block pointers. However, it looks like that will take more work, since the [relevant code path](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codegen/triton.py#L1306) disables block pointer analysis. It might be better to defer that to a future PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146000
Approved by: https://github.com/eellison, https://github.com/jansel
Fixes#146404
Adds changes to the matmul and matmul_backward operation for nested jagged tensors, to support back propagation when the output is a regular strided tensor.
This required adding support for the nested matmul operation to work when the nested tensor wasn't 'self', i.e
`A@B` where `A` isn't nested but `B` is.
The operation schemas had to be updated to reflect that either input can be a strided tensor instead (and the gradient), so an extra assertion is added in an edge case where neither input is nested.
Unit tests are also added.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146405
Approved by: https://github.com/soulitzer, https://github.com/jbschlosser
Not yet ready to setp HAS_GPU to true, but can unskip tests that require GPU
(Noticed while running test_mps_basics.py that `test_scalar_cpu_tensor_arg` is getting skipped)
- Replace `GPU_TYPE` with `self.device` in `test_custom_op_fixed_layout_sequential`, `test_inductor_layout_optimization_input_mutations`, `test_mutable_custom_op_fixed_layout2` otherwise they GPU tests are just running for _cpu suffixes.
- Tweak `test_tmp_not_defined_issue3` to work correctly on CPU, by defining `test_device` and `test_device_0`
- UnXFail `test_mutable_custom_op_fixed_layout2_dynamic_shapes` as it should just work on CPU
- Add `skip_if_no_triton` decorator and decorate `test_reduction_config_limit` with it, as it does not need CPU nor GPU, but rather a triton backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145156
Approved by: https://github.com/dcci, https://github.com/Skylion007, https://github.com/jansel
This PR improves opcheck to:
1. directly use torch.testing.assert_close (without a msg override).
This allows it to print the absolute and relative differences and the
number of mismatched elements.
2. take in an atol/rtol tolerance (for if someone just wants to use
opcheck in their testing).
Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146488
Approved by: https://github.com/williamwen42
In Python 3.12, the error message has changed from "Can't pickle local object" to "Can't get local object".
The old regex would no longer catch the error.
This PR make it compatible with Python 3.12 and backward compatible as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145945
Approved by: https://github.com/H-Huang
This enables a check that which a class which only inherits from immutable classes like str, tuple, and NamedTuple, also defined `__slots__` so they don't allocate memory unnecessarily. This also ensure contributors think about how they define their classes with subclass NamedTuples and str, of which we have many in our codebase
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146276
Approved by: https://github.com/aorenste
This PR:
- adds pytree.register_constant for registering a class to be treated as
a constant by torch.compile/torch.fx
- adds a very barebones flat_apply HOP. This should be sufficient to get
mark_traceable working. A lot more work is necessary to get the custom
operator case working (when make_fx sees a custom operator with PyTree
arg types, it needs to emit a call to the flat_apply HOP).
- I expect the flat_apply HOP to change a lot, I want to ship this in
the current state to unblock the mark_traceable and custom ops
work.
Test Plan:
- It's kind of difficult to test the barebones flat_apply HOP "works" so
I added a really simple test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146060
Approved by: https://github.com/StrongerXi, https://github.com/yanboliang
ghstack dependencies: #146059
Summary:
Previously, aoti compile node is represented as a kernel-less custom op in the exported program. The node was not eager runnable, which is a common practice for numerical validation during lowering.
I introduce a new HOP to address this.
The schema is following
```
aoti_call_delegate(lower_moduel: AOTInductorEPModule, original_gm: fx.GraphModule, weights: List[Tensor], inputs: List[Tensor])
```
There are a few problems exposed by HOP
- AOTI expects a FX graph with weights as getattr nodes, aka stateful graph. HOP expect graph_module arguments to be stateless. Export serializer also expect a stateless graph. Currently, to make AOTI happy, I am making `original_gm` stateful, and bypassing the serialization for `original_gm`.
- As a result, the HOP is not re-traceable, as functionalization on stateful graph module argument will fail.
Test Plan: buck2 test 'fbcode//mode/opt' fbcode//deeplearning/aot_inductor/cpu/test:cpu_lowering_utils_test
Reviewed By: zhxchen17
Differential Revision: D68359391
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145630
Approved by: https://github.com/zou3519
Summary:
This allows us to use environment variables to set string values. We've added
tests for the specific functionality implemented here. Note that we already
accidentally started setting up configs to use this, so we're just adding the
feature.
Additionally, we're not fully validating the underlying type when we set the
value (and in general, it's more difficult than we would like to do this). Let
me know if people feel strongly, and we can add a PR to do this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145980
Approved by: https://github.com/yushangdi, https://github.com/oulgen
E.g. torch.ops.higher_order.cond does not exist until it is imported,
which is bad if it shows up in an FX graph or is used in some code
somewhere.
This PR also makes some more HOPs get imported at `import torch` time.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145939
Approved by: https://github.com/ydwu4
ghstack dependencies: #145938
Triton 2.2 and greater have a bug where allowing TF32 generation for a GPU that does not support TF32 will cause code generation errors. Patch around this problem by:
1. Adding a function to `torch.cuda` that determines whether CUDA hardware is capable of using the TF32 format.
2. Using that function to explicitly disable TF32 generation when calling Triton, where needed.
To demonstrate that this fix works, try running `test/inductor/test_max_autotune.py` on a GPU with CUDA compute capability < 8 (e.g. any NVIDIA consumer GPU) without this fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145684
Approved by: https://github.com/eqy