[relanding again after fixing internal build]
Summary:
This might cause some new DDEs on call sites that do not use is_contiguous_or_false() or sym_is_contiguous()
but want to find those call sites to handle this properly by calling is_contiguous_or_false() and not is_contiguous() explitly when appropriate.
I had to fix one issue after removing the implicit size oblivious reasoning. here is context
we defined in this https://github.com/pytorch/pytorch/pull/157472 sym_is_contiguous to be the function computing contiguity for dynamic shapes in c++. It returns a symbolic expression that represents contiguity and guaranteed not to throw a DDE.
when people call is_contiguous we do sym_is_contiguous().guard_bool()
when people call is_contiguous_or_false we do sym_is_contiguous().guard_or_false()
one issue not handled well was this path
```
c10::SymBool TensorImpl::sym_is_contiguous_custom(
at::MemoryFormat memory_format) const {
if (C10_UNLIKELY(matches_python_custom(SizesStridesPolicy::CustomStrides))) {
return pyobj_slot_.load_pyobj_interpreter()->is_contiguous(
this, memory_format);
}
return sym_is_contiguous_default(memory_format);
}
```
namely if we call sym_is_contiguous_custom but we have matches_python_custom(SizesStridesPolicy::CustomStrides) return true , then we used to call is_contiguous(this, memory_format);
This used to go through the load_pyobj_interpreter and end up calling the python is_contiguous call which used implicit size oblivious reasoning.
once we removed that implicit size oblivious reasoning, the right thing we want is to call
return pyobj_slot_.load_pyobj_interpreter()->sym_is_contiguous(this, memory_format);
otherwise we would get DDE even if the caller is doing sym_is_contiguous.
so I had to define it for pyinterpreter, and then I had to override it for nested tensors.
Approved by: https://github.com/ezyang
Test Plan:
contbuild & OSS CI, see e444cd24d4
Rollback Plan:
Differential Revision: D80435179
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160869
Approved by: https://github.com/ezyang
This might cause some new DDEs on call sites that do not use is_contiguous_or_false() or sym_is_contiguous()
but want to find those call sites to handle this properly by calling is_contiguous_or_false() and not is_contiguous() explitly when appropriate.
I had to fix one issue after removing the implicit size oblivious reasoning. here is context
we defined in this https://github.com/pytorch/pytorch/pull/157472 sym_is_contiguous to be the function computing contiguity for dynamic shapes in c++. It returns a symbolic expression that represents contiguity and guaranteed not to throw a DDE.
when people call is_contiguous we do sym_is_contiguous().guard_bool()
when people call is_contiguous_or_false we do sym_is_contiguous().guard_or_false()
one issue not handled well was this path
```
c10::SymBool TensorImpl::sym_is_contiguous_custom(
at::MemoryFormat memory_format) const {
if (C10_UNLIKELY(matches_python_custom(SizesStridesPolicy::CustomStrides))) {
return pyobj_slot_.load_pyobj_interpreter()->is_contiguous(
this, memory_format);
}
return sym_is_contiguous_default(memory_format);
}
```
namely if we call sym_is_contiguous_custom but we have matches_python_custom(SizesStridesPolicy::CustomStrides) return true , then we used to call is_contiguous(this, memory_format);
This used to go through the load_pyobj_interpreter and end up calling the python is_contiguous call which used implicit size oblivious reasoning.
once we removed that implicit size oblivious reasoning, the right thing we want is to call
return pyobj_slot_.load_pyobj_interpreter()->sym_is_contiguous(this, memory_format);
otherwise we would get DDE even if the caller is doing sym_is_contiguous.
so I had to define it for pyinterpreter, and then I had to override it for nested tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159197
Approved by: https://github.com/ezyang
## Summary
Adds the missing batching rule for `torch.matrix_exp` to enable efficient `vmap` support.
Previously, using `vmap` with `matrix_exp` would trigger a performance warning and fall back to a slow loop-based implementation, even though `matrix_exp` natively supports batched inputs.
Fixes#115992
## Details
`torch.matrix_exp` is an alias for `torch.linalg.matrix_exp`. This PR adds vmap support by registering `matrix_exp` with `OP_DECOMPOSE`, which reuses the existing CompositeImplicitAutograd decomposition to automatically generate batching behavior from the operation's simpler component operations.
## Testing
The existing test suite for vmap and matrix_exp should cover this change. The fix enables:
- No performance warning when using `vmap(torch.matrix_exp)`
- Efficient native batched execution instead of loop-based fallback
**Edit:** Updated Details section to accurately reflect the implementation approach (decomposition rather than batch rule registration)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155202
Approved by: https://github.com/zou3519
Summary:
This diff adds two new operators torch.ops._quantized.wrapped_linear_prepack and torch.ops._quantized.wrapped_quantized_linear_prepacked. It is a decomposition of the op torch.ops._quantized.wrapped_quantized_linear added in the previous diff.
We decomposed in this way as packed weight could be computed early so we don;t need to do it in every forward in AOTI
Reviewed By: jerryzh168
Differential Revision: D61395887
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134232
Approved by: https://github.com/houseroad
Continuation of #79979.
Fixes#79161
This PR does the following:
* Expands the `parametrize_fn()` signature from returning a 3-tuple of `(test, test_name, param_kwargs)` to returning a 4-tuple of `(test, test_name, param_kwargs, decorator_fn)`. Expected signature for the addition is `decorator_fn(param_kwargs) -> List[decorator]` i.e. given the full set of test params, return a list of decorators to apply.
* `modules`, `ops`, and `parametrize` now fit the new signature, returning `decorator_fn`s instead of applying decorators themselves.
* `instantiate_parametrized_tests()` and `instantiate_device_type_tests()` now call the returned `decorator_fn`, passing in the full set of `param_kwargs` (after composition + `device` / `dtype` additions) and applying the returned decorators.
* Composing multiple `parametrize_fn`s also composes the corresponding `decorator_fn`s; the composed `decorator_fn` simply concatenates the decorator lists returned by the constituents.
* Expands `DecorateInfo.is_active` to support callables:
```python
DecorateInfo(
unittest.expectedFailure, "TestOps", "test_python_ref_executor",
device_type='cuda', active_if=lambda params: params['executor'] == 'nvfuser'
),
```
* Adds several tests to `test/test_testing.py` ensuring proper decoration using `@parametrize`, `@modules`, and `@ops`.
* (minor) Fixes a couple `ModuleInfo` naming oddities uncovered during testing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91658
Approved by: https://github.com/malfet
Fixes https://github.com/pytorch/functorch/issues/1087
It looks like there are `306` rules that should be looked into
```
test/functorch/test_vmap_registrations.py .x.....xxxxxxx.x.x.x.x.x.x.x.x........xx.x.x..x.x.xxx...xxxx.x.x.x........x.........xxxxx..x..x.....xx...xx.....xxx.xxxxxxxxxxxxxxxxx.. [ 24%]
.........x.x......x.xxxxxx..x..xx.x.xxx.x.......x.xxx.xx..xxx.xxx...xxxxx.x....xxxxxxxxxxxxxxx....xx.xxx.xx.x...xx...xx...xxxxxx...xxxxx..x...xxxxxxxxxxxx..xx..xx.xx.x..xxxx..xx [ 56%]
.xx..x.x....xxxxxx.x.xx...xxxxx.xx...x..x.x.xx...xx.xxxxxx.xxxxxx..x........xxxxxxxx..xxxxxxxx..xx.xxxxxxxxxxxxxxxxxxxxxxx..........xxxx.xxxx.........xxxxxxxx..xxx..xxx.x.x.x.xx [ 88%]
xx.xxx.x......xxx.x.xxxxxxxx....x......xxxxxxxxx.xx.x.x.x.......xx [100%]
=================================================================== 249 passed, 1185 deselected, 306 xfailed in 3.17s ===================================================================
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91367
Approved by: https://github.com/zou3519