Fixes#161010 by making `clone_meta` match the semantics of strides for eager mode.
This is:
* Case 1: Tensor is_non_overlapping_and_dense; in this case, stride should match input tensor stride
* Case 2: Otherwise, stride should be contiguous computed from input tensor using `compute_elementwise_output_strides`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163017
Approved by: https://github.com/williamwen42, https://github.com/xmfan
Co-authored-by: morrison-turnansky <mturnans@redhat.com>
- This pull request introduces support for the [OCP Micro-scaling (MX) format](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf), with a focus on compatibility with AMD **ROCm 7.0** and the **gfx950** architecture.
This PR also establishes the foundation for enabling MX-FPX features in [TorchAO](https://github.com/pytorch/ao/issues/2229) on the AMD platform.
- Validation (**ROCm 7.0** + **gfx950** required):
`111 relevant tests passing.`
> PYTORCH_TEST_WITH_ROCM=1 python test/test_matmul_cuda.py -k test_blockwise -v
Co-author: @jagadish-amd — Thank you for the efforts leading validation on gfx950 with ROCm 7.0.
-----------------------------------
This pull request introduces support for new scalar types and scaling methods, particularly for ROCm 7.0 and gfx950, and refines testing for these features. Key changes include adding constraints for matrix dimensions, enabling block-wise scaling, and updating tests to accommodate new data types.
### Support for new scalar types and scaling methods:
* [`aten/src/ATen/cuda/CUDABlas.cpp`](diffhunk://#diff-74fcb26047c1df4024105d36ce22a36b77cf8cc93c28631d743e639b3d6066aeR1876-R1885): Added constraints for matrix dimensions when using `Float8_e8m0fnu` with block-wise scaling, ensuring dimensions are multiples of 32. Updated compatibility checks to support ROCm 7.0 for `Float8_e8m0fnu` and `Float8_e4m3fn`. [[1]](diffhunk://#diff-74fcb26047c1df4024105d36ce22a36b77cf8cc93c28631d743e639b3d6066aeR1876-R1885) [[2]](diffhunk://#diff-74fcb26047c1df4024105d36ce22a36b77cf8cc93c28631d743e639b3d6066aeL1913-R1934)
* [`aten/src/ATen/native/cuda/Blas.cpp`](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abR1276-R1290): Introduced block-wise scaling for `Float8_e8m0fnu`, with checks for ROCm 7.0 and GPU architecture `gfx950`. Added validation for supported scalar types and matrix dimensions. [[1]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abR1276-R1290) [[2]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abR1349-R1364)
### Updates to scalar type mappings:
* [`aten/src/ATen/cuda/CUDADataType.h`](diffhunk://#diff-9188bb13b1a49f459141f5f9b875593d1c5ce2beb5ad711fdbaf5bc7089ec015L93-R93): Extended scalar type mappings to support `Float4_e2m1fn_x2` for ROCm 7.0.
* [`aten/src/ATen/cuda/tunable/GemmHipblaslt.h`](diffhunk://#diff-bfa1a3b5d4bef1892bf50338775f3b0fd8cd31fc1868148f3968b98aefb68e3fR88-R96): Added a constexpr mapping for `Float4_e2m1fn_x2` based on ROCm version.
### Enhancements to testing(@jagadish-amd):
* [`test/test_matmul_cuda.py`](diffhunk://#diff-3f31c52b48cfddf8f4617d809f7695b2e4a1c78656f8c4b5143a4b45d01fcf23R765-R766): Updated tests to include new scalar types (`Float4_e2m1fn_x2`) and recipes (`mxfp4`). Added logic to handle different scaling recipes and validate compatibility with ROCm and CUDA versions. [[1]](diffhunk://#diff-3f31c52b48cfddf8f4617d809f7695b2e4a1c78656f8c4b5143a4b45d01fcf23R765-R766) [[2]](diffhunk://#diff-3f31c52b48cfddf8f4617d809f7695b2e4a1c78656f8c4b5143a4b45d01fcf23L1331-R1356) F592e669L1353R1472)
These changes improve compatibility with newer hardware and software versions, enhance functionality for matrix operations, and ensure robust testing for the added features.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151360
Approved by: https://github.com/drisspg, https://github.com/malfet
Fixes#157452
Test with
```
python test/dynamo/test_repros.py ReproTests.test_nn_parameter_ctor_graph_breaks
```
### Release Notes
Change to nn.Parameter Constructor Behavior in Dynamo
Semantic change introduced in the nn.Parameter constructor; previously, if the constructor lacked a clean source, the system would attempt to infer arguments to construct a clone and lift this synthetic proxy in the computation graph. This approach had many potential edge cases and was difficult to reason about. The new behavior defaults to graph breaking when the nn.Parameter constructor does not have a clean source. Users are now suggested to manually move the constructor out of the graph in such cases. This change improves clarity and reduces complexity in graph construction and debugging. Users can escape hatch to old semantics with `torch.dynamo.config.graph_break_on_nn_param_ctor=False` if this cannot be done.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158800
Approved by: https://github.com/anijain2305
Fixes#157452
Test with
```
python test/dynamo/test_repros.py ReproTests.test_nn_parameter_ctor_graph_breaks
```
### Release Notes
Change to nn.Parameter Constructor Behavior in Dynamo
Semantic change introduced in the nn.Parameter constructor; previously, if the constructor lacked a clean source, the system would attempt to infer arguments to construct a clone and lift this synthetic proxy in the computation graph. This approach had many potential edge cases and was difficult to reason about. The new behavior defaults to graph breaking when the nn.Parameter constructor does not have a clean source. Users are now suggested to manually move the constructor out of the graph in such cases. This change improves clarity and reduces complexity in graph construction and debugging. Users can escape hatch to old semantics with `torch.dynamo.config.graph_break_on_nn_param_ctor=False` if this cannot be done.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158800
Approved by: https://github.com/anijain2305
Fixes https://github.com/pytorch/pytorch/issues/158164
This was fixed by applying `skip_code_recursive` to any function registered to `sys.monitoring` (via `PyThreadState_GET()->interp->monitoring_callables`). This check is done whenever we attempt to set the eval frame callback from Python.
Microbenchmark: `benchmarks/dynamo/microbenchmarks/overheads.py`:
BEFORE:
```
requires_grad=False
eager 7.1us (warmup=0.0s)
compiled 24.6us (warmup=10.0s)
requires_grad=True
eager 8.9us (warmup=0.0s)
compiled 57.8us (warmup=0.1s)
inference_mode()
eager 6.5us (warmup=0.0s)
compiled 23.4us (warmup=0.1s)
```
AFTER:
```
requires_grad=False
eager 7.0us (warmup=0.0s)
compiled 23.2us (warmup=15.2s)
requires_grad=True
eager 9.0us (warmup=0.0s)
compiled 55.1us (warmup=0.1s)
inference_mode()
eager 6.4us (warmup=0.0s)
compiled 22.2us (warmup=0.1s)
```
Followup thought: how do we let users know that a frame is skipped because the code object is a callable registered to sys.monitoring? (or any other reason?)
Differential Revision: [D78530528](https://our.internmc.facebook.com/intern/diff/D78530528)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158171
Approved by: https://github.com/jansel
I feel uneasy about touching `__warningregistry__` since it is undocumented and private surface. The only public API hook that doesn't increment warnings version seems to be https://docs.python.org/3/library/warnings.html#warnings.showwarning.
So we could wack a mole all the warnings muters in compile to just not display warnings, and we wouldn't invalidate warnings cache. This PR adds it for torch/_dynamo, and I didn't find any warnings versioning mutation from torch/_inductor.
There is a behavior change if someone calls a compiled graph with simplefilter("error"):
```python
# e.g. test/dynamo_expected_failures/TestAutogradFallback.test_no_autograd_kernel_inplace_mode_nothing
with warnings.catch_warnings():
warnings.simplefilter("error") # turns all warnings into errors
compiled_fn() # will throw if any of the muted warnings fire
```
FIXES https://github.com/pytorch/pytorch/issues/128427
A note for the future: The warnings module doesn't offer a thread safe way of using it. Even regular filters have this problem, directly editing `__warningregistry__` would be very bad, and this PR would mute all threads. Someone will need to build a thread safe warnings interface.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158520
Approved by: https://github.com/anijain2305, https://github.com/zou3519
Summary: This test is failing internally because the number of underlying calls to the rng differ by virtue of various library initializations that get sucked in with an internal build.
Test Plan: `buck test '@fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_repros.py::ReproTests::test_longtensor_list' --run-disabled`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158458
Approved by: https://github.com/jansel
Otherwise it turns test into a trivial one(that always succeeds), as following example demonstrates
```python
import torch
from torch.testing._internal.common_utils import serialTest, run_tests, TestCase
class MegaTest(TestCase):
@serialTest
def test_foo(self):
if hasattr(self.test_foo, "pytestmark"):
print("foo has attr and it is", self.test_foo.pytestmark)
print("foo")
@serialTest()
def test_bar(self):
if hasattr(self.test_bar, "pytestmark"):
print("bar has attr and it is", self.test_bar.pytestmark)
print("bar")
if __name__ == "__main__":
run_tests()
```
That will print
```
test_bar (__main__.MegaTest.test_bar) ... bar has attr and it is [Mark(name='serial', args=(), kwargs={})]
bar
ok
test_foo (__main__.MegaTest.test_foo) ... ok
----------------------------------------------------------------------
Ran 2 tests in 0.013s
```
Added assert that arg is boolean in the decorator to prevent such silent skips in the future
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157388
Approved by: https://github.com/clee2000
Calling `at::native::nansum_out` causes the fake kernel to dispatch to a
`make_reduction` call and then segfaults later due to the
`mutable_data_ptr` call in `TensorIteratorBase::build`. It also causes
fake tensor propagation issue in Dynamo. The added tests demonstrate the
aforementioned 2 issues.
This patch fixes it by dispatching to `at::nansum_out` instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156642
Approved by: https://github.com/zou3519
I had to create a new PR for this because of @atalman request of temporary reverting the previous PR to restore diff train sync. Nothing has changed from this PR and the original one.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156639
Approved by: https://github.com/atalman
This PR refers to the issue: https://github.com/pytorch/pytorch/issues/155352
This PR uses torch._dynamo.utils.warn_once so that this warning only emits once, clarifies in the warning that silent incorrectness is potential, not observed, Doesn't warn for functions that come from torch.*
As of right now with this code change the terminal outputs:
if the code came from torch.* :
Nothing, as we shouldn't warn for functions that come from torch.*
else:
/data/users/ssubbarao8/pytorch/torch/_dynamo/variables/functions.py:1565: UserWarning: Dynamo detected a call to a `functools.lru_cache`-wrapped function. Dynamo ignores the cache wrapper and directly traces the wrapped function. Silent incorrectness is only a *potential* risk, not something we have observed. Enable TORCH_LOGS="+dynamo" for a DEBUG stack trace.
torch._dynamo.utils.warn_once(msg)
If the user runs the command 'TORCH_LOGS="+dynamo" python foo4.py', in the debug logs it shows(this log below is based on chillee's repro:
/data/users/ssubbarao8/pytorch/torch/_dynamo/variables/functions.py:1565: UserWarning: Dynamo detected a call to a `functools.lru_cache`-wrapped function. Dynamo ignores the cache wrapper and directly traces the wrapped function. Silent incorrectness is only a *potential* risk, not something we have observed. Enable TORCH_LOGS="+dynamo" for a DEBUG stack trace.
torch._dynamo.utils.warn_once(msg)
V0619 21:00:16.504000 956424 torch/_dynamo/variables/functions.py:1575] [0/0] call to a lru_cache` wrapped function from user code at: /data/users/ssubbarao8/pytorch/foo4.py:9
V0619 21:00:16.504000 956424 torch/_dynamo/variables/functions.py:1575] [0/0] File "/data/users/ssubbarao8/pytorch/foo4.py", line 9, in <module>
V0619 21:00:16.504000 956424 torch/_dynamo/variables/functions.py:1575] [0/0] torch.compile(foo, backend="eager")(torch.randn(4))
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156463
Approved by: https://github.com/williamwen42
See added test for the case that this PR handles. In particular, the semantics for nested torch.compile with toggled fullgraph settings was strange before - `@torch.compile(fullgraph=True)` overrides the existing fullgraph setting, while `@torch.compile(fullgraph=False)` does not.
Note that this change will add an extra frame to any inlined torch.compile'd function (which I don't expect to happen frequently).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155166
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289, #154782
See added test for the case that this PR handles. In particular, the semantics for nested torch.compile with toggled fullgraph settings was strange before - `@torch.compile(fullgraph=True)` overrides the existing fullgraph setting, while `@torch.compile(fullgraph=False)` does not.
Note that this change will add an extra frame to any inlined torch.compile'd function (which I don't expect to happen frequently).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155166
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289, #154782
Fixes#148950.
During the construction of graph and running the node of add under [interpreter](/github.com/pytorch/pytorch/blob/d68d4d31f4824f1d1e0d1d6899e9879ad19b0754/torch/fx/interpreter.py#L301
), the functional argument of conj complex tensor gets cloned. This result in always having *.is_conj()* evaluted to false in decomposition function.
Propose a fix of calling resolve_conj() in the decomposition of complex tensor add.
Test as below
`python test/dynamo/test_repros.py ReproTests.test_add_complex_conj`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153945
Approved by: https://github.com/jansel