Summary: In non-strict mode of torch.export() we didn't set those `is_compiling()` to `True` which is needed by some models.
Test Plan: Unit tests and manual testing.
Differential Revision: D53624452
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119602
Approved by: https://github.com/suo
Summary: Previously, we omitted `equal_to_1` from the `triton_meta` part of the `@user_autotune` decorator. For user-written Triton kernels, this could lead to perf regressions, as the kernel in the Inductor codegen is compiled without `equal_to_1` specialization.
Fixes#120478. The repro from the issue, on A100:
Before this PR:
```
Triton matmul: 0.0167 seconds
Triton matmul compiled: 0.0751 seconds
```
After this PR:
```
Triton matmul: 0.0168 seconds
Triton matmul compiled: 0.0072 seconds
```
Test Plan:
```
$ python test/dynamo/test_triton_kernels.py -k test_triton_kernel_equal_to_1_arg
...
----------------------------------------------------------------------
Ran 3 tests in 3.545s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120579
Approved by: https://github.com/oulgen, https://github.com/jansel, https://github.com/chenyang78
Summary: Previously, in the `memory_plan_reuse` we assumed that the generated code is flat: in the sense of it can't have nested scopes. However, with nested control flow codegen-ing, this is no longer the case. This causes bugs in buffers being reused across the visibility boundaries in different nested scopes.
In this PR, we add nested planning states in `memory_plan_reuse` on entering and exiting scope in the codegen. This restricts the buffer reusability only to the currently active (peak) scope / planning state.
Test Plan:
```
python test/inductor/test_control_flow.py -k test_subgraphs_with_parameters
...
----------------------------------------------------------------------
Ran 27 tests in 149.413s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120777
Approved by: https://github.com/chenyang78, https://github.com/desertfire, https://github.com/jansel
ghstack dependencies: #120665
Previous, we parametrize some tests to run with both native and py funcol by flipping a global variable. However, some of these tests are multi-threaded tests, and the parametrization mechanism could lead to race condition.
This PR changes the mechansim to use `mock.patch` which is applied on a per-thread basis.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120833
Approved by: https://github.com/wconstab
Summary: The current machinery of Inductor's `compile_fx` assumes that the incoming fx graph is flat. As a result, everything before `graph.run` is applied to the outermost graph. This assumption was valid before #119759, but now there is control flow bringing (arbitrarily deeply) nested fx subgraphs to `compile_fx`.
In this PR, we start extending the `compile_fx` machinery to deal with nested fx subgraphs. Namely, we recursively apply Inductor's `pre_grad`, `joint_graph`, and `post_grad` passes to the nested subgraphs in the incoming fx graph.
For the recursive application of the `pre_grad` passes (which require example inputs per subgraph), we don't pass example inputs for the nested subgraphs. A few different attempts to infer the latter via fake tensor prop has led to different side effects in the model. Therefore, to the nested subgraphs, we only apply a subset of `pre_grad` passes that doesn't require example inputs.
Test Plan:
```
$ python test/inductor/test_control_flow.py
...
----------------------------------------------------------------------
Ran 26 tests in 59.252s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120665
Approved by: https://github.com/eellison
Fix and test issues with both coalesced and individual send/recv ops
Considered an alternate approach and then ditched it
- alternate approach: #119757
- reason ditched: prefer recording individual collective events inside
coalescing region instead of just the event at the end of the region,
which also would not have tensor sizes or opnames without additional
state variables added
Another approach also ditched
- record events on workEnqueue instead of initWork
- reason ditched: too messy to get input/output shapes tagged on
recording when recording in workEnqueue. Adding the info onto the
Work obj would be possible, but adds to overhead of copying Works
which we do on every collective. We can get info off the input/output
tensors directly in initWork, but we don't want to keep refs to those
tensors alive while the work is Enqueued, so we'd have to specifically
copy size lists or something.
This PR instead avoids creating a work inside pointToPoint when
coalescing is active. Instead, only at endCoalescing() is a work finally
intialized and enqueued. But it adds a record() call inside
pointToPoint() instead of creating a work, during coalescing. This
record() call picks up tensor shapes and op names.
It ALSO changes initWork to accept a 'record' argument. This defaults to
false, and should only be set to true if the caller ensures the work
will be enqueued by workEnqueue, ensuring its cuda events are live when
used by flight recorder's update_state().
The testing uncovers some odd pre-existing behavior and leaves them
alone for now. We could change some of these
- seq starts off at 1, not 0 for first op (but this is inconistent)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120270
Approved by: https://github.com/shuqiangzhang
ghstack dependencies: #120724
In cases where sequence number is shared between events (e.g. coalesced
collectives) we want to ensure a unique (and ordered) ID per record.
Note: the records are already in a list, so their ID could be implicitly
observed. But (1) it's a ring buffer, so absolute ID is lost once the
buffer rolls over once, (2) users may sort or process or filter their
flight records, so having the ID be an explicit member of an entry is
still useful
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120724
Approved by: https://github.com/zdevito
As reported in https://github.com/pytorch/pytorch/issues/119434, `pyhpc_isoneutral_mixing`, `pyhpc_equation_of_state` and `pyhpc_turbulent_kinetic_energy` failed with dynamic shape testing, we propose to skip the dynamic batch size testing of these 3 models in this PR.
* Error msg is
```
File "/localdisk/leslie/torch_inductor_community/pytorch/benchmarks/dynamo/common.py", line 3879, in run
assert marked, f"nothing in example_inputs had a dim with {batch_size}"
AssertionError: nothing in example_inputs had a dim with 1048576
```
* Root Cause is
* Benchmark code will only annotate the inputs' dim as dynamic when its size equals to batch size c617e7b407/benchmarks/dynamo/common.py (L3867-L3871). If it fails to find any dim equals to batch size, above error throws.
* However, for these 3 models, none of the inputs' dim will equal to input batch size since the [relationship of dim sizes](26b85eadde/torchbenchmark/models/pyhpc_equation_of_state/__init__.py (L12-L16))
```
shape = (
math.ceil(2 * size ** (1/3)),
math.ceil(2 * size ** (1/3)),
math.ceil(0.25 * size ** (1/3)),
)
```
* Another thing is `pyhpc_isoneutral_mixing`, `pyhpc_equation_of_state` can pass the dynamic batch size accuracy testing, because the batch size has been set to 4 in accuracy testing (c617e7b407/benchmarks/dynamo/common.py (L3456)) and `math.ceil(2 * size ** (1/3))` happens equaling to 4.
* Since the dim sizes of input has above relationship, running the these models in dynamic shape, we may need to annotate `dim[0](s0) = dim[2](s1) * 8`, per the discussion in https://github.com/pytorch/pytorch/issues/117477#issuecomment-1897108756 @avikchaudhuri, looks like we are not expressible for this case. So, I think we may need to skip the dynamic batch size testing for these 3 models.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120599
Approved by: https://github.com/jgong5, https://github.com/desertfire
Previously, torch.export in non-strict mode was failing on str inputs while creating fake inputs for tracing (fakify()), and using graph nodes to create constraints. This fixes those 2 stages to allow strs to pass.
Failing test case:
```
class Foo(torch.nn.Module):
def forward(self, a, b, mode):
return torch.div(a, b, rounding_mode=mode)
foo = Foo()
inps = (torch.randn(4, 4), torch.randn(4), "trunc")
exported = export(foo, inps)
with self.assertRaisesRegex(
RuntimeError, "to be equal to trunc, but got floor"
):
_ = exported.module()(torch.randn(4, 4), torch.randn(4), "floor")
self.assertTrue(torch.allclose(exported.module()(*inps), foo(*inps)))
```
Before:
```
(pytorch-local) pianpwk@pianpwk-mbp pytorch % python test/export/test_export_nonstrict.py -k test_runtime_assert_for_prm_str
E
======================================================================
ERROR: test_runtime_assert_for_prm_str_non_strict (__main__.NonStrictExportTestExport.test_runtime_assert_for_prm_str_non_strict)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/pianpwk/Documents/pytorch/torch/testing/_internal/common_utils.py", line 2744, in wrapper
method(*args, **kwargs)
File "/Users/pianpwk/Documents/pytorch/test/export/testing.py", line 40, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/Users/pianpwk/Documents/pytorch/test/export/test_export.py", line 1588, in test_runtime_assert_for_prm_str
exported = export(foo, inps)
^^^^^^^^^^^^^^^^^
File "/Users/pianpwk/Documents/pytorch/test/export/test_export_nonstrict.py", line 16, in mocked_non_strict_export
return export(*args, **kwargs, strict=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/pianpwk/Documents/pytorch/torch/export/__init__.py", line 186, in export
return _export(
^^^^^^^^
File "/Users/pianpwk/Documents/pytorch/torch/export/_trace.py", line 541, in wrapper
raise e
File "/Users/pianpwk/Documents/pytorch/torch/export/_trace.py", line 527, in wrapper
ep = fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/Users/pianpwk/Documents/pytorch/torch/export/exported_program.py", line 83, in wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/Users/pianpwk/Documents/pytorch/torch/export/_trace.py", line 707, in _export
) = make_fake_inputs(f, args, kwargs, constraints)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/pianpwk/Documents/pytorch/torch/_export/non_strict_utils.py", line 133, in make_fake_inputs
fake_args, fake_kwargs = tree_map_with_path(
^^^^^^^^^^^^^^^^^^^
File "/Users/pianpwk/Documents/pytorch/torch/utils/_pytree.py", line 1519, in tree_map_with_path
return treespec.unflatten(func(*xs) for xs in zip(*all_keypath_leaves))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/pianpwk/Documents/pytorch/torch/utils/_pytree.py", line 734, in unflatten
leaves = list(leaves)
^^^^^^^^^^^^
File "/Users/pianpwk/Documents/pytorch/torch/utils/_pytree.py", line 1519, in <genexpr>
return treespec.unflatten(func(*xs) for xs in zip(*all_keypath_leaves))
^^^^^^^^^
File "/Users/pianpwk/Documents/pytorch/torch/_export/non_strict_utils.py", line 134, in <lambda>
lambda kp, val: fakify(fake_mode, kp, val, t_constraints, sources),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/pianpwk/Documents/pytorch/torch/_export/non_strict_utils.py", line 68, in fakify
raise ValueError("Only tensors allowed as input")
ValueError: Only tensors allowed as input
To execute this test, run the following from the base repo dir:
python test/export/test_export_nonstrict.py -k test_runtime_assert_for_prm_str_non_strict
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
----------------------------------------------------------------------
Ran 1 test in 0.008s
FAILED (errors=1)
```
After:
```
(pytorch-local) pianpwk@pianpwk-mbp pytorch % python test/export/test_export_nonstrict.py -k test_runtime_assert_for_prm_str
.
----------------------------------------------------------------------
Ran 1 test in 0.237s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120536
Approved by: https://github.com/tugsbayasgalan, https://github.com/zhxchen17, https://github.com/avikchaudhuri, https://github.com/gmagogsfm
The special-case code for handling SUPPORTED_NODES was producing a guard that looked like:
```
"G['torch'].utils._pytree.SUPPORTED_NODES[<class '__main__.CausalLMOutputWithPast'>].type"
```
resulting in a eval error trying to evaluate the guard.
This change adds a new source type (`ClassSource`) which is given a class type (in this case `CausalLMOutputWithPast`) and attempts to fetch it from its defining module. It then uses that to build the `SUPPORTED_NODES` guards instead of referring to the type directly.
Also added a unit test which fails before this change and passes after.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120798
Approved by: https://github.com/anijain2305
This PR is mostly just code movement to make the code review easier - AFAIK it should not change any functionality. The final goal is to remove the xfails for some of the test_fake opinfos for these ops. The opinfos are failing because the outputs can have mixed devices - we need to move them to fake_impls first before we can support mixed device returns.
This PR:
* Move the `_meta_registrations.py` implementations to `fake_impls.py`
* Change the function signature from taking explicit named variables to taking `{args, kwargs}` and normalizing them
* Wrap all the returned tensors in FakeTensors
Tests: relying on opinfos. I also checked `test_fake_*` for these tests (by removing x-fails and patching things until they passed) to verify general correctness.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120682
Approved by: https://github.com/drisspg
Summary: Vulkan gtests were segfaulting on mac because the memory for barriers can get destroyed after the local function(CommandBuffer::insert_barrier) exits where it is created. Since we provide this barrier pointer to vulkan library it needs to be around even after the function exit, else we get crashes.
Test Plan:
See that there is no segfault on mac with fix and tests can run:
Compile gtests:
buck2 build --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
Crash w/o diff
bash-3.2$ buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_quantized_api_test_binAppleMac
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
[==========] Running 85 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 85 tests from VulkanAPITest
[ RUN ] VulkanAPITest.uniform_buffer_copy
[ OK ] VulkanAPITest.uniform_buffer_copy (88 ms)
[ RUN ] VulkanAPITest.copy_to_buffer
Segmentation fault: 11
With diff there is no crash:
bash-3.2$ buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_quantized_api_test_binAppleMac
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
[==========] Running 85 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 85 tests from VulkanAPITest
[ RUN ] VulkanAPITest.uniform_buffer_copy
[ OK ] VulkanAPITest.uniform_buffer_copy (296 ms)
.....
[ FAILED ] VulkanAPITest.gelu_quint8_self (23 ms)
[----------] 85 tests from VulkanAPITest (1494 ms total)
[----------] Global test environment tear-down
[==========] 85 tests from 1 test suite ran. (1494 ms total)
[ PASSED ] 72 tests.
[ FAILED ] 13 tests, listed below:
[ FAILED ] VulkanAPITest.linear_2d_flat
[ FAILED ] VulkanAPITest.linear_2d_small
[ FAILED ] VulkanAPITest.linear_2d_large
[ FAILED ] VulkanAPITest.linear_3d_flat
[ FAILED ] VulkanAPITest.linear_3d_small
[ FAILED ] VulkanAPITest.linear_3d_large
[ FAILED ] VulkanAPITest.linear_4d_flat
[ FAILED ] VulkanAPITest.linear_4d_small
[ FAILED ] VulkanAPITest.linear_4d_large
[ FAILED ] VulkanAPITest.gelu_qint8
[ FAILED ] VulkanAPITest.gelu_qint8_self
[ FAILED ] VulkanAPITest.gelu_quint8
[ FAILED ] VulkanAPITest.gelu_quint8_self
The above failing tests were failing before as well and are being worked on.
Differential Revision: D54023146
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120337
Approved by: https://github.com/SS-JIA
This adds support for backwards hooks that are *both*:
1) Interior to the graph; and
2) Dynamically generated (e.g. lambdas)
We do this by creating a BackwardState object that is used to register the hooks in the forward, then populated by dynamo *after* the forwards runs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120382
Approved by: https://github.com/xmfan
Summary:
`nullptr` is typesafe. `0` and `NULL` are not. In the future, only `nullptr` will be allowed.
This diff helps us embrace the future _now_ in service of enabling `-Wzero-as-null-pointer-constant`.
Test Plan: Sandcastle
Reviewed By: meyering
Differential Revision: D54163060
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120740
Approved by: https://github.com/Skylion007
With the current `Dim`-based dynamic shapes API for export, one can express that shapes of different input shapes must be equal by reusing the same `Dim`. However, non-trivial relationships between such input shapes cannot be expressed.
Recently we are seeing more and more examples of code that require this additional expressibility, e.g., where a pair of shapes might differ by one, or a shape might be double another (or simply even).
This PR introduces the concept of a "derived" `Dim`, i.e., a linear arithmetic expression over a `Dim`. By using a combination of `Dim`s and derived `Dim`s to specify input shapes, the desired relationships can be expressed naturally. E.g., a pair of shapes might be `dim` and `dim + 1`, or `dim` and `2*dim`, or even `2*dim` and `dim + 1`.
We extend the current infrastructure that translates `Dim`s to deprecated `dynamic_dim`-based constraints to work with derived `Dim`s. As usual, we raise constraint violation errors when shape guards cannot be verified given a dynamic shapes spec; suggest fixes; and raise runtime errors when future inputs violate the spec.
Importantly, some guards that used to cause forced specializations in the constraint solver because they were deemed "too complex" now do not do so, because they can now be specified as constraints. Since this was what motivated the introduction of a `disable_constraint_solver` flag to some internal APIs, we may not need that flag any more.
Note that shapes of placeholders in exported programs can now contain symbolic expressions and not just symbols.
Differential Revision: D53254587
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118729
Approved by: https://github.com/ezyang