Summary: Implement an on-disk cache to save and reuse compiled FX Graphs. This implementation does not handle tensors with symbolic shapes. This needs to be done in a follow-up PR.
Test Plan:
* New unit tests exercising saving and load from the cache.
* New unit tests to exercise the cache key calculations.
* Ran several benchmarks to see cache hit and resulting compilation times.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103453
Approved by: https://github.com/eellison
The motivation for removing this is already present in the pre-PR comments. Copying it
~~~
# NB - SuperSource is a weird one.
# it is our only source with 2 bases, so we use the objec
# as the base, rather than the type, since an invocation
# like super(Foo, foo) is represented here, the source object base is more spiritually
# aligned with the instance, rather than the type.
# This whole construction is questionable tho, and we should probably find a way to
# avoid this exception to our otherwise nice source parentage invariant.
~~~
Instead of using super(a, b), we can use `type(b).__mro__[index]`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110475
Approved by: https://github.com/jansel
Fixes#88919
@mruberry @peterbell10
This PR adds a warning to the .cpp STFT and ISTFT functions if a window is not provided.
It also describes the warning in the documentation on `functional.py`.
Finally, it adds unit tests to check if the warning is being produced.
I have audited for internal calls of `stft` and `istft` on Pytorch and haven't found any.
Thank you for the opportunity to contribute!
Eric
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110695
Approved by: https://github.com/ezyang
Summary: `torch.nonzero` doesn't have inductor lowering (yet). To invoke the operator in AOT Inductor's ABI compatibility mode we need a dedicated shim function.
Test Plan:
```
$ python test/inductor/test_aot_inductor.py -k test_zero_grid_with_unbacked_symbols
...
----------------------------------------------------------------------
Ran 4 tests in 78.650s
OK
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110766
Approved by: https://github.com/chenyang78
ghstack dependencies: #110713, #110745, #110764
Summary: `repeat_interleave.Tensor` doesn't have inductor lowering. To invoke the operator in AOT Inductor's ABI compatibility mode we need a dedicated shim function.
Test Plan:
```
$ python test/inductor/test_aot_inductor.py -k test_repeat_interleave
...
----------------------------------------------------------------------
Ran 4 tests in 70.526s
OK
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110745
Approved by: https://github.com/chenyang78
ghstack dependencies: #110713
Summary:
In some scenarios, we want to update constants at runtime.
In such cases, we have to keep the original constants in
the generated code without applying any constant-inlining
optimizations.
This PR adds a config to force us to add tensor constants.
Differential Revision: D49895154
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110491
Approved by: https://github.com/mikekgfb
Summary:
## Context
Both `aten.sum` and `aten.squeeze` have a "most generic" variant in the form of `aten.sum.dim_IntList` and `aten.squeeze.dims` respectively. Add decompositions for other non generic variants of these operators to express them using the most generic variant.
Note that to register these decomps, the reference implementation under `_refs` had to be removed as registered decompositions. cc: @lezcano @peterbell10
Test Plan: Github CI + Meta Internal CI
Differential Revision: D49965952
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110645
Approved by: https://github.com/peterbell10, https://github.com/digantdesai, https://github.com/manuelcandales
We had the option but never used cpu_offload as optimizer state_dict offloads the tensors to CPU by default. And this is usually most users want as the tensors are required to be moved to CPU eventually. However, we may want to disable offloading to CPU in some cases, epsecially for the debugging purpose. This PR lets optimizer state_dict read the flag.
Differential Revision: [D48913340](https://our.internmc.facebook.com/intern/diff/D48913340/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108434
Approved by: https://github.com/wz337
This PR adds support for aten.where and support implicit scalar
promotion, basically when we meet scalar tensors in dispatching logic,
we implicitly convert it those to replicated dtensor
The latter also enables bunch of ops in op db to pass
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110584
Approved by: https://github.com/fduwjj
Summary:
As title, this diff hides the contiguous requirement for user input mesh when initializing DeviceMesh.
In the current implementation, when testing with inter-node model parallelism, an exception is thrown during mesh validation when the following input is provided:
```
mesh = torch.arange(0, world_size).view(mp_size, dp_size).transpose(0, 1)
device_mesh = DeviceMesh(
"cuda",
mesh.contiguous(),
mesh_dim_names=("dp", "mp")
)
```
Test Plan:
**Unit Test**:
```
buck2 test mode/dev-nosan //caffe2/test/distributed/_tensor:device_mesh -- test_validate_device_mesh
Test UI: https://www.internalfb.com/intern/testinfra/testrun/3940649876878399
Network: Up: 0B Down: 0B
Jobs completed: 6. Time elapsed: 1:58.7s.
Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0
```
**Test with MP**
```
mesh = torch.arange(0, world_size).view(mp_size, dp_size).transpose(0, 1)
device_mesh = DeviceMesh(
"cuda",
mesh.contiguous(),
mesh_dim_names=("dp", "mp")
)
```
Without the change: exception.
After this change: initialzied sucessfully.
Differential Revision: D49942839
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110628
Approved by: https://github.com/wanchaol, https://github.com/xw285cornell, https://github.com/fduwjj
Previously,`Dim` definitions that shared the same name but had different ranges were allowed to appear in the `dynamic_shapes` argument of an `export` call. They would correspond to the *same* dynamic dimension (identified by the shared name) with an effective range would be the *intersection* of the different ranges.
However this behavior can be confusing, because having different definitions with the same name is more likely than not unintentional. Therefore, this PR makes it a user error.
We still allow different definitions with the same name to exist at the same time (no global uniqueness) as long as they are not confused in the same `export` call. Redefinitions with the same bounds are also allowed, in case they are accidentally created by executing the same code multiple times.
Differential Revision: [D49965944](https://our.internmc.facebook.com/intern/diff/D49965944/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110638
Approved by: https://github.com/zhxchen17
Summary: The runtime assertions inserted in the `torch._export.export` by the `_AddRuntimeAssertionsForInlineConstraintsPass` lead to errors in AOT Inductor like #109884. In `torch._export.aot_compile` export and AOT compilation are run consecutively which would lead to the above issue if any assertions are inserted.
In this PR, we're adding a new parameter / flag to `torch._export.aot_compile`, `remove_runtime_assertions`, to remove the assertions inserted during export before AOT compilation. The flag is set to `False` for BC.
Additionally, we remove the flag `add_runtime_assertions_for_inline_constraints` recently added to `torch._dynamo.config`, as it can lead to undesirable `torch._export` behavior and is 's no longer required for the AOT Inductor testing purposes.
Test Plan: CI
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110710
Approved by: https://github.com/zhxchen17, https://github.com/chenyang78
Expose a set of observability hooks into C10D such that our users can
detect collectives failure both faster and more easily.
The design is similar to NCCL desync debug that it minimized the
overhead by doing most of the work out of the main thread.
This PR introduces a new module torch.distributed.hooks that exposes the following set of methods:
register_collective_start_hook
register_collective_end_hook
register_process_group_hook
The process group hook exposes PG creation on the member ranks and call them inline from the
the PG creation code. This is fine since this happens during initialization and a limited number of times.
The collective start/end hooks are fired from a single background thread. It reads
events from a C++ queue and dispatches over.
Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown
and have it as background thread. This is not possible with more reasonable choices like a condvar.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108815
Approved by: https://github.com/wconstab, https://github.com/fduwjj
Summary:
With the release of cuSPARSELt v0.5.0, we now have support for
int8 int8 -> int32 matmul.
This PR adds support for this via out_dtype.
Test Plan:
```
python test/test_sparse_semi_structured.py -k int32
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110499
Approved by: https://github.com/cpuhrsch
Fixes recent broken unit tests caused by PR #109908 because cudnn and miopen have separate batch norm functions.
```
2023-10-05T09:35:01.6606614Z _______________ TestQuantizePT2EQAT.test_qat_conv_bn_fusion_cuda _______________
2023-10-05T09:35:01.6606948Z Traceback (most recent call last):
2023-10-05T09:35:01.6607362Z File "/var/lib/jenkins/pytorch/test/quantization/pt2e/test_quantize_pt2e_qat.py", line 323, in test_qat_conv_bn_fusion_cuda
2023-10-05T09:35:01.6607767Z self._verify_symmetric_xnnpack_qat_graph(
2023-10-05T09:35:01.6608217Z File "/var/lib/jenkins/pytorch/test/quantization/pt2e/test_quantize_pt2e_qat.py", line 130, in _verify_symmetric_xnnpack_qat_graph
2023-10-05T09:35:01.6608658Z self._verify_symmetric_xnnpack_qat_graph_helper(
2023-10-05T09:35:01.6609105Z File "/var/lib/jenkins/pytorch/test/quantization/pt2e/test_quantize_pt2e_qat.py", line 173, in _verify_symmetric_xnnpack_qat_graph_helper
2023-10-05T09:35:01.6609623Z m = prepare_qat_pt2e(m, quantizer)
2023-10-05T09:35:01.6610171Z File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/ao/quantization/quantize_pt2e.py", line 178, in prepare_qat_pt2e
2023-10-05T09:35:01.6610561Z _fuse_conv_bn_qat(model)
2023-10-05T09:35:01.6611072Z File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/ao/quantization/pt2e/qat_utils.py", line 501, in _fuse_conv_bn_qat
2023-10-05T09:35:01.6611497Z m = _fuse_conv_bn_qat_helper(m, is_cuda=True)
2023-10-05T09:35:01.6612065Z File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/ao/quantization/pt2e/qat_utils.py", line 575, in _fuse_conv_bn_qat_helper
2023-10-05T09:35:01.6612492Z _get_conv_bn_getitem_nodes(r.replacements)
2023-10-05T09:35:01.6613058Z File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/ao/quantization/pt2e/qat_utils.py", line 383, in _get_conv_bn_getitem_nodes
2023-10-05T09:35:01.6613465Z assert bn_node is not None
2023-10-05T09:35:01.6613716Z AssertionError
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110653
Approved by: https://github.com/jerryzh168, https://github.com/pruthvistony
Move print to the beginning instead because putting it at the end makes it so you have to scroll through when debugging, and nothing in that function indicates that it should be printing anything
Also the line for printing disabled issues out of the for loop
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110621
Approved by: https://github.com/huydhn
Summary:
Since we changed IR that we are working with to pre autograd aten IR, it's easier
to use plain pattern match instead of relying on source_matcher_utils now, this
PR refactors the annotation for conv to use aten ops directly.
Also fixed reentrant test after this change.
Test Plan:
python test/test_quantization.py TestQuantizePT2E
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110308
Approved by: https://github.com/kimishpatel
When we convert to local tensor, dtensor can't track autograd or
gradient layout of the local tensor anymore, if user do sth not expected, there
needs to be a way for user to hint about the gradient layout of the
local tensor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110629
Approved by: https://github.com/zdevito
Summary:
Original commit changeset: 03980fb054d5
Original Phabricator Diff: D49519512
Bisecting shows that this diff is the cause of S369683. Since this affects Ads production, need to back out this diff immediately.
Test Plan: See S369683
Reviewed By: ezyang
Differential Revision: D49958638
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110622
Approved by: https://github.com/yanboliang
To reduce the amount of logs
* for successes, only print the part that says what tests ran and don't print the rest. Zip the log into an artifact. The line listing al the test names is really long, but if you view source of the raw logs, it will not wrap so it will only be one line. The log classifier can also be configured to ignored this line. Gets rid of lines like `test_ops.py::TestCommonCPU::test_multiple_devices_round_cpu_int64 SKIPPED [0.0010s] (Only runs on cuda) [ 9%]`
* for failures/reruns, print logs. Do not zip.
Also
* change log artifact name
Examples of various logs:
a074db0f7f failures
1b439e24c4 failures
possibly controversial haha
should i include an option for always printing?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110033
Approved by: https://github.com/huydhn
Summary:
D49187352 caused our model conversion and loading of QAT checkpoint to be stuck with thrift time out.
we are actively checking in final code and model for static quant HTP prod model, and encountered this breakage at head Thursday.
Thrift timeout is a not failing, and because of that, it's hard to bisect and find this culprit. It is also hard to set up unit test, because the job simply time-out. Better test is needed to guard downstream model conversion against upstream changes.
Our suspicion of why this diff broke us is that we create a lot of modules with qat (in a recursive manner) but our model is not a qat traceable module (it is a graph with many qat modules and floating point modules). With fuctools.partial as in the original diff, we will be caching modules in the memory and causing the memory of the machine to be taken up completely.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110392
Approved by: https://github.com/junesg, https://github.com/jerryzh168
Defining kernels as static vars is problematic for subsequent model loading on non-default CUDA devices.
Assuming those kernels were loaded in context of the device #0, so, they are not nullptr anymore, therefore kernels won't work on devices other than the device #0.
This change makes devices remembered at model level in AOT mode.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110554
Approved by: https://github.com/chenyang78, https://github.com/desertfire
Summary: Today, we get different batch norm ops depending on
the device the model is placed on at export time. Exporting
`model.cpu()` gives `_native_batch_norm_legit`, while exporting
`model.cuda()` gives `cudnn_batch_norm`. QAT fusion currently
only supports the former and silently ignores the latter. This
commit fixes this by additionally matching on the latter op
during QAT fusion.
Test Plan:
python test/test_quantization.py TestQuantizePT2EQAT.test_qat_conv_bn_fusion
python test/test_quantization.py TestQuantizePT2EQAT.test_qat_conv_bn_relu_fusion
Reviewers: jerryzh168, kimishpatel
Subscribers: jerryzh168, kimishpatel, supriyar
Differential Revision: [D49615145](https://our.internmc.facebook.com/intern/diff/D49615145)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109908
Approved by: https://github.com/jerryzh168
There is an issue with float8 type promotion, because _promoteTypesLookup doesn't contain records for few types between bfloat16 and float8.
I have simply moved float8 types just after bfloat16, however I'm not sure if it doesn't break serialization.
Please, decide if it can stay like this, or should I insert missing records filled with "ud" into _promoteTypesLookup instead of moving types.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110279
Approved by: https://github.com/albanD
Part of #109684
- #109684
Changes:
- Add new functions `tree_structure`, `tree_leaves`, `tree_map_` and `tree_map_only_` to Python pytree.
- Extract reusable tests for pytree to `TestGenericPytree`.
- Change `treespec_dumps` and `treespec_loads` in C++ pytree to call Python pytree and use JSON string as serialization type.
- Rename `torch.utils.pytree` -> `torch.utils._cxx_pytree`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110395
Approved by: https://github.com/zou3519
We want to be able to use SingletonSymNode to represent strides for Jagged layout tensor. The following is for 3D, but easily generalizable to higher dimensions.
Constraints:
- [B, x, D] (where x represents the "variably lengthed dim") can be strided in two ways [x, 1, sum(x)] and [dx, d, 1]. We need two different placeholder values depending on how the jagged tensor is strided.
- When doing operations we need the strides of output tensors to be expressable in terms of the strides and sizes of the inner tensors. Given [B, x, D] @ [D, D'], the output strides is [x * D', D', 1] rather than some opaque [x2, D', 1]. This constraint exists because if I'm tracing, I need a symint to represent the output stride. This symint needs to come from somewhere; I get it in several ways: (1) create a constant, (2) unbacked symint, (3) create a new input using a source, (4) output of an operation on an existing symint. It is clear that (4) is what we want here, which brings us to the design below.
Design:
Given the two constraints, the most straightforward way to implement this is actually to update SingletonSymNode to include some scalar factor, i.e. Morally, SingletonSymNode represents `factor * [s_0, s_1, …, s_n]` This enables us to symbolically compute strides from sizes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110369
Approved by: https://github.com/ezyang
ghstack dependencies: #110044
Previously, something like j0 >= 3, would return False. In sympy however, it is not possible to make it so that both j0 >= 3 and j0 < 3 return False. In sympy, you only get to dispatch on Ge, and the remaining are derived, e.g. defining Ge(j0 >= 3) to be False would force Lt(j0, 3) to be True, which is not what we want.
In this PR, we make it so that both j0 >=3 and j0 < 3 error, so that in a future PR when we create the symbolic counterpart of this singleton, the behaviors can be the same.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110044
Approved by: https://github.com/ezyang
When mapping between the original signature of a program and the graph-captured signature of its exported program, we emit errors when we see unexpected original or graph-captured inputs or outputs.
These errors can arise because of various reasons, e.g.:
1. some input or output has been lifted because of mutation
2. some type is not pytree-registered for flattening / unflattening
3. some type cannot be realized with graph operations
(This is probably not an exhaustive list.)
Previously we used to emit errors based on a vanilla id-based membership check between the two sides, mostly anticipating (1) as the reason for errors. But this does not do justice to errors because of (2) or (3).
This PR emits a different error when it finds (3) to be a probable cause. Specifically, it considers only Tensor and Sym* types to be "supported": no other type seems to be realizable by graph operations.
When (2) is a probable cause, we sometimes also hit the same error because we would expect the supported types to show through upon registration. But this kind of error may need some more work in the future.
Differential Revision: [D49885828](https://our.internmc.facebook.com/intern/diff/D49885828/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110472
Approved by: https://github.com/ydwu4
Summary:
This PR removed several APIs from the AOTInductor interface,
which are not used by the client.
It also simplified AOTInductor's model class by removing
the dim info for input/output tensors. We included dim info
before to return max output shapes, which was used by the client
to allocate memory for output tensors. Now, we allocate output
tensor memory from the .so so that we don't need to maintain
such information any more. The deletion of dim info from
the model class also simplified the codegen quite a bit.
Test Plan: ci
Reviewed By: khabinov
Differential Revision: D49835430
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110411
Approved by: https://github.com/khabinov, https://github.com/desertfire, https://github.com/jansel
This PR inlines the expecteds strings onto the `assertExpectedInline` calls, so that, when
change is needed, we may do that by using the `expectedtest` machinery: setting the
environment variable `EXPECTTEST_ACCEPT=1`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110290
Approved by: https://github.com/zou3519