Summary:
Since we changed IR that we are working with to pre autograd aten IR, it's easier
to use plain pattern match instead of relying on source_matcher_utils now, this
PR refactors the annotation for conv to use aten ops directly.
Also fixed reentrant test after this change.
Test Plan:
python test/test_quantization.py TestQuantizePT2E
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110308
Approved by: https://github.com/kimishpatel
When we convert to local tensor, dtensor can't track autograd or
gradient layout of the local tensor anymore, if user do sth not expected, there
needs to be a way for user to hint about the gradient layout of the
local tensor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110629
Approved by: https://github.com/zdevito
Summary:
Original commit changeset: 03980fb054d5
Original Phabricator Diff: D49519512
Bisecting shows that this diff is the cause of S369683. Since this affects Ads production, need to back out this diff immediately.
Test Plan: See S369683
Reviewed By: ezyang
Differential Revision: D49958638
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110622
Approved by: https://github.com/yanboliang
To reduce the amount of logs
* for successes, only print the part that says what tests ran and don't print the rest. Zip the log into an artifact. The line listing al the test names is really long, but if you view source of the raw logs, it will not wrap so it will only be one line. The log classifier can also be configured to ignored this line. Gets rid of lines like `test_ops.py::TestCommonCPU::test_multiple_devices_round_cpu_int64 SKIPPED [0.0010s] (Only runs on cuda) [ 9%]`
* for failures/reruns, print logs. Do not zip.
Also
* change log artifact name
Examples of various logs:
a074db0f7f failures
1b439e24c4 failures
possibly controversial haha
should i include an option for always printing?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110033
Approved by: https://github.com/huydhn
Summary:
D49187352 caused our model conversion and loading of QAT checkpoint to be stuck with thrift time out.
we are actively checking in final code and model for static quant HTP prod model, and encountered this breakage at head Thursday.
Thrift timeout is a not failing, and because of that, it's hard to bisect and find this culprit. It is also hard to set up unit test, because the job simply time-out. Better test is needed to guard downstream model conversion against upstream changes.
Our suspicion of why this diff broke us is that we create a lot of modules with qat (in a recursive manner) but our model is not a qat traceable module (it is a graph with many qat modules and floating point modules). With fuctools.partial as in the original diff, we will be caching modules in the memory and causing the memory of the machine to be taken up completely.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110392
Approved by: https://github.com/junesg, https://github.com/jerryzh168
Defining kernels as static vars is problematic for subsequent model loading on non-default CUDA devices.
Assuming those kernels were loaded in context of the device #0, so, they are not nullptr anymore, therefore kernels won't work on devices other than the device #0.
This change makes devices remembered at model level in AOT mode.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110554
Approved by: https://github.com/chenyang78, https://github.com/desertfire
Fixes#106698
Also added a check for python API, because current error message
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/sehoon/pytorch-latest/torch/nn/modules/adaptive.py", line 128, in __init__
or (min(cutoffs) <= 0) \
ValueError: min() arg is an empty sequence
```
is not very comprehensible.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106777
Approved by: https://github.com/albanD
Ideally all `_dynamo.exc.UserError`s should have "case names", i.e., link to examples in `exportdb`.
This PR adds case names to several instances of `_dynamo.exc.UserError`. In particular, looking at coverage based on `UserErrorType`:
* `DYNAMIC_CONTROL_FLOW`, `ANTI_PATTERN`, and `STANDARD_LIBRARY` are fully covered.
* `CONSTRAINT_VIOLATION` and `DYNAMIC_DIM` have no coverage. We don't seem to have any dedicated examples of specifying dynamic shapes in `exportdb` (although they are used in some other examples without explanation, to avoid some specialization that would make such examples moot).
* `INVALID_INPUT` is only partly covered. Frankly this is tedious to cover via examples.
Differential Revision: [D49928518](https://our.internmc.facebook.com/intern/diff/D49928518/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110555
Approved by: https://github.com/angelayi, https://github.com/ydwu4
Summary: Today, we get different batch norm ops depending on
the device the model is placed on at export time. Exporting
`model.cpu()` gives `_native_batch_norm_legit`, while exporting
`model.cuda()` gives `cudnn_batch_norm`. QAT fusion currently
only supports the former and silently ignores the latter. This
commit fixes this by additionally matching on the latter op
during QAT fusion.
Test Plan:
python test/test_quantization.py TestQuantizePT2EQAT.test_qat_conv_bn_fusion
python test/test_quantization.py TestQuantizePT2EQAT.test_qat_conv_bn_relu_fusion
Reviewers: jerryzh168, kimishpatel
Subscribers: jerryzh168, kimishpatel, supriyar
Differential Revision: [D49615145](https://our.internmc.facebook.com/intern/diff/D49615145)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109908
Approved by: https://github.com/jerryzh168
There is an issue with float8 type promotion, because _promoteTypesLookup doesn't contain records for few types between bfloat16 and float8.
I have simply moved float8 types just after bfloat16, however I'm not sure if it doesn't break serialization.
Please, decide if it can stay like this, or should I insert missing records filled with "ud" into _promoteTypesLookup instead of moving types.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110279
Approved by: https://github.com/albanD
Part of #109684
- #109684
Changes:
- Add new functions `tree_structure`, `tree_leaves`, `tree_map_` and `tree_map_only_` to Python pytree.
- Extract reusable tests for pytree to `TestGenericPytree`.
- Change `treespec_dumps` and `treespec_loads` in C++ pytree to call Python pytree and use JSON string as serialization type.
- Rename `torch.utils.pytree` -> `torch.utils._cxx_pytree`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110395
Approved by: https://github.com/zou3519
We want to be able to use SingletonSymNode to represent strides for Jagged layout tensor. The following is for 3D, but easily generalizable to higher dimensions.
Constraints:
- [B, x, D] (where x represents the "variably lengthed dim") can be strided in two ways [x, 1, sum(x)] and [dx, d, 1]. We need two different placeholder values depending on how the jagged tensor is strided.
- When doing operations we need the strides of output tensors to be expressable in terms of the strides and sizes of the inner tensors. Given [B, x, D] @ [D, D'], the output strides is [x * D', D', 1] rather than some opaque [x2, D', 1]. This constraint exists because if I'm tracing, I need a symint to represent the output stride. This symint needs to come from somewhere; I get it in several ways: (1) create a constant, (2) unbacked symint, (3) create a new input using a source, (4) output of an operation on an existing symint. It is clear that (4) is what we want here, which brings us to the design below.
Design:
Given the two constraints, the most straightforward way to implement this is actually to update SingletonSymNode to include some scalar factor, i.e. Morally, SingletonSymNode represents `factor * [s_0, s_1, …, s_n]` This enables us to symbolically compute strides from sizes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110369
Approved by: https://github.com/ezyang
ghstack dependencies: #110044
When mapping between the original signature of a program and the graph-captured signature of its exported program, we emit errors when we see unexpected original or graph-captured inputs or outputs.
These errors can arise because of various reasons, e.g.:
1. some input or output has been lifted because of mutation
2. some type is not pytree-registered for flattening / unflattening
3. some type cannot be realized with graph operations
(This is probably not an exhaustive list.)
Previously we used to emit errors based on a vanilla id-based membership check between the two sides, mostly anticipating (1) as the reason for errors. But this does not do justice to errors because of (2) or (3).
This PR emits a different error when it finds (3) to be a probable cause. Specifically, it considers only Tensor and Sym* types to be "supported": no other type seems to be realizable by graph operations.
When (2) is a probable cause, we sometimes also hit the same error because we would expect the supported types to show through upon registration. But this kind of error may need some more work in the future.
Differential Revision: [D49885828](https://our.internmc.facebook.com/intern/diff/D49885828/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110472
Approved by: https://github.com/ydwu4
Summary:
This PR removed several APIs from the AOTInductor interface,
which are not used by the client.
It also simplified AOTInductor's model class by removing
the dim info for input/output tensors. We included dim info
before to return max output shapes, which was used by the client
to allocate memory for output tensors. Now, we allocate output
tensor memory from the .so so that we don't need to maintain
such information any more. The deletion of dim info from
the model class also simplified the codegen quite a bit.
Test Plan: ci
Reviewed By: khabinov
Differential Revision: D49835430
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110411
Approved by: https://github.com/khabinov, https://github.com/desertfire, https://github.com/jansel
Summary: Registering param/buffer will write into a vector inside Object, need to maintain thread safety if we have threads reading from the vector and writing to the vector at the same time.
Test Plan: CI
Differential Revision: D49882601
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110488
Approved by: https://github.com/davidberard98
Summary: This diff refactors the code by moving CUDAAllocatorConfig into the header file. This config refactoring is done so that we can use the same config code for CUDA pinned memory as well.
Test Plan: sandcastle
Differential Revision: D49653265
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110123
Approved by: https://github.com/zdevito
Summary:
See wrapper.codegen_reinterpret_view(), it return a temporary handle for tensor, which has following problem.
```
# NB, the return handle here represents a temporary tensor, which will be automatically
# released.
# Here's a sample usage in the cpp wrapper code:
# ```
# aoti_torch_addmm_out(
# buf1,
# arg1_1,
# RAIIAtenTensorHandle(tmp_tensor_handle_0),
# buf0,
# 1L,
# 1L));
# ```
# RAIIAtenTensorHandle(tmp_tensor_handle_0) will be released after the call to addmm_out.
# This could be problematic when it's used in a different pattern, for example:
# ````
# AtenTensorHandle tensor_args[] = {RAIIAtenTensorHandle(tmp_tensor_handle_2), buf5, buf6};
# aoti_torch_proxy_executor_call_function(..., tensor_args);
# ````
# RAIIAtenTensorHandle(tmp_tensor_handle_2) will be invalid when it's used in the latter
# kernel call.
return f"RAIIAtenTensorHandle({tmp_name})"
```
As a result, ProxyExecutor would generate following code, which cause invalid memory access.
Before:
```
// Source Nodes: [fn_with_tuple_output], Original ATen: [fb.fn_with_tuple_output]
AtenTensorHandle tmp_tensor_handle_2;
AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch__reinterpret_tensor(buf3, 2, int_array_0, int_array_1, 0L, &tmp_tensor_handle_2));
...
AtenTensorHandle tensor_args[] = {RAIIAtenTensorHandle(tmp_tensor_handle_2), buf5, buf6};
int64_t int_args[] = {1};
aoti_torch_proxy_executor_call_function(proxy_executor, 1, 1, int_args, 3, tensor_args);
buf3.reset();
```
With fix in this diff, ProxyExecutor generates following code
After:
```
// Source Nodes: [fn_with_tuple_output], Original ATen: [fb.fn_with_tuple_output]
AtenTensorHandle tmp_tensor_handle_2;
AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch__reinterpret_tensor(buf3, 2, int_array_0, int_array_1, 0L, &tmp_tensor_handle_2));
...
aoti_torch_proxy_executor_call_function(proxy_executor, 1, 1, std::vector<int64_t>{1}.data(), 3, std::vector<AtenTensorHandle>{RAIIAtenTensorHandle(tmp_tensor_handle_2), buf5, buf6}.data());
buf3.reset();
```
I am not exactly a big fan of such `std::vector{...}.data()` for creating a temp array, but I can't think of another fix.
Test Plan: buck2 run mode/dev-nosan deeplearning/aot_inductor/test:test_custom_ops
Reviewed By: desertfire
Differential Revision: D49758764
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110451
Approved by: https://github.com/desertfire
Summary: Point to point ops don't enqueue their work to the `workMetaList_` which means that the NCCL watchdog does not watch over them, hence they do not respect the collective timeouts.
Test Plan:
While trying to add a test I found we dont have tests which validate the nccl watch dog. It looks like this is because we dont have a good way to detect when nccl watchdog has thrown an error (exception is thrown in a side thread) in our testing framework / `MultiprocessTestCase`
I manually tested this change with the script in https://github.com/pytorch/pytorch/issues/109401, but need to look more closely at how to automate a test for NCCL watchdog
Differential Revision: D49418976
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109611
Approved by: https://github.com/wconstab
Triplet Margin Loss takes in a Callable `distance_function` parameter which is not supported as an argument on the fx graph. See previous error:
> File "/scratch/eellison/work/pytorch/torch/_dynamo/symbolic_convert.py", line 562, in call_function
self.push(fn.call_function(self, args, kwargs))
File "/scratch/eellison/work/pytorch/torch/_dynamo/variables/torch.py", line 723, in call_function
*proxy_args_kwargs(args, kwargs),
File "/scratch/eellison/work/pytorch/torch/_dynamo/utils.py", line 504, in proxy_args_kwargs
f"call_function args: {typestr(*args)} {typestr(*list(kwargs.values()))}"
File "/scratch/eellison/work/pytorch/torch/_dynamo/exc.py", line 143, in unimplemented
raise Unsupported(msg)
torch._dynamo.exc.Unsupported: call_function args: TensorVariable() TensorVariable() TensorVariable() ConstantVariable(float) NNModuleVariable()
This is fixable by just inlining into `triplet_margin_loss` and continuing to compile it. This required support for `has_torch_function_variadic`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110302
Approved by: https://github.com/mlazos
The first reland broke internal (failing diff: D49617462).
The major error looks like it's because there's an internal-only higher order op that needs a new functionalization rule. I'm going to land an internal diff for that and confirm tests pass before relanding this PR.
Also confirmed that the issue from https://github.com/pytorch/pytorch/issues/110121 is fixed, and added a test.
This reverts commit 1b90f07f5a.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110079
Approved by: https://github.com/ezyang
Summary: This diff fixes a heap underflow found by fuzzing in torch/csrc/jit/runtime/vararg_functions.cpp
Test Plan:
CI and
```
arc lionhead crash reproduce 1753074381791061
```
doesn't crash anymore.
Differential Revision: D49537535
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110441
Approved by: https://github.com/Skylion007
Summary:
Previously, we link against cuda libs even for pure cpp backend.
This caused issues for cases where the inference platform does not
have GPUs. This diff removed cuda dependency for cpp backend.
Reviewed By: bertmaher, muchulee8, mikekgfb
Differential Revision: D49800712
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110409
Approved by: https://github.com/bertmaher, https://github.com/desertfire
Description:
- Fixed misleading test sample case
Context: sample input is composed of input tensor `(N, C, iH, iW)` and grid tensor `(N, oH, oW, 2)`, however, grid is defined as `(N, C, oW, 2)`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110383
Approved by: https://github.com/peterbell10
**Background**: recordStreams can result in memory spikes, so we don't want them to appear in FSDP (https://dev-discuss.pytorch.org/t/fsdp-cudacachingallocator-an-outsider-newb-perspective/1486). @ awgu is working on fixing this, but it turns out profiler was causing recordStream to get called when it is enabled.
Why profiler was causing recordStream to get called: NCCL calls add profiler events manually; they register a callback to be executed when the future for the collective is completed; this indicates the end of the CPU-side profiler event for the callback:
c2c7c4035f/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (L1822-L1824)
In order to guarantee safety, ivalue::Future::invokeCallback calls `recordStream` on the future's storage buffers; this marks the fact that other streams (e.g. the one that the callback runs on) may need to use the storage.
c2c7c4035f/aten/src/ATen/core/ivalue_inl.h (L1171-L1173)
**Change**: The end-profiler-event callback doesn't actually use the future, so we don't need to recordStream on it. This PR introduces an optional parameter `uses_future` for adding callbacks; a user can set this variable to "false" to unsafely skip the recordStream, if the user knows that the future will not be used in the lambda.
**Tests**: (a) unit tests; (b) added an assert in recordStream: c2c7c4035f/c10/cuda/CUDACachingAllocator.cpp (L3260) and verified that it doesn't get triggered when running basic distributed tests w/ profiler enabled
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109933
Approved by: https://github.com/wconstab
Summary: after converting nn.multihead attention we weren't deleting the
old in_proj_weight and in_proj_bias despite not (really) using them.
Test Plan: python test/test_quantization.py -k
"test_custom_module_multi_head_attention"
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110407
Approved by: https://github.com/jerryzh168
This PR updates backend as a property to DTensorTestbase and add "cpu:gloo,cuda:nccl" support in DTensorTestbase so that we can use `cpu:gloo,cuda:nccl` backend for checkpoint unit tests.
cc. @wanchaol, @fduwjj, @XilunWu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110397
Approved by: https://github.com/wanchaol
Partially fixes `test_memory_format_factory_like_functions_preserve` with PYTORCH_TEST_WITH_INDUCTOR. Inductor preserves memory layouts for user-visible outputs as annotated on the fx graph that it is passed in. That graph is generated from running aot_autograd with decompositions. If the decompositions give incorrect strides, so will inductor.
This preserves the layout of `_like` operators when it corresponds to a `torch.memory_format`. It doesnt fix a) arbitrary permutations, b) striding of non-dense outputs. Both of these are lower-pri compared to preserving channels last. We would need either https://github.com/pytorch/pytorch/issues/92920 or a `to` variant that takes in a physical layout arbitrary permutations. I converted the output of rand to the correct layout instead of passing the layout in so that this would compose with the `replace_random` pass, and because the two pointwise ops will get fused anyway.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110242
Approved by: https://github.com/int3
The original implementation of `_gather_orig_param_state` is naive. It performs one allgather_object and two allgather (if the optimizer is Adam) per FQN. This can be slow and make `_optim_state_dict` become bottleneck.
This PR rewrite the implementation and fuse all the `allgather_object`s into one. As for `allgather`, it is fused based on the information of FlatParameters. So there will be 2N `allgather` where N is the number of FlatParameter and 2 is due to Adam having 2 states per FQN.
One experiment on 8GPU A100 shows that the execution of the gathering is improved to 0.3 seconds from 3 seconds.
Differential Revision: [D48835138](https://our.internmc.facebook.com/intern/diff/D48835138/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108298
Approved by: https://github.com/awgu
Summary:
Adding back D46578700 / PR https://github.com/pytorch/pytorch/pull/108426
Note: The changes were originally reverted due to memory regression, these changes are putting the code behind a gflag so it is only used by binaries that require expanded stack for BPF Profiling.
Original Diff comment:
To get a Node's call stack we currently loop on the InlinedCallStack graph and follow the "callee" chain. Since the node's inlined stack does not change we can optimize this but expanding the node's inlined stack once and reusing it. This is particularly useful when reading the node's stack from another process (e.g. BPF) as it simplified the memory traversal process.
The new data structure (NodeSourceInfo) only holds pointers to the function name and file name variables, and assumes these objects will be alive throughout the lifetime of the process.
Each Node has an extended attribute that has an index to a vector of stack frames expanded_node_stacks_
node_stack_attr_symbol_ is only needed to make accessing the stack vector index attribute easier from BPF.
Test Plan:
- Verified using BPF Program in subsequent diffs
- Perf testing for loading large model: P822455246
Differential Revision: D49565461
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110229
Approved by: https://github.com/zdevito
If `XLA_HLO_DEBUG` flag is enabled, generated a minified HLO graph when using the minifier. This function enables HLO minification support by porting the minified FX graph to StableHLO via the `save_torch_model_as_stablehlo` function.
This allows users to port the minified graph to compilers that are not compatible with TorchDynamo/Inductor workflow and use XLA instead. The purpose of this PR is to help XLA users debug accuracy and compilation errors. It will also be helpful for existing TorchDynamo/XLA workflow on `torchxla_trace_once` backend as well.
Fixes [#5461](https://github.com/pytorch/xla/issues/5461) in Torch XLA repo. CC @GleasonK @qihqi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109084
Approved by: https://github.com/anijain2305
pytree is a great tool, but it sometimes considers to be evil for
tensor subclasses, it's useful to implement subclass quickly, but it:
* exposes non-trival CPU overhead
* many ops don't need pytree, only the one with list/dict ops needs
* blindly use pytree to re-wrap have semantic issues for inplace/out
ops
This PR avoid using pytree for most ops during torch_dispatch and only
enable it for certain ops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110132
Approved by: https://github.com/fduwjj
Summary:
Previously we used export's constraints to specify all batch-size dimensions being dynamic. This is done by creating 1 constraint `dynamic_dim(inp[0][0], lower, upper)`, followed by `dynamic_dim(inp[0][0]) == dynamic_dim(inp[i][0])` for every input `i`.
Through the new `dynamic_shapes` API, we can use `Dims("batch_size")` on every dimension to specify which dimensions are dynamic and equal to each other, and `None` otherwise: `{i: [Dims("batch_size", lower, upper), None] for every input i}`
Note: `dynamic_shapes` and `constraints` utilize the same "constraints" backend so this diff should be idempotent.
Test Plan: `buck2 run @//mode/dev-nosan //caffe2/torch/fb/model_transform/experimental/benchmark/test/aotinductor:test_aot_inductor_benchmark`
Reviewed By: chenyang78, aakhundov
Differential Revision: D49784351
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110360
Approved by: https://github.com/desertfire
When running on "gloo" and "cpu:gloo,cuda:nccl" backend, it will run into the following error.
```
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/data/users/irisz/pytorch/torch/multiprocessing/spawn.py", line 74, in _wrap
fn(i, *args)
File "/data/users/irisz/pytorch/torch/distributed/checkpoint/examples/fsdp_checkpoint_example.py", line 105, in run_fsdp_checkpoint_example
optim_state = load_sharded_optimizer_state_dict(
File "/data/users/irisz/pytorch/torch/distributed/checkpoint/optimizer.py", line 295, in load_sharded_optimizer_state_dict
_alloc_tensor(value.properties, value.size, dp_pg_device_type), sharding_spec
File "/data/users/irisz/pytorch/torch/distributed/checkpoint/optimizer.py", line 109, in _alloc_tensor
device=cast(torch.device, _get_device_module(device_type).current_device()),
AttributeError: module 'torch.cpu' has no attribute 'current_device'
```
This PR fix the error in optimizer.py. Will follow up to add "cpu:gloo,cuda:nccl" support in DTensorBase so we can update unit test to include this backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110299
Approved by: https://github.com/kumpera
Summary:
https://docs.google.com/document/d/1QJJEGnj2nHGPODlw38BEG3KLLCOTfdOVjPrNQbz_LM8/edit#bookmark=id.lp80wfshq130
`exported_program.run_decompositions(decomposition_table)` will optionally take a decomposition table, and run decompositions on the exported program, returning a new exported program. By default we will run the Core ATen decomposition table.
Splitting up this diff with the following one (D49742989) to make migrating Executorch easier:
1. Land this diff
1. Wait for a pytorch nightly to include this diff
1. Update executorch's pytorch nightly
1. Land the following diff to have export() return no decomps
Test Plan: Tested in following diff
Differential Revision: D49743208
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110236
Approved by: https://github.com/gmagogsfm
Example of when the `evict_first` heuristic helps.
```
@torch.compile
def f(a, b):
return (a * b).sum(dim=-1)
N = 512
inps = (torch.randn(N, N, N).permute(2, 1, 0), torch.randn(N, N, N).permute(1, 2, 0))
from torch._inductor.utils import do_bench
print(do_bench(lambda: f(*inps)))
```
This generates code like this: http://ix.io/4HFs
```
Original: 3.8 ms
This PR: 3.54 ms
Always `evict_first: 5.4ms
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108841
Approved by: https://github.com/lezcano, https://github.com/jansel
When users define customized `attention mask` using `dtype=torch.float16`, e.g.
```
from torch.nn import functional as F
float_min = torch.finfo(torch.float16).min
attention_mask_fp16 = (attention_mask * 1.0).masked_fill(attention_mask, float_min).to(torch.float16)
attn_output = F.scaled_dot_product_attention(
query_layer_, key_layer_, value_layer_, attention_mask_fp16, 0.0, is_causal=False
)
```
the onnx graph cannot be exported.
When q, k ,v have the fp16 type, we can support this `attn_mask` to be `fp16` type, by adding
```
elif (
_type_utils.JitScalarType.from_value(attn_mask)
== _type_utils.JitScalarType.FLOAT
in (_type_utils.JitScalarType.FLOAT, _type_utils.JitScalarType.HALF)
```
This can export `.onnx` graph.
Fixes#109336
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110306
Approved by: https://github.com/titaiwangms
Summary:
## Why?
The .so and .h files are compiled seperately with different flags. The .so is compiled by AOTInductor and .h files (eg. c10/core/TensorImpl.h) are compiled by buck2.
Let's make sure the .so is also compiled with this macro in fbcode.
Differential Revision: D49664078
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110122
Approved by: https://github.com/chenyang78, https://github.com/khabinov
This PR enables the misc-XX checks in clang-tidy. Meanwhile, I excluded some of them that require a lot of code changes and have no immediate benefits. Some additional fixes and suppression were also given.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110283
Approved by: https://github.com/albanD
Summary: with the grid computed in terms of unbacked `SymInt`s, it can happen that the grid is zero size. This causes CUDA error on `cuLaunchKernel` in the AOT Inductor codegen.
In this PR, when the grid contains unbacked `SymInt`s, a check is added around the `launchKernel` in the AOT Inductor's C++ wrapper codegen to make sure that the grid is not zero-size.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110312
Approved by: https://github.com/chenyang78
**Summary**
Follow up https://github.com/pytorch/pytorch/pull/109893 which has issue in support of CPU as reported in https://github.com/pytorch/pytorch/issues/109897. This fix mainly includes 2 changes:
- Current implementation of `rename_indexing`
10c646295d/torch/_inductor/codegen/common.py (L1023) only add symbol name start with `s` or `ps` into `kernel.args.sizevars`. However, `Unbacked symint` will start as `i`, so we extend the implementation of `rename_indexing` to support symbol start with `i`.
- Currently, the internal loop index also name start as `i`. Since `i` has has been used as `Unbacked symint`, change the name to start with `x` which should align with trition.
**Test Plan**
```
python -u -m pytest -s -v test_torchinductor_dynamic_shapes.py -k test_bool_mask_nobreak
python -u -m pytest -s -v test_torchinductor_dynamic_shapes.py -k test_nonzero_size_factory_nobreak
python -u -m pytest -s -v test_torchinductor_dynamic_shapes.py -k test_item_zeros_nobreak
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110262
Approved by: https://github.com/ezyang, https://github.com/jgong5
Summary: This diff fixes a heap UAF found by fuzzing in torch/csrc/jit/mobile/interpreter.cpp
Test Plan:
CI and
```
arc lionhead crash reproduce 1009060456885023
```
doesn't crash anymore.
Reviewed By: malfet
Differential Revision: D49538326
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110289
Approved by: https://github.com/malfet
An `ExportedProgram`'s `__call__` signature is different from the original module, so `dynamic_shapes` that follow the original signature would fail when applied to re-export an `ExportedProgram`.
This PR fixes this issue, in other words, the original `dynamic_shapes` should now work when re-exporting.
Differential Revision: D49764011
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110276
Approved by: https://github.com/tugsbayasgalan
generate_output may return non-list/tuple outputs. Let's force
those to be list, because we will enumerate kernel.outputs
later in the codegen.
Also fixed a minor issue in an assertion message.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110145
Approved by: https://github.com/aakhundov
Generating reference outputs somtimes fails because of type mismatches in the graph,
an issue which was noticed previously for `prims.convert_element_type` and fixed in #92036
but the same issue happens with other functions such as tensor constructors.
This expands the fix from #92036 to all dtype keyword arguments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110232
Approved by: https://github.com/ezyang
Removing the functionalities from nvfuser python APIs.
Since the use of nvfuser has been deprecated before the last release cut. We are removing torch script support.
I'll have the next PR to actually remove the code base.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110124
Approved by: https://github.com/davidberard98
Summary:
Also added annotation support for conv1d_relu and conv1d in XNNPACKQuantizer, the quantized results still
matches fx quant path (didn't quantize conv1d) so tests are not disabled
Test Plan: with-proxy buck2 run executorch/examples/quantization:example -- -m=w2l --verify
Differential Revision: D49479546
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109830
Approved by: https://github.com/kimishpatel
Summary:
Add the test to make sure we can call the quantize API multiple times
Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_reentrant
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110125
Approved by: https://github.com/kimishpatel
ghstack dependencies: #110097
Opaque pointers support is disabled in llvm 14 and enabled by default from llvm 15 and above.
setOpaquePointers api usage is deprecated from llvm 16. Removed this API.
Update CreateMalloc and CreateFree apis for latest llvm release.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110200
Approved by: https://github.com/Skylion007
Make `np.arange` respect an explicitly provided dtype.
Also remove duplicated tests:
- torch_np/test_function_base.py::TestArange is a dupe of
- torch_np/numpy_tests/core/test_multiarray.py::TestArange
Fixes#109975
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110005
Approved by: https://github.com/lezcano
A resubmit of https://github.com/pytorch/pytorch/pull/108447. Copy over the descriptions:
This is a follow-up of the discussion in https://github.com/pytorch/pytorch/pull/108356, where we want to repalce source_fn with source_fn_stack
Before this PR, for the following example:
```python
backend = EagerAndRecordGraphs()
@torch.compile(backend=backend, fullgraph=True)
def cond_f(pred, pred2, x, y):
def true_fn(pred2, x, y):
return x + y
def false_fn(pred2, x, y):
def true_fn2(x, y):
return x.sin() - y.cos()
def false_fn2(x, y):
return x.cos() - y.sin()
return control_flow.cond(pred2, true_fn2, false_fn2, (x, y))
return control_flow.cond(pred, true_fn, false_fn, (pred2, x, y))
```
The graph captured is shown below:
```python
class GraphModule(torch.nn.Module):
def forward(self, L_pred_ : torch.Tensor, L_pred2_ : torch.Tensor, L_x_ : torch.Tensor, L_y_ : torch.Tensor):
l_pred_ = L_pred_
l_pred2_ = L_pred2_
l_x_ = L_x_
l_y_ = L_y_
cond_true_1 = self.cond_true_1
cond_false_1 = self.cond_false_1
cond = torch.ops.higher_order.cond(l_pred_, cond_true_1, cond_false_1, [l_pred2_, l_x_, l_y_]); l_pred_ = cond_true_1 = cond_false_1 = l_pred2_ = l_x_ = l_y_ = None
return (cond,)
class GraphModule(torch.nn.Module):
def forward(self, l_pred2_, l_x_, l_y_):
add = l_x_ + l_y_; l_x_ = l_y_ = None
return add
class GraphModule(torch.nn.Module):
def forward(self, l_pred2_, l_x_, l_y_):
cond_true_0 = self.cond_true_0
cond_false_0 = self.cond_false_0
cond = torch.ops.higher_order.cond(l_pred2_, cond_true_0, cond_false_0, [l_x_, l_y_]); l_pred2_ = cond_true_0 = cond_false_0 = l_x_ = l_y_ = None
return cond
class GraphModule(torch.nn.Module):
def forward(self, l_x_, l_y_):
sin = l_x_.sin(); l_x_ = None
cos = l_y_.cos(); l_y_ = None
sub = sin - cos; sin = cos = None
return sub
class GraphModule(torch.nn.Module):
def forward(self, l_x_, l_y_):
cos = l_x_.cos(); l_x_ = None
sin = l_y_.sin(); l_y_ = None
sub = cos - sin; cos = sin = None
return sub
```
the source_fn for inner cond, sin, cos will be a (name, target) tuple:
```
('cond', <torch._ops.HigherOrderOperator object at xxx>)
('sin', 'sin')
('cos', 'cos')
('sub'. <built-in function sub>)
```
After this pr, the source_fn_stack will be a list of (name, target) tuple. The bottom of stack is the end of the list.
```
[('cond', <torch._ops.HigherOrderOperator object at xxx>), ('cond', <torch._ops.HigherOrderOperator object at xxx>)],
[('cond', <torch._ops.HigherOrderOperator object at xxx>), ('cond', <torch._ops.HigherOrderOperator object at xxx>), ('sin', 'sin')],
[('cond', <torch._ops.HigherOrderOperator object at xxx>), ('cond', <torch._ops.HigherOrderOperator object at xxx>), ('cos', 'cos')]
[('cond', <torch._ops.HigherOrderOperator object at xxx>), ('cond', <torch._ops.HigherOrderOperator object at xxx>), ('sub', <built-in function sub>)]
```
Test Plan:
See added tests in test_higher_order_ops.py and modify existing test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108595
Approved by: https://github.com/angelayi, https://github.com/zou3519
This PR allows us to use the same failures_dict for multiple test
classes. This is helpful if you have a bunch of small TestCase(es) and
to centralize all the failures dict into one big one.
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110164
Approved by: https://github.com/williamwen42
When the sharding_strategy is set to SHARD_GRAD_OP and forward_prefetch=True, during direct validation run, self.is_first_iter will always be True (because training=False, iter+1 is not executed). Additionally, the _pre_forward_order_index of the first handle entering the record_pre_forward function is 0. This causes the handle to have a False result in the if condition at line 166 when entering the record_pre_forward function again (the expected value should be True because _pre_forward_order_index has actually been assigned a value). As a result, the first handle is repetitively added to handles_pre_forward_order, leading to incorrect prefetching order.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110138
Approved by: https://github.com/awgu
It's useful to have a simple, lightweight way to run a model that adds
essentially no overhead to calling the model's generated `run_impl` method.
This C API is a super thin wrapper around AOTInductorModel: Create, Run, and
Delete are provided, and do very little work beyond dispatch to the appropriate
helpers.
Note the Create function also provides additional functionality beyond the
Container API; it allows the user to pass in a weight map defined in userland,
which is a requirement for several serving use cases.
Differential Revision: [D49670711](https://our.internmc.facebook.com/intern/diff/D49670711/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110158
Approved by: https://github.com/desertfire, https://github.com/chenyang78
Summary:
Use the upstream `torch._dynamo.same` function in accuracy checking and remove the self-hosted version in torchbench.
Now cmf_10x and ads_dhen_5x can run in deterministic mode, enable deepcopy and deterministic mode.
Test Plan:
```
$ buck2 run mode/opt //pytorch/benchmark:run -- cmf_10x -d cuda -t train --accuracy
Running train method from cmf_10x on cuda in eager mode with input batch size 4 and precision tf32.
Accuracy: pass
```
```
$ buck2 run mode/opt //pytorch/benchmark:run -- cmf_10x -d cuda -t train --torchdynamo inductor --torchinductor_enable_batch_fusion --torchinductor_enable_split_cat_fx_pass --accuracy
Running train method from cmf_10x on cuda in dynamo inductor mode with input batch size 4 and precision tf32.
Accuracy: pass
```
Without this PR, it will print:
```
File "/mnt/xarfuse/uid-234232/9aa53cfe-seed-nspid4026531836_cgpid9238070-ns-4026531840/torch/_dynamo/utils.py", line 190, in time_wrapper
r = func(*args, **kwargs)
File "/mnt/xarfuse/uid-234232/9aa53cfe-seed-nspid4026531836_cgpid9238070-ns-4026531840/torch/_inductor/graph.py", line 464, in run
return super().run(*args)
File "/mnt/xarfuse/uid-234232/9aa53cfe-seed-nspid4026531836_cgpid9238070-ns-4026531840/torch/fx/interpreter.py", line 138, in run
self.env[node] = self.run_node(node)
File "/mnt/xarfuse/uid-234232/9aa53cfe-seed-nspid4026531836_cgpid9238070-ns-4026531840/torch/_inductor/graph.py", line 826, in run_node
result.realize_hint()
File "/mnt/xarfuse/uid-234232/9aa53cfe-seed-nspid4026531836_cgpid9238070-ns-4026531840/torch/_inductor/ir.py", line 5273, in realize_hint
and self.is_pointwise_non_scalar_tensor_num_reads_larger_than_one()
File "/mnt/xarfuse/uid-234232/9aa53cfe-seed-nspid4026531836_cgpid9238070-ns-4026531840/torch/_inductor/utils.py", line 343, in wrapper
setattr(self, key, fn(self))
File "/mnt/xarfuse/uid-234232/9aa53cfe-seed-nspid4026531836_cgpid9238070-ns-4026531840/torch/_inductor/ir.py", line 5332, in is_pointwise_non_scalar_tensor_num_reads_larger_than_one
(sum(read.index != 0 for read in self.data.get_reads()) > 1)
File "/mnt/xarfuse/uid-234232/9aa53cfe-seed-nspid4026531836_cgpid9238070-ns-4026531840/torch/_inductor/ir.py", line 5332, in <genexpr>
(sum(read.index != 0 for read in self.data.get_reads()) > 1)
File "/mnt/xarfuse/uid-234232/9aa53cfe-seed-nspid4026531836_cgpid9238070-ns-4026531840/torch/_inductor/dependencies.py", line 74, in index
raise NotImplementedError("StarDep does not have an index")
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
NotImplementedError: StarDep does not have an index
Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
You can suppress this exception and fall back to eager by setting:
import torch._dynamo
torch._dynamo.config.suppress_errors = True
```
Reviewed By: jackiexu1992, mengluy0125
Differential Revision: D49639733
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110189
Approved by: https://github.com/desertfire, https://github.com/jansel
Recently we updated the `export` API to take an experimental `dynamic_shapes` argument that was meant to subsume the existing `constraints` argument.
This PR deprecates `constraints` (with a warning on its use, but without actually removing it). Simultaneously it replaces all uses of `constraints` in docs, examples, and tests with corresponding uses of `dynamic_shapes` (preserving behavior). This exercise fortunately revealed some minor bugs in the implementation which have also been fixed in this PR.
Some uses of `constraints` still remain, e.g., when `torch._dynamo.export` is called directly. (Meta-internal uses will be updated in a separate diff.)
Differential Revision: D49676049
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110143
Approved by: https://github.com/tugsbayasgalan
Summary:
Resolving error:
AttributeError: Can't pickle local object '_add_module_to_qconfig_obs_ctr.<locals>.get_factory_kwargs_based_on_module_device'
by moving nested function out to the main module
Test Plan: Added test to CI
Reviewed By: andrewor14
Differential Revision: D49187352
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109288
Approved by: https://github.com/andrewor14
The commit message of #107862 says it enabled mypy checking for
sizevars.py, but it seems that it neglected to update .lintrunner.toml.
New type errors appear to have crept in since then, so I've fixed them
accordingly.
A similar mistake happened with #109955 for joint_graph.py, though that
one is more recent and so hasn't had any new type errors to fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110109
Approved by: https://github.com/Skylion007
This is reland of PRs #https://github.com/pytorch/pytorch/pull/108626 and #109564. We fixed the IOS build failure by changing
```
((CHECK) ? (EXPR) : ([] { assert(!#CHECK); }(), (EXPR)))
```
to
```
((CHECK) ? (EXPR) : ([] { assert(false); }(), (EXPR)))
```
in TR2_OPTIONAL_ASSERTED_EXPRESSION, since the former syntax was invalid on Apple Clang. Anyway, we could apply the simple fix hoping that c10::optional would be replaced by std::optional soon.
We also enabled -Wdeprecated on c10.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110019
Approved by: https://github.com/clee2000
## Context
Add existing decomps for `lift_fresh`, `split.Tensor`, and `unbind` to the core ATen decomposition table. Do not use them in inductor, since Inductor currently lowers these directly.
One note though is that `lift_fresh`'s decomposition has a note saying it's not correct under autograd. However, my understanding is that these decompositions are registered to the `"post_autograd"` decomposition table, meaning autograd wouldn't be a factor. Would like some confirmation that this premise is correct.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110102
Approved by: https://github.com/jansel
PR #108360 uses the same default `last_dim_size` formula from complex-to-real (C2R) transforms for
complex-to-complex (C2C) and real-to-complex (R2C). However, this is not correct because for C2R
the input is only half the size of the full tensor, which is not the case for C2C and C2R.
This error is mostly benign since `last_dim_size` was only used for the `>= 1` condition which is
almost always met anyway.
For this PR I now use it as the argument to `_apply_norm` which makes it load-bearing for correctness
and so is thoroughly tested now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109083
Approved by: https://github.com/lezcano
First error
```
Traceback (most recent call last):
File "/home/ubuntu/exporty.py", line 8, in <module>
ep = torch.export.export(MyModule(), torch.randn(5))
File "/opt/conda/envs/sam/lib/python3.10/site-packages/torch/export/__init__.py", line 509, in export
return export(f, args, kwargs, constraints)
File "/opt/conda/envs/sam/lib/python3.10/site-packages/torch/_export/__init__.py", line 314, in export
raise UserError(UserErrorType.INVALID_INPUT,
torch._dynamo.exc.UserError: Expecting `args` to be a tuple of example positional inputs, got <class 'torch.Tensor'>
```
Second error
```
(sam) ubuntu@ip-172-31-9-217:~$ python exporty.py
Traceback (most recent call last):
File "/home/ubuntu/exporty.py", line 13, in <module>
torch.export.save(ep, 'exported_program.pt2', extra_files=extra_files)
File "/opt/conda/envs/sam/lib/python3.10/site-packages/torch/export/__init__.py", line 566, in save
save(ep, f, extra_files=extra_files, opset_version=opset_version)
File "/opt/conda/envs/sam/lib/python3.10/site-packages/torch/_export/__init__.py", line 595, in save
encoded_content = content.encode('utf-8')
AttributeError: 'bytes' object has no attribute 'encode'. Did you mean: 'decode'?
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110169
Approved by: https://github.com/angelayi