…with large index
Fixes#130806
When an output size of 2147483648 (=131072*16384) is expected in the above issue, it throwed out the following error:
RuntimeError: HIP error: invalid configuration argument
What happened was that the second parameter passed to hipLaunchKernel was crazy {2147483648,1,1}.
Found two issues in the Indexing.cu:
1: ptrdiff_t was used but it is signed int, outTotalSize >= 2147483648 can cause overflow when doing [this](39493aa934/aten/src/ATen/native/cuda/Indexing.cu (L1367)):
2: On ROCm, std::min -> ::min did not work as expected when outTotalSize>=2147483648
As the result, 2147483648 was sent to hipLaunchKernel which the GPU does not support such a huge number since this number specifies the number of threads per block. The original code intended to set 128 threads per block, though this is debatable as the perf would not good for latest powerful GPUs (a TODO item to update for perf maybe?) , but at least it would not cause `invalid configuration argument` error.
[Test]
Run the same code snippet in the [issue](https://github.com/pytorch/pytorch/issues/130806), and print the output, its dim and numel(), which looks like below now:
```
output=tensor([[ 0.4044, -0.0244, -0.6865, ..., -0.7800, 0.1175, 1.6726],
[-1.0866, -0.1609, 0.3538, ..., 1.9105, 0.7882, 1.1583],
[-2.2079, 0.3736, 0.3610, ..., -0.2658, -0.0459, 1.3077],
...,
[ 0.8753, -0.7482, -0.1978, ..., 0.9016, 1.1501, -0.5178],
[-1.5845, -0.6277, 1.4520, ..., 0.5733, -2.1198, -0.0915],
[-0.6310, -1.0239, -0.1910, ..., 0.4309, 0.1630, 0.3239]],
device='cuda:0'), dim=2, numel=2147483648
```
Added a large tensor unit test too.
```
/pytorch# pytest test/nn/test_embedding.py -k test_large_tensors
================================================================================== test session starts ===================================================================================
platform linux -- Python 3.9.19, pytest-7.3.2, pluggy-1.4.0
rootdir: /dockerx/development/pytorch
configfile: pytest.ini
plugins: flakefinder-1.1.0, rerunfailures-14.0, xdist-3.3.1, xdoctest-1.1.0, cpp-2.3.0, hypothesis-5.35.1
collected 288 items / 287 deselected / 1 selected
Running 1 items in this shard
test/nn/test_embedding.py . [100%]
=========================================================================== 1 passed, 287 deselected in 3.16s ============================================================================
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130994
Approved by: https://github.com/jeffdaily, https://github.com/xw285cornell
Currently we require `n % register_block_n == 0` which typically bring good perf when `n` is a multiply of 8, 16, 32 etc. while will fall back to the reference micro gemm otherwise (where `register_block_n == 1`). This PR optimizes this by padding `n` to the multiple of `register_block_n` which is 8, 16, 32 etc. for packed weight. Therefore, the micro-gemm can work as is on the padded `n`. When the weight is padded, we will use the local accumulation buffer to get the result from micro-gemm and then unpadded (sliced) before storing back to the output buffer.
Performance numbers measured on "Intel (R) Xeon (R) CPU Max 9480", single core, bf16.
Before
AUTOTUNE linear_unary(512x768, 3073x768, 3073)
_linear_pointwise 2.3563 ms 100.0%
cpp_packed_gemm_0 710.5902 ms 0.3%
After
AUTOTUNE linear_unary(512x768, 3073x768, 3073)
cpp_packed_gemm_0 1.8909 ms 100.0%
_linear_pointwise 2.1016 ms 90.0%
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130690
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
ghstack dependencies: #130675
0.12.0 Major Updates:
- Add context manager to temporarily set the dictionary sorting mode
- Add accessor APIs
- Use `stable` tag for `pybind11` for Python 3.13 support
- Fix potential segmentation fault for pickling support
0.12.1 Updates:
- Fix warning regression during import when launch with strict warning filters
Closes#130155
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130139
Approved by: https://github.com/zou3519
ghstack dependencies: #130895
Fixed TrainingIRToRunDecomp failures for test_tensor_attribute_zero_args and also a few re-tracability failures because run_decomposition does a retracing.
**edit:** also remove the eliminate_dead_code() in _unlift because of one onnx test failure:
a constant tensor attr was lifted as constant_tensor input but it's not used in the graph after aot_autograd due to a short cut in its decomposition. This causes the setattr to be removed by eliminate_dead_code but the graph signature still contains the name of that buffer, which causes an inconsitency between the transformed graph and ep's original signature after _unlift. And it seems that this has happened a few times where some nodes are accidentally removed and we're in an inconsistent state.
The alternative of removing it would be: every time we call elimiate_dead_code, we verify the consistency of the graph with 1. the graph before transformation and 2. all the meta datas but i think this deserves a complete design.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130990
Approved by: https://github.com/pianpwk
There's no reason to ban them for vmap or jvp, because without the
{grad, vjp} transforms those just act above PyTorch autograd, which will
end up saving regular Tensors.
Test Plan:
- some tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131191
Approved by: https://github.com/drisspg
#### Issue
Model parameters sometime do not appear in the `named_parameters()` function. For example, when trying to jit.trace an already jit.scripted model. This PR fixes that by relying on `state_dict` to get both parameters`requires_grad=True` and buffers.
#### Test Plan
* `pytest test/export/test_converter.py -s -k test_convert_retrace_nested_scripted_modules`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129787
Approved by: https://github.com/angelayi
Summary:
In export workflow, we always have a lifted graph which doesn't fetch constants through get_attr nodes. This cause some compatibility issue when we're trying to use inductor's split_const_gm function with a lifted graph.
This diff make an additive change to split_const_gm's interface, such that, when the pass sees a placeholder node is present in the lifted_constants table, it will also use that as the source of constness.
This change won't break the existing code and the lifted_constants table can be used orthogonal to the existing const folding mechanisms.
Also as required from MTIA team, we want to introduce a small callback function used to skip certain nodes during const folding.
For the internal followup counterpart, see D59685145
Test Plan: buck run mode/opt caffe2/test:test_export -- -r split_const_gm
Differential Revision: D59692790
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130743
Approved by: https://github.com/desertfire, https://github.com/SherlockNoMad
Earlier the signature of dequantize ops for decomposed quantized Tensor was changed for wider use-cases where the output dtype can be different from torch.float and needs to be passed during dequantization.
Please refer: https://github.com/pytorch/pytorch/pull/121450
However, setting of correct output dtype for dequantize ops was still missing in convert_pt2e flow.
This change enables the users to use PT2E quantization flow with non torch.float unquantized dtype, such as torch.bfloat16.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128953
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
beartype has served us well in identifying type errors and ensuring we call internal functions with the correct arguments (thanks!). However, the value of having beartype is diminished because of the following:
1. When beartype improves support for better Dict[] type checking, it discovered typing mistakes in some functions that were previously uncaught. This caused the exporter to fail with newer versions beartype when it used to succeed. Since we cannot fix PyTorch and release a new version just because of this, it creates confusion for users that have beartype in their environment from using torch.onnx
2. beartype adds an additional call line in the traceback, which makes the already thick dynamo stack even larger, affecting readability when users diagnose errors with the traceback.
3. Since the typing annotations need to be evaluated, we cannot use new syntaxes like `|` because we need to maintain compatibility with Python 3.8. We don't want to wait for PyTorch take py310 as the lowest supported Python before using the new typing syntaxes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130484
Approved by: https://github.com/titaiwangms
While for optimizations like pad_mm, there are always only two possible choices, for other decision procedures, like kernel choice selection, the set of "available" choices depends on the input. Instead of storing the choices as metadata, we can instead take a look at all choices for which we have collected data (i.e. `df[CHOICE_COL].unique()`).
In this PR, I also try to replace "choice" and "feedback" with global constants CHOICE_COL and FEEDBACK_COL.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130304
Approved by: https://github.com/eellison
This is an updated PR to equip cond with the autograd feature and replaces the old [PR](https://github.com/pytorch/pytorch/pull/126007)
@ydwu4 I tried to incorporate your requests already.
Currently there are two problems that I struggle with solving:
1. There seems to be an import issue when trying to import cond in `torch/__init__.py`, see [here](8a704035c9/torch/__init__.py (L1914-L1916)). Therefore, I had to comment those lines, which resolved the import issues, but I believe cond is not proberly exposed as torch.cond.
2. I am not entirely sure how to deal with the opinfo test in `hop_db.py`
Co-authored-by: Yidi Wu <yidi@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126911
Approved by: https://github.com/ydwu4
Summary:
Add three top level APIs for numeric debugger in pt2e flow that can log intermediate output in the model
and calculate summary for metric comparisons between nodes in two graphs
* `prepare_for_propagation_comparison`
* `extract_results_from_loggers`
* `compare_results`
Test Plan:
python test/test_quantization.py -k test_prepare_for_propagation_comparison
python test/test_quantization.py -k test_extract_results_from_loggers
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130643
Approved by: https://github.com/dulinriley, https://github.com/tarun292
We currently can't generate split scans when there are multiple scan
values, so we normally fall back to ATen. However, for the higher order
scan op, we can't fallback so it makes sense to just generate the slower
kernel anyway. This avoids having special shapes where we fail to
codegen.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130936
Approved by: https://github.com/lezcano
Sets `prefer_deferred_runtime_asserts_over_guards=True` for export, so any guards emitted from `SymNode.expect_true` (for example, guards that are implicitly required to be true for an op to succeed) won't lead to constraint violations. Instead these should appear in the graph as runtime asserts, or potentially as replacement expressions for placeholder shapes.
For example, this reshape op should emit s0 * s1 = s2, deferred as a runtime assert.
```
x = torch.randn(4, 8) # [s0, s1]
y = torch.randn(32) # [s2]
out = x.reshape(-1) + y
# this emits Eq(s0 * s1, s2), and we represent y's shape as [s0*s1] in the graph.
```
However, other complex guards can still cause export to fail, for instance guards emitted from `SymNode.guard_bool/guard_size_oblivious` (e.g. explicit if-else conditions in user code or lower-level op implementations hit during tracing) can still raise constraint violations. These can be deferred with `allow_complex_guards_as_runtime_asserts=True`. We don't yet make this default, because while this makes export more likely to succeed, it results in non-trivial asserts being emitted that often represent specialization to a variant of the op, or checks related to 0/1 specialization.
We also remove forced specializations for export and kill the `_disable_forced_specializations` flag - now any guard we can't express with Dims/DerivedDims either are handled with Hybrid SymInts, or should be resolved with rewriting or deferring.
Follow up:
Currently, `ShapeEnv._set_replacement()` is called for complex equality expressions (e.g. s2 -> s0*s1 in the example above), and the ExportedProgram stores `s0*s1` in the input placeholder. This isn't checked for validity when the program is run, so an option is to avoid replacement and/or runtime assert on equality.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130775
Approved by: https://github.com/avikchaudhuri
Fixes#128745
Solve the issue with conflicts when users use full_state_dict while the model is FSDP.
Current solve the issue for `full_state_dict=True`, with error
`'aten.copy_.default: got mixed torch.Tensor and DTensor, need to convert all torch.Tensor to DTensor before calling distributed operators!',).`
TODO: for` broadcast_from_rank0=True, full_state_dict=True`, the error is
`NotImplementedError: c10d::broadcast_: attempted to run this operator with Meta tensors`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129635
Approved by: https://github.com/fegin
Summary: Finishing up the mechanism to "register" certain types of operators to a registry so that the serializer can handle them correctly. This is expected to be firstly used by executorch.
Test Plan: buck run mode/opt caffe2/test:test_export -- -r test_export_with_extension_op_serialization
Differential Revision: D59825148
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130851
Approved by: https://github.com/angelayi
- More conservative estimation of plannable inputs
- Consider constant_pad_nd as pointwise node in concat lowering
- Use aten.cat instead of constant pad ndwhen padding just a single dimension because it can be memory-planned away
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128909
Approved by: https://github.com/Chillee
Might fix#127660, need to test some more cases.
We update the reinplacing pass. If we have something like the following,
where "sin" is a custom op (this situation should also apply to triton
kernels)
```py
def graph(x):
y = sin(x)
z = sin(y)
x.copy_(z)
```
then the reinplacer used to produce the following:
```py
"""step 1: reinplaces the first sin"""
def graph(x):
x_clone = x.clone()
sin_out(x, out=x_clone)
z = sin(x_clone)
x.copy_(z)
"""step 2: reinplaces the second sin"""
def graph(x):
x_clone = x.clone()
sin_out(x, out=x_clone)
sin_out(x_clone, out=x_clone)
x.copy_(x_clone)
```
However, the first clone is unnecessary. It is safe to reinplace
the first sin into the following:
```py
def graph(x):
sin_out(x, out=x)
z = sin(x)
x.copy_(z)
```
because there are no users of `x`'s original value (the copy_ node
doesn't actually use the original value of x!)
This PR updates the reinplacing pass to ignore copy_ in its computation
of if the original value of the mutated argument is still needed.
NB: this also applies to triton kernels, but it was easier for me to
reason about custom ops (and my repros were all for custom ops).
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130866
Approved by: https://github.com/oulgen
# Summary
- This removes a bunch of example score mods that were primarily used for testing and places them directly in the test file. We should follow up with merging test_flex_decode and test_flash when the velocity slows down a little
- Fixes a bug with indexing on block mask
- Adds some doc strings to helper funcs and fixes some misc typing things
- Forces functions passed to `create_block_mask` to mask_mods and updates tests files
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130871
Approved by: https://github.com/joydddd, https://github.com/Chillee
This PR re-implements pin memory aiming to get rid of the optional `device` argument and makes all related APIs to be device-agnostic. We add two new abstract APIs in [AcceleratorHooksInterface](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/detail/AcceleratorHooksInterface.h#L12) and redefine pin memory as: "Pin memory is always pinned for the current accelerator device". In detail, it uses [getAcceleratorHooksInterface](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/Context.h#L61) in pin_memory/is_pinned to get an appropriate device and invoke the corresponding overridden interfaces, instead of using BackendSelect and then dispatching to CUDA or other specific backends' implement methods.
Note: For new backends who want to implement and use pin memory, just inherit AcceleratorHooksInterface and overwrite the `isPinnedPtr` and `getPinnedMemoryAllocator` methods.
Additional context: To avoid BC-breaking, this PR just preserves the `device` arg of related APIs and would throw a deprecation warning if `device` arg is passed. Another PR will be submitted to update all PT callers (`Tensor.is_pinned()`, `Tensor.pin_memory()`...) not to pass this arg based on this PR. In future, `device` arg will be actually removed.
Relates #124908
Relates #14560
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126376
Approved by: https://github.com/albanD
Summary: Modify the existing `sum` operator in PyTorch, invoked by `torch.sum`, to allow for reductions along the ragged dimension of a nested tensor. This diff enables PyTorch users to invoke `torch.sum` on a nested tensor with `dim=1`, where `ragged_idx=1`.
Functions modified in `caffe2/torch/nested/_internal/ops.py`:
- `sum_dim_IntList()`: The function assumes that `ragged_idx=1`; in the case that `dim=1` as well, where `dim` is the dimension on which we reduce, this diff invokes the PyTorch benchmark found in D58423489. Specifically, this diff pads a nested tensor, e.g. of logical shape `(B, *, M)`, using [`torch.ops.aten._jagged_to_padded_dense_forward`](https://www.internalfb.com/code/fbsource/[92c2a067ab04e3eebc999254fed4ae2fbea6def3]/fbcode/deeplearning/fbgemm/fbgemm_gpu/fb/inductor_lowerings/elementwise_ops.py?lines=26), then reduces across the `*` dimension (`dim == 1`) to a `(B, M)` output tensor.
- `_wrap_jagged_dims()`: This diff adds special handling to allow for the case where `dim` contains `1` and not `0`, but to continue disallowing the case where `dim` contains `0` and not `1`. In this function's creation, I created a helper function, `_get_condition_for_invalid_jagged_reductions()`, which makes it clearer which conditions apply to which operators. Specifically, operators which are enabled with jagged reductions are specified at the top of the file in `SUPPORTED_JAGGED_REDUCTIONS` and have a different set of conditions that need to be tested, as reducing along `dim == 1` without `dim == 0` is now possible.
Functions modified in `caffe2/test/test_nestedtensor.py`:
- `test_sum_int_DimList()`: This diff adds special handling in the `sum` unit test to allow for the case where `dim` contains `1` and not `0`, but to continue disallowing the case where `dim` contains `0` and not `1`.
- `test_sum_int_DimList_ragged_dim_1()`: This diff adds a new unit test which verifies the accuracy and feasibility of reducing along the jagged dimension of a nested tensor.
Notes:
- This diff solely adds functionality for the case in which we reduce only along the ragged dimension. Cases in which we reduce along both the ragged and another dimension, like `dim == (1, 2)`, are not permitted, as this set of diffs focuses primarily on the former.
- The `sum` operator is the only operator which uses the function `_wrap_jagged_dims()`; all other operators use `_wrap_jagged_dim()`. I would like to later look into why this is the case and if we can consolidate this!
- I modified some of the comments in the `sum` function as well as the unit tests for more clarity.
Test Plan:
Verify that existing (`test_sum_int_DimList`) and new (`test_sum_int_DimList_ragged_dim_1`) unit tests pass via the following command:
```
buck2 run mode/{opt,inplace} //caffe2/test:nested -- --regex test_sum_int_DimList
```
Differential Revision: D59571209
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130425
Approved by: https://github.com/davidberard98
Reland of: https://github.com/pytorch/pytorch/pull/128016
Summary from previous PR:
We assume only two possible mutually exclusive scenarios:
Running compiled region for training (Any of inputs has requires_grad)
Produced differentiable outputs should have requires_grad.
Running compiled region for inference (None of inputs has requires_grad)
All outputs do not have requires_grad.
Even if user runs the region under no_grad(), but has an input Tensor with requires_grad - we go Training scenario (1).
With current state that means:
1/ needs_autograd should not check torch.is_grad_enabled(), only that any of inputs requires_grad
2/ if needs_autograd => trace_joint (We are in training scenario 1.) => always run compiled region under with.enable_grad()
Changes in partitioner?
Inference and Training graphs had difference in return container, list/tuple.
The changes in partitioner are done to unify and return always tuple.
As a result - some changes in test_aotdispatch.py for graph contents list -> tuple.
Why was revert?
There was a regression of hf_Reformer model on inference.
```
TORCHINDUCTOR_FX_GRAPH_CACHE=0 python benchmarks/dynamo/torchbench.py --performance --inference --bfloat16 --backend inductor --device cuda --only hf_Reformer --cold-start-latency --use-eval-mode
```
Because one of the compiled graphs contained outputs, which are aliases to the inputs that are nn.Parameter(requires_grad=True).
Even if inference bencharmsk torchbench runs inside with` torch.no_grad()` - alias (specifically for hf_Reformer - expand) ops preserve requires_grad.
As a result we started compiling training graph instead of inference.
Fix for view ops:
If we have outputs, that are aliases to inputs that requires_grad, those outputs requires grad is not a reason to generate training graph.
This is handled in aot_autograd.py, where output_and_mutation_safe are calculated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128890
Approved by: https://github.com/bdhirsh