This PR updates the public API for NJT construction `torch.nested.nested_tensor_from_jagged()` to accept values for min / max sequence length. It's useful to provide these ahead of time to avoid GPU -> CPU syncs from on-demand computation later on.
NB: The test changes are extensive because I reworked the existing `_validate_nt()` helper function used throughout our NJT construction tests to verify more (specifically: expected cached min / max seq len and contiguity).
API design question: should we additionally provide an option to compute these from `offsets` at construction time? I can think of three possible cases during construction:
1. Min / max seq len has already been obtained from *somewhere* (manual calculation, static values, etc.) and they should be used in the cache
2. Min / max seq len should be computed immediately at construction time for use in the cache (ideally, the caller wouldn't have to do this computation manually)
3. Min / max seq len are not needed at all (i.e. SDPA isn't ever called) and computation should be skipped
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130175
Approved by: https://github.com/davidberard98, https://github.com/soulitzer
In this PR:
- Ensure that if a tensor not requiring grad is saved for backward unpacking does not trigger a detach (unless the user installs a saved tensor pack hook that returns a tensor requiring grad).
- Update non-reentrant checkpoint to also no longer detach for this case.
Alternatives:
- For custom autograd Function, you could directly save on ctx to work around this, but that would not work for when we switch to using custom ops.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127959
Approved by: https://github.com/YuqingJ
ghstack dependencies: #125795, #128545, #129262
This PR does 3 things:
1. Adds a copy-free strided->jagged layout conversion for NT
2. Adds a copy-free jagged->strided layout conversion for NT
3. Modifies and expands the .to() API to support the layout argument for the specific case of NT layout conversion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115749
Approved by: https://github.com/jbschlosser
Idea: close over min / max sequence length in the main NJT view func (`_nested_view_from_jagged`) so that view replay during fake-ification propagates these correctly in torch.compile.
For dynamic shapes support for min / max sequence length, this PR uses a hack that stores the values in `(val, 0)` shaped tensors.
**NB: This PR changes SDPA to operate on real views instead of using `buffer_from_jagged()` / `ViewNestedFromBuffer`, which may impact the internal FIRST model. That is, it undoes the partial revert from #123215 alongside a fix to the problem that required the partial revert. We need to verify that there are no regressions there before landing.**
Differential Revision: [D55448636](https://our.internmc.facebook.com/intern/diff/D55448636)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122836
Approved by: https://github.com/soulitzer
Idea: close over min / max sequence length in the main NJT view func (`_nested_view_from_jagged`) so that view replay during fake-ification propagates these correctly in torch.compile.
For dynamic shapes support for min / max sequence length, this PR uses a hack that stores the values in `(val, 0)` shaped tensors.
**NB: This PR changes SDPA to operate on real views instead of using `buffer_from_jagged()` / `ViewNestedFromBuffer`, which may impact the internal FIRST model. That is, it undoes the partial revert from #123215 alongside a fix to the problem that required the partial revert. We need to verify that there are no regressions there before landing.**
Differential Revision: [D55448636](https://our.internmc.facebook.com/intern/diff/D55448636)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122836
Approved by: https://github.com/soulitzer
ghstack dependencies: #127007, #128057
This PR introduces naive CPU impls for:
* `_jagged_to_padded_dense_forward()`
* `_padded_dense_to_jagged_forward()`
On the CUDA side, these are backed by lifted FBGEMM kernels. We may want to revisit the CPU versions with higher-performance implementations at a later time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127007
Approved by: https://github.com/davidberard98
Before: `softmax` definition uses `jagged_unary_pointwise()` (wrong)
After: `softmax` impl adjusts the `dim` arg to account for the difference in dimensionality between the outer NT and the NT's `_values`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119459
Approved by: https://github.com/soulitzer
Summary:
Add the following error checks for the `unbind` operator on `NestedTensor`s when `ragged_idx != 1`:
- The current implementation allows the creation of `NestedTensor` instances from the class definition with an `offsets` tensor that applies to a dimension other than the jagged dimension. This diff ensures that `unbind` fails when the `offsets` exceed the length of the jagged dimension.
Test Plan:
Added the following unit tests:
`test_unbind_with_lengths_ragged_idx_equals_2_bad_dim_cpu` verifies that `unbind` fails when there is a mismatch between the offsets and the jagged dimension, for `NestedTensor`s with `lengths`.
```
test_unbind_with_lengths_ragged_idx_equals_2_bad_dim_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
```
Reviewed By: davidberard98
Differential Revision: D57989082
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128058
Approved by: https://github.com/davidberard98
PyTorch can't depend on `fbgemm_gpu` as a dependency because `fbgemm_gpu` already has a dependency on PyTorch. So this PR copy / pastes kernels from `fbgemm_gpu`:
* `dense_to_jagged_forward()` as CUDA registration for new ATen op `_padded_dense_to_jagged_forward()`
* `jagged_to_padded_dense_forward()` as CUDA registration for new ATen op `_jagged_to_padded_dense_forward()`
CPU impls for these new ATen ops will be added in a follow-up PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125946
Approved by: https://github.com/davidberard98
Summary:
Extend coverage for the `NestedTensor` `unbind` operator to cases in which `ragged_idx != 1`.
Currently, the `unbind` operator in the `NestedTensor` class splits a tensor along the 0-th dimension, where the `ragged_idx` property, which controls the jagged dimension upon which `unbind` splits, is 1. This diff extends support for `ragged_idx != 1` in `NestedTensor`s, allowing `unbind` to split a tensor along a jagged dimension greater than 0 for `NestedTensor`s with and without the `lengths` property.
Test Plan:
Added the following unit tests:
`test_unbind_ragged_idx_equals_2_cpu`, `test_unbind_ragged_idx_equals_3_cpu`, and `test_unbind_ragged_idx_equals_last_dim_cpu` verify that `unbind` works for all jagged dimensions greater than 1, for `NestedTensor`s without `lengths`.
```
test_unbind_ragged_idx_equals_2_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
test_unbind_ragged_idx_equals_3_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
test_unbind_ragged_idx_equals_last_dim_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
```
`test_unbind_with_lengths_cpu` and `test_unbind_with_lengths_ragged_idx_equals_1_cpu` verify that `unbind` works when the jagged dimension is 1, for `NestedTensor`s with `lengths`.
```
test_unbind_with_lengths_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
test_unbind_with_lengths_ragged_idx_equals_1_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
```
`test_unbind_with_lengths_ragged_idx_equals_2_cpu` and `test_unbind_with_lengths_ragged_idx_equals_3_cpu` verify that `unbind` works when the jagged dimension is greater than 1, for `NestedTensor`s with `lengths`.
```
test_unbind_with_lengths_ragged_idx_equals_2_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
test_unbind_with_lengths_ragged_idx_equals_3_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
```
`test_unbind_with_lengths_ragged_idx_equals_0_cpu` verifies that `unbind` fails when the jagged dimension is 0 (the batch dimension), for `NestedTensor`s with `lengths`.
```
test_unbind_with_lengths_ragged_idx_equals_0_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
```
`test_unbind_with_lengths_ragged_idx_equals_2_bad_dim_cpu` verifies that `unbind` fails when there is a mismatch between the offsets and the jagged dimension, for `NestedTensor`s with `lengths`.
```
test_unbind_with_lengths_ragged_idx_equals_2_bad_dim_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
```
`test_unbind_with_wrong_lengths_cpu` verifies that `unbind` fails when the lengths exceed the limitations set by offsets, for `NestedTensor`s with `lengths`.
```
test_unbind_with_wrong_lengths_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
```
Differential Revision: D57942686
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127493
Approved by: https://github.com/davidberard98
Fixes#127097
**TL;DR**: dimensions marked with mark_dynamic can result in assertion failures if the marked-dynamic dimensions get specialized. In NJT, we don't care _that_ much that a dimension is marked as dynamic. So instead, mark with `maybe_mark_dynamic` which suggests that a dimension should be dynamic, but doesn't fail if the dimension gets specialized.
**Background**:
NJT marks the values tensor as dynamic:
49ad90349d/torch/nested/_internal/nested_tensor.py (L122)
It does this for two reasons:
1. **Conceptual**: We know that this dimension _should_ be dynamic; it's a nested tensor, so the sequence lengths will _probably_ vary between batches in the common case. Therefore, we should compile it as dynamic to prevent needing a recompile to trigger automatic dynamic shapes.
2. **Implementation detail**: Right now we run into issues with torch.compile / tensor_unflatten / other details when the dimensions are not marked as dynamic. We have some attempts to remove this (e.g. https://github.com/pytorch/pytorch/pull/126563) but while testing this I wasn't able to get all tests to pass, so there could be potential regressions here if we removed the mark_dynamic.
**Justification for this change**
1. **Conceptual**: AFAIK, we don't care enough about the dynamism of this dimension to error out if we specialize. We'd prefer that we don't have to recompile to get automatic dynamic shapes, but it's also better to not have this issue (and not to force the user to go hunt down all the other equivalent shapes to mark them as dynamic as well). This solution allows us to suggest the dynamism but not force it.
2. **Implementation detail**: This still marks the dimension as symbolic at the beginning of dynamo tracing, so we will (probably) avoid a lot of the issues we run into when we completely remove the `mark_dynamic` decorators.
Differential Revision: [D57933779](https://our.internmc.facebook.com/intern/diff/D57933779)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127453
Approved by: https://github.com/soulitzer, https://github.com/YuqingJ
PyTorch can't depend on `fbgemm_gpu` as a dependency because `fbgemm_gpu` already has a dependency on PyTorch. So this PR copy / pastes kernels from `fbgemm_gpu`:
* `dense_to_jagged_forward()` as CUDA registration for new ATen op `_padded_dense_to_jagged_forward()`
* `jagged_to_padded_dense_forward()` as CUDA registration for new ATen op `_jagged_to_padded_dense_forward()`
CPU impls for these new ATen ops will be added in a follow-up PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125946
Approved by: https://github.com/davidberard98
PyTorch can't depend on `fbgemm_gpu` as a dependency because `fbgemm_gpu` already has a dependency on PyTorch. So this PR copy / pastes kernels from `fbgemm_gpu`:
* `dense_to_jagged_forward()` as CUDA registration for new ATen op `_padded_dense_to_jagged_forward()`
* `jagged_to_padded_dense_forward()` as CUDA registration for new ATen op `_jagged_to_padded_dense_forward()`
CPU impls for these new ATen ops will be added in a follow-up PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125946
Approved by: https://github.com/davidberard98
Current tolerances fail on RTX 6000 (Ada) with `Mismatched elements: 2 / 144 (1.4%)`
```
AssertionError: Tensor-likes are not close!
Mismatched elements: 2 / 144 (1.4%)
Greatest absolute difference: 0.002197265625 at index (5, 0, 0) (up to 0.001 allowed)
Greatest relative difference: 0.08203125 at index (3, 0, 0) (up to 0.016 allowed)
To execute this test, run the following from the base repo dir:
python test/test_nestedtensor.py -k test_sdpa_with_packed_in_proj_cuda_bfloat16
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
----------------------------------------------------------------------
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126356
Approved by: https://github.com/drisspg
Fixes#123698
This PR makes TensorImpl::has_symbolic_sizes_strides return false for NestedTensors.
1. It passes in the actual sizes when we call `_make_wrapper_subclass` - this is the change that makes the subclass register as `has_symbolic_sizes_strides() == True`
2. It adds a field to `_make_wrapper_subclass` where an explicit `numel` can be provided. This allows us to skip the numel computation for the storage, which previously fails due to arithmetic on NestedInts.
3. Implements `aten::numel` for NJT - this is separate from the overridden numel in `make_wrapper_subclass` for now. Note also that this means that we leave `dispatch_sizes_strides_policy="sizes"`, so that we call into the custom `numel` implementation (as well as `sizes` and `strides`), because `numel` cannot currently be computed from `sizes` for NJT.
Note also that this depends on #121361, because calling TensorImpl::set_sizes_and_strides() tries to clone the sizes into the tensor, which means that we need `clone` to be implemented on NestedInt.
Differential Revision: [D57225736](https://our.internmc.facebook.com/intern/diff/D57225736)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124687
Approved by: https://github.com/albanD
Fixes#121758
**TL;DR**: When profiling is turned on, the dispatcher will sometimes attach the autograd sequence number to the recorded profiler event. This PR expands the set of profiler events onto which we attach sequence numbers. Before, we'd only attach a sequence number if the current dispatch key was an Autograd dispatch key. Now, we attach a sequence number if the current dispatch key **set** contains Autograd.
**Context**:
The use case for this is torch.profiler for python subclasses.
Autograd attaches a "sequence number" to all ops that it encounters during the forward pass. Then, the corresponding sequence number can be associated with a backward kernel when backward is executed. This is used by the profiler to associate the forward ops to the backward ops; a forward op and a backward op with the same sequence number are "linked" in some post-processing step.
Prior to this PR, this profiler feature didn't work for python subclasses. The reason is that we don't collect profiler information for all the dispatches for a given kernel; we only dispatch the initial `call`, and not the subsequent `redispatch` invocations. Normally, an Autograd key (if we're running with autograd) is the highest dispatch key, so the initial `call` that we profile is an Autograd key, and we collect the sequence number. But when we're dealing with a python subclass, the first dispatch key is PythonTLSSnapshot, which eventually redispatches into Autograd. We don't record the Autograd sequence number in that case because we don't record redispatches.
To fix this, this PR collects a sequence number whenever the dispatch key **set** contains an Autograd key. That means we might sometimes collect multiple events with the same sequence number, or possibly attach sequence numbers when we won't actually use them? (e.g. maybe if the initial dispatch key handler removes Autograd for some reason). Although this might be a bit confusing for users looking directly at the sequence_nr directly, I think the main use case is for the profiler to create fwd-bwd links; and those should still be generated correctly in these cases.
Differential Revision: [D55724190](https://our.internmc.facebook.com/intern/diff/D55724190)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123304
Approved by: https://github.com/soulitzer
For internal purposes, this PR reverts the use of real views in SDPA -> autograd.Function "views" (i.e. `ViewBufferFromNested` and `ViewNestedFromBuffer`). This is a temporary fix to get the FIRST model launched and working.
**Note: this breaks some other Dynamo tests related to SDPA that rely on real views, but the breakage there isn't expected to be likely in a real-world scenario.**
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123215
Approved by: https://github.com/YuqingJ
For internal purposes, this PR reverts the use of real views in SDPA -> autograd.Function "views" (i.e. `ViewBufferFromNested` and `ViewNestedFromBuffer`). This is a temporary fix to get the FIRST model launched and working.
**Note: this breaks some other Dynamo tests related to SDPA that rely on real views, but the breakage there isn't expected to be likely in a real-world scenario.**
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123215
Approved by: https://github.com/YuqingJ
This PR introduces `torch.nested.nested_tensor_from_jagged(values, offsets=None, lengths=None, jagged_dim=1)` (bikeshedding welcome). This is intended to be the main entrypoint for getting an NJT from the `(values, offsets, lengths)` components. The returned NJT is a view of the `values` component.
Note that `torch.nested.nested_tensor()` / `torch.nested.as_nested_tensor()` already exist for constructing an NJT from a list of tensors.
TODO:
* Some doc formatting; suggestions welcome there
* Tests / examples using `jagged_dim != 1`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121518
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #113279, #113280
This PR adds support for tensor inputs to `as_nested_tensor()`. The tensor is treated as a batch of consistently-sized constituents. It utilizes `_nested_view_from_values_offsets()` to return a real view that allows for propagating gradients into inputs.
Co-authored-by: voznesenskym <voznesenskym@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113280
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
ghstack dependencies: #113279
This PR:
* Introduces an ATen op for creating true jagged views from a dense values buffer
* `_nested_view_from_jagged(values, offsets, lengths, ragged_idx, dummy)`
* This ops is implemented on the Python side using torch.library so we can return a subclass instance
* `jagged_from_list()` now uses this instead of the old autograd.Function `NestedViewFromBuffer`
* The latter op is used for non-contiguous JTs returned via `torch.nested.narrow()`
* `dummy` is an awful hack to ensure that `NestedTensor.__torch_dispatch__()` is invoked for our view
* Introduces an ATen op for accessing the `values` component of an NT via a view
* `_nested_get_values(nt)`
* **Removes** the autograd.Functions `ViewNestedFromBuffer` and `ViewBufferFromNested` in favor of `nested_from_values_offsets()` / `nested_from_values_offsets_lengths()` and `nt.values()`, respectively.
* Changes test code to prefer `as_nested_tensor()` over `jagged_from_list()` directly
* Similarly, avoid `buffer_from_jagged()`, preferring `values()`
* Depends on general subclass view fake-ification on the PT2 side (handled solely in previous PRs in the stack)
With these changes, the semantics of jagged layout NTs are such that they are considered a true view of the underlying `values` buffer. This means views of jagged NTs are views of the underlying buffer as well, simplifying some handling.
Differential Revision: [D54269922](https://our.internmc.facebook.com/intern/diff/D54269922)
Co-authored-by: voznesenskym <voznesenskym@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113279
Approved by: https://github.com/ezyang
This PR introduces `torch.nested.nested_tensor_from_jagged(values, offsets=None, lengths=None, jagged_dim=1)` (bikeshedding welcome). This is intended to be the main entrypoint for getting an NJT from the `(values, offsets, lengths)` components. The returned NJT is a view of the `values` component.
Note that `torch.nested.nested_tensor()` / `torch.nested.as_nested_tensor()` already exist for constructing an NJT from a list of tensors.
TODO:
* Some doc formatting; suggestions welcome there
* Tests / examples using `jagged_dim != 1`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121518
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #113280
This PR adds support for tensor inputs to `as_nested_tensor()`. The tensor is treated as a batch of consistently-sized constituents. It utilizes `_nested_view_from_values_offsets()` to return a real view that allows for propagating gradients into inputs.
Co-authored-by: voznesenskym <voznesenskym@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113280
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
This PR:
* Introduces an ATen op for creating true jagged views from a dense values buffer
* `_nested_view_from_jagged(values, offsets, lengths, ragged_idx, dummy)`
* This ops is implemented on the Python side using torch.library so we can return a subclass instance
* `jagged_from_list()` now uses this instead of the old autograd.Function `NestedViewFromBuffer`
* The latter op is used for non-contiguous JTs returned via `torch.nested.narrow()`
* `dummy` is an awful hack to ensure that `NestedTensor.__torch_dispatch__()` is invoked for our view
* Introduces an ATen op for accessing the `values` component of an NT via a view
* `_nested_get_values(nt)`
* **Removes** the autograd.Functions `ViewNestedFromBuffer` and `ViewBufferFromNested` in favor of `nested_from_values_offsets()` / `nested_from_values_offsets_lengths()` and `nt.values()`, respectively.
* Changes test code to prefer `as_nested_tensor()` over `jagged_from_list()` directly
* Similarly, avoid `buffer_from_jagged()`, preferring `values()`
* Depends on general subclass view fake-ification on the PT2 side (handled solely in previous PRs in the stack)
With these changes, the semantics of jagged layout NTs are such that they are considered a true view of the underlying `values` buffer. This means views of jagged NTs are views of the underlying buffer as well, simplifying some handling.
Differential Revision: [D54269922](https://our.internmc.facebook.com/intern/diff/D54269922)
Co-authored-by: voznesenskym <voznesenskym@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113279
Approved by: https://github.com/ezyang
This PR:
* Uses reified ViewFuncs to swap in fake tensors / symbolic SymInts for view replay during subclass view fake-ification
* Enables automatic dynamic on view bases -> fakeifies according to the resultant symbolic context instead of the old "all-static" approach
* Covers the following view types:
* subclass -> dense
* dense -> subclass
* subclass -> subclass
* Dense -> dense views are handled the old way via an `as_strided()` call, as it's likely there is no view func available
Differential Revision: [D54269082](https://our.internmc.facebook.com/intern/diff/D54269082)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118405
Approved by: https://github.com/ezyang
Meta registration wrongly assumes 4D inputs, while the underlying op allows 3D inputs for the `mha_varlen_fwd()` case.
Testing: I added `detach()`es so the NJT test `test_sdpa_compile()` won't fail for a view-related reason. It should pass now with this fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119812
Approved by: https://github.com/drisspg
Before: `softmax` definition uses `jagged_unary_pointwise()` (wrong)
After: `softmax` impl adjusts the `dim` arg to account for the difference in dimensionality between the outer NT and the NT's `_values`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119459
Approved by: https://github.com/soulitzer
It should usually be safe to run pointwise binary ops with >2 inputs. e.g. threshold_backward(tensor, tensor, scalar): we just operate on the values of the nested tensors, and pass in the other args as-is.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119419
Approved by: https://github.com/soulitzer
* The TODOs in `test/test_nestedtensor.py` has been mitigated, I keep the issue for reference.
* ~~The TODOs in `test/test_ops_fwd_gradients.py` doesn't apply anymore~~
* The TODOs in `run_test.py` to support disabling C++ tests is probably not going to happen. I have never seen a flaky C++ test that needs to be disabled before.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119113
Approved by: https://github.com/kit1980
Uses case: `_unsafe_view` is used in aot_autograd to create a view that doesn't register as a view:
eebe7e1d37/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py (L470-L476)
If a transposed nested tensor (i.e. NT with ragged_idx != 1) encounters this code path, it previously would fail for two reasons: 1) because `_unsafe_view` isn't registered, and 2) because ragged_idx != 1 is not supported. This PR adds support for `_unsafe_view` (completely reusing the implementation of `view`; this just registers `_unsafe_view` as another op using the same implementation). It also adds support for ragged_idx != 1, but only for trivial cases where inp._size == size (the use case used by aot_autograd).
Tests: verify that the result of `_unsafe_view` doesn't have a `_base`, and that simple views on transposed NTs work.
Differential Revision: [D53096814](https://our.internmc.facebook.com/intern/diff/D53096814)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118317
Approved by: https://github.com/soulitzer
This PR allows pointwise ops to operate on tensors with ragged_idx != 1. It does this by passing the ragged_idx metadata into the construction of the returned NestedTensor when computing pointwise ops. The assumption is that: pointwise ops can operate directly on the values tensors, and the resulting tensor should have all the same metadata properties as the input tensors. For binary ops, a test is added to verify that adding two tensors with different ragged_idx cannot be added.
Previously:
* unary pointwise ops would error out when performed on nested tensors with ragged_idx != 1
* binary pointwise ops would produce tensors with nonsense shapes
Differential Revision: [D53032641](https://our.internmc.facebook.com/intern/diff/D53032641)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118157
Approved by: https://github.com/jbschlosser
Support this fallback by converting the jagged layout NT to strided layout NT, and the convert the result back to jagged layout NT.
This fallback might not be efficient since it uses unbind, contiguous and split.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116445
Approved by: https://github.com/soulitzer
Support this fallback by converting the jagged layout NT to strided layout NT, and the convert the result back to jagged layout NT.
This fallback might not be efficient since it uses unbind, contiguous and split.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116445
Approved by: https://github.com/soulitzer
Part 2 of implementation for general [subclass view fake-ification](https://docs.google.com/document/d/1C5taWiplmX7nKiURXDOAZG2W5VNJ2iV0fQFq92H0Cxw).
Details:
* Codegen `rev_view_func()` alongside `view_func()`
* Reverse view_func gives you a "base" from a "view": `rev_view_func(new_view) -> new_base` AKA it plays the original view backwards
* Utilizes the functional inverses defined in `FunctionalInverses.cpp`, passing `InverseReturnMode::AlwaysView`
* Manually implements functional inverses for `narrow()` and `chunk()`
* **NB: Multi-output views now set view_func() / rev_view_func() for each of the output views!**
* Due to this, the `as_view()` overload that operates on a list of views is scrapped in favor of iteration via codegen
Example codegen in `ADInplaceOrViewTypeN.cpp`:
```cpp
at::Tensor narrow(c10::DispatchKeySet ks, const at::Tensor & self, int64_t dim, c10::SymInt start, c10::SymInt length) {
auto _tmp = ([&]() {
at::AutoDispatchBelowADInplaceOrView guard;
return at::_ops::narrow::redispatch(ks & c10::after_ADInplaceOrView_keyset, self, dim, start, length);
})();
std::function<at::Tensor(const at::Tensor&)> func=nullptr;
std::function<at::Tensor(const at::Tensor&)> rev_func=nullptr;
if (false || !self.unsafeGetTensorImpl()->support_as_strided() ||
c10::AutogradState::get_tls_state().get_view_replay_enabled()) {
func = [=](const at::Tensor& input_base) {
return at::_ops::narrow::call(input_base, dim, start, length);
};
rev_func = [=](const at::Tensor& input_view) {
// NB: args from narrow() signature are passed along to the inverse
return at::functionalization::FunctionalInverses::narrow_copy_inverse(self, input_view, at::functionalization::InverseReturnMode::AlwaysView, dim, start, length);
};
}
auto result = as_view(/* base */ self, /* output */ _tmp, /* is_bw_differentiable */ true, /* is_fw_differentiable */ true, /* view_func */ func, /* rev_view_func */ rev_func, /* creation_meta */ InferenceMode::is_enabled() ? CreationMeta::INFERENCE_MODE : (at::GradMode::is_enabled() ? CreationMeta::DEFAULT : CreationMeta::NO_GRAD_MODE));
return result;
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115894
Approved by: https://github.com/soulitzer
Decorates all NT tests with `@markDynamoStrictTest` to ensure we get the correct signal. Adds xfails where needed to get things passing.
Includes a fix in meta_utils.py for a bug that was breaking several python 3.11 tests. In particular, a dense tensor graph input that is a view of a strided NT would slip past Dynamo's check and break in meta-ification.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116111
Approved by: https://github.com/soulitzer, https://github.com/zou3519
ghstack dependencies: #115192
This PR removes the need for passing `ragged_size` into the `NestedTensor` constructor. This was an artifact of fake-ification, where sometimes we needed the NT to have a symbolic singleton symint shape for the ragged dimension. The new way of achieving this is to also store mappings between fake / functional tensors -> symbolic symints in the ragged structure registry. Now the `NestedTensor` constructor can just query this registry for the `ragged_size`.
Old: `NestedTensor(values, offsets, *, ragged_size=None, **kwargs)`
New: `NestedTensor(values, offsets, **kwargs)`
This makes it possible to have a `_nested_view_from_values_offsets(values, offsets)` without needing to pass a `ragged_size`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113491
Approved by: https://github.com/ezyang, https://github.com/soulitzer
Summary:
Add split and layer_norm_backward.
Note: It is non trivial to support split_with_sizes backward so adding the split operation to support the use case in the model.
Test Plan: unit tests
Differential Revision: D51052966
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113108
Approved by: https://github.com/soulitzer