Idea: close over min / max sequence length in the main NJT view func (`_nested_view_from_jagged`) so that view replay during fake-ification propagates these correctly in torch.compile.
For dynamic shapes support for min / max sequence length, this PR uses a hack that stores the values in `(val, 0)` shaped tensors.
**NB: This PR changes SDPA to operate on real views instead of using `buffer_from_jagged()` / `ViewNestedFromBuffer`, which may impact the internal FIRST model. That is, it undoes the partial revert from #123215 alongside a fix to the problem that required the partial revert. We need to verify that there are no regressions there before landing.**
Differential Revision: [D55448636](https://our.internmc.facebook.com/intern/diff/D55448636)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122836
Approved by: https://github.com/soulitzer
ghstack dependencies: #127007, #128057
This PR introduces naive CPU impls for:
* `_jagged_to_padded_dense_forward()`
* `_padded_dense_to_jagged_forward()`
On the CUDA side, these are backed by lifted FBGEMM kernels. We may want to revisit the CPU versions with higher-performance implementations at a later time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127007
Approved by: https://github.com/davidberard98
Before: `softmax` definition uses `jagged_unary_pointwise()` (wrong)
After: `softmax` impl adjusts the `dim` arg to account for the difference in dimensionality between the outer NT and the NT's `_values`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119459
Approved by: https://github.com/soulitzer
Summary:
Add the following error checks for the `unbind` operator on `NestedTensor`s when `ragged_idx != 1`:
- The current implementation allows the creation of `NestedTensor` instances from the class definition with an `offsets` tensor that applies to a dimension other than the jagged dimension. This diff ensures that `unbind` fails when the `offsets` exceed the length of the jagged dimension.
Test Plan:
Added the following unit tests:
`test_unbind_with_lengths_ragged_idx_equals_2_bad_dim_cpu` verifies that `unbind` fails when there is a mismatch between the offsets and the jagged dimension, for `NestedTensor`s with `lengths`.
```
test_unbind_with_lengths_ragged_idx_equals_2_bad_dim_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
```
Reviewed By: davidberard98
Differential Revision: D57989082
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128058
Approved by: https://github.com/davidberard98
PyTorch can't depend on `fbgemm_gpu` as a dependency because `fbgemm_gpu` already has a dependency on PyTorch. So this PR copy / pastes kernels from `fbgemm_gpu`:
* `dense_to_jagged_forward()` as CUDA registration for new ATen op `_padded_dense_to_jagged_forward()`
* `jagged_to_padded_dense_forward()` as CUDA registration for new ATen op `_jagged_to_padded_dense_forward()`
CPU impls for these new ATen ops will be added in a follow-up PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125946
Approved by: https://github.com/davidberard98
Summary:
Extend coverage for the `NestedTensor` `unbind` operator to cases in which `ragged_idx != 1`.
Currently, the `unbind` operator in the `NestedTensor` class splits a tensor along the 0-th dimension, where the `ragged_idx` property, which controls the jagged dimension upon which `unbind` splits, is 1. This diff extends support for `ragged_idx != 1` in `NestedTensor`s, allowing `unbind` to split a tensor along a jagged dimension greater than 0 for `NestedTensor`s with and without the `lengths` property.
Test Plan:
Added the following unit tests:
`test_unbind_ragged_idx_equals_2_cpu`, `test_unbind_ragged_idx_equals_3_cpu`, and `test_unbind_ragged_idx_equals_last_dim_cpu` verify that `unbind` works for all jagged dimensions greater than 1, for `NestedTensor`s without `lengths`.
```
test_unbind_ragged_idx_equals_2_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
test_unbind_ragged_idx_equals_3_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
test_unbind_ragged_idx_equals_last_dim_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
```
`test_unbind_with_lengths_cpu` and `test_unbind_with_lengths_ragged_idx_equals_1_cpu` verify that `unbind` works when the jagged dimension is 1, for `NestedTensor`s with `lengths`.
```
test_unbind_with_lengths_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
test_unbind_with_lengths_ragged_idx_equals_1_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
```
`test_unbind_with_lengths_ragged_idx_equals_2_cpu` and `test_unbind_with_lengths_ragged_idx_equals_3_cpu` verify that `unbind` works when the jagged dimension is greater than 1, for `NestedTensor`s with `lengths`.
```
test_unbind_with_lengths_ragged_idx_equals_2_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
test_unbind_with_lengths_ragged_idx_equals_3_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
```
`test_unbind_with_lengths_ragged_idx_equals_0_cpu` verifies that `unbind` fails when the jagged dimension is 0 (the batch dimension), for `NestedTensor`s with `lengths`.
```
test_unbind_with_lengths_ragged_idx_equals_0_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
```
`test_unbind_with_lengths_ragged_idx_equals_2_bad_dim_cpu` verifies that `unbind` fails when there is a mismatch between the offsets and the jagged dimension, for `NestedTensor`s with `lengths`.
```
test_unbind_with_lengths_ragged_idx_equals_2_bad_dim_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
```
`test_unbind_with_wrong_lengths_cpu` verifies that `unbind` fails when the lengths exceed the limitations set by offsets, for `NestedTensor`s with `lengths`.
```
test_unbind_with_wrong_lengths_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
```
Differential Revision: D57942686
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127493
Approved by: https://github.com/davidberard98
Fixes#127097
**TL;DR**: dimensions marked with mark_dynamic can result in assertion failures if the marked-dynamic dimensions get specialized. In NJT, we don't care _that_ much that a dimension is marked as dynamic. So instead, mark with `maybe_mark_dynamic` which suggests that a dimension should be dynamic, but doesn't fail if the dimension gets specialized.
**Background**:
NJT marks the values tensor as dynamic:
49ad90349d/torch/nested/_internal/nested_tensor.py (L122)
It does this for two reasons:
1. **Conceptual**: We know that this dimension _should_ be dynamic; it's a nested tensor, so the sequence lengths will _probably_ vary between batches in the common case. Therefore, we should compile it as dynamic to prevent needing a recompile to trigger automatic dynamic shapes.
2. **Implementation detail**: Right now we run into issues with torch.compile / tensor_unflatten / other details when the dimensions are not marked as dynamic. We have some attempts to remove this (e.g. https://github.com/pytorch/pytorch/pull/126563) but while testing this I wasn't able to get all tests to pass, so there could be potential regressions here if we removed the mark_dynamic.
**Justification for this change**
1. **Conceptual**: AFAIK, we don't care enough about the dynamism of this dimension to error out if we specialize. We'd prefer that we don't have to recompile to get automatic dynamic shapes, but it's also better to not have this issue (and not to force the user to go hunt down all the other equivalent shapes to mark them as dynamic as well). This solution allows us to suggest the dynamism but not force it.
2. **Implementation detail**: This still marks the dimension as symbolic at the beginning of dynamo tracing, so we will (probably) avoid a lot of the issues we run into when we completely remove the `mark_dynamic` decorators.
Differential Revision: [D57933779](https://our.internmc.facebook.com/intern/diff/D57933779)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127453
Approved by: https://github.com/soulitzer, https://github.com/YuqingJ
PyTorch can't depend on `fbgemm_gpu` as a dependency because `fbgemm_gpu` already has a dependency on PyTorch. So this PR copy / pastes kernels from `fbgemm_gpu`:
* `dense_to_jagged_forward()` as CUDA registration for new ATen op `_padded_dense_to_jagged_forward()`
* `jagged_to_padded_dense_forward()` as CUDA registration for new ATen op `_jagged_to_padded_dense_forward()`
CPU impls for these new ATen ops will be added in a follow-up PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125946
Approved by: https://github.com/davidberard98
PyTorch can't depend on `fbgemm_gpu` as a dependency because `fbgemm_gpu` already has a dependency on PyTorch. So this PR copy / pastes kernels from `fbgemm_gpu`:
* `dense_to_jagged_forward()` as CUDA registration for new ATen op `_padded_dense_to_jagged_forward()`
* `jagged_to_padded_dense_forward()` as CUDA registration for new ATen op `_jagged_to_padded_dense_forward()`
CPU impls for these new ATen ops will be added in a follow-up PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125946
Approved by: https://github.com/davidberard98
Current tolerances fail on RTX 6000 (Ada) with `Mismatched elements: 2 / 144 (1.4%)`
```
AssertionError: Tensor-likes are not close!
Mismatched elements: 2 / 144 (1.4%)
Greatest absolute difference: 0.002197265625 at index (5, 0, 0) (up to 0.001 allowed)
Greatest relative difference: 0.08203125 at index (3, 0, 0) (up to 0.016 allowed)
To execute this test, run the following from the base repo dir:
python test/test_nestedtensor.py -k test_sdpa_with_packed_in_proj_cuda_bfloat16
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
----------------------------------------------------------------------
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126356
Approved by: https://github.com/drisspg
Fixes#123698
This PR makes TensorImpl::has_symbolic_sizes_strides return false for NestedTensors.
1. It passes in the actual sizes when we call `_make_wrapper_subclass` - this is the change that makes the subclass register as `has_symbolic_sizes_strides() == True`
2. It adds a field to `_make_wrapper_subclass` where an explicit `numel` can be provided. This allows us to skip the numel computation for the storage, which previously fails due to arithmetic on NestedInts.
3. Implements `aten::numel` for NJT - this is separate from the overridden numel in `make_wrapper_subclass` for now. Note also that this means that we leave `dispatch_sizes_strides_policy="sizes"`, so that we call into the custom `numel` implementation (as well as `sizes` and `strides`), because `numel` cannot currently be computed from `sizes` for NJT.
Note also that this depends on #121361, because calling TensorImpl::set_sizes_and_strides() tries to clone the sizes into the tensor, which means that we need `clone` to be implemented on NestedInt.
Differential Revision: [D57225736](https://our.internmc.facebook.com/intern/diff/D57225736)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124687
Approved by: https://github.com/albanD
Fixes#121758
**TL;DR**: When profiling is turned on, the dispatcher will sometimes attach the autograd sequence number to the recorded profiler event. This PR expands the set of profiler events onto which we attach sequence numbers. Before, we'd only attach a sequence number if the current dispatch key was an Autograd dispatch key. Now, we attach a sequence number if the current dispatch key **set** contains Autograd.
**Context**:
The use case for this is torch.profiler for python subclasses.
Autograd attaches a "sequence number" to all ops that it encounters during the forward pass. Then, the corresponding sequence number can be associated with a backward kernel when backward is executed. This is used by the profiler to associate the forward ops to the backward ops; a forward op and a backward op with the same sequence number are "linked" in some post-processing step.
Prior to this PR, this profiler feature didn't work for python subclasses. The reason is that we don't collect profiler information for all the dispatches for a given kernel; we only dispatch the initial `call`, and not the subsequent `redispatch` invocations. Normally, an Autograd key (if we're running with autograd) is the highest dispatch key, so the initial `call` that we profile is an Autograd key, and we collect the sequence number. But when we're dealing with a python subclass, the first dispatch key is PythonTLSSnapshot, which eventually redispatches into Autograd. We don't record the Autograd sequence number in that case because we don't record redispatches.
To fix this, this PR collects a sequence number whenever the dispatch key **set** contains an Autograd key. That means we might sometimes collect multiple events with the same sequence number, or possibly attach sequence numbers when we won't actually use them? (e.g. maybe if the initial dispatch key handler removes Autograd for some reason). Although this might be a bit confusing for users looking directly at the sequence_nr directly, I think the main use case is for the profiler to create fwd-bwd links; and those should still be generated correctly in these cases.
Differential Revision: [D55724190](https://our.internmc.facebook.com/intern/diff/D55724190)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123304
Approved by: https://github.com/soulitzer
For internal purposes, this PR reverts the use of real views in SDPA -> autograd.Function "views" (i.e. `ViewBufferFromNested` and `ViewNestedFromBuffer`). This is a temporary fix to get the FIRST model launched and working.
**Note: this breaks some other Dynamo tests related to SDPA that rely on real views, but the breakage there isn't expected to be likely in a real-world scenario.**
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123215
Approved by: https://github.com/YuqingJ
For internal purposes, this PR reverts the use of real views in SDPA -> autograd.Function "views" (i.e. `ViewBufferFromNested` and `ViewNestedFromBuffer`). This is a temporary fix to get the FIRST model launched and working.
**Note: this breaks some other Dynamo tests related to SDPA that rely on real views, but the breakage there isn't expected to be likely in a real-world scenario.**
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123215
Approved by: https://github.com/YuqingJ
This PR introduces `torch.nested.nested_tensor_from_jagged(values, offsets=None, lengths=None, jagged_dim=1)` (bikeshedding welcome). This is intended to be the main entrypoint for getting an NJT from the `(values, offsets, lengths)` components. The returned NJT is a view of the `values` component.
Note that `torch.nested.nested_tensor()` / `torch.nested.as_nested_tensor()` already exist for constructing an NJT from a list of tensors.
TODO:
* Some doc formatting; suggestions welcome there
* Tests / examples using `jagged_dim != 1`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121518
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #113279, #113280
This PR adds support for tensor inputs to `as_nested_tensor()`. The tensor is treated as a batch of consistently-sized constituents. It utilizes `_nested_view_from_values_offsets()` to return a real view that allows for propagating gradients into inputs.
Co-authored-by: voznesenskym <voznesenskym@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113280
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
ghstack dependencies: #113279
This PR:
* Introduces an ATen op for creating true jagged views from a dense values buffer
* `_nested_view_from_jagged(values, offsets, lengths, ragged_idx, dummy)`
* This ops is implemented on the Python side using torch.library so we can return a subclass instance
* `jagged_from_list()` now uses this instead of the old autograd.Function `NestedViewFromBuffer`
* The latter op is used for non-contiguous JTs returned via `torch.nested.narrow()`
* `dummy` is an awful hack to ensure that `NestedTensor.__torch_dispatch__()` is invoked for our view
* Introduces an ATen op for accessing the `values` component of an NT via a view
* `_nested_get_values(nt)`
* **Removes** the autograd.Functions `ViewNestedFromBuffer` and `ViewBufferFromNested` in favor of `nested_from_values_offsets()` / `nested_from_values_offsets_lengths()` and `nt.values()`, respectively.
* Changes test code to prefer `as_nested_tensor()` over `jagged_from_list()` directly
* Similarly, avoid `buffer_from_jagged()`, preferring `values()`
* Depends on general subclass view fake-ification on the PT2 side (handled solely in previous PRs in the stack)
With these changes, the semantics of jagged layout NTs are such that they are considered a true view of the underlying `values` buffer. This means views of jagged NTs are views of the underlying buffer as well, simplifying some handling.
Differential Revision: [D54269922](https://our.internmc.facebook.com/intern/diff/D54269922)
Co-authored-by: voznesenskym <voznesenskym@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113279
Approved by: https://github.com/ezyang
This PR introduces `torch.nested.nested_tensor_from_jagged(values, offsets=None, lengths=None, jagged_dim=1)` (bikeshedding welcome). This is intended to be the main entrypoint for getting an NJT from the `(values, offsets, lengths)` components. The returned NJT is a view of the `values` component.
Note that `torch.nested.nested_tensor()` / `torch.nested.as_nested_tensor()` already exist for constructing an NJT from a list of tensors.
TODO:
* Some doc formatting; suggestions welcome there
* Tests / examples using `jagged_dim != 1`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121518
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #113280
This PR adds support for tensor inputs to `as_nested_tensor()`. The tensor is treated as a batch of consistently-sized constituents. It utilizes `_nested_view_from_values_offsets()` to return a real view that allows for propagating gradients into inputs.
Co-authored-by: voznesenskym <voznesenskym@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113280
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
This PR:
* Introduces an ATen op for creating true jagged views from a dense values buffer
* `_nested_view_from_jagged(values, offsets, lengths, ragged_idx, dummy)`
* This ops is implemented on the Python side using torch.library so we can return a subclass instance
* `jagged_from_list()` now uses this instead of the old autograd.Function `NestedViewFromBuffer`
* The latter op is used for non-contiguous JTs returned via `torch.nested.narrow()`
* `dummy` is an awful hack to ensure that `NestedTensor.__torch_dispatch__()` is invoked for our view
* Introduces an ATen op for accessing the `values` component of an NT via a view
* `_nested_get_values(nt)`
* **Removes** the autograd.Functions `ViewNestedFromBuffer` and `ViewBufferFromNested` in favor of `nested_from_values_offsets()` / `nested_from_values_offsets_lengths()` and `nt.values()`, respectively.
* Changes test code to prefer `as_nested_tensor()` over `jagged_from_list()` directly
* Similarly, avoid `buffer_from_jagged()`, preferring `values()`
* Depends on general subclass view fake-ification on the PT2 side (handled solely in previous PRs in the stack)
With these changes, the semantics of jagged layout NTs are such that they are considered a true view of the underlying `values` buffer. This means views of jagged NTs are views of the underlying buffer as well, simplifying some handling.
Differential Revision: [D54269922](https://our.internmc.facebook.com/intern/diff/D54269922)
Co-authored-by: voznesenskym <voznesenskym@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113279
Approved by: https://github.com/ezyang
This PR:
* Uses reified ViewFuncs to swap in fake tensors / symbolic SymInts for view replay during subclass view fake-ification
* Enables automatic dynamic on view bases -> fakeifies according to the resultant symbolic context instead of the old "all-static" approach
* Covers the following view types:
* subclass -> dense
* dense -> subclass
* subclass -> subclass
* Dense -> dense views are handled the old way via an `as_strided()` call, as it's likely there is no view func available
Differential Revision: [D54269082](https://our.internmc.facebook.com/intern/diff/D54269082)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118405
Approved by: https://github.com/ezyang
Meta registration wrongly assumes 4D inputs, while the underlying op allows 3D inputs for the `mha_varlen_fwd()` case.
Testing: I added `detach()`es so the NJT test `test_sdpa_compile()` won't fail for a view-related reason. It should pass now with this fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119812
Approved by: https://github.com/drisspg
Before: `softmax` definition uses `jagged_unary_pointwise()` (wrong)
After: `softmax` impl adjusts the `dim` arg to account for the difference in dimensionality between the outer NT and the NT's `_values`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119459
Approved by: https://github.com/soulitzer
It should usually be safe to run pointwise binary ops with >2 inputs. e.g. threshold_backward(tensor, tensor, scalar): we just operate on the values of the nested tensors, and pass in the other args as-is.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119419
Approved by: https://github.com/soulitzer
* The TODOs in `test/test_nestedtensor.py` has been mitigated, I keep the issue for reference.
* ~~The TODOs in `test/test_ops_fwd_gradients.py` doesn't apply anymore~~
* The TODOs in `run_test.py` to support disabling C++ tests is probably not going to happen. I have never seen a flaky C++ test that needs to be disabled before.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119113
Approved by: https://github.com/kit1980
Uses case: `_unsafe_view` is used in aot_autograd to create a view that doesn't register as a view:
eebe7e1d37/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py (L470-L476)
If a transposed nested tensor (i.e. NT with ragged_idx != 1) encounters this code path, it previously would fail for two reasons: 1) because `_unsafe_view` isn't registered, and 2) because ragged_idx != 1 is not supported. This PR adds support for `_unsafe_view` (completely reusing the implementation of `view`; this just registers `_unsafe_view` as another op using the same implementation). It also adds support for ragged_idx != 1, but only for trivial cases where inp._size == size (the use case used by aot_autograd).
Tests: verify that the result of `_unsafe_view` doesn't have a `_base`, and that simple views on transposed NTs work.
Differential Revision: [D53096814](https://our.internmc.facebook.com/intern/diff/D53096814)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118317
Approved by: https://github.com/soulitzer
This PR allows pointwise ops to operate on tensors with ragged_idx != 1. It does this by passing the ragged_idx metadata into the construction of the returned NestedTensor when computing pointwise ops. The assumption is that: pointwise ops can operate directly on the values tensors, and the resulting tensor should have all the same metadata properties as the input tensors. For binary ops, a test is added to verify that adding two tensors with different ragged_idx cannot be added.
Previously:
* unary pointwise ops would error out when performed on nested tensors with ragged_idx != 1
* binary pointwise ops would produce tensors with nonsense shapes
Differential Revision: [D53032641](https://our.internmc.facebook.com/intern/diff/D53032641)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118157
Approved by: https://github.com/jbschlosser
Support this fallback by converting the jagged layout NT to strided layout NT, and the convert the result back to jagged layout NT.
This fallback might not be efficient since it uses unbind, contiguous and split.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116445
Approved by: https://github.com/soulitzer
Support this fallback by converting the jagged layout NT to strided layout NT, and the convert the result back to jagged layout NT.
This fallback might not be efficient since it uses unbind, contiguous and split.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116445
Approved by: https://github.com/soulitzer
Part 2 of implementation for general [subclass view fake-ification](https://docs.google.com/document/d/1C5taWiplmX7nKiURXDOAZG2W5VNJ2iV0fQFq92H0Cxw).
Details:
* Codegen `rev_view_func()` alongside `view_func()`
* Reverse view_func gives you a "base" from a "view": `rev_view_func(new_view) -> new_base` AKA it plays the original view backwards
* Utilizes the functional inverses defined in `FunctionalInverses.cpp`, passing `InverseReturnMode::AlwaysView`
* Manually implements functional inverses for `narrow()` and `chunk()`
* **NB: Multi-output views now set view_func() / rev_view_func() for each of the output views!**
* Due to this, the `as_view()` overload that operates on a list of views is scrapped in favor of iteration via codegen
Example codegen in `ADInplaceOrViewTypeN.cpp`:
```cpp
at::Tensor narrow(c10::DispatchKeySet ks, const at::Tensor & self, int64_t dim, c10::SymInt start, c10::SymInt length) {
auto _tmp = ([&]() {
at::AutoDispatchBelowADInplaceOrView guard;
return at::_ops::narrow::redispatch(ks & c10::after_ADInplaceOrView_keyset, self, dim, start, length);
})();
std::function<at::Tensor(const at::Tensor&)> func=nullptr;
std::function<at::Tensor(const at::Tensor&)> rev_func=nullptr;
if (false || !self.unsafeGetTensorImpl()->support_as_strided() ||
c10::AutogradState::get_tls_state().get_view_replay_enabled()) {
func = [=](const at::Tensor& input_base) {
return at::_ops::narrow::call(input_base, dim, start, length);
};
rev_func = [=](const at::Tensor& input_view) {
// NB: args from narrow() signature are passed along to the inverse
return at::functionalization::FunctionalInverses::narrow_copy_inverse(self, input_view, at::functionalization::InverseReturnMode::AlwaysView, dim, start, length);
};
}
auto result = as_view(/* base */ self, /* output */ _tmp, /* is_bw_differentiable */ true, /* is_fw_differentiable */ true, /* view_func */ func, /* rev_view_func */ rev_func, /* creation_meta */ InferenceMode::is_enabled() ? CreationMeta::INFERENCE_MODE : (at::GradMode::is_enabled() ? CreationMeta::DEFAULT : CreationMeta::NO_GRAD_MODE));
return result;
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115894
Approved by: https://github.com/soulitzer
Decorates all NT tests with `@markDynamoStrictTest` to ensure we get the correct signal. Adds xfails where needed to get things passing.
Includes a fix in meta_utils.py for a bug that was breaking several python 3.11 tests. In particular, a dense tensor graph input that is a view of a strided NT would slip past Dynamo's check and break in meta-ification.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116111
Approved by: https://github.com/soulitzer, https://github.com/zou3519
ghstack dependencies: #115192
This PR removes the need for passing `ragged_size` into the `NestedTensor` constructor. This was an artifact of fake-ification, where sometimes we needed the NT to have a symbolic singleton symint shape for the ragged dimension. The new way of achieving this is to also store mappings between fake / functional tensors -> symbolic symints in the ragged structure registry. Now the `NestedTensor` constructor can just query this registry for the `ragged_size`.
Old: `NestedTensor(values, offsets, *, ragged_size=None, **kwargs)`
New: `NestedTensor(values, offsets, **kwargs)`
This makes it possible to have a `_nested_view_from_values_offsets(values, offsets)` without needing to pass a `ragged_size`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113491
Approved by: https://github.com/ezyang, https://github.com/soulitzer
Summary:
Add split and layer_norm_backward.
Note: It is non trivial to support split_with_sizes backward so adding the split operation to support the use case in the model.
Test Plan: unit tests
Differential Revision: D51052966
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113108
Approved by: https://github.com/soulitzer
This PR:
* Adds support for the `layout` kwarg to `torch.nested.as_nested_tensor()`
* Fixes `torch.nested.nested_tensor()`
* It should accept a list of lists of scalars
* It should not preserve autograd history
* Adds extensive testing for these two functions
Semantics for the two functions follow those of the strided layout:
* `torch.nested.nested_tensor(tensor_list, layout=torch.jagged)`: Creates a new jagged layout NT **with no autograd history**
* `tensor_list` can be a list of Tensors or list of lists of scalars
* `torch.nested.as_nested_tensor(tensor_list, layout=torch.jagged)`: Creates a new jagged layout NT **preserving autograd history of `tensor_list`**
* `tensor_list` must be a list of Tensors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112304
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
This PR has a number of changes that improve subclass support for AOTAutograd/Inductor in general:
- previously if a subclass does extra aliasing between graph outputs/inputs in a way, the partitioner would complain because grad_outputs are the outputs reused as-is. Now we do a view_as(self) to workaround this.
- Use dense -> dense metadata when working with fwd_output_strides during backward. This is important since the stride information comes from inductor which sees the dense to dense graph.
- Inductor requires that the inputs to the compiled backward to match some expected strides computed during compilation. We make sure to make the inner tensors of the subclass contiguous (previously, we only made the subclass itself contiguous)
Changes specific to NestedTensor relevant to compilation:
- Properly handle the case where `__tensor_unflatten__` is passed non-symbolic dense tensors and with meta extracted from fake subclasses.
- Skip var_to_range logic for singleton int
- Skip size hint logic in inductor for singleton int
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110529
Approved by: https://github.com/bdhirsh
In this PR:
- Adds support for strides for jagged tensor (design doc for this coming soon)
- NestedTensor skips automatic dynamic
- Make use of @bdhirsh's subclass fakification logic by adding the __tensor_{un,}flatten__ functions.
- Additional logic for fakification: since existing subclass fakification logic does not handle the case where the outer tensor has an additional dimension. We insert one-off logic to (1) insert an extra SingletonSymInt onto the fakified NestedTensor. (2) make sure we call track_symint on both the sizes on the inner and outer tensor during guard creation.
Remaining things that are weird:
- Still need to skip some logic in meta utils for some reason (I was going to write this up more, but decided not to since we're not able to do this anyway for a immediate reason: we cannot arbitrarily compare singleton ints. For now I'm just following Brian's advise from [here](https://github.com/pytorch/pytorch/pull/109171#discussion_r1328137070) )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109171
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
We want to be able to use SingletonSymNode to represent strides for Jagged layout tensor. The following is for 3D, but easily generalizable to higher dimensions.
Constraints:
- [B, x, D] (where x represents the "variably lengthed dim") can be strided in two ways [x, 1, sum(x)] and [dx, d, 1]. We need two different placeholder values depending on how the jagged tensor is strided.
- When doing operations we need the strides of output tensors to be expressable in terms of the strides and sizes of the inner tensors. Given [B, x, D] @ [D, D'], the output strides is [x * D', D', 1] rather than some opaque [x2, D', 1]. This constraint exists because if I'm tracing, I need a symint to represent the output stride. This symint needs to come from somewhere; I get it in several ways: (1) create a constant, (2) unbacked symint, (3) create a new input using a source, (4) output of an operation on an existing symint. It is clear that (4) is what we want here, which brings us to the design below.
Design:
Given the two constraints, the most straightforward way to implement this is actually to update SingletonSymNode to include some scalar factor, i.e. Morally, SingletonSymNode represents `factor * [s_0, s_1, …, s_n]` This enables us to symbolically compute strides from sizes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110369
Approved by: https://github.com/ezyang
ghstack dependencies: #110044
# Summary
Preivously we disallowd dis-contiguous NTs to passed into to empty_like. This was done out of an abundance of caution, :think:. However it should be safe to create an empty NT for dis-contiguous NTs. Empty like does account for offsets, strides, and sizes in construction of the result and therefore this should be safe.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98383
Approved by: https://github.com/cpuhrsch
This is needed for the HSTU model.
Details:
* ~~NT `chunk` now calls into NT `split_with_sizes` since the latter is more general~~ (removed; they're totally separate)
* Throws for backward
* Only operates over the last dim (`dim=-1`)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97446
Approved by: https://github.com/cpuhrsch
# Summary
NestedTensors currenlty don't support non-identical strided addition. When accumulating grad it possible to try and accumulate a grad with different striding then the old var and there is no way to change this in user code. This is a solution.. probs should support strided addition for NT
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97195
Approved by: https://github.com/albanD, https://github.com/cpuhrsch
# Summary
In preparation for pt 2.0 launch this PR updates SDPA's API and makes the function a nn.funcitonal public function.
## Changes
### API
Previously the the function signature was:
`scaled_dot_product_attention(query, key, value, attn_mask=None, need_attn_weights=False, dropout_p=0.0, is_causal=False) -> (Tensor, Tensor)`
Updated signature:
`scaled_dot_product_attention(query, key, value, attn_mask=None, dropout_p=0.0, is_causal=False) -> Tensor`
This PR removes the need_attn_weights optional boolean variable and updates the return type to a singular tensor.
#### Reasoning:
The main goal of this function is to provide an easy interface for users to call into fused attention kernels e.g. (FlashAttention). The fused kernels do not currently support arbitrary attn_mask or dropout but there is a PR to mem-efficient attention to enable these. We want to have the API surface ready for when the backing kernels get updated.
The fused kernels save on memory usage by not materializing the weights and it is unlikely that a fast fused implementation will enable this feature so we are removing.
Discussed with folks at FAIR/Xformers and +1 this API change.
#### Make function Public
In preparation for the pt 2.0 launch we make the function public to start to generate user feedback
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92189
Approved by: https://github.com/cpuhrsch
Summary: This diff modifies the implementation of the select operator so slices of the irregular dimension can be selected (e.g. nt[:,0,:]).
Test Plan:
Added new unit tests to test that the new functions work as intended (see them in diff). To test,
`buck test mode/dev-nosan //caffe2/test:nested`
Differential Revision: D41083993
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88585
Approved by: https://github.com/cpuhrsch
Summary: This diff merges both previous implementations of constructors for nested tensors, the one from lists of tensors and the one with arbitrary python lists, adn implements it in pytorch core so no extensions are needed to construct NT.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88213
Approved by: https://github.com/cpuhrsch
Fixes#87713
BMM for cpu supports non-contiguous nested tensor inputs, while BMM for Cuda does not support currently non-contiguous inputs.
The derivative for BMM:
```
- name: bmm(Tensor self, Tensor mat2) -> Tensor
self: grad.bmm(mat2.transpose(1, 2).conj())
mat2: self.transpose(1, 2).conj().bmm(grad)
result: self_t.bmm(mat2_p) + self_p.bmm(mat2_t)
```
When calling backward it was impossible for this function to succeed since the inputs were always discontiguous, regardless of the user input. This adds contiguous calls to BMM_cuda implementation for nested tensors.
This was not caught by tests because grad_check is currently only done on CPU in test_nestedtensors. This PR updates the autograd test to also be run on GPU.
As a result I found one more issue with the backward for to_padded_tensor erroring instead of calling the generic version.
cc @cpuhrsch @jbschlosser @bhosmer @mikaylagawarecki
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88108
Approved by: https://github.com/cpuhrsch
Summary: This diff implements copy_ in order to allow pinned memory transfers for nested tensors, as well as fill_ and ones_like, to test whether nested tensors can be created with other factory functions.
Test Plan: Pass all CI and sandcastle jobs.
Reviewed By: mikekgfb
Differential Revision: D40689594
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87728
Approved by: https://github.com/cpuhrsch
Summary: This commit adds support for moving NestedTensors from CPU to GPU and back. The implementation includes requires implementing empty_like(), which is based on PR#83140.
Test Plan: Added a new unit test based on the unit test for the main .to() implementation. All unit tests must pass, as well as every sandcastle job.
Differential Revision: D40437585
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87146
Approved by: https://github.com/drisspg
Summary: In order to make the layer normalization implementation for nested tensors public, it needs to be generalized to accept a normalized_shape argument instead of assuming it to be the last dimension of the nested_tensor. This commit does that, as well as adding extra unit tests to ensure the implementation is correct.
Test Plan:
All unit tests designed to test different ways of using the function work:
`buck test //caffe2/test:nested -- test_layer_norm`
Differential Revision: D40105207
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86295
Approved by: https://github.com/drisspg
### this effectively means that we only allow reshaping/viewing of nt with ONE ragged dimension
Behavior before this PR:
1. `-1` allowed for implicit batch dimension
2. multiple `-1`s allowed for pre-existing dimensions
3. for new dimensions, `-1` is not allowed
it is worth noting that for the most part 3 is basically unreachable because assuming a nested tensor has at least 1 ragged dimension, you would expect at least one -1 to be in the proposed shape for the pre-existing dimensions
Behavior after this PR:
1. batch dimension **must be specified**
2. **only one** `-1` allowed for pre-existing dimensions **this effectively means that we only allow reshaping/viewing of nt with ONE ragged dimension**
3. unchanged
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85691
Approved by: https://github.com/cpuhrsch
### this effectively means that we only allow reshaping/viewing of nt with ONE ragged dimension
Behavior before this PR:
1. `-1` allowed for implicit batch dimension
2. multiple `-1`s allowed for pre-existing dimensions
3. for new dimensions, `-1` is not allowed
it is worth noting that for the most part 3 is basically unreachable because assuming a nested tensor has at least 1 ragged dimension, you would expect at least one -1 to be in the proposed shape for the pre-existing dimensions
Behavior after this PR:
1. batch dimension **must be specified**
2. **only one** `-1` allowed for pre-existing dimensions **this effectively means that we only allow reshaping/viewing of nt with ONE ragged dimension**
3. unchanged
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85691
Approved by: https://github.com/cpuhrsch
Previously indexing a nested tensor when it requires_grad would raise an error because the backward formula for `select.int` uses `self.sizes()`. This PR fixes that by temporarily registering a _nested_select_backward function which can be removed when we start using the symint approach to register kernels. For now this functionality is needed for creating a POC that nested tensor can be an API to `segment_coo` and `segment_csr` in the torch_scatter repo
```
a = torch.arange(10).reshape(2, 5).float()
b = torch.arange(12).reshape(2, 6).float()
nt = torch.nested_tensor([a, b], dtype=torch.float).requires_grad_(True)
nt[0]
# RuntimeError: Internal error: NestedTensorImpl doesn't support sizes. Please file an issue on https://github.com/pytorch/nestedtensor
```
whereas
```
nt = torch.nested_tensor([a, b], dtype=torch.float).requires_grad_(False)
nt[0]
```
would succeed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83875
Approved by: https://github.com/albanD, https://github.com/drisspg
# Summary
Trying to do some clean up into code structure for nested tensors. This introduces a utility header and cpp file that implements helper functions.
This is the initial PR in more clean up. The next would be separating out the all native functions that create nested tensors into their own file since they do not infact do math on nested tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84385
Approved by: https://github.com/mikaylagawarecki
## Summary
Add detach op for nested tensors. Nested tensors are not part of the composite explicit dispatch key set and therefore need to be added manually.
The Detach test is failing only for the dtype=torch.float32, torch.float16 and device=cuda. The chain of ops that called are sum.backward() -> from_padded() -> unbind(). This populates the grad for a and b.
Does this potentially indicated that cuda implementation for one of these ops, likely from_padded() is incorrect?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84078
Approved by: https://github.com/albanD
## Summary
Add detach op for nested tensors. Nested tensors are not part of the composite explicit dispatch key set and therefore need to be added manually.
The Detach test is failing only for the dtype=torch.float32, torch.float16 and device=cuda. The chain of ops that called are sum.backward() -> from_padded() -> unbind(). This populates the grad for a and b.
Does this potentially indicated that cuda implementation for one of these ops, likely from_padded() is incorrect?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84078
Approved by: https://github.com/albanD
When the initial version came out, `NestedTensor` was not included in the `CompositeImplicitAutograd` key set, so we had to register dropout_nested to dropout and make it forward-only. Now is the time to improve it!
This pr removes dropout_nested; instead native_dropout_nested is implemented along with native_dropout_backward_nested.
Side change: remove dropout__nested since @cpuhrsch suggested to leave out nested in-place ops for now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83338
Approved by: https://github.com/jbschlosser
Per offline discussion, this will be updated to use expand once expand semantics for nested tensor have been fleshed out.
Next steps will be to add support for other features for forward sum mentioned on #82387 and likewise update the backward
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82625
Approved by: https://github.com/albanD
# Summary
This is PR is pulling out all the changes from #81838 specific to properly creating nested_tensor views. I will update this comment with a design doc once that has been made. This should enable proper creation of NestedTensor views, two nested_tensors sharing the same buffer_ but with different NestedTensor meta data.
The function `create_nested_tensor_view` is a helper function for creating a new nested tensor whose storage aliases the base causing the underlying storage to be shared - and is therefore a view.
This function by itself is not differentiable and therefore autograd does not track its uses. If a nested tensor function implementation uses this helper in its implementation the aten_op must meet two requirements:
- The function must return a view of the input
- The function must be explicit and defines its backward
## Testing
A bug was found when creating a base tensor out of inference mode and then creating a view in inference mode. This test has been aded to this PR in order to show the effect of the change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82658
Approved by: https://github.com/albanD
# Summary
This change fixes a bug that was encountered when trying to add more backward formulas for nested tensor ops. If a derivative is defined that stores the "result" for use in the backward the output of the forward op is saved using:
```
if (grad_fn) {
grad_fn->result_ = SavedVariable(result, true);
}
```
SavedVariable calls a series of functions which in turn calls shallow_copy_and_detach and when c179597753/c10/core/TensorImpl.cpp (L533) is hit this calls sizes_custom() which is not implemented and errors. I also noticed that since the storage format is different for nested_tensor not `storage_ ` but instead two tensors that the we should actually be calling the NestedTensorImpl constructor.
This PR overrides shallow_copy_and_detach from the derived class and ensures that shallow copy works correctly.
## Update
- Added the softmax derivative in this PR because that is a direct use case that was blocked by not having shallow_copy_and_detach work correctly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83002
Approved by: https://github.com/albanD
### Description
<!-- What did you change and why was it needed? -->
The nested_tensor impl for `contiguous` was currently disabled. Prior to the work on nested_tensor transpose. Only contiguous nested tensors could be created from python. However now is possible to create nested tensors that are non contiguous. This pr links up the existing function used at the c++ level to the python function.
### Tests
Updated Test in `test/test_nestedtensor.py`
### Notes
The inference mode had to be removed for this test. This is because the func `.contiguous` is a composite implicit function. Currently this does not work in inference mode. However: https://github.com/pytorch/pytorch/pull/81838 should fix that issue.
### Why
When writing kernels in Triton for nested tensors I exposed a helper function that returned the "Buffer" tensor to python. Now contiguity can be checked before running any triton kernel. Also a good follow up would be making `nt.contiguous` on non contiguous nested tensors return a contiguous nested tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82147
Approved by: https://github.com/jbschlosser
Adds an initial private API version of the SDP interface.
Signature:
```
_scaled_dot_product_attention(Tensor query, Tensor key, Tensor value, Tensor? attn_mask=None,
float dropout_p=0.0, bool need_attn_weights=True, bool is_causal=False) -> (Tensor, Tensor)
```
Returns a tuple of `(output, attn_weights)`.
Note the following:
* `need_attn_weights`: flag indicating that attention weights should be computed. This is useful to toggle off for flash attention as it does not materialize the weights by default, making it more expensive to return them.
* Boolean attention mask support only; `True` values within `attn_mask` indicate that the element should take part in attention (notably, this is reverse of MHA, which uses `True` to mask *out* values). Mask is optional.
* `is_causal`: Temporary flag indicating whether to use a causal attention weighting. If this is set to `True`, it takes precedent over any value passed in for `attn_mask`. Longer term, the `is_causal` flagging can be subsumed into the `attn_mask` arg via tensor subclassing (see e.g. [CausalTensor](https://github.com/facebookresearch/xformers/blob/sparse_cleanup/xformers/sparse/causal_tensor.py) in xFormers).
* Testing is currently done via reference with the existing Python impl of `F._scaled_dot_product_attention`.
* This PR does not yet drop-in the new SDP anywhere. A future PR can hook it up in BT or MHA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81956
Approved by: https://github.com/drisspg, https://github.com/erichan1
There was a discussion on whether letting nested tensor `reshape` support collapsing and splitting dimension 0. The conclusion was to make reshape simple, so we need a tweaked `matmul`, which only supports 3+ dimension nonbroadcast case, i.e. a generalized `bmm`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81957
Approved by: https://github.com/jbschlosser
Nested tensor used to assume the buffer memory to be contiguous. However, some operations can break that assumption:
* reshape
* transpose
* slice
To be able to access underlying tensors from discontinuous buffer, we need 3 metadata:
* sizes of each tensor (`nested_size_tensor_`)
* strides of each tensor (`nested_stride_tensor_`)
* offset of each tensor (`offsets_`)
so we access each tensor by `buffer.as_strided(size, stride, offset)`
This pull request introduces the offsets metadata, then added reshape and transpose so that we can create discontinuous cases for testing. Unbind, select, dropout, softmax, bmm are refactored to provide tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80981
Approved by: https://github.com/jbschlosser
A first step towards adding dimension-wise reductions to NestedTensor,
- Assumes tensors in the nested tensor as well as the buffer of the nested tensor are contiguous
- Always enforces `keepdim=True`
- Only supports reduction across the last dimension
- No support for acctype (`dtype` argument)
- No autograd support
- CPU only
Next steps would be to add support for the above. For now this basic support is for prototyping to make sure `NestedTensor` can be used as an API for segment reductions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82387
Approved by: https://github.com/jbschlosser
Adds an initial private API version of the SDP interface.
Signature:
```
_scaled_dot_product_attention(Tensor query, Tensor key, Tensor value, Tensor? attn_mask=None,
float dropout_p=0.0, bool need_attn_weights=True, bool is_causal=False) -> (Tensor, Tensor)
```
Returns a tuple of `(output, attn_weights)`.
Note the following:
* `need_attn_weights`: flag indicating that attention weights should be computed. This is useful to toggle off for flash attention as it does not materialize the weights by default, making it more expensive to return them.
* Boolean attention mask support only; `True` values within `attn_mask` indicate that the element should take part in attention (notably, this is reverse of MHA, which uses `True` to mask *out* values). Mask is optional.
* `is_causal`: Temporary flag indicating whether to use a causal attention weighting. If this is set to `True`, it takes precedent over any value passed in for `attn_mask`. Longer term, the `is_causal` flagging can be subsumed into the `attn_mask` arg via tensor subclassing (see e.g. [CausalTensor](https://github.com/facebookresearch/xformers/blob/sparse_cleanup/xformers/sparse/causal_tensor.py) in xFormers).
* Testing is currently done via reference with the existing Python impl of `F._scaled_dot_product_attention`.
* This PR does not yet drop-in the new SDP anywhere. A future PR can hook it up in BT or MHA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81956
Approved by: https://github.com/drisspg, https://github.com/erichan1
There was a discussion on whether letting nested tensor `reshape` support collapsing and splitting dimension 0. The conclusion was to make reshape simple, so we need a tweaked `matmul`, which only supports 3+ dimension nonbroadcast case, i.e. a generalized `bmm`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81957
Approved by: https://github.com/jbschlosser
# Summary
This change fixes a bug that was encountered when trying to add more backward formulas for nested tensor ops. If a derivative is defined that stores the "result" for use in the backward the output of the forward op is saved using:
```
if (grad_fn) {
grad_fn->result_ = SavedVariable(result, true);
}
```
SavedVariable calls a series of functions which in turn calls shallow_copy_and_detach and when c179597753/c10/core/TensorImpl.cpp (L533) is hit this calls sizes_custom() which is not implemented and errors. I also noticed that since the storage format is different for nested_tensor not `storage_ ` but instead two tensors that the we should actually be calling the NestedTensorImpl constructor.
This PR overrides shallow_copy_and_detach from the derived class and ensures that shallow copy works correctly.
## Update
- Added the softmax derivative in this PR because that is a direct use case that was blocked by not having shallow_copy_and_detach work correctly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81838
Approved by: https://github.com/soulitzer
Nested tensor used to assume the buffer memory to be contiguous. However, some operations can break that assumption:
* reshape
* transpose
* slice
To be able to access underlying tensors from discontinuous buffer, we need 3 metadata:
* sizes of each tensor (`nested_size_tensor_`)
* strides of each tensor (`nested_stride_tensor_`)
* offset of each tensor (`offsets_`)
so we access each tensor by `buffer.as_strided(size, stride, offset)`
This pull request introduces the offsets metadata, then added reshape and transpose so that we can create discontinuous cases for testing. Unbind, select, dropout, softmax, bmm are refactored to provide tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80981
Approved by: https://github.com/jbschlosser
Add numel implementation for Nested Tensor. Currently the construction of nested size and nested_strides assume contiguous. This implementation was based off of the safe_compute_numel(). Having a TORCH_CHECK in a for loop kinda feels bad but I don't really know how performant numel needs to be.
Since nested size is stored as a tensor: `nested_size_tensor().cumprod(dim=1).sum(dim=0)[1].item() ` Would also get the job done.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80424
Approved by: https://github.com/cpuhrsch
This allows subclasses such as NestedTensorImpl to provide special behavior for `int64_t size(int64_t d)` that'll also be accessible by our Python frontend.
It follows the same pattern as sizes_custom.
Currently getting CI before asking for a review.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80236
Approved by: https://github.com/ezyang
In transformer, the scale step in attention has a `nested_tensor / scalar` operation. There are two ways to support that:
1. directly support `nested_tensor / scalar`:
* pro: straightforward, good UX
* con: is dispatching `mul(nested tensor, regular tensor)` a good practice?
2. let user manually convert `scalar` to `nested_scalar = torch.nested_tensor([broadcast_scalar])`
* pro: dispatcher only has to deal with `mul(nested tensor, nested tensor)`
* con: confusing manual conversions, bad UX
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80284
Approved by: https://github.com/cpuhrsch
2 reasons to add metadata `nested_stride`:
1. it will be used later in `reshape` and `transpose`
2. it reduces the computation to get offsets and shapes necessary in `unbind`-like codes, which will be used again and again in nested tensor operations
`unbind` and `select` are refactored to make use of `nested_stride`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79831
Approved by: https://github.com/cpuhrsch, https://github.com/jbschlosser
This PR adds support for `SymInt`s in python. Namely,
* `THPVariable_size` now returns `sym_sizes()`
* python arg parser is modified to parse PyObjects into ints and `SymbolicIntNode`s
* pybind11 bindings for `SymbolicIntNode` are added, so size expressions can be traced
* a large number of tests added to demonstrate how to implement python symints.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78135
Approved by: https://github.com/ezyang