**Background:** It's common to use `scalar_tensor()` in the input to `where()` to convert any scalars present to compatible tensors with matching options, *including layout*. This shows up in various places, notably including derivative formulas ([example](78491d6afc/tools/autograd/derivatives.yaml (L432-L434))). It causes problems for NJTs because they have `layout=torch.jagged` and it never makes sense to create a scalar tensor with this layout. Some of the breakage only seems to happen in CI for reasons I don't fully understand (see the revert of #140736 due to softshrink's derivative formula).
**This PR:**
* Allows non-contiguous NJT inputs to `where()` + adds tests for this
* Handles scalar tensor / dense tensor inputs for `condition` / `other` + adds tests for this
* Uses limited `broadcast_tensors()` / `broadcast_to()` support
* Improves `expand()` to work on non-contig NJTs
* Changes `scalar_tensor()` to use `torch.strided` instead of `torch.jagged` in both eager and torch.compile (i.e. meta registration)
* Changes backward formulas for `sinc`, `pow`, `special.i1`, and `special.i1e` to uses `scalar_tensor()` instead of e.g. `zeros({})`
**Alternative approach:** Update all problematic usages of `scalar_tensor()` to avoid ever passing `layout=torch.jagged`. This is an extensive change and includes `torch.where()` logic, a bunch of derivative formulas, and likely other places not yet discovered.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141500
Approved by: https://github.com/malfet, https://github.com/cpuhrsch, https://github.com/soulitzer
This PR contains three `unsqueeze()`-related fixes for NJT:
1. Adjusts the output's `_ragged_idx` when `unsqueeze()` inserts a dim before the ragged dim
2. Corrects the unbind reference for `unsqueeze()` after the last input dim. For this case, the dim kwarg canonicalization logic needs to be applied wrt `inp.dim() + 1` to account for `dim=-1` properly
3. Adds ragged dim support to `unsqueeze()`, allowing for e.g. `(B, j1, D) -> (B, 1, j1, D)`. This is okay now after #137125
Note that `unsqueeze()` still doesn't support batch dim operation, and arguably should never support this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141392
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #140736, #140161
This PR introduces `ExtraOpData`, a structure that contains op metadata regarding whether the op is a view and the dim-related args it accepts. It also populates a huge database for dim-wise / view ops with this info.
Test logic (sample input generation, references) have been updated to utilize this data. It allows for a fairly generic set of sample inputs & a reference for the class of ops that accept a single NJT and operate dim-wise (AKA "unary dimwise ops").
Testing is added over the following ops:
* `chunk()`
* `narrow()`
* `select()`
* `split()`
* `split_with_sizes()`
* `squeeze()`
* `unflatten()`
* `unsqueeze()`
Most of the above do not operate on the ragged / batch dims or on non-contiguous NJTs, so the proper xfails are added as needed.
I also slipped in a couple minor fixes (sorry):
1. The `_wrap_jagged_dim()` helper now avoids assuming the `nt._ragged_idx == 1` and allows for a batch dim to be a valid input, disambiguating the converted inner dim as necessary through an additional `operating_on_batch` return value (i.e. both dim=0 and dim=1 map to dim=0 on the inner values tensor, since that dim represents a packed ragged dim for all batch items)
2. Padded dense -> NJT conversion requires shape gymnastics to operate with the restrictive FBGEMM kernel. The gymnastics were slightly wrong for the transposed NJT case, and this PR fixes that
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140161
Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch
ghstack dependencies: #140736
This PR contains several fixes related to non-contiguous NJTs:
1. Propagates `lengths` through op calls appropriately (see desc of #138098)
* SDPA now calls `nested_view_from_values_offsets_lengths()` instead of `nested_view_from_values_offsets()`
2. Allows non-contig NJTs in unsqueeze / transpose / select
3. Expands padded dense -> NJT conversion to support non-contig NJTs
4. (unrelated sorry) Updates `split` / `split_with_sizes` to allow for optional `dim`, matching the ATen signature
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140160
Approved by: https://github.com/cpuhrsch
Fixes#137512
Relaxes the restriction that the ragged dim is immediately next to the batch dim e.g. `(B, *, D_0, ..., D_N)`. This allows for constructing NJTs of shape e.g. `(B, D, j0)` directly. It's possible before this PR to get an NJT of e.g. shape `(B, D, j0)` by constructing an NJT of shape `(B, j0, D)` and transposing it. This PR allows a user to go straight there without the transpose. The standard `torch.nested.nested_tensor(list)` constructor has been updated to support this.
At the very least, this is useful for testing on transposed NJTs. I'm willing to make this functionality private if needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137125
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
I'm sick of reductions not working properly - spotty dim coverage, missing backwards, etc. This PR fixes quite a bit.
It applies to the following ops:
* `sum` / `mean` / `prod`
* `all` / `any`
* `amin` / `amax`
* `min` / `max`
* `argmin` / `argmax`
The general reduction logic has been factored out into a helper `_apply_reduction(func, func_name, identity_element, *args, **kwargs)`. The idea is that by providing a valid identity element, we can utilize conversions to padded dense when needed for reducing over the ragged dim.
Extensive test coverage includes:
* reductions across ragged dim
* reductions across non-batch, non-ragged dims
* reductions across both batch and ragged dims
* multiple dim reductions (for ops that support this)
* full reduction -> scalar
Bonus: the PR includes backwards fixes for `sum` and `mean`, which have never worked.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139317
Approved by: https://github.com/cpuhrsch
This PR adds FlexAttention + NJT support. In particular:
* To handle raggedness, treats the packed sequence dim of input NJTs as a giant "stacked sequence". To ensure user `score_mod` / `mask_mod` functions can still be written in the original NJT sequence space, this PR handles conversions for indices within the giant "stacked sequence" -> sequence relative indices automatically.
* Provides `py_impls` for `NestedTensor` to the HOPs for flex attention forward / backward that simply wrap / unwrap NJTs appropriately
* Adds barebones `new_empty()` support to NJT since FlexAttention utilizes this repeatedly; right now, only `new_empty()` with a shape of `()` is supported
* Tests that FlexAttention with a causal mask matches causal SDPA
* Adds a new public API for FlexAttention usage:
* `create_nested_block_mask(mask_mod, B, H, njt, BLOCK_SIZE, _compile)` - NJT analogue for `create_block_mask()` that utilizes the `njt`'s ragged structure to create an appropriately-sized block mask (e.g. `(1, 1, total_seqlen, total_seqlen)`). This function handles the index conversion from "stacked sequence" space -> relative sequence space.
* Minor note: as this is a public API, this function is purposefully named with "nested" instead of "njt" to keep the latter as an informal, mostly internal-only term.
Example usage:
```python
def causal_mask(b, h, q_idx, kv_idx):
return q_idx >= kv_idx
query = ... # NJT of shape (B, H, S*, D)
key = ... # NJT of shape (B, H, S*, D)
value = ... # NJT of shape (B, H, S*, D)
# create_nested_block_mask() automatically converts indices from "stacked sequence" space -> relative sequence space
block_mask = create_nested_block_mask(causal_mask, 1, 1, query) # block mask conceptual shape is (B, H, sum(S*), sum(S*))
output = flex_attention(query, key, value, block_mask=block_mask)
def causal_score_mod(score, b, h, q_idx, kv_idx):
return torch.where(q_idx >= kv_idx, score, float("-inf"))
# flex_attention() automatically converts indices from "stacked sequence" space -> relative sequence space for NJT inputs
output2 = flex_attention(query, key, value, score_mod=causal_score_mod)
```
TODO:
* ~~Determine the right level of abstraction for public API helpers + move them alongside other helpers~~ Verify this with others though
* ~~Some cleanup~~
* ~~`njt_score_mod_adapter`~~
* ~~Q: should `create_njt_block_mask()` call `njt_mask_mod_adapter()` so we don't need two calls?~~
* Can we avoid materializing the `sum(s)` length `seq_idx` used for conversion between stacked sequence -> sequence relative indices?
* Not for now, although future work may deepen the integration between Flex + NJT (possibly requiring custom templates). We should try to cache this though.
* ~~Demonstrate non-causal mask~~
* Support non-contiguous NJTs with holes (**booted to future PR**)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136792
Approved by: https://github.com/drisspg
ghstack dependencies: #138841
Before this PR, NJT would dispatch e.g. `NJT * nested_int` to `mul.Tensor`, wrongly interpreting the SymInt as a tensor and outputting garbage. This PR verifies that there are no nested ints in the list of args before dispatching for pointwise ops.
I originally tried checking that `the number of passed tensor args == the number of func schema tensor args`, but this wrongly disallows `nt * 2`, which (non-intuitively to me at least at first) dispatches via the `mul.Tensor` overload.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138602
Approved by: https://github.com/soulitzer
Does what it says on the tin. I believe the right behavior here is to ensure that `record_stream()` is called on all tensor components of the NJT to ensure they all live until stream computation is complete.
This is an ask from torchrec as the op is used there.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137099
Approved by: https://github.com/ngimel
There's an annoying pattern emerging for pulling out the NJT min / max seqlen ints if they exist without computing / caching if they don't. This PR introduces private convenience functions to simplify handling this and avoiding redundant checks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138130
Approved by: https://github.com/soulitzer
Our matmul support is abysmal - let's at least get this working and do it performantly later.
Bonus: implements `bmm` as well.
jagged <-> padded dense conversions are utilized when possible, and an unbind-based fallback otherwise (the former works with torch.compile and the latter doesn't). Some testing is missing because we don't have factory function support yet :(
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138121
Approved by: https://github.com/cpuhrsch
Fixes#129366
Since NJT has custom serialization logic, we need an NJT-specific fix to clear out cached sizes / strides PyCapsules. Eventually, we should switch NJT to use the default serialization logic, but this depends on #125622 being addressed.
This PR also makes serialization more complete by explicitly handling `lengths`, `ragged_idx`, and the `metadata_cache`, ensuring working operation for both contiguous and non-contiguous NJTs,
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137031
Approved by: https://github.com/soulitzer
ghstack dependencies: #137030
Prior to this PR, calling `reshape()` under `inference_mode()` would throw a `NotImplementedError`. This is because `inference_mode()` disables autograd key dispatch, incidentally preventing the decomposition of reshape for NJT.
This PR fixes this by redispatching on the `CompositeImplicitAutogradNestedTensor` key whenever a composite implicit op is encountered in `NJT.__torch_dispatch__()`. This fixes reshape and any other composite implicit ops underneath `inference_mode()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134683
Approved by: https://github.com/soulitzer, https://github.com/albanD
ghstack dependencies: #136566
Related: #132695
This PR uses padded dense <-> jagged conversions to handle binary pointwise broadcasting of (NT, T) and (T, NT). This includes:
* `(B, j0, D) + (1, 1, 1)`
* `(B, j0, D) + (B, 1, 1)`
* `(B, j0, D) + (B, 1, D)`
* etc.
This PR also adds (hacky) support for bool inputs to the jagged <-> padded dense conversions. The underlying CUDA kernels do not support integer / bool inputs; so the following workaround is employed: `convert input -> half, run conversion kernel, convert output -> bool`. Note that this bool support is needed specifically for the backward formula of `fmax`, and likely others.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133021
Approved by: https://github.com/cpuhrsch
`rms_norm()` is a nice-to-have for ViT :)
This PR:
* SymInt-ifies `rms_norm()`, allowing NJT to use the same decomp.
* Adds torch_function-based input validation logic for nested-specific stuff (no normalization supported over the ragged dim for now) on the python NJT side.
* Adds multi-dim support (on non-ragged, non-batch dims) to `mean()` for NJT.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135872
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #125947
This PR solves two problems with `sum()` support in NJT:
* `sum()` over a dim with `keepdim=True` returns the wrong shape (i.e. it'll keep the wrong dim). This is a long-standing bug from way back in #112519.
* Historically, we've only supported `sum()` over a dim and not a full reduction. This PR adds the full reduction form (forward only, backward still fails).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131945
Approved by: https://github.com/davidberard98, https://github.com/jananisriram
This PR:
* Implements the pre-existing `nt.to_padded_tensor(padding_val)` ATen op via the FBGEMM kernel + appropriate view gymnastics (since that kernel only handles 2D values)
* Introduces a new `_nested_from_padded_tensor` op for the reverse conversion, implemented via the reverse FBGEMM kernel + view gymnastics
* Note: there is currently no public API for this; design booted to a future PR
TODO:
* ~~Propagate min / max sequence length via the new factory function `_nested_from_padded_tensor`~~
* ~~Verify that Inductor does computation fusion via test logic~~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125947
Approved by: https://github.com/soulitzer
This PR:
* Implements the pre-existing `nt.to_padded_tensor(padding_val)` ATen op via the FBGEMM kernel + appropriate view gymnastics (since that kernel only handles 2D values)
* Introduces a new `_nested_from_padded_tensor` op for the reverse conversion, implemented via the reverse FBGEMM kernel + view gymnastics
* Note: there is currently no public API for this; design booted to a future PR
TODO:
* ~~Propagate min / max sequence length via the new factory function `_nested_from_padded_tensor`~~
* ~~Verify that Inductor does computation fusion via test logic~~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125947
Approved by: https://github.com/soulitzer
A user wants to use the flop counter with meta devices. This previously caused problems for SDPA+NJT:
1. autocast check: `torch.is_autocast_enabled("meta")` fails because `meta` is not valid for autocasting. If we skip this, we run into the next error
2. math backend: conversion to NST requires getting concrete offsets in a list of python integers, which doesn't work on a meta tensor b2eb0e8c6a/torch/nested/_internal/sdpa.py (L809-L815)
3. (fixed in the previous PR, #134288) - if we force using flash attention backend for flop counting, `_flash_attention_forward` previously didn't support meta tensors.
In this PR, we check specifically for FlopCounterMode, and, if it's enabled and combined with meta tensors, (a) skip autocasting and (b) force it down the flash attention path. This isn't generally safe for tracing (e.g. if you actually care which kernels you are running), but in the absence of actual device information, we have to make some assumptions. By specifically checking for FlopCounterMode, this should reduce the chance of unintended side effects for other meta tensor users.
Note: fake tensor would solve a bunch of these issues, but it's not a viable solution right now for the user.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134289
Approved by: https://github.com/soulitzer
ghstack dependencies: #134288
Part of #134054.
This corresponds to the pytorch mypy changes from D61493706. Updating takes so
long and touches so many files that it's impossible to land as a whole without conflicting with some other intermediate change.
So landing these 'type: ignore' for pytorch in advance of them actually being needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134202
Approved by: https://github.com/Skylion007
It's possible to construct an NJT with "holes" by specifying both `offsets` and `lengths` metadata. When `nt.clone(memory_format=torch.contiguous_format)` is called on such an NJT, the result should be an NJT without holes.
This PR fixes this in simplistic way using `unbind()`, which isn't really supported in `torch.compile`. The longer term solution involves writing a proper kernel to support this.
NB: Another limitation is that the returned NJT does not have the same ragged structure as the input. While we could manually hack the nested int registry (or update the union find when that lands), this is the first instance where a NJT with holes and an NJT without holes could have the same ragged structure, and getting those to play nicely together requires some fairly involved updates. For now, this PR punts on these updates until we can clean this up.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132776
Approved by: https://github.com/ani300, https://github.com/soulitzer
ghstack dependencies: #131898, #131704, #131937
Summary:
Modify `softmax` on the ragged dimension, where `ragged_idx == 1`, to allow for 2D nested tensors. This diff now enables a `softmax` operation on tensors of shape `(B, *)`, where `*` is the ragged dimension.
Extend existing `softmax` unit tests to include 2D nested tensors using the `include_2d_tensor=True` keyword argument.
Test Plan:
Verify that existing and modified unit tests pass using the following commands:
```
buck2 run mode/{opt,inplace} //caffe2/test:nested -- --regex test_softmax
```
```
buck2 run mode/{opt,inplace} //caffe2/test:nested -- --regex test_jagged_op
```
Reviewed By: davidberard98
Differential Revision: D60780975
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132812
Approved by: https://github.com/davidberard98
The goal of this PR is to avoid stack overflow when we create extremely long chains of thunks, and then evaluate them (e.g., as occurs if you sum(long list of symint)). The basic idea behind this PR is to only thunkify proxies if they're being created in places where they may or may not be used--crucially, symint operations that occur in user code we are tracing are eagerly placed into the graph, even if they may eventually be dead.
I annotated the PR with explanation of changes.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132421
Approved by: https://github.com/Skylion007, https://github.com/zou3519
ghstack dependencies: #132674, #132675
When autocasting is turned on, right now SDPA w/ NJT won't be autocasted. This PR adds manual "autocasting" logic in sdpa.py - at the beginning, it just checks if autocasting is enabled, and if so, it casts the inputs in the way you would expect if autocasting was actually running.
Why normal autocasting won't work:
* NJT intercepts the `__torch_function__` call for scaled_dot_product_attention, which, AFAIK, happens before we get to any dispatcher logic, and then calls efficient attention or flash attention. So autocasting the scaled_dot_product_attention op won't work; we never call the aten op for scaled_dot_product_attention, so we won't ever run autocasting for it.
* If we try to add autocasting handling for `_flash_attention_forward` or `_efficient_attention_forward`, then autocasting will _run_, but it will have the wrong semantics: sdpa.py's handling will run first, and it will do backend selection based on the uncasted inputs to SDPA. This also means that if the inputs to the SDPA call don't have uniform types, the sdpa.py implementation will fail checks (this is the specific issue we're targeting).
Alternative: "just change the backend selection logic for NJT to be autocast aware, but don't actually do the autocast; then, add `_(flash|efficient)_attention_forward` to autocasting rules". I think this would work too. But it's arguably better to make the backend-selection logic and actual-autocast-behavior use the same implementation, in case the implementations are different.
Differential Revision: [D60879916](https://our.internmc.facebook.com/intern/diff/D60879916)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132835
Approved by: https://github.com/soulitzer
This PR does 3 things:
1. Adds a copy-free strided->jagged layout conversion for NT
2. Adds a copy-free jagged->strided layout conversion for NT
3. Modifies and expands the .to() API to support the layout argument for the specific case of NT layout conversion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115749
Approved by: https://github.com/jbschlosser
Combines contributions from https://github.com/pytorch/pytorch/pull/130505
Some context can be found in this large comment block:
a5b64d39fd/test/dynamo/test_subclasses.py (L1667-L1681)
Changes in this PR
- For each tensor fakified, check the nested int registry in eager, and eagerly symbolicize if that tensor has already been associated with nested int in eager.
- Adds a separate counter stored on FakeTensorMode as a fake analog to _tensor_id_counter (which keeps track of unique tensors). This counter is initialized to the global eager tensor id counter upon creation of the FakeTensorMode, and needs to be reset when the same FakeTensorMode is reused to trace again (in this PR, we piggyback on the epoch incrementing logic).
- (refactor) Today, we store FakeTensor -> symbolic nested int in the global registry. With this PR, symbolic nested int is stored directly on the FakeTensor. (Eager still caches nested int in the registry, though we should avoid this at some point.)
Basically unchanged, but worth noting:
- `__tensor_unflatten__` is still responsible for determining whether we should cache for now. The logic is somewhat simplified.
- to_copy is still using the trick of updating two different tensors in the registry to point to the same nested int. This is kind of broken, but we try to leave it as is, and plan a better fix with the UnionFind stack.
Differential Revision: [D60406772](https://our.internmc.facebook.com/intern/diff/D60406772)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130292
Approved by: https://github.com/bdhirsh
ghstack dependencies: #131916, #131803
This PR does 3 things:
1. Adds a copy-free strided->jagged layout conversion for NT
2. Adds a copy-free jagged->strided layout conversion for NT
3. Modifies and expands the .to() API to support the layout argument for the specific case of NT layout conversion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115749
Approved by: https://github.com/jbschlosser
**Background:** NJT utilizes a `jagged_unary_pointwise()` fallback that historically has assumed blindly that the first arg is an NJT. This assumption breaks certain ops; for example `pow(scalar, Tensor)` has an NJT as the second arg.
This PR expands `jagged_unary_pointwise()` and the associated schema validation logic to handle an NJT in args other than the first position.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131937
Approved by: https://github.com/soulitzer
ghstack dependencies: #131898, #131704
Modify the existing `layer normalization` operator in PyTorch, invoked by `torch.layer_norm`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff, which uses the `aten` padding operator, enables PyTorch users to invoke `torch.nn.functional.layer_norm` on a nested tensor when reducing along the ragged dimension, e.g. `*` in a `(B, *, M)` or `(B, *, M, N)` nested tensor.
Write unit tests based on the `softmax` jagged operator to verify the accuracy of the ragged reduction implementation for `torch.nn.functional.layer_norm`. Add unit tests to verify error handling for unsupported features.
Note that this implementation is limited to nested tensors with `ragged_idx == 1`, i.e. the ragged dimension is not transposed. The layer normalization operator also requires an operation on a 2-dimensional layer; for nested tensors with 4 or more dimensions, I flatten the extra dimensions, then unflatten them after performing layer normalization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132172
Approved by: https://github.com/davidberard98
ghstack dependencies: #132170
Modify the existing `softmax` operator in PyTorch, invoked by `torch.softmax`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff, which uses the aten padding operator, enables PyTorch users to invoke `torch.softmax` on a nested tensor when reducing along the ragged dimension, e.g. `*` in a `(B, *, M)` nested tensor.
Write unit tests based on the `sum` and `mean` jagged operators to verify the accuracy of the ragged reduction implementation for `torch.softmax`. Add unit tests to verify error handling for unsupported features in `NestedTensor` `torch.softmax`.
Note that this implementation is limited to nested tensors with `ragged_idx == 1`, i.e. the ragged dimension is not transposed. In addition, the `softmax` operator is required to take in as input an integer for the reduction dimension `dim`, requiring new unit tests heavily inspired by the `sum` and `mean` jagged operator unit tests. `Softmax` also allows for reducing along the batch dimension.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132170
Approved by: https://github.com/davidberard98
This PR utilizes the info from the existing OpInfo database `op_db` to contribute to general NJT testing.
* New tests in `TestNestedTensorOpInfo`
* `test_forward()` - compares forward output to an unbind-based reference
* `test_backward()` - compares forward output and grads to an unbind-based reference
* `test_forward_compile()` - compares forward compile output (`backend="aot_eager_decomp_partition"`) to eager
* `test_backward_compile()` - compares forward compile output (`backend="aot_eager_decomp_partition"`) and grads to eager
* To avoid adding a bunch of NJT-specific stuff to the `OpInfo` structure, this PR translates `op_db` -> a NJT-specific `njt_op_db`.
* `UnaryUfuncInfo`s utilize a new `sample_inputs_unary_njt_pointwise()` which iterates through a comprehensive list of NJTs: contiguous / non-contiguous, dims 2, 3, and 4, transposed / not, etc.
* `BinaryUfuncInfo`s utilize a new `sample_inputs_binary_njt_pointwise()` which iterates through a comprehensive list of NJTs: contiguous / non-contiguous, dims 2, 3, and 4, transposed / not, etc.
* `ReductionOpInfo`s utilize a new `sample_inputs_njt_reduction()` which covers full reductions, reductions over the jagged dim, and reductions over the non-jagged dim
* Several xfails were added to get things passing
TODO (future PRs):
* Pass non-contiguous / non-contiguous with holes NJTs (maybe we should have separate tests for these? most ops don't support NJTs with holes today)
* Mixed (NT, T), (T, NT) inputs for binary ops
* Handle other types of OpInfos (beyond unary pointwise, binary pointwise, and reduction) by manually by writing sample_inputs_funcs
* Address all xfails via fixes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131704
Approved by: https://github.com/soulitzer
ghstack dependencies: #131898
**Background**: `TestCase.assertEqual()` is commonly used during test case validation. Historically, to support NSTs, the logic was written to compare two nested tensors by unbinding them and comparing their components. This logic applied to NJTs as well, which in practice meant that two NJTs with different nested ints in their shapes could compare equal if their components were equal.
This PR changes the above logic so that NJTs are no longer unbound during comparison, allowing them to receive full shape validation. This makes `TestCase.assertEqual()` stricter for NJTs, requiring them to have the same nested ints in their shapes to compare equal.
Note that some tests rely on the old, looser behavior. To address this, the PR introduces a base `NestedTensorTestCase` that defines a helper function `assertEqualIgnoringNestedInts()` so that these tests can explicitly opt in to the looser comparison behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131898
Approved by: https://github.com/soulitzer
Modify the existing `layer normalization` operator in PyTorch, invoked by `torch.layer_norm`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff, which uses the `aten` padding operator, enables PyTorch users to invoke `torch.nn.functional.layer_norm` on a nested tensor when reducing along the ragged dimension, e.g. `*` in a `(B, *, M)` or `(B, *, M, N)` nested tensor.
Write unit tests based on the `softmax` jagged operator to verify the accuracy of the ragged reduction implementation for `torch.nn.functional.layer_norm`. Add unit tests to verify error handling for unsupported features.
Note that this implementation is limited to nested tensors with `ragged_idx == 1`, i.e. the ragged dimension is not transposed. The layer normalization operator also requires an operation on a 2-dimensional layer; for nested tensors with 4 or more dimensions, I flatten the extra dimensions, then unflatten them after performing layer normalization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131519
Approved by: https://github.com/davidberard98
ghstack dependencies: #131518
Modify the existing `layer normalization` operator in PyTorch, invoked by `torch.layer_norm`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff, which uses the `aten` padding operator, enables PyTorch users to invoke `torch.nn.functional.layer_norm` on a nested tensor when reducing along the ragged dimension, e.g. `*` in a `(B, *, M)` or `(B, *, M, N)` nested tensor.
Write unit tests based on the `softmax` jagged operator to verify the accuracy of the ragged reduction implementation for `torch.nn.functional.layer_norm`. Add unit tests to verify error handling for unsupported features.
Note that this implementation is limited to nested tensors with `ragged_idx == 1`, i.e. the ragged dimension is not transposed. The layer normalization operator also requires an operation on a 2-dimensional layer; for nested tensors with 4 or more dimensions, I flatten the extra dimensions, then unflatten them after performing layer normalization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131519
Approved by: https://github.com/davidberard98
ghstack dependencies: #131518
Modify the existing `softmax` operator in PyTorch, invoked by `torch.softmax`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff, which uses the aten padding operator, enables PyTorch users to invoke `torch.softmax` on a nested tensor when reducing along the ragged dimension, e.g. `*` in a `(B, *, M)` nested tensor.
Write unit tests based on the `sum` and `mean` jagged operators to verify the accuracy of the ragged reduction implementation for `torch.softmax`. Add unit tests to verify error handling for unsupported features in `NestedTensor` `torch.softmax`.
Note that this implementation is limited to nested tensors with `ragged_idx == 1`, i.e. the ragged dimension is not transposed. In addition, the `softmax` operator is required to take in as input an integer for the reduction dimension `dim`, requiring new unit tests heavily inspired by the `sum` and `mean` jagged operator unit tests. `Softmax` also allows for reducing along the batch dimension.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131518
Approved by: https://github.com/davidberard98
Add support for transposed, non-contiguous `NestedTensor`s, where `ragged_idx > 1`, for the aten operators `sum` and `mean`. This diff enables reducing along the jagged dimension for non-contiguous `NestedTensor`s, transposed between non-batch dimensions as well as between a ragged and a non-batch dimension. For example, users can now reduce a `NestedTensor` of shape `(B, M, *, N)` along `*` or `(B, N, M, *)` along `*`.
Parametrize existing unit tests and add new unit tests verifying the accuracy of implementations on `NestedTensor`s that transpose between 2 non-batch dimensions as well as between a ragged and a non-batch dimension.
Differential Revision: [D59847927](https://our.internmc.facebook.com/intern/diff/D59847927/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131517
Approved by: https://github.com/davidberard98
Summary:
Modify the existing `mean` operator in PyTorch, invoked by `torch.mean`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff enables PyTorch users to invoke `torch.mean` on a nested tensor when reducing along the ragged dimension, e.g. `*` in a `(B, *, M)` nested tensor.
Parametrize unit tests from `sum` to verify the accuracy of the ragged reduction implementation for `torch.mean`. Add unit tests and parametrize `sum` unit tests to verify error handling for unsupported features in `NestedTensor` `torch.mean`.
Test Plan:
Verify that the new unit test passes via the following command:
```
buck2 run mode/{opt,inplace} //caffe2/test:nested -- --regex test_mean
```
```
buck2 run mode/{opt,inplace} //caffe2/test:nested -- --regex test_jagged_op
```
Differential Revision: D59654668
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131132
Approved by: https://github.com/davidberard98, https://github.com/jbschlosser
Summary: Modify the existing `sum` operator in PyTorch, invoked by `torch.sum`, to allow for reductions along the ragged dimension of a nested tensor. This diff enables PyTorch users to invoke `torch.sum` on a nested tensor with `dim=1`, where `ragged_idx=1`.
Functions modified in `caffe2/torch/nested/_internal/ops.py`:
- `sum_dim_IntList()`: The function assumes that `ragged_idx=1`; in the case that `dim=1` as well, where `dim` is the dimension on which we reduce, this diff invokes the PyTorch benchmark found in D58423489. Specifically, this diff pads a nested tensor, e.g. of logical shape `(B, *, M)`, using [`torch.ops.aten._jagged_to_padded_dense_forward`](https://www.internalfb.com/code/fbsource/[92c2a067ab04e3eebc999254fed4ae2fbea6def3]/fbcode/deeplearning/fbgemm/fbgemm_gpu/fb/inductor_lowerings/elementwise_ops.py?lines=26), then reduces across the `*` dimension (`dim == 1`) to a `(B, M)` output tensor.
- `_wrap_jagged_dims()`: This diff adds special handling to allow for the case where `dim` contains `1` and not `0`, but to continue disallowing the case where `dim` contains `0` and not `1`. In this function's creation, I created a helper function, `_get_condition_for_invalid_jagged_reductions()`, which makes it clearer which conditions apply to which operators. Specifically, operators which are enabled with jagged reductions are specified at the top of the file in `SUPPORTED_JAGGED_REDUCTIONS` and have a different set of conditions that need to be tested, as reducing along `dim == 1` without `dim == 0` is now possible.
Functions modified in `caffe2/test/test_nestedtensor.py`:
- `test_sum_int_DimList()`: This diff adds special handling in the `sum` unit test to allow for the case where `dim` contains `1` and not `0`, but to continue disallowing the case where `dim` contains `0` and not `1`.
- `test_sum_int_DimList_ragged_dim_1()`: This diff adds a new unit test which verifies the accuracy and feasibility of reducing along the jagged dimension of a nested tensor.
Notes:
- This diff solely adds functionality for the case in which we reduce only along the ragged dimension. Cases in which we reduce along both the ragged and another dimension, like `dim == (1, 2)`, are not permitted, as this set of diffs focuses primarily on the former.
- The `sum` operator is the only operator which uses the function `_wrap_jagged_dims()`; all other operators use `_wrap_jagged_dim()`. I would like to later look into why this is the case and if we can consolidate this!
- I modified some of the comments in the `sum` function as well as the unit tests for more clarity.
Test Plan:
Verify that existing (`test_sum_int_DimList`) and new (`test_sum_int_DimList_ragged_dim_1`) unit tests pass via the following command:
```
buck2 run mode/{opt,inplace} //caffe2/test:nested -- --regex test_sum_int_DimList
```
Differential Revision: D59571209
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130425
Approved by: https://github.com/davidberard98
Background: this bug was triggering DEBUG=1 asserts in the backward for `unbind()`, which calls `empty_like()`. I found that the NJT implementation of `empty_like()` was redispatching on `values` while blindly passing along all kwargs. This resulted in `empty_like(values, ..., layout=torch.jagged)`, which is incorrect since `values` is strided, tripping the debug assert here:
433b691f98/aten/src/ATen/EmptyTensor.cpp (L305)
This PR explicitly sets `layout=torch.strided` when redispatching `*_like()` factories on `values`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129879
Approved by: https://github.com/soulitzer
This PR does 3 things:
1. Adds a copy-free strided->jagged layout conversion for NT
2. Adds a copy-free jagged->strided layout conversion for NT
3. Modifies and expands the .to() API to support the layout argument for the specific case of NT layout conversion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115749
Approved by: https://github.com/jbschlosser
Idea: close over min / max sequence length in the main NJT view func (`_nested_view_from_jagged`) so that view replay during fake-ification propagates these correctly in torch.compile.
For dynamic shapes support for min / max sequence length, this PR uses a hack that stores the values in `(val, 0)` shaped tensors.
**NB: This PR changes SDPA to operate on real views instead of using `buffer_from_jagged()` / `ViewNestedFromBuffer`, which may impact the internal FIRST model. That is, it undoes the partial revert from #123215 alongside a fix to the problem that required the partial revert. We need to verify that there are no regressions there before landing.**
Differential Revision: [D55448636](https://our.internmc.facebook.com/intern/diff/D55448636)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122836
Approved by: https://github.com/soulitzer
Idea: close over min / max sequence length in the main NJT view func (`_nested_view_from_jagged`) so that view replay during fake-ification propagates these correctly in torch.compile.
For dynamic shapes support for min / max sequence length, this PR uses a hack that stores the values in `(val, 0)` shaped tensors.
**NB: This PR changes SDPA to operate on real views instead of using `buffer_from_jagged()` / `ViewNestedFromBuffer`, which may impact the internal FIRST model. That is, it undoes the partial revert from #123215 alongside a fix to the problem that required the partial revert. We need to verify that there are no regressions there before landing.**
Differential Revision: [D55448636](https://our.internmc.facebook.com/intern/diff/D55448636)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122836
Approved by: https://github.com/soulitzer
ghstack dependencies: #127007, #128057
Before: `softmax` definition uses `jagged_unary_pointwise()` (wrong)
After: `softmax` impl adjusts the `dim` arg to account for the difference in dimensionality between the outer NT and the NT's `_values`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119459
Approved by: https://github.com/soulitzer
Summary:
Add the following error checks for the `unbind` operator on `NestedTensor`s when `ragged_idx != 1`:
- The current implementation allows the creation of `NestedTensor` instances from the class definition with an `offsets` tensor that applies to a dimension other than the jagged dimension. This diff ensures that `unbind` fails when the `offsets` exceed the length of the jagged dimension.
Test Plan:
Added the following unit tests:
`test_unbind_with_lengths_ragged_idx_equals_2_bad_dim_cpu` verifies that `unbind` fails when there is a mismatch between the offsets and the jagged dimension, for `NestedTensor`s with `lengths`.
```
test_unbind_with_lengths_ragged_idx_equals_2_bad_dim_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
```
Reviewed By: davidberard98
Differential Revision: D57989082
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128058
Approved by: https://github.com/davidberard98