# Summary
In preparation for pt 2.0 launch this PR updates SDPA's API and makes the function a nn.funcitonal public function.
## Changes
### API
Previously the the function signature was:
`scaled_dot_product_attention(query, key, value, attn_mask=None, need_attn_weights=False, dropout_p=0.0, is_causal=False) -> (Tensor, Tensor)`
Updated signature:
`scaled_dot_product_attention(query, key, value, attn_mask=None, dropout_p=0.0, is_causal=False) -> Tensor`
This PR removes the need_attn_weights optional boolean variable and updates the return type to a singular tensor.
#### Reasoning:
The main goal of this function is to provide an easy interface for users to call into fused attention kernels e.g. (FlashAttention). The fused kernels do not currently support arbitrary attn_mask or dropout but there is a PR to mem-efficient attention to enable these. We want to have the API surface ready for when the backing kernels get updated.
The fused kernels save on memory usage by not materializing the weights and it is unlikely that a fast fused implementation will enable this feature so we are removing.
Discussed with folks at FAIR/Xformers and +1 this API change.
#### Make function Public
In preparation for the pt 2.0 launch we make the function public to start to generate user feedback
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92189
Approved by: https://github.com/cpuhrsch
Summary: This diff modifies the implementation of the select operator so slices of the irregular dimension can be selected (e.g. nt[:,0,:]).
Test Plan:
Added new unit tests to test that the new functions work as intended (see them in diff). To test,
`buck test mode/dev-nosan //caffe2/test:nested`
Differential Revision: D41083993
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88585
Approved by: https://github.com/cpuhrsch
Summary: This diff merges both previous implementations of constructors for nested tensors, the one from lists of tensors and the one with arbitrary python lists, adn implements it in pytorch core so no extensions are needed to construct NT.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88213
Approved by: https://github.com/cpuhrsch
Fixes#87713
BMM for cpu supports non-contiguous nested tensor inputs, while BMM for Cuda does not support currently non-contiguous inputs.
The derivative for BMM:
```
- name: bmm(Tensor self, Tensor mat2) -> Tensor
self: grad.bmm(mat2.transpose(1, 2).conj())
mat2: self.transpose(1, 2).conj().bmm(grad)
result: self_t.bmm(mat2_p) + self_p.bmm(mat2_t)
```
When calling backward it was impossible for this function to succeed since the inputs were always discontiguous, regardless of the user input. This adds contiguous calls to BMM_cuda implementation for nested tensors.
This was not caught by tests because grad_check is currently only done on CPU in test_nestedtensors. This PR updates the autograd test to also be run on GPU.
As a result I found one more issue with the backward for to_padded_tensor erroring instead of calling the generic version.
cc @cpuhrsch @jbschlosser @bhosmer @mikaylagawarecki
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88108
Approved by: https://github.com/cpuhrsch
Summary: This diff implements copy_ in order to allow pinned memory transfers for nested tensors, as well as fill_ and ones_like, to test whether nested tensors can be created with other factory functions.
Test Plan: Pass all CI and sandcastle jobs.
Reviewed By: mikekgfb
Differential Revision: D40689594
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87728
Approved by: https://github.com/cpuhrsch
Summary: This commit adds support for moving NestedTensors from CPU to GPU and back. The implementation includes requires implementing empty_like(), which is based on PR#83140.
Test Plan: Added a new unit test based on the unit test for the main .to() implementation. All unit tests must pass, as well as every sandcastle job.
Differential Revision: D40437585
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87146
Approved by: https://github.com/drisspg
Summary: In order to make the layer normalization implementation for nested tensors public, it needs to be generalized to accept a normalized_shape argument instead of assuming it to be the last dimension of the nested_tensor. This commit does that, as well as adding extra unit tests to ensure the implementation is correct.
Test Plan:
All unit tests designed to test different ways of using the function work:
`buck test //caffe2/test:nested -- test_layer_norm`
Differential Revision: D40105207
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86295
Approved by: https://github.com/drisspg
### this effectively means that we only allow reshaping/viewing of nt with ONE ragged dimension
Behavior before this PR:
1. `-1` allowed for implicit batch dimension
2. multiple `-1`s allowed for pre-existing dimensions
3. for new dimensions, `-1` is not allowed
it is worth noting that for the most part 3 is basically unreachable because assuming a nested tensor has at least 1 ragged dimension, you would expect at least one -1 to be in the proposed shape for the pre-existing dimensions
Behavior after this PR:
1. batch dimension **must be specified**
2. **only one** `-1` allowed for pre-existing dimensions **this effectively means that we only allow reshaping/viewing of nt with ONE ragged dimension**
3. unchanged
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85691
Approved by: https://github.com/cpuhrsch
### this effectively means that we only allow reshaping/viewing of nt with ONE ragged dimension
Behavior before this PR:
1. `-1` allowed for implicit batch dimension
2. multiple `-1`s allowed for pre-existing dimensions
3. for new dimensions, `-1` is not allowed
it is worth noting that for the most part 3 is basically unreachable because assuming a nested tensor has at least 1 ragged dimension, you would expect at least one -1 to be in the proposed shape for the pre-existing dimensions
Behavior after this PR:
1. batch dimension **must be specified**
2. **only one** `-1` allowed for pre-existing dimensions **this effectively means that we only allow reshaping/viewing of nt with ONE ragged dimension**
3. unchanged
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85691
Approved by: https://github.com/cpuhrsch
Previously indexing a nested tensor when it requires_grad would raise an error because the backward formula for `select.int` uses `self.sizes()`. This PR fixes that by temporarily registering a _nested_select_backward function which can be removed when we start using the symint approach to register kernels. For now this functionality is needed for creating a POC that nested tensor can be an API to `segment_coo` and `segment_csr` in the torch_scatter repo
```
a = torch.arange(10).reshape(2, 5).float()
b = torch.arange(12).reshape(2, 6).float()
nt = torch.nested_tensor([a, b], dtype=torch.float).requires_grad_(True)
nt[0]
# RuntimeError: Internal error: NestedTensorImpl doesn't support sizes. Please file an issue on https://github.com/pytorch/nestedtensor
```
whereas
```
nt = torch.nested_tensor([a, b], dtype=torch.float).requires_grad_(False)
nt[0]
```
would succeed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83875
Approved by: https://github.com/albanD, https://github.com/drisspg
# Summary
Trying to do some clean up into code structure for nested tensors. This introduces a utility header and cpp file that implements helper functions.
This is the initial PR in more clean up. The next would be separating out the all native functions that create nested tensors into their own file since they do not infact do math on nested tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84385
Approved by: https://github.com/mikaylagawarecki
## Summary
Add detach op for nested tensors. Nested tensors are not part of the composite explicit dispatch key set and therefore need to be added manually.
The Detach test is failing only for the dtype=torch.float32, torch.float16 and device=cuda. The chain of ops that called are sum.backward() -> from_padded() -> unbind(). This populates the grad for a and b.
Does this potentially indicated that cuda implementation for one of these ops, likely from_padded() is incorrect?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84078
Approved by: https://github.com/albanD
## Summary
Add detach op for nested tensors. Nested tensors are not part of the composite explicit dispatch key set and therefore need to be added manually.
The Detach test is failing only for the dtype=torch.float32, torch.float16 and device=cuda. The chain of ops that called are sum.backward() -> from_padded() -> unbind(). This populates the grad for a and b.
Does this potentially indicated that cuda implementation for one of these ops, likely from_padded() is incorrect?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84078
Approved by: https://github.com/albanD
When the initial version came out, `NestedTensor` was not included in the `CompositeImplicitAutograd` key set, so we had to register dropout_nested to dropout and make it forward-only. Now is the time to improve it!
This pr removes dropout_nested; instead native_dropout_nested is implemented along with native_dropout_backward_nested.
Side change: remove dropout__nested since @cpuhrsch suggested to leave out nested in-place ops for now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83338
Approved by: https://github.com/jbschlosser
Per offline discussion, this will be updated to use expand once expand semantics for nested tensor have been fleshed out.
Next steps will be to add support for other features for forward sum mentioned on #82387 and likewise update the backward
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82625
Approved by: https://github.com/albanD