Commit Graph

244 Commits

Author SHA1 Message Date
Driss Guessous
07bc6b9587 [SDPA] Update dispatch logic to check for sm86 and head_size == 128 for flash attention (#94921)
Fixes #94883

Where backward for flash_attention on sm86 hardware with head_size == 128 is not supported.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94921
Approved by: https://github.com/cpuhrsch, https://github.com/albanD
2023-02-16 03:11:16 +00:00
Driss Guessous
81bbee7d7e [SDPA] Adds basic correctness checks (#94274)
# Summary
Add more checks around shape constraints as well as update the sdp_utils to properly catch different head_dims between qk and v for flash_attention which is not supported.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94274
Approved by: https://github.com/cpuhrsch
2023-02-09 08:05:26 +00:00
Driss Guessous
e0950fccfa [SDPA] Add expanded autograd testing for fused kernels and disable head_dim128 sm86 mem-efficient (#94009)
# Summary
- Adds a large parameter sweep for testing the various configs a user can call sdpa with and compares the deviation of the fused kernels vs the eager math fallback to test for correctness.
- Sm86 + head_dim==128 is throwing an IMA  for memory efficient attention. We add a filter for use_mem_efficient_attention().  This has since been fixed in the upstream Xformers version but will likely not make it for branch cut.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94009
Approved by: https://github.com/cpuhrsch
2023-02-07 18:04:48 +00:00
Driss Guessous
653dc73df0 [SDPA] Wire up FlashAttention's backward (#92917)
# Summary
This PR creates _flash_attention_backward and _scaled_dot_product_flash_attention_backward native functions and registers them to the respective derivatives.yaml.

The goal is to replicate the torch.autograd.Function defined in the FlashAttention repo [here](33e0860c9c/flash_attn/flash_attn_interface.py (L126)) natively in PyTorch.  One thing that we don't have access to is ctx.save_for_backward in native PyTorch so in order to save these variables I extended the returned objects from the forward functions.

### MetaFunctions
I also updated the FlashAttention meta functions to mirror the real outputs now. As well I added a meta registration for backwards. I have an XLMR training script and while eager training now works with FlashAttention compiling this module fails with the inductor error down below.

### Questions?
Performance issues vs mem efficient when using torch.nn.mha_forward

TorchCompile -> See purposed solution below.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92917
Approved by: https://github.com/cpuhrsch
2023-02-02 04:02:30 +00:00
Driss Guessous
a3715efd8b Remove windows check for cmake to build Fused kernels (#91909)
# Summary
Add support for fused attention kernels (FlashAttention and memory-efficient attention) on Windows. Previously we could not do this because the fixes required c++17 to do this but we have since update the PyTorch standard.

This PR:
- Changes invocations of unsigned long to the fixed width integer type
- Adds in the #define FP16_SWITCH(COND, ...) which has been added to the flash_attention main branch
- Changes the some macros used within mem-efficient attention code in order to work around the VA_ARG discrepancy between clang/gcc and msvc. An alternative would be setting the global flag Zc:preprocessor
- Selectively applies /Zc:lambda to only the mem-efficient sources since applying this globally caused quantization files to not compile

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91909
Approved by: https://github.com/cpuhrsch
2023-01-25 01:21:12 +00:00
Michael Gschwind
7265f60ad0 Regularize mask handling for attn_mask and key_padding_mask (#92733)
Summary:
Regularize mask handling for attn_mask and key_padding_mask
* Update documentation to remove reference to byte masks (which were deprecated long ago)
* Introduce check and warn about deprecation if attn_mask and key_padding_mask types mismatch
* Convert all masks to float before combining
* Combine by adding

Test Plan: sandcastle & github CI

Differential Revision: D42653215

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92733
Approved by: https://github.com/ngimel, https://github.com/drisspg
2023-01-24 14:12:05 +00:00
Driss Guessous
df14650f0b [SDPA] Update SDPA API and make function Public (#92189)
# Summary
In preparation for pt 2.0 launch this PR updates SDPA's API and makes the function a nn.funcitonal public function.

## Changes
### API
Previously the the function signature was:
`scaled_dot_product_attention(query, key, value, attn_mask=None, need_attn_weights=False, dropout_p=0.0, is_causal=False) -> (Tensor, Tensor)`
Updated signature:
`scaled_dot_product_attention(query, key, value, attn_mask=None, dropout_p=0.0, is_causal=False) -> Tensor`

This PR removes the need_attn_weights optional boolean variable and updates the return type to a singular tensor.

#### Reasoning:
The main goal of this function is to provide an easy interface for users to call into fused attention kernels e.g.  (FlashAttention). The fused kernels do not currently support arbitrary attn_mask or dropout but there is a PR to mem-efficient attention to enable these. We want to have the API surface ready for when the backing kernels get updated.

The fused kernels save on memory usage by not materializing the weights and it is unlikely that a fast fused implementation will enable this feature so we are removing.

Discussed with folks at FAIR/Xformers and +1 this API change.

#### Make function Public
In preparation for the pt 2.0 launch we make the function public to start to generate user feedback

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92189
Approved by: https://github.com/cpuhrsch
2023-01-23 20:50:46 +00:00
John Crousse
0b90ddacd9 Unit test for is_causal Better Transformers (#91900) (#92102)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/91900

Test Plan:
buck test  :test_transformers -- -r test_train_with_is_causal
buck test mode/opt :test_transformers -- -r test_is_causal_gpu
flake8 test_transformers.py

Differential Revision: D42453642

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92102
Approved by: https://github.com/drisspg
2023-01-16 17:25:06 +00:00
Driss Guessous
92855a215b [SDPA] Guard mem efficient attention in deterministic mode (#91979)
# Summary
Memory efficient attention is a non deterministic algorithm.

This PR ensures that the sdp_choice will allow for mem-efficient  to be used as the backend to SDPA if we are in warn only mode.  Otherwise  if we have enabled determinism and and set warn_only to False sdp_choice will not return memory efficient attention as the backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91979
Approved by: https://github.com/cpuhrsch
2023-01-11 07:40:31 +00:00
Michael Gschwind
26beb46da4 Reduce #iters to make test run always (#91837)
Summary: Reduce #iters to make test run always

Test Plan: sandcastle

Reviewed By: drisspg

Differential Revision: D42397999

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91837
Approved by: https://github.com/drisspg
2023-01-09 21:38:18 +00:00
Driss Guessous
f219970990 Return empty attention weights when need_atten_weights = False (#91782)
# Summary
This PR updates the second return value from SDPA to return an empty tensor of size 0 not what it would be if need_attn_weights is True. Also updates the meta function to account for this change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91782
Approved by: https://github.com/cpuhrsch
2023-01-06 19:06:48 +00:00
Michael Gschwind
af589b3d1f switch causal mask for is_causal flag (#91171)
Summary: switch causal mask for is_causal flag

Test Plan: sandcastle & github

Differential Revision: D42089340

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91171
Approved by: https://github.com/wushirong, https://github.com/drisspg
2022-12-30 17:24:58 +00:00
Michael Gschwind
d1772aff60 Autocast support for scaled_dot_product_attention (#91066)
Summary: Autocast support for scaled_dot_product_attention

Test Plan: sandcastle and guthub cicd

Differential Revision: D42085525

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91066
Approved by: https://github.com/ngimel, https://github.com/drisspg
2022-12-19 23:42:26 +00:00
Driss Guessous
912748e3b7 [SDP] Fix alignment check for efficient_attention (#90413)
Fixes a bug found using head_dim_size==100 on an a100 gpu. This PR contains stricter guards on the input shape. These constraints are taken from xformers: https://github.com/facebookresearch/xformers/blob/gh/danthe3rd/60/orig/xformers/ops/fmha/cutlass.py#L23
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90413
Approved by: https://github.com/mikekgfb
2022-12-09 21:09:25 +00:00
Driss Guessous
1d9e1fca97 Update sdp dispatch logic to enable fused backward (#89154)
# Summary
Reorganizes how the sdp dispatch logic is down in order to enable backwards for fused kernels

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89154
Approved by: https://github.com/cpuhrsch
2022-11-21 20:02:09 +00:00
PyTorch MergeBot
e1d58b1928 Revert "Update sdp dispatch logic to enable fused backward (#89154)"
This reverts commit 2e72ec7982.

Reverted https://github.com/pytorch/pytorch/pull/89154 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but the new test_sdp_math_gradcheck test breaks periodic slow gradcheck, i.e. 419ef2cdcf
2022-11-20 22:14:38 +00:00
Driss Guessous
2e72ec7982 Update sdp dispatch logic to enable fused backward (#89154)
# Summary
Reorganizes how the sdp dispatch logic is down in order to enable backwards for fused kernels

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89154
Approved by: https://github.com/cpuhrsch
2022-11-19 02:06:27 +00:00
Driss Guessous
b291c1213a Create native function for determining which implementation of SDP to call (#89029)
# Summary
Creates a callable native function that can determine which implementation of scaled dot product will get called. This allows to bump re-order the runtime dispatch of SDP to enable autograd.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89029
Approved by: https://github.com/cpuhrsch
2022-11-16 03:07:54 +00:00
Driss Guessous
ff6d2a6d1b Add mem efficient backward (#88856)
# Registers the derivative for mem efficient backward

- Use gradcheck to test correctness. The kernel is not implemented for fp64 so run checks with bumped tolerances in fp32
- I also made updates based off of Xformer main branch and flash-attention cutlass branch.
- This will enable the fused backward to be called for scaled dot product attention

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88856
Approved by: https://github.com/cpuhrsch
2022-11-15 20:22:57 +00:00
PyTorch MergeBot
50c18217a3 Revert "Add mem efficient backward (#88856)"
This reverts commit 35e668b5ce.

Reverted https://github.com/pytorch/pytorch/pull/88856 on behalf of https://github.com/DanilBaibak due to breaking internal builds
2022-11-15 09:37:09 +00:00
Michael Gschwind
1f88b208ac Fix cuda/cpu check on NoneType (Unit test) (#88970)
Summary: Fix cuda/cpu check on NoneType (unit test)

Test Plan: sabdcastle/ github CI/CD

Differential Revision: D41208798

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88970
Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch
2022-11-15 01:25:19 +00:00
Driss Guessous
35e668b5ce Add mem efficient backward (#88856)
# Registers the derivative for mem efficient backward

- Use gradcheck to test correctness. The kernel is not implemented for fp64 so run checks with bumped tolerances in fp32
- I also made updates based off of Xformer main branch and flash-attention cutlass branch.
- This will enable the fused backward to be called for scaled dot product attention

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88856
Approved by: https://github.com/cpuhrsch
2022-11-15 01:10:35 +00:00
Grigory Sizov
7ad87f63e2 Support src_mask and src_key_padding_mask for Better Transformer (#88488)
Fixes T135842750 (follow-up for #87377)

## Description

At present, having both `src_key_padding_mask` and `src_mask` at the same time is not supported on the fastpath in Transformer and Multi-Head Attention.

This PR enables using both masks on the fastpath on CPU and GPU: if both masks are passed, we merge them into a 4D mask in Python and change mask type to 2 before passing downstream.

Downstream processing in native code is not changed, as it already supports 4D mask. Indeed, it is done depending on the device:
- on CUDA, by `SoftMax.cu::masked_softmax_cuda`. When mask type is 2, it calls either `dispatch_softmax_forward` -> `softmax_warp_forward` or `at::softmax` (depending on the input size). In both cases 4D mask is supported.
- on CPU, by `SoftMax.cpp::masked_softmax_cpp`. It calls `hosted_softmax` which supports 4D mask.

## Tests
- Extended `test_mask_check_fastpath` to check that fast path is indeed taken in Transformer when two masks are passed
- Added `test_multihead_self_attn_two_masks_fast_path_mock` to check that fast path is taken in MHA when two masks are passed
- Added `test_multihead_self_attn_two_masks_fast_path` to check that fast and slow paths give the same result when two masks are passed in MHA
- `test_masked_softmax_mask_types` now covers mask type 2
- `test_transformerencoderlayer_fast_path` (CPU smoke test) is expanded to the case of both masks provided simultaneously
- `test_masked_softmax_devices_parity` checks that mask type 2 is accepted by CPU and CUDA paths

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88488
Approved by: https://github.com/mikekgfb
2022-11-10 08:12:56 +00:00
kshitij12345
fe3a226d74 [minor] use set_default_dtype instead of try and finally (#88295)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88295
Approved by: https://github.com/mruberry
2022-11-03 19:28:33 +00:00
Grigory Sizov
4c78c7c82a Enable src_mask in fast path of TransformerEncoderLayer (#87377)
## Issues
Fixes https://github.com/pytorch/pytorch/issues/81129#issuecomment-1179435674

## Description

Passing a 2D attention mask `src_mask` into the fast path of `TransformerEncoderLayer` in CPU was causing an error and so was disabled in https://github.com/pytorch/pytorch/pull/81277. This PR unrolls this fix, enabling `src_mask` on the fast path:

- Either attention mask `src_mask` of shape `(L, L)` or padding mask `src_key_padding_mask` of shape `(B, L)` are now allowed on the CPU fast path. If softmax is applied along the last dimension (as in multi-head attention), these masks are processed without expanding them to 4D. Instead, when iterating through the input, `Softmax.cpp::host_softmax` converts the index to match the mask dimensions, depending on the type.
- If softmax is applied along the dimension other than the last, `Softmax.cpp::masked_softmax_cpu` expands masks to 4D, converting them to `mask_type=2`. Theoretically one could also add special optimized cases for `dim=0, 1, 2` and process them without mask expansion, but I don't know how often is that used

## Tests:
- `test_transformerencoderlayer_fast_path` is extended to cover both attention mask and padding mask
- `test_masked_softmax_mask_types_0_1` is added to ensure results from CPU softmax with attention and padding masks match the explicit slow calculation
- `test_masked_softmax_devices_parity` is added to ensure results from masked softmax on CPU and CUDA match

## Note
I had to replace `float` with `torch.get_default_dtype()` in a couple of tests for the following reason:
- `test_nn.py` [sets the default type to `torch.double`](https://github.com/pytorch/pytorch/blob/master/test/test_nn.py#L24-L26)
- If I execute `test_nn.py` and `test_transformers.py` in one `pytest` run, this default still holds for transformer tests
- Some tests in `test_transformers.py` which were previously following the slow path now switched to fast path, and hard-coded `float` started clashing with default `double`

Let me know if there is a better way around it - or maybe I'm not supposed to run tests with `pytest` like this

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87377
Approved by: https://github.com/mikekgfb, https://github.com/weiwangmeta, https://github.com/malfet
2022-10-31 19:59:36 +00:00
Driss Guessous
e24ce484ed Use scaled_dot_product_attention within attention.cpp (#87312)
# Summary
Use the private _scaled_dot_product_attention to support _native_multiheaded_attention. _SDP provides access to fused kernels when certain conditions are meant enabling a speed up for MHA.

cc @cpuhrsch @jbschlosser @bhosmer @mikaylagawarecki
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87312
Approved by: https://github.com/cpuhrsch
2022-10-31 04:06:31 +00:00
Driss Guessous
35c611d30f Add mem efficient backend flag (#87946)
# Summary
Add in a torch.backends.cuda flag and update context manager to pic between the three implementations of the scaled_dot_product_attention.

cc @cpuhrsch @jbschlosser @bhosmer @mikaylagawarecki
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87946
Approved by: https://github.com/cpuhrsch
2022-10-28 15:51:10 +00:00
Rui Zhu
4b757f4633 Assert if padding mask type is unexpected (#86353) (#87106)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86353

Fix the issue described in
https://github.com/pytorch/pytorch/issues/86120

Test Plan: buck test mode/opt caffe2/test:test_transformers -- test_train_with_long_type_pad

Differential Revision: D40129968

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87106
Approved by: https://github.com/malfet
2022-10-20 16:01:54 +00:00
Driss Guessous
5fb687182d Enable sdp_forward for NestedTensors (#86720)
# Summary
This PR implements a sdp_forward for NestedTensors. This impl will call into flash and mem_efficient_attention when possible.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86720
Approved by: https://github.com/cpuhrsch
2022-10-18 02:00:04 +00:00
Driss Guessous
c5a4844085 Xformer SDP forward/backward kernel (#86157)
# Summary
Include xformer kernel code and make header updates to successfully build. Need to update the kernel calling code and dispatch system to clean this up.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86157
Approved by: https://github.com/cpuhrsch
2022-10-07 03:52:46 +00:00
Driss Guessous
cd6477617c Custom sdp implementations dense (#85984)
# Summary

- This code creates the runtime dispatch system for choosing a performant fused SDP kernel. The only choice of fused kernel is flash_attention. It also creates python flags and a context manager that can be used to turn off and on behavior for dispatch.
- This also adds support for flash_attention with dense tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85984
Approved by: https://github.com/cpuhrsch
2022-10-03 17:36:37 +00:00
Mikayla Gawarecki
afaee00fec Add python nested_tensor and as_nested_tensor constructors in torch.nested (#85593)
Remove `torch.nested_tensor` which has erroneous behavior wrt gradients (could be either leaf or not leaf). Introduce `torch.nested.nested_tensor` and `torch.nested.as_nested_tensor` in the vein of `torch.tensor` and `torch.as_tensor`. Done in nested `__init__.py` for now but can move to pybind in future (when we want to load from numpy/nested lists ).

Discussed offline with @cpuhrsch and pybind constructor (https://github.com/pytorch/pytorch/pull/85536) was more gnarly than expected, so we can move to that when we do need loading from numpy etc.

Differential Revision: [D39806622](https://our.internmc.facebook.com/intern/diff/D39806622)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85593
Approved by: https://github.com/drisspg, https://github.com/cpuhrsch
2022-09-28 20:15:02 +00:00
Driss Guessous
253ffbf28b Exposing native _scaled_dot_product_attention to torch.nn (#85044)
# Summary
This exposes the _scaled_dot_product_attention function to python in the nn namespace. It is still underscored because the api for args, and kwargs is still in flux for the next few weeks and will eventually land as a prototype feature.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85044
Approved by: https://github.com/cpuhrsch
2022-09-22 16:30:16 +00:00
PyTorch MergeBot
a3dc338ee1 Revert "Exposing native _scaled_dot_product_attention to torch.nn (#85044)"
This reverts commit 9fdd8a8b7f.

Reverted https://github.com/pytorch/pytorch/pull/85044 on behalf of https://github.com/huydhn due to This breaks CUDA 10.2 in trunk. We are deprecating CUDA 10.2, but it is still here in the mean time
2022-09-21 08:34:51 +00:00
Driss Guessous
9fdd8a8b7f Exposing native _scaled_dot_product_attention to torch.nn (#85044)
# Summary
This exposes the _scaled_dot_product_attention function to python in the nn namespace. It is still underscored because the api for args, and kwargs is still in flux for the next few weeks and will eventually land as a prototype feature.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85044
Approved by: https://github.com/cpuhrsch
2022-09-21 03:09:08 +00:00
Eric Han
7a5d5a0020 Disable Transformer/MHA fast path when autocast is enabled (#84722)
Differential Revision: D39362298

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84722
Approved by: https://github.com/cpuhrsch
2022-09-09 01:15:24 +00:00
Eric Han
b182f08135 Fix issue in softmax.cu with transformer error when mask seqlen > 1024 (#83639)
Fixes #83142

Adds
- test to catch this issue.
- fix to softmax.cu that broadcasts src_key_padding_mask to regular attention_mask shape
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83639
Approved by: https://github.com/ngimel
2022-08-30 18:06:27 +00:00
Rui Zhu
e0f2eba93d Move odd num_head in TransformerEncoder to slow_path (#83483)
Summary: odd nhead is not supported for masked softmax, therefore we just move it to use old slow_path

Test Plan: CI

Differential Revision: D38720086

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83483
Approved by: https://github.com/erichan1
2022-08-20 10:02:08 +00:00
Yoav Navon
dfc97df64d Add fastpath test for mask check flag (#82999)
Summary: Check that fastpath is taken, which type (sparsity fastpath or normal) for mask that is aligned and one that is not.

Test Plan: buck test caffe2/test:test_transformers

Differential Revision: D38259928

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82999
Approved by: https://github.com/jbschlosser
2022-08-12 00:04:45 +00:00
Joel Benjamin Schlosser
6ca95547ac Initial private SDP interface and naive composite impl (#81956)
Adds an initial private API version of the SDP interface.

Signature:
```
_scaled_dot_product_attention(Tensor query, Tensor key, Tensor value, Tensor? attn_mask=None,
    float dropout_p=0.0, bool need_attn_weights=True, bool is_causal=False) -> (Tensor, Tensor)
```

Returns a tuple of `(output, attn_weights)`.

Note the following:
* `need_attn_weights`: flag indicating that attention weights should be computed. This is useful to toggle off for flash attention as it does not materialize the weights by default, making it more expensive to return them.
* Boolean attention mask support only; `True` values within `attn_mask` indicate that the element should take part in attention (notably, this is reverse of MHA, which uses `True` to mask *out* values). Mask is optional.
* `is_causal`: Temporary flag indicating whether to use a causal attention weighting. If this is set to `True`, it takes precedent over any value passed in for `attn_mask`. Longer term, the `is_causal` flagging can be subsumed into the `attn_mask` arg via tensor subclassing (see e.g. [CausalTensor](https://github.com/facebookresearch/xformers/blob/sparse_cleanup/xformers/sparse/causal_tensor.py) in xFormers).
* Testing is currently done via reference with the existing Python impl of `F._scaled_dot_product_attention`.
* This PR does not yet drop-in the new SDP anywhere. A future PR can hook it up in BT or MHA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81956
Approved by: https://github.com/drisspg, https://github.com/erichan1
2022-08-01 22:26:18 +00:00
PyTorch MergeBot
26776d628c Revert "Initial private SDP interface and naive composite impl (#81956)"
This reverts commit f15c5bf133.

Reverted https://github.com/pytorch/pytorch/pull/81956 on behalf of https://github.com/janeyx99 due to broke all configs on test_scaled_dot_product_attention (__main__.TestNestedTensorAutograd) f15c5bf133
2022-07-27 18:36:54 +00:00
Joel Benjamin Schlosser
f15c5bf133 Initial private SDP interface and naive composite impl (#81956)
Adds an initial private API version of the SDP interface.

Signature:
```
_scaled_dot_product_attention(Tensor query, Tensor key, Tensor value, Tensor? attn_mask=None,
    float dropout_p=0.0, bool need_attn_weights=True, bool is_causal=False) -> (Tensor, Tensor)
```

Returns a tuple of `(output, attn_weights)`.

Note the following:
* `need_attn_weights`: flag indicating that attention weights should be computed. This is useful to toggle off for flash attention as it does not materialize the weights by default, making it more expensive to return them.
* Boolean attention mask support only; `True` values within `attn_mask` indicate that the element should take part in attention (notably, this is reverse of MHA, which uses `True` to mask *out* values). Mask is optional.
* `is_causal`: Temporary flag indicating whether to use a causal attention weighting. If this is set to `True`, it takes precedent over any value passed in for `attn_mask`. Longer term, the `is_causal` flagging can be subsumed into the `attn_mask` arg via tensor subclassing (see e.g. [CausalTensor](https://github.com/facebookresearch/xformers/blob/sparse_cleanup/xformers/sparse/causal_tensor.py) in xFormers).
* Testing is currently done via reference with the existing Python impl of `F._scaled_dot_product_attention`.
* This PR does not yet drop-in the new SDP anywhere. A future PR can hook it up in BT or MHA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81956
Approved by: https://github.com/drisspg, https://github.com/erichan1
2022-07-27 15:41:45 +00:00
Eric Han
23088fcfdf disable src mask for transformer and multiheadattention fastpath (#81277)
Disable fastpath if src_mask passed to TransformerEncoderLayer and MultiheadAttention.
- Refactored test_transformerencoder from test_nn.py to test_transformers.py. Added a src_mask test there.
- Added a specific src_mask test in test_transformers.py

Fixes https://github.com/pytorch/pytorch/issues/81129

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81277
Approved by: https://github.com/zrphercule
2022-07-15 20:55:17 +00:00
Eric Han
06274d7a48 Add test for torchscripting nn.TransformerEncoder, including fast path (#79796)
Summary:
Add test just to check if TransformerEncoder will crash when enumerating over params [with_no_grad, use_torchscript, training].

Motivation for this was that TransformerEncoder fast path (so with_no_grad=True) and use_torchscript=True would crash with the issue that NestedTensor doesn't have size. This was caused because the TransformerEncoder fast path generates a NestedTensor automatically as a perf optimization and torchscript attempts to find intermediate tensor sizes while it optimizes. But NestedTensor has not implemented a size method, so things fail.

This test goes together with this fix https://github.com/pytorch/pytorch/pull/79480

Test Plan:
```
buck build --show-output mode/opt -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=a100 mode/inplace  //caffe2/test:transformers

./fbcode/buck-out/gen/caffe2/test/transformers#binary.par
```
Test runs and passes together with the changes from the PR above (I made another diff on top of this with those changes). Does not pass without the fix.

Reviewed By: mikekgfb

Differential Revision: D37222923

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79796
Approved by: https://github.com/zrphercule
2022-06-17 22:00:49 +00:00