pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Driss Guessous	07bc6b9587	[SDPA] Update dispatch logic to check for sm86 and head_size == 128 for flash attention (#94921 ) Fixes #94883 Where backward for flash_attention on sm86 hardware with head_size == 128 is not supported. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94921 Approved by: https://github.com/cpuhrsch, https://github.com/albanD	2023-02-16 03:11:16 +00:00
Driss Guessous	81bbee7d7e	[SDPA] Adds basic correctness checks (#94274 ) # Summary Add more checks around shape constraints as well as update the sdp_utils to properly catch different head_dims between qk and v for flash_attention which is not supported. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94274 Approved by: https://github.com/cpuhrsch	2023-02-09 08:05:26 +00:00
Driss Guessous	e0950fccfa	[SDPA] Add expanded autograd testing for fused kernels and disable head_dim128 sm86 mem-efficient (#94009 ) # Summary - Adds a large parameter sweep for testing the various configs a user can call sdpa with and compares the deviation of the fused kernels vs the eager math fallback to test for correctness. - Sm86 + head_dim==128 is throwing an IMA for memory efficient attention. We add a filter for use_mem_efficient_attention(). This has since been fixed in the upstream Xformers version but will likely not make it for branch cut. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94009 Approved by: https://github.com/cpuhrsch	2023-02-07 18:04:48 +00:00
Driss Guessous	653dc73df0	[SDPA] Wire up FlashAttention's backward (#92917 ) # Summary This PR creates _flash_attention_backward and _scaled_dot_product_flash_attention_backward native functions and registers them to the respective derivatives.yaml. The goal is to replicate the torch.autograd.Function defined in the FlashAttention repo [here](`33e0860c9c/flash_attn/flash_attn_interface.py (L126)`) natively in PyTorch. One thing that we don't have access to is ctx.save_for_backward in native PyTorch so in order to save these variables I extended the returned objects from the forward functions. ### MetaFunctions I also updated the FlashAttention meta functions to mirror the real outputs now. As well I added a meta registration for backwards. I have an XLMR training script and while eager training now works with FlashAttention compiling this module fails with the inductor error down below. ### Questions? Performance issues vs mem efficient when using torch.nn.mha_forward TorchCompile -> See purposed solution below. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92917 Approved by: https://github.com/cpuhrsch	2023-02-02 04:02:30 +00:00
Driss Guessous	a3715efd8b	Remove windows check for cmake to build Fused kernels (#91909 ) # Summary Add support for fused attention kernels (FlashAttention and memory-efficient attention) on Windows. Previously we could not do this because the fixes required c++17 to do this but we have since update the PyTorch standard. This PR: - Changes invocations of unsigned long to the fixed width integer type - Adds in the #define FP16_SWITCH(COND, ...) which has been added to the flash_attention main branch - Changes the some macros used within mem-efficient attention code in order to work around the VA_ARG discrepancy between clang/gcc and msvc. An alternative would be setting the global flag Zc:preprocessor - Selectively applies /Zc:lambda to only the mem-efficient sources since applying this globally caused quantization files to not compile Pull Request resolved: https://github.com/pytorch/pytorch/pull/91909 Approved by: https://github.com/cpuhrsch	2023-01-25 01:21:12 +00:00
Michael Gschwind	7265f60ad0	Regularize mask handling for attn_mask and key_padding_mask (#92733 ) Summary: Regularize mask handling for attn_mask and key_padding_mask * Update documentation to remove reference to byte masks (which were deprecated long ago) * Introduce check and warn about deprecation if attn_mask and key_padding_mask types mismatch * Convert all masks to float before combining * Combine by adding Test Plan: sandcastle & github CI Differential Revision: D42653215 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92733 Approved by: https://github.com/ngimel, https://github.com/drisspg	2023-01-24 14:12:05 +00:00
Driss Guessous	df14650f0b	[SDPA] Update SDPA API and make function Public (#92189 ) # Summary In preparation for pt 2.0 launch this PR updates SDPA's API and makes the function a nn.funcitonal public function. ## Changes ### API Previously the the function signature was: `scaled_dot_product_attention(query, key, value, attn_mask=None, need_attn_weights=False, dropout_p=0.0, is_causal=False) -> (Tensor, Tensor)` Updated signature: `scaled_dot_product_attention(query, key, value, attn_mask=None, dropout_p=0.0, is_causal=False) -> Tensor` This PR removes the need_attn_weights optional boolean variable and updates the return type to a singular tensor. #### Reasoning: The main goal of this function is to provide an easy interface for users to call into fused attention kernels e.g. (FlashAttention). The fused kernels do not currently support arbitrary attn_mask or dropout but there is a PR to mem-efficient attention to enable these. We want to have the API surface ready for when the backing kernels get updated. The fused kernels save on memory usage by not materializing the weights and it is unlikely that a fast fused implementation will enable this feature so we are removing. Discussed with folks at FAIR/Xformers and +1 this API change. #### Make function Public In preparation for the pt 2.0 launch we make the function public to start to generate user feedback Pull Request resolved: https://github.com/pytorch/pytorch/pull/92189 Approved by: https://github.com/cpuhrsch	2023-01-23 20:50:46 +00:00
John Crousse	0b90ddacd9	Unit test for is_causal Better Transformers (#91900 ) (#92102 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/91900 Test Plan: buck test :test_transformers -- -r test_train_with_is_causal buck test mode/opt :test_transformers -- -r test_is_causal_gpu flake8 test_transformers.py Differential Revision: D42453642 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92102 Approved by: https://github.com/drisspg	2023-01-16 17:25:06 +00:00
Driss Guessous	92855a215b	[SDPA] Guard mem efficient attention in deterministic mode (#91979 ) # Summary Memory efficient attention is a non deterministic algorithm. This PR ensures that the sdp_choice will allow for mem-efficient to be used as the backend to SDPA if we are in warn only mode. Otherwise if we have enabled determinism and and set warn_only to False sdp_choice will not return memory efficient attention as the backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91979 Approved by: https://github.com/cpuhrsch	2023-01-11 07:40:31 +00:00
Michael Gschwind	26beb46da4	Reduce #iters to make test run always (#91837 ) Summary: Reduce #iters to make test run always Test Plan: sandcastle Reviewed By: drisspg Differential Revision: D42397999 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91837 Approved by: https://github.com/drisspg	2023-01-09 21:38:18 +00:00
Driss Guessous	f219970990	Return empty attention weights when need_atten_weights = False (#91782 ) # Summary This PR updates the second return value from SDPA to return an empty tensor of size 0 not what it would be if need_attn_weights is True. Also updates the meta function to account for this change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91782 Approved by: https://github.com/cpuhrsch	2023-01-06 19:06:48 +00:00
Michael Gschwind	af589b3d1f	switch causal mask for is_causal flag (#91171 ) Summary: switch causal mask for is_causal flag Test Plan: sandcastle & github Differential Revision: D42089340 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91171 Approved by: https://github.com/wushirong, https://github.com/drisspg	2022-12-30 17:24:58 +00:00
Michael Gschwind	d1772aff60	Autocast support for scaled_dot_product_attention (#91066 ) Summary: Autocast support for scaled_dot_product_attention Test Plan: sandcastle and guthub cicd Differential Revision: D42085525 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91066 Approved by: https://github.com/ngimel, https://github.com/drisspg	2022-12-19 23:42:26 +00:00
Driss Guessous	912748e3b7	[SDP] Fix alignment check for efficient_attention (#90413 ) Fixes a bug found using head_dim_size==100 on an a100 gpu. This PR contains stricter guards on the input shape. These constraints are taken from xformers: https://github.com/facebookresearch/xformers/blob/gh/danthe3rd/60/orig/xformers/ops/fmha/cutlass.py#L23 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90413 Approved by: https://github.com/mikekgfb	2022-12-09 21:09:25 +00:00
Driss Guessous	1d9e1fca97	Update sdp dispatch logic to enable fused backward (#89154 ) # Summary Reorganizes how the sdp dispatch logic is down in order to enable backwards for fused kernels Pull Request resolved: https://github.com/pytorch/pytorch/pull/89154 Approved by: https://github.com/cpuhrsch	2022-11-21 20:02:09 +00:00
PyTorch MergeBot	e1d58b1928	Revert "Update sdp dispatch logic to enable fused backward (#89154 )" This reverts commit `2e72ec7982`. Reverted https://github.com/pytorch/pytorch/pull/89154 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but the new test_sdp_math_gradcheck test breaks periodic slow gradcheck, i.e. `419ef2cdcf`	2022-11-20 22:14:38 +00:00
Driss Guessous	2e72ec7982	Update sdp dispatch logic to enable fused backward (#89154 ) # Summary Reorganizes how the sdp dispatch logic is down in order to enable backwards for fused kernels Pull Request resolved: https://github.com/pytorch/pytorch/pull/89154 Approved by: https://github.com/cpuhrsch	2022-11-19 02:06:27 +00:00
Driss Guessous	b291c1213a	Create native function for determining which implementation of SDP to call (#89029 ) # Summary Creates a callable native function that can determine which implementation of scaled dot product will get called. This allows to bump re-order the runtime dispatch of SDP to enable autograd. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89029 Approved by: https://github.com/cpuhrsch	2022-11-16 03:07:54 +00:00
Driss Guessous	ff6d2a6d1b	Add mem efficient backward (#88856 ) # Registers the derivative for mem efficient backward - Use gradcheck to test correctness. The kernel is not implemented for fp64 so run checks with bumped tolerances in fp32 - I also made updates based off of Xformer main branch and flash-attention cutlass branch. - This will enable the fused backward to be called for scaled dot product attention Pull Request resolved: https://github.com/pytorch/pytorch/pull/88856 Approved by: https://github.com/cpuhrsch	2022-11-15 20:22:57 +00:00
PyTorch MergeBot	50c18217a3	Revert "Add mem efficient backward (#88856 )" This reverts commit `35e668b5ce`. Reverted https://github.com/pytorch/pytorch/pull/88856 on behalf of https://github.com/DanilBaibak due to breaking internal builds	2022-11-15 09:37:09 +00:00
Michael Gschwind	1f88b208ac	Fix cuda/cpu check on NoneType (Unit test) (#88970 ) Summary: Fix cuda/cpu check on NoneType (unit test) Test Plan: sabdcastle/ github CI/CD Differential Revision: D41208798 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88970 Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch	2022-11-15 01:25:19 +00:00
Driss Guessous	35e668b5ce	Add mem efficient backward (#88856 ) # Registers the derivative for mem efficient backward - Use gradcheck to test correctness. The kernel is not implemented for fp64 so run checks with bumped tolerances in fp32 - I also made updates based off of Xformer main branch and flash-attention cutlass branch. - This will enable the fused backward to be called for scaled dot product attention Pull Request resolved: https://github.com/pytorch/pytorch/pull/88856 Approved by: https://github.com/cpuhrsch	2022-11-15 01:10:35 +00:00
Grigory Sizov	7ad87f63e2	Support src_mask and src_key_padding_mask for Better Transformer (#88488 ) Fixes T135842750 (follow-up for #87377) ## Description At present, having both `src_key_padding_mask` and `src_mask` at the same time is not supported on the fastpath in Transformer and Multi-Head Attention. This PR enables using both masks on the fastpath on CPU and GPU: if both masks are passed, we merge them into a 4D mask in Python and change mask type to 2 before passing downstream. Downstream processing in native code is not changed, as it already supports 4D mask. Indeed, it is done depending on the device: - on CUDA, by `SoftMax.cu::masked_softmax_cuda`. When mask type is 2, it calls either `dispatch_softmax_forward` -> `softmax_warp_forward` or `at::softmax` (depending on the input size). In both cases 4D mask is supported. - on CPU, by `SoftMax.cpp::masked_softmax_cpp`. It calls `hosted_softmax` which supports 4D mask. ## Tests - Extended `test_mask_check_fastpath` to check that fast path is indeed taken in Transformer when two masks are passed - Added `test_multihead_self_attn_two_masks_fast_path_mock` to check that fast path is taken in MHA when two masks are passed - Added `test_multihead_self_attn_two_masks_fast_path` to check that fast and slow paths give the same result when two masks are passed in MHA - `test_masked_softmax_mask_types` now covers mask type 2 - `test_transformerencoderlayer_fast_path` (CPU smoke test) is expanded to the case of both masks provided simultaneously - `test_masked_softmax_devices_parity` checks that mask type 2 is accepted by CPU and CUDA paths Pull Request resolved: https://github.com/pytorch/pytorch/pull/88488 Approved by: https://github.com/mikekgfb	2022-11-10 08:12:56 +00:00
kshitij12345	fe3a226d74	[minor] use set_default_dtype instead of try and finally (#88295 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88295 Approved by: https://github.com/mruberry	2022-11-03 19:28:33 +00:00
Grigory Sizov	4c78c7c82a	Enable `src_mask` in fast path of `TransformerEncoderLayer` (#87377 ) ## Issues Fixes https://github.com/pytorch/pytorch/issues/81129#issuecomment-1179435674 ## Description Passing a 2D attention mask `src_mask` into the fast path of `TransformerEncoderLayer` in CPU was causing an error and so was disabled in https://github.com/pytorch/pytorch/pull/81277. This PR unrolls this fix, enabling `src_mask` on the fast path: - Either attention mask `src_mask` of shape `(L, L)` or padding mask `src_key_padding_mask` of shape `(B, L)` are now allowed on the CPU fast path. If softmax is applied along the last dimension (as in multi-head attention), these masks are processed without expanding them to 4D. Instead, when iterating through the input, `Softmax.cpp::host_softmax` converts the index to match the mask dimensions, depending on the type. - If softmax is applied along the dimension other than the last, `Softmax.cpp::masked_softmax_cpu` expands masks to 4D, converting them to `mask_type=2`. Theoretically one could also add special optimized cases for `dim=0, 1, 2` and process them without mask expansion, but I don't know how often is that used ## Tests: - `test_transformerencoderlayer_fast_path` is extended to cover both attention mask and padding mask - `test_masked_softmax_mask_types_0_1` is added to ensure results from CPU softmax with attention and padding masks match the explicit slow calculation - `test_masked_softmax_devices_parity` is added to ensure results from masked softmax on CPU and CUDA match ## Note I had to replace `float` with `torch.get_default_dtype()` in a couple of tests for the following reason: - `test_nn.py` [sets the default type to `torch.double`](https://github.com/pytorch/pytorch/blob/master/test/test_nn.py#L24-L26) - If I execute `test_nn.py` and `test_transformers.py` in one `pytest` run, this default still holds for transformer tests - Some tests in `test_transformers.py` which were previously following the slow path now switched to fast path, and hard-coded `float` started clashing with default `double` Let me know if there is a better way around it - or maybe I'm not supposed to run tests with `pytest` like this Pull Request resolved: https://github.com/pytorch/pytorch/pull/87377 Approved by: https://github.com/mikekgfb, https://github.com/weiwangmeta, https://github.com/malfet	2022-10-31 19:59:36 +00:00
Driss Guessous	e24ce484ed	Use scaled_dot_product_attention within attention.cpp (#87312 ) # Summary Use the private _scaled_dot_product_attention to support _native_multiheaded_attention. _SDP provides access to fused kernels when certain conditions are meant enabling a speed up for MHA. cc @cpuhrsch @jbschlosser @bhosmer @mikaylagawarecki Pull Request resolved: https://github.com/pytorch/pytorch/pull/87312 Approved by: https://github.com/cpuhrsch	2022-10-31 04:06:31 +00:00
Driss Guessous	35c611d30f	Add mem efficient backend flag (#87946 ) # Summary Add in a torch.backends.cuda flag and update context manager to pic between the three implementations of the scaled_dot_product_attention. cc @cpuhrsch @jbschlosser @bhosmer @mikaylagawarecki Pull Request resolved: https://github.com/pytorch/pytorch/pull/87946 Approved by: https://github.com/cpuhrsch	2022-10-28 15:51:10 +00:00
Rui Zhu	4b757f4633	Assert if padding mask type is unexpected (#86353 ) (#87106 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/86353 Fix the issue described in https://github.com/pytorch/pytorch/issues/86120 Test Plan: buck test mode/opt caffe2/test:test_transformers -- test_train_with_long_type_pad Differential Revision: D40129968 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87106 Approved by: https://github.com/malfet	2022-10-20 16:01:54 +00:00
Driss Guessous	5fb687182d	Enable sdp_forward for NestedTensors (#86720 ) # Summary This PR implements a sdp_forward for NestedTensors. This impl will call into flash and mem_efficient_attention when possible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86720 Approved by: https://github.com/cpuhrsch	2022-10-18 02:00:04 +00:00
Driss Guessous	c5a4844085	Xformer SDP forward/backward kernel (#86157 ) # Summary Include xformer kernel code and make header updates to successfully build. Need to update the kernel calling code and dispatch system to clean this up. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86157 Approved by: https://github.com/cpuhrsch	2022-10-07 03:52:46 +00:00
Driss Guessous	cd6477617c	Custom sdp implementations dense (#85984 ) # Summary - This code creates the runtime dispatch system for choosing a performant fused SDP kernel. The only choice of fused kernel is flash_attention. It also creates python flags and a context manager that can be used to turn off and on behavior for dispatch. - This also adds support for flash_attention with dense tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85984 Approved by: https://github.com/cpuhrsch	2022-10-03 17:36:37 +00:00
Mikayla Gawarecki	afaee00fec	Add python `nested_tensor` and `as_nested_tensor` constructors in `torch.nested` (#85593 ) Remove `torch.nested_tensor` which has erroneous behavior wrt gradients (could be either leaf or not leaf). Introduce `torch.nested.nested_tensor` and `torch.nested.as_nested_tensor` in the vein of `torch.tensor` and `torch.as_tensor`. Done in nested `__init__.py` for now but can move to pybind in future (when we want to load from numpy/nested lists ). Discussed offline with @cpuhrsch and pybind constructor (https://github.com/pytorch/pytorch/pull/85536) was more gnarly than expected, so we can move to that when we do need loading from numpy etc. Differential Revision: [D39806622](https://our.internmc.facebook.com/intern/diff/D39806622) Pull Request resolved: https://github.com/pytorch/pytorch/pull/85593 Approved by: https://github.com/drisspg, https://github.com/cpuhrsch	2022-09-28 20:15:02 +00:00
Driss Guessous	253ffbf28b	Exposing native _scaled_dot_product_attention to torch.nn (#85044 ) # Summary This exposes the _scaled_dot_product_attention function to python in the nn namespace. It is still underscored because the api for args, and kwargs is still in flux for the next few weeks and will eventually land as a prototype feature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85044 Approved by: https://github.com/cpuhrsch	2022-09-22 16:30:16 +00:00
PyTorch MergeBot	a3dc338ee1	Revert "Exposing native _scaled_dot_product_attention to torch.nn (#85044 )" This reverts commit `9fdd8a8b7f`. Reverted https://github.com/pytorch/pytorch/pull/85044 on behalf of https://github.com/huydhn due to This breaks CUDA 10.2 in trunk. We are deprecating CUDA 10.2, but it is still here in the mean time	2022-09-21 08:34:51 +00:00
Driss Guessous	9fdd8a8b7f	Exposing native _scaled_dot_product_attention to torch.nn (#85044 ) # Summary This exposes the _scaled_dot_product_attention function to python in the nn namespace. It is still underscored because the api for args, and kwargs is still in flux for the next few weeks and will eventually land as a prototype feature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85044 Approved by: https://github.com/cpuhrsch	2022-09-21 03:09:08 +00:00
Eric Han	7a5d5a0020	Disable Transformer/MHA fast path when autocast is enabled (#84722 ) Differential Revision: D39362298 Pull Request resolved: https://github.com/pytorch/pytorch/pull/84722 Approved by: https://github.com/cpuhrsch	2022-09-09 01:15:24 +00:00
Eric Han	b182f08135	Fix issue in softmax.cu with transformer error when mask seqlen > 1024 (#83639 ) Fixes #83142 Adds - test to catch this issue. - fix to softmax.cu that broadcasts src_key_padding_mask to regular attention_mask shape Pull Request resolved: https://github.com/pytorch/pytorch/pull/83639 Approved by: https://github.com/ngimel	2022-08-30 18:06:27 +00:00
Rui Zhu	e0f2eba93d	Move odd num_head in TransformerEncoder to slow_path (#83483 ) Summary: odd nhead is not supported for masked softmax, therefore we just move it to use old slow_path Test Plan: CI Differential Revision: D38720086 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83483 Approved by: https://github.com/erichan1	2022-08-20 10:02:08 +00:00
Yoav Navon	dfc97df64d	Add fastpath test for mask check flag (#82999 ) Summary: Check that fastpath is taken, which type (sparsity fastpath or normal) for mask that is aligned and one that is not. Test Plan: buck test caffe2/test:test_transformers Differential Revision: D38259928 Pull Request resolved: https://github.com/pytorch/pytorch/pull/82999 Approved by: https://github.com/jbschlosser	2022-08-12 00:04:45 +00:00
Joel Benjamin Schlosser	6ca95547ac	Initial private SDP interface and naive composite impl (#81956 ) Adds an initial private API version of the SDP interface. Signature: ``` _scaled_dot_product_attention(Tensor query, Tensor key, Tensor value, Tensor? attn_mask=None, float dropout_p=0.0, bool need_attn_weights=True, bool is_causal=False) -> (Tensor, Tensor) ``` Returns a tuple of `(output, attn_weights)`. Note the following: * `need_attn_weights`: flag indicating that attention weights should be computed. This is useful to toggle off for flash attention as it does not materialize the weights by default, making it more expensive to return them. * Boolean attention mask support only; `True` values within `attn_mask` indicate that the element should take part in attention (notably, this is reverse of MHA, which uses `True` to mask out values). Mask is optional. * `is_causal`: Temporary flag indicating whether to use a causal attention weighting. If this is set to `True`, it takes precedent over any value passed in for `attn_mask`. Longer term, the `is_causal` flagging can be subsumed into the `attn_mask` arg via tensor subclassing (see e.g. [CausalTensor](https://github.com/facebookresearch/xformers/blob/sparse_cleanup/xformers/sparse/causal_tensor.py) in xFormers). * Testing is currently done via reference with the existing Python impl of `F._scaled_dot_product_attention`. * This PR does not yet drop-in the new SDP anywhere. A future PR can hook it up in BT or MHA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/81956 Approved by: https://github.com/drisspg, https://github.com/erichan1	2022-08-01 22:26:18 +00:00
PyTorch MergeBot	26776d628c	Revert "Initial private SDP interface and naive composite impl (#81956 )" This reverts commit `f15c5bf133`. Reverted https://github.com/pytorch/pytorch/pull/81956 on behalf of https://github.com/janeyx99 due to broke all configs on test_scaled_dot_product_attention (__main__.TestNestedTensorAutograd) `f15c5bf133`	2022-07-27 18:36:54 +00:00
Joel Benjamin Schlosser	f15c5bf133	Initial private SDP interface and naive composite impl (#81956 ) Adds an initial private API version of the SDP interface. Signature: ``` _scaled_dot_product_attention(Tensor query, Tensor key, Tensor value, Tensor? attn_mask=None, float dropout_p=0.0, bool need_attn_weights=True, bool is_causal=False) -> (Tensor, Tensor) ``` Returns a tuple of `(output, attn_weights)`. Note the following: * `need_attn_weights`: flag indicating that attention weights should be computed. This is useful to toggle off for flash attention as it does not materialize the weights by default, making it more expensive to return them. * Boolean attention mask support only; `True` values within `attn_mask` indicate that the element should take part in attention (notably, this is reverse of MHA, which uses `True` to mask out values). Mask is optional. * `is_causal`: Temporary flag indicating whether to use a causal attention weighting. If this is set to `True`, it takes precedent over any value passed in for `attn_mask`. Longer term, the `is_causal` flagging can be subsumed into the `attn_mask` arg via tensor subclassing (see e.g. [CausalTensor](https://github.com/facebookresearch/xformers/blob/sparse_cleanup/xformers/sparse/causal_tensor.py) in xFormers). * Testing is currently done via reference with the existing Python impl of `F._scaled_dot_product_attention`. * This PR does not yet drop-in the new SDP anywhere. A future PR can hook it up in BT or MHA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/81956 Approved by: https://github.com/drisspg, https://github.com/erichan1	2022-07-27 15:41:45 +00:00
Eric Han	23088fcfdf	disable src mask for transformer and multiheadattention fastpath (#81277 ) Disable fastpath if src_mask passed to TransformerEncoderLayer and MultiheadAttention. - Refactored test_transformerencoder from test_nn.py to test_transformers.py. Added a src_mask test there. - Added a specific src_mask test in test_transformers.py Fixes https://github.com/pytorch/pytorch/issues/81129 Pull Request resolved: https://github.com/pytorch/pytorch/pull/81277 Approved by: https://github.com/zrphercule	2022-07-15 20:55:17 +00:00
Eric Han	06274d7a48	Add test for torchscripting nn.TransformerEncoder, including fast path (#79796 ) Summary: Add test just to check if TransformerEncoder will crash when enumerating over params [with_no_grad, use_torchscript, training]. Motivation for this was that TransformerEncoder fast path (so with_no_grad=True) and use_torchscript=True would crash with the issue that NestedTensor doesn't have size. This was caused because the TransformerEncoder fast path generates a NestedTensor automatically as a perf optimization and torchscript attempts to find intermediate tensor sizes while it optimizes. But NestedTensor has not implemented a size method, so things fail. This test goes together with this fix https://github.com/pytorch/pytorch/pull/79480 Test Plan: ``` buck build --show-output mode/opt -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=a100 mode/inplace //caffe2/test:transformers ./fbcode/buck-out/gen/caffe2/test/transformers#binary.par ``` Test runs and passes together with the changes from the PR above (I made another diff on top of this with those changes). Does not pass without the fix. Reviewed By: mikekgfb Differential Revision: D37222923 Pull Request resolved: https://github.com/pytorch/pytorch/pull/79796 Approved by: https://github.com/zrphercule	2022-06-17 22:00:49 +00:00

1 2 3 4 5

244 Commits