pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
chilli	e0c5113dec	Add support for capturing tensors with score_mod (#124444 ) ``` import torch from torch import nn import torch.nn.functional as F import torch._inductor.config as config # torch.set_default_device('cuda') import torch from torch.nn.attention._templated_attention import _templated_attention as templated_attention from triton.testing import do_bench from torch.nn.attention import SDPBackend, sdpa_kernel index = torch.ops.aten torch.manual_seed(0) B = 16 H = 16 S = 2048 D = 64 head_scale = torch.randn(H, device='cuda') def alibi(score, batch, head, token_q, token_kv): return score + torch.ops.aten.index(head_scale, [head]) * (token_q - token_kv) bias = torch.randn(H, S, S, dtype=torch.float16, device='cuda') query = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16) key = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16) value = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16) compiled = torch.compile(templated_attention) out = compiled(query, key, value, score_mod=alibi) out2 = templated_attention(query, key, value,score_mod=alibi) print((out - out2).abs().mean()) assert (out - out2).abs().mean() < 1e-3 print("Flash (no mask): ", do_bench(lambda: F.scaled_dot_product_attention(query, key, value))) print("Flash (mask): ", do_bench(lambda: F.scaled_dot_product_attention(query, key, value, attn_mask=bias))) print("flexattention: ", do_bench(lambda: compiled(query, key, value, score_mod=alibi))) ``` <img width="324" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/18c175d0-2720-4dfd-8747-85b8a8f609f5"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124444 Approved by: https://github.com/jansel, https://github.com/drisspg	2024-04-23 06:20:13 +00:00
Yanbo Liang	72a34eeb99	Dynamo x autograd.Function supports non-{Tensor, symnode, constant} inputs (#124360 ) Fixes #118395 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124360 Approved by: https://github.com/zou3519	2024-04-22 23:32:54 +00:00
Aaron Gokaslan	5a1216bb2e	[BE]: Update ruff to 0.4.1 (#124549 ) Update ruff to 0.4.1 . This version fixes a lot false negatives/false positives, is 20-40% faster, and has various other bug fixes. Below is a before and after table showing the execution time of ruff lint and ruff format in milliseconds courtesy of https://astral.sh/blog/ruff-v0.4.0 \| Repository \| Linter (v0.3) \| Linter (v0.4) \| Formatter (v0.3) \| Formatter (v0.4) \| \|----------------------------------------------------\|---------------\|---------------\|------------------\|------------------\| \| [pytorch/pytorch](https://github.com/pytorch/pytorch) \| 328.7 \| 251.8 \| 351.1 \| 274.9 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/124549 Approved by: https://github.com/ezyang	2024-04-21 14:06:23 +00:00
Yanbo Liang	0d90d4d613	[Dynamo] Fix NamedTuple hasattr bug (#124531 ) Fixes #124402 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124531 Approved by: https://github.com/jansel	2024-04-21 04:36:22 +00:00
Animesh Jain	febc4d8759	[dynamo][easy] forbid_in_graph check to use getattr_static (#124445 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124445 Approved by: https://github.com/yanboliang, https://github.com/jansel	2024-04-20 14:11:05 +00:00
Yanbo Liang	a3e3693afc	[Dynamo] Fix missing bracket in ListVariable (#124532 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/124532 Approved by: https://github.com/williamwen42	2024-04-20 08:26:30 +00:00
drisspg	f1cbaf1764	Adds LSE output for templated-attention-hop if inputs require grad (#124308 ) Adds LSE output for templated-attention-hop if inputs require grad Prep PR for adding autograd support to templated-attention-hop. The kernel needs to output the LSE during the forward which will be used during backwards. ### Output code https://gist.github.com/drisspg/2aea3ce5db75811e7e143eeecb774d8a ## Before \| Type \| Speedup \| batch_size \| num_heads \| q_seq_len \| k_seq_len \| head_dim \| score_mod \| dtype \| \|---------\|-----------\|--------------\|-------------\|-------------\|-------------\|------------\|---------------\|----------------\| \| Average \| 1.159 \| \| \| \| \| \| \| \| \| Max \| 1.342 \| 16 \| 16 \| 512 \| 512 \| 64 \| noop \| torch.bfloat16 \| \| Min \| 1.016 \| 1 \| 16 \| 512 \| 512 \| 64 \| relative_bias \| torch.bfloat16 \| ## After Type \| Speedup \| batch_size \| num_heads \| q_seq_len \| k_seq_len \| head_dim \| score_mod \| dtype \| \|---------\|-----------\|--------------\|-------------\|-------------\|-------------\|------------\|-------------\|----------------\| \| Average \| 1.155 \| \| \| \| \| \| \| \| \| Max \| 1.339 \| 16 \| 16 \| 512 \| 512 \| 64 \| noop \| torch.bfloat16 \| \| Min \| 1.009 \| 1 \| 16 \| 512 \| 512 \| 64 \| head_bias \| torch.bfloat16 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/124308 Approved by: https://github.com/Chillee	2024-04-20 05:45:56 +00:00
JackCaoG	7ae835eee4	Enable SourcelessBuilder to build GraphModule generated by make_fx (#123673 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123673 Approved by: https://github.com/ezyang, https://github.com/anijain2305 ghstack dependencies: #123680	2024-04-19 17:23:51 +00:00
Michael Lazos	5050e627dc	Defer marking_static_address (#124309 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124309 Approved by: https://github.com/anijain2305 ghstack dependencies: #123324, #123404, #123405	2024-04-19 17:20:57 +00:00
Xuehai Pan	a6f044a490	[dynamo, 3.8-3.9] support dataclass with `frozen=True` in Python 3.8/3.9 (#124393 ) Closes #114966 Frozen field assignment in `__init__` in Python 3.8-3.9: `f5bd65ed37/Lib/dataclasses.py (L402-L411)` ```python import builtins BUILTINS = builtins def _field_assign(frozen, name, value, self_name): # If we're a frozen class, then assign to our fields in __init__ # via object.__setattr__. Otherwise, just use a simple # assignment. # # self_name is what "self" is called in this function: don't # hard-code "self", since that might be a field name. if frozen: return f'BUILTINS.object.__setattr__({self_name},{name!r},{value})' return f'{self_name}.{name}={value}' ``` Frozen field assignment in `__init__` in Python 3.10+: `812245ecce/Lib/dataclasses.py (L436-L445)` ```python __dataclass_builtins_object__ = object def _field_assign(frozen, name, value, self_name): # If we're a frozen class, then assign to our fields in __init__ # via object.__setattr__. Otherwise, just use a simple # assignment. # # self_name is what "self" is called in this function: don't # hard-code "self", since that might be a field name. if frozen: return f'__dataclass_builtins_object__.__setattr__({self_name},{name!r},{value})' return f'{self_name}.{name}={value}' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124393 Approved by: https://github.com/jansel	2024-04-19 05:10:33 +00:00
ydwu4	2e48b39603	Fix example_value of map (#124203 ) Previously, we didn't expand the shape of example_value of map to the same as inputs (edit: the first mapped dimension). This pr fixes this bug. To make this easier, we change _call_function_and_unflatten_output to accept example_values directly instead of retrieving them from the variable trackers. Also remove a redundant call function node in strict_mode higher order op in dynamo. Test Plan: existing tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124203 Approved by: https://github.com/ezyang, https://github.com/zou3519	2024-04-18 19:18:36 +00:00
Andrew M. James	64f42bfd52	[dynamo] Support list.reverse (#124210 ) fixes #123974 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124210 Approved by: https://github.com/peterbell10	2024-04-17 23:33:32 +00:00
Edward Z. Yang	bebdbb63ce	Introduce set_example_value and use it throughout Dynamo (#124176 ) I'm going to setup some extra behavior when we set example value, so I need a convenient place to interpose. I cannot easily do it on meta itself because its a generic dict with no interposition point. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124176 Approved by: https://github.com/oulgen ghstack dependencies: #124105, #124059	2024-04-17 22:57:11 +00:00
Xuehai Pan	93e249969b	[BE] enable `ruff` rule `RSE` and remove useless parentheses in `raise` statements (#124261 ) Remove useless parentheses in `raise` statements if the exception type is raised with no argument. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124261 Approved by: https://github.com/albanD	2024-04-17 19:29:34 +00:00
Animesh Jain	f433517181	[dynamo][decorator] Support disable on nn modules (#124185 ) Fixes https://github.com/pytorch/pytorch/issues/123979 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124185 Approved by: https://github.com/weifengpy, https://github.com/yoyoyocmu	2024-04-17 16:20:34 +00:00
Xuehai Pan	7e1c98c171	[dynamo] support `object.__setattr__(obj, name, value)` (#124068 ) Resolves #114964 Resolves #114966 - #114964 - #114966 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124068 Approved by: https://github.com/jansel	2024-04-17 15:57:14 +00:00
Jason Ansel	f3fd280238	[dynamo] Relax strict_mode for autograd.Function forward inputs (#123910 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123910 Approved by: https://github.com/oulgen	2024-04-13 19:41:59 +00:00
Animesh Jain	58afcd7b61	[dynamo][dict] Add UnspecializedNNModuleVariable to dict keys (#122812 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122812 Approved by: https://github.com/jansel ghstack dependencies: #122943, #123877, #123878	2024-04-13 02:07:35 +00:00
Xuehai Pan	7b11fb4695	[Dynamo] fix opcode `YIELD_FROM` and `SEND` (#123912 ) This PR is split from #120300. - #120300 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123912 Approved by: https://github.com/anijain2305	2024-04-12 21:57:47 +00:00
Jason Ansel	8069469081	[dynamo] Support Tuple[int] args to autograd.Function (#123887 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123887 Approved by: https://github.com/anijain2305 ghstack dependencies: #123700, #123705, #123786, #123790, #123803, #123804, #123896	2024-04-12 19:03:13 +00:00
Jason Ansel	70b8c58f84	[dynamo] Emit warning to turn on capture_scalar_outputs (#123896 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123896 Approved by: https://github.com/anijain2305 ghstack dependencies: #123700, #123705, #123786, #123790, #123803, #123804	2024-04-12 19:03:13 +00:00
Jason Ansel	e3935783f7	[dynamo] Fix @property on user-defined nn.Module (#123804 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123804 Approved by: https://github.com/anijain2305 ghstack dependencies: #123700, #123705, #123786, #123790, #123803	2024-04-12 19:03:13 +00:00
Jason Ansel	6bac183dc2	[dynamo] Support numpy.iinfo/finfo (#123803 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123803 Approved by: https://github.com/anijain2305 ghstack dependencies: #123700, #123705, #123786, #123790	2024-04-12 19:03:13 +00:00
Jason Ansel	11e6f84ad8	[dynamo] Graph break on uninitialized nn.Module (#123790 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123790 Approved by: https://github.com/anijain2305 ghstack dependencies: #123700, #123705, #123786	2024-04-12 19:03:13 +00:00
Jason Ansel	6b0ba6bbd3	[dynamo] Improve constant-prop for regex/torch.__version__ (#123705 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123705 Approved by: https://github.com/anijain2305 ghstack dependencies: #123700	2024-04-12 19:03:13 +00:00
Simon Fan	7fc3aa5f81	[compiled autograd][aot] Trim runtime refs for list inputs from dynamo (#122535 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122535 Approved by: https://github.com/bdhirsh ghstack dependencies: #123630, #123674, #122353, #123359	2024-04-12 10:29:09 +00:00
Simon Fan	d274d57037	[compiled autograd][dynamo] Make compiled graph take in boxed inputs (#122353 ) ### Context In today's Dynamo, we lift all tensors encountered during tracing to be individual graph inputs, even when they were in a container. And [Dynamo generates](`fdc281f258/torch/_dynamo/codegen.py (L371)`) the runtime function's signature using the graph's graphargs. This means that the generated function will have each grapharg as an argument, which is problematic if we want to free the inputs in inductor codegen. See [python function arguments are kept alive for the duration of the function call](https://github.com/pytorch/pytorch/pull/83137#issuecomment-1211320670). ```python # original code def forward(inputs): a, b, c, d, e = inputs inputs.clear() out = a out += b del b # frees memory out += c del c # frees memory out += d del d # frees memory out += e del e # frees memory return out # compiled code: def forward(a, b, c, d, e): # b, c, d, e can't be freed before end of function ``` This isn't a concern when compiling forward because a, b, c, d, e are all from user code, and should be kept alive. But when compiling backwards, a, b, c, d, e may be intermediate results i.e. activations, that we DO want to clear ASAP to remain on par with eager peak memory. ### Solution We have encountered similar memory problems in AOTAutograd before, where we adopted the boxed calling convention (wrapping to-be-freed objects in a list), adding list clearing to inductor codegen, and being careful about holding references to elements in the input list. We need to do something similar, but for inputs from the user program (compiled autograd fx graph in this case). This PR support lists as graphargs/placeholder nodes. When tracing a list of tensors, we create a node for it, and pre-emptively initialize variable trackers for its elements before they are used in the user program. Subsequent uses of those variables will find hits in the lookup table `input_source_to_var`. With the inputs as a list in the graph args, our compiled code can free inputs just like in the eager case. ```python def forward(inputs): # a, b, c, d, e can be freed within the function now ``` Currently, AOT/Inductor flattens list input via [flatten_graph_inputs wrapper](`597f479643/torch/_inductor/compile_fx.py (L1454-L1478)`), which is why this PR's CI can be green. Additional changes are needed to its runtime wrapper, done in the next PR. The next step is to ensure that we are careful in forwarding the list to inductor codegen without holding additional references. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122353 Approved by: https://github.com/jansel ghstack dependencies: #123630, #123674	2024-04-12 10:29:09 +00:00
Animesh Jain	ede9e8237a	[dynamo] Bug fix for GET_YIELD_FROM_ITER (#122943 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122943 Approved by: https://github.com/jansel	2024-04-12 02:32:58 +00:00
Brian Hirsh	96fe3c5d46	fix correctness for dynamo inlining RangeVariable __contains__ (#122751 ) Fixes https://github.com/pytorch/pytorch/issues/122379 It looks like `iter_contains()` in dynamo expects to take in something like `iter_contains(List[VariableTracker], VariableTracker])`. Previously, when we called this function where the list in question was a `RangeVariable`, we would pass in `RangeVariable.items` as our list. This is wrong, though since `RangeVariable.items` just contains the underlying [start, stop, step]. It looks like `unpack_var_sequence` does the right thing of "materializing" the range into a list of `VariableTrackers`, so I used that instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122751 Approved by: https://github.com/anijain2305, https://github.com/jansel ghstack dependencies: #122502	2024-04-12 01:12:23 +00:00
Jason Ansel	5a7fd20aa1	[dynamo] Support autograd.FunctionCtx.needs_input_grad (#123700 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123700 Approved by: https://github.com/anijain2305	2024-04-11 19:30:55 +00:00
Edward Z. Yang	b36b523c05	Fix guard_size_oblivious on non-symbolic expression (#123743 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123743 Approved by: https://github.com/avikchaudhuri	2024-04-10 22:45:54 +00:00
Michael Lazos	bff321716c	Remove special handling of step with closure (#123620 ) Implements https://github.com/pytorch/pytorch/issues/123479 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123620 Approved by: https://github.com/anijain2305 ghstack dependencies: #123496, #123497, #123551, #123552, #123618	2024-04-09 21:15:24 +00:00
Thiago Crepaldi	1b5944358e	Ignore logging.Logger.* calls during dynamo export (#123402 ) Follow up for https://github.com/pytorch/pytorch/pull/123368 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123402 Approved by: https://github.com/williamwen42	2024-04-09 18:51:00 +00:00
Will Feng	7a78534468	[Compile FSDP2][1/n] Support using user-defined object instance method as hook (#123399 ) FSDP2 has this pattern of using user-defined object instance method as hook, and it will throw this error under compile: `torch._dynamo.exc.Unsupported: call_function UserDefinedObjectVariable(_pre_forward) [FSDPManagedNNModuleVariable(), TupleVariable(), ConstDictVariable()] {}` This PR adds support for it by always allowing to trace into a UserDefinedObjectVariable that's an instance method (i.e. `MethodType`). Supersedes https://github.com/pytorch/pytorch/pull/123320. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123399 Approved by: https://github.com/jansel	2024-04-09 17:29:08 +00:00
Michael Lazos	7c23fed12c	Move step to cpu if state is already initialized (#123618 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123618 Approved by: https://github.com/anijain2305 ghstack dependencies: #123496, #123497, #123551, #123552	2024-04-09 09:04:18 +00:00
PyTorch MergeBot	d04957c0c6	Revert "Ignore logging.Logger.* calls during dynamo export (#123402 )" This reverts commit `75933ff523`. Reverted https://github.com/pytorch/pytorch/pull/123402 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/123402#issuecomment-2044236088))	2024-04-09 06:28:12 +00:00
Thiago Crepaldi	75933ff523	Ignore logging.Logger.* calls during dynamo export (#123402 ) Follow up for https://github.com/pytorch/pytorch/pull/123368 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123402 Approved by: https://github.com/williamwen42	2024-04-08 22:50:54 +00:00
Jason Ansel	d8e0c26e64	[dynamo] Support warnings.catch_warnings (#123511 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123511 Approved by: https://github.com/anijain2305	2024-04-08 22:27:46 +00:00
Michael Lazos	89e6292d48	Defer setting capturable in optimizer variable (#123497 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123497 Approved by: https://github.com/anijain2305 ghstack dependencies: #123496	2024-04-08 19:31:25 +00:00
Michael Lazos	73e235f0a6	Swap to ID guard for optimizer Variable (#123496 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123496 Approved by: https://github.com/anijain2305	2024-04-08 19:28:25 +00:00
PyTorch MergeBot	3e8d3577be	Revert "Swap to ID guard for optimizer Variable (#123496 )" This reverts commit `26bf05ccac`. Reverted https://github.com/pytorch/pytorch/pull/123496 on behalf of https://github.com/PaliC due to seems to have broken distributed/fsdp/test_fsdp_hybrid_shard.py as per `26bf05ccac` ([comment](https://github.com/pytorch/pytorch/pull/123496#issuecomment-2043251234))	2024-04-08 17:06:05 +00:00
PyTorch MergeBot	d9ac80f80c	Revert "Defer setting capturable in optimizer variable (#123497 )" This reverts commit `76b290344f`. Reverted https://github.com/pytorch/pytorch/pull/123497 on behalf of https://github.com/PaliC due to seems to have broken distributed/fsdp/test_fsdp_hybrid_shard.py as per `26bf05ccac` ([comment](https://github.com/pytorch/pytorch/pull/123496#issuecomment-2043251234))	2024-04-08 17:06:05 +00:00
Michael Lazos	76b290344f	Defer setting capturable in optimizer variable (#123497 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123497 Approved by: https://github.com/anijain2305 ghstack dependencies: #123496	2024-04-08 08:34:19 +00:00
Michael Lazos	26bf05ccac	Swap to ID guard for optimizer Variable (#123496 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123496 Approved by: https://github.com/anijain2305	2024-04-08 05:03:34 +00:00
Jason Ansel	212e460dce	[dynamo] Support custom __setattr__ on UserDefinedObjectVariable (#123318 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123318 Approved by: https://github.com/anijain2305	2024-04-07 21:06:52 +00:00
Oguz Ulgen	287680176b	Use graph.find_nodes in dynamo (#122257 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122257 Approved by: https://github.com/jansel ghstack dependencies: #121565, #122255, #122256	2024-04-07 18:51:18 +00:00
Will Feng	7b02910163	[Compile FSDP2][2/n] Support streams created outside of compile region (#123487 ) FSDP2 creates CUDA streams outside of compile region in its 1st iteration eager run, and then torch.compile will attempt to record method calls on these streams (e.g. `stream.record_event()`) in >1st iteration compiled run. Before this PR, stream proxy is None which causes "None doesn't have attribute record_event" error when we try to call `record_event()` on it. After this PR, stream proxy has the correct value which makes calling methods on it possible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123487 Approved by: https://github.com/jansel	2024-04-06 08:42:42 +00:00
drisspg	f4e2a226aa	ScoreMod API (#121845 ) # Summary This PR adds a new higher-order_op: `templated_attention`. This op is designed to extend the functionality of torch.nn.fucntional.scaled_dot_product_attention. PyTorch has efficient pre-written fused-attention kernels. However, users want to modify how scores are computed (a substep inside attention) -- this traditionally requires the user to write their own attention kernel. One such modification to attention scores that is not currently supported by the top level SDPA op is:[ Attention with Linear Biases (ALiBi](https://arxiv.org/abs/2108.12409)). This higher-order op will instead accept a callable( 'score_mod') function that is through torch.compile will be used to create an efficient attention kernel instantiation. ### Details This HOP utilizes the existing fx and HOP infra to capture and convert the User `score-mod` function and convert to an FX graph module. Inductor then consumes this HOP that has a `ir.Subgraph` input. It will inline this lowered subgraph into a triton kernel which performs fused attention with the modification to the scores matrix inlined. ### API The API for a score_mod function should be as follows: ```Python def score_mod(score: torch.Tensor, batch: torch.Tensor, head: torch.Tensor, token_1: torch.Tensor, token_kv: torch.Tensor) -> torch.Tensor ``` This function receives five parameters: - `score`: A scalar tensor representing the attention score, with the same data type and device as the query, key, and value tensors. - `batch`, `head`, `seq_len_q`, `seq_len_kv`: Scalar tensors indicating the batch index, head index, query index, and key/value index, respectively, with torch.int data type and located on the same device as the score tensor. Consider inputs query, key, value of shapes (2, 4, 16, 8), leading to an intermediate attention score matrix of shape (2, 4, 16, 16) The score_mod function will be vectorized over each element of this matrix. For instance, modifying the score at the position corresponding to the 0th batch, 2nd head, between the 8th query and the 9th key element, would be invoked as: ```Python score_mod(score[0,2,8,9], torch.tensor(0), torch.tensor(2), torch.tensor(8), torch.tensor(9)) ``` ### Examples ```Python import torch from torch.nn.attention.templated_attention import templated_attention torch.manual_seed(0) # Lets create some input tensors # The input tensor has shape (batch_size, num_heads, seq_len, head_dim) query = torch.randn(8, 8, 2048, 64, device="cuda", dtype=torch.float32) key = torch.randn(8, 8, 2048, 64, device="cuda", dtype=torch.float32) value = torch.randn(8, 8, 2048, 64, device="cuda", dtype=torch.float32) # Lets create a fun new score_modification! I will call this # Checkerboard. It will reduce the score for neighboring tokens (1 step apart) # in the sequence. And increase the score for tokens 2 steps apart. For everything # else, the score will remain the same. def checkerboard(score, batch, head, token_q, token_kv): score = torch.where(torch.abs(token_kv - token_q) == 1, score * 0.5, score) score = torch.where(torch.abs(token_kv - token_q) == 2, score * 2.0, score) return score # Lets call templated_attention with this new score modification output = templated_attention(query, key, value, score_mod=checkerboard) compiled_templated_attention = torch.compile(templated_attention) out_compiled = compiled_templated_attention(query, key, value, score_mod=checkerboard) torch.testing.assert_close(output, out_compiled, atol=2e-2, rtol=2e-2) ``` ### Future Work - This PR is currently only forward only. However the triton kernel for backwards where score_modifications to not rely on external buffers has been explored here: https://github.com/drisspg/transformer_nuggets/blob/main/transformer_nuggets/flash/flash_attention.py - Kernel Improvements; There are has been some larger updates to the fused attention implementation that Triton uses in its tutorials. The implementation of this kernel is based on a prior version and should be updated. - We may want to unify this API under the top level SDPA API and leave that as a follow up once this is more stable - Should we error on CPU? - There are some issues with dynamic shapes - Capturing of free variables and lifting to inputs to the subgraph is not working correctly today ### Performance Comparisons generated by this benchmark: \| Type \| Speedup \| batch_size \| num_heads \| q_seq_len \| k_seq_len \| head_dim \| score_mod \| dtype \| \|---------\|-----------\|--------------\|-------------\|-------------\|-------------\|------------\|---------------\|----------------\| \| Average \| 5.412 \| \| \| \| \| \| \| \| \| Max \| 8.882 \| 16 \| 16 \| 4096 \| 4096 \| 64 \| relative_bias \| torch.bfloat16 \| \| Min \| 3.645 \| 8 \| 16 \| 512 \| 512 \| 64 \| causal_mask \| torch.bfloat16 \| \| Min \| 0.345 \| 1 \| 16 \| 1024 \| 1024 \| 64 \| pathological \| torch.bfloat16 \| For reference \| Configuration \| Forward Time (µ seconds) \| Backend \| Speedup \| \|-----------------------------------------------\|--------------------------\|------------------\|---------\| \| Fastest Config in Sweep (`8 16 4096 4096 64 relative_bias torch.bfloat16`) \| 3608 \| Templated Attention \| 1.0 \| \| Compiled SDPA (No Mask) \| 9928 \| Math \| 2.75x \| \| Compiled SDPA (With Mask) \| 11898 \| Math \| 3.29x \| \| Compiled SDPA (With Mask) \| 8704 \| Memory Efficient Attention \| 2.42x \| \| Compiled SDPA (No Mask) \| 2548 \| FlashAttention2 \| 0.706x \| The speedups are measuring compiled templated attention speed versus different calls to torch.nn.functional.sdpa <details> <summary> FULL PERFORMANCE SWEEP NUMBERS </summary> \| batch_size \| num_heads \| q_seq_len \| k_seq_len \| head_dim \| score_mod \| dtype \| eager_time \| compiled_time \| speedup \| \|--------------\|-------------\|-------------\|-------------\|------------\|---------------\|----------------\|--------------\|-----------------\|-----------\| \| 1 \| 16 \| 512 \| 512 \| 64 \| causal_mask \| torch.bfloat16 \| 331.444 \| 67.221 \| 4.931 \| \| 1 \| 16 \| 512 \| 512 \| 64 \| relative_bias \| torch.bfloat16 \| 335.300 \| 64.187 \| 5.224 \| \| 1 \| 16 \| 512 \| 512 \| 64 \| head_bias \| torch.bfloat16 \| 352.039 \| 63.806 \| 5.517 \| \| 1 \| 16 \| 512 \| 512 \| 64 \| pathological \| torch.bfloat16 \| 371.699 \| 711.349 \| 0.523 \| \| 1 \| 16 \| 1024 \| 1024 \| 64 \| causal_mask \| torch.bfloat16 \| 333.488 \| 86.455 \| 3.857 \| \| 1 \| 16 \| 1024 \| 1024 \| 64 \| relative_bias \| torch.bfloat16 \| 322.363 \| 82.469 \| 3.909 \| \| 1 \| 16 \| 1024 \| 1024 \| 64 \| head_bias \| torch.bfloat16 \| 349.967 \| 82.233 \| 4.256 \| \| 1 \| 16 \| 1024 \| 1024 \| 64 \| pathological \| torch.bfloat16 \| 486.359 \| 1412.453 \| 0.344 \| \| 1 \| 16 \| 4096 \| 4096 \| 64 \| causal_mask \| torch.bfloat16 \| 2794.597 \| 551.188 \| 5.070 \| \| 1 \| 16 \| 4096 \| 4096 \| 64 \| relative_bias \| torch.bfloat16 \| 3965.150 \| 513.101 \| 7.728 \| \| 1 \| 16 \| 4096 \| 4096 \| 64 \| head_bias \| torch.bfloat16 \| 2408.013 \| 504.759 \| 4.771 \| \| 1 \| 16 \| 4096 \| 4096 \| 64 \| pathological \| torch.bfloat16 \| 6850.531 \| 16733.675 \| 0.409 \| \| 8 \| 16 \| 512 \| 512 \| 64 \| causal_mask \| torch.bfloat16 \| 441.939 \| 123.576 \| 3.576 \| \| 8 \| 16 \| 512 \| 512 \| 64 \| relative_bias \| torch.bfloat16 \| 560.379 \| 116.710 \| 4.801 \| \| 8 \| 16 \| 512 \| 512 \| 64 \| head_bias \| torch.bfloat16 \| 421.172 \| 115.825 \| 3.636 \| \| 8 \| 16 \| 512 \| 512 \| 64 \| pathological \| torch.bfloat16 \| 994.492 \| 2132.806 \| 0.466 \| \| 8 \| 16 \| 1024 \| 1024 \| 64 \| causal_mask \| torch.bfloat16 \| 1436.430 \| 309.495 \| 4.641 \| \| 8 \| 16 \| 1024 \| 1024 \| 64 \| relative_bias \| torch.bfloat16 \| 1892.216 \| 290.186 \| 6.521 \| \| 8 \| 16 \| 1024 \| 1024 \| 64 \| head_bias \| torch.bfloat16 \| 1360.665 \| 282.956 \| 4.809 \| \| 8 \| 16 \| 1024 \| 1024 \| 64 \| pathological \| torch.bfloat16 \| 3525.532 \| 8359.702 \| 0.422 \| \| 8 \| 16 \| 4096 \| 4096 \| 64 \| causal_mask \| torch.bfloat16 \| 22026.839 \| 3864.604 \| 5.700 \| \| 8 \| 16 \| 4096 \| 4096 \| 64 \| relative_bias \| torch.bfloat16 \| 31262.746 \| 3609.551 \| 8.661 \| \| 8 \| 16 \| 4096 \| 4096 \| 64 \| head_bias \| torch.bfloat16 \| 20219.079 \| 3480.402 \| 5.809 \| \| 8 \| 16 \| 4096 \| 4096 \| 64 \| pathological \| torch.bfloat16 \| 54654.647 \| 116652.357 \| 0.469 \| \| 16 \| 16 \| 512 \| 512 \| 64 \| causal_mask \| torch.bfloat16 \| 820.606 \| 188.683 \| 4.349 \| \| 16 \| 16 \| 512 \| 512 \| 64 \| relative_bias \| torch.bfloat16 \| 1058.362 \| 179.295 \| 5.903 \| \| 16 \| 16 \| 512 \| 512 \| 64 \| head_bias \| torch.bfloat16 \| 784.372 \| 175.714 \| 4.464 \| \| 16 \| 16 \| 512 \| 512 \| 64 \| pathological \| torch.bfloat16 \| 1890.792 \| 4212.877 \| 0.449 \| \| 16 \| 16 \| 1024 \| 1024 \| 64 \| causal_mask \| torch.bfloat16 \| 2781.830 \| 557.017 \| 4.994 \| \| 16 \| 16 \| 1024 \| 1024 \| 64 \| relative_bias \| torch.bfloat16 \| 3694.050 \| 525.249 \| 7.033 \| \| 16 \| 16 \| 1024 \| 1024 \| 64 \| head_bias \| torch.bfloat16 \| 2634.164 \| 507.613 \| 5.189 \| \| 16 \| 16 \| 1024 \| 1024 \| 64 \| pathological \| torch.bfloat16 \| 6959.917 \| 15331.116 \| 0.454 \| \| 16 \| 16 \| 4096 \| 4096 \| 64 \| causal_mask \| torch.bfloat16 \| 43889.096 \| 7582.018 \| 5.789 \| \| 16 \| 16 \| 4096 \| 4096 \| 64 \| relative_bias \| torch.bfloat16 \| 62784.293 \| 7075.846 \| 8.873 \| \| 16 \| 16 \| 4096 \| 4096 \| 64 \| head_bias \| torch.bfloat16 \| 40308.606 \| 6829.587 \| 5.902 \| \| 16 \| 16 \| 4096 \| 4096 \| 64 \| pathological \| torch.bfloat16 \| 108892.137 \| 233090.953 \| 0.467 \| </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121845 Approved by: https://github.com/Chillee, https://github.com/zou3519	2024-04-06 01:10:44 +00:00
Animesh Jain	8e98fda7a9	[dynamo][easy] Add AC test and improve graph break message (#121394 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121394 Approved by: https://github.com/yanboliang	2024-04-06 01:02:45 +00:00
Michael Lazos	d9d25076fe	Reduce guards of optimizer state dict to guard once per param group (#123413 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123413 Approved by: https://github.com/anijain2305	2024-04-06 00:12:59 +00:00

1 2 3 4 5 ...

1187 Commits