pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
drisspg	c46fc46dba	expose mem-eff to autograd (#110495 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110495 Approved by: https://github.com/jbschlosser	2023-11-13 17:47:40 +00:00
Peter Bell	44f1c6e41c	[inductor] Handle variance corrections larger than number of data points (#113284 ) Fixes #113167 When correction is larger than the number of data points, we should return a nan by dividing by zero, as is done in the eager implementation. `5ea76f1760/aten/src/ATen/native/SharedReduceOps.h (L137)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/113284 Approved by: https://github.com/lezcano	2023-11-13 11:16:17 +00:00
Edward Z. Yang	9752ef595c	[BE] Consistently use the sym_stride lowering, instead of short-circuiting before (#113071 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/113071 Approved by: https://github.com/voznesenskym	2023-11-10 21:19:12 +00:00
Roger Lam	289d887a41	Fix ZeroDivisionError when unfolding a zero-dimension tensor in compile mode (#113259 ) Fixes #113026 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113259 Approved by: https://github.com/peterbell10	2023-11-09 17:25:36 +00:00
Lucas Pasqualin	1d56e7b5af	Adds broadcast to functional collectives (#112668 ) Adds `broadcast` to functional collectives, including inductor support. Test with `python test_inductor_collectives.py -- TestCollectivesMultiProc.test_broadcast_inductor` Pull Request resolved: https://github.com/pytorch/pytorch/pull/112668 Approved by: https://github.com/wanchaol, https://github.com/wconstab	2023-11-09 15:47:52 +00:00
Oguz Ulgen	fbf7866ac9	[Inductor] Fallback scatter when src dtype is bf16 (#113204 ) basic_gnn_gcn, basic_gnn_gin, basic_gnn_sage now pass Pull Request resolved: https://github.com/pytorch/pytorch/pull/113204 Approved by: https://github.com/eellison	2023-11-09 03:43:11 +00:00
eellison	325e0fdfdd	Enable masked_scatter_backward for inductor (#109642 ) masked_scatter_backward was previously implemented as a CompositeExplicitAutograd, which involved a decomp that calls masked_select, and masked_select in general produces data-dependent shapes that inductor doesn't support. But masked_scatter_backward reshapes the return value of masked_select such that the end result has a static shape again. I have converted masked_scatter_backward into an aten op to avoid this issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109642 Approved by: https://github.com/ezyang ghstack dependencies: #108170	2023-11-09 01:27:57 +00:00
Yifu Wang	625958d8bc	Inductor support for native c10d_functional (#112439 ) This PR adds Inductor support for [native c10d_functional ops](https://github.com/pytorch/pytorch/pull/110570). The Inductor IRs introduced in this PR will replace the existing `CollectiveKernel` IR hierarchy. Compared to the existing collective IRs, the new IRs: - Are target language agnostic and support AOTInductor. - Express the constraints solely with read/write deps. This maximizes the potential for buffer reuse. - Address an issue where out-of-place collective's input buffers could be mutated while being volatilely read. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112439 Approved by: https://github.com/Chillee	2023-11-08 23:40:21 +00:00
Oguz Ulgen	611a7457ca	[Inductor] Kill MutationLayout from ir.py (#112925 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112925 Approved by: https://github.com/jansel	2023-11-07 17:03:52 +00:00
Edward Z. Yang	10a829b85d	Retarget sym_size/sym_stride lowerings to their .int overloads (#113054 ) Fixes https://github.com/pytorch/pytorch/issues/112913 The new logging looks like this: ``` [2023-11-06 12:48:57,732] [0/0] torch._inductor.graph: [DEBUG] lowering %arg0_1 : [num_users=0] = placeholder[target=arg0_1] [2023-11-06 12:48:57,732] [0/0] torch._inductor.graph: [DEBUG] lowering %arg1_1 : [num_users=2] = placeholder[target=arg1_1] [2023-11-06 12:48:57,733] [0/0] torch._inductor.graph: [DEBUG] lowering %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%arg1_1, 1), kwargs = {}) [2023-11-06 12:48:57,733] [0/0] torch._inductor.graph: [DEBUG] via <function make_pointwise.<locals>.inner at 0x7f0abed28ee0> [2023-11-06 12:48:57,735] [0/0] torch._inductor.graph: [DEBUG] lowering %sym_stride_int : [num_users=1] = call_function[target=torch.ops.aten.sym_stride.int](args = (%add, 0), kwargs = {}) sym_stride [2023-11-06 12:48:57,735] [0/0] torch._inductor.graph: [DEBUG] lowering %mul : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%arg1_1, %sym_stride_int), kwargs = {}) [2023-11-06 12:48:57,735] [0/0] torch._inductor.graph: [DEBUG] via <function mul at 0x7f0abec8bd00> [2023-11-06 12:48:57,744] [0/0] torch._inductor.graph: [DEBUG] lowering return (mul,) ``` Notice that `sym_stride` no longer is hitting the lowering. This is what the behavior was before I broke it. A better refactor coming soon. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/113054 Approved by: https://github.com/davidberard98	2023-11-07 04:15:38 +00:00
Peter Bell	718035791d	Prefer `e.is_number` over `not e.free_symbols` in SymPy (#112688 ) We spend somewhere on the order 1% in `sympy.Expr.free_symbols` as it is called millions of times. Most of the time we actually just want to know "is this a constant", however `e.is_constant()` is horribly slow. It turns out though that there is another propery `is_number` that does what we want. > property is_number: > > Returns True if self has no free symbols and no undefined functions (AppliedUndef, to be precise). It will be faster > than if not self.free_symbols, however, since is_number will fail as soon as it hits a free symbol or undefined > function. Even further, we also avoid the overhead of building the unnecessary set object. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112688 Approved by: https://github.com/lezcano	2023-11-06 20:05:13 +00:00
vfdev	59e003d159	Fixed cat uint8 lowering (#112753 ) Description: - Fixed cat uint8 lowering Otherwise, it gives the following issue on the repro code: ```python def func(x): batch_shape = x.shape[:1] out = torch.cat([x.new_zeros(1).expand(batch_shape + (1,)), x], dim=-1) return out cfunc = torch.compile(func) x = torch.randint(0, 256, size=(3, 255), dtype=torch.uint8) out = cfunc(x) ``` Error message: ``` File "/pytorch/torch/_inductor/lowering.py", line 1037, in <genexpr> if all(len(input.layout.size) == 4 for input in inputs): File "/pytorch/torch/_inductor/ir.py", line 5795, in __getattr__ fn = getattr(self.data, name) torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: LoweringException: AttributeError: 'ExpandView' object has no attribute 'layout' target: aten.cat.default args[0]: [TensorBox( ExpandView(data=StorageBox( ComputedBuffer(name='buf0', layout=FlexibleLayout('cpu', torch.uint8, size=[1], stride=[1]), data=Pointwise( 'cpu', torch.uint8, def inner_fn(index): _ = index tmp0 = ops.constant(0, torch.uint8) return tmp0 , ranges=[1], origin_node=full, origins={full} )) ), size=[3, 1]) ), TensorBox(StorageBox( InputBuffer(name='arg0_1', layout=FixedLayout('cpu', torch.uint8, size=[3, 255], stride=[255, 1])) ))] args[1]: 1 Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information ``` Context: compiling is not working for torchvision's `F.equalize` op: https://github.com/pytorch/vision/issues/8056 Pull Request resolved: https://github.com/pytorch/pytorch/pull/112753 Approved by: https://github.com/peterbell10	2023-11-06 19:42:04 +00:00
Oguz Ulgen	001573b687	[Inductor] Support one node creating multiple mutations in scheduler (#112547 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112547 Approved by: https://github.com/Chillee	2023-11-03 16:01:31 +00:00
leslie-fang-intel	a53d29cc18	Enable oneDNN QLinear FP32/BF16 output (#112126 ) Summary - PR 2 for enabling Int8-Mixed-BF16 PT2E PTQ Quantization with Inductor https://github.com/pytorch/pytorch/issues/111640. - Enable QLinear (relu) with BFloat16 or Float32 output. TestPlan ``` python -u -m pytest -s -v test_quantized_op.py -k test_qlinear_pt2e ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/112126 Approved by: https://github.com/jerryzh168, https://github.com/jgong5 ghstack dependencies: #112010	2023-11-03 08:20:54 +00:00
leslie-fang-intel	b6fc7af8a0	Enable oneDNN QConv FP32/BF16 output (#112010 ) Summary - PR 1 for enabling Int8-Mixed-BF16 PT2E PTQ Quantization with Inductor https://github.com/pytorch/pytorch/issues/111640. - Enable QConv (relu, add, add_relu) with BFloat16 or Float32 output. Test Plan ``` python -u -m pytest -s -v test_quantized_op.py -k test_qconv1d_pt2e python -u -m pytest -s -v test_quantized_op.py -k test_qconv2d_pt2e python -u -m pytest -s -v test_quantized_op.py -k test_qconv3d_pt2e python -u -m pytest test_quantized_op.py -k test_qconv2d_relu_pt2e python -u -m pytest test_quantized_op.py -k test_qconv2d_add_pt2e python -u -m pytest test_quantized_op.py -k test_qconv2d_add_relu_pt2e python -u -m pytest test_quantized_op.py -k test_qconv2d_add_relu_float_output_pt2e ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/112010 Approved by: https://github.com/jerryzh168, https://github.com/jgong5	2023-11-03 08:16:45 +00:00
Jiong Gong	e061144aaf	[inductor] replace ops.div with ops.truediv (#112243 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112243 Approved by: https://github.com/lezcano ghstack dependencies: #112234	2023-11-01 05:50:51 +00:00
chilli	3cee033b98	Reland of a bunch of pattern matcher + indexing fixes (#112476 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112476 Approved by: https://github.com/oulgen	2023-11-01 02:13:44 +00:00
Jon Chuang	9bfebf754f	[dynamo] fix graph break, improve hygeine - enforce using ConstantVariable for `torch.device`,`torch.dtype` (#112416 ) Fixes https://github.com/pytorch/pytorch/pull/112332/files#r1375690808 Simplify code paths, fix graph break ``` torch._dynamo.exc.InternalTorchDynamoError: TorchVariable() has no type ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/112416 Approved by: https://github.com/lezcano	2023-11-01 00:19:52 +00:00
Peter Bell	66c32d099a	Use `pytree.arg_tree_leaves` everywhere (#112394 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112394 Approved by: https://github.com/lezcano ghstack dependencies: #112391, #112392, #112393	2023-10-31 15:57:06 +00:00
drisspg	40569b28f4	Constrain fx_stride order for scaled_mm (#112430 ) # Summary CublasLT requires row_major @ col_major order for scaled_mm. It is possible for the to not respect this constraint without adding this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112430 Approved by: https://github.com/eellison	2023-10-31 00:02:35 +00:00
Chengkun Du	67638d4dad	torch.compile: fix bug of fallback_randn when 'generator' is None (#112240 ) When I run Stable Diffusion in [Huggingface/Diffusers](https://github.com/huggingface/diffusers)，an error occured: ``` LoweringException: AssertionError: should have been handled in replace_random.py. target: aten.randn.generator args[0]: [1, 4, 64, 64] kwargs: {'generator': None, 'dtype': torch.float16, 'layout': torch.strided, 'device': device(type='cuda', index=0), 'pin_memory': False} ``` It looks like some bug of dynamo, and you can reproduce this bug like this: ```python import torch def model(shape, generator): return torch.randn(shape, generator=generator, device="cuda:0") model = torch.compile(model) x = model((1, 3, 64, 64), None) print(x) ``` Error occurs because 'None' is passed into ‘generator' , and dynamo has to process `torch.randn` into fx node `torch.ops.aten.randn.generator`. aten.randn.generator is not processed by decomposition and it is processed by lowering in [torch/_inductor/lowering.py](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/lowering.py#L1815), randn.generator is processed like this: ```python @register_lowering(aten.randn) def randn(args, kwargs): if kwargs.get("generator", None) is not None: return fallback_randn_generator(args, *kwargs) elif config.fallback_random: return fallback_randn_default(args, *kwargs) raise AssertionError("should have been handled in replace_random.py") ``` As you can see, because 'generator' is None, it will not step into `fallback_randn_generator`, and of course, if you don't open `config.fallback_random`, it will not step into `fallback_randn_default`, too. Actually, if 'generator' is None, it could also be processed as`aten.randn.default`. And then, AssertionError will be throw, but in here, I will not disscuss too much about how to process this bug and will open an issue. Actually, `config.fallback_random` offers a way to debug randn in [config.py](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/config.py#L190), so I try to open `config.fallback_random` to debug my model. But when I open it by: ```python # fallback to eager for random/dropout, this is slow but useful for debugging fallback_random = True ``` Another error occurs! ```python LoweringException: RuntimeError: Unknown keyword argument 'generator' for operator 'aten::randn'. Schema: aten::randn(SymInt[] size, , ScalarType? dtype=None, Layouit? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor ``` Obviously, `aten::randn` does not support `kwargs:{generator: None}`, so it should be popped before kwargs is feeded into `fallback_randn_default`. That's all I'm going to say. Thanks for reading carefully. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112240 Approved by: https://github.com/jansel	2023-10-30 21:10:54 +00:00
PyTorch MergeBot	fc0b0820fc	Revert "Readded device_assert skipping in index and index_put (and also added (#112093 )" This reverts commit `b110d87ac2`. Reverted https://github.com/pytorch/pytorch/pull/112093 on behalf of https://github.com/ZainRizvi due to Stack breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/112093#issuecomment-1785922905))	2023-10-30 19:45:41 +00:00
PyTorch MergeBot	052f7a3edc	Revert "Added patterns for randperm + index_add (#112102 )" This reverts commit `1ff0b82be9`. Reverted https://github.com/pytorch/pytorch/pull/112102 on behalf of https://github.com/ZainRizvi due to Stack breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/112102#issuecomment-1785916704))	2023-10-30 19:41:29 +00:00
Peter Bell	bbd5b935e4	Use `pytree.tree_leaves` everywhere (#112324 ) This changes all the instances I could find of `tree_flatten(...)[0]` or `x, _ = tree_flatten` to use `tree_leaves`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112324 Approved by: https://github.com/lezcano ghstack dependencies: #112327, #112323	2023-10-30 03:39:04 +00:00
chilli	1ff0b82be9	Added patterns for randperm + index_add (#112102 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112102 Approved by: https://github.com/lezcano ghstack dependencies: #112093, #112101	2023-10-28 01:26:52 +00:00
chilli	b110d87ac2	Readded device_assert skipping in index and index_put (and also added (#112093 ) copy to noop pass) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112093 Approved by: https://github.com/oulgen, https://github.com/lezcano	2023-10-27 18:23:49 +00:00
Elias Ellison	7cb72704cc	Constrain sdpa to fx strides (#111721 ) Fix for https://github.com/pytorch/pytorch/issues/109607. sdpa requires last dimension strides to be 1. Add constraint so that we run the op with the strides we observed in tracing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111721 Approved by: https://github.com/drisspg, https://github.com/Chillee, https://github.com/jansel ghstack dependencies: #111976	2023-10-27 03:23:27 +00:00
PyTorch MergeBot	0a3199dd7e	Revert "Readded device_assert skipping in index and index_put (and also added (#112093 )" This reverts commit `e38347f490`. Reverted https://github.com/pytorch/pytorch/pull/112093 on behalf of https://github.com/izaitsevfb due to Sorry, trying to resolve a conflict with intern, and unblock the revert of #108690 ([comment](https://github.com/pytorch/pytorch/pull/112093#issuecomment-1782154814))	2023-10-27 01:37:33 +00:00
lezcano	47ccf04885	Split SymNode into its own file (#112037 ) This PR: - Moves TrueDiv, LShift, RShift, IsNonOverlappingAndDenseIndicator to `_sympy.functions.py` - Moves SymNode to `fx.experimental.sym_node`. - This file does not have any SymPy dependencies at import time - It installs the magic methods in Sym{Bool,Int,Float}. - N.b. With this split, we may be able to move Sym{Bool,Int,Float} to this file, and remove quite a few of the hacks around these classes - Imports `sym_node` in `torch/__init__.py` rather than the whole `symbolic_shapes.py`. This breaks the import-time dependency between torch and SymPy Pull Request resolved: https://github.com/pytorch/pytorch/pull/112037 Approved by: https://github.com/peterbell10 ghstack dependencies: #112035, #112036	2023-10-26 23:32:27 +00:00
PyTorch MergeBot	55ab9932f5	Revert "Constrain sdpa to fx strides (#111721 )" This reverts commit `8a7c3cec78`. Reverted https://github.com/pytorch/pytorch/pull/111721 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is breaking ROCm job in trunk `8a7c3cec78` ([comment](https://github.com/pytorch/pytorch/pull/111721#issuecomment-1782064133))	2023-10-26 23:27:57 +00:00
Elias Ellison	8a7c3cec78	Constrain sdpa to fx strides (#111721 ) Fix for https://github.com/pytorch/pytorch/issues/109607. sdpa requires last dimension strides to be 1. Add constraint so that we run the op with the strides we observed in tracing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111721 Approved by: https://github.com/drisspg, https://github.com/Chillee, https://github.com/jansel ghstack dependencies: #111976	2023-10-26 20:21:55 +00:00
chilli	e38347f490	Readded device_assert skipping in index and index_put (and also added (#112093 ) copy to noop pass) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112093 Approved by: https://github.com/oulgen, https://github.com/lezcano ghstack dependencies: #111990	2023-10-26 07:54:44 +00:00
Oleg Bulatov	192477b5ba	Enable flake8-bugbear B020 lint (#110823 ) Fixes part of https://github.com/pytorch/pytorch/issues/106571 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110823 Approved by: https://github.com/Skylion007	2023-10-24 22:43:47 +00:00
Hongtao Yu	6977ba6e3c	[inductor] decomposition for complex addition (#110740 ) Tracks https://github.com/pytorch/pytorch/issues/98161 Complex number support in Pytorch isn't ideal today as complex operations will mostly end up taken care of by the aten runtime, except for `torch.angle` which is handled in [105609](https://github.com/pytorch/pytorch/pull/105609). In general a better way to handle that could be to decompose complex operations first so that more opportunities for fusion could be unveiled, and then to have Triton take care of non-continuous (strided) tensor operations more efficiently. This change adds support to decompose complex addtions. ``` @triton.jit def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 6 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (x0), xmask) tmp1 = tl.load(in_ptr1 + (x0), xmask) tmp2 = tmp0 + tmp1 tl.store(out_ptr0 + (x0), tmp2, xmask) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/110740 Approved by: https://github.com/jansel	2023-10-24 03:41:24 +00:00
Oguz Ulgen	2b2b6caf8f	[inductor] Implement clone removal for user defined triton kernel via reinplace_scatters (#111627 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/111627 Approved by: https://github.com/jansel ghstack dependencies: #111434	2023-10-22 22:28:00 +00:00
Oguz Ulgen	977d3bcc46	[Inductor] Support user defined triton kernels in inductor (#111434 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/111434 Approved by: https://github.com/jansel	2023-10-22 17:04:19 +00:00
Jason Ansel	a1154e673b	[Compiled Autograd] Turn accumulate_grad into an op (#111700 ) Relands #111271 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111700 Approved by: https://github.com/voznesenskym	2023-10-21 17:31:09 +00:00
Elias Ellison	0a147fd112	Pointwise fuse cat with pointwise inputs or outputs and <= 4 inputs (#111233 ) Improves perf of llama_v2 locally from 1.55 -> 1.57 The initial heuristic is to lower to pointwise if # of inputs is <= 4, and all the inputs are pointwise or cannot be memory planned away, or if all the outputs are pointwise. Perf run was +3% on inference.. There are definitely instances where we should be lowering to foreach_kernels, but it's less flexible for fusion. The motivating example was: ``` def rotate_half(x): """Rotates half the hidden dims of the input.""" x1 = x[..., : x.shape[-1] // 2] x2 = x[..., x.shape[-1] // 2 :] return torch.cat((-x2, x1), dim=-1) def apply_rotary_pos_emb(q, k, cos, sin): iota = torch.ops.prims.iota.default(512, start = 0, step = 1, dtype = torch.int64, device = device(type='cuda', index=0), requires_grad = False) # File: /scratch/eellison/work/torchdynamo/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py:657, code: position_ids = position_ids.unsqueeze(0).view(-1, seq_length) unsqueeze = torch.ops.aten.unsqueeze.default(iota, 0) position_ids = torch.ops.aten.reshape.default(unsqueeze, [-1, 512]); unsqueeze = None # The first two dimensions of cos and sin are always 1, so we can `squeeze` them. cos = cos.squeeze(1).squeeze(0) # [seq_len, dim] sin = sin.squeeze(1).squeeze(0) # [seq_len, dim] cos = cos[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim] sin = sin[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim] q_embed = (q * cos) + (rotate_half(q) * sin) k_embed = (k * cos) + (rotate_half(k) * sin) return q_embed, k_embed ``` Also not sure if I should be more worried about concatting reduction->pointwise inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111233 Approved by: https://github.com/Chillee	2023-10-21 02:34:05 +00:00
Jane Xu	93a9b1314b	Make step() faster by passing in a tensor vs scalar 1 (#111084 ) This is the culminated result of https://github.com/pytorch/pytorch/pull/110954#issuecomment-1758520411. We are making the code slightly more complicated to gain some perf in minimizing calls to `.copy_()` and `.to()`. ### Code ``` import torch with torch.cuda.device(0): steps = [torch.zeros((), device="cpu", dtype=torch.float32) for i in range(1000)] with torch.profiler.profile( activities=[ torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA, ] ) as p: # New code: # step_device = steps[0].device # one = torch.tensor(1.0, device=step_device) if str(step_device) == "cpu" else 1 # torch._foreach_add_(steps, one, 1.0) # Old code: torch._foreach_add_(steps, 1) print(p.key_averages().table(sort_by="cpu_time_total")) ``` ### Profiles with old code ``` ------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ------------------------- ------------ ------------ ------------ ------------ ------------ ------------ aten::_foreach_add_ 35.31% 52.089ms 99.99% 147.495ms 147.495ms 1 aten::add_ 25.05% 36.949ms 64.68% 95.406ms 95.406us 1000 aten::to 3.97% 5.852ms 39.63% 58.457ms 58.457us 1000 aten::_to_copy 10.11% 14.917ms 35.66% 52.605ms 52.605us 1000 aten::copy_ 21.65% 31.939ms 21.65% 31.939ms 31.939us 1000 aten::empty_strided 3.90% 5.749ms 3.90% 5.749ms 5.749us 1000 cudaDeviceSynchronize 0.01% 18.000us 0.01% 18.000us 18.000us 1 ------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 147.513ms ``` with new code ``` ------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ------------------------- ------------ ------------ ------------ ------------ ------------ ------------ aten::_foreach_add_ 55.06% 49.963ms 99.86% 90.625ms 90.625ms 1 aten::add_ 44.81% 40.662ms 44.81% 40.662ms 40.662us 1000 aten::detach_ 0.01% 8.000us 0.05% 45.000us 45.000us 1 detach_ 0.04% 37.000us 0.04% 37.000us 37.000us 1 aten::empty 0.03% 30.000us 0.03% 30.000us 30.000us 1 aten::to 0.03% 23.000us 0.03% 23.000us 23.000us 1 cudaDeviceSynchronize 0.02% 22.000us 0.02% 22.000us 22.000us 1 aten::lift_fresh 0.01% 6.000us 0.01% 6.000us 6.000us 1 ------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 90.751ms ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/111084 Approved by: https://github.com/albanD ghstack dependencies: #111079	2023-10-20 01:34:08 +00:00
Ying Zhang	dc31dbbcab	Optimize reduction + amax fusion (#111122 ) This PR optimizes cases like layer_norm + fp8 quant (which includes amax and fp8 quant) fusion when amax is split into multiple reduction kernels. Benchmark: ``` python test/inductor/test_fp8.py -k test_layernorm_fp8_quant_benchmark Before this PR: Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096). Benchmark results: Inductor: 0.13262102689486555ms, Eager: 0.8211962616822429ms, LN only Inductor: 0.09606276150627614ms. After this PR: Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096). Benchmark results: Inductor: 0.08281274131274131ms, Eager: 0.8217452830188678ms, LN only Inductor: 0.09586902286902287ms. ``` LN + fp8 quant is even faster than LN itself. The reason could be that LN + fp8 outputs fp8 while LN outputs fp16. From Inductor nightly benchmark test: There are perf differences in cuda_graph / cuda_graph_dynamic / default runs, but no difference in inductor_max_autotune. So it seems to me that the perf differences are mostly like fluctuations. ![Screenshot 2023-10-18 at 4 58 55 PM](https://github.com/pytorch/pytorch/assets/10527447/6640474a-1e1d-4d33-97e9-0a60d0bc9f1f) Pull Request resolved: https://github.com/pytorch/pytorch/pull/111122 Approved by: https://github.com/jansel	2023-10-19 20:53:50 +00:00
Michael Lazos	543dc75746	[Reland] horizontal concat fusion (#111437 ) Reland https://github.com/pytorch/pytorch/pull/108115 The main fix is to disallow nop nodes to be included in foreach scheduler nodes Pull Request resolved: https://github.com/pytorch/pytorch/pull/111437 Approved by: https://github.com/yanboliang	2023-10-18 17:09:01 +00:00
PyTorch MergeBot	3eb5cae3af	Revert "[Compiled Autograd] Turn accumulate_grad into an op (#111271 )" This reverts commit `04b04c0686`. Reverted https://github.com/pytorch/pytorch/pull/111271 on behalf of https://github.com/jeanschmidt due to Breaking internal CI ([comment](https://github.com/pytorch/pytorch/pull/111271#issuecomment-1768527932))	2023-10-18 14:02:34 +00:00
Sherlock Huang	1aad6d803a	[Reland][Inductor] Disallow OpOverloadPacket in ir.FallbackKernel (#110567 ) (#111396 ) This is a reland of #110567 with additional fbcode fixed. Summary: In ABI compatible mode, We always need op_overload.schema for FallbackKernel. Approved by: https://github.com/jansel Test Plan: contbuild & OSS CI, see `37a0265992` Differential Revision: D50339346 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111396 Approved by: https://github.com/chenyang78	2023-10-17 18:53:38 +00:00
Jason Ansel	04b04c0686	[Compiled Autograd] Turn accumulate_grad into an op (#111271 ) Rather than baking the behavior of `AccumulateGrad` nodes into the generated graph (either as `+=`, or as a return value of the graph). This creates a new `accumulate_grad_` dispatcher op that is included in the generated graph like: ``` def forward(self, inputs, sizes, hooks): getitem = inputs[0] getitem_1 = inputs[1] getitem_2 = inputs[2] getitem_3 = inputs[3] getitem_4 = inputs[4] getitem_5 = inputs[5] getitem_6 = inputs[6] getitem_7 = inputs[7] getitem_8 = inputs[8] getitem_9 = inputs[9]; inputs = None expand = torch.ops.aten.expand.default(getitem, [2, 4]); getitem = None threshold_backward = torch.ops.aten.threshold_backward.default(expand, getitem_1, 0); expand = getitem_1 = None t = torch.ops.aten.t.default(getitem_3); getitem_3 = None mm = torch.ops.aten.mm.default(threshold_backward, t); t = None t_1 = torch.ops.aten.t.default(threshold_backward) mm_1 = torch.ops.aten.mm.default(t_1, getitem_2); t_1 = getitem_2 = None t_2 = torch.ops.aten.t.default(mm_1); mm_1 = None sum_1 = torch.ops.aten.sum.dim_IntList(threshold_backward, [0], True); threshold_backward = None view = torch.ops.aten.view.default(sum_1, [4]); sum_1 = None t_3 = torch.ops.aten.t.default(t_2); t_2 = None accumulate_grad_ = torch.ops.inductor.accumulate_grad_.default(getitem_4, t_3); getitem_4 = t_3 = None threshold_backward_1 = torch.ops.aten.threshold_backward.default(mm, getitem_5, 0); mm = getitem_5 = None t_4 = torch.ops.aten.t.default(threshold_backward_1) mm_2 = torch.ops.aten.mm.default(t_4, getitem_6); t_4 = getitem_6 = None t_5 = torch.ops.aten.t.default(mm_2); mm_2 = None sum_2 = torch.ops.aten.sum.dim_IntList(threshold_backward_1, [0], True); threshold_backward_1 = None view_1 = torch.ops.aten.view.default(sum_2, [4]); sum_2 = None t_6 = torch.ops.aten.t.default(t_5); t_5 = None accumulate_grad__1 = torch.ops.inductor.accumulate_grad_.default(getitem_7, t_6); getitem_7 = t_6 = None accumulate_grad__2 = torch.ops.inductor.accumulate_grad_.default(getitem_8, view_1); getitem_8 = view_1 = None accumulate_grad__3 = torch.ops.inductor.accumulate_grad_.default(getitem_9, view); getitem_9 = view = None return [] ``` The motivation here is `AccumulateGrad` nodes are causing trouble in FSDP tracing, since FSDP is in-place resizing parameters and parameter storage in hooks. We will model this mutation in dynamo, but not during the initial compiled autograd capture. This allows us to bypass failing shape checks in the initial capture. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111271 Approved by: https://github.com/voznesenskym	2023-10-16 21:16:17 +00:00
Adnan Akhundov	d7317d8a11	Fix size_hint call sites failing on unbacked SymInts (#110520 ) Summary: Unbacked SymInts can't get a `sizevars.size_hint` due to being data-dependent. #109893 has added a new `fallback` parameter to `sizevars.size_hint` to specify the fallback value in cases like unbacked SymInt. In this PR we add more of those. Test Plan: CI Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/110520 Approved by: https://github.com/jansel, https://github.com/ezyang	2023-10-14 08:10:09 +00:00
Edward Z. Yang	d38472c176	Don't sympify reflection_pad2d ranges (#111212 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/111212 Approved by: https://github.com/eellison	2023-10-13 21:36:30 +00:00
Edward Z. Yang	d24539ee6a	Improve reflection_pad2d lowering for dynamic shapes (#110988 ) Fixes https://github.com/pytorch/pytorch/issues/110696 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/110988 Approved by: https://github.com/jansel, https://github.com/lezcano	2023-10-13 13:38:46 +00:00
Oguz Ulgen	fd4ba806f6	Implement tensor slice in inductor to stop falling back for aten.index (#111015 ) Fixes #110711 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111015 Approved by: https://github.com/Chillee	2023-10-11 17:53:24 +00:00
eellison	c5f06b9753	Re-enable test_copy_transpose_math_view, neg_view/dce fix (#110651 ) - neg view can just be lowered to neg() post functionalization - we were treating all fallback kernels as not having side effects. we shouldn't dce mutating fallback kernels - either mutations induced by the reinplacing pass or clone_ with unsupported arguments (complex) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110651 Approved by: https://github.com/Chillee, https://github.com/jansel, https://github.com/malfet, https://github.com/Skylion007	2023-10-10 16:34:01 +00:00
PyTorch MergeBot	19ecb5d0d5	Revert "[Inductor] Disallow OpOverloadPacket in ir.FallbackKernel (#110567 )" This reverts commit `37a0265992`. Reverted https://github.com/pytorch/pytorch/pull/110567 on behalf of https://github.com/kit1980 due to breaking internal builds, see D50091340 ([comment](https://github.com/pytorch/pytorch/pull/110567#issuecomment-1754308982))	2023-10-10 03:49:20 +00:00

1 2 3 4 5 ...

355 Commits