pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Aaron Gokaslan	5a1216bb2e	[BE]: Update ruff to 0.4.1 (#124549 ) Update ruff to 0.4.1 . This version fixes a lot false negatives/false positives, is 20-40% faster, and has various other bug fixes. Below is a before and after table showing the execution time of ruff lint and ruff format in milliseconds courtesy of https://astral.sh/blog/ruff-v0.4.0 \| Repository \| Linter (v0.3) \| Linter (v0.4) \| Formatter (v0.3) \| Formatter (v0.4) \| \|----------------------------------------------------\|---------------\|---------------\|------------------\|------------------\| \| [pytorch/pytorch](https://github.com/pytorch/pytorch) \| 328.7 \| 251.8 \| 351.1 \| 274.9 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/124549 Approved by: https://github.com/ezyang	2024-04-21 14:06:23 +00:00
Edward Z. Yang	f34905f61d	Assert that TracingContext is available when set_example_value is called (#124284 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124284 Approved by: https://github.com/Chillee ghstack dependencies: #124105, #124059, #124176, #124283	2024-04-21 11:23:13 +00:00
Edward Z. Yang	0e6367dd44	Factor var_to_range assignments to _update_var_to_range helper (#124283 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124283 Approved by: https://github.com/IvanKobzarev ghstack dependencies: #124105, #124059, #124176	2024-04-21 11:23:13 +00:00
Colin Peppler	cbf420b67a	[inductor] for UserDefinedTritonKernels don't mark all inputs as mutating (#124425 ) Take this example: ``` def _mul2(x): y = torch.empty_like(x) mul2_kernel[(10,)]( in_ptr0=x, out_ptr=y, n_elements=x.numel(), BLOCK_SIZE=1, ) return y def f(x): for _ in range(4): x = _mul2(x) return x + 1 ``` Currently, the codegen will show up like this. Notice, how we allocate 5 buffers of the same size. ``` # Source Nodes: [triton_kernel_wrapper_mutation], Original ATen: [] buf0 = empty_strided_cuda((10, ), (1, ), torch.float32) mul2_kernel_0.run(in_ptr0=arg0_1, out_ptr=reinterpret_tensor(buf0, (10, ), (1, ), 0) ...) # Source Nodes: [triton_kernel_wrapper_mutation_1], Original ATen: [] buf4 = empty_strided_cuda((10, ), (1, ), torch.float32) mul2_kernel_0.run(in_ptr0=reinterpret_tensor(buf0, (10, ), (1, ), 0), out_ptr=reinterpret_tensor(buf4, (10, ), (1, ), 0) ...) # Source Nodes: [triton_kernel_wrapper_mutation_2], Original ATen: [] buf8 = empty_strided_cuda((10, ), (1, ), torch.float32) mul2_kernel_0.run(in_ptr0=reinterpret_tensor(buf4, (10, ), (1, ), 0), out_ptr=reinterpret_tensor(buf8, (10, ), (1, ), 0) ...) # Source Nodes: [triton_kernel_wrapper_mutation_3], Original ATen: [] buf12 = empty_strided_cuda((10, ), (1, ), torch.float32) mul2_kernel_0.run(in_ptr0=reinterpret_tensor(buf8, (10, ), (1, ), 0), out_ptr=reinterpret_tensor(buf12, (10, ), (1, ), 0) ...) # Source Nodes: [add], Original ATen: [aten.add] buf16 = empty_strided_cuda((10, ), (1, ), torch.float32) triton_poi_fused_add_0.run(buf12, buf16, 10, grid=grid(10), stream=stream0)...) return (buf16, ) ``` With this PR, we want to see this. Notice, how we only allocate 2 buffers this time. The other 3 buffers are re-used. ``` # Source Nodes: [triton_kernel_wrapper_mutation], Original ATen: [] buf0 = empty_strided_cuda((10, ), (1, ), torch.float32) mul2_kernel_0.run(in_ptr0=arg0_1, out_ptr=reinterpret_tensor(buf0, (10, ), (1, ), 0), ...) del arg0_1 # Source Nodes: [triton_kernel_wrapper_mutation_1], Original ATen: [] buf2 = empty_strided_cuda((10, ), (1, ), torch.float32) mul2_kernel_0.run(in_ptr0=reinterpret_tensor(buf0, (10, ), (1, ), 0), out_ptr=reinterpret_tensor(buf2, (10, ), (1, ), 0) ...) # Source Nodes: [triton_kernel_wrapper_mutation_2], Original ATen: [] buf4 = buf0; del buf0 # reuse mul2_kernel_0.run(in_ptr0=reinterpret_tensor(buf2, (10, ), (1, ), 0), out_ptr=reinterpret_tensor(buf4, (10, ), (1, ), 0) ...) # Source Nodes: [triton_kernel_wrapper_mutation_3], Original ATen: [] buf6 = buf2; del buf2 # reuse mul2_kernel_0.run(in_ptr0=reinterpret_tensor(buf4, (10, ), (1, ), 0), out_ptr=reinterpret_tensor(buf6, (10, ), (1, ), 0) ...) del buf4 # Source Nodes: [add], Original ATen: [aten.add] buf8 = buf6; del buf6 # reuse triton_poi_fused_add_0.run(buf8, 10, grid=grid(10), stream=stream0) return (buf8, ) ``` Differential Revision: [D56379307](https://our.internmc.facebook.com/intern/diff/D56379307) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124425 Approved by: https://github.com/oulgen	2024-04-21 06:00:14 +00:00
Yanbo Liang	0d90d4d613	[Dynamo] Fix NamedTuple hasattr bug (#124531 ) Fixes #124402 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124531 Approved by: https://github.com/jansel	2024-04-21 04:36:22 +00:00
FFFrog	d6f88105ce	Fix the problem about load_state_dict with unexpected key whose prefix matches a valid key (#124385 ) Fixes https://github.com/pytorch/pytorch/issues/123510 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124385 Approved by: https://github.com/mikaylagawarecki	2024-04-20 23:19:25 +00:00
Edward Z. Yang	afa78ad08c	Call writeline from writelines (#124515 ) This makes it more convenient to add a breakpoint here. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124515 Approved by: https://github.com/albanD	2024-04-20 15:45:30 +00:00
Animesh Jain	a32eac345f	[dynamo] Return gm.forward for eager backend (#124109 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124109 Approved by: https://github.com/yanboliang, https://github.com/jansel ghstack dependencies: #124445	2024-04-20 14:11:05 +00:00
Animesh Jain	febc4d8759	[dynamo][easy] forbid_in_graph check to use getattr_static (#124445 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124445 Approved by: https://github.com/yanboliang, https://github.com/jansel	2024-04-20 14:11:05 +00:00
Yanbo Liang	a3e3693afc	[Dynamo] Fix missing bracket in ListVariable (#124532 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/124532 Approved by: https://github.com/williamwen42	2024-04-20 08:26:30 +00:00
Michael Lazos	0d0b5b2655	Enable dynamo rosenbrock sparse tests (#124542 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124542 Approved by: https://github.com/yf225 ghstack dependencies: #124540, #124541	2024-04-20 05:54:41 +00:00
Michael Lazos	184f16016e	Enable dynamo-traced deepcopy test for RMSprop (#124541 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124541 Approved by: https://github.com/yf225 ghstack dependencies: #124540	2024-04-20 05:54:41 +00:00
Michael Lazos	6a730698e2	Enable dynamo-traced Adamax tests (#124540 ) Enabling tests related to https://github.com/pytorch/pytorch/issues/121178 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124540 Approved by: https://github.com/yf225	2024-04-20 05:54:41 +00:00
drisspg	f1cbaf1764	Adds LSE output for templated-attention-hop if inputs require grad (#124308 ) Adds LSE output for templated-attention-hop if inputs require grad Prep PR for adding autograd support to templated-attention-hop. The kernel needs to output the LSE during the forward which will be used during backwards. ### Output code https://gist.github.com/drisspg/2aea3ce5db75811e7e143eeecb774d8a ## Before \| Type \| Speedup \| batch_size \| num_heads \| q_seq_len \| k_seq_len \| head_dim \| score_mod \| dtype \| \|---------\|-----------\|--------------\|-------------\|-------------\|-------------\|------------\|---------------\|----------------\| \| Average \| 1.159 \| \| \| \| \| \| \| \| \| Max \| 1.342 \| 16 \| 16 \| 512 \| 512 \| 64 \| noop \| torch.bfloat16 \| \| Min \| 1.016 \| 1 \| 16 \| 512 \| 512 \| 64 \| relative_bias \| torch.bfloat16 \| ## After Type \| Speedup \| batch_size \| num_heads \| q_seq_len \| k_seq_len \| head_dim \| score_mod \| dtype \| \|---------\|-----------\|--------------\|-------------\|-------------\|-------------\|------------\|-------------\|----------------\| \| Average \| 1.155 \| \| \| \| \| \| \| \| \| Max \| 1.339 \| 16 \| 16 \| 512 \| 512 \| 64 \| noop \| torch.bfloat16 \| \| Min \| 1.009 \| 1 \| 16 \| 512 \| 512 \| 64 \| head_bias \| torch.bfloat16 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/124308 Approved by: https://github.com/Chillee	2024-04-20 05:45:56 +00:00
Oguz Ulgen	0d64b82f0b	Make CompiledFxGraph portable between machines (#124438 ) As we prepare FxGraphCache to move to remote, we need to make sure there's no data that is on the disk. Differential Revision: [D56363808](https://our.internmc.facebook.com/intern/diff/D56363808) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124438 Approved by: https://github.com/jansel	2024-04-20 05:26:14 +00:00
Shunting Zhang	c5a4ba2257	[inductor] consider pointwise nodes when deciding reduction hint (#124131 ) In certain rare scenarios, inductor can generate a reduction kernel with really bad perf. E.g., if - the reduction kernel contains a reduction node followed by a pointwise node - And the pointwise node use a transposed layout. - the reduction node is an inner reduction - and rnumel <= 1024 , then inductor will generate a persistent reduction kernel and it causes really bad perf when doing tl.store for the pointwise node since we use a very skinny tile `(XBLOCK=1, RBLOCK=next_power_of_2(rnumel))` . I've tried a few version of fix. - The first version is, if I found any pointwise node in a reduction kernel uses a non-contiguous dependency, we use ReductionHint.DEFAULT. This cause 8s compilation time increase for huggingface with no perf wins... The reason is ReductionHint.DEFAULT does more autotunings. - Then I changed the code to be more specific. We change the hint from INNER to DEFAULT if we are sure that the pointwise kernel can use a >1 stride for the lowest dimension. Kernels meet this condition should mostly have really bad perf anyways. The situation mentioned above is rare. But it's reported by internal users. I'll also run one more perf test. Testing script: https://gist.github.com/shunting314/9d3389891fa43633b49b8b7564ad6d8b . Something equivalent is also added as a unit test. For this specific test from user reports, we improve the mentioned reduction kernels perf by 4.14x (451us -> 109us) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124131 Approved by: https://github.com/jansel	2024-04-20 05:07:56 +00:00
Xiaodong Wang	57f64197f3	Reduce warning msg in torch.profiler (#124469 ) Summary: This is actually quite noisy and my logs are full of this soft assertion msg. Maybe making it log once? Test Plan: On AMD GPU side, I got a lot of those warnings: ``` W0415 01:40:45.109864 917160 collection.cpp:602] Warning: Memcpy ? (? -> ?) (function operator())” ``` So just suppress the excessive logs Reviewed By: aaronenyeshi, yoyoyocmu Differential Revision: D55602788 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124469 Approved by: https://github.com/aaronenyeshi	2024-04-20 04:45:12 +00:00
Scott Wolchok	3d8b903d95	[PyTorch] Remove ArrayRefTensor::numel_ (#124516 ) ArrayRefTensor::numel_ is redundant with the size of the contained MiniArrayRef. Reclaiming the space entirely would break ABI compatibility, but at least we have 4-8 bytes for future expansion. Differential Revision: [D56366829](https://our.internmc.facebook.com/intern/diff/D56366829/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D56366829/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/124516 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2024-04-20 02:44:20 +00:00
Andrew Gu	f9fce110af	[FSDP2][ez] Removed error check for swap tensors flag (#124513 ) Since `DTensor` uses `swap_tensors` path automatically now, we can remove this check for the global flag. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124513 Approved by: https://github.com/weifengpy ghstack dependencies: #124319, #120256	2024-04-20 00:46:36 +00:00
Andrew Gu	1c2cb36811	[FSDP2] Added CPU offloading (#120256 ) #### Overview This PR adds CPU offloading via the `offload_policy: OffloadPolicy` argument. - We incur one H2D copy for each parameter before all-gather. - We incur one D2H copy for each gradient after reduce-scatter. - We run optimizer on CPU. #### Example (Mixed Precision and CPU Offloading) This example uses a small 125M numel model, which is not too representative. We can try to run with a larger model like Llama-7B. However, since the current optimizer step is already too slow, we may want to patch a faster CPU optimizer. Forward ![Screenshot 2024-02-21 at 10 36 29 AM](https://github.com/pytorch/pytorch/assets/31054793/00ed95db-3a55-49bb-ac98-9b9162feaacd) ![Screenshot 2024-02-21 at 10 39 12 AM](https://github.com/pytorch/pytorch/assets/31054793/10e29854-1907-4001-b3dc-aab6c3bf153c) Backward ![Screenshot 2024-02-21 at 10 37 47 AM](https://github.com/pytorch/pytorch/assets/31054793/7039cb2e-eb78-4f53-b83f-67bae61ebddd) ![Screenshot 2024-02-21 at 10 38 44 AM](https://github.com/pytorch/pytorch/assets/31054793/e34615d6-6b6b-4995-aef1-9c7563034799) Overall CPU (CPU optimizer step dominates) ![Screenshot 2024-02-21 at 10 39 47 AM](https://github.com/pytorch/pytorch/assets/31054793/7a2a929a-3a40-4b35-891b-016cf57e8079) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120256 Approved by: https://github.com/weifengpy ghstack dependencies: #124319	2024-04-20 00:42:58 +00:00
soulitzer	cf5ca58e7f	[NJT] Inline through torch.nested.nested_tensor_from_jagged instead of graph break (#124343 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124343 Approved by: https://github.com/jbschlosser	2024-04-19 23:13:59 +00:00
Laith Sakka	acbf888a13	rename sl to strobelight (#124455 ) Summary: TORCH_COMPILE_SL_PROFILE ->TORCH_COMPILE_STROBELIGHT SL_MAX_STACK_LENGTH -> COMPILE_STROBELIGHT_MAX_STACK_LENGTH SL_MAX_PROFILE_TIME -> COMPILE_STROBELIGHT_MAX_PROFILE_TIME profile_with_sl() -> strobelight() compiletime_sl_profile_meta() -> compiletime_strobelight_meta() Test Plan: 1. run and verify ``` TORCH_COMPILE_STROBELIGHT=TRUE buck2 run @//mode/inplace @//mode/opt //caffe2/fb/strobelight:compiletime_profiler_example ``` 2. run and verify ``` buck2 run @//mode/inplace @//mode/opt //caffe2/fb/strobelight:function_profiler_example --local-only ``` 3. run and verify truncated stack for ``` TORCH_COMPILE_STROBELIGHT=TRUE COMPILE_STROBELIGHT_MAX_STACK_LENGTH=1 buck2 run @//mode/inplace @//mode/opt //caffe2/fb/strobelight:compiletime_profiler_example ``` 4. add infinite loop in _verify and verify samples for ``` COMPILE_STROBELIGHT_MAX_PROFILE_TIME=30 TORCH_COMPILE_STROBELIGHT=TRUE buck2 run @//mode/inplace @//mode/opt //caffe2/fb/strobelight:compiletime_profiler_example ``` Reviewed By: oulgen Differential Revision: D56327139 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124455 Approved by: https://github.com/oulgen	2024-04-19 22:50:13 +00:00
PyTorch MergeBot	0feab7d6c3	Revert "Build device generic torch.Stream and torch.Event based on c10::Stream/Event (#123611 )" This reverts commit `cb17721899`. Reverted https://github.com/pytorch/pytorch/pull/123611 on behalf of https://github.com/jeffdaily due to This broke ROCm. see test_overrides.py ([comment](https://github.com/pytorch/pytorch/pull/123611#issuecomment-2067363780))	2024-04-19 22:44:26 +00:00
PyTorch MergeBot	929242a15c	Revert "torch.mtia module for MTIA device backend (#123612 )" This reverts commit `d7e1bf9ff9`. Reverted https://github.com/pytorch/pytorch/pull/123612 on behalf of https://github.com/jeffdaily due to This broke ROCm. see test_overrides.py ([comment](https://github.com/pytorch/pytorch/pull/123611#issuecomment-2067363780))	2024-04-19 22:44:26 +00:00
cdzhan	f8f7cfbeee	Add __torch_function__ support for generated tensor methods/property of PrivateUse1 (#121723 ) support following case: ```python import torch ... class CustomFooTensor(torch.Tensor): @classmethod def __torch_function__(cls, func, types, args=(), kwargs=None): ... a = CustomFooTensor([3]) print(a.is_foo) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121723 Approved by: https://github.com/albanD	2024-04-19 22:34:34 +00:00
rzou	f0560f7b3b	[opcheck] Stop doing test_aot_dispatch_static by default (#124495 ) Motivations: - this is pretty redundant with test_aot_dispatch_dynamic. - The user story for opcheck is that a user should use opcheck to see if their operator was "registered correctly". If a user's custom op only supports dynamic shapes, then it's a bit awkward for one of the tests (e.g. `test_aot_dispatch_static`) to fail. - We've already stopped running test_aot_dispatch_static in all of our opcheck tests. Test Plan: - wait for CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/124495 Approved by: https://github.com/williamwen42 ghstack dependencies: #124180, #124200, #124299, #124134, #124199, #124403, #124414	2024-04-19 21:57:22 +00:00
rzou	37d18966ea	[custom_op] set some tags when constructing the op (#124414 ) - the op is automatically "pt2-compliant" - In general we want to turn on needs_fixed_stride_order for all customm ops, but this needs some more work, so we're just going to turn it on for the new custom op API. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124414 Approved by: https://github.com/albanD ghstack dependencies: #124180, #124200, #124299, #124134, #124199, #124403	2024-04-19 21:57:22 +00:00
Andrew Gu	1900f79b72	[FSDP2] Added `set_reshard_after_backward` (#124319 ) This PR adds a `set_reshard_after_backward` method to allow disabling resharding after backward. `reshard_after_backward=False` can be used with `reshard_after_forward=False` to implement "ZeRO-1", where there is only all-gather on the first microbatch forward and reduce-scatter on the last microbatch backward. ``` for microbatch_idx, microbatch in dataloader: is_last_microbatch = microbatch_idx == num_microbatches - 1 model.set_requires_gradient_sync(is_last_microbatch) model.set_reshard_after_backward(is_last_microbatch) model.set_is_last_backward(is_last_microbatch) microbatch_fwd_bwd(model, microbatch, microbatch_idx) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124319 Approved by: https://github.com/weifengpy	2024-04-19 21:49:35 +00:00
Pian Pawakapan	10b9d4d19c	[export] handle Dim.lower = 0, 1 for ep.run_decompositions() (#123602 ) Summary: With pre-dispatch export and ep.run_decompositions(), range constraints are updated through looking at ShapeEnv.var_to_range. However the lower bounds on these may be incorrect - analysis on un-specialized symbols are done with lower bounds of 2, which mismatch with user-specified bounds (may be 0, 1). This updates `_get_updated_range_constraints()` to use the old range constraints if possible. Test Plan: Existing pre-dispatch/dynamic shapes test case. Differential Revision: D55899872 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123602 Approved by: https://github.com/tugsbayasgalan	2024-04-19 21:29:36 +00:00
eellison	000d55870a	Enable in oss (#124031 ) Biggest movement is 4% HF inference, 9% TIMM inference. Note, this is max-autotune mode so we are more tolerant of compilation increases. We could improve compilation time by limiting: ``` # Take how many of the top triton kernels to benchmark epilogue max_epilogue_benchmarked_choices = 3 ``` There is a hf_Whisper failure which you can repro on main without this stack with `TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS=TRITON TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --accuracy --training --only hf_Whisper`. When you turn off epilogue fusion, it fixes the accuracy. I bisected the failure to an epilogue, however when you compare the results of that epilogue with the corresponding separate kernels the results of the output are equivalent. Inference: <img width="1686" alt="image" src="https://github.com/pytorch/pytorch/assets/11477974/0b240080-cd33-4c08-89d3-583103b1fb0c"> Training: <img width="1329" alt="Screenshot 2024-04-16 at 6 16 30 PM" src="https://github.com/pytorch/pytorch/assets/11477974/db0afcc9-7288-4c27-84ce-4fc1a5690788"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124031 Approved by: https://github.com/Chillee, https://github.com/shunting314 ghstack dependencies: #124030, #122642, #123229, #122825	2024-04-19 20:28:55 +00:00
eellison	179108f14d	Use separate flags for MultiTemplates from BenchmarkFusion (#122825 ) Two changes: - Make the flag for multi template buffer independent from benchmark fusion. While benchmark fusion can be useful, the compilation time/performance trade offs are different than for just templates, which we'd like to enable by default. - Dont do MultiTemplateBuffers/benchmark-fusion for templates which have custom input gen fn's (currently which only exist internally). Threading the custom input gn fns to benchmark fusion is NYI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122825 Approved by: https://github.com/shunting314 ghstack dependencies: #124030, #122642, #123229	2024-04-19 19:50:42 +00:00
IvanKobzarev	73f56e1e81	[sym_shapes][perf] Do not calculate hint in advice_is_size (#124472 ) Differential Revision: [D56352412](https://our.internmc.facebook.com/intern/diff/D56352412) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124472 Approved by: https://github.com/ezyang	2024-04-19 19:10:24 +00:00
PyTorch MergeBot	f87c788a34	Revert "Capture triton kernel in execution trace (#124140 )" This reverts commit `89407eca3b`. Reverted https://github.com/pytorch/pytorch/pull/124140 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/124140#issuecomment-2067137104))	2024-04-19 19:05:44 +00:00
IvanKobzarev	761de37ab7	[sym_shape][perf] eval_static: guards, unbacked compute once (#124217 ) Differential Revision: [D56212345](https://our.internmc.facebook.com/intern/diff/D56212345) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124217 Approved by: https://github.com/ezyang	2024-04-19 19:03:04 +00:00
Zhuoran Zhao	b0d83726bd	[5/x][AMD][Lowering Enablement] Hipifying aoti code_wrapper (#124241 ) Summary: as title Test Plan: CI & unit test patch on top of https://www.internalfb.com/phabricator/paste/view/P1214895953 to test Differential Revision: D56223917 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124241 Approved by: https://github.com/jansel, https://github.com/desertfire	2024-04-19 18:57:38 +00:00
rzou	25c65d6642	Change register_autograd to reflect ordering of setup_context and backward (#124403 ) old: `register_autograd(setup_context, backward, /)` new: `register_autograd(backward, /, *, setup_context=None)` Motivations: - We introduce these APIs as "give us a backward and use setup_context to save things for backward". - setup_context isn't always necessary. Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124403 Approved by: https://github.com/albanD ghstack dependencies: #124180, #124200, #124299, #124134, #124199	2024-04-19 17:56:30 +00:00
rzou	a8e17b2d4d	Move schema inference to torch._library (#124199 ) After this PR, we can delete torch._custom_op/torch._custom_ops (except there are external libraries depending it). Pull Request resolved: https://github.com/pytorch/pytorch/pull/124199 Approved by: https://github.com/albanD ghstack dependencies: #124180, #124200, #124299, #124134	2024-04-19 17:56:30 +00:00
rzou	a78450a00b	Excise uses of the old custom ops APIs (#124134 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124134 Approved by: https://github.com/albanD ghstack dependencies: #124180, #124200, #124299	2024-04-19 17:56:26 +00:00
eellison	9489019085	Small fixes for deferred epilogue (#123229 ) Two small fixes: - preserve rng around compile_fx_inner - Now that will precompile in the background while lowering multiple templates in parallel, we no longer can allocate inputs at the beginning of the function because we will have multiple sets of inputs allocated at the same time. Instead, allocate them when needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123229 Approved by: https://github.com/shunting314 ghstack dependencies: #124030, #122642	2024-04-19 17:41:29 +00:00
eellison	39fc280dce	Dont precompile already seen keys, limit epilogue choices (#122642 ) Two changes: - in epilogue benchmark fusion, only take top 6 choices. There were basically no choices taken after this in HF. - Share a single precompilation function among matmuls with same key. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122642 Approved by: https://github.com/shunting314 ghstack dependencies: #124030	2024-04-19 17:34:22 +00:00
JackCaoG	7ae835eee4	Enable SourcelessBuilder to build GraphModule generated by make_fx (#123673 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123673 Approved by: https://github.com/ezyang, https://github.com/anijain2305 ghstack dependencies: #123680	2024-04-19 17:23:51 +00:00
Michael Lazos	68a027f144	Fixes for 123400 (#123406 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123406 Approved by: https://github.com/janeyx99 ghstack dependencies: #123324, #123404, #123405, #124309	2024-04-19 17:20:57 +00:00
Michael Lazos	5050e627dc	Defer marking_static_address (#124309 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124309 Approved by: https://github.com/anijain2305 ghstack dependencies: #123324, #123404, #123405	2024-04-19 17:20:57 +00:00
Michael Lazos	1531a29fb9	Enable tests related to 116061 (#123405 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123405 Approved by: https://github.com/janeyx99 ghstack dependencies: #123324, #123404	2024-04-19 17:20:54 +00:00
Michael Lazos	406d99e46c	Fix for 117147 (#123404 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123404 Approved by: https://github.com/Skylion007, https://github.com/janeyx99 ghstack dependencies: #123324	2024-04-19 17:20:50 +00:00
Michael Lazos	203d111c54	Enable dynamo test_forloop_goes_right_direction_multi_gpu (#123324 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123324 Approved by: https://github.com/janeyx99	2024-04-19 17:20:41 +00:00
ydwu4	293f756cdc	Support aot_export torchbind op (#123370 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123370 Approved by: https://github.com/zou3519 ghstack dependencies: #123367	2024-04-19 17:17:27 +00:00
ydwu4	e62169a8fa	Support torchbind op dispatch in python (#123367 ) We override the `__call__` method and register fake, functional, proxy default dispatch mode implementation in its python_key_mode_table. The idea is: 1. when inputs contains FakeScriptObject, we dispatch it through _get_dispatch mechanism. We implement dispatch mode keys automatically in the operator's constructor. 2. when inputs are not fakified, we dispatch through the original c++ dispatcher. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123367 Approved by: https://github.com/zou3519	2024-04-19 17:17:27 +00:00
eellison	136f8378e1	Re-land precompile triton templates (#124030 ) Re-land precompile triton templates. This got reverted because we were precompiling templates without checking the cache. I have since added logic and a test to ensure we do not precompile if there is a cache hit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124030 Approved by: https://github.com/shunting314, https://github.com/nmacchioni, https://github.com/yoyoyocmu	2024-04-19 17:03:33 +00:00
rzou	bad8d25881	Add torch.library.register_kernel (#124299 ) This mirrors the .register_kernel method on the object produced by the custom_op decorator. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124299 Approved by: https://github.com/albanD ghstack dependencies: #124180, #124200	2024-04-19 13:54:21 +00:00
rzou	3918dfedc5	[custom_op] Rename register_impl to register_kernel (#124200 ) Motivation: - The API is used for registering an implementation for a specific device type. - "impl" is ambiguous and can be confused with Library.impl. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124200 Approved by: https://github.com/albanD ghstack dependencies: #124180	2024-04-19 13:54:21 +00:00
rzou	22a2f676c3	[custom_op] add ability to provide manual schema (#124180 ) Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124180 Approved by: https://github.com/albanD	2024-04-19 13:54:13 +00:00
GdoongMathew	8b1ad51881	Better Error Message in `ChainedScheduler` and `SequentialLR` (#121633 ) Fixes #121577 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121633 Approved by: https://github.com/janeyx99	2024-04-19 13:37:41 +00:00
Jesse Cai	c9db59e9e4	[sparse] Add fast semi-structured spasification kernels (#122350 ) This PR adds in fast semi-structured sparsification kernels to PyTorch. These kernels allow for accelerated semi-structured sparsification kernels in PyTorch. The kernels have been added as aten native functions In particular, three new functions have been added: * `torch._sparse_semi_structured_tile` This function will return the packed representation and metadata for both X and X', as well as the thread masks. Note that this applies 2:4 sparsity in a 4x4 tile instead of a 1x4 strip as usual. * `torch._sparse_semi_structured_apply` This function takes in an input tensor and thread masks from the above function and returns a packed representation and metadata from applying thread masks to the input tensor. * `torch._sparse_semi_structured_apply_dense` This function does the same thing as above but instead of returning the tensor in the sparse representation it returns it in the dense representation The subclasses have also been updated to add a new `prune_dense_static_sort` classmethod to create sparse tensors with this format. I've added some additional documentatino on how to calculate the compressed tensors needed to create a SparseSemiStructuredTensor oneself. To this end, there are two new helper functions added: `sparse_semi_structured_tile` `compute_compressed_swizzled_bitmask` Differential Revision: [D56190801](https://our.internmc.facebook.com/intern/diff/D56190801) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122350 Approved by: https://github.com/cpuhrsch	2024-04-19 13:31:58 +00:00
Cen Zhao	96724a769b	[ptd] drop ncclGroupStart/end for ncclCommInit (#124363 ) (#124416 ) Summary: ``` ncclGroupStart() ncclCommInit(..) ncclGroupEnd() ``` above pattern is only needed when we have single-thread to manage multiple GPUs in our case, we always have 1 process managing 1 GPU, we don't need group operation. Test Plan: CI Differential Revision: D56274975 Co-authored-by: Cen Zhao <cenzhao@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124416 Approved by: https://github.com/shuqiangzhang	2024-04-19 13:12:42 +00:00
chilli	8e280862ff	Add custom joint graph passes (#124443 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124443 Approved by: https://github.com/aorenste, https://github.com/malfet	2024-04-19 11:54:46 +00:00
Jane Xu	b412b75b42	[optim] add fused_adam/adamw_kernel support for CPU device (#123074 ) On par with `CUDA` implementation. For `autocast` logic, same with `CUDA` + `Fused Adam`: - check inf in `gradscalar.step` - In fused kernel, if there is `inf`, do nothing. If not, unscale the grad ( also write back) and update the param. TestPlan: ``` # extend CUDA only test for CPU fused adagrad python test_optim.py -k test_fused_matches_forloop python test_optim.py -k test_fused_large_tensor python test_torch.py -k test_grad_scaling_autocast_fused # extend fused test python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step python test_optim.py -k test_can_load_older_state_dict # newly added test (follow `6b1f13ea2f/test/test_cuda.py (L1108)`) python test_optim.py -k test_grad_scaling_autocast_fused_optimizers ``` Benchmark: 5.1x on 56 core SPR Parameter-size=1M Nparams=10 [test script](https://gist.github.com/zhuhaozhe/ef9a290ad3f8f4067b3373a3bdaa33e7) ``` numactl -C 0-55 -m 0 python bench_adam.py non-fused 6.0174267292022705 s fused 1.1787631511688232 s ``` Note: Fused kernel accuracy The accuracy failure in CI shows a little higher than default tolerance ``` 2024-04-02T06:09:16.2213887Z Mismatched elements: 21 / 64 (32.8%) 2024-04-02T06:09:16.2214339Z Greatest absolute difference: 1.5735626220703125e-05 at index (6, 6) (up to 1e-05 allowed) 2024-04-02T06:09:16.2214813Z Greatest relative difference: 1.0073336852656212e-05 at index (4, 1) (up to 1.3e-06 allowed) ``` I have debug it step by step and unfortunately we may not able to make the `fused kernel` exactly same with `non fused` one due to compiler optimizations. For example, in non-fused impl ``` exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2) ``` and in fused impl ``` exp_avg_sq_ptr[d] = scalar_t(beta2) * exp_avg_sq_ptr[d]; // std::cout << "exp_avg_sq " << exp_avg_sq_ptr[d] << std::endl; exp_avg_sq_ptr[d] = exp_avg_sq_ptr[d] + scalar_t(exp_avg_sq_grad_coefficient) * grad_val * grad_val; ``` If I keep `std::cout`, I can get exactly same results in UT ``` ===============param 0.6796758770942688 0.6796758770942688 ``` But when I comment out it, there will be a difference ``` ===============param 0.6796758770942688 0.6796759366989136 ``` So I will make the tolerance a little higher than default one. Co-authored-by: Jane Xu <janeyx@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123074 Approved by: https://github.com/jgong5, https://github.com/janeyx99	2024-04-19 11:14:04 +00:00
Boyuan Feng	9a71d12d92	[CUDAGraphTree] Support mutated inputs from prior cudagraph pool (#123231 ) # PR This PR supports mutating inputs in cudagraph trees, if these inputs are outputs from previous cudagraph. Please check #121861 for more details. # Note on Optimistic Mutation Check To determine whether applying cudagraph, we need to check input mutations, falling into four categories: a) no mutation, b) mutation on parameters/buffers, c) mutation on cudagraph recorded tensors, d) mutation on non-cudagraph recorded tensors. We can apply cudagraph for type a,b,c but cannot for type d. This input mutation types depends on function, current_node, and inputs. Since `check_for_mutation` is slow, there is a trade-off on making type c or d faster. - To make type d) faster, we want to `check_for_mutation` and call eager function early. However, this adds unnecessary overhead to type a, b, c due to the extra check. - To make type c) faster, we want to skip `check_for_mutation` at the beginning and only `check_for_mutation` before `record_function` for a new function. This removes the overhead of `check_for_mutation` for type a, b, c. However, this adds extra overhead to type d due to `check_invariants` for all children nodes. Instead, we design optimistic mutation check. The assumption is that, given a function and a node, the input mutation types usually remain the same across inputs. So, if we have ever detect a function on a node with type d, we will never detect it as type c. The detailed design is: - [Slow Path] On the first invocation of a function on a node, we run `check_for_mutation` once and cache the input mutation type as `non_cudagraph_managed_mutation[node_id][func_id]`. - [Fast Path] On the subsequent invocations of a function on a node, we skip `check_for_mutation`. For `non_cudagraph_managed_mutation[node_id][func_id]` as true, we directly call eager function. Otherwise, we `check_variants` and call cudagraph function. - [Slow Path] Before `record_function`, we run `check_for_mutation` again. Q1: Would there be overhead for type a,b,c,d? A: No. We only check input mutation types for the first invocation of a function on a node. Q2: If a function happens to be type c during the first invocation on a node, could we detect it as type d in the future? A: Yes. This is done by `check_invariants` and guarantees the correctness. Q3: If a function happens to be type d during the first invocation on a node, could it still be recognized as type c in the future? A: No. But this should happen rarely according to our assumption. In the rare case that it happens, there would not be any correctness issues and the performance is the same as the eager (or inductor optimized) function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123231 Approved by: https://github.com/eellison	2024-04-19 10:32:12 +00:00
Tobias Ringwald	58e403c739	Added a docstring for torch.Size.numel. (#124186 ) Fixes #61231. Fixes #124167. This PR documents a rather long-standing issue w.r.t. unexpected behavior of `torch.Size.numel`, first reported almost 5 years ago. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124186 Approved by: https://github.com/janeyx99	2024-04-19 09:23:02 +00:00
PyTorch MergeBot	520bc1080e	Revert "[Profiler] Unify the device(CUDA, XPU, PrivateUse1) in torch profiler post processing (#123247 )" This reverts commit `768ce2cdda`. Reverted https://github.com/pytorch/pytorch/pull/123247 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/123247#issuecomment-2066152611))	2024-04-19 09:09:03 +00:00
Xuehai Pan	a6f044a490	[dynamo, 3.8-3.9] support dataclass with `frozen=True` in Python 3.8/3.9 (#124393 ) Closes #114966 Frozen field assignment in `__init__` in Python 3.8-3.9: `f5bd65ed37/Lib/dataclasses.py (L402-L411)` ```python import builtins BUILTINS = builtins def _field_assign(frozen, name, value, self_name): # If we're a frozen class, then assign to our fields in __init__ # via object.__setattr__. Otherwise, just use a simple # assignment. # # self_name is what "self" is called in this function: don't # hard-code "self", since that might be a field name. if frozen: return f'BUILTINS.object.__setattr__({self_name},{name!r},{value})' return f'{self_name}.{name}={value}' ``` Frozen field assignment in `__init__` in Python 3.10+: `812245ecce/Lib/dataclasses.py (L436-L445)` ```python __dataclass_builtins_object__ = object def _field_assign(frozen, name, value, self_name): # If we're a frozen class, then assign to our fields in __init__ # via object.__setattr__. Otherwise, just use a simple # assignment. # # self_name is what "self" is called in this function: don't # hard-code "self", since that might be a field name. if frozen: return f'__dataclass_builtins_object__.__setattr__({self_name},{name!r},{value})' return f'{self_name}.{name}={value}' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124393 Approved by: https://github.com/jansel	2024-04-19 05:10:33 +00:00
Nikita Shulga	1ba85b34dd	[AOTI] Enbale mmaped weights when CUDA is used (#124346 ) By refactoring the logic that returns the start to constant pointer into `_get_constants_start()` method and call it from both CUDA and CPU readers It has no runtime impact, but export time is down from 10m to 3m if mmaped weights are used on AWS p4d.24xlarge Pull Request resolved: https://github.com/pytorch/pytorch/pull/124346 Approved by: https://github.com/mikekgfb, https://github.com/desertfire	2024-04-19 04:47:27 +00:00
Kiuk Chung	87f44d70b1	[torch/distributed] Check gloo availability when doing isinstance(pg,… (#124233 ) Fixes a bug where a reference to `_ProcessGroupWrapper` is used without first checking whether gloo is available. This fails on pytorch builds that do not include gloo becuase `_ProcessGroupWrapper` is only pybinded when building with `USE_GLOO=1`. Therefore, creation of a new process group fails with a `NameError` when only NCCL is available as the backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124233 Approved by: https://github.com/rohan-varma, https://github.com/d4l3k	2024-04-19 04:07:00 +00:00
Chen, Zejun	768ce2cdda	[Profiler] Unify the device(CUDA, XPU, PrivateUse1) in torch profiler post processing (#123247 ) This PR unifies the CUDA, XPU and PrivateUse1 in the torch profiler. Now CUDA, XPU and PrivateUse1 can together use string object `use_device` to distinguish each other and share one device path for calculating kineto time durations and memory statistics for post processing. #suppress-api-compatibility-check Co-authored-by: Aaron Enye Shi <enye.shi@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123247 Approved by: https://github.com/aaronenyeshi, https://github.com/gujinghui	2024-04-19 03:31:13 +00:00
rraminen	803a08f8ae	[ROCm] Add cublasGemmAlgo_t -> hipblasGemmAlgo_t (#121030 ) This PR is to add cublasGemmAlgo_t -> hipblasGemmAlgo_t to cuda_to_hip_mappings.py. It is required for DeepSpeed transformer extension build on ROCm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121030 Approved by: https://github.com/jeffdaily, https://github.com/ezyang	2024-04-19 02:57:16 +00:00
rzou	889e3eeed3	Avoid cuda init to FakeTensorMode (#124413 ) Also partially fixes #122109 This PR: - We add a C++ flag (only_lift_cpu_tensors) to toggle the torch.tensor(1, device='cuda') ctor strategy. When false (default), it does the current PyTorch behavior of unconditionally constructing a concrete CUDA tensor then calling lift_fresh on it. When true, we instead construct a concrete CPU tensor, call lift_fresh, and then call Tensor.to(device) (under any ambient modes). - FakeTensorMode flips this flag depending on if CUDA is available or not. We don't unconditionally set the flag to True because that is likely BC-breaking. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124413 Approved by: https://github.com/eellison	2024-04-19 02:39:35 +00:00
chilli	e620c3e814	Optimized templated attention to use exp2 (#124356 ) 0.705 (vs. FA2) to 0.860 after this change. <img width="1270" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/d58f57ba-e50e-44ea-8a8a-4f13b8650adf"> to <img width="1277" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/f1945b67-0cfc-463c-a2f6-5812b90677fe"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124356 Approved by: https://github.com/drisspg	2024-04-19 01:58:19 +00:00
Tristan Rice	ddd0ed1b43	distributed: templated ring attention (#124215 ) This adds a templated version of the ring attention forwards function as well as tests it with memory efficient attention. This doesn't add support for memory efficient attention in DTensor. That will be added in a follow up PR. This templating is also a POC of how to support other attention ops such as Jagged/nested tensor and as well how to implement striped attention in a scalable way. Misc changes: * Fixes all_to_all_single autograd implementation with CUDA + adds NCCL test * Adds compile support to the ring attention implementations (required some tweaks to process groups) Test plan: ``` pytest test/distributed/_tensor/test_attention.py pytest test/distributed/test_functional_api.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124215 Approved by: https://github.com/wanchaol	2024-04-19 00:57:08 +00:00
Bin Bao	4946638f06	[AOTI] Add ABI-compatiblity tests (#123848 ) Summary: In AOTInductor generated CPU model code, there can be direct references to some aten/c10 utility functions and data structures, e.g. at::vec and c10::Half. These are performance critical and thus it doesn't make sense to create C shim for them. Instead, we make sure they are implemented in a header-only way, and use this set of tests to guard future changes. There are more header files to be updated, but we will do it in other followup PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123848 Approved by: https://github.com/jansel ghstack dependencies: #123847	2024-04-19 00:51:24 +00:00
JackCaoG	9ed9b22ec0	Implement efficient_conv_bn_eval_decomp_graph_transform to handle conv and bn fusion after decomp (#123680 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123680 Approved by: https://github.com/ezyang, https://github.com/youkaichao	2024-04-19 00:22:25 +00:00
Shuqiang Zhang	ca6a0e1348	[c10d] remove the env of TORCH_NCCL_ABORT_IN_DESTROY_PG (#124334 ) Summary: This ENV was introduced to safely rollout the behavior change in destroy process group (e.g., call ncclCommsAbort). Now that this behavior change were already rolled out, we no longer need this env and we should clean up it to keep our code cleaner Test Plan: Modified/existing ut pass Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/124334 Approved by: https://github.com/wconstab	2024-04-18 23:42:55 +00:00
eellison	e4f6340f21	realize inputs to mem bound mm decomposition (#123165 ) Differential Revision: [D55639709](https://our.internmc.facebook.com/intern/diff/D55639709) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123165 Approved by: https://github.com/jackiexu1992	2024-04-18 23:10:04 +00:00
Mikayla Gawarecki	5ba6bb7b2f	Add swap_tensors path to nn parametrizations (#124130 ) Fixes #123859 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124130 Approved by: https://github.com/albanD	2024-04-18 22:22:08 +00:00
Wei Wei	87f651c7e7	fix cpu test errors (#124116 ) Similar fix is from @int3 but not landed. Credit to @int3 too. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124116 Approved by: https://github.com/chenyang78	2024-04-18 20:30:58 +00:00
ydwu4	2e48b39603	Fix example_value of map (#124203 ) Previously, we didn't expand the shape of example_value of map to the same as inputs (edit: the first mapped dimension). This pr fixes this bug. To make this easier, we change _call_function_and_unflatten_output to accept example_values directly instead of retrieving them from the variable trackers. Also remove a redundant call function node in strict_mode higher order op in dynamo. Test Plan: existing tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124203 Approved by: https://github.com/ezyang, https://github.com/zou3519	2024-04-18 19:18:36 +00:00
PyTorch MergeBot	4a0900d04b	Revert "[NJT] Inline through torch.nested.nested_tensor_from_jagged instead of graph break (#124343 )" This reverts commit `ef93402f61`. Reverted https://github.com/pytorch/pytorch/pull/124343 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124343#issuecomment-2064937192))	2024-04-18 18:55:48 +00:00
Sheng Fu	89407eca3b	Capture triton kernel in execution trace (#124140 ) Summary: This DIFF is to capture triton kernels in execution trace. Test Plan: buck test mode/dev-nosan caffe2/test:profiler -- test_execution_trace_with_pt2 Differential Revision: D56162599 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124140 Approved by: https://github.com/briancoutinho	2024-04-18 18:38:26 +00:00
angelayi	74bedbb9e1	[export] Serialize rational symint ranges (#123884 ) Some symints result in rational ranges like 10/3 which runs into an error ([example](https://www.internalfb.com/intern/everpaste/?handle=GMG2AxkeoFUrh-UDAFcE8pKPgjoUbsIXAAAB)). Ed will eventually get rid(?) of these rational ranges but as a workaround export can just clamp the results during serialization time Pull Request resolved: https://github.com/pytorch/pytorch/pull/123884 Approved by: https://github.com/zhxchen17	2024-04-18 18:20:11 +00:00
Aaron Orenstein	37215a4fa2	Fix memory leak in pattern_matcher (#124345 ) #121313 changed precompiled patterns so they are more integrated with the pattern matching code. This resulted with a list of "known" patterns (with their example data) being stored globally. Unfortunately since small FakeTensors store a constant of the original tensor it meant that we leaked cuda tensors in the example data. Fix this by clearing out the constant storage for the example data that we keep around. Fixes #124081 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124345 Approved by: https://github.com/xuzhao9	2024-04-18 17:38:12 +00:00
egienvalue	d7e1bf9ff9	torch.mtia module for MTIA device backend (#123612 ) MTIA device has its own Module in PyTorch now. torch.mtia has following APIs similar to other backends. The lazy_init is also supported. ``` __all__ = [ "init", "is_available", "synchronize", "device_count", "current_device", "current_stream", "default_stream", "set_stream", "stream", "device", ] ``` ------------ For device management. We expand AccleratorHooksInterface to support generic device management and it can be used in both C++ and PyThon. ``` def _accelerator_hooks_device_count() -> _int: ... def _accelerator_hooks_set_current_device(device_index: _int) -> None: ... def _accelerator_hooks_get_current_device() -> _int : ... def _accelerator_hooks_exchange_device(device_index: _int) -> _int : ... def _accelerator_hooks_maybe_exchange_device(device_index: _int) -> _int : ... ``` --------- Adding get_device_module API to retrieve device modules for different device types. ``` def get_device_module(device: Optional[Union[torch.device, str]] = None) ``` --------- @exported-using-ghexport Differential Revision: [D52923602](https://our.internmc.facebook.com/intern/diff/D52923602/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123612 Approved by: https://github.com/albanD ghstack dependencies: #123611	2024-04-18 17:38:06 +00:00
egienvalue	cb17721899	Build device generic torch.Stream and torch.Event based on c10::Stream/Event (#123611 ) This diff intends to build device generic torch.Stream and torch.Event for newly added accelerators in PyTorch. ------------ torch.Stream APIs ``` # Defined in torch/csrc/Stream.cpp class Stream(_StreamBase): stream_id: _int # Stream id device_index: _int device_type: _int device: _device # The device of the stream @overload def __new__(self, device: Optional[DeviceLikeType] = None, priority: _int = 0) -> Stream: ... @overload def __new__(self, stream_id: _int, device_index: _int, device_type: _int, priority: _int = 0) -> Stream: ... def query(self) -> _bool: ... def synchronize(self) -> None: ... def wait_event(self, event: Event) -> None: ... def wait_stream(self, other: Stream) -> None: ... def record_event(self, event: Optional[Event] = None) -> Event: ... def query(self) -> None: ... def synchronize(self) -> None: ... def __hash__(self) -> _int: ... def __repr__(self) -> str: ... def __eq__(self, other: object) -> _bool: ... ``` ------------------ torch.Event APIs: - IPC related APIs are not implemented, since many device backends don't support it, but we leave interfaces there for future adaption of torch.cuda.Stream. - currently only the enable_timing is supported, since it is the most common one used in other device backends. We have to refactor the event flag system in PyTorch to support more fancy flag. - elapsedTime API is added to c10::Event ``` # Defined in torch/csrc/Event.cpp class Event(_EventBase): device: _device # The device of the Event event_id: _int # The raw event created by device backend def __new__(self, device: Optional[DeviceLikeType] = None, enable_timing: _bool = False, blocking: _bool = False, interprocess: _bool = False) -> Event: ... @classmethod def from_ipc_handle(self, device: DeviceLikeType, ipc_handle: bytes) -> Event: ... def record(self, stream: Optional[Stream] = None) -> None: ... def wait(self, stream: Optional[Stream] = None) -> None: ... def query(self) -> _bool: ... def elapsed_time(self, other: Event) -> _float: ... def synchronize(self) -> None: ... def ipc_handle(self) -> bytes: ... def __repr__(self) -> str: ... ``` ----------- c10::Event provides new APIs - calculate elapsedTime. - Get raw event id - Synchronize event. ``` double elapsedTime(const Event& event) const { return impl_.elapsedTime(event.impl_); } void* eventId() const { return impl_.eventId(); } void synchronize() const { return impl_.synchronize(); } ``` ---------- TODO: need to find a good way to test them in PyTorch with API mocks. Differential Revision: [D55351839](https://our.internmc.facebook.com/intern/diff/D55351839/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123611 Approved by: https://github.com/albanD	2024-04-18 17:35:09 +00:00
Jason Ansel	7a6edb0b66	Possible fix for einops warning (#124084 ) See https://github.com/arogozhnikov/einops/issues/315 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124084 Approved by: https://github.com/peterbell10	2024-04-18 17:09:50 +00:00
Zhengxu Chen	e1062f5738	[export] Add a printer to unflattened module. (#124315 ) Summary: add a helper method to print graph in every level of unflattened module. Test Plan: {F1489609684} Differential Revision: D56263195 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124315 Approved by: https://github.com/tugsbayasgalan	2024-04-18 16:35:51 +00:00
Boyuan Feng	aa2da0cdd2	[Export] Add runtime assert to non-strict export (#123681 ) This PR moves insert_deferred_runtime_asserts from dynamo to torch.fx.passes and uses it to add runtime assertion for non-strict export. Differential Revision: D55944267 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123681 Approved by: https://github.com/tugsbayasgalan, https://github.com/angelayi	2024-04-18 16:13:27 +00:00
soulitzer	ef93402f61	[NJT] Inline through torch.nested.nested_tensor_from_jagged instead of graph break (#124343 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124343 Approved by: https://github.com/jbschlosser	2024-04-18 14:42:54 +00:00
Andrew Gu	bbb6e36495	[FSDP2] Fixed `set_requires_gradient_sync`'s `recurse` arg (#124318 ) The `recurse` argument was not being respected for `set_requires_gradient_sync`. This PR fixes that. The previous unit test did not have nested FSDP modules with managed parameters, so the `recurse=False` was not being exercised. We augment the unit test to try only disabling gradient sync for the root module and not children. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124318 Approved by: https://github.com/weifengpy ghstack dependencies: #120952, #124293	2024-04-18 14:21:57 +00:00
rzou	1542874311	Delete qualname from custom_op decorator (#124092 ) I forgot to delete this in an earlier PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124092 Approved by: https://github.com/albanD ghstack dependencies: #123937, #124064, #124065, #124066, #124071, #124089	2024-04-18 12:48:04 +00:00
rzou	648c39c47d	Add OpOverload.redispatch; use it in new custom ops API (#124089 ) A kernel has "dispatcher convention" if there is an additional keyset arg at the beginning of the argument list. This PR: - adds a way to register kernels with dispatcher_convention using Library.impl (pass dispatcher_convention = True) - adds OpOverload.redispatch We use both of the above in the new custom ops API: we register the autograd kernel in dispatcher convention so that we can actually call redispatch like how pytorch built-in ops do it. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124089 Approved by: https://github.com/albanD ghstack dependencies: #123937, #124064, #124065, #124066, #124071	2024-04-18 12:48:04 +00:00
rzou	645173a0b5	Add torch.library.register_autograd (#124071 ) Allows registering autograd for all custom op entry points: - the new-style custom op API (custom_op) - the old-style torch.library APIs - C++ operator registration Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124071 Approved by: https://github.com/albanD ghstack dependencies: #123937, #124064, #124065, #124066	2024-04-18 12:47:59 +00:00
rzou	8135c4b921	torch.library.register_fake now accepts more types (#124066 ) We allow it to accept: - a string with the op name - an opoverload - a new-style custom op If any of these are referring to a new-style custom op (created with the custom_op decorator), then we dispatch to CustomOpDef.register_fake. Otherwise, we do what we previously did. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124066 Approved by: https://github.com/albanD ghstack dependencies: #123937, #124064, #124065	2024-04-18 12:47:55 +00:00
xinan.lin	6fcbeb3489	[ATen] Add CPU fp16 support for nll_loss and cross_entropy_loss (#123256 ) Add CPU FP16 support for nll_loss and cross_entropy_loss. Resolve issue #123328. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123256 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/malfet	2024-04-18 11:44:38 +00:00
IvanKobzarev	d59f1da62f	[sym_shapes][perf] _find not update unchanged replacements (#124274 ) Differential Revision: [D56236380](https://our.internmc.facebook.com/intern/diff/D56236380) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124274 Approved by: https://github.com/ezyang	2024-04-18 08:32:02 +00:00
IvanKobzarev	9eba1995d0	[sym_shapes][perf] Use sympy xreplace instead of subs (#124208 ) https://github.com/sympy/sympy/issues/22240 Differential Revision: [D56207553](https://our.internmc.facebook.com/intern/diff/D56207553) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124208 Approved by: https://github.com/ezyang, https://github.com/lezcano	2024-04-18 08:19:03 +00:00
PyTorch MergeBot	2b82345e48	Revert "Re-land precompile triton templates (#124030 )" This reverts commit `030bb13fe8`. Reverted https://github.com/pytorch/pytorch/pull/124030 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124030#issuecomment-2063191117))	2024-04-18 07:21:41 +00:00
Animesh Jain	704fac5618	[dynamo][cpp-guard] Reland Attempt 1 - Enable cpp guard manager (#124231 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124231 Approved by: https://github.com/jansel ghstack dependencies: #124230, #124237	2024-04-18 06:36:20 +00:00
PyTorch MergeBot	6e86a40694	Revert "[Dynamo] Check for __bool__ attribute before accessing it (#120943 )" This reverts commit `dd7aeedb72`. Reverted https://github.com/pytorch/pytorch/pull/120943 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/120943#issuecomment-2063098295))	2024-04-18 06:34:32 +00:00
PyTorch MergeBot	8ff85b42f9	Revert "Add swap_tensors path to nn parametrizations (#124130 )" This reverts commit `64f6ddf12c`. Reverted https://github.com/pytorch/pytorch/pull/124130 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124130#issuecomment-2063074856))	2024-04-18 06:12:54 +00:00
Zhuoran Zhao	8ad66e05d2	[4/x][AMD][Lowering Enablement] Enabling meta internal AOTInductor compilation on ROCM (#124123 ) Summary: as title Test Plan: CI & unit test Differential Revision: D56163334 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124123 Approved by: https://github.com/chenyang78, https://github.com/jansel	2024-04-18 04:19:37 +00:00
xinan.lin	c9ab9248ce	[Inductor Intel GPU backend Upstream] Generalize device-bias code in (#124249 ) Generalize device-bias code in tirton_utils.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/124249 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/jansel	2024-04-18 03:54:31 +00:00
Yanan Cao (PyTorch)	27daa110c8	Back out "Refresh OpOverloadPacket if a new OpOverload gets added (#123578 )" (#124324 ) Summary: Original commit changeset: 528276bc8a92 Original Phabricator Diff: D56057952 Differential Revision: D56271240 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124324 Approved by: https://github.com/davidberard98	2024-04-18 03:33:54 +00:00

1 2 3 4 5 ...

37529 Commits