pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Oguz Ulgen	5033d3ba6d	Disable fb_memcache for MTIA (#125658 ) Differential Revision: [D57035819](https://our.internmc.facebook.com/intern/diff/D57035819/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125658 Approved by: https://github.com/jamesjwu	2024-05-07 07:00:26 +00:00
Oguz Ulgen	22bcfc25ef	Initial implementation of Inductor FX Graph Remote Cache (#124669 ) This diff implements a remote caching strategy (memcache for internal and redis for external) for caching of Inductor FX Graph to Inductor generated wrapper file. It uses the same idea with the autotuning result cache that is currently live. This will land turned off and before turning this on by default, I will do more testing and including looking at the dynamic shape guards added by inductor. Differential Revision: [D56441624](https://our.internmc.facebook.com/intern/diff/D56441624/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124669 Approved by: https://github.com/jansel, https://github.com/eellison	2024-05-06 22:10:27 +00:00
Daohang Shi	2ea1e84d40	log pt2 config dict to signpost from inductor post grad (#124593 ) Summary: previous attempts don't work eventually. D49720297 causes online train SEV due to extra importing. D56299408 mitigates a tricky bug from Distributed Shampoo constructor but unfortutenaly didn't correct the scuba logging either. see f552546983 Test Plan: {F1491621504} Differential Revision: D56378270 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124593 Approved by: https://github.com/anijain2305	2024-04-26 18:57:11 +00:00
Simon Fan	14430564ce	[cudagraphs] add cudagraph_skips counter (#124804 ) used in tests and benchmark csv Pull Request resolved: https://github.com/pytorch/pytorch/pull/124804 Approved by: https://github.com/eellison ghstack dependencies: #119729, #124700	2024-04-26 03:22:29 +00:00
David Berard	4259e5d0e0	[inductor] Specialize on unguarded alignment of example inputs (#123319 ) When inductor generates triton code, the triton code can either assume that the inputs given to it are aligned or unaligned. If they are aligned, triton can use more efficient instructions (like vectorized loads or tensor cores). However, if we generate "aligned" code and pass in unaligned inputs, the triton code will error out; to fix this, we clone unaligned inputs that are passed to triton kernels that expect aligned inputs. This can lead to excessive clones if we have inputs that are not expected to be aligned. In this PR, we use the example input to decide whether the generated triton code should assume alignment or not. If the example input is aligned, then we will generate triton code that assumes alignment; if at runtime we receive an unaligned input, we'll make a clone. Meanwhile, if the example input is not aligned, the generated triton code will not assume inputs are aligned and we won't ever need to clone. Note that the alignment of the inputs is not guarded on; we found that adding guards on tensor offsets (a) was slow in cases where we do a lot of comparisons on tensor offsets, and (b) led to a lot of recompilations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123319 Approved by: https://github.com/eellison	2024-04-25 22:28:15 +00:00
PyTorch MergeBot	154157416c	Revert "[cudagraphs] add cudagraph_skips counter (#124804 )" This reverts commit `fdad16b851`. Reverted https://github.com/pytorch/pytorch/pull/124804 on behalf of https://github.com/jeanschmidt due to one PR in this stack seems to have broken linux pull cuda12 tests ([comment](https://github.com/pytorch/pytorch/pull/119729#issuecomment-2076750595))	2024-04-25 09:26:25 +00:00
Simon Fan	fdad16b851	[cudagraphs] add cudagraph_skips counter (#124804 ) used in tests and benchmark csv Pull Request resolved: https://github.com/pytorch/pytorch/pull/124804 Approved by: https://github.com/eellison ghstack dependencies: #119729, #124700	2024-04-25 03:38:09 +00:00
Boyuan Feng	b91f83f181	[cudagraph] add config for cudagraph managed input mutation support (#124754 ) Summary: [#123231](https://github.com/pytorch/pytorch/pull/123231) adds cudagraph supports for more types of functions (i.e., cudagraph managed input mutation). These newly supported functions may have mutated static inputs, leading to assertion errors in some workload which skip cudagraph previously. This diff adds a config to opt in the new feature. Test Plan: ci Differential Revision: D56481353 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124754 Approved by: https://github.com/eellison	2024-04-24 04:23:53 +00:00
Laith Sakka	8cf54929e3	compiletime->compile_time (#124579 ) Summary: title. Test Plan: run strobelight profiler. Reviewed By: oulgen Differential Revision: D56395415 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124579 Approved by: https://github.com/oulgen	2024-04-23 20:50:53 +00:00
Laith Sakka	acbf888a13	rename sl to strobelight (#124455 ) Summary: TORCH_COMPILE_SL_PROFILE ->TORCH_COMPILE_STROBELIGHT SL_MAX_STACK_LENGTH -> COMPILE_STROBELIGHT_MAX_STACK_LENGTH SL_MAX_PROFILE_TIME -> COMPILE_STROBELIGHT_MAX_PROFILE_TIME profile_with_sl() -> strobelight() compiletime_sl_profile_meta() -> compiletime_strobelight_meta() Test Plan: 1. run and verify ``` TORCH_COMPILE_STROBELIGHT=TRUE buck2 run @//mode/inplace @//mode/opt //caffe2/fb/strobelight:compiletime_profiler_example ``` 2. run and verify ``` buck2 run @//mode/inplace @//mode/opt //caffe2/fb/strobelight:function_profiler_example --local-only ``` 3. run and verify truncated stack for ``` TORCH_COMPILE_STROBELIGHT=TRUE COMPILE_STROBELIGHT_MAX_STACK_LENGTH=1 buck2 run @//mode/inplace @//mode/opt //caffe2/fb/strobelight:compiletime_profiler_example ``` 4. add infinite loop in _verify and verify samples for ``` COMPILE_STROBELIGHT_MAX_PROFILE_TIME=30 TORCH_COMPILE_STROBELIGHT=TRUE buck2 run @//mode/inplace @//mode/opt //caffe2/fb/strobelight:compiletime_profiler_example ``` Reviewed By: oulgen Differential Revision: D56327139 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124455 Approved by: https://github.com/oulgen	2024-04-19 22:50:13 +00:00
eellison	9489019085	Small fixes for deferred epilogue (#123229 ) Two small fixes: - preserve rng around compile_fx_inner - Now that will precompile in the background while lowering multiple templates in parallel, we no longer can allocate inputs at the beginning of the function because we will have multiple sets of inputs allocated at the same time. Instead, allocate them when needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123229 Approved by: https://github.com/shunting314 ghstack dependencies: #124030, #122642	2024-04-19 17:41:29 +00:00
Boyuan Feng	9a71d12d92	[CUDAGraphTree] Support mutated inputs from prior cudagraph pool (#123231 ) # PR This PR supports mutating inputs in cudagraph trees, if these inputs are outputs from previous cudagraph. Please check #121861 for more details. # Note on Optimistic Mutation Check To determine whether applying cudagraph, we need to check input mutations, falling into four categories: a) no mutation, b) mutation on parameters/buffers, c) mutation on cudagraph recorded tensors, d) mutation on non-cudagraph recorded tensors. We can apply cudagraph for type a,b,c but cannot for type d. This input mutation types depends on function, current_node, and inputs. Since `check_for_mutation` is slow, there is a trade-off on making type c or d faster. - To make type d) faster, we want to `check_for_mutation` and call eager function early. However, this adds unnecessary overhead to type a, b, c due to the extra check. - To make type c) faster, we want to skip `check_for_mutation` at the beginning and only `check_for_mutation` before `record_function` for a new function. This removes the overhead of `check_for_mutation` for type a, b, c. However, this adds extra overhead to type d due to `check_invariants` for all children nodes. Instead, we design optimistic mutation check. The assumption is that, given a function and a node, the input mutation types usually remain the same across inputs. So, if we have ever detect a function on a node with type d, we will never detect it as type c. The detailed design is: - [Slow Path] On the first invocation of a function on a node, we run `check_for_mutation` once and cache the input mutation type as `non_cudagraph_managed_mutation[node_id][func_id]`. - [Fast Path] On the subsequent invocations of a function on a node, we skip `check_for_mutation`. For `non_cudagraph_managed_mutation[node_id][func_id]` as true, we directly call eager function. Otherwise, we `check_variants` and call cudagraph function. - [Slow Path] Before `record_function`, we run `check_for_mutation` again. Q1: Would there be overhead for type a,b,c,d? A: No. We only check input mutation types for the first invocation of a function on a node. Q2: If a function happens to be type c during the first invocation on a node, could we detect it as type d in the future? A: Yes. This is done by `check_invariants` and guarantees the correctness. Q3: If a function happens to be type d during the first invocation on a node, could it still be recognized as type c in the future? A: No. But this should happen rarely according to our assumption. In the rare case that it happens, there would not be any correctness issues and the performance is the same as the eager (or inductor optimized) function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123231 Approved by: https://github.com/eellison	2024-04-19 10:32:12 +00:00
Shunting Zhang	fb6f6270d6	[inductor] comprehensive padding (#120758 ) This PR adds the ability to pad tensor strides during lowering. The goal is to make sure (if possible) tensors with bad shape can have aligned strides so GPU can access the memory more efficiently. By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120758 Approved by: https://github.com/jansel	2024-04-15 19:05:51 +00:00
chunyuan	cdc47ad991	fix amp for AOTInductor (#122883 ) ## Pitch This PR disables the amp when calling the inference_compiler in AOTInductor path (after having exported the model graph), following the way we disable AMP in Inductor path in https://github.com/pytorch/pytorch/pull/86515. ## Description When testing AOTInductor AMP accuracy on CPU using the dynamo benchmark suites, multiple workloads will fail in this assertion: [assert pattern_repr not in _seen_patterns](`1d52c2d985/torch/_inductor/pattern_matcher.py (L1095)`) which is called when registering SDPA patterns. The `inference_compiler` ([fw_compiler_base](`1d52c2d985/torch/_inductor/compile_fx.py (L1234)`)) will call into [_recursive_joint_graph_passe](`1d52c2d985/torch/_inductor/compile_fx.py (L1241)`) and then [_sfdp_init](`1d52c2d985/torch/_inductor/fx_passes/fuse_attention.py (L847)`). When testing accuracy, we'll set [inductor_config.fallback_random = True](`1d52c2d985/benchmarks/dynamo/common.py (L3526)`), which will make the `search_fn` to be `None` [here](`1d52c2d985/torch/_inductor/fx_passes/serialized_patterns/central_index.py (L117-L118)`), thus the pattern will be generated runtime [here](`1d52c2d985/torch/_inductor/pattern_matcher.py (L1083-L1084)`). When AMP is on, the generated pattern for SDPA FP32 will be the same as that of FP16, which makes the assertion fail. Inductor path disables amp inside [aot_dispatch_base](`1d52c2d985/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py (L124-L128)`). We follow the same way to disable for AOTInductor path here (after having exported the model graph) to fix this issue. ## UT For the added UT, there's one case `python test/inductor/test_aot_inductor.py -k test_amp_fallback_random_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface` fails with the below error which is not caused by this PR itself. Marked it as skipped for now. ``` RuntimeError: Error in dlopen: /tmp/torchinductor_user/cf5vk3gqkbvud56qeotdxqvns4wbk3sjnlnuadolt7b6g7a6kspb/cfzjo5ackvrth2gp6oq4lfpdyfafoagodfpjvbzhsi2u64hza2vn.so: undefined symbol: _Z16aoti_torch_dtypeIN3c108BFloat16EEiv ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122883 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-04-10 06:20:17 +00:00
PyTorch MergeBot	b3eb1b2f74	Revert "fix amp for AOTInductor (#122883 )" This reverts commit `a4a49f77b8`. Reverted https://github.com/pytorch/pytorch/pull/122883 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/122883#issuecomment-2046026363))	2024-04-09 20:51:53 +00:00
chunyuan	a4a49f77b8	fix amp for AOTInductor (#122883 ) ## Pitch This PR disables the amp when calling the inference_compiler in AOTInductor path (after having exported the model graph), following the way we disable AMP in Inductor path in https://github.com/pytorch/pytorch/pull/86515. ## Description When testing AOTInductor AMP accuracy on CPU using the dynamo benchmark suites, multiple workloads will fail in this assertion: [assert pattern_repr not in _seen_patterns](`1d52c2d985/torch/_inductor/pattern_matcher.py (L1095)`) which is called when registering SDPA patterns. The `inference_compiler` ([fw_compiler_base](`1d52c2d985/torch/_inductor/compile_fx.py (L1234)`)) will call into [_recursive_joint_graph_passe](`1d52c2d985/torch/_inductor/compile_fx.py (L1241)`) and then [_sfdp_init](`1d52c2d985/torch/_inductor/fx_passes/fuse_attention.py (L847)`). When testing accuracy, we'll set [inductor_config.fallback_random = True](`1d52c2d985/benchmarks/dynamo/common.py (L3526)`), which will make the `search_fn` to be `None` [here](`1d52c2d985/torch/_inductor/fx_passes/serialized_patterns/central_index.py (L117-L118)`), thus the pattern will be generated runtime [here](`1d52c2d985/torch/_inductor/pattern_matcher.py (L1083-L1084)`). When AMP is on, the generated pattern for SDPA FP32 will be the same as that of FP16, which makes the assertion fail. Inductor path disables amp inside [aot_dispatch_base](`1d52c2d985/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py (L124-L128)`). We follow the same way to disable for AOTInductor path here (after having exported the model graph) to fix this issue. ## UT For the added UT, there's one case `python test/inductor/test_aot_inductor.py -k test_amp_fallback_random_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface` fails with the below error which is not caused by this PR itself. Marked it as skipped for now. ``` RuntimeError: Error in dlopen: /tmp/torchinductor_user/cf5vk3gqkbvud56qeotdxqvns4wbk3sjnlnuadolt7b6g7a6kspb/cfzjo5ackvrth2gp6oq4lfpdyfafoagodfpjvbzhsi2u64hza2vn.so: undefined symbol: _Z16aoti_torch_dtypeIN3c108BFloat16EEiv ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122883 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-04-09 12:08:33 +00:00
Menglu Yu	6d005ca590	[PT2][Observability] Add model_type and global_rank for the scuba log for the dashboard Optimus pattern frequency monitor (#123398 ) Summary: We also log the model type and global rank for easier scuba query to develop the dashbord monitor. More context: https://docs.google.com/document/d/1RuUCOBOgVt9pp-Jgoo4oEXWvoYv6GN0DljypsqgVTp4/edit Test Plan: # local reproduce ``` buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "cmf" --flow_id 524546542 ``` optimus parameter sent to the scuba: ``` {'before_recompile_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GO3rCxej_mk0RV0DAPE1wtdadgNkbr0LAAAz', 'BatchLayernormFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GESWQhm_XNqiIYYCAJ2nCcg9PPwnbr0LAAAz', 'BatchSigmoidPreGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GAOv_RZb5hEwKIQBAPc7kNFDN2kEbr0LAAAz', 'normalization_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GMUFOxkqRm1ellcDAFLjROHAy4NXbr0LAAAz', 'remove_split_with_size_one_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GAuEghZfCNtAVtcCACAqgBH3h4R0br0LAAAz', 'merge_getitem_cat_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GL_q2xZIJ9gRUp4GAAnBc-_frnUpbr0LAAAz', 'merge_splits_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GGIerBZMJvpn5moBAH4lzgkY5_Rjbr0LAAAz', 'after_recompile_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GKPngRabKVDgodEHAJNTi6H37kwbbr0LAAAz', 'before_recompile_post_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GBuDPxmwQPFoGJkCAOsLt_QwVNxvbr0LAAAz', 'BatchMulPostGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GJQypRaGi3AMr3MBAMWUDs5rHztkbr0LAAAz', 'after_recompile_post_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GMce9xaOaCu3l9YCAM41j-H0hWZMbr0LAAAz', 'inductor': Counter({'pattern_matcher_nodes': 2281, 'pattern_matcher_count': 2081, 'normalization_pass': 864, 'remove_split_with_size_one_pass': 748, 'merge_splits_pass': 82, 'merge_getitem_cat_pass': 11, 'scmerge_split_sections_removed': 4, 'batch_layernorm': 1, 'batch_sigmoid': 1, 'scmerge_split_added': 1, 'scmerge_cat_added': 1, 'scmerge_split_removed': 1, 'scmerge_cat_removed': 1, 'batch_aten_mul': 1}), 'model_type': None, 'global_rank': None} ``` # e2e test I have no resouce to run the test right now due to the MC proposal deadline. Will add it next week. Should ok based on the local reproduce results. Differential Revision: D55777055 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123398 Approved by: https://github.com/Yuzhen11	2024-04-09 03:28:10 +00:00
angelayi	493478db4a	[effects] Add inductor support for tokens (#122347 ) Given the following code/dynamo graph: ``` class GraphModule(torch.nn.Module): def forward(self, L_x_ : torch.Tensor): l_x_ = L_x_ _print = torch.ops.aten._print('moo') res = l_x_ + l_x_; l_x_ = None _print_1 = torch.ops.aten._print('moo') return (res,) ``` AOTAutograd will trace the following program, threading tokens from the inputs, through the effectful operator calls (torch.ops.aten._print), and as an output: ``` class <lambda>(torch.nn.Module): def forward(self, arg0_1: "f32[0]", arg1_1: "f32[2, 3]"): with_effects = torch._higher_order_ops.effects.with_effects(arg0_1, torch.ops.aten._print.default, 'moo'); arg0_1 = None getitem: "f32[0]" = with_effects[0]; with_effects = None add: "f32[2, 3]" = torch.ops.aten.add.Tensor(arg1_1, arg1_1); arg1_1 = None with_effects_1 = torch._higher_order_ops.effects.with_effects(getitem, torch.ops.aten._print.default, 'moo'); getitem = None getitem_2: "f32[0]" = with_effects_1[0]; with_effects_1 = None return (getitem_2, add) ``` However when we get to inductor, since we want the inductor generated code to not have any token inputs/outputs for better readability, we want to modify the aten graph by removing the tokens from inputs, and creating them through `torch.ops.aten._make_dep_token`, and sinking them through the `torch.ops.aten._sink_tokens` operators. This has to be done after the partitioner, otherwise the partitioner will add the make_token/sink_token operators to the backwards graph. ``` class <lambda>(torch.nn.Module): def forward(self, arg1_1: "f32[2, 3]"): _make_dep_token_default: "f32[0]" = torch.ops.aten._make_dep_token.default() with_effects = torch._higher_order_ops.effects.with_effects(_make_dep_token_default, torch.ops.aten._print.default, 'moo'); _make_dep_token_default = None getitem: "f32[0]" = with_effects[0]; with_effects = None add: "f32[2, 3]" = torch.ops.aten.add.Tensor(arg1_1, arg1_1); arg1_1 = None with_effects_1 = torch._higher_order_ops.effects.with_effects(getitem, torch.ops.aten._print.default, 'moo'); getitem = None getitem_2: "f32[0]" = with_effects_1[0]; with_effects_1 = None _sink_tokens_default = torch.ops.aten._sink_tokens.default((getitem_2,)); getitem_2 = None return (add,) ``` When doing inductor lowering, we convert `with_effects` calls to an `EffectfulKernel`, which just a `FallbackKernel` but with a pointer to previous effectful operator's call. During scheduling, we will create a `StarDep` between the EffectfulKernel and its previous EffectfulKernel so that they don't get reordered. The inductor generated python code looks like: ``` def call(args): arg1_1, = args args.clear() assert_size_stride(arg1_1, (2, 3), (3, 1)) # Source Nodes: [_print], Original ATen: [] buf2 = aten._print.default('moo') # Source Nodes: [_print_1], Original ATen: [] buf3 = aten._print.default('moo') buf4 = empty_strided_cpu((2, 3), (3, 1), torch.float32) cpp_fused_add_0(arg1_1, buf4) del arg1_1 return (buf4, ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122347 Approved by: https://github.com/bdhirsh	2024-04-09 03:22:32 +00:00
Oguz Ulgen	f8465df9f0	Use graph.find_nodes in inductor (#122256 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122256 Approved by: https://github.com/jansel ghstack dependencies: #121565, #122255	2024-04-07 18:51:14 +00:00
Laith Sakka	caed7f6727	profile pt2 compile time with strobelight (#123311 ) For oss this diff adds a decorator @profile_sb_fbcode that is a nop for non meta workload. Facebook: With this diff someone can generate a strobelight profile for pt2 compilation. users need to set the env variable TORCH_COMPILE_SL_PROFILE =TRUE . For example: ``` TORCH_COMPILE_SL_PROFILE =TRUE buck2 run @//mode/inplace @//mode/opt //caffe2/fb/strobelight:compiletime_profile_example ``` see sample output bellow, at the end of summary. The way this works, is that a unique id is generated and associated with all samples that are collected for functions that are decorated with profile_sb_fbcode. This id can then be used to combine different strobe light profile into one. (for example three compilation events happens in the code bellow). Right now the following two functions are annotated with profile_sb_fbcode. bw_compiler and _compile. if two profile_sl_fbcode is called recursively, recursive invocations are ignored and a log is printed. The output is: ``` Strobelight is enabled for pt2 compilation Unique user-id for this run is: 2024-04-03-13:59:49147091devvm4561.ash0.facebook.com You can use the following link to access the strobelight profile at the end of the run: https://www.internalfb.com/intern/scuba/query/?dataset=pyperf_experimental%2Fon_demand&drillstate=%7B%22purposes%22%3A[]%2C%22end%22%3A%22now%22%2C%22start%22%3A%22-30%20days%22%2C%22filterMode%22%3A%22DEFAULT%22%2C%22modifiers%22%3A[]%2C%22sampleCols%22%3A[]%2C%22cols%22%3A[%22namespace_id%22%2C%22namespace_process_id%22]%2C%22derivedCols%22%3A[]%2C%22mappedCols%22%3A[]%2C%22enumCols%22%3A[]%2C%22return_remainder%22%3Afalse%2C%22should_pivot%22%3Afalse%2C%22is_timeseries%22%3Afalse%2C%22hideEmptyColumns%22%3Afalse%2C%22timezone%22%3A%22America%2FLos_Angeles%22%2C%22compare%22%3A%22none%22%2C%22samplingRatio%22%3A%221%22%2C%22metric%22%3A%22count%22%2C%22aggregation_field%22%3A%22async_stack_complete%22%2C%22top%22%3A10000%2C%22aggregateList%22%3A[]%2C%22param_dimensions%22%3A[%7B%22dim%22%3A%22py_async_stack%22%2C%22op%22%3A%22edge%22%2C%22param%22%3A%220%22%2C%22anchor%22%3A%220%22%7D]%2C%22order%22%3A%22weight%22%2C%22order_desc%22%3Atrue%2C%22constraints%22%3A[[%7B%22column%22%3A%22run_user%22%2C%22op%22%3A%22eq%22%2C%22value%22%3A[%22[%5C%222024-04-03-13:59:49147091devvm4561.ash0.facebook.com%5C%22]%22]%7D]]%2C%22c_constraints%22%3A[[]]%2C%22b_constraints%22%3A[[]]%2C%22ignoreGroupByInComparison%22%3Afalse%7D&view=GraphProfilerView&&pool=uber&graphprofiler_filter=&graphprofiler_column_to_sort_by=exclusive the link below takes you to the collected strobelight profile https://www.internalfb.com/intern/scuba/query/?dataset=pyperf_experimental%2Fon_demand&drillstate=%7B%22dimensions%22%3A%5B%5D%2C%22param_dimensions%22%3A%5B%7B%22anchor%22%3A%220%22%2C%22param%22%3A%220%22%2C%22op%22%3A%22edge%22%2C%22dim%22%3A%22py_async_stack%22%7D%5D%2C%22constraints%22%3A%5B%5B%7B%22value%22%3A%5B%22%5B%5C%22-6800545191281321%5C%22%5D%22%5D%2C%22op%22%3A%22eq%22%2C%22column%22%3A%22run_id%22%7D%2C%7B%22value%22%3A%5B%22%5B%5C%222024-04-03-13%3A59%3A49147091devvm4561.ash0.facebook.com%5C%22%5D%22%5D%2C%22op%22%3A%22eq%22%2C%22column%22%3A%22run_user%22%7D%5D%5D%2C%22top%22%3A10000%2C%22end%22%3A%221712181610%22%2C%22start%22%3A%221712174410%22%7D&view=GraphProfilerView& 1 storbelight success runs out of 1 non-ignored runs. strobelight run id is: 6181728288420687 the link below takes you to the collected strobelight profile https://www.internalfb.com/intern/scuba/query/?dataset=pyperf_experimental%2Fon_demand&drillstate=%7B%22dimensions%22%3A%5B%5D%2C%22param_dimensions%22%3A%5B%7B%22anchor%22%3A%220%22%2C%22param%22%3A%220%22%2C%22op%22%3A%22edge%22%2C%22dim%22%3A%22py_async_stack%22%7D%5D%2C%22constraints%22%3A%5B%5B%7B%22value%22%3A%5B%22%5B%5C%226181728288420687%5C%22%5D%22%5D%2C%22op%22%3A%22eq%22%2C%22column%22%3A%22run_id%22%7D%2C%7B%22value%22%3A%5B%22%5B%5C%222024-04-03-13%3A59%3A49147091devvm4561.ash0.facebook.com%5C%22%5D%22%5D%2C%22op%22%3A%22eq%22%2C%22column%22%3A%22run_user%22%7D%5D%5D%2C%22top%22%3A10000%2C%22end%22%3A%221712181621%22%2C%22start%22%3A%221712174421%22%7D&view=GraphProfilerView& 2 storbelight success runs out of 2 non-ignored runs. strobelight run id is: -1026103682715688 the link below takes you to the collected strobelight profile https://www.internalfb.com/intern/scuba/query/?dataset=pyperf_experimental%2Fon_demand&drillstate=%7B%22dimensions%22%3A%5B%5D%2C%22param_dimensions%22%3A%5B%7B%22anchor%22%3A%220%22%2C%22param%22%3A%220%22%2C%22op%22%3A%22edge%22%2C%22dim%22%3A%22py_async_stack%22%7D%5D%2C%22constraints%22%3A%5B%5B%7B%22value%22%3A%5B%22%5B%5C%22-1026103682715688%5C%22%5D%22%5D%2C%22op%22%3A%22eq%22%2C%22column%22%3A%22run_id%22%7D%2C%7B%22value%22%3A%5B%22%5B%5C%222024-04-03-13%3A59%3A49147091devvm4561.ash0.facebook.com%5C%22%5D%22%5D%2C%22op%22%3A%22eq%22%2C%22column%22%3A%22run_user%22%7D%5D%5D%2C%22top%22%3A10000%2C%22end%22%3A%221712181647%22%2C%22start%22%3A%221712174447%22%7D&view=GraphProfilerView& 3 storbelight success runs out of 3 non-ignored runs. ``` Test Plan: Was tested on buck2 run @//mode/inplace @//mode/opt //caffe2/fb/strobelight:compiletime_profile_example This was also tested in one of the ads benchmarks ``` TORCH_COMPILE_SL_PROFILE =TRUE buck2 run mode/opt mode/inplace //pytorch/benchmark:run -- ads_mc_igctr_mc3_v0 -d cuda -t train --torchdynamo inductor ``` The results matches the results reported in https://fb.workplace.com/groups/257735836456307/permalink/657458576484029 Differential Revision: D55672271 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123311 Approved by: https://github.com/aorenste	2024-04-06 18:57:44 +00:00
Shunting Zhang	6890333e3d	[inductor] fix tensor overlap detection that cause cudagraphs being disabled (#123327 ) If any graph input has overlapping memory, inductor disables cudagraphs. But the function `complex_memory_overlap` detecting memory overlap can have false positive. E.g. for tensor `rand_strided((8, 1500, 1), (1504, 1, 1), device=self.device)` the function reports overlapping previously.. This is caused by size=1 dimension. The fix is to do squeeze before running the detection algorithm. This fixes the perf regress for hf_Whisper and timm_efficientdet when we do padding. For these models cudagraphs were dynamically disabled when doing padding due to the issue discussed here and cause perf regress. This may help the dashboard if this is a common thing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123327 Approved by: https://github.com/Chillee	2024-04-04 08:55:00 +00:00
Simon Fan	8ac0f072e6	[aot eager] Support frontend graphs with list arguments (#123212 ) We already support bumpy inputs for 3rd party frontend and compiled backward graph, we should add the behavior to aot_eager too Pull Request resolved: https://github.com/pytorch/pytorch/pull/123212 Approved by: https://github.com/jansel ghstack dependencies: #122691, #122746, #123007	2024-04-03 17:07:52 +00:00
Simon Fan	12e36dc1df	[dynamo] Fix torch._dynamo.disable on flatten_graph_inputs wrapper (#123007 ) Existing `innermost_fn` handling of `functools.wraps` is not ideal, but I'm not sure if there's a good fix. This can manifest for GmWrapper (used to handle list inputs from Dynamo -> AOTAutograd) where we don't call the unflatten wrapper at runtime. Since core parts of Dynamo rely on attribute check for `_torchdynamo_orig_callable`, so I'm adding a test to cover it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123007 Approved by: https://github.com/jansel ghstack dependencies: #122691, #122746	2024-04-02 21:39:44 +00:00
Bin Bao	0ff6155eee	[AOTI] Support module buffer mutation (#123164 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/120424. Because in a forward pass module buffers may be mutated, we need to allow that in AOTI. In addition, this will be a necessary step if we want to extend AOTI to training. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123164 Approved by: https://github.com/digantdesai, https://github.com/malfet, https://github.com/chenyang78, https://github.com/khabinov	2024-04-02 20:25:26 +00:00
eellison	5f46312dbb	Reapply "Switch cudagraph backend to cudagraph trees (#121019 )" and "Add Cudagraphs disable checking (#121018 )" (#121864 ) (#122713 ) This reverts commit `92ed8553a6`. No longer importing codecache or boxed_nop at top level, both of which casued issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122713 Approved by: https://github.com/anijain2305	2024-04-02 16:11:00 +00:00
Menglu Yu	c40f386afd	[Inductor][1/n]Split cat customization (#123045 ) Summary: Change the config and revise the group batch fusion in order not to reuse the exsiting pre_grad and post_grad fusion options Test Plan: # unit test ``` buck2 test @mode/dev-nosan //caffe2/test/inductor:split_cat_fx_passes ``` Test UI: https://www.internalfb.com/intern/testinfra/testrun/17732923560510096 Network: Up: 15MiB Down: 155MiB (reSessionID-6a577a14-1772-42d9-9ae8-bfdc62f406a3) Jobs completed: 267487. Time elapsed: 2:39.7s. Cache hits: 99%. Commands: 104465 (cached: 104457, remote: 8, local: 0) Tests finished: Pass 11. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` buck2 test @mode/dev-nosan //caffe2/test/inductor/fb:split_cat_fx_passes_fb ``` Test UI: https://www.internalfb.com/intern/testinfra/testrun/9007199283031382 Network: Up: 28MiB Down: 177MiB (reSessionID-a3081518-7cba-4c83-b442-c16655ecb2cd) Jobs completed: 183164. Time elapsed: 1:41.4s. Cache hits: 99%. Commands: 75875 (cached: 75862, remote: 12, local: 1) Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` buck2 test @mode/dev-nosan //caffe2/test/inductor:group_batch_fusion ``` Test UI: https://www.internalfb.com/intern/testinfra/testrun/10133099189612276 Network: Up: 1.3MiB Down: 3.1MiB (reSessionID-0d312a2d-e19e-4ba6-9f96-7eb5863734e7) Discovered 9. Pass 0. Fail 0. Fatal 0. Skip 0. Timeout 0 Network: Up: 1.4MiB Down: 3.2MiB (reSessionID-0d312a2d-e19e-4ba6-9f96-7eb5863734e7) Jobs completed: 68. Time elapsed: 2:19.9s. Cache hits: 0%. Commands: 13 (cached: 0, remote: 1, local: 12) Tests finished: Pass 9. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` buck2 test @mode/dev-nosan //caffe2/test/inductor:perf ``` Test UI: https://www.internalfb.com/intern/testinfra/testrun/5066549804623287 Network: Up: 1.5MiB Down: 1.1MiB (reSessionID-8d912a20-fceb-4698-89c3-d28e0708831f) Jobs completed: 164. Time elapsed: 1:42.2s. Cache hits: 0%. Commands: 13 (cached: 0, remote: 1, local: 12) Tests finished: Pass 57. Fail 0. Fatal 0. Skip 0. Build failure 0 # local reproduce case 1: with split cat ``` buck2 run @mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "cmf" --flow_id 524546542 ``` optimus parameter sent to the scuba: ``` {'before_recompile_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GLL6RBZb-ssXJYcBAMzw0oaKtp80br0LAAAz', 'BatchLayernormFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GH1LAxcxv0Ae_BkFAHVav3K3oosDbr0LAAAz', 'BatchSigmoidPreGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GNb0jwR-Ukkqns4CAGRmOqucfedDbr0LAAAz', 'normalization_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GHsIQxm-hn3SPrgCAKq1E-HBsoZHbr0LAAAz', 'remove_split_with_size_one_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GOrJORmbMTV_xlQDAOwolqclPsIAbr0LAAAz', 'merge_getitem_cat_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GCqkmRblvVKybGUDACVxkwVIrWxLbr0LAAAz', 'merge_splits_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GCB1QBfko_kVN0wFAKGjSZv4DJULbr0LAAAz', 'after_recompile_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GMwJPRmu4ry88swDAO1gdA5RCKIXbr0LAAAz', 'before_recompile_post_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GLXCORnNiKeQFmoDABR93CRKmP8Sbr0LAAAz', 'BatchMulPostGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GBMIPRnlwQyjSD4BANPuaMhV7MUjbr0LAAAz', 'after_recompile_post_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GJ9KPxkOv4LL8_0DAA65D4kh4JYDbr0LAAAz', 'inductor': Counter({'pattern_matcher_nodes': 2844, 'pattern_matcher_count': 2604, 'normalization_pass': 886, 'remove_split_with_size_one_pass': 748, 'merge_splits_pass': 82, 'merge_getitem_cat_pass': 11, 'scmerge_split_sections_removed': 4, 'batch_aten_mul': 4, 'batch_sigmoid': 2, 'batch_aten_sub': 2, 'batch_layernorm': 1, 'scmerge_split_added': 1, 'scmerge_cat_added': 1, 'scmerge_split_removed': 1, 'scmerge_cat_removed': 1, 'batch_aten_add': 1}), 'BatchAddPostGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GEcvPxmxBj-pd8gCABE1QgB-d6N6br0LAAAz', 'BatchSubPostGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GEvQxhYomJGj2FMBAEXXAI8Vgzhmbr0LAAAz'} ``` P1202819405 case 2: without split cat ``` buck2 run @mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch --model_type "cmf" --flow_id 524546542 ``` optimus parameter sent to the scuba: ``` {'before_recompile_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GAY7PxmGthuyjSwEAHF_A767YbMkbr0LAAAz', 'BatchLayernormFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GLDPtBacXyybEOICAKaGCPatq5oabr0LAAAz', 'BatchSigmoidPreGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GBu7ORkiDJu42QAEAGmlVTgO_Mpbbr0LAAAz', 'after_recompile_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GC893BZNl99ftY4BAHm5Z8sM4ptSbr0LAAAz', 'before_recompile_post_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GCAeuRYgzPO5RcsCAPO3Z7tdMNMKbr0LAAAz', 'BatchMulPostGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GHBIQxm1jlU-xhsFAONkzhh2mgknbr0LAAAz', 'after_recompile_post_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GDoUPhmZ0noiaGMDAJHYuuiwHEAUbr0LAAAz', 'inductor': Counter({'pattern_matcher_nodes': 1189, 'pattern_matcher_count': 757, 'batch_aten_mul': 9, 'batch_aten_sub': 3, 'batch_sigmoid': 2, 'batch_aten_add': 2, 'batch_layernorm': 1, 'batch_linear_post_grad': 1}), 'BatchAddPostGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GAluthYxi8uxpI4BAIQDzn3OyywUbr0LAAAz', 'BatchSubPostGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GDjsJhTK5VAcot4CADIcAixghrYibr0LAAAz', 'PostGradBatchLinearFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GEPfJxfJwktC7wsEAA0QbkqYNuVAbr0LAAAz'} ``` P1202823734 # e2e training_platform:fd4f02cd855f5cc0ccb49317a5a6c8bb with split cat f546646358 without split cat f546647159 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123045 Approved by: https://github.com/jackiexu1992	2024-04-02 14:36:22 +00:00
Jason Ansel	6c0911f1d9	[inductor] Skip cudagraphs warning on CPU (#123009 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123009 Approved by: https://github.com/shunting314	2024-03-30 05:46:09 +00:00
Menglu Yu	9693797491	[PT2][Inductor][Observability] Improve the optimus scuba log (#122361 ) Summary: Titled Test Plan: ``` buck2 test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion ``` Test UI: https://www.internalfb.com/intern/testinfra/testrun/18014398535709463 Network: Up: 113KiB Down: 480KiB (reSessionID-1d2e3558-15b5-4a4e-8c5d-10c983afb389) Discovered 9. Pass 0. Fail 0. Fatal 0. Skip 0. Timeout 0 Command: test. Remaining: 9/24. Cache hits: 0%. Time elapsed: 44.3s Command: test. Remaining: 9/24. Cache hits: 0%. Time elapsed: 44.4s Command: test. Remaining: 9/24. Cache hits: 0%. Time elapsed: 44.5s Network: Up: 117KiB Down: 507KiB (reSessionID-1d2e3558-15b5-4a4e-8c5d-10c983afb389) Jobs completed: 24. Time elapsed: 1:48.3s. Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2) Tests finished: Pass 9. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` buck2 test mode/dev-nosan //caffe2/test/inductor:split_cat_fx_passes ``` Test UI: https://www.internalfb.com/intern/testinfra/testrun/16044073698893554 Network: Up: 120KiB Down: 60KiB (reSessionID-57f2c21b-3f4e-462b-9e5b-fe3dd15f6b7d) Jobs completed: 28. Time elapsed: 1:47.5s. Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2) Tests finished: Pass 11. Fail 0. Fatal 0. Skip 0. Build failure 0 optimus_scuba_log: ``` {'before_recompile_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GIbj2haUwKx69H8BAKXdGqXZSpoybr0LAAAz', 'group_batch_fusion_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GFqhiRYcJ_C4JFoDABKPTsfpzjJ_br0LAAAz', 'normalization_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GIvswhaiAVyipcoGAJZ5sUi8Bb5qbr0LAAAz', 'remove_split_with_size_one_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GFneTxcVBPaqVuwCADCiI4q1mEwlbr0LAAAz', 'merge_getitem_cat_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GJc0Phn87ljuMO0CADBPGqqehKp2br0LAAAz', 'merge_splits_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GLWB_BbvLyT7D_0DABmygDYPDjJ_br0LAAAz', 'after_recompile_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GO6eQBeIj6oV3o4JAFLzQ3ECMTIrbr0LAAAz', 'inductor_pre_grad': Counter({'pattern_matcher_nodes': 2006, 'pattern_matcher_count': 1806, 'normalization_pass': 861, 'remove_split_with_size_one_pass': 748, 'merge_splits_pass': 82, 'merge_getitem_cat_pass': 11, 'scmerge_split_sections_removed': 4, 'batch_layernorm': 1, 'batch_sigmoid': 1, 'scmerge_split_added': 1, 'scmerge_cat_added': 1, 'scmerge_split_removed': 1, 'scmerge_cat_removed': 1}), 'before_recompile_post_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GMoKmxYg6AUeQ40KAMDaJ4EVDwYmbr0LAAAz', 'group_batch_fusion_post_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GHIvQxkrV1PMBggEACv7786a2bE8br0LAAAz', 'after_recompile_post_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GIpBNxXupQTHWx8BALSiVrKgDbtfbr0LAAAz', 'inductor_post_grad': Counter({'pattern_matcher_nodes': 2093, 'pattern_matcher_count': 1893, 'normalization_pass': 861, 'remove_split_with_size_one_pass': 748, 'merge_splits_pass': 82, 'merge_getitem_cat_pass': 11, 'scmerge_split_sections_removed': 4, 'batch_layernorm': 1, 'batch_sigmoid': 1, 'scmerge_split_added': 1, 'scmerge_cat_added': 1, 'scmerge_split_removed': 1, 'scmerge_cat_removed': 1, 'batch_aten_mul': 1})} ``` Differential Revision: D55107000 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122361 Approved by: https://github.com/jackiexu1992	2024-03-28 17:13:32 +00:00
eellison	df724153c1	Add option to skip cudagraphing on dynamic shape graphs (#122520 ) This was requested internally. Differential Revision: [D55264528](https://our.internmc.facebook.com/intern/diff/D55264528) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122520 Approved by: https://github.com/mlazos, https://github.com/shunting314	2024-03-26 21:49:21 +00:00
Honglin Zhu	adeedc060f	[Inductor] Fix unbacked symbol in stride when using item() (#122298 ) Fixes #122296 Test: python test/inductor/test_torchinductor_dynamic_shapes.py -k test_item_unbacked_stride_nobreak_cuda Co-authored-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122298 Approved by: https://github.com/ezyang	2024-03-24 06:27:15 +00:00
David Berard	8013c4409f	[inductor] config to control whether we assume inputs are aligned (#122158 ) Motivation: https://github.com/pytorch/pytorch/issues/112771 Summary: Inductor generates triton that assumes that inputs are going to be 16-byte aligned. If the inputs aren't aligned, Inductor clones the inputs. This PR introduces a config option to not do this: when assume_aligned_inputs=False, Inductor will _not_ pass inputs as being divisible_by_16, and Inductor will not make clones. This an can generate code that might be a bit slower, but this tradeoff can be worth it in some scenarios where you might otherwise make a lot of clones. Ideally, we could do this on a per-tensor basis. But this would be a lot of work, and attempts to add guards on storage offsets to do this automatically have run into issues: recompilations and excessive time to generate/check guards. Tests https://github.com/pytorch/pytorch/pull/122159 flips this to False. It didn't run through all errors, but the ones we see are all expected failures: divisible_by_16 changes; triton kernel caching fails if we call the same triton kernel multiple times (this makes sense because the first call will have unaligned inputs, but subsequent calls have aligned inputs); and some xfailed tests start passing. Alternatives/RFC: * Is this the right thing to do with cudagraphs? * Elias and Jason mentioned that we probably still want to make clones if we're dealing with unaligned inputs to matmuls. Is this something we should add in this config option? (In the use case I'm targeting, it seems like we don't need this optimization right now) Differential Revision: [D55079094](https://our.internmc.facebook.com/intern/diff/D55079094) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122158 Approved by: https://github.com/ezyang	2024-03-22 20:03:38 +00:00
Adnan Akhundov	e419011471	[inductor] Add torch.while_loop support to JIT Inductor (#122069 ) Summary: `torch.while_loop` HOP support is added to JIT Inductor. The test coverage is limited due to the functionality constraints of the upstream `torch.while_loop` op in Dynamo / Export. When those are lifted, we'll add more tests (see TODO-s in the test file). AOT Inductor support will be added in a follow-up PR. Test Plan: ``` $ python test/inductor/test_control_flow.py ... ---------------------------------------------------------------------- Ran 38 tests in 159.387s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122069 Approved by: https://github.com/jansel, https://github.com/eellison	2024-03-22 02:45:27 +00:00
angelayi	99055ae165	[aoti] Fix compilation bug for buffer mutations (#121688 ) I realized there's a bug when unlifting buffer mutations in AOTI. However there seems to be a bug during tracing where AOTI mutates the buffer. I didn't take the time to investigate, so I left is as TODO for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121688 Approved by: https://github.com/chenyang78, https://github.com/bdhirsh	2024-03-21 21:51:32 +00:00
Sam Larsen	6f4fa8e9a1	[inductor] FX graph cache: simplify "current callable" logic (#121903 ) Summary: The handling of the current_callable and compiled_artifact fields in the CompiledFxGraph object is unnecessarily complicated and confusing. We can simplify by storing only the callable. That field is not serializable, so the caching approach is to store a path to the generated artifact and reload from disk on a cache hit. We can just reload inline in the FX cache hit path. This change has the added benefit that it makes it easier to fallback to a "cache miss" if the path somehow doesn't exist. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121903 Approved by: https://github.com/eellison	2024-03-15 20:00:08 +00:00
PyTorch MergeBot	0cd094a4fd	Revert "[aoti] Fix compilation bug for buffer mutations (#121688 )" This reverts commit `9f314d4aa8`. Reverted https://github.com/pytorch/pytorch/pull/121688 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/121688#issuecomment-1998740094))	2024-03-15 01:34:04 +00:00
Animesh Jain	92ed8553a6	Revert "Switch cudagraph backend to cudagraph trees (#121019 )" and "Add Cudagraphs disable checking (#121018 )" (#121864 ) This reverts commit `9373ad0bb8`. Revert "Add Cudagraphs disable checking (#121018)" This reverts commit `4af0e634bf`. Causes compilation time increase. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121864 Approved by: https://github.com/eellison	2024-03-15 00:03:09 +00:00
angelayi	9f314d4aa8	[aoti] Fix compilation bug for buffer mutations (#121688 ) I realized there's a bug when unlifting buffer mutations in AOTI. However there seems to be a bug during tracing where AOTI mutates the buffer. I didn't take the time to investigate, so I left is as TODO for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121688 Approved by: https://github.com/chenyang78	2024-03-14 19:35:26 +00:00
eellison	9373ad0bb8	Switch cudagraph backend to cudagraph trees (#121019 ) Switch torch.compile(..., backend="cudagraphs") to use cudagraph trees. Enabled a few test in cudagraph_trees and note that there is another test suite existing for cudagraphs backend: https://github.com/pytorch/pytorch/blob/main/test/dynamo/test_cudagraphs.py. This is basically the inductor cudagraphs without inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121019 Approved by: https://github.com/ezyang, https://github.com/jansel ghstack dependencies: #121017, #121018	2024-03-08 22:56:26 +00:00
Elias Ellison	937e89f252	cudagraphs backend refactoring (#121017 ) This is just some refactoring.. no functional changes Pull Request resolved: https://github.com/pytorch/pytorch/pull/121017 Approved by: https://github.com/ezyang	2024-03-08 19:47:41 +00:00
Sam Larsen	72dd9b2430	[inductor] Make some improvements to FX graph caching (#117888 ) Summary: This is in preparation to enable FX graph caching by default. First fix some bugs uncovered by running all unit tests under `test/inductor/`. I'll enable in a separate diff in case we need to revert. Summary of changes: * Turn off caching for tests that require a compilation, e.g., when checking that a relevant counter was incremented * Bypass caching when we see mkldnn tensors as constants (they currently don't serialize, so we can't save to disk) * Include various global settings that could affect compilation in the cache key calculation. * Handle a few config settings that break key calculation. * Handle code paths where no ShapeEnv is available (the cache impl requires a shape env as part of handling guards) * Skip caching when freezing is enabled (Freezing can embed constants that wouldn't be static across runs). * Fix the clear() method to not throw when the cache /tmp dir doesn't exist Test Plan: Ran all tests under `test/inductor/` twice with TORCHINDUCTOR_FX_GRAPH_CACHE=1 to exercise any test that might be affected by caching. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117888 Approved by: https://github.com/eellison	2024-03-08 02:30:49 +00:00
Adnan Akhundov	0dbef1618f	[inductor] Apply fx passes recursively to nested subgraphs (#120665 ) Summary: The current machinery of Inductor's `compile_fx` assumes that the incoming fx graph is flat. As a result, everything before `graph.run` is applied to the outermost graph. This assumption was valid before #119759, but now there is control flow bringing (arbitrarily deeply) nested fx subgraphs to `compile_fx`. In this PR, we start extending the `compile_fx` machinery to deal with nested fx subgraphs. Namely, we recursively apply Inductor's `pre_grad`, `joint_graph`, and `post_grad` passes to the nested subgraphs in the incoming fx graph. For the recursive application of the `pre_grad` passes (which require example inputs per subgraph), we don't pass example inputs for the nested subgraphs. A few different attempts to infer the latter via fake tensor prop has led to different side effects in the model. Therefore, to the nested subgraphs, we only apply a subset of `pre_grad` passes that doesn't require example inputs. Test Plan: ``` $ python test/inductor/test_control_flow.py ... ---------------------------------------------------------------------- Ran 26 tests in 59.252s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120665 Approved by: https://github.com/eellison	2024-02-29 02:34:54 +00:00
Edward Z. Yang	1a1fc1047d	Add structured trace logs (#120289 ) Overall design: https://docs.google.com/document/d/1CX_hJ0PNy9f3R1y8TJrfkSeLkvGjjjLU84BSXgS2AZ8/edit How to read the diff: * Most files are me augmenting pre-existing logging with structured variants. For the most part it's simple (esp FX graphs, which have a canonical string representation); it gets more complicated when I decided to JSON-ify some data structure instead of keeping the ad hoc printing (notably, guards and dynamo output graph sizes) * torch/_functorch/_aot_autograd/collect_metadata_analysis.py is some unrelated fixes I noticed while auditing artifact logs * torch/_logging/_internal.py has the actual trace log implementation. The trace logger is implement as a logger named torch.__trace which is disconnected from the logging hierarchy. It gets its own handler and formatter (TorchLogsFormatter with _is_trace True). `trace_structured` is the main way to emit a trace log. Unusually, there's a separate "metadata" and "payload" field. The metadata field should not be too long (as it is serialized as a single line) and is always JSON (we put contextual things like compile id in it); the payload field can be long and is emitted after the metadata log line and can span multiple lines. * torch/_logging/structured.py contains some helpers for converting Python data structures into JSON form. Notably, we have a string interning implementation here, which helps reduce the cost of serializing filenames into the log. * test/dynamo/test_structured_trace.py the tests are cribbed from test_logging.py, but all rewritten to use expect tests on munged versions of what we'd actually output. Payloads are never tested, since they tend not be very stable. https://github.com/ezyang/tlparse is a POC Rust program that can interpret these logs. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120289 Approved by: https://github.com/Skylion007 ghstack dependencies: #120712	2024-02-28 01:01:41 +00:00
PyTorch MergeBot	f3dd2a544c	Revert "Add structured trace logs (#120289 )" This reverts commit `9dfaef962c`. Reverted https://github.com/pytorch/pytorch/pull/120289 on behalf of https://github.com/kit1980 due to breaking internal builds, see D54230697 ([comment](https://github.com/pytorch/pytorch/pull/120289#issuecomment-1967477120))	2024-02-27 19:49:05 +00:00
Edward Z. Yang	9dfaef962c	Add structured trace logs (#120289 ) Overall design: https://docs.google.com/document/d/1CX_hJ0PNy9f3R1y8TJrfkSeLkvGjjjLU84BSXgS2AZ8/edit How to read the diff: * Most files are me augmenting pre-existing logging with structured variants. For the most part it's simple (esp FX graphs, which have a canonical string representation); it gets more complicated when I decided to JSON-ify some data structure instead of keeping the ad hoc printing (notably, guards and dynamo output graph sizes) * torch/_functorch/_aot_autograd/collect_metadata_analysis.py is some unrelated fixes I noticed while auditing artifact logs * torch/_logging/_internal.py has the actual trace log implementation. The trace logger is implement as a logger named torch.__trace which is disconnected from the logging hierarchy. It gets its own handler and formatter (TorchLogsFormatter with _is_trace True). There's a teensy bit of FB specific code to automatically enable trace logging if a /logs directory exists. `trace_structured` is the main way to emit a trace log. Unusually, there's a separate "metadata" and "payload" field. The metadata field should not be too long (as it is serialized as a single line) and is always JSON (we put contextual things like compile id in it); the payload field can be long and is emitted after the metadata log line and can span multiple lines. * torch/_logging/structured.py contains some helpers for converting Python data structures into JSON form. Notably, we have a string interning implementation here, which helps reduce the cost of serializing filenames into the log. * test/dynamo/test_structured_trace.py the tests are cribbed from test_logging.py, but all rewritten to use expect tests on munged versions of what we'd actually output. Payloads are never tested, since they tend not be very stable. https://github.com/ezyang/tlparse is a POC Rust program that can interpret these logs. Testing that the fbcode detection works at https://www.internalfb.com/mlhub/pipelines/runs/fblearner/534553450 (Meta-only) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120289 Approved by: https://github.com/Skylion007	2024-02-27 00:04:23 +00:00
Michael Lazos	56203fc407	Add profiling for backward (#120540 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/120540 Approved by: https://github.com/anijain2305	2024-02-24 16:53:28 +00:00
Menglu Yu	7b1f5c874f	[PT2][Optimus][Observability] Log the optimus graph transformation to the scuba (#119745 ) Summary: Current everstore upload logging may cuase excessive compilation time when the model has lots of graph breaks (post: https://fb.workplace.com/groups/257735836456307/permalink/633533465543207/), we here log the transformation only when the graph changed Test Plan: timeout flows: f528209775 f530084719 Differential Revision: D53692344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119745 Approved by: https://github.com/jackiexu1992	2024-02-16 21:32:04 +00:00
Yanbo Liang	7f5b87c953	[torch.compile] Log more compilation time breakdown (#119865 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/119865 Approved by: https://github.com/ezyang	2024-02-15 02:20:07 +00:00
Mu-Chu Lee	2b48891e62	[AOTInductor] Add Runtime Constant-folding for AOTInductor (#118765 ) Summary: Add Runtime Constant-folding for AOTInductor. This also include the invocation of constant folding at load time. The constant folding lowering is a 2-step process. First, we split the graph into 2 modules, one of it is the constant module, which doesn't depend on any input and the whole module could be inferred (constant-folded) one-time and be reused. The constant module, is lowered, and being codegen-ed as usual and cached (let's call this constant code). The constant code reuses the whole lowering/profiling/etc. process, only difference is that we do not generate any headers or initialization for the constant code. Second, after handling the constant module, we take care of the main module (which is the part that would depend on the user input.) For the main module, we take in one additional component, the constant code, compare with a normal lowering. Addition step we do here is that, we inject the constant code into the codegen-ed main module, and create the caller for the main module to consume the result of the constant module. Test Plan: Unit tests included in commit. Differential Revision: D53274382 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118765 Approved by: https://github.com/chenyang78	2024-02-01 04:54:25 +00:00
suo	d0627cc2af	[export] do not rewrite state dict when unlifting (#118611 ) This is Very Bad; changing state dict keys violates one of the key contracts we have, which is "do not mess with the state dict". Change unlift to use a similar `_assign_attr` approach that fx.GraphModule and unflatten do. Also took the opportunity to improve the interface of `_assign_attr` to be more general. Differential Revision: [D53139277](https://our.internmc.facebook.com/intern/diff/D53139277/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118611 Approved by: https://github.com/zhxchen17 ghstack dependencies: #118607, #118608, #118609, #118610	2024-01-30 19:14:19 +00:00
Daohang Shi	5dfcf07449	Reland PR117393 [inductor/fb] log config dict when compilation finishes (#118552 ) Summary: Reverted due to merge conflict Differential Revision: D53188124 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118552 Approved by: https://github.com/mengluy0125	2024-01-30 04:34:22 +00:00

1 2 3 4 5

239 Commits