pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Aaron Orenstein	5a0068cc69	[BE] mypy: disallow untyped decorators (#131428 ) Untyped decorators strip the types from their decorated function so even if the underlying function is fully typed then callers to it don't get any benefit from type annotations. Step 1 - Enable the error and override in all the offending files. #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131428 Approved by: https://github.com/justinchuby, https://github.com/oulgen	2024-07-23 21:50:55 +00:00
Michael Lazos	db376fb643	Ensure non-contiguous indices are handled (#131430 ) The unaligned inputs checker built in the assumption that static indices are a contiguous range (ie 0, 1, 2) when with the new changes with nn module inlining break this assumption. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131430 Approved by: https://github.com/anijain2305	2024-07-23 16:37:55 +00:00
Oguz Ulgen	4f0497c747	Divorce triton and pt2 remote caching (#131345 ) Now that remote caching has evolved into various parts of PT2, we want to separate triton and pt2 caching as changes to one have caused SEVs to the other. Differential Revision: D60047752 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131345 Approved by: https://github.com/aorenste	2024-07-23 16:28:12 +00:00
Xuehai Pan	b6d477fd56	[BE][Easy][16/19] enforce style for empty lines in import segments in `torch/_i*/` (#129768 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129768 Approved by: https://github.com/jansel	2024-07-20 16:20:58 +00:00
Zhengxu Chen	abb3f2822c	[aotinductor] Support additional lifted constants supplied to const folding. (#130743 ) Summary: In export workflow, we always have a lifted graph which doesn't fetch constants through get_attr nodes. This cause some compatibility issue when we're trying to use inductor's split_const_gm function with a lifted graph. This diff make an additive change to split_const_gm's interface, such that, when the pass sees a placeholder node is present in the lifted_constants table, it will also use that as the source of constness. This change won't break the existing code and the lifted_constants table can be used orthogonal to the existing const folding mechanisms. Also as required from MTIA team, we want to introduce a small callback function used to skip certain nodes during const folding. For the internal followup counterpart, see D59685145 Test Plan: buck run mode/opt caffe2/test:test_export -- -r split_const_gm Differential Revision: D59692790 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130743 Approved by: https://github.com/desertfire, https://github.com/SherlockNoMad	2024-07-19 16:48:56 +00:00
Michael Lazos	415d5e53ae	Propagate buffer and parameter indices through AOT (#130393 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130393 Approved by: https://github.com/bdhirsh	2024-07-16 22:12:38 +00:00
Sam Larsen	156b99cfb1	[inductor] Handle inductor counters in fx graph cache (#130635 ) Summary: Similar to the handling of metrics, save inductor counter deltas in the FX graph cache entry and increment the counters appropriately on a cache hit Test Plan: new unit test Pull Request resolved: https://github.com/pytorch/pytorch/pull/130635 Approved by: https://github.com/eellison	2024-07-16 20:07:16 +00:00
PyTorch MergeBot	cbda8be537	Revert "Propagate buffer and parameter indices through AOT (#130393 )" This reverts commit `69a77389e2`. Reverted https://github.com/pytorch/pytorch/pull/130393 on behalf of https://github.com/clee2000 due to broke lint for torch/_functorch/_aot_autograd/subclass_utils.py https://github.com/pytorch/pytorch/actions/runs/9948630877/job/27483551649 `80236dca90` lint was green on PR, probably a landrace ([comment](https://github.com/pytorch/pytorch/pull/130393#issuecomment-2231263753))	2024-07-16 15:43:34 +00:00
eellison	e11c41035c	Directly use empty strided in cudagraph copy (#130777 ) We had an issue with the `-1` somehow ending up in negative num elements required. not sure why the original didn't work - we should land if CI is green. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130777 Approved by: https://github.com/BoyuanFeng	2024-07-16 14:37:30 +00:00
Michael Lazos	69a77389e2	Propagate buffer and parameter indices through AOT (#130393 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130393 Approved by: https://github.com/bdhirsh ghstack dependencies: #130391, #130392, #130503	2024-07-16 00:25:38 +00:00
Oguz Ulgen	be7bf20234	Add JK to enable fx graph cache for amd (#130463 ) Test Plan: ad hoc testing Differential Revision: D59593961 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130463 Approved by: https://github.com/nmacchioni, https://github.com/mxz297	2024-07-11 15:28:38 +00:00
Laith Sakka	a09910d3a9	add strobelight profile links to tlparse (#129703 ) Summary: title. Test Plan: buck2TORCH_TRACE=~/my_trace_log_dir buck2 run @//mode/inplace @//mode/opt //caffe2/fb/strobelight:compile_time_profiler_example tlparse ~/my_trace_log_dir result https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpBrQJcL/index.html {F1726980413} Differential Revision: D59130581 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129703 Approved by: https://github.com/aorenste	2024-07-10 16:53:21 +00:00
Sam Larsen	87d14ad419	[inductor] Fix TORCHINDUCTOR_FORCE_DISABLE_CACHES (#129257 ) Summary: See https://github.com/pytorch/pytorch/issues/129159; this option wasn't doing its job for a few reasons. In this PR: * Fix the with_fresh_cache_if_config() decorator * Reset the "TORCHINDUCTOR_CACHE_DIR" & "TRITON_CACHE_DIR" env vars in sub-process to support them changing in the parent process Pull Request resolved: https://github.com/pytorch/pytorch/pull/129257 Approved by: https://github.com/oulgen	2024-06-26 18:34:48 +00:00
PyTorch MergeBot	ad76da6c16	Revert "[inductor] Fix TORCHINDUCTOR_FORCE_DISABLE_CACHES (#129257 )" This reverts commit `7b57ddd38c`. Reverted https://github.com/pytorch/pytorch/pull/129257 on behalf of https://github.com/clee2000 due to one of the PRs in the stack seems to have broken test/distributed/_composable/test_replicate_with_compiler.py::ReplicateTest::test_bucketing_concat_op on distributed https://github.com/pytorch/pytorch/actions/runs/9653941844/job/26627760340 `4c1e4c5f30`, not tested on this PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/129257#issuecomment-2189444171))	2024-06-25 16:48:32 +00:00
Colin Peppler	1315be4893	[aotinductor] only autotune at compile time when enabled via config (#129413 ) internal breakage when enabled. Test Plan: CI Differential Revision: D58965784 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129413 Approved by: https://github.com/jingsh, https://github.com/desertfire	2024-06-25 00:41:10 +00:00
Nikita Shulga	14dc08ddc7	Inductor to fail gracefully on Voltas for bf16 tensors (#129288 ) Volta(sm_7x) do not have a HW support for bfloat16 datatype, and while it is is emulated to ted in software, so PyTorch eager can use bfloat16 tensors, but not in Triton. So if graph with either CUDA bf16 input or output tensors is used, raise warnings and skip the frame. Add optional parameter `including_emulation` to `torch.cuda.is_bf16_supported` method and call it from `torch._inductor.compile_fx. _check_triton_bf16_support`. Test plan: Modify `is_bf16_supported` to return False and see that warning is generated Fixes https://github.com/pytorch/pytorch/issues/118122 and https://github.com/pytorch/pytorch/issues/118581 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129288 Approved by: https://github.com/eqy, https://github.com/jansel	2024-06-25 00:04:13 +00:00
Sam Larsen	7b57ddd38c	[inductor] Fix TORCHINDUCTOR_FORCE_DISABLE_CACHES (#129257 ) Summary: See https://github.com/pytorch/pytorch/issues/129159; this option wasn't doing its job for a few reasons. In this PR: * Fix the with_fresh_cache_if_config() decorator * Reset the "TORCHINDUCTOR_CACHE_DIR" & "TRITON_CACHE_DIR" env vars in sub-process to support them changing in the parent process Pull Request resolved: https://github.com/pytorch/pytorch/pull/129257 Approved by: https://github.com/oulgen	2024-06-24 23:39:43 +00:00
Bin Bao	62e5d045c0	[AOTI] Auto-tune Triton kernels in a seperate block (#129057 ) Summary: Currently AOTI does a two-pass compilation for the CUDA backend. In the first pass AOTI generates Python code, runs the generated code once with real example inputs to trigger Triton kernel compilation and tuning, and then AOTI runs the second pass to generate cpp code and compiles that into a shared library. There are several problems with this approach when we want to enable the cpp wrapper mode for JIT Inductor: * Compilation time: JIT compilation is more sensitive to compilation time than AOT compilation. The two-pass approach does add extra overhead for compilation. * Peak memory size: when executing the first-pass generated code with real inputs, some inputs need to be cloned to avoid side effect coming from input mutation. This can raise the high-water mark for memory consumption. * Missing triton kernel autotuning: Because kernel autotune depends on the kernel being executed in the two-pass approach, some kernels will not be autotuned when a model contains control flow such as torch.if or torch.while. This PR is the first step towards solving these problems by moving Triton kernel autotuning to the compile time and use random inputs for tuning. The cpp wrapper codegen still has two passes, but in the first pass, Inductor will generate a separate code just for kernel autotuning, with https://gist.github.com/desertfire/606dc772b3e989b5e2edc66d76593070 as an example, and we no longer need to execute the model after the first-pass finishes. After that we rerun a second pass to generate cpp code. This reduces peak memory consumption and enables kernel autotuning when there is control flow. Truly making the codegen into one-pass will come later once this solution is proven stable and generates as performant kernels as before. Differential Revision: [D58782766](https://our.internmc.facebook.com/intern/diff/D58782766) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129057 Approved by: https://github.com/jansel, https://github.com/eellison	2024-06-21 14:34:13 +00:00
Oguz Ulgen	54b0006cb2	Evaluate symexprs on load path of cache not write (#128997 ) When caching is enabled, an internal model fails with ``` assert_size_stride(bmm_9, (17, s0, 512), (54784, 512, 1)) AssertionError: expected size 17==17, stride 57344==54784 at dim=0 ``` looking at this model, the exact problem is when the cache is hit on the forward graph, the generated code for backward fails since the strides of the outputs of forward, passed to backward as inputs, are not what we expected. This PR changes the evaluation logic so that we defer evaluation of output stride exprs to load path as opposed to eagerly doing it on save path. I have not been able to come up with a unit test repro for this problem. Differential Revision: [D58796503](https://our.internmc.facebook.com/intern/diff/D58796503) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128997 Approved by: https://github.com/ezyang	2024-06-20 08:55:12 +00:00
Oguz Ulgen	6079c50910	Make config.fx_graph_remote_cache be three-value switch (#128628 ) Summary: We want to allow for three configurations False: Force off True: Force on None: OFF for OSS and JK config for internal Test Plan: CI Differential Revision: D58535897 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128628 Approved by: https://github.com/masnesral, https://github.com/eellison	2024-06-15 17:52:09 +00:00
chilli	c486e2ab64	Add coloring to fx graph print out (#128476 ) Note: Won't land immediately, at least I'll need to add a color option to the field. But curious if any tests fail. Old: <img width="1294" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/c3a750ed-5e54-4621-b2e4-be5481be15b6"> New: <img width="1303" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/3a1f1adc-6f3a-413e-8b87-ee53da9bf4ed"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128476 Approved by: https://github.com/ezyang	2024-06-13 23:39:04 +00:00
Wang, Eikan	1f302d6885	Support aten operations with out tensor (#124926 ) This PR intends to support the aten operations with the `out` tensor. Currently, the AOT compile always does NOT keep input tensor mutations. According to the comments, this is because it has not encountered such a use case. > For now there's no use case involving keeping input mutations in the graph (which we can only do in the inference case anyway). We can add this later if we need to. However, for aten operations, it is popular that the `out` tensor is an input parameter and needs to be mutated. This PR intends to support it by adding a `keep_inference_input_mutations` flag to `aot_inductor.keep_inference_input_mutations`. This flag can provide flexibility to the callee in deciding whether the AOT compile needs to keep input tensor mutations in the graph. Take `clamp` as an example as follows. ```python out_tensor = torch.randn(128, dtype=torch.float, device=device).fill_(-2.0) inp_tensor = torch.randn(128, dtype=torch.float, device=device).fill_(1.0) min_tensor = inp_tensor - 0.05 max_tensor = inp_tensor + 0.05 torch.clamp(input=inp_tensor, min=min_tensor, max=max_tensor, out=out_tensor) ``` W/O this PR ```python def forward(self): arg0_1: "f32[128]"; arg1_1: "f32[128]"; arg2_1: "f32[128]"; arg3_1: "f32[128]"; arg0_1, arg1_1, arg2_1, arg3_1, = fx_pytree.tree_flatten_spec([], self._in_spec) clamp_min: "f32[128]" = torch.ops.aten.clamp_min.Tensor(arg0_1, arg1_1); arg0_1 = arg1_1 = None clamp_max: "f32[128]" = torch.ops.aten.clamp_max.Tensor(clamp_min, arg2_1); clamp_min = arg2_1 = None return (clamp_max, clamp_max) ``` W/ this PR ```python def forward(self): arg0_1: "f32[128]"; arg1_1: "f32[128]"; arg2_1: "f32[128]"; arg3_1: "f32[128]"; arg0_1, arg1_1, arg2_1, arg3_1, = fx_pytree.tree_flatten_spec([], self._in_spec) clamp_min: "f32[128]" = torch.ops.aten.clamp_min.Tensor(arg0_1, arg1_1); arg0_1 = arg1_1 = None clamp_max: "f32[128]" = torch.ops.aten.clamp_max.Tensor(clamp_min, arg2_1); clamp_min = arg2_1 = None copy_: "f32[128]" = torch.ops.aten.copy_.default(arg3_1, clamp_max); arg3_1 = clamp_max = None return (copy_,) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124926 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/angelayi	2024-06-12 22:31:59 +00:00
PyTorch MergeBot	81e4e12f02	Revert "Support aten operations with out tensor (#124926 )" This reverts commit `cba195c8ed`. Reverted https://github.com/pytorch/pytorch/pull/124926 on behalf of https://github.com/clee2000 due to newly added test broke in internal D58444103. Test passed in OSS CI though ([comment](https://github.com/pytorch/pytorch/pull/124926#issuecomment-2163441547))	2024-06-12 16:20:04 +00:00
Colin L Reliability Rice	a206dcc79e	fb_memcache: Move to fbcode from thirdparty (#128174 ) Summary: The fb_memcache injections location and path is changing. Test Plan: Existing tests should pass. Reviewed By: bertmaher, oulgen Differential Revision: D57973772 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128174 Approved by: https://github.com/oulgen	2024-06-11 07:46:12 +00:00
Wang, Eikan	cba195c8ed	Support aten operations with out tensor (#124926 ) This PR intends to support the aten operations with the `out` tensor. Currently, the AOT compile always does NOT keep input tensor mutations. According to the comments, this is because it has not encountered such a use case. > For now there's no use case involving keeping input mutations in the graph (which we can only do in the inference case anyway). We can add this later if we need to. However, for aten operations, it is popular that the `out` tensor is an input parameter and needs to be mutated. This PR intends to support it by adding a `keep_inference_input_mutations` flag to `aot_inductor.keep_inference_input_mutations`. This flag can provide flexibility to the callee in deciding whether the AOT compile needs to keep input tensor mutations in the graph. Take `clamp` as an example as follows. ```python out_tensor = torch.randn(128, dtype=torch.float, device=device).fill_(-2.0) inp_tensor = torch.randn(128, dtype=torch.float, device=device).fill_(1.0) min_tensor = inp_tensor - 0.05 max_tensor = inp_tensor + 0.05 torch.clamp(input=inp_tensor, min=min_tensor, max=max_tensor, out=out_tensor) ``` W/O this PR ```python def forward(self): arg0_1: "f32[128]"; arg1_1: "f32[128]"; arg2_1: "f32[128]"; arg3_1: "f32[128]"; arg0_1, arg1_1, arg2_1, arg3_1, = fx_pytree.tree_flatten_spec([], self._in_spec) clamp_min: "f32[128]" = torch.ops.aten.clamp_min.Tensor(arg0_1, arg1_1); arg0_1 = arg1_1 = None clamp_max: "f32[128]" = torch.ops.aten.clamp_max.Tensor(clamp_min, arg2_1); clamp_min = arg2_1 = None return (clamp_max, clamp_max) ``` W/ this PR ```python def forward(self): arg0_1: "f32[128]"; arg1_1: "f32[128]"; arg2_1: "f32[128]"; arg3_1: "f32[128]"; arg0_1, arg1_1, arg2_1, arg3_1, = fx_pytree.tree_flatten_spec([], self._in_spec) clamp_min: "f32[128]" = torch.ops.aten.clamp_min.Tensor(arg0_1, arg1_1); arg0_1 = arg1_1 = None clamp_max: "f32[128]" = torch.ops.aten.clamp_max.Tensor(clamp_min, arg2_1); clamp_min = arg2_1 = None copy_: "f32[128]" = torch.ops.aten.copy_.default(arg3_1, clamp_max); arg3_1 = clamp_max = None return (copy_,) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124926 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/angelayi	2024-06-11 04:35:27 +00:00
Aaron Orenstein	ea614fb2b1	Flip default value for mypy disallow_untyped_defs [2/11] (#127839 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127839 Approved by: https://github.com/oulgen	2024-06-08 18:23:08 +00:00
Edward Z. Yang	73d6ec2db6	Increase verbosity of FX graph dumps (#128042 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128042 Approved by: https://github.com/aorenste	2024-06-08 07:24:58 +00:00
Simon Fan	00c6ca4459	[compiled autograd][cudagraphs] Inputs runtime wrapper to move cpu scalars to cuda (#125382 ) Most commonly CPU scalars used for philox random seed. Right now, any cpu input will skip cudagraphing the entire graph. We need both the traced graph and the runtime inputs to be cudaified. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125382 Approved by: https://github.com/jansel	2024-06-07 07:12:46 +00:00
Michael Lazos	f5328542b5	Allow multiple cudagraph recordings per compiled graph (#126822 ) ### Introduction/Problem Today when dynamo traces a builtin nn module (nn.Linear for example) it will specially handle parameters of that module by storing them as constant attributes of the graph. This requires that dynamo guard on the ID of the NNModule because if the instance of the module changes, we need to retrace and recollect the new parameters as attributes of the graph. This creates a 1:1 compiled graph to cudagraph relationship. With hierarchical compilation, dynamo will treat builtin nn modules like any other code. This reduces complexity and critically, if there are multiple identical layers in a model, we only need to compile one of those layers once, and reuse the same compiled artifact for each layer. This introduces a problem for the current approach to parameter handling. Since the parameters could now possibly change across calls to the compiled artifact, these need to be inputs to the graph instead of attributes. This introduces a problem for cudagraphs - previously cudagraphs was guaranteed that the parameters of builtin NN Modules would be constant across calls, but now since the compiled artifact needs to be agnostic to the actual instance of the NN module being used these parameter memory locations may vary. Previously cudagraphs simply copies varying inputs to cudagraph owned memory, but since the parameters are quite large, this is catastrophic for performance. ### Solution To avoid this performance cliff, this PR allows cudagraphs to re-record a new cudagraph if only parameters change. Metadata about which arguments are parameters are propagated from AOT Autograd to compile_fx, and these indices are passed to cudagraphs. If these memory locations change, a new graph is recorded vs previously where this would be an error (because this previously should not happen). This enables a 1:many compiled graph to cudagraph relationship. Across similar modules we will re-record cudagraphs and dispatch the correct graph if parameter pointers match when the cudagraph is executed. ### Next steps (if needed) It is theoretically possible that a user passes Parameters that change frequently as inputs to model code - if this is a common issue this design allows for dynamo to pass metadata indicating which parameters were created in a builtin NN Module context to only permit those parameters to have the multi-cudagraph behavior, but this PR does not implement this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126822 Approved by: https://github.com/eellison ghstack dependencies: #126820, #126821	2024-06-06 06:39:59 +00:00
Michael Lazos	5a3bea1e88	Remove unused arg to GraphLowering (#126821 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126821 Approved by: https://github.com/eellison ghstack dependencies: #126820	2024-06-06 06:39:59 +00:00
James Wu	63d7ffe121	Retry of D58015187 Move AsyncCompile to a different file (#127691 ) Summary: This is a retry of https://github.com/pytorch/pytorch/pull/127545/files and D58015187, fixing the internal test that also imported codecache Test Plan: Same tests as CI in github, plus sandcastle for internal unit tests should pass now Differential Revision: D58054611 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127691 Approved by: https://github.com/oulgen	2024-06-03 15:29:41 +00:00
Yanbo Liang	7e97b33fbb	[Dynamo] Log backward graph compilation metrics (#126629 ) Fixes #125313 Compilation metric logs for the code example at #125313: ``` %s CompilationMetrics(compile_id='0/0', frame_key='1', co_name='forward', co_filename='/data/users/ybliang/debug/debug2.py', co_firstlineno=10, cache_size=0, accumulated_cache_size=0, guard_count=11, shape_env_guard_count=0, graph_op_count=1, graph_node_count=3, graph_input_count=1, start_time=1716247236.6165977, entire_frame_compile_time_s=7.926939964294434, backend_compile_time_s=7.887059926986694, inductor_compile_time_s=4.108498811721802, code_gen_time_s=3.97833514213562, fail_type=None, fail_reason=None, fail_user_frame_filename=None, fail_user_frame_lineno=None, non_compliant_ops=set(), compliant_custom_ops=set(), restart_reasons={"'skip function graph_break in file /home/ybliang/local/pytorch/torch/_dynamo/decorators.py'"}, dynamo_time_before_restart_s=0.025330543518066406, has_guarded_code=True, is_fwd=True) %s CompilationMetrics(compile_id='1/0', frame_key='2', co_name='torch_dynamo_resume_in_forward_at_12', co_filename='/data/users/ybliang/debug/debug2.py', co_firstlineno=12, cache_size=0, accumulated_cache_size=0, guard_count=10, shape_env_guard_count=0, graph_op_count=2, graph_node_count=5, graph_input_count=1, start_time=1716247244.544928, entire_frame_compile_time_s=0.10148310661315918, backend_compile_time_s=0.08753013610839844, inductor_compile_time_s=0.03691983222961426, code_gen_time_s=0.022417306900024414, fail_type=None, fail_reason=None, fail_user_frame_filename=None, fail_user_frame_lineno=None, non_compliant_ops=set(), compliant_custom_ops=set(), restart_reasons=set(), dynamo_time_before_restart_s=0.0, has_guarded_code=True, is_fwd=True) tensor([[-0.1622, -0.0000, -0.0000, 0.5643, -0.0000, 0.0000, -0.5087, 0.0914, -0.0000, -0.0421]], grad_fn=<CompiledFunctionBackward>) %s CompilationMetrics(compile_id='1/0', frame_key=None, co_name=None, co_filename=None, co_firstlineno=None, cache_size=None, accumulated_cache_size=None, guard_count=None, shape_env_guard_count=None, graph_op_count=None, graph_node_count=None, graph_input_count=None, start_time=None, entire_frame_compile_time_s=None, backend_compile_time_s=None, inductor_compile_time_s=0.026738643646240234, code_gen_time_s=0.016446352005004883, fail_type=None, fail_reason=None, fail_user_frame_filename=None, fail_user_frame_lineno=None, non_compliant_ops=None, compliant_custom_ops=None, restart_reasons=None, dynamo_time_before_restart_s=None, has_guarded_code=None, is_fwd=False) %s CompilationMetrics(compile_id='0/0', frame_key=None, co_name=None, co_filename=None, co_firstlineno=None, cache_size=None, accumulated_cache_size=None, guard_count=None, shape_env_guard_count=None, graph_op_count=None, graph_node_count=None, graph_input_count=None, start_time=None, entire_frame_compile_time_s=None, backend_compile_time_s=None, inductor_compile_time_s=0.14563536643981934, code_gen_time_s=0.08652091026306152, fail_type=None, fail_reason=None, fail_user_frame_filename=None, fail_user_frame_lineno=None, non_compliant_ops=None, compliant_custom_ops=None, restart_reasons=None, dynamo_time_before_restart_s=None, has_guarded_code=None, is_fwd=False) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126629 Approved by: https://github.com/ezyang	2024-06-03 03:55:33 +00:00
Oguz Ulgen	0d9e527c4d	Remove tensor storage_offset/storage_bytes from the cache key (#127319 ) Summary: We observed differences in these fields and inductor does not specialize on them so it is safe to remove them from the key. Test Plan: CI Reviewed By: masnesral Differential Revision: D57871276 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127319 Approved by: https://github.com/masnesral	2024-06-02 00:28:43 +00:00
PyTorch MergeBot	22f392ba40	Revert "[easy?] Move AsyncCompile to a different file (#127235 )" This reverts commit `f58fc16e8f`. Reverted https://github.com/pytorch/pytorch/pull/127235 on behalf of https://github.com/izaitsevfb due to breaking internal tests, see [D58015187](https://www.internalfb.com/diff/D58015187) ([comment](https://github.com/pytorch/pytorch/pull/127235#issuecomment-2143518610))	2024-06-01 17:16:16 +00:00
PyTorch MergeBot	d49dc8f4b8	Revert "Add noqa to prevent lint warnings (#127545 )" This reverts commit `f9937afd4f`. Reverted https://github.com/pytorch/pytorch/pull/127545 on behalf of https://github.com/izaitsevfb due to reverting to unblock the revert of #127545 ([comment](https://github.com/pytorch/pytorch/pull/127545#issuecomment-2143517711))	2024-06-01 17:12:46 +00:00
James Wu	f9937afd4f	Add noqa to prevent lint warnings (#127545 ) This is to prevent the import from being removed due to unused import. What's annoying about this is that it's not consistently running: lintrunner doesn't warn me on this PR even without the comment, but it does on other PRs Pull Request resolved: https://github.com/pytorch/pytorch/pull/127545 Approved by: https://github.com/masnesral	2024-05-30 17:56:49 +00:00
James Wu	f58fc16e8f	[easy?] Move AsyncCompile to a different file (#127235 ) By moving AsyncCompile to its own file, we can import codecache without running the side effects of AsyncCompile. This will be important for AOTAutogradCaching, where we want to share some implementation details with codecache.py without spawning new processes. To conservatively maintain the same behavior elsewhere, every time we import codecache, I've added an import to torch._inductor.async_compile (except in autograd_cache.py, where the explicit goal is to not do this) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127235 Approved by: https://github.com/aorenste, https://github.com/oulgen, https://github.com/masnesral	2024-05-30 02:43:02 +00:00
Xuehai Pan	a28bfb5ed5	[4/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort functorch (#127125 ) The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127125 Approved by: https://github.com/Skylion007 ghstack dependencies: #127122, #127123, #127124	2024-05-25 22:45:38 +00:00
chilli	51e707650f	Fix flexattention not realizing inputs before lowering (also refactored runtime estimation) (#126615 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126615 Approved by: https://github.com/yanboliang, https://github.com/drisspg, https://github.com/xmfan	2024-05-22 17:28:46 +00:00
PyTorch MergeBot	8a4597980c	Revert "Fix flexattention not realizing inputs before lowering (also refactored runtime estimation) (#126615 )" This reverts commit `831efeeadf`. Reverted https://github.com/pytorch/pytorch/pull/126615 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126615#issuecomment-2124169157))	2024-05-22 08:23:40 +00:00
chilli	831efeeadf	Fix flexattention not realizing inputs before lowering (also refactored runtime estimation) (#126615 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126615 Approved by: https://github.com/yanboliang, https://github.com/drisspg, https://github.com/xmfan	2024-05-20 23:40:56 +00:00
dshi7	9df2f8687f	cprofile every compile id [x/y] to keep consistent with tlparse (#125659 ) This PR moves cprofile decorator to keep consistent with `torch_inductor_stats` logging and is needed by fbcode diffs of profiling enablement in internal e2e jobs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125659 Approved by: https://github.com/ezyang	2024-05-14 17:09:28 +00:00
Oguz Ulgen	3c4058cf18	Add master cache disable switch for inductor (#126084 ) Fixes #125699 Differential Revision: [D57284558](https://our.internmc.facebook.com/intern/diff/D57284558/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126084 Approved by: https://github.com/jansel	2024-05-14 01:19:28 +00:00
Oguz Ulgen	5033d3ba6d	Disable fb_memcache for MTIA (#125658 ) Differential Revision: [D57035819](https://our.internmc.facebook.com/intern/diff/D57035819/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125658 Approved by: https://github.com/jamesjwu	2024-05-07 07:00:26 +00:00
Oguz Ulgen	22bcfc25ef	Initial implementation of Inductor FX Graph Remote Cache (#124669 ) This diff implements a remote caching strategy (memcache for internal and redis for external) for caching of Inductor FX Graph to Inductor generated wrapper file. It uses the same idea with the autotuning result cache that is currently live. This will land turned off and before turning this on by default, I will do more testing and including looking at the dynamic shape guards added by inductor. Differential Revision: [D56441624](https://our.internmc.facebook.com/intern/diff/D56441624/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124669 Approved by: https://github.com/jansel, https://github.com/eellison	2024-05-06 22:10:27 +00:00
Daohang Shi	2ea1e84d40	log pt2 config dict to signpost from inductor post grad (#124593 ) Summary: previous attempts don't work eventually. D49720297 causes online train SEV due to extra importing. D56299408 mitigates a tricky bug from Distributed Shampoo constructor but unfortutenaly didn't correct the scuba logging either. see f552546983 Test Plan: {F1491621504} Differential Revision: D56378270 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124593 Approved by: https://github.com/anijain2305	2024-04-26 18:57:11 +00:00
Simon Fan	14430564ce	[cudagraphs] add cudagraph_skips counter (#124804 ) used in tests and benchmark csv Pull Request resolved: https://github.com/pytorch/pytorch/pull/124804 Approved by: https://github.com/eellison ghstack dependencies: #119729, #124700	2024-04-26 03:22:29 +00:00
David Berard	4259e5d0e0	[inductor] Specialize on unguarded alignment of example inputs (#123319 ) When inductor generates triton code, the triton code can either assume that the inputs given to it are aligned or unaligned. If they are aligned, triton can use more efficient instructions (like vectorized loads or tensor cores). However, if we generate "aligned" code and pass in unaligned inputs, the triton code will error out; to fix this, we clone unaligned inputs that are passed to triton kernels that expect aligned inputs. This can lead to excessive clones if we have inputs that are not expected to be aligned. In this PR, we use the example input to decide whether the generated triton code should assume alignment or not. If the example input is aligned, then we will generate triton code that assumes alignment; if at runtime we receive an unaligned input, we'll make a clone. Meanwhile, if the example input is not aligned, the generated triton code will not assume inputs are aligned and we won't ever need to clone. Note that the alignment of the inputs is not guarded on; we found that adding guards on tensor offsets (a) was slow in cases where we do a lot of comparisons on tensor offsets, and (b) led to a lot of recompilations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123319 Approved by: https://github.com/eellison	2024-04-25 22:28:15 +00:00
PyTorch MergeBot	154157416c	Revert "[cudagraphs] add cudagraph_skips counter (#124804 )" This reverts commit `fdad16b851`. Reverted https://github.com/pytorch/pytorch/pull/124804 on behalf of https://github.com/jeanschmidt due to one PR in this stack seems to have broken linux pull cuda12 tests ([comment](https://github.com/pytorch/pytorch/pull/119729#issuecomment-2076750595))	2024-04-25 09:26:25 +00:00
Simon Fan	fdad16b851	[cudagraphs] add cudagraph_skips counter (#124804 ) used in tests and benchmark csv Pull Request resolved: https://github.com/pytorch/pytorch/pull/124804 Approved by: https://github.com/eellison ghstack dependencies: #119729, #124700	2024-04-25 03:38:09 +00:00

1 2 3 4 5 ...

282 Commits