pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Aaron Gokaslan	12e95aa4ee	[BE]: Apply PERF401 autofixes from ruff (#140980 ) * Automatically applies ruff rule 401. Turns loops into equivalent list comprehensions which are faster and do not leak the scope of the loop variables. * list comprehensions not only often have better typing, but are 50+% faster than for loops on overhead. They also preserve length information etc and are better for the interpreter to optimize. * Manually went back and made mypy happy after the change. * Also fixed style lints in files covered by flake8 but not by pyfmt Pull Request resolved: https://github.com/pytorch/pytorch/pull/140980 Approved by: https://github.com/justinchuby, https://github.com/malfet	2024-11-20 17:52:07 +00:00
Shunting Zhang	7e9e83a8c6	[inductor] force contiguous layout for implicit fallback (#140996 ) Fix https://github.com/pytorch/pytorch/issues/140462 . Horace found that when we implicitly fallback to eager, some eager kernels may not work correctly if Inductor provide non-contiguous inputs (due to padding etc.). The original issue is found for the backward op of weight_norm. The fix in this PR is a general one: we force inputs to all implicit fallback kernels to be contiguous. I have to refactor the code a bit to make it work. Previously we apply layout constraint in `GraphLowering.run_node`. We looks for implicit fallback in `call_function`. The problem here is, when we setup the implicit fallback in `call_function` with a layout constraint, we don't have a chance to apply the constraints.. The refactor moves the code that applies layout constraints to `call_function`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140996 Approved by: https://github.com/jansel	2024-11-20 06:41:17 +00:00
Sam Larsen	ff17d2b83e	[easy][logging] Remove dynamo_timed fwd_only param (#140993 ) Summary: It's ignored; remove it Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/140993 Approved by: https://github.com/ezyang	2024-11-20 02:31:51 +00:00
Henry Tsang	4f2543c31d	[logs] Add dynamo_timed to get better compilation time breakdown for AOTI (#140198 ) Adding some dynamo timed for the purpose of better understanding AOTI compilation time. Probably would require a few more passes. A lot of time is spent in Scheduler.__init__, and not enough annotations are there. run_command_and_check takes a lot time as well. But there is probably not much we can do. Maybe we can add a config to tune C++ optimization level? traces: <img width="1205" alt="Screenshot 2024-11-08 at 4 41 10 PM" src="https://github.com/user-attachments/assets/61645264-b3af-4d4a-804d-700b0f831c7c"> Differential Revision: D65554141 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140198 Approved by: https://github.com/desertfire	2024-11-19 18:54:17 +00:00
PyTorch MergeBot	d472a5f680	Revert "[inductor] Refactor MutableBox to make IRNode typing easier (#140895 )" This reverts commit `c79e78b503`. Reverted https://github.com/pytorch/pytorch/pull/140895 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think test_torchbind_inductor is failing in trunk after this lands ([comment](https://github.com/pytorch/pytorch/pull/140895#issuecomment-2484679319))	2024-11-19 04:25:41 +00:00
Jason Ansel	c79e78b503	[inductor] Refactor MutableBox to make IRNode typing easier (#140895 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140895 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2024-11-19 00:24:35 +00:00
Bin Bao	62fb6fd8bd	Fix broken AOTInductor node and kernel counts (#139435 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139435 Approved by: https://github.com/desertfire ghstack dependencies: #139411, #139412 Co-authored-by: Bin Bao <binbao@meta.com>	2024-11-17 04:17:07 +00:00
Angela Yi	baf756a785	[reland] [aoti] Selectively package AOTI generated files (#140675 ) Summary: Reland https://github.com/pytorch/pytorch/pull/140022 Test Plan: CI Differential Revision: D65929964 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140675 Approved by: https://github.com/desertfire	2024-11-15 23:48:34 +00:00
PyTorch MergeBot	14641c0393	Revert "Fix broken AOTInductor node and kernel counts (#139435 )" This reverts commit `8cb0b932a1`. Reverted https://github.com/pytorch/pytorch/pull/139435 on behalf of https://github.com/kit1980 due to breaking internal inductor test ([comment](https://github.com/pytorch/pytorch/pull/139411#issuecomment-2477235367))	2024-11-14 19:25:46 +00:00
Sam Larsen	b11ff3cf60	[logging] Overhaul dynamo_timed and CompilationMetrics logging. (#139849 ) Here's the overview: There's a new contextmanager singleton called MetricsContext. Entering the MetricsContext is how we demarcate the boundary on which we'll create a single CompilationMetrics object, and therefore, a single dynamo_compile log entry. While we're inside the MetricsContext, we can update/set many different metrics. Most importantly: `dynamo_timed` can also update the in-progress MetricsContext. In the proposal here, we tell `dynamo_timed` that we want it to do so by providing the name of the MetricsContext field to increment. There can be many `dynamo_timed` calls in different parts of the code updating different fields. Then when the MetricsContext exits, that's when the logging of everything gathered finally happens. One potential footgun is trying to use `dynamo_timed` when we haven't entered the MetricsContext, but we assert on that problem. Another problem is that we re-enter the context recursively, but we watch for that and do the logging only when the outermost exits. Some specifics: * Introduce MetricsContext - a context manager that on exit, records the CompilationMetrics (which also logs to dynamo_compile). * Completely remove the concept of frame_phase_timing. Instead, update the MetricsContext during compilation, either directly or via dynamo_timed. * Remove some globals we previously used to accumulate counters to later populate a CompilationMetrics. We use CompilationMetrics set/update/increment APIs instead. * `record_compilation_metrics` is now called on exit from MetricsContext. * Populate legacy CompilationMetrics fields right before logging, inside `record_compilation_metrics`. * Remove the one-off `add_remote_cache_time_saved` helper; capture that timing directly into the MetricsContext. And specifically, several changes to dynamo_timed: * "Modernize" the parameters and update all callsites accordingly. * Move the backwards logging of the CompilationMetrics to the backwards compile location. * Add a parameter for which CompilationMetrics field to update Pull Request resolved: https://github.com/pytorch/pytorch/pull/139849 Approved by: https://github.com/ezyang	2024-11-14 19:11:20 +00:00
PyTorch MergeBot	d63eb3c46c	Revert "[logging] Overhaul dynamo_timed and CompilationMetrics logging. (#139849 )" This reverts commit `cb15c15157`. Reverted https://github.com/pytorch/pytorch/pull/139849 on behalf of https://github.com/kit1980 due to Breaking an internal tests + there is a bug according to the author ([comment](https://github.com/pytorch/pytorch/pull/139849#issuecomment-2474459094))	2024-11-13 18:47:51 +00:00
PyTorch MergeBot	b4cc5d38b4	Revert "[aoti] Remove dir after packaging (#140022 )" This reverts commit `ba136a78ba`. Reverted https://github.com/pytorch/pytorch/pull/140022 on behalf of https://github.com/angelayi due to sorry I realized I need to land from internal ([comment](https://github.com/pytorch/pytorch/pull/140022#issuecomment-2473814720))	2024-11-13 14:43:15 +00:00
angelayi	ba136a78ba	[aoti] Remove dir after packaging (#140022 ) Update AOTI to return a list of files that it generates when `aot_inductor.package=True`. Then we will only package the files that are in that list. This should fix the [caching issue](https://fb.workplace.com/groups/1028545332188949/permalink/1081702043539944/) and hopefully https://github.com/pytorch/pytorch/issues/140053. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140022 Approved by: https://github.com/larryliu0820, https://github.com/desertfire, https://github.com/malfet	2024-11-13 12:17:19 +00:00
PyTorch MergeBot	d48ea29b9a	Revert "[aoti] Remove dir after packaging (#140022 )" This reverts commit `8c6abe5a8c`. Reverted https://github.com/pytorch/pytorch/pull/140022 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the lint failure is legit ([comment](https://github.com/pytorch/pytorch/pull/140022#issuecomment-2471847439))	2024-11-12 23:35:27 +00:00
Wei Wei	70a223cce6	[aotinductor] fix a few issues in bandwidth profiler (#139607 ) Summary: The recent tries on bandwidth profiler is not as expected. I have observed a few issues and tried to fix them in this diff: 1. The return of the DebugAutotuner class 2. Profiling results shows really large overhead. DebugAutotuner.run() returns the benchmark time around 45ms while CachingAutotuner.run() returns the benchmark time around 0.45ms. The `_find_names` and `re.match` takes 45ms: P1669186358 After we commenting out the above _find_names and re.match, the benchmark time become consistent with non-profiling mode: P1669185589 3. introduce a variable `bandwidth_info` to control the path in DebugAutotuner.run(). During benchmarking of configuration selection, we should turn off the `bandwidth_info` After applying this diff, the profiling issues mentioned above are fixed: P1669273172 Test Plan: ``` TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_PROFILE=1 TORCHINDUCTOR_PROFILE_OUTPUT=~/tmp/profile.txt TORCH_LOGS='+inductor,+schedule,output_code' TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 CUDA_VISIBLE_DEVICES=5 buck run mode/{opt,inplace} scripts/wwei6/triton_examples:test_mat 2>&1 \| tee profiling-5.log ``` If we want to disable the Aten backend, just add TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS="TRITON" Differential Revision: D64883079 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139607 Approved by: https://github.com/chenyang78	2024-11-12 23:26:47 +00:00
angelayi	8c6abe5a8c	[aoti] Remove dir after packaging (#140022 ) Update AOTI to return a list of files that it generates when `aot_inductor.package=True`. Then we will only package the files that are in that list. This should fix the [caching issue](https://fb.workplace.com/groups/1028545332188949/permalink/1081702043539944/) and hopefully https://github.com/pytorch/pytorch/issues/140053. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140022 Approved by: https://github.com/larryliu0820, https://github.com/desertfire, https://github.com/malfet	2024-11-12 21:36:24 +00:00
Benjamin Glass	8cb0b932a1	Fix broken AOTInductor node and kernel counts (#139435 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139435 Approved by: https://github.com/desertfire ghstack dependencies: #139411, #139412	2024-11-12 15:22:46 +00:00
Sam Larsen	cb15c15157	[logging] Overhaul dynamo_timed and CompilationMetrics logging. (#139849 ) Here's the overview: There's a new contextmanager singleton called MetricsContext. Entering the MetricsContext is how we demarcate the boundary on which we'll create a single CompilationMetrics object, and therefore, a single dynamo_compile log entry. While we're inside the MetricsContext, we can update/set many different metrics. Most importantly: `dynamo_timed` can also update the in-progress MetricsContext. In the proposal here, we tell `dynamo_timed` that we want it to do so by providing the name of the MetricsContext field to increment. There can be many `dynamo_timed` calls in different parts of the code updating different fields. Then when the MetricsContext exits, that's when the logging of everything gathered finally happens. One potential footgun is trying to use `dynamo_timed` when we haven't entered the MetricsContext, but we assert on that problem. Another problem is that we re-enter the context recursively, but we watch for that and do the logging only when the outermost exits. Some specifics: * Introduce MetricsContext - a context manager that on exit, records the CompilationMetrics (which also logs to dynamo_compile). * Completely remove the concept of frame_phase_timing. Instead, update the MetricsContext during compilation, either directly or via dynamo_timed. * Remove some globals we previously used to accumulate counters to later populate a CompilationMetrics. We use CompilationMetrics set/update/increment APIs instead. * `record_compilation_metrics` is now called on exit from MetricsContext. * Populate legacy CompilationMetrics fields right before logging, inside `record_compilation_metrics`. * Remove the one-off `add_remote_cache_time_saved` helper; capture that timing directly into the MetricsContext. And specifically, several changes to dynamo_timed: * "Modernize" the parameters and update all callsites accordingly. * Move the backwards logging of the CompilationMetrics to the backwards compile location. * Add a parameter for which CompilationMetrics field to update Pull Request resolved: https://github.com/pytorch/pytorch/pull/139849 Approved by: https://github.com/ezyang ghstack dependencies: #140094	2024-11-11 14:24:23 +00:00
Aaron Orenstein	06f619d999	typing ir.py - part 2 (#131846 ) See #131852 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131846 Approved by: https://github.com/eellison ghstack dependencies: #139238	2024-11-06 00:01:15 +00:00
Aaron Orenstein	c2109ec479	typing ir.py - Disallow untyped defs for ir.py (#139238 ) - Remove "mypy: allow-untyped-defs" and mark functions individually with "no-untyped-def" - Mark some trivial functions with the proper return types (`None` and `torch.dtype`) - Fixed a type bug in the signature of supported_dtype_of_cpp_wrapper() - `ruff check torch/_inductor/ir.py --select ANN --fix --unsafe-fixes` and then fixed up things that looked incorrectly applied. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139238 Approved by: https://github.com/Skylion007, https://github.com/ezyang	2024-11-06 00:01:15 +00:00
Oguz Ulgen	c0d21b6581	End TritonBundle on non-cache write codepaths (#139698 ) Summary: When we bypass cache write on inductor, we were also forgetting to reset the bundle, this moves resetting the bundle into post_compile step so it gets uniformly reset. This diff also turns on the cache for internal so that we can do a code rollout. Test Plan: updated tests Differential Revision: D65457224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139698 Approved by: https://github.com/ezyang	2024-11-05 17:00:40 +00:00
James Wu	f4ee5a243d	Add PT2 Compile Events for triton and kernel compilation + load_by_key_path (#139402 ) Adds a few more dynamo_timed() to measure triton compilation and load_by_key_path times. In the case of async compilation with multiple threads, we'll generate a single `kernel_compile` event that occurs when waiting on all the parallel compiles to finish. In the case where async parallel compilation is disabled (or, compile threads are warming up), we'll generate a `triton_compile` event for each kernel. The `triton_compile` events is a bit questionable: do we need a row for each triton compile event? It might eat up on our already low retention, so I might just remove that. Will discuss with @slarsen. Differential Revision: [D65215707](https://our.internmc.facebook.com/intern/diff/D65215707/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139402 Approved by: https://github.com/oulgen	2024-11-04 06:37:18 +00:00
eellison	ee2f8a50d3	Class rename (#139490 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139490 Approved by: https://github.com/exclamaforte, https://github.com/zou3519 ghstack dependencies: #139295	2024-11-02 00:10:17 +00:00
Yifu Wang	e6e140c3d7	[Inductor] fix a compilation time regression caused by user-visible output handling (#139420 ) Some checks failed docker-builds / docker-build (pytorch-linux-focal-py3-clang10-onnx, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-focal-py3-clang9-android-ndk-r21e, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-focal-py3.11-clang10, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-focal-py3.12-clang10, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-focal-py3.9-clang10, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-focal-rocm-n-1-py3, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-focal-rocm-n-py3, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-jammy-aarch64-py3.10-gcc11, linux.arm64.2xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-jammy-aarch64-py3.10-gcc11-inductor-benchmarks, linux.arm64.m7g.4xlarge, 600) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-jammy-cuda11.8-cudnn9-py3.9-clang12, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-jammy-cuda11.8-cudnn9-py3.9-linter, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-jammy-py3-clang12-executorch, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-jammy-py3-clang15-asan, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-jammy-py3-clang18-asan, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-jammy-py3.12-halide, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-jammy-py3.9-gcc11, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-jammy-py3.9-gcc11-inductor-benchmarks, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-jammy-xpu-2024.0-py3, linux.12xlarge) (push) Has been cancelled Details ossf-scorecard / Scorecards analysis (push) Has been cancelled Details Nightly Upload to rockset / upload-stats-to-rockset (push) Has been cancelled Details inductor-cu124-unittest / get-default-label-prefix (push) Has been cancelled Details inductor-cu124-unittest / cuda12.4-py3.12-gcc9-sm86 (push) Has been cancelled Details inductor-cu124-unittest / cuda12.4-py3.10-gcc9-sm86 (push) Has been cancelled Details inductor-rocm / get-label-type (push) Has been cancelled Details inductor-cu124 / inductor-unittest (push) Has been cancelled Details inductor-cu124 / get-default-label-prefix (push) Has been cancelled Details inductor-cu124 / get-a100-test-label-type (push) Has been cancelled Details inductor-rocm / rocm6.2-py3.10-inductor (push) Has been cancelled Details inductor-cu124 / cuda12.4-py3.10-gcc9-sm86 (push) Has been cancelled Details inductor-cu124 / cuda12.4-py3.10-gcc9-sm80 (push) Has been cancelled Details This PR fixes a compilation time regression manifested in timm_models/hrnet_w18 caused by https://github.com/pytorch/pytorch/pull/136732. The regression is reproducible locally. The compilation time is a bit noisy, but it's still possible to tell the difference. ``` Before the offending PR compilation_latency mean=176.022 seconds compilation_latency mean=176.564 seconds On the offending PR compilation_latency mean=180.096 seconds compilation_latency mean=179.101 seconds On the fix compilation_latency mean=173.153 seconds compilation_latency mean=174.182 seconds ``` (I think the fix being faster than the baseline is due to noise) The cause of the regression is an inefficiency in `is_user_visible_output()`. Specifically, it used `output_node.args[0].index(node)` to obtain the output idx for each node (and we called this for each node twice). The offending PR had the assumption that `len(output_node.args[0])` is rather small. However, it has been proven false by the benchmark (it was 1900+ for timm_models/hrnet_w18). The fix is to precompute `user_visible_output_strides` once by iterating only over the nodes in `output_node.args[0]`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139420 Approved by: https://github.com/ezyang	2024-11-01 08:27:40 +00:00
James Wu	c8a648d4df	Add option to dynamo_timed and chromium_event_logger for logging pt2 compile events (#139309 ) This diff considerably changes the column format of PT2 Compile Events: - Now, instead of logging one new column per every piece of metadata, we just log a single column, "metadata". This vastly decreases the number of columns we need to log, which should help with retention. - Now, we only log to scuba for a set of dynamo_timed() events that we actually care about aggregating. To do so, we add a boolean to dynamo_timed() that decides whether or not to log a pt2_compile_event. We'll always log a chromium_event for every dynamo_timed(), but only log a subset of those to scuba. Differential Revision: [D65225598](https://our.internmc.facebook.com/intern/diff/D65225598/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139309 Approved by: https://github.com/oulgen	2024-11-01 02:40:25 +00:00
Oguz Ulgen	69ea2e726c	Consolidate Triton cache into Inductor cache (#138239 ) Summary: This diff/PR attempts to consolidate Triton caching into the Inductor caching so that there can be just one cache that unifies them both, reducing network requests and increasing success rate. Implementation details can be found via reading the code or the post: https://fb.workplace.com/groups/1553867532149891/posts/1605037517032892 I did not use the Autotune bundler code at all since I want to simplify that and merge it into this on the next diff/PR. In terms of instrumentation 1) Dynamo compile: `triton_bundler_time_saved_s` this is sum of all triton.compile calls. We dont have to use the specific number, can use this as a binary value. 2) Events table: I used dynamo_timed to measure how much time we spend on bundler collect and write functions which is all the work we do in this diff 3) TLParse: I emitted number of kernels and triton_bundler_time_saved_s into tlparse as well Test Plan: Updated unit tests Adhoc running ``` TORCHINDUCTOR_BUNDLE_TRITON_INTO_FX_GRAPH_CACHE=1 buck2 run @mode/opt //scripts/oulgen:runner ``` gives https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpmTZt6b/0_0_0/fx_graph_cache_hit_4.json <img width="771" alt="image" src="https://github.com/user-attachments/assets/478782a2-ee47-40cb-b723-fcac2bf9dd93"> Differential Revision: D64504909 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138239 Approved by: https://github.com/ezyang	2024-10-31 01:37:16 +00:00
Yifu Wang	7765d1ef70	Preliminary registered-buffer collective support via Inductor (#138029 ) ``` NOTE [lowering-time collective optimization] In collective communication libraries such as NCCL, every rank maintains communication buffers that are remotely accessible by some peers. Depending on the underlying transport, remote accessibility may be established via mechanisms such as ib_reg_mr, CUDA P2P, or CUDA multicast. Typically, these buffers are private to the communication library by default, and communication ops copy user data in and out of these buffers. To prevent these copies, an optimization commonly known as "user buffer registration" can be employed. This allows direct establishment of remote accessibility on user buffers, eliminating the need for copying. However, this optimization introduces stringent usage requirements, which are typically hard to satisfy without being intrusive to the user code: - Establishing remote accessibility is expensive and often done ahead of time. In such implementations, all ranks must agree on the set of allocations used for every collective op. Failing to meet this requirement can lead to runtime errors or even silent correctness issues. - Even if the collective communication library supports gracefully falling back to "unregistered" implementations, the fallback mechanism would nullify the optimization. - Some communication mechanisms impose stricter requirements than others. For example, CUDA's multicast + multi-mem instructions require all ranks to agree not only on the allocations used for every collective but also on the offsets within these allocations. To support all different mechanisms with optimal results, we aim to satisfy the strictest requirement for this family of optimizations - we ensures that every collective op invocation is guaranteed to operate on the same allocation, at the same offset, in every iteration. For eligible collective ops, we identify communication buffers at lowering time and optionally choose to lower the op to a different kernel (ommunication libraries like NCCL handle both registered and non-registered buffers transparently within the same op, though some may require different ops for different cases). Later, the codegen will perform "persistent allocation" to satisfy the aforementioned constraints, and optionally, perform buffer planning to optimize overall memory usage. ``` ### Changes - Created `comm_lowering.py` for the lowerings of `_c10d_functional` ops. This is to prevent cluttering `lowering.py` as we add more lowering-time collective optimizations. This PR moved the lowerings for `all_reduce` and `all_reduce_` to the file. - Added `comm_buffer_type: Dict[str, str]` to `GraphLowering` to track whether a buffer is a comm buffer and the type of the comm buffer. - Added codegen allocation support for comm buffers of type "symm_mem". - Added support for auto-lowering `_c10d_functional.all_reduce_` to `symm_mem.one_shot_all_reduce`. - Added an Inductor config for collective optimizations in general (`config._collective`). ### Limitation Currently, each persistently allocated comm buffer is dedicated to a single callsite. This is not viable in terms of memory usage. However, this is a neccesary intermediate state before we tackle memory planning for comm buffers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138029 Approved by: https://github.com/Chillee ghstack dependencies: #138028	2024-10-30 18:11:09 +00:00
Jason Ansel	2b937e4e6d	[inductor] Cooperative reductions (#137756 ) Example generated code for `(x+y).sum()`: ```py @triton.jit def triton_unk_fused_add_sum_0(in_ptr0, in_ptr1, out_ptr0, ws_ptr, semaphores_ptr, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr, RSPLIT : tl.constexpr): xnumel = 1 rnumel = 1048576 rsplit_id = tl.program_id(0) num_rblocks = (rnumel + RBLOCK - 1) // RBLOCK rsplit_chunk = (num_rblocks + RSPLIT - 1) // RSPLIT * RBLOCK rsplit_start = rsplit_chunk * rsplit_id rsplit_end = rsplit_chunk * (rsplit_id + 1) xoffset = tl.program_id(1) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = tl.full([XBLOCK, RBLOCK], True, tl.int1) rbase = tl.arange(0, RBLOCK)[None, :] _tmp4 = tl.full([XBLOCK, RBLOCK], 0, tl.float32) for roffset in range(rsplit_start, rsplit_end, RBLOCK): rindex = roffset + rbase rmask = rindex < rnumel r0 = rindex tmp0 = tl.load(in_ptr0 + (r0), rmask, eviction_policy='evict_first', other=0.0) tmp1 = tl.load(in_ptr1 + (r0), rmask, eviction_policy='evict_first', other=0.0) tmp2 = tmp0 + tmp1 tmp3 = tl.broadcast_to(tmp2, [XBLOCK, RBLOCK]) tmp5 = _tmp4 + tmp3 _tmp4 = tl.where(rmask, tmp5, _tmp4) tmp4 = tl.sum(_tmp4, 1)[:, None] if RSPLIT > 1: tmp4_ws = (ws_ptr + 0).to(tl.pointer_type(tl.float32)) tl.store(tmp4_ws + (xindex * RSPLIT + rsplit_id), tmp4, None) if RSPLIT > 1: triton_helpers.gpu_barrier(semaphores_ptr + (2 * tl.program_id(1) + 0), RSPLIT, True) if RSPLIT > 1: tmp4_peers = tl.load(tmp4_ws + (xindex * RSPLIT + tl.arange(0, RSPLIT)[None,:]), None, eviction_policy='evict_first') tmp4 = tl.sum(tmp4_peers, 1)[:, None] if rsplit_id == (0 % RSPLIT): tl.store(out_ptr0 + (tl.full([XBLOCK, 1], 0, tl.int32)), tmp4, None) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137756 Approved by: https://github.com/eellison	2024-10-29 00:45:53 +00:00
PyTorch MergeBot	60d1c7138d	Revert "[inductor] Cooperative reductions (#137756 )" This reverts commit `fed37dbfbc`. Reverted https://github.com/pytorch/pytorch/pull/137756 on behalf of https://github.com/jeanschmidt due to ROCM tests are timing out :( ([comment](https://github.com/pytorch/pytorch/pull/137756#issuecomment-2441579322))	2024-10-28 13:24:33 +00:00
Jason Ansel	fed37dbfbc	[inductor] Cooperative reductions (#137756 ) Example generated code for `(x+y).sum()`: ```py @triton.jit def triton_unk_fused_add_sum_0(in_ptr0, in_ptr1, out_ptr0, ws_ptr, semaphores_ptr, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr, RSPLIT : tl.constexpr): xnumel = 1 rnumel = 1048576 rsplit_id = tl.program_id(0) num_rblocks = (rnumel + RBLOCK - 1) // RBLOCK rsplit_chunk = (num_rblocks + RSPLIT - 1) // RSPLIT * RBLOCK rsplit_start = rsplit_chunk * rsplit_id rsplit_end = rsplit_chunk * (rsplit_id + 1) xoffset = tl.program_id(1) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = tl.full([XBLOCK, RBLOCK], True, tl.int1) rbase = tl.arange(0, RBLOCK)[None, :] _tmp4 = tl.full([XBLOCK, RBLOCK], 0, tl.float32) for roffset in range(rsplit_start, rsplit_end, RBLOCK): rindex = roffset + rbase rmask = rindex < rnumel r0 = rindex tmp0 = tl.load(in_ptr0 + (r0), rmask, eviction_policy='evict_first', other=0.0) tmp1 = tl.load(in_ptr1 + (r0), rmask, eviction_policy='evict_first', other=0.0) tmp2 = tmp0 + tmp1 tmp3 = tl.broadcast_to(tmp2, [XBLOCK, RBLOCK]) tmp5 = _tmp4 + tmp3 _tmp4 = tl.where(rmask, tmp5, _tmp4) tmp4 = tl.sum(_tmp4, 1)[:, None] if RSPLIT > 1: tmp4_ws = (ws_ptr + 0).to(tl.pointer_type(tl.float32)) tl.store(tmp4_ws + (xindex * RSPLIT + rsplit_id), tmp4, None) if RSPLIT > 1: triton_helpers.gpu_barrier(semaphores_ptr + (2 * tl.program_id(1) + 0), RSPLIT, True) if RSPLIT > 1: tmp4_peers = tl.load(tmp4_ws + (xindex * RSPLIT + tl.arange(0, RSPLIT)[None,:]), None, eviction_policy='evict_first') tmp4 = tl.sum(tmp4_peers, 1)[:, None] if rsplit_id == (0 % RSPLIT): tl.store(out_ptr0 + (tl.full([XBLOCK, 1], 0, tl.int32)), tmp4, None) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137756 Approved by: https://github.com/eellison ghstack dependencies: #138970	2024-10-27 16:31:38 +00:00
Yifu Wang	3a6f014381	[Inductor] improve the stride preservation logic of user-visible outputs (#136732 ) ## Context Previously, the stride preservation of user-visible nodes worked as follows: - After joint-graph tracing, we recorded the names of user-visible nodes and passed them to GraphLowering. - In GraphLowering, we determined whether we needed to preserve the striding for a certain node by checking if the node's name was in `user_visible_outputs`. - We obtained the original strides by checking `node.meta["val"].stride()`. However, there's a problem with this approach: the nodes in output_node.args[0] and their strides could change between the completion of joint-graph tracing and the consumption of `user_visible_outputs` (e.g., during post-grad passes), making it unreliable. ## This PR - After joint graph tracing: - Record the original strides for all nodes in `output_nodes.args[0]` as `output_node.meta["original_output_strides"]` (recording for all nodes in case we need the info for other purposes such as debugging). - Record the indices of user-visible outputs as `output_node.meta["user_visible_output_idxs"]`. - Remove the original plumbing of `user_visible_outputs`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136732 Approved by: https://github.com/Chillee	2024-10-26 18:49:14 +00:00
chilli	07dbc42881	Stop force realizing to prevent recursion errors unless it's much bigger (#138881 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138881 Approved by: https://github.com/shunting314 ghstack dependencies: #138733, #138794	2024-10-25 18:59:01 +00:00
Aaron Orenstein	07cc4bd3e2	typing compile_fx.py (#138033 ) Type annotations for compile_fx. - Some of the stuff here is pretty complicated (functions which return functions that take functions) so I bailed on those and used `Any` just to get the rest landed. - There are also changes to type signatures in other files which I did just to let mypy know more about the types in compile_fx.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138033 Approved by: https://github.com/Skylion007	2024-10-21 18:14:59 +00:00
Jason Ansel	4632594546	[inductor] Move V.graph.scheduler.current_device to V.graph.current_device (#138252 ) There are some places where it would be nice to use this, but the scheduler hasn't yet been created. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138252 Approved by: https://github.com/eellison ghstack dependencies: #138170	2024-10-18 23:05:54 +00:00
Jason Ansel	85a6a782e5	[inductor] Generalize WorkspaceArg for graph-level semaphores (#138170 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138170 Approved by: https://github.com/Chillee	2024-10-18 23:05:54 +00:00
Benjamin Glass	1ac42b5f3e	graph.py: Refine unspec variable finding (#137303 ) Add an additional check that scalars wrapped to 0-D tensors by dynamo are actually 0-D. This fixes a bug where a 1-D tensor was mistakenly converted to a scalar value rather than passed as a pointer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137303 Approved by: https://github.com/eellison ghstack dependencies: #135701	2024-10-18 20:00:25 +00:00
chilli	6752e7dc3e	Moved some of Inductor IR nodes to be frozen (#137859 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137859 Approved by: https://github.com/ezyang	2024-10-17 18:04:45 +00:00
Aaron Orenstein	524fe784ec	BundledAutotuneCache (take 2) (#137902 ) Summary: Add a cache to combine individual autotune caches into a single cached bundle. We still rely on the individual autotune caches - on a cache hit we copy the individual results into the local caches so they can retrieved later. Attempt 2 of #134959 (D60677499). Various configs: env: TORCHINDUCTOR_BUNDLED_AUTOTUNE_REMOTE_CACHE config: bundled_autotune_remote_cache jk: pytorch/remote_cache:bundled_autotune_remote_cache_version Test Plan: unit tests Manually tested w/ EMU: ``` cd fbcode/accelerators/workloads/models/emu_flash/v1p4 make build_benchmark_model && make save_model_to_path make test_pt2_latency ``` - on a cold run we got 0 hits and 40 misses. On a warm run it got 40 hits and 0 miss. - perf seems a little better - for 8 runs: - no bundled cache averaged 14m11s - bundled cache averaged 14m6s - 125ms saved per cache entry seems reasonable Cache Metrics for an sample run: no bundled cache: ``` INFO: Cache Metrics: FbMemcacheRemoteKernelCache: {hit: 2256, miss: 0, put: 0, exception: 0} FbRemoteAutotuneCache: {hit: 0, miss: 0, put: 7, exception: 0} FbRemoteFxGraphCache: {hit: 40, miss: 0, put: 0, exception: 0} LocalAutotuneCache: {hit: 878, miss: 0, put: 7, exception: 0} backend:MemcacheCache: {hit: 2256, miss: 0, put: 7, exception: 0} backend:_LocalAutotuneCacheBackend: {hit: 878, miss: 0, put: 7, exception: 0} backend:_ManifoldCache: {hit: 40, miss: 0, put: 0, exception: 0} ``` bundled cache: ``` INFO: Cache Metrics: FbMemcacheRemoteKernelCache: {hit: 2258, miss: 0, put: 0, exception: 0} FbRemoteAutotuneCache: {hit: 0, miss: 0, put: 8, exception: 0} FbRemoteBundledAutotuneCache: {hit: 40, miss: 0, put: 0, exception: 0} <<<<<< FbRemoteFxGraphCache: {hit: 40, miss: 0, put: 0, exception: 0} LocalAutotuneCache: {hit: 878, miss: 0, put: 886, exception: 0} backend:MemcacheCache: {hit: 2258, miss: 0, put: 8, exception: 0} backend:_LocalAutotuneCacheBackend: {hit: 878, miss: 0, put: 886, exception: 0} backend:_ManifoldCache: {hit: 80, miss: 0, put: 0, exception: 0} ``` Differential Revision: D64336043 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137902 Approved by: https://github.com/oulgen	2024-10-15 18:39:47 +00:00
chilli	0e4d42634e	Port Inductor dataclasses to be kw_only (#137768 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137768 Approved by: https://github.com/ezyang	2024-10-14 10:33:43 +00:00
PyTorch MergeBot	41977a0531	Revert "Port Inductor dataclasses to be kw_only (#137768 )" This reverts commit `65d665bae5`. Reverted https://github.com/pytorch/pytorch/pull/137768 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seem to fail test_loop_ordering in trunk ([comment](https://github.com/pytorch/pytorch/pull/137768#issuecomment-2409203115))	2024-10-13 22:25:19 +00:00
chilli	65d665bae5	Port Inductor dataclasses to be kw_only (#137768 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137768 Approved by: https://github.com/ezyang	2024-10-13 14:55:45 +00:00
PyTorch MergeBot	1358969fa1	Revert "BundledAutotuneCache (#134959 )" This reverts commit `709021143d`. Reverted https://github.com/pytorch/pytorch/pull/134959 on behalf of https://github.com/albanD due to The newly added test fails on rocm CI ([comment](https://github.com/pytorch/pytorch/pull/134959#issuecomment-2408091754))	2024-10-11 20:43:56 +00:00
Aaron Orenstein	709021143d	BundledAutotuneCache (#134959 ) Add a cache to combine individual autotune caches into a single cached bundle. We still rely on the individual autotune caches - on a cache hit we copy the individual results into the local caches so they can retrieved later. Various related configs: env: TORCHINDUCTOR_BUNDLED_AUTOTUNE_REMOTE_CACHE config: bundled_autotune_remote_cache jk: pytorch/remote_cache:bundled_autotune_remote_cache_version Testing: Manually tested w/ EMU: ``` cd fbcode/accelerators/workloads/models/emu_flash/v1p4 make build_benchmark_model && make save_model_to_path make test_pt2_latency ``` - on a cold run we got 0 hits and 40 misses. On a warm run it got 40 hits and 0 miss. - perf seems a little better - for 8 runs: - no bundled cache averaged 14m11s - bundled cache averaged 14m6s - 125ms saved per cache entry seems reasonable Cache Metrics for an sample run: no bundled cache: ``` INFO: Cache Metrics: FbMemcacheRemoteKernelCache: {hit: 2256, miss: 0, put: 0, exception: 0} FbRemoteAutotuneCache: {hit: 0, miss: 0, put: 7, exception: 0} FbRemoteFxGraphCache: {hit: 40, miss: 0, put: 0, exception: 0} LocalAutotuneCache: {hit: 878, miss: 0, put: 7, exception: 0} backend:MemcacheCache: {hit: 2256, miss: 0, put: 7, exception: 0} backend:_LocalAutotuneCacheBackend: {hit: 878, miss: 0, put: 7, exception: 0} backend:_ManifoldCache: {hit: 40, miss: 0, put: 0, exception: 0} ``` bundled cache: ``` INFO: Cache Metrics: FbMemcacheRemoteKernelCache: {hit: 2258, miss: 0, put: 0, exception: 0} FbRemoteAutotuneCache: {hit: 0, miss: 0, put: 8, exception: 0} FbRemoteBundledAutotuneCache: {hit: 40, miss: 0, put: 0, exception: 0} FbRemoteFxGraphCache: {hit: 40, miss: 0, put: 0, exception: 0} LocalAutotuneCache: {hit: 878, miss: 0, put: 886, exception: 0} backend:MemcacheCache: {hit: 2258, miss: 0, put: 8, exception: 0} backend:_LocalAutotuneCacheBackend: {hit: 878, miss: 0, put: 886, exception: 0} backend:_ManifoldCache: {hit: 80, miss: 0, put: 0, exception: 0} ``` Differential Revision: D60677499 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134959 Approved by: https://github.com/oulgen	2024-10-11 19:12:41 +00:00
Animesh Jain	cd02c85ba4	[inductor][subgraph][python-wrapper] Lift subgraph code into functions (#137200 ) Earlier the subgraphs were getting inlined into the output code. This PR lifts the subgraphs into a function, and then we just call the function in the output code. This is the output code for test `test_cond_reintepret_view_inputs_outputs` Before this PR - https://www.internalfb.com/intern/paste/P1632948905/ With this PR - https://www.internalfb.com/intern/paste/P1632946348/ A relevant snippet from the above paste is ~~~ def false_graph_0(args): false_graph_0_arg0_1, false_graph_0_arg1_1, s0 = args args.clear() s0 = s0 with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) false_graph_0_buf0 = empty_strided_cuda(((-1) + s0, 20), (20, 1), torch.float32) false_graph_0_buf1 = empty_strided_cuda(((-1) + s0, 20), (20, 1), torch.float32) # Unsorted Source Nodes: [cond, z1, z2], Original ATen: [aten.sub, aten.add] triton_poi_fused_add_sub_1_xnumel = (-20) + (20s0) stream0 = get_raw_stream(0) triton_poi_fused_add_sub_1.run(false_graph_0_arg0_1, false_graph_0_arg1_1, false_graph_0_buf0, false_graph_0_buf1, triton_poi_fused_add_sub_1_xnumel, grid=grid(triton_poi_fused_add_sub_1_xnumel), stream=stream0) del false_graph_0_arg0_1 del false_graph_0_arg1_1 return (reinterpret_tensor(false_graph_0_buf0, ((-3) + s0, 20), (20, 1), 40), reinterpret_tensor(false_graph_0_buf1, ((-1) + s0, 16), (20, 1), 4), ) async_compile.wait(globals()) del async_compile def call(args): arg0_1, arg1_1, arg2_1, arg3_1 = args args.clear() s0 = arg0_1 assert_size_stride(arg1_1, (s0, 20), (20, 1)) assert_size_stride(arg2_1, (s0, 20), (20, 1)) assert_size_stride(arg3_1, (), ()) with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) buf0 = [None] 2 buf0 = [None] * 2 if arg3_1.item(): # subgraph: true_graph_0 true_graph_0_arg0_1 = reinterpret_tensor(arg1_1, ((-1) + s0, 20), (20, 1), 0) true_graph_0_arg1_1 = reinterpret_tensor(arg2_1, ((-1) + s0, 20), (20, 1), 0) (true_graph_0_buf0, true_graph_0_buf1) = true_graph_0([true_graph_0_arg0_1, true_graph_0_arg1_1, s0]) buf0[0] = true_graph_0_buf0 buf0[1] = true_graph_0_buf1 else: # subgraph: false_graph_0 false_graph_0_arg0_1 = reinterpret_tensor(arg1_1, ((-1) + s0, 20), (20, 1), 0) false_graph_0_arg1_1 = reinterpret_tensor(arg2_1, ((-1) + s0, 20), (20, 1), 0) (false_graph_0_buf0, false_graph_0_buf1) = false_graph_0([false_graph_0_arg0_1, false_graph_0_arg1_1, s0]) buf0[0] = false_graph_0_buf0 buf0[1] = false_graph_0_buf1 del arg1_1 del arg2_1 del arg3_1 buf1 = buf0[0] buf2 = buf0[1] del buf0 return (buf1, buf2, ) ~~~ The key change is to recursively call `codegen` for the subgraph and rely on `SubgraphPythonWrapper` to generate just the subgraph `fn`. The resulting subgraph_code is then inserted into the parent wrapper. Note that this PR only works for python wrapper. For cpp wrapper, we need a lot of refactor to ensure that we don't duplicate the global variables in the outpute_code. So, for now, I fallback to the old way of inlining for cpp wrapper. I am hoping someone with more familiarity with cpp wrapper can support subgraph lifting (cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov). This work will unblock hierarchical compilation (or cold start compile time work). Pull Request resolved: https://github.com/pytorch/pytorch/pull/137200 Approved by: https://github.com/desertfire, https://github.com/eellison	2024-10-11 17:57:10 +00:00
leslie-fang-intel	71010bf097	[Inductor][CPP] generalize the wgt tensor delete (#135101 ) Summary Previously, we assumed the packed weight for (`MKL/MKLDNN`) linear operations was at `new_input_nodes[1]`. However, this is not the case for `MKL linear`, where `new_input_nodes[1]` contains the original weight instead of the packed weight. To generalize the code, in this PR, we identify nodes that are present in `input_nodes` but not in `new_input_nodes`—indicating they are no longer used by the GEMM template and can be considered candidates for deletion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135101 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-10-10 06:01:09 +00:00
eellison	47af7cc962	Add compiler bisector (#131936 ) This is a utility to aid the torch.compile debugging. You provide a function that returns True on success, False on failure, or do something out of process and run bisect_helper `good \| bad`. The bisector will first go through backends - `eager`, `aot_eager`, `aot_eager_decomp_partition`, `inductor` to find the first failing backend. Then, it will go through subsystems within the backend - currently limited but could be expanded - and try to find the first subsystem for which disabling fixes the problem. Once it has found the failing subsystem, it will find the number of times the subsystem is applied, and then bisect through it. An example usage of how to hook it up for aot_eager_decomp_partition and decomposition subsystem is : ``` from torch._inductor.bisect_helper import BisectionManager if op in CURRENT_DECOMPOSITION_TABLE: if BisectionManager.disable_subsystem("aot_eager_decomp_partition", "decomposition", lambda: repr(op)): return NotImplemented ``` Once it has discovered the problematic change, it will print out the associated debug info, and you can set the same limits with `TORCH_BISECT_BACKEND` `TORCH_BISECT_SUBSYSTEM` and `TORCH_BISECT_MAX`. We could add further options as an automated way of going through a check list for checking divergence - e.g., the mode to emulate amp casts. Fix for https://github.com/pytorch/pytorch/issues/126546 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131936 Approved by: https://github.com/ezyang	2024-10-09 20:34:11 +00:00
Sam Larsen	c87c9f0a01	[inductor] Conditionally copy args to cpu to minimize memory overhead of autotuning (#136701 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136701 Approved by: https://github.com/eellison	2024-10-07 19:47:04 +00:00
Animesh Jain	e2b72348d0	[inductor] Reuse the subgraph if accessed via same get_attr node (#137193 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137193 Approved by: https://github.com/jansel ghstack dependencies: #137191	2024-10-07 17:20:58 +00:00
Animesh Jain	7a5eaecd92	[inductor] Correctly keep track of the graph_input_names (#137191 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137191 Approved by: https://github.com/jansel	2024-10-07 17:20:53 +00:00
Edward Z. Yang	6bd9d37266	Remove allow-untyped-defs from torch.fx.experimental.symbolic_shapes (#137019 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137019 Approved by: https://github.com/Skylion007 ghstack dependencies: #136934, #136935, #136972	2024-10-01 13:22:10 +00:00

1 2 3 4 5 ...

419 Commits