pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Gabriel Ferns	254293b777	Add `flag _metrics_log_runtime` to disable runtime metric logging by default (#153506 ) https://github.com/pytorch/pytorch/pull/152708 expanded support of `get_estimated_runtime` to many more types of `SchedulerNodes`. This caused an increase in compile time because we're always calling `get_estimated_runtime` to populate the metrics table. This PR adds a flag for this logging, which reduces the instruction count by 8%. Long term, we should probably merge metrics.py with TORCH_LOGS/tlparse (suggestion from @xmfan). Update: added support for TORCH_LOGS for the metrics logging. Test Plan: mm_loop.py and many existing tests cover. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153506 Approved by: https://github.com/eellison	2025-05-22 01:02:11 +00:00
Aaron Gokaslan	3555ebb63d	[BE]: Update ruff to 0.11.8 (#153249 ) Fixes a ton of false negatives throughout the codebase. RUFF also properly validates NOQA comments now and most of the changes are fixing typos there or removing filewide flake8 suppressions that were also silencing ruff issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153249 Approved by: https://github.com/cyyever, https://github.com/albanD, https://github.com/seemethere	2025-05-12 18:30:52 +00:00
Xuehai Pan	c73a92fbf5	[BE][CI] bump `ruff` to 0.9.2: multiline `assert` statements (#144546 ) Reference: https://docs.astral.sh/ruff/formatter/black/#assert-statements > Unlike Black, Ruff prefers breaking the message over breaking the assertion, similar to how both Ruff and Black prefer breaking the assignment value over breaking the assignment target: > > ```python > # Input > assert ( > len(policy_types) >= priority + num_duplicates > ), f"This tests needs at least {priority+num_duplicates} many types." > > > # Black > assert ( > len(policy_types) >= priority + num_duplicates > ), f"This tests needs at least {priority+num_duplicates} many types." > > # Ruff > assert len(policy_types) >= priority + num_duplicates, ( > f"This tests needs at least {priority + num_duplicates} many types." > ) > ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144546 Approved by: https://github.com/malfet	2025-02-27 20:46:16 +00:00
Aaron Orenstein	07669ed960	PEP585 update - benchmarks tools torchgen (#145101 ) This is one of a series of PRs to update us to PEP585 (changing Dict -> dict, List -> list, etc). Most of the PRs were completely automated with RUFF as follows: Since RUFF UP006 is considered an "unsafe" fix first we need to enable unsafe fixes: ``` --- a/tools/linter/adapters/ruff_linter.py +++ b/tools/linter/adapters/ruff_linter.py @@ -313,6 +313,7 @@ "ruff", "check", "--fix-only", + "--unsafe-fixes", "--exit-zero", *([f"--config={config}"] if config else []), "--stdin-filename", ``` Then we need to tell RUFF to allow UP006 (as a final PR once all of these have landed this will be made permanent): ``` --- a/pyproject.toml +++ b/pyproject.toml @@ -40,7 +40,7 @@ [tool.ruff] -target-version = "py38" +target-version = "py39" line-length = 88 src = ["caffe2", "torch", "torchgen", "functorch", "test"] @@ -87,7 +87,6 @@ "SIM116", # Disable Use a dictionary instead of consecutive `if` statements "SIM117", "SIM118", - "UP006", # keep-runtime-typing "UP007", # keep-runtime-typing ] select = [ ``` Finally running `lintrunner -a --take RUFF` will fix up the deprecated uses. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145101 Approved by: https://github.com/bobrenjc93	2025-01-18 05:05:07 +00:00
Xuehai Pan	dcc3cf7066	[BE] fix ruff rule E226: add missing whitespace around operator in f-strings (#144415 ) The fixes are generated by: ```bash ruff check --fix --preview --unsafe-fixes --select=E226 . lintrunner -a --take "RUFF,PYFMT" --all-files ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144415 Approved by: https://github.com/huydhn, https://github.com/Skylion007	2025-01-08 21:55:00 +00:00
bobrenjc93	fcf9dc3b11	Migrate from Tuple -> tuple in benchmarks (#144259 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144259 Approved by: https://github.com/yanboliang	2025-01-07 04:09:52 +00:00
Oguz Ulgen	dc55704b48	Rename cache limit to recompile limit in configs (#143709 ) This PR renames every cache_limit to recompile_limit via sed. Old config options are maintained via Config(alias='xyz') Pull Request resolved: https://github.com/pytorch/pytorch/pull/143709 Approved by: https://github.com/jansel	2024-12-22 10:03:57 +00:00
William Wen	c04f0bb7b9	[dynamo] add benchmark for guard eval (#142430 ) Benchmarks: - 713.2us (3.10) - 598.8us (3.12) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142430 Approved by: https://github.com/jansel ghstack dependencies: #142117	2024-12-17 18:54:27 +00:00
Xuehai Pan	267f82b860	[BE] Format `.ci/` / `.github/` / `benchmarks/` / `functorch/` / `tools/` / `torchgen/` with `ruff format` (#132577 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132577 Approved by: https://github.com/malfet	2024-10-11 18:30:26 +00:00
Oguz Ulgen	034af88c2d	Add a microbechmark for cache read path (#137607 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137607 Approved by: https://github.com/jamesjwu	2024-10-10 16:36:18 +00:00
Oguz Ulgen	ae03c0cff3	Add microbenchmark for FxGraphHashDetails.debug_lines (#137506 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137506 Approved by: https://github.com/jamesjwu	2024-10-09 16:15:05 +00:00
Jason Ansel	8da9c4178c	[inductor] Benchmark Halide in operatorbench.py (#136809 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136809 Approved by: https://github.com/eellison ghstack dependencies: #136808	2024-09-28 19:26:04 +00:00
Jason Ansel	375921b755	[inductor] Improve operatorbench.py (#136808 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136808 Approved by: https://github.com/eellison	2024-09-28 06:22:02 +00:00
Yueming Hao	a71e5509bc	[inductor]Add profiler to operatorbench (#135515 ) Add profiling to operatorbench. The new argument `--profile` is added and the profiling trace is like the following figure. <img width="954" alt="image" src="https://github.com/user-attachments/assets/5b00d6e3-4905-4a77-a5e9-9f62620a5fd5"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135515 Approved by: https://github.com/shunting314	2024-09-10 02:33:30 +00:00
Nicolas Macchioni	5cb05a82b4	[BC breaking] move benchmarking + prefer inductor path (#132827 ) move benchmarking out of `torch._inductor.runtime.runtime_utils` and into `torch._inductor.runtime.benchmarking`, and prefer this path over directly accessing Triton's benchmarking Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/132827 Approved by: https://github.com/eellison	2024-08-08 00:47:45 +00:00
Xuehai Pan	c0ed38e644	[BE][Easy][3/19] enforce style for empty lines in import segments in `benchmarks/` (#129754 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129754 Approved by: https://github.com/ezyang	2024-07-17 14:34:42 +00:00
Peter Bell	e2e624a02f	[AOTAutograd] Micro-optimize runtime_wrapper (#128188 ) This moves a bunch of runtime inspection of the `output_info` for alias handling into the construction of fixed output handlers that are created during compilation and captured by the runtime wrapper. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128188 Approved by: https://github.com/bdhirsh	2024-07-04 03:53:06 +00:00
Xuehai Pan	26f4f10ac8	[5/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort torch (#127126 ) The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126 Approved by: https://github.com/kit1980	2024-05-27 14:49:57 +00:00
PyTorch MergeBot	55c0ab2887	Revert "[5/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort torch (#127126 )" This reverts commit `7763c83af6`. Reverted https://github.com/pytorch/pytorch/pull/127126 on behalf of https://github.com/XuehaiPan due to Broken CI ([comment](https://github.com/pytorch/pytorch/pull/127126#issuecomment-2133044286))	2024-05-27 09:22:08 +00:00
Xuehai Pan	7763c83af6	[5/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort torch (#127126 ) The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126 Approved by: https://github.com/kit1980 ghstack dependencies: #127122, #127123, #127124, #127125	2024-05-27 04:22:18 +00:00
eellison	d777685ef9	Script for choosing template configurations (#126560 ) This adds logging that will mark any invocation of a matmul for a particular input shapes, and record every template configs performance on it. Then, we can parse that into a script which will minimize the total mm execution time given N allowed templates. And in future, other experiments.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126560 Approved by: https://github.com/nmacchioni, https://github.com/jansel	2024-05-21 02:28:39 +00:00
Jason Ansel	7fd8870e6b	[inductor] Refactor runtime files into torch._inductor.runtime (part 3) (#124557 ) I am planning to make the compile_worker process not import torch so it can start up much faster. This stack is prep for that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124557 Approved by: https://github.com/yanboliang ghstack dependencies: #124552, #124553	2024-04-22 18:46:24 +00:00
PyTorch MergeBot	0b90af0bf5	Revert "[inductor] Refactor runtime files into torch._inductor.runtime (part 3) (#124557 )" This reverts commit `fcf28b0ad5`. Reverted https://github.com/pytorch/pytorch/pull/124557 on behalf of https://github.com/jeanschmidt due to There are internal breakages, already discussed with author and he'll FF ([comment](https://github.com/pytorch/pytorch/pull/124552#issuecomment-2070548223))	2024-04-22 18:28:05 +00:00
Jason Ansel	fcf28b0ad5	[inductor] Refactor runtime files into torch._inductor.runtime (part 3) (#124557 ) I am planning to make the compile_worker process not import torch so it can start up much faster. This stack is prep for that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124557 Approved by: https://github.com/yanboliang ghstack dependencies: #124552, #124553	2024-04-22 04:51:15 +00:00
Xuehai Pan	93e249969b	[BE] enable `ruff` rule `RSE` and remove useless parentheses in `raise` statements (#124261 ) Remove useless parentheses in `raise` statements if the exception type is raised with no argument. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124261 Approved by: https://github.com/albanD	2024-04-17 19:29:34 +00:00
Jason Ansel	5a10b56083	[dynamo] Small microbenchmark changes (#122032 ) Used to generate numbers in #122029 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122032 Approved by: https://github.com/yanboliang	2024-03-18 18:08:06 +00:00
Jason Ansel	1e9a7df8fe	[dynamo] Compile time optimizations in tx.step() (#121790 ) `python benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py` - Before: `symbolic_convert_overhead_stress_test: 10.7s` - After: `symbolic_convert_overhead_stress_test: 8.6s` `tx.step()` is a small part of that benchmark, so likely the speedup in that isolated function is larger than the top line. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121790 Approved by: https://github.com/oulgen	2024-03-15 01:01:05 +00:00
Jason Ansel	9aa3fedb75	Slightly faster FX graph iterator (#121611 ) Before: ``` iterating over 100000000 FX nodes took 5.9s (16830686 nodes/s) ``` After: ``` iterating over 100000000 FX nodes took 5.0s (19937698 nodes/s) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121611 Approved by: https://github.com/oulgen	2024-03-11 20:00:19 +00:00
Jason Ansel	c5702a0891	[dynamo] Optimize BACKEND_MATCH guard (#118065 ) As measured by `benchmarks/dynamo/microbenchmarks/overheads.py`: - Before `22.5us` - After `18.1us` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118065 Approved by: https://github.com/ydwu4	2024-01-24 07:47:52 +00:00
Jason Ansel	a669319450	[inductor] Faster C++ kernel python bindings (#117500 ) Calling C++ from Python via ctypes is notoriously slow. This switches to generating our own C++ bindings directly, which is a >5x speedup on this kernel-launch-bound microbenchmark: ```python from ctypes import c_void_p import torch from torch import empty from torch._inductor.codecache import AsyncCompile from torch._dynamo.testing import rand_strided from torch._inductor.utils import print_performance from torch._inductor.wrapper_benchmark import compiled_module_main async_compile = AsyncCompile() src = ''' #include "/tmp/torchinductor_jansel/gb/cgbau5vlj6cetmcjbjbtw6x4rrivaln6f45s5d72gy2bfx5foz3k.h" extern "C" void kernel(const float* in_ptr0, float* out_ptr0) { { auto tmp0 = in_ptr0[static_cast<long>(0L)]; auto tmp1 = static_cast<float>(1.0); auto tmp2 = decltype(tmp0)(tmp0 + tmp1); out_ptr0[static_cast<long>(0L)] = tmp2; } } ''' cpp_fused_add_ctypes = async_compile.cpp(src) cpp_fused_add_cpython = async_compile.cpp_pybinding(["const float", "float"], src) async_compile.wait(globals()) del async_compile def call(arg0_1): buf0 = empty((1,), device='cpu', dtype=torch.float32) if use_ctypes: for _ in range(100): cpp_fused_add_ctypes(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr())) else: for _ in range(100): cpp_fused_add_cpython(arg0_1, buf0) del arg0_1 return (buf0,) def benchmark_compiled_module(times=1000, repeat=100): arg0_1 = rand_strided((1,), (1,), device='cpu', dtype=torch.float32) return print_performance(lambda: call(arg0_1), times=times, repeat=repeat) print("old ctypes bindings: ", end='') use_ctypes = True compiled_module_main('None', benchmark_compiled_module) print("new bindings: ", end='') use_ctypes = False compiled_module_main('None', benchmark_compiled_module) ``` Output: ``` old ctypes bindings: 0.000073 new bindings: 0.000013 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117500 Approved by: https://github.com/desertfire	2024-01-18 16:20:12 +00:00
Nikita Shulga	a1afd1b195	Revert "[inductor] Faster C++ kernel python bindings (#117500 )" It should have never been landed, but was landed again, thanks to ghstack grafting/ungrafting see discussion on https://github.com/pytorch/pytorch/pull/116910 This reverts commit `e457b6fb18`.	2024-01-17 17:06:32 -08:00
titaiwangms	e457b6fb18	[inductor] Faster C++ kernel python bindings (#117500 ) Calling C++ from Python via ctypes is notoriously slow. This switches to generating our own C++ bindings directly, which is a >5x speedup on this kernel-launch-bound microbenchmark: ```python from ctypes import c_void_p import torch from torch import empty from torch._inductor.codecache import AsyncCompile from torch._dynamo.testing import rand_strided from torch._inductor.utils import print_performance from torch._inductor.wrapper_benchmark import compiled_module_main async_compile = AsyncCompile() src = ''' #include "/tmp/torchinductor_jansel/gb/cgbau5vlj6cetmcjbjbtw6x4rrivaln6f45s5d72gy2bfx5foz3k.h" extern "C" void kernel(const float* in_ptr0, float* out_ptr0) { { auto tmp0 = in_ptr0[static_cast<long>(0L)]; auto tmp1 = static_cast<float>(1.0); auto tmp2 = decltype(tmp0)(tmp0 + tmp1); out_ptr0[static_cast<long>(0L)] = tmp2; } } ''' cpp_fused_add_ctypes = async_compile.cpp(src) cpp_fused_add_cpython = async_compile.cpp_pybinding(["const float", "float"], src) async_compile.wait(globals()) del async_compile def call(arg0_1): buf0 = empty((1,), device='cpu', dtype=torch.float32) if use_ctypes: for _ in range(100): cpp_fused_add_ctypes(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr())) else: for _ in range(100): cpp_fused_add_cpython(arg0_1, buf0) del arg0_1 return (buf0,) def benchmark_compiled_module(times=1000, repeat=100): arg0_1 = rand_strided((1,), (1,), device='cpu', dtype=torch.float32) return print_performance(lambda: call(arg0_1), times=times, repeat=repeat) print("old ctypes bindings: ", end='') use_ctypes = True compiled_module_main('None', benchmark_compiled_module) print("new bindings: ", end='') use_ctypes = False compiled_module_main('None', benchmark_compiled_module) ``` Output: ``` old ctypes bindings: 0.000073 new bindings: 0.000013 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117500 Approved by: https://github.com/desertfire ghstack dependencies: #117409, #116667, #117591	2024-01-17 23:03:15 +00:00
PyTorch MergeBot	da6abaeeac	Revert "[inductor] Faster C++ kernel python bindings (#117500 )" This reverts commit `bb0fd1bd3c`. Reverted https://github.com/pytorch/pytorch/pull/117500 on behalf of https://github.com/PaliC due to breaking internal discussed with author offline ([comment](https://github.com/pytorch/pytorch/pull/117500#issuecomment-1896516512))	2024-01-17 19:34:26 +00:00
titaiwangms	bb0fd1bd3c	[inductor] Faster C++ kernel python bindings (#117500 ) Calling C++ from Python via ctypes is notoriously slow. This switches to generating our own C++ bindings directly, which is a >5x speedup on this kernel-launch-bound microbenchmark: ```python from ctypes import c_void_p import torch from torch import empty from torch._inductor.codecache import AsyncCompile from torch._dynamo.testing import rand_strided from torch._inductor.utils import print_performance from torch._inductor.wrapper_benchmark import compiled_module_main async_compile = AsyncCompile() src = ''' #include "/tmp/torchinductor_jansel/gb/cgbau5vlj6cetmcjbjbtw6x4rrivaln6f45s5d72gy2bfx5foz3k.h" extern "C" void kernel(const float* in_ptr0, float* out_ptr0) { { auto tmp0 = in_ptr0[static_cast<long>(0L)]; auto tmp1 = static_cast<float>(1.0); auto tmp2 = decltype(tmp0)(tmp0 + tmp1); out_ptr0[static_cast<long>(0L)] = tmp2; } } ''' cpp_fused_add_ctypes = async_compile.cpp(src) cpp_fused_add_cpython = async_compile.cpp_pybinding(["const float", "float"], src) async_compile.wait(globals()) del async_compile def call(arg0_1): buf0 = empty((1,), device='cpu', dtype=torch.float32) if use_ctypes: for _ in range(100): cpp_fused_add_ctypes(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr())) else: for _ in range(100): cpp_fused_add_cpython(arg0_1, buf0) del arg0_1 return (buf0,) def benchmark_compiled_module(times=1000, repeat=100): arg0_1 = rand_strided((1,), (1,), device='cpu', dtype=torch.float32) return print_performance(lambda: call(arg0_1), times=times, repeat=repeat) print("old ctypes bindings: ", end='') use_ctypes = True compiled_module_main('None', benchmark_compiled_module) print("new bindings: ", end='') use_ctypes = False compiled_module_main('None', benchmark_compiled_module) ``` Output: ``` old ctypes bindings: 0.000073 new bindings: 0.000013 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117500 Approved by: https://github.com/desertfire ghstack dependencies: #117409, #116667, #117591	2024-01-17 19:12:24 +00:00
PyTorch MergeBot	9da01affd3	Revert "[inductor] Faster C++ kernel python bindings (#117500 )" This reverts commit `3a52147cc5`. Reverted https://github.com/pytorch/pytorch/pull/117500 on behalf of https://github.com/PaliC due to breaking internal discussed with author offline ([comment](https://github.com/pytorch/pytorch/pull/117500#issuecomment-1896426304))	2024-01-17 18:42:39 +00:00
Jason Ansel	3a52147cc5	[inductor] Faster C++ kernel python bindings (#117500 ) Calling C++ from Python via ctypes is notoriously slow. This switches to generating our own C++ bindings directly, which is a >5x speedup on this kernel-launch-bound microbenchmark: ```python from ctypes import c_void_p import torch from torch import empty from torch._inductor.codecache import AsyncCompile from torch._dynamo.testing import rand_strided from torch._inductor.utils import print_performance from torch._inductor.wrapper_benchmark import compiled_module_main async_compile = AsyncCompile() src = ''' #include "/tmp/torchinductor_jansel/gb/cgbau5vlj6cetmcjbjbtw6x4rrivaln6f45s5d72gy2bfx5foz3k.h" extern "C" void kernel(const float* in_ptr0, float* out_ptr0) { { auto tmp0 = in_ptr0[static_cast<long>(0L)]; auto tmp1 = static_cast<float>(1.0); auto tmp2 = decltype(tmp0)(tmp0 + tmp1); out_ptr0[static_cast<long>(0L)] = tmp2; } } ''' cpp_fused_add_ctypes = async_compile.cpp(src) cpp_fused_add_cpython = async_compile.cpp_pybinding(["const float", "float"], src) async_compile.wait(globals()) del async_compile def call(arg0_1): buf0 = empty((1,), device='cpu', dtype=torch.float32) if use_ctypes: for _ in range(100): cpp_fused_add_ctypes(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr())) else: for _ in range(100): cpp_fused_add_cpython(arg0_1, buf0) del arg0_1 return (buf0,) def benchmark_compiled_module(times=1000, repeat=100): arg0_1 = rand_strided((1,), (1,), device='cpu', dtype=torch.float32) return print_performance(lambda: call(arg0_1), times=times, repeat=repeat) print("old ctypes bindings: ", end='') use_ctypes = True compiled_module_main('None', benchmark_compiled_module) print("new bindings: ", end='') use_ctypes = False compiled_module_main('None', benchmark_compiled_module) ``` Output: ``` old ctypes bindings: 0.000073 new bindings: 0.000013 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117500 Approved by: https://github.com/desertfire	2024-01-16 22:30:04 +00:00
Aaron Gokaslan	d9f2cf9974	[BE]: Enable ruff rule PIE800 - unnecessary nested dict expansion (#113880 ) Adds an additional list which removes unnecessary dict literal unpacking, also applies the fixes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113880 Approved by: https://github.com/albanD	2023-11-16 22:34:38 +00:00
Peter Bell	bbd5b935e4	Use `pytree.tree_leaves` everywhere (#112324 ) This changes all the instances I could find of `tree_flatten(...)[0]` or `x, _ = tree_flatten` to use `tree_leaves`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112324 Approved by: https://github.com/lezcano ghstack dependencies: #112327, #112323	2023-10-30 03:39:04 +00:00
Justin Chu	5ef023b05a	[BE] Enable ruff's UP rules and autoformat benchmarks/ (#105429 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105429 Approved by: https://github.com/malfet	2023-07-19 04:46:37 +00:00
Aaron Gokaslan	2f95a3d0fc	[BE]: Apply ruff PERF fixes to torch (#104917 ) Applies automated ruff fixes in the PERF modules and enables all automatic ones. I also updated ruff which applied some additional fixes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104917 Approved by: https://github.com/ezyang, https://github.com/albanD	2023-07-11 20:45:21 +00:00
Elias Ellison	2baadc2ade	Small operatorbench improvements (#103110 ) - Don't copy inputs in cudagraphs wrapping, since the copies will distorts timing and triton do_bench will clear cache anyway - Don't skip op if there is a fallback, since we have both fallbacks and lowerings for some ops - Add option for channels last Pull Request resolved: https://github.com/pytorch/pytorch/pull/103110 Approved by: https://github.com/desertfire	2023-06-07 22:04:59 +00:00
xndcn	bebb8b7c1e	[inductor] use native fetch_add function for trivial types (#101931 ) floating-point is supported by std::atomic::fetch_add since C++20. However, this code path is not activated yet because cpp_flags in codecache.py is hard-coded to "-std=c++17" Pull Request resolved: https://github.com/pytorch/pytorch/pull/101931 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel	2023-06-01 03:47:56 +00:00
lezcano	8b4e28d65d	Fix microbenchmarks (#101065 ) As per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/101065 Approved by: https://github.com/jansel	2023-05-11 09:14:22 +00:00
Shunting Zhang	68bc0fc012	[inductor] a script to benchmark the perf impact from tensor layout (#99583 ) Follow up on Jason's idea of tensor layout tuning. Add a script to show the perf impact of layout to convolution (will add more cases like batch/layer norm, reduction to the scripts). For convolution, a quick test shows using channels last layout, we get 1.4x speedup for convolution: ``` baseline 4.509183883666992 test 3.178528070449829 speedup 1.419x ``` The speedup definitely also depends on input/weight shapes. E.g., change input channel from 3 in the test to 8, we see speedup to be 2.1x The trace shows cudnn calls different kernels when input layout changes to channels last. <img width="997" alt="Screenshot 2023-04-19 at 5 27 54 PM" src="https://user-images.githubusercontent.com/52589240/233228656-4bdcac0a-7633-416a-82e1-17d8dc8ea9a6.png"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99583 Approved by: https://github.com/jansel	2023-04-20 06:26:10 +00:00
PyTorch MergeBot	629377ea8b	Revert "Replace _dynamo.config with an object instead of module (#96455 )" This reverts commit `420104a886`. Reverted https://github.com/pytorch/pytorch/pull/96455 on behalf of https://github.com/jansel due to BC breaking, was landed prematurely	2023-04-12 15:06:14 +00:00
Han Qi	420104a886	Replace _dynamo.config with an object instead of module (#96455 ) Summary: Replace _dynamo.config with an object instead of module Current usage patterns of setting and reading fields on config will work unchanged. Only changes needed going forward: 1. import torch._dynamo.config will not work. However, just doing import torch._dynamo is sufficient to access dynamo config as torch._dynamo.config. 2. Files inside of _dynamo folder need to access config via from torch._dynamo.config_util import config instead of from torch._dynamo import config. Because _dynamo/__init__.py imports some of the files so it would be circular import. Test Plan: Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/96455 Approved by: https://github.com/williamwen42	2023-04-11 21:23:32 +00:00
Edward Z. Yang	9a8f71f23e	Convert logging f-strings to use % format (#98697 ) Codemod done with https://gist.github.com/ezyang/2e8b0463cdc6be278478495b23ff0530 with assistance from ChatGPT. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98697 Approved by: https://github.com/voznesenskym	2023-04-10 12:19:31 +00:00
Jason Ansel	9370f253e3	[inductor] Rewrite convolution triton templates (#95556 ) Fixes #95775 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95556 Approved by: https://github.com/Chillee, https://github.com/ngimel	2023-03-22 18:12:23 +00:00
BowenBao	60a68477a6	Bump black version to 23.1.0 (#96578 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96578 Approved by: https://github.com/ezyang	2023-03-15 06:27:59 +00:00
Aaron Gokaslan	3d82d8d0ed	[BE] Enable more flake8-comprehensions checks (#94601 ) I applied some flake8 fixes and enabled checking for them in the linter. I also enabled some checks for my previous comprehensions PR. This is a follow up to #94323 where I enable the flake8 checkers for the fixes I made and fix a few more of them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94601 Approved by: https://github.com/ezyang	2023-02-10 23:40:29 +00:00

1 2

60 Commits