pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Aaron Orenstein	07669ed960	PEP585 update - benchmarks tools torchgen (#145101 ) This is one of a series of PRs to update us to PEP585 (changing Dict -> dict, List -> list, etc). Most of the PRs were completely automated with RUFF as follows: Since RUFF UP006 is considered an "unsafe" fix first we need to enable unsafe fixes: ``` --- a/tools/linter/adapters/ruff_linter.py +++ b/tools/linter/adapters/ruff_linter.py @@ -313,6 +313,7 @@ "ruff", "check", "--fix-only", + "--unsafe-fixes", "--exit-zero", *([f"--config={config}"] if config else []), "--stdin-filename", ``` Then we need to tell RUFF to allow UP006 (as a final PR once all of these have landed this will be made permanent): ``` --- a/pyproject.toml +++ b/pyproject.toml @@ -40,7 +40,7 @@ [tool.ruff] -target-version = "py38" +target-version = "py39" line-length = 88 src = ["caffe2", "torch", "torchgen", "functorch", "test"] @@ -87,7 +87,6 @@ "SIM116", # Disable Use a dictionary instead of consecutive `if` statements "SIM117", "SIM118", - "UP006", # keep-runtime-typing "UP007", # keep-runtime-typing ] select = [ ``` Finally running `lintrunner -a --take RUFF` will fix up the deprecated uses. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145101 Approved by: https://github.com/bobrenjc93	2025-01-18 05:05:07 +00:00
Nicolas Macchioni	2f51d06210	basic InductorBenchmarker (#133058 ) This PR adds the most basic custom benchmarker (i.e. a benchmarker that is not provided by Triton), which we call `InductorBenchmarker`. This new benchmarker is very basic in principal, and very closely follows Triton's `do_bench` implementation with slight changes such as flushing the exact L2 cache size (Triton defaults to 256mb), using a buffer zero for warmup (Triton uses the benchmarked kernel itself, I found that buffer zeroes are more consistent), and returning the min runtime (Triton can return min, among other things, currently Inductor picks median). Pull Request resolved: https://github.com/pytorch/pytorch/pull/133058 Approved by: https://github.com/eellison ghstack dependencies: #144315	2025-01-18 02:35:00 +00:00
Huy Do	8e4539245e	Update ci_expected_accuracy for TIMM levit_128 for further investigation (#145112 ) TSIA, it looks like an upstream change, but I'm not sure from where yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145112 Approved by: https://github.com/izaitsevfb, https://github.com/malfet	2025-01-18 01:55:34 +00:00
Laith Sakka	96c0dbbe97	Enhance running pr time benchmarks locally experience. (#144838 ) Summary: title Test Plan: NA Differential Revision: D68195894 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144838 Approved by: https://github.com/huydhn	2025-01-17 07:57:40 +00:00
Laith Sakka	62ce3e6e84	refresh benchmarks results after recent recent regressions (#143075 ) refresh data after !5 regression by https://github.com/pytorch/pytorch/pull/144319 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143075 Approved by: https://github.com/bobrenjc93, https://github.com/huydhn	2025-01-15 09:11:57 +00:00
Bin Bao	2683691237	[AOTI] Add a boxed_run API (#142213 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/141696. Add a new C++ runner API (boxed_run) following dynamo's boxed calling convention, which steals tensors' ownership from the input tensor list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142213 Approved by: https://github.com/ezyang	2025-01-14 18:47:42 +00:00
PyTorch MergeBot	4f74864c94	Revert "[AOTI] Add a boxed_run API (#142213 )" This reverts commit `868984c3e3`. Reverted https://github.com/pytorch/pytorch/pull/142213 on behalf of https://github.com/kit1980 due to breaking lots of internal builds, see D68036023 ([comment](https://github.com/pytorch/pytorch/pull/142213#issuecomment-2588378262))	2025-01-13 22:43:47 +00:00
Huy Do	396630ed78	Update the accuracy results for moco and llama (#144523 ) This has been failing in trunk for sometimes, let's just update the accuracy results first. The command I run `python benchmarks/dynamo/ci_expected_accuracy/update_expected.py 127f836881e75e0c688619b54a35b018a69d7ee7`. I also fix the update script a bit to make it working after https://github.com/pytorch/pytorch/pull/139337 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144523 Approved by: https://github.com/kit1980, https://github.com/Skylion007	2025-01-10 19:40:49 +00:00
Bin Bao	868984c3e3	[AOTI] Add a boxed_run API (#142213 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/141696. Add a new C++ runner API (boxed_run) following dynamo's boxed calling convention, which steals tensors' ownership from the input tensor list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142213 Approved by: https://github.com/ezyang	2025-01-10 18:27:00 +00:00
Guilherme Leobas	6bc17b0725	Update #graph breaks for moco benchmark (#144266 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144266 Approved by: https://github.com/zou3519	2025-01-09 18:51:13 +00:00
Xuehai Pan	dcc3cf7066	[BE] fix ruff rule E226: add missing whitespace around operator in f-strings (#144415 ) The fixes are generated by: ```bash ruff check --fix --preview --unsafe-fixes --select=E226 . lintrunner -a --take "RUFF,PYFMT" --all-files ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144415 Approved by: https://github.com/huydhn, https://github.com/Skylion007	2025-01-08 21:55:00 +00:00
Animesh Jain	2ac41404a8	[dynamo][dicts] Guarding lazily on dict keys (#143997 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143997 Approved by: https://github.com/jansel	2025-01-08 03:56:33 +00:00
bobrenjc93	fcf9dc3b11	Migrate from Tuple -> tuple in benchmarks (#144259 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144259 Approved by: https://github.com/yanboliang	2025-01-07 04:09:52 +00:00
Laith Sakka	5ccbfffd11	update expected results (#144274 ) this PR `f6488d85a0` made it +1.3% < 1.5%. once we have the API from dev infra and change the test this wont be happening. <img width="364" alt="Screenshot 2025-01-06 at 11 01 15 AM" src="https://github.com/user-attachments/assets/401b2d11-e400-49d6-b6f9-8e10ca141cb0" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/144274 Approved by: https://github.com/oulgen, https://github.com/anijain2305	2025-01-06 23:18:21 +00:00
Guilherme Leobas	4c8d661348	Set `enable_trace_contextlib_contextmanager` flag to True (#140604 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140604 Approved by: https://github.com/zou3519 ghstack dependencies: #136033	2025-01-06 16:56:22 +00:00
PyTorch MergeBot	b01556bd8a	Revert "[dynamo][dicts] Guarding lazily on dict keys (#143997 )" This reverts commit `f5df082fab`. Reverted https://github.com/pytorch/pytorch/pull/143997 on behalf of https://github.com/jeanschmidt due to Seems to have introduced internal ci redness in some tests, D67828366 ([comment](https://github.com/pytorch/pytorch/pull/143997#issuecomment-2571587599))	2025-01-05 11:09:45 +00:00
Animesh Jain	f5df082fab	[dynamo][dicts] Guarding lazily on dict keys (#143997 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143997 Approved by: https://github.com/jansel ghstack dependencies: #144129, #144130, #144141, #144158, #144163, #144160	2025-01-04 18:13:00 +00:00
PyTorch MergeBot	8d63a4a409	Revert "Set `enable_trace_contextlib_contextmanager` flag to True (#140604 )" This reverts commit `1c817fe671`. Reverted https://github.com/pytorch/pytorch/pull/140604 on behalf of https://github.com/guilhermeleobas due to breaking one of the benchmarks (moco) ([comment](https://github.com/pytorch/pytorch/pull/140604#issuecomment-2569640837))	2025-01-03 18:23:53 +00:00
Xuehai Pan	b6bdb67f82	[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 ) Changes by apply order: 1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`. 2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`. 3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first. `.parent{...}.absolute()` -> `.absolute().parent{...}` 4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.) `.parent.parent.parent.parent` -> `.parents[3]` 5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~ ~`.parents[3]` -> `.parents[4 - 1]`~ 6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374 Approved by: https://github.com/justinchuby, https://github.com/malfet	2024-12-29 17:23:13 +00:00
Animesh Jain	c3c27aef34	[dynamo] Remove HFPretrained config hack (#143698 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143698 Approved by: https://github.com/williamwen42, https://github.com/jansel ghstack dependencies: #143888	2024-12-28 02:03:13 +00:00
Aaron Orenstein	3df12d38cf	dynamo tracing perf: cache cleaned_instructions: 33.7 -> 30.0 (#143070 ) See #143056 for overall docs. This PR: Cache the interesting/expensive bits of `cleaned_instructions()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143070 Approved by: https://github.com/jansel	2024-12-26 19:02:08 +00:00
PyTorch MergeBot	475656fd9c	Revert "[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 )" This reverts commit `2293fe1024`. Reverted https://github.com/pytorch/pytorch/pull/129374 on behalf of https://github.com/malfet due to failing internal ROCM builds with error: ModuleNotFoundError: No module named hipify ([comment](https://github.com/pytorch/pytorch/pull/129374#issuecomment-2562973920))	2024-12-26 17:32:23 +00:00
Oguz Ulgen	dc55704b48	Rename cache limit to recompile limit in configs (#143709 ) This PR renames every cache_limit to recompile_limit via sed. Old config options are maintained via Config(alias='xyz') Pull Request resolved: https://github.com/pytorch/pytorch/pull/143709 Approved by: https://github.com/jansel	2024-12-22 10:03:57 +00:00
Aaron Orenstein	f2b744b9ca	dynamo tracing perf: import_module: 59.92 -> 52.9 (#143057 ) See #143056 for overall docs. This PR: Using `importlib.import_module()` within the hot path of symbolic_convert is slow. Memoize it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143057 Approved by: https://github.com/jansel	2024-12-22 06:38:38 +00:00
Xuehai Pan	2293fe1024	[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 ) Changes by apply order: 1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`. 2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`. 3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first. `.parent{...}.absolute()` -> `.absolute().parent{...}` 4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.) `.parent.parent.parent.parent` -> `.parents[3]` 5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~ ~`.parents[3]` -> `.parents[4 - 1]`~ 6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374 Approved by: https://github.com/justinchuby, https://github.com/malfet	2024-12-21 22:08:01 +00:00
Yanan Cao (PyTorch)	0666347fc4	[Codemod][AddExplicitStrictExportArg] caffe2/benchmarks/dynamo (#143686 ) Reviewed By: avikchaudhuri Pull Request resolved: https://github.com/pytorch/pytorch/pull/143686 Approved by: https://github.com/tugsbayasgalan	2024-12-21 19:56:56 +00:00
Guilherme Leobas	1c817fe671	Set `enable_trace_contextlib_contextmanager` flag to True (#140604 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140604 Approved by: https://github.com/zou3519 ghstack dependencies: #136033	2024-12-20 12:02:27 +00:00
Guilherme Leobas	673cc88fd6	Add support for `contextmanager` in Dynamo (#136033 ) Fixes #130559 * Intro This PR adds support for `@contextmanager` in Dynamo. We chose to limit the scope of this work to only `@contextmanager` and plan to handle generators fully in #141055 (still in draft). * Motivation Dynamo lacks support for generator functions. When it encounters one, it traces it as if it were a regular function. This is problematic because it can lead to incorrect behavior. To illustrate, consider the test case below: ```python import torch import contextlib @contextlib.contextmanager def set_default_dtype(dtype): old_dtype = torch.get_default_dtype() try: torch.set_default_dtype(dtype) yield finally: torch.set_default_dtype(old_dtype) @torch.compile(backend="eager", fullgraph=True) def fn(): with set_default_dtype(torch.float64): x = torch.tensor([3.0, 3.0 + 5.0j]) return x ``` Before this work, Dynamo would not stop at the `yield`, and the graph produced would contain both calls to `set_default_dtype` executed one after the other. This is incorrect because the context manager should execute code before and after the `yield`. * List of changes `YIELD_VALUE` now raises an exception (`YieldValueOp`) to signal that control flow must be suspended and returned to the caller. Additionally, `RETURN_VALUE` behaves differently in a generator function. Unlike regular functions, where `RETURN_VALUE` indicates the final result, in generators it signifies that the generator is exhausted and implicitly raises `StopIteration`. A new `VariableTracker` named `FunctionDecoratedByContextlibContextManagerVariable` was introduced to handle `@contextmanager`. This variable tracker acts not just as a wrapper for the original function but also maintains an internal `tx` (InstructionTranslator) object to suspend and return control flow to the parent tracer when a `yield` is encountered. * Corner cases Returning a context manager from a compiled function is not supported. This would require PyTorch to synchronize the generator state between Dynamo and the interpreter. Any attempt to return it will result in an `IncorrectUsage` exception. Graph breaks require special handling as well. In the event of a graph break, the frame associated with the context manager is skipped, and the context manager runs in eager mode. * This PR is breaking my code There is a configuration flag (`enable_trace_contextlib`) that can be set to `False` to disable tracing context managers. If this still causes crashes, please revert this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136033 Approved by: https://github.com/zou3519	2024-12-20 12:02:20 +00:00
Huy Do	fe0f20615c	[DynamoBench] Handle accuracy results in benchmark records (#143611 ) I discovered this issue when trying to search for the accuracy results on the database and couldn't find any. It turns out that the results is there on the JSON file, for example `"metric": {"name": "accuracy", "benchmark_values": ["pass_due_to_skip"]}`, but inserting them into the database fails because benchmark values is a list of strings here while the expectation is that it's a list of numbers. ClickHouse doesn't support mix types atm. It has a Variant type https://clickhouse.com/docs/en/sql-reference/data-types/variant, but this isn't recommended by CH team themselves. So, the remaining option is to store this in the `extra_info` field. This field is a dictionary, so it can goes there. ### Testing https://github.com/pytorch/pytorch/actions/runs/12421747715 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143611 Approved by: https://github.com/kit1980	2024-12-20 06:43:38 +00:00
Laith Sakka	2a11472f46	update expected results (#143586 ) update results based on small regression added by `17b71e5d6a` the max we was 1.25%. for sum_floor_div <img width="842" alt="Screenshot 2024-12-19 at 9 04 30 AM" src="https://github.com/user-attachments/assets/6ce913cd-110d-4837-af59-08fb6a0dd12d" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/143586 Approved by: https://github.com/bobrenjc93	2024-12-19 18:43:27 +00:00
William Wen	c04f0bb7b9	[dynamo] add benchmark for guard eval (#142430 ) Benchmarks: - 713.2us (3.10) - 598.8us (3.12) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142430 Approved by: https://github.com/jansel ghstack dependencies: #142117	2024-12-17 18:54:27 +00:00
Aaron Orenstein	63e1f97f4b	dynamo tracing perf: don't unnecessarily call getframeinfo on the hot path: 47.26 -> 37.66 (#143066 ) See #143056 for overall docs. This PR: Stop using `getframeinfo()` when we only care about the function name and throw the rest away. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143066 Approved by: https://github.com/jansel	2024-12-13 18:20:48 +00:00
Tom Ritchford	da67a6a7bb	[inductor] Replace set by OrderedSet (#138466 ) Uses the set_linter from https://github.com/pytorch/pytorch/pull/138454 and considerable manual editing Pull Request resolved: https://github.com/pytorch/pytorch/pull/138466 Approved by: https://github.com/eellison	2024-12-13 16:08:45 +00:00
bobrenjc93	b5d8d2444a	add README.md for compile time benchmarks (#143145 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143145 Approved by: https://github.com/laithsakka ghstack dependencies: #141517, #143143	2024-12-13 05:12:26 +00:00
bobrenjc93	ceb664aca6	add float_args benchmark (#143143 ) 71% improvement with automatic dynamic float arguments with specialize_float=False ``` float_args,compile_time_instruction_count,346293869 ``` with specialize_float=True ``` float_args,compile_time_instruction_count,1198546486 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143143 Approved by: https://github.com/laithsakka ghstack dependencies: #141517	2024-12-13 03:35:59 +00:00
James Wu	fbbafd0320	Turn on AOTAutogradCache by default on open source (#141981 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141981 Approved by: https://github.com/bdhirsh, https://github.com/oulgen	2024-12-12 04:21:11 +00:00
Tom Ritchford	498a7808ff	Fix unused Python variables outside torch/ and test/ (#136359 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136359 Approved by: https://github.com/albanD	2024-12-11 17:10:23 +00:00
Edward Z. Yang	c29b4edbb9	Remove no-op aot_compilation_time (#142490 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/142490 Approved by: https://github.com/xuzhao9	2024-12-11 10:37:25 +00:00
Edward Z. Yang	646024e823	Add convnext_base to higher tolerance (#142159 ) See https://github.com/pytorch/pytorch/issues/141498 https://github.com/pytorch/pytorch/issues/141703 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/142159 Approved by: https://github.com/bertmaher, https://github.com/huydhn	2024-12-06 04:00:13 +00:00
eellison	f83361b274	inductor dtype propagation fixes (#141495 ) - Add in upcast_compute_type on creation of new tensors (loads, constants) - Fixes index_expr - right now we are sort of inconsistent in dtype and dont always respect the dtype specified. would be nice to fix but not doing in this pr. - bug fix in view dtype where we were always upcasting back to fp32 when input was in bf16/fp16. we should only be doing that if the output is also in bf16/fp16. - for masked, avoid calling dtype propagation and just use output dtype. Turns on the runtime dtype verification for opinfo tests. The separate test file is still useful because we can use it for testing turning off codegen_upcast_to_fp32. Follow ups: - We could consider requiring less explicit upcast_compute_types calls and do it automatically. That would potentially make things easier but be less flexible in the future. Maybe I should have done it this pr. - Be more consistent on our index expr dtype printing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141495 Approved by: https://github.com/blaine-rister, https://github.com/arui-meta, https://github.com/ezyang ghstack dependencies: #139945, #140057	2024-11-28 11:39:38 +00:00
James Wu	a7ca6a9113	Enable autograd cache on inductor tests (#140890 ) This turns on AOTAutogradCache for all inductor tests. It clears AOTAutogradCache on each test as well, by virtue of the local cache using the same directory to store cache entries. I've also tested with INDUCTOR_TEST_DISABLE_FRESH_CACHE=1, running all the tests. AOTAutogradCache successfully caches 99% of these. There are a few tests that use view_replay and therefore save functional tensors, which cause AOTAutogradCache to fail to pickle its result. Will look into next steps there, but for now, it seems okay if the cache just misses on those cases where it can't serialize the result. It would be better to check before pickling, though. I've made the following small bugfixes to get this working: - Inductor is sometimes used in a standalone mode without dynamo, which leads to attribute errors in check_can_cache. In general, we should never crash in cache checking, only bypass. So I change a try catch to check Exception instead of just a specific exception. - Add extra structured logging for metadata on cache hits Pull Request resolved: https://github.com/pytorch/pytorch/pull/140890 Approved by: https://github.com/bdhirsh	2024-11-27 20:41:43 +00:00
eellison	fd553b9817	Add remaining method and tests for dtype propagation (#140057 ) Adds the remaining unimplemented ops as well as an assertion failure if someone adds a new op without a dtype rule. We test all unique pointwise operators registered as lowerings which have an opinfo. There will be some follow ups for this to work well with both `codegen_upcast_to_fp32` as True and False. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140057 Approved by: https://github.com/arui-meta, https://github.com/blaine-rister, https://github.com/ezyang ghstack dependencies: #139945	2024-11-27 17:06:44 +00:00
eellison	566ceb3e7e	Refactor dtype propagation (#139945 ) A couple changes. - Tries to reuse dtype propagation rules that were already registered in inductor. These were present both with `pointwise_overrides_data` and the `boolean_ops` list. Additionally, the registration of pointwise ops already specified dtype propagation rules. Saves those registrations and reuses them later. - Factors out `get_promoted_dtype` which uses functools.lru_cache to take in non - CSEVariable args because those will not work with the functools cache. Tests get added later in the stack when everything is implemented. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139945 Approved by: https://github.com/blaine-rister, https://github.com/arui-meta, https://github.com/ezyang	2024-11-27 16:57:02 +00:00
Tugsbayasgalan Manlaibaatar	87f9c1abe5	Change export IR to non-functional pre-dispatch IR (#139511 ) Differential Revision: [D65362160](https://our.internmc.facebook.com/intern/diff/D65362160) State after this IR: 1. For the tests that require inference IR, they are replaced with ep.run_decomp({}) so export_for_training_run_decomp is sort of redundant but i guess it is still nice that multiple round of retracing still working. In general, we need some auditing to reduce our redundant testing coverages. 2. After this PR landed and not get reverted for a week or so, i will replace the export_for_training calls with export as they are the same thing now. 3. Added more tests to also cover now "deprecated" old IR by patching export to use old export. For reviewers, please look at the internal version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139511 Approved by: https://github.com/ydwu4, https://github.com/angelayi, https://github.com/avikchaudhuri	2024-11-20 21:47:55 +00:00
Laith Sakka	caa3a3e12c	Only compute new_untracked_symbols and new_unbacked_bindings if needed. (#140083 ) Summary: 237s -> 198.. buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=2000 Test Plan: NA Differential Revision: D65638637 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140083 Approved by: https://github.com/ezyang, https://github.com/isuruf, https://github.com/anijain2305	2024-11-20 19:28:18 +00:00
Huy Do	b5db3cb61c	Skip uploading benchmark records when there is no model name (#141145 ) A small fix I just realize after https://github.com/pytorch/pytorch/pull/141087. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141145 Approved by: https://github.com/malfet	2024-11-20 19:05:47 +00:00
Huy Do	1a7055cb73	Record PR time benchmark results in JSON format (#140493 ) I'm trying to make this benchmark results available on OSS benchmark database, so that people can query it from outside. The first step is to also record the results in the JSON format compatible with the database schema defined in https://github.com/pytorch/test-infra/pull/5839. Existing CSV files remain unchanged. ### Testing The JSON results are uploaded as artifacts to S3 https://github.com/pytorch/pytorch/actions/runs/11809725848/job/32901411180#step:26:13, for example https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/11809725848/1/artifact/test-jsons-test-pr_time_benchmarks-1-1-linux.g4dn.metal.nvidia.gpu_32901411180.zip Pull Request resolved: https://github.com/pytorch/pytorch/pull/140493 Approved by: https://github.com/laithsakka	2024-11-20 18:54:01 +00:00
Huy Do	4acd56eb53	Upload MPS benchmark results (#141087 ) This uploads the MPS benchmark results to benchmark database. The data can then be queried, for example: ``` select benchmark, model, metric from oss_ci_benchmark_v3 where head_sha = '99a133116fee15aa1467165f2b209b37da53f189' and metric.name in ['eager_peak_mem', 'dynamo_peak_mem', 'speedup'] and model.name = 'BERT_pytorch' ``` I'm documenting the JSON format at https://github.com/pytorch/pytorch/wiki/How-to-integrate-with-PyTorch-OSS-benchmark-database ### Testing Locally, ``` PYTHONPATH=/Users/huydo/Storage/mine/benchmark python benchmarks/dynamo/torchbench.py --performance --only resnet152 --backend eager --training --devices mps --output test/test-reports/torchbench_training.csv ``` Workflow dispatch https://github.com/pytorch/pytorch/actions/runs/11927990520 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141087 Approved by: https://github.com/malfet	2024-11-20 18:18:21 +00:00
Laith Sakka	8d708090c0	Optimize increment summations [Latest Nov 15] (#140822 ) Summary: wins on torchrec benchmark, for 2K nodes it save 40seconds with the recent sympy changes (https://www.internalfb.com/diff/D65883538) we save around 13 second ( with the max opt on). ``` buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=200 ``` This diff optimizes construction expressions of the form a+b+c... (all unique symbols). which are very common in torchrec models. How Expressions of the form a+b+c are not optimized by add, the only needed optimization is sorting them. If we have a+b+c and we are adding (d) to it, we can do a binary search to know the position of (d) and avoid optimizing the new expression by passing the new order. Extensions: 1. support constant terms. 2. support 10a+10b+.. (this will give even more wins will extend the support in second PR) Differential Revision: D66008482 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140822 Approved by: https://github.com/ezyang	2024-11-20 16:48:20 +00:00
PyTorch MergeBot	a4e8ca789a	Revert "Record PR time benchmark results in JSON format (#140493 )" This reverts commit `783cd9c8dd`. Reverted https://github.com/pytorch/pytorch/pull/140493 on behalf of https://github.com/huydhn due to I think I missed something in the workflow setup as the test is failing in non-test CI jobs ([comment](https://github.com/pytorch/pytorch/pull/140493#issuecomment-2487360455))	2024-11-20 04:04:07 +00:00
angelayi	878a849c92	[aoti] Remove example inputs from aoti_compile_and_package (#140991 ) Differential Revision: [D66136724](https://our.internmc.facebook.com/intern/diff/D66136724) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140991 Approved by: https://github.com/yushangdi, https://github.com/desertfire ghstack dependencies: #140990	2024-11-20 02:49:47 +00:00
Huy Do	783cd9c8dd	Record PR time benchmark results in JSON format (#140493 ) I'm trying to make this benchmark results available on OSS benchmark database, so that people can query it from outside. The first step is to also record the results in the JSON format compatible with the database schema defined in https://github.com/pytorch/test-infra/pull/5839. Existing CSV files remain unchanged. ### Testing The JSON results are uploaded as artifacts to S3 https://github.com/pytorch/pytorch/actions/runs/11809725848/job/32901411180#step:26:13, for example https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/11809725848/1/artifact/test-jsons-test-pr_time_benchmarks-1-1-linux.g4dn.metal.nvidia.gpu_32901411180.zip Pull Request resolved: https://github.com/pytorch/pytorch/pull/140493 Approved by: https://github.com/laithsakka	2024-11-20 01:48:00 +00:00
Catherine Lee	fc813df120	Benchmarks dynamo update script to use ClickHouse instead of Rockset (#140574 ) Query works but the part where it parses the job name is broken Pull Request resolved: https://github.com/pytorch/pytorch/pull/140574 Approved by: https://github.com/huydhn	2024-11-15 22:17:35 +00:00
Laith Sakka	500ce29e4c	Use has_free_unbacked_symbols instead of bool(free_unbacked_symbols) (#140027 ) with 20K features saves 20 seconds. 257.021589517593-> 237.8304626941681 buck2 run @fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=2000 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140027 Approved by: https://github.com/ezyang	2024-11-15 19:01:06 +00:00
Oguz Ulgen	65518fd9ef	Turn on triton bundler in OSS (#140600 ) Its been enabled internally, lets also push it out to OSS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140600 Approved by: https://github.com/masnesral	2024-11-14 20:02:15 +00:00
Laith Sakka	aaefa48441	reduce the threshold to change exisiting data suggestion to noise/3 (#140623 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140623 Approved by: https://github.com/bobrenjc93	2024-11-14 06:29:25 +00:00
Michael Lazos	ea0f60ecfa	[Dynamo] allow dynamic callables on tensor variables (#137940 ) Fixes https://github.com/pytorch/pytorch/issues/134844 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137940 Approved by: https://github.com/williamwen42	2024-11-08 23:49:34 +00:00
Laith Sakka	d1a45800a3	refresh numbers after accepted less than noise regression (#140029 ) https://github.com/pytorch/pytorch/pull/138363 regressed some benchmarks but less than noise level updating values to avoid flakiness. <img width="803" alt="Screenshot 2024-11-07 at 10 31 29 AM" src="https://github.com/user-attachments/assets/31326452-a6ad-44b8-b324-25e953355fcf"> PASS: benchmark ('add_loop_eager', 'compile_time_instruction_count') pass, actual result 3073605220 +1.21% is within expected 3037000000 ±1.50% PASS: benchmark ('add_loop_eager_dynamic', 'compile_time_instruction_count') pass, actual result 5700849667 +1.37% is within expected 5624000000 ±2.50% Pull Request resolved: https://github.com/pytorch/pytorch/pull/140029 Approved by: https://github.com/bobrenjc93	2024-11-07 22:27:00 +00:00
Laith Sakka	de4216bfda	increase add_loop benchmark and refresh all results! (#139703 ) see comments end of https://github.com/pytorch/pytorch/pull/138756 I am also refreshing all values Pull Request resolved: https://github.com/pytorch/pytorch/pull/139703 Approved by: https://github.com/bobrenjc93	2024-11-05 05:41:21 +00:00
Bin Bao	740054ffe6	[AOTI][reland] Switch OSS dashboard to use aoti_compile_and_package (#139597 ) Summary: Reland https://github.com/pytorch/pytorch/pull/139154 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139597 Approved by: https://github.com/angelayi	2024-11-04 18:53:17 +00:00
PyTorch MergeBot	709752e0bb	Revert "[AOTI] Switch OSS dashboard to use aoti_compile_and_package (#139154 )" This reverts commit `293fbb42d2`. Reverted https://github.com/pytorch/pytorch/pull/139154 on behalf of https://github.com/desertfire due to cpu_aot_inductor_amp_freezing fails ([comment](https://github.com/pytorch/pytorch/pull/139154#issuecomment-2452983651))	2024-11-02 13:04:00 +00:00
Bin Bao	293fbb42d2	[AOTI] Switch OSS dashboard to use aoti_compile_and_package (#139154 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139154 Approved by: https://github.com/angelayi ghstack dependencies: #139153	2024-11-02 03:10:05 +00:00
Laith Sakka	6a1c451479	Don't uselessly recompute axiom dict every static eval call (#138967 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138967 Approved by: https://github.com/ezyang	2024-10-31 21:16:55 +00:00
Laith Sakka	c056dc4cb8	In Inductor, be willing to generate deferred runtime asserts when unbacked (#138804 ) Title + we avoid calling defer_assert when we statically know the guard results. timing for pnasnet5large ``` TIMING: code_gen:21.79672 inductor_compile:39.57726 backend_compile:65.30649 entire_frame_compile:95.22052 total_wall_time:95.22052 ``` matches with out the diff ``` TIMING: code_gen:21.89314 inductor_compile:39.72298 backend_compile:65.38539 entire_frame_compile:95.0854 total_wall_time:95.0854 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138804 Approved by: https://github.com/ezyang	2024-10-28 02:19:55 +00:00
Laith Sakka	705f5b3489	Several enhancements for check_results.py (#137925 ) 1) always generate expected_results.csv up to accuracy of first three digits ex: 112313212312 --> 1120000000 .. etc 2) regenerate all record in expected_results.csv and not just failed ones , why? because if we change something by 1.3% and noise 1.5% we want to reflect that. 3) add "please update all results that changed significantly, and not only the failed ones" ``` (myenv) [lsakka@devgpu005.nha1 ~/pytorch/benchmarks/dynamo/pr_time_benchmarks (check_result_ehancements)]$ python check_results.py test_check_result/expected_test.csv te st_check_result/result_test.csv out WIN: benchmark ('a', 'instruction count') failed, actual result 9011111111 is -18.16% lower than expected 11011111111 ±1.00% please update the expected results. please update all results that changed significantly, and not only the failed ones REGRESSION: benchmark ('b', 'memory') failed, actual result 20011111111 is 99.89% higher than expected 10011111111 ±+10.00% if this is an expected regression, please update the expected results. please update all results that changed significantly, and not only the failed ones REGRESSION: benchmark ('c', 'something') failed, actual result 107111111111 is 969.92% higher than expected 10011111111 ±+10.00% if this is an expected regression, please update the expected results. please update all results that changed significantly, and not only the failed ones MISSING REGRESSION TEST: benchmark ('d', 'missing-test') does not have a regression test enabled for it. new expected results file content if needed: a,instruction count,9011000000,0.01 b,memory,20010000000,0.1 c,something,107100000000,0.1 There was some failures you can use the new reference expected result stored at path:out and printed above ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137925 Approved by: https://github.com/aorenste	2024-10-26 16:27:55 +00:00
Laith Sakka	10e2840ce3	Enable failing diffs on update_hint_regression and sum_floordiv_regression and autograd benchmarks regression (#137548 ) update_hint_regression has been behaving, so I am setting 2% noise threshold for it. 1.5% for sum_floordiv_regression. I have one concern, with the way we do the regression detection. small or changes <threshold level will accumulate and eventually trigger failure. to avoid those would have to keep any eye on the dashboard and potentially refresh the expected result file regularly even when there is no faluires. . Pull Request resolved: https://github.com/pytorch/pytorch/pull/137548 Approved by: https://github.com/aorenste	2024-10-26 07:28:49 +00:00
Pian Pawakapan	09848c892a	[aot_compile] propagate ShapeEnv during lowering (#138362 ) We found that `export() -> _inductor.aot_compile()` lowering, 3 different ShapeEnvs get created, leading to errors when one ShapeEnv processes expressions created by another ShapeEnv. This plumbs the 2 places where ShapeEnv creation happens, detecting the original ShapeEnv from the GraphModule example values, so the original ShapeEnv is just reused. Differential Revision: D64613290 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138362 Approved by: https://github.com/angelayi	2024-10-24 22:22:14 +00:00
Laith Sakka	ed313a5ca2	Introduce torch.sym_add, variadic add (#138660 ) Tested internally here: https://www.internalfb.com/diff/D64057744 This is a reland after previous internal failures. main change is ``` if min is None and max is None: torch._check_is_size(size) return ``` Partially addresses https://github.com/pytorch/pytorch/issues/128150 When you have big sums of values, we end up computing long chains of binary addition in our FX graph representation. Not only is this ugly, it also is quadratic, as the sympy.Add constructor is O(N) in number of arguments. Instead, ensure that we maintain the summation as a single FX node so we can do the entire addition all in one go. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138660 Approved by: https://github.com/ezyang, https://github.com/bobrenjc93	2024-10-23 17:42:41 +00:00
Ryan Guo	0a4197490c	Delay mul/pow expansion for `_SympyT` to enable more folding (#138235 ) Instead of calling `safe_expand` right after symbolic expression construction, we invoke it in `ShapeEnv.simplify`. This enables more simplification with product form, e.g., ``` (a + b)^2 / (a + b) --> (a + b) ``` which won't happen if we expand eagerly during product construction: ``` (a^2 + 2ab + b^2) / (a + b) --> no change ``` Fixes #136044. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138235 Approved by: https://github.com/ezyang	2024-10-21 16:38:47 +00:00
Animesh Jain	0a2407b93c	[dynamo] Support omegaconf DictConfig (#138378 ) Fixes https://github.com/pytorch/pytorch/issues/138224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138378 Approved by: https://github.com/jansel ghstack dependencies: #138359	2024-10-20 02:43:17 +00:00
Chong Gu	d512d0e227	Always use aten.constant_pad_nd for mm padding (#137820 ) Summary: From experiment, it seems like aten.constant_pad_nd has better QPS compared to torch.cat. The qps gain for ig ctr is ~10%, and ~5% for oc. Test Plan: ``` buck2 run mode/opt -c fbcode.nvcc_arch=a100 //caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/585279927/480/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend=AOT_INDUCTOR ``` ``` buck2 run mode/opt //caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/588102397/1500/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend=AOT_INDUCTOR ``` Differential Revision: D64271583 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137820 Approved by: https://github.com/eellison	2024-10-18 19:35:03 +00:00
Brian Hirsh	a682194a11	inductor: use previous guards to know if a size is 1 for broadcasting (#136670 ) Fixes https://github.com/pytorch/pytorch/issues/136640 Today, inductor has some logic to figure out when it needs to do broadcasting during lowering, which just checks if any of the input shapes have sizes equal to 1. In particular: we should already have this information by the time we get to inductor, because our FakeTensor compute will have branched/guarded on whether any ops performed broadcasting, appropriately. In particular, if we have a tensor with a size value of `(64//((2048//(s3((s2//s3)))))))`, and it happens to be equal to one (and it is used in an op that requires this dim to be broadcasted), FakeTensorProp will have generated a guard: ``` Eq((64//((2048//(s3((s2//s3))))))), 1) ``` I chose the simplest possible way to beef up inductor's checks to know when a given size is equal to 1: loop over the existing shape env guards, and if our current size is a sympy expression on the LHS of one of our `Eq(LHS, 1)` guards, then return True. I'm hoping for feedback on whether or not this approach is reasonable. One better option I could imagine is that our symbolic reasoning should have automatically simplified the size of our tensor down to a constant as part of evaluating that guard. I was originally going to try to do this directly in the shape env, but I ran into a few issues: (1) I wanted to call some version of `set_replacement(expr, 1)`. But `set_replacement()` only accepts plain symbols on the LHS, not expressions (2) in theory I could get this to work if I could rework the above expression to move everything that is not a free variable to the RHS, e.g. `Eq(s2, 32)`. It looks like our existing `try_solve()` logic is... [not quite able](https://github.com/pytorch/pytorch/blob/main/torch/utils/_sympy/solve.py#L27) to do this generally though. Checking the guards feels pretty simple-and-easy. Are we worried that it is too slow to iterate over all the guards? I could also cache the lookup so we only need to iterate over guards that are of the form `Eq(LHS, 1)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136670 Approved by: https://github.com/ezyang	2024-10-16 22:41:39 +00:00
Isuru Fernando	120fbe9caa	Update inductor benchmark time to avoid flakiness (#137900 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137900 Approved by: https://github.com/laithsakka	2024-10-15 16:17:04 +00:00
Edward Z. Yang	5c3ba6faff	Add fbscribelogger to Dynamo benchmark runner (#137867 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137867 Approved by: https://github.com/bobrenjc93	2024-10-15 04:36:41 +00:00
Isuru Fernando	08ce3aac62	Cache some ValueRanges (#137438 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137438 Approved by: https://github.com/ezyang	2024-10-13 19:23:34 +00:00
Bin Bao	cfc5d18aad	[AOTI] Turn on the ABI-compatible mode as default (#136534 ) Summary: Make AOTI generate ABI-compatible code as default for OSS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136534 Approved by: https://github.com/chenyang78 ghstack dependencies: #137660	2024-10-13 14:42:58 +00:00
Valentine233	67883e70c0	change GPT2ForSequenceClassification inference accuracy tolerance (#136749 ) Fixes https://github.com/pytorch/pytorch/issues/123503. https://github.com/pytorch/pytorch/pull/121866 makes GPT2ForSequenceClassification hit the SDPA pattern 18 and then encounter the accuracy issue. The issue only happens with BF16 inference single thread. This PR tends to increase the model tolerance from 4e-3 to 5e-3 and make the check pass. Note that the issue is due to some small implementation diff. For example, the sdpa math backend scales q, k before matmul for stability; the flash attention backend has more diffs as a new algorithm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136749 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-10-12 01:12:28 +00:00
PyTorch MergeBot	c58e5c4efa	Revert "[AOTI] Turn on the ABI-compatible mode as default (#136534 )" This reverts commit `b0da076f0c`. Reverted https://github.com/pytorch/pytorch/pull/136534 on behalf of https://github.com/desertfire due to The dependent PR https://github.com/pytorch/pytorch/pull/137660 fails in fbcode ([comment](https://github.com/pytorch/pytorch/pull/136534#issuecomment-2408211238))	2024-10-11 22:50:58 +00:00
Xuehai Pan	267f82b860	[BE] Format `.ci/` / `.github/` / `benchmarks/` / `functorch/` / `tools/` / `torchgen/` with `ruff format` (#132577 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132577 Approved by: https://github.com/malfet	2024-10-11 18:30:26 +00:00
Laith Sakka	a06d49a9f9	bump up add_loop_inductor_gpu expected instruction count. (#137672 ) diff https://github.com/pytorch/pytorch/pull/137117/files increased instruction count for add_loop_inductor_gpu but not enough to fail in that diff, but now its kind of flaky test . it failed on recent merge: <img width="1351" alt="Screenshot 2024-10-09 at 5 25 57 PM" src="https://github.com/user-attachments/assets/27178f76-c08e-4d13-9ac4-4cd70f146611"> and here is the history <img width="1047" alt="Screenshot 2024-10-09 at 5 26 07 PM" src="https://github.com/user-attachments/assets/bd563e34-6f7f-461a-ae54-8a616be9bf09"> <img width="777" alt="Screenshot 2024-10-09 at 5 30 19 PM" src="https://github.com/user-attachments/assets/d0a1ca81-2bdb-4cf6-8ac8-ba5971d447bf"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137672 Approved by: https://github.com/masnesral	2024-10-11 16:46:38 +00:00
Bin Bao	b0da076f0c	[AOTI] Turn on the ABI-compatible mode as default (#136534 ) Summary: Make AOTI generate ABI-compatible code as default for OSS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136534 Approved by: https://github.com/chenyang78 ghstack dependencies: #137660	2024-10-10 23:44:57 +00:00
Colin Peppler	9690cacd61	[aotinductor] Add helper fn to atomically apply size_hint to an expr w/ unbacked symints (#137537 ) ### Context Fixes CUDA IMA in autotune_at_compile_time, where we would generate an example tensor with an incorrect stride. In the case below, the stride should be (u0 * 128, 128, 1). However, we apply the fallback on the entire expr (i.e. u0 * 128). ``` # buf817 = tensor(size=(s0, u0, 128), stride=(u0 * 128, 128, 1)) buf812 = generate_example_value( (64, 8192, 128), (8192, 128, 1), "cuda:0", torch.bfloat16, 0 ) ``` The fix is to apply the fallback on each symbol. ### Test ``` PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer python test_aot_inductor.py -k test_stride_with_unbacked_expr_abi_compatible_cuda ========= Invalid __global__ write of size 2 bytes ``` Differential Revision: [D64074561](https://our.internmc.facebook.com/intern/diff/D64074561) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137537 Approved by: https://github.com/jingsh	2024-10-10 17:11:24 +00:00
Oguz Ulgen	034af88c2d	Add a microbechmark for cache read path (#137607 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137607 Approved by: https://github.com/jamesjwu	2024-10-10 16:36:18 +00:00
Laith Sakka	f394fb554b	Enable failing diffs for regressions on basic_modules_ListOfLinears benchmarks (#137541 ) Note that basic_modules_ListOfLinears_inductor_gpu_force_shape_pad is flay with 8% detected variance, I set it up with 20% threshold (8*2)++ others are stable within +-1.5% <img width="611" alt="Screenshot 2024-10-08 at 4 19 03 PM" src="https://github.com/user-attachments/assets/103c4bc7-6be8-41bf-ac31-4b8909fabfcf"> <img width="1581" alt="Screenshot 2024-10-08 at 4 18 56 PM" src="https://github.com/user-attachments/assets/56006f7a-e7de-4966-9a05-9263195adc68"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137541 Approved by: https://github.com/aorenste	2024-10-10 02:47:38 +00:00
Laith Sakka	361046718d	Generate new expected results file when there is failures in diff time benchmarks (#137551 ) The test also add singpost log for the benchmarks that pass. to test run I ran python check_results.py test_check_result/expected_test.csv test_check_result/result_test.csv out.csv results ``` WIN: benchmark ('a', 'instruction count') failed, actual result 90 is -18.18% lower than expected 110 ±1.00% please update the expected results. REGRESSION: benchmark ('b', 'memory') failed, actual result 200 is 100.00% higher than expected 100 ±+10.00% if this is an expected regression, please update the expected results. PASS: benchmark ('c', 'something') pass, actual result 107 +7.00% is within expected 100 ±10.00% MISSING REGRESSION TEST: benchmark ('d', 'missing-test') does not have a regression test enabled for it. You can use the new reference expected result stored at path: out.csv. a,instruction count,90,0.01 b,memory,200,0.1 c,something,100,0.1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137551 Approved by: https://github.com/aorenste	2024-10-10 01:09:15 +00:00
PyTorch MergeBot	16a2c2cfd4	Revert "Introduce torch.sym_sum (#136429 )" This reverts commit `90bed32b98`. Reverted https://github.com/pytorch/pytorch/pull/136429 on behalf of https://github.com/ezyang due to fails internal stuff ([comment](https://github.com/pytorch/pytorch/pull/136429#issuecomment-2403335147))	2024-10-09 20:08:01 +00:00
Oguz Ulgen	ae03c0cff3	Add microbenchmark for FxGraphHashDetails.debug_lines (#137506 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137506 Approved by: https://github.com/jamesjwu	2024-10-09 16:15:05 +00:00
Michael Lazos	27dee935af	[Dynamo] Ensure torch function modes are dispatched on builtin ops (#137117 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137117 Approved by: https://github.com/yanboliang, https://github.com/williamwen42 ghstack dependencies: #137114, #137115, #137116	2024-10-09 02:29:40 +00:00
PyTorch MergeBot	2d18c2d5e7	Revert "[Dynamo] Ensure torch function modes are dispatched on builtin ops (#137117 )" This reverts commit `941be418d8`. Reverted https://github.com/pytorch/pytorch/pull/137117 on behalf of https://github.com/huydhn due to The top of the stack has been reverted but it leaves trunk in a broken state, so I try to revert the rest of the stack ([comment](https://github.com/pytorch/pytorch/pull/137114#issuecomment-2400765603))	2024-10-08 20:33:17 +00:00
Brian Hirsh	b41fc14072	compile time benchmarks for AOTDispatcher (partitioner) (#136760 ) compile time benchmark for the min cut partitioner. I'm hoping that this is a reasonable benchmark because: (1) it consists of a single input + many weights that are used sequentially (2) contains a mix of recompute vs non-recomputed ops (matmul + sin) (3) it is relatively simple from running locally: ``` collecting compile time instruction count for aotdispatcher_partitioner_cpu compile time instruction count for iteration 0 is 21764219181 compile time instruction count for iteration 1 is 12475020009 compile time instruction count for iteration 2 is 12463710140 compile time instruction count for iteration 3 is 12455676489 compile time instruction count for iteration 4 is 12451344330 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136760 Approved by: https://github.com/ezyang ghstack dependencies: #136759	2024-10-08 18:44:13 +00:00
Brian Hirsh	48b8f818b2	compile time benchmarks for AOTDispatcher (inference/training/subclasses) (#136759 ) this adds a few compile time benchmarks for some disjoint paths in AOTDispatcher: (1) inference vs training code paths (2) "subclasses" vs "no subclasses" codepaths Also see https://github.com/pytorch/pytorch/pull/136760 for a partitioner benchmark (I'm not sure why ghstack didn't display the stack nicely) I ran locally, and got these numbers on the 4 paths: ``` collecting compile time instruction count for aotdispatcher_inference_nosubclass_cpu compile time instruction count for iteration 0 is 11692348671 compile time instruction count for iteration 1 is 3026287204 compile time instruction count for iteration 2 is 3011467318 compile time instruction count for iteration 3 is 3004485935 compile time instruction count for iteration 4 is 3003087410 collecting compile time instruction count for aotdispatcher_training_nosubclass_cpu compile time instruction count for iteration 0 is 6068003223 compile time instruction count for iteration 1 is 5585418102 compile time instruction count for iteration 2 is 5581856618 compile time instruction count for iteration 3 is 5581651794 compile time instruction count for iteration 4 is 5578742619 collecting compile time instruction count for aotdispatcher_inference_subclass_cpu compile time instruction count for iteration 0 is 8634984264 compile time instruction count for iteration 1 is 8633467573 compile time instruction count for iteration 2 is 8632182092 compile time instruction count for iteration 3 is 8632056925 compile time instruction count for iteration 4 is 8632543871 collecting compile time instruction count for aotdispatcher_training_subclass_cpu compile time instruction count for iteration 0 is 14737239311 compile time instruction count for iteration 1 is 14734346427 compile time instruction count for iteration 2 is 14736493730 compile time instruction count for iteration 3 is 14734121272 compile time instruction count for iteration 4 is 14733852882 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136759 Approved by: https://github.com/laithsakka	2024-10-08 18:44:13 +00:00
Edward Z. Yang	90bed32b98	Introduce torch.sym_sum (#136429 ) Partially addresses https://github.com/pytorch/pytorch/issues/128150 When you have big sums of values, we end up computing long chains of binary addition in our FX graph representation. Not only is this ugly, it also is quadratic, as the sympy.Add constructor is O(N) in number of arguments. Instead, ensure that we maintain the summation as a single FX node so we can do the entire addition all in one go. update_hint_regression benchmark, before and after: ``` update_hint_regression,compile_time_instruction_count,2648328980 update_hint_regression,compile_time_instruction_count,2563748678 ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136429 Approved by: https://github.com/isuruf	2024-10-08 18:12:57 +00:00
Michael Lazos	941be418d8	[Dynamo] Ensure torch function modes are dispatched on builtin ops (#137117 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137117 Approved by: https://github.com/yanboliang, https://github.com/williamwen42 ghstack dependencies: #137114, #137115, #137116	2024-10-07 18:55:26 +00:00
Laith Sakka	8b9cbf22c2	Enable regression test for add loop benchmarks (#136573 ) The red dotted line is 1.5 <img width="1607" alt="Screenshot 2024-09-24 at 11 50 41 AM" src="https://github.com/user-attachments/assets/719a9a86-89af-4c58-8723-80a28f9bb517"> expected taken from the average. <img width="850" alt="Screenshot 2024-09-24 at 2 33 27 PM" src="https://github.com/user-attachments/assets/0f25e855-35ae-4031-86ef-1452ef6598de"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136573 Approved by: https://github.com/ezyang	2024-10-04 18:12:08 +00:00
PyTorch MergeBot	951107e8c2	Revert "compile time benchmarks for AOTDispatcher (inference/training/subclasses) (#136759 )" This reverts commit `b17cd264d3`. Reverted https://github.com/pytorch/pytorch/pull/136759 on behalf of https://github.com/ZainRizvi due to Something in this stack seems to be causing tests to fail on trunk. See functorch/test_control_flow.py::TestControlFlow::test_associative_scan_dim_reverse_True_combine_mode_generic_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/11107079955/job/30872132411) [HUD commit link](`c010c6099b`) ([comment](https://github.com/pytorch/pytorch/pull/136670#issuecomment-2386303362))	2024-10-01 15:23:55 +00:00
PyTorch MergeBot	923410193b	Revert "compile time benchmarks for AOTDispatcher (partitioner) (#136760 )" This reverts commit `c010c6099b`. Reverted https://github.com/pytorch/pytorch/pull/136760 on behalf of https://github.com/ZainRizvi due to Something in this stack seems to be causing tests to fail on trunk. See functorch/test_control_flow.py::TestControlFlow::test_associative_scan_dim_reverse_True_combine_mode_generic_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/11107079955/job/30872132411) [HUD commit link](`c010c6099b`) ([comment](https://github.com/pytorch/pytorch/pull/136670#issuecomment-2386303362))	2024-10-01 15:23:55 +00:00
Bin Bao	a15f3f51bc	[AOTI] Update sam_fast from timeout to fail_to_run (#136996 ) Summary: sam_fast changes from timeout to fail_to_run after https://github.com/pytorch/pytorch/pull/136591, which "regressed" in a good way. Update the expected result file and continue investigating. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136996 Approved by: https://github.com/ezyang	2024-09-30 14:05:49 +00:00
Brian Hirsh	c010c6099b	compile time benchmarks for AOTDispatcher (partitioner) (#136760 ) compile time benchmark for the min cut partitioner. I'm hoping that this is a reasonable benchmark because: (1) it consists of a single input + many weights that are used sequentially (2) contains a mix of recompute vs non-recomputed ops (matmul + sin) (3) it is relatively simple from running locally: ``` collecting compile time instruction count for aotdispatcher_partitioner_cpu compile time instruction count for iteration 0 is 21764219181 compile time instruction count for iteration 1 is 12475020009 compile time instruction count for iteration 2 is 12463710140 compile time instruction count for iteration 3 is 12455676489 compile time instruction count for iteration 4 is 12451344330 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136760 Approved by: https://github.com/ezyang ghstack dependencies: #136670, #136759	2024-09-30 13:25:02 +00:00
Brian Hirsh	b17cd264d3	compile time benchmarks for AOTDispatcher (inference/training/subclasses) (#136759 ) this adds a few compile time benchmarks for some disjoint paths in AOTDispatcher: (1) inference vs training code paths (2) "subclasses" vs "no subclasses" codepaths Also see https://github.com/pytorch/pytorch/pull/136760 for a partitioner benchmark (I'm not sure why ghstack didn't display the stack nicely) I ran locally, and got these numbers on the 4 paths: ``` collecting compile time instruction count for aotdispatcher_inference_nosubclass_cpu compile time instruction count for iteration 0 is 11692348671 compile time instruction count for iteration 1 is 3026287204 compile time instruction count for iteration 2 is 3011467318 compile time instruction count for iteration 3 is 3004485935 compile time instruction count for iteration 4 is 3003087410 collecting compile time instruction count for aotdispatcher_training_nosubclass_cpu compile time instruction count for iteration 0 is 6068003223 compile time instruction count for iteration 1 is 5585418102 compile time instruction count for iteration 2 is 5581856618 compile time instruction count for iteration 3 is 5581651794 compile time instruction count for iteration 4 is 5578742619 collecting compile time instruction count for aotdispatcher_inference_subclass_cpu compile time instruction count for iteration 0 is 8634984264 compile time instruction count for iteration 1 is 8633467573 compile time instruction count for iteration 2 is 8632182092 compile time instruction count for iteration 3 is 8632056925 compile time instruction count for iteration 4 is 8632543871 collecting compile time instruction count for aotdispatcher_training_subclass_cpu compile time instruction count for iteration 0 is 14737239311 compile time instruction count for iteration 1 is 14734346427 compile time instruction count for iteration 2 is 14736493730 compile time instruction count for iteration 3 is 14734121272 compile time instruction count for iteration 4 is 14733852882 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136759 Approved by: https://github.com/laithsakka ghstack dependencies: #136670	2024-09-30 13:25:02 +00:00
Laith Sakka	e205193e1c	Enable failing diffs on regression (#136551 ) 1. example of failing diff https://github.com/pytorch/pytorch/pull/136740 2. test this by running python check_results.py test_check_result/expected_test.csv test_check_result/result_test.csv results ``` WIN: benchmark ('a', ' instruction count') failed, actual result 90 is 18.18% lower than expected 110 ±1.00% please update the expected results. REGRESSION: benchmark ('b', ' memory') failed, actual result 200 is 100.00% higher than expected 100 ±10.00% if this is an expected regression, please update the expected results. MISSING REGRESSION TEST: benchmark ('d', ' missing-test') does not have a regression test enabled for it ``` MISSING REGRESSION TEST does not fail but its logged. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136551 Approved by: https://github.com/ezyang ghstack dependencies: #136383	2024-09-29 22:31:26 +00:00
Jason Ansel	8da9c4178c	[inductor] Benchmark Halide in operatorbench.py (#136809 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136809 Approved by: https://github.com/eellison ghstack dependencies: #136808	2024-09-28 19:26:04 +00:00
Jason Ansel	375921b755	[inductor] Improve operatorbench.py (#136808 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136808 Approved by: https://github.com/eellison	2024-09-28 06:22:02 +00:00
William Wen	2157e396a3	[dynamo] attempt run only mode when dynamo cache limit is hit (#136655 ) Implement https://github.com/pytorch/pytorch/issues/135458. Try run-only mode when dynamo cache limit is hit. If no valid cache entries are found, then skip code recursively. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136655 Approved by: https://github.com/jansel	2024-09-27 17:15:05 +00:00
Zain Rizvi	37f340c1e5	[EZ] Remove remaining amz2023 runner variant references (#136540 ) Validated no jobs use the amz2023 runner variant anymore ([proof](https://github.com/search?type=code&q=org%3Apytorch+%2F%5Cbamz2023%5Cb%2F+&p=1)) so removing all references to it Explicit references to the amz2023 runner type variants were removed in the following PRs: - https://github.com/pytorch/ignite/pull/3285 - https://github.com/pytorch/ao/pull/887 - https://github.com/pytorch/fbscribelogger/pull/1 - https://github.com/pytorch/pytorch/pull/134355 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136540 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-09-25 19:01:00 +00:00
PyTorch MergeBot	064093a4d6	Revert "Increase update_hint_regression problem size to 1000 (#136434 )" This reverts commit `3116fbda0f`. Reverted https://github.com/pytorch/pytorch/pull/136434 on behalf of https://github.com/ezyang due to whoops, this is too slow ([comment](https://github.com/pytorch/pytorch/pull/136434#issuecomment-2371847842))	2024-09-24 17:05:20 +00:00
Edward Z. Yang	3116fbda0f	Increase update_hint_regression problem size to 1000 (#136434 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136434 Approved by: https://github.com/laithsakka	2024-09-23 18:51:44 +00:00
Laith Sakka	0b91e7e2dc	Remove duplicate line (#136383 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136383 Approved by: https://github.com/kit1980, https://github.com/malfet	2024-09-21 01:35:13 +00:00
Laith Sakka	b71802fa79	add basic_modules_ListOfLinears_inductor_gpu_force_shape_pad (#136175 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136175 Approved by: https://github.com/ezyang	2024-09-19 19:15:50 +00:00
Igor Sugak	bce52d0b60	[CODEMOD][caffe2] use npt.NDArray instead of np.ndarray in type annotations (#136288 ) Summary: To facilitate PSS-2 upgrade, this uses `ndt.NDArray` instead of `nd.ndarray` in type annotations. In Numpy-1.19 (PSS-1) it's an alias to `nd.ndarray` -- a noop. In Numpy-1.24, `ndt.NDArray` a proper generic type, and without this change uses of `nd.ndarray` generate this Pyre type error: ```counterexample Invalid type parameters [24]: Generic type `np.ndarray` expects 2 type parameters. ``` Test Plan: Sandcastle plus visual inspection Differential Revision: D62977370 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136288 Approved by: https://github.com/kit1980	2024-09-19 12:40:36 +00:00
leslie-fang-intel	8072ebc36c	SKIP llama for dynamic size testing (#135960 ) Running Torchbench llama with dynamic size failed with ``` File "/localdisk/leslie/torch_inductor_community/pytorch/torch/fx/experimental/symbolic_shapes.py", line 4182, in produce_guards raise ConstraintViolationError( torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (L['inputs'][0].size()[0])! For more information, run with TORCH_LOGS="+dynamic". - Not all values of RelaxedUnspecConstraint(L['inputs'][0].size()[0]) are valid because L['inputs'][0].size()[0] was inferred to be a constant (32). ``` Skip this model for marking dynamic dim. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135960 Approved by: https://github.com/ezyang	2024-09-15 00:06:49 +00:00
Laith Sakka	b82122beef	Only keep ListOfLinears module in basic_modules_benchmarks and add gpu version. (#135730 ) All of the previous benchmarks are similar, ListOfLinears should be representative enough. I copied the previous benchmarks from unit tests without an intention, was just trying to create a large number of benchmarks to better observe noise. This PR keeps only one, we can add more as we see value and regressions in the future. Also this diff adds a GPU version. ``` collecting compile time instruction count for basic_modules_ListOfLinears_eager compile time instruction count for iteration 0 is 6479525851 compile time instruction count for iteration 1 is 1024432680 compile time instruction count for iteration 2 is 1019417317 compile time instruction count for iteration 3 is 1013603566 compile time instruction count for iteration 4 is 1008853980 compile time instruction count for iteration 5 is 1009541481 compile time instruction count for iteration 6 is 1005025533 compile time instruction count for iteration 7 is 1004116323 compile time instruction count for iteration 8 is 1000828633 compile time instruction count for iteration 9 is 999788323 collecting compile time instruction count for basic_modules_ListOfLinears_inductor compile time instruction count for iteration 0 is 40837529730 compile time instruction count for iteration 1 is 18411921909 compile time instruction count for iteration 2 is 18383665161 compile time instruction count for iteration 3 is 18348983522 compile time instruction count for iteration 4 is 18349276590 compile time instruction count for iteration 5 is 18353046274 compile time instruction count for iteration 6 is 18346818581 compile time instruction count for iteration 7 is 18340057998 compile time instruction count for iteration 8 is 18331267320 compile time instruction count for iteration 9 is 18328381338 collecting compile time instruction count for basic_modules_ListOfLinears_inductor_gpu compile time instruction count for iteration 0 is 15408870979 compile time instruction count for iteration 1 is 10949520859 compile time instruction count for iteration 2 is 11058786167 compile time instruction count for iteration 3 is 11003606719 compile time instruction count for iteration 4 is 10896406770 compile time instruction count for iteration 5 is 10982875189 compile time instruction count for iteration 6 is 10931848275 compile time instruction count for iteration 7 is 10956345008 compile time instruction count for iteration 8 is 11045384499 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135730 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2024-09-14 16:45:52 +00:00
Laith Sakka	44dd218a61	Disable garbage collection during compile_time_instructions count in benchmark base by default. (#135768 ) When we measure compile time instruction count, probably we do want in most cases to measure gc instructions disabling it here by default. if it is needed we can add an option to allow it, or someone can use the regular total instruction count instead of compile time instruction count. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135768 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2024-09-14 06:15:28 +00:00
Laith Sakka	46935c8241	Reduce default iterations to 5 . (#135773 ) running all benchmarks takes around 15 mins rn, this is the data https://www.internalfb.com/phabricator/paste/view/P1583590240 the data looks mostly stable, and 5 iterations should be good, specially with our 1.5% threshold. that said, the diff also add a way to increase the number of iterations for a specific benchmark. after the change results https://www.internalfb.com/phabricator/paste/view/P1583618969 time is down to half (7 mins) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135773 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2024-09-13 21:16:38 +00:00
Laith Sakka	4f407c1884	Only measure compile time instruction count for sum_floordiv benchmark (#135785 ) there was a recent strange noise +5%, -5%. using only compile time : 1) avoid gc time . 2) avoid other operations that are not what we try to measure by this. ==> less probable noise. ``` collecting compile time instruction count for sum_floordiv_regression compile time instruction count for iteration 0 is 8899290248 compile time instruction count for iteration 1 is 1188830489 compile time instruction count for iteration 2 is 1180579615 compile time instruction count for iteration 3 is 1176263131 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135785 Approved by: https://github.com/avikchaudhuri, https://github.com/anijain2305	2024-09-13 21:14:10 +00:00
Laith Sakka	2e461e54e8	Add gpu and gpu_dynamic versions of add_loop (#135809 ) I am thinking maybe 3 iterations are enough for this one? - so I am keeping eager and inductor since inductor is 2X eager time - Eager dynamic is 2X eager so keeping this as well. - inductor have three tests. (dynamic gpu, gpu and cpu) I am unsure if am over profiling here happy to trim if anyone have suggestions. ``` collecting compile time instruction count for add_loop_eager compile time instruction count for iteration 0 is 8213664211 compile time instruction count for iteration 1 is 2798628246 compile time instruction count for iteration 2 is 2796811362 compile time instruction count for iteration 3 is 2794438188 compile time instruction count for iteration 4 is 2794634117 collecting compile time instruction count for add_loop_eager_dynamic compile time instruction count for iteration 0 is 5724108021 compile time instruction count for iteration 1 is 5499908609 compile time instruction count for iteration 2 is 5569101366 compile time instruction count for iteration 3 is 5493806364 compile time instruction count for iteration 4 is 5493169851 collecting compile time instruction count for add_loop_inductor compile time instruction count for iteration 0 is 49789381222 compile time instruction count for iteration 1 is 25769347393 compile time instruction count for iteration 2 is 25772594322 compile time instruction count for iteration 3 is 25768695952 compile time instruction count for iteration 4 is 25768032314 collecting compile time instruction count for add_loop_inductor_gpu compile time instruction count for iteration 0 is 23966942581 compile time instruction count for iteration 1 is 23771950919 compile time instruction count for iteration 2 is 23770784286 compile time instruction count for iteration 3 is 23780160875 compile time instruction count for iteration 4 is 23774634465 collecting compile time instruction count for add_loop_inductor_dynamic_gpu compile time instruction count for iteration 0 is 41505055086 compile time instruction count for iteration 1 is 41293654089 compile time instruction count for iteration 2 is 41301016100 compile time instruction count for iteration 3 is 41306056207 compile time instruction count for iteration 4 is 41308171566 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135809 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2024-09-13 20:42:31 +00:00
Pian Pawakapan	b897ab0540	[export] ignore mark_dynamic() in export (#135536 ) Previously we were accomodating `torch._dynamo.mark_dynamic()` for export's dynamic shapes. Here we clean things up and ignore it, requiring users to specify an export input for `dynamic_shapes`. Note: there's 4 decorators relevant to export, `mark_dynamic, maybe_mark_dynamic, mark_static, mark_unbacked`. User calls that involve export have only been `mark_dynamic()`, and we use `maybe_mark_dynamic` under the hood for `Dim.AUTO`, but we could start using others. One reason I decided to not warn and just silently ignore is these decorators cause the tensors to carry dynamic info, and it'll be hard to tell whether the markers are from export or user calls when re-exporting with the same inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135536 Approved by: https://github.com/avikchaudhuri	2024-09-12 21:22:19 +00:00
zengxian	7ec17b49cf	Fix dynamo benchmark skip logic for cpu device (#135193 ) Fixes #132380, adjust torchbench and huggingface skip models list, then we can remove `--no-skip` when running benchmarks on 3 suites. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135193 Approved by: https://github.com/chuanqi129, https://github.com/jansel	2024-09-10 03:02:19 +00:00
Yueming Hao	a71e5509bc	[inductor]Add profiler to operatorbench (#135515 ) Add profiling to operatorbench. The new argument `--profile` is added and the profiling trace is like the following figure. <img width="954" alt="image" src="https://github.com/user-attachments/assets/5b00d6e3-4905-4a77-a5e9-9f62620a5fd5"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135515 Approved by: https://github.com/shunting314	2024-09-10 02:33:30 +00:00
Yanbo Liang	d81731615f	[Dynamo] Adding CallFunctionNoArgsSource and (#135425 ) CallFunctionNoArgsGuardAccessor to support torch.cuda.current_device() Pull Request resolved: https://github.com/pytorch/pytorch/pull/135425 Approved by: https://github.com/anijain2305	2024-09-09 22:46:00 +00:00
leslie-fang-intel	07689a38bf	[Inductor] Fix AOT weight alignment issue on CPU (#135205 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/135027. On CPU, the `consts_size` used to generate `_binary_constants_bin_start` is not padded to `ALIGN_BYTES`, while `serialized_weights` is, causing a failure in the 16K alignment check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135205 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-09-06 03:06:51 +00:00
Yiming Zhou	050ad925f3	[benchmark] Add to torchbench relative path search (#134871 ) Add to relative path search in benchmark. This enables user to run `torchbench.py` inside the `pytorch/benchmark/dynamo` folder when `torchbench` repo is cloned in the same level as `pytorch` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134871 Approved by: https://github.com/FindHao	2024-08-31 00:28:22 +00:00
Animesh Jain	577a93514f	[dynamo][dynamic][heuristic] Mark tuple getitem integers as static (#134734 ) Fixes issue seen in https://github.com/pytorch/pytorch/issues/132872#issuecomment-2314574656 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134734 Approved by: https://github.com/jansel ghstack dependencies: #134653, #134713	2024-08-30 17:06:57 +00:00
Laith Sakka	0d5f978795	add basic nn modules diff time benchmarks (#134658 ) benchmarks several shapes of basic nn modules. in both eager and inductor ``` collecting compile time instruction count for basic_modules_ListOfLinears_inductor compile time instruction count for iteration 0 is 48602516013 compile time instruction count for iteration 1 is 20424350269 compile time instruction count for iteration 2 is 20440350455 compile time instruction count for iteration 3 is 20419269999 compile time instruction count for iteration 4 is 20430782200 compile time instruction count for iteration 5 is 20455049622 compile time instruction count for iteration 6 is 20157290712 compile time instruction count for iteration 7 is 20455324001 compile time instruction count for iteration 8 is 20450158317 compile time instruction count for iteration 9 is 20492987748 collecting compile time instruction count for basic_modules_ListOfLinears_eager compile time instruction count for iteration 0 is 961328334 compile time instruction count for iteration 1 is 958887896 compile time instruction count for iteration 2 is 958792214 compile time instruction count for iteration 3 is 958375977 compile time instruction count for iteration 4 is 958568525 compile time instruction count for iteration 5 is 958152305 compile time instruction count for iteration 6 is 959322800 compile time instruction count for iteration 7 is 958332703 compile time instruction count for iteration 8 is 958092100 compile time instruction count for iteration 9 is 958095277 collecting compile time instruction count for basic_modules_ModuleForwardHasGraphBreak_inductor compile time instruction count for iteration 0 is 3572145793 compile time instruction count for iteration 1 is 3503323973 compile time instruction count for iteration 2 is 3501962432 compile time instruction count for iteration 3 is 3501746084 compile time instruction count for iteration 4 is 3500687361 compile time instruction count for iteration 5 is 3822254676 compile time instruction count for iteration 6 is 3498356846 compile time instruction count for iteration 7 is 3499019157 compile time instruction count for iteration 8 is 3500780314 compile time instruction count for iteration 9 is 3500257458 collecting compile time instruction count for basic_modules_ModuleForwardHasGraphBreak_eager compile time instruction count for iteration 0 is 1844838754 compile time instruction count for iteration 1 is 1843476862 compile time instruction count for iteration 2 is 1844761450 compile time instruction count for iteration 3 is 1845371742 compile time instruction count for iteration 4 is 1845159665 compile time instruction count for iteration 5 is 1845035802 compile time instruction count for iteration 6 is 1844895007 compile time instruction count for iteration 7 is 1844697922 compile time instruction count for iteration 8 is 1844780885 compile time instruction count for iteration 9 is 1844493990 collecting compile time instruction count for basic_modules_SequentialWithDuplicatedModule_inductor compile time instruction count for iteration 0 is 1597839479 compile time instruction count for iteration 1 is 1348225351 compile time instruction count for iteration 2 is 1347340818 compile time instruction count for iteration 3 is 1348170800 compile time instruction count for iteration 4 is 1348637747 compile time instruction count for iteration 5 is 1678366444 compile time instruction count for iteration 6 is 1348412420 compile time instruction count for iteration 7 is 1348461578 compile time instruction count for iteration 8 is 1347420149 compile time instruction count for iteration 9 is 1349748195 collecting compile time instruction count for basic_modules_SequentialWithDuplicatedModule_eager compile time instruction count for iteration 0 is 137721777 compile time instruction count for iteration 1 is 139065517 compile time instruction count for iteration 2 is 137130552 compile time instruction count for iteration 3 is 137506030 compile time instruction count for iteration 4 is 137089838 compile time instruction count for iteration 5 is 137477395 compile time instruction count for iteration 6 is 138550452 compile time instruction count for iteration 7 is 137568409 compile time instruction count for iteration 8 is 136968468 compile time instruction count for iteration 9 is 137481664 collecting compile time instruction count for basic_modules_ModuleComparison_inductor compile time instruction count for iteration 0 is 917209684 compile time instruction count for iteration 1 is 899154426 compile time instruction count for iteration 2 is 898145079 compile time instruction count for iteration 3 is 899817018 compile time instruction count for iteration 4 is 899184687 compile time instruction count for iteration 5 is 898172885 compile time instruction count for iteration 6 is 899958951 compile time instruction count for iteration 7 is 899348186 compile time instruction count for iteration 8 is 897745404 compile time instruction count for iteration 9 is 899581123 collecting compile time instruction count for basic_modules_ModuleComparison_eager compile time instruction count for iteration 0 is 113165302 compile time instruction count for iteration 1 is 112724376 compile time instruction count for iteration 2 is 112774611 compile time instruction count for iteration 3 is 114465211 compile time instruction count for iteration 4 is 112689572 compile time instruction count for iteration 5 is 112726465 compile time instruction count for iteration 6 is 112853691 compile time instruction count for iteration 7 is 112295238 compile time instruction count for iteration 8 is 114022136 compile time instruction count for iteration 9 is 112664932 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134658 Approved by: https://github.com/anijain2305 ghstack dependencies: #133834, #134635, #134649, #134652	2024-08-30 02:13:52 +00:00
Laith Sakka	496e57283d	add add_loop benchmarks (#134652 ) This benchmark measure the cost of compiling the following function in eager and inductor its basically two benchmarks. ``` @torch.compile(backend=self.backend, fullgraph=True) def f(a, b): result = a.clone() for i in range(1000): if i % 3 == 0: result = result + b elif i % 3 == 1: result = result + 8 * b else: result = result.sin() return result ``` PYTHONPATH=$(pwd) python benchmarks/add_loop.py out ``` collecting compile time instruction count for add_loop_eager compile time instruction count for iteration 0 is 8286649663 compile time instruction count for iteration 1 is 2838971338 compile time instruction count for iteration 2 is 2834263023 compile time instruction count for iteration 3 is 2829447493 compile time instruction count for iteration 4 is 2830904231 compile time instruction count for iteration 5 is 2830281077 compile time instruction count for iteration 6 is 2831466595 compile time instruction count for iteration 7 is 2830732164 compile time instruction count for iteration 8 is 2831088056 compile time instruction count for iteration 9 is 2831204407 collecting compile time instruction count for add_loop_inductor compile time instruction count for iteration 0 is 32585687849 compile time instruction count for iteration 1 is 11747553436 compile time instruction count for iteration 2 is 11746959875 compile time instruction count for iteration 3 is 11749479461 compile time instruction count for iteration 4 is 11750053711 compile time instruction count for iteration 5 is 11750793958 compile time instruction count for iteration 6 is 11751673576 compile time instruction count for iteration 7 is 11754552912 compile time instruction count for iteration 8 is 11753723127 compile time instruction count for iteration 9 is 11759059942 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134652 Approved by: https://github.com/anijain2305 ghstack dependencies: #133834, #134635, #134649	2024-08-29 23:04:01 +00:00
Bin Bao	387d3fc296	[AOTI] Switch benchmarking to use export non-strict mode (#130977 ) Summary: Switch the export part used by AOTInductor benchmarking from strict to non-strict, and switch it from producing torch IR to aten IR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130977 Approved by: https://github.com/angelayi ghstack dependencies: #134639	2024-08-29 16:08:52 +00:00
Laith Sakka	c5f114747e	fix flakiness in update_hint_benchmark.py (#134649 ) ``` compile time instruction count for iteration 1 is 10732129038 compile time instruction count for iteration 2 is 10719776783 compile time instruction count for iteration 3 is 10729546868 compile time instruction count for iteration 4 is 10737655132 compile time instruction count for iteration 5 is 10732564252 compile time instruction count for iteration 6 is 10728721234 compile time instruction count for iteration 7 is 10733354271 compile time instruction count for iteration 8 is 10719588972 compile time instruction count for iteration 9 is 10706311856 ``` 1. add torch.manual_seed(0), inputs was not the same across iterations 2. disable gc. 3. remove loop (not needed since compilation happen once only) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134649 Approved by: https://github.com/aorenste ghstack dependencies: #133834, #134635	2024-08-29 02:22:05 +00:00
Laith Sakka	633a9a3b13	add back sum_floordiv benchmark. (#134635 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134635 Approved by: https://github.com/avikchaudhuri, https://github.com/oulgen ghstack dependencies: #133834	2024-08-28 17:38:24 +00:00
Bin Bao	e6bf1710ff	[Inductor][Refactor] Rename CPU benchmark test configs (#134639 ) Summary: benchmarks/dynamo/ci_expected_accuracy/update_expected.py expects a benchmark run config is named as {config}_{benchmark}, and CPU tests should follow the same naming convention. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134639 Approved by: https://github.com/huydhn	2024-08-28 14:49:55 +00:00
Laith Sakka	d6091c8726	Add compile time instruction count metric (#133834 ) PYTHONPATH=$(pwd) python benchmarks/update_hint_benchmark.py out as of this diff, compile_time_instruction_count counts the number of instruction from within convert_frame.compile_inner ``` update_hint_regression,compile_time_instruction_count,10522459165 ``` will add result from CI once populated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133834 Approved by: https://github.com/aorenste	2024-08-27 23:29:02 +00:00
PyTorch MergeBot	5b392d22c6	Revert "fix stuck floordiv (#134150 )" This reverts commit `92c4771853`. Reverted https://github.com/pytorch/pytorch/pull/134150 on behalf of https://github.com/anijain2305 due to compile time regression internal ([comment](https://github.com/pytorch/pytorch/pull/134150#issuecomment-2313230404))	2024-08-27 18:23:44 +00:00
Avik Chaudhuri	92c4771853	fix stuck floordiv (#134150 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/134133 Test Plan: Tested on the small repro in the linked issue with different lengths N (replacing 100), recording N vs. time taken in nanoseconds: 10 127268319 20 220839662 30 325463125 40 429259441 50 553136055 60 670799769 70 999170514 80 899014103 90 997168902 100 1168202035 110 1388556619 120 1457488235 130 1609816470 140 2177889877 150 1917560313 160 2121096113 170 2428502334 180 4117450755 190 4003068224 So N ~ 200 takes ~5s. Previously even smaller N would go for >1 min. Didn't add a perf test because ezyang is planning to build a benchmark. Also tested on https://www.internalfb.com/diff/D61560171, which now gets past the stuck point. Differential Revision: D61619660 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134150 Approved by: https://github.com/ezyang	2024-08-26 07:27:59 +00:00
Nikita Shulga	5f0bd98767	Increase max total number of dynamo partitions to 15 (#134153 ) Needed to be able to split some of the aarch64 workflows to 15 shards Pull Request resolved: https://github.com/pytorch/pytorch/pull/134153 Approved by: https://github.com/seemethere, https://github.com/kit1980, https://github.com/ZainRizvi	2024-08-21 23:10:12 +00:00
Bin Bao	5d5a45dc85	[CI][dashboard] Collect Export pass rate separately (#134076 ) Summary: Collect Export pass rate separately when running AOTInduction, so that we can have a better isolated signal. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134076 Approved by: https://github.com/angelayi	2024-08-21 21:18:55 +00:00
Edward Z. Yang	32e057636c	Enable scribe environment for compile-time benchmarks if requested. (#133891 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133891 Approved by: https://github.com/malfet	2024-08-21 18:02:54 +00:00
Weizhuo Zhang	5153550e4b	[CI] Add FP32 dynamic, AMP static, AMP dynamic for AOT inductor accuracy CPU CI test (#132836 ) This PR added 3 more accuracy test for AOT inductor CPU side. 1. FP32 dynamic shape accuracy test, torchbench suite 2. AMP static shape accuracy test, torchbench suite 3. AMP dynamic shape accuracy test, torchbench suite Test Time cost: \| Precision \| Shape Type \| Suite \| Time cost \| \|----------- \|------------ \|------------ \|----------- \| \| FP32 \| dynamic \| Torchbench \| 1h40m \| \| AMP \| Static \| Torchbench \| 1h38m \| \| AMP \| dynamic \| Torchbench \| 1h48m \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/132836 Approved by: https://github.com/desertfire	2024-08-19 14:26:48 +00:00
laithsakka	7673ee5456	remove benchmarks/__init__.py (#133390 ) trying to address https://github.com/pytorch/pytorch/issues/133377 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133390 Approved by: https://github.com/kit1980, https://github.com/malfet, https://github.com/ezyang	2024-08-15 19:08:10 +00:00
eellison	4bb1650ca3	Bump maxinum num warps (#132458 ) Fix for https://github.com/pytorch/pytorch/issues/129104 Our heuristic for num_warps was giving the optimal number, but we were capping maximum num_warps at 8. Gives 1% speedup on HF and TIMM in inference, 2% speedup in TIMM training, neutral otherwise. ultimately, I think we want live var analysis for register usage.. still worth landing this now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132458 Approved by: https://github.com/Chillee, https://github.com/shunting314	2024-08-14 16:51:05 +00:00
laithsakka	f5e704a6f2	Add instruction count benchmark to run on pull requests (#131475 ) This PR only adds the execution of the benchmarks on this PR and print results, following diffs will add checking out head~1 and running it and comparing. to access results goto test pr_time_benchmarks and inspect logs: you should see ``` + echo 'benchmark results on current PR: ' benchmark results on current PR: + cat /var/lib/jenkins/workspace/test/test-reports/pr_time_benchmarks_before.txt update_hint_regression,instruction_count,27971461254 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131475 Approved by: https://github.com/ezyang	2024-08-12 05:20:26 +00:00
Shunting Zhang	10c2168b31	[pt2-bench] use larger multiplier for smaller tensors for a few models (#132952 ) Fix https://github.com/pytorch/pytorch/issues/132922 and https://github.com/pytorch/pytorch/issues/132924 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132952 Approved by: https://github.com/eellison, https://github.com/jansel	2024-08-09 00:09:21 +00:00
leslie-fang-intel	ac960dced1	Skip Reformer for Dynamic size testing (#132468 ) Summary As discussed in https://github.com/pytorch/pytorch/issues/132286, `Reformer` has specialized the batch size dim which will fails the API `mark_dynamic` `3a355c1891/torch/_dynamo/decorators.py (L228-L230)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132468 Approved by: https://github.com/ezyang	2024-08-08 08:25:53 +00:00
Nicolas Macchioni	5cb05a82b4	[BC breaking] move benchmarking + prefer inductor path (#132827 ) move benchmarking out of `torch._inductor.runtime.runtime_utils` and into `torch._inductor.runtime.benchmarking`, and prefer this path over directly accessing Triton's benchmarking Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/132827 Approved by: https://github.com/eellison	2024-08-08 00:47:45 +00:00
HDCharles	374747818d	Run performance test non-alternately (#131935 ) Summary: By default, performance tests (speedup experiments) will run the baseline and test backend alternately. However, this does not work for the torchao backend, which will change the model in-place, therefore the baseline run will also run with torchao backend since the model has already been quantized. Add a new experiment "latency_experiment" to run performance tests non-alternately (first run baseline for a few iterations, then run the test backend). other changes: need to add torch.compiler.cudagraph_mark_step_begin() to avoid the slowdown from # Unable to hit fast path of CUDAGraphs because of pending, uninvoked backwards also updated the torchao APIs to the current versions X-link: https://github.com/pytorch/benchmark/pull/2394 Test Plan: python run_benchmark.py torchao --only AlbertForMaskedLM --quantization noquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune python run_benchmark.py torchao --only BartForCausalLM --quantization noquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune python run_benchmark.py torchao --only timm_efficientnet --quantization noquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune (should all be ~1.0 0.997x 1.006x 0.994x Reviewed By: xuzhao9 Differential Revision: D60252821 Pulled By: HDCharles Pull Request resolved: https://github.com/pytorch/pytorch/pull/131935 Approved by: https://github.com/xuzhao9	2024-08-08 00:23:20 +00:00
Justin Chu	6966d44eda	[ONNX] Rename _internal/exporter to _exporter_legacy (#132429 ) The next PR will be creating an `exporter` directory to house logic from `torch-onnx` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132429 Approved by: https://github.com/titaiwangms	2024-08-03 04:23:05 +00:00
zengxian	f936e68506	[CI] Update CPU inductor smoke test model list and target (#132221 ) Fixes #132097 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132221 Approved by: https://github.com/desertfire	2024-08-02 07:09:54 +00:00
Sergii Dymchenko	8458980bbf	Move benchmarks/dynamo/huggingface configuration to YAML (#131724 ) Similar to https://github.com/pytorch/pytorch/pull/120299 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131724 Approved by: https://github.com/shunting314	2024-07-27 00:55:04 +00:00
Bin Bao	0272934238	[Inductor][CPU] Fix an InvalidVecISA issue on CI (#131812 ) Summary: CPU CI nodes failed to find valid VecISA because importing torch under the default pytorch directory will fail with the following msg, so switch cwd to a tmp directory. ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/var/lib/jenkins/workspace/torch/__init__.py", line 66, in <module> from torch.torch_version import __version__ as __version__ File "/var/lib/jenkins/workspace/torch/torch_version.py", line 4, in <module> from torch.version import __version__ as internal_version ModuleNotFoundError: No module named 'torch.version' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131812 Approved by: https://github.com/eellison, https://github.com/malfet	2024-07-26 22:31:44 +00:00
Sergii Dymchenko	da1a1fa55f	Move load_yaml_file to common (#131924 ) This is for https://github.com/pytorch/pytorch/pull/131724 and future timm_models.py refactoring. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131924 Approved by: https://github.com/shunting314, https://github.com/huydhn	2024-07-26 19:47:52 +00:00
zengxian	d3e932dc10	[CI] Add inductor cpu accuracy test running on AVX2 runners (#128682 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128682 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-07-26 13:24:41 +00:00
Animesh Jain	246e32055a	[benchmark] Add hf_T5_generate to inline_inbuilt_nn_modules (#131804 ) Fixes https://github.com/pytorch/pytorch/issues/121989 We are turning on the flag by default in another PR. But that PR can go through reverts. So, forcibly adding the benchmark to prevent dashboard fluctuation in case of reverts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131804 Approved by: https://github.com/yanboliang, https://github.com/shunting314 ghstack dependencies: #131795, #131801	2024-07-26 00:20:42 +00:00
Animesh Jain	01bc2a8165	[inline-inbuilt-nn-modules] Skip mobilenet_v2 test for cpu inductor (#131694 ) Related issue https://github.com/pytorch/pytorch/issues/131693 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131694 Approved by: https://github.com/eellison	2024-07-25 02:49:16 +00:00
Justin Chu	9db567f17d	[ONNX] Set dump_exported_program to True in bench (#131670 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131670 Approved by: https://github.com/titaiwangms	2024-07-24 20:02:03 +00:00
zengxian	8a591da3e7	[CI] Enable AOT inductor in cpu performance smoke test (#130097 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130097 Approved by: https://github.com/chuanqi129, https://github.com/desertfire	2024-07-23 03:44:13 +00:00
Xu Zhao	e3eaa22126	[torchbench][multisect] Run accuracy check at Diff time (#131266 ) Summary: X-link: https://github.com/pytorch/benchmark/pull/2388 We can enable accuracy checks at Diff time since it is not a performance metric. * Refactor the existing diff time test to use the new PT2 Benchmark Runner. * Deprecate the speedup tests and enable the accuracy tests only. We rely on ServiceLab to perform performance testing and regression detection. Test Plan: Sandcastle CI Or buck test command: ``` buck2 test 'fbcode//mode/opt' fbcode//pytorch/benchmark/fb/test_gpu:run_test_gpu -- test_training_resnet50_accuracy ``` Test UI: https://www.internalfb.com/intern/testinfra/testrun/1688850102375429 Reviewed By: oulgen Differential Revision: D59825601 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131266 Approved by: https://github.com/oulgen	2024-07-22 20:14:28 +00:00
Xuehai Pan	c0ed38e644	[BE][Easy][3/19] enforce style for empty lines in import segments in `benchmarks/` (#129754 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129754 Approved by: https://github.com/ezyang	2024-07-17 14:34:42 +00:00
Xu Zhao	1d8baa4df2	[torchbench][servicelab] Fix servicelab test failures (#130781 ) Fix servicelab test failures Pull Request resolved: https://github.com/pytorch/pytorch/pull/130781 Approved by: https://github.com/desertfire	2024-07-16 17:35:13 +00:00
Xu Zhao	213685ba97	[torchao][pt2 benchmark runner] Run performance test non-alternately (#130136 ) Summary: By default, performance tests (speedup experiments) will run the baseline and test backend alternately. However, this does not work for the torchao backend, which will change the model in-place, therefore the baseline run will also run with torchao backend since the model has already been quantized. Add a new experiment "latency_experiment" to run performance tests non-alternately (first run baseline for a few iterations, then run the test backend). Test Plan: ``` buck2 run mode/opt //pytorch/benchmark:pt2 -- --only AlbertForMaskedLM --quantization noquant --performance --inference --bfloat16 ``` ``` buck2 run mode/opt //pytorch/benchmark:pt2 -- --only AlbertForMaskedLM --quantization autoquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune ``` Differential Revision: D59332736 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130136 Approved by: https://github.com/jerryzh168	2024-07-16 13:38:17 +00:00
eellison	9ab8d47f9d	Constant folding for dynamic shape node (#129686 ) Extend constant folding for dynamic shape node, only support pointwise op and some restricted ops We support dynamic shapes by limiting constant folding of ops that are guaranteed to have uniform values (full, pointwise ops, and views) and running these operators with tensors of shape 1. This also eliminates the possibility of memory overhead of constant folding. Taken over from https://github.com/pytorch/pytorch/pull/128937 joint work with @imzhuhl Pull Request resolved: https://github.com/pytorch/pytorch/pull/129686 Approved by: https://github.com/Chillee ghstack dependencies: #130367	2024-07-16 00:17:11 +00:00
PyTorch MergeBot	9df4bc6a0d	Revert "Constant folding for dynamic shape node (#129686 )" This reverts commit `b7d287fbec`. Reverted https://github.com/pytorch/pytorch/pull/129686 on behalf of https://github.com/atalman due to Failing internally. Test: https://github.com/pytorch/ao/blob/main/test/prototype/mx_formats/test_mx_linear.py ([comment](https://github.com/pytorch/pytorch/pull/129686#issuecomment-2228755295))	2024-07-15 15:19:24 +00:00
Xuehai Pan	4d7bf72d93	[BE][Easy] fix ruff rule needless-bool (SIM103) (#130206 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130206 Approved by: https://github.com/malfet	2024-07-14 08:17:52 +00:00
titaiwangms	18418a7dbb	[ONNX] Fix torch_onnx patch accuracy bug in benchmark (#130586 ) The ONNX related compilers have another route of accuracy check, and this PR brings torch_onnx compiler to the right measurement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130586 Approved by: https://github.com/justinchuby	2024-07-12 15:47:59 +00:00
eellison	b7d287fbec	Constant folding for dynamic shape node (#129686 ) Extend constant folding for dynamic shape node, only support pointwise op and some restricted ops We support dynamic shapes by limiting constant folding of ops that are guaranteed to have uniform values (full, pointwise ops, and views) and running these operators with tensors of shape 1. This also eliminates the possibility of memory overhead of constant folding. Taken over from https://github.com/pytorch/pytorch/pull/128937 joint work with @imzhuhl Pull Request resolved: https://github.com/pytorch/pytorch/pull/129686 Approved by: https://github.com/Chillee ghstack dependencies: #130367	2024-07-12 03:44:29 +00:00
Xuehai Pan	973037be6a	[BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): `list()` / `tuple()` / `dict()` (#130199 ) This PR changes the empty collection factory call to Python literals: - `list()` -> `[]` - `tuple()` -> `()` - `dict()` -> `{}` The Python literals are more performant and safer. For example, the bytecode for building an empty dictionary: ```bash $ python3 -m dis - <<EOS import collections d1 = {} d2 = dict() dict = collections.OrderedDict d3 = dict() EOS ``` ```text 0 0 RESUME 0 1 2 LOAD_CONST 0 (0) 4 LOAD_CONST 1 (None) 6 IMPORT_NAME 0 (collections) 8 STORE_NAME 0 (collections) 3 10 BUILD_MAP 0 12 STORE_NAME 1 (d1) 4 14 PUSH_NULL 16 LOAD_NAME 2 (dict) 18 CALL 0 26 STORE_NAME 3 (d2) 6 28 LOAD_NAME 0 (collections) 30 LOAD_ATTR 8 (OrderedDict) 50 STORE_NAME 2 (dict) 7 52 PUSH_NULL 54 LOAD_NAME 2 (dict) 56 CALL 0 64 STORE_NAME 5 (d3) 66 RETURN_CONST 1 (None) ``` The dict literal `{}` only has one bytecode `BUILD_MAP`, while the factory call `dict()` has three `PUSH_NULL + LOAD_NAME + CALL`. Also, the factory call is not safe if users override the `dict` name in `locals` or `globals` (see the example of replacing with `OrderedDict` above). Pull Request resolved: https://github.com/pytorch/pytorch/pull/130199 Approved by: https://github.com/malfet	2024-07-11 17:30:28 +00:00
Shunting Zhang	c5ede865c4	[pt2-bench] raise tolerance for squeezenet1_1 (#130165 ) The training accuracy for this model starts to regress. It does not show up on the weekly run yet but 1. it shows up in my MA runs [here](https://hud.pytorch.org/benchmark/torchbench/inductor_max_autotune?dashboard=torchinductor&startTime=Fri,%2028%20Jun%202024%2006:53:45%20GMT&stopTime=Fri,%2005%20Jul%202024%2006:53:45%20GMT&granularity=hour&mode=training&dtype=amp&lBranch=gh/shunting314/162/head&lCommit=cb236e8c198b54901e4fb19698f91be786f72e25&rBranch=main&rCommit=4ee1cb9b955fcc5d75a421b19393998122136f2c) 2. I can repro it locally Command: ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/torchbench.py --accuracy --training --amp --backend inductor --device cuda --only squeezenet1_1 ``` Raise the tolerance to fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130165 Approved by: https://github.com/jansel ghstack dependencies: #129996, #129941, #130005, #130163	2024-07-06 00:49:15 +00:00
Shunting Zhang	0fcbca9adb	[pt2-bench] use eval mode for vision_maskrcnn (#130163 ) Try to fix https://github.com/pytorch/pytorch/issues/130161 The reason that `--accuracy` works is we use eval mode. While `--training` does not work since we use training mode but TorchBench does not return targets tenors. In training mode, vision_maskrcnn requires targets tensors I fix that to always use eval mode for vision_maskrcnn training. With the fix, I start see a segfault: https://gist.github.com/shunting314/5a70df3463b2a4421b2c34aa88e78d1f I'm not sure if that's due to my local setup but I think the fix in this PR is something we need any way. We can check the dashboard after the PR is in. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130163 Approved by: https://github.com/jansel ghstack dependencies: #129996, #129941, #130005	2024-07-06 00:49:15 +00:00
Shunting Zhang	8f6765f7a7	[pt2-bench] fix accuracy failure for beit_base_patch16_224 during training (#130005 ) This model's accuracy test recently regressed. I have a quite smooth debugging process to figure out the cause. So I'd like to write it down just in case it can be helpful. Clicking the model name beit_base_patch16_224 on the dashboard, we are able to see the pass status of the model in e.g. the past month. For this model, we can see that it starts to fail on June 08: <img width="1118" alt="Screenshot 2024-07-02 at 5 36 35 PM" src="https://github.com/pytorch/pytorch/assets/52589240/32f27ccd-3ec7-4431-88b3-febeff831f8e"> What's nice is the dashboard shows the nightly commits for each run. Running ``` git log --oneline a448b3ae9537c0ae233fb9199a4a221fdffbb..0e6c204642a571d5a7cd60be0caeb9b50faca030 torch/_inductor/ ``` Gives us the list of Inductor PRs between the good and bad commit: https://gist.github.com/shunting314/eb57965688fc9e1746fcfa9b7b6b02df Roughly looking thru the PRs, I feel ``` `ffc202a1b9` Added remove_noop_ops to joint_graph_passes (#124451) ``` can change numerics so I disable it locally by this one line change: https://gist.github.com/shunting314/13aec768bda986056d0fb40dce53396e . And then the accuracy test pass. (Command: time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only beit_base_patch16_224 ) Horace's PR (https://github.com/pytorch/pytorch/pull/124451) itself is valid. It removes no-op ops in joint-graph. I think maybe the graph get changed and cause the partitioner do different recomputation decisions. That can cause some numerics change. Since this is not a real issue, I'll raise the tolerance to make it pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130005 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: #129996, #129941	2024-07-05 10:26:39 +00:00
Shunting Zhang	c0735a3dd3	[pt2-bench] fix accuracy failure for a few models (#129941 ) This PR batch the fix for a few accuracy failures issues during training by raising tolerance. I do that only for models that I think it fails not due to real issue. ## sebotnet33ts_256 The accuracy test for this model start to fail around June 05 [link](https://hud.pytorch.org/benchmark/timm_models/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Sun%2C%2002%20Jun%202024%2007%3A19%3A38%20GMT&stopTime=Tue%2C%2002%20Jul%202024%2007%3A19%3A38%20GMT&granularity=day&mode=training&dtype=amp&lBranch=main&lCommit=04a0d856207d83c2031e4b9cb6825ba3e0092850&rBranch=main&rCommit=e62925930f6a62f6aeeb1fe1a661a9bd3352b53d&model=sebotnet33ts_256). I can not repro locally, but from the log from the dashboard: ``` RMSE (res-fp64): 0.09441, (ref-fp64): 0.02971 and shape=torch.Size([1536]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000 ``` raising the tolerance should fix it. ## DebertaForQuestionAnswering This model fails accuracy test on the dashboard only in max-autotune mode. I can not repro locally by command: ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/huggingface.py --accuracy --no-translation-validation --training --amp --backend inductor --device cuda --only DebertaForQuestionAnswering ``` From error message on the dashboard: ``` RMSE (res-fp64): 0.01803, (ref-fp64): 0.00537 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000 ``` 0.02 tolerance should suppress this error. ## gluon_inception_v3 This model fail on the dashboard in max-autotune mode. I can not repro locally by command ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only gluon_inception_v3 ``` From error message on the dashboard ``` RMSE (res-fp64): 0.02798, (ref-fp64): 0.00730 and shape=torch.Size([384]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000 Accuracy failed for key name Mixed_7c.branch3x3dbl_3a.bn.running_var ``` raising tolerance should suppress this error. # mobilenetv3_large_100 Fail in MA model. I can not repro locally by command ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only ``` The error message on the dashboard is ``` RMSE (res-fp64): 0.29754, (ref-fp64): 0.05205 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000 ``` The tensor is so small that the noise can be high. I use larger multiplier for smaller tensor in torch._dynamo.utils.same. # yolov3 Fail on dashboard with error ``` Error on the dashboard: RMSE (res-fp64): 0.01278, (ref-fp64): 0.00246 and shape=torch.Size([256]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000 ``` Fix it by using a larger multiplier for smaller tensors and raising the tolereance. # timm_efficientdet Fail on the dashboard with error ``` E0623 18:37:43.638000 139924418725056 torch/_dynamo/utils.py:1468] RMSE (res-fp64): 0.00096, (ref-fp64): 0.00009 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000 ``` But I can not repro locally with command ``` time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only timm_efficientdet --training ``` Raise the tolerance should fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129941 Approved by: https://github.com/jansel ghstack dependencies: #129996	2024-07-05 10:26:39 +00:00
Jason Ansel	3240bff56a	[benchmarking] Add join_results.py (#129202 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129202 Approved by: https://github.com/yanboliang, https://github.com/shunting314	2024-07-05 06:55:30 +00:00
PyTorch MergeBot	fa3953a2e1	Revert "[pt2-bench] fix accuracy failure for a few models (#129941 )" This reverts commit `dafbd603ee`. Reverted https://github.com/pytorch/pytorch/pull/129941 on behalf of https://github.com/jeanschmidt due to Seems to have introduced breakages in main cuda12 focal jobs ([comment](https://github.com/pytorch/pytorch/pull/129996#issuecomment-2209175516))	2024-07-04 14:55:38 +00:00
PyTorch MergeBot	54da35a2e0	Revert "[pt2-bench] fix accuracy failure for beit_base_patch16_224 during training (#130005 )" This reverts commit `0af8c8a981`. Reverted https://github.com/pytorch/pytorch/pull/130005 on behalf of https://github.com/jeanschmidt due to Seems to have introduced breakages in main cuda12 focal jobs ([comment](https://github.com/pytorch/pytorch/pull/129996#issuecomment-2209175516))	2024-07-04 14:55:38 +00:00
titaiwangms	bffb278700	[ONNX] Add `artifacts_dir` to torch-onnx-patch in benchmark (#130069 ) Add `artifacts_dir` to torch-onnx-patch to save error report for debugging. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130069 Approved by: https://github.com/justinchuby	2024-07-04 07:11:02 +00:00
Peter Bell	e2e624a02f	[AOTAutograd] Micro-optimize runtime_wrapper (#128188 ) This moves a bunch of runtime inspection of the `output_info` for alias handling into the construction of fixed output handlers that are created during compilation and captured by the runtime wrapper. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128188 Approved by: https://github.com/bdhirsh	2024-07-04 03:53:06 +00:00
Shunting Zhang	0af8c8a981	[pt2-bench] fix accuracy failure for beit_base_patch16_224 during training (#130005 ) This model's accuracy test recently regressed. I have a quite smooth debugging process to figure out the cause. So I'd like to write it down just in case it can be helpful. Clicking the model name beit_base_patch16_224 on the dashboard, we are able to see the pass status of the model in e.g. the past month. For this model, we can see that it starts to fail on June 08: <img width="1118" alt="Screenshot 2024-07-02 at 5 36 35 PM" src="https://github.com/pytorch/pytorch/assets/52589240/32f27ccd-3ec7-4431-88b3-febeff831f8e"> What's nice is the dashboard shows the nightly commits for each run. Running ``` git log --oneline a448b3ae9537c0ae233fb9199a4a221fdffbb..0e6c204642a571d5a7cd60be0caeb9b50faca030 torch/_inductor/ ``` Gives us the list of Inductor PRs between the good and bad commit: https://gist.github.com/shunting314/eb57965688fc9e1746fcfa9b7b6b02df Roughly looking thru the PRs, I feel ``` `ffc202a1b9` Added remove_noop_ops to joint_graph_passes (#124451) ``` can change numerics so I disable it locally by this one line change: https://gist.github.com/shunting314/13aec768bda986056d0fb40dce53396e . And then the accuracy test pass. (Command: time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only beit_base_patch16_224 ) Horace's PR (https://github.com/pytorch/pytorch/pull/124451) itself is valid. It removes no-op ops in joint-graph. I think maybe the graph get changed and cause the partitioner do different recomputation decisions. That can cause some numerics change. Since this is not a real issue, I'll raise the tolerance to make it pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130005 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: #129996, #129941	2024-07-04 01:14:29 +00:00
Shunting Zhang	dafbd603ee	[pt2-bench] fix accuracy failure for a few models (#129941 ) This PR batch the fix for a few accuracy failures issues during training by raising tolerance. I do that only for models that I think it fails not due to real issue. ## sebotnet33ts_256 The accuracy test for this model start to fail around June 05 [link](https://hud.pytorch.org/benchmark/timm_models/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Sun%2C%2002%20Jun%202024%2007%3A19%3A38%20GMT&stopTime=Tue%2C%2002%20Jul%202024%2007%3A19%3A38%20GMT&granularity=day&mode=training&dtype=amp&lBranch=main&lCommit=04a0d856207d83c2031e4b9cb6825ba3e0092850&rBranch=main&rCommit=e62925930f6a62f6aeeb1fe1a661a9bd3352b53d&model=sebotnet33ts_256). I can not repro locally, but from the log from the dashboard: ``` RMSE (res-fp64): 0.09441, (ref-fp64): 0.02971 and shape=torch.Size([1536]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000 ``` raising the tolerance should fix it. ## DebertaForQuestionAnswering This model fails accuracy test on the dashboard only in max-autotune mode. I can not repro locally by command: ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/huggingface.py --accuracy --no-translation-validation --training --amp --backend inductor --device cuda --only DebertaForQuestionAnswering ``` From error message on the dashboard: ``` RMSE (res-fp64): 0.01803, (ref-fp64): 0.00537 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000 ``` 0.02 tolerance should suppress this error. ## gluon_inception_v3 This model fail on the dashboard in max-autotune mode. I can not repro locally by command ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only gluon_inception_v3 ``` From error message on the dashboard ``` RMSE (res-fp64): 0.02798, (ref-fp64): 0.00730 and shape=torch.Size([384]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000 Accuracy failed for key name Mixed_7c.branch3x3dbl_3a.bn.running_var ``` raising tolerance should suppress this error. # mobilenetv3_large_100 Fail in MA model. I can not repro locally by command ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only ``` The error message on the dashboard is ``` RMSE (res-fp64): 0.29754, (ref-fp64): 0.05205 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000 ``` The tensor is so small that the noise can be high. I use larger multiplier for smaller tensor in torch._dynamo.utils.same. # yolov3 Fail on dashboard with error ``` Error on the dashboard: RMSE (res-fp64): 0.01278, (ref-fp64): 0.00246 and shape=torch.Size([256]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000 ``` Fix it by using a larger multiplier for smaller tensors and raising the tolereance. # timm_efficientdet Fail on the dashboard with error ``` E0623 18:37:43.638000 139924418725056 torch/_dynamo/utils.py:1468] RMSE (res-fp64): 0.00096, (ref-fp64): 0.00009 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000 ``` But I can not repro locally with command ``` time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only timm_efficientdet --training ``` Raise the tolerance should fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129941 Approved by: https://github.com/jansel ghstack dependencies: #129996	2024-07-04 01:14:29 +00:00
Valentine233	62b710782d	change LayoutLMForSequenceClassification inference accuracy tolerance (#129728 ) Fixes #128510. https://github.com/pytorch/pytorch/pull/124451 makes LayoutLMForSequenceClassification hit the SDPA pattern 1 and then encounter the accuracy issue. The issue only happens with BF16 inference single thread. This PR tends to increase the model tolerance and make the check pass. Note that even the math-version SDPA could have the issue because of some small implementation diff. The test log: Single thread ``` correct_result: SequenceClassifierOutput(loss=tensor(0.5998), logits=tensor([[0.3301, 0.1338]], dtype=torch.bfloat16), hidden_states=None, attentions=None) new_result: SequenceClassifierOutput(loss=tensor(0.6016), logits=tensor([[0.3281, 0.1357]], dtype=torch.bfloat16), hidden_states=None, attentions=None) E0627 01:09:16.762789 140281313759104 torch/_dynamo/utils.py:1476] RMSE (res-fp64): 0.00151, (ref-fp64): 0.00046 and shape=torch.Size([1, 2]). res.dtype: torch.bfloat16, multiplier: 3.000000, tol: 0.001000 E0627 01:09:16.762972 140281313759104 torch/_dynamo/utils.py:1390] Accuracy failed for key name logits fail_accuracy ``` Multiple threads ``` correct_result: SequenceClassifierOutput(loss=tensor(0.6007), logits=tensor([[0.3301, 0.1357]], dtype=torch.bfloat16), hidden_states=None, attentions=None) new_result: SequenceClassifierOutput(loss=tensor(0.6016), logits=tensor([[0.3281, 0.1357]], dtype=torch.bfloat16), hidden_states=None, attentions=None) pass ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129728 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-03 06:28:27 +00:00
Aaron Gokaslan	6c2a8b6b38	[Ez][BE]: Enable new stable ruff rules (#129825 ) Applies a bunch of new ruff lint rules that are now stable. Some of these improve efficiency or readability. Since I already did passes on the codebase for these when they were in preview, there should be relatively few changes to the codebase. This is just more for future hardening of it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129825 Approved by: https://github.com/XuehaiPan, https://github.com/jansel, https://github.com/malfet	2024-07-02 14:47:10 +00:00
Xu Zhao	cc9b005bf2	Enable torchao nightly workflow (#129779 ) Summary: Make the following improvements: * Schedule the torchao benchmark nightly * Enable torchbench, timm, and huggingface models * Refactor the benchmarking script to better arrange the benchmarking groups Test workflow: https://github.com/pytorch/benchmark/actions/runs/9705589352 X-link: https://github.com/pytorch/benchmark/pull/2336 Differential Revision: D59074571 Pulled By: xuzhao9 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129779 Approved by: https://github.com/jerryzh168	2024-07-01 14:28:38 +00:00
Weizhuo Zhang	fff633f087	[CI] Enable AOT inductor FP32 accuracy test for CPU (#129040 ) This PR enabled AOT inductor backend FP32 accuracy check for CPU in CI workflow, which could catch AOT inductor issue at early stage. Test Time cost: \| Suite \| Precision \| Time cost \| \|------------- \|----------- \|----------- \| \| Huggingface \| FP32 \| 1h12m \| \| Timm models \| FP32 \| 1h32m \| \| Torchbench \| FP32 \| 1h40m \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/129040 Approved by: https://github.com/chuanqi129, https://github.com/desertfire, https://github.com/malfet	2024-06-30 14:00:09 +00:00
Xuehai Pan	4ee1cb9b95	[BE][Easy] replace `import pathlib` with `from pathlib import Path` (#129426 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129426 Approved by: https://github.com/malfet	2024-06-30 01:36:07 +00:00
PyTorch MergeBot	2effbcfcd8	Revert "[BE][Easy] replace `import pathlib` with `from pathlib import Path` (#129426 )" This reverts commit `6d75604ef1`. Reverted https://github.com/pytorch/pytorch/pull/129426 on behalf of https://github.com/XuehaiPan due to recognize `Path` as new exported API ([comment](https://github.com/pytorch/pytorch/pull/129426#issuecomment-2198371625))	2024-06-29 23:24:06 +00:00
Xuehai Pan	6d75604ef1	[BE][Easy] replace `import pathlib` with `from pathlib import Path` (#129426 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129426 Approved by: https://github.com/malfet	2024-06-29 15:42:09 +00:00
Animesh Jain	a676b7c5f3	Add XGLMForCausalLM to the flaky model list (#129776 ) Not failing on devGPU. Went to CI machine ... flaky. So adding to the flaky list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129776 Approved by: https://github.com/mlazos ghstack dependencies: #129583, #129610, #129775	2024-06-29 05:47:28 +00:00
Animesh Jain	5d1763d159	Add lcnet to the inline_inbuilt_nn_module list (#129775 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129775 Approved by: https://github.com/mlazos ghstack dependencies: #129583, #129610	2024-06-29 05:47:28 +00:00
PyTorch MergeBot	3d96217891	Revert "[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 )" This reverts commit `9e1f3ecaa7`. Reverted https://github.com/pytorch/pytorch/pull/129374 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is still failing with the same error ([comment](https://github.com/pytorch/pytorch/pull/129374#issuecomment-2197801405))	2024-06-29 00:47:15 +00:00
Xuehai Pan	9e1f3ecaa7	[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 ) Changes by apply order: 1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`. 2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`. 3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first. `.parent{...}.absolute()` -> `.absolute().parent{...}` 4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.) `.parent.parent.parent.parent` -> `.parents[3]` 5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~ ~`.parents[3]` -> `.parents[4 - 1]`~ 6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374 Approved by: https://github.com/justinchuby, https://github.com/malfet	2024-06-28 00:35:15 +00:00
Wei Wang	9b5b93c58f	[CUDA][Inductor][CI] Revert PR#127150 since cu124 is now behaving similar enough to cu121 (#128423 ) Pre-requisite: close https://github.com/pytorch/pytorch/issues/126692 first. This PR also gives a current read on cu121 and cu124 parity. Essentially reverting #127150 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128423 Approved by: https://github.com/atalman, https://github.com/eqy	2024-06-27 05:22:18 +00:00
PyTorch MergeBot	895316119d	Revert "[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 )" This reverts commit `0314c4c101`. Reverted https://github.com/pytorch/pytorch/pull/129374 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it causes lots of internal build failures where they fail to find hipify module ([comment](https://github.com/pytorch/pytorch/pull/129374#issuecomment-2192437052))	2024-06-26 19:03:57 +00:00
PyTorch MergeBot	e9aefad641	Revert "[CUDA][Inductor][CI] Revert PR#127150 since cu124 is now behaving similar enough to cu121 (#128423 )" This reverts commit `551e412718`. Reverted https://github.com/pytorch/pytorch/pull/128423 on behalf of https://github.com/nWEIdia due to Sorry for reverting your change but I need to revert it to cleanly revert https://github.com/pytorch/pytorch/pull/129374 ([comment](https://github.com/pytorch/pytorch/pull/128423#issuecomment-2192423840))	2024-06-26 18:54:41 +00:00
Xu Zhao	474d743dba	[torchao][benchmark] Skip all accuracy tests by returning `pass_due_to_skip` (#129545 ) Summary: As the title says. Test Plan: ``` buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BERT_pytorch --quantization noquant --inference --bfloat16 --accuracy ``` Differential Revision: D59040593 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129545 Approved by: https://github.com/HDCharles	2024-06-26 14:21:53 +00:00
Wei Wang	551e412718	[CUDA][Inductor][CI] Revert PR#127150 since cu124 is now behaving similar enough to cu121 (#128423 ) Pre-requisite: close https://github.com/pytorch/pytorch/issues/126692 first. This PR also gives a current read on cu121 and cu124 parity. Essentially reverting #127150 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128423 Approved by: https://github.com/atalman, https://github.com/eqy	2024-06-25 20:59:49 +00:00
DiweiSun	ae0f84d89c	[CI] Enable amp accuracy check for inductor cpu (#127758 ) This is to enable inductor AMP accuracy check for on CPU in CI workflow to capture issue early. Three suites are included: timms, huggingface as well as torchbench. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127758 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-06-25 20:34:18 +00:00
Weizhuo Zhang	53f462c506	Write dynamo benchmarks performance result to csv when throw exceptions (#126764 ) Performance mode Issue: When dynamo benchmarks performance warm-up failed, the result will be not written into csv file. But the accuracy will be written as `fail_to_run` even when dynamo pass failed. So the accuracy model number is not aligned with performance model number for each of their csv files. ![image](https://github.com/pytorch/pytorch/assets/84730719/9043d215-130b-46b4-a835-f148c225947c) - Fix: The warm-up failed models will be recorded into csv file shown as following: ![image](https://github.com/pytorch/pytorch/assets/84730719/7907a3c2-c942-42bb-b31c-55424a0e8117) Accuracy mode issue: `detectron2_fasterrcnn_r` models failed on accuracy mode, but was tested successfully on performance mode. The accuracy failure is same as PR `ee557d8f61`. ``` Dynamic Shape: Traceback (most recent call last): File "benchmarks/dynamo/torchbench.py", line 449, in <module> torchbench_main() File "benchmarks/dynamo/torchbench.py", line 445, in torchbench_main main(TorchBenchmarkRunner(), original_dir) File "/workspace/pytorch/benchmarks/dynamo/common.py", line 3650, in main process_entry(0, runner, original_dir, args) File "/workspace/pytorch/benchmarks/dynamo/common.py", line 3582, in process_entry return run(runner, args, original_dir) File "/workspace/pytorch/benchmarks/dynamo/common.py", line 4163, in run assert marked, f"nothing in example_inputs had a dim with {batch_size}" AssertionError: nothing in example_inputs had a dim with 4 ``` ![image](https://github.com/pytorch/pytorch/assets/84730719/f25392f0-f982-46c8-8e2c-a8a25d85a21a) - Fix: same as PR `ee557d8f61`, the batch_size will be skipped to set as 4 when testing dynamic shapes. Dynamic shapes passrate improved from 89% -> 95% \| Comp Item \| Compiler \| suite \| before \| After fix \| \|-----------\|----------\|------------\|------------\|------------\| \| Pass Rate \| Inductor \| torchbench \| 89%, 73/82 \| 95%, 79/83 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/126764 Approved by: https://github.com/jansel	2024-06-25 17:49:04 +00:00
Xuehai Pan	0314c4c101	[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 ) Changes by apply order: 1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`. 2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`. 3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first. `.parent{...}.absolute()` -> `.absolute().parent{...}` 4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.) `.parent.parent.parent.parent` -> `.parents[3]` 5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~ ~`.parents[3]` -> `.parents[4 - 1]`~ 6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374 Approved by: https://github.com/justinchuby, https://github.com/malfet	2024-06-25 08:28:38 +00:00
titaiwangms	0e1e289033	[ONNX] Benchmark refactored ONNX export (#129427 ) Reuse torch.onnx.export with torch_onnx patch to test ExportedProgram -> ONNX IR exporter Pull Request resolved: https://github.com/pytorch/pytorch/pull/129427 Approved by: https://github.com/justinchuby	2024-06-25 04:47:53 +00:00
Huy Do	b72ef9df0d	Update torchbench model expected accuracy values after pinning numpy (#129213 ) After pinning numpy on torchbench, we need to move torchbench inductor benchmark jobs out of unstable state asap, so that more failures don't sneak it. I'm updating the expected values here to make trunk green. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129213 Approved by: https://github.com/xuzhao9, https://github.com/malfet, https://github.com/desertfire	2024-06-22 04:59:50 +00:00
Jason Ansel	bdc39eef3b	[inductor] Add --inductor-config benchmark flag (#129034 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129034 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #129024, #129033	2024-06-21 16:53:42 +00:00
Simon Fan	123812790b	[compiled autograd] update benchmarks to use cli flags for fullgraph/dynamic (#127960 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127960 Approved by: https://github.com/jansel	2024-06-21 08:16:33 +00:00
Deng Weishi	b542825066	Enable deterministic support for oneDNN (#127277 ) This PR is a part of RFC https://github.com/pytorch/pytorch/issues/114848. For the request for Torchbenchmark models, this PR enables the deterministic attribute for the oneDNN operators for XPU backends, like convolution, deconvolution and matmult. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127277 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/desertfire, https://github.com/gujinghui	2024-06-21 05:21:24 +00:00
Xu Zhao	1835e3beab	Fix the inductor ci (#128879 ) Fix the torchbench+inductor ci on trunk due to recent upgrade to numpy 2.0.0rc1. We have to remove DALLE2_pytorch model, since it depends on embedding-reader, which is not compatible with numpy>2: https://github.com/rom1504/embedding-reader/blob/main/requirements.txt#L3 Fixes #128845 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128879 Approved by: https://github.com/eellison	2024-06-17 22:20:33 +00:00
PyTorch MergeBot	c172b58fe0	Revert "Update DALLE2_pytorch expected accuracy result on CPU (#128718 )" This reverts commit `fd27138c4a`. Reverted https://github.com/pytorch/pytorch/pull/128718 on behalf of https://github.com/huydhn due to This has reverted back to the previous expected value for some reason `153362fbc9` ([comment](https://github.com/pytorch/pytorch/pull/128718#issuecomment-2174194219))	2024-06-17 18:49:15 +00:00
eellison	5344c41d43	Use forked torchbench branch with pinned numpy (#128856 ) Adds pinned numpy commit to yolov3 dependencies to the existing pinned commit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128856 Approved by: https://github.com/huydhn, https://github.com/PaliC	2024-06-17 18:41:42 +00:00

... 2 3 4 5 6 ...

1093 Commits