Commit Graph

1093 Commits

Author SHA1 Message Date
Aaron Orenstein
07669ed960 PEP585 update - benchmarks tools torchgen (#145101)
This is one of a series of PRs to update us to PEP585 (changing Dict -> dict, List -> list, etc).  Most of the PRs were completely automated with RUFF as follows:

Since RUFF UP006 is considered an "unsafe" fix first we need to enable unsafe fixes:

```
--- a/tools/linter/adapters/ruff_linter.py
+++ b/tools/linter/adapters/ruff_linter.py
@@ -313,6 +313,7 @@
                     "ruff",
                     "check",
                     "--fix-only",
+                    "--unsafe-fixes",
                     "--exit-zero",
                     *([f"--config={config}"] if config else []),
                     "--stdin-filename",
```

Then we need to tell RUFF to allow UP006 (as a final PR once all of these have landed this will be made permanent):

```
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -40,7 +40,7 @@

 [tool.ruff]
-target-version = "py38"
+target-version = "py39"
 line-length = 88
 src = ["caffe2", "torch", "torchgen", "functorch", "test"]

@@ -87,7 +87,6 @@
     "SIM116", # Disable Use a dictionary instead of consecutive `if` statements
     "SIM117",
     "SIM118",
-    "UP006", # keep-runtime-typing
     "UP007", # keep-runtime-typing
 ]
 select = [
```

Finally running `lintrunner -a --take RUFF` will fix up the deprecated uses.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145101
Approved by: https://github.com/bobrenjc93
2025-01-18 05:05:07 +00:00
Nicolas Macchioni
2f51d06210 basic InductorBenchmarker (#133058)
This PR adds the most basic custom benchmarker (i.e. a benchmarker that is not provided by Triton), which we call `InductorBenchmarker`. This new benchmarker is very basic in principal, and very closely follows Triton's `do_bench` implementation with slight changes such as flushing the exact L2 cache size (Triton defaults to 256mb), using a buffer zero for warmup (Triton uses the benchmarked kernel itself, I found that buffer zeroes are more consistent),  and returning the min runtime (Triton can return min, among other things, currently Inductor picks median).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133058
Approved by: https://github.com/eellison
ghstack dependencies: #144315
2025-01-18 02:35:00 +00:00
Huy Do
8e4539245e Update ci_expected_accuracy for TIMM levit_128 for further investigation (#145112)
TSIA, it looks like an upstream change, but I'm not sure from where yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145112
Approved by: https://github.com/izaitsevfb, https://github.com/malfet
2025-01-18 01:55:34 +00:00
Laith Sakka
96c0dbbe97 Enhance running pr time benchmarks locally experience. (#144838)
Summary: title

Test Plan: NA

Differential Revision: D68195894

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144838
Approved by: https://github.com/huydhn
2025-01-17 07:57:40 +00:00
Laith Sakka
62ce3e6e84 refresh benchmarks results after recent recent regressions (#143075)
refresh data after !5 regression by https://github.com/pytorch/pytorch/pull/144319

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143075
Approved by: https://github.com/bobrenjc93, https://github.com/huydhn
2025-01-15 09:11:57 +00:00
Bin Bao
2683691237 [AOTI] Add a boxed_run API (#142213)
Summary: Fixes https://github.com/pytorch/pytorch/issues/141696. Add a new C++ runner API (boxed_run) following dynamo's boxed calling convention, which steals tensors' ownership from the input tensor list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142213
Approved by: https://github.com/ezyang
2025-01-14 18:47:42 +00:00
PyTorch MergeBot
4f74864c94 Revert "[AOTI] Add a boxed_run API (#142213)"
This reverts commit 868984c3e3.

Reverted https://github.com/pytorch/pytorch/pull/142213 on behalf of https://github.com/kit1980 due to breaking lots of internal builds, see D68036023 ([comment](https://github.com/pytorch/pytorch/pull/142213#issuecomment-2588378262))
2025-01-13 22:43:47 +00:00
Huy Do
396630ed78 Update the accuracy results for moco and llama (#144523)
This has been failing in trunk for sometimes, let's just update the accuracy results first.  The command I run `python benchmarks/dynamo/ci_expected_accuracy/update_expected.py 127f836881e75e0c688619b54a35b018a69d7ee7`.  I also fix the update script a bit to make it working after https://github.com/pytorch/pytorch/pull/139337

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144523
Approved by: https://github.com/kit1980, https://github.com/Skylion007
2025-01-10 19:40:49 +00:00
Bin Bao
868984c3e3 [AOTI] Add a boxed_run API (#142213)
Summary: Fixes https://github.com/pytorch/pytorch/issues/141696. Add a new C++ runner API (boxed_run) following dynamo's boxed calling convention, which steals tensors' ownership from the input tensor list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142213
Approved by: https://github.com/ezyang
2025-01-10 18:27:00 +00:00
Guilherme Leobas
6bc17b0725 Update #graph breaks for moco benchmark (#144266)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144266
Approved by: https://github.com/zou3519
2025-01-09 18:51:13 +00:00
Xuehai Pan
dcc3cf7066 [BE] fix ruff rule E226: add missing whitespace around operator in f-strings (#144415)
The fixes are generated by:

```bash
ruff check --fix --preview --unsafe-fixes --select=E226 .
lintrunner -a --take "RUFF,PYFMT" --all-files
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144415
Approved by: https://github.com/huydhn, https://github.com/Skylion007
2025-01-08 21:55:00 +00:00
Animesh Jain
2ac41404a8 [dynamo][dicts] Guarding lazily on dict keys (#143997)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143997
Approved by: https://github.com/jansel
2025-01-08 03:56:33 +00:00
bobrenjc93
fcf9dc3b11 Migrate from Tuple -> tuple in benchmarks (#144259)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144259
Approved by: https://github.com/yanboliang
2025-01-07 04:09:52 +00:00
Laith Sakka
5ccbfffd11 update expected results (#144274)
this PR f6488d85a0 made it +1.3% < 1.5%.
once we have the API from dev infra and change the test this wont be happening.

<img width="364" alt="Screenshot 2025-01-06 at 11 01 15 AM" src="https://github.com/user-attachments/assets/401b2d11-e400-49d6-b6f9-8e10ca141cb0" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144274
Approved by: https://github.com/oulgen, https://github.com/anijain2305
2025-01-06 23:18:21 +00:00
Guilherme Leobas
4c8d661348 Set enable_trace_contextlib_contextmanager flag to True (#140604)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140604
Approved by: https://github.com/zou3519
ghstack dependencies: #136033
2025-01-06 16:56:22 +00:00
PyTorch MergeBot
b01556bd8a Revert "[dynamo][dicts] Guarding lazily on dict keys (#143997)"
This reverts commit f5df082fab.

Reverted https://github.com/pytorch/pytorch/pull/143997 on behalf of https://github.com/jeanschmidt due to Seems to have introduced internal ci redness in some tests, D67828366 ([comment](https://github.com/pytorch/pytorch/pull/143997#issuecomment-2571587599))
2025-01-05 11:09:45 +00:00
Animesh Jain
f5df082fab [dynamo][dicts] Guarding lazily on dict keys (#143997)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143997
Approved by: https://github.com/jansel
ghstack dependencies: #144129, #144130, #144141, #144158, #144163, #144160
2025-01-04 18:13:00 +00:00
PyTorch MergeBot
8d63a4a409 Revert "Set enable_trace_contextlib_contextmanager flag to True (#140604)"
This reverts commit 1c817fe671.

Reverted https://github.com/pytorch/pytorch/pull/140604 on behalf of https://github.com/guilhermeleobas due to breaking one of the benchmarks (moco) ([comment](https://github.com/pytorch/pytorch/pull/140604#issuecomment-2569640837))
2025-01-03 18:23:53 +00:00
Xuehai Pan
b6bdb67f82 [BE][Easy] use pathlib.Path instead of dirname / ".." / pardir (#129374)
Changes by apply order:

1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`.
2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`.
3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first.

    `.parent{...}.absolute()` -> `.absolute().parent{...}`

4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.)

    `.parent.parent.parent.parent` -> `.parents[3]`

5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~

    ~`.parents[3]` -> `.parents[4 - 1]`~

6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374
Approved by: https://github.com/justinchuby, https://github.com/malfet
2024-12-29 17:23:13 +00:00
Animesh Jain
c3c27aef34 [dynamo] Remove HFPretrained config hack (#143698)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143698
Approved by: https://github.com/williamwen42, https://github.com/jansel
ghstack dependencies: #143888
2024-12-28 02:03:13 +00:00
Aaron Orenstein
3df12d38cf dynamo tracing perf: cache cleaned_instructions: 33.7 -> 30.0 (#143070)
See #143056 for overall docs.

This PR: Cache the interesting/expensive bits of `cleaned_instructions()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143070
Approved by: https://github.com/jansel
2024-12-26 19:02:08 +00:00
PyTorch MergeBot
475656fd9c Revert "[BE][Easy] use pathlib.Path instead of dirname / ".." / pardir (#129374)"
This reverts commit 2293fe1024.

Reverted https://github.com/pytorch/pytorch/pull/129374 on behalf of https://github.com/malfet due to failing internal ROCM builds with error: ModuleNotFoundError: No module named hipify ([comment](https://github.com/pytorch/pytorch/pull/129374#issuecomment-2562973920))
2024-12-26 17:32:23 +00:00
Oguz Ulgen
dc55704b48 Rename cache limit to recompile limit in configs (#143709)
This PR renames every cache_limit to recompile_limit via sed.

Old config options are maintained via Config(alias='xyz')

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143709
Approved by: https://github.com/jansel
2024-12-22 10:03:57 +00:00
Aaron Orenstein
f2b744b9ca dynamo tracing perf: import_module: 59.92 -> 52.9 (#143057)
See #143056 for overall docs.

This PR: Using `importlib.import_module()` within the hot path of
symbolic_convert is slow. Memoize it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143057
Approved by: https://github.com/jansel
2024-12-22 06:38:38 +00:00
Xuehai Pan
2293fe1024 [BE][Easy] use pathlib.Path instead of dirname / ".." / pardir (#129374)
Changes by apply order:

1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`.
2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`.
3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first.

    `.parent{...}.absolute()` -> `.absolute().parent{...}`

4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.)

    `.parent.parent.parent.parent` -> `.parents[3]`

5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~

    ~`.parents[3]` -> `.parents[4 - 1]`~

6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374
Approved by: https://github.com/justinchuby, https://github.com/malfet
2024-12-21 22:08:01 +00:00
Yanan Cao (PyTorch)
0666347fc4 [Codemod][AddExplicitStrictExportArg] caffe2/benchmarks/dynamo (#143686)
Reviewed By: avikchaudhuri

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143686
Approved by: https://github.com/tugsbayasgalan
2024-12-21 19:56:56 +00:00
Guilherme Leobas
1c817fe671 Set enable_trace_contextlib_contextmanager flag to True (#140604)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140604
Approved by: https://github.com/zou3519
ghstack dependencies: #136033
2024-12-20 12:02:27 +00:00
Guilherme Leobas
673cc88fd6 Add support for contextmanager in Dynamo (#136033)
Fixes #130559

* Intro

This PR adds support for `@contextmanager` in Dynamo. We chose to limit the
scope of this work to only `@contextmanager` and plan to handle generators fully
in #141055 (still in draft).

* Motivation

Dynamo lacks support for generator functions. When it encounters one, it traces
it as if it were a regular function. This is problematic because it can lead to
incorrect behavior. To illustrate, consider the test case below:

```python
import torch
import contextlib

@contextlib.contextmanager
def set_default_dtype(dtype):
    old_dtype = torch.get_default_dtype()
    try:
        torch.set_default_dtype(dtype)
        yield
    finally:
        torch.set_default_dtype(old_dtype)

@torch.compile(backend="eager", fullgraph=True)
def fn():
    with set_default_dtype(torch.float64):
        x = torch.tensor([3.0, 3.0 + 5.0j])
    return x
```

Before this work, Dynamo would not stop at the `yield`, and the graph produced
would contain both calls to `set_default_dtype` executed one after the other.
This is incorrect because the context manager should execute code before and
after the `yield`.

* List of changes

`YIELD_VALUE` now raises an exception (`YieldValueOp`) to signal that control
flow must be suspended and returned to the caller. Additionally, `RETURN_VALUE`
behaves differently in a generator function. Unlike regular functions, where
`RETURN_VALUE` indicates the final result, in generators it signifies that the
generator is exhausted and implicitly raises `StopIteration`.

A new `VariableTracker` named `FunctionDecoratedByContextlibContextManagerVariable`
was introduced to handle `@contextmanager`. This variable tracker acts not just
as a wrapper for the original function but also maintains an internal `tx`
(InstructionTranslator) object to suspend and return control flow to the parent
tracer when a `yield` is encountered.

* Corner cases

Returning a context manager from a compiled function is not supported. This
would require PyTorch to synchronize the generator state between Dynamo and the
interpreter. Any attempt to return it will result in an `IncorrectUsage`
exception.

Graph breaks require special handling as well. In the event of a graph break,
the frame associated with the context manager is skipped, and the context
manager runs in eager mode.

* This PR is breaking my code

There is a configuration flag (`enable_trace_contextlib`) that can be set to
`False` to disable tracing context managers. If this still causes crashes,
please revert this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136033
Approved by: https://github.com/zou3519
2024-12-20 12:02:20 +00:00
Huy Do
fe0f20615c [DynamoBench] Handle accuracy results in benchmark records (#143611)
I discovered this issue when trying to search for the accuracy results on the database and couldn't find any.  It turns out that the results is there on the JSON file, for example `"metric": {"name": "accuracy", "benchmark_values": ["pass_due_to_skip"]}`, but inserting them into the database fails because benchmark values is a list of strings here while the expectation is that it's a list of numbers.

ClickHouse doesn't support mix types atm. It has a Variant type https://clickhouse.com/docs/en/sql-reference/data-types/variant, but this isn't recommended by CH team themselves.  So, the remaining option is to store this in the `extra_info` field.  This field is a dictionary, so it can goes there.

### Testing

https://github.com/pytorch/pytorch/actions/runs/12421747715

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143611
Approved by: https://github.com/kit1980
2024-12-20 06:43:38 +00:00
Laith Sakka
2a11472f46 update expected results (#143586)
update results based on small regression added by
17b71e5d6a

the max we was 1.25%. for sum_floor_div
<img width="842" alt="Screenshot 2024-12-19 at 9 04 30 AM" src="https://github.com/user-attachments/assets/6ce913cd-110d-4837-af59-08fb6a0dd12d" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143586
Approved by: https://github.com/bobrenjc93
2024-12-19 18:43:27 +00:00
William Wen
c04f0bb7b9 [dynamo] add benchmark for guard eval (#142430)
Benchmarks:
- 713.2us (3.10)
- 598.8us (3.12)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142430
Approved by: https://github.com/jansel
ghstack dependencies: #142117
2024-12-17 18:54:27 +00:00
Aaron Orenstein
63e1f97f4b dynamo tracing perf: don't unnecessarily call getframeinfo on the hot path: 47.26 -> 37.66 (#143066)
See #143056 for overall docs.

This PR: Stop using `getframeinfo()` when we only care about the function name
and throw the rest away.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143066
Approved by: https://github.com/jansel
2024-12-13 18:20:48 +00:00
Tom Ritchford
da67a6a7bb [inductor] Replace set by OrderedSet (#138466)
Uses the set_linter from https://github.com/pytorch/pytorch/pull/138454
and considerable manual editing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138466
Approved by: https://github.com/eellison
2024-12-13 16:08:45 +00:00
bobrenjc93
b5d8d2444a add README.md for compile time benchmarks (#143145)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143145
Approved by: https://github.com/laithsakka
ghstack dependencies: #141517, #143143
2024-12-13 05:12:26 +00:00
bobrenjc93
ceb664aca6 add float_args benchmark (#143143)
71% improvement with automatic dynamic float arguments

with specialize_float=False
```
float_args,compile_time_instruction_count,346293869
```

with specialize_float=True
```
float_args,compile_time_instruction_count,1198546486
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143143
Approved by: https://github.com/laithsakka
ghstack dependencies: #141517
2024-12-13 03:35:59 +00:00
James Wu
fbbafd0320 Turn on AOTAutogradCache by default on open source (#141981)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141981
Approved by: https://github.com/bdhirsh, https://github.com/oulgen
2024-12-12 04:21:11 +00:00
Tom Ritchford
498a7808ff Fix unused Python variables outside torch/ and test/ (#136359)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136359
Approved by: https://github.com/albanD
2024-12-11 17:10:23 +00:00
Edward Z. Yang
c29b4edbb9 Remove no-op aot_compilation_time (#142490)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142490
Approved by: https://github.com/xuzhao9
2024-12-11 10:37:25 +00:00
Edward Z. Yang
646024e823 Add convnext_base to higher tolerance (#142159)
See https://github.com/pytorch/pytorch/issues/141498 https://github.com/pytorch/pytorch/issues/141703

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142159
Approved by: https://github.com/bertmaher, https://github.com/huydhn
2024-12-06 04:00:13 +00:00
eellison
f83361b274 inductor dtype propagation fixes (#141495)
- Add in upcast_compute_type on creation of new tensors (loads, constants)
- Fixes index_expr - right now we are sort of inconsistent in dtype and dont always respect the dtype specified. would be nice to fix but not doing in this pr.
- bug fix in view dtype where we were always upcasting back to fp32 when input was in bf16/fp16. we should only be doing that if the output is also in bf16/fp16.
- for masked, avoid calling dtype propagation and just use output dtype.

Turns on the runtime dtype verification for opinfo tests. The separate test file is still useful because we can use it for testing turning off codegen_upcast_to_fp32.

Follow ups:

- We could consider requiring less explicit upcast_compute_types calls and do it automatically. That would potentially make things easier but be less flexible in the future. Maybe I should have done it this pr.
- Be more consistent on our index expr dtype printing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141495
Approved by: https://github.com/blaine-rister, https://github.com/arui-meta, https://github.com/ezyang
ghstack dependencies: #139945, #140057
2024-11-28 11:39:38 +00:00
James Wu
a7ca6a9113 Enable autograd cache on inductor tests (#140890)
This turns on AOTAutogradCache for all inductor tests. It clears AOTAutogradCache on each test as well, by virtue of the local cache using the same directory to store cache entries.

I've also tested with INDUCTOR_TEST_DISABLE_FRESH_CACHE=1, running all the tests. AOTAutogradCache successfully caches 99% of these. There are a few tests that use view_replay and therefore save functional tensors, which cause AOTAutogradCache to fail to pickle its result. Will look into next steps there, but for now, it seems okay if the cache just misses on those cases where it can't serialize the result. It would be better to check before pickling, though.

I've made the following small bugfixes to get this working:
- Inductor is sometimes used in a standalone mode without dynamo, which leads to attribute errors in check_can_cache. In general, we should *never* crash in cache checking, only bypass. So I change a try catch to check Exception instead of just a specific exception.
- Add extra structured logging for metadata on cache hits

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140890
Approved by: https://github.com/bdhirsh
2024-11-27 20:41:43 +00:00
eellison
fd553b9817 Add remaining method and tests for dtype propagation (#140057)
Adds the remaining unimplemented ops as well as an assertion failure if someone adds a new op without a dtype rule.

We test all unique pointwise operators registered as lowerings which have an opinfo. There will be some follow ups for this to work well with both `codegen_upcast_to_fp32` as True and False.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140057
Approved by: https://github.com/arui-meta, https://github.com/blaine-rister, https://github.com/ezyang
ghstack dependencies: #139945
2024-11-27 17:06:44 +00:00
eellison
566ceb3e7e Refactor dtype propagation (#139945)
A couple changes.

- Tries to reuse dtype propagation rules that were already registered in inductor. These were present both with `pointwise_overrides_data` and the `boolean_ops` list. Additionally, the registration of pointwise ops already specified dtype propagation rules. Saves those registrations and reuses them later.

- Factors out `get_promoted_dtype` which uses functools.lru_cache to take in non - CSEVariable args because those will not work with the functools cache.

Tests get added later in the stack when everything is implemented.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139945
Approved by: https://github.com/blaine-rister, https://github.com/arui-meta, https://github.com/ezyang
2024-11-27 16:57:02 +00:00
Tugsbayasgalan Manlaibaatar
87f9c1abe5 Change export IR to non-functional pre-dispatch IR (#139511)
Differential Revision: [D65362160](https://our.internmc.facebook.com/intern/diff/D65362160)

State after this IR:
1. For the tests that require inference IR, they are replaced with ep.run_decomp({}) so export_for_training_run_decomp is sort of redundant but i guess it is still nice that multiple round of retracing still working. In general, we need some auditing to reduce our redundant testing coverages.
2. After this PR landed and not get reverted for a week or so, i will replace the export_for_training calls with export as they are the same thing now.
3. Added more tests to also cover now "deprecated" old IR by patching export to use old export. For reviewers, please look at the internal version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139511
Approved by: https://github.com/ydwu4, https://github.com/angelayi, https://github.com/avikchaudhuri
2024-11-20 21:47:55 +00:00
Laith Sakka
caa3a3e12c Only compute new_untracked_symbols and new_unbacked_bindings if needed. (#140083)
Summary:
237s -> 198..
buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=2000

Test Plan: NA

Differential Revision: D65638637

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140083
Approved by: https://github.com/ezyang, https://github.com/isuruf, https://github.com/anijain2305
2024-11-20 19:28:18 +00:00
Huy Do
b5db3cb61c Skip uploading benchmark records when there is no model name (#141145)
A small fix I just realize after https://github.com/pytorch/pytorch/pull/141087.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141145
Approved by: https://github.com/malfet
2024-11-20 19:05:47 +00:00
Huy Do
1a7055cb73 Record PR time benchmark results in JSON format (#140493)
I'm trying to make this benchmark results available on OSS benchmark database, so that people can query it from outside.  The first step is to also record the results in the JSON format compatible with the database schema defined in https://github.com/pytorch/test-infra/pull/5839.

Existing CSV files remain unchanged.

### Testing

The JSON results are uploaded as artifacts to S3 https://github.com/pytorch/pytorch/actions/runs/11809725848/job/32901411180#step:26:13, for example https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/11809725848/1/artifact/test-jsons-test-pr_time_benchmarks-1-1-linux.g4dn.metal.nvidia.gpu_32901411180.zip

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140493
Approved by: https://github.com/laithsakka
2024-11-20 18:54:01 +00:00
Huy Do
4acd56eb53 Upload MPS benchmark results (#141087)
This uploads the MPS benchmark results to benchmark database.  The data can then be queried, for example:

```
select benchmark, model, metric from oss_ci_benchmark_v3 where head_sha = '99a133116fee15aa1467165f2b209b37da53f189' and metric.name in ['eager_peak_mem', 'dynamo_peak_mem', 'speedup'] and model.name = 'BERT_pytorch'
```

I'm documenting the JSON format at https://github.com/pytorch/pytorch/wiki/How-to-integrate-with-PyTorch-OSS-benchmark-database

### Testing

Locally,

```
PYTHONPATH=/Users/huydo/Storage/mine/benchmark python benchmarks/dynamo/torchbench.py --performance --only resnet152 --backend eager --training --devices mps --output test/test-reports/torchbench_training.csv
```

Workflow dispatch https://github.com/pytorch/pytorch/actions/runs/11927990520

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141087
Approved by: https://github.com/malfet
2024-11-20 18:18:21 +00:00
Laith Sakka
8d708090c0 Optimize increment summations [Latest Nov 15] (#140822)
Summary:
**wins**
on torchrec benchmark, for 2K nodes it save 40seconds
with the recent sympy changes (https://www.internalfb.com/diff/D65883538) we save around 13 second ( with the max opt on).
```
buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=200
```
This diff optimizes construction expressions of the form
a+b+c...  (all unique symbols).
which are very common in torchrec models.

**How**
Expressions of the form a+b+c are not optimized by add, the only needed optimization is sorting them.
If we have  a+b+c and we are adding (d) to it, we can do a binary search to know
the position of (d) and avoid optimizing the new expression by passing the new order.

**Extensions**:
1. support constant terms.
2. support 10a+10b+.. (this will give even more wins will extend the support in second PR)

Differential Revision: D66008482

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140822
Approved by: https://github.com/ezyang
2024-11-20 16:48:20 +00:00
PyTorch MergeBot
a4e8ca789a Revert "Record PR time benchmark results in JSON format (#140493)"
This reverts commit 783cd9c8dd.

Reverted https://github.com/pytorch/pytorch/pull/140493 on behalf of https://github.com/huydhn due to I think I missed something in the workflow setup as the test is failing in non-test CI jobs ([comment](https://github.com/pytorch/pytorch/pull/140493#issuecomment-2487360455))
2024-11-20 04:04:07 +00:00
angelayi
878a849c92 [aoti] Remove example inputs from aoti_compile_and_package (#140991)
Differential Revision: [D66136724](https://our.internmc.facebook.com/intern/diff/D66136724)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140991
Approved by: https://github.com/yushangdi, https://github.com/desertfire
ghstack dependencies: #140990
2024-11-20 02:49:47 +00:00
Huy Do
783cd9c8dd Record PR time benchmark results in JSON format (#140493)
I'm trying to make this benchmark results available on OSS benchmark database, so that people can query it from outside.  The first step is to also record the results in the JSON format compatible with the database schema defined in https://github.com/pytorch/test-infra/pull/5839.

Existing CSV files remain unchanged.

### Testing

The JSON results are uploaded as artifacts to S3 https://github.com/pytorch/pytorch/actions/runs/11809725848/job/32901411180#step:26:13, for example https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/11809725848/1/artifact/test-jsons-test-pr_time_benchmarks-1-1-linux.g4dn.metal.nvidia.gpu_32901411180.zip

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140493
Approved by: https://github.com/laithsakka
2024-11-20 01:48:00 +00:00
Catherine Lee
fc813df120 Benchmarks dynamo update script to use ClickHouse instead of Rockset (#140574)
Query works but the part where it parses the job name is broken

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140574
Approved by: https://github.com/huydhn
2024-11-15 22:17:35 +00:00
Laith Sakka
500ce29e4c Use has_free_unbacked_symbols instead of bool(free_unbacked_symbols) (#140027)
with 20K features saves 20 seconds.
257.021589517593-> 237.8304626941681
buck2 run @fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=2000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140027
Approved by: https://github.com/ezyang
2024-11-15 19:01:06 +00:00
Oguz Ulgen
65518fd9ef Turn on triton bundler in OSS (#140600)
Its been enabled internally, lets also push it out to OSS.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140600
Approved by: https://github.com/masnesral
2024-11-14 20:02:15 +00:00
Laith Sakka
aaefa48441 reduce the threshold to change exisiting data suggestion to noise/3 (#140623)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140623
Approved by: https://github.com/bobrenjc93
2024-11-14 06:29:25 +00:00
Michael Lazos
ea0f60ecfa [Dynamo] allow dynamic callables on tensor variables (#137940)
Fixes https://github.com/pytorch/pytorch/issues/134844

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137940
Approved by: https://github.com/williamwen42
2024-11-08 23:49:34 +00:00
Laith Sakka
d1a45800a3 refresh numbers after accepted less than noise regression (#140029)
https://github.com/pytorch/pytorch/pull/138363 regressed some benchmarks but less than noise level updating values to avoid flakiness.
<img width="803" alt="Screenshot 2024-11-07 at 10 31 29 AM" src="https://github.com/user-attachments/assets/31326452-a6ad-44b8-b324-25e953355fcf">

PASS: benchmark ('add_loop_eager', 'compile_time_instruction_count') pass, actual result 3073605220 +1.21% is within expected 3037000000 ±1.50%

PASS: benchmark ('add_loop_eager_dynamic', 'compile_time_instruction_count') pass, actual result 5700849667 +1.37% is within expected 5624000000 ±2.50%

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140029
Approved by: https://github.com/bobrenjc93
2024-11-07 22:27:00 +00:00
Laith Sakka
de4216bfda increase add_loop benchmark and refresh all results! (#139703)
see comments end of https://github.com/pytorch/pytorch/pull/138756
I am also refreshing all values

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139703
Approved by: https://github.com/bobrenjc93
2024-11-05 05:41:21 +00:00
Bin Bao
740054ffe6 [AOTI][reland] Switch OSS dashboard to use aoti_compile_and_package (#139597)
Summary: Reland https://github.com/pytorch/pytorch/pull/139154

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139597
Approved by: https://github.com/angelayi
2024-11-04 18:53:17 +00:00
PyTorch MergeBot
709752e0bb Revert "[AOTI] Switch OSS dashboard to use aoti_compile_and_package (#139154)"
This reverts commit 293fbb42d2.

Reverted https://github.com/pytorch/pytorch/pull/139154 on behalf of https://github.com/desertfire due to cpu_aot_inductor_amp_freezing fails ([comment](https://github.com/pytorch/pytorch/pull/139154#issuecomment-2452983651))
2024-11-02 13:04:00 +00:00
Bin Bao
293fbb42d2 [AOTI] Switch OSS dashboard to use aoti_compile_and_package (#139154)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139154
Approved by: https://github.com/angelayi
ghstack dependencies: #139153
2024-11-02 03:10:05 +00:00
Laith Sakka
6a1c451479 Don't uselessly recompute axiom dict every static eval call (#138967)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138967
Approved by: https://github.com/ezyang
2024-10-31 21:16:55 +00:00
Laith Sakka
c056dc4cb8 In Inductor, be willing to generate deferred runtime asserts when unbacked (#138804)
Title + we avoid calling defer_assert when we statically know the guard results.
timing for pnasnet5large

```
TIMING: code_gen:21.79672 inductor_compile:39.57726 backend_compile:65.30649 entire_frame_compile:95.22052 total_wall_time:95.22052
```
matches with out the diff
```
TIMING: code_gen:21.89314 inductor_compile:39.72298 backend_compile:65.38539 entire_frame_compile:95.0854 total_wall_time:95.0854
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138804
Approved by: https://github.com/ezyang
2024-10-28 02:19:55 +00:00
Laith Sakka
705f5b3489 Several enhancements for check_results.py (#137925)
1) always generate expected_results.csv up to accuracy of first three digits
ex: 112313212312 --> 1120000000 .. etc
2) regenerate all record in  expected_results.csv and not just failed ones , why? because if we change something
by 1.3% and noise 1.5% we want to reflect that.
3) add "please update all results that changed significantly, and not only the failed ones"

```
(myenv) [lsakka@devgpu005.nha1 ~/pytorch/benchmarks/dynamo/pr_time_benchmarks (check_result_ehancements)]$ python check_results.py test_check_result/expected_test.csv te
st_check_result/result_test.csv out
WIN: benchmark ('a', 'instruction count') failed, actual result 9011111111 is -18.16% lower than expected 11011111111 ±1.00% please update the expected results.

please update all results that changed significantly, and not only the failed ones
REGRESSION: benchmark ('b', 'memory') failed, actual result 20011111111 is 99.89% higher than expected 10011111111 ±+10.00% if this is an expected regression, please update the expected results.

please update all results that changed significantly, and not only the failed ones
REGRESSION: benchmark ('c', 'something') failed, actual result 107111111111 is 969.92% higher than expected 10011111111 ±+10.00% if this is an expected regression, please update the expected results.

please update all results that changed significantly, and not only the failed ones
MISSING REGRESSION TEST: benchmark ('d', 'missing-test') does not have a regression test enabled for it.

new expected results file content if needed:
a,instruction count,9011000000,0.01
b,memory,20010000000,0.1
c,something,107100000000,0.1

There was some failures you can use the new reference expected result stored at path:out and printed above

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137925
Approved by: https://github.com/aorenste
2024-10-26 16:27:55 +00:00
Laith Sakka
10e2840ce3 Enable failing diffs on update_hint_regression and sum_floordiv_regression and autograd benchmarks regression (#137548)
update_hint_regression has been behaving, so I am setting 2% noise threshold for it. 1.5% for sum_floordiv_regression.

I have one concern, with the way we do the regression detection. small or changes <threshold level  will accumulate and eventually trigger failure. to avoid those would have to keep any eye on the dashboard and potentially refresh the expected result file regularly even when there is no faluires. .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137548
Approved by: https://github.com/aorenste
2024-10-26 07:28:49 +00:00
Pian Pawakapan
09848c892a [aot_compile] propagate ShapeEnv during lowering (#138362)
We found that `export() -> _inductor.aot_compile()` lowering, 3 different ShapeEnvs get created, leading to errors when one ShapeEnv processes expressions created by another ShapeEnv. This plumbs the 2 places where ShapeEnv creation happens, detecting the original ShapeEnv from the GraphModule example values, so the original ShapeEnv is just reused.

Differential Revision: D64613290

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138362
Approved by: https://github.com/angelayi
2024-10-24 22:22:14 +00:00
Laith Sakka
ed313a5ca2 Introduce torch.sym_add, variadic add (#138660)
Tested internally here: https://www.internalfb.com/diff/D64057744
This is a reland after previous internal failures.
main change is
```
 if min is None and max is None:
        torch._check_is_size(size)
        return
```

Partially addresses https://github.com/pytorch/pytorch/issues/128150

When you have big sums of values, we end up computing long chains of
binary addition in our FX graph representation.  Not only is this ugly,
it also is quadratic, as the sympy.Add constructor is O(N) in number
of arguments.  Instead, ensure that we maintain the summation as a
single FX node so we can do the entire addition all in one go.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138660
Approved by: https://github.com/ezyang, https://github.com/bobrenjc93
2024-10-23 17:42:41 +00:00
Ryan Guo
0a4197490c Delay mul/pow expansion for _SympyT to enable more folding (#138235)
Instead of calling `safe_expand` right after symbolic expression construction, we invoke it in `ShapeEnv.simplify`. This enables more simplification with product form, e.g.,
```
(a + b)^2 / (a + b) --> (a + b)
```
which won't happen if we expand eagerly during product construction:
```
(a^2 + 2ab + b^2) / (a + b) --> no change
```

Fixes #136044.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138235
Approved by: https://github.com/ezyang
2024-10-21 16:38:47 +00:00
Animesh Jain
0a2407b93c [dynamo] Support omegaconf DictConfig (#138378)
Fixes https://github.com/pytorch/pytorch/issues/138224

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138378
Approved by: https://github.com/jansel
ghstack dependencies: #138359
2024-10-20 02:43:17 +00:00
Chong Gu
d512d0e227 Always use aten.constant_pad_nd for mm padding (#137820)
Summary: From experiment, it seems like aten.constant_pad_nd has better QPS compared to torch.cat. The qps gain for ig ctr is ~10%, and ~5% for oc.

Test Plan:
```
buck2 run mode/opt -c fbcode.nvcc_arch=a100 //caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/585279927/480/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend=AOT_INDUCTOR
```
```
buck2 run mode/opt //caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/588102397/1500/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend=AOT_INDUCTOR
```

Differential Revision: D64271583

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137820
Approved by: https://github.com/eellison
2024-10-18 19:35:03 +00:00
Brian Hirsh
a682194a11 inductor: use previous guards to know if a size is 1 for broadcasting (#136670)
Fixes https://github.com/pytorch/pytorch/issues/136640

Today, inductor has some logic to figure out when it needs to do broadcasting during lowering, which just checks if any of the input shapes have sizes equal to 1.

In particular: we should already have this information by the time we get to inductor, because our FakeTensor compute will have branched/guarded on whether any ops performed broadcasting, appropriately.

In particular, if we have a tensor with a size value of `(64//((2048//(s3*((s2//s3)))))))`, and it happens to be equal to one (and it is used in an op that requires this dim to be broadcasted), FakeTensorProp will have generated a guard:
```
Eq((64//((2048//(s3*((s2//s3))))))), 1)
```

I chose the simplest possible way to beef up inductor's checks to know when a given size is equal to 1: loop over the existing shape env guards, and if our current size is a sympy expression on the LHS of one of our `Eq(LHS, 1)` guards, then return True.

I'm hoping for feedback on whether or not this approach is reasonable. One better option I could imagine is that our symbolic reasoning should have automatically simplified the size of our tensor down to a constant as part of evaluating that guard. I was originally going to try to do this directly in the shape env, but I ran into a few issues:

(1) I wanted to call some version of `set_replacement(expr, 1)`. But `set_replacement()` only accepts plain symbols on the LHS, not expressions

(2) in theory I could get this to work if I could rework the above expression to move everything that is not a free variable to the RHS, e.g. `Eq(s2, 32)`. It looks like our existing  `try_solve()` logic is... [not quite able](https://github.com/pytorch/pytorch/blob/main/torch/utils/_sympy/solve.py#L27) to do this generally though.

Checking the guards feels pretty simple-and-easy. Are we worried that it is too slow to iterate over all the guards? I could also cache the lookup so we only need to iterate over guards that are of the form `Eq(LHS, 1)`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136670
Approved by: https://github.com/ezyang
2024-10-16 22:41:39 +00:00
Isuru Fernando
120fbe9caa Update inductor benchmark time to avoid flakiness (#137900)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137900
Approved by: https://github.com/laithsakka
2024-10-15 16:17:04 +00:00
Edward Z. Yang
5c3ba6faff Add fbscribelogger to Dynamo benchmark runner (#137867)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137867
Approved by: https://github.com/bobrenjc93
2024-10-15 04:36:41 +00:00
Isuru Fernando
08ce3aac62 Cache some ValueRanges (#137438)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137438
Approved by: https://github.com/ezyang
2024-10-13 19:23:34 +00:00
Bin Bao
cfc5d18aad [AOTI] Turn on the ABI-compatible mode as default (#136534)
Summary: Make AOTI generate ABI-compatible code as default for OSS.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136534
Approved by: https://github.com/chenyang78
ghstack dependencies: #137660
2024-10-13 14:42:58 +00:00
Valentine233
67883e70c0 change GPT2ForSequenceClassification inference accuracy tolerance (#136749)
Fixes https://github.com/pytorch/pytorch/issues/123503.

https://github.com/pytorch/pytorch/pull/121866 makes GPT2ForSequenceClassification hit the SDPA pattern 18 and then encounter the accuracy issue. The issue only happens with BF16 inference single thread. This PR tends to increase the model tolerance from 4e-3 to 5e-3 and make the check pass. Note that the issue is due to some small implementation diff. For example, the sdpa math backend scales q, k before matmul for stability; the flash attention backend has more diffs as a new algorithm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136749
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-10-12 01:12:28 +00:00
PyTorch MergeBot
c58e5c4efa Revert "[AOTI] Turn on the ABI-compatible mode as default (#136534)"
This reverts commit b0da076f0c.

Reverted https://github.com/pytorch/pytorch/pull/136534 on behalf of https://github.com/desertfire due to The dependent PR https://github.com/pytorch/pytorch/pull/137660 fails in fbcode ([comment](https://github.com/pytorch/pytorch/pull/136534#issuecomment-2408211238))
2024-10-11 22:50:58 +00:00
Xuehai Pan
267f82b860 [BE] Format .ci/ / .github/ / benchmarks/ / functorch/ / tools/ / torchgen/ with ruff format (#132577)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132577
Approved by: https://github.com/malfet
2024-10-11 18:30:26 +00:00
Laith Sakka
a06d49a9f9 bump up add_loop_inductor_gpu expected instruction count. (#137672)
diff https://github.com/pytorch/pytorch/pull/137117/files increased instruction count for add_loop_inductor_gpu
but not enough to fail in that diff, but now its kind of flaky test .

it failed on recent merge:
<img width="1351" alt="Screenshot 2024-10-09 at 5 25 57 PM" src="https://github.com/user-attachments/assets/27178f76-c08e-4d13-9ac4-4cd70f146611">

and here is the history
<img width="1047" alt="Screenshot 2024-10-09 at 5 26 07 PM" src="https://github.com/user-attachments/assets/bd563e34-6f7f-461a-ae54-8a616be9bf09">
<img width="777" alt="Screenshot 2024-10-09 at 5 30 19 PM" src="https://github.com/user-attachments/assets/d0a1ca81-2bdb-4cf6-8ac8-ba5971d447bf">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137672
Approved by: https://github.com/masnesral
2024-10-11 16:46:38 +00:00
Bin Bao
b0da076f0c [AOTI] Turn on the ABI-compatible mode as default (#136534)
Summary: Make AOTI generate ABI-compatible code as default for OSS.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136534
Approved by: https://github.com/chenyang78
ghstack dependencies: #137660
2024-10-10 23:44:57 +00:00
Colin Peppler
9690cacd61 [aotinductor] Add helper fn to atomically apply size_hint to an expr w/ unbacked symints (#137537)
### Context
Fixes CUDA IMA in autotune_at_compile_time, where we would generate an example tensor with an incorrect stride.

In the case below, the stride should be (u0 * 128, 128, 1). However, we apply the fallback on the entire expr (i.e. u0 * 128).
```
# buf817 = tensor(size=(s0, u0, 128), stride=(u0 * 128, 128, 1))

buf812 = generate_example_value(
    (64, 8192, 128), (8192, 128, 1), "cuda:0", torch.bfloat16, 0
)
```

The fix is to apply the fallback on each symbol.

### Test
```
PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer python test_aot_inductor.py -k test_stride_with_unbacked_expr_abi_compatible_cuda

========= Invalid __global__ write of size 2 bytes
```

Differential Revision: [D64074561](https://our.internmc.facebook.com/intern/diff/D64074561)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137537
Approved by: https://github.com/jingsh
2024-10-10 17:11:24 +00:00
Oguz Ulgen
034af88c2d Add a microbechmark for cache read path (#137607)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137607
Approved by: https://github.com/jamesjwu
2024-10-10 16:36:18 +00:00
Laith Sakka
f394fb554b Enable failing diffs for regressions on basic_modules_ListOfLinears benchmarks (#137541)
Note that basic_modules_ListOfLinears_inductor_gpu_force_shape_pad is flay with 8% detected variance,
I set it up with 20% threshold (8*2)++
others are stable within +-1.5%

<img width="611" alt="Screenshot 2024-10-08 at 4 19 03 PM" src="https://github.com/user-attachments/assets/103c4bc7-6be8-41bf-ac31-4b8909fabfcf">

<img width="1581" alt="Screenshot 2024-10-08 at 4 18 56 PM" src="https://github.com/user-attachments/assets/56006f7a-e7de-4966-9a05-9263195adc68">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137541
Approved by: https://github.com/aorenste
2024-10-10 02:47:38 +00:00
Laith Sakka
361046718d Generate new expected results file when there is failures in diff time benchmarks (#137551)
The test also add singpost log for the benchmarks that pass.
to test run I ran python check_results.py test_check_result/expected_test.csv test_check_result/result_test.csv out.csv
results
```
WIN: benchmark ('a', 'instruction count') failed, actual result 90 is -18.18% lower than expected 110 ±1.00% please update the expected results.

REGRESSION: benchmark ('b', 'memory') failed, actual result 200 is 100.00% higher than expected 100 ±+10.00% if this is an expected regression, please update the expected results.

PASS: benchmark ('c', 'something') pass, actual result 107 +7.00% is within expected 100 ±10.00%

MISSING REGRESSION TEST: benchmark ('d', 'missing-test') does not have a regression test enabled for it.

You can use the new reference expected result stored at path: out.csv.

a,instruction count,90,0.01
b,memory,200,0.1
c,something,100,0.1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137551
Approved by: https://github.com/aorenste
2024-10-10 01:09:15 +00:00
PyTorch MergeBot
16a2c2cfd4 Revert "Introduce torch.sym_sum (#136429)"
This reverts commit 90bed32b98.

Reverted https://github.com/pytorch/pytorch/pull/136429 on behalf of https://github.com/ezyang due to fails internal stuff ([comment](https://github.com/pytorch/pytorch/pull/136429#issuecomment-2403335147))
2024-10-09 20:08:01 +00:00
Oguz Ulgen
ae03c0cff3 Add microbenchmark for FxGraphHashDetails.debug_lines (#137506)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137506
Approved by: https://github.com/jamesjwu
2024-10-09 16:15:05 +00:00
Michael Lazos
27dee935af [Dynamo] Ensure torch function modes are dispatched on builtin ops (#137117)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137117
Approved by: https://github.com/yanboliang, https://github.com/williamwen42
ghstack dependencies: #137114, #137115, #137116
2024-10-09 02:29:40 +00:00
PyTorch MergeBot
2d18c2d5e7 Revert "[Dynamo] Ensure torch function modes are dispatched on builtin ops (#137117)"
This reverts commit 941be418d8.

Reverted https://github.com/pytorch/pytorch/pull/137117 on behalf of https://github.com/huydhn due to The top of the stack has been reverted but it leaves trunk in a broken state, so I try to revert the rest of the stack ([comment](https://github.com/pytorch/pytorch/pull/137114#issuecomment-2400765603))
2024-10-08 20:33:17 +00:00
Brian Hirsh
b41fc14072 compile time benchmarks for AOTDispatcher (partitioner) (#136760)
compile time benchmark for the min cut partitioner. I'm hoping that this is a reasonable benchmark because:

(1) it consists of a single input + many weights that are used sequentially
(2) contains a mix of recompute vs non-recomputed ops (matmul + sin)
(3) it is relatively simple

from running locally:
```
collecting compile time instruction count for aotdispatcher_partitioner_cpu
compile time instruction count for iteration 0 is 21764219181
compile time instruction count for iteration 1 is 12475020009
compile time instruction count for iteration 2 is 12463710140
compile time instruction count for iteration 3 is 12455676489
compile time instruction count for iteration 4 is 12451344330
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136760
Approved by: https://github.com/ezyang
ghstack dependencies: #136759
2024-10-08 18:44:13 +00:00
Brian Hirsh
48b8f818b2 compile time benchmarks for AOTDispatcher (inference/training/subclasses) (#136759)
this adds a few compile time benchmarks for some disjoint paths in AOTDispatcher:

(1) inference vs training code paths
(2) "subclasses" vs "no subclasses" codepaths

Also see https://github.com/pytorch/pytorch/pull/136760 for a partitioner benchmark (I'm not sure why ghstack didn't display the stack nicely)

I ran locally, and got these numbers on the 4 paths:
```
collecting compile time instruction count for aotdispatcher_inference_nosubclass_cpu
compile time instruction count for iteration 0 is 11692348671
compile time instruction count for iteration 1 is 3026287204
compile time instruction count for iteration 2 is 3011467318
compile time instruction count for iteration 3 is 3004485935
compile time instruction count for iteration 4 is 3003087410
collecting compile time instruction count for aotdispatcher_training_nosubclass_cpu
compile time instruction count for iteration 0 is 6068003223
compile time instruction count for iteration 1 is 5585418102
compile time instruction count for iteration 2 is 5581856618
compile time instruction count for iteration 3 is 5581651794
compile time instruction count for iteration 4 is 5578742619
collecting compile time instruction count for aotdispatcher_inference_subclass_cpu
compile time instruction count for iteration 0 is 8634984264
compile time instruction count for iteration 1 is 8633467573
compile time instruction count for iteration 2 is 8632182092
compile time instruction count for iteration 3 is 8632056925
compile time instruction count for iteration 4 is 8632543871
collecting compile time instruction count for aotdispatcher_training_subclass_cpu
compile time instruction count for iteration 0 is 14737239311
compile time instruction count for iteration 1 is 14734346427
compile time instruction count for iteration 2 is 14736493730
compile time instruction count for iteration 3 is 14734121272
compile time instruction count for iteration 4 is 14733852882
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136759
Approved by: https://github.com/laithsakka
2024-10-08 18:44:13 +00:00
Edward Z. Yang
90bed32b98 Introduce torch.sym_sum (#136429)
Partially addresses https://github.com/pytorch/pytorch/issues/128150

When you have big sums of values, we end up computing long chains of
binary addition in our FX graph representation.  Not only is this ugly,
it also is quadratic, as the sympy.Add constructor is O(N) in number
of arguments.  Instead, ensure that we maintain the summation as a
single FX node so we can do the entire addition all in one go.

update_hint_regression benchmark, before and after:

```
update_hint_regression,compile_time_instruction_count,2648328980
update_hint_regression,compile_time_instruction_count,2563748678
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136429
Approved by: https://github.com/isuruf
2024-10-08 18:12:57 +00:00
Michael Lazos
941be418d8 [Dynamo] Ensure torch function modes are dispatched on builtin ops (#137117)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137117
Approved by: https://github.com/yanboliang, https://github.com/williamwen42
ghstack dependencies: #137114, #137115, #137116
2024-10-07 18:55:26 +00:00
Laith Sakka
8b9cbf22c2 Enable regression test for add loop benchmarks (#136573)
The red dotted line is 1.5

<img width="1607" alt="Screenshot 2024-09-24 at 11 50 41 AM" src="https://github.com/user-attachments/assets/719a9a86-89af-4c58-8723-80a28f9bb517">

expected taken from the average.
<img width="850" alt="Screenshot 2024-09-24 at 2 33 27 PM" src="https://github.com/user-attachments/assets/0f25e855-35ae-4031-86ef-1452ef6598de">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136573
Approved by: https://github.com/ezyang
2024-10-04 18:12:08 +00:00
PyTorch MergeBot
951107e8c2 Revert "compile time benchmarks for AOTDispatcher (inference/training/subclasses) (#136759)"
This reverts commit b17cd264d3.

Reverted https://github.com/pytorch/pytorch/pull/136759 on behalf of https://github.com/ZainRizvi due to Something in this stack seems to be causing tests to fail on trunk. See functorch/test_control_flow.py::TestControlFlow::test_associative_scan_dim_reverse_True_combine_mode_generic_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/11107079955/job/30872132411) [HUD commit link](c010c6099b) ([comment](https://github.com/pytorch/pytorch/pull/136670#issuecomment-2386303362))
2024-10-01 15:23:55 +00:00
PyTorch MergeBot
923410193b Revert "compile time benchmarks for AOTDispatcher (partitioner) (#136760)"
This reverts commit c010c6099b.

Reverted https://github.com/pytorch/pytorch/pull/136760 on behalf of https://github.com/ZainRizvi due to Something in this stack seems to be causing tests to fail on trunk. See functorch/test_control_flow.py::TestControlFlow::test_associative_scan_dim_reverse_True_combine_mode_generic_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/11107079955/job/30872132411) [HUD commit link](c010c6099b) ([comment](https://github.com/pytorch/pytorch/pull/136670#issuecomment-2386303362))
2024-10-01 15:23:55 +00:00
Bin Bao
a15f3f51bc [AOTI] Update sam_fast from timeout to fail_to_run (#136996)
Summary: sam_fast changes from timeout to fail_to_run after https://github.com/pytorch/pytorch/pull/136591, which "regressed" in a good way. Update the expected result file and continue investigating.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136996
Approved by: https://github.com/ezyang
2024-09-30 14:05:49 +00:00
Brian Hirsh
c010c6099b compile time benchmarks for AOTDispatcher (partitioner) (#136760)
compile time benchmark for the min cut partitioner. I'm hoping that this is a reasonable benchmark because:

(1) it consists of a single input + many weights that are used sequentially
(2) contains a mix of recompute vs non-recomputed ops (matmul + sin)
(3) it is relatively simple

from running locally:
```
collecting compile time instruction count for aotdispatcher_partitioner_cpu
compile time instruction count for iteration 0 is 21764219181
compile time instruction count for iteration 1 is 12475020009
compile time instruction count for iteration 2 is 12463710140
compile time instruction count for iteration 3 is 12455676489
compile time instruction count for iteration 4 is 12451344330
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136760
Approved by: https://github.com/ezyang
ghstack dependencies: #136670, #136759
2024-09-30 13:25:02 +00:00
Brian Hirsh
b17cd264d3 compile time benchmarks for AOTDispatcher (inference/training/subclasses) (#136759)
this adds a few compile time benchmarks for some disjoint paths in AOTDispatcher:

(1) inference vs training code paths
(2) "subclasses" vs "no subclasses" codepaths

Also see https://github.com/pytorch/pytorch/pull/136760 for a partitioner benchmark (I'm not sure why ghstack didn't display the stack nicely)

I ran locally, and got these numbers on the 4 paths:
```
collecting compile time instruction count for aotdispatcher_inference_nosubclass_cpu
compile time instruction count for iteration 0 is 11692348671
compile time instruction count for iteration 1 is 3026287204
compile time instruction count for iteration 2 is 3011467318
compile time instruction count for iteration 3 is 3004485935
compile time instruction count for iteration 4 is 3003087410
collecting compile time instruction count for aotdispatcher_training_nosubclass_cpu
compile time instruction count for iteration 0 is 6068003223
compile time instruction count for iteration 1 is 5585418102
compile time instruction count for iteration 2 is 5581856618
compile time instruction count for iteration 3 is 5581651794
compile time instruction count for iteration 4 is 5578742619
collecting compile time instruction count for aotdispatcher_inference_subclass_cpu
compile time instruction count for iteration 0 is 8634984264
compile time instruction count for iteration 1 is 8633467573
compile time instruction count for iteration 2 is 8632182092
compile time instruction count for iteration 3 is 8632056925
compile time instruction count for iteration 4 is 8632543871
collecting compile time instruction count for aotdispatcher_training_subclass_cpu
compile time instruction count for iteration 0 is 14737239311
compile time instruction count for iteration 1 is 14734346427
compile time instruction count for iteration 2 is 14736493730
compile time instruction count for iteration 3 is 14734121272
compile time instruction count for iteration 4 is 14733852882
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136759
Approved by: https://github.com/laithsakka
ghstack dependencies: #136670
2024-09-30 13:25:02 +00:00
Laith Sakka
e205193e1c Enable failing diffs on regression (#136551)
1. example of failing diff
https://github.com/pytorch/pytorch/pull/136740

2. test this by running
python check_results.py test_check_result/expected_test.csv   test_check_result/result_test.csv

results
```
WIN: benchmark ('a', ' instruction count') failed, actual result 90 is 18.18% lower than expected 110 ±1.00% please update the expected results.
REGRESSION: benchmark ('b', ' memory') failed, actual result 200 is 100.00% higher than expected 100 ±10.00% if this is an expected regression, please update the expected results.
MISSING REGRESSION TEST: benchmark ('d', ' missing-test') does not have a regression test enabled for it
```
MISSING REGRESSION TEST does not fail but its logged.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136551
Approved by: https://github.com/ezyang
ghstack dependencies: #136383
2024-09-29 22:31:26 +00:00
Jason Ansel
8da9c4178c [inductor] Benchmark Halide in operatorbench.py (#136809)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136809
Approved by: https://github.com/eellison
ghstack dependencies: #136808
2024-09-28 19:26:04 +00:00
Jason Ansel
375921b755 [inductor] Improve operatorbench.py (#136808)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136808
Approved by: https://github.com/eellison
2024-09-28 06:22:02 +00:00
William Wen
2157e396a3 [dynamo] attempt run only mode when dynamo cache limit is hit (#136655)
Implement https://github.com/pytorch/pytorch/issues/135458.

Try run-only mode when dynamo cache limit is hit. If no valid cache entries are found, then skip code recursively.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136655
Approved by: https://github.com/jansel
2024-09-27 17:15:05 +00:00
Zain Rizvi
37f340c1e5 [EZ] Remove remaining amz2023 runner variant references (#136540)
Validated no jobs use the amz2023 runner variant anymore ([proof](https://github.com/search?type=code&q=org%3Apytorch+%2F%5Cbamz2023%5Cb%2F+&p=1)) so removing all references to it

Explicit references to the amz2023 runner type variants were removed in the following PRs:
- https://github.com/pytorch/ignite/pull/3285
- https://github.com/pytorch/ao/pull/887
- https://github.com/pytorch/fbscribelogger/pull/1
- https://github.com/pytorch/pytorch/pull/134355

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136540
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-09-25 19:01:00 +00:00
PyTorch MergeBot
064093a4d6 Revert "Increase update_hint_regression problem size to 1000 (#136434)"
This reverts commit 3116fbda0f.

Reverted https://github.com/pytorch/pytorch/pull/136434 on behalf of https://github.com/ezyang due to whoops, this is too slow ([comment](https://github.com/pytorch/pytorch/pull/136434#issuecomment-2371847842))
2024-09-24 17:05:20 +00:00
Edward Z. Yang
3116fbda0f Increase update_hint_regression problem size to 1000 (#136434)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136434
Approved by: https://github.com/laithsakka
2024-09-23 18:51:44 +00:00
Laith Sakka
0b91e7e2dc Remove duplicate line (#136383)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136383
Approved by: https://github.com/kit1980, https://github.com/malfet
2024-09-21 01:35:13 +00:00
Laith Sakka
b71802fa79 add basic_modules_ListOfLinears_inductor_gpu_force_shape_pad (#136175)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136175
Approved by: https://github.com/ezyang
2024-09-19 19:15:50 +00:00
Igor Sugak
bce52d0b60 [CODEMOD][caffe2] use npt.NDArray instead of np.ndarray in type annotations (#136288)
Summary:
To facilitate PSS-2 upgrade, this uses `ndt.NDArray` instead of `nd.ndarray` in type annotations. In Numpy-1.19 (PSS-1) it's an alias to `nd.ndarray` -- a noop.
In Numpy-1.24, `ndt.NDArray` a proper generic type, and without this change uses of `nd.ndarray` generate this Pyre type error:
```counterexample
 Invalid type parameters [24]: Generic type `np.ndarray` expects 2 type parameters.
```

Test Plan: Sandcastle plus visual inspection

Differential Revision: D62977370

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136288
Approved by: https://github.com/kit1980
2024-09-19 12:40:36 +00:00
leslie-fang-intel
8072ebc36c SKIP llama for dynamic size testing (#135960)
Running Torchbench llama with dynamic size failed with
```
  File "/localdisk/leslie/torch_inductor_community/pytorch/torch/fx/experimental/symbolic_shapes.py", line 4182, in produce_guards
    raise ConstraintViolationError(
torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (L['inputs'][0].size()[0])! For more information, run with TORCH_LOGS="+dynamic".
  - Not all values of RelaxedUnspecConstraint(L['inputs'][0].size()[0]) are valid because L['inputs'][0].size()[0] was inferred to be a constant (32).
```
Skip this model for marking dynamic dim.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135960
Approved by: https://github.com/ezyang
2024-09-15 00:06:49 +00:00
Laith Sakka
b82122beef Only keep ListOfLinears module in basic_modules_benchmarks and add gpu version. (#135730)
All of the previous benchmarks are similar, ListOfLinears should be representative enough.
I copied the previous benchmarks from unit tests without an intention, was just trying to create a large
number of benchmarks to better observe noise.

This PR keeps only one, we can add more as we see value and regressions in the future.
Also this diff adds a GPU version.
```
collecting compile time instruction count for basic_modules_ListOfLinears_eager
compile time instruction count for iteration 0 is 6479525851
compile time instruction count for iteration 1 is 1024432680
compile time instruction count for iteration 2 is 1019417317
compile time instruction count for iteration 3 is 1013603566
compile time instruction count for iteration 4 is 1008853980
compile time instruction count for iteration 5 is 1009541481
compile time instruction count for iteration 6 is 1005025533
compile time instruction count for iteration 7 is 1004116323
compile time instruction count for iteration 8 is 1000828633
compile time instruction count for iteration 9 is 999788323
collecting compile time instruction count for basic_modules_ListOfLinears_inductor
compile time instruction count for iteration 0 is 40837529730
compile time instruction count for iteration 1 is 18411921909
compile time instruction count for iteration 2 is 18383665161
compile time instruction count for iteration 3 is 18348983522
compile time instruction count for iteration 4 is 18349276590
compile time instruction count for iteration 5 is 18353046274
compile time instruction count for iteration 6 is 18346818581
compile time instruction count for iteration 7 is 18340057998
compile time instruction count for iteration 8 is 18331267320
compile time instruction count for iteration 9 is 18328381338
collecting compile time instruction count for basic_modules_ListOfLinears_inductor_gpu
compile time instruction count for iteration 0 is 15408870979
compile time instruction count for iteration 1 is 10949520859
compile time instruction count for iteration 2 is 11058786167
compile time instruction count for iteration 3 is 11003606719
compile time instruction count for iteration 4 is 10896406770
compile time instruction count for iteration 5 is 10982875189
compile time instruction count for iteration 6 is 10931848275
compile time instruction count for iteration 7 is 10956345008
compile time instruction count for iteration 8 is 11045384499
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135730
Approved by: https://github.com/ezyang, https://github.com/anijain2305
2024-09-14 16:45:52 +00:00
Laith Sakka
44dd218a61 Disable garbage collection during compile_time_instructions count in benchmark base by default. (#135768)
When we measure compile time instruction count, probably we do want in most cases to measure gc instructions
disabling it here by default.
if it is needed we can add an option to allow it, or someone can use the regular total instruction count instead of compile time instruction count.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135768
Approved by: https://github.com/ezyang, https://github.com/anijain2305
2024-09-14 06:15:28 +00:00
Laith Sakka
46935c8241 Reduce default iterations to 5 . (#135773)
running all benchmarks takes around 15 mins rn, this is the data
https://www.internalfb.com/phabricator/paste/view/P1583590240
the data looks mostly stable, and 5 iterations should be good, specially with our 1.5% threshold.
that said, the diff also add a way to increase the number of iterations for a specific benchmark.

after the change results
https://www.internalfb.com/phabricator/paste/view/P1583618969
time is down to half (7 mins)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135773
Approved by: https://github.com/ezyang, https://github.com/anijain2305
2024-09-13 21:16:38 +00:00
Laith Sakka
4f407c1884 Only measure compile time instruction count for sum_floordiv benchmark (#135785)
there was a recent strange noise +5%, -5%.
using only compile time :
1) avoid gc time .
2) avoid other operations that are not what we try to measure by this. ==> less probable noise.
```
collecting compile time instruction count for sum_floordiv_regression
compile time instruction count for iteration 0 is 8899290248
compile time instruction count for iteration 1 is 1188830489
compile time instruction count for iteration 2 is 1180579615
compile time instruction count for iteration 3 is 1176263131
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135785
Approved by: https://github.com/avikchaudhuri, https://github.com/anijain2305
2024-09-13 21:14:10 +00:00
Laith Sakka
2e461e54e8 Add gpu and gpu_dynamic versions of add_loop (#135809)
I am thinking maybe 3 iterations are enough for this one?
- so I am keeping eager and inductor since inductor is 2X eager time
- Eager dynamic is 2X eager so keeping this as well.
- inductor have three tests. (dynamic gpu, gpu and cpu)
I am unsure if am over profiling here happy to trim if anyone have suggestions.
```
collecting compile time instruction count for add_loop_eager
compile time instruction count for iteration 0 is 8213664211
compile time instruction count for iteration 1 is 2798628246
compile time instruction count for iteration 2 is 2796811362
compile time instruction count for iteration 3 is 2794438188
compile time instruction count for iteration 4 is 2794634117
collecting compile time instruction count for add_loop_eager_dynamic
compile time instruction count for iteration 0 is 5724108021
compile time instruction count for iteration 1 is 5499908609
compile time instruction count for iteration 2 is 5569101366
compile time instruction count for iteration 3 is 5493806364
compile time instruction count for iteration 4 is 5493169851
collecting compile time instruction count for add_loop_inductor
compile time instruction count for iteration 0 is 49789381222
compile time instruction count for iteration 1 is 25769347393
compile time instruction count for iteration 2 is 25772594322
compile time instruction count for iteration 3 is 25768695952
compile time instruction count for iteration 4 is 25768032314
collecting compile time instruction count for add_loop_inductor_gpu
compile time instruction count for iteration 0 is 23966942581
compile time instruction count for iteration 1 is 23771950919
compile time instruction count for iteration 2 is 23770784286
compile time instruction count for iteration 3 is 23780160875
compile time instruction count for iteration 4 is 23774634465
collecting compile time instruction count for add_loop_inductor_dynamic_gpu
compile time instruction count for iteration 0 is 41505055086
compile time instruction count for iteration 1 is 41293654089
compile time instruction count for iteration 2 is 41301016100
compile time instruction count for iteration 3 is 41306056207
compile time instruction count for iteration 4 is 41308171566
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135809
Approved by: https://github.com/ezyang, https://github.com/anijain2305
2024-09-13 20:42:31 +00:00
Pian Pawakapan
b897ab0540 [export] ignore mark_dynamic() in export (#135536)
Previously we were accomodating `torch._dynamo.mark_dynamic()` for export's dynamic shapes. Here we clean things up and ignore it, requiring users to specify an export input for `dynamic_shapes`.

Note: there's 4 decorators relevant to export, `mark_dynamic, maybe_mark_dynamic, mark_static, mark_unbacked`. User calls that involve export have only been `mark_dynamic()`, and we use `maybe_mark_dynamic` under the hood for `Dim.AUTO`, but we could start using others. One reason I decided to not warn and just silently ignore is these decorators cause the tensors to carry dynamic info, and it'll be hard to tell whether the markers are from export or user calls when re-exporting with the same inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135536
Approved by: https://github.com/avikchaudhuri
2024-09-12 21:22:19 +00:00
zengxian
7ec17b49cf Fix dynamo benchmark skip logic for cpu device (#135193)
Fixes #132380, adjust torchbench and huggingface skip models list, then we can remove `--no-skip` when running benchmarks on 3 suites.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135193
Approved by: https://github.com/chuanqi129, https://github.com/jansel
2024-09-10 03:02:19 +00:00
Yueming Hao
a71e5509bc [inductor]Add profiler to operatorbench (#135515)
Add profiling to operatorbench. The new argument `--profile` is added and the profiling trace is like the following figure.
<img width="954" alt="image" src="https://github.com/user-attachments/assets/5b00d6e3-4905-4a77-a5e9-9f62620a5fd5">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135515
Approved by: https://github.com/shunting314
2024-09-10 02:33:30 +00:00
Yanbo Liang
d81731615f [Dynamo] Adding CallFunctionNoArgsSource and (#135425)
CallFunctionNoArgsGuardAccessor to support torch.cuda.current_device()

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135425
Approved by: https://github.com/anijain2305
2024-09-09 22:46:00 +00:00
leslie-fang-intel
07689a38bf [Inductor] Fix AOT weight alignment issue on CPU (#135205)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/135027. On CPU, the `consts_size` used to generate `_binary_constants_bin_start` is not padded to `ALIGN_BYTES`, while `serialized_weights` is, causing a failure in the 16K alignment check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135205
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-09-06 03:06:51 +00:00
Yiming Zhou
050ad925f3 [benchmark] Add to torchbench relative path search (#134871)
Add to relative path search in benchmark. This enables user to run `torchbench.py` inside the `pytorch/benchmark/dynamo` folder when `torchbench` repo is cloned in the same level as `pytorch`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134871
Approved by: https://github.com/FindHao
2024-08-31 00:28:22 +00:00
Animesh Jain
577a93514f [dynamo][dynamic][heuristic] Mark tuple getitem integers as static (#134734)
Fixes issue seen in https://github.com/pytorch/pytorch/issues/132872#issuecomment-2314574656

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134734
Approved by: https://github.com/jansel
ghstack dependencies: #134653, #134713
2024-08-30 17:06:57 +00:00
Laith Sakka
0d5f978795 add basic nn modules diff time benchmarks (#134658)
benchmarks several shapes of basic nn modules. in both eager and inductor

```
collecting compile time instruction count for basic_modules_ListOfLinears_inductor
compile time instruction count for iteration 0 is 48602516013
compile time instruction count for iteration 1 is 20424350269
compile time instruction count for iteration 2 is 20440350455
compile time instruction count for iteration 3 is 20419269999
compile time instruction count for iteration 4 is 20430782200
compile time instruction count for iteration 5 is 20455049622
compile time instruction count for iteration 6 is 20157290712
compile time instruction count for iteration 7 is 20455324001
compile time instruction count for iteration 8 is 20450158317
compile time instruction count for iteration 9 is 20492987748
collecting compile time instruction count for basic_modules_ListOfLinears_eager
compile time instruction count for iteration 0 is 961328334
compile time instruction count for iteration 1 is 958887896
compile time instruction count for iteration 2 is 958792214
compile time instruction count for iteration 3 is 958375977
compile time instruction count for iteration 4 is 958568525
compile time instruction count for iteration 5 is 958152305
compile time instruction count for iteration 6 is 959322800
compile time instruction count for iteration 7 is 958332703
compile time instruction count for iteration 8 is 958092100
compile time instruction count for iteration 9 is 958095277
collecting compile time instruction count for basic_modules_ModuleForwardHasGraphBreak_inductor
compile time instruction count for iteration 0 is 3572145793
compile time instruction count for iteration 1 is 3503323973
compile time instruction count for iteration 2 is 3501962432
compile time instruction count for iteration 3 is 3501746084
compile time instruction count for iteration 4 is 3500687361
compile time instruction count for iteration 5 is 3822254676
compile time instruction count for iteration 6 is 3498356846
compile time instruction count for iteration 7 is 3499019157
compile time instruction count for iteration 8 is 3500780314
compile time instruction count for iteration 9 is 3500257458
collecting compile time instruction count for basic_modules_ModuleForwardHasGraphBreak_eager
compile time instruction count for iteration 0 is 1844838754
compile time instruction count for iteration 1 is 1843476862
compile time instruction count for iteration 2 is 1844761450
compile time instruction count for iteration 3 is 1845371742
compile time instruction count for iteration 4 is 1845159665
compile time instruction count for iteration 5 is 1845035802
compile time instruction count for iteration 6 is 1844895007
compile time instruction count for iteration 7 is 1844697922
compile time instruction count for iteration 8 is 1844780885
compile time instruction count for iteration 9 is 1844493990
collecting compile time instruction count for basic_modules_SequentialWithDuplicatedModule_inductor
compile time instruction count for iteration 0 is 1597839479
compile time instruction count for iteration 1 is 1348225351
compile time instruction count for iteration 2 is 1347340818
compile time instruction count for iteration 3 is 1348170800
compile time instruction count for iteration 4 is 1348637747
compile time instruction count for iteration 5 is 1678366444
compile time instruction count for iteration 6 is 1348412420
compile time instruction count for iteration 7 is 1348461578
compile time instruction count for iteration 8 is 1347420149
compile time instruction count for iteration 9 is 1349748195
collecting compile time instruction count for basic_modules_SequentialWithDuplicatedModule_eager
compile time instruction count for iteration 0 is 137721777
compile time instruction count for iteration 1 is 139065517
compile time instruction count for iteration 2 is 137130552
compile time instruction count for iteration 3 is 137506030
compile time instruction count for iteration 4 is 137089838
compile time instruction count for iteration 5 is 137477395
compile time instruction count for iteration 6 is 138550452
compile time instruction count for iteration 7 is 137568409
compile time instruction count for iteration 8 is 136968468
compile time instruction count for iteration 9 is 137481664
collecting compile time instruction count for basic_modules_ModuleComparison_inductor
compile time instruction count for iteration 0 is 917209684
compile time instruction count for iteration 1 is 899154426
compile time instruction count for iteration 2 is 898145079
compile time instruction count for iteration 3 is 899817018
compile time instruction count for iteration 4 is 899184687
compile time instruction count for iteration 5 is 898172885
compile time instruction count for iteration 6 is 899958951
compile time instruction count for iteration 7 is 899348186
compile time instruction count for iteration 8 is 897745404
compile time instruction count for iteration 9 is 899581123
collecting compile time instruction count for basic_modules_ModuleComparison_eager
compile time instruction count for iteration 0 is 113165302
compile time instruction count for iteration 1 is 112724376
compile time instruction count for iteration 2 is 112774611
compile time instruction count for iteration 3 is 114465211
compile time instruction count for iteration 4 is 112689572
compile time instruction count for iteration 5 is 112726465
compile time instruction count for iteration 6 is 112853691
compile time instruction count for iteration 7 is 112295238
compile time instruction count for iteration 8 is 114022136
compile time instruction count for iteration 9 is 112664932
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134658
Approved by: https://github.com/anijain2305
ghstack dependencies: #133834, #134635, #134649, #134652
2024-08-30 02:13:52 +00:00
Laith Sakka
496e57283d add add_loop benchmarks (#134652)
This benchmark measure the cost of compiling the following function in eager and inductor
its basically two benchmarks.

```
        @torch.compile(backend=self.backend, fullgraph=True)
        def f(a, b):
            result = a.clone()
            for i in range(1000):
                if i % 3 == 0:
                    result = result + b
                elif i % 3 == 1:
                    result = result + 8 * b
                else:
                    result = result.sin()
            return result
```

 PYTHONPATH=$(pwd) python benchmarks/add_loop.py out
 ```
collecting compile time instruction count for add_loop_eager
compile time instruction count for iteration 0 is 8286649663
compile time instruction count for iteration 1 is 2838971338
compile time instruction count for iteration 2 is 2834263023
compile time instruction count for iteration 3 is 2829447493
compile time instruction count for iteration 4 is 2830904231
compile time instruction count for iteration 5 is 2830281077
compile time instruction count for iteration 6 is 2831466595
compile time instruction count for iteration 7 is 2830732164
compile time instruction count for iteration 8 is 2831088056
compile time instruction count for iteration 9 is 2831204407

collecting compile time instruction count for add_loop_inductor
compile time instruction count for iteration 0 is 32585687849
compile time instruction count for iteration 1 is 11747553436
compile time instruction count for iteration 2 is 11746959875
compile time instruction count for iteration 3 is 11749479461
compile time instruction count for iteration 4 is 11750053711
compile time instruction count for iteration 5 is 11750793958
compile time instruction count for iteration 6 is 11751673576
compile time instruction count for iteration 7 is 11754552912
compile time instruction count for iteration 8 is 11753723127
compile time instruction count for iteration 9 is 11759059942
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134652
Approved by: https://github.com/anijain2305
ghstack dependencies: #133834, #134635, #134649
2024-08-29 23:04:01 +00:00
Bin Bao
387d3fc296 [AOTI] Switch benchmarking to use export non-strict mode (#130977)
Summary: Switch the export part used by AOTInductor benchmarking from strict to non-strict, and switch it from producing torch IR to aten IR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130977
Approved by: https://github.com/angelayi
ghstack dependencies: #134639
2024-08-29 16:08:52 +00:00
Laith Sakka
c5f114747e fix flakiness in update_hint_benchmark.py (#134649)
```
compile time instruction count for iteration 1 is 10732129038
compile time instruction count for iteration 2 is 10719776783
compile time instruction count for iteration 3 is 10729546868
compile time instruction count for iteration 4 is 10737655132
compile time instruction count for iteration 5 is 10732564252
compile time instruction count for iteration 6 is 10728721234
compile time instruction count for iteration 7 is 10733354271
compile time instruction count for iteration 8 is 10719588972
compile time instruction count for iteration 9 is 10706311856
```
1. add torch.manual_seed(0), inputs was not the same across iterations
2. disable gc.
3. remove loop (not needed since compilation happen once only)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134649
Approved by: https://github.com/aorenste
ghstack dependencies: #133834, #134635
2024-08-29 02:22:05 +00:00
Laith Sakka
633a9a3b13 add back sum_floordiv benchmark. (#134635)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134635
Approved by: https://github.com/avikchaudhuri, https://github.com/oulgen
ghstack dependencies: #133834
2024-08-28 17:38:24 +00:00
Bin Bao
e6bf1710ff [Inductor][Refactor] Rename CPU benchmark test configs (#134639)
Summary: benchmarks/dynamo/ci_expected_accuracy/update_expected.py expects a benchmark run config is named as {config}_{benchmark}, and CPU tests should follow the same naming convention.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134639
Approved by: https://github.com/huydhn
2024-08-28 14:49:55 +00:00
Laith Sakka
d6091c8726 Add compile time instruction count metric (#133834)
PYTHONPATH=$(pwd) python benchmarks/update_hint_benchmark.py out
as of this diff, compile_time_instruction_count counts the number of instruction from within
convert_frame.compile_inner
```
update_hint_regression,compile_time_instruction_count,10522459165
```
 will add result from CI once populated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133834
Approved by: https://github.com/aorenste
2024-08-27 23:29:02 +00:00
PyTorch MergeBot
5b392d22c6 Revert "fix stuck floordiv (#134150)"
This reverts commit 92c4771853.

Reverted https://github.com/pytorch/pytorch/pull/134150 on behalf of https://github.com/anijain2305 due to compile time regression internal ([comment](https://github.com/pytorch/pytorch/pull/134150#issuecomment-2313230404))
2024-08-27 18:23:44 +00:00
Avik Chaudhuri
92c4771853 fix stuck floordiv (#134150)
Summary: Fixes https://github.com/pytorch/pytorch/issues/134133

Test Plan:
Tested on the small repro in the linked issue with different lengths N (replacing 100), recording N vs. time taken in nanoseconds:
10 127268319
20 220839662
30 325463125
40 429259441
50 553136055
60 670799769
70 999170514
80 899014103
90 997168902
100 1168202035
110 1388556619
120 1457488235
130 1609816470
140 2177889877
150 1917560313
160 2121096113
170 2428502334
180 4117450755
190 4003068224

So N ~ 200 takes ~5s. Previously even smaller N would go for >1 min.

Didn't add a perf test because ezyang is planning to build a benchmark.

Also tested on https://www.internalfb.com/diff/D61560171, which now gets past the stuck point.

Differential Revision: D61619660

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134150
Approved by: https://github.com/ezyang
2024-08-26 07:27:59 +00:00
Nikita Shulga
5f0bd98767 Increase max total number of dynamo partitions to 15 (#134153)
Needed to be able to split some of the aarch64 workflows to 15 shards

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134153
Approved by: https://github.com/seemethere, https://github.com/kit1980, https://github.com/ZainRizvi
2024-08-21 23:10:12 +00:00
Bin Bao
5d5a45dc85 [CI][dashboard] Collect Export pass rate separately (#134076)
Summary: Collect Export pass rate separately when running AOTInduction, so that we can have a better isolated signal.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134076
Approved by: https://github.com/angelayi
2024-08-21 21:18:55 +00:00
Edward Z. Yang
32e057636c Enable scribe environment for compile-time benchmarks if requested. (#133891)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133891
Approved by: https://github.com/malfet
2024-08-21 18:02:54 +00:00
Weizhuo Zhang
5153550e4b [CI] Add FP32 dynamic, AMP static, AMP dynamic for AOT inductor accuracy CPU CI test (#132836)
This PR added 3 more accuracy test for AOT inductor CPU side.
1. FP32 dynamic shape accuracy test, torchbench suite
2. AMP static shape accuracy test, torchbench suite
3. AMP dynamic shape accuracy test, torchbench suite

**Test Time cost:**
| Precision 	| Shape Type 	| Suite      	| Time cost 	|
|-----------	|------------	|------------	|-----------	|
| FP32      	|    dynamic 	| Torchbench 	|  1h40m         	|
| AMP       	|     Static 	| Torchbench 	|  1h38m        	|
| AMP       	|    dynamic 	| Torchbench 	|  1h48m        	|

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132836
Approved by: https://github.com/desertfire
2024-08-19 14:26:48 +00:00
laithsakka
7673ee5456 remove benchmarks/__init__.py (#133390)
trying to address https://github.com/pytorch/pytorch/issues/133377

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133390
Approved by: https://github.com/kit1980, https://github.com/malfet, https://github.com/ezyang
2024-08-15 19:08:10 +00:00
eellison
4bb1650ca3 Bump maxinum num warps (#132458)
Fix for https://github.com/pytorch/pytorch/issues/129104

Our heuristic for num_warps was giving the optimal number, but we were capping maximum num_warps at 8. Gives 1% speedup on HF and TIMM in inference, 2% speedup in TIMM training, neutral otherwise.

ultimately, I think we want live var analysis for register usage.. still worth landing this now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132458
Approved by: https://github.com/Chillee, https://github.com/shunting314
2024-08-14 16:51:05 +00:00
laithsakka
f5e704a6f2 Add instruction count benchmark to run on pull requests (#131475)
This PR only adds the execution of the benchmarks on this PR and print results, following diffs will add checking out head~1 and running it and comparing.

to access results goto test pr_time_benchmarks and inspect logs:
you should see
```
+ echo 'benchmark results on current PR: '
benchmark results on current PR:
+ cat /var/lib/jenkins/workspace/test/test-reports/pr_time_benchmarks_before.txt
update_hint_regression,instruction_count,27971461254
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131475
Approved by: https://github.com/ezyang
2024-08-12 05:20:26 +00:00
Shunting Zhang
10c2168b31 [pt2-bench] use larger multiplier for smaller tensors for a few models (#132952)
Fix https://github.com/pytorch/pytorch/issues/132922  and https://github.com/pytorch/pytorch/issues/132924

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132952
Approved by: https://github.com/eellison, https://github.com/jansel
2024-08-09 00:09:21 +00:00
leslie-fang-intel
ac960dced1 Skip Reformer for Dynamic size testing (#132468)
**Summary**

As discussed in https://github.com/pytorch/pytorch/issues/132286, `Reformer` has specialized the batch size dim which will fails the API  `mark_dynamic` 3a355c1891/torch/_dynamo/decorators.py (L228-L230)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132468
Approved by: https://github.com/ezyang
2024-08-08 08:25:53 +00:00
Nicolas Macchioni
5cb05a82b4 [BC breaking] move benchmarking + prefer inductor path (#132827)
move benchmarking out of `torch._inductor.runtime.runtime_utils` and into `torch._inductor.runtime.benchmarking`, and prefer this path over directly accessing Triton's benchmarking

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132827
Approved by: https://github.com/eellison
2024-08-08 00:47:45 +00:00
HDCharles
374747818d Run performance test non-alternately (#131935)
Summary:
By default, performance tests (speedup experiments) will run the baseline and test backend alternately.

However, this does not work for the torchao backend, which will change the model in-place, therefore the baseline run will also run with torchao backend since the model has already been quantized.

Add a new experiment "latency_experiment" to run performance tests non-alternately (first run baseline for a few iterations, then run the test backend).

other changes:

need to add torch.compiler.cudagraph_mark_step_begin() to avoid the
slowdown from             # Unable to hit fast path of CUDAGraphs because of pending, uninvoked backwards

also updated the torchao APIs to the current versions

X-link: https://github.com/pytorch/benchmark/pull/2394

Test Plan:
python run_benchmark.py torchao --only AlbertForMaskedLM --quantization noquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune python run_benchmark.py torchao --only BartForCausalLM --quantization noquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune python run_benchmark.py torchao --only timm_efficientnet --quantization noquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune

(should all be ~1.0
0.997x
1.006x
0.994x

Reviewed By: xuzhao9

Differential Revision: D60252821

Pulled By: HDCharles

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131935
Approved by: https://github.com/xuzhao9
2024-08-08 00:23:20 +00:00
Justin Chu
6966d44eda [ONNX] Rename _internal/exporter to _exporter_legacy (#132429)
The next PR will be creating an `exporter` directory to house logic from `torch-onnx`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132429
Approved by: https://github.com/titaiwangms
2024-08-03 04:23:05 +00:00
zengxian
f936e68506 [CI] Update CPU inductor smoke test model list and target (#132221)
Fixes #132097

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132221
Approved by: https://github.com/desertfire
2024-08-02 07:09:54 +00:00
Sergii Dymchenko
8458980bbf Move benchmarks/dynamo/huggingface configuration to YAML (#131724)
Similar to https://github.com/pytorch/pytorch/pull/120299

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131724
Approved by: https://github.com/shunting314
2024-07-27 00:55:04 +00:00
Bin Bao
0272934238 [Inductor][CPU] Fix an InvalidVecISA issue on CI (#131812)
Summary: CPU CI nodes failed to find valid VecISA because importing torch under the default pytorch directory will fail with the following msg, so switch cwd to a tmp directory.

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/var/lib/jenkins/workspace/torch/__init__.py", line 66, in <module>
    from torch.torch_version import __version__ as __version__
  File "/var/lib/jenkins/workspace/torch/torch_version.py", line 4, in <module>
    from torch.version import __version__ as internal_version
ModuleNotFoundError: No module named 'torch.version'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131812
Approved by: https://github.com/eellison, https://github.com/malfet
2024-07-26 22:31:44 +00:00
Sergii Dymchenko
da1a1fa55f Move load_yaml_file to common (#131924)
This is for https://github.com/pytorch/pytorch/pull/131724 and future timm_models.py refactoring.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131924
Approved by: https://github.com/shunting314, https://github.com/huydhn
2024-07-26 19:47:52 +00:00
zengxian
d3e932dc10 [CI] Add inductor cpu accuracy test running on AVX2 runners (#128682)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128682
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-07-26 13:24:41 +00:00
Animesh Jain
246e32055a [benchmark] Add hf_T5_generate to inline_inbuilt_nn_modules (#131804)
Fixes https://github.com/pytorch/pytorch/issues/121989

We are turning on the flag by default in another PR. But that PR can go
through reverts. So, forcibly adding the benchmark to prevent dashboard
fluctuation in case of reverts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131804
Approved by: https://github.com/yanboliang, https://github.com/shunting314
ghstack dependencies: #131795, #131801
2024-07-26 00:20:42 +00:00
Animesh Jain
01bc2a8165 [inline-inbuilt-nn-modules] Skip mobilenet_v2 test for cpu inductor (#131694)
Related issue https://github.com/pytorch/pytorch/issues/131693

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131694
Approved by: https://github.com/eellison
2024-07-25 02:49:16 +00:00
Justin Chu
9db567f17d [ONNX] Set dump_exported_program to True in bench (#131670)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131670
Approved by: https://github.com/titaiwangms
2024-07-24 20:02:03 +00:00
zengxian
8a591da3e7 [CI] Enable AOT inductor in cpu performance smoke test (#130097)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130097
Approved by: https://github.com/chuanqi129, https://github.com/desertfire
2024-07-23 03:44:13 +00:00
Xu Zhao
e3eaa22126 [torchbench][multisect] Run accuracy check at Diff time (#131266)
Summary:
X-link: https://github.com/pytorch/benchmark/pull/2388

We can enable accuracy checks at Diff time since it is not a performance metric.

* Refactor the existing diff time test to use the new PT2 Benchmark Runner.
* Deprecate the speedup tests and enable the accuracy tests only. We rely on ServiceLab to perform performance testing and regression detection.

Test Plan:
Sandcastle CI

Or buck test command:

```
buck2 test 'fbcode//mode/opt' fbcode//pytorch/benchmark/fb/test_gpu:run_test_gpu -- test_training_resnet50_accuracy
```

Test UI: https://www.internalfb.com/intern/testinfra/testrun/1688850102375429

Reviewed By: oulgen

Differential Revision: D59825601

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131266
Approved by: https://github.com/oulgen
2024-07-22 20:14:28 +00:00
Xuehai Pan
c0ed38e644 [BE][Easy][3/19] enforce style for empty lines in import segments in benchmarks/ (#129754)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129754
Approved by: https://github.com/ezyang
2024-07-17 14:34:42 +00:00
Xu Zhao
1d8baa4df2 [torchbench][servicelab] Fix servicelab test failures (#130781)
Fix servicelab test failures
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130781
Approved by: https://github.com/desertfire
2024-07-16 17:35:13 +00:00
Xu Zhao
213685ba97 [torchao][pt2 benchmark runner] Run performance test non-alternately (#130136)
Summary:
By default, performance tests (speedup experiments) will run the baseline and test backend alternately.

However, this does not work for the torchao backend, which will change the model in-place, therefore the baseline run will also run with torchao backend since the model has already been quantized.

Add a new experiment "latency_experiment" to run performance tests non-alternately (first run baseline for a few iterations, then run the test backend).

Test Plan:
```
buck2 run mode/opt //pytorch/benchmark:pt2 -- --only AlbertForMaskedLM --quantization noquant --performance --inference --bfloat16
```

```
buck2 run mode/opt //pytorch/benchmark:pt2 -- --only AlbertForMaskedLM --quantization autoquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune
```

Differential Revision: D59332736

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130136
Approved by: https://github.com/jerryzh168
2024-07-16 13:38:17 +00:00
eellison
9ab8d47f9d Constant folding for dynamic shape node (#129686)
Extend constant folding for dynamic shape node, only support pointwise op and some restricted ops

We support dynamic shapes by limiting constant folding of ops that are guaranteed to have uniform values (full, pointwise ops, and views) and running these operators with tensors of shape 1. This also eliminates the possibility of memory overhead of constant folding.

Taken over from https://github.com/pytorch/pytorch/pull/128937

joint work with @imzhuhl

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129686
Approved by: https://github.com/Chillee
ghstack dependencies: #130367
2024-07-16 00:17:11 +00:00
PyTorch MergeBot
9df4bc6a0d Revert "Constant folding for dynamic shape node (#129686)"
This reverts commit b7d287fbec.

Reverted https://github.com/pytorch/pytorch/pull/129686 on behalf of https://github.com/atalman due to Failing internally.  Test: https://github.com/pytorch/ao/blob/main/test/prototype/mx_formats/test_mx_linear.py ([comment](https://github.com/pytorch/pytorch/pull/129686#issuecomment-2228755295))
2024-07-15 15:19:24 +00:00
Xuehai Pan
4d7bf72d93 [BE][Easy] fix ruff rule needless-bool (SIM103) (#130206)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130206
Approved by: https://github.com/malfet
2024-07-14 08:17:52 +00:00
titaiwangms
18418a7dbb [ONNX] Fix torch_onnx patch accuracy bug in benchmark (#130586)
The ONNX related compilers have another route of accuracy check, and this PR brings torch_onnx compiler to the right measurement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130586
Approved by: https://github.com/justinchuby
2024-07-12 15:47:59 +00:00
eellison
b7d287fbec Constant folding for dynamic shape node (#129686)
Extend constant folding for dynamic shape node, only support pointwise op and some restricted ops

We support dynamic shapes by limiting constant folding of ops that are guaranteed to have uniform values (full, pointwise ops, and views) and running these operators with tensors of shape 1. This also eliminates the possibility of memory overhead of constant folding.

Taken over from https://github.com/pytorch/pytorch/pull/128937

joint work with @imzhuhl

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129686
Approved by: https://github.com/Chillee
ghstack dependencies: #130367
2024-07-12 03:44:29 +00:00
Xuehai Pan
973037be6a [BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): list() / tuple() / dict() (#130199)
This PR changes the empty collection factory call to Python literals:

- `list()` -> `[]`
- `tuple()` -> `()`
- `dict()` -> `{}`

The Python literals are more performant and safer. For example, the bytecode for building an empty dictionary:

```bash
$ python3 -m dis - <<EOS
import collections

d1 = {}
d2 = dict()

dict = collections.OrderedDict
d3 = dict()
EOS
```

```text
  0           0 RESUME                   0

  1           2 LOAD_CONST               0 (0)
              4 LOAD_CONST               1 (None)
              6 IMPORT_NAME              0 (collections)
              8 STORE_NAME               0 (collections)

  3          10 BUILD_MAP                0
             12 STORE_NAME               1 (d1)

  4          14 PUSH_NULL
             16 LOAD_NAME                2 (dict)
             18 CALL                     0
             26 STORE_NAME               3 (d2)

  6          28 LOAD_NAME                0 (collections)
             30 LOAD_ATTR                8 (OrderedDict)
             50 STORE_NAME               2 (dict)

  7          52 PUSH_NULL
             54 LOAD_NAME                2 (dict)
             56 CALL                     0
             64 STORE_NAME               5 (d3)
             66 RETURN_CONST             1 (None)
```

The dict literal `{}` only has one bytecode `BUILD_MAP`, while the factory call `dict()` has three `PUSH_NULL + LOAD_NAME + CALL`. Also, the factory call is not safe if users override the `dict` name in `locals` or `globals` (see the example of replacing with `OrderedDict` above).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130199
Approved by: https://github.com/malfet
2024-07-11 17:30:28 +00:00
Shunting Zhang
c5ede865c4 [pt2-bench] raise tolerance for squeezenet1_1 (#130165)
The training accuracy for this model starts to regress. It does not show up on the weekly run yet but
1. it shows up in my MA runs [here](https://hud.pytorch.org/benchmark/torchbench/inductor_max_autotune?dashboard=torchinductor&startTime=Fri,%2028%20Jun%202024%2006:53:45%20GMT&stopTime=Fri,%2005%20Jul%202024%2006:53:45%20GMT&granularity=hour&mode=training&dtype=amp&lBranch=gh/shunting314/162/head&lCommit=cb236e8c198b54901e4fb19698f91be786f72e25&rBranch=main&rCommit=4ee1cb9b955fcc5d75a421b19393998122136f2c)
2. I can repro it locally

Command:
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/torchbench.py --accuracy --training --amp --backend
 inductor --device cuda --only squeezenet1_1
```

Raise the tolerance to fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130165
Approved by: https://github.com/jansel
ghstack dependencies: #129996, #129941, #130005, #130163
2024-07-06 00:49:15 +00:00
Shunting Zhang
0fcbca9adb [pt2-bench] use eval mode for vision_maskrcnn (#130163)
Try to fix https://github.com/pytorch/pytorch/issues/130161

The reason that `--accuracy` works is we use eval mode. While `--training` does not work since we use training mode but TorchBench does not return targets tenors. In training mode, vision_maskrcnn requires targets tensors

I fix that to always use eval mode for vision_maskrcnn training.

With the fix, I start see a segfault: https://gist.github.com/shunting314/5a70df3463b2a4421b2c34aa88e78d1f

I'm not sure if that's due to my local setup but I think the fix in this PR is something we need any way. We can check the dashboard after the PR is in.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130163
Approved by: https://github.com/jansel
ghstack dependencies: #129996, #129941, #130005
2024-07-06 00:49:15 +00:00
Shunting Zhang
8f6765f7a7 [pt2-bench] fix accuracy failure for beit_base_patch16_224 during training (#130005)
This model's accuracy test recently regressed. I have a quite smooth debugging process to figure out the cause. So I'd like to write it down just in case it can be helpful.

Clicking the model name beit_base_patch16_224 on the dashboard, we are able to see the pass status of the model in e.g. the past month. For this model, we can see that it starts to fail on June 08:

<img width="1118" alt="Screenshot 2024-07-02 at 5 36 35 PM" src="https://github.com/pytorch/pytorch/assets/52589240/32f27ccd-3ec7-4431-88b3-febeff831f8e">

What's nice is the dashboard shows the nightly commits for each run.

Running
```
git log --oneline a448b3ae9537c0ae233fb9199a4a221fdffbb..0e6c204642a571d5a7cd60be0caeb9b50faca030 torch/_inductor/
```
Gives us the list of Inductor PRs between the good and bad commit: https://gist.github.com/shunting314/eb57965688fc9e1746fcfa9b7b6b02df

Roughly looking thru the PRs, I feel
```
ffc202a1b9 Added remove_noop_ops to joint_graph_passes (#124451)
```
can change numerics so I disable it locally by this one line change: https://gist.github.com/shunting314/13aec768bda986056d0fb40dce53396e  . And then the accuracy test pass. (Command: time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only beit_base_patch16_224  )

Horace's PR (https://github.com/pytorch/pytorch/pull/124451) itself is valid. It removes no-op ops in joint-graph. I think maybe the graph get changed and cause the partitioner do different recomputation decisions. That can cause some numerics change.

Since this is not a real issue, I'll raise the tolerance to make it pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130005
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #129996, #129941
2024-07-05 10:26:39 +00:00
Shunting Zhang
c0735a3dd3 [pt2-bench] fix accuracy failure for a few models (#129941)
This PR batch the fix for a few accuracy failures issues during training by raising tolerance. I do that only for models that I think it fails not due to real issue.

## sebotnet33ts_256

The accuracy test for this model start to fail around June 05 [link](https://hud.pytorch.org/benchmark/timm_models/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Sun%2C%2002%20Jun%202024%2007%3A19%3A38%20GMT&stopTime=Tue%2C%2002%20Jul%202024%2007%3A19%3A38%20GMT&granularity=day&mode=training&dtype=amp&lBranch=main&lCommit=04a0d856207d83c2031e4b9cb6825ba3e0092850&rBranch=main&rCommit=e62925930f6a62f6aeeb1fe1a661a9bd3352b53d&model=sebotnet33ts_256).

I can not repro locally, but from the log from the dashboard:
```
RMSE (res-fp64): 0.09441, (ref-fp64): 0.02971 and shape=torch.Size([1536]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000
```
raising the tolerance should fix it.

## DebertaForQuestionAnswering

This model fails accuracy test on the dashboard only in max-autotune mode. I can not repro locally by command:
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/huggingface.py --accuracy --no-translation-validation --training --amp --backend inductor --device cuda --only DebertaForQuestionAnswering
```

From error message on the dashboard:
```
RMSE (res-fp64): 0.01803, (ref-fp64): 0.00537 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
```

0.02 tolerance should suppress this error.

## gluon_inception_v3

This model fail on the dashboard in max-autotune mode. I can not repro locally by command
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only gluon_inception_v3
```

From error message on the dashboard
```
RMSE (res-fp64): 0.02798, (ref-fp64): 0.00730 and shape=torch.Size([384]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
Accuracy failed for key name Mixed_7c.branch3x3dbl_3a.bn.running_var
```
raising tolerance should suppress this error.

# mobilenetv3_large_100
Fail in MA model. I can not repro locally by command
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only
```
The error message on the dashboard is
```
RMSE (res-fp64): 0.29754, (ref-fp64): 0.05205 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000
```

The tensor is so small that the noise can be high. I use larger multiplier for smaller tensor in torch._dynamo.utils.same.

# yolov3

Fail on dashboard with error
```
Error on the dashboard: RMSE (res-fp64): 0.01278, (ref-fp64): 0.00246 and shape=torch.Size([256]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```

Fix it by using a larger multiplier for smaller tensors and raising the tolereance.

# timm_efficientdet

Fail on the dashboard with error
```
E0623 18:37:43.638000 139924418725056 torch/_dynamo/utils.py:1468] RMSE (res-fp64): 0.00096, (ref-fp64): 0.00009 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```
But I can not repro locally with command
```
time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only timm_efficientdet  --training
```

Raise the tolerance should fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129941
Approved by: https://github.com/jansel
ghstack dependencies: #129996
2024-07-05 10:26:39 +00:00
Jason Ansel
3240bff56a [benchmarking] Add join_results.py (#129202)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129202
Approved by: https://github.com/yanboliang, https://github.com/shunting314
2024-07-05 06:55:30 +00:00
PyTorch MergeBot
fa3953a2e1 Revert "[pt2-bench] fix accuracy failure for a few models (#129941)"
This reverts commit dafbd603ee.

Reverted https://github.com/pytorch/pytorch/pull/129941 on behalf of https://github.com/jeanschmidt due to Seems to have introduced breakages in main cuda12 focal jobs ([comment](https://github.com/pytorch/pytorch/pull/129996#issuecomment-2209175516))
2024-07-04 14:55:38 +00:00
PyTorch MergeBot
54da35a2e0 Revert "[pt2-bench] fix accuracy failure for beit_base_patch16_224 during training (#130005)"
This reverts commit 0af8c8a981.

Reverted https://github.com/pytorch/pytorch/pull/130005 on behalf of https://github.com/jeanschmidt due to Seems to have introduced breakages in main cuda12 focal jobs ([comment](https://github.com/pytorch/pytorch/pull/129996#issuecomment-2209175516))
2024-07-04 14:55:38 +00:00
titaiwangms
bffb278700 [ONNX] Add artifacts_dir to torch-onnx-patch in benchmark (#130069)
Add `artifacts_dir` to torch-onnx-patch to save error report for debugging.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130069
Approved by: https://github.com/justinchuby
2024-07-04 07:11:02 +00:00
Peter Bell
e2e624a02f [AOTAutograd] Micro-optimize runtime_wrapper (#128188)
This moves a bunch of runtime inspection of the `output_info` for alias handling into the construction of fixed output handlers that are created during compilation and captured by the runtime wrapper.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128188
Approved by: https://github.com/bdhirsh
2024-07-04 03:53:06 +00:00
Shunting Zhang
0af8c8a981 [pt2-bench] fix accuracy failure for beit_base_patch16_224 during training (#130005)
This model's accuracy test recently regressed. I have a quite smooth debugging process to figure out the cause. So I'd like to write it down just in case it can be helpful.

Clicking the model name beit_base_patch16_224 on the dashboard, we are able to see the pass status of the model in e.g. the past month. For this model, we can see that it starts to fail on June 08:

<img width="1118" alt="Screenshot 2024-07-02 at 5 36 35 PM" src="https://github.com/pytorch/pytorch/assets/52589240/32f27ccd-3ec7-4431-88b3-febeff831f8e">

What's nice is the dashboard shows the nightly commits for each run.

Running
```
git log --oneline a448b3ae9537c0ae233fb9199a4a221fdffbb..0e6c204642a571d5a7cd60be0caeb9b50faca030 torch/_inductor/
```
Gives us the list of Inductor PRs between the good and bad commit: https://gist.github.com/shunting314/eb57965688fc9e1746fcfa9b7b6b02df

Roughly looking thru the PRs, I feel
```
ffc202a1b9 Added remove_noop_ops to joint_graph_passes (#124451)
```
can change numerics so I disable it locally by this one line change: https://gist.github.com/shunting314/13aec768bda986056d0fb40dce53396e  . And then the accuracy test pass. (Command: time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only beit_base_patch16_224  )

Horace's PR (https://github.com/pytorch/pytorch/pull/124451) itself is valid. It removes no-op ops in joint-graph. I think maybe the graph get changed and cause the partitioner do different recomputation decisions. That can cause some numerics change.

Since this is not a real issue, I'll raise the tolerance to make it pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130005
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #129996, #129941
2024-07-04 01:14:29 +00:00
Shunting Zhang
dafbd603ee [pt2-bench] fix accuracy failure for a few models (#129941)
This PR batch the fix for a few accuracy failures issues during training by raising tolerance. I do that only for models that I think it fails not due to real issue.

## sebotnet33ts_256

The accuracy test for this model start to fail around June 05 [link](https://hud.pytorch.org/benchmark/timm_models/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Sun%2C%2002%20Jun%202024%2007%3A19%3A38%20GMT&stopTime=Tue%2C%2002%20Jul%202024%2007%3A19%3A38%20GMT&granularity=day&mode=training&dtype=amp&lBranch=main&lCommit=04a0d856207d83c2031e4b9cb6825ba3e0092850&rBranch=main&rCommit=e62925930f6a62f6aeeb1fe1a661a9bd3352b53d&model=sebotnet33ts_256).

I can not repro locally, but from the log from the dashboard:
```
RMSE (res-fp64): 0.09441, (ref-fp64): 0.02971 and shape=torch.Size([1536]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000
```
raising the tolerance should fix it.

## DebertaForQuestionAnswering

This model fails accuracy test on the dashboard only in max-autotune mode. I can not repro locally by command:
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/huggingface.py --accuracy --no-translation-validation --training --amp --backend inductor --device cuda --only DebertaForQuestionAnswering
```

From error message on the dashboard:
```
RMSE (res-fp64): 0.01803, (ref-fp64): 0.00537 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
```

0.02 tolerance should suppress this error.

## gluon_inception_v3

This model fail on the dashboard in max-autotune mode. I can not repro locally by command
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only gluon_inception_v3
```

From error message on the dashboard
```
RMSE (res-fp64): 0.02798, (ref-fp64): 0.00730 and shape=torch.Size([384]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
Accuracy failed for key name Mixed_7c.branch3x3dbl_3a.bn.running_var
```
raising tolerance should suppress this error.

# mobilenetv3_large_100
Fail in MA model. I can not repro locally by command
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only
```
The error message on the dashboard is
```
RMSE (res-fp64): 0.29754, (ref-fp64): 0.05205 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000
```

The tensor is so small that the noise can be high. I use larger multiplier for smaller tensor in torch._dynamo.utils.same.

# yolov3

Fail on dashboard with error
```
Error on the dashboard: RMSE (res-fp64): 0.01278, (ref-fp64): 0.00246 and shape=torch.Size([256]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```

Fix it by using a larger multiplier for smaller tensors and raising the tolereance.

# timm_efficientdet

Fail on the dashboard with error
```
E0623 18:37:43.638000 139924418725056 torch/_dynamo/utils.py:1468] RMSE (res-fp64): 0.00096, (ref-fp64): 0.00009 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```
But I can not repro locally with command
```
time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only timm_efficientdet  --training
```

Raise the tolerance should fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129941
Approved by: https://github.com/jansel
ghstack dependencies: #129996
2024-07-04 01:14:29 +00:00
Valentine233
62b710782d change LayoutLMForSequenceClassification inference accuracy tolerance (#129728)
Fixes #128510.

https://github.com/pytorch/pytorch/pull/124451 makes LayoutLMForSequenceClassification hit the SDPA pattern 1 and then encounter the accuracy issue. The issue only happens with BF16 inference single thread. This PR tends to increase the model tolerance and make the check pass. Note that even the math-version SDPA could have the issue because of some small implementation diff.

The test log:
Single thread
```
correct_result:  SequenceClassifierOutput(loss=tensor(0.5998), logits=tensor([[0.3301, 0.1338]], dtype=torch.bfloat16), hidden_states=None, attentions=None)
new_result:  SequenceClassifierOutput(loss=tensor(0.6016), logits=tensor([[0.3281, 0.1357]], dtype=torch.bfloat16), hidden_states=None, attentions=None)
E0627 01:09:16.762789 140281313759104 torch/_dynamo/utils.py:1476] RMSE (res-fp64): 0.00151, (ref-fp64): 0.00046 and shape=torch.Size([1, 2]). res.dtype: torch.bfloat16, multiplier: 3.000000, tol: 0.001000
E0627 01:09:16.762972 140281313759104 torch/_dynamo/utils.py:1390] Accuracy failed for key name logits
fail_accuracy
```

Multiple threads
```
correct_result:  SequenceClassifierOutput(loss=tensor(0.6007), logits=tensor([[0.3301, 0.1357]], dtype=torch.bfloat16), hidden_states=None, attentions=None)
new_result:  SequenceClassifierOutput(loss=tensor(0.6016), logits=tensor([[0.3281, 0.1357]], dtype=torch.bfloat16), hidden_states=None, attentions=None)
pass
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129728
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-03 06:28:27 +00:00
Aaron Gokaslan
6c2a8b6b38 [Ez][BE]: Enable new stable ruff rules (#129825)
Applies a bunch of new ruff lint rules that are now stable. Some of these improve efficiency or readability. Since I already did passes on the codebase for these when they were in preview, there should be relatively few changes to the codebase. This is just more for future hardening of it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129825
Approved by: https://github.com/XuehaiPan, https://github.com/jansel, https://github.com/malfet
2024-07-02 14:47:10 +00:00
Xu Zhao
cc9b005bf2 Enable torchao nightly workflow (#129779)
Summary:
Make the following improvements:
* Schedule the torchao benchmark nightly
* Enable torchbench, timm, and huggingface models
* Refactor the benchmarking script to better arrange the benchmarking groups

Test workflow: https://github.com/pytorch/benchmark/actions/runs/9705589352

X-link: https://github.com/pytorch/benchmark/pull/2336

Differential Revision: D59074571

Pulled By: xuzhao9

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129779
Approved by: https://github.com/jerryzh168
2024-07-01 14:28:38 +00:00
Weizhuo Zhang
fff633f087 [CI] Enable AOT inductor FP32 accuracy test for CPU (#129040)
This PR enabled AOT inductor backend FP32 accuracy check for CPU in CI workflow, which could catch AOT inductor issue at early stage.

**Test Time cost:**
| Suite       	| Precision 	| Time cost 	|
|-------------	|-----------	|-----------	|
| Huggingface 	| FP32      	|   1h12m   	|
| Timm models 	| FP32      	|   1h32m   	|
|  Torchbench 	| FP32      	|   1h40m   	|

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129040
Approved by: https://github.com/chuanqi129, https://github.com/desertfire, https://github.com/malfet
2024-06-30 14:00:09 +00:00
Xuehai Pan
4ee1cb9b95 [BE][Easy] replace import pathlib with from pathlib import Path (#129426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129426
Approved by: https://github.com/malfet
2024-06-30 01:36:07 +00:00
PyTorch MergeBot
2effbcfcd8 Revert "[BE][Easy] replace import pathlib with from pathlib import Path (#129426)"
This reverts commit 6d75604ef1.

Reverted https://github.com/pytorch/pytorch/pull/129426 on behalf of https://github.com/XuehaiPan due to recognize `Path` as new exported API ([comment](https://github.com/pytorch/pytorch/pull/129426#issuecomment-2198371625))
2024-06-29 23:24:06 +00:00
Xuehai Pan
6d75604ef1 [BE][Easy] replace import pathlib with from pathlib import Path (#129426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129426
Approved by: https://github.com/malfet
2024-06-29 15:42:09 +00:00
Animesh Jain
a676b7c5f3 Add XGLMForCausalLM to the flaky model list (#129776)
Not failing on devGPU. Went to CI machine ... flaky. So adding to the flaky list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129776
Approved by: https://github.com/mlazos
ghstack dependencies: #129583, #129610, #129775
2024-06-29 05:47:28 +00:00
Animesh Jain
5d1763d159 Add lcnet to the inline_inbuilt_nn_module list (#129775)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129775
Approved by: https://github.com/mlazos
ghstack dependencies: #129583, #129610
2024-06-29 05:47:28 +00:00
PyTorch MergeBot
3d96217891 Revert "[BE][Easy] use pathlib.Path instead of dirname / ".." / pardir (#129374)"
This reverts commit 9e1f3ecaa7.

Reverted https://github.com/pytorch/pytorch/pull/129374 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is still failing with the same error ([comment](https://github.com/pytorch/pytorch/pull/129374#issuecomment-2197801405))
2024-06-29 00:47:15 +00:00
Xuehai Pan
9e1f3ecaa7 [BE][Easy] use pathlib.Path instead of dirname / ".." / pardir (#129374)
Changes by apply order:

1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`.
2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`.
3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first.

    `.parent{...}.absolute()` -> `.absolute().parent{...}`

4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.)

    `.parent.parent.parent.parent` -> `.parents[3]`

5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~

    ~`.parents[3]` -> `.parents[4 - 1]`~

6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374
Approved by: https://github.com/justinchuby, https://github.com/malfet
2024-06-28 00:35:15 +00:00
Wei Wang
9b5b93c58f [CUDA][Inductor][CI] Revert PR#127150 since cu124 is now behaving similar enough to cu121 (#128423)
Pre-requisite: close https://github.com/pytorch/pytorch/issues/126692 first.

This PR also gives a current read on cu121 and cu124 parity.

Essentially reverting #127150

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128423
Approved by: https://github.com/atalman, https://github.com/eqy
2024-06-27 05:22:18 +00:00
PyTorch MergeBot
895316119d Revert "[BE][Easy] use pathlib.Path instead of dirname / ".." / pardir (#129374)"
This reverts commit 0314c4c101.

Reverted https://github.com/pytorch/pytorch/pull/129374 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it causes lots of internal build failures where they fail to find hipify module ([comment](https://github.com/pytorch/pytorch/pull/129374#issuecomment-2192437052))
2024-06-26 19:03:57 +00:00
PyTorch MergeBot
e9aefad641 Revert "[CUDA][Inductor][CI] Revert PR#127150 since cu124 is now behaving similar enough to cu121 (#128423)"
This reverts commit 551e412718.

Reverted https://github.com/pytorch/pytorch/pull/128423 on behalf of https://github.com/nWEIdia due to Sorry for reverting your change but I need to revert it to cleanly revert https://github.com/pytorch/pytorch/pull/129374 ([comment](https://github.com/pytorch/pytorch/pull/128423#issuecomment-2192423840))
2024-06-26 18:54:41 +00:00
Xu Zhao
474d743dba [torchao][benchmark] Skip all accuracy tests by returning pass_due_to_skip (#129545)
Summary: As the title says.

Test Plan:
```
buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BERT_pytorch --quantization noquant --inference --bfloat16 --accuracy
```

Differential Revision: D59040593

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129545
Approved by: https://github.com/HDCharles
2024-06-26 14:21:53 +00:00
Wei Wang
551e412718 [CUDA][Inductor][CI] Revert PR#127150 since cu124 is now behaving similar enough to cu121 (#128423)
Pre-requisite: close https://github.com/pytorch/pytorch/issues/126692 first.

This PR also gives a current read on cu121 and cu124 parity.

Essentially reverting #127150

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128423
Approved by: https://github.com/atalman, https://github.com/eqy
2024-06-25 20:59:49 +00:00
DiweiSun
ae0f84d89c [CI] Enable amp accuracy check for inductor cpu (#127758)
This is to enable inductor AMP accuracy check for on CPU in CI workflow to capture issue early. Three suites are included: timms, huggingface as well as torchbench.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127758
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-06-25 20:34:18 +00:00
Weizhuo Zhang
53f462c506 Write dynamo benchmarks performance result to csv when throw exceptions (#126764)
**Performance mode Issue**: When dynamo benchmarks performance warm-up failed, the result will be not written into csv file. But the accuracy will be written as `fail_to_run` even when dynamo pass failed. So the accuracy model number is not aligned with performance model number for each of their csv files.
![image](https://github.com/pytorch/pytorch/assets/84730719/9043d215-130b-46b4-a835-f148c225947c)

- **Fix**: The warm-up failed models will be recorded into csv file shown as following:
![image](https://github.com/pytorch/pytorch/assets/84730719/7907a3c2-c942-42bb-b31c-55424a0e8117)

**Accuracy mode issue**: `detectron2_fasterrcnn_r` models failed on accuracy mode, but was tested successfully on performance mode. The accuracy failure is same as PR ee557d8f61.
```
Dynamic Shape:
Traceback (most recent call last):
  File "benchmarks/dynamo/torchbench.py", line 449, in <module>
    torchbench_main()
  File "benchmarks/dynamo/torchbench.py", line 445, in torchbench_main
    main(TorchBenchmarkRunner(), original_dir)
  File "/workspace/pytorch/benchmarks/dynamo/common.py", line 3650, in main
    process_entry(0, runner, original_dir, args)
  File "/workspace/pytorch/benchmarks/dynamo/common.py", line 3582, in process_entry
    return run(runner, args, original_dir)
  File "/workspace/pytorch/benchmarks/dynamo/common.py", line 4163, in run
    assert marked, f"nothing in example_inputs had a dim with {batch_size}"
AssertionError: nothing in example_inputs had a dim with 4
```
![image](https://github.com/pytorch/pytorch/assets/84730719/f25392f0-f982-46c8-8e2c-a8a25d85a21a)

- **Fix**: same as PR ee557d8f61, the batch_size will be skipped to set as 4 when testing dynamic shapes.

Dynamic shapes passrate improved from 89% -> **95%**
| Comp Item | Compiler | suite      | before     | After fix  |
|-----------|----------|------------|------------|------------|
| Pass Rate | Inductor | torchbench | 89%, 73/82 | 95%, 79/83 |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126764
Approved by: https://github.com/jansel
2024-06-25 17:49:04 +00:00
Xuehai Pan
0314c4c101 [BE][Easy] use pathlib.Path instead of dirname / ".." / pardir (#129374)
Changes by apply order:

1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`.
2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`.
3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first.

    `.parent{...}.absolute()` -> `.absolute().parent{...}`

4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.)

    `.parent.parent.parent.parent` -> `.parents[3]`

5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~

    ~`.parents[3]` -> `.parents[4 - 1]`~

6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374
Approved by: https://github.com/justinchuby, https://github.com/malfet
2024-06-25 08:28:38 +00:00
titaiwangms
0e1e289033 [ONNX] Benchmark refactored ONNX export (#129427)
Reuse torch.onnx.export with torch_onnx patch to test ExportedProgram -> ONNX IR exporter

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129427
Approved by: https://github.com/justinchuby
2024-06-25 04:47:53 +00:00
Huy Do
b72ef9df0d Update torchbench model expected accuracy values after pinning numpy (#129213)
After pinning numpy on torchbench, we need to move torchbench inductor benchmark jobs out of unstable state asap, so that more failures don't sneak it.  I'm updating the expected values here to make trunk green.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129213
Approved by: https://github.com/xuzhao9, https://github.com/malfet, https://github.com/desertfire
2024-06-22 04:59:50 +00:00
Jason Ansel
bdc39eef3b [inductor] Add --inductor-config benchmark flag (#129034)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129034
Approved by: https://github.com/shunting314, https://github.com/eellison
ghstack dependencies: #129024, #129033
2024-06-21 16:53:42 +00:00
Simon Fan
123812790b [compiled autograd] update benchmarks to use cli flags for fullgraph/dynamic (#127960)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127960
Approved by: https://github.com/jansel
2024-06-21 08:16:33 +00:00
Deng Weishi
b542825066 Enable deterministic support for oneDNN (#127277)
This PR is a part of RFC https://github.com/pytorch/pytorch/issues/114848.
For the request for Torchbenchmark models, this PR enables the deterministic attribute for the oneDNN operators for XPU backends, like convolution, deconvolution and matmult.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127277
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/desertfire, https://github.com/gujinghui
2024-06-21 05:21:24 +00:00
Xu Zhao
1835e3beab Fix the inductor ci (#128879)
Fix the torchbench+inductor ci on trunk due to recent upgrade to numpy 2.0.0rc1.
We have to remove DALLE2_pytorch model, since it depends on embedding-reader, which is not compatible with numpy>2: https://github.com/rom1504/embedding-reader/blob/main/requirements.txt#L3

Fixes #128845

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128879
Approved by: https://github.com/eellison
2024-06-17 22:20:33 +00:00
PyTorch MergeBot
c172b58fe0 Revert "Update DALLE2_pytorch expected accuracy result on CPU (#128718)"
This reverts commit fd27138c4a.

Reverted https://github.com/pytorch/pytorch/pull/128718 on behalf of https://github.com/huydhn due to This has reverted back to the previous expected value for some reason 153362fbc9 ([comment](https://github.com/pytorch/pytorch/pull/128718#issuecomment-2174194219))
2024-06-17 18:49:15 +00:00
eellison
5344c41d43 Use forked torchbench branch with pinned numpy (#128856)
Adds pinned numpy commit to yolov3 dependencies to the existing pinned commit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128856
Approved by: https://github.com/huydhn, https://github.com/PaliC
2024-06-17 18:41:42 +00:00