Commit Graph

1797 Commits

Author SHA1 Message Date
Laith Sakka
6a1c451479 Don't uselessly recompute axiom dict every static eval call (#138967)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138967
Approved by: https://github.com/ezyang
2024-10-31 21:16:55 +00:00
Laith Sakka
c056dc4cb8 In Inductor, be willing to generate deferred runtime asserts when unbacked (#138804)
Title + we avoid calling defer_assert when we statically know the guard results.
timing for pnasnet5large

```
TIMING: code_gen:21.79672 inductor_compile:39.57726 backend_compile:65.30649 entire_frame_compile:95.22052 total_wall_time:95.22052
```
matches with out the diff
```
TIMING: code_gen:21.89314 inductor_compile:39.72298 backend_compile:65.38539 entire_frame_compile:95.0854 total_wall_time:95.0854
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138804
Approved by: https://github.com/ezyang
2024-10-28 02:19:55 +00:00
Aaron Gokaslan
5d074746e9 [BE]: Add better optional typing (#138426)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138426
Approved by: https://github.com/XuehaiPan, https://github.com/malfet
2024-10-27 14:19:00 +00:00
Laith Sakka
705f5b3489 Several enhancements for check_results.py (#137925)
1) always generate expected_results.csv up to accuracy of first three digits
ex: 112313212312 --> 1120000000 .. etc
2) regenerate all record in  expected_results.csv and not just failed ones , why? because if we change something
by 1.3% and noise 1.5% we want to reflect that.
3) add "please update all results that changed significantly, and not only the failed ones"

```
(myenv) [lsakka@devgpu005.nha1 ~/pytorch/benchmarks/dynamo/pr_time_benchmarks (check_result_ehancements)]$ python check_results.py test_check_result/expected_test.csv te
st_check_result/result_test.csv out
WIN: benchmark ('a', 'instruction count') failed, actual result 9011111111 is -18.16% lower than expected 11011111111 ±1.00% please update the expected results.

please update all results that changed significantly, and not only the failed ones
REGRESSION: benchmark ('b', 'memory') failed, actual result 20011111111 is 99.89% higher than expected 10011111111 ±+10.00% if this is an expected regression, please update the expected results.

please update all results that changed significantly, and not only the failed ones
REGRESSION: benchmark ('c', 'something') failed, actual result 107111111111 is 969.92% higher than expected 10011111111 ±+10.00% if this is an expected regression, please update the expected results.

please update all results that changed significantly, and not only the failed ones
MISSING REGRESSION TEST: benchmark ('d', 'missing-test') does not have a regression test enabled for it.

new expected results file content if needed:
a,instruction count,9011000000,0.01
b,memory,20010000000,0.1
c,something,107100000000,0.1

There was some failures you can use the new reference expected result stored at path:out and printed above

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137925
Approved by: https://github.com/aorenste
2024-10-26 16:27:55 +00:00
Laith Sakka
10e2840ce3 Enable failing diffs on update_hint_regression and sum_floordiv_regression and autograd benchmarks regression (#137548)
update_hint_regression has been behaving, so I am setting 2% noise threshold for it. 1.5% for sum_floordiv_regression.

I have one concern, with the way we do the regression detection. small or changes <threshold level  will accumulate and eventually trigger failure. to avoid those would have to keep any eye on the dashboard and potentially refresh the expected result file regularly even when there is no faluires. .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137548
Approved by: https://github.com/aorenste
2024-10-26 07:28:49 +00:00
Pian Pawakapan
09848c892a [aot_compile] propagate ShapeEnv during lowering (#138362)
We found that `export() -> _inductor.aot_compile()` lowering, 3 different ShapeEnvs get created, leading to errors when one ShapeEnv processes expressions created by another ShapeEnv. This plumbs the 2 places where ShapeEnv creation happens, detecting the original ShapeEnv from the GraphModule example values, so the original ShapeEnv is just reused.

Differential Revision: D64613290

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138362
Approved by: https://github.com/angelayi
2024-10-24 22:22:14 +00:00
PyTorch MergeBot
8197e4c70d Revert "[sparse] add search for optimal alg_id to torch.compile (#137427)"
This reverts commit 39bfba3f56.

Reverted https://github.com/pytorch/pytorch/pull/137427 on behalf of https://github.com/jcaip due to this PR breaks AO tests ([comment](https://github.com/pytorch/pytorch/pull/137427#issuecomment-2435906592))
2024-10-24 17:27:06 +00:00
Laith Sakka
ed313a5ca2 Introduce torch.sym_add, variadic add (#138660)
Tested internally here: https://www.internalfb.com/diff/D64057744
This is a reland after previous internal failures.
main change is
```
 if min is None and max is None:
        torch._check_is_size(size)
        return
```

Partially addresses https://github.com/pytorch/pytorch/issues/128150

When you have big sums of values, we end up computing long chains of
binary addition in our FX graph representation.  Not only is this ugly,
it also is quadratic, as the sympy.Add constructor is O(N) in number
of arguments.  Instead, ensure that we maintain the summation as a
single FX node so we can do the entire addition all in one go.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138660
Approved by: https://github.com/ezyang, https://github.com/bobrenjc93
2024-10-23 17:42:41 +00:00
Jesse Cai
39bfba3f56 [sparse] add search for optimal alg_id to torch.compile (#137427)
Summary:

This PR adds a lowering for `torch._cslt_sparse_mm` to find the optimal
alg_id and cache it when running with `torch.compile`

Seeing speedups on both bfloat16 and float8 dtypes:
<img width="641" alt="Screenshot 2024-10-17 at 2 10 38 PM" src="https://github.com/user-attachments/assets/b928cd11-32a3-43e5-b209-8e4028896f0b">
<img width="1274" alt="Screenshot 2024-10-17 at 1 39 03 PM" src="https://github.com/user-attachments/assets/d9edd684-a8ec-46fd-b3da-2e76dbcb7bb6">

* `torch._cslt_sparse_mm_search` has been modified to return optimal
  split-k parameters as well as max alg_id.

* max_id is now available in `torch.backends.cusparselt` via
  `torch.backends.cusparselt.get_max_alg_id()`

* fixed meta registrations for float8

Test Plan:

python test/test_sparse_semi_structured.py

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137427
Approved by: https://github.com/cpuhrsch
2024-10-22 22:39:42 +00:00
Ryan Guo
0a4197490c Delay mul/pow expansion for _SympyT to enable more folding (#138235)
Instead of calling `safe_expand` right after symbolic expression construction, we invoke it in `ShapeEnv.simplify`. This enables more simplification with product form, e.g.,
```
(a + b)^2 / (a + b) --> (a + b)
```
which won't happen if we expand eagerly during product construction:
```
(a^2 + 2ab + b^2) / (a + b) --> no change
```

Fixes #136044.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138235
Approved by: https://github.com/ezyang
2024-10-21 16:38:47 +00:00
Animesh Jain
0a2407b93c [dynamo] Support omegaconf DictConfig (#138378)
Fixes https://github.com/pytorch/pytorch/issues/138224

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138378
Approved by: https://github.com/jansel
ghstack dependencies: #138359
2024-10-20 02:43:17 +00:00
Chong Gu
d512d0e227 Always use aten.constant_pad_nd for mm padding (#137820)
Summary: From experiment, it seems like aten.constant_pad_nd has better QPS compared to torch.cat. The qps gain for ig ctr is ~10%, and ~5% for oc.

Test Plan:
```
buck2 run mode/opt -c fbcode.nvcc_arch=a100 //caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/585279927/480/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend=AOT_INDUCTOR
```
```
buck2 run mode/opt //caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/588102397/1500/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend=AOT_INDUCTOR
```

Differential Revision: D64271583

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137820
Approved by: https://github.com/eellison
2024-10-18 19:35:03 +00:00
Brian Hirsh
a682194a11 inductor: use previous guards to know if a size is 1 for broadcasting (#136670)
Fixes https://github.com/pytorch/pytorch/issues/136640

Today, inductor has some logic to figure out when it needs to do broadcasting during lowering, which just checks if any of the input shapes have sizes equal to 1.

In particular: we should already have this information by the time we get to inductor, because our FakeTensor compute will have branched/guarded on whether any ops performed broadcasting, appropriately.

In particular, if we have a tensor with a size value of `(64//((2048//(s3*((s2//s3)))))))`, and it happens to be equal to one (and it is used in an op that requires this dim to be broadcasted), FakeTensorProp will have generated a guard:
```
Eq((64//((2048//(s3*((s2//s3))))))), 1)
```

I chose the simplest possible way to beef up inductor's checks to know when a given size is equal to 1: loop over the existing shape env guards, and if our current size is a sympy expression on the LHS of one of our `Eq(LHS, 1)` guards, then return True.

I'm hoping for feedback on whether or not this approach is reasonable. One better option I could imagine is that our symbolic reasoning should have automatically simplified the size of our tensor down to a constant as part of evaluating that guard. I was originally going to try to do this directly in the shape env, but I ran into a few issues:

(1) I wanted to call some version of `set_replacement(expr, 1)`. But `set_replacement()` only accepts plain symbols on the LHS, not expressions

(2) in theory I could get this to work if I could rework the above expression to move everything that is not a free variable to the RHS, e.g. `Eq(s2, 32)`. It looks like our existing  `try_solve()` logic is... [not quite able](https://github.com/pytorch/pytorch/blob/main/torch/utils/_sympy/solve.py#L27) to do this generally though.

Checking the guards feels pretty simple-and-easy. Are we worried that it is too slow to iterate over all the guards? I could also cache the lookup so we only need to iterate over guards that are of the form `Eq(LHS, 1)`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136670
Approved by: https://github.com/ezyang
2024-10-16 22:41:39 +00:00
Isuru Fernando
120fbe9caa Update inductor benchmark time to avoid flakiness (#137900)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137900
Approved by: https://github.com/laithsakka
2024-10-15 16:17:04 +00:00
Edward Z. Yang
5c3ba6faff Add fbscribelogger to Dynamo benchmark runner (#137867)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137867
Approved by: https://github.com/bobrenjc93
2024-10-15 04:36:41 +00:00
Isuru Fernando
08ce3aac62 Cache some ValueRanges (#137438)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137438
Approved by: https://github.com/ezyang
2024-10-13 19:23:34 +00:00
Bin Bao
cfc5d18aad [AOTI] Turn on the ABI-compatible mode as default (#136534)
Summary: Make AOTI generate ABI-compatible code as default for OSS.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136534
Approved by: https://github.com/chenyang78
ghstack dependencies: #137660
2024-10-13 14:42:58 +00:00
Valentine233
67883e70c0 change GPT2ForSequenceClassification inference accuracy tolerance (#136749)
Fixes https://github.com/pytorch/pytorch/issues/123503.

https://github.com/pytorch/pytorch/pull/121866 makes GPT2ForSequenceClassification hit the SDPA pattern 18 and then encounter the accuracy issue. The issue only happens with BF16 inference single thread. This PR tends to increase the model tolerance from 4e-3 to 5e-3 and make the check pass. Note that the issue is due to some small implementation diff. For example, the sdpa math backend scales q, k before matmul for stability; the flash attention backend has more diffs as a new algorithm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136749
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-10-12 01:12:28 +00:00
PyTorch MergeBot
c58e5c4efa Revert "[AOTI] Turn on the ABI-compatible mode as default (#136534)"
This reverts commit b0da076f0c.

Reverted https://github.com/pytorch/pytorch/pull/136534 on behalf of https://github.com/desertfire due to The dependent PR https://github.com/pytorch/pytorch/pull/137660 fails in fbcode ([comment](https://github.com/pytorch/pytorch/pull/136534#issuecomment-2408211238))
2024-10-11 22:50:58 +00:00
Xuehai Pan
267f82b860 [BE] Format .ci/ / .github/ / benchmarks/ / functorch/ / tools/ / torchgen/ with ruff format (#132577)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132577
Approved by: https://github.com/malfet
2024-10-11 18:30:26 +00:00
Laith Sakka
a06d49a9f9 bump up add_loop_inductor_gpu expected instruction count. (#137672)
diff https://github.com/pytorch/pytorch/pull/137117/files increased instruction count for add_loop_inductor_gpu
but not enough to fail in that diff, but now its kind of flaky test .

it failed on recent merge:
<img width="1351" alt="Screenshot 2024-10-09 at 5 25 57 PM" src="https://github.com/user-attachments/assets/27178f76-c08e-4d13-9ac4-4cd70f146611">

and here is the history
<img width="1047" alt="Screenshot 2024-10-09 at 5 26 07 PM" src="https://github.com/user-attachments/assets/bd563e34-6f7f-461a-ae54-8a616be9bf09">
<img width="777" alt="Screenshot 2024-10-09 at 5 30 19 PM" src="https://github.com/user-attachments/assets/d0a1ca81-2bdb-4cf6-8ac8-ba5971d447bf">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137672
Approved by: https://github.com/masnesral
2024-10-11 16:46:38 +00:00
Bin Bao
b0da076f0c [AOTI] Turn on the ABI-compatible mode as default (#136534)
Summary: Make AOTI generate ABI-compatible code as default for OSS.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136534
Approved by: https://github.com/chenyang78
ghstack dependencies: #137660
2024-10-10 23:44:57 +00:00
Colin Peppler
9690cacd61 [aotinductor] Add helper fn to atomically apply size_hint to an expr w/ unbacked symints (#137537)
### Context
Fixes CUDA IMA in autotune_at_compile_time, where we would generate an example tensor with an incorrect stride.

In the case below, the stride should be (u0 * 128, 128, 1). However, we apply the fallback on the entire expr (i.e. u0 * 128).
```
# buf817 = tensor(size=(s0, u0, 128), stride=(u0 * 128, 128, 1))

buf812 = generate_example_value(
    (64, 8192, 128), (8192, 128, 1), "cuda:0", torch.bfloat16, 0
)
```

The fix is to apply the fallback on each symbol.

### Test
```
PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer python test_aot_inductor.py -k test_stride_with_unbacked_expr_abi_compatible_cuda

========= Invalid __global__ write of size 2 bytes
```

Differential Revision: [D64074561](https://our.internmc.facebook.com/intern/diff/D64074561)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137537
Approved by: https://github.com/jingsh
2024-10-10 17:11:24 +00:00
Oguz Ulgen
034af88c2d Add a microbechmark for cache read path (#137607)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137607
Approved by: https://github.com/jamesjwu
2024-10-10 16:36:18 +00:00
Laith Sakka
f394fb554b Enable failing diffs for regressions on basic_modules_ListOfLinears benchmarks (#137541)
Note that basic_modules_ListOfLinears_inductor_gpu_force_shape_pad is flay with 8% detected variance,
I set it up with 20% threshold (8*2)++
others are stable within +-1.5%

<img width="611" alt="Screenshot 2024-10-08 at 4 19 03 PM" src="https://github.com/user-attachments/assets/103c4bc7-6be8-41bf-ac31-4b8909fabfcf">

<img width="1581" alt="Screenshot 2024-10-08 at 4 18 56 PM" src="https://github.com/user-attachments/assets/56006f7a-e7de-4966-9a05-9263195adc68">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137541
Approved by: https://github.com/aorenste
2024-10-10 02:47:38 +00:00
Laith Sakka
361046718d Generate new expected results file when there is failures in diff time benchmarks (#137551)
The test also add singpost log for the benchmarks that pass.
to test run I ran python check_results.py test_check_result/expected_test.csv test_check_result/result_test.csv out.csv
results
```
WIN: benchmark ('a', 'instruction count') failed, actual result 90 is -18.18% lower than expected 110 ±1.00% please update the expected results.

REGRESSION: benchmark ('b', 'memory') failed, actual result 200 is 100.00% higher than expected 100 ±+10.00% if this is an expected regression, please update the expected results.

PASS: benchmark ('c', 'something') pass, actual result 107 +7.00% is within expected 100 ±10.00%

MISSING REGRESSION TEST: benchmark ('d', 'missing-test') does not have a regression test enabled for it.

You can use the new reference expected result stored at path: out.csv.

a,instruction count,90,0.01
b,memory,200,0.1
c,something,100,0.1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137551
Approved by: https://github.com/aorenste
2024-10-10 01:09:15 +00:00
PyTorch MergeBot
16a2c2cfd4 Revert "Introduce torch.sym_sum (#136429)"
This reverts commit 90bed32b98.

Reverted https://github.com/pytorch/pytorch/pull/136429 on behalf of https://github.com/ezyang due to fails internal stuff ([comment](https://github.com/pytorch/pytorch/pull/136429#issuecomment-2403335147))
2024-10-09 20:08:01 +00:00
Oguz Ulgen
ae03c0cff3 Add microbenchmark for FxGraphHashDetails.debug_lines (#137506)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137506
Approved by: https://github.com/jamesjwu
2024-10-09 16:15:05 +00:00
Michael Lazos
27dee935af [Dynamo] Ensure torch function modes are dispatched on builtin ops (#137117)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137117
Approved by: https://github.com/yanboliang, https://github.com/williamwen42
ghstack dependencies: #137114, #137115, #137116
2024-10-09 02:29:40 +00:00
PyTorch MergeBot
2d18c2d5e7 Revert "[Dynamo] Ensure torch function modes are dispatched on builtin ops (#137117)"
This reverts commit 941be418d8.

Reverted https://github.com/pytorch/pytorch/pull/137117 on behalf of https://github.com/huydhn due to The top of the stack has been reverted but it leaves trunk in a broken state, so I try to revert the rest of the stack ([comment](https://github.com/pytorch/pytorch/pull/137114#issuecomment-2400765603))
2024-10-08 20:33:17 +00:00
Brian Hirsh
b41fc14072 compile time benchmarks for AOTDispatcher (partitioner) (#136760)
compile time benchmark for the min cut partitioner. I'm hoping that this is a reasonable benchmark because:

(1) it consists of a single input + many weights that are used sequentially
(2) contains a mix of recompute vs non-recomputed ops (matmul + sin)
(3) it is relatively simple

from running locally:
```
collecting compile time instruction count for aotdispatcher_partitioner_cpu
compile time instruction count for iteration 0 is 21764219181
compile time instruction count for iteration 1 is 12475020009
compile time instruction count for iteration 2 is 12463710140
compile time instruction count for iteration 3 is 12455676489
compile time instruction count for iteration 4 is 12451344330
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136760
Approved by: https://github.com/ezyang
ghstack dependencies: #136759
2024-10-08 18:44:13 +00:00
Brian Hirsh
48b8f818b2 compile time benchmarks for AOTDispatcher (inference/training/subclasses) (#136759)
this adds a few compile time benchmarks for some disjoint paths in AOTDispatcher:

(1) inference vs training code paths
(2) "subclasses" vs "no subclasses" codepaths

Also see https://github.com/pytorch/pytorch/pull/136760 for a partitioner benchmark (I'm not sure why ghstack didn't display the stack nicely)

I ran locally, and got these numbers on the 4 paths:
```
collecting compile time instruction count for aotdispatcher_inference_nosubclass_cpu
compile time instruction count for iteration 0 is 11692348671
compile time instruction count for iteration 1 is 3026287204
compile time instruction count for iteration 2 is 3011467318
compile time instruction count for iteration 3 is 3004485935
compile time instruction count for iteration 4 is 3003087410
collecting compile time instruction count for aotdispatcher_training_nosubclass_cpu
compile time instruction count for iteration 0 is 6068003223
compile time instruction count for iteration 1 is 5585418102
compile time instruction count for iteration 2 is 5581856618
compile time instruction count for iteration 3 is 5581651794
compile time instruction count for iteration 4 is 5578742619
collecting compile time instruction count for aotdispatcher_inference_subclass_cpu
compile time instruction count for iteration 0 is 8634984264
compile time instruction count for iteration 1 is 8633467573
compile time instruction count for iteration 2 is 8632182092
compile time instruction count for iteration 3 is 8632056925
compile time instruction count for iteration 4 is 8632543871
collecting compile time instruction count for aotdispatcher_training_subclass_cpu
compile time instruction count for iteration 0 is 14737239311
compile time instruction count for iteration 1 is 14734346427
compile time instruction count for iteration 2 is 14736493730
compile time instruction count for iteration 3 is 14734121272
compile time instruction count for iteration 4 is 14733852882
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136759
Approved by: https://github.com/laithsakka
2024-10-08 18:44:13 +00:00
Edward Z. Yang
90bed32b98 Introduce torch.sym_sum (#136429)
Partially addresses https://github.com/pytorch/pytorch/issues/128150

When you have big sums of values, we end up computing long chains of
binary addition in our FX graph representation.  Not only is this ugly,
it also is quadratic, as the sympy.Add constructor is O(N) in number
of arguments.  Instead, ensure that we maintain the summation as a
single FX node so we can do the entire addition all in one go.

update_hint_regression benchmark, before and after:

```
update_hint_regression,compile_time_instruction_count,2648328980
update_hint_regression,compile_time_instruction_count,2563748678
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136429
Approved by: https://github.com/isuruf
2024-10-08 18:12:57 +00:00
Michael Lazos
941be418d8 [Dynamo] Ensure torch function modes are dispatched on builtin ops (#137117)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137117
Approved by: https://github.com/yanboliang, https://github.com/williamwen42
ghstack dependencies: #137114, #137115, #137116
2024-10-07 18:55:26 +00:00
Laith Sakka
8b9cbf22c2 Enable regression test for add loop benchmarks (#136573)
The red dotted line is 1.5

<img width="1607" alt="Screenshot 2024-09-24 at 11 50 41 AM" src="https://github.com/user-attachments/assets/719a9a86-89af-4c58-8723-80a28f9bb517">

expected taken from the average.
<img width="850" alt="Screenshot 2024-09-24 at 2 33 27 PM" src="https://github.com/user-attachments/assets/0f25e855-35ae-4031-86ef-1452ef6598de">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136573
Approved by: https://github.com/ezyang
2024-10-04 18:12:08 +00:00
PyTorch MergeBot
951107e8c2 Revert "compile time benchmarks for AOTDispatcher (inference/training/subclasses) (#136759)"
This reverts commit b17cd264d3.

Reverted https://github.com/pytorch/pytorch/pull/136759 on behalf of https://github.com/ZainRizvi due to Something in this stack seems to be causing tests to fail on trunk. See functorch/test_control_flow.py::TestControlFlow::test_associative_scan_dim_reverse_True_combine_mode_generic_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/11107079955/job/30872132411) [HUD commit link](c010c6099b) ([comment](https://github.com/pytorch/pytorch/pull/136670#issuecomment-2386303362))
2024-10-01 15:23:55 +00:00
PyTorch MergeBot
923410193b Revert "compile time benchmarks for AOTDispatcher (partitioner) (#136760)"
This reverts commit c010c6099b.

Reverted https://github.com/pytorch/pytorch/pull/136760 on behalf of https://github.com/ZainRizvi due to Something in this stack seems to be causing tests to fail on trunk. See functorch/test_control_flow.py::TestControlFlow::test_associative_scan_dim_reverse_True_combine_mode_generic_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/11107079955/job/30872132411) [HUD commit link](c010c6099b) ([comment](https://github.com/pytorch/pytorch/pull/136670#issuecomment-2386303362))
2024-10-01 15:23:55 +00:00
Bin Bao
a15f3f51bc [AOTI] Update sam_fast from timeout to fail_to_run (#136996)
Summary: sam_fast changes from timeout to fail_to_run after https://github.com/pytorch/pytorch/pull/136591, which "regressed" in a good way. Update the expected result file and continue investigating.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136996
Approved by: https://github.com/ezyang
2024-09-30 14:05:49 +00:00
Brian Hirsh
c010c6099b compile time benchmarks for AOTDispatcher (partitioner) (#136760)
compile time benchmark for the min cut partitioner. I'm hoping that this is a reasonable benchmark because:

(1) it consists of a single input + many weights that are used sequentially
(2) contains a mix of recompute vs non-recomputed ops (matmul + sin)
(3) it is relatively simple

from running locally:
```
collecting compile time instruction count for aotdispatcher_partitioner_cpu
compile time instruction count for iteration 0 is 21764219181
compile time instruction count for iteration 1 is 12475020009
compile time instruction count for iteration 2 is 12463710140
compile time instruction count for iteration 3 is 12455676489
compile time instruction count for iteration 4 is 12451344330
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136760
Approved by: https://github.com/ezyang
ghstack dependencies: #136670, #136759
2024-09-30 13:25:02 +00:00
Brian Hirsh
b17cd264d3 compile time benchmarks for AOTDispatcher (inference/training/subclasses) (#136759)
this adds a few compile time benchmarks for some disjoint paths in AOTDispatcher:

(1) inference vs training code paths
(2) "subclasses" vs "no subclasses" codepaths

Also see https://github.com/pytorch/pytorch/pull/136760 for a partitioner benchmark (I'm not sure why ghstack didn't display the stack nicely)

I ran locally, and got these numbers on the 4 paths:
```
collecting compile time instruction count for aotdispatcher_inference_nosubclass_cpu
compile time instruction count for iteration 0 is 11692348671
compile time instruction count for iteration 1 is 3026287204
compile time instruction count for iteration 2 is 3011467318
compile time instruction count for iteration 3 is 3004485935
compile time instruction count for iteration 4 is 3003087410
collecting compile time instruction count for aotdispatcher_training_nosubclass_cpu
compile time instruction count for iteration 0 is 6068003223
compile time instruction count for iteration 1 is 5585418102
compile time instruction count for iteration 2 is 5581856618
compile time instruction count for iteration 3 is 5581651794
compile time instruction count for iteration 4 is 5578742619
collecting compile time instruction count for aotdispatcher_inference_subclass_cpu
compile time instruction count for iteration 0 is 8634984264
compile time instruction count for iteration 1 is 8633467573
compile time instruction count for iteration 2 is 8632182092
compile time instruction count for iteration 3 is 8632056925
compile time instruction count for iteration 4 is 8632543871
collecting compile time instruction count for aotdispatcher_training_subclass_cpu
compile time instruction count for iteration 0 is 14737239311
compile time instruction count for iteration 1 is 14734346427
compile time instruction count for iteration 2 is 14736493730
compile time instruction count for iteration 3 is 14734121272
compile time instruction count for iteration 4 is 14733852882
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136759
Approved by: https://github.com/laithsakka
ghstack dependencies: #136670
2024-09-30 13:25:02 +00:00
Laith Sakka
e205193e1c Enable failing diffs on regression (#136551)
1. example of failing diff
https://github.com/pytorch/pytorch/pull/136740

2. test this by running
python check_results.py test_check_result/expected_test.csv   test_check_result/result_test.csv

results
```
WIN: benchmark ('a', ' instruction count') failed, actual result 90 is 18.18% lower than expected 110 ±1.00% please update the expected results.
REGRESSION: benchmark ('b', ' memory') failed, actual result 200 is 100.00% higher than expected 100 ±10.00% if this is an expected regression, please update the expected results.
MISSING REGRESSION TEST: benchmark ('d', ' missing-test') does not have a regression test enabled for it
```
MISSING REGRESSION TEST does not fail but its logged.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136551
Approved by: https://github.com/ezyang
ghstack dependencies: #136383
2024-09-29 22:31:26 +00:00
Jason Ansel
8da9c4178c [inductor] Benchmark Halide in operatorbench.py (#136809)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136809
Approved by: https://github.com/eellison
ghstack dependencies: #136808
2024-09-28 19:26:04 +00:00
Jason Ansel
375921b755 [inductor] Improve operatorbench.py (#136808)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136808
Approved by: https://github.com/eellison
2024-09-28 06:22:02 +00:00
William Wen
2157e396a3 [dynamo] attempt run only mode when dynamo cache limit is hit (#136655)
Implement https://github.com/pytorch/pytorch/issues/135458.

Try run-only mode when dynamo cache limit is hit. If no valid cache entries are found, then skip code recursively.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136655
Approved by: https://github.com/jansel
2024-09-27 17:15:05 +00:00
Zain Rizvi
37f340c1e5 [EZ] Remove remaining amz2023 runner variant references (#136540)
Validated no jobs use the amz2023 runner variant anymore ([proof](https://github.com/search?type=code&q=org%3Apytorch+%2F%5Cbamz2023%5Cb%2F+&p=1)) so removing all references to it

Explicit references to the amz2023 runner type variants were removed in the following PRs:
- https://github.com/pytorch/ignite/pull/3285
- https://github.com/pytorch/ao/pull/887
- https://github.com/pytorch/fbscribelogger/pull/1
- https://github.com/pytorch/pytorch/pull/134355

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136540
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-09-25 19:01:00 +00:00
PyTorch MergeBot
064093a4d6 Revert "Increase update_hint_regression problem size to 1000 (#136434)"
This reverts commit 3116fbda0f.

Reverted https://github.com/pytorch/pytorch/pull/136434 on behalf of https://github.com/ezyang due to whoops, this is too slow ([comment](https://github.com/pytorch/pytorch/pull/136434#issuecomment-2371847842))
2024-09-24 17:05:20 +00:00
Edward Z. Yang
3116fbda0f Increase update_hint_regression problem size to 1000 (#136434)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136434
Approved by: https://github.com/laithsakka
2024-09-23 18:51:44 +00:00
Laith Sakka
0b91e7e2dc Remove duplicate line (#136383)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136383
Approved by: https://github.com/kit1980, https://github.com/malfet
2024-09-21 01:35:13 +00:00
Laith Sakka
b71802fa79 add basic_modules_ListOfLinears_inductor_gpu_force_shape_pad (#136175)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136175
Approved by: https://github.com/ezyang
2024-09-19 19:15:50 +00:00
Igor Sugak
bce52d0b60 [CODEMOD][caffe2] use npt.NDArray instead of np.ndarray in type annotations (#136288)
Summary:
To facilitate PSS-2 upgrade, this uses `ndt.NDArray` instead of `nd.ndarray` in type annotations. In Numpy-1.19 (PSS-1) it's an alias to `nd.ndarray` -- a noop.
In Numpy-1.24, `ndt.NDArray` a proper generic type, and without this change uses of `nd.ndarray` generate this Pyre type error:
```counterexample
 Invalid type parameters [24]: Generic type `np.ndarray` expects 2 type parameters.
```

Test Plan: Sandcastle plus visual inspection

Differential Revision: D62977370

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136288
Approved by: https://github.com/kit1980
2024-09-19 12:40:36 +00:00