Commit Graph

1507 Commits

Author SHA1 Message Date
Shunting Zhang
aaba3a87b1 tune down batch-size for res2net to avoid OOM (#122977)
The batch-size for this model is 64 previously. Later on we change that to 256 and cause OOM in cudagraphs setting. This PR tune the batch size down to 128.

Share more logs from my local run
```
cuda,res2net101_26w_4s,128,1.603578,110.273572,335.263494,1.042566,11.469964,11.001666,807,2,7,6,0,0
cuda,res2net101_26w_4s,256,1.714980,207.986155,344.013071,1.058278,22.260176,21.034332,807,2,7,6,0,0
```

The log shows that torch.compile uses 11GB for 128 batch size and 21GB for 256 batch size. I guess the benchmark script has extra overhead cause the model OOM for 256 batch size in the dashboard run.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122977
Approved by: https://github.com/Chillee
2024-03-30 03:54:53 +00:00
Angela Yi
482d8bf1ea [aoti] Change aot_compile callsites (#122225)
Summary:
Replacing `torch._export.aot_compile` callsites with
```
ep = torch.export._trace._export(.., predispatch=True)   # Traces the given program into predispatch IR
so_path = torch._inductor.aot_compile_ep(ep, ...)  # Takes an exported program and compiles it into a .so
```

This allows us to explicitly split up the export step from AOTInductor. We can later modify tests to do `export + serialize + deserialize + inductor` to mimic internal production use cases better.

Test Plan: CI

Differential Revision: D54808612

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122225
Approved by: https://github.com/SherlockNoMad, https://github.com/khabinov
2024-03-29 21:34:20 +00:00
chilli
ed37fbdf60 made gpt_fast benchmark run faster (#122872)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122872
Approved by: https://github.com/msaroufim, https://github.com/yifuwang
ghstack dependencies: #122848
2024-03-29 03:49:19 +00:00
chilli
52b1d2a73d Increase timm batch sizes to make less overhead-bound and less noisy (#122581)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122581
Approved by: https://github.com/ezyang
ghstack dependencies: #122686, #122688, #121692, #122841
2024-03-28 02:34:32 +00:00
Oguz Ulgen
a697d972b1 Fix torchbench errors (#122735)
Summary: It looks like this target has stopped working, lets fix it.

Test Plan:
```
buck2 run mode/opt //caffe2/benchmarks/dynamo/:test
```
now works

Differential Revision: D55389546

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122735
Approved by: https://github.com/xmfan
2024-03-27 06:59:16 +00:00
Jason Ansel
069270db60 [dynamo] Fix list comparison ops (#122559)
Fixes #122376

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122559
Approved by: https://github.com/Skylion007
2024-03-25 07:03:23 +00:00
Shunting Zhang
152fa9ecc2 skip moondream for training (#122483)
The model shows as failed model on the dashboard for training. But the model is not implemented for training (at least for now):
2196021e9b/torchbenchmark/models/moondream/__init__.py (L6)

Skip it in dashboard.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122483
Approved by: https://github.com/eellison
2024-03-22 17:35:52 +00:00
drisspg
7fa1be506b Add an option to sdpa benchmark to specify backend (#122368)
# Summary
Adds the ability to specify sdpa backend
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122368
Approved by: https://github.com/cpuhrsch
2024-03-21 07:00:40 +00:00
Huy Do
a1d02b423c XFAIL detectron2_maskrcnn_r_101_c4 CPU inductor accuracy (#122263)
This starts to fail in trunk after the stack https://github.com/pytorch/pytorch/pull/122066 lands

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122263
Approved by: https://github.com/jansel
2024-03-20 08:03:34 +00:00
Shunting Zhang
1c4887d52b fix dlrm accuracy test in max-autotune (#122012)
torchrec_dlrm training fail the accuracy check when max-autotune is enabled.

I found there is no real issue in PT2. We fail to get fp64 reference results for the accuracy check. In max-autotune mode numerical may change a bit and cause the cosine similarity check fail. Using fp64 baseline is more reliable and make the test pass.

The reason why we are not using a fp64 baseline earlier is because torchrec uses a dataclass [Batch](99e6e669b5/torchrec/datasets/utils.py (L28)) to represent the input. We use pytree to cast model and inputs to fp64. pytree can not look into a dataclass. My fix is to convert the dataclass to namedtuple to be more pytree friendly

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122012
Approved by: https://github.com/jansel, https://github.com/eellison
2024-03-19 22:23:42 +00:00
Jason Ansel
07caea5c12 [dynamo] Refactor COMPARE_OP and comparison builtins (#122043)
This removes the duplicate handling of comparison ops between symbolic_convert and bultin and refactors the handling to use the binop infrastructure.  This change regresses overheads a bit, but this is fixed in the next PR.

New test skips are variants of `type(e) is np.ndarray` previously falling back to eager.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122043
Approved by: https://github.com/anijain2305
ghstack dependencies: #122039
2024-03-19 04:23:17 +00:00
eellison
ba69dc6675 [Easy] add option to print compilation time (#121996)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121996
Approved by: https://github.com/davidberard98
2024-03-18 22:42:41 +00:00
Jason Ansel
5a10b56083 [dynamo] Small microbenchmark changes (#122032)
Used to generate numbers in #122029

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122032
Approved by: https://github.com/yanboliang
2024-03-18 18:08:06 +00:00
Jason Ansel
1e9a7df8fe [dynamo] Compile time optimizations in tx.step() (#121790)
`python benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py`
- Before: `symbolic_convert_overhead_stress_test: 10.7s`
- After: `symbolic_convert_overhead_stress_test: 8.6s`

`tx.step()` is a small part of that benchmark, so likely the speedup in that isolated function is larger than the top line.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121790
Approved by: https://github.com/oulgen
2024-03-15 01:01:05 +00:00
Yanbo Liang
43e243180b Add gpt-fast as a static benchmark (#121886)
Run:
```
python benchmarks/gpt_fast/benchmark.py
```
It generated a cvs file ```gpt_fast_benchmark.csv``` with the content like:
```
name,mode,target,actual,percentage
Llama-2-7b-chat-hf,bfloat16,104,103.458618,99.48%
Llama-2-7b-chat-hf,int8,155,158.964615,102.56%
Mixtral-8x7B-v0.1,int8,97,99.760132,102.85%
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121886
Approved by: https://github.com/Chillee
2024-03-14 21:46:59 +00:00
Animesh Jain
cd1751b14f [dynamo] Measure Dynamo cache latency lookup (#121604)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121604
Approved by: https://github.com/jansel
ghstack dependencies: #121614, #121622
2024-03-12 17:09:11 +00:00
Jason Ansel
9aa3fedb75 Slightly faster FX graph iterator (#121611)
Before:
```
iterating over 100000000 FX nodes took 5.9s (16830686 nodes/s)
```

After:
```
iterating over 100000000 FX nodes took 5.0s (19937698 nodes/s)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121611
Approved by: https://github.com/oulgen
2024-03-11 20:00:19 +00:00
James Wu
ae22bdaefe Update torchbench commit pin, add sam_fast benchmark (#121420)
After this, the sam_fast benchmark can now be run in the pytorch repo:
```
SEGMENT_ANYTHING_FAST_USE_FLASH_4=0 benchmarks/dynamo/torchbench.py --inference --amp --performance --backend=inductor --explain --only sam_fast
```

sam_fast is designed for inference only, with cuda and amp on. The code adds these restrictions to the benchmark.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121420
Approved by: https://github.com/oulgen, https://github.com/msaroufim
2024-03-11 19:48:53 +00:00
Yifu Wang
d7a5e59647 [dynamo] support group=None when rewriting collectives (#121043)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121043
Approved by: https://github.com/awgu
2024-03-06 21:37:19 +00:00
angelayi
58ac4a2007 Remove llava from ci_expected_accuracy as it's flaky (#121322)
https://github.com/pytorch/pytorch/pull/121029 added it into the CI but the test is flaky on hud. It alternates between fail_accuracy and fail_to_run

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121322
Approved by: https://github.com/desertfire
2024-03-06 20:47:01 +00:00
angelayi
ae4c85960f Add Deberta pass (#121206)
Adding DebertaForQuestionAnswering to inductor benchmark pass, as it did not show up before

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121206
Approved by: https://github.com/desertfire
2024-03-05 17:56:25 +00:00
Sun, Jiayi
ee557d8f61 skip detectron2_fcos_r_50_fpn in dynamic shape test (#120697)
As reported in https://github.com/pytorch/pytorch/issues/119434, `detectron2_fcos_r_50_fpn` failed with dynamic shape testing, we propose to skip the dynamic batch size testing of this model in this PR.

* Error msg is
```
  File "/home/jiayisun/pytorch/benchmarks/dynamo/common.py", line 3877, in run
    assert marked, f"nothing in example_inputs had a dim with {batch_size}"
AssertionError: nothing in example_inputs had a dim with 4
```

* Root Cause is
Benchmark code will only annotate the inputs' dim as dynamic when its size equals to batch size c617e7b407/benchmarks/dynamo/common.py (L3867-L3871). If it fails to find any dim equals to batch size, above error throws.
However, the inputs of `detectron2_fcos_r_50_fpn` are as follows:

```
([{'file_name': '/home/jiayisun/benchmark/torchbenchmark/data/.data/coco2017-minimal/coco/val2017/000000001268.jpg', 'height': 427, 'width': 640, 'image_id': 1268, 'image': tensor([[[147., 124.,  82.,  ...,   3.,   4.,   5.],
         [125., 104.,  65.,  ...,   3.,   3.,   4.],
         [ 87.,  68.,  34.,  ...,   2.,   2.,   2.],
         ...,
         [ 47.,  45.,  41.,  ...,  45.,  45.,  45.],
         [ 46.,  44.,  40.,  ...,  44.,  45.,  46.],
         [ 46.,  44.,  40.,  ...,  43.,  45.,  46.]],

        [[154., 129.,  84.,  ...,   3.,   4.,   5.],
         [133., 110.,  69.,  ...,   3.,   3.,   4.],
         [ 95.,  76.,  43.,  ...,   2.,   2.,   2.],
         ...,
         [ 44.,  42.,  38.,  ...,  34.,  37.,  39.],
         [ 43.,  41.,  37.,  ...,  35.,  39.,  41.],
         [ 43.,  41.,  37.,  ...,  35.,  40.,  43.]],

        [[171., 140.,  85.,  ...,   3.,   4.,   5.],
         [147., 120.,  71.,  ...,   3.,   3.,   4.],
         [103.,  83.,  47.,  ...,   2.,   2.,   2.],
         ...,
         [ 46.,  44.,  40.,  ...,  16.,  20.,  22.],
         [ 45.,  43.,  39.,  ...,  17.,  22.,  26.],
         [ 45.,  43.,  39.,  ...,  18.,  24.,  28.]]])}, ... ],)
```

None of the inputs' dim will equal to input batch size, so I think we may need to skip the dynamic batch size testing for this model.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120697
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/desertfire
2024-03-05 12:12:18 +00:00
angelayi
c3c618c750 Update torchbench pin (#121029)
Fixes https://github.com/pytorch/pytorch/issues/117280 after bumping the HF version in https://github.com/pytorch/benchmark/pull/2179

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121029
Approved by: https://github.com/desertfire
2024-03-05 03:21:32 +00:00
PyTorch MergeBot
368f242e37 Revert "[PT2D] Make the speedup benchmark works with DDP + CompiledAutograd (#120454)"
This reverts commit 8c2e569928.

Reverted https://github.com/pytorch/pytorch/pull/120454 on behalf of https://github.com/desertfire due to breaks nightly dashboard cudagraphs run ([comment](https://github.com/pytorch/pytorch/pull/120454#issuecomment-1975001824))
2024-03-03 02:58:47 +00:00
Shunting Zhang
c4ed456fc3 [inductor] fix accuracy failure for a few models under freezing (#121054)
Fix https://github.com/pytorch/pytorch/issues/120545 . The reason why these models fail accuracy test with freezing is due to the conv-batchnorm fusion. Conv-batchnorm fusion causes relative big numerical churn.

For the failed TIMM models, raising the tolerance to `8 * 1e-2` can make the test pass.

For the failed TB models, the numerical difference is too large. Having a discussion with @eellison , we decided to skip them with freezing for now.

One the other hand, we probably should dig more why the conv-bn fusion cause such large numerical difference.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121054
Approved by: https://github.com/eellison
2024-03-02 04:53:59 +00:00
Chien-Chin Huang
8c2e569928 [PT2D] Make the speedup benchmark works with DDP + CompiledAutograd (#120454)
With DDP + CompiledAutograd, we could not use the same parallelized model to do the test. This PR copies the model.

Differential Revision: [D54094257](https://our.internmc.facebook.com/intern/diff/D54094257/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120454
Approved by: https://github.com/yf225, https://github.com/xmfan
2024-03-01 08:35:22 +00:00
leslie-fang-intel
950b484356 skip three pyhpc models with dynamic shape test (#120599)
As reported in https://github.com/pytorch/pytorch/issues/119434, `pyhpc_isoneutral_mixing`, `pyhpc_equation_of_state` and `pyhpc_turbulent_kinetic_energy` failed with dynamic shape testing, we propose to skip the dynamic batch size testing of these 3 models in this PR.

* Error msg is
```
  File "/localdisk/leslie/torch_inductor_community/pytorch/benchmarks/dynamo/common.py", line 3879, in run
    assert marked, f"nothing in example_inputs had a dim with {batch_size}"
AssertionError: nothing in example_inputs had a dim with 1048576
```

* Root Cause is
  *  Benchmark code will only annotate the inputs' dim as dynamic when its size equals to batch size c617e7b407/benchmarks/dynamo/common.py (L3867-L3871). If it fails to find any dim equals to batch size, above error throws.
  * However, for these 3 models, none of the inputs' dim will equal to input batch size since the [relationship of dim sizes](26b85eadde/torchbenchmark/models/pyhpc_equation_of_state/__init__.py (L12-L16))
  ```
    shape = (
        math.ceil(2 * size ** (1/3)),
        math.ceil(2 * size ** (1/3)),
        math.ceil(0.25 * size ** (1/3)),
    )
  ```
  * Another thing is `pyhpc_isoneutral_mixing`, `pyhpc_equation_of_state` can pass the dynamic batch size accuracy testing, because the batch size has been set to 4 in accuracy testing (c617e7b407/benchmarks/dynamo/common.py (L3456)) and `math.ceil(2 * size ** (1/3))` happens equaling to 4.

* Since the dim sizes of input has above relationship, running the these models in dynamic shape, we may need to annotate `dim[0](s0) = dim[2](s1) * 8`, per the discussion in https://github.com/pytorch/pytorch/issues/117477#issuecomment-1897108756 @avikchaudhuri, looks like we are not expressible for this case. So, I think we may need to skip the dynamic batch size testing for these 3 models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120599
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-02-29 00:38:06 +00:00
Scott Wolchok
98d1529474 [PyTorch] fix mixed int32/int64 indices/offsets for embedding_bag_out (#120752)
This was an oversight in D27482738 (#55189) -- it only patched the regular embedding_bag operator, but static runtime uses the out variant.

Differential Revision: [D54285460](https://our.internmc.facebook.com/intern/diff/D54285460/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120752
Approved by: https://github.com/houseroad
2024-02-28 20:13:30 +00:00
Sergii Dymchenko
d341b66e96 Revert [dynamo] support group=None when rewriting collectives (#12018) (#120677)
This reverts commit 298c686d3f.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120677
Approved by: https://github.com/yifuwang, https://github.com/huydhn
2024-02-27 00:33:35 +00:00
Shunting Zhang
b381a4372b make GPT2ForSequenceClassification pass inference accuracy check (#120537)
We need a higher tolerance for GPT2ForSequenceClassification since if I change --bfloat16 in
```
time python benchmarks/dynamo/huggingface.py --accuracy --inference --bfloat16 --backend inductor --disable-cudagraphs --only GPT2ForSequenceClassification
```
to --float16 or --float32 it will pass the accuracy check.

Adding --freezing can also make the test pass for this model. I think that's may be due to different fusion output being generated (depending on if constant propagation is happening controlled by freezing) and cause some small numerical difference.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120537
Approved by: https://github.com/jansel
2024-02-26 11:02:57 +00:00
leslie-fang-intel
c617e7b407 Add resnet50/mobilenet_v2_quantized_qat in into deterministic_algorithms exclusive list (#120384)
After PR: https://github.com/pytorch/pytorch/pull/120026, 2 `Torchbench` testcases: `resnet50_quantized_qat` and `mobilenet_v2_quantized_qat` can pass the performance testing but failed with accuracy test. The failure msg is:  `mobilenet_v2_quantized_qat, RuntimeError: quantized_resize_cpu_ does not have a deterministic implementation but you set 'torch.use_deterministic_algorithms(True)'. `

- `torch.use_deterministic_algorithms(True)` only setting for accuracy test. fff9d98e58/benchmarks/dynamo/common.py (L3480)
- However, `quantized_resize_cpu_` only support `nondeterministic_algorithms` because the resized output memory may be uninitialized. fff9d98e58/aten/src/ATen/native/quantized/cpu/TensorOperators.cpp (L85-L87)

Add these 2 models into the deterministic_algorithms exclusive model list in this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120384
Approved by: https://github.com/desertfire, https://github.com/jgong5
2024-02-26 05:05:43 +00:00
Yifu Wang
298c686d3f [dynamo] support group=None when rewriting collectives (#120118)
Resolves case 2 in #120082.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120118
Approved by: https://github.com/wconstab
ghstack dependencies: #120370
2024-02-25 03:12:10 +00:00
Yukio Siraichi
cef9f70f4b Move torchbench model configuration into a YAML file. (#120299)
This PR moves other aspects of torchbench's model configuration (e.g. batch size,
tolerance requirements, etc.) into a new YAML file: `torchbench.yaml`. It also merges the
recently added `torchbench_skip_models.yaml` file inside the `skip` key.

This is an effort so that external consumers are able to easily replicate the performance
results and coverage results from the PyTorch HUD.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120299
Approved by: https://github.com/jansel
2024-02-23 14:00:14 +00:00
baocheny
edd03f975f highlight readme code block (#120228)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120228
Approved by: https://github.com/mikaylagawarecki
2024-02-22 21:23:08 +00:00
xiangdong
e06978be4b [CI] Add initial inductor cpu smoketest for performance (#116456)
Co-authored-by: chuanqiw <chuanqi.wang@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116456
Approved by: https://github.com/jgong5, https://github.com/atalman
2024-02-21 20:04:50 +00:00
Yukio Siraichi
92bf2a4550 [torchbench] Update skipped models. (#120117)
This PR updates the list of benchmarks that should (not) be skipped. Here's a summary of
the changes:

- `detectron2_maskrcnn`: #120115
- `fambench_xlmr`: moved to canary models
- `hf_Bert` and `hf_Bert_large`: pass
- `maml`: pass
- `clip`: renamed to `hf_clip`
- `gat`, `gcn`, and `sage`: moved to canary models

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120117
Approved by: https://github.com/ezyang, https://github.com/lezcano
2024-02-19 18:08:32 +00:00
Eddie Yan
cd380c794f [CUDNN][SDPA] Experimental cuDNN Flash Attention v2 Inference (#115663)
#113713

Going to clean up some of the checks and will remove draft status after.
Can be tested on SM80+ with `TORCH_CUDNN_MHA_ENABLED=1`.

CC @drisspg @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115663
Approved by: https://github.com/drisspg
2024-02-14 22:02:06 +00:00
Chien-Chin Huang
c0e5cca4f8 [DDP] Change the --no-optimize-ddp flag to reflect the latest usage (#119437)
Compiled DDP now has 4 different optimization modes. This PR changes the Dynamo benchmark flag to reflect that change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119437
Approved by: https://github.com/wconstab, https://github.com/xmfan
2024-02-13 16:53:56 +00:00
chuanqiw
074f2bb5ce Fix dynamo benchmark runner for torchbench skip sets (#118615)
Fix dynamo benchmark runner for torchbench skip sets, which introduced by PR #118032

This runner.py script is still used in the [Inductor CPU Performance Dashboard](https://github.com/pytorch/pytorch/issues/93531) regular test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118615
Approved by: https://github.com/jgong5, https://github.com/ysiraichi, https://github.com/ezyang
2024-02-06 02:06:54 +00:00
PyTorch MergeBot
966db82c9d Revert "Remove extra graph breaks (#118987)"
This reverts commit 9a8e3b07d7.

Reverted https://github.com/pytorch/pytorch/pull/118987 on behalf of https://github.com/eellison due to reverting because it causes regression ([comment](https://github.com/pytorch/pytorch/pull/118987#issuecomment-1928224447))
2024-02-05 22:19:37 +00:00
Michael Lazos
9a8e3b07d7 Remove extra graph breaks (#118987)
Fixes https://github.com/pytorch/pytorch/issues/104053

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118987
Approved by: https://github.com/janeyx99
2024-02-03 05:55:09 +00:00
BowenBao
30f43e3d89 [ONNX][bench] Deepcopy model to another device before export to avoid OOM (#118710)
Prior to onnx export, the model is deepcopied to avoid modifications that may affect later performance profiling. However this increases the memory requirement on the device.
This PR modifies the script to deepcopy and export the model on another device when possible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118710
Approved by: https://github.com/thiagocrepaldi
2024-01-31 23:03:39 +00:00
Aaron Gokaslan
1562dae62c [BE]: Apply RUF025 dict.fromkeys preview rule (#118637)
Simplifies and optimizes dict construction using the `fromkeys` classmethod ctor. This also makes it really obvious when all the keys will have the same static value, which could be a bug if unintentional. It is also significantly faster than using a dict comprehension. The rule is in preview, but I am adding a forward fix for when it becomes stable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118637
Approved by: https://github.com/albanD
2024-01-30 20:46:54 +00:00
Edward Z. Yang
119b66ba16 Use strict to toggle strict options in MYPYSTRICT (#118479)
As we force a specific version of mypy, it's OK to use the agglomerated flag.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118479
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #118414, #118418, #118432, #118467, #118468, #118469, #118475
2024-01-28 19:22:22 +00:00
Yukio Siraichi
2f6fc33c20 Move skip sets into a new file. (#118032)
This PR moves the skip sets that lived in benchmarks/dynamo/torchbench.py into a more
readable YAML file, so that it is consumable from other projects (e.g. XLA).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118032
Approved by: https://github.com/lezcano, https://github.com/ezyang
2024-01-24 19:22:01 +00:00
Jason Ansel
c5702a0891 [dynamo] Optimize BACKEND_MATCH guard (#118065)
As measured by `benchmarks/dynamo/microbenchmarks/overheads.py`:
- Before `22.5us`
- After `18.1us`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118065
Approved by: https://github.com/ydwu4
2024-01-24 07:47:52 +00:00
Simon Fan
ed0ec2e0be Remove dynamo runner's dependency on distributed build (#117903)
So that we can bisect faster without needing to rebuild distributed module. We remove the annotation to avoid flake8 undefined name lint

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117903
Approved by: https://github.com/xuzhao9
2024-01-24 06:51:14 +00:00
Jane Xu
13d2cdffa2 Remove optimizer.step patching for profiler hook (#115772)
1. I'd like to remove the patching that avoids the profiler hook, but it adds an additional graph break due to nested wrappers. #117767 if interested, see (internal only) paste for [before](P996529232) and [after](P997507449) this PR.

```
I've locally run perf benchmarks for yolov3: Before the speedup is 4.183x, and after it is 4.208x.
I've also run it for resnet50: before, speedup is 3.706x and now it is 3.924x.
```

2. @mlazos I now unwrap twice in the dynamo and inductor tests. This feels like we're testing deficiently--should we add tests to test that tracing through the profiler hook and the use_grad hook are functioning according to expectations (I know there's at least one graph break in one).
3. There's a strange memory thing going on...what is happening? This has been resolved with @voznesenskym's [change](https://github.com/pytorch/pytorch/pull/116169). (for details see below)

<details>
This PR will fail the test_static_address_finalizer test due to a mysterious thing that is happening (idk what, but maybe the dynamo cache or a frame _expecting_ the patching to have been done).

There is no Python refcycle, as the backrefs for `p_ref()` look like:
![image](https://github.com/pytorch/pytorch/assets/31798555/4d6cbf50-3924-4efe-b578-d93389eebec8)
(so 5 backrefs but none of them python)

And the refs:
![image](https://github.com/pytorch/pytorch/assets/31798555/25e01105-bcb9-44ca-997a-2cf1670a6d42)
</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115772
Approved by: https://github.com/jansel, https://github.com/mlazos
2024-01-23 20:15:41 +00:00
Bin Bao
4d625c1c92 [AOTI] Fix a bug in the torch._export.aot_load API (#118039)
Summary:
tree_flatten_spec should use args instead of *args

clone of https://github.com/pytorch/pytorch/pull/117948 but with some fbcode specific changes

Test Plan: CI

Differential Revision: D52982401

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118039
Approved by: https://github.com/angelayi
2024-01-23 14:54:02 +00:00
Michael Lazos
f302a0d380 Re-enable SGD (#117434)
Re-enables the SGD optimizer now that compile times are more reasonable. [Benchmark run](https://github.com/pytorch/pytorch/actions/runs/7511073761)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117434
Approved by: https://github.com/anijain2305, https://github.com/janeyx99
2024-01-19 04:28:50 +00:00