After this, the sam_fast benchmark can now be run in the pytorch repo:
```
SEGMENT_ANYTHING_FAST_USE_FLASH_4=0 benchmarks/dynamo/torchbench.py --inference --amp --performance --backend=inductor --explain --only sam_fast
```
sam_fast is designed for inference only, with cuda and amp on. The code adds these restrictions to the benchmark.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121420
Approved by: https://github.com/oulgen, https://github.com/msaroufim
As reported in https://github.com/pytorch/pytorch/issues/119434, `detectron2_fcos_r_50_fpn` failed with dynamic shape testing, we propose to skip the dynamic batch size testing of this model in this PR.
* Error msg is
```
File "/home/jiayisun/pytorch/benchmarks/dynamo/common.py", line 3877, in run
assert marked, f"nothing in example_inputs had a dim with {batch_size}"
AssertionError: nothing in example_inputs had a dim with 4
```
* Root Cause is
Benchmark code will only annotate the inputs' dim as dynamic when its size equals to batch size c617e7b407/benchmarks/dynamo/common.py (L3867-L3871). If it fails to find any dim equals to batch size, above error throws.
However, the inputs of `detectron2_fcos_r_50_fpn` are as follows:
```
([{'file_name': '/home/jiayisun/benchmark/torchbenchmark/data/.data/coco2017-minimal/coco/val2017/000000001268.jpg', 'height': 427, 'width': 640, 'image_id': 1268, 'image': tensor([[[147., 124., 82., ..., 3., 4., 5.],
[125., 104., 65., ..., 3., 3., 4.],
[ 87., 68., 34., ..., 2., 2., 2.],
...,
[ 47., 45., 41., ..., 45., 45., 45.],
[ 46., 44., 40., ..., 44., 45., 46.],
[ 46., 44., 40., ..., 43., 45., 46.]],
[[154., 129., 84., ..., 3., 4., 5.],
[133., 110., 69., ..., 3., 3., 4.],
[ 95., 76., 43., ..., 2., 2., 2.],
...,
[ 44., 42., 38., ..., 34., 37., 39.],
[ 43., 41., 37., ..., 35., 39., 41.],
[ 43., 41., 37., ..., 35., 40., 43.]],
[[171., 140., 85., ..., 3., 4., 5.],
[147., 120., 71., ..., 3., 3., 4.],
[103., 83., 47., ..., 2., 2., 2.],
...,
[ 46., 44., 40., ..., 16., 20., 22.],
[ 45., 43., 39., ..., 17., 22., 26.],
[ 45., 43., 39., ..., 18., 24., 28.]]])}, ... ],)
```
None of the inputs' dim will equal to input batch size, so I think we may need to skip the dynamic batch size testing for this model.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120697
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/desertfire
Fix https://github.com/pytorch/pytorch/issues/120545 . The reason why these models fail accuracy test with freezing is due to the conv-batchnorm fusion. Conv-batchnorm fusion causes relative big numerical churn.
For the failed TIMM models, raising the tolerance to `8 * 1e-2` can make the test pass.
For the failed TB models, the numerical difference is too large. Having a discussion with @eellison , we decided to skip them with freezing for now.
One the other hand, we probably should dig more why the conv-bn fusion cause such large numerical difference.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121054
Approved by: https://github.com/eellison
As reported in https://github.com/pytorch/pytorch/issues/119434, `pyhpc_isoneutral_mixing`, `pyhpc_equation_of_state` and `pyhpc_turbulent_kinetic_energy` failed with dynamic shape testing, we propose to skip the dynamic batch size testing of these 3 models in this PR.
* Error msg is
```
File "/localdisk/leslie/torch_inductor_community/pytorch/benchmarks/dynamo/common.py", line 3879, in run
assert marked, f"nothing in example_inputs had a dim with {batch_size}"
AssertionError: nothing in example_inputs had a dim with 1048576
```
* Root Cause is
* Benchmark code will only annotate the inputs' dim as dynamic when its size equals to batch size c617e7b407/benchmarks/dynamo/common.py (L3867-L3871). If it fails to find any dim equals to batch size, above error throws.
* However, for these 3 models, none of the inputs' dim will equal to input batch size since the [relationship of dim sizes](26b85eadde/torchbenchmark/models/pyhpc_equation_of_state/__init__.py (L12-L16))
```
shape = (
math.ceil(2 * size ** (1/3)),
math.ceil(2 * size ** (1/3)),
math.ceil(0.25 * size ** (1/3)),
)
```
* Another thing is `pyhpc_isoneutral_mixing`, `pyhpc_equation_of_state` can pass the dynamic batch size accuracy testing, because the batch size has been set to 4 in accuracy testing (c617e7b407/benchmarks/dynamo/common.py (L3456)) and `math.ceil(2 * size ** (1/3))` happens equaling to 4.
* Since the dim sizes of input has above relationship, running the these models in dynamic shape, we may need to annotate `dim[0](s0) = dim[2](s1) * 8`, per the discussion in https://github.com/pytorch/pytorch/issues/117477#issuecomment-1897108756 @avikchaudhuri, looks like we are not expressible for this case. So, I think we may need to skip the dynamic batch size testing for these 3 models.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120599
Approved by: https://github.com/jgong5, https://github.com/desertfire
We need a higher tolerance for GPT2ForSequenceClassification since if I change --bfloat16 in
```
time python benchmarks/dynamo/huggingface.py --accuracy --inference --bfloat16 --backend inductor --disable-cudagraphs --only GPT2ForSequenceClassification
```
to --float16 or --float32 it will pass the accuracy check.
Adding --freezing can also make the test pass for this model. I think that's may be due to different fusion output being generated (depending on if constant propagation is happening controlled by freezing) and cause some small numerical difference.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120537
Approved by: https://github.com/jansel
This PR moves other aspects of torchbench's model configuration (e.g. batch size,
tolerance requirements, etc.) into a new YAML file: `torchbench.yaml`. It also merges the
recently added `torchbench_skip_models.yaml` file inside the `skip` key.
This is an effort so that external consumers are able to easily replicate the performance
results and coverage results from the PyTorch HUD.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120299
Approved by: https://github.com/jansel
This PR updates the list of benchmarks that should (not) be skipped. Here's a summary of
the changes:
- `detectron2_maskrcnn`: #120115
- `fambench_xlmr`: moved to canary models
- `hf_Bert` and `hf_Bert_large`: pass
- `maml`: pass
- `clip`: renamed to `hf_clip`
- `gat`, `gcn`, and `sage`: moved to canary models
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120117
Approved by: https://github.com/ezyang, https://github.com/lezcano
Prior to onnx export, the model is deepcopied to avoid modifications that may affect later performance profiling. However this increases the memory requirement on the device.
This PR modifies the script to deepcopy and export the model on another device when possible.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118710
Approved by: https://github.com/thiagocrepaldi
1. I'd like to remove the patching that avoids the profiler hook, but it adds an additional graph break due to nested wrappers. #117767 if interested, see (internal only) paste for [before](P996529232) and [after](P997507449) this PR.
```
I've locally run perf benchmarks for yolov3: Before the speedup is 4.183x, and after it is 4.208x.
I've also run it for resnet50: before, speedup is 3.706x and now it is 3.924x.
```
2. @mlazos I now unwrap twice in the dynamo and inductor tests. This feels like we're testing deficiently--should we add tests to test that tracing through the profiler hook and the use_grad hook are functioning according to expectations (I know there's at least one graph break in one).
3. There's a strange memory thing going on...what is happening? This has been resolved with @voznesenskym's [change](https://github.com/pytorch/pytorch/pull/116169). (for details see below)
<details>
This PR will fail the test_static_address_finalizer test due to a mysterious thing that is happening (idk what, but maybe the dynamo cache or a frame _expecting_ the patching to have been done).
There is no Python refcycle, as the backrefs for `p_ref()` look like:

(so 5 backrefs but none of them python)
And the refs:

</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115772
Approved by: https://github.com/jansel, https://github.com/mlazos
- Removes an outdated assert that prevents perf tests from running DDP, we now have single node --multiprocess and perf tests are already wrapping the model using `deepcopy_and_maybe_ddp`
- Append rank name to traces to avoid all ranks trying to create the same file
- Renames `deepcopy_and_maybe_ddp` to `deepcopy_and_maybe_parallelize` to include FSDP
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113332
Approved by: https://github.com/H-Huang, https://github.com/wconstab
Adds `--compile-autograd` flag to benchmark suite to run accuracy and performance tests. Also adds autograd_captures and autograd_compiles to dynamo stats
e.g. accuracy_inductor.csv
```
dev,name,batch_size,accuracy,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles
cuda,BERT_pytorch,4,pass,2655,2,8,7,1,1
cuda,Background_Matting,4,pass_due_to_skip,0,0,0,0,0,0
cuda,DALLE2_pytorch,0,eager_fail_to_run,0,0,0,0,0,0
cuda,LearningToPaint,4,pass,639,2,8,7,1,1
...
```
e.g. speedup_inductor.csv
```
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles
cuda,hf_T5,8,1.214311,136.236793,88.350570,0.751322,18.754706,24.962275,3298,2,8,8,1,1
cuda,hf_T5,8,1.226645,135.431856,52.461461,1.040973,18.754706,18.016508,795,1,7,7,0,0
...
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117196
Approved by: https://github.com/jansel
Sometimes, the first statement that sets this variable in the try block fails due to out of memory issues and the finally block tries to delete this variable, but it was not written to in the first place.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116260
Approved by: https://github.com/lezcano