Run each `(batch_size, compile)` benchmark 10 times in `./runner.sh` and get mean and standard deviation of metrics in output table
Only report `warmup latency`, `average_latency`, `throughput` and `gpu_util`
Break `output.md` file into a single markdown file per `(batch_size, compile)` configuration. Further runs of `./runner.sh` will append one row to the table in each file for easy comparison
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113309
Approved by: https://github.com/albanD