Commit Graph

1351 Commits

Author SHA1 Message Date
angelayi
b126adcdee [aotinductor] Pass TorchIR to AOTInductor (#110020)
Updates `_export.aot_compile` to pass a torch IR graph to inductor, allowing inductor to now run the pre_grad_passes, and reuse more of inductor's code.
Also updates the API to only return the `so_path`, and not returning the exported program. The pytree call spec is now serialized and placed inside of the generated model code. When calling the model, because there is no c++ pytree implementation linked yet, we can access the call specs through `get_call_spec()`, and call pytree flatten/unflattenin python.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110020
Approved by: https://github.com/desertfire
2023-10-26 15:54:31 +00:00
Simon Fan
28ebe5df7a yolov3: reduce batch size due to OOM (#111959)
yolov3 w/ cudagraphs (known to use more memory) is failing perf test due to OOM (https://hud.pytorch.org/benchmark/torchbench/inductor_with_cudagraphs?startTime=Mon,%2016%20Oct%202023%2020:19:47%20GMT&stopTime=Mon,%2023%20Oct%202023%2020:19:47%20GMT&granularity=hour&mode=training&dtype=amp&lBranch=main&lCommit=0b424ee0b7bfe09e0a438a63e8336e95eea85901&rBranch=main&rCommit=29048be41ca3aa8974795d93b9ea9fd6dee415fc)

I'm reducing the batch size from 16 to 8 to keep the same batch size for all yolov3 HUD benchmarks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111959
Approved by: https://github.com/xuzhao9
2023-10-25 06:18:53 +00:00
Simon Fan
9e6c97890b Dynamo runner: add FSDP handcrafted module wrapping policy (#111505)
The default size based auto wrap policy may not be representative of actual usage of the models. We add support for a few handpicked models, and fallback to the size based policy.

sample command:
`PYTHONPATH=~/benchmark/ python benchmarks/dynamo/torchbench.py -dcuda --training --backend=inductor --multiprocess --performance --only nanogpt --fsdp`

1.257x
1.256x
1.257x
1.252x
1.257x
1.262x
1.258x
1.272x

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111505
Approved by: https://github.com/H-Huang, https://github.com/xuzhao9
2023-10-25 03:05:31 +00:00
BowenBao
ad4971c0b1 Delete deepcopied model after use in benchmark to reduce memory consumption (#111868)
As title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111868
Approved by: https://github.com/msaroufim, https://github.com/thiagocrepaldi
ghstack dependencies: #111867, #111593
2023-10-24 23:44:14 +00:00
BowenBao
4839f319da Apply same 'pick_grad' on generating fp64 reference outputs (#111593)
To lower memory consumption for inference mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111593
Approved by: https://github.com/msaroufim, https://github.com/thiagocrepaldi
ghstack dependencies: #111867
2023-10-24 20:16:53 +00:00
BowenBao
ec2e0712db [ONNX] Enable onnx inlining in benchmark for >2GB models (#111867)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111867
Approved by: https://github.com/thiagocrepaldi
2023-10-24 20:16:53 +00:00
Pearu Peterson
6382011843 Add NVIDIA A100 optimized meta parameters to bsr_dense_mm (#111760)
As in the title.

The figures below illustrate the performance differences of bsr_dense_mm with optimized parameters and bsr_dense_mm with default parameters (GPU: NVIDIA A100-SXM4-80GB). The first figure represents the performance equilibrium point in BSR tensor sparsity at which value bsr_dense_mm have the same performance characteristics as torch.matmul. The second figure represents speedups from using optimized meta parameters in bsr_dense_mm at its performance equilibrium points with respect to bsr_dense_mm with default meta parameters.

In sum, this PR speeds up `bsr_dense_mm` about 50 % depending on the bsr tensor shape and blocksize and lowers the performance equilibrium points of BSR tensor sparsity and strided tensor for matmul operations.

<img src="https://github.com/pytorch/pytorch/assets/402156/6fe9d35f-dd21-4aa0-bb01-6ee257254453" width="48%"> <img src="https://github.com/pytorch/pytorch/assets/402156/506921c6-3770-4209-ad3d-498d2ae4989d" width="48%">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111760
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #110396, #111470, #111489
2023-10-23 23:52:49 +00:00
Pearu Peterson
6078ed95cc Use lru_cache to cache indices data for bsr_scatter_mm. (#111470)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111470
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #110396
2023-10-23 23:52:49 +00:00
Pearu Peterson
d4708a6da7 Add scatter_mm and bsr_scatter_mm operations. (#110396)
This PR introduces `scatter_mm` operation (compute `mm` of arbitrary pairs of tensors given in batches of tensors) that is used to implement `bsr_scatter_mm` that is equivalent to `bsr_dense_mm` (the `mm` operation on bsr and strided tensors). The implementation is provided both in Triton (when tensor dimensions are multiples of 16) and in PyTorch (otherwise).

The figures below illustrate the performance differences of `bsr_scatter_mm` and `bsr_dense_mm` (GPU: `NVIDIA GeForce RTX 2060 SUPER`). The first figure represents the performance equilibrium point in BSR tensor sparsity at which value `bsr_scatter_mm` or `bsr_dense_mm` have the same performance characteristics as `torch.matmul`. The second figure represents speedups from using `bsr_scatter_mm` at its performance equilibrium points with respect to `bsr_dense_mm`.

<img src="https://github.com/pytorch/pytorch/assets/402156/526d182e-937f-4812-a6c4-904f52d6d5ab" width="48%"> <img src="https://github.com/pytorch/pytorch/assets/402156/ccb606ab-1f3f-4133-887c-b56285f4f168" width="48%">

The same figures for GPU card `NVIDIA A100-SXM4-80GB`:

<img src="https://github.com/pytorch/pytorch/assets/402156/25466f1d-df34-4d1c-a975-afb478e4d9f0" width="48%"> <img src="https://github.com/pytorch/pytorch/assets/402156/6ada91f0-a20f-4f0d-8a48-1f4ccc60d08e" width="48%">

In sum:
- `bsr_scatter_mm` is about 2x faster than `bsr_dense_mm` for small block sizes of 16 and 32 and large tensors [GPU: `NVIDIA GeForce RTX 2060 SUPER`].
- `bsr_scatter_mm` is up to 2x faster than `bsr_dense_mm` for small block sizes of 16 and large tensors [GPU: `NVIDIA A100-SXM4-80GB`].
- `bsr_dense_mm` is up to 20 % faster than `bsr_scatter_mm` for block sizes of 64 or larger [GPU: `NVIDIA GeForce RTX 2060 SUPER`].
- However, `bsr_dense_mm` fails with `OutOfResources` exception for block sizes of 256 or larger whereas `bsr_scatter_mm` succeeds.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110396
Approved by: https://github.com/cpuhrsch
2023-10-23 19:45:30 +00:00
Bin Bao
ce48d36324 [aotinductor] Update test utility to use AOTIModelRunner (#111657)
Summary: Use AOTIModelRunner provided by libtorch instead of the custom written RAIIModelContainer for testing. This change also makes running AOTInductor benchmarks on CPU possbile.

Differential Revision: [D50560764](https://our.internmc.facebook.com/intern/diff/D50560764)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111657
Approved by: https://github.com/chenyang78
2023-10-23 18:21:27 +00:00
Jon Chuang
c4ab229a82 [dynamo] Implement set.__contains__ for Tensor as object match of FakeTensor (#111738)
Fixes https://github.com/pytorch/pytorch/issues/111556

Dynamo implementation of `set.__contains__` previously used `__eq__` match.

But this is wrong when `__eq__` match does not imply `__hash__` match, as is the case for `torch.Tensor`, leading to inconsistent results. See: https://github.com/pytorch/pytorch/issues/111542

Hence implement as Tensor object match i.e. proxy node `'example_value'` FakeTensor match.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111738
Approved by: https://github.com/lezcano
2023-10-22 17:40:34 +00:00
Aaron Gokaslan
cb856b08b2 [BE]: Attach cause to some exceptions and enable RUFF TRY200 (#111496)
Did some easy fixes from enabling TRY200. Most of these seem like oversights instead of intentional. The proper way to silence intentional errors is with `from None` to note that you thought about whether it should contain the cause and decided against it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111496
Approved by: https://github.com/malfet
2023-10-19 21:56:36 +00:00
BowenBao
e3463fe4ca [ONNX] Benchmark to store test data along exported model (#111095)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111095
Approved by: https://github.com/justinchuby, https://github.com/thiagocrepaldi
2023-10-19 03:20:52 +00:00
BowenBao
0b14ec8ca6 [ONNX] Add dynamo_onnx_aot_inline to bench (#110183)
An option that applies onnx.inliner post model export.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110183
Approved by: https://github.com/thiagocrepaldi
2023-10-18 00:43:04 +00:00
Shunting Zhang
cc9b7bb85c [reland] [inductor] fix a max-autotune rng state related bug (#111381)
reland https://github.com/pytorch/pytorch/pull/109828

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111381
Approved by: https://github.com/lezcano
2023-10-17 19:16:36 +00:00
Evgeni Burovski
48989bc820 trace frames with np.ndarray (#110512)
Fixes #109604

Resubmit gh-109715 + several skips and small fixes to make tests pass.

The main fix here is by @ysiraichi : previously, dynamo did not resume tracing numpy ndarrays after a graph break.
While at it, fix several small issues Yukio's fix uncovers:

- graph break gracefully on numpy dtypes which do not map to torch.dtypes (uint16 etc)
- recognize array scalars in dynamo, treat them as 0D ndarrays
- make sure that iterating over torch.ndarray generates arrays not bare tensors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110512
Approved by: https://github.com/lezcano
2023-10-15 00:56:10 +00:00
Jesse Cai
8db72a430d [sparse] Add padding for dense matrices in semi-structured sparse (#110583)
Summary:

Currently we have shape constraints in semi-structured sparsity for both
CUTLASS and cuSPARSELt

These shape constraints unfortunately apply to both the dense and sparse
matrices in sparsedense matmul.

This PR adds in support for calling `F.pad` in order to pad dense
matrices to the right size with zeros and then pull out the
corresponding rows from the resultant result matrix.

We also throw a warning in this case.
The tests have also been updated to take in a dense_input_shape
parameter.

Test Plan:
```
python test/test_sparse_semi_structured.py
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110583
Approved by: https://github.com/alexsamardzic, https://github.com/cpuhrsch
2023-10-13 20:04:23 +00:00
PaliC
af05fbb84a Linter to avoid csv merge conflicts (#111163)
This PR addresses the persistent issue of merge conflicts in the benchmarks/dynamo/ci_expected_accuracy/ directory, specifically those arising from frequently updated CSV files. Based on @malfet  suggestion, the solution implemented adds three spaces between each line in the CSV files. This approach has proven effective in preventing merge conflicts, as evidenced in [D50239634](https://www.internalfb.com/intern/diff/D50239634/). Regardless of these changes the extra new lines should still allow the csvs to be ingested as normal.

If you have access to the diff:
Normally, modifying a line that is later altered in the stack results in a merge conflict during restacking. With this new spacing strategy, lines that are not modified further down the stack will not trigger merge conflicts, achieving our intended outcome.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111163
Approved by: https://github.com/malfet, https://github.com/huydhn
2023-10-13 09:35:34 +00:00
angelayi
577e3dff88 [aotinductor] Fail models temporarily (#111100)
Temporarily mark these models as fail. Failures are due to https://github.com/pytorch/pytorch/pull/111030 which is needed for ExecuTorch's release so it can't be reverted. Will forward fix the failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111100
Approved by: https://github.com/desertfire
2023-10-12 00:48:44 +00:00
Bin Bao
4abfa22812 [aotinductor] Add a perf smoke test for AOTInductor (#110972)
Summary: To prevent perf regression like the one caused by #110510

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110972
Approved by: https://github.com/chenyang78
2023-10-11 13:30:05 +00:00
Michael Voznesensky
1e7947b3e0 Revert "Reland 3rd try [finishing colesbury's PR 100642] Guard on nn.Module dicts and type (#109323)" + Forward fixes + test (#110964)
This reverts commit f786fbdebd.

Forward fixes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110964
Approved by: https://github.com/ezyang, https://github.com/anijain2305
2023-10-11 05:16:47 +00:00
angelayi
83061ee177 [aotinductor] Fix benchmarks with self.autocast (#110490)
Fixes https://github.com/pytorch/pytorch/issues/108173

The original error was that there was a type mismatch between the output of eager mode (float16) and from aot_compile (float32). This is because when we run the model eagerly in the benchmarks, we call [self.model_iter_fn](https://github.com/pytorch/pytorch/blob/main/benchmarks/dynamo/common.py#L2072-L2076) to run the model, rather than directly calling the model. In the case of timm models, it calls the model with [self.autocast()](https://github.com/pytorch/pytorch/blob/main/benchmarks/dynamo/timm_models.py#L321-L323), causing the eager model to return a float16 value. However, the model we export with aot_compile does not have the self.autocast context, so it returns float32.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110490
Approved by: https://github.com/desertfire
2023-10-06 02:13:47 +00:00
Xu Zhao
2e31fae5c5 Cleanup the code in the dynamo userbenchmark (#110519)
Summary:
Skip importing the modules that are only available in the pytorch source code, not pytorch nightly release.

Make dynamo benchmark work on both OSS and internal.

X-link: https://github.com/pytorch/benchmark/pull/1960

Test Plan:
```
$ python run_benchmark.py dynamo --only alexnet --training --performance --inductor
loading model: 0it [00:05, ?it/s]
cuda train alexnet
running benchmark: 100%|█████████████████| 30/30 [00:00<00:00, 41.46it/s]
1.129x
```

```
$ buck2 run mode/opt //pytorch/benchmark:run_benchmark -- dynamo --only alexnet --training --inductor --performance --output-directory $HOME
loading model: 0it [00:16, ?it/s]
running benchmark: 100%|█████████████████| 30/30 [00:00<00:00, 37.94it/s]
cuda train alexnet
1.120x
```

Differential Revision: D49912006

Pulled By: xuzhao9

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110519
Approved by: https://github.com/desertfire, https://github.com/jansel
2023-10-04 23:26:30 +00:00
Bin Bao
06e88d2cfc [aotinductor] Remove output_spec from AOTInductorModelCache (#110462)
Summary: No need to store output_spec as the returned exported.call_spec already contains that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110462
Approved by: https://github.com/angelayi
2023-10-03 22:29:36 +00:00
BowenBao
6b2c52278e Benchmark flag to include slowdowns when computing gmean of speedups over eager (#108375)
`clip(1)` excludes slowdowns by treating them as 1x.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108375
Approved by: https://github.com/jansel
2023-10-02 20:35:08 +00:00
atalman
b253fc9c93 Revert "[1/N] Dynamo skipfiles refactor (#109567)" (#110296)
This reverts commit 84c5435b29.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110296
Approved by: https://github.com/yanboliang
2023-09-29 20:35:46 +00:00
Simon Fan
88ef126a93 rename nanogpt_generate to nanogpt to also support train (#109746)
Differential Revision: [D49522940](https://our.internmc.facebook.com/intern/diff/D49522940)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109746
Approved by: https://github.com/msaroufim, https://github.com/malfet, https://github.com/xuzhao9
2023-09-29 17:36:48 +00:00
Bin Bao
f82a29e32b [inductor] Add CI jobs to test AOTInductor (#108419)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108419
Approved by: https://github.com/angelayi, https://github.com/jansel
2023-09-28 20:19:25 +00:00
Yanbo Liang
84c5435b29 [1/N] Dynamo skipfiles refactor (#109567)
This is 1/N of the dynamo skipfiles/allowed_functions refactor, the major change in this PR includes:
* Refactor & define the [skipfiles rules](https://github.com/pytorch/pytorch/pull/109567/files#diff-5aa3ce9db729bf0901ea97a5d3cc51924cc8575d9c516c1c8f572a35de92544aR56) and interface
* For every ```skipfiles.check```, we return both the check result and the skip/inline reason and log them for debugging.
* We found several latent issues/bugs and incorrect implementations in the codebase, but I'm planning to fix them in follow-up PRs to make the refactor decoupled with bug fixes.
* More details in the inline comments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109567
Approved by: https://github.com/ezyang, https://github.com/jansel, https://github.com/anijain2305
2023-09-28 18:36:46 +00:00
PyTorch MergeBot
75462fd870 Revert "[1/N] Dynamo skipfiles refactor (#109567)"
This reverts commit f8e0ebec8c.

Reverted https://github.com/pytorch/pytorch/pull/109567 on behalf of https://github.com/huydhn due to Many jobs are failing in trunk after this with FILENAME_ALLOWLIST is not defined error f8e0ebec8c. This looks like a landrace ([comment](https://github.com/pytorch/pytorch/pull/109567#issuecomment-1738344950))
2023-09-28 02:22:22 +00:00
Yanbo Liang
f8e0ebec8c [1/N] Dynamo skipfiles refactor (#109567)
This is 1/N of the dynamo skipfiles/allowed_functions refactor, the major change in this PR includes:
* Refactor & define the [skipfiles rules](https://github.com/pytorch/pytorch/pull/109567/files#diff-5aa3ce9db729bf0901ea97a5d3cc51924cc8575d9c516c1c8f572a35de92544aR56) and interface
* For every ```skipfiles.check```, we return both the check result and the skip/inline reason and log them for debugging.
* We found several latent issues/bugs and incorrect implementations in the codebase, but I'm planning to fix them in follow-up PRs to make the refactor decoupled with bug fixes.
* More details in the inline comments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109567
Approved by: https://github.com/ezyang, https://github.com/jansel, https://github.com/anijain2305
2023-09-28 01:21:59 +00:00
BowenBao
85e408217a [ONNX] Move out onnx bench bash scripts (#103983)
Summary:
- Remove onnx bench related scripts and `_onnx` folder.
- Update `common.py` to include onnx related patches previously under `_onnx` folder.
- Update `merge_rules.json` to include bench files.
- Added quick sanity onnx bench test to onnx CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103983
Approved by: https://github.com/kit1980
2023-09-27 23:54:26 +00:00
rzou
7dbdf3be1e Fix inductor CI (by updating graph break count) (#110160)
There was a vision hash update which led to fewer graph breaks. This
seems expected to me (because the hash update included
https://github.com/pytorch/vision/pull/7944 and nms is used in maskrcnn).

Test Plan:
- wait for ci

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110160
Approved by: https://github.com/ezyang, https://github.com/Chillee
2023-09-27 14:37:36 +00:00
angelayi
57cdad2396 [aotinductor] Update benchmark to include compilation time (#109998)
Fixes [comment](https://github.com/pytorch/pytorch/pull/109820#pullrequestreview-1638629777)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109998
Approved by: https://github.com/desertfire
2023-09-25 21:30:22 +00:00
angelayi
a565f1bee6 [aotinductor] Skip benchmarks with control flow (#109661)
Since AOTInductor doesn't support control flow yet, we will skip over tests that are currently failing due to containing control flow in the code. Logs taken from https://hud.pytorch.org/benchmark/compilers?startTime=Tue%2C%2012%20Sep%202023%2022%3A56%3A40%20GMT&stopTime=Tue%2C%2019%20Sep%202023%2022%3A56%3A40%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=main&lCommit=2c1554a0323107d821be3ff13df7833b9f0b960d&rBranch=main&rCommit=47be61e12bd51df27182343d312dc3df485d5559

Errors documented in https://github.com/pytorch/pytorch/issues/105217

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109661
Approved by: https://github.com/desertfire
2023-09-25 18:49:06 +00:00
PyTorch MergeBot
d9627c4264 Revert "[inductor] fix a max-autotune rng state related bug (#109828)"
This reverts commit 3663436db3.

Reverted https://github.com/pytorch/pytorch/pull/109828 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the rocm failure looks legit. There is also another numpy import error when running dynamo test on CPU ([comment](https://github.com/pytorch/pytorch/pull/109828#issuecomment-1732423883))
2023-09-23 22:35:37 +00:00
Shunting Zhang
3663436db3 [inductor] fix a max-autotune rng state related bug (#109828)
Fix https://github.com/pytorch/pytorch/issues/109736 .

HF pin move causes regression on accuracy check for HF models on the dashboard. Manually reverting the HF PR ( https://github.com/huggingface/transformers/pull/24696/files ) could recover, but this may hide some real issue. I happen to found that using a warm matmul max-autotune cache can work around the issue. Or putting it in another way:
- make all calls to check_cache cache miss repro the issue
- make all cals to check_cache cache hit works around the issue

I did some sort of 'bisect' to force halving the amount of cache miss each time while still make sure we can repro. Luckily reducing to a single cache miss still repro the issue. With more debugging, it turns out that it's the call to `torch.randn` on cuda device causing the problem.

The fix is to make sure  we restore the rng state when we generate random inputs for max-autotune benchmarking.

TBH, I can not fully explain the root cause although I know it's caused by rng state change.  AOTAutograd already has some logic to preserve rng state. And I can not repro the issue in unit tests. I have a few guess why the RNG state is not restored in the first place after we generate random inputs for max-autotune:
- maybe AOTAutograd misses some corner case to preserve the rng state
- maybe for the failed models, there are some eager fallback that's not handled by inductor. And if those fallback calles random number related APIs, we will see the issue. But again I don't find a good way to simulate this.

Repro:

```
TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 CUDA_VISIBLE_DEVICES=3 time python benchmarks/dynamo/huggingface.py --backend inductor --amp --accuracy --only PLBartForCausalLM --training --cold-start-latency
```

We always repro the issue without the PR but pass the accuracy check with the PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109828
Approved by: https://github.com/eellison
2023-09-23 00:58:10 +00:00
Mark Saroufim
e2cfbca5ab Add clip to dynamo runners (#109840)
CLIP was moved to canary models because we use the multimodal version which depends on torchtext which torchbench deprecated https://github.com/pytorch/benchmark/pull/1837

This issue didn't show up before because we hadn't updated the torchbench pin

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109840
Approved by: https://github.com/cpuhrsch
2023-09-22 20:50:57 +00:00
Bin Bao
8856c1628e [inductor] Change AOTInductor to return output tensors (#109790)
Summary:
Change AOTInductor to directly return output tensors instead of taking pre-allocated output tensors to return the results. This gives several benefits:

* It makes sure AOTInductor has the same behavior when managing the output tensors as the default Inductor, which is widely tested and thus more reliable.
* As we have debugged before, there are cases we still have to codegen extra copy_ ops to fill the pre-allocated output tensors which doesn't make sense for performance.
* With the coming enhanced memory planning, this again will make sure the memory planning logic is the between AOTInductor and Inductor, which will greatly simplify the problem and improve the reliability.

This change also combines D49494954 from Yang and https://github.com/pytorch/pytorch/pull/109560 from Angela.

Differential Revision: D49502318

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109790
Approved by: https://github.com/chenyang78
2023-09-22 02:31:52 +00:00
Angela Yi
f7ddc54503 [aotinductor] Update performance benchmark code (109560) (#109820)
Summary: Same as #109560, made a new PR because we need to land from internal

Previously during performance benchmark testing, we would create an AOTInductorModelContainerHandle every time the compiled function is run with new inputs. However after https://github.com/pytorch/pytorch/pull/108473 we now load the constants needed in the runtime when initializing the AOTInductorModelContainerHandle. This resulted in our benchmarks displaying a ~0.4x speedup.

This diff moves the initialization of AOTInductorModelContainerHandle outside of the code where we run the compiled function with different inputs.

For example,
```
python benchmarks/dynamo/huggingface.py --performance --cold-start-latency --inference --bfloat16 --export-aot-inductor --disable-cudagraphs --device cuda --total-partitions 3 --partition-id 0 --only AlbertForMaskedLM
```
results in `1.359x` speedup.

Specifically, this adds a `create_container_handle` and `delete_container_handle` function which need to called before `run`. We call `create_container_handle` to initialize the AOTInductorModelContainerHandle, call `run` to run the compiled .so with different inputs, and then `delete_container_handle` to delete it.

[Updated dashboard results](https://hud.pytorch.org/benchmark/compilers?startTime=Wed%2C%2013%20Sep%202023%2021%3A03%3A55%20GMT&stopTime=Wed%2C%2020%20Sep%202023%2021%3A03%3A55%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=angelayi/aot_inductor_benchmark&lCommit=f9aa49c4c9a1a140b6f0c4520d1d6d99b57e12fa&rBranch=main&rCommit=015be4cedba357eb931e24bf188479235db7c5c8)

Test Plan: CI

Differential Revision: D49513934

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109820
Approved by: https://github.com/desertfire
2023-09-21 20:49:41 +00:00
Simon Fan
ef8d461b09 Fix torchbench --multiprocess (#109657)
`python benchmarks/dynamo/torchbench.py --multiprocess` currently fails due to initializing distributed multiple times:

```
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:6789 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:6789
 (errno: 98 - Address already in use).
```

Because torchbench calls itself via mp.spawn, there is the parent run (with `--multiprocess`) and child runs (with `--multiprocess --only <model>`).

This PR addresses this by fixing two issues:
1) distributed is initialized once in parent run and once in child runs, it should be initialized only in child runs where we have accurate rank and world size info
2) torchbench overrides CUDA_VISIBLE_DEVICES/world_size sometimes, but it shouldn't for distributed use cases where we want to use all available gpus

I am also adding a CI test to cover this type of issue in #109311

### Test plan
parent run test: `python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --inference --bfloat16 --output /home/xmfan/local/pytorch/test/test-reports/inference_torchbench.csv --multiprocess`
child run test: `python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --inference --bfloat16 --output /home/xmfan/local/pytorch/test/test-reports/inference_torchbench.csv --multiprocess --only simple_gpt`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109657
Approved by: https://github.com/H-Huang
2023-09-21 16:53:07 +00:00
eellison
d24ba7a634 Add 3d Attn Pattern to match HF Whisper (#109156)
Adds a 3d pattern that improves perf of HF Whisper from 1.3 -> 4.1. We could be matching more generally on 3d, but i'll leave that for another pr.

Thanks to @drisspg for helping me write the pattern.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109156
Approved by: https://github.com/yanboliang
ghstack dependencies: #109663, #108894, #108917, #109142
2023-09-20 16:39:31 +00:00
Edward Z. Yang
964b79c813 [EASY] Update dynamo dependency installing Makefile (#107229)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107229
Approved by: https://github.com/bdhirsh
2023-09-19 18:58:37 +00:00
Mark Saroufim
0ec9f59f70 Loudly Error in dynamo bench if eager fails (#109536)
Helps debug https://github.com/pytorch/benchmark/issues/1901

I will wait until the ONNX beartype sev is fixed before merging

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109536
Approved by: https://github.com/xuzhao9
2023-09-19 00:40:42 +00:00
angelayi
5b13f74e9b [export] Update how we input kwargs (#109160)
Previously, the code for passing inputs to exported program was:
```
if kwargs:
    return (args, kwargs)
else:
    return args
```

However, this causes some inconsistency where if the original input contains args and kwargs, the treespec would be a tuple containing a tuple of arguments, and a dictionary of keyword arguments. But if the original input only contained args, the treespec would just be a tuple of arguments. This inconsistency causes some inconveniences in the runtime.

So I updated the code to just always keep the kwargs around.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109160
Approved by: https://github.com/zhxchen17, https://github.com/avikchaudhuri
2023-09-19 00:04:32 +00:00
Justin Chu
050c56d0a5 [dynamo][ci] Pin beartype to 0.15.0 (#109510)
CIs are failing because of https://github.com/beartype/beartype/issues/282

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109510
Approved by: https://github.com/thiagocrepaldi
2023-09-18 19:08:32 +00:00
Aaron Gokaslan
6d725e7d66 [BE]: enable ruff rules PLR1722 and PLW3301 (#109461)
Enables two ruff rules derived from pylint:
* PLR1722 replaces any exit() calls with sys.exit(). exit() is only designed to be used in repl contexts as may not always be imported by default. This always use the version in the sys module which is better
* PLW3301 replaces nested min / max calls with simplified versions (ie. `min(a, min(b, c))` => `min(a, b. c)`). The new version is more idiomatic and more efficient.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109461
Approved by: https://github.com/ezyang
2023-09-18 02:07:21 +00:00
Animesh Jain
f786fbdebd Reland 3rd try [finishing colesbury's PR 100642] Guard on nn.Module dicts and type (#109323)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109323
Approved by: https://github.com/huydhn, https://github.com/voznesenskym
2023-09-15 08:44:14 +00:00
Simon Fan
54c5f474a7 Forward rank and world size info to Torchbench models when using dynamo runner (#108438)
Adding support to pass rank and world_size to torchbench model, via its extra_args parameter: https://github.com/pytorch/benchmark/blob/main/torchbenchmark/util/model.py#L83C80-L83C90

This is used for models which distribute over multiple GPUs e.g. simple_gpt https://github.com/pytorch/benchmark/pull/1867

Also add an option to skip multiprocess only gpu models

Testing via `python benchmarks/dynamo/torchbench.py -d cuda --output=benchmark_logs/performance.csv --inference --performance --timing --print-memory --multiprocess --only simple_gpt`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108438
Approved by: https://github.com/Chillee
2023-09-14 21:01:20 +00:00
Nakul Camsamudram
109ab6a0df Support str() on user defined functions (#108973)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108973
Approved by: https://github.com/anijain2305
2023-09-14 01:32:02 +00:00