The general idea is to do a separate CUDA graph for each size. Because of cuda graph trees, these graphs will all share the same memory pool, so your memory usage will only be the worst case memory usage of the biggest dynamic size you want. This requires an extra dispatch in the cudagraphified callable. You must pay for a CUDA graph recording for every dynamic size you encounter, but this is MUCH cheaper than running the entire PT2 compile stack, so I expect you to still see benefits.
This was surprisingly easy to do.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105064
Approved by: https://github.com/voznesenskym
This PR disables translation validation (TV) when running the benchmark suits on
performance workflows: inductor with A100s.
In summary, the changes are:
- Add flag for turning TV on and off on _benchmarks/dynamo/common.py_
- Turn TV on only on CI accuracy builds
- Add `--no-translation-validation` target flag to _.ci/pytorch/test.sh_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104887
Approved by: https://github.com/ezyang
As of now, translation validation runs to its completion. However, Z3 is time
consuming. PR #104464, for example, disables translation validation for a few benchmarks.
Instead, this PR introduces a timeout for translation validation. In that case, Z3 will
return `unknown`, since it wasn't able to prove or disprove the assertions. Then, we log
it as a warning, but don't stop execution.
Here's a summary of the changes:
- Added an environment variable for turning translation validation on and off
- Added an environment variable for setting the translation validation timeout
- Possibly reverts the changes in #104464
- ~~Move from "QF_NRA" to "QF_NIRA" logic~~
- ~~It makes more sense, given the nature of the problems~~
- "QF_NRA" seems to solve more instances of _dynamo/test_dynamic_shapes.py_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104654
Approved by: https://github.com/ezyang
issues resolved: https://github.com/pytorch/pytorch/issues/104294
local test on TB and TIMM
* python benchmarks/dynamo/torchbench.py -d cuda --inference --accuracy --progress --export --print-dataframe-summary
* python benchmarks/dynamo/timm_models.py -d cuda --inference --accuracy --progress --export --print-dataframe-summary
why not HF
* huggingface use kwargs (dict) to torch.nn.module
* we will need to support kwargs in torch._export.export, which is in progress
local test result
timm 95% pass rate (58 ouf of 61 passed) P781702926
* 1 x [export specific]1 x ERROR:common:Mutating module attribute rel_indices during export
* 1 x[not relevant to export] Unknown model (SelecSls42b)
* 1 x [not relevant to export] Failed to load model: HTTP Error 409: Public access is not permitted on this storage account
torchbench 54% pass rate (41 out of 75 passed) P781690552
* 7 x ERROR:common:Dynamo input and output is a strict subset of traced input/output
* 3 x ERROR:common:call_method NNModuleVariable() / UserDefinedObjectVariable
* 3 x ERROR:common:Mutating module attribute {xx} during export.
* 2 x ERROR:common:inline in skipfiles
* 2 x ERROR:common:Consider annotating your code using constrain_as_*(). It appears that you're trying
* 1 x ERROR:common:guard on data-dependent symbolic int/float
* 1 x ERROR:common:Tensor.tolist
* 1 x ERROR:common:Tensor.numpy. Turn on config.numpy_ndarray_as_tensor and install torch_np to support tensor.numpy(). [may be dev * env?]
* 1 x ERROR:common:missing: BUILD_SET
* 1 x ERROR:common:whole graph export entails exactly one guard export
* 1 x ERROR:common:call_function BuiltinVariable(str) [GetAttrVariable(UserMethodVariable(<function
* 1 x ERROR:common:Dynamic slicing on data-dependent value is not supported
* 1 x ERROR:common:Failed running call_function <function interpolate at 0x7f60a8361ea0>(*(FakeTensor(..., device='cuda:0', size=(1, 3, * 427,
* 1 x ERROR:common:Dynamo attempts to add additional input during export: value=0.6177528500556946, source=RandomValueSource(random_call_index=0)
* 1 x Found following user inputs located at [16, 17, 18, 19, 20, 21, 22] are mutated. This is currently banned in the aot_export workflow.
* 1 x RuntimeError: cumsum_cuda_kernel does not have a deterministic implementation
* 4 x pass_due_to_skip
* 1 x eager_2nd_run_OOM
* 1 x fail_accuracy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104382
Approved by: https://github.com/zhxchen17
Some notes:
* I now manually turn off `_generate` jobs from running with cudagraphs, as it is unrealistic to expect to cudagraph autoregressive generation up to max sequence length, this would imply compiling the entire unrolled sequence generation. Concretely, cm3leon_generate was timing out post this change, likely due to the compile time slowdown of dynamic shapes ON TOP OF accidentally unrolling all the loops
* A few torch._dynamo.reset tactically inserted to force recompiles on tests that expected it
* expectedFailureAutomaticDynamic flip into patching automatic_dynamic_shapes=False
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103623
Approved by: https://github.com/voznesenskym
This PR turns translation validation on by default for tests and accuracy benchmark
runs. It also installs Z3 on CI.
The main changes are:
- Add `--no-translation-validation` as an option in _test/run_tests.py_
- Set `PYTORCH_TEST_WITH_TV` environment variable
- Add `TEST_WITH_TV` variable in _torch/testing/_internal/common_utils.py_
- Turn translation validation on for accuracy benchmarks in _benchmarks/dynamo/common.py_
- Add Z3 installation on CI scripts
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103611
Approved by: https://github.com/ezyang
This PR adds in support for semi-structured sparsity via a tensor
subclass. It currently uses the CUTLASS kernels merged in PR #100881.
In the future we plan to add in cuSPARSELt support (see the other PRs in
the stack), which will give us larger performance gains.
This PR adds in 2 things:
- a Tensor subclass, `SparseSemiStructuredTensor` to store the
sparse tensor in copmressed form and override `__torch_dispatch__`.
- a conversion function that takes in a dense tensor and a
semi-structured sparse bool mask and creates an instance of the
subclass.
**SparseSemiStructuredTensor**
The subclass stores the dense tensor in a contiguous flattened tensor
for future compatability with cuSPARSELt, which expects this format.
Note that the CUTLASS kernels do not have this limitation, as the
specified values and the metadata are passed separately in
`_structured_sparse_linear`. In the future we can use the cuSPARSELT bindings
[here](https://github.com/pytorch/pytorch/pull/103700) for faster matmul, better dtype converage, and relaxed shape
constraints.
Since we currently don't have a way to go back from the sparse
representation to the dense representation, and we store the weights in
compressed form, we don't have a great way to handle .t().
Instead, we keep track of how often we've called transpose on our
tensor, and if it's an unexpected number we throw an error. When the first
argument is sparse, we expect an even number of calls to transpose,
while when the second argument is sparse, we expect an odd number of
calls. This is because we support second argument sparse matrix
multiplications by using transpose properties.
**to_sparse_semi_structured**
This is a conversion function to convert a dense tensor and a
semi-structured sparse bool mask into a subclass. Currently, we must
pass in a bool mask, since we can't infer it becuase there may be
additional zero elements in the dense tensor, so `tensor !=0` is not 2:4
sparse.
Once we add either a method to derive the mask from the dense tensor or
cuSPARSELt, we no longer need to pass in the mask. cuSPARSELt has it's
own helper functions to create the metadata mask.
**User Details**
We have implemented support for the following ops for `torch.float16`
and `torch.int8`:
```
torch.addmm(bias, dense, sparse.t())
torch.mm(dense, sparse)
torch.mm(sparse, dense)
aten.linear.default
aten.t.default
aten.t.detach
```
The end user interface to accelerate a nn.Linaer module with the
subclass would look like this:
```
from torch.sparse import to_sparse_semi_structured
mask = torch.Tensor([0, 0, 1, 1]).tile(128, 32).cuda().bool()
linear = Model(128, 128).half().cuda()
linear.weight = nn.Parameter(to_sparse_semi_structured(linear.weight,
mask=linear.weight.bool())
```
This also updates tests and the `torch.sparse` module docstring to
reflect these changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102135
Approved by: https://github.com/albanD
This PR adds in support for semi-structured sparsity via a tensor
subclass. It currently uses the CUTLASS kernels merged in PR #100881.
In the future we plan to add in cuSPARSELt support (see the other PRs in
the stack), which will give us larger performance gains.
This PR adds in 2 things:
- a Tensor subclass, `SparseSemiStructuredTensor` to store the
sparse tensor in copmressed form and override `__torch_dispatch__`.
- a conversion function that takes in a dense tensor and a
semi-structured sparse bool mask and creates an instance of the
subclass.
**SparseSemiStructuredTensor**
The subclass stores the dense tensor in a contiguous flattened tensor
for future compatability with cuSPARSELt, which expects this format.
Note that the CUTLASS kernels do not have this limitation, as the
specified values and the metadata are passed separately in
`_structured_sparse_linear`. In the future we can use the cuSPARSELT bindings
[here](https://github.com/pytorch/pytorch/pull/103700) for faster matmul, better dtype converage, and relaxed shape
constraints.
Since we currently don't have a way to go back from the sparse
representation to the dense representation, and we store the weights in
compressed form, we don't have a great way to handle .t().
Instead, we keep track of how often we've called transpose on our
tensor, and if it's an unexpected number we throw an error. When the first
argument is sparse, we expect an even number of calls to transpose,
while when the second argument is sparse, we expect an odd number of
calls. This is because we support second argument sparse matrix
multiplications by using transpose properties.
**to_sparse_semi_structured**
This is a conversion function to convert a dense tensor and a
semi-structured sparse bool mask into a subclass. Currently, we must
pass in a bool mask, since we can't infer it becuase there may be
additional zero elements in the dense tensor, so `tensor !=0` is not 2:4
sparse.
Once we add either a method to derive the mask from the dense tensor or
cuSPARSELt, we no longer need to pass in the mask. cuSPARSELt has it's
own helper functions to create the metadata mask.
**User Details**
We have implemented support for the following ops for `torch.float16`
and `torch.int8`:
```
torch.addmm(bias, dense, sparse.t())
torch.mm(dense, sparse)
torch.mm(sparse, dense)
aten.linear.default
aten.t.default
aten.t.detach
```
The end user interface to accelerate a nn.Linaer module with the
subclass would look like this:
```
from torch.sparse import to_sparse_semi_structured
mask = torch.Tensor([0, 0, 1, 1]).tile(128, 32).cuda().bool()
linear = Model(128, 128).half().cuda()
linear.weight = nn.Parameter(to_sparse_semi_structured(linear.weight,
mask=linear.weight.bool())
```
This also updates tests and the `torch.sparse` module docstring to
reflect these changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102135
Approved by: https://github.com/albanD
This PR adds in support for semi-structured sparsity via a tensor
subclass. It currently uses the CUTLASS kernels merged in PR #100881.
In the future we plan to add in cuSPARSELt support (see the other PRs in
the stack), which will give us larger performance gains.
This PR adds in 2 things:
- a Tensor subclass, `SparseSemiStructuredTensor` to store the
sparse tensor in copmressed form and override `__torch_dispatch__`.
- a conversion function that takes in a dense tensor and a
semi-structured sparse bool mask and creates an instance of the
subclass.
**SparseSemiStructuredTensor**
The subclass stores the dense tensor in a contiguous flattened tensor
for future compatability with cuSPARSELt, which expects this format.
Note that the CUTLASS kernels do not have this limitation, as the
specified values and the metadata are passed separately in
`_structured_sparse_linear`. In the future we can use the cuSPARSELT bindings
[here](https://github.com/pytorch/pytorch/pull/103700) for faster matmul, better dtype converage, and relaxed shape
constraints.
Since we currently don't have a way to go back from the sparse
representation to the dense representation, and we store the weights in
compressed form, we don't have a great way to handle .t().
Instead, we keep track of how often we've called transpose on our
tensor, and if it's an unexpected number we throw an error. When the first
argument is sparse, we expect an even number of calls to transpose,
while when the second argument is sparse, we expect an odd number of
calls. This is because we support second argument sparse matrix
multiplications by using transpose properties.
**to_sparse_semi_structured**
This is a conversion function to convert a dense tensor and a
semi-structured sparse bool mask into a subclass. Currently, we must
pass in a bool mask, since we can't infer it becuase there may be
additional zero elements in the dense tensor, so `tensor !=0` is not 2:4
sparse.
Once we add either a method to derive the mask from the dense tensor or
cuSPARSELt, we no longer need to pass in the mask. cuSPARSELt has it's
own helper functions to create the metadata mask.
**User Details**
We have implemented support for the following ops for `torch.float16`
and `torch.int8`:
```
torch.addmm(bias, dense, sparse.t())
torch.mm(dense, sparse)
torch.mm(sparse, dense)
aten.linear.default
aten.t.default
aten.t.detach
```
The end user interface to accelerate a nn.Linaer module with the
subclass would look like this:
```
from torch.sparse import to_sparse_semi_structured
mask = torch.Tensor([0, 0, 1, 1]).tile(128, 32).cuda().bool()
linear = Model(128, 128).half().cuda()
linear.weight = nn.Parameter(to_sparse_semi_structured(linear.weight,
mask=linear.weight.bool())
```
This also updates tests and the `torch.sparse` module docstring to
reflect these changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102135
Approved by: https://github.com/albanD
- Extend dynamo bench interface with '--compilers onnx' and '--compilers dynamo-onnx'
- ONNX bench exports model to onnx and runs in ONNX Runtime.
- Introduce error aggregation and report.
- Scripts to build ONNX deps and running ONNX bench.
- Huggingface accuracy check workaround for ONNX.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103135
Approved by: https://github.com/thiagocrepaldi, https://github.com/jansel
Previously, cudagraphs and dynamic_shapes were incompatible and enabling
dynamic shapes would forcibly disable cudagraphs. This new strategy
I think is better. The idea is essentially that cudagraphs is an
"optimization" that happens to guard on every input. When cudagraphs
is on, we force everything static, and this automatically does the right
thing because we will force a recompile if sizes change.
This obsoletes https://github.com/pytorch/pytorch/pull/101813
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103290
Approved by: https://github.com/voznesenskym, https://github.com/eellison
Originally, my goal for this PR was to remove the `dynamic_shapes` tests in torch/_dynamo/variables/builder.py. However, one thing lead to another, and it turns out that it was easiest to do all of the following in one go:
* Unconditionally allocate a ShapeEnv, no matter if dynamic_shapes is enabled or not (torch/_dynamo/output_graph.py). There is a small adjustment to export torch/_dynamo/eval_frame.py to account for the fact that a ShapeEnv always exists, even if you're not doing symbolic export.
* Remove dynamic_shapes test from unspec logic (torch/_dynamo/variables/builder.py), the original goal
* Specialize strides and storage offset if all sizes are dynamic (torch/fx/experimental/symbolic_shapes.py). This is required to deal with unconditional ShapeEnv: if a ShapeEnv exist, fake tensor-ification may choose to allocate symbols. The idea is that with `automatic_dynamic_shapes == False`, Dynamo should never request dynamic sizes, but this invariant was not upheld for nontrivial strides/offset.
The rest are just auxiliary fixups from the above:
* Workaround bug in FakeTensorProp where sometimes it doesn't return a FakeTensor (torch/fx/passes/fake_tensor_prop.py), see https://github.com/pytorch/pytorch/pull/103395 for follow up
* Make ShapeProp correctly handle int inputs (torch/fx/passes/shape_prop.py)
* Disable indexing strength reduction if `assume_static_by_default` is False (torch/_inductor/codegen/triton.py)
* Fix hf_T5_generate to NOT toggle `assume_static_by_default` if dynamic shapes is not enabled (benchmarks/dynamo/common.py); technically this is not necessary anymore but it's in for safety.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103302
Approved by: https://github.com/voznesenskym
Previously, cudagraphs and dynamic_shapes were incompatible and enabling
dynamic shapes would forcibly disable cudagraphs. This new strategy
I think is better. The idea is essentially that cudagraphs is an
"optimization" that happens to guard on every input. When cudagraphs
is on, we force everything static, and this automatically does the right
thing because we will force a recompile if sizes change.
This obsoletes https://github.com/pytorch/pytorch/pull/101813
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103290
Approved by: https://github.com/voznesenskym
Even if you passed in --amp we would run inference in float32.
`AlbertForMaskedLM` goes from 1.305 float32 to 1.724x amp, and then again to 1.910x with freezing. Benchmark numbers for amp are about to go way up lol.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103220
Approved by: https://github.com/desertfire
- Don't copy inputs in cudagraphs wrapping, since the copies will distorts timing and triton do_bench will clear cache anyway
- Don't skip op if there is a fallback, since we have both fallbacks and lowerings for some ops
- Add option for channels last
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103110
Approved by: https://github.com/desertfire
The "tolerance" option evaluates the model on the baseline device in eager mode (default: CPU) compared to the test device (e.g., CUDA, XLA, etc.) and compares the output tensors to determine the absolute tolerance value based on the [formula](https://pytorch.org/docs/stable/generated/torch.allclose.html). It then saves the results in a CSV file. This comparison highlights the tolerance/accuracy difference between XLA and GPU/CPU devices and can also be used to evaluate newer accelerators. This feature aims to identify accuracy failures on the test device (e.g., XLA) and facilitate quick bug triaging.
This feature enables the following capabilities:
1. Ability to monitor accuracy issues of backends
2. Provide more informative picture on accuracy beyond pass/ fail status
3. Having a dump of accuracy information will help triage models accordingly
The data generated using this feature is in the [spreadsheet](https://docs.google.com/spreadsheets/d/1A8BAzSqfAw0Q5rgzK5Gk__Uy7qhuynh8tedxKnH-t94/edit#gid=0).
The spreadsheet data can be used to compile the below summary table:
| Suite | Max Tolerance | | No. of models with high inaccuracy(>=0.005) | | Mean Tolerance | |
|------------------ |:-------------:|:--------:|:-------------------------------------------:|:--------:|:--------------:|:--------:|
| | xla | inductor | xla | inductor | xla | inductor |
| huggingface | 0.1169 | 0.0032 | 1 | 0 | 0.0022 | 0.0005 |
| timm_models | 0.0373 | 2.8892 | 10 | 8 | 0.0028 | 0.7044 |
| torchbench | 3.013 | 3.0381 | 6 | 2 | 0.0016 | 0.0016 |
| All models | 3.013 | 3.0381 | 17 | 10 | 0.0028 | 0.7044 |
I used PyTorch release/2.0 branch and corresponding [commit_pin](https://github.com/pytorch/pytorch/blob/release/2.0/.github/ci_commit_pins/xla.txt) for XLA to generate the above data.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102218
Approved by: https://github.com/jansel
Summary:
Otherwise we get
```
Traceback (most recent call last):
File "<string>", line 49, in <module>
File "<string>", line 47, in __run
File "/usr/local/fbcode/platform010/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/fbcode/platform010/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/data/users/jongsoo/fbsource/buck-out/v2/gen/fbcode/ef4169ac7f95fb74/caffe2/benchmarks/transformer/__sdp__/sdp#link-tree/caffe2/benchmarks/transformer/sdp.py", line 346, in <module>
main(save_path)
File "/data/users/jongsoo/fbsource/buck-out/v2/gen/fbcode/ef4169ac7f95fb74/caffe2/benchmarks/transformer/__sdp__/sdp#link-tree/caffe2/benchmarks/transformer/sdp.py", line 328, in main
experiment = run_single_experiment(experiment_config)
File "/data/users/jongsoo/fbsource/buck-out/v2/gen/fbcode/ef4169ac7f95fb74/caffe2/benchmarks/transformer/__sdp__/sdp#link-tree/caffe2/benchmarks/transformer/sdp.py", line 229, in run_single_experiment
assert_close_tensors(nn_mha_output, composite_mha_output)
File "/data/users/jongsoo/fbsource/buck-out/v2/gen/fbcode/ef4169ac7f95fb74/caffe2/benchmarks/transformer/__sdp__/sdp#link-tree/caffe2/benchmarks/transformer/sdp.py", line 196, in assert_close_tensors
assert torch.allclose(a, b, atol=1e-3, rtol=1e-3)
AssertionError
```
Test Plan: buck run mode/dev-nosan //caffe2/benchmarks/transformer:sdp
Differential Revision: D45843836
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101965
Approved by: https://github.com/drisspg
The memory compression for these models is at parity, but because we interleave timings between torch.compile and eager run memory is duplicated between between eager and cudagraphs pool and causes OOM.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101837
Approved by: https://github.com/anijain2305
This PR adds support for tracing autograd.Function with grad.
A few important bullet points outlining our approach:
1) Our goal is to verify soundness in order to add a call_function to the autograd.Function's `apply` to the graph.
2) We achieve (1) by either verifying soundness or rejecting soundness, by ensuring that both forward and backward of the autograd.Function are sound.
3) For the forward, if we verify soundness, we install its guards into the graph.
4) For the backward, if we verify soundness, we throw it out. However, backwards soundness verification is more onerous, and has a config driven set of banned attrs and methods for tensors.
1-4 above are achieved by turning the forward and backward into UserDefinedFunctionVariables, and inlining through them, relying on dynamo's soundness detection. If we graph break in these, we raise and treat them as unsound. As noted above, backwards is stricter yet.
For the tracing, the safety comes from dynamo's HigherOrderOperator system. That system ensures that not only do we trace soundly, but that no new variables are lifted into inputs during the tracing, and that the forward and backwards are entirely self contained.
Whenever we reject a function as unsound, we restore back, as usual.
Due to some limitations in the lifting logic, we have an escape hatch we implemented for tensors that are known in forward, but cross into backwards through save_tensors (save) /saved_tensors (load). We escape hatch here to avoid having the known saved tensors coming from forward end up being accidentally treated as lifted variables (and rejected). This is sound, but a little hacky feeling.
Additionally, due to some limitations in fx node removal, combined with how we produce subgraphs for the traces installed from HigherOrderOperators, we had to improve our node removal logic. In the event of a restore, we remove the old nodes from the graph, as usual in dynamo. However, because the references to these nodes may exist in subgraphs, we traverse any nodes users and remove them first if and only if they are in another graph. This is always sound, because removal should only be downstream of restoration at this point.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99483
Approved by: https://github.com/zou3519
This pass does a limited form of constant propagation, as well as propagation of
sympy indexing expressions. For example, say you have the function:
```python
def flip(x):
i = torch.arange(x.size(0) - 1, -1, -1, device=x.device)
return x[i]
```
On current main this results in indirect indexing:
```python
class buf0_loop_body:
var_ranges = {z0: 4, z1: 3}
index0 = 3 - z0
index1 = 3*indirect0 + z1
index2 = 3*z0 + z1
def body(self, ops):
get_index = self.get_index('index0')
index_expr = ops.index_expr(get_index, torch.int64)
set_indirect0 = self.set_indirect0(index_expr)
get_index_1 = self.get_index('index1')
load = ops.load('arg0_1', get_index_1)
get_index_2 = self.get_index('index2')
store = ops.store('buf0', get_index_2, load, None)
return store
```
With this PR the indexing is propagated through the computation and into direct
indexing:
```python
class buf0_loop_body:
var_ranges = {z0: 4, z1: 3}
index0 = -3*z0 + z1 + 9
index1 = 3*z0 + z1
def body(self, ops):
get_index = self.get_index('index0')
load = ops.load('arg0_1', get_index)
get_index_1 = self.get_index('index1')
store = ops.store('buf0', get_index_1, load, None)
return store
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101077
Approved by: https://github.com/lezcano, https://github.com/ngimel
With the TQDM changes in #100969 -- the models names ended up getting hidden from the benchmark printouts. We would print the model name with no newline, then tqdm would print a `\r` and overwrite the name of the running model.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101627
Approved by: https://github.com/ezyang
This pr accomplishes
1) Enables retries for downloading torchbenchmark and huggingface models in a similar method to how we do it for timm models right now.
2) creates a `_download_model` function for the hugging face and TIMM runners whose output I plan to use to preload the models somewhere if possible (please double check I'll be saving the right thing). Instead of retries, we plan to just add torchbench to a docker image as it is relatively small.
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 3361a4c</samp>
> _We're the brave and bold coders of the `common.py` module_
> _We've made a handy function for downloading models_
> _We've shared it with our mates in the other runners_
> _So pull and push and try again, we'll get them all in time_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101019
Approved by: https://github.com/huydhn, https://github.com/desertfire
Summary:
Otherwise we get
```
Traceback (most recent call last):
File "<string>", line 49, in <module>
File "<string>", line 47, in __run
File "/usr/local/fbcode/platform010/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/fbcode/platform010/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/data/users/jongsoo/fbsource/buck-out/v2/gen/fbcode/ef4169ac7f95fb74/caffe2/benchmarks/transformer/__sdp_backwards__/sdp_backwards#link-tree/caffe2/benchmarks/transformer/sdp_backwards.py", line 188, in <module>
main()
File "/data/users/jongsoo/fbsource/buck-out/v2/gen/fbcode/ef4169ac7f95fb74/caffe2/benchmarks/transformer/__sdp_backwards__/sdp_backwards#link-tree/caffe2/benchmarks/transformer/sdp_backwards.py", line 184, in main
run_timing(min_run_time, batch_size, embed_dim, num_heads, max_seq_len, dtype)
File "/data/users/jongsoo/fbsource/buck-out/v2/gen/fbcode/ef4169ac7f95fb74/caffe2/benchmarks/transformer/__sdp_backwards__/sdp_backwards#link-tree/caffe2/benchmarks/transformer/sdp_backwards.py", line 105, in run_timing
rand_fused_upward = cpt(x, x, x, mask).clone().detach()
File "/data/users/jongsoo/fbsource/buck-out/v2/gen/fbcode/ef4169ac7f95fb74/caffe2/benchmarks/transformer/__sdp_backwards__/sdp_backwards#link-tree/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/users/jongsoo/fbsource/buck-out/v2/gen/fbcode/ef4169ac7f95fb74/caffe2/benchmarks/transformer/__sdp_backwards__/sdp_backwards#link-tree/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/data/users/jongsoo/fbsource/buck-out/v2/gen/fbcode/ef4169ac7f95fb74/caffe2/benchmarks/transformer/__sdp_backwards__/sdp_backwards#link-tree/caffe2/benchmarks/transformer/sdp_backwards.py", line 39, in forward
attn, _ = torch.nn.functional.scaled_dot_product_attention(
ValueError: too many values to unpack (expected 2)
```
Test Plan: buck run mode/dev-nosan //caffe2/benchmarks/transformer:sdp_backwards
Differential Revision: D45843838
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101341
Approved by: https://github.com/drisspg
Currently, we return `unimplemented` w/o a graph break on seeing a x.unsqueeze_ when x is input. This essentially means we fall back to the original frame.
This PR actually graph breaks so that we can generate the continuation frame for the rest of the function. Instead of graph breaking at LOAD_ATTR, we delay the graph break to the actual CALL_FUNCTION, where its cleaner to graph break.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99986
Approved by: https://github.com/jansel
Follow up on Jason's idea of tensor layout tuning. Add a script to show the perf impact of layout to convolution (will add more cases like batch/layer norm, reduction to the scripts).
For convolution, a quick test shows using channels last layout, we get 1.4x speedup for convolution:
```
baseline 4.509183883666992 test 3.178528070449829 speedup 1.419x
```
The speedup definitely also depends on input/weight shapes. E.g., change input channel from 3 in the test to 8, we see speedup to be 2.1x
The trace shows cudnn calls different kernels when input layout changes to channels last.
<img width="997" alt="Screenshot 2023-04-19 at 5 27 54 PM" src="https://user-images.githubusercontent.com/52589240/233228656-4bdcac0a-7633-416a-82e1-17d8dc8ea9a6.png">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99583
Approved by: https://github.com/jansel
Previously, we had a problem when partitioning forward-backward dynamic graphs, which is that we could end up with a backward graph that mentions a symbol in an input tensor (e.g., `f32[s0 + s1]`), but without this symbol being otherwise bound elsewhere. When this happens, we have no way of actually deriving the values of `s0` and `s1`. Our fix for this in https://github.com/pytorch/pytorch/pull/93059 was to just retrace the graph, so that s0 + s1 got allocated a new symbol s2 and everything was happy. However, this strategy had other problems, namely (1) we lost all information from the previous ShapeEnv, including guards and (2) we end up allocating a LOT of fresh new symbols in backwards.
With this change, we preserve the same ShapeEnv between forward and backwards. How do we do this? We simply require that every symbol which may be present inside tensors, ALSO be a plain SymInt input to the graph. This invariant is enforced by Dynamo. Once we have done this, we can straightforwardly modify the partitioner to preserve these SymInt as saved for backwards, if they are needed in the backwards graph to preserve the invariant as well.
This apparently breaks yolov3, but since everything else is OK I'm merging this as obviously good and investigating later.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99089
Approved by: https://github.com/voznesenskym
Dynamo benchmark --verbose is broken:
```
Traceback (most recent call last):
File "/scratch/ybliang/work/repos/pytorch/benchmarks/dynamo/torchbench.py", line 400, in <module>
torchbench_main()
File "/scratch/ybliang/work/repos/pytorch/benchmarks/dynamo/torchbench.py", line 396, in torchbench_main
main(TorchBenchmarkRunner(), original_dir)
File "/scratch/ybliang/work/repos/pytorch/benchmarks/dynamo/common.py", line 1967, in main
return maybe_fresh_cache(
File "/scratch/ybliang/work/repos/pytorch/benchmarks/dynamo/common.py", line 993, in inner
return fn(*args, **kwargs)
File "/scratch/ybliang/work/repos/pytorch/benchmarks/dynamo/common.py", line 2135, in run
torch._dynamo.config.log_level = logging.DEBUG
File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/config_utils.py", line 67, in __setattr__
raise AttributeError(f"{self.__name__}.{name} does not exist")
AttributeError: torch._dynamo.config.log_level does not exist
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99224
Approved by: https://github.com/voznesenskym
Allowed modules are stuck into dynamo's fx graph as call_module
nodes, without dynamo doing any tracing of the module. This means
during AOT trace time, hooks will fire during tracing when the
call_module is executed, but the hooks themselves will disappear
after that and not be present in the compiled program.
(worse, if they performed any tensor operations, those would get
traced so you could end up with part of the hook's functionality).
To circumvent this, there are two options for 'allowed modules' with hooks.
1) don't treat them as 'allowed' - trace into them
2) graph-break, so the module is no longer part of the dynamo trace at all
(1) will fail for users that opted into allowed modules becuase they know
their module has problems being traced by dynamo.
(2) causes graph breaks on common modules such as nn.Linear, just because they
are marked as 'allowed'.
It would help matters if we could differentiate between types of allowed modules
(A) allowed to avoid overheads - used for common ops like nn.Linear
(B) allowed to avoid dynamo graphbreaks caused by unsupported code
Ideally, we'd use method (1) for group (A) and (2) for (B).
For now, graph-break on all cases of allowed modules.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97184
Approved by: https://github.com/jansel
[perf-compare](https://github.com/pytorch/pytorch/actions/workflows/inductor-perf-compare.yml) has a different structure than that of the nightlies.
For these files, the script now generates:
```
# cuda float32 training performance results
## Geometric mean speedup
huggingface timm_models torchbench
-------- ------------- ------------- ------------
inductor 1.46 1.4 1.17
## Mean compilation time
huggingface timm_models torchbench
-------- ------------- ------------- ------------
inductor 57.85 97.63 60.18
## Peak memory compression ratio
huggingface timm_models torchbench
-------- ------------- ------------- ------------
inductor 1.06 1.01 0.83
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99095
Approved by: https://github.com/ezyang
Summary:
Replace _dynamo.config with an object instead of module
Current usage patterns of setting and reading fields on config will work
unchanged.
Only changes needed going forward:
1. import torch._dynamo.config will not work. However, just doing
import torch._dynamo is sufficient to access dynamo config
as torch._dynamo.config.
2. Files inside of _dynamo folder need to access config via
from torch._dynamo.config_util import config instead of
from torch._dynamo import config. Because _dynamo/__init__.py
imports some of the files so it would be circular import.
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96455
Approved by: https://github.com/williamwen42