Fixes https://github.com/pytorch/pytorch/issues/119238
Here's what it looks like now:
```
$ TORCH_LOGS=+torch._dynamo.convert_frame python a.py
[2024-02-05 18:52:07,248] [0/0] torch._dynamo.convert_frame: [DEBUG] torchdynamo start compiling f /data/users/ezyang/b/pytorch/a.py:3, stack (elided 5 frames):
[2024-02-05 18:52:07,248] [0/0] torch._dynamo.convert_frame: [DEBUG] File "/data/users/ezyang/b/pytorch/a.py", line 7, in <module>
[2024-02-05 18:52:07,248] [0/0] torch._dynamo.convert_frame: [DEBUG] f(torch.randn(2))
[2024-02-05 18:52:07,248] [0/0] torch._dynamo.convert_frame: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/eval_frame.py", line 453, in _fn
[2024-02-05 18:52:07,248] [0/0] torch._dynamo.convert_frame: [DEBUG] return fn(*args, **kwargs)
[2024-02-05 18:52:07,248] [0/0] torch._dynamo.convert_frame: [DEBUG]
$ cat a.py
import torch
@torch.compile
def f(x):
return x * 2
f(torch.randn(2))
```
The eval_frame frame is intentionally present, since what happens is you run the torch.compile wrapper, and then you actually hit the user frame to be compiled.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119251
Approved by: https://github.com/yanboliang, https://github.com/mlazos
The `initial` argument in `functools.reduce` can be `None`.
```python
initial_missing = object()
def reduce(function, iterable, initial=initial_missing, /):
it = iter(iterable)
if initial is initial_missing:
value = next(it)
else:
value = initial
for element in it:
value = function(value, element)
return value
```
Reference:
- python/cpython#102759
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116398
Approved by: https://github.com/Skylion007
Fix https://github.com/pytorch/pytorch/issues/109736 .
HF pin move causes regression on accuracy check for HF models on the dashboard. Manually reverting the HF PR ( https://github.com/huggingface/transformers/pull/24696/files ) could recover, but this may hide some real issue. I happen to found that using a warm matmul max-autotune cache can work around the issue. Or putting it in another way:
- make all calls to check_cache cache miss repro the issue
- make all cals to check_cache cache hit works around the issue
I did some sort of 'bisect' to force halving the amount of cache miss each time while still make sure we can repro. Luckily reducing to a single cache miss still repro the issue. With more debugging, it turns out that it's the call to `torch.randn` on cuda device causing the problem.
The fix is to make sure we restore the rng state when we generate random inputs for max-autotune benchmarking.
TBH, I can not fully explain the root cause although I know it's caused by rng state change. AOTAutograd already has some logic to preserve rng state. And I can not repro the issue in unit tests. I have a few guess why the RNG state is not restored in the first place after we generate random inputs for max-autotune:
- maybe AOTAutograd misses some corner case to preserve the rng state
- maybe for the failed models, there are some eager fallback that's not handled by inductor. And if those fallback calles random number related APIs, we will see the issue. But again I don't find a good way to simulate this.
Repro:
```
TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 CUDA_VISIBLE_DEVICES=3 time python benchmarks/dynamo/huggingface.py --backend inductor --amp --accuracy --only PLBartForCausalLM --training --cold-start-latency
```
We always repro the issue without the PR but pass the accuracy check with the PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109828
Approved by: https://github.com/eellison
RFC: https://github.com/pytorch/rfcs/pull/54
First commit is the contents of https://github.com/Quansight-Labs/numpy_pytorch_interop/
We have already been using this in core for the last few months as a external dependency. This PR pulls all these into core.
In the next commits, I do a number of things in this order
- Fix a few small issues
- Make the tests that this PR adds pass
- Bend backwards until lintrunner passes
- Remove the optional dependency on `torch_np` and simply rely on the upstreamed code
- Fix a number dynamo tests that were passing before (they were not tasting anything I think) and are not passing now.
Missing from this PR (but not blocking):
- Have a flag that deactivates tracing NumPy functions and simply breaks. There used to be one but after the merge stopped working and I removed it. @lezcano to investigate.
- https://github.com/pytorch/pytorch/pull/106431#issuecomment-1667079543. @voznesenskym to submit a fix after we merge.
All the tests in `tests/torch_np` take about 75s to run.
This was a work by @ev-br, @rgommers @honno and I. I did not create this PR via ghstack (which would have been convenient) as this is a collaboration, and ghstack doesn't allow for shared contributions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106211
Approved by: https://github.com/ezyang
`TensorMeta.from_irnodes` handles either a single `IRNode` or a tuple or list of them. I tried to express this with overloading, but because this file is in MYPYNOFOLLOW, the `IRNode` subclasses become `Any`, which causes the overloads to be overlapping.
This changes the type of the argument to `benchmark_in_sub_process` to the more specific `TritonTemplateCaller`, since that one has the `bmreq` member and existing docstrings indicate that only the triton template benchmark is handled.
The `rand_strided` call caused a mypy error because the default value for device was a string. This is fixed by adding type hints to `rand_strided` in `torch/_dynamo/testing.py`. Likewise, the return value of `PyCodeCache.load_by_key_path` can be inferred from the type hint on `PyCodeCache.cache`.
Fixes one part of #105230
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105791
Approved by: https://github.com/jansel, https://github.com/Skylion007
Some notes:
* I now manually turn off `_generate` jobs from running with cudagraphs, as it is unrealistic to expect to cudagraph autoregressive generation up to max sequence length, this would imply compiling the entire unrolled sequence generation. Concretely, cm3leon_generate was timing out post this change, likely due to the compile time slowdown of dynamic shapes ON TOP OF accidentally unrolling all the loops
* A few torch._dynamo.reset tactically inserted to force recompiles on tests that expected it
* expectedFailureAutomaticDynamic flip into patching automatic_dynamic_shapes=False
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103623
Approved by: https://github.com/voznesenskym
Added two signpost_event calls to torch.fx.experimental.symbolic_shapes, one for produce_guards (where we can give stats like how many free symbols and how many guards produced) and the other is for evaluate_expr after freeze (so we can look for cases where we're improperly discarding guards in backwards.)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103882
Approved by: https://github.com/Skylion007
First, infra improvements: new combinator `expectedFailureDynamic` which subsumes expectedFailure calls in test_dynamic_shapes.py. It's just nicer to have these right with the test. Implementation in torch/_dynamo/testing.py and it works by putting an attr on the test, which is then converted into a real expectedFailure when we actually generate the dynamic shapes test class
Next, some housekeeping:
* test/dynamo/test_unspec.py accidentally was running mostly statically due to the `assume_static_by_default` config flip. Don't assume static by default and xfail some tests which regressed in that time.
* New test file test/dynamo/test_config.py, for testing permutations of configuration options. `test_dynamic_shapes` got moved there.
Finally, grinding through tests in a way that will make them more compatible with dynamic by default:
* If the test explicitly requires dynamic_shapes=False, remove that patch (and probably xfail it)
* If the test checks dynamic_shapes internally, remove that test and patch the test so it ALWAYS runs with dynamic_shapes (this is not coverage loss because we're going to switch the default)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103542
Approved by: https://github.com/anijain2305
Issue: #93684
# Problem
Reduce graph breaks when dynamo compiles python functions containing numpy functions and ndarray operations.
# Design (as I know it)
* Use torch_np.ndarray(a wrapper of tensor) to back a `VariableTracker`: `NumpyTensorVariable`.
* Translate all attributes and methods calls, on ndarray, to torch_np.ndarray equivalent.
This PR adds `NumpyTensorVariable` and supports:
1. tensor to ndarray, ndarray to tensor
2. numpy functions such as numpy.meshgrid()
3. ndarray attributes such as `itemsize`, `stride`
Next PR will handle returning `np.ndarray` and add support for ndarray methods
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95849
Approved by: https://github.com/ezyang
Summary of changes:
- Add CPython exceptiontable parsing/assembling functions in torch/_dynamo/bytecode_transformation.py, based on https://github.com/python/cpython/blob/3.11/Objects/exception_handling_notes.txt.
- Add optional `exn_tab_entry` field to dynamo `Instruction`s in torch/_dynamo/bytecode_transformation.py in order to virtualize exception table entries (start, end, target instructions).
- Add checks guarding against duplicate instructions in dynamo, so that jump/exceptiontable targets are unambiguous. See `get_indexof` in torch/_dynamo/bytecode_analysis.py. Ensure that bytecode generation throughout dynamo does not generate duplicate instructions.
- Allow dynamo bytecode generation logic to generate nested exception table entries for developer convenience. CPython expects entries to not overlap, so we flatten nested entries during assembly in torch/_dynamo/bytecode_transformation.py:compute_exception_table.
- Simulate the block stack in torch/_dynamo/symbolic_convert.py. CPython removed the block stack in 3.11, but dynamo needs it in order to keep track of active contexts. So we simulate the block stack as before by looking at exceptiontable entries in order to determine the current blocks.
- Update context codegen in torch/_dynamo/resume_execution.py. The `SETUP_FINALLY` bytecode, which conveniently had a jump target to the finally block, was removed in 3.11, so we need to keep track of the jump target of the finally block using exceptiontables. Generating resume functions is more difficult since the original exceptiontable entries pointing to old cleanup code need to be modified to point to new cleanup code.
- Fix a push_null bug in torch/_dynamo/variables/functions.py introduced by https://github.com/pytorch/pytorch/pull/98699
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96511
Approved by: https://github.com/jansel, https://github.com/yanboliang, https://github.com/albanD
Summary:
Replace _dynamo.config with an object instead of module
Current usage patterns of setting and reading fields on config will work
unchanged.
Only changes needed going forward:
1. import torch._dynamo.config will not work. However, just doing
import torch._dynamo is sufficient to access dynamo config
as torch._dynamo.config.
2. Files inside of _dynamo folder need to access config via
from torch._dynamo.config_util import config instead of
from torch._dynamo import config. Because _dynamo/__init__.py
imports some of the files so it would be circular import.
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96455
Approved by: https://github.com/williamwen42
Previously they printed like `torch._dynamo.testing.make_test_cls_with_patches.<locals>.DummyTestClass`; now they print as `torch._dynamo.testing.StaticDefaultDynamicShapesUnspecTests`
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95806
Approved by: https://github.com/dagitses
Cleaning FQN for `FullyShardedDataParallel(use_orig_params=True)` can cause some discrepancies with respect to the FQN compared to manually looping over `named_modules()` and `named_parameters()` together.
There is no requirement for the FQNs to be clean when using wrapper FSDP + `use_orig_params=True`. We can leave clean FQNs to `fully_shard`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91767
Approved by: https://github.com/zhaojuanmao
This PR introduces a new function we can pass to torch._dynamo.optimize - guard_failure_fn. Usage is in the PR, and the one stacked on top of it, but the gist of it is that it emits failed guard reason strings alongside code. This is useful for tests and debugging, as it gives far finer grained assertions and control than the compile counter alone.
This is a resubmit of https://github.com/pytorch/pytorch/pull/90129
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90371
Approved by: https://github.com/ezyang