Commit Graph

164 Commits

Author SHA1 Message Date
Bin Bao
577d930c39 [CI] Revert https://github.com/pytorch/pytorch/pull/96195 (#96897)
Summary: https://github.com/pytorch/pytorch/pull/96195 was an experiment
for debugging flaky failures on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96897
Approved by: https://github.com/ngimel
2023-03-16 06:28:18 +00:00
Edward Z. Yang
3606f59366 Default specialize_int to False (#96624)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96624
Approved by: https://github.com/janeyx99
2023-03-16 02:54:18 +00:00
Will Constable
54cd4a67d0 Output peak memory stats from dynamo torchbench perf CI (#95666)
Adds absolute memory usage numbers (in addition to compression ratio) to performance jobs.

Example output:
<img width="1211" alt="image" src="https://user-images.githubusercontent.com/4984825/225419950-500908c5-00ce-4711-afa2-c995bf90d35d.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95666
Approved by: https://github.com/ezyang, https://github.com/williamwen42
2023-03-15 19:24:47 +00:00
Bin Bao
33c7be360f [reland][CI] switch torchbench to a pinned version (#96782)
Summary: This is reland of https://github.com/pytorch/pytorch/pull/96553

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96782
Approved by: https://github.com/huydhn
2023-03-15 12:46:36 +00:00
Edward Z. Yang
037acd5a22 Update CI skips (#96745)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96745
Approved by: https://github.com/wconstab
2023-03-14 22:19:10 +00:00
PyTorch MergeBot
be4eaa69c2 Revert "[CI] switch torchbench to a pinned version (#96553)"
This reverts commit 61d6ccd29a.

Reverted https://github.com/pytorch/pytorch/pull/96553 on behalf of https://github.com/desertfire due to land race
2023-03-14 21:39:45 +00:00
PyTorch MergeBot
ba4fb9b6ad Revert "Default specialize_int to False (#96624)"
This reverts commit 1ac8782db2.

Reverted https://github.com/pytorch/pytorch/pull/96624 on behalf of https://github.com/kit1980 due to Broke inductor/test_torchinductor_dynamic_shapes.py
2023-03-14 19:43:47 +00:00
Bin Bao
61d6ccd29a [CI] switch torchbench to a pinned version (#96553)
Summary: Previously we were using a branch on torchbench which skips
torchaudio. We should switch to make sure a good test coverage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96553
Approved by: https://github.com/huydhn, https://github.com/ezyang
2023-03-14 18:42:22 +00:00
Edward Z. Yang
1ac8782db2 Default specialize_int to False (#96624)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96624
Approved by: https://github.com/janeyx99
2023-03-14 18:37:47 +00:00
David Berard
6e3d51b08a [inductor][CI] also skip rexnet_100 on non-dynamic shapes (#96691)
Recent failures show rexnet_100 accuracy is flaky also on non-dynamic shapes (was already disabled for dynamic shapes in #96474). The failure occurs for the same reason (stem.bn.weight.grad).
e.g. https://github.com/pytorch/pytorch/actions/runs/4402868441/jobs/7710977874

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96691
Approved by: https://github.com/desertfire
2023-03-14 18:11:59 +00:00
Edward Z. Yang
ff7e510d1e Correctly use PythonPrinter for generating wrapper code referencing sympy (#96710)
Otherwise you get stuff like ceiling(s0) which is not valid Python code. Fixes volo_d1_224

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96710
Approved by: https://github.com/ngimel, https://github.com/jansel
2023-03-14 14:35:52 +00:00
Wang, Eikan
3cad8d23d0 [Inductor] Skip the hf_T5_base due to intermittent failure on CI (#96649)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96649
Approved by: https://github.com/desertfire
2023-03-14 07:40:20 +00:00
Edward Z. Yang
507feb805f Don't specialize torch.Size with specialize_int = False (#96419)
Fixes https://github.com/pytorch/pytorch/issues/95868

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96419
Approved by: https://github.com/jansel, https://github.com/ngimel
2023-03-14 01:32:58 +00:00
Edward Z. Yang
c7f39c0820 Update CI skips (#96554)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96554
Approved by: https://github.com/janeyx99
2023-03-13 13:40:45 +00:00
David Berard
29cd60dfb7 [CI] handle more dynamo benchmark models that are not expected to be deterministic (#96324)
Follow-up to #96245. alexnet, Background_Matting, vision_maskrcnn, and vgg16 all have the same problem; but on float32 they were also failing on the previous day so I missed this. Once the amp jobs became available I could see that these have the same issue (on both float32 and amp).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96324
Approved by: https://github.com/desertfire
2023-03-10 18:15:34 +00:00
Bin Bao
a651e6253a [CI] Change compile_threads to 1 when running benchmark accuracy test on CI (#96195)
Summary: This is not a pretty solution, but it a way to verify if the flakiness is coming from parallel compilation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96195
Approved by: https://github.com/ngimel
2023-03-10 17:39:38 +00:00
Edward Z. Yang
ff2e14f200 Skip rexnet_100 in dynamic CI (#96474)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96474
Approved by: https://github.com/yanboliang, https://github.com/msaroufim
2023-03-10 01:23:19 +00:00
Edward Z. Yang
c988de1040 [EASY] Update inductor training dynamic skips (#96298)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96298
Approved by: https://github.com/Chillee, https://github.com/janeyx99
2023-03-08 19:31:46 +00:00
Bin Bao
b3a079810e [CI] Add a workflow for quick perf comparison (#96166)
Summary: ciflow/inductor-perf-test-nightly now contains full dashboard
run which takes a very long time. Ed proposed a simplification of the
perf run there, but it is still worth to have a set of fast perf test
which only includes one configuration (--training --amp).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96166
Approved by: https://github.com/huydhn, https://github.com/weiwangmeta
2023-03-08 19:09:04 +00:00
Bin Bao
664381b293 [CI] Avoid calling torch.use_deterministic_algorithms for some models (#96245)
tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96245
Approved by: https://github.com/davidberard98
2023-03-08 03:35:32 +00:00
Edward Z. Yang
d0641ed247 [TEST] Turn on unspecialize int dynamic training inductor CI (#96058)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96058
Approved by: https://github.com/janeyx99, https://github.com/voznesenskym
2023-03-07 16:08:45 +00:00
Edward Z. Yang
a6e3e7905e Turn on unspecialize int dynamic inductor CI (#96034)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96034
Approved by: https://github.com/voznesenskym
2023-03-07 12:39:55 +00:00
Jason Ansel
95d17dc93d [inductor] Reland #95567 part 1 (#96023)
This is the non-problematic part of #95567.  The errors were coming from
IR printing changes which will be next in the stack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96023
Approved by: https://github.com/ngimel, https://github.com/mlazos
2023-03-06 22:57:22 +00:00
Edward Z. Yang
1fd7ea1ba8 Update skips for RecursionError (#96109)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96109
Approved by: https://github.com/huydhn
2023-03-06 17:55:38 +00:00
Bin Bao
60cf95610d [CI] Skip xcit_large_24_p8_224 in TIMM (#96048)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96048
Approved by: https://github.com/jansel
2023-03-05 14:54:46 +00:00
Bin Bao
1359d16fe8 [CI] Further tighten the checking of two eager runs (#95902)
Summary: To catch nondeterminism in eager if there is any.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95902
Approved by: https://github.com/jansel
2023-03-05 14:53:02 +00:00
Edward Z. Yang
c7c4a20321 Update dynamic skips (#95966)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95966
Approved by: https://github.com/janeyx99, https://github.com/voznesenskym
2023-03-04 23:01:58 +00:00
Jason Ansel
43dd043ea7 Revert "[inductor] Improve error messages (#95567)" (#96014)
This reverts commit 62b775583f.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96014
Approved by: https://github.com/Chillee
2023-03-04 04:03:31 +00:00
Edward Z. Yang
d303665d33 Make int unspecialization actually work (#95621)
OK, so this PR used to be about reducing the number of constants we specialize on, but it turns out that unspecialization was ~essentially never used (because we still constant specialized way too aggressively) and I ended up having to fix a bunch of issues to actually get tests to pass. So this PR is now "make int unspecialization actually work". As part of this, I have to turn off unspecialization by default, as there are still latent bugs in inductor.

The general strategy is that an unspecialized int is represented as a SymInt. Representing it as a 0d tensor (which is what the code used to do) is untenable: (1) we often need unspecialized ints to participate in size computations, but we have no way of propagating sympy expressions through tensor compute, and (2) a lot of APIs work when passed SymInt, but not when passed a Tensor. However, I continue to represent Numpy scalars as Tensors, as they are rarely used for size computation and they have an explicit dtype, so they are more accurately modeled as 0d tensors.

* I folded in the changes from https://github.com/pytorch/pytorch/pull/95099 as I cannot represent unspecialized ints as SymInts without also turning on dynamic shapes. This also eliminates the necessity for test_unspec.py, as toggling specialization without dynamic shapes doesn't do anything. As dynamic shapes defaults to unspecializing, I just deleted this entirely; for the specialization case, I rely on regular static shape tests to catch it. (Hypothetically, we could also rerun all the tests with dynamic shapes, but WITH int/float specialization, but this seems... not that useful? I mean, I guess export wants it, but I'd kind of like our Source heuristic to improve enough that export doesn't have to toggle this either.)
* Only 0/1 integers get specialized by default now
* A hodgepodge of fixes. I'll comment on the PR about them.

Fixes https://github.com/pytorch/pytorch/issues/95469

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95621
Approved by: https://github.com/jansel, https://github.com/Chillee
2023-03-04 01:22:08 +00:00
Jason Ansel
62b775583f [inductor] Improve error messages (#95567)
Example error message before/after (710 to 131 lines):
https://gist.github.com/jansel/6fecad057738089fa95bf08c3de9fc8a

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95567
Approved by: https://github.com/mlazos
2023-03-02 02:20:55 +00:00
Bin Bao
879f0c3fee [CI] Increate the timeout limit for benchmark test (#95787)
Summary: xcit_large_24_p8_224 occasionally hits TIMEOUT on CI. Bump up
the limit to reduce flakiness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95787
Approved by: https://github.com/ezyang, https://github.com/ZainRizvi
2023-03-01 19:54:25 +00:00
Bin Bao
e79b2b7792 [CI] Force clear triton cache between running each test (#95729)
Summary: The idea is to see if this reduces some of the flakiness
we have seen on CI. If it does help, then we have a problem in our
caching implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95729
Approved by: https://github.com/ngimel
2023-03-01 04:10:03 +00:00
Will Constable
1a72712645 Add dynamo graph break stats to CI (#95635)
Adds columns to csv produced by accuracy job including dynamo graph break stats.

Example output from torchbench CI job:
<img width="771" alt="image" src="https://user-images.githubusercontent.com/4984825/221716236-9276684e-1be8-43e1-837e-f41671d4e0e3.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95635
Approved by: https://github.com/ezyang
2023-02-28 16:17:46 +00:00
Edward Z. Yang
3762e801ba Update dynamic skips (#95587)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95587
Approved by: https://github.com/voznesenskym
2023-02-28 03:26:55 +00:00
Bin Bao
fa5a4b0dfc [CI] Do not compare two eager run results against fp64 result (#95616)
Summary: When running the benchmark test with --accuracy, two eager runs
should return the same result. If not, we want to detect it early, but
comparing against fp64_output may hide the non-deterministism in eager.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95616
Approved by: https://github.com/ZainRizvi
2023-02-27 20:11:21 +00:00
Bin Bao
ab1ab3ab19 [CI] Specify more torch.backends.cudnn options to reduce non-determinism (#95478)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95478
Approved by: https://github.com/ezyang
2023-02-25 18:54:12 +00:00
Bin Bao
4c8ad93a7c [Inductor][CI] Remove hf_GPT2_large from CPU inference test (#95473)
Summary: hf_GPT2_large shows random failure on CI for the CPU inference. Created https://github.com/pytorch/pytorch/issues/95474 for the Intel team to investigate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95473
Approved by: https://github.com/anijain2305
2023-02-24 18:21:36 +00:00
Will Constable
8de4238a31 Add dynamo bench arg --per_process_memory_fraction (#95260)
Simply pipes the arg to the existing torch.cuda API by the same name.

Useful for locally debugging OOMs that happened on a smaller GPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95260
Approved by: https://github.com/davidberard98
2023-02-22 05:11:18 +00:00
Edward Z. Yang
08370ddad8 Update model skips (#95089)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95089
Approved by: https://github.com/albanD
2023-02-20 13:24:49 +00:00
Wang, Eikan
954c767bc6 [Inductor] Enable accuracy test for CPPBackend (#94898)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94898
Approved by: https://github.com/jgong5, https://github.com/desertfire
2023-02-20 05:02:15 +00:00
Edward Z. Yang
a2f44d82f8 Flag guard unbacked SymInt/SymFloat support (#94987)
I believe this fixes the AllenaiLongformerBase problem in periodic.

The longer version of the problem is here is we are currently optimistically converting all item() calls into unbacked SymInt/SymFloat, but sometimes this results in a downstream error due to a data-dependent guard. Fallbacks for this case are non-existent; this will just crash the model. This is bad. So we flag guard until we get working fallbacks.

What could these fallbacks look like? One idea I have is to optimistically make data-dependent calls unbacked, but then if it results in a crash, restart Dynamo analysis with the plan of graph breaking when the item() call immediately happened.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94987
Approved by: https://github.com/Skylion007, https://github.com/malfet
2023-02-17 00:25:05 +00:00
Edward Z. Yang
7aaebe00ee Fail dynamic_aot_eager AllenaiLongformerBase model (#94986)
```
GuardOnDataDependentSymNode: It appears that you're trying to get a value out of symbolic int/float whose value is data-dependent (and thus we do not know the true value.)  The expression we were trying to evaluate is Eq(i3, -1).  Scroll up to see where each of these data-dependent accesses originally occurred.

While executing %as_strided : [#users=1] = call_method[target=as_strided](args = (%pad,), kwargs = {size: (12, %add, 768, 64), stride: (%getitem, %mul, %getitem_1, %getitem_2)})
Original traceback:
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/transformers/models/longformer/modeling_longformer.py", line 928, in <graph break in _sliding_chunks_matmul_attn_probs_value>
    chunked_value = padded_value.as_strided(size=chunked_value_size, stride=chunked_value_stride)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94986
Approved by: https://github.com/albanD
2023-02-16 20:02:46 +00:00
Aaron Gokaslan
0444a6c90a [BE] Remove deprecated logging warn method (#94708)
Swaps all logging.warn calls to logging.warning since the former is deprecated and even raises a deprecation warning now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94708
Approved by: https://github.com/ezyang
2023-02-13 18:24:52 +00:00
Edward Z. Yang
ae7a628b03 Dynamic shapes CI updates (#94690)
Data from https://github.com/pytorch/pytorch/pull/94683

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94690
Approved by: https://github.com/cpuhrsch
2023-02-13 18:20:12 +00:00
PyTorch MergeBot
10c430ba0a Revert "Set torch.backends.cudnn.enabled to false when testing accuracy (#94363)"
This reverts commit 2a5851735a.

Reverted https://github.com/pytorch/pytorch/pull/94363 on behalf of https://github.com/desertfire due to TIMM models start to show flaky failures after this PR, need more investigation
2023-02-10 04:40:32 +00:00
Bin Bao
2a5851735a Set torch.backends.cudnn.enabled to false when testing accuracy (#94363)
Summary: It looks like setting torch.backends.cudnn.deterministic to
True is not enough for eliminating non-determinism when testing
benchmarks with --accuracy, so let's turn off cudnn completely.
With this change, mobilenet_v3_large does not show random failure on my
local environment. Also take this chance to clean up CI skip lists.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94363
Approved by: https://github.com/ezyang
2023-02-09 23:43:13 +00:00
Xuehai Pan
a229b4526f [BE] Prefer dash over underscore in command-line options (#94505)
Preferring dash over underscore in command-line options. Add `--command-arg-name` to the argument parser. The old arguments with underscores `--command_arg_name` are kept for backward compatibility.

Both dashes and underscores are used in the PyTorch codebase. Some argument parsers only have dashes or only have underscores in arguments. For example, the `torchrun` utility for distributed training only accepts underscore arguments (e.g., `--master_port`). The dashes are more common in other command-line tools. And it looks to be the default choice in the Python standard library:

`argparse.BooleanOptionalAction`: 4a9dff0e5a/Lib/argparse.py (L893-L895)

```python
class BooleanOptionalAction(Action):
    def __init__(...):
            if option_string.startswith('--'):
                option_string = '--no-' + option_string[2:]
                _option_strings.append(option_string)
```

It adds `--no-argname`, not `--no_argname`. Also typing `_` need to press the shift or the caps-lock key than `-`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94505
Approved by: https://github.com/ezyang, https://github.com/seemethere
2023-02-09 20:16:49 +00:00
Edward Z. Yang
c028fc4e25 Decouple PT2 dynamic shapes from the functorch setting (#94469)
The functorch setting still exists, but now it is no longer necessary:
we infer use of Python dispatcher by checking if the ambient
FakeTensorMode has a ShapeEnv or not.  The setting still exists,
but it is for controlling direct AOTAutograd use now; for PT2,
it's sufficient to use torch._dynamo.config.dynamic_shapes.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94469
Approved by: https://github.com/Chillee, https://github.com/voznesenskym, https://github.com/jansel
2023-02-09 06:41:41 +00:00
PyTorch MergeBot
ca63040d2b Revert "Set torch.backends.cudnn.enabled to false when testing accuracy (#94363)"
This reverts commit 7bfc59993d.

Reverted https://github.com/pytorch/pytorch/pull/94363 on behalf of https://github.com/huydhn due to This change fails in trunk 7bfc59993d running out of memory.  Mark this as weird because it was green in PR
2023-02-09 01:24:35 +00:00
Bin Bao
7bfc59993d Set torch.backends.cudnn.enabled to false when testing accuracy (#94363)
Summary: It looks like setting torch.backends.cudnn.deterministic to
True is not enough for eliminating non-determinism when testing
benchmarks with --accuracy, so let's turn off cudnn completely.
With this change, mobilenet_v3_large does not show random failure on my
local environment. Also take this chance to clean up CI skip lists.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94363
Approved by: https://github.com/ezyang
2023-02-08 23:30:10 +00:00