Commit Graph

653 Commits

Author SHA1 Message Date
Xuehai Pan
ba48cf6535 [BE][Easy][6/19] enforce style for empty lines in import segments in test/ (#129757)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129757
Approved by: https://github.com/ezyang
2024-07-17 06:42:37 +00:00
Xuehai Pan
4d7bf72d93 [BE][Easy] fix ruff rule needless-bool (SIM103) (#130206)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130206
Approved by: https://github.com/malfet
2024-07-14 08:17:52 +00:00
Yuanhao Ji
312652c325 [RFC] Add support for device extension autoloading (#127074)
Fixes #122468

- Load device extensions at the end of `torch/__init__.py`
- Enabled by default, or you can disable it with `TORCH_DEVICE_BACKEND_AUTOLOAD=0`

run test:

```python
python test/run_test.py -i test_autoload_enable
python test/run_test.py -i test_autoload_disable
```

doc:

https://docs-preview.pytorch.org/pytorch/pytorch/127074/miscellaneous_environment_variables.html

co-author:  @jgong5 @bsochack @bkowalskiINTEL @jczaja @FFFrog @hipudding

Co-authored-by: albanD <desmaison.alban@gmail.com>
Co-authored-by: Jiong Gong <jiong.gong@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127074
Approved by: https://github.com/albanD, https://github.com/jgong5
2024-07-09 06:14:13 +00:00
Catherine Lee
91a8376d47 run_test: Unset cpp stacktraces after reruns (#129004)
Rerun the failing test singly with the env var set.  If it succeeds, start a new process without the cpp stack traces env var

We don't want to waste time generating these if we don't have to

They can also show up in assertion errors, which may cause unexpected failures if a test wants to check these

Adds new --rs (run single) to be used the same way --scs and --sc are.  It will only run the single test in the step current file

https://hud.pytorch.org/pytorch/pytorch/pull/129004?sha=2c349d3557d399020bf1f6a8b7045e2e4957ba46 has some examples of logs

In the above:
* test_checkpoint_valid failed, then passed in another subprocess.  The testing continued in a different new subprocess from the test right after it (test_checkpointing_without_reentrant_early_free)
* test_format_traceback_short failed consistently, but it continued to run because keep-going was set

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129004
Approved by: https://github.com/PaliC
2024-07-03 01:50:15 +00:00
Xuehai Pan
4ee1cb9b95 [BE][Easy] replace import pathlib with from pathlib import Path (#129426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129426
Approved by: https://github.com/malfet
2024-06-30 01:36:07 +00:00
PyTorch MergeBot
2effbcfcd8 Revert "[BE][Easy] replace import pathlib with from pathlib import Path (#129426)"
This reverts commit 6d75604ef1.

Reverted https://github.com/pytorch/pytorch/pull/129426 on behalf of https://github.com/XuehaiPan due to recognize `Path` as new exported API ([comment](https://github.com/pytorch/pytorch/pull/129426#issuecomment-2198371625))
2024-06-29 23:24:06 +00:00
Xuehai Pan
6d75604ef1 [BE][Easy] replace import pathlib with from pathlib import Path (#129426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129426
Approved by: https://github.com/malfet
2024-06-29 15:42:09 +00:00
Catherine Lee
8892ddaacc [TD] Test removal on sm86 (#127131)
Yolo

I'm excited to break CI :')
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127131
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2024-06-07 20:19:18 +00:00
Howard Huang
baaa914bf7 [small] test clean up (#128079)
remove unnecessary line: https://github.com/pytorch/pytorch/issues/123733
add main so test can be run `python ...`: https://github.com/pytorch/pytorch/issues/124906

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128079
Approved by: https://github.com/awgu
2024-06-06 21:21:40 +00:00
chuanqiw
627d2cd87d [CI] disable td for xpu ci test by default (#127611)
Due to the xpu ci test has been enabled td by default, a lot of test cases (75%) have been skipped in CI tests. It caused some ci failures escaped from the ci tests, for example issue #127539. This PR depends on PR #127595 landed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127611
Approved by: https://github.com/etaf, https://github.com/atalman
2024-06-04 17:15:10 +00:00
Catherine Lee
a31a60d85b Change run_test.py arg parsing to handle additional args better (#126709)
Do not inherit parser from common_utils
* I don't think we use any variables in run_test that depend on those, and I think all tests except doctests run in a subprocess so they will parse the args in common_utils and set the variables.  I don't think doctests wants any of those variables?

Parse known args, add the extra args as extra, pass the extra ones along to the subprocess
Removes the first instance of `--`

I think I will miss run_test telling me if an arg is valid or not

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126709
Approved by: https://github.com/ZainRizvi, https://github.com/huydhn, https://github.com/Flamefire
2024-05-23 21:08:12 +00:00
Catherine Lee
ac2c547838 [TD] Upload names of failures to s3 for pytest cache (#126315)
Some tests don't get run through pytest and pytest crashes when a test segfaults, so in both caess, the pytest cache won't have an entry (similar to https://github.com/pytorch/test-infra/pull/5205).

Instead, manually upload/download an extra file that lists the failing test files

Technically this would be more general than the pytest cache
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126315
Approved by: https://github.com/ZainRizvi
2024-05-21 16:29:31 +00:00
PyTorch MergeBot
8bca0847c2 Revert "[TD] Upload names of failures to s3 for pytest cache (#126315)"
This reverts commit 655038687a.

Reverted https://github.com/pytorch/pytorch/pull/126315 on behalf of https://github.com/clee2000 due to broke inductor ([comment](https://github.com/pytorch/pytorch/pull/126315#issuecomment-2121133045))
2024-05-20 20:15:08 +00:00
Catherine Lee
655038687a [TD] Upload names of failures to s3 for pytest cache (#126315)
Some tests don't get run through pytest and pytest crashes when a test segfaults, so in both caess, the pytest cache won't have an entry (similar to https://github.com/pytorch/test-infra/pull/5205).

Instead, manually upload/download an extra file that lists the failing test files

Technically this would be more general than the pytest cache
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126315
Approved by: https://github.com/ZainRizvi
2024-05-20 17:36:30 +00:00
drisspg
762ce6f062 Add Lowering for FlexAttention Backwards (#125515)
# Summary
#### What does this PR do?
It enables Inductor to actually generate the fused flex attention kernel for the backwards

I did some other things along the way:
- Abstract out the 'build_subgraph_buffer' subroutine and make it reusable between flex attention and flex_attention backwards. In total we need too build 3 subgraphs for fwd + bwd. 1 for the fwd graph and then 2 in the bwd. The FAv2 algorithm recomputes the parts of the forward (more efficiently since we already have the row_max via logsumexp), therefore we need to inline both the fwd graph and the joint graph in the bwds kernel.
- The version of the backwards kernel is from a somewhat older version of the triton tutorial implementation. I think that we should update in a follow up to a newer version. Notably the blocks need to be square for this to work as currently implemented. I am sure there are many opportunities for optimization.
- I didnt correctly register the decomp table + IndexMode when I landed: https://github.com/pytorch/pytorch/pull/123902, this remedies that.
- The rel_bias helper func was reversed in terms of causality. I updated and then add a test specific for "future causal" attention.
- This PRs but the main point that I think still needs to be worked out is the store_output call. I have it hacked up to be 'fake' but I dont think we want to land that and likely want to just have a mutated 'dq' and a stored_output 'dk'
- I also needed to update the `TritonTemplateKernel` to actually accept multiple subgraphs (modifications)
- I updated the benchmark to also profile bwds performance

### Benchmark Numbers:
_The current implementation is not parallelizing over ctx length in the bwd_
FWD Speedups

| Type    |   Speedup | shape              | score_mod   | dtype          |
|---------|-----------|--------------------|-------------|----------------|
| Average |     0.991 |                    |             |                |
| Max     |     1.182 | (16, 16, 4096, 64) | noop        | torch.bfloat16 |
| Min     |     0.796 | (2, 16, 512, 256)  | head_bias   | torch.bfloat16 |

BWD Speedups

| Type    |   Speedup | shape              | score_mod   | dtype          |
|---------|-----------|--------------------|-------------|----------------|
| Average |     0.291 |                    |             |                |
| Max     |     0.652 | (8, 16, 512, 64)   | head_bias   | torch.bfloat16 |
| Min     |     0.073 | (2, 16, 4096, 128) | head_bias   | torch.bfloat16 |

<details>

<summary>Full Data</summary>

| shape               | score_mod     | dtype          |   fwd_eager_time |   fwd_compiled_time |   bwd_eager_time |   bwd_compiled_time |   fwd_speedup |   bwd_speedup |
|---------------------|---------------|----------------|------------------|---------------------|------------------|---------------------|---------------|---------------|
| (2, 16, 512, 64)    | noop          | torch.bfloat16 |           19.936 |              19.092 |           57.851 |             193.564 |         1.044 |         0.299 |
| (2, 16, 512, 64)    | causal_mask   | torch.bfloat16 |           19.955 |              19.497 |           57.662 |             206.278 |         1.024 |         0.280 |
| (2, 16, 512, 64)    | relative_bias | torch.bfloat16 |           19.455 |              21.297 |           57.674 |             195.219 |         0.913 |         0.295 |
| (2, 16, 512, 64)    | head_bias     | torch.bfloat16 |           19.958 |              21.289 |           57.674 |             193.859 |         0.938 |         0.298 |
| (2, 16, 512, 128)   | noop          | torch.bfloat16 |           28.157 |              28.615 |           82.831 |             454.211 |         0.984 |         0.182 |
| (2, 16, 512, 128)   | causal_mask   | torch.bfloat16 |           28.154 |              28.444 |           83.091 |             432.083 |         0.990 |         0.192 |
| (2, 16, 512, 128)   | relative_bias | torch.bfloat16 |           28.722 |              27.897 |           83.175 |             446.789 |         1.030 |         0.186 |
| (2, 16, 512, 128)   | head_bias     | torch.bfloat16 |           28.299 |              27.673 |           83.052 |             459.179 |         1.023 |         0.181 |
| (2, 16, 512, 256)   | noop          | torch.bfloat16 |           41.167 |              50.504 |          175.019 |            1083.545 |         0.815 |         0.162 |
| (2, 16, 512, 256)   | causal_mask   | torch.bfloat16 |           41.656 |              51.933 |          175.078 |            1171.176 |         0.802 |         0.149 |
| (2, 16, 512, 256)   | relative_bias | torch.bfloat16 |           41.697 |              50.722 |          175.159 |            1097.312 |         0.822 |         0.160 |
| (2, 16, 512, 256)   | head_bias     | torch.bfloat16 |           41.690 |              52.387 |          175.184 |            1097.336 |         0.796 |         0.160 |
| (2, 16, 1024, 64)   | noop          | torch.bfloat16 |           39.232 |              37.454 |          127.847 |             612.430 |         1.047 |         0.209 |
| (2, 16, 1024, 64)   | causal_mask   | torch.bfloat16 |           39.930 |              39.599 |          127.755 |             665.359 |         1.008 |         0.192 |
| (2, 16, 1024, 64)   | relative_bias | torch.bfloat16 |           39.417 |              41.304 |          127.902 |             614.990 |         0.954 |         0.208 |
| (2, 16, 1024, 64)   | head_bias     | torch.bfloat16 |           39.965 |              42.034 |          127.953 |             613.273 |         0.951 |         0.209 |
| (2, 16, 1024, 128)  | noop          | torch.bfloat16 |           63.964 |              71.024 |          226.510 |            1637.669 |         0.901 |         0.138 |
| (2, 16, 1024, 128)  | causal_mask   | torch.bfloat16 |           63.843 |              72.451 |          226.750 |            1558.949 |         0.881 |         0.145 |
| (2, 16, 1024, 128)  | relative_bias | torch.bfloat16 |           64.301 |              70.487 |          226.651 |            1610.063 |         0.912 |         0.141 |
| (2, 16, 1024, 128)  | head_bias     | torch.bfloat16 |           64.033 |              71.394 |          226.676 |            1668.511 |         0.897 |         0.136 |
| (2, 16, 1024, 256)  | noop          | torch.bfloat16 |          129.348 |             141.390 |          507.337 |            4405.175 |         0.915 |         0.115 |
| (2, 16, 1024, 256)  | causal_mask   | torch.bfloat16 |          129.538 |             145.680 |          507.178 |            4768.874 |         0.889 |         0.106 |
| (2, 16, 1024, 256)  | relative_bias | torch.bfloat16 |          129.438 |             142.782 |          507.004 |            4401.002 |         0.907 |         0.115 |
| (2, 16, 1024, 256)  | head_bias     | torch.bfloat16 |          129.058 |             146.242 |          507.547 |            4434.251 |         0.883 |         0.114 |
| (2, 16, 4096, 64)   | noop          | torch.bfloat16 |          481.606 |             409.120 |         1440.890 |           14147.269 |         1.177 |         0.102 |
| (2, 16, 4096, 64)   | causal_mask   | torch.bfloat16 |          480.227 |             438.847 |         1434.419 |           14973.386 |         1.094 |         0.096 |
| (2, 16, 4096, 64)   | relative_bias | torch.bfloat16 |          480.831 |             458.104 |         1432.935 |           14193.253 |         1.050 |         0.101 |
| (2, 16, 4096, 64)   | head_bias     | torch.bfloat16 |          480.749 |             452.497 |         1437.040 |           14084.869 |         1.062 |         0.102 |
| (2, 16, 4096, 128)  | noop          | torch.bfloat16 |          872.534 |             848.275 |         2600.895 |           35156.849 |         1.029 |         0.074 |
| (2, 16, 4096, 128)  | causal_mask   | torch.bfloat16 |          872.647 |             868.279 |         2587.581 |           31919.531 |         1.005 |         0.081 |
| (2, 16, 4096, 128)  | relative_bias | torch.bfloat16 |          871.484 |             827.644 |         2593.989 |           34805.634 |         1.053 |         0.075 |
| (2, 16, 4096, 128)  | head_bias     | torch.bfloat16 |          871.422 |             856.437 |         2602.482 |           35708.591 |         1.017 |         0.073 |
| (2, 16, 4096, 256)  | noop          | torch.bfloat16 |         1904.497 |            1758.183 |         6122.416 |           66754.593 |         1.083 |         0.092 |
| (2, 16, 4096, 256)  | causal_mask   | torch.bfloat16 |         1911.174 |            1762.821 |         6113.207 |           72759.392 |         1.084 |         0.084 |
| (2, 16, 4096, 256)  | relative_bias | torch.bfloat16 |         1911.254 |            1727.108 |         6123.530 |           66577.988 |         1.107 |         0.092 |
| (2, 16, 4096, 256)  | head_bias     | torch.bfloat16 |         1916.977 |            1801.804 |         6118.158 |           67359.680 |         1.064 |         0.091 |
| (8, 16, 512, 64)    | noop          | torch.bfloat16 |           44.984 |              43.974 |          170.276 |             262.259 |         1.023 |         0.649 |
| (8, 16, 512, 64)    | causal_mask   | torch.bfloat16 |           45.001 |              46.265 |          170.509 |             274.893 |         0.973 |         0.620 |
| (8, 16, 512, 64)    | relative_bias | torch.bfloat16 |           45.466 |              48.211 |          170.606 |             262.759 |         0.943 |         0.649 |
| (8, 16, 512, 64)    | head_bias     | torch.bfloat16 |           45.481 |              48.435 |          170.267 |             261.265 |         0.939 |         0.652 |
| (8, 16, 512, 128)   | noop          | torch.bfloat16 |           72.565 |              74.736 |          313.220 |             773.126 |         0.971 |         0.405 |
| (8, 16, 512, 128)   | causal_mask   | torch.bfloat16 |           72.015 |              75.755 |          313.311 |             775.513 |         0.951 |         0.404 |
| (8, 16, 512, 128)   | relative_bias | torch.bfloat16 |           72.105 |              74.189 |          313.806 |             769.238 |         0.972 |         0.408 |
| (8, 16, 512, 128)   | head_bias     | torch.bfloat16 |           72.005 |              74.364 |          313.509 |             775.237 |         0.968 |         0.404 |
| (8, 16, 512, 256)   | noop          | torch.bfloat16 |          138.656 |             165.453 |          663.707 |            2672.067 |         0.838 |         0.248 |
| (8, 16, 512, 256)   | causal_mask   | torch.bfloat16 |          139.096 |             172.613 |          663.593 |            2926.538 |         0.806 |         0.227 |
| (8, 16, 512, 256)   | relative_bias | torch.bfloat16 |          139.500 |             168.417 |          663.938 |            2658.629 |         0.828 |         0.250 |
| (8, 16, 512, 256)   | head_bias     | torch.bfloat16 |          139.776 |             173.549 |          662.920 |            2667.266 |         0.805 |         0.249 |
| (8, 16, 1024, 64)   | noop          | torch.bfloat16 |          134.883 |             125.004 |          484.706 |            1195.254 |         1.079 |         0.406 |
| (8, 16, 1024, 64)   | causal_mask   | torch.bfloat16 |          134.297 |             132.875 |          485.420 |            1234.953 |         1.011 |         0.393 |
| (8, 16, 1024, 64)   | relative_bias | torch.bfloat16 |          134.839 |             139.231 |          485.470 |            1198.556 |         0.968 |         0.405 |
| (8, 16, 1024, 64)   | head_bias     | torch.bfloat16 |          133.822 |             136.449 |          485.608 |            1189.198 |         0.981 |         0.408 |
| (8, 16, 1024, 128)  | noop          | torch.bfloat16 |          235.470 |             234.765 |          886.094 |            2662.944 |         1.003 |         0.333 |
| (8, 16, 1024, 128)  | causal_mask   | torch.bfloat16 |          236.305 |             241.382 |          886.293 |            2646.984 |         0.979 |         0.335 |
| (8, 16, 1024, 128)  | relative_bias | torch.bfloat16 |          236.414 |             233.980 |          885.250 |            2642.178 |         1.010 |         0.335 |
| (8, 16, 1024, 128)  | head_bias     | torch.bfloat16 |          237.176 |             239.040 |          885.754 |            2665.242 |         0.992 |         0.332 |
| (8, 16, 1024, 256)  | noop          | torch.bfloat16 |          504.445 |             517.855 |         1978.956 |            9592.906 |         0.974 |         0.206 |
| (8, 16, 1024, 256)  | causal_mask   | torch.bfloat16 |          502.428 |             536.002 |         1978.611 |           10607.342 |         0.937 |         0.187 |
| (8, 16, 1024, 256)  | relative_bias | torch.bfloat16 |          503.396 |             523.960 |         1977.993 |            9539.284 |         0.961 |         0.207 |
| (8, 16, 1024, 256)  | head_bias     | torch.bfloat16 |          503.818 |             536.014 |         1980.131 |            9576.262 |         0.940 |         0.207 |
| (8, 16, 4096, 64)   | noop          | torch.bfloat16 |         1970.139 |            1674.930 |         5750.940 |           16724.134 |         1.176 |         0.344 |
| (8, 16, 4096, 64)   | causal_mask   | torch.bfloat16 |         1959.036 |            1775.056 |         5780.512 |           17390.350 |         1.104 |         0.332 |
| (8, 16, 4096, 64)   | relative_bias | torch.bfloat16 |         1947.198 |            1773.869 |         5780.643 |           16779.699 |         1.098 |         0.345 |
| (8, 16, 4096, 64)   | head_bias     | torch.bfloat16 |         1963.935 |            1829.502 |         5780.018 |           16703.259 |         1.073 |         0.346 |
| (8, 16, 4096, 128)  | noop          | torch.bfloat16 |         3582.711 |            3362.623 |        10436.069 |           36415.565 |         1.065 |         0.287 |
| (8, 16, 4096, 128)  | causal_mask   | torch.bfloat16 |         3581.504 |            3499.472 |        10346.869 |           36164.959 |         1.023 |         0.286 |
| (8, 16, 4096, 128)  | relative_bias | torch.bfloat16 |         3589.779 |            3337.849 |        10529.621 |           36261.696 |         1.075 |         0.290 |
| (8, 16, 4096, 128)  | head_bias     | torch.bfloat16 |         3602.265 |            3436.444 |        10458.660 |           36507.790 |         1.048 |         0.286 |
| (8, 16, 4096, 256)  | noop          | torch.bfloat16 |         7695.923 |            7126.275 |        24643.009 |          140949.081 |         1.080 |         0.175 |
| (8, 16, 4096, 256)  | causal_mask   | torch.bfloat16 |         7679.939 |            7186.252 |        24538.105 |          157156.067 |         1.069 |         0.156 |
| (8, 16, 4096, 256)  | relative_bias | torch.bfloat16 |         7681.374 |            6994.832 |        24549.713 |          140077.179 |         1.098 |         0.175 |
| (8, 16, 4096, 256)  | head_bias     | torch.bfloat16 |         7679.822 |            7212.278 |        24627.823 |          140675.003 |         1.065 |         0.175 |
| (16, 16, 512, 64)   | noop          | torch.bfloat16 |           80.126 |              78.291 |          333.719 |             541.165 |         1.023 |         0.617 |
| (16, 16, 512, 64)   | causal_mask   | torch.bfloat16 |           80.065 |              81.696 |          333.779 |             551.113 |         0.980 |         0.606 |
| (16, 16, 512, 64)   | relative_bias | torch.bfloat16 |           80.138 |              86.715 |          333.364 |             542.118 |         0.924 |         0.615 |
| (16, 16, 512, 64)   | head_bias     | torch.bfloat16 |           80.415 |              85.204 |          333.294 |             536.840 |         0.944 |         0.621 |
| (16, 16, 512, 128)  | noop          | torch.bfloat16 |          134.964 |             138.025 |          607.093 |            1333.102 |         0.978 |         0.455 |
| (16, 16, 512, 128)  | causal_mask   | torch.bfloat16 |          134.192 |             141.523 |          606.269 |            1424.318 |         0.948 |         0.426 |
| (16, 16, 512, 128)  | relative_bias | torch.bfloat16 |          135.711 |             138.639 |          606.283 |            1327.974 |         0.979 |         0.457 |
| (16, 16, 512, 128)  | head_bias     | torch.bfloat16 |          135.552 |             140.555 |          607.107 |            1347.370 |         0.964 |         0.451 |
| (16, 16, 512, 256)  | noop          | torch.bfloat16 |          275.113 |             315.144 |         1301.583 |            5268.153 |         0.873 |         0.247 |
| (16, 16, 512, 256)  | causal_mask   | torch.bfloat16 |          274.867 |             328.106 |         1302.513 |            5770.594 |         0.838 |         0.226 |
| (16, 16, 512, 256)  | relative_bias | torch.bfloat16 |          276.052 |             321.770 |         1302.904 |            5241.920 |         0.858 |         0.249 |
| (16, 16, 512, 256)  | head_bias     | torch.bfloat16 |          271.409 |             328.839 |         1302.142 |            5266.037 |         0.825 |         0.247 |
| (16, 16, 1024, 64)  | noop          | torch.bfloat16 |          260.489 |             237.463 |          955.884 |            1817.558 |         1.097 |         0.526 |
| (16, 16, 1024, 64)  | causal_mask   | torch.bfloat16 |          262.378 |             254.350 |          955.280 |            1843.807 |         1.032 |         0.518 |
| (16, 16, 1024, 64)  | relative_bias | torch.bfloat16 |          261.338 |             268.253 |          956.038 |            1820.036 |         0.974 |         0.525 |
| (16, 16, 1024, 64)  | head_bias     | torch.bfloat16 |          262.153 |             264.156 |          956.023 |            1810.076 |         0.992 |         0.528 |
| (16, 16, 1024, 128) | noop          | torch.bfloat16 |          476.475 |             461.413 |         1760.578 |            4306.521 |         1.033 |         0.409 |
| (16, 16, 1024, 128) | causal_mask   | torch.bfloat16 |          473.794 |             479.178 |         1761.277 |            4619.439 |         0.989 |         0.381 |
| (16, 16, 1024, 128) | relative_bias | torch.bfloat16 |          473.839 |             463.282 |         1758.692 |            4290.562 |         1.023 |         0.410 |
| (16, 16, 1024, 128) | head_bias     | torch.bfloat16 |          472.979 |             472.896 |         1763.086 |            4367.931 |         1.000 |         0.404 |
| (16, 16, 1024, 256) | noop          | torch.bfloat16 |         1014.184 |            1026.764 |         3922.997 |           19104.147 |         0.988 |         0.205 |
| (16, 16, 1024, 256) | causal_mask   | torch.bfloat16 |         1013.217 |            1039.046 |         3928.382 |           21086.281 |         0.975 |         0.186 |
| (16, 16, 1024, 256) | relative_bias | torch.bfloat16 |         1008.519 |            1015.278 |         3922.133 |           18980.652 |         0.993 |         0.207 |
| (16, 16, 1024, 256) | head_bias     | torch.bfloat16 |         1011.360 |            1047.542 |         3931.245 |           19069.172 |         0.965 |         0.206 |
| (16, 16, 4096, 64)  | noop          | torch.bfloat16 |         3929.850 |            3325.667 |        11411.704 |           23344.280 |         1.182 |         0.489 |
| (16, 16, 4096, 64)  | causal_mask   | torch.bfloat16 |         3885.262 |            3581.544 |        11390.515 |           23725.639 |         1.085 |         0.480 |
| (16, 16, 4096, 64)  | relative_bias | torch.bfloat16 |         3865.737 |            3537.308 |        11489.901 |           23406.330 |         1.093 |         0.491 |
| (16, 16, 4096, 64)  | head_bias     | torch.bfloat16 |         3880.530 |            3665.249 |        11484.411 |           23299.496 |         1.059 |         0.493 |
| (16, 16, 4096, 128) | noop          | torch.bfloat16 |         7030.306 |            6745.715 |        20621.264 |           57464.096 |         1.042 |         0.359 |
| (16, 16, 4096, 128) | causal_mask   | torch.bfloat16 |         7095.414 |            7034.385 |        20410.656 |           61660.511 |         1.009 |         0.331 |
| (16, 16, 4096, 128) | relative_bias | torch.bfloat16 |         7084.779 |            6686.497 |        20315.161 |           57243.969 |         1.060 |         0.355 |
| (16, 16, 4096, 128) | head_bias     | torch.bfloat16 |         7075.367 |            6863.305 |        20494.385 |           58481.953 |         1.031 |         0.350 |
| (16, 16, 4096, 256) | noop          | torch.bfloat16 |        15612.741 |           14297.482 |        55306.847 |          281161.865 |         1.092 |         0.197 |
| (16, 16, 4096, 256) | causal_mask   | torch.bfloat16 |        15326.592 |           14263.878 |        55227.806 |          313063.232 |         1.075 |         0.176 |
| (16, 16, 4096, 256) | relative_bias | torch.bfloat16 |        15297.963 |           14007.379 |        54558.029 |          279529.175 |         1.092 |         0.195 |
| (16, 16, 4096, 256) | head_bias     | torch.bfloat16 |        15216.160 |           14276.027 |        55081.581 |          280996.826 |         1.066 |         0.196 |

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125515
Approved by: https://github.com/Chillee
2024-05-17 00:41:55 +00:00
Jithun Nair
14d8e3aec0 Add distributed/_tensor/test_attention to ROCM_BLOCKLIST (#126336)
Fixes #125504
Fixes #126252
Fixes #126296
Fixes #126330

This PR doesn't really fix the RingAttentionTest tests for ROCm, but explicitly adds the whole test file to ROCM_BLOCKLIST to get a clean signal on ROCm distributed CI. We will enable these tests in a follow-up PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126336
Approved by: https://github.com/huydhn, https://github.com/pruthvistony
2024-05-16 16:38:09 +00:00
Catherine Lee
48f98bcdfc [TD] Enable test removal on most default configs + distributed CUDA for everyone (#125931)
yolo

Add the longest jobs in pull:
* default cpu configs
* non sm86 cuda
* distributed cuda for everyone

Still excluding
* slow, inductor, rocm, onnx, mac, dynamo
* distributed cpu
* windows cuda
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125931
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2024-05-14 17:35:12 +00:00
Catherine Lee
6f619cc727 [ez] functorch/test_vmap and test_dataloader to run in parallel (#125597)
Also mark test_svd serial in linalg to see if it helps with the flakiness
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125597
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2024-05-08 15:37:29 +00:00
Huy Do
0e57bbb6d7 Set timeout for C++ tests (#125517)
Looking at the unrelated Windows timeout failure on https://github.com/pytorch/pytorch/pull/125199, it looks like we don't have a timeout value set for C++ tests atm.  In this case, a C++ test on Windows timed out after 2+ hours.

```
2024-05-02T23:35:34.0639067Z Running cpp/c10_TypeList_test 1/1 ... [2024-05-02 23:35:34.059021]
2024-05-02T23:35:34.0641108Z Executing ['pytest', 'C:\\actions-runner\\_work\\pytorch\\pytorch\\build\\win_tmp\\build\\torch\\test\\c10_TypeList_test.exe', '-m', 'not serial', '-v', '-vv', '-rfEX', '-n', '2', '--junit-xml-reruns', 'test-reports\\python-pytest\\test\\run_test\\test\\run_test-c898ddeff8f33cbf.xml', '-x', '--reruns=2'] ... [2024-05-02 23:35:34.062137]
2024-05-03T02:45:33.7862004Z Process SpawnPoolWorker-2:
2024-05-03T02:45:33.7927201Z Traceback (most recent call last):
2024-05-03T02:45:33.7928032Z   File "C:\Jenkins\Miniconda3\lib\multiprocessing\process.py", line 315, in _bootstrap
2024-05-03T02:45:33.7928722Z     self.run()
2024-05-03T02:45:33.7929722Z   File "C:\Jenkins\Miniconda3\lib\multiprocessing\process.py", line 108, in run
2024-05-03T02:45:33.7931639Z     self._target(*self._args, **self._kwargs)
2024-05-03T02:45:33.7932435Z   File "C:\Jenkins\Miniconda3\lib\multiprocessing\pool.py", line 114, in worker
2024-05-03T02:45:33.7933338Z     task = get()
2024-05-03T02:45:33.7933946Z   File "C:\Jenkins\Miniconda3\lib\multiprocessing\queues.py", line 365, in get
2024-05-03T02:45:33.7935219Z     res = self._reader.recv_bytes()
2024-05-03T02:45:33.7935897Z   File "C:\Jenkins\Miniconda3\lib\multiprocessing\connection.py", line 221, in recv_bytes
2024-05-03T02:45:33.7936609Z     buf = self._recv_bytes(maxlength)
2024-05-03T02:45:33.7937302Z   File "C:\Jenkins\Miniconda3\lib\multiprocessing\connection.py", line 310, in _recv_bytes
2024-05-03T02:45:33.7938316Z     waitres = _winapi.WaitForMultipleObjects(
2024-05-03T02:45:33.7938766Z KeyboardInterrupt
```

Retrying was working, but it was already too late to finish the job.  I'm setting the same default `THRESHOLD * 3` timeout value here for C++ tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125517
Approved by: https://github.com/clee2000
2024-05-07 16:41:38 +00:00
Catherine Lee
848fce35b5 [CI][ez] Don't retry when it says don't retry (#125643)
default arg for retry_shell is retries=1
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125643
Approved by: https://github.com/huydhn
2024-05-07 16:20:00 +00:00
Catherine Lee
1b3fd83ab2 [TD] Enable TD on AVX related configs (#125482)
On test configs `nogpu_AVX512` and `nogpu_NO_AVX2`, which are the next longest jobs on trunk after windows
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125482
Approved by: https://github.com/huydhn
2024-05-06 22:02:16 +00:00
Catherine Lee
d4727fd4eb [TD][ez] Better check for is pr or not (#125485)
You can trigger ciflow tags on main branch commits, so we should be more conservative when checking to see if a workflow is a PR/on the main branch.

get_pr_number checks for the pr number based on the PR_NUMBER env var or a tag of the for `ciflow/workflow/pr number`

If we fail to find something like this, then assume it is on the main branch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125485
Approved by: https://github.com/huydhn
2024-05-04 03:08:44 +00:00
Catherine Lee
e16f1ee4cc [ez][CI] Move test_modules and test_schema_check off CI_SERIAL_LIST (#125193)
* Related https://github.com/pytorch/pytorch/pull/124085

As in title, move test_modules and test_schema_check off CI_SERIAL_LIST
If things fail, they can get the serialTest decorator instead
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125193
Approved by: https://github.com/huydhn
2024-05-01 15:48:48 +00:00
PyTorch MergeBot
e7631d6eae Revert "CI: add aarch64 linux workflow (#121284)"
This reverts commit 32cf04cb7f.

Reverted https://github.com/pytorch/pytorch/pull/121284 on behalf of https://github.com/malfet due to Test only changes has not been reverted ([comment](https://github.com/pytorch/pytorch/pull/121284#issuecomment-2083925890))
2024-04-30 00:24:11 +00:00
Catherine Lee
4d717cd7c3 [TD] Enable td on cpu windows (#125049)
yolo

Also
* Ensure that at least 1 test always gets run (`//` does truncation which results in 0 if you have too few tests discovered)
* Don't run test removal on slow tests - I'm not touching that yet

I am avoid everything other than pull + trunk workflows, so not doing this on windows CUDA, which runs on periodic
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125049
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2024-04-29 23:39:54 +00:00
Catherine Lee
faee0e5ee8 [ez][CI] Move test_linalg and test_sparse_csr off CI_SERIAL_LIST (#125068)
* https://github.com/pytorch/pytorch/pull/124649 for context

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125068
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2024-04-29 21:22:35 +00:00
Sunita Nadampalli
32cf04cb7f CI: add aarch64 linux workflow (#121284)
aarch64 linux workflow is triggered for ciflow/aarch64 tags.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121284
Approved by: https://github.com/atalman, https://github.com/malfet
2024-04-29 18:25:40 +00:00
egienvalue
8461e7ed9e Add test_cpp_extensions tests for stream_and_event and mita_backend (#123614)
Test the generic torch.Stream/Event with fake device gurad and hooks. Since we added a fake device backend, it is mutual exclusive to other backends. Tests will be skipped if TEST_CUDA or TEST_ROCM is true.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123614
Approved by: https://github.com/albanD
ghstack dependencies: #123611, #123612
2024-04-26 16:17:54 +00:00
PyTorch MergeBot
4a1299cc0e Revert "Add test_cpp_extensions tests for stream_and_event and mita_backend (#123614)"
This reverts commit 355dc34f86.

Reverted https://github.com/pytorch/pytorch/pull/123614 on behalf of https://github.com/jeffdaily due to this PR broke ROCm with message RuntimeError: Cannot have MTIA with other devices ([comment](https://github.com/pytorch/pytorch/pull/123612#issuecomment-2077649762))
2024-04-25 16:06:46 +00:00
Catherine Lee
4f29103749 [ez][CI] Move test_cuda off CI_SERIAL_LIST (#124649)
Tag test cases with large tensor with serial, also tag a few more that failed on a previous iteration of this PR

Move test_cuda and test_cuda_expandable_segments off the serial list
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124649
Approved by: https://github.com/ZainRizvi
2024-04-24 22:04:23 +00:00
egienvalue
355dc34f86 Add test_cpp_extensions tests for stream_and_event and mita_backend (#123614)
Test the generic torch.Stream/Event with fake device gurad and hooks.

Differential Revision: [D56443358](https://our.internmc.facebook.com/intern/diff/D56443358)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123614
Approved by: https://github.com/albanD
ghstack dependencies: #123611, #123612
2024-04-24 20:51:20 +00:00
Catherine Lee
8fe0b8b6a8 No CPP or xdist process level reruns (#124798)
xdist doesn't play well with current process level rerun scheme
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124798
Approved by: https://github.com/huydhn
2024-04-24 19:44:51 +00:00
PyTorch MergeBot
52da03edeb Revert "Add test_cpp_extensions tests for stream_and_event and mita_backend (#123614)"
This reverts commit b6f0159db0.

Reverted https://github.com/pytorch/pytorch/pull/123614 on behalf of https://github.com/jeffdaily due to This broke ROCm. see test_overrides.py ([comment](https://github.com/pytorch/pytorch/pull/123611#issuecomment-2067363780))
2024-04-19 22:44:26 +00:00
egienvalue
b6f0159db0 Add test_cpp_extensions tests for stream_and_event and mita_backend (#123614)
Test the generic torch.Stream/Event with fake device gurad and hooks.
@exported-using-ghexport

Differential Revision: [D55902506](https://our.internmc.facebook.com/intern/diff/D55902506/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123614
Approved by: https://github.com/albanD
ghstack dependencies: #123611, #123612
2024-04-18 17:40:13 +00:00
Catherine Lee
025387f4dd [ez][CI] Reduce CI_SERIAL_LIST pt2 (#124298)
#124085

Add @serialTest() to some tests

slow gradcheck already runs serially

Doing this slowly so its easier to check flaky issues that might get made

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124298
Approved by: https://github.com/kit1980
2024-04-18 00:13:36 +00:00
Catherine Lee
0abd3f60fd [CI] Reduce CI_SERIAL_LIST list (#124085)
Add serial marker for individual tests so the test file can be removed from the ci serial list
Run serial marked tests first in serial
Run all other tests afterwards in parallel

Slowly reduce list and mark individual tests as serial instead

Hope # of serial tests is small so sharding evenness doesn't get too messed up

Hopefully can do 3 procs for sm86 and cpu?

serial no longer looks like a real word to me

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124085
Approved by: https://github.com/seemethere, https://github.com/malfet
2024-04-17 00:23:47 +00:00
Catherine Lee
946b50c788 [ez][TD] Increase logging (#124082)
increase logging during td
generate an artifact that says which tests got excluded
fix minor bug where filter test configs couldnt get commit messages

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124082
Approved by: https://github.com/seemethere, https://github.com/malfet
2024-04-17 00:18:28 +00:00
Catherine Lee
3cd06f56b1 [ez] test_profiler in serial (#123665)
Add test_profiler to the serial list since we keep needing to reopen disable issues and I think its due to being incompatible with parallelism
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123665
Approved by: https://github.com/ZainRizvi, https://github.com/huydhn
2024-04-11 20:24:47 +00:00
William Wen
4bee4c7c25 [3.12] enable inductor unittests (#123654)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123654
Approved by: https://github.com/jansel
2024-04-10 20:51:43 +00:00
Catherine Lee
61be8843c9 [TD] Use label to configure td on distributed for rollout (#122976)
Gate TD on distributed behind label

TODO:
auto add label to certain people's prs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122976
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2024-04-08 15:53:55 +00:00
William Wen
d59c5d7353 [dynamo, 3.12] enable dynamo on 3.12, enable most dynamo unittests on 3.12 (#123216)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123216
Approved by: https://github.com/jansel, https://github.com/malfet
2024-04-04 20:00:54 +00:00
Catherine Lee
b5bef9bbfd Fix cpp tests not running + failing to surface (#122845)
The comment in the code should have the information
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122845
Approved by: https://github.com/huydhn
2024-03-29 22:41:45 +00:00
Catherine Lee
03184a82dd [TD] TD on ASAN PR jobs (#122332)
Low impact CPU jobs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122332
Approved by: https://github.com/huydhn
2024-03-22 22:32:51 +00:00
eellison
cbbed46377 Defer selection of triton template (#120275)
Our prior approach to epilogue fusion was to select from a choice from a set of triton templates and extern calls based on benchmarking inputs, then unconditionally fuse epilogues. This can be sub-optimal in following ways:

- We select an extern kernel, however an epilogue like relu() exists such that choosing a triton template + relu would have been faster
- We select a triton template, epilogue fuse, and register spilling occurs causing it to be slower than not epilogue fusing.

In this PR we wait to select either the Triton Template or Extern Kernel based on benchmarking results from the kernel itself and its epilogue. As soon as a successful fusion occurs where a fused Triton Template + epilogue is faster than the unfused choice we finalize the MultiTemplateBuffer as a specific template. If no fusion occurs we'll finalize the MultiTemplateBuffer after fusion.

Note: if there are multiple epilogue fusions (not super likely), even though we select a template after the first fusion, we will still benchmark to see if subsequent epilogue are worth fusing. We could potentially defer choosing template in this case in a follow up at expense of compile time.

Gives 4% HF training win, 10% TIMM inference win. Increases compilation time which I will be trying to address more in follow up prs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120275
Approved by: https://github.com/jansel
ghstack dependencies: #121996
2024-03-20 01:40:33 +00:00
Kai Londenberg
a5ec45f2ec [Inductor Cutlass backend] Move tests to separate file (#121489)
Move Cutlass backend related tests to test/inductor/test_cutlass_backend.py - no changes to the tests themselves.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121489
Approved by: https://github.com/jansel
2024-03-12 21:59:48 +00:00
Catherine Lee
fac06a12c8 CI sanity check test for env vars (#120519)
Make a test that fails on purpose to trigger retries.  Check the opposite of success (that env vars exist)

It's bit hacky because I want it to fail on the normal flow in order to trigger reruns but I don't want to expose the failures to users since it's confusing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120519
Approved by: https://github.com/huydhn
2024-03-11 15:35:45 +00:00
PyTorch MergeBot
2c2d6ce515 Revert "CI sanity check test for env vars (#120519)"
This reverts commit f43b9c56c5.

Reverted https://github.com/pytorch/pytorch/pull/120519 on behalf of https://github.com/clee2000 due to broken on slow d27509c384 https://github.com/pytorch/pytorch/actions/runs/8208843198/job/22453617568 ([comment](https://github.com/pytorch/pytorch/pull/120519#issuecomment-1986480624))
2024-03-08 22:01:35 +00:00
Catherine Lee
f43b9c56c5 CI sanity check test for env vars (#120519)
Make a test that fails on purpose to trigger retries.  Check the opposite of success (that env vars exist)

It's bit hacky because I want it to fail on the normal flow in order to trigger reruns but I don't want to expose the failures to users since it's confusing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120519
Approved by: https://github.com/huydhn
2024-03-08 20:28:50 +00:00
Catherine Lee
06b52dd103 TD outside of test job (#118250)
Give TD it's own job so that each shard can get the results from this one job artifact and they will always be in sync with each other/no longer need to worry about consistently issues

* Move test discovery to its own file that is not dependent on torch so it can be run without building torch
  * Cannot do cpp test discovery before building pytorch
* Move TD calculation to own file that will create a json file with the final results
* TD is now job/build env agnostic
* TD will rank all tests, including those that test jobs may not want to run (ex it will rank distributed tests along with default tests, even though these tests are never run on the same machine together)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118250
Approved by: https://github.com/huydhn
2024-03-01 23:08:10 +00:00
Catherine Lee
0290fe65bd Test TD (test removal) on crossref (#119426)
Current threshold is to cut the bottom 75% of test files, which results in 13 min of tests getting cut.
test_ops, functorch/test_ops, and test_decomp and other really long running test files are not getting cut and make the top 25% to take really long (still 90+ min)

The original plan was to test on rocm but I'm worried about queuing given that cutting 75% of test files only cuts off 13 min, and crossref is rarely referenced by others and people keep talking about getting rid of it, so it's a good alternative

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119426
Approved by: https://github.com/huydhn
2024-02-29 18:53:43 +00:00