Commit Graph

4312 Commits

Author SHA1 Message Date
Sun, Jiayi
c173a9d9b3 add Half support for layer_norm on CPU (#99590)
### Testing
Single socket (icx, 32cores):
| shape | fp32 forward (ms) | fp16 forward (ms) | mixed fp32 fp16 forward (ms) | fp32 backward (ms) | fp16 backward (ms) | mixed fp32 fp16 backward (ms) |
| -- | -- | -- | -- | -- | -- | -- |
| (1, 8, 16) | 0.012 | 0.011 | 0.011 | 0.051 | 0.051 | 0.050 |
| (8 ,8, 16) | 0.013 | 0.013 | 0.013 | 0.054 | 0.053 | 0.051 |
| (32, 8, 16) | 0.015 | 0.014 | 0.014 | 0.059 | 0.054 | 0.052 |
| (64, 128, 56, 56) | 1.875 | 0.790 | 1.016 | 12.845 | 7.151 | 6.985 |
| (64, 128, 256, 256) | 50.226 | 25.462 | 35.736 | 328.957 | 179.615 | 175.618 |

Single core (icx):

| shape | fp32 forward (ms) | fp16 forward (ms) | mixed fp32 fp16 forward (ms) | fp32 backward (ms) | fp16 backward (ms) | mixed fp32 fp16 backward (ms) |
| -- | -- | -- | -- | -- | -- | -- |
| (1, 8, 16) | 0.012 | 0.011 | 0.011 | 0.040 | 0.041 | 0.041 |
| (8 ,8, 16) | 0.012 | 0.012 | 0.012 | 0.042 | 0.042 | 0.042 |
| (32, 8, 16) | 0.027 | 0.014 | 0.014 | 0.048 | 0.048 | 0.046 |
| (64, 128, 56, 56) | 58.054 | 11.034 | 17.928 | 108.603 | 48.816 | 50.244 |
| (64, 128, 256, 256) | 1327.758 | 352.394 | 496.994 | 2846.182 | 1224.247 | 1218.422 |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99590
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/cpuhrsch
2023-12-20 01:11:15 +00:00
Oguz Ulgen
c55210b4f0 [Inductor] Deduplicate grid wrapper statements for user defined triton kernels (#115849)
Noticed that on many MRS kernels the grid wrapper for autotuning is huge with a bunch of duplicates due to num_warps and num_stages not being needed for grid calculation. Lets deduplicate these entries.

Previously, we would see wrapper like
```
    def grid_wrapper_for_add_kernel_2d_autotuned_0(meta):
        if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
        if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
        if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
        if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
```
now it looks like
```
    def grid_wrapper_for_add_kernel_2d_autotuned_0(meta):
        if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
        if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115849
Approved by: https://github.com/jansel
2023-12-20 00:25:32 +00:00
aaitzhan
f88c9af98e [TEST] Skip scaled_dot_product_attention test on sm < 80 (#115760)
According to the [functionality](https://github.com/NVIDIA/cutlass/blob/main/media/docs/functionality.md) page, CUTLASS support `bfloat16` aka `bf16` only on compute capability 80+ devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115760
Approved by: https://github.com/drisspg
2023-12-19 22:00:33 +00:00
rzou
5ba87a31bc Unflake test_reference_numerics_large__refs_special_multigammaln_mvlgamma_p_1_cpu_bfloat16 (#116058)
Run the test under markDynamoStrict mode and record an expected failure
under the Dynamo CI shard.

Test Plan:
- wait for CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116058
Approved by: https://github.com/atalman
2023-12-19 16:42:29 +00:00
PyTorch MergeBot
c539f7df10 Revert "[Inductor] Deduplicate grid wrapper statements for user defined triton kernels (#115849)"
This reverts commit 21b8127f1c.

Reverted https://github.com/pytorch/pytorch/pull/115849 on behalf of https://github.com/jeanschmidt due to Breaking internal tests, please check internal diff for more details ([comment](https://github.com/pytorch/pytorch/pull/115849#issuecomment-1863012933))
2023-12-19 15:47:55 +00:00
PyTorch MergeBot
a7bfa04da6 Revert "More markDynamoStrictTest (#115870)"
This reverts commit 7f686c8fe1.

Reverted https://github.com/pytorch/pytorch/pull/115870 on behalf of https://github.com/jeanschmidt due to Breaking internal tests and builds, please check diff ([comment](https://github.com/pytorch/pytorch/pull/115870#issuecomment-1862997125))
2023-12-19 15:40:57 +00:00
PyTorch MergeBot
24af118e55 Revert "markDynamoStrictTest more tests (#115871)"
This reverts commit 478f0e96dc.

Reverted https://github.com/pytorch/pytorch/pull/115871 on behalf of https://github.com/jeanschmidt due to Breaking internal tests and builds, please check diff, this is required to revert #115870 ([comment](https://github.com/pytorch/pytorch/pull/115871#issuecomment-1862992931))
2023-12-19 15:36:27 +00:00
Jeff Daily
e3aefe2970 Revert "Initial Flash Attention support on ROCM (#114309)" (#115975)
This reverts commit 5bddbed399.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115975
Approved by: https://github.com/atalman, https://github.com/malfet
2023-12-16 03:40:14 +00:00
Jane Xu
056a882cb9 add markDynamoStrictTest to TestOptimRenewed, removing flakiness (#115947)
fixes #115406 fixes #115394 fixes #115393 fixes #115392 fixes #115391

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115947
Approved by: https://github.com/albanD, https://github.com/zou3519
2023-12-16 01:33:32 +00:00
PyTorch MergeBot
91b848bf81 Revert "markDynamoStrictTest on more tests (#115879)"
This reverts commit 8b650cdd3c.

Reverted https://github.com/pytorch/pytorch/pull/115879 on behalf of https://github.com/atalman due to OSSCI oncall, broke inductor ([comment](https://github.com/pytorch/pytorch/pull/115879#issuecomment-1858418921))
2023-12-15 20:00:09 +00:00
PyTorch MergeBot
c006c8b50e Revert "markDynamoStrictTest some more (#115885)"
This reverts commit 55ce4693ff.

Reverted https://github.com/pytorch/pytorch/pull/115885 on behalf of https://github.com/atalman due to OSSCI oncall, broke inductor ([comment](https://github.com/pytorch/pytorch/pull/115885#issuecomment-1858409669))
2023-12-15 19:51:24 +00:00
rzou
55ce4693ff markDynamoStrictTest some more (#115885)
Featuring
test_native_mha.py
test_nn.py
test_prims.py
test_schema_check.py
test_serialization.py
test_show_pickle.py
test_sort_and_select.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115885
Approved by: https://github.com/voznesenskym
ghstack dependencies: #115845, #115855, #115856, #115857, #115858, #115870, #115871, #115879
2023-12-15 13:19:52 +00:00
rzou
8b650cdd3c markDynamoStrictTest on more tests (#115879)
Featuring:
test_mobile_optimizer.py
test_module_init.py
test_modules.py
test_multiprocessing.py
test_multiprocessing_spawn.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115879
Approved by: https://github.com/voznesenskym
ghstack dependencies: #115845, #115855, #115856, #115857, #115858, #115870, #115871
2023-12-15 13:19:52 +00:00
Aidyn-A
cd47e335d1 [TEST] Skip test_schema_correctness for float8 dtype (#115757)
According to the https://github.com/pytorch/pytorch/issues/107256#issuecomment-1705341870 the ops tested in `test_schema_correctness` are not supported with `torch.float8_e4m3fn` yet. Until they are not supported, it is best to skip the test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115757
Approved by: https://github.com/drisspg
2023-12-15 06:26:46 +00:00
rzou
478f0e96dc markDynamoStrictTest more tests (#115871)
For:
test_dispatch.py
test_fake_tensor.py
test_indexing.py
test_linalg.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115871
Approved by: https://github.com/voznesenskym
ghstack dependencies: #115845, #115855, #115856, #115857, #115858, #115870
2023-12-15 05:26:54 +00:00
rzou
7f686c8fe1 More markDynamoStrictTest (#115870)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115870
Approved by: https://github.com/voznesenskym
ghstack dependencies: #115845, #115855, #115856, #115857, #115858
2023-12-15 05:26:54 +00:00
rzou
85262b0a9e markDynamoStrictTest some test_cpp_extensions.* (#115858)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115858
Approved by: https://github.com/voznesenskym
ghstack dependencies: #115845, #115855, #115856, #115857
2023-12-15 01:22:38 +00:00
rzou
4ccd8eb613 Add Dynamo test expected failure mechanism (#115845)
Tests that are added to a list in dynamo_test_failures.py will
automatically be marked as expectedFailure when run with
PYTORCH_TEST_WITH_DYNAMO=1. I'm splitting this PR off on its own so that
I can test various things on top of it.

Also added an unMarkDynamoStrictTest that is not useful until we turn
on strict mode by default.

Test Plan:
- code reading
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115845
Approved by: https://github.com/voznesenskym
2023-12-15 01:22:17 +00:00
Oguz Ulgen
21b8127f1c [Inductor] Deduplicate grid wrapper statements for user defined triton kernels (#115849)
Noticed that on many MRS kernels the grid wrapper for autotuning is huge with a bunch of duplicates due to num_warps and num_stages not being needed for grid calculation. Lets deduplicate these entries.

Previously, we would see wrapper like
```
    def grid_wrapper_for_add_kernel_2d_autotuned_0(meta):
        if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
        if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
        if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
        if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
```
now it looks like
```
    def grid_wrapper_for_add_kernel_2d_autotuned_0(meta):
        if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
        if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115849
Approved by: https://github.com/jansel
2023-12-14 23:26:04 +00:00
Xinya Zhang
5bddbed399
Initial Flash Attention support on ROCM (#114309)
This pull requests add initial Flash Attention support for AMD/ROCM platform. It added a specialized Triton repository/branch as a compile-time dependency for Flash Attention math library on AMD/ROCM. This triton submodule is not used at runtime and will not be shipped to the final pytorch package. We have the plan to release this specialized Triton as a separate project.

Know limitations:

- [ ] Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`.
- [ ] Only supports power of two sequence lengths.
- [ ] No support for varlen APIs.
- [ ] Only support head dimension 16,32,64,128.
- [ ] Performance is still being optimized.

Fixes https://github.com/pytorch/pytorch/issues/112997

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114309

Approved by: https://github.com/jeffdaily, https://github.com/malfet

---------

Co-authored-by: Joseph Groenenboom <joseph.groenenboom@amd.com>
2023-12-14 08:52:57 -08:00
Mikayla Gawarecki
ac60a70e06 Migrated loss functions to ModuleInfos (#115584)
Migrates most tests in `common_nn.py:criterion_tests` to ModuleInfos.

**I can split this up if it is too large to review**

What this PR does not include:
- [`no_batch_dim` tests](https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_nn.py#L3995-L4112)
- [tests that use the functional variant of the loss function and `wrap_functional`](https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_nn.py#L1079-L1128)

#### On test times
This PR increases test time by ~58s locally
Before this PR:
```
>>> python test/test_nn.py -k Loss
Ran 1003 tests in 28.977s
```
After this PR
```
>>> python test/test_nn.py -k Loss
Ran 368 tests in 23.073s
```

```
>>> python test/test_modules.py -k Loss
Ran 836 tests in 63.900s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115584
Approved by: https://github.com/janeyx99
ghstack dependencies: #115617
2023-12-14 16:21:05 +00:00
eqy
7e1542b938 [CUDA][FP8] Skip test_dtypes on FP8 _scaled_mm (#115661)
This test isn't actually parametrized by `dtype` so it seems to surface bogus failures where "unsupported" types "work" but in reality fp8 is used every time.

CC @drisspg I'm guessing this doesn't surface in upstream CI because there are no SM9.0 runners yet?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115661
Approved by: https://github.com/drisspg
2023-12-14 05:12:33 +00:00
Fuzzkatt
ef01e78fd9 disable test_ddp_profiling_autograd_profiler in distributed_test.py (#115704)
test was previously disabled in upstream: https://github.com/pytorch/pytorch/issues/77342, currently failing in NVIDIA internal CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115704
Approved by: https://github.com/soulitzer
2023-12-14 01:41:37 +00:00
PyTorch MergeBot
626b7dc847 Revert "Migrated loss functions to ModuleInfos (#115584)"
This reverts commit f138b08d2e.

Reverted https://github.com/pytorch/pytorch/pull/115584 on behalf of https://github.com/atalman due to OSS CI oncall, breaks slow test ([comment](https://github.com/pytorch/pytorch/pull/115584#issuecomment-1854855080))
2023-12-13 23:34:30 +00:00
atalman
3807fc690f [OSSCI oncall] fix lint (#115737)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115737
Approved by: https://github.com/DanilBaibak
2023-12-13 14:15:26 +00:00
Chien-Chin Huang
cc28f61fa3 [DCP][BE] Move DCP._state_dict_utils out from DCP (#115523)
DCP._state_dict_utils is also used by FSDP. This can cause circular import sometimes. Move it out from DCP to avoid circular import.

Differential Revision: [D52022440](https://our.internmc.facebook.com/intern/diff/D52022440/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115523
Approved by: https://github.com/wz337
2023-12-13 08:59:48 +00:00
Lucas Pasqualin
ffb2a28a67 Fixes expected behavior when no_dist=True in state_dict_loader.load (#115660)
Fixes expected behavior when `no_dist=True` in `state_dict_loader.load`

Fixes #115591

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115660
Approved by: https://github.com/wz337, https://github.com/fegin
2023-12-12 22:21:16 +00:00
Mikayla Gawarecki
f138b08d2e Migrated loss functions to ModuleInfos (#115584)
Migrates most tests in `common_nn.py:criterion_tests` to ModuleInfos.

**I can split this up if it is too large to review**

What this PR does not include:
- [`no_batch_dim` tests](https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_nn.py#L3995-L4112)
- [tests that use the functional variant of the loss function and `wrap_functional`](https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_nn.py#L1079-L1128)

#### On test times
This PR increases test time by ~58s locally
Before this PR:
```
>>> python test/test_nn.py -k Loss
Ran 1003 tests in 28.977s
```
After this PR
```
>>> python test/test_nn.py -k Loss
Ran 368 tests in 23.073s
```

```
>>> python test/test_modules.py -k Loss
Ran 836 tests in 63.900s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115584
Approved by: https://github.com/janeyx99
ghstack dependencies: #115617
2023-12-12 22:20:20 +00:00
soulitzer
8885128dcc Fix backward for SDPA NT jagged layout (#115576)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115576
Approved by: https://github.com/jbschlosser, https://github.com/ani300
2023-12-12 18:35:40 +00:00
mingfeima
a8acd6c410 Add Half support for AvgPool2d on CPU (#109578)
Add Half support for AvgPool2d (both channels last and channels first) on CPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109578
Approved by: https://github.com/mingfeima, https://github.com/albanD
2023-12-12 12:59:47 +00:00
David Berard
89ee3af076 [Reland][Dynamo] Don't log compilation metrics for PyTorch unit tests (#115571)
Reland #115452, which was reverted to simplify a merge conflict with #115386

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115571
Approved by: https://github.com/yanboliang
2023-12-12 01:15:54 +00:00
Isuru Fernando
505574c46a Add decomposition for torch.block_diag (#115096)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115096
Approved by: https://github.com/peterbell10
2023-12-11 20:04:22 +00:00
Catherine Lee
b5578cb08b [ez] Remove unittest retries (#115460)
Pytest is used in CI now for reruns and I doubt people are using the env vars when running locally.  imo removing this code has the makes the run function easier to read
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115460
Approved by: https://github.com/malfet, https://github.com/huydhn
2023-12-11 19:46:09 +00:00
David Berard
5c0976fa04 Revert "[dynamo] guarded config (#111299)" (#115386)
This reverts commit 5927e9cbf2.

Differential Revision: [D51959266](https://our.internmc.facebook.com/intern/diff/D51959266)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115386
Approved by: https://github.com/yanboliang, https://github.com/malfet
ghstack dependencies: #115384, #115401, #115385
2023-12-11 19:35:42 +00:00
PyTorch MergeBot
f06f51b152 Revert "[Dynamo] Don't log compilation metrics for PyTorch unit tests (#115452)"
This reverts commit cd444aa075.

Reverted https://github.com/pytorch/pytorch/pull/115452 on behalf of https://github.com/davidberard98 due to Merge conflict with #115385, which already landed in fbcode ([comment](https://github.com/pytorch/pytorch/pull/115452#issuecomment-1850729965))
2023-12-11 19:21:40 +00:00
Nikita Shulga
100c466bff [CI][Inductor] Skip CPU tests when running on GPU (#115430)
This is just follows the standard practice for CI, when one specifies `PYTORCH_TESTING_DEVICE_ONLY_FOR=cuda`, only tests targeting the device should be run

Do it by refactoring part of `instantiate_device_type_tests` into `get_desired_device_type_test_bases` and using it from test_torchinductor.py to skip CPU tests

Fixes https://github.com/pytorch/pytorch/issues/115423

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115430
Approved by: https://github.com/seemethere
2023-12-10 15:21:24 +00:00
Wang, Xiao
d7705f325d Patch --save-xml when TEST_IN_SUBPROCESS (#115463)
Patch `--save-xml` when `TEST_IN_SUBPROCESS`

When `--save-xml` is given as a unit test argument and the test is handled by a `TEST_IN_SUBPROCESS` handler (e.g., `run_test_with_subprocess` for `distributed/test_c10d_nccl`), the `--save-xml` args were first "consumed" by argparser in `common_utils.py`. When a following subprocess in this `if TEST_IN_SUBPROCESS:` section starts, there are no `--save-xml` args, thus leaving `args.save_xml` to `None`.

Since argparser for `--save-xml` option has a default argument of `_get_test_report_path()` when the arg is `None`, it's not a problem for Github CI run. It could be an issue when people run those tests without `CI=1`. Test reports won't be saved in this case even if they passed `--save-xml=xxx`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115463
Approved by: https://github.com/clee2000
2023-12-09 02:38:31 +00:00
Yanbo Liang
cd444aa075 [Dynamo] Don't log compilation metrics for PyTorch unit tests (#115452)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115452
Approved by: https://github.com/zou3519
2023-12-09 01:39:36 +00:00
Wongboo
68f74dd162 Add python and C++ support for LPPool3d (#114199)
Add python and C++ support for LPPool3d to Fixes #114114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114199
Approved by: https://github.com/mikaylagawarecki
2023-12-08 18:18:44 +00:00
Iris Zhang (PyTorch)
23fa9621e4 [DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099) (#115193)
Summary:

Rename _device_mesh.py to device_mesh.py, update all callsites, add documentation.
We created stubs for public class and methods in torch.distributed.device_mesh so that torch.distributed.device_mesh can be imported with or without distributed is available().

Original diff reverted: D51629761
Original PR reverted: https://github.com/pytorch/pytorch/pull/115099
Prior to landing, CI signals are all passed. Shipit added the "ci/trunk" label to the PR and DID NOT wait for it and went ahead committing. More context can be found in the reverted PR above.

Test Plan: CI.

Differential Revision: D51861018

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115193
Approved by: https://github.com/fegin
2023-12-08 08:44:32 +00:00
Nikita Shulga
6c585de076 [CUDA] baddmm should fall back to addmm for batch=1 (#114992)
I.e. it feels reasonable to always call `at::cuda::gemm` rather than `at::cuda::bgemm` when num_batches == 1
After the change, benchmarking torch built with CUDA-12 using  [following perf script](https://gist.github.com/malfet/6a17156d7f5663b8b12054a1beff3fe1) on A100  are as follows:
|      Shape     |  bmm_time |  mm_time  | slow down (%) |
| -------------- | --------- | --------- | ------------- |
|    1x1x4096    |   14.18   |   14.31   |     -0.89     |
|    1x1x8192    |   14.37   |   14.37   |     -0.05     |
|   1x1x16384    |   14.03   |   14.12   |     -0.68     |
|   1x1x32768    |   14.19   |   14.24   |     -0.35     |
|   1x1x65536    |   14.85   |   14.52   |     2.30      |
|   1x1x131072   |   14.03   |   14.07   |     -0.33     |
|  128x128x128   |   11.34   |   11.06   |     2.56      |
|  256x256x256   |   14.85   |   14.40   |     3.15      |
|  512x512x512   |   27.22   |   27.22   |     -0.01     |
| 1024x1024x1024 |  129.66   |  129.50   |     0.12      |
| 2048x2048x2048 |  972.18   |  973.24   |     -0.11     |
|  129x127x129   |   11.21   |   11.25   |     -0.39     |
|  257x255x257   |   14.50   |   14.43   |     0.44      |
|  513x511x513   |   29.01   |   29.01   |     0.01      |
| 1025x1023x1025 |  137.65   |  137.64   |     0.01      |
| 2049x2047x2049 |  982.58   |  982.65   |     -0.01     |
|  4097x3x4097   |   86.65   |   86.64   |     0.01      |
|  8193x3x8193   |  384.02   |  383.96   |     0.02      |
| 16385x3x16385  |  1106.73  |  1107.32  |     -0.05     |
| 32769x3x32769  |  4739.49  |  4739.48  |     0.00      |
| 65537x3x65537  | 17377.78  | 17378.74  |     -0.01     |
|  4097x5x4097   |   87.09   |   87.12   |     -0.03     |
|  8193x5x8193   |  301.38   |  301.36   |     0.01      |
| 16385x5x16385  |  1107.38  |  1108.04  |     -0.06     |
| 32769x5x32769  |  4743.73  |  4744.07  |     -0.01     |
| 65537x5x65537  | 17392.32  | 17395.42  |     -0.02     |
|  4097x7x4097   |   87.17   |   87.19   |     -0.02     |
|  8193x7x8193   |  301.94   |  302.00   |     -0.02     |
| 16385x7x16385  |  1107.17  |  1106.79  |     0.03      |
| 32769x7x32769  |  4747.15  |  4747.13  |     0.00      |
| 65537x7x65537  | 17403.85  | 17405.02  |     -0.01     |

Fixes perf problem reported in https://github.com/pytorch/pytorch/issues/114911
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114992
Approved by: https://github.com/Skylion007, https://github.com/eqy
2023-12-08 07:53:17 +00:00
Jane Xu
21cca2494d Move test_multi_tensor_optimizers to use OptimizerInfos (#114797)
This PR aims for parity+ compared to the old testing for the simplest foreach test case.

Test coverage increase: we now test foreach optimizers with CPU as well as on GPU.

Before:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$ python test/test_optim.py -v -k test_multi_tensor_optimizers
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_multi_tensor_optimizers (optim.test_optim.TestOptim) ... ok

----------------------------------------------------------------------
Ran 1 test in 7.253s

OK
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$
```

Now, we get granular test cases at the cost of overhead!
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$ python test/test_optim.py -v -k test_foreach
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_foreach_ASGD_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Adadelta_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Adagrad_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_AdamW_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Adam_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Adamax_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_NAdam_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_RAdam_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_RMSprop_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Rprop_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_SGD_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_ASGD_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Adadelta_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Adagrad_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_AdamW_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Adam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Adamax_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_NAdam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_RAdam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_RMSprop_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Rprop_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_SGD_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok

----------------------------------------------------------------------
Ran 22 tests in 30.954s

OK
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$
```

Why the increase in time?
Two reasons:
1. overhead. Any _CUDA_ *Info test (OpInfo, ModuleInfo, OptimizerInfo) will wrap itself with the `CudaNonDefaultStream` policy, and `CudaNonDefaultStream.__enter__` when called for the first time will go through all visible CUDA devices and synchronize each of them, thus forcing the CUDAContext to be init'd. Doing this for all 8 devices takes ~10-15s. Also, test parametrization costs a little overhead too, but not to the level init'ing CUDA context does.
2. We test more! Now, we have 72 configs (in the foreach optimizer world) whereas we only had 59 before.

Next steps for the future:
- consider adding more Tensor LR configs (like a Tensor LR without capturable in the single tensor case)
- this is likely the next PR or 2: migrate all uses of _test_derived_optimizers in test_optim to TestOptimRenewed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114797
Approved by: https://github.com/albanD
2023-12-07 19:37:56 +00:00
youkaichao
16373bbc1f fix error message in pytorch (#115349)
Fixes https://dev-discuss.pytorch.org/t/typo-in-error-message/1709 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115349
Approved by: https://github.com/Skylion007
2023-12-07 19:27:29 +00:00
rzou
a1bfaf75dc markDynamoStrictTest: add nopython flag, set default to False (#115276)
Default should be False because in general, we're interested
in reliability and composability: we want to check that
running PyTorch with and without Dynamo has the same semantics (with
graph breaks allowed).

Test Plan:
Existing tests?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115276
Approved by: https://github.com/voznesenskym
ghstack dependencies: #115267
2023-12-07 18:42:21 +00:00
Jerry Zhang
a93b9ee9d8 [quant][be] Add a test for per channel quant for groupwise conv (#115224)
Summary:
just making sure this works

Test Plan:
python test/test_quantization.py -k test_groupwise_per_channel_quant

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115224
Approved by: https://github.com/andrewor14
2023-12-07 04:46:20 +00:00
y-sq
233ce0d24b Support GPU annotations for auto-trace jobs similar on-demand support (#114638)
Summary: When using auto_trace, gpu_user_annotation is not shown in the results. Fixing this by including `GPU_USER_ANNOTATION` in `kCudaTypes`.

Differential Revision: D51597995

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114638
Approved by: https://github.com/aaronenyeshi
2023-12-06 09:38:13 +00:00
Jane Xu
d78fe039eb Introduce OptimizerInfos + add a test_errors (#114178)
Introduce OptimizerInfos + use them to refactor out the error testing.

Why OptimizerInfos?
- cleaner, easier way to test all configs of optimizers
- would plug in well with devicetype to auto-enable tests for devices like MPS, meta
- would allow for more granular testing. currently, lots of functionality is tested in `_test_basic_cases` and some of that should be broken down more.

What did I do for error testing?
- I moved out some error cases from `_test_basic_cases` into a new test_errors parametrized test.
- The new test has to live in TestOptimRenewed (bikeshedding welcome) because the parametrized tests need to take in device and dtype and hook correctly, and not all tests in TestOptim do that.
- TestOptimRenewed also is migrating to the toplevel test/test_optim.py now because importing TestOptimRenewed does not work (because of test instantiation, TestOptimRenewed gets replaced with TestOptimRenewedDevice for CPU, CUDA, and whatever other device).

Is there any change in test coverage?
- INCREASE: The error case where a single Parameter (vs a container of them) are passed in has now expanded to all optims instead of only LBFGS
- DECREASE: Not much. The only thing is we no longer test two error cases for foreach=True AND foreach=False, which I think is redundant. (Highlighted in comments)

Possible but not urgent next step: test ALL possible error cases by going through all the constructors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114178
Approved by: https://github.com/albanD
2023-12-05 22:58:36 +00:00
Joel Schlosser
22704426c3 Expand dynamic dims support for traceable subclasses (#114311)
Continuation of #112185, following the design in this [doc](https://docs.google.com/document/d/1ipSxcTzEMMOAPvxP-YJlD5JBZZmIGgh8Q34ixtOUCRo).

Summary:
* Introduce `SubclassSymbolicPolicy` containing separate dynamic dim / constraint policies for the outer and inner tensors
    * Expand the automatic dynamic algorithm to recurse into inner tensors and produce one of these for a subclass instance
    * Maintain legacy behavior for subclasses by recursively calling `mark_dynamic()` on inner tensors *of the same dim as outer* when `mark_dynamic(outer, ...)` is called
    * Addresses this: 6a86cf00ad/torch/_dynamo/variables/builder.py (L1750)
* Add `outer_size` and `outer_stride` arguments to `__tensor_unflatten__()` so that you can find out what symbols were allocated for the outer size / stride (you are expected to return a tensor that compares equal to the outer symbols)
    * Signatures now:
    ```python
    # attrs is a list of inner tensor attributes on x; inner_tensor = getattr(x, attr)
    # ctx is anything useful for rebuilding the class we want to guard on
    attrs, ctx = x.__tensor_flatten__()
    ...
    # inner_tensors is a dict of {attr -> tensor}
    # ctx is taken unmodified from flattening and (eventually) guarded on
    # outer_size is the expected size of the output; possibly symbolic
    # outer_stride is the expected strides of the output; possibly symbolic
    y = MySubclass.__tensor_unflatten__(inner_tensors, ctx, outer_size, outer_stride)

    # at the __tensor_unflatten__() call-site in PT2, we assert y.shape == outer_size and y.stride() == outer_stride
    # the assert simplifies symbols when there are relationships between outer and inner symbols
    ```
    * Size info needed for `NestedTensor` at least, stride info needed for `DTensor` at least
    * Punting on `outer_storage_offset` because storage_offset handling is horribly broken in PT2 right now
* ~~Add new `__tensor_mark_dynamic__()` to allow overriding the behavior of mark_dynamic on a per-subclass basis~~ (booted to future work)
* ~~Add guards for tensor subclasses by calling `__tensor_flatten__()` in the guard to test equality on `ctx`~~
    * Now handled in #114469
* Next PR: add TENSOR_MATCH guards on inner tensors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114311
Approved by: https://github.com/ezyang, https://github.com/drisspg, https://github.com/voznesenskym, https://github.com/bdhirsh
2023-12-05 21:09:25 +00:00
Nikita Shulga
a827ac71f2 Revert "[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099)"
This reverts commit eaa64339d6.
2023-12-05 08:59:36 -08:00
Iris Zhang (PyTorch)
eaa64339d6 [DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099)
Summary:
Rename _device_mesh.py to device_mesh.py, update all callsites, adds documentation.

Original diff reverted: D51629761
Original PR reverted: https://github.com/pytorch/pytorch/pull/114991
It was failing because failing a public module binding tests in MacOS, and this is due to the change in import order for torch/distributed/fsdp/_common_utils.py. Since this original import would still work, we remove the changes in this file.

Test Plan: CI.

Differential Revision: D51825114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115099
Approved by: https://github.com/wanchaol, https://github.com/fegin
2023-12-05 05:44:52 +00:00