pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Sun, Jiayi	c173a9d9b3	add Half support for layer_norm on CPU (#99590 ) ### Testing Single socket (icx, 32cores): \| shape \| fp32 forward (ms) \| fp16 forward (ms) \| mixed fp32 fp16 forward (ms) \| fp32 backward (ms) \| fp16 backward (ms) \| mixed fp32 fp16 backward (ms) \| \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| \| (1, 8, 16) \| 0.012 \| 0.011 \| 0.011 \| 0.051 \| 0.051 \| 0.050 \| \| (8 ,8, 16) \| 0.013 \| 0.013 \| 0.013 \| 0.054 \| 0.053 \| 0.051 \| \| (32, 8, 16) \| 0.015 \| 0.014 \| 0.014 \| 0.059 \| 0.054 \| 0.052 \| \| (64, 128, 56, 56) \| 1.875 \| 0.790 \| 1.016 \| 12.845 \| 7.151 \| 6.985 \| \| (64, 128, 256, 256) \| 50.226 \| 25.462 \| 35.736 \| 328.957 \| 179.615 \| 175.618 \| Single core (icx): \| shape \| fp32 forward (ms) \| fp16 forward (ms) \| mixed fp32 fp16 forward (ms) \| fp32 backward (ms) \| fp16 backward (ms) \| mixed fp32 fp16 backward (ms) \| \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| \| (1, 8, 16) \| 0.012 \| 0.011 \| 0.011 \| 0.040 \| 0.041 \| 0.041 \| \| (8 ,8, 16) \| 0.012 \| 0.012 \| 0.012 \| 0.042 \| 0.042 \| 0.042 \| \| (32, 8, 16) \| 0.027 \| 0.014 \| 0.014 \| 0.048 \| 0.048 \| 0.046 \| \| (64, 128, 56, 56) \| 58.054 \| 11.034 \| 17.928 \| 108.603 \| 48.816 \| 50.244 \| \| (64, 128, 256, 256) \| 1327.758 \| 352.394 \| 496.994 \| 2846.182 \| 1224.247 \| 1218.422 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/99590 Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/cpuhrsch	2023-12-20 01:11:15 +00:00
Oguz Ulgen	c55210b4f0	[Inductor] Deduplicate grid wrapper statements for user defined triton kernels (#115849 ) Noticed that on many MRS kernels the grid wrapper for autotuning is huge with a bunch of duplicates due to num_warps and num_stages not being needed for grid calculation. Lets deduplicate these entries. Previously, we would see wrapper like ``` def grid_wrapper_for_add_kernel_2d_autotuned_0(meta): if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1) if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1) if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1) if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1) ``` now it looks like ``` def grid_wrapper_for_add_kernel_2d_autotuned_0(meta): if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1) if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115849 Approved by: https://github.com/jansel	2023-12-20 00:25:32 +00:00
aaitzhan	f88c9af98e	[TEST] Skip scaled_dot_product_attention test on sm < 80 (#115760 ) According to the [functionality](https://github.com/NVIDIA/cutlass/blob/main/media/docs/functionality.md) page, CUTLASS support `bfloat16` aka `bf16` only on compute capability 80+ devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115760 Approved by: https://github.com/drisspg	2023-12-19 22:00:33 +00:00
rzou	5ba87a31bc	Unflake test_reference_numerics_large__refs_special_multigammaln_mvlgamma_p_1_cpu_bfloat16 (#116058 ) Run the test under markDynamoStrict mode and record an expected failure under the Dynamo CI shard. Test Plan: - wait for CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/116058 Approved by: https://github.com/atalman	2023-12-19 16:42:29 +00:00
PyTorch MergeBot	c539f7df10	Revert "[Inductor] Deduplicate grid wrapper statements for user defined triton kernels (#115849 )" This reverts commit `21b8127f1c`. Reverted https://github.com/pytorch/pytorch/pull/115849 on behalf of https://github.com/jeanschmidt due to Breaking internal tests, please check internal diff for more details ([comment](https://github.com/pytorch/pytorch/pull/115849#issuecomment-1863012933))	2023-12-19 15:47:55 +00:00
PyTorch MergeBot	a7bfa04da6	Revert "More markDynamoStrictTest (#115870 )" This reverts commit `7f686c8fe1`. Reverted https://github.com/pytorch/pytorch/pull/115870 on behalf of https://github.com/jeanschmidt due to Breaking internal tests and builds, please check diff ([comment](https://github.com/pytorch/pytorch/pull/115870#issuecomment-1862997125))	2023-12-19 15:40:57 +00:00
PyTorch MergeBot	24af118e55	Revert "markDynamoStrictTest more tests (#115871 )" This reverts commit `478f0e96dc`. Reverted https://github.com/pytorch/pytorch/pull/115871 on behalf of https://github.com/jeanschmidt due to Breaking internal tests and builds, please check diff, this is required to revert #115870 ([comment](https://github.com/pytorch/pytorch/pull/115871#issuecomment-1862992931))	2023-12-19 15:36:27 +00:00
Jeff Daily	e3aefe2970	Revert "Initial Flash Attention support on ROCM (#114309 )" (#115975 ) This reverts commit `5bddbed399`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115975 Approved by: https://github.com/atalman, https://github.com/malfet	2023-12-16 03:40:14 +00:00
Jane Xu	056a882cb9	add markDynamoStrictTest to TestOptimRenewed, removing flakiness (#115947 ) fixes #115406 fixes #115394 fixes #115393 fixes #115392 fixes #115391 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115947 Approved by: https://github.com/albanD, https://github.com/zou3519	2023-12-16 01:33:32 +00:00
PyTorch MergeBot	91b848bf81	Revert "markDynamoStrictTest on more tests (#115879 )" This reverts commit `8b650cdd3c`. Reverted https://github.com/pytorch/pytorch/pull/115879 on behalf of https://github.com/atalman due to OSSCI oncall, broke inductor ([comment](https://github.com/pytorch/pytorch/pull/115879#issuecomment-1858418921))	2023-12-15 20:00:09 +00:00
PyTorch MergeBot	c006c8b50e	Revert "markDynamoStrictTest some more (#115885 )" This reverts commit `55ce4693ff`. Reverted https://github.com/pytorch/pytorch/pull/115885 on behalf of https://github.com/atalman due to OSSCI oncall, broke inductor ([comment](https://github.com/pytorch/pytorch/pull/115885#issuecomment-1858409669))	2023-12-15 19:51:24 +00:00
rzou	55ce4693ff	markDynamoStrictTest some more (#115885 ) Featuring test_native_mha.py test_nn.py test_prims.py test_schema_check.py test_serialization.py test_show_pickle.py test_sort_and_select.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/115885 Approved by: https://github.com/voznesenskym ghstack dependencies: #115845, #115855, #115856, #115857, #115858, #115870, #115871, #115879	2023-12-15 13:19:52 +00:00
rzou	8b650cdd3c	markDynamoStrictTest on more tests (#115879 ) Featuring: test_mobile_optimizer.py test_module_init.py test_modules.py test_multiprocessing.py test_multiprocessing_spawn.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/115879 Approved by: https://github.com/voznesenskym ghstack dependencies: #115845, #115855, #115856, #115857, #115858, #115870, #115871	2023-12-15 13:19:52 +00:00
Aidyn-A	cd47e335d1	[TEST] Skip test_schema_correctness for float8 dtype (#115757 ) According to the https://github.com/pytorch/pytorch/issues/107256#issuecomment-1705341870 the ops tested in `test_schema_correctness` are not supported with `torch.float8_e4m3fn` yet. Until they are not supported, it is best to skip the test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115757 Approved by: https://github.com/drisspg	2023-12-15 06:26:46 +00:00
rzou	478f0e96dc	markDynamoStrictTest more tests (#115871 ) For: test_dispatch.py test_fake_tensor.py test_indexing.py test_linalg.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/115871 Approved by: https://github.com/voznesenskym ghstack dependencies: #115845, #115855, #115856, #115857, #115858, #115870	2023-12-15 05:26:54 +00:00
rzou	7f686c8fe1	More markDynamoStrictTest (#115870 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115870 Approved by: https://github.com/voznesenskym ghstack dependencies: #115845, #115855, #115856, #115857, #115858	2023-12-15 05:26:54 +00:00
rzou	85262b0a9e	markDynamoStrictTest some test_cpp_extensions.* (#115858 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115858 Approved by: https://github.com/voznesenskym ghstack dependencies: #115845, #115855, #115856, #115857	2023-12-15 01:22:38 +00:00
rzou	4ccd8eb613	Add Dynamo test expected failure mechanism (#115845 ) Tests that are added to a list in dynamo_test_failures.py will automatically be marked as expectedFailure when run with PYTORCH_TEST_WITH_DYNAMO=1. I'm splitting this PR off on its own so that I can test various things on top of it. Also added an unMarkDynamoStrictTest that is not useful until we turn on strict mode by default. Test Plan: - code reading Pull Request resolved: https://github.com/pytorch/pytorch/pull/115845 Approved by: https://github.com/voznesenskym	2023-12-15 01:22:17 +00:00
Oguz Ulgen	21b8127f1c	[Inductor] Deduplicate grid wrapper statements for user defined triton kernels (#115849 ) Noticed that on many MRS kernels the grid wrapper for autotuning is huge with a bunch of duplicates due to num_warps and num_stages not being needed for grid calculation. Lets deduplicate these entries. Previously, we would see wrapper like ``` def grid_wrapper_for_add_kernel_2d_autotuned_0(meta): if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1) if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1) if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1) if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1) ``` now it looks like ``` def grid_wrapper_for_add_kernel_2d_autotuned_0(meta): if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1) if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115849 Approved by: https://github.com/jansel	2023-12-14 23:26:04 +00:00
Xinya Zhang	5bddbed399	Initial Flash Attention support on ROCM (#114309 ) This pull requests add initial Flash Attention support for AMD/ROCM platform. It added a specialized Triton repository/branch as a compile-time dependency for Flash Attention math library on AMD/ROCM. This triton submodule is not used at runtime and will not be shipped to the final pytorch package. We have the plan to release this specialized Triton as a separate project. Know limitations: - [ ] Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`. - [ ] Only supports power of two sequence lengths. - [ ] No support for varlen APIs. - [ ] Only support head dimension 16,32,64,128. - [ ] Performance is still being optimized. Fixes https://github.com/pytorch/pytorch/issues/112997 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114309 Approved by: https://github.com/jeffdaily, https://github.com/malfet --------- Co-authored-by: Joseph Groenenboom <joseph.groenenboom@amd.com>	2023-12-14 08:52:57 -08:00
Mikayla Gawarecki	ac60a70e06	Migrated loss functions to ModuleInfos (#115584 ) Migrates most tests in `common_nn.py:criterion_tests` to ModuleInfos. I can split this up if it is too large to review What this PR does not include: - [`no_batch_dim` tests](https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_nn.py#L3995-L4112) - [tests that use the functional variant of the loss function and `wrap_functional`](https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_nn.py#L1079-L1128) #### On test times This PR increases test time by ~58s locally Before this PR: ``` >>> python test/test_nn.py -k Loss Ran 1003 tests in 28.977s ``` After this PR ``` >>> python test/test_nn.py -k Loss Ran 368 tests in 23.073s ``` ``` >>> python test/test_modules.py -k Loss Ran 836 tests in 63.900s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115584 Approved by: https://github.com/janeyx99 ghstack dependencies: #115617	2023-12-14 16:21:05 +00:00
eqy	7e1542b938	[CUDA][FP8] Skip `test_dtypes` on FP8 `_scaled_mm` (#115661 ) This test isn't actually parametrized by `dtype` so it seems to surface bogus failures where "unsupported" types "work" but in reality fp8 is used every time. CC @drisspg I'm guessing this doesn't surface in upstream CI because there are no SM9.0 runners yet? Pull Request resolved: https://github.com/pytorch/pytorch/pull/115661 Approved by: https://github.com/drisspg	2023-12-14 05:12:33 +00:00
Fuzzkatt	ef01e78fd9	disable test_ddp_profiling_autograd_profiler in distributed_test.py (#115704 ) test was previously disabled in upstream: https://github.com/pytorch/pytorch/issues/77342, currently failing in NVIDIA internal CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/115704 Approved by: https://github.com/soulitzer	2023-12-14 01:41:37 +00:00
PyTorch MergeBot	626b7dc847	Revert "Migrated loss functions to ModuleInfos (#115584 )" This reverts commit `f138b08d2e`. Reverted https://github.com/pytorch/pytorch/pull/115584 on behalf of https://github.com/atalman due to OSS CI oncall, breaks slow test ([comment](https://github.com/pytorch/pytorch/pull/115584#issuecomment-1854855080))	2023-12-13 23:34:30 +00:00
atalman	3807fc690f	[OSSCI oncall] fix lint (#115737 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/115737 Approved by: https://github.com/DanilBaibak	2023-12-13 14:15:26 +00:00
Chien-Chin Huang	cc28f61fa3	[DCP][BE] Move DCP._state_dict_utils out from DCP (#115523 ) DCP._state_dict_utils is also used by FSDP. This can cause circular import sometimes. Move it out from DCP to avoid circular import. Differential Revision: [D52022440](https://our.internmc.facebook.com/intern/diff/D52022440/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115523 Approved by: https://github.com/wz337	2023-12-13 08:59:48 +00:00
Lucas Pasqualin	ffb2a28a67	Fixes expected behavior when `no_dist=True` in `state_dict_loader.load` (#115660 ) Fixes expected behavior when `no_dist=True` in `state_dict_loader.load` Fixes #115591 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115660 Approved by: https://github.com/wz337, https://github.com/fegin	2023-12-12 22:21:16 +00:00
Mikayla Gawarecki	f138b08d2e	Migrated loss functions to ModuleInfos (#115584 ) Migrates most tests in `common_nn.py:criterion_tests` to ModuleInfos. I can split this up if it is too large to review What this PR does not include: - [`no_batch_dim` tests](https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_nn.py#L3995-L4112) - [tests that use the functional variant of the loss function and `wrap_functional`](https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_nn.py#L1079-L1128) #### On test times This PR increases test time by ~58s locally Before this PR: ``` >>> python test/test_nn.py -k Loss Ran 1003 tests in 28.977s ``` After this PR ``` >>> python test/test_nn.py -k Loss Ran 368 tests in 23.073s ``` ``` >>> python test/test_modules.py -k Loss Ran 836 tests in 63.900s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115584 Approved by: https://github.com/janeyx99 ghstack dependencies: #115617	2023-12-12 22:20:20 +00:00
soulitzer	8885128dcc	Fix backward for SDPA NT jagged layout (#115576 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115576 Approved by: https://github.com/jbschlosser, https://github.com/ani300	2023-12-12 18:35:40 +00:00
mingfeima	a8acd6c410	Add Half support for AvgPool2d on CPU (#109578 ) Add Half support for AvgPool2d (both channels last and channels first) on CPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/109578 Approved by: https://github.com/mingfeima, https://github.com/albanD	2023-12-12 12:59:47 +00:00
David Berard	89ee3af076	[Reland][Dynamo] Don't log compilation metrics for PyTorch unit tests (#115571 ) Reland #115452, which was reverted to simplify a merge conflict with #115386 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115571 Approved by: https://github.com/yanboliang	2023-12-12 01:15:54 +00:00
Isuru Fernando	505574c46a	Add decomposition for torch.block_diag (#115096 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115096 Approved by: https://github.com/peterbell10	2023-12-11 20:04:22 +00:00
Catherine Lee	b5578cb08b	[ez] Remove unittest retries (#115460 ) Pytest is used in CI now for reruns and I doubt people are using the env vars when running locally. imo removing this code has the makes the run function easier to read Pull Request resolved: https://github.com/pytorch/pytorch/pull/115460 Approved by: https://github.com/malfet, https://github.com/huydhn	2023-12-11 19:46:09 +00:00
David Berard	5c0976fa04	Revert "[dynamo] guarded config (#111299 )" (#115386 ) This reverts commit `5927e9cbf2`. Differential Revision: [D51959266](https://our.internmc.facebook.com/intern/diff/D51959266) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115386 Approved by: https://github.com/yanboliang, https://github.com/malfet ghstack dependencies: #115384, #115401, #115385	2023-12-11 19:35:42 +00:00
PyTorch MergeBot	f06f51b152	Revert "[Dynamo] Don't log compilation metrics for PyTorch unit tests (#115452 )" This reverts commit `cd444aa075`. Reverted https://github.com/pytorch/pytorch/pull/115452 on behalf of https://github.com/davidberard98 due to Merge conflict with #115385, which already landed in fbcode ([comment](https://github.com/pytorch/pytorch/pull/115452#issuecomment-1850729965))	2023-12-11 19:21:40 +00:00
Nikita Shulga	100c466bff	[CI][Inductor] Skip CPU tests when running on GPU (#115430 ) This is just follows the standard practice for CI, when one specifies `PYTORCH_TESTING_DEVICE_ONLY_FOR=cuda`, only tests targeting the device should be run Do it by refactoring part of `instantiate_device_type_tests` into `get_desired_device_type_test_bases` and using it from test_torchinductor.py to skip CPU tests Fixes https://github.com/pytorch/pytorch/issues/115423 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115430 Approved by: https://github.com/seemethere	2023-12-10 15:21:24 +00:00
Wang, Xiao	d7705f325d	Patch `--save-xml` when `TEST_IN_SUBPROCESS` (#115463 ) Patch `--save-xml` when `TEST_IN_SUBPROCESS` When `--save-xml` is given as a unit test argument and the test is handled by a `TEST_IN_SUBPROCESS` handler (e.g., `run_test_with_subprocess` for `distributed/test_c10d_nccl`), the `--save-xml` args were first "consumed" by argparser in `common_utils.py`. When a following subprocess in this `if TEST_IN_SUBPROCESS:` section starts, there are no `--save-xml` args, thus leaving `args.save_xml` to `None`. Since argparser for `--save-xml` option has a default argument of `_get_test_report_path()` when the arg is `None`, it's not a problem for Github CI run. It could be an issue when people run those tests without `CI=1`. Test reports won't be saved in this case even if they passed `--save-xml=xxx`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115463 Approved by: https://github.com/clee2000	2023-12-09 02:38:31 +00:00
Yanbo Liang	cd444aa075	[Dynamo] Don't log compilation metrics for PyTorch unit tests (#115452 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/115452 Approved by: https://github.com/zou3519	2023-12-09 01:39:36 +00:00
Wongboo	68f74dd162	Add python and C++ support for LPPool3d (#114199 ) Add python and C++ support for LPPool3d to Fixes #114114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114199 Approved by: https://github.com/mikaylagawarecki	2023-12-08 18:18:44 +00:00
Iris Zhang (PyTorch)	23fa9621e4	[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099 ) (#115193 ) Summary: Rename _device_mesh.py to device_mesh.py, update all callsites, add documentation. We created stubs for public class and methods in torch.distributed.device_mesh so that torch.distributed.device_mesh can be imported with or without distributed is available(). Original diff reverted: D51629761 Original PR reverted: https://github.com/pytorch/pytorch/pull/115099 Prior to landing, CI signals are all passed. Shipit added the "ci/trunk" label to the PR and DID NOT wait for it and went ahead committing. More context can be found in the reverted PR above. Test Plan: CI. Differential Revision: D51861018 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115193 Approved by: https://github.com/fegin	2023-12-08 08:44:32 +00:00
Nikita Shulga	6c585de076	[CUDA] baddmm should fall back to addmm for batch=1 (#114992 ) I.e. it feels reasonable to always call `at::cuda::gemm` rather than `at::cuda::bgemm` when num_batches == 1 After the change, benchmarking torch built with CUDA-12 using [following perf script](https://gist.github.com/malfet/6a17156d7f5663b8b12054a1beff3fe1) on A100 are as follows: \| Shape \| bmm_time \| mm_time \| slow down (%) \| \| -------------- \| --------- \| --------- \| ------------- \| \| 1x1x4096 \| 14.18 \| 14.31 \| -0.89 \| \| 1x1x8192 \| 14.37 \| 14.37 \| -0.05 \| \| 1x1x16384 \| 14.03 \| 14.12 \| -0.68 \| \| 1x1x32768 \| 14.19 \| 14.24 \| -0.35 \| \| 1x1x65536 \| 14.85 \| 14.52 \| 2.30 \| \| 1x1x131072 \| 14.03 \| 14.07 \| -0.33 \| \| 128x128x128 \| 11.34 \| 11.06 \| 2.56 \| \| 256x256x256 \| 14.85 \| 14.40 \| 3.15 \| \| 512x512x512 \| 27.22 \| 27.22 \| -0.01 \| \| 1024x1024x1024 \| 129.66 \| 129.50 \| 0.12 \| \| 2048x2048x2048 \| 972.18 \| 973.24 \| -0.11 \| \| 129x127x129 \| 11.21 \| 11.25 \| -0.39 \| \| 257x255x257 \| 14.50 \| 14.43 \| 0.44 \| \| 513x511x513 \| 29.01 \| 29.01 \| 0.01 \| \| 1025x1023x1025 \| 137.65 \| 137.64 \| 0.01 \| \| 2049x2047x2049 \| 982.58 \| 982.65 \| -0.01 \| \| 4097x3x4097 \| 86.65 \| 86.64 \| 0.01 \| \| 8193x3x8193 \| 384.02 \| 383.96 \| 0.02 \| \| 16385x3x16385 \| 1106.73 \| 1107.32 \| -0.05 \| \| 32769x3x32769 \| 4739.49 \| 4739.48 \| 0.00 \| \| 65537x3x65537 \| 17377.78 \| 17378.74 \| -0.01 \| \| 4097x5x4097 \| 87.09 \| 87.12 \| -0.03 \| \| 8193x5x8193 \| 301.38 \| 301.36 \| 0.01 \| \| 16385x5x16385 \| 1107.38 \| 1108.04 \| -0.06 \| \| 32769x5x32769 \| 4743.73 \| 4744.07 \| -0.01 \| \| 65537x5x65537 \| 17392.32 \| 17395.42 \| -0.02 \| \| 4097x7x4097 \| 87.17 \| 87.19 \| -0.02 \| \| 8193x7x8193 \| 301.94 \| 302.00 \| -0.02 \| \| 16385x7x16385 \| 1107.17 \| 1106.79 \| 0.03 \| \| 32769x7x32769 \| 4747.15 \| 4747.13 \| 0.00 \| \| 65537x7x65537 \| 17403.85 \| 17405.02 \| -0.01 \| Fixes perf problem reported in https://github.com/pytorch/pytorch/issues/114911 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114992 Approved by: https://github.com/Skylion007, https://github.com/eqy	2023-12-08 07:53:17 +00:00
Jane Xu	21cca2494d	Move test_multi_tensor_optimizers to use OptimizerInfos (#114797 ) This PR aims for parity+ compared to the old testing for the simplest foreach test case. Test coverage increase: we now test foreach optimizers with CPU as well as on GPU. Before: ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$ python test/test_optim.py -v -k test_multi_tensor_optimizers /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" test_multi_tensor_optimizers (optim.test_optim.TestOptim) ... ok ---------------------------------------------------------------------- Ran 1 test in 7.253s OK (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$ ``` Now, we get granular test cases at the cost of overhead! ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$ python test/test_optim.py -v -k test_foreach /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" test_foreach_ASGD_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok test_foreach_Adadelta_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok test_foreach_Adagrad_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok test_foreach_AdamW_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok test_foreach_Adam_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok test_foreach_Adamax_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok test_foreach_NAdam_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok test_foreach_RAdam_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok test_foreach_RMSprop_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok test_foreach_Rprop_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok test_foreach_SGD_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok test_foreach_ASGD_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_Adadelta_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_Adagrad_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_AdamW_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_Adam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_Adamax_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_NAdam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_RAdam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_RMSprop_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_Rprop_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_SGD_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok ---------------------------------------------------------------------- Ran 22 tests in 30.954s OK (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$ ``` Why the increase in time? Two reasons: 1. overhead. Any _CUDA_ *Info test (OpInfo, ModuleInfo, OptimizerInfo) will wrap itself with the `CudaNonDefaultStream` policy, and `CudaNonDefaultStream.__enter__` when called for the first time will go through all visible CUDA devices and synchronize each of them, thus forcing the CUDAContext to be init'd. Doing this for all 8 devices takes ~10-15s. Also, test parametrization costs a little overhead too, but not to the level init'ing CUDA context does. 2. We test more! Now, we have 72 configs (in the foreach optimizer world) whereas we only had 59 before. Next steps for the future: - consider adding more Tensor LR configs (like a Tensor LR without capturable in the single tensor case) - this is likely the next PR or 2: migrate all uses of _test_derived_optimizers in test_optim to TestOptimRenewed Pull Request resolved: https://github.com/pytorch/pytorch/pull/114797 Approved by: https://github.com/albanD	2023-12-07 19:37:56 +00:00
youkaichao	16373bbc1f	fix error message in pytorch (#115349 ) Fixes https://dev-discuss.pytorch.org/t/typo-in-error-message/1709 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/115349 Approved by: https://github.com/Skylion007	2023-12-07 19:27:29 +00:00
rzou	a1bfaf75dc	markDynamoStrictTest: add nopython flag, set default to False (#115276 ) Default should be False because in general, we're interested in reliability and composability: we want to check that running PyTorch with and without Dynamo has the same semantics (with graph breaks allowed). Test Plan: Existing tests? Pull Request resolved: https://github.com/pytorch/pytorch/pull/115276 Approved by: https://github.com/voznesenskym ghstack dependencies: #115267	2023-12-07 18:42:21 +00:00
Jerry Zhang	a93b9ee9d8	[quant][be] Add a test for per channel quant for groupwise conv (#115224 ) Summary: just making sure this works Test Plan: python test/test_quantization.py -k test_groupwise_per_channel_quant Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/115224 Approved by: https://github.com/andrewor14	2023-12-07 04:46:20 +00:00
y-sq	233ce0d24b	Support GPU annotations for auto-trace jobs similar on-demand support (#114638 ) Summary: When using auto_trace, gpu_user_annotation is not shown in the results. Fixing this by including `GPU_USER_ANNOTATION` in `kCudaTypes`. Differential Revision: D51597995 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114638 Approved by: https://github.com/aaronenyeshi	2023-12-06 09:38:13 +00:00
Jane Xu	d78fe039eb	Introduce OptimizerInfos + add a test_errors (#114178 ) Introduce OptimizerInfos + use them to refactor out the error testing. Why OptimizerInfos? - cleaner, easier way to test all configs of optimizers - would plug in well with devicetype to auto-enable tests for devices like MPS, meta - would allow for more granular testing. currently, lots of functionality is tested in `_test_basic_cases` and some of that should be broken down more. What did I do for error testing? - I moved out some error cases from `_test_basic_cases` into a new test_errors parametrized test. - The new test has to live in TestOptimRenewed (bikeshedding welcome) because the parametrized tests need to take in device and dtype and hook correctly, and not all tests in TestOptim do that. - TestOptimRenewed also is migrating to the toplevel test/test_optim.py now because importing TestOptimRenewed does not work (because of test instantiation, TestOptimRenewed gets replaced with TestOptimRenewedDevice for CPU, CUDA, and whatever other device). Is there any change in test coverage? - INCREASE: The error case where a single Parameter (vs a container of them) are passed in has now expanded to all optims instead of only LBFGS - DECREASE: Not much. The only thing is we no longer test two error cases for foreach=True AND foreach=False, which I think is redundant. (Highlighted in comments) Possible but not urgent next step: test ALL possible error cases by going through all the constructors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114178 Approved by: https://github.com/albanD	2023-12-05 22:58:36 +00:00
Joel Schlosser	22704426c3	Expand dynamic dims support for traceable subclasses (#114311 ) Continuation of #112185, following the design in this [doc](https://docs.google.com/document/d/1ipSxcTzEMMOAPvxP-YJlD5JBZZmIGgh8Q34ixtOUCRo). Summary: * Introduce `SubclassSymbolicPolicy` containing separate dynamic dim / constraint policies for the outer and inner tensors * Expand the automatic dynamic algorithm to recurse into inner tensors and produce one of these for a subclass instance * Maintain legacy behavior for subclasses by recursively calling `mark_dynamic()` on inner tensors of the same dim as outer when `mark_dynamic(outer, ...)` is called * Addresses this: `6a86cf00ad/torch/_dynamo/variables/builder.py (L1750)` * Add `outer_size` and `outer_stride` arguments to `__tensor_unflatten__()` so that you can find out what symbols were allocated for the outer size / stride (you are expected to return a tensor that compares equal to the outer symbols) * Signatures now: ```python # attrs is a list of inner tensor attributes on x; inner_tensor = getattr(x, attr) # ctx is anything useful for rebuilding the class we want to guard on attrs, ctx = x.__tensor_flatten__() ... # inner_tensors is a dict of {attr -> tensor} # ctx is taken unmodified from flattening and (eventually) guarded on # outer_size is the expected size of the output; possibly symbolic # outer_stride is the expected strides of the output; possibly symbolic y = MySubclass.__tensor_unflatten__(inner_tensors, ctx, outer_size, outer_stride) # at the __tensor_unflatten__() call-site in PT2, we assert y.shape == outer_size and y.stride() == outer_stride # the assert simplifies symbols when there are relationships between outer and inner symbols ``` * Size info needed for `NestedTensor` at least, stride info needed for `DTensor` at least * Punting on `outer_storage_offset` because storage_offset handling is horribly broken in PT2 right now * ~~Add new `__tensor_mark_dynamic__()` to allow overriding the behavior of mark_dynamic on a per-subclass basis~~ (booted to future work) * ~~Add guards for tensor subclasses by calling `__tensor_flatten__()` in the guard to test equality on `ctx`~~ * Now handled in #114469 * Next PR: add TENSOR_MATCH guards on inner tensors Pull Request resolved: https://github.com/pytorch/pytorch/pull/114311 Approved by: https://github.com/ezyang, https://github.com/drisspg, https://github.com/voznesenskym, https://github.com/bdhirsh	2023-12-05 21:09:25 +00:00
Nikita Shulga	a827ac71f2	Revert "[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099 )" This reverts commit `eaa64339d6`.	2023-12-05 08:59:36 -08:00
Iris Zhang (PyTorch)	eaa64339d6	[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099 ) Summary: Rename _device_mesh.py to device_mesh.py, update all callsites, adds documentation. Original diff reverted: D51629761 Original PR reverted: https://github.com/pytorch/pytorch/pull/114991 It was failing because failing a public module binding tests in MacOS, and this is due to the change in import order for torch/distributed/fsdp/_common_utils.py. Since this original import would still work, we remove the changes in this file. Test Plan: CI. Differential Revision: D51825114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115099 Approved by: https://github.com/wanchaol, https://github.com/fegin	2023-12-05 05:44:52 +00:00

1 2 3 4 5 ...

4312 Commits