pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
PyTorch MergeBot	24af118e55	Revert "markDynamoStrictTest more tests (#115871 )" This reverts commit `478f0e96dc`. Reverted https://github.com/pytorch/pytorch/pull/115871 on behalf of https://github.com/jeanschmidt due to Breaking internal tests and builds, please check diff, this is required to revert #115870 ([comment](https://github.com/pytorch/pytorch/pull/115871#issuecomment-1862992931))	2023-12-19 15:36:27 +00:00
rzou	478f0e96dc	markDynamoStrictTest more tests (#115871 ) For: test_dispatch.py test_fake_tensor.py test_indexing.py test_linalg.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/115871 Approved by: https://github.com/voznesenskym ghstack dependencies: #115845, #115855, #115856, #115857, #115858, #115870	2023-12-15 05:26:54 +00:00
voznesenskym	081c5b3adc	Add Stateful/Stateless symbolic contexts, use fresh fake mode for dynamo backends (#113926 ) (#114526 ) Summary: The primary problem we are setting out to solve here is fake tensor freshness. Before this PR, fake tensors after dynamo represented fake tensors at the end of trace, so subsequent retraces like aot_autograd would start off with fake tensors in the wrong (end result) state, rather than their expected fresh state. The solution here is to start a fresh fake mode, and re-fakify the tensors. The nuance comes from ensuring that symbols are uniformly created for the symbolic sizes and strides of the tensor. This PR is the result of a lot of back and forth with ezyang and eellison. Initially, the first pass at this was not super different from what we have in the PR - the broad strokes were the same: 1) We cache source->symbol in shape_env 2) We pass policy objects around, stored at dynamo fakificaiton time, and reused for later fakification 3) We create a new fake mode for backends (from https://github.com/pytorch/pytorch/pull/113605/files) This is ugly, and has some layering violations. We detoured our decision making through a few other alternatives. Immutable/mutable fake tensor mode was the most interesting alternative, https://github.com/pytorch/pytorch/pull/113653, and was struck down on concerns of complexity in fake mode combined with it not covering all edge cases. We also detoured on what to do about tensor memoization returning back potentially different tensors than requested, and if that was an anti pattern (it is) we want to hack in with the symbol cache (we don't). We went back to the drawing board here, but with a few concessions: 1) the cache for source->symbol must live outside of shape_env, for both lifecycle, and layering reasons 2) A good amount of work needs to be done to pipe policy around fake_mode and meta_utils correctly, to cover all the cases (ezyang did this) cc penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 aakhundov kadeng imported-using-ghimport Test Plan: Imported from OSS Reviewed By: huydhn, Chillee Differential Revision: D51566250 Pulled By: voznesenskym Pull Request resolved: https://github.com/pytorch/pytorch/pull/114526 Approved by: https://github.com/Chillee, https://github.com/huydhn	2023-11-26 23:40:32 +00:00
PyTorch MergeBot	2f3beb715c	Revert "Add Stateful/Stateless symbolic contexts, use fresh fake mode for dynamo backends (#113926 )" This reverts commit `2ca1119d53`. Reverted https://github.com/pytorch/pytorch/pull/113926 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/113926#issuecomment-1822713852))	2023-11-22 12:52:33 +00:00
voznesenskym	2ca1119d53	Add Stateful/Stateless symbolic contexts, use fresh fake mode for dynamo backends (#113926 ) The primary problem we are setting out to solve here is fake tensor freshness. Before this PR, fake tensors after dynamo represented fake tensors at the end of trace, so subsequent retraces like aot_autograd would start off with fake tensors in the wrong (end result) state, rather than their expected fresh state. The solution here is to start a fresh fake mode, and re-fakify the tensors. The nuance comes from ensuring that symbols are uniformly created for the symbolic sizes and strides of the tensor. This PR is the result of a lot of back and forth with @ezyang and @eellison. Initially, the first pass at this was not super different from what we have in the PR - the broad strokes were the same: 1) We cache source->symbol in shape_env 2) We pass policy objects around, stored at dynamo fakificaiton time, and reused for later fakification 3) We create a new fake mode for backends (from https://github.com/pytorch/pytorch/pull/113605/files) This is ugly, and has some layering violations. We detoured our decision making through a few other alternatives. Immutable/mutable fake tensor mode was the most interesting alternative, https://github.com/pytorch/pytorch/pull/113653, and was struck down on concerns of complexity in fake mode combined with it not covering all edge cases. We also detoured on what to do about tensor memoization returning back potentially different tensors than requested, and if that was an anti pattern (it is) we want to hack in with the symbol cache (we don't). We went back to the drawing board here, but with a few concessions: 1) the cache for source->symbol must live outside of shape_env, for both lifecycle, and layering reasons 2) A good amount of work needs to be done to pipe policy around fake_mode and meta_utils correctly, to cover all the cases (@ezyang did this) Pull Request resolved: https://github.com/pytorch/pytorch/pull/113926 Approved by: https://github.com/ezyang, https://github.com/eellison	2023-11-20 23:06:37 +00:00
Edward Z. Yang	e2b114ab9f	[BE] Package dynamic_dims/constraint_dims into CreateSymbolicPolicy (#113802 ) This will make it more convenient to propagate more information through all of these functions in the future (e.g., for storage offset information.) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/113802 Approved by: https://github.com/davidberard98, https://github.com/voznesenskym	2023-11-17 18:22:46 +00:00
eellison	f8eb46d623	index put device error checking (#113729 ) Fix for https://github.com/pytorch/pytorch/issues/101371 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113729 Approved by: https://github.com/bdhirsh	2023-11-16 00:39:04 +00:00
voznesenskym	b8b3c26d3d	If we re-fakeify a FakeTensor with the same ShapeEnv, preserve symbols (#113651 ) Subsumes half of https://github.com/pytorch/pytorch/pull/113605 We support fakeifying an already fake tensor, which will give you a new fake tensor mirroring the same structure as the original fake tensor, which is what is needed by https://github.com/pytorch/pytorch/issues/113643 . However, when this refakeification happens, we will naively reallocate all new sizes for all of the fake tensor. This is the right thing to do if you are re-fakeifying on a fresh ShapeEnv (because you're reparametrizing the sizes or something), but if you have two fake tensor modes which are sharing a shape environment, you would actually rather just reuse the original sizes/strides/offset from the original fake tensor. This ends up being pretty simple. I recommend viewing with whitespace diff turned off. There's some fuzz around jagged tensor handling; that code is probably not quite right, but I fixed it for this particular case in the most straightforward way. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/113651 Approved by: https://github.com/albanD, https://github.com/eellison, https://github.com/bdhirsh	2023-11-15 00:36:04 +00:00
soulitzer	3b58755c1c	Fix FakeTensor tolist when size is not symbolic (#112206 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112206 Approved by: https://github.com/ezyang ghstack dependencies: #112205	2023-10-30 19:25:10 +00:00
Peter Bell	bbd5b935e4	Use `pytree.tree_leaves` everywhere (#112324 ) This changes all the instances I could find of `tree_flatten(...)[0]` or `x, _ = tree_flatten` to use `tree_leaves`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112324 Approved by: https://github.com/lezcano ghstack dependencies: #112327, #112323	2023-10-30 03:39:04 +00:00
Matthew Hoffman	68b0db1274	Define the public API for torch.distributed.fsdp (#109922 ) Related: https://github.com/pytorch/pytorch/wiki/Public-API-definition-and-documentation Related: https://github.com/microsoft/pylance-release/issues/2953 This fixes pylance issues for these classes: ``` "FullyShardedDataParallel" is not exported from module "torch.distributed.fsdp" ``` These classes all have public docs: * [`BackwardPrefetch`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.BackwardPrefetch) * [`CPUOffload`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.CPUOffload) * [`FullyShardedDataParallel`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.FullyShardedDataParallel) * [`MixedPrecision`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.MixedPrecision) * [`ShardingStrategy`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.ShardingStrategy) And it seems like all the newly added classes will have docs once they are released. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109922 Approved by: https://github.com/wanchaol	2023-09-28 02:15:58 +00:00
eellison	ad53b53518	Generate patterns in fp16 and fp32 (#109142 ) aten.softmax will generate a different decomposition for fp16/bf16 and fp32 because when invoked in lower precision it will upcast the inputs to fp32 and then downcast after. This has been causing us to miss bf16 patterns. For example, Camembert improves 20% with this PR (as do I'm sure many other models). Pull Request resolved: https://github.com/pytorch/pytorch/pull/109142 Approved by: https://github.com/yanboliang ghstack dependencies: #109663, #108894, #108917	2023-09-20 06:38:02 +00:00
PyTorch MergeBot	c2f5d4d8f0	Revert "Generate patterns in fp16 and fp32 (#109142 )" This reverts commit `14994cc978`. Reverted https://github.com/pytorch/pytorch/pull/109142 on behalf of https://github.com/eellison due to MESSAGE ([comment](https://github.com/pytorch/pytorch/pull/109142#issuecomment-1726641232))	2023-09-19 22:52:05 +00:00
eellison	14994cc978	Generate patterns in fp16 and fp32 (#109142 ) aten.softmax will generate a different decomposition for fp16/bf16 and fp32 because when invoked in lower precision it will upcast the inputs to fp32 and then downcast after. This has been causing us to miss bf16 patterns. For example, Camembert improves 20% with this PR (as do I'm sure many other models). Pull Request resolved: https://github.com/pytorch/pytorch/pull/109142 Approved by: https://github.com/yanboliang ghstack dependencies: #108894, #108917	2023-09-19 20:59:42 +00:00
drisspg	ad90ab31f2	Flash Attention v2 (#105602 ) # Summary ## PR Dependencies I don't use ghstack :( this is a PR where it would have been helpful. That beings said I am going to peel off some PRs to make reviewing this easier: - [x] Separate build flags for Flash and MemEff: #107985 ### Description This pull request updates the version of _scaled_dot_product_flash_attention from version 1 to version 2. The changes are based on the flash attention code originally authored by @tridao ### Changes Made The majority of the changes in this pull request involve: - Copying over the flash_attention sources. - Updating header files. - Removing padding and slicing code from within the flash_attention kernel and relocating it to the composite implicit region of the SDPA. This was need to make the kernel functional and appease autograd. - Introducing a simple kernel generator to generate different instantiations of the forward and backward flash templates. - Adding conditional compilation (ifdef) to prevent building when nvcc is invoked with gencode < sm80. - Introducing a separate dependent option for mem_eff_attention, as flash_attention v2 lacks support for Windows and cannot be built for sm50 generation codes. - Modifying build.sh to reduce parallelization on sm86 runners and to lower the maximum parallelization on the manywheel builds. This adjustment was made to address out-of-memory issues during the compilation of FlashAttentionV2 sources. - Adding/Updating tests. ### Notes for Reviewers This is not a fun review, and I apologize in advance. Most of the files-changed are in the flash_attn/ folder. The only files of interest here IMO: - aten/src/ATen/native/transformers/cuda/flash_attn/flash_api.cpp - aten/src/ATen/native/transformers/cuda/flash_attn/kernels/generate_kernels.py ( this has been incorporated upstream to flash-attention github) There are a number of files all related to avoiding OOMs in CI/CD. These are typically shell scripts. ### Follow up items - Include the updates from `e07aa036db` and `9e5e8bc91e` \| https://github.com/pytorch/pytorch/issues/108108 ### Work Items - [x] I don't think Windows will be supported for 3.1.0 - Need to update cmakee - [x] Let multi_query/attention pass through and test \| UPDATE: I have the fast path implemented here: https://github.com/pytorch/pytorch/pull/106730 but since this will require changes to semantics of math to call repeat_interleave, I think this should be done as a followup. - [x] Had to drop cutlass back to 3.0.0 to get it to compile. Need to figure out how to upgrade to 3.1.0 and later. Spoke with Tri and he is going to be taking a look. Note: compiling with clang currently errors for the cute headers. - [x] Update test exercise above codepath - [x] Still need to disable on seq_len % 128 != 0 for backward( Tri beat me to it `a4f148b6ab`) - [x] Add determinism warning to BWD, Tri got to this one as well: 1c41d2b - [x] Update dispatcher to universally prefer FlashV2 - [x] Update tests to exercise new head_dims - [x] Move the head_dim padding from kernel to top level composite implicit function in order to make it purely functional - [x] Create template generator script - [x] Initial cmake support for building kernels/ folder - [x] Replay CudaGraph changes ### Results #### Forward only The TFlops are reported here are on a100 that is underclocked. ![flashv2_tflops_vs_seq_len](https://github.com/pytorch/pytorch/assets/32754868/152de46d-8fa6-42f0-9a9c-ef1eb7ae29e7) #### Forward+Backward Ran a sweep and for large compute bound sizes we do see a ~2x performance increase for forw+back. <img width="1684" alt="Screenshot 2023-07-20 at 3 47 47 PM" src="https://github.com/pytorch/pytorch/assets/32754868/fdd26e07-0077-4878-a417-f3a418b6fb3b"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105602 Approved by: https://github.com/huydhn, https://github.com/cpuhrsch	2023-09-13 13:59:05 +00:00
Edward Z. Yang	55f956f1d2	optests improvements based on torchvision usage on nms (#108929 ) - Update cross-ref FakeMode test to use ShapeEnv. Dynamic ops can now return an unbacked SymInt. We always accept this as equal to whatever the real value was. - Relax test so it works on all classes, not just unittest.TestCase - Properly wrap the original method, so things like pytree.mark.parametrize are carried over - Support dynamic shapes by default for make_fx `tracing_mode="fake"` without symbolifying everything else Fixes https://github.com/pytorch/pytorch/issues/108927 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/108929 Approved by: https://github.com/zou3519	2023-09-13 13:26:15 +00:00
Huy Do	a9c663c269	Revert "Flash Attention v2 (#105602 )" (#108827 ) This reverts commit `add45aea1c`. There are some conflicts on some benchmark csv file https://github.com/pytorch/pytorch/pull/105602#issuecomment-1710988951 so I need to revert this manually. The diff has been reverted internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108827 Approved by: https://github.com/kit1980	2023-09-08 07:43:04 +00:00
PyTorch MergeBot	e45b290127	Revert "Revert "Flash Attention v2 (#105602 )" (#108827 )" This reverts commit `24e9bbe22a`. Reverted https://github.com/pytorch/pytorch/pull/108827 on behalf of https://github.com/huydhn due to I need to land this revert properly as there are new failures showing up on trunk ([comment](https://github.com/pytorch/pytorch/pull/108827#issuecomment-1711020924))	2023-09-08 03:25:45 +00:00
Huy Do	24e9bbe22a	Revert "Flash Attention v2 (#105602 )" (#108827 ) This reverts commit `add45aea1c`. There are some conflicts on some benchmark csv file https://github.com/pytorch/pytorch/pull/105602#issuecomment-1710988951 so I need to revert this manually. The diff has been reverted internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108827 Approved by: https://github.com/kit1980	2023-09-08 02:54:20 +00:00
drisspg	add45aea1c	Flash Attention v2 (#105602 ) # Summary ## PR Dependencies I don't use ghstack :( this is a PR where it would have been helpful. That beings said I am going to peel off some PRs to make reviewing this easier: - [x] Separate build flags for Flash and MemEff: #107985 ### Description This pull request updates the version of _scaled_dot_product_flash_attention from version 1 to version 2. The changes are based on the flash attention code originally authored by @tridao ### Changes Made The majority of the changes in this pull request involve: - Copying over the flash_attention sources. - Updating header files. - Removing padding and slicing code from within the flash_attention kernel and relocating it to the composite implicit region of the SDPA. This was need to make the kernel functional and appease autograd. - Introducing a simple kernel generator to generate different instantiations of the forward and backward flash templates. - Adding conditional compilation (ifdef) to prevent building when nvcc is invoked with gencode < sm80. - Introducing a separate dependent option for mem_eff_attention, as flash_attention v2 lacks support for Windows and cannot be built for sm50 generation codes. - Modifying build.sh to reduce parallelization on sm86 runners and to lower the maximum parallelization on the manywheel builds. This adjustment was made to address out-of-memory issues during the compilation of FlashAttentionV2 sources. - Adding/Updating tests. ### Notes for Reviewers This is not a fun review, and I apologize in advance. Most of the files-changed are in the flash_attn/ folder. The only files of interest here IMO: - aten/src/ATen/native/transformers/cuda/flash_attn/flash_api.cpp - aten/src/ATen/native/transformers/cuda/flash_attn/kernels/generate_kernels.py ( this has been incorporated upstream to flash-attention github) There are a number of files all related to avoiding OOMs in CI/CD. These are typically shell scripts. ### Follow up items - Include the updates from `e07aa036db` and `9e5e8bc91e` \| https://github.com/pytorch/pytorch/issues/108108 ### Work Items - [x] I don't think Windows will be supported for 3.1.0 - Need to update cmakee - [x] Let multi_query/attention pass through and test \| UPDATE: I have the fast path implemented here: https://github.com/pytorch/pytorch/pull/106730 but since this will require changes to semantics of math to call repeat_interleave, I think this should be done as a followup. - [x] Had to drop cutlass back to 3.0.0 to get it to compile. Need to figure out how to upgrade to 3.1.0 and later. Spoke with Tri and he is going to be taking a look. Note: compiling with clang currently errors for the cute headers. - [x] Update test exercise above codepath - [x] Still need to disable on seq_len % 128 != 0 for backward( Tri beat me to it `a4f148b6ab`) - [x] Add determinism warning to BWD, Tri got to this one as well: 1c41d2b - [x] Update dispatcher to universally prefer FlashV2 - [x] Update tests to exercise new head_dims - [x] Move the head_dim padding from kernel to top level composite implicit function in order to make it purely functional - [x] Create template generator script - [x] Initial cmake support for building kernels/ folder - [x] Replay CudaGraph changes ### Results #### Forward only The TFlops are reported here are on a100 that is underclocked. ![flashv2_tflops_vs_seq_len](https://github.com/pytorch/pytorch/assets/32754868/152de46d-8fa6-42f0-9a9c-ef1eb7ae29e7) #### Forward+Backward Ran a sweep and for large compute bound sizes we do see a ~2x performance increase for forw+back. <img width="1684" alt="Screenshot 2023-07-20 at 3 47 47 PM" src="https://github.com/pytorch/pytorch/assets/32754868/fdd26e07-0077-4878-a417-f3a418b6fb3b"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105602 Approved by: https://github.com/huydhn, https://github.com/cpuhrsch	2023-09-01 22:14:44 +00:00
ydwu4	c71828b097	Lift non-FakeTensor restriction for compile (#107042 ) Currently, we have the assertion that dynamo won't accept FakeTensor input unless we're exporting. This PR try to remove this restriction to finish https://github.com/pytorch/pytorch/pull/105679. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107042 Approved by: https://github.com/ezyang, https://github.com/zou3519	2023-08-15 20:58:56 +00:00
Elias Ellison	e61558b5fe	Test fixes (#106586 ) Fix for https://github.com/pytorch/pytorch/issues/106548 and https://github.com/pytorch/pytorch/issues/106299. The fallback was not actually testing fallback anymore now that we have a fake tensor rule for conv. Memory format fallback testing is also now exercised in test_ops.py `TestFakeTensor`. Gc collect fixes the list_clearing test. I suspect their was a refcycle introduced which is making it flakey. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106586 Approved by: https://github.com/wconstab	2023-08-04 23:23:17 +00:00
drisspg	f533791cd0	[SDPA] Mirror c++ implementation in FlashAttention meta func (#106477 ) # Summary Test edge case and update meta function to match the c++ implementation Pull Request resolved: https://github.com/pytorch/pytorch/pull/106477 Approved by: https://github.com/eellison	2023-08-03 00:28:27 +00:00
PyTorch MergeBot	fdd4b3aaa8	Revert "faketensor: prevent deepcopy from cloning FakeTensorMode (#104476 )" This reverts commit `c54afea6ee`. Reverted https://github.com/pytorch/pytorch/pull/104476 on behalf of https://github.com/jeanschmidt due to sadly it is breaking internal tests, and I can't coordinate a FF due to timezone differences ([comment](https://github.com/pytorch/pytorch/pull/104476#issuecomment-1661808343))	2023-08-02 08:56:33 +00:00
Brian Hirsh	c54afea6ee	faketensor: prevent deepcopy from cloning FakeTensorMode (#104476 ) fixes https://github.com/pytorch/pytorch/issues/104465 A more detailed repro is here, which uses `nn.TransformerLayer` (this breaks with AOTAutograd today, due to the presence of multiple FakeTensorMode objects lying around) https://github.com/pytorch/pytorch/issues/103505#issuecomment-1614817132 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104476 Approved by: https://github.com/ezyang	2023-07-31 15:49:08 +00:00
Jason Ansel	3ecd05d9f3	Fix FakeTensor issues with copy_ between devices (#106172 ) Used to fail with: ``` RuntimeError: Unhandled FakeTensor Device Propagation for aten.copy_.default, found two different devices cpu, cuda:0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106172 Approved by: https://github.com/eellison	2023-07-29 15:55:32 +00:00
Edward Z. Yang	4af9a914ab	Improve FakeTensor to work with mixed meta-cpu embedding bag arguments (#105924 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105924 Approved by: https://github.com/mikaylagawarecki, https://github.com/eellison	2023-07-26 01:19:08 +00:00
Justin Chu	73e1455327	[BE] Enable ruff's UP rules and autoformat test/ (#105434 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105434 Approved by: https://github.com/albanD	2023-07-19 20:36:06 +00:00
Yanbo Liang	77642da3b8	Fix broken meta registration for torch.full (#104451 ) Fixes #104117 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104451 Approved by: https://github.com/eellison	2023-06-30 05:14:52 +00:00
Richard Zou	08a054649c	[operator_compile_check] Add FakeTensor testing (#103595 ) This PR adds dedicated FakeTensor testing to operator_compile_check. We reuse CrossRefFakeMode to do this and improve the error messages on it. Note that this only really runs detailed tests for operators that do not have data-dependent output shape. In the future we should add something like a dynamic CrossRefFakeMode. Test Plan: - existing tests (these now have improved error messages). Pull Request resolved: https://github.com/pytorch/pytorch/pull/103595 Approved by: https://github.com/ezyang, https://github.com/soulitzer	2023-06-16 16:55:51 +00:00
Richard Zou	5b700fc914	Disable fallback for custom kernels (#101131 ) Previous failed attempt was here: https://github.com/pytorch/pytorch/pull/97715. Basically we tried to disable fallback for all ops (aten + custom) but hit many CI failures due to missing fake tensor coverage. Let's just disable it for custom kernels for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101131 Approved by: https://github.com/zou3519	2023-06-06 23:25:29 +00:00
Shunting Zhang	86c7652503	[inductor] layout optimization for conv (#99773 ) convolution kernel with channels last runs much faster then kernel with contiguous inputs. The PR leverage that to optimize tensor layouts so we provide 'channels last' inputs to convolution. Some care need to be taken to not convert tensor layout between contiguous and channels last back and forth. Those extra copies hurt performance quite much. Latest perf number [here](https://hud.pytorch.org/benchmark/compilers?startTime=Wed%2C%2024%20May%202023%2023%3A40%3A37%20GMT&stopTime=Wed%2C%2031%20May%202023%2023%3A40%3A37%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=shunting-layout-opt-19&lCommit=baa797fc100688dfb044fbcbdebcfd2591710f78&rBranch=main&rCommit=999bae0f54108ffc5b7cf2524a02a83901554b16) - TB: 1.64x -> 1.69x - HF: 1.79x -> 1.78x (random noise) - TIMM: 1.51x -> 1.65x Right now we disable layout optimization for dynamic shape since there is perf loss in that combination. Here is a GH issue to followup: https://github.com/pytorch/pytorch/issues/102670 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99773 Approved by: https://github.com/jansel	2023-06-02 21:08:18 +00:00
William Wen	da963d793b	Fix aten.copy device mismatch bug in FakeTensor (#102664 ) Fixes `pytest ./generated/test_yizhou_wang_RODNet.py -k test_000` failure in https://github.com/pytorch/pytorch/issues/92670. FakeTensor would raise an error upon trying to run `aten.copy` with inputs with different devices, although this is allowed behavior. Also fix `aten.slice_scatter`, since it also takes args with different devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102664 Approved by: https://github.com/yanboliang	2023-06-01 23:05:20 +00:00
Edward Z. Yang	f65732552e	Support FakeTensor with FlatParameter (#101987 ) In this PR we turn FlatParameter into a virtual tensor subclass which doesn't actually ever get instantiated: __new__ will create a Parameter instead (or a FakeTensor, if necessary). Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/101987 Approved by: https://github.com/awgu, https://github.com/eellison	2023-05-23 23:12:08 +00:00
Elias Ellison	e9246b290f	Initialize cuda tensor in fake tensor (#102027 ) Fix for https://github.com/pytorch/pytorch/issues/92627 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102027 Approved by: https://github.com/ngimel	2023-05-23 06:24:50 +00:00
Elias Ellison	f99eeb5bdf	Check devices on meta functions that return inputs (#101807 ) FakeTensor has a default device logic that wraps meta tensors to the right device after running meta kernels and throws on multiple devices. This logic was only running on the wrapping from meta kernels -> fake. For out variants, where the output of the meta kernel was already a fake tensor because it was an input, the device logic wasn't running. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101807 Approved by: https://github.com/ngimel	2023-05-19 16:13:39 +00:00
Richard Zou	e6f9bc500b	CustomOp simple abstract implementation registration (#99439 ) This PR: - adds an abstract registration API for CustomOp (CustomOp.impl_abstract) that is used for both FakeTensor and meta tensors - deletes CustomOp.impl_meta The user story behind this API is that it is the one-stop shop for registering implementations for data-less Tensors, i.e. FakeTensor and Meta tensor. The abstract implementation provided by the user: - gets registered as the FakeTensor implementation AND the meta formula - can be written like a regular meta formula. If the user decides that they need something more special (i.e. data-dependent output shape), then they are able to query a current context object (FakeTensorImplCtx) that has methods to construct new unbacked symints. Caveats: - we really need to make FakeTensor/FakeTensorMode public. Otherwise, there isn't a way for the user to interactively test that their abstract implementation is correct without running through large pieces of the PT2 stack (make_fx or torch.compile). - We do not memoize the symints produced by ctx.create_unbacked_symint(). It is possible to do this in the future, but it is difficult to do soundly and I am not convinced of the utility outside of the nonzero() usecase mentioned in #95399 Public API: - More docs will come when we actually expose this API to users by putting it in a public namespace, unless you folks want it now. - The APIs mentioned in `__all__` are the ones that are intended to be public. Test Plan: - Updated existing custom_op_db operators - Added new numpy_nonzero and numpy_nms operations that test operations that have data-dependendent output shape. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99439 Approved by: https://github.com/ezyang	2023-04-28 13:45:39 +00:00
Natalia Gimelshein	48d112c431	Fix fake tracing of cross entropy with label smoothing and weight (#99830 ) Fixes #99726 Adds a special path in cross entropy implementation for tensor subclasses, we don't always use it as it requires slightly more memory and is a bit slower. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99830 Approved by: https://github.com/ezyang	2023-04-24 04:07:23 +00:00
Michael Voznesensky	5e73569ab4	Add memoized_only mode to tensor conversion (#99741 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99741 Approved by: https://github.com/ezyang	2023-04-22 19:19:39 +00:00
PyTorch MergeBot	bce21ee06a	Revert "Fix bug in check required output size in _as_strided_scatter_meta (#98483 )" This reverts commit `5b692fd819`. Reverted https://github.com/pytorch/pytorch/pull/98483 on behalf of https://github.com/malfet due to Broke inductor, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=inductor%2C%201%2C%201	2023-04-18 18:59:47 +00:00
Richard Zou	57e1a50da3	Fix FakeTensor printing (#99205 ) I got too confused by the FakeTensor printing, so this PR fixes it to print normally. Before: ``` with FakeTensorMode(): x = torch.empty(2, 2, device="cpu") print(x) # FakeTensor(FakeTensor(..., device='meta', shape=(2, 2)), cpu) ``` After (Tensor printing doesn't print the default device): ``` FakeTensor(..., shape=(2, 2)) ``` Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/99205 Approved by: https://github.com/eellison	2023-04-18 13:26:27 +00:00
lantiankaikai	5b692fd819	Fix bug in check required output size in _as_strided_scatter_meta (#98483 ) Original Issue from #92670 pytest ./generated/test_XuyangBai_PointDSC.py -k test_004 ==> RuntimeError: as_strided_scatter: sizes [4], strides [85], storage offset 256 and itemsize 4 requiring a storage size of 2048 are out of bounds for storage of size 1024 Repro: ``` class Model(nn.Module): def __init__(self): super(Model, self).__init__() def forward(self, x): x[1].fill_diagonal_(0) # this check size failed device = torch.device("cpu") model = Model() model.to(device) torch._dynamo.reset() compiled_model = torch._dynamo.optimize("inductor")(model) arg = [torch.rand([4, 1, 1])] compiled_model(*arg) ``` The error was raised at the checking required size in as_strided_scatter. https://github.com/pytorch/pytorch/blob/master/torch/_prims/__init__.py#L1818 In the case of input is a tensor with storage offset(a view), when compute input's storage length, should also take input's base tensor's size/stride/offset into account instead of compare it with number of element of input. This diff fix the bug and add test. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/98483 Approved by: https://github.com/ngimel	2023-04-18 05:07:57 +00:00
Natalia Gimelshein	888c65b6a4	fix fake tensor propagation for cross_entropy with smoothing (#99255 ) Fixes #99250, unfortunately I haven't figured out how to handle cross-entropy with smooth loss and weights. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99255 Approved by: https://github.com/jansel, https://github.com/malfet	2023-04-17 00:31:26 +00:00
andrewor14	651c1be885	Recompute flat_arg_fake_tensors after fakification (#98769 ) Summary: This fixes the case when some of the input tensors were real tensors and fakified in `validate_and_convert_non_fake_tensors`, but `flat_arg_fake_tensors` would not contain all the inputs because it was computed before the fakification. We fix this by recomputing `flat_arg_fake_tensors` after fakification as well. Test Plan: python test/dynamo/test_export.py ExportTests.test_mixed_real_and_fake_inputs Reviewers: Chillee, voznesenskym Pull Request resolved: https://github.com/pytorch/pytorch/pull/98769 Approved by: https://github.com/voznesenskym	2023-04-14 19:14:29 +00:00
Elias Ellison	fc53472ce4	Move/Fix FakeTensor logic for detecting multiple fake modes (#97186 ) This was leftover for when we had more logic in the FakeTensor and not FakeTensorMode, and wasn't firing correctly. It also makes more sense for it to be in the other validation function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97186 Approved by: https://github.com/bdhirsh	2023-04-13 19:20:01 +00:00
PyTorch MergeBot	4828585019	Revert "Move/Fix FakeTensor logic for detecting multiple fake modes (#97186 )" This reverts commit `8a057c445d`. Reverted https://github.com/pytorch/pytorch/pull/97186 on behalf of https://github.com/huydhn due to This breaks ONNX test in trunk and it looks like a landrace as the CI signal is green	2023-04-12 19:24:54 +00:00
Elias Ellison	8a057c445d	Move/Fix FakeTensor logic for detecting multiple fake modes (#97186 ) This was leftover for when we had more logic in the FakeTensor and not FakeTensorMode, and wasn't firing correctly. It also makes more sense for it to be in the other validation function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97186 Approved by: https://github.com/bdhirsh	2023-04-12 17:40:41 +00:00
Elias Ellison	445863128b	Use .to instead of contiguous to generate channels last tensor (#96791 ) Fix for https://github.com/pytorch/pytorch/issues/95693. From https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html: > There are minor difference between the two APIs to and contiguous. We suggest to stick with to when explicitly converting memory format of tensor. For general cases the two APIs behave the same. However in special cases for a 4D tensor with size NCHW when either: C==1 or H==1 && W==1, only to would generate a proper stride to represent channels last memory format. We hit this case in convolution_backward in calling `contiguous()`. Even though we were determining that we should run the backward in channels_last forward, as FakeTensor had gathered from the output of [determine_backend_memory_format](https://github.com/pytorch/pytorch/blob/master/torch/_subclasses/fake_tensor.py#L559), we were still outputting a contiguous tensor. That led to the mismatch in strides in the issue. Should we be calling `to` instead of `contiguous` more liberally throughout the codebase, especially in convolution related code ? Not sure if there are reasons not to do this. Another fix would be to update `cudnn_conv_suggest_memory_format` so that it would output a contiguous_format in this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96791 Approved by: https://github.com/ngimel	2023-03-15 19:12:04 +00:00
Xuehai Pan	046e88a291	[BE] [3/3] Rewrite `super()` calls in test (#94592 ) Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied. - #94587 - #94588 - #94592 Also, methods with only a `super()` call are removed: ```diff class MyModule(nn.Module): - def __init__(self): - super().__init__() - def forward(self, ...): ... ``` Some cases that change the semantics should be kept unchanged. E.g.: `f152a79be9/caffe2/python/net_printer.py (L184-L190)` `f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94592 Approved by: https://github.com/ezyang, https://github.com/seemethere	2023-02-12 22:20:53 +00:00
Wei-Sheng Chin	c5c7687b74	Allow FakeTensorProp to run on graphs traced with some None inputs (#94569 ) Without this tiny change in `torch/_subclasses/fake_tensor.py`, the added test may fail with ``` TypeError: cannot create weak reference to 'NoneType' object ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94569 Approved by: https://github.com/ezyang	2023-02-10 20:38:22 +00:00

1 2 3

124 Commits