pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Tugsbayasgalan Manlaibaatar	d85314c95c	Support Predispatch functionalization (#113728 ) In this PR, we are implementing Functionalization on pre-dispatch graph. Today, every dispatch key except for Dispatchkey.Python has a dedicated mode stack in python. PreDispatch tracing relies on this behaviour by pushing ProxyTorchDispatchMode to Dispatchkey.PreDispatch mode stack and handle the dispatching logic in python. To make pre-dispatch functionalization work, we now need to push FunctionalTensorMode on DispatchKey.PreDispatch mode stack and make sure it runs before ProxyTorchDispatchMode. (this is very similar to how post-dispatch tracing work). Here are some design decisions we made for this flow to work: 1. FunctionalTensorMode internally calls C++ functionalize key. Since C++ functionalization goes after PreDispatch, if we are not careful, we will keep re-entering into PreDispatch key. We solve this by directly dispatching to C++ Functionalize key. 2. We delete mode_stack_per_key logic because the only realistic time it is exercised is for PreDispatch and it is in general not safe to have a plain list because FunctionalTensorMode and ProxyTorchDispatchMode ordering matter and it is hard to enforce it on plain list. Instead, now we have a private class that tracks PreDispatch mode stack. 3. We will still run CompositeImplicitAutograd decomps in this PR, and disable this logic later as a followup. Some missing bits after this PR: 1. Preserving autograd ops in a functional form. Right now they still show up in the graph but in a "non-functional" way. 2. Turn off CompositeImplicitAutograd decomps 3. Functionalizing HOO Pull Request resolved: https://github.com/pytorch/pytorch/pull/113728 Approved by: https://github.com/bdhirsh	2023-12-19 20:28:35 +00:00
cdzhan	99554112d3	[pytorch] add namespace for optTypeMetaToScalarType in codegen to avoid not declared when compile (#115623 ) Fixes compilation failure in some environment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115623 Approved by: https://github.com/albanD	2023-12-13 00:59:01 +00:00
Jesse Cai	4471fe6c39	[sparse][semi-structured] add alg_id to _cslt_sparse_mm and _cslt_sparse_mm_search (#115178 ) Summary: cuSPARSELt has support for different alg_id, which are set via `cusparseLTMatmulAlgSetAttribute`, in total there are 4 different alg_ids, 0 - 3. Previously we were just using the default alg_id, as from our initial experiments we found that for most shapes the default alg_id is the fastest and that they made no difference on numerical correctness, just performance. From our previous experiments the fastest alg_id seemed to differ only on small matmul shapes. danthe3rd found a performance regression when running with cuSPARSELt v0.4.0 vs v0.5.0, on LLM shapes, which match these characteristics (activations are small, weights are large). However it's likely that this is due to the alg_id ordering changing, as mentioned in the release notes for v0.5.0. ``` cusparseLtMatmulAlgSelectionInit() does not ensure the same ordering of algorithm id alg as in v0.4.0. ``` This PR adds in the following: - support for passing in alg_id to _cslt_sparse_mm - a new op, _cslt_sparse_mm_search, which returns the optimal alg_id for a given matmul _cslt_sparse_mm_search has the same function signature as _cslt_sparse_mm, minus the alg_id parameter. We are able to achieve v0.4.0 performance with alg_id=1 on the shapes that daniel provided. We will address autoselecting the best alg_id in a future PR, possibly with torch.compile. Test Plan: ``` python test/test_sparse_semi_structured -k cslt ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/115178 Approved by: https://github.com/cpuhrsch	2023-12-11 23:08:51 +00:00
PyTorch MergeBot	40a14e07ef	Revert "[sparse][semi-structured] add alg_id to _cslt_sparse_mm and _cslt_spasre_mm_search (#115178 )" This reverts commit `1e5636f791`. Reverted https://github.com/pytorch/pytorch/pull/115178 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the Window build failure looks legit `1e5636f791` ([comment](https://github.com/pytorch/pytorch/pull/115178#issuecomment-1850605711))	2023-12-11 18:07:17 +00:00
Jesse Cai	1e5636f791	[sparse][semi-structured] add alg_id to _cslt_sparse_mm and _cslt_spasre_mm_search (#115178 ) Summary: cuSPARSELt has support for different alg_id, which are set via `cusparseLTMatmulAlgSetAttribute`, in total there are 4 different alg_ids, 0 - 3. Previously we were just using the default alg_id, as from our initial experiments we found that for most shapes the default alg_id is the fastest and that they made no difference on numerical correctness, just performance. From our previous experiments the fastest alg_id seemed to differ only on small matmul shapes. danthe3rd found a performance regression when running with cuSPARSELt v0.4.0 vs v0.5.0, on LLM shapes, which match these characteristics (activations are small, weights are large). However it's likely that this is due to the alg_id ordering changing, as mentioned in the release notes for v0.5.0. ``` cusparseLtMatmulAlgSelectionInit() does not ensure the same ordering of algorithm id alg as in v0.4.0. ``` This PR adds in the following: - support for passing in alg_id to _cslt_sparse_mm - a new op, _cslt_sparse_mm_search, which returns the optimal alg_id for a given matmul _cslt_sparse_mm_search has the same function signature as _cslt_sparse_mm, minus the alg_id parameter. We are able to achieve v0.4.0 performance with alg_id=1 on the shapes that daniel provided. We will address autoselecting the best alg_id in a future PR, possibly with torch.compile. Test Plan: ``` python test/test_sparse_semi_structured -k cslt ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/115178 Approved by: https://github.com/cpuhrsch	2023-12-11 15:47:28 +00:00
Mengwei Liu	898554a3a3	[torchgen] Add logic in custom ops to return empty tensor (#114143 ) Summary: Add two logic: 1. If the custom op is returning a `Tensor` but also doesn't have an out tensor as input, return an empty tensor. 2. If the custom op is returning more than one Tensor and the number of out tensors is not the same as return Tensor, return a tuple of empty tensors. Test Plan: Rely on new unit tests Differential Revision: D51471651 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114143 Approved by: https://github.com/cccclai	2023-12-08 17:03:44 +00:00
voznesenskym	ddf1cb7870	AOTAutograd: handle set_(), detect metadata mutations that cancel out (#111554 ) This should be enough to get @voznesenskym 's FSDP branch to plumb `set_()` through AOTAutograd properly and have everything properly no-op out. Main changes are: (1) graph break on `aten::set_.source_Tensor_storage_offset` (we could support it but it isn't needed, seems safer to graph break) (2) Functionalization: add a "proper" functionalization kernel for `aten::set_.source_Tensor`. The previous one we had was codegen'd and it was wrong (it would just clone() and call set_(), which does not do the right thing). I also manually mark on the `FunctionalTensorWrapper` when a given tensor has been mutated by a `set_()` call. (3) AOTAutograd: I added a new field, `InputAliasInfo.mutates_storage_metadata`, so we can distinguish between "regular" metadata mutations, and metadata mutations due to `set_()` calls. This is mainly because at runtime, one requires calling `as_strided_()` to fix up metadata, while the other requires calling `set_()`. (4) Made AOTAutograd's detection for metadata mutations / set_() mutations smarter and detect no-ops (if the storage and metadata are all the same). I also killed `was_updated()` and `was_metadata_updated()`, and replaced them with (existing) `has_data_mutation() ` and (new) `has_data_mutation()`, which can more accurately distinguish between data-mutation vs. `set_()` calls vs. metadata-mutation This PR is still silently correct in one case though, which I'd like to discuss more. In particular, this example: ``` def f(x): x_view = x.view(-1) x.set_(torch.ones(2)) x_view.mul_(2) return ``` If you have an input that experiences both a data-mutation and a `x_old.set_(x_new)` call, there are two cases: (a) the data mutation happened on the storage of `x_new`. This case should be handled automatically: if x_new is a graph intermediate then we will functionalize the mutation. If x_new is a different graph input, then we will perform the usual `copy_()` on that other graph input (b) the data mutation happened on the storage of `x_old`. This is more of a pain to handle, and doesn't currently work. At runtime, the right thing to do is probably something like: ``` def functionalized_f(x): x_view = x.view(-1) # set_() desugars into a no-op; later usages of x will use x_output x_output = torch.ones(2) # functionalize the mutation on x_view x_view_updated = x.mul(2) x_updated = x_view_updated.view(x.shape) # x experienced TWO TYPES of mutations; a data mutation and a metatadata mutation # We need to return both updated tensors in our graph return x_updated, x_output def runtime_wrapper(x): x_data_mutation_result, x_set_mutation_result = compiled_graph(x) # First, perform the data mutation on x's old storage x.copy_(x_data_mutation_result) # Then, swap out the storage of x with the new storage x.set_(x_set_mutation_result) ``` There are two things that make this difficult to do though: (1) Functionalization: the functionalization rule for `set_()` will fully throw away the old `FunctionalStorageImpl` on the graph input. So if there are any mutations to that `FunctionalStorageImpl` later on in the graph, the current graph input won't know about it. Maybe we can have a given `FunctionalTensorWrapper` remember all previous storages that it had, and track mutations on all of them - although this feels pretty complicated. (2) AOTAutograd now needs to know that we might have two graph outputs that correspond to a single "mutated input", which is annoying. It's worth pointing out that this issue is probably extremely unlikely for anyone to run into - can we just detect it and error? This feels slightly easier than solving it, although not significantly easier. We would still need `FunctionalTensorWrapper` to keep track of mutations on any of its "previous" storages, so it can report this info back to AOTAutograd so we can raise an error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111554 Approved by: https://github.com/ezyang ghstack dependencies: #113926	2023-11-28 19:33:35 +00:00
Tarun Karuturi	39f16c221e	Adding event_tracer evalue logging calls in codegen (#114584 ) Summary: This diff adds support in the ExecuTorch codegen layer to log the outputs of kernels to event_tracer. It does this by calling the `event_tracer_log_evalue` API. When the `ET_EVENT_TRACER_ENABLED` flag is disabled this is essentially a no-op and will add no overhead. Test Plan: CI Reviewed By: larryliu0820 Differential Revision: D51534590 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114584 Approved by: https://github.com/larryliu0820	2023-11-28 18:32:05 +00:00
Nikita Shulga	7c98bac4a0	[BE] Speedup register schema compilation (#114438 ) For some reason, inlining initializer list into a std::vector takes a lot of time using clang-15. But considering that there are only dozen or so distrinct tags, creating them once and pass as def argument should not affect runtime speed at all, but this significantly improves compilation time. On Mac M1 it reduces time needed to compiler RegisterSchema.cpp from 50 to 3 seconds. Special case empty tags, to keep torch_gen tests happy Before ``` % /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ -ftime-report -DAT_PER_OPERATOR_HEADERS -DCAFFE2_BUILD_MAIN_LIB -DCPUINFO_SUPPORTED_PLATFORM=1 -DFMT_HEADER_ONLY=1 -DFXDIV_USE_INLINE_ASSEMBLY=0 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DNNP_CONVOLUTION_ONLY=0 -DNNP_INFERENCE_ONLY=0 -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DUSE_EXTERNAL_MZCRC -D_FILE_OFFSET_BITS=64 -Dtorch_cpu_EXPORTS -I/Users/nshulga/git/pytorch/pytorch/build/aten/src -I/Users/nshulga/git/pytorch/pytorch/aten/src -I/Users/nshulga/git/pytorch/pytorch/build -I/Users/nshulga/git/pytorch/pytorch -I/Users/nshulga/git/pytorch/pytorch/cmake/../third_party/benchmark/include -I/Users/nshulga/git/pytorch/pytorch/third_party/onnx -I/Users/nshulga/git/pytorch/pytorch/build/third_party/onnx -I/Users/nshulga/git/pytorch/pytorch/third_party/foxi -I/Users/nshulga/git/pytorch/pytorch/build/third_party/foxi -I/Users/nshulga/git/pytorch/pytorch/torch/csrc/api -I/Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include -I/Users/nshulga/git/pytorch/pytorch/caffe2/aten/src/TH -I/Users/nshulga/git/pytorch/pytorch/build/caffe2/aten/src/TH -I/Users/nshulga/git/pytorch/pytorch/build/caffe2/aten/src -I/Users/nshulga/git/pytorch/pytorch/build/caffe2/../aten/src -I/Users/nshulga/git/pytorch/pytorch/torch/csrc -I/Users/nshulga/git/pytorch/pytorch/third_party/miniz-2.1.0 -I/Users/nshulga/git/pytorch/pytorch/third_party/kineto/libkineto/include -I/Users/nshulga/git/pytorch/pytorch/third_party/kineto/libkineto/src -I/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/.. -I/Users/nshulga/git/pytorch/pytorch/third_party/FXdiv/include -I/Users/nshulga/git/pytorch/pytorch/c10/.. -I/Users/nshulga/git/pytorch/pytorch/third_party/pthreadpool/include -I/Users/nshulga/git/pytorch/pytorch/third_party/cpuinfo/include -I/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/quantized/cpu/qnnpack/include -I/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/quantized/cpu/qnnpack/src -I/Users/nshulga/git/pytorch/pytorch/third_party/cpuinfo/deps/clog/include -I/Users/nshulga/git/pytorch/pytorch/third_party/NNPACK/include -I/Users/nshulga/git/pytorch/pytorch/third_party/FP16/include -I/Users/nshulga/git/pytorch/pytorch/third_party/fmt/include -I/Users/nshulga/git/pytorch/pytorch/third_party/flatbuffers/include -isystem /Users/nshulga/git/pytorch/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /Users/nshulga/git/pytorch/pytorch/cmake/../third_party/googletest/googletest/include -isystem /Users/nshulga/git/pytorch/pytorch/third_party/protobuf/src -isystem /Users/nshulga/git/pytorch/pytorch/third_party/XNNPACK/include -isystem /Users/nshulga/git/pytorch/pytorch/cmake/../third_party/eigen -isystem /Users/nshulga/git/pytorch/pytorch/build/include -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=braced-scalar-init -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wvla-extension -Wsuggest-override -Wnewline-eof -Winconsistent-missing-override -Winconsistent-missing-destructor-override -Wno-pass-failed -Wno-error=pedantic -Wno-error=old-style-cast -Wno-error=inconsistent-missing-override -Wno-error=inconsistent-missing-destructor-override -Wconstant-conversion -Wno-invalid-partial-specialization -Wno-missing-braces -Qunused-arguments -fcolor-diagnostics -faligned-new -Werror -Wno-unused-but-set-variable -fno-math-errno -fno-trapping-math -Werror=format -DUSE_MPS -Wno-unused-private-field -Wno-missing-braces -O3 -DNDEBUG -DNDEBUG -arch arm64 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.0.sdk -fPIC -D__NEON__ -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-unused-function -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-type-limits -Wno-array-bounds -Wno-strict-overflow -Wno-strict-aliasing -fvisibility=hidden -O2 -Wmissing-prototypes -Werror=missing-prototypes -Xpreprocessor -fopenmp -I/Users/nshulga/miniforge3/include -std=gnu++17 -Wno-missing-prototypes -Wno-error=missing-prototypes -o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/RegisterSchema.cpp.o -c /Users/nshulga/git/pytorch/pytorch/build/aten/src/ATen/RegisterSchema.cpp ===-------------------------------------------------------------------------=== ... Pass execution timing report ... ===-------------------------------------------------------------------------=== Total Execution Time: 131.8054 seconds (132.5540 wall clock) ---User Time--- --System Time-- --User+System-- ---Wall Time--- ---Instr--- --- Name --- 43.6364 ( 33.2%) 0.0919 ( 30.1%) 43.7282 ( 33.2%) 43.9658 ( 33.2%) 536345245380 ModuleInlinerWrapperPass 43.6291 ( 33.2%) 0.0891 ( 29.2%) 43.7182 ( 33.2%) 43.9549 ( 33.2%) 536264096394 DevirtSCCRepeatedPass 42.3766 ( 32.2%) 0.0185 ( 6.1%) 42.3951 ( 32.2%) 42.6198 ( 32.2%) 523040901767 GVNPass 0.4085 ( 0.3%) 0.0040 ( 1.3%) 0.4125 ( 0.3%) 0.4195 ( 0.3%) 4106085945 SimplifyCFGPass 0.3611 ( 0.3%) 0.0115 ( 3.8%) 0.3726 ( 0.3%) 0.3779 ( 0.3%) 4864696407 InstCombinePass 0.1607 ( 0.1%) 0.0088 ( 2.9%) 0.1695 ( 0.1%) 0.1720 ( 0.1%) 1780986175 InlinerPass 0.0865 ( 0.1%) 0.0024 ( 0.8%) 0.0889 ( 0.1%) 0.0914 ( 0.1%) 1489982961 SROAPass 0.0750 ( 0.1%) 0.0013 ( 0.4%) 0.0763 ( 0.1%) 0.0764 ( 0.1%) 620016338 SCCPPass 0.0661 ( 0.1%) 0.0040 ( 1.3%) 0.0701 ( 0.1%) 0.0735 ( 0.1%) 592027163 EarlyCSEPass ... ===-------------------------------------------------------------------------=== Clang front-end time report ===-------------------------------------------------------------------------=== Total Execution Time: 48.2802 seconds (48.8638 wall clock) ... ``` After ``` % /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ -ftime-report -DAT_PER_OPERATOR_HEADERS -DCAFFE2_BUILD_MAIN_LIB -DCPUINFO_SUPPORTED_PLATFORM=1 -DFMT_HEADER_ONLY=1 -DFXDIV_USE_INLINE_ASSEMBLY=0 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DNNP_CONVOLUTION_ONLY=0 -DNNP_INFERENCE_ONLY=0 -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DUSE_EXTERNAL_MZCRC -D_FILE_OFFSET_BITS=64 -Dtorch_cpu_EXPORTS -I/Users/nshulga/git/pytorch/pytorch/build/aten/src -I/Users/nshulga/git/pytorch/pytorch/aten/src -I/Users/nshulga/git/pytorch/pytorch/build -I/Users/nshulga/git/pytorch/pytorch -I/Users/nshulga/git/pytorch/pytorch/cmake/../third_party/benchmark/include -I/Users/nshulga/git/pytorch/pytorch/third_party/onnx -I/Users/nshulga/git/pytorch/pytorch/build/third_party/onnx -I/Users/nshulga/git/pytorch/pytorch/third_party/foxi -I/Users/nshulga/git/pytorch/pytorch/build/third_party/foxi -I/Users/nshulga/git/pytorch/pytorch/torch/csrc/api -I/Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include -I/Users/nshulga/git/pytorch/pytorch/caffe2/aten/src/TH -I/Users/nshulga/git/pytorch/pytorch/build/caffe2/aten/src/TH -I/Users/nshulga/git/pytorch/pytorch/build/caffe2/aten/src -I/Users/nshulga/git/pytorch/pytorch/build/caffe2/../aten/src -I/Users/nshulga/git/pytorch/pytorch/torch/csrc -I/Users/nshulga/git/pytorch/pytorch/third_party/miniz-2.1.0 -I/Users/nshulga/git/pytorch/pytorch/third_party/kineto/libkineto/include -I/Users/nshulga/git/pytorch/pytorch/third_party/kineto/libkineto/src -I/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/.. -I/Users/nshulga/git/pytorch/pytorch/third_party/FXdiv/include -I/Users/nshulga/git/pytorch/pytorch/c10/.. -I/Users/nshulga/git/pytorch/pytorch/third_party/pthreadpool/include -I/Users/nshulga/git/pytorch/pytorch/third_party/cpuinfo/include -I/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/quantized/cpu/qnnpack/include -I/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/quantized/cpu/qnnpack/src -I/Users/nshulga/git/pytorch/pytorch/third_party/cpuinfo/deps/clog/include -I/Users/nshulga/git/pytorch/pytorch/third_party/NNPACK/include -I/Users/nshulga/git/pytorch/pytorch/third_party/FP16/include -I/Users/nshulga/git/pytorch/pytorch/third_party/fmt/include -I/Users/nshulga/git/pytorch/pytorch/third_party/flatbuffers/include -isystem /Users/nshulga/git/pytorch/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /Users/nshulga/git/pytorch/pytorch/cmake/../third_party/googletest/googletest/include -isystem /Users/nshulga/git/pytorch/pytorch/third_party/protobuf/src -isystem /Users/nshulga/git/pytorch/pytorch/third_party/XNNPACK/include -isystem /Users/nshulga/git/pytorch/pytorch/cmake/../third_party/eigen -isystem /Users/nshulga/git/pytorch/pytorch/build/include -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=braced-scalar-init -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wvla-extension -Wsuggest-override -Wnewline-eof -Winconsistent-missing-override -Winconsistent-missing-destructor-override -Wno-pass-failed -Wno-error=pedantic -Wno-error=old-style-cast -Wno-error=inconsistent-missing-override -Wno-error=inconsistent-missing-destructor-override -Wconstant-conversion -Wno-invalid-partial-specialization -Wno-missing-braces -Qunused-arguments -fcolor-diagnostics -faligned-new -Werror -Wno-unused-but-set-variable -fno-math-errno -fno-trapping-math -Werror=format -DUSE_MPS -Wno-unused-private-field -Wno-missing-braces -O3 -DNDEBUG -DNDEBUG -arch arm64 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.0.sdk -fPIC -D__NEON__ -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-unused-function -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-type-limits -Wno-array-bounds -Wno-strict-overflow -Wno-strict-aliasing -fvisibility=hidden -O2 -Wmissing-prototypes -Werror=missing-prototypes -Xpreprocessor -fopenmp -I/Users/nshulga/miniforge3/include -std=gnu++17 -Wno-missing-prototypes -Wno-error=missing-prototypes -o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/RegisterSchema.cpp.o -c /Users/nshulga/git/pytorch/pytorch/build/aten/src/ATen/RegisterSchema.cpp ===-------------------------------------------------------------------------=== ... Pass execution timing report ... ===-------------------------------------------------------------------------=== Total Execution Time: 1.2920 seconds (1.3187 wall clock) ---User Time--- --System Time-- --User+System-- ---Wall Time--- ---Instr--- --- Name --- 0.3070 ( 27.6%) 0.0547 ( 30.2%) 0.3617 ( 28.0%) 0.3654 ( 27.7%) 3719690895 ModuleInlinerWrapperPass 0.3024 ( 27.2%) 0.0525 ( 29.0%) 0.3549 ( 27.5%) 0.3585 ( 27.2%) 3653363330 DevirtSCCRepeatedPass 0.0619 ( 5.6%) 0.0073 ( 4.0%) 0.0692 ( 5.4%) 0.0711 ( 5.4%) 868136227 InstCombinePass 0.0601 ( 5.4%) 0.0065 ( 3.6%) 0.0666 ( 5.2%) 0.0679 ( 5.1%) 696430647 InlinerPass 0.0363 ( 3.3%) 0.0033 ( 1.8%) 0.0396 ( 3.1%) 0.0425 ( 3.2%) 535426974 SimplifyCFGPass 0.0280 ( 2.5%) 0.0069 ( 3.8%) 0.0348 ( 2.7%) 0.0358 ( 2.7%) 378716394 BlockFrequencyAnalysis 0.0208 ( 1.9%) 0.0049 ( 2.7%) 0.0257 ( 2.0%) 0.0262 ( 2.0%) 283689627 BranchProbabilityAnalysis 0.0239 ( 2.1%) 0.0002 ( 0.1%) 0.0241 ( 1.9%) 0.0241 ( 1.8%) 219122704 OpenMPOptCGSCCPass 0.0174 ( 1.6%) 0.0015 ( 0.8%) 0.0189 ( 1.5%) 0.0192 ( 1.5%) 215583965 GVNPass 0.0153 ( 1.4%) 0.0025 ( 1.4%) 0.0178 ( 1.4%) 0.0187 ( 1.4%) 184232295 EarlyCSEPass ... ===-------------------------------------------------------------------------=== Clang front-end time report ===-------------------------------------------------------------------------=== Total Execution Time: 2.9128 seconds (3.1027 wall clock) ... ``` And the generated schema file looks as follows: ```cpp TORCH_LIBRARY(aten, m) { const std::vector<at::Tag> tags_0 = {at::Tag::pt2_compliant_tag}; m.def("_cast_Byte(Tensor self, bool non_blocking=False) -> Tensor", tags_0); m.def("_cast_Char(Tensor self, bool non_blocking=False) -> Tensor", tags_0); m.def("_cast_Double(Tensor self, bool non_blocking=False) -> Tensor", tags_0); m.def("_cast_Float(Tensor self, bool non_blocking=False) -> Tensor", tags_0); m.def("_cast_Int(Tensor self, bool non_blocking=False) -> Tensor", tags_0); m.def("_cast_Long(Tensor self, bool non_blocking=False) -> Tensor", tags_0); m.def("_cast_Short(Tensor self, bool non_blocking=False) -> Tensor", tags_0); m.def("_cast_Half(Tensor self, bool non_blocking=False) -> Tensor", tags_0); m.def("_backward(Tensor self, Tensor[] inputs, Tensor? gradient=None, bool? retain_graph=None, bool create_graph=False) -> ()", tags_0); m.def("set_data(Tensor(a!) self, Tensor new_data) -> ()", tags_0); m.def("data(Tensor self) -> Tensor", tags_0); m.def("is_leaf(Tensor self) -> bool", tags_0); m.def("output_nr(Tensor self) -> int", tags_0); m.def("_version(Tensor self) -> int", tags_0); m.def("requires_grad_(Tensor(a!) self, bool requires_grad=True) -> Tensor(a!)", tags_0); m.def("retain_grad(Tensor(a!) self) -> ()", tags_0); m.def("retains_grad(Tensor self) -> bool", tags_0); m.def("_fw_primal(Tensor(a) self, int level) -> Tensor(a)", tags_0); m.def("_make_dual(Tensor(a) primal, Tensor tangent, int level) -> Tensor(a)", tags_0); m.def("_unpack_dual(Tensor(a) dual, int level) -> (Tensor(a) primal, Tensor tangent)", tags_0); m.def("_new_zeros_with_same_feature_meta(Tensor self, Tensor other, *, int self_num_batch_dims=0) -> Tensor", tags_0); m.def("_has_same_storage_numel(Tensor self, Tensor other) -> bool", tags_0); const std::vector<at::Tag> tags_1 = {at::Tag::inplace_view, at::Tag::pt2_compliant_tag}; m.def("rename_(Tensor(a!) self, Dimname[]? names) -> Tensor(a!)", tags_1); m.def("rename(Tensor(a) self, Dimname[]? names) -> Tensor(a)", tags_0); m.def("align_to(Tensor(a) self, Dimname[] names) -> Tensor(a)", tags_0); m.def("align_to.ellipsis_idx(Tensor(a) self, Dimname[] order, int ellipsis_idx) -> Tensor(a)", tags_0); m.def("align_as(Tensor self, Tensor other) -> Tensor", tags_0); ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/114438 Approved by: https://github.com/zou3519	2023-11-27 23:33:04 +00:00
PyTorch MergeBot	3e1abde46d	Revert "AOTAutograd: handle set_(), detect metadata mutations that cancel out (#111554 )" This reverts commit `a911b4db9d`. Reverted https://github.com/pytorch/pytorch/pull/111554 on behalf of https://github.com/DanilBaibak due to The lower PR in the stack #113926 breaks the internal build ([comment](https://github.com/pytorch/pytorch/pull/111554#issuecomment-1822472206))	2023-11-22 10:13:48 +00:00
Antonio Kim	7fc292930c	Add support for `torch.Generator` type in TorchScript (#110413 ) - Add support for `torch.Generator` type in TorchScript - Add `generator` args to all `torch.nn.init` functions that call `uniform_` or `normal_` - Add support for `torch.Generator` in LTC's TorchScript backend (CC: @wconstab) CC: @eellison @davidberard98 @GlebKazantaev @behzad-a Pull Request resolved: https://github.com/pytorch/pytorch/pull/110413 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/glebk-cerebras, https://github.com/davidberard98	2023-11-21 23:07:21 +00:00
voznesenskym	a911b4db9d	AOTAutograd: handle set_(), detect metadata mutations that cancel out (#111554 ) This should be enough to get @voznesenskym 's FSDP branch to plumb `set_()` through AOTAutograd properly and have everything properly no-op out. Main changes are: (1) graph break on `aten::set_.source_Tensor_storage_offset` (we could support it but it isn't needed, seems safer to graph break) (2) Functionalization: add a "proper" functionalization kernel for `aten::set_.source_Tensor`. The previous one we had was codegen'd and it was wrong (it would just clone() and call set_(), which does not do the right thing). I also manually mark on the `FunctionalTensorWrapper` when a given tensor has been mutated by a `set_()` call. (3) AOTAutograd: I added a new field, `InputAliasInfo.mutates_storage_metadata`, so we can distinguish between "regular" metadata mutations, and metadata mutations due to `set_()` calls. This is mainly because at runtime, one requires calling `as_strided_()` to fix up metadata, while the other requires calling `set_()`. (4) Made AOTAutograd's detection for metadata mutations / set_() mutations smarter and detect no-ops (if the storage and metadata are all the same). I also killed `was_updated()` and `was_metadata_updated()`, and replaced them with (existing) `has_data_mutation() ` and (new) `has_data_mutation()`, which can more accurately distinguish between data-mutation vs. `set_()` calls vs. metadata-mutation This PR is still silently correct in one case though, which I'd like to discuss more. In particular, this example: ``` def f(x): x_view = x.view(-1) x.set_(torch.ones(2)) x_view.mul_(2) return ``` If you have an input that experiences both a data-mutation and a `x_old.set_(x_new)` call, there are two cases: (a) the data mutation happened on the storage of `x_new`. This case should be handled automatically: if x_new is a graph intermediate then we will functionalize the mutation. If x_new is a different graph input, then we will perform the usual `copy_()` on that other graph input (b) the data mutation happened on the storage of `x_old`. This is more of a pain to handle, and doesn't currently work. At runtime, the right thing to do is probably something like: ``` def functionalized_f(x): x_view = x.view(-1) # set_() desugars into a no-op; later usages of x will use x_output x_output = torch.ones(2) # functionalize the mutation on x_view x_view_updated = x.mul(2) x_updated = x_view_updated.view(x.shape) # x experienced TWO TYPES of mutations; a data mutation and a metatadata mutation # We need to return both updated tensors in our graph return x_updated, x_output def runtime_wrapper(x): x_data_mutation_result, x_set_mutation_result = compiled_graph(x) # First, perform the data mutation on x's old storage x.copy_(x_data_mutation_result) # Then, swap out the storage of x with the new storage x.set_(x_set_mutation_result) ``` There are two things that make this difficult to do though: (1) Functionalization: the functionalization rule for `set_()` will fully throw away the old `FunctionalStorageImpl` on the graph input. So if there are any mutations to that `FunctionalStorageImpl` later on in the graph, the current graph input won't know about it. Maybe we can have a given `FunctionalTensorWrapper` remember all previous storages that it had, and track mutations on all of them - although this feels pretty complicated. (2) AOTAutograd now needs to know that we might have two graph outputs that correspond to a single "mutated input", which is annoying. It's worth pointing out that this issue is probably extremely unlikely for anyone to run into - can we just detect it and error? This feels slightly easier than solving it, although not significantly easier. We would still need `FunctionalTensorWrapper` to keep track of mutations on any of its "previous" storages, so it can report this info back to AOTAutograd so we can raise an error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111554 Approved by: https://github.com/ezyang ghstack dependencies: #113926	2023-11-21 01:52:46 +00:00
Edward Z. Yang	8c4812be80	Replace expect_int with guard_int (#113921 ) The idea is that instead of erroring, we will just specialize at these sites. Fixes https://github.com/pytorch/pytorch/issues/113142 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/113921 Approved by: https://github.com/zou3519	2023-11-20 21:27:48 +00:00
Brian Vaughan	dbb96ef30d	improve annotation device parameters where a device ordinal is allowed (#113647 ) Using mypy in code that depends on pytorch, I noticed that the type annotation doesn't allow a device ordinal. `error: Argument "device" to "to_empty" of "Module" has incompatible type "int"; expected "str \| device" [arg-type]` Pull Request resolved: https://github.com/pytorch/pytorch/pull/113647 Approved by: https://github.com/albanD	2023-11-17 14:41:22 +00:00
Jane Xu	deec2380c7	Add 0dim Tensor overload for _foreach_div (#113688 ) This PR is ALMOST basically just following the steps from #106677 EXCEPT! We do add one feature. Similar to fused_adam(w), for the CUDA dispatches: when the scalar tensor is on CPU, we .item and redispatch to the normal scalar overload. Otherwise, the cuda kernel will complain about mismatch in devices between the scalar and the tensors. Why do we add this feature? Our optimizers want to allow lr as a tensor, and lr could be a CPU tensor. lr is used with foreach_div_ in Adam, so our CI will break otherwise. After this PR, `_foreach_mul` and `_foreach_div` will accept either a CPU or a GPU tensor for the scalar tensor (vs only a GPU tensor). They join the ranks of `fused_adam(w)` in this characteristic. I did not yet do the same thing for foreach_add (the only other foreach op with a .Tensor overload) because there is no use case and will be more involved. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113688 Approved by: https://github.com/mlazos, https://github.com/albanD	2023-11-15 20:59:32 +00:00
George White	6c187246d6	Add support for float8_e4m3fnuz and _e5m2fnuz (#107586 ) This PR relates to the feature in [this feature submission](https://docs.google.com/document/d/1pF2T1xz54IPg1jG7FhykbrpbcJZVelQw0v8vBaoLkfs/edit). It has been based on #104242 which adds similar float8 types. These new types added in this PR are described in the paper at https://arxiv.org/abs/2206.02915. A brief description and comparison of the types with other float8 types can be also found in the [OpenXLA RFC](https://github.com/openxla/stablehlo/blob/main/rfcs/20230321-fp8_fnuz.md). Pull Request resolved: https://github.com/pytorch/pytorch/pull/107586 Approved by: https://github.com/seemethere, https://github.com/malfet	2023-11-15 15:01:11 +00:00
PyTorch MergeBot	252e68a83b	Revert "Add support for `torch.Generator` type in TorchScript (#110413 )" This reverts commit `54493fe8c4`. Reverted https://github.com/pytorch/pytorch/pull/110413 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is, unfortunately, still breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/110413#issuecomment-1811625557))	2023-11-15 00:51:23 +00:00
Aaron Gokaslan	18d7b8e4f7	[BE]: ruff apply rule PLW1510 to find silent subprocess errors (#113644 ) Reopens #111682 that I messed up due to a bad rebase and triggered some issues with CLA. This explicitly adds check=True or False to any subprocess calls where appropriate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113644 Approved by: https://github.com/ezyang, https://github.com/kit1980	2023-11-14 20:59:40 +00:00
Aaron Gokaslan	b7b2178204	[BE]: Remove useless lambdas (#113602 ) Applies PLW0108 which removes useless lambda calls in Python, the rule is in preview so it is not ready to be enabled by default just yet. These are the autofixes from the rule. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113602 Approved by: https://github.com/albanD	2023-11-14 20:06:48 +00:00
Antonio Kim	54493fe8c4	Add support for `torch.Generator` type in TorchScript (#110413 ) - Add support for `torch.Generator` type in TorchScript - Add `generator` args to all `torch.nn.init` functions that call `uniform_` or `normal_` - Add support for `torch.Generator` in LTC's TorchScript backend (CC: @wconstab) CC: @eellison @davidberard98 @GlebKazantaev @behzad-a Pull Request resolved: https://github.com/pytorch/pytorch/pull/110413 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/glebk-cerebras, https://github.com/davidberard98	2023-11-13 23:18:14 +00:00
PyTorch MergeBot	0e6b6a2483	Revert "AOTAutograd: handle set_(), detect metadata mutations that cancel out (#111554 )" This reverts commit `3afb4e5cf7`. Reverted https://github.com/pytorch/pytorch/pull/111554 on behalf of https://github.com/clee2000 due to the xla failure is real sorry, log classifier is showing the wrong line ([comment](https://github.com/pytorch/pytorch/pull/111554#issuecomment-1809177978))	2023-11-13 21:46:57 +00:00
Brian Hirsh	3afb4e5cf7	AOTAutograd: handle set_(), detect metadata mutations that cancel out (#111554 ) This should be enough to get @voznesenskym 's FSDP branch to plumb `set_()` through AOTAutograd properly and have everything properly no-op out. Main changes are: (1) graph break on `aten::set_.source_Tensor_storage_offset` (we could support it but it isn't needed, seems safer to graph break) (2) Functionalization: add a "proper" functionalization kernel for `aten::set_.source_Tensor`. The previous one we had was codegen'd and it was wrong (it would just clone() and call set_(), which does not do the right thing). I also manually mark on the `FunctionalTensorWrapper` when a given tensor has been mutated by a `set_()` call. (3) AOTAutograd: I added a new field, `InputAliasInfo.mutates_storage_metadata`, so we can distinguish between "regular" metadata mutations, and metadata mutations due to `set_()` calls. This is mainly because at runtime, one requires calling `as_strided_()` to fix up metadata, while the other requires calling `set_()`. (4) Made AOTAutograd's detection for metadata mutations / set_() mutations smarter and detect no-ops (if the storage and metadata are all the same). I also killed `was_updated()` and `was_metadata_updated()`, and replaced them with (existing) `has_data_mutation() ` and (new) `has_data_mutation()`, which can more accurately distinguish between data-mutation vs. `set_()` calls vs. metadata-mutation This PR is still silently correct in one case though, which I'd like to discuss more. In particular, this example: ``` def f(x): x_view = x.view(-1) x.set_(torch.ones(2)) x_view.mul_(2) return ``` If you have an input that experiences both a data-mutation and a `x_old.set_(x_new)` call, there are two cases: (a) the data mutation happened on the storage of `x_new`. This case should be handled automatically: if x_new is a graph intermediate then we will functionalize the mutation. If x_new is a different graph input, then we will perform the usual `copy_()` on that other graph input (b) the data mutation happened on the storage of `x_old`. This is more of a pain to handle, and doesn't currently work. At runtime, the right thing to do is probably something like: ``` def functionalized_f(x): x_view = x.view(-1) # set_() desugars into a no-op; later usages of x will use x_output x_output = torch.ones(2) # functionalize the mutation on x_view x_view_updated = x.mul(2) x_updated = x_view_updated.view(x.shape) # x experienced TWO TYPES of mutations; a data mutation and a metatadata mutation # We need to return both updated tensors in our graph return x_updated, x_output def runtime_wrapper(x): x_data_mutation_result, x_set_mutation_result = compiled_graph(x) # First, perform the data mutation on x's old storage x.copy_(x_data_mutation_result) # Then, swap out the storage of x with the new storage x.set_(x_set_mutation_result) ``` There are two things that make this difficult to do though: (1) Functionalization: the functionalization rule for `set_()` will fully throw away the old `FunctionalStorageImpl` on the graph input. So if there are any mutations to that `FunctionalStorageImpl` later on in the graph, the current graph input won't know about it. Maybe we can have a given `FunctionalTensorWrapper` remember all previous storages that it had, and track mutations on all of them - although this feels pretty complicated. (2) AOTAutograd now needs to know that we might have two graph outputs that correspond to a single "mutated input", which is annoying. It's worth pointing out that this issue is probably extremely unlikely for anyone to run into - can we just detect it and error? This feels slightly easier than solving it, although not significantly easier. We would still need `FunctionalTensorWrapper` to keep track of mutations on any of its "previous" storages, so it can report this info back to AOTAutograd so we can raise an error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111554 Approved by: https://github.com/ezyang	2023-11-13 16:39:25 +00:00
PyTorch MergeBot	9a28a7b498	Revert "Add support for `torch.Generator` type in TorchScript (#110413 )" This reverts commit `27e31ab6e8`. Reverted https://github.com/pytorch/pytorch/pull/110413 on behalf of https://github.com/PaliC due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/110413#issuecomment-1799003164))	2023-11-07 15:53:32 +00:00
rzou	71dca16610	Grandfather autogen'ed ops as pt2_compliant (#113036 ) Summary: I missed this when I grandfathered torchgen'ed aten ops as pt2_compliant. Test Plan: New test. Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/113036 Approved by: https://github.com/williamwen42	2023-11-06 23:43:17 +00:00
Antonio Kim	27e31ab6e8	Add support for `torch.Generator` type in TorchScript (#110413 ) - Add support for `torch.Generator` type in TorchScript - Add `generator` args to all `torch.nn.init` functions that call `uniform_` or `normal_` - Add support for `torch.Generator` in LTC's TorchScript backend (CC: @wconstab) CC: @eellison @davidberard98 @GlebKazantaev @behzad-a Pull Request resolved: https://github.com/pytorch/pytorch/pull/110413 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/glebk-cerebras, https://github.com/davidberard98	2023-11-06 21:27:02 +00:00
Mengwei Liu	19e9f5cc7b	[torchgen] Add support for optional tensor (#112938 ) Summary: As titled Test Plan: rely on CI Differential Revision: D50997957 Pull Request resolved: https://github.com/pytorch/pytorch/pull/112938 Approved by: https://github.com/Skylion007	2023-11-06 20:03:05 +00:00
rzou	d91a18c433	Grandfather in torchgen'ed aten ops to torch.Tag.pt2_compliant_tag (#112053 ) In torchgen, we add the pt2_compliant_tag to all aten ops. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/112053 Approved by: https://github.com/soulitzer	2023-10-26 21:21:09 +00:00
Brian Hirsh	2aaa7e542c	AOTAutograd: avoid intermediate_base logic when all aliased outputs came from a multi_output_view (#111411 ) Partially addresses https://github.com/pytorch/pytorch/issues/111081 This fixes the majority of the slowness from https://fb.workplace.com/groups/1405155842844877/permalink/7491314274228973/. In particular, the type of example that suffers the most perf-wise in AOTAutograd looks like this: ``` @torch.compile def f(x): intermediate = x.mul(2) outs = intermediate.unbind(0) return outs x = torch.randn(50, 50, requires_grad=True) outs = f(x) sum(outs).sum().backward() ``` There are 50 output tensors in the above function, that all alias each other. AOTAutograd will dutifully exercise its intermediate base [logic](https://github.com/pytorch/pytorch/blob/main/torch/_functorch/aot_autograd.py#L294), and try to regenerate the aliases outside of the compiled `autograd.Function` at runtime, to ensure that the autograd engine is aware of the aliasing. In this case, this will result in 50 AsStridedBackward nodes in the backward, because we will fall back to using as_strided to generate each of those 50 outputs. The current PR as is (somewhat unsafely) ensures that the backward graph consists of a single `UnbindBackward`, or a call to `aten.cat()`. I left a long comment in the code describing the situation, but the core idea is that autograd does not let you mutate grad_fn of tensor aliases that come from multi-output views. So if we have `k` outputs that alias each other, but `k-1` of them are aliases that came from multi-output views, then in eager mode, it would not be possible to mutate one of the aliases in a way that would change the grad_fn of any of the other aliases, without causing an error in the backward. So the claim I'm making is that if we hide this aliasing from the autograd engine, then it is impossible for the user to perform any mutations that would cause autograd metadata to diverge between torch.compile and eager in a way that isn't an error in eager mode. To be fair, I think that taking the approach outlined in https://docs.google.com/document/d/1DlfFq8TKbuAn2zyJxLfoW-X1qkkm5PLdHFtySo03QAk/edit would also help us avoid the as_strided calls in this particularly egregious case, and* keep the autograd error messages. This relies on both pre-dispatch functionalization being fully hardened and adding some pretty invasive changes to AOTAutograd though, and is probably at least several months out. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111411 Approved by: https://github.com/ezyang	2023-10-26 02:54:50 +00:00
Richard Zou	66b74d231a	Change torch.library.impl to accept a device string (#111659 ) torch.library.impl now accepts a device string (e.g. "cpu", "cuda"). It still accepts DispatchKey strings, but we no longer document this, because using arbitrary DispatchKeys is more for the power users. We map the device string to a DispatchKey and then register the impl for said DispatchKey. A user may also specify multiple device strings at once or specify "types=default" to get a CompositeExplicitAutograd registration. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/111659 Approved by: https://github.com/soulitzer ghstack dependencies: #111380	2023-10-23 23:02:41 +00:00
Aaron Gokaslan	1ad0f0b308	[BE]: remove unnecessary enumerate calls (#111690 ) Remove unnecessary enumerate calls, entirely automated fixes so probably reasonably low risk. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111690 Approved by: https://github.com/malfet	2023-10-20 23:20:29 +00:00
Jane Xu	ca7d084ff9	Add ScalarTensor or 0dim overload for _foreach_add (#111079 ) Adding a Tensor overload will allow us to: - optimize in more cases than before - increase coverage for scalarTensor instead of just scalars in our foreach APIs The main complication in this PR was that add.Tensor has a scalar overload, so I've now built out support for that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111079 Approved by: https://github.com/albanD	2023-10-20 01:34:07 +00:00
isdanni	382327bd0e	[BE] Enable Ruff's Flake8 PYI034 (#111105 ) Enable [non-self-return-type (PYI034)](https://docs.astral.sh/ruff/rules/non-self-return-type/#non-self-return-type-pyi034) Link: #110950 EDIT: to newly added reviewers, please ignore the request, it's due to a rebase error 😅 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111105 Approved by: https://github.com/Skylion007	2023-10-13 21:19:53 +00:00
Kazuaki Ishizaki	ac48c11ab7	Fix typo under torchgen directory (#111154 ) This PR fixes typo in comments and messages in files under `torchgen` directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111154 Approved by: https://github.com/rajveer43, https://github.com/Skylion007	2023-10-13 16:43:46 +00:00
isdanni	dede1e96e2	[BE] Enable Ruff's Flake8 PYI018 (#111101 ) Enable [unused-private-type-var (PYI018)](https://docs.astral.sh/ruff/rules/unused-private-type-var/#unused-private-type-var-pyi018) Link: #110950 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111101 Approved by: https://github.com/albanD	2023-10-12 16:26:21 +00:00
isdanni	2f53085f3f	[BE] Enable Ruff's Flake8 PYI030 (#111103 ) Enable [unnecessary-literal-union (PYI030)](https://docs.astral.sh/ruff/rules/unnecessary-literal-union/) Link: #110950 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111103 Approved by: https://github.com/albanD	2023-10-12 13:31:59 +00:00
Kazuaki Ishizaki	50bd252863	Fix typo `the the` (#110869 ) This PR fixes typo `the the` of comments and exception message in files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110869 Approved by: https://github.com/soulitzer	2023-10-09 19:32:45 +00:00
Aaron Gokaslan	144cda7f06	[BE]: Enable ruff's flake8-PYI rules (#110830 ) Enable Flake8-PYI rules codebase wide. Most of the rules already match our codebase style, the remaining ones that were not autofixed I have added to the pyproject.toml to be enabled in a later PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110830 Approved by: https://github.com/albanD	2023-10-09 16:37:26 +00:00
Edward Z. Yang	6a974bec5d	Change flash attention outputs to be SymInt instead of int (#110533 ) Fixes https://github.com/pytorch/pytorch/issues/110322 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/110533 Approved by: https://github.com/albanD	2023-10-05 01:00:07 +00:00
Fabrice Pont	053367b1ed	fix: flake8-bugbear code B024 (#107265 ) See #106571 item B024 This fix concerns the addition of `abstractmethod` to methods declared inside abstract classes. Should I also include PEP8 compliant reformatting on the files I had to modify ? Pull Request resolved: https://github.com/pytorch/pytorch/pull/107265 Approved by: https://github.com/kit1980	2023-10-04 23:52:52 +00:00
Mengwei Liu	0721a394b6	[executorch][kernel reg] Allow kernel manual registration (#110086 ) Summary: Exposing a codegen mode for generating a hook for user to register their kernels. If we pass `--manual-registration` flag to `gen_executorch.py`, we will generate the following files: 1. RegisterKernels.h which declares a `register_all_kernels()` API inside `torch::executor` namespace. 2. RegisterKernelsEverything.cpp which implements `register_all_kernels()` by defining an array of generated kernels. This way user can depend on the library declared by `executorch_generated_lib` macro (with `manual_registration=True`) and be able to include `RegisterKernels.h`. Then they can manually call `register_all_kernels()` instead of relying on C++ static initialization mechanism which is not available in some embedded systems. Test Plan: Rely on the unit test: ``` buck2 test fbcode//executorch/runtime/kernel/test:test_kernel_manual_registration ``` Reviewed By: cccclai Differential Revision: D49439673 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110086 Approved by: https://github.com/cccclai	2023-09-27 16:04:20 +00:00
Tarun Karuturi	a51b8df261	Add support for event_tracer in codegen layer (#109990 ) Summary: Split out from D48975975, this handles the pytorch specific changes to add support for event_tracer in codegen layer. Test Plan: CI Reviewed By: dbort Differential Revision: D49487710 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109990 Approved by: https://github.com/Jack-Khuu	2023-09-27 09:09:03 +00:00
Brian Hirsh	63526a63f5	Make FunctionalTensor subclass to be more like functorch (interaction with ZeroTensor + Conjugate key) (#109023 ) I added some tests for Conj, Neg and ZeroTensor for both python and C++ functionalization. This also fixes a nasty segfult when running a functorch `jacfwd` test with `torch.compile`, once AOTAutograd is using `FunctionalTensor`. Changes: (1) I use Jeffrey's `make_wrapper_subclass(extra_dispatch_keys)` kwarg to plumb extra dispatch keys ontoto the wrapper, mirroring what C++ functionalization does (C++ functionalization will mirror all dispatch keys from the inner tensor to the wrapper, except for python and functorch keys). (2) FunctionalTensorMode will decompose CompositeImplicitAutograd ops, since (for example) ZeroTensor kernels can send ops like `.to()` directly to the Python key. We'll need a way to toggle this later for pre-dispatch functionalization (3) Bound `_ForceDispatchKeyGuard` and BatchedTensorImpl's dispatch keyset to python Pull Request resolved: https://github.com/pytorch/pytorch/pull/109023 Approved by: https://github.com/zou3519 ghstack dependencies: #108654, #109662, #109632	2023-09-22 07:09:04 +00:00
eellison	067f172930	Serialize Remaining Patterns (#108917 ) Serializes the remaining traced patterns. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108917 Approved by: https://github.com/davidberard98 ghstack dependencies: #109663, #108894	2023-09-20 05:39:23 +00:00
eellison	16d608d70d	Add Python serialization to Pattern Matcher patterns (#108894 ) Adds a Python Pretty Printer to the pattern matcher that serializes patterns as python. Generating our fuse attention patterns was taking 4 seconds of compile time, which will only get worse as we add more variants (which I will do in the rest of this stack). To write out patterns, build pytorch, then run `gen_attention_patterns.py`. Since there is a line limit for PRs i'm only including the _sdpa_pattern1 in this first diff. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108894 Approved by: https://github.com/yanboliang ghstack dependencies: #109663	2023-09-20 05:36:52 +00:00
PyTorch MergeBot	8705fc1bbd	Revert "Add Python serialization to Pattern Matcher patterns (#108894 )" This reverts commit `7db175b6f6`. Reverted https://github.com/pytorch/pytorch/pull/108894 on behalf of https://github.com/eellison due to land race ([comment](https://github.com/pytorch/pytorch/pull/108894#issuecomment-1726649151))	2023-09-19 23:00:03 +00:00
PyTorch MergeBot	8b4b1817c8	Revert "Serialize Remaining Patterns (#108917 )" This reverts commit `7bf08b77f3`. Reverted https://github.com/pytorch/pytorch/pull/108917 on behalf of https://github.com/eellison due to land race ([comment](https://github.com/pytorch/pytorch/pull/108917#issuecomment-1726646267))	2023-09-19 22:54:52 +00:00
eellison	7bf08b77f3	Serialize Remaining Patterns (#108917 ) Serializes the remaining traced patterns. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108917 Approved by: https://github.com/davidberard98 ghstack dependencies: #108894	2023-09-19 20:45:52 +00:00
eellison	7db175b6f6	Add Python serialization to Pattern Matcher patterns (#108894 ) Adds a Python Pretty Printer to the pattern matcher that serializes patterns as python. Generating our fuse attention patterns was taking 4 seconds of compile time, which will only get worse as we add more variants (which I will do in the rest of this stack). To write out patterns, build pytorch, then run `gen_attention_patterns.py`. Since there is a line limit for PRs i'm only including the _sdpa_pattern1 in this first diff. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108894 Approved by: https://github.com/yanboliang	2023-09-19 20:36:52 +00:00
Brian Hirsh	f22b303f65	Add TorchDispatch version of functionalization (#106404 ) This PR adds a new `FunctionalTensor` subclass, and `FunctionalTensorMode` torch dispatch mode. Together, this class/mode are a lightweight wrapper around our existing C++ functionalization logic. This idea came from Ed - later in the stack, I want to be able to run functionalization underneath torch_dispatch, when performing tracing in AOTAutograd. I can't do this easily with vanilla C++ functionalization, because it has a dedicated dispatch key that always runs before TorchDispatch. However, by adding a torch_dispatch mode shim around functionalization, we can use functionalization as a torch_dispatch mode, which will make it easier to run underneath other modes later. This PR provides the basic new classes, and some light testing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106404 Approved by: https://github.com/ezyang	2023-09-15 20:19:25 +00:00
soulitzer	2bcff92540	Add NestedTensor python subclass (#108314 ) Description coming soon Pull Request resolved: https://github.com/pytorch/pytorch/pull/108314 Approved by: https://github.com/jbschlosser ghstack dependencies: #108808	2023-09-11 18:29:20 +00:00
Joel Schlosser	b928e08f3d	Initial vmap + NT support with unbind fallback (#106786 ) PoC demonstrating vmap + NT based on the [design doc](https://docs.google.com/document/d/1dVVk6TOqz93PLTIneU2T3xaxCs9qZ0MaJyCvOAp_bC0). This PR: * Allows `BatchedTensorImpl`s to contain NTs * Introduces a `BatchedNestedTensor` dispatch key for NT-specific batching rules * Provides a batching rule fallback that unbinds the NTs -> performs computation on constituent -> rebinds results into NT Restrictions: * Only supports one level of vmap * Only supports vmapping over dim=0 for NTs * For operations with mixed NT / dense inputs, support is also limited to dim=0 for the dense inputs Pull Request resolved: https://github.com/pytorch/pytorch/pull/106786 Approved by: https://github.com/zou3519	2023-09-07 13:53:20 +00:00
Brian Hirsh	da54f3c519	reorder proxy / fake modes so they always run last (#104482 ) Update: Made refactor of the original PR. See the original description below, but here I'll describe the updates: (1) TLS changes in `TorchDispatchModeTLS.h/cpp`. I added a `TorchDispatchModeKey` enum, that (for now) just contains PROXY and FAKE. The ModeTLS used to just contain a `std::vector<std::shared_ptr<c10::SafePyObject>>` corresponding to the mode stack. It now also contains a separate array of "infra modes", indexed by mode key (PROXY and FAKE, with a new addition, FUNCTIONAL, coming later in the stack). `TorchDispatchModeTLS::push_onto_stack` and `TorchDispatchModeTLS::pop_stack` are now a bit more complicated. Pushing accepts an optional mode_key, which if set, tells us to add the given mode directly to our "infra_modes" array. Popping will first check the "user mode" stack, before trying to pop anything from the infra mode stack. It also optionally returns the mode key of the mode we popped if there was one - that way if we push that same mode back onto the TLS later, we know where it goes. `TorchDispatchModeTLS::dispatch_mode_enabled()` now accepts an optional `skip_infra_modes` param, so you can separately query if there are "any modes at all", or if there are "any user modes". `TorchDispatchModeTLS::get/set/unset_mode()` all take in a mode key, and get/set/unset the mode at that particular mode key (meaning they are only meant to be used for infra modes). There were also some mild codegen changes to support the new enum (2) `fake_tensor.py/proxy_tensor.py/_python_dispatch.py` The way I tell the infra that certain subclasses/modes are "infra" is through the enum: I gave `FakeTensor` and `FakeTensorMode` a `self._mode_key = torch._C.TorchDispatchModeKey.FAKE`. `TorchDispatchMode.__enter/exit__()` (in `_python_dispatch.py` now check if the current mode has a mode key, and if so they plumb it into any `push_onto_stack()` calls (which eventually instructs `TorchDispatchModeTLS` where to put the mode). Same thing for `ProxyTorchDispatchMode`. I also had to change both of these mode's enter/exit, to handle the fact that there can no longer be multiple proxy/fake modes on the mode stack at once. I updated them both to have a `self.enter_stack: List[Optional[TorchDispatchMode]]` - whenever we push a given mode in `__enter__`, we remove the current ambient fake/proxy mode from the mode stack, and save it in `enter_stack`, so that on exit we can reset the state properly. (2) dispatching logic in `python_arg_parser.cpp` This is where the core dispatching logic changes are. I added two helpers, `dispatch_on_subclass()` and `dispatch_on_mode()`. The overall dispatching order is now: ``` (a) dispatch_on_mode() # try user modes first (where the mode stack automatically considers infra modes last) (b) dispatch_on_subclass() # try user subclasses next (skipping infra subclasses) (c) dispatch_on_subclass() # try infra subclasses next (skipping user subclasses) ``` Note that we still want "user subclasses" to run before "infra modes". As Ed helped me realize, this will work today: If proxy/fake modes in step 1, they'll return NotImplemented if they see a user subclass, allowing us to redispatch to the user subclass. How do (b) and (c) distinguish between user and infra subclasses? Infra subclasses (FakeTensor, and later FunctionalTensor) are required to have a `_mode_key` hidden on the subclass - so we filter via arguments that do/don't have the _mode_key. (3) I also changed `DoubleTensor` to `TwoTensor` to minimize confusion (@albanD pointed out that DoubleTensor would be easily confused with `torch.FloatTensor` and friends). ----- original description below ----- The main purpose of this PR is to fix the "ordering problem" between torch_dispatch modes, where we want to ensure that our Fake and Proxy dispatch modes always run after any dispatch modes created by the user, regardless of where they are in the stack. See this doc for more details: https://docs.google.com/document/d/1COQ291nOZvtFnzGTQMJqoYZ3sttEYFw_7HbfSyL8gcA/edit Full set of changes below. I ended up including a few semi-related changes in this PR that I documented - but if folks would rather I separate them out, happy to try to do that. (1) Add dedicated TLS slots for FakeTensorMode and ProxyTensorMode This is the main component of this PR. There are two new slots, `TorchDispatchModeTLS.fake_mode_` and `TorchDispatchModeTLS.proxy_mode_`, which correspond to a single "global" fake and proxy mode. There is now an invariant that `torchDispatchModeState.stack_` can never contain either of these modes. I also added a `TorchDispatchModeTLS::maybe_highest_mode()` helper that consults the `stack_` as well as both the proxy and fake slots, and returns the highest priority mode - this is because there are a few places in the codebase where we legitimately want to get the highest priority mode, including fake or proxy, if one is set. This also made the implementations of the existing `disable_proxy_modes_tracing()` and `get_innermost_proxy_mode()` marginally simpler. (2) Updated the dispatching logic in handle_torch_function_no_python_arg_parser() This is the function that actually figures out which torch_dispatch implementation to call, given the current mode stack and tensor subclass inputs. This function got marginally more complicated as part of the refactor: First we inspect the mode stack and any non-fake subclass inputs. Then we check for the proxy mode slot. Then we check for the Fake mode slot, before finally checking for any fake subclass inputs. (3) new python `_get_fake_tensor_mode()` and `_get_proxy_tensor_mode()` API's Before, if you wanted to see if proxy or fake modes were active in python, you would have to consult the mode stack. Since these two modes are no longer part of the actual mode stack, I added two new API's to directly check if either proxy or fake modes are active. (4) Allow traceable tensor subclasses to access storages from python This is convenient later in the stack, where AOTAutograd needs to detect aliasing of inputs and outputs, where those inputs and outputs might be tensor subclasses. Previously, `x.untyped_storage()` would raise an error if `x` was a subclass. In this PR, I tried to relax this constraint as little as possible: `THPVariable_storage()` will only try to return a storage to python if the tensor subclass that you are passing in is "traceable" (5) Fixed subclass fakeification @wanchaol recently added support to be able to fakeify tensor subclasses. That fakeification logic works in most cases, but there is one case it doesn't handle: autograd metadata. In particular, since autograd sees our tensor subclasses and not their desugared tensors, we need to make sure that our fakeified subclass has the same autograd metadata as the original subclass. I updated `meta_utils.py` to make sure that the autograd metadata is correct. (6) make tensor subclasses resizeable Previously we didn't allow tensor subclasses to be resizeable. I ran into an issue where fakeifying a tensor subclass occasionally requires swapping out its storage, which can involve resizing the tensor. Mechanically, this required updating `at::for_blob()` to expose a way to request that the tensor that you create has resizeable storage, and then using this new API in `_make_wrapper_tensor()`. (7) Added a basic DoubleTensor subclass for testing I use this subclass more later in this stack in my AOTAutograd tests - but it serves as a simple subclass example to test the dispatch ordering in this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104482 Approved by: https://github.com/ezyang ghstack dependencies: #107415	2023-08-29 02:36:48 +00:00
cyy	d9fb7166d6	[BE] use DeviceIndex instead of int64_t for related device interfaces (#103068 ) This PR unifies the device interfaces in aten/cpp and torch/csrc/cpp to use c10::DeviceIndex. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103068 Approved by: https://github.com/malfet	2023-08-25 20:16:14 +00:00
Masaki Kozuki	5814380e7b	Revert "Revert "Reland "Add forward mode AD to out-place foreach functions (#102409 ) (#106043 )""" (#106320 ) Fixed a typo specifying the number of tensors and elements in the test having failed in slow gradcheck Pull Request resolved: https://github.com/pytorch/pytorch/pull/106320 Approved by: https://github.com/soulitzer	2023-08-18 23:01:42 +00:00
Jun Luo	2abcfc40b0	Enable torchgen for MTIA dispatch key (#107046 ) Summary: As title. Test Plan: See diff D48258693 Differential Revision: D48273743 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107046 Approved by: https://github.com/albanD	2023-08-15 07:56:18 +00:00
Tugsbayasgalan Manlaibaatar	20c5add133	[export] Refactor `constrain_as_value` and `constrain_as_size` (#106591 ) Some notable changes: 1. `constrain_as_size` allows min value to be less than 2 as it will unconditionally assume min >= 2 for compiler purposes. Instead, we add additional check to make sure max value is always greater than 2. 2. Previously, we used to runtime assert on the unbacked symint's val range which would be always between [2, max]. I modified this logic to assert on [0, max] unless user explicitly specifies the min range. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106591 Approved by: https://github.com/gmagogsfm, https://github.com/ezyang	2023-08-15 05:41:43 +00:00
Mengwei Liu	ddd2f682b9	[executorch] Let custom ops registration code only import ATen headers (#107064 ) Summary: Basically we generate `CustomOpsNativeFunctions.h` for registering custom ops into PyTorch JIT runtime. This header needs to hookup with the C++ kernel implementation of all the custom ops. For this reason it should include ATen headers instead of Executorch headers. This PR changes it. Test Plan: Rely on existing CI jobs Differential Revision: D48282828 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107064 Approved by: https://github.com/kirklandsign	2023-08-13 00:34:34 +00:00
David Watson	c9cdcb299a	Remove ExclusivelyOwned from register_dispatch_key (#106791 ) This fixes a bug that could occur with python decompositions. When an operation is intercepted in the c++ code in pytorch the outputs a created as `ExclusivelyOwned<at::Tensor>`s. Later on when it dispatches back to python for the decomposition these tensors have their ownership shared with python. In a normal use case the exclusively owned tensor is released and it's value returned as a non-exclusively owned tensor from the operation. However if the python decomposition throws an error the `ExclusivelyOwned` wrapper destroys the `at::Tensor` leading to a python reference to a tensor which isn't alive (and meaning pytorch falls over in debug mode). Note this will be a performance hit when handling errors. Fixes #106790 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106791 Approved by: https://github.com/ezyang	2023-08-11 21:04:33 +00:00
PyTorch MergeBot	745d29b0cc	Revert "[export] Refactor `constrain_as_value` and `constrain_as_size` (#106591 )" This reverts commit `18989890bf`. Reverted https://github.com/pytorch/pytorch/pull/106591 on behalf of https://github.com/izaitsevfb due to Breaks inductor test on trunk ([comment](https://github.com/pytorch/pytorch/pull/106591#issuecomment-1675069091))	2023-08-11 16:37:47 +00:00
Tugsbayasgalan Manlaibaatar	18989890bf	[export] Refactor `constrain_as_value` and `constrain_as_size` (#106591 ) Some notable changes: 1. `constrain_as_size` allows min value to be less than 2 as it will unconditionally assume min >= 2 for compiler purposes. Instead, we add additional check to make sure max value is always greater than 2. 2. Previously, we used to runtime assert on the unbacked symint's val range which would be always between [2, max]. I modified this logic to assert on [0, max] unless user explicitly specifies the min range. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106591 Approved by: https://github.com/gmagogsfm, https://github.com/ezyang	2023-08-11 05:29:22 +00:00
Alexander Pivovarov	02abbb8109	Fix some typos, mostly "that that" (#106901 ) Fix some typos Pull Request resolved: https://github.com/pytorch/pytorch/pull/106901 Approved by: https://github.com/janeyx99	2023-08-10 19:46:53 +00:00
PyTorch MergeBot	2b427ae3a7	Revert "Reland "Add forward mode AD to out-place foreach functions (#102409 ) (#106043 )" This reverts commit `e773f28ee3`. Reverted https://github.com/pytorch/pytorch/pull/106043 on behalf of https://github.com/DanilBaibak due to Break slow tests ([comment](https://github.com/pytorch/pytorch/pull/106043#issuecomment-1658642734))	2023-07-31 15:50:36 +00:00
Aaron Gokaslan	52d4b1ae31	[BE]: Enable ruff rules PIE807 and PIE810 (#106218 ) * Enables PIE807 + PIE810. PIE807 is do not reimplement list builtin function using lambda and PIE810 is to always fuse startswith / endswith calls (I applied the autofixes for this before we had ruff enabled). Pull Request resolved: https://github.com/pytorch/pytorch/pull/106218 Approved by: https://github.com/albanD	2023-07-28 22:35:56 +00:00
Masaki Kozuki	e773f28ee3	Reland "Add forward mode AD to out-place foreach functions (#102409 ) (#106043 ) forward-mode AD of out-of-place foreach functions, finally. rel: - #102409 - #105504 - #58833 - #100695 --- # Generated Foreach ```c++ ::std::vector<at::Tensor> _foreach_sinh(c10::DispatchKeySet ks, at::TensorList self) { auto self_ = unpack(self, "self", 0); [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self ); std::vector<bool> _any_has_forward_grad_result(self.size()); for (const auto& i : c10::irange(self.size())) { _any_has_forward_grad_result[i] = isFwGradDefined(self[i]); } std::shared_ptr<ForeachSinhBackward0> grad_fn; if (_any_requires_grad) { grad_fn = std::shared_ptr<ForeachSinhBackward0>(new ForeachSinhBackward0(), deleteNode); grad_fn->set_next_edges(collect_next_edges( self )); grad_fn->self_ = make_saved_variable_list(self); grad_fn->self_size_ = self.size(); } #ifndef NDEBUG std::vector<c10::optional<Storage>> self__storage_saved(self_.size()); for (const Tensor& tensor : self_) self__storage_saved.push_back( tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt); std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size()); for (size_t i=0; i<self_.size(); i++) if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr(); #endif auto _tmp = ([&]() { at::AutoDispatchBelowADInplaceOrView guard; return at::redispatch::_foreach_sinh(ks & c10::after_autograd_keyset, self_); })(); auto result = std::move(_tmp); #ifndef NDEBUG for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage())); } for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr()); } #endif if (grad_fn) { set_history(flatten_tensor_args( result ), grad_fn); } std::vector<c10::optional<at::Tensor>> result_new_fw_grad_opts(self.size(), c10::nullopt); for (const auto& i : c10::irange(result_new_fw_grad_opts.size())) { if (_any_has_forward_grad_result[i]) { auto self_t_raw = toNonOptFwGrad(self[i]); auto self_tensor = toNonOptTensor(self[i]); auto self_t = (self_t_raw.defined() \|\| !self_tensor.defined()) ? self_t_raw : at::_efficientzerotensor(self_tensor.sizes(), self_tensor.options()); auto self_p = toNonOptPrimal(self[i]); result_new_fw_grad_opts[i] = (self_t.conj() * self_p.cosh().conj()).conj(); } } for (const auto& i : c10::irange(result_new_fw_grad_opts.size())) { auto& result_new_fw_grad_opt = result_new_fw_grad_opts[i]; if (result_new_fw_grad_opt.has_value() && result_new_fw_grad_opt.value().defined() && result[i].defined()) { // The hardcoded 0 here will need to be updated once we support multiple levels. result[i]._set_fw_grad(result_new_fw_grad_opt.value(), /* level / 0, / is_inplace_op / false); } } return result; } ::std::vector<at::Tensor> _foreach_norm_Scalar(c10::DispatchKeySet ks, at::TensorList self, const at::Scalar & ord) { auto self_ = unpack(self, "self", 0); [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self ); std::vector<bool> _any_has_forward_grad_result(self.size()); for (const auto& i : c10::irange(self.size())) { _any_has_forward_grad_result[i] = isFwGradDefined(self[i]); } std::shared_ptr<ForeachNormBackward0> grad_fn; if (_any_requires_grad) { grad_fn = std::shared_ptr<ForeachNormBackward0>(new ForeachNormBackward0(), deleteNode); grad_fn->set_next_edges(collect_next_edges( self )); grad_fn->ord = ord; grad_fn->self_ = make_saved_variable_list(self); grad_fn->self_size_ = self.size(); } #ifndef NDEBUG std::vector<c10::optional<Storage>> self__storage_saved(self_.size()); for (const Tensor& tensor : self_) self__storage_saved.push_back( tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt); std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size()); for (size_t i=0; i<self_.size(); i++) if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr(); #endif auto _tmp = ([&]() { at::AutoDispatchBelowADInplaceOrView guard; return at::redispatch::_foreach_norm(ks & c10::after_autograd_keyset, self_, ord); })(); auto result = std::move(_tmp); #ifndef NDEBUG for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage())); } for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr()); } #endif if (grad_fn) { set_history(flatten_tensor_args( result ), grad_fn); } std::vector<c10::optional<at::Tensor>> result_new_fw_grad_opts(self.size(), c10::nullopt); for (const auto& i : c10::irange(result_new_fw_grad_opts.size())) { if (_any_has_forward_grad_result[i]) { auto self_t_raw = toNonOptFwGrad(self[i]); auto self_tensor = toNonOptTensor(self[i]); auto self_t = (self_t_raw.defined() \|\| !self_tensor.defined()) ? self_t_raw : at::_efficientzerotensor(self_tensor.sizes(), self_tensor.options()); auto self_p = toNonOptPrimal(self[i]); result_new_fw_grad_opts[i] = norm_jvp(self_p, self_t, ord, result[i]); } } for (const auto& i : c10::irange(result_new_fw_grad_opts.size())) { auto& result_new_fw_grad_opt = result_new_fw_grad_opts[i]; if (result_new_fw_grad_opt.has_value() && result_new_fw_grad_opt.value().defined() && result[i].defined()) { // The hardcoded 0 here will need to be updated once we support multiple levels. result[i]._set_fw_grad(result_new_fw_grad_opt.value(), / level / 0, / is_inplace_op / false); } } if (grad_fn) { grad_fn->result = result; } return result; } ``` # Reference ```c++ at::Tensor sinh(c10::DispatchKeySet ks, const at::Tensor & self) { auto& self_ = unpack(self, "self", 0); [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self ); [[maybe_unused]] auto _any_has_forward_grad_result = (isFwGradDefined(self)); std::shared_ptr<SinhBackward0> grad_fn; if (_any_requires_grad) { grad_fn = std::shared_ptr<SinhBackward0>(new SinhBackward0(), deleteNode); grad_fn->set_next_edges(collect_next_edges( self )); grad_fn->self_ = SavedVariable(self, false); } #ifndef NDEBUG c10::optional<Storage> self__storage_saved = self_.has_storage() ? c10::optional<Storage>(self_.storage()) : c10::nullopt; c10::intrusive_ptr<TensorImpl> self__impl_saved; if (self_.defined()) self__impl_saved = self_.getIntrusivePtr(); #endif auto _tmp = ([&]() { at::AutoDispatchBelowADInplaceOrView guard; return at::redispatch::sinh(ks & c10::after_autograd_keyset, self_); })(); auto result = std::move(_tmp); #ifndef NDEBUG if (self__storage_saved.has_value() && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__storage_saved.value().is_alias_of(self_.storage())); if (self__impl_saved && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__impl_saved == self_.getIntrusivePtr()); if (result.has_storage() && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(result)) { TORCH_INTERNAL_ASSERT(result.storage().use_count() == 1, "function: sinh"); } if (!at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(result)) TORCH_INTERNAL_ASSERT(result.use_count() <= 1, "function: sinh"); #endif if (grad_fn) { set_history(flatten_tensor_args( result ), grad_fn); } c10::optional<at::Tensor> result_new_fw_grad_opt = c10::nullopt; if (_any_has_forward_grad_result && (result.defined())) { auto self_t_raw = toNonOptFwGrad(self); auto self_tensor = toNonOptTensor(self); auto self_t = (self_t_raw.defined() \|\| !self_tensor.defined()) ? self_t_raw : at::_efficientzerotensor(self_tensor.sizes(), self_tensor.options()); auto self_p = toNonOptPrimal(self); result_new_fw_grad_opt = (self_t.conj() self_p.cosh().conj()).conj(); } if (result_new_fw_grad_opt.has_value() && result_new_fw_grad_opt.value().defined() && result.defined()) { // The hardcoded 0 here will need to be updated once we support multiple levels. result._set_fw_grad(result_new_fw_grad_opt.value(), /* level / 0, / is_inplace_op / false); } return result; } at::Tensor norm_Scalar(c10::DispatchKeySet ks, const at::Tensor & self, const at::Scalar & p) { auto& self_ = unpack(self, "self", 0); [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self ); [[maybe_unused]] auto _any_has_forward_grad_result = (isFwGradDefined(self)); std::shared_ptr<NormBackward0> grad_fn; if (_any_requires_grad) { grad_fn = std::shared_ptr<NormBackward0>(new NormBackward0(), deleteNode); grad_fn->set_next_edges(collect_next_edges( self )); grad_fn->p = p; grad_fn->self_ = SavedVariable(self, false); } #ifndef NDEBUG c10::optional<Storage> self__storage_saved = self_.has_storage() ? c10::optional<Storage>(self_.storage()) : c10::nullopt; c10::intrusive_ptr<TensorImpl> self__impl_saved; if (self_.defined()) self__impl_saved = self_.getIntrusivePtr(); #endif auto _tmp = ([&]() { at::AutoDispatchBelowADInplaceOrView guard; return at::redispatch::norm(ks & c10::after_autograd_keyset, self_, p); })(); auto result = std::move(_tmp); #ifndef NDEBUG if (self__storage_saved.has_value() && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__storage_saved.value().is_alias_of(self_.storage())); if (self__impl_saved && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__impl_saved == self_.getIntrusivePtr()); if (result.has_storage() && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(result)) { TORCH_INTERNAL_ASSERT(result.storage().use_count() == 1, "function: norm_Scalar"); } if (!at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(result)) TORCH_INTERNAL_ASSERT(result.use_count() <= 1, "function: norm_Scalar"); #endif if (grad_fn) { set_history(flatten_tensor_args( result ), grad_fn); } throw_error_for_complex_autograd(result, "norm"); c10::optional<at::Tensor> result_new_fw_grad_opt = c10::nullopt; if (_any_has_forward_grad_result && (result.defined())) { auto self_t_raw = toNonOptFwGrad(self); auto self_tensor = toNonOptTensor(self); auto self_t = (self_t_raw.defined() \|\| !self_tensor.defined()) ? self_t_raw : at::_efficientzerotensor(self_tensor.sizes(), self_tensor.options()); auto self_p = toNonOptPrimal(self); result_new_fw_grad_opt = norm_jvp(self_p, self_t, p, result); } if (result_new_fw_grad_opt.has_value() && result_new_fw_grad_opt.value().defined() && result.defined()) { // The hardcoded 0 here will need to be updated once we support multiple levels. result._set_fw_grad(result_new_fw_grad_opt.value(), / level / 0, / is_inplace_op */ false); } if (grad_fn) { grad_fn->result_ = SavedVariable(result, true); } return result; } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106043 Approved by: https://github.com/soulitzer	2023-07-27 03:13:24 +00:00
Alan Ji	70b0f1b248	fix some typos (#106018 ) Fixes #ISSUE_NUMBER Fix typos in `test_static_module.cc`, `backend_cutting_test.cc` and `types_base.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106018 Approved by: https://github.com/awgu	2023-07-26 18:14:44 +00:00
Aaron Gokaslan	6d43c89f37	[BE]: Update Ruff to 0.0.280 (#105724 ) Removes unusued loop values in python dictionary iteration. Automated fix from Ruff master Pull Request resolved: https://github.com/pytorch/pytorch/pull/105724 Approved by: https://github.com/ezyang, https://github.com/janeyx99	2023-07-22 23:03:34 +00:00
Justin Chu	4cc1745b13	[BE] f-stringify torch/ and scripts (#105538 ) This PR is a follow up on the pyupgrade series to convert more strings to use f-strings using `flynt`. - https://docs.python.org/3/reference/lexical_analysis.html#f-strings - https://pypi.org/project/flynt/ Command used: ``` flynt torch/ -ll 120 flynt scripts/ -ll 120 flynt tools/ -ll 120 ``` and excluded `collect_env.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105538 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-07-21 19:35:24 +00:00
Amadeusz Skrzypczak	b64bd4a5dd	Add torch.float8_e5m2 and torch.float8_e4m3 data types (#104242 ) Proposal of two float8 variants - e5m2 and e4m3 - based on https://arxiv.org/pdf/2209.05433.pdf Hide all Float8 operator implementations behind `#if !defined(C10_MOBILE)` guard to keep Android build size almost unchanged TODO: - Refactor duplicated code - Cleanup unbalanced pragma pop in dtype utils - Add native implementation on the CUDA size Co-authored-by: Nikita Shulga <nshulga@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/104242 Approved by: https://github.com/albanD	2023-07-20 16:09:11 +00:00
PyTorch MergeBot	f2b15772ff	Revert "Add torch.float8_e5m2 and torch.float8_e4m3 data types (#104242 )" This reverts commit `a9804130e5`. Reverted https://github.com/pytorch/pytorch/pull/104242 on behalf of https://github.com/PaliC due to breaks lint (run lintrunner and remerge) ([comment](https://github.com/pytorch/pytorch/pull/104242#issuecomment-1644150284))	2023-07-20 15:37:53 +00:00
Amadeusz Skrzypczak	a9804130e5	Add torch.float8_e5m2 and torch.float8_e4m3 data types (#104242 ) Proposal of two float8 variants - e5m2 and e4m3 - based on https://arxiv.org/pdf/2209.05433.pdf Hide all Float8 operator implementations behind `#if !defined(C10_MOBILE)` guard to keep Android build size almost unchanged TODO: - Refactor duplicated code - Cleanup unbalanced pragma pop in dtype utils - Add native implementation on the CUDA size Co-authored-by: Nikita Shulga <nshulga@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/104242 Approved by: https://github.com/albanD	2023-07-20 09:45:45 +00:00
Justin Chu	964d29f312	[BE] Enable ruff's UP rules and autoformat torchgen/ (#105423 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105423 Approved by: https://github.com/Skylion007	2023-07-18 06:44:20 +00:00
Nikita Shulga	5837e95d30	[Reland] Update mypy to 1.4.1 (#105227 ) This PR re-lands - [Typing] Fix PEP 484 Violation (#105022) - Update mypy to 1.4.1 (#91983) That were reverted due to the conflict with internal source repo. Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional) Plus few real fixes: - Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi` - Add missing return statement to `torch._export. deserialize_graph` - Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights` - Add assert it `torch/optim/optimizer.py` that Optional list is not None TODO (in followup PR): - Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py` Unrelated, to bypass CI failures due to the gcc9 dependency update in Ubuntu-18.04: - Add hack to squash older libstdc++ from conda environment in favor one from OS to `.ci/docker/install_conda.sh` - Update bazel cuda builds to focal, as with libstdc++-6.0.32 bazel builds loose the ability to catch exceptions (probably because they link with cupti statically, but I could not found where it is done) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105227 Approved by: https://github.com/atalman, https://github.com/albanD, https://github.com/Skylion007	2023-07-15 20:30:20 +00:00
PyTorch MergeBot	15fd1ea118	Revert "[Reland] Update mypy to 1.4.1 (#105227 )" This reverts commit `c9c4f8efc3`. Reverted https://github.com/pytorch/pytorch/pull/105227 on behalf of https://github.com/atalman due to trying to mitigate ci sev #105248 ([comment](https://github.com/pytorch/pytorch/pull/105227#issuecomment-1636510935))	2023-07-14 22:28:35 +00:00
Dave Bort	d06e1df1aa	[torchgen] Rename executorch's RuntimeContext to KernelRuntimeContext (#104892 ) Rename the context type to match changes in executorch. Differential Revision: [D46977359](https://our.internmc.facebook.com/intern/diff/D46977359/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104892 Approved by: https://github.com/larryliu0820	2023-07-14 21:15:50 +00:00
Nikita Shulga	c9c4f8efc3	[Reland] Update mypy to 1.4.1 (#105227 ) This PR re-lands - [Typing] Fix PEP 484 Violation (#105022) - Update mypy to 1.4.1 (#91983) That were reverted due to the conflict with internal source repo. Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional) Plus few real fixes: - Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi` - Add missing return statement to `torch._export. deserialize_graph` - Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights` - Add assert it `torch/optim/optimizer.py` that Optional list is not None TODO (in followup PR): - Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105227 Approved by: https://github.com/atalman, https://github.com/albanD, https://github.com/Skylion007	2023-07-14 20:45:12 +00:00
PyTorch MergeBot	3c5a494d7a	Revert "Update mypy to 1.4.1 (#91983 )" This reverts commit `634659e262`. Reverted https://github.com/pytorch/pytorch/pull/91983 on behalf of https://github.com/malfet due to It's dependent change was reverted, so reverting this one as well, to keep CI clean ([comment](https://github.com/pytorch/pytorch/pull/91983#issuecomment-1636059709))	2023-07-14 15:59:16 +00:00
Nikita Shulga	634659e262	Update mypy to 1.4.1 (#91983 ) Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional) Plus few real fixes: - Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi` - Add missing return statement to `torch._export. deserialize_graph` - Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights` - TODO (in followup PR): - Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91983 Approved by: https://github.com/kit1980, https://github.com/ZainRizvi, https://github.com/huydhn, https://github.com/thiagocrepaldi, https://github.com/aaronenyeshi	2023-07-13 16:30:36 +00:00
Jane Xu	038cb4075a	Add capturable/maximize tests to Adam(W) optim configs (#104669 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104669 Approved by: https://github.com/albanD	2023-07-10 17:38:46 +00:00
PyTorch MergeBot	8958f041be	Revert "Add forward mode AD to out-place foreach functions (#102409 )" This reverts commit `e2ec0ba404`. Reverted https://github.com/pytorch/pytorch/pull/102409 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but it is failing some tests in trunk `e799f565eb` ([comment](https://github.com/pytorch/pytorch/pull/102409#issuecomment-1615254393))	2023-06-30 22:46:57 +00:00
Masaki Kozuki	e2ec0ba404	Add forward mode AD to out-place foreach functions (#102409 ) The major difference from in-place support is that some out-place functions have their derivatives spelled out in derivatives.yaml, which requires some changes in `load_derivatives.py` and some handlings in various places due to the others whose derivatives are generated by `torchgen`. rel: - #58833 - #100695 --- # Generated Foreach ```c++ ::std::vector<at::Tensor> _foreach_sinh(c10::DispatchKeySet ks, at::TensorList self) { auto self_ = unpack(self, "self", 0); [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self ); std::vector<bool> _any_has_forward_grad_result(self.size()); for (const auto& i : c10::irange(self.size())) { _any_has_forward_grad_result[i] = isFwGradDefined(self[i]); } std::shared_ptr<ForeachSinhBackward0> grad_fn; if (_any_requires_grad) { grad_fn = std::shared_ptr<ForeachSinhBackward0>(new ForeachSinhBackward0(), deleteNode); grad_fn->set_next_edges(collect_next_edges( self )); grad_fn->self_ = make_saved_variable_list(self); grad_fn->self_size_ = self.size(); } #ifndef NDEBUG std::vector<c10::optional<Storage>> self__storage_saved(self_.size()); for (const Tensor& tensor : self_) self__storage_saved.push_back( tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt); std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size()); for (size_t i=0; i<self_.size(); i++) if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr(); #endif auto _tmp = ([&]() { at::AutoDispatchBelowADInplaceOrView guard; return at::redispatch::_foreach_sinh(ks & c10::after_autograd_keyset, self_); })(); auto result = std::move(_tmp); #ifndef NDEBUG for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage())); } for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr()); } #endif if (grad_fn) { set_history(flatten_tensor_args( result ), grad_fn); } std::vector<c10::optional<at::Tensor>> result_new_fw_grad_opts(self.size(), c10::nullopt); for (const auto& i : c10::irange(result_new_fw_grad_opts.size())) { if (_any_has_forward_grad_result[i]) { auto self_t_raw = toNonOptFwGrad(self[i]); auto self_tensor = toNonOptTensor(self[i]); auto self_t = (self_t_raw.defined() \|\| !self_tensor.defined()) ? self_t_raw : at::_efficientzerotensor(self_tensor.sizes(), self_tensor.options()); auto self_p = toNonOptPrimal(self[i]); result_new_fw_grad_opts[i] = (self_t.conj() * self_p.cosh().conj()).conj(); } } for (const auto& i : c10::irange(result_new_fw_grad_opts.size())) { auto& result_new_fw_grad_opt = result_new_fw_grad_opts[i]; if (result_new_fw_grad_opt.has_value() && result_new_fw_grad_opt.value().defined() && result[i].defined()) { // The hardcoded 0 here will need to be updated once we support multiple levels. result[i]._set_fw_grad(result_new_fw_grad_opt.value(), /* level / 0, / is_inplace_op / false); } } return result; } ::std::vector<at::Tensor> _foreach_norm_Scalar(c10::DispatchKeySet ks, at::TensorList self, const at::Scalar & ord) { auto self_ = unpack(self, "self", 0); [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self ); std::vector<bool> _any_has_forward_grad_result(self.size()); for (const auto& i : c10::irange(self.size())) { _any_has_forward_grad_result[i] = isFwGradDefined(self[i]); } std::shared_ptr<ForeachNormBackward0> grad_fn; if (_any_requires_grad) { grad_fn = std::shared_ptr<ForeachNormBackward0>(new ForeachNormBackward0(), deleteNode); grad_fn->set_next_edges(collect_next_edges( self )); grad_fn->ord = ord; grad_fn->self_ = make_saved_variable_list(self); grad_fn->self_size_ = self.size(); } #ifndef NDEBUG std::vector<c10::optional<Storage>> self__storage_saved(self_.size()); for (const Tensor& tensor : self_) self__storage_saved.push_back( tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt); std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size()); for (size_t i=0; i<self_.size(); i++) if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr(); #endif auto _tmp = ([&]() { at::AutoDispatchBelowADInplaceOrView guard; return at::redispatch::_foreach_norm(ks & c10::after_autograd_keyset, self_, ord); })(); auto result = std::move(_tmp); #ifndef NDEBUG for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage())); } for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr()); } #endif if (grad_fn) { set_history(flatten_tensor_args( result ), grad_fn); } std::vector<c10::optional<at::Tensor>> result_new_fw_grad_opts(self.size(), c10::nullopt); for (const auto& i : c10::irange(result_new_fw_grad_opts.size())) { if (_any_has_forward_grad_result[i]) { auto self_t_raw = toNonOptFwGrad(self[i]); auto self_tensor = toNonOptTensor(self[i]); auto self_t = (self_t_raw.defined() \|\| !self_tensor.defined()) ? self_t_raw : at::_efficientzerotensor(self_tensor.sizes(), self_tensor.options()); auto self_p = toNonOptPrimal(self[i]); result_new_fw_grad_opts[i] = norm_jvp(self_p, self_t, ord, result[i]); } } for (const auto& i : c10::irange(result_new_fw_grad_opts.size())) { auto& result_new_fw_grad_opt = result_new_fw_grad_opts[i]; if (result_new_fw_grad_opt.has_value() && result_new_fw_grad_opt.value().defined() && result[i].defined()) { // The hardcoded 0 here will need to be updated once we support multiple levels. result[i]._set_fw_grad(result_new_fw_grad_opt.value(), / level / 0, / is_inplace_op / false); } } if (grad_fn) { grad_fn->result = result; } return result; } ``` # Reference ```c++ at::Tensor sinh(c10::DispatchKeySet ks, const at::Tensor & self) { auto& self_ = unpack(self, "self", 0); [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self ); [[maybe_unused]] auto _any_has_forward_grad_result = (isFwGradDefined(self)); std::shared_ptr<SinhBackward0> grad_fn; if (_any_requires_grad) { grad_fn = std::shared_ptr<SinhBackward0>(new SinhBackward0(), deleteNode); grad_fn->set_next_edges(collect_next_edges( self )); grad_fn->self_ = SavedVariable(self, false); } #ifndef NDEBUG c10::optional<Storage> self__storage_saved = self_.has_storage() ? c10::optional<Storage>(self_.storage()) : c10::nullopt; c10::intrusive_ptr<TensorImpl> self__impl_saved; if (self_.defined()) self__impl_saved = self_.getIntrusivePtr(); #endif auto _tmp = ([&]() { at::AutoDispatchBelowADInplaceOrView guard; return at::redispatch::sinh(ks & c10::after_autograd_keyset, self_); })(); auto result = std::move(_tmp); #ifndef NDEBUG if (self__storage_saved.has_value() && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__storage_saved.value().is_alias_of(self_.storage())); if (self__impl_saved && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__impl_saved == self_.getIntrusivePtr()); if (result.has_storage() && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(result)) { TORCH_INTERNAL_ASSERT(result.storage().use_count() == 1, "function: sinh"); } if (!at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(result)) TORCH_INTERNAL_ASSERT(result.use_count() <= 1, "function: sinh"); #endif if (grad_fn) { set_history(flatten_tensor_args( result ), grad_fn); } c10::optional<at::Tensor> result_new_fw_grad_opt = c10::nullopt; if (_any_has_forward_grad_result && (result.defined())) { auto self_t_raw = toNonOptFwGrad(self); auto self_tensor = toNonOptTensor(self); auto self_t = (self_t_raw.defined() \|\| !self_tensor.defined()) ? self_t_raw : at::_efficientzerotensor(self_tensor.sizes(), self_tensor.options()); auto self_p = toNonOptPrimal(self); result_new_fw_grad_opt = (self_t.conj() self_p.cosh().conj()).conj(); } if (result_new_fw_grad_opt.has_value() && result_new_fw_grad_opt.value().defined() && result.defined()) { // The hardcoded 0 here will need to be updated once we support multiple levels. result._set_fw_grad(result_new_fw_grad_opt.value(), /* level / 0, / is_inplace_op / false); } return result; } at::Tensor norm_Scalar(c10::DispatchKeySet ks, const at::Tensor & self, const at::Scalar & p) { auto& self_ = unpack(self, "self", 0); [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self ); [[maybe_unused]] auto _any_has_forward_grad_result = (isFwGradDefined(self)); std::shared_ptr<NormBackward0> grad_fn; if (_any_requires_grad) { grad_fn = std::shared_ptr<NormBackward0>(new NormBackward0(), deleteNode); grad_fn->set_next_edges(collect_next_edges( self )); grad_fn->p = p; grad_fn->self_ = SavedVariable(self, false); } #ifndef NDEBUG c10::optional<Storage> self__storage_saved = self_.has_storage() ? c10::optional<Storage>(self_.storage()) : c10::nullopt; c10::intrusive_ptr<TensorImpl> self__impl_saved; if (self_.defined()) self__impl_saved = self_.getIntrusivePtr(); #endif auto _tmp = ([&]() { at::AutoDispatchBelowADInplaceOrView guard; return at::redispatch::norm(ks & c10::after_autograd_keyset, self_, p); })(); auto result = std::move(_tmp); #ifndef NDEBUG if (self__storage_saved.has_value() && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__storage_saved.value().is_alias_of(self_.storage())); if (self__impl_saved && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__impl_saved == self_.getIntrusivePtr()); if (result.has_storage() && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(result)) { TORCH_INTERNAL_ASSERT(result.storage().use_count() == 1, "function: norm_Scalar"); } if (!at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(result)) TORCH_INTERNAL_ASSERT(result.use_count() <= 1, "function: norm_Scalar"); #endif if (grad_fn) { set_history(flatten_tensor_args( result ), grad_fn); } throw_error_for_complex_autograd(result, "norm"); c10::optional<at::Tensor> result_new_fw_grad_opt = c10::nullopt; if (_any_has_forward_grad_result && (result.defined())) { auto self_t_raw = toNonOptFwGrad(self); auto self_tensor = toNonOptTensor(self); auto self_t = (self_t_raw.defined() \|\| !self_tensor.defined()) ? self_t_raw : at::_efficientzerotensor(self_tensor.sizes(), self_tensor.options()); auto self_p = toNonOptPrimal(self); result_new_fw_grad_opt = norm_jvp(self_p, self_t, p, result); } if (result_new_fw_grad_opt.has_value() && result_new_fw_grad_opt.value().defined() && result.defined()) { // The hardcoded 0 here will need to be updated once we support multiple levels. result._set_fw_grad(result_new_fw_grad_opt.value(), / level / 0, / is_inplace_op */ false); } if (grad_fn) { grad_fn->result_ = SavedVariable(result, true); } return result; } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102409 Approved by: https://github.com/soulitzer	2023-06-30 04:51:43 +00:00
Jack Khuu	18dacf7e79	[Specialized Kernel] Update yaml syntax to use kernel instead of dispatch (#104070 ) Based on this [code search](https://fburl.com/code/gjcnw8ly) (.yaml with `dispatch: CPU:`), update all files found to use ``` kernels: - arg_meta: None kernel_name: ``` instead of ``` dispatch: CPU: ``` --- ## Code changes: - `fbcode/executorch/codegen/tools/gen_oplist.py` - Strip ET specific fields prior to calling parse_native_yaml_struct --- ## Files edited that are not `functions.yaml` or `custom_ops.yaml` - fbcode/executorch/kernels/optimized/optimized.yaml - fbcode/executorch/kernels/quantized/quantized.yaml - fbcode/executorch/kernels/test/custom_kernel_example/my_functions.yaml --- ## Found Files that were not edited Dispatched to more than just CPU - fbcode/caffe2/aten/src/ATen/native/native_functions.yaml - xplat/caffe2/aten/src/ATen/native/native_functions.yaml - xros/third-party/caffe2/caffe2/aten/src/ATen/native/native_functions.yaml Grouped ops.yaml path - fbcode/on_device_ai/Assistant/Jarvis/min_runtime/operators/ops.yaml --- Design Doc: https://docs.google.com/document/d/1gq4Wz2R6verKJ2EFseLyPdAF0wqomnCrVDDJpRkYsRw/edit?kh_source=GDOCS#heading=h.8raqyft9y50 Differential Revision: [D46952067](https://our.internmc.facebook.com/intern/diff/D46952067/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D46952067/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/104070 Approved by: https://github.com/larryliu0820	2023-06-27 09:53:20 +00:00
Driss Guessous	4a008d268a	REDO of dropout support for mem eff #102038 (#103704 ) THIS IS A new PR with the changes from #102038 + #103201 + plus namespacing changes to fix bug. # Summary This PR builds off of: - https://github.com/pytorch/pytorch/pull/101847 - https://github.com/pytorch/pytorch/pull/100583 It specifically adds dropout support to the memory efficient attention kernel. In the process of doing so roughly 3 changes were made: - Update sdpa dispatching to allow for inputs requiring grad to be sent to efficient attention - Update how memory efficient attention handles passing the rng state from forward to backward in order to enable cuda_graph support - Fix a bug in the kernel that was causing incorrect gradients to be produced for num_keys > 64 with dropout and causal masking set. https://github.com/facebookresearch/xformers/pull/755 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103704 Approved by: https://github.com/cpuhrsch	2023-06-26 23:05:03 +00:00
Jack Khuu	d1c367470b	[Specialized Kernel] Remove requirement for type_alias and dim_order_alias to be present (#104006 ) These fields are not required when kernels provided do not use aliases (e.g. only a default kernel Differential Revision: [D46916099](https://our.internmc.facebook.com/intern/diff/D46916099/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104006 Approved by: https://github.com/larryliu0820	2023-06-23 16:49:57 +00:00
Mengwei Liu	ce845dfe49	[Reland][ET] Select used et_kernel_metadata only (#104005 ) Summary: Currently we rely on root operator, but we also need to check for et_kernel_metadata for used specialized kernels. Test Plan: contbuild & OSS CI Reviewed By: Jack-Khuu Differential Revision: D46882119 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104005 Approved by: https://github.com/Jack-Khuu	2023-06-23 14:38:45 +00:00
PyTorch MergeBot	08a7d60a46	Revert "[Reland][ET] Select used et_kernel_metadata only (#103705 )" This reverts commit `59a01c49ee`. Reverted https://github.com/pytorch/pytorch/pull/103705 on behalf of https://github.com/osalpekar due to large number of internal failures in executorch contbuild. See [D46882119](https://www.internalfb.com/diff/D46882119) for more details ([comment](https://github.com/pytorch/pytorch/pull/103705#issuecomment-1601789900))	2023-06-21 22:51:38 +00:00
Brian Hirsh	3cfd677b1f	fix inference mode / PyDispatcher / Functionalize interaction (#103275 ) Fixes https://github.com/pytorch/pytorch/issues/103132 This is kind of annoying: Functionalization (and also vmap, I think?) manually figures out which ops have C++ CompositeImplicit decomps, and directly registers them to the Functionalize key. This is a problem for the PyDispatcher: We normally want the PyDispatcher to take precedence over the regular dispatcher. But in this case, we have a python decomp registered to `CompositeImplicitAutograd`, and a C++ decomp registered directly to the `Functionalize` key, so the C++ decomp gets precedence over the python decomp. The way this showed up was that a model was running `matmul()` under inference mode, so we never hit the autograd dispatch key, and go straight to the functionalize dispatch key. Matmul has both a python decomp and a c++ decomp, but we were running the C++ decomp. That C++ decomp isn't meant to be used with dynamic shapes, so we were failing with the "tried to call `.sizes()` on a tensor with dynamic shapes" error. For now, I had the PyDispatcher mimic the behavior of functionalization codegen: when you register a python decomp to the `CompositeImplicitAutograd` key, this PR just automatically registers that decomp to the `Functionalize` key at the same time. I'm trying to remember now why we didn't just add `Functionalize` (and all of the other functorch transform keys) directly to the `CompositeImplicitAutograd` alias keyset, but I couldn't remember (@zou3519 any chance you remember?). Pull Request resolved: https://github.com/pytorch/pytorch/pull/103275 Approved by: https://github.com/ezyang, https://github.com/zou3519	2023-06-21 15:19:55 +00:00
Hansong Zhang	59a01c49ee	[Reland][ET] Select used et_kernel_metadata only (#103705 ) Currently we rely on root operator, but we also need to check for et_kernel_metadata for used specialized kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103705 Approved by: https://github.com/larryliu0820	2023-06-18 00:33:28 +00:00
xuanqi	b27c3558a4	[RFC]: Create aten native op for constrain_range (#103346 ) At high current implementation of constrains functions (constrain_as_) will raise exception for the following code snippets: ``` def f(x): a = x.item() constrain_as_size(a, 4, 7) return torch.empty((a, 4)) inp = torch.tensor([5]) ep = torch._export.export(f, (inp,)) ``` The reason is because current constrain logic is: 1) Purely python so it won't survive AOT export (the full node is gone after AOT export since AOT export only maintains aten level op). 2) Utilize side effect to add range constraints for traced symbol's shape env ([code](`9591e52880/torch/fx/experimental/symbolic_shapes.py (L370-L372)`)). 3) If runtime assertion is turned on (by default). [`_AddRuntimeAssertionsForConstraintsPass`](`9591e52880/torch/_export/passes/add_runtime_assertions_for_constraints_pass.py (L98-L100)`) will try to append assertion node based on range constrains extracted from shape env of symbol during another interpretation round. 4). However, since 1), in the round of AOT export, range constraints logic won't run for symbols generated during this round. And later there is no range constrains information available for assertion round and caused issue. 5) As a result of above, it will failure at `torch.empty((a, 4))` (there is no constrains for `a` that it must be positive). The fix here is just to implement range constrain logic as a native aten op (CPU implementation as no-op) to make it be able to survive AOT export. NOTE:** [Logic](`2d745b95d7/torch/fx/experimental/symbolic_shapes.py (L350-L365C15)`) within [`constrain_range`](`2d745b95d7/torch/fx/experimental/symbolic_shapes.py (LL313C74-L313C74)`) is split out as `constrain_range_int` to capture case when non `SymInt` is passed in and reused in the new `_constrain_range`. The reason is when non `SymInt` is provided: * If it directly calls `sym_constrain_range`, the C++ version will be called which will be no-op. * So in this case it calls `constrain_range_int` instead to be able to capture issue like user provides a input whose tensor's shape could be out of range during exporting, like the following for above code example: ``` ... inp = torch.tensor([10]) ep = torch._export.export(f, (inp,)) # immediately raise error ``` Differential Revision: [D46734204](https://our.internmc.facebook.com/intern/diff/D46734204) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103346 Approved by: https://github.com/tugsbayasgalan	2023-06-16 14:55:40 +00:00
PyTorch MergeBot	8553f9c896	Revert "[ET] Select used et_kernel_metadata only (#103658 )" This reverts commit `480d20cac1`. Reverted https://github.com/pytorch/pytorch/pull/103658 on behalf of https://github.com/malfet due to Broke Windows builds ([comment](https://github.com/pytorch/pytorch/pull/103658#issuecomment-1593696503))	2023-06-15 20:41:45 +00:00
Hansong Zhang	480d20cac1	[ET] Select used et_kernel_metadata only (#103658 ) Currently we rely on root operator, but we also need to check for et_kernel_metadata for used specialized kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103658 Approved by: https://github.com/larryliu0820	2023-06-15 19:05:04 +00:00
cyy	f2900420da	fix missing-prototypes warnings in torch_cpu (Part 6) (#101845 ) This PR fixes more missing-prototypes violations in the torch_cpu source following PRs #100053, #100147, #100245, #100849 and #101788 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101845 Approved by: https://github.com/albanD	2023-06-15 16:48:28 +00:00
Jack Khuu	e9674d146c	[Specialized Kernel] Propagate Specialized Kernel Support through ComputeCodegenUnboxedKernels (#103113 ) Updating ComputeCodegenUnboxedKernels to accept and write out kernel information to RegisterCodegenUnboxedKernels.cpp Differential Revision: [D46486195](https://our.internmc.facebook.com/intern/diff/D46486195/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103113 Approved by: https://github.com/larryliu0820, https://github.com/kirklandsign	2023-06-14 10:18:16 +00:00
Yinghai Lu	4c3799447f	Back out "Dropout support for memory efficient attention (#102038 )" & "Two small mem_eff bug fixes (#103201 )" (#103464 ) Summary: Original commit changeset: 04c4473d8510 Original Phabricator Diff: D46584152 & D46582033 Test Plan: Already explained in summary. Reviewed By: yinghai Differential Revision: D46633283 fbshipit-source-id: c23c2945408988f3c4339dfd5cd40ae46261716c Co-authored-by: Shenxiu Liu <shenxiu@meta.com>	2023-06-12 18:56:48 -07:00
Nikita Vedeneev	056d92e2a0	sparse.mm backward: performance improvements (#94991 ) `torch.sparse.mm` - faster and without syncs in "most" cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94991 Approved by: https://github.com/Skylion007, https://github.com/pearu, https://github.com/cpuhrsch	2023-06-12 20:57:29 +00:00
SherlockNoMad	d997969b8b	[Reland] Add sym_size/stride/numel/storage_offset to native_function.yaml (#103107 ) Differential Revision: D46459100 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103107 Approved by: https://github.com/angelayi, https://github.com/soulitzer	2023-06-12 19:18:49 +00:00
Jack Khuu	d0c0e13b69	[Specialized Kernel] Translate Kernel Assignment Logic from function.yaml to native_functions.yaml (#102576 ) Updating `gen_executorch.translate_native_yaml()` to translate kernel assignments when converting `functions.yaml` to `native_functions.yaml` --- Functions.yaml format: ``` - func: add.out type_alias: T0: [<Type>, <Type>] T1: [<Type>] dim_order_alias: D0: [0, 1, 2, 3] D1: [0, 3, 2, 1] kernels: - arg_meta: null kernel_name: default_impl - arg_meta: self: [T0, D0] other:[T0, D0] out: [T0, D0] kernel_name: test_impl ``` native_functions.yaml format ``` func: add.out(Tensor self, Tensor other, , Scalar alpha=1, Tensor(a!) out) -> Tensor(a!) kernel: default: default_impl v<Version>/<TYPE Enum>;<DIM Order>\|<TYPE Enum>;<DIM Order>\|<TYPE Enum>;<DIM Order>: test_impl ``` Example: 'v1/6;0,1,2,3\|3;0,1,2,3\|6;0,1,2,3' : 'test_impl'* ## Note: - If a "kernels" field is not present in functions.yaml (as it currently is), the output is unaffected --- Design Doc: https://docs.google.com/document/d/1gq4Wz2R6verKJ2EFseLyPdAF0wqomnCrVDDJpRkYsRw/edit?kh_source=GDOCS# Differential Revision: [D45971107](https://our.internmc.facebook.com/intern/diff/D45971107/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102576 Approved by: https://github.com/larryliu0820	2023-06-08 23:42:24 +00:00
Driss Guessous	606fb882c4	Dropout support for memory efficient attention (#102038 ) # Summary This PR builds off of: - https://github.com/pytorch/pytorch/pull/101847 - https://github.com/pytorch/pytorch/pull/100583 It specifically adds dropout support to the memory efficient attention kernel. In the process of doing so roughly 3 changes were made: - Update sdpa dispatching to allow for inputs requiring grad to be sent to efficient attention - Update how memory efficient attention handles passing the rng state from forward to backward in order to enable cuda_graph support - Fix a bug in the kernel that was causing incorrect gradients to be produced for num_keys > 64 with dropout and causal masking set. https://github.com/facebookresearch/xformers/pull/755 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102038 Approved by: https://github.com/cpuhrsch	2023-06-08 21:50:12 +00:00
Hansong Zhang	47cfcf566a	Add selector.is_et_kernel_key_selected (#103184 ) Summary: This API is used by the gen_executorch.py to check whether a kernel with specified kernel key is used or not. Test Plan: ``` buck test xplat/caffe2/tools:test_torchgen_executorch buck run fbcode//executorch/codegen/tools:test_gen_oplist_real_model ``` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/103184 Approved by: https://github.com/larryliu0820	2023-06-08 01:10:20 +00:00
Mengwei Liu	eebe0ee141	[Executorch][codegen] Add ETKernelIndex for aggregating all kernels for kernel (#102874 ) Summary: keys and change codegen to take ETKernelIndex We are adding support for dtype and dim order specialized kernel registration. This requires us to reorganize `BackendIndex` (which is a `Dict[DispatchKey, Dict[OperatorName, BackendMetadata]]`) to be `Dict[OperatorName, Dict[ETKernelKey, BackendMetadata]]`. This PR adds new data structures in order to support this change: * `ETKernelKey` to retrieve a certain kernel from the registry. * `ETKernelIndex`, the dictionary from operator name to kernel key to kernel mapping. Note that the codegen logic is not changed yet, we need subsequent diffs to actually generate code for different kernel keys. Test Plan: Added tests Reviewed By: Jack-Khuu Differential Revision: D46407096 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102874 Approved by: https://github.com/Jack-Khuu, https://github.com/kirklandsign	2023-06-03 17:23:42 +00:00
Nikita Shulga	fb0729054b	Revert "[Executorch][codegen] Add ETKernelIndex for aggregating all kernels for kernel (#102565 )" This reverts commit `019c38624c` / https://github.com/pytorch/pytorch/pull/102565 as it breaks ExecutorchBuilds.	2023-06-01 12:35:23 -07:00
Larry Liu	019c38624c	[Executorch][codegen] Add ETKernelIndex for aggregating all kernels for kernel (#102565 ) keys and change codegen to take ETKernelIndex We are adding support for dtype and dim order specialized kernel registration. This requires us to reorganize `BackendIndex` (which is a `Dict[DispatchKey, Dict[OperatorName, BackendMetadata]]`) to be `Dict[OperatorName, Dict[ETKernelKey, BackendMetadata]]`. This PR adds new data structures in order to support this change: * `ETKernelKey` to retrieve a certain kernel from the registry. * `ETKernelIndex`, the dictionary from operator name to kernel key to kernel mapping. Note that the codegen logic is not changed yet, we need subsequent diffs to actually generate code for different kernel keys. Differential Revision: [D46206339](https://our.internmc.facebook.com/intern/diff/D46206339/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102565 Approved by: https://github.com/Jack-Khuu	2023-05-31 09:41:36 +00:00
mikey dagitses	9bbee245fe	update rules_python and let bazel install its own pip dependencies (#101405 ) update rules_python and let bazel install its own pip dependencies Summary: This is the official way of doing Python in Bazel. Test Plan: Rely on CI. --- Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/101405). * #101406 * __->__ #101405 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101405 Approved by: https://github.com/vors, https://github.com/huydhn	2023-05-23 06:20:33 +00:00
Masaki Kozuki	ba2bc7df8f	Enable `backward` on `_foreach_zero_` (#101149 ) Currently torchgen cannot find an appropriate `DifferentiabilityInfo` for `_foreach_zero_` because `gen_foreach_derivativeinfo` doesn't correctly make use of `functional_info_by_signature` and `differentiability_infos`, and `is_reference_for_foreach` a bit too strict to `_foreach_zero_`. Generated code in `VariableType` ```c++ void _foreach_zero_(c10::DispatchKeySet ks, at::TensorList self) { auto self_ = unpack(self, "self", 0); [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self ); std::vector<c10::optional<at::Tensor>> original_selfs(self.size()); std::vector<std::shared_ptr<ZeroBackward0>> grad_fns; if (_any_requires_grad) { for (const auto& i : c10::irange( self.size() )) { const auto ith_requires_grad = compute_requires_grad(self[i]); check_inplace(self[i], ith_requires_grad); grad_fns.push_back([&]() -> std::shared_ptr<ZeroBackward0> { if (!ith_requires_grad) { return nullptr; } else { auto grad_fn = std::shared_ptr<ZeroBackward0>(new ZeroBackward0(), deleteNode); grad_fn->set_next_edges(collect_next_edges( self[i] )); return grad_fn; } }()); } } #ifndef NDEBUG std::vector<c10::optional<Storage>> self__storage_saved(self_.size()); for (const Tensor& tensor : self_) self__storage_saved.push_back( tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt); std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size()); for (size_t i=0; i<self_.size(); i++) if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr(); #endif { at::AutoDispatchBelowAutograd guard; at::redispatch::_foreach_zero_(ks & c10::after_autograd_keyset, self_); } #ifndef NDEBUG for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage())); } for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr()); } #endif if (!grad_fns.empty()) { auto differentiable_outputs = flatten_tensor_args( self ); TORCH_INTERNAL_ASSERT(differentiable_outputs.size() == grad_fns.size()); for (const auto& i : c10::irange(grad_fns.size())) { auto grad_fn = grad_fns[i]; if (grad_fn != nullptr) { rebase_history(differentiable_outputs[i], grad_fns[i]); } } } } ``` Rel: - #58833 - #96405 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101149 Approved by: https://github.com/soulitzer	2023-05-17 03:10:13 +00:00
Nikita Shulga	20cf42de2c	Revert "[Reland] Add sym_size/stride/numel/storage_offset to native_function.… (#100749 )" This reverts commit `bb454891ed`.	2023-05-16 18:17:02 -07:00
Edward Z. Yang	b94f143ace	SymIntify convNd and conv_transposeNd, fix inductor symint handling (#101488 ) Fixes https://github.com/pytorch/pytorch/issues/101014 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/101488 Approved by: https://github.com/ngimel	2023-05-16 17:46:52 +00:00
Andrew Gallagher	3b82298265	[caffe2/torchgen] Fix codegen non-determinism (#101286 ) Summary: Fix several cases of leaking set-iteration-order to generated sources, causing non-determinism in generated code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101286 Approved by: https://github.com/Skylion007, https://github.com/albanD	2023-05-15 18:45:19 +00:00
Aaron Gokaslan	dfe484a3b3	[BE]: Bugfix functorch and some generic typing improvements (#101337 ) Fixes some typing bugs found with newer versions of mypy Pull Request resolved: https://github.com/pytorch/pytorch/pull/101337 Approved by: https://github.com/ezyang	2023-05-14 14:20:56 +00:00
blzheng	ab74744522	add inplace_view tag to resize_as_() (#100786 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100786 Approved by: https://github.com/jgong5, https://github.com/bdhirsh, https://github.com/eellison	2023-05-13 13:49:14 +00:00
Sherlock Huang	bb454891ed	[Reland] Add sym_size/stride/numel/storage_offset to native_function.… (#100749 ) …yaml (#91… (#91919) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/91919 Approved by: https://github.com/ezyang Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/92402 Reviewed By: ezyang Differential Revision: D42565586 Pulled By: SherlockNoMad fbshipit-source-id: 1c2986e45307e076d239836a1b45441a9fa3c9d9 ghstack-source-id: 969f4928486e04c57aaf98e20e3c3ca946c51613 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/100749 Approved by: https://github.com/zhxchen17, https://github.com/albanD	2023-05-12 22:57:42 +00:00
Hansong Zhang	fa6df34d30	[ET selective build] add kernel metadata section to selective_build.yaml (#100665 ) Summary: For each op, we have a List[List[dtype;dim-order]]: - the inner list contains the `dtype;dim-order` info for each arg if we have a Tensor/TensorList/OptionalTensorList - the outer list contains different occurances of dtype/dim-order combinations for that op in the program Example: ``` et_kernel_metadata: aten::add.out: # A list of different dtype/dim-order combinations used in model - # Each contains the list of args of Tensor dtype and dim order if applicable - FLOAT;0,1 - FLOAT;0,1 - NON_TENSOR_ARG - FLOAT;0,1 - FLOAT;0,1 - - INT;0,1 - INT;0,1 - NON_TENSOR_ARG - INT;0,1 - INT;0,1 aten::mul.out: - - FLOAT;0,1 - FLOAT;0,1 - FLOAT;0,1 - FLOAT;0,1 ``` We don't have the arg name so far; we need to parse the schema (functions.yaml) to get that info. We depend on the order of args from that file. Test Plan: `buck run fbcode//executorch/codegen/tools:test_gen_oplist_real_model` Differential Revision: D45551409 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100665 Approved by: https://github.com/larryliu0820	2023-05-09 21:30:01 +00:00
PyTorch MergeBot	de02c8bed4	Revert "Rename DispatchKey.PrivateUse1 to custom device in torchgen. (#99406 )" This reverts commit `c0ecd98958`. Reverted https://github.com/pytorch/pytorch/pull/99406 on behalf of https://github.com/ezyang due to we're doing it another way ([comment](https://github.com/pytorch/pytorch/pull/99406#issuecomment-1540295309))	2023-05-09 15:04:16 +00:00
Elias Ellison	16a4075327	Throw if 'dropout' argument name but func does not have nondeterministic_seeded (#100771 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100771 Approved by: https://github.com/ezyang	2023-05-08 23:34:28 +00:00
Natalia Gimelshein	bfe5f5bbe1	[WIP] enable cuda graphs support for flash attention with dropout (#100196 ) Fixes #99905 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100196 Approved by: https://github.com/drisspg	2023-05-08 16:19:18 +00:00
PyTorch MergeBot	c3aa59c8f5	Revert "[WIP] enable cuda graphs support for flash attention with dropout (#100196 )" This reverts commit `32615618e4`. Reverted https://github.com/pytorch/pytorch/pull/100196 on behalf of https://github.com/clee2000 due to broke no ops build `32615618e4` https://github.com/pytorch/pytorch/actions/runs/4866578063/jobs/8678258318 ([comment](https://github.com/pytorch/pytorch/pull/100196#issuecomment-1532352810))	2023-05-03 01:41:56 +00:00
Natalia Gimelshein	32615618e4	[WIP] enable cuda graphs support for flash attention with dropout (#100196 ) Fixes #99905 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100196 Approved by: https://github.com/drisspg	2023-05-02 23:05:31 +00:00
Masaki Kozuki	6c934a89a7	Skip invalid grads in outplace foreachs' backward (#100256 ) Fixes #100248 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100256 Approved by: https://github.com/soulitzer, https://github.com/albanD	2023-04-29 22:45:26 +00:00
Richard Zou	4135295a76	Excise yaml dependency in torchgen.model (#100203 ) The problem: - The new CustomOp API depends on torchgen.model - torchgen.model imports `yaml` - `yaml` is not a PyTorch runtime dependency To unblock myself, because I'm not sure how long it'll take to convince people yaml should be a PyTorch runtime dependency (unless one of you wants to approve #100166), this PR removes the yaml dependency from torchgen.model. It does so by splitting torchgen.utils (the offender) into torchgen.utils (no yaml) and torchgen.yaml (which uses yaml). Test Plan: - CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/100203 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2023-04-28 13:45:39 +00:00
Tugsbayasgalan Manlaibaatar	d4bf76c2a4	Persist torch.assert in aten graph (#100101 ) This PR introduces a new operator called aten._assert_async.msg, which allows passing a tensor value and assertion message as inputs. As part of TorchDynamo, we're replacing the use of torch._assert with this new operator so that make_fx also knows how to handle assertions. This is subset of https://github.com/pytorch/pytorch/pull/98878, refer there for historic reviews. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100101 Approved by: https://github.com/jansel	2023-04-28 07:31:43 +00:00
Masaki Kozuki	674018903d	per-Tensor `grad_fn` for in-place foreach functions (#96405 ) Generate a `grad_fn` for each (tuple of) `Tensor`(s) of the same index for `_foreach_foo_` and each `grad_fn` is `FooBackward`. The current status of foreach functions' backward support for the record: - out-place: Implemented, but no optimized implementations like their forward path - in-place: not implemented. I think this check `7eaaefafb3/torchgen/api/autograd.py (L309-L311)` is partly responsible but the difference of signature between out-place and in-place (see https://github.com/pytorch/pytorch/pull/96405#discussion_r1154690940) would prevent in-place from using out-place versions (the logic is around `7eaaefafb3/torchgen/api/autograd.py (L495-L500)`) ```c++ void _foreach_abs_(c10::DispatchKeySet ks, at::TensorList self) { auto self_ = unpack(self, "self", 0); #ifndef NDEBUG std::vector<c10::optional<Storage>> self__storage_saved(self_.size()); for (const Tensor& tensor : self_) self__storage_saved.push_back( tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt); std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size()); for (size_t i=0; i<self_.size(); i++) if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr(); #endif { at::AutoDispatchBelowAutograd guard; at::redispatch::_foreach_abs_(ks & c10::after_autograd_keyset, self_); } #ifndef NDEBUG for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_)) AT_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage())); } for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_)) AT_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr()); } #endif } ``` Related: - #95431 - #95765 for multiple `grad_fn`s logic --- Examples: outputs of `_foreach_add_.List`, `_foreach_addcmul_.ScalarList`, and `_foreach_exp` ```c++ void _foreach_addcmul__ScalarList(c10::DispatchKeySet ks, at::TensorList self, at::TensorList tensor1, at::TensorList tensor2, at::ArrayRef<at::Scalar> scalars) { auto self_ = unpack(self, "self", 0); auto tensor1_ = unpack(tensor1, "tensor1", 1); auto tensor2_ = unpack(tensor2, "tensor2", 2); auto _any_requires_grad = compute_requires_grad( self, tensor1, tensor2 ); (void)_any_requires_grad; std::vector<c10::optional<at::Tensor>> original_selfs(self.size()); std::vector<std::shared_ptr<AddcmulBackward0>> grad_fns; if (_any_requires_grad) { for (const auto& i : c10::irange( self.size() )) { const auto ith_requires_grad = compute_requires_grad(self[i], tensor1[i], tensor2[i]); check_inplace(self[i], ith_requires_grad); grad_fns.push_back([&]() -> std::shared_ptr<AddcmulBackward0> { if (!ith_requires_grad) { return nullptr; } else { auto grad_fn = std::shared_ptr<AddcmulBackward0>(new AddcmulBackward0(), deleteNode); grad_fn->set_next_edges(collect_next_edges( self[i], tensor1[i], tensor2[i] )); return grad_fn; } }()); } if (!grad_fns.empty()) { for (const auto& i : c10::irange(grad_fns.size())) { auto grad_fn = grad_fns[i]; if (grad_fn != nullptr) { grad_fn->self_scalar_type = self[i].scalar_type(); grad_fn->tensor1_scalar_type = tensor1[i].scalar_type(); if (grad_fn->should_compute_output(1)) { grad_fn->tensor2_ = SavedVariable(tensor2[i], false); } grad_fn->value = scalars[i]; if (grad_fn->should_compute_output(2)) { grad_fn->tensor1_ = SavedVariable(tensor1[i], false); } grad_fn->tensor2_scalar_type = tensor2[i].scalar_type(); } } } } #ifndef NDEBUG std::vector<c10::optional<Storage>> self__storage_saved(self_.size()); for (const Tensor& tensor : self_) self__storage_saved.push_back( tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt); std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size()); for (size_t i=0; i<self_.size(); i++) if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr(); std::vector<c10::optional<Storage>> tensor1__storage_saved(tensor1_.size()); for (const Tensor& tensor : tensor1_) tensor1__storage_saved.push_back( tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt); std::vector<c10::intrusive_ptr<TensorImpl>> tensor1__impl_saved(tensor1_.size()); for (size_t i=0; i<tensor1_.size(); i++) if (tensor1_[i].defined()) tensor1__impl_saved[i] = tensor1_[i].getIntrusivePtr(); std::vector<c10::optional<Storage>> tensor2__storage_saved(tensor2_.size()); for (const Tensor& tensor : tensor2_) tensor2__storage_saved.push_back( tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt); std::vector<c10::intrusive_ptr<TensorImpl>> tensor2__impl_saved(tensor2_.size()); for (size_t i=0; i<tensor2_.size(); i++) if (tensor2_[i].defined()) tensor2__impl_saved[i] = tensor2_[i].getIntrusivePtr(); #endif { at::AutoDispatchBelowAutograd guard; at::redispatch::_foreach_addcmul_(ks & c10::after_autograd_keyset, self_, tensor1_, tensor2_, scalars); } #ifndef NDEBUG for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage())); } for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr()); } for (size_t i=0; i<tensor1_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (tensor1__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(tensor1_)) TORCH_INTERNAL_ASSERT(tensor1__storage_saved[i].value().is_alias_of(tensor1_[i].storage())); } for (size_t i=0; i<tensor1_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (tensor1__impl_saved[i] && !at::impl::tensorlist_has_dispatch(tensor1_)) TORCH_INTERNAL_ASSERT(tensor1__impl_saved[i] == tensor1_[i].getIntrusivePtr()); } for (size_t i=0; i<tensor2_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (tensor2__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(tensor2_)) TORCH_INTERNAL_ASSERT(tensor2__storage_saved[i].value().is_alias_of(tensor2_[i].storage())); } for (size_t i=0; i<tensor2_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (tensor2__impl_saved[i] && !at::impl::tensorlist_has_dispatch(tensor2_)) TORCH_INTERNAL_ASSERT(tensor2__impl_saved[i] == tensor2_[i].getIntrusivePtr()); } #endif if (!grad_fns.empty()) { auto differentiable_outputs = flatten_tensor_args( self ); TORCH_INTERNAL_ASSERT(differentiable_outputs.size() == grad_fns.size()); for (const auto& i : c10::irange(grad_fns.size())) { auto grad_fn = grad_fns[i]; if (grad_fn != nullptr) { rebase_history(differentiable_outputs[i], grad_fns[i]); } } } } ``` ```c++ void _foreach_add__List(c10::DispatchKeySet ks, at::TensorList self, at::TensorList other, const at::Scalar & alpha) { auto self_ = unpack(self, "self", 0); auto other_ = unpack(other, "other", 1); auto _any_requires_grad = compute_requires_grad( self, other ); (void)_any_requires_grad; std::vector<c10::optional<at::Tensor>> original_selfs(self.size()); std::vector<std::shared_ptr<AddBackward0>> grad_fns; if (_any_requires_grad) { for (const auto& i : c10::irange( self.size() )) { const auto ith_requires_grad = compute_requires_grad(self[i], other[i]); check_inplace(self[i], ith_requires_grad); grad_fns.push_back([&]() -> std::shared_ptr<AddBackward0> { if (!ith_requires_grad) { return nullptr; } else { auto grad_fn = std::shared_ptr<AddBackward0>(new AddBackward0(), deleteNode); grad_fn->set_next_edges(collect_next_edges( self[i], other[i] )); return grad_fn; } }()); } if (!grad_fns.empty()) { for (const auto& i : c10::irange(grad_fns.size())) { auto grad_fn = grad_fns[i]; if (grad_fn != nullptr) { grad_fn->other_scalar_type = other[i].scalar_type(); grad_fn->alpha = alpha; grad_fn->self_scalar_type = self[i].scalar_type(); } } } } #ifndef NDEBUG std::vector<c10::optional<Storage>> self__storage_saved(self_.size()); for (const Tensor& tensor : self_) self__storage_saved.push_back( tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt); std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size()); for (size_t i=0; i<self_.size(); i++) if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr(); std::vector<c10::optional<Storage>> other__storage_saved(other_.size()); for (const Tensor& tensor : other_) other__storage_saved.push_back( tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt); std::vector<c10::intrusive_ptr<TensorImpl>> other__impl_saved(other_.size()); for (size_t i=0; i<other_.size(); i++) if (other_[i].defined()) other__impl_saved[i] = other_[i].getIntrusivePtr(); #endif { at::AutoDispatchBelowAutograd guard; at::redispatch::_foreach_add_(ks & c10::after_autograd_keyset, self_, other_, alpha); } #ifndef NDEBUG for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage())); } for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr()); } for (size_t i=0; i<other_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (other__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(other_)) TORCH_INTERNAL_ASSERT(other__storage_saved[i].value().is_alias_of(other_[i].storage())); } for (size_t i=0; i<other_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (other__impl_saved[i] && !at::impl::tensorlist_has_dispatch(other_)) TORCH_INTERNAL_ASSERT(other__impl_saved[i] == other_[i].getIntrusivePtr()); } #endif if (!grad_fns.empty()) { auto differentiable_outputs = flatten_tensor_args( self ); TORCH_INTERNAL_ASSERT(differentiable_outputs.size() == grad_fns.size()); for (const auto& i : c10::irange(grad_fns.size())) { auto grad_fn = grad_fns[i]; if (grad_fn != nullptr) { rebase_history(differentiable_outputs[i], grad_fns[i]); } } } } ... void _foreach_exp_(c10::DispatchKeySet ks, at::TensorList self) { auto self_ = unpack(self, "self", 0); auto _any_requires_grad = compute_requires_grad( self ); (void)_any_requires_grad; std::vector<c10::optional<at::Tensor>> original_selfs(self.size()); std::vector<std::shared_ptr<ExpBackward0>> grad_fns; if (_any_requires_grad) { for (const auto& i : c10::irange( self.size() )) { const auto ith_requires_grad = compute_requires_grad(self[i]); check_inplace(self[i], ith_requires_grad); grad_fns.push_back([&]() -> std::shared_ptr<ExpBackward0> { if (!ith_requires_grad) { return nullptr; } else { auto grad_fn = std::shared_ptr<ExpBackward0>(new ExpBackward0(), deleteNode); grad_fn->set_next_edges(collect_next_edges( self[i] )); return grad_fn; } }()); } } #ifndef NDEBUG std::vector<c10::optional<Storage>> self__storage_saved(self_.size()); for (const Tensor& tensor : self_) self__storage_saved.push_back( tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt); std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size()); for (size_t i=0; i<self_.size(); i++) if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr(); #endif { at::AutoDispatchBelowAutograd guard; at::redispatch::_foreach_exp_(ks & c10::after_autograd_keyset, self_); } #ifndef NDEBUG for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage())); } for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr()); } #endif if (!grad_fns.empty()) { auto differentiable_outputs = flatten_tensor_args( self ); TORCH_INTERNAL_ASSERT(differentiable_outputs.size() == grad_fns.size()); for (const auto& i : c10::irange(grad_fns.size())) { auto grad_fn = grad_fns[i]; if (grad_fn != nullptr) { rebase_history(differentiable_outputs[i], grad_fns[i]); } } } if (!grad_fns.empty()) { for (const auto& i : c10::irange(grad_fns.size())) { auto grad_fn = grad_fns[i]; if (grad_fn != nullptr) { grad_fn->result_ = SavedVariable(self[i], true, self[i].is_view()); } } } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/96405 Approved by: https://github.com/soulitzer	2023-04-28 00:55:04 +00:00
feifan	c0ecd98958	Rename DispatchKey.PrivateUse1 to custom device in torchgen. (#99406 ) I want to use torchgen to generate code, and my yaml file format is the same as `native_functions.yaml`. I will use the PrivateUse1, but in my yaml file, I don't want to show PrivateUse1 to the user. So I want to achieve the following result(e.g. my device is `YPU`): ``` >>>from torchgen.model import DispatchKey >>>str(DispatchKey.PrivateUse1) "YPU" >>>DispatchKey.parse("YPU") DispatchKey.PrivateUse1 ``` I also thought that not everyone would need this feature, so I add a new func to handle this scenario. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99406 Approved by: https://github.com/ezyang	2023-04-27 03:30:48 +00:00
Edward Z. Yang	3a5427baf4	Add torch.utils._content_store (#99809 ) Implements a simple content-addressable store for storages (with tensors implemented as cheap references on top), enabling incremental serialization of tensors to disk, which I intend to use in the accuracy repro extractor. Check the comment at the top of torch/utils/_content_store.py for more details on the intended use case. One major piece of this PR is implementing the content hash for tensors. For our prospective use case, we may need to repeatedly hash up to 80 GB of tensor data every time we snapshot (and we may snapshot multiple times). Using a conventional cryptographic hash and hashing each snapshot would likely take on order of minutes, which seemed too slow to me. So instead, I implemented a crappy hash function that can be run on GPU. It is at least somewhat theoretically grounded: using random parameters generated by Philox, we use the standard shift-multiply and xor sum universal hash family. The hash function is a bit dorky though; instead of properly doing 160-bit math, it just runs 32-bit hash five times and cats them together. By the way, this sets the first precedent for kernel in PyTorch library which MUST be torch.compile'd to be run (in fact, this kernel does not run in eager mode because of the use of xor_sum, which doesn't actually exist in ATen.) I had to add a few more primitives to inductor, namely randint (over the entire int range) and xor_sum. Fortunately, these primitives are natively supported by Triton/C++, and so they were very easy to plumb through. xor_sum is exposed as a prim, while randint special cases on when low/high span the entire 32-bit signed integer range. Thanks to Jeff Johnson for letting me bounce ideas of him on a Saturday morning lol. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99809 Approved by: https://github.com/voznesenskym	2023-04-26 18:02:59 +00:00
Aaron Gokaslan	e2a3817dfd	[BE] Enable C419 rule for any all shortcircuiting (#99890 ) Apparently https://github.com/pytorch/pytorch/pull/78142 made torch.JIT allow for simple generator expressions which allows us to enable rules that replace unnecessary list comprehensions with generators in any/all. This was originally part of #99280 but I split it off into this PR so that it can be easily reverted should anything break. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99890 Approved by: https://github.com/justinchuby, https://github.com/kit1980, https://github.com/malfet	2023-04-25 15:02:13 +00:00
Nikita Karetnikov	42921fc801	[torchgen] accept scalars for unary `SymInt` arrays (#99921 ) Fixes https://github.com/pytorch/pytorch/issues/99907 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99921 Approved by: https://github.com/malfet	2023-04-25 00:49:15 +00:00
Luca Wehrstedt	24bf15fe8d	Support record_stream in dispatch mode (#99529 ) Summary: Issuing a `t.record_stream(s)` call while a `TorchDispatchMode` is active was throwing because PyTorch was unable to convert a c10::Stream back to a Python object. It's now fixed. Fixes https://github.com/pytorch/pytorch/issues/94403 Test Plan: Added a unit test Differential Revision: D45117566 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99529 Approved by: https://github.com/albanD	2023-04-21 07:17:19 +00:00
Ikko Eltociear Ashimine	99c6d46cf7	fix typo in gen_functionalization_type.py (#99303 ) propogate -> propagate Pull Request resolved: https://github.com/pytorch/pytorch/pull/99303 Approved by: https://github.com/kit1980	2023-04-17 22:59:40 +00:00
Nikita Shulga	0be65069d3	[BE] Use `Literal` from `typing` (#98846 ) Since PyTorch is Python-3.8+ compatible framework Pull Request resolved: https://github.com/pytorch/pytorch/pull/98846 Approved by: https://github.com/janeyx99, https://github.com/ZainRizvi, https://github.com/Neilblaze	2023-04-12 05:49:37 +00:00
Edward Z. Yang	16ec7efa49	Don't use f-strings in logging calls (1/X) (#98591 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98591 Approved by: https://github.com/albanD	2023-04-07 15:52:50 +00:00
Richard Zou	f21a176c03	Python Dispatcher should respect FuncTorchBatchedDecomposition key (#98328 ) Fixes https://github.com/pytorch/pytorch/issues/97425. Python Dispatcher's resolve_key function should be equivalent to computeDispatchTableEntryWithDebug. We added a section to computeDispatchTableEntryWithDebug but forgot to add it to resolve_key. This PR fixes that discrepancy. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/98328 Approved by: https://github.com/Chillee, https://github.com/kshitij12345, https://github.com/Neilblaze	2023-04-05 20:32:53 +00:00
dilililiwhy	526b564fa0	Uniformly use elem when checking ListType (#97873 ) Fixes #ISSUE_NUMBER a initial trial to let code of arg parser become more readable (go through and understand logic behind torchgen as a rookie) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97873 Approved by: https://github.com/ezyang	2023-04-05 12:06:03 +00:00
mikey dagitses	2ac9086987	run buildifier on unified build files (#98141 ) This is pretty tricky. buildifier by default doesn't do much to these files. It does a little more if you tell it that they are `BUILD.bazel` files with -type=build. But it can do even more if you remove the target definitions from the `def define_rules()` wrapper and dedent them. I wrote a little wrapper that does that. I'll submit it at a later date. Differential Revision: [D44606558](https://our.internmc.facebook.com/intern/diff/D44606558/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D44606558/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/98141 Approved by: https://github.com/ezyang, https://github.com/PaliC	2023-04-04 00:37:19 +00:00
Aaron Gokaslan	9c3fbe7475	[BE] Enable flake8-simplify checks (#97984 ) Enable some sensible flake8-simplify rules. Mainly wanted to enable the SIM101, and `yield from` SIM103 checks. @kit1980 since you wanted to be tagged on this CI check. Enabling this check also helped flag one logical bug so it's definitely beneficial (also fixed in this PR). Pull Request resolved: https://github.com/pytorch/pytorch/pull/97984 Approved by: https://github.com/ezyang	2023-03-31 03:40:21 +00:00
Aaron Gokaslan	47dca20d80	[BE] Enable flake8-comprehension rule C417 (#97880 ) Enables flake8-comprehension rule C417. Ruff autogenerated these fixes to the codebase. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97880 Approved by: https://github.com/ezyang, https://github.com/kit1980, https://github.com/albanD	2023-03-30 14:34:24 +00:00
Sergii Dymchenko	477f3f555f	Simplify by using yield from (#97831 ) The issues were found by SIM104 flake8-simplify in a local run. I'll take a look on adding the check to the CI separately. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97831 Approved by: https://github.com/Skylion007	2023-03-29 19:15:24 +00:00
Aaron Gokaslan	597b558c51	[BE]: Update flake8 and plugins and fix bugs (#97795 ) Update flake8 and flake8-plugins in lintrunner to a modern version. Enables more checks and makes flake8 checks significantly faster. Added a few additional rule ignores that will need to be fixed in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97795 Approved by: https://github.com/alexsio27444, https://github.com/janeyx99, https://github.com/ezyang	2023-03-28 23:51:55 +00:00
Brian Hirsh	35c9ea89fa	dont bake in defaults when tracing *_like factories (#97564 ) quick fix for https://github.com/pytorch/pytorch/issues/97541. letting CI run to see if there's any fallout Pull Request resolved: https://github.com/pytorch/pytorch/pull/97564 Approved by: https://github.com/ezyang	2023-03-27 22:53:44 +00:00
Mengwei Liu	a524123c91	[torchgen] Bump native function max namespace levels due for internal use case (#97381 ) Summary: As titled. Should be trivial Test Plan: Rely on unit test Differential Revision: D44314834 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97381 Approved by: https://github.com/cccclai	2023-03-23 00:40:37 +00:00
Joel Schlosser	77e73b9b7a	Refactor NT offsets metadata to be a Tensor (#96909 ) It's tedious work, but somebody's gotta do it. Benefits: * Enable access to offsets metadata from Python via private API (for validation, etc.) * Consistency with nested sizes / strides metadata * Needed for SymInt-ifying offsets metadata * more TBD Bonus: * Remove `_tensor` suffixes from metadata / getter names Pull Request resolved: https://github.com/pytorch/pytorch/pull/96909 Approved by: https://github.com/drisspg	2023-03-21 18:51:35 +00:00
Chung-chieh Shan	2c588b3ad5	Allow new_full's fill_value argument type to be complex (#91345 ) It seems that this code should type-check but doesn't: ```python torch.zeros((2,3),dtype=torch.cdouble).new_full((4,5),complex(6,7)) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91345 Approved by: https://github.com/zou3519, https://github.com/ezyang	2023-03-21 12:34:00 +00:00
Aaron Gokaslan	5471621497	[BE] Remove unnecessary dict comprehensions (#97116 ) Removes unnecessary dict comprehensions that optimize creation of dicts from iterables Pull Request resolved: https://github.com/pytorch/pytorch/pull/97116 Approved by: https://github.com/kit1980	2023-03-20 00:56:57 +00:00
BowenBao	60a68477a6	Bump black version to 23.1.0 (#96578 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96578 Approved by: https://github.com/ezyang	2023-03-15 06:27:59 +00:00
Hansong Zhang	93ff71ec37	[ET] Add RuntimeContext to ET Aten mode (#96084 ) Summary: In ATen mode, we add the RuntimeContext arg, so we have something like ``` TORCH_API inline at::Tensor & gelu_outf(torch::executor::RuntimeContext & context, const at::Tensor & self, c10::string_view approximate, at::Tensor & out) { return at::gelu_outf(self, approximate, out); } ``` and user can use `<namespace like aten>::gelu_outf` and we will automatically dispatch the registered function in aten kernel using `at::gelu_outf` (dispatched by ATen/Functions.h header) In optimized kernel tests, we can now automatically handle between aten kernel and optimized kernel. The implication is that the test must depend on the correctness of codegen; an error in codegen can break the kernel tests. Test Plan: CI Differential Revision: D43777848 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96084 Approved by: https://github.com/larryliu0820	2023-03-08 02:51:47 +00:00
Masaki Kozuki	49f6849f58	Fix codegen logic for foreach derivatives (#95263 ) follow-up https://github.com/pytorch/pytorch/pull/93901. Unexpected numerical mismatches observed in some foreach functions' backward result seemed to be caused by the wrong order of `IndexRangeGenerator::range` call. This pr has `args_with_derivatives` have the same or similar order of `foreach_native_function.func.arguments.flat_non_out` --- what the current master generates for `_foreach_mul.List`: ```cpp variable_list ForeachMulBackward0List::apply(variable_list&& grads) { std::lock_guard<std::mutex> lock(mutex_); TORCH_CHECK(!other_released_, ERR_BACKWARD_TWICE); TORCH_CHECK(!self_released_, ERR_BACKWARD_TWICE); IndexRangeGenerator gen; auto other_ix = gen.range(other_size_); auto self_ix = gen.range(self_size_); variable_list grad_inputs(gen.size()); auto other = unpack_list(other_); auto self = unpack_list(self_); if (task_should_compute_output({ other_ix })) { std::vector<Tensor> grad_result; grad_result.reserve(grads.size()); for (const auto & i : c10::irange(grads.size())) { grad_result.emplace_back(mul_tensor_backward(grads[i], self[i], other[i].scalar_type())); } copy_range(grad_inputs, other_ix, grad_result); } if (task_should_compute_output({ self_ix })) { std::vector<Tensor> grad_result; grad_result.reserve(grads.size()); for (const auto & i : c10::irange(grads.size())) { grad_result.emplace_back(mul_tensor_backward(grads[i], other[i], self[i].scalar_type())); } copy_range(grad_inputs, self_ix, grad_result); } return grad_inputs; } ``` with this PR the generated backward is ```cpp variable_list ForeachMulBackward0List::apply(variable_list&& grads) { std::lock_guard<std::mutex> lock(mutex_); TORCH_CHECK(!self_released_, ERR_BACKWARD_TWICE); TORCH_CHECK(!other_released_, ERR_BACKWARD_TWICE); IndexRangeGenerator gen; auto self_ix = gen.range(self_size_); <----- diff auto other_ix = gen.range(other_size_); <----- diff variable_list grad_inputs(gen.size()); auto self = unpack_list(self_); auto other = unpack_list(other_); if (task_should_compute_output({ other_ix })) { std::vector<Tensor> grad_result; grad_result.reserve(grads.size()); for (const auto & i : c10::irange(grads.size())) { grad_result.emplace_back(mul_tensor_backward(grads[i], self[i], other[i].scalar_type())); } copy_range(grad_inputs, other_ix, grad_result); } if (task_should_compute_output({ self_ix })) { std::vector<Tensor> grad_result; grad_result.reserve(grads.size()); for (const auto & i : c10::irange(grads.size())) { grad_result.emplace_back(mul_tensor_backward(grads[i], other[i], self[i].scalar_type())); } copy_range(grad_inputs, self_ix, grad_result); } return grad_inputs; } ``` The change is to fix the order of `self_ix` and `other_ix`.[](url) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95263 Approved by: https://github.com/soulitzer	2023-03-04 20:03:54 +00:00
Xuehai Pan	22d3ac79d2	[torchgen] Prettify generated type annotations (#95877 ) Changes: 1. Use class inheritance for `torch/return_types.pyi`: Before: ```python max = NamedTuple("max", [("values", Tensor), ("indices", Tensor)]) ``` After: ```python class max(NamedTuple): values: Tensor indices: Tensor ``` ------ 2. Add missing spaces in generated type annotations. 1. Always has a space after `,`. 2. If an argument is annotated, then there need spaces around `=` when it has a default value. ```diff - def func(..., out: Optional[Tensor]=None, ...) -> Tensor: + def func(..., out: Optional[Tensor] = None, ...) -> Tensor: ``` 3. If an argument is not annotated, then there should be no spaces around `=` when it has a default value. ```python def contiguous(self, memory_format=torch.contiguous_format) -> Tensor: ... ``` ------ 3. ~Remove redundant import alias in `torch/nn/functional.pyi`:~ (Reverted) UPDATE: `mypy` needs the alias to work. Before: ```python from .. import conv1d as conv1d from .. import conv2d as conv2d from .. import conv3d as conv3d from .. import conv_transpose1d as conv_transpose1d from .. import conv_transpose2d as conv_transpose2d from .. import conv_transpose3d as conv_transpose3d from .. import conv_tbc as conv_tbc from .. import avg_pool1d as avg_pool1d from .. import relu_ as relu_ from .. import selu_ as selu_ from .. import celu_ as celu_ from .. import rrelu_ as rrelu_ from .. import pixel_shuffle as pixel_shuffle from .. import pixel_unshuffle as pixel_unshuffle from .. import channel_shuffle as channel_shuffle from .. import native_channel_shuffle as native_channel_shuffle from .. import pdist as pdist from .. import cosine_similarity as cosine_similarity ``` After: ```python from .. import ( conv1d, conv2d, conv3d, conv_transpose1d, conv_transpose2d, conv_transpose3d, conv_tbc, avg_pool1d, relu_, selu_, celu_, rrelu_, pixel_shuffle, pixel_unshuffle, channel_shuffle, native_channel_shuffle, pdist, cosine_similarity, ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/95877 Approved by: https://github.com/ezyang	2023-03-03 07:08:40 +00:00
Wonjoo Lee	3095c95828	Fixes for PyTorch/XLA functionalization integration (#94537 ) Fixes for PyTorch/XLA functionalization integration --- Some notable changes include: - More asserts in `FunctionalTensorWrapper`, so bugs show up more cleanly in cases where we e.g. forget to wrap an output - Make the *_scatter ops `CompositeExplicitAutogradNonFunctional`, so we get a better error message and XLA doesn't accidentally try to us them - Fix LTC/XLA codegen in core to handle multi-tensor out= ops with no returns - Better erroring: Allow XLA to use the CPU fallback from core in a way so that it always errors on view ops, which XLA should no longer see. - Update MetaConverter to exclude XLA tensors in raising NotImplemented… - Add `_propagate_xla_data` op - Add meta tensor support for some ops Pull Request resolved: https://github.com/pytorch/pytorch/pull/94537 Approved by: https://github.com/bdhirsh	2023-03-02 23:02:34 +00:00
alexdremov	b9e95158d5	[MPS] Fix LSTM backward and forward pass (#95137 ) Fixes #91694 Fixes #92615 Several transpositions were missing for backward graph in case of `batch_first=True`. The #91694 is not reproduced with `batch_first=False`. After fixing transpose issue, I finally thought that now I can use LSTM freely in my project. And then I got horrific results on train. Seems related to #92615. After that I decided to fix LSTM's backward step completely. I collected all my findings in this thread — seems like I succeeded Funny enough, backward tests were completely disabled before and were not passing: ```python @unittest.skipIf(True, "Backward of lstm returns wrong result") def test_lstm_2(self, device="mps", dtype=torch.float32): ``` UPD: forward pass of multi-layer version also was wrong due to the incorrect `initState, initCell` slices. Tests were passing because states were inited with zeros. Accidentally fixed this too Pull Request resolved: https://github.com/pytorch/pytorch/pull/95137 Approved by: https://github.com/jhavukainen, https://github.com/kulinseth, https://github.com/soulitzer	2023-02-23 17:32:42 +00:00
Peter Bell	bc438af6fe	std/var: support floating point correction value (#94073 ) Ref https://github.com/pytorch/pytorch/issues/61492#issuecomment-1413003480 The array API specifies correction to be `Union[int, float]` while we currently only support integers. https://data-apis.org/array-api/latest/API_specification/generated/array_api.std.html As std/var is calculated currently, the final count of elements is already done in floating point so we can make the correction floating point without any loss of precision or generality. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94073 Approved by: https://github.com/ezyang	2023-02-23 05:50:45 +00:00
Masaki Kozuki	f54233e273	[foreach] bump tensor's version and define backward via torchgen (as possible) (#93901 ) ## summary - increment tensor versions in inplace foreach functions - add a logic to take care of `ArrayRef<Scalar>` rel: https://github.com/pytorch/pytorch/issues/58833, https://github.com/pytorch/pytorch/pull/89591 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93901 Approved by: https://github.com/albanD	2023-02-20 23:18:07 +00:00
Mengwei Liu	679e5dbfa1	[executorch] Always generate CustomOpsNativeFunctions.h if custom_ops.yaml is present (#95084 ) To match the build system logic, enforce CustomOpsNativeFunctions.h to be generated if we have custom_ops.yaml, even if we don't select any custom ops. Added unit test. Differential Revision: [D43402718](https://our.internmc.facebook.com/intern/diff/D43402718) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95084 Approved by: https://github.com/iseeyuan	2023-02-20 18:54:41 +00:00
Mengwei Liu	41865bd8ed	[executorch] Add RuntimeContext to generated C++ API Signature (#94570 ) Summary: Pass runtime context all the way to kernel level. RegisterCodegenUnboxedKernels.cpp: ``` static Operator operators_to_register[] = { Operator( "aten::add.out", [](torch::executor::RuntimeContext & context, EValue** stack) { EValue& self = stack[0]; EValue& other = stack[1]; EValue& alpha = stack[2]; EValue& out = stack[3]; const torch::executor::Tensor & self_base = self.to<torch::executor::Tensor>(); const torch::executor::Tensor & other_base = other.to<torch::executor::Tensor>(); const torch::executor::Scalar & alpha_base = alpha.to<torch::executor::Scalar>(); torch::executor::Tensor & out_base = out.to<torch::executor::Tensor>(); EXECUTORCH_SCOPE_PROF("native_call_add.out"); torch::executor::aten::add_outf(context, self_base, other_base, alpha_base, out_base); } ), } ``` Functions.h ``` // aten::add.out(Tensor self, Tensor other, *, Scalar alpha=1, Tensor(a!) out) -> Tensor(a!) TORCH_API inline at::Tensor & add_outf(torch::executor::RuntimeContext & context, const at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha, at::Tensor & out) { return torch::executor::native::add_out(self, other, alpha, out); } ``` Test Plan: TBD Differential Revision: D41325633 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94570 Approved by: https://github.com/cccclai	2023-02-16 02:43:18 +00:00
Li-Huai (Allan) Lin	e8dc34eaeb	[MPS] Move max_pool2d to mps dispatch key (#90772 ) Related issue: #77394 This PR also modifies some assertions in the codegen, an explanatory comment for it has been added. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90772 Approved by: https://github.com/albanD	2023-02-16 01:13:08 +00:00
Larry Liu	79783a51da	[torchgen] Loosen the restriction for only allowing 2 nested namespaces for kernels (#94834 ) As titled. We still want to have some restriction to avoid misuse but for internal use case we want to change the limit from 2 to 3. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94834 Approved by: https://github.com/SS-JIA	2023-02-14 21:50:12 +00:00
Aaron Gokaslan	67d9790985	[BE] Apply almost all remaining flake8-comprehension checks (#94676 ) Applies the remaining flake8-comprehension fixes and checks. This changes replace all remaining unnecessary generator expressions with list/dict/set comprehensions which are more succinct, performant, and better supported by our torch.jit compiler. It also removes useless generators such as 'set(a for a in b)`, resolving it into just the set call. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94676 Approved by: https://github.com/ezyang	2023-02-12 01:01:25 +00:00
Aaron Gokaslan	3d82d8d0ed	[BE] Enable more flake8-comprehensions checks (#94601 ) I applied some flake8 fixes and enabled checking for them in the linter. I also enabled some checks for my previous comprehensions PR. This is a follow up to #94323 where I enable the flake8 checkers for the fixes I made and fix a few more of them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94601 Approved by: https://github.com/ezyang	2023-02-10 23:40:29 +00:00
Xuehai Pan	a229b4526f	[BE] Prefer dash over underscore in command-line options (#94505 ) Preferring dash over underscore in command-line options. Add `--command-arg-name` to the argument parser. The old arguments with underscores `--command_arg_name` are kept for backward compatibility. Both dashes and underscores are used in the PyTorch codebase. Some argument parsers only have dashes or only have underscores in arguments. For example, the `torchrun` utility for distributed training only accepts underscore arguments (e.g., `--master_port`). The dashes are more common in other command-line tools. And it looks to be the default choice in the Python standard library: `argparse.BooleanOptionalAction`: `4a9dff0e5a/Lib/argparse.py (L893-L895)` ```python class BooleanOptionalAction(Action): def __init__(...): if option_string.startswith('--'): option_string = '--no-' + option_string[2:] _option_strings.append(option_string) ``` It adds `--no-argname`, not `--no_argname`. Also typing `_` need to press the shift or the caps-lock key than `-`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94505 Approved by: https://github.com/ezyang, https://github.com/seemethere	2023-02-09 20:16:49 +00:00
Xuehai Pan	69e0bda999	[BE] Import `Literal`, `Protocol`, and `Final` from standard library `typing` as of Python 3.8+ (#94490 ) Changes: 1. `typing_extensions -> typing-extentions` in dependency. Use dash rather than underline to fit the [PEP 503: Normalized Names](https://peps.python.org/pep-0503/#normalized-names) convention. ```python import re def normalize(name): return re.sub(r"[-_.]+", "-", name).lower() ``` 2. Import `Literal`, `Protocal`, and `Final` from standard library as of Python 3.8+ 3. Replace `Union[Literal[XXX], Literal[YYY]]` to `Literal[XXX, YYY]`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94490 Approved by: https://github.com/ezyang, https://github.com/albanD	2023-02-09 19:17:49 +00:00
albanD	496c0a207b	Make segment_reduce properly private. (#93166 ) I am attempting not to change the aten function to reduce the amount of BC issues on the torchscript side. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93166 Approved by: https://github.com/ngimel	2023-02-06 18:32:23 +00:00
PyTorch MergeBot	f7bd5d0ccb	Revert "[Reland] Add sym_size/stride/numel/storage_offset to native_function.yaml (#91… (#92402 )" This reverts commit `965f4ea3ba`. Reverted https://github.com/pytorch/pytorch/pull/92402 on behalf of https://github.com/zhxchen17 due to Caused a regression for an export model.	2023-02-03 03:12:43 +00:00
Nikita Vedeneev	84187399fc	retire sparse_mask_helper (#91714 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91714 Approved by: https://github.com/albanD, https://github.com/amjames, https://github.com/cpuhrsch	2023-02-02 13:53:02 +00:00
Driss Guessous	653dc73df0	[SDPA] Wire up FlashAttention's backward (#92917 ) # Summary This PR creates _flash_attention_backward and _scaled_dot_product_flash_attention_backward native functions and registers them to the respective derivatives.yaml. The goal is to replicate the torch.autograd.Function defined in the FlashAttention repo [here](`33e0860c9c/flash_attn/flash_attn_interface.py (L126)`) natively in PyTorch. One thing that we don't have access to is ctx.save_for_backward in native PyTorch so in order to save these variables I extended the returned objects from the forward functions. ### MetaFunctions I also updated the FlashAttention meta functions to mirror the real outputs now. As well I added a meta registration for backwards. I have an XLMR training script and while eager training now works with FlashAttention compiling this module fails with the inductor error down below. ### Questions? Performance issues vs mem efficient when using torch.nn.mha_forward TorchCompile -> See purposed solution below. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92917 Approved by: https://github.com/cpuhrsch	2023-02-02 04:02:30 +00:00
Sherlock Huang	965f4ea3ba	[Reland] Add sym_size/stride/numel/storage_offset to native_function.yaml (#91… (#92402 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91919 Approved by: https://github.com/ezyang Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/92402 Approved by: https://github.com/ezyang	2023-02-01 04:47:49 +00:00
Jacob Szwejbka	2e9107ec1e	[Pytorch][Executorch] Handwritten view copy out ops should resize out (#91194 ) Summary: Handwritten out ops should have feature parity with the codegend ones. This means they should resize out to the appropriate size. Q. Why are these handwritten instead of codegend anyway? Q2. Wheres a good spot to put the resize and copy helpers since they are reused in the codegend out kernels Test Plan: ci. Differential Revision: D42177051 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91194 Approved by: https://github.com/ezyang	2023-01-30 23:07:14 +00:00
Larry Liu	a0ca9dc8ca	[torchgen] Small fix for empty yaml file edge case (#92938 ) Rely on CI. Avoid issues such as: ``` Traceback (most recent call last): File "<string>", line 38, in <module> File "<string>", line 36, in __run File "/usr/local/fbcode/platform010/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/local/fbcode/platform010/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/re_cwd/buck-out/v2/gen/fbcode/2841b324ed9b88dd/caffe2/torchgen/__gen_executorch__/gen_executorch#link-tree/torchgen/gen_executorch.py", line 690, in <module> main() File "/re_cwd/buck-out/v2/gen/fbcode/2841b324ed9b88dd/caffe2/torchgen/__gen_executorch__/gen_executorch#link-tree/torchgen/gen_executorch.py", line 626, in main parsed_yaml, custom_ops_parsed_yaml = parse_yaml_files( File "/re_cwd/buck-out/v2/gen/fbcode/2841b324ed9b88dd/caffe2/torchgen/__gen_executorch__/gen_executorch#link-tree/torchgen/gen_executorch.py", line 505, in parse_yaml_files translate_native_yaml( File "/re_cwd/buck-out/v2/gen/fbcode/2841b324ed9b88dd/caffe2/torchgen/__gen_executorch__/gen_executorch#link-tree/torchgen/gen_executorch.py", line 448, in translate_native_yaml for e in native_es: TypeError: 'NoneType' object is not iterable ``` Differential Revision: [D42729435](https://our.internmc.facebook.com/intern/diff/D42729435) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92938 Approved by: https://github.com/JacobSzwejbka	2023-01-27 22:45:21 +00:00
Han Qi	1f352f7c1f	Update flatbuffer test models to match pkl models (#93022 ) Also regenerate upgrader with ``` python torchgen/operator_versions/gen_mobile_upgraders.py ``` Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/93022 Approved by: https://github.com/tugsbayasgalan	2023-01-26 21:17:57 +00:00
Masaki Kozuki	30876229a7	[mta] Backward of unary foreach functions (#89591 ) as per title, this PR defines backward of those. This doesn't implement forward-mode automatic differentiation as [the current codegen](`a747326423/tools/autograd/gen_variable_type.py (L1513)`) doesn't seem to handle `ArrayRef<Tensor>`. Rel: - https://github.com/pytorch/pytorch/issues/53796 - https://github.com/pytorch/pytorch/issues/58833 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89591 Approved by: https://github.com/albanD	2023-01-23 08:28:06 +00:00
PyTorch MergeBot	befe815466	Revert "Add sym_size/stride/numel/storage_offset to native_function.yaml (#91919 )" This reverts commit `0388400f3f`. Reverted https://github.com/pytorch/pytorch/pull/91919 on behalf of https://github.com/atalman due to Break internal build	2023-01-17 21:03:18 +00:00
Sherlock Huang	0388400f3f	Add sym_size/stride/numel/storage_offset to native_function.yaml (#91919 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91919 Approved by: https://github.com/ezyang	2023-01-17 03:39:57 +00:00
Larry Liu	7568484d54	[torchgen] Add CI job to cover custom ops registration for Executorch (#91291 ) As titled. To register a custom op into Executorch, we need: * `custom_ops.yaml`, defines the operator schema and the corresponding native function. * `custom_ops.cpp`, defines the kernel. * `RegisterDispatchKeyCustomOps.cpp`, a template to register operator into PyTorch. Added a new test for custom ops. The custom op `custom::add_3.out` takes 3 tensors and add them together. The test makes sure it is registered correctly and then verifies the outcome is correct. Differential Revision: [D42204263](https://our.internmc.facebook.com/intern/diff/D42204263/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91291 Approved by: https://github.com/ezyang	2023-01-14 02:30:54 +00:00
Luca Lumetti	a4a0195c6c	Fix torch.where signature mismatch that was caused by torchgen (#91627 ) Fixes #91003 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91627 Approved by: https://github.com/albanD	2023-01-13 16:17:55 +00:00
Brian Hirsh	c47bdd7522	_scatter ops should preserve input stride/storage_offset (#91029 ) It turns out that we do* need to update *_scatter ops to return the exact same strides as their inputs. I added a test to `test/test_functionalization.py`, which now trips thanks to Ed's functionalization stride debugging check. It only actually ends up tripping silent correctness if you try to .backward() on that function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91029 Approved by: https://github.com/ezyang	2022-12-22 19:41:53 +00:00
Mengwei Liu	2f154f68ea	[torchgen] Add CI job to make sure torchgen works for Executorch op registration (#89596 ) ## Job Test running on most CI jobs. ## Test binary * `test_main.cpp`: entry for gtest * `test_operator_registration.cpp`: test cases for gtest ## Helper sources * `operator_registry.h/cpp`: simple operator registry for testing purpose. * `Evalue.h`: a boxed data type that wraps ATen types, for testing purpose. * `selected_operators.yaml`: operators Executorch care about so far, we should cover all of them. ## Templates * `NativeFunctions.h`: for generating headers for native functions. (not compiled in the test, since we will be using `libtorch`) * `RegisterCodegenUnboxedKernels.cpp`: for registering boxed operators. * `Functions.h`: for declaring operator C++ APIs. Generated `Functions.h` merely wraps `ATen/Functions.h`. ## Build files * `CMakeLists.txt`: generate code to register ops. * `build.sh`: driver file, to be called by CI job. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89596 Approved by: https://github.com/ezyang	2022-12-21 03:07:32 +00:00
Edward Z. Yang	1c46a32b67	Minor typing improvements (#91068 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/91068 Approved by: https://github.com/Skylion007, https://github.com/soumith	2022-12-20 23:43:11 +00:00
Larry Liu	909b7ca92a	[torchgen] Move Executorch codegen logic into torchgen (#90806 ) ## Codegen entry point Main logic and Executorch codegen entry: `gen_executorch.py`. `RegisterCodegenUnboxedKernels.cpp`: ```cpp register_operators({ Operator( "aten::add.out", [](EValue** stack) { EValue& self = stack[0]; EValue& other = stack[1]; EValue& alpha = stack[2]; EValue& out = stack[3]; const at::Tensor & self_base = self.to<at::Tensor>(); const at::Tensor & other_base = other.to<at::Tensor>(); const at::Scalar & alpha_base = alpha.to<at::Scalar>(); at::Tensor & out_base = out.to<at::Tensor>(); EXECUTORCH_SCOPE_PROF("native_call_add.out"); torch::executor::aten::add_outf(self_base, other_base, alpha_base, out_base); }) ); ``` `Functions.h`: ```cpp namespace torch { namespace executor { namespace aten { // aten::add_outf(Tensor self, Tensor other, Scalar alpha, , Tensor(a!) out) -> Tensor(a!) TORCH_API inline at::Tensor & add_outf(const at::Tensor & self, const at::Tensor & other, at::Scalar alpha, at::Tensor & out) { return at::add_outf(self, other, alpha, out); } } // namespace aten } // namespace executor } // namespace torch ``` Unit tests: `test_executorch_gen.py` CI job in next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90806 Approved by: https://github.com/ezyang	2022-12-19 21:58:43 +00:00
Larry Liu	679da8bd89	[torchgen] Move Executorch custom ops logic into torchgen (#90099 ) ## Logic to handle custom ops We generate files for custom ops, so that they can be registered into PyTorch. Generated files: * `Register{dispatch_key}CustomOps.cpp` (dispatch_key = CPU), it's basically the same as vanilla PyTorch `RegisterCPU.cpp`. The only difference is that we bind to native functions directly. * `Register{dispatch_key}Stub.cpp` (dispatch_key = CPU), register placeholder kernels for custom ops. Only used when there's no custom op kernel available. As an example: ```cpp namespace { at::Tensor & wrapper_out_unsqueeze_out(const at::Tensor & self, int64_t dim, at::Tensor & out) { // No device check // DeviceGuard omitted return torch::executor::native::unsqueeze_out(self, dim, out); } } // anonymous namespace TORCH_LIBRARY_IMPL(aten, CPU, m) { m.impl("unsqueeze.out", TORCH_FN(wrapper_out_unsqueeze_out)); } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/90099 Approved by: https://github.com/ezyang	2022-12-19 21:58:43 +00:00
Larry Liu	ca52f63fc0	[torchgen] Move Executorch unboxing logic into torchgen (#90098 ) This PR adds `unboxing.py` which converts a `EValue` (similar to `IValue`) to its corresponding C++ type, based on the `ExecutorchCppSignature`. Added unit tests to it in `test_executorch_unboxing.py`. Notice that this unboxing logic should work for both ATen types and Executorch types, hence the unit tests are parametrized. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90098 Approved by: https://github.com/ezyang	2022-12-19 21:58:43 +00:00
Brian Hirsh	440a3f2398	fix set_() with functionalization (#90722 ) This should fix https://github.com/pytorch/pytorch/issues/90573 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90722 Approved by: https://github.com/ezyang	2022-12-19 16:11:06 +00:00
Edward Z. Yang	4fa8d774b8	Add macro C10_AS_INTARRAYREF_SLOW (#90675 ) This makes it easier to narrow down who is throwing the error, instead of having to use gdb. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D42088781](https://our.internmc.facebook.com/intern/diff/D42088781)	2022-12-16 15:10:35 -08:00
Edward Z. Yang	68805b565a	Include dispatch key in wrapper symbol name (#90674 ) When looking at gdb traces, this makes it easier to tell that you're looking at the CPU wrapper vs CUDA wrapper, etc. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D42088744](https://our.internmc.facebook.com/intern/diff/D42088744) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90674 Approved by: https://github.com/ngimel, https://github.com/malfet	2022-12-16 19:36:32 +00:00
PyTorch MergeBot	750576a50a	Revert "Include dispatch key in wrapper symbol name (#90674 )" This reverts commit `e87370133c`. Reverted https://github.com/pytorch/pytorch/pull/90674 on behalf of https://github.com/osalpekar due to executorch breakage internally, more details in [D42051698](https://www.internalfb.com/diff/D42051698)	2022-12-16 01:05:57 +00:00
PyTorch MergeBot	140a3139d6	Revert "Add macro C10_AS_INTARRAYREF_SLOW (#90675 )" This reverts commit `8090cb5386`. Reverted https://github.com/pytorch/pytorch/pull/90675 on behalf of https://github.com/osalpekar due to broke internal acc_tensor implementation in training_platform contbuild. See [D42052101](https://www.internalfb.com/diff/D42052101) for details.	2022-12-16 00:30:50 +00:00
Edward Z. Yang	8090cb5386	Add macro C10_AS_INTARRAYREF_SLOW (#90675 ) This makes it easier to narrow down who is throwing the error, instead of having to use gdb. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90675 Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/JackCaoG	2022-12-14 21:29:23 +00:00
Larry Liu	f3393b7ea7	[torchgen] Introduce Executorch types and signatures (#90781 ) Retry of #90591, which is a retry of #89595. Reverted due to dependency PR breaking internal fbcode. ## Forked BaseCppType Created a module for Executorch: `torchgen.executorch`. ## In `torchgen.executorch.api.types.types`: * Define `BaseCppType` with `torch::executor` namespace. ## In `torchgen.executorch.api.et_cpp`: * Help generate `NamedCType` for `ExecutorchCppSignature` arguments. ## In `torchgen.executorch.api.types.signatures`: * Define the signature using these types. (`ExecutorchCppSignature`) ## In `torchgen.executorch.api.types.__init__`: * Suppress flake8 error for `import *`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90781 Approved by: https://github.com/ezyang	2022-12-14 20:13:04 +00:00
Larry Liu	4adffe6d51	[torchgen] Let native function declaration generation logic take a callable (#90780 ) Retry of #90590, which is a retry of #89594. Original PR reverted due to internal breakage. This PR fixes the breakage by adding a default value to the new argument. This PR allows `get_native_function_declarations` API to take a function as argument. This function should take `NativeFunction` as input and emit code for native function declaration. By default it is `dest.compute_native_function_declaration`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90780 Approved by: https://github.com/ezyang	2022-12-14 20:13:04 +00:00
Driss Guessous	51c6c5e156	[SDPA] Standardizes the return shape for dense tensor of SDPA regardless of fused kernel called (#90776 ) # Summary Continues to fix up the meta output story of SDPA to be more correct Pull Request resolved: https://github.com/pytorch/pytorch/pull/90776 Approved by: https://github.com/cpuhrsch	2022-12-14 18:08:02 +00:00
Edward Z. Yang	e87370133c	Include dispatch key in wrapper symbol name (#90674 ) When looking at gdb traces, this makes it easier to tell that you're looking at the CPU wrapper vs CUDA wrapper, etc. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90674 Approved by: https://github.com/ngimel	2022-12-14 03:09:22 +00:00
PyTorch MergeBot	ea64c8c6ad	Revert "[torchgen] Let native function declaration generation logic take a callable (#90590 )" This reverts commit `de6beca838`. Reverted https://github.com/pytorch/pytorch/pull/90590 on behalf of https://github.com/seemethere due to Causes internal failures, see https://www.internalfb.com/intern/sandcastle/job/4503600464398605/insights	2022-12-13 03:41:04 +00:00
PyTorch MergeBot	b3e6a6dc0b	Revert "[torchgen] Introduce Executorch types and signatures (#90591 )" This reverts commit `ddf00c803b`. Reverted https://github.com/pytorch/pytorch/pull/90591 on behalf of https://github.com/seemethere due to Part of a stack that causes internal failures, see https://www.internalfb.com/intern/sandcastle/job/4503600464398605/insights	2022-12-13 03:36:31 +00:00
Larry Liu	ddf00c803b	[torchgen] Introduce Executorch types and signatures (#90591 ) Retry of #89595. Accidentally closed. ## Forked `BaseCppType` Created a module for Executorch: `torchgen.executorch`. In `torchgen.executorch.api.types.types`: * Define `BaseCppType` with `torch::executor` namespace. In `torchgen.executorch.api.et_cpp`: * Help generate `NamedCType` for `ExecutorchCppSignature` arguments. In `torchgen.executorch.api.types.signatures`: * Define the signature using these types. (`ExecutorchCppSignature`) In `torchgen.executorch.api.types.__init__`: * Suppress flake8 error for `import *`. Differential Revision: [D41501836](https://our.internmc.facebook.com/intern/diff/D41501836/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90591 Approved by: https://github.com/iseeyuan	2022-12-10 04:34:02 +00:00
Larry Liu	de6beca838	[torchgen] Let native function declaration generation logic take a callable (#90590 ) Retry of #89594. Accidentally closed. This PR allows `get_native_function_declarations` API to take a function as argument. This function should take `NativeFunction` as input and emit code for native function declaration. By default it is `dest.compute_native_function_declaration`. Differential Revision: [D41501838](https://our.internmc.facebook.com/intern/diff/D41501838/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90590 Approved by: https://github.com/iseeyuan	2022-12-10 04:34:02 +00:00
Larry Liu	453ff96029	[torchgen] Refactor types (#90589 ) A retry of #89487. Accidentally closed. ## Split `torchgen.api.types` into `types_base`, `types` and `signatures`. In `types_base`: * Created base class `CType`. `BaseCType` and `ConstRefCType` etc are inheriting `CType`. * Only keep abstract type model definitions, such as `BaseCppType`. In `types`: * Define `BaseCppType` with `at` and `c10` namespaces. * All the signatures using these types. In `signatures`: * Define all the signatures. In `__init__`: * `from ... import `, suppress flake8 error. Differential Revision: [D41455634](https://our.internmc.facebook.com/intern/diff/D41455634/) NOTE FOR REVIEWERS*: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D41455634/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/90589 Approved by: https://github.com/iseeyuan	2022-12-10 04:34:00 +00:00
Elias Ellison	b651e06049	Add Pointwise Tag from pointwise set in DTensor, use in aot_autograd partitioner (#90029 ) Takes the pointwise op list from [DTensor](https://github.com/pytorch/pytorch/blob/master/torch/distributed/_tensor/ops/pointwise_ops.py#L36) as an initially starting point for pointwise ops, and feeds them to the aot autograd partitioner. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90029 Approved by: https://github.com/ezyang	2022-12-08 20:21:17 +00:00
Ram Rachum	351d73b97f	Fix exception causes all over the codebase (#90271 ) This is the continuation to #90134 and hopefully the final PR in this series. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90271 Approved by: https://github.com/kit1980	2022-12-07 04:29:00 +00:00
Sean Ross-Ross	bb673fb1d9	fix: update error when tensor escapes vmap (#89077 ) Fixes https://github.com/pytorch/functorch/issues/1054 @zou3519, I played around with it, but I am unsure of how to repro the cases for gen_vmap_inplace_plumbing and below in gen_vmap_plumbing_no_returns I've also seen that there are 24 other instances of the `TORCH_INTERNAL_ASSERT(maybe_layer.has_value());` assert, should I change all of these and add tests? Pull Request resolved: https://github.com/pytorch/pytorch/pull/89077 Approved by: https://github.com/zou3519	2022-12-06 05:52:09 +00:00
Danni Li	05ccbd6d94	Functionalization: skip meta block computation if compute_reference_meta is false (#90219 ) Skip computing meta block when `compute_reference_meta` is `False`. Issue: #89914 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90219 Approved by: https://github.com/ezyang	2022-12-06 04:03:01 +00:00
Edward Z. Yang	5266953443	Add crossref debug mode for functionalization, catches stride errors (#89498 ) The idea is to add a custom handler to Functionalize key in Python dispatcher that runs the functionalized version along side a non functionalized version, and checks that their outputs agree in the end. (Technically, for metadata mutation we should also check the inputs, but for now we're relying on those functions returning self.) I turned this on for test_functionalize.py (new TestCrossRefFunctionalize) and found a bunch of failures that look legit. This probably doesn't interact that nicely if you're also tracing at the same time, probably need more special logic for that (directly, just disabling tracing for when we create the nested fake tensor mode, but IDK if there's a more principled way to organize this.) There are some misc fixups which I can split if people really want. - xfail_inherited_tests moved to test common_utils - Bindings for _dispatch_tls_set_dispatch_key_included, _dispatch_tls_is_dispatch_key_included and _functionalization_reapply_views_tls - Type stubs for _enable_functionalization, _disable_functionalization - all_known_overloads utility to let you iterate over all OpOverloads in all namespaces. Iterator support on all torch._ops objects to let you iterate over their members. - suspend_functionalization lets you temporarily disable functionalization mode in a context - check_metadata_matches for easily comparing outputs of functions and see if they match (TODO: there are a few copies of this logic, consolidate!) - _fmt for easily printing the metadata of a tensor without its data - _uncache_dispatch for removing a particular dispatch key from the cache, so that we force it to regenerate - check_significant_strides new kwarg only_cuda to let you also do stride test even when inputs are not CUDA - Functionalize in torch._C.DispatchKey Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89498 Approved by: https://github.com/malfet	2022-11-23 04:18:25 +00:00
Driss Guessous	b291c1213a	Create native function for determining which implementation of SDP to call (#89029 ) # Summary Creates a callable native function that can determine which implementation of scaled dot product will get called. This allows to bump re-order the runtime dispatch of SDP to enable autograd. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89029 Approved by: https://github.com/cpuhrsch	2022-11-16 03:07:54 +00:00
Richard Zou	3bc327993f	PyDispatcher integration with functorch (#88785 ) This PR teaches PyDispatcher and PyOperator about functorch transforms. It is important that PyDispatcher/PyOperator dispatch with functorch transforms, because this is our plan for higher-order operators (operators that accept functions as arguments). Examples of these include: - functorch transforms over the existing cond operator (control flow) - autograd.Function support for functorch (which I am working towards), - AOTDispatcher (should be a higher order operator) Concretely, the problem with teaching PyDispatcher/PyOperator about functorch is that the stack-based dispatching logic (DynamicLayerStack) is hidden inside the fallbacks for two dispatch keys (DynamicLayer{Front, Back}). PyDispatcher doesn't know about C++ boxed fallbacks, our plan on record for that is that we need to reimplement all of them in Python (but can call helper functions in C++ to make our lives easier). Instead of exposing all of what DynamicLayer{Front, Back} do to python, this PR takes the approach of re-implementing part of the stack-based dispatching in Python. The motivation is that this is more sane and follows what the "ideal" implementation of functorch would have been: - each transform should be a "mode" - there should be no TLS dispatch key set hackery. functorch needs to do this hackery today to re-use VariableType implementations. This PR: - exposes the DynamicLayerStack to Python - The DynamicLayerStack is a stack of Interpreters. These get exposed to Python as well. - Interpreters can run operations (Interpreter.process) or lower them to the next interpreter in the stack (Interpreter.lower) - To use a PyOperator with functorch transforms, a developer needs to register a rule for each transform (vmap, grad, jvp, ...). - The PyOperator API is NOT user-facing. Things like autograd.Function support for functorch will end up going through the autograd.Function API. Question for reviewers: - Does this design make sense? - I'm trying to split up the "functorch support for autograd.Function" work into logical pieces. Would it be better if I didn't? (the full thing is a bit long - 1000-2000 LOC). Test Plan: - new tests that construct PyOperator and compose them with functorch transforms Pull Request resolved: https://github.com/pytorch/pytorch/pull/88785 Approved by: https://github.com/samdow, https://github.com/soulitzer	2022-11-16 00:46:59 +00:00
Edward Z. Yang	23a3eb37cf	SymIntify _copy functionalization kernels (and _copy_out too) (#88572 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88572 Approved by: https://github.com/anjali411, https://github.com/bdhirsh	2022-11-07 21:40:10 +00:00
Wonjoo Lee	a171b0636a	Add use_lazy_shape flag to GenLazyIr class (#88444 ) Add use_lazy_shape flag to GenLazyIr class to allow XLA to use its custom shape class. The default value is kept to use lazy shape, so this PR does not introduce any new behaviors. PyTorch/XLA companion PR: https://github.com/pytorch/xla/pull/4111 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88444 Approved by: https://github.com/alanwaketan, https://github.com/wconstab	2022-11-04 08:23:56 +00:00
Wonjoo Lee	0efd4e92b5	Make GenLazyNativeFuncDefinition generator to be customizable in lazy codegen (#87823 ) As part of the ongoing LTC migration effort, PyTorch/XLA is updating its codegen to use `xla::Shape` instead of `torch::lazy::Shape`. To achieve this, this PR updates the codegen to make the `GenLazyNativeFuncDefinition` generator customizable. The existing `GenLazyNativeFuncDefinition` is kept by using the initial default values, so this change should not introduce any new behaviors to the existing codegen in PyTorch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87823 Approved by: https://github.com/alanwaketan, https://github.com/wconstab	2022-11-03 06:19:40 +00:00
Jerry Zhang	a0fb234b45	[codegen] using TORCH_LIBRARY_FRAGMENT for some namespaces (#88229 ) Summary: Sometimes we want to extend an existing custom namespace library, instead of creating a new one, but we don't have a namespace config right now, so we hardcode some custom libraries defined in pytorch today, i.e. quantized and quantized_decomposed Test Plan: ci Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/88229 Approved by: https://github.com/ezyang	2022-11-03 02:30:02 +00:00

... 2 3 4 5 6 ...

529 Commits