pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
jjsjann123	c9c402eae9	[nvfuser_upstream_push] Reland: nvfuser code base bump 060822 (#79406 ) Landing reverted PR #79147. Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Bug fixes and minor refactor Squashed commits to WAR github API Commits that's actually in this PR from the devel branch: ``` 4c60e7dff22a494632370e5df55c011007340d06 Add examples infrastructure for using nvFuser in a standalone program (#1725) 02a05d98334ffa580d73ccb28fdb8c577ad296fe Fix issue #1751 (#1753) 8a69aa320bd7629e1709fe5ceb7104d2c88ec84c Refactor NvFuser transpose API to match eager mode behavior (#1746) ffdf6b7709048170d768217fcd7083fc8387f932 Remove BroadcastWithoutStride. (#1738) 02bab16035e70734450c02124f5cdaa95cf5749d Fix flipping of a boolean flag (#1745) 465d66890c8242e811224359cbdb1c2915490741 cleanup (#1744) 26d354e68720bc7dd2d3b1338ac01b707a230b6a fixing noncontig broadcast (#1742) 856b6b2f9073662dd98ca22ba6c3540e20eb1cdd Add IterDomainBuilder (#1736) 1fd974f912cd4c1e21cbd16e2abb23598d66a02f fixing warning for gcc7 (#1732) de2740a43a869f8272c2648e091d7b8235097db9 disabling complex in python tests for #1730 (#1733) fbbbe0a2e7c7a63e0e2719b8bfccb759b714221a fixing MSVC build (#1728) b5feee5e2b28be688dbddc766f3c0220389c8175 Fix the fused reduction runtime kernel (#1729) 5247682dff5980bb66edf8d3aac25dea2ef2ced5 Re-entrant GroupedGridReduction (#1727) ``` RUN_TORCHBENCH: nvfuser Pull Request resolved: https://github.com/pytorch/pytorch/pull/79406 Approved by: https://github.com/davidberard98	2022-06-16 17:52:21 +00:00
Michael Andreas Dagitses	acd072967a	canonicalize includes of form <aten/src/ATen/...> Pull Request resolved: https://github.com/pytorch/pytorch/pull/78033 This was never intended to be supported. @override-unit-failures (Note: this ignores all push blocking failures!) Differential Revision: [D36567054](https://our.internmc.facebook.com/intern/diff/D36567054/) Approved by: https://github.com/kit1980	2022-06-16 17:46:45 +00:00
Ivan Yashchuk	e10b762537	Enable torch._refs.var for nvFuser executor (#79517 ) This PR adds variance function with correction argument to nvFuser. Now it's possible to run ```py import torch import torch._refs from torch._prims.executor import make_traced def foo1(a): return torch._refs.var(a, keepdim=False, unbiased=False) def foo2(a): return torch._refs.var(a, keepdim=False, correction=2) a = torch.randn(3, 3, device='cuda') make_traced(foo1)(a, executor="nvfuser") make_traced(foo2)(a, executor="nvfuser") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/79517 Approved by: https://github.com/mruberry, https://github.com/jjsjann123	2022-06-14 23:08:53 +00:00
Ivan Yashchuk	8895862744	Enable torch._refs.mean for nvFuser executor (#79444 ) This PR fixes a bug with `broadcast_in_dim` leading to the situation when reduction ops were not allowed to be used before `broadcast_in_dim`. With this PR it's possible to run ```py import torch import torch._refs from torch._prims.executor import make_traced def foo(a): return torch._refs.mean(a, keepdim=False) a = torch.randn(3, 3, device='cuda') make_traced(foo)(a, executor="nvfuser") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/79444 Approved by: https://github.com/mruberry, https://github.com/jjsjann123	2022-06-14 19:42:07 +00:00
Michael Andreas Dagitses	52a5266aab	turn on -Werror=unused-but-set-variable Summary: Also fix the one violation. Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/79305 Approved by: https://github.com/malfet	2022-06-13 20:23:50 +00:00
Michael Andreas Dagitses	606b234336	turn on -Werror=unused-function in our Bazel CPU build Summary: We also fix any existing issues. Note that we only do this for the CPU build because nvcc is considered a C++ toolchain but it does not have the same flag support. Adding flags to the GPU build will cause nvcc errors. Test Plan: Built locally, rely on CI to confirm. Reviewers: malfet Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/79154 Approved by: https://github.com/seemethere, https://github.com/osalpekar, https://github.com/albanD	2022-06-10 22:11:54 +00:00
PyTorch MergeBot	d28e9e145b	Revert "[nvfuser_upstream_push] nvfuser code base bump 060822 (#79147 )" This reverts commit `49c41b87a2`. Reverted https://github.com/pytorch/pytorch/pull/79147 on behalf of https://github.com/janeyx99 due to Broke 11.3 builds on trunk `49c41b87a2`	2022-06-10 20:55:10 +00:00
PyTorch MergeBot	bcd7a20953	Revert "turn on -Werror=unused-function in our Bazel CPU build" This reverts commit `67d313a032`. Reverted https://github.com/pytorch/pytorch/pull/79154 on behalf of https://github.com/malfet due to Breaks bazel build: `67d313a032`	2022-06-10 20:43:03 +00:00
jjsjann123	49c41b87a2	[nvfuser_upstream_push] nvfuser code base bump 060822 (#79147 ) Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Bug fixes and minor refactor Squashed commits to WAR github API Commits that's actually in this PR from the devel branch: ``` 4c60e7dff22a494632370e5df55c011007340d06 Add examples infrastructure for using nvFuser in a standalone program (#1725) 02a05d98334ffa580d73ccb28fdb8c577ad296fe Fix issue #1751 (#1753) 8a69aa320bd7629e1709fe5ceb7104d2c88ec84c Refactor NvFuser transpose API to match eager mode behavior (#1746) ffdf6b7709048170d768217fcd7083fc8387f932 Remove BroadcastWithoutStride. (#1738) 02bab16035e70734450c02124f5cdaa95cf5749d Fix flipping of a boolean flag (#1745) 465d66890c8242e811224359cbdb1c2915490741 cleanup (#1744) 26d354e68720bc7dd2d3b1338ac01b707a230b6a fixing noncontig broadcast (#1742) 856b6b2f9073662dd98ca22ba6c3540e20eb1cdd Add IterDomainBuilder (#1736) 1fd974f912cd4c1e21cbd16e2abb23598d66a02f fixing warning for gcc7 (#1732) de2740a43a869f8272c2648e091d7b8235097db9 disabling complex in python tests for #1730 (#1733) fbbbe0a2e7c7a63e0e2719b8bfccb759b714221a fixing MSVC build (#1728) b5feee5e2b28be688dbddc766f3c0220389c8175 Fix the fused reduction runtime kernel (#1729) 5247682dff5980bb66edf8d3aac25dea2ef2ced5 Re-entrant GroupedGridReduction (#1727) ``` RUN_TORCHBENCH: nvfuser Pull Request resolved: https://github.com/pytorch/pytorch/pull/79147 Approved by: https://github.com/davidberard98	2022-06-10 19:37:42 +00:00
Michael Andreas Dagitses	67d313a032	turn on -Werror=unused-function in our Bazel CPU build Summary: We also fix any existing issues. Note that we only do this for the CPU build because nvcc is considered a C++ toolchain but it does not have the same flag support. Adding flags to the GPU build will cause nvcc errors. Test Plan: Built locally, rely on CI to confirm. Reviewers: malfet Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/79154 Approved by: https://github.com/seemethere, https://github.com/osalpekar, https://github.com/albanD	2022-06-10 18:30:08 +00:00
Michael Andreas Dagitses	f96d96a7fc	turn on -Werror=type-limits in our Bazel CPU build Summary: We also fix any existing issues. Test Plan: Built locally, rely on CI to confirm. Reviewers: malfet Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/79139 Approved by: https://github.com/seemethere, https://github.com/osalpekar, https://github.com/albanD	2022-06-10 10:04:08 +00:00
jjsjann123	462874f418	adding a quick link to nvfuser README.md in jit doc for 1.12 release (#78160 ) adding a link to github 1.12 release branch nvfuser README.md in jit doc Note that this PR is intended to be cherry-picked by 1.12 release, we'll have a follow up PR to update the link once this PR is merged. Pull Request resolved: https://github.com/pytorch/pytorch/pull/78160 Approved by: https://github.com/davidberard98	2022-06-09 17:28:17 +00:00
jjsjann123	9e52ad28c9	[nvfuser_upstream_push] nvfuser code base bump 052422 (#78244 ) Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ A few bigger updates: 1. Initial support of cp.async and cp.async.wait: https://github.com/csarofeen/pytorch/pull/1619 2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: https://github.com/csarofeen/pytorch/pull/1643 3. Extending the infrastructure to support mma operators on turing and ampere arch: https://github.com/csarofeen/pytorch/pull/1440 Commits that's actually in this PR from the csarofeen branch ``` * dd2325294e236c5082c642819a1103bcfe4561a3 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710) * b3d1c3f446355a2d276bac8272e7aa8b5bb6b1f0 Fix missing cooperative launch (#1726) * dc670a226cbe52be46cecef47001f38bf9a09433 Async gmem copy support on sm80+ (#1619) * 5e6a8dab5a71aefe0548bbfa15d1a93c556d23fe Add turing mma support and test (#1643) * d6d6b7d3f10dd91dafa4cdbd5e460bbb38173af4 Fix rFactor when there are indirect root domain(s), and refactor (#1723) * 7093e39150c6d80e0f9f767d56654714a2e8a927 Mma op integration on ampere (#1440) * fade8da55e60a118c5595378896d34b862b2fcc3 patch python test for bfloat16 (#1724) * 8fbd0b18743a72ac10478857c3d2351204375685 Fine-grained kernel profiling (#1720) * 77c1b4fa633f9e631d267923f4537336fa328939 Adding dry run mode to skip arch dependent checks (#1702) * 151d95b97bebefc94199bb4a53423ede32b55451 More precise concretization analysis (#1719) * f4d3630ed54d7069dd377a64be1f91013b285b66 Enable complex python tests (#1667) * 4ceeee509774cc2ce6c834a4dc1e313f71d94503 Minor bugfix in transform_rfactor.cpp (#1715) * 3675c70faf218e86d2c78dbd3874b175a3b0a203 Separate root domain and rfactor domain in TransformPrinter (#1716) * f68b830d5def65dadfe29d4edf52fc703369c84a Fix scheduling with polymorphic broadcast (#1714) * 4ab5ef7ae2cfd8fffad1e1d882ae7c50631211dc updating_ci_machine (#1718) * 56585c58b1ff338704cafb0cd6be2b3d536bed5a Merge pull request #1711 from csarofeen/upstream_master_bump_0517 * 174d453d3be0c11a5acb0fff3b3f36e19cfdaf81 Allow using nvFuser on CUDA extension (#1701) * 18bee67495454b9a79625799776e746bd5e81c4c Validate LOOP concrete IDs have complete IterDomains (#1676) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/78244 Approved by: https://github.com/csarofeen, https://github.com/malfet	2022-06-07 17:30:51 -07:00
David Berard	38bc10ae25	retry - enable NVFuser by default Enable NVFuser in OSS. Retry of #77213, because it was breaking torchvision tests. Fix in #77471 has been verified by jjsjann123 Pull Request resolved: https://github.com/pytorch/pytorch/pull/77579 Approved by: https://github.com/eellison, https://github.com/malfet, https://github.com/atalman, https://github.com/seemethere	2022-05-20 14:21:18 +00:00
jjsjann123	6583c0384b	fixing trivial reduction & broadcast scheduling (#77884 ) cherry-picked fixes from https://github.com/csarofeen/pytorch/pull/1714 Pull Request resolved: https://github.com/pytorch/pytorch/pull/77884 Approved by: https://github.com/csarofeen, https://github.com/davidberard98	2022-05-20 02:00:42 +00:00
jjsjann123	17fbb85734	[nvfuser] prevent spamming warning message (#77777 ) updating TORCH_WARN to TORCH_WARN_ONCE to prevent spamming the log Pull Request resolved: https://github.com/pytorch/pytorch/pull/77777 Approved by: https://github.com/davidberard98	2022-05-19 20:43:14 +00:00
jjsjann123	a2802ad0b9	Upstream master bump 0513 (#77471 ) Updating nvfuser code base. This should fix the indexing issue observed in https://github.com/pytorch/vision/issues/6015. Running tests locally as well. Will update the description here at a later point @bypass-github-export-checks Pull Request resolved: https://github.com/pytorch/pytorch/pull/77471 Approved by: https://github.com/seemethere, https://github.com/eellison	2022-05-18 11:48:50 -07:00
Xiang Gao	4eec865f58	[nvFuser] Improving bitwise ops support (#77158 ) - Some renaming to better match PyTorch API: - `lshift` -> `bitwise_left_shift` - `rshift` -> `bitwise_right_shift` - `andOp` -> `bitwise_and` - `orOp` -> `bitwise_or` - `xorOp` -> `bitwise_xor` - `notOp` -> `bitwise_not` - Fix type inferences and type checking of these ops - Add `bitwise_*` to parser and python frontend - Improve test coverage Pull Request resolved: https://github.com/pytorch/pytorch/pull/77158 Approved by: https://github.com/kevinstephano, https://github.com/jjsjann123	2022-05-18 17:21:34 +00:00
PyTorch MergeBot	2a905aef09	Revert "enable NVFuser by default" This reverts commit `24f7dcd816`. Reverted https://github.com/pytorch/pytorch/pull/77213 on behalf of https://github.com/davidberard98	2022-05-16 18:23:39 +00:00
David Berard	e175065c4e	[NVFuser] fix force-disable flag This prevents the std::call_once() check from erroring if: * PYTORCH_JIT_USE_NNC_NOT_NVFUSER=1 * PYTORCH_JIT_ENABLE_NVFUSER=1 * user has not set a flag. Pull Request resolved: https://github.com/pytorch/pytorch/pull/77395 Approved by: https://github.com/eellison	2022-05-14 19:31:32 +00:00
David Berard	36f7a6cc4a	[NVFuser] don't decompose conv2d if we don't have shape info Sometimes bias won't have shape info (e.g. in the added test, conv gets run two times in a loop, each with different shapes). In that case we should just skip decomposition instead of erroring out. Pull Request resolved: https://github.com/pytorch/pytorch/pull/77440 Approved by: https://github.com/jjsjann123	2022-05-13 22:39:43 +00:00
David Berard	24f7dcd816	enable NVFuser by default Enable NVFuser in OSS. Tests are passing, and we've also run tests in [torchvision](https://github.com/pytorch/vision/pull/5959) and [torchaudio](https://github.com/pytorch/audio/pull/2372) Retry of #76006, because that PR had GH1/ghstack issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/77213 Approved by: https://github.com/eellison	2022-05-11 19:59:31 +00:00
David Berard	6fd14ba9db	[NVFuser] Add environment variable to force disable NVFuser PYTORCH_JIT_USE_NNC_NOT_NVFUSER=1 will force NVFuser to be disabled, regardless of other environment variables or values set at runtime. It will be used for guarding certain parts of the internal rollout. Pull Request resolved: https://github.com/pytorch/pytorch/pull/77168 Approved by: https://github.com/jjsjann123, https://github.com/eellison	2022-05-11 16:12:19 +00:00
David Berard	3c2e0dc657	[NVFuser] assert that vectors are the same size in translateSingleWelford Before, sometimes out_root.size() < in_root.size(), which would result in a segfault while accessing out_root[i]. If, instead, we just error out here, an exception will be thrown and then we'll run the fallback instead of completely erroring out. Pull Request resolved: https://github.com/pytorch/pytorch/pull/77010 Approved by: https://github.com/eellison, https://github.com/jjsjann123	2022-05-11 15:48:44 +00:00
Kevin Stephano	752d496c91	Fix `broadcast_in_dim` support in NVFuser Frontend (#76790 ) This PR primarily addresses augmenting the frontend to properly support `broadcast_in_dim`. This required make a new version of the `define_tensor()` that takes in the `size` and `strides` of input tensors in order to properly determine broadcasts. This PR also has a fix for the `python_example.py` that broke when a new argument was added to reductions to allow the user to specify an output Data Type. `define_tensor()` Interface Example: ``` fusion2 = Fusion() input1 = torch.ones(1, 1, 4, device='cuda') input2 = torch.ones(2, 3, 4, device='cuda') with FusionDefinition(fusion2) as fd : t0 = fd.define_tensor(sizes=input1.size(), strides=input1.stride()) t1 = fd.define_tensor(sizes=input2.size(), strides=input2.stride()) fd.add_input(t0) fd.add_input(t1) t0_b = fd.Ops.broadcast_in_dim(t0, [2, 3, 4], [0, 1, 2]) print("Broadcast TensorView", t0_b) t2 = fd.Ops.add(t0_b, t1) fd.add_output(t2) ``` Print statement of defined broadcast tensor: ``` Broadcast TensorView T2_l[ sbS6{1}, sbS7{1}, iS8{i2} ] DataType: float Contiguity: ttt ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/76790 Approved by: https://github.com/mruberry, https://github.com/jjsjann123	2022-05-10 18:13:22 +00:00
jjsjann123	489818e7c6	disabling squeeze/unsqueeze; disabling BN/BN_BWD for perf concern (#77017 ) Fixes #76883 (via disabling squeeze/unsqueeze) Disabling BN fwd/bwd for our perf concern. I need to update our python tests. Awaiting build to finish so I can update tests accordingly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/77017 Approved by: https://github.com/csarofeen, https://github.com/davidberard98	2022-05-09 22:57:20 +00:00
jjsjann123	b4f3f9c651	Torchvision patch (#77001 ) Fixes #76791 Note that this is a hot patch so we get to run upstream tests. I'm doing proper fix in our local repo and will update upstream code once those are merged/reviewed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/77001 Approved by: https://github.com/davidberard98	2022-05-09 16:53:23 +00:00
Xiang Gao	104f0bf09e	[Reland] Add atan2 isfinite isinf isnan isneginf isposinf isreal to nvfuser and its frontend (#76769 ) This reverts commit `4bb5944133`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/76769 Approved by: https://github.com/csarofeen, https://github.com/mruberry	2022-05-07 21:26:00 +00:00
David Berard	6c615a21a0	[NVFuser] prep for on-by-default 1. fix tests that expected nvfuser off-by-default behavior 2. skip nvfuser if getExecutorMode() == false Pull Request resolved: https://github.com/pytorch/pytorch/pull/76937 Approved by: https://github.com/eellison	2022-05-06 18:18:53 +00:00
sanchitintel	4ee29d6033	[Reland take-2] Add JIT graph fuser for oneDNN Graph API (v0.5) Re-landing #68111/#74596 ## Description v0.5 PR of this [RFC](https://github.com/pytorch/pytorch/issues/49444). On the basis of #50256, the below improvements are included: * The [v0.5 release branch](https://github.com/oneapi-src/oneDNN/releases/tag/graph-v0.5) of the oneDNN Graph API is used * The fuser now works with the profiling graph executor. We have inserted type check nodes to guard the profiled tensor properties. ### User API: The optimization pass is disabled by default. Users could enable it by: ``` torch.jit.enable_onednn_fusion(True) ``` `torch.jit.freeze` should be used after tracing (recommended) or scripting a model. ### Performance: [pytorch/benchmark](https://github.com/pytorch/benchmark) tool is used to compare the performance: * SkyLake 8180 (1 socket of 28 cores): ![image](https://user-images.githubusercontent.com/65992142/151162305-05e44425-a24e-4d5e-94e1-743b40b87a8c.png) * SkyLake 8180 (single thread): ![image](https://user-images.githubusercontent.com/65992142/151162528-69f90b79-d08d-46b8-8775-d80a6ccbce8a.png) * By mapping hardswish to oneDNN Graph, it’s 8% faster than PyTorch JIT (NNC + OFI) ** We expect performance gain after mapping transpose, contiguous & view to oneDNN graph ops ### Directory structure of the integration code Fuser-related code is placed under: ``` torch/csrc/jit/codegen/onednn/ ``` Optimization pass registration is done in: ``` torch/csrc/jit/passes/onednn_graph_fuser.h ``` CMake for the integration code is in: ``` caffe2/CMakeLists.txt cmake/public/mkldnn.cmake cmake/Modules/FindMKLDNN.cmake ``` ## Limitations * In this PR, we only support Pytorch-oneDNN-Graph integration on Linux platform. Support on Windows and MacOS will be enabled as a next step. * We have only optimized the inference use-case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/76622 Approved by: https://github.com/eellison	2022-05-05 16:57:03 +00:00
CodemodService FBSourceClangFormatLinterBot	fa3e0d5f4c	[AutoAccept][Codemod][FBSourceClangFormatLinter] Daily `arc lint --take CLANGFORMAT` (#76802 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/76802 Reviewed By: ivanmurashko Differential Revision: D36124688 fbshipit-source-id: d6921d373500ec56bf20db073030df781f635f56 (cherry picked from commit 8047422f3c42c095065ab1622c898a8c742de2f1)	2022-05-04 09:52:23 +00:00
David Berard	e33f3229a2	[NVFuser] environment variable to turn nvfuser on or off (#76485 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/76485 Adds an environment variable `PYTORCH_JIT_ENABLE_NVFUSER` for controlling whether or not nvfuser is enabled. This required changing the PassManager behavior to support the case where nvfuser gets enabled by default when PYTORCH_JIT_ENABLE_NVFUSER=1. Previously the solution for turning nvfuser on or off was to use the PassManager to register or un-register the pass. That works fine if the pass starts of _disabled_, but causes issues once we try to enable the pass by default. The main issue with enabling by default is with the validation check to see whether NVFuser can be turned on. The check relies on at::globalContext().hasCUDA(), which requires CUDAHooks to be registered before hasCUDA() wil work correctly. At static initialization time it's difficult to ensure that CUDAHooks will be registered _before_ we attempt to register the nvfuser pass. In OSS it worked fine, but in internal builds it would fail on ROCm builds. To fix this, we switch the control of NVFuser enablement to a check in the pass. i.e. previously, we enabled/disabled nvfuser by registering or de-registering the pass in pass manager; now, the pass is always registered in pass manager, and enablement is done by a check within the nvfuser pass. Remaining TODO: Connect this with NNC so that in cases where NNC is available but not NVFuser (i.e. on AMD gpus), NNC can be turned on automatically. Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D35982618 Pulled By: davidberard98 fbshipit-source-id: fd5b76bc0b8c8716c96fdc04bebfb15026a7ef60 (cherry picked from commit ff14603ff5ac8d9b6c749c4f111f4a8be8023b7f)	2022-05-03 23:05:40 +00:00
PyTorch MergeBot	4bb5944133	Revert "Add atan2 isfinite isinf isnan isneginf isposinf isreal to nvfuser and its frontend" This reverts commit `92d10decc4`. Reverted https://github.com/pytorch/pytorch/pull/76598 on behalf of https://github.com/malfet	2022-05-03 19:53:28 +00:00
Xiang Gao	92d10decc4	Add atan2 isfinite isinf isnan isneginf isposinf isreal to nvfuser and its frontend Fixes: https://github.com/csarofeen/pytorch/issues/1632 Pull Request resolved: https://github.com/pytorch/pytorch/pull/76598 Approved by: https://github.com/csarofeen, https://github.com/mruberry	2022-05-03 16:31:40 +00:00
jjsjann123	d23619b030	Permutation extended Extended permutation support in integration (See more details on https://github.com/csarofeen/pytorch/issues/1601). This update allows us to better support permutation propagation on tensors, specifically for binary ops with inputs of different ranks. Our goal is to avoid permuting tensors unless absolutely necessary. We try to preserve the permutation propagation rule in aten, with some known limitation at the time. The idea in this implementation is the same as with our existing code, which is to permute input/output tensors outside of codegen: For a simplified binary op scenario: `output = binaryOp(input0, input1)` 1. In a simple case where `input0` and `input1` come with the same rank & permutation order, our output would preserve the same permutation; 2. For cases where `input0` and `input1` come with different ranks but with compatible permutation, the tensor with the higher rank dictates the permutation of the output; 3. For cases where `input0` and `input1` come with different ranks but with in-compatible permutation, this is where permutation propagation fails and the output tensor will be contiguous. By compatible permutation, it means that we can permute the higher rank tensor to contiguous format, and then apply a second permutation to the tensor with lower rank to match their axes. This check is implemented in `MemoryFormat::broadcastToRank(int lower_rank)`. Some concrete example (note that we comply with eager propagation on cases 1-3, but diverge in behavior for cases 4, 5): 1. different rank & same permutation ``` t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2]) # stride (hwc, 1, wc, c) t1 = torch.randn(h, w, c).cuda().permute([2, 0, 1]) # stride (1, wc, c) out = scripted_add(t0, t1) # stride (hwc, 1, wc, c) preserving memory format of t0 ``` 2. different rank & compatible permutation ``` t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2]) # stride (hwc, 1, wc, c) t1 = torch.randn(c, h, w).cuda() # stride (hw, w, 1) out = scripted_add(t0, t1) # stride (hwc, 1, wc, c) preserving memory format of t0 ``` 3. different rank & compatible permutation with broadcasting ``` t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2]) # stride (hwc, 1, wc, c) t1 = torch.randn(c).cuda().unsqueeze(-1).unsqueeze(-1) # stride (1, 1, 1) out = scripted_add(t0, t1) # stride (hwc, 1, wc, c) preserving memory format of t0 ``` 4. different rank & in-compatible permutation ``` t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2]) # stride (hwc, 1, wc, c) t1 = torch.randn(h, w).cuda() # stride (w, 1) jit_out = scripted_add(t0, t1) # stride (hwc, 1, wc, c) # stride (hwc, wc, c, 1) # nvfuser outputs contiguous tensor eager_out = eager_add(t0, t1) # stride (hwc, 1, wc, c) # stride (hwc, 1, wc, c) # TI preserves memory format of LHS operand ``` 5. different rank & in-compatible permutation ``` t0 = torch.randn(c, h, w).cuda() # stride (hw, w, 1) t1 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2]) # stride (hwc, 1, wc, c) jit_out = scripted_add(t0, t1) # stride (hwc, 1, wc, c) # stride (hwc, 1, wc, c) # nvfuser preserves memory format of highest rank tensors eager_out = eager_add(t0, t1) # stride (hwc, 1, wc, c) # stride (hwc, hw, w, 1) # TensorIterator preserves memory format of LHS operand ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/76563 Approved by: https://github.com/kevinstephano, https://github.com/ngimel	2022-05-02 22:09:56 +00:00
CodemodService FBSourceClangFormatLinterBot	461cc0a960	[AutoAccept][Codemod][FBSourceClangFormatLinter] Daily `arc lint --take CLANGFORMAT` Reviewed By: adamjernst Differential Revision: D36061557 fbshipit-source-id: 61ea6017a3550dbbb13b7b288a1d537253c25fe2 (cherry picked from commit 2ea13c57e96763318859159b6441c02f7c72caf2)	2022-05-02 22:07:42 +00:00
Peter Bell	2e480fc2db	Cleanup ATen-core forward declarations I noticed that when `SymInt` was introduced, `jit_type_base.h` was added as an include to the `Operator.h` template which is supposed to be kept extremely clean and only use forward declarations. Also, that forward declarations for `OptionalArrayRef` were missing. So, I've refactored the forward declarations into `ATen/core/ATen_fwd.h` and cleaned up some of the `c10` headers that were masking these missing declarations. I've also re-generated the pre-compiled header so `SymInt` is included. Pull Request resolved: https://github.com/pytorch/pytorch/pull/76576 Approved by: https://github.com/albanD	2022-05-02 14:50:48 +00:00
Ryan Spring	33be4c94c0	[Nvfuser] Add cast support between double and half types Fixes `RuntimeError: Illegal Cast value from DataType: __half to DataType: double` Example: ``` with FusionDefinition(fusion) as fd : t0 = fd.define_tensor(2, DataType.Half) t1 = fd.define_tensor(2, DataType.Double) fd.add_input(t0) fd.add_input(t1) t2 = fd.Ops.add(t0, t1) t5 = fd.Ops.relu(t2) fd.add_output(t5) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/76605 Approved by: https://github.com/csarofeen	2022-05-01 19:04:48 +00:00
jjsjann123	100e72f54b	Nvfuser faster fallback Follow up to #76505 Addressing https://github.com/pytorch/pytorch/pull/76505#discussion_r861260818 to further improve fallback perf during compilation failure. This allows us to reuse fallback instead re-constructing new code every time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/76604 Approved by: https://github.com/davidberard98	2022-04-30 05:59:18 +00:00
PyTorch MergeBot	3dcd67a1b3	Revert "[Re-landing 68111] Add JIT graph fuser for oneDNN Graph API (Preview4.1)" This reverts commit `8b11d81058`. Reverted https://github.com/pytorch/pytorch/pull/74596 on behalf of https://github.com/janeyx99	2022-04-29 15:40:17 +00:00
chunyuan	8b11d81058	[Re-landing 68111] Add JIT graph fuser for oneDNN Graph API (Preview4.1) Re-landing https://github.com/pytorch/pytorch/pull/68111 ## Description Preview4 PR of this [RFC](https://github.com/pytorch/pytorch/issues/49444). On the basis of https://github.com/pytorch/pytorch/pull/50256, the below improvements are included: - The [preview4 release branch](https://github.com/oneapi-src/oneDNN/releases/tag/graph-v0.4.1) of the oneDNN Graph API is used - The fuser now works with the profiling graph executor. We have inserted type check nodes to guard the profiled tensor properties. ### User API: The optimization pass is disabled by default. Users could enable it by: ``` torch.jit.enable_onednn_fusion(True) ``` ### Performance: [pytorch/benchmark](https://github.com/pytorch/benchmark) tool is used to compare the performance: - SkyLake 8180 (1 socket of 28 cores): ![image](https://user-images.githubusercontent.com/65992142/151162305-05e44425-a24e-4d5e-94e1-743b40b87a8c.png) - SkyLake 8180 (single thread): ![image](https://user-images.githubusercontent.com/65992142/151162528-69f90b79-d08d-46b8-8775-d80a6ccbce8a.png) \* By mapping hardswish to oneDNN Graph, it’s 8% faster than PyTorch JIT (NNC + OFI) \** We expect performance gain after mapping transpose, contiguous & view to oneDNN graph ops ### Directory structure of the integration code Fuser-related code are placed under: ``` torch/csrc/jit/codegen/onednn/ ``` Optimization pass registration is done in: ``` torch/csrc/jit/passes/onednn_graph_fuser.h ``` CMake for the integration code is: ``` caffe2/CMakeLists.txt ``` ## Limitations - In this PR, we have only supported the optimization on Linux platform. The support on Windows and MacOS will be enabled as the next step. - We have only optimized the inference use case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/74596 Approved by: https://github.com/malfet	2022-04-29 01:01:33 +00:00
jjsjann123	ac31e5d4a3	Add a matching lerp implementation to eager mode. (#1612 ) Fixes part of #76046 Add a matching lerp to eager mode. Co-authored-by: jjsjann123 <alex.jann2012@gmail.com> Co-authored-by: jjsjann123 <jiej@nvidia.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/76459 Approved by: https://github.com/ngimel	2022-04-28 23:37:01 +00:00
David Berard	e52dc9888b	Retry - [NVFuser] always use fallback if fusion fails Retry of #75983. The change is to handle cases where attr::cache_id is not set. This can happen if compilation fails. Original message: 1) remember when fusions fail; and on subsequent runs, always take the fallback. 2) during the first fallback, cache the Code object. On autogen-69 from the nvfuser microbenchmarks (https://github.com/pytorch/benchmark/pull/801) this improved performanance as follows: * Original (always attempt fusion): 25ms * Always take fallback after first failure: 0.79ms * Always take fallback + cache Code object: 0.62ms * Eager: 0.58ms Pull Request resolved: https://github.com/pytorch/pytorch/pull/76505 Approved by: https://github.com/eellison	2022-04-28 20:38:37 +00:00
Ryan Spring	e9f17da2cf	Nvfuser - Type Promotion Fix Fix Type Promotion failures in [Issue 76046](https://github.com/pytorch/pytorch/issues/76046) 1. Updated nvfuser type promotion rule for codegen kernel; 2. Updated casting for output of nvfuser kernel to respect profiling/TorchScript scalar type; 3. Updated type_inference.cpp to only update device/scalar_type when profiling information is missing. Additional Type Promotion Fixes: - test_nvfuser_correctness_softmax_with_dtype_cuda_float32 - test_nvfuser_correctness_softmax_with_dtype_cuda_bfloat16 - test_nvfuser_correctness_softmax_with_dtype_cuda_float16 - test_nvfuser_correctness_softmax_with_dtype_cuda_float32 - test_nvfuser_correctness_log_softmax_dtype_cuda_bfloat16 - test_nvfuser_correctness_log_softmax_dtype_cuda_bool - test_nvfuser_correctness_log_softmax_dtype_cuda_float16 - test_nvfuser_correctness_log_softmax_dtype_cuda_float32 - test_nvfuser_correctness_sum_cuda_int32 - test_nvfuser_correctness_sum_to_size_cuda_int32 Pull Request resolved: https://github.com/pytorch/pytorch/pull/76343 Approved by: https://github.com/jjsjann123, https://github.com/mruberry	2022-04-28 16:08:38 +00:00
Kevin Stephano	f51516df1c	Adding `broadcast_in_dim` and non-contiguous Tensor support NVFuser Python Frontend Adding new features. 1. `broadcast_in_dim` support and example. 2. Adding non-contiguous `TensorView` support and example. `broadcast_in_dim` example: ``` with FusionDefinition(fusion) as fd : t0 = fd.define_tensor(1) t1 = fd.define_tensor(3) fd.add_input(t0) fd.add_input(t1) t0_b = fd.Ops.broadcast_in_dim(t0, [2, 3, 4], [1]) t2 = fd.Ops.add(t0_b, t1) fd.add_output(t2) ``` Non-contiguous tensor support example: ``` with FusionDefinition(fusion) as fd : t0 = fd.define_tensor(3, [False, False, False]) t1 = fd.define_tensor(3, [True, True, True]) fd.add_input(t0) fd.add_input(t1) print("Input1 Contiguity:", t0) print("Input2 Contiguity:", t1) t2 = fd.Ops.add(t0, t1) print("Output Contiguity:", t2, "\n") fd.add_output(t2) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/76484 Approved by: https://github.com/mruberry	2022-04-28 08:36:57 +00:00
PyTorch MergeBot	bfb39e577c	Revert "[NVFuser] always use fallback if fusion fails" This reverts commit `da984c507c`. Reverted https://github.com/pytorch/pytorch/pull/75983 on behalf of https://github.com/davidberard98	2022-04-26 15:21:23 +00:00
Kevin Stephano	b17b2b1cc7	Add NVFuser Python Frontend New functionality. 1. Adds Pybind11 bindings for NVFuser. 2. Requires a build file change and JIT python file change outside of NVFuser's code area. Example: ``` import torch from torch._C._nvfuser import Fusion, FusionDefinition # Construct and Define Fusion fusion = Fusion() with FusionDefinition(fusion) as fd : t0 = fd.define_tensor(3) t1 = fd.define_tensor(1) s0 = fd.define_scalar() fd.add_input(t0) fd.add_input(t1) fd.add_input(s0) c0 = fd.define_constant(3.0) t1_b = fd.Ops.broadcast(t1, [True, True, False]) t2 = fd.Ops.add(t0, t1) t3 = fd.Ops.mul(t2, c0) t4 = fd.Ops.mul(t3, s0) t5 = fd.Ops.relu(t4) t6 = fd.Ops.sum(t5, [-1], False) fd.add_output(t6) fusion.print_ir() # Execute Fusion input1 = torch.ones(2, 4, 8, device='cuda') input2 = torch.ones(8, device='cuda') # Kernel compilation should be cached for the 2nd iteration # with input tensors of the same shape for _ in range(5) : outputs = fusion.execute([input1, input2, 2.0]) print(outputs[0]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/76353 Approved by: https://github.com/csarofeen, https://github.com/mruberry	2022-04-26 06:10:19 +00:00
jjsjann123	e48b29b1fb	patching 11.1 ptxas issue Fixes #75708 `--ptxas-options` only passes its immediate argument to ptxas. So we should have put that in front of every ptxas argument. It's actually strange how this worked in CUDA TK 11.6. I'm following up with nvrtc team on this internally, meanwhile we should merge this PR to avoid register failures in generated kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/76226 Approved by: https://github.com/davidberard98	2022-04-25 22:26:24 +00:00
David Berard	f36d348f75	[NVFuser] multithreading nvfuser test 1) add multithreading tests 2) make IrParser thread safe with std::call_once (previously, registerJitOperator could get called twice simultaneously and segfault) Pull Request resolved: https://github.com/pytorch/pytorch/pull/76259 Approved by: https://github.com/jjsjann123	2022-04-25 21:48:50 +00:00
David Berard	da984c507c	[NVFuser] always use fallback if fusion fails 1) remember when fusions fail; and on subsequent runs, always take the fallback. 2) during the first fallback, cache the Code object. On autogen-69 from the nvfuser microbenchmarks (https://github.com/pytorch/benchmark/pull/801) this improved performanance as follows: * Original (always attempt fusion): 25ms * Always take fallback after first failure: 0.79ms * Always take fallback + cache Code object: 0.62ms * Eager: 0.58ms Pull Request resolved: https://github.com/pytorch/pytorch/pull/75983 Approved by: https://github.com/jjsjann123	2022-04-25 20:48:47 +00:00

1 2 3 4 5

234 Commits