pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Kurt Mohler	23bdb570cf	Reland: Enable `dim=None` for `torch.sum` (#79881 ) Part of #29137 Reland of #75845 Pull Request resolved: https://github.com/pytorch/pytorch/pull/79881 Approved by: https://github.com/albanD, https://github.com/kulinseth	2022-07-09 00:54:42 +00:00
Xiang Gao	dad071d8fe	[nvFuser] Add real and imag to nvfuser and its python frontend (#79824 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/79824 Approved by: https://github.com/mruberry, https://github.com/jjsjann123, https://github.com/kevinstephano	2022-07-07 17:25:42 +00:00
Nikita Shulga	2beb57a823	Add `-Werror=non-virtual-dtor` (reland) (#81012 ) This PR relands #80584, but instead of adding suppression in CMakeLists.txt suppresses it directly in `llvm_codegen.cpp` and just for a single header. In general, it's better to avoid `set_target_properties` pattern for suppressing warnings, as it makes build brittle and hard to debug/understand Test plan: wait for `ciflow/binaries_wheel` to finish Pull Request resolved: https://github.com/pytorch/pytorch/pull/81012 Approved by: https://github.com/huydhn, https://github.com/kit1980	2022-07-07 05:33:55 +00:00
PyTorch MergeBot	0491c10a63	Revert "Add -Werror=non-virtual-dtor (#80584 )" This reverts commit `7670035862`. Reverted https://github.com/pytorch/pytorch/pull/80584 on behalf of https://github.com/malfet due to Broke nighly builds, see https://github.com/pytorch/pytorch/runs/7209779559?check_suite_focus=true	2022-07-06 22:26:59 +00:00
Huy Do	7670035862	Add -Werror=non-virtual-dtor (#80584 ) This also resolves https://github.com/pytorch/pytorch/pull/77323 Pull Request resolved: https://github.com/pytorch/pytorch/pull/80584 Approved by: https://github.com/seemethere	2022-07-04 16:54:47 +00:00
Ivan Yashchuk	92e1710dc0	Add ComplexDouble scalar creation bindings to nvFuser's Python API (#80522 ) There is a problem that pybind11 silently converts Python's complex scalar to `bool` and uses `define_constant<bool>` overload. It was unnoticed because `0j` corresponds to `False` and tests passed, with `2j` scalar tests for `_refs.where` would fail without proper bindings. Pull Request resolved: https://github.com/pytorch/pytorch/pull/80522 Approved by: https://github.com/ngimel	2022-06-29 21:12:13 +00:00
jjsjann123	9e86796fe3	simple c10 implementation for std::call_once (#78051 ) A long standing bug on std::call_once: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66146 It could hang during re-entry after an exception handling. Added a c10 implementation yielding a bulky mutex. Not the most efficient thing but at least it shouldn't hang. Pull Request resolved: https://github.com/pytorch/pytorch/pull/78051 Approved by: https://github.com/albanD	2022-06-28 15:47:03 +00:00
Ivan Yashchuk	12dc410ff2	Fix nvFuser's where(tensor, python_scalar, tensor) type promotion (#80347 ) This PR modifies the type promotion logic for nvFuser's `where` function when one of the arguments is a scalar. With the proposed change behavior now matches with ATen's type promotion. The following script fails on master and passes with this PR: ```py import torch import torch._refs from torch._prims.executor import make_traced a = torch.ones(3, 3, dtype=torch.bool, device='cuda') b = torch.randn(3, 3, device='cuda') func = lambda a, b: torch._refs.where(a, 0.0, b) assert make_traced(func)(a, b, executor="nvfuser").dtype == torch.float32 ``` This PR allows to unskip nvFuser tests for `_refs.log_softmax`, it was failing with a dtype mismatch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/80347 Approved by: https://github.com/ngimel	2022-06-28 08:42:16 +00:00
Ivan Yashchuk	471c833043	Fix div(scalar, tensor) case for nvFuser (#80341 ) This PR fixes a typo in nvFuser for `div(scalar, tensor)` unlocking the tests for `_refs.true_divide` and `_refs.sigmoid`. Same PR with a C++ test to the nvFuser devel branch: https://github.com/csarofeen/pytorch/pull/1778. Pull Request resolved: https://github.com/pytorch/pytorch/pull/80341 Approved by: https://github.com/ngimel	2022-06-27 17:25:24 +00:00
David Berard	459090e3ce	[NVFuser] add "canBeEnabled" interface If you try to enable NVFuser when it's not possible, it will error out. This will allow you to check whether or not it's possible before trying to enable it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/79648 Approved by: https://github.com/eellison	2022-06-17 16:15:04 +00:00
PyTorch MergeBot	ee6ebfc06b	Revert "Enable `dim=None` for `torch.sum` (#75845 )" This reverts commit `e79a51f7db`. Reverted https://github.com/pytorch/pytorch/pull/75845 on behalf of https://github.com/malfet due to Breaks MacOS builds, see `e79a51f7db`	2022-06-16 22:01:41 +00:00
Kurt Mohler	e79a51f7db	Enable `dim=None` for `torch.sum` (#75845 ) Part of #29137 Pull Request resolved: https://github.com/pytorch/pytorch/pull/75845 Approved by: https://github.com/ezyang	2022-06-16 20:17:07 +00:00
jjsjann123	c9c402eae9	[nvfuser_upstream_push] Reland: nvfuser code base bump 060822 (#79406 ) Landing reverted PR #79147. Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Bug fixes and minor refactor Squashed commits to WAR github API Commits that's actually in this PR from the devel branch: ``` 4c60e7dff22a494632370e5df55c011007340d06 Add examples infrastructure for using nvFuser in a standalone program (#1725) 02a05d98334ffa580d73ccb28fdb8c577ad296fe Fix issue #1751 (#1753) 8a69aa320bd7629e1709fe5ceb7104d2c88ec84c Refactor NvFuser transpose API to match eager mode behavior (#1746) ffdf6b7709048170d768217fcd7083fc8387f932 Remove BroadcastWithoutStride. (#1738) 02bab16035e70734450c02124f5cdaa95cf5749d Fix flipping of a boolean flag (#1745) 465d66890c8242e811224359cbdb1c2915490741 cleanup (#1744) 26d354e68720bc7dd2d3b1338ac01b707a230b6a fixing noncontig broadcast (#1742) 856b6b2f9073662dd98ca22ba6c3540e20eb1cdd Add IterDomainBuilder (#1736) 1fd974f912cd4c1e21cbd16e2abb23598d66a02f fixing warning for gcc7 (#1732) de2740a43a869f8272c2648e091d7b8235097db9 disabling complex in python tests for #1730 (#1733) fbbbe0a2e7c7a63e0e2719b8bfccb759b714221a fixing MSVC build (#1728) b5feee5e2b28be688dbddc766f3c0220389c8175 Fix the fused reduction runtime kernel (#1729) 5247682dff5980bb66edf8d3aac25dea2ef2ced5 Re-entrant GroupedGridReduction (#1727) ``` RUN_TORCHBENCH: nvfuser Pull Request resolved: https://github.com/pytorch/pytorch/pull/79406 Approved by: https://github.com/davidberard98	2022-06-16 17:52:21 +00:00
Michael Andreas Dagitses	acd072967a	canonicalize includes of form <aten/src/ATen/...> Pull Request resolved: https://github.com/pytorch/pytorch/pull/78033 This was never intended to be supported. @override-unit-failures (Note: this ignores all push blocking failures!) Differential Revision: [D36567054](https://our.internmc.facebook.com/intern/diff/D36567054/) Approved by: https://github.com/kit1980	2022-06-16 17:46:45 +00:00
Ivan Yashchuk	e10b762537	Enable torch._refs.var for nvFuser executor (#79517 ) This PR adds variance function with correction argument to nvFuser. Now it's possible to run ```py import torch import torch._refs from torch._prims.executor import make_traced def foo1(a): return torch._refs.var(a, keepdim=False, unbiased=False) def foo2(a): return torch._refs.var(a, keepdim=False, correction=2) a = torch.randn(3, 3, device='cuda') make_traced(foo1)(a, executor="nvfuser") make_traced(foo2)(a, executor="nvfuser") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/79517 Approved by: https://github.com/mruberry, https://github.com/jjsjann123	2022-06-14 23:08:53 +00:00
Ivan Yashchuk	8895862744	Enable torch._refs.mean for nvFuser executor (#79444 ) This PR fixes a bug with `broadcast_in_dim` leading to the situation when reduction ops were not allowed to be used before `broadcast_in_dim`. With this PR it's possible to run ```py import torch import torch._refs from torch._prims.executor import make_traced def foo(a): return torch._refs.mean(a, keepdim=False) a = torch.randn(3, 3, device='cuda') make_traced(foo)(a, executor="nvfuser") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/79444 Approved by: https://github.com/mruberry, https://github.com/jjsjann123	2022-06-14 19:42:07 +00:00
Michael Andreas Dagitses	52a5266aab	turn on -Werror=unused-but-set-variable Summary: Also fix the one violation. Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/79305 Approved by: https://github.com/malfet	2022-06-13 20:23:50 +00:00
Michael Andreas Dagitses	606b234336	turn on -Werror=unused-function in our Bazel CPU build Summary: We also fix any existing issues. Note that we only do this for the CPU build because nvcc is considered a C++ toolchain but it does not have the same flag support. Adding flags to the GPU build will cause nvcc errors. Test Plan: Built locally, rely on CI to confirm. Reviewers: malfet Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/79154 Approved by: https://github.com/seemethere, https://github.com/osalpekar, https://github.com/albanD	2022-06-10 22:11:54 +00:00
PyTorch MergeBot	d28e9e145b	Revert "[nvfuser_upstream_push] nvfuser code base bump 060822 (#79147 )" This reverts commit `49c41b87a2`. Reverted https://github.com/pytorch/pytorch/pull/79147 on behalf of https://github.com/janeyx99 due to Broke 11.3 builds on trunk `49c41b87a2`	2022-06-10 20:55:10 +00:00
PyTorch MergeBot	bcd7a20953	Revert "turn on -Werror=unused-function in our Bazel CPU build" This reverts commit `67d313a032`. Reverted https://github.com/pytorch/pytorch/pull/79154 on behalf of https://github.com/malfet due to Breaks bazel build: `67d313a032`	2022-06-10 20:43:03 +00:00
jjsjann123	49c41b87a2	[nvfuser_upstream_push] nvfuser code base bump 060822 (#79147 ) Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Bug fixes and minor refactor Squashed commits to WAR github API Commits that's actually in this PR from the devel branch: ``` 4c60e7dff22a494632370e5df55c011007340d06 Add examples infrastructure for using nvFuser in a standalone program (#1725) 02a05d98334ffa580d73ccb28fdb8c577ad296fe Fix issue #1751 (#1753) 8a69aa320bd7629e1709fe5ceb7104d2c88ec84c Refactor NvFuser transpose API to match eager mode behavior (#1746) ffdf6b7709048170d768217fcd7083fc8387f932 Remove BroadcastWithoutStride. (#1738) 02bab16035e70734450c02124f5cdaa95cf5749d Fix flipping of a boolean flag (#1745) 465d66890c8242e811224359cbdb1c2915490741 cleanup (#1744) 26d354e68720bc7dd2d3b1338ac01b707a230b6a fixing noncontig broadcast (#1742) 856b6b2f9073662dd98ca22ba6c3540e20eb1cdd Add IterDomainBuilder (#1736) 1fd974f912cd4c1e21cbd16e2abb23598d66a02f fixing warning for gcc7 (#1732) de2740a43a869f8272c2648e091d7b8235097db9 disabling complex in python tests for #1730 (#1733) fbbbe0a2e7c7a63e0e2719b8bfccb759b714221a fixing MSVC build (#1728) b5feee5e2b28be688dbddc766f3c0220389c8175 Fix the fused reduction runtime kernel (#1729) 5247682dff5980bb66edf8d3aac25dea2ef2ced5 Re-entrant GroupedGridReduction (#1727) ``` RUN_TORCHBENCH: nvfuser Pull Request resolved: https://github.com/pytorch/pytorch/pull/79147 Approved by: https://github.com/davidberard98	2022-06-10 19:37:42 +00:00
Michael Andreas Dagitses	67d313a032	turn on -Werror=unused-function in our Bazel CPU build Summary: We also fix any existing issues. Note that we only do this for the CPU build because nvcc is considered a C++ toolchain but it does not have the same flag support. Adding flags to the GPU build will cause nvcc errors. Test Plan: Built locally, rely on CI to confirm. Reviewers: malfet Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/79154 Approved by: https://github.com/seemethere, https://github.com/osalpekar, https://github.com/albanD	2022-06-10 18:30:08 +00:00
Michael Andreas Dagitses	f96d96a7fc	turn on -Werror=type-limits in our Bazel CPU build Summary: We also fix any existing issues. Test Plan: Built locally, rely on CI to confirm. Reviewers: malfet Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/79139 Approved by: https://github.com/seemethere, https://github.com/osalpekar, https://github.com/albanD	2022-06-10 10:04:08 +00:00
jjsjann123	462874f418	adding a quick link to nvfuser README.md in jit doc for 1.12 release (#78160 ) adding a link to github 1.12 release branch nvfuser README.md in jit doc Note that this PR is intended to be cherry-picked by 1.12 release, we'll have a follow up PR to update the link once this PR is merged. Pull Request resolved: https://github.com/pytorch/pytorch/pull/78160 Approved by: https://github.com/davidberard98	2022-06-09 17:28:17 +00:00
jjsjann123	9e52ad28c9	[nvfuser_upstream_push] nvfuser code base bump 052422 (#78244 ) Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ A few bigger updates: 1. Initial support of cp.async and cp.async.wait: https://github.com/csarofeen/pytorch/pull/1619 2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: https://github.com/csarofeen/pytorch/pull/1643 3. Extending the infrastructure to support mma operators on turing and ampere arch: https://github.com/csarofeen/pytorch/pull/1440 Commits that's actually in this PR from the csarofeen branch ``` * dd2325294e236c5082c642819a1103bcfe4561a3 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710) * b3d1c3f446355a2d276bac8272e7aa8b5bb6b1f0 Fix missing cooperative launch (#1726) * dc670a226cbe52be46cecef47001f38bf9a09433 Async gmem copy support on sm80+ (#1619) * 5e6a8dab5a71aefe0548bbfa15d1a93c556d23fe Add turing mma support and test (#1643) * d6d6b7d3f10dd91dafa4cdbd5e460bbb38173af4 Fix rFactor when there are indirect root domain(s), and refactor (#1723) * 7093e39150c6d80e0f9f767d56654714a2e8a927 Mma op integration on ampere (#1440) * fade8da55e60a118c5595378896d34b862b2fcc3 patch python test for bfloat16 (#1724) * 8fbd0b18743a72ac10478857c3d2351204375685 Fine-grained kernel profiling (#1720) * 77c1b4fa633f9e631d267923f4537336fa328939 Adding dry run mode to skip arch dependent checks (#1702) * 151d95b97bebefc94199bb4a53423ede32b55451 More precise concretization analysis (#1719) * f4d3630ed54d7069dd377a64be1f91013b285b66 Enable complex python tests (#1667) * 4ceeee509774cc2ce6c834a4dc1e313f71d94503 Minor bugfix in transform_rfactor.cpp (#1715) * 3675c70faf218e86d2c78dbd3874b175a3b0a203 Separate root domain and rfactor domain in TransformPrinter (#1716) * f68b830d5def65dadfe29d4edf52fc703369c84a Fix scheduling with polymorphic broadcast (#1714) * 4ab5ef7ae2cfd8fffad1e1d882ae7c50631211dc updating_ci_machine (#1718) * 56585c58b1ff338704cafb0cd6be2b3d536bed5a Merge pull request #1711 from csarofeen/upstream_master_bump_0517 * 174d453d3be0c11a5acb0fff3b3f36e19cfdaf81 Allow using nvFuser on CUDA extension (#1701) * 18bee67495454b9a79625799776e746bd5e81c4c Validate LOOP concrete IDs have complete IterDomains (#1676) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/78244 Approved by: https://github.com/csarofeen, https://github.com/malfet	2022-06-07 17:30:51 -07:00
David Berard	38bc10ae25	retry - enable NVFuser by default Enable NVFuser in OSS. Retry of #77213, because it was breaking torchvision tests. Fix in #77471 has been verified by jjsjann123 Pull Request resolved: https://github.com/pytorch/pytorch/pull/77579 Approved by: https://github.com/eellison, https://github.com/malfet, https://github.com/atalman, https://github.com/seemethere	2022-05-20 14:21:18 +00:00
jjsjann123	6583c0384b	fixing trivial reduction & broadcast scheduling (#77884 ) cherry-picked fixes from https://github.com/csarofeen/pytorch/pull/1714 Pull Request resolved: https://github.com/pytorch/pytorch/pull/77884 Approved by: https://github.com/csarofeen, https://github.com/davidberard98	2022-05-20 02:00:42 +00:00
jjsjann123	17fbb85734	[nvfuser] prevent spamming warning message (#77777 ) updating TORCH_WARN to TORCH_WARN_ONCE to prevent spamming the log Pull Request resolved: https://github.com/pytorch/pytorch/pull/77777 Approved by: https://github.com/davidberard98	2022-05-19 20:43:14 +00:00
jjsjann123	a2802ad0b9	Upstream master bump 0513 (#77471 ) Updating nvfuser code base. This should fix the indexing issue observed in https://github.com/pytorch/vision/issues/6015. Running tests locally as well. Will update the description here at a later point @bypass-github-export-checks Pull Request resolved: https://github.com/pytorch/pytorch/pull/77471 Approved by: https://github.com/seemethere, https://github.com/eellison	2022-05-18 11:48:50 -07:00
Xiang Gao	4eec865f58	[nvFuser] Improving bitwise ops support (#77158 ) - Some renaming to better match PyTorch API: - `lshift` -> `bitwise_left_shift` - `rshift` -> `bitwise_right_shift` - `andOp` -> `bitwise_and` - `orOp` -> `bitwise_or` - `xorOp` -> `bitwise_xor` - `notOp` -> `bitwise_not` - Fix type inferences and type checking of these ops - Add `bitwise_*` to parser and python frontend - Improve test coverage Pull Request resolved: https://github.com/pytorch/pytorch/pull/77158 Approved by: https://github.com/kevinstephano, https://github.com/jjsjann123	2022-05-18 17:21:34 +00:00
PyTorch MergeBot	2a905aef09	Revert "enable NVFuser by default" This reverts commit `24f7dcd816`. Reverted https://github.com/pytorch/pytorch/pull/77213 on behalf of https://github.com/davidberard98	2022-05-16 18:23:39 +00:00
David Berard	e175065c4e	[NVFuser] fix force-disable flag This prevents the std::call_once() check from erroring if: * PYTORCH_JIT_USE_NNC_NOT_NVFUSER=1 * PYTORCH_JIT_ENABLE_NVFUSER=1 * user has not set a flag. Pull Request resolved: https://github.com/pytorch/pytorch/pull/77395 Approved by: https://github.com/eellison	2022-05-14 19:31:32 +00:00
David Berard	36f7a6cc4a	[NVFuser] don't decompose conv2d if we don't have shape info Sometimes bias won't have shape info (e.g. in the added test, conv gets run two times in a loop, each with different shapes). In that case we should just skip decomposition instead of erroring out. Pull Request resolved: https://github.com/pytorch/pytorch/pull/77440 Approved by: https://github.com/jjsjann123	2022-05-13 22:39:43 +00:00
David Berard	24f7dcd816	enable NVFuser by default Enable NVFuser in OSS. Tests are passing, and we've also run tests in [torchvision](https://github.com/pytorch/vision/pull/5959) and [torchaudio](https://github.com/pytorch/audio/pull/2372) Retry of #76006, because that PR had GH1/ghstack issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/77213 Approved by: https://github.com/eellison	2022-05-11 19:59:31 +00:00
David Berard	6fd14ba9db	[NVFuser] Add environment variable to force disable NVFuser PYTORCH_JIT_USE_NNC_NOT_NVFUSER=1 will force NVFuser to be disabled, regardless of other environment variables or values set at runtime. It will be used for guarding certain parts of the internal rollout. Pull Request resolved: https://github.com/pytorch/pytorch/pull/77168 Approved by: https://github.com/jjsjann123, https://github.com/eellison	2022-05-11 16:12:19 +00:00
David Berard	3c2e0dc657	[NVFuser] assert that vectors are the same size in translateSingleWelford Before, sometimes out_root.size() < in_root.size(), which would result in a segfault while accessing out_root[i]. If, instead, we just error out here, an exception will be thrown and then we'll run the fallback instead of completely erroring out. Pull Request resolved: https://github.com/pytorch/pytorch/pull/77010 Approved by: https://github.com/eellison, https://github.com/jjsjann123	2022-05-11 15:48:44 +00:00
Kevin Stephano	752d496c91	Fix `broadcast_in_dim` support in NVFuser Frontend (#76790 ) This PR primarily addresses augmenting the frontend to properly support `broadcast_in_dim`. This required make a new version of the `define_tensor()` that takes in the `size` and `strides` of input tensors in order to properly determine broadcasts. This PR also has a fix for the `python_example.py` that broke when a new argument was added to reductions to allow the user to specify an output Data Type. `define_tensor()` Interface Example: ``` fusion2 = Fusion() input1 = torch.ones(1, 1, 4, device='cuda') input2 = torch.ones(2, 3, 4, device='cuda') with FusionDefinition(fusion2) as fd : t0 = fd.define_tensor(sizes=input1.size(), strides=input1.stride()) t1 = fd.define_tensor(sizes=input2.size(), strides=input2.stride()) fd.add_input(t0) fd.add_input(t1) t0_b = fd.Ops.broadcast_in_dim(t0, [2, 3, 4], [0, 1, 2]) print("Broadcast TensorView", t0_b) t2 = fd.Ops.add(t0_b, t1) fd.add_output(t2) ``` Print statement of defined broadcast tensor: ``` Broadcast TensorView T2_l[ sbS6{1}, sbS7{1}, iS8{i2} ] DataType: float Contiguity: ttt ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/76790 Approved by: https://github.com/mruberry, https://github.com/jjsjann123	2022-05-10 18:13:22 +00:00
jjsjann123	489818e7c6	disabling squeeze/unsqueeze; disabling BN/BN_BWD for perf concern (#77017 ) Fixes #76883 (via disabling squeeze/unsqueeze) Disabling BN fwd/bwd for our perf concern. I need to update our python tests. Awaiting build to finish so I can update tests accordingly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/77017 Approved by: https://github.com/csarofeen, https://github.com/davidberard98	2022-05-09 22:57:20 +00:00
jjsjann123	b4f3f9c651	Torchvision patch (#77001 ) Fixes #76791 Note that this is a hot patch so we get to run upstream tests. I'm doing proper fix in our local repo and will update upstream code once those are merged/reviewed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/77001 Approved by: https://github.com/davidberard98	2022-05-09 16:53:23 +00:00
Xiang Gao	104f0bf09e	[Reland] Add atan2 isfinite isinf isnan isneginf isposinf isreal to nvfuser and its frontend (#76769 ) This reverts commit `4bb5944133`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/76769 Approved by: https://github.com/csarofeen, https://github.com/mruberry	2022-05-07 21:26:00 +00:00
David Berard	6c615a21a0	[NVFuser] prep for on-by-default 1. fix tests that expected nvfuser off-by-default behavior 2. skip nvfuser if getExecutorMode() == false Pull Request resolved: https://github.com/pytorch/pytorch/pull/76937 Approved by: https://github.com/eellison	2022-05-06 18:18:53 +00:00
sanchitintel	4ee29d6033	[Reland take-2] Add JIT graph fuser for oneDNN Graph API (v0.5) Re-landing #68111/#74596 ## Description v0.5 PR of this [RFC](https://github.com/pytorch/pytorch/issues/49444). On the basis of #50256, the below improvements are included: * The [v0.5 release branch](https://github.com/oneapi-src/oneDNN/releases/tag/graph-v0.5) of the oneDNN Graph API is used * The fuser now works with the profiling graph executor. We have inserted type check nodes to guard the profiled tensor properties. ### User API: The optimization pass is disabled by default. Users could enable it by: ``` torch.jit.enable_onednn_fusion(True) ``` `torch.jit.freeze` should be used after tracing (recommended) or scripting a model. ### Performance: [pytorch/benchmark](https://github.com/pytorch/benchmark) tool is used to compare the performance: * SkyLake 8180 (1 socket of 28 cores): ![image](https://user-images.githubusercontent.com/65992142/151162305-05e44425-a24e-4d5e-94e1-743b40b87a8c.png) * SkyLake 8180 (single thread): ![image](https://user-images.githubusercontent.com/65992142/151162528-69f90b79-d08d-46b8-8775-d80a6ccbce8a.png) * By mapping hardswish to oneDNN Graph, it’s 8% faster than PyTorch JIT (NNC + OFI) ** We expect performance gain after mapping transpose, contiguous & view to oneDNN graph ops ### Directory structure of the integration code Fuser-related code is placed under: ``` torch/csrc/jit/codegen/onednn/ ``` Optimization pass registration is done in: ``` torch/csrc/jit/passes/onednn_graph_fuser.h ``` CMake for the integration code is in: ``` caffe2/CMakeLists.txt cmake/public/mkldnn.cmake cmake/Modules/FindMKLDNN.cmake ``` ## Limitations * In this PR, we only support Pytorch-oneDNN-Graph integration on Linux platform. Support on Windows and MacOS will be enabled as a next step. * We have only optimized the inference use-case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/76622 Approved by: https://github.com/eellison	2022-05-05 16:57:03 +00:00
CodemodService FBSourceClangFormatLinterBot	fa3e0d5f4c	[AutoAccept][Codemod][FBSourceClangFormatLinter] Daily `arc lint --take CLANGFORMAT` (#76802 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/76802 Reviewed By: ivanmurashko Differential Revision: D36124688 fbshipit-source-id: d6921d373500ec56bf20db073030df781f635f56 (cherry picked from commit 8047422f3c42c095065ab1622c898a8c742de2f1)	2022-05-04 09:52:23 +00:00
David Berard	e33f3229a2	[NVFuser] environment variable to turn nvfuser on or off (#76485 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/76485 Adds an environment variable `PYTORCH_JIT_ENABLE_NVFUSER` for controlling whether or not nvfuser is enabled. This required changing the PassManager behavior to support the case where nvfuser gets enabled by default when PYTORCH_JIT_ENABLE_NVFUSER=1. Previously the solution for turning nvfuser on or off was to use the PassManager to register or un-register the pass. That works fine if the pass starts of _disabled_, but causes issues once we try to enable the pass by default. The main issue with enabling by default is with the validation check to see whether NVFuser can be turned on. The check relies on at::globalContext().hasCUDA(), which requires CUDAHooks to be registered before hasCUDA() wil work correctly. At static initialization time it's difficult to ensure that CUDAHooks will be registered _before_ we attempt to register the nvfuser pass. In OSS it worked fine, but in internal builds it would fail on ROCm builds. To fix this, we switch the control of NVFuser enablement to a check in the pass. i.e. previously, we enabled/disabled nvfuser by registering or de-registering the pass in pass manager; now, the pass is always registered in pass manager, and enablement is done by a check within the nvfuser pass. Remaining TODO: Connect this with NNC so that in cases where NNC is available but not NVFuser (i.e. on AMD gpus), NNC can be turned on automatically. Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D35982618 Pulled By: davidberard98 fbshipit-source-id: fd5b76bc0b8c8716c96fdc04bebfb15026a7ef60 (cherry picked from commit ff14603ff5ac8d9b6c749c4f111f4a8be8023b7f)	2022-05-03 23:05:40 +00:00
PyTorch MergeBot	4bb5944133	Revert "Add atan2 isfinite isinf isnan isneginf isposinf isreal to nvfuser and its frontend" This reverts commit `92d10decc4`. Reverted https://github.com/pytorch/pytorch/pull/76598 on behalf of https://github.com/malfet	2022-05-03 19:53:28 +00:00
Xiang Gao	92d10decc4	Add atan2 isfinite isinf isnan isneginf isposinf isreal to nvfuser and its frontend Fixes: https://github.com/csarofeen/pytorch/issues/1632 Pull Request resolved: https://github.com/pytorch/pytorch/pull/76598 Approved by: https://github.com/csarofeen, https://github.com/mruberry	2022-05-03 16:31:40 +00:00
jjsjann123	d23619b030	Permutation extended Extended permutation support in integration (See more details on https://github.com/csarofeen/pytorch/issues/1601). This update allows us to better support permutation propagation on tensors, specifically for binary ops with inputs of different ranks. Our goal is to avoid permuting tensors unless absolutely necessary. We try to preserve the permutation propagation rule in aten, with some known limitation at the time. The idea in this implementation is the same as with our existing code, which is to permute input/output tensors outside of codegen: For a simplified binary op scenario: `output = binaryOp(input0, input1)` 1. In a simple case where `input0` and `input1` come with the same rank & permutation order, our output would preserve the same permutation; 2. For cases where `input0` and `input1` come with different ranks but with compatible permutation, the tensor with the higher rank dictates the permutation of the output; 3. For cases where `input0` and `input1` come with different ranks but with in-compatible permutation, this is where permutation propagation fails and the output tensor will be contiguous. By compatible permutation, it means that we can permute the higher rank tensor to contiguous format, and then apply a second permutation to the tensor with lower rank to match their axes. This check is implemented in `MemoryFormat::broadcastToRank(int lower_rank)`. Some concrete example (note that we comply with eager propagation on cases 1-3, but diverge in behavior for cases 4, 5): 1. different rank & same permutation ``` t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2]) # stride (hwc, 1, wc, c) t1 = torch.randn(h, w, c).cuda().permute([2, 0, 1]) # stride (1, wc, c) out = scripted_add(t0, t1) # stride (hwc, 1, wc, c) preserving memory format of t0 ``` 2. different rank & compatible permutation ``` t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2]) # stride (hwc, 1, wc, c) t1 = torch.randn(c, h, w).cuda() # stride (hw, w, 1) out = scripted_add(t0, t1) # stride (hwc, 1, wc, c) preserving memory format of t0 ``` 3. different rank & compatible permutation with broadcasting ``` t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2]) # stride (hwc, 1, wc, c) t1 = torch.randn(c).cuda().unsqueeze(-1).unsqueeze(-1) # stride (1, 1, 1) out = scripted_add(t0, t1) # stride (hwc, 1, wc, c) preserving memory format of t0 ``` 4. different rank & in-compatible permutation ``` t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2]) # stride (hwc, 1, wc, c) t1 = torch.randn(h, w).cuda() # stride (w, 1) jit_out = scripted_add(t0, t1) # stride (hwc, 1, wc, c) # stride (hwc, wc, c, 1) # nvfuser outputs contiguous tensor eager_out = eager_add(t0, t1) # stride (hwc, 1, wc, c) # stride (hwc, 1, wc, c) # TI preserves memory format of LHS operand ``` 5. different rank & in-compatible permutation ``` t0 = torch.randn(c, h, w).cuda() # stride (hw, w, 1) t1 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2]) # stride (hwc, 1, wc, c) jit_out = scripted_add(t0, t1) # stride (hwc, 1, wc, c) # stride (hwc, 1, wc, c) # nvfuser preserves memory format of highest rank tensors eager_out = eager_add(t0, t1) # stride (hwc, 1, wc, c) # stride (hwc, hw, w, 1) # TensorIterator preserves memory format of LHS operand ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/76563 Approved by: https://github.com/kevinstephano, https://github.com/ngimel	2022-05-02 22:09:56 +00:00
CodemodService FBSourceClangFormatLinterBot	461cc0a960	[AutoAccept][Codemod][FBSourceClangFormatLinter] Daily `arc lint --take CLANGFORMAT` Reviewed By: adamjernst Differential Revision: D36061557 fbshipit-source-id: 61ea6017a3550dbbb13b7b288a1d537253c25fe2 (cherry picked from commit 2ea13c57e96763318859159b6441c02f7c72caf2)	2022-05-02 22:07:42 +00:00
Peter Bell	2e480fc2db	Cleanup ATen-core forward declarations I noticed that when `SymInt` was introduced, `jit_type_base.h` was added as an include to the `Operator.h` template which is supposed to be kept extremely clean and only use forward declarations. Also, that forward declarations for `OptionalArrayRef` were missing. So, I've refactored the forward declarations into `ATen/core/ATen_fwd.h` and cleaned up some of the `c10` headers that were masking these missing declarations. I've also re-generated the pre-compiled header so `SymInt` is included. Pull Request resolved: https://github.com/pytorch/pytorch/pull/76576 Approved by: https://github.com/albanD	2022-05-02 14:50:48 +00:00
Ryan Spring	33be4c94c0	[Nvfuser] Add cast support between double and half types Fixes `RuntimeError: Illegal Cast value from DataType: __half to DataType: double` Example: ``` with FusionDefinition(fusion) as fd : t0 = fd.define_tensor(2, DataType.Half) t1 = fd.define_tensor(2, DataType.Double) fd.add_input(t0) fd.add_input(t1) t2 = fd.Ops.add(t0, t1) t5 = fd.Ops.relu(t2) fd.add_output(t5) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/76605 Approved by: https://github.com/csarofeen	2022-05-01 19:04:48 +00:00
jjsjann123	100e72f54b	Nvfuser faster fallback Follow up to #76505 Addressing https://github.com/pytorch/pytorch/pull/76505#discussion_r861260818 to further improve fallback perf during compilation failure. This allows us to reuse fallback instead re-constructing new code every time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/76604 Approved by: https://github.com/davidberard98	2022-04-30 05:59:18 +00:00
PyTorch MergeBot	3dcd67a1b3	Revert "[Re-landing 68111] Add JIT graph fuser for oneDNN Graph API (Preview4.1)" This reverts commit `8b11d81058`. Reverted https://github.com/pytorch/pytorch/pull/74596 on behalf of https://github.com/janeyx99	2022-04-29 15:40:17 +00:00
chunyuan	8b11d81058	[Re-landing 68111] Add JIT graph fuser for oneDNN Graph API (Preview4.1) Re-landing https://github.com/pytorch/pytorch/pull/68111 ## Description Preview4 PR of this [RFC](https://github.com/pytorch/pytorch/issues/49444). On the basis of https://github.com/pytorch/pytorch/pull/50256, the below improvements are included: - The [preview4 release branch](https://github.com/oneapi-src/oneDNN/releases/tag/graph-v0.4.1) of the oneDNN Graph API is used - The fuser now works with the profiling graph executor. We have inserted type check nodes to guard the profiled tensor properties. ### User API: The optimization pass is disabled by default. Users could enable it by: ``` torch.jit.enable_onednn_fusion(True) ``` ### Performance: [pytorch/benchmark](https://github.com/pytorch/benchmark) tool is used to compare the performance: - SkyLake 8180 (1 socket of 28 cores): ![image](https://user-images.githubusercontent.com/65992142/151162305-05e44425-a24e-4d5e-94e1-743b40b87a8c.png) - SkyLake 8180 (single thread): ![image](https://user-images.githubusercontent.com/65992142/151162528-69f90b79-d08d-46b8-8775-d80a6ccbce8a.png) \* By mapping hardswish to oneDNN Graph, it’s 8% faster than PyTorch JIT (NNC + OFI) \** We expect performance gain after mapping transpose, contiguous & view to oneDNN graph ops ### Directory structure of the integration code Fuser-related code are placed under: ``` torch/csrc/jit/codegen/onednn/ ``` Optimization pass registration is done in: ``` torch/csrc/jit/passes/onednn_graph_fuser.h ``` CMake for the integration code is: ``` caffe2/CMakeLists.txt ``` ## Limitations - In this PR, we have only supported the optimization on Linux platform. The support on Windows and MacOS will be enabled as the next step. - We have only optimized the inference use case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/74596 Approved by: https://github.com/malfet	2022-04-29 01:01:33 +00:00
jjsjann123	ac31e5d4a3	Add a matching lerp implementation to eager mode. (#1612 ) Fixes part of #76046 Add a matching lerp to eager mode. Co-authored-by: jjsjann123 <alex.jann2012@gmail.com> Co-authored-by: jjsjann123 <jiej@nvidia.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/76459 Approved by: https://github.com/ngimel	2022-04-28 23:37:01 +00:00
David Berard	e52dc9888b	Retry - [NVFuser] always use fallback if fusion fails Retry of #75983. The change is to handle cases where attr::cache_id is not set. This can happen if compilation fails. Original message: 1) remember when fusions fail; and on subsequent runs, always take the fallback. 2) during the first fallback, cache the Code object. On autogen-69 from the nvfuser microbenchmarks (https://github.com/pytorch/benchmark/pull/801) this improved performanance as follows: * Original (always attempt fusion): 25ms * Always take fallback after first failure: 0.79ms * Always take fallback + cache Code object: 0.62ms * Eager: 0.58ms Pull Request resolved: https://github.com/pytorch/pytorch/pull/76505 Approved by: https://github.com/eellison	2022-04-28 20:38:37 +00:00
Ryan Spring	e9f17da2cf	Nvfuser - Type Promotion Fix Fix Type Promotion failures in [Issue 76046](https://github.com/pytorch/pytorch/issues/76046) 1. Updated nvfuser type promotion rule for codegen kernel; 2. Updated casting for output of nvfuser kernel to respect profiling/TorchScript scalar type; 3. Updated type_inference.cpp to only update device/scalar_type when profiling information is missing. Additional Type Promotion Fixes: - test_nvfuser_correctness_softmax_with_dtype_cuda_float32 - test_nvfuser_correctness_softmax_with_dtype_cuda_bfloat16 - test_nvfuser_correctness_softmax_with_dtype_cuda_float16 - test_nvfuser_correctness_softmax_with_dtype_cuda_float32 - test_nvfuser_correctness_log_softmax_dtype_cuda_bfloat16 - test_nvfuser_correctness_log_softmax_dtype_cuda_bool - test_nvfuser_correctness_log_softmax_dtype_cuda_float16 - test_nvfuser_correctness_log_softmax_dtype_cuda_float32 - test_nvfuser_correctness_sum_cuda_int32 - test_nvfuser_correctness_sum_to_size_cuda_int32 Pull Request resolved: https://github.com/pytorch/pytorch/pull/76343 Approved by: https://github.com/jjsjann123, https://github.com/mruberry	2022-04-28 16:08:38 +00:00
Kevin Stephano	f51516df1c	Adding `broadcast_in_dim` and non-contiguous Tensor support NVFuser Python Frontend Adding new features. 1. `broadcast_in_dim` support and example. 2. Adding non-contiguous `TensorView` support and example. `broadcast_in_dim` example: ``` with FusionDefinition(fusion) as fd : t0 = fd.define_tensor(1) t1 = fd.define_tensor(3) fd.add_input(t0) fd.add_input(t1) t0_b = fd.Ops.broadcast_in_dim(t0, [2, 3, 4], [1]) t2 = fd.Ops.add(t0_b, t1) fd.add_output(t2) ``` Non-contiguous tensor support example: ``` with FusionDefinition(fusion) as fd : t0 = fd.define_tensor(3, [False, False, False]) t1 = fd.define_tensor(3, [True, True, True]) fd.add_input(t0) fd.add_input(t1) print("Input1 Contiguity:", t0) print("Input2 Contiguity:", t1) t2 = fd.Ops.add(t0, t1) print("Output Contiguity:", t2, "\n") fd.add_output(t2) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/76484 Approved by: https://github.com/mruberry	2022-04-28 08:36:57 +00:00
PyTorch MergeBot	bfb39e577c	Revert "[NVFuser] always use fallback if fusion fails" This reverts commit `da984c507c`. Reverted https://github.com/pytorch/pytorch/pull/75983 on behalf of https://github.com/davidberard98	2022-04-26 15:21:23 +00:00
Kevin Stephano	b17b2b1cc7	Add NVFuser Python Frontend New functionality. 1. Adds Pybind11 bindings for NVFuser. 2. Requires a build file change and JIT python file change outside of NVFuser's code area. Example: ``` import torch from torch._C._nvfuser import Fusion, FusionDefinition # Construct and Define Fusion fusion = Fusion() with FusionDefinition(fusion) as fd : t0 = fd.define_tensor(3) t1 = fd.define_tensor(1) s0 = fd.define_scalar() fd.add_input(t0) fd.add_input(t1) fd.add_input(s0) c0 = fd.define_constant(3.0) t1_b = fd.Ops.broadcast(t1, [True, True, False]) t2 = fd.Ops.add(t0, t1) t3 = fd.Ops.mul(t2, c0) t4 = fd.Ops.mul(t3, s0) t5 = fd.Ops.relu(t4) t6 = fd.Ops.sum(t5, [-1], False) fd.add_output(t6) fusion.print_ir() # Execute Fusion input1 = torch.ones(2, 4, 8, device='cuda') input2 = torch.ones(8, device='cuda') # Kernel compilation should be cached for the 2nd iteration # with input tensors of the same shape for _ in range(5) : outputs = fusion.execute([input1, input2, 2.0]) print(outputs[0]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/76353 Approved by: https://github.com/csarofeen, https://github.com/mruberry	2022-04-26 06:10:19 +00:00
jjsjann123	e48b29b1fb	patching 11.1 ptxas issue Fixes #75708 `--ptxas-options` only passes its immediate argument to ptxas. So we should have put that in front of every ptxas argument. It's actually strange how this worked in CUDA TK 11.6. I'm following up with nvrtc team on this internally, meanwhile we should merge this PR to avoid register failures in generated kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/76226 Approved by: https://github.com/davidberard98	2022-04-25 22:26:24 +00:00
David Berard	f36d348f75	[NVFuser] multithreading nvfuser test 1) add multithreading tests 2) make IrParser thread safe with std::call_once (previously, registerJitOperator could get called twice simultaneously and segfault) Pull Request resolved: https://github.com/pytorch/pytorch/pull/76259 Approved by: https://github.com/jjsjann123	2022-04-25 21:48:50 +00:00
David Berard	da984c507c	[NVFuser] always use fallback if fusion fails 1) remember when fusions fail; and on subsequent runs, always take the fallback. 2) during the first fallback, cache the Code object. On autogen-69 from the nvfuser microbenchmarks (https://github.com/pytorch/benchmark/pull/801) this improved performanance as follows: * Original (always attempt fusion): 25ms * Always take fallback after first failure: 0.79ms * Always take fallback + cache Code object: 0.62ms * Eager: 0.58ms Pull Request resolved: https://github.com/pytorch/pytorch/pull/75983 Approved by: https://github.com/jjsjann123	2022-04-25 20:48:47 +00:00
David Berard	a8ed69502e	[NVFuser] document _jit_set_nvfuser_skip_node_kind Pull Request resolved: https://github.com/pytorch/pytorch/pull/76004 Approved by: https://github.com/jjsjann123	2022-04-20 22:56:21 +00:00
Nikita Shulga	f6c275f55d	Remove `-Wno-unused-variable` from `utils.cmake` (take 2) (#75538 ) Summary: [Comment](https://github.com/pytorch/pytorch/pull/62445/files#r680132022) claims, it got added for consistency with top level CMakeLists.txt, but `-Wno-unused-variable` is not mentioned there. Modify violations in 50+ files that were added in the interim by either removing unused variables, or decorating the code with `C10_UNUSED` if local variable is likely used to extend object lifetime until the end of the block. Caused preventable revert in https://github.com/pytorch/pytorch/pull/72633#issuecomment-1092300787 Pull Request resolved: https://github.com/pytorch/pytorch/pull/75538 Reviewed By: anjali411 Differential Revision: D35747333 Pulled By: malfet fbshipit-source-id: 3fc5828e44a4c05ba0e89e92613e6ebbdb260626 (cherry picked from commit c179fba21cfa2a0093fad50ccad5a22dd7cff52c)	2022-04-20 17:41:59 +00:00
PyTorch MergeBot	5c56b2286b	Revert "Remove `-Wno-unused-variable` from utils.cmake" This reverts commit `018cbe1f5c`. Reverted https://github.com/pytorch/pytorch/pull/75538 on behalf of https://github.com/seemethere	2022-04-19 17:19:09 +00:00
Nikita Shulga	018cbe1f5c	Remove `-Wno-unused-variable` from utils.cmake [Comment](https://github.com/pytorch/pytorch/pull/62445/files#r680132022) claims, it got added for consistency with top level CMakeLists.txt, but `-Wno-unused-variable` is not mentioned there. Modify violations in 50+ files that were added in the interim by either removing unused variables, or decorating the code with `C10_UNUSED` if local variable is likely used to extend object lifetime until the end of the block. Caused preventable revert in https://github.com/pytorch/pytorch/pull/72633#issuecomment-1092300787 Pull Request resolved: https://github.com/pytorch/pytorch/pull/75538 Approved by: https://github.com/cpuhrsch	2022-04-19 15:26:55 +00:00
David Berard	ebb60a8b2f	[NVFuser] don't decompose linear if we don't have shape info Pull Request resolved: https://github.com/pytorch/pytorch/pull/75770 Approved by: https://github.com/jjsjann123, https://github.com/robieta	2022-04-18 14:24:37 +00:00
David Berard	15892481ab	[NVFuser] fix incorrect vector size in chunk move Internally there's some modes that run with -D_GLIBCXX_ASSERTIONS, which was throwing out of bounds errors at `strides[output_index]` Pull Request resolved: https://github.com/pytorch/pytorch/pull/75765 Approved by: https://github.com/jjsjann123, https://github.com/cpuhrsch	2022-04-15 18:33:28 +00:00
Nirav Mehta	dfcaedeb1a	Fix formatting issues Summary: Fixing clang-format errors using `arc f` Changes already in github included https://github.com/pytorch/pytorch/pull/68460 Test Plan: test run in Signals Reviewed By: osalpekar Differential Revision: D35649381 fbshipit-source-id: 15f9cc7259c6425a14d2646200008f15ec47cbf0 (cherry picked from commit 6581afe58afae4dcc34d4024499c6cb61a56b448)	2022-04-14 23:29:13 +00:00
PyTorch MergeBot	db6165215e	Revert "[ci] use lintrunner in CI" This reverts commit `4c3ee53522`. Reverted https://github.com/pytorch/pytorch/pull/68460 on behalf of https://github.com/malfet	2022-04-14 23:27:27 +00:00
David Berard	d21082c982	[NVFuser] make comparators obey strict weak ordering Internally there is an at-runtime assert that was failing. The assert is checking to make sure that comparators obey strict weak ordering https://en.cppreference.com/w/cpp/named_req/Compare In particular `comp(a, a) == false` was failing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/75713 Approved by: https://github.com/jjsjann123, https://github.com/csarofeen	2022-04-14 21:42:23 +00:00
Michael Suo	4c3ee53522	[ci] use lintrunner in CI This changes our lint workflows to use lintrunner for the linters that are currently supported + some random fixes to make things lint clean on master + changes to Makefile to use lintrunner Pull Request resolved: https://github.com/pytorch/pytorch/pull/68460 Approved by: https://github.com/t10-13rocket, https://github.com/seemethere, https://github.com/janeyx99	2022-04-14 17:43:41 +00:00
jjsjann123	692ebc8d8b	baby steps on patching inf/nan behavior & aten::amin support in nvfuser Fixes #75622 1. Instead of getting max/min_value for reduction init value, we go with (-)infinity instead so we can properly preserve inf inputs; 2. Adding inf/(-)inf/nan for float value. 3. Adding aten::amin in nvfuser (@kevinstephano @rdspring1 for review) Pull Request resolved: https://github.com/pytorch/pytorch/pull/75646 Approved by: https://github.com/rdspring1, https://github.com/kevinstephano, https://github.com/ngimel	2022-04-13 15:51:17 +00:00
jiej	0203341bbd	patching clamp for one sided clamp Fixes #75088 The solution is just to avoid putting random value for non-specified clamp as pointed out in https://github.com/pytorch/pytorch/issues/75088#issuecomment-1093410036 Pull Request resolved: https://github.com/pytorch/pytorch/pull/75558 Approved by: https://github.com/ngimel	2022-04-12 03:02:32 +00:00
jjsjann123	f7e7af80e0	disabling reshape Fixes #75282 Temporarily disables reshape to avoid codegen failure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/75539 Approved by: https://github.com/davidberard98	2022-04-12 02:43:45 +00:00
Yulv-git	ac2d2e3a3d	Fix some typos. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/75561 Approved by: https://github.com/albanD	2022-04-11 21:55:59 +00:00
jjsjann123	2d5e4cff85	disabling view Disabling view to avoid codegen errors as we resolve them internally. This is currently done via simply stop the non-alias transformation for view op in fusion pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/75235 Approved by: https://github.com/davidberard98	2022-04-07 01:00:04 +00:00
Horace He	5994d68484	Reland NVFuser guard changes Reland of https://github.com/pytorch/pytorch/pull/75016 with `USE_CUDA` => `USE_NVFUSER` Pull Request resolved: https://github.com/pytorch/pytorch/pull/75303 Approved by: https://github.com/jjsjann123, https://github.com/davidberard98	2022-04-06 06:32:34 +00:00
David Berard	e9e75215e2	[JIT] Optionally validate nvfuser outputs after execution (#74361 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74361 This adds an optional validation after executing an NVFuser node, which checks that the output is the same as the unfused implementation. Then the outputs and the graph are reported via a callback. ```python import torch def callback(x, y, graph): for i in range(len(x)-amt, len(x)): print(x[i]) print(y[i]) print(graph) with torch.jit.fuser("fuser2"): torch._C._jit_nvfuser_set_comparison_callback(True, callback) torch.jit.script def g(x, y): z = torch.add(x, y) return torch.sin(z) def f(x, y, a): z = torch.add(x, y) return g(torch.relu(z), a) f_s = torch.jit.script(f) x = torch.rand((10, 10), dtype=torch.half).cuda() y = torch.rand((10, 10), dtype=torch.half).cuda() a = torch.rand((10, 10), dtype=torch.half).cuda() f_s(x, y, a) f_s(x, y, a) f_s(x, y, a) ``` Test Plan: Imported from OSS Reviewed By: eellison Differential Revision: D34975310 Pulled By: davidberard98 fbshipit-source-id: 2379c9a6f371cd58da6a187c1f16882f3923ab24 (cherry picked from commit 96c87992c65f5e6bb1bdd51791682dd837af99b4)	2022-04-01 23:48:30 +00:00
PyTorch MergeBot	1352c6417a	Revert "Nvfuser guard patch" This reverts commit `d86181f745`. Reverted https://github.com/pytorch/pytorch/pull/75016 on behalf of https://github.com/malfet	2022-04-01 23:45:55 +00:00
jjsjann123	d86181f745	Nvfuser guard patch Fixes issue where CudaFusionGuard would return false on backward graph because `requires_grad` flag doesn't match. This is due to the fact that autodiff uses GradMode switch to turn on/off requires_grad, which is not taken into consideration by nvfuser guard. We verified the implementation under `TensorType::matchTensor`. - [x] Add python test to verify no fallback is observed Pull Request resolved: https://github.com/pytorch/pytorch/pull/75016 Approved by: https://github.com/eellison	2022-04-01 14:23:48 +00:00
jjsjann123	873ced7cd0	Nvfuser code bump 030122 (#73627 ) Summary: Things changed in this PR that requires review: test/forward_backward_compatibility/check_forward_backward_compatibility.py Our previous function overload extension names were wrong and has been updated in this PR, hence the compatibility list updated. nvfuser code updates with bug fixes towards failures we encountered in OpInfoTests as well as failures reported by AOTAutograd team. Pull Request resolved: https://github.com/pytorch/pytorch/pull/73627 Reviewed By: Chillee Differential Revision: D34765458 Pulled By: davidberard98 fbshipit-source-id: c81f3d6a1b723fb3a8ba419b7f82227f70440ca7 (cherry picked from commit b6a2c362c37051e44fac31687b2fe272f776551e)	2022-03-31 08:18:22 +00:00
Nikita Shulga	c40a009d66	Revert D35194935: Check all CUDA API calls for errors in torch/ Test Plan: revert-hammer Differential Revision: D35194935 (`79e5b053b6`) Original commit changeset: f5ec5be87cdf Original Phabricator Diff: D35194935 (`79e5b053b6`) fbshipit-source-id: 0bb770d2cdb29b8e724c0b6a125c748f363d3358 (cherry picked from commit 04e5a73da4a53b0ec296f3df2c85626d19290c1f)	2022-03-31 05:48:30 +00:00
Richard Barnes	79e5b053b6	Check all CUDA API calls for errors in torch/ (#74923 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74923 Test Plan: Sandcastle Reviewed By: malfet Differential Revision: D35194935 fbshipit-source-id: f5ec5be87cdf775eb9c99f8c3baed6b0366dda49 (cherry picked from commit 7284c4ed7d57261d4936055e0c1a3f8f911fb1f0)	2022-03-31 05:08:55 +00:00
jiej	86c817cfa0	Requires grad guard Adding CudaFusionGuard to guard on device/requires_grad of profiled tensor type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/74780 Approved by: https://github.com/davidberard98	2022-03-29 19:23:10 +00:00
jiej	e4e19d5beb	nvfuser parser skip api (#74520 ) Summary: added python API to disable nvfuser on certain opkind. ``` "_jit_set_nvfuser_skip_node_kind", [](const std::string& op_name, bool flip = true) { return fuser::cuda::skipNode(op_name, flip); }) ``` Args: `op_name`: Symbol of op; `flip`: flag indicating whether to flip the given op in the skip list. Returns: a bool flag indicating if `op_name` was already in the skip list. The python example that disables the fusion of `aten::add` afterwards. `torch._C._jit_set_nvfuser_skip_node_kind("aten::add", True) # returns False, as no op is in skip list by default` Pull Request resolved: https://github.com/pytorch/pytorch/pull/74520 Reviewed By: saketh-are Differential Revision: D35046110 Pulled By: davidberard98 fbshipit-source-id: 689f5286513dbab206768823a852467b9f6b49b6 (cherry picked from commit 9a31129f7591ba2d393ab057b1cd137a6a25e7e8)	2022-03-23 20:56:43 +00:00
Michael Suo	e5bf87963d	Revert D34584878: [pytorch][PR] Add JIT graph fuser for oneDNN Graph API (Preview4) Test Plan: revert-hammer Differential Revision: D34584878 (`7dd0823011`) Original commit changeset: ce817aa8cc90 Original Phabricator Diff: D34584878 (`7dd0823011`) fbshipit-source-id: a941aaad34f8fe5f0c51f719f9f5c29b811c4d5b (cherry picked from commit a43262ec7521b1665b02a64d3f279e72ee2344b9)	2022-03-21 23:07:14 +00:00
chunyuan	7dd0823011	Add JIT graph fuser for oneDNN Graph API (Preview4) (#68111 ) Summary: ## Description Preview4 PR of this [RFC](https://github.com/pytorch/pytorch/issues/49444). On the basis of https://github.com/pytorch/pytorch/pull/50256, the below improvements are included: - The [preview4 release branch](https://github.com/oneapi-src/oneDNN/releases/tag/graph-v0.4.1) of the oneDNN Graph API is used - The fuser now works with the profiling graph executor. We have inserted type check nodes to guard the profiled tensor properties. ### User API: The optimization pass is disabled by default. Users could enable it by: ``` torch.jit.enable_onednn_fusion(True) ``` ### Performance: [pytorch/benchmark](https://github.com/pytorch/benchmark) tool is used to compare the performance: - SkyLake 8180 (1 socket of 28 cores): ![image](https://user-images.githubusercontent.com/65992142/151162305-05e44425-a24e-4d5e-94e1-743b40b87a8c.png) - SkyLake 8180 (single thread): ![image](https://user-images.githubusercontent.com/65992142/151162528-69f90b79-d08d-46b8-8775-d80a6ccbce8a.png) \* By mapping hardswish to oneDNN Graph, it’s 8% faster than PyTorch JIT (NNC + OFI) \** We expect performance gain after mapping transpose, contiguous & view to oneDNN graph ops ### Directory structure of the integration code Fuser-related code are placed under: ``` torch/csrc/jit/codegen/onednn/ ``` Optimization pass registration is done in: ``` torch/csrc/jit/passes/onednn_graph_fuser.h ``` CMake for the integration code is: ``` caffe2/CMakeLists.txt ``` ## Limitations - In this PR, we have only supported the optimization on Linux platform. The support on Windows and MacOS will be enabled as the next step. - We have only optimized the inference use case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/68111 Reviewed By: eellison Differential Revision: D34584878 Pulled By: malfet fbshipit-source-id: ce817aa8cc9052ee9ed930c9cf66be83449e61a4 (cherry picked from commit cd17683aa7d9c0947df45a1ab53627feff795587)	2022-03-21 22:12:19 +00:00
David Berard	890b1e8f9e	[JIT] C10_EXPORT -> TORCH_API (#73818 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73818 These all appear to be defined in libtorch_cpu.so, so they should be marked with TORCH_API. TORCH_API means that these symbols are exported from libtorch_cpu.so and no other libraries. In comparison, C10_EXPORT will export the symbol in _all_ built libraries, if it's available. I think most of these were fine because most were only defined in cpp files (which would only be included in the targets for one .so file). However, the change in pass_manager.h affects behavior, since the class is defined in the .h file, which could result in two separate implementations of the same static functions. Previously we saw issues on windows with this: https://github.com/pytorch/pytorch/pull/73742 Test Plan: Imported from OSS Reviewed By: george-qi Differential Revision: D34698175 Pulled By: davidberard98 fbshipit-source-id: cb871e861cf966bff596cfa8340a32a17fca0b66 (cherry picked from commit 6b9988e5688e6d4a9928c3e331efb74f000a9e4a)	2022-03-14 20:29:58 +00:00
David Berard	31b64fc3e6	[JIT] log extract tool - dump NVFuser fallbacks instead of fusion groups (#73881 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73881 NVFuser fusion groups can contain nvfuser-only ops, e.g. `prim::reshape_copy`. Previously, we couldn't get a baseline performance measurement because the nvfuser-only ops would error out on nnc- and no-fusion- runs. Instead, dump the fallback graphs, after the fallbacks are corrected into runnable fallbacks. Test Plan: Imported from OSS Reviewed By: eellison Differential Revision: D34698307 Pulled By: davidberard98 fbshipit-source-id: c357b2736b789bfd347afe9c83a1b610b64881e0 (cherry picked from commit 5918d826502ff75fbc22d242844ae6435dd7d22a)	2022-03-08 16:38:17 +00:00
David Berard	b27ec57331	[JIT] script & logging for extracting IR from logs (#72889 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72889 The script along with the GRAPH_EXPORT macro will allow for an easy way to extract IR from logs. One use case in this diff is to extract the fusion groups from nvfuser, so that the fusions can be tested individually. Usage (e.g. for nvfuser test) 1. Write some test.py file that uses nvfuser 2. `PYTORCH_JIT_LOG_LEVEL=">>graph_fuser" python3 test.py 2>&1 \| tee output.txt` 3. `python3 pytorch/scripts/jit/log_extract.py output.txt --nvfuser` This will run with and without nvfuser to compare the output. Alternatively, use `--output` to dump the IR so that it can be used in other applications. Currently, only `--output` works (since generating input tensors is not supported) Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D34440189 Pulled By: davidberard98 fbshipit-source-id: fca0f619200ee37aba34bb39b69e6c640c263e26 (cherry picked from commit eb319166075db160f1628f0de545641fbecde8be)	2022-03-02 18:34:35 +00:00
Gabor Kertesz	c4ff49f4c7	Enable win-arm64 This patch enables Pytorch build from source with Ninja and 'Visual Studio 16 2019' CMake generator on Windows on Arm. Tests: - Build from source: 'python setup.py develop'. - Run simple Pytorch example: passed - python test\test_torch.py: -- same results as on x64 -- Ran 1344 tests, failures=2 Pull Request resolved: https://github.com/pytorch/pytorch/pull/72424	2022-02-28 17:17:56 +00:00
CodemodService FBSourceClangFormatLinterBot	b9ccbe4ff2	[AutoAccept][Codemod][FBSourceClangFormatLinter] Daily `arc lint --take CLANGFORMAT` Reviewed By: bilalsou Differential Revision: D34237270 fbshipit-source-id: f33c06e9cbbde8b1fa39b11f9addb716f3762c99 (cherry picked from commit `0db3686e9d`)	2022-02-15 10:41:24 +00:00
jiej	2d110d514f	Nvfuser code bump 2_1_2022 (#72127 ) Summary: Things changed in this PR that requires review: 1. aten/src/ATen/core/interned_strings.h 2. torch/csrc/jit/ir/alias_analysis.h : exposing createValue to allow efficient mutation 3. torch/csrc/jit/runtime/symbolic_shape_registry.cpp : added gelu/tanh/erf in registry 4. torch/jit/_script.py : throws scripting model sees autocast as decorator since it's not supported nvfuser code update: 1. codegen improvements and performance tuning 2. integration bug fixes for shape expression logic 3. kernel segmentation update to address perf regression from horizontal fusion 4. scalar cpu tensor promotion to support inter-device operation between cpu scalar tensor and cuda tensor Things reverted from local changes: aten::gelu with approximation (tracked in PR: https://github.com/pytorch/pytorch/pull/61439) Pull Request resolved: https://github.com/pytorch/pytorch/pull/72127 Reviewed By: HamidShojanazeri Differential Revision: D34113233 Pulled By: jbschlosser fbshipit-source-id: b82cde32b71e324eca0ea57cb8c9f9647278ca74 (cherry picked from commit `e009bc5c4e`)	2022-02-15 00:43:16 +00:00
Ryan Spring	4f8b986e28	Implement Tanh Gelu Approximation (#61439 ) Summary: 1. Implements https://github.com/pytorch/pytorch/issues/39853 2. Adds approximate boolean flag to Gelu 3. Enables Tanh Gelu approximation 4. Adds double backward support for Gelu 5. Enable Tanh Gelu in NvFuser ``` def gelu(x, approximate : str = 'none'): if approximate == 'tanh': # sqrt(2/pi) = 0.7978845608028654 return 0.5 * x * (1.0 + torch.tanh(0.7978845608028654 * (x + 0.044715 * torch.pow(x, 3.0)))) else: return x * normcdf(x) ``` Linking XLA PR - https://github.com/pytorch/xla/pull/3039 Pull Request resolved: https://github.com/pytorch/pytorch/pull/61439 Reviewed By: VitalyFedyunin Differential Revision: D33894937 Pulled By: jbschlosser fbshipit-source-id: b65e8fb6ea66168af8f34f45ed50e92737a33851 (cherry picked from commit `6e986f91a9`)	2022-02-14 03:40:32 +00:00
Nikita Shulga	74c44ba9d6	Revert D33850228: [pytorch][PR] Implement Tanh Gelu Approximation Test Plan: revert-hammer Differential Revision: D33850228 (`23d03025dc`) Original commit changeset: 3cc33fb298e4 Original Phabricator Diff: D33850228 (`23d03025dc`) fbshipit-source-id: 9436e7df73c2b2e2011f321674f24973316d3692 (cherry picked from commit `c9efb58223`)	2022-01-31 17:44:19 +00:00
Ryan Spring	23d03025dc	Implement Tanh Gelu Approximation (#61439 ) Summary: 1. Implements https://github.com/pytorch/pytorch/issues/39853 2. Adds approximate boolean flag to Gelu 3. Enables Tanh Gelu approximation 4. Adds double backward support for Gelu 5. Enable Tanh Gelu in NvFuser ``` def gelu(x, approximate : str = 'none'): if approximate == 'tanh': # sqrt(2/pi) = 0.7978845608028654 return 0.5 * x * (1.0 + torch.tanh(0.7978845608028654 * (x + 0.044715 * torch.pow(x, 3.0)))) else: return x * normcdf(x) ``` Linking XLA PR - https://github.com/pytorch/xla/pull/3039 Pull Request resolved: https://github.com/pytorch/pytorch/pull/61439 Reviewed By: cpuhrsch Differential Revision: D33850228 Pulled By: jbschlosser fbshipit-source-id: 3cc33fb298e480d7ecc5c67716da019d60c6ab33 (cherry picked from commit `3a53b3e94f`)	2022-01-31 17:07:45 +00:00
Nolan O'Brien	d68c314b13	[warnings][caffe2] Fix asserts yielding -Wstring-conversion warnings (#72013 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72013 Find and replace `assert(!"` with `assert(false && "` Excludes headers and paths that contain "third-party" or "external" Clang raises a `-Wstring-conversion` warning when treating a string as a boolean. This is not uncommon for asserts though (e.g. `assert(!"should never happen")`). Clang does permit `expr && "string"` though in order to support these assertion use cases. Test Plan: ci pass Differential Revision: D33823092 fbshipit-source-id: 9a1af012215bdc91f8b4162ddb2df28d51539773 (cherry picked from commit `0286910350`)	2022-01-29 00:48:06 +00:00
Joel Schlosser	cb823d9f07	Revert D33744717: [pytorch][PR] Implement Tanh Gelu Approximation Test Plan: revert-hammer Differential Revision: D33744717 (`f499ab9cef`) Original commit changeset: d64532a562ed Original Phabricator Diff: D33744717 (`f499ab9cef`) fbshipit-source-id: 396c3f63de5865f894dbc353d0790a01a624be93 (cherry picked from commit `e9fb2d1db1`)	2022-01-28 18:35:01 +00:00
Ryan Spring	f499ab9cef	Implement Tanh Gelu Approximation (#61439 ) Summary: 1. Implements https://github.com/pytorch/pytorch/issues/39853 2. Adds approximate boolean flag to Gelu 3. Enables Tanh Gelu approximation 4. Adds double backward support for Gelu 5. Enable Tanh Gelu in NvFuser ``` def gelu(x, approximate : str = 'none'): if approximate == 'tanh': # sqrt(2/pi) = 0.7978845608028654 return 0.5 * x * (1.0 + torch.tanh(0.7978845608028654 * (x + 0.044715 * torch.pow(x, 3.0)))) else: return x * normcdf(x) ``` Linking XLA PR - https://github.com/pytorch/xla/pull/3039 Pull Request resolved: https://github.com/pytorch/pytorch/pull/61439 Reviewed By: mikaylagawarecki Differential Revision: D33744717 Pulled By: jbschlosser fbshipit-source-id: d64532a562ed53247bb4fa52bb16722634d5c187 (cherry picked from commit `4713dd9cca`)	2022-01-28 16:59:09 +00:00
Will Constable	4523a73288	Fix usages of TORCH_CHECK/_INTERNAL_ASSERT without condition (#71879 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71879 Two locations of improper macro usage were reported (https://github.com/pytorch/pytorch/issues/71848), and this diff fixes them. In both cases this is behavior-changing, since the incorrect usages would have passed assertion due interpreting the error string as the condition, and both cases should have been 'assert false'. Test Plan: Run CI Reviewed By: alanwaketan Differential Revision: D33800406 fbshipit-source-id: dfe3d9a6455e6eb96cb639022f8813a8bd6520c3 (cherry picked from commit `ee551e5a16`)	2022-01-27 04:20:55 +00:00
Nolan O'Brien	0fdb90da5e	[warning] Fix TORCH_INTERNAL_ASSERT calls missing condition to check 1/x (#71711 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71711 This will fix a ton of broken asserts that should always fire but never actually fire. All would have been caught with `-Wstring-conversion` warnings enabled. Test Plan: CI Pass Differential Revision: D33743605 fbshipit-source-id: 062641f9d5d02c6e317c5a286fd01017cf77237f (cherry picked from commit `639b42e04b`)	2022-01-25 15:45:21 +00:00
CodemodService FBSourceClangFormatLinterBot	88012c7daf	[AutoAccept][Codemod][FBSourceClangFormatLinter] Daily `arc lint --take CLANGFORMAT` Reviewed By: zertosh Differential Revision: D33577744 fbshipit-source-id: 7ecc8367998ee1dffde54c2f4dd3cfafe19a53c9	2022-01-14 06:10:57 -08:00
Mike Ruberry	3a0c680a14	Jiterates exp2, erfc, erfinv and entr and refactors code_template.h to ATen (#71295 ) Summary: Per title. cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/71295 Reviewed By: ngimel Differential Revision: D33575885 Pulled By: mruberry fbshipit-source-id: bc841b46fc0b5458a26a4d4465b18a7a54cd5a5b	2022-01-13 23:58:51 -08:00
Shintaro Iwasaki	5cae40c169	[pytorch][aten][cuda] move CUDAGeneratorImpl.h to ATen/cuda (#70650 ) Summary: This patch moves a CUDA-specific file, `CUDAGeneratorImpl.h` to `ATen/cuda` as the following TODO comment in `CUDAGeneratorImpl.h` suggests: ``` // TODO: this file should be in ATen/cuda, not top level ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/70650 Reviewed By: jianyuh, xw285cornell Differential Revision: D33414890 Pulled By: shintaro-iwasaki fbshipit-source-id: 4ff839205f4e4ea4c8767f164d583eb7072f1b8b	2022-01-10 22:27:04 -08:00
Scott Wolchok	ddea6980fe	[PyTorch][JIT] Don't refcount Type singletons (#69579 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69579 This should help us avoid reference counting overhead on singleton Type subclasses without a major rewrite of the Type subsystem. ghstack-source-id: 146643993 Test Plan: Ran //caffe2/caffe2/fb/high_perf_models/pytorch/benchmark_framework_overheads:cpp_benchmark with arguments `--op empty -niter 40 --stressTestRecordFunction --captureRecordFunctionInputs` on devbig with turbo off. Before: ``` I1206 13:47:15.037441 1201670 bench.cpp:144] Mean 0.737675 I1206 13:47:15.037463 1201670 bench.cpp:145] Median 0.736725 I1206 13:47:15.037468 1201670 bench.cpp:146] Min 0.722897 I1206 13:47:15.037473 1201670 bench.cpp:147] stddev 0.00508187 I1206 13:47:15.037482 1201670 bench.cpp:148] stddev / mean 0.00688903 ``` After: ``` I1206 13:48:16.830123 1205612 bench.cpp:144] Mean 0.66988 I1206 13:48:16.830150 1205612 bench.cpp:145] Median 0.663956 I1206 13:48:16.830157 1205612 bench.cpp:146] Min 0.65986 I1206 13:48:16.830164 1205612 bench.cpp:147] stddev 0.0335928 I1206 13:48:16.830171 1205612 bench.cpp:148] stddev / mean 0.0501475 ``` Static runtime startup is also improved; for CMF local_ro, time to initialize a predictor went from 10.01s to 9.59s. (Note: I wish I had a production workload to demonstrate the advantage of this on. I tried ctr_mobile_feed local_ro net but it was neutral. Anything that manipulates types or List/Dict a lot might be promising.) Reviewed By: suo Differential Revision: D32923880 fbshipit-source-id: c82ed6689b3598e61047fbcb2149982173127ff0	2022-01-06 17:39:16 -08:00
Peter Bell	fa09099ba3	Codegen: TraceType only includes operators being registered (#68691 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68691 TraceType is a sharded file, so by only including specific operator headers, we ensure that changing one (non-method) operator only needs one shard to be re-compiled. This also changes all the included autograd and jit headers from including `ATen/ATen.h` to just including `ATen/core/Tensor.h`. Test Plan: Imported from OSS Reviewed By: gchanan Differential Revision: D33336948 Pulled By: albanD fbshipit-source-id: 4e40371592b9a5a7e7fcd1d8cecae11ffb873113	2022-01-02 13:09:19 -08:00
jjsjann123	e429a68478	Allow single node fusion for nvfuser (#70000 ) Summary: Setting `PYTORCH_NVFUSER_ONE_OP_FUSION=1` will take all nodes nvFuser support, instead of waiting for fusion opportunity. Pull Request resolved: https://github.com/pytorch/pytorch/pull/70000 Reviewed By: samdow Differential Revision: D33292195 Pulled By: davidberard98 fbshipit-source-id: 8ed5ce5e82fbb6737e8ab5ce4223b038eaf47756	2021-12-23 17:07:57 -08:00
CodemodService FBSourceClangFormatLinterBot	181120f7d7	[AutoAccept][Codemod][FBSourceClangFormatLinter] Daily `arc lint --take CLANGFORMAT` Reviewed By: zertosh Differential Revision: D33229251 fbshipit-source-id: 3a69bb459fa0a65888d6f9c8e70b5de032ddad97	2021-12-19 16:38:25 -08:00
jiej	78f06e0690	fixing conv2d decomposition and tests (#70127 ) Summary: Current implementation has a bug where decomposed `add_optional` from `conv2d` is placed before the producer node, this causes linter error on graph. Cherry-picked from https://github.com/csarofeen/pytorch/pull/1333 Fixing issue posted in https://github.com/csarofeen/pytorch/issues/1325 Pull Request resolved: https://github.com/pytorch/pytorch/pull/70127 Reviewed By: ejguan Differential Revision: D33199018 Pulled By: jansel fbshipit-source-id: bce1f14a443811b4d55116a04fd4daa86084cc47	2021-12-19 10:38:23 -08:00
Nikita Shulga	26e32988bd	Revert D32596264: Codegen: TraceType only includes operators being registered Test Plan: revert-hammer Differential Revision: D32596264 (`e66a8ab4f5`) Original commit changeset: 2f28b62d7b99 Original Phabricator Diff: D32596264 (`e66a8ab4f5`) fbshipit-source-id: 7d18c4e77ce30dd7817a95f9c39b565cb246cd12	2021-12-17 11:20:12 -08:00
Peter Bell	e66a8ab4f5	Codegen: TraceType only includes operators being registered (#68691 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68691 TraceType is a sharded file, so by only including specific operator headers, we ensure that changing one (non-method) operator only needs one shard to be re-compiled. This also changes all the included autograd and jit headers from including `ATen/ATen.h` to just including `ATen/core/Tensor.h`. Test Plan: Imported from OSS Reviewed By: jbschlosser, malfet Differential Revision: D32596264 Pulled By: albanD fbshipit-source-id: 2f28b62d7b9932f30fad7daacd8ac5bb7f63c621	2021-12-17 10:35:05 -08:00
CodemodService FBSourceClangFormatLinterBot	de2d9e2966	[AutoAccept][Codemod][FBSourceClangFormatLinter] Daily `arc lint --take CLANGFORMAT` Reviewed By: zertosh Differential Revision: D33183467 fbshipit-source-id: d7c37f3522a38e85891524c544eab4fdb01270de	2021-12-17 09:45:20 -08:00
Nikita Shulga	92463573d8	Sanitize string before passing it as shell argument (#70070 ) Summary: Use `c10::printQuotedString` to escape any characters that might render string to be interpreted as more than one argument by shell script. Please note, that this codepath is deprecated and is not accessible by a typical PyTorch usage workflows. This issue was discovered by Daniel Lawrence of the Amazon Alexa team. Pull Request resolved: https://github.com/pytorch/pytorch/pull/70070 Reviewed By: suo Differential Revision: D33172721 Pulled By: malfet fbshipit-source-id: 9dbd17f6eb775aaa1a545da42cbc95864c1189ee	2021-12-17 08:08:28 -08:00
jiej	76d282d447	Nvfuser code bump 12 5 (#69964 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69964 Things added in this PR that requires review: 1. cuLaunchCooperativeKernel driver API added aten/src/ATen/cuda/detail/LazyNVRTC.cpp aten/src/ATen/cuda/nvrtc_stub/ATenNVRTC.h nvfuser code update: 1. perf turning on codegen scheduler that improves performance. 2. permutation support has been extended beyond contiguous/channels-last. (The improvements could be observed on PW benchmark) Things reverted from local changes: 1. aten::gelu with approximation 2. local changes that is upstreamed in PR https://github.com/pytorch/pytorch/issues/68804 Pull Request resolved: https://github.com/pytorch/pytorch/pull/69428 Reviewed By: ngimel Differential Revision: D33073817 Pulled By: wconstab fbshipit-source-id: e77d32e81d037d7370822b040456fd4c3bd68edb	2021-12-16 08:28:54 -08:00
Peter Bell	b2e79ed5ec	Remove WindowsTorchApiMacro.h in favor of Export.h (#69585 ) Summary: Follow up to https://github.com/pytorch/pytorch/issues/68095 This also changes the files from the ATen folder to include c10's `Export.h` instead since they can't ever be exporting `TORCH_PYTHON_API`. cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/69585 Reviewed By: mrshenli Differential Revision: D32958594 Pulled By: albanD fbshipit-source-id: 1ec7ef63764573fa2b486928955e3a1172150061	2021-12-09 17:30:09 -08:00
Peter Bell	e279963eef	Remove remaining THC code (#69039 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69039 Test Plan: Imported from OSS Reviewed By: anjali411 Differential Revision: D32872476 Pulled By: ngimel fbshipit-source-id: 7972aacc24aef9450fb59b707ed6396c501bcb31	2021-12-08 12:18:08 -08:00
jjsjann123	0dc3f829d9	Nvfuser code bump 11 5 (#67943 ) Summary: nvfuser code update: 1. Tuning heuristics on schedulers for reduction/normalization kernels; 2. bfloat16 on IO tensor support; 3. Refactored memory format support, now we can support dimension collapsing with non-coherent input tensors with different memory format. e.g. channels last tensor input to batch normalization. Note that we are currently limiting memory format to only Contiguous and Channels last; 4. Refactored nvfuser graph partitioning in `graph_fuser.cpp`, separated node merge and profile node API. Updated `profiling_record.cpp`. Things that are reverted from our local branch: 1. changes on some entries in autodiff 2. aten::gelu with approximation 3. native_dropout(_backward) Pull Request resolved: https://github.com/pytorch/pytorch/pull/67943 Reviewed By: ngimel Differential Revision: D32288709 Pulled By: dzhulgakov fbshipit-source-id: fc9491182ea7e0158bc112c66f096823c588eaf1	2021-11-17 01:22:17 -08:00
Alexander Grund	e4a9ee8d42	Deduplicate codegenOutputQuery to query maximum CUDA compute capabilities (#55901 ) Summary: There were 2 versions of the same code which were slightly different although functionally equivalent. When adding support for another CUDA / device version both would need to be changed and kept in sync. So it is better to have only 1 version of it as the unique source of truth. I chose the implementation which looks cleaner and easier to read and added some minor enhancements and comments to further increase readability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/55901 Reviewed By: H-Huang Differential Revision: D31636917 Pulled By: bertmaher fbshipit-source-id: 622e1fabc39de4f3f1b1aa9a1544cfbd35a5cfd9	2021-10-18 07:42:15 -07:00
Scott Wolchok	2d885ab73d	[jit] Reduce refcounting of Types (#65345 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65345 FooType::get() can return a const reference. Inconveniently, converting shared_ptr<FooType> to shared_ptr<Type> requires a copy & refcount bump, so to properly take advantage of this in unshapedType() we need to take a const Type& in isSubtypeOf(), which is good practice anyway -- don't require a shared_ptr if you don't need to take ownership. ghstack-source-id: 140044165 Test Plan: CI perf says c10::unshapedType time decreased from 2.8% to 2.2% during static runtime startup, though I expect this to be generally beneficial. Reviewed By: hlu1 Differential Revision: D31027361 fbshipit-source-id: 676feb81db9f74ad7b8651d8774f4ecb4cfa6ab8	2021-10-08 09:03:04 -07:00
jiej	321345d7c9	Revert "Revert D31227448: [pytorch][PR] fixing sorting in stride indices" (#66176 ) Summary: enabling https://github.com/pytorch/pytorch/issues/63940 Pull Request resolved: https://github.com/pytorch/pytorch/pull/66176 Reviewed By: ngimel Differential Revision: D31423920 Pulled By: dzhulgakov fbshipit-source-id: 06b1e0f757f4fb5b31ee1fa464bcd689df919b9c	2021-10-07 22:09:07 -07:00
Bin Bao	6e06cb76ff	[JIT] Initialize CUDA context before launching fused kernel (#65064 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65064 The problem appears when nvfuser is triggered from LazyTensor. Because LT maintains its own thread pool, the thread used for the first-time compilation does CUDA context initialization properly, but later cached execution may use a different thread which does not have a proper CUDA context. Test Plan: Imported from OSS Reviewed By: saketh-are Differential Revision: D31269691 Pulled By: desertfire fbshipit-source-id: 384362025c087d61e8b625ff938379df283ef8b2	2021-10-05 16:01:59 -07:00
Nikita Shulga	4c4525fa5c	Compile without -Wno-unused-variable (take 2) (#66041 ) Summary: Delete `-Wno-unused-variable` from top level `CMakeLists.txt` Still suppress those warnings for tests and `torch_python` Delete number of unused variables from caffe2 code Use `(void)var;` to suppress unused variable in range loops Use `C10_UNUSED` for global constructors and use `constexpr` instead of `static` for global constants Do not delete `caffe2::OperatorBase::Output` calls as they have side effects Pull Request resolved: https://github.com/pytorch/pytorch/pull/66041 Reviewed By: ngimel Differential Revision: D31360142 Pulled By: malfet fbshipit-source-id: 6fdfb9f91efdc49ca984a2f2a17ee377d28210c8	2021-10-04 20:39:39 -07:00
soulitzer	4cdfceddd2	[Reland] Avoid saving self for `softmax` and `log_softmax` (#66018 ) Summary: Reland of https://github.com/pytorch/pytorch/pull/65242 The last attempt of the reland automatically rebased onto stable, which did not yet have the revert commit Pull Request resolved: https://github.com/pytorch/pytorch/pull/66018 Reviewed By: albanD Differential Revision: D31348822 Pulled By: soulitzer fbshipit-source-id: 881d701b404530c1352ac9245bd67264e1652b8a	2021-10-03 21:35:01 -07:00
Nikita Shulga	e4ee5ca698	Revert D31326599: [pytorch][PR] Compile without -Wno-unused-variable Test Plan: revert-hammer Differential Revision: D31326599 (`a6280ab653`) Original commit changeset: 924155f1257a fbshipit-source-id: b8ee5bc0298637443232f5ee9ec79e51ed256faf	2021-10-01 20:40:47 -07:00
Nikita Shulga	a6280ab653	Compile without -Wno-unused-variable (#65954 ) Summary: Delete `-Wno-unused-variable` from top level `CMakeLists.txt` Still suppress those warnings for tests and `torch_python` Delete number of unused variables from caffe2 code Use `(void)var;` to suppress unused variable in range loops Use `C10_UNUSED` for global constructors and use `constexpr` instead of `static` for global constants Pull Request resolved: https://github.com/pytorch/pytorch/pull/65954 Reviewed By: ngimel Differential Revision: D31326599 Pulled By: malfet fbshipit-source-id: 924155f1257a2ba1896c50512f615e45ca1f61f3	2021-10-01 17:40:47 -07:00
Michael Suo	ccf8d48f16	Revert D31317680: [pytorch][PR] Avoid saving self for`softmax` and `log_softmax` Test Plan: revert-hammer Differential Revision: D31317680 (`5f7cadc7aa`) Original commit changeset: b3b921e06775 fbshipit-source-id: 1bca0672383536a2c21243ceb52349c766a94344	2021-10-01 09:31:44 -07:00
soulitzer	5f7cadc7aa	Avoid saving self for`softmax` and `log_softmax` (#65242 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/64000 - updates double backward formula to compute grad wrt output instead of self - ~~In some of the error messages, we still refer to the dtype of the input, even though we are now checking the dtype of the output~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/65242 Reviewed By: malfet Differential Revision: D31317680 Pulled By: soulitzer fbshipit-source-id: b3b921e06775cfc12e5a97a9ee8d73aec3aac7c3	2021-10-01 07:49:07 -07:00
Pruthvi Madugundu	085e2f7bdd	[ROCm] Changes not to rely on CUDA_VERSION or HIP_VERSION (#65610 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65610 - Replace HIP_PLATFORM_HCC with USE_ROCM - Dont rely on CUDA_VERSION or HIP_VERSION and use USE_ROCM and ROCM_VERSION. - In the next PR - Will be removing the mapping from CUDA_VERSION to HIP_VERSION and CUDA to HIP in hipify. - HIP_PLATFORM_HCC is deprecated, so will add HIP_PLATFORM_AMD to support HIP host code compilation on gcc. cc jeffdaily sunway513 jithunnair-amd ROCmSupport amathews-amd Reviewed By: jbschlosser Differential Revision: D30909053 Pulled By: ezyang fbshipit-source-id: 224a966ebf1aaec79beccbbd686fdf3d49267e06	2021-09-29 09:55:43 -07:00
Nikita Shulga	82e0bf44c0	Apply linter suggestions to #65137 (#65459 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65459 Just run linter on the change and apply all suggestions Test Plan: N/A Reviewed By: seemethere Differential Revision: D31102960 fbshipit-source-id: 04e1d07935690f2ddbc64533661b3e55379d13b5	2021-09-27 13:07:40 -07:00
CodemodService FBSourceClangFormatLinterBot	2a4d5e4c6d	[AutoAccept][Codemod][FBSourceClangFormatLinter] Daily `arc lint --take CLANGFORMAT` Reviewed By: zertosh Differential Revision: D31138547 fbshipit-source-id: ba134ae7f057c918eaefdc6310f7663e187e9749	2021-09-23 07:54:32 -07:00
jiej	127c9402d0	Revert "Revert D30752939: [pytorch][PR] nvfuser update" (#65137 ) Summary: This reverts commit `03389dc851`. Attempt again for PR: https://github.com/pytorch/pytorch/issues/63745 Fixes the windows build failure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/65137 Reviewed By: seemethere, dzhulgakov, heitorschueroff Differential Revision: D30994556 Pulled By: malfet fbshipit-source-id: f1925b6c5cc1a1a441a96499667c91e8dfc1b53d	2021-09-22 04:54:51 -07:00
Eli Uriegas	03389dc851	Revert D30752939: [pytorch][PR] nvfuser update Test Plan: revert-hammer Differential Revision: D30752939 (`cfaecaf40b`) Original commit changeset: ce122e80f01b fbshipit-source-id: 57685df8f9946032a06eff1de8a3d1498500d2d2	2021-09-15 17:38:47 -07:00
jiej	cfaecaf40b	nvfuser update (#63745 ) Summary: Syncing nvfuser code base from devel branch, Listing a few of our development since last sync: - Extends support to normalization and reduction kernels. - Multiple kernel launch for single `CudaFusionGroup`. Hierarchical caching system has been updated to cache graph segmentation. - profile_ivalue is enabled to convert dynamic scalar into compile time constants, which are required by the codegen. (e.g. reduction axes). To keep this PR simple and relatively review-free. We stripped most external changes and submitted them as separate PRs, so this gigantic PR is easier to handle. internal updates are files located in: 1. updates in nvfuser codegen `torch/csrc/jit/coddgen/cuda` 2. added nvfuser specific benchmarks `benchmarks/cpp/nvfuser` 3. nvfuser jit cpp tests `test/cpp/jit/test_gpu.cpp` `test/cpp/jit/test_gpu_shift.cpp` `test/cpp/jit/test_gpu_validator.h` updates affecting integration: 1. profile_ivalue enabled for nvfuser. related changes are in `torch/csrc/jit/runtime/`, 2. exposed a few more symbols `aten/src/ATen/core/` used by codegen Pull Request resolved: https://github.com/pytorch/pytorch/pull/63745 Reviewed By: saketh-are Differential Revision: D30752939 Pulled By: malfet fbshipit-source-id: ce122e80f01bcd3865f5bd3c4dfde660665fd84c	2021-09-15 14:42:55 -07:00
Zhengxu Chen	ac99d63f83	[jit] Make operation call accept Stack& instead Stack* (#63414 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63414 Misuse of raw pointer in here where stack is never nullable. ghstack-source-id: 136938318 Test Plan: compiles. Imported from OSS Reviewed By: ejguan Differential Revision: D30375410 fbshipit-source-id: 9d65b620bb76d90d886c800f54308520095d58ee	2021-08-30 11:49:20 -07:00
Bert Maher	a709ab34a8	[nnc] Re-enable CPU fusion" (#63665 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63665 This reverts commit `125e2d02e5`. Test Plan: Imported from OSS Reviewed By: ZolotukhinM Differential Revision: D30471646 Pulled By: bertmaher fbshipit-source-id: 4189869566f03b5f9ada78d78830f6a34946eed6	2021-08-23 12:42:42 -07:00
Alban Desmaison	125e2d02e5	Revert D30417370: [nnc] Enable CPU fusion Test Plan: revert-hammer Differential Revision: D30417370 (`b9fc656cf2`) Original commit changeset: 84ce7a578a36 fbshipit-source-id: cd23774cdc3273fd72f8a05f1900eaf36f373e6b	2021-08-20 12:30:21 -07:00
Bert Maher	b9fc656cf2	[nnc] Enable CPU fusion (#63545 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63545 Test Plan: Imported from OSS Reviewed By: navahgar Differential Revision: D30417370 Pulled By: bertmaher fbshipit-source-id: 84ce7a578a3678d5562bab99d1dc00330c4f72d1	2021-08-20 11:18:21 -07:00
Pruthvi Madugundu	ab7a472980	[ROCm] Update HIP_VERSION to TORCH_HIP_VERSION (#62786 ) Summary: - HIP_VERSION semantic versioning will change in ROCm4.3. The changes essentially remove the dependency on HIP_VERSION provided in the hip header to keep code compatible with older and newer versions of ROCm. - TORCH_HIP_VERSION is derived from HIP_VERSION_MAJOR and HIP_VERSION_MINOR Pull Request resolved: https://github.com/pytorch/pytorch/pull/62786 Reviewed By: bdhirsh Differential Revision: D30281682 Pulled By: seemethere fbshipit-source-id: e41e69fb9e13de5ddd1af99ba5bbdcbb7b64b673	2021-08-13 15:00:43 -07:00
Richard Barnes	d2594fa538	irange-ify 3 (#62112 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62112 Test Plan: Sandcastle Reviewed By: malfet Differential Revision: D29879513 fbshipit-source-id: c01d18d34bb19014bf28d92c4d04b07e50a2770a	2021-07-26 12:56:58 -07:00
Richard Barnes	ee44d73e59	Modernize override (#61744 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61744 Test Plan: Sandcastle Reviewed By: malfet Differential Revision: D29717320 fbshipit-source-id: 6eea4295ee2e5572ab337620be412376fcc2f3cc	2021-07-23 23:04:46 -07:00
Nikita Shulga	a9b0a921d5	Disable `avoid-non-const-global-variables` lint check (#62008 ) Summary: As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH` All changes but the ones to `.clang-tidy` are generated using following script: ``` for i in `find . -type f -iname ".c" -or -iname "*.h"\|xargs grep cppcoreguidelines-avoid-non-const-global-variables\|cut -f1 -d:\|sort\|uniq`; do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008 Reviewed By: driazati, r-barnes Differential Revision: D29838584 Pulled By: malfet fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13	2021-07-22 18:04:40 -07:00
Natalia Gimelshein	6284d2a82b	wrap cudaStreamSynchronize calls (#61889 ) Summary: This is a first step towards creating context manager that errors out on synchronizing calls. Pull Request resolved: https://github.com/pytorch/pytorch/pull/61889 Reviewed By: albanD Differential Revision: D29805280 Pulled By: ngimel fbshipit-source-id: b66400fbe0941b7daa51e6b30abe27b9cccd4e8a	2021-07-21 19:30:52 -07:00
Richard Barnes	59a5312ce6	Modernize fix deprecated header (#61736 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61736 Test Plan: Sandcastle Reviewed By: malfet Differential Revision: D29716965 fbshipit-source-id: 314c2b557c240ac16bbfab114ab764beb189e78a	2021-07-20 10:06:11 -07:00
Richard Barnes	349f2f767c	Modernize to default constructor and nullptr in torch (#61735 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61735 Test Plan: Sandcastle Reviewed By: malfet Differential Revision: D29716659 fbshipit-source-id: ec2a0a0b7e55d2e50b1d35f0b651bd40675ae7e8	2021-07-16 10:51:13 -07:00
Nikita Shulga	635d864b26	Fix modernize-use-equals-default nolint failures in torch/csrcs (#61142 ) Summary: Test-plan: Compile + clang-tidy Pull Request resolved: https://github.com/pytorch/pytorch/pull/61142 Reviewed By: VitalyFedyunin Differential Revision: D29529372 Pulled By: malfet fbshipit-source-id: 2ccde7712a51c28243b16bbb4d1d68086e0414a6	2021-07-06 09:46:46 -07:00
Mike Guo	6ecc1a4c4f	Make pytorch clang-tidy clean (#60649 ) Summary: This PR suppresses clang-tidy warnings in the codebase (for now) so that we can re-enable clang-tidy checks on master. I ran this script to add the `NOLINTNEXTLINE` comments (on a devserver): ```bash python3 setup.py develop # Uses same script that's run on CI and adds the -j (parallel), -s (add comments), -k (continue if diagnostic errors are found) options python3 tools/clang_tidy.py \ -j \ -s \ -k \ -v \ --paths torch/csrc/ \ -g"-torch/csrc/jit/passes/onnx/helper.cpp" \ -g"-torch/csrc/jit/passes/onnx/shape_type_inference.cpp" \ -g"-torch/csrc/jit/serialization/onnx.cpp" \ -g"-torch/csrc/jit/serialization/export.cpp" \ -g"-torch/csrc/jit/serialization/import.cpp" \ -g"-torch/csrc/jit/serialization/import_legacy.cpp" \ -g"-torch/csrc/onnx/init.cpp" \ -g"-torch/csrc/cuda/nccl." \ -g"-torch/csrc/cuda/python_nccl.cpp" \ -g"-torch/csrc/autograd/FunctionsManual.cpp" \ -g"-torch/csrc/generic/.cpp" \ -g"-torch/csrc/jit/codegen/cuda/runtime/*" \ -g"-torch/csrc/deploy/interpreter/interpreter.cpp" \ -g"-torch/csrc/deploy/interpreter/interpreter.h" \ -g"-torch/csrc/deploy/interpreter/interpreter_impl.h" \ -g"-torch/csrc/deploy/interpreter/test_main.cpp" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/60649 Test Plan: Verified changes by re-running the script (without the `-s` option) and seeing no warnings/errors. Reviewed By: walterddr, janeyx99 Differential Revision: D29504258 Pulled By: 1ntEgr8 fbshipit-source-id: 78310b30ee8213b73ddb4771ad874665323e7a4e	2021-07-01 12:21:07 -07:00
Richard Barnes	b162d95e46	Fix a number of lint perf and safety issues in torch (#59897 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59897 Test Plan: Sandcastle Reviewed By: ngimel Differential Revision: D29037012 fbshipit-source-id: 7c16286d5fc2b67964fb65f8374dfff4d1a7aefb	2021-06-15 13:14:51 -07:00
Richard Barnes	fbe65b16ae	Use irange in torch/csrc/jit (#55716 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55716 Test Plan: Sandcastle Reviewed By: ngimel Differential Revision: D27690245 fbshipit-source-id: 6052b0acd792a9527d131822453a17cdb7ae3ba5	2021-06-07 16:48:08 -07:00
Bert Maher	6309b342c3	[nnc] Enable CPU fuser inside FB, take 5 (#59461 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59461 long tail test failues ghstack-source-id: 130607578 Test Plan: fixed T92123560 Reviewed By: navahgar Differential Revision: D28892885 fbshipit-source-id: 762a275b5aa14af0847e46cbf4036d3342b82189	2021-06-04 16:26:46 -07:00
Bert Maher	46d724c919	Revert D28859795: [nnc] Enable CPU fusion inside Facebook, take 4 Test Plan: revert-hammer Differential Revision: D28859795 (`6baa66ece9`) Original commit changeset: 826801db24e8 fbshipit-source-id: c85a0fc7e88c95af939d5c0f50c0c8878e1174d3	2021-06-03 16:29:51 -07:00
Bert Maher	6baa66ece9	[nnc] Enable CPU fusion inside Facebook, take 4 Summary: fixed the awkward configerator initialization issue that broke some tests. Trying again Test Plan: predictor comparisons Reviewed By: ZolotukhinM Differential Revision: D28859795 fbshipit-source-id: 826801db24e86b1c3594a86e3ac32f0a84c496f7	2021-06-03 09:33:13 -07:00
Richard Barnes	3979cb0656	irange for size_t (#55320 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55320 Test Plan: Sandcastle Reviewed By: ngimel Differential Revision: D27572577 fbshipit-source-id: 97710fd2bb1303006b05828a0d1343b0b59ccb03	2021-06-03 01:04:13 -07:00
Bert Maher	afd5237a4f	Revert D28800692: [nnc] Enable CPU fusion inside Facebook, take 3 Test Plan: revert-hammer Differential Revision: D28800692 (`6e7dae9cec`) Original commit changeset: d791c3b2ccd7 fbshipit-source-id: 5042fecfbab59181572013bf39760bc716e86430	2021-06-02 10:07:46 -07:00
Bert Maher	6e7dae9cec	[nnc] Enable CPU fusion inside Facebook, take 3 (#59253 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59253 Fixed a miscompilation exposed by multithreaded profiling collection; let's try again. ghstack-source-id: 130286580 Test Plan: servicelab Reviewed By: navahgar, huiguoo Differential Revision: D28800692 fbshipit-source-id: d791c3b2ccd75fe5e6eca0859083d4cd67460147	2021-06-01 15:42:22 -07:00
Jeff Daily	ba694520e5	[ROCm] fix JIT codegen (#57400 ) Summary: Fixes upcoming changes that are part of ROCm 4.2 and affect PyTorch JIT. - ROCM_VERSION macro must be available to both device and host compilation passes. - Unifies some of CUDA and HIP differences in the code generated. - NAN / POS_INFINITY / NEG_INFINITY - Do not hipify `extern __shared__` -> `HIP_DYNAMIC_SHARED()` macro [deprecated] - Differentiates bf16 codegen for HIP. - Optionally provides missing macros when using hiprtc precompiled header feature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/57400 Reviewed By: ejguan Differential Revision: D28421065 Pulled By: malfet fbshipit-source-id: 215f476773c61d8b0d9d148a4e5f5d016f863074	2021-05-27 11:45:07 -07:00
Bert Maher	a6b358d53b	Revert D28461013: [nnc] Enable CPU fusion inside Facebook, take 2 Test Plan: revert-hammer Differential Revision: D28461013 (`c76405d3b1`) Original commit changeset: 79a80b6ffb65 fbshipit-source-id: d9cc5c512542153f39664635fb080d797a9de7d0	2021-05-19 15:27:38 -07:00
Bert Maher	c76405d3b1	[nnc] Enable CPU fusion inside Facebook, take 2 (#58347 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58347 Back out "Revert D27652484 (`ac04cc775b`): [nnc] Enable CPU fusion inside Facebook" Original commit changeset: ecfef3ee1e71 ghstack-source-id: 129279584 Test Plan: Tests for bugfix included in this stack Reviewed By: navahgar Differential Revision: D28461013 fbshipit-source-id: 79a80b6ffb653ab952ff5efaa143d3362bb7d966	2021-05-18 21:45:48 -07:00
Bert Maher	c4c2039fb2	Revert D27652484: [nnc] Enable CPU fusion inside Facebook Test Plan: revert-hammer Differential Revision: D27652484 (`ac04cc775b`) Original commit changeset: a82681455dae fbshipit-source-id: ecfef3ee1e7197148b172234691e72faf4b95cf8	2021-05-14 16:41:23 -07:00
Bert Maher	ac04cc775b	[nnc] Enable CPU fusion inside Facebook (#58029 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58029 We've been testing this for months, it's time. ghstack-source-id: 128932738 Test Plan: CI Reviewed By: ZolotukhinM Differential Revision: D27652484 fbshipit-source-id: a82681455dae0db19c8ac9918065b6e186c9e71a	2021-05-14 00:10:10 -07:00
Nikita Shulga	3a66a1cb99	[clang-tidy] Exclude cppcoreguidelines-avoid-magic-numbers (#57841 ) Summary: Add cppcoreguidelines-avoid-magic-numbers exclusion to clang-tidy Remove existing nolint warnings using following script: ``` for file in `git ls-files \| grep -v \.py`; do gsed '/^ *\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)/d' -i $file; done ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/57841 Reviewed By: samestep Differential Revision: D28295045 Pulled By: malfet fbshipit-source-id: 7c6e8d1213c9593f169ed3df6a916498f1a97163	2021-05-07 20:02:33 -07:00
Ailing Zhang	0ecdbfebff	s/InplaceOrView/ADInplaceOrView/g (#57372 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57372 Pull Request resolved: https://github.com/pytorch/pytorch/pull/57324 Test Plan: Imported from OSS Reviewed By: ZolotukhinM Differential Revision: D28121821 Pulled By: ailzhang fbshipit-source-id: f568dd2505f6279da9ffb93ce1d22e0f98c606bb	2021-05-01 22:56:18 -07:00
Nikita Shulga	eac02f85cf	Fix more clang-tidy errors (#57235 ) Summary: In my last PR I've missed CUDA and distributed folders, fixing this now This change is autogenerated by `python tool/clang_tidy.py -s` Pull Request resolved: https://github.com/pytorch/pytorch/pull/57235 Reviewed By: janeyx99 Differential Revision: D28084444 Pulled By: malfet fbshipit-source-id: bf222f69ee90c7872c3cb0931e8cdb84f0cb3cda	2021-04-28 23:29:10 -07:00
Nikita Shulga	4cb534f92e	Make PyTorch code-base clang-tidy compliant (#56892 ) Summary: This is an automatic change generated by the following script: ``` #!/usr/bin/env python3 from subprocess import check_output, check_call import os def get_compiled_files_list(): import json with open("build/compile_commands.json") as f: data = json.load(f) files = [os.path.relpath(node['file']) for node in data] for idx, fname in enumerate(files): if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'): files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')] return files def run_clang_tidy(fname): check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"]) changes = check_output(["git", "ls-files", "-m"]) if len(changes) == 0: return check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"]) def main(): git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n") compiled_files = get_compiled_files_list() for idx, fname in enumerate(git_files): if fname not in compiled_files: continue if fname.startswith("caffe2/contrib/aten/"): continue print(f"[{idx}/{len(git_files)}] Processing {fname}") run_clang_tidy(fname) if __name__ == "__main__": main() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892 Reviewed By: H-Huang Differential Revision: D27991944 Pulled By: malfet fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179	2021-04-28 14:10:25 -07:00
Ailing Zhang	be7a943bb8	s/AutoDispatchBelowAutograd/AutoDispatchBelowInplaceOrView. (#56657 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56657 Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D27931526 Pulled By: ailzhang fbshipit-source-id: 3af718df3435e2b0b30bc62070dbdc5aeeecdfb4	2021-04-23 15:50:00 -07:00
Aswin John Mathews	73eaa0a5f5	Fixing error in jit cuda on ROCm: non-constant-expression cannot be n… (#55243 ) Summary: On ROCm, the error when compiling was "non-constant-expression cannot be narrowed from type 'int' to 'uint32_t'" when compiling grid_reduction.cu. Added typecast to fix issue. Also, removed test skip with ROCm : re-enabling Pull Request resolved: https://github.com/pytorch/pytorch/pull/55243 Reviewed By: malfet Differential Revision: D27917066 Pulled By: ngimel fbshipit-source-id: b0b7c5fc8ecd2624222b35fe060846f7d1670f07	2021-04-21 16:35:27 -07:00
Ailing Zhang	3d904b56ec	s/AutoNonVariableTypeMode/AutoDispatchBelowAutograd/ (#56423 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56423 Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D27866606 Pulled By: ailzhang fbshipit-source-id: e3942356dc3133d1c5722de40ec0d45e6a60f2f1	2021-04-20 17:17:46 -07:00
Jeff Daily	e1752ffa04	[reland][ROCm] use hiprtc precompiled header (#55965 ) Summary: Revert "Revert D27449031 (`2a7df657fe`): [pytorch][PR] [ROCm] use hiprtc precompiled header". Reland PR https://github.com/pytorch/pytorch/issues/54350. This reverts commit `204ac21bf1`. The original PR was reverted under suspicion that it was causing CI instability, but it was instead due to a hardware failure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/55965 Reviewed By: jbschlosser Differential Revision: D27755907 Pulled By: malfet fbshipit-source-id: 75bf0b9d888df3dee62f00a366b1123757e0474e	2021-04-15 15:47:56 -07:00
Richard Barnes	d690973295	irange on int64_t (#55148 ) Summary: Converts loops of the form: ``` for(int64_t VAR=0;VAR<LIMIT;VAR++) ``` to the form ``` for(const auto VAR : c10::irange(LIMIT)) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/55148 Test Plan: Sandcastle Reviewed By: ngimel Differential Revision: D27447811 fbshipit-source-id: 6311a094ec4a81a0b57383aaee0ba1b1dc2445c4	2021-04-05 16:14:00 -07:00
Mike Ruberry	c0ac0fef4e	Revert D27448156: irange for size_t Test Plan: revert-hammer Differential Revision: D27448156 (`041b4431b2`) Original commit changeset: 585da57d4de9 fbshipit-source-id: 8e047c29f391c0166e0a1a87c3fb2a0854377365	2021-04-03 19:14:00 -07:00
Richard Barnes	041b4431b2	irange for size_t (#55163 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55163 Test Plan: Sandcastle Reviewed By: ngimel Differential Revision: D27448156 fbshipit-source-id: 585da57d4de91c692b6360d65f7b8a66deb0f8c1	2021-04-02 23:22:29 -07:00
Alexander Golynski	204ac21bf1	Revert D27449031: [pytorch][PR] [ROCm] use hiprtc precompiled header Test Plan: revert-hammer Differential Revision: D27449031 (`2a7df657fe`) Original commit changeset: 81a8d7847a47 fbshipit-source-id: b7b970c8ea4110357fba3ad4d52a86fa5641d90c	2021-04-01 06:42:04 -07:00
Jeff Daily	2a7df657fe	[ROCm] use hiprtc precompiled header (#54350 ) Summary: HIP's runtime compiler (hiprtc) is adding support for precompiled HIP headers in the ROCm 4.2 release. Conditionally add support for this feature. Using this feature will improve the ROCm torch wheel user experience; users will no longer need to install HIP headers separately to use torch JIT features. The use of this feature is conditionalized on a new ROCM_VERSION macro. Pull Request resolved: https://github.com/pytorch/pytorch/pull/54350 Reviewed By: H-Huang Differential Revision: D27449031 Pulled By: malfet fbshipit-source-id: 81a8d7847a47ce2bb253d1ea58740ef66ed154a3	2021-03-31 13:36:50 -07:00
Bert Maher	9db4802184	[fuser] Support bfloat16 (#54571 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54571 Supports bfloat16 via a similar method to half: upconvert inputs to fp32, do math, then downconvert outputs to bf16. Resource strings are mostly derived from cuda-11 headers. Fixes #53918, for the legacy fuser at least. Test Plan: Imported from OSS Reviewed By: ailzhang Differential Revision: D27328987 Pulled By: bertmaher fbshipit-source-id: 5c0eae44164623faa0c75cb818e8bf0211579fdc	2021-03-25 15:59:15 -07:00
johnlu	36ce673f16	Disable the fusion group which is not supported by XPU device. (#54239 ) Summary: The XPU device doesn't support the fusion group. Disable it for XPU devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/54239 Reviewed By: zou3519 Differential Revision: D27188735 Pulled By: ezyang fbshipit-source-id: f28f62148e7aa12e8b08345df7eb0133216ce6a5	2021-03-22 07:43:28 -07:00
Sam Estep	8c798e0622	Forbid trailing whitespace (#53406 ) Summary: Context: https://github.com/pytorch/pytorch/pull/53299#discussion_r587882857 These are the only hand-written parts of this diff: - the addition to `.github/workflows/lint.yml` - the file endings changed in these four files (to appease FB-internal land-blocking lints): - `GLOSSARY.md` - `aten/src/ATen/core/op_registration/README.md` - `scripts/README.md` - `torch/csrc/jit/codegen/fuser/README.md` The rest was generated by running this command (on macOS): ``` git grep -I -l ' $' -- . ':(exclude)/contrib/' ':(exclude)third_party' \| xargs gsed -i 's/ *$//' ``` I looked over the auto-generated changes and didn't see anything that looked problematic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/53406 Test Plan: This run (after adding the lint but before removing existing trailing spaces) failed: - https://github.com/pytorch/pytorch/runs/2043032377 This run (on the tip of this PR) succeeded: - https://github.com/pytorch/pytorch/runs/2043296348 Reviewed By: walterddr, seemethere Differential Revision: D26856620 Pulled By: samestep fbshipit-source-id: 3f0de7f7c2e4b0f1c089eac9b5085a58dd7e0d97	2021-03-05 17:22:55 -08:00
Richard Barnes	29c4290a8d	Use c10::irange for great good (#52153 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52153 Test Plan: Sandcastle Reviewed By: ngimel Differential Revision: D26407087 fbshipit-source-id: ea8ce1c17299cb9d89621e4a39f31edc2faa9fd6	2021-02-24 18:43:50 -08:00
Jeff Daily	4df8e774e6	[ROCm] warn unsupported PYTORCH_CUDA_FUSER_DISABLE_FMA (#50508 ) Summary: nvcc's `--fmad=false` is not valid for the HIP compiler. Upcoming ROCm releases will start treating unrecognized compiler flags as an error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/50508 Reviewed By: albanD Differential Revision: D25920291 Pulled By: mrshenli fbshipit-source-id: c0ff3b74dd07f3d0661ba29efafaab291ef3621c	2021-02-16 08:09:57 -08:00
Nikita Shulga	b8f3a658f9	Do not include "DynamicLibrary.h" into a top-level header (#52182 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52182 DynamicLibrary provides a very specific functionality, so there is no need to exposes it to every project depending on `ATen.h` Test Plan: Imported from OSS Reviewed By: walterddr Differential Revision: D26417404 Pulled By: malfet fbshipit-source-id: f8318cacb07dcc8b2f95984f88ea1df4e5369b8b	2021-02-13 19:34:46 -08:00
Nikita Shulga	5499e839f1	[Fuser] Do not attempt to use OpenMP if build without OpenMP support (#51504 ) Summary: Clang from XCode does not support `-fopenmp` option, no need to try to compile with it. Infer whether OpenMP is supported by checking _OPENMP define. Also, use clang compiler if host app was compiled with clang rather than gcc. Fix few range loop warnings and add static_asserts that range loop variables are raw pointers. This changes makes fuser tests on OS X a bit faster. Before: ``` % python3 test_jit.py -v TestScript.test_batchnorm_fuser_cpu Fail to import hypothesis in common_utils, tests are not derandomized CUDA not available, skipping tests test_batchnorm_fuser_cpu (__main__.TestScript) ... clang: error: unsupported option '-fopenmp' clang: error: unsupported option '-fopenmp' warning: pytorch jit fuser failed to compile with openmp, trying without it... ok ---------------------------------------------------------------------- Ran 1 test in 0.468s OK ``` After: ``` % python3 test_jit.py -v TestScript.test_batchnorm_fuser_cpu Fail to import hypothesis in common_utils, tests are not derandomized CUDA not available, skipping tests test_batchnorm_fuser_cpu (__main__.TestScript) ... ok ---------------------------------------------------------------------- Ran 1 test in 0.435s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/51504 Reviewed By: smessmer Differential Revision: D26186875 Pulled By: malfet fbshipit-source-id: 930b3bcf543fdfad0f493d687072aaaf5f9e2bfc	2021-02-02 15:31:59 -08:00
jjsjann123	392abde8e6	patch nvrtc API for cuda TK >= 11.1 (#50319 ) Summary: CUDA TK >= 11.1 provides ptxjitcompiler that emits SASS instead of PTX. 1. This gives better backward-compatibility that allows future TK to work with older driver, which might not necessarily be able to load generated PTX through JIT compile and would error out at runtime; https://docs.nvidia.com/deploy/cuda-compatibility/#using-ptx 2. Meanwhile, SASS doesn't provide good future compatibility, so for unsupported arch, we fallback to PTX to support future device. https://docs.nvidia.com/deploy/cuda-compatibility/index.html#cubin-compatibility Pull Request resolved: https://github.com/pytorch/pytorch/pull/50319 Reviewed By: malfet Differential Revision: D26114475 Pulled By: ngimel fbshipit-source-id: 046e9e7b3312d910f499572608a0bc1fe53feef5	2021-01-27 23:58:20 -08:00
Jane Xu	533cb9530e	Introducing TORCH_CUDA_CPP_API and TORCH_CUDA_CU_API to the code (#50627 ) Summary: Sub-step of my attempt to split up the torch_cuda library, as it is huge. Please look at https://github.com/pytorch/pytorch/issues/49050 for details on the split and which files are in which target. This PR introduces two new macros for Windows DLL purposes, TORCH_CUDA_CPP_API and TORCH_CUDA_CU_API. Both are defined as TORCH_CUDA_API for the time being. Pull Request resolved: https://github.com/pytorch/pytorch/pull/50627 Reviewed By: mruberry Differential Revision: D25955441 Pulled By: janeyx99 fbshipit-source-id: ff226026833b8fb2fb7c77df6f2d6c824f006869	2021-01-21 19:09:11 -08:00
Scott Wolchok	4a0d17ba2d	[PyTorch][codemod] Replace immediately-dereferenced expect calls w/expectRef (#50228 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50228 `fastmod -m 'expect(<((at\|c10)::)?\w+Type>\s*)->' 'expectRef${1}.'` Presuming it builds, this is a safe change: the result of `expect()` wasn't being saved anywhere, so we didn't need it, so we can take a reference instead of a new `shared_ptr`. ghstack-source-id: 119782961 Test Plan: CI Reviewed By: SplitInfinity Differential Revision: D25837374 fbshipit-source-id: 86757b70b1520e3dbaa141001e7976400cdd3b08	2021-01-13 16:13:55 -08:00
Chester Liu	9d8bd216f9	Use Unicode friendly API in fused kernel related code (#49781 ) Summary: See https://github.com/pytorch/pytorch/issues/47422 Pull Request resolved: https://github.com/pytorch/pytorch/pull/49781 Reviewed By: gchanan Differential Revision: D25847993 Pulled By: ezyang fbshipit-source-id: e683a8d5841885857ea3037ac801432a1a3eda68	2021-01-10 20:03:00 -08:00
Andres Suarez	8530c65e25	[codemod][fbcode/caffe2] Apply clang-format update fixes Test Plan: Sandcastle and visual inspection. Reviewed By: igorsugak Differential Revision: D25849205 fbshipit-source-id: ef664c1ad4b3ee92d5c020a5511b4ef9837a09a0	2021-01-09 14:37:36 -08:00
Thomas Viehmann	ea087e2d92	JIT: guard DifferentiableGraph node (#49433 ) Summary: This adds guarding for DifferentiableGraph nodes in order to not depend on Also bailing out on required gradients for the CUDA fuser. Fixes https://github.com/pytorch/pytorch/issues/49299 I still need to look into a handful of failing tests, but maybe it can be a discussion basis. Pull Request resolved: https://github.com/pytorch/pytorch/pull/49433 Reviewed By: ngimel Differential Revision: D25681374 Pulled By: Krovatkin fbshipit-source-id: 8e7be53a335c845560436c0cceeb5e154c9cf296	2021-01-08 20:01:27 -08:00
Scott Wolchok	480a756194	[PyTorch] IValue::toTensor can now return const Tensor& (#48868 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48868 Building on the previous diff, we can make `toTensor()` return a `const Tensor&`, which should make it easier to avoid reference counting. ghstack-source-id: 119327372 Test Plan: internal benchmarks. Reviewed By: bwasti Differential Revision: D25325379 fbshipit-source-id: ca699632901691bcee432f595f75b0a4416d55dd	2021-01-06 08:40:50 -08:00
Chester Liu	2ac180a5dd	Fix cl.exe detection in cpu/fused_kernel.cpp (#50085 ) Summary: The command used here is essentially `where cl.exe`. By using `system()` we will not be able to find cl.exe unless we are using VS Developer Prompt, which makes `activate()` meaningless. Change `system()` to `run()` fixes this. Found during https://github.com/pytorch/pytorch/issues/49781. Pull Request resolved: https://github.com/pytorch/pytorch/pull/50085 Reviewed By: smessmer Differential Revision: D25782054 Pulled By: ezyang fbshipit-source-id: e8e3cac903a73f3bd78def667ebe0e93201814c8	2021-01-06 07:16:41 -08:00
caozhong	aff0b68a58	Fix include files for out-of-tree compilation (#48827 ) Summary: Signed-off-by: caozhong <zhong.z.cao@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/48827 Reviewed By: agolynski Differential Revision: D25375988 Pulled By: ailzhang fbshipit-source-id: a8d5ab4572d991d6d96dfe758011517651ff0a6b	2020-12-15 14:40:44 -08:00
jiej	a6fa3b2682	adding profile_ivalue (#47666 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47666 Test Plan: Imported from OSS Reviewed By: ZolotukhinM Differential Revision: D25255573 Pulled By: Krovatkin fbshipit-source-id: 5d8753e4040a3d96105d28d26728125947c7a638	2020-12-09 15:29:15 -08:00
Nikita Shulga	b9cd774e29	Get rid of printf in cuda fuser debugPrint() (#46994 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46994 Reviewed By: raghuramank100, mruberry Differential Revision: D25342954 Pulled By: malfet fbshipit-source-id: 549b5b072f7f70877261a155e989a21072ec49d8	2020-12-04 15:13:26 -08:00
jiej	dabc286ab3	Remove output used only by sizes (#448 ) (#47665 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47665 Re-enabled the pass to remove outputs from fusion that is only used by aten::size; Added size computation for reduction op via new operator prim::ReductionSizes; Test Plan: Imported from OSS Reviewed By: navahgar, jamesr66a Differential Revision: D25254675 Pulled By: Krovatkin fbshipit-source-id: e9a057b0287ed0ac93b415647fd8e5e836ba9856	2020-12-03 11:14:30 -08:00
Lemo	85c1e8acdc	Replace kernel resource strings with real .cu source files (#48283 ) Summary: Convert the NVFUSER's runtime CUDA sources (under `.../jit/codegen/cuda/runtime`) to string literals, then include the headers with the generated literals. Pull Request resolved: https://github.com/pytorch/pytorch/pull/48283 Reviewed By: mrshenli Differential Revision: D25163362 Pulled By: ngimel fbshipit-source-id: 4e6c181688ddea78ce6f3c754fee62fa6df16641	2020-12-02 21:22:29 -08:00
Elias Ellison	1195403915	[NNC] Add cpu fusion gflag (#48682 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48682 Reviewed By: Krovatkin, ngimel Differential Revision: D25260205 Pulled By: eellison fbshipit-source-id: df1655fd75f2a13bcf7c025b1f0a7becc2fd126a	2020-12-02 19:47:18 -08:00
jjsjann123	15fc66d6c8	fix nvrtc PTX architecture cap for CUDA toolkit (#48455 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/48200 CUDA 11.0 only supports < sm_80 (https://docs.nvidia.com/cuda/archive/11.0/nvrtc/#group__options) Note: NVRTC documentation is not a reliable source to query supported architecture. Rule of thumb is that nvrtc supports the same set of arch for nvcc, so the best way to query that is something like `nvcc -h \| grep -o "compute_[0-9][0-9]" \| sort \| uniq` Pull Request resolved: https://github.com/pytorch/pytorch/pull/48455 Reviewed By: zhangguanheng66 Differential Revision: D25255529 Pulled By: ngimel fbshipit-source-id: e84cf51ab50519b4c97dad063cc43c9194942bb2	2020-12-02 11:50:22 -08:00
Chester Liu	8177f63c91	Reorganize and refine the Windows.h import in C++ files (#48009 ) Summary: This PR aims to reduce the import overhead and symbol noises from the `windows.h` headers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/48009 Reviewed By: gchanan Differential Revision: D25045840 Pulled By: ezyang fbshipit-source-id: 01fda70f433ba2dd0cd2d7cd676ab6ffe9d98b90	2020-11-20 14:21:09 -08:00
Scott Wolchok	4c9eb57914	[PyTorch] Narrow Device to 2 bytes by narrowing DeviceType and DeviceIndex (#47023 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47023 DeviceType pretty clearly only needs 1 byte. DeviceIndex only needs 1 byte given that machines don't have anywhere near 255 GPUs in them as far as I know. ghstack-source-id: 116901430 Test Plan: Existing tests, added assertion to catch if my assumption about DeviceIndex is incorrect Reviewed By: dzhulgakov Differential Revision: D24605460 fbshipit-source-id: 7c9a89027fcf8eebd623b7cdbf6302162c981cd2	2020-11-18 19:39:40 -08:00
Scott Wolchok	1bafff2366	[PyTorch][JIT] Skip unnecessary refcounting in TensorType::merge (#47959 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47959 Taking a shared_ptr by value incurs refcounting overhead and should only be done if the callee needs to take ownership. Otherwise, `const T&` is more efficient. (Specifically, you will have to do an atomic decrement when the argument is destroyed and probably an atomic increment as well. Passing by `const T&` also takes one less register than passing `std::shared_ptr<T>`, but that's less important.) This diff fixes just this one function, but I'd be happy to audit & fix this whole file in future diffs. Thoughts? ghstack-source-id: 116914899 Test Plan: build ATen-cpu Reviewed By: Krovatkin Differential Revision: D24970954 fbshipit-source-id: 6bdb4b710a94b8baf4ad63418fb38136134e0ef3	2020-11-18 17:49:16 -08:00
Sebastian Messmer	edf751ca2f	Make empty c10-full (#46092 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46092 Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. ghstack-source-id: 116544203 (Note: this ignores all push blocking failures!) Test Plan: vs prev diff (outdated, before c10::optional fix): https://www.internalfb.com/intern/fblearner/details/224735103/ after c10::optional fix: https://www.internalfb.com/intern/fblearner/details/231391773/ Also, after the c10::optional fix, the instruction counting benchmark shows a 2% regression for calling empty from Python. We decided this is acceptable and decided against landing D24425836 which would fix the regression. Reviewed By: ezyang Differential Revision: D24219944 fbshipit-source-id: e554096e90ce438c75b679131c3151ff8e5c5d50	2020-11-12 17:08:21 -08:00
Elias Ellison	f221a19a7f	Force LLVM Compilation for CPU Tests (#46949 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46949 Test Plan: Imported from OSS Reviewed By: ansley Differential Revision: D24805247 Pulled By: eellison fbshipit-source-id: 4fcaf02d8a78cc5cbcbde36940d0a2c85fba3fc5	2020-11-12 11:12:08 -08:00

... 2 3 4 5 6 ...

396 Commits