pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Nikita Shulga	c40a009d66	Revert D35194935: Check all CUDA API calls for errors in torch/ Test Plan: revert-hammer Differential Revision: D35194935 (`79e5b053b6`) Original commit changeset: f5ec5be87cdf Original Phabricator Diff: D35194935 (`79e5b053b6`) fbshipit-source-id: 0bb770d2cdb29b8e724c0b6a125c748f363d3358 (cherry picked from commit 04e5a73da4a53b0ec296f3df2c85626d19290c1f)	2022-03-31 05:48:30 +00:00
Richard Barnes	79e5b053b6	Check all CUDA API calls for errors in torch/ (#74923 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74923 Test Plan: Sandcastle Reviewed By: malfet Differential Revision: D35194935 fbshipit-source-id: f5ec5be87cdf775eb9c99f8c3baed6b0366dda49 (cherry picked from commit 7284c4ed7d57261d4936055e0c1a3f8f911fb1f0)	2022-03-31 05:08:55 +00:00
jiej	86c817cfa0	Requires grad guard Adding CudaFusionGuard to guard on device/requires_grad of profiled tensor type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/74780 Approved by: https://github.com/davidberard98	2022-03-29 19:23:10 +00:00
jiej	e4e19d5beb	nvfuser parser skip api (#74520 ) Summary: added python API to disable nvfuser on certain opkind. ``` "_jit_set_nvfuser_skip_node_kind", [](const std::string& op_name, bool flip = true) { return fuser::cuda::skipNode(op_name, flip); }) ``` Args: `op_name`: Symbol of op; `flip`: flag indicating whether to flip the given op in the skip list. Returns: a bool flag indicating if `op_name` was already in the skip list. The python example that disables the fusion of `aten::add` afterwards. `torch._C._jit_set_nvfuser_skip_node_kind("aten::add", True) # returns False, as no op is in skip list by default` Pull Request resolved: https://github.com/pytorch/pytorch/pull/74520 Reviewed By: saketh-are Differential Revision: D35046110 Pulled By: davidberard98 fbshipit-source-id: 689f5286513dbab206768823a852467b9f6b49b6 (cherry picked from commit 9a31129f7591ba2d393ab057b1cd137a6a25e7e8)	2022-03-23 20:56:43 +00:00
Michael Suo	e5bf87963d	Revert D34584878: [pytorch][PR] Add JIT graph fuser for oneDNN Graph API (Preview4) Test Plan: revert-hammer Differential Revision: D34584878 (`7dd0823011`) Original commit changeset: ce817aa8cc90 Original Phabricator Diff: D34584878 (`7dd0823011`) fbshipit-source-id: a941aaad34f8fe5f0c51f719f9f5c29b811c4d5b (cherry picked from commit a43262ec7521b1665b02a64d3f279e72ee2344b9)	2022-03-21 23:07:14 +00:00
chunyuan	7dd0823011	Add JIT graph fuser for oneDNN Graph API (Preview4) (#68111 ) Summary: ## Description Preview4 PR of this [RFC](https://github.com/pytorch/pytorch/issues/49444). On the basis of https://github.com/pytorch/pytorch/pull/50256, the below improvements are included: - The [preview4 release branch](https://github.com/oneapi-src/oneDNN/releases/tag/graph-v0.4.1) of the oneDNN Graph API is used - The fuser now works with the profiling graph executor. We have inserted type check nodes to guard the profiled tensor properties. ### User API: The optimization pass is disabled by default. Users could enable it by: ``` torch.jit.enable_onednn_fusion(True) ``` ### Performance: [pytorch/benchmark](https://github.com/pytorch/benchmark) tool is used to compare the performance: - SkyLake 8180 (1 socket of 28 cores): ![image](https://user-images.githubusercontent.com/65992142/151162305-05e44425-a24e-4d5e-94e1-743b40b87a8c.png) - SkyLake 8180 (single thread): ![image](https://user-images.githubusercontent.com/65992142/151162528-69f90b79-d08d-46b8-8775-d80a6ccbce8a.png) \* By mapping hardswish to oneDNN Graph, it’s 8% faster than PyTorch JIT (NNC + OFI) \** We expect performance gain after mapping transpose, contiguous & view to oneDNN graph ops ### Directory structure of the integration code Fuser-related code are placed under: ``` torch/csrc/jit/codegen/onednn/ ``` Optimization pass registration is done in: ``` torch/csrc/jit/passes/onednn_graph_fuser.h ``` CMake for the integration code is: ``` caffe2/CMakeLists.txt ``` ## Limitations - In this PR, we have only supported the optimization on Linux platform. The support on Windows and MacOS will be enabled as the next step. - We have only optimized the inference use case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/68111 Reviewed By: eellison Differential Revision: D34584878 Pulled By: malfet fbshipit-source-id: ce817aa8cc9052ee9ed930c9cf66be83449e61a4 (cherry picked from commit cd17683aa7d9c0947df45a1ab53627feff795587)	2022-03-21 22:12:19 +00:00
David Berard	890b1e8f9e	[JIT] C10_EXPORT -> TORCH_API (#73818 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73818 These all appear to be defined in libtorch_cpu.so, so they should be marked with TORCH_API. TORCH_API means that these symbols are exported from libtorch_cpu.so and no other libraries. In comparison, C10_EXPORT will export the symbol in _all_ built libraries, if it's available. I think most of these were fine because most were only defined in cpp files (which would only be included in the targets for one .so file). However, the change in pass_manager.h affects behavior, since the class is defined in the .h file, which could result in two separate implementations of the same static functions. Previously we saw issues on windows with this: https://github.com/pytorch/pytorch/pull/73742 Test Plan: Imported from OSS Reviewed By: george-qi Differential Revision: D34698175 Pulled By: davidberard98 fbshipit-source-id: cb871e861cf966bff596cfa8340a32a17fca0b66 (cherry picked from commit 6b9988e5688e6d4a9928c3e331efb74f000a9e4a)	2022-03-14 20:29:58 +00:00
David Berard	31b64fc3e6	[JIT] log extract tool - dump NVFuser fallbacks instead of fusion groups (#73881 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73881 NVFuser fusion groups can contain nvfuser-only ops, e.g. `prim::reshape_copy`. Previously, we couldn't get a baseline performance measurement because the nvfuser-only ops would error out on nnc- and no-fusion- runs. Instead, dump the fallback graphs, after the fallbacks are corrected into runnable fallbacks. Test Plan: Imported from OSS Reviewed By: eellison Differential Revision: D34698307 Pulled By: davidberard98 fbshipit-source-id: c357b2736b789bfd347afe9c83a1b610b64881e0 (cherry picked from commit 5918d826502ff75fbc22d242844ae6435dd7d22a)	2022-03-08 16:38:17 +00:00
David Berard	b27ec57331	[JIT] script & logging for extracting IR from logs (#72889 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72889 The script along with the GRAPH_EXPORT macro will allow for an easy way to extract IR from logs. One use case in this diff is to extract the fusion groups from nvfuser, so that the fusions can be tested individually. Usage (e.g. for nvfuser test) 1. Write some test.py file that uses nvfuser 2. `PYTORCH_JIT_LOG_LEVEL=">>graph_fuser" python3 test.py 2>&1 \| tee output.txt` 3. `python3 pytorch/scripts/jit/log_extract.py output.txt --nvfuser` This will run with and without nvfuser to compare the output. Alternatively, use `--output` to dump the IR so that it can be used in other applications. Currently, only `--output` works (since generating input tensors is not supported) Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D34440189 Pulled By: davidberard98 fbshipit-source-id: fca0f619200ee37aba34bb39b69e6c640c263e26 (cherry picked from commit eb319166075db160f1628f0de545641fbecde8be)	2022-03-02 18:34:35 +00:00
Gabor Kertesz	c4ff49f4c7	Enable win-arm64 This patch enables Pytorch build from source with Ninja and 'Visual Studio 16 2019' CMake generator on Windows on Arm. Tests: - Build from source: 'python setup.py develop'. - Run simple Pytorch example: passed - python test\test_torch.py: -- same results as on x64 -- Ran 1344 tests, failures=2 Pull Request resolved: https://github.com/pytorch/pytorch/pull/72424	2022-02-28 17:17:56 +00:00
CodemodService FBSourceClangFormatLinterBot	b9ccbe4ff2	[AutoAccept][Codemod][FBSourceClangFormatLinter] Daily `arc lint --take CLANGFORMAT` Reviewed By: bilalsou Differential Revision: D34237270 fbshipit-source-id: f33c06e9cbbde8b1fa39b11f9addb716f3762c99 (cherry picked from commit `0db3686e9d`)	2022-02-15 10:41:24 +00:00
jiej	2d110d514f	Nvfuser code bump 2_1_2022 (#72127 ) Summary: Things changed in this PR that requires review: 1. aten/src/ATen/core/interned_strings.h 2. torch/csrc/jit/ir/alias_analysis.h : exposing createValue to allow efficient mutation 3. torch/csrc/jit/runtime/symbolic_shape_registry.cpp : added gelu/tanh/erf in registry 4. torch/jit/_script.py : throws scripting model sees autocast as decorator since it's not supported nvfuser code update: 1. codegen improvements and performance tuning 2. integration bug fixes for shape expression logic 3. kernel segmentation update to address perf regression from horizontal fusion 4. scalar cpu tensor promotion to support inter-device operation between cpu scalar tensor and cuda tensor Things reverted from local changes: aten::gelu with approximation (tracked in PR: https://github.com/pytorch/pytorch/pull/61439) Pull Request resolved: https://github.com/pytorch/pytorch/pull/72127 Reviewed By: HamidShojanazeri Differential Revision: D34113233 Pulled By: jbschlosser fbshipit-source-id: b82cde32b71e324eca0ea57cb8c9f9647278ca74 (cherry picked from commit `e009bc5c4e`)	2022-02-15 00:43:16 +00:00
Ryan Spring	4f8b986e28	Implement Tanh Gelu Approximation (#61439 ) Summary: 1. Implements https://github.com/pytorch/pytorch/issues/39853 2. Adds approximate boolean flag to Gelu 3. Enables Tanh Gelu approximation 4. Adds double backward support for Gelu 5. Enable Tanh Gelu in NvFuser ``` def gelu(x, approximate : str = 'none'): if approximate == 'tanh': # sqrt(2/pi) = 0.7978845608028654 return 0.5 * x * (1.0 + torch.tanh(0.7978845608028654 * (x + 0.044715 * torch.pow(x, 3.0)))) else: return x * normcdf(x) ``` Linking XLA PR - https://github.com/pytorch/xla/pull/3039 Pull Request resolved: https://github.com/pytorch/pytorch/pull/61439 Reviewed By: VitalyFedyunin Differential Revision: D33894937 Pulled By: jbschlosser fbshipit-source-id: b65e8fb6ea66168af8f34f45ed50e92737a33851 (cherry picked from commit `6e986f91a9`)	2022-02-14 03:40:32 +00:00
Nikita Shulga	74c44ba9d6	Revert D33850228: [pytorch][PR] Implement Tanh Gelu Approximation Test Plan: revert-hammer Differential Revision: D33850228 (`23d03025dc`) Original commit changeset: 3cc33fb298e4 Original Phabricator Diff: D33850228 (`23d03025dc`) fbshipit-source-id: 9436e7df73c2b2e2011f321674f24973316d3692 (cherry picked from commit `c9efb58223`)	2022-01-31 17:44:19 +00:00
Ryan Spring	23d03025dc	Implement Tanh Gelu Approximation (#61439 ) Summary: 1. Implements https://github.com/pytorch/pytorch/issues/39853 2. Adds approximate boolean flag to Gelu 3. Enables Tanh Gelu approximation 4. Adds double backward support for Gelu 5. Enable Tanh Gelu in NvFuser ``` def gelu(x, approximate : str = 'none'): if approximate == 'tanh': # sqrt(2/pi) = 0.7978845608028654 return 0.5 * x * (1.0 + torch.tanh(0.7978845608028654 * (x + 0.044715 * torch.pow(x, 3.0)))) else: return x * normcdf(x) ``` Linking XLA PR - https://github.com/pytorch/xla/pull/3039 Pull Request resolved: https://github.com/pytorch/pytorch/pull/61439 Reviewed By: cpuhrsch Differential Revision: D33850228 Pulled By: jbschlosser fbshipit-source-id: 3cc33fb298e480d7ecc5c67716da019d60c6ab33 (cherry picked from commit `3a53b3e94f`)	2022-01-31 17:07:45 +00:00
Nolan O'Brien	d68c314b13	[warnings][caffe2] Fix asserts yielding -Wstring-conversion warnings (#72013 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72013 Find and replace `assert(!"` with `assert(false && "` Excludes headers and paths that contain "third-party" or "external" Clang raises a `-Wstring-conversion` warning when treating a string as a boolean. This is not uncommon for asserts though (e.g. `assert(!"should never happen")`). Clang does permit `expr && "string"` though in order to support these assertion use cases. Test Plan: ci pass Differential Revision: D33823092 fbshipit-source-id: 9a1af012215bdc91f8b4162ddb2df28d51539773 (cherry picked from commit `0286910350`)	2022-01-29 00:48:06 +00:00
Joel Schlosser	cb823d9f07	Revert D33744717: [pytorch][PR] Implement Tanh Gelu Approximation Test Plan: revert-hammer Differential Revision: D33744717 (`f499ab9cef`) Original commit changeset: d64532a562ed Original Phabricator Diff: D33744717 (`f499ab9cef`) fbshipit-source-id: 396c3f63de5865f894dbc353d0790a01a624be93 (cherry picked from commit `e9fb2d1db1`)	2022-01-28 18:35:01 +00:00
Ryan Spring	f499ab9cef	Implement Tanh Gelu Approximation (#61439 ) Summary: 1. Implements https://github.com/pytorch/pytorch/issues/39853 2. Adds approximate boolean flag to Gelu 3. Enables Tanh Gelu approximation 4. Adds double backward support for Gelu 5. Enable Tanh Gelu in NvFuser ``` def gelu(x, approximate : str = 'none'): if approximate == 'tanh': # sqrt(2/pi) = 0.7978845608028654 return 0.5 * x * (1.0 + torch.tanh(0.7978845608028654 * (x + 0.044715 * torch.pow(x, 3.0)))) else: return x * normcdf(x) ``` Linking XLA PR - https://github.com/pytorch/xla/pull/3039 Pull Request resolved: https://github.com/pytorch/pytorch/pull/61439 Reviewed By: mikaylagawarecki Differential Revision: D33744717 Pulled By: jbschlosser fbshipit-source-id: d64532a562ed53247bb4fa52bb16722634d5c187 (cherry picked from commit `4713dd9cca`)	2022-01-28 16:59:09 +00:00
Will Constable	4523a73288	Fix usages of TORCH_CHECK/_INTERNAL_ASSERT without condition (#71879 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71879 Two locations of improper macro usage were reported (https://github.com/pytorch/pytorch/issues/71848), and this diff fixes them. In both cases this is behavior-changing, since the incorrect usages would have passed assertion due interpreting the error string as the condition, and both cases should have been 'assert false'. Test Plan: Run CI Reviewed By: alanwaketan Differential Revision: D33800406 fbshipit-source-id: dfe3d9a6455e6eb96cb639022f8813a8bd6520c3 (cherry picked from commit `ee551e5a16`)	2022-01-27 04:20:55 +00:00
Nolan O'Brien	0fdb90da5e	[warning] Fix TORCH_INTERNAL_ASSERT calls missing condition to check 1/x (#71711 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71711 This will fix a ton of broken asserts that should always fire but never actually fire. All would have been caught with `-Wstring-conversion` warnings enabled. Test Plan: CI Pass Differential Revision: D33743605 fbshipit-source-id: 062641f9d5d02c6e317c5a286fd01017cf77237f (cherry picked from commit `639b42e04b`)	2022-01-25 15:45:21 +00:00
CodemodService FBSourceClangFormatLinterBot	88012c7daf	[AutoAccept][Codemod][FBSourceClangFormatLinter] Daily `arc lint --take CLANGFORMAT` Reviewed By: zertosh Differential Revision: D33577744 fbshipit-source-id: 7ecc8367998ee1dffde54c2f4dd3cfafe19a53c9	2022-01-14 06:10:57 -08:00
Mike Ruberry	3a0c680a14	Jiterates exp2, erfc, erfinv and entr and refactors code_template.h to ATen (#71295 ) Summary: Per title. cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/71295 Reviewed By: ngimel Differential Revision: D33575885 Pulled By: mruberry fbshipit-source-id: bc841b46fc0b5458a26a4d4465b18a7a54cd5a5b	2022-01-13 23:58:51 -08:00
Shintaro Iwasaki	5cae40c169	[pytorch][aten][cuda] move CUDAGeneratorImpl.h to ATen/cuda (#70650 ) Summary: This patch moves a CUDA-specific file, `CUDAGeneratorImpl.h` to `ATen/cuda` as the following TODO comment in `CUDAGeneratorImpl.h` suggests: ``` // TODO: this file should be in ATen/cuda, not top level ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/70650 Reviewed By: jianyuh, xw285cornell Differential Revision: D33414890 Pulled By: shintaro-iwasaki fbshipit-source-id: 4ff839205f4e4ea4c8767f164d583eb7072f1b8b	2022-01-10 22:27:04 -08:00
Scott Wolchok	ddea6980fe	[PyTorch][JIT] Don't refcount Type singletons (#69579 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69579 This should help us avoid reference counting overhead on singleton Type subclasses without a major rewrite of the Type subsystem. ghstack-source-id: 146643993 Test Plan: Ran //caffe2/caffe2/fb/high_perf_models/pytorch/benchmark_framework_overheads:cpp_benchmark with arguments `--op empty -niter 40 --stressTestRecordFunction --captureRecordFunctionInputs` on devbig with turbo off. Before: ``` I1206 13:47:15.037441 1201670 bench.cpp:144] Mean 0.737675 I1206 13:47:15.037463 1201670 bench.cpp:145] Median 0.736725 I1206 13:47:15.037468 1201670 bench.cpp:146] Min 0.722897 I1206 13:47:15.037473 1201670 bench.cpp:147] stddev 0.00508187 I1206 13:47:15.037482 1201670 bench.cpp:148] stddev / mean 0.00688903 ``` After: ``` I1206 13:48:16.830123 1205612 bench.cpp:144] Mean 0.66988 I1206 13:48:16.830150 1205612 bench.cpp:145] Median 0.663956 I1206 13:48:16.830157 1205612 bench.cpp:146] Min 0.65986 I1206 13:48:16.830164 1205612 bench.cpp:147] stddev 0.0335928 I1206 13:48:16.830171 1205612 bench.cpp:148] stddev / mean 0.0501475 ``` Static runtime startup is also improved; for CMF local_ro, time to initialize a predictor went from 10.01s to 9.59s. (Note: I wish I had a production workload to demonstrate the advantage of this on. I tried ctr_mobile_feed local_ro net but it was neutral. Anything that manipulates types or List/Dict a lot might be promising.) Reviewed By: suo Differential Revision: D32923880 fbshipit-source-id: c82ed6689b3598e61047fbcb2149982173127ff0	2022-01-06 17:39:16 -08:00
Peter Bell	fa09099ba3	Codegen: TraceType only includes operators being registered (#68691 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68691 TraceType is a sharded file, so by only including specific operator headers, we ensure that changing one (non-method) operator only needs one shard to be re-compiled. This also changes all the included autograd and jit headers from including `ATen/ATen.h` to just including `ATen/core/Tensor.h`. Test Plan: Imported from OSS Reviewed By: gchanan Differential Revision: D33336948 Pulled By: albanD fbshipit-source-id: 4e40371592b9a5a7e7fcd1d8cecae11ffb873113	2022-01-02 13:09:19 -08:00
jjsjann123	e429a68478	Allow single node fusion for nvfuser (#70000 ) Summary: Setting `PYTORCH_NVFUSER_ONE_OP_FUSION=1` will take all nodes nvFuser support, instead of waiting for fusion opportunity. Pull Request resolved: https://github.com/pytorch/pytorch/pull/70000 Reviewed By: samdow Differential Revision: D33292195 Pulled By: davidberard98 fbshipit-source-id: 8ed5ce5e82fbb6737e8ab5ce4223b038eaf47756	2021-12-23 17:07:57 -08:00
CodemodService FBSourceClangFormatLinterBot	181120f7d7	[AutoAccept][Codemod][FBSourceClangFormatLinter] Daily `arc lint --take CLANGFORMAT` Reviewed By: zertosh Differential Revision: D33229251 fbshipit-source-id: 3a69bb459fa0a65888d6f9c8e70b5de032ddad97	2021-12-19 16:38:25 -08:00
jiej	78f06e0690	fixing conv2d decomposition and tests (#70127 ) Summary: Current implementation has a bug where decomposed `add_optional` from `conv2d` is placed before the producer node, this causes linter error on graph. Cherry-picked from https://github.com/csarofeen/pytorch/pull/1333 Fixing issue posted in https://github.com/csarofeen/pytorch/issues/1325 Pull Request resolved: https://github.com/pytorch/pytorch/pull/70127 Reviewed By: ejguan Differential Revision: D33199018 Pulled By: jansel fbshipit-source-id: bce1f14a443811b4d55116a04fd4daa86084cc47	2021-12-19 10:38:23 -08:00
Nikita Shulga	26e32988bd	Revert D32596264: Codegen: TraceType only includes operators being registered Test Plan: revert-hammer Differential Revision: D32596264 (`e66a8ab4f5`) Original commit changeset: 2f28b62d7b99 Original Phabricator Diff: D32596264 (`e66a8ab4f5`) fbshipit-source-id: 7d18c4e77ce30dd7817a95f9c39b565cb246cd12	2021-12-17 11:20:12 -08:00
Peter Bell	e66a8ab4f5	Codegen: TraceType only includes operators being registered (#68691 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68691 TraceType is a sharded file, so by only including specific operator headers, we ensure that changing one (non-method) operator only needs one shard to be re-compiled. This also changes all the included autograd and jit headers from including `ATen/ATen.h` to just including `ATen/core/Tensor.h`. Test Plan: Imported from OSS Reviewed By: jbschlosser, malfet Differential Revision: D32596264 Pulled By: albanD fbshipit-source-id: 2f28b62d7b9932f30fad7daacd8ac5bb7f63c621	2021-12-17 10:35:05 -08:00
CodemodService FBSourceClangFormatLinterBot	de2d9e2966	[AutoAccept][Codemod][FBSourceClangFormatLinter] Daily `arc lint --take CLANGFORMAT` Reviewed By: zertosh Differential Revision: D33183467 fbshipit-source-id: d7c37f3522a38e85891524c544eab4fdb01270de	2021-12-17 09:45:20 -08:00
Nikita Shulga	92463573d8	Sanitize string before passing it as shell argument (#70070 ) Summary: Use `c10::printQuotedString` to escape any characters that might render string to be interpreted as more than one argument by shell script. Please note, that this codepath is deprecated and is not accessible by a typical PyTorch usage workflows. This issue was discovered by Daniel Lawrence of the Amazon Alexa team. Pull Request resolved: https://github.com/pytorch/pytorch/pull/70070 Reviewed By: suo Differential Revision: D33172721 Pulled By: malfet fbshipit-source-id: 9dbd17f6eb775aaa1a545da42cbc95864c1189ee	2021-12-17 08:08:28 -08:00
jiej	76d282d447	Nvfuser code bump 12 5 (#69964 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69964 Things added in this PR that requires review: 1. cuLaunchCooperativeKernel driver API added aten/src/ATen/cuda/detail/LazyNVRTC.cpp aten/src/ATen/cuda/nvrtc_stub/ATenNVRTC.h nvfuser code update: 1. perf turning on codegen scheduler that improves performance. 2. permutation support has been extended beyond contiguous/channels-last. (The improvements could be observed on PW benchmark) Things reverted from local changes: 1. aten::gelu with approximation 2. local changes that is upstreamed in PR https://github.com/pytorch/pytorch/issues/68804 Pull Request resolved: https://github.com/pytorch/pytorch/pull/69428 Reviewed By: ngimel Differential Revision: D33073817 Pulled By: wconstab fbshipit-source-id: e77d32e81d037d7370822b040456fd4c3bd68edb	2021-12-16 08:28:54 -08:00
Peter Bell	b2e79ed5ec	Remove WindowsTorchApiMacro.h in favor of Export.h (#69585 ) Summary: Follow up to https://github.com/pytorch/pytorch/issues/68095 This also changes the files from the ATen folder to include c10's `Export.h` instead since they can't ever be exporting `TORCH_PYTHON_API`. cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/69585 Reviewed By: mrshenli Differential Revision: D32958594 Pulled By: albanD fbshipit-source-id: 1ec7ef63764573fa2b486928955e3a1172150061	2021-12-09 17:30:09 -08:00
Peter Bell	e279963eef	Remove remaining THC code (#69039 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69039 Test Plan: Imported from OSS Reviewed By: anjali411 Differential Revision: D32872476 Pulled By: ngimel fbshipit-source-id: 7972aacc24aef9450fb59b707ed6396c501bcb31	2021-12-08 12:18:08 -08:00
jjsjann123	0dc3f829d9	Nvfuser code bump 11 5 (#67943 ) Summary: nvfuser code update: 1. Tuning heuristics on schedulers for reduction/normalization kernels; 2. bfloat16 on IO tensor support; 3. Refactored memory format support, now we can support dimension collapsing with non-coherent input tensors with different memory format. e.g. channels last tensor input to batch normalization. Note that we are currently limiting memory format to only Contiguous and Channels last; 4. Refactored nvfuser graph partitioning in `graph_fuser.cpp`, separated node merge and profile node API. Updated `profiling_record.cpp`. Things that are reverted from our local branch: 1. changes on some entries in autodiff 2. aten::gelu with approximation 3. native_dropout(_backward) Pull Request resolved: https://github.com/pytorch/pytorch/pull/67943 Reviewed By: ngimel Differential Revision: D32288709 Pulled By: dzhulgakov fbshipit-source-id: fc9491182ea7e0158bc112c66f096823c588eaf1	2021-11-17 01:22:17 -08:00
Alexander Grund	e4a9ee8d42	Deduplicate codegenOutputQuery to query maximum CUDA compute capabilities (#55901 ) Summary: There were 2 versions of the same code which were slightly different although functionally equivalent. When adding support for another CUDA / device version both would need to be changed and kept in sync. So it is better to have only 1 version of it as the unique source of truth. I chose the implementation which looks cleaner and easier to read and added some minor enhancements and comments to further increase readability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/55901 Reviewed By: H-Huang Differential Revision: D31636917 Pulled By: bertmaher fbshipit-source-id: 622e1fabc39de4f3f1b1aa9a1544cfbd35a5cfd9	2021-10-18 07:42:15 -07:00
Scott Wolchok	2d885ab73d	[jit] Reduce refcounting of Types (#65345 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65345 FooType::get() can return a const reference. Inconveniently, converting shared_ptr<FooType> to shared_ptr<Type> requires a copy & refcount bump, so to properly take advantage of this in unshapedType() we need to take a const Type& in isSubtypeOf(), which is good practice anyway -- don't require a shared_ptr if you don't need to take ownership. ghstack-source-id: 140044165 Test Plan: CI perf says c10::unshapedType time decreased from 2.8% to 2.2% during static runtime startup, though I expect this to be generally beneficial. Reviewed By: hlu1 Differential Revision: D31027361 fbshipit-source-id: 676feb81db9f74ad7b8651d8774f4ecb4cfa6ab8	2021-10-08 09:03:04 -07:00
jiej	321345d7c9	Revert "Revert D31227448: [pytorch][PR] fixing sorting in stride indices" (#66176 ) Summary: enabling https://github.com/pytorch/pytorch/issues/63940 Pull Request resolved: https://github.com/pytorch/pytorch/pull/66176 Reviewed By: ngimel Differential Revision: D31423920 Pulled By: dzhulgakov fbshipit-source-id: 06b1e0f757f4fb5b31ee1fa464bcd689df919b9c	2021-10-07 22:09:07 -07:00
Bin Bao	6e06cb76ff	[JIT] Initialize CUDA context before launching fused kernel (#65064 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65064 The problem appears when nvfuser is triggered from LazyTensor. Because LT maintains its own thread pool, the thread used for the first-time compilation does CUDA context initialization properly, but later cached execution may use a different thread which does not have a proper CUDA context. Test Plan: Imported from OSS Reviewed By: saketh-are Differential Revision: D31269691 Pulled By: desertfire fbshipit-source-id: 384362025c087d61e8b625ff938379df283ef8b2	2021-10-05 16:01:59 -07:00
Nikita Shulga	4c4525fa5c	Compile without -Wno-unused-variable (take 2) (#66041 ) Summary: Delete `-Wno-unused-variable` from top level `CMakeLists.txt` Still suppress those warnings for tests and `torch_python` Delete number of unused variables from caffe2 code Use `(void)var;` to suppress unused variable in range loops Use `C10_UNUSED` for global constructors and use `constexpr` instead of `static` for global constants Do not delete `caffe2::OperatorBase::Output` calls as they have side effects Pull Request resolved: https://github.com/pytorch/pytorch/pull/66041 Reviewed By: ngimel Differential Revision: D31360142 Pulled By: malfet fbshipit-source-id: 6fdfb9f91efdc49ca984a2f2a17ee377d28210c8	2021-10-04 20:39:39 -07:00
soulitzer	4cdfceddd2	[Reland] Avoid saving self for `softmax` and `log_softmax` (#66018 ) Summary: Reland of https://github.com/pytorch/pytorch/pull/65242 The last attempt of the reland automatically rebased onto stable, which did not yet have the revert commit Pull Request resolved: https://github.com/pytorch/pytorch/pull/66018 Reviewed By: albanD Differential Revision: D31348822 Pulled By: soulitzer fbshipit-source-id: 881d701b404530c1352ac9245bd67264e1652b8a	2021-10-03 21:35:01 -07:00
Nikita Shulga	e4ee5ca698	Revert D31326599: [pytorch][PR] Compile without -Wno-unused-variable Test Plan: revert-hammer Differential Revision: D31326599 (`a6280ab653`) Original commit changeset: 924155f1257a fbshipit-source-id: b8ee5bc0298637443232f5ee9ec79e51ed256faf	2021-10-01 20:40:47 -07:00
Nikita Shulga	a6280ab653	Compile without -Wno-unused-variable (#65954 ) Summary: Delete `-Wno-unused-variable` from top level `CMakeLists.txt` Still suppress those warnings for tests and `torch_python` Delete number of unused variables from caffe2 code Use `(void)var;` to suppress unused variable in range loops Use `C10_UNUSED` for global constructors and use `constexpr` instead of `static` for global constants Pull Request resolved: https://github.com/pytorch/pytorch/pull/65954 Reviewed By: ngimel Differential Revision: D31326599 Pulled By: malfet fbshipit-source-id: 924155f1257a2ba1896c50512f615e45ca1f61f3	2021-10-01 17:40:47 -07:00
Michael Suo	ccf8d48f16	Revert D31317680: [pytorch][PR] Avoid saving self for`softmax` and `log_softmax` Test Plan: revert-hammer Differential Revision: D31317680 (`5f7cadc7aa`) Original commit changeset: b3b921e06775 fbshipit-source-id: 1bca0672383536a2c21243ceb52349c766a94344	2021-10-01 09:31:44 -07:00
soulitzer	5f7cadc7aa	Avoid saving self for`softmax` and `log_softmax` (#65242 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/64000 - updates double backward formula to compute grad wrt output instead of self - ~~In some of the error messages, we still refer to the dtype of the input, even though we are now checking the dtype of the output~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/65242 Reviewed By: malfet Differential Revision: D31317680 Pulled By: soulitzer fbshipit-source-id: b3b921e06775cfc12e5a97a9ee8d73aec3aac7c3	2021-10-01 07:49:07 -07:00
Pruthvi Madugundu	085e2f7bdd	[ROCm] Changes not to rely on CUDA_VERSION or HIP_VERSION (#65610 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65610 - Replace HIP_PLATFORM_HCC with USE_ROCM - Dont rely on CUDA_VERSION or HIP_VERSION and use USE_ROCM and ROCM_VERSION. - In the next PR - Will be removing the mapping from CUDA_VERSION to HIP_VERSION and CUDA to HIP in hipify. - HIP_PLATFORM_HCC is deprecated, so will add HIP_PLATFORM_AMD to support HIP host code compilation on gcc. cc jeffdaily sunway513 jithunnair-amd ROCmSupport amathews-amd Reviewed By: jbschlosser Differential Revision: D30909053 Pulled By: ezyang fbshipit-source-id: 224a966ebf1aaec79beccbbd686fdf3d49267e06	2021-09-29 09:55:43 -07:00
Nikita Shulga	82e0bf44c0	Apply linter suggestions to #65137 (#65459 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65459 Just run linter on the change and apply all suggestions Test Plan: N/A Reviewed By: seemethere Differential Revision: D31102960 fbshipit-source-id: 04e1d07935690f2ddbc64533661b3e55379d13b5	2021-09-27 13:07:40 -07:00
CodemodService FBSourceClangFormatLinterBot	2a4d5e4c6d	[AutoAccept][Codemod][FBSourceClangFormatLinter] Daily `arc lint --take CLANGFORMAT` Reviewed By: zertosh Differential Revision: D31138547 fbshipit-source-id: ba134ae7f057c918eaefdc6310f7663e187e9749	2021-09-23 07:54:32 -07:00
jiej	127c9402d0	Revert "Revert D30752939: [pytorch][PR] nvfuser update" (#65137 ) Summary: This reverts commit `03389dc851`. Attempt again for PR: https://github.com/pytorch/pytorch/issues/63745 Fixes the windows build failure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/65137 Reviewed By: seemethere, dzhulgakov, heitorschueroff Differential Revision: D30994556 Pulled By: malfet fbshipit-source-id: f1925b6c5cc1a1a441a96499667c91e8dfc1b53d	2021-09-22 04:54:51 -07:00
Eli Uriegas	03389dc851	Revert D30752939: [pytorch][PR] nvfuser update Test Plan: revert-hammer Differential Revision: D30752939 (`cfaecaf40b`) Original commit changeset: ce122e80f01b fbshipit-source-id: 57685df8f9946032a06eff1de8a3d1498500d2d2	2021-09-15 17:38:47 -07:00
jiej	cfaecaf40b	nvfuser update (#63745 ) Summary: Syncing nvfuser code base from devel branch, Listing a few of our development since last sync: - Extends support to normalization and reduction kernels. - Multiple kernel launch for single `CudaFusionGroup`. Hierarchical caching system has been updated to cache graph segmentation. - profile_ivalue is enabled to convert dynamic scalar into compile time constants, which are required by the codegen. (e.g. reduction axes). To keep this PR simple and relatively review-free. We stripped most external changes and submitted them as separate PRs, so this gigantic PR is easier to handle. internal updates are files located in: 1. updates in nvfuser codegen `torch/csrc/jit/coddgen/cuda` 2. added nvfuser specific benchmarks `benchmarks/cpp/nvfuser` 3. nvfuser jit cpp tests `test/cpp/jit/test_gpu.cpp` `test/cpp/jit/test_gpu_shift.cpp` `test/cpp/jit/test_gpu_validator.h` updates affecting integration: 1. profile_ivalue enabled for nvfuser. related changes are in `torch/csrc/jit/runtime/`, 2. exposed a few more symbols `aten/src/ATen/core/` used by codegen Pull Request resolved: https://github.com/pytorch/pytorch/pull/63745 Reviewed By: saketh-are Differential Revision: D30752939 Pulled By: malfet fbshipit-source-id: ce122e80f01bcd3865f5bd3c4dfde660665fd84c	2021-09-15 14:42:55 -07:00
Zhengxu Chen	ac99d63f83	[jit] Make operation call accept Stack& instead Stack* (#63414 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63414 Misuse of raw pointer in here where stack is never nullable. ghstack-source-id: 136938318 Test Plan: compiles. Imported from OSS Reviewed By: ejguan Differential Revision: D30375410 fbshipit-source-id: 9d65b620bb76d90d886c800f54308520095d58ee	2021-08-30 11:49:20 -07:00
Bert Maher	a709ab34a8	[nnc] Re-enable CPU fusion" (#63665 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63665 This reverts commit `125e2d02e5`. Test Plan: Imported from OSS Reviewed By: ZolotukhinM Differential Revision: D30471646 Pulled By: bertmaher fbshipit-source-id: 4189869566f03b5f9ada78d78830f6a34946eed6	2021-08-23 12:42:42 -07:00
Alban Desmaison	125e2d02e5	Revert D30417370: [nnc] Enable CPU fusion Test Plan: revert-hammer Differential Revision: D30417370 (`b9fc656cf2`) Original commit changeset: 84ce7a578a36 fbshipit-source-id: cd23774cdc3273fd72f8a05f1900eaf36f373e6b	2021-08-20 12:30:21 -07:00
Bert Maher	b9fc656cf2	[nnc] Enable CPU fusion (#63545 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63545 Test Plan: Imported from OSS Reviewed By: navahgar Differential Revision: D30417370 Pulled By: bertmaher fbshipit-source-id: 84ce7a578a3678d5562bab99d1dc00330c4f72d1	2021-08-20 11:18:21 -07:00
Pruthvi Madugundu	ab7a472980	[ROCm] Update HIP_VERSION to TORCH_HIP_VERSION (#62786 ) Summary: - HIP_VERSION semantic versioning will change in ROCm4.3. The changes essentially remove the dependency on HIP_VERSION provided in the hip header to keep code compatible with older and newer versions of ROCm. - TORCH_HIP_VERSION is derived from HIP_VERSION_MAJOR and HIP_VERSION_MINOR Pull Request resolved: https://github.com/pytorch/pytorch/pull/62786 Reviewed By: bdhirsh Differential Revision: D30281682 Pulled By: seemethere fbshipit-source-id: e41e69fb9e13de5ddd1af99ba5bbdcbb7b64b673	2021-08-13 15:00:43 -07:00
Richard Barnes	d2594fa538	irange-ify 3 (#62112 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62112 Test Plan: Sandcastle Reviewed By: malfet Differential Revision: D29879513 fbshipit-source-id: c01d18d34bb19014bf28d92c4d04b07e50a2770a	2021-07-26 12:56:58 -07:00
Richard Barnes	ee44d73e59	Modernize override (#61744 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61744 Test Plan: Sandcastle Reviewed By: malfet Differential Revision: D29717320 fbshipit-source-id: 6eea4295ee2e5572ab337620be412376fcc2f3cc	2021-07-23 23:04:46 -07:00
Nikita Shulga	a9b0a921d5	Disable `avoid-non-const-global-variables` lint check (#62008 ) Summary: As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH` All changes but the ones to `.clang-tidy` are generated using following script: ``` for i in `find . -type f -iname ".c" -or -iname "*.h"\|xargs grep cppcoreguidelines-avoid-non-const-global-variables\|cut -f1 -d:\|sort\|uniq`; do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008 Reviewed By: driazati, r-barnes Differential Revision: D29838584 Pulled By: malfet fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13	2021-07-22 18:04:40 -07:00
Natalia Gimelshein	6284d2a82b	wrap cudaStreamSynchronize calls (#61889 ) Summary: This is a first step towards creating context manager that errors out on synchronizing calls. Pull Request resolved: https://github.com/pytorch/pytorch/pull/61889 Reviewed By: albanD Differential Revision: D29805280 Pulled By: ngimel fbshipit-source-id: b66400fbe0941b7daa51e6b30abe27b9cccd4e8a	2021-07-21 19:30:52 -07:00
Richard Barnes	59a5312ce6	Modernize fix deprecated header (#61736 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61736 Test Plan: Sandcastle Reviewed By: malfet Differential Revision: D29716965 fbshipit-source-id: 314c2b557c240ac16bbfab114ab764beb189e78a	2021-07-20 10:06:11 -07:00
Richard Barnes	349f2f767c	Modernize to default constructor and nullptr in torch (#61735 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61735 Test Plan: Sandcastle Reviewed By: malfet Differential Revision: D29716659 fbshipit-source-id: ec2a0a0b7e55d2e50b1d35f0b651bd40675ae7e8	2021-07-16 10:51:13 -07:00
Nikita Shulga	635d864b26	Fix modernize-use-equals-default nolint failures in torch/csrcs (#61142 ) Summary: Test-plan: Compile + clang-tidy Pull Request resolved: https://github.com/pytorch/pytorch/pull/61142 Reviewed By: VitalyFedyunin Differential Revision: D29529372 Pulled By: malfet fbshipit-source-id: 2ccde7712a51c28243b16bbb4d1d68086e0414a6	2021-07-06 09:46:46 -07:00
Mike Guo	6ecc1a4c4f	Make pytorch clang-tidy clean (#60649 ) Summary: This PR suppresses clang-tidy warnings in the codebase (for now) so that we can re-enable clang-tidy checks on master. I ran this script to add the `NOLINTNEXTLINE` comments (on a devserver): ```bash python3 setup.py develop # Uses same script that's run on CI and adds the -j (parallel), -s (add comments), -k (continue if diagnostic errors are found) options python3 tools/clang_tidy.py \ -j \ -s \ -k \ -v \ --paths torch/csrc/ \ -g"-torch/csrc/jit/passes/onnx/helper.cpp" \ -g"-torch/csrc/jit/passes/onnx/shape_type_inference.cpp" \ -g"-torch/csrc/jit/serialization/onnx.cpp" \ -g"-torch/csrc/jit/serialization/export.cpp" \ -g"-torch/csrc/jit/serialization/import.cpp" \ -g"-torch/csrc/jit/serialization/import_legacy.cpp" \ -g"-torch/csrc/onnx/init.cpp" \ -g"-torch/csrc/cuda/nccl." \ -g"-torch/csrc/cuda/python_nccl.cpp" \ -g"-torch/csrc/autograd/FunctionsManual.cpp" \ -g"-torch/csrc/generic/.cpp" \ -g"-torch/csrc/jit/codegen/cuda/runtime/*" \ -g"-torch/csrc/deploy/interpreter/interpreter.cpp" \ -g"-torch/csrc/deploy/interpreter/interpreter.h" \ -g"-torch/csrc/deploy/interpreter/interpreter_impl.h" \ -g"-torch/csrc/deploy/interpreter/test_main.cpp" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/60649 Test Plan: Verified changes by re-running the script (without the `-s` option) and seeing no warnings/errors. Reviewed By: walterddr, janeyx99 Differential Revision: D29504258 Pulled By: 1ntEgr8 fbshipit-source-id: 78310b30ee8213b73ddb4771ad874665323e7a4e	2021-07-01 12:21:07 -07:00
Richard Barnes	b162d95e46	Fix a number of lint perf and safety issues in torch (#59897 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59897 Test Plan: Sandcastle Reviewed By: ngimel Differential Revision: D29037012 fbshipit-source-id: 7c16286d5fc2b67964fb65f8374dfff4d1a7aefb	2021-06-15 13:14:51 -07:00
Richard Barnes	fbe65b16ae	Use irange in torch/csrc/jit (#55716 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55716 Test Plan: Sandcastle Reviewed By: ngimel Differential Revision: D27690245 fbshipit-source-id: 6052b0acd792a9527d131822453a17cdb7ae3ba5	2021-06-07 16:48:08 -07:00
Bert Maher	6309b342c3	[nnc] Enable CPU fuser inside FB, take 5 (#59461 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59461 long tail test failues ghstack-source-id: 130607578 Test Plan: fixed T92123560 Reviewed By: navahgar Differential Revision: D28892885 fbshipit-source-id: 762a275b5aa14af0847e46cbf4036d3342b82189	2021-06-04 16:26:46 -07:00
Bert Maher	46d724c919	Revert D28859795: [nnc] Enable CPU fusion inside Facebook, take 4 Test Plan: revert-hammer Differential Revision: D28859795 (`6baa66ece9`) Original commit changeset: 826801db24e8 fbshipit-source-id: c85a0fc7e88c95af939d5c0f50c0c8878e1174d3	2021-06-03 16:29:51 -07:00
Bert Maher	6baa66ece9	[nnc] Enable CPU fusion inside Facebook, take 4 Summary: fixed the awkward configerator initialization issue that broke some tests. Trying again Test Plan: predictor comparisons Reviewed By: ZolotukhinM Differential Revision: D28859795 fbshipit-source-id: 826801db24e86b1c3594a86e3ac32f0a84c496f7	2021-06-03 09:33:13 -07:00
Richard Barnes	3979cb0656	irange for size_t (#55320 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55320 Test Plan: Sandcastle Reviewed By: ngimel Differential Revision: D27572577 fbshipit-source-id: 97710fd2bb1303006b05828a0d1343b0b59ccb03	2021-06-03 01:04:13 -07:00
Bert Maher	afd5237a4f	Revert D28800692: [nnc] Enable CPU fusion inside Facebook, take 3 Test Plan: revert-hammer Differential Revision: D28800692 (`6e7dae9cec`) Original commit changeset: d791c3b2ccd7 fbshipit-source-id: 5042fecfbab59181572013bf39760bc716e86430	2021-06-02 10:07:46 -07:00
Bert Maher	6e7dae9cec	[nnc] Enable CPU fusion inside Facebook, take 3 (#59253 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59253 Fixed a miscompilation exposed by multithreaded profiling collection; let's try again. ghstack-source-id: 130286580 Test Plan: servicelab Reviewed By: navahgar, huiguoo Differential Revision: D28800692 fbshipit-source-id: d791c3b2ccd75fe5e6eca0859083d4cd67460147	2021-06-01 15:42:22 -07:00
Jeff Daily	ba694520e5	[ROCm] fix JIT codegen (#57400 ) Summary: Fixes upcoming changes that are part of ROCm 4.2 and affect PyTorch JIT. - ROCM_VERSION macro must be available to both device and host compilation passes. - Unifies some of CUDA and HIP differences in the code generated. - NAN / POS_INFINITY / NEG_INFINITY - Do not hipify `extern __shared__` -> `HIP_DYNAMIC_SHARED()` macro [deprecated] - Differentiates bf16 codegen for HIP. - Optionally provides missing macros when using hiprtc precompiled header feature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/57400 Reviewed By: ejguan Differential Revision: D28421065 Pulled By: malfet fbshipit-source-id: 215f476773c61d8b0d9d148a4e5f5d016f863074	2021-05-27 11:45:07 -07:00
Bert Maher	a6b358d53b	Revert D28461013: [nnc] Enable CPU fusion inside Facebook, take 2 Test Plan: revert-hammer Differential Revision: D28461013 (`c76405d3b1`) Original commit changeset: 79a80b6ffb65 fbshipit-source-id: d9cc5c512542153f39664635fb080d797a9de7d0	2021-05-19 15:27:38 -07:00
Bert Maher	c76405d3b1	[nnc] Enable CPU fusion inside Facebook, take 2 (#58347 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58347 Back out "Revert D27652484 (`ac04cc775b`): [nnc] Enable CPU fusion inside Facebook" Original commit changeset: ecfef3ee1e71 ghstack-source-id: 129279584 Test Plan: Tests for bugfix included in this stack Reviewed By: navahgar Differential Revision: D28461013 fbshipit-source-id: 79a80b6ffb653ab952ff5efaa143d3362bb7d966	2021-05-18 21:45:48 -07:00
Bert Maher	c4c2039fb2	Revert D27652484: [nnc] Enable CPU fusion inside Facebook Test Plan: revert-hammer Differential Revision: D27652484 (`ac04cc775b`) Original commit changeset: a82681455dae fbshipit-source-id: ecfef3ee1e7197148b172234691e72faf4b95cf8	2021-05-14 16:41:23 -07:00
Bert Maher	ac04cc775b	[nnc] Enable CPU fusion inside Facebook (#58029 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58029 We've been testing this for months, it's time. ghstack-source-id: 128932738 Test Plan: CI Reviewed By: ZolotukhinM Differential Revision: D27652484 fbshipit-source-id: a82681455dae0db19c8ac9918065b6e186c9e71a	2021-05-14 00:10:10 -07:00
Nikita Shulga	3a66a1cb99	[clang-tidy] Exclude cppcoreguidelines-avoid-magic-numbers (#57841 ) Summary: Add cppcoreguidelines-avoid-magic-numbers exclusion to clang-tidy Remove existing nolint warnings using following script: ``` for file in `git ls-files \| grep -v \.py`; do gsed '/^ *\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)/d' -i $file; done ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/57841 Reviewed By: samestep Differential Revision: D28295045 Pulled By: malfet fbshipit-source-id: 7c6e8d1213c9593f169ed3df6a916498f1a97163	2021-05-07 20:02:33 -07:00
Ailing Zhang	0ecdbfebff	s/InplaceOrView/ADInplaceOrView/g (#57372 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57372 Pull Request resolved: https://github.com/pytorch/pytorch/pull/57324 Test Plan: Imported from OSS Reviewed By: ZolotukhinM Differential Revision: D28121821 Pulled By: ailzhang fbshipit-source-id: f568dd2505f6279da9ffb93ce1d22e0f98c606bb	2021-05-01 22:56:18 -07:00
Nikita Shulga	eac02f85cf	Fix more clang-tidy errors (#57235 ) Summary: In my last PR I've missed CUDA and distributed folders, fixing this now This change is autogenerated by `python tool/clang_tidy.py -s` Pull Request resolved: https://github.com/pytorch/pytorch/pull/57235 Reviewed By: janeyx99 Differential Revision: D28084444 Pulled By: malfet fbshipit-source-id: bf222f69ee90c7872c3cb0931e8cdb84f0cb3cda	2021-04-28 23:29:10 -07:00
Nikita Shulga	4cb534f92e	Make PyTorch code-base clang-tidy compliant (#56892 ) Summary: This is an automatic change generated by the following script: ``` #!/usr/bin/env python3 from subprocess import check_output, check_call import os def get_compiled_files_list(): import json with open("build/compile_commands.json") as f: data = json.load(f) files = [os.path.relpath(node['file']) for node in data] for idx, fname in enumerate(files): if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'): files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')] return files def run_clang_tidy(fname): check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"]) changes = check_output(["git", "ls-files", "-m"]) if len(changes) == 0: return check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"]) def main(): git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n") compiled_files = get_compiled_files_list() for idx, fname in enumerate(git_files): if fname not in compiled_files: continue if fname.startswith("caffe2/contrib/aten/"): continue print(f"[{idx}/{len(git_files)}] Processing {fname}") run_clang_tidy(fname) if __name__ == "__main__": main() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892 Reviewed By: H-Huang Differential Revision: D27991944 Pulled By: malfet fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179	2021-04-28 14:10:25 -07:00
Ailing Zhang	be7a943bb8	s/AutoDispatchBelowAutograd/AutoDispatchBelowInplaceOrView. (#56657 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56657 Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D27931526 Pulled By: ailzhang fbshipit-source-id: 3af718df3435e2b0b30bc62070dbdc5aeeecdfb4	2021-04-23 15:50:00 -07:00
Aswin John Mathews	73eaa0a5f5	Fixing error in jit cuda on ROCm: non-constant-expression cannot be n… (#55243 ) Summary: On ROCm, the error when compiling was "non-constant-expression cannot be narrowed from type 'int' to 'uint32_t'" when compiling grid_reduction.cu. Added typecast to fix issue. Also, removed test skip with ROCm : re-enabling Pull Request resolved: https://github.com/pytorch/pytorch/pull/55243 Reviewed By: malfet Differential Revision: D27917066 Pulled By: ngimel fbshipit-source-id: b0b7c5fc8ecd2624222b35fe060846f7d1670f07	2021-04-21 16:35:27 -07:00
Ailing Zhang	3d904b56ec	s/AutoNonVariableTypeMode/AutoDispatchBelowAutograd/ (#56423 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56423 Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D27866606 Pulled By: ailzhang fbshipit-source-id: e3942356dc3133d1c5722de40ec0d45e6a60f2f1	2021-04-20 17:17:46 -07:00
Jeff Daily	e1752ffa04	[reland][ROCm] use hiprtc precompiled header (#55965 ) Summary: Revert "Revert D27449031 (`2a7df657fe`): [pytorch][PR] [ROCm] use hiprtc precompiled header". Reland PR https://github.com/pytorch/pytorch/issues/54350. This reverts commit `204ac21bf1`. The original PR was reverted under suspicion that it was causing CI instability, but it was instead due to a hardware failure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/55965 Reviewed By: jbschlosser Differential Revision: D27755907 Pulled By: malfet fbshipit-source-id: 75bf0b9d888df3dee62f00a366b1123757e0474e	2021-04-15 15:47:56 -07:00
Richard Barnes	d690973295	irange on int64_t (#55148 ) Summary: Converts loops of the form: ``` for(int64_t VAR=0;VAR<LIMIT;VAR++) ``` to the form ``` for(const auto VAR : c10::irange(LIMIT)) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/55148 Test Plan: Sandcastle Reviewed By: ngimel Differential Revision: D27447811 fbshipit-source-id: 6311a094ec4a81a0b57383aaee0ba1b1dc2445c4	2021-04-05 16:14:00 -07:00
Mike Ruberry	c0ac0fef4e	Revert D27448156: irange for size_t Test Plan: revert-hammer Differential Revision: D27448156 (`041b4431b2`) Original commit changeset: 585da57d4de9 fbshipit-source-id: 8e047c29f391c0166e0a1a87c3fb2a0854377365	2021-04-03 19:14:00 -07:00
Richard Barnes	041b4431b2	irange for size_t (#55163 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55163 Test Plan: Sandcastle Reviewed By: ngimel Differential Revision: D27448156 fbshipit-source-id: 585da57d4de91c692b6360d65f7b8a66deb0f8c1	2021-04-02 23:22:29 -07:00
Alexander Golynski	204ac21bf1	Revert D27449031: [pytorch][PR] [ROCm] use hiprtc precompiled header Test Plan: revert-hammer Differential Revision: D27449031 (`2a7df657fe`) Original commit changeset: 81a8d7847a47 fbshipit-source-id: b7b970c8ea4110357fba3ad4d52a86fa5641d90c	2021-04-01 06:42:04 -07:00
Jeff Daily	2a7df657fe	[ROCm] use hiprtc precompiled header (#54350 ) Summary: HIP's runtime compiler (hiprtc) is adding support for precompiled HIP headers in the ROCm 4.2 release. Conditionally add support for this feature. Using this feature will improve the ROCm torch wheel user experience; users will no longer need to install HIP headers separately to use torch JIT features. The use of this feature is conditionalized on a new ROCM_VERSION macro. Pull Request resolved: https://github.com/pytorch/pytorch/pull/54350 Reviewed By: H-Huang Differential Revision: D27449031 Pulled By: malfet fbshipit-source-id: 81a8d7847a47ce2bb253d1ea58740ef66ed154a3	2021-03-31 13:36:50 -07:00
Bert Maher	9db4802184	[fuser] Support bfloat16 (#54571 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54571 Supports bfloat16 via a similar method to half: upconvert inputs to fp32, do math, then downconvert outputs to bf16. Resource strings are mostly derived from cuda-11 headers. Fixes #53918, for the legacy fuser at least. Test Plan: Imported from OSS Reviewed By: ailzhang Differential Revision: D27328987 Pulled By: bertmaher fbshipit-source-id: 5c0eae44164623faa0c75cb818e8bf0211579fdc	2021-03-25 15:59:15 -07:00
johnlu	36ce673f16	Disable the fusion group which is not supported by XPU device. (#54239 ) Summary: The XPU device doesn't support the fusion group. Disable it for XPU devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/54239 Reviewed By: zou3519 Differential Revision: D27188735 Pulled By: ezyang fbshipit-source-id: f28f62148e7aa12e8b08345df7eb0133216ce6a5	2021-03-22 07:43:28 -07:00
Sam Estep	8c798e0622	Forbid trailing whitespace (#53406 ) Summary: Context: https://github.com/pytorch/pytorch/pull/53299#discussion_r587882857 These are the only hand-written parts of this diff: - the addition to `.github/workflows/lint.yml` - the file endings changed in these four files (to appease FB-internal land-blocking lints): - `GLOSSARY.md` - `aten/src/ATen/core/op_registration/README.md` - `scripts/README.md` - `torch/csrc/jit/codegen/fuser/README.md` The rest was generated by running this command (on macOS): ``` git grep -I -l ' $' -- . ':(exclude)/contrib/' ':(exclude)third_party' \| xargs gsed -i 's/ *$//' ``` I looked over the auto-generated changes and didn't see anything that looked problematic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/53406 Test Plan: This run (after adding the lint but before removing existing trailing spaces) failed: - https://github.com/pytorch/pytorch/runs/2043032377 This run (on the tip of this PR) succeeded: - https://github.com/pytorch/pytorch/runs/2043296348 Reviewed By: walterddr, seemethere Differential Revision: D26856620 Pulled By: samestep fbshipit-source-id: 3f0de7f7c2e4b0f1c089eac9b5085a58dd7e0d97	2021-03-05 17:22:55 -08:00
Richard Barnes	29c4290a8d	Use c10::irange for great good (#52153 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52153 Test Plan: Sandcastle Reviewed By: ngimel Differential Revision: D26407087 fbshipit-source-id: ea8ce1c17299cb9d89621e4a39f31edc2faa9fd6	2021-02-24 18:43:50 -08:00
Jeff Daily	4df8e774e6	[ROCm] warn unsupported PYTORCH_CUDA_FUSER_DISABLE_FMA (#50508 ) Summary: nvcc's `--fmad=false` is not valid for the HIP compiler. Upcoming ROCm releases will start treating unrecognized compiler flags as an error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/50508 Reviewed By: albanD Differential Revision: D25920291 Pulled By: mrshenli fbshipit-source-id: c0ff3b74dd07f3d0661ba29efafaab291ef3621c	2021-02-16 08:09:57 -08:00
Nikita Shulga	b8f3a658f9	Do not include "DynamicLibrary.h" into a top-level header (#52182 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52182 DynamicLibrary provides a very specific functionality, so there is no need to exposes it to every project depending on `ATen.h` Test Plan: Imported from OSS Reviewed By: walterddr Differential Revision: D26417404 Pulled By: malfet fbshipit-source-id: f8318cacb07dcc8b2f95984f88ea1df4e5369b8b	2021-02-13 19:34:46 -08:00
Nikita Shulga	5499e839f1	[Fuser] Do not attempt to use OpenMP if build without OpenMP support (#51504 ) Summary: Clang from XCode does not support `-fopenmp` option, no need to try to compile with it. Infer whether OpenMP is supported by checking _OPENMP define. Also, use clang compiler if host app was compiled with clang rather than gcc. Fix few range loop warnings and add static_asserts that range loop variables are raw pointers. This changes makes fuser tests on OS X a bit faster. Before: ``` % python3 test_jit.py -v TestScript.test_batchnorm_fuser_cpu Fail to import hypothesis in common_utils, tests are not derandomized CUDA not available, skipping tests test_batchnorm_fuser_cpu (__main__.TestScript) ... clang: error: unsupported option '-fopenmp' clang: error: unsupported option '-fopenmp' warning: pytorch jit fuser failed to compile with openmp, trying without it... ok ---------------------------------------------------------------------- Ran 1 test in 0.468s OK ``` After: ``` % python3 test_jit.py -v TestScript.test_batchnorm_fuser_cpu Fail to import hypothesis in common_utils, tests are not derandomized CUDA not available, skipping tests test_batchnorm_fuser_cpu (__main__.TestScript) ... ok ---------------------------------------------------------------------- Ran 1 test in 0.435s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/51504 Reviewed By: smessmer Differential Revision: D26186875 Pulled By: malfet fbshipit-source-id: 930b3bcf543fdfad0f493d687072aaaf5f9e2bfc	2021-02-02 15:31:59 -08:00
jjsjann123	392abde8e6	patch nvrtc API for cuda TK >= 11.1 (#50319 ) Summary: CUDA TK >= 11.1 provides ptxjitcompiler that emits SASS instead of PTX. 1. This gives better backward-compatibility that allows future TK to work with older driver, which might not necessarily be able to load generated PTX through JIT compile and would error out at runtime; https://docs.nvidia.com/deploy/cuda-compatibility/#using-ptx 2. Meanwhile, SASS doesn't provide good future compatibility, so for unsupported arch, we fallback to PTX to support future device. https://docs.nvidia.com/deploy/cuda-compatibility/index.html#cubin-compatibility Pull Request resolved: https://github.com/pytorch/pytorch/pull/50319 Reviewed By: malfet Differential Revision: D26114475 Pulled By: ngimel fbshipit-source-id: 046e9e7b3312d910f499572608a0bc1fe53feef5	2021-01-27 23:58:20 -08:00
Jane Xu	533cb9530e	Introducing TORCH_CUDA_CPP_API and TORCH_CUDA_CU_API to the code (#50627 ) Summary: Sub-step of my attempt to split up the torch_cuda library, as it is huge. Please look at https://github.com/pytorch/pytorch/issues/49050 for details on the split and which files are in which target. This PR introduces two new macros for Windows DLL purposes, TORCH_CUDA_CPP_API and TORCH_CUDA_CU_API. Both are defined as TORCH_CUDA_API for the time being. Pull Request resolved: https://github.com/pytorch/pytorch/pull/50627 Reviewed By: mruberry Differential Revision: D25955441 Pulled By: janeyx99 fbshipit-source-id: ff226026833b8fb2fb7c77df6f2d6c824f006869	2021-01-21 19:09:11 -08:00
Scott Wolchok	4a0d17ba2d	[PyTorch][codemod] Replace immediately-dereferenced expect calls w/expectRef (#50228 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50228 `fastmod -m 'expect(<((at\|c10)::)?\w+Type>\s*)->' 'expectRef${1}.'` Presuming it builds, this is a safe change: the result of `expect()` wasn't being saved anywhere, so we didn't need it, so we can take a reference instead of a new `shared_ptr`. ghstack-source-id: 119782961 Test Plan: CI Reviewed By: SplitInfinity Differential Revision: D25837374 fbshipit-source-id: 86757b70b1520e3dbaa141001e7976400cdd3b08	2021-01-13 16:13:55 -08:00
Chester Liu	9d8bd216f9	Use Unicode friendly API in fused kernel related code (#49781 ) Summary: See https://github.com/pytorch/pytorch/issues/47422 Pull Request resolved: https://github.com/pytorch/pytorch/pull/49781 Reviewed By: gchanan Differential Revision: D25847993 Pulled By: ezyang fbshipit-source-id: e683a8d5841885857ea3037ac801432a1a3eda68	2021-01-10 20:03:00 -08:00
Andres Suarez	8530c65e25	[codemod][fbcode/caffe2] Apply clang-format update fixes Test Plan: Sandcastle and visual inspection. Reviewed By: igorsugak Differential Revision: D25849205 fbshipit-source-id: ef664c1ad4b3ee92d5c020a5511b4ef9837a09a0	2021-01-09 14:37:36 -08:00
Thomas Viehmann	ea087e2d92	JIT: guard DifferentiableGraph node (#49433 ) Summary: This adds guarding for DifferentiableGraph nodes in order to not depend on Also bailing out on required gradients for the CUDA fuser. Fixes https://github.com/pytorch/pytorch/issues/49299 I still need to look into a handful of failing tests, but maybe it can be a discussion basis. Pull Request resolved: https://github.com/pytorch/pytorch/pull/49433 Reviewed By: ngimel Differential Revision: D25681374 Pulled By: Krovatkin fbshipit-source-id: 8e7be53a335c845560436c0cceeb5e154c9cf296	2021-01-08 20:01:27 -08:00
Scott Wolchok	480a756194	[PyTorch] IValue::toTensor can now return const Tensor& (#48868 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48868 Building on the previous diff, we can make `toTensor()` return a `const Tensor&`, which should make it easier to avoid reference counting. ghstack-source-id: 119327372 Test Plan: internal benchmarks. Reviewed By: bwasti Differential Revision: D25325379 fbshipit-source-id: ca699632901691bcee432f595f75b0a4416d55dd	2021-01-06 08:40:50 -08:00
Chester Liu	2ac180a5dd	Fix cl.exe detection in cpu/fused_kernel.cpp (#50085 ) Summary: The command used here is essentially `where cl.exe`. By using `system()` we will not be able to find cl.exe unless we are using VS Developer Prompt, which makes `activate()` meaningless. Change `system()` to `run()` fixes this. Found during https://github.com/pytorch/pytorch/issues/49781. Pull Request resolved: https://github.com/pytorch/pytorch/pull/50085 Reviewed By: smessmer Differential Revision: D25782054 Pulled By: ezyang fbshipit-source-id: e8e3cac903a73f3bd78def667ebe0e93201814c8	2021-01-06 07:16:41 -08:00
caozhong	aff0b68a58	Fix include files for out-of-tree compilation (#48827 ) Summary: Signed-off-by: caozhong <zhong.z.cao@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/48827 Reviewed By: agolynski Differential Revision: D25375988 Pulled By: ailzhang fbshipit-source-id: a8d5ab4572d991d6d96dfe758011517651ff0a6b	2020-12-15 14:40:44 -08:00
jiej	a6fa3b2682	adding profile_ivalue (#47666 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47666 Test Plan: Imported from OSS Reviewed By: ZolotukhinM Differential Revision: D25255573 Pulled By: Krovatkin fbshipit-source-id: 5d8753e4040a3d96105d28d26728125947c7a638	2020-12-09 15:29:15 -08:00
Nikita Shulga	b9cd774e29	Get rid of printf in cuda fuser debugPrint() (#46994 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46994 Reviewed By: raghuramank100, mruberry Differential Revision: D25342954 Pulled By: malfet fbshipit-source-id: 549b5b072f7f70877261a155e989a21072ec49d8	2020-12-04 15:13:26 -08:00
jiej	dabc286ab3	Remove output used only by sizes (#448 ) (#47665 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47665 Re-enabled the pass to remove outputs from fusion that is only used by aten::size; Added size computation for reduction op via new operator prim::ReductionSizes; Test Plan: Imported from OSS Reviewed By: navahgar, jamesr66a Differential Revision: D25254675 Pulled By: Krovatkin fbshipit-source-id: e9a057b0287ed0ac93b415647fd8e5e836ba9856	2020-12-03 11:14:30 -08:00
Lemo	85c1e8acdc	Replace kernel resource strings with real .cu source files (#48283 ) Summary: Convert the NVFUSER's runtime CUDA sources (under `.../jit/codegen/cuda/runtime`) to string literals, then include the headers with the generated literals. Pull Request resolved: https://github.com/pytorch/pytorch/pull/48283 Reviewed By: mrshenli Differential Revision: D25163362 Pulled By: ngimel fbshipit-source-id: 4e6c181688ddea78ce6f3c754fee62fa6df16641	2020-12-02 21:22:29 -08:00
Elias Ellison	1195403915	[NNC] Add cpu fusion gflag (#48682 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48682 Reviewed By: Krovatkin, ngimel Differential Revision: D25260205 Pulled By: eellison fbshipit-source-id: df1655fd75f2a13bcf7c025b1f0a7becc2fd126a	2020-12-02 19:47:18 -08:00
jjsjann123	15fc66d6c8	fix nvrtc PTX architecture cap for CUDA toolkit (#48455 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/48200 CUDA 11.0 only supports < sm_80 (https://docs.nvidia.com/cuda/archive/11.0/nvrtc/#group__options) Note: NVRTC documentation is not a reliable source to query supported architecture. Rule of thumb is that nvrtc supports the same set of arch for nvcc, so the best way to query that is something like `nvcc -h \| grep -o "compute_[0-9][0-9]" \| sort \| uniq` Pull Request resolved: https://github.com/pytorch/pytorch/pull/48455 Reviewed By: zhangguanheng66 Differential Revision: D25255529 Pulled By: ngimel fbshipit-source-id: e84cf51ab50519b4c97dad063cc43c9194942bb2	2020-12-02 11:50:22 -08:00
Chester Liu	8177f63c91	Reorganize and refine the Windows.h import in C++ files (#48009 ) Summary: This PR aims to reduce the import overhead and symbol noises from the `windows.h` headers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/48009 Reviewed By: gchanan Differential Revision: D25045840 Pulled By: ezyang fbshipit-source-id: 01fda70f433ba2dd0cd2d7cd676ab6ffe9d98b90	2020-11-20 14:21:09 -08:00
Scott Wolchok	4c9eb57914	[PyTorch] Narrow Device to 2 bytes by narrowing DeviceType and DeviceIndex (#47023 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47023 DeviceType pretty clearly only needs 1 byte. DeviceIndex only needs 1 byte given that machines don't have anywhere near 255 GPUs in them as far as I know. ghstack-source-id: 116901430 Test Plan: Existing tests, added assertion to catch if my assumption about DeviceIndex is incorrect Reviewed By: dzhulgakov Differential Revision: D24605460 fbshipit-source-id: 7c9a89027fcf8eebd623b7cdbf6302162c981cd2	2020-11-18 19:39:40 -08:00
Scott Wolchok	1bafff2366	[PyTorch][JIT] Skip unnecessary refcounting in TensorType::merge (#47959 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47959 Taking a shared_ptr by value incurs refcounting overhead and should only be done if the callee needs to take ownership. Otherwise, `const T&` is more efficient. (Specifically, you will have to do an atomic decrement when the argument is destroyed and probably an atomic increment as well. Passing by `const T&` also takes one less register than passing `std::shared_ptr<T>`, but that's less important.) This diff fixes just this one function, but I'd be happy to audit & fix this whole file in future diffs. Thoughts? ghstack-source-id: 116914899 Test Plan: build ATen-cpu Reviewed By: Krovatkin Differential Revision: D24970954 fbshipit-source-id: 6bdb4b710a94b8baf4ad63418fb38136134e0ef3	2020-11-18 17:49:16 -08:00
Sebastian Messmer	edf751ca2f	Make empty c10-full (#46092 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46092 Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. ghstack-source-id: 116544203 (Note: this ignores all push blocking failures!) Test Plan: vs prev diff (outdated, before c10::optional fix): https://www.internalfb.com/intern/fblearner/details/224735103/ after c10::optional fix: https://www.internalfb.com/intern/fblearner/details/231391773/ Also, after the c10::optional fix, the instruction counting benchmark shows a 2% regression for calling empty from Python. We decided this is acceptable and decided against landing D24425836 which would fix the regression. Reviewed By: ezyang Differential Revision: D24219944 fbshipit-source-id: e554096e90ce438c75b679131c3151ff8e5c5d50	2020-11-12 17:08:21 -08:00
Elias Ellison	f221a19a7f	Force LLVM Compilation for CPU Tests (#46949 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46949 Test Plan: Imported from OSS Reviewed By: ansley Differential Revision: D24805247 Pulled By: eellison fbshipit-source-id: 4fcaf02d8a78cc5cbcbde36940d0a2c85fba3fc5	2020-11-12 11:12:08 -08:00
jiej	ac146c4820	[nvFuser] Switching to `CudaFusionGuard` from `BailOut` for nvfuser - update 2 (#46452 ) Summary: 1. Added CudaFusionGuard as the custom TypeCheck for nvfuser; enabled dynamic shape support with profiling executor; 2. dropped support for legacy fuser; 3. re-enabled nvfuser tests; 4. added registration for profiling record to allow profiling on user specified nodes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/46452 Reviewed By: zou3519, anjali411 Differential Revision: D24364642 Pulled By: ngimel fbshipit-source-id: daf53a9a6b6636e1ede420a3a6d0397d4a8b450b	2020-10-19 15:44:31 -07:00
Thomas Viehmann	d3d8da7a8e	Enable CUDA Fuser for ROCm (#45965 ) Summary: This enables the cuda fuser on ROCm and enables tests for them. Part of this patch is based on work of Rohith Nallamaddi, thank you. Errors are my own, of course. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45965 Reviewed By: seemethere Differential Revision: D24170457 Pulled By: walterddr fbshipit-source-id: 3dd25b3501a41d2f00acba3ce8642ce51c49c9a6	2020-10-08 10:41:56 -07:00
Nikita Shulga	1558a3657b	Add LazyNVRTC (#45674 ) Summary: Instead of dynamically loading `caffe2_nvrtc`, lazyNVRTC provides the same functionality by binding all the hooks to lazy bind implementation, very similar to the shared library jump tables: On the first call, each function from the list tries to get a global handle to the respective shared library and replace itself with the dynamically resolved symbol, using the following template: ``` auto fn = reinterpret_cast<decltype(&NAME)>(getCUDALibrary().sym(C10_SYMBOLIZE(NAME))); if (!fn) throw std::runtime_error("Can't get" ## NAME); lazyNVRTC.NAME = fn; return fn(...) ``` Fixes https://github.com/pytorch/pytorch/issues/31985 Pull Request resolved: https://github.com/pytorch/pytorch/pull/45674 Reviewed By: ezyang Differential Revision: D24073946 Pulled By: malfet fbshipit-source-id: 1479a75e5200e14df003144625a859d312885874	2020-10-05 16:27:40 -07:00
jjsjann123	99e0a87bbb	[nvFuser] Latency improvements for pointwise + reduction fusion (#45218 ) Summary: A lot of changes are in this update, some highlights: - Added Doxygen config file - Split the fusion IR (higher level TE like IR) from kernel IR (lower level CUDA like IR) - Improved latency with dynamic shape handling for the fusion logic - Prevent recompilation for pointwise + reduction fusions when not needed - Improvements to inner dimension reduction performance - Added input -> kernel + kernel launch parameters cache, added eviction policy - Added reduction fusions with multiple outputs (still single reduction stage) - Fixed code generation bugs for symbolic tiled GEMM example - Added thread predicates to prevent shared memory form being loaded multiple times - Improved sync threads placements with shared memory and removed read before write race - Fixes to FP16 reduction fusions where output would come back as FP32 Pull Request resolved: https://github.com/pytorch/pytorch/pull/45218 Reviewed By: ezyang Differential Revision: D23905183 Pulled By: soumith fbshipit-source-id: 12f5ad4cbe03e9a25043bccb89e372f8579e2a79	2020-09-24 23:17:20 -07:00
Bert Maher	6ec8fabc29	Fix frac in CUDA fuser (#44152 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44152 Test Plan: Imported from OSS Reviewed By: jamesr66a Differential Revision: D23528506 fbshipit-source-id: bfd468d72fa55ce317f88ae83e1f2d5eee041aa0	2020-09-09 11:10:08 -07:00
Elias Ellison	5bd2902796	[JIT] Remove references to no longer generated _tanh_backward and _sigmoid_backward (#44138 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44138 If you look at the sigmoid and tanh backward they are composed of other ops: https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/runtime/symbolic_script.cpp#L786 https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/runtime/symbolic_script.cpp#L164 So tanh_backward and sigmoid_backward are no longer generated / legacy ops. Test Plan: Imported from OSS Reviewed By: ZolotukhinM Differential Revision: D23543603 Pulled By: eellison fbshipit-source-id: ce8353e53043cf969b536aac47c9576d66d4ce02	2020-09-05 01:41:36 -07:00
Ashkan Aliabadi	4e39c310eb	Move torch/csrc/utils/hash.h to c10/util/hash.h. (#42503 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42503 Test Plan: Imported from OSS Reviewed By: IvanKobzarev Differential Revision: D23252331 Pulled By: AshkanAliabadi fbshipit-source-id: 3c4c0e27b9a7eec8560e374c2a3ba5f1c65dae48	2020-08-29 17:47:00 -07:00
Bert Maher	0bf27d64f4	Fix NaN propagation in fuser's min/max implementation (#43590 ) Summary: fmax/fmin propagate the number if one argument is NaN, which doesn't match the eager mode behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/43590 Reviewed By: mruberry Differential Revision: D23338664 Pulled By: bertmaher fbshipit-source-id: b0316a6f01fcf8946ba77621efa18f339379b2d0	2020-08-26 17:31:06 -07:00
Nikita Shulga	d06f1818ad	Fix `codegen/cuda` gcc-5.4 compilation issues (#43223 ) Summary: Most of the fixes is the same old enum-is-not-hasheable error In manager.cpp use std::unordered_map::emplace rather than `insert` to avoid error triggered by missed copy elision This regression was introduced by https://github.com/pytorch/pytorch/pull/43129 Pull Request resolved: https://github.com/pytorch/pytorch/pull/43223 Reviewed By: albanD, seemethere Differential Revision: D23198330 Pulled By: malfet fbshipit-source-id: 576082f7a4454dd29182892c9c4e0b51a967d456	2020-08-18 17:19:07 -07:00
Christian Sarofeen	b3bda94393	[NVFuser] Enable E2E BCast-PWise-Reduction fusions (#43129 ) Summary: Had a bunch of merged commits that shouldn't have been there, reverted them to prevent conflicts. Lots of new features, highlights listed below. Overall: - Enables pointwise fusion, single (but N-D) broadcast -- pointwise fusion, single (but N-D) broadcast -- pointwise -- single (but N-D) reduction fusion. Integration: - Separate "magic scheduler" logic that takes a fusion and generates code generator schedule - Reduction fusion scheduling with heuristics closely matching eagermode (unrolling supported, but no vectorize support) - 2-Stage caching mechanism, one on contiguity, device, type, and operations, the other one is input size->reduction heuristic Code Generation: - More generic support in code generation for computeAt - Full rework of loop nest generation and Indexing to more generically handle broadcast operations - Code generator has automatic kernel launch configuration (including automatic allocation of grid reduction buffers) - Symbolic (runtime) tilling on grid/block dimensions is supported - Simplified index generation based on user-defined input contiguity - Automatic broadcast support (similar to numpy/pytorch semantics) - Support for compile time constant shared memory buffers - Parallelized broadcast support (i.e. block reduction -> block broadcast support) Pull Request resolved: https://github.com/pytorch/pytorch/pull/43129 Reviewed By: mrshenli Differential Revision: D23162207 Pulled By: soumith fbshipit-source-id: 16deee4074c64de877eed7c271d6a359927111b2	2020-08-18 09:10:08 -07:00
Nikita Shulga	4aa543ed2e	Fix unordered-map-over-enum for GCC 5.4 (#41063 ) Summary: Forgot to add this to https://github.com/pytorch/pytorch/pull/41055 Pull Request resolved: https://github.com/pytorch/pytorch/pull/41063 Differential Revision: D22407451 Pulled By: malfet fbshipit-source-id: 6f06653b165cc4817d134657f87caf643182832a	2020-07-06 23:26:31 -07:00
Nikita Shulga	50df097599	Fix CUDA jit codegen compilation with gcc-5.4 (#41055 ) Summary: It's a known gcc-5.4 bug that enum class is not hasheable by default, so `std::unordered_map` needs 3rd explicit parameters to compute hash from the type. Should fix regression caused by https://github.com/pytorch/pytorch/pull/40864 Pull Request resolved: https://github.com/pytorch/pytorch/pull/41055 Differential Revision: D22405478 Pulled By: malfet fbshipit-source-id: f4bd36bebdc1ad0251ebd1e6cefba866e6605fe6	2020-07-06 21:09:17 -07:00
Christian Sarofeen	b9b4f05abf	[nvFuser] Working towards reductions, codegen improvements (#40864 ) Summary: Have basic reduction fusion working, and have improved code generator to approach performance of eager mode reductions. Coming soon will be pointwise-reduction fusions in a way that should prevent the possibility of hitting regressions. Also working on performant softmax kernels in the code generator which may be our next fusion target. Pull Request resolved: https://github.com/pytorch/pytorch/pull/40864 Reviewed By: ngimel Differential Revision: D22392877 Pulled By: soumith fbshipit-source-id: 457448a807d628b1035f6d90bc0abe8a87bf8447	2020-07-06 14:52:49 -07:00
Jiayu Liu	0203d70c63	[nit] fix some typo within documentation (#40692 ) Summary: Apologize if this seems trivial, but i'd like to fix them on my way of reading some of the source code. Thanks! Pull Request resolved: https://github.com/pytorch/pytorch/pull/40692 Differential Revision: D22284651 Pulled By: mrshenli fbshipit-source-id: 4259d1808aa4d15a02cfd486cfb44dd75fdc58f8	2020-06-30 19:24:44 -07:00
Sebastian Messmer	53af9df557	Unify boxed function signature between jit and c10 (#37034 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37034 c10 takes a Stack* in boxed functions while JIT took Stack&. c10 doesn't return anything while JIT returns an int which is always zero. This changes JIT to follow the c10 behavior. ghstack-source-id: 106834069 Test Plan: unit tests Differential Revision: D20567950 fbshipit-source-id: 1a7aea291023afc52ae706957e9a5ca576fbb53b	2020-06-29 19:24:26 -07:00
Sergey Ionov	d14d47b9b5	Get rid of global constructors in cuda codegen (#40183 ) Summary: Use switch instead of look ups in global std::unordered_maps<> to do enum-to-name conversions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/40183 Reviewed By: malfet Differential Revision: D22117731 Pulled By: ionsphere fbshipit-source-id: d150114cfae5b1222bb9142d815f2379072506c7	2020-06-18 13:54:11 -07:00
Christian Sarofeen	80e5ebf989	[nvFuser] Transform replay refactor and minor updates (#39579 ) Summary: We've got quite a few things going on, preparing a push back to upstream so we don't get too desynced. - Major refactor of transform replay. It is now far more robust and fixes bugs discovered in reductions. Preparing for extension to explicit broadcast ops which will be the last major memory pattern for op coverage. Broadcast ops will allow us to express up to and potentially beyond norms and gemms. - Initial runtime expression evaluator. This allows us to evaluate expressions at runtime. Will be useful for determining our grid/block layout at runtime, so we don't have to manually compute them according to the code we're trying to generate. - Moving to int64 and double for scalar representations to match PyTorch JIT. - Improvements in codegen interface where we return Tensor like object instead of parent class Val. - Add `addcmul` and `lerp` ops - General updates, fixes, test additions, test inprovements. Pull Request resolved: https://github.com/pytorch/pytorch/pull/39579 Differential Revision: D21974001 Pulled By: soumith fbshipit-source-id: 7f7ccc91593466e948f3ce90f8f9b7fbc5c28de2	2020-06-11 23:04:24 -07:00
Ilia Cherniavskii	abe2be2063	[resubmit] Use TensorMethods.cpp (#39385 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39385 see https://github.com/pytorch/pytorch/pull/37639 Test Plan: https://github.com/pytorch/pytorch/pull/37639 Imported from OSS Differential Revision: D21833287 fbshipit-source-id: 9928d3f4122903d0de67ad312e349352d5f5c19c	2020-06-02 20:27:51 -07:00
Edward Yang	2fe0fc2684	Revert D21374247: Use TensorMethods.cpp Test Plan: revert-hammer Differential Revision: D21374247 Original commit changeset: 076964415079 fbshipit-source-id: 732ec8c561d1f37475c1b5549ba79c718e3a6db8	2020-06-01 08:12:09 -07:00
Ilia Cherniavskii	68e62b9ab6	Use TensorMethods.cpp (#37639 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37639 Changing TensorMethods.h to .cpp Necessary to avoid incomplete types in dispatcher Test Plan: CI Imported from OSS checked mobile size, no change, small reduction in size in fbios fbios: Succeeded Change in Download Size for arm64 + 3x assets variation: -18.2 KiB Change in Uncompressed Size for arm64 + 3x assets variation: -8.8 KiB reran benchmark, no stat. significant difference buck run mode/opt caffe2/caffe2/fb/high_perf_models/pytorch/benchmark_framework_overheads:benchmark_torchscript_model -- --model_file caffe2/caffe2/fb/high_perf_models/pytorch/benchmark_framework_overheads/addmodule.pt --num_runs 3 ╷ @ 68592d0d 41 minutes ago iliacher D21374247 ╭─╯ Use TensorMethods.cpp Created 3 benchmark runs on aibench for caffe2/caffe2/fb/high_perf_models/pytorch/benchmark_framework_overheads/addmodule.pt. Links to the results: * Adhoc run: https://our.intern.facebook.com/intern/aibench/details/1729113760 * Adhoc run: https://our.intern.facebook.com/intern/aibench/details/3867976782 * Adhoc run: https://our.intern.facebook.com/intern/aibench/details/2782186766 hg prev @ 7f501b42 Thursday at 14:26 bvaughan D21764704 ╷ short-circuit pow for complex 1 and 0 exponents Created 3 benchmark runs on aibench for caffe2/caffe2/fb/high_perf_models/pytorch/benchmark_framework_overheads/addmodule.pt. Links to the results: * Adhoc run: https://our.intern.facebook.com/intern/aibench/details/2155256332 * Adhoc run: https://our.intern.facebook.com/intern/aibench/details/1802057074 * Adhoc run: https://our.intern.facebook.com/intern/aibench/details/4119590830 Differential Revision: D21374247 fbshipit-source-id: 076964415079cf84fb57f1f7b43d087afed86e1d	2020-05-31 17:11:12 -07:00
lixinyu	a04fb2ab22	[Reland] add xenial + cuda 9.2 + gcc 5.4 CI test (#39036 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39036 Test Plan: Imported from OSS Differential Revision: D21731026 Pulled By: glaringlee fbshipit-source-id: ae678f786f95e3687ed6b3f176fe6736a436c721	2020-05-28 19:48:18 -07:00
Jeff Daily	1093e26d72	[ROCm] HIP version guard for occupancy API compatibility (#38551 ) Summary: CC ezyang xw285cornell HIP from ROCm 3.5 renames `hipOccupancyMaxActiveBlocksPerMultiprocessor` to `hipModuleOccupancyMaxActiveBlocksPerMultiprocessor`. In addition, the API parameter types now match CUDA. Add these changes in a backwards-compatible manner. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38551 Differential Revision: D21721832 Pulled By: ezyang fbshipit-source-id: 6fc971845e363d7495d8be9550e76d0f082c3062	2020-05-27 10:09:06 -07:00
Jeff Daily	ccab142197	Add ROCm-specific half_support_literal for JIT. (#38899 ) Summary: CC ezyang xw285cornell sunway513 lcskrishna Pull Request resolved: https://github.com/pytorch/pytorch/pull/38899 Differential Revision: D21721855 Pulled By: ezyang fbshipit-source-id: 3739c462f04cee40ff979f44387ef66b971f5303	2020-05-26 14:34:08 -07:00
Christian Sarofeen	8e69c3be17	[nvFuser] Reduction support in codegen, fp16 support (#38627 ) Summary: Adds reduction support for the code generator. Reductions are fully supported with split/merge/reorder/rfactor/computeAt/unroll operators. There is also cross thread (intra-block) reduction support. The two remaining pieces missing for reduction support is: - Safety: If cross thread reduction was used, child operators shouldn't be able to bind that thread dim anymore - Cross block reduction: we will want inter-block reduction support to match parity with tensor iterator PR also provides FP16 support for fusions now. We insert casts on FP16 inputs to FP32, and we insert casts to FP16 on FP16 outputs. Also working towards reductions and shape inference for reductions in the fusion pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38627 Reviewed By: albanD Differential Revision: D21663196 Pulled By: soumith fbshipit-source-id: 3ff2df563f86c39cd5821ab9c1148149e5172a9e	2020-05-21 17:18:39 -07:00
anjali411	8e07b75cef	Have DeviceType available in torch namespace (#38036 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38036 Resolves: https://github.com/pytorch/pytorch/issues/36946 Test Plan: Imported from OSS Differential Revision: D21463610 Pulled By: anjali411 fbshipit-source-id: c4aabfac2cd1f05f8b66745aae0a17c2af4d9c9b	2020-05-11 16:06:52 -07:00
Cloud Han	6e1e2a60dc	fix compilation error with gcc 5.5 (#38112 ) Summary: Fixed https://github.com/pytorch/pytorch/issues/38111 Pull Request resolved: https://github.com/pytorch/pytorch/pull/38112 Differential Revision: D21476876 Pulled By: malfet fbshipit-source-id: 06d25e763eb73961f7b4e4cfbd2bb59f5ab96387	2020-05-08 13:23:36 -07:00
jiej	1667aa6451	[CUDA_FUSER] Expand operation support for cuda fuser (#37849 ) Summary: This PR added more supported operations in CUDA fuser. We are covering major point-wise operations supported in legacy fuser. In an attempt to adapt to legacy executor: 1. added an naive shape propagation pass on pytorch JIT IR; 2. small refactor on graph partitioning; 3. fallback interpreter execution of fusion group; Pull Request resolved: https://github.com/pytorch/pytorch/pull/37849 Reviewed By: yf225 Differential Revision: D21444320 Pulled By: soumith fbshipit-source-id: 712e18ab8497f8d58a07e6f8d200cdab52cf0d74	2020-05-07 09:21:09 -07:00
Nikolay Korovaiko	6ecb5bb1f0	match old fuser rem to eager (#37196 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37196 Reviewed By: zdevito Differential Revision: D21223172 Pulled By: Krovatkin fbshipit-source-id: 4d4ff1127d5dc69ab73f07ca79c1f5b0b4dd9732	2020-05-01 10:55:06 -07:00
Nikolay Korovaiko	4ed790d742	Adding symbolic sizes, contiguity, stride indices (#36101 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36101 Reviewed By: jamesr66a Differential Revision: D20908711 Pulled By: Krovatkin fbshipit-source-id: f90ce74acffeb645d7d906d07e293164d65ed7e6	2020-05-01 02:01:25 -07:00
Deyu Fu	346215caa4	[jit] Adding vectorized load/store support for JIT generated CUDA kernel (#36555 ) Summary: JIT pointwise kernel currently does not do vectorized load/store, which may lead to not optimal performance in shorter data types, like half and int8. In this PR, a fixed length of 4 elements per load/store is added for supported tensor shape, implemented as a runtime check inside kernel. Supported tensor shape: - all input/output data point are aligned to 4sizeof(dtype) - last dimension contiguous(stride 1) and size is multiple of 4 - all other dimension have stride that is multiple of 4 All test_jit passed, and here is performance result on a simple `ax+by+c` fusion result before PR: ``` torch.float32 kernel time: 0.748 ms. torch.float16 kernel time: 0.423 ms. torch.int8 kernel time: 0.268 ms. ``` result after PR: ``` torch.float32 kernel time: 0.733 ms. torch.float16 kernel time: 0.363 ms. torch.int8 kernel time: 0.191 ms. ``` test code: ``` import torch import time # disable profiling to test all data types torch._C._jit_set_profiling_mode(False) torch._C._jit_set_profiling_executor(False) torch.jit.script def axpby(x, y): return x * 2 - y * 3 + 1 for test_dtype in [torch.float32, torch.float16, torch.int8]: a = torch.randn(12345,4096, device="cuda").to(test_dtype) b = torch.randn(12345,4096, device="cuda").to(test_dtype) # warm up for _ in range(100): c = axpby(a,b) torch.cuda.synchronize() start = time.time() for _ in range(1000): c = axpby(a,b) torch.cuda.synchronize() end = time.time() print("{} kernel time: {:.3f} ms.".format(test_dtype, end-start)) ``` Generated code: [log_with_generated_code.txt](https://github.com/pytorch/pytorch/files/4472813/log_with_generated_code.txt) Additional note: double type is disabled from vectorized code path. We can later improve it with dynamic vectorization length support and less in-kernel check when we can use tensor shape information in codegen. For now, this implementation is following cache through TensorDesc mechanism, which does not have enough compile time information. Pull Request resolved: https://github.com/pytorch/pytorch/pull/36555 Differential Revision: D21142762 Pulled By: ngimel fbshipit-source-id: 1cfdc5807a944c4670b040dc2d2dfa480377e7d7	2020-04-20 19:24:28 -07:00
Christian Sarofeen	f11c4f90c2	New CUDA Fuser: Unrolling support, interface refactor (#36435 ) Summary: Unrolling support has been added in a way that we get good performing code on GPUs. Not sure how long this link will last but an example of a generated unrolled kernel is: https://godbolt.org/z/i0uAv3 What can be seen from there is multiple calls of "ld.global.f32" without "ld.store.f32" in between them (and vice versa). This means that we are launching multiple loads that can be run in parallel, as well as multiple stores that can be run in parallel. This can be a crucial optimization for memory bound kernels. This was generally a point of concern in TVM as an attempt of a similar kernel from TVM produces: https://godbolt.org/z/Vu97vG which surrounds load - store pairs in conditional branches preventing the benefits of unrolling. Pull Request resolved: https://github.com/pytorch/pytorch/pull/36435 Reviewed By: ZolotukhinM Differential Revision: D21024011 Pulled By: soumith fbshipit-source-id: e852e282fa7a304aba962e1926f756098c011fe0	2020-04-16 09:20:24 -07:00
Christian Sarofeen	e551bfc8de	New CUDA Fuser code lowering refactor (#36199 ) Summary: This PR completely refactors the code lowering process from our IR to CUDA. Before we had one giant step that would go from a relatively high level IR straight to CUDA, now we're lowering this first into concepts like ForLoop, IfThenElse, TensorIndex, Allocate. This lowering will allow us to do more complex code lowering like reductions and unrolling. Unrolling will quickly follow this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/36199 Reviewed By: dzhulgakov Differential Revision: D20925220 Pulled By: soumith fbshipit-source-id: 8f621c694c68a1aad8653e625d7287fe2d8b35dc	2020-04-09 14:27:05 -07:00
Pavel Belevich	34b32ca914	Remove operator-> from at::Generator (#36027 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36027 Differential Revision: D20856462 Pulled By: pbelevich fbshipit-source-id: 156fc23d51d8125d41e96b36b3b1312f13040588	2020-04-07 08:07:07 -07:00
Pavel Belevich	3328a2f903	Rename CPUGenerator to CPUGeneratorImpl and CUDAGenerator to CUDAGeneratorImpl (#36026 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36026 Differential Revision: D20856458 Pulled By: pbelevich fbshipit-source-id: 6d105593dca67640d508a4aebf7edf028d52af32	2020-04-07 08:05:23 -07:00
Nikita Shulga	e707cee501	Fix gcc-5.4 compilation (#35935 ) Summary: It needs a hint how to hash `enum class` in `std::unordered_map` Pull Request resolved: https://github.com/pytorch/pytorch/pull/35935 Test Plan: CI Differential Revision: D20837750 Pulled By: malfet fbshipit-source-id: 4208ee4bfa2e3cfbedf5b92bf18031225bf9dfa1	2020-04-03 08:39:30 -07:00
Christian Sarofeen	aeb13f212b	Make ValType hashable. (#35917 ) Summary: Build fix stemming from https://github.com/pytorch/pytorch/issues/34785 Pull Request resolved: https://github.com/pytorch/pytorch/pull/35917 Differential Revision: D20829353 Pulled By: soumith fbshipit-source-id: 4ba84ecedd354efbc9ac47c9b0f0e3871b404f13	2020-04-03 00:16:56 -07:00
Song Zhou	dabeff33b9	[pytorch] Fix fblearner flow compiling errors (#35902 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35902 Move operator registration to anonymous namespace to avoid collision. Reviewed By: soumith Differential Revision: D20822382 fbshipit-source-id: 1ab00871491668b8b85e803ac877d96477f1688b	2020-04-02 14:52:48 -07:00
Soumith Chintala	d9dd353a00	fix clang-format (#35884 ) Summary: breakage introduced in PR that I landed Pull Request resolved: https://github.com/pytorch/pytorch/pull/35884 Differential Revision: D20817603 Pulled By: soumith fbshipit-source-id: b0729bed81549d4c8e6a889c380baa19c73ef127	2020-04-02 12:12:27 -07:00
Christian Sarofeen	6d24f8fe21	Infrastructure for a new CUDA Fuser (#34785 ) Summary: Summary: This PR contains the infrastructure of a new CUDA fuser. This CUDA fuser is based on many of the same principles of TensorExpressions and Halide, however the implementation is ground up. The fusion pass itself is similar to the default CUDA fuser, however, it has undergone some refactoring and is using the new code generation infrastructure. For those who are interested in how the code generation in this PR works, I would recommend reviewing _test/cpp/jit/test_gpu_fusion.cpp_ as well as the long comment section at the beginning of _torch/csrc/jit/codegen/cuda/transform_replay.h_ One of the largest differences between our approach and that of TVM/Halide, is the concept of "TensorView". TensorView from a high level should be thought of similarly to how we think of working with Tensors in PyTorch. It's an N-D object which can undergo transformations that change its dimensionality. Dimensionality changes are done through the operations split/merge/reorder/computeAt. These transformations are similar to split/fuse/reorder/compute_at of TVM, they modify how a tensor is iterated over to generate GPU code. Interestingly, in our scheme these transformations are applied to tensors and only impact how that tensor is generated. Warning: This PR is purposefully not feature complete with the current fuser. We wanted to separate out the infrastructure from the fusion capabilities. Once in, smaller incremental PRs will be submitted to expand capabilities of the fuser. Short term goals: Parity with current CUDA fuser (including performance): - Dynamic shapes (no recompilation) - Implicit handling of braodcast (broadcasted tensors are treated as tensors of the braodcasted size in the generated code) - Dropout Mid-term goals: - Transposes fused with pointwise operations where transpose involves only 2 axes (across the fused operation). - 1-D reductions fused with pointwise operations Pull Request resolved: https://github.com/pytorch/pytorch/pull/34785 Reviewed By: ZolotukhinM Differential Revision: D20650977 Pulled By: soumith fbshipit-source-id: ee39c95a880e1b9822e874ed4cc180971572bf63	2020-04-02 09:22:42 -07:00
Meghan Lele	6384c2d81b	[JIT] clang-format JIT code (#35115 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35115 This commit runs the newly added tools/clang_format.py on the JIT codebase and includes all of the formatting changes thus produced. Testing: Ran the script, CI. Test Plan: Imported from OSS Reviewed By: eellison Differential Revision: D20568523 Pulled By: SplitInfinity fbshipit-source-id: e09bdb982ccf090eecfb7c7b461b8d0681eef82b	2020-03-26 11:24:51 -07:00
Pavel Belevich	5306713a36	Replace Generator* with Generator that holds std::shared_ptr<GeneratorImpl> (#34468 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34468 This PR prepares `at::Generator` for pybind11's `type_caster<at::Generator>` which is required to implement custom RNG in python. The following changes are done: 1. `at::Generator` was moved to `c10::GeneratorImpl` (similar to `c10::TensorImpl`) 2. `at::Generator` was recreated as a holder of `std::shared_ptr<c10::GeneratorImpl>` (similar to `at::Tensor` that holds `c10::intrusive_ptr<c10::TensorImpl>`) 3. Most of `at::Generator*` usages were replaced with `at::Generator` TBD: replacing `Generator generator = nullptr` with `{}` requires JIT changes(adding Generator to IValue?) Differential Revision: D20549420 Pulled By: pbelevich fbshipit-source-id: 4c92a40eab8f033b359bb6c93f4cd84b07ee8d4e	2020-03-21 17:36:10 -07:00
Edward Yang	cf8b728255	Delete OperatorOptions, absorb AliasAnalysisKind into FunctionSchema. (#34588 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34588 I constructed the patch by deleting OperatorOptions and then rerouting all queries for AliasAnalysisKind to FunctionSchema. Some of the behavior is kind of bogus: we really shouldn't be mutating FunctionSchema after the fact, but that won't get fixed until we actually switch to true schema merging. Reland of https://github.com/pytorch/pytorch/pull/34160 Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Differential Revision: D20387079 Pulled By: ezyang fbshipit-source-id: d189f7a6ad8cd186b88b6fbfa3f189994eea14e8	2020-03-11 20:59:46 -07:00
Edward Yang	6f8a8e4e47	Revert D20282846: Delete OperatorOptions, absorb AliasAnalysisKind into FunctionSchema. Test Plan: revert-hammer Differential Revision: D20282846 Original commit changeset: ba7bca6e8adc fbshipit-source-id: b9e15d2b2c3d1dbc6e971ab3c0bdf380e769dcf1	2020-03-11 07:50:29 -07:00
Edward Yang	9d42177a31	Delete OperatorOptions, absorb AliasAnalysisKind into FunctionSchema. (#34160 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34160 I constructed the patch by deleting OperatorOptions and then rerouting all queries for AliasAnalysisKind to FunctionSchema. Some of the behavior is kind of bogus: we really shouldn't be mutating FunctionSchema after the fact, but that won't get fixed until we actually switch to true schema merging. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Differential Revision: D20282846 Pulled By: ezyang fbshipit-source-id: ba7bca6e8adc3365789639b88e54c4e881b1692e	2020-03-11 07:15:18 -07:00
Zachary DeVito	358450e02b	improved TorchScript traceback (#33834 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33834 This changes how we report Tracebacks to make them more clear when there are both serialized and non-serialized ranges. It now looks like: ``` Traceback (most recent call last): File "foo.py", line 25, in <module> s2(a, b) File "/scratch/zdevito/pytorch/torch/nn/modules/module.py", line 550, in __call__ result = self.forward(input, kwargs) RuntimeError: The following operation failed in the TorchScript interpreter. Traceback of TorchScript, serialized code (most recent call last): File "code/__torch__.py", line 7, in forward x: Tensor, y: Tensor) -> Tensor: return (self).bar(x, y, ) ~~~~~~~~~ <--- HERE def bar(self: __torch__.Moo, x: Tensor, File "code/__torch__.py", line 11, in bar x: Tensor, y: Tensor) -> Tensor: _0 = (self).baz(x, y, ) ~~~~~~~~~ <--- HERE _1 = torch.ones([3], dtype=None, layout=None, device=None, pin_memory=None) return torch.add(_0, _1, alpha=1) File "code/__torch__.py", line 17, in baz x: Tensor, y: Tensor) -> Tensor: return torch.add(x, y, alpha=1) ~~~~~~~~~ <--- HERE Traceback of TorchScript, original code (most recent call last): File "foo.py", line 11, in forward def forward(self, x, y): return self.bar(x, y) ~~~~~~~~ <--- HERE File "foo.py", line 9, in bar def bar(self, x, y): return self.baz(x, y) + torch.ones(3) ~~~~~~~~ <--- HERE File "foo.py", line 7, in baz def baz(self, x, y): return x + y ~~~~~ <--- HERE RuntimeError: The size of tensor a (4) must match the size of tensor b (5) at non-singleton dimension 1 ``` It follows Python convension of having the most important information last and reading from the bottom up. Changes: Moved the error message to the end, to copy Python * Report original traceback separate from serialized traceback * Make sure root functions have names in the interpreter trace. Test Plan: Imported from OSS Differential Revision: D20126136 Pulled By: zdevito fbshipit-source-id: fd01f9985e5d74e04c4d064c02e8bc320f4fac13	2020-03-03 12:27:38 -08:00
Michael Suo	dbe850af5b	[jit] do the code reorg (#33851 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33851 Rationale and context described in #33828. Script to reproduce the move: https://gist.github.com/suo/16cbefaaeb67ca5a7c6caffd49b7f6e9 ghstack-source-id: 99079645 Test Plan: Make sure CI passes Reviewed By: jamesr66a Differential Revision: D20133869 fbshipit-source-id: 390e9241a9c85366d9005c492ac31f10aa96488e	2020-02-27 13:02:51 -08:00

... 3 4 5 6 7 ...

364 Commits