pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
cyy	8f291e8c00	Fix clang-tidy warnings in torch/jit (#146963 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146963 Approved by: https://github.com/davidberard98	2025-02-15 03:36:59 +00:00
cyy	419a7e197d	[6/N] Fix Wextra-semi warning (#139605 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139605 Approved by: https://github.com/ezyang	2024-11-04 13:43:16 +00:00
cyy	7bbdf87517	[22/N] Fix clang-tidy warnings in jit (#134829 ) Follows #134537 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134829 Approved by: https://github.com/ezyang	2024-09-19 19:24:42 +00:00
cyy	07fe1dd58f	[13/N] Fix clang-tidy warnings in jit (#132411 ) Follows #132209 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132411 Approved by: https://github.com/Skylion007	2024-08-02 03:14:09 +00:00
cyy	c99adce9a1	[12/N] Fix clang-tidy warnings in jit (#132209 ) Follows #132131 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132209 Approved by: https://github.com/Skylion007	2024-08-01 15:12:12 +00:00
cyy	eccbd408e5	[10/N] Fix clang-tidy warnings in jit (#132122 ) Follows #132010 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132122 Approved by: https://github.com/Skylion007	2024-07-30 12:56:31 +00:00
cyy	f4dcf2ae93	[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/128301 Approved by: https://github.com/ezyang, https://github.com/r-barnes	2024-07-08 07:03:53 +00:00
PyTorch MergeBot	846bb30e13	Revert "[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301 )" This reverts commit `bd72e28314`. Reverted https://github.com/pytorch/pytorch/pull/128301 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it fails XLA build `bd72e28314`. Please rebase your PR before relanding because I think the failure is hidden by an unrelated broken trunk XLA failure from your current base commit ([comment](https://github.com/pytorch/pytorch/pull/128301#issuecomment-2169035822))	2024-06-15 01:58:20 +00:00
cyy	bd72e28314	[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/128301 Approved by: https://github.com/ezyang	2024-06-14 23:21:01 +00:00
Richard Barnes	ed327876f5	[codemod] `c10:optional` -> `std::optional` (#126135 ) Generated by running the following from PyTorch root: ``` find . -regex ".*\.$cpp\\|h\\|cu\\|hpp\\|cc\\|cxx$$" \| grep -v "build/" \| xargs -n 50 -P 4 perl -pi -e 's/c10::optional/std::optional/' ``` `c10::optional` is just an alias for `std::optional`. This removes usages of that alias in preparation for eliminating it entirely. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126135 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/albanD, https://github.com/aaronenyeshi	2024-05-14 19:35:51 +00:00
Aaron Gokaslan	b9182cbbd8	Fixup torch jit with some initializers and moves (#92037 ) Fixup some minor codequality issues in torch JIT Pull Request resolved: https://github.com/pytorch/pytorch/pull/92037 Approved by: https://github.com/ezyang	2023-01-12 17:29:24 +00:00
Aaron Gokaslan	18b37bbff9	Clang-Tidy: Improve tensorexpr headers with additional std::moves (#91572 ) Splitting #91559 into smaller pieces Pull Request resolved: https://github.com/pytorch/pytorch/pull/91572 Approved by: https://github.com/ezyang	2023-01-05 09:57:54 +00:00
Wang, Eikan	429a80dded	[NNC] Lowering function generates the output buffer with the specified stride (#76529 ) Summary: Pass stride information to lowering function to generate the output bufer with proper memory layout. Pull Request resolved: https://github.com/pytorch/pytorch/pull/76529 Reviewed By: ZolotukhinM Differential Revision: D36116712 Pulled By: IvanKobzarev fbshipit-source-id: d3901f756b3710ecce172d6db3ecb0b7c12fb929 (cherry picked from commit b6cd53c91c01db36ea0e99167dc0ce0ae1d3aa23)	2022-05-04 20:04:22 +00:00
Peter Bell	2e480fc2db	Cleanup ATen-core forward declarations I noticed that when `SymInt` was introduced, `jit_type_base.h` was added as an include to the `Operator.h` template which is supposed to be kept extremely clean and only use forward declarations. Also, that forward declarations for `OptionalArrayRef` were missing. So, I've refactored the forward declarations into `ATen/core/ATen_fwd.h` and cleaned up some of the `c10` headers that were masking these missing declarations. I've also re-generated the pre-compiled header so `SymInt` is included. Pull Request resolved: https://github.com/pytorch/pytorch/pull/76576 Approved by: https://github.com/albanD	2022-05-02 14:50:48 +00:00
zengk95	1d55518198	Revert "[nnc] Strides to Tensor (#72962 )" This reverts commit `939060925f`. Fixes https://github.com/pytorch/vision/issues/5873 Pull Request resolved: https://github.com/pytorch/pytorch/pull/76332 Approved by: https://github.com/seemethere	2022-04-25 19:50:00 +00:00
Ivan Kobzarev	939060925f	[nnc] Strides to Tensor (#72962 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72962 Test Plan: Imported from OSS Reviewed By: ZolotukhinM, cpuhrsch Differential Revision: D34589306 Pulled By: IvanKobzarev fbshipit-source-id: ecee5249760ecc0c8b2edb1842b90218899bc944 (cherry picked from commit 9e310c4c67389da30da89126d838ffe3864aba6f)	2022-04-23 19:35:15 +00:00
Wang, Eikan	ef0873327e	[NNC] Add utility functions to check channels-last contiguous (#75938 ) Summary: The `Buf` uses `std::vector<ExprHandle>` to represent its strides. The `ExprHandle` could be an immediate value or a mathematical expression with variables involved both for the static shape and dynamic shape. So it is hard to directly deduce the channels-last contiguous layout based on the numerical calculation. Hence, the utility functions of this PR are based on the pattern match to check whether the `Buf` is channels-last contiguous. Pull Request resolved: https://github.com/pytorch/pytorch/pull/75938 Reviewed By: cpuhrsch Differential Revision: D35724091 Pulled By: ZolotukhinM fbshipit-source-id: f79ae21749d0aad8601f0434b52df88602ff09bf (cherry picked from commit 3712bbbe4bea57c5c1abe1eafde4b8778e13e0c4)	2022-04-22 06:42:39 -07:00
Mikhail Zolotukhin	9123e9b3b5	[TensorExpr] Switch from `ExprPtr` to `ExprHandle` in Compute impl. (#72389 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72389 This is an NFC change that just prepares the code for the upcoming deletion of `DimArg` class. This change makes `Compute` and `Reduce` APIs to use `ExprHandle` everywhere. There should be no observable behavior change from this PR. Test Plan: Imported from OSS Reviewed By: navahgar Differential Revision: D34030295 Pulled By: ZolotukhinM fbshipit-source-id: 3fd035b6a6bd0a07ccfa92e118819478ae85412a (cherry picked from commit `1b0a4b6fac`)	2022-02-11 01:21:59 +00:00
Ivan Kobzarev	6fb8ebcd92	[tensorexp] Add strides to Buf (#68018 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68018 Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D32262381 Pulled By: IvanKobzarev fbshipit-source-id: dba79add0bf703bc2378d64e726d4c47ec30e3be	2021-11-13 08:33:01 -08:00
Ivan Kobzarev	e52d0e773b	[tensorexpr][ir][quant] Adding qscale and qzero to tensorexpr IR Buf (#66675 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66675 Test Plan: Imported from OSS Reviewed By: gchanan Differential Revision: D31676328 Pulled By: IvanKobzarev fbshipit-source-id: c6479415fa7d809e02dd3789ee0bfd6dfe50dc92	2021-10-27 01:32:16 -07:00
Mikhail Zolotukhin	f23f21dafe	[TensorExpr] Remove 'Placeholder' class. (#64887 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64887 BufHandle has exactly the same functionality and should be used instead. Differential Revision: D30889483 D30889483 Test Plan: Imported from OSS Reviewed By: navahgar Pulled By: ZolotukhinM fbshipit-source-id: 365fe8e396731b88920535a3de96bd3301aaa3f3	2021-09-14 00:22:44 -07:00
Bert Maher	e7fb35021a	[nnc] Enable fusion of bfloat16 ops (#64196 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64196 Test Plan: Imported from OSS Reviewed By: ZolotukhinM Differential Revision: D30643864 Pulled By: bertmaher fbshipit-source-id: e95edeaf7089464d713ea1d1f951743d3e5f61c5	2021-08-30 20:09:36 -07:00
Raghavan Raman	a836d83957	[nnc] Fixed warning due to implicit parameter conversion (#64117 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64117 Test Plan: Imported from OSS Reviewed By: ZolotukhinM Differential Revision: D30616945 Pulled By: navahgar fbshipit-source-id: eaf69232ac4a684ab5f97a54a514971655f86ef3	2021-08-30 04:39:34 -07:00
Bert Maher	2e6221a232	[nnc] Make 64-bit dimensions work (#64077 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64077 We were assuming kernel dimensions fit in 32 bits (the old fuser made this assumption too), but we should be able to support 64. ghstack-source-id: 136933272 Test Plan: unit tests; new IR level test with huge sizes Reviewed By: ZolotukhinM Differential Revision: D30596689 fbshipit-source-id: 23b7e393a2ebaecb0c391a6b1f0c4b05a98bcc94	2021-08-28 19:59:47 -07:00
Cheng Chang	0f6b524665	[NNC] Add C++ codegen backend to NNC (#62869 ) Summary: Adds a C++ codegen backend to NNC to generate C++ for CPU instead of generating LLVM IR. Tensors are represented as blobs of float. Vector operations are devectorized/unrolled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/62869 Test Plan: https://github.com/pytorch/pytorch/tree/mvz-nnc-aot-prototype makes it able to AOT compile the whole MobileNetV3 model into binary code through LLVM codegen in NNC. I forked that branch to https://github.com/cheng-chang/pytorch/tree/cc-aot-cpp, merged this PR into it, and modified `fancy_compile` to compile MobileNetV3 into C++ through ``` import torch m = torch.jit.load('mobnet.pt') m.eval() f = torch.jit.freeze(m) torch._C._fancy_compile(f.graph, [1, 3, 224, 224]) ``` The generated C++ file `mobnet.cc` can be found at https://gist.github.com/cheng-chang/e2830cc6920b39204ebf368035b2bcec. I manually compiled the generated C++ through `g++ -o mobnet -std=c++14 -L./build/lib -ltorch_cpu -ltorch mobnet.cc`, and it succeeded. Reviewed By: ZolotukhinM Differential Revision: D30149482 Pulled By: cheng-chang fbshipit-source-id: e77b189f0353e37cd309423a48a513e668d07675	2021-08-26 09:56:37 -07:00
Mikhail Zolotukhin	f0d274294d	[TensorExpr] Nuke KernelArena and KernelScope. (#63587 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63587 Now that there is no classes using KernelArena for memory management we can remove it. Differential Revision: D30429115 D30429115 Test Plan: Imported from OSS Reviewed By: navahgar Pulled By: ZolotukhinM fbshipit-source-id: 375f6f9294d27790645eeb7cb5a8e87047a57544	2021-08-24 00:32:16 -07:00
Mikhail Zolotukhin	4e15a6f495	[TensorExpr] Switch Exprs and Stmt from kernel-arena to shared_ptr. (#63216 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63216 Currently there are three classes managed by KernelArena: Expr, Stmt, and Tensor (and derived classes). KernelArena has been a long standing painpoint for NNC devs and we're moving away from that memory management model to ref-count based memory model (using shared_ptr). This commit switches Expr and Stmt to shared_ptr and is the biggest change in this transition. Later commits will detach Tensor from KernelArena and kill the arena + scope altogether. Differential Revision: D30353195 D30353195 Test Plan: Imported from OSS Reviewed By: navahgar Pulled By: ZolotukhinM fbshipit-source-id: 9575225ada3d0fb65087ae40435f3dfea4792cae	2021-08-24 00:32:11 -07:00
Mikhail Zolotukhin	1dc2b52764	[TensorExpr] Add a wrapper for all expr and stmt pointers. (#63195 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63195 This helps us to later switch from using KernelArena with raw pointers to shared pointers without having to change all our source files at once. The changes are mechanical and should not affect any functionality. With this PR, we're changing the following: * `Add` --> `AddPtr` `new Add(...)` --> `alloc<Add>(...)` * `dynamic_cast<Add>` --> `to<Add>` `static_cast<Add>` --> `static_to<Add>` Due to some complications with args forwarding, some places became more verbose, e.g.: `new Block({})` --> `new Block(std::vector<ExprPtr>())` Test Plan: Imported from OSS Reviewed By: navahgar Differential Revision: D30292779 Pulled By: ZolotukhinM fbshipit-source-id: 150301c7d2df56b608b035827b6a9a87f5e2d9e9	2021-08-17 13:44:45 -07:00
Raghavan Raman	e50e8b07d8	[nnc] Updated IRMutator and IRSimplifier to perform in-place mutations. (#63246 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63246 Test Plan: Imported from OSS Reviewed By: ZolotukhinM Differential Revision: D30309636 Pulled By: navahgar fbshipit-source-id: 409ea8d6982888cfee9127e6248044dd2ed9d8d4	2021-08-16 00:09:22 -07:00
Raghavan Raman	59dd12042e	[nnc] Removed const from all fields in IR. (#62336 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62336 This PR was generated by removing `const` for all types of nodes in NNC IR, and fixing compilation errors that were the result of this change. This is the first step in making all NNC mutations in-place. Test Plan: Imported from OSS Reviewed By: iramazanli Differential Revision: D30049829 Pulled By: navahgar fbshipit-source-id: ed14e2d2ca0559ffc0b92ac371f405579c85dd63	2021-08-03 11:44:36 -07:00
Mike Guo	6ecc1a4c4f	Make pytorch clang-tidy clean (#60649 ) Summary: This PR suppresses clang-tidy warnings in the codebase (for now) so that we can re-enable clang-tidy checks on master. I ran this script to add the `NOLINTNEXTLINE` comments (on a devserver): ```bash python3 setup.py develop # Uses same script that's run on CI and adds the -j (parallel), -s (add comments), -k (continue if diagnostic errors are found) options python3 tools/clang_tidy.py \ -j \ -s \ -k \ -v \ --paths torch/csrc/ \ -g"-torch/csrc/jit/passes/onnx/helper.cpp" \ -g"-torch/csrc/jit/passes/onnx/shape_type_inference.cpp" \ -g"-torch/csrc/jit/serialization/onnx.cpp" \ -g"-torch/csrc/jit/serialization/export.cpp" \ -g"-torch/csrc/jit/serialization/import.cpp" \ -g"-torch/csrc/jit/serialization/import_legacy.cpp" \ -g"-torch/csrc/onnx/init.cpp" \ -g"-torch/csrc/cuda/nccl." \ -g"-torch/csrc/cuda/python_nccl.cpp" \ -g"-torch/csrc/autograd/FunctionsManual.cpp" \ -g"-torch/csrc/generic/.cpp" \ -g"-torch/csrc/jit/codegen/cuda/runtime/*" \ -g"-torch/csrc/deploy/interpreter/interpreter.cpp" \ -g"-torch/csrc/deploy/interpreter/interpreter.h" \ -g"-torch/csrc/deploy/interpreter/interpreter_impl.h" \ -g"-torch/csrc/deploy/interpreter/test_main.cpp" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/60649 Test Plan: Verified changes by re-running the script (without the `-s` option) and seeing no warnings/errors. Reviewed By: walterddr, janeyx99 Differential Revision: D29504258 Pulled By: 1ntEgr8 fbshipit-source-id: 78310b30ee8213b73ddb4771ad874665323e7a4e	2021-07-01 12:21:07 -07:00
Jason Ansel	85517a2b70	[TensorExpr] More python binding cleanups (#60058 ) Summary: A few more quality of life improvements for NNC's python bindings: - Use standard `torch.dtype`s (rather than `te.Dtype`) - Make names optional (they don't seem to matter) - Make shapes optional - A few implicit conversions to make code cleaner Followup to https://github.com/pytorch/pytorch/issues/59920 Pull Request resolved: https://github.com/pytorch/pytorch/pull/60058 Reviewed By: bertmaher Differential Revision: D29151953 Pulled By: jansel fbshipit-source-id: c8286e329eb4ee3921ca0786e17248cf6a898bd8	2021-06-16 20:06:08 -07:00
Hui Guo	f4fdc49957	[NNC] Add python bindings for loopnest.compress_buffer (#59681 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59681 Test Plan: Imported from OSS Reviewed By: navahgar Differential Revision: D28981573 Pulled By: huiguoo fbshipit-source-id: 003d66df576903c71bf46c95851fe6ccbba76f29	2021-06-11 11:28:39 -07:00
Raghavan Raman	eef72f3f8a	[NNC] Update Buf on mutation instead of creating new ones (#57513 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57513 Test Plan: Imported from OSS Reviewed By: VitalyFedyunin Differential Revision: D28226917 Pulled By: navahgar fbshipit-source-id: 4e74c56a85b7aadc285b872b8ef8f8e26f31c8ce	2021-05-06 01:08:23 -07:00
Hui Guo	afe6b4c8ee	[NNC] Add logical Operators '&&' and '\|\|' (#56947 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56947 Test Plan: Imported from OSS Reviewed By: ZolotukhinM Differential Revision: D28007342 Pulled By: huiguoo fbshipit-source-id: a2ad8d2e99d7c8d8c8bdcd8f65fa3f340bdd2bbc	2021-05-01 18:44:27 -07:00
Raghavan Raman	5b7317b562	[NNC] API for Buffer Compression (#55853 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/54338 This PR adds the following API in NNC to implement "buffer compression". ``` static void compressBuffer(Buf* buf, Stmt* stmt); ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/55853 Reviewed By: ezyang Differential Revision: D27960986 Pulled By: navahgar fbshipit-source-id: a69988e607196f3e2db0212313ea5deefb9859ac	2021-04-23 14:12:03 -07:00
Bert Maher	90f848572c	NNC depthwise conv2d implementation (#54920 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54920 Add a depthwise convolution implementation and reasonably good schedules for 3x3 stride=1,2. ghstack-source-id: 126076113 Test Plan: new tensorexpr test: Conv.DepthwiseConv2D Reviewed By: ZolotukhinM Differential Revision: D27413745 fbshipit-source-id: 833da6072b655fbe2b679704e9d56a08e1bf7e7e	2021-04-08 21:56:53 -07:00
Nikita Shulga	6a39613f35	[BE] Make torch/csrc/jit/tensorexpr/ clang-tidy clean (#55628 ) Summary: Mostly auto-generated changes using ``` python3 tools/clang_tidy.py -c build -x torch/csrc/jit/tensorexpr/eval.cpp -s ``` With following common patterns manually fixed - Use ` = default` instead of `{}` - deleted methods should be public - Use pass-by-value + std::move instead of pass-by-reference+copy Pull Request resolved: https://github.com/pytorch/pytorch/pull/55628 Reviewed By: walterddr Differential Revision: D27655378 Pulled By: malfet fbshipit-source-id: 92be87a08113435d820711103ea9b0364182c71a	2021-04-08 19:44:14 -07:00
Mikhail Zolotukhin	ff6b3c76ab	[TensorExpr] Add TORCH_APIs to all expr classes. (#55002 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55002 Test Plan: Imported from OSS Reviewed By: navahgar, walterddr Differential Revision: D27446409 Pulled By: ZolotukhinM fbshipit-source-id: 3442d5876bc68974fb3d44878f89c1a7895668d2	2021-04-01 19:48:10 -07:00
Mikhail Zolotukhin	1ccaec0238	[TensorExpr] Cleanup IRNodeType enum. (#55001 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55001 The enum is only used for precedence computation thus we only need to enum node-types for which we know the precedence priority. Test Plan: Imported from OSS Reviewed By: navahgar Differential Revision: D27446410 Pulled By: ZolotukhinM fbshipit-source-id: 217dd63c4fd086155030ebf0c3e1772605109f7b	2021-04-01 19:48:07 -07:00
Horace He	42e0983230	[NNC] Added some APIs for dealing directly with Bufs (instead of Tensors) (#53011 ) Summary: (also includes some python binding stuff :P) Pull Request resolved: https://github.com/pytorch/pytorch/pull/53011 Reviewed By: gchanan, robieta Differential Revision: D26801120 Pulled By: Chillee fbshipit-source-id: 42a1efb6cbc9ddc0b72b780f3d6b712b3ae62b09	2021-03-05 06:55:48 -08:00
Hui Guo	973e306c84	changed TE 'Allocate' API to take one argument 'Buf' instead of three arguments 'Var', 'dtype', 'dims'. (#50167 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50167 Test Plan: Imported from OSS `python test/test_jit_fuser_te.py` `python test/test_jit_fuser_legacy.py` `python test/test_jit_fuser.py` `build/bin/test_tensorexpr` Reviewed By: ZolotukhinM Differential Revision: D25814342 Pulled By: huiguoo fbshipit-source-id: 44cba7f92365b826c9cb1d385a94858934570dee	2021-02-22 15:08:51 -08:00
Zirui Tao	2b202667c1	[1/N] CPU pointwise optimization: Add a benchmark for Relu Summary: As title Test Plan: Building: finished in 01:58.4 min (100%) 16761/16761 jobs, 16761 updated Total time: 02:32.3 min Run on (24 X 2394.45 MHz CPU s) 2021-02-16 21:29:30 ---------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ---------------------------------------------------------------------------------------------------- relu_nnc/64 1738 ns 1738 ns 410535 log/s=36.8257M/s relu_nnc/512 1708 ns 1708 ns 408678 log/s=299.711M/s relu_nnc/8192 3297 ns 3297 ns 214362 log/s=2.48499G/s relu_nnc/32768 10725 ns 10722 ns 61032 log/s=3.05603G/s log_nnc_sleef/64 2076 ns 2075 ns 326248 log/s=30.8436M/s log_nnc_sleef/512 3070 ns 3069 ns 230616 log/s=166.81M/s log_nnc_sleef/8192 22214 ns 22210 ns 31251 log/s=368.849M/s log_nnc_sleef/32768 85835 ns 85824 ns 8366 log/s=381.804M/s log_nnc_fast/64 1852 ns 1852 ns 379123 log/s=34.5532M/s log_nnc_fast/512 2456 ns 2456 ns 299463 log/s=208.503M/s log_nnc_fast/8192 10953 ns 10952 ns 69894 log/s=747.957M/s log_nnc_fast/32768 35424 ns 35422 ns 19986 log/s=925.08M/s log_nnc_vml/64 2361 ns 2361 ns 356220 log/s=27.1063M/s log_nnc_vml/512 2218 ns 2218 ns 313444 log/s=230.857M/s log_nnc_vml/8192 8420 ns 8420 ns 81594 log/s=972.912M/s log_nnc_vml/32768 29484 ns 29484 ns 21701 log/s=1.1114G/s log_aten/64 15970 ns 15970 ns 44401 log/s=4.00742M/s log_aten/512 18344 ns 18344 ns 41056 log/s=27.9114M/s log_aten/8192 24894 ns 24893 ns 27414 log/s=329.084M/s log_aten/32768 29129 ns 29125 ns 22477 log/s=1.12508G/s logit_nnc_sleef/64 2379 ns 2379 ns 261168 logit/s=26.8981M/s logit_nnc_sleef/512 5778 ns 5774 ns 114009 logit/s=88.6757M/s logit_nnc_sleef/8192 57268 ns 57236 ns 12429 logit/s=143.127M/s logit_nnc_sleef/32768 216356 ns 216344 ns 3026 logit/s=151.462M/s logit_nnc_fast/64 2178 ns 2173 ns 282306 logit/s=29.4565M/s logit_nnc_fast/512 2955 ns 2943 ns 202527 logit/s=173.95M/s logit_nnc_fast/8192 14836 ns 14835 ns 46794 logit/s=552.192M/s logit_nnc_fast/32768 53999 ns 53997 ns 12842 logit/s=606.846M/s logit_nnc_vml/64 2132 ns 2132 ns 335874 logit/s=30.018M/s logit_nnc_vml/512 3029 ns 3029 ns 250988 logit/s=169.058M/s logit_nnc_vml/8192 13264 ns 13263 ns 53504 logit/s=617.655M/s logit_nnc_vml/32768 49395 ns 48284 ns 14526 logit/s=678.654M/s logit_aten/64 88180 ns 86690 ns 9270 logit/s=738.261k/s logit_aten/512 54682 ns 54489 ns 10000 logit/s=9.3964M/s logit_aten/8192 170878 ns 164357 ns 6965 logit/s=49.8427M/s logit_aten/32768 452291 ns 434638 ns 3967 logit/s=75.3915M/s logit_caffe2/64 30170 ns 29902 ns 24686 logit/s=2.14029M/s logit_caffe2/512 203517 ns 201201 ns 3570 logit/s=2.54472M/s logit_caffe2/8192 3199528 ns 3157098 ns 220 logit/s=2.59479M/s logit_caffe2/32768 12520838 ns 12504846 ns 56 logit/s=2.62042M/s tanh_nnc_fast/64 1979 ns 1977 ns 309745 tanh/s=32.3752M/s tanh_nnc_fast/512 2331 ns 2331 ns 300937 tanh/s=219.636M/s tanh_nnc_fast/8192 8323 ns 8323 ns 83601 tanh/s=984.26M/s tanh_nnc_fast/32768 30767 ns 30766 ns 23024 tanh/s=1065.06M/s tanh_aten/64 17181 ns 17180 ns 36818 tanh/s=3.72522M/s tanh_aten/512 19071 ns 19036 ns 37243 tanh/s=26.8968M/s tanh_aten/8192 53542 ns 52006 ns 16268 tanh/s=157.521M/s tanh_aten/32768 619869 ns 587600 ns 1000 tanh/s=55.7658M/s tanh_caffe2/64 9668 ns 9654 ns 70926 tanh/s=6.62919M/s tanh_caffe2/512 70409 ns 70409 ns 9881 tanh/s=7.27184M/s tanh_caffe2/8192 1179098 ns 1179011 ns 644 tanh/s=6.9482M/s tanh_caffe2/32768 4384300 ns 4382613 ns 156 tanh/s=7.47682M/s BatchNorm/ATen/1/64/112/112 23186429 ns 23183715 ns 27 GB/s=277.028M/s BatchNorm/ATen/1/256/14/14 1772907 ns 1770636 ns 394 GB/s=226.703M/s BatchNorm/ATen/1/128/28/28 3069417 ns 3069229 ns 232 GB/s=261.569M/s BatchNorm/ATen/1/64/56/56 6367276 ns 6367190 ns 111 GB/s=252.173M/s BatchNorm/ATen/1/512/7/7 1334734 ns 1334373 ns 516 GB/s=150.411M/s BatchNorm/ATen/5/64/112/112 131727903 ns 131721364 ns 7 GB/s=243.792M/s BatchNorm/ATen/5/256/14/14 7879002 ns 7874672 ns 85 GB/s=254.873M/s BatchNorm/ATen/5/128/28/28 15561373 ns 15269781 ns 42 GB/s=262.877M/s BatchNorm/ATen/5/64/56/56 29169722 ns 29107393 ns 24 GB/s=275.812M/s BatchNorm/ATen/5/512/7/7 5042006 ns 5028687 ns 100 GB/s=199.559M/s BatchNorm/NNC/1/64/112/112 3303598 ns 3271058 ns 188 GB/s=1.96344G/s BatchNorm/NNC/1/256/14/14 330641 ns 326644 ns 2033 GB/s=1.22889G/s BatchNorm/NNC/1/128/28/28 498706 ns 497894 ns 1131 GB/s=1.61242G/s BatchNorm/NNC/1/64/56/56 1116910 ns 1114768 ns 641 GB/s=1.44033G/s BatchNorm/NNC/1/512/7/7 163380 ns 163351 ns 3493 GB/s=1.22867G/s BatchNorm/NNC/5/64/112/112 16392078 ns 16386427 ns 41 GB/s=1.95971G/s BatchNorm/NNC/5/256/14/14 1133781 ns 1133369 ns 674 GB/s=1.77086G/s BatchNorm/NNC/5/128/28/28 2053208 ns 2053211 ns 276 GB/s=1.95503G/s BatchNorm/NNC/5/64/56/56 3874949 ns 3874734 ns 165 GB/s=2.07193G/s BatchNorm/NNC/5/512/7/7 653665 ns 651498 ns 1236 GB/s=1.54033G/s BatchNorm/ATenRelu/1/64/112/112 36878892 ns 36100523 ns 22 GB/s=177.907M/s BatchNorm/ATenRelu/1/256/14/14 6404318 ns 5544976 ns 100 GB/s=72.3913M/s BatchNorm/ATenRelu/1/128/28/28 5897059 ns 5735509 ns 106 GB/s=139.973M/s BatchNorm/ATenRelu/1/64/56/56 10075458 ns 9965146 ns 62 GB/s=161.125M/s BatchNorm/ATenRelu/1/512/7/7 2680507 ns 2662541 ns 254 GB/s=75.3806M/s BatchNorm/ATenRelu/5/64/112/112 145738113 ns 144253693 ns 5 GB/s=222.612M/s BatchNorm/ATenRelu/5/256/14/14 13582519 ns 13427209 ns 65 GB/s=149.476M/s BatchNorm/ATenRelu/5/128/28/28 22747138 ns 22627185 ns 31 GB/s=177.401M/s BatchNorm/ATenRelu/5/64/56/56 53609692 ns 52936728 ns 15 GB/s=151.656M/s BatchNorm/ATenRelu/5/512/7/7 11378314 ns 11083777 ns 65 GB/s=90.5395M/s BatchNorm/NNCRelu/1/64/112/112 3154436 ns 3148939 ns 193 GB/s=2.03958G/s BatchNorm/NNCRelu/1/256/14/14 337341 ns 337163 ns 1926 GB/s=1.19055G/s BatchNorm/NNCRelu/1/128/28/28 505570 ns 505569 ns 1231 GB/s=1.58794G/s BatchNorm/NNCRelu/1/64/56/56 903452 ns 903421 ns 659 GB/s=1.77728G/s BatchNorm/NNCRelu/1/512/7/7 158521 ns 158321 ns 3781 GB/s=1.2677G/s BatchNorm/NNCRelu/5/64/112/112 15488210 ns 15480019 ns 41 GB/s=2.07446G/s BatchNorm/NNCRelu/5/256/14/14 1149186 ns 1148963 ns 649 GB/s=1.74683G/s BatchNorm/NNCRelu/5/128/28/28 2011589 ns 2011424 ns 320 GB/s=1.99564G/s BatchNorm/NNCRelu/5/64/56/56 3776274 ns 3776060 ns 161 GB/s=2.12607G/s BatchNorm/NNCRelu/5/512/7/7 699762 ns 699582 ns 975 GB/s=1.43446G/s BM_CompileSwish 30471825 ns 30470017 ns 24 BM_CompileSwishLLVMOnly 27479624 ns 27473475 ns 25 FusedOverhead 196219 ns 196195 ns 3342 UnfusedOverhead 220210 ns 220119 ns 3302 Gemm/Torch/128/128/128 115526 ns 115343 ns 7414 GFLOPS=36.3637G/s Gemm/TensorExprNoopt/128/128/128 3155851 ns 3155706 ns 210 GFLOPS=1.32912G/s Gemm/TensorExprTile32x32/128/128/128 124454 ns 124452 ns 5774 GFLOPS=33.7021G/s Gemm/TensorExprTile4x16/128/128/128 174408 ns 174366 ns 3987 GFLOPS=24.0546G/s Gemm/TensorExprTile4x16VecUnroll/128/128/128 72949 ns 72948 ns 9028 GFLOPS=57.4974G/s Gemm/TensorExprTile4x16Cache/128/128/128 73237 ns 73234 ns 9501 GFLOPS=57.2726G/s Reduce1D/Torch/16777216 426865265 ns 426853756 ns 2 BYTES=157.217M/s Reduce1D/Naive/16777216 132347709 ns 132343710 ns 5 BYTES=507.08M/s Reduce1D/NativeRfactor/16777216 234668375 ns 234664682 ns 3 BYTES=285.978M/s Reduce1D/TeNaive/16777216 20468304 ns 20467906 ns 34 BYTES=3.27874G/s Reduce1D/TeSplitTail/16777216 20378995 ns 20378678 ns 34 BYTES=3.29309G/s Reduce1D/TeSplitMask/16777216 20371783 ns 20371260 ns 36 BYTES=3.29429G/s Reduce1D/TeRfactorV2/16777216 8235908 ns 8235723 ns 84 BYTES=8.14851G/s CPU info: Running ```sudo lshw -class processor```. Get 24 CPUs with identical architecture as follows: *-cpu:0 description: CPU product: Intel Core Processor (Broadwell) vendor: Intel Corp. physical id: 400 bus info: cpu@0 version: 6.61.2 slot: CPU 0 size: 2GHz capacity: 2GHz width: 64 bits capabilities: fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp x86-64 constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat configuration: cores=1 enabledcores=1 microcode=1 threads=1 Reviewed By: bwasti Differential Revision: D26275048 fbshipit-source-id: 3de669f622eb8cd328787caa878dc0c05de600a5	2021-02-17 17:18:28 -08:00
Bert Maher	2e35fe9535	[te] Implement log approximation using the VML approach (#51752 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51752 Using a straight power series approximation with enough terms gives precision down to the denormal range, and avoids the fp division used in the sleef approach. This is nice because recent CPUs have dual pipelined fma units, so we can compute 16 logarithms in parallel; whereas there's usually only one FP divider and it has a fairly high latency/low throughput. ghstack-source-id: 121392347 Test Plan: On my avx2+fma broadwell: ``` --------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... --------------------------------------------------------------------------- log_nnc_sleef/64 178 ns 178 ns 3933565 log/s=358.993M/s log_nnc_sleef/512 1286 ns 1285 ns 559459 log/s=398.354M/s log_nnc_sleef/8192 19366 ns 19364 ns 36619 log/s=423.053M/s log_nnc_sleef/32768 79288 ns 79286 ns 8718 log/s=413.287M/s log_nnc_fast/64 92 ns 92 ns 7644990 log/s=696.939M/s log_nnc_fast/512 483 ns 483 ns 1426802 log/s=1059.49M/s log_nnc_fast/8192 7519 ns 7514 ns 95319 log/s=1090.23M/s log_nnc_fast/32768 31344 ns 31338 ns 22397 log/s=1045.62M/s log_nnc_vml/64 88 ns 88 ns 7923812 log/s=728.469M/s log_nnc_vml/512 454 ns 454 ns 1521437 log/s=1.12739G/s log_nnc_vml/8192 6763 ns 6763 ns 103264 log/s=1.21136G/s log_nnc_vml/32768 26565 ns 26564 ns 23609 log/s=1.23354G/s log_aten/64 418 ns 418 ns 1651401 log/s=153.117M/s log_aten/512 801 ns 801 ns 875857 log/s=638.923M/s log_aten/8192 6877 ns 6872 ns 100840 log/s=1.19208G/s log_aten/32768 26989 ns 26988 ns 26268 log/s=1.21416G/s ``` Reviewed By: bwasti, zheng-xq Differential Revision: D26246400 fbshipit-source-id: dae47ee6baeab1a813ec4d4440748164051aed3d	2021-02-10 02:09:10 -08:00
Mikhail Zolotukhin	42aeb68128	[TensorExpr] Move 'initializer' field from 'Tensor' to 'Buf'. (#50993 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50993 This is the first step to make 'Tensor` a thin wrapper over 'Buf' and 'Stmt', which will be finished in subsequent PRs. This change also allows to remove 'buf_initializers_' from 'LoopNest', making it "less stateful". Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D26038224 Pulled By: ZolotukhinM fbshipit-source-id: f418816e54c62f291fa45812901487394e9b95b5	2021-01-27 16:10:53 -08:00
Bram Wasti	d60d108280	[nnc] Expose fast tanh/sigmoid (#50736 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50736 Exposes tanh and sigmoid to other backends Test Plan: buck test caffe2/test/cpp/tensorexpr:tensorexpr -- "ATen.fast" Reviewed By: bertmaher Differential Revision: D25884911 fbshipit-source-id: f9a5286450331f60935cfd40bb23f4a4f4c1d087	2021-01-22 09:56:02 -08:00
Peng Wu	6568572712	Support integral types for kAbs in SimpleIREvaluator (#49357 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49357 This is a follow-up fix for PR #48679, where the previous PR adds support for integer inputs to aten::abs by promoting integers to float and then demote the result back to integers. This PR supports integer inputs to aten::abs more efficiently in the SimpleIREvaluator by allowing implementing integer inputs for kAbs (renamed from kFabs). - Rename kFabs to kAbs - Add support for integer input to kAbs in SimpleIREvalator (note that: llvm_codegen and cuda_codegen already supports integer inputs to kAbs) Test Plan: - `PYTORCH_TENSOREXPR_DONT_USE_LLVM=1 python test/test_jit_fuser_te.py TestTEFuser.test_unary_ops` - `python test/test_jit_fuser_te.py TestTEFuser.test_unary_ops` Imported from OSS Reviewed By: eellison Differential Revision: D25545791 fbshipit-source-id: e52f51a352d149f66ce8341fb3beb479be08a230	2020-12-18 07:57:58 -08:00
Bram Wasti	1047957831	[te][reapply] Add fast log approximation based on sleef (#49575 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49575 This is a fast log implementations benchmark: ``` buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench -c 'fbcode.caffe2_gpu_type=none' ``` Test Plan: buck test mode/no-gpu //caffe2/test/cpp/tensorexpr:tensorexpr -- *.fastLogFloat Reviewed By: bertmaher Differential Revision: D25627157 fbshipit-source-id: a4920f4f4005ce617d372b375e790ca966275cd9	2020-12-17 17:02:00 -08:00
Edward Yang	ea4ccc730e	Revert D25445815: [te] Add fast log approximation based on sleef Test Plan: revert-hammer Differential Revision: D25445815 (`1329066b69`) Original commit changeset: 20696eacd12a fbshipit-source-id: 38830a6abd16260d60e5dd9a5594e65736a9c782	2020-12-17 15:03:17 -08:00
Bram Wasti	1329066b69	[te] Add fast log approximation based on sleef Summary: This is a fast log implementations benchmark: ``` buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench -c 'fbcode.caffe2_gpu_type=none' ``` Test Plan: buck test mode/no-gpu //caffe2/test/cpp/tensorexpr:tensorexpr -- *.fastLogFloat Reviewed By: bertmaher Differential Revision: D25445815 fbshipit-source-id: 20696eacd12a55e797f606f4a6dbbd94c9652888	2020-12-17 14:28:34 -08:00

1 2

73 Commits