pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Mikhail Zolotukhin	f23f21dafe	[TensorExpr] Remove 'Placeholder' class. (#64887 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64887 BufHandle has exactly the same functionality and should be used instead. Differential Revision: D30889483 D30889483 Test Plan: Imported from OSS Reviewed By: navahgar Pulled By: ZolotukhinM fbshipit-source-id: 365fe8e396731b88920535a3de96bd3301aaa3f3	2021-09-14 00:22:44 -07:00
Mikhail Zolotukhin	f0d274294d	[TensorExpr] Nuke KernelArena and KernelScope. (#63587 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63587 Now that there is no classes using KernelArena for memory management we can remove it. Differential Revision: D30429115 D30429115 Test Plan: Imported from OSS Reviewed By: navahgar Pulled By: ZolotukhinM fbshipit-source-id: 375f6f9294d27790645eeb7cb5a8e87047a57544	2021-08-24 00:32:16 -07:00
Mikhail Zolotukhin	62d02f2b57	[TensorExpr] Make 'Tensor' a value type. (#63586 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63586 This is another commit in transition from KernelArena memory management. Tensor is essentially just a pair of <BufPtr, StmtPtr> and we don't need to dynamically allocate it at all - it's cheap to pass it by value, and that's what we're switching to in this commit. After this change nothing uses KernelScope/KernelArena and they can be safely removed. Differential Revision: D30429114 D30429114 Test Plan: Imported from OSS Reviewed By: navahgar Pulled By: ZolotukhinM fbshipit-source-id: f90b859cfe863692b7beffbe9bd0e4143df1e819	2021-08-24 00:32:13 -07:00
Mikhail Zolotukhin	1dc2b52764	[TensorExpr] Add a wrapper for all expr and stmt pointers. (#63195 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63195 This helps us to later switch from using KernelArena with raw pointers to shared pointers without having to change all our source files at once. The changes are mechanical and should not affect any functionality. With this PR, we're changing the following: * `Add` --> `AddPtr` `new Add(...)` --> `alloc<Add>(...)` * `dynamic_cast<Add>` --> `to<Add>` `static_cast<Add>` --> `static_to<Add>` Due to some complications with args forwarding, some places became more verbose, e.g.: `new Block({})` --> `new Block(std::vector<ExprPtr>())` Test Plan: Imported from OSS Reviewed By: navahgar Differential Revision: D30292779 Pulled By: ZolotukhinM fbshipit-source-id: 150301c7d2df56b608b035827b6a9a87f5e2d9e9	2021-08-17 13:44:45 -07:00
Raghavan Raman	30e24b2d2b	[nnc] Modified vectorize API to return bool (#59422 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59422 Test Plan: Imported from OSS Reviewed By: huiguoo Differential Revision: D28886980 Pulled By: navahgar fbshipit-source-id: 58cc3ecd86564a312a132f8260d836b096505095	2021-06-11 12:02:19 -07:00
Raghavan Raman	e2467cc43e	[NNC] Make splitWithTail transform in-place (#58268 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58268 Test Plan: Imported from OSS Reviewed By: ZolotukhinM Differential Revision: D28427228 Pulled By: navahgar fbshipit-source-id: 270b62c4e83739ad21dd68f375120e56881b394f	2021-05-25 11:31:14 -07:00
Bert Maher	4156588365	[nnc] Allow 1 ulp tolerance in log approximation (#52165 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52165 Apparently bitwise identicality is too high a bar (I'm seeing differences at this level depending on the HW platform, e.g., Broadwell is bitwise accurate but Skylake is 1ulp off). But anyways VML is accurate to 1 ulp, so let's allow that. ghstack-source-id: 121815001 Test Plan: test_approx Reviewed By: asuhan Differential Revision: D26408079 fbshipit-source-id: 46cbd1487c72ae7bc40567f2f72ed2b919707d0d	2021-02-16 16:49:36 -08:00
Bert Maher	2e35fe9535	[te] Implement log approximation using the VML approach (#51752 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51752 Using a straight power series approximation with enough terms gives precision down to the denormal range, and avoids the fp division used in the sleef approach. This is nice because recent CPUs have dual pipelined fma units, so we can compute 16 logarithms in parallel; whereas there's usually only one FP divider and it has a fairly high latency/low throughput. ghstack-source-id: 121392347 Test Plan: On my avx2+fma broadwell: ``` --------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... --------------------------------------------------------------------------- log_nnc_sleef/64 178 ns 178 ns 3933565 log/s=358.993M/s log_nnc_sleef/512 1286 ns 1285 ns 559459 log/s=398.354M/s log_nnc_sleef/8192 19366 ns 19364 ns 36619 log/s=423.053M/s log_nnc_sleef/32768 79288 ns 79286 ns 8718 log/s=413.287M/s log_nnc_fast/64 92 ns 92 ns 7644990 log/s=696.939M/s log_nnc_fast/512 483 ns 483 ns 1426802 log/s=1059.49M/s log_nnc_fast/8192 7519 ns 7514 ns 95319 log/s=1090.23M/s log_nnc_fast/32768 31344 ns 31338 ns 22397 log/s=1045.62M/s log_nnc_vml/64 88 ns 88 ns 7923812 log/s=728.469M/s log_nnc_vml/512 454 ns 454 ns 1521437 log/s=1.12739G/s log_nnc_vml/8192 6763 ns 6763 ns 103264 log/s=1.21136G/s log_nnc_vml/32768 26565 ns 26564 ns 23609 log/s=1.23354G/s log_aten/64 418 ns 418 ns 1651401 log/s=153.117M/s log_aten/512 801 ns 801 ns 875857 log/s=638.923M/s log_aten/8192 6877 ns 6872 ns 100840 log/s=1.19208G/s log_aten/32768 26989 ns 26988 ns 26268 log/s=1.21416G/s ``` Reviewed By: bwasti, zheng-xq Differential Revision: D26246400 fbshipit-source-id: dae47ee6baeab1a813ec4d4440748164051aed3d	2021-02-10 02:09:10 -08:00

8 Commits