pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Xuehai Pan	c8d43cbc6e	[BE][3/6] fix typos in test/ (#157637 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157637 Approved by: https://github.com/yewentao256, https://github.com/albanD ghstack dependencies: #156605	2025-07-17 12:08:33 +00:00
Kazuaki Ishizaki	deb800ee81	Fix typo under test directory (#111304 ) This PR fixes typo in comments under `test` directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111304 Approved by: https://github.com/Skylion007	2023-10-16 23:06:06 +00:00
Mikhail Zolotukhin	f0d274294d	[TensorExpr] Nuke KernelArena and KernelScope. (#63587 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63587 Now that there is no classes using KernelArena for memory management we can remove it. Differential Revision: D30429115 D30429115 Test Plan: Imported from OSS Reviewed By: navahgar Pulled By: ZolotukhinM fbshipit-source-id: 375f6f9294d27790645eeb7cb5a8e87047a57544	2021-08-24 00:32:16 -07:00
Mikhail Zolotukhin	7fdba4564a	[TensorExpr] IRSimplifier: sort terms in polynomials, terms, minterms, maxterms. (#63197 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63197 This solves non-determinism from using hash values in sort methods. Changes in tests are mostly mechanical. Test Plan: Imported from OSS Reviewed By: navahgar Differential Revision: D30292776 Pulled By: ZolotukhinM fbshipit-source-id: 74f57b53c3afc9d4be45715fd74781271373e055	2021-08-18 14:49:27 -07:00
Mikhail Zolotukhin	1dc2b52764	[TensorExpr] Add a wrapper for all expr and stmt pointers. (#63195 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63195 This helps us to later switch from using KernelArena with raw pointers to shared pointers without having to change all our source files at once. The changes are mechanical and should not affect any functionality. With this PR, we're changing the following: * `Add` --> `AddPtr` `new Add(...)` --> `alloc<Add>(...)` * `dynamic_cast<Add>` --> `to<Add>` `static_cast<Add>` --> `static_to<Add>` Due to some complications with args forwarding, some places became more verbose, e.g.: `new Block({})` --> `new Block(std::vector<ExprPtr>())` Test Plan: Imported from OSS Reviewed By: navahgar Differential Revision: D30292779 Pulled By: ZolotukhinM fbshipit-source-id: 150301c7d2df56b608b035827b6a9a87f5e2d9e9	2021-08-17 13:44:45 -07:00
Nikita Shulga	a9b0a921d5	Disable `avoid-non-const-global-variables` lint check (#62008 ) Summary: As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH` All changes but the ones to `.clang-tidy` are generated using following script: ``` for i in `find . -type f -iname ".c" -or -iname "*.h"\|xargs grep cppcoreguidelines-avoid-non-const-global-variables\|cut -f1 -d:\|sort\|uniq`; do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008 Reviewed By: driazati, r-barnes Differential Revision: D29838584 Pulled By: malfet fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13	2021-07-22 18:04:40 -07:00
CodemodService FBSourceClangFormatLinterBot	cbfce376a8	[AutoAccept][Codemod][FBSourceClangFormatLinter] Daily `arc lint --take CLANGFORMAT` Reviewed By: zertosh Differential Revision: D28319469 fbshipit-source-id: 8295597a8ee16b2fef3f7aacdd6c892cb22db988	2021-05-10 03:39:31 -07:00
Nikita Shulga	3a66a1cb99	[clang-tidy] Exclude cppcoreguidelines-avoid-magic-numbers (#57841 ) Summary: Add cppcoreguidelines-avoid-magic-numbers exclusion to clang-tidy Remove existing nolint warnings using following script: ``` for file in `git ls-files \| grep -v \.py`; do gsed '/^ *\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)/d' -i $file; done ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/57841 Reviewed By: samestep Differential Revision: D28295045 Pulled By: malfet fbshipit-source-id: 7c6e8d1213c9593f169ed3df6a916498f1a97163	2021-05-07 20:02:33 -07:00
Nikita Shulga	4cb534f92e	Make PyTorch code-base clang-tidy compliant (#56892 ) Summary: This is an automatic change generated by the following script: ``` #!/usr/bin/env python3 from subprocess import check_output, check_call import os def get_compiled_files_list(): import json with open("build/compile_commands.json") as f: data = json.load(f) files = [os.path.relpath(node['file']) for node in data] for idx, fname in enumerate(files): if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'): files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')] return files def run_clang_tidy(fname): check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"]) changes = check_output(["git", "ls-files", "-m"]) if len(changes) == 0: return check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"]) def main(): git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n") compiled_files = get_compiled_files_list() for idx, fname in enumerate(git_files): if fname not in compiled_files: continue if fname.startswith("caffe2/contrib/aten/"): continue print(f"[{idx}/{len(git_files)}] Processing {fname}") run_clang_tidy(fname) if __name__ == "__main__": main() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892 Reviewed By: H-Huang Differential Revision: D27991944 Pulled By: malfet fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179	2021-04-28 14:10:25 -07:00
Mikhail Zolotukhin	1263448cb2	[TensorExpr] Remove mask field from Load and Store classes. (#55825 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55825 The mask has never been used (in vectorization we generate an explicit `IfThenElse` construct when we need to mask out some elements). The PR removes it and cleans up all its traces from tests. Differential Revision: D27717776 Test Plan: Imported from OSS Reviewed By: navahgar Pulled By: ZolotukhinM fbshipit-source-id: 41d1feeea4322da75b3999d661801c2a7f82b9db	2021-04-13 12:08:51 -07:00
Mikhail Zolotukhin	d3b427a0e3	[TensorExpr] Add an unmasked `Load` constructor. (#52790 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52790 Fixes #52774. Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D26649542 Pulled By: ZolotukhinM fbshipit-source-id: ab1c9e55f52e59d0bd00fbde2ec3125f8c7917ee	2021-02-24 22:45:29 -08:00
Hui Guo	973e306c84	changed TE 'Allocate' API to take one argument 'Buf' instead of three arguments 'Var', 'dtype', 'dims'. (#50167 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50167 Test Plan: Imported from OSS `python test/test_jit_fuser_te.py` `python test/test_jit_fuser_legacy.py` `python test/test_jit_fuser.py` `build/bin/test_tensorexpr` Reviewed By: ZolotukhinM Differential Revision: D25814342 Pulled By: huiguoo fbshipit-source-id: 44cba7f92365b826c9cb1d385a94858934570dee	2021-02-22 15:08:51 -08:00
Andres Suarez	8530c65e25	[codemod][fbcode/caffe2] Apply clang-format update fixes Test Plan: Sandcastle and visual inspection. Reviewed By: igorsugak Differential Revision: D25849205 fbshipit-source-id: ef664c1ad4b3ee92d5c020a5511b4ef9837a09a0	2021-01-09 14:37:36 -08:00
Bert Maher	07657b6001	[tensorexpr] Switch cpp tests to pure gtest (#48160 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48160 We no longer use the custom c++ test infra anyways, so move to pure gtest. Fixes #45703 ghstack-source-id: 116977283 Test Plan: `buck test //caffe2/test/cpp/tensorexpr` Reviewed By: navahgar, nickgg Differential Revision: D25046618 fbshipit-source-id: da34183d87465f410379048148c28e1623618553	2020-11-18 12:23:34 -08:00
Nick Gibson	19da1d22fe	[NNC] Registerizer V2, supporting partial and conditional replacement (#45574 ) Summary: This is a rewrite of the Registerizer, supporting scalar replacement in vastly more situations. As a refresher, the registerizer does this: Before: ``` A[0] = 0; for (int x = 0; x < 10; x++) { A[0] = (A[0]) + x; } ``` After: ``` int A_ = 0; for (int x = 0; x < 10; x++) { A_ = x + A_; } A[0] = A_; ``` Which can greatly reduce the number of accesses to main memory in a kernel. There are cases where doing this gets complicated, and the existing implementation bails out whenever encountering multiple partial overlaps of the same buffer, or conditional accesses under any circumstances. This makes it much less useful in the presence of complex (ie. real world not example) kernels. This new version should work optimally in almost all cases (I have a few minor follow ups). I tested this version extensively, and found quite a few bugs in the original implementation I'd prefer not to back port fixes for - so I'm in favor of landing this even if we don't immediately see a perf win. I believe the killer app for this kind of optimization is fused reductions and we haven't enabled many examples of that yet. It is safe to move two accesses of the same Tensor element to a local scalar Var if between all usages of the element there are no other Loads or Stores that may refer to it. In the comments I refer to this as overlapping the access, or "cutting" the existing AccessInfo. In the case where a candidate for registerization is cut, it may be possible to finalize the access early by writing it back to the Tensor and then create a new scalar variable after the overlapping access is complete. We will attempt to do this when it saves memory accesses. There are a few cases that make this more challenging: - For: Loops change the number of real usages of a buffer by the loop extent, but only if we can pull the definition and finalization of the scalar variable out of the loop block. For loops often create accesses which are conditional on a loop var and will overlap large ranges of elements. E.g. Before: ``` A[0] = 2; for (int x1 = 0; x1 < 10; x1++) { A[0] = (A[0]) + x1; } for (int x2 = 1; x2 < 10; x2++) { A[x2] = A[x2 - 1]; } for (int x3 = 0; x3 < 10; x3++) { A[0] = (A[0]) + x3; } ``` After: ``` int A_1 = 2; for (int x1 = 0; x1 < 10; x1++) { A_1 = A_1 + x1; } A[0] = A_1; for (int x2 = 1; x2 < 10; x2++) { A[x2] = A[x2 - 1]; } int A_2 = A[0]; for (int x3 = 0; x3 < 10; x3++) { A_2 = A_2 + x3; } A[0] = A_2; ``` - Cond: Conditions complicate lifting scalars out of internal scopes. Generally we cannot lift an access outside of a conditional scope unless there is already a reference to that same access at the higher scope, since we don't know if the condition was guarding an array access not safe at the higher scope. In the comments I refer to this as the condition "hiding" the access, and the outer access "unhiding" it. E.g. this example: ``` if (x<5 ? 1 : 0) { A[x] = (A[x]) + 1; } A[x] = (A[x]) + 1; if (x>5 ? 1 : 0) { A[x] = (A[x]) + 1; } ``` The A[x] access can be registerized due to the unconditional access between the two conditions: ``` int A_1 = A[x]; if (x<5 ? 1 : 0) { A_1 = A_1 + 1; } A_1 = A_1 + 1; if (x>5 ? 1 : 0) { A_1 = A_1 + 1; } A[x] = A_1; ``` But this example has no accesses that can be registerized: ``` if (x<5 ? 1 : 0) { A[x] = (A[x]) + 1; } if (x>5 ? 1 : 0) { A[x] = (A[x]) + 1; } ``` - IfThenElse: Same situation as Cond, except since IfThenElse is an Expr rather than a Stmt we cannot insert the scalar definition or finalizer within the conditional scope. Accesses inside an IfThenElse can be safely combined with external accesses but cannot exist completely within. E.g in this example the `B[x]` cannot be registerized as there is no safe place to define it. ``` A[x] = IfThenElse(x<3 ? 1 : 0, (B[x]) + (B[x]), B[x]); ``` But the equivalent kernel using Cond can be registerized: ``` if (x<3 ? 1 : 0) { float B_1 = B[x]; A[x] = B_1 + B_1; } else { A[x] = B[x]; } ``` - Let: Accesses dependent on local variables via Let Stmts, or loop vars, cannot be raised outside of the scope of the dependent var. E.g. no accesses in this example can be registerized: ``` for (int x = 0; x < 10; x++) { int y = 30; A[y] = x + (A[y]); } ``` But they can in this example: ``` int y = 30; for (int x = 0; x < 10; x++) { A[y] = x + (A[y]); } ``` Testing The majority of this PR is tests, over 3k lines of them, because there are many different rules to consider and they can interact together more or less arbitrarily. I'd greatly appreciate any ideas for situations we could encounter that are not covered by the tests. Performance Still working on it, will update. In many FastRRNS sub kernels this diff reduces the number of total calls to Store or Load by 4x, but since those kernels use Concat very heavily (meaning a lot of branches) the actual number encountered by any particular thread on GPU is reduced only slightly. Overall perf improved by a very small amount. Reductions is where this optimization should really shine, and in particular the more complex the kernel gets (with extra fusions, etc) the better this version of the registerizer should do compared the existing version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45574 Reviewed By: albanD Differential Revision: D24151517 Pulled By: nickgg fbshipit-source-id: 9f0b2d98cc213eeea3fda16fee3d144d49fd79ae	2020-10-07 18:17:27 -07:00
Mikhail Zolotukhin	4aca63d38a	[TensorExpr] Change API for creating Load and Store expressions. (#45520 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45520 With this change `Load`s and `Store`s no longer accept `Placeholder`s in their constructor and `::make` functions and can only be built with `Buf`. `Placeholder` gets its own `store`, `load`, `storeWithMask`, and `loadWithMask` method for more convenient construction. Test Plan: Imported from OSS Reviewed By: glaringlee Differential Revision: D23998789 Pulled By: ZolotukhinM fbshipit-source-id: 3fe018e00c1529a563553b2b215f403b34aea912	2020-09-29 20:52:38 -07:00
Mikhail Zolotukhin	3c33695a6d	[TensorExpr] Rename `Buffer` to `Placeholder`. (#45389 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45389 Differential Revision: D23952866 Test Plan: Imported from OSS Reviewed By: nickgg Pulled By: ZolotukhinM fbshipit-source-id: 17eedd3ac17897501403482ac1866c569d247c75	2020-09-29 01:21:54 -07:00
Nick Gibson	30fccc53a9	[NNC] Don't attempt to refactor conditional scalars (#44223 ) Summary: Fixes a bug in the NNC registerizer for Cuda where it would hoist reads out of a conditional context when trying to cache them. As a quick fix, prevent scalar replacement if a usage is within a condition. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44223 Reviewed By: gchanan Differential Revision: D23551247 Pulled By: nickgg fbshipit-source-id: 17a7bf2be4c8c3dd8a9ab7997dce9aea200c3685	2020-09-11 04:22:16 -07:00
Nick Gibson	1390cad2d8	[NNC] Hook up registerizer to Cuda codegen [2/x] (#42878 ) Summary: Insert the registerizer into the Cuda Codegen pass list, to enable scalar replacement and close the gap in simple reduction performance. First up the good stuff, benchmark before: ``` Column sum Caffe2 NNC Simple Better (10, 100) 5.7917 9.7037 6.9386 6.0448 (100, 100) 5.9338 14.972 7.1139 6.3254 (100, 10000) 21.453 741.54 145.74 12.555 (1000, 1000) 8.0678 122.75 22.833 9.0778 Row sum Caffe2 NNC Simple Better (10, 100) 5.4502 7.9661 6.1469 5.5587 (100, 100) 5.7613 13.897 21.49 5.5808 (100, 10000) 21.702 82.398 75.462 22.793 (1000, 1000) 22.527 129 176.51 22.517 ``` After: ``` Column sum Caffe2 NNC Simple Better (10, 100) 6.0458 9.4966 7.1094 6.056 (100, 100) 5.9299 9.1482 7.1693 6.593 (100, 10000) 21.739 121.97 162.63 14.376 (1000, 1000) 9.2374 29.01 26.883 10.127 Row sum Caffe2 NNC Simple Better (10, 100) 5.9773 8.1792 7.2307 5.8941 (100, 100) 6.1456 9.3155 24.563 5.8163 (100, 10000) 25.384 30.212 88.531 27.185 (1000, 1000) 26.517 32.702 209.31 26.537 ``` Speedup about 3-8x depending on the size of the data (increasing with bigger inputs). The gap between NNC and simple is closed or eliminated - remaining issue appears to be kernel launch overhead. Next up is getting us closer to the _Better_ kernel. It required a lot of refactoring and bug fixes on the way: * Refactored flattening of parallelized loops out of the CudaPrinter and into its own stage, so we can transform the graph in the stage between flattening and printing (where registerization occurs). * Made AtomicAddFuser less pessimistic, it will now recognize that if an Add to a buffer is dependent on all used Block and Thread vars then it has no overlap and does not need to be atomic. This allows registerization to apply to these stores. * Fixed PrioritizeLoad mutator so that it does not attempt to separate the Store and Load to the same buffer (i.e. reduction case). * Moved CudaAnalysis earlier in the process, allowing later stages to use the analyzed bufs. * Fixed a bug in the Registerizer where when adding a default initializer statement it would use the dtype of the underlying var (which is always kHandle) instead of the dtype of the Buf. * Fixed a bug in the IRMutator where Allocate statements logic was inverted to be replaced only if they did not change. * Added simplification of simple Division patterns to the IRSimplifier. Pull Request resolved: https://github.com/pytorch/pytorch/pull/42878 Reviewed By: glaringlee Differential Revision: D23382499 Pulled By: nickgg fbshipit-source-id: 3640a98fd843723abad9f54e67070d48c96fe949	2020-08-31 10:39:46 -07:00
Nick Gibson	aabdef51f9	[NNC] Registerizer for GPU [1/x] (#42606 ) Summary: Adds a new optimization pass, the Registerizer, which looks for common Stores and Loads to a single item in a buffer and replaces them with a local temporary scalar which is cheaper to write. For example it can replace: ``` A[0] = 0; for (int x = 0; x < 10; x++) { A[0] = (A[0]) + x; } ``` with: ``` int A_ = 0; for (int x = 0; x < 10; x++) { A_ = x + A_; } A[0] = A_; ``` This is particularly useful on GPUs when parallelizing, since after replacing loops with metavars we have a lot of accesses like this. Early tests of simple reductions on a V100 indicates this can speed them up by ~5x. This diff got a bit unwieldy with the integration code so that will come in a follow up. Pull Request resolved: https://github.com/pytorch/pytorch/pull/42606 Reviewed By: bertmaher Differential Revision: D22970969 Pulled By: nickgg fbshipit-source-id: 831fd213f486968624b9a4899a331ea9aeb40180	2020-08-11 11:17:50 -07:00

20 Commits