pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
CodemodService FBSourceClangFormatLinterBot	60632a00fe	[AutoAccept][Codemod][FBSourceClangFormatLinter] Daily `arc lint --take CLANGFORMAT` Reviewed By: zertosh Differential Revision: D33561057 fbshipit-source-id: 79873717c45c8bbe6d0ae760e718770fd960185d	2022-01-13 03:27:06 -08:00
Elias Ellison	5480deb183	Add support for permutting dynamic fusion group outputs to channels last format (#70656 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70656 Test Plan: Imported from OSS Reviewed By: navahgar Differential Revision: D33458650 Pulled By: eellison fbshipit-source-id: f0c7d20743deac7a87f7c9176e60da8100aefe41	2022-01-12 09:11:34 -08:00
Elias Ellison	39be20f259	[JIT][NNC] Add handling of strides to dynamic shape support. (#70464 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70464 Add handling of strided input tensors to dynamic fusion. This is done with the same set of input striding specializations as https://github.com/pytorch/pytorch/pull/60684/: ``` S_ONE, // STRIDE_ONE: packed S_CONT, // STRIDE_CONTIGUOUS: stride[i + 1] * sizes[i + 1] S_TRAN_CONT, // STRIDE_TRANSPOSED_CONTIGUOUS: stride[i-1] * sizes[i-1] S_AS_ARG, // STRIDE_AS_ARG: stride passed in as runtime value ``` and then two additional specializations for a) contiguous tensor and b) channels-last tensor. channels-last is a common case and we should optimize for it. additionally, tensors natively store whether they are contiguous/channels-last contiguous, which makes it faster to check if tensors follow this pattern. Output striding will be done in a follow up. The striding is stored on both the TensorGroup node and on the guard node. The striding descriptors are stored as a vector of strings on the node for debugability and to make use of storing ivalues as attributes on nodes. As an example: ``` %8 : Double(10, 11, 12, 13, strides=[1716, 1, 143, 11], requires_grad=0, device=cpu) = prim::TensorExprGroup_0[symbolic_shape_inputs=[-37, -36, -35, -34], striding_inputs_desc=[["TENSOR_CONT_CHANNELS_LAST"]](%x, %24, %23, %22, %21)``` ``` Test Plan: Imported from OSS Reviewed By: navahgar Differential Revision: D33458649 Pulled By: eellison fbshipit-source-id: c42616d3c683d70f6258180d23d3841a31a6030d	2022-01-12 09:11:31 -08:00
Elias Ellison	fb66f561b1	Add copy out to the fallback path in SR invocation of composed op (#70871 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70871 We had previously handled reusing memory in the optimized kernel execution path, but not yet handled it if we hit the unoptimized fallback. Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D33458652 Pulled By: eellison fbshipit-source-id: 4eb62181ed02c95813a99638f5e2d0f9347b5c08	2022-01-10 12:16:38 -08:00
Taylor Robie	24bc3be146	[Profiler] Clean up profiler includes. (#69421 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69421 I've hit a lot of build issues in D32671972, and I've come to realize that a lot of it boils down to header hygene. `function.h` includes `profiler.h` solely to transitively include `record_function.h` which winds up leaking the profiler symbols. Moreover several files are relying on transitive includes to get access to `getTime`. As long as I have to touch all the places that use `getTime`, I may as well also move them to the new namespace. Test Plan: Unit tests and CI. Reviewed By: aaronenyeshi, albanD Differential Revision: D32865907 fbshipit-source-id: f87d6fd5afb784dca2146436e72c69e34623020e	2021-12-15 12:50:24 -08:00
Scott Wolchok	1d84d8c5d8	[PyTorch] Remove StringView from RecordFunction interface (1/2) (#68410 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68410 First step toward not heap-allocating a string in RecordFunction::before() every time ghstack-source-id: 144287654 Test Plan: CI Reviewed By: chaekit Differential Revision: D32453847 fbshipit-source-id: 080d95095fb568287b65fcc41a4ca6929b5f9a87	2021-11-30 13:20:08 -08:00
Joel Schlosser	8fef7c09f5	Remove finput from slow2d signatures (#68896 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68896 Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D32655874 Pulled By: jbschlosser fbshipit-source-id: 3c9acb106961c40af1432652179edb2bc5a4bfa5	2021-11-30 09:47:24 -08:00
Raghavan Raman	2fd468e5f8	[jit] Set the graph input types before interpreting the graph during tracing (#68242 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68242 Test Plan: Imported from OSS Reviewed By: saketh-are Differential Revision: D32382958 Pulled By: navahgar fbshipit-source-id: 4e82a604a9ea2046af2755de23944147e618a65f	2021-11-15 15:44:32 -08:00
Rohan Varma	90d311b268	[RPC] Add exception logging to constValue() (#67802 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67802 In RPC C++ code, we might sometimes call constValue() when the future actually has an exception, and in unittests we want to assert on the exception. What happens is that we get a message basically saying "!eptr_" which indicates there is some exception but we don't know what it is. This diff simply adds logging for the exception and mentions that `value` over `constValue` should be used when the future can have an exception. The contract of `constValue` to throw when `eptr_` is set is still held, it is just enhanced with additional logging. ghstack-source-id: 142375391 Test Plan: Added UT Reviewed By: mrshenli Differential Revision: D32156552 fbshipit-source-id: 4dd5e73b92173209074c104a4b75c2021e20de4b	2021-11-04 10:04:09 -07:00
Zhengxu Chen	0795735351	[jit] Clean up unneeded virtual methods from Function interface. (#65968 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65968 tryToGraphFunction() should cover all cases and more composable than adhoc virtual methods. ghstack-source-id: 141759214 Test Plan: no behavior change. Reviewed By: gmagogsfm Differential Revision: D31326154 fbshipit-source-id: 692a35df424f7d4f777a96489c4cbb24b3ae7807	2021-10-28 12:28:48 -07:00
Zhengxu Chen	b55a2500d2	[jit] Remove graph() call from abstract Function interface. (#65967 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65967 Graph is an implementation detail. If user wants to get access to the underlying graph, they should be able to explicitly dynamic cast instead. ghstack-source-id: 141659819 Test Plan: no behavior change. Reviewed By: gmagogsfm Differential Revision: D31326153 fbshipit-source-id: a0e984f57c6013494b92a7095bf5bb660035eb84	2021-10-27 11:54:26 -07:00
Michael Shi	ad5731cacc	[PyTorch] Add flop count for bmm and baddbmm (#66636 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66636 Add FLOP count for bmm and baddbmm, which is `2bmnk`. Reviewed By: ngimel Differential Revision: D31622061 fbshipit-source-id: f3e1e1e34c45228693117b81647fb4a623c4085b	2021-10-25 17:31:12 -07:00
Nikolay Korovaiko	a7ebf76a15	jit trace (#59949 ) Summary: Fixes #{issue number} Pull Request resolved: https://github.com/pytorch/pytorch/pull/59949 Reviewed By: ZolotukhinM Differential Revision: D31366787 Pulled By: Krovatkin fbshipit-source-id: 798cbcd97e8ecfba984f98cd70214954be9309af	2021-10-24 18:04:22 -07:00
Scott Wolchok	2d885ab73d	[jit] Reduce refcounting of Types (#65345 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65345 FooType::get() can return a const reference. Inconveniently, converting shared_ptr<FooType> to shared_ptr<Type> requires a copy & refcount bump, so to properly take advantage of this in unshapedType() we need to take a const Type& in isSubtypeOf(), which is good practice anyway -- don't require a shared_ptr if you don't need to take ownership. ghstack-source-id: 140044165 Test Plan: CI perf says c10::unshapedType time decreased from 2.8% to 2.2% during static runtime startup, though I expect this to be generally beneficial. Reviewed By: hlu1 Differential Revision: D31027361 fbshipit-source-id: 676feb81db9f74ad7b8651d8774f4ecb4cfa6ab8	2021-10-08 09:03:04 -07:00
Scott Wolchok	ece25c453f	[PyTorch] Store Argument::alias_info_ on the heap (#64824 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64824 See comment in function_schema.h for explanation. I claim that this is a good tradeoff because the aliasing information seems to be used only in compiler-ish code paths, where performance isn't as critical as actual execution. If performance is important there too, perhaps we should hoist isWrite into the Argument itself since there are several paths that only care about isWrite. ghstack-source-id: 138958896 Test Plan: CI, profile schema parsing on startup and see much fewer page faults in createArgumentVector. Reviewed By: suo Differential Revision: D30860719 fbshipit-source-id: 1d4d2328f2b8e34f5ddf9d82083fd4dd7b7f738f	2021-09-24 17:00:51 -07:00
Peter Bell	68e5935498	Remove fgrad_input from slow_conv2d (#64280 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64280 Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D30830887 Pulled By: jbschlosser fbshipit-source-id: 5a3a79ad9d9118177672eabf872f9d9a3313ebe4	2021-09-24 14:27:39 -07:00
Elias Ellison	3bf93d769c	[JIT] Add gradient check in constants (#64613 ) Summary: fixes internal issue Pull Request resolved: https://github.com/pytorch/pytorch/pull/64613 Reviewed By: Gamrix Differential Revision: D30799016 Pulled By: eellison fbshipit-source-id: 48ef52d1cac627919e6cd232216d24878a2a8b58	2021-09-09 08:13:57 -07:00
Zhengxu Chen	ac99d63f83	[jit] Make operation call accept Stack& instead Stack* (#63414 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63414 Misuse of raw pointer in here where stack is never nullable. ghstack-source-id: 136938318 Test Plan: compiles. Imported from OSS Reviewed By: ejguan Differential Revision: D30375410 fbshipit-source-id: 9d65b620bb76d90d886c800f54308520095d58ee	2021-08-30 11:49:20 -07:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	19c1b45f25	Detect out argument in the schema (#62755 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62755 After this change, out argument can be checked by calling is_out() Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D30415256 Pulled By: tugsbayasgalan fbshipit-source-id: b2e1fa46bab7c813aaede1f44149081ef2df566d	2021-08-27 11:20:33 -07:00
Kimish Patel	38c185189c	[Pytorch Edge] Enable kineto profiler on mobile via EdgeKinetoProfiler (#62419 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62419 This diff adds support for cpu only kineto profiler on mobile. Thus enabling chrome trace generation on mobile. This bring cpp API for mobile profiling on part with Torchscript. This is done via: 1. Utilizating debug handle annotations in KinetoEvent. 2. Adding post processing capability, via callbacks, to KinetoThreadLocalState 3. Creating new RAII stype profiler, KinetoEdgeCPUProfiler, which can be used in surrounding scope of model execution. This will write chrome trace to the location specified in profiler constructor. Test Plan: MobileProfiler.ModuleHierarchy Imported from OSS Reviewed By: raziel Differential Revision: D29993660 fbshipit-source-id: 0b44f52f9e9c5f5aff81ebbd9273c254c3c03299	2021-08-13 21:40:19 -07:00
Kimish Patel	1b04d99f55	[Pytorch Profiler] Introduce scopes to enableProfiler (#62417 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62417 This diff adds an option to make enableProfiler enable callbacks only for certain RecordScopes. Why? Profiling has some overhead when we repeatedly execute callbacks for alls copes. On mobile side when we often have small quantized models this overhead can be large. We observed that by only profiling top level op and skipping profiling of other atend ops called within we can limit this overhead. For example, instead of profling at::conv2d -> at::convolution -> at::convolution_ and further more if ops like transpose etc. are called, skipping profiling of those. Of course this limits the visibility, but at the least this way we get a choice. Test Plan: Imported from OSS Reviewed By: ilia-cher Differential Revision: D29993659 fbshipit-source-id: 852d3ae7822f0d94dc6e507bd4019b60d488ef69	2021-08-13 21:40:15 -07:00
Kimish Patel	b00afe135d	[Pytorch Profiler] Add debug_handles to KinetoEvent (#62228 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62228 This diff adds debug handles to events and provides a way to use RECORD_FUNCTIONs that will pass debug_handles down to profiler, which will record it in the events. Why add debug_handles? For pytorch mobile, with lite interpreter, we generate debug handles that can be used for lazily symbolicate exception traces to model level stack trace. Similar to the model level stack trace you get in TorchScript models. The debug_handles also enable getting module hierarchy for lite interpreter model, support for which was added to KinetoProfiler in previous diffs. Followup plan: 1. Enabled scope callbacks such that lite interpreter can use it to profiler only top level ops. 2. Enable post processing callbacks that take KinetoEvents and populate module hierarchy using debug handles. This will let us use KinetoProfiler for lite interpter use cases on mobile. Aim is to use RAII guard to similarly generate chrome trace for mobile usecases as well, although only for top level ops. Test Plan: test_misc : RecordDebugHandles.Basic Imported from OSS Reviewed By: ilia-cher Differential Revision: D29935899 fbshipit-source-id: 4f06dc411b6b5fe0ffaebdd26d3274c96f8f389b	2021-08-13 21:40:14 -07:00
Nikita Shulga	709ac6853a	Fix warnings (#62930 ) Summary: Add `-Wno-writable-strings`(which is clang's flavor of `-Wwrite-strings`) to list of warnings ignored while compiling torch_python. Avoid unnecessary copies in range loop Fix number of signed-unsigned comparisons Found while building locally on M1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/62930 Reviewed By: albanD Differential Revision: D30171981 Pulled By: malfet fbshipit-source-id: 25bd43dab5675f927ca707e32737ed178b04651e	2021-08-11 14:07:10 -07:00
Howard Cheng	fa22f6303f	[PyTorch] Add flop count for addmm (#61895 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61895 * Add FLOP count for addmm, should be `2mnk`. Share the same code path for `addmm` and `mm`. Test Plan: Imported from OSS `python test/test_profiler.py` Run a sample profile and check that FLOPS for `aten::addmm` is correct. `[chowar@devbig053.frc2 ~/local/pytorch/build] ninja bin/test_jit` `[chowar@devbig053.frc2 ~/local/pytorch/build] ./bin/test_jit --gtest_filter='ComputeFlopsTest'` Reviewed By: dskhudia Differential Revision: D29785671 fbshipit-source-id: d1512036202d7234a981bda897af1f75808ccbfe	2021-08-11 12:33:43 -07:00
Kimish Patel	026cfe85b4	Fix InlinedCallStack annotation to account for module calling its own (#61791 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61791 methods from forward During inlining we attached InlinedCallstack to nodes being inlined. In the process we attach moodule information as well, such that if CallMethod is being inlined we know which class instance and class type the method belongs to. However, CallMethod can be calling a method of the same object to which the graph belongs. e.g.: ``` def forward(self, input): x = input + 10 return forward_impl_(x, input) ``` Here forward_impl is method defined on the same class in which forward is defined. Existing module hierarchy annotation will mislabel this as unknown instance since the method is not associated with output of GetAttr node (it would be we had called self.conv.forward_impl_ for example). Change in this PR reconciles this by creating a placeholder name "SELF" for module instance indicating that you can traverse InlinedCallStack backwards to find first node with name != SELF, which would be the name of the object. e.g.: TOP(ResNet)::forward.SELF(ResNet)::_forward_impl.layer1(Sequential)::forward.0(BasicBlock)::forward.conv1(Conv2d)::forward.SELF(Conv2d)::_conv_forward Test Plan: Add test Imported from OSS Reviewed By: larryliu0820 Differential Revision: D29745443 fbshipit-source-id: 1525e41df53913341c4c36a56772454782a0ba93	2021-07-26 15:00:57 -07:00
Nikita Shulga	a9b0a921d5	Disable `avoid-non-const-global-variables` lint check (#62008 ) Summary: As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH` All changes but the ones to `.clang-tidy` are generated using following script: ``` for i in `find . -type f -iname ".c" -or -iname "*.h"\|xargs grep cppcoreguidelines-avoid-non-const-global-variables\|cut -f1 -d:\|sort\|uniq`; do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008 Reviewed By: driazati, r-barnes Differential Revision: D29838584 Pulled By: malfet fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13	2021-07-22 18:04:40 -07:00
Bin Bao	add291cf66	[JIT] Add a phase to perform inplace<->functional conversion for activation operators (#57477 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57477 Currently the conversion only deals with activation operators. The legality check is somewhat strict for now. Test Plan: ``` python test/test_jit.py -k test_functional_to_inplace_activation python test/test_jit.py -k test_inplace_to_functional_activation ``` Reviewed By: mrshenli Differential Revision: D28155153 Pulled By: desertfire fbshipit-source-id: df092830c4dff3ce9578ff76285eb7a566b7d81b	2021-06-03 06:43:23 -07:00
Scott Wolchok	de22657e1c	[PyTorch] Replace RecordFunction shouldRun callback with atomic bools (#56504 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56504 Having callbacks registered but disabled via their `shouldRun` callback defeats the `shouldRunRecordFunction` optimization (no relation between the two things, despite the shared prefix on the names) that aims to skip `RecordFunction` construction. This diff attempts to safely rectify this issue: we drop support for `shouldRun` callbacks (this is bc-breaking; does anything use these externally? do I need to add the support back and just stop using it internally?), add support for enabling and disabling callbacks, and (for global callbacks) make doing so thread-safe. There is an interesting subtlety with `std::atomic` that came up: it is neither copyable nor movable, which precludes putting it into `std::vector`. I manually overrode this because the thread safety reasons it is neither copyable nor movable don't apply here; we already state that adding or removing callbacks (the operations that might copy/move an atomic) are not thread-safe and should be done at initialization time. ghstack-source-id: 129614296 Test Plan: Existing CI should cover correctness, right? Inspected perf report of a simple benchmark that runs nn.Linear in a loop on CUDA, where internally have Kineto initialized and thus had a shouldRun observer previously; we are no longer going through the dispatcher's slow RecordFunction path or spending measurable time constructing RecordFunction instances. Reviewed By: ilia-cher Differential Revision: D27834944 fbshipit-source-id: 93db1bc0a28b5372f7307490c908457e7853fa92	2021-05-26 14:31:33 -07:00
Nikita Shulga	3a66a1cb99	[clang-tidy] Exclude cppcoreguidelines-avoid-magic-numbers (#57841 ) Summary: Add cppcoreguidelines-avoid-magic-numbers exclusion to clang-tidy Remove existing nolint warnings using following script: ``` for file in `git ls-files \| grep -v \.py`; do gsed '/^ *\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)/d' -i $file; done ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/57841 Reviewed By: samestep Differential Revision: D28295045 Pulled By: malfet fbshipit-source-id: 7c6e8d1213c9593f169ed3df6a916498f1a97163	2021-05-07 20:02:33 -07:00
Luca Wehrstedt	36e47af58b	Pass reference to parent future in callbacks (#57635 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57635 Note: this PR looks massive, but it's just one simple change, codemodded many times. In many cases, a callback needs to access the value/error produced by the parent future. In Python this was easy because the callback was invoked with the parent future as argument, and could thus inspect it. In C++ the callbacks didn't take any arguments, thus in many cases we worked around this by capturing the future in its own callback. This is risky (leads to reference cycle and thus memory leak) and must be done carefully (spoiler: sometimes we weren't). ghstack-source-id: 128296580 Test Plan: CI Reviewed By: wanchaol Differential Revision: D28178783 fbshipit-source-id: 6de02c4568be42123372edc008f630d5ddae0081	2021-05-07 03:59:18 -07:00
Luca Wehrstedt	9aa1461a68	Make wrapPropagateTLSState more generic (#57634 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57634 `wrapPropagateTLSState` was restricting its argument to be an argument-less function, and I need to relax this for later work. Also, it was requiring its argument to be converted to `std::function`, and also returned a `std::function`. Each creation of a `std::function` could cause a heap allocation. It's not particularly expensive, but here we can easily avoid it by having `wrapPropagateTLSState` directly operate on generic callables (thus, possibly, raw lambdas). ghstack-source-id: 128295264 Test Plan: CI Reviewed By: ilia-cher Differential Revision: D28178782 fbshipit-source-id: d657f5751514974518606dd4fc4175e805dcb90a	2021-05-07 03:58:08 -07:00
Kimish Patel	5326ec60e6	[Inlined Callstack Fix] Fix inlined callstack for blocks of the node. (#56562 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56562 Earlier inlined callstack was annotated only nodes. This left out nodes such as If which have block of nodes. These nodes should also be updated similarly. Test Plan: Added test in test_misc Imported from OSS Reviewed By: ZolotukhinM Differential Revision: D27902516 fbshipit-source-id: 4e65c686fa6b4977e8719db45f71f7d2599d4d8e	2021-05-04 09:21:15 -07:00
Nikita Shulga	4cb534f92e	Make PyTorch code-base clang-tidy compliant (#56892 ) Summary: This is an automatic change generated by the following script: ``` #!/usr/bin/env python3 from subprocess import check_output, check_call import os def get_compiled_files_list(): import json with open("build/compile_commands.json") as f: data = json.load(f) files = [os.path.relpath(node['file']) for node in data] for idx, fname in enumerate(files): if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'): files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')] return files def run_clang_tidy(fname): check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"]) changes = check_output(["git", "ls-files", "-m"]) if len(changes) == 0: return check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"]) def main(): git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n") compiled_files = get_compiled_files_list() for idx, fname in enumerate(git_files): if fname not in compiled_files: continue if fname.startswith("caffe2/contrib/aten/"): continue print(f"[{idx}/{len(git_files)}] Processing {fname}") run_clang_tidy(fname) if __name__ == "__main__": main() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892 Reviewed By: H-Huang Differential Revision: D27991944 Pulled By: malfet fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179	2021-04-28 14:10:25 -07:00
Louis Feng	159fdde9ae	Support needsOutputs for RecordFunction and ObserverUtil improvements (#55012 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55012 Pull Request resolved: https://github.com/pytorch/pytorch/pull/54442 Added needsOutputs support to RecordFunction, improved ObserverUtil functions to handle list data. Minor refactor names to be consistent. To get output data from kernel calls, we need to temporarily capture them before passing them to the record function. Then the results are released to function return. We handle two cases, for unboxed and boxed kernels. The boxed version is fairly simple since all outputs are stored in the stack object. For unboxed kernel calls, we added a `ReturnValue` utility class to properly handle the different return values of unboxed kernels. For optimization, this intermediate capture is only enabled for observers that request `needsOutputs(true)` and should not affect other observers or when the observer is not enabled. Test Plan: ``` => buck build //caffe2/test/cpp/jit: --show-output => buck-out/gen/caffe2/test/cpp/jit/jit --gtest_filter=RecordFunctionTest* CUDA not available. Disabling CUDA and MultiCUDA tests Note: Google Test filter = RecordFunctionTest-_CUDA:*_MultiCUDA [==========] Running 7 tests from 1 test case. [----------] Global test environment set-up. [----------] 7 tests from RecordFunctionTest [ RUN ] RecordFunctionTest.TracedTestInputsOutputs [ OK ] RecordFunctionTest.TracedTestInputsOutputs (226 ms) [ RUN ] RecordFunctionTest.SampledCallbacks [ OK ] RecordFunctionTest.SampledCallbacks (771 ms) [ RUN ] RecordFunctionTest.RecordFunctionGuard [ OK ] RecordFunctionTest.RecordFunctionGuard (0 ms) [ RUN ] RecordFunctionTest.Callbacks [ OK ] RecordFunctionTest.Callbacks (2 ms) [ RUN ] RecordFunctionTest.ShouldRun [ OK ] RecordFunctionTest.ShouldRun (0 ms) [ RUN ] RecordFunctionTest.Basic [ OK ] RecordFunctionTest.Basic (1 ms) [ RUN ] RecordFunctionTest.OperatorNameOverload [ OK ] RecordFunctionTest.OperatorNameOverload (1 ms) [----------] 7 tests from RecordFunctionTest (1001 ms total) [----------] Global test environment tear-down [==========] 7 tests from 1 test case ran. (1002 ms total) [ PASSED ] 7 tests. ``` Reviewed By: ilia-cher Differential Revision: D27449877 fbshipit-source-id: 69918b729565f5899471d9db42a587f9af52238d	2021-04-02 15:16:17 -07:00
Qi Zhao	5b448cf21a	Revert D25966661: Support needsOutputs for RecordFunction and ObserverUtil improvements Test Plan: revert-hammer Differential Revision: D25966661 (`0e43a73f76`) Original commit changeset: 707886e1f212 fbshipit-source-id: a4e4af29abf622c1e0aaaf7dfb019c045988b4bc	2021-03-30 15:41:12 -07:00
Louis Feng	0e43a73f76	Support needsOutputs for RecordFunction and ObserverUtil improvements (#54442 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54442 Added needsOutputs support to RecordFunction, improved ObserverUtil functions to handle list data. Minor refactor names to be consistent. To get output data from kernel calls, we need to temporarily capture them before passing them to the record function. Then the results are released to function return. We handle two cases, for unboxed and boxed kernels. The boxed version is fairly simple since all outputs are stored in the stack object. For unboxed kernel calls, we added a `ReturnValue` utility class to properly handle the different return values of unboxed kernels. For optimization, this intermediate capture is only enabled for observers that request `needsOutputs(true)` and should not affect other observers or when the observer is not enabled. Test Plan: ``` => buck build //caffe2/test/cpp/jit: --show-output => buck-out/gen/caffe2/test/cpp/jit/jit --gtest_filter=RecordFunctionTest* CUDA not available. Disabling CUDA and MultiCUDA tests Note: Google Test filter = RecordFunctionTest-_CUDA:*_MultiCUDA [==========] Running 7 tests from 1 test case. [----------] Global test environment set-up. [----------] 7 tests from RecordFunctionTest [ RUN ] RecordFunctionTest.TracedTestInputsOutputs [ OK ] RecordFunctionTest.TracedTestInputsOutputs (226 ms) [ RUN ] RecordFunctionTest.SampledCallbacks [ OK ] RecordFunctionTest.SampledCallbacks (771 ms) [ RUN ] RecordFunctionTest.RecordFunctionGuard [ OK ] RecordFunctionTest.RecordFunctionGuard (0 ms) [ RUN ] RecordFunctionTest.Callbacks [ OK ] RecordFunctionTest.Callbacks (2 ms) [ RUN ] RecordFunctionTest.ShouldRun [ OK ] RecordFunctionTest.ShouldRun (0 ms) [ RUN ] RecordFunctionTest.Basic [ OK ] RecordFunctionTest.Basic (1 ms) [ RUN ] RecordFunctionTest.OperatorNameOverload [ OK ] RecordFunctionTest.OperatorNameOverload (1 ms) [----------] 7 tests from RecordFunctionTest (1001 ms total) [----------] Global test environment tear-down [==========] 7 tests from 1 test case ran. (1002 ms total) [ PASSED ] 7 tests. ``` Reviewed By: ilia-cher Differential Revision: D25966661 fbshipit-source-id: 707886e1f212f40ba16a1fe292ea7dd33f2646e3	2021-03-30 14:26:22 -07:00
Pritam Damania	267fc27d39	Ensure torch.futures.wait_all exits early on error. (#53953 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53953 torch.futures.wait_all, would wait for all specified futures to complete before it returned. As a result, if there was an error it would still wait for a long time (ex: long running RPCs) before it returned an error to the user. This PR ensures `wait_all` returns and error as soon as any future runs into an error and doesn't wait for all futures to complete. I removed the logic _invoke_rpc_python_udf which raised an error in the unwrap function, because ideally the error should be set on the Future and not be raised to the user only when `wait()` is called. As an example, in the case of `wait_all`, the user never calls `wait()` on the future that errored out but a future down the chain and we should propagate these errors via `setError` instead. ghstack-source-id: 124721216 Test Plan: 1) Unit test added. 2) waitforbuildbot Reviewed By: mrshenli Differential Revision: D27032362 fbshipit-source-id: c719e2277c27ff3d45f1511d5dc6f1f71a03e3a8	2021-03-25 07:39:14 -07:00
Elias Ellison	9a990dafd9	Add a filter to remove mutation (#51923 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51923 Test Plan: Imported from OSS Reviewed By: navahgar Differential Revision: D26696700 Pulled By: eellison fbshipit-source-id: 9665e9b786f55b6e5b98420eae19de262d46bb96	2021-03-01 21:22:33 -08:00
Xu Zhao	4fdebdc0c9	Improve PyTorch profiler flop computation formulas (#51377 ) Summary: Improve the flops computation formula of aten::conv2d operator to support stride, pad, dilation, and groups arguments. This diff also fixes the following issues: - Apply a factor of 2 to aten::mm because output accounts for multiplication and addition. - Fix incorrect names of scalar operators to aten::mul and aten::add. Pull Request resolved: https://github.com/pytorch/pytorch/pull/51377 Test Plan: ```python python test/test_profiler.py ``` Reviewed By: jspark1105 Differential Revision: D26165223 Pulled By: xuzhao9 fbshipit-source-id: 2c5f0155c47af2e6a19332fd6ed73ace47fa072a	2021-02-02 11:49:04 -08:00
Scott Wolchok	4a0d17ba2d	[PyTorch][codemod] Replace immediately-dereferenced expect calls w/expectRef (#50228 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50228 `fastmod -m 'expect(<((at\|c10)::)?\w+Type>\s*)->' 'expectRef${1}.'` Presuming it builds, this is a safe change: the result of `expect()` wasn't being saved anywhere, so we didn't need it, so we can take a reference instead of a new `shared_ptr`. ghstack-source-id: 119782961 Test Plan: CI Reviewed By: SplitInfinity Differential Revision: D25837374 fbshipit-source-id: 86757b70b1520e3dbaa141001e7976400cdd3b08	2021-01-13 16:13:55 -08:00
Andres Suarez	8530c65e25	[codemod][fbcode/caffe2] Apply clang-format update fixes Test Plan: Sandcastle and visual inspection. Reviewed By: igorsugak Differential Revision: D25849205 fbshipit-source-id: ef664c1ad4b3ee92d5c020a5511b4ef9837a09a0	2021-01-09 14:37:36 -08:00
Xu Zhao	573f4aa352	FLOPS Roofline Analysis Feature for PyTorch Profiler. (#46506 ) Summary: FLOPs Roofline Analysis Feature for PyTorch Profiler. Currently, PyTorch Profiler lacks the ability to measure the FLOPs of operators, such as mm and conv. FLOPs are helpful to estimate the computation complexity of the operators. For now, we use input shapes to estimate the number of floating pointer operations. In the future, we may compute this information by tracking hardware counters. Pull Request resolved: https://github.com/pytorch/pytorch/pull/46506 Test Plan: Run `python test/test_profiler_flops.py -k test_flops`. The test will print a profiler table with "FLOPS" column, like the following: ---------------------------- ------------ ------------ ------------ ------------ ------------ ------------ --------------------------------------------- ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls Input Shapes MFLOPS ---------------------------- ------------ ------------ ------------ ------------ ------------ ------------ --------------------------------------------- ------------ aten::matmul 0.06% 57.653us 82.97% 79.310ms 79.310ms 1 [[40, 33, 1, 243], [243, 243]] -- aten::mm 82.84% 79.186ms 82.86% 79.204ms 79.204ms 1 [[1320, 243], [243, 243]] 984.323 aten::conv2d 0.04% 36.345us 16.06% 15.347ms 15.347ms 1 [[40, 16, 18, 260], [33, 16, 18, 18], [33], [ 44065010.318 aten::convolution 0.02% 16.016us 16.02% 15.310ms 15.310ms 1 [[40, 16, 18, 260], [33, 16, 18, 18], [33], [ -- aten::_convolution 0.07% 63.855us 16.00% 15.294ms 15.294ms 1 [[40, 16, 18, 260], [33, 16, 18, 18], [33], [ -- aten::mkldnn_convolution 15.89% 15.188ms 15.93% 15.225ms 15.225ms 1 [[40, 16, 18, 260], [33, 16, 18, 18], [33], [ -- aten::relu 0.10% 98.223us 0.64% 612.157us 306.079us 2 [[40, 33, 1, 243]] -- aten::threshold 0.49% 465.416us 0.54% 513.934us 256.967us 2 [[40, 33, 1, 243], [], []] -- aten::add_ 0.29% 279.301us 0.29% 279.301us 279.301us 1 [[40, 33, 1, 243], [243], []] -- aten::empty 0.10% 99.113us 0.10% 99.113us 24.778us 4 [[], [], [], [], [], []] -- ---------------------------- ------------ ------------ ------------ ------------ ------------ ------------ --------------------------------------------- ------------ Self CPU time total: 95.584ms . ---------------------------------------------------------------------- Ran 1 test in 0.176s For now, we only provide FLOPs calculation for aten::conv2d and aten::mm operators. Reviewed By: ezyang Differential Revision: D25214452 Pulled By: xuzhao9 fbshipit-source-id: 0ae841bd8dbdeb032346dc3d9d38e19875aa1da3	2020-12-17 21:19:25 -08:00
Scott Wolchok	22c6dafd33	[PyTorch] Use plain old function pointer for RecordFunctionCallback (reapply) (#49408 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49408 Nearly every non-test callsite doesn't need to capture any variables anyway, and this saves 48 bytes per callback. ghstack-source-id: 118665808 Test Plan: Wait for GitHub CI since we had C++14-specific issues with this one in previous PR https://github.com/pytorch/pytorch/pull/48629 Reviewed By: malfet Differential Revision: D25563207 fbshipit-source-id: 6a2831205917d465f8248ca37429ba2428d5626d	2020-12-15 19:16:01 -08:00
Mike Ruberry	25bc906281	Revert D25135415: [PyTorch] Use plain old function pointer for RecordFunctionCallback Test Plan: revert-hammer Differential Revision: D25135415 (`7e23ee1598`) Original commit changeset: 5e92dc79da64 fbshipit-source-id: 45b1634a100084c84dca158a1f16ca760fef6988	2020-12-14 21:04:27 -08:00
Scott Wolchok	7e23ee1598	[PyTorch] Use plain old function pointer for RecordFunctionCallback (#48629 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48629 Nearly every non-test callsite doesn't need to capture any variables anyway, and this saves 48 bytes per callback. ghstack-source-id: 118568240 Test Plan: CI Reviewed By: dhruvbird Differential Revision: D25135415 fbshipit-source-id: 5e92dc79da6473ed15d1e381a21ed315879168f3	2020-12-14 20:08:16 -08:00
Scott Wolchok	900aa4ee97	[PyTorch] remove convenience RecordFunctionCallback interface (#48620 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48620 In preparation for storing bare function pointer (8 bytes) instead of std::function (32 bytes). ghstack-source-id: 118568242 Test Plan: CI Reviewed By: ezyang Differential Revision: D25132183 fbshipit-source-id: 3790cfb5d98479a46cf665b14eb0041a872c13da	2020-12-14 20:03:15 -08:00
Chen Lai	416dc68341	[Pytorch][Annotation] Update inlined callstack with module instance info (#47416 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47416 Test Plan: Imported from OSS Reviewed By: kimishpatel Differential Revision: D24752846 Pulled By: cccclai fbshipit-source-id: 94d3c18c56161d1de3a16bb7c93502fedf71644c	2020-12-03 10:44:46 -08:00
Scott Wolchok	d1df4038ff	[PyTorch] Make RecordFunctionCallback::should_run_ a function pointer (#48274 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48274 The std::function-ness of it was used only for tests. (std::function is huge at 32 bytes, and not particularly efficient.) ghstack-source-id: 117498491 Test Plan: CI Reviewed By: dzhulgakov Differential Revision: D25102077 fbshipit-source-id: fd941ddf32235a9659a1a17609c27cc5cb446a54	2020-12-01 13:02:25 -08:00
Ilia Cherniavskii	f7a8bf2855	Use libkineto in profiler (#46470 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46470 Adding ability to use Kineto (CUPTI) to profile CUDA kernels Test Plan: USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install python test/test_profiler.py python test/test_autograd.py -k test_profile python test/test_autograd.py -k test_record ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Memcpy HtoD (Pageable -> Device) 0.00% 0.000us 0.00% 0.000us 0.000us 2.000us 33.33% 2.000us 1.000us 2 sgemm_32x32x32_NN 0.00% 0.000us 0.00% 0.000us 0.000us 2.000us 33.33% 2.000us 2.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 16.67% 1.000us 1.000us 1 Memcpy DtoH (Device -> Pageable) 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 16.67% 1.000us 1.000us 1 aten::randn 5.17% 74.000us 6.71% 96.000us 48.000us 0.000us 0.00% 0.000us 0.000us 2 aten::empty 1.33% 19.000us 1.33% 19.000us 4.750us 0.000us 0.00% 0.000us 0.000us 4 aten::normal_ 1.05% 15.000us 1.05% 15.000us 7.500us 0.000us 0.00% 0.000us 0.000us 2 aten::to 77.90% 1.114ms 91.61% 1.310ms 436.667us 0.000us 0.00% 3.000us 1.000us 3 aten::empty_strided 2.52% 36.000us 2.52% 36.000us 12.000us 0.000us 0.00% 0.000us 0.000us 3 aten::copy_ 2.73% 39.000us 11.19% 160.000us 53.333us 0.000us 0.00% 3.000us 1.000us 3 cudaMemcpyAsync 4.34% 62.000us 4.34% 62.000us 20.667us 0.000us 0.00% 0.000us 0.000us 3 cudaStreamSynchronize 1.61% 23.000us 1.61% 23.000us 7.667us 0.000us 0.00% 0.000us 0.000us 3 aten::mm 0.21% 3.000us 7.20% 103.000us 103.000us 0.000us 0.00% 2.000us 2.000us 1 aten::stride 0.21% 3.000us 0.21% 3.000us 1.000us 0.000us 0.00% 0.000us 0.000us 3 cudaLaunchKernel 2.45% 35.000us 2.45% 35.000us 17.500us 0.000us 0.00% 0.000us 0.000us 2 aten::add 0.49% 7.000us 4.27% 61.000us 61.000us 0.000us 0.00% 1.000us 1.000us 1 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ``` benchmark: https://gist.github.com/ilia-cher/a5a9eb6b68504542a3cad5150fc39b1a Reviewed By: Chillee Differential Revision: D25142223 Pulled By: ilia-cher fbshipit-source-id: b0dff46c28da5fb0a8e01cf548aa4f2b723fde80	2020-11-25 04:32:16 -08:00
Elias Ellison	a00ba63023	Disable old fuser internally (#48322 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48322 Disable old fuser internally. I would like to find where we are inadvertently setting the old fuser, but in the meantime I would like to land a diff that I know will 100% cause it not to be run, and verify that it fixes the issue. Test Plan: sandcastle Reviewed By: ZolotukhinM Differential Revision: D25126202 fbshipit-source-id: 5a4d0742f5f829e536f50e7ede1256c94dd05232	2020-11-21 00:42:23 -08:00

1 2 3

127 Commits