pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
David Berard	614b865721	[profiler] _RecordFunctionFast - faster python bindings for record_function (#107195 ) torch.profiler.record_function is relatively slow; for example, in some benchmarks I was running, x.view_as(x) was ~2us, and ~16-17us when wrapped in a record_function context. The reasons for this are: dispatcher overhead from going through an op (the main source of overhead), python binding / python conversion overhead, and some overhead from the context manager. This new implementation is faster, but it won't work with torchscript. Based on the benchmarks I was running, it adds 0.5-0.7us overhead per call when the profiler is turned off. To use it, you can just: ```python with torch._C._profiler_manual._RecordFunctionFast("title"): torch.add(x, y) ``` It implements a context manager in python which directly calls the record_function utilities, instead of calling through an op. * The context manager is implemented directly in python because the overhead from calling a python function seems non-negligible * All the record_function calls, python object conversions are guarded on checks for whether the profiler is enabled or not. It seems like this saves a few hundred nanoseconds. For more details about the experiments I ran to choose this implementation, see [my record_functions experiments branch](https://github.com/pytorch/pytorch/compare/main...davidberard98:pytorch:record-function-fast-experiments?expand=1). This also adds a `torch.autograd.profiler._is_profiler_enabled` global variable that can be used to check whether a profiler is currently enabled. It's useful for further reducing the overhead, like this: ```python if torch.autograd.profiler._is_profiler_enabled: with torch._C._profiler_manual._RecordFunctionFast("title"): torch.add(x, y) else: torch.add(x, y) ``` On BERT_pytorch (CPU-bound model), if we add a record_function inside CachedAutotuning.run: * Naive torch.profiler.record_function() is a ~30% slowdown * Always wrapping with RecordFunctionFast causes a regression of ~2-4%. * Guarding with an if statement - any regression is within noise Selected benchmark results: these come from a 2.20GHz machine, GPU build but only running CPU ops; running `x.view_as(x)`, with various record_functions applied (with profiling turned off). For more detailed results see "record_functions experiments branch" linked above (those results are on a different machine, but show the same patterns). Note that the results are somewhat noisy, assume 0.05-0.1us variations ``` Baseline:: 1.7825262546539307 us # Just running x.view_as(x) profiled_basic:: 13.600390434265137 us # torch.profiler.record_function(x) + view_as precompute_manual_cm_rf:: 2.317216396331787 us # torch._C._profiler_manual._RecordFunctionFast(), if the context is pre-constructed + view_as guard_manual_cm_rf:: 1.7994389533996582 us # guard with _is_profiler_enabled + view_as ``` Differential Revision: [D48421198](https://our.internmc.facebook.com/intern/diff/D48421198) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107195 Approved by: https://github.com/albanD, https://github.com/aaronenyeshi	2023-08-22 18:48:30 +00:00
Aaron Gokaslan	b1e8e01e50	[BE]: Apply PYI autofixes to various types (#107521 ) Applies some autofixes from the ruff PYI rules to improve the typing of PyTorch. I haven't enabled most of these ruff rules yet as they do not have autofixes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107521 Approved by: https://github.com/ezyang	2023-08-20 02:42:21 +00:00
Edward Z. Yang	36bb7a1f42	Add fast traceback utilities (#107358 ) This adds some utilities for conveniently working with fast combined CapturedTraceback from Python. The main goal of these utilities is to make it easier for people to use CapturedTraceback as a drop-in replacement for `traceback.extract_stack`, which is 20x slower than CapturedTraceback. I port symbolic shapes to use the new CapturedTraceback code, to validate that the APIs work and are useful. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/107358 Approved by: https://github.com/zdevito, https://github.com/albanD ghstack dependencies: #107438	2023-08-18 19:05:54 +00:00
MooYeh	fb6652b56e	[profiler] add profiler parsing support for custom device. (#106142 ) We hope PyTorch profiling parsing ability can also be applicable to custom devices. Based on previous work https://github.com/pytorch/pytorch/pull/101554, we have made supplementary updates to PyTorch profiling to extend its parsing capabilities for custom devices. These modifications do not affect the original logic of the code and mainly include the following aspects: 1. Added the relevant logic for use_device in torch.profiler.profiler._KinetoProfile. 2. In torch.autograd.profiler and torch.autograd.profiler_util, custom devices profiling data parsing ability has been added using privateuse1 and use_device attributes. 3. In torch._C._autograd.pyi and torch._C._autograd.pyi, custom devices related attributes have been added. The underlying C++ logic will be added in subsequent pull requests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106142 Approved by: https://github.com/aaronenyeshi	2023-08-02 20:23:22 +00:00
Brian Coutinho	8d9c8897ed	[profiler] add option for kineto synchronization events in the trace (#105187 ) Summary: ## About Sync Events For CUDA profiling mode, we can enable tracing CUDA synchronization events. * This feature captures synchronization events in CUDA including 1) context/device sync, 2) stream sync, 3) CUDA event sync, 4) CUDA stream wait event (inter stream synchronization). Read more * We add this flag using the profiler's experimental config option. * This PR relies on `7b003638c6` change in pytorch/kineto ## Usage Just set the `enable_cuda_sync_events` option in `_ExperimentalConfig` ``` from torch.autograd.profiler import profile, _ExperimentalConfig with profile(use_kineto=True, use_cuda=True, experimental_config=_ExperimentalConfig(enable_cuda_sync_events=True), ) as prof: workload() ``` Please wait for PyTorch github repo to point to `7b003638c6` or later commit in Kineto Test Plan: ## Unit Test Added a unit test buck2 test mode/dev-nosan caffe2/test:profiler --local-only -- test_profiler_cuda_sync_events Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0 ttps://www.internalfb.com/intern/testinfra/testrun/281475298097379 Reviewed By: davidberard98 Differential Revision: D46244591 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105187 Approved by: https://github.com/aaronenyeshi	2023-07-26 03:45:04 +00:00
Louis Feng	5847cb55e4	[PyPer][ET] Refactor EG to ET (#99694 ) Summary: Change execution graph to execution trace. See post: https://fb.workplace.com/groups/873291503156329/permalink/1529496217535851/ Test Plan: Run a job. Reviewed By: chaekit Differential Revision: D44121392 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99694 Approved by: https://github.com/chaekit	2023-06-22 19:41:54 +00:00
Richard Li	f1f57e1e54	trigger tracing for MTIA events (#102288 ) Summary: trigger tracing for MTIA events on python side when ProfilerActivity.MTIA is specified Test Plan: Test diff: D45437426 ``` hg graft D45437426 ``` - in one terminal ``` cd ~/fbsource/fbcode buck2 run -j 8 \ //infra_asic_fpga/firmware/tools/mad/service:mad_service ``` - in another terminal Pytorch profiler ``` buck run mode/dev-nosan -j 8 //caffe2/torch/fb/acc_runtime/afg/tests:test_afg -- -m kernel_add ``` Differential Revision: D46122853 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102288 Approved by: https://github.com/aaronenyeshi	2023-06-05 15:10:31 +00:00
dujinhang	2e8ce910bb	[Profiler][1/N] add profiler support for custom device. (#101554 ) 1. `torch.autograd.profiler` interface parameters changed. (use `self.use_device` instead of `self.use_cuda` facilitates access by other devices and integrate it in subsequent pr) 2. Modify `ProfilerEventStub`(aka `std::shared_ptr<CUevent_st>`) to `ProfilerVoidEventStub`(aka `std::shared_ptr<void>`) so that `ProfilerStubs` can be inherited by any `{device}Methods`. In addition, `cuda_event_start_` is renamed to `device_event_start_` , cuda and other devices can use this event pointer if needed. 4. custom device support using legacy profiling(add `ProfilerState::KINETO_PRIVATEUSE1_FALLBACK` option) 5. add `privateuse1Stubs` register (parse results and test cases are added in subsequent pr) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101554 Approved by: https://github.com/aaronenyeshi	2023-06-02 09:19:19 +00:00
David Berard	5324124eac	[profiler] Reintroduce forward-backward links (#102424 ) TL;DR: This re-introduces links between backward kernels and their corresponding forward kernels. <img width="1020" alt="Screenshot 2023-05-26 at 7 25 22 PM" src="https://github.com/pytorch/pytorch/assets/5067123/02571b59-859c-4c9e-b3ef-121ef3159812"> In the example above, you can see there are two such flows - one for aten::add, and one for aten::binary_cross_entropy ### Details Forward/backward links were added in https://github.com/pytorch/pytorch/pull/62553, but then disabled in https://github.com/pytorch/pytorch/pull/72904 due to segfaults (e.g. https://github.com/pytorch/pytorch/issues/69443). Between now and when the fwd-bwd links were disabled, there's been a lot of refactoring; so this PR updates the implementation: * Use a raw profiler::impl::Result instead of a KinetoEvent * Move the implementation to collection.cpp, where the TraceWrapper is currently handled. * Sort the events before processing, because they aren't always in chronological order * There can now be more than one event in the backward pass that matches the sequenceNr-threadID pair. The implementation needed to be updated to avoid showing multiple endpoints for a given sequenceNr-threadID pair ([ptr to where the bwd sequenceNr-threadID pair is duplicated](`6e3e3dd477/torch/csrc/profiler/collection.cpp (L398-L399)`)). Next, we need to verify that https://github.com/pytorch/pytorch/issues/69443 is fixed. Running the repro no longer errors. Looking further into the details of the issue it seems like the handling of the [raw linkedActivity pointer (old code from 2021)](`6089dcac48/libkineto/src/output_json.cpp (L283)`) resulted in the segfault. Now, it doesn't look like the linked activity is used anywhere in output_json.cpp so the issue should be fixed. ### Testing #### 1. unit test `test_profiler_fwd_bwd_link` was un-skipped. It was modified to match the new implementation. #### 2. https://github.com/pytorch/pytorch/issues/69443 I ran the repro in https://github.com/pytorch/pytorch/issues/69443 and verified there were no segfaults. #### 3. Duplicate flow IDs When forward-backward connections were first introduced, gpu-cpu async links had not been introduced. There's a possibility that gpu-cpu links and fwd-bwd links could interfere if their IDs overlap. I manually tested this in chrome://tracing; I edited a file so that a gpu-cpu link had the same ID as one of the fwd-bwd connections. The chrome tracing UI continued showing both types of links. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102424 Approved by: https://github.com/aaronenyeshi	2023-05-31 02:50:38 +00:00
David Berard	935100cbde	[profiler] When record_inputs=True, record scalar lists of length <= 30 (#100593 ) Many ops take as inputs scalars or scalar lists which are important to understand the properties of the op. For example, convolution ops' behavior and output shapes often depend on padding and strides, which are provided as scalars of lists of scalars. This will record scalar lists when record_inputs=True. Details: During collection (and this was true before this PR as well), we serialize values and tensor metadata into an InputOutputEncoder. After collection occurs, we deserialize these values to attach the information to each of the events. This PR does this: - Adds support for serializing scalar lists during collection / serialization - Adds an extra field called "Concrete Args" - Splits up the deserialization process into two steps - one for generating "input shapes" and one for generating "concrete args". We split up input shapes and concrete args to avoid interrupting any previous workflows that relied on the specific data in the input shapes category; additionally, it's just a better description. Note that single scalars will remain in the "input shapes" category as they were already in that category in the past. Differential Revision: [D45798431](https://our.internmc.facebook.com/intern/diff/D45798431) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100593 Approved by: https://github.com/aaronenyeshi	2023-05-16 07:58:46 +00:00
Xuehai Pan	1fd119948e	[3/3] Update `.pyi` Python stub files and enable `'UFMT'` linter (#95268 ) Changes: - #95200 1. Recognize `.py.in` and `.pyi.in` files as Python in VS Code for a better development experience. 2. Fix deep setting merge in `tools/vscode_settings.py`. - #95267 3. Use `Namedtuple` rather than `namedtuple + __annotations__` for `torch.nn.utils.rnn.PackedSequence_`: `namedtuple + __annotations__`: ```python PackedSequence_ = namedtuple('PackedSequence_', ['data', 'batch_sizes', 'sorted_indices', 'unsorted_indices']) # type annotation for PackedSequence_ to make it compatible with TorchScript PackedSequence_.__annotations__ = {'data': torch.Tensor, 'batch_sizes': torch.Tensor, 'sorted_indices': Optional[torch.Tensor], 'unsorted_indices': Optional[torch.Tensor]} ``` `Namedtuple`: Python 3.6+ ```python class PackedSequence_(NamedTuple): data: torch.Tensor batch_sizes: torch.Tensor sorted_indices: Optional[torch.Tensor] unsorted_indices: Optional[torch.Tensor] ``` - => this PR: #95268 4. Sort import statements and remove unnecessary imports in `.pyi`, `.pyi.in` files. 5. Format `.pyi`, `.pyi.in` files and remove unnecessary ellipsis `...` in type stubs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95268 Approved by: https://github.com/huydhn	2023-03-01 23:50:56 +00:00
Xuehai Pan	69e0bda999	[BE] Import `Literal`, `Protocol`, and `Final` from standard library `typing` as of Python 3.8+ (#94490 ) Changes: 1. `typing_extensions -> typing-extentions` in dependency. Use dash rather than underline to fit the [PEP 503: Normalized Names](https://peps.python.org/pep-0503/#normalized-names) convention. ```python import re def normalize(name): return re.sub(r"[-_.]+", "-", name).lower() ``` 2. Import `Literal`, `Protocal`, and `Final` from standard library as of Python 3.8+ 3. Replace `Union[Literal[XXX], Literal[YYY]]` to `Literal[XXX, YYY]`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94490 Approved by: https://github.com/ezyang, https://github.com/albanD	2023-02-09 19:17:49 +00:00
Taylor Robie	8023c9dc64	[Profiler] Memory profiler part 3: Schema parsing and mutable arguments (#86854 ) The appropriate annotation for a block of memory is a function of time: an input can be mutated in-place to become an activation, a clever kernel might steal the memory of a detached input (such as a mask) to use as output memory, etc. We could pessimistically assume that all ops mutate all of their inputs, however inspection of schema allows us to significantly narrow that assumption with minimal effort. Checking schemas also allows us to distinguish between dispatcher ops (which have load bearing semantics) and user annotations with reasonably high precision. Differential Revision: [D40220390](https://our.internmc.facebook.com/intern/diff/D40220390/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86854 Approved by: https://github.com/chaekit	2022-11-15 19:17:57 +00:00
Taylor Robie	cef13ebea0	[Profiler] Memory profiler part 1: Gradient identification (#86802 ) There are multiple ways to indentify that a Tensor is a gradient. (A subset of which also give additional context.) So to start off I've made a utility to handle that determination. Differential Revision: [D39920730](https://our.internmc.facebook.com/intern/diff/D39920730/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86802 Approved by: https://github.com/chaekit	2022-11-08 23:53:13 +00:00
Taylor Robie	6e6f929b2c	[Profiler] Restructure inputs and capture TensorLists. (#87825 ) This PR unifies and rationalizes some of the input representation in Result. The current approach of storing separate types in separate vectors is tedious for two types (Tensors and scalars), but would be even more annoying with the addition of TensorLists. A similar disconnection exists with sizes and strides which the user is also expected to zip with tensor_metadata. I simplified things by moving inputs to a variant and moving sizes and strides into TensorMetadata. This also forced collection of sizes and strides in python tracer which helps to bring it in line with op profiling. Collection of TensorLists is fairly straightforward; `InputOutputEncoder` already has a spot for them (I actually collected them in the original TorchTidy prototype) so it was just a matter of plumbing things through. Differential Revision: [D40734451](https://our.internmc.facebook.com/intern/diff/D40734451/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87825 Approved by: https://github.com/slgong-fb, https://github.com/chaekit	2022-11-08 21:48:43 +00:00
Taylor Robie	e132c45fd0	[Profiler] Handle ABA for TensorImpl* when assigning IDs (#87133 ) Part of the current ID assingment algorithm groups any Storages which are associated with the same TensorImpl. This isn't sound (which I knew but deferred until it actually became a problem) because pointers can be reused by different objects. (ABA problem) ABA is easy to handle for Storage because we see allocations and frees, but ~TensorImpl is very hot and cannot tolerate profiling code without significant increases in overhead. This PR narrows the conditions under which ID assignment will join on TensorImpl. Two storages which are associated with the same TensorImpl* are grouped IFF they were live at the same time. (Note that this still allows storages with disjoint lifetimes to be joined transitively through a third storage which overlaps with both.) The need for this PR arose in memory profiling. The Python argument parser creates short lived Tensors for (some) scalar arguments which triggers this issue. (Which is stochastic and platform dependent since optimizations like reusing recently freed allocations is implementation defined.) Spurious connections can lead to confusing and long range interactions when building up the memory profile, so it makes sense to harden ID assignment to avoid any issues. Differential Revision: [D40445121](https://our.internmc.facebook.com/intern/diff/D40445121/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87133 Approved by: https://github.com/slgong-fb, https://github.com/chaekit	2022-11-08 21:48:43 +00:00
Taylor Robie	5ec03fc17a	[Profiler][Trivial] Add Module cls and self bindings and type_caster macro (#86755 ) Just a bit of clean up. We will need `self` and `cls` for memory profiling, and the type_caster specializations were getting quite verbose. Differential Revision: [D39920728](https://our.internmc.facebook.com/intern/diff/D39920728/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86755 Approved by: https://github.com/slgong-fb, https://github.com/aaronenyeshi	2022-10-23 19:23:44 +00:00
Taylor Robie	35fb007749	[Profiler][Minor] Separate standalone profilers from the main PyTorch profiler. (#85511 ) There are a number of instrumentation utils which have been added to the profiler toolkit. They are generally small and self contained, often wrapping vendor APIs. (NVTX, ITT) They don't really interact with the much more expansive machinery of the PyTorch profiler beyond registration / unregistration, minor util sharing, and reusing the profiler base class. Just as in the case of stubs, it makes sense to group them in a dedicated subfolder. Differential Revision: [D39108649](https://our.internmc.facebook.com/intern/diff/D39108649/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39108649/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/85511 Approved by: https://github.com/albanD	2022-10-14 05:38:48 +00:00
Taylor Robie	acd2f21ea1	[Profiler] Update python binding type annotations (#85722 ) The annotations for `torch._C._profiler` have gotten a bit stale. This PR simply brings them up to date. There is one small quality of life change that alters behavior: instead of returning device type and index separately we return a `torch.device` object. Differential Revision: [D39852803](https://our.internmc.facebook.com/intern/diff/D39852803/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/85722 Approved by: https://github.com/chaekit	2022-10-03 05:41:39 +00:00
Taylor Robie	5ed338a55b	[Profiler] Add dtype to `_TensorMetadata` (#85721 ) `Inputs.dtypes_` stringifies the dtypes; however this loses information which is hard to recover and useful for analysis. So this PR adds full `torch.dtype` info for Tensors. Differential Revision: [D39852802](https://our.internmc.facebook.com/intern/diff/D39852802/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/85721 Approved by: https://github.com/chaekit	2022-10-03 05:41:39 +00:00
Taylor Robie	ba95984588	[Profiler] Make `name` a property. (#85720 ) This is just a quality of life change. `.name` is 30% fewer characters than `.name()`. I should have done this from the start. Differential Revision: [D39788873](https://our.internmc.facebook.com/intern/diff/D39788873/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/85720 Approved by: https://github.com/chaekit	2022-10-03 05:41:36 +00:00
Taylor Robie	1a0e1db763	[Profiler] Compute unique IDs for Tensors (#85162 ) This PR is largely based on https://github.com/pytorch/pytorch/pull/80266, with one major difference. #80266 assigned each unique {TensorImpl, StorageImpl} pair a unique ID, whereas this PR seeks to cluster the implicit graph formed by the pairs into disjoint groups and assign an ID to each disjoint group. Differential Revision: [D39563859](https://our.internmc.facebook.com/intern/diff/D39563859/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/85162 Approved by: https://github.com/chaekit	2022-09-25 17:43:49 +00:00
Seonglyong Gong	ebd4e90ff7	[Profiler] add config option to remove 'Call stack' field from trace file (#84982 ) Summary: `Call stack` field increases trace file size exponentially for Python stack tracing (need to be deprecated carefully). Added a config option to avoid this increase. Test Plan: `experimental_config=_ExperimentalConfig(no_callstack_trace=True),` will remove the field. + CI tests Differential Revision: D39489828 Pull Request resolved: https://github.com/pytorch/pytorch/pull/84982 Approved by: https://github.com/robieta	2022-09-15 06:41:33 +00:00
Taylor Robie	014a333df3	[Profiler][Minor] Extend Python bindings (#83622 ) Adding some fields which are needed for memory profiling. Differential Revision: [D38528382](https://our.internmc.facebook.com/intern/diff/D38528382/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/83622 Approved by: https://github.com/Gamrix	2022-08-26 20:03:24 +00:00
Taylor Robie	1fa9a377d0	[Profiler] Start moving python bindings out of autograd (#82584 ) A lot of profiler code still lives in autograd for historic reasons. However as we formalize and clean up profiler internals it makes sense to pull more and more into the profiler folders/namespace. For now I'm just moving some of the core config data structures and those related to `torch::profiler::impl::Result` to keep the scope manageable. Differential Revision: [D37961462](https://our.internmc.facebook.com/intern/diff/D37961462/) NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D37961462/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/82584 Approved by: https://github.com/albanD, https://github.com/Gamrix	2022-08-19 17:15:18 +00:00

25 Commits