Since we plan to have a bunch of code that is sensitive to whether or
not a SymInt contains a symbolic shape or not, it seems like a bad idea
to have an implicit constructor.
For example, code like:
```
sizes_and_strides_.stride_at_unchecked(dim) = 0;
```
would sail through, and the `0` would get implicitly promoted to a
SymInt.
This is a tradeoff though: it makes code that handles `SymInt`s more
clunky as `int64_t`s and integer literals need to be explicitly wrapped
in `SymInt` before being used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77666
Approved by: https://github.com/ezyang
- Allow registering custom decompositions
- Add easier API for invoking decompositions
- Shorten API names (no users yet)
I am doing these as one pr because they are fairly short/simple and because github first does not support ghstack yet.
cc @Chillee @zou3519
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76252
Approved by: https://github.com/davidberard98
Moves jit shape function registration to python. Like jit decompositions, a script must be run after adding new definitions which serializes them in a c++ file.
This was a request so that torch-mlir could define functions in python and upstream their shape functions. cc @silvasean @makslevental
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75546
Approved by: https://github.com/davidberard98
Summary:
This PR introduces `SymInt` type to Pytorch which will be used by LTC and AOTAutograd for tracing size arithmetic and tests.
`SymInt` is a C++ union structure [int64_t, SymbolicIntNode*] that wraps around an int64_t field where the value of the field could be an index into a list of `shared_ptr<SymbolicIntNode>` or a real int.
This PR doesn't add any support for actually tracing symbolic ints. i.e. data_ for now can only contain real ints.
```
Goal 1: just to show we can add a type to PyTorch core. (wraps int) LANDEABLE
Finalize the naming - symint
Want the name to be short
Does invoke “size” - NO
SInt/SymInt/SymbolicInt
SInt could mean signed int
sym_int or symint or SymInt (originally it was “int”; capitalized implies object semantics, whereas lowercase implies value semantics)
JIT schema - symint
C++ - symint
```
See more details here: https://docs.google.com/document/d/1iiLNwR5ohAsw_ymfnOpDsyF6L9RTUaHMpD8 (d843f63f2a)YLw-jxEw
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74861
Reviewed By: qihqi, ngimel
Differential Revision: D35226230
Pulled By: Krovatkin
fbshipit-source-id: 34acf342bd50fcaa4d8d5dd49c2fd6a98823a5b3
(cherry picked from commit 218643f63ef181cabb92d13a6e837eb64f2dda3c)
This is a technical revert of 6d36bbde7e to reconcile it with e50478c02592597f12b8490ec5496f76c7d8b8cc (which is the same + lint changes applied)
Should be skipped during import
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74186
Make the execution settings mutable on function_impl so that we can set it for running op decompositions. Add mapping to function objects and show example in test of executing op decompositions.
Test Plan: Imported from OSS
Reviewed By: gchanan
Differential Revision: D34938125
Pulled By: eellison
fbshipit-source-id: adf108b2f6c1bd166910c6d7b94245661d67ce0d
(cherry picked from commit 9957e33803002d9e71abe4ff802769270b6960d3)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74012
This allows setting an executor on a function. The first use case is use to decompositions in C++ without additional fusion passes etc which might not work with custom tensors like batched tensors/vmap. A subsequent use case might be taking advantage of invokees of JIT execution which guard on certain properties before invocation (such as complete shapes in AOT autograd, rank in lazy tensor).
Test Plan: Imported from OSS
Reviewed By: gchanan
Differential Revision: D34938124
Pulled By: eellison
fbshipit-source-id: cf7a45416457942b872322cab47d871a8336bdb5
(cherry picked from commit 9c600eb9ad0f2173f003e511268e97584edae36d)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73270
Together with open registration of NNC lowerings this should make possible to add support for custom operators, including internal fb-ops
Test Plan: Imported from OSS
Reviewed By: mrshenli
Differential Revision: D34451275
Pulled By: eellison
fbshipit-source-id: ae8ae2deb93caa6770e738217461e65853897b55
(cherry picked from commit ea6b7e8a6d8f970a20e68d02eefc5c951e32aa07)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70532
This adds profiling to Optional[Tensor] types
First, in profiling_record.cpp, profiling nodes are added to Optional[Tensor] inputs. The nodes record
(a) whether or not any `None` types are encountered, and
(b) of the Tensor types, what's the most specific type matching all of non-null tensors that were encoutered (shape, dtype, etc.)
In tensorexpr_fuser, when specializing types based on the profiled information, an Optional[Tensor] type will always be Optional[], but the Tensor type contained in the optional type can be specialized (e.g. `Optional[Float(2x2x2, cpu, etc)]`)
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D33714748
Pulled By: davidberard98
fbshipit-source-id: 93c819054450de7ac84b112de1012c0c12e34120
(cherry picked from commit 21cfd80123)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70464
Add handling of strided input tensors to dynamic fusion. This is done with the same set of input striding specializations as https://github.com/pytorch/pytorch/pull/60684/:
```
S_ONE, // STRIDE_ONE: packed
S_CONT, // STRIDE_CONTIGUOUS: stride[i + 1] * sizes[i + 1]
S_TRAN_CONT, // STRIDE_TRANSPOSED_CONTIGUOUS: stride[i-1] * sizes[i-1]
S_AS_ARG, // STRIDE_AS_ARG: stride passed in as runtime value
```
and then two additional specializations for a) contiguous tensor and b) channels-last tensor. channels-last is a common case and we should optimize for it. additionally, tensors natively store whether they are contiguous/channels-last contiguous, which makes it faster to check if tensors follow this pattern.
Output striding will be done in a follow up.
The striding is stored on both the TensorGroup node and on the guard node. The striding descriptors are stored as a vector of strings on the node for debugability and to make use of storing ivalues as attributes on nodes.
As an example:
```
%8 : Double(10, 11, 12, 13, strides=[1716, 1, 143, 11], requires_grad=0, device=cpu) = prim::TensorExprGroup_0[symbolic_shape_inputs=[-37, -36, -35, -34], striding_inputs_desc=[["TENSOR_CONT_CHANNELS_LAST"]](%x, %24, %23, %22, %21)```
```
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D33458649
Pulled By: eellison
fbshipit-source-id: c42616d3c683d70f6258180d23d3841a31a6030d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70871
We had previously handled reusing memory in the optimized kernel execution path, but not yet handled it if we hit the unoptimized fallback.
Test Plan: Imported from OSS
Reviewed By: ngimel
Differential Revision: D33458652
Pulled By: eellison
fbshipit-source-id: 4eb62181ed02c95813a99638f5e2d0f9347b5c08
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69421
I've hit a lot of build issues in D32671972, and I've come to realize that a lot of it boils down to header hygene. `function.h` includes `profiler.h` *solely* to transitively include `record_function.h` which winds up leaking the profiler symbols. Moreover several files are relying on transitive includes to get access to `getTime`. As long as I have to touch all the places that use `getTime`, I may as well also move them to the new namespace.
Test Plan: Unit tests and CI.
Reviewed By: aaronenyeshi, albanD
Differential Revision: D32865907
fbshipit-source-id: f87d6fd5afb784dca2146436e72c69e34623020e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68410
First step toward not heap-allocating a string in RecordFunction::before() every time
ghstack-source-id: 144287654
Test Plan: CI
Reviewed By: chaekit
Differential Revision: D32453847
fbshipit-source-id: 080d95095fb568287b65fcc41a4ca6929b5f9a87
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67802
In RPC C++ code, we might sometimes call constValue() when the future actually has an exception, and in unittests we want to assert on the exception. What happens is that we get a message basically saying "!eptr_" which indicates there is some exception but we don't know what it is.
This diff simply adds logging for the exception and mentions that `value` over `constValue` should be used when the future can have an exception. The contract of `constValue` to throw when `eptr_` is set is still held, it is just enhanced with additional logging.
ghstack-source-id: 142375391
Test Plan: Added UT
Reviewed By: mrshenli
Differential Revision: D32156552
fbshipit-source-id: 4dd5e73b92173209074c104a4b75c2021e20de4b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65968
tryToGraphFunction() should cover all cases and more composable than
adhoc virtual methods.
ghstack-source-id: 141759214
Test Plan: no behavior change.
Reviewed By: gmagogsfm
Differential Revision: D31326154
fbshipit-source-id: 692a35df424f7d4f777a96489c4cbb24b3ae7807
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65967
Graph is an implementation detail. If user wants to get access to the
underlying graph, they should be able to explicitly dynamic cast instead.
ghstack-source-id: 141659819
Test Plan: no behavior change.
Reviewed By: gmagogsfm
Differential Revision: D31326153
fbshipit-source-id: a0e984f57c6013494b92a7095bf5bb660035eb84
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65345
FooType::get() can return a const reference. Inconveniently, converting shared_ptr<FooType> to shared_ptr<Type> requires a copy & refcount bump, so to properly take advantage of this in unshapedType() we need to take a const Type& in isSubtypeOf(), which is good practice anyway -- don't require a shared_ptr if you don't need to take ownership.
ghstack-source-id: 140044165
Test Plan:
CI
perf says c10::unshapedType time decreased from 2.8% to 2.2% during static runtime startup, though I expect this to be generally beneficial.
Reviewed By: hlu1
Differential Revision: D31027361
fbshipit-source-id: 676feb81db9f74ad7b8651d8774f4ecb4cfa6ab8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64824
See comment in function_schema.h for explanation. I claim that this is a good tradeoff because the aliasing information seems to be used only in compiler-ish code paths, where performance isn't as critical as actual execution. If performance is important there too, perhaps we should hoist isWrite into the Argument itself since there are several paths that only care about isWrite.
ghstack-source-id: 138958896
Test Plan: CI, profile schema parsing on startup and see much fewer page faults in createArgumentVector.
Reviewed By: suo
Differential Revision: D30860719
fbshipit-source-id: 1d4d2328f2b8e34f5ddf9d82083fd4dd7b7f738f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63414
Misuse of raw pointer in here where stack is never nullable.
ghstack-source-id: 136938318
Test Plan:
compiles.
Imported from OSS
Reviewed By: ejguan
Differential Revision: D30375410
fbshipit-source-id: 9d65b620bb76d90d886c800f54308520095d58ee
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62755
After this change, out argument can be checked by calling is_out()
Test Plan: Imported from OSS
Reviewed By: mruberry
Differential Revision: D30415256
Pulled By: tugsbayasgalan
fbshipit-source-id: b2e1fa46bab7c813aaede1f44149081ef2df566d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62419
This diff adds support for cpu only kineto profiler on mobile. Thus
enabling chrome trace generation on mobile. This bring cpp API for
mobile profiling on part with Torchscript.
This is done via:
1. Utilizating debug handle annotations in KinetoEvent.
2. Adding post processing capability, via callbacks, to
KinetoThreadLocalState
3. Creating new RAII stype profiler, KinetoEdgeCPUProfiler, which can be
used in surrounding scope of model execution. This will write chrome
trace to the location specified in profiler constructor.
Test Plan:
MobileProfiler.ModuleHierarchy
Imported from OSS
Reviewed By: raziel
Differential Revision: D29993660
fbshipit-source-id: 0b44f52f9e9c5f5aff81ebbd9273c254c3c03299
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62417
This diff adds an option to make enableProfiler enable callbacks only
for certain RecordScopes.
Why?
Profiling has some overhead when we repeatedly execute callbacks for
alls copes. On mobile side when we often have small quantized models
this overhead can be large. We observed that by only profiling top level
op and skipping profiling of other atend ops called within we can limit
this overhead. For example, instead of profling at::conv2d -> at::convolution ->
at::convolution_ and further more if ops like transpose etc. are called,
skipping profiling of those. Of course this limits the visibility, but
at the least this way we get a choice.
Test Plan: Imported from OSS
Reviewed By: ilia-cher
Differential Revision: D29993659
fbshipit-source-id: 852d3ae7822f0d94dc6e507bd4019b60d488ef69
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62228
This diff adds debug handles to events and provides a way to use
RECORD_FUNCTIONs that will pass debug_handles down to profiler, which
will record it in the events.
Why add debug_handles?
For pytorch mobile, with lite interpreter, we generate debug handles
that can be used for lazily symbolicate exception traces to model level
stack trace. Similar to the model level stack trace you get in
TorchScript models. The debug_handles also enable getting module
hierarchy for lite interpreter model, support for which was added to
KinetoProfiler in previous diffs.
Followup plan:
1. Enabled scope callbacks such that lite interpreter can use it to
profiler only top level ops.
2. Enable post processing callbacks that take KinetoEvents and populate
module hierarchy using debug handles.
This will let us use KinetoProfiler for lite interpter use cases on
mobile. Aim is to use RAII guard to similarly generate chrome trace for
mobile usecases as well, although only for top level ops.
Test Plan:
test_misc : RecordDebugHandles.Basic
Imported from OSS
Reviewed By: ilia-cher
Differential Revision: D29935899
fbshipit-source-id: 4f06dc411b6b5fe0ffaebdd26d3274c96f8f389b
Summary:
Add `-Wno-writable-strings`(which is clang's flavor of `-Wwrite-strings`) to list of warnings ignored while compiling torch_python.
Avoid unnecessary copies in range loop
Fix number of signed-unsigned comparisons
Found while building locally on M1
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62930
Reviewed By: albanD
Differential Revision: D30171981
Pulled By: malfet
fbshipit-source-id: 25bd43dab5675f927ca707e32737ed178b04651e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61895
* Add FLOP count for addmm, should be `2*m*n*k`.
Share the same code path for `addmm` and `mm`.
Test Plan:
Imported from OSS
`python test/test_profiler.py`
Run a sample profile and check that FLOPS for `aten::addmm` is correct.
`[chowar@devbig053.frc2 ~/local/pytorch/build] ninja bin/test_jit`
`[chowar@devbig053.frc2 ~/local/pytorch/build] ./bin/test_jit --gtest_filter='ComputeFlopsTest*'`
Reviewed By: dskhudia
Differential Revision: D29785671
fbshipit-source-id: d1512036202d7234a981bda897af1f75808ccbfe
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61791
methods from forward
During inlining we attached InlinedCallstack to nodes being inlined. In
the process we attach moodule information as well, such that if
CallMethod is being inlined we know which class instance and class type
the method belongs to. However, CallMethod can be calling a method of
the same object to which the graph belongs. e.g.:
```
def forward(self, input):
x = input + 10
return forward_impl_(x, input)
```
Here forward_impl is method defined on the same class in which forward
is defined. Existing module hierarchy annotation will mislabel this as
unknown instance since the method is not associated with output of
GetAttr node (it would be we had called self.conv.forward_impl_ for
example).
Change in this PR reconciles this by creating a placeholder name "SELF"
for module instance indicating that you can traverse InlinedCallStack
backwards to find first node with name != SELF, which would be the name
of the object.
e.g.:
TOP(ResNet)::forward.SELF(ResNet)::_forward_impl.layer1(Sequential)::forward.0(BasicBlock)::forward.conv1(Conv2d)::forward.SELF(Conv2d)::_conv_forward
Test Plan:
Add test
Imported from OSS
Reviewed By: larryliu0820
Differential Revision: D29745443
fbshipit-source-id: 1525e41df53913341c4c36a56772454782a0ba93
Summary:
As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH`
All changes but the ones to `.clang-tidy` are generated using following script:
```
for i in `find . -type f -iname "*.c*" -or -iname "*.h"|xargs grep cppcoreguidelines-avoid-non-const-global-variables|cut -f1 -d:|sort|uniq`; do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008
Reviewed By: driazati, r-barnes
Differential Revision: D29838584
Pulled By: malfet
fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56504
Having callbacks registered but disabled via their
`shouldRun` callback defeats the `shouldRunRecordFunction`
optimization (no relation between the two things, despite the
shared prefix on the names) that aims to skip `RecordFunction`
construction.
This diff attempts to safely rectify this issue: we drop support for
`shouldRun` callbacks (this is bc-breaking; does anything use these
externally? do I need to add the support back and just stop using it
internally?), add support for enabling and disabling callbacks, and
(for global callbacks) make doing so thread-safe.
There is an interesting subtlety with `std::atomic` that came up: it
is neither copyable nor movable, which precludes putting it into
`std::vector`. I manually overrode this because the thread safety
reasons it is neither copyable nor movable don't apply here; we
already state that adding or removing callbacks (the operations that
might copy/move an atomic) are not thread-safe and should be done at
initialization time.
ghstack-source-id: 129614296
Test Plan:
Existing CI should cover correctness, right? Inspected
perf report of a simple benchmark that runs nn.Linear in a loop on
CUDA, where internally have Kineto initialized and thus had a
shouldRun observer previously; we are no longer going through the
dispatcher's slow RecordFunction path or spending measurable time
constructing RecordFunction instances.
Reviewed By: ilia-cher
Differential Revision: D27834944
fbshipit-source-id: 93db1bc0a28b5372f7307490c908457e7853fa92
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57635
Note: this PR looks massive, but it's just one simple change, codemodded many times.
In many cases, a callback needs to access the value/error produced by the parent future. In Python this was easy because the callback was invoked with the parent future as argument, and could thus inspect it. In C++ the callbacks didn't take any arguments, thus in many cases we worked around this by capturing the future in its own callback. This is risky (leads to reference cycle and thus memory leak) and must be done carefully (spoiler: sometimes we weren't).
ghstack-source-id: 128296580
Test Plan: CI
Reviewed By: wanchaol
Differential Revision: D28178783
fbshipit-source-id: 6de02c4568be42123372edc008f630d5ddae0081
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57634
`wrapPropagateTLSState` was restricting its argument to be an argument-less function, and I need to relax this for later work.
Also, it was requiring its argument to be converted to `std::function`, and also returned a `std::function`. Each creation of a `std::function` could cause a heap allocation. It's not particularly expensive, but here we can easily avoid it by having `wrapPropagateTLSState` directly operate on generic callables (thus, possibly, raw lambdas).
ghstack-source-id: 128295264
Test Plan: CI
Reviewed By: ilia-cher
Differential Revision: D28178782
fbshipit-source-id: d657f5751514974518606dd4fc4175e805dcb90a