Commit Graph

127 Commits

Author SHA1 Message Date
CodemodService FBSourceClangFormatLinterBot
60632a00fe [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D33561057

fbshipit-source-id: 79873717c45c8bbe6d0ae760e718770fd960185d
2022-01-13 03:27:06 -08:00
Elias Ellison
5480deb183 Add support for permutting dynamic fusion group outputs to channels last format (#70656)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70656

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D33458650

Pulled By: eellison

fbshipit-source-id: f0c7d20743deac7a87f7c9176e60da8100aefe41
2022-01-12 09:11:34 -08:00
Elias Ellison
39be20f259 [JIT][NNC] Add handling of strides to dynamic shape support. (#70464)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70464

Add handling of strided input tensors to dynamic fusion. This is done with the same set of input striding specializations as https://github.com/pytorch/pytorch/pull/60684/:
```
  S_ONE, // STRIDE_ONE: packed
  S_CONT, // STRIDE_CONTIGUOUS: stride[i + 1] * sizes[i + 1]
  S_TRAN_CONT, // STRIDE_TRANSPOSED_CONTIGUOUS: stride[i-1] * sizes[i-1]
  S_AS_ARG, // STRIDE_AS_ARG: stride passed in as runtime value
```
and then two additional specializations for a) contiguous tensor and b) channels-last tensor. channels-last is a common case and we should optimize for it. additionally, tensors natively store whether they are contiguous/channels-last contiguous, which makes it faster to check if tensors follow this pattern.

Output striding will be done in a follow up.

The striding is stored on both the TensorGroup node and on the guard node. The striding descriptors are stored as a vector of strings on the node for debugability and to make use of storing ivalues as attributes on nodes.

As an example:

```

%8 : Double(10, 11, 12, 13, strides=[1716, 1, 143, 11], requires_grad=0, device=cpu) = prim::TensorExprGroup_0[symbolic_shape_inputs=[-37, -36, -35, -34], striding_inputs_desc=[["TENSOR_CONT_CHANNELS_LAST"]](%x, %24, %23, %22, %21)```
```

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D33458649

Pulled By: eellison

fbshipit-source-id: c42616d3c683d70f6258180d23d3841a31a6030d
2022-01-12 09:11:31 -08:00
Elias Ellison
fb66f561b1 Add copy out to the fallback path in SR invocation of composed op (#70871)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70871

We had previously handled reusing memory in the optimized kernel execution path, but not yet handled it if we hit the unoptimized fallback.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D33458652

Pulled By: eellison

fbshipit-source-id: 4eb62181ed02c95813a99638f5e2d0f9347b5c08
2022-01-10 12:16:38 -08:00
Taylor Robie
24bc3be146 [Profiler] Clean up profiler includes. (#69421)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69421

I've hit a lot of build issues in D32671972, and I've come to realize that a lot of it boils down to header hygene. `function.h` includes `profiler.h` *solely* to transitively include `record_function.h` which winds up leaking the profiler symbols. Moreover several files are relying on transitive includes to get access to `getTime`. As long as I have to touch all the places that use `getTime`, I may as well also move them to the new namespace.

Test Plan: Unit tests and CI.

Reviewed By: aaronenyeshi, albanD

Differential Revision: D32865907

fbshipit-source-id: f87d6fd5afb784dca2146436e72c69e34623020e
2021-12-15 12:50:24 -08:00
Scott Wolchok
1d84d8c5d8 [PyTorch] Remove StringView from RecordFunction interface (1/2) (#68410)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68410

First step toward not heap-allocating a string in RecordFunction::before() every time
ghstack-source-id: 144287654

Test Plan: CI

Reviewed By: chaekit

Differential Revision: D32453847

fbshipit-source-id: 080d95095fb568287b65fcc41a4ca6929b5f9a87
2021-11-30 13:20:08 -08:00
Joel Schlosser
8fef7c09f5 Remove finput from slow2d signatures (#68896)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68896

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D32655874

Pulled By: jbschlosser

fbshipit-source-id: 3c9acb106961c40af1432652179edb2bc5a4bfa5
2021-11-30 09:47:24 -08:00
Raghavan Raman
2fd468e5f8 [jit] Set the graph input types before interpreting the graph during tracing (#68242)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68242

Test Plan: Imported from OSS

Reviewed By: saketh-are

Differential Revision: D32382958

Pulled By: navahgar

fbshipit-source-id: 4e82a604a9ea2046af2755de23944147e618a65f
2021-11-15 15:44:32 -08:00
Rohan Varma
90d311b268 [RPC] Add exception logging to constValue() (#67802)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67802

In RPC C++ code, we might sometimes call constValue() when the future actually has an exception, and in unittests we want to assert on the exception. What happens is that we get a message basically saying "!eptr_" which indicates there is some exception but we don't know what it is.

This diff simply adds logging for the exception and mentions that `value` over `constValue` should be used when the future can have an exception. The contract of `constValue` to throw when `eptr_` is set is still held, it is just enhanced with additional logging.
ghstack-source-id: 142375391

Test Plan: Added UT

Reviewed By: mrshenli

Differential Revision: D32156552

fbshipit-source-id: 4dd5e73b92173209074c104a4b75c2021e20de4b
2021-11-04 10:04:09 -07:00
Zhengxu Chen
0795735351 [jit] Clean up unneeded virtual methods from Function interface. (#65968)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65968

tryToGraphFunction() should cover all cases and more composable than
adhoc virtual methods.
ghstack-source-id: 141759214

Test Plan: no behavior change.

Reviewed By: gmagogsfm

Differential Revision: D31326154

fbshipit-source-id: 692a35df424f7d4f777a96489c4cbb24b3ae7807
2021-10-28 12:28:48 -07:00
Zhengxu Chen
b55a2500d2 [jit] Remove graph() call from abstract Function interface. (#65967)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65967

Graph is an implementation detail. If user wants to get access to the
underlying graph, they should be able to explicitly dynamic cast instead.
ghstack-source-id: 141659819

Test Plan: no behavior change.

Reviewed By: gmagogsfm

Differential Revision: D31326153

fbshipit-source-id: a0e984f57c6013494b92a7095bf5bb660035eb84
2021-10-27 11:54:26 -07:00
Michael Shi
ad5731cacc [PyTorch] Add flop count for bmm and baddbmm (#66636)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66636

Add FLOP count for bmm and baddbmm, which is `2*b*m*n*k`.

Reviewed By: ngimel

Differential Revision: D31622061

fbshipit-source-id: f3e1e1e34c45228693117b81647fb4a623c4085b
2021-10-25 17:31:12 -07:00
Nikolay Korovaiko
a7ebf76a15 jit trace (#59949)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59949

Reviewed By: ZolotukhinM

Differential Revision: D31366787

Pulled By: Krovatkin

fbshipit-source-id: 798cbcd97e8ecfba984f98cd70214954be9309af
2021-10-24 18:04:22 -07:00
Scott Wolchok
2d885ab73d [jit] Reduce refcounting of Types (#65345)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65345

FooType::get() can return a const reference. Inconveniently, converting shared_ptr<FooType> to shared_ptr<Type> requires a copy & refcount bump, so to properly take advantage of this in unshapedType() we need to take a const Type& in isSubtypeOf(), which is good practice anyway -- don't require a shared_ptr if you don't need to take ownership.
ghstack-source-id: 140044165

Test Plan:
CI

perf says c10::unshapedType time decreased from 2.8% to 2.2% during static runtime startup, though I expect this to be generally beneficial.

Reviewed By: hlu1

Differential Revision: D31027361

fbshipit-source-id: 676feb81db9f74ad7b8651d8774f4ecb4cfa6ab8
2021-10-08 09:03:04 -07:00
Scott Wolchok
ece25c453f [PyTorch] Store Argument::alias_info_ on the heap (#64824)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64824

See comment in function_schema.h for explanation. I claim that this is a good tradeoff because the aliasing information seems to be used only in compiler-ish code paths, where performance isn't as critical as actual execution. If performance is important there too, perhaps we should hoist isWrite into the Argument itself since there are several paths that only care about isWrite.
ghstack-source-id: 138958896

Test Plan: CI, profile schema parsing on startup and see much fewer page faults in createArgumentVector.

Reviewed By: suo

Differential Revision: D30860719

fbshipit-source-id: 1d4d2328f2b8e34f5ddf9d82083fd4dd7b7f738f
2021-09-24 17:00:51 -07:00
Peter Bell
68e5935498 Remove fgrad_input from slow_conv2d (#64280)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64280

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D30830887

Pulled By: jbschlosser

fbshipit-source-id: 5a3a79ad9d9118177672eabf872f9d9a3313ebe4
2021-09-24 14:27:39 -07:00
Elias Ellison
3bf93d769c [JIT] Add gradient check in constants (#64613)
Summary:
fixes internal issue

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64613

Reviewed By: Gamrix

Differential Revision: D30799016

Pulled By: eellison

fbshipit-source-id: 48ef52d1cac627919e6cd232216d24878a2a8b58
2021-09-09 08:13:57 -07:00
Zhengxu Chen
ac99d63f83 [jit] Make operation call accept Stack& instead Stack* (#63414)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63414

Misuse of raw pointer in here where stack is never nullable.
ghstack-source-id: 136938318

Test Plan:
compiles.

Imported from OSS

Reviewed By: ejguan

Differential Revision: D30375410

fbshipit-source-id: 9d65b620bb76d90d886c800f54308520095d58ee
2021-08-30 11:49:20 -07:00
Tugsbayasgalan (Tugsuu) Manlaibaatar
19c1b45f25 Detect out argument in the schema (#62755)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62755

After this change, out argument can be checked by calling is_out()

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D30415256

Pulled By: tugsbayasgalan

fbshipit-source-id: b2e1fa46bab7c813aaede1f44149081ef2df566d
2021-08-27 11:20:33 -07:00
Kimish Patel
38c185189c [Pytorch Edge] Enable kineto profiler on mobile via EdgeKinetoProfiler (#62419)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62419

This diff adds support for cpu only kineto profiler on mobile. Thus
enabling chrome trace generation on mobile. This bring cpp API for
mobile profiling on part with Torchscript.
This is done via:
1. Utilizating debug handle annotations in KinetoEvent.
2. Adding post processing capability, via callbacks, to
KinetoThreadLocalState
3. Creating new RAII stype profiler, KinetoEdgeCPUProfiler, which can be
used in surrounding scope of model execution. This will write chrome
trace to the location specified in profiler constructor.

Test Plan:
MobileProfiler.ModuleHierarchy

Imported from OSS

Reviewed By: raziel

Differential Revision: D29993660

fbshipit-source-id: 0b44f52f9e9c5f5aff81ebbd9273c254c3c03299
2021-08-13 21:40:19 -07:00
Kimish Patel
1b04d99f55 [Pytorch Profiler] Introduce scopes to enableProfiler (#62417)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62417

This diff adds an option to make enableProfiler enable callbacks only
for certain RecordScopes.
Why?
Profiling has some overhead when we repeatedly execute callbacks for
alls copes. On mobile side when we often have small quantized models
this overhead can be large. We observed that by only profiling top level
op and skipping profiling of other atend ops called within we can limit
this overhead. For example, instead of profling at::conv2d -> at::convolution ->
at::convolution_ and further more if ops like transpose etc. are called,
skipping profiling of those. Of course this limits the visibility, but
at the least this way we get a choice.

Test Plan: Imported from OSS

Reviewed By: ilia-cher

Differential Revision: D29993659

fbshipit-source-id: 852d3ae7822f0d94dc6e507bd4019b60d488ef69
2021-08-13 21:40:15 -07:00
Kimish Patel
b00afe135d [Pytorch Profiler] Add debug_handles to KinetoEvent (#62228)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62228

This diff adds debug handles to events and provides a way to use
RECORD_FUNCTIONs that will pass debug_handles down to profiler, which
will record it in the events.

Why add debug_handles?
For pytorch mobile, with lite interpreter, we generate debug handles
that can be used for lazily symbolicate exception traces to model level
stack trace. Similar to the model level stack trace you get in
TorchScript models. The debug_handles also enable getting module
hierarchy for lite interpreter model, support for which was added to
KinetoProfiler in previous diffs.

Followup plan:
1. Enabled scope callbacks such that lite interpreter can use it to
profiler only top level ops.
2. Enable post processing callbacks that take KinetoEvents and populate
module hierarchy using debug handles.

This will let us use KinetoProfiler for lite interpter use cases on
mobile. Aim is to use RAII guard to similarly generate chrome trace for
mobile usecases as well, although only for top level ops.

Test Plan:
test_misc : RecordDebugHandles.Basic

Imported from OSS

Reviewed By: ilia-cher

Differential Revision: D29935899

fbshipit-source-id: 4f06dc411b6b5fe0ffaebdd26d3274c96f8f389b
2021-08-13 21:40:14 -07:00
Nikita Shulga
709ac6853a Fix warnings (#62930)
Summary:
Add `-Wno-writable-strings`(which is clang's flavor of `-Wwrite-strings`) to list of warnings ignored while compiling torch_python.
Avoid unnecessary copies in range loop
Fix number of signed-unsigned comparisons

Found while building locally on M1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62930

Reviewed By: albanD

Differential Revision: D30171981

Pulled By: malfet

fbshipit-source-id: 25bd43dab5675f927ca707e32737ed178b04651e
2021-08-11 14:07:10 -07:00
Howard Cheng
fa22f6303f [PyTorch] Add flop count for addmm (#61895)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61895

* Add FLOP count for addmm, should be `2*m*n*k`.

Share the same code path for `addmm` and `mm`.

Test Plan:
Imported from OSS

`python test/test_profiler.py`
Run a sample profile and check that FLOPS for `aten::addmm` is correct.

`[chowar@devbig053.frc2 ~/local/pytorch/build] ninja bin/test_jit`
`[chowar@devbig053.frc2 ~/local/pytorch/build] ./bin/test_jit --gtest_filter='ComputeFlopsTest*'`

Reviewed By: dskhudia

Differential Revision: D29785671

fbshipit-source-id: d1512036202d7234a981bda897af1f75808ccbfe
2021-08-11 12:33:43 -07:00
Kimish Patel
026cfe85b4 Fix InlinedCallStack annotation to account for module calling its own (#61791)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61791

methods from forward

During inlining we attached InlinedCallstack to nodes being inlined. In
the process we attach moodule information as well, such that if
CallMethod is being inlined we know which class instance and class type
the method belongs to. However, CallMethod can be calling a method of
the same object to which the graph belongs. e.g.:

```
def forward(self, input):
  x = input + 10
  return forward_impl_(x, input)
```
Here forward_impl is method defined on the same class in which forward
is defined. Existing module hierarchy annotation will mislabel this as
unknown instance since the method is not associated with output of
GetAttr node (it would be we had called self.conv.forward_impl_ for
example).
Change in this PR reconciles this by creating a placeholder name "SELF"
for module instance indicating that you can traverse InlinedCallStack
backwards to find first node with name != SELF, which would be the name
of the object.
e.g.:
TOP(ResNet)::forward.SELF(ResNet)::_forward_impl.layer1(Sequential)::forward.0(BasicBlock)::forward.conv1(Conv2d)::forward.SELF(Conv2d)::_conv_forward

Test Plan:
Add test

Imported from OSS

Reviewed By: larryliu0820

Differential Revision: D29745443

fbshipit-source-id: 1525e41df53913341c4c36a56772454782a0ba93
2021-07-26 15:00:57 -07:00
Nikita Shulga
a9b0a921d5 Disable avoid-non-const-global-variables lint check (#62008)
Summary:
As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH`

All changes but the ones to `.clang-tidy` are generated using following script:
```
for i in `find . -type f -iname "*.c*" -or -iname "*.h"|xargs grep cppcoreguidelines-avoid-non-const-global-variables|cut -f1 -d:|sort|uniq`;  do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008

Reviewed By: driazati, r-barnes

Differential Revision: D29838584

Pulled By: malfet

fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13
2021-07-22 18:04:40 -07:00
Bin Bao
add291cf66 [JIT] Add a phase to perform inplace<->functional conversion for activation operators (#57477)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57477

Currently the conversion only deals with activation operators. The legality check is somewhat strict for now.

Test Plan:
```
python test/test_jit.py -k test_functional_to_inplace_activation
python test/test_jit.py -k test_inplace_to_functional_activation
```

Reviewed By: mrshenli

Differential Revision: D28155153

Pulled By: desertfire

fbshipit-source-id: df092830c4dff3ce9578ff76285eb7a566b7d81b
2021-06-03 06:43:23 -07:00
Scott Wolchok
de22657e1c [PyTorch] Replace RecordFunction shouldRun callback with atomic bools (#56504)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56504

Having callbacks registered but disabled via their
`shouldRun` callback defeats the `shouldRunRecordFunction`
optimization (no relation between the two things, despite the
shared prefix on the names) that aims to skip `RecordFunction`
construction.

This diff attempts to safely rectify this issue: we drop support for
`shouldRun` callbacks (this is bc-breaking; does anything use these
externally? do I need to add the support back and just stop using it
internally?), add support for enabling and disabling callbacks, and
(for global callbacks) make doing so thread-safe.

There is an interesting subtlety with `std::atomic` that came up: it
is neither copyable nor movable, which precludes putting it into
`std::vector`. I manually overrode this because the thread safety
reasons it is neither copyable nor movable don't apply here; we
already state that adding or removing callbacks (the operations that
might copy/move an atomic) are not thread-safe and should be done at
initialization time.
ghstack-source-id: 129614296

Test Plan:
Existing CI should cover correctness, right?  Inspected
perf report of a simple benchmark that runs nn.Linear in a loop on
CUDA, where internally have Kineto initialized and thus had a
shouldRun observer previously; we are no longer going through the
dispatcher's slow RecordFunction path or spending measurable time
constructing RecordFunction instances.

Reviewed By: ilia-cher

Differential Revision: D27834944

fbshipit-source-id: 93db1bc0a28b5372f7307490c908457e7853fa92
2021-05-26 14:31:33 -07:00
Nikita Shulga
3a66a1cb99 [clang-tidy] Exclude cppcoreguidelines-avoid-magic-numbers (#57841)
Summary:
Add cppcoreguidelines-avoid-magic-numbers exclusion to clang-tidy
Remove existing nolint warnings using following script:
```
for file in `git ls-files | grep -v \.py`; do gsed '/^ *\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)/d' -i  $file; done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57841

Reviewed By: samestep

Differential Revision: D28295045

Pulled By: malfet

fbshipit-source-id: 7c6e8d1213c9593f169ed3df6a916498f1a97163
2021-05-07 20:02:33 -07:00
Luca Wehrstedt
36e47af58b Pass reference to parent future in callbacks (#57635)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57635

Note: this PR looks massive, but it's just one simple change, codemodded many times.

In many cases, a callback needs to access the value/error produced by the parent future. In Python this was easy because the callback was invoked with the parent future as argument, and could thus inspect it. In C++ the callbacks didn't take any arguments, thus in many cases we worked around this by capturing the future in its own callback. This is risky (leads to reference cycle and thus memory leak) and must be done carefully (spoiler: sometimes we weren't).
ghstack-source-id: 128296580

Test Plan: CI

Reviewed By: wanchaol

Differential Revision: D28178783

fbshipit-source-id: 6de02c4568be42123372edc008f630d5ddae0081
2021-05-07 03:59:18 -07:00
Luca Wehrstedt
9aa1461a68 Make wrapPropagateTLSState more generic (#57634)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57634

`wrapPropagateTLSState` was restricting its argument to be an argument-less function, and I need to relax this for later work.

Also, it was requiring its argument to be converted to `std::function`, and also returned a `std::function`. Each creation of a `std::function` could cause a heap allocation. It's not particularly expensive, but here we can easily avoid it by having `wrapPropagateTLSState` directly operate on generic callables (thus, possibly, raw lambdas).
ghstack-source-id: 128295264

Test Plan: CI

Reviewed By: ilia-cher

Differential Revision: D28178782

fbshipit-source-id: d657f5751514974518606dd4fc4175e805dcb90a
2021-05-07 03:58:08 -07:00
Kimish Patel
5326ec60e6 [Inlined Callstack Fix] Fix inlined callstack for blocks of the node. (#56562)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56562

Earlier inlined callstack was annotated only nodes. This left out nodes
such as If which have block of nodes. These nodes should also be updated
similarly.

Test Plan:
Added test in test_misc

Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D27902516

fbshipit-source-id: 4e65c686fa6b4977e8719db45f71f7d2599d4d8e
2021-05-04 09:21:15 -07:00
Nikita Shulga
4cb534f92e Make PyTorch code-base clang-tidy compliant (#56892)
Summary:
This is an automatic change generated by the following script:
```
#!/usr/bin/env python3
from subprocess import check_output, check_call
import os

def get_compiled_files_list():
    import json
    with open("build/compile_commands.json") as f:
        data = json.load(f)
    files = [os.path.relpath(node['file']) for node in data]
    for idx, fname in enumerate(files):
        if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'):
            files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')]
    return files

def run_clang_tidy(fname):
    check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"])
    changes = check_output(["git", "ls-files", "-m"])
    if len(changes) == 0:
        return
    check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"])

def main():
    git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n")
    compiled_files = get_compiled_files_list()
    for idx, fname in enumerate(git_files):
        if fname not in compiled_files:
            continue
        if fname.startswith("caffe2/contrib/aten/"):
            continue
        print(f"[{idx}/{len(git_files)}] Processing {fname}")
        run_clang_tidy(fname)

if __name__ == "__main__":
    main()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892

Reviewed By: H-Huang

Differential Revision: D27991944

Pulled By: malfet

fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179
2021-04-28 14:10:25 -07:00
Louis Feng
159fdde9ae Support needsOutputs for RecordFunction and ObserverUtil improvements (#55012)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55012

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54442

Added needsOutputs support to RecordFunction, improved ObserverUtil functions to handle list data. Minor refactor names to be consistent.

To get output data from kernel calls, we need to temporarily capture them before passing them to the record function. Then the results are released to function return. We handle two cases, for unboxed and boxed kernels. The boxed version is fairly simple since all outputs are stored in the stack object. For unboxed kernel calls, we added a `ReturnValue` utility class to properly handle the different return values of unboxed kernels.

For optimization, this intermediate capture is only enabled for observers that request `needsOutputs(true)` and should not affect other observers or when the observer is not enabled.

Test Plan:
```
=> buck build //caffe2/test/cpp/jit: --show-output
=> buck-out/gen/caffe2/test/cpp/jit/jit --gtest_filter=RecordFunctionTest*
CUDA not available. Disabling CUDA and MultiCUDA tests
Note: Google Test filter = RecordFunctionTest*-*_CUDA:*_MultiCUDA
[==========] Running 7 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 7 tests from RecordFunctionTest
[ RUN      ] RecordFunctionTest.TracedTestInputsOutputs
[       OK ] RecordFunctionTest.TracedTestInputsOutputs (226 ms)
[ RUN      ] RecordFunctionTest.SampledCallbacks
[       OK ] RecordFunctionTest.SampledCallbacks (771 ms)
[ RUN      ] RecordFunctionTest.RecordFunctionGuard
[       OK ] RecordFunctionTest.RecordFunctionGuard (0 ms)
[ RUN      ] RecordFunctionTest.Callbacks
[       OK ] RecordFunctionTest.Callbacks (2 ms)
[ RUN      ] RecordFunctionTest.ShouldRun
[       OK ] RecordFunctionTest.ShouldRun (0 ms)
[ RUN      ] RecordFunctionTest.Basic
[       OK ] RecordFunctionTest.Basic (1 ms)
[ RUN      ] RecordFunctionTest.OperatorNameOverload
[       OK ] RecordFunctionTest.OperatorNameOverload (1 ms)
[----------] 7 tests from RecordFunctionTest (1001 ms total)

[----------] Global test environment tear-down
[==========] 7 tests from 1 test case ran. (1002 ms total)
[  PASSED  ] 7 tests.

```

Reviewed By: ilia-cher

Differential Revision: D27449877

fbshipit-source-id: 69918b729565f5899471d9db42a587f9af52238d
2021-04-02 15:16:17 -07:00
Qi Zhao
5b448cf21a Revert D25966661: Support needsOutputs for RecordFunction and ObserverUtil improvements
Test Plan: revert-hammer

Differential Revision:
D25966661 (0e43a73f76)

Original commit changeset: 707886e1f212

fbshipit-source-id: a4e4af29abf622c1e0aaaf7dfb019c045988b4bc
2021-03-30 15:41:12 -07:00
Louis Feng
0e43a73f76 Support needsOutputs for RecordFunction and ObserverUtil improvements (#54442)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54442

Added needsOutputs support to RecordFunction, improved ObserverUtil functions to handle list data. Minor refactor names to be consistent.

To get output data from kernel calls, we need to temporarily capture them before passing them to the record function. Then the results are released to function return. We handle two cases, for unboxed and boxed kernels. The boxed version is fairly simple since all outputs are stored in the stack object. For unboxed kernel calls, we added a `ReturnValue` utility class to properly handle the different return values of unboxed kernels.

For optimization, this intermediate capture is only enabled for observers that request `needsOutputs(true)` and should not affect other observers or when the observer is not enabled.

Test Plan:
```
=> buck build //caffe2/test/cpp/jit: --show-output
=> buck-out/gen/caffe2/test/cpp/jit/jit --gtest_filter=RecordFunctionTest*
CUDA not available. Disabling CUDA and MultiCUDA tests
Note: Google Test filter = RecordFunctionTest*-*_CUDA:*_MultiCUDA
[==========] Running 7 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 7 tests from RecordFunctionTest
[ RUN      ] RecordFunctionTest.TracedTestInputsOutputs
[       OK ] RecordFunctionTest.TracedTestInputsOutputs (226 ms)
[ RUN      ] RecordFunctionTest.SampledCallbacks
[       OK ] RecordFunctionTest.SampledCallbacks (771 ms)
[ RUN      ] RecordFunctionTest.RecordFunctionGuard
[       OK ] RecordFunctionTest.RecordFunctionGuard (0 ms)
[ RUN      ] RecordFunctionTest.Callbacks
[       OK ] RecordFunctionTest.Callbacks (2 ms)
[ RUN      ] RecordFunctionTest.ShouldRun
[       OK ] RecordFunctionTest.ShouldRun (0 ms)
[ RUN      ] RecordFunctionTest.Basic
[       OK ] RecordFunctionTest.Basic (1 ms)
[ RUN      ] RecordFunctionTest.OperatorNameOverload
[       OK ] RecordFunctionTest.OperatorNameOverload (1 ms)
[----------] 7 tests from RecordFunctionTest (1001 ms total)

[----------] Global test environment tear-down
[==========] 7 tests from 1 test case ran. (1002 ms total)
[  PASSED  ] 7 tests.

```

Reviewed By: ilia-cher

Differential Revision: D25966661

fbshipit-source-id: 707886e1f212f40ba16a1fe292ea7dd33f2646e3
2021-03-30 14:26:22 -07:00
Pritam Damania
267fc27d39 Ensure torch.futures.wait_all exits early on error. (#53953)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53953

torch.futures.wait_all, would wait for all specified futures to
complete before it returned. As a result, if there was an error it would still
wait for a long time (ex: long running RPCs) before it returned an error to the
user.

This PR ensures `wait_all` returns and error as soon as any future runs into an
error and doesn't wait for all futures to complete.

I removed the logic _invoke_rpc_python_udf which raised an error in the unwrap
function, because ideally the error should be set on the Future and not be
raised to the user only when `wait()` is called. As an example, in the case of
`wait_all`, the user never calls `wait()` on the future that errored out but a
future down the chain and we should propagate these errors via `setError`
instead.
ghstack-source-id: 124721216

Test Plan:
1) Unit test added.
2) waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D27032362

fbshipit-source-id: c719e2277c27ff3d45f1511d5dc6f1f71a03e3a8
2021-03-25 07:39:14 -07:00
Elias Ellison
9a990dafd9 Add a filter to remove mutation (#51923)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51923

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D26696700

Pulled By: eellison

fbshipit-source-id: 9665e9b786f55b6e5b98420eae19de262d46bb96
2021-03-01 21:22:33 -08:00
Xu Zhao
4fdebdc0c9 Improve PyTorch profiler flop computation formulas (#51377)
Summary:
Improve the flops computation formula of aten::conv2d operator to support stride, pad, dilation, and groups arguments.

This diff also fixes the following issues:
- Apply a factor of 2 to aten::mm because output accounts for multiplication and addition.
- Fix incorrect names of scalar operators to aten::mul and aten::add.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51377

Test Plan:
```python
python test/test_profiler.py
```

Reviewed By: jspark1105

Differential Revision: D26165223

Pulled By: xuzhao9

fbshipit-source-id: 2c5f0155c47af2e6a19332fd6ed73ace47fa072a
2021-02-02 11:49:04 -08:00
Scott Wolchok
4a0d17ba2d [PyTorch][codemod] Replace immediately-dereferenced expect calls w/expectRef (#50228)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50228

`fastmod -m 'expect(<((at|c10)::)?\w+Type>\(\)\s*)->'
'expectRef${1}.'`
Presuming it builds, this is a safe change: the result of `expect()`
wasn't being saved anywhere, so we didn't need it, so we can take a
reference instead of a new `shared_ptr`.
ghstack-source-id: 119782961

Test Plan: CI

Reviewed By: SplitInfinity

Differential Revision: D25837374

fbshipit-source-id: 86757b70b1520e3dbaa141001e7976400cdd3b08
2021-01-13 16:13:55 -08:00
Andres Suarez
8530c65e25 [codemod][fbcode/caffe2] Apply clang-format update fixes
Test Plan: Sandcastle and visual inspection.

Reviewed By: igorsugak

Differential Revision: D25849205

fbshipit-source-id: ef664c1ad4b3ee92d5c020a5511b4ef9837a09a0
2021-01-09 14:37:36 -08:00
Xu Zhao
573f4aa352 FLOPS Roofline Analysis Feature for PyTorch Profiler. (#46506)
Summary:
FLOPs Roofline Analysis Feature for PyTorch Profiler.

Currently, PyTorch Profiler lacks the ability to measure the FLOPs of operators, such as mm and conv.
FLOPs are helpful to estimate the computation complexity of the operators.
For now, we use input shapes to estimate the number of floating pointer operations.
In the future, we may compute this information by tracking hardware counters.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46506

Test Plan:
Run `python test/test_profiler_flops.py -k test_flops`. The test will print a profiler table with "FLOPS" column, like the following:
----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ---------------------------------------------  ------------
                        Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls                                   Input Shapes        MFLOPS
----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ---------------------------------------------  ------------
                aten::matmul         0.06%      57.653us        82.97%      79.310ms      79.310ms             1                 [[40, 33, 1, 243], [243, 243]]            --
                    aten::mm        82.84%      79.186ms        82.86%      79.204ms      79.204ms             1                      [[1320, 243], [243, 243]]       984.323
                aten::conv2d         0.04%      36.345us        16.06%      15.347ms      15.347ms             1  [[40, 16, 18, 260], [33, 16, 18, 18], [33], [  44065010.318
           aten::convolution         0.02%      16.016us        16.02%      15.310ms      15.310ms             1  [[40, 16, 18, 260], [33, 16, 18, 18], [33], [            --
          aten::_convolution         0.07%      63.855us        16.00%      15.294ms      15.294ms             1  [[40, 16, 18, 260], [33, 16, 18, 18], [33], [            --
    aten::mkldnn_convolution        15.89%      15.188ms        15.93%      15.225ms      15.225ms             1  [[40, 16, 18, 260], [33, 16, 18, 18], [33], [            --
                  aten::relu         0.10%      98.223us         0.64%     612.157us     306.079us             2                             [[40, 33, 1, 243]]            --
             aten::threshold         0.49%     465.416us         0.54%     513.934us     256.967us             2                     [[40, 33, 1, 243], [], []]            --
                  aten::add_         0.29%     279.301us         0.29%     279.301us     279.301us             1                  [[40, 33, 1, 243], [243], []]            --
                 aten::empty         0.10%      99.113us         0.10%      99.113us      24.778us             4                       [[], [], [], [], [], []]            --
----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ---------------------------------------------  ------------
Self CPU time total: 95.584ms

.
----------------------------------------------------------------------
Ran 1 test in 0.176s

For now, we only provide FLOPs calculation for aten::conv2d and aten::mm operators.

Reviewed By: ezyang

Differential Revision: D25214452

Pulled By: xuzhao9

fbshipit-source-id: 0ae841bd8dbdeb032346dc3d9d38e19875aa1da3
2020-12-17 21:19:25 -08:00
Scott Wolchok
22c6dafd33 [PyTorch] Use plain old function pointer for RecordFunctionCallback (reapply) (#49408)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49408

Nearly every non-test callsite doesn't need to capture any variables anyway, and this saves 48 bytes per callback.
ghstack-source-id: 118665808

Test Plan:
Wait for GitHub CI since we had C++14-specific issues with
this one in previous PR https://github.com/pytorch/pytorch/pull/48629

Reviewed By: malfet

Differential Revision: D25563207

fbshipit-source-id: 6a2831205917d465f8248ca37429ba2428d5626d
2020-12-15 19:16:01 -08:00
Mike Ruberry
25bc906281 Revert D25135415: [PyTorch] Use plain old function pointer for RecordFunctionCallback
Test Plan: revert-hammer

Differential Revision:
D25135415 (7e23ee1598)

Original commit changeset: 5e92dc79da64

fbshipit-source-id: 45b1634a100084c84dca158a1f16ca760fef6988
2020-12-14 21:04:27 -08:00
Scott Wolchok
7e23ee1598 [PyTorch] Use plain old function pointer for RecordFunctionCallback (#48629)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48629

Nearly every non-test callsite doesn't need to capture any variables anyway, and this saves 48 bytes per callback.
ghstack-source-id: 118568240

Test Plan: CI

Reviewed By: dhruvbird

Differential Revision: D25135415

fbshipit-source-id: 5e92dc79da6473ed15d1e381a21ed315879168f3
2020-12-14 20:08:16 -08:00
Scott Wolchok
900aa4ee97 [PyTorch] remove convenience RecordFunctionCallback interface (#48620)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48620

In preparation for storing bare function pointer (8 bytes)
instead of std::function (32 bytes).
ghstack-source-id: 118568242

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D25132183

fbshipit-source-id: 3790cfb5d98479a46cf665b14eb0041a872c13da
2020-12-14 20:03:15 -08:00
Chen Lai
416dc68341 [Pytorch][Annotation] Update inlined callstack with module instance info (#47416)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47416

Test Plan: Imported from OSS

Reviewed By: kimishpatel

Differential Revision: D24752846

Pulled By: cccclai

fbshipit-source-id: 94d3c18c56161d1de3a16bb7c93502fedf71644c
2020-12-03 10:44:46 -08:00
Scott Wolchok
d1df4038ff [PyTorch] Make RecordFunctionCallback::should_run_ a function pointer (#48274)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48274

The std::function-ness of it was used only for tests. (std::function is huge at 32 bytes, and not particularly efficient.)
ghstack-source-id: 117498491

Test Plan: CI

Reviewed By: dzhulgakov

Differential Revision: D25102077

fbshipit-source-id: fd941ddf32235a9659a1a17609c27cc5cb446a54
2020-12-01 13:02:25 -08:00
Ilia Cherniavskii
f7a8bf2855 Use libkineto in profiler (#46470)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46470

Adding ability to use Kineto (CUPTI) to profile CUDA kernels

Test Plan:
USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
python test/test_profiler.py

python test/test_autograd.py -k test_profile
python test/test_autograd.py -k test_record

```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                       Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       1.000us             2
                                      sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       2.000us             1
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                       Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                                            aten::randn         5.17%      74.000us         6.71%      96.000us      48.000us       0.000us         0.00%       0.000us       0.000us             2
                                            aten::empty         1.33%      19.000us         1.33%      19.000us       4.750us       0.000us         0.00%       0.000us       0.000us             4
                                          aten::normal_         1.05%      15.000us         1.05%      15.000us       7.500us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::to        77.90%       1.114ms        91.61%       1.310ms     436.667us       0.000us         0.00%       3.000us       1.000us             3
                                    aten::empty_strided         2.52%      36.000us         2.52%      36.000us      12.000us       0.000us         0.00%       0.000us       0.000us             3
                                            aten::copy_         2.73%      39.000us        11.19%     160.000us      53.333us       0.000us         0.00%       3.000us       1.000us             3
                                        cudaMemcpyAsync         4.34%      62.000us         4.34%      62.000us      20.667us       0.000us         0.00%       0.000us       0.000us             3
                                  cudaStreamSynchronize         1.61%      23.000us         1.61%      23.000us       7.667us       0.000us         0.00%       0.000us       0.000us             3
                                               aten::mm         0.21%       3.000us         7.20%     103.000us     103.000us       0.000us         0.00%       2.000us       2.000us             1
                                           aten::stride         0.21%       3.000us         0.21%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3
                                       cudaLaunchKernel         2.45%      35.000us         2.45%      35.000us      17.500us       0.000us         0.00%       0.000us       0.000us             2
                                              aten::add         0.49%       7.000us         4.27%      61.000us      61.000us       0.000us         0.00%       1.000us       1.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
```

benchmark: https://gist.github.com/ilia-cher/a5a9eb6b68504542a3cad5150fc39b1a

Reviewed By: Chillee

Differential Revision: D25142223

Pulled By: ilia-cher

fbshipit-source-id: b0dff46c28da5fb0a8e01cf548aa4f2b723fde80
2020-11-25 04:32:16 -08:00
Elias Ellison
a00ba63023 Disable old fuser internally (#48322)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48322

Disable old fuser internally. I would like to find where we are inadvertently setting the old fuser, but in the meantime I would like to land a diff that I know will 100% cause it not to be run, and verify that it fixes the issue.

Test Plan: sandcastle

Reviewed By: ZolotukhinM

Differential Revision: D25126202

fbshipit-source-id: 5a4d0742f5f829e536f50e7ede1256c94dd05232
2020-11-21 00:42:23 -08:00