Commit Graph

88 Commits

Author SHA1 Message Date
Scott Wolchok
4a0d17ba2d [PyTorch][codemod] Replace immediately-dereferenced expect calls w/expectRef (#50228)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50228

`fastmod -m 'expect(<((at|c10)::)?\w+Type>\(\)\s*)->'
'expectRef${1}.'`
Presuming it builds, this is a safe change: the result of `expect()`
wasn't being saved anywhere, so we didn't need it, so we can take a
reference instead of a new `shared_ptr`.
ghstack-source-id: 119782961

Test Plan: CI

Reviewed By: SplitInfinity

Differential Revision: D25837374

fbshipit-source-id: 86757b70b1520e3dbaa141001e7976400cdd3b08
2021-01-13 16:13:55 -08:00
Andres Suarez
8530c65e25 [codemod][fbcode/caffe2] Apply clang-format update fixes
Test Plan: Sandcastle and visual inspection.

Reviewed By: igorsugak

Differential Revision: D25849205

fbshipit-source-id: ef664c1ad4b3ee92d5c020a5511b4ef9837a09a0
2021-01-09 14:37:36 -08:00
Xu Zhao
573f4aa352 FLOPS Roofline Analysis Feature for PyTorch Profiler. (#46506)
Summary:
FLOPs Roofline Analysis Feature for PyTorch Profiler.

Currently, PyTorch Profiler lacks the ability to measure the FLOPs of operators, such as mm and conv.
FLOPs are helpful to estimate the computation complexity of the operators.
For now, we use input shapes to estimate the number of floating pointer operations.
In the future, we may compute this information by tracking hardware counters.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46506

Test Plan:
Run `python test/test_profiler_flops.py -k test_flops`. The test will print a profiler table with "FLOPS" column, like the following:
----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ---------------------------------------------  ------------
                        Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls                                   Input Shapes        MFLOPS
----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ---------------------------------------------  ------------
                aten::matmul         0.06%      57.653us        82.97%      79.310ms      79.310ms             1                 [[40, 33, 1, 243], [243, 243]]            --
                    aten::mm        82.84%      79.186ms        82.86%      79.204ms      79.204ms             1                      [[1320, 243], [243, 243]]       984.323
                aten::conv2d         0.04%      36.345us        16.06%      15.347ms      15.347ms             1  [[40, 16, 18, 260], [33, 16, 18, 18], [33], [  44065010.318
           aten::convolution         0.02%      16.016us        16.02%      15.310ms      15.310ms             1  [[40, 16, 18, 260], [33, 16, 18, 18], [33], [            --
          aten::_convolution         0.07%      63.855us        16.00%      15.294ms      15.294ms             1  [[40, 16, 18, 260], [33, 16, 18, 18], [33], [            --
    aten::mkldnn_convolution        15.89%      15.188ms        15.93%      15.225ms      15.225ms             1  [[40, 16, 18, 260], [33, 16, 18, 18], [33], [            --
                  aten::relu         0.10%      98.223us         0.64%     612.157us     306.079us             2                             [[40, 33, 1, 243]]            --
             aten::threshold         0.49%     465.416us         0.54%     513.934us     256.967us             2                     [[40, 33, 1, 243], [], []]            --
                  aten::add_         0.29%     279.301us         0.29%     279.301us     279.301us             1                  [[40, 33, 1, 243], [243], []]            --
                 aten::empty         0.10%      99.113us         0.10%      99.113us      24.778us             4                       [[], [], [], [], [], []]            --
----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ---------------------------------------------  ------------
Self CPU time total: 95.584ms

.
----------------------------------------------------------------------
Ran 1 test in 0.176s

For now, we only provide FLOPs calculation for aten::conv2d and aten::mm operators.

Reviewed By: ezyang

Differential Revision: D25214452

Pulled By: xuzhao9

fbshipit-source-id: 0ae841bd8dbdeb032346dc3d9d38e19875aa1da3
2020-12-17 21:19:25 -08:00
Scott Wolchok
22c6dafd33 [PyTorch] Use plain old function pointer for RecordFunctionCallback (reapply) (#49408)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49408

Nearly every non-test callsite doesn't need to capture any variables anyway, and this saves 48 bytes per callback.
ghstack-source-id: 118665808

Test Plan:
Wait for GitHub CI since we had C++14-specific issues with
this one in previous PR https://github.com/pytorch/pytorch/pull/48629

Reviewed By: malfet

Differential Revision: D25563207

fbshipit-source-id: 6a2831205917d465f8248ca37429ba2428d5626d
2020-12-15 19:16:01 -08:00
Mike Ruberry
25bc906281 Revert D25135415: [PyTorch] Use plain old function pointer for RecordFunctionCallback
Test Plan: revert-hammer

Differential Revision:
D25135415 (7e23ee1598)

Original commit changeset: 5e92dc79da64

fbshipit-source-id: 45b1634a100084c84dca158a1f16ca760fef6988
2020-12-14 21:04:27 -08:00
Scott Wolchok
7e23ee1598 [PyTorch] Use plain old function pointer for RecordFunctionCallback (#48629)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48629

Nearly every non-test callsite doesn't need to capture any variables anyway, and this saves 48 bytes per callback.
ghstack-source-id: 118568240

Test Plan: CI

Reviewed By: dhruvbird

Differential Revision: D25135415

fbshipit-source-id: 5e92dc79da6473ed15d1e381a21ed315879168f3
2020-12-14 20:08:16 -08:00
Scott Wolchok
900aa4ee97 [PyTorch] remove convenience RecordFunctionCallback interface (#48620)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48620

In preparation for storing bare function pointer (8 bytes)
instead of std::function (32 bytes).
ghstack-source-id: 118568242

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D25132183

fbshipit-source-id: 3790cfb5d98479a46cf665b14eb0041a872c13da
2020-12-14 20:03:15 -08:00
Chen Lai
416dc68341 [Pytorch][Annotation] Update inlined callstack with module instance info (#47416)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47416

Test Plan: Imported from OSS

Reviewed By: kimishpatel

Differential Revision: D24752846

Pulled By: cccclai

fbshipit-source-id: 94d3c18c56161d1de3a16bb7c93502fedf71644c
2020-12-03 10:44:46 -08:00
Scott Wolchok
d1df4038ff [PyTorch] Make RecordFunctionCallback::should_run_ a function pointer (#48274)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48274

The std::function-ness of it was used only for tests. (std::function is huge at 32 bytes, and not particularly efficient.)
ghstack-source-id: 117498491

Test Plan: CI

Reviewed By: dzhulgakov

Differential Revision: D25102077

fbshipit-source-id: fd941ddf32235a9659a1a17609c27cc5cb446a54
2020-12-01 13:02:25 -08:00
Ilia Cherniavskii
f7a8bf2855 Use libkineto in profiler (#46470)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46470

Adding ability to use Kineto (CUPTI) to profile CUDA kernels

Test Plan:
USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
python test/test_profiler.py

python test/test_autograd.py -k test_profile
python test/test_autograd.py -k test_record

```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                       Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       1.000us             2
                                      sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       2.000us             1
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                       Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                                            aten::randn         5.17%      74.000us         6.71%      96.000us      48.000us       0.000us         0.00%       0.000us       0.000us             2
                                            aten::empty         1.33%      19.000us         1.33%      19.000us       4.750us       0.000us         0.00%       0.000us       0.000us             4
                                          aten::normal_         1.05%      15.000us         1.05%      15.000us       7.500us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::to        77.90%       1.114ms        91.61%       1.310ms     436.667us       0.000us         0.00%       3.000us       1.000us             3
                                    aten::empty_strided         2.52%      36.000us         2.52%      36.000us      12.000us       0.000us         0.00%       0.000us       0.000us             3
                                            aten::copy_         2.73%      39.000us        11.19%     160.000us      53.333us       0.000us         0.00%       3.000us       1.000us             3
                                        cudaMemcpyAsync         4.34%      62.000us         4.34%      62.000us      20.667us       0.000us         0.00%       0.000us       0.000us             3
                                  cudaStreamSynchronize         1.61%      23.000us         1.61%      23.000us       7.667us       0.000us         0.00%       0.000us       0.000us             3
                                               aten::mm         0.21%       3.000us         7.20%     103.000us     103.000us       0.000us         0.00%       2.000us       2.000us             1
                                           aten::stride         0.21%       3.000us         0.21%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3
                                       cudaLaunchKernel         2.45%      35.000us         2.45%      35.000us      17.500us       0.000us         0.00%       0.000us       0.000us             2
                                              aten::add         0.49%       7.000us         4.27%      61.000us      61.000us       0.000us         0.00%       1.000us       1.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
```

benchmark: https://gist.github.com/ilia-cher/a5a9eb6b68504542a3cad5150fc39b1a

Reviewed By: Chillee

Differential Revision: D25142223

Pulled By: ilia-cher

fbshipit-source-id: b0dff46c28da5fb0a8e01cf548aa4f2b723fde80
2020-11-25 04:32:16 -08:00
Elias Ellison
a00ba63023 Disable old fuser internally (#48322)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48322

Disable old fuser internally. I would like to find where we are inadvertently setting the old fuser, but in the meantime I would like to land a diff that I know will 100% cause it not to be run, and verify that it fixes the issue.

Test Plan: sandcastle

Reviewed By: ZolotukhinM

Differential Revision: D25126202

fbshipit-source-id: 5a4d0742f5f829e536f50e7ede1256c94dd05232
2020-11-21 00:42:23 -08:00
Scott Wolchok
bef460a803 [PyTorch] Return raw ptr from ThreadLocalDebugInfo::get() (#47796)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47796

`ThreadLocalDebugInfo::get()` is a hot function. For example, it is called by `DefaultCPUAllocator::allocate()`. Most callers do not even bother to keep the returned `shared_ptr` around, proving that they have no lifetime issues currently. For the rest, it appears that the only way that the returned pointer could become invalid is if they then called a function that swapped out `ThreadLocalDebugInfo` using `ThreadLocalStateGuard`. There are very few such paths, and it doesn't look like any current callers of `ThreadLocalDebugInfo::get()` needed a `shared_ptr` at all.
ghstack-source-id: 116979577

Test Plan:
1) reviewers to double-check audit of safety
2) run framework overhead benchmarks

Reviewed By: dzhulgakov

Differential Revision: D24902978

fbshipit-source-id: d684737cc2568534cac7cd3fb8d623b971c2fd28
2020-11-18 20:37:17 -08:00
Dhruv Matani
9c1a41b724 [RFC] Add OperatorHandle overload to the RecordFunction::before() method (#46401)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46401

Broader context about selective/custom build available at https://fb.quip.com/2oEzAR5MKqbD and https://fb.workplace.com/groups/pytorch.mobile.team/permalink/735794523641956/

Basically, we want to be able to trace full operator names (with overload name). The current observer infra picks up the operator name from the schema, which doesn't seem to include the overload name. To ensure consistency with the existing uses and to accomodate the new use-case, this diff adds a new overload to accept an `OperatorHandle` object, and the code in `before()` eagerly resolves it to an `OperatorName` object (which can be cached in a member variable) as well as a string (view) operator-name which has the same semantics as before.

Why do we pass in an `OperatorHandle` but then resolve it to an `OperatorName`? This might come across as a strange design choice (and it is), but it is grounded in practicality.

It is not reasonable to cache an `OperatorHandle` object but caching an `OperatorName` object is reasonable since it holds all the data itself.

An initial version of this change was trying to test this change in the `xplat` repo, which didn't work. Thanks to ilia-cher for pointing out that the dispatcher observing mechanism is disabled under a compile time flag (macro) for xplat.
ghstack-source-id: 114360747

Test Plan:
`buck test fbcode/caffe2/fb/test:record_function_test` succeeds. Also replicated this test in OSS in the file `test_misc.cpp` where the rest of the `RecordFunction` subsystem is being tested.

Ran benchmark as reqiested by ilia-cher

{P146511280}

Reviewed By: ilia-cher

Differential Revision: D24315241

fbshipit-source-id: 239f3081e6aa2e26c3021a7dd61f328b723b03d9
2020-10-28 22:38:26 -07:00
Ansley Ussery
5072728d88 Fix stride printing/parsing formatting (#45156)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45156

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D24078695

Pulled By: ansley

fbshipit-source-id: dab993277d43b31105c38d12098c37653747b42a
2020-10-06 15:06:46 -07:00
Ilia Cherniavskii
f5c95d5cf1 Source code level attribution in profiler (#43898)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43898

Adding with_source parameter to enable tracking source code
(filename and line) in profiler for eager, torchscript and autograd
modes

Test Plan:
python test/test_profiler.py
```
Name                                 Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Source Location
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  --------------------------------------------
ts_method_1                          10.43%           235.364us        36.46%           822.920us        822.920us        1                test/test_profiler.py(70): test_source
aten::add                            7.52%            169.833us        8.88%            200.439us        200.439us        1                test/test_profiler.py(69): test_source
aten::normal_                        6.26%            141.380us        6.26%            141.380us        141.380us        1                test/test_profiler.py(67): test_source
aten::add                            5.80%            130.830us        8.41%            189.800us        63.267us         3                test/test_profiler.py(72): test_source
aten::sum                            5.02%            113.340us        8.39%            189.475us        189.475us        1                test/test_profiler.py(64): ts_method_1
aten::add                            4.58%            103.346us        6.33%            142.847us        142.847us        1                test/test_profiler.py(62): ts_method_1
aten::mul                            4.05%            91.498us         9.62%            217.113us        217.113us        1                test/test_profiler.py(71): test_source
aten::add                            4.03%            90.880us         5.60%            126.405us        126.405us        1                test/test_profiler.py(58): ts_method_2
aten::empty                          3.49%            78.735us         3.49%            78.735us         19.684us         4                test/test_profiler.py(72): test_source
```

Reviewed By: ngimel

Differential Revision: D23432664

Pulled By: ilia-cher

fbshipit-source-id: 83ad7ebe0c2502494d3b48c4e687802db9c77615
2020-09-30 00:57:35 -07:00
Rohan Varma
27ab9bc0f9 [RPC profiling] Extend RPC profiling to support async function execution over RPC. (#44664)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44664

Closes https://github.com/pytorch/pytorch/issues/39971. This PR adds support for functions decorated with `rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run.

To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node.

For example, if the following async function is ran on a server over RPC:

```
def slow_add(x, y):
    time.sleep(1)
    return torch.add(x, y)

rpc.functions.async_execution
def slow_async_add(to, x, y):
    return rpc.rpc_async(to, slow_add, args=(x, y))
```

we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output:

```
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Name                                                                                                                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Node ID
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------                                                                                                                            rpc_async#slow_async_add(worker1 -> worker2)                                                                               0.00%            0.000us          0                1.012s
         1.012s           1                1
aten::empty                                                                                                                7.02%            11.519us         7.02%            11.519us         11.519us         1                1
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)                             0.00%            0.000us          0                1.006s
         1.006s           1                2                                                                                                                                          rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty                                                        7.21%            11.843us         7.21%            11.843us
         11.843us         1                2
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add        71.94%           118.107us        85.77%           140.802us        140.802us        1                3
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty      13.82%           22.695us         13.82%           22.695us
         22.695us         1                3                                                                                                                                          -------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Self CPU time total: 164.164us
```

This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code.
ghstack-source-id: 112868470

Test Plan:
```
rvarm1@devbig978:fbcode  (52dd34f6)$ buck test mode/no-gpu mode/dev-nosan //caffe2/test/distributed/rpc:process_group_agent -- test_rpc_profiling_async_function --print-passing-details --stress-runs 1
```

Reviewed By: mrshenli

Differential Revision: D23638387

fbshipit-source-id: eedb6d48173a4ecd41d70a9c64048920bd4807c4
2020-09-25 13:19:26 -07:00
Michael Suo
22401b850b port all JIT tests to gtest (#45264)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45264

Context for why we are porting to gtest in: https://github.com/pytorch/pytorch/pull/45018.

This PR completes the process of porting and removes unused files/macros.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23901392

Pulled By: suo

fbshipit-source-id: 89526890e1a49462f3f77718f4ee273c5bc578ba
2020-09-25 11:37:43 -07:00
Michael Suo
6d21d5f0b3 gtest-ify JIT tests, through the letter c (#45249)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45249

Reland of https://github.com/pytorch/pytorch/pull/45055 and
https://github.com/pytorch/pytorch/pull/45020

See https://github.com/pytorch/pytorch/pull/45018 for context.

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D23892645

Pulled By: suo

fbshipit-source-id: e7fe58d5e1a5a0c44f4e2aec9694145afabde0fd
2020-09-24 00:21:20 -07:00
Michael Suo
e9aa6898ab Revert D23802296: gtest-ify JIT tests, through the letter c
Test Plan: revert-hammer

Differential Revision:
D23802296 (d2b045030e)

Original commit changeset: 20c9798a414e

fbshipit-source-id: a28d56039ca404fe94ed7572f1febd1673e3e788
2020-09-23 17:42:19 -07:00
Michael Suo
d2b045030e gtest-ify JIT tests, through the letter c (#45020)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45020

See https://github.com/pytorch/pytorch/pull/45018 for context.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23802296

Pulled By: suo

fbshipit-source-id: 20c9798a414e9ba30869a862012cbdee0613c8b1
2020-09-23 14:28:45 -07:00
Rohan Varma
70d2e4d1f6 [RPC profiling] Allow disableProfiler() to be called from another thread. (#44653)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44653

This changes the profiler per a discussion with ilia-cher offline that enables `disableProfiler()` event consolidation logic to be called from different threads (i.e. threads where the profiler was not explicitly enabled). This is needed to support the functionality enabled by D23638387 where we defer profiling event collection until executing an async callback that can execute on a different thread, to support RPC async function profiling.

This is done by introducing 2 flags `cleanupTLSState` and `consolidate` which controls whether we should clean up thread local settings (we don't do this when calling `disableProfiler()` on non-main threads) and whether we should consolidate all profiled events. Backwards compatiblity is ensured since both options are true by default.

Added a test in `test_misc.cpp` to test this.
ghstack-source-id: 112605620

Reviewed By: mrshenli

Differential Revision: D23638499

fbshipit-source-id: f5bbb0d41ef883c5e5870bc27e086b8b8908f46b
2020-09-22 21:16:58 -07:00
Michael Suo
374e9373b5 [jit] Pull (most) tests out of libtorch_python (#44795)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44795

Today, we build our cpp tests twice, once as a standalone gtest binary,
and once linked in `libtorch_python` so we can call them from
`test_jit.py`.

This is convenient (it means that `test_jit.py` is a single entry point
for all our tests), but has a few drawbacks:
1. We can't actually use the gtest APIs, since we don't link gtest into
`libtorch_python`. We're stuck with the subset that we want to write
polyfills for, and an awkward registration scheme where you have to
write a test then include it in `tests.h`).
2. More seriously, we register custom operators and classes in these
tests. In a world where we may be linking many `libtorch_python`s, this
has a tendency to cause errors with `libtorch`.

So now, only tests that explicitly require cooperation with Python are
built into `libtorch_python`. The rest are built into
`build/bin/test_jit`.

There are tests which require that we define custom classes and
operators. In these cases, I've built thm into separate `.so`s that we
call `torch.ops.load_library()` on.

Test Plan: Imported from OSS

Reviewed By: SplitInfinity, ZolotukhinM

Differential Revision: D23735520

Pulled By: suo

fbshipit-source-id: d146bf4e7eb908afa6f96b394e4d395d63ad72ff
2020-09-18 14:04:40 -07:00
Louis Feng
eb75cfb9c0 Back out "Revert D23323486: DPP Async Tracing" plus windows build fix. (#44702)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44702

Original commit changeset: c6bd6d277aca

This diff caused windows build to fail due to a compiler bug in VS2019 (lambda capture constant int value). This back out works around the issue with explicit capture of const int value.

Test Plan: Tested and previously landed.

Reviewed By: mruberry

Differential Revision: D23703215

fbshipit-source-id: f9ef23be97540bc9cf78a855295fb8c69f360459
2020-09-16 11:32:11 -07:00
Mike Ruberry
7036e91abd Revert D23323486: DPP Async Tracing
Test Plan: revert-hammer

Differential Revision:
D23323486 (71673b31f9)

Original commit changeset: 4b6ca6c0e320

fbshipit-source-id: c6bd6d277aca070bef2de3522c2a60e23b4395ad
2020-09-15 01:19:23 -07:00
Louis Feng
71673b31f9 DPP Async Tracing (#44252)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44252

Add tracing to DPP client. Because DPP requests are async, we need to be able to start a trace event in one thread and potentially end in a different thread. RecordFunction and LibgpumonObserver previously assume each trace event starts and finishes in the same thread. So they use a thread local context to track enter and exit call backs. Async events breaks this assumption. This change attaches the event context to the RecordFunction object so we do not need to use thread local context.

Test Plan:
Tested with dpp perf test and able to collect trace.

{F307824044}

Reviewed By: ilia-cher

Differential Revision: D23323486

fbshipit-source-id: 4b6ca6c0e32028fb38a476cd1f44c17a001fc03b
2020-09-14 18:43:14 -07:00
Nikolay Korovaiko
f91bdbeabd Enable function calls in TEFuser and SpecializeAutogradZero (#43866)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43866

Reviewed By: ezyang

Differential Revision: D23452798

Pulled By: Krovatkin

fbshipit-source-id: 2cff4c905bf1b5d9de56e7869458ffa6fce1f1b5
2020-09-03 14:42:52 -07:00
Pritam Damania
f1624b82b5 Preserve python backtrace in autograd engine errors. (#43684)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43684

This PR attempts to address #42560 by capturing the appropriate
exception_ptr in the autograd engine and passing it over to the Future.

As part of this change, there is a significant change the Future API where we
now only accept an exception_ptr as part of setError.

For the example in #42560, the exception trace would now look like:

```
> Traceback (most recent call last):
>   File "test_autograd.py", line 6914, in test_preserve_backtrace
>     Foo.apply(t).sum().backward()
>   File "torch/tensor.py", line 214, in backward
>     torch.autograd.backward(self, gradient, retain_graph, create_graph)
>   File "torch/autograd/__init__.py", line 127, in backward
>     allow_unreachable=True)  # allow_unreachable flag
>   File "torch/autograd/function.py", line 87, in apply
>     return self._forward_cls.backward(self, *args)
>   File "test_autograd.py", line 6910, in backward
>     raise ValueError("something")
> ValueError: something
```
ghstack-source-id: 111109637

Test Plan: waitforbuildbot

Reviewed By: albanD

Differential Revision: D23365408

fbshipit-source-id: 1470c4776ec8053ea92a6ee1663460a3bae6edc5
2020-09-01 01:28:47 -07:00
Nikolay Korovaiko
000739c31a Function calls for fallback paths (#43274)
Summary:
This PR adds API to package unoptimized/fallback blocks as function calls. It's mainly meant to be used by TensorExpressionsFuser and SpecializeAutogradZero passes as both specialize the original graph but would also like to provide a fallback path in case the assumptions under which the graph was specialized do not hold for some inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43274

Reviewed By: malfet

Differential Revision: D23406961

Pulled By: Krovatkin

fbshipit-source-id: ef21fc9ad886953461b09418d02c75c58375490c
2020-08-28 23:31:02 -07:00
Nikolay Korovaiko
a97ca93c0e remove prim::profile and special-casing (#43160)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43160

Reviewed By: ZolotukhinM

Differential Revision: D23284421

Pulled By: Krovatkin

fbshipit-source-id: 35e97aad299509a682ae7e95d7cef53301625309
2020-08-22 23:52:36 -07:00
Elias Ellison
91f3114fc1 [JIT] Represent profiled types as a node attribute (#43035)
Summary:
This changes profiled types from being represented as:
`%23 : Float(4:256, 256:1, requires_grad=0, device=cpu) = prim::profile(%0)`
->
`%23 : Tensor = prim::profile[profiled_type=Float(4:256, 256:1, requires_grad=0, device=cpu)](%0)`

Previously, by representing the profiled type in the IR directly it was very easy for optimizations to accidentally use profiled types without inserting the proper guards that would ensure that the specialized type would be seen.

It would be a nice follow up to extend this to prim::Guard as well, however we have short term plans to get rid of prim::Guard.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43035

Reviewed By: ZolotukhinM

Differential Revision: D23120226

Pulled By: eellison

fbshipit-source-id: c78d7904edf314dd65d1a343f2c3a947cb721b32
2020-08-14 20:17:46 -07:00
Ilia Cherniavskii
a53fdaa23f Remove ProfiledType (#42570)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42570

ProfiledType doesn't do anything and is not used atm, removing

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D22938664

Pulled By: ilia-cher

fbshipit-source-id: 037c512938028f44258b702bbcde3f8c144f4aa0
2020-08-06 01:52:08 -07:00
Ilia Cherniavskii
e7a09b4d17 RecordFunction in Dispatcher (#37587)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37587

Lifting RecordFunction up into the dispatcher code

Test Plan: Imported from OSS

Differential Revision: D21374246

fbshipit-source-id: 19f9c1719e6fd3990e451c5bbd771121e91128f7
2020-07-17 22:20:05 -07:00
Rohan Varma
bf9cc5c776 Add callback with TLS state API in futures (#40326)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40326

Adds a helper function `addCallbackWithTLSState` to both
torch/csrc/utils/future.h which is used internally by RPC framework and the JIT
future. Uses this helper function to avoid to pass in TLS state where it is needed for rpc and `record_function_ops.cpp`. For example, the following:

```
at::ThreadLocalState tls_state;
fut->addCallback([tls_state = std::move(tls_state)]() {
at::ThreadLocalStateGuard g(tls_state);
some_cb_that_requires_tls_state();
}
```

becomes

```
fut->addCallbackWithTLSState(some_cb_that_requires_tls_state);
```
ghstack-source-id: 107383961

Test Plan: RPC Tests and added a test in test_misc.cpp

Differential Revision: D22147634

fbshipit-source-id: 46c02337b90ee58ca5a0861e932413c40d06ed4c
2020-07-08 23:25:35 -07:00
Sebastian Messmer
53af9df557 Unify boxed function signature between jit and c10 (#37034)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37034

c10 takes a Stack* in boxed functions while JIT took Stack&.
c10 doesn't return anything while JIT returns an int which is always zero.

This changes JIT to follow the c10 behavior.
ghstack-source-id: 106834069

Test Plan: unit tests

Differential Revision: D20567950

fbshipit-source-id: 1a7aea291023afc52ae706957e9a5ca576fbb53b
2020-06-29 19:24:26 -07:00
Jeremy Lilley
569c85b45d [futures] Add assert to Future constValue() accessor, add hasValue(). (#39950)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39950

Per the comment in the code, constValue() should only be used in
the case where the future was complete and value was not an error.
Add an assert to enforce this.

Also, add hasValue() accessor for completeness.
ghstack-source-id: 105815597

Test Plan: buck test mode/dev-nosan caffe2/test/cpp/jit:

Differential Revision: D22021776

fbshipit-source-id: b59b6c775eab344068a76f4cd8c3a9dc1f2a174e
2020-06-15 12:11:22 -07:00
Nikolay Korovaiko
7f55197a57 Peel Loop (#39434)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39434

Differential Revision: D21857037

Pulled By: Krovatkin

fbshipit-source-id: 6583da167fe93d96e93f1c3d71f46f94e7f4e982
2020-06-10 13:48:18 -07:00
Jeremy Lilley
be3bbfc917 [futures] Add collectAny() to ivalue::Future for completeness (#39597)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39597

To complement collectAll(), this change adds collectAny(), and writes
up relevant unittest coverage.

We also remove the vector-based helper version of collectAll(), which
was debatable usefulness in a previous change.
ghstack-source-id: 105527180

Test Plan: buck test mode/dev-nosan caffe2/test/cpp/jit/...

Differential Revision: D21910311

fbshipit-source-id: dbb3ca404672a3d751b1b3cf016e6084a9ff8040
2020-06-09 16:32:52 -07:00
Jeremy Lilley
b83fed8d4c [futures] Add c++ ivalue::Future collectAll() helper (#39119)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39119

Add some base c++ unittest coverage for ivalue::Future, and in
the process, add a basic collectAll() primitive, per 38937.

In the process, I realized that List<Future> is effectively
impossible to construct (since the Future's type is not templated,
but rather passed in,  the getTypePtr_<T>::call() isn't defined),
so added a workaround in List to make it possible.
ghstack-source-id: 105309650

Test Plan: buck test mode/dev-nosan caffe2/test/cpp/jit/...

Differential Revision: D21756884

fbshipit-source-id: 5d40c8d1c55098de5497655c7b887f4f56508a37
2020-06-08 05:52:09 -07:00
Linbin Yu
b28422d444 add overload name for str cmp (#39607)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39607

add overload name for strcmp macro to prevent duplicated op names in lite interpreter

also reformatted some other files

Test Plan:
verified these op schema are changed

```
-aten::eq(str a, str b) -> (bool)
+aten::eq.str(str a, str b) -> (bool)

-aten::ne(str a, str b) -> (bool)
+aten::ne.str(str a, str b) -> (bool)

-aten::lt(str a, str b) -> (bool)
+aten::lt.str(str a, str b) -> (bool)

-aten::gt(str a, str b) -> (bool)
+aten::gt.str(str a, str b) -> (bool)

-aten::le(str a, str b) -> (bool)
+aten::le.str(str a, str b) -> (bool)

-aten::ge(str a, str b) -> (bool)
+aten::ge.str(str a, str b) -> (bool)
```

Reviewed By: iseeyuan

Differential Revision: D21913049

fbshipit-source-id: 518db068c8c5b0efd19223f0bd94fc3351335dc4
2020-06-06 23:21:35 -07:00
Nikolay Korovaiko
97a2918a07 reduce number of bailout nodes (#38281)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38281

Differential Revision: D21665509

Pulled By: Krovatkin

fbshipit-source-id: c2c34b759aec30d0a161e582030ba994192ee4ec
2020-06-05 13:45:37 -07:00
Ilia Cherniavskii
abe2be2063 [resubmit] Use TensorMethods.cpp (#39385)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39385

see https://github.com/pytorch/pytorch/pull/37639

Test Plan:
https://github.com/pytorch/pytorch/pull/37639

Imported from OSS

Differential Revision: D21833287

fbshipit-source-id: 9928d3f4122903d0de67ad312e349352d5f5c19c
2020-06-02 20:27:51 -07:00
Edward Yang
2fe0fc2684 Revert D21374247: Use TensorMethods.cpp
Test Plan: revert-hammer

Differential Revision:
D21374247

Original commit changeset: 076964415079

fbshipit-source-id: 732ec8c561d1f37475c1b5549ba79c718e3a6db8
2020-06-01 08:12:09 -07:00
Ilia Cherniavskii
68e62b9ab6 Use TensorMethods.cpp (#37639)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37639

Changing TensorMethods.h to .cpp
Necessary to avoid incomplete types in dispatcher

Test Plan:
CI

Imported from OSS

checked mobile size, no change, small reduction in size in fbios
fbios: Succeeded
Change in Download Size for arm64 + 3x assets variation: -18.2 KiB
Change in Uncompressed Size for arm64 + 3x assets variation: -8.8 KiB

reran benchmark, no stat. significant difference

buck run mode/opt caffe2/caffe2/fb/high_perf_models/pytorch/benchmark_framework_overheads:benchmark_torchscript_model -- --model_file caffe2/caffe2/fb/high_perf_models/pytorch/benchmark_framework_overheads/addmodule.pt --num_runs 3

╷ @  68592d0d  41 minutes ago  iliacher  D21374247
╭─╯  Use TensorMethods.cpp

Created 3 benchmark runs on aibench for caffe2/caffe2/fb/high_perf_models/pytorch/benchmark_framework_overheads/addmodule.pt.
Links to the results:

* Adhoc run: https://our.intern.facebook.com/intern/aibench/details/1729113760

* Adhoc run: https://our.intern.facebook.com/intern/aibench/details/3867976782

* Adhoc run: https://our.intern.facebook.com/intern/aibench/details/2782186766

hg prev

@  7f501b42  Thursday at 14:26  bvaughan  D21764704
╷  short-circuit pow for complex 1 and 0 exponents

Created 3 benchmark runs on aibench for caffe2/caffe2/fb/high_perf_models/pytorch/benchmark_framework_overheads/addmodule.pt.
Links to the results:

* Adhoc run: https://our.intern.facebook.com/intern/aibench/details/2155256332

* Adhoc run: https://our.intern.facebook.com/intern/aibench/details/1802057074

* Adhoc run: https://our.intern.facebook.com/intern/aibench/details/4119590830

Differential Revision: D21374247

fbshipit-source-id: 076964415079cf84fb57f1f7b43d087afed86e1d
2020-05-31 17:11:12 -07:00
Ilia Cherniavskii
a5e023f28a Set RecordFunction id only when needed (#39265)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39265

In this PR we set id of RecordFunction only when callbacks need them and when
there's at least one active callback

Test Plan:
testRecordFunction unit test in test_misc.cpp
buck test mode/dev caffe2/test/cpp/jit:jit

https://our.intern.facebook.com/intern/testinfra/testrun/8725724291116413

Reviewed By: dzhulgakov

Differential Revision: D21790421

fbshipit-source-id: 016623d7f1a2a271921a71c0483061e232b40321
2020-05-29 15:34:44 -07:00
Luca Antiga
e088902b4a Add type-hint check for default arguments in TorchScript C++ frontend (#39021)
Summary:
This PR fixes https://github.com/pytorch/pytorch/issues/39020 by requiring users to type-hint default arguments to a TorchScript when using the C++ frontend (the Python frontend will insert those automatically).

Since this is a bit of a niche use case, I opted for the simpler solution of making type-hints mandatory for default arguments, as opposed to trying to type-infer them. I left a comment in the code justifying this choice.

Test is included.

/cc t-vi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39021

Differential Revision: D21755317

Pulled By: suo

fbshipit-source-id: e007650d3bfb3a4c58c25ad2c3a17759898f303b
2020-05-28 01:42:04 -07:00
Nikita Shulga
c6e9e9359f [Codemod][GleanFbcode] Remove dead includes in caffe2/test (#39023)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39023

Reviewed By: orionr

Differential Revision: D21702529

fbshipit-source-id: 6945bba95609102409850b105a8a091e33b8acc9
2020-05-27 14:07:26 -07:00
Ilia Cherniavskii
43dd8760d7 Move ThreadLocalDebugInfo to c10 (#37774)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37774

Move ThreadLocalDebugInfo from ATen to C10

Test Plan: Imported from OSS

Differential Revision: D21384249

Pulled By: ilia-cher

fbshipit-source-id: f9b5089a868f84a2ee013695a481fcc883d3c6b2
2020-05-11 19:27:41 -07:00
Ilia Cherniavskii
facc5e0cc4 Make profiler thread local (#36291)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36291

Move profiler state to be a thread local property,
reuse existing thread local propagation mechanism to ensure
correct profiling of async tasks. This also makes
push/pop callback thread safe and easier to use in e.g.
distributed profilier

Test Plan:
USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py develop install
./build/bin/test_jit

./build/bin/test_jit
python test/test_autograd.py
python test/test_jit.py

Differential Revision: D20938501

Pulled By: ilia-cher

fbshipit-source-id: c0c6c3eddcfea8fc7c14229534b7246a0ad25845
2020-05-07 14:52:49 -07:00
Ilia Cherniavskii
2ef4010593 Propagate TLS callbacks with ThreadLocalState (#37745)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37745

This PR makes it possible to set TLS callbacks and use
them transparently not only in the main thread but also
in any async tasks

Test Plan: Imported from OSS

Differential Revision: D21374873

Pulled By: ilia-cher

fbshipit-source-id: 3be2e121673b32d7694e17e794f3b474826dffe9
2020-05-07 14:52:44 -07:00
Ilia Cherniavskii
2d708cefcc Move RecordFunction into ATen (#37548)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37548

Moving RecordFunction from torch::autograd::profiler into at namespace

Test Plan:
CI

Imported from OSS

Differential Revision: D21315852

fbshipit-source-id: 4a4dbabf116c162f9aef0da8606590ec3f3847aa
2020-05-07 14:52:39 -07:00