Commit Graph

58390 Commits

Author SHA1 Message Date
mikey dagitses
2ac9086987 run buildifier on unified build files (#98141)
This is pretty tricky. buildifier by default doesn't do much to these
files. It does a little more if you tell it that they are
`BUILD.bazel` files with -type=build. But it can do even more if you
remove the target definitions from the `def define_rules()` wrapper
and dedent them.

I wrote a little wrapper that does that. I'll submit it at a later
date.

Differential Revision: [D44606558](https://our.internmc.facebook.com/intern/diff/D44606558/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D44606558/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98141
Approved by: https://github.com/ezyang, https://github.com/PaliC
2023-04-04 00:37:19 +00:00
Michael Voznesensky
b1e60bfb6a Pass f_locals as a dict rather than kwargs (#98107)
Fixes https://github.com/pytorch/pytorch/issues/97688

One big problem is that instead of printing x < y we now print
`E["x"] < E["y"]` and now all of the tests wobbled and I'm mad.

Signed-off-by: Edward Z. Yang <ezyangmeta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98107
Approved by: https://github.com/ezyang
2023-04-04 00:30:08 +00:00
Jason Ansel
b96fe9b61c Fix issues related to ClassInstantier in HF models (#97997)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97997
Approved by: https://github.com/anijain2305
2023-04-04 00:01:08 +00:00
Yifu Wang
4d13fcddef [spmd expansion] support torch.ops.aten.sym_numel (#98229)
The current logic assumes non-overload ops takes two arguments however torch.ops.aten.sym_numel takes one.

Differential Revision: [D44615037](https://our.internmc.facebook.com/intern/diff/D44615037/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98229
Approved by: https://github.com/mrshenli
2023-04-03 23:57:10 +00:00
Yanbo Liang
a6bd21d935 [Dynamo] Eagerly initializing Lazy Module to reduce graph breaks (#97946)
Fixes Meta internal user case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97946
Approved by: https://github.com/wconstab
2023-04-03 22:24:43 +00:00
Bin Bao
96f548a1ac [inductor] Add an AOT mode for the Triton backend (#98214)
Summary:
This is a copy of https://github.com/pytorch/pytorch/pull/97152 to make
the landing easier.

This PR implements a two-pass wrapper codegen for the Triton
backend to achieve ahead-of-time compilation. In the first pass, the
regular python wrapper code will be generated, and then the generated
code will be executed to perform Triton compilation and autotuning.
After that, the second pass wrapper codegen will generate C++ wrapper
with proper CUDA API to load and launch Triton-generated CUDA kernels.

Like the AOT mode for the cpp backend, the next step would be to provide
a more complete API for AOT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98214
Approved by: https://github.com/eellison
2023-04-03 22:19:18 +00:00
Mikayla Gawarecki
73b06a0268 Fix rendering of arguments for nn.functional ops that use boolean_dispatch (#98092)
Fix #97982

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98092
Approved by: https://github.com/albanD
2023-04-03 21:17:43 +00:00
Guang Yang
eeb18d1e54 Fix dynamo tests and re-enable internally (#97937)
Summary:
`:test_dynamo` has been broken for long time internally in Meta. This PR is to fix the  broken test and re-enable it internally.
- Using the root `pytest.ini` for pytest
- Decouple tests so that one can be disabled with affecting others
- Temporarily disable the test cases that require additional efforts to fix

**OSS CI doesn't provide test code coverage info. Meta internal test infra does. The value of re-enabling these tests internally is not only to collect test coverage info but help fbcode developers to build/test from fbcode.**

Test Plan:
`buck test mode/dev-nosan //caffe2/test:test_dynamo`
https://www.internalfb.com/intern/testinfra/testrun/7318349540623516

Differential Revision: D44325238

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97937
Approved by: https://github.com/ezyang
2023-04-03 20:47:13 +00:00
Yu Guo
3654552b8c add deterministic impl for scatter and scatter_reduction sum/mean mode (#98060)
using the existing deterministic implementation via `index_put` which has a deterministic implementation based on sorting indices.

With the `accumulate` arg in `index_put`, this can work for both scatter and scatter_reduce with sum/mean reduction mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98060
Approved by: https://github.com/mikaylagawarecki
2023-04-03 20:38:29 +00:00
Kwanghoon An
13f169c9da Per Channel in brack-propagation function (#97475)
Summary:
Supporting Per Channel quantization in the gradient computation function.

One workaround that I have added here is
Current QNNPACK is not designed to process [transposed weight](https://fb.workplace.com/groups/pytorch.edge.users/permalink/1283737025829921/)
Here we are simply replacing Per Channel to Per Tensor to compute a gradient (Some slow learning curve or WER degradation might be expected - We don't know, nothing is guaranteed)

Test Plan:
You can create your own synthetic model,
FP32 layer -> INT8 layer with Per Channel and see if loss is decreasing

Differential Revision: D43898794

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97475
Approved by: https://github.com/weiwangmeta
2023-04-03 20:34:44 +00:00
PaliC
8e5f57a2b1 add users to external contribution metrics (#97928)
:copilot summary
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97928
Approved by: https://github.com/kit1980
2023-04-03 19:52:31 +00:00
Rohan Varma
1ea528ef24 [bf16] bf16 support for conv_depthwise3d (#97819)
Add bf16 for this op

Differential Revision: [D44473429](https://our.internmc.facebook.com/intern/diff/D44473429/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97819
Approved by: https://github.com/fegin
2023-04-03 19:31:27 +00:00
Jason Ansel
55afaa46a4 Support functools.partial and itertools.product (#98120)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98120
Approved by: https://github.com/anijain2305
2023-04-03 18:23:25 +00:00
Devashish Shankar
2c905f2152 Extend Pattern Matcher to allow handling split-cat style patterns (#97726)
Summary:
This diff extends pattern matcher, by adding a few features which allows it to handle split-getitem-cat style patterns.

3 problems I encountered were:

1. In the handler, I only need one Arg() (the one which is the first input to split). None of the other args are relevant to replacement graph.  So, we add a new Ignored() pattern to have ignored args

2. The pattern matching was visiting the split node again and again during the DFS. By propogating the patterns with _users>1 or Any into the child MatchContext, we avoid this problem.

3. To avoid the unbundling issue, I switched to using KeywordArg() instead of Arg() - as for this pattern, we need a flat list of Arg() in the end

Example pattern: https://www.internalfb.com/intern/anp/view/?id=3325856

```
pass_patterns.append(defaultdict(list))

register_replacement_pattern(
    CallFunction(
        aten.cat,
            ListOf( CallFunction(operator.getitem, CallFunction(aten.split_with_sizes, KeywordArg("input_"), Ignored(), Ignored(), _users=Any),
                                                   Ignored()
                                                   ),),
        Ignored()
    ),
    pass_number=3
)
def split_cat_replace(input_):
    return input_
```

Test Plan: https://www.internalfb.com/intern/anp/view/?kernel=default&id=3317105

Reviewed By: jansel

Differential Revision: D44282499

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97726
Approved by: https://github.com/jansel
2023-04-03 17:30:56 +00:00
Bin Bao
095c129bd3 [CI] Add inference run for the performance dashboard (#98174)
Summary: Remove fp32 training performance run and trade for amp inference
performance run.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98174
Approved by: https://github.com/huydhn
2023-04-03 17:29:55 +00:00
Bin Bao
ba7ee00f00 Add a --inference flag to dynamo benchmark script (#98173)
Summary: When calling benchmark scripts, make it a requirement to pass
--inference or --training

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98173
Approved by: https://github.com/huydhn
2023-04-03 17:12:28 +00:00
John Haitas
5a54eb0b15 [caffe2] miniz fix -Wstrict-prototypes (#98027)
Summary: this fixes -Wstrict-prototypes

Test Plan: eyes

Reviewed By: rmaz

Differential Revision: D44556017

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98027
Approved by: https://github.com/albanD
2023-04-03 16:56:47 +00:00
Elias Ellison
0f0c1b6516 Flip back swtich (#98099)
There are some errors occurring on the benchmark - switch back to old cudagraph impl until they are figured out

https://torchci-git-fork-huydhn-add-compilers-bench-74abf8-fbopensource.vercel.app/benchmark/compilers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98099
Approved by: https://github.com/desertfire
2023-04-03 14:46:33 +00:00
DanilBaibak
55daa835e9 Added allowed_workflows to pytorch probot (#98082)
Added allowed_workflows to pytorch probot. This is a follow up PR [regarding the retry bot](https://github.com/pytorch/test-infra/pull/3942/files#diff-ee5e4f1e1fa962c6f62e5dcebde6e0bab573e74474601bf5749ccb668fd9c900R14-R16).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98082
Approved by: https://github.com/huydhn
2023-04-03 12:30:43 +00:00
mingfeima
ced5c89b6f add explicit vectorization for Half dtype on CPU (#96076)
This patch is part of half float performance optimization on CPU:
* add specification for dtype `Half` in `Vectorized<>` under both avx256 and avx512.
* add specification for dtype `Half` in functional utils, e.g. `vec::map_reduce<>()`, which uses float32 as accumulate type.

Also add a helper struct `vec_hold_type<scalar_t>`, since Vectorized<Half>::value_type is pointing to its underlying storage type which is `uint16_t`, leading to error if the kernel uses `Vec::value_type`.

Half uses the same logic as BFloat16 in the Vectorized<>, each half vector is mapped to 2x float vectors for computation.

Notice that this patch modified the cmake files by adding **-mf16c** on AVX2 build, from https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html, we can see that all the hardware platforms that support **avx2** already have **f16c**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96076
Approved by: https://github.com/malfet
2023-04-03 10:58:37 +00:00
Huy Do
c99895ca6f Move pull and trunk slow tests to periodic (#98040)
I notice that we are running some slow tests for CPU and `sm86` on pull and trunk.  They take much longer to run than other shards (1.5x to 2x longer).  I propose that we move them to periodic instead. Thoughts?

The correlation between them are:

* `linux-bionic-cuda11.7-py3.10-gcc7-sm86 / test (slow)` and `linux-bionic-cuda11.7-py3.10-gcc7-sm86 / test (default)` is 0.93
* `linux-bionic-py3.8-clang9-slow / test (slow)` and `linux-bionic-py3.8-clang9 / test (default)` is 0.98

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at db56750</samp>

This pull request updates the `.github/workflows` files to optimize the testing workflows for PyTorch. It adds new periodic workflows for more platforms and configurations, and removes some redundant or slow workflows from the pull and trunk workflows.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98040
Approved by: https://github.com/malfet
2023-04-03 08:13:12 +00:00
PyTorch MergeBot
c597d9c1f2 Revert "Inductor cpp wrapper: support LinearUnary (#97655)"
This reverts commit d03003ab8e.

Reverted https://github.com/pytorch/pytorch/pull/97655 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but it looks like the change causes a regression on CPU test time d03003ab8e  (inductor/test_cpp_wrapper.py)
2023-04-03 08:09:58 +00:00
chunyuan
d03003ab8e Inductor cpp wrapper: support LinearUnary (#97655)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97655
Approved by: https://github.com/jansel
2023-04-03 04:26:10 +00:00
chunyuan
0c1f524b92 Inductor cpp wrapper: support MKLPackedLinear (#90755)
Invoke `torch.ops.mkl._mkl_linear` from c++.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90755
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/jansel
2023-04-03 04:07:38 +00:00
Jiong Gong
5d62d12557 [Inductor] support transpose vertical reduction in cpp (#97781)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97781
Approved by: https://github.com/jansel
2023-04-03 02:02:15 +00:00
Jason Ansel
76074dc0a3 Improve support for dict subclasses (#98154)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98154
Approved by: https://github.com/anijain2305
2023-04-03 01:42:08 +00:00
Jiong Gong
bf22ecba2a [Inductor] support vertical reduction in cpp (#97644)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97644
Approved by: https://github.com/jansel
2023-04-03 01:29:12 +00:00
Jason Ansel
35b3309539 Fix graph break from inline patched init (#98150)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98150
Approved by: https://github.com/anijain2305, https://github.com/yanboliang
2023-04-03 01:11:30 +00:00
Jiong Gong
8e5f491623 [Inductor] simplify CPP backend Tile2D code and support non-contiguous load/store (#97626)
Remove `CppTile2DTailKernel` and `CppTile2DKernelChecker` and reuse `CppVecKernel` and `CppVecKernelChecker` for them. Add vectorization with fallback for load/store in CppVecKernel for the non-contiguous load/store needed by `CppTile2DTailKernel`.

This PR also adds a functional support for transposed copy of bfloat16 data types. Better performance requires vectorized intrinsics implemented for at::vec::transpose_mxn. cc @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @desertfire

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97626
Approved by: https://github.com/jansel
2023-04-03 01:11:20 +00:00
Jason Ansel
71d850a100 [inductor] Fallback on complex64 kernels (#98155)
Later PRs in this stack fixe graph breaks in GoogleFnet which triggers errors from inductor trying to compile torch.complex64, this fixes that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98155
Approved by: https://github.com/anijain2305, https://github.com/ngimel
2023-04-03 01:06:43 +00:00
Jason Ansel
bc9dd969e1 Support inlining no_grad() decorator (#98121)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98121
Approved by: https://github.com/anijain2305, https://github.com/voznesenskym
2023-04-03 00:24:56 +00:00
Shen Li
96403cfcec [Easy] Fix lint error on DTensor math_ops.py (#98170)
This lint error is caused by conflicts betwee #97996 and #98148

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98170
Approved by: https://github.com/yifuwang
2023-04-02 19:11:05 +00:00
Shen Li
02179827cb [Easy] Include SPMD and DTensor files in UFMT checks (#98148)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98148
Approved by: https://github.com/fegin
2023-04-02 15:34:49 +00:00
Aleksei Nikiforov
38609cc47d TensorExpr eval: fix copying variables from pointers on big endian systems (#96951)
When copying data from pointers, only lowest bytes are copied. On little endian systems they are located at the beginning of pointer. On big endian systems they are located at the end of pointer.

This change fixes TestTensorExprPyBind::test_dynamic_shape and TestTensorExprPyBind::test_dynamic_shape_2d tests from test/test_tensorexpr_pybind.py on big endian systems.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96951
Approved by: https://github.com/ezyang, https://github.com/EikanWang
2023-04-02 12:49:14 +00:00
yanbing-j
2ab18a23e1 Update ideep submodule (#97430)
### Description

This PR is to update ideep submodule for the following two aspects:

1. At inductor side, we are supporting dynamic shape path for packed linear, which we hopes the packed weight of linear doesn't depend on the input shapes and still can get a better a performance using a packed weight got from a dummy input shapes. However the current ideep has a accuracy issue for this case. This updating will fix the issue. 
2. Add an extra arg is_channels_last for deconv to notify ideep whether to go channels last or not because the memory format checks of ideep (e.g. is_nhwc(), is_ndhwc()) is not 100% identical to suggest_memory_format() from pytorch.

### Performance Benchmark

Use TorchBench test in ICX with 40 cores
Intel OpenMP & tcmalloc were preloaded
![image](https://user-images.githubusercontent.com/61222868/229072474-193513ba-6727-4451-91ff-0d57e016736f.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97430
Approved by: https://github.com/jgong5
2023-04-02 06:42:09 +00:00
Shen Li
347c67d4a2 [Easy] Consolidate string startswith checks (#98147)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98147
Approved by: https://github.com/fegin
2023-04-02 04:02:37 +00:00
Wanchao Liang
7fcff01b50 [reland] switch mean to use reduction linear (#97996)
mean is actually a reduction linear formula if the final reduction
is partial sum (which currently is), so switching to use that instead
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97996
Approved by: https://github.com/XilunWu, https://github.com/yifuwang
2023-04-02 03:19:56 +00:00
Jason Ansel
d9e5ab4606 Fix graph break from 'hasattr: HFPretrainedConfigVariable()' (#98119)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98119
Approved by: https://github.com/anijain2305
2023-04-02 02:56:45 +00:00
Jason Ansel
b9d3b3f595 Improve support for contextlib.nullcontext (#98111)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98111
Approved by: https://github.com/anijain2305
2023-04-02 02:33:14 +00:00
Jason Ansel
92b46202ef Add --stats option to benchmark scripts (#98109)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98109
Approved by: https://github.com/anijain2305
2023-04-02 02:23:13 +00:00
mikey dagitses
e402259b8a avoid warning in irange for unsigned types (#97973)
Unsigned types should not be compared to be less than zero.

Differential Revision: [D44538384](https://our.internmc.facebook.com/intern/diff/D44538384/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97973
Approved by: https://github.com/Skylion007
2023-04-01 23:52:37 +00:00
Nikita Shulga
2af09393f9 masked_scatter should accept only bool masks (#97999)
Modify test_torch to check that assert is raised in this case

torch.uint8 usage has been deprecated for a few releases, and errors has been raised for other dtypes on CUDA device, but not on CPU.
This PR finally restricts mask to just `torch.bool`
See https://github.com/pytorch/pytorch/pull/96594 as an example doing it for `torch.masked_fill`

Fixes https://github.com/pytorch/pytorch/issues/94634

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97999
Approved by: https://github.com/ngimel
2023-04-01 23:25:25 +00:00
Jason Ansel
bbc4e911c8 Move CPUReproTests to its own file (#97943)
test_torchinductor has gotten too big (almost 10k lines), this stack is trying to split it into smaller pieces.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97943
Approved by: https://github.com/ngimel
2023-04-01 22:39:49 +00:00
Li-Huai (Allan) Lin
db8abde9b6 [MPS] Enable conditional indexing tests (#97871)
The tests seem to be working now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97871
Approved by: https://github.com/kulinseth
2023-04-01 16:15:08 +00:00
Shen Li
e8d39606eb [SPMD] Enable fused Adam in full train step tracing (#98113)
Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98113
Approved by: https://github.com/yifuwang, https://github.com/fegin
2023-04-01 15:54:13 +00:00
Shen Li
bccf2ef0ce Format DTensor dispatch.py and _meta_registrations.py (#98114)
Format-only changes with black and lintrunner to prepare for the commit on top.

Differential Revision: [D44603809](https://our.internmc.facebook.com/intern/diff/D44603809)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98114
Approved by: https://github.com/yifuwang, https://github.com/fegin
2023-04-01 15:54:13 +00:00
mikey dagitses
64077ce511 remove redundant typed StorageImpl::data() member (#97650)
This has the same implementation as the unsafe variants and the unsafe
variants match the original semantics of the code, given that they
don't check that the type matches.

Given that we're updating callsites anyways to address the mutability
aspect, we might as well just drop this method now.

Differential Revision: [D44410210](https://our.internmc.facebook.com/intern/diff/D44410210/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97650
Approved by: https://github.com/ezyang
2023-04-01 08:16:54 +00:00
Shunting Zhang
13461e9767 [inductor] more cuda metrics in wrapper (#97723)
Following metrics should be helpful:
- percent of time GPU is busy
- percent of time various category of kernels (e.g. pointwise/reduction triton kernel) takes
- percent of time each individual kernel takes compared to total wall time of the benchmark

This PR add those.

Example result from hf_Bert infernece graph:

```
  == triton_pointwise category kernels ==
Kernel                            Self CUDA TIME (ms)  Count    Percent
------------------------------  ---------------------  -------  ---------
triton_poi_fused_gelu_6_0d1d                  0.48154  12.0     5.52%
triton_poi_fused_clone_1_0d1d2                0.29011  24.0     3.33%
triton_poi_fused_clone_2_0d1d2                0.17417  12.0     2.00%
triton_poi_fused_clone_4_0d1d2                0.10797  12.0     1.24%
Total                                         1.05379           12.08%

  == triton_persistent_reduction category kernels ==
Kernel                            Self CUDA TIME (ms)  Count    Percent
------------------------------  ---------------------  -------  ---------
triton_per_fused__softmax__to_                0.97188  12.0     11.14%
triton_per_fused_add_native_la                0.37401  24.0     4.29%
triton_per_fused_gelu_native_l                0.02     1.0      0.23%
triton_per_fused_add_embedding                0.01718  1.0      0.20%
Total                                         1.38307           15.86%

  == unknown category kernels ==
Kernel                            Self CUDA TIME (ms)  Count    Percent
------------------------------  ---------------------  -------  ---------
ampere_fp16_s16816gemm_fp16_12                2.24514  24.0     25.74%
ampere_fp16_s16816gemm_fp16_25                1.39796  49.0     16.03%
void cutlass::Kernel<cutlass_8                1.36093  1.0      15.61%
ampere_fp16_s16816gemm_fp16_64                0.74591  12.0     8.55%
ampere_fp16_s16816gemm_fp16_12                0.61989  12.0     7.11%
Memset (Device)                               0.024    12.0     0.28%
void at::native::(anonymous na                0.01543  2.03     0.18%
void at::native::vectorized_el                0.00011  0.03     0.00%
Total                                         6.40937           73.49%

Percent of time when GPU is busy: 101.44%
```

Note: the output shows total time GPU is busy is larger than total wall time. We measure total wall time disabling profiling while measure GPU time enabling profiling, that may distort the measurement a bit? But I assume the effect is not too large assuming the profiler mostly increase CPU time (rather than GPU).

## interesting usages
1. I pick a model that cudagraphs improve perf significantly like densenet121 and run the tool on it's forward graph. It's no surprise that quite a lot of time GPU is idle:
```
(Forward graph) Percent of time when GPU is busy: 32.69%
Total wall time 17.307 ms
```

Its backward graph has less percent of GPU idle time, but it's still high:
```
(Backward graph) Percent of time when GPU is busy: 46.70%
Total wall time 17.422 ms
```

2. I profile a subset of torchbench models and plot a table to show the percent of execution time for pointwise/reduction/persistent_reduction/unknown_category . Since I plan to explore using coordinate descent tuner to improve reduction, those models with high percent of time spending on reduction should be good caididates (e.g. resnet50, mobilenet_v2 ).

NOTE: a same model appears twice. The first rows is for the fwd graph and the second for the bwd graph. We profile different graphs for a model separately.

```
benchmark_name           pointwise_percent    reduction_percent    persistent_reduction_percent    unknown_category_percent    GPU_busy_percent    wall_time_ms
-----------------------  -------------------  -------------------  ------------------------------  --------------------------  ------------------  --------------
resnet18                 19.73%               7.86%                4.81%                           41.25%                      73.65%              2.549ms
resnet18                 18.59%               7.13%                3.35%                           67.35%                      96.41%              3.467ms
resnet50                 29.57%               22.13%               2.07%                           51.68%                      105.46%             6.834ms
resnet50                 26.42%               15.27%               0.94%                           59.68%                      102.31%             13.346ms
vgg16                    26.23%               0.00%                0.00%                           74.20%                      100.43%             18.212ms
vgg16                    15.63%               5.61%                0.10%                           79.42%                      100.75%             33.485ms
BERT_pytorch             28.62%               4.82%                14.88%                          33.32%                      81.64%              7.162ms
BERT_pytorch             14.43%               13.41%               18.19%                          49.24%                      95.27%              10.395ms
densenet121              11.89%               2.14%                3.86%                           16.36%                      34.25%              16.531ms
densenet121              10.37%               2.06%                4.09%                           31.46%                      47.98%              16.934ms
hf_Bert                  23.94%               0.00%                29.88%                          46.09%                      99.90%              7.766ms
hf_Bert                  11.65%               10.54%               20.26%                          61.66%                      104.11%             11.892ms
nvidia_deeprecommender   42.92%               0.00%                0.00%                           56.75%                      99.67%              3.476ms
nvidia_deeprecommender   31.36%               3.44%                0.46%                           65.20%                      100.45%             3.872ms
alexnet                  30.99%               0.00%                0.00%                           69.16%                      100.14%             3.169ms
alexnet                  24.41%               4.83%                0.17%                           71.09%                      100.50%             4.709ms
mobilenet_v2             29.21%               27.79%               2.49%                           44.00%                      103.49%             10.160ms
mobilenet_v2             17.50%               15.05%               1.06%                           69.68%                      103.29%             20.715ms
resnext50_32x4d          18.96%               9.28%                2.31%                           28.79%                      59.33%              5.899ms
resnext50_32x4d          18.48%               11.01%               1.86%                           53.80%                      85.14%              7.167ms
mnasnet1_0               19.07%               14.52%               3.01%                           35.43%                      72.03%              6.028ms
mnasnet1_0               14.17%               12.00%               1.87%                           67.56%                      95.60%              9.225ms
squeezenet1_1            38.56%               0.00%                1.77%                           56.21%                      96.53%              2.221ms
squeezenet1_1            21.26%               7.57%                1.05%                           67.30%                      97.18%              4.942ms
timm_vision_transformer  17.05%               0.00%                18.80%                          65.79%                      101.64%             9.608ms
timm_vision_transformer  9.31%                9.07%                10.32%                          73.25%                      101.96%             16.814ms
```

## how to use
`python {compiled_module_wrapper.py} -p`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97723
Approved by: https://github.com/jansel
2023-04-01 08:04:14 +00:00
Jerry Zhang
553bb01df9 [quant][pt2e][refactor] Remove extra arguments of _maybe_insert_observers_before_graph_output (#98029)
Summary:
This PR allows _maybe_insert_observers_before_graph_output to be reused by pt2e flow

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps
python test/test_quantization.py TestQuantizeFxModels

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98029
Approved by: https://github.com/vkuzo
2023-04-01 05:38:36 +00:00
Milos Puzovic
2630144786 Call to mkldnn_matmul from aten::addmm on AArch64 (#91763)
We have noticed that on BERT_pytorch in torchbenchmark majority of time is spent in running GEMM in aten:addmm. At the moment this calls into BLAS routine, but on AArch64 it will be faster if it calls into mkldnn_matmul. Performance wise compared to build with OpenBLAS it runs faster 1.2x faster on 16 cores with batch size of 8 on Graviton3, while if fast math mode (mkldnn_matmul exposes through oneDNN and Arm Compute Library option to run GEMM with FP32 inputs using BBF16 operations) is enabled then it is 2.3x

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91763
Approved by: https://github.com/jgong5, https://github.com/ngimel, https://github.com/malfet
2023-04-01 04:25:57 +00:00