pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 00:20:18 +01:00

Author	SHA1	Message	Date
mikey dagitses	2ac9086987	run buildifier on unified build files (#98141 ) This is pretty tricky. buildifier by default doesn't do much to these files. It does a little more if you tell it that they are `BUILD.bazel` files with -type=build. But it can do even more if you remove the target definitions from the `def define_rules()` wrapper and dedent them. I wrote a little wrapper that does that. I'll submit it at a later date. Differential Revision: [D44606558](https://our.internmc.facebook.com/intern/diff/D44606558/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D44606558/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/98141 Approved by: https://github.com/ezyang, https://github.com/PaliC	2023-04-04 00:37:19 +00:00
Michael Voznesensky	b1e60bfb6a	Pass f_locals as a dict rather than kwargs (#98107 ) Fixes https://github.com/pytorch/pytorch/issues/97688 One big problem is that instead of printing x < y we now print `E["x"] < E["y"]` and now all of the tests wobbled and I'm mad. Signed-off-by: Edward Z. Yang <ezyangmeta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98107 Approved by: https://github.com/ezyang	2023-04-04 00:30:08 +00:00
Jason Ansel	b96fe9b61c	Fix issues related to ClassInstantier in HF models (#97997 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97997 Approved by: https://github.com/anijain2305	2023-04-04 00:01:08 +00:00
Yifu Wang	4d13fcddef	[spmd expansion] support torch.ops.aten.sym_numel (#98229 ) The current logic assumes non-overload ops takes two arguments however torch.ops.aten.sym_numel takes one. Differential Revision: [D44615037](https://our.internmc.facebook.com/intern/diff/D44615037/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98229 Approved by: https://github.com/mrshenli	2023-04-03 23:57:10 +00:00
Yanbo Liang	a6bd21d935	[Dynamo] Eagerly initializing Lazy Module to reduce graph breaks (#97946 ) Fixes Meta internal user case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97946 Approved by: https://github.com/wconstab	2023-04-03 22:24:43 +00:00
Bin Bao	96f548a1ac	[inductor] Add an AOT mode for the Triton backend (#98214 ) Summary: This is a copy of https://github.com/pytorch/pytorch/pull/97152 to make the landing easier. This PR implements a two-pass wrapper codegen for the Triton backend to achieve ahead-of-time compilation. In the first pass, the regular python wrapper code will be generated, and then the generated code will be executed to perform Triton compilation and autotuning. After that, the second pass wrapper codegen will generate C++ wrapper with proper CUDA API to load and launch Triton-generated CUDA kernels. Like the AOT mode for the cpp backend, the next step would be to provide a more complete API for AOT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98214 Approved by: https://github.com/eellison	2023-04-03 22:19:18 +00:00
Mikayla Gawarecki	73b06a0268	Fix rendering of arguments for nn.functional ops that use boolean_dispatch (#98092 ) Fix #97982 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98092 Approved by: https://github.com/albanD	2023-04-03 21:17:43 +00:00
Guang Yang	eeb18d1e54	Fix dynamo tests and re-enable internally (#97937 ) Summary: `:test_dynamo` has been broken for long time internally in Meta. This PR is to fix the broken test and re-enable it internally. - Using the root `pytest.ini` for pytest - Decouple tests so that one can be disabled with affecting others - Temporarily disable the test cases that require additional efforts to fix OSS CI doesn't provide test code coverage info. Meta internal test infra does. The value of re-enabling these tests internally is not only to collect test coverage info but help fbcode developers to build/test from fbcode. Test Plan: `buck test mode/dev-nosan //caffe2/test:test_dynamo` https://www.internalfb.com/intern/testinfra/testrun/7318349540623516 Differential Revision: D44325238 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97937 Approved by: https://github.com/ezyang	2023-04-03 20:47:13 +00:00
Yu Guo	3654552b8c	add deterministic impl for scatter and scatter_reduction sum/mean mode (#98060 ) using the existing deterministic implementation via `index_put` which has a deterministic implementation based on sorting indices. With the `accumulate` arg in `index_put`, this can work for both scatter and scatter_reduce with sum/mean reduction mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98060 Approved by: https://github.com/mikaylagawarecki	2023-04-03 20:38:29 +00:00
Kwanghoon An	13f169c9da	Per Channel in brack-propagation function (#97475 ) Summary: Supporting Per Channel quantization in the gradient computation function. One workaround that I have added here is Current QNNPACK is not designed to process [transposed weight](https://fb.workplace.com/groups/pytorch.edge.users/permalink/1283737025829921/) Here we are simply replacing Per Channel to Per Tensor to compute a gradient (Some slow learning curve or WER degradation might be expected - We don't know, nothing is guaranteed) Test Plan: You can create your own synthetic model, FP32 layer -> INT8 layer with Per Channel and see if loss is decreasing Differential Revision: D43898794 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97475 Approved by: https://github.com/weiwangmeta	2023-04-03 20:34:44 +00:00
PaliC	8e5f57a2b1	add users to external contribution metrics (#97928 ) :copilot summary Pull Request resolved: https://github.com/pytorch/pytorch/pull/97928 Approved by: https://github.com/kit1980	2023-04-03 19:52:31 +00:00
Rohan Varma	1ea528ef24	[bf16] bf16 support for conv_depthwise3d (#97819 ) Add bf16 for this op Differential Revision: [D44473429](https://our.internmc.facebook.com/intern/diff/D44473429/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97819 Approved by: https://github.com/fegin	2023-04-03 19:31:27 +00:00
Jason Ansel	55afaa46a4	Support functools.partial and itertools.product (#98120 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98120 Approved by: https://github.com/anijain2305	2023-04-03 18:23:25 +00:00
Devashish Shankar	2c905f2152	Extend Pattern Matcher to allow handling split-cat style patterns (#97726 ) Summary: This diff extends pattern matcher, by adding a few features which allows it to handle split-getitem-cat style patterns. 3 problems I encountered were: 1. In the handler, I only need one Arg() (the one which is the first input to split). None of the other args are relevant to replacement graph. So, we add a new Ignored() pattern to have ignored args 2. The pattern matching was visiting the split node again and again during the DFS. By propogating the patterns with _users>1 or Any into the child MatchContext, we avoid this problem. 3. To avoid the unbundling issue, I switched to using KeywordArg() instead of Arg() - as for this pattern, we need a flat list of Arg() in the end Example pattern: https://www.internalfb.com/intern/anp/view/?id=3325856 ``` pass_patterns.append(defaultdict(list)) register_replacement_pattern( CallFunction( aten.cat, ListOf( CallFunction(operator.getitem, CallFunction(aten.split_with_sizes, KeywordArg("input_"), Ignored(), Ignored(), _users=Any), Ignored() ),), Ignored() ), pass_number=3 ) def split_cat_replace(input_): return input_ ``` Test Plan: https://www.internalfb.com/intern/anp/view/?kernel=default&id=3317105 Reviewed By: jansel Differential Revision: D44282499 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97726 Approved by: https://github.com/jansel	2023-04-03 17:30:56 +00:00
Bin Bao	095c129bd3	[CI] Add inference run for the performance dashboard (#98174 ) Summary: Remove fp32 training performance run and trade for amp inference performance run. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98174 Approved by: https://github.com/huydhn	2023-04-03 17:29:55 +00:00
Bin Bao	ba7ee00f00	Add a --inference flag to dynamo benchmark script (#98173 ) Summary: When calling benchmark scripts, make it a requirement to pass --inference or --training Pull Request resolved: https://github.com/pytorch/pytorch/pull/98173 Approved by: https://github.com/huydhn	2023-04-03 17:12:28 +00:00
John Haitas	5a54eb0b15	[caffe2] miniz fix -Wstrict-prototypes (#98027 ) Summary: this fixes -Wstrict-prototypes Test Plan: eyes Reviewed By: rmaz Differential Revision: D44556017 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98027 Approved by: https://github.com/albanD	2023-04-03 16:56:47 +00:00
Elias Ellison	0f0c1b6516	Flip back swtich (#98099 ) There are some errors occurring on the benchmark - switch back to old cudagraph impl until they are figured out https://torchci-git-fork-huydhn-add-compilers-bench-74abf8-fbopensource.vercel.app/benchmark/compilers Pull Request resolved: https://github.com/pytorch/pytorch/pull/98099 Approved by: https://github.com/desertfire	2023-04-03 14:46:33 +00:00
DanilBaibak	55daa835e9	Added allowed_workflows to pytorch probot (#98082 ) Added allowed_workflows to pytorch probot. This is a follow up PR [regarding the retry bot](https://github.com/pytorch/test-infra/pull/3942/files#diff-ee5e4f1e1fa962c6f62e5dcebde6e0bab573e74474601bf5749ccb668fd9c900R14-R16). Pull Request resolved: https://github.com/pytorch/pytorch/pull/98082 Approved by: https://github.com/huydhn	2023-04-03 12:30:43 +00:00
mingfeima	ced5c89b6f	add explicit vectorization for Half dtype on CPU (#96076 ) This patch is part of half float performance optimization on CPU: * add specification for dtype `Half` in `Vectorized<>` under both avx256 and avx512. * add specification for dtype `Half` in functional utils, e.g. `vec::map_reduce<>()`, which uses float32 as accumulate type. Also add a helper struct `vec_hold_type<scalar_t>`, since Vectorized<Half>::value_type is pointing to its underlying storage type which is `uint16_t`, leading to error if the kernel uses `Vec::value_type`. Half uses the same logic as BFloat16 in the Vectorized<>, each half vector is mapped to 2x float vectors for computation. Notice that this patch modified the cmake files by adding -mf16c on AVX2 build, from https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html, we can see that all the hardware platforms that support avx2 already have f16c Pull Request resolved: https://github.com/pytorch/pytorch/pull/96076 Approved by: https://github.com/malfet	2023-04-03 10:58:37 +00:00
Huy Do	c99895ca6f	Move pull and trunk slow tests to periodic (#98040 ) I notice that we are running some slow tests for CPU and `sm86` on pull and trunk. They take much longer to run than other shards (1.5x to 2x longer). I propose that we move them to periodic instead. Thoughts? The correlation between them are: * `linux-bionic-cuda11.7-py3.10-gcc7-sm86 / test (slow)` and `linux-bionic-cuda11.7-py3.10-gcc7-sm86 / test (default)` is 0.93 * `linux-bionic-py3.8-clang9-slow / test (slow)` and `linux-bionic-py3.8-clang9 / test (default)` is 0.98 <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at db56750</samp> This pull request updates the `.github/workflows` files to optimize the testing workflows for PyTorch. It adds new periodic workflows for more platforms and configurations, and removes some redundant or slow workflows from the pull and trunk workflows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98040 Approved by: https://github.com/malfet	2023-04-03 08:13:12 +00:00
PyTorch MergeBot	c597d9c1f2	Revert "Inductor cpp wrapper: support LinearUnary (#97655 )" This reverts commit `d03003ab8e`. Reverted https://github.com/pytorch/pytorch/pull/97655 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but it looks like the change causes a regression on CPU test time `d03003ab8e` (inductor/test_cpp_wrapper.py)	2023-04-03 08:09:58 +00:00
chunyuan	d03003ab8e	Inductor cpp wrapper: support LinearUnary (#97655 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97655 Approved by: https://github.com/jansel	2023-04-03 04:26:10 +00:00
chunyuan	0c1f524b92	Inductor cpp wrapper: support MKLPackedLinear (#90755 ) Invoke `torch.ops.mkl._mkl_linear` from c++. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90755 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/jansel	2023-04-03 04:07:38 +00:00
Jiong Gong	5d62d12557	[Inductor] support transpose vertical reduction in cpp (#97781 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97781 Approved by: https://github.com/jansel	2023-04-03 02:02:15 +00:00
Jason Ansel	76074dc0a3	Improve support for dict subclasses (#98154 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98154 Approved by: https://github.com/anijain2305	2023-04-03 01:42:08 +00:00
Jiong Gong	bf22ecba2a	[Inductor] support vertical reduction in cpp (#97644 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97644 Approved by: https://github.com/jansel	2023-04-03 01:29:12 +00:00
Jason Ansel	35b3309539	Fix graph break from inline patched init (#98150 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98150 Approved by: https://github.com/anijain2305, https://github.com/yanboliang	2023-04-03 01:11:30 +00:00
Jiong Gong	8e5f491623	[Inductor] simplify CPP backend Tile2D code and support non-contiguous load/store (#97626 ) Remove `CppTile2DTailKernel` and `CppTile2DKernelChecker` and reuse `CppVecKernel` and `CppVecKernelChecker` for them. Add vectorization with fallback for load/store in CppVecKernel for the non-contiguous load/store needed by `CppTile2DTailKernel`. This PR also adds a functional support for transposed copy of bfloat16 data types. Better performance requires vectorized intrinsics implemented for at::vec::transpose_mxn. cc @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @desertfire Pull Request resolved: https://github.com/pytorch/pytorch/pull/97626 Approved by: https://github.com/jansel	2023-04-03 01:11:20 +00:00
Jason Ansel	71d850a100	[inductor] Fallback on complex64 kernels (#98155 ) Later PRs in this stack fixe graph breaks in GoogleFnet which triggers errors from inductor trying to compile torch.complex64, this fixes that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98155 Approved by: https://github.com/anijain2305, https://github.com/ngimel	2023-04-03 01:06:43 +00:00
Jason Ansel	bc9dd969e1	Support inlining no_grad() decorator (#98121 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98121 Approved by: https://github.com/anijain2305, https://github.com/voznesenskym	2023-04-03 00:24:56 +00:00
Shen Li	96403cfcec	[Easy] Fix lint error on DTensor math_ops.py (#98170 ) This lint error is caused by conflicts betwee #97996 and #98148 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98170 Approved by: https://github.com/yifuwang	2023-04-02 19:11:05 +00:00
Shen Li	02179827cb	[Easy] Include SPMD and DTensor files in UFMT checks (#98148 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98148 Approved by: https://github.com/fegin	2023-04-02 15:34:49 +00:00
Aleksei Nikiforov	38609cc47d	TensorExpr eval: fix copying variables from pointers on big endian systems (#96951 ) When copying data from pointers, only lowest bytes are copied. On little endian systems they are located at the beginning of pointer. On big endian systems they are located at the end of pointer. This change fixes TestTensorExprPyBind::test_dynamic_shape and TestTensorExprPyBind::test_dynamic_shape_2d tests from test/test_tensorexpr_pybind.py on big endian systems. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96951 Approved by: https://github.com/ezyang, https://github.com/EikanWang	2023-04-02 12:49:14 +00:00
yanbing-j	2ab18a23e1	Update ideep submodule (#97430 ) ### Description This PR is to update ideep submodule for the following two aspects: 1. At inductor side, we are supporting dynamic shape path for packed linear, which we hopes the packed weight of linear doesn't depend on the input shapes and still can get a better a performance using a packed weight got from a dummy input shapes. However the current ideep has a accuracy issue for this case. This updating will fix the issue. 2. Add an extra arg is_channels_last for deconv to notify ideep whether to go channels last or not because the memory format checks of ideep (e.g. is_nhwc(), is_ndhwc()) is not 100% identical to suggest_memory_format() from pytorch. ### Performance Benchmark Use TorchBench test in ICX with 40 cores Intel OpenMP & tcmalloc were preloaded ![image](https://user-images.githubusercontent.com/61222868/229072474-193513ba-6727-4451-91ff-0d57e016736f.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97430 Approved by: https://github.com/jgong5	2023-04-02 06:42:09 +00:00
Shen Li	347c67d4a2	[Easy] Consolidate string startswith checks (#98147 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98147 Approved by: https://github.com/fegin	2023-04-02 04:02:37 +00:00
Wanchao Liang	7fcff01b50	[reland] switch mean to use reduction linear (#97996 ) mean is actually a reduction linear formula if the final reduction is partial sum (which currently is), so switching to use that instead Pull Request resolved: https://github.com/pytorch/pytorch/pull/97996 Approved by: https://github.com/XilunWu, https://github.com/yifuwang	2023-04-02 03:19:56 +00:00
Jason Ansel	d9e5ab4606	Fix graph break from 'hasattr: HFPretrainedConfigVariable()' (#98119 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98119 Approved by: https://github.com/anijain2305	2023-04-02 02:56:45 +00:00
Jason Ansel	b9d3b3f595	Improve support for contextlib.nullcontext (#98111 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98111 Approved by: https://github.com/anijain2305	2023-04-02 02:33:14 +00:00
Jason Ansel	92b46202ef	Add --stats option to benchmark scripts (#98109 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98109 Approved by: https://github.com/anijain2305	2023-04-02 02:23:13 +00:00
mikey dagitses	e402259b8a	avoid warning in irange for unsigned types (#97973 ) Unsigned types should not be compared to be less than zero. Differential Revision: [D44538384](https://our.internmc.facebook.com/intern/diff/D44538384/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97973 Approved by: https://github.com/Skylion007	2023-04-01 23:52:37 +00:00
Nikita Shulga	2af09393f9	`masked_scatter` should accept only bool masks (#97999 ) Modify test_torch to check that assert is raised in this case torch.uint8 usage has been deprecated for a few releases, and errors has been raised for other dtypes on CUDA device, but not on CPU. This PR finally restricts mask to just `torch.bool` See https://github.com/pytorch/pytorch/pull/96594 as an example doing it for `torch.masked_fill` Fixes https://github.com/pytorch/pytorch/issues/94634 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97999 Approved by: https://github.com/ngimel	2023-04-01 23:25:25 +00:00
Jason Ansel	bbc4e911c8	Move CPUReproTests to its own file (#97943 ) test_torchinductor has gotten too big (almost 10k lines), this stack is trying to split it into smaller pieces. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97943 Approved by: https://github.com/ngimel	2023-04-01 22:39:49 +00:00
Li-Huai (Allan) Lin	db8abde9b6	[MPS] Enable conditional indexing tests (#97871 ) The tests seem to be working now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97871 Approved by: https://github.com/kulinseth	2023-04-01 16:15:08 +00:00
Shen Li	e8d39606eb	[SPMD] Enable fused Adam in full train step tracing (#98113 ) Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98113 Approved by: https://github.com/yifuwang, https://github.com/fegin	2023-04-01 15:54:13 +00:00
Shen Li	bccf2ef0ce	Format DTensor dispatch.py and _meta_registrations.py (#98114 ) Format-only changes with black and lintrunner to prepare for the commit on top. Differential Revision: [D44603809](https://our.internmc.facebook.com/intern/diff/D44603809) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98114 Approved by: https://github.com/yifuwang, https://github.com/fegin	2023-04-01 15:54:13 +00:00
mikey dagitses	64077ce511	remove redundant typed StorageImpl::data() member (#97650 ) This has the same implementation as the unsafe variants and the unsafe variants match the original semantics of the code, given that they don't check that the type matches. Given that we're updating callsites anyways to address the mutability aspect, we might as well just drop this method now. Differential Revision: [D44410210](https://our.internmc.facebook.com/intern/diff/D44410210/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97650 Approved by: https://github.com/ezyang	2023-04-01 08:16:54 +00:00
Shunting Zhang	13461e9767	[inductor] more cuda metrics in wrapper (#97723 ) Following metrics should be helpful: - percent of time GPU is busy - percent of time various category of kernels (e.g. pointwise/reduction triton kernel) takes - percent of time each individual kernel takes compared to total wall time of the benchmark This PR add those. Example result from hf_Bert infernece graph: ``` == triton_pointwise category kernels == Kernel Self CUDA TIME (ms) Count Percent ------------------------------ --------------------- ------- --------- triton_poi_fused_gelu_6_0d1d 0.48154 12.0 5.52% triton_poi_fused_clone_1_0d1d2 0.29011 24.0 3.33% triton_poi_fused_clone_2_0d1d2 0.17417 12.0 2.00% triton_poi_fused_clone_4_0d1d2 0.10797 12.0 1.24% Total 1.05379 12.08% == triton_persistent_reduction category kernels == Kernel Self CUDA TIME (ms) Count Percent ------------------------------ --------------------- ------- --------- triton_per_fused__softmax__to_ 0.97188 12.0 11.14% triton_per_fused_add_native_la 0.37401 24.0 4.29% triton_per_fused_gelu_native_l 0.02 1.0 0.23% triton_per_fused_add_embedding 0.01718 1.0 0.20% Total 1.38307 15.86% == unknown category kernels == Kernel Self CUDA TIME (ms) Count Percent ------------------------------ --------------------- ------- --------- ampere_fp16_s16816gemm_fp16_12 2.24514 24.0 25.74% ampere_fp16_s16816gemm_fp16_25 1.39796 49.0 16.03% void cutlass::Kernel<cutlass_8 1.36093 1.0 15.61% ampere_fp16_s16816gemm_fp16_64 0.74591 12.0 8.55% ampere_fp16_s16816gemm_fp16_12 0.61989 12.0 7.11% Memset (Device) 0.024 12.0 0.28% void at::native::(anonymous na 0.01543 2.03 0.18% void at::native::vectorized_el 0.00011 0.03 0.00% Total 6.40937 73.49% Percent of time when GPU is busy: 101.44% ``` Note: the output shows total time GPU is busy is larger than total wall time. We measure total wall time disabling profiling while measure GPU time enabling profiling, that may distort the measurement a bit? But I assume the effect is not too large assuming the profiler mostly increase CPU time (rather than GPU). ## interesting usages 1. I pick a model that cudagraphs improve perf significantly like densenet121 and run the tool on it's forward graph. It's no surprise that quite a lot of time GPU is idle: ``` (Forward graph) Percent of time when GPU is busy: 32.69% Total wall time 17.307 ms ``` Its backward graph has less percent of GPU idle time, but it's still high: ``` (Backward graph) Percent of time when GPU is busy: 46.70% Total wall time 17.422 ms ``` 2. I profile a subset of torchbench models and plot a table to show the percent of execution time for pointwise/reduction/persistent_reduction/unknown_category . Since I plan to explore using coordinate descent tuner to improve reduction, those models with high percent of time spending on reduction should be good caididates (e.g. resnet50, mobilenet_v2 ). NOTE: a same model appears twice. The first rows is for the fwd graph and the second for the bwd graph. We profile different graphs for a model separately. ``` benchmark_name pointwise_percent reduction_percent persistent_reduction_percent unknown_category_percent GPU_busy_percent wall_time_ms ----------------------- ------------------- ------------------- ------------------------------ -------------------------- ------------------ -------------- resnet18 19.73% 7.86% 4.81% 41.25% 73.65% 2.549ms resnet18 18.59% 7.13% 3.35% 67.35% 96.41% 3.467ms resnet50 29.57% 22.13% 2.07% 51.68% 105.46% 6.834ms resnet50 26.42% 15.27% 0.94% 59.68% 102.31% 13.346ms vgg16 26.23% 0.00% 0.00% 74.20% 100.43% 18.212ms vgg16 15.63% 5.61% 0.10% 79.42% 100.75% 33.485ms BERT_pytorch 28.62% 4.82% 14.88% 33.32% 81.64% 7.162ms BERT_pytorch 14.43% 13.41% 18.19% 49.24% 95.27% 10.395ms densenet121 11.89% 2.14% 3.86% 16.36% 34.25% 16.531ms densenet121 10.37% 2.06% 4.09% 31.46% 47.98% 16.934ms hf_Bert 23.94% 0.00% 29.88% 46.09% 99.90% 7.766ms hf_Bert 11.65% 10.54% 20.26% 61.66% 104.11% 11.892ms nvidia_deeprecommender 42.92% 0.00% 0.00% 56.75% 99.67% 3.476ms nvidia_deeprecommender 31.36% 3.44% 0.46% 65.20% 100.45% 3.872ms alexnet 30.99% 0.00% 0.00% 69.16% 100.14% 3.169ms alexnet 24.41% 4.83% 0.17% 71.09% 100.50% 4.709ms mobilenet_v2 29.21% 27.79% 2.49% 44.00% 103.49% 10.160ms mobilenet_v2 17.50% 15.05% 1.06% 69.68% 103.29% 20.715ms resnext50_32x4d 18.96% 9.28% 2.31% 28.79% 59.33% 5.899ms resnext50_32x4d 18.48% 11.01% 1.86% 53.80% 85.14% 7.167ms mnasnet1_0 19.07% 14.52% 3.01% 35.43% 72.03% 6.028ms mnasnet1_0 14.17% 12.00% 1.87% 67.56% 95.60% 9.225ms squeezenet1_1 38.56% 0.00% 1.77% 56.21% 96.53% 2.221ms squeezenet1_1 21.26% 7.57% 1.05% 67.30% 97.18% 4.942ms timm_vision_transformer 17.05% 0.00% 18.80% 65.79% 101.64% 9.608ms timm_vision_transformer 9.31% 9.07% 10.32% 73.25% 101.96% 16.814ms ``` ## how to use `python {compiled_module_wrapper.py} -p` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97723 Approved by: https://github.com/jansel	2023-04-01 08:04:14 +00:00
Jerry Zhang	553bb01df9	[quant][pt2e][refactor] Remove extra arguments of _maybe_insert_observers_before_graph_output (#98029 ) Summary: This PR allows _maybe_insert_observers_before_graph_output to be reused by pt2e flow Test Plan: python test/test_quantization.py TestQuantizeFx python test/test_quantization.py TestQuantizeFxOps python test/test_quantization.py TestQuantizeFxModels Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/98029 Approved by: https://github.com/vkuzo	2023-04-01 05:38:36 +00:00
Milos Puzovic	2630144786	Call to mkldnn_matmul from aten::addmm on AArch64 (#91763 ) We have noticed that on BERT_pytorch in torchbenchmark majority of time is spent in running GEMM in aten:addmm. At the moment this calls into BLAS routine, but on AArch64 it will be faster if it calls into mkldnn_matmul. Performance wise compared to build with OpenBLAS it runs faster 1.2x faster on 16 cores with batch size of 8 on Graviton3, while if fast math mode (mkldnn_matmul exposes through oneDNN and Arm Compute Library option to run GEMM with FP32 inputs using BBF16 operations) is enabled then it is 2.3x Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/91763 Approved by: https://github.com/jgong5, https://github.com/ngimel, https://github.com/malfet	2023-04-01 04:25:57 +00:00

1 2 3 4 5 ...

58390 Commits