pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Edward Z. Yang	1fd7ea1ba8	Update skips for RecursionError (#96109 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96109 Approved by: https://github.com/huydhn	2023-03-06 17:55:38 +00:00
Bin Bao	60cf95610d	[CI] Skip xcit_large_24_p8_224 in TIMM (#96048 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96048 Approved by: https://github.com/jansel	2023-03-05 14:54:46 +00:00
Bin Bao	1359d16fe8	[CI] Further tighten the checking of two eager runs (#95902 ) Summary: To catch nondeterminism in eager if there is any. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95902 Approved by: https://github.com/jansel	2023-03-05 14:53:02 +00:00
Edward Z. Yang	c7c4a20321	Update dynamic skips (#95966 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95966 Approved by: https://github.com/janeyx99, https://github.com/voznesenskym	2023-03-04 23:01:58 +00:00
Jason Ansel	43dd043ea7	Revert "[inductor] Improve error messages (#95567 )" (#96014 ) This reverts commit `62b775583f`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96014 Approved by: https://github.com/Chillee	2023-03-04 04:03:31 +00:00
Edward Z. Yang	d303665d33	Make int unspecialization actually work (#95621 ) OK, so this PR used to be about reducing the number of constants we specialize on, but it turns out that unspecialization was ~essentially never used (because we still constant specialized way too aggressively) and I ended up having to fix a bunch of issues to actually get tests to pass. So this PR is now "make int unspecialization actually work". As part of this, I have to turn off unspecialization by default, as there are still latent bugs in inductor. The general strategy is that an unspecialized int is represented as a SymInt. Representing it as a 0d tensor (which is what the code used to do) is untenable: (1) we often need unspecialized ints to participate in size computations, but we have no way of propagating sympy expressions through tensor compute, and (2) a lot of APIs work when passed SymInt, but not when passed a Tensor. However, I continue to represent Numpy scalars as Tensors, as they are rarely used for size computation and they have an explicit dtype, so they are more accurately modeled as 0d tensors. * I folded in the changes from https://github.com/pytorch/pytorch/pull/95099 as I cannot represent unspecialized ints as SymInts without also turning on dynamic shapes. This also eliminates the necessity for test_unspec.py, as toggling specialization without dynamic shapes doesn't do anything. As dynamic shapes defaults to unspecializing, I just deleted this entirely; for the specialization case, I rely on regular static shape tests to catch it. (Hypothetically, we could also rerun all the tests with dynamic shapes, but WITH int/float specialization, but this seems... not that useful? I mean, I guess export wants it, but I'd kind of like our Source heuristic to improve enough that export doesn't have to toggle this either.) * Only 0/1 integers get specialized by default now * A hodgepodge of fixes. I'll comment on the PR about them. Fixes https://github.com/pytorch/pytorch/issues/95469 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95621 Approved by: https://github.com/jansel, https://github.com/Chillee	2023-03-04 01:22:08 +00:00
Jason Ansel	62b775583f	[inductor] Improve error messages (#95567 ) Example error message before/after (710 to 131 lines): https://gist.github.com/jansel/6fecad057738089fa95bf08c3de9fc8a Pull Request resolved: https://github.com/pytorch/pytorch/pull/95567 Approved by: https://github.com/mlazos	2023-03-02 02:20:55 +00:00
Bin Bao	879f0c3fee	[CI] Increate the timeout limit for benchmark test (#95787 ) Summary: xcit_large_24_p8_224 occasionally hits TIMEOUT on CI. Bump up the limit to reduce flakiness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95787 Approved by: https://github.com/ezyang, https://github.com/ZainRizvi	2023-03-01 19:54:25 +00:00
Bin Bao	e79b2b7792	[CI] Force clear triton cache between running each test (#95729 ) Summary: The idea is to see if this reduces some of the flakiness we have seen on CI. If it does help, then we have a problem in our caching implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95729 Approved by: https://github.com/ngimel	2023-03-01 04:10:03 +00:00
Will Constable	1a72712645	Add dynamo graph break stats to CI (#95635 ) Adds columns to csv produced by accuracy job including dynamo graph break stats. Example output from torchbench CI job: <img width="771" alt="image" src="https://user-images.githubusercontent.com/4984825/221716236-9276684e-1be8-43e1-837e-f41671d4e0e3.png"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95635 Approved by: https://github.com/ezyang	2023-02-28 16:17:46 +00:00
Edward Z. Yang	3762e801ba	Update dynamic skips (#95587 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95587 Approved by: https://github.com/voznesenskym	2023-02-28 03:26:55 +00:00
Bin Bao	fa5a4b0dfc	[CI] Do not compare two eager run results against fp64 result (#95616 ) Summary: When running the benchmark test with --accuracy, two eager runs should return the same result. If not, we want to detect it early, but comparing against fp64_output may hide the non-deterministism in eager. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95616 Approved by: https://github.com/ZainRizvi	2023-02-27 20:11:21 +00:00
Bin Bao	ab1ab3ab19	[CI] Specify more torch.backends.cudnn options to reduce non-determinism (#95478 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95478 Approved by: https://github.com/ezyang	2023-02-25 18:54:12 +00:00
Bin Bao	4c8ad93a7c	[Inductor][CI] Remove hf_GPT2_large from CPU inference test (#95473 ) Summary: hf_GPT2_large shows random failure on CI for the CPU inference. Created https://github.com/pytorch/pytorch/issues/95474 for the Intel team to investigate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95473 Approved by: https://github.com/anijain2305	2023-02-24 18:21:36 +00:00
Will Constable	8de4238a31	Add dynamo bench arg --per_process_memory_fraction (#95260 ) Simply pipes the arg to the existing torch.cuda API by the same name. Useful for locally debugging OOMs that happened on a smaller GPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95260 Approved by: https://github.com/davidberard98	2023-02-22 05:11:18 +00:00
Edward Z. Yang	08370ddad8	Update model skips (#95089 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95089 Approved by: https://github.com/albanD	2023-02-20 13:24:49 +00:00
Wang, Eikan	954c767bc6	[Inductor] Enable accuracy test for CPPBackend (#94898 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94898 Approved by: https://github.com/jgong5, https://github.com/desertfire	2023-02-20 05:02:15 +00:00
Edward Z. Yang	a2f44d82f8	Flag guard unbacked SymInt/SymFloat support (#94987 ) I believe this fixes the AllenaiLongformerBase problem in periodic. The longer version of the problem is here is we are currently optimistically converting all item() calls into unbacked SymInt/SymFloat, but sometimes this results in a downstream error due to a data-dependent guard. Fallbacks for this case are non-existent; this will just crash the model. This is bad. So we flag guard until we get working fallbacks. What could these fallbacks look like? One idea I have is to optimistically make data-dependent calls unbacked, but then if it results in a crash, restart Dynamo analysis with the plan of graph breaking when the item() call immediately happened. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94987 Approved by: https://github.com/Skylion007, https://github.com/malfet	2023-02-17 00:25:05 +00:00
Edward Z. Yang	7aaebe00ee	Fail dynamic_aot_eager AllenaiLongformerBase model (#94986 ) ``` GuardOnDataDependentSymNode: It appears that you're trying to get a value out of symbolic int/float whose value is data-dependent (and thus we do not know the true value.) The expression we were trying to evaluate is Eq(i3, -1). Scroll up to see where each of these data-dependent accesses originally occurred. While executing %as_strided : [#users=1] = call_method[target=as_strided](args = (%pad,), kwargs = {size: (12, %add, 768, 64), stride: (%getitem, %mul, %getitem_1, %getitem_2)}) Original traceback: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/transformers/models/longformer/modeling_longformer.py", line 928, in <graph break in _sliding_chunks_matmul_attn_probs_value> chunked_value = padded_value.as_strided(size=chunked_value_size, stride=chunked_value_stride) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94986 Approved by: https://github.com/albanD	2023-02-16 20:02:46 +00:00
Aaron Gokaslan	0444a6c90a	[BE] Remove deprecated logging warn method (#94708 ) Swaps all logging.warn calls to logging.warning since the former is deprecated and even raises a deprecation warning now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94708 Approved by: https://github.com/ezyang	2023-02-13 18:24:52 +00:00
Edward Z. Yang	ae7a628b03	Dynamic shapes CI updates (#94690 ) Data from https://github.com/pytorch/pytorch/pull/94683 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94690 Approved by: https://github.com/cpuhrsch	2023-02-13 18:20:12 +00:00
PyTorch MergeBot	10c430ba0a	Revert "Set torch.backends.cudnn.enabled to false when testing accuracy (#94363 )" This reverts commit `2a5851735a`. Reverted https://github.com/pytorch/pytorch/pull/94363 on behalf of https://github.com/desertfire due to TIMM models start to show flaky failures after this PR, need more investigation	2023-02-10 04:40:32 +00:00
Bin Bao	2a5851735a	Set torch.backends.cudnn.enabled to false when testing accuracy (#94363 ) Summary: It looks like setting torch.backends.cudnn.deterministic to True is not enough for eliminating non-determinism when testing benchmarks with --accuracy, so let's turn off cudnn completely. With this change, mobilenet_v3_large does not show random failure on my local environment. Also take this chance to clean up CI skip lists. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94363 Approved by: https://github.com/ezyang	2023-02-09 23:43:13 +00:00
Xuehai Pan	a229b4526f	[BE] Prefer dash over underscore in command-line options (#94505 ) Preferring dash over underscore in command-line options. Add `--command-arg-name` to the argument parser. The old arguments with underscores `--command_arg_name` are kept for backward compatibility. Both dashes and underscores are used in the PyTorch codebase. Some argument parsers only have dashes or only have underscores in arguments. For example, the `torchrun` utility for distributed training only accepts underscore arguments (e.g., `--master_port`). The dashes are more common in other command-line tools. And it looks to be the default choice in the Python standard library: `argparse.BooleanOptionalAction`: `4a9dff0e5a/Lib/argparse.py (L893-L895)` ```python class BooleanOptionalAction(Action): def __init__(...): if option_string.startswith('--'): option_string = '--no-' + option_string[2:] _option_strings.append(option_string) ``` It adds `--no-argname`, not `--no_argname`. Also typing `_` need to press the shift or the caps-lock key than `-`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94505 Approved by: https://github.com/ezyang, https://github.com/seemethere	2023-02-09 20:16:49 +00:00
Edward Z. Yang	c028fc4e25	Decouple PT2 dynamic shapes from the functorch setting (#94469 ) The functorch setting still exists, but now it is no longer necessary: we infer use of Python dispatcher by checking if the ambient FakeTensorMode has a ShapeEnv or not. The setting still exists, but it is for controlling direct AOTAutograd use now; for PT2, it's sufficient to use torch._dynamo.config.dynamic_shapes. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94469 Approved by: https://github.com/Chillee, https://github.com/voznesenskym, https://github.com/jansel	2023-02-09 06:41:41 +00:00
PyTorch MergeBot	ca63040d2b	Revert "Set torch.backends.cudnn.enabled to false when testing accuracy (#94363 )" This reverts commit `7bfc59993d`. Reverted https://github.com/pytorch/pytorch/pull/94363 on behalf of https://github.com/huydhn due to This change fails in trunk `7bfc59993d` running out of memory. Mark this as weird because it was green in PR	2023-02-09 01:24:35 +00:00
Bin Bao	7bfc59993d	Set torch.backends.cudnn.enabled to false when testing accuracy (#94363 ) Summary: It looks like setting torch.backends.cudnn.deterministic to True is not enough for eliminating non-determinism when testing benchmarks with --accuracy, so let's turn off cudnn completely. With this change, mobilenet_v3_large does not show random failure on my local environment. Also take this chance to clean up CI skip lists. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94363 Approved by: https://github.com/ezyang	2023-02-08 23:30:10 +00:00
Jason Ansel	eb1aca162e	Re-enable cudagraphs for benchmark scripts (#94192 ) Related to https://github.com/pytorch/pytorch/pull/93253 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94192 Approved by: https://github.com/albanD, https://github.com/desertfire	2023-02-08 16:38:32 +00:00
chuanqiw	94394e568e	change the dynamo benchmark timeout as a parameter (#94284 ) Change the dynamo benchmark timeout from hard code to a parameter with default value 1200ms, cause the hard code 1200ms timeout led some single thread mode model crashed on CPU platform. With the parameter, users can specify the timeout freely. Fixes #94281 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94284 Approved by: https://github.com/malfet	2023-02-08 00:45:08 +00:00
Bin Bao	db011e11ea	Skip sebotnet33ts_256 on CI (#94067 ) Summary: Random failure on CI and it happens more frequently lately. Skip for now and filed an issue at https://github.com/pytorch/pytorch/issues/94066 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94067 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-02-06 14:58:54 +00:00
Edward Z. Yang	1d53123f44	Report graph breaks separately from graph count (#94143 ) graph break != graph count - 1. Suppose you have a nested inline function call f1 to f2 to f3. A graph break in f3 results in six graphs: f1 before, f2 before, f3 before, f3 after, f2 after, f1 after. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94143 Approved by: https://github.com/voznesenskym	2023-02-05 04:03:12 +00:00
Edward Z. Yang	c1da35af5e	Update dynamic benchmark skips (#94114 ) Data from https://github.com/pytorch/pytorch/pull/94134 Signed-off-by: Edward Z. Yang <ezyangmeta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94114 Approved by: https://github.com/SherlockNoMad	2023-02-04 20:36:51 +00:00
Jason Ansel	e071d72f3c	Tag dynamo backends as debug/experimental (#93878 ) Hides debug/experimental backends by default. Before: ``` torch._dynamo.list_backends() ['aot_eager', 'aot_eager_decomp_partition', 'aot_torchxla_trace_once', 'aot_torchxla_trivial', 'aot_ts', 'aot_ts_nvfuser', 'cudagraphs', 'dynamo_accuracy_minifier_backend', 'dynamo_minifier_backend', 'eager', 'inductor', 'ipex', 'nvprims_aten', 'nvprims_nvfuser', 'onnxrt', 'tensorrt', 'torchxla_trace_once', 'torchxla_trivial', 'ts', 'tvm'] ``` After: ``` torch._dynamo.list_backends() ['aot_ts_nvfuser', 'cudagraphs', 'inductor', 'ipex', 'nvprims_nvfuser', 'onnxrt', 'tensorrt', 'tvm'] ``` Fixes https://github.com/pytorch/pytorch/issues/93733 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93878 Approved by: https://github.com/voznesenskym	2023-02-04 00:50:51 +00:00
Jason Ansel	0a93e6db5a	Fix/refactor dynamo ipex backend (#93863 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93863 Approved by: https://github.com/desertfire	2023-02-03 21:42:27 +00:00
Jason Ansel	203b2cad3e	Remove fx2trt/torch2trt backends (#93822 ) These backends have been broken for some time. I tried to get them running again, but as far as I can tell they are not maintained. Installing torch_tensorrt downgrades PyTorch to 1.12. If I manually bypass that downgrade, I get import errors from inside fx2trt. Fixes that re-add these are welcome, but it might make sense to move these wrappers to the torch_tensorrt repo once PyTorch 2.0 support is added. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93822 Approved by: https://github.com/frank-wei	2023-02-03 21:04:21 +00:00
Jason Ansel	a5ff40032d	Fix/refactor dynamo onnxrt backend (#93818 ) Fixes https://github.com/pytorch/pytorch/issues/90352 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93818 Approved by: https://github.com/voznesenskym	2023-02-03 20:48:02 +00:00
Edward Z. Yang	2481fc0df4	Add count to FakeTensorMode.__torch_dispatch__ (#93936 ) Most calls to fake tensor never hit `FakeTensor.__torch_dispatch__` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93936 Approved by: https://github.com/bdhirsh, https://github.com/albanD	2023-02-03 14:21:11 +00:00
Fabio Rocha	63115b70f0	Fixed issue with --diff-branch arg in dynamo benchmarks (#93989 ) As @peterbell10 pointed out, it was giving incorrect results for `compression_ratio` and `compression_latency` when you used `--diff-branch`. This fixes this by running a separate subprocess for each branch to make sure you are not being affected by run for other branch. Also added a couple of more significant figures to numbers in summary table. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93989 Approved by: https://github.com/jansel	2023-02-03 08:36:57 +00:00
Jason Ansel	60e8c766b5	Refactor dynamo training backends (#93409 ) This splits training.py into many files and moves them from `dynamo.optimizations.training` to `dynamo.backends.*`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93409 Approved by: https://github.com/ezyang	2023-02-03 03:07:15 +00:00
atalman	6e285c479d	Remove cuda 11.6 from CI replace with 11.7 (#93406 ) Remove cuda 11.6 from CI replace with 11.7 Following the Release readme here: https://github.com/pytorch/pytorch/blob/master/RELEASE.md#release-compatibility-matrix Pull Request resolved: https://github.com/pytorch/pytorch/pull/93406 Approved by: https://github.com/malfet, https://github.com/desertfire	2023-02-02 19:16:05 +00:00
Jason Ansel	d7b39b17ab	Remove torch/_dynamo/optimizations/{analysis,log_args}.py (#93279 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93279 Approved by: https://github.com/voznesenskym	2023-02-02 02:34:36 +00:00
Edward Z. Yang	03b465a6d0	Add --iterations to benchmark script (#93858 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93858 Approved by: https://github.com/williamwen42	2023-02-01 21:56:49 +00:00
Edward Z. Yang	08041c5264	Configurable repro_tolerance for same_two_models (#93398 ) Fixes https://github.com/pytorch/pytorch/issues/93293 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93398 Approved by: https://github.com/SherlockNoMad	2023-02-01 01:41:48 +00:00
Edward Z. Yang	811e95a15e	--dynamic-ci-skips now works for all backends (#93369 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93369 Approved by: https://github.com/albanD	2023-01-31 20:07:58 +00:00
Edward Z. Yang	efee879695	Don't suppress warnings in CI. (#93269 ) Warnings are an important clue that something bad is going on. You want to see them in logs. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93269 Approved by: https://github.com/voznesenskym	2023-01-30 19:21:09 +00:00
Edward Z. Yang	9eb402d18e	Update dynamic benchmark skips (#93228 ) Data from https://github.com/pytorch/pytorch/pull/93223 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93228 Approved by: https://github.com/desertfire	2023-01-30 14:22:53 +00:00
XiaobingSuper	9a2becf60a	inductor: fix inplace op's wrong lowering issue when preop is NopKernel (#92247 ) For TIMM ghostnet_100, there has such case, concat+inplace_add: ``` import torch from torch._inductor import config config.debug = True torch._dynamo.config.verbose=True class MockModule(torch.nn.Module): def __init__(self): super().__init__() def forward(self, x, y, z): out = torch.cat([x, y], dim=1) out+=z return out mod = MockModule().eval() inputs = ( torch.randn([1, 64, 16, 16]), torch.randn([1, 64, 16, 16]), torch.randn([1, 128, 16, 16]), ) ref = mod(inputs) with torch.no_grad(): opt_model = torch._dynamo.optimize('inductor')(mod) out = opt_model(inputs) out = opt_model(inputs) out = opt_model(inputs) print(torch.equal(ref, out)) ``` the inductor always get a wrong result, I find that inductor get a wrong code: ``` from ctypes import c_void_p, c_long import torch import random from torch import empty_strided, as_strided, device from torch._inductor.codecache import AsyncCompile from torch._inductor.select_algorithm import extern_kernels aten = torch.ops.aten assert_size_stride = torch._C._dynamo.guards.assert_size_stride async_compile = AsyncCompile() kernel_cpp_0 = async_compile.cpp(''' #include "/tmp/torchinductor_xiaobing/77/c7773nj5pwikpmm2pwa62rcudlf7p3if7eyqb5k4sjsvewwje4le.h" extern "C" void kernel(const float* __restrict__ in_ptr0, const float* __restrict__ in_ptr1, const float* __restrict__ in_ptr2, const float* __restrict__ in_ptr3, float* __restrict__ out_ptr0, float* __restrict__ out_ptr1, float* __restrict__ out_ptr2) { { for(long i0=0; i0<1024; i0+=1) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + 16i0); tmp0.store(out_ptr0 + 16i0); } #pragma omp simd simdlen(8) for(long i0=16384; i0<16384; i0+=1) { auto tmp0 = in_ptr0[i0]; out_ptr0[i0] = tmp0; } } { for(long i0=0; i0<1024; i0+=1) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + 16i0); tmp0.store(out_ptr1 + 16i0); } #pragma omp simd simdlen(8) for(long i0=16384; i0<16384; i0+=1) { auto tmp0 = in_ptr1[i0]; out_ptr1[i0] = tmp0; } } { for(long i0=0; i0<2048; i0+=1) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr2 + 16i0); auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr3 + 16i0); auto tmp2 = tmp0 + tmp1; tmp2.store(out_ptr2 + 16i0); } #pragma omp simd simdlen(8) for(long i0=32768; i0<32768; i0+=1) { auto tmp0 = in_ptr2[i0]; auto tmp1 = in_ptr3[i0]; auto tmp2 = tmp0 + tmp1; out_ptr2[i0] = tmp2; } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, arg1_1, arg2_1 = args args.clear() buf3 = empty_strided((1, 128, 16, 16), (32768, 256, 16, 1), device='cpu', dtype=torch.float32) buf0 = as_strided(buf3, (1, 64, 16, 16), (32768, 256, 16, 1)) # alias buf1 = as_strided(buf3, (1, 64, 16, 16), (32768, 256, 16, 1), 16384) # alias buf2 = empty_strided((1, 128, 16, 16), (32768, 256, 16, 1), device='cpu', dtype=torch.float32) kernel_cpp_0(c_void_p(arg0_1.data_ptr()), c_void_p(arg1_1.data_ptr()), c_void_p(buf2.data_ptr()), c_void_p(arg2_1.data_ptr()), c_void_p(buf0.data_ptr()), c_void_p(buf1.data_ptr()), c_void_p(buf3.data_ptr())) del arg0_1 del arg1_1 del arg2_1 return (buf3, ) if __name__ == "__main__": from torch._dynamo.testing import rand_strided from torch._inductor.utils import print_performance arg0_1 = rand_strided((1, 64, 16, 16), (16384, 256, 16, 1), device='cpu', dtype=torch.float32) arg1_1 = rand_strided((1, 64, 16, 16), (16384, 256, 16, 1), device='cpu', dtype=torch.float32) arg2_1 = rand_strided((1, 128, 16, 16), (32768, 256, 16, 1), device='cpu', dtype=torch.float32) print_performance(lambda: call([arg0_1, arg1_1, arg2_1])) ``` you can see that the add operation always adds a random value, see the ir code: 1. ir_pre_fusion.txt* ``` buf0: SchedulerNode(ComputedBuffer) buf0.writes = [MemoryDep(name='buf0', index=c0, size=(16384,))] buf0.unmet_dependencies = [] buf0.met_dependencies = [MemoryDep(name='arg0_1', index=c0, size=(16384,))] buf0.group.device = cpu buf0.group.iteration = ((16384,), ()) buf0.sizes = ([16384], []) buf0.aliases = ['buf3'] class buf0_loop_body: var_ranges = {z0: 16384} index0 = z0 def body(self, ops): get_index = self.get_index('index0') load = ops.load('arg0_1', get_index) get_index_1 = self.get_index('index0') store = ops.store('buf0', get_index_1, load, None) return store buf1: SchedulerNode(ComputedBuffer) buf1.writes = [MemoryDep(name='buf1', index=c0, size=(16384,))] buf1.unmet_dependencies = [] buf1.met_dependencies = [MemoryDep(name='arg1_1', index=c0, size=(16384,))] buf1.group.device = cpu buf1.group.iteration = ((16384,), ()) buf1.sizes = ([16384], []) buf1.aliases = ['buf3'] class buf1_loop_body: var_ranges = {z0: 16384} index0 = z0 def body(self, ops): get_index = self.get_index('index0') load = ops.load('arg1_1', get_index) get_index_1 = self.get_index('index0') store = ops.store('buf1', get_index_1, load, None) return store buf2: NopKernelSchedulerNode(ConcatKernel) buf2.writes = [StarDep(name='buf2')] buf2.unmet_dependencies = [StarDep(name='buf0'), StarDep(name='buf1')] buf2.met_dependencies = [] buf3: SchedulerNode(ComputedBuffer) buf3.writes = [MemoryDep(name='buf3', index=c0, size=(32768,))] buf3.unmet_dependencies = [MemoryDep(name='buf2', index=c0, size=(32768,))] buf3.met_dependencies = [MemoryDep(name='arg2_1', index=c0, size=(32768,))] buf3.group.device = cpu buf3.group.iteration = ((32768,), ()) buf3.sizes = ([32768], []) class buf3_loop_body: var_ranges = {z0: 32768} index0 = z0 def body(self, ops): get_index = self.get_index('index0') load = ops.load('buf2', get_index) get_index_1 = self.get_index('index0') load_1 = ops.load('arg2_1', get_index_1) add = ops.add(load, load_1) get_index_2 = self.get_index('index0') store = ops.store('buf3', get_index_2, add, None) return store ``` 2. ir_post_fusion.txt ``` buf0: SchedulerNode(ComputedBuffer) buf0.writes = [MemoryDep(name='buf0', index=c0, size=(16384,))] buf0.unmet_dependencies = [] buf0.met_dependencies = [MemoryDep(name='arg0_1', index=c0, size=(16384,))] buf0.group.device = cpu buf0.group.iteration = ((16384,), ()) buf0.sizes = ([16384], []) buf0.aliases = ['buf3'] class buf0_loop_body: var_ranges = {z0: 16384} index0 = z0 def body(self, ops): get_index = self.get_index('index0') load = ops.load('arg0_1', get_index) get_index_1 = self.get_index('index0') store = ops.store('buf0', get_index_1, load, None) return store buf1: SchedulerNode(ComputedBuffer) buf1.writes = [MemoryDep(name='buf1', index=c0, size=(16384,))] buf1.unmet_dependencies = [] buf1.met_dependencies = [MemoryDep(name='arg1_1', index=c0, size=(16384,))] buf1.group.device = cpu buf1.group.iteration = ((16384,), ()) buf1.sizes = ([16384], []) buf1.aliases = ['buf3'] class buf1_loop_body: var_ranges = {z0: 16384} index0 = z0 def body(self, ops): get_index = self.get_index('index0') load = ops.load('arg1_1', get_index) get_index_1 = self.get_index('index0') store = ops.store('buf1', get_index_1, load, None) return store buf2: NopKernelSchedulerNode(ConcatKernel) buf2.writes = [StarDep(name='buf2')] buf2.unmet_dependencies = [StarDep(name='buf0'), StarDep(name='buf1')] buf2.met_dependencies = [] buf3: SchedulerNode(ComputedBuffer) buf3.writes = [MemoryDep(name='buf3', index=c0, size=(32768,))] buf3.unmet_dependencies = [MemoryDep(name='buf2', index=c0, size=(32768,))] buf3.met_dependencies = [MemoryDep(name='arg2_1', index=c0, size=(32768,))] buf3.group.device = cpu buf3.group.iteration = ((32768,), ()) buf3.sizes = ([32768], []) class buf3_loop_body: var_ranges = {z0: 32768} index0 = z0 def body(self, ops): get_index = self.get_index('index0') load = ops.load('buf2', get_index) get_index_1 = self.get_index('index0') load_1 = ops.load('arg2_1', get_index_1) add = ops.add(load, load_1) get_index_2 = self.get_index('index0') store = ops.store('buf3', get_index_2, add, None) return store ``` From the ir code, you can see the buf3 always adds an empty buf2 which has never been written. The root cause is that there has a potential issue when doing the mutation for inplace add when its' input is a NopKernel. After this PR, the ir will be like(ir_pre_fusion.txt): ``` buf0: SchedulerNode(ComputedBuffer) buf0.writes = [MemoryDep(name='buf0', index=c0, size=(16384,))] buf0.unmet_dependencies = [] buf0.met_dependencies = [MemoryDep(name='arg0_1', index=c0, size=(16384,))] buf0.group.device = cpu buf0.group.iteration = ((16384,), ()) buf0.sizes = ([16384], []) buf0.aliases = ['buf2'] class buf0_loop_body: var_ranges = {z0: 16384} index0 = z0 def body(self, ops): get_index = self.get_index('index0') load = ops.load('arg0_1', get_index) get_index_1 = self.get_index('index0') store = ops.store('buf0', get_index_1, load, None) return store buf1: SchedulerNode(ComputedBuffer) buf1.writes = [MemoryDep(name='buf1', index=c0, size=(16384,))] buf1.unmet_dependencies = [] buf1.met_dependencies = [MemoryDep(name='arg1_1', index=c0, size=(16384,))] buf1.group.device = cpu buf1.group.iteration = ((16384,), ()) buf1.sizes = ([16384], []) buf1.aliases = ['buf2'] class buf1_loop_body: var_ranges = {z0: 16384} index0 = z0 def body(self, ops): get_index = self.get_index('index0') load = ops.load('arg1_1', get_index) get_index_1 = self.get_index('index0') store = ops.store('buf1', get_index_1, load, None) return store buf2: NopKernelSchedulerNode(ConcatKernel) buf2.writes = [StarDep(name='buf2')] buf2.unmet_dependencies = [StarDep(name='buf0'), StarDep(name='buf1')] buf2.met_dependencies = [] buf3: SchedulerNode(ComputedBuffer) buf3.writes = [MemoryDep(name='buf3', index=c0, size=(32768,))] buf3.unmet_dependencies = [MemoryDep(name='buf2', index=c0, size=(32768,)), StarDep(name='buf2')] buf3.met_dependencies = [MemoryDep(name='arg2_1', index=c0, size=(32768,))] buf3.group.device = cpu buf3.group.iteration = ((32768,), ()) buf3.sizes = ([32768], []) buf3.mutations = ['buf2'] class buf3_loop_body: var_ranges = {z0: 32768} index0 = z0 def body(self, ops): get_index = self.get_index('index0') load = ops.load('buf2', get_index) get_index_1 = self.get_index('index0') load_1 = ops.load('arg2_1', get_index_1) add = ops.add(load, load_1) get_index_2 = self.get_index('index0') store = ops.store('buf3', get_index_2, add, None) return store ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/92247 Approved by: https://github.com/ngimel, https://github.com/desertfire, https://github.com/jansel	2023-01-29 05:35:21 +00:00
Edward Z. Yang	025ef99ddf	Get rid of dedicated inductor dynamic_shapes config (#93076 ) Instead, use Dynamo dynamic_shapes config Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93076 Approved by: https://github.com/voznesenskym	2023-01-27 02:58:16 +00:00
Edward Z. Yang	5e9fa0a8fc	Mark crossvit_9_240 as passing dynamic=True (#92981 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92981 Approved by: https://github.com/Chillee	2023-01-26 13:05:37 +00:00
Michael Voznesensky	d322f82b05	Add @count util to torch, use it to track benchmark stats (#93013 ) <img width="1333" alt="image" src="https://user-images.githubusercontent.com/4755252/214687911-f766f072-c162-4298-9aed-c889f1375336.png"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93013 Approved by: https://github.com/ezyang	2023-01-26 03:09:12 +00:00
Edward Z. Yang	2ee94633a1	Change ciflow/inductor to test inductor inference with dynamic shapes (#92771 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92771 Approved by: https://github.com/voznesenskym	2023-01-25 02:21:02 +00:00
Edward Z. Yang	f724ecbd52	Add dynamic shapes aot_eager to periodic (#92770 ) This means it overlaps with ciflow/inductor, but I'm about to change that soon. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92770 Approved by: https://github.com/voznesenskym, https://github.com/albanD, https://github.com/desertfire	2023-01-25 02:21:02 +00:00
Edward Z. Yang	fb46d3e138	Run all of the timm models shards in the periodic (#92900 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92900 Approved by: https://github.com/bdhirsh, https://github.com/atalman	2023-01-24 17:56:20 +00:00
Horace He	c0327eb463	Some more inductor fixes for symbolic shapes (#92867 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92867 Approved by: https://github.com/ezyang	2023-01-24 15:05:46 +00:00
PyTorch MergeBot	2cf03bbbab	Revert "Run all of the timm models shards in the periodic (#92743 )" This reverts commit `de69cedf98`. Reverted https://github.com/pytorch/pytorch/pull/92743 on behalf of https://github.com/atalman due to This needs to be landed after https://github.com/pytorch/pytorch/pull/92845 and https://github.com/pytorch/pytorch/pull/92846 are landed	2023-01-23 23:44:09 +00:00
Fabio Rocha	a43b55e135	A few usability improvements for the dynamo benchmarks. (#92713 ) --diff_main renamed to --diff-branch BRANCH and now works again Summary table splits results per branch. csv output now has column with branch name when run in this mode Added --progress flag so you can track how many models are going to be run. Example output: ``` $ python benchmarks/dynamo/torchbench.py --quiet --performance --backend inductor --float16 --batch-size-file $(realpath benchmarks/dynamo/torchbench_models_list.txt) --filter 'alexnet\|vgg16' --progress --diff viable/strict Running model 1/2 batch size: 1024 cuda eval alexnet dynamo_bench_diff_branch 1.251x p=0.00 cuda eval alexnet viable/strict 1.251x p=0.00 Running model 2/2 batch size: 128 cuda eval vgg16 dynamo_bench_diff_branch 1.344x p=0.00 cuda eval vgg16 viable/strict 1.342x p=0.00 Summary for tag=dynamo_bench_diff_branch: speedup gmean=1.30x mean=1.30x abs_latency gmean=24.09x mean=25.26x compilation_latency mean=2.0 seconds compression_ratio mean=0.9x Summary for tag=viable/strict: speedup gmean=1.30x mean=1.30x abs_latency gmean=24.11x mean=25.29x compilation_latency mean=0.5 seconds compression_ratio mean=1.0x ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/92713 Approved by: https://github.com/jansel	2023-01-23 18:23:35 +00:00
Edward Z. Yang	4a3fb7bcbc	Make CI_SKIPS into a consolidated dict (#92769 ) This makes it easier to add more configurations without causing a thicket of if statements selecting the correct variable. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92769 Approved by: https://github.com/voznesenskym, https://github.com/desertfire	2023-01-23 14:57:18 +00:00
Edward Z. Yang	3cfd2fa1c7	Make --inductor imply --backend inductor (#92764 ) This is to make some downstream code more uniform (can always ask args.backend for backend) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92764 Approved by: https://github.com/voznesenskym, https://github.com/desertfire	2023-01-23 14:57:18 +00:00
Edward Z. Yang	c52567ec18	Switch CI exclusions to use exact match. (#92761 ) Since the CI exclusions are hard-coded in our script, we might as well require them to match exactly. This solved some head scratching where I was like, "this model is not obviously excluded, why is it not showing up in CI." Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92761 Approved by: https://github.com/jansel	2023-01-22 17:10:20 +00:00
Edward Z. Yang	de69cedf98	Run all of the timm models shards in the periodic (#92743 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92743 Approved by: https://github.com/kit1980	2023-01-21 18:39:17 +00:00
Michael Voznesensky	5778c04a15	Add `--timing` flag, phase timing to @dynamo_timed (#92637 ) Ex output: ``` TIMING: entire_frame_compile:8.574629999999999 backend_compile:5.26806 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/92637 Approved by: https://github.com/ezyang	2023-01-21 10:52:13 +00:00
Edward Z. Yang	27bf879b8c	Forward fix: restore sebotnet33ts_256 aot_eager skip (#92741 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92741 Approved by: https://github.com/kit1980	2023-01-21 08:10:23 +00:00
Edward Z. Yang	9ad0aca6e5	Update aot_eager CI failures (#92696 ) Based on https://hud.pytorch.org/pr/92689 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92696 Approved by: https://github.com/desertfire	2023-01-21 02:29:22 +00:00
PyTorch MergeBot	44132cc4b0	Revert "Add `--timing` flag, phase timing to @dynamo_timed (#92637 )" This reverts commit `773b513435`. Reverted https://github.com/pytorch/pytorch/pull/92637 on behalf of https://github.com/malfet due to Broke lint	2023-01-20 16:23:20 +00:00
Michael Voznesensky	773b513435	Add `--timing` flag, phase timing to @dynamo_timed (#92637 ) Ex output: ``` TIMING: entire_frame_compile:8.574629999999999 backend_compile:5.26806 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/92637 Approved by: https://github.com/ezyang	2023-01-20 05:01:21 +00:00
Edward Z. Yang	44e52ea514	Reenable mobilevit_s in CI, seems to pass (#92585 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92585 Approved by: https://github.com/Chillee	2023-01-19 15:24:45 +00:00
Edward Z. Yang	b92a7afed9	Reclassify some dynamic aot_eager failures as static failures (#92376 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92376 Approved by: https://github.com/Chillee	2023-01-18 19:27:11 +00:00
Wu, Chunyuan	3aa6cec18c	[dynamo] exclude reset_rng_state when measure timing (#92237 ) Fixes inductor performance regression on CPU: https://github.com/pytorch/torchdynamo/issues/2027, https://github.com/pytorch/torchdynamo/issues/2028 and https://github.com/pytorch/torchdynamo/issues/2029. The details are explained here: https://github.com/pytorch/torchdynamo/issues/2028#issuecomment-1381496678. ### Performance - Model: lennard_jones - Machine: IceLake (32 cores per socket) - Configuration: single instance, 32 cores per instance - jemalloc and iomp enabled ```bash python benchmarks/dynamo/torchbench.py --inductor-settings --inductor --performance --float32 -dcpu -n5000 --no-skip --dashboard --only=lennard_jones --quiet ``` <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/chunyuan/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/chunyuan/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> </head> <body link="#0563C1" vlink="#954F72"> Time before regression \| Time after regression \| Time with this PR -- \| -- \| -- 0.00020483799744397402 \| 0.0002818034990923479 \| 0.00020241099991835654 </body> </html> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92237 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-01-18 13:17:28 +00:00
Edward Z. Yang	fbbb19599a	Update dynamic skips after #92076 (#92103 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92103 Approved by: https://github.com/voznesenskym, https://github.com/Chillee	2023-01-13 04:05:10 +00:00
Edward Z. Yang	74cbf058a5	Support --dynamic-ci-skips (#91893 ) This makes it easier for us to run only the skipped benchmarks and see if that actually started passing. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/91893 Approved by: https://github.com/albanD	2023-01-11 20:02:58 +00:00
Edward Z. Yang	d24324bf1d	s/INDCUTOR/INDUCTOR/ (#91885 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/91885 Approved by: https://github.com/Skylion007, https://github.com/atalman, https://github.com/malfet	2023-01-11 12:28:19 +00:00
Edward Z. Yang	56ed976edf	hrnet_w18, tts_angular works with dynamic shapes (#91891 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/91891 Approved by: https://github.com/voznesenskym	2023-01-11 11:40:16 +00:00
blzheng	0c1777acec	Dynamo benchmark: add CPU specific changes (#88477 ) This pr adds some CPU specific changes: - Add support for IPEX backend - https://github.com/pytorch/torchdynamo/issues/1618 - https://github.com/pytorch/torchdynamo/issues/1534 - Enable CPU launcher in runner.py. - Fix the issue that some environment variables are not support on CPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/88477 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-01-07 09:26:06 +00:00
Shunting Zhang	a5f32f8978	training support for dynamo+torchxla integration (#88449 ) We've already shown some promising perf result by integrating dynamo with torchxla for inference. To provide consistent UX for training and for inference, in this PR we try to enable training for dynamo/torchxla. Training is trickier than inference and we may not expect much perf gains since 1. in training case, torchxla only generate a single combined graph for fwd/bwd/optimizer while in `torchxla_trace_once` bridge we added in dynamo, due to how AOT_Autograd works, we will generate 3 graphs: one for forward, one for backward and one for the optimizer. XLA favors larger graph to do more optimizations. 2. in training case, tracing overhead can be overlapped with computation. Tracing overhead is not as a big deal for training as for inference. After all training cares more about throughput while inference cares more about latency. 3. in training case, people can increase batch size to 'mitigate' the tracing overhead. Increase batch size does not change tracing overhead, thus it shows like the tracing overhead 'per example' reduces. But we still want to add training support to dynamo/torchxla to make the work complete. We added '--iterations-per-run' argument to control how may iterations we do per measure/device sync. This is to understand the impact of item 2 above. Results: With '--iterations-per-run' equals to 1, here are the perf numbers: ``` +-------------------------+--------------------+-------------------------+ \| Model \| XLA (trace once) \| XLA (trace everytime) \| +=========================+====================+=========================+ \| resnet18 \| 0.91 \| 0.959 \| +-------------------------+--------------------+-------------------------+ \| resnet50 \| 0.917 \| 0.932 \| +-------------------------+--------------------+-------------------------+ \| resnext50_32x4d \| 0.912 \| 0.905 \| +-------------------------+--------------------+-------------------------+ \| alexnet \| 1.038 \| 0.974 \| +-------------------------+--------------------+-------------------------+ \| mobilenet_v2 \| 0.881 \| 0.835 \| +-------------------------+--------------------+-------------------------+ \| mnasnet1_0 \| 0.903 \| 0.931 \| +-------------------------+--------------------+-------------------------+ \| vgg16 \| 0.914 \| 0.967 \| +-------------------------+--------------------+-------------------------+ \| BERT_pytorch \| 1.359 \| 0.84 \| +-------------------------+--------------------+-------------------------+ \| timm_vision_transformer \| 1.288 \| 0.893 \| +-------------------------+--------------------+-------------------------+ \| geomean \| 1.0006 \| 0.913794 \| +-------------------------+--------------------+-------------------------+ ``` Overall it looks like graph break indeed cause perf loss. But for BERT_pytorch and timm_vision_transformer we still see perf gain. We need do more experiments with larger '--iterations-per-run' NOTE: In torchbench.py I added the following code to do a few workaround: ``` from myscripts import workaround # TODO will remove this line before landing ``` Here are the content of workaround.py: ``` import torch from torch import nn import os # override max_pool2d with avg_pool2d if os.environ.get("REPLACE_MAXPOOL", "0") == "1": torch.nn.MaxPool2d = torch.nn.AvgPool2d ``` It work around a few issues we found 1. MaxPool2d does not work for training in dynamo/torchxla: https://github.com/pytorch/torchdynamo/issues/1837 . WIP fix from Brian in https://github.com/pytorch/pytorch/pull/90226 , https://github.com/pytorch/xla/pull/4276/files (WIP) 2. recent change ( this PR https://github.com/pytorch/pytorch/pull/88697 ) in op decomposition cause batch_norm ops to fallback in torchxla. Fix from jack in https://github.com/pytorch/xla/pull/4282#event-7969608134 . (confirmed the fix after adding Deduper to handle duplicated return from fx graph generated by AOTAutograd) 3. we have issue to handle dropout because of random seed out of sync issue. Here is the fix: https://github.com/pytorch/xla/pull/4293 (confirmed the fix) Example command: ``` REPLACE_MAXPOOL=1 USE_FAKE_TENSOR=0 GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --trace-on-xla --training --backend=aot_torchxla_trace_once --only vgg16 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/88449 Approved by: https://github.com/wconstab, https://github.com/qihqi, https://github.com/malfet	2023-01-05 19:59:34 +00:00
Bin Bao	6bf0e3b697	[inductor] Check for BackendCompilerFailed on CI (#91634 ) Summary: https://github.com/pytorch/pytorch/pull/91283/ skips certain random triton failure on CI, but we need to check against the BackendCompilerFailed exception type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91634 Approved by: https://github.com/ngimel	2023-01-03 22:38:29 +00:00
Animesh Jain	a32916190d	buck-related minifier work (#91215 ) Summary: Extending the minifier to generate buck target Test Plan: N/A Differential Revision: D42173893 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91215 Approved by: https://github.com/bertmaher, https://github.com/ngimel	2022-12-22 19:33:50 +00:00
Bin Bao	07c61685c8	[inductor] CI improvments (#91283 ) Summary: 1) Setting torch.backends.cudnn.deterministic to True helps to eliminate the eager_variance failures seen on CI 2) Skip Triton failure instead of retry 3) Some minor script cleanup is also included in this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91283 Approved by: https://github.com/anijain2305	2022-12-22 15:37:43 +00:00
Michael Lazos	2f5759eaba	Disable non-deterministic models for optimizers (#91149 ) These two models are non-deterministic even with constant inputs + weights and sometimes fail with variations between the fp64 and fp32 models in CI very rarely as a result. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91149 Approved by: https://github.com/desertfire	2022-12-20 20:19:54 +00:00
Bin Bao	84e73e1269	[inductor] small CI improvements (#91140 ) Summary: 1) Increase timm_model download retry times; 2) Skip certain random triton failures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91140 Approved by: https://github.com/williamwen42	2022-12-20 17:26:12 +00:00
Michael Lazos	07c340bb2a	Remove debug code (#91148 ) Removes some debug code Pull Request resolved: https://github.com/pytorch/pytorch/pull/91148 Approved by: https://github.com/desertfire, https://github.com/williamwen42	2022-12-20 15:00:55 +00:00
Bin Bao	2a37ba8e81	[inductor] Add retry after benchmark test fails on CI (#90808 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90808 Approved by: https://github.com/malfet	2022-12-19 18:10:55 +00:00
Michael Lazos	1accd915a4	Re-enable optimizers (#90709 ) Fixes https://github.com/pytorch/pytorch/issues/90165 https://github.com/pytorch/torchdynamo/issues/328 Re-enables optimizer capture + compilation now that the dynamo slowdowns have been fixed and it has speedups, numbers to come soon Pull Request resolved: https://github.com/pytorch/pytorch/pull/90709 Approved by: https://github.com/anijain2305, https://github.com/jansel, https://github.com/yanboliang	2022-12-19 04:07:41 +00:00
Edward Z. Yang	212873c615	Add dynamic shapes benchmark accuracy to CI (#90444 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90444 Approved by: https://github.com/voznesenskym	2022-12-17 11:17:20 +00:00
PyTorch MergeBot	e2377c8300	Revert "Add dynamic shapes benchmark accuracy to CI (#90444 )" This reverts commit `85db031e60`. Reverted https://github.com/pytorch/pytorch/pull/90444 on behalf of https://github.com/ezyang due to lint failing	2022-12-17 07:18:07 +00:00
Edward Z. Yang	85db031e60	Add dynamic shapes benchmark accuracy to CI (#90444 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90444 Approved by: https://github.com/voznesenskym	2022-12-17 06:39:45 +00:00
Michael Lazos	7c524221ba	[reland3][dynamo] Revert "Revert "[reland][dynamo] use optimizers correctly in benchmar… (#90956 ) …king (#87492)" (#90746)" This reverts commit `ff1bbc2773`. This should be okay to merge now. The flakiness of HF models will be fixed by seeding the rng (https://github.com/pytorch/pytorch/pull/90936), and the numeric mismatch was root-caused to three decomps (still investigating why those decomps cause this) see https://github.com/pytorch/torchdynamo/issues/1985 for more detail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90956 Approved by: https://github.com/desertfire	2022-12-17 06:27:15 +00:00
PyTorch MergeBot	6bc6fb21db	Revert "[reland2][dynamo] Revert "Revert "[reland][dynamo] use optimizers correctly in benchmar… (#90956 )" This reverts commit `8bc38ae4e2`. Reverted https://github.com/pytorch/pytorch/pull/90956 on behalf of https://github.com/desertfire due to Causing TIMM model failures	2022-12-16 19:28:05 +00:00
Michael Lazos	8bc38ae4e2	[reland2][dynamo] Revert "Revert "[reland][dynamo] use optimizers correctly in benchmar… (#90956 ) …king (#87492)" (#90746)" This reverts commit `ff1bbc2773`. This should be okay to merge now. The flakiness of HF models will be fixed by seeding the rng (https://github.com/pytorch/pytorch/pull/90936), and the numeric mismatch was root-caused to three decomps (still investigating why those decomps cause this) see https://github.com/pytorch/torchdynamo/issues/1985 for more detail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90956 Approved by: https://github.com/desertfire	2022-12-16 13:33:38 +00:00
Bin Bao	ad4189c8db	[reland][inductor] Update TIMM skip list (#90762 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90762 Approved by: https://github.com/eellison	2022-12-13 19:56:23 +00:00
Bin Bao	ff1bbc2773	Revert "[reland][dynamo] use optimizers correctly in benchmarking (#87492 )" (#90746 ) This reverts commit `d91d7a3221`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90746 Approved by: https://github.com/anijain2305	2022-12-13 11:37:16 +00:00
PyTorch MergeBot	e37c8c8436	Revert "[inductor] Update TIMM skip list (#90188 )" This reverts commit `fd3f5d7bf7`. Reverted https://github.com/pytorch/pytorch/pull/90188 on behalf of https://github.com/desertfire due to flaky accuracy failure	2022-12-12 15:31:50 +00:00
Edward Z. Yang	e1ed5ad5a5	Add a timeout to benchmark script (#90634 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90634 Approved by: https://github.com/voznesenskym	2022-12-11 23:12:29 +00:00
Jiong Gong	181d37475d	Simple fix: add missing positional arg in init_optimizer() call (#90641 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/90641 Approved by: https://github.com/kit1980	2022-12-11 13:18:05 +00:00
Bin Bao	fd3f5d7bf7	[inductor] Update TIMM skip list (#90188 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90188 Approved by: https://github.com/anijain2305	2022-12-09 21:30:23 +00:00
Animesh Jain	d91d7a3221	[reland][dynamo] use optimizers correctly in benchmarking (#87492 ) Reland https://github.com/pytorch/pytorch/pull/87311 mlazos: updated to use SGD to not add a bunch of additional memory allocations (like Adam) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87492 Approved by: https://github.com/desertfire	2022-12-09 20:32:53 +00:00
Ram Rachum	351d73b97f	Fix exception causes all over the codebase (#90271 ) This is the continuation to #90134 and hopefully the final PR in this series. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90271 Approved by: https://github.com/kit1980	2022-12-07 04:29:00 +00:00
David Berard	8f079b895b	[Dynamo+FSDP] Update benchmarks with use_orig_params=True (#90100 ) After https://github.com/pytorch/pytorch/pull/89523, we now need to assert use_orig_params=True, even in the non-recursive case where (I think) we wouldn't otherwise need to run with use_orig_params=True. Tested with `python benchmarks/dynamo/torchbench.py --training --accuracy --only hf_T5 --fsdp` Pull Request resolved: https://github.com/pytorch/pytorch/pull/90100 Approved by: https://github.com/wconstab	2022-12-07 03:33:58 +00:00
Richard Zou	4068c5467d	[Reland] Move functorch/_src to torch/_functorch (#88756 ) (#90091 ) This will be the last disruptive functorch internals change. Why are we moving these files? - As a part of rationalizing functorch we are moving the code in functorch/_src to torch/_functorch - This is so that we can offer the functorch APIs as native PyTorch APIs (coming soon) and resolve some internal build issues. Why are we moving all of these files at once? - It's better to break developers all at once rather than many times Test Plan: - wait for tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/90091 Approved by: https://github.com/anijain2305, https://github.com/ezyang	2022-12-03 14:17:15 +00:00
Wang, Eikan	0bde810572	Add more debug information for Inductor (#90008 ) - Add graph index to the profile information of the Inductor kernel for better debugability. The generated code for different graphs could produce kernels with the same name. The side effect is that it is hard to identify the portion of E2E performance for these kernels because the profiler will aggregate the performance with the same kernel name regardless of different graphs. Hence, this PR added the graph index to the profile information to address this limitation. - Label arbitrary code ranges for `eager` and `opt` modes for better debugability The profile information of dynamo benchmarks mixes the eager mode and opt mode. It is hard to separate the range for different modes. This PR added eager and opt marks to the profile information to address this limitation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90008 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-12-02 09:34:48 +00:00
Animesh Jain	3162a48a77	[dynamo][benchmarks] Call zero grad (#90026 ) Hoping that it might reduce some flakiness Pull Request resolved: https://github.com/pytorch/pytorch/pull/90026 Approved by: https://github.com/williamwen42	2022-12-02 04:05:57 +00:00
Animesh Jain	68805b08d1	[benchmarks][dynamo] Trying CI - Set train() for TIMM models accuracy tests (#89780 ) Moving to train mode for TIMM models and also raising batch size for accuracy testing. Raising batch size seems to remove a lot of noise/instability coming from batch_norm decomposition. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89780 Approved by: https://github.com/ngimel	2022-11-30 12:57:35 +00:00
PyTorch MergeBot	218d9c6e09	Revert "Move functorch/_src to torch/_functorch (#88756 )" This reverts commit `52bc5c1cfe`. Reverted https://github.com/pytorch/pytorch/pull/88756 on behalf of https://github.com/clee2000 due to broke imports in tests `52bc5c1cfe` https://github.com/pytorch/pytorch/actions/runs/3574742513/jobs/6010814968 probably a landrace	2022-11-29 17:17:11 +00:00
Richard Zou	52bc5c1cfe	Move functorch/_src to torch/_functorch (#88756 ) This will be the last disruptive functorch internals change. Why are we moving these files? - As a part of rationalizing functorch we are moving the code in functorch/_src to torch/_functorch - This is so that we can offer the functorch APIs as native PyTorch APIs (coming soon) and resolve some internal build issues. Why are we moving all of these files at once? - It's better to break developers all at once rather than many times Test Plan: - wait for tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/88756 Approved by: https://github.com/ezyang	2022-11-29 13:55:42 +00:00
Bin Bao	465ee7bc09	[inductor] skip dm_nfnet_f0 in TIMM model test (#89768 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89768 Approved by: https://github.com/clee2000	2022-11-28 20:08:41 +00:00
Animesh Jain	cdf4087597	[benchmarks] Disabling gradscaler (#89741 ) Disabling Gradscaler because 1) Benchmark setup runs 2 iterations of fwd-bwd. So, not useful. 2) Current setup shares grad_scaler for eager and dynamo model, which is bad as Gradscaler has state and can adjust the scaling factor between eager and dynamo run, making accuracy check harder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89741 Approved by: https://github.com/ngimel	2022-11-28 20:08:37 +00:00
Bin Bao	049a0f2cd5	[inductor] Update CI model tests (#89499 ) Summary: 1) Add model inference test 2) Switch model training test to use AMP Pull Request resolved: https://github.com/pytorch/pytorch/pull/89499 Approved by: https://github.com/bertmaher	2022-11-23 18:30:51 +00:00
Edward Z. Yang	ed32511974	Don't use explain() for --explain; instead read it off the counters (#89518 ) Fixes huggingface problem where example_inputs is not actually the args. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89518 Approved by: https://github.com/albanD	2022-11-23 02:43:53 +00:00
Will Constable	26322544b8	Add limited FSDP correctness to torchdynamo benchmark (#89469 ) - Does not do recursive wrapping - Only supports accuracy bench - Mainly useful for sweeping over models for correctness, in part to evaluate whether dynamo support for FSDP is breaking anywhere Pull Request resolved: https://github.com/pytorch/pytorch/pull/89469 Approved by: https://github.com/davidberard98, https://github.com/aazzolini	2022-11-23 00:19:36 +00:00
Animesh Jain	f281f435a8	Fix benchmarks - xla tensor test (#89509 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/89509 Approved by: https://github.com/ngimel, https://github.com/shunting314	2022-11-22 18:42:13 +00:00
Shunting Zhang	e545caa50f	dynamo/torchxla integration: trace on xla rather than eager (#88904 ) In #87741 we added the inference support for dynamo/torchxla integration. Later on in #88449 we attempt to add the training support. That attempt is not smooth because - we try 2 things together 1. let dynamo trace the model on xla rather than eager 2. enable training - It turns out neither of these two tasks are trivial enough. Furthermore, item 2 (enable training) depends on item 1 (tracing on xla). We enable training via AOTAutograd. AOTAutograd lift all model parameters/buffers as graph inputs. Without item 1 being done, we would need copy all graph inputs (including model parameters/buffers) from eager device to xla devices. That hurts performance a lot. Have a cache to map eager parameter to XLA parameter does not solve the problem since the update on either will not sync automatically to the other. They will easily go out of sync. This PR let dynamo trace the model on XLA rather than eager. This is a preparation step to enabling training. Also, tracing on XLA makes the data movement more efficient. We see 1.5x geomean speedup compared to previous 1.38x. ``` +-------------------------+--------------------+-------------------------+ \| Model \| XLA (trace once) \| XLA (trace everytime) \| +=========================+====================+=========================+ \| resnet18 \| 1.38 \| 1.008 \| +-------------------------+--------------------+-------------------------+ \| resnet50 \| 1.227 \| 0.998 \| +-------------------------+--------------------+-------------------------+ \| resnext50_32x4d \| 1.544 \| 1.008 \| +-------------------------+--------------------+-------------------------+ \| alexnet \| 1.085 \| 1.045 \| +-------------------------+--------------------+-------------------------+ \| mobilenet_v2 \| 2.028 \| 1.013 \| +-------------------------+--------------------+-------------------------+ \| mnasnet1_0 \| 1.516 \| 0.995 \| +-------------------------+--------------------+-------------------------+ \| squeezenet1_1 \| 0.868 \| 1.01 \| +-------------------------+--------------------+-------------------------+ \| vgg16 \| 1.099 \| 1.008 \| +-------------------------+--------------------+-------------------------+ \| BERT_pytorch \| 3.26 \| 1.027 \| +-------------------------+--------------------+-------------------------+ \| timm_vision_transformer \| 2.182 \| 1.015 \| +-------------------------+--------------------+-------------------------+ \| geomean \| 1.50389 \| 1.01261 \| +-------------------------+--------------------+-------------------------+ ``` Example command ``` GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --trace-on-xla --only resnet18 --backend=torchxla_trace_once ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/88904 Approved by: https://github.com/wconstab, https://github.com/JackCaoG, https://github.com/jansel	2022-11-22 03:57:04 +00:00
Xu Zhao	e4d9dbd7d2	Port torchdynamo's torchbench script to userbenchmark (#89239 ) Summary: This Diff ports the torchbench.py script from torchdynamo to torchbench to support the development of internal models. Currently, only works with the `--only` option, and can only test one model at a time. Note that the noisy logs are from upstream model code, not the benchmark code. In the internal environment, `torch._dynamo.config.base_dir` is not writable, so we add an option to specify the output directory. Test Plan: ``` $ buck2 run mode/opt //caffe2/benchmarks/dynamo:torchbench -- --performance --only ads_dhen_5x --part over --output-directory /tmp/tb-test/ cuda eval ads_dhen_5x 1/ 1 +0 frames 2s 1 graphs 1 graph calls 412/ 411 = 100% ops 100% time ``` ``` $ buck2 run mode/opt //caffe2/benchmarks/dynamo:torchbench -- --performance --only cmf_10x --part over --output-directory /tmp/tb-test/ cuda eval cmf_10x 1/ 1 +0 frames 1s 1 graphs 1 graph calls 306/ 305 = 100% ops 100% time ``` Reviewed By: jansel Differential Revision: D41294311 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89239 Approved by: https://github.com/jansel	2022-11-21 17:25:28 +00:00
Michael Voznesensky	631baecbcd	Add --explain flag to bench (#89316 ) TORCHDYNAMO_DYNAMIC_SHAPES=1 AOT_DYNAMIC_SHAPES=1 time python benchmarks/dynamo/torchbench.py --accuracy --explain --backend aot_eager --train --only BERT_pytorch Dynamo produced 76 graphs with 75 graph break and 198 ops Pull Request resolved: https://github.com/pytorch/pytorch/pull/89316 Approved by: https://github.com/ezyang	2022-11-19 03:35:09 +00:00
Bin Bao	19fcb80551	[inductor] Skip DALLE2_pytorch in torchbench (#89288 ) Summary: DALLE2_pytorch fails in eager as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89288 Approved by: https://github.com/Krovatkin	2022-11-18 16:21:17 +00:00
Bin Bao	1f7c0ff6e7	[inductor] Temporarily disable functorch_dp_cifar10 test in TorchBench (#89281 ) Summary: The failure wasn't caught because of a land race. Skip the test for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89281 Approved by: https://github.com/Krovatkin	2022-11-18 16:07:44 +00:00
Bin Bao	31b10e7d40	Enable inductor CI for TorchBench (#87465 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87465 Approved by: https://github.com/malfet	2022-11-17 23:16:21 +00:00
Animesh Jain	74610a1ced	[dynamo][benchmarks] HF - Fix seq len and batch sizes (#89165 ) Fixes many models in https://github.com/pytorch/torchdynamo/issues/1842 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89165 Approved by: https://github.com/ngimel	2022-11-17 06:14:24 +00:00
Shunting Zhang	a13433940c	allow loading model from a path in torchbench (#89028 ) Sometimes it's really convenient to run simple models thru the torchbench.py script rather than those from pytorch/benchmark. This PR add the ability to run any model from a specified path by overloading the --only argument. This PR is split out from #88904 Here is the usage: Specify the path and class name of the model in format like: --only=path:<MODEL_FILE_PATH>,class:<CLASS_NAME> Due to the fact that dynamo changes current working directory, the path should be an absolute path. The class should have a method get_example_inputs to return the inputs for the model. An example looks like ``` class LinearModel(nn.Module): def __init__(self): super().__init__() self.linear = nn.Linear(10, 10) def forward(self, x): return self.linear(x) def get_example_inputs(self): return (torch.randn(2, 10),) ``` Test command: ``` # python benchmarks/dynamo/torchbench.py --performance --only=path:/pytorch/myscripts/model_collection.py,class:LinearModel --backend=eager WARNING:common:torch.cuda.is_available() == False, using CPU cpu eval LinearModel 0.824x p=0.00 ``` Content of model_collection.py ``` from torch import nn import torch class LinearModel(nn.Module): """ AotAutogradStrategy.compile_fn ignore graph with at most 1 call nodes. Make sure this model calls 2 linear layers to avoid being skipped. """ def __init__(self, nlayer=2): super().__init__() layers = [] for _ in range(nlayer): layers.append(nn.Linear(10, 10)) self.layers = nn.Sequential(*layers) def forward(self, x): return self.layers(x) def get_example_inputs(self): return (torch.randn(2, 10),) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/89028 Approved by: https://github.com/jansel	2022-11-16 00:29:08 +00:00
Bin Bao	4108367123	Exclude poolformer_m36 from the inductor model test (#88908 ) Summary: The root cause is still to be investigated. Issue tracked at https://github.com/pytorch/torchdynamo/issues/1856 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88908 Approved by: https://github.com/malfet	2022-11-12 03:10:25 +00:00
William Wen	6e3555edea	Add absolute latency to dashboard (#88790 ) Add absolute latency to dashboard, as requested by https://github.com/pytorch/torchdynamo/issues/1833#issuecomment-1302742914 Tested by setting `run.sh` to ``` # Setup the output directory rm -rf ../test-dynamo-runner-logs-7/ mkdir ../test-dynamo-runner-logs-7/ # Commands for torchbench for device=cuda, dtype=float32 for training and for performance testing python benchmarks/dynamo/torchbench.py --performance --float32 -dcuda --output=../test-dynamo-runner-logs-7//inductor_torchbench_float32_training_cuda_performance.csv --training --inductor --no-skip --dashboard --only mobilenet_v2 --cold_start_latency # Commands for torchbench for device=cuda, dtype=float32 for training and for accuracy testing python benchmarks/dynamo/torchbench.py --accuracy --float32 -dcuda --output=../test-dynamo-runner-logs-7//inductor_torchbench_float32_training_cuda_accuracy.csv --training --inductor --no-skip --dashboard --only mobilenet_v2 ``` and running `python benchmarks/dynamo/runner.py --output-dir ../test-dynamo-runner-logs-7/ --dashboard-archive-path /data/home/williamwen/dynamo-runner-logs-copy --training --run --compilers inductor --flag-compilers inductor --suites torchbench --update-dashboard` (need to comment out the `generate_commands` line and change the github issue ID from 681 to something else). Sample comment: https://github.com/pytorch/torchdynamo/issues/1831#issuecomment-1309645562 NOTE: this change breaks processing old logs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88790 Approved by: https://github.com/anijain2305	2022-11-10 01:45:52 +00:00
Jason Ansel	de53d4143a	Fix TorchInductor benchmarking in fbcode (#88689 ) Summary: Makes the C++ TorchInductor benchmarking work in fbcode plus some minor fixed to enable that. Test Plan: Test added Differential Revision: D41045910 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88689 Approved by: https://github.com/soumith	2022-11-09 18:13:06 +00:00
Will Constable	100b55637b	Mark dynamo torchbench dlrm as unsupported (#88712 ) - DLRM requires special configuration of embedding layers which are sparse and not compatible with DDP. - I could mark the embedding params as ignored in DDP to make the benchmark pass, but this isn't a representative benchmark. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88712 Approved by: https://github.com/ezyang	2022-11-09 17:23:56 +00:00
Will Constable	89c5819626	Dynamo DDP accuracy bench uses find_unused_parameters (#88645 ) - find_unused_parameters adds a slight overhead, but is required in cases where users do not manually specify parameters to ignore which will not receive grads. In some models, some parameters do not receive grads, and this causes DDP to throw an exception as it waits for a grad for each parameter Pull Request resolved: https://github.com/pytorch/pytorch/pull/88645 Approved by: https://github.com/soumith	2022-11-08 16:13:10 +00:00
Will Constable	1f32c3c087	Add single-process DDP accuracy support to dynamo benchmark suite (#88511 ) - does not intend to support multi-process, as that is more complex and we have torchbench scripts for that - currently only works in accuracy mode as this was the main goal, but could be extended for measuring single-gpu perf impact of graph breaks Run with `python benchmarks/dynamo/torchbench.py --inductor --training --accuracy --only hf_Bert --ddp` Example output ``` cuda train hf_Bert [2022-11-04 18:52:08,304] torch._inductor.compile_fx: [WARNING] skipping cudagraphs due to complex input striding PASS ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/88511 Approved by: https://github.com/davidberard98, https://github.com/aazzolini	2022-11-05 02:41:17 +00:00
Animesh Jain	1b575782a0	[dynamo][benchmarks] use fresh inductor cache and raise batch size wherever possible (#88044 ) cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx Pull Request resolved: https://github.com/pytorch/pytorch/pull/88044 Approved by: https://github.com/ngimel	2022-10-30 17:10:17 +00:00
Shunting Zhang	e4a8661ab8	torchdynamo and xla integration (#87741 ) # Motivation - torchdynamo and torchxla uses different strategies to be a sound graph capture technique. The former relies on guards; the latter relies on retracing - guard system is quite low overhead but torchxla tracing overhead is quite high The main idea is to leverage guard system in torchdynamo to avoid retracing in torchxla so that - we can integration torchdynamo with XLA - we reduce or even completely avoid tracing overhead of torchxla # Technique details ## XLA baseline We found that different frameworks do not generate numerically identical results for the SAME model with the SAME input. By default, torchdynamo uses eager as baseline so the model will run with PyTorch. It would be tricky to compare a model running on XLA with this baseline: it's hard to check correctness. To make the comparison easier, we add a flag `--use-xla-baseline`. When it's enabled, the baseline will be run on XLA. ## New dynamo backends added We add 2 new dynamo backends torchxla_trivial and trochxla_trace_once to control the optimization targets. torchxla_trivial simply moves inputs/model parameters to XLA and run the model on XLA. There is tracing overhead for each run. We should expect that result to be mostly neutral compared to the XLA baseline. torchxla_trace_once only traces once during AOT compiling time. Here are the steps: 1. dynamo capture guards and the subgraph 2. torchxla_trace_once backend trace the graph with torchxla, lowering the graph and record a hash of the graph for later lookup 3. at inference time, the hash is used directly to lookup the optimized graph and run it. # Limitations We can not handle LTC/torchxla fall back right now. If a op misses LTC kernel, we raise and exception and that will results in dynamo fallback (or try another compiler). People have brainstormed the idea of graph breaking and stitching the subgraphs together. But maybe it's easier to add those missing LTC kernels for those models. # Results The models we tested are those not causing LTC fallback. We run the tests on GPU. We see 1.38x geomean speedup for trochxla_trace_once and torchxla_trivial is mostly neutral as expected. ``` \| Model \| XLA (trace once) \| XLA (trace everytime) \| +=========================+====================+=========================+ \| resnet18 \| 1.346 \| 1.045 \| +-------------------------+--------------------+-------------------------+ \| resnet50 \| 1.153 \| 1.007 \| +-------------------------+--------------------+-------------------------+ \| resnext50_32x4d \| 1.381 \| 1.039 \| +-------------------------+--------------------+-------------------------+ \| alexnet \| 1.045 \| 1.018 \| +-------------------------+--------------------+-------------------------+ \| mobilenet_v2 \| 1.562 \| 1.021 \| +-------------------------+--------------------+-------------------------+ \| mnasnet1_0 \| 1.303 \| 1.069 \| +-------------------------+--------------------+-------------------------+ \| squeezenet1_1 \| 1.278 \| 1.025 \| +-------------------------+--------------------+-------------------------+ \| vgg16 \| 1.076 \| 1.008 \| +-------------------------+--------------------+-------------------------+ \| BERT_pytorch \| 2.224 \| 0.978 \| +-------------------------+--------------------+-------------------------+ \| timm_vision_transformer \| 1.81 \| 1.025 \| +-------------------------+--------------------+-------------------------+ \| geomean \| 1.38101 \| 1.02324 \| +-------------------------+--------------------+-------------------------+ ``` The speedup is similar to what we see from previous work for LTC's TorchScript backend (we see 1.40 geomean speedup there): https://docs.google.com/presentation/d/1G09X8v41u_cLKLtSdf7v6R8G19-iZTPcW_VAdOnvYBI/edit#slide=id.g11bf989cb6b_1_5 # Next steps - Use AOT autograd to enable training - Share results on XLA devices - Do more extensive tests on torchbench models Example command ``` GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --use-xla-baseline --only resnet18 --backend=torchxla_trace_once ``` Thanks @JackCaoG from torchxla team to help debugging various perf issues and merging the torchxla PR! That's super critical for us to get the results above. torchxla side PR: https://github.com/pytorch/xla/pull/4119 topic: not user facing cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @jansel Pull Request resolved: https://github.com/pytorch/pytorch/pull/87741 Approved by: https://github.com/wconstab	2022-10-29 17:52:26 +00:00
Animesh Jain	2cb7c3f865	[dynamo][benchmarks] Prepone Cold start setup (#87913 ) Parallel compilation warms the Threadpool when we call `torch._dynamo.optimize()`. In current benchmarks, we were setting up the TRITON_CACHE_DIR much later. Because of this parallel compilation artifacts were not used and compilation latency improvements were not visible in dashboard. This PR just prepones the setup of TRITON_CACHE_DIR. cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx Pull Request resolved: https://github.com/pytorch/pytorch/pull/87913 Approved by: https://github.com/wconstab	2022-10-28 02:41:13 +00:00
Animesh Jain	83b381d34d	[dynamo] add inductor runs w/o cudagraphs (#87847 ) as title cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx Pull Request resolved: https://github.com/pytorch/pytorch/pull/87847 Approved by: https://github.com/jansel	2022-10-27 19:49:29 +00:00
Animesh Jain	ebe5aad466	[inductor] Revert channels-last support (#87588 ) We witnessed slow compilation times last week. Earlier, I thought it was due to parallel compilation. But, after git bisect, I found the source of extra time to be my PR - https://github.com/pytorch/pytorch/pull/87049 For 1x1 kernel, the current striding check incorrectly declares channels-first 1x1 convs to channels last. I am not sure why it caused so much compilation time jump. Or why it did not fail? There was no change in performance speedup. cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu to identify what could be source of this compilation time increase, so that we can manually check that part of the stack. With this `res2next50` compilation time went back to 96 seconds (which was raised to 900 seconds with my earlier PR) for single thread. And parallel-compilation brings it down to ~30 seconds. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87588 Approved by: https://github.com/soumith, https://github.com/jansel, https://github.com/ngimel	2022-10-25 19:58:25 +00:00
Bin Bao	f047dadab9	Enable inductor CI for TIMM (#87462 ) cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang Pull Request resolved: https://github.com/pytorch/pytorch/pull/87462 Approved by: https://github.com/anijain2305	2022-10-22 05:50:00 +00:00
Will Constable	c55b332517	Delete unused static runtime experiment (#87473 ) cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang Pull Request resolved: https://github.com/pytorch/pytorch/pull/87473 Approved by: https://github.com/anijain2305	2022-10-21 20:03:24 +00:00
Will Constable	dfc65f43f9	Delete unused ts experiment (#87472 ) cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang Pull Request resolved: https://github.com/pytorch/pytorch/pull/87472 Approved by: https://github.com/anijain2305	2022-10-21 20:03:24 +00:00
Will Constable	7baf4b1969	Delete unused ltc experiments (#87471 ) cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang Pull Request resolved: https://github.com/pytorch/pytorch/pull/87471 Approved by: https://github.com/anijain2305	2022-10-21 20:03:22 +00:00
Will Constable	62d30f5a8a	Remove unused cold_start experiment (#87470 ) - this `--cold_start` experiment didn't end up being used - there is a new `--cold_start_latency` flag that is used - this experiment was only hooked up for nvfuser anyway cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang Pull Request resolved: https://github.com/pytorch/pytorch/pull/87470 Approved by: https://github.com/anijain2305	2022-10-21 20:00:05 +00:00
Edward Z. Yang	96691865b9	[dynamo] Unify raise_on_* config to suppress_errors and raise by default (#87440 ) I noticed that a lot of bugs are being suppressed by torchdynamo's default error suppression, and worse yet, there's no way to unsuppress them. After discussion with voz and soumith, we decided that we will unify error suppression into a single option (suppress_errors) and default suppression to False. If your model used to work and no longer works, try TORCHDYNAMO_SUPPRESS_ERRORS=1 to bring back the old suppression behavior. Signed-off-by: Edward Z. Yang <ezyang@fb.com> cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang Pull Request resolved: https://github.com/pytorch/pytorch/pull/87440 Approved by: https://github.com/voznesenskym, https://github.com/albanD	2022-10-21 17:03:29 +00:00
Bin Bao	b1cf377cce	Enable inductor CI for huggingface (#86792 ) Summary: Unit tests will be enabled after fixed in trunck. TorchBench and TIMM need more setup and are coming later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86792 Approved by: https://github.com/jansel, https://github.com/huydhn	2022-10-21 01:38:46 +00:00
PyTorch MergeBot	f38a88c4dd	Revert "[dynamo] use optimizers correctly in benchmarking (#87311 )" This reverts commit `703c19008d`. Reverted https://github.com/pytorch/pytorch/pull/87311 on behalf of https://github.com/anijain2305 due to Bin (desertfire) is trying to get torchbench models in CI, and this PR prevents that. I will bring this back after models are in CI.	2022-10-20 22:01:51 +00:00
Animesh Jain	703c19008d	[dynamo] use optimizers correctly in benchmarking (#87311 ) We were not setting optimizers correctly * This hid the issue that we see here - https://github.com/pytorch/torchdynamo/issues/1687 * This has also revealed that we are activating profilers for every dynamo optimized model call. This could affect speedup cc @jansel @lezcano @fdrocha Pull Request resolved: https://github.com/pytorch/pytorch/pull/87311 Approved by: https://github.com/mlazos, https://github.com/yanboliang	2022-10-20 05:46:25 +00:00
Animesh Jain	c30cfb07ab	[dynamo][dashboard] Run 2 iterations for the correctness runs (#87104 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87104 Approved by: https://github.com/soumith	2022-10-18 15:53:40 +00:00
Jason Ansel	30f6f6903c	[inductor] Move size asserts to C++, fix bug (#87028 ) Inductor internally models any `size=1` dimension as having `stride=0` to simplify indexing formulas (sympy will remove these terms from the expression). This caused a bug in our generate stride assert in detectron2_maskrcnn_r_50_fpn, where we asserted the wrong stride of a size==1 dimension. This fixes that bug, and moves size/stride assert logic to C++ which should be a small perf gain. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87028 Approved by: https://github.com/anijain2305	2022-10-16 20:17:22 +00:00
Jason Ansel	054a2fd6c2	Sync changes from `pytorch/torchdynamo` (#87013 ) This updates to: `6380959be2` Generated with: https://github.com/pytorch/torchdynamo/blob/main/copy_to_core.sh Pull Request resolved: https://github.com/pytorch/pytorch/pull/87013 Approved by: https://github.com/voznesenskym	2022-10-15 21:00:57 +00:00
Jason Ansel	c7c09722ad	Move TorchDynamo into PyTorch core (#86461 ) Context: https://github.com/pytorch/torchdynamo/issues/1588 This PR moves [TorchDynamo](https://github.com/pytorch/torchdynamo) and TorchInductor into PyTorch core. - `torchdynamo` becomes `torch._dynamo` - `torchinductor` becomes `torch._inductor` This PR was generated by running `copy_to_core.sh` in https://github.com/pytorch/torchdynamo/pull/1538 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86461 Approved by: https://github.com/voznesenskym	2022-10-13 23:18:06 +00:00

... 4 5 6 7 8

391 Commits