pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Michael Lazos	ea0f60ecfa	[Dynamo] allow dynamic callables on tensor variables (#137940 ) Fixes https://github.com/pytorch/pytorch/issues/134844 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137940 Approved by: https://github.com/williamwen42	2024-11-08 23:49:34 +00:00
Laith Sakka	d1a45800a3	refresh numbers after accepted less than noise regression (#140029 ) https://github.com/pytorch/pytorch/pull/138363 regressed some benchmarks but less than noise level updating values to avoid flakiness. <img width="803" alt="Screenshot 2024-11-07 at 10 31 29 AM" src="https://github.com/user-attachments/assets/31326452-a6ad-44b8-b324-25e953355fcf"> PASS: benchmark ('add_loop_eager', 'compile_time_instruction_count') pass, actual result 3073605220 +1.21% is within expected 3037000000 ±1.50% PASS: benchmark ('add_loop_eager_dynamic', 'compile_time_instruction_count') pass, actual result 5700849667 +1.37% is within expected 5624000000 ±2.50% Pull Request resolved: https://github.com/pytorch/pytorch/pull/140029 Approved by: https://github.com/bobrenjc93	2024-11-07 22:27:00 +00:00
Laith Sakka	de4216bfda	increase add_loop benchmark and refresh all results! (#139703 ) see comments end of https://github.com/pytorch/pytorch/pull/138756 I am also refreshing all values Pull Request resolved: https://github.com/pytorch/pytorch/pull/139703 Approved by: https://github.com/bobrenjc93	2024-11-05 05:41:21 +00:00
Bin Bao	740054ffe6	[AOTI][reland] Switch OSS dashboard to use aoti_compile_and_package (#139597 ) Summary: Reland https://github.com/pytorch/pytorch/pull/139154 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139597 Approved by: https://github.com/angelayi	2024-11-04 18:53:17 +00:00
PyTorch MergeBot	709752e0bb	Revert "[AOTI] Switch OSS dashboard to use aoti_compile_and_package (#139154 )" This reverts commit `293fbb42d2`. Reverted https://github.com/pytorch/pytorch/pull/139154 on behalf of https://github.com/desertfire due to cpu_aot_inductor_amp_freezing fails ([comment](https://github.com/pytorch/pytorch/pull/139154#issuecomment-2452983651))	2024-11-02 13:04:00 +00:00
Bin Bao	293fbb42d2	[AOTI] Switch OSS dashboard to use aoti_compile_and_package (#139154 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139154 Approved by: https://github.com/angelayi ghstack dependencies: #139153	2024-11-02 03:10:05 +00:00
Laith Sakka	6a1c451479	Don't uselessly recompute axiom dict every static eval call (#138967 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138967 Approved by: https://github.com/ezyang	2024-10-31 21:16:55 +00:00
Laith Sakka	c056dc4cb8	In Inductor, be willing to generate deferred runtime asserts when unbacked (#138804 ) Title + we avoid calling defer_assert when we statically know the guard results. timing for pnasnet5large ``` TIMING: code_gen:21.79672 inductor_compile:39.57726 backend_compile:65.30649 entire_frame_compile:95.22052 total_wall_time:95.22052 ``` matches with out the diff ``` TIMING: code_gen:21.89314 inductor_compile:39.72298 backend_compile:65.38539 entire_frame_compile:95.0854 total_wall_time:95.0854 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138804 Approved by: https://github.com/ezyang	2024-10-28 02:19:55 +00:00
Aaron Gokaslan	5d074746e9	[BE]: Add better optional typing (#138426 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138426 Approved by: https://github.com/XuehaiPan, https://github.com/malfet	2024-10-27 14:19:00 +00:00
Laith Sakka	705f5b3489	Several enhancements for check_results.py (#137925 ) 1) always generate expected_results.csv up to accuracy of first three digits ex: 112313212312 --> 1120000000 .. etc 2) regenerate all record in expected_results.csv and not just failed ones , why? because if we change something by 1.3% and noise 1.5% we want to reflect that. 3) add "please update all results that changed significantly, and not only the failed ones" ``` (myenv) [lsakka@devgpu005.nha1 ~/pytorch/benchmarks/dynamo/pr_time_benchmarks (check_result_ehancements)]$ python check_results.py test_check_result/expected_test.csv te st_check_result/result_test.csv out WIN: benchmark ('a', 'instruction count') failed, actual result 9011111111 is -18.16% lower than expected 11011111111 ±1.00% please update the expected results. please update all results that changed significantly, and not only the failed ones REGRESSION: benchmark ('b', 'memory') failed, actual result 20011111111 is 99.89% higher than expected 10011111111 ±+10.00% if this is an expected regression, please update the expected results. please update all results that changed significantly, and not only the failed ones REGRESSION: benchmark ('c', 'something') failed, actual result 107111111111 is 969.92% higher than expected 10011111111 ±+10.00% if this is an expected regression, please update the expected results. please update all results that changed significantly, and not only the failed ones MISSING REGRESSION TEST: benchmark ('d', 'missing-test') does not have a regression test enabled for it. new expected results file content if needed: a,instruction count,9011000000,0.01 b,memory,20010000000,0.1 c,something,107100000000,0.1 There was some failures you can use the new reference expected result stored at path:out and printed above ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137925 Approved by: https://github.com/aorenste	2024-10-26 16:27:55 +00:00
Laith Sakka	10e2840ce3	Enable failing diffs on update_hint_regression and sum_floordiv_regression and autograd benchmarks regression (#137548 ) update_hint_regression has been behaving, so I am setting 2% noise threshold for it. 1.5% for sum_floordiv_regression. I have one concern, with the way we do the regression detection. small or changes <threshold level will accumulate and eventually trigger failure. to avoid those would have to keep any eye on the dashboard and potentially refresh the expected result file regularly even when there is no faluires. . Pull Request resolved: https://github.com/pytorch/pytorch/pull/137548 Approved by: https://github.com/aorenste	2024-10-26 07:28:49 +00:00
Pian Pawakapan	09848c892a	[aot_compile] propagate ShapeEnv during lowering (#138362 ) We found that `export() -> _inductor.aot_compile()` lowering, 3 different ShapeEnvs get created, leading to errors when one ShapeEnv processes expressions created by another ShapeEnv. This plumbs the 2 places where ShapeEnv creation happens, detecting the original ShapeEnv from the GraphModule example values, so the original ShapeEnv is just reused. Differential Revision: D64613290 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138362 Approved by: https://github.com/angelayi	2024-10-24 22:22:14 +00:00
PyTorch MergeBot	8197e4c70d	Revert "[sparse] add search for optimal alg_id to torch.compile (#137427 )" This reverts commit `39bfba3f56`. Reverted https://github.com/pytorch/pytorch/pull/137427 on behalf of https://github.com/jcaip due to this PR breaks AO tests ([comment](https://github.com/pytorch/pytorch/pull/137427#issuecomment-2435906592))	2024-10-24 17:27:06 +00:00
Laith Sakka	ed313a5ca2	Introduce torch.sym_add, variadic add (#138660 ) Tested internally here: https://www.internalfb.com/diff/D64057744 This is a reland after previous internal failures. main change is ``` if min is None and max is None: torch._check_is_size(size) return ``` Partially addresses https://github.com/pytorch/pytorch/issues/128150 When you have big sums of values, we end up computing long chains of binary addition in our FX graph representation. Not only is this ugly, it also is quadratic, as the sympy.Add constructor is O(N) in number of arguments. Instead, ensure that we maintain the summation as a single FX node so we can do the entire addition all in one go. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138660 Approved by: https://github.com/ezyang, https://github.com/bobrenjc93	2024-10-23 17:42:41 +00:00
Jesse Cai	39bfba3f56	[sparse] add search for optimal alg_id to torch.compile (#137427 ) Summary: This PR adds a lowering for `torch._cslt_sparse_mm` to find the optimal alg_id and cache it when running with `torch.compile` Seeing speedups on both bfloat16 and float8 dtypes: <img width="641" alt="Screenshot 2024-10-17 at 2 10 38 PM" src="https://github.com/user-attachments/assets/b928cd11-32a3-43e5-b209-8e4028896f0b"> <img width="1274" alt="Screenshot 2024-10-17 at 1 39 03 PM" src="https://github.com/user-attachments/assets/d9edd684-a8ec-46fd-b3da-2e76dbcb7bb6"> * `torch._cslt_sparse_mm_search` has been modified to return optimal split-k parameters as well as max alg_id. * max_id is now available in `torch.backends.cusparselt` via `torch.backends.cusparselt.get_max_alg_id()` * fixed meta registrations for float8 Test Plan: python test/test_sparse_semi_structured.py Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/137427 Approved by: https://github.com/cpuhrsch	2024-10-22 22:39:42 +00:00
Ryan Guo	0a4197490c	Delay mul/pow expansion for `_SympyT` to enable more folding (#138235 ) Instead of calling `safe_expand` right after symbolic expression construction, we invoke it in `ShapeEnv.simplify`. This enables more simplification with product form, e.g., ``` (a + b)^2 / (a + b) --> (a + b) ``` which won't happen if we expand eagerly during product construction: ``` (a^2 + 2ab + b^2) / (a + b) --> no change ``` Fixes #136044. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138235 Approved by: https://github.com/ezyang	2024-10-21 16:38:47 +00:00
Animesh Jain	0a2407b93c	[dynamo] Support omegaconf DictConfig (#138378 ) Fixes https://github.com/pytorch/pytorch/issues/138224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138378 Approved by: https://github.com/jansel ghstack dependencies: #138359	2024-10-20 02:43:17 +00:00
Chong Gu	d512d0e227	Always use aten.constant_pad_nd for mm padding (#137820 ) Summary: From experiment, it seems like aten.constant_pad_nd has better QPS compared to torch.cat. The qps gain for ig ctr is ~10%, and ~5% for oc. Test Plan: ``` buck2 run mode/opt -c fbcode.nvcc_arch=a100 //caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/585279927/480/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend=AOT_INDUCTOR ``` ``` buck2 run mode/opt //caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/588102397/1500/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend=AOT_INDUCTOR ``` Differential Revision: D64271583 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137820 Approved by: https://github.com/eellison	2024-10-18 19:35:03 +00:00
Brian Hirsh	a682194a11	inductor: use previous guards to know if a size is 1 for broadcasting (#136670 ) Fixes https://github.com/pytorch/pytorch/issues/136640 Today, inductor has some logic to figure out when it needs to do broadcasting during lowering, which just checks if any of the input shapes have sizes equal to 1. In particular: we should already have this information by the time we get to inductor, because our FakeTensor compute will have branched/guarded on whether any ops performed broadcasting, appropriately. In particular, if we have a tensor with a size value of `(64//((2048//(s3((s2//s3)))))))`, and it happens to be equal to one (and it is used in an op that requires this dim to be broadcasted), FakeTensorProp will have generated a guard: ``` Eq((64//((2048//(s3((s2//s3))))))), 1) ``` I chose the simplest possible way to beef up inductor's checks to know when a given size is equal to 1: loop over the existing shape env guards, and if our current size is a sympy expression on the LHS of one of our `Eq(LHS, 1)` guards, then return True. I'm hoping for feedback on whether or not this approach is reasonable. One better option I could imagine is that our symbolic reasoning should have automatically simplified the size of our tensor down to a constant as part of evaluating that guard. I was originally going to try to do this directly in the shape env, but I ran into a few issues: (1) I wanted to call some version of `set_replacement(expr, 1)`. But `set_replacement()` only accepts plain symbols on the LHS, not expressions (2) in theory I could get this to work if I could rework the above expression to move everything that is not a free variable to the RHS, e.g. `Eq(s2, 32)`. It looks like our existing `try_solve()` logic is... [not quite able](https://github.com/pytorch/pytorch/blob/main/torch/utils/_sympy/solve.py#L27) to do this generally though. Checking the guards feels pretty simple-and-easy. Are we worried that it is too slow to iterate over all the guards? I could also cache the lookup so we only need to iterate over guards that are of the form `Eq(LHS, 1)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136670 Approved by: https://github.com/ezyang	2024-10-16 22:41:39 +00:00
Isuru Fernando	120fbe9caa	Update inductor benchmark time to avoid flakiness (#137900 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137900 Approved by: https://github.com/laithsakka	2024-10-15 16:17:04 +00:00
Edward Z. Yang	5c3ba6faff	Add fbscribelogger to Dynamo benchmark runner (#137867 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137867 Approved by: https://github.com/bobrenjc93	2024-10-15 04:36:41 +00:00
Isuru Fernando	08ce3aac62	Cache some ValueRanges (#137438 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137438 Approved by: https://github.com/ezyang	2024-10-13 19:23:34 +00:00
Bin Bao	cfc5d18aad	[AOTI] Turn on the ABI-compatible mode as default (#136534 ) Summary: Make AOTI generate ABI-compatible code as default for OSS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136534 Approved by: https://github.com/chenyang78 ghstack dependencies: #137660	2024-10-13 14:42:58 +00:00
Valentine233	67883e70c0	change GPT2ForSequenceClassification inference accuracy tolerance (#136749 ) Fixes https://github.com/pytorch/pytorch/issues/123503. https://github.com/pytorch/pytorch/pull/121866 makes GPT2ForSequenceClassification hit the SDPA pattern 18 and then encounter the accuracy issue. The issue only happens with BF16 inference single thread. This PR tends to increase the model tolerance from 4e-3 to 5e-3 and make the check pass. Note that the issue is due to some small implementation diff. For example, the sdpa math backend scales q, k before matmul for stability; the flash attention backend has more diffs as a new algorithm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136749 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-10-12 01:12:28 +00:00
PyTorch MergeBot	c58e5c4efa	Revert "[AOTI] Turn on the ABI-compatible mode as default (#136534 )" This reverts commit `b0da076f0c`. Reverted https://github.com/pytorch/pytorch/pull/136534 on behalf of https://github.com/desertfire due to The dependent PR https://github.com/pytorch/pytorch/pull/137660 fails in fbcode ([comment](https://github.com/pytorch/pytorch/pull/136534#issuecomment-2408211238))	2024-10-11 22:50:58 +00:00
Xuehai Pan	267f82b860	[BE] Format `.ci/` / `.github/` / `benchmarks/` / `functorch/` / `tools/` / `torchgen/` with `ruff format` (#132577 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132577 Approved by: https://github.com/malfet	2024-10-11 18:30:26 +00:00
Laith Sakka	a06d49a9f9	bump up add_loop_inductor_gpu expected instruction count. (#137672 ) diff https://github.com/pytorch/pytorch/pull/137117/files increased instruction count for add_loop_inductor_gpu but not enough to fail in that diff, but now its kind of flaky test . it failed on recent merge: <img width="1351" alt="Screenshot 2024-10-09 at 5 25 57 PM" src="https://github.com/user-attachments/assets/27178f76-c08e-4d13-9ac4-4cd70f146611"> and here is the history <img width="1047" alt="Screenshot 2024-10-09 at 5 26 07 PM" src="https://github.com/user-attachments/assets/bd563e34-6f7f-461a-ae54-8a616be9bf09"> <img width="777" alt="Screenshot 2024-10-09 at 5 30 19 PM" src="https://github.com/user-attachments/assets/d0a1ca81-2bdb-4cf6-8ac8-ba5971d447bf"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137672 Approved by: https://github.com/masnesral	2024-10-11 16:46:38 +00:00
Bin Bao	b0da076f0c	[AOTI] Turn on the ABI-compatible mode as default (#136534 ) Summary: Make AOTI generate ABI-compatible code as default for OSS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136534 Approved by: https://github.com/chenyang78 ghstack dependencies: #137660	2024-10-10 23:44:57 +00:00
Colin Peppler	9690cacd61	[aotinductor] Add helper fn to atomically apply size_hint to an expr w/ unbacked symints (#137537 ) ### Context Fixes CUDA IMA in autotune_at_compile_time, where we would generate an example tensor with an incorrect stride. In the case below, the stride should be (u0 * 128, 128, 1). However, we apply the fallback on the entire expr (i.e. u0 * 128). ``` # buf817 = tensor(size=(s0, u0, 128), stride=(u0 * 128, 128, 1)) buf812 = generate_example_value( (64, 8192, 128), (8192, 128, 1), "cuda:0", torch.bfloat16, 0 ) ``` The fix is to apply the fallback on each symbol. ### Test ``` PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer python test_aot_inductor.py -k test_stride_with_unbacked_expr_abi_compatible_cuda ========= Invalid __global__ write of size 2 bytes ``` Differential Revision: [D64074561](https://our.internmc.facebook.com/intern/diff/D64074561) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137537 Approved by: https://github.com/jingsh	2024-10-10 17:11:24 +00:00
Oguz Ulgen	034af88c2d	Add a microbechmark for cache read path (#137607 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137607 Approved by: https://github.com/jamesjwu	2024-10-10 16:36:18 +00:00
Laith Sakka	f394fb554b	Enable failing diffs for regressions on basic_modules_ListOfLinears benchmarks (#137541 ) Note that basic_modules_ListOfLinears_inductor_gpu_force_shape_pad is flay with 8% detected variance, I set it up with 20% threshold (8*2)++ others are stable within +-1.5% <img width="611" alt="Screenshot 2024-10-08 at 4 19 03 PM" src="https://github.com/user-attachments/assets/103c4bc7-6be8-41bf-ac31-4b8909fabfcf"> <img width="1581" alt="Screenshot 2024-10-08 at 4 18 56 PM" src="https://github.com/user-attachments/assets/56006f7a-e7de-4966-9a05-9263195adc68"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137541 Approved by: https://github.com/aorenste	2024-10-10 02:47:38 +00:00
Laith Sakka	361046718d	Generate new expected results file when there is failures in diff time benchmarks (#137551 ) The test also add singpost log for the benchmarks that pass. to test run I ran python check_results.py test_check_result/expected_test.csv test_check_result/result_test.csv out.csv results ``` WIN: benchmark ('a', 'instruction count') failed, actual result 90 is -18.18% lower than expected 110 ±1.00% please update the expected results. REGRESSION: benchmark ('b', 'memory') failed, actual result 200 is 100.00% higher than expected 100 ±+10.00% if this is an expected regression, please update the expected results. PASS: benchmark ('c', 'something') pass, actual result 107 +7.00% is within expected 100 ±10.00% MISSING REGRESSION TEST: benchmark ('d', 'missing-test') does not have a regression test enabled for it. You can use the new reference expected result stored at path: out.csv. a,instruction count,90,0.01 b,memory,200,0.1 c,something,100,0.1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137551 Approved by: https://github.com/aorenste	2024-10-10 01:09:15 +00:00
PyTorch MergeBot	16a2c2cfd4	Revert "Introduce torch.sym_sum (#136429 )" This reverts commit `90bed32b98`. Reverted https://github.com/pytorch/pytorch/pull/136429 on behalf of https://github.com/ezyang due to fails internal stuff ([comment](https://github.com/pytorch/pytorch/pull/136429#issuecomment-2403335147))	2024-10-09 20:08:01 +00:00
Oguz Ulgen	ae03c0cff3	Add microbenchmark for FxGraphHashDetails.debug_lines (#137506 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137506 Approved by: https://github.com/jamesjwu	2024-10-09 16:15:05 +00:00
Michael Lazos	27dee935af	[Dynamo] Ensure torch function modes are dispatched on builtin ops (#137117 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137117 Approved by: https://github.com/yanboliang, https://github.com/williamwen42 ghstack dependencies: #137114, #137115, #137116	2024-10-09 02:29:40 +00:00
PyTorch MergeBot	2d18c2d5e7	Revert "[Dynamo] Ensure torch function modes are dispatched on builtin ops (#137117 )" This reverts commit `941be418d8`. Reverted https://github.com/pytorch/pytorch/pull/137117 on behalf of https://github.com/huydhn due to The top of the stack has been reverted but it leaves trunk in a broken state, so I try to revert the rest of the stack ([comment](https://github.com/pytorch/pytorch/pull/137114#issuecomment-2400765603))	2024-10-08 20:33:17 +00:00
Brian Hirsh	b41fc14072	compile time benchmarks for AOTDispatcher (partitioner) (#136760 ) compile time benchmark for the min cut partitioner. I'm hoping that this is a reasonable benchmark because: (1) it consists of a single input + many weights that are used sequentially (2) contains a mix of recompute vs non-recomputed ops (matmul + sin) (3) it is relatively simple from running locally: ``` collecting compile time instruction count for aotdispatcher_partitioner_cpu compile time instruction count for iteration 0 is 21764219181 compile time instruction count for iteration 1 is 12475020009 compile time instruction count for iteration 2 is 12463710140 compile time instruction count for iteration 3 is 12455676489 compile time instruction count for iteration 4 is 12451344330 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136760 Approved by: https://github.com/ezyang ghstack dependencies: #136759	2024-10-08 18:44:13 +00:00
Brian Hirsh	48b8f818b2	compile time benchmarks for AOTDispatcher (inference/training/subclasses) (#136759 ) this adds a few compile time benchmarks for some disjoint paths in AOTDispatcher: (1) inference vs training code paths (2) "subclasses" vs "no subclasses" codepaths Also see https://github.com/pytorch/pytorch/pull/136760 for a partitioner benchmark (I'm not sure why ghstack didn't display the stack nicely) I ran locally, and got these numbers on the 4 paths: ``` collecting compile time instruction count for aotdispatcher_inference_nosubclass_cpu compile time instruction count for iteration 0 is 11692348671 compile time instruction count for iteration 1 is 3026287204 compile time instruction count for iteration 2 is 3011467318 compile time instruction count for iteration 3 is 3004485935 compile time instruction count for iteration 4 is 3003087410 collecting compile time instruction count for aotdispatcher_training_nosubclass_cpu compile time instruction count for iteration 0 is 6068003223 compile time instruction count for iteration 1 is 5585418102 compile time instruction count for iteration 2 is 5581856618 compile time instruction count for iteration 3 is 5581651794 compile time instruction count for iteration 4 is 5578742619 collecting compile time instruction count for aotdispatcher_inference_subclass_cpu compile time instruction count for iteration 0 is 8634984264 compile time instruction count for iteration 1 is 8633467573 compile time instruction count for iteration 2 is 8632182092 compile time instruction count for iteration 3 is 8632056925 compile time instruction count for iteration 4 is 8632543871 collecting compile time instruction count for aotdispatcher_training_subclass_cpu compile time instruction count for iteration 0 is 14737239311 compile time instruction count for iteration 1 is 14734346427 compile time instruction count for iteration 2 is 14736493730 compile time instruction count for iteration 3 is 14734121272 compile time instruction count for iteration 4 is 14733852882 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136759 Approved by: https://github.com/laithsakka	2024-10-08 18:44:13 +00:00
Edward Z. Yang	90bed32b98	Introduce torch.sym_sum (#136429 ) Partially addresses https://github.com/pytorch/pytorch/issues/128150 When you have big sums of values, we end up computing long chains of binary addition in our FX graph representation. Not only is this ugly, it also is quadratic, as the sympy.Add constructor is O(N) in number of arguments. Instead, ensure that we maintain the summation as a single FX node so we can do the entire addition all in one go. update_hint_regression benchmark, before and after: ``` update_hint_regression,compile_time_instruction_count,2648328980 update_hint_regression,compile_time_instruction_count,2563748678 ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136429 Approved by: https://github.com/isuruf	2024-10-08 18:12:57 +00:00
Michael Lazos	941be418d8	[Dynamo] Ensure torch function modes are dispatched on builtin ops (#137117 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137117 Approved by: https://github.com/yanboliang, https://github.com/williamwen42 ghstack dependencies: #137114, #137115, #137116	2024-10-07 18:55:26 +00:00
Laith Sakka	8b9cbf22c2	Enable regression test for add loop benchmarks (#136573 ) The red dotted line is 1.5 <img width="1607" alt="Screenshot 2024-09-24 at 11 50 41 AM" src="https://github.com/user-attachments/assets/719a9a86-89af-4c58-8723-80a28f9bb517"> expected taken from the average. <img width="850" alt="Screenshot 2024-09-24 at 2 33 27 PM" src="https://github.com/user-attachments/assets/0f25e855-35ae-4031-86ef-1452ef6598de"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136573 Approved by: https://github.com/ezyang	2024-10-04 18:12:08 +00:00
PyTorch MergeBot	951107e8c2	Revert "compile time benchmarks for AOTDispatcher (inference/training/subclasses) (#136759 )" This reverts commit `b17cd264d3`. Reverted https://github.com/pytorch/pytorch/pull/136759 on behalf of https://github.com/ZainRizvi due to Something in this stack seems to be causing tests to fail on trunk. See functorch/test_control_flow.py::TestControlFlow::test_associative_scan_dim_reverse_True_combine_mode_generic_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/11107079955/job/30872132411) [HUD commit link](`c010c6099b`) ([comment](https://github.com/pytorch/pytorch/pull/136670#issuecomment-2386303362))	2024-10-01 15:23:55 +00:00
PyTorch MergeBot	923410193b	Revert "compile time benchmarks for AOTDispatcher (partitioner) (#136760 )" This reverts commit `c010c6099b`. Reverted https://github.com/pytorch/pytorch/pull/136760 on behalf of https://github.com/ZainRizvi due to Something in this stack seems to be causing tests to fail on trunk. See functorch/test_control_flow.py::TestControlFlow::test_associative_scan_dim_reverse_True_combine_mode_generic_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/11107079955/job/30872132411) [HUD commit link](`c010c6099b`) ([comment](https://github.com/pytorch/pytorch/pull/136670#issuecomment-2386303362))	2024-10-01 15:23:55 +00:00
Bin Bao	a15f3f51bc	[AOTI] Update sam_fast from timeout to fail_to_run (#136996 ) Summary: sam_fast changes from timeout to fail_to_run after https://github.com/pytorch/pytorch/pull/136591, which "regressed" in a good way. Update the expected result file and continue investigating. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136996 Approved by: https://github.com/ezyang	2024-09-30 14:05:49 +00:00
Brian Hirsh	c010c6099b	compile time benchmarks for AOTDispatcher (partitioner) (#136760 ) compile time benchmark for the min cut partitioner. I'm hoping that this is a reasonable benchmark because: (1) it consists of a single input + many weights that are used sequentially (2) contains a mix of recompute vs non-recomputed ops (matmul + sin) (3) it is relatively simple from running locally: ``` collecting compile time instruction count for aotdispatcher_partitioner_cpu compile time instruction count for iteration 0 is 21764219181 compile time instruction count for iteration 1 is 12475020009 compile time instruction count for iteration 2 is 12463710140 compile time instruction count for iteration 3 is 12455676489 compile time instruction count for iteration 4 is 12451344330 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136760 Approved by: https://github.com/ezyang ghstack dependencies: #136670, #136759	2024-09-30 13:25:02 +00:00
Brian Hirsh	b17cd264d3	compile time benchmarks for AOTDispatcher (inference/training/subclasses) (#136759 ) this adds a few compile time benchmarks for some disjoint paths in AOTDispatcher: (1) inference vs training code paths (2) "subclasses" vs "no subclasses" codepaths Also see https://github.com/pytorch/pytorch/pull/136760 for a partitioner benchmark (I'm not sure why ghstack didn't display the stack nicely) I ran locally, and got these numbers on the 4 paths: ``` collecting compile time instruction count for aotdispatcher_inference_nosubclass_cpu compile time instruction count for iteration 0 is 11692348671 compile time instruction count for iteration 1 is 3026287204 compile time instruction count for iteration 2 is 3011467318 compile time instruction count for iteration 3 is 3004485935 compile time instruction count for iteration 4 is 3003087410 collecting compile time instruction count for aotdispatcher_training_nosubclass_cpu compile time instruction count for iteration 0 is 6068003223 compile time instruction count for iteration 1 is 5585418102 compile time instruction count for iteration 2 is 5581856618 compile time instruction count for iteration 3 is 5581651794 compile time instruction count for iteration 4 is 5578742619 collecting compile time instruction count for aotdispatcher_inference_subclass_cpu compile time instruction count for iteration 0 is 8634984264 compile time instruction count for iteration 1 is 8633467573 compile time instruction count for iteration 2 is 8632182092 compile time instruction count for iteration 3 is 8632056925 compile time instruction count for iteration 4 is 8632543871 collecting compile time instruction count for aotdispatcher_training_subclass_cpu compile time instruction count for iteration 0 is 14737239311 compile time instruction count for iteration 1 is 14734346427 compile time instruction count for iteration 2 is 14736493730 compile time instruction count for iteration 3 is 14734121272 compile time instruction count for iteration 4 is 14733852882 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136759 Approved by: https://github.com/laithsakka ghstack dependencies: #136670	2024-09-30 13:25:02 +00:00
Laith Sakka	e205193e1c	Enable failing diffs on regression (#136551 ) 1. example of failing diff https://github.com/pytorch/pytorch/pull/136740 2. test this by running python check_results.py test_check_result/expected_test.csv test_check_result/result_test.csv results ``` WIN: benchmark ('a', ' instruction count') failed, actual result 90 is 18.18% lower than expected 110 ±1.00% please update the expected results. REGRESSION: benchmark ('b', ' memory') failed, actual result 200 is 100.00% higher than expected 100 ±10.00% if this is an expected regression, please update the expected results. MISSING REGRESSION TEST: benchmark ('d', ' missing-test') does not have a regression test enabled for it ``` MISSING REGRESSION TEST does not fail but its logged. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136551 Approved by: https://github.com/ezyang ghstack dependencies: #136383	2024-09-29 22:31:26 +00:00
Jason Ansel	8da9c4178c	[inductor] Benchmark Halide in operatorbench.py (#136809 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136809 Approved by: https://github.com/eellison ghstack dependencies: #136808	2024-09-28 19:26:04 +00:00
Jason Ansel	375921b755	[inductor] Improve operatorbench.py (#136808 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136808 Approved by: https://github.com/eellison	2024-09-28 06:22:02 +00:00
William Wen	2157e396a3	[dynamo] attempt run only mode when dynamo cache limit is hit (#136655 ) Implement https://github.com/pytorch/pytorch/issues/135458. Try run-only mode when dynamo cache limit is hit. If no valid cache entries are found, then skip code recursively. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136655 Approved by: https://github.com/jansel	2024-09-27 17:15:05 +00:00

1 2 3 4 5 ...

1803 Commits