pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Jovian Anthony Jaison	45d62d6fc5	[dynamo] Added cuda and triton versions to dynamo_compile (#141290 ) Opening another PR since #141140 was reverted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141290 Approved by: https://github.com/masnesral	2024-11-22 20:04:42 +00:00
Bob Ren	6d779d0549	Always unspecialize float in OSS (#138922 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138922 Approved by: https://github.com/ezyang Co-authored-by: Edward Z. Yang <ezyang@meta.com>	2024-11-22 17:54:42 +00:00
Colin L. Rice	f5d00f1456	pytorch/features: Make a feature logger and record triton bundling (#141056 ) This modifies metrics_context to allow us to store whether a feature was used or not. This also starts recording this for triton bundling. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141056 Approved by: https://github.com/masnesral	2024-11-22 01:31:08 +00:00
Prajesh Praveen Anchalia	4e34fbdcbc	Add inductor_fx_graph_cache stats to dynamo_utils (#141190 ) Summary: Add the following inductor fx graph cache stats to dynamo compile - inductor_fx_cache_hit_count - inductor_fx_cache_miss_count - inductor_fx_cache_backend_type - inductor_fx_cache_hit_keys - inductor_fx_cache_miss_keys - remote_cache_version Test Plan: Run local tests and staging logger: P1683061460 Differential Revision: D66232206 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141190 Approved by: https://github.com/masnesral	2024-11-21 20:59:10 +00:00
Ivan Zaitsev	149677e30c	Revert "[dynamo] Added cuda and triton versions to dynamo_compile" (#141280 ) Reverts pytorch/pytorch#141140 reason: conflicts with https://github.com/pytorch/pytorch/pull/141190 and wasn't merged using mergebot Pull Request resolved: https://github.com/pytorch/pytorch/pull/141280 Approved by: https://github.com/clee2000, https://github.com/kit1980	2024-11-21 20:50:06 +00:00
Jovian Anthony Jaison	11d0ba068f	[dynamo] Added cuda and triton versions to dynamo_compile (#141140 ) [dynamo] Added cuda and triton versions to dynamo_compile (#141140) Summary: Add cuda and triton versions to dynamo_compile logging site. Test Plan: $ buck2 run mode/opt //scripts/oulgen:runner File changed: fbcode//caffe2/torch/_dynamo/convert_frame.py Buck UI: https://www.internalfb.com/buck2/1a8ada1f-d54e-44b2-a368-b2ff2030e113 Network: Up: 65KiB Down: 0B (reSessionID-8f4d1d6d-a680-4ecc-8e73-c29c932d824b) Jobs completed: 2166. Time elapsed: 7.0s. Cache hits: 0%. Commands: 3 (cached: 0, remote: 0, local: 3) BUILD SUCCEEDED ... Cuda: 12.4.0 Triton: 3.0.0 Reviewed By: masnesral Differential Revision: D66181508	2024-11-21 12:20:02 -08:00
Henry Tsang	4f2543c31d	[logs] Add dynamo_timed to get better compilation time breakdown for AOTI (#140198 ) Adding some dynamo timed for the purpose of better understanding AOTI compilation time. Probably would require a few more passes. A lot of time is spent in Scheduler.__init__, and not enough annotations are there. run_command_and_check takes a lot time as well. But there is probably not much we can do. Maybe we can add a config to tune C++ optimization level? traces: <img width="1205" alt="Screenshot 2024-11-08 at 4 41 10 PM" src="https://github.com/user-attachments/assets/61645264-b3af-4d4a-804d-700b0f831c7c"> Differential Revision: D65554141 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140198 Approved by: https://github.com/desertfire	2024-11-19 18:54:17 +00:00
Prajesh Praveen Anchalia	1e234e63b3	[pytorch][dynamo_compile] Log inductor config to dynamo_compile (#140790 ) Summary: Scrubbed inductor config logging to dynamo_compile as json:str. Scrub RE: `r'((^TYPE_CHECKING$)\|(._progress$)\|(.TESTING.)\|(.(rocm\|halide).)\|(^trace\..)\|(^_))'`to save some space. Test Plan: Staging logger: https://fburl.com/data/ltkt08zm P1679697917 {F1958428018} Differential Revision: D65806399 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140790 Approved by: https://github.com/masnesral	2024-11-19 02:39:33 +00:00
Sam Larsen	e2e67a010a	[logging] Add dynamo_compile fields for pre-dispatch/joint/post-dispatch times (#140306 ) Tested internally: P1679622670 Differential Revision: [D65986059](https://our.internmc.facebook.com/intern/diff/D65986059) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140306 Approved by: https://github.com/ezyang	2024-11-15 15:02:08 +00:00
Sam Larsen	b11ff3cf60	[logging] Overhaul dynamo_timed and CompilationMetrics logging. (#139849 ) Here's the overview: There's a new contextmanager singleton called MetricsContext. Entering the MetricsContext is how we demarcate the boundary on which we'll create a single CompilationMetrics object, and therefore, a single dynamo_compile log entry. While we're inside the MetricsContext, we can update/set many different metrics. Most importantly: `dynamo_timed` can also update the in-progress MetricsContext. In the proposal here, we tell `dynamo_timed` that we want it to do so by providing the name of the MetricsContext field to increment. There can be many `dynamo_timed` calls in different parts of the code updating different fields. Then when the MetricsContext exits, that's when the logging of everything gathered finally happens. One potential footgun is trying to use `dynamo_timed` when we haven't entered the MetricsContext, but we assert on that problem. Another problem is that we re-enter the context recursively, but we watch for that and do the logging only when the outermost exits. Some specifics: * Introduce MetricsContext - a context manager that on exit, records the CompilationMetrics (which also logs to dynamo_compile). * Completely remove the concept of frame_phase_timing. Instead, update the MetricsContext during compilation, either directly or via dynamo_timed. * Remove some globals we previously used to accumulate counters to later populate a CompilationMetrics. We use CompilationMetrics set/update/increment APIs instead. * `record_compilation_metrics` is now called on exit from MetricsContext. * Populate legacy CompilationMetrics fields right before logging, inside `record_compilation_metrics`. * Remove the one-off `add_remote_cache_time_saved` helper; capture that timing directly into the MetricsContext. And specifically, several changes to dynamo_timed: * "Modernize" the parameters and update all callsites accordingly. * Move the backwards logging of the CompilationMetrics to the backwards compile location. * Add a parameter for which CompilationMetrics field to update Pull Request resolved: https://github.com/pytorch/pytorch/pull/139849 Approved by: https://github.com/ezyang	2024-11-14 19:11:20 +00:00
PyTorch MergeBot	d63eb3c46c	Revert "[logging] Overhaul dynamo_timed and CompilationMetrics logging. (#139849 )" This reverts commit `cb15c15157`. Reverted https://github.com/pytorch/pytorch/pull/139849 on behalf of https://github.com/kit1980 due to Breaking an internal tests + there is a bug according to the author ([comment](https://github.com/pytorch/pytorch/pull/139849#issuecomment-2474459094))	2024-11-13 18:47:51 +00:00
Sam Larsen	cb15c15157	[logging] Overhaul dynamo_timed and CompilationMetrics logging. (#139849 ) Here's the overview: There's a new contextmanager singleton called MetricsContext. Entering the MetricsContext is how we demarcate the boundary on which we'll create a single CompilationMetrics object, and therefore, a single dynamo_compile log entry. While we're inside the MetricsContext, we can update/set many different metrics. Most importantly: `dynamo_timed` can also update the in-progress MetricsContext. In the proposal here, we tell `dynamo_timed` that we want it to do so by providing the name of the MetricsContext field to increment. There can be many `dynamo_timed` calls in different parts of the code updating different fields. Then when the MetricsContext exits, that's when the logging of everything gathered finally happens. One potential footgun is trying to use `dynamo_timed` when we haven't entered the MetricsContext, but we assert on that problem. Another problem is that we re-enter the context recursively, but we watch for that and do the logging only when the outermost exits. Some specifics: * Introduce MetricsContext - a context manager that on exit, records the CompilationMetrics (which also logs to dynamo_compile). * Completely remove the concept of frame_phase_timing. Instead, update the MetricsContext during compilation, either directly or via dynamo_timed. * Remove some globals we previously used to accumulate counters to later populate a CompilationMetrics. We use CompilationMetrics set/update/increment APIs instead. * `record_compilation_metrics` is now called on exit from MetricsContext. * Populate legacy CompilationMetrics fields right before logging, inside `record_compilation_metrics`. * Remove the one-off `add_remote_cache_time_saved` helper; capture that timing directly into the MetricsContext. And specifically, several changes to dynamo_timed: * "Modernize" the parameters and update all callsites accordingly. * Move the backwards logging of the CompilationMetrics to the backwards compile location. * Add a parameter for which CompilationMetrics field to update Pull Request resolved: https://github.com/pytorch/pytorch/pull/139849 Approved by: https://github.com/ezyang ghstack dependencies: #140094	2024-11-11 14:24:23 +00:00
Sam Larsen	4f6b30bcbc	Add testing for the utils surrounding dynamo_timed (#140094 ) Summary: This will make it easier to verify that we don't break these utilities for the refactor in https://github.com/pytorch/pytorch/pull/139849. It's one giant test. I can split it into multiple for better readability if ppl prefer that. My rationale for the giant test is that I found I was just resetting compilation and recompiling the same thing many times, which was slow and wasteful. Test Plan: The new tests Differential Revision: [D65682138](https://our.internmc.facebook.com/intern/diff/D65682138) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140094 Approved by: https://github.com/ezyang	2024-11-10 04:17:45 +00:00
Shunting Zhang	c0735a3dd3	[pt2-bench] fix accuracy failure for a few models (#129941 ) This PR batch the fix for a few accuracy failures issues during training by raising tolerance. I do that only for models that I think it fails not due to real issue. ## sebotnet33ts_256 The accuracy test for this model start to fail around June 05 [link](https://hud.pytorch.org/benchmark/timm_models/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Sun%2C%2002%20Jun%202024%2007%3A19%3A38%20GMT&stopTime=Tue%2C%2002%20Jul%202024%2007%3A19%3A38%20GMT&granularity=day&mode=training&dtype=amp&lBranch=main&lCommit=04a0d856207d83c2031e4b9cb6825ba3e0092850&rBranch=main&rCommit=e62925930f6a62f6aeeb1fe1a661a9bd3352b53d&model=sebotnet33ts_256). I can not repro locally, but from the log from the dashboard: ``` RMSE (res-fp64): 0.09441, (ref-fp64): 0.02971 and shape=torch.Size([1536]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000 ``` raising the tolerance should fix it. ## DebertaForQuestionAnswering This model fails accuracy test on the dashboard only in max-autotune mode. I can not repro locally by command: ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/huggingface.py --accuracy --no-translation-validation --training --amp --backend inductor --device cuda --only DebertaForQuestionAnswering ``` From error message on the dashboard: ``` RMSE (res-fp64): 0.01803, (ref-fp64): 0.00537 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000 ``` 0.02 tolerance should suppress this error. ## gluon_inception_v3 This model fail on the dashboard in max-autotune mode. I can not repro locally by command ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only gluon_inception_v3 ``` From error message on the dashboard ``` RMSE (res-fp64): 0.02798, (ref-fp64): 0.00730 and shape=torch.Size([384]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000 Accuracy failed for key name Mixed_7c.branch3x3dbl_3a.bn.running_var ``` raising tolerance should suppress this error. # mobilenetv3_large_100 Fail in MA model. I can not repro locally by command ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only ``` The error message on the dashboard is ``` RMSE (res-fp64): 0.29754, (ref-fp64): 0.05205 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000 ``` The tensor is so small that the noise can be high. I use larger multiplier for smaller tensor in torch._dynamo.utils.same. # yolov3 Fail on dashboard with error ``` Error on the dashboard: RMSE (res-fp64): 0.01278, (ref-fp64): 0.00246 and shape=torch.Size([256]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000 ``` Fix it by using a larger multiplier for smaller tensors and raising the tolereance. # timm_efficientdet Fail on the dashboard with error ``` E0623 18:37:43.638000 139924418725056 torch/_dynamo/utils.py:1468] RMSE (res-fp64): 0.00096, (ref-fp64): 0.00009 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000 ``` But I can not repro locally with command ``` time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only timm_efficientdet --training ``` Raise the tolerance should fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129941 Approved by: https://github.com/jansel ghstack dependencies: #129996	2024-07-05 10:26:39 +00:00
Shunting Zhang	8f1c2e1e28	[pt2-bench] pass acc test if ref is NaN (#129996 ) I'm debugging the accuracy failure for training vision_maskrcnn. Unfortunately I could not succeed to run it locally (I've check pined commits for torchbenchmars/torchvision are correct, and reinstalled torchbenchmark for mask_rcnn). I get this error: ``` eager run fail: AssertionError: targets should not be none when in training mode ``` (Command: time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --training --only vision_maskrcnn ) But look at the log from the dashboard ``` E0623 19:17:59.085000 140114670171328 torch/_dynamo/utils.py:1468] RMSE (res-fp64): nan, (ref-fp64): nan and shape=torch.Size([1024, 256, 1, 1]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000 ``` We can see both the reference number and the pt2 number are NaN. I change torch._dynamo.utils.same to return true if both RMSE values are NaN. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129996 Approved by: https://github.com/jansel	2024-07-05 10:26:39 +00:00
PyTorch MergeBot	6dfa53ca76	Revert "[pt2-bench] pass acc test if ref is NaN (#129996 )" This reverts commit `51fa0bd436`. Reverted https://github.com/pytorch/pytorch/pull/129996 on behalf of https://github.com/jeanschmidt due to Seems to have introduced breakages in main cuda12 focal jobs ([comment](https://github.com/pytorch/pytorch/pull/129996#issuecomment-2209175516))	2024-07-04 14:55:38 +00:00
PyTorch MergeBot	fa3953a2e1	Revert "[pt2-bench] fix accuracy failure for a few models (#129941 )" This reverts commit `dafbd603ee`. Reverted https://github.com/pytorch/pytorch/pull/129941 on behalf of https://github.com/jeanschmidt due to Seems to have introduced breakages in main cuda12 focal jobs ([comment](https://github.com/pytorch/pytorch/pull/129996#issuecomment-2209175516))	2024-07-04 14:55:38 +00:00
Shunting Zhang	dafbd603ee	[pt2-bench] fix accuracy failure for a few models (#129941 ) This PR batch the fix for a few accuracy failures issues during training by raising tolerance. I do that only for models that I think it fails not due to real issue. ## sebotnet33ts_256 The accuracy test for this model start to fail around June 05 [link](https://hud.pytorch.org/benchmark/timm_models/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Sun%2C%2002%20Jun%202024%2007%3A19%3A38%20GMT&stopTime=Tue%2C%2002%20Jul%202024%2007%3A19%3A38%20GMT&granularity=day&mode=training&dtype=amp&lBranch=main&lCommit=04a0d856207d83c2031e4b9cb6825ba3e0092850&rBranch=main&rCommit=e62925930f6a62f6aeeb1fe1a661a9bd3352b53d&model=sebotnet33ts_256). I can not repro locally, but from the log from the dashboard: ``` RMSE (res-fp64): 0.09441, (ref-fp64): 0.02971 and shape=torch.Size([1536]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000 ``` raising the tolerance should fix it. ## DebertaForQuestionAnswering This model fails accuracy test on the dashboard only in max-autotune mode. I can not repro locally by command: ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/huggingface.py --accuracy --no-translation-validation --training --amp --backend inductor --device cuda --only DebertaForQuestionAnswering ``` From error message on the dashboard: ``` RMSE (res-fp64): 0.01803, (ref-fp64): 0.00537 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000 ``` 0.02 tolerance should suppress this error. ## gluon_inception_v3 This model fail on the dashboard in max-autotune mode. I can not repro locally by command ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only gluon_inception_v3 ``` From error message on the dashboard ``` RMSE (res-fp64): 0.02798, (ref-fp64): 0.00730 and shape=torch.Size([384]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000 Accuracy failed for key name Mixed_7c.branch3x3dbl_3a.bn.running_var ``` raising tolerance should suppress this error. # mobilenetv3_large_100 Fail in MA model. I can not repro locally by command ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only ``` The error message on the dashboard is ``` RMSE (res-fp64): 0.29754, (ref-fp64): 0.05205 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000 ``` The tensor is so small that the noise can be high. I use larger multiplier for smaller tensor in torch._dynamo.utils.same. # yolov3 Fail on dashboard with error ``` Error on the dashboard: RMSE (res-fp64): 0.01278, (ref-fp64): 0.00246 and shape=torch.Size([256]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000 ``` Fix it by using a larger multiplier for smaller tensors and raising the tolereance. # timm_efficientdet Fail on the dashboard with error ``` E0623 18:37:43.638000 139924418725056 torch/_dynamo/utils.py:1468] RMSE (res-fp64): 0.00096, (ref-fp64): 0.00009 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000 ``` But I can not repro locally with command ``` time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only timm_efficientdet --training ``` Raise the tolerance should fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129941 Approved by: https://github.com/jansel ghstack dependencies: #129996	2024-07-04 01:14:29 +00:00
Shunting Zhang	51fa0bd436	[pt2-bench] pass acc test if ref is NaN (#129996 ) I'm debugging the accuracy failure for training vision_maskrcnn. Unfortunately I could not succeed to run it locally (I've check pined commits for torchbenchmars/torchvision are correct, and reinstalled torchbenchmark for mask_rcnn). I get this error: ``` eager run fail: AssertionError: targets should not be none when in training mode ``` (Command: time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --training --only vision_maskrcnn ) But look at the log from the dashboard ``` E0623 19:17:59.085000 140114670171328 torch/_dynamo/utils.py:1468] RMSE (res-fp64): nan, (ref-fp64): nan and shape=torch.Size([1024, 256, 1, 1]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000 ``` We can see both the reference number and the pt2 number are NaN. I change torch._dynamo.utils.same to return true if both RMSE values are NaN. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129996 Approved by: https://github.com/jansel	2024-07-04 01:14:29 +00:00

19 Commits