pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

History

Ivan Yashchuk 3aae6ff1e1 Add nvprims.var_mean (#83508 ) This PR adds nvfuser-specific primitive - `var_mean`. Interpretation `torch.var_mean` -> `torch.ops.nvprims.var_mean` is handled by `TorchRefsNvfuserCapabilityMode` context manager. I moved some helper code from `_prims/__init__.py` to `_prims_common`. Correctness is tested with OpInfo tests (see `PythonRefInfo("ops.nvprims.var_mean"`). Layer norm reference now uses `torch.var_mean` instead of `torch._refs.var_mean` to allow interception. Here's a simple comparison of performance with this PR and master (on 3080ti): ```py import torch from torch._prims.context import TorchRefsNvfuserCapabilityMode from torch.fx.experimental.proxy_tensor import make_fx from torch._prims.executor import execute def func(a): return torch.native_layer_norm(a, (1024,), None, None, 1e-6) a = torch.randn(10, 512, 1024, dtype=torch.float16, device="cuda") with TorchRefsNvfuserCapabilityMode(): gm = make_fx(func)(a) for _ in range(10): execute(gm, a, executor="strictly_nvfuser"); ``` run with `PYTORCH_NVFUSER_DUMP=dump_eff_bandwidth python script.py` ```py # WITH THIS PR # kernel1 run in 0.032768 ms, achieved: 641.25 GB/s # kernel1 run in 0.033792 ms, achieved: 621.818 GB/s # kernel1 run in 0.032768 ms, achieved: 641.25 GB/s # kernel1 run in 0.032608 ms, achieved: 644.396 GB/s # kernel1 run in 0.031744 ms, achieved: 661.935 GB/s # kernel1 run in 0.031744 ms, achieved: 661.935 GB/s # kernel1 run in 0.032768 ms, achieved: 641.25 GB/s # kernel1 run in 0.03072 ms, achieved: 684 GB/s # kernel1 run in 0.031744 ms, achieved: 661.935 GB/s # kernel1 run in 0.031744 ms, achieved: 661.935 GB/s # ON MASTER # kernel1 run in 0.05632 ms, achieved: 373.091 GB/s # kernel1 run in 0.044032 ms, achieved: 477.209 GB/s # kernel1 run in 0.044032 ms, achieved: 477.209 GB/s # kernel1 run in 0.044032 ms, achieved: 477.209 GB/s # kernel1 run in 0.043808 ms, achieved: 479.649 GB/s # kernel1 run in 0.043008 ms, achieved: 488.571 GB/s # kernel1 run in 0.044032 ms, achieved: 477.209 GB/s # kernel1 run in 0.043008 ms, achieved: 488.571 GB/s # kernel1 run in 0.043008 ms, achieved: 488.571 GB/s # kernel1 run in 0.043008 ms, achieved: 488.571 GB/s ``` So this PR gives about 35% improvement in performance using nvfuser executor with this specific normalized shape. Also this PR fixes https://github.com/pytorch/pytorch/issues/83506 (see the change in `torch/csrc/jit/python/pybind_utils.cpp`). Ref. https://github.com/pytorch/pytorch/issues/80187 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83508 Approved by: https://github.com/ngimel		2022-08-28 18:45:25 +00:00
..
init.cpp	[JIT] Add SchemaCheckMode OpInfo test (#82442 )	2022-08-09 23:13:43 +00:00
init.h
module_python.h	Revert "Revert "Add a lint rule for torch/csrc/util/pybind.h include (#82552 )"" (#82599 )	2022-08-02 19:37:02 +00:00
pybind_utils.cpp	Add nvprims.var_mean (#83508 )	2022-08-28 18:45:25 +00:00
pybind_utils.h	Revert "Don't introduce new overload for SymInt (#83628 )"	2022-08-27 01:23:17 +00:00
pybind.h
python_arg_flatten.cpp	[ONNX] Support optional type (#68793 ) (#73284 )	2022-05-04 20:24:30 +00:00
python_arg_flatten.h
python_custom_class.cpp
python_custom_class.h
python_dict.cpp	Revert "Revert "Add a lint rule for torch/csrc/util/pybind.h include (#82552 )"" (#82599 )	2022-08-02 19:37:02 +00:00
python_dict.h
python_interpreter.cpp	Revert "Revert "Add a lint rule for torch/csrc/util/pybind.h include (#82552 )"" (#82599 )	2022-08-02 19:37:02 +00:00
python_ir.cpp	[JIT] Retry - Support scripting torch.is_autocast_enabled() (#82394 )	2022-08-10 18:26:17 +00:00
python_ir.h
python_ivalue.h	Revert "Revert "Add a lint rule for torch/csrc/util/pybind.h include (#82552 )"" (#82599 )	2022-08-02 19:37:02 +00:00
python_list.cpp	Revert "Revert "Add a lint rule for torch/csrc/util/pybind.h include (#82552 )"" (#82599 )	2022-08-02 19:37:02 +00:00
python_list.h
python_sugared_value.cpp	Revert "Revert "Add a lint rule for torch/csrc/util/pybind.h include (#82552 )"" (#82599 )	2022-08-02 19:37:02 +00:00
python_sugared_value.h	[ROCm] Enable/fix unit tests test_stream_args and test_event_args (#82346 )	2022-08-01 22:55:15 +00:00
python_tracer.cpp	Fix C API to be compatible with latest 3.11 beta (#81242 )	2022-07-27 08:37:10 +00:00
python_tracer.h
python_tree_views.cpp	Reland "Make debug_pkl smaller by only emitting unique traces." (#73368 )	2022-04-18 22:34:21 +00:00
python_tree_views.h
script_init.cpp	Get rid of ENABLE_UPGRADERS macro (#77574 )	2022-08-09 05:33:14 +00:00
script_init.h
update_graph_executor_opt.cpp
update_graph_executor_opt.h