pytorch/torch/csrc/jit/python
Ivan Yashchuk 3aae6ff1e1 Add nvprims.var_mean (#83508)
This PR adds nvfuser-specific primitive - `var_mean`.
Interpretation `torch.var_mean` -> `torch.ops.nvprims.var_mean` is handled by `TorchRefsNvfuserCapabilityMode` context manager.

I moved some helper code from `_prims/__init__.py` to `_prims_common`. Correctness is tested with OpInfo tests (see `PythonRefInfo("ops.nvprims.var_mean"`).

Layer norm reference now uses `torch.var_mean` instead of `torch._refs.var_mean` to allow interception. Here's a simple comparison of performance with this PR and master (on 3080ti):
```py
import torch
from torch._prims.context import TorchRefsNvfuserCapabilityMode
from torch.fx.experimental.proxy_tensor import make_fx
from torch._prims.executor import execute

def func(a):
    return torch.native_layer_norm(a, (1024,), None, None, 1e-6)

a = torch.randn(10, 512, 1024, dtype=torch.float16, device="cuda")

with TorchRefsNvfuserCapabilityMode():
    gm = make_fx(func)(a)

for _ in range(10):
    execute(gm, a, executor="strictly_nvfuser");
```
run with `PYTORCH_NVFUSER_DUMP=dump_eff_bandwidth python script.py`
```py
# WITH THIS PR
# kernel1 run in 0.032768 ms, achieved: 641.25 GB/s
# kernel1 run in 0.033792 ms, achieved: 621.818 GB/s
# kernel1 run in 0.032768 ms, achieved: 641.25 GB/s
# kernel1 run in 0.032608 ms, achieved: 644.396 GB/s
# kernel1 run in 0.031744 ms, achieved: 661.935 GB/s
# kernel1 run in 0.031744 ms, achieved: 661.935 GB/s
# kernel1 run in 0.032768 ms, achieved: 641.25 GB/s
# kernel1 run in 0.03072 ms, achieved: 684 GB/s
# kernel1 run in 0.031744 ms, achieved: 661.935 GB/s
# kernel1 run in 0.031744 ms, achieved: 661.935 GB/s

# ON MASTER
# kernel1 run in 0.05632 ms, achieved: 373.091 GB/s
# kernel1 run in 0.044032 ms, achieved: 477.209 GB/s
# kernel1 run in 0.044032 ms, achieved: 477.209 GB/s
# kernel1 run in 0.044032 ms, achieved: 477.209 GB/s
# kernel1 run in 0.043808 ms, achieved: 479.649 GB/s
# kernel1 run in 0.043008 ms, achieved: 488.571 GB/s
# kernel1 run in 0.044032 ms, achieved: 477.209 GB/s
# kernel1 run in 0.043008 ms, achieved: 488.571 GB/s
# kernel1 run in 0.043008 ms, achieved: 488.571 GB/s
# kernel1 run in 0.043008 ms, achieved: 488.571 GB/s
```
So this PR gives about 35% improvement in performance using nvfuser executor with this specific normalized shape.

Also this PR fixes https://github.com/pytorch/pytorch/issues/83506 (see the change in `torch/csrc/jit/python/pybind_utils.cpp`).

Ref. https://github.com/pytorch/pytorch/issues/80187

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83508
Approved by: https://github.com/ngimel
2022-08-28 18:45:25 +00:00
..
init.cpp [JIT] Add SchemaCheckMode OpInfo test (#82442) 2022-08-09 23:13:43 +00:00
init.h
module_python.h Revert "Revert "Add a lint rule for torch/csrc/util/pybind.h include (#82552)"" (#82599) 2022-08-02 19:37:02 +00:00
pybind_utils.cpp Add nvprims.var_mean (#83508) 2022-08-28 18:45:25 +00:00
pybind_utils.h Revert "Don't introduce new overload for SymInt (#83628)" 2022-08-27 01:23:17 +00:00
pybind.h
python_arg_flatten.cpp [ONNX] Support optional type (#68793) (#73284) 2022-05-04 20:24:30 +00:00
python_arg_flatten.h
python_custom_class.cpp
python_custom_class.h
python_dict.cpp Revert "Revert "Add a lint rule for torch/csrc/util/pybind.h include (#82552)"" (#82599) 2022-08-02 19:37:02 +00:00
python_dict.h
python_interpreter.cpp Revert "Revert "Add a lint rule for torch/csrc/util/pybind.h include (#82552)"" (#82599) 2022-08-02 19:37:02 +00:00
python_ir.cpp [JIT] Retry - Support scripting torch.is_autocast_enabled() (#82394) 2022-08-10 18:26:17 +00:00
python_ir.h
python_ivalue.h Revert "Revert "Add a lint rule for torch/csrc/util/pybind.h include (#82552)"" (#82599) 2022-08-02 19:37:02 +00:00
python_list.cpp Revert "Revert "Add a lint rule for torch/csrc/util/pybind.h include (#82552)"" (#82599) 2022-08-02 19:37:02 +00:00
python_list.h
python_sugared_value.cpp Revert "Revert "Add a lint rule for torch/csrc/util/pybind.h include (#82552)"" (#82599) 2022-08-02 19:37:02 +00:00
python_sugared_value.h [ROCm] Enable/fix unit tests test_stream_args and test_event_args (#82346) 2022-08-01 22:55:15 +00:00
python_tracer.cpp Fix C API to be compatible with latest 3.11 beta (#81242) 2022-07-27 08:37:10 +00:00
python_tracer.h
python_tree_views.cpp Reland "Make debug_pkl smaller by only emitting unique traces." (#73368) 2022-04-18 22:34:21 +00:00
python_tree_views.h
script_init.cpp Get rid of ENABLE_UPGRADERS macro (#77574) 2022-08-09 05:33:14 +00:00
script_init.h
update_graph_executor_opt.cpp
update_graph_executor_opt.h