mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

History

David Berard e33f3229a2 [NVFuser] environment variable to turn nvfuser on or off (#76485 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/76485 Adds an environment variable `PYTORCH_JIT_ENABLE_NVFUSER` for controlling whether or not nvfuser is enabled. This required changing the PassManager behavior to support the case where nvfuser gets enabled by default when PYTORCH_JIT_ENABLE_NVFUSER=1. Previously the solution for turning nvfuser on or off was to use the PassManager to register or un-register the pass. That works fine if the pass starts of _disabled_, but causes issues once we try to enable the pass by default. The main issue with enabling by default is with the validation check to see whether NVFuser can be turned on. The check relies on at::globalContext().hasCUDA(), which requires CUDAHooks to be registered before hasCUDA() wil work correctly. At static initialization time it's difficult to ensure that CUDAHooks will be registered _before_ we attempt to register the nvfuser pass. In OSS it worked fine, but in internal builds it would fail on ROCm builds. To fix this, we switch the control of NVFuser enablement to a check in the pass. i.e. previously, we enabled/disabled nvfuser by registering or de-registering the pass in pass manager; now, the pass is always registered in pass manager, and enablement is done by a check within the nvfuser pass. Remaining TODO: Connect this with NNC so that in cases where NNC is available but not NVFuser (i.e. on AMD gpus), NNC can be turned on automatically. Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D35982618 Pulled By: davidberard98 fbshipit-source-id: fd5b76bc0b8c8716c96fdc04bebfb15026a7ef60 (cherry picked from commit ff14603ff5ac8d9b6c749c4f111f4a8be8023b7f)		2022-05-03 23:05:40 +00:00
..
docs
ops	Add a matching lerp implementation to eager mode. (#1612 )	2022-04-28 23:37:01 +00:00
python_frontend	Revert "Add atan2 isfinite isinf isnan isneginf isposinf isreal to nvfuser and its frontend"	2022-05-03 19:53:28 +00:00
runtime	Revert "Add atan2 isfinite isinf isnan isneginf isposinf isreal to nvfuser and its frontend"	2022-05-03 19:53:28 +00:00
scheduler	Remove `-Wno-unused-variable` from `utils.cmake` (take 2) (#75538 )	2022-04-20 17:41:59 +00:00
test	[NVFuser] environment variable to turn nvfuser on or off (#76485 )	2022-05-03 23:05:40 +00:00
tools	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
arith.cpp	Revert "Add atan2 isfinite isinf isnan isneginf isposinf isreal to nvfuser and its frontend"	2022-05-03 19:53:28 +00:00
arith.h	Revert "Add atan2 isfinite isinf isnan isneginf isposinf isreal to nvfuser and its frontend"	2022-05-03 19:53:28 +00:00
codegen.cpp	Remove `-Wno-unused-variable` from `utils.cmake` (take 2) (#75538 )	2022-04-20 17:41:59 +00:00
codegen.h	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
compute_at_map.cpp	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
compute_at_map.h	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
compute_at.cpp	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
compute_at.h	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
contiguity.cpp	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
contiguity.h	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
disjoint_set.h
dispatch.cpp	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
dispatch.h	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
evaluator_common.cpp	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
evaluator_common.h	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
executor_kernel_arg.cpp	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
executor_kernel_arg.h	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
executor_launch_params.cpp
executor_launch_params.h
executor_utils.cpp	patching 11.1 ptxas issue	2022-04-25 22:26:24 +00:00
executor_utils.h	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
executor.cpp	Remove `-Wno-unused-variable` from `utils.cmake` (take 2) (#75538 )	2022-04-20 17:41:59 +00:00
executor.h	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
expr_evaluator.cpp
expr_evaluator.h	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
fusion_segmenter.cpp	Add a matching lerp implementation to eager mode. (#1612 )	2022-04-28 23:37:01 +00:00
fusion_segmenter.h	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
fusion.cpp	Add a matching lerp implementation to eager mode. (#1612 )	2022-04-28 23:37:01 +00:00
fusion.h	Add NVFuser Python Frontend	2022-04-26 06:10:19 +00:00
graph_fuser.cpp	Nvfuser - Type Promotion Fix	2022-04-28 16:08:38 +00:00
index_compute.cpp	Remove `-Wno-unused-variable` from `utils.cmake` (take 2) (#75538 )	2022-04-20 17:41:59 +00:00
index_compute.h	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
index_reference_replay.cpp	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
index_reference_replay.h	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
instrumentation.cpp	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
instrumentation.h
interface.cpp	[NVFuser] environment variable to turn nvfuser on or off (#76485 )	2022-05-03 23:05:40 +00:00
interface.h	[NVFuser] environment variable to turn nvfuser on or off (#76485 )	2022-05-03 23:05:40 +00:00
ir_all_nodes.h
ir_base_nodes.cpp	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
ir_base_nodes.h	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
ir_builder.cpp	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
ir_builder.h	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
ir_cloner.cpp	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
ir_cloner.h	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
ir_container.cpp	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
ir_container.h	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
ir_graphviz.cpp	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
ir_graphviz.h	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
ir_interface_nodes.h	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
ir_internal_nodes.h	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
ir_iostream.cpp	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
ir_iostream.h	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
ir_nodes.cpp	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
ir_printer.h	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
ir_utils.cpp	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
ir_utils.h	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
iter_visitor.cpp	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
iter_visitor.h	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
kernel_cache.cpp	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
kernel_cache.h	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
kernel_expr_evaluator.cpp	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
kernel_expr_evaluator.h	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
kernel_ir_dispatch.cpp	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
kernel_ir_dispatch.h	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
kernel_ir.cpp	Remove `-Wno-unused-variable` from `utils.cmake` (take 2) (#75538 )	2022-04-20 17:41:59 +00:00
kernel_ir.h	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
kernel.cpp	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
kernel.h	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
lower_alias_memory.cpp	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
lower_alias_memory.h	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
lower_allocation.cpp	Remove `-Wno-unused-variable` from `utils.cmake` (take 2) (#75538 )	2022-04-20 17:41:59 +00:00
lower_allocation.h	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
lower_double_buffer.cpp	Remove `-Wno-unused-variable` from `utils.cmake` (take 2) (#75538 )	2022-04-20 17:41:59 +00:00
lower_double_buffer.h	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
lower_expr_sort.cpp	[NVFuser] make comparators obey strict weak ordering	2022-04-14 21:42:23 +00:00
lower_expr_sort.h
lower_fused_reduction.cpp	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
lower_fused_reduction.h	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
lower_fusion_simplifier.cpp	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
lower_fusion_simplifier.h	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
lower_index_hoist.cpp	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
lower_index_hoist.h	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
lower_index.cpp	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
lower_index.h	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
lower_insert_syncs.cpp	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
lower_insert_syncs.h	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
lower_loops.cpp	[NVFuser] make comparators obey strict weak ordering	2022-04-14 21:42:23 +00:00
lower_loops.h	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
lower_magic_zero.cpp	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
lower_magic_zero.h	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
lower_misaligned_vectorization.cpp	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
lower_misaligned_vectorization.h	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
lower_predicate.cpp	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
lower_predicate.h	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
lower_replace_size.cpp	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
lower_replace_size.h	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
lower_shift.cpp	Remove `-Wno-unused-variable` from `utils.cmake` (take 2) (#75538 )	2022-04-20 17:41:59 +00:00
lower_shift.h	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
lower_sync_information.cpp	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
lower_sync_information.h	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
lower_thread_predicate.cpp	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
lower_thread_predicate.h	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
lower_trivial_broadcast.cpp	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
lower_trivial_broadcast.h	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
lower_trivial_reductions.cpp	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
lower_trivial_reductions.h	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
lower_unroll.cpp	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
lower_unroll.h	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
lower_utils.cpp	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
lower_utils.h	[NVFuser] make comparators obey strict weak ordering	2022-04-14 21:42:23 +00:00
lower_validation.cpp	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
lower_validation.h	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
lower_warp_reduce.cpp	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
lower_warp_reduce.h	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
lower2device.cpp	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
lower2device.h	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
manager.cpp	Nvfuser faster fallback	2022-04-30 05:59:18 +00:00
manager.h	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
mma_type.cpp	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
mma_type.h	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
mutator.cpp	Remove `-Wno-unused-variable` from `utils.cmake` (take 2) (#75538 )	2022-04-20 17:41:59 +00:00
mutator.h	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
non_divisible_split.cpp
non_divisible_split.h	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
nvfuser.cmake	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
parallel_dimension_map.cpp	Remove `-Wno-unused-variable` from `utils.cmake` (take 2) (#75538 )	2022-04-20 17:41:59 +00:00
parallel_dimension_map.h	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
parallel_type_bitmap.cpp
parallel_type_bitmap.h	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
parser.cpp	Revert "Add atan2 isfinite isinf isnan isneginf isposinf isreal to nvfuser and its frontend"	2022-05-03 19:53:28 +00:00
parser.h	nvfuser parser skip api (#74520 )	2022-03-23 20:56:43 +00:00
partial_split_map.cpp	Remove `-Wno-unused-variable` from `utils.cmake` (take 2) (#75538 )	2022-04-20 17:41:59 +00:00
partial_split_map.h	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
partition.cpp	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
partition.h	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
predicate_compute.cpp	Remove `-Wno-unused-variable` from `utils.cmake` (take 2) (#75538 )	2022-04-20 17:41:59 +00:00
predicate_compute.h	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
README.md	[NVFuser] document _jit_set_nvfuser_skip_node_kind	2022-04-20 22:56:21 +00:00
reference_tensor.h	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
register_interface.cpp	[NVFuser] environment variable to turn nvfuser on or off (#76485 )	2022-05-03 23:05:40 +00:00
root_domain_map.cpp	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
root_domain_map.h	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
tensor_view.cpp	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
transform_iter.cpp	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
transform_iter.h	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
transform_replay.cpp	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
transform_replay.h	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
transform_rfactor.cpp	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
transform_rfactor.h	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
transform_view.cpp	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
transform_view.h	Nvfuser code bump 2_1_2022 (#72127 )	2022-02-15 00:43:16 +00:00
type_inference.cpp	Revert "Add atan2 isfinite isinf isnan isneginf isposinf isreal to nvfuser and its frontend"	2022-05-03 19:53:28 +00:00
type_inference.h
type_promotion.cpp	Add a matching lerp implementation to eager mode. (#1612 )	2022-04-28 23:37:01 +00:00
type_promotion.h	Nvfuser - Type Promotion Fix	2022-04-28 16:08:38 +00:00
type.cpp	Revert "Add atan2 isfinite isinf isnan isneginf isposinf isreal to nvfuser and its frontend"	2022-05-03 19:53:28 +00:00
type.h	Revert "Add atan2 isfinite isinf isnan isneginf isposinf isreal to nvfuser and its frontend"	2022-05-03 19:53:28 +00:00
utils.cpp	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
utils.h	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00
vectorization_info.h	Nvfuser code bump 030122 (#73627 )	2022-03-31 08:18:22 +00:00

README.md

NVFuser - A Fusion Code Generator for NVIDIA GPUs

NVFuser is integrated as a backend for TorchScript's Profiling Graph Executor

Enabling NVFuser

NVFuser is not currently the default fuser for NVIDIA GPUs.

Fusions will only show up during the ~3rd iteration of execution, the exact number depends on profiling executor's optimization phases

Enable by Context Manager

jit_model = torch.jit.script(model)

with torch.jit.fuser("fuser2") :
    for _ in range(5) :
        outputs = jit_model(inputs)

Enable by Specific Functions

Disable cpu/gpu fusion for native/nnc fuser

torch._C._jit_override_can_fuse_on_cpu(False)
torch._C._jit_override_can_fuse_on_gpu(False)

Disable nnc fuser

torch._C._jit_set_texpr_fuser_enabled(False)

Enable nvfuser

torch._C._jit_set_nvfuser_enabled(True)

Simple knobs to change fusion behavior

Allow single node fusion torch._C._jit_set_nvfuser_single_node_mode(True) Fusion group is only created when two or more compatible ops are grouped together. Turn on single node fusion would allow fusion pass to create fusion group with a single node, this is very handy for testing and could be useful when single node generated kernel out-performs native cuda kernels in framework.
Allow horizontal fusion torch._C._jit_set_nvfuser_horizontal_mode(True) Fusion pass fuses producer to consumer, horizontal mode allows sibling nodes that shared tensor input to be fused together. This could save input memory bandwidth.
Turn off guard for fusion torch._C._jit_set_nvfuser_guard_mode(False) This disables the runtime check on fusion group pre-assumptions (tensor meta information / constant inputs / profiled constants), this really is only used for testing as we want to ensure generated kernels are indeed tested and you should avoid using this in training scripts.
Turn off fusion for certain node kinds torch._C._jit_set_nvfuser_skip_node_kind("aten::add", True) This disables fusion for certain nodes, but allows other nodes to continue being fused. The first parameter is the node kind, and the second parameter is whether to toggle the node on or off in fusion.

Fusion Debugging

Given the following script as an example

import torch

def forward(x):
    o = x + 1.0
    o = o.relu()
    return o

shape = (2, 32, 128, 512)
input = torch.rand(*shape).cuda()
t = torch.jit.script(forward)

with torch.jit.fuser("fuser2"):
    for k in range(4):
        o = t(input)

TorchScript Based Debugging

1. TorchScript IR Graph

Usage

Two easy ways to checkout fusion for graph: The first one is to print out graph in python script after a few runs (for optimization to kick in).

print(t.graph_for(input))

The second way is to turn on graph dumping in profiling executor via command line below:

PYTORCH_JIT_LOG_LEVEL="profiling_graph_executor_impl" python <your pytorch script>

Example Output

Graph print out is straight forward and you should look for prim::CudaFusionGroup_X for fused kernels. While profiling executor dumps many things, but the most important part is Optimized Graph. In this example, it shows a Fusion Group, which is an indication that fusion is happening and you should be expecting fused kernel!

  Optimized Graph:
  graph(%x.1 : Tensor):
    %12 : bool = prim::CudaFusionGuard[types=[Float(2, 32, 128, 512, strides=[2097152, 65536, 512, 1], requires_grad=0, device=cuda:0)]](%x.1)
    %11 : Tensor = prim::If(%12)
      block0():
        %o.8 : Tensor = prim::CudaFusionGroup_0[cache_id=0](%x.1)
        -> (%o.8)
      block1():
        %18 : Function = prim::Constant[name="fallback_function", fallback=1]()
        %19 : (Float(2, 32, 128, 512, strides=[2097152, 65536, 512, 1], requires_grad=0, device=cuda:0)) = prim::CallFunction(%18, %x.1)
        %20 : Float(2, 32, 128, 512, strides=[2097152, 65536, 512, 1], requires_grad=0, device=cuda:0) = prim::TupleUnpack(%19)
        -> (%20)
    return (%11)
  with prim::CudaFusionGroup_0 = graph(%2 : Float(2, 32, 128, 512, strides=[2097152, 65536, 512, 1], requires_grad=0, device=cuda:0)):
    %4 : int = prim::Constant[value=1]()
    %3 : float = prim::Constant[value=1.]() # test.py:6:12
    %o.1 : Float(2, 32, 128, 512, strides=[2097152, 65536, 512, 1], requires_grad=0, device=cuda:0) = aten::add(%2, %3, %4) # test.py:6:8
    %o.5 : Float(2, 32, 128, 512, strides=[2097152, 65536, 512, 1], requires_grad=0, device=cuda:0) = aten::relu(%o.1) # test.py:7:8
    return (%o.5)

Note that one thing that could prevents fusion when you are running training is autodiff. Fusion pass only runs within prim::DifferentiableGraph, so the first thing you should check is to that targetted ops are within differentiable graph subgraphs. Graph dump could be quite confusing to look at, since it naively dumps all graphs executed by profiling executor and differentiable graphs are executed via a nested graph executor. So for each graph, you might see a few segmented Optimized Graph where each corresponds to a differentiable node in the original graph.

2. Cuda Fusion Graphs

Usage

Cuda fusion dump gives the input and output graph to fusion pass. This is a good place to check fusion pass logic.

PYTORCH_JIT_LOG_LEVEL="graph_fuser" python <your pytorch script>

Example Output

Running the same script above, in the log, you should be looking for two graphs Before Fusion shows the subgraph where fusion pass runs on; Before Compilation shows the graph sent to codegen backend, where each CudaFusionGroup will trigger codegen runtime system to generate kernel(s) to execute the subgraph.

  Before Fusion:
  graph(%x.1 : Tensor):
    %2 : float = prim::Constant[value=1.]()
    %1 : int = prim::Constant[value=1]()
    %3 : Tensor = prim::profile[profiled_type=Float(2, 32, 128, 512, strides=[2097152, 65536, 512, 1], requires_grad=0, device=cuda:0)](%x.1)
    %o.10 : Tensor = aten::add(%3, %2, %1) # test.py:6:8
    %5 : Tensor = prim::profile[profiled_type=Float(2, 32, 128, 512, strides=[2097152, 65536, 512, 1], requires_grad=0, device=cuda:0)](%o.10)
    %o.7 : Tensor = aten::relu(%5) # test.py:7:8
    %7 : Tensor = prim::profile[profiled_type=Float(2, 32, 128, 512, strides=[2097152, 65536, 512, 1], requires_grad=0, device=cuda:0)](%o.7)
    %8 : Tensor = prim::profile[profiled_type=Float(2, 32, 128, 512, strides=[2097152, 65536, 512, 1], requires_grad=0, device=cuda:0)](%o.7)
    return (%7, %8)

  Before Compilation:
  graph(%x.1 : Tensor):
    %13 : bool = prim::CudaFusionGuard[types=[Float(2, 32, 128, 512, strides=[2097152, 65536, 512, 1], requires_grad=0, device=cuda:0)]](%x.1)
    %12 : Tensor = prim::If(%13)
      block0():
        %o.11 : Tensor = prim::CudaFusionGroup_0(%x.1)
        -> (%o.11)
      block1():
        %o.7 : Tensor = prim::FallbackGraph_1(%x.1)
        -> (%o.7)
    return (%12, %12)
  with prim::CudaFusionGroup_0 = graph(%2 : Float(2, 32, 128, 512, strides=[2097152, 65536, 512, 1], requires_grad=0, device=cuda:0)):
    %4 : int = prim::Constant[value=1]()
    %3 : float = prim::Constant[value=1.]()
    %o.10 : Float(2, 32, 128, 512, strides=[2097152, 65536, 512, 1], requires_grad=0, device=cuda:0) = aten::add(%2, %3, %4) # test.py:6:8
    %o.7 : Float(2, 32, 128, 512, strides=[2097152, 65536, 512, 1], requires_grad=0, device=cuda:0) = aten::relu(%o.10) # test.py:7:8
    return (%o.7)
  with prim::FallbackGraph_1 = graph(%x.1 : Float(2, 32, 128, 512, strides=[2097152, 65536, 512, 1], requires_grad=0, device=cuda:0)):
    %1 : int = prim::Constant[value=1]()
    %2 : float = prim::Constant[value=1.]()
    %o.10 : Float(2, 32, 128, 512, strides=[2097152, 65536, 512, 1], requires_grad=0, device=cuda:0) = aten::add(%x.1, %2, %1) # test.py:6:8
    %o.7 : Float(2, 32, 128, 512, strides=[2097152, 65536, 512, 1], requires_grad=0, device=cuda:0) = aten::relu(%o.10) # test.py:7:8
    return (%o.7)

General ideals of debug no-fusion

Currently there we have a few consumers that utilizes nvfuser via lowering computations to TorchScript and executing that through a ProfilingExecutor.

Without going into too much details about how the integration is done, a few notes on debugging no-fusion on ProfilingExecutor:

Run TorchScript module multiple times (5 could be a lucky number) to enable fusion. Because ProfilingExecutor takes the first (few) runs for profiling, later optimization (including the fusion pass the enables nvfuser) relies on profiling information to run, so your initial runs are not going to trigger fused kernels. Note that the number of profiling runs is dependent on your model.
Fused kernel should show up in TorchScript IR as prim::CudaFusionGroup. You can look at your TorchScript optimized graph to see if fusion is happening jit_model.graph_for(*inputs).
If your scripted model has inputs requiring gradient, fusion is only happening for graphs inside prim::DifferentiableGraph. There are many reasons why your graph is not autodiff-able. Take a look at /torch/csrc/jit/runtime/symbolic_scripts.cpp, which lists all autodiff-able ops (note that this is a different list from autograd-supported ops). There's also a threshold where tiny autodiff graph are inlined/reverted, which could be disabled via torch._C._debug_set_autodiff_subgraph_inlining(False).

General ideals of debug nvfuser mal-functioning

Assuming we have ProfilingExecutor things worked out properly, that is, you see a region that's supposed to be fused but did not ended up in a fused kernel, here's ways to dig deeper:

Dump fusion pass result: PYTORCH_JIT_LOG_LEVEL=graph_fuser python your_script.py &> log

Looks for graph dumped with Before Fusion & Before Compilation, which shows the portion of graph where fusion pass runs on and the result of fusion (CudaFusionGroup).
Check out which ops are not fused and roughly why: PYTORCH_JIT_LOG_LEVEL=">partition:graph_fuser" python your_script.py &> log

Enabling GRAPH_UPDATE from partition.cpp dumps a log when a given node is rejected by fusion.
Disabling FALLBACK path: If you see a warning where a FALLBACK path has been taken while executing your model with nvfuser enabled, it's indicating that either codegen or fusion pass has failed unexpectedly. This is likely to cause regression on model performance, even though it's still functionally correct. We recommend to disable FALLBACK path, so error would be reported properly to open an informative issue.

PYTORCH_NVFUSER_DISABLE_FALLBACK=1 python your_script.py &> log
Pin point kernel/fusion pattern that's causing error: With a larger model that includes multiple fusion patterns, it could be tricky to figure out which exact fusion is causing FALLBACK and build up a minimal python repro. One quick thing to try is to run the example with a few knobs turned on:
```
PYTORCH_NVFUSER_DISABLE_FALLBACK=1 \
PYTORCH_JIT_LOG_LEVEL=">partition:graph_fuser:>>kernel_cache" \
python your_script.py &> log
```
This logs all TorchScript IR parsed to codegen IR as well as kernel generated and executed by nvfuser. Since fallback path is disabled, it's likely that the last log would indicate the failing fusion.

Hint: look for last Before Compilation: that indicates a parsing failure, or running GraphCache: xxxxx, which indicates jit compilation/execution failure (also search for the GraphCache address, which would should have dumped a TorchScript IR earlier.

Query nvfuser codegen kernels

There're a few debug dump that could be turned on via environment variables. Look for PYTORCH_NVFUSER_DUMP inside [pytorch_source_path]/torch/csrc/jit/codegen/cuda/utils.cpp. A few useful ones are:

dump_eff_bandwidth: print out effective bandwidth of each generated kernel. This naively measure the kernel time divided by I/O buffer size and is a good/simple metric of performance for bandwidth bound kernels
cuda_kernel: print out generated cuda kernels
launch_param: print out launch config of generated kernels
print_args: print out input output tensors of executed codegen kernels

FAQs

There's regression after turning on nvfuser.

First thing is to check that you have fusion kernel running properly. Try to run your model with fallback disabled to see if you hit any errors that caused fallback via export PYTORCH_NVFUSER_DISABLE_FALLBACK=1.

I didn't see any speedup with nvfuser.

Check if there is fusion in your script model. Run your script with PYTORCH_JIT_LOG_LEVEL="graph_fuser", you should see some log dump of before/after graph regarding fusion pass. If nothing shows up in the log, that means something in TorchScript is not right and fusion pass are not executed. Check [General ideals of debug no-fusion] for more details.