mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-07 12:21:27 +01:00
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91011 Approved by: https://github.com/mlazos, https://github.com/jansel, https://github.com/msaroufim
179 lines
8.3 KiB
ReStructuredText
179 lines
8.3 KiB
ReStructuredText
Getting Started
|
||
===============
|
||
|
||
Let’s start with a simple example. Note that you are likely to see more
|
||
significant speedups the newer your GPU is.
|
||
|
||
.. code:: python
|
||
|
||
from torch._dynamo import optimize
|
||
import torch
|
||
def fn(x, y):
|
||
a = torch.cos(x).cuda()
|
||
b = torch.sin(y).cuda()
|
||
return a + b
|
||
new_fn = optimize("inductor")(fn)
|
||
input_tensor = torch.randn(10000).to(device="cuda:0")
|
||
a = new_fn(input_tensor, input_tensor)
|
||
|
||
This example will not actually run faster. Its purpose is to demonstrate
|
||
the ``torch.cos()`` and ``torch.sin()`` features which are
|
||
examples of pointwise ops as in they operate element by element on a
|
||
vector. A more famous pointwise op you might want to use would
|
||
be something like ``torch.relu()``. Pointwise ops in eager mode are
|
||
suboptimal because each one would need to read a tensor from
|
||
memory, make some changes, and then write back those changes. The single
|
||
most important optimization that inductor does is fusion. So back to our
|
||
example we can turn 2 reads and 2 writes into 1 read and 1 write which
|
||
is crucial especially for newer GPUs where the bottleneck is memory
|
||
bandwidth (how quickly you can send data to a GPU) rather than compute
|
||
(how quickly your GPU can crunch floating point operations).
|
||
|
||
Another major optimization that inductor makes available is automatic
|
||
support for CUDA graphs.
|
||
CUDA graphs help eliminate the overhead from launching individual
|
||
kernels from a Python program which is especially relevant for newer GPUs.
|
||
|
||
TorchDynamo supports many different backends but inductor specifically works
|
||
by generating `Triton <https://github.com/openai/triton>`__ kernels and
|
||
we can inspect them by running ``TORCH_COMPILE_DEBUG=1 python trig.py``
|
||
with the actual generated kernel being
|
||
|
||
.. code-block:: python
|
||
|
||
@pointwise(size_hints=[16384], filename=__file__, meta={'signature': {0: '*fp32', 1: '*fp32', 2: 'i32'}, 'device': 0, 'constants': {}, 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2), equal_to_1=())]})
|
||
@triton.jit
|
||
def kernel(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
|
||
xnumel = 10000
|
||
xoffset = tl.program_id(0) * XBLOCK
|
||
xindex = xoffset + tl.reshape(tl.arange(0, XBLOCK), [XBLOCK])
|
||
xmask = xindex < xnumel
|
||
x0 = xindex
|
||
tmp0 = tl.load(in_ptr0 + (x0), xmask)
|
||
tmp1 = tl.sin(tmp0)
|
||
tmp2 = tl.sin(tmp1)
|
||
tl.store(out_ptr0 + (x0 + tl.zeros([XBLOCK], tl.int32)), tmp2, xmask)
|
||
|
||
And you can verify that fusing the two ``sins`` did actually occur
|
||
because the two ``sin`` operations occur within a single Triton kernel
|
||
and the temporary variables are held in registers with very fast access.
|
||
|
||
You can read up a lot more on Triton’s performance
|
||
`here <https://openai.com/blog/triton/>`__ but the key is it’s in Python
|
||
so you can easily understand it even if you have not written all that
|
||
many CUDA kernels.
|
||
|
||
Next, let’s try a real model like resnet50 from the PyTorch
|
||
hub.
|
||
|
||
.. code-block:: python
|
||
|
||
import torch
|
||
import torch._dynamo as dynamo
|
||
model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet18', pretrained=True)
|
||
opt_model = dynamo.optimize("inductor")(model)
|
||
model(torch.randn(1,3,64,64))
|
||
|
||
And that is not the only available backend, you can run in a REPL
|
||
``dynamo.list_backends()`` to see all the available backends. Try out the
|
||
``aot_cudagraphs`` or ``nvfuser`` next as inspiration.
|
||
|
||
Let’s do something a bit more interesting now, our community frequently
|
||
uses pretrained models from
|
||
`transformers <https://github.com/huggingface/transformers>`__ or
|
||
`TIMM <https://github.com/rwightman/pytorch-image-models>`__ and one of
|
||
our design goals is for Dynamo and inductor to work out of the box with
|
||
any model that people would like to author.
|
||
|
||
So we will directly download a pretrained model from the
|
||
HuggingFace hub and optimize it:
|
||
|
||
.. code-block:: python
|
||
|
||
import torch
|
||
from transformers import BertTokenizer, BertModel
|
||
import torch._dynamo as dynamo
|
||
# Copy pasted from here https://huggingface.co/bert-base-uncased
|
||
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
|
||
model = BertModel.from_pretrained("bert-base-uncased").to(device="cuda:0")
|
||
model = dynamo.optimize("inductor")(model) # This is the only line of code that we changed
|
||
text = "Replace me by any text you'd like."
|
||
encoded_input = tokenizer(text, return_tensors='pt').to(device="cuda:0")
|
||
output = model(**encoded_input)
|
||
|
||
If you remove the ``to(device="cuda:0")`` from the model and
|
||
``encoded_input``, then Triton will generate C++ kernels that will be
|
||
optimized for running on your CPU. You can inspect both Triton or C++
|
||
kernels for BERT, they’re obviously more complex than the trigonometry
|
||
example we had above but you can similarly skim it and understand if you
|
||
understand PyTorch.
|
||
|
||
Similarly let’s try out a TIMM example
|
||
|
||
.. code-block:: python
|
||
|
||
import timm
|
||
import torch._dynamo as dynamo
|
||
import torch
|
||
model = timm.create_model('resnext101_32x8d', pretrained=True, num_classes=2)
|
||
opt_model = dynamo.optimize("inductor")(model)
|
||
opt_model(torch.randn(64,3,7,7))
|
||
|
||
Our goal with Dynamo and inductor is to build the highest coverage ML compiler
|
||
which should work with any model you throw at it.
|
||
|
||
Existing Backends
|
||
~~~~~~~~~~~~~~~~~
|
||
|
||
TorchDynamo has a growing list of backends, which can be found in
|
||
`backends.py <https://github.com/pytorch/pytorch/blob/master/torch/_dynamo/optimizations/backends.py>`__
|
||
or ``torchdynamo.list_backends()`` each of which with its optional dependencies.
|
||
|
||
Some of the most commonly used backends include:
|
||
|
||
* **Debugging backends**:
|
||
* ``dynamo.optimize("eager")`` - Uses PyTorch
|
||
to run the extracted GraphModule. This is quite useful in debugging
|
||
TorchDynamo issues.
|
||
* ``dynamo.optimize("aot_eager")`` - Uses
|
||
AotAutograd with no compiler, for example, just using PyTorch eager for the
|
||
AotAutograd’s extracted forward and backward graphs. This is useful for
|
||
debugging, and unlikely to give speedups.
|
||
|
||
* **Training & inference backends**:
|
||
* ``dynamo.optimize("inductor")`` - Uses ``TorchInductor`` backend
|
||
with AotAutograd and cudagraphs by leveraging
|
||
codegened Triton kernels `Read
|
||
more <https://dev-discuss.pytorch.org/t/torchinductor-a-pytorch-native-compiler-with-define-by-run-ir-and-symbolic-shapes/747>`__
|
||
* ``dynamo.optimize("nvfuser")`` - nvFuser with TorchScript. `Read more <https://dev-discuss.pytorch.org/t/tracing-with-primitives-update-1-nvfuser-and-its-primitives/593>`__
|
||
* ``dynamo.optimize("aot_nvfuser")`` - nvFuser with AotAutograd. `Read more <https://dev-discuss.pytorch.org/t/tracing-with-primitives-update-1-nvfuser-and-its-primitives/593>`__
|
||
* ``dynamo.optimize("aot_cudagraphs")`` - cudagraphs with AotAutograd. `Read more <https://github.com/pytorch/torchdynamo/pull/757>`__
|
||
|
||
* **Inference-only backends**:
|
||
* ``dynamo.optimize("ofi")`` - Uses
|
||
Torchscript ``optimize_for_inference``. `Read
|
||
more <https://pytorch.org/docs/stable/generated/torch.jit.optimize_for_inference.html>`__
|
||
* ``dynamo.optimize("fx2trt")`` - Uses Nvidia TensorRT for inference optimizations. `Read more <https://github.com/pytorch/TensorRT/blob/master/docsrc/tutorials/getting_started_with_fx_path.rst>`__
|
||
* ``dynamo.optimize("onnxrt")`` - Uses ONNXRT for inference on CPU/GPU. `Read more <https://onnxruntime.ai/>`__ \* ``dynamo.optimize("ipex")`` - Uses IPEX for inference on CPU. `Read more <https://github.com/intel/intel-extension-for-pytorch>`__
|
||
|
||
Why do you need another way of optimizing PyTorch code?
|
||
-------------------------------------------------------
|
||
|
||
While a number of other code optimization tools exist in the PyTorch
|
||
ecosystem, each of them has its own flow.
|
||
Here is a few examples of existing methods and their limitations:
|
||
|
||
- ``torch.jit.trace()`` is silently wrong if it cannot trace, for example:
|
||
during control flow
|
||
- ``torch.jit.script()`` requires modifications to user or library code
|
||
by adding type annotations and removing non PyTorch code
|
||
- ``torch.fx.symbolic_trace()`` either traces correctly or gives a hard
|
||
error but it’s limited to traceable code so still can’t handle
|
||
control flow
|
||
- ``torch._dynamo`` works out of the box and produces partial graphs.
|
||
It still has the option of producing a single graph with
|
||
``nopython=True`` which are needed for `some
|
||
situations <./documentation/FAQ.md#do-i-still-need-to-export-whole-graphs>`__
|
||
but allows a smoother transition where partial graphs can be
|
||
optimized without code modification
|