mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-07 00:21:07 +01:00
Summary: A new note on Modules for 1.8 documentation. Rendered form can be seen here: https://alband.github.io/doc_view/notes/modules.html (thanks Alban!) Pull Request resolved: https://github.com/pytorch/pytorch/pull/51536 Reviewed By: albanD Differential Revision: D26254282 Pulled By: jbschlosser fbshipit-source-id: 09cbd46aa268a29b6f54fd48ffe1d6b98db0ff31
459 lines
19 KiB
ReStructuredText
459 lines
19 KiB
ReStructuredText
.. _modules:
|
||
|
||
Modules
|
||
=======
|
||
|
||
PyTorch uses modules to represent neural networks. Modules are:
|
||
|
||
* **Building blocks of stateful computation.**
|
||
PyTorch provides a robust library of modules and makes it simple to define new custom modules, allowing for
|
||
easy construction of elaborate, multi-layer neural networks.
|
||
* **Tightly integrated with PyTorch's**
|
||
`autograd <https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html>`_
|
||
**system.** Modules make it simple to specify learnable parameters for PyTorch's Optimizers to update.
|
||
* **Easy to work with and transform.** Modules are straightforward to save and restore, transfer between
|
||
CPU / GPU / TPU devices, prune, quantize, and more.
|
||
|
||
This note describes modules, and is intended for all PyTorch users. Since modules are so fundamental to PyTorch,
|
||
many topics in this note are elaborated on in other notes or tutorials, and links to many of those documents
|
||
are provided here as well.
|
||
|
||
.. contents:: :local:
|
||
|
||
A Simple Custom Module
|
||
----------------------
|
||
|
||
To get started, let's look at a simpler, custom version of PyTorch's :class:`~torch.nn.Linear` module.
|
||
This module applies an affine transformation to its input.
|
||
|
||
.. code-block:: python
|
||
|
||
import torch
|
||
from torch import nn
|
||
|
||
class MyLinear(nn.Module):
|
||
def __init__(self, in_features, out_features):
|
||
super().__init__()
|
||
self.weight = nn.Parameter(torch.randn(in_features, out_features))
|
||
self.bias = nn.Parameter(torch.randn(out_features))
|
||
|
||
def forward(self, input):
|
||
return (input @ self.weight) + self.bias
|
||
|
||
This simple module has the following fundamental characteristics of modules:
|
||
|
||
* **It inherits from the base Module class.**
|
||
All modules should subclass :class:`~torch.nn.Module` for composability with other modules.
|
||
* **It defines some "state" that is used in computation.**
|
||
Here, the state consists of randomly-initialized ``weight`` and ``bias`` tensors that define the affine
|
||
transformation. Because each of these is defined as a :class:`~torch.nn.parameter.Parameter`, they are
|
||
*registered* for the module and will automatically be tracked and returned from calls
|
||
to :func:`~torch.nn.Module.parameters`. Parameters can be
|
||
considered the "learnable" aspects of the module's computation (more on this later). Note that modules
|
||
are not required to have state, and can also be stateless.
|
||
* **It defines a forward() function that performs the computation.** For this affine transformation module, the input
|
||
is matrix-multiplied with the ``weight`` parameter (using the ``@`` short-hand notation) and added to the ``bias``
|
||
parameter to produce the output. More generally, the ``forward()`` implementation for a module can perform arbitrary
|
||
computation involving any number of inputs and outputs.
|
||
|
||
This simple module demonstrates how modules package state and computation together. Instances of this module can be
|
||
constructed and called:
|
||
|
||
.. code-block:: python
|
||
|
||
m = MyLinear(4, 3)
|
||
sample_input = torch.randn(4)
|
||
m(sample_input)
|
||
: tensor([-0.3037, -1.0413, -4.2057], grad_fn=<AddBackward0>)
|
||
|
||
Note that the module itself is callable, and that calling it invokes its ``forward()`` function.
|
||
This name is in reference to the concepts of "forward pass" and "backward pass", which apply to each module.
|
||
The "forward pass" is responsible for applying the computation represented by the module
|
||
to the given input(s) (as shown in the above snippet). The "backward pass" computes gradients of
|
||
module outputs with respect to its inputs, which can be used for "training" parameters through gradient
|
||
descent methods. PyTorch's autograd system automatically takes care of this backward pass computation, so it
|
||
is not required to manually implement a ``backward()`` function for each module. The process of training
|
||
module parameters through successive forward / backward passes is covered in detail in
|
||
:ref:`Neural Network Training with Modules`.
|
||
|
||
The full set of parameters registered by the module can be iterated through via a call to
|
||
:func:`~torch.nn.Module.parameters` or :func:`~torch.nn.Module.named_parameters`,
|
||
where the latter includes each parameter's name:
|
||
|
||
.. code-block:: python
|
||
|
||
for parameter in m.named_parameters():
|
||
print(parameter)
|
||
: ('weight', Parameter containing:
|
||
tensor([[ 1.0597, 1.1796, 0.8247],
|
||
[-0.5080, -1.2635, -1.1045],
|
||
[ 0.0593, 0.2469, -1.4299],
|
||
[-0.4926, -0.5457, 0.4793]], requires_grad=True))
|
||
('bias', Parameter containing:
|
||
tensor([ 0.3634, 0.2015, -0.8525], requires_grad=True))
|
||
|
||
In general, the parameters registered by a module are aspects of the module's computation that should be
|
||
"learned". A later section of this note shows how to update these parameters using one of PyTorch's Optimizers.
|
||
Before we get to that, however, let's first examine how modules can be composed with one another.
|
||
|
||
Modules as Building Blocks
|
||
--------------------------
|
||
|
||
Modules can contain other modules, making them useful building blocks for developing more elaborate functionality.
|
||
The simplest way to do this is using the :class:`~torch.nn.Sequential` module. It allows us to chain together
|
||
multiple modules:
|
||
|
||
.. code-block:: python
|
||
|
||
net = nn.Sequential(
|
||
MyLinear(4, 3),
|
||
nn.ReLU(),
|
||
MyLinear(3, 1)
|
||
)
|
||
|
||
sample_input = torch.randn(4)
|
||
net(sample_input)
|
||
: tensor([-0.6749], grad_fn=<AddBackward0>)
|
||
|
||
Note that :class:`~torch.nn.Sequential` automatically feeds the output of the first ``MyLinear`` module as input
|
||
into the :class:`~torch.nn.ReLU`, and the output of that as input into the second ``MyLinear`` module. As
|
||
shown, it is limited to in-order chaining of modules.
|
||
|
||
In general, it is recommended to define a custom module for anything beyond the simplest use cases, as this gives
|
||
full flexibility on how submodules are used for a module's computation.
|
||
|
||
For example, here's a simple neural network implemented as a custom module:
|
||
|
||
.. code-block:: python
|
||
|
||
import torch.nn.functional as F
|
||
|
||
class Net(nn.Module):
|
||
def __init__(self):
|
||
super().__init__()
|
||
self.l0 = MyLinear(4, 3)
|
||
self.l1 = MyLinear(3, 1)
|
||
def forward(self, x):
|
||
x = self.l0(x)
|
||
x = F.relu(x)
|
||
x = self.l1(x)
|
||
return x
|
||
|
||
This module is composed of two "children" or "submodules" (\ ``l0`` and ``l1``\ ) that define the layers of
|
||
the neural network and are utilized for computation within the module's ``forward()`` method. Immediate
|
||
children of a module can be iterated through via a call to :func:`~torch.nn.Module.children` or
|
||
:func:`~torch.nn.Module.named_children`:
|
||
|
||
.. code-block:: python
|
||
|
||
net = Net()
|
||
for child in net.named_children():
|
||
print(child)
|
||
: ('l0', MyLinear())
|
||
('l1', MyLinear())
|
||
|
||
To go deeper than just the immediate children, :func:`~torch.nn.Module.modules` and
|
||
:func:`~torch.nn.Module.named_modules` *recursively* iterate through a module and its child modules:
|
||
|
||
.. code-block:: python
|
||
|
||
class BigNet(nn.Module):
|
||
def __init__(self):
|
||
super().__init__()
|
||
self.l1 = MyLinear(5, 4)
|
||
self.net = Net()
|
||
def forward(self, x):
|
||
return self.net(self.l1(x))
|
||
|
||
big_net = BigNet()
|
||
for module in big_net.named_modules():
|
||
print(module)
|
||
: ('', BigNet(
|
||
(l1): MyLinear()
|
||
(net): Net(
|
||
(l0): MyLinear()
|
||
(l1): MyLinear()
|
||
)
|
||
))
|
||
('l1', MyLinear())
|
||
('net', Net(
|
||
(l0): MyLinear()
|
||
(l1): MyLinear()
|
||
))
|
||
('net.l0', MyLinear())
|
||
('net.l1', MyLinear())
|
||
|
||
Sometimes, it's necessary for a module to dynamically define submodules.
|
||
The :class:`~torch.nn.ModuleList` and :class:`~torch.nn.ModuleDict` modules are useful here; they
|
||
register submodules from a list or dict:
|
||
|
||
.. code-block:: python
|
||
|
||
class DynamicNet(nn.Module):
|
||
def __init__(self, num_layers):
|
||
super().__init__()
|
||
self.linears = nn.ModuleList(
|
||
[MyLinear(4, 4) for _ in range(num_layers)])
|
||
self.activations = nn.ModuleDict({
|
||
'relu': nn.ReLU(),
|
||
'lrelu': nn.LeakyReLU()
|
||
})
|
||
self.final = MyLinear(4, 1)
|
||
def forward(self, x, act):
|
||
for linear in self.linears:
|
||
x = linear(x)
|
||
x = self.activations[act](x)
|
||
x = self.final(x)
|
||
return x
|
||
|
||
dynamic_net = DynamicNet(3)
|
||
sample_input = torch.randn(4)
|
||
output = dynamic_net(sample_input, 'relu')
|
||
|
||
For any given module, its parameters consist of its direct parameters as well as the parameters of all submodules.
|
||
This means that calls to :func:`~torch.nn.Module.parameters` and :func:`~torch.nn.Module.named_parameters` will
|
||
recursively include child parameters, allowing for convenient optimization of all parameters within the network:
|
||
|
||
.. code-block:: python
|
||
|
||
for parameter in dynamic_net.named_parameters():
|
||
print(parameter)
|
||
: ('linears.0.weight', Parameter containing:
|
||
tensor([[-1.2051, 0.7601, 1.1065, 0.1963],
|
||
[ 3.0592, 0.4354, 1.6598, 0.9828],
|
||
[-0.4446, 0.4628, 0.8774, 1.6848],
|
||
[-0.1222, 1.5458, 1.1729, 1.4647]], requires_grad=True))
|
||
('linears.0.bias', Parameter containing:
|
||
tensor([ 1.5310, 1.0609, -2.0940, 1.1266], requires_grad=True))
|
||
('linears.1.weight', Parameter containing:
|
||
tensor([[ 2.1113, -0.0623, -1.0806, 0.3508],
|
||
[-0.0550, 1.5317, 1.1064, -0.5562],
|
||
[-0.4028, -0.6942, 1.5793, -1.0140],
|
||
[-0.0329, 0.1160, -1.7183, -1.0434]], requires_grad=True))
|
||
('linears.1.bias', Parameter containing:
|
||
tensor([ 0.0361, -0.9768, -0.3889, 1.1613], requires_grad=True))
|
||
('linears.2.weight', Parameter containing:
|
||
tensor([[-2.6340, -0.3887, -0.9979, 0.0767],
|
||
[-0.3526, 0.8756, -1.5847, -0.6016],
|
||
[-0.3269, -0.1608, 0.2897, -2.0829],
|
||
[ 2.6338, 0.9239, 0.6943, -1.5034]], requires_grad=True))
|
||
('linears.2.bias', Parameter containing:
|
||
tensor([ 1.0268, 0.4489, -0.9403, 0.1571], requires_grad=True))
|
||
('final.weight', Parameter containing:
|
||
tensor([[ 0.2509], [-0.5052], [ 0.3088], [-1.4951]], requires_grad=True))
|
||
('final.bias', Parameter containing:
|
||
tensor([0.3381], requires_grad=True))
|
||
|
||
It's also easy to move all parameters to a different device or change their precision using
|
||
:func:`~torch.nn.Module.to`:
|
||
|
||
.. code-block:: python
|
||
|
||
# Move all parameters to a CUDA device
|
||
dynamic_net.to(device='cuda')
|
||
|
||
# Change precision of all parameters
|
||
dynamic_net.to(dtype=torch.float64)
|
||
|
||
dynamic_net(torch.randn(5, device='cuda', dtype=torch.float64))
|
||
: tensor([6.5166], device='cuda:0', dtype=torch.float64, grad_fn=<AddBackward0>)
|
||
|
||
These examples show how elaborate neural networks can be formed through module composition. To allow for
|
||
quick and easy construction of neural networks with minimal boilerplate, PyTorch provides a large library of
|
||
performant modules within the :mod:`torch.nn` namespace that perform computation commonly found within neural
|
||
networks, including pooling, convolutions, loss functions, etc.
|
||
|
||
In the next section, we give a full example of training a neural network.
|
||
|
||
For more information, check out:
|
||
|
||
* Recursively :func:`~torch.nn.Module.apply` a function to a module and its submodules
|
||
* Library of PyTorch-provided modules: `torch.nn <https://pytorch.org/docs/stable/nn.html>`_
|
||
* Defining neural net modules: https://pytorch.org/tutorials/beginner/examples_nn/two_layer_net_module.html
|
||
|
||
.. _Neural Network Training with Modules:
|
||
|
||
Neural Network Training with Modules
|
||
------------------------------------
|
||
|
||
Once a network is built, it has to be trained, and its parameters can be easily optimized with one of PyTorch’s
|
||
Optimizers from :mod:`torch.optim`:
|
||
|
||
.. code-block:: python
|
||
|
||
# Create the network (from previous section) and optimizer
|
||
net = Net()
|
||
optimizer = torch.optim.SGD(net.parameters(), lr=1e-4, weight_decay=1e-2, momentum=0.9)
|
||
|
||
# Run a sample training loop that "teaches" the network
|
||
# to output the constant zero function
|
||
for _ in range(10000):
|
||
input = torch.randn(4)
|
||
output = net(input)
|
||
loss = torch.abs(output)
|
||
net.zero_grad()
|
||
loss.backward()
|
||
optimizer.step()
|
||
|
||
In this simplified example, the network learns to simply output zero, as any non-zero output is "penalized" according
|
||
to its absolute value by employing :func:`torch.abs` as a loss function. While this is not a very interesting task, the
|
||
key parts of training are present:
|
||
|
||
* A network is created.
|
||
* An optimizer (in this case, a stochastic gradient descent optimizer) is created, and the network’s
|
||
parameters are associated with it.
|
||
* A training loop...
|
||
* acquires an input,
|
||
* runs the network,
|
||
* computes a loss,
|
||
* zeros the network’s parameters’ gradients,
|
||
* calls loss.backward() to update the parameters’ gradients,
|
||
* calls optimizer.step() to apply the gradients to the parameters.
|
||
|
||
After the above snippet has been run, note that the network's parameters have changed. In particular, examining the
|
||
value of ``l1``\ 's ``weight`` parameter shows that its values are now much closer to 0 (as may be expected):
|
||
|
||
.. code-block:: python
|
||
|
||
print(net.l1.weight)
|
||
: Parameter containing:
|
||
tensor([[-0.0013],
|
||
[ 0.0030],
|
||
[-0.0008]], requires_grad=True)
|
||
|
||
Training neural networks can often be tricky. For more information, check out:
|
||
|
||
* Using Optimizers: https://pytorch.org/tutorials/beginner/examples_nn/two_layer_net_optim.html.
|
||
* Neural network training: https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html
|
||
* Introduction to autograd: https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html
|
||
|
||
Module State
|
||
------------
|
||
|
||
In the previous section, we demonstrated training a module's "parameters", or learnable aspects of computation.
|
||
Now, if we want to save the trained model to disk, we can do so by saving its ``state_dict`` (i.e. "state dictionary"):
|
||
|
||
.. code-block:: python
|
||
|
||
# Save the module
|
||
torch.save(net.state_dict(), 'net.pt')
|
||
|
||
...
|
||
|
||
# Load the module later on
|
||
new_net = Net()
|
||
new_net.load_state_dict(torch.load('net.pt'))
|
||
: <All keys matched successfully>
|
||
|
||
A module's ``state_dict`` contains state that affects its computation. This includes, but is not limited to, the
|
||
module's parameters. For some modules, it may be useful to have state beyond parameters that affects module
|
||
computation but is not learnable. For such cases, PyTorch provides the concept of "buffers", both "persistent"
|
||
and "non-persistent". Following is an overview of the various types of state a module can have:
|
||
|
||
* **Parameters**\ : learnable aspects of computation; contained within the ``state_dict``
|
||
* **Buffers**\ : non-learnable aspects of computation
|
||
|
||
* **Persistent** buffers: contained within the ``state_dict`` (i.e. serialized when saving & loading)
|
||
* **Non-persistent** buffers: not contained within the ``state_dict`` (i.e. left out of serialization)
|
||
|
||
As a motivating example for the use of buffers, consider a simple module that maintains a running mean. We want
|
||
the current value of the running mean to be considered part of the module's ``state_dict`` so that it will be
|
||
restored when loading a serialized form of the module, but we don't want it to be learnable.
|
||
This snippet shows how to use :func:`~torch.nn.Module.register_buffer` to accomplish this:
|
||
|
||
.. code-block:: python
|
||
|
||
class RunningMean(nn.Module):
|
||
def __init__(self, num_features, momentum=0.9):
|
||
super().__init__()
|
||
self.momentum = momentum
|
||
self.register_buffer('mean', torch.zeros(num_features))
|
||
def forward(self, x):
|
||
self.mean = self.momentum * self.mean + (1.0 - self.momentum) * x
|
||
return self.mean
|
||
|
||
Now, the current value of the running mean is considered part of the module's ``state_dict``
|
||
and will be properly restored when loading the module from disk:
|
||
|
||
.. code-block:: python
|
||
|
||
m = RunningMean(4)
|
||
for _ in range(10):
|
||
input = torch.randn(4)
|
||
m(input)
|
||
|
||
print(m.state_dict())
|
||
: OrderedDict([('mean', tensor([ 0.1041, -0.1113, -0.0647, 0.1515]))]))
|
||
|
||
# Serialized form will contain the 'mean' tensor
|
||
torch.save(m.state_dict(), 'mean.pt')
|
||
|
||
m_loaded = RunningMean(4)
|
||
m_loaded.load_state_dict(torch.load('mean.pt'))
|
||
assert(torch.all(m.mean == m_loaded.mean))
|
||
|
||
As mentioned previously, buffers can be left out of the module's ``state_dict`` by marking them as non-persistent:
|
||
|
||
.. code-block:: python
|
||
|
||
self.register_buffer('unserialized_thing', torch.randn(5), persistent=False)
|
||
|
||
Both persistent and non-persistent buffers are affected by model-wide device / dtype changes applied with
|
||
:func:`~torch.nn.Module.to`:
|
||
|
||
.. code-block:: python
|
||
|
||
# Moves all module parameters and buffers to the specified device / dtype
|
||
m.to(device='cuda', dtype=torch.float64)
|
||
|
||
Buffers of a module can be iterated over using :func:`~torch.nn.Module.buffers` or
|
||
:func:`~torch.nn.Module.named_buffers`.
|
||
|
||
For more information, check out:
|
||
|
||
* Saving and loading: https://pytorch.org/tutorials/beginner/saving_loading_models.html
|
||
* Serialization semantics: https://pytorch.org/docs/master/notes/serialization.html
|
||
* What is a state dict? https://pytorch.org/tutorials/recipes/recipes/what_is_state_dict.html
|
||
|
||
Module Hooks
|
||
------------
|
||
|
||
In :ref:`Neural Network Training with Modules`, we demonstrated the training process for a module, which iteratively
|
||
performs forward and backward passes, updating module parameters each iteration. For more control
|
||
over this process, PyTorch provides "hooks" that can perform arbitrary computation during a forward or backward
|
||
pass, even modifying how the pass is done if desired. Some useful examples for this functionality include
|
||
debugging, visualizing activations, examining gradients in-depth, etc. Hooks can be added to modules
|
||
you haven't written yourself, meaning this functionality can be applied to third-party or PyTorch-provided modules.
|
||
|
||
PyTorch provides two types of hooks for modules:
|
||
|
||
* **Forward hooks** are called during the forward pass. They can be installed for a given module with
|
||
:func:`~torch.nn.Module.register_forward_pre_hook` and :func:`~torch.nn.Module.register_forward_hook`.
|
||
These hooks will be called respectively just before the forward function is called and just after it is called.
|
||
Alternatively, these hooks can be installed globally for all modules with the analagous
|
||
:func:`~torch.nn.modules.module.register_module_forward_pre_hook` and
|
||
:func:`~torch.nn.modules.module.register_module_forward_hook` functions.
|
||
* **Backward hooks** are called during the backward pass. They can be installed with
|
||
:func:`~torch.nn.Module.register_full_backward_hook`. These hooks will be called when the backward for this
|
||
Module has been computed and will allow the user to access the gradients for both the inputs and outputs.
|
||
Alternatively, they can be installed globally for all modules with
|
||
:func:`~torch.nn.modules.module.register_module_full_backward_hook`.
|
||
|
||
All hooks allow the user to return an updated value that will be used throughout the remaining computation.
|
||
Thus, these hooks can be used to either execute arbitrary code along the regular module forward/backward or
|
||
modify some inputs/outputs without having to change the module's ``forward()`` function.
|
||
|
||
Advanced Features
|
||
-----------------
|
||
|
||
PyTorch also provides several more advanced features that are designed to work with modules. All these functionalities
|
||
are "inherited" when writing a new module. In-depth discussion of these features can be found in the links below.
|
||
|
||
For more information, check out:
|
||
|
||
* Profiling: https://pytorch.org/tutorials/beginner/profiler.html
|
||
* Pruning: https://pytorch.org/tutorials/intermediate/pruning_tutorial.html
|
||
* Quantization: https://pytorch.org/tutorials/recipes/quantization.html
|
||
* Exporting modules to TorchScript (e.g. for usage from C++):
|
||
https://pytorch.org/tutorials/beginner/Intro_to_TorchScript_tutorial.html
|