Summary:
Fixes https://github.com/pytorch/pytorch/issues/598
This is BC-breaking as we now explicitly don't call the hook when there are not Tensors at the top level of the output.
This feature was not working anyways as the returned grad_input/grad_output were wrong (not respecting the output structure and wrong inputs for multi-Node Module).
This is also BC-breaking as we now report the correct gradients for `nn.Module`s that contain multiple autograd `Node`s while we use to return bad results before.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46163
Reviewed By: ailzhang, mruberry
Differential Revision: D24894180
Pulled By: albanD
fbshipit-source-id: e1b5d193d2818eb2f51e2a2722c7405c8bd13c2b
Summary:
Update the requirements on input dimensions for torch.nn.SyncBatchNorm:
1. Checks the aggregated batch size `count_all` instead of batch size in every DDP process https://github.com/pytorch/pytorch/issues/36865
2. Added test function for SyncBatchNorm where every process only has 1 input
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37133
Differential Revision: D21331120
Pulled By: zhaojuanmao
fbshipit-source-id: ef3d1937990006609cfe4a68a64d90276c5085f2
Summary:
As shown in https://github.com/pytorch/pytorch/issues/36452 , SyncBatchNorm can block host thread due the ``MemcpyDtoH`` and ``MemcpyHtoD`` when dealing with argument ``counts`` for native function ``batch_norm_gather_stats_with_counts``.
- This fix change signiture of ``batch_norm_gather_stats_with_counts`` to
```c++
std::tuple<Tensor, Tensor> batch_norm_gather_stats_with_counts_cuda(const Tensor& self, const Tensor& mean, const Tensor& invstd, const Tensor& running_mean, const Tensor& running_var, double momentum, double epsilon, const Tensor& counts)
```
so it can directly receive "counts" in a ``CUDATensor`` rather than ``IntArrayRef`` whose data is in host memory.
- This fix also improve implementation of ``SyncBatchNorm`` function so the construction of ``counts`` tensor will not cause additional ``MemcpyHtoD``, which will block host thread, too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36659
Differential Revision: D21196991
Pulled By: ngimel
fbshipit-source-id: 84a529e6cf22e03618fecbb8f070ec452f81229e
Summary:
update the requirements on input dimensions for `torch.nn.SyncBatchNorm`:
1. 2D inputs is now permissible, https://github.com/pytorch/pytorch/issues/20204 ;
2. requires at least two element along normalization plane (BatchNorm behavior);
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29626
Differential Revision: D18492531
Pulled By: albanD
fbshipit-source-id: f008e46a2d520d73c3c2730890a7424eba2ede9e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25339
This is to get rid of backend-specific dispatch in modules; this autograd function is no longer backend specific so
doesn't need to be in a backend specific location.
Test Plan: Imported from OSS
Differential Revision: D17101576
Pulled By: gchanan
fbshipit-source-id: f4f0bd3ecc2d4dbd8cdfedbaabcadb8c603d2507
Summary:
- Summary:
Added synchronized batch normalization, allows synchronization of stats across mini-batches between processes within a process group.
Current implementation uses a mixture of extended ATen native functions (cpp cuda extension) + torch.nn.modules (c10d python API)
- User-facing api:
1. torch.nn.utils.convert_sync_batchnorm(modules, process_group=None)
2. torch.nn.SyncBatchNorm(num_features, eps=1e-5, momentum=0.1, affine=True, track_running_stats=True, ***process_group=None***)
- supported use case:
DistributedDataParallel with ***single-gpu multi-process***
a. User creates model containing `torch.nn.SyncBatchNorm` layers through one of the ways listed below:
1. use layers directly:
torch.nn.SyncBatchNorm(...)
similar API as with torch.nn.BatchNormXd(...)
with added argument `process_group` which is used to limit the scope of
synchronization within each process group. Default value is None, which
implies synchronization across all GPUs
2. use torch.nn.utils.convert_sync_batchnorm(modules, process_group)
recursively convert all `torch.nn.BatchNormXd` into `torch.nn.SyncBatchNorm`
preserving values of parameters/buffers.
the utility function also allows user to specify process_group value to all
converted layers.
b. user wraps their model with
`torch.distributed.parallel.DataParallelDistributed`, from this point, user
should follow the general guidelines for DDP use guide
- Error checking
For use cases not supported, we error out:
1. Application launched without ddp:
> import torch
> sbn = torch.nn.SyncBatchNorm(10).cuda()
> inp = torch.randn(5, 10, 3, 3).cuda()
> sbn(inp) --> Error!
> AttributeError: SyncBatchNorm is only supported within torch.nn.parallel.DistributedDataParallel
2. Application launched using DDP with multi-GPU per-process:
> ddp_module = nn.parallel.DistributedDataParallel(module, device_ids=device_ids, output_device=args.local_rank)
> ValueError: SyncBatchNorm is only supported for DDP with single GPU per process
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14267
Differential Revision: D14270035
Pulled By: ezyang
fbshipit-source-id: 4956d8fa565c32e9df5408d53719ff9f945f4d6d