Commit Graph

12 Commits

Author SHA1 Message Date
Vasiliy Kuznetsov
f64d24c941 speed up SyncBatchNorm by batching distributed communication (#38246)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38246

Speeds up SyncBatchNorm by batching the distributed communication.
Initial benchmarks show a ~15+% speed improvement on MobileNetV2 and
EfficientNetB3 on a single machine with 8 gpus. Improvement
vs baseline increases as # of gpus increases.

Test Plan:
verified that before+after intermediate values in fwd/bwd pass are equivalent (with `torch.allclose`)

benchmark runner:
https://gist.github.com/vkuzo/7b1ce1b1b051ee6d46877d0f18ab9b1f

results (1 forward pass + 1 backward pass, 1 machine, 8x Tesla-P100, batch_size=20 per node):
```
model           gpus  before_ms after_ms  speedup
efficientnet-b3 2     660       654       0.00909
efficientnet-b3 4     777       710       0.08623
efficientnet-b3 8     988       838       0.15182
mobilenet-v2    2     267       266       0.00375
mobilenet-v2    4     328       289       0.1189
mobilenet-v2    8     453       373       0.1766
```

Imported from OSS

Differential Revision: D21505905

fbshipit-source-id: 3e796343fce8329a2e17671d60ae66c0387924e7
2020-05-13 11:21:42 -07:00
elmirador
ae755a73d3 SyncBatchNorm size check update (#37133)
Summary:
Update the requirements on input dimensions for torch.nn.SyncBatchNorm:
1. Checks the aggregated batch size `count_all` instead of batch size in every DDP process https://github.com/pytorch/pytorch/issues/36865
2. Added test function for SyncBatchNorm where every process only has 1 input
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37133

Differential Revision: D21331120

Pulled By: zhaojuanmao

fbshipit-source-id: ef3d1937990006609cfe4a68a64d90276c5085f2
2020-05-01 18:01:30 -07:00
gzygzy9211
ab2a9ab925 Non-blocking SyncBatchNorm update (#36659)
Summary:
As shown in https://github.com/pytorch/pytorch/issues/36452 , SyncBatchNorm can block host thread due the ``MemcpyDtoH`` and ``MemcpyHtoD`` when dealing with argument ``counts`` for native function ``batch_norm_gather_stats_with_counts``.

- This fix change signiture of ``batch_norm_gather_stats_with_counts`` to
```c++
std::tuple<Tensor, Tensor> batch_norm_gather_stats_with_counts_cuda(const Tensor& self, const Tensor& mean, const Tensor& invstd, const Tensor& running_mean, const Tensor& running_var, double momentum, double epsilon, const Tensor& counts)
```
so it can directly receive "counts" in a ``CUDATensor`` rather than ``IntArrayRef`` whose data is in host memory.

- This fix also improve implementation of ``SyncBatchNorm`` function so the construction of ``counts`` tensor will not cause additional ``MemcpyHtoD``, which will block host thread, too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36659

Differential Revision: D21196991

Pulled By: ngimel

fbshipit-source-id: 84a529e6cf22e03618fecbb8f070ec452f81229e
2020-04-23 10:22:19 -07:00
Jie
289d52c120 Fixing SyncBN dgrad (#36382)
Summary:
Previous PR https://github.com/pytorch/pytorch/issues/22248 which provides support for variadic batch size across processes doesn't account the mean_dy/mean_dy_xmu on backward path, which produces wrong dgrad.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36382

Differential Revision: D20984446

Pulled By: ngimel

fbshipit-source-id: 80066eee83760b275d61e2cdd4e86facca5577fd
2020-04-13 21:08:31 -07:00
Xiao Wang
c1dd70688a Fix deprecated python "add" calls (#33428)
Summary:
This PR fixed those python "add" calls using deprecated signature `add(Scalar, Tensor)`. The alternative signature `add(Tensor, alpha = Scalar)` is used.

cc csarofeen zasdfgbnm ptrblck ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33428

Differential Revision: D20002534

Pulled By: vincentqb

fbshipit-source-id: 81f2dd6170a47a9b53a17e5817c26e70d8afa130
2020-02-26 09:02:31 -08:00
Brian Wignall
f326045b37 Fix typos, via a Levenshtein-type corrector (#31523)
Summary:
Should be non-semantic.

Uses https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines to find likely typos, with https://github.com/bwignall/typochecker to help automate the checking.

Uses an updated version of the tool used in https://github.com/pytorch/pytorch/pull/30606 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31523

Differential Revision: D19216749

Pulled By: mrshenli

fbshipit-source-id: 7fd489cb9a77cd7e4950c1046f925d57524960ea
2020-01-17 16:03:19 -08:00
jiej
9c7e604c60 SyncBatchNorm Update on input dimension checks (#29626)
Summary:
update the requirements on input dimensions for `torch.nn.SyncBatchNorm`:
1. 2D inputs is now permissible, https://github.com/pytorch/pytorch/issues/20204 ;
2. requires at least two element along normalization plane (BatchNorm behavior);
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29626

Differential Revision: D18492531

Pulled By: albanD

fbshipit-source-id: f008e46a2d520d73c3c2730890a7424eba2ede9e
2019-11-18 16:09:51 -08:00
Gregory Chanan
23fde77d3d Remove Module._backend as it's not used anymore.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25342

Test Plan: Imported from OSS

Differential Revision: D17101571

Pulled By: gchanan

fbshipit-source-id: 2cda46fe197e26a1cacb5e912f535809973d306e
2019-08-29 15:43:49 -07:00
root
8640aef505 Add support for non-affine batch norm with float stats and half inputs (#22750)
Summary:
This PR creates support for non-affine batch norm with float running estimates and half inputs.
Changed were made similar to https://github.com/pytorch/pytorch/issues/16735.

I couldn't find a specific test for `SyncBatchNorm`, so I used [this code](https://gist.github.com/ptrblck/ab45bfcde6df55ac28a7be18531f4718) to test it.

cc ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22750

Differential Revision: D17119965

Pulled By: ezyang

fbshipit-source-id: 2e8c5d63fc3c636b8a1338c43c9c101a0f5e9b22
2019-08-29 14:04:37 -07:00
Gregory Chanan
a8ae33ce27 Move autograd function for CrossMapLRN2d from being backend specific to modules/_functions. (#25339)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25339

This is to get rid of backend-specific dispatch in modules; this autograd function is no longer backend specific so
doesn't need to be in a backend specific location.

Test Plan: Imported from OSS

Differential Revision: D17101576

Pulled By: gchanan

fbshipit-source-id: f4f0bd3ecc2d4dbd8cdfedbaabcadb8c603d2507
2019-08-29 09:55:11 -07:00
Shuaipeng Li
29ec4769bb Fix SyncBatchNorm running var update issue (#22248)
Summary:
## Fix https://github.com/pytorch/pytorch/issues/22192

+ change signature: `func: batch_norm_gather_stats(Tensor input, Tensor mean, Tensor invstd, Tensor? running_mean, Tensor? running_var, float momentum, float eps, Tensor counts) -> (Tensor, Tensor)`
+ change cuda & cuda head
```cuda
std::tuple<Tensor, Tensor> batch_norm_gather_stats_cuda(const Tensor& self, const Tensor& mean, const Tensor& invstd, const Tensor& running_mean,
                                                        const Tensor& running_var, double momentum, double epsilon, int64_t count) {
                                                        const Tensor& running_var, double momentum, double epsilon, const Tensor& counts)
```
+ change python interface
```python
class SyncBatchNorm(Function):
    def forward(self, input, weight, bias, running_mean, running_var, eps, momentum, process_group, world_size):
        ...
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22248

Differential Revision: D16002146

Pulled By: mrshenli

fbshipit-source-id: 9007e83928267b89df4d3847aabfbdb63e456956
2019-07-03 17:17:59 -07:00
jiej
39669316a6 (#14267)
Summary:
- Summary:

Added synchronized batch normalization, allows synchronization of stats across mini-batches between processes within a process group.
Current implementation uses a mixture of extended ATen native functions (cpp cuda extension) + torch.nn.modules (c10d python API)

- User-facing api:

1. torch.nn.utils.convert_sync_batchnorm(modules, process_group=None)

2. torch.nn.SyncBatchNorm(num_features, eps=1e-5, momentum=0.1, affine=True, track_running_stats=True, ***process_group=None***)

- supported use case:
DistributedDataParallel with ***single-gpu multi-process***

a. User creates model containing `torch.nn.SyncBatchNorm` layers through one of the ways listed below:

  1. use layers directly:

     torch.nn.SyncBatchNorm(...)

     similar API as with torch.nn.BatchNormXd(...)
     with added argument `process_group` which is used to limit the scope of
     synchronization within each process group. Default value is None, which
     implies synchronization across all GPUs

  2. use torch.nn.utils.convert_sync_batchnorm(modules, process_group)

     recursively convert all `torch.nn.BatchNormXd` into `torch.nn.SyncBatchNorm`
     preserving values of parameters/buffers.
     the utility function also allows user to specify process_group value to all
     converted layers.

b. user wraps their model with
   `torch.distributed.parallel.DataParallelDistributed`, from this point, user
   should follow the general guidelines for DDP use guide

- Error checking

For use cases not supported, we error out:

1. Application launched without ddp:
   > import torch
   > sbn = torch.nn.SyncBatchNorm(10).cuda()
   > inp = torch.randn(5, 10, 3, 3).cuda()
   > sbn(inp) --> Error!
   > AttributeError: SyncBatchNorm is only supported within torch.nn.parallel.DistributedDataParallel

2. Application launched using DDP with multi-GPU per-process:
   > ddp_module = nn.parallel.DistributedDataParallel(module, device_ids=device_ids, output_device=args.local_rank)
   > ValueError: SyncBatchNorm is only supported for DDP with single GPU per process
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14267

Differential Revision: D14270035

Pulled By: ezyang

fbshipit-source-id: 4956d8fa565c32e9df5408d53719ff9f945f4d6d
2019-03-06 13:39:11 -08:00