pytorch/docs/source
Thomas Viehmann 685224aa14 Add CTC loss (#9628)
Summary:
The CPU and CUDA variants are a direct transposition of Graves et al.'s description of the algorithm with the
modification that is is in log space.
The there also is a binding for the (much faster) CuDNN implementation.

This could eventually fix #3420

I still need to add tests (TestNN seems much more elaborate than the other testing) and fix the bugs than invariably turn up during the testing. Also, I want to add some more code comments.

I could use feedback on all sorts of things, including:
- Type handling (cuda vs. cpu for the int tensors, dtype for the int tensors)
- Input convention. I use log probs because that is what the gradients are for.
- Launch parameters for the kernels
- Errors and obmissions and anything else I'm not even aware of.

Thank you for looking!

In terms of performance it looks like it is superficially comparable to WarpCTC (and thus, but I have not systematically investigated this).
I have read CuDNN is much faster than implementations because it does *not* use log-space, but also the gathering step is much much faster (but I avoided trying tricky things, it seems to contribute to warpctc's fragility). I might think some more which existing torch function (scatter or index..) I could learn from for that step.
Average timings for the kernels from nvprof for some size:

```
CuDNN:
60.464us compute_alphas_and_betas
16.755us compute_grads_deterministic
Cuda:
121.06us ctc_loss_backward_collect_gpu_kernel (= grads)
109.88us ctc_loss_gpu_kernel (= alphas)
98.517us ctc_loss_backward_betas_gpu_kernel (= betas)
WarpCTC:
299.74us compute_betas_and_grad_kernel
66.977us compute_alpha_kernel
```

Of course, I still have the (silly) outer blocks loop rather than computing consecutive `s` in each thread which I might change, and there are a few other things where one could look for better implementations.

Finally, it might not be unreasonable to start with these implementations, as the performance of the loss has to be seen in the context of the entire training computation, so this would likely dilute the relative speedup considerably.

My performance measuring testing script:
```
import timeit
import sys
import torch
num_labels = 10
target_length  = 30
input_length = 50
eps = 1e-5
BLANK = 0#num_labels
batch_size = 16

torch.manual_seed(5)
activations = torch.randn(input_length, batch_size, num_labels + 1)
log_probs = torch.log_softmax(activations, 2)
probs = torch.exp(log_probs)
targets = torch.randint(1, num_labels+1, (batch_size * target_length,), dtype=torch.long)
targets_2d = targets.view(batch_size, target_length)
target_lengths = torch.tensor(batch_size*[target_length])
input_lengths = torch.tensor(batch_size*[input_length])
activations = log_probs.detach()

def time_cuda_ctc_loss(grout, *args):
    torch.cuda.synchronize()
    culo, culog_alpha = torch._ctc_loss(*args)
    g, = torch.autograd.grad(culo, args[0], grout)
    torch.cuda.synchronize()

def time_cudnn_ctc_loss(groupt, *args):
    torch.cuda.synchronize()
    culo, cugra= torch._cudnn_ctc_loss(*args)
    g, = torch.autograd.grad(culo, args[0], grout)
    torch.cuda.synchronize()

def time_warp_ctc_loss(grout, *args):
    torch.cuda.synchronize()
    culo = warpctc.ctc_loss(*args, blank_label=BLANK, size_average=False, length_average=False, reduce=False)
    g, = torch.autograd.grad(culo, args[0], grout)
    torch.cuda.synchronize()

if sys.argv[1] == 'cuda':
    lpcu = log_probs.float().cuda().detach().requires_grad_()
    args = [lpcu, targets_2d.cuda(), input_lengths.cuda(), target_lengths.cuda(), BLANK]
    grout = lpcu.new_ones((batch_size,))
    torch.cuda.synchronize()
    print(timeit.repeat("time_cuda_ctc_loss(grout, *args)", number=1000, globals=globals()))
elif sys.argv[1] == 'cudnn':
    lpcu = log_probs.float().cuda().detach().requires_grad_()
    args = [lpcu, targets.int(), input_lengths.int(), target_lengths.int(), BLANK, True]
    grout = lpcu.new_ones((batch_size,))
    torch.cuda.synchronize()
    print(timeit.repeat("time_cudnn_ctc_loss(grout, *args)", number=1000, globals=globals()))
elif sys.argv[1] == 'warpctc':
    import warpctc
    activations = activations.cuda().detach().requires_grad_()
    args = [activations, input_lengths.int(), targets.int(), target_lengths.int()]
    grout = activations.new_ones((batch_size,), device='cpu')
    torch.cuda.synchronize()

    print(timeit.repeat("time_warp_ctc_loss(grout, *args)", number=1000, globals=globals()))
```
I'll also link to a notebook that I used for writing up the algorithm in simple form and then test the against implementations against it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9628

Differential Revision: D8952453

Pulled By: ezyang

fbshipit-source-id: 18e073f40c2d01a7c96c1cdd41f6c70a06e35860
2018-07-31 11:09:48 -07:00
..
_static i2h<->h2h in gif (#8750) 2018-06-21 14:46:47 -04:00
_templates docs: add canonical_url and fix redirect link (#8155) 2018-06-05 10:29:55 -04:00
notes various documentation formatting (#9359) 2018-07-13 02:48:25 -07:00
scripts Add grid lines for activation images, fixes #9130 (#9134) 2018-07-03 19:10:00 -07:00
autograd.rst Add autograd automatic anomaly detection (#7677) 2018-06-11 21:26:17 -04:00
bottleneck.rst [docs] Clarify more CUDA profiling gotchas in bottleneck docs (#6763) 2018-04-19 13:15:27 -04:00
checkpoint.rst [docs] Fix some sphinx warnings (#6764) 2018-04-19 12:37:42 -04:00
conf.py Add html-stable target to docs Makefile (#9884) 2018-07-26 12:09:06 -07:00
cpp_extension.rst Inline JIT C++ Extensions (#7059) 2018-04-30 11:48:44 -04:00
cuda.rst Fix Python docs for broadcast and braodcast_coalesced (#4727) 2018-01-19 10:57:20 -05:00
data.rst add fold example and add nn.Fold/nn.Unfold and F.fold/F.unfold to doc (#8600) 2018-06-18 09:36:42 -04:00
distributed.rst fix nccl distributed documentation 2018-05-17 18:03:54 -04:00
distributions.rst Implementation of Weibull distribution (#9454) 2018-07-24 20:40:15 -07:00
dlpack.rst document torch.utils.dlpack (#9343) 2018-07-11 07:46:09 -07:00
ffi.rst Improve ffi utils (#479) 2017-01-18 11:17:01 -05:00
index.rst document torch.utils.dlpack (#9343) 2018-07-11 07:46:09 -07:00
legacy.rst Add anything in torch.legacy docs 2017-01-16 12:59:47 -05:00
model_zoo.rst Add model_zoo utility torch torch.utils (#424) 2017-01-09 13:16:58 -05:00
multiprocessing.rst Typofix 2017-10-13 01:31:22 +02:00
nn.rst Add CTC loss (#9628) 2018-07-31 11:09:48 -07:00
onnx.rst Add gt lt ge le to the supported operators list (#8375) 2018-06-12 15:28:34 -04:00
optim.rst Add Cosine Annealing LR Scheduler (#3311) 2017-12-18 02:43:08 -05:00
sparse.rst Update docs with new tensor repr (#6454) 2018-04-21 07:35:37 -04:00
storage.rst Start documenting torch.Tensor (#377) 2016-12-30 01:21:34 -05:00
tensor_attributes.rst Update device docs (#6887) 2018-04-23 19:04:20 -04:00
tensors.rst Fix typo in tensors.rst (#10073) 2018-07-31 10:13:40 -07:00
torch.rst Remove deprecated masked_copy (#9819) 2018-07-26 20:55:18 -07:00