pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

History

Thomas Viehmann 685224aa14 Add CTC loss (#9628 ) Summary: The CPU and CUDA variants are a direct transposition of Graves et al.'s description of the algorithm with the modification that is is in log space. The there also is a binding for the (much faster) CuDNN implementation. This could eventually fix #3420 I still need to add tests (TestNN seems much more elaborate than the other testing) and fix the bugs than invariably turn up during the testing. Also, I want to add some more code comments. I could use feedback on all sorts of things, including: - Type handling (cuda vs. cpu for the int tensors, dtype for the int tensors) - Input convention. I use log probs because that is what the gradients are for. - Launch parameters for the kernels - Errors and obmissions and anything else I'm not even aware of. Thank you for looking! In terms of performance it looks like it is superficially comparable to WarpCTC (and thus, but I have not systematically investigated this). I have read CuDNN is much faster than implementations because it does not use log-space, but also the gathering step is much much faster (but I avoided trying tricky things, it seems to contribute to warpctc's fragility). I might think some more which existing torch function (scatter or index..) I could learn from for that step. Average timings for the kernels from nvprof for some size: ``` CuDNN: 60.464us compute_alphas_and_betas 16.755us compute_grads_deterministic Cuda: 121.06us ctc_loss_backward_collect_gpu_kernel (= grads) 109.88us ctc_loss_gpu_kernel (= alphas) 98.517us ctc_loss_backward_betas_gpu_kernel (= betas) WarpCTC: 299.74us compute_betas_and_grad_kernel 66.977us compute_alpha_kernel ``` Of course, I still have the (silly) outer blocks loop rather than computing consecutive `s` in each thread which I might change, and there are a few other things where one could look for better implementations. Finally, it might not be unreasonable to start with these implementations, as the performance of the loss has to be seen in the context of the entire training computation, so this would likely dilute the relative speedup considerably. My performance measuring testing script: ``` import timeit import sys import torch num_labels = 10 target_length = 30 input_length = 50 eps = 1e-5 BLANK = 0#num_labels batch_size = 16 torch.manual_seed(5) activations = torch.randn(input_length, batch_size, num_labels + 1) log_probs = torch.log_softmax(activations, 2) probs = torch.exp(log_probs) targets = torch.randint(1, num_labels+1, (batch_size * target_length,), dtype=torch.long) targets_2d = targets.view(batch_size, target_length) target_lengths = torch.tensor(batch_size[target_length]) input_lengths = torch.tensor(batch_size[input_length]) activations = log_probs.detach() def time_cuda_ctc_loss(grout, args): torch.cuda.synchronize() culo, culog_alpha = torch._ctc_loss(args) g, = torch.autograd.grad(culo, args[0], grout) torch.cuda.synchronize() def time_cudnn_ctc_loss(groupt, args): torch.cuda.synchronize() culo, cugra= torch._cudnn_ctc_loss(args) g, = torch.autograd.grad(culo, args[0], grout) torch.cuda.synchronize() def time_warp_ctc_loss(grout, args): torch.cuda.synchronize() culo = warpctc.ctc_loss(args, blank_label=BLANK, size_average=False, length_average=False, reduce=False) g, = torch.autograd.grad(culo, args[0], grout) torch.cuda.synchronize() if sys.argv[1] == 'cuda': lpcu = log_probs.float().cuda().detach().requires_grad_() args = [lpcu, targets_2d.cuda(), input_lengths.cuda(), target_lengths.cuda(), BLANK] grout = lpcu.new_ones((batch_size,)) torch.cuda.synchronize() print(timeit.repeat("time_cuda_ctc_loss(grout, args)", number=1000, globals=globals())) elif sys.argv[1] == 'cudnn': lpcu = log_probs.float().cuda().detach().requires_grad_() args = [lpcu, targets.int(), input_lengths.int(), target_lengths.int(), BLANK, True] grout = lpcu.new_ones((batch_size,)) torch.cuda.synchronize() print(timeit.repeat("time_cudnn_ctc_loss(grout, args)", number=1000, globals=globals())) elif sys.argv[1] == 'warpctc': import warpctc activations = activations.cuda().detach().requires_grad_() args = [activations, input_lengths.int(), targets.int(), target_lengths.int()] grout = activations.new_ones((batch_size,), device='cpu') torch.cuda.synchronize() print(timeit.repeat("time_warp_ctc_loss(grout, *args)", number=1000, globals=globals())) ``` I'll also link to a notebook that I used for writing up the algorithm in simple form and then test the against implementations against it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/9628 Differential Revision: D8952453 Pulled By: ezyang fbshipit-source-id: 18e073f40c2d01a7c96c1cdd41f6c70a06e35860		2018-07-31 11:09:48 -07:00
..
_static	i2h<->h2h in gif (#8750 )	2018-06-21 14:46:47 -04:00
_templates	docs: add canonical_url and fix redirect link (#8155 )	2018-06-05 10:29:55 -04:00
notes	various documentation formatting (#9359 )	2018-07-13 02:48:25 -07:00
scripts	Add grid lines for activation images, fixes #9130 (#9134 )	2018-07-03 19:10:00 -07:00
autograd.rst	Add autograd automatic anomaly detection (#7677 )	2018-06-11 21:26:17 -04:00
bottleneck.rst	[docs] Clarify more CUDA profiling gotchas in bottleneck docs (#6763 )	2018-04-19 13:15:27 -04:00
checkpoint.rst	[docs] Fix some sphinx warnings (#6764 )	2018-04-19 12:37:42 -04:00
conf.py	Add html-stable target to docs Makefile (#9884 )	2018-07-26 12:09:06 -07:00
cpp_extension.rst	Inline JIT C++ Extensions (#7059 )	2018-04-30 11:48:44 -04:00
cuda.rst	Fix Python docs for broadcast and braodcast_coalesced (#4727 )	2018-01-19 10:57:20 -05:00
data.rst	add fold example and add nn.Fold/nn.Unfold and F.fold/F.unfold to doc (#8600 )	2018-06-18 09:36:42 -04:00
distributed.rst	fix nccl distributed documentation	2018-05-17 18:03:54 -04:00
distributions.rst	Implementation of Weibull distribution (#9454 )	2018-07-24 20:40:15 -07:00
dlpack.rst	document torch.utils.dlpack (#9343 )	2018-07-11 07:46:09 -07:00
ffi.rst	Improve ffi utils (#479 )	2017-01-18 11:17:01 -05:00
index.rst	document torch.utils.dlpack (#9343 )	2018-07-11 07:46:09 -07:00
legacy.rst	Add anything in torch.legacy docs	2017-01-16 12:59:47 -05:00
model_zoo.rst	Add model_zoo utility torch torch.utils (#424 )	2017-01-09 13:16:58 -05:00
multiprocessing.rst	Typofix	2017-10-13 01:31:22 +02:00
nn.rst	Add CTC loss (#9628 )	2018-07-31 11:09:48 -07:00
onnx.rst	Add gt lt ge le to the supported operators list (#8375 )	2018-06-12 15:28:34 -04:00
optim.rst	Add Cosine Annealing LR Scheduler (#3311 )	2017-12-18 02:43:08 -05:00
sparse.rst	Update docs with new tensor repr (#6454 )	2018-04-21 07:35:37 -04:00
storage.rst	Start documenting torch.Tensor (#377 )	2016-12-30 01:21:34 -05:00
tensor_attributes.rst	Update device docs (#6887 )	2018-04-23 19:04:20 -04:00
tensors.rst	Fix typo in tensors.rst (#10073 )	2018-07-31 10:13:40 -07:00
torch.rst	Remove deprecated masked_copy (#9819 )	2018-07-26 20:55:18 -07:00