pytorch/torch/csrc/cuda/Tensor.cpp
xiaobing.zhang c2c835dd95 Port sigmoid backward to Aten(CPU+CUDA) (#29185)
Summary:
VitalyFedyunin, This PR is about port sigmoid backward to Aten:
**Test script:**
```
import torch
import torch.nn as nn
import time

torch.manual_seed(0)

def _time():
    if torch.cuda.is_available():
        torch.cuda.synchronize()
    return time.time()

device = "cpu"
if torch.cuda.is_available():
    device = "cuda"

#warm up
for n in [100, 10000]:
    input = torch.randn(128, n, requires_grad=True, device=device)
    for i in range(1000):
        output = input.sigmoid().sum()
        output.backward()

#get running time
for n in [100, 10000]:
    bwd_t = 0
    input = torch.randn(128, n, requires_grad=True, device=device)
    for i in range(10000):
        output = input.sigmoid().sum()
        t1 = _time()
        output.backward()
        t2 = _time()
        bwd_t = bwd_t + (t2 - t1)
    bwd_avg = bwd_t / 10000 * 1000
    print("input size(128, %d), backwad avg time is %.2f (ms)." % (n, bwd_avg))
```
Test Device: CPU: skx-8280, GPU: Tesla P40

**Perfromance**:
Before:
```
GPU:
input size(128, 100), backwad avg time is 0.14 (ms).
input size(128, 10000), backwad avg time is 0.17 (ms).
CPU:
OMP_NUM_THREADS=56
input size(128, 100), backwad avg time is 0.06 (ms).
input size(128, 10000), backwad avg time is 4.21 (ms).
OMP_NUM_THREADS=1
input size(128, 100), backwad avg time is 0.06 (ms).
input size(128, 10000), backwad avg time is 2.30 (ms).
```
After:
```
GPU:
input size(128, 100), backwad avg time is 0.14 (ms).
input size(128, 10000), backwad avg time is 0.17 (ms).
CPU:
OMP_NUM_THREADS=56
input size(128, 100), backwad avg time is 0.05 (ms).
input size(128, 10000), backwad avg time is 0.48 (ms).
OMP_NUM_THREADS=1
input size(128, 100), backwad avg time is 0.04 (ms).
input size(128, 10000), backwad avg time is 0.86 (ms).
```
How to set number thread? using following script:
```
num_threads=$1
script=$2
last_core=`expr $num_threads - 1`
echo "using $num_threads OMP threads"
echo "bind cores to 0~$last_core"
export OMP_NUM_THREADS=$num_threads
export KMP_AFFINITY=granularity=fine,compact,1,0
numactl --physcpubind=0-$last_core --membind=0 python $script
```
and run **./run.sh num_threads test.py**.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29185

Differential Revision: D18587352

Pulled By: VitalyFedyunin

fbshipit-source-id: 8167ca261960399f795d35a83fa8c4be365bc4da
2019-11-20 07:31:42 -08:00

20 lines
483 B
C++

#define __STDC_FORMAT_MACROS
#include <torch/csrc/python_headers.h>
#include <structmember.h>
#include <vector>
#include <stack>
#include <tuple>
#include <torch/csrc/cuda/THCP.h>
#include <torch/csrc/cuda/override_macros.h>
#include <torch/csrc/copy_utils.h>
#include <torch/csrc/DynamicTypes.h>
#include <torch/csrc/utils/tensor_numpy.h>
//generic_include THC torch/csrc/generic/Tensor.cpp
#include <torch/csrc/cuda/undef_macros.h>
#include <torch/csrc/cuda/restore_macros.h>