Commit Graph

70 Commits

Author SHA1 Message Date
Syed Tousif Ahmed
4dab208d97 Adds Issue#153109 as a test for CUDAPluggableAllocator (#163575)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163575
Approved by: https://github.com/ngimel
2025-10-01 09:07:48 +00:00
Nikita Shulga
02cd4dbcf4 [BE][CI] Get rid of duplicated code (#131406)
Followup after https://github.com/pytorch/pytorch/pull/131061 Define `run_if_exists` function that runs cpp test if it exists and prints a warning otherwise.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131406
Approved by: https://github.com/kit1980, https://github.com/atalman
2024-07-23 04:01:13 +00:00
Syed Tousif Ahmed
1f961ad495 Runs aten cuda cpp tests in CI (#131061)
It seems like these tests are never run because https://github.com/pytorch/pytorch/pull/99956 got rid of the `pushd $1` which would make the if conditions true in CUDA builds.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131061
Approved by: https://github.com/malfet, https://github.com/eqy
2024-07-19 12:35:33 +00:00
Huy Do
35834a405c Run C++ tests on CI with run_test.py (#99956)
After https://github.com/pytorch/pytorch/pull/99559, we can now run C++ test with `run_test.py`.  Although advance features such as `--import-slow-tests` and `--import-disabled-tests` won't work for now, there will still be a gain in reliability and performance as C++ can now be retried and run in parallel.

This covers all C++ tests in the CI including aten, libtorch, and Vulkan C++ tests across all platforms Linux, Windows, MacOS.

Notes:
* To support C++ test discovery, the env variable `CPP_TESTS_DIR` can be set to where the C++ test binaries is located
* Support pytest -k argument via run_test as this is used by pytest-cpp to replace `--gtest-filter`
* The XML output is in pytest format, but it's ok now because we don't have slow test or flaky test support for C++ test yet
* ~~I need to figure out why conftest.py doesn't work when I invoke pytest directly for C++ test, so `--sc` is not available for C++ tests at the moment.  Proper pytest plugin like stepwise works fine though.  I'll investigate and fix it in a separate PR~~ Found the cause, `conftest.py` is per directory and needs to be in any arbitrary directory that holds C++ test
* Two tests `test_api` and `test_tensorexpr` timed out on ASAN, I suspect that ASAN is now used on top of the python executable, which is slower than running native C++ code.  IMO, it's ok to run these tests as before on ASAN for now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99956
Approved by: https://github.com/clee2000, https://github.com/ZainRizvi
2023-05-09 21:24:12 +00:00
Richard Zou
4b1053497c [vmap] Prepend "legacy" to files for old vmap implementation (#90324)
We have an older torch.vmap implementation. It is no longer supported.
It still needs to exist somewhere for the sake of BC with
torch.autograd.functional.

This PR makes it clear what files are meant for implementing the old
vmap implementation. I've seen a couple of PRs recently adding support
for the old vmap implementation, so this will lessen the confusion.

Test Plan:
- CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90324
Approved by: https://github.com/samdow
2022-12-07 18:46:15 +00:00
Mikayla Gawarecki
3024bcfff5 Add cuda_atomic_ops_test to run_tests.sh
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74482

Approved by: https://github.com/cpuhrsch
2022-03-24 17:12:35 +00:00
Nikita Shulga
77beccaedb Do not build PyTorch with caffe2 by default (#66658)
Summary:
CAFFE2 has been deprecated for a while, but still included in every PyTorch build.
We should stop building it by default, although CI should still validate that caffe2 code is buildable.

Build even fewer dependencies when compiling mobile builds without Caffe2
Introduce `TEST_CAFFE2` in torch.common.utils
Skip `TestQuantizedEmbeddingOps` and `TestJit.test_old_models_bc`  is code is compiled without Caffe2
Should be landed after https://github.com/pytorch/builder/pull/864

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66658

Reviewed By: driazati, seemethere, janeyx99

Differential Revision: D31669156

Pulled By: malfet

fbshipit-source-id: 1cc45e2d402daf913a4685eb9f841cc3863e458d
2021-10-21 20:32:47 -07:00
Alex Suhan
b176feec1e Add device and key for lazy tensors (#61621)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61621

Test Plan: CI

Reviewed By: mruberry

Differential Revision: D29912934

Pulled By: asuhan

fbshipit-source-id: 493c32063a3e756d93cbf1d876563a35eaafb537
2021-07-26 23:00:22 -07:00
Richard Zou
970096b624 [Reland] Adds an aten::_ops namespace with unambiguous function names (#59018)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59018

Fixes #58044.

This PR:
- adds `ATEN_FN(op)` and `ATEN_FN2(op, overload)` macros that resolve to
an non-overloaded function in aten::_ops that calls the desired operator
(without default arguments).

The motivation for this is two-fold:
1) Using aten operators with templates is hard if the operator is
overloaded (e.g. add.Tensor and add.Scalar).
2) Method-only operators require special handling; pointers-to-method
are different from function pointers. `ATEN_FN2(add_, Tensor)` returns
a function instead of a method.

There is some interesting behavior for out= operations.
`ATEN_FN2(sin, "out")` gives a function that is *faithful* to the schema;
that is, the order of arguments is exactly what it looks like in the
schema. This makes it so that you can directly register
`ATEN_FN2(sin,"out")` (or a function wrapping it using the same signature)
as an override for a DispatchKey.

Test Plan:
- New tests that ATEN_FN2 works on function and method-only operators
- New test that ATEN_FN works
- New test that ATEN_FN macro returns a "faithful" function.

Codegen output:
Operators.h and Operators.cpp are both here:
https://gist.github.com/zou3519/c2c6a900410b571f0d7d127019ca5175

Reviewed By: bdhirsh

Differential Revision: D28721206

Pulled By: zou3519

fbshipit-source-id: a070017f98e8f4038cb0c64be315eef45d264217
2021-06-01 17:19:06 -07:00
Alice Ou
24508337f4 Revert D28643215: Adds an aten::_ops namespace with unambiguous function names
Test Plan: revert-hammer

Differential Revision:
D28643215 (28740869a1)

Original commit changeset: 7b2b8459f1b2

fbshipit-source-id: ea869bf4cfde7038087e990b2cff5a86f9e2a531
2021-05-26 12:35:34 -07:00
Richard Zou
28740869a1 Adds an aten::_ops namespace with unambiguous function names (#58092)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58092

Fixes #58044.

This PR:
- adds `ATEN_FN(op)` and `ATEN_FN2(op, overload)` macros that resolve to
an non-overloaded function in aten::_ops that calls the desired operator
(without default arguments).

The motivation for this is two-fold:
1) Using aten operators with templates is hard if the operator is
overloaded (e.g. add.Tensor and add.Scalar).
2) Method-only operators require special handling; pointers-to-method
are different from function pointers. `ATEN_FN2(add_, Tensor)` returns
a function instead of a method.

There is some interesting behavior for out= operations.
`ATEN_FN2(sin, "out")` gives a function that is *faithful* to the schema;
that is, the order of arguments is exactly what it looks like in the
schema. This makes it so that you can directly register
`ATEN_FN2(sin,"out")` (or a function wrapping it using the same signature)
as an override for a DispatchKey.

Test Plan:
- New tests that ATEN_FN2 works on function and method-only operators
- New test that ATEN_FN works
- New test that ATEN_FN macro returns a "faithful" function.

Codegen output:
Operators.h and Operators.cpp are both here:
https://gist.github.com/zou3519/c2c6a900410b571f0d7d127019ca5175

Reviewed By: mruberry

Differential Revision: D28643215

Pulled By: zou3519

fbshipit-source-id: 7b2b8459f1b2eb5ad01ee7b0d2bb77639f77940e
2021-05-26 07:29:15 -07:00
Xiang Gao
3de86b951d Migrate thrust->cub for index put (#55693)
Summary:
64bit indexing is not supported, because if `num_indices = 2^31`, then 4 long tensors of `num_indices` elements will take 64GB RAM. I don't think anybody will be interested in running `index_put` with 64GB GPU RAM.

Benchmark on CUDA 11.3 RTX3090:
```python
import torch
import itertools

def run50_sync(f):
    for _ in range(50):
        f()
    torch.cuda.synchronize()

run50_sync(lambda: torch.randperm(1000000, device='cuda'))

def benchmark(M, L):
    a = torch.randn(M, device='cuda')
    i1 = torch.randint(M, (L,), dtype=torch.long, device='cuda')
    v = torch.randn(L, device='cuda')

    torch.cuda.synchronize()

    %timeit run50_sync(lambda:a.index_put_((i1,), v, True))

for M, L in itertools.product((100, 100000, 10000000), repeat=2):
    print(M, L)
    benchmark(M, L)
```

Before
```
100 100
5.13 ms ± 91 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
100 100000
30.2 ms ± 471 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
100 10000000
3.17 s ± 14.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
100000 100
5.19 ms ± 61.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
100000 100000
11.9 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
100000 10000000
712 ms ± 3.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
10000000 100
5.07 ms ± 66.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
10000000 100000
12.1 ms ± 76.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
10000000 10000000
627 ms ± 7.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

After
```
100 100
3.75 ms ± 49.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
100 100000
26.2 ms ± 154 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
100 10000000
2.81 s ± 23.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
100000 100
3.85 ms ± 16.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
100000 100000
9.74 ms ± 40.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
100000 10000000
444 ms ± 1.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
10000000 100
3.85 ms ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
10000000 100000
10.7 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
10000000 10000000
396 ms ± 2.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55693

Reviewed By: albanD

Differential Revision: D27895967

Pulled By: ngimel

fbshipit-source-id: 0616ce33395ce46f1a4161dfd38940b8e54fedc2
2021-04-27 12:27:09 -07:00
Nikita Shulga
efeb988518 Suppress "ioctl points to uninitialised" check (#48187)
Summary:
libcuda.so from CUDA-11.1 makes ioctl() that valgrind's memcheck tool considers dangerous
Instruct valgrind to suppress that check

Fixes false positives reported in https://app.circleci.com/pipelines/github/pytorch/pytorch/240774/workflows/d4c66de8-f13b-47a2-ae62-2ec1bbe0664b/jobs/9026496

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48187

Reviewed By: janeyx99

Differential Revision: D25059850

Pulled By: malfet

fbshipit-source-id: 982df5860524482b0fcb2bfc6bb490fb06694cf6
2020-11-18 18:45:46 -08:00
Xiang Gao
3876889218 Remove LegacyComplex.h (#39834)
Summary:
All std::complex has been migrated to c10::complex
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39834

Differential Revision: D22001969

Pulled By: ezyang

fbshipit-source-id: 665a9198afde45a95309053b2f2381e123bf869a
2020-06-12 08:18:25 -07:00
Nikita Shulga
39d037253c Test PyTorch using python-3.8 + GCC-9 on Bionic (Reland) (#39121)
Summary:
Enable new test config in .circleci/config.yml
Skip scanning several 3rd-party packages to work around https://bugs.python.org/issue40350
Remove pre python-3.5 checks from `test.sh` and update `scikit-learn` to python-3.8 compatible version

This is a reland of https://github.com/pytorch/pytorch/pull/39030
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39121

Differential Revision: D21820375

Pulled By: malfet

fbshipit-source-id: d0be79b7d204cf692e055d42b9be42402dc4c1c0
2020-06-01 11:11:12 -07:00
Richard Zou
a3bab37d96 Add BatchedTensorImpl (#38424)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38424

On the way to adding initial vmap support, this is the implementation
for BatchedTensorImpl. Vmap (in future PRs) leverages Tensors backed by
BatchedTensorImpl to do its work.

For more context, here is an overview of the plan to add initial vmap support.
- [this PR] Add BatchedTensorImpl
- Add one or two batching rules
- Add vmap Python API
- Add "slow" for-loop fallbacks for out-of-place functions via
dispatcher fallback mechanism.
- Add batching rules for "view" functions
- Add "slow" for-loop fallbacks for in-place functions
- Miscellaneous handling for failure cases
- And more

Test Plan: - `./build/bin/vmap_test`

Differential Revision: D21640917

Pulled By: zou3519

fbshipit-source-id: 969490a838cf2099ed80104e7d51ee8ff069e168
2020-05-20 09:10:00 -07:00
Gao, Xiang
c5624e831d Add overloads of std:: math functions for c10::complex [resubmit] (#37468)
Summary:
This reverts commit d167a7f654.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37468

Differential Revision: D21305110

Pulled By: anjali411

fbshipit-source-id: d1bdc9d9feac00331fc2b2b905d49f80bef680f9
2020-04-30 10:20:45 -07:00
Nikita Shulga
6098cf7e33 Add sched_setaffinity check from libgomp to valgrind.sup (#37532)
Summary:
- It's valid to call `sched_setaffinity` with nullptr
- The call is coming from libomp which should be valgrind safe
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37532

Test Plan: CI

Differential Revision: D21311252

Pulled By: malfet

fbshipit-source-id: a325f97741b997738c35759d02fcc34c1cb44d95
2020-04-29 14:48:23 -07:00
Lu Fang
d167a7f654 Revert D21256854: [pytorch][PR] Add overloads of std:: math functions for c10::complex
Test Plan: revert-hammer

Differential Revision:
D21256854

Original commit changeset: 2112ba6b7992

fbshipit-source-id: b81c377f9cd33a493a63d1e666cbe6765516fca8
2020-04-27 13:23:34 -07:00
Gao, Xiang
6d409481b3 Add overloads of std:: math functions for c10::complex (#35725)
Summary:
Issue: https://github.com/pytorch/pytorch/issues/35284

~This depends on and contains https://github.com/pytorch/pytorch/pull/35524. Please review after the dependency gets merged and I will rebase to get a clean diff.~

The implementation of most functions follow the pattern

```C++
template<typename T>
C10_HOST_DEVICE c10::complex<T> some_function(c10::complex<T> x) {
#if defined(__CUDACC__) || defined(__HIPCC__)
  return static_cast<c10::complex<T>>(thrust::some_function(static_cast<thrust::complex<T>>(x)));
#else
  return static_cast<c10::complex<T>>(std::some_function(static_cast<std::complex<T>>(x)));
#endif
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35725

Differential Revision: D21256854

Pulled By: ezyang

fbshipit-source-id: 2112ba6b79923450feafd7ebdc7184a3eaecadb6
2020-04-27 10:32:16 -07:00
Xiang Gao
20328f67bb Add core of c10::complex [resubmit] (#36626)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36626

This reverts commit 9216c67c9e.

Test Plan: Imported from OSS

Differential Revision: D21140441

Pulled By: anjali411

fbshipit-source-id: 488530088e2ff87dc27e70d21ace88ff2967e7ab
2020-04-24 12:08:23 -07:00
Gao, Xiang
7c7cb74887 Add missing ${CMAKE_CURRENT_SOURCE_DIR}/complex_test.cpp (#37080)
Summary:
This test is never built in OSS CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37080

Differential Revision: D21179296

Pulled By: anjali411

fbshipit-source-id: 22a5b82f17676213c8ec51642bef35dc61f9cace
2020-04-22 19:22:59 -07:00
Mike Ruberry
9216c67c9e Revert D21021677: [pytorch][PR] Add core of c10::complex
Test Plan: revert-hammer

Differential Revision:
D21021677

Original commit changeset: 9e144e581fa4

fbshipit-source-id: ce6a88fc71ec0134d0fc6ecdddc4c4db35f89b1f
2020-04-14 13:58:24 -07:00
Xiang Gao
25252816cf Add core of c10::complex (#35524)
Summary:
Step 0 of https://github.com/pytorch/pytorch/issues/35284

Reference: https://en.cppreference.com/w/cpp/numeric/complex
We are targeting C++20. The difference across C++ versions are mostly `constexpr` qualifiers, newer version has more function declared as `constexpr`

This PR adds the core of `c10::complex`, it includes
- standard constructors as in `std::complex`
- explicit conversion constructors converting from `std/thrust::complex` to `c10::complex`
- standard assignment operators as in `std::complex`
- conversion assignment operators converting from `std/thrust::complex` to `c10::complex`
- other standard operators as in `std::complex`
- standard methods as in `std::complex`
- explicit casting operators to std/thrust
- basic non-member functions as in `std::complex`:
  - arithmetic operators
  - `==`, `!=`
  - `<<`, `>>`
  - `std::real`, `std::imag`, `std::abs`, `std::arg`, `std::norm`, `std::conj`, `std::proj`, `std::polar`
    - Some of them are intentionally not completely implemented, these are marked as `TODO` and will be implemented in the future.

This PR does not include:
- overload of math functions

which will come in the next PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35524

Differential Revision: D21021677

Pulled By: anjali411

fbshipit-source-id: 9e144e581fa4b2bee62d33adaf756ce5aadc0c71
2020-04-14 11:00:24 -07:00
Natalia Gimelshein
b6306e1517 Revert D20624698: [pytorch][PR] Make GPU loops support mutable lambda
Test Plan: revert-hammer

Differential Revision:
D20624698

Original commit changeset: 06e398779345

fbshipit-source-id: d17059c692b4b460f3aa8081bc80c296ddb88228
2020-03-24 14:42:40 -07:00
Xiang Gao
39a101d06e Make GPU loops support mutable lambda (#35015)
Summary:
I will need it for https://github.com/pytorch/pytorch/pull/34004

The `mutable` qualifier allows a lambda to capture some values, and modify its own copy. This would be useful for random kernels: we capture a `state` of RNG, initialize it when it first run, and the initialized stated will be used later:

```C++
gpu_kernel(iter, [state, initialized](scalar_t arg) mutable -> scalar_t {
  if (!initialized) {
    curand_init(..., state);
    initialized = true;
  }
  return some_math(curand_uniform(state), arg);
}
```

The `operator()` of `mutable` lambda is not `const`, so we can not pass it as constant reference. It can not be called inside a non-`mutable` lambda either.

Example usage:

```C++
auto t = at::empty({4096}, kCUDA);
float thread_work_index_ = 0;
auto iter = TensorIterator::nullary_op(t);
gpu_kernel(iter, [thread_work_index_]GPU_LAMBDA() mutable -> float {
  return thread_work_index_++;
});
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35015

Differential Revision: D20624698

Pulled By: ngimel

fbshipit-source-id: 06e3987793451cd514181d20252510297e2d28a9
2020-03-24 12:30:49 -07:00
Nikita Shulga
9dd5d51b01 [ATen] Exclude CUDA tests when running basic under valgrind (#34181)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34181

Test Plan: CI

Reviewed By: orionr, seemethere

Differential Revision: D20241021

fbshipit-source-id: a7371afc45acc2c07a36c8216036338e14170a56
2020-03-04 11:24:33 -08:00
Nikita Shulga
9d1c971b11 [Aten] Suppress valgrind leaks in libcuda (#34169)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34169

Valgrind have no insight how memory is being initialized by ioctls()

Test Plan: CI

Reviewed By: seemethere

Differential Revision: D20235974

fbshipit-source-id: 46413afa4842e7d42582bbbda903438b1d98691f
2020-03-03 16:00:17 -08:00
Xiang Gao
9c2ed2574a Vectorized memory access in TensorIterator GPU loop for 1d contiguous case (#32383)
Summary:
Step 2 of https://github.com/pytorch/pytorch/issues/31975

Vectorized memory access is enabled. Generated code: https://github.com/zasdfgbnm/things/blob/master/2020Q1/disassembly-elementwise-vec.ipynb

```
void at::native::modern::elementwise_kernel<4, 64, 4, at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()https://github.com/pytorch/pytorch/issues/1}::operator()() const::{lambda()https://github.com/pytorch/pytorch/issues/4}::operator()() const::{lambda(float, float)https://github.com/pytorch/pytorch/issues/1}, at::detail::Array<char*, 3> >(int, at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()https://github.com/pytorch/pytorch/issues/1}::operator()() const::{lambda()https://github.com/pytorch/pytorch/issues/4}::operator()() const::{lambda(float, float)https://github.com/pytorch/pytorch/issues/1}, at::detail::Array<char*, 3>)

**ASM:**

	.section	.text._ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_,"ax",progbits
	.sectioninfo	@"SHI_REGISTERS=20"
	.align	128
        .global         _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_
        .type           _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_,function
        .size           _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_,(.L_40898 - _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_)
        .other          _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_,@"STO_CUDA_ENTRY STV_DEFAULT"
_ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_:
.text._ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_:
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 294
        /*0000*/                   IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;
        /*0010*/              @!PT SHFL.IDX PT, RZ, RZ, RZ, RZ ;
        /*0020*/                   S2R R9, SR_CTAID.X ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 177
        /*0030*/                   S2R R0, SR_TID.X ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 294
        /*0040*/                   IMAD.SHL.U32 R9, R9, 0x100, RZ ;
        /*0050*/                   IADD3 R5, -R9, c[0x0][0x160], RZ ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 256
        /*0060*/                   SHF.R.S32.HI R17, RZ, 0x1f, R9 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 296
        /*0070*/                   ISETP.GE.AND P0, PT, R5, 0x100, PT ;
        /*0080*/              @!P0 BRA `(.L_3173) ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 256
        /*0090*/                   IMAD.SHL.U32 R12, R9.reuse, 0x4, RZ ;
        /*00a0*/                   SHF.L.U64.HI R17, R9, 0x2, R17 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 260
        /*00b0*/                   IADD3 R8, P0, R12.reuse, c[0x0][0x188], RZ ;
        /*00c0*/                   IADD3 R2, P1, R12, c[0x0][0x190], RZ ;
        /*00d0*/                   IADD3.X R9, R17.reuse, c[0x0][0x18c], RZ, P0, !PT ;
        /*00e0*/                   IADD3.X R3, R17, c[0x0][0x194], RZ, P1, !PT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 218
        /*00f0*/                   IMAD.WIDE R8, R0, 0x10, R8 ;
        /*0100*/                   IMAD.WIDE R2, R0, 0x10, R2 ;
        /*0110*/                   LDG.E.128.SYS R8, [R8] ;
        /*0120*/                   LDG.E.128.SYS R4, [R2] ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 256
        /*0130*/                   IADD3 R12, P0, R12, c[0x0][0x180], RZ ;
        /*0140*/                   IADD3.X R13, R17, c[0x0][0x184], RZ, P0, !PT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 238
        /*0150*/                   IMAD.WIDE R12, R0, 0x10, R12 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 196
        /*0160*/                   FFMA R7, R7, c[0x0][0x168], R11 ;
        /*0170*/                   FFMA R6, R6, c[0x0][0x168], R10 ;
        /*0180*/                   FFMA R5, R5, c[0x0][0x168], R9 ;
        /*0190*/                   FFMA R4, R4, c[0x0][0x168], R8 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 238
        /*01a0*/                   STG.E.128.SYS [R12], R4 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 301
        /*01b0*/                   EXIT ;
.L_3173:
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180
        /*01c0*/                   ISETP.GE.AND P0, PT, R0, R5, PT ;
        /*01d0*/                   BMOV.32.CLEAR RZ, B0 ;
        /*01e0*/                   BSSY B0, `(.L_3174) ;
        /*01f0*/               P0 BRA `(.L_3175) ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183
        /*0200*/                   IADD3 R3, P1, R9, R0, RZ ;
        /*0210*/                   LEA.HI.X.SX32 R4, R0, R17, 0x1, P1 ;
        /*0220*/                   LEA R2, P1, R3, c[0x0][0x188], 0x2 ;
        /*0230*/                   LEA.HI.X R3, R3, c[0x0][0x18c], R4, 0x2, P1 ;
        /*0240*/                   LDG.E.SYS R8, [R2] ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 184
        /*0250*/                   IADD3 R4, R0, 0x40, RZ ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180
        /*0260*/                   ISETP.GE.AND P1, PT, R4, R5, PT ;
        /*0270*/               P1 BRA `(.L_3175) ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183
        /*0280*/                   LDG.E.SYS R4, [R2+0x100] ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 184
        /*0290*/                   IADD3 R6, R0, 0x80, RZ ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180
        /*02a0*/                   ISETP.GE.AND P1, PT, R6, R5, PT ;
        /*02b0*/               P1 BRA `(.L_3175) ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 184
        /*02c0*/                   IADD3 R10, R0, 0xc0, RZ ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183
        /*02d0*/                   LDG.E.SYS R7, [R2+0x200] ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180
        /*02e0*/                   ISETP.GE.AND P1, PT, R10, R5, PT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183
        /*02f0*/              @!P1 LDG.E.SYS R6, [R2+0x300] ;
.L_3175:
        /*0300*/                   BSYNC B0 ;
.L_3174:
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180
        /*0310*/                   BMOV.32.CLEAR RZ, B0 ;
        /*0320*/                   BSSY B0, `(.L_3176) ;
        /*0330*/               P0 BRA `(.L_3177) ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183
        /*0340*/                   IADD3 R3, P1, R9, R0, RZ ;
        /*0350*/                   LEA.HI.X.SX32 R10, R0, R17, 0x1, P1 ;
        /*0360*/                   LEA R2, P1, R3, c[0x0][0x190], 0x2 ;
        /*0370*/                   LEA.HI.X R3, R3, c[0x0][0x194], R10, 0x2, P1 ;
        /*0380*/                   LDG.E.SYS R11, [R2] ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 184
        /*0390*/                   IADD3 R10, R0, 0x40, RZ ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180
        /*03a0*/                   ISETP.GE.AND P1, PT, R10, R5, PT ;
        /*03b0*/               P1 BRA `(.L_3177) ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183
        /*03c0*/                   LDG.E.SYS R13, [R2+0x100] ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 184
        /*03d0*/                   IADD3 R10, R0, 0x80, RZ ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180
        /*03e0*/                   ISETP.GE.AND P1, PT, R10, R5, PT ;
        /*03f0*/               P1 BRA `(.L_3177) ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 184
        /*0400*/                   IADD3 R10, R0, 0xc0, RZ ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180
        /*0410*/                   ISETP.GE.AND P1, PT, R10, R5, PT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183
        /*0420*/                   LDG.E.SYS R10, [R2+0x200] ;
        /*0430*/              @!P1 LDG.E.SYS R15, [R2+0x300] ;
.L_3177:
        /*0440*/                   BSYNC B0 ;
.L_3176:
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193
        /*0450*/               P0 EXIT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 196
        /*0460*/                   IADD3 R9, P0, R9, R0, RZ ;
        /*0470*/                   FFMA R11, R11, c[0x0][0x168], R8 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 197
        /*0480*/                   IADD3 R14, R0, 0x40, RZ ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 196
        /*0490*/                   LEA.HI.X.SX32 R12, R0, R17, 0x1, P0 ;
        /*04a0*/                   LEA R2, P0, R9.reuse, c[0x0][0x180], 0x2 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193
        /*04b0*/                   ISETP.GE.AND P1, PT, R14, R5, PT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 196
        /*04c0*/                   LEA.HI.X R3, R9, c[0x0][0x184], R12, 0x2, P0 ;
        /*04d0*/                   STG.E.SYS [R2], R11 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193
        /*04e0*/               P1 EXIT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 197
        /*04f0*/                   IADD3 R8, R0, 0x80, RZ ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 196
        /*0500*/                   FFMA R13, R13, c[0x0][0x168], R4 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193
        /*0510*/                   ISETP.GE.AND P0, PT, R8, R5, PT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 196
        /*0520*/                   STG.E.SYS [R2+0x100], R13 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193
        /*0530*/               P0 EXIT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 197
        /*0540*/                   IADD3 R0, R0, 0xc0, RZ ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 196
        /*0550*/                   FFMA R7, R10, c[0x0][0x168], R7 ;
        /*0560*/                   FFMA R15, R15, c[0x0][0x168], R6 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193
        /*0570*/                   ISETP.GE.AND P0, PT, R0, R5, PT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 196
        /*0580*/                   STG.E.SYS [R2+0x200], R7 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193
        /*0590*/               P0 EXIT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 196
        /*05a0*/                   STG.E.SYS [R2+0x300], R15 ;
        /*05b0*/                   EXIT ;
.L_3178:
        /*05c0*/                   BRA `(.L_3178);
        /*05d0*/                   NOP;
        /*05e0*/                   NOP;
        /*05f0*/                   NOP;
.L_40898:
```

We can clearly see the `LDG.E.128` in it, which is a result of vectorization.

Benchmark: https://github.com/zasdfgbnm/things/blob/master/2020Q1/benchmark-vec.ipynb

Benchmark on P100, dtype `uint8`:

before:
```
1.4.0a0+a5b4d78
e1d97025ee
22.2 µs ± 89.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
34.7 µs ± 38.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
52 µs ± 312 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
86.9 µs ± 135 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
154 µs ± 204 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
291 µs ± 668 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
566 µs ± 1.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.18 ms ± 1.54 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.29 ms ± 1.48 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.4 ms ± 1.15 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

after:
```
1.4.0a0+a5b4d78
1281cdfd8188fe86241ecaf71d001809d016c3a3
24 µs ± 116 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
30.5 µs ± 355 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
43.1 µs ± 300 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
67.6 µs ± 113 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
116 µs ± 275 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
215 µs ± 142 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
413 µs ± 791 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
824 µs ± 891 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.63 ms ± 478 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3.19 ms ± 1.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

Benchmark on P100, dtype `half`:

Before:
```
1.4.0a0+a5b4d78
1c017f0c14
30.8 µs ± 226 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
43.4 µs ± 164 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
69.1 µs ± 83 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
119 µs ± 103 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
224 µs ± 99.1 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
418 µs ± 206 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
865 µs ± 237 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.69 ms ± 695 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3.3 ms ± 527 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
6.77 ms ± 741 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

After

```
1.4.0a0+a5b4d78
7e50ee27333e7047072d328d03767b4845286356
28.9 µs ± 61.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
40.2 µs ± 244 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
63.8 µs ± 350 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
109 µs ± 196 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
199 µs ± 157 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
380 µs ± 446 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
743 µs ± 2.17 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.47 ms ± 1.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.91 ms ± 9.17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.8 ms ± 296 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

cc: csarofeen ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32383

Differential Revision: D19697455

Pulled By: ngimel

fbshipit-source-id: 0707481c2f334e6634c000b4afd275b2fee8fbe1
2020-02-03 16:20:40 -08:00
Wanchao Liang
3613a30345 Move dict_test.cpp to test folder and fix dict_test.cpp for Aten includes (#24071)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24071

Pull Request resolved: https://github.com/pytorch/pytorch/pull/24071

Test Plan: Imported from OSS

Differential Revision: D16728574

Pulled By: wanchaol

fbshipit-source-id: 6952b9703a40dc35f567bf17fbdcef6e0c6c2d6e
2019-08-08 22:41:16 -07:00
Syed Tousif Ahmed
effcc398c4 Refactor Random Number Generators in ATen (#21555)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21555
ghimport-source-id: dd900a8c3e1ef9ef1e011b8bb5476626d18cc462

Test Plan: Imported from OSS

Differential Revision: D15875780

Pulled By: ezyang

fbshipit-source-id: 6e04e90af62ab9c9593d74f344a3a084aaaf6f43
2019-06-19 13:54:09 -07:00
Syed Tousif Ahmed
ae342fd076 Refactor Random Number Generators in ATen (#21364)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21364
ghimport-source-id: ca7d37e10190ba46dc8512f437404ca9216d3369

Differential Revision: D15696497

Pulled By: ezyang

fbshipit-source-id: 2e713b8566ae915e175b5a79ac1dd9b86cc2a23d
2019-06-12 13:01:30 -07:00
Richard Zou
8ffcbfb7d4 Add unique_ptr<NamedTensorMeta> field to TensorImpl (#21341)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21341
ghimport-source-id: 06021b06864746571a904a1cfc0aaea5f8a12325

Differential Revision: D15717907

Pulled By: zou3519

fbshipit-source-id: 48ee76cf2f11a8b092be75ecac8d5faee68ca0d9
2019-06-10 07:29:36 -07:00
Richard Zou
4727685ea1 Added at::Dimname (#21280)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21280
ghimport-source-id: 921848326e4828ffd422868be26c409c6490e1ab

Differential Revision: D15698516

Pulled By: zou3519

fbshipit-source-id: 502b9b019d51dd46327e6caf2af69aa520c70cb6
2019-06-07 06:30:42 -07:00
Sam Gross
25a6ff10f0 Add gtest for TensorIterator (#21253)
Summary:
This adds a regression test for the bug fix in #21236. Operations
involving CUDA tensors an CPU scalars should not copy the CPU scalar to
the device (because that is slow). They should instead "lift" the scalar
to a kernel parameter.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21253

Reviewed By: bddppq

Differential Revision: D15604080

Pulled By: colesbury

fbshipit-source-id: c14ded5d584499eaa5ea83337ffc50278205f3d6
2019-06-04 07:23:42 -07:00
Syed Tousif Ahmed
67414714e5 Move THCTensor_(uniform) to ATen (#20292)
Summary:
As a first step for this plan: https://github.com/pytorch/pytorch/issues/19508#issuecomment-485178192, this PR moves `THCTensor_(uniform)` to ATen. Major changes are:
- `uniform_` cuda kernel now utilizes a philox generator.
- the kernel also utilizes TensorIterator
- the kernel uses a grid-stride loop to achieve peak effective bandwidth

- Since the engine has changed from `curandStateMTGP32` to `curandStatePhilox4_32_10`, the randoms generated now will be different.
- Here is the diff showing codegen changes: https://gist.github.com/syed-ahmed/4af9ae0d42b6c7dbaa13b9dd0d1dd1e8 (BC breaking change if any)

- Philox4_32_10 is known to pass the standard TestU01 Big Crush test (https://www.thesalmons.org/john/random123/papers/random123sc11.pdf) and hence the quality of random numbers generated isn't an issue when compared to the previously used `curandStateMTGP32`.
- I have added a test case in `aten/src/ATen/test/cuda_distributions_test.cu` which verifies that philox offset is incremented properly

The benchmark was done on a DGX station with 4 V100s.
I modified the script from jcjohnson 's [multinomial benchmark](https://github.com/jcjohnson/pytorch-multinomial-benchmark) to produce this notebook which shows that there is a general speedup with this PR and a regression hasn't been introduced: https://gist.github.com/syed-ahmed/9d26d4e96308aed274d0f2c7be5218ef

To reproduce the notebook:
- Run https://gist.github.com/syed-ahmed/4208c22c541f1d30ad6a9b1efc1d728f in a container with the current pytorch top of tree with the command: `python uniform_benchmark.py --stats_json before.json`
- Apply this diff to the current pytorch top of tree and run the same script in a container with the command: `python uniform_benchmark.py --stats_json after.json`
- Run the notebook attached above with the `after.json` and `before.json` in the same directory

The effected bandwidth was calculated using the script (thanks to ngimel ): https://gist.github.com/syed-ahmed/f8b7384d642f4bce484228b508b4bc68
Following are the numbers before and after.
```
uniform, size, elements 65536 forward 5.168914794921875e-06 bandwidth (GB/s) 50.71548098597786
uniform, size, elements 131072 forward 5.056858062744141e-06 bandwidth (GB/s) 103.67860705101367
uniform, size, elements 262144 forward 7.164478302001953e-06 bandwidth (GB/s) 146.357621001797
uniform, size, elements 524288 forward 1.1217594146728515e-05 bandwidth (GB/s) 186.9520302275877
uniform, size, elements 1048576 forward 1.923084259033203e-05 bandwidth (GB/s) 218.10297600317384
uniform, size, elements 2097152 forward 3.640890121459961e-05 bandwidth (GB/s) 230.39992200138826
uniform, size, elements 4194304 forward 6.778717041015625e-05 bandwidth (GB/s) 247.49839679819922
uniform, size, elements 8388608 forward 0.00012810707092285157 bandwidth (GB/s) 261.92490202361347
uniform, size, elements 16777216 forward 0.00025241613388061524 bandwidth (GB/s) 265.86598474620627
uniform, size, elements 33554432 forward 0.000497891902923584 bandwidth (GB/s) 269.5720239913193
```
```
uniform, size, elements 65536 forward 5.550384521484375e-06 bandwidth (GB/s) 47.22988091821306
uniform, size, elements 131072 forward 5.581378936767578e-06 bandwidth (GB/s) 93.93520954942333
uniform, size, elements 262144 forward 6.165504455566406e-06 bandwidth (GB/s) 170.071404141686
uniform, size, elements 524288 forward 6.3276290893554685e-06 bandwidth (GB/s) 331.4277702414469
uniform, size, elements 1048576 forward 8.509159088134765e-06 bandwidth (GB/s) 492.91639239047356
uniform, size, elements 2097152 forward 1.2989044189453124e-05 bandwidth (GB/s) 645.8218077979443
uniform, size, elements 4194304 forward 2.347707748413086e-05 bandwidth (GB/s) 714.6211452997259
uniform, size, elements 8388608 forward 4.4286251068115234e-05 bandwidth (GB/s) 757.6715389250498
uniform, size, elements 16777216 forward 8.672237396240235e-05 bandwidth (GB/s) 773.8356427961071
uniform, size, elements 33554432 forward 0.00016920566558837892 bandwidth (GB/s) 793.2224227438523
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20292

Differential Revision: D15277761

Pulled By: ezyang

fbshipit-source-id: 8bfe31a01eeed77f0ed6e7ec4d2dda4c6472ecaa
2019-05-13 09:38:28 -07:00
Alex Şuhan
9811a4220d Add XLA / TPU device type, backend type and type id (#16763)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16763

Replicate the easy bits in https://github.com/pytorch/pytorch/pull/15153 with TPU / XLA instead of MSNPU. Also don't initialize the storage for XLA tensors for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16585

Reviewed By: ezyang

Differential Revision: D13912118

Pulled By: gchanan

fbshipit-source-id: 4889177e2478768fb281ed075b71146d1d850bd9
2019-02-05 12:56:44 -08:00
Roy Li
7e642dfff3 Introduce backend extensions (overriding operators on custom backends)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/15153

Reviewed By: gchanan

Differential Revision: D13445571

fbshipit-source-id: 62e2ebe0a6e81c4983b47cddb57ee5eb78e96708
2019-02-01 11:00:16 -08:00
Dmytro Dzhulgakov
a061e3fd77 Back out "Revert D13596031: Improve c2-aten tensor interop and add proper testing" (#16514)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16514

Original commit changeset: dc371697f14b
Relanding https://github.com/pytorch/pytorch/pull/15860 - the problem was that layer_norm was using at::empty which is not yet on mobile

Reviewed By: ezyang

Differential Revision: D13861480

fbshipit-source-id: e2116da32bc117175c96b9151b1beba9b31eff36
2019-01-31 13:38:55 -08:00
Edward Yang
3b337e7892 Revert D13596031: Improve c2-aten tensor interop and add proper testing
Differential Revision:
D13596031

Original commit changeset: d20b601e06ba

fbshipit-source-id: dc371697f14b3893a9164380a39e7a49d8d68ecf
2019-01-29 07:14:57 -08:00
Dmytro Dzhulgakov
5e21e0fe75 Improve c2-aten tensor interop and add proper testing (#15860)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15860

Few changes (which are harder to split in separate diffs, so together):
- make conversion explicit (as they can throw to avoid surprises)
- fix tensor legacy dispatch not initialized when tensor is created on C2 side
- add a bunch of invariants to enforce

Reviewed By: ezyang

Differential Revision: D13596031

fbshipit-source-id: d20b601e06ba47aeff2f6e8e15769840e2d46108
2019-01-28 23:41:50 -08:00
Roy Li
50fbf79451 test basic tensor interop
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12249

Differential Revision: D13469356

Pulled By: li-roy

fbshipit-source-id: b49748462aa44ac34b8ce79783f2c895a537a232
2018-12-27 17:04:00 -08:00
Peter Goldsborough
5987b44dda Remove aten doc/ folder (#11158)
Summary:
ATen's doc/ folder is manually maintained and can thus cause confusion with the generated file. We now have proper online documentation for ATen, which is superior to ATen doc/. Let's delete ATen/doc.

ezyang apaszke soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11158

Differential Revision: D9618782

Pulled By: goldsborough

fbshipit-source-id: 0ef14f84947601a0589aa4a41e5c8619783426fe
2018-08-31 14:55:13 -07:00
Gregory Chanan
7842b6d0f7 Fix at::optional compile problems on Windows CUDA.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/10909

Differential Revision: D9516837

Pulled By: gchanan

fbshipit-source-id: fad7e3284e74c599b873ebaae2dcdf5013505855
2018-08-27 14:40:41 -07:00
Syed Tousif Ahmed
1fbabff76a Refactor THCNumerics and add common math functions for at::Half (#10301)
Summary:
**Summary**: This PR is a followup of mruberry's https://github.com/pytorch/pytorch/pull/9318/. It tries to achieve the following:
- Specializing std common math functions for `at::Half` type.
- Create `CUDANumerics.cuh` to contain necessary parts from `THCNumerics.cuh`.
- Update `THCNumerics.cuh` with new usage and comments to  demonstrate the best practice for developers and hence, making way for its deprecation.
- Remove legacy/redundant code path.
- Remove unused CUDA HALF macros (see separate PR https://github.com/pytorch/pytorch/pull/10147)

**Comments**: `CUDANumerics.cuh` contains mathematical functions that are either not in the std namespace or are specialized for compilation with CUDA NVCC or CUDA NVRTC. This header is derived from the legacy `THCNumerics.cuh`. Following are some rationale behind why some functions were kept while others were removed:
- All arithmetic can now be done in ATen using binary cuda kernel  or CUDA tensor pointwise apply (check https://github.com/pytorch/pytorch/pull/8919 and `CUDAApplyUtils`). `at::Half` comparisons rely on implicit conversion to float.
- Functions that are c/c++ standard compliant, have been specialized for user defined types, for instance, the std namespace has been opened up for `at::Half`, that defines math function definitions for `at::Half`. Check `Half-inl.h`
- Some standard compliant functions are specialized here for performance reasons. For instance, `powi` is used for `pow` calculation on integral types. Moreover, `abs`, `isinf`, `isnan` are specialized to save one API call vs when used with std. Although this is subject to change, depending on if we really care about saving one API call.
- Numeric limits such as `max/min` is removed since they call standard defines. Moreover, numeric limits for
`at::Half` is present in `Half-inl.h`. I understood that HIP has some issue with `std::numeric_limits` and this the related github issue I found: https://github.com/ROCm-Developer-Tools/HIP/issues/374. AlexVlx mentions that the issue can be avoided by launching `std::numeric_limits` in `__device__`. Since, we are launching lambdas with device contexts, I don't see an issue why `std::numeric_limits` won't compile on HIP if launched with device context within a kernel, unless I am not aware of the real reason why max/min was there in THCNumerics in the first place. (Haven't ever tried a build with HIP).

Here are some reference PRs that was handy in refactoring TH into ATen:
- https://github.com/pytorch/pytorch/pull/6786
- https://github.com/pytorch/pytorch/pull/5475
- https://github.com/pytorch/pytorch/pull/9401
- https://github.com/pytorch/pytorch/pull/8689
- https://github.com/pytorch/pytorch/pull/8919
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10301

Differential Revision: D9204758

Pulled By: soumith

fbshipit-source-id: 09f489c1656458c02367b6cd31c3eeeca5acdc8a
2018-08-24 16:02:06 -07:00
mruberry
d6f21fc663 Ports Streams to ATen (#8997)
Summary:
This PR moves the THCStream logic (from both the THCStream and THCState APIs) to ATen. In particular, it:

+ Creates a new (THC free) at::CUDAStream class and API
+ Extends the at::Context API to expose it
+ Stubs the current THCStream and THCState APIs to use it
+ Updates THC to no longer violate stream encapsulation (stream.hpp is dead)
+ Adds an ATen cpp test of the API
+ Bonus: Removes some debug spew in test_nn.py

The new API has several advantages over the old one:

(1) It comes with an easy to use RAII, the CUDAStream. CUDAStreams have the expected copy and move semantics and are implicitly convertible to cudaStream_t.
(2) It does not depend on THCState, THCThreadLocal, or CUDA (thanks to goldsborough for suggesting the dynamic registration technique)
(3) It provides one consistent API/place for all stream operations, instead of having them split between THCStream and THCState
(4) The internals are completely encapsulated, unlike the historic THCStream
(5) It has getAndRetain semantics, which are safer than the historic gets (which allowed a gap between acquisition and retention)

There are a couple things this PR does not do, however, which are left for future work:

- It leaves the c10d:CUDAStream class as a THCStream wrapper (which now really wraps an at::CUDAStream).
- It leaves historic users of THCStream mostly untouched, except where they violated encapsulation (by using stream.hpp). A couple forward declarations were also changed.

I hope this PR allows easy usage of streams from ATen and is a useful pattern for porting more of the THCState API.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8997

Differential Revision: D8683375

Pulled By: soumith

fbshipit-source-id: 2e48ad85f1f9c8817684fe63a267938e80eafdcf
2018-07-08 16:25:09 -07:00
Orion Reblitz-Richardson
4bf0202cac
[build] Have PyTorch depend on minimal libcaffe2.so instead of libATen.so (#7399)
* Have PyTorch depend on minimal libcaffe2.so instead of libATen.so

* Build ATen tests as a part of Caffe2 build

* Hopefully cufft and nvcc fPIC fixes

* Make ATen install components optional

* Add tests back for ATen and fix TH build

* Fixes for test_install.sh script

* Fixes for cpp_build/build_all.sh

* Fixes for aten/tools/run_tests.sh

* Switch ATen cmake calls to USE_CUDA instead of NO_CUDA

* Attempt at fix for aten/tools/run_tests.sh

* Fix typo in last commit

* Fix valgrind call after pushd

* Be forgiving about USE_CUDA disable like PyTorch

* More fixes on the install side

* Link all libcaffe2 during test run

* Make cuDNN optional for ATen right now

* Potential fix for non-CUDA builds

* Use NCCL_ROOT_DIR environment variable

* Pass -fPIC through nvcc to base compiler/linker

* Remove THCUNN.h requirement for libtorch gen

* Add Mac test for -Wmaybe-uninitialized

* Potential Windows and Mac fixes

* Move MSVC target props to shared function

* Disable cpp_build/libtorch tests on Mac

* Disable sleef for Windows builds

* Move protos under BUILD_CAFFE2

* Remove space from linker flags passed with -Wl

* Remove ATen from Caffe2 dep libs since directly included

* Potential Windows fixes

* Preserve options while sleef builds

* Force BUILD_SHARED_LIBS flag for Caffe2 builds

* Set DYLD_LIBRARY_PATH and LD_LIBRARY_PATH for Mac testing

* Pass TORCH_CUDA_ARCH_LIST directly in cuda.cmake

* Fixes for the last two changes

* Potential fix for Mac build failure

* Switch Caffe2 to build_caffe2 dir to not conflict

* Cleanup FindMKL.cmake

* Another attempt at Mac cpp_build fix

* Clear cpp-build directory for Mac builds

* Disable test in Mac build/test to match cmake
2018-05-24 07:47:27 -07:00
Mike Ruberry
37b9d093d2 Updates collapseDims() function and documentation (#7056)
* Updates collapseDims() function and documentation

* Adds C++ tests, validates input, updates names for readability

* Removes invalid test

* stashing to merge AT_CHECK macro

* Updates asserts, removes tests on Windows
2018-05-12 23:42:55 -04:00
cpuhrsch
ae35e0e924 Support non-contiguous tensors for unary ops (#6119) 2018-04-27 21:31:34 +02:00
Will Feng
e089849b4a
Add mutex to THC random number generator (#6527)
* Add mutex to THC random number generator

* Add test for CUDA RNG multithread

* fix lint

* Rename gen_state to state and remove unnecessary mutex lock

* Remove RNG test from cpp_extensions

* Add CUDA RNG test to libtorch

* Build test_rng only if CUDA exists

* Move test to aten/src/ATen/test/

* Separate ATen build and test, and run ATen test in CI test phase

* Don't test ATen in ASAN build

* Fix bug in ATen scalar_test

* Fix bug in ATen native_test

* Add FIXME to some CUDA tests in scalar_tensor_test

* Valgrind doesn't work well with CUDA, seed the CPU and CUDA RNG separately instead
2018-04-18 15:54:13 -04:00