pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 00:20:18 +01:00

Author	SHA1	Message	Date
Syed Tousif Ahmed	4dab208d97	Adds Issue#153109 as a test for CUDAPluggableAllocator (#163575 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163575 Approved by: https://github.com/ngimel	2025-10-01 09:07:48 +00:00
Nikita Shulga	02cd4dbcf4	[BE][CI] Get rid of duplicated code (#131406 ) Followup after https://github.com/pytorch/pytorch/pull/131061 Define `run_if_exists` function that runs cpp test if it exists and prints a warning otherwise. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131406 Approved by: https://github.com/kit1980, https://github.com/atalman	2024-07-23 04:01:13 +00:00
Syed Tousif Ahmed	1f961ad495	Runs aten cuda cpp tests in CI (#131061 ) It seems like these tests are never run because https://github.com/pytorch/pytorch/pull/99956 got rid of the `pushd $1` which would make the if conditions true in CUDA builds. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131061 Approved by: https://github.com/malfet, https://github.com/eqy	2024-07-19 12:35:33 +00:00
Huy Do	35834a405c	Run C++ tests on CI with run_test.py (#99956 ) After https://github.com/pytorch/pytorch/pull/99559, we can now run C++ test with `run_test.py`. Although advance features such as `--import-slow-tests` and `--import-disabled-tests` won't work for now, there will still be a gain in reliability and performance as C++ can now be retried and run in parallel. This covers all C++ tests in the CI including aten, libtorch, and Vulkan C++ tests across all platforms Linux, Windows, MacOS. Notes: * To support C++ test discovery, the env variable `CPP_TESTS_DIR` can be set to where the C++ test binaries is located * Support pytest -k argument via run_test as this is used by pytest-cpp to replace `--gtest-filter` * The XML output is in pytest format, but it's ok now because we don't have slow test or flaky test support for C++ test yet * ~~I need to figure out why conftest.py doesn't work when I invoke pytest directly for C++ test, so `--sc` is not available for C++ tests at the moment. Proper pytest plugin like stepwise works fine though. I'll investigate and fix it in a separate PR~~ Found the cause, `conftest.py` is per directory and needs to be in any arbitrary directory that holds C++ test * Two tests `test_api` and `test_tensorexpr` timed out on ASAN, I suspect that ASAN is now used on top of the python executable, which is slower than running native C++ code. IMO, it's ok to run these tests as before on ASAN for now Pull Request resolved: https://github.com/pytorch/pytorch/pull/99956 Approved by: https://github.com/clee2000, https://github.com/ZainRizvi	2023-05-09 21:24:12 +00:00
Richard Zou	4b1053497c	[vmap] Prepend "legacy" to files for old vmap implementation (#90324 ) We have an older torch.vmap implementation. It is no longer supported. It still needs to exist somewhere for the sake of BC with torch.autograd.functional. This PR makes it clear what files are meant for implementing the old vmap implementation. I've seen a couple of PRs recently adding support for the old vmap implementation, so this will lessen the confusion. Test Plan: - CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/90324 Approved by: https://github.com/samdow	2022-12-07 18:46:15 +00:00
Mikayla Gawarecki	3024bcfff5	Add cuda_atomic_ops_test to run_tests.sh Pull Request resolved: https://github.com/pytorch/pytorch/pull/74482 Approved by: https://github.com/cpuhrsch	2022-03-24 17:12:35 +00:00
Nikita Shulga	77beccaedb	Do not build PyTorch with caffe2 by default (#66658 ) Summary: CAFFE2 has been deprecated for a while, but still included in every PyTorch build. We should stop building it by default, although CI should still validate that caffe2 code is buildable. Build even fewer dependencies when compiling mobile builds without Caffe2 Introduce `TEST_CAFFE2` in torch.common.utils Skip `TestQuantizedEmbeddingOps` and `TestJit.test_old_models_bc` is code is compiled without Caffe2 Should be landed after https://github.com/pytorch/builder/pull/864 Pull Request resolved: https://github.com/pytorch/pytorch/pull/66658 Reviewed By: driazati, seemethere, janeyx99 Differential Revision: D31669156 Pulled By: malfet fbshipit-source-id: 1cc45e2d402daf913a4685eb9f841cc3863e458d	2021-10-21 20:32:47 -07:00
Alex Suhan	b176feec1e	Add device and key for lazy tensors (#61621 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61621 Test Plan: CI Reviewed By: mruberry Differential Revision: D29912934 Pulled By: asuhan fbshipit-source-id: 493c32063a3e756d93cbf1d876563a35eaafb537	2021-07-26 23:00:22 -07:00
Richard Zou	970096b624	[Reland] Adds an aten::_ops namespace with unambiguous function names (#59018 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59018 Fixes #58044. This PR: - adds `ATEN_FN(op)` and `ATEN_FN2(op, overload)` macros that resolve to an non-overloaded function in aten::_ops that calls the desired operator (without default arguments). The motivation for this is two-fold: 1) Using aten operators with templates is hard if the operator is overloaded (e.g. add.Tensor and add.Scalar). 2) Method-only operators require special handling; pointers-to-method are different from function pointers. `ATEN_FN2(add_, Tensor)` returns a function instead of a method. There is some interesting behavior for out= operations. `ATEN_FN2(sin, "out")` gives a function that is faithful to the schema; that is, the order of arguments is exactly what it looks like in the schema. This makes it so that you can directly register `ATEN_FN2(sin,"out")` (or a function wrapping it using the same signature) as an override for a DispatchKey. Test Plan: - New tests that ATEN_FN2 works on function and method-only operators - New test that ATEN_FN works - New test that ATEN_FN macro returns a "faithful" function. Codegen output: Operators.h and Operators.cpp are both here: https://gist.github.com/zou3519/c2c6a900410b571f0d7d127019ca5175 Reviewed By: bdhirsh Differential Revision: D28721206 Pulled By: zou3519 fbshipit-source-id: a070017f98e8f4038cb0c64be315eef45d264217	2021-06-01 17:19:06 -07:00
Alice Ou	24508337f4	Revert D28643215: Adds an aten::_ops namespace with unambiguous function names Test Plan: revert-hammer Differential Revision: D28643215 (`28740869a1`) Original commit changeset: 7b2b8459f1b2 fbshipit-source-id: ea869bf4cfde7038087e990b2cff5a86f9e2a531	2021-05-26 12:35:34 -07:00
Richard Zou	28740869a1	Adds an aten::_ops namespace with unambiguous function names (#58092 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58092 Fixes #58044. This PR: - adds `ATEN_FN(op)` and `ATEN_FN2(op, overload)` macros that resolve to an non-overloaded function in aten::_ops that calls the desired operator (without default arguments). The motivation for this is two-fold: 1) Using aten operators with templates is hard if the operator is overloaded (e.g. add.Tensor and add.Scalar). 2) Method-only operators require special handling; pointers-to-method are different from function pointers. `ATEN_FN2(add_, Tensor)` returns a function instead of a method. There is some interesting behavior for out= operations. `ATEN_FN2(sin, "out")` gives a function that is faithful to the schema; that is, the order of arguments is exactly what it looks like in the schema. This makes it so that you can directly register `ATEN_FN2(sin,"out")` (or a function wrapping it using the same signature) as an override for a DispatchKey. Test Plan: - New tests that ATEN_FN2 works on function and method-only operators - New test that ATEN_FN works - New test that ATEN_FN macro returns a "faithful" function. Codegen output: Operators.h and Operators.cpp are both here: https://gist.github.com/zou3519/c2c6a900410b571f0d7d127019ca5175 Reviewed By: mruberry Differential Revision: D28643215 Pulled By: zou3519 fbshipit-source-id: 7b2b8459f1b2eb5ad01ee7b0d2bb77639f77940e	2021-05-26 07:29:15 -07:00
Xiang Gao	3de86b951d	Migrate thrust->cub for index put (#55693 ) Summary: 64bit indexing is not supported, because if `num_indices = 2^31`, then 4 long tensors of `num_indices` elements will take 64GB RAM. I don't think anybody will be interested in running `index_put` with 64GB GPU RAM. Benchmark on CUDA 11.3 RTX3090: ```python import torch import itertools def run50_sync(f): for _ in range(50): f() torch.cuda.synchronize() run50_sync(lambda: torch.randperm(1000000, device='cuda')) def benchmark(M, L): a = torch.randn(M, device='cuda') i1 = torch.randint(M, (L,), dtype=torch.long, device='cuda') v = torch.randn(L, device='cuda') torch.cuda.synchronize() %timeit run50_sync(lambda:a.index_put_((i1,), v, True)) for M, L in itertools.product((100, 100000, 10000000), repeat=2): print(M, L) benchmark(M, L) ``` Before ``` 100 100 5.13 ms ± 91 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 100 100000 30.2 ms ± 471 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 100 10000000 3.17 s ± 14.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 100000 100 5.19 ms ± 61.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 100000 100000 11.9 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 100000 10000000 712 ms ± 3.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 10000000 100 5.07 ms ± 66.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 10000000 100000 12.1 ms ± 76.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 10000000 10000000 627 ms ± 7.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` After ``` 100 100 3.75 ms ± 49.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 100 100000 26.2 ms ± 154 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 100 10000000 2.81 s ± 23.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 100000 100 3.85 ms ± 16.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 100000 100000 9.74 ms ± 40.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 100000 10000000 444 ms ± 1.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 10000000 100 3.85 ms ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 10000000 100000 10.7 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 10000000 10000000 396 ms ± 2.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/55693 Reviewed By: albanD Differential Revision: D27895967 Pulled By: ngimel fbshipit-source-id: 0616ce33395ce46f1a4161dfd38940b8e54fedc2	2021-04-27 12:27:09 -07:00
Nikita Shulga	efeb988518	Suppress "ioctl points to uninitialised" check (#48187 ) Summary: libcuda.so from CUDA-11.1 makes ioctl() that valgrind's memcheck tool considers dangerous Instruct valgrind to suppress that check Fixes false positives reported in https://app.circleci.com/pipelines/github/pytorch/pytorch/240774/workflows/d4c66de8-f13b-47a2-ae62-2ec1bbe0664b/jobs/9026496 Pull Request resolved: https://github.com/pytorch/pytorch/pull/48187 Reviewed By: janeyx99 Differential Revision: D25059850 Pulled By: malfet fbshipit-source-id: 982df5860524482b0fcb2bfc6bb490fb06694cf6	2020-11-18 18:45:46 -08:00
Xiang Gao	3876889218	Remove LegacyComplex.h (#39834 ) Summary: All std::complex has been migrated to c10::complex Pull Request resolved: https://github.com/pytorch/pytorch/pull/39834 Differential Revision: D22001969 Pulled By: ezyang fbshipit-source-id: 665a9198afde45a95309053b2f2381e123bf869a	2020-06-12 08:18:25 -07:00
Nikita Shulga	39d037253c	Test PyTorch using python-3.8 + GCC-9 on Bionic (Reland) (#39121 ) Summary: Enable new test config in .circleci/config.yml Skip scanning several 3rd-party packages to work around https://bugs.python.org/issue40350 Remove pre python-3.5 checks from `test.sh` and update `scikit-learn` to python-3.8 compatible version This is a reland of https://github.com/pytorch/pytorch/pull/39030 Pull Request resolved: https://github.com/pytorch/pytorch/pull/39121 Differential Revision: D21820375 Pulled By: malfet fbshipit-source-id: d0be79b7d204cf692e055d42b9be42402dc4c1c0	2020-06-01 11:11:12 -07:00
Richard Zou	a3bab37d96	Add BatchedTensorImpl (#38424 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38424 On the way to adding initial vmap support, this is the implementation for BatchedTensorImpl. Vmap (in future PRs) leverages Tensors backed by BatchedTensorImpl to do its work. For more context, here is an overview of the plan to add initial vmap support. - [this PR] Add BatchedTensorImpl - Add one or two batching rules - Add vmap Python API - Add "slow" for-loop fallbacks for out-of-place functions via dispatcher fallback mechanism. - Add batching rules for "view" functions - Add "slow" for-loop fallbacks for in-place functions - Miscellaneous handling for failure cases - And more Test Plan: - `./build/bin/vmap_test` Differential Revision: D21640917 Pulled By: zou3519 fbshipit-source-id: 969490a838cf2099ed80104e7d51ee8ff069e168	2020-05-20 09:10:00 -07:00
Gao, Xiang	c5624e831d	Add overloads of std:: math functions for c10::complex [resubmit] (#37468 ) Summary: This reverts commit `d167a7f654`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/37468 Differential Revision: D21305110 Pulled By: anjali411 fbshipit-source-id: d1bdc9d9feac00331fc2b2b905d49f80bef680f9	2020-04-30 10:20:45 -07:00
Nikita Shulga	6098cf7e33	Add `sched_setaffinity` check from libgomp to `valgrind.sup` (#37532 ) Summary: - It's valid to call `sched_setaffinity` with nullptr - The call is coming from libomp which should be valgrind safe Pull Request resolved: https://github.com/pytorch/pytorch/pull/37532 Test Plan: CI Differential Revision: D21311252 Pulled By: malfet fbshipit-source-id: a325f97741b997738c35759d02fcc34c1cb44d95	2020-04-29 14:48:23 -07:00
Lu Fang	d167a7f654	Revert D21256854: [pytorch][PR] Add overloads of std:: math functions for c10::complex Test Plan: revert-hammer Differential Revision: D21256854 Original commit changeset: 2112ba6b7992 fbshipit-source-id: b81c377f9cd33a493a63d1e666cbe6765516fca8	2020-04-27 13:23:34 -07:00
Gao, Xiang	6d409481b3	Add overloads of std:: math functions for c10::complex (#35725 ) Summary: Issue: https://github.com/pytorch/pytorch/issues/35284 ~This depends on and contains https://github.com/pytorch/pytorch/pull/35524. Please review after the dependency gets merged and I will rebase to get a clean diff.~ The implementation of most functions follow the pattern ```C++ template<typename T> C10_HOST_DEVICE c10::complex<T> some_function(c10::complex<T> x) { #if defined(__CUDACC__) \|\| defined(__HIPCC__) return static_cast<c10::complex<T>>(thrust::some_function(static_cast<thrust::complex<T>>(x))); #else return static_cast<c10::complex<T>>(std::some_function(static_cast<std::complex<T>>(x))); #endif } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/35725 Differential Revision: D21256854 Pulled By: ezyang fbshipit-source-id: 2112ba6b79923450feafd7ebdc7184a3eaecadb6	2020-04-27 10:32:16 -07:00
Xiang Gao	20328f67bb	Add core of c10::complex [resubmit] (#36626 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36626 This reverts commit `9216c67c9e`. Test Plan: Imported from OSS Differential Revision: D21140441 Pulled By: anjali411 fbshipit-source-id: 488530088e2ff87dc27e70d21ace88ff2967e7ab	2020-04-24 12:08:23 -07:00
Gao, Xiang	7c7cb74887	Add missing ${CMAKE_CURRENT_SOURCE_DIR}/complex_test.cpp (#37080 ) Summary: This test is never built in OSS CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/37080 Differential Revision: D21179296 Pulled By: anjali411 fbshipit-source-id: 22a5b82f17676213c8ec51642bef35dc61f9cace	2020-04-22 19:22:59 -07:00
Mike Ruberry	9216c67c9e	Revert D21021677: [pytorch][PR] Add core of c10::complex Test Plan: revert-hammer Differential Revision: D21021677 Original commit changeset: 9e144e581fa4 fbshipit-source-id: ce6a88fc71ec0134d0fc6ecdddc4c4db35f89b1f	2020-04-14 13:58:24 -07:00
Xiang Gao	25252816cf	Add core of c10::complex (#35524 ) Summary: Step 0 of https://github.com/pytorch/pytorch/issues/35284 Reference: https://en.cppreference.com/w/cpp/numeric/complex We are targeting C++20. The difference across C++ versions are mostly `constexpr` qualifiers, newer version has more function declared as `constexpr` This PR adds the core of `c10::complex`, it includes - standard constructors as in `std::complex` - explicit conversion constructors converting from `std/thrust::complex` to `c10::complex` - standard assignment operators as in `std::complex` - conversion assignment operators converting from `std/thrust::complex` to `c10::complex` - other standard operators as in `std::complex` - standard methods as in `std::complex` - explicit casting operators to std/thrust - basic non-member functions as in `std::complex`: - arithmetic operators - `==`, `!=` - `<<`, `>>` - `std::real`, `std::imag`, `std::abs`, `std::arg`, `std::norm`, `std::conj`, `std::proj`, `std::polar` - Some of them are intentionally not completely implemented, these are marked as `TODO` and will be implemented in the future. This PR does not include: - overload of math functions which will come in the next PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/35524 Differential Revision: D21021677 Pulled By: anjali411 fbshipit-source-id: 9e144e581fa4b2bee62d33adaf756ce5aadc0c71	2020-04-14 11:00:24 -07:00
Natalia Gimelshein	b6306e1517	Revert D20624698: [pytorch][PR] Make GPU loops support mutable lambda Test Plan: revert-hammer Differential Revision: D20624698 Original commit changeset: 06e398779345 fbshipit-source-id: d17059c692b4b460f3aa8081bc80c296ddb88228	2020-03-24 14:42:40 -07:00
Xiang Gao	39a101d06e	Make GPU loops support mutable lambda (#35015 ) Summary: I will need it for https://github.com/pytorch/pytorch/pull/34004 The `mutable` qualifier allows a lambda to capture some values, and modify its own copy. This would be useful for random kernels: we capture a `state` of RNG, initialize it when it first run, and the initialized stated will be used later: ```C++ gpu_kernel(iter, [state, initialized](scalar_t arg) mutable -> scalar_t { if (!initialized) { curand_init(..., state); initialized = true; } return some_math(curand_uniform(state), arg); } ``` The `operator()` of `mutable` lambda is not `const`, so we can not pass it as constant reference. It can not be called inside a non-`mutable` lambda either. Example usage: ```C++ auto t = at::empty({4096}, kCUDA); float thread_work_index_ = 0; auto iter = TensorIterator::nullary_op(t); gpu_kernel(iter, [thread_work_index_]GPU_LAMBDA() mutable -> float { return thread_work_index_++; }); ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/35015 Differential Revision: D20624698 Pulled By: ngimel fbshipit-source-id: 06e3987793451cd514181d20252510297e2d28a9	2020-03-24 12:30:49 -07:00
Nikita Shulga	9dd5d51b01	[ATen] Exclude CUDA tests when running `basic` under valgrind (#34181 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34181 Test Plan: CI Reviewed By: orionr, seemethere Differential Revision: D20241021 fbshipit-source-id: a7371afc45acc2c07a36c8216036338e14170a56	2020-03-04 11:24:33 -08:00
Nikita Shulga	9d1c971b11	[Aten] Suppress valgrind leaks in libcuda (#34169 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34169 Valgrind have no insight how memory is being initialized by ioctls() Test Plan: CI Reviewed By: seemethere Differential Revision: D20235974 fbshipit-source-id: 46413afa4842e7d42582bbbda903438b1d98691f	2020-03-03 16:00:17 -08:00
Xiang Gao	9c2ed2574a	Vectorized memory access in TensorIterator GPU loop for 1d contiguous case (#32383 ) Summary: Step 2 of https://github.com/pytorch/pytorch/issues/31975 Vectorized memory access is enabled. Generated code: https://github.com/zasdfgbnm/things/blob/master/2020Q1/disassembly-elementwise-vec.ipynb ``` void at::native::modern::elementwise_kernel<4, 64, 4, at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()https://github.com/pytorch/pytorch/issues/1}::operator()() const::{lambda()https://github.com/pytorch/pytorch/issues/4}::operator()() const::{lambda(float, float)https://github.com/pytorch/pytorch/issues/1}, at::detail::Array<char, 3> >(int, at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()https://github.com/pytorch/pytorch/issues/1}::operator()() const::{lambda()https://github.com/pytorch/pytorch/issues/4}::operator()() const::{lambda(float, float)https://github.com/pytorch/pytorch/issues/1}, at::detail::Array<char, 3>) ASM: .section .text._ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_,"ax",progbits .sectioninfo @"SHI_REGISTERS=20" .align 128 .global _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_ .type _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_,function .size _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_,(.L_40898 - _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_) .other _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_,@"STO_CUDA_ENTRY STV_DEFAULT" _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_: .text._ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_: //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 294 /0000/ IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ; /0010/ @!PT SHFL.IDX PT, RZ, RZ, RZ, RZ ; /0020/ S2R R9, SR_CTAID.X ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 177 /0030/ S2R R0, SR_TID.X ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 294 /0040/ IMAD.SHL.U32 R9, R9, 0x100, RZ ; /0050/ IADD3 R5, -R9, c[0x0][0x160], RZ ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 256 /0060/ SHF.R.S32.HI R17, RZ, 0x1f, R9 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 296 /0070/ ISETP.GE.AND P0, PT, R5, 0x100, PT ; /0080/ @!P0 BRA `(.L_3173) ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 256 /0090/ IMAD.SHL.U32 R12, R9.reuse, 0x4, RZ ; /00a0/ SHF.L.U64.HI R17, R9, 0x2, R17 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 260 /00b0/ IADD3 R8, P0, R12.reuse, c[0x0][0x188], RZ ; /00c0/ IADD3 R2, P1, R12, c[0x0][0x190], RZ ; /00d0/ IADD3.X R9, R17.reuse, c[0x0][0x18c], RZ, P0, !PT ; /00e0/ IADD3.X R3, R17, c[0x0][0x194], RZ, P1, !PT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 218 /00f0/ IMAD.WIDE R8, R0, 0x10, R8 ; /0100/ IMAD.WIDE R2, R0, 0x10, R2 ; /0110/ LDG.E.128.SYS R8, [R8] ; /0120/ LDG.E.128.SYS R4, [R2] ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 256 /0130/ IADD3 R12, P0, R12, c[0x0][0x180], RZ ; /0140/ IADD3.X R13, R17, c[0x0][0x184], RZ, P0, !PT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 238 /0150/ IMAD.WIDE R12, R0, 0x10, R12 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 196 /0160/ FFMA R7, R7, c[0x0][0x168], R11 ; /0170/ FFMA R6, R6, c[0x0][0x168], R10 ; /0180/ FFMA R5, R5, c[0x0][0x168], R9 ; /0190/ FFMA R4, R4, c[0x0][0x168], R8 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 238 /01a0/ STG.E.128.SYS [R12], R4 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 301 /01b0/ EXIT ; .L_3173: //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180 /01c0/ ISETP.GE.AND P0, PT, R0, R5, PT ; /01d0/ BMOV.32.CLEAR RZ, B0 ; /01e0/ BSSY B0, `(.L_3174) ; /01f0/ P0 BRA `(.L_3175) ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183 /0200/ IADD3 R3, P1, R9, R0, RZ ; /0210/ LEA.HI.X.SX32 R4, R0, R17, 0x1, P1 ; /0220/ LEA R2, P1, R3, c[0x0][0x188], 0x2 ; /0230/ LEA.HI.X R3, R3, c[0x0][0x18c], R4, 0x2, P1 ; /0240/ LDG.E.SYS R8, [R2] ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 184 /0250/ IADD3 R4, R0, 0x40, RZ ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180 /0260/ ISETP.GE.AND P1, PT, R4, R5, PT ; /0270/ P1 BRA `(.L_3175) ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183 /0280/ LDG.E.SYS R4, [R2+0x100] ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 184 /0290/ IADD3 R6, R0, 0x80, RZ ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180 /02a0/ ISETP.GE.AND P1, PT, R6, R5, PT ; /02b0/ P1 BRA `(.L_3175) ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 184 /02c0/ IADD3 R10, R0, 0xc0, RZ ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183 /02d0/ LDG.E.SYS R7, [R2+0x200] ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180 /02e0/ ISETP.GE.AND P1, PT, R10, R5, PT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183 /02f0/ @!P1 LDG.E.SYS R6, [R2+0x300] ; .L_3175: /0300/ BSYNC B0 ; .L_3174: //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180 /0310/ BMOV.32.CLEAR RZ, B0 ; /0320/ BSSY B0, `(.L_3176) ; /0330/ P0 BRA `(.L_3177) ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183 /0340/ IADD3 R3, P1, R9, R0, RZ ; /0350/ LEA.HI.X.SX32 R10, R0, R17, 0x1, P1 ; /0360/ LEA R2, P1, R3, c[0x0][0x190], 0x2 ; /0370/ LEA.HI.X R3, R3, c[0x0][0x194], R10, 0x2, P1 ; /0380/ LDG.E.SYS R11, [R2] ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 184 /0390/ IADD3 R10, R0, 0x40, RZ ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180 /03a0/ ISETP.GE.AND P1, PT, R10, R5, PT ; /03b0/ P1 BRA `(.L_3177) ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183 /03c0/ LDG.E.SYS R13, [R2+0x100] ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 184 /03d0/ IADD3 R10, R0, 0x80, RZ ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180 /03e0/ ISETP.GE.AND P1, PT, R10, R5, PT ; /03f0/ P1 BRA `(.L_3177) ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 184 /0400/ IADD3 R10, R0, 0xc0, RZ ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180 /0410/ ISETP.GE.AND P1, PT, R10, R5, PT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183 /0420/ LDG.E.SYS R10, [R2+0x200] ; /0430/ @!P1 LDG.E.SYS R15, [R2+0x300] ; .L_3177: /0440/ BSYNC B0 ; .L_3176: //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193 /0450/ P0 EXIT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 196 /0460/ IADD3 R9, P0, R9, R0, RZ ; /0470/ FFMA R11, R11, c[0x0][0x168], R8 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 197 /0480/ IADD3 R14, R0, 0x40, RZ ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 196 /0490/ LEA.HI.X.SX32 R12, R0, R17, 0x1, P0 ; /04a0/ LEA R2, P0, R9.reuse, c[0x0][0x180], 0x2 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193 /04b0/ ISETP.GE.AND P1, PT, R14, R5, PT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 196 /04c0/ LEA.HI.X R3, R9, c[0x0][0x184], R12, 0x2, P0 ; /04d0/ STG.E.SYS [R2], R11 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193 /04e0/ P1 EXIT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 197 /04f0/ IADD3 R8, R0, 0x80, RZ ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 196 /0500/ FFMA R13, R13, c[0x0][0x168], R4 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193 /0510/ ISETP.GE.AND P0, PT, R8, R5, PT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 196 /0520/ STG.E.SYS [R2+0x100], R13 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193 /0530/ P0 EXIT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 197 /0540/ IADD3 R0, R0, 0xc0, RZ ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 196 /0550/ FFMA R7, R10, c[0x0][0x168], R7 ; /0560/ FFMA R15, R15, c[0x0][0x168], R6 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193 /0570/ ISETP.GE.AND P0, PT, R0, R5, PT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 196 /0580/ STG.E.SYS [R2+0x200], R7 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193 /0590/ P0 EXIT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 196 /05a0/ STG.E.SYS [R2+0x300], R15 ; /05b0/ EXIT ; .L_3178: /05c0/ BRA `(.L_3178); /05d0/ NOP; /05e0/ NOP; /05f0/ NOP; .L_40898: ``` We can clearly see the `LDG.E.128` in it, which is a result of vectorization. Benchmark: https://github.com/zasdfgbnm/things/blob/master/2020Q1/benchmark-vec.ipynb Benchmark on P100, dtype `uint8`: before: ``` 1.4.0a0+a5b4d78 `e1d97025ee` 22.2 µs ± 89.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 34.7 µs ± 38.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 52 µs ± 312 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 86.9 µs ± 135 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 154 µs ± 204 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 291 µs ± 668 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 566 µs ± 1.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.18 ms ± 1.54 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 2.29 ms ± 1.48 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 4.4 ms ± 1.15 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` after: ``` 1.4.0a0+a5b4d78 1281cdfd8188fe86241ecaf71d001809d016c3a3 24 µs ± 116 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 30.5 µs ± 355 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 43.1 µs ± 300 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 67.6 µs ± 113 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 116 µs ± 275 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 215 µs ± 142 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 413 µs ± 791 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 824 µs ± 891 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.63 ms ± 478 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 3.19 ms ± 1.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` Benchmark on P100, dtype `half`: Before: ``` 1.4.0a0+a5b4d78 `1c017f0c14` 30.8 µs ± 226 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 43.4 µs ± 164 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 69.1 µs ± 83 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 119 µs ± 103 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 224 µs ± 99.1 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 418 µs ± 206 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 865 µs ± 237 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.69 ms ± 695 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 3.3 ms ± 527 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) 6.77 ms ± 741 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` After ``` 1.4.0a0+a5b4d78 7e50ee27333e7047072d328d03767b4845286356 28.9 µs ± 61.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 40.2 µs ± 244 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 63.8 µs ± 350 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 109 µs ± 196 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 199 µs ± 157 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 380 µs ± 446 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 743 µs ± 2.17 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.47 ms ± 1.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 2.91 ms ± 9.17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 5.8 ms ± 296 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` cc: csarofeen ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/32383 Differential Revision: D19697455 Pulled By: ngimel fbshipit-source-id: 0707481c2f334e6634c000b4afd275b2fee8fbe1	2020-02-03 16:20:40 -08:00
Wanchao Liang	3613a30345	Move dict_test.cpp to test folder and fix dict_test.cpp for Aten includes (#24071 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/24071 Pull Request resolved: https://github.com/pytorch/pytorch/pull/24071 Test Plan: Imported from OSS Differential Revision: D16728574 Pulled By: wanchaol fbshipit-source-id: 6952b9703a40dc35f567bf17fbdcef6e0c6c2d6e	2019-08-08 22:41:16 -07:00
Syed Tousif Ahmed	effcc398c4	Refactor Random Number Generators in ATen (#21555 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/21555 ghimport-source-id: dd900a8c3e1ef9ef1e011b8bb5476626d18cc462 Test Plan: Imported from OSS Differential Revision: D15875780 Pulled By: ezyang fbshipit-source-id: 6e04e90af62ab9c9593d74f344a3a084aaaf6f43	2019-06-19 13:54:09 -07:00
Syed Tousif Ahmed	ae342fd076	Refactor Random Number Generators in ATen (#21364 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/21364 ghimport-source-id: ca7d37e10190ba46dc8512f437404ca9216d3369 Differential Revision: D15696497 Pulled By: ezyang fbshipit-source-id: 2e713b8566ae915e175b5a79ac1dd9b86cc2a23d	2019-06-12 13:01:30 -07:00
Richard Zou	8ffcbfb7d4	Add unique_ptr<NamedTensorMeta> field to TensorImpl (#21341 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/21341 ghimport-source-id: 06021b06864746571a904a1cfc0aaea5f8a12325 Differential Revision: D15717907 Pulled By: zou3519 fbshipit-source-id: 48ee76cf2f11a8b092be75ecac8d5faee68ca0d9	2019-06-10 07:29:36 -07:00
Richard Zou	4727685ea1	Added at::Dimname (#21280 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/21280 ghimport-source-id: 921848326e4828ffd422868be26c409c6490e1ab Differential Revision: D15698516 Pulled By: zou3519 fbshipit-source-id: 502b9b019d51dd46327e6caf2af69aa520c70cb6	2019-06-07 06:30:42 -07:00
Sam Gross	25a6ff10f0	Add gtest for TensorIterator (#21253 ) Summary: This adds a regression test for the bug fix in #21236. Operations involving CUDA tensors an CPU scalars should not copy the CPU scalar to the device (because that is slow). They should instead "lift" the scalar to a kernel parameter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/21253 Reviewed By: bddppq Differential Revision: D15604080 Pulled By: colesbury fbshipit-source-id: c14ded5d584499eaa5ea83337ffc50278205f3d6	2019-06-04 07:23:42 -07:00
Syed Tousif Ahmed	67414714e5	Move THCTensor_(uniform) to ATen (#20292 ) Summary: As a first step for this plan: https://github.com/pytorch/pytorch/issues/19508#issuecomment-485178192, this PR moves `THCTensor_(uniform)` to ATen. Major changes are: - `uniform_` cuda kernel now utilizes a philox generator. - the kernel also utilizes TensorIterator - the kernel uses a grid-stride loop to achieve peak effective bandwidth - Since the engine has changed from `curandStateMTGP32` to `curandStatePhilox4_32_10`, the randoms generated now will be different. - Here is the diff showing codegen changes: https://gist.github.com/syed-ahmed/4af9ae0d42b6c7dbaa13b9dd0d1dd1e8 (BC breaking change if any) - Philox4_32_10 is known to pass the standard TestU01 Big Crush test (https://www.thesalmons.org/john/random123/papers/random123sc11.pdf) and hence the quality of random numbers generated isn't an issue when compared to the previously used `curandStateMTGP32`. - I have added a test case in `aten/src/ATen/test/cuda_distributions_test.cu` which verifies that philox offset is incremented properly The benchmark was done on a DGX station with 4 V100s. I modified the script from jcjohnson 's [multinomial benchmark](https://github.com/jcjohnson/pytorch-multinomial-benchmark) to produce this notebook which shows that there is a general speedup with this PR and a regression hasn't been introduced: https://gist.github.com/syed-ahmed/9d26d4e96308aed274d0f2c7be5218ef To reproduce the notebook: - Run https://gist.github.com/syed-ahmed/4208c22c541f1d30ad6a9b1efc1d728f in a container with the current pytorch top of tree with the command: `python uniform_benchmark.py --stats_json before.json` - Apply this diff to the current pytorch top of tree and run the same script in a container with the command: `python uniform_benchmark.py --stats_json after.json` - Run the notebook attached above with the `after.json` and `before.json` in the same directory The effected bandwidth was calculated using the script (thanks to ngimel ): https://gist.github.com/syed-ahmed/f8b7384d642f4bce484228b508b4bc68 Following are the numbers before and after. ``` uniform, size, elements 65536 forward 5.168914794921875e-06 bandwidth (GB/s) 50.71548098597786 uniform, size, elements 131072 forward 5.056858062744141e-06 bandwidth (GB/s) 103.67860705101367 uniform, size, elements 262144 forward 7.164478302001953e-06 bandwidth (GB/s) 146.357621001797 uniform, size, elements 524288 forward 1.1217594146728515e-05 bandwidth (GB/s) 186.9520302275877 uniform, size, elements 1048576 forward 1.923084259033203e-05 bandwidth (GB/s) 218.10297600317384 uniform, size, elements `2097152` forward 3.640890121459961e-05 bandwidth (GB/s) 230.39992200138826 uniform, size, elements 4194304 forward 6.778717041015625e-05 bandwidth (GB/s) 247.49839679819922 uniform, size, elements 8388608 forward 0.00012810707092285157 bandwidth (GB/s) 261.92490202361347 uniform, size, elements 16777216 forward 0.00025241613388061524 bandwidth (GB/s) 265.86598474620627 uniform, size, elements 33554432 forward 0.000497891902923584 bandwidth (GB/s) 269.5720239913193 ``` ``` uniform, size, elements 65536 forward 5.550384521484375e-06 bandwidth (GB/s) 47.22988091821306 uniform, size, elements 131072 forward 5.581378936767578e-06 bandwidth (GB/s) 93.93520954942333 uniform, size, elements 262144 forward 6.165504455566406e-06 bandwidth (GB/s) 170.071404141686 uniform, size, elements 524288 forward 6.3276290893554685e-06 bandwidth (GB/s) 331.4277702414469 uniform, size, elements 1048576 forward 8.509159088134765e-06 bandwidth (GB/s) 492.91639239047356 uniform, size, elements `2097152` forward 1.2989044189453124e-05 bandwidth (GB/s) 645.8218077979443 uniform, size, elements 4194304 forward 2.347707748413086e-05 bandwidth (GB/s) 714.6211452997259 uniform, size, elements 8388608 forward 4.4286251068115234e-05 bandwidth (GB/s) 757.6715389250498 uniform, size, elements 16777216 forward 8.672237396240235e-05 bandwidth (GB/s) 773.8356427961071 uniform, size, elements 33554432 forward 0.00016920566558837892 bandwidth (GB/s) 793.2224227438523 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/20292 Differential Revision: D15277761 Pulled By: ezyang fbshipit-source-id: 8bfe31a01eeed77f0ed6e7ec4d2dda4c6472ecaa	2019-05-13 09:38:28 -07:00
Alex Şuhan	9811a4220d	Add XLA / TPU device type, backend type and type id (#16763 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/16763 Replicate the easy bits in https://github.com/pytorch/pytorch/pull/15153 with TPU / XLA instead of MSNPU. Also don't initialize the storage for XLA tensors for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/16585 Reviewed By: ezyang Differential Revision: D13912118 Pulled By: gchanan fbshipit-source-id: 4889177e2478768fb281ed075b71146d1d850bd9	2019-02-05 12:56:44 -08:00
Roy Li	7e642dfff3	Introduce backend extensions (overriding operators on custom backends) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/15153 Reviewed By: gchanan Differential Revision: D13445571 fbshipit-source-id: 62e2ebe0a6e81c4983b47cddb57ee5eb78e96708	2019-02-01 11:00:16 -08:00
Dmytro Dzhulgakov	a061e3fd77	Back out "Revert D13596031: Improve c2-aten tensor interop and add proper testing" (#16514 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/16514 Original commit changeset: dc371697f14b Relanding https://github.com/pytorch/pytorch/pull/15860 - the problem was that layer_norm was using at::empty which is not yet on mobile Reviewed By: ezyang Differential Revision: D13861480 fbshipit-source-id: e2116da32bc117175c96b9151b1beba9b31eff36	2019-01-31 13:38:55 -08:00
Edward Yang	3b337e7892	Revert D13596031: Improve c2-aten tensor interop and add proper testing Differential Revision: D13596031 Original commit changeset: d20b601e06ba fbshipit-source-id: dc371697f14b3893a9164380a39e7a49d8d68ecf	2019-01-29 07:14:57 -08:00
Dmytro Dzhulgakov	5e21e0fe75	Improve c2-aten tensor interop and add proper testing (#15860 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/15860 Few changes (which are harder to split in separate diffs, so together): - make conversion explicit (as they can throw to avoid surprises) - fix tensor legacy dispatch not initialized when tensor is created on C2 side - add a bunch of invariants to enforce Reviewed By: ezyang Differential Revision: D13596031 fbshipit-source-id: d20b601e06ba47aeff2f6e8e15769840e2d46108	2019-01-28 23:41:50 -08:00
Roy Li	50fbf79451	test basic tensor interop Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12249 Differential Revision: D13469356 Pulled By: li-roy fbshipit-source-id: b49748462aa44ac34b8ce79783f2c895a537a232	2018-12-27 17:04:00 -08:00
Peter Goldsborough	5987b44dda	Remove aten doc/ folder (#11158 ) Summary: ATen's doc/ folder is manually maintained and can thus cause confusion with the generated file. We now have proper online documentation for ATen, which is superior to ATen doc/. Let's delete ATen/doc. ezyang apaszke soumith Pull Request resolved: https://github.com/pytorch/pytorch/pull/11158 Differential Revision: D9618782 Pulled By: goldsborough fbshipit-source-id: 0ef14f84947601a0589aa4a41e5c8619783426fe	2018-08-31 14:55:13 -07:00
Gregory Chanan	7842b6d0f7	Fix at::optional compile problems on Windows CUDA. Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/10909 Differential Revision: D9516837 Pulled By: gchanan fbshipit-source-id: fad7e3284e74c599b873ebaae2dcdf5013505855	2018-08-27 14:40:41 -07:00
Syed Tousif Ahmed	1fbabff76a	Refactor THCNumerics and add common math functions for at::Half (#10301 ) Summary: Summary: This PR is a followup of mruberry's https://github.com/pytorch/pytorch/pull/9318/. It tries to achieve the following: - Specializing std common math functions for `at::Half` type. - Create `CUDANumerics.cuh` to contain necessary parts from `THCNumerics.cuh`. - Update `THCNumerics.cuh` with new usage and comments to demonstrate the best practice for developers and hence, making way for its deprecation. - Remove legacy/redundant code path. - Remove unused CUDA HALF macros (see separate PR https://github.com/pytorch/pytorch/pull/10147) Comments: `CUDANumerics.cuh` contains mathematical functions that are either not in the std namespace or are specialized for compilation with CUDA NVCC or CUDA NVRTC. This header is derived from the legacy `THCNumerics.cuh`. Following are some rationale behind why some functions were kept while others were removed: - All arithmetic can now be done in ATen using binary cuda kernel or CUDA tensor pointwise apply (check https://github.com/pytorch/pytorch/pull/8919 and `CUDAApplyUtils`). `at::Half` comparisons rely on implicit conversion to float. - Functions that are c/c++ standard compliant, have been specialized for user defined types, for instance, the std namespace has been opened up for `at::Half`, that defines math function definitions for `at::Half`. Check `Half-inl.h` - Some standard compliant functions are specialized here for performance reasons. For instance, `powi` is used for `pow` calculation on integral types. Moreover, `abs`, `isinf`, `isnan` are specialized to save one API call vs when used with std. Although this is subject to change, depending on if we really care about saving one API call. - Numeric limits such as `max/min` is removed since they call standard defines. Moreover, numeric limits for `at::Half` is present in `Half-inl.h`. I understood that HIP has some issue with `std::numeric_limits` and this the related github issue I found: https://github.com/ROCm-Developer-Tools/HIP/issues/374. AlexVlx mentions that the issue can be avoided by launching `std::numeric_limits` in `__device__`. Since, we are launching lambdas with device contexts, I don't see an issue why `std::numeric_limits` won't compile on HIP if launched with device context within a kernel, unless I am not aware of the real reason why max/min was there in THCNumerics in the first place. (Haven't ever tried a build with HIP). Here are some reference PRs that was handy in refactoring TH into ATen: - https://github.com/pytorch/pytorch/pull/6786 - https://github.com/pytorch/pytorch/pull/5475 - https://github.com/pytorch/pytorch/pull/9401 - https://github.com/pytorch/pytorch/pull/8689 - https://github.com/pytorch/pytorch/pull/8919 Pull Request resolved: https://github.com/pytorch/pytorch/pull/10301 Differential Revision: D9204758 Pulled By: soumith fbshipit-source-id: 09f489c1656458c02367b6cd31c3eeeca5acdc8a	2018-08-24 16:02:06 -07:00
mruberry	d6f21fc663	Ports Streams to ATen (#8997 ) Summary: This PR moves the THCStream logic (from both the THCStream and THCState APIs) to ATen. In particular, it: + Creates a new (THC free) at::CUDAStream class and API + Extends the at::Context API to expose it + Stubs the current THCStream and THCState APIs to use it + Updates THC to no longer violate stream encapsulation (stream.hpp is dead) + Adds an ATen cpp test of the API + Bonus: Removes some debug spew in test_nn.py The new API has several advantages over the old one: (1) It comes with an easy to use RAII, the CUDAStream. CUDAStreams have the expected copy and move semantics and are implicitly convertible to cudaStream_t. (2) It does not depend on THCState, THCThreadLocal, or CUDA (thanks to goldsborough for suggesting the dynamic registration technique) (3) It provides one consistent API/place for all stream operations, instead of having them split between THCStream and THCState (4) The internals are completely encapsulated, unlike the historic THCStream (5) It has getAndRetain semantics, which are safer than the historic gets (which allowed a gap between acquisition and retention) There are a couple things this PR does not do, however, which are left for future work: - It leaves the c10d:CUDAStream class as a THCStream wrapper (which now really wraps an at::CUDAStream). - It leaves historic users of THCStream mostly untouched, except where they violated encapsulation (by using stream.hpp). A couple forward declarations were also changed. I hope this PR allows easy usage of streams from ATen and is a useful pattern for porting more of the THCState API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/8997 Differential Revision: D8683375 Pulled By: soumith fbshipit-source-id: 2e48ad85f1f9c8817684fe63a267938e80eafdcf	2018-07-08 16:25:09 -07:00
Orion Reblitz-Richardson	4bf0202cac	[build] Have PyTorch depend on minimal libcaffe2.so instead of libATen.so (#7399 ) * Have PyTorch depend on minimal libcaffe2.so instead of libATen.so * Build ATen tests as a part of Caffe2 build * Hopefully cufft and nvcc fPIC fixes * Make ATen install components optional * Add tests back for ATen and fix TH build * Fixes for test_install.sh script * Fixes for cpp_build/build_all.sh * Fixes for aten/tools/run_tests.sh * Switch ATen cmake calls to USE_CUDA instead of NO_CUDA * Attempt at fix for aten/tools/run_tests.sh * Fix typo in last commit * Fix valgrind call after pushd * Be forgiving about USE_CUDA disable like PyTorch * More fixes on the install side * Link all libcaffe2 during test run * Make cuDNN optional for ATen right now * Potential fix for non-CUDA builds * Use NCCL_ROOT_DIR environment variable * Pass -fPIC through nvcc to base compiler/linker * Remove THCUNN.h requirement for libtorch gen * Add Mac test for -Wmaybe-uninitialized * Potential Windows and Mac fixes * Move MSVC target props to shared function * Disable cpp_build/libtorch tests on Mac * Disable sleef for Windows builds * Move protos under BUILD_CAFFE2 * Remove space from linker flags passed with -Wl * Remove ATen from Caffe2 dep libs since directly included * Potential Windows fixes * Preserve options while sleef builds * Force BUILD_SHARED_LIBS flag for Caffe2 builds * Set DYLD_LIBRARY_PATH and LD_LIBRARY_PATH for Mac testing * Pass TORCH_CUDA_ARCH_LIST directly in cuda.cmake * Fixes for the last two changes * Potential fix for Mac build failure * Switch Caffe2 to build_caffe2 dir to not conflict * Cleanup FindMKL.cmake * Another attempt at Mac cpp_build fix * Clear cpp-build directory for Mac builds * Disable test in Mac build/test to match cmake	2018-05-24 07:47:27 -07:00
Mike Ruberry	37b9d093d2	Updates collapseDims() function and documentation (#7056 ) * Updates collapseDims() function and documentation * Adds C++ tests, validates input, updates names for readability * Removes invalid test * stashing to merge AT_CHECK macro * Updates asserts, removes tests on Windows	2018-05-12 23:42:55 -04:00
cpuhrsch	ae35e0e924	Support non-contiguous tensors for unary ops (#6119 )	2018-04-27 21:31:34 +02:00
Will Feng	e089849b4a	Add mutex to THC random number generator (#6527 ) * Add mutex to THC random number generator * Add test for CUDA RNG multithread * fix lint * Rename gen_state to state and remove unnecessary mutex lock * Remove RNG test from cpp_extensions * Add CUDA RNG test to libtorch * Build test_rng only if CUDA exists * Move test to aten/src/ATen/test/ * Separate ATen build and test, and run ATen test in CI test phase * Don't test ATen in ASAN build * Fix bug in ATen scalar_test * Fix bug in ATen native_test * Add FIXME to some CUDA tests in scalar_tensor_test * Valgrind doesn't work well with CUDA, seed the CPU and CUDA RNG separately instead	2018-04-18 15:54:13 -04:00

1 2

70 Commits