Commit Graph

135 Commits

Author SHA1 Message Date
Hannes Friederich
5932c37198 [caffe2] drop XROS ports (#76366)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76366

caffe2 is not currently being built for XROS.

Test Plan: CI

Reviewed By: kimishpatel

Differential Revision: D35923922

fbshipit-source-id: 260dacadf0bd5b6bab7833a4ce81e896d280b053
(cherry picked from commit 8370b8dd2519d55a79fa8d45e7951ca8dc0b21a8)
2022-04-26 23:54:22 +00:00
Shintaro Iwasaki
7dc2cfa249 [c10][rocm] fix __assert_fail() declaration mismatch error (#73040)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73040

This patch fixes a compilation error in PyTorch with ROCm when `NDEBUG` is passed.

## Problem

Forward declaration of `__host__ __device__ __assert_fail()` is used in `c10/macros/Macros.h` for HIP compilation when `NDEBUG` is set  However, HIP has  `__device__ __assert_fail()` in `hip/amd_detail/amd_device_functions.h`, causing a function type error.

This issue does not appear in ROCm CI tests since it happens only when `NDEBUG` is passed.

## Solution

[EDIT] After the discussion on GitHub, we chose to entirely disable `CUDA_KERNEL_ASSERT()` for ROCm.

 ---

To solve this compilation error, this patch disables `CUDA_KERNEL_ASSERT()`, which uses `__assert_fail()` when
1. `c10/macros/Macros.h` is included for `*.hip` (precisely speaking, `__HIP__` or `__HIP_ARCH__` is defined), and
2. `NDEBUG` is passed.

Note that there's no impact on default compilation because, without a special compilation flag, those HIP files are compiled without `-NDEBUG`. And that's why this issue has not been found.

### Justification
[1] We cannot declare one host-and-device function for two separate host and device functions.
```
__device__ int func() {return 0};
__host__ int func() {return 0};
// Compile error (hipcc)
// __device__ __host__ int func();
```
[2] Forward declaration of a correct `__device__` only `__assert_fail()` for `__HIP__` causes the following error:
```
pytorch/c10/util/TypeCast.h:135:7: error: reference to __device__ function '__assert_fail' in __host__ __device__ function
      ERROR_UNSUPPORTED_CAST
      ^
pytorch/c10/util/TypeCast.h:118:32: note: expanded from macro 'ERROR_UNSUPPORTED_CAST'
#define ERROR_UNSUPPORTED_CAST CUDA_KERNEL_ASSERT(false);
                               ^
pytorch/c10/macros/Macros.h:392:5: note: expanded from macro 'CUDA_KERNEL_ASSERT'
    __assert_fail(
```

[3] Maybe there's a way to properly define `__assert_fail()` for HIP + NDEBUG, but this might be too much. Please let me just disable it.

### Technical details

Error
```
pytorch/c10/macros/Macros.h:368:5: error: __host__ __device__ function '__assert_fail' cannot overload __device__ function '__assert_fail'
    __assert_fail(
    ^
/opt/rocm/hip/include/hip/amd_detail/amd_device_functions.h:1173:6: note: previous declaration is here
void __assert_fail(const char *assertion,
```

CUDA definition (9.x) of `__assert_fail()`
```
#elif defined(__GNUC__)
extern __host__ __device__ __cudart_builtin__ void __assert_fail(
  const char *, const char *, unsigned int, const char *)
  __THROW;
```

ROCm definition (the latest version)
```
// 2b59661f3e/include/hip/amd_detail/amd_device_functions.h (L1172-L1177)
extern "C" __device__ __attribute__((noinline)) __attribute__((weak))
void __assert_fail(const char *assertion,
                   const char *file,
                   unsigned int line,
                   const char *function);
```

Test Plan:
CI + reproducer
```
python3 tools/amd_build/build_amd.py
python3 setup.py develop --cmake-only
cmake -DHIP_HIPCC_FLAGS_RELEASE="-DNDEBUG" build
cmake --build build
```

Reviewed By: xw285cornell

Differential Revision: D34310555

fbshipit-source-id: 7542288912590533ced3f20afd2e704b6551991b
(cherry picked from commit 9e52196e36820abe36bf6427cabc7389d3ea6cb5)
2022-03-01 04:35:30 +00:00
Zhengxu Chen
fe277b8717 [jit][edge] Migrate to TypeFactory for jit types on mobile (#71516)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71516

Mobile should be able to contruct dynamic types by default.
ghstack-source-id: 147498365

Test Plan:
CI.

**-48KB** binary size reduction for igios BSB.
UMBEX link: https://www.internalfb.com/intern/unigraph/explorer/?jsgq_traversal_spec=%7B%22builds%22%3A[%22bsb%3A422553426218394%5Cu0040base%22%2C%22bsb%3A422553426218394%5Cu0040diff%22]%7D&unigraph_project=UnigraphProjectMbex&is_mbex_redirected

Reviewed By: iseeyuan

Differential Revision: D33673958

fbshipit-source-id: 8600c04ae929283681971aae264d3774188df9cd
(cherry picked from commit 64ebcec09e)
2022-01-26 07:32:04 +00:00
CodemodService FBSourceClangFormatLinterBot
60632a00fe [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D33561057

fbshipit-source-id: 79873717c45c8bbe6d0ae760e718770fd960185d
2022-01-13 03:27:06 -08:00
Nolan O'Brien
8f4cec2231 [warnings][Caffe2] Suppress warnings in caffe2 headers (#71196)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71196

`caffe2` headers contain code that can elicit warnings when built with strict compiler flags.  Rather than force downstream/consuming code to weaken their compiler flags, suppress those warnings in the header using `#pragma clang diagnostic` suppressions.

Test Plan: CI Pass

Reviewed By: malfet

Differential Revision: D33536233

fbshipit-source-id: 74404e7a5edaf244f79f7a0addd991a84442a31f
2022-01-12 10:16:35 -08:00
Scott Wolchok
d026057bb3 [PyTorch] Update SmallVector from LLVM (#69110)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69110

I pasted the current LLVM code, reapplied the modifications listed in the code comments, caught a few more in the diff/build process. The trivially copyable detection is different now; if gcc builds fail, will try reverting to C10_IS_TRIVIALLY_COPYABLE or copying what LLVM is doing.

The motivation for this change is that, as noted in an existing comment, C10_IS_TRIVIALLY_COPYABLE did the wrong thing for std::unique_ptr, which caused problems with D32454856 / #68412.

ghstack-source-id: 145327773

Test Plan: CI

Reviewed By: bhosmer, mruberry

Differential Revision: D32733017

fbshipit-source-id: 9452ab90328e3fdf457aad23a26f2f6835b0bd3d
2021-12-10 11:57:19 -08:00
Mengwei Liu
53aac4b6f3 [PyTorch] Allow override for macro HAS_DEMANGLE (#66540)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66540

Currently the macro `HAS_DEMANGLE` is determined by compiler predefined macros. Here I'm adding an option to allow `HAS_DEMANGLE` to be defined in build files.

Test Plan: Rely on CI

Reviewed By: poweic

Differential Revision: D31600007

fbshipit-source-id: 76cf088b0f5ee940e977d3b213f1446ea64be036
2021-10-17 16:10:45 -07:00
Mengwei Liu
d8532e3524 [PyTorch] Split c10 Type.cpp into two files to allow targets to include one of them (#66445)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66445

`Type.cpp` implements `demangle()` function based on the macro `HAS_DEMANGLE`. This diff splits it into two `.cpps` so that we can add either one into the build target. This change follows the patternof `flags_use_no_gflags.cpp` and `flags_use_gflags.cpp`.

Test Plan: Rely on CI

Reviewed By: iseeyuan

Differential Revision: D31551432

fbshipit-source-id: f8b11783e513fa812228ec873459ad3043ff9147
2021-10-11 21:52:24 -07:00
Karol Kosik
eb3b9fe719 [XROS][ML] System specific adjustments for UTs to work. (#65245)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65245

Building and running c10 and qnnpack tests on XROS.

Notable changes:
- Adding #if define(_XROS_) in few places not supported by XROS
- Changing Threadpool to abstract class
ghstack-source-id: 139513579

Test Plan: Run c10 and qnnpack tests on XROS.

Reviewed By: veselinp, iseeyuan

Differential Revision: D30137333

fbshipit-source-id: bb6239b935187fac712834341fe5a8d3377762b1
2021-10-01 18:15:14 -07:00
Pruthvi Madugundu
085e2f7bdd [ROCm] Changes not to rely on CUDA_VERSION or HIP_VERSION (#65610)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65610

- Replace HIP_PLATFORM_HCC with USE_ROCM
- Dont rely on CUDA_VERSION or HIP_VERSION and use USE_ROCM and ROCM_VERSION.

- In the next PR
   - Will be removing the mapping from CUDA_VERSION to HIP_VERSION and CUDA to HIP in hipify.
   - HIP_PLATFORM_HCC is deprecated, so will add HIP_PLATFORM_AMD to support HIP host code compilation on gcc.

cc jeffdaily sunway513 jithunnair-amd ROCmSupport amathews-amd

Reviewed By: jbschlosser

Differential Revision: D30909053

Pulled By: ezyang

fbshipit-source-id: 224a966ebf1aaec79beccbbd686fdf3d49267e06
2021-09-29 09:55:43 -07:00
Jane Xu
6707dfeefb Remove 9.2 related macros for CONSTEXPR (#65066)
Summary:
Removes C10_HOST_CONSTEXPR_EXCEPT_CUDA92 references in the code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65066

Reviewed By: driazati

Differential Revision: D31022520

Pulled By: janeyx99

fbshipit-source-id: f02cdc6caba5b48405575242921f5845ff18f729
2021-09-17 17:31:20 -07:00
Jeff Daily
c7b03e2b83 [ROCm] define C10_WARP_SIZE to warpSize HIP constant (#64302)
Summary:
warpSize is defined as a constexpr in HIP headers.  It is incorrect to assume warpSize 64.  This change fixes the C10_WARP_SIZE definition in torch sources similar to [how it was done in caffe2](https://github.com/pytorch/pytorch/blob/master/caffe2/utils/GpuDefs.cuh#L10-L14).

cc jeffdaily sunway513 jithunnair-amd ROCmSupport

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64302

Reviewed By: mrshenli

Differential Revision: D30785975

Pulled By: malfet

fbshipit-source-id: 68f8333182ad4d02bd0c8d02f1751a50bc5bafa7
2021-09-10 09:43:47 -07:00
Dmytro Dzhulgakov
f446e835ee Fix CUDA_KERNEL_ASSERT ambiguous symbol in NDEBUG mode (#62527)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62527

If NDEBUG is applied inconsistently in compilation we might get 'ambiguous declaration' error. Let's make sure that the forward declaration matches glibc including all specifiers.

Test Plan: sandcastle

Reviewed By: mdschatz

Differential Revision: D30030051

fbshipit-source-id: 9f4d5f1d4e74f0a4eaeeaaaad76b93ee485d8bcd
2021-08-11 01:10:09 -07:00
Brian Hirsh
7bc86458e1 Revert "Revert D28833086: beef up at::_ops API" (#60214)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60214

Relanding this PR, but with a fix for windows cuda builds (example failure in master here: https://github.com/pytorch/pytorch/runs/2852662871)

This is identical to the original PR except for one change in `tools/codegen/gen.py`: `static constexpr` -> `static CONSTEXPR_EXCEPT_WIN_CUDA`

This actually took a while to figure out, until I tracked down a previous pytorch PR that encountered a similar issue: https://github.com/pytorch/pytorch/pull/40675

This reverts commit 6d0fb85a62.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D29213932

Pulled By: bdhirsh

fbshipit-source-id: b90c7c10e5a51f8d6173ddca673b418e5774c248
2021-06-24 18:08:54 -07:00
Erjia Guan
65f33ec85c Follow-up fix for compilation error on CUDA92 (#60287)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60287

Follow up of #60017

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D29236208

Pulled By: ejguan

fbshipit-source-id: f1acf9630b45fea8cbdf7d64e47661643d0a52b8
2021-06-21 13:29:11 -07:00
Erjia Guan
691183bb74 Fix compile failure on CUDA92 (#60017)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/60016

For CUDA 92
- OptionalBase was not check if `is_arrayref`
- constexpr seems not expect to raise Exception for cuda 92

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60017

Reviewed By: malfet

Differential Revision: D29139515

Pulled By: ejguan

fbshipit-source-id: 4f4f6d9fe6a5f2eadf913de0a9781cc9f2e6ac6f
2021-06-16 12:23:08 -07:00
Jeff Daily
24e27af683 [ROCm] enable kernel asserts (#49624)
Summary:
Addresses missing ROCm feature indicated in https://github.com/pytorch/pytorch/issues/38943.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49624

Reviewed By: agolynski

Differential Revision: D28902459

Pulled By: malfet

fbshipit-source-id: 29c9b552770241a0ec52cd057ea45efc4389d838
2021-06-07 13:43:07 -07:00
Edward Yang
f05d5bec48 Preserve PyObject even when it goes dead (#56017)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56017

Fixes #55686

This patch is seemingly straightforward but some of the changes are very
subtle.  For the general algorithmic approach, please first read the
quoted issue.  Based on the algorithm, there are some fairly
straightforward changes:

- New boolean on TensorImpl tracking if we own the pyobj or not
- PythonHooks virtual interface for requesting deallocation of pyobj
  when TensorImpl is being released and we own its pyobj, and
  implementation of the hooks in python_tensor.cpp
- Modification of THPVariable to MaybeOwned its C++ tensor, directly
  using swolchok's nice new class

And then, there is python_variable.cpp.  Some of the changes follow the
general algorithmic approach:

- THPVariable_NewWithVar is simply adjusted to handle MaybeOwned and
  initializes as owend (like before)
- THPVariable_Wrap adds the logic for reverting ownership back to
  PyObject when we take out an owning reference to the Python object
- THPVariable_dealloc attempts to resurrect the Python object if
  the C++ tensor is live, and otherwise does the same old implementation
  as before
- THPVariable_tryResurrect implements the resurrection logic.  It is
  modeled after CPython code so read the cited logic and see if
  it is faithfully replicated
- THPVariable_clear is slightly updated for MaybeOwned and also to
  preserve the invariant that if owns_pyobj, then pyobj_ is not null.
  This change is slightly dodgy: the previous implementation has a
  comment mentioning that the pyobj nulling is required to ensure we
  don't try to reuse the dead pyobj.  I don't think, in this new world,
  this is possible, because the invariant says that the pyobj only
  dies if the C++ object is dead too.  But I still unset the field
  for safety.

And then... there is THPVariableMetaType.  colesbury explained in the
issue why this is necessary: when destructing an object in Python, you
start off by running the tp_dealloc of the subclass before moving up
to the parent class (much in the same way C++ destructors work).  The
deallocation process for a vanilla Python-defined class does irreparable
harm to the PyObject instance (e.g., the finalizers get run) making it
no longer valid attempt to resurrect later in the tp_dealloc chain.
(BTW, the fact that objects can resurrect but in an invalid state is
one of the reasons why it's so frickin' hard to write correct __del__
implementations).  So we need to make sure that we actually override
the tp_dealloc of the bottom most *subclass* of Tensor to make sure
we attempt a resurrection before we start finalizing.  To do this,
we need to define a metaclass for Tensor that can override tp_dealloc
whenever we create a new subclass of Tensor.  By the way, it was totally
not documented how to create metaclasses in the C++ API, and it took
a good bit of trial error to figure it out (and the answer is now
immortalized in https://stackoverflow.com/q/67077317/23845 -- the things
that I got wrong in earlier versions of the PR included setting
tp_basicsize incorrectly, incorrectly setting Py_TPFLAGS_HAVE_GC on
the metaclass--you want to leave it unset so that it inherits, and
determining that tp_init is what actually gets called when you construct
a class, not tp_call as another not-to-be-named StackOverflow question
suggests).

Aside: Ordinarily, adding a metaclass to a class is a user visible
change, as it means that it is no longer valid to mixin another class
with a different metaclass.  However, because _C._TensorBase is a C
extension object, it will typically conflict with most other
metaclasses, so this is not BC breaking.

The desired new behavior of a subclass tp_dealloc is to first test if
we should resurrect, and otherwise do the same old behavior.  In an
initial implementation of this patch, I implemented this by saving the
original tp_dealloc (which references subtype_dealloc, the "standard"
dealloc for all Python defined classes) and invoking it.  However, this
results in an infinite loop, as it attempts to call the dealloc function
of the base type, but incorrectly chooses subclass type (because it is
not a subtype_dealloc, as we have overridden it; see
b38601d496/Objects/typeobject.c (L1261) )
So, with great reluctance, I must duplicate the behavior of
subtype_dealloc in our implementation.  Note that this is not entirely
unheard of in Python binding code; for example, Cython
c25c3ccc4b/Cython/Compiler/ModuleNode.py (L1560)
also does similar things.  This logic makes up the bulk of
THPVariable_subclass_dealloc

To review this, you should pull up the CPython copy of subtype_dealloc
b38601d496/Objects/typeobject.c (L1230)
and verify that I have specialized the implementation for our case
appropriately.  Among the simplifications I made:

- I assume PyType_IS_GC, because I assume that Tensor subclasses are
  only ever done in Python and those classes are always subject to GC.
  (BTW, yes!  This means I have broken anyone who has extend PyTorch
  tensor from C API directly.  I'm going to guess no one has actually
  done this.)

- I don't bother walking up the type bases to find the parent dealloc;
  I know it is always THPVariable_dealloc.  Similarly, I can get rid
  of some parent type tests based on knowledge of how
  THPVariable_dealloc is defined

- The CPython version calls some private APIs which I can't call, so
  I use the public PyObject_GC_UnTrack APIs.

- I don't allow the finalizer of a Tensor to change its type (but
  more on this shortly)

One alternative I discussed with colesbury was instead of copy pasting
the subtype_dealloc, we could transmute the type of the object that was
dying to turn it into a different object whose tp_dealloc is
subtype_dealloc, so the stock subtype_dealloc would then be applicable.
We decided this would be kind of weird and didn't do it that way.

TODO:

- More code comments

- Figure out how not to increase the size of TensorImpl with the new
  bool field

- Add some torture tests for the THPVariable_subclass_dealloc, e.g.,
  involving subclasses of Tensors that do strange things with finalizers

- Benchmark the impact of taking the GIL to release C++ side tensors
  (e.g., from autograd)

- Benchmark the impact of adding a new metaclass to Tensor (probably
  will be done by separating out the metaclass change into its own
  change)

- Benchmark the impact of changing THPVariable to conditionally own
  Tensor (as opposed to unconditionally owning it, as before)

- Add tests that this actually indeed preserves the Python object

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D27765125

Pulled By: ezyang

fbshipit-source-id: 857f14bdcca2900727412aff4c2e2d7f0af1415a
2021-06-03 10:50:36 -07:00
chengjun
511979df85 Define the SYCL device version __assert_fail when the NDEBUG defined. (#58906)
Summary:
## Motivation
The utils in namespace  `c10` require the `__assert_fail` when the NDEBUG is defined in kernel code.

The `__assert_fail` declaration in pytorch is not compatible to the SYCL‘s specification.

This causes compile error when use these utils in SYCL kernels.

## Solution
Add the `__assert_fail` declaration for SYCL kernels to pytorch when compiling the SYCL kernels with `c10` utils.

## Additional context
`__assert_fail` in SYCL kernel

`extern SYCL_EXTERNAL void __assert_fail(const char *expr, const char *file, unsigned int line, const char *func);`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58906

Reviewed By: anjali411

Differential Revision: D28700863

Pulled By: ezyang

fbshipit-source-id: 81896d022b35ace8cd16474128649eabedfaf138
2021-05-26 12:48:37 -07:00
Peter Bell
0c2d38264a Improve BatchNorm1d performance (CUDA) (#57786)
Summary:
Part of gh-38915, resubmit of gh-57034

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57786

Reviewed By: mruberry

Differential Revision: D28290284

Pulled By: ngimel

fbshipit-source-id: 8768578ba9ace6a948cb8145c0091e0ea49b12da
2021-05-08 19:09:29 -07:00
Sam Estep
2992ff3fb8 Revert D28142447: Improve BatchNorm1d performance (CUDA)
Test Plan: revert-hammer

Differential Revision:
D28142447 (b2936ad8fa)

Original commit changeset: c70109780e20

fbshipit-source-id: e93f6d00d644697b106f5ea8ab79872f353b51c6
2021-05-06 15:01:19 -07:00
Peter Bell
b2936ad8fa Improve BatchNorm1d performance (CUDA) (#57034)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57034

Resolves gh-38915

For the example given in the issue, BatchNorm1d on cuDNN is around 12x slower
than BatchNorm2d. Internally, cuDNN expects at least a 4d tensor (N, C, H, W)
so these two modules actually call the same cuDNN code. My assumption is that
cuDNN just isn't optimized for H=W=1.

Instead, this disables cudnn for 2d batch_norm inputs and improves the CUDA
implementation of `native_batch_norm` to be competative with cuDNN. For the
example in the issue, `BatchNorm1d` now takes 335 us compared to 6.3 ms before,
or a 18x speedup.

Before this change, nvprof shows:
```
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.64%  630.95ms       100  6.3095ms  5.6427ms  8.8800ms  void cudnn::bn_fw_tr_1C11_kernel_NCHW<float, float, int=512, bool=0, int=2>(cudnnTensorStruct, float const *, cudnn::bn_fw_tr_1C11_kernel_NCHW<float, float, int=512, bool=0, int=2>, cudnnTensorStruct*, float const *, float const , cudnnTensorStruct*, cudnnTensorStruct*, cudnnTensorStruct**, float const *, float const *, float const *, cudnnTensorStruct*, cudnnTensorStruct*)
```

But after, it shows:
```
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   54.76%  14.352ms       100  143.52us  123.52us  756.28us  _ZN2at6native27unrolled_elementwise_kernelIZZZNS0_72_GLOBAL__N__48_tmpxft_001e82d0_00000000_7_Normalization_cpp1_ii_db66e07022batch_norm_elementwiseERKNS_6TensorES5_RKN3c108optionalIS3_EESA_S5_S5_ENKUlvE_clEvENKUlvE2_clEvEUlfffffE_NS_6detail5ArrayIPcLi6EEE16OffsetCalculatorILi5EjESI_ILi1EjENS0_6memory15LoadWithoutCastENSL_16StoreWithoutCastEEEviT_T0_T1_T2_T3_T4_
                   35.09%  9.1951ms       100  91.950us  84.415us  362.17us  void at::native::reduce_kernel<int=256, int=2, at::native::ReduceOp<float, at::native::WelfordOps<float, float, int, float, thrust::pair<float, float>>, unsigned int, float, int=2>>(float)
                    0.71%  186.14us       100  1.8610us  1.8240us  1.9840us  _ZN2at6native72_GLOBAL__N__48_tmpxft_001e82d0_00000000_7_Normalization_cpp1_ii_db66e07045unrolled_elementwise_kernel_for_multi_outputsILi3EZZZNS1_34batch_norm_update_stats_and_invertERKNS_6TensorES5_S5_S5_ddlENKUlvE_clEvENKUlvE2_clEvEUlffffE_NS_6detail5ArrayIPcLi7EEE23TrivialOffsetCalculatorILi4EjESD_ILi3EjEEEviT0_T1_T2_T3_
                    0.59%  153.37us       100  1.5330us  1.4720us  2.6240us
										void at::native::vectorized_elementwise_kernel<int=4,
										at::native::BUnaryFunctor<at::native::AddFunctor<long>>,
										at::detail::Array<char*, int=2>>(int, long,
										at::native::AddFunctor<long>)
```

I think there is similar scope to improve the backward implementation.

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D28142447

Pulled By: ngimel

fbshipit-source-id: c70109780e206fa85e50a31e90a1cb4c533199da
2021-05-06 12:14:02 -07:00
Scott Wolchok
44cc873fba [PyTorch] Autoformat c10 (#56830)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56830

Opt into formatting on GitHub and format everything. This is a trial run before turning on formatting for more and eventually all of the codebase.

Test Plan: CI

Reviewed By: zertosh

Differential Revision: D27979080

fbshipit-source-id: a80f0c48691c08ae8ca0af06377b87e6a2351151
2021-04-30 21:23:28 -07:00
Scott Wolchok
52805a0f4f [PyTorch] Include hip_runtime.h in macros.h (#57070)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57070

See code comment.
ghstack-source-id: 127564865

Test Plan: CI, should unbreak build of following formatting diff

Reviewed By: ngimel

Differential Revision: D28044331

fbshipit-source-id: f571e60b2534313fb9e7dd13dd98d2441b9ce8b8
2021-04-30 09:02:48 -07:00
Mengwei Liu
c15d943149 [PyTorch] Fix broken build caused by keyword missing on Windows (#53562)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53562

On Windows when we try to build //xplat/caffe2/c10:c10Windows, it failed with an error like
```
stderr: buck-out\gen\83497cbb\xplat\caffe2\c10\c10Windows#header-mode-symlink-tree-only,headers\c10/macros/Macros.h(189): error C2220: warning treated as error - no 'object' file generated
buck-out\gen\83497cbb\xplat\caffe2\c10\c10Windows#header-mode-symlink-tree-only,headers\c10/macros/Macros.h(189): warning C4067: unexpected tokens following preprocessor directive - expected a newline
```
See log here: https://www.internalfb.com/intern/buck/build/6eaea1f8-e237-4860-9f3b-3a8edd2207c6/

This is because Windows doesn't support `__has_attribute` keyword. Here I'm changing the ordering of `if` and `elif` so that we don't hit that line when build in Windows.

Test Plan: buck build //xplat/caffe2/c10:c10Windows xplat/mode/windows

Reviewed By: kimishpatel, swolchok

Differential Revision: D26896510

fbshipit-source-id: d52438a3df7bf742e467a919f6ab4fed14484f22
2021-03-11 18:24:46 -08:00
Scott Wolchok
eaad002cf6 [PyTorch] s/__attribute__((__noinline__))/__attribute__((noinline))/ (#52362)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52362

AFAICT, it is documented to be the latter and not the former.
GCC: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes
Clang: https://clang.llvm.org/docs/AttributeReference.html

Both versions work in the oldest and newest GCC & Clang versions on Godbolt: https://godbolt.org/z/s6f4PW

So why change?
1) lack of underscores matches the documentation
2) AMD HIP defines `__noinline__` as a macro, which doesn't play well with the underscore version.
See 2080cc113a/include/hip/hcc_detail/host_defines.h (L54)
ghstack-source-id: 121875424

Test Plan: Rely on existing CI

Reviewed By: bhosmer

Differential Revision: D26488991

fbshipit-source-id: 6cfcdfd41c58170659e263cd519ac5359ffd5d46
2021-02-17 21:04:28 -08:00
Samuel Marks
8aad66a7bd [c10/**] Fix typos (#49815)
Summary:
All pretty minor. I avoided renaming `class DestructableMock` to `class DestructibleMock` and similar such symbol renames (in this PR).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49815

Reviewed By: VitalyFedyunin

Differential Revision: D25734507

Pulled By: mruberry

fbshipit-source-id: bbe8874a99d047e9d9814bf92ea8c036a5c6a3fd
2021-01-01 02:11:56 -08:00
Scott Wolchok
f0f315c33b [PyTorch] Inline RecordFunctionCallback::shouldRun (#48286)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48286

RecordFunction initialization is a hot path. shouldRun often does little enough work that the function prologue takes a significant proportion of its time. So, this diff forces it to be inline.
ghstack-source-id: 117892387

Test Plan: FB-internal benchmarks

Reviewed By: ezyang

Differential Revision: D25108879

fbshipit-source-id: 7121413e714c5ca22c8bf10c1d2535a878c15aec
2020-12-04 20:48:39 -08:00
Brian Hirsh
b5149513ec migrate export_caffe2_op_to_c10.h macros to the new dispatcher registration API, update code_analyzer regex (#48308)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48308

The original regex that I added didn't correctly match namespaces that started with an underscore (e.g. `_test`), which caused a master-only test to fail.

The only change from the previous commit is that I updated the regex like so:

before: `^.*TORCH_LIBRARY_IMPL_init_([^_]+)_([^_]+)_[0-9]+(\(.*)?$`
after: `^.*TORCH_LIBRARY_IMPL_init_([_]*[^_]+)_([^_]+)_[0-9]+(\(.*)?$`

I added in a `[_]*` to the beginning of the namespace capture. I did the same for the `_FRAGMENT` regex.

Verified that running `ANALYZE_TEST=1 tools/code_analyzer/build.sh` (as the master-only test does) produces no diff in the output.

Fixing regex pattern to allow for underscores at the beginning of the
namespace

This reverts commit 3c936ecd3c.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D25123295

Pulled By: bdhirsh

fbshipit-source-id: 54bd1e3f0c8e28145e736142ad62a18806bb9672
2020-11-30 13:05:33 -08:00
Brian Hirsh
3c936ecd3c Revert D25056091: migrate export_caffe2_op_to_c10.h macros to the new dispatcher registration API
Test Plan: revert-hammer

Differential Revision:
D25056091 (0ea4982cf3)

Original commit changeset: 0f647ab9bc5e

fbshipit-source-id: e54047b91d82df25460ee00482373c4580f94d50
2020-11-19 19:10:14 -08:00
Brian Hirsh
0ea4982cf3 migrate export_caffe2_op_to_c10.h macros to the new dispatcher registration API (#48097)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48097

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D25056091

Pulled By: bdhirsh

fbshipit-source-id: 0f647ab9bc5e5aee497dac058df492f6e742cfe9
2020-11-19 17:56:56 -08:00
Xiao Wang
fcc7f272de maximum number of threads per block for sm_86 is 1536 (#45889)
Summary:
according to https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45889

Reviewed By: albanD

Differential Revision: D24131188

Pulled By: ngimel

fbshipit-source-id: 31d3038f7b1bc403751448c62b19609573c67a49
2020-10-06 12:01:33 -07:00
Louis Feng
eb75cfb9c0 Back out "Revert D23323486: DPP Async Tracing" plus windows build fix. (#44702)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44702

Original commit changeset: c6bd6d277aca

This diff caused windows build to fail due to a compiler bug in VS2019 (lambda capture constant int value). This back out works around the issue with explicit capture of const int value.

Test Plan: Tested and previously landed.

Reviewed By: mruberry

Differential Revision: D23703215

fbshipit-source-id: f9ef23be97540bc9cf78a855295fb8c69f360459
2020-09-16 11:32:11 -07:00
Mike Ruberry
7036e91abd Revert D23323486: DPP Async Tracing
Test Plan: revert-hammer

Differential Revision:
D23323486 (71673b31f9)

Original commit changeset: 4b6ca6c0e320

fbshipit-source-id: c6bd6d277aca070bef2de3522c2a60e23b4395ad
2020-09-15 01:19:23 -07:00
Louis Feng
71673b31f9 DPP Async Tracing (#44252)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44252

Add tracing to DPP client. Because DPP requests are async, we need to be able to start a trace event in one thread and potentially end in a different thread. RecordFunction and LibgpumonObserver previously assume each trace event starts and finishes in the same thread. So they use a thread local context to track enter and exit call backs. Async events breaks this assumption. This change attaches the event context to the RecordFunction object so we do not need to use thread local context.

Test Plan:
Tested with dpp perf test and able to collect trace.

{F307824044}

Reviewed By: ilia-cher

Differential Revision: D23323486

fbshipit-source-id: 4b6ca6c0e32028fb38a476cd1f44c17a001fc03b
2020-09-14 18:43:14 -07:00
Giuseppe Ottaviano
6324ef4ced [caffe2] Speed up compilation of aten-op.cc (#44440)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44440

`aten-op.cc` takes a long time to compile due to the large generated constructor. For each case, the `std::function` constructor and the initialization functions are inlined, producing a huge amount of intermediate code that takes a long time to optimize, given that many compiler optimization passes are superlinear in the function size.

This diff moves each case to a separate function, so that each one is cheap to optimize, and the constructor is just a large jump table, which is easy to optimize.

Reviewed By: dzhulgakov

Differential Revision: D23593741

fbshipit-source-id: 1ce7a31cda10d9b0c9d799716ea312a291dc0d36
2020-09-09 21:21:48 -07:00
Yi Zhang
47e1b7a8f1 Set CONSTEXPR_EXCEPT_WIN_CUDA as const while it is not constexpr (#43380)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42467

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43380

Reviewed By: albanD

Differential Revision: D23278930

Pulled By: pbelevich

fbshipit-source-id: 6ce0bc9fd73cd0ead46c414fdea5f6fb7e9fec3e
2020-08-22 03:25:37 -07:00
Dmytro Dzhulgakov
06d978a9ad [c10/cuda] Reorganize device_count() and robustly surface ASAN warnings (#42249)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42249

Main change is to bring Caffe2's superior error messages for cuda initialization into c10 and use them in all code paths.

Basic logic:

| Case | Call to device_count() | init_cuda, e.g. allocating tensor |
| -- | -- | -- |
| all good | non-zero | just works |
| no gpus | 0, no warning | throw exception with good message |
| driver issues | 0, produce warning | throw exception with good message |
| out of memory with ASAN | 0, produce warning| throw exception with ASAN message |

Previously, the error thrown from init_cuda was very generic and the ASAN warning (if any) was buried in the logs.

Other clean up changes:
* cache device_count() always in a static variable
* move all asan macros in c10

Test Plan:
Hard to unittest because of build modes. Verified manually that the behavior from the table above holds by running the following script in different modes (ASAN/no-ASAN, CUDA_VISIBLE_DEVICES=):

```
print('before import')
import torch
print('after import')
print('devices: ', torch.cuda.device_count())
x = torch.tensor([1,2,3])
print('tensor creation')
x = x.cuda()
print('moved to cuda')
```

Reviewed By: ngimel

Differential Revision: D22824329

fbshipit-source-id: 5314007313a3897fc955b02f8b21b661ae35fdf5
2020-08-05 11:39:31 -07:00
Xiang Gao
877a59967f Ampere has CUDA_MAX_THREADS_PER_SM == 2048 (#41138)
Summary:
See: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf
page 44, table 5
![image](https://user-images.githubusercontent.com/1032377/86958633-56051580-c111-11ea-94da-c726a61dc00a.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41138

Differential Revision: D22488904

Pulled By: malfet

fbshipit-source-id: 97bd585d91e1a368f51aa6bd52081bc57d42dbf8
2020-07-10 20:02:20 -07:00
Jiakai Liu
a2ef54c598 [pytorch] fix CUDA_KERNEL_ASSERT macro for android build (#40151)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40151

For debug android build it throws the following error:
```
  In file included from src/pytorch/android/pytorch_android/src/main/cpp/pytorch_jni_common.cpp:9:
  In file included from src/pytorch/android/pytorch_android/src/main/cpp/pytorch_jni_common.h:2:
  In file included from ../../../../src/main/cpp/libtorch_include/armeabi-v7a/torch/csrc/api/include/torch/types.h:3:
  In file included from ../../../../src/main/cpp/libtorch_include/armeabi-v7a/ATen/ATen.h:5:
  In file included from ../../../../src/main/cpp/libtorch_include/armeabi-v7a/ATen/Context.h:4:
  In file included from ../../../../src/main/cpp/libtorch_include/armeabi-v7a/ATen/Tensor.h:3:
  In file included from ../../../../src/main/cpp/libtorch_include/armeabi-v7a/ATen/core/TensorBody.h:7:
  In file included from ../../../../src/main/cpp/libtorch_include/armeabi-v7a/c10/core/Scalar.h:13:
  ../../../../src/main/cpp/libtorch_include/armeabi-v7a/c10/util/TypeCast.h:157:22: error: use of undeclared identifier '__assert_fail'
  AT_FORALL_QINT_TYPES(DEFINE_UNCASTABLE)
                       ^
```

Seems __assert_fail() isn't available on Android by default - in NDEBUG mode it forward declares the function and CI passes.

But CUDA_KERNEL_ASSERT() shouldn't be relevant for mobile build at all and we already bypass `__APPLE__` so the easiest fix is to add `__ANDROID__`.

Test Plan: Imported from OSS

Differential Revision: D22095562

Pulled By: ljk53

fbshipit-source-id: 793108a7bc64db161a0747761c0fbd70262e7d5a
2020-06-17 16:26:08 -07:00
Benoit Steiner
6d3e4aa0f9 Made sure torchscript compiles in optimized mode (#38888)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38888

Test Plan: ran the build

Reviewed By: zdevito

Differential Revision: D21045046

fbshipit-source-id: f86d51b083cbc530012d36bbc770f13b28f4c65d
2020-06-01 14:53:55 -07:00
peter
a5d44800f0 Implement CUDA_KERNEL_ASSERT for MSVC (#39218)
Summary:
Tested locally on CPU/GPU + Debug/Release.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39218

Differential Revision: D21786500

Pulled By: malfet

fbshipit-source-id: 7e871003d3509436952932b5ff3599e36bb8f205
2020-05-29 11:44:54 -07:00
Natalia Gimelshein
ba14a701dc restore proper cuda assert behavior with DNDEBUG (#38943)
Summary:
Per title. https://github.com/pytorch/pytorch/issues/32719 essentially disabled asserts in cuda kernels in release build. Asserts in cuda kernels are typically used to prevent invalid reads/writes, so without asserts invalid read/writes are silent errors in most cases (sometimes they would still cause "illegal memory access" errors, but because of caching allocator this usually won't happen).
We don't need 2 macros, CUDA_ALWAYS_ASSERT and CUDA_KERNEL_ASSERT because all current asserts in cuda kernels are important to prevent illegal memory accesses, and they should never be disabled.
This PR removes macro CUDA_ALWAYS_ASSERT and instead makes CUDA_KERNEL_ASSERT (that is commonly used in the kernels) an asserttion both in release and debug builds.
Fixes https://github.com/pytorch/pytorch/issues/38771
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38943

Differential Revision: D21723767

Pulled By: ngimel

fbshipit-source-id: d88d8aa1b047b476d5340e69311e65aff4da5074
2020-05-26 18:11:00 -07:00
Brian
389e16c33b torch.pow Add type promotion support and fix issue with __rpow__ (#37098)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37098

### **Cherry-picked from another stack:**
Some code review already occurred here: https://github.com/pytorch/pytorch/pull/32582

### Summary:

Fixes: https://github.com/pytorch/pytorch/issues/32436

The issue caused incorrect handling of dtypes for scalar ** tensor.
e.g. before this change:
```
>>> 5.5 ** torch.ones(5, dtype=torch.int32)
tensor([5, 5, 5, 5, 5], dtype=torch.int32)
```
should return a float tensor.

Also fixes a number of incorrect cases:
 * tensors to negative powers were giving incorrect results (1 instead
    of 0 or error)
 * Behavior wasn't consistent between cuda/cpu
 * large_value ** 1 in some cases gave a result not equal
    to large_value because of truncation in conversion to double and back.

BC-breaking:

Previously incorrect behavior (in 1.4):
```
>>> a
tensor([1, 1, 1, 1, 1], dtype=torch.int32)
>>> a.pow_(.5)
tensor([1, 1, 1, 1, 1], dtype=torch.int32)
```

After this change:
`RuntimeError: result type Float can't be cast to the desired output type Int`

Test Plan: Imported from OSS

Differential Revision: D21686207

Pulled By: nairbv

fbshipit-source-id: e797e7b195d224fa46404f668bb714e312ea78ac
2020-05-26 08:29:51 -07:00
Nikita Shulga
44345ad08c Do not define C10_IOS on Mac (#37283)
Summary:
Because MacOS is not iOS
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37283

Test Plan: CI

Differential Revision: D21244398

Pulled By: malfet

fbshipit-source-id: b822e216e83887e2f2961b5c5384eaf749629f61
2020-04-25 13:52:46 -07:00
Sebastian Messmer
de090c42b1 Optimize binary size of assert macros (#37023)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37023

Optimize binary size of assert macros, through two ideas:

Concatenate string literals with __FILE__ and __LINE__ at compile time into one literal instead of keeping them in separate literals and combining them with c10::str
Optimize binary size of c10::str for some scenarios, especially for the scenario where it is called with an empty parameter list, this is actually a common call scenario in assert macros.
In server oss builds, this PR reduces binary size from 118.05 MB to 117.05 MB
ghstack-source-id: 102607237

Test Plan: Run oss server build (python setup.py install) and check size of libtorch_cpu.so reducing from 118.05MB to 117.05MB

Differential Revision: D20719400

fbshipit-source-id: 5c61f4195b947f06aafb8f0c8e255de3366e1ff2
2020-04-22 17:13:17 -07:00
Xiang Gao
15c7486416 Canonicalize includes in c10, and add tests for it (#36299)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36299

Test Plan: Imported from OSS

Differential Revision: D20943005

Pulled By: ezyang

fbshipit-source-id: 9dd0a58824bd0f1b5ad259942f92954ba1f63eae
2020-04-10 12:07:52 -07:00
Mike Ruberry
21c94606b8 Cleans up type conversions, adds CPU test comparing with NumPy (#35374)
Summary:
Per title. Follow-up to https://github.com/pytorch/pytorch/pull/35086.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35374

Differential Revision: D20712443

Pulled By: mruberry

fbshipit-source-id: 987089c14bff644fd6a636da5530dc260e1d1a68
2020-03-27 22:11:57 -07:00
Mike Ruberry
36e36eff2f Ignores deliberate undefined float->int conversion (#35086)
Summary:
In C++, casting a floating point value to an integer dtype is undefined when the value is outside the dtype's dynamic range. For example, casting 300.5 to Int8 is undefined behavior because the maximum representable Int8 value is 127, and 300.5 > 127.

PyTorch, like NumPy, deliberately allows and makes these casts, however, and when we do this we trigger undefined behavior that causes our sanitizers to (correctly) complain. I propose skipping this sanitization on our cast function.

The history of this PR demonstrates the issue, showing a single CI failure in the ASAN build when a test is added that converts a large float value to an integral value. The current PR shows a green CI after the sanitization is skipped.

There are alternatives to skipping this sanitization:

- Clamping or otherwise converting floats to the dynamic range of integral types they're cast to
- Throwing a runtime error if a float value is outside the dynamic range of the integral type it's cast to (this would not be NumPy compatible)
- Declaring programs in error if they perform these casts (this is technically true)
- Preventing this happening in PyTorch proper so the ASAN build doesn't fail

None of these alternatives seems particularly appealing, and I think it's appropriate to skip the sanitization because our behavior is deliberate.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35086

Differential Revision: D20591163

Pulled By: mruberry

fbshipit-source-id: fa7a90609c73c4c627bd39726a7dcbaeeffa1d1b
2020-03-23 01:08:57 -07:00
Igor Sugak
259d7299db [caffe2] do not declare __assert_fail in clang builds (#33893)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33893

It appears that when Clang drives CUDA compilation ` __assert_fail` is always defined as device function.

Test Plan:
```lang=bash
buck build mode/opt -c fbcode.cuda_use_clang=true -c cxx.untracked_headers=ignore //fblearner/flow/projects/dper:workflow
```

Reviewed By: ngimel

Differential Revision: D20145034

fbshipit-source-id: 23153411ed631e05421c7afcf41b7ea5619cdd96
2020-03-10 14:45:03 -07:00
Nikita Shulga
4edff32f81 [c10] Fix typo in __assert_fail noreturn modifier guard (#34157)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34157

`[[noreturn]` only conficts with CUDA __asert_fail defition if clang is used if host compiler

Test Plan: CI

Reviewed By: EscapeZero

Differential Revision: D20232088

fbshipit-source-id: 7182c28a15278e03175865cd0c87410c5de5bf2c
2020-03-03 17:25:25 -08:00
Nikita Shulga
0689cf8fc1 [c10] Make __assert_fail CUDA definition compilable with clang host compiler (#34102)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34102

if nvcc is invoked with clang host compiler, it will fail with the following error due to the decorators mismatch defined in cuda and c10:
```
 error: attribute "noreturn" did not appear on original declaration
```

Test Plan: Build pytorch with clang

Reviewed By: EscapeZero

Differential Revision: D20204951

fbshipit-source-id: ff7cef0db43436e50590cb4bbf1ae7302c1440fa
2020-03-02 20:11:49 -08:00
Wojciech Baranowski
8aa09de19e build: set -DNDEBUG in Release (#32719)
Summary:
This might lead to silent undefined behaviour (e.g. with out-of-bound indices). This affects `test_multinomial_invalid_probs_cuda` which is now removed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32719

Test Plan:
* Build with VERBOSE=1 and manually inspect `less ndebug.build.log | grep 'c++' | grep -v -- -DNDEBUG` (only with nina on Linux)
* CI

Fixes https://github.com/pytorch/pytorch/issues/22745

Differential Revision: D20104340

Pulled By: yf225

fbshipit-source-id: 2ebfd7ddae632258a36316999eeb5c968fb7642c
2020-02-26 12:53:31 -08:00
Michael Ranieri
9b2b15f4fc misc windows warning fixes (#33632)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33632

* `inline_container.h` was unnecessarily exposing all includers to caffe2 headers via `caffe2/core/logging.h`
* Add msvc version of hiding unused warnings.
* Make sure clang on windows does not use msvc pragmas.
* Don't redefine math macro.

Test Plan: CI green

Differential Revision: D20017046

fbshipit-source-id: 230a9743eb88aee08d0a4833680ec2f01b7ab1e9
2020-02-21 19:36:25 -08:00
Michael Ranieri
40265e2d66 prevent various warnings related to undef and redef (#33196)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33196

Test Plan: Sandcastle green

Reviewed By: malfet

Differential Revision: D19842268

fbshipit-source-id: 47bc3d7a75e803041491e11a648b4a9e7d9cc72c
2020-02-12 13:28:35 -08:00
Sebastian Messmer
c21f89970f Remove c++14-conditional constexpr (#30916)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30916

These macros said "make it constexpr if we're in C++14". Since we're now always C++14, we can just say "constexpr" isntead.
ghstack-source-id: 96369584

Test Plan: waitforsandcastle

Differential Revision: D18869635

fbshipit-source-id: f41751e4e26fad6214ec3a98db2d961315fd73ff
2020-01-07 16:40:11 -08:00
Sebastian Messmer
409151e1bb Use [[noreturn]] instead of C10_NORETURN or CAFFE_NORETURN (#30917)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30917

This is a C++14 feature, we can use this now.
ghstack-source-id: 95255753

Test Plan: waitforsandcastle

Differential Revision: D18869637

fbshipit-source-id: dd02036b9faeaffa64b2d2d305725443054da31b
2019-12-15 23:54:16 -08:00
Serhat Yilmaz
57ee7dab87 Wraps assert statements in cuda kernels (#31276)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31276

Change assert --> CUDA_ASSERT_KERNEL to avoid hip undefined __assert_fail()

This is similar to https://github.com/pytorch/pytorch/pull/13902 in caffe2 land.

Test Plan: wait for CI to clear

Reviewed By: bddppq

Differential Revision: D19047582

fbshipit-source-id: 34703b03786c8eee9c78d2459eb54bde8dc21a57
2019-12-14 20:29:47 -08:00
Sebastian Messmer
70e9ef518f c10::string_view (#26616)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26616

Implement C++17 std::string_view for C++11.

This is useful for compile time type name retrievaly which I'm going to stack on top of this.
It is also useful to replace `const std::string&` with throughout our codebase.
ghstack-source-id: 92100314

Test Plan: unit tests

Differential Revision: D17518992

fbshipit-source-id: 48e31c677d51b0041f4b37e89a92bd176d4a0b08
2019-10-21 16:10:40 -07:00
Sebastian Messmer
5c67b01467 Switch internal CUDA build to C++14 (#26757)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26757

This doesn't switch any open source builds or CI.
The internal fbcode build is C++17 already for quite some time, but in CUDA code, we had it restricted to C++11.
This diff changes that to C++14.

Because this doesn't change anything open source, the risk of this is low.
ghstack-source-id: 90728524

Test Plan: waitforsandcastle

Differential Revision: D17558142

fbshipit-source-id: 9cfd47e38e71d5a2fdae2f535c01f281bf007d9a
2019-09-26 14:57:21 -07:00
Johannes M Dieterich
a8d4bb34ea Unify treatment of warp size / wave size (#25884)
Summary:
Introduce a C10_WARP_SIZE define in Macros.h

For kernels that had ifdef-ing of WARP_SIZE for ROCm vs CUDA, use said macro. This is no functional change - we merely refactor to unify on one WARP_SIZE definition.

I hope we can encourage use of this macro over more WARP_SIZE definitions being sprinkled across the code base (or numerically hard-coded).

Some kernels remain that have their own WARP_SIZE definitions but did not satisfy above condition. They will be fixed in follow-up PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25884

Differential Revision: D17276662

Pulled By: bddppq

fbshipit-source-id: cef8e77a74ae2e5de10df816ea80b25cb2bab713
2019-09-10 00:11:09 -07:00
Sam Gross
d8314a6260 Replace nullary/unary/binary loops with generic implementation (#21475)
Summary:
```
This replaces the kernel helpers in Loops.h/cuh with the following:

  cpu_kernel
  cpu_kernel_vec

  gpu_kernel
  gpu_kernel_with_scalars

These work with functions with any number of input arugments, with the
exception of 'gpu_kernel_with_scalars' which is limited to binary
operations. Previously, we only supported functions of 0, 1, or 2 input
arguments. Adding support for 3 or 4 input argument functions required
significant amount of additional code.

This makes a few other changes:

Remove 'ntensors' from the for_each/serial_for_each loop. Most loops
assume a fixed number of tensors, and the value is accessible from
TensorIterator::ntensors()

Only lift CPU scalars to parameters in 'gpu_kernel_with_scalars'.
Previously, we performed this recursively in gpu_unary_kernel and
gpu_binary_kernel, so something like `torch.add(3, 4, out=cuda_tensor)`
would specialize to a "nullary" kernel. Now, only the first
scalar input is lifted to a kernel parameter. Any additional scalar
inputs are copied to CUDA tensors. Note that operations like `x + 5`
and `5 + x` still work efficiently. This avoids generating an exponential
number of specializations in the number of input arguments.
```

**Performance measurements**
Timing numbers are unchanged for basic elementwise operations. Linked below is a script to measure torch.add perf on PR vs. master CPU+GPU (GCC 7.3):
[miniperf.py](https://gist.github.com/colesbury/4a61893a22809cb0931f08cd37127be4)

**Generated assembly**
cpu_kernel and cpu_kernel_vec still generate good vectorized code with
both GCC 7.3 and GCC 4.8.5. Below is the assembly for the "hot" inner loop of
torch.add as well as an auto-vectorized torch.mul implementation using cpu_kernel/
binary_kernel. (The real torch.mul uses cpu_kernel_vec but I wanted to check that
auto vectorization still works well):

[torch.add GCC 7.3](https://gist.github.com/colesbury/927ddbc71dc46899602589e85aef1331)
[torch.add GCC 4.8](https://gist.github.com/colesbury/f00e0aafd3d1c54e874e9718253dae16)
[torch.mul auto vectorized GCC 7.3](https://gist.github.com/colesbury/3077bfc65db9b4be4532c447bc0f8628)
[torch.mul auto vectorized GCC 4.8](https://gist.github.com/colesbury/1b38e158b3f0aaf8aad3a76963fcde86)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21475

Differential Revision: D15745116

Pulled By: colesbury

fbshipit-source-id: 914277d7930dc16e94f15bf87484a4ef82890f91
2019-06-17 19:08:33 -07:00
Dmytro Dzhulgakov
c25e33789e Lightweight at-most-once logging for API usage (#20745)
Summary:
Resubmit #20698 which got messed up.

Idea is that when PyTorch is used in a custom build environment (e.g. Facebook), it's useful to track usage of various APIs centrally. This PR introduces a simple very lightweight mechanism to do so - only first invocation of a trigger point would be logged. This is significantly more lightweight than #18235 and thus we can allow to put logging in e.g. TensorImpl.

Also adds an initial list of trigger points. Trigger points are added in such a way that no static initialization triggers them, i.e. just linking with libtorch.so will not cause any logging. Further suggestions of what to log are welcomed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20745

Differential Revision: D15429196

Pulled By: dzhulgakov

fbshipit-source-id: a5e41a709a65b7ebccc6b95f93854e583cf20aca
2019-05-23 23:17:59 -07:00
Edward Z. Yang
9b1dbffba5
Re-sync with internal repository (#20702) 2019-05-20 09:22:57 -04:00
Dmytro Dzhulgakov
d3059b9c49 Lightweight logging for once-only API usage 2019-05-19 23:04:40 -07:00
Edward Yang
4e551a7edb Make C10_NODISCARD macro more portable for nvcc+clang. (#20324)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20324
ghimport-source-id: e51181c82f87c946b5ffcb87b0ad71a056cb4659

Differential Revision: D15359317

Pulled By: ezyang

fbshipit-source-id: d88798f13a61c74456641ddec8250c08ce8af240
2019-05-17 08:57:19 -07:00
Sebastian Messmer
e710f3b1e1 Fix C10_MOBILE macro for ios (#19779)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19779

This macro wasn't set correctly because the target macros weren't included from Apple's header.

Reviewed By: dzhulgakov

Differential Revision: D15090427

fbshipit-source-id: 43ca44f0f409e11718b7f60c3fdcd2aa02d7018e
2019-04-30 12:03:24 -07:00
Grigory Arutyunov
2336f0ba06 msvc_fixes (#17201)
Summary:
Fixing MSVC errors

```
  D:\pytorch-scripts\caffe2_builders\v141\pytorch\aten\src\THC/THCReduce.cuh(144): error C4002: too many actual paramet
ers for macro 'C10_LAUNCH_BOUNDS_1' [D:\pytorch-scripts\caffe2_builders\v141\pytorch\build\Debug\caffe2\caffe2_gpu.vcxp
roj]
  D:\pytorch-scripts\caffe2_builders\v141\pytorch\aten\src\THC/THCReduce.cuh(259): error C4002: too many actual paramet
ers for macro 'C10_LAUNCH_BOUNDS_1' [D:\pytorch-scripts\caffe2_builders\v141\pytorch\build\Debug\caffe2\caffe2_gpu.vcxp
roj]
  D:/pytorch-scripts/caffe2_builders/v141/pytorch/aten/src/THCUNN/SpatialDilatedMaxPooling.cu(51): error C4002: too man
y actual parameters for macro 'C10_LAUNCH_BOUNDS_1' [D:\pytorch-scripts\caffe2_builders\v141\pytorch\build\Debug\caffe2
\caffe2_gpu.vcxproj]
```

on variadic C10_LAUNCH_BOUNDS as well as Debug linking issues with at::Half in pool_op_cudnn.cc like this one

```
pool_op_cudnn.obj : error LNK2019: unresolved external symbol "public: bool __cdecl caffe2::MaxPoolFunctor<class caff
e2::CUDAContext>::GlobalPoolingBackward<struct c10::Half,2>(int,int,int,struct c10::Half const *,struct c10::Half const
 ,struct c10::Half const ,struct c10::Half ,class caffe2::CUDAContext )const " (??$GlobalPoolingBackward@UHalf@c10@
@$01@?$MaxPoolFunctor@VCUDAContext@caffe2@@caffe2@QEBA_NHHHPEBUHalf@c10@00PEAU23@PEAVCUDAContext@1@Z) referenced in
 function "public: bool __cdecl caffe2::`anonymous namespace'::CuDNNMaxPoolFunctor::GlobalPoolingBackward<struct c10::H
alf,2>(int,int,int,struct c10::Half const ,struct c10::Half const ,struct c10::Half const ,struct c10::Half ,class
caffe2::CUDAContext *)const " (??$GlobalPoolingBackward@UHalf@c10@@$01@CuDNNMaxPoolFunctor@?A0xb936404a@caffe2@QEBA_NH
HHPEBUHalf@c10@00PEAU34@PEAVCUDAContext@2@Z) [D:\pytorch-scripts\caffe2_builders\v141\pytorch\build\Debug\caffe2\caff
e2_gpu.vcxproj]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17201

Differential Revision: D14165732

Pulled By: ezyang

fbshipit-source-id: 875fd9a5b2db6f83fc483f6d750d2c011260eb8b
2019-03-01 15:17:41 -08:00
Junjie Bai
212024282b Mark cudaGetLastError return value unused in C10_CUDA_CHECK
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/17605

Reviewed By: xw285cornell

Differential Revision: D14277586

Pulled By: bddppq

fbshipit-source-id: 38879208f2ab83cf39d8a8a61b288cd09fcafd9a
2019-03-01 00:05:46 -08:00
Sebastian Messmer
6706e9af19 Make C10_MOBILE consistent with how feature macros are usually used (#17481)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17481

Usually, feature macros are either defined or undefined and checked accordingly.
C10_MOBILE was a weird special case that was always defined but either defined to 1 or to 0.

This caused a lot of confusion for me when trying to disable something from mobile build and it also disabled it
from the server build (because I was using ifdef). Also, I found a place in the existing code base that made
that wrong assumption and used the macro wrongly, see https://fburl.com/y4icohts

Reviewed By: dzhulgakov

Differential Revision: D14214825

fbshipit-source-id: f3a155b6d43d334e8839e2b2e3c40ed2c773eab6
2019-02-27 17:57:51 -08:00
Syed Tousif Ahmed
86af14b0c7 Resolves ptxas warnings when compiling for CUDA_ARCH 750 and a memoryType deprecation warning (#15461)
Summary:
When compiling for `TORCH_CUDA_ARCH_LIST=7.5` we were getting ptxas warnings (https://github.com/pytorch/pytorch/issues/14310). This was because we had some hardcoded values when using launch_bounds in kernels. The maximum number of threads per multiprocessor is 1024 for Turing architecture (7.5) but 2048 for previous architectures. The hardcoded launch_bounds in the kernel were requesting for 2048 threads when compiling for Turing and hence were generating the warning.

This PR adds a macro that checks for the bounds on the launch bounds value supplied. The max number of threads per block across all architectures is 1024. If a user supplies more than 1024, I just clamp it down to 512. Depending on this value, I set the minimum number of blocks per sm. This PR should resolve https://github.com/pytorch/pytorch/issues/14310. The gradient computation being wrong reported in that PR is probably due to the faulty card.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15461

Differential Revision: D13633952

Pulled By: soumith

fbshipit-source-id: 795aa151109f343ab5433bf3cb070cb6ec896fff
2019-01-10 21:44:39 -08:00
Edward Yang
e58bbbac18 Delete dependencies from CUDAStream; remove synchronize_with (#13920)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13920

I want to move CUDAStream and CUDAGuard to c10_cuda without also
bringing along CUDAContext or CUDAEvent for the ride (at least for
now).  To do this, I need to eliminate those dependencies.

There's a few functions in CUDAContext.h which don't really need
THCState, so they're separated out and put in general
purpose c10/cuda/CUDAFunctions.h

Reviewed By: smessmer

Differential Revision: D13047468

fbshipit-source-id: 7ed9d5e660f95805ab39d7af25892327edae050e
2018-11-19 17:05:41 -08:00
Edward Yang
0478d32cb8 Move AlignOf, SmallVector and ArrayRef to c10.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13916

Reviewed By: smessmer

Differential Revision: D13046722

fbshipit-source-id: 1583d3170d60e22f0a535cd1fd56bdf928186f5d
2018-11-14 11:13:16 -08:00
Edward Yang
fbabe5bf62 Rename c10::detail to c10::impl (#13838)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13838

According to Sebastian, the detail convention is specifically for header-private
functionality.  That's not what c10/detail is; it's general, library private headers
which may be used in multiple places within PyTorch.  Rename it to impl to avoid
the confusion in nomenclature.

Reviewed By: smessmer

Differential Revision: D13024368

fbshipit-source-id: 050f2632d83a69e3ae53ded88e8f938c5d61f0ef
2018-11-14 07:39:37 -08:00
Edward Yang
e35418b3be New implementations of DeviceGuard, StreamGuard and MultiStreamGuard (with CUDA specializations) (#13342)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13342

This PR introduces a few new concepts:

- DeviceGuardImplInterface, and implementations for CPU and CUDA, which
  provide a generic interface for interfacing with device and stream state,
  without requiring a direct dependency on the code in question.
- InlineDeviceGuard, a general template for generating both specialized
  and dynamically dispatched device guard implementations.  Dynamic
  dispatch is done by specializing it on a VirtualGuardImpl.
- Provide a device-independent DeviceGuard class, which can be used even
  from CPU code. It uses the aforementioned dynamic dispatch.
- CUDA-specialized CUDAGuard class, which doesn't have a dynamic dispatch
  but can only be used from CUDA.
- StreamGuard, which is the same as above, but for streams rather than
  devices.
- Optional variants of all the aforementioned guards, which are a no-op if
  no device/stream is specified
- CUDAMultiStreamGuard, specifically for the case when we want to set
  a device on every guard.

There are some subtle semantic changes, which have been thoroughly documented
in the class definition.

BC-breaking changes:

- Move constructor/assignment have been removed from all device guard
  implementations.
- In some cases where you previously wrote 'set_device' (or 'set_stream'), you now must write
  'reset_device', because if you switch devices/device types, the stream/device on the
  previous device is unset.  This is different from previous behavior.
- CUDAGuard no longer handles streams, or multiple streams.  Use CUDAStreamGuard
  or CUDAMultiStreamGuard as appropriate for your use case.

Reviewed By: dzhulgakov

Differential Revision: D12849620

fbshipit-source-id: f61956256f0b12be754b3234fcc73c2abc1be04e
2018-11-11 12:11:10 -08:00
Jerry Zhang
e06f92785c Move ATen/core/Macros.h to c10/macros/Macros.h
Summary:
EXT=h,cc,cpp,hpp,cxx,cu,cuh
d=caffe2/aten/
codemod -m -d $d --extensions $EXT 'AT_HOST_DEVICE' 'C10_HOST_DEVICE'
codemod -m -d $d --extensions $EXT 'AT_DEVICE' 'C10_DEVICE'
codemod -m -d $d --extensions $EXT 'AT_HOST' 'C10_HOST'
codemod -m -d $d --extensions $EXT 'AT_ANDROID' 'C10_ANDROID'
codemod -m -d $d --extensions $EXT 'AT_IOS' 'C10_IOS'
codemod -m -d $d --extensions $EXT 'AT_MOBILE' 'C10_MOBILE'
codemod -m -d $d --extensions $EXT 'ATen/core/Macros.h' 'c10/macros/Macros.h'
codemod -m -d $d --extensions $EXT 'HIP_HOST_DEVICE' 'C10_HIP_HOST_DEVICE'

Reviewed By: dzhulgakov

Differential Revision: D12851341

fbshipit-source-id: 7d540530ef779e16ddf2b4cdda9dcc85a61410c3
2018-11-05 12:32:11 -08:00
Sebastian Messmer
979560c9fc Include c10 namespace into caffe2 and at namespaces. (#12950)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12950

For backwards compatibility, we want the c10 symbols to be reachable from caffe2 and aten.
When we move classes from at/caffe2 to c10, this
 1. allow keeping backwards compatibility with third paty code we can't control
 2. allows splitting diffs that move such classes into two diffs, where one only fixes the includes and the second one fixes the namespaces.

Reviewed By: ezyang

Differential Revision: D10496244

fbshipit-source-id: 914818688fad8c079889dfdc6242bc228b539f0e
2018-10-25 14:08:47 -07:00
Edward Yang
8c514627a4 Add C10_LIKELY/C10_UNLIKELY macros (#12932)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12932

I was looking at some assembly for some code I was working on,
and felt a desire to have likely()/unlikely() macros.  I checked
if we already had them, and we didn't.  This commit adds them,
and fixes up all known use sites to make use of it.

Reviewed By: Maratyszcza

Differential Revision: D10488399

fbshipit-source-id: 7476da208907480d49f02b37c7345c17d85c3db7
2018-10-22 16:26:19 -07:00
David Reiss
96d826f635 Define REGISTER_CPU_GRADIENT_OPERATOR (#12588)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12588

By default, this is an alias for REGISTER_CPU_OPERATOR.  If gradients are not
required (e.g., on mobile) it can be converted to a no-op by defining
CAFFE2_NO_GRADIENT_OPS, resulting in a smaller build.

GRADIENT_OPERATOR_SCHEMA works similarly.

CAFFE2_NO_GRADIENT_OPS also converts REGISTER_GRADIENT to a no-op.

Use these macros in fully_connected_op.cc as an example.
Follow-up diffs will convert more operators.

I had to introduce MACRO_EXPAND to handle the way Visual Studio expands
VA_ARGS.

Reviewed By: Yangqing

Differential Revision: D10209468

fbshipit-source-id: 4116d9098b97646bb30a00f2a7d46aa5d7ebcae0
2018-10-22 10:01:02 -07:00
Yangqing Jia
7dbb38e856 Moving logging from caffe2 to c10. (#12881)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12881

TSIA. This should not change any functionality.

Remaining work:
- change the build script to deprecate use of CAFFE2_USE_MINIMAL_GOOGLE_GLOG and use a C10 macro instead.
- Unify the exception name (EnforceNotMet -> Error)
- Unify the logging and warning APIs (like AT_WARNING)

Reviewed By: dzhulgakov

Differential Revision: D10441597

fbshipit-source-id: 4784dc0cd5af83dacb10c4952a2d1d7236b3f14d
2018-10-19 20:22:08 -07:00
Yangqing Jia
7d5f7ed270 Using c10 namespace across caffe2. (#12714)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12714

This is a short change to enable c10 namespace in caffe2. We did not enable
it before due to gflags global variable confusion, but it should have been
mostly cleaned now. Right now, the plan on record is that namespace caffe2 and
namespace aten will fully be supersets of namespace c10.

Most of the diff is codemod, and only two places of non-codemod is in caffe2/core/common.h, where

```
using namespace c10;
```

is added, and in Flags.h, where instead of creating aliasing variables in c10 namespace, we directly put it in the global namespace to match gflags (and same behavior if gflags is not being built with).

Reviewed By: dzhulgakov

Differential Revision: D10390486

fbshipit-source-id: 5e2df730e28e29a052f513bddc558d9f78a23b9b
2018-10-17 12:57:19 -07:00
Edward Yang
07d67aa17a Make TensorOptions immutable. (#12630)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12630

Instead of providing mutable accessors, our "mutators" now
return new copies of TensorOptions.  Since TensorOptions is
simply two 64-bit integers, this is not a big efficiency
problem.

There may be some sites that assumed that TensorOptions was
mutable.  They need to be fixed.

Reviewed By: SsnL

Differential Revision: D10249293

fbshipit-source-id: b3d17acc37e78c0b90ea2c29515de5dd01209bd3
2018-10-15 08:30:16 -07:00
Sebastian Messmer
6f664d3917 Improve TypeMeta (#11502)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11502

TypeMeta now is only a pointer to a TypeMetaData structure, of which there is exactly one global instance per type.
This reduces the size of everything storing a TypeMeta (Tensor, Blob, ...) and potentially improves performance.

Also, this diff gets rid of the type name registry in favor of static strings.

Experiments (summary: 1-3% perf gain)
- Service Lab: https://our.intern.facebook.com/intern/servicelab/30712497/
 -> No significant results found.
- Mobile Lab c10bench.json: https://our.intern.facebook.com/intern/fblearner/details/75984908/
 -> 1-3% perf gain
- Mobile Lab c10bench default: https://our.intern.facebook.com/intern/fblearner/details/75984999/
 -> 2-3% perf gain
- adindexer canary: https://our.intern.facebook.com/intern/ads/canary/413002142824203076
 -> no significant changes (benchmark too noisy)
- adfinder canary: https://our.intern.facebook.com/intern/ads/canary/413002166737860362
 -> no significant changes (benchmark too noisy)

Reviewed By: dzhulgakov

Differential Revision: D9763422

fbshipit-source-id: fc08937f114af5ff9f3ddbe7c7e396942868cdf5
2018-10-06 14:09:28 -07:00
Yangqing Jia
9c49bb9ddf Move registry fully to c10 (#12077)
Summary:
This does 6 things:

- add c10/util/Registry.h as the unified registry util
  - cleaned up some APIs such as export condition
- fully remove aten/core/registry.h
- fully remove caffe2/core/registry.h
- remove a bogus aten/registry.h
- unifying all macros
- set up registry testing in c10

Also, an important note that we used to mark the templated Registry class as EXPORT - this should not happen, because one should almost never export a template class. This PR fixes that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12077

Reviewed By: ezyang

Differential Revision: D10050771

Pulled By: Yangqing

fbshipit-source-id: 417b249b49fed6a67956e7c6b6d22374bcee24cf
2018-09-27 03:09:54 -07:00
Yangqing Jia
a6f1ae7f20 set up c10 scaffolding. Move macros proper first.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/11939

Reviewed By: orionr, dzhulgakov

Differential Revision: D10004629

Pulled By: Yangqing

fbshipit-source-id: ba50a96820d35c7922d81c78c4cbe849c85c251c
2018-09-24 11:09:59 -07:00