Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13377
* Enable junk fill for the default CPU allocator. The first diff only enables this for the tests. A second diff will change the default of zero-fill to false.
* Fix tests to use 64-bit counters that IterOp and LearningRateOp demands.
* Fix kernels that uses uninitialized memory.
Reviewed By: salexspb
Differential Revision: D10866512
fbshipit-source-id: 17860e77e63a203edf46d0da0335608f77884821
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12714
This is a short change to enable c10 namespace in caffe2. We did not enable
it before due to gflags global variable confusion, but it should have been
mostly cleaned now. Right now, the plan on record is that namespace caffe2 and
namespace aten will fully be supersets of namespace c10.
Most of the diff is codemod, and only two places of non-codemod is in caffe2/core/common.h, where
```
using namespace c10;
```
is added, and in Flags.h, where instead of creating aliasing variables in c10 namespace, we directly put it in the global namespace to match gflags (and same behavior if gflags is not being built with).
Reviewed By: dzhulgakov
Differential Revision: D10390486
fbshipit-source-id: 5e2df730e28e29a052f513bddc558d9f78a23b9b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12029
In order to remove New() function in StaticContext(to remove StaticContext) and converge to the Allocator design, we'll first change the return type of New to at::DataPtr.
Reviewed By: ezyang
Differential Revision: D9889990
fbshipit-source-id: 3257c763530b987025f428741bdd2e089d11bad4
Summary:
This does 6 things:
- add c10/util/Registry.h as the unified registry util
- cleaned up some APIs such as export condition
- fully remove aten/core/registry.h
- fully remove caffe2/core/registry.h
- remove a bogus aten/registry.h
- unifying all macros
- set up registry testing in c10
Also, an important note that we used to mark the templated Registry class as EXPORT - this should not happen, because one should almost never export a template class. This PR fixes that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12077
Reviewed By: ezyang
Differential Revision: D10050771
Pulled By: Yangqing
fbshipit-source-id: 417b249b49fed6a67956e7c6b6d22374bcee24cf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11748
For avx512, we need to align at a multiple of 64B not 32B
Regardless of avx512, it's in general a good idea to be cache line aligned.
Reviewed By: ilia-cher
Differential Revision: D9845056
fbshipit-source-id: b1d3ed67749c0c1a64acd5cc230a1279e8023512
Summary:
Properly annotated all apis for cpu front. Checked with cmake using
cmake -DUSE_ATEN=ON -DUSE_CUDA=OFF -DBUILD_ATEN=ON
and resulting libcaffe2.so has about 11k symbols.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10504
Reviewed By: ezyang
Differential Revision: D9316491
Pulled By: Yangqing
fbshipit-source-id: 215659abf350af7032e9a4b0f28a856babab2454
Adding NUMA awareness through numa_node_id in DeviceOption. Blobs of operators
with numa_node_id are allocated on corr. memory banks, using CPU pools with
NUMA affinity set to run operators.
Summary:
During the team meeting today Dima and Alex mentioned that the current lambda
function causes slowdown in performance when a large number of alloc and
dealloc happen. My observation is that most of the Delete are actually direct
Delete() function pointers, so I gave it a shot to see if we can reduce
the overhead.
RawAllocDealloc is much fast already, and we observe another 5ns reduction
(12.5%). For TensorAllocDealloc of 32x32 tensors, we are observing 57ns saving
(26%). This is measured on Xeon(R) CPU E5-2660.
Also cleaned up the function interfaces of ShareExternalPointer so we have 2
functions only.
Reviewed By: salexspb, dzhulgakov
Differential Revision: D5801013
fbshipit-source-id: 7068207a43400fa3902bbb3689b3c729e839456c