Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35187
When I touch these files, lint will always introduce some unintended change, to prevent it from happening, we need to format the code first.
change is generated by:
arc f
Test Plan: integration test.
Differential Revision: D20587596
fbshipit-source-id: 512cf6b86bd6632a61c80ed53e3a9e229feecc2a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36172
Original commit changeset: 3d7801613f86
D20449887 broke some OSS tests as the OSS export sync wasn't working correctly.
Test Plan:
Manually export latest version to OSS to trigger the tests
+ test plan in D20449887
verified onnx tests are passing in https://github.com/pytorch/pytorch/pull/36172
Reviewed By: andrewwdye
Differential Revision: D20902279
fbshipit-source-id: bc30fcc9f5cc8076f69a5d92675fd27455948372
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31966
This has three parts:
* When `--caffe2_handle_executor_threads_exceptions` is set when a parallel execution step throws an exception it can hang waiting for async nets to finish. This adds cancellation code to cancel any async nets.
* This makes the exceptions returned from parallel workers pass a std::exception_ptr so the stack trace can be recorded with folly::SmartExceptionTracer.
* Define Cancel method at NetBase level to avoid pulling in unsupported AsyncSchedulingNet for fbandroid.
Test Plan:
Added unit tests for plan_executor
buck test //caffe2/caffe2:caffe2_test_cpu
buck test //caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest --stress-runs 100
Reviewed By: boryiingsu
Differential Revision: D19320177
fbshipit-source-id: d9939fcea1317751fa3de4172dfae7f781b71b75
Summary:
This is a realand of https://github.com/pytorch/pytorch/pull/36196
Before the fix bazel spews following multi-line warning for every single caffe2 operator:
```
In file included from ./c10/util/logging_is_google_glog.h:50,
from ./c10/util/Logging.h:26,
from ./caffe2/core/logging.h:2,
from ./caffe2/core/blob.h:13,
from ./caffe2/core/operator.h:18,
from ./caffe2/sgd/adadelta_op.h:1,
from caffe2/sgd/adadelta_op.cc:1:
bazel-out/k8-fastbuild/bin/external/com_github_glog/_virtual_includes/glog/glog/logging.h: In instantiation of 'std::string* google::Check_LTImpl(const T1&, const T2&, const char*) [with T1 = int; T2 = long unsigned int; std::string = std::__cxx11::basic_string<char>]':
./caffe2/core/operator.h:192:5: required from 'const T& caffe2::OperatorBase::Input(int, caffe2::DeviceType) [with T = caffe2::Tensor; caffe2::DeviceType = c10::DeviceType]'
./caffe2/core/operator.h:890:48: required from 'const caffe2::Tensor& caffe2::Operator<Context>::Input(int, caffe2::DeviceType) [with Context = caffe2::CPUContext; caffe2::DeviceType = c10::DeviceType]'
./caffe2/sgd/adadelta_op.h:87:5: required from 'bool caffe2::SparseAdadeltaOp<Context>::RunOnDevice() [with Context = caffe2::CPUContext]'
./caffe2/sgd/adadelta_op.h:85:8: required from here
bazel-out/k8-fastbuild/bin/external/com_github_glog/_virtual_includes/glog/glog/logging.h:722:32: warning: comparison of integer expressions of different signedness: 'const int' and 'const long unsigned int' [-Wsign-compare]
722 | DEFINE_CHECK_OP_IMPL(Check_LT, < )
| ^
bazel-out/k8-fastbuild/bin/external/com_github_glog/_virtual_includes/glog/glog/logging.h:148:53: note: in definition of macro 'GOOGLE_PREDICT_TRUE'
148 | #define GOOGLE_PREDICT_TRUE(x) (__builtin_expect(!!(x), 1))
| ^
bazel-out/k8-fastbuild/bin/external/com_github_glog/_virtual_includes/glog/glog/logging.h:722:1: note: in expansion of macro 'DEFINE_CHECK_OP_IMPL'
722 | DEFINE_CHECK_OP_IMPL(Check_LT, < )
| ^~~~~~~~~~~~~~~~~~~~
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36224
Test Plan: CI
Differential Revision: D20919506
Pulled By: malfet
fbshipit-source-id: b8b4b7c62dcbc109b30165b19635a6ef30033e73
Summary:
Otherwise, while bazel spews following multi-line warning for every single caffe2 operator:
```
In file included from ./c10/util/logging_is_google_glog.h:50,
from ./c10/util/Logging.h:26,
from ./caffe2/core/logging.h:2,
from ./caffe2/core/blob.h:13,
from ./caffe2/core/operator.h:18,
from ./caffe2/sgd/adadelta_op.h:1,
from caffe2/sgd/adadelta_op.cc:1:
bazel-out/k8-fastbuild/bin/external/com_github_glog/_virtual_includes/glog/glog/logging.h: In instantiation of 'std::string* google::Check_LTImpl(const T1&, const T2&, const char*) [with T1 = int; T2 = long unsigned int; std::string = std::__cxx11::basic_string<char>]':
./caffe2/core/operator.h:192:5: required from 'const T& caffe2::OperatorBase::Input(int, caffe2::DeviceType) [with T = caffe2::Tensor; caffe2::DeviceType = c10::DeviceType]'
./caffe2/core/operator.h:890:48: required from 'const caffe2::Tensor& caffe2::Operator<Context>::Input(int, caffe2::DeviceType) [with Context = caffe2::CPUContext; caffe2::DeviceType = c10::DeviceType]'
./caffe2/sgd/adadelta_op.h:87:5: required from 'bool caffe2::SparseAdadeltaOp<Context>::RunOnDevice() [with Context = caffe2::CPUContext]'
./caffe2/sgd/adadelta_op.h:85:8: required from here
bazel-out/k8-fastbuild/bin/external/com_github_glog/_virtual_includes/glog/glog/logging.h:722:32: warning: comparison of integer expressions of different signedness: 'const int' and 'const long unsigned int' [-Wsign-compare]
722 | DEFINE_CHECK_OP_IMPL(Check_LT, < )
| ^
bazel-out/k8-fastbuild/bin/external/com_github_glog/_virtual_includes/glog/glog/logging.h:148:53: note: in definition of macro 'GOOGLE_PREDICT_TRUE'
148 | #define GOOGLE_PREDICT_TRUE(x) (__builtin_expect(!!(x), 1))
| ^
bazel-out/k8-fastbuild/bin/external/com_github_glog/_virtual_includes/glog/glog/logging.h:722:1: note: in expansion of macro 'DEFINE_CHECK_OP_IMPL'
722 | DEFINE_CHECK_OP_IMPL(Check_LT, < )
| ^~~~~~~~~~~~~~~~~~~~
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36196
Differential Revision: D20909696
Pulled By: malfet
fbshipit-source-id: 16723355f473379ba9da6d3c33bd561b9724800a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34753
This improves support for exceptions and capturing stack traces in caffe2 async nets. We generally want to use exceptions everywhere we can in order to preserve stack information. It also makes the exception timestamp more accurate so multiple exceptions at the same time can be correctly ordered.
Test Plan: Updated the tests to use the new error semantics + adds a test to ensure the stack is correctly propagated through deferrable async scheduling.
Reviewed By: andrewwdye
Differential Revision: D20449887
fbshipit-source-id: 047fdf1bd52fd7c7c1f3fde77df9a27ed9e288e7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35857
This fixes a lot of common ops for InferBlobShapesAndTypes as well as adds support for testing the inferred shapes and types of gradient ops.
Ops:
* Concat
* Split
* LeakyReLU
* Relu
* Prelu
* Gelu
* Elu
* Sinh, Tanh, Cosh
* Abs
* ... and a number of other simple element wise ops
Test Plan:
Added support to hypothesis test to check the shape and type of gradient ops.
Enabled it for all the ops I fixed the shape and type inference for.
buck test caffe2/caffe2/python/operator_test:
Reviewed By: pradeepd24
Differential Revision: D20806284
fbshipit-source-id: 77f796d9ff208e09e871bdbadf9a0a7c196b77f2
Summary:
Fixes incorrect usages of symbol annotations including:
1. Exporting or importing a function/class in an anonymous namespace.
2. Exporting or importing a function/class implementation in a header file. However, by removing the symbol annotations, they are now local symbols. If they need to be remain global, I can move the implementations to the source file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35364
Differential Revision: D20670031
Pulled By: ezyang
fbshipit-source-id: cd8018dee703e2424482c27fe9608e040d8105b8
Summary:
And few typos
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34791
Test Plan: CI
Differential Revision: D20524879
Pulled By: malfet
fbshipit-source-id: 58fa03bd6356979e77cd1bffb6370d41a177c409
Summary:
Throwing from destructor leads to undefined behaviour (most often to segault)
So it's better to leak memory then segault
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34756
Test Plan: Run `test_pytorch_onnx_caffe2`
Differential Revision: D20504228
Pulled By: malfet
fbshipit-source-id: 7a05776fea9036f602e95b8182f8493cb5886dab
Summary:
To speed up compilation time
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34811
Test Plan: CI
Differential Revision: D20476992
Pulled By: malfet
fbshipit-source-id: 922cde93783fbfc04854851d7a05a635d5239792
Summary:
Replacing <ATen/core/Tensor.h> with <<ATen/core/TensorBody.h> speeds up compilation of caffe2 operators by 15%
For example, it reduces pool_op.cu compilation from 18.8s to 16s
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34810
Test Plan: CI
Differential Revision: D20472230
Pulled By: malfet
fbshipit-source-id: e1b261cc24ff577f09e2d5f6428be2063c6d4a8b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34105
make parallel_net_test.cc chronos conforming.
exclude gtest asserts that check thrown exceptions when exceptions are disabled.
Test Plan: CI green
Differential Revision: D20153525
fbshipit-source-id: 7371e559da948f46773fed09e3a23a77411d59e0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33954
fixes caffe2/core/module_test.cc on windows
misc lint fixes.
Test Plan: CI green
Reviewed By: malfet
Differential Revision: D20153512
fbshipit-source-id: aeae84a028e26edd65c7218611e3c49a8d9bb8c0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33959
make sure clang on windows uses correct attributes.
add support for cl.exe style pragma attributes
Test Plan: CI green
Differential Revision: D20153548
fbshipit-source-id: bfbfd374e8f5e7d7b8598453c3ca2b6693a425f1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33563
When NVCC or Clang are driving CUDA compilation many math functions are declared by default, with a small difference: Clang marks them as `__device__` only, while NVCC uses both `__host__` and `__device__`. This makes every un-elaborated `min` or `max` function call from a `__host__` function generate a syntax error when Clang is used.
Fix the errors by using `std::min` and `std::max` from `<algorithm>`, since C++14 they are `constexpr` and can be used in the `__device__` code [1].
1. https://llvm.org/docs/CompileCudaWithLLVM.html#algorithm
Test Plan:
```lang=bash
buck build mode/opt -c fbcode.cuda_use_clang=true //fblearner/flow/projects/dper:workflow
buck build mode/opt //fblearner/flow/projects/dper:workflow
```
Execute tests on devgpu:
```
buck test mode/dev-nosan -j 8 //caffe2/caffe2/python/operator_test/... //caffe2/test:cuda
```
Reviewed By: ngimel
Differential Revision: D20005795
fbshipit-source-id: 98a3f35e8a96c15d3ad3d2066396591f5cca1696
Summary: The first run of the net is noisy sometimes - just run it twice.
Reviewed By: cheshen1
Differential Revision: D20039274
fbshipit-source-id: 639e65646bf52f3efe1ecd4bbcd0e413d9389b29
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30734
What are specialized lists?
The IValues that hold List[int], List[Tensor], and List[AnythingElse] are different C++ types.
e.g. List[int] has a std::vector<int> while List[AnythingElse] holds a std::vector<IValue>.
Why do we have specialized lists?
When we first created the JIT we needed to bind the ATen C++ API which has std::vector<int>,
std::vector<Tensor> as inputs. The easiest way to match this API was to make our IValues contain
these same types. Conversion was just unwrapping the IValue, very easy and cheap.
What is the problem with specialized lists?
We end up with significant special cases through the compiler. Other types like Dict are not
specialized. So in the Pickler, for instance, there is a single piece of logic to handle
their serialization. For Lists, we end up with multiple cases. Furthermore, it doesn't
match Python, leading to problems along translation boundaries. Our pickle serialization
is slightly different than python, so it is harder to load objects from our IValue serialization
as Python values.
They also make it harder to provide an easy-to-use user API. We'd like to match pybind11 for C++
bindings to TorchScript. This would entail having a single torch::List class (untemplated)
that can be used to construct inputs. This is made much harder if the underlying ivalue needs
to be different depending on the type inside the list. The ideal case would be to have a constructor like
```
template<typename T>
List(std::vector<T> foo);
```
It would then set up the type tags correctly based on type T, without the need for passing tags.
Do specialized lists improve perf?
Not in a way we have been able to measure. Our major concern initially was having to translate
a std::vector<IValue> to std::vector<int> to call ATen functions. This was especially a concern
for aten::_convolution which takes a number of mostly-constant lists of integers. However,
when we measure the effect of actually having to do this conversion for an aten::_convolution,
it does not take measurable time (benchmark results below).
This is true even if you use a trivial convolution (e.g. 1x1x1), and comment out the actual convolution code.
What are the issues removing them?
This PR removes list specialization but keeps the serialization format, and IValue APIs almost exactly
the same. The only visible change is that toTensorListRef and family have turned into toTensorVector
because they now return by value a copy of the list as a vector.
Further PRs can then clean up the complexity issues that arose from speclization. This will likely
involve removing the isTensorList/isIntList functions, and refactoring the code that used them to
work generically. At some point we will also change serialization to no longer write specialized
lists in the pickle binary. This is forward incompatible, so will go in its own PR.
Benchmark:
```
import torch
import torch.nn as nn
import torch.nn.functional as F
import time
class MnistNet(nn.Module):
def __init__(self):
super(MnistNet, self).__init__()
self.conv1 = nn.Conv2d(1, 1, kernel_size=1)
self.conv2 = nn.Conv2d(1, 1, kernel_size=1)
def forward(self, x):
for i in range(10):
x = F.relu(self.conv1(x))
x = F.relu(self.conv2(x))
return x
model = MnistNet()
x = torch.rand(1, 1, 1, 1)
r = torch.jit.trace(model, x )
r(x)
r(x)
r(x)
r(x)
print(torch.jit.last_executed_optimized_graph())
while True:
b = time.time()
for i in range(100):
r(x)
e = time.time()
print(e - b)
```
Results (no observable difference):
```
Before (actual conv)
0.13251137733459473
0.13260436058044434
0.13276338577270508
0.1327497959136963
0.13250041007995605
0.13270330429077148
0.13290190696716309
0.13265132904052734
0.13274288177490234
0.1326758861541748
0.13253355026245117
0.13254785537719727
0.13260746002197266
0.13285017013549805
0.13264012336730957
0.132490873336792
0.13280034065246582
0.13243484497070312
0.1325232982635498
0.1326127052307129
0.13264131546020508
0.13274383544921875
0.13298296928405762
0.1326909065246582
-------------------
After (actual conv)
0.13127517700195312
0.13150334358215332
0.13092470169067383
0.13102364540100098
0.13134360313415527
0.13155555725097656
0.13314104080200195
0.13151955604553223
0.13160037994384766
0.1315293312072754
0.13137340545654297
0.13148093223571777
0.131455659866333
0.1327371597290039
0.13134026527404785
0.13152337074279785
0.13151192665100098
0.13165974617004395
0.13403725624084473
0.13251852989196777
0.13135504722595215
0.1315624713897705
0.1317615509033203
0.1314380168914795
0.13157200813293457
--------------------
The following replace the convolution operator with a no-op, to show
that even if the conv op was made faster, then we still would not see
a difference:
Before (fake conv)
0.0069539546966552734
0.0069522857666015625
0.007120847702026367
0.007344722747802734
0.007689952850341797
0.007932662963867188
0.00761723518371582
0.007501363754272461
0.007532835006713867
0.007141828536987305
0.007174253463745117
0.007114410400390625
0.007071495056152344
------------------
After (fake conv)
0.007458209991455078
0.007337093353271484
0.007268190383911133
0.007313251495361328
0.007306575775146484
0.007468700408935547
0.0073091983795166016
0.007308483123779297
0.007538318634033203
0.007356882095336914
0.007464170455932617
0.007372140884399414
```
Test Plan: Imported from OSS
Differential Revision: D18814702
Pulled By: zdevito
fbshipit-source-id: 0371c73b63068fdc12f24b801371ea90f23531a6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31335
When an error occurs in a net we end up cancelling all the async ops. If one error occurs it's highly likely other errors will occur as well.
Typically we see:
1. SendOp failed due to a network error
2. async scheduling cancels all other ops via `SetFinished("Cancelled");`
3. Another SendOp fails due to a network error and crashes the process when the exception is thrown.
This changes caffe2 ops to allow failing twice.
Test Plan: buck test //caffe2/caffe2:caffe2_test_cpu
Reviewed By: andrewwdye
Differential Revision: D19106548
fbshipit-source-id: 4b7882258a240894cc16d061a563c83a3214d3d9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30915
Since we now have C++14, we don't need these c10::guts helpers anymore
ghstack-source-id: 95777609
Test Plan: waitforsandcastle
Differential Revision: D18869639
fbshipit-source-id: 97716f932297c64c6e814410ac47b444c33d4e2e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30917
This is a C++14 feature, we can use this now.
ghstack-source-id: 95255753
Test Plan: waitforsandcastle
Differential Revision: D18869637
fbshipit-source-id: dd02036b9faeaffa64b2d2d305725443054da31b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31116
Changelist:
- remove BUILD_NAMEDTENSOR macro
- remove torch._C._BUILD_NAMEDTENSOR
- remove all python behavior that relies on torch._C._BUILD_NAMEDTENSOR
Future:
- In the next diff, I will remove all usages of
ATen/core/EnableNamedTensor.h since that header doesn't do anything
anymore
- After that, we'll be done with the BUILD_NAMEDTENSOR removal.
Test Plan: - run CI
Differential Revision: D18934951
Pulled By: zou3519
fbshipit-source-id: 0a0df0f1f0470d0a01c495579333a2835aac9f5d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30912
Add a new data type ZERO_COLLISION_HASH .
Test Plan: ci
Reviewed By: boryiingsu
Differential Revision: D18843626
fbshipit-source-id: b2d8280f13c78b4a656cf95822198df59de7b64c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30315
The new structure is that libtorch_cpu contains the bulk of our
code, and libtorch depends on libtorch_cpu and libtorch_cuda.
This is a reland of https://github.com/pytorch/pytorch/pull/29731 but
I've extracted all of the prep work into separate PRs which can be
landed before this one.
Some things of note:
* torch/csrc/cuda/nccl.cpp was added to the wrong list of SRCS, now fixed (this didn't matter before because previously they were all in the same library)
* The dummy file for libtorch was brought back from the dead; it was previously deleted in #20774
In an initial version of the patch, I forgot to make torch_cuda explicitly depend on torch_cpu. This lead to some very odd errors, most notably "bin/blob_test: hidden symbol `_ZNK6google8protobuf5Arena17OnArenaAllocationEPKSt9type_infom' in lib/libprotobuf.a(arena.cc.o) is referenced by DSO"
* A number of places in Android/iOS builds have to add torch_cuda explicitly as a library, as they do not have transitive dependency calculation working correctly
* I had to torch_cpu/torch_cuda caffe2_interface_library so that they get whole-archived linked into torch when you statically link. And I had to do this in an *exported* fashion because torch needs to depend on torch_cpu_library. In the end I exported everything and removed the redefinition in the Caffe2Config.cmake. However, I am not too sure why the old code did it in this way in the first place; however, it doesn't seem to have broken anything to switch it this way.
* There's some uses of `__HIP_PLATFORM_HCC__` still in `torch_cpu` code, so I had to apply it to that library too (UGH). This manifests as a failer when trying to run the CUDA fuser. This doesn't really matter substantively right now because we still in-place HIPify, but it would be good to fix eventually. This was a bit difficult to debug because of an unrelated HIP bug, see https://github.com/ROCm-Developer-Tools/HIP/issues/1706Fixes#27215 (as our libraries are smaller), and executes on
part of the plan in #29235.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D18790941
Pulled By: ezyang
fbshipit-source-id: 01296f6089d3de5e8365251b490c51e694f2d6c7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29337
This argument is needed by boxing wrappers so they're able to get a pointer to the corresponding unboxed kernel and call into it.
But if a kernel is registered in a boxed way, we don't need it and should hide this from the API.
This is especially needed for the backend fallback API where users would only be left wondering why this argument is there and what it does.
Also, hiding it allows us to potentially totally remove it in a future refactoring if we find some way to do so.
ghstack-source-id: 94481316
Test Plan: unit tests
Differential Revision: D18361991
fbshipit-source-id: 5cef26c896fe3f2a5db730d3bc79dcd62e7ef492
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29201
This is required for boxed backend fallback kernels (e.g. lazy, AMP) because they need to know which op was actually called.
ghstack-source-id: 94481313
Test Plan: I will add unit tests in a diff stacked on top
Differential Revision: D18282746
fbshipit-source-id: 339a1bbabd6aff31a587b98f095c75104dfc6f99
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29731
The new structure is that libtorch_cpu contains the bulk of our
code, and libtorch depends on libtorch_cpu and libtorch_cuda.
Some subtleties about the patch:
- There were a few functions that crossed CPU-CUDA boundary without API macros. I just added them, easy enough. An inverse situation was aten/src/THC/THCTensorRandom.cu where we weren't supposed to put API macros directly in a cpp file.
- DispatchStub wasn't getting all of its symbols related to static members on DispatchStub exported properly. I tried a few fixes but in the end I just moved everyone off using DispatchStub to dispatch CUDA/HIP (so they just use normal dispatch for those cases.) Additionally, there were some mistakes where people incorrectly were failing to actually import the declaration of the dispatch stub, so added includes for those cases.
- torch/csrc/cuda/nccl.cpp was added to the wrong list of SRCS, now fixed (this didn't matter before because previously they were all in the same library)
- The dummy file for libtorch was brought back from the dead; it was previously deleted in #20774
- In an initial version of the patch, I forgot to make torch_cuda explicitly depend on torch_cpu. This lead to some very odd errors, most notably "bin/blob_test: hidden symbol `_ZNK6google8protobuf5Arena17OnArenaAllocationEPKSt9type_infom' in lib/l
ibprotobuf.a(arena.cc.o) is referenced by DSO"
- A number of places in Android/iOS builds have to add torch_cuda explicitly as a library, as they do not have transitive dependency calculation working correctly. This situation also happens with custom C++ extensions.
- There's a ROCm compiler bug where extern "C" on functions is not respected. There's a little workaround to handle this.
- Because I was too lazy to check if HIPify was converting TORCH_CUDA_API into TORCH_HIP_API, I just made it so HIP build also triggers the TORCH_CUDA_API macro. Eventually, we should translate and keep the nature of TORCH_CUDA_API constant in all cases.
Fixes#27215 (as our libraries are smaller), and executes on
part of the plan in #29235.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D18632773
Pulled By: ezyang
fbshipit-source-id: ea717c81e0d7554ede1dc404108603455a81da82
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29653
I didn't remove is_variable from Tensor for BC reasons, but I did
remove as many uses as I could from the codebase.
at::impl::variable_excluded_from_dispatch got moved to TensorBody.h
so that it's more widely accessible.
This diff is NOT semantics preserving. Here are the major differences:
- In a number of native operator implementations, we tested that arguments
are not variable. I replaced these with asserts that variable is
excluded from dispatch. I actually don't think these asserts are really
necessary now (they should certainly be true, but it's hard to get
it wrong), but I've kept them for old time's sake. At least, they'll detect
if you call these functions before you've processed variable (indicating
a bug in your kernel.)
- There are a number of places where we do a per-tensor test for being a
variable, for better error reporting when someone commits Tensor/Variable
confusion. Although these tests are substantively the same as the
tests above, in these cases I decided to *delete* the test entirely.
The reasoning is that in these cases, we didn't really care about
dispatch (also, see above; I'm not too sure we really need the dispatch
asserts), we cared about Tensor/Variable confusion. Since Tensor/Variable
confusion is impossible now, we don't need the tests. One of the key
factors which pushed me one way or another was whether or not a function
was doing per-tensor validation; if I kept the assert in such functions,
I'd repeatedly access the TLS. Even if we want to bring back the asserts,
they would have to go somewhere else.
Another similar idiom is the number of places we do !x.defined() ||
x.is_variable(); I treated this equivalently.
- nuclear_norm's computation of compute_uv is a bit weird, but I think
it's OK to just delete the is_variable case (I *suspect* that it is
always the case that self.is_variable(), but it doesn't really matter.)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D18496168
Pulled By: ezyang
fbshipit-source-id: 5a1ded931e0c10a6b758ba64a8380d34110e0c3e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29670
This is the entry point to loading CUDA code, improve error message to prompt users to check that gpu code is included.
Test Plan: Build without gpu code. Run the binary. Check that the new error message exists.
Reviewed By: yfeldblum
Differential Revision: D18453798
fbshipit-source-id: 63d9ec50acdf57ef4baf3f7d99c836c56bc1435e
Summary:
Also move the logic that installs the pybind11 headers from setup.py to cmake (to align with other headers).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29659
Differential Revision: D18458208
Pulled By: bddppq
fbshipit-source-id: cfd1e74b892d4a65591626ab321780c8c87b810d
Summary:
This diff adds the following:
- An AsyncIf to support conditional async execution. This op assumes that then_net and else_net are async scheduling nets. This op itself completes when every async op in the active net completes. Cancellation cancels the inner nets and the async ops.
- Unit tests targeting asynchronicity and error/cancellation handling.
Test Plan:
New unit tests
With --stress-runs=2000:
https://our.intern.facebook.com/intern/testinfra/testrun/4785074616784325
Reviewed By: ilia-cher
Differential Revision: D18051357
fbshipit-source-id: 1399a437b3ca63fd4ea0cf08d173f85b9242cc1f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29052
Make sure we handle the case of multiple, async, terminal (no children)
and failing cpu ops.
Test Plan: AsyncIf tests
Reviewed By: yyetim
Differential Revision: D18276401
Pulled By: ilia-cher
fbshipit-source-id: 35b175dd025bc7e392056ac1331b159376a29e60
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28024
We preallocated type ids to align them with ScalarType. At that point, the maximum type id was 10 and we used 11 to specify undefined type id.
However, since then, ScalarType got more additions, 11 isn't undefined anymore, and numbers 11-15 have meaning.
caffe2::TypeIdentifier also got its separate additions, 12 and upwards have meaning that differs from ScalarType.
I'm going with the (CI-tested) assumption that caffe2::TypeIdentifier and ScalarType actually don't need to be aligned
and remove the functionality for preallocated type ids. This simplifies our type ids.
ghstack-source-id: 92051872
Test Plan: unit tests
Differential Revision: D17936165
fbshipit-source-id: 2c9df2b9b3f35b3e319641c96638321ac3433d5c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26509
We preallocated type ids to align them with ScalarType. At that point, the maximum type id was 10 and we used 11 to specify undefined type id, see https://github.com/pytorch/pytorch/pull/10139.
However, since then, ScalarType got more additions, 11 isn't undefined anymore, and numbers 11-15 have meaning.
caffe2::TypeIdentifier also got its separate additions, 12 and upwards have meaning that differs from ScalarType.
I'm going with the (CI-tested) assumption that caffe2::TypeIdentifier and ScalarType actually don't need to be aligned
and remove the functionality for preallocated type ids. This simplifies our type ids.
ghstack-source-id: 91896918
Test Plan: unit tests
Differential Revision: D17490109
fbshipit-source-id: 800c340d9d3556a99f6e3ffc33af14ad68d7cc59
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26502
Create type ids at compile time instead of incrementing a counter at runtime. This is done by computing a compile time crc64 on the type name. We couldn't do this before, because we still used GCC4 and that compiler didn't support the use of `__PRETTY_FUNCTION__` in a constexpr context. However, since GCC5 this is possible and we can use this trick.
This does not change the semantics of preallocated type ids. I actually think we don't need to preallocate anymore, but I split the removal of preallocation into a separate diff to be able to test it separately.
ghstack-source-id: 91896920
Test Plan: unit tests
Differential Revision: D17488861
fbshipit-source-id: ce7b059d7c8686b69cb091a4a8beaf4b96391343
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27086
This is a major source of merge conflicts, and AFAICT isn't necessary anymore (it may have been necessary for some mobile build stuff in the past).
This is a commandeer of #25031
Test Plan: Imported from OSS
Reviewed By: ljk53
Differential Revision: D17687345
Pulled By: ezyang
fbshipit-source-id: bf6131af835ed1f9e3c10699c81d4454a240445f
Summary: Add helper function randomFill to test_utils.h so we can use it in benchmark scrips as well tests.
Test Plan:
```
buck run mode/opt //tvm/sparse:cblas_bench
```
Reviewed By: yinghai
Differential Revision: D17759193
fbshipit-source-id: e4909b04e83ca9382ab4718855fb63743d028de1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26337
- Factor out boxing and unboxing functionality from the c10 dispatcher into a c10::KernelFunction class
- Move that class and everything else it depends on into ATen/core/boxing
- This also allows us to get rid of c10::KernelCache. Instead, we now store a pointer to the unboxed functor in c10::KernelFunction.
- We're also getting rid of the DispatchTableEntry struct and instead store KernelFunction directly.
- To make this work, we need to change the dispatcher calling API from Dispatcher::lookup().callBoxed/callUnboxed and OperatorEntry::lookup().callBoxed/callUnboxed to Dispatcher::callBoxed/callUnboxed and OperatorEntry::callBoxed/callUnboxed.
ghstack-source-id: 90459911
Test Plan: unit tests
Differential Revision: D17416607
fbshipit-source-id: fd221f1d70eb3f1b4d33092eaa7e37d25684c934
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25908
Original commit changeset: f6e961e88c01
device_option propagation is completely broken in Caffe2 for cases when pass through operators are used. As an example Gather operator don't have gradient and passes through it's inputs, which results in incorrect detection of the components for sparse parameter aggregation (component will be empty instead of the real device).
This diff is trying to fix this issue.
Original diff had a problem, that Caffe2 is not handling cases when device option is present, but contains only metadata (for example one for auto-generated reduction ops in backward pass). This diff is addressing this issue by merging device options during the backward pass
Test Plan:
1. net_transform is finally working with Gather + FloatToHalf transformed model instead of failing because of incorrect number of components.
2. New unit-test.
3. Verify that previously broken benchmark is now passing
ezyang do you have suggestions what else I should test?
Reviewed By: ezyang
Differential Revision: D17281528
fbshipit-source-id: 4a1bc386f29f6a34fbf8008effde9d4890abebfa
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23668
- The eager mode frontend now calls operators who are defined in native_functions.yaml with `use_c10_dispatcher: True` through the c10 dispatcher and not anymore through globalATenDispatch().
- These operators aren't registered with globalAtenDispatch anymore, only on c10 now.
- Backend extensions calling globalATenDispatch().registerOp() to add their own kernels still work, this function will forward the registration to the c10 dispatcher for them.
ghstack-source-id: 90130455
Test Plan: benchmarks at https://docs.google.com/document/d/1gpzKZcFf1JJameY1vKxF7Cloul9s6D8HKIK2_Pp1hFo/edit#
Differential Revision: D16603133
fbshipit-source-id: 991f17b355e9c78c5e86fee4fa381df7ab98ac82
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25650
This PR removes protobuf dependencies from mobile build altogether:
- caffe2/proto: protobuf files, including caffe2.proto and torch.proto;
- caffe2 components that depend on caffe2.proto, including most part of
caffe2/core, caffe2/utils;
- libprotobuf / libprotobuf-lite dependencies;
- protobuf compiler;
- some utils class, e.g.: netdef_converter.cpp;
- introduce a macro to disable third_party/onnx which depends on protobuf;
Test Plan:
- builds;
- link with demo app to make sure it can load and run a model in pickle format;
Differential Revision: D17183548
Pulled By: ljk53
fbshipit-source-id: fe60b48674f29c4a9b58fd1cf8ece44191491531
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25671
To decouple string_utils.h from types.h and protobuf headers.
Logically GetDimFromOrderString seems to be more similiar to
StringToStorageOrder comparing to other string_utils functions.
Test Plan: - Will check all internal/external CI jobs.
Reviewed By: yinghai
Differential Revision: D17191912
Pulled By: ljk53
fbshipit-source-id: fe555feef27bfd74c92b6297c12fb668252ca9ff
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23888
This is an alternative to https://github.com/pytorch/pytorch/pull/23684.
Instead of splitting a bunch of headers into declaration and definition, we change tensor includes to only include the tensor declaration when the tensor definition isn't needed.
ghstack-source-id: 89357687
Test Plan: waitforsandcastle
Differential Revision: D16673569
fbshipit-source-id: fa1d92809b05de7910a8c2dc2f55abe071ca63bf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25620
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25602
Enable rocThrust with hipCUB and rocPRIM for ROCm. They are the ROCm implementations of the thrust and cub APIs and replace the older hip-thrust and cub-hip packages going forward. ROCm 2.5 is the first release to contain the new packages as an option, as of 2.6 they will be the only available option.
Add hipification rules to correctly hipify thrust::cuda to thrust::hip and cub:: to hipcub:: going forward. Add hipification rules to hipify specific cub headers to the general hipcub header.
Infrastructure work to correctly find, include and link against the new packages. Add the macro definition to choose the HIP backend to Thrust.
Since include chains are now a little different from CUDA's Thrust, add includes for functionality used where applicable.
Skip four tests that fail with the new rocThrust for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21864
Reviewed By: xw285cornell
Differential Revision: D16940768
Pulled By: bddppq
fbshipit-source-id: 3dba8a8f1763dd23d89eb0dd26d1db109973dbe5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25252
Our model going forward for extensions will be that you will have to
get an allocation of an ID in our system. This is how things work
in practice today; we're just simplifying our underlying registration
since there is no need to have distributed registration.
There are some codemods in this diff:
```
codemod --extensions cpp,h,cc,cuh,py,in --exclude-paths=c10/core/TensorTypeId.h '([A-Za-z]+?)TensorId\(\)' 'TensorTypeId::\1TensorId'
codemod --extensions cpp,h,cc,cuh,py,in 'TensorTypeIds::undefined\(\)' 'TensorTypeId::UndefinedTensorId'
codemod --extensions cpp 'TensorType1\(\)' 'TensorTypeId::CPUTensorId'
codemod --extensions cpp 'TensorType2\(\)' 'TensorTypeId::CUDATensorId'
codemod --extensions cpp 'TensorType3\(\)' 'TensorTypeId::XLATensorId'
codemod --extensions cpp 'TensorType1' 'CPUTensorId'
codemod --extensions cpp 'TensorType2' 'CUDATensorId'
codemod --extensions cpp 'TensorType3' 'XLATensorId'
```
The main hand-written changes are in c10/core/TensorTypeId.h
Other manual fixes:
- aten/src/ATen/core/op_registration/op_registration.cpp - stop using
std::string operator+
- aten/src/ATen/function_wrapper.py - handle a hardcoded TypeId() that
wasn't caught by codemod
- torch/csrc/tensor/python_tensor.h - fix now incorrect forward declaration
of TensorTypeId
- aten/src/ATen/core/op_registration/ - remove out-of-line registration
Differential Revision: D17072001
Test Plan: ossci and sandcastle
Pulled By: ezyang
fbshipit-source-id: c641515fd0604c045c54fbb1d6b1b950f45e89d1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24361
Currently we only support Conv in kernel but have entrance for both type using one same class
It is time make change
Reviewed By: csummersea
Differential Revision: D16604713
fbshipit-source-id: b98d39a2c7960707cd50ba27e43dce73f741eeeb
Summary:
Adds qtensor specific fields to the proto file so that they get serialized into the model.json
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23356
ghstack-source-id: 87263428
Differential Revision: D16473237
fbshipit-source-id: bf5b51d0863d036d30a1644a3c3b74516468224b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23096
nets can have states that depends on the rest of the state in the Workspace. Hence, they should be destructed first.
Reviewed By: ajyu
Differential Revision: D16382987
fbshipit-source-id: 3fd030ba206e2d0e897abb9e31c95bdaeb9482b7
Summary:
As part of the Variable/Tensor merge, we want to be able to pass Variables into Caffe2 without doing extra shallow copy, to improve performance and also allow for in-place mutations in Caffe2 ops. There are a few approaches outlined in https://github.com/pytorch/pytorch/pull/22418, and this PR is the chosen approach.
Specifically, we can have the assumption that we won't be connecting autograd to C2 gradients at any point (as it's too tricky and not that useful). Therefore, we can pass Variable into Caffe2 ops by requiring that all Variables in Caffe2 don't require grad. For code paths in Caffe2 that might potentially track gradients (e.g. `ScriptModuleOp` and `call_caffe2_op_from_c10`), we use the `torch::NoGradGuard` to make sure gradients are not tracked.
This supersedes https://github.com/pytorch/pytorch/pull/22418.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22473
Differential Revision: D16099042
Pulled By: yf225
fbshipit-source-id: 57efc3c7cfb3048d9abe90e63759acc14ebd2972
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22477
There is actually no use of uninitialized variable but some compilers are not smart enough to reason about two if branches are already taken together.
Reviewed By: hx89
Differential Revision: D16100211
fbshipit-source-id: 25f01d668063603d7aaa776451afe8a10415d2ea
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22005
When a Dict or List is created with type information, it will remember that.
If at any point later, this list is instantiated to a List<T> with a concrete type, it will assert that T is the correct type.
Differential Revision: D15914462
fbshipit-source-id: a8c3d91cb6d28d0c1ac0b57a4c4c6ac137153ff7
Summary:
Currently the build system accepts USE_NAMEDTENSOR from the environment
variable and turns it into NAMEDTENSOR_ENABLED when passing to CMake.
This discrepancy does not seem necessary and complicates the build
system. The naming of this build option is also semantically incorrect
("BUILD_" vis-a-vis "USE_"). This commit eradicate this issue before it
is made into a stable release.
The support of NO_NAMEDTENSOR is also removed, since PyTorch has been
quite inconsistent about "NO_*" build options.
---
Note: All environment variables with their names starting with `BUILD_` are currently automatically passed to CMake with no need of an additional wrapper.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22360
Differential Revision: D16074509
Pulled By: zou3519
fbshipit-source-id: dc316287e26192118f3c99b945454bc50535b2ae
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22241
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20387
glibc has a non-standard function, feenableexcept, that triggers floating-point exception handler . Compared to feclearexcept + fetestexcept , this approach allows us to see precisely where the exception is raised from the stack trace.
Reviewed By: jspark1105
Differential Revision: D15301095
fbshipit-source-id: 94f6e72456b2280f78d7d01c2ee069ae46d609bb
Summary:
Saying `I` in an err msg is too subjective to be used in a framework.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22369
Differential Revision: D16067712
Pulled By: soumith
fbshipit-source-id: 2a390646bd5b15674c99f65e3c460a7272f508b6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22084
For DictPtr/ListPtr, default construction was disallowed because it was ambigious if it's supposed to create an empty list or a nullptr.
But since we renamed them to Dict/List, we can now allow default construction without ambiguity.
Differential Revision: D15948098
fbshipit-source-id: 942a9235b51608d1870ee4a2f2f0a5d0d45ec6e6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21937
This changes call sites to use the new naming scheme
Reviewed By: zdevito
Differential Revision: D15892404
fbshipit-source-id: 8d32aa90a0ead1066688166478f299fde9c2c133
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21806
Dispatcher::findSchema(op_name) now uses a lookup table instead of iterating through the list of operators to find it.
This speeds up op lookup (as in finding the operator handle from the name, not as in finding a kernel when you already have the operator handle)
and it also speeds up op registration since that needs to look if an op with the same name already eists.
Differential Revision: D15834256
fbshipit-source-id: c3639d7b567e4ed5e3627c3ebfd01b7d08b55ac1
Summary:
After https://github.com/pytorch/pytorch/pull/17072, we are allowed to pass Variables into ATen ops, thus there is no need to unwrap input variables in the c10 call path.
Note that since Caffe2 still expects inputs to be pure Tensors, we moved the unwrapping logic to the Caffe2 wrapper.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21620
Differential Revision: D15763560
Pulled By: yf225
fbshipit-source-id: 5375f0e51eb320f380ae599ebf98e6b259f0bff8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21446
this is used for easier tracing of iter id when looking at trace diagram
Reviewed By: ilia-cher
Differential Revision: D15628950
fbshipit-source-id: ee75b3bdb14a36abc18c7bddc49d8ec9789b724d
Summary:
This renames the CMake `caffe2` target to `torch`, as well as renaming `caffe2_gpu` to `torch_gpu` (and likewise for other gpu target variants). Many intermediate variables that don't manifest as artifacts of the build remain for now with the "caffe2" name; a complete purge of `caffe2` from CMake variable names is beyond the scope of this PR.
The shell `libtorch` library that had been introduced as a stopgap in https://github.com/pytorch/pytorch/issues/17783 is again flattened in this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20774
Differential Revision: D15769965
Pulled By: kostmo
fbshipit-source-id: b86e8c410099f90be0468e30176207d3ad40c821
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21177
- Integrate c10::ListPtr into IValue and the c10 dispatcher.
- Streamline conversion to/from IValue. Before, we had IValue::to<> and kernel_functor.h had its own ivalue_to_arg_type and return_type_to_ivalue. They are now unified. Also, this means that nested types like Dicts of Lists of Optional of Dict of ... do work as expected now
Differential Revision: D15476433
fbshipit-source-id: bde9df80df20091aa8e6ae17ba7e90abd149b954
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21492
If one async operator failed, async_scheduling net currently only marks all scheduled async operators as finished without cancelling the callbacks.
The new behavior is to cancel the callbacks first, then set event status to finished.
Reviewed By: ilia-cher
Differential Revision: D15702475
fbshipit-source-id: 55a1774d768b2e238bab859b83332f1877a001ca
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17946
Some of these are probably implementable for exported operators,
but aren't implemented yet and for now it's better to assert than to just return wrong results.
Reviewed By: ezyang
Differential Revision: D14430749
fbshipit-source-id: 2b0037a9ed227a22aa7376a90e6d3d09d3e04707
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20603
When we use intra_op_parallel operators, Caffe2 tracing was generating trace only for the master task giving a false impression that a lot of threads are underutilized.
This diff also traces child tasks.
Reviewed By: ilia-cher
Differential Revision: D14820008
fbshipit-source-id: ff4ed203804d86d9231c21c99d869f1ddf1d1ef9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20493
This helps distinguish if the op was a quantized op or not.
Reviewed By: salexspb
Differential Revision: D15337854
fbshipit-source-id: 43c7aef143085cfaeb4ec2102a7f36cc454e0e94
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20173
Enabled op profiling even when net type is not dag or prof dag. Also added
engine type info to summary.
Reviewed By: salexspb, ilia-cher
Differential Revision: D15177813
fbshipit-source-id: 5be0efeaabc9a961cf1d73b0703749c08bb1adbb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20821
Change registration API. Instead of
static auto registry = torch::RegisterOperators()
.op("my::op", torch::RegisterOperators::options()
.kernel<Kernel>()
.dispatchKey(CPUTensorId()));
it is now
static auto registry = torch::RegisterOperators()
.op("my::op", torch::RegisterOperators::options()
.kernel<Kernel>(CPUTensorId()));
This binds kernel and dispatch key together, allowing them to be separate from other future configuration options like alias analysis or autograd wrappers.
The semantic problem behind this is that the dispatch key is a *kernel config parameter* and not an *operator config parameter* while things like autograd wrappers, alias info, and actually the kernel itself are *operator config parameters*. And while previously, the different kind of config parameters have been mixed, this diff now separates them.
Before this change, it wouldn't have been well defined if you specified a dispatchKey together with an autogradWrapper or aliasInfo for example.
// what is this supposed to do?
static auto registry = torch::RegisterOperators()
.op("my::op", torch::RegisterOperators::options()
.aliasInfo(DEFAULT)
.dispatchKey(CPUTensorId()));
If we get more kernel config parameters in the future, we could introduce something like this
static auto registry = torch::RegisterOperators()
.op("my::op", torch::RegisterOperators::options()
.kernel<Kernel>(torch::RegisterOperators::kernelOptions()
.dispatchKey(CPUTensorId())
.otherConfig());
but that's overkill as long as dispatch keys are the only kernel config parameter, and we can introduce that later without breaking backwards compatibility.
A nice side effect of this is that people can register multiple kernels to the same operator in the same `.op()` call:
static auto registry = torch::RegisterOperators()
.op("my::op", torch::RegisterOperators::options()
.kernel<Kernel1>(CPUTensorId())
.kernel<Kernel2>(CUDATensorId()));
Reviewed By: dzhulgakov
Differential Revision: D15455790
fbshipit-source-id: 1c46bfe676dcacf74cf36bd3f5df3d2c32b8fb11
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17818
Some of these are probably implementable for exported operators,
but aren't implemented yet and for now it's better to assert than to just return wrong results.
Reviewed By: ezyang
Differential Revision: D14392459
fbshipit-source-id: bf86e6cb0a7cfefd112a65dc85cc243e57a5ad52
Summary:
Resubmit #20698 which got messed up.
Idea is that when PyTorch is used in a custom build environment (e.g. Facebook), it's useful to track usage of various APIs centrally. This PR introduces a simple very lightweight mechanism to do so - only first invocation of a trigger point would be logged. This is significantly more lightweight than #18235 and thus we can allow to put logging in e.g. TensorImpl.
Also adds an initial list of trigger points. Trigger points are added in such a way that no static initialization triggers them, i.e. just linking with libtorch.so will not cause any logging. Further suggestions of what to log are welcomed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20745
Differential Revision: D15429196
Pulled By: dzhulgakov
fbshipit-source-id: a5e41a709a65b7ebccc6b95f93854e583cf20aca
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20833
Att. The algorithm is still "horrendously inefficient". But since we are sunsetting Nomnigraph, I just did the minimal fix here.
Reviewed By: tracelogfb
Differential Revision: D15463880
fbshipit-source-id: 413a1280a92c1923ba49031177816a2d5f888575
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20514
Change API from
static auto registry = c10::RegisterOperators()
.op("my::op",
c10::kernel(...),
c10::dispatchKey(...)
);
to
static auto registry = c10::RegisterOperators()
.op("my::op", c10::RegisterOperators::options()
.kernel(...)
.dispatchKey(...)
);
because this allows better discoverability. People looking for which options are available will easier find it and IDE autocompletion will work better.
Reviewed By: zdevito
Differential Revision: D15346348
fbshipit-source-id: 4b74a33b75c2b9cda4a903639fb7abd2c7cff167
Summary:
#19975 was separated by 2 PRs.
This one:
Introduce MemoryFormat argument to the `x.is_contiguous(memory_format=torch.channels_last)` and to the `y = x.contiguous(memory_format=torch.channels_last)` functions.
At this moment both functions just operate with strides and doesn't store any tensor state.
(Original RFC #19092)
-----
Expands functionality of two tensor functions `.is_contiguous` and `.contiguous` (both python and c++ api).
Note: We had several complaints about `.to(memory_format)` function, and decided not to support it.
1. `.contiguous` now support optional keyword-only argument - `memory_format`, which can be either `torch.contiguous_format` or `torch.channels_last`.
- Using `torch.contiguous_format` will preserve existing `.contiguous()` behavior.
- Calling `x.contiguous(memory_format=torch.channels_last)` returns new tensor which maintain same semantical layout (NCHW), but have different memory allocation pattern.
`x.contiguous(memory_format=torch.channels_last)` expects input tensor to be 3d, 4d or 5d; and fails otherwise.
2. `.is_contiguous` now support optional keyword-only argument - `memory_format`, which can be either `torch.contiguous_format` or `torch.channels_last`.
- `x.is_contiguous(memory_format=torch.contiguous_format)` preserves same functionality as `x.is_contiguous()` and remains unchanged.
- `x.is_contiguous(memory_format=torch.channels_last)` returns true if A) input tensor is contiguous in memory AND B) allocated in the memory in NWHC (or similar for 3d,5d) format.
Note: By the end of the phase one `x.is_contiguous(memory_format=torch.channels_last)` will calculate state of the Tensor on every call. This functionality going to be updated later.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20455
Differential Revision: D15341577
Pulled By: VitalyFedyunin
fbshipit-source-id: bbb6b4159a8a49149110ad321109a3742383185d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20439
This is the QTensorProto workflow for multi group quantization in C2 side.
No DNNLOWP Tensor related thing is included in this pr, so once we finished glow side, we should be able to test this pr using resnet50.
Reviewed By: yinghai
Differential Revision: D15096919
fbshipit-source-id: 741eecd59eb79d24d9fe2b035f6246d42422d25c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20463
Source file changes mostly involve ifdef'ing-out references to JIT code
from files that are part of Caffe2Go. Update Internal build scripts to
remove those files from our globs.
After this, changes to most of the JIT files should not trigger mobile CI.
Reviewed By: dzhulgakov
Differential Revision: D15329407
fbshipit-source-id: 48f614c6b028eef0a03ce5161d083a3e078b0412
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20108
Add cpp runs for c2, hooked up via pybinds. Print output to terminal. This is not hooked up with the pep output yet because I'd like to verify the numbers first.
Note that this isn't quite the same mechanism as the pytorch cpp hookup, which uses cpp_python_extensions. If I can use the same mechanism to pull all the inputs for c2 through cpp and do FeedBlobs in cpp, then I'll switch to that.
Reviewed By: zheng-xq
Differential Revision: D15155976
fbshipit-source-id: 708079dacd3e19aacfe43d70c5e5bc54da2cf9e3
Summary:
Some functions were not decorated with `CAFFE2_API`, makes them unusable when creating unit tests for custom ops outside Caffe2 repo.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20114
Differential Revision: D15217490
Pulled By: ezyang
fbshipit-source-id: dda3910ad24e566567607deaac705a34ec8e7b8d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19817
A lot of files were depending on the JIT's typesystem
because operator.h depends on function_schema.h. However,
this isn't fundamental to the design. This diff tries to
remove the direct depenency and only includes the c10
wrapper helpers in files where it is required.
Reviewed By: smessmer
Differential Revision: D15112247
fbshipit-source-id: 2c53d83e542c32d9a398c8b60dbf40ab7a1cb0f6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19458
The algorithm in https://fburl.com/ggh9iyvc fails to really ensure topological ordering of nodes. The fix is ugly but effective. I think we need a real topological sort to fix this issue more nicely. Mikhail Zolotukhin, Bram Wasti.
Differential Revision: D15011893
fbshipit-source-id: 130c3aa442f5d578adfb14fbe5f16aa722434942
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19388
The old implementation forced a refcount bump when converting at::Tensor to caffe2::Tensor.
Now, it is possible to move it without a refcount bump.
Reviewed By: dzhulgakov
Differential Revision: D14986815
fbshipit-source-id: 92b4b0a6f323ed38376ffad75f960cad250ecd9b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19287
Since we now have a string-schema-based op registration API, we can also use it when exposing caffe2 operators.
Reviewed By: dzhulgakov
Differential Revision: D14931925
fbshipit-source-id: ec162469d2d94965e8c99d431c801ae7c43849c8
Summary:
Currently, a TensorImpl's `is_variable_` is true if and only if the TensorImpl has AutogradMeta. This PR unifies these two concepts by removing `is_variable_` and change `is_variable()` to check existence of AutogradMeta instead.
Removing `is_variable_` is part of the work in Variable/Tensor merge.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19139
Differential Revision: D14893339
Pulled By: yf225
fbshipit-source-id: ceb5e22c3c01f79b5d21d5bdbf4a7d1bc397796a
Summary:
It's not intended that Storages have 'default' CUDA devices, but this is allowable via the Storage::create_legacy codepath.
This also messages with device_caching, because the initial cache is obtained from the Storage, which may have a 'default' device.
Instead, we materialize a device by allocating 0 bytes via the allocator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18605
Differential Revision: D14680620
Pulled By: gchanan
fbshipit-source-id: 6d43383d836e90beaf12bfe37c3f0506843f5432
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19154
I recently saw some weird workflow error due to empty but set net_type. Maybe we should just fallback to simple net in this case.
Reviewed By: dzhulgakov
Differential Revision: D14890072
fbshipit-source-id: 4e9edf8232298000713bebb0bfdec61e9c5df17d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19080
OSS: add a tiny unit test utility function to create tensors given shape and data outside of any workspace. I use it in an internal test
Reviewed By: dzhulgakov
Differential Revision: D14814194
fbshipit-source-id: 6d53b235d99a97da812215f5c7f11fecad363c8c
Summary:
Almost there, feel free to review.
these c10 operators are exported to _caffe2 domain.
TODO:
- [x] let the onnx checker pass
- [x] test tensor list as argument
- [x] test caffe2 backend and converter
- [x] check the c10 schema can be exported to onnx
- [x] refactor the test case to share some code
- [x] fix the problem in ONNX_ATEN_FALLBACK
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18210
Reviewed By: zrphercule
Differential Revision: D14600916
Pulled By: houseroad
fbshipit-source-id: 2592a75f21098fb6ceb38c5d00ee40e9e01cd144
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18531
Currently we use C10_LOG_EVERY_MS to log the data type change, but it pollutes the log of some service,
we would like to change it to C10_LOG_FIRST_N to prevent that.
Reviewed By: dzhulgakov
Differential Revision: D14647704
fbshipit-source-id: b84e4002bd4aa94d616133cd1049c3d4ab05386e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18161
This introduces version 0 for the new operator registration.
For now, it only works with kernels that are defined as stack-based functions.
This is actually not the intended public API for defining kernels, but it's the basis which is going to be used to define the public APIs (see diffs on top for them),
and it's also the API used for exposing caffe2 operators.
This diff also switches the mechanism for exposing caffe2 operators to the new mechanism.
Reviewed By: dzhulgakov
Differential Revision: D14514231
fbshipit-source-id: 454ab7b5b46a10203aa27b175400d23f818dd1df
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18160
When exposing a c10 operator to the caffe2 frontend, don't use the operator schema but use the operator name instead.
This allows us to get rid of the existing mechanism for operator schema registration in a diff stacked on top.
Reviewed By: dzhulgakov
Differential Revision: D14513420
fbshipit-source-id: 6b08a9c6d9497eaf18b62361dd44bc07c7b4b76b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18129
A lot of tensor interference function assume the operator passes the schema.
So call Verity to make sure this is actually the case.
Created diff before to add checking in Concat (https://github.com/pytorch/pytorch/pull/17110), but I encountered lot more places where this is assumed (for example ElementwiseOpShapeInference)
Reviewed By: mdschatz
Differential Revision: D14503933
fbshipit-source-id: cf0097b8c3e4beb1cded6b61e092a6adee4b8fcb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18123
the motivation of this fix is to resolve things like:
for(auto i = 0; i < N; i++) where N is bigger than int32
These instances of comparison were found by enabling -Wsign-compare
There are way too many things to fix, so issuing this as a series of fixes
The plan is to fix all these issues and then enable this flag into Caffe2 to catch future instances
Reviewed By: ZolotukhinM
Differential Revision: D14497094
fbshipit-source-id: bca3927a2188bd33a508fa503ba221c220cdaefe
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18040
Add flag to fails if float point exceptions is detected in operator runs
Sample exception
Exception [enforce fail at operator.h:837] !std::fetestexcept(FE_DIVBYZERO). Division by zero floating point exception (FE_DIVBYZERO) reported.
Error from operator:
input: "1" input: "0" output: "out" name: "" type: "Div"
Reviewed By: jspark1105
Differential Revision: D14467731
fbshipit-source-id: fad030b1d619a5a661ff2114edb947e4562cecdd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18037
The FunctionSchema can now store an overload name and the parser knows how to parse it. Specify like this:
my_func.overload1(arg1: Tensor) -> Tensor
my_func.overload2(arg1: Tensor, arg2: Tensor) -> Tensor
Reviewed By: zdevito
Differential Revision: D14467497
fbshipit-source-id: 8832b32f07351bb61090357b17b77a6a2fed3650
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18036
- Add macros to export c10 cuda operators to caffe2 frontend
- Instead of having a separate caffe2 registry for the c10 operator wrappers, use the existing caffe2 registries
Reviewed By: ezyang
Differential Revision: D14467495
fbshipit-source-id: 7715ed2e38d2bbe16f1446ae82c17193a3fabcb9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17781
The wrapper for calling a c10 operator from caffe2 is now based on a runtime FunctionSchema instead of compile time information. This way, it can be created for any c10 operator schema with just one invocation to a simple macro instead of having to define arguments and more as compile time structures.
Furthermore, previously, the wrapper assumed there's an argument present for preallocated outputs, but that was only true for caffe2 operators exported to c10. So the wrapper only worked correctly for calling caffe2->c10->caffe2. Now with the new implementation, it works for any c10 operator.
Also, binary size for this should be much smaller.
Reviewed By: ezyang
Differential Revision: D14375054
fbshipit-source-id: bac7ab8e63929e6e2a148eacac41ed092009aa86
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17743
- caffe2::Operator::SetOutputTensor() can now be used in operators that are called from c10/PyTorch.
- If the operator uses SetOutputTensor() instead of XOutput(), the wrapper doesn't preallocate an empty tensor for the operator anymore. Only outputs accessed in XOutput() will get an output tensor preallocated.
- Remove the copying of the vector with output tensors into a vector with pointer to output tensors.
- Preallocated outputs are now passed in as one TensorList argument on the stack. This TensorList argument has a well-defined name so other wrappers (i.e. the wrapper calling from c2 into c10) can recognize and use it).
- Macros for exporting caffe2 operators to c10 are simplified. Instead of having `c10_op_handle_for_c2_op`, we now pass in the operator handle as a template argument.
- `SetOutputTensor` and `OutputTensorOrUndefined` now work with operators exported to c10
Reviewed By: ezyang
Differential Revision: D14362434
fbshipit-source-id: 44a5e717204f21ea8e9728437429d9b84906f9f5
Summary:
1. Move ATen threadpool & open registration mechanism to C10
2. Move the `global_work_queue` to use this open registration mechanism, to allow users to substitute in their own
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17788
Reviewed By: zdevito
Differential Revision: D14379707
Pulled By: jamesr66a
fbshipit-source-id: 949662d0024875abf09907d97db927f160c54d45
Summary:
CreateDB actually returns nullptr when db type is unknown and throws when the file is missing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17795
Reviewed By: ezyang
Differential Revision: D14383226
Pulled By: dzhulgakov
fbshipit-source-id: 1dcf75a6b4ba8b64a24d4e5daf02db3189d56b7b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17742
This path isn't used anymore, and is incompatible with the changes stacked on top of this diff.
Removing it.
cc bwasti to check and confirm these can really be deleted
Reviewed By: ezyang
Differential Revision: D14362426
fbshipit-source-id: 32cdc19f28c2a981ae1e204901420998367ee588
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17623
Despite it's generic sounding name, caffe2::DeviceGuard actually
only worked on CUDA devices. Rename it to something that more
clearly spells out its applicability.
I'm not sure if it's the right call, but in this patch I added
'using CUDAGuard = c10::cuda::CUDAGuard', as this seems to be more
in-line with how the Caffe2 codebase is currently written. More
idiomatic c10 namespace style would be to say cuda::CUDAGuard.
Willing to change this if people shout.
This is a respin of D13156470 (#14284)
Reviewed By: dzhulgakov
Differential Revision: D14285504
fbshipit-source-id: 93b8ab938b064572b3b010c307e1261fde0fff3d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17579
These methods previously just returned 0 when it was not a legacy operator,
making it impossible to convert some operators.
Reviewed By: dzhulgakov
Differential Revision: D14253094
fbshipit-source-id: 72bfdcf6da291a4ab80d1e0ceb20984b86edc408
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17491
Before, there was no way to expose a caffe2 operator that had a variable number of inputs.
Now, this is allowed by giving the operator one tensor list input.
Note that the tensor list must be the first input, and that any other tensor inputs will be ignored and inaccessible in this case.
Reviewed By: ezyang
Differential Revision: D14220705
fbshipit-source-id: 7f921bfb581caf46b229888c409bbcc40f7dda80
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17481
Usually, feature macros are either defined or undefined and checked accordingly.
C10_MOBILE was a weird special case that was always defined but either defined to 1 or to 0.
This caused a lot of confusion for me when trying to disable something from mobile build and it also disabled it
from the server build (because I was using ifdef). Also, I found a place in the existing code base that made
that wrong assumption and used the macro wrongly, see https://fburl.com/y4icohts
Reviewed By: dzhulgakov
Differential Revision: D14214825
fbshipit-source-id: f3a155b6d43d334e8839e2b2e3c40ed2c773eab6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17078
This prevents caffe2 operators from being expsoed to c10 on mobile,
which in turn causes the whole c10 dispatcher to be stripped away
and saves binary size.
We probably want to re-enable the c10 dispatcher for mobile,
but for now this is ok.
Reviewed By: ezyang
Differential Revision: D14077972
fbshipit-source-id: e4dd3e3b60cdfbde91fe0d24102c1d9708d3e5c4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17384
Better handling of possible net run errors in prof_dag counters.
Reviewed By: yinghai
Differential Revision: D14177619
fbshipit-source-id: 51bc952c684c53136ce97e22281b1af5706f871e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15034
Rethrow exception happened during RunAsync, ensure that pending tasks
are not executed after marked as finished
Reviewed By: andrewwdye
Differential Revision: D13409649
fbshipit-source-id: 3fd12b3dcf32af4752f8b6e55eb7a92812a5c057
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17132
schedule() function is not supposed to throw exception and is supposed
to succeed in scheduling the full graph of tasks, potential errors (e.g. errors
from underlying thread pool, out of memory exceptions etc) are considered not
recoverable.
The invariant - the graph of tasks is either not executed or
executed in full before the call to finishRun()
Reviewed By: andrewwdye
Differential Revision: D14092457
fbshipit-source-id: a3e5d65dfee5ff5e5e71ec72bb9e576180019698
Summary:
In the NUMA case, PinnedCPUAllocator's allocate() would return a
DataPtr constructed by DefaultCPUAllocator, which would reference
the Default... Delete() rather than the Pinned... Delete(). That
meant Pinned... Delete() would never run, so cudaHostUnregister()
would never be called when regions were freed.
See: https://github.com/pytorch/pytorch/issues/16280
This change adds a 'naked_allocate()' method to the Default allocator
that just returns a pointer to the allocated memory rather than
wrapping it in a DataPtr. Pinned allocator uses that then constructs
a DataPtr with reference to its own Delete().
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16340
Reviewed By: dzhulgakov
Differential Revision: D13843206
Pulled By: ezyang
fbshipit-source-id: 9efb572e5a01b49ef2a4aceeccc13cd0b1066528
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17080
This changes all operators using this macro to the new format
Reviewed By: dzhulgakov
Differential Revision: D14078628
fbshipit-source-id: 67048e485e326765fd49567cc008633d3d500d5c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16691
Previous diffs already introduced a macro that registers caffe2 CPU kernels with c10.
This now also registers the CUDA kernels with it.
Reviewed By: bwasti
Differential Revision: D13901619
fbshipit-source-id: c15e5b7081ff10e5219af460779b88d6e091a6a6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16965
Instead of having one large templated function to wrap the caffe2 op, minimize the amount of templated code.
Non-templated code can be reused between different operators and decreases binary size.
Reviewed By: orionr
Differential Revision: D14018806
fbshipit-source-id: bedd4152eec21dd8c5778446963826316d210543
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16399
Catching cudaError_t return values in a few places, because it's nodiscard in rocm. Unless we add -Wno-unused-result, it'll end up with a compilation error.
Also in c10/cuda/test, check whether a host has GPU or not. We were silently throwing out the error before (so not really testing the cuda api).
Reviewed By: bddppq
Differential Revision: D13828281
fbshipit-source-id: 587d1cc31c20b836ce9594e3c18f067d322b2934
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16721
The very key line is we have to set the stream to the default
stream before calling the allocator. This is very interesting.
It shouldn't be necessary, but seemingly is!
Reviewed By: dzhulgakov
Differential Revision: D13943193
fbshipit-source-id: c21014917d9fe504fab0ad8abbc025787f559287
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16615
This is another go at landing https://github.com/pytorch/pytorch/pull/16226
Now that the caching allocator is moved to c10_cuda, we can
delete the duplicate copy from Caffe2.
The difference between this and the previous PR is that this
version faithfully maintains the binding code; in particular,
we end up with a SECOND copy of the caching allocator in
this patch. I verified that this code does NOT cause a crash
in the workflow we canaried last time.
In further diffs, I plan to eliminate the second copy, and then
adjust the binding code.
Reviewed By: dzhulgakov
Differential Revision: D13901067
fbshipit-source-id: 66331fd4eadffd0a5defb3cea532d5cd07287872
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16867
Some caffe2 operators (example: BBoxTransform) have not just one template parameter which is the context, but might have multiple template parameters.
Because of this, we can't handle the context parameter inside the macro.
Reviewed By: bwasti
Differential Revision: D13995696
fbshipit-source-id: f55c3be913c8b125445a8d486846fc2fab587a63
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16643
The test was disabled in D13908117 because it conflicted with another diff that was about to land.
Now fixed the merge conflict and re-landing it.
Reviewed By: ezyang
Differential Revision: D13911775
fbshipit-source-id: b790f1c3a3f207916eea41ac93bc104d011f629b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16548
With this macro, a caffe2 operator can now directly be registered with c10.
No need to write custom wrapper kernels anymore.
Differential Revision: D13877076
fbshipit-source-id: e56846238c5bb4b1989b79855fd44d5ecf089c9c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16751
This was made more complicated by the fact that ivalue::IntList
is a thing. So I had to fix all of the sites where we referring
to IValue post facto.
The following codemods were run, in this order:
```
codemod -m -d . --extensions cc,cpp,cu,cuh,h,hpp,py,cwrap,yaml,in IntList IntArrayRef
codemod -m -d . --extensions cc,cpp,cu,cuh,h,hpp,py,cwrap,yaml,in IntArrayRef::create IntList::create
codemod -m -d . --extensions cc,cpp,cu,cuh,h,hpp,py,cwrap,yaml,in ivalue::IntArrayRef ivalue::IntList
codemod -m -d . --extensions cc,cpp,cu,cuh,h,hpp,py,cwrap,yaml,in Tag::IntArrayRef Tag::IntList
codemod -m -d . --extensions cc,cpp,cu,cuh,h,hpp,py,cwrap,yaml,in isIntArrayRef isIntList
codemod -m -d . --extensions cc,cpp,cu,cuh,h,hpp,py,cwrap,yaml,in toIntArrayRef toIntList
codemod -m -d . --extensions cc,cpp,cu,cuh,h,hpp,py,cwrap,yaml,in 'Shared<IntArrayRef>' 'Shared<IntList>'
codemod -m -d . --extensions cc,cpp,cu,cuh,h,hpp,py,cwrap,yaml,in 'intrusive_ptr<IntArrayRef>' 'intrusive_ptr<IntList>'
```
Some manual fixups were done afterwards; they can be reviewed separately
at https://github.com/pytorch/pytorch/pull/16752
Reviewed By: dzhulgakov
Differential Revision: D13954363
fbshipit-source-id: b5c40aacba042402155a2f5a229fa6db7992ac64
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16730
with Jerry's new updates Tensor must be defined -- as a result I've needed to update the shim for caffe2 ops being used in PyTorch
Reviewed By: smessmer
Differential Revision: D13946950
fbshipit-source-id: 6f77877c61a743f82bdfc2ad04d6ab583000cc18
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16625
This is a squash of multiple PRs that refactored the old c10 dispatcher into a new one that follows the c10 dispatcher design doc.
It is now unboxed and follows the Stack semantics from JIT. It also uses the runtime JIT schema instead of its own compile time schema definitions.
Reviewed By: ezyang
Differential Revision: D13907069
fbshipit-source-id: edcc4806ccd21474fdfb5a98516219b1956db13d
Summary:
I went through my build log and did what I thought were reasonable fixes to all the C++ compilation warnings that came up
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16411
Differential Revision: D13901006
Pulled By: jamesr66a
fbshipit-source-id: 02df4e3e5a5c8dd9e69ac9f065cd3f2a80645033
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16514
Original commit changeset: dc371697f14b
Relanding https://github.com/pytorch/pytorch/pull/15860 - the problem was that layer_norm was using at::empty which is not yet on mobile
Reviewed By: ezyang
Differential Revision: D13861480
fbshipit-source-id: e2116da32bc117175c96b9151b1beba9b31eff36
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16473
This resolves the issues associated with caffe2 initialization (specifically the REGISTER_FUNCTION_SCHEMA_OPERATOR calls) being run after Torch's static op registration calls.
The fix employs a meyer's singleton wrapped by the constructor of a type. Everything is placed inside a macro to make it easier for users to use.
Reviewed By: smessmer
Differential Revision: D13854306
fbshipit-source-id: ecf60861f229532826fae254974e9af4389055df
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16576
allows instantiation of operator with arguments passed by move rather than explicit copies
per Sebastian's suggestion
Reviewed By: smessmer
Differential Revision: D13882416
fbshipit-source-id: bc8d50e73f5a1ae87155b0cf96799b8573a7a8fa
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16282
This changes the core kernel abstraction to be a function taking a stack, popping its arguments from the stack and pushing results to the stack,
instead of getting arguments as ArrayRef<IValue> and returning an output IValue.
Caffe2 operators need to have a way to pass in preallocated output tensors.
The convention for them is to get all inputs *and* outputs on the stack and also return all of them, i.e. a caffe2 op will always have inputs == outputs.
This will probably change in later diffs towards making the outputs in-arguments optional in the JIT schema.
Reviewed By: ezyang
Differential Revision: D13792335
fbshipit-source-id: e9cc2b5e438cc4653e1f701633a154b92b604932
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16510
This diff was supposed to be memory usage neutral, but based on
some internal flows involving cuDNN, it was not. Reverting pending
further investigation.
Original commit changeset: 03f1ebf7f11c
Reviewed By: xw285cornell
Differential Revision: D13863610
fbshipit-source-id: 15517e255fd6b0c064b65fb99f0ef19742236cfd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15860
Few changes (which are harder to split in separate diffs, so together):
- make conversion explicit (as they can throw to avoid surprises)
- fix tensor legacy dispatch not initialized when tensor is created on C2 side
- add a bunch of invariants to enforce
Reviewed By: ezyang
Differential Revision: D13596031
fbshipit-source-id: d20b601e06ba47aeff2f6e8e15769840e2d46108
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16180
Only the kernel knows about its state, the caller doesn't see it anymore.
Reviewed By: ezyang
Differential Revision: D13744071
fbshipit-source-id: cb00ff1a881508c1b36ac4123bee1f68ca02ca9c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16177
Change the API for calling operators so that it can store state in an OpKernel object.
This diff doesn't store the state there yet, that comes in a follow up diff.
Reviewed By: ezyang
Differential Revision: D13742889
fbshipit-source-id: 20511a9a1b9f850074e50634d4b4acf87f8c6ecd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16173
Helper to make it easy to run ops in caffe2
Reviewed By: smessmer
Differential Revision: D13468240
fbshipit-source-id: 2276c7870af6dcdf829957f005fd16ac1ef319b5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16048
This enables full shimming of the operator (previously it was only
Output() shimmed).
Reviewed By: smessmer
Differential Revision: D13468241
fbshipit-source-id: c853b775ab5cdcd968f4a6cc4766e91c3c6b1c45
Summary:
This PR adds thread-local guard (`at::AutoNonVariableTypeMode`) to make sure that in VariableType.cpp the operations on baseType still dispatch to non-Variable type, even if the parameters will become Variables after the Tensor/Variable merge. We achieve this by making `legacyTensorType()` and `getType()` check the `at::AutoNonVariableTypeMode` guard to decide whether to return non-Variable type for a variable.
This is part of the VariableImpl/TensorImpl merge work: https://github.com/pytorch/pytorch/issues/13638.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15939
Reviewed By: ezyang
Differential Revision: D13640980
Pulled By: yf225
fbshipit-source-id: d12c2543822958558d7d70d36c50999a5eb8783f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16226
Now that the caching allocator is moved to c10_cuda, we can
delete the duplicate copy from Caffe2.
Reviewed By: dzhulgakov, smessmer
Differential Revision: D13762540
fbshipit-source-id: 03f1ebf7f11c68c19aa0d66110156fe228da6138
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16294
In `ReinitializeTensor`, we compare `tensor->GetDevice()` and `options.device()`, but in the callsite, we actually just provide an option with `device_type`, which means the `device_id` will always be default(-1) for `options`, but for tensor, although it is passed a `device` with default `device_id`, when we allocate the data, the `device` of the `tensor` is the `device` of `Storage`, which is the `device` of underlying `DataPtr`, which is the same as the `device` of the `Context` of the operator, which has a non-default `device_id`.
Therefore everytime we do `ReinitializeTensor`, we'll find the `device` does not match, and after the `ReinitializeTensor` call, the `device` still does not match. That's why everytime we'll allocate a new Tensor and cause perf regressions for ops that uses `ReinitializeTensor` on multiple GPUs.
Reviewed By: BIT-silence
Differential Revision: D13795635
fbshipit-source-id: 24d6afa1a0196a32eb0134ee08b4280244cdb0c3
Summary: Some automation to fix uninitialized members for caffe2 code. Ran canary to make sure I don't have any regression in prod, but not sure how to test comprehensively for caffe2
Reviewed By: ezyang
Differential Revision: D13776185
fbshipit-source-id: fb2a479971cc0276d8784be1c44f01252410bd24
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16176
This makes PyTorch and Caffe2's data() method line up.
Historically, PyTorch made no distinction between tensors
with const or non-const data, and thus provided a
non-const pointer with data() member. Changing the API to
return a const-pointer would break all mutable code, whereas
changing the Caffe2 API to change a pointer doesn't break
any code, *except* for code which required an exact match
on const-ness (e.g., in template arguments). Since the latter
is less disruptive, we've opted for it here.
The few places downstream that broke due to this are fixed
in this patch.
Reviewed By: smessmer
Differential Revision: D13742916
fbshipit-source-id: baa4b4544cfdf7c1f369f4d69a1e0d5953c1bd99
Summary:
Save reallocation costs, by reserving vectors according to how many elements we expect to put in.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16201
Differential Revision: D13762594
Pulled By: ezyang
fbshipit-source-id: 7e3bfe421489dde48a2ddb0920dd155f69baecc0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16051
This changes the kernels stored in the c10 dispatcher from plain C function pointers to IValue-based KernelFunction*.
Note that KernelFunction is currently taking an `ArrayRef<IValue>` as arguments. A later diff will change that to it taking a `Stack*`.
Reviewed By: ezyang
Differential Revision: D13684518
fbshipit-source-id: 1fa54f60cec2e967b92a4a043d6e3ac1627ed991
Summary:
Based on offline discussion it should be less surprising to the users of existing code. Thus caffe2::Tensor is now a move-only class (as it used to be), explicit calls to UnsafeSharedInstance() are necessary to get shared_ptr behavior.
This change also identified a few places that misused the copy constructor - those are fixed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15416
Reviewed By: Yangqing
Differential Revision: D13524598
fbshipit-source-id: aea12d6dff77342606fa88ce4ddddbff266245a7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16050
The c10 dispatcher will (soon) depend on IValue and IValue can't be moved to c10 yet because it depends on at::Tensor, which depends on legacy Type dispatch and we don't want the legacy dispatch in c10.
So instead, we move the c10 dispatcher back to ATen/core until we can actually move at::Tensor to c10.
Reviewed By: ezyang
Differential Revision: D13684517
fbshipit-source-id: 1125f4254223907c52f96ff73034f6d4ae9fd0a7
Summary:
There is a little error in the comment, "A->B", so the Task B must start after task A finishes, not "B".
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15922
Differential Revision: D13709579
Pulled By: ezyang
fbshipit-source-id: 735afe83f4532b7c7456da3e96209b3e07071f37
Summary:
TensorProto.DataType in caffe2/proto/caffe2.proto has BYTE = 3 defined, while there is no corresponding TypeMeta defined in caffe2/core/types.cc: DataTypeToTypeMeta. This issue failed the C++ tutorial of MNIST + LMDB.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15627
Differential Revision: D13709602
Pulled By: ezyang
fbshipit-source-id: d4826d0f9b3975e6a8478d4bad1abbbedcaea197
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15814
Plan is to remove the APIs we want to deprecate one by one and make sure it still builds in sandcastle and ossci
Reviewed By: ezyang
Differential Revision: D12812029
fbshipit-source-id: ea0c3dd882bec95fcd4507160ebc61f598b6d040
Summary:
Use new test utils in converter_nomnigraph_test , and add utils to set device option name, external inputs, outputs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15751
Differential Revision: D13586228
Pulled By: duc0
fbshipit-source-id: ff809dd7bf9f30641ce2a6fef7e2810f005521c2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15967
Codemod generated with clangr shard mode, 25 files per diff,
To eliminiate partially initialized Tensor, we split the initialization of local Tensor variables into two steps, first declare un uninitialized Tensor, and
call `ReinitializeTensor` to initialize it.
motivation: https://github.com/pytorch/pytorch/pull/12407
Reviewed By: smessmer
Differential Revision: D13586735
fbshipit-source-id: eae2d79e1107a2e813ce3809e690af4706aaa9ca
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15692
It was leading to ocassional crashes with dynamically linked CUDA because runtime was already destroyed.
Also, unique_ptr<T[]> is more suitable than deque<T> for the purpose.
Reviewed By: Yangqing
Differential Revision: D13571988
fbshipit-source-id: 37eb26dfbe361c49160367b53f87bd037c6c0e46
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15407
Don't ask the tensor for its intrusive pointer if we just want to check if two tensors are the same.
This mirrors ATen APIs.
Reviewed By: dzhulgakov
Differential Revision: D13520389
fbshipit-source-id: 681317f36f480ab60e532bb08a073f98f39770fd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15316
This starts cleaning up the files in c10 according to the module structure we decided on.
Move to c10/util:
- Half.h, Half-inl.h, Half.cpp, bitcasts.h
Move to c10/core:
- Device.h, Device.cpp
- DeviceType.h, DeviceType.cpp
i-am-not-moving-c2-to-c10
Reviewed By: dzhulgakov
Differential Revision: D13498493
fbshipit-source-id: dfcf1c490474a12ab950c72ca686b8ad86428f63
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15876
Build changes made it so some .so libraries are now registered after GlobalInit is called. Although this shouldn't be common, it also shouldn't be explicitly excluded. These changes allow for late Caffe2 registration, but also warn in that case.
Reviewed By: kuttas
Differential Revision: D13608186
fbshipit-source-id: 0ca7bcd32516d374077db0c2548cf8c28ccdd5f6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15195
This removes the use of caffe2::Tensor or at::Tensor in the c10 dispatcher and only uses C10::Tensor.
It also changes output tensors to be passed as `const Tensor&` instead of `Tensor*` because we otherwise can't forward them in operator_c10wrapper.h.
Reviewed By: ezyang
Differential Revision: D13461640
fbshipit-source-id: 7f79925a7d60f01660a24bbfda47391af0c70ed3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14819
This is a minimal wrapper for a c10::TensorImpl,
maybe destined for greatness later when we move caffe2::Tensor or at::Tensor into c10.
Reviewed By: dzhulgakov
Differential Revision: D13348039
fbshipit-source-id: 874f515358e94f35dc7a4c3e55b35fde59c51ff1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15418
Previously we are using Resize + ShareData.
Instead, we'll create a function on Tensor that clones itself with same storage.
Suppose we want `t` to `ShareData` with `t0`, Previous:
```
Tensor t(dims, CPU);
t.Resize(t0.sizes());
t.ShareData(t0);
```
Now:
```
Tensor t = t0.Alias();
```
Reviewed By: dzhulgakov
Differential Revision: D13507609
fbshipit-source-id: 6e4275d02f4c3356cbce91127f1b01111dc86b9f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15417
Right now the way we test whether Blob contains a CPU tensor is broken in ```PythonOpBase``` is broken, which means non-CPU path might never be taken.
Searching through the codebase, non-gpu path is used in PythonDLPack, and it is used in PytorchOp which is unused. So we'll remove non-gpu path in this diff.
Reviewed By: dzhulgakov
Differential Revision: D13495011
fbshipit-source-id: 9fe9537f05026d2a2cf7051efa81d184de722710