Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43987
This replaces the caffe2 CPU random number (std::mt19937) with at::mt19937 which is the one currently used in pytorch. The ATen RNG is 10x faster than the std one and appears to be more robust given bugs in the std (https://fburl.com/diffusion/uhro7lqb)
For large embedding tables (10GB+) we see UniformFillOp taking upwards of 10 minutes as we're bottlenecked on the single threaded RNG. Swapping to at::mt19937 cuts that time to 10% of the current.
Test Plan: Ran all relevant tests + CI. This doesn't introduce new features (+ is a core change) so existing tests+CI should be sufficient to catch regressions.
Reviewed By: dzhulgakov
Differential Revision: D23219710
fbshipit-source-id: bd16ed6415b2933e047bcb283a013d47fb395814
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46424
Currently if an exception occurs in a reporter thread the process is killed via std::terminate. This adds support for handling the reporter exception if FLAGS_caffe2_handle_executor_threads_exceptions is set to true.
Test Plan: buck test mode/opt -c python.package_style=inplace //caffe2/caffe2/python:hypothesis_test //caffe2/caffe2:caffe2_test_cpu -- --stress-runs 100
Reviewed By: dahsh
Differential Revision: D24345027
fbshipit-source-id: 0659495c9e27680ebae41fe5a3cf26ce2f455cb3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46110
## Motivation
* `Cancel` is now added to `OperatorBase` and `NetBase` (https://github.com/pytorch/pytorch/pull/44145).
* We need a test to cover and exhibit that we can cancel stuck net and propagate error with plan executor.
## Summary
* Added PlanExecutorTest `ErrorPlanWithCancellableStuckNet` for plan executor.
* Set cancelCount to zero at the beginning of tests to avoid global state be carried over in some test environment.
Test Plan:
## Unit Test Added
```
buck test caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest
buck test caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest --stress-runs 1000
```
Reviewed By: d4l3k
Differential Revision: D24226577
fbshipit-source-id: c834383bfe6ab50747975c229eb42a363eed3458
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46080
temp removal of ErrorPlanWithCancellableStuckNet, will fill out more
Test Plan:
```
buck test caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest
```
remove a test
Reviewed By: fegin
Differential Revision: D24213971
fbshipit-source-id: e6e600bad00b45c726311193b4b3238f1700526e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45319
## Motivation
* `Cancel` is now added to `OperatorBase` and `NetBase` (https://github.com/pytorch/pytorch/pull/44145)
* We need a test to cover and exhibit that we can cancel stuck net and propagate error with plan executor.
## Summary
* Added `ErrorPlanWithCancellableStuckNet` for plan executor.
* We set a plan with two nets: one stuck net with blocking operator that never returns, and one with error
net with error op that throws, and tested it throw and cancel.
Test Plan:
## Unit Test added
```
buck test caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest
buck test caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest --stress-runs 100
```
```
Summary
Pass: 400
ListingSuccess: 2
```
Reviewed By: d4l3k
Differential Revision: D23920548
fbshipit-source-id: feff41f73698bd6ea9b744f920e0fece4ee44438
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45981
This is a recommit of previously reverted D20850851 (3fbddb92b1).
TL;DR - combining condition_variables and atomics is a bad idea
https://stackoverflow.com/questions/49622713/c17-atomics-and-condition-variable-deadlock
This also adds some ifdefs to disable the death test for mobile, xplat and tsan builds since forking doesn't play nicely with them.
Test Plan:
buck test mode/opt //caffe2/caffe2/python:hypothesis_test -- --stress-runs 1000 test_atomic_iter_with_concurrent_steps --timeout 120
buck test mode/opt //caffe2/caffe2/python:hypothesis_test -- --stress-runs 100
buck test mode/opt caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest --stress-runs 100
no timeouts https://www.internalfb.com/intern/testinfra/testconsole/testrun/7036874440059883/
will ensure no timeouts in OSS
Reviewed By: walterddr, dahsh
Differential Revision: D24165505
fbshipit-source-id: 17cd23bfbcd9c2826a4067a387023d5186353196
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45297
If we have two concurrent substeps and one of them throws an exception and the other is blocking, we'll currently hang. This waits up to 1 minute for it to complete before terminating the process.
Test Plan: buck test caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest --stress-runs 100
Reviewed By: dahsh
Differential Revision: D20850851
fbshipit-source-id: 330503775d8062a34645ba55fe38e6770de5e3c7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44062
Previously, BackendSelect kernels were still written in the legacy way, i.e. they took one TensorOptions argument instead of scattered dtype, layout, device, pin_memory, and they used hacky_wrapper to be callable. This caused a re-wrapping step. Calling into a BackencSelect kernel required taking the individual scattered arguments, packing them into a TensorOptions, and the kernel itself then gathered them again for redispatch.
Now with this PR, BackendSelect kernels are written in the new way and no hacky_wrapper or rewrapping is needed for them.
ghstack-source-id: 112825789
Test Plan:
vs master: https://www.internalfb.com/intern/fblearner/details/216117032/
vs previous diff: https://www.internalfb.com/intern/fblearner/details/216170194/
Reviewed By: ezyang
Differential Revision: D23484192
fbshipit-source-id: e8fb49c4692404b6b775d18548b990c4cdddbada
Summary:
There is a module called `2to3` which you can target for future specifically to remove these, the directory of `caffe2` has the most redundant imports:
```2to3 -f future -w caffe2```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45033
Reviewed By: seemethere
Differential Revision: D23808648
Pulled By: bugra
fbshipit-source-id: 38971900f0fe43ab44a9168e57f2307580d36a38
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44145
## Motivation
* To be able to make C2 ops cancellable so we can safely exit.
* Some C2 operators are now blocking thus being non-cancellable. If an error
occurs we need to be able to safely stop all net execution so we can throw
the exception to the caller.
## Summary
* Adds `NetBase::Cancel()` to NetBase which iterates over the entire list of
operators and call Cancel.
* Cancel on all ops was added to Net since there's nothing Asyc specific about it.
* `AsyncSchedulingNet` calls parent Cancel.
* To preserve backwards compatibility, `AsyncSchedulingNet`'s Cancel still calls
`CancelAndFinishAsyncTasks` .
* Adds `Cancel()` to `OperatorBase`.
Reviewed By: dzhulgakov
Differential Revision: D23279202
fbshipit-source-id: e1bb0ff04a4e1393f935dbcac7c78c0baf728550
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43564
Static dispatch was originally introduced for mobile selective build.
Since we have added selective build support for dynamic dispatch and
tested it in FB production for months, we can deprecate static dispatch
to reduce the complexity of the codebase.
Test Plan: Imported from OSS
Reviewed By: ezyang
Differential Revision: D23324452
Pulled By: ljk53
fbshipit-source-id: d2970257616a8c6337f90249076fca1ae93090c7
Summary: per title, makes c2 wrappers safer as contiguity of torch inputs is not guaranteed
Test Plan: covered by existing tests
Reviewed By: dzhulgakov
Differential Revision: D23310137
fbshipit-source-id: 3fe12abc7e394b8762098d032200778018e5b591
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43027
Format db.h and db.cc using the default formatter.
This change was split off of D22705434.
Test Plan: Wait for sandcastle.
Reviewed By: rohithmenon, marksantaniello
Differential Revision: D23113765
fbshipit-source-id: 3f02d55bfb055bda0fcba5122336fa001562d42e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43239
This is an incremental step as part of the process to migrate caffe2 random number generator off of std::mt19937 and to instead use at::mt19937+at::CPUGeneratorImpl. The ATen variants are much more performant (10x faster).
This adds a way to get the CPUContext RandSeed for tail use cases that require a std::mt19937 and borrow the CPUContext one.
Test Plan: This isn't used anywhere within the caffe2 codebase. Compile should be sufficient.
Reviewed By: dzhulgakov
Differential Revision: D23203280
fbshipit-source-id: 595c1cb447290604ee3ef61d5b5fc079b61a4e14
Summary:
This diff NVMifies the NE Eval Flow.
- It defines a `LoadNVM` operator which either
- receives a list of nvm blobs, or
- extracts the blobs that could be NVMified from the model.
- dumps NVMified blobs into NVM
- and deallocates from DRAM
- NVMify the Eval net on dper and C2 backend
Specific NVMOp for SLS is pushed through different diffs.
Test Plan: flow-cli test-locally dper.workflows.evaluation.eval_workflow --parameters-file=/mnt/public/ehsaardestani/temp/small_model.json 2>&1 | tee log
Reviewed By: yinghai, amylittleyang
Differential Revision: D22469973
fbshipit-source-id: ed8379ad404e96d04ac05e580176d3aca984575b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42249
Main change is to bring Caffe2's superior error messages for cuda initialization into c10 and use them in all code paths.
Basic logic:
| Case | Call to device_count() | init_cuda, e.g. allocating tensor |
| -- | -- | -- |
| all good | non-zero | just works |
| no gpus | 0, no warning | throw exception with good message |
| driver issues | 0, produce warning | throw exception with good message |
| out of memory with ASAN | 0, produce warning| throw exception with ASAN message |
Previously, the error thrown from init_cuda was very generic and the ASAN warning (if any) was buried in the logs.
Other clean up changes:
* cache device_count() always in a static variable
* move all asan macros in c10
Test Plan:
Hard to unittest because of build modes. Verified manually that the behavior from the table above holds by running the following script in different modes (ASAN/no-ASAN, CUDA_VISIBLE_DEVICES=):
```
print('before import')
import torch
print('after import')
print('devices: ', torch.cuda.device_count())
x = torch.tensor([1,2,3])
print('tensor creation')
x = x.cuda()
print('moved to cuda')
```
Reviewed By: ngimel
Differential Revision: D22824329
fbshipit-source-id: 5314007313a3897fc955b02f8b21b661ae35fdf5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41461
capacity is misleading, and we have many wrong uses internally. Let's rename to nbytes to avoid the confusion in future. Ultimately, we could remove this parameter if possible.
So far I haven't seen any case this capacity is necessary.
Test Plan: oss ci
Differential Revision: D22544189
fbshipit-source-id: f310627f2ab8f4ebb294e0dd5eabc380926991eb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41096
The spark spot model had some issues in tensor conversion, see P134598596. It happens when we convert an undefined c10 tensor to caffe2 tensor.
This diff added a null check.
Test Plan: spark spot model runs without problem
Reviewed By: smessmer
Differential Revision: D22330705
fbshipit-source-id: dfe0f29a48019b6611cad3fd8f2ae49e8db5427e
Summary:
If virtual function is implemented in header file, it's implementation will be included as a weak symbol to every shared library that includes this header along with all of it's dependencies.
This was one of the reasons why size of libcaffe2_module_test_dynamic.so was 500Kb (AddRelatedBlobInfo implementation pulled a quarter of libprotobuf.a with it)
Combination of this and https://github.com/pytorch/pytorch/issues/40845 reduces size of `libcaffe2_module_test_dynamic.so` from 500kb to 50Kb.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40844
Differential Revision: D22334725
Pulled By: malfet
fbshipit-source-id: 836a4cbb9f344355ddd2512667e77472546616c0
Summary:
… file
This prevents implementation of those functions(as lambdas) to be embedded as weak symbol into every shared library that includes this header.
Combination of this and https://github.com/pytorch/pytorch/pull/40844 reduces size of `libcaffe2_module_test_dynamic.so` from 500kb to 50Kb.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40845
Differential Revision: D22334779
Pulled By: malfet
fbshipit-source-id: 64706918fc2947350a58c0877f294b1b8b085455
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40096
Declaring `tensor_proto` to be of type `auto` means that it will copy the entire `TensorProto` instead of just keeping a reference. This changes it to just use a const reference instead.
Test Plan:
Using the model loader benchmark to measure model loading performance:
### `tensor_proto` is of type `const auto&`
```
============================================================================
caffe2/caffe2/fb/predictor/ModelLoaderBenchmark.cpprelative time/iter iters/s
============================================================================
BlobProtoInt32DeserializationFloat16 11.08ms 90.27
BlobProtoByteDeserializationFloat16 1509.73% 733.73us 1.36K
----------------------------------------------------------------------------
BlobProtoInt32DeserializationUInt8 10.48ms 95.45
BlobProtoByteDeserializationUInt8 2974.57% 352.22us 2.84K
============================================================================
```
### `tensor_proto` is of type `auto`
```
============================================================================
caffe2/caffe2/fb/predictor/ModelLoaderBenchmark.cpprelative time/iter iters/s
============================================================================
BlobProtoInt32DeserializationFloat16 13.84ms 72.26
BlobProtoByteDeserializationFloat16 658.85% 2.10ms 476.08
----------------------------------------------------------------------------
BlobProtoInt32DeserializationUInt8 17.09ms 58.51
BlobProtoByteDeserializationUInt8 3365.98% 507.80us 1.97K
============================================================================
```
Reviewed By: marksantaniello
Differential Revision: D21959644
fbshipit-source-id: 6bc2dfbde306f88bf7cd4f9b14b95ac69c2e1b4d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39493
Make sure we wait for all types, incl. async cpu ops
Test Plan: CI
Reviewed By: kennyhorror
Differential Revision: D21873540
fbshipit-source-id: 37875cade68e1b3323086833f8d4db79362a68e8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39759
Caffe2 has a mode where it uses PT's caching allocator. Somehow we were not calling the initialization explicitly.
Now, I have no idea why it worked before. Probably worth to run a bisect separately.
Reviewed By: houseroad
Differential Revision: D21962331
fbshipit-source-id: f16ad6b27a67dbe0bda93939cca8c94620d22a09
Summary:
Gets rid of some in-kernel asserts where they can be replaced with static_asserts
Replaces bare in-kernel `assert` in one case with `CUDA_KERNEL_ASSERT` where necessary
replaces host code `assert`s with `TORCH_INTERNAL_ASSERT`
Another group of asserts is in fractional max pooling kernels which should be fixed regardless https://github.com/pytorch/pytorch/issues/39044, the problems there are not just asserts.
I've audited remaining cases of in-kernel asserts, and they are more like `TORCH_INTERNAL_ASSERT`, so they should not happen with invalid user data. I think it's ok to leave them as is.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39047
Differential Revision: D21750392
Pulled By: ngimel
fbshipit-source-id: e9417523a2c672284de3515933cb7ed166e56719
Summary:
Per title. https://github.com/pytorch/pytorch/issues/32719 essentially disabled asserts in cuda kernels in release build. Asserts in cuda kernels are typically used to prevent invalid reads/writes, so without asserts invalid read/writes are silent errors in most cases (sometimes they would still cause "illegal memory access" errors, but because of caching allocator this usually won't happen).
We don't need 2 macros, CUDA_ALWAYS_ASSERT and CUDA_KERNEL_ASSERT because all current asserts in cuda kernels are important to prevent illegal memory accesses, and they should never be disabled.
This PR removes macro CUDA_ALWAYS_ASSERT and instead makes CUDA_KERNEL_ASSERT (that is commonly used in the kernels) an asserttion both in release and debug builds.
Fixes https://github.com/pytorch/pytorch/issues/38771
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38943
Differential Revision: D21723767
Pulled By: ngimel
fbshipit-source-id: d88d8aa1b047b476d5340e69311e65aff4da5074
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38066
Increasing priority for PinnedCPUAllocator to make sure it is set when CUDA is enabled.
Test Plan: buck test mode/dev-nosan //vision/fair/detectron2/tests:test_export_caffe2 -- 'testMaskRCNNGPU \(test_export_caffe2\.TestCaffe2Export\)'
Reviewed By: ppwwyyxx
Differential Revision: D21465835
fbshipit-source-id: 643cff30d35c174085e5fde5197ddb05885b2e99
Summary:
Helps prevent following accidental failures:
```
..\caffe2\core\parallel_net_test.cc:303
The difference between ms and 350 is 41, which exceeds kTimeThreshold, where
ms evaluates to 391,
350 evaluates to 350, and
kTimeThreshold evaluates to 40.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37892
Differential Revision: D21417251
Pulled By: malfet
fbshipit-source-id: 300cff7042e466f014850cc7cc406c725d5d0c04
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37776
* Remove type-specific size tracking in favor of byte size tracking in Storage and StorageImpl
* Changed numel() and set_numel() to nbytes() and set_nbytes()
* Added enum argument to Storage/StorageImpl constructor to indicate new meaning of the size parameter
* Update all callers of the changed API
Part of issue https://github.com/pytorch/pytorch/issues/33950
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37028
Differential Revision: D21171334
Pulled By: ezyang
fbshipit-source-id: 37329a379de9a3a83cc5e9007e455a3e1c2d10b8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37101Fixes#36954.
The basic concept is to streamline the process of rethrowing
c10::Error with extra error information. This is in a few
steps:
- I completely remodeled the Error data type and the internal
invariants. Instead of manually adding in newlines, the
message stack formatting process is responsible for inserting
newlines and spacing as necessary. Call sites are then
modified to respect the new API model.
- TORCH_RETHROW macro is added, which adds context to an error
message and then rethrows it.
New internal assert failure looks like:
```
0 INTERNAL ASSERT FAILED at ../c10/test/util/exception_test.cpp:64, please report a bug to PyTorch.
Exception raised from TestBody at ../c10/test/util/exception_test.cpp:64 (most recent call first):
frame #0: <unknown function> + 0x6aab9 (0x7ff611d3aab9 in /data/users/ezyang/pytorch-tmp/build/lib/libc10.so)
frame #1: ...
```
Error message with context looks like:
```
This is an error
This is context 1
This is context 2
```
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D21202891
Pulled By: ezyang
fbshipit-source-id: 361cadd16bc52e5886dba08e79277771ada76169
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37094
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D21202892
Pulled By: ezyang
fbshipit-source-id: d59e6bffabd90cc734056bdce2cd1fe63262fab8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36850
Since now all unboxing happens after dispatch, which means that all c10 ops support unboxing, we can now use op.callBoxed() for all ops and don't need callBoxedWorkaround (which was going through the JIT registry) anymore.
ghstack-source-id: 102879558
Test Plan: waitforsandcastle
Differential Revision: D21102375
fbshipit-source-id: d1e041116563a9650d5a86b07eb96d217d8756f3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36841
right now, all c2 ops's output will be unwrapped blindly. This is not correct, if we have a single tensor list returned.
Test Plan: buck test mode/dev-nosan mode/no-gpu //caffe2/caffe2/fb/python/operator_test:torch_integration_test
Reviewed By: alyssawangqq
Differential Revision: D21100463
fbshipit-source-id: 9f22f3ddf029e7da9d98008d68820bf7f8239d4f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35187
When I touch these files, lint will always introduce some unintended change, to prevent it from happening, we need to format the code first.
change is generated by:
arc f
Test Plan: integration test.
Differential Revision: D20587596
fbshipit-source-id: 512cf6b86bd6632a61c80ed53e3a9e229feecc2a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36172
Original commit changeset: 3d7801613f86
D20449887 broke some OSS tests as the OSS export sync wasn't working correctly.
Test Plan:
Manually export latest version to OSS to trigger the tests
+ test plan in D20449887
verified onnx tests are passing in https://github.com/pytorch/pytorch/pull/36172
Reviewed By: andrewwdye
Differential Revision: D20902279
fbshipit-source-id: bc30fcc9f5cc8076f69a5d92675fd27455948372
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31966
This has three parts:
* When `--caffe2_handle_executor_threads_exceptions` is set when a parallel execution step throws an exception it can hang waiting for async nets to finish. This adds cancellation code to cancel any async nets.
* This makes the exceptions returned from parallel workers pass a std::exception_ptr so the stack trace can be recorded with folly::SmartExceptionTracer.
* Define Cancel method at NetBase level to avoid pulling in unsupported AsyncSchedulingNet for fbandroid.
Test Plan:
Added unit tests for plan_executor
buck test //caffe2/caffe2:caffe2_test_cpu
buck test //caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest --stress-runs 100
Reviewed By: boryiingsu
Differential Revision: D19320177
fbshipit-source-id: d9939fcea1317751fa3de4172dfae7f781b71b75
Summary:
This is a realand of https://github.com/pytorch/pytorch/pull/36196
Before the fix bazel spews following multi-line warning for every single caffe2 operator:
```
In file included from ./c10/util/logging_is_google_glog.h:50,
from ./c10/util/Logging.h:26,
from ./caffe2/core/logging.h:2,
from ./caffe2/core/blob.h:13,
from ./caffe2/core/operator.h:18,
from ./caffe2/sgd/adadelta_op.h:1,
from caffe2/sgd/adadelta_op.cc:1:
bazel-out/k8-fastbuild/bin/external/com_github_glog/_virtual_includes/glog/glog/logging.h: In instantiation of 'std::string* google::Check_LTImpl(const T1&, const T2&, const char*) [with T1 = int; T2 = long unsigned int; std::string = std::__cxx11::basic_string<char>]':
./caffe2/core/operator.h:192:5: required from 'const T& caffe2::OperatorBase::Input(int, caffe2::DeviceType) [with T = caffe2::Tensor; caffe2::DeviceType = c10::DeviceType]'
./caffe2/core/operator.h:890:48: required from 'const caffe2::Tensor& caffe2::Operator<Context>::Input(int, caffe2::DeviceType) [with Context = caffe2::CPUContext; caffe2::DeviceType = c10::DeviceType]'
./caffe2/sgd/adadelta_op.h:87:5: required from 'bool caffe2::SparseAdadeltaOp<Context>::RunOnDevice() [with Context = caffe2::CPUContext]'
./caffe2/sgd/adadelta_op.h:85:8: required from here
bazel-out/k8-fastbuild/bin/external/com_github_glog/_virtual_includes/glog/glog/logging.h:722:32: warning: comparison of integer expressions of different signedness: 'const int' and 'const long unsigned int' [-Wsign-compare]
722 | DEFINE_CHECK_OP_IMPL(Check_LT, < )
| ^
bazel-out/k8-fastbuild/bin/external/com_github_glog/_virtual_includes/glog/glog/logging.h:148:53: note: in definition of macro 'GOOGLE_PREDICT_TRUE'
148 | #define GOOGLE_PREDICT_TRUE(x) (__builtin_expect(!!(x), 1))
| ^
bazel-out/k8-fastbuild/bin/external/com_github_glog/_virtual_includes/glog/glog/logging.h:722:1: note: in expansion of macro 'DEFINE_CHECK_OP_IMPL'
722 | DEFINE_CHECK_OP_IMPL(Check_LT, < )
| ^~~~~~~~~~~~~~~~~~~~
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36224
Test Plan: CI
Differential Revision: D20919506
Pulled By: malfet
fbshipit-source-id: b8b4b7c62dcbc109b30165b19635a6ef30033e73
Summary:
Otherwise, while bazel spews following multi-line warning for every single caffe2 operator:
```
In file included from ./c10/util/logging_is_google_glog.h:50,
from ./c10/util/Logging.h:26,
from ./caffe2/core/logging.h:2,
from ./caffe2/core/blob.h:13,
from ./caffe2/core/operator.h:18,
from ./caffe2/sgd/adadelta_op.h:1,
from caffe2/sgd/adadelta_op.cc:1:
bazel-out/k8-fastbuild/bin/external/com_github_glog/_virtual_includes/glog/glog/logging.h: In instantiation of 'std::string* google::Check_LTImpl(const T1&, const T2&, const char*) [with T1 = int; T2 = long unsigned int; std::string = std::__cxx11::basic_string<char>]':
./caffe2/core/operator.h:192:5: required from 'const T& caffe2::OperatorBase::Input(int, caffe2::DeviceType) [with T = caffe2::Tensor; caffe2::DeviceType = c10::DeviceType]'
./caffe2/core/operator.h:890:48: required from 'const caffe2::Tensor& caffe2::Operator<Context>::Input(int, caffe2::DeviceType) [with Context = caffe2::CPUContext; caffe2::DeviceType = c10::DeviceType]'
./caffe2/sgd/adadelta_op.h:87:5: required from 'bool caffe2::SparseAdadeltaOp<Context>::RunOnDevice() [with Context = caffe2::CPUContext]'
./caffe2/sgd/adadelta_op.h:85:8: required from here
bazel-out/k8-fastbuild/bin/external/com_github_glog/_virtual_includes/glog/glog/logging.h:722:32: warning: comparison of integer expressions of different signedness: 'const int' and 'const long unsigned int' [-Wsign-compare]
722 | DEFINE_CHECK_OP_IMPL(Check_LT, < )
| ^
bazel-out/k8-fastbuild/bin/external/com_github_glog/_virtual_includes/glog/glog/logging.h:148:53: note: in definition of macro 'GOOGLE_PREDICT_TRUE'
148 | #define GOOGLE_PREDICT_TRUE(x) (__builtin_expect(!!(x), 1))
| ^
bazel-out/k8-fastbuild/bin/external/com_github_glog/_virtual_includes/glog/glog/logging.h:722:1: note: in expansion of macro 'DEFINE_CHECK_OP_IMPL'
722 | DEFINE_CHECK_OP_IMPL(Check_LT, < )
| ^~~~~~~~~~~~~~~~~~~~
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36196
Differential Revision: D20909696
Pulled By: malfet
fbshipit-source-id: 16723355f473379ba9da6d3c33bd561b9724800a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34753
This improves support for exceptions and capturing stack traces in caffe2 async nets. We generally want to use exceptions everywhere we can in order to preserve stack information. It also makes the exception timestamp more accurate so multiple exceptions at the same time can be correctly ordered.
Test Plan: Updated the tests to use the new error semantics + adds a test to ensure the stack is correctly propagated through deferrable async scheduling.
Reviewed By: andrewwdye
Differential Revision: D20449887
fbshipit-source-id: 047fdf1bd52fd7c7c1f3fde77df9a27ed9e288e7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35857
This fixes a lot of common ops for InferBlobShapesAndTypes as well as adds support for testing the inferred shapes and types of gradient ops.
Ops:
* Concat
* Split
* LeakyReLU
* Relu
* Prelu
* Gelu
* Elu
* Sinh, Tanh, Cosh
* Abs
* ... and a number of other simple element wise ops
Test Plan:
Added support to hypothesis test to check the shape and type of gradient ops.
Enabled it for all the ops I fixed the shape and type inference for.
buck test caffe2/caffe2/python/operator_test:
Reviewed By: pradeepd24
Differential Revision: D20806284
fbshipit-source-id: 77f796d9ff208e09e871bdbadf9a0a7c196b77f2
Summary:
Fixes incorrect usages of symbol annotations including:
1. Exporting or importing a function/class in an anonymous namespace.
2. Exporting or importing a function/class implementation in a header file. However, by removing the symbol annotations, they are now local symbols. If they need to be remain global, I can move the implementations to the source file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35364
Differential Revision: D20670031
Pulled By: ezyang
fbshipit-source-id: cd8018dee703e2424482c27fe9608e040d8105b8
Summary:
And few typos
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34791
Test Plan: CI
Differential Revision: D20524879
Pulled By: malfet
fbshipit-source-id: 58fa03bd6356979e77cd1bffb6370d41a177c409
Summary:
Throwing from destructor leads to undefined behaviour (most often to segault)
So it's better to leak memory then segault
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34756
Test Plan: Run `test_pytorch_onnx_caffe2`
Differential Revision: D20504228
Pulled By: malfet
fbshipit-source-id: 7a05776fea9036f602e95b8182f8493cb5886dab
Summary:
To speed up compilation time
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34811
Test Plan: CI
Differential Revision: D20476992
Pulled By: malfet
fbshipit-source-id: 922cde93783fbfc04854851d7a05a635d5239792
Summary:
Replacing <ATen/core/Tensor.h> with <<ATen/core/TensorBody.h> speeds up compilation of caffe2 operators by 15%
For example, it reduces pool_op.cu compilation from 18.8s to 16s
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34810
Test Plan: CI
Differential Revision: D20472230
Pulled By: malfet
fbshipit-source-id: e1b261cc24ff577f09e2d5f6428be2063c6d4a8b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34105
make parallel_net_test.cc chronos conforming.
exclude gtest asserts that check thrown exceptions when exceptions are disabled.
Test Plan: CI green
Differential Revision: D20153525
fbshipit-source-id: 7371e559da948f46773fed09e3a23a77411d59e0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33954
fixes caffe2/core/module_test.cc on windows
misc lint fixes.
Test Plan: CI green
Reviewed By: malfet
Differential Revision: D20153512
fbshipit-source-id: aeae84a028e26edd65c7218611e3c49a8d9bb8c0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33959
make sure clang on windows uses correct attributes.
add support for cl.exe style pragma attributes
Test Plan: CI green
Differential Revision: D20153548
fbshipit-source-id: bfbfd374e8f5e7d7b8598453c3ca2b6693a425f1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33563
When NVCC or Clang are driving CUDA compilation many math functions are declared by default, with a small difference: Clang marks them as `__device__` only, while NVCC uses both `__host__` and `__device__`. This makes every un-elaborated `min` or `max` function call from a `__host__` function generate a syntax error when Clang is used.
Fix the errors by using `std::min` and `std::max` from `<algorithm>`, since C++14 they are `constexpr` and can be used in the `__device__` code [1].
1. https://llvm.org/docs/CompileCudaWithLLVM.html#algorithm
Test Plan:
```lang=bash
buck build mode/opt -c fbcode.cuda_use_clang=true //fblearner/flow/projects/dper:workflow
buck build mode/opt //fblearner/flow/projects/dper:workflow
```
Execute tests on devgpu:
```
buck test mode/dev-nosan -j 8 //caffe2/caffe2/python/operator_test/... //caffe2/test:cuda
```
Reviewed By: ngimel
Differential Revision: D20005795
fbshipit-source-id: 98a3f35e8a96c15d3ad3d2066396591f5cca1696
Summary: The first run of the net is noisy sometimes - just run it twice.
Reviewed By: cheshen1
Differential Revision: D20039274
fbshipit-source-id: 639e65646bf52f3efe1ecd4bbcd0e413d9389b29
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30734
What are specialized lists?
The IValues that hold List[int], List[Tensor], and List[AnythingElse] are different C++ types.
e.g. List[int] has a std::vector<int> while List[AnythingElse] holds a std::vector<IValue>.
Why do we have specialized lists?
When we first created the JIT we needed to bind the ATen C++ API which has std::vector<int>,
std::vector<Tensor> as inputs. The easiest way to match this API was to make our IValues contain
these same types. Conversion was just unwrapping the IValue, very easy and cheap.
What is the problem with specialized lists?
We end up with significant special cases through the compiler. Other types like Dict are not
specialized. So in the Pickler, for instance, there is a single piece of logic to handle
their serialization. For Lists, we end up with multiple cases. Furthermore, it doesn't
match Python, leading to problems along translation boundaries. Our pickle serialization
is slightly different than python, so it is harder to load objects from our IValue serialization
as Python values.
They also make it harder to provide an easy-to-use user API. We'd like to match pybind11 for C++
bindings to TorchScript. This would entail having a single torch::List class (untemplated)
that can be used to construct inputs. This is made much harder if the underlying ivalue needs
to be different depending on the type inside the list. The ideal case would be to have a constructor like
```
template<typename T>
List(std::vector<T> foo);
```
It would then set up the type tags correctly based on type T, without the need for passing tags.
Do specialized lists improve perf?
Not in a way we have been able to measure. Our major concern initially was having to translate
a std::vector<IValue> to std::vector<int> to call ATen functions. This was especially a concern
for aten::_convolution which takes a number of mostly-constant lists of integers. However,
when we measure the effect of actually having to do this conversion for an aten::_convolution,
it does not take measurable time (benchmark results below).
This is true even if you use a trivial convolution (e.g. 1x1x1), and comment out the actual convolution code.
What are the issues removing them?
This PR removes list specialization but keeps the serialization format, and IValue APIs almost exactly
the same. The only visible change is that toTensorListRef and family have turned into toTensorVector
because they now return by value a copy of the list as a vector.
Further PRs can then clean up the complexity issues that arose from speclization. This will likely
involve removing the isTensorList/isIntList functions, and refactoring the code that used them to
work generically. At some point we will also change serialization to no longer write specialized
lists in the pickle binary. This is forward incompatible, so will go in its own PR.
Benchmark:
```
import torch
import torch.nn as nn
import torch.nn.functional as F
import time
class MnistNet(nn.Module):
def __init__(self):
super(MnistNet, self).__init__()
self.conv1 = nn.Conv2d(1, 1, kernel_size=1)
self.conv2 = nn.Conv2d(1, 1, kernel_size=1)
def forward(self, x):
for i in range(10):
x = F.relu(self.conv1(x))
x = F.relu(self.conv2(x))
return x
model = MnistNet()
x = torch.rand(1, 1, 1, 1)
r = torch.jit.trace(model, x )
r(x)
r(x)
r(x)
r(x)
print(torch.jit.last_executed_optimized_graph())
while True:
b = time.time()
for i in range(100):
r(x)
e = time.time()
print(e - b)
```
Results (no observable difference):
```
Before (actual conv)
0.13251137733459473
0.13260436058044434
0.13276338577270508
0.1327497959136963
0.13250041007995605
0.13270330429077148
0.13290190696716309
0.13265132904052734
0.13274288177490234
0.1326758861541748
0.13253355026245117
0.13254785537719727
0.13260746002197266
0.13285017013549805
0.13264012336730957
0.132490873336792
0.13280034065246582
0.13243484497070312
0.1325232982635498
0.1326127052307129
0.13264131546020508
0.13274383544921875
0.13298296928405762
0.1326909065246582
-------------------
After (actual conv)
0.13127517700195312
0.13150334358215332
0.13092470169067383
0.13102364540100098
0.13134360313415527
0.13155555725097656
0.13314104080200195
0.13151955604553223
0.13160037994384766
0.1315293312072754
0.13137340545654297
0.13148093223571777
0.131455659866333
0.1327371597290039
0.13134026527404785
0.13152337074279785
0.13151192665100098
0.13165974617004395
0.13403725624084473
0.13251852989196777
0.13135504722595215
0.1315624713897705
0.1317615509033203
0.1314380168914795
0.13157200813293457
--------------------
The following replace the convolution operator with a no-op, to show
that even if the conv op was made faster, then we still would not see
a difference:
Before (fake conv)
0.0069539546966552734
0.0069522857666015625
0.007120847702026367
0.007344722747802734
0.007689952850341797
0.007932662963867188
0.00761723518371582
0.007501363754272461
0.007532835006713867
0.007141828536987305
0.007174253463745117
0.007114410400390625
0.007071495056152344
------------------
After (fake conv)
0.007458209991455078
0.007337093353271484
0.007268190383911133
0.007313251495361328
0.007306575775146484
0.007468700408935547
0.0073091983795166016
0.007308483123779297
0.007538318634033203
0.007356882095336914
0.007464170455932617
0.007372140884399414
```
Test Plan: Imported from OSS
Differential Revision: D18814702
Pulled By: zdevito
fbshipit-source-id: 0371c73b63068fdc12f24b801371ea90f23531a6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31335
When an error occurs in a net we end up cancelling all the async ops. If one error occurs it's highly likely other errors will occur as well.
Typically we see:
1. SendOp failed due to a network error
2. async scheduling cancels all other ops via `SetFinished("Cancelled");`
3. Another SendOp fails due to a network error and crashes the process when the exception is thrown.
This changes caffe2 ops to allow failing twice.
Test Plan: buck test //caffe2/caffe2:caffe2_test_cpu
Reviewed By: andrewwdye
Differential Revision: D19106548
fbshipit-source-id: 4b7882258a240894cc16d061a563c83a3214d3d9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30915
Since we now have C++14, we don't need these c10::guts helpers anymore
ghstack-source-id: 95777609
Test Plan: waitforsandcastle
Differential Revision: D18869639
fbshipit-source-id: 97716f932297c64c6e814410ac47b444c33d4e2e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30917
This is a C++14 feature, we can use this now.
ghstack-source-id: 95255753
Test Plan: waitforsandcastle
Differential Revision: D18869637
fbshipit-source-id: dd02036b9faeaffa64b2d2d305725443054da31b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31116
Changelist:
- remove BUILD_NAMEDTENSOR macro
- remove torch._C._BUILD_NAMEDTENSOR
- remove all python behavior that relies on torch._C._BUILD_NAMEDTENSOR
Future:
- In the next diff, I will remove all usages of
ATen/core/EnableNamedTensor.h since that header doesn't do anything
anymore
- After that, we'll be done with the BUILD_NAMEDTENSOR removal.
Test Plan: - run CI
Differential Revision: D18934951
Pulled By: zou3519
fbshipit-source-id: 0a0df0f1f0470d0a01c495579333a2835aac9f5d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30912
Add a new data type ZERO_COLLISION_HASH .
Test Plan: ci
Reviewed By: boryiingsu
Differential Revision: D18843626
fbshipit-source-id: b2d8280f13c78b4a656cf95822198df59de7b64c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30315
The new structure is that libtorch_cpu contains the bulk of our
code, and libtorch depends on libtorch_cpu and libtorch_cuda.
This is a reland of https://github.com/pytorch/pytorch/pull/29731 but
I've extracted all of the prep work into separate PRs which can be
landed before this one.
Some things of note:
* torch/csrc/cuda/nccl.cpp was added to the wrong list of SRCS, now fixed (this didn't matter before because previously they were all in the same library)
* The dummy file for libtorch was brought back from the dead; it was previously deleted in #20774
In an initial version of the patch, I forgot to make torch_cuda explicitly depend on torch_cpu. This lead to some very odd errors, most notably "bin/blob_test: hidden symbol `_ZNK6google8protobuf5Arena17OnArenaAllocationEPKSt9type_infom' in lib/libprotobuf.a(arena.cc.o) is referenced by DSO"
* A number of places in Android/iOS builds have to add torch_cuda explicitly as a library, as they do not have transitive dependency calculation working correctly
* I had to torch_cpu/torch_cuda caffe2_interface_library so that they get whole-archived linked into torch when you statically link. And I had to do this in an *exported* fashion because torch needs to depend on torch_cpu_library. In the end I exported everything and removed the redefinition in the Caffe2Config.cmake. However, I am not too sure why the old code did it in this way in the first place; however, it doesn't seem to have broken anything to switch it this way.
* There's some uses of `__HIP_PLATFORM_HCC__` still in `torch_cpu` code, so I had to apply it to that library too (UGH). This manifests as a failer when trying to run the CUDA fuser. This doesn't really matter substantively right now because we still in-place HIPify, but it would be good to fix eventually. This was a bit difficult to debug because of an unrelated HIP bug, see https://github.com/ROCm-Developer-Tools/HIP/issues/1706Fixes#27215 (as our libraries are smaller), and executes on
part of the plan in #29235.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D18790941
Pulled By: ezyang
fbshipit-source-id: 01296f6089d3de5e8365251b490c51e694f2d6c7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29337
This argument is needed by boxing wrappers so they're able to get a pointer to the corresponding unboxed kernel and call into it.
But if a kernel is registered in a boxed way, we don't need it and should hide this from the API.
This is especially needed for the backend fallback API where users would only be left wondering why this argument is there and what it does.
Also, hiding it allows us to potentially totally remove it in a future refactoring if we find some way to do so.
ghstack-source-id: 94481316
Test Plan: unit tests
Differential Revision: D18361991
fbshipit-source-id: 5cef26c896fe3f2a5db730d3bc79dcd62e7ef492
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29201
This is required for boxed backend fallback kernels (e.g. lazy, AMP) because they need to know which op was actually called.
ghstack-source-id: 94481313
Test Plan: I will add unit tests in a diff stacked on top
Differential Revision: D18282746
fbshipit-source-id: 339a1bbabd6aff31a587b98f095c75104dfc6f99
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29731
The new structure is that libtorch_cpu contains the bulk of our
code, and libtorch depends on libtorch_cpu and libtorch_cuda.
Some subtleties about the patch:
- There were a few functions that crossed CPU-CUDA boundary without API macros. I just added them, easy enough. An inverse situation was aten/src/THC/THCTensorRandom.cu where we weren't supposed to put API macros directly in a cpp file.
- DispatchStub wasn't getting all of its symbols related to static members on DispatchStub exported properly. I tried a few fixes but in the end I just moved everyone off using DispatchStub to dispatch CUDA/HIP (so they just use normal dispatch for those cases.) Additionally, there were some mistakes where people incorrectly were failing to actually import the declaration of the dispatch stub, so added includes for those cases.
- torch/csrc/cuda/nccl.cpp was added to the wrong list of SRCS, now fixed (this didn't matter before because previously they were all in the same library)
- The dummy file for libtorch was brought back from the dead; it was previously deleted in #20774
- In an initial version of the patch, I forgot to make torch_cuda explicitly depend on torch_cpu. This lead to some very odd errors, most notably "bin/blob_test: hidden symbol `_ZNK6google8protobuf5Arena17OnArenaAllocationEPKSt9type_infom' in lib/l
ibprotobuf.a(arena.cc.o) is referenced by DSO"
- A number of places in Android/iOS builds have to add torch_cuda explicitly as a library, as they do not have transitive dependency calculation working correctly. This situation also happens with custom C++ extensions.
- There's a ROCm compiler bug where extern "C" on functions is not respected. There's a little workaround to handle this.
- Because I was too lazy to check if HIPify was converting TORCH_CUDA_API into TORCH_HIP_API, I just made it so HIP build also triggers the TORCH_CUDA_API macro. Eventually, we should translate and keep the nature of TORCH_CUDA_API constant in all cases.
Fixes#27215 (as our libraries are smaller), and executes on
part of the plan in #29235.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D18632773
Pulled By: ezyang
fbshipit-source-id: ea717c81e0d7554ede1dc404108603455a81da82
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29653
I didn't remove is_variable from Tensor for BC reasons, but I did
remove as many uses as I could from the codebase.
at::impl::variable_excluded_from_dispatch got moved to TensorBody.h
so that it's more widely accessible.
This diff is NOT semantics preserving. Here are the major differences:
- In a number of native operator implementations, we tested that arguments
are not variable. I replaced these with asserts that variable is
excluded from dispatch. I actually don't think these asserts are really
necessary now (they should certainly be true, but it's hard to get
it wrong), but I've kept them for old time's sake. At least, they'll detect
if you call these functions before you've processed variable (indicating
a bug in your kernel.)
- There are a number of places where we do a per-tensor test for being a
variable, for better error reporting when someone commits Tensor/Variable
confusion. Although these tests are substantively the same as the
tests above, in these cases I decided to *delete* the test entirely.
The reasoning is that in these cases, we didn't really care about
dispatch (also, see above; I'm not too sure we really need the dispatch
asserts), we cared about Tensor/Variable confusion. Since Tensor/Variable
confusion is impossible now, we don't need the tests. One of the key
factors which pushed me one way or another was whether or not a function
was doing per-tensor validation; if I kept the assert in such functions,
I'd repeatedly access the TLS. Even if we want to bring back the asserts,
they would have to go somewhere else.
Another similar idiom is the number of places we do !x.defined() ||
x.is_variable(); I treated this equivalently.
- nuclear_norm's computation of compute_uv is a bit weird, but I think
it's OK to just delete the is_variable case (I *suspect* that it is
always the case that self.is_variable(), but it doesn't really matter.)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D18496168
Pulled By: ezyang
fbshipit-source-id: 5a1ded931e0c10a6b758ba64a8380d34110e0c3e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29670
This is the entry point to loading CUDA code, improve error message to prompt users to check that gpu code is included.
Test Plan: Build without gpu code. Run the binary. Check that the new error message exists.
Reviewed By: yfeldblum
Differential Revision: D18453798
fbshipit-source-id: 63d9ec50acdf57ef4baf3f7d99c836c56bc1435e
Summary:
Also move the logic that installs the pybind11 headers from setup.py to cmake (to align with other headers).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29659
Differential Revision: D18458208
Pulled By: bddppq
fbshipit-source-id: cfd1e74b892d4a65591626ab321780c8c87b810d
Summary:
This diff adds the following:
- An AsyncIf to support conditional async execution. This op assumes that then_net and else_net are async scheduling nets. This op itself completes when every async op in the active net completes. Cancellation cancels the inner nets and the async ops.
- Unit tests targeting asynchronicity and error/cancellation handling.
Test Plan:
New unit tests
With --stress-runs=2000:
https://our.intern.facebook.com/intern/testinfra/testrun/4785074616784325
Reviewed By: ilia-cher
Differential Revision: D18051357
fbshipit-source-id: 1399a437b3ca63fd4ea0cf08d173f85b9242cc1f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29052
Make sure we handle the case of multiple, async, terminal (no children)
and failing cpu ops.
Test Plan: AsyncIf tests
Reviewed By: yyetim
Differential Revision: D18276401
Pulled By: ilia-cher
fbshipit-source-id: 35b175dd025bc7e392056ac1331b159376a29e60
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28024
We preallocated type ids to align them with ScalarType. At that point, the maximum type id was 10 and we used 11 to specify undefined type id.
However, since then, ScalarType got more additions, 11 isn't undefined anymore, and numbers 11-15 have meaning.
caffe2::TypeIdentifier also got its separate additions, 12 and upwards have meaning that differs from ScalarType.
I'm going with the (CI-tested) assumption that caffe2::TypeIdentifier and ScalarType actually don't need to be aligned
and remove the functionality for preallocated type ids. This simplifies our type ids.
ghstack-source-id: 92051872
Test Plan: unit tests
Differential Revision: D17936165
fbshipit-source-id: 2c9df2b9b3f35b3e319641c96638321ac3433d5c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26509
We preallocated type ids to align them with ScalarType. At that point, the maximum type id was 10 and we used 11 to specify undefined type id, see https://github.com/pytorch/pytorch/pull/10139.
However, since then, ScalarType got more additions, 11 isn't undefined anymore, and numbers 11-15 have meaning.
caffe2::TypeIdentifier also got its separate additions, 12 and upwards have meaning that differs from ScalarType.
I'm going with the (CI-tested) assumption that caffe2::TypeIdentifier and ScalarType actually don't need to be aligned
and remove the functionality for preallocated type ids. This simplifies our type ids.
ghstack-source-id: 91896918
Test Plan: unit tests
Differential Revision: D17490109
fbshipit-source-id: 800c340d9d3556a99f6e3ffc33af14ad68d7cc59
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26502
Create type ids at compile time instead of incrementing a counter at runtime. This is done by computing a compile time crc64 on the type name. We couldn't do this before, because we still used GCC4 and that compiler didn't support the use of `__PRETTY_FUNCTION__` in a constexpr context. However, since GCC5 this is possible and we can use this trick.
This does not change the semantics of preallocated type ids. I actually think we don't need to preallocate anymore, but I split the removal of preallocation into a separate diff to be able to test it separately.
ghstack-source-id: 91896920
Test Plan: unit tests
Differential Revision: D17488861
fbshipit-source-id: ce7b059d7c8686b69cb091a4a8beaf4b96391343
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27086
This is a major source of merge conflicts, and AFAICT isn't necessary anymore (it may have been necessary for some mobile build stuff in the past).
This is a commandeer of #25031
Test Plan: Imported from OSS
Reviewed By: ljk53
Differential Revision: D17687345
Pulled By: ezyang
fbshipit-source-id: bf6131af835ed1f9e3c10699c81d4454a240445f
Summary: Add helper function randomFill to test_utils.h so we can use it in benchmark scrips as well tests.
Test Plan:
```
buck run mode/opt //tvm/sparse:cblas_bench
```
Reviewed By: yinghai
Differential Revision: D17759193
fbshipit-source-id: e4909b04e83ca9382ab4718855fb63743d028de1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26337
- Factor out boxing and unboxing functionality from the c10 dispatcher into a c10::KernelFunction class
- Move that class and everything else it depends on into ATen/core/boxing
- This also allows us to get rid of c10::KernelCache. Instead, we now store a pointer to the unboxed functor in c10::KernelFunction.
- We're also getting rid of the DispatchTableEntry struct and instead store KernelFunction directly.
- To make this work, we need to change the dispatcher calling API from Dispatcher::lookup().callBoxed/callUnboxed and OperatorEntry::lookup().callBoxed/callUnboxed to Dispatcher::callBoxed/callUnboxed and OperatorEntry::callBoxed/callUnboxed.
ghstack-source-id: 90459911
Test Plan: unit tests
Differential Revision: D17416607
fbshipit-source-id: fd221f1d70eb3f1b4d33092eaa7e37d25684c934
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25908
Original commit changeset: f6e961e88c01
device_option propagation is completely broken in Caffe2 for cases when pass through operators are used. As an example Gather operator don't have gradient and passes through it's inputs, which results in incorrect detection of the components for sparse parameter aggregation (component will be empty instead of the real device).
This diff is trying to fix this issue.
Original diff had a problem, that Caffe2 is not handling cases when device option is present, but contains only metadata (for example one for auto-generated reduction ops in backward pass). This diff is addressing this issue by merging device options during the backward pass
Test Plan:
1. net_transform is finally working with Gather + FloatToHalf transformed model instead of failing because of incorrect number of components.
2. New unit-test.
3. Verify that previously broken benchmark is now passing
ezyang do you have suggestions what else I should test?
Reviewed By: ezyang
Differential Revision: D17281528
fbshipit-source-id: 4a1bc386f29f6a34fbf8008effde9d4890abebfa
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23668
- The eager mode frontend now calls operators who are defined in native_functions.yaml with `use_c10_dispatcher: True` through the c10 dispatcher and not anymore through globalATenDispatch().
- These operators aren't registered with globalAtenDispatch anymore, only on c10 now.
- Backend extensions calling globalATenDispatch().registerOp() to add their own kernels still work, this function will forward the registration to the c10 dispatcher for them.
ghstack-source-id: 90130455
Test Plan: benchmarks at https://docs.google.com/document/d/1gpzKZcFf1JJameY1vKxF7Cloul9s6D8HKIK2_Pp1hFo/edit#
Differential Revision: D16603133
fbshipit-source-id: 991f17b355e9c78c5e86fee4fa381df7ab98ac82
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25650
This PR removes protobuf dependencies from mobile build altogether:
- caffe2/proto: protobuf files, including caffe2.proto and torch.proto;
- caffe2 components that depend on caffe2.proto, including most part of
caffe2/core, caffe2/utils;
- libprotobuf / libprotobuf-lite dependencies;
- protobuf compiler;
- some utils class, e.g.: netdef_converter.cpp;
- introduce a macro to disable third_party/onnx which depends on protobuf;
Test Plan:
- builds;
- link with demo app to make sure it can load and run a model in pickle format;
Differential Revision: D17183548
Pulled By: ljk53
fbshipit-source-id: fe60b48674f29c4a9b58fd1cf8ece44191491531
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25671
To decouple string_utils.h from types.h and protobuf headers.
Logically GetDimFromOrderString seems to be more similiar to
StringToStorageOrder comparing to other string_utils functions.
Test Plan: - Will check all internal/external CI jobs.
Reviewed By: yinghai
Differential Revision: D17191912
Pulled By: ljk53
fbshipit-source-id: fe555feef27bfd74c92b6297c12fb668252ca9ff
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23888
This is an alternative to https://github.com/pytorch/pytorch/pull/23684.
Instead of splitting a bunch of headers into declaration and definition, we change tensor includes to only include the tensor declaration when the tensor definition isn't needed.
ghstack-source-id: 89357687
Test Plan: waitforsandcastle
Differential Revision: D16673569
fbshipit-source-id: fa1d92809b05de7910a8c2dc2f55abe071ca63bf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25620
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25602
Enable rocThrust with hipCUB and rocPRIM for ROCm. They are the ROCm implementations of the thrust and cub APIs and replace the older hip-thrust and cub-hip packages going forward. ROCm 2.5 is the first release to contain the new packages as an option, as of 2.6 they will be the only available option.
Add hipification rules to correctly hipify thrust::cuda to thrust::hip and cub:: to hipcub:: going forward. Add hipification rules to hipify specific cub headers to the general hipcub header.
Infrastructure work to correctly find, include and link against the new packages. Add the macro definition to choose the HIP backend to Thrust.
Since include chains are now a little different from CUDA's Thrust, add includes for functionality used where applicable.
Skip four tests that fail with the new rocThrust for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21864
Reviewed By: xw285cornell
Differential Revision: D16940768
Pulled By: bddppq
fbshipit-source-id: 3dba8a8f1763dd23d89eb0dd26d1db109973dbe5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25252
Our model going forward for extensions will be that you will have to
get an allocation of an ID in our system. This is how things work
in practice today; we're just simplifying our underlying registration
since there is no need to have distributed registration.
There are some codemods in this diff:
```
codemod --extensions cpp,h,cc,cuh,py,in --exclude-paths=c10/core/TensorTypeId.h '([A-Za-z]+?)TensorId\(\)' 'TensorTypeId::\1TensorId'
codemod --extensions cpp,h,cc,cuh,py,in 'TensorTypeIds::undefined\(\)' 'TensorTypeId::UndefinedTensorId'
codemod --extensions cpp 'TensorType1\(\)' 'TensorTypeId::CPUTensorId'
codemod --extensions cpp 'TensorType2\(\)' 'TensorTypeId::CUDATensorId'
codemod --extensions cpp 'TensorType3\(\)' 'TensorTypeId::XLATensorId'
codemod --extensions cpp 'TensorType1' 'CPUTensorId'
codemod --extensions cpp 'TensorType2' 'CUDATensorId'
codemod --extensions cpp 'TensorType3' 'XLATensorId'
```
The main hand-written changes are in c10/core/TensorTypeId.h
Other manual fixes:
- aten/src/ATen/core/op_registration/op_registration.cpp - stop using
std::string operator+
- aten/src/ATen/function_wrapper.py - handle a hardcoded TypeId() that
wasn't caught by codemod
- torch/csrc/tensor/python_tensor.h - fix now incorrect forward declaration
of TensorTypeId
- aten/src/ATen/core/op_registration/ - remove out-of-line registration
Differential Revision: D17072001
Test Plan: ossci and sandcastle
Pulled By: ezyang
fbshipit-source-id: c641515fd0604c045c54fbb1d6b1b950f45e89d1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24361
Currently we only support Conv in kernel but have entrance for both type using one same class
It is time make change
Reviewed By: csummersea
Differential Revision: D16604713
fbshipit-source-id: b98d39a2c7960707cd50ba27e43dce73f741eeeb
Summary:
Adds qtensor specific fields to the proto file so that they get serialized into the model.json
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23356
ghstack-source-id: 87263428
Differential Revision: D16473237
fbshipit-source-id: bf5b51d0863d036d30a1644a3c3b74516468224b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23096
nets can have states that depends on the rest of the state in the Workspace. Hence, they should be destructed first.
Reviewed By: ajyu
Differential Revision: D16382987
fbshipit-source-id: 3fd030ba206e2d0e897abb9e31c95bdaeb9482b7
Summary:
As part of the Variable/Tensor merge, we want to be able to pass Variables into Caffe2 without doing extra shallow copy, to improve performance and also allow for in-place mutations in Caffe2 ops. There are a few approaches outlined in https://github.com/pytorch/pytorch/pull/22418, and this PR is the chosen approach.
Specifically, we can have the assumption that we won't be connecting autograd to C2 gradients at any point (as it's too tricky and not that useful). Therefore, we can pass Variable into Caffe2 ops by requiring that all Variables in Caffe2 don't require grad. For code paths in Caffe2 that might potentially track gradients (e.g. `ScriptModuleOp` and `call_caffe2_op_from_c10`), we use the `torch::NoGradGuard` to make sure gradients are not tracked.
This supersedes https://github.com/pytorch/pytorch/pull/22418.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22473
Differential Revision: D16099042
Pulled By: yf225
fbshipit-source-id: 57efc3c7cfb3048d9abe90e63759acc14ebd2972
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22477
There is actually no use of uninitialized variable but some compilers are not smart enough to reason about two if branches are already taken together.
Reviewed By: hx89
Differential Revision: D16100211
fbshipit-source-id: 25f01d668063603d7aaa776451afe8a10415d2ea
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22005
When a Dict or List is created with type information, it will remember that.
If at any point later, this list is instantiated to a List<T> with a concrete type, it will assert that T is the correct type.
Differential Revision: D15914462
fbshipit-source-id: a8c3d91cb6d28d0c1ac0b57a4c4c6ac137153ff7
Summary:
Currently the build system accepts USE_NAMEDTENSOR from the environment
variable and turns it into NAMEDTENSOR_ENABLED when passing to CMake.
This discrepancy does not seem necessary and complicates the build
system. The naming of this build option is also semantically incorrect
("BUILD_" vis-a-vis "USE_"). This commit eradicate this issue before it
is made into a stable release.
The support of NO_NAMEDTENSOR is also removed, since PyTorch has been
quite inconsistent about "NO_*" build options.
---
Note: All environment variables with their names starting with `BUILD_` are currently automatically passed to CMake with no need of an additional wrapper.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22360
Differential Revision: D16074509
Pulled By: zou3519
fbshipit-source-id: dc316287e26192118f3c99b945454bc50535b2ae
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22241
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20387
glibc has a non-standard function, feenableexcept, that triggers floating-point exception handler . Compared to feclearexcept + fetestexcept , this approach allows us to see precisely where the exception is raised from the stack trace.
Reviewed By: jspark1105
Differential Revision: D15301095
fbshipit-source-id: 94f6e72456b2280f78d7d01c2ee069ae46d609bb
Summary:
Saying `I` in an err msg is too subjective to be used in a framework.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22369
Differential Revision: D16067712
Pulled By: soumith
fbshipit-source-id: 2a390646bd5b15674c99f65e3c460a7272f508b6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22084
For DictPtr/ListPtr, default construction was disallowed because it was ambigious if it's supposed to create an empty list or a nullptr.
But since we renamed them to Dict/List, we can now allow default construction without ambiguity.
Differential Revision: D15948098
fbshipit-source-id: 942a9235b51608d1870ee4a2f2f0a5d0d45ec6e6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21937
This changes call sites to use the new naming scheme
Reviewed By: zdevito
Differential Revision: D15892404
fbshipit-source-id: 8d32aa90a0ead1066688166478f299fde9c2c133
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21806
Dispatcher::findSchema(op_name) now uses a lookup table instead of iterating through the list of operators to find it.
This speeds up op lookup (as in finding the operator handle from the name, not as in finding a kernel when you already have the operator handle)
and it also speeds up op registration since that needs to look if an op with the same name already eists.
Differential Revision: D15834256
fbshipit-source-id: c3639d7b567e4ed5e3627c3ebfd01b7d08b55ac1
Summary:
After https://github.com/pytorch/pytorch/pull/17072, we are allowed to pass Variables into ATen ops, thus there is no need to unwrap input variables in the c10 call path.
Note that since Caffe2 still expects inputs to be pure Tensors, we moved the unwrapping logic to the Caffe2 wrapper.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21620
Differential Revision: D15763560
Pulled By: yf225
fbshipit-source-id: 5375f0e51eb320f380ae599ebf98e6b259f0bff8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21446
this is used for easier tracing of iter id when looking at trace diagram
Reviewed By: ilia-cher
Differential Revision: D15628950
fbshipit-source-id: ee75b3bdb14a36abc18c7bddc49d8ec9789b724d
Summary:
This renames the CMake `caffe2` target to `torch`, as well as renaming `caffe2_gpu` to `torch_gpu` (and likewise for other gpu target variants). Many intermediate variables that don't manifest as artifacts of the build remain for now with the "caffe2" name; a complete purge of `caffe2` from CMake variable names is beyond the scope of this PR.
The shell `libtorch` library that had been introduced as a stopgap in https://github.com/pytorch/pytorch/issues/17783 is again flattened in this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20774
Differential Revision: D15769965
Pulled By: kostmo
fbshipit-source-id: b86e8c410099f90be0468e30176207d3ad40c821
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21177
- Integrate c10::ListPtr into IValue and the c10 dispatcher.
- Streamline conversion to/from IValue. Before, we had IValue::to<> and kernel_functor.h had its own ivalue_to_arg_type and return_type_to_ivalue. They are now unified. Also, this means that nested types like Dicts of Lists of Optional of Dict of ... do work as expected now
Differential Revision: D15476433
fbshipit-source-id: bde9df80df20091aa8e6ae17ba7e90abd149b954
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21492
If one async operator failed, async_scheduling net currently only marks all scheduled async operators as finished without cancelling the callbacks.
The new behavior is to cancel the callbacks first, then set event status to finished.
Reviewed By: ilia-cher
Differential Revision: D15702475
fbshipit-source-id: 55a1774d768b2e238bab859b83332f1877a001ca
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17946
Some of these are probably implementable for exported operators,
but aren't implemented yet and for now it's better to assert than to just return wrong results.
Reviewed By: ezyang
Differential Revision: D14430749
fbshipit-source-id: 2b0037a9ed227a22aa7376a90e6d3d09d3e04707
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20603
When we use intra_op_parallel operators, Caffe2 tracing was generating trace only for the master task giving a false impression that a lot of threads are underutilized.
This diff also traces child tasks.
Reviewed By: ilia-cher
Differential Revision: D14820008
fbshipit-source-id: ff4ed203804d86d9231c21c99d869f1ddf1d1ef9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20493
This helps distinguish if the op was a quantized op or not.
Reviewed By: salexspb
Differential Revision: D15337854
fbshipit-source-id: 43c7aef143085cfaeb4ec2102a7f36cc454e0e94
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20173
Enabled op profiling even when net type is not dag or prof dag. Also added
engine type info to summary.
Reviewed By: salexspb, ilia-cher
Differential Revision: D15177813
fbshipit-source-id: 5be0efeaabc9a961cf1d73b0703749c08bb1adbb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20821
Change registration API. Instead of
static auto registry = torch::RegisterOperators()
.op("my::op", torch::RegisterOperators::options()
.kernel<Kernel>()
.dispatchKey(CPUTensorId()));
it is now
static auto registry = torch::RegisterOperators()
.op("my::op", torch::RegisterOperators::options()
.kernel<Kernel>(CPUTensorId()));
This binds kernel and dispatch key together, allowing them to be separate from other future configuration options like alias analysis or autograd wrappers.
The semantic problem behind this is that the dispatch key is a *kernel config parameter* and not an *operator config parameter* while things like autograd wrappers, alias info, and actually the kernel itself are *operator config parameters*. And while previously, the different kind of config parameters have been mixed, this diff now separates them.
Before this change, it wouldn't have been well defined if you specified a dispatchKey together with an autogradWrapper or aliasInfo for example.
// what is this supposed to do?
static auto registry = torch::RegisterOperators()
.op("my::op", torch::RegisterOperators::options()
.aliasInfo(DEFAULT)
.dispatchKey(CPUTensorId()));
If we get more kernel config parameters in the future, we could introduce something like this
static auto registry = torch::RegisterOperators()
.op("my::op", torch::RegisterOperators::options()
.kernel<Kernel>(torch::RegisterOperators::kernelOptions()
.dispatchKey(CPUTensorId())
.otherConfig());
but that's overkill as long as dispatch keys are the only kernel config parameter, and we can introduce that later without breaking backwards compatibility.
A nice side effect of this is that people can register multiple kernels to the same operator in the same `.op()` call:
static auto registry = torch::RegisterOperators()
.op("my::op", torch::RegisterOperators::options()
.kernel<Kernel1>(CPUTensorId())
.kernel<Kernel2>(CUDATensorId()));
Reviewed By: dzhulgakov
Differential Revision: D15455790
fbshipit-source-id: 1c46bfe676dcacf74cf36bd3f5df3d2c32b8fb11
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17818
Some of these are probably implementable for exported operators,
but aren't implemented yet and for now it's better to assert than to just return wrong results.
Reviewed By: ezyang
Differential Revision: D14392459
fbshipit-source-id: bf86e6cb0a7cfefd112a65dc85cc243e57a5ad52
Summary:
Resubmit #20698 which got messed up.
Idea is that when PyTorch is used in a custom build environment (e.g. Facebook), it's useful to track usage of various APIs centrally. This PR introduces a simple very lightweight mechanism to do so - only first invocation of a trigger point would be logged. This is significantly more lightweight than #18235 and thus we can allow to put logging in e.g. TensorImpl.
Also adds an initial list of trigger points. Trigger points are added in such a way that no static initialization triggers them, i.e. just linking with libtorch.so will not cause any logging. Further suggestions of what to log are welcomed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20745
Differential Revision: D15429196
Pulled By: dzhulgakov
fbshipit-source-id: a5e41a709a65b7ebccc6b95f93854e583cf20aca
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20833
Att. The algorithm is still "horrendously inefficient". But since we are sunsetting Nomnigraph, I just did the minimal fix here.
Reviewed By: tracelogfb
Differential Revision: D15463880
fbshipit-source-id: 413a1280a92c1923ba49031177816a2d5f888575
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20514
Change API from
static auto registry = c10::RegisterOperators()
.op("my::op",
c10::kernel(...),
c10::dispatchKey(...)
);
to
static auto registry = c10::RegisterOperators()
.op("my::op", c10::RegisterOperators::options()
.kernel(...)
.dispatchKey(...)
);
because this allows better discoverability. People looking for which options are available will easier find it and IDE autocompletion will work better.
Reviewed By: zdevito
Differential Revision: D15346348
fbshipit-source-id: 4b74a33b75c2b9cda4a903639fb7abd2c7cff167
Summary:
#19975 was separated by 2 PRs.
This one:
Introduce MemoryFormat argument to the `x.is_contiguous(memory_format=torch.channels_last)` and to the `y = x.contiguous(memory_format=torch.channels_last)` functions.
At this moment both functions just operate with strides and doesn't store any tensor state.
(Original RFC #19092)
-----
Expands functionality of two tensor functions `.is_contiguous` and `.contiguous` (both python and c++ api).
Note: We had several complaints about `.to(memory_format)` function, and decided not to support it.
1. `.contiguous` now support optional keyword-only argument - `memory_format`, which can be either `torch.contiguous_format` or `torch.channels_last`.
- Using `torch.contiguous_format` will preserve existing `.contiguous()` behavior.
- Calling `x.contiguous(memory_format=torch.channels_last)` returns new tensor which maintain same semantical layout (NCHW), but have different memory allocation pattern.
`x.contiguous(memory_format=torch.channels_last)` expects input tensor to be 3d, 4d or 5d; and fails otherwise.
2. `.is_contiguous` now support optional keyword-only argument - `memory_format`, which can be either `torch.contiguous_format` or `torch.channels_last`.
- `x.is_contiguous(memory_format=torch.contiguous_format)` preserves same functionality as `x.is_contiguous()` and remains unchanged.
- `x.is_contiguous(memory_format=torch.channels_last)` returns true if A) input tensor is contiguous in memory AND B) allocated in the memory in NWHC (or similar for 3d,5d) format.
Note: By the end of the phase one `x.is_contiguous(memory_format=torch.channels_last)` will calculate state of the Tensor on every call. This functionality going to be updated later.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20455
Differential Revision: D15341577
Pulled By: VitalyFedyunin
fbshipit-source-id: bbb6b4159a8a49149110ad321109a3742383185d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20439
This is the QTensorProto workflow for multi group quantization in C2 side.
No DNNLOWP Tensor related thing is included in this pr, so once we finished glow side, we should be able to test this pr using resnet50.
Reviewed By: yinghai
Differential Revision: D15096919
fbshipit-source-id: 741eecd59eb79d24d9fe2b035f6246d42422d25c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20463
Source file changes mostly involve ifdef'ing-out references to JIT code
from files that are part of Caffe2Go. Update Internal build scripts to
remove those files from our globs.
After this, changes to most of the JIT files should not trigger mobile CI.
Reviewed By: dzhulgakov
Differential Revision: D15329407
fbshipit-source-id: 48f614c6b028eef0a03ce5161d083a3e078b0412
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20108
Add cpp runs for c2, hooked up via pybinds. Print output to terminal. This is not hooked up with the pep output yet because I'd like to verify the numbers first.
Note that this isn't quite the same mechanism as the pytorch cpp hookup, which uses cpp_python_extensions. If I can use the same mechanism to pull all the inputs for c2 through cpp and do FeedBlobs in cpp, then I'll switch to that.
Reviewed By: zheng-xq
Differential Revision: D15155976
fbshipit-source-id: 708079dacd3e19aacfe43d70c5e5bc54da2cf9e3
Summary:
Some functions were not decorated with `CAFFE2_API`, makes them unusable when creating unit tests for custom ops outside Caffe2 repo.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20114
Differential Revision: D15217490
Pulled By: ezyang
fbshipit-source-id: dda3910ad24e566567607deaac705a34ec8e7b8d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19817
A lot of files were depending on the JIT's typesystem
because operator.h depends on function_schema.h. However,
this isn't fundamental to the design. This diff tries to
remove the direct depenency and only includes the c10
wrapper helpers in files where it is required.
Reviewed By: smessmer
Differential Revision: D15112247
fbshipit-source-id: 2c53d83e542c32d9a398c8b60dbf40ab7a1cb0f6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19458
The algorithm in https://fburl.com/ggh9iyvc fails to really ensure topological ordering of nodes. The fix is ugly but effective. I think we need a real topological sort to fix this issue more nicely. Mikhail Zolotukhin, Bram Wasti.
Differential Revision: D15011893
fbshipit-source-id: 130c3aa442f5d578adfb14fbe5f16aa722434942
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19388
The old implementation forced a refcount bump when converting at::Tensor to caffe2::Tensor.
Now, it is possible to move it without a refcount bump.
Reviewed By: dzhulgakov
Differential Revision: D14986815
fbshipit-source-id: 92b4b0a6f323ed38376ffad75f960cad250ecd9b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19287
Since we now have a string-schema-based op registration API, we can also use it when exposing caffe2 operators.
Reviewed By: dzhulgakov
Differential Revision: D14931925
fbshipit-source-id: ec162469d2d94965e8c99d431c801ae7c43849c8
Summary:
Currently, a TensorImpl's `is_variable_` is true if and only if the TensorImpl has AutogradMeta. This PR unifies these two concepts by removing `is_variable_` and change `is_variable()` to check existence of AutogradMeta instead.
Removing `is_variable_` is part of the work in Variable/Tensor merge.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19139
Differential Revision: D14893339
Pulled By: yf225
fbshipit-source-id: ceb5e22c3c01f79b5d21d5bdbf4a7d1bc397796a
Summary:
It's not intended that Storages have 'default' CUDA devices, but this is allowable via the Storage::create_legacy codepath.
This also messages with device_caching, because the initial cache is obtained from the Storage, which may have a 'default' device.
Instead, we materialize a device by allocating 0 bytes via the allocator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18605
Differential Revision: D14680620
Pulled By: gchanan
fbshipit-source-id: 6d43383d836e90beaf12bfe37c3f0506843f5432
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19154
I recently saw some weird workflow error due to empty but set net_type. Maybe we should just fallback to simple net in this case.
Reviewed By: dzhulgakov
Differential Revision: D14890072
fbshipit-source-id: 4e9edf8232298000713bebb0bfdec61e9c5df17d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19080
OSS: add a tiny unit test utility function to create tensors given shape and data outside of any workspace. I use it in an internal test
Reviewed By: dzhulgakov
Differential Revision: D14814194
fbshipit-source-id: 6d53b235d99a97da812215f5c7f11fecad363c8c
Summary:
Almost there, feel free to review.
these c10 operators are exported to _caffe2 domain.
TODO:
- [x] let the onnx checker pass
- [x] test tensor list as argument
- [x] test caffe2 backend and converter
- [x] check the c10 schema can be exported to onnx
- [x] refactor the test case to share some code
- [x] fix the problem in ONNX_ATEN_FALLBACK
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18210
Reviewed By: zrphercule
Differential Revision: D14600916
Pulled By: houseroad
fbshipit-source-id: 2592a75f21098fb6ceb38c5d00ee40e9e01cd144
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18531
Currently we use C10_LOG_EVERY_MS to log the data type change, but it pollutes the log of some service,
we would like to change it to C10_LOG_FIRST_N to prevent that.
Reviewed By: dzhulgakov
Differential Revision: D14647704
fbshipit-source-id: b84e4002bd4aa94d616133cd1049c3d4ab05386e