Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46560
Follow-up for D24236604 (16c52d918b).
For nets that pass the schema check, memonger actually makes sure to preserve the inplaceness of operators if they are already inplace. So we can safely enable it for correct input nets.
(Note: this ignores all push blocking failures!)
Differential Revision: D24402482
fbshipit-source-id: a7e95cb0e3eb87adeac79b9b69eef207957b0bd5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43987
This replaces the caffe2 CPU random number (std::mt19937) with at::mt19937 which is the one currently used in pytorch. The ATen RNG is 10x faster than the std one and appears to be more robust given bugs in the std (https://fburl.com/diffusion/uhro7lqb)
For large embedding tables (10GB+) we see UniformFillOp taking upwards of 10 minutes as we're bottlenecked on the single threaded RNG. Swapping to at::mt19937 cuts that time to 10% of the current.
Test Plan: Ran all relevant tests + CI. This doesn't introduce new features (+ is a core change) so existing tests+CI should be sufficient to catch regressions.
Reviewed By: dzhulgakov
Differential Revision: D23219710
fbshipit-source-id: bd16ed6415b2933e047bcb283a013d47fb395814
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46424
Currently if an exception occurs in a reporter thread the process is killed via std::terminate. This adds support for handling the reporter exception if FLAGS_caffe2_handle_executor_threads_exceptions is set to true.
Test Plan: buck test mode/opt -c python.package_style=inplace //caffe2/caffe2/python:hypothesis_test //caffe2/caffe2:caffe2_test_cpu -- --stress-runs 100
Reviewed By: dahsh
Differential Revision: D24345027
fbshipit-source-id: 0659495c9e27680ebae41fe5a3cf26ce2f455cb3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46110
## Motivation
* `Cancel` is now added to `OperatorBase` and `NetBase` (https://github.com/pytorch/pytorch/pull/44145).
* We need a test to cover and exhibit that we can cancel stuck net and propagate error with plan executor.
## Summary
* Added PlanExecutorTest `ErrorPlanWithCancellableStuckNet` for plan executor.
* Set cancelCount to zero at the beginning of tests to avoid global state be carried over in some test environment.
Test Plan:
## Unit Test Added
```
buck test caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest
buck test caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest --stress-runs 1000
```
Reviewed By: d4l3k
Differential Revision: D24226577
fbshipit-source-id: c834383bfe6ab50747975c229eb42a363eed3458
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46080
temp removal of ErrorPlanWithCancellableStuckNet, will fill out more
Test Plan:
```
buck test caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest
```
remove a test
Reviewed By: fegin
Differential Revision: D24213971
fbshipit-source-id: e6e600bad00b45c726311193b4b3238f1700526e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45319
## Motivation
* `Cancel` is now added to `OperatorBase` and `NetBase` (https://github.com/pytorch/pytorch/pull/44145)
* We need a test to cover and exhibit that we can cancel stuck net and propagate error with plan executor.
## Summary
* Added `ErrorPlanWithCancellableStuckNet` for plan executor.
* We set a plan with two nets: one stuck net with blocking operator that never returns, and one with error
net with error op that throws, and tested it throw and cancel.
Test Plan:
## Unit Test added
```
buck test caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest
buck test caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest --stress-runs 100
```
```
Summary
Pass: 400
ListingSuccess: 2
```
Reviewed By: d4l3k
Differential Revision: D23920548
fbshipit-source-id: feff41f73698bd6ea9b744f920e0fece4ee44438
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45981
This is a recommit of previously reverted D20850851 (3fbddb92b1).
TL;DR - combining condition_variables and atomics is a bad idea
https://stackoverflow.com/questions/49622713/c17-atomics-and-condition-variable-deadlock
This also adds some ifdefs to disable the death test for mobile, xplat and tsan builds since forking doesn't play nicely with them.
Test Plan:
buck test mode/opt //caffe2/caffe2/python:hypothesis_test -- --stress-runs 1000 test_atomic_iter_with_concurrent_steps --timeout 120
buck test mode/opt //caffe2/caffe2/python:hypothesis_test -- --stress-runs 100
buck test mode/opt caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest --stress-runs 100
no timeouts https://www.internalfb.com/intern/testinfra/testconsole/testrun/7036874440059883/
will ensure no timeouts in OSS
Reviewed By: walterddr, dahsh
Differential Revision: D24165505
fbshipit-source-id: 17cd23bfbcd9c2826a4067a387023d5186353196
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45297
If we have two concurrent substeps and one of them throws an exception and the other is blocking, we'll currently hang. This waits up to 1 minute for it to complete before terminating the process.
Test Plan: buck test caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest --stress-runs 100
Reviewed By: dahsh
Differential Revision: D20850851
fbshipit-source-id: 330503775d8062a34645ba55fe38e6770de5e3c7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44062
Previously, BackendSelect kernels were still written in the legacy way, i.e. they took one TensorOptions argument instead of scattered dtype, layout, device, pin_memory, and they used hacky_wrapper to be callable. This caused a re-wrapping step. Calling into a BackencSelect kernel required taking the individual scattered arguments, packing them into a TensorOptions, and the kernel itself then gathered them again for redispatch.
Now with this PR, BackendSelect kernels are written in the new way and no hacky_wrapper or rewrapping is needed for them.
ghstack-source-id: 112825789
Test Plan:
vs master: https://www.internalfb.com/intern/fblearner/details/216117032/
vs previous diff: https://www.internalfb.com/intern/fblearner/details/216170194/
Reviewed By: ezyang
Differential Revision: D23484192
fbshipit-source-id: e8fb49c4692404b6b775d18548b990c4cdddbada
Summary:
There is a module called `2to3` which you can target for future specifically to remove these, the directory of `caffe2` has the most redundant imports:
```2to3 -f future -w caffe2```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45033
Reviewed By: seemethere
Differential Revision: D23808648
Pulled By: bugra
fbshipit-source-id: 38971900f0fe43ab44a9168e57f2307580d36a38
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44145
## Motivation
* To be able to make C2 ops cancellable so we can safely exit.
* Some C2 operators are now blocking thus being non-cancellable. If an error
occurs we need to be able to safely stop all net execution so we can throw
the exception to the caller.
## Summary
* Adds `NetBase::Cancel()` to NetBase which iterates over the entire list of
operators and call Cancel.
* Cancel on all ops was added to Net since there's nothing Asyc specific about it.
* `AsyncSchedulingNet` calls parent Cancel.
* To preserve backwards compatibility, `AsyncSchedulingNet`'s Cancel still calls
`CancelAndFinishAsyncTasks` .
* Adds `Cancel()` to `OperatorBase`.
Reviewed By: dzhulgakov
Differential Revision: D23279202
fbshipit-source-id: e1bb0ff04a4e1393f935dbcac7c78c0baf728550
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43564
Static dispatch was originally introduced for mobile selective build.
Since we have added selective build support for dynamic dispatch and
tested it in FB production for months, we can deprecate static dispatch
to reduce the complexity of the codebase.
Test Plan: Imported from OSS
Reviewed By: ezyang
Differential Revision: D23324452
Pulled By: ljk53
fbshipit-source-id: d2970257616a8c6337f90249076fca1ae93090c7
Summary: per title, makes c2 wrappers safer as contiguity of torch inputs is not guaranteed
Test Plan: covered by existing tests
Reviewed By: dzhulgakov
Differential Revision: D23310137
fbshipit-source-id: 3fe12abc7e394b8762098d032200778018e5b591
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43027
Format db.h and db.cc using the default formatter.
This change was split off of D22705434.
Test Plan: Wait for sandcastle.
Reviewed By: rohithmenon, marksantaniello
Differential Revision: D23113765
fbshipit-source-id: 3f02d55bfb055bda0fcba5122336fa001562d42e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43239
This is an incremental step as part of the process to migrate caffe2 random number generator off of std::mt19937 and to instead use at::mt19937+at::CPUGeneratorImpl. The ATen variants are much more performant (10x faster).
This adds a way to get the CPUContext RandSeed for tail use cases that require a std::mt19937 and borrow the CPUContext one.
Test Plan: This isn't used anywhere within the caffe2 codebase. Compile should be sufficient.
Reviewed By: dzhulgakov
Differential Revision: D23203280
fbshipit-source-id: 595c1cb447290604ee3ef61d5b5fc079b61a4e14
Summary:
This diff NVMifies the NE Eval Flow.
- It defines a `LoadNVM` operator which either
- receives a list of nvm blobs, or
- extracts the blobs that could be NVMified from the model.
- dumps NVMified blobs into NVM
- and deallocates from DRAM
- NVMify the Eval net on dper and C2 backend
Specific NVMOp for SLS is pushed through different diffs.
Test Plan: flow-cli test-locally dper.workflows.evaluation.eval_workflow --parameters-file=/mnt/public/ehsaardestani/temp/small_model.json 2>&1 | tee log
Reviewed By: yinghai, amylittleyang
Differential Revision: D22469973
fbshipit-source-id: ed8379ad404e96d04ac05e580176d3aca984575b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42249
Main change is to bring Caffe2's superior error messages for cuda initialization into c10 and use them in all code paths.
Basic logic:
| Case | Call to device_count() | init_cuda, e.g. allocating tensor |
| -- | -- | -- |
| all good | non-zero | just works |
| no gpus | 0, no warning | throw exception with good message |
| driver issues | 0, produce warning | throw exception with good message |
| out of memory with ASAN | 0, produce warning| throw exception with ASAN message |
Previously, the error thrown from init_cuda was very generic and the ASAN warning (if any) was buried in the logs.
Other clean up changes:
* cache device_count() always in a static variable
* move all asan macros in c10
Test Plan:
Hard to unittest because of build modes. Verified manually that the behavior from the table above holds by running the following script in different modes (ASAN/no-ASAN, CUDA_VISIBLE_DEVICES=):
```
print('before import')
import torch
print('after import')
print('devices: ', torch.cuda.device_count())
x = torch.tensor([1,2,3])
print('tensor creation')
x = x.cuda()
print('moved to cuda')
```
Reviewed By: ngimel
Differential Revision: D22824329
fbshipit-source-id: 5314007313a3897fc955b02f8b21b661ae35fdf5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41461
capacity is misleading, and we have many wrong uses internally. Let's rename to nbytes to avoid the confusion in future. Ultimately, we could remove this parameter if possible.
So far I haven't seen any case this capacity is necessary.
Test Plan: oss ci
Differential Revision: D22544189
fbshipit-source-id: f310627f2ab8f4ebb294e0dd5eabc380926991eb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41096
The spark spot model had some issues in tensor conversion, see P134598596. It happens when we convert an undefined c10 tensor to caffe2 tensor.
This diff added a null check.
Test Plan: spark spot model runs without problem
Reviewed By: smessmer
Differential Revision: D22330705
fbshipit-source-id: dfe0f29a48019b6611cad3fd8f2ae49e8db5427e
Summary:
If virtual function is implemented in header file, it's implementation will be included as a weak symbol to every shared library that includes this header along with all of it's dependencies.
This was one of the reasons why size of libcaffe2_module_test_dynamic.so was 500Kb (AddRelatedBlobInfo implementation pulled a quarter of libprotobuf.a with it)
Combination of this and https://github.com/pytorch/pytorch/issues/40845 reduces size of `libcaffe2_module_test_dynamic.so` from 500kb to 50Kb.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40844
Differential Revision: D22334725
Pulled By: malfet
fbshipit-source-id: 836a4cbb9f344355ddd2512667e77472546616c0
Summary:
… file
This prevents implementation of those functions(as lambdas) to be embedded as weak symbol into every shared library that includes this header.
Combination of this and https://github.com/pytorch/pytorch/pull/40844 reduces size of `libcaffe2_module_test_dynamic.so` from 500kb to 50Kb.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40845
Differential Revision: D22334779
Pulled By: malfet
fbshipit-source-id: 64706918fc2947350a58c0877f294b1b8b085455
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40096
Declaring `tensor_proto` to be of type `auto` means that it will copy the entire `TensorProto` instead of just keeping a reference. This changes it to just use a const reference instead.
Test Plan:
Using the model loader benchmark to measure model loading performance:
### `tensor_proto` is of type `const auto&`
```
============================================================================
caffe2/caffe2/fb/predictor/ModelLoaderBenchmark.cpprelative time/iter iters/s
============================================================================
BlobProtoInt32DeserializationFloat16 11.08ms 90.27
BlobProtoByteDeserializationFloat16 1509.73% 733.73us 1.36K
----------------------------------------------------------------------------
BlobProtoInt32DeserializationUInt8 10.48ms 95.45
BlobProtoByteDeserializationUInt8 2974.57% 352.22us 2.84K
============================================================================
```
### `tensor_proto` is of type `auto`
```
============================================================================
caffe2/caffe2/fb/predictor/ModelLoaderBenchmark.cpprelative time/iter iters/s
============================================================================
BlobProtoInt32DeserializationFloat16 13.84ms 72.26
BlobProtoByteDeserializationFloat16 658.85% 2.10ms 476.08
----------------------------------------------------------------------------
BlobProtoInt32DeserializationUInt8 17.09ms 58.51
BlobProtoByteDeserializationUInt8 3365.98% 507.80us 1.97K
============================================================================
```
Reviewed By: marksantaniello
Differential Revision: D21959644
fbshipit-source-id: 6bc2dfbde306f88bf7cd4f9b14b95ac69c2e1b4d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39493
Make sure we wait for all types, incl. async cpu ops
Test Plan: CI
Reviewed By: kennyhorror
Differential Revision: D21873540
fbshipit-source-id: 37875cade68e1b3323086833f8d4db79362a68e8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39759
Caffe2 has a mode where it uses PT's caching allocator. Somehow we were not calling the initialization explicitly.
Now, I have no idea why it worked before. Probably worth to run a bisect separately.
Reviewed By: houseroad
Differential Revision: D21962331
fbshipit-source-id: f16ad6b27a67dbe0bda93939cca8c94620d22a09
Summary:
Gets rid of some in-kernel asserts where they can be replaced with static_asserts
Replaces bare in-kernel `assert` in one case with `CUDA_KERNEL_ASSERT` where necessary
replaces host code `assert`s with `TORCH_INTERNAL_ASSERT`
Another group of asserts is in fractional max pooling kernels which should be fixed regardless https://github.com/pytorch/pytorch/issues/39044, the problems there are not just asserts.
I've audited remaining cases of in-kernel asserts, and they are more like `TORCH_INTERNAL_ASSERT`, so they should not happen with invalid user data. I think it's ok to leave them as is.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39047
Differential Revision: D21750392
Pulled By: ngimel
fbshipit-source-id: e9417523a2c672284de3515933cb7ed166e56719
Summary:
Per title. https://github.com/pytorch/pytorch/issues/32719 essentially disabled asserts in cuda kernels in release build. Asserts in cuda kernels are typically used to prevent invalid reads/writes, so without asserts invalid read/writes are silent errors in most cases (sometimes they would still cause "illegal memory access" errors, but because of caching allocator this usually won't happen).
We don't need 2 macros, CUDA_ALWAYS_ASSERT and CUDA_KERNEL_ASSERT because all current asserts in cuda kernels are important to prevent illegal memory accesses, and they should never be disabled.
This PR removes macro CUDA_ALWAYS_ASSERT and instead makes CUDA_KERNEL_ASSERT (that is commonly used in the kernels) an asserttion both in release and debug builds.
Fixes https://github.com/pytorch/pytorch/issues/38771
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38943
Differential Revision: D21723767
Pulled By: ngimel
fbshipit-source-id: d88d8aa1b047b476d5340e69311e65aff4da5074
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38066
Increasing priority for PinnedCPUAllocator to make sure it is set when CUDA is enabled.
Test Plan: buck test mode/dev-nosan //vision/fair/detectron2/tests:test_export_caffe2 -- 'testMaskRCNNGPU \(test_export_caffe2\.TestCaffe2Export\)'
Reviewed By: ppwwyyxx
Differential Revision: D21465835
fbshipit-source-id: 643cff30d35c174085e5fde5197ddb05885b2e99
Summary:
Helps prevent following accidental failures:
```
..\caffe2\core\parallel_net_test.cc:303
The difference between ms and 350 is 41, which exceeds kTimeThreshold, where
ms evaluates to 391,
350 evaluates to 350, and
kTimeThreshold evaluates to 40.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37892
Differential Revision: D21417251
Pulled By: malfet
fbshipit-source-id: 300cff7042e466f014850cc7cc406c725d5d0c04
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37776
* Remove type-specific size tracking in favor of byte size tracking in Storage and StorageImpl
* Changed numel() and set_numel() to nbytes() and set_nbytes()
* Added enum argument to Storage/StorageImpl constructor to indicate new meaning of the size parameter
* Update all callers of the changed API
Part of issue https://github.com/pytorch/pytorch/issues/33950
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37028
Differential Revision: D21171334
Pulled By: ezyang
fbshipit-source-id: 37329a379de9a3a83cc5e9007e455a3e1c2d10b8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37101Fixes#36954.
The basic concept is to streamline the process of rethrowing
c10::Error with extra error information. This is in a few
steps:
- I completely remodeled the Error data type and the internal
invariants. Instead of manually adding in newlines, the
message stack formatting process is responsible for inserting
newlines and spacing as necessary. Call sites are then
modified to respect the new API model.
- TORCH_RETHROW macro is added, which adds context to an error
message and then rethrows it.
New internal assert failure looks like:
```
0 INTERNAL ASSERT FAILED at ../c10/test/util/exception_test.cpp:64, please report a bug to PyTorch.
Exception raised from TestBody at ../c10/test/util/exception_test.cpp:64 (most recent call first):
frame #0: <unknown function> + 0x6aab9 (0x7ff611d3aab9 in /data/users/ezyang/pytorch-tmp/build/lib/libc10.so)
frame #1: ...
```
Error message with context looks like:
```
This is an error
This is context 1
This is context 2
```
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D21202891
Pulled By: ezyang
fbshipit-source-id: 361cadd16bc52e5886dba08e79277771ada76169
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37094
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D21202892
Pulled By: ezyang
fbshipit-source-id: d59e6bffabd90cc734056bdce2cd1fe63262fab8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36850
Since now all unboxing happens after dispatch, which means that all c10 ops support unboxing, we can now use op.callBoxed() for all ops and don't need callBoxedWorkaround (which was going through the JIT registry) anymore.
ghstack-source-id: 102879558
Test Plan: waitforsandcastle
Differential Revision: D21102375
fbshipit-source-id: d1e041116563a9650d5a86b07eb96d217d8756f3