Commit Graph

1294 Commits

Author SHA1 Message Date
Hao Lu
51bf7bed84 [caffe2] Allow memonger to optimize nets with inplace(enforced) ops (#46560)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46560

Follow-up for D24236604 (16c52d918b).

For nets that pass the schema check, memonger actually makes sure to preserve the inplaceness of operators if they are already inplace. So we can safely enable it for correct input nets.

(Note: this ignores all push blocking failures!)

Differential Revision: D24402482

fbshipit-source-id: a7e95cb0e3eb87adeac79b9b69eef207957b0bd5
2020-10-22 13:23:33 -07:00
Richard Barnes
c44300884e Clarify timing of GetDeviceProperty() (#46715)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46715

Test Plan: N/A

Reviewed By: ezyang

Differential Revision: D24455538

fbshipit-source-id: 1770807d178f618ef6338e28f669f09e4cbd2009
2020-10-22 11:29:31 -07:00
Tristan Rice
0c9787c758 caffe2: use at::mt19937 instead of std::mt19937 (10x speedup) (#43987)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43987

This replaces the caffe2 CPU random number (std::mt19937) with at::mt19937 which is the one currently used in pytorch. The ATen RNG is 10x faster than the std one and appears to be more robust given bugs in the std (https://fburl.com/diffusion/uhro7lqb)

For large embedding tables (10GB+) we see UniformFillOp taking upwards of 10 minutes as we're bottlenecked on the single threaded RNG. Swapping to at::mt19937 cuts that time to 10% of the current.

Test Plan: Ran all relevant tests + CI. This doesn't introduce new features (+ is a core change) so existing tests+CI should be sufficient to catch regressions.

Reviewed By: dzhulgakov

Differential Revision: D23219710

fbshipit-source-id: bd16ed6415b2933e047bcb283a013d47fb395814
2020-10-16 16:08:35 -07:00
Tristan Rice
dd169ca17c caffe2/plan_executor: propagate exceptions from reporter substeps (#46424)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46424

Currently if an exception occurs in a reporter thread the process is killed via std::terminate. This adds support for handling the reporter exception if FLAGS_caffe2_handle_executor_threads_exceptions is set to true.

Test Plan: buck test mode/opt -c python.package_style=inplace //caffe2/caffe2/python:hypothesis_test //caffe2/caffe2:caffe2_test_cpu -- --stress-runs 100

Reviewed By: dahsh

Differential Revision: D24345027

fbshipit-source-id: 0659495c9e27680ebae41fe5a3cf26ce2f455cb3
2020-10-16 12:28:57 -07:00
Hao Lu
16c52d918b [caffe2] Bypass memonger for in-place ops (#46378)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46378

Reviewed By: dzhulgakov

Differential Revision: D24236604

fbshipit-source-id: 9f599687467ea969e89243482f8e2a41f7db0a23
2020-10-15 16:03:52 -07:00
Danny Huang
85c3ba5588 [caffe2] add PlanExecutorTest ErrorPlanWithCancellableStuckNet (#46110)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46110

## Motivation
* `Cancel` is now added to `OperatorBase` and `NetBase` (https://github.com/pytorch/pytorch/pull/44145).
* We need a test to cover and exhibit that we can cancel stuck net and propagate error with plan executor.

## Summary
* Added PlanExecutorTest `ErrorPlanWithCancellableStuckNet` for plan executor.
* Set cancelCount to zero at the beginning of tests to avoid global state be carried over in some test environment.

Test Plan:
## Unit Test Added

```
buck test caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest
buck test caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest --stress-runs 1000
```

Reviewed By: d4l3k

Differential Revision: D24226577

fbshipit-source-id: c834383bfe6ab50747975c229eb42a363eed3458
2020-10-12 12:00:15 -07:00
Danny Huang
87226f72d2 [caffe2] temp remove ErrorPlanWithCancellableStuckNet (#46080)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46080

temp removal of ErrorPlanWithCancellableStuckNet, will fill out more

Test Plan:
```
buck test caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest
```
remove a test

Reviewed By: fegin

Differential Revision: D24213971

fbshipit-source-id: e6e600bad00b45c726311193b4b3238f1700526e
2020-10-08 23:35:45 -07:00
Danny Huang
487624e369 [caffe2] plan executor error propagation test with blocking cancellable op (#45319)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45319

## Motivation
* `Cancel` is now added to `OperatorBase` and `NetBase` (https://github.com/pytorch/pytorch/pull/44145)
* We need a test to cover and exhibit that we can cancel stuck net and propagate error with plan executor.

## Summary
* Added `ErrorPlanWithCancellableStuckNet` for plan executor.
* We set a plan with two nets: one stuck net with blocking operator that never returns, and one with error
  net with error op that throws, and tested it throw and cancel.

Test Plan:
## Unit Test added
```
buck test caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest
buck test caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest --stress-runs 100
```
```
Summary
  Pass: 400
  ListingSuccess: 2
```

Reviewed By: d4l3k

Differential Revision: D23920548

fbshipit-source-id: feff41f73698bd6ea9b744f920e0fece4ee44438
2020-10-08 19:54:49 -07:00
Tristan Rice
59e4803b94 Recommit: caffe2/plan_executor: wait for 1 minute after exception and then abort (#45981)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45981

This is a recommit of previously reverted D20850851 (3fbddb92b1).

TL;DR - combining condition_variables and atomics is a bad idea

https://stackoverflow.com/questions/49622713/c17-atomics-and-condition-variable-deadlock

This also adds some ifdefs to disable the death test for mobile, xplat and tsan builds since forking doesn't play nicely with them.

Test Plan:
buck test mode/opt //caffe2/caffe2/python:hypothesis_test -- --stress-runs 1000 test_atomic_iter_with_concurrent_steps --timeout 120
  buck test mode/opt //caffe2/caffe2/python:hypothesis_test -- --stress-runs 100
  buck test mode/opt caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest --stress-runs 100

no timeouts https://www.internalfb.com/intern/testinfra/testconsole/testrun/7036874440059883/

will ensure no timeouts in OSS

Reviewed By: walterddr, dahsh

Differential Revision: D24165505

fbshipit-source-id: 17cd23bfbcd9c2826a4067a387023d5186353196
2020-10-08 14:17:30 -07:00
Rong Rong
1bb2d41b68 Revert D20850851: caffe2/plan_executor: wait for 1 minute after exception and then abort
Test Plan: revert-hammer

Differential Revision:
D20850851 (3fbddb92b1)

Original commit changeset: 330503775d80

fbshipit-source-id: 612c6c3c4d5586bc8ad00a112cd00fc74fb44243
2020-10-07 09:04:24 -07:00
Tristan Rice
3fbddb92b1 caffe2/plan_executor: wait for 1 minute after exception and then abort (#45297)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45297

If we have two concurrent substeps and one of them throws an exception and the other is blocking, we'll currently hang. This waits up to 1 minute for it to complete before terminating the process.

Test Plan: buck test caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest --stress-runs 100

Reviewed By: dahsh

Differential Revision: D20850851

fbshipit-source-id: 330503775d8062a34645ba55fe38e6770de5e3c7
2020-10-06 12:59:09 -07:00
Sebastian Messmer
2ac7de7d53 Remove hacky_wrapper from BackendSelect kernels (#44062)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44062

Previously, BackendSelect kernels were still written in the legacy way, i.e. they took one TensorOptions argument instead of scattered dtype, layout, device, pin_memory,  and they used hacky_wrapper to be callable. This caused a re-wrapping step. Calling into a BackencSelect kernel required taking the individual scattered arguments, packing them into a TensorOptions, and the kernel itself then gathered them again for redispatch.

Now with this PR, BackendSelect kernels are written in the new way and no hacky_wrapper or rewrapping is needed for them.
ghstack-source-id: 112825789

Test Plan:
vs master: https://www.internalfb.com/intern/fblearner/details/216117032/

vs previous diff: https://www.internalfb.com/intern/fblearner/details/216170194/

Reviewed By: ezyang

Differential Revision: D23484192

fbshipit-source-id: e8fb49c4692404b6b775d18548b990c4cdddbada
2020-09-25 09:04:03 -07:00
Bugra Akyildiz
27c7158166 Remove __future__ imports for legacy Python2 supports (#45033)
Summary:
There is a module called `2to3` which you can target for future specifically to remove these, the directory of `caffe2` has the most redundant imports:

```2to3 -f future -w caffe2```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45033

Reviewed By: seemethere

Differential Revision: D23808648

Pulled By: bugra

fbshipit-source-id: 38971900f0fe43ab44a9168e57f2307580d36a38
2020-09-23 17:57:02 -07:00
Nikita Shulga
2ae74c0632 Compile less legacy code when BUILD_CAFFE2 is set to False (take 2) (#44453)
Summary:
2nd attempt to land https://github.com/pytorch/pytorch/pull/44079

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44453

Reviewed By: walterddr, seemethere

Differential Revision: D23619528

Pulled By: malfet

fbshipit-source-id: c7c206ebd327dcf3994789bd47008b05ff862fe7
2020-09-11 16:27:47 -07:00
Danny Huang
2b8f0b2023 [caffe2] adds Cancel to OperatorBase and NetBase (#44145)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44145

## Motivation

* To be able to make C2 ops cancellable so we can safely exit.
* Some C2 operators are now blocking thus being non-cancellable. If an error
  occurs we need to be able to safely stop all net execution so we can throw
  the exception to the caller.

## Summary
*  Adds `NetBase::Cancel()` to NetBase which iterates over the entire list of
   operators and call Cancel.
* Cancel on all ops was added to Net since there's nothing Asyc specific about it.
* `AsyncSchedulingNet` calls parent Cancel.
* To preserve backwards compatibility, `AsyncSchedulingNet`'s Cancel still calls
   `CancelAndFinishAsyncTasks` .
* Adds `Cancel()` to `OperatorBase`.

Reviewed By: dzhulgakov

Differential Revision: D23279202

fbshipit-source-id: e1bb0ff04a4e1393f935dbcac7c78c0baf728550
2020-09-11 12:50:26 -07:00
Wanchao Liang
d07a36e0c1 Revert D23490149: [pytorch][PR] Compile less legacy code when BUILD_CAFFE2 is set to False
Test Plan: revert-hammer

Differential Revision:
D23490149 (15e99b6ff6)

Original commit changeset: a76382c30d83

fbshipit-source-id: 75057fa9af2c19eb976962552118bf0a99911b38
2020-09-04 22:59:39 -07:00
Nikita Shulga
15e99b6ff6 Compile less legacy code when BUILD_CAFFE2 is set to False (#44079)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44079

Reviewed By: walterddr

Differential Revision: D23490149

Pulled By: malfet

fbshipit-source-id: a76382c30d83127d180ec63ac15093a7297aae53
2020-09-04 20:04:21 -07:00
Jiakai Liu
3a0e35c9f2 [pytorch] deprecate static dispatch (#43564)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43564

Static dispatch was originally introduced for mobile selective build.

Since we have added selective build support for dynamic dispatch and
tested it in FB production for months, we can deprecate static dispatch
to reduce the complexity of the codebase.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23324452

Pulled By: ljk53

fbshipit-source-id: d2970257616a8c6337f90249076fca1ae93090c7
2020-08-27 14:52:48 -07:00
Natalia Gimelshein
d1d32003bb force pytorch tensors to contiguous before calling c2 ops
Summary: per title, makes c2 wrappers safer as contiguity of torch inputs is not guaranteed

Test Plan: covered by existing tests

Reviewed By: dzhulgakov

Differential Revision: D23310137

fbshipit-source-id: 3fe12abc7e394b8762098d032200778018e5b591
2020-08-24 23:04:13 -07:00
Sean Lynch
f80b695a75 Properly format db.h and db.cc (#43027)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43027

Format db.h and db.cc using the default formatter.

This change was split off of D22705434.

Test Plan: Wait for sandcastle.

Reviewed By: rohithmenon, marksantaniello

Differential Revision: D23113765

fbshipit-source-id: 3f02d55bfb055bda0fcba5122336fa001562d42e
2020-08-24 18:29:45 -07:00
Tristan Rice
5e04bb2c1c caffe2: expose CPUContext RandSeed for backwards compatibility with external RNG (#43239)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43239

This is an incremental step as part of the process to migrate caffe2 random number generator off of std::mt19937 and to instead use at::mt19937+at::CPUGeneratorImpl. The ATen variants are much more performant (10x faster).

This adds a way to get the CPUContext RandSeed for tail use cases that require a std::mt19937 and borrow the CPUContext one.

Test Plan: This isn't used anywhere within the caffe2 codebase. Compile should be sufficient.

Reviewed By: dzhulgakov

Differential Revision: D23203280

fbshipit-source-id: 595c1cb447290604ee3ef61d5b5fc079b61a4e14
2020-08-21 19:36:38 -07:00
Ehsan K. Ardestani
ecb9e790ed Remove excessive logging in plan_executor (#42888)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42888

as title

Test Plan: flow-cli test-locally dper.workflows.evaluation.eval_workflow --parameters-file /mnt/public/ehsanardestani/temp/quant_eval_inputs_all.json

Reviewed By: amylittleyang

Differential Revision: D23066529

fbshipit-source-id: f925afd1734e617e412b0f171e16c781d13272d9
2020-08-11 23:57:17 -07:00
Ehsan K. Ardestani
a5af2434fe NVMified NE Eval
Summary:
This diff NVMifies the NE Eval Flow.
- It defines a `LoadNVM` operator which either
  - receives a list of nvm blobs, or
  - extracts the blobs that could be NVMified from the model.
- dumps NVMified blobs into NVM
-  and deallocates from DRAM
- NVMify the Eval net on dper and C2 backend

Specific NVMOp for SLS is pushed through different diffs.

Test Plan: flow-cli test-locally dper.workflows.evaluation.eval_workflow --parameters-file=/mnt/public/ehsaardestani/temp/small_model.json 2>&1 | tee log

Reviewed By: yinghai, amylittleyang

Differential Revision: D22469973

fbshipit-source-id: ed8379ad404e96d04ac05e580176d3aca984575b
2020-08-06 10:25:31 -07:00
Dmytro Dzhulgakov
06d978a9ad [c10/cuda] Reorganize device_count() and robustly surface ASAN warnings (#42249)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42249

Main change is to bring Caffe2's superior error messages for cuda initialization into c10 and use them in all code paths.

Basic logic:

| Case | Call to device_count() | init_cuda, e.g. allocating tensor |
| -- | -- | -- |
| all good | non-zero | just works |
| no gpus | 0, no warning | throw exception with good message |
| driver issues | 0, produce warning | throw exception with good message |
| out of memory with ASAN | 0, produce warning| throw exception with ASAN message |

Previously, the error thrown from init_cuda was very generic and the ASAN warning (if any) was buried in the logs.

Other clean up changes:
* cache device_count() always in a static variable
* move all asan macros in c10

Test Plan:
Hard to unittest because of build modes. Verified manually that the behavior from the table above holds by running the following script in different modes (ASAN/no-ASAN, CUDA_VISIBLE_DEVICES=):

```
print('before import')
import torch
print('after import')
print('devices: ', torch.cuda.device_count())
x = torch.tensor([1,2,3])
print('tensor creation')
x = x.cuda()
print('moved to cuda')
```

Reviewed By: ngimel

Differential Revision: D22824329

fbshipit-source-id: 5314007313a3897fc955b02f8b21b661ae35fdf5
2020-08-05 11:39:31 -07:00
Rohith Menon
4e16be9073 [MemLeak] Fix memory leak from releasing unique ptr (#41883)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41883

Fix memory leak from releasing unique ptr

Test Plan:
Tested serialization with and without the change.

Heap profile without change:
```
Welcome to jeprof!  For help, type 'help'.
(jeprof) top
Total: 7298.4 MB
  4025.2  55.2%  55.2%   4025.2  55.2% c10::alloc_cpu (inline)
  3195.3  43.8%  98.9%   3195.3  43.8% caffe2::SerializeUsingBytesOrInt32
    63.6   0.9%  99.8%     63.6   0.9% __gnu_cxx::new_allocator::allocate (inline)
     5.0   0.1%  99.9%      5.0   0.1% google::protobuf::RepeatedField::Reserve
     2.5   0.0%  99.9%      2.5   0.0% folly::aligned_malloc (inline)
     1.2   0.0%  99.9%      1.2   0.0% caffe2::detail::CopyFromProtoWithCast (inline)
     1.0   0.0%  99.9%      1.0   0.0% __new_exitfn
     1.0   0.0% 100.0%      1.0   0.0% std::_Function_base::_Base_manager::_M_init_functor (inline)
     0.5   0.0% 100.0%      0.5   0.0% folly::HHWheelTimerBase::newTimer (inline)
     0.5   0.0% 100.0%      0.5   0.0% std::__detail::_Hashtable_alloc::_M_allocate_node
```

Heap profile with change:
```
Welcome to jeprof!  For help, type 'help'.
(jeprof) top
Total: 6689.2 MB
  4025.2  60.2%  60.2%   4025.2  60.2% c10::alloc_cpu (inline)
  2560.0  38.3%  98.4%   2560.0  38.3% caffe2::::HugePagesArena::alloc_huge (inline)
    90.9   1.4%  99.8%     90.9   1.4% __gnu_cxx::new_allocator::allocate (inline)
     5.0   0.1%  99.9%      5.0   0.1% google::protobuf::RepeatedField::Reserve
     2.0   0.0%  99.9%      2.0   0.0% prof_backtrace_impl (inline)
     1.0   0.0%  99.9%     20.3   0.3% std::__cxx11::basic_string::_M_construct (inline)
     1.0   0.0%  99.9%      1.0   0.0% std::_Function_base::_Base_manager::_M_init_functor (inline)
     0.5   0.0%  99.9%      0.5   0.0% folly::UnboundedQueue::allocNextSegment (inline)
     0.5   0.0% 100.0%      0.5   0.0% folly::aligned_malloc (inline)
     0.5   0.0% 100.0%      0.5   0.0% __new_exitfn
```

Reviewed By: yinghai

Differential Revision: D22662093

fbshipit-source-id: d0b8ff1ed26c72b14bb02fb1146c51ef11a7e519
2020-07-22 16:54:19 -07:00
Stanislau Hlebik
b774ce54f8 remediation of S205607
fbshipit-source-id: 798decc90db4f13770e97cdce3c0df7d5421b2a3
2020-07-17 17:19:47 -07:00
Stanislau Hlebik
8fdea489af remediation of S205607
fbshipit-source-id: 5113fe0c527595e4227ff827253b7414abbdf7ac
2020-07-17 17:17:03 -07:00
Lu Fang
b2e52186b9 Rename capacity to nbytes in ShareExternalPointer to avoid confusion in future (#41461)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41461

capacity is misleading, and we have many wrong uses internally. Let's rename to nbytes to avoid the confusion in future. Ultimately, we could remove this parameter if possible.
So far I haven't seen any case this capacity is necessary.

Test Plan: oss ci

Differential Revision: D22544189

fbshipit-source-id: f310627f2ab8f4ebb294e0dd5eabc380926991eb
2020-07-15 22:04:18 -07:00
Linbin Yu
df1f8a48d8 add null check for c2 tensor conversion (#41096)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41096

The spark spot model had some issues in tensor conversion, see P134598596. It happens when we convert an undefined c10 tensor to caffe2 tensor.
This diff added a null check.

Test Plan: spark spot model runs without problem

Reviewed By: smessmer

Differential Revision: D22330705

fbshipit-source-id: dfe0f29a48019b6611cad3fd8f2ae49e8db5427e
2020-07-09 11:44:23 -07:00
Nikita Shulga
d1352192e2 Move OperatorBase::AddRelatedBlobInfo implementation to .cc file (#40844)
Summary:
If virtual function is implemented in header file, it's implementation will be included as a weak symbol to every shared library that includes this header along with all of it's dependencies.

This was one of the reasons why size of libcaffe2_module_test_dynamic.so  was 500Kb (AddRelatedBlobInfo implementation pulled a quarter of libprotobuf.a with it)

Combination of this and https://github.com/pytorch/pytorch/issues/40845 reduces size of `libcaffe2_module_test_dynamic.so` from 500kb to 50Kb.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40844

Differential Revision: D22334725

Pulled By: malfet

fbshipit-source-id: 836a4cbb9f344355ddd2512667e77472546616c0
2020-07-01 11:48:15 -07:00
Nikita Shulga
cbdf399fc6 Move OperatorSchema default inference function implementations to .cc… (#40845)
Summary:
… file

This prevents implementation of those functions(as lambdas) to be embedded as weak symbol into every shared library that includes this header.

Combination of this and https://github.com/pytorch/pytorch/pull/40844 reduces size of `libcaffe2_module_test_dynamic.so` from 500kb to 50Kb.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40845

Differential Revision: D22334779

Pulled By: malfet

fbshipit-source-id: 64706918fc2947350a58c0877f294b1b8b085455
2020-07-01 11:42:52 -07:00
Sean Lynch
64689c2474 Remove unecessary copy within blob serialization (#40096)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40096

Declaring `tensor_proto` to be of type `auto` means that it will copy the entire `TensorProto` instead of just keeping a reference. This changes it to just use a const reference instead.

Test Plan:
Using the model loader benchmark to measure model loading performance:

### `tensor_proto` is of type `const auto&`
```
============================================================================
caffe2/caffe2/fb/predictor/ModelLoaderBenchmark.cpprelative  time/iter  iters/s
============================================================================
BlobProtoInt32DeserializationFloat16                        11.08ms    90.27
BlobProtoByteDeserializationFloat16             1509.73%   733.73us    1.36K
----------------------------------------------------------------------------
BlobProtoInt32DeserializationUInt8                          10.48ms    95.45
BlobProtoByteDeserializationUInt8               2974.57%   352.22us    2.84K
============================================================================
```

### `tensor_proto` is of type `auto`
```
============================================================================
caffe2/caffe2/fb/predictor/ModelLoaderBenchmark.cpprelative  time/iter  iters/s
============================================================================
BlobProtoInt32DeserializationFloat16                        13.84ms    72.26
BlobProtoByteDeserializationFloat16              658.85%     2.10ms   476.08
----------------------------------------------------------------------------
BlobProtoInt32DeserializationUInt8                          17.09ms    58.51
BlobProtoByteDeserializationUInt8               3365.98%   507.80us    1.97K
============================================================================
```

Reviewed By: marksantaniello

Differential Revision: D21959644

fbshipit-source-id: 6bc2dfbde306f88bf7cd4f9b14b95ac69c2e1b4d
2020-06-16 14:45:59 -07:00
Ilia Cherniavskii
01986e9890 Wait for all op types in SimpleNet (#39493)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39493

Make sure we wait for all types, incl. async cpu ops

Test Plan: CI

Reviewed By: kennyhorror

Differential Revision: D21873540

fbshipit-source-id: 37875cade68e1b3323086833f8d4db79362a68e8
2020-06-11 13:00:34 -07:00
Dmytro Dzhulgakov
e46060701d [caffe2] Fix of initializing ATen's CUDA before using caching allocator (#39759)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39759

Caffe2 has a mode where it uses PT's caching allocator. Somehow we were not calling the initialization explicitly.

Now, I have no idea why it worked before. Probably worth to run a bisect separately.

Reviewed By: houseroad

Differential Revision: D21962331

fbshipit-source-id: f16ad6b27a67dbe0bda93939cca8c94620d22a09
2020-06-09 17:25:42 -07:00
Natalia Gimelshein
9c19a12965 fix asserts in cuda code (#39047)
Summary:
Gets rid of some in-kernel asserts where they can be replaced with static_asserts
Replaces bare in-kernel `assert` in one case with `CUDA_KERNEL_ASSERT` where necessary
replaces host code `assert`s with `TORCH_INTERNAL_ASSERT`
Another group of asserts is in fractional max pooling kernels which should be fixed regardless https://github.com/pytorch/pytorch/issues/39044, the problems there are not just asserts.
I've audited remaining cases of in-kernel asserts, and they are more like `TORCH_INTERNAL_ASSERT`, so they should not happen with invalid user data. I think it's ok to leave them as is.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39047

Differential Revision: D21750392

Pulled By: ngimel

fbshipit-source-id: e9417523a2c672284de3515933cb7ed166e56719
2020-05-28 15:51:38 -07:00
Natalia Gimelshein
ba14a701dc restore proper cuda assert behavior with DNDEBUG (#38943)
Summary:
Per title. https://github.com/pytorch/pytorch/issues/32719 essentially disabled asserts in cuda kernels in release build. Asserts in cuda kernels are typically used to prevent invalid reads/writes, so without asserts invalid read/writes are silent errors in most cases (sometimes they would still cause "illegal memory access" errors, but because of caching allocator this usually won't happen).
We don't need 2 macros, CUDA_ALWAYS_ASSERT and CUDA_KERNEL_ASSERT because all current asserts in cuda kernels are important to prevent illegal memory accesses, and they should never be disabled.
This PR removes macro CUDA_ALWAYS_ASSERT and instead makes CUDA_KERNEL_ASSERT (that is commonly used in the kernels) an asserttion both in release and debug builds.
Fixes https://github.com/pytorch/pytorch/issues/38771
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38943

Differential Revision: D21723767

Pulled By: ngimel

fbshipit-source-id: d88d8aa1b047b476d5340e69311e65aff4da5074
2020-05-26 18:11:00 -07:00
Kurt Mohler
f9eb8824f1 Remove datatype from Storage and StorageImpl (#38870)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38870

* Removed dtype data member from StorageImpl
* Removed any methods or method arguments in Storage/StorageImpl that deal with dtypes
* Update all callers of the changed API

Part of issue https://github.com/pytorch/pytorch/issues/33950
Original PR: https://github.com/pytorch/pytorch/pull/38038

Reviewed By: albanD

Differential Revision: D21549645

Pulled By: ezyang

fbshipit-source-id: 4289b356c55ff6b9530376a79343b99b540ee3de
2020-05-21 15:26:08 -07:00
Ilia Cherniavskii
a94fb71b12 Memory profiling (#37775)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37775

Adding memory usage into profiler table output

Test Plan:
BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py
develop install --cmake

```
import torch
import torchvision.models as models
model = models.resnet18()
inp = torch.randn(5, 3, 224, 224)

with torch.autograd.profiler.profile(profile_memory=True, record_shapes=True) as prof:
    model(inp)

print(prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_memory_usage", row_limit=15))
```

```
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem Total    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
resize_                      0.37%            577.936us        0.37%            577.936us        9.796us          339.03 Mb        59               [[0]]
empty                        0.69%            1.061ms          0.74%            1.139ms          5.556us          47.42 Mb         205              []
stride                       0.00%            0.853us          0.00%            0.853us          0.853us          19.53 Kb         1                [[5, 1000]]
empty_strided                0.01%            21.393us         0.02%            26.033us         5.207us          252 b            5                []
is_complex                   0.02%            37.425us         0.02%            37.425us         1.291us          208 b            29               [[]]
masked_select                0.04%            55.333us         0.06%            93.616us         46.808us         120 b            2                [[30], [30]]
conv2d                       0.01%            18.009us         9.62%            14.902ms         14.902ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
convolution                  0.01%            12.436us         9.61%            14.884ms         14.884ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_convolution                 0.03%            52.381us         9.60%            14.871ms         14.871ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
size                         0.00%            5.429us          0.00%            5.429us          0.339us          0 b              16               [[5, 3, 224, 224]]
contiguous                   0.00%            1.934us          0.00%            1.934us          0.967us          0 b              2                [[5, 3, 224, 224]]
_convolution_nogroup         0.02%            27.505us         9.57%            14.814ms         14.814ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_nnpack_available            0.02%            34.267us         0.02%            34.267us         1.713us          0 b              20               []
thnn_conv2d                  0.01%            13.274us         9.54%            14.771ms         14.771ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
thnn_conv2d_forward          5.98%            9.264ms          19.02%           29.446ms         14.723ms         0 b              2                [[5, 3, 224, 224], [64, 3, 7, 7], [
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 154.855ms
```

Reviewed By: ngimel

Differential Revision: D21384248

Pulled By: ilia-cher

fbshipit-source-id: 31359cce2aa06f6255ed1ad8c60d03cb640bfec3
2020-05-19 15:48:48 -07:00
Xiang Gao
5e2d8745c8 RIP CUDA <9.2: circleci, aten, and caffe2 (#36846)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36846

Test Plan: Imported from OSS

Differential Revision: D21620850

Pulled By: ngimel

fbshipit-source-id: 7ad1676a12f86250f301095ffc6f365a3b370f34
2020-05-18 13:41:05 -07:00
Allan Di Wu
d35ab0b7ae Fix CUDA memory management issues caused by not using PinnedCPUAllocator (#38066)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38066

Increasing priority for PinnedCPUAllocator to make sure it is set when CUDA is enabled.

Test Plan: buck test mode/dev-nosan //vision/fair/detectron2/tests:test_export_caffe2 -- 'testMaskRCNNGPU \(test_export_caffe2\.TestCaffe2Export\)'

Reviewed By: ppwwyyxx

Differential Revision: D21465835

fbshipit-source-id: 643cff30d35c174085e5fde5197ddb05885b2e99
2020-05-07 21:52:00 -07:00
Ansha Yu
32329c3338 [nomni] fix outputs check to replaceSubgraph (#38005)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38005

D21445887 runs into a dbgo build crash on this stack P130135519

It is because the assertion sg_inputs_copy.size() == 0 is too restrictive.
nn::getOutputs(sg) returns "output" nodes which can include any inputs
that have additional consumers that are not in the subgraph itself.
To fix, proposing to remove inputs from the output check.

Test Plan:
Run tests

Sanity canaries:
https://our.intern.facebook.com/intern/ads/canary/426498931666198610/
https://our.intern.facebook.com/intern/ads/canary/426498935267166205/

Reviewed By: bwasti

Differential Revision: D21445881

fbshipit-source-id: 419a4b1a230f0370619cea574403bfa114e56a7c
2020-05-07 19:58:15 -07:00
Edward Yang
fe88806784 Back out "Revert D21171334: [pytorch][PR] Change StorageImpl to track byte count rather than element count" (#37893)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37893

Original commit changeset: 50746043acf3

Test Plan: sandcastle and ossci

Reviewed By: malfet, seemethere, ngimel

Differential Revision: D21416509

fbshipit-source-id: 735ec4e61f9d36d4537f52dd2dc6267751aeb94b
2020-05-05 22:43:15 -07:00
Nikita Shulga
9f060d3873 [Caffe2] Increase timing threshold to 50 ms on Windows (#37892)
Summary:
Helps prevent following accidental failures:
```
..\caffe2\core\parallel_net_test.cc:303
The difference between ms and 350 is 41, which exceeds kTimeThreshold, where
ms evaluates to 391,
350 evaluates to 350, and
kTimeThreshold evaluates to 40.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37892

Differential Revision: D21417251

Pulled By: malfet

fbshipit-source-id: 300cff7042e466f014850cc7cc406c725d5d0c04
2020-05-05 19:45:36 -07:00
Edward Yang
a2fc7f787a Revert D21171334: [pytorch][PR] Change StorageImpl to track byte count rather than element count
Test Plan: revert-hammer

Differential Revision:
D21171334

Original commit changeset: 37329a379de9

fbshipit-source-id: 50746043acf3c76754688de0fe6f1cc12437ea2f
2020-05-05 16:36:15 -07:00
Kurt Mohler
3706803b60 Change StorageImpl to track byte count rather than element count (#37776)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37776

* Remove type-specific size tracking in favor of byte size tracking in Storage and StorageImpl
* Changed numel() and set_numel() to nbytes() and set_nbytes()
* Added enum argument to Storage/StorageImpl constructor to indicate new meaning of the size parameter
* Update all callers of the changed API

Part of issue https://github.com/pytorch/pytorch/issues/33950
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37028

Differential Revision: D21171334

Pulled By: ezyang

fbshipit-source-id: 37329a379de9a3a83cc5e9007e455a3e1c2d10b8
2020-05-05 14:20:51 -07:00
Edward Yang
a058e938f9 Refactor error msg stack handling, add TORCH_RETHROW (#37101)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37101

Fixes #36954.

The basic concept is to streamline the process of rethrowing
c10::Error with extra error information.  This is in a few
steps:

- I completely remodeled the Error data type and the internal
  invariants.  Instead of manually adding in newlines, the
  message stack formatting process is responsible for inserting
  newlines and spacing as necessary.  Call sites are then
  modified to respect the new API model.
- TORCH_RETHROW macro is added, which adds context to an error
  message and then rethrows it.

New internal assert failure looks like:

```
0 INTERNAL ASSERT FAILED at ../c10/test/util/exception_test.cpp:64, please report a bug to PyTorch.
Exception raised from TestBody at ../c10/test/util/exception_test.cpp:64 (most recent call first):
frame #0: <unknown function> + 0x6aab9 (0x7ff611d3aab9 in /data/users/ezyang/pytorch-tmp/build/lib/libc10.so)
frame #1: ...
```

Error message with context looks like:

```
This is an error
  This is context 1
  This is context 2
```

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21202891

Pulled By: ezyang

fbshipit-source-id: 361cadd16bc52e5886dba08e79277771ada76169
2020-05-04 11:56:45 -07:00
Edward Yang
efd8f70cac Make msg() and msg_with_backtrace() private (#37094)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37094

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21202892

Pulled By: ezyang

fbshipit-source-id: d59e6bffabd90cc734056bdce2cd1fe63262fab8
2020-05-04 11:54:34 -07:00
cyy
2658bae570 use std::move (#34365)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34365

Differential Revision: D21349942

Pulled By: mrshenli

fbshipit-source-id: 4deb51cbb557501b43990ec7080c71a839cb5db9
2020-05-01 13:42:23 -07:00
Sebastian Messmer
4e976b9334 Remove callBoxedWorkaround (#36850)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36850

Since now all unboxing happens after dispatch, which means that all c10 ops support unboxing, we can now use op.callBoxed() for all ops and don't need callBoxedWorkaround (which was going through the JIT registry) anymore.
ghstack-source-id: 102879558

Test Plan: waitforsandcastle

Differential Revision: D21102375

fbshipit-source-id: d1e041116563a9650d5a86b07eb96d217d8756f3
2020-04-24 23:13:31 -07:00
Nikita Shulga
e7a72bb0c6 Add nomnigraph include folder to Caffe2_GPU_INCLUDE (#37056)
Summary:
Because `caffe2/contrib/tensort` includes nomnigraph headers
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37056

Test Plan: `cmake ../pytorch -DPYTHON_EXECUTABLE=/usr/bin/python3.7 -DCMAKE_BUILD_TYPE=RELWITHDEBINFO -DUSE_CUDA=YES -DBUILD_TEST=YES -DUSE_TENSORRT=YES -DTENSORRT_ROOT=$HOME/Downloads/TensorRT-7.0.0.11 -DCMAKE_CXX_COMPILER=/usr/bin/cuda-g++ -DCMAKE_C_COMPILER=/usr/bin/cuda-gcc -DUSE_MKLDNN=ON -G Ninja; ninja torch_cuda`

Differential Revision: D21178927

Pulled By: malfet

fbshipit-source-id: e1bed94fdb395ebfd6eb5d950ca378da77592531
2020-04-22 09:44:13 -07:00