Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45177
## Motivation
* To be able to make C2 ops cancellable so we can safely exit.
* Some C2 operators are now blocking thus being non-cancellable. If an error
occurs we need to be able to safely stop all net execution so we can throw
the exception to the caller.
## Summary
* When an error occurs in a net or it got cancelled, running ops will have the
`Cancel` method called.
This diff adds `Cancel` method to the `SafeEnqueueBlobsOp`
and `SafeDequeueBlobsOp` to have the call queue->close() to force all the
blocking ops to return.
* Adds unit test that verified the error propagation.
Test Plan:
## Unit test added to verify that queue ops propagate errors
```
buck test caffe2/caffe2/python:hypothesis_test -- test_safe_dequeue_blob__raises_exception_when_hang --stress-runs 1000
```
```
Summary
Pass: 1000
ListingSuccess: 1
```
Reviewed By: d4l3k
Differential Revision: D23846967
fbshipit-source-id: c7ddd63259e033ed0bed9df8e1b315f87bf59394
Summary:
## Motivation
* To be able to make C2 ops cancellable so we can safely exit.
* Some C2 operators are now blocking thus being non-cancellable. If an error
occurs we need to be able to safely stop all net execution so we can throw
the exception to the caller.
* When an error occurs in a net or it got cancelled, running ops will have the
`Cancel` method called.
* This diff adds `Cancel` method to the `SafeEnqueueBlobsOp`
and `SafeDequeueBlobsOp` to have the call queue->close() to force all the
blocking ops to return.
* Adds unit test that verified the error propagation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44495
Test Plan:
## Unit Test added to verify that queue ops propagate errors
```
buck test caffe2/caffe2/python:hypothesis_test
```
Reviewed By: dzhulgakov
Differential Revision: D23236088
Pulled By: dahsh
fbshipit-source-id: daa90d9ee32483fb51195e269a52cf5987bb0a5a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14269
Removes reference to Context proper and instead adds a bool argument for async copy (the same as `copy_`)
For CopyFrom - I haven't tweaked all callsites yet. Instead I rely on a terrible hack that pointer to context is implicitly converted to bool when passed, haha :) It's not a good code and I propose to fix it in a follow up diff (maybe using clangr tooling).
Reviewed By: ezyang
Differential Revision: D13117981
fbshipit-source-id: 7cb1dc2ba6a4c50ac26614f45ab8318ea96e3138
Summary:
Hi guys,
I'd like to build Caffe2 with more supported options in Windows with Microsoft Visual Studios.
This is the first pull request.
Running scripts/build_windows_shared.bat is able to build Caffe2 with both CMAKE_BUILD_TYPE=Debug and CMAKE_BUILD_TYPE=Release with Visual Studio 14 2015.
CUDA is 9.0, cudnn is 7.0.5, glog, gflags and lmdb are supported on my system.
Python is 3.5, Detectron works from python interface as well.
It was even possible to debug detectron code and step into caffe2_gpu.dll with pdbs built.
What is disappointing, that c10/experimental ops don't build with this Visual Studio generator, I added special option INCLUDE_EXPERIMENTAL_C10_OPS (default ON) to deal with it in build_windows_shared.bat.
After this pull request the next step is to add Visual Studio 2017 support in the script.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13550
Reviewed By: ezyang
Differential Revision: D13042597
Pulled By: orionr
fbshipit-source-id: f313f909f599cd582a1d000eff766eef3a9fc4fc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9939
Pull Request resolved: https://github.com/facebookresearch/weakly-supervised-action-detection/pull/13
Pull Request resolved: https://github.com/pytorch/translate/pull/166
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9125
Closes https://github.com/pytorch/pytorch/pull/9125
Use inheritance for polymorphism, and remove template parameter
This is to change the templating in call sites, the core implementations will change later
Before Caffe2 Tensor class was compile-time fixed to bind to a particular device/context. With this change, we're making it a runtime property (stored inside the tensor), but preserve the same semantics. For example, one has to specify device type in order to create a Tensor - there are no uninitialized tensors. More specifically the changes are:
1. We added an extra argument *DeviceType* to most of the constructors of the tensor, e.g. (Tensor(DeviceType type)),
2. Semantics of constructor Tensor(const Tensor<SrcContext>& src, ContextForCopy* context); is changed, in this constructor, the second context is passed in to enable us to call the templated Copy function, it could be in a different context as source and target previously, now we'll enforce that the context should have same device type as src, if it is provided.
3. To preserve 'get-or-construct' semantics of Blob, we added specialized getter Blob::GetMutableTensor that verifies both that Blob contains a Tensor and that it's of a correct type
4. Specifically, Tensor type is not default-constructible any more (as we don't have unknown device tensors) and thus some of the code handling STL containers needs to change
Note: Some changes are postponed just to keep this diff a bit smaller. Please see `TODO`s.
Reviewed By: ezyang, houseroad
Differential Revision: D9024330
fbshipit-source-id: e0b8295d2dc6ebe2963383ded5af799ad17164ba
Summary:
Pull Request resolved: https://github.com/facebookresearch/weakly-supervised-action-detection/pull/13
Pull Request resolved: https://github.com/pytorch/translate/pull/166
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9125
Closes https://github.com/pytorch/pytorch/pull/9125
Use inheritance for polymorphism, and remove template parameter
This is to change the templating in call sites, the core implementations will change later
Before Caffe2 Tensor class was compile-time fixed to bind to a particular device/context. With this change, we're making it a runtime property (stored inside the tensor), but preserve the same semantics. For example, one has to specify device type in order to create a Tensor - there are no uninitialized tensors. More specifically the changes are:
1. We added an extra argument *DeviceType* to most of the constructors of the tensor, e.g. (Tensor(DeviceType type)),
2. Semantics of constructor Tensor(const Tensor<SrcContext>& src, ContextForCopy* context); is changed, in this constructor, the second context is passed in to enable us to call the templated Copy function, it could be in a different context as source and target previously, now we'll enforce that the context should have same device type as src, if it is provided.
3. To preserve 'get-or-construct' semantics of Blob, we added specialized getter Blob::GetMutableTensor that verifies both that Blob contains a Tensor and that it's of a correct type
4. Specifically, Tensor type is not default-constructible any more (as we don't have unknown device tensors) and thus some of the code handling STL containers needs to change
Note: Some changes are postponed just to keep this diff a bit smaller. Please see `TODO`s.
Reviewed By: xw285cornell
Differential Revision: D8121878
fbshipit-source-id: 4a5e9a677ba4ac82095df959851a054c81eccf81
Summary:
Preivously in SafeDequeueOp, the in.dims()[0] would fail if in.ndim()=0.
However the error message if not informative. I added a Caffe_Enforce,
which would print out the input and output blob name. This is very helpful for
future debugging as well.
Differential Revision: D6821421
fbshipit-source-id: b07e5829a2c580aaaac88b0d9ff8d05f6da11713
Summary: Weighted sampling reader dequeue randomly chooses a hive reader to read a mini-batch. This diff allows dequeue to output the index of the randomly chosen table to a specific blob.
Reviewed By: kennyhorror
Differential Revision: D6621070
fbshipit-source-id: 754b981fc2bcfdb0146d2a0a5b677e7cfe74211b
Summary: This is a reapplication of the earlier PR due to xplat move. Original author is Christoph Conrads <christoph.conrads@fluent.ai> christoph-conrads .
Reviewed By: houseroad
Differential Revision: D6379736
fbshipit-source-id: b7482ecf3b9487a528c15e92976e915791210002
Summary: This will allow to do data reading in small batches and concat the batches later on.
Reviewed By: kennyhorror
Differential Revision: D5739129
fbshipit-source-id: 66a8087e5f9d10d654e367c6111ac90cbf54224e
Summary: As title. This helps with (quite common) cases where data input is stuck for reason or another, and the net execution never proceeds and is stuck forever.
Reviewed By: andrewwdye
Differential Revision: D5409885
fbshipit-source-id: 840261fd5964408f788fc0f50ece0d74193694ac
Summary:
Similar to SafeDequeueBlobsOp, but add weight-based sampling for reading from multiple input BlobsQueue.
WeightedSampleDequeueBlobsOp will take a vector of weights (each weight is mapped to one input blob queue).
Based on probability, we will choose which BlobQueue to fetch.
WeightedSampleDequeueBlobsOp shall stop when any of input BlobQueue is empty.
Reviewed By: dzhulgakov
Differential Revision: D4905160
fbshipit-source-id: 5b1551e2250569f933a6c01ed04442843c5e0cb6
Summary: Allow to drill down on data throuhgput overall and per field.
Reviewed By: dzhulgakov
Differential Revision: D4622168
fbshipit-source-id: 1462bb2fac05824fda0c02f4f5f0b8713893e650
Summary: Add support for "safe" versions of enqueue and dequeue. I'm not sure if using `math::Set<bool, Context>` is the best context independent approach for setting the status.
Differential Revision: D4398633
fbshipit-source-id: 7c88c8e11acfe36fd3d94f17dbf68ce558eb6df1
Summary:
One more small batch of CHECKs that left in C2 codebase. Most of the left overs
should be in tests/GPU only code.
Reviewed By: Yangqing
Differential Revision: D4243782
fbshipit-source-id: a4a03c116ea8ba16facd2efc135746d5921f19d5