Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15588
Use NHWC2NCHW or NCHW2NHWC functions which is easier to understand compared to code using transpose and generalizable to non-2D convolutions.
Reviewed By: csummersea
Differential Revision: D13557674
fbshipit-source-id: c4fdb8850503ea58f6b17b188513ae2b29691ec0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15082
We didn't have unit test for low-precision rowwise adagrad
Reviewed By: chocjy
Differential Revision: D13300732
fbshipit-source-id: 46e7bdfc82c5a6855eeb6f653c0a96b0b3a20546
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15389
SparseLengthsMean was generating uninitialized data for empty inputs (lengths == 0). We should return zeros.
The unit tests were also not covering this special case which is fixed by this diff.
Reviewed By: salexspb
Differential Revision: D13515970
fbshipit-source-id: 3c35265638f64f13f0262cee930c94f8628005da
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15458
many nets in the wild seem to have outputs that are never produced by the net.
Reviewed By: ZolotukhinM
Differential Revision: D13534185
fbshipit-source-id: 2b23b39c28404c53f68868f3bf6df53c5fea9eab
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15453
Just move things around to facilitate further development. No logic change.
Reviewed By: rdzhabarov
Differential Revision: D13533959
fbshipit-source-id: eebab1306939e802aacffb24a711d372fd67916c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15174
Previously, Caffe2 maintained a separate per-thread per-device
current logical CUDA stream ID. In this PR, we switch Caffe2 over
to using c10::Stream to manage the current stream, and also
manage the allocation of cudaStream_t objects.
This results in a slight behavior change: previously, Caffe2
would have been willing to allocate an arbitrary number of
CUDA streams, depending on how high the logical stream IDs
went. The c10::Stream pool has a fixed number of streams, once
you exceed it, it wraps around.
Reviewed By: dzhulgakov
Differential Revision: D13451550
fbshipit-source-id: da6cf33ee026932a2d873835f6e090f7b8a7d8dc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15366
swap the old implementation with a slightly easier one to understand
I ran the tests and compared the number of chains compared to the old algorithm. This one outperforms on every test, but we have yet to see if that impacts performance at all.
old chain 34 nomnigraph chain 25
old chain 46 nomnigraph chain 34
old chain 228 nomnigraph chain 188
old chain 397 nomnigraph chain 338
Reviewed By: ilia-cher
Differential Revision: D13057451
fbshipit-source-id: ccd050bfead6eb94ab9c7b0a70b09a22c2b9e499
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15250
This adds `__repr__` methods to all of the classes under task.py. This makes the objects much easier to interact with when using them in an interactive manner, such as in a Jupyter notebook.
The default `__repr__` method just returns the object ID which is very unhelpful.
Reviewed By: hanli0612
Differential Revision: D13475758
fbshipit-source-id: 6e1b166ec35163b9776c797b6a2e0d002560cd29
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15191
OSS:
just splitting out basic flags from a unit test. So I can extend them in another test where I need to add additional flags.
Reviewed By: yinghai
Differential Revision: D13159184
fbshipit-source-id: 9823e792cf0ed8d0379235c44564862b7d784845
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15110
support casting to string on CPU
Reviewed By: intermilan
Differential Revision: D13429381
fbshipit-source-id: b737a1ba1237b10f692d5c42b42a544b94ba9fd1
Summary:
the speed-up of a single operation is up to 3X .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15106
Differential Revision: D13429596
Pulled By: bddppq
fbshipit-source-id: f8d987cafeac9bef9c3daf7e43ede8c6a4ee2ce5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14631
adding a empty name scope to allow people jump out from current namescope.
This could be useful when you want to access blob from parent or sibling scope.
Facebook:
e.g: we encoutered a potential usecase in D13124249 (it's a large diff, please search by EmptyNameScope in that diff), we need to access to a blob declared in root namescope from a device namescope (device namescope has been used by parallel_GPU API). `EmptyNameScope` can help us do that with ease.
I referenced to `EmptyDeviceScope` D6103412 while implementing this one.
Reviewed By: yinghai
Differential Revision: D13272240
fbshipit-source-id: d4cde5abcc2336e456b6c6ef086266ef94d86da8
Summary:
…done once
This allow no-op build to work correctly even when BUILD_CAFFE2_OPS is on.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14982
Differential Revision: D13413960
Pulled By: zdevito
fbshipit-source-id: 6e5412a8c375af8a47c76f548cdd31cff15f3853
Summary:
Currently in caffe2, one cannot properly fetch the content of Int8 blobs.
Upon digging the source code, it turns out that the relevant source code is not being compiled. Adding the source to CMakeLists.txt fixes this issue.
First time ever doing a pull request. Please let me know if there's any rule I should follow. Thanks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15047
Differential Revision: D13417583
Pulled By: bddppq
fbshipit-source-id: dd39575971a3012635edbf97a045d80e4b62a8eb
Summary:
fix auto grad summing for IfOp where intermediate output needs renaming.
Bug before this diff:
- we only renames the output of IfOp without changing the subnet ops output
- this results in blob not found error
the unittest provides an example
this diff fix that for IfOp
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14772
Differential Revision: D13327090
Pulled By: harouwu
fbshipit-source-id: ec40ee88526ace3619c54551e223dd71158a02f8
Summary:
This will let us install tests and other Caffe2 python code as a part of running Caffe2 tests in PyTorch.
Broken out of https://github.com/pytorch/pytorch/pull/13733/
cc pjh5 yf225
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14898
Reviewed By: pjh5
Differential Revision: D13381123
Pulled By: orionr
fbshipit-source-id: 0ec96629b0570f6cc2abb1d1d6fce084e7464dbe
Summary:
This pull request contains changes for:
1. Added MIOpen RNN API miopenGetRNNLayerBiasSize and miopenGetRNNLayerParamSize.
2. Fixed usage of API miopenGetRNNLayerParam.
3. Modifying the RNN test to run using MIOpen engine.
Differential Revision: D13355699
Pulled By: bddppq
fbshipit-source-id: 6f750657f8049c5446eca893880b397804120b69
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13756
This implements general Gather operator for arbitrary axis, sharing the code with BatchGather.
- CPU gather & batch gather logic is now shared through caffe2::gather_helper, for any axis.
- Shared CUDA kernel moved to gather_op.cuh, for any axis.
- Gradients of axis > 0 delegate to BatchGatherGradientOp which now has axis argument.
- BatchGatherOp doc strings updated to have correct rank (q + (r -1)) and output.
- Added tests for axis == 2.
GatherOp supports index wrapping for axis == 0 by default, which was earlier for ONNX.
This diff also extends it to work in Cuda kernel. Added "wrap_indices" argument which specifies
wheather this wrapping should be done; set it to true if you'd like wrapping for any axis.
TBD: Update gradients to support negative indices (separate diff).
TBD: Once we have operator versioning, we'd like to update GatherOp to NOT support axis 0 wrapping
by default, but rather do it only if wrap_indices is set.
Reviewed By: dzhulgakov
Differential Revision: D12983815
fbshipit-source-id: 8add9d67b47fe8c5ba7a335f581ca0530b205cd7
Summary:
Goal of this PR is to unify cuda and hip device types in caffe2 python front end.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14221
Differential Revision: D13148564
Pulled By: bddppq
fbshipit-source-id: ef9bd2c7d238200165f217097ac5727e686d887b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14268
Removes the need for Context in Tensor by doing simple dispatch for CopyBytes. It'd eventually be subsumed by Roy Li's changes of proper copy_ op, but before that is done, let's get a clear logic of how copies are implemented and clean up some craft in CopyFrom implementation.
Note, that with these changes, one can probably can get rid of Context::CopyFromCPU/CopyToCPU, but it's a matter for follow up diffs.
This diff doesn't change the API of Tensor yet, but relies on the fact that passing `Context` to CopyFrom makes copy async if the device is CUDA and doesn't have any effect otherwise (that's how Context methods are implemented).
This doesn't change semantics of copy async implementation - as before it blindly calls cudaMemcpyAsync which probably means that it can be misused if invoked separately outside of operator body. I'll leave it for the follow up copy_ unification.
For Extend() we always do async copy - it makes sense as it's an in-place device-device operation and only any further op would be observable.
Note: there are now three ways of invoking copy in C2 code - templated CopyBytes, virtual CopyFromCPU/etc, and double-dispatch free method here. Hopefully we can get rid of the second one.
Also, please advise whether it's c10-worthy :)
Reviewed By: ezyang
Differential Revision: D13117987
fbshipit-source-id: a6772d6dcf3effaf06717da3a656fc9873b310b5