Summary:
The test seems to be failing in ROCM 4.1 on CI node. Disabling the same for now. The test will be re-enabled for ROCM when CI transitions to 4.2.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56951
Reviewed By: zou3519
Differential Revision: D28059808
Pulled By: ezyang
fbshipit-source-id: a9b064b7525ae6dce89c51fe29ff07f37b7ac796
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46702
- fails on probability distribution with odd items
- trying to access an `acc_type` (`float`) in a `scalar_t` (`float16`) aligned memory
- produce unrepeatable result for large input tensor
- parallel cumsum not monotonic at some positions
### Fixes
- computing cumsum on `acc_type` (`float`) instead of using `scalar_t` (`float16`) fixed both issues
- the non-monotonic behavior may happen even using `float`, though
- in these cases, deterministic behavior may be achieved by eliminating the race condition when writing the result, using the atomic function `atomicMax`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55364
Reviewed By: mruberry
Differential Revision: D28031666
Pulled By: ngimel
fbshipit-source-id: 0fc6289e0b9ea2d31ef3771e7ca370de8f5c02de
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55102
To avoid casting a tensor to `.long()`, we introduce support for int32 in `torch.repeat_interleave`.
Reviewed By: ezyang
Differential Revision: D27478235
fbshipit-source-id: 08b4cce65fe94ff10535ddc07e1ba2bacea6a2cf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53238
There is a tension for the Vitals design: (1) we want a macro based logging API for C++ and (2) we want a clean python API. Furthermore, we want to this to work with "print on destruction" semantics.
The unfortunate resolution is that there are (2) ways to define vitals:
(1) Use the macros for local use only within C++ - this keeps the semantics people enjoy
(2) For vitals to be used through either C++ or Python, we use a global VitalsAPI object.
Both these go to the same place for the user: printing to stdout as the globals are destructed.
The long history on this diff shows many different ways to try to avoid having 2 different paths... we tried weak pointers & shared pointers, verbose switch cases, etc. Ultimately each ran into an ugly trade-off and this cuts the difference better the alternatives.
Test Plan:
buck test mode/dev caffe2/test:torch -- --regex vital
buck test //caffe2/aten:vitals
Reviewed By: orionr
Differential Revision: D26736443
fbshipit-source-id: ccab464224913edd07c1e8532093f673cdcb789f
Summary:
#### Reason for relanding
Line 1607 of `torch/testing/_internal/common_methods_invocations.py` of https://github.com/pytorch/pytorch/issues/50999 had `dtype` instead of `dtype=torch.bool`, so 4 of the 9 sample inputs for `bool` had incorrect dtype. This bug was caught by https://github.com/pytorch/pytorch/issues/54949.
1. Added support for pow() on CPU for `float16` (`Half`) and `bfloat16` types.
Both `pow(Tensor, Scalar)` and `pow(Tensor, Tensor)` are now supported for the aforementioned types.
However autograd isn't supported for `Float16` on CPU yet, as `log_vml_cpu` can't be enabled for it.
2. heitorschueroff added `pow_tensor_scalar_optimized_kernel` to refactor & simplify `PowKernel.cpp`.
It provides a common path for all the complex types & floating point types (except Float16, due to lack of complete AVX2 vectorization support for it). It replaced code that had previously been duplicated for (float, double) and complex types,
so PowKernel.cpp looks a lot cleaner now.
3. Enabled (unskipped) some tests for `erf`, `erfc`,`erfinv`, `tan` and `linalg.vector.norm` which were being skipped earlier due to `pow()` not having been implemented for `float16` & `bfloat16`.
4. Added an OpInfo for `pow()` & enabled some test cases for `pow()`.
5. Extended the coverage of existing tests for `pow` in `test_binary_ufuncs.py` in order to enable comparison with `numpy`, even with discontiguous tensors, and added a test to ensure that a runtime error is raised for `pow`'s inplace variant if resizing the base tensor is required during its invocation.
6. Added `float16` & `bfloat16` to `square`'s dtype lists in its `UnaryUfuncInfo`.
7. Removed redundant `dtypesIfCPU` and `dtypesIfCUDA` from `OpInfo`s where they are equal to `dtypes`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55280
Reviewed By: jbschlosser
Differential Revision: D27591772
Pulled By: heitorschueroff
fbshipit-source-id: c7420811b32595bb3353149a61e54a73f2eb352b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55150
Somehow I forgot to add these checks. Now they're in here. Thanks
ngimel for noticing.
This is probably a slight efficiency hit on TensorIterator, which is
probably already doing all these checks. Would be good to follow up
on this, though it may not be easily fixable with the TI rewrite.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: zhangguanheng66
Differential Revision: D27523879
Pulled By: ezyang
fbshipit-source-id: 458e617dbc6de6fcfa9e5841148b30b99f52e001
Summary:
Added the functionality desired in https://github.com/pytorch/pytorch/issues/50789.
1. Added support for pow() on CPU for `float16` (`Half`) and `bfloat16` types.
Both `pow(Tensor, Scalar)` and `pow(Tensor, Tensor)` are now supported for the aforementioned types.
However autograd isn't supported for `Float16` on CPU yet, as `log_vml_cpu` can't be enabled for it.
2. heitorschueroff added `pow_tensor_scalar_optimized_kernel` to refactor & simplify `PowKernel.cpp`.
It provides a common path for all the complex types & floating point types (except Float16, due to lack of complete AVX2 vectorization support for it). It replaced code that had previously been duplicated for (float, double) and complex types,
so PowKernel.cpp looks a lot cleaner now.
3. Enabled (unskipped) some tests for `erf`, `erfc`,`erfinv`, `linalg.norm` and `linalg.vector.norm` which were being skipped earlier due to `pow()` not having been implemented for `float16` & `bfloat16`.
4. Added an OpInfo for `pow()` & enabled some test cases for `pow()`.
5. Extended the coverage of existing tests for `pow` in `test_binary_ufuncs.py` in order to enable comparison with `numpy`, even with discontiguous tensors, and added a test to ensure that a runtime error is raised for `pow`'s inplace variant if resizing the base tensor is required during its invocation.
6. Added `float16` & `bfloat16` to `square`'s dtype lists in its `UnaryUfuncInfo`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50999
Reviewed By: zou3519
Differential Revision: D27478225
Pulled By: heitorschueroff
fbshipit-source-id: d309dd98d5a96d0cb9b08281757bb1c65266d011
Summary:
- Corrected a few errata in the SVD docs
- Made the notation more uniform (refer to `Vh` in `linalg.svd`, always use double tilts...)
- Wrote a better explanation about why the gradients of `U` and `V` are not well-defined when the input is complex or real but has repeated singular values. The previous one pointed to a somewhat obscure post on gauge theory.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54002
Reviewed By: malfet
Differential Revision: D27459502
Pulled By: mruberry
fbshipit-source-id: f5c35eca02d35dadd2fc0eeadfacc8824f409400
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54901
Some subtleties:
- Need to make sure not to clobber composite definitions when
deciding when to generate
- I was lazy and so I didn't make inplace on TensorList work,
nor did I make inplace functions that returned void work
- A few tests started complaining that these noop meta functions
weren't raising the errors they needed. This is tracked
in https://github.com/pytorch/pytorch/issues/54897
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: jbschlosser
Differential Revision: D27407232
Pulled By: ezyang
fbshipit-source-id: 5e706a267496368acdafd128942c310954e43d29
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53973
Two parts to this PR; I had to put them together because adding support for X causes more test code to be exercised, which in turn may require a fix for Y.
The first part is restoring the concept of storage to meta tensors. Previously, meta tensors had a nullptr storage (e.g., `meta_tensor.storage()` is an error.) As I was increasing the coverage of meta tensors, I started running into test cases (specifically memory overlap tests) that were failing because not having storage meant I couldn't check for memory overlap. After some discussion, we decided that it would make sense for meta tensors to model this as well (we already model strides, so getting accurate view information also seems useful). This PR does that by:
* Rewrite all of the factory functions in MetaTensor.cpp to use the generic versions (which are very carefully written to not actually poke at the data pointer, so everything works out). The key idea here is we give meta tensors a special allocator, MetaAllocator, which always returns a nullptr even if you ask for a nonzero number of bytes. resize_ is also made generic; the normal variant can be used directly rather than having to instruct it to avoid resizing storage
* Turn on memory overlap checking in TensorIterator even for meta tensors
* Although meta tensors now have storage, the concept of meta storage is NOT exposed to Python land (as it would imply I would have to codegen MetaFloatStorage, MetaDoubleStorage, etc. classes). So `x.storage()` still raises an error and I have a cludge in `__deepcopy__` to break storage sharing upon deep copy (this is wrong, but no tests exercise this at the moment).
The second part is adding more support for the most used functions in the test suite.
* Inplace operations have very simple meta functions. I added `fill_`, `zero_`, `random_`, `uniform_` and `normal_`. In the case of random, I take advantage of pbelevich's templates for defining random kernels, so that I can reuse the common scaffolding, and then just register a noop stub that actually does the RNG. (Look, another structured kernels tiny variant!)
* `copy_` is now implemented. Copying into a meta tensor is always OK, but copying out of a meta tensor raises an error (as we don't know what the "correct" data to copy out is in this case)
* `empty_strided` usage from structured kernels now is implemented (TBH, this could have been done as soon as `empty_strided` was added)
* Meta was missing in a few places in TensorOptions/DispatchKey utility functions, so I added them
* Autograd engine now correctly homes meta tensors with CPU tensors (they have -1 device index so CUDA queues wouldn't work anyway)
* `apply_`, `map_` and `map2_` are special cased to no-op on meta tensor self. These count as inplace operations too but they are implemented a little differently.
Getting more meta function support triggers a number of bugs in the test suite, which I then fix:
- Linear algebra functions sometimes don't report NotImplementedError because they get swallowed by catch all try blocks. This is tracked in https://github.com/pytorch/pytorch/issues/53739
- dlpack obviously doesn't work with meta tensors, I just disabled the test
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Differential Revision: D27036572
Test Plan: Imported from OSS
Reviewed By: agolynski, bdhirsh
Pulled By: ezyang
fbshipit-source-id: 7005ecf4feb92a643c37389fdfbd852dbf00ac78
Summary:
```
index_add(Tensor self, int dim, Tensor index, Tensor source) -> Tensor
```
now becomes
```
index_add(Tensor self, int dim, Tensor index, Tensor source, Scalar alpha=1) -> Tensor
```
Generally, this sounds useful and harmless, and inside PyTorch, we are already needing this feature in `add_out_dense_sparse_cuda`, see the `SparseCUDATensorMath.cu` change in this PR.
**Test not added yet. Will add if after discussion we believe this is a good idea.**
- [ ] TODO: add test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54176
Reviewed By: ngimel
Differential Revision: D27319198
Pulled By: mruberry
fbshipit-source-id: fe43be082d1230c87c5313458213d5252be2ff23
Summary:
Added the support for half / bfloat / bool for `index_select`, as suggested by ngimel in
https://github.com/pytorch/pytorch/issues/49707#issuecomment-788140578
For the tests to pass, I also added the support for `index_add`.
I added `OpInfo` tests for `index_add` and more thorough forward tests for `index_select` to test these changes.
While doing so, I found that the support for scalar types in the derivative of `index_add` was not correct, so I corrected it.
Resolves https://github.com/pytorch/pytorch/issues/49707
It should also resolve similar issues that I encountered when porting `index_copy`, `take` and `put`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53898
Reviewed By: mruberry
Differential Revision: D27193294
Pulled By: ngimel
fbshipit-source-id: 5a0af2c62a0cf24f3cc9c74f230ab4f3712bbb7a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54079
Fixes https://github.com/pytorch/pytorch/issues/53815
Instead of testing if something is CUDA, we instead test if something
is not CPU. This in the general theming of "Don't be so darn CUDA
centric".
Intruigingly, we didn't have a is_cpu() method on Tensor. Which seems
like a big oversight and one of the reasons how we ended up in this
mess. So in it goes. Maybe we should also get this for Python bindings
as well (but in that case, should probably look into redoing all of the
is_X bindings so they aren't done manually).
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: ngimel
Differential Revision: D27109507
Pulled By: ezyang
fbshipit-source-id: abbe72c2e688c452ffe098d206cb79938b5824b1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53759Fixes#53587, see issue for in-depth explanation of the bug.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D26971342
Pulled By: ezyang
fbshipit-source-id: 805983fed2658e27fb033f36a71fd30950a29328
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53665
ngimel pointed out to me where we already test the behavior of the `Upsample` ops in `test_nn.py`. This PR deleting my bespoke tests in `test_torch.py` and updates those in `test_nn.py` to test memory format properly.
There were two reasons the original test didn't pick up on a memory format regression:
- They didn't test the memory format of the output tensor explicitly, i.e. `output.is_contiguous(memory_format=...)`
- Even with that change, the test tensors were to simple to fail the tests. From some trial and error, it looks like one of the first two dimensions in the inputs needs to be > 1 in order for the `channels_last` memory format to actually re-order the strides.
Test Plan: Imported from OSS
Reviewed By: ngimel
Differential Revision: D26929683
Pulled By: bdhirsh
fbshipit-source-id: d17bc660ff031e9b3e2c93c60a9e9308e56ea612
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48841 for half datatype (it was fixed for other datatypes before).
The reason for https://github.com/pytorch/pytorch/issues/48841 happening for half was that `exponential_` for half was producing 0s.
Exponential distribution implementation on cuda is here e08aae2613/aten/src/ATen/native/cuda/DistributionTemplates.h (L535-L545)
with `transformation::exponential` defined here
e08aae2613/aten/src/ATen/core/TransformationHelper.h (L113-L123)
It takes a uniformly distributed random number and takes `log` of it. If necessary, the result is then converted to low precision datatype (half). To avoid 0's, before applying `log`, ones are replaced with std::nextafter(1,0). This seems fine, because log(1-eps) is still representable in half precision (`torch.tensor([1.], device="cuda").nextafter(torch.tensor([0.], device="cuda")).log().half()` produces 5.96e-8) , so casting to `scalar_t` should work. However, since fast log approximation is used (`__logf`), the log result is ~3e-9 instead of more accurate 5.96e-8, and underflows when casting to half. Using `::log` instead of fast approximation fixes it, however, it comes with ~20% perf penalty on exponential kernel for fp32 datatype, probably more for half.
Edit: alternative approach used now is to filter all small values returned by transformation. The result is equivalent to squashing of 1's to 1-eps that was used before, and computing correct log of 1-eps (which is -eps, exactly equal even for doubles). This doesn't incur noticeable performance hit.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53480
Reviewed By: mruberry
Differential Revision: D26924622
Pulled By: ngimel
fbshipit-source-id: dc1329e4773bf91f26af23c8afa0ae845cfb0937
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53535
During the port to structured kernels for upsample kernels, I missed that a subset of them explicitly pass `memory_format` information from the input to the output tensors.
Note 1:
I added the logic into the `meta` function of each op, which feels morally correct since this logic affects the output shape/metadata. One consequence is that all backend implementations will get the logic. I synced with fmassa that this seems reasonable.
Note 2:
This logic used to happen in the following operators, which this PR fixes:
- upsample_nearest3d
- upsample_trilinear3d
- upsample_nearest2d
- upsample_bilinear2d
I explicitly didn't patch the other upsample kernels, which look like they never forwarded memory_format information:
- `upsample_bicubic2d` (maybe this should though? `UpSampleBicubic2d.cpp` isn't currently written to do anything different for `channels_last` tensors)
- All of the `upsample_{mode}1d` operators. Probably because, afaik, channels_last isn't supported for 3d tensors
- The corresponding backwards operator for every upsample op.
Note 3:
I'm also wondering why memory_format isn't just directly a part of the `tensor::options()` method, which would cause all ops to universally forward memory_format information from input to output tensors, rather than just the upsample ops. My guess is:
- BC-breakage. I'm not sure whether this would really *break* people, but it's an API change
- performance. `tensor::options()` is called everywhere, and adding a call to `suggest_memory_format()` would probably noticeably hit microbenchmarks. We could probably deal with that by making `memory_format` a precomputed field on the tensor?
Test Plan: Imported from OSS
Reviewed By: H-Huang
Differential Revision: D26891540
Pulled By: bdhirsh
fbshipit-source-id: b3845f4dd5646b88bf738b9e41fe829be6b0e5cf
Summary:
Helps make master green by removing this hefty memory allocating from CPU test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53561
Reviewed By: malfet, albanD
Differential Revision: D26897941
Pulled By: janeyx99
fbshipit-source-id: 9f6c2d55f4eea1ab48665f7819fc113f21991036
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53276
- One of the tests had a syntax error (but the test
wasn't fine grained enough to catch this; any error
was a pass)
- Doesn't work on ROCm
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Differential Revision: D26820048
Test Plan: Imported from OSS
Reviewed By: mruberry
Pulled By: ezyang
fbshipit-source-id: b02c4252d10191c3b1b78f141d008084dc860c45