Fixes#120614
Takes over #120615
In summary, this PR:
- Adds a `__dlpack__` attribute check in the tensor creation path (i.e. [`internal_new_from_data` @ tensor_new.cpp](cdfe1bffd1/torch/csrc/utils/tensor_new.cpp (L266)))
- Creates the tensor by using the DLPack machinery, instead of an element-by-element copy
- No changes since #120615
- Adds a test, making sure the DLPack machinery is used
- Wraps a tensor in a fresh `TensorDLPackWrapper` class that implements only the DLPack methods
- Creates a new tensor from an instance of `TensorDLPackWrapper`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138697
Approved by: https://github.com/ezyang
Co-authored-by: Wenzel Jakob <wenzel.jakob@epfl.ch>
Changes the StreamID encoding to use the last bit to distinguish between external and internal streams, 4 bits for IdType (DEFAULT, EXT or user-created streams possibly with high priority), and 5 bits for index. This allows us to have more stream priorities exposed to user (I'm currently setting 4, but that's easy to change now). Note, we are pre-creating all 32 streams in the pool per each allowed priority, I don't know if it's a problem in practice. Currently cuda 11.8/A100 GPUs allow 6 different stream priorities, the number may be different for the different cards/different cuda versions.
Previous callsites explicitly requesting high prioity stream (`isHighPriority=true`) are now getting the highest priority stream.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101956
Approved by: https://github.com/ezyang
Changes the StreamID encoding to use the last bit to distinguish between external and internal streams, 4 bits for IdType (DEFAULT, EXT or user-created streams possibly with high priority), and 5 bits for index. This allows us to have more stream priorities exposed to user (I'm currently setting 4, but that's easy to change now). Note, we are pre-creating all 32 streams in the pool per each allowed priority, I don't know if it's a problem in practice. Currently cuda 11.8/A100 GPUs allow 6 different stream priorities, the number may be different for the different cards/different cuda versions.
Previous callsites explicitly requesting high prioity stream (`isHighPriority=true`) are now getting the highest priority stream.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101956
Approved by: https://github.com/ezyang
It seems that some legacy default stream logic (e.g., present in a8ff647e42/torch/utils/dlpack.py (L114) ) is not handled on the potential receiving end in `torch/_tensor.py`.
Open to suggestions on how to make the test case less clunky, as this was the combination we arrived at after discovering flakiness in alternate versions.
Thanks to Olga Andreeva for surfacing this issue and providing a repro.
CC @Aidyn-A @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101318
Approved by: https://github.com/ngimel
Fixes#83069. Also move all the dlpack tests to a new file., `test_dlpack.py`.
The fix involves always allocating a "strides" int array when converting to dlPack and deleting the strides when the capsule descructor is called. Then the strides are copied from the tensor, and `strides[i]` is set to `1` where `shape[i] < 2`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83158
Approved by: https://github.com/ezyang