mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-07 00:21:07 +01:00
63d62d3e44
7 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
63d62d3e44 |
Skips test_addcmul_cuda if using ROCm (#44304)
Summary: This test is failing consistently on linux-bionic-rocm3.7-py3.6-test2. Relevant log snippet: ``` 03:43:11 FAIL: test_addcmul_cuda_float16 (__main__.TestForeachCUDA) 03:43:11 ---------------------------------------------------------------------- 03:43:11 Traceback (most recent call last): 03:43:11 File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 818, in wrapper 03:43:11 method(*args, **kwargs) 03:43:11 File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 258, in instantiated_test 03:43:11 result = test(self, *args) 03:43:11 File "test_foreach.py", line 83, in test_addcmul 03:43:11 self._test_pointwise_op(device, dtype, torch._foreach_addcmul, torch._foreach_addcmul_, torch.addcmul) 03:43:11 File "test_foreach.py", line 58, in _test_pointwise_op 03:43:11 self.assertEqual(tensors, expected) 03:43:11 File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1153, in assertEqual 03:43:11 exact_dtype=exact_dtype, exact_device=exact_device) 03:43:11 File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1127, in assertEqual 03:43:11 self.assertTrue(result, msg=msg) 03:43:11 AssertionError: False is not true : Tensors failed to compare as equal! With rtol=0.001 and atol=1e-05, found 10 element(s) (out of 400) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.00048828125 (-0.46484375 vs. -0.46533203125), which occurred at index (11, 18). ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/44304 Reviewed By: malfet, izdeby Differential Revision: D23578316 Pulled By: mruberry fbshipit-source-id: 558eecf42677383e7deaa4961e12ef990ffbe28c |
||
|
|
cce5982c4c |
Add unary ops: exp and sqrt (#42537)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42537 [First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](https://github.com/pytorch/pytorch/pull/41554). **Motivation** [GitHub issue](https://github.com/pytorch/pytorch/issues/38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. **Current API restrictions** - List can't be empty (will fixed in upcoming PRs). - All tensors in the list must have the same dtype, device and size. **Broadcasting** At this point we don't support broadcasting. **What is 'Fast' and 'Slow' route** In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path. To go the fast route, - All tensors must have strided layout - All tensors must be dense and not have overlapping memory - The resulting tensor type must be the same. ---------------- **In this PR** Adding APIs: ``` torch._foreach_exp(TensorList tl1) torch._foreach_exp_(TensorList tl1) torch._foreach_sqrt(TensorList tl1) torch._foreach_sqrt_(TensorList tl1) ``` **Tests** Tested via unit tests **TODO** 1. Properly handle empty lists 2. Properly handle bool tensors **Plan for the next PRs** 1. APIs - Pointwise Ops 2. Complete tasks from TODO 3. Rewrite PyTorch optimizers to use for-each operators for performance gains. Test Plan: Imported from OSS Reviewed By: cpuhrsch Differential Revision: D23331889 Pulled By: izdeby fbshipit-source-id: 8b04673b8412957472ed56361954ca3884eb9376 |
||
|
|
10dd25dcd1 |
Add binary ops for _foreach APIs (#42536)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42536 [First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](https://github.com/pytorch/pytorch/pull/41554). **Motivation** [GitHub issue](https://github.com/pytorch/pytorch/issues/38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. **Current API restrictions** - List can't be empty (will fixed in upcoming PRs). - All tensors in the list must have the same dtype, device and size. **Broadcasting** At this point we don't support broadcasting. **What is 'Fast' and 'Slow' route** In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path. To go the fast route, - All tensors must have strided layout - All tensors must be dense and not have overlapping memory - The resulting tensor type must be the same. ---------------- **In this PR** Adding APIs: ``` torch._foreach_sub(TensorList tl1, TensorList tl2) torch._foreach_sub_(TensorList self, TensorList tl2) torch._foreach_mul(TensorList tl1, TensorList tl2) torch._foreach_mul_(TensorList self, TensorList tl2) torch._foreach_div(TensorList tl1, TensorList tl2) torch._foreach_div_(TensorList self, TensorList tl2) torch._foreach_sub(TensorList tl1, Scalar scalar) torch._foreach_sub_(TensorList self, Scalar scalar) torch._foreach_mul(TensorList tl1, Scalar scalar) torch._foreach_mul_(TensorList self, Scalar scalar) torch._foreach_div(TensorList tl1, Scalar scalar) torch._foreach_div(TensorList self, Scalar scalar) ``` **Tests** Tested via unit tests **TODO** 1. Properly handle empty lists 2. Properly handle bool tensors **Plan for the next PRs** 1. APIs - Unary Ops for list - Pointwise Ops 2. Complete tasks from TODO 3. Rewrite PyTorch optimizers to use for-each operators for performance gains. Test Plan: Imported from OSS Reviewed By: cpuhrsch Differential Revision: D23331891 Pulled By: izdeby fbshipit-source-id: 18c5937287e33e825b2e391e41864dd64e226f19 |
||
|
|
2f044d4ee5 |
Fix CI build (#44068)
Summary: Some of our machines have only 1 device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44068 Reviewed By: wanchaol Differential Revision: D23485730 Pulled By: izdeby fbshipit-source-id: df6bc0aba18feefc50c56a8f376103352fa2a2ea |
||
|
|
297c938729 |
Add _foreach_add(TensorList tl1, TensorList tl2) and _foreach_add_(TensorList tl1, TensorList tl2) APIs (#42533)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42533 [First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](https://github.com/pytorch/pytorch/pull/41554). **Motivation** [GitHub issue](https://github.com/pytorch/pytorch/issues/38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. **Current API restrictions** - List can't be empty (will fixed in upcoming PRs). - All tensors in the list must have the same dtype, device and size. **Broadcasting** At this point we don't support broadcasting. **What is 'Fast' and 'Slow' route** In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path. To go the fast route, - All tensors must have strided layout - All tensors must be dense and not have overlapping memory - The resulting tensor type must be the same. ---------------- **In this PR** - Adding a `_foreach_add(TensorList tl1, TensorList tl2)` API - Adding a `_foreach_add_(TensorList tl1, TensorList tl2)` API **Tests** Tested via unit tests **TODO** 1. Properly handle empty lists **Plan for the next PRs** 1. APIs - Binary Ops for list with Scalar - Binary Ops for list with list - Unary Ops for list - Pointwise Ops 2. Complete tasks from TODO 3. Rewrite PyTorch optimizers to use for-each operators for performance gains. Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D23331894 Pulled By: izdeby fbshipit-source-id: 876dd1bc82750f609b9e3ba23c8cad94d8d6041c |
||
|
|
4cb8d306e6 |
Add _foreach_add_(TensorList tensors, Scalar scalar) API (#42531)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42531 [First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](https://github.com/pytorch/pytorch/pull/41554). **Motivation** [GitHub issue](https://github.com/pytorch/pytorch/issues/38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. **Current API restrictions** - List can't be empty (will fixed in upcoming PRs). - All tensors in the list must have the same dtype, device and size. **Broadcasting** At this point we don't support broadcasting. **What is 'Fast' and 'Slow' route** In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path. To go the fast route, - All tensors must have strided layout - All tensors must be dense and not have overlapping memory - The resulting tensor type must be the same. --------------- **In this PR** - Adding a `std::vector<Tensor> _foreach_add_(TensorList tensors, Scalar scalar)` API - Resolving some additional comments from previous [PR](https://github.com/pytorch/pytorch/pull/41554). **Tests** Tested via unit tests **TODO** 1. Properly handle empty lists **Plan for the next PRs** 1. APIs - Binary Ops for list with Scalar - Binary Ops for list with list - Unary Ops for list - Pointwise Ops 2. Complete tasks from TODO 3. Rewrite PyTorch optimizers to use for-each operators for performance gains. Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D23331892 Pulled By: izdeby fbshipit-source-id: c585b72e1e87f6f273f904f75445618915665c4c |
||
|
|
e995c3d21e |
Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar) (#41554)
Summary: Initial PR for the Tensor List functionality. **Motivation** [GitHub issue](https://github.com/pytorch/pytorch/issues/38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. **In this PR** - Adding `multi_tensor_apply` mechanism which will help to efficiently apply passed functor on a given list of tensors on CUDA. - Adding a first private API - `std::vector<Tensor> _foreach_add(TensorList tensors, Scalar scalar)` **Tests** Tested via unit tests **Plan for the next PRs** 1. Cover these ops with `multi_tensor_apply` support - exponent - division - mul_ - add_ - addcmul_ - addcdiv_ - Sqrt 2. Rewrite PyTorch optimizers to use for-each operators in order to get performance gains. Pull Request resolved: https://github.com/pytorch/pytorch/pull/41554 Reviewed By: cpuhrsch Differential Revision: D22829724 Pulled By: izdeby fbshipit-source-id: 47febdbf7845cf931958a638567b7428a24782b1 |