Summary:
In this PR:
1) Added binary operations with ScalarLists.
2) Fixed _foreach_div(...) bug in native_functions
3) Covered all possible cases with scalars and scalar lists in tests
4) [minor] fixed bug in native_functions by adding "use_c10_dispatcher: full" to all _foreach functions
tested via unit tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44743
Reviewed By: bwasti, malfet
Differential Revision: D23753711
Pulled By: izdeby
fbshipit-source-id: bf3e8c54bc07867e8f6e82b5d3d35ff8e99b5a0a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42537
[First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](https://github.com/pytorch/pytorch/pull/41554).
**Motivation**
[GitHub issue](https://github.com/pytorch/pytorch/issues/38655)
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start.
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex).
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.
**Current API restrictions**
- List can't be empty (will fixed in upcoming PRs).
- All tensors in the list must have the same dtype, device and size.
**Broadcasting**
At this point we don't support broadcasting.
**What is 'Fast' and 'Slow' route**
In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path.
To go the fast route,
- All tensors must have strided layout
- All tensors must be dense and not have overlapping memory
- The resulting tensor type must be the same.
----------------
**In this PR**
Adding APIs:
```
torch._foreach_exp(TensorList tl1)
torch._foreach_exp_(TensorList tl1)
torch._foreach_sqrt(TensorList tl1)
torch._foreach_sqrt_(TensorList tl1)
```
**Tests**
Tested via unit tests
**TODO**
1. Properly handle empty lists
2. Properly handle bool tensors
**Plan for the next PRs**
1. APIs
- Pointwise Ops
2. Complete tasks from TODO
3. Rewrite PyTorch optimizers to use for-each operators for performance gains.
Test Plan: Imported from OSS
Reviewed By: cpuhrsch
Differential Revision: D23331889
Pulled By: izdeby
fbshipit-source-id: 8b04673b8412957472ed56361954ca3884eb9376
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42536
[First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](https://github.com/pytorch/pytorch/pull/41554).
**Motivation**
[GitHub issue](https://github.com/pytorch/pytorch/issues/38655)
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start.
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex).
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.
**Current API restrictions**
- List can't be empty (will fixed in upcoming PRs).
- All tensors in the list must have the same dtype, device and size.
**Broadcasting**
At this point we don't support broadcasting.
**What is 'Fast' and 'Slow' route**
In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path.
To go the fast route,
- All tensors must have strided layout
- All tensors must be dense and not have overlapping memory
- The resulting tensor type must be the same.
----------------
**In this PR**
Adding APIs:
```
torch._foreach_sub(TensorList tl1, TensorList tl2)
torch._foreach_sub_(TensorList self, TensorList tl2)
torch._foreach_mul(TensorList tl1, TensorList tl2)
torch._foreach_mul_(TensorList self, TensorList tl2)
torch._foreach_div(TensorList tl1, TensorList tl2)
torch._foreach_div_(TensorList self, TensorList tl2)
torch._foreach_sub(TensorList tl1, Scalar scalar)
torch._foreach_sub_(TensorList self, Scalar scalar)
torch._foreach_mul(TensorList tl1, Scalar scalar)
torch._foreach_mul_(TensorList self, Scalar scalar)
torch._foreach_div(TensorList tl1, Scalar scalar)
torch._foreach_div(TensorList self, Scalar scalar)
```
**Tests**
Tested via unit tests
**TODO**
1. Properly handle empty lists
2. Properly handle bool tensors
**Plan for the next PRs**
1. APIs
- Unary Ops for list
- Pointwise Ops
2. Complete tasks from TODO
3. Rewrite PyTorch optimizers to use for-each operators for performance gains.
Test Plan: Imported from OSS
Reviewed By: cpuhrsch
Differential Revision: D23331891
Pulled By: izdeby
fbshipit-source-id: 18c5937287e33e825b2e391e41864dd64e226f19
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42533
[First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](https://github.com/pytorch/pytorch/pull/41554).
**Motivation**
[GitHub issue](https://github.com/pytorch/pytorch/issues/38655)
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start.
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex).
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.
**Current API restrictions**
- List can't be empty (will fixed in upcoming PRs).
- All tensors in the list must have the same dtype, device and size.
**Broadcasting**
At this point we don't support broadcasting.
**What is 'Fast' and 'Slow' route**
In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path.
To go the fast route,
- All tensors must have strided layout
- All tensors must be dense and not have overlapping memory
- The resulting tensor type must be the same.
----------------
**In this PR**
- Adding a `_foreach_add(TensorList tl1, TensorList tl2)` API
- Adding a `_foreach_add_(TensorList tl1, TensorList tl2)` API
**Tests**
Tested via unit tests
**TODO**
1. Properly handle empty lists
**Plan for the next PRs**
1. APIs
- Binary Ops for list with Scalar
- Binary Ops for list with list
- Unary Ops for list
- Pointwise Ops
2. Complete tasks from TODO
3. Rewrite PyTorch optimizers to use for-each operators for performance gains.
Test Plan: Imported from OSS
Reviewed By: zou3519
Differential Revision: D23331894
Pulled By: izdeby
fbshipit-source-id: 876dd1bc82750f609b9e3ba23c8cad94d8d6041c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42629
How to approach reviewing this diff:
- The new codegen itself lives in `tools/codegen`. Start with `gen.py`, then read `model.py` and them the `api/` folder. The comments at the top of the files describe what is going on. The CLI interface of the new codegen is similar to the old one, but (1) it is no longer necessary to explicitly specify cwrap inputs (and now we will error if you do so) and (2) the default settings for source and install dir are much better; to the extent that if you run the codegen from the root source directory as just `python -m tools.codegen.gen`, something reasonable will happen.
- The old codegen is (nearly) entirely deleted; every Python file in `aten/src/ATen` was deleted except for `common_with_cwrap.py`, which now permanently finds its home in `tools/shared/cwrap_common.py` (previously cmake copied the file there), and `code_template.py`, which now lives in `tools/codegen/code_template.py`. We remove the copying logic for `common_with_cwrap.py`.
- All of the inputs to the old codegen are deleted.
- Build rules now have to be adjusted to not refer to files that no longer exist, and to abide by the (slightly modified) CLI.
- LegacyTHFunctions files have been generated and checked in. We expect these to be deleted as these final functions get ported to ATen. The deletion process is straightforward; just delete the functions of the ones you are porting. There are 39 more functions left to port.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: bhosmer
Differential Revision: D23183978
Pulled By: ezyang
fbshipit-source-id: 6073ba432ad182c7284a97147b05f0574a02f763