Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45742
Add a new flag to native_functions.yaml: `use_c10_dispatcher: hacky_wrapper_for_legacy_signatures`
and the codegen only wraps kernels in the aforementioned wrapper if that flag is set.
Apart from that, `use_c10_dispatcher: hacky_wrapper_for_legacy_signatures` is equivalent to `full`,
i.e. it has full boxing and unboxing support.
This greatly reduces the number of ops we apply the hacky_wrapper to, i.e. all ops marked as `use_c10_dispatcher: full` don't have it anymore.
ghstack-source-id: 113982139
Test Plan:
waitforsandcastle
vs fbcode:
https://www.internalfb.com/intern/fblearner/details/214511705/
vs base diff:
https://www.internalfb.com/intern/fblearner/details/214693207/
Reviewed By: ezyang
Differential Revision: D23328718
fbshipit-source-id: be120579477b3a05f26ca5f75025bfac37617620
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45131
These make it easier to group native functions together and determine
what kind of native function it is (inplace/out/functional). Currently
they are not used but they may be useful for tools.autograd porters.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: zhangguanheng66
Differential Revision: D23872526
Pulled By: ezyang
fbshipit-source-id: 1d6e429ab9a1f0fdb764be4228c5bca4dce8f24e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45127
I thought I was being clever by using Sequence, which doesn't commit to
List or Tuple, but forces read-onlyness in the type system. However,
there is runtime implication to using List or Tuple: Lists can't be
hashed, but Tuples can be! This is important because I shortly want
to group by FunctionSchema, and to do this I need FunctionSchema to
be hashable. Switch everything to Tuple for true immutability.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: gchanan
Differential Revision: D23872527
Pulled By: ezyang
fbshipit-source-id: 5c8fae1c50a5ae47b4167543646d94ddcafff8c3
Summary:
Amp gradient unscaling is a great use case for multi tensor apply (in fact it's the first case I wrote it for). This PR adds an MTA unscale+infcheck functor. Really excited to have it for `torch.cuda.amp`. izdeby your interface was clean and straightforward to use, great work!
Labeled as bc-breaking because the native_functions.yaml exposure of unscale+infcheck changes from [`_amp_non_finite_check_and_unscale_` to `_amp_foreach_non_finite_check_and_unscale_`]( https://github.com/pytorch/pytorch/pull/44778/files#diff-f1e4b2c15de770d978d0eb77b53a4077L6289-L6293).
The PR also modifies Unary/Binary/Pointwise Functors to
- do ops' internal math in FP32 for FP16 or bfloat16 inputs, which improves precision ([and throughput, on some architectures!](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions)) and has no downside for the ops we care about.
- accept an instantiated op functor rather than an op functor template (`template<class> class Op`). This allows calling code to pass lambdas.
Open question: As written now, the PR has MTA Functors take care of pre- and post-casting FP16/bfloat16 inputs to FP32 before running the ops. However, alternatively, the pre- and post-math casting could be deferred/written into the ops themselves, which gives them a bit more control. I can easily rewrite it that way if you prefer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44778
Reviewed By: gchanan
Differential Revision: D23944102
Pulled By: izdeby
fbshipit-source-id: 22b25ccad5f69b413c77afe8733fa9cacc8e766d
Summary:
In this PR:
1) Added binary operations with ScalarLists.
2) Fixed _foreach_div(...) bug in native_functions
3) Covered all possible cases with scalars and scalar lists in tests
4) [minor] fixed bug in native_functions by adding "use_c10_dispatcher: full" to all _foreach functions
tested via unit tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44743
Reviewed By: bwasti, malfet
Differential Revision: D23753711
Pulled By: izdeby
fbshipit-source-id: bf3e8c54bc07867e8f6e82b5d3d35ff8e99b5a0a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42537
[First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](https://github.com/pytorch/pytorch/pull/41554).
**Motivation**
[GitHub issue](https://github.com/pytorch/pytorch/issues/38655)
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start.
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex).
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.
**Current API restrictions**
- List can't be empty (will fixed in upcoming PRs).
- All tensors in the list must have the same dtype, device and size.
**Broadcasting**
At this point we don't support broadcasting.
**What is 'Fast' and 'Slow' route**
In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path.
To go the fast route,
- All tensors must have strided layout
- All tensors must be dense and not have overlapping memory
- The resulting tensor type must be the same.
----------------
**In this PR**
Adding APIs:
```
torch._foreach_exp(TensorList tl1)
torch._foreach_exp_(TensorList tl1)
torch._foreach_sqrt(TensorList tl1)
torch._foreach_sqrt_(TensorList tl1)
```
**Tests**
Tested via unit tests
**TODO**
1. Properly handle empty lists
2. Properly handle bool tensors
**Plan for the next PRs**
1. APIs
- Pointwise Ops
2. Complete tasks from TODO
3. Rewrite PyTorch optimizers to use for-each operators for performance gains.
Test Plan: Imported from OSS
Reviewed By: cpuhrsch
Differential Revision: D23331889
Pulled By: izdeby
fbshipit-source-id: 8b04673b8412957472ed56361954ca3884eb9376
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42536
[First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](https://github.com/pytorch/pytorch/pull/41554).
**Motivation**
[GitHub issue](https://github.com/pytorch/pytorch/issues/38655)
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start.
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex).
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.
**Current API restrictions**
- List can't be empty (will fixed in upcoming PRs).
- All tensors in the list must have the same dtype, device and size.
**Broadcasting**
At this point we don't support broadcasting.
**What is 'Fast' and 'Slow' route**
In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path.
To go the fast route,
- All tensors must have strided layout
- All tensors must be dense and not have overlapping memory
- The resulting tensor type must be the same.
----------------
**In this PR**
Adding APIs:
```
torch._foreach_sub(TensorList tl1, TensorList tl2)
torch._foreach_sub_(TensorList self, TensorList tl2)
torch._foreach_mul(TensorList tl1, TensorList tl2)
torch._foreach_mul_(TensorList self, TensorList tl2)
torch._foreach_div(TensorList tl1, TensorList tl2)
torch._foreach_div_(TensorList self, TensorList tl2)
torch._foreach_sub(TensorList tl1, Scalar scalar)
torch._foreach_sub_(TensorList self, Scalar scalar)
torch._foreach_mul(TensorList tl1, Scalar scalar)
torch._foreach_mul_(TensorList self, Scalar scalar)
torch._foreach_div(TensorList tl1, Scalar scalar)
torch._foreach_div(TensorList self, Scalar scalar)
```
**Tests**
Tested via unit tests
**TODO**
1. Properly handle empty lists
2. Properly handle bool tensors
**Plan for the next PRs**
1. APIs
- Unary Ops for list
- Pointwise Ops
2. Complete tasks from TODO
3. Rewrite PyTorch optimizers to use for-each operators for performance gains.
Test Plan: Imported from OSS
Reviewed By: cpuhrsch
Differential Revision: D23331891
Pulled By: izdeby
fbshipit-source-id: 18c5937287e33e825b2e391e41864dd64e226f19
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42533
[First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](https://github.com/pytorch/pytorch/pull/41554).
**Motivation**
[GitHub issue](https://github.com/pytorch/pytorch/issues/38655)
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start.
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex).
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.
**Current API restrictions**
- List can't be empty (will fixed in upcoming PRs).
- All tensors in the list must have the same dtype, device and size.
**Broadcasting**
At this point we don't support broadcasting.
**What is 'Fast' and 'Slow' route**
In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path.
To go the fast route,
- All tensors must have strided layout
- All tensors must be dense and not have overlapping memory
- The resulting tensor type must be the same.
----------------
**In this PR**
- Adding a `_foreach_add(TensorList tl1, TensorList tl2)` API
- Adding a `_foreach_add_(TensorList tl1, TensorList tl2)` API
**Tests**
Tested via unit tests
**TODO**
1. Properly handle empty lists
**Plan for the next PRs**
1. APIs
- Binary Ops for list with Scalar
- Binary Ops for list with list
- Unary Ops for list
- Pointwise Ops
2. Complete tasks from TODO
3. Rewrite PyTorch optimizers to use for-each operators for performance gains.
Test Plan: Imported from OSS
Reviewed By: zou3519
Differential Revision: D23331894
Pulled By: izdeby
fbshipit-source-id: 876dd1bc82750f609b9e3ba23c8cad94d8d6041c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42629
How to approach reviewing this diff:
- The new codegen itself lives in `tools/codegen`. Start with `gen.py`, then read `model.py` and them the `api/` folder. The comments at the top of the files describe what is going on. The CLI interface of the new codegen is similar to the old one, but (1) it is no longer necessary to explicitly specify cwrap inputs (and now we will error if you do so) and (2) the default settings for source and install dir are much better; to the extent that if you run the codegen from the root source directory as just `python -m tools.codegen.gen`, something reasonable will happen.
- The old codegen is (nearly) entirely deleted; every Python file in `aten/src/ATen` was deleted except for `common_with_cwrap.py`, which now permanently finds its home in `tools/shared/cwrap_common.py` (previously cmake copied the file there), and `code_template.py`, which now lives in `tools/codegen/code_template.py`. We remove the copying logic for `common_with_cwrap.py`.
- All of the inputs to the old codegen are deleted.
- Build rules now have to be adjusted to not refer to files that no longer exist, and to abide by the (slightly modified) CLI.
- LegacyTHFunctions files have been generated and checked in. We expect these to be deleted as these final functions get ported to ATen. The deletion process is straightforward; just delete the functions of the ones you are porting. There are 39 more functions left to port.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: bhosmer
Differential Revision: D23183978
Pulled By: ezyang
fbshipit-source-id: 6073ba432ad182c7284a97147b05f0574a02f763