pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Brian Hirsh	7a0f0d24d0	Codegen - error when an argument that looks like an out argument isn't a kwarg (fix #43273 ) (#47284 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47284 Test Plan: Imported from OSS Reviewed By: nikithamalgifb Differential Revision: D24706763 Pulled By: bdhirsh fbshipit-source-id: 60fbe81a0dff7e07aa8c169235d15b84151d3ed7	2020-11-03 16:30:01 -08:00
Edward Yang	54d83296a9	Desugar missing dispatch field into singleton Math entry (#46970 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46970 Now that catchall declarations are reinterpreted as registrations to dispatch key Math, we can now simplify code generation logic by directly generating to Math, and bypasing logic for catchall. This also helps avoid bugs where we incorrectly classify some kernels as Math and others as not, even though they get registered in the same way. Bill of changes: - Give Math its own unique TORCH_LIBRARY_IMPL - Make it so NativeFunction.dispatch is always non-None. Simplify downstream conditionals accordingly - When parsing NativeFunction, fill in missing dispatch with a singleton Math entry (pointing to the cpp.name!) One thing that is a little big about this change is a lot of kernels which previously didn't report as "math" now report as math. I picked a setting for these booleans that made sense to me, but I'm not sure if e.g. XLA will handle it 100% correctly. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D24592391 Pulled By: ezyang fbshipit-source-id: 2e3355f19f9525698864312418df08411f30a85d	2020-10-29 14:43:44 -07:00
Alban Desmaison	46b252b83a	Revert D24262885: [pytorch][PR] Added foreach_zero_ API Test Plan: revert-hammer Differential Revision: D24262885 (`8e37dcb1f3`) Original commit changeset: 144c283dd009 fbshipit-source-id: 451b202e23bc1fcb11b20d26c11d9a1329789d22	2020-10-28 06:48:59 -07:00
iurii zdebskyi	8e37dcb1f3	Added foreach_zero_ API (#46215 ) Summary: Adding Added foreach_zero_(TensorList) API Tested via unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/46215 Reviewed By: zhangguanheng66 Differential Revision: D24262885 Pulled By: izdeby fbshipit-source-id: 144c283dd00924083096d6d92eb9085cbd6097d3	2020-10-27 18:03:34 -07:00
Iurii Zdebskyi	e7564b076c	Refactor scalar list APIs to use overloads (#45673 ) Summary: Refactor foreach APIs to use overloads in case of scalar list inputs. Tested via unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45673 Reviewed By: heitorschueroff Differential Revision: D24053424 Pulled By: izdeby fbshipit-source-id: 35976cc50b4acfe228a32ed26cede579d5621cde	2020-10-19 09:28:49 -07:00
Iurii Zdebskyi	8a074af929	Added scalar lists APIs for addcdiv and addcmul (#45932 ) Summary: 1) Added new APIs: _foreach_addcdiv(Tensor(a!)[] self, Tensor[] tensor1, Tensor[] tensor2, float[] scalars) _foreach_addcdiv_(Tensor(a!)[] self, Tensor[] tensor1, Tensor[] tensor2, float[] scalars) _foreach_addcmul(Tensor(a!)[] self, Tensor[] tensor1, Tensor[] tensor2, float[] scalars) _foreach_addcmul_(Tensor(a!)[] self, Tensor[] tensor1, Tensor[] tensor2, float[] scalars) 2) Updated optimizers to use new APIs Tested via unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/45932 Reviewed By: navahgar Differential Revision: D24150306 Pulled By: izdeby fbshipit-source-id: c2e65dedc95d9d81a2fdd116e41df0accb0b6f26	2020-10-14 08:12:37 -07:00
chengjun	5741de883a	Define the record_stream method in native_functions.yaml (#44301 ) Summary: The record_stream method was hard coded for CUDA device. Define the record_stream in the native_functions.yaml to enable the dynamic dispatch to different end device. Fixes https://github.com/pytorch/pytorch/issues/36556 Pull Request resolved: https://github.com/pytorch/pytorch/pull/44301 Reviewed By: glaringlee Differential Revision: D23763954 Pulled By: ezyang fbshipit-source-id: e6d24f5e7892b56101fa858a6cad2abc5cdc4293	2020-10-13 09:15:22 -07:00
Edward Yang	944eb0e31d	Add NativeFunctionGroup (#45918 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45918 This groups together related native functions (functional, inplace, out) into a single group. It's not used by anything but Jiakai said this would be useful for his stuff so I'm putting it in immediately. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: smessmer Differential Revision: D24163526 Pulled By: ezyang fbshipit-source-id: 9979b0fe9249c78e4a64a50c5ed0e2ab99f499b9	2020-10-13 08:34:36 -07:00
Sebastian Messmer	6ba6ecb048	Only use hacky_wrapper_for_legacy_signatures if an op needs it (#45742 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45742 Add a new flag to native_functions.yaml: `use_c10_dispatcher: hacky_wrapper_for_legacy_signatures` and the codegen only wraps kernels in the aforementioned wrapper if that flag is set. Apart from that, `use_c10_dispatcher: hacky_wrapper_for_legacy_signatures` is equivalent to `full`, i.e. it has full boxing and unboxing support. This greatly reduces the number of ops we apply the hacky_wrapper to, i.e. all ops marked as `use_c10_dispatcher: full` don't have it anymore. ghstack-source-id: 113982139 Test Plan: waitforsandcastle vs fbcode: https://www.internalfb.com/intern/fblearner/details/214511705/ vs base diff: https://www.internalfb.com/intern/fblearner/details/214693207/ Reviewed By: ezyang Differential Revision: D23328718 fbshipit-source-id: be120579477b3a05f26ca5f75025bfac37617620	2020-10-12 09:39:18 -07:00
Ailing Zhang	d811d4d7ba	Support DefaultBackend keyword in native_functions.yaml. (#45719 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45719 Test Plan: Imported from OSS Reviewed By: bhosmer Differential Revision: D24165888 Pulled By: ailzhang fbshipit-source-id: 9b3c5e71f5b6a985e1a43157813e7d77dbe13b07	2020-10-09 16:28:26 -07:00
Edward Yang	4583edb5d6	Add NativeFunction.signature and kind. (#45131 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45131 These make it easier to group native functions together and determine what kind of native function it is (inplace/out/functional). Currently they are not used but they may be useful for tools.autograd porters. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: zhangguanheng66 Differential Revision: D23872526 Pulled By: ezyang fbshipit-source-id: 1d6e429ab9a1f0fdb764be4228c5bca4dce8f24e	2020-10-01 08:46:40 -07:00
Edward Yang	41bd5a5ee0	Switch all Sequences in tools.codegen.model to Tuple (#45127 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45127 I thought I was being clever by using Sequence, which doesn't commit to List or Tuple, but forces read-onlyness in the type system. However, there is runtime implication to using List or Tuple: Lists can't be hashed, but Tuples can be! This is important because I shortly want to group by FunctionSchema, and to do this I need FunctionSchema to be hashable. Switch everything to Tuple for true immutability. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: gchanan Differential Revision: D23872527 Pulled By: ezyang fbshipit-source-id: 5c8fae1c50a5ae47b4167543646d94ddcafff8c3	2020-10-01 08:41:53 -07:00
Michael Carilli	72bc3d9de4	Use MTA for amp grad unscaling, enforce op math type in MTA functors, and allow op lambdas (#44778 ) Summary: Amp gradient unscaling is a great use case for multi tensor apply (in fact it's the first case I wrote it for). This PR adds an MTA unscale+infcheck functor. Really excited to have it for `torch.cuda.amp`. izdeby your interface was clean and straightforward to use, great work! Labeled as bc-breaking because the native_functions.yaml exposure of unscale+infcheck changes from [`_amp_non_finite_check_and_unscale_` to `_amp_foreach_non_finite_check_and_unscale_`]( https://github.com/pytorch/pytorch/pull/44778/files#diff-f1e4b2c15de770d978d0eb77b53a4077L6289-L6293). The PR also modifies Unary/Binary/Pointwise Functors to - do ops' internal math in FP32 for FP16 or bfloat16 inputs, which improves precision ([and throughput, on some architectures!](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions)) and has no downside for the ops we care about. - accept an instantiated op functor rather than an op functor template (`template<class> class Op`). This allows calling code to pass lambdas. Open question: As written now, the PR has MTA Functors take care of pre- and post-casting FP16/bfloat16 inputs to FP32 before running the ops. However, alternatively, the pre- and post-math casting could be deferred/written into the ops themselves, which gives them a bit more control. I can easily rewrite it that way if you prefer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44778 Reviewed By: gchanan Differential Revision: D23944102 Pulled By: izdeby fbshipit-source-id: 22b25ccad5f69b413c77afe8733fa9cacc8e766d	2020-10-01 07:51:16 -07:00
Iurii Zdebskyi	d5748d9a1a	Enable binary ops with Scalar Lists with for foreach APIs (#45298 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45298 Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D23931986 Pulled By: izdeby fbshipit-source-id: 281267cd6f90d57a169af89f9f10b0f4fcab47e3	2020-09-25 12:58:34 -07:00
Xinyu Li	26001a2334	Revert D23753711: [pytorch][PR] Add foreach APIs for binary ops with ScalarList Test Plan: revert-hammer Differential Revision: D23753711 (`71d1b5b0e2`) Original commit changeset: bf3e8c54bc07 fbshipit-source-id: 192692e0d3fff4cade9983db0a1760fedfc9674c	2020-09-24 11:55:49 -07:00
iurii zdebskyi	71d1b5b0e2	Add foreach APIs for binary ops with ScalarList (#44743 ) Summary: In this PR: 1) Added binary operations with ScalarLists. 2) Fixed _foreach_div(...) bug in native_functions 3) Covered all possible cases with scalars and scalar lists in tests 4) [minor] fixed bug in native_functions by adding "use_c10_dispatcher: full" to all _foreach functions tested via unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/44743 Reviewed By: bwasti, malfet Differential Revision: D23753711 Pulled By: izdeby fbshipit-source-id: bf3e8c54bc07867e8f6e82b5d3d35ff8e99b5a0a	2020-09-24 08:30:42 -07:00
Iurii Zdebskyi	cce5982c4c	Add unary ops: exp and sqrt (#42537 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42537 [First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](https://github.com/pytorch/pytorch/pull/41554). Motivation [GitHub issue](https://github.com/pytorch/pytorch/issues/38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. Current API restrictions - List can't be empty (will fixed in upcoming PRs). - All tensors in the list must have the same dtype, device and size. Broadcasting At this point we don't support broadcasting. What is 'Fast' and 'Slow' route In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path. To go the fast route, - All tensors must have strided layout - All tensors must be dense and not have overlapping memory - The resulting tensor type must be the same. ---------------- In this PR Adding APIs: ``` torch._foreach_exp(TensorList tl1) torch._foreach_exp_(TensorList tl1) torch._foreach_sqrt(TensorList tl1) torch._foreach_sqrt_(TensorList tl1) ``` Tests Tested via unit tests TODO 1. Properly handle empty lists 2. Properly handle bool tensors Plan for the next PRs 1. APIs - Pointwise Ops 2. Complete tasks from TODO 3. Rewrite PyTorch optimizers to use for-each operators for performance gains. Test Plan: Imported from OSS Reviewed By: cpuhrsch Differential Revision: D23331889 Pulled By: izdeby fbshipit-source-id: 8b04673b8412957472ed56361954ca3884eb9376	2020-09-07 19:57:34 -07:00
Iurii Zdebskyi	10dd25dcd1	Add binary ops for _foreach APIs (#42536 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42536 [First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](https://github.com/pytorch/pytorch/pull/41554). Motivation [GitHub issue](https://github.com/pytorch/pytorch/issues/38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. Current API restrictions - List can't be empty (will fixed in upcoming PRs). - All tensors in the list must have the same dtype, device and size. Broadcasting At this point we don't support broadcasting. What is 'Fast' and 'Slow' route In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path. To go the fast route, - All tensors must have strided layout - All tensors must be dense and not have overlapping memory - The resulting tensor type must be the same. ---------------- In this PR Adding APIs: ``` torch._foreach_sub(TensorList tl1, TensorList tl2) torch._foreach_sub_(TensorList self, TensorList tl2) torch._foreach_mul(TensorList tl1, TensorList tl2) torch._foreach_mul_(TensorList self, TensorList tl2) torch._foreach_div(TensorList tl1, TensorList tl2) torch._foreach_div_(TensorList self, TensorList tl2) torch._foreach_sub(TensorList tl1, Scalar scalar) torch._foreach_sub_(TensorList self, Scalar scalar) torch._foreach_mul(TensorList tl1, Scalar scalar) torch._foreach_mul_(TensorList self, Scalar scalar) torch._foreach_div(TensorList tl1, Scalar scalar) torch._foreach_div(TensorList self, Scalar scalar) ``` Tests Tested via unit tests TODO 1. Properly handle empty lists 2. Properly handle bool tensors Plan for the next PRs 1. APIs - Unary Ops for list - Pointwise Ops 2. Complete tasks from TODO 3. Rewrite PyTorch optimizers to use for-each operators for performance gains. Test Plan: Imported from OSS Reviewed By: cpuhrsch Differential Revision: D23331891 Pulled By: izdeby fbshipit-source-id: 18c5937287e33e825b2e391e41864dd64e226f19	2020-09-07 10:29:32 -07:00
Iurii Zdebskyi	297c938729	Add _foreach_add(TensorList tl1, TensorList tl2) and _foreach_add_(TensorList tl1, TensorList tl2) APIs (#42533 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42533 [First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](https://github.com/pytorch/pytorch/pull/41554). Motivation [GitHub issue](https://github.com/pytorch/pytorch/issues/38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. Current API restrictions - List can't be empty (will fixed in upcoming PRs). - All tensors in the list must have the same dtype, device and size. Broadcasting At this point we don't support broadcasting. What is 'Fast' and 'Slow' route In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path. To go the fast route, - All tensors must have strided layout - All tensors must be dense and not have overlapping memory - The resulting tensor type must be the same. ---------------- In this PR - Adding a `_foreach_add(TensorList tl1, TensorList tl2)` API - Adding a `_foreach_add_(TensorList tl1, TensorList tl2)` API Tests Tested via unit tests TODO 1. Properly handle empty lists Plan for the next PRs 1. APIs - Binary Ops for list with Scalar - Binary Ops for list with list - Unary Ops for list - Pointwise Ops 2. Complete tasks from TODO 3. Rewrite PyTorch optimizers to use for-each operators for performance gains. Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D23331894 Pulled By: izdeby fbshipit-source-id: 876dd1bc82750f609b9e3ba23c8cad94d8d6041c	2020-09-02 12:18:28 -07:00
Edward Yang	6ea89166bd	Rewrite of ATen code generator (#42629 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42629 How to approach reviewing this diff: - The new codegen itself lives in `tools/codegen`. Start with `gen.py`, then read `model.py` and them the `api/` folder. The comments at the top of the files describe what is going on. The CLI interface of the new codegen is similar to the old one, but (1) it is no longer necessary to explicitly specify cwrap inputs (and now we will error if you do so) and (2) the default settings for source and install dir are much better; to the extent that if you run the codegen from the root source directory as just `python -m tools.codegen.gen`, something reasonable will happen. - The old codegen is (nearly) entirely deleted; every Python file in `aten/src/ATen` was deleted except for `common_with_cwrap.py`, which now permanently finds its home in `tools/shared/cwrap_common.py` (previously cmake copied the file there), and `code_template.py`, which now lives in `tools/codegen/code_template.py`. We remove the copying logic for `common_with_cwrap.py`. - All of the inputs to the old codegen are deleted. - Build rules now have to be adjusted to not refer to files that no longer exist, and to abide by the (slightly modified) CLI. - LegacyTHFunctions files have been generated and checked in. We expect these to be deleted as these final functions get ported to ATen. The deletion process is straightforward; just delete the functions of the ones you are porting. There are 39 more functions left to port. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: bhosmer Differential Revision: D23183978 Pulled By: ezyang fbshipit-source-id: 6073ba432ad182c7284a97147b05f0574a02f763	2020-08-31 09:00:22 -07:00

20 Commits