pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Iurii Zdebskyi	dad74e58fc	[WIP] Added foreach_trunc, foreahc_reciprocal, foreach_sigmoid APIs (#47385 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47385 Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D24737051 Pulled By: izdeby fbshipit-source-id: ed259d9184b2b784d8cc1983a8b85cc6cbf930ba	2020-12-07 10:47:23 -08:00
Edward Yang	742903c0df	Move argument grouping into FunctionSchema (#48195 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48195 The general approach is to change Arguments, splitting `positional`, `kwarg_only` and `out`, into `pre_self_positional`, `self_arg`, `post_self_positional`, and `pre_tensor_options_kwarg_only`, `tensor_options` and `post_tensor_options_kwarg_only`. The splits are as you'd expect: we extract out the self argument and the tensor options arguments, and record the other arguments that came before and after. To do this, we move the logic in `group_arguments` to the parsing process. Some fuzz in the process: * I renamed `ThisArgument` to `SelfArgument`, since we don't actually use the terminology "this" outside of C++ (and the model is Python biased) * I kept the `group_arguments` function, which now just reads out the arguments from the structured model in the correct order. In the long term, we should get rid of this function entirely, but for now I kept it as is to reduce churn. * I decided to arbitrarily say that when self is missing, everything goes in "post-self", but when tensor options is missing, everything goes in "pre-tensor-options". This was based on where you typically find the argument in question: self is usually at front (so most args are after it), while tensor options are typically at the end (so most args go before it). Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: zhangguanheng66 Differential Revision: D25231166 Pulled By: ezyang fbshipit-source-id: 25d77ad8319c4ce0bba4ad82e451bf536ef823ad	2020-12-02 07:57:11 -08:00
Edward Yang	ba5686f8c5	Refactor argument fields in FunctionSchema to Arguments (#48182 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48182 I'm planning to add a bunch more argument fields following https://github.com/pytorch/pytorch/pull/45890#discussion_r503646917 and it will be a lot more convenient if the arguments get to live in their own dedicated struct. Type checker will tell you if I've done it wrong. No change to output. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: ljk53 Differential Revision: D25057897 Pulled By: ezyang fbshipit-source-id: dd377181dad6ab0c894d19d83408b7812775a691	2020-12-02 07:57:06 -08:00
Jiakai Liu	de284b6d35	[pytorch][codegen] add autograd data model (#48249 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48249 Introduced autograd related data models at tools.codegen.api.autograd. Migrated load_derivatives.py to produce the new data models from derivatives.yaml. It has clean mypy-strict result. Changed both gen_autograd_functions.py and gen_variable_type.py to consume the new data model. Added type annotations to gen_autograd_functions.py - it has clean mypy-strict result except for the .gen_autograd import (so haven't added it to the strict config in this PR). To limit the scope of the PR, gen_variable_type.py is not refactored, and the main structure of load_derivatives.py / gen_autograd_functions.py is kept. We only make necessary changes to make it work. Confirmed byte-for-byte compatible with the old codegen: ``` Run it before and after this PR: .jenkins/pytorch/codegen-test.sh <baseline_output_dir> .jenkins/pytorch/codegen-test.sh <test_output_dir> Then run diff to compare the generated files: diff -Naur <baseline_output_dir> <test_output_dir> ``` Test Plan: Imported from OSS Reviewed By: ezyang Differential Revision: D25086561 Pulled By: ljk53 fbshipit-source-id: 1f43ab0931d9814c24683b9a48ca497c5fc3d729	2020-11-19 21:47:05 -08:00
Jiakai Liu	f98ab18445	[pytorch][codegen] move is_abstract property to NativeFunction model (#48252 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48252 Moved to a shared place so that gen_variable_type.py can reuse it. Test Plan: Imported from OSS Reviewed By: ezyang Differential Revision: D25087808 Pulled By: ljk53 fbshipit-source-id: 1f32e506956fc4eb08734cfde0add47b3e666bd9	2020-11-19 12:30:13 -08:00
Iurii Zdebskyi	94cd048bda	Added foreach_frac API (#47384 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47384 Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D24737052 Pulled By: izdeby fbshipit-source-id: 8c94cc42bf22bfbb8f78bfeb2017a5756045763a	2020-11-17 16:56:30 -08:00
Iurii Zdebskyi	134bce7cd0	Adding bunch of unary foreach APIs (#47875 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47875 Implementing several unary operators for _foreach_ APIs. ### Planned list of ops - [x] abs - [x] acos - [x] asin - [x] atan - [x] ceil - [x] cos - [x] cosh - [x] erf - [x] erfc - [x] exp - [x] expm1 - [x] floor - [x] log - [x] log10 - [x] log1p - [x] log2 - [ ] frac - [x] neg - [ ] reciprocal - [x] round - [ ] rsqrt - [ ] sigmoid - [x] sin - [x] sinh - [x] sqrt - [x] tan - [x] tanh - [ ] trunc - [x] lgamma - [ ] digamma - [ ] erfinv - [ ] sign - [ ] mvlgamma - [ ] clamp - [ ] clamp_min - [ ] clamp_max ### Perf results ``` ----------------- OP: sin ----------------- Median: 998.79 us 300.84 us ----------------- OP: abs ----------------- Median: 1.19 ms 294.97 us ----------------- OP: acos ----------------- Median: 982.30 us 299.40 us ----------------- OP: asin ----------------- Median: 1.16 ms 298.09 us ----------------- OP: atan ----------------- Median: 986.92 us 295.64 us ----------------- OP: ceil ----------------- Median: 1.17 ms 297.25 us ----------------- OP: cos ----------------- Median: 972.72 us 294.41 us ----------------- OP: cosh ----------------- Median: 1.17 ms 294.97 us ----------------- OP: erf ----------------- Median: 1.17 ms 297.02 us ----------------- OP: erfc ----------------- Median: 1.14 ms 299.23 us ----------------- OP: exp ----------------- Median: 1.15 ms 298.79 us ----------------- OP: expm1 ----------------- Median: 1.17 ms 291.79 us ----------------- OP: floor ----------------- Median: 1.17 ms 293.51 us ----------------- OP: log ----------------- Median: 1.13 ms 318.01 us ----------------- OP: log10 ----------------- Median: 987.17 us 295.57 us ----------------- OP: log1p ----------------- Median: 1.13 ms 297.15 us ----------------- OP: log2 ----------------- Median: 974.21 us 295.01 us ----------------- OP: frac ----------------- Median: 1.15 ms 296.01 us ----------------- OP: neg ----------------- Median: 1.13 ms 294.98 us ----------------- OP: reciprocal ----------------- Median: 1.16 ms 293.69 us ----------------- OP: round ----------------- Median: 1.12 ms 297.48 us ----------------- OP: sigmoid ----------------- Median: 1.13 ms 296.53 us ----------------- OP: sin ----------------- Median: 991.02 us 295.78 us ----------------- OP: sinh ----------------- Median: 1.15 ms 295.70 us ----------------- OP: sqrt ----------------- Median: 1.17 ms 297.75 us ----------------- OP: tan ----------------- 978.20 us 297.99 us ----------------- OP: tanh ----------------- Median: 967.84 us 297.29 us ----------------- OP: trunc ----------------- Median: 1.14 ms 298.72 us ----------------- OP: lgamma ----------------- Median: 1.14 ms 317.53 us ``` ### Script ``` import torch import torch.optim as optim import torch.nn as nn import torchvision import torch.utils.benchmark as benchmark_utils inputs = [torch.rand(3, 200, 200, device="cuda") for _ in range(100)] def main(): for op in [ "sin", "abs", "acos", "asin", "atan", "ceil", "cos", "cosh", "erf", "erfc", "exp", "expm1", "floor", "log", "log10", "log1p", "log2", "frac", "neg", "reciprocal", "round", "sigmoid", "sin", "sinh", "sqrt", "tan", "tanh", "trunc", "lgamma" ]: print("\n\n----------------- OP: ", op, " -----------------") stmt = "[torch.{op}(t) for t in inputs]" timer = benchmark_utils.Timer( stmt=stmt.format(op = op), globals=globals(), label="str(optimizer)", ) print(f"autorange:\n{timer.blocked_autorange()}\n\n") stmt = "torch._foreach_{op}(inputs)" timer_mta = benchmark_utils.Timer( stmt=stmt.format(op = op), globals=globals(), label="str(optimizer_mta)", ) print(f"autorange:\n{timer_mta.blocked_autorange()}\n\n") if __name__ == "__main__": main() ``` Test Plan: Imported from OSS Reviewed By: nikithamalgifb Differential Revision: D24948801 Pulled By: izdeby fbshipit-source-id: defec3c0394d6816d9a8b05a42a057348f1b4d96	2020-11-17 16:51:54 -08:00
Edward Yang	cdc2d2843b	Structured kernel definitions (#45277 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45277 Implements structured kernels as per https://github.com/pytorch/rfcs/pull/9 and ports upsample_nearest1d to use the framework. The general structure of this diff: - Define a new syntax for specifying structured kernels in `native_functions.yaml`. You put `structured: True` on the `out` function (that's what you implement) and `structured_delegate: foo.out` on the functional/inplace variants to define them in terms of the `out` function. There's a bunch of new consistency checking to see if you've done this right, though the error messages are of varying quality. This is most of what's going on in tools.codegen.model - NativeFunctionGroup turns into StructuredNativeFunctions. Previously I thought that maybe we would use this grouping mechanism for both structured and unstructured kernels, but it turned out that Jiakai needed to make his own grouping structure. So now I've specialized it for structured kernels, which also means I get to add a bunch of invariants, like requiring structured kernels to have both a functional and an out variant. This is the lower bundle of changes in tools.codegen.model - When you make an out kernel structured, this induces us to generate a new meta function signature for you to write shape checking and output allocation code. The signatures of these is defined by `tools.codegen.api.meta` and generated into `MetaFunctions.h`. Coverage here is very bare bones and will be driven by actual operators we port as we go. - The meaty part of code generation is what we do when we have some grouped StructuredNativeFunctions. We continue to generate a wrapper per function type, but they're are a bit different as the call your meta functions, and make reference to the actual implementations in out. - Then there's a port of `upsample_nearest1d`; easiest to review by just looking at what the final code looks like. Missing pieces: - Stride calculation in TensorMeta - Sufficient sanity checking for inplace/out variants - Enough rope to make TensorIterator work This PR improves instruction counts on `upsample_nearest1d` because it eliminates an extra redispatch. Testing `at::upsample_nearest1d(x, {10});` * Functional: before 1314105, after 1150705 * Out: before 915705, after 838405 These numbers may be jittered up to +-16400 (which is the difference when I tested against an unaffected operator `at::upsample_linear1d`), though that may also because unrelated changes affected all operators globally. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: D24253555 Test Plan: Imported from OSS Reviewed By: smessmer Pulled By: ezyang fbshipit-source-id: 4ef58dd911991060f13576864c8171f9cc614456	2020-11-17 15:24:43 -08:00
Iurii Zdebskyi	1c45631f10	Revert D24737050: [WIP] Adding bunch of unary foreach APIs Test Plan: revert-hammer Differential Revision: D24737050 (`b6a2444eff`) Original commit changeset: deb59b41ad1c fbshipit-source-id: 76cd85028114cfc8fc5b7bb49cd27efc2e315aa5	2020-11-10 09:41:41 -08:00
Iurii Zdebskyi	b6a2444eff	[WIP] Adding bunch of unary foreach APIs (#47383 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47383 Test Plan: Imported from OSS Reviewed By: anjali411 Differential Revision: D24737050 Pulled By: izdeby fbshipit-source-id: deb59b41ad1c79b66cafbd9a9d3d6b069794e743	2020-11-09 14:14:28 -08:00
Brian Hirsh	7a0f0d24d0	Codegen - error when an argument that looks like an out argument isn't a kwarg (fix #43273 ) (#47284 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47284 Test Plan: Imported from OSS Reviewed By: nikithamalgifb Differential Revision: D24706763 Pulled By: bdhirsh fbshipit-source-id: 60fbe81a0dff7e07aa8c169235d15b84151d3ed7	2020-11-03 16:30:01 -08:00
Edward Yang	54d83296a9	Desugar missing dispatch field into singleton Math entry (#46970 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46970 Now that catchall declarations are reinterpreted as registrations to dispatch key Math, we can now simplify code generation logic by directly generating to Math, and bypasing logic for catchall. This also helps avoid bugs where we incorrectly classify some kernels as Math and others as not, even though they get registered in the same way. Bill of changes: - Give Math its own unique TORCH_LIBRARY_IMPL - Make it so NativeFunction.dispatch is always non-None. Simplify downstream conditionals accordingly - When parsing NativeFunction, fill in missing dispatch with a singleton Math entry (pointing to the cpp.name!) One thing that is a little big about this change is a lot of kernels which previously didn't report as "math" now report as math. I picked a setting for these booleans that made sense to me, but I'm not sure if e.g. XLA will handle it 100% correctly. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D24592391 Pulled By: ezyang fbshipit-source-id: 2e3355f19f9525698864312418df08411f30a85d	2020-10-29 14:43:44 -07:00
Alban Desmaison	46b252b83a	Revert D24262885: [pytorch][PR] Added foreach_zero_ API Test Plan: revert-hammer Differential Revision: D24262885 (`8e37dcb1f3`) Original commit changeset: 144c283dd009 fbshipit-source-id: 451b202e23bc1fcb11b20d26c11d9a1329789d22	2020-10-28 06:48:59 -07:00
iurii zdebskyi	8e37dcb1f3	Added foreach_zero_ API (#46215 ) Summary: Adding Added foreach_zero_(TensorList) API Tested via unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/46215 Reviewed By: zhangguanheng66 Differential Revision: D24262885 Pulled By: izdeby fbshipit-source-id: 144c283dd00924083096d6d92eb9085cbd6097d3	2020-10-27 18:03:34 -07:00
Iurii Zdebskyi	e7564b076c	Refactor scalar list APIs to use overloads (#45673 ) Summary: Refactor foreach APIs to use overloads in case of scalar list inputs. Tested via unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45673 Reviewed By: heitorschueroff Differential Revision: D24053424 Pulled By: izdeby fbshipit-source-id: 35976cc50b4acfe228a32ed26cede579d5621cde	2020-10-19 09:28:49 -07:00
Iurii Zdebskyi	8a074af929	Added scalar lists APIs for addcdiv and addcmul (#45932 ) Summary: 1) Added new APIs: _foreach_addcdiv(Tensor(a!)[] self, Tensor[] tensor1, Tensor[] tensor2, float[] scalars) _foreach_addcdiv_(Tensor(a!)[] self, Tensor[] tensor1, Tensor[] tensor2, float[] scalars) _foreach_addcmul(Tensor(a!)[] self, Tensor[] tensor1, Tensor[] tensor2, float[] scalars) _foreach_addcmul_(Tensor(a!)[] self, Tensor[] tensor1, Tensor[] tensor2, float[] scalars) 2) Updated optimizers to use new APIs Tested via unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/45932 Reviewed By: navahgar Differential Revision: D24150306 Pulled By: izdeby fbshipit-source-id: c2e65dedc95d9d81a2fdd116e41df0accb0b6f26	2020-10-14 08:12:37 -07:00
chengjun	5741de883a	Define the record_stream method in native_functions.yaml (#44301 ) Summary: The record_stream method was hard coded for CUDA device. Define the record_stream in the native_functions.yaml to enable the dynamic dispatch to different end device. Fixes https://github.com/pytorch/pytorch/issues/36556 Pull Request resolved: https://github.com/pytorch/pytorch/pull/44301 Reviewed By: glaringlee Differential Revision: D23763954 Pulled By: ezyang fbshipit-source-id: e6d24f5e7892b56101fa858a6cad2abc5cdc4293	2020-10-13 09:15:22 -07:00
Edward Yang	944eb0e31d	Add NativeFunctionGroup (#45918 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45918 This groups together related native functions (functional, inplace, out) into a single group. It's not used by anything but Jiakai said this would be useful for his stuff so I'm putting it in immediately. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: smessmer Differential Revision: D24163526 Pulled By: ezyang fbshipit-source-id: 9979b0fe9249c78e4a64a50c5ed0e2ab99f499b9	2020-10-13 08:34:36 -07:00
Sebastian Messmer	6ba6ecb048	Only use hacky_wrapper_for_legacy_signatures if an op needs it (#45742 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45742 Add a new flag to native_functions.yaml: `use_c10_dispatcher: hacky_wrapper_for_legacy_signatures` and the codegen only wraps kernels in the aforementioned wrapper if that flag is set. Apart from that, `use_c10_dispatcher: hacky_wrapper_for_legacy_signatures` is equivalent to `full`, i.e. it has full boxing and unboxing support. This greatly reduces the number of ops we apply the hacky_wrapper to, i.e. all ops marked as `use_c10_dispatcher: full` don't have it anymore. ghstack-source-id: 113982139 Test Plan: waitforsandcastle vs fbcode: https://www.internalfb.com/intern/fblearner/details/214511705/ vs base diff: https://www.internalfb.com/intern/fblearner/details/214693207/ Reviewed By: ezyang Differential Revision: D23328718 fbshipit-source-id: be120579477b3a05f26ca5f75025bfac37617620	2020-10-12 09:39:18 -07:00
Ailing Zhang	d811d4d7ba	Support DefaultBackend keyword in native_functions.yaml. (#45719 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45719 Test Plan: Imported from OSS Reviewed By: bhosmer Differential Revision: D24165888 Pulled By: ailzhang fbshipit-source-id: 9b3c5e71f5b6a985e1a43157813e7d77dbe13b07	2020-10-09 16:28:26 -07:00
Edward Yang	4583edb5d6	Add NativeFunction.signature and kind. (#45131 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45131 These make it easier to group native functions together and determine what kind of native function it is (inplace/out/functional). Currently they are not used but they may be useful for tools.autograd porters. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: zhangguanheng66 Differential Revision: D23872526 Pulled By: ezyang fbshipit-source-id: 1d6e429ab9a1f0fdb764be4228c5bca4dce8f24e	2020-10-01 08:46:40 -07:00
Edward Yang	41bd5a5ee0	Switch all Sequences in tools.codegen.model to Tuple (#45127 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45127 I thought I was being clever by using Sequence, which doesn't commit to List or Tuple, but forces read-onlyness in the type system. However, there is runtime implication to using List or Tuple: Lists can't be hashed, but Tuples can be! This is important because I shortly want to group by FunctionSchema, and to do this I need FunctionSchema to be hashable. Switch everything to Tuple for true immutability. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: gchanan Differential Revision: D23872527 Pulled By: ezyang fbshipit-source-id: 5c8fae1c50a5ae47b4167543646d94ddcafff8c3	2020-10-01 08:41:53 -07:00
Michael Carilli	72bc3d9de4	Use MTA for amp grad unscaling, enforce op math type in MTA functors, and allow op lambdas (#44778 ) Summary: Amp gradient unscaling is a great use case for multi tensor apply (in fact it's the first case I wrote it for). This PR adds an MTA unscale+infcheck functor. Really excited to have it for `torch.cuda.amp`. izdeby your interface was clean and straightforward to use, great work! Labeled as bc-breaking because the native_functions.yaml exposure of unscale+infcheck changes from [`_amp_non_finite_check_and_unscale_` to `_amp_foreach_non_finite_check_and_unscale_`]( https://github.com/pytorch/pytorch/pull/44778/files#diff-f1e4b2c15de770d978d0eb77b53a4077L6289-L6293). The PR also modifies Unary/Binary/Pointwise Functors to - do ops' internal math in FP32 for FP16 or bfloat16 inputs, which improves precision ([and throughput, on some architectures!](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions)) and has no downside for the ops we care about. - accept an instantiated op functor rather than an op functor template (`template<class> class Op`). This allows calling code to pass lambdas. Open question: As written now, the PR has MTA Functors take care of pre- and post-casting FP16/bfloat16 inputs to FP32 before running the ops. However, alternatively, the pre- and post-math casting could be deferred/written into the ops themselves, which gives them a bit more control. I can easily rewrite it that way if you prefer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44778 Reviewed By: gchanan Differential Revision: D23944102 Pulled By: izdeby fbshipit-source-id: 22b25ccad5f69b413c77afe8733fa9cacc8e766d	2020-10-01 07:51:16 -07:00
Iurii Zdebskyi	d5748d9a1a	Enable binary ops with Scalar Lists with for foreach APIs (#45298 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45298 Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D23931986 Pulled By: izdeby fbshipit-source-id: 281267cd6f90d57a169af89f9f10b0f4fcab47e3	2020-09-25 12:58:34 -07:00
Xinyu Li	26001a2334	Revert D23753711: [pytorch][PR] Add foreach APIs for binary ops with ScalarList Test Plan: revert-hammer Differential Revision: D23753711 (`71d1b5b0e2`) Original commit changeset: bf3e8c54bc07 fbshipit-source-id: 192692e0d3fff4cade9983db0a1760fedfc9674c	2020-09-24 11:55:49 -07:00
iurii zdebskyi	71d1b5b0e2	Add foreach APIs for binary ops with ScalarList (#44743 ) Summary: In this PR: 1) Added binary operations with ScalarLists. 2) Fixed _foreach_div(...) bug in native_functions 3) Covered all possible cases with scalars and scalar lists in tests 4) [minor] fixed bug in native_functions by adding "use_c10_dispatcher: full" to all _foreach functions tested via unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/44743 Reviewed By: bwasti, malfet Differential Revision: D23753711 Pulled By: izdeby fbshipit-source-id: bf3e8c54bc07867e8f6e82b5d3d35ff8e99b5a0a	2020-09-24 08:30:42 -07:00
Iurii Zdebskyi	cce5982c4c	Add unary ops: exp and sqrt (#42537 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42537 [First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](https://github.com/pytorch/pytorch/pull/41554). Motivation [GitHub issue](https://github.com/pytorch/pytorch/issues/38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. Current API restrictions - List can't be empty (will fixed in upcoming PRs). - All tensors in the list must have the same dtype, device and size. Broadcasting At this point we don't support broadcasting. What is 'Fast' and 'Slow' route In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path. To go the fast route, - All tensors must have strided layout - All tensors must be dense and not have overlapping memory - The resulting tensor type must be the same. ---------------- In this PR Adding APIs: ``` torch._foreach_exp(TensorList tl1) torch._foreach_exp_(TensorList tl1) torch._foreach_sqrt(TensorList tl1) torch._foreach_sqrt_(TensorList tl1) ``` Tests Tested via unit tests TODO 1. Properly handle empty lists 2. Properly handle bool tensors Plan for the next PRs 1. APIs - Pointwise Ops 2. Complete tasks from TODO 3. Rewrite PyTorch optimizers to use for-each operators for performance gains. Test Plan: Imported from OSS Reviewed By: cpuhrsch Differential Revision: D23331889 Pulled By: izdeby fbshipit-source-id: 8b04673b8412957472ed56361954ca3884eb9376	2020-09-07 19:57:34 -07:00
Iurii Zdebskyi	10dd25dcd1	Add binary ops for _foreach APIs (#42536 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42536 [First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](https://github.com/pytorch/pytorch/pull/41554). Motivation [GitHub issue](https://github.com/pytorch/pytorch/issues/38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. Current API restrictions - List can't be empty (will fixed in upcoming PRs). - All tensors in the list must have the same dtype, device and size. Broadcasting At this point we don't support broadcasting. What is 'Fast' and 'Slow' route In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path. To go the fast route, - All tensors must have strided layout - All tensors must be dense and not have overlapping memory - The resulting tensor type must be the same. ---------------- In this PR Adding APIs: ``` torch._foreach_sub(TensorList tl1, TensorList tl2) torch._foreach_sub_(TensorList self, TensorList tl2) torch._foreach_mul(TensorList tl1, TensorList tl2) torch._foreach_mul_(TensorList self, TensorList tl2) torch._foreach_div(TensorList tl1, TensorList tl2) torch._foreach_div_(TensorList self, TensorList tl2) torch._foreach_sub(TensorList tl1, Scalar scalar) torch._foreach_sub_(TensorList self, Scalar scalar) torch._foreach_mul(TensorList tl1, Scalar scalar) torch._foreach_mul_(TensorList self, Scalar scalar) torch._foreach_div(TensorList tl1, Scalar scalar) torch._foreach_div(TensorList self, Scalar scalar) ``` Tests Tested via unit tests TODO 1. Properly handle empty lists 2. Properly handle bool tensors Plan for the next PRs 1. APIs - Unary Ops for list - Pointwise Ops 2. Complete tasks from TODO 3. Rewrite PyTorch optimizers to use for-each operators for performance gains. Test Plan: Imported from OSS Reviewed By: cpuhrsch Differential Revision: D23331891 Pulled By: izdeby fbshipit-source-id: 18c5937287e33e825b2e391e41864dd64e226f19	2020-09-07 10:29:32 -07:00
Iurii Zdebskyi	297c938729	Add _foreach_add(TensorList tl1, TensorList tl2) and _foreach_add_(TensorList tl1, TensorList tl2) APIs (#42533 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42533 [First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](https://github.com/pytorch/pytorch/pull/41554). Motivation [GitHub issue](https://github.com/pytorch/pytorch/issues/38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. Current API restrictions - List can't be empty (will fixed in upcoming PRs). - All tensors in the list must have the same dtype, device and size. Broadcasting At this point we don't support broadcasting. What is 'Fast' and 'Slow' route In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path. To go the fast route, - All tensors must have strided layout - All tensors must be dense and not have overlapping memory - The resulting tensor type must be the same. ---------------- In this PR - Adding a `_foreach_add(TensorList tl1, TensorList tl2)` API - Adding a `_foreach_add_(TensorList tl1, TensorList tl2)` API Tests Tested via unit tests TODO 1. Properly handle empty lists Plan for the next PRs 1. APIs - Binary Ops for list with Scalar - Binary Ops for list with list - Unary Ops for list - Pointwise Ops 2. Complete tasks from TODO 3. Rewrite PyTorch optimizers to use for-each operators for performance gains. Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D23331894 Pulled By: izdeby fbshipit-source-id: 876dd1bc82750f609b9e3ba23c8cad94d8d6041c	2020-09-02 12:18:28 -07:00
Edward Yang	6ea89166bd	Rewrite of ATen code generator (#42629 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42629 How to approach reviewing this diff: - The new codegen itself lives in `tools/codegen`. Start with `gen.py`, then read `model.py` and them the `api/` folder. The comments at the top of the files describe what is going on. The CLI interface of the new codegen is similar to the old one, but (1) it is no longer necessary to explicitly specify cwrap inputs (and now we will error if you do so) and (2) the default settings for source and install dir are much better; to the extent that if you run the codegen from the root source directory as just `python -m tools.codegen.gen`, something reasonable will happen. - The old codegen is (nearly) entirely deleted; every Python file in `aten/src/ATen` was deleted except for `common_with_cwrap.py`, which now permanently finds its home in `tools/shared/cwrap_common.py` (previously cmake copied the file there), and `code_template.py`, which now lives in `tools/codegen/code_template.py`. We remove the copying logic for `common_with_cwrap.py`. - All of the inputs to the old codegen are deleted. - Build rules now have to be adjusted to not refer to files that no longer exist, and to abide by the (slightly modified) CLI. - LegacyTHFunctions files have been generated and checked in. We expect these to be deleted as these final functions get ported to ATen. The deletion process is straightforward; just delete the functions of the ones you are porting. There are 39 more functions left to port. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: bhosmer Differential Revision: D23183978 Pulled By: ezyang fbshipit-source-id: 6073ba432ad182c7284a97147b05f0574a02f763	2020-08-31 09:00:22 -07:00

30 Commits