pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Heitor Schueroff	8fa328f88e	[doc] Deprecate torch.cholesky in favor of torch.linalg.cholesky (#51460 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51460 This PR is part of a larger effort to ensure torch.linalg documentation is consistent (see #50287). * #51459 [doc] Fix linalg.cholesky doc consistency issues Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D26176130 Pulled By: heitorschueroff fbshipit-source-id: cc89575db69cbfd5f87d970a2e71deb6522a35b1	2021-02-01 15:47:08 -08:00
Heitor Schueroff	8583f7cbe2	[doc] Fix linalg.cholesky doc consistency issues (#51459 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51459 This PR is part of a larger effort to ensure torch.linalg documentation is consistent (see #50287). Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D26176131 Pulled By: heitorschueroff fbshipit-source-id: 2ad88a339e6dff044965e8bf29dd8c852afecb34	2021-02-01 15:43:47 -08:00
Yi Wang	c08078031f	[Gradient Compression] Allow BatchedPowerSGD to run vanilla allreduce for the first K iterations (#51270 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51270 Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations. This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120725858 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl baseline: f248001754 batched PowerSGD: f246960752 The training time was reduced from 54m48s to 30m33s, and the accuracy is approximately the same: 44.21 vs 44.35 Reviewed By: rohan-varma Differential Revision: D26077709 fbshipit-source-id: 6afeefad7a3fbdd7da2cbffb56dfbad855a96cb5	2021-02-01 15:26:29 -08:00
Rong Rong (AI Infra)	718e4b110b	add git submodule troubleshoot to CONTRIBUTING.md (#51458 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/51355. Pull Request resolved: https://github.com/pytorch/pytorch/pull/51458 Reviewed By: janeyx99 Differential Revision: D26176233 Pulled By: walterddr fbshipit-source-id: 758e4203e11c81489234bbca812d1a3738504148	2021-02-01 14:30:00 -08:00
Cheng Chang	109bc1047e	[NNC] Generate C++ code for Allocate and Free (#51070 ) Summary: This is the initial skeleton for C++ codegen, it includes generations for Allocate and Free. Pull Request resolved: https://github.com/pytorch/pytorch/pull/51070 Test Plan: New unit tests are added to `test_cpp_codegen.cpp`. Reviewed By: ZolotukhinM Differential Revision: D26061818 Pulled By: cheng-chang fbshipit-source-id: b5256b2dcee6b2583ba73b6c9684994dbe7cdc1f	2021-02-01 13:06:51 -08:00
anjali411	642afcb168	Add sgn to torch.rst so that it appears in the built docs (#51479 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51479 Fixes https://github.com/pytorch/pytorch/issues/50146 Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D26179734 Pulled By: anjali411 fbshipit-source-id: 1cda9a3dc9ce600e585900eea70fbecac0635d5c	2021-02-01 12:43:06 -08:00
Scott Wolchok	d1ddc5d65d	[PyTorch] Outline OperatorEntry::assertSignatureIsCorrect fail path (#51269 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51269 This saves about 10% of the compile time of Functions.cpp. Found using clang-9's `-ftime-trace` feature + ClangBuildAnalyzer. Test Plan: Compared -ftime-trace + ClangBuildAnalyzer output. Before: P167884397 After: P167888502 Note that time spent generating assertSignatureIsCorrect is way down, though it's still kind of slow. Reviewed By: ezyang Differential Revision: D26121814 fbshipit-source-id: 949a85d8939c02e4fb5ac1adc35905ed34414724	2021-02-01 12:40:19 -08:00
Scott Wolchok	9877777fee	[PyTorch] check isValidUnboxed() in the dispatcher (#51247 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51247 See code comment for explanation. This measures neutral compared to the previous diff with `perf stat` when running on a benchmark that calls empty in a loop. I think that we should commit it anyway because: 1) I have previously seen it make a difference when applied earlier in the stack. 2) This makes sense both on principle and via inspecting output assembly: we avoid having to touch the boxed kernel at all (usually) and instead use the unboxed kernel for both the validity check in `OperatorEntry::lookup` and the actual `KernelFunction::call`. ghstack-source-id: 120697497 Test Plan: Aforementioned perf measurement Reviewed By: ezyang Differential Revision: D26113650 fbshipit-source-id: 8448c4ed764d477f63eb7c0f6dd87b1fc0228b73	2021-02-01 12:40:14 -08:00
Scott Wolchok	4495b49ffa	[PyTorch] Pass TensorOptions by value (#51165 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51165 `TensorOptions` does not have a non-trivial copy, move, or destroy operation and is small enough to fit in a register, so it seems like we should pass it by value. ghstack-source-id: 120697498 Test Plan: Measured timing for empty framework overhead benchmark before & after this change: Before: ``` I0126 16:02:50.662864 2137574 bench.cpp:139] Mean 0.268645 I0126 16:02:50.662891 2137574 bench.cpp:140] Median 0.267485 I0126 16:02:50.662896 2137574 bench.cpp:141] Min 0.266485 I0126 16:02:50.662901 2137574 bench.cpp:142] stddev 0.00219359 I0126 16:02:50.662915 2137574 bench.cpp:143] stddev / mean 0.00816537 2,968.37 msec task-clock # 0.997 CPUs utilized ( +- 0.03% ) 250 context-switches # 0.084 K/sec ( +- 2.21% ) 1 cpu-migrations # 0.000 K/sec 11,403 page-faults # 0.004 M/sec ( +- 0.28% ) 5,898,481,882 cycles # 1.987 GHz ( +- 0.03% ) (50.05%) 16,169,242,938 instructions # 2.74 insn per cycle ( +- 0.03% ) (50.06%) 3,076,546,626 branches # 1036.443 M/sec ( +- 0.05% ) (50.05%) 2,531,859 branch-misses # 0.08% of all branches ( +- 0.89% ) (50.03%) ``` After: ``` I0126 16:23:20.010062 2244624 bench.cpp:139] Mean 0.266814 I0126 16:23:20.010092 2244624 bench.cpp:140] Median 0.265759 I0126 16:23:20.010099 2244624 bench.cpp:141] Min 0.260291 I0126 16:23:20.010107 2244624 bench.cpp:142] stddev 0.00548279 I0126 16:23:20.010118 2244624 bench.cpp:143] stddev / mean 0.0205491 2,983.75 msec task-clock # 0.995 CPUs utilized ( +- 0.36% ) 243 context-switches # 0.082 K/sec ( +- 1.26% ) 1 cpu-migrations # 0.000 K/sec 11,422 page-faults # 0.004 M/sec ( +- 0.18% ) 5,928,639,486 cycles # 1.987 GHz ( +- 0.36% ) (50.02%) 16,105,928,210 instructions # 2.72 insn per cycle ( +- 0.05% ) (50.02%) 3,150,273,453 branches # 1055.809 M/sec ( +- 0.03% ) (50.05%) 3,713,617 branch-misses # 0.12% of all branches ( +- 0.83% ) (50.07%) ``` It looked close to neutral, so I used `perf stat` to confirm it's about a 1% instruction count win. For deciding whether this stack is worth it, I went back and ran `perf stat` on the baseline diff before I started touching the dispatcher: ``` 2,968.37 msec task-clock # 0.997 CPUs utilized ( +- 0.03% ) 250 context-switches # 0.084 K/sec ( +- 2.21% ) 1 cpu-migrations # 0.000 K/sec 11,403 page-faults # 0.004 M/sec ( +- 0.28% ) 5,898,481,882 cycles # 1.987 GHz ( +- 0.03% ) (50.05%) 16,169,242,938 instructions # 2.74 insn per cycle ( +- 0.03% ) (50.06%) 3,076,546,626 branches # 1036.443 M/sec ( +- 0.05% ) (50.05%) 2,531,859 branch-misses # 0.08% of all branches ( +- 0.89% ) (50.03%) ``` If I've done the arithmetic correctly, we have an 0.39% instruction count win. Reviewed By: ezyang Differential Revision: D25983863 fbshipit-source-id: 87d1451a01ead25738ea6b80db270d344bc583b2	2021-02-01 12:40:08 -08:00
Scott Wolchok	341c76dcc1	[PyTorch] Add C10_ALWAYS_INLINE to critical dispatcher paths (#51245 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51245 Splitting this out from #51164 (D26069629) to allow it to land separately; I'm sure this is a good idea but I'm less sure about #51164. ghstack-source-id: 120697499 Test Plan: double-check effect on empty benchmark with perf stat; didn't move Reviweers: ezyang, messmer Reviewed By: ezyang Differential Revision: D26112627 fbshipit-source-id: 50d4418d351527bcedd5ccdc49106bc642699870	2021-02-01 12:39:58 -08:00
Scott Wolchok	673687e764	[PyTorch] Refactor Dispatcher to inline less code in fast path (#51163 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51163 The Dispatcher seems to have been in a precarious local maximum: I tried to make several different changes to parameter passing and ended up with regressions due to reduced inlining that swamped any gains I might have gotten from the parameter passing changes. This diff reduces the amount of inline code on the fast path. It should both reduce code size and provide a platform for making further improvements to the dispatcher code. It is a slight performance regression, but it unblocked the following two diffs (which seem to get us back where we were) from landing. ghstack-source-id: 120693163 Test Plan: CI, framework overhead benchmarks to check the size of the regression Compared timing for empty framework overhead benchmark before/after. Build command: `buck build mode/no-gpu //caffe2/caffe2/fb/high_perf_models/pytorch/benchmark_framework_overheads:cpp_benchmark mode/opt-clang --show-output` Run with `numactl -m 0 -C 3 path/to/cpp_benchmark -op empty -niter 100` Before: ``` I0126 16:02:04.373075 2135872 bench.cpp:139] Mean 0.266272 I0126 16:02:04.373106 2135872 bench.cpp:140] Median 0.266347 I0126 16:02:04.373111 2135872 bench.cpp:141] Min 0.263585 I0126 16:02:04.373117 2135872 bench.cpp:142] stddev 0.0021264 I0126 16:02:04.373131 2135872 bench.cpp:143] stddev / mean 0.00798581 ``` After: ``` I0126 16:02:30.377992 2137048 bench.cpp:139] Mean 0.27579 I0126 16:02:30.378023 2137048 bench.cpp:140] Median 0.275281 I0126 16:02:30.378029 2137048 bench.cpp:141] Min 0.270617 I0126 16:02:30.378034 2137048 bench.cpp:142] stddev 0.00308287 I0126 16:02:30.378044 2137048 bench.cpp:143] stddev / mean 0.0111783 ``` Yes, it's a regression, but I compared D26069629 stacked on this diff vs not: With this diff: ``` I0126 16:02:50.662864 2137574 bench.cpp:139] Mean 0.268645 I0126 16:02:50.662891 2137574 bench.cpp:140] Median 0.267485 I0126 16:02:50.662896 2137574 bench.cpp:141] Min 0.266485 I0126 16:02:50.662901 2137574 bench.cpp:142] stddev 0.00219359 I0126 16:02:50.662915 2137574 bench.cpp:143] stddev / mean 0.00816537 ``` Without: ``` I0126 20:40:27.815824 3240699 bench.cpp:139] Mean 0.270755 I0126 20:40:27.815860 3240699 bench.cpp:140] Median 0.268998 I0126 20:40:27.815866 3240699 bench.cpp:141] Min 0.268306 I0126 20:40:27.815873 3240699 bench.cpp:142] stddev 0.00260365 I0126 20:40:27.815886 3240699 bench.cpp:143] stddev / mean 0.00961624 ``` So we do seem to have accomplished something w.r.t. not overwhelming the inliner. Reviewed By: ezyang Differential Revision: D26091377 fbshipit-source-id: c9b7f4e187059fa15452b7c75fc29816022b92b1	2021-02-01 12:36:48 -08:00
Jacob Szwejbka	ec611aca88	[Pytorch Mobile] Expose _export_operator_list to python (#51312 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51312 Follow up to D24690094 (`4a870f6518`) exposing the api in python. Created matching unit test. ghstack-source-id: 120611452 Test Plan: Ran unit test Reviewed By: dhruvbird Differential Revision: D26112765 fbshipit-source-id: ffe3bb97de0a4f08b31719b4b47dcebd7d2fd42a	2021-02-01 12:09:02 -08:00
James Reed	609f76f27a	[WIP][FX] Add Interpreter and Transformer (#50420 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50420 Test Plan: Imported from OSS Reviewed By: zdevito Differential Revision: D25880330 Pulled By: jamesr66a fbshipit-source-id: 27d34888e36e39924821fed891d79f969237a104	2021-02-01 11:40:12 -08:00
Yi Wang	0831984ed5	[Resubmission][Gradient Compression] Refactor default_hooks.py and powerSGD_hook.py by creating a util function that make a vanilla allreduce future (#51400 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51400 Resubmission of #51094 Address https://github.com/pytorch/pytorch/pull/50973#discussion_r564229818 Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120725690 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl Reviewed By: rohan-varma Differential Revision: D26162333 fbshipit-source-id: ccc2eae5383a23673e00d61cb5570fb8bf749cd0	2021-02-01 11:34:41 -08:00
Scott Wolchok	6c24296795	[PyTorch] Devirtualize TensorImpl::has_storage (#51049 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51049 This diff makes it OK to query has_storage() on all TensorImpls. I added debug assertions that storage_ is indeed never set on them, which is required for this to be correct. ghstack-source-id: 120714380 Test Plan: CI Reviewed By: ezyang Differential Revision: D26008498 fbshipit-source-id: b3f55f0b57b04636d13b09aa55bb720c6529542c	2021-02-01 11:30:23 -08:00
Scott Wolchok	765062c085	[PyTorch] Devirtualize TensorImpl::storage_offset (#51048 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51048 There doesn't seem to be any reason to prohibit accessing the always-zero storage_offset of those TensorImpls that prohibit set_storage_offset. ghstack-source-id: 120714379 Test Plan: CI Reviewed By: ezyang Differential Revision: D26008499 fbshipit-source-id: cd92ac0afdebbd5cf8f04df141843635113b6444	2021-02-01 11:27:13 -08:00
kshitij12345	50fa415a4d	[testing] Add OpInfo for ceil and floor (#51198 ) Summary: Reference: https://github.com/pytorch/pytorch/issues/50006 Pull Request resolved: https://github.com/pytorch/pytorch/pull/51198 Reviewed By: malfet Differential Revision: D26105099 Pulled By: mruberry fbshipit-source-id: 6cfa89f42b87cca66dbc5bf474d17a6cad7eb45a	2021-02-01 10:10:36 -08:00
Max Balandat	449098c2d2	[SobolEngine] Update direction numbers to 21201 dims (#49710 ) Summary: Performs the update that was suggested in https://github.com/pytorch/pytorch/issues/41489 Adjust the functionality to largely match that pf the scipy companion PR https://github.com/scipy/scipy/pull/10844/, including - a new `draw_base2` method - include zero as the first point in the (unscrambled) Sobol sequence The scipy PR is also quite opinionated if the `draw` method doesn't get called with a base 2 number (for which the resulting sequence has nice properties, see the scipy PR for a comprehensive discussion of this). Note that this update is a breaking change in the sense that sequences generated with the same parameters after as before will not be identical! They will have the same (better, arguably) distributional properties, but calling the engine with the same seed will result in different numbers in the sequence. Pull Request resolved: https://github.com/pytorch/pytorch/pull/49710 Test Plan: ``` from torch.quasirandom import SobolEngine sobol = SobolEngine(3) sobol.draw(4) sobol = SobolEngine(4, scramble=True) sobol.draw(5) sobol = SobolEngine(4, scramble=True) sobol.draw_base2(2) ``` Reviewed By: malfet Differential Revision: D25657233 Pulled By: Balandat fbshipit-source-id: 9df50a14631092b176cc692b6024aa62a639ef61	2021-02-01 08:44:31 -08:00
Hameer Abbasi	b1907f5ebc	Fix pickling for Tensor subclasses (redo) (#47732 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/47051 Redo of https://github.com/pytorch/pytorch/issues/47115 Pull Request resolved: https://github.com/pytorch/pytorch/pull/47732 Reviewed By: izdeby Differential Revision: D25465382 Pulled By: ezyang fbshipit-source-id: 3a8d57281a2d6f57415d5735d34ad307f3526638	2021-02-01 07:32:52 -08:00
anjali411	508bab43e7	Support complex number list in JIT (#51145 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51145 Test Plan: Imported from OSS Reviewed By: SplitInfinity Differential Revision: D26154025 Pulled By: anjali411 fbshipit-source-id: 74645f9b6467757ddb9d75846e778222109848f0	2021-01-31 23:54:14 -08:00
Mike Ruberry	40c0fffb4b	Fixes docs (#51439 ) Summary: pytorch_python_doc_build is failing with: ``` Jan 31 04:30:45 /var/lib/jenkins/workspace/docs/source/notes/broadcasting.rst:6: WARNING: 'any' reference target not found: numpy.doc.broadcasting ``` this removes the incorrect reference and adds an updated link. Pull Request resolved: https://github.com/pytorch/pytorch/pull/51439 Reviewed By: ngimel Differential Revision: D26170232 Pulled By: mruberry fbshipit-source-id: 829999db52e1e860d36d626d0d9f26e31283d14b	2021-01-31 22:00:26 -08:00
Jianyu Huang	d1dcd5f287	[fbgemm_gpu] Use the latest philox_cuda_state API for stochastic rounding (#51004 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51004 Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/493 Follow up on the failure case on FP16 stochastic rounding: - https://github.com/pytorch/pytorch/pull/50148 - D26006041 From Natalia: - https://github.com/pytorch/pytorch/pull/50916 is the fix, philox_engine_inputs is deprecated btw so if you could refactor it to use philox_cuda_state that would be great. - instructions to change the call https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/CUDAGeneratorImpl.h#L48-L83, it will be important to use philox_cuda_state with graph capture. Benchmark: - Before this Diff: ``` (base) [jianyuhuang@devgpu017.atn5.facebook.com: ~/fbsource/fbcode/hpc/ops/benchmarks] $ buck run mode/opt //hpc/ops/benchmarks:split_table_batched_embeddings_benchmark device -- --fp16 --stoc 2>&1 \| tee before_diff.log PARSING BUCK FILES: FINISHED IN 0.4s CREATING ACTION GRAPH: FINISHED IN 0.0s DOWNLOADED 0 ARTIFACTS, 0.00 BYTES, 0.0% CACHE MISS BUILDING: FINISHED IN 5.3s (100%) 6474/6474 JOBS, 0 UPDATED BUILD SUCCEEDED DEBUG:root:Using fused exact_row_wise_adagrad with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=False, max_gradient=1.0, learning_rate=0.1, eps=0.1, beta1=0.9, beta2=0.999, weight_decay=0.0, eta=0.001, momentum=0.9) INFO:root:Embedding parameters: 0.41 GParam, 0.82GB INFO:root:Accessed weights per batch: 83.89MB INFO:root:Forward, B: 512, E: 100000, T: 32, D: 128, L: 20, W: False, BW: 607.48GB/s, T: 138us INFO:root:ForwardBackward, B: 512, E: 100000, T: 32, D: 128, L: 20, BW: 220.85GB/s, T: 1139us ``` - After this Diff: ``` (base) [jianyuhuang@devgpu017.atn5.facebook.com: ~/fbsource/fbcode/hpc/ops/benchmarks] $ buck run mode/opt //hpc/ops/[5/1935] ks:split_table_batched_embeddings_benchmark device -- --fp16 --stoc 2>&1 \| tee after_diff.log PARSING BUCK FILES: FINISHED IN 1.1s CREATING ACTION GRAPH: FINISHED IN 0.0s DEBUG:root:Using fused exact_row_wise_adagrad with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=Fal se, max_gradient=1.0, learning_rate=0.1, eps=0.1, beta1=0.9, beta2=0.999, weight_decay=0.0, eta=0.001, momentum=0.9) INFO:root:Embedding parameters: 0.41 GParam, 0.82GB INFO:root:Accessed weights per batch: 83.89MB INFO:root:Forward, B: 512, E: 100000, T: 32, D: 128, L: 20, W: False, BW: 608.80GB/s, T: 138us INFO:root:ForwardBackward, B: 512, E: 100000, T: 32, D: 128, L: 20, BW: 229.17GB/s, T: 1098us ``` Test Plan: CI Reviewed By: ngimel Differential Revision: D26038596 fbshipit-source-id: 5360395c1c3b1a062b38e5695239258e892c63c4	2021-01-31 20:42:43 -08:00
jiej	0e1c5cb354	fixing index clamping for upsample nearest kernel backward (#51240 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/51036 Pull Request resolved: https://github.com/pytorch/pytorch/pull/51240 Reviewed By: ailzhang Differential Revision: D26139221 Pulled By: ngimel fbshipit-source-id: 0591ac6d1f988b54c1b1ee50d34fb7c2a3f97c4e	2021-01-31 15:22:58 -08:00
Rohan Varma	9cf62a4b5d	[1.8] Add additional tests for object-based APIs (#51341 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51341 Adds tests for objects that contain CPU/GPU tensors to ensure that they can also be serialized/deserialized appropriately. ghstack-source-id: 120718120 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D26144100 fbshipit-source-id: f1a8ccb9741bb5372cb7809cb43cbe43bf47d517	2021-01-30 19:50:08 -08:00
Rohan Varma	c255628134	[Collective APIs] Make python object collective API args consistent (#50625 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50625 Make API signatures consistent and provide default argument similar to the tensor collectives. ghstack-source-id: 120718121 Test Plan: CI Reviewed By: wanchaol Differential Revision: D25932012 fbshipit-source-id: d16267e236a65ac9d55e19e2178f9d9267b08a20	2021-01-30 19:47:16 -08:00
Marat Subkhankulov	721ba97eb6	Create op benchmark for stack (#51263 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51263 - Add benchmark for stack op Test Plan: ``` buck build mode/opt //caffe2/benchmarks/operator_benchmark/pt:stack_test --show-output MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 buck-out/gen/caffe2/benchmarks/operator_benchmark/pt/stack_test.par --tag_filter=static_runtime \| grep Execution Forward Execution Time (us) : 6.380 Forward Execution Time (us) : 6.553 Forward Execution Time (us) : 14.904 Forward Execution Time (us) : 5.657 Forward Execution Time (us) : 5.612 Forward Execution Time (us) : 6.051 Forward Execution Time (us) : 4.225 Forward Execution Time (us) : 4.240 Forward Execution Time (us) : 6.280 Forward Execution Time (us) : 6.267 Forward Execution Time (us) : 418.932 Forward Execution Time (us) : 417.694 Forward Execution Time (us) : 1592.455 Forward Execution Time (us) : 2919.261 Forward Execution Time (us) : 211.458 Forward Execution Time (us) : 211.518 Forward Execution Time (us) : 783.953 Forward Execution Time (us) : 1457.823 Forward Execution Time (us) : 2032.816 Forward Execution Time (us) : 2090.662 Forward Execution Time (us) : 6487.098 Forward Execution Time (us) : 11874.702 Forward Execution Time (us) : 2123.830 Forward Execution Time (us) : 2195.453 Forward Execution Time (us) : 6435.978 Forward Execution Time (us) : 11852.205 Forward Execution Time (us) : 2036.526 Forward Execution Time (us) : 2055.618 Forward Execution Time (us) : 6417.192 Forward Execution Time (us) : 12468.744 Forward Execution Time (us) : 4959.704 Forward Execution Time (us) : 5121.823 Forward Execution Time (us) : 5082.105 Forward Execution Time (us) : 5395.936 Forward Execution Time (us) : 5162.756 Forward Execution Time (us) : 23798.080 Forward Execution Time (us) : 4957.921 Forward Execution Time (us) : 4971.234 Forward Execution Time (us) : 5005.909 Forward Execution Time (us) : 5159.614 Forward Execution Time (us) : 5013.221 Forward Execution Time (us) : 20238.741 Forward Execution Time (us) : 7632.439 Forward Execution Time (us) : 7589.376 Forward Execution Time (us) : 7859.937 Forward Execution Time (us) : 8214.213 Forward Execution Time (us) : 11606.562 Forward Execution Time (us) : 34612.919 ``` Reviewed By: hlu1 Differential Revision: D25859143 fbshipit-source-id: a1b735ce87f57b5eb67e223e549248a2cd7663c1	2021-01-30 10:32:14 -08:00
Natalia Gimelshein	e26fccc22b	update profiler doc strings (#51395 ) Summary: Fixes formatting for autograd.profiler doc string (was broken), slightly expands profiler.profile documentation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/51395 Reviewed By: ilia-cher Differential Revision: D26162349 Pulled By: ngimel fbshipit-source-id: ac7af8e0f3dbae2aa899ad815d2311c2758ee57c	2021-01-29 23:37:06 -08:00
Ilia Cherniavskii	17b5683156	Multi-GPU Kineto profiler test (#51391 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51391 Adding a test to check the kineto profiler on multiple gpus Test Plan: python test/test_profiler.py Reviewed By: ngimel Differential Revision: D26160788 Pulled By: ilia-cher fbshipit-source-id: f3554f52176cc26e7f331d205f1a514eb03aa758	2021-01-29 23:26:12 -08:00
Hao Lu	11cda929fb	[StaticRuntime] Fix bug in MemoryPlanner (#51342 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51342 There is a subtle bug with the MemoryPlanner with regard to view ops with out variant. ``` def forward(self, a: Tensor, shape: List[int]): b = a.reshape(shape) return b + b ``` In this case, if we replace reshape with the out variant, b would be managed by the MemoryPlanner and the storage of its output would have been set to nullptr right after inference by the MemoryPlanner if opts.cleanup_activations is true. Because b is a view of a, the storage of a is also set to nullptr, and this violates the API which promises that a is const. To fix this bug, I changed the MemoryPlanner so that it puts b in the unmanaged part. Test Plan: Add unit test to enforce the constness of inputs ``` buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest ``` Reviewed By: ajyu Differential Revision: D26144203 fbshipit-source-id: 2dbacccf7685d0fe0f0b1195166e0510b2069fe3	2021-01-29 21:16:02 -08:00
Ansley Ussery	09e48dbd33	Handle error during dict expansion (#51374 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51374 Test Plan: Imported from OSS Reviewed By: gmagogsfm Differential Revision: D26155995 Pulled By: ansley fbshipit-source-id: 04e924cb641565341c570c6cf5e5eec42e4f9c8b	2021-01-29 18:46:10 -08:00
Natalia Gimelshein	7ab89f58be	expose memory_fraction and gpu_process docs (#51372 ) Summary: Per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/51372 Reviewed By: mruberry Differential Revision: D26157787 Pulled By: ngimel fbshipit-source-id: 97eac5f12881a2bf62c251f6f7eaf65fdbe34056	2021-01-29 18:22:34 -08:00
Natalia Gimelshein	7d30f67659	remove LegacyDefinitions as it is empty now (#51251 ) Summary: Per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/51251 Reviewed By: mruberry Differential Revision: D26120574 Pulled By: ngimel fbshipit-source-id: 223b4f358932f47e0af7413752c7db7c35402260	2021-01-29 18:15:11 -08:00
Yanli Zhao	d5541c50a3	add a c++ interface in processGroup to get its backend name (#51066 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51066 backend name of a processgroup created using distributed_c10d python API is tracked, but there is no good way to track name of a processgroup created using processGroup c++ API. In some cases, knowing backend name of a processGroup is useful, e,g., log the backend name, or write some codes that have dependency on the known backend. ghstack-source-id: 120628432 Test Plan: unit tests Reviewed By: pritamdamania87 Differential Revision: D26059769 fbshipit-source-id: 6584c6695c5c3570137dc98c16e06cbe4b7f5503	2021-01-29 17:28:42 -08:00
Wanchao Liang	662b6d2115	[dist_optim] update the doc of DistributedOptimizer (#51314 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51314 updating the doc of DistributedOptimizer to include TorchScript enablement information Test Plan: Imported from OSS Reviewed By: pbelevich Differential Revision: D26156032 Pulled By: wanchaol fbshipit-source-id: 1f3841f55918a5c2ed531cf6aeeb3f6e3a09a6a8	2021-01-29 17:12:52 -08:00
kshitij12345	a88e1d3ddf	[complex] Complex support for masked_scatter and autograd support for masked_scatter and masked_select (#51281 ) Summary: Reference: https://github.com/pytorch/pytorch/issues/33152 Changes * Enable complex support for masked_scatter * Enable half support for masked_scatter CPU * Enable complex autograd support for masked_scatter CPU and masked_select (both CPU and CUDA). Note: Complex Support for masked_scatter CUDA is disabled as it depends on `masked_fill` which is yet to be ported to ATen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/51281 Reviewed By: ailzhang Differential Revision: D26127561 Pulled By: anjali411 fbshipit-source-id: 6284926b934942213c5dfc24b5bcc8538d0231af	2021-01-29 13:49:31 -08:00
Brian Skinn	fe645fdfc7	Update _torch_docs.py (#51212 ) Summary: Fix `torch.linalg.qr` reference where it's desired to render fully-qualified name into docs. Suggested fix for https://github.com/pytorch/pytorch/pull/47764/files#r565368195 Pull Request resolved: https://github.com/pytorch/pytorch/pull/51212 Reviewed By: ezyang Differential Revision: D26142496 Pulled By: ailzhang fbshipit-source-id: 052b2085099baa372e3b515b403f25d23cf50785	2021-01-29 13:03:09 -08:00
Arindam Roy	da920fa141	Enable rocm tests in common nn (#51227 ) Summary: Fixes #{issue number} Resubmitting a new PR as the older one got reverted due to problems in test_optim.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/51227 Reviewed By: ezyang Differential Revision: D26142505 Pulled By: ailzhang fbshipit-source-id: a2ab5d85630aac2d2ce17652ba19c11ea668a6a9	2021-01-29 12:54:04 -08:00
Eli Uriegas	52609c8c65	.github: Up frequency of stale checks (#51365 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51365 We have a pretty big backlog of PRs when it comes to checking for stale and the action only supports processing 30 PRs at a given time. Signed-off-by: Eli Uriegas <eliuriegas@fb.com> Test Plan: Imported from OSS Reviewed By: samestep Differential Revision: D26153785 Pulled By: seemethere fbshipit-source-id: 585b36068683e04cf4e2cc59013482f143ec30a3	2021-01-29 12:50:40 -08:00
Ivan Kobzarev	dbfaf966b0	[android] turn on USE_VULKAN for android builds by default (#51291 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51291 Turning on USE_VULKAN for android builds Remove standalone android vulkan build Testing all ci jobs (for master): https://github.com/pytorch/pytorch/pull/51292 Test Plan: Imported from OSS Reviewed By: AshkanAliabadi Differential Revision: D26141891 Pulled By: IvanKobzarev fbshipit-source-id: e8e1a4ab612c0786ce09217ab9370fd75a71eb00	2021-01-29 11:58:21 -08:00
Hong Xu	ebd2a82559	Replace all AT_ASSERTM in RNN_miopen.cpp (#51072 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51072 AT_ASSERTM is deprecated and should be replaced by either TORCH_CHECK or TORCH_INTERNAL_ASSERT, depending on the situation. Test Plan: Imported from OSS Reviewed By: ailzhang Differential Revision: D26074364 Pulled By: ezyang fbshipit-source-id: 742e28afe49e0a546c252a0fad487f93410d0cb5	2021-01-29 11:40:38 -08:00
Hong Xu	dfca1e48d3	Replace all AT_ASSERTM under c10/ (except Exception.h) (#50843 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50843 AT_ASSERTM is deprecated and should be replaced by either TORCH_CHECK or TORCH_INTERNAL_ASSERT, depending on the situation. Test Plan: Imported from OSS Reviewed By: ailzhang Differential Revision: D26074365 Pulled By: ezyang fbshipit-source-id: 46e13588fad4e24828f3cc99635e9cb2223a6c2c	2021-01-29 11:37:07 -08:00
Shoichiro Kawauchi	c41ca4ae5b	[doc]Fix autograd.detect_anomaly docs incorrectly formatted (#51335 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/51141 Two bullet points don't render as bullet points. Before <img width="657" alt="screenshot before" src="https://user-images.githubusercontent.com/19372617/106240701-125a3080-6248-11eb-9572-f915aa9b72e1.png"> After <img width="888" alt="screenshot after" src="https://user-images.githubusercontent.com/19372617/106240714-17b77b00-6248-11eb-8e54-51be103639e9.png"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/51335 Reviewed By: izdeby Differential Revision: D26148582 Pulled By: ezyang fbshipit-source-id: 5aff6f9bd7affdf13bec965e9bf1a417e5caa88d	2021-01-29 11:18:51 -08:00
Rohan Varma	5021582fe6	Fix benchmarks/distributed/ddp/benchmark.py (#51095 ) Summary: Fixes the issue reported in https://github.com/pytorch/pytorch/issues/50679 by using built-in object-based collectives. User has verified this patch works Test with: RANK=0 python3 pytorch-dist-benchmark.py --world-size 2 --master-addr 127.0.0.1 --master-port 23456 RANK=1 python3 pytorch-dist-benchmark.py --world-size 2 --master-addr 127.0.0.1 --master-port 23456 Pull Request resolved: https://github.com/pytorch/pytorch/pull/51095 Reviewed By: SciPioneer Differential Revision: D26070275 Pulled By: rohan-varma fbshipit-source-id: 59abcaac9e395bcdd8a018bf6ba07521d94b2fdf	2021-01-29 11:10:13 -08:00
Richard Barnes	1b089c1257	Modernize for-loops (#50899 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50899 Test Plan: Sandcastle tests + OSS CI Reviewed By: ezyang Differential Revision: D26001931 fbshipit-source-id: d829d520f647aacd178e1c7a9faa6196cc5af54e	2021-01-29 10:52:31 -08:00
Yi Zhang	edaa23c8ab	extend init_group_test timeout to 5s (#51330 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/50662 ![image](https://user-images.githubusercontent.com/16190118/106225549-58030300-6220-11eb-948d-1998bdafc245.png) From: https://circleci.com/api/v1.1/project/github/pytorch/pytorch/10203733/output/105/0?file=true&allocation-id=60022ee190b8596d279f4531-0-build%2F195A7D58 (`e86f941395`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/51330 Reviewed By: izdeby Differential Revision: D26148618 Pulled By: ezyang fbshipit-source-id: 708d7522843da2f5c919cf41919e6819f89903e2	2021-01-29 10:44:28 -08:00
Ivan Yashchuk	30675d0921	Added OpInfo-based testing of triangular_solve (#50948 ) Summary: Added OpInfo-based testing of `torch.triangular_solve`. These tests helped to discover that CPU `triangular_solve` wasn't working for empty matrices and for CUDA inputs a warning was printed to the terminal. It is fixed now. CUDA gradgrad checks are skipped. ``` 11.44s call test/test_ops.py::TestGradientsCUDA::test_fn_gradgrad_triangular_solve_cuda_complex128 2.97s call test/test_ops.py::TestGradientsCUDA::test_fn_gradgrad_triangular_solve_cuda_float64 1.60s call test/test_ops.py::TestGradientsCPU::test_fn_gradgrad_triangular_solve_cpu_complex128 1.36s call test/test_ops.py::TestOpInfoCUDA::test_supported_dtypes_triangular_solve_cuda_complex128 1.20s call test/test_ops.py::TestGradientsCUDA::test_fn_grad_triangular_solve_cuda_complex128 0.86s call test/test_ops.py::TestCommonCUDA::test_variant_consistency_jit_triangular_solve_cuda_complex64 0.85s call test/test_ops.py::TestCommonCUDA::test_variant_consistency_jit_triangular_solve_cuda_complex128 0.81s call test/test_ops.py::TestCommonCUDA::test_variant_consistency_jit_triangular_solve_cuda_float64 0.77s call test/test_ops.py::TestCommonCUDA::test_variant_consistency_jit_triangular_solve_cuda_float32 0.46s call test/test_ops.py::TestCommonCPU::test_variant_consistency_jit_triangular_solve_cpu_complex128 0.44s call test/test_ops.py::TestCommonCPU::test_variant_consistency_jit_triangular_solve_cpu_complex64 0.44s call test/test_ops.py::TestGradientsCUDA::test_fn_grad_triangular_solve_cuda_float64 0.42s call test/test_ops.py::TestGradientsCPU::test_fn_gradgrad_triangular_solve_cpu_float64 0.40s call test/test_ops.py::TestCommonCPU::test_variant_consistency_jit_triangular_solve_cpu_float32 0.40s call test/test_ops.py::TestCommonCPU::test_variant_consistency_jit_triangular_solve_cpu_float64 0.17s call test/test_ops.py::TestGradientsCPU::test_fn_grad_triangular_solve_cpu_complex128 ``` Ref. https://github.com/pytorch/pytorch/issues/50006 Pull Request resolved: https://github.com/pytorch/pytorch/pull/50948 Reviewed By: ailzhang Differential Revision: D26123998 Pulled By: mruberry fbshipit-source-id: 54136e8fc8a71f107dddb692c5be298c6d5ed168	2021-01-29 10:31:07 -08:00
Ansley Ussery	1b479416b7	Clarify logic in `ir_emitter` (#51299 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51299 Test Plan: Imported from OSS Reviewed By: gmagogsfm Differential Revision: D26131245 Pulled By: ansley fbshipit-source-id: ecd69275214775804f5aa92f9b4c0b19be19b596	2021-01-29 10:05:01 -08:00
Jeffrey Wan	c0966914bc	Internal gradcheck wrapper in testing._internal that sets certain flags to True (#51133 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/49409 There are many call sites where, gradcheck/gradgradcheck is now being implicitly invoked with `check_batched_grad` as True, but they were previously False. Cases fall into two basic categories: 1) the call site was previously using `torch.autograd.gradcheck` but is now changed to use the globally imported function instead 3) the call site was already using globally imported function, but does not explicitly pass `check_batched_grad` flag Only in the _assertGradAndGradgradChecks cases, which are infrequent, I assumed that the the author is aware that omitting the flag means not applying check_batched_grad=True. (but maybe that is not the case?) Overall this PR in its current state assumes that unless the author explicitly specified `check_batched_grad=False`, they were just probably not aware of this flag and did not mean to have this flag as False. So far exceptions to the above (as discovered by CI) include: - Mkldnn (opaque tensors do not have strides) https://app.circleci.com/pipelines/github/pytorch/pytorch/264416/workflows/e4d87886-6247-4305-8526-2696130aa9a4/jobs/10401882/tests - all cases in test_sparse (https://app.circleci.com/pipelines/github/pytorch/pytorch/264553/workflows/3c1cbe30-830d-4acd-b240-38d833dccd9b/jobs/10407103) - all cases in test_overrides (https://app.circleci.com/pipelines/github/pytorch/pytorch/264553/workflows/3c1cbe30-830d-4acd-b240-38d833dccd9b/jobs/10407236) - test_autograd (test_LSTM_grad_and_gradgrad) - (https://app.circleci.com/pipelines/github/pytorch/pytorch/264553/workflows/3c1cbe30-830d-4acd-b240-38d833dccd9b/jobs/10407235) - test_data_parallel (test_data_parallel_buffers_requiring_grad) - SIGSEGV (https://app.circleci.com/pipelines/github/pytorch/pytorch/264820/workflows/14d89503-040d-4e3d-9f7b-0bc04833589b/jobs/10422697) - test_nn (https://app.circleci.com/pipelines/github/pytorch/pytorch/264919/workflows/df79e3ed-8a31-4a8e-b584-858ee99686ff/jobs/10427315) Possible TODO is to prevent new tests from invoking external gradcheck. Pull Request resolved: https://github.com/pytorch/pytorch/pull/51133 Reviewed By: ezyang Differential Revision: D26147919 Pulled By: soulitzer fbshipit-source-id: dff883b50f337510a89f391ea2fd87de2d531432	2021-01-29 09:13:37 -08:00
Iurii Zdebskyi	5a406c023e	Revert D26070147: [Gradient Compression] Refactor default_hooks.py and powerSGD_hook.py by creating a util function that make a vanilla allreduce future Test Plan: revert-hammer Differential Revision: D26070147 (`e7b3496232`) Original commit changeset: 8c9339f1511e fbshipit-source-id: fa1e9582baec9759a73b3004be9bb19bdeb6cd34	2021-01-29 09:06:24 -08:00
Yan Li	270111b7b6	split quantization jit op (#51329 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51329 Currently the test_qbatch_norm_relu is containing too many examples and causing timeout. Splitting them for now to fix the timeout issue Test Plan: buck test caffe2/test:quantization Reviewed By: supriyar Differential Revision: D26141037 fbshipit-source-id: da877efa78924a252a35c2b83407869ebb8c48b7	2021-01-29 07:49:53 -08:00

... 3 4 5 6 7 ...

33381 Commits