Commit Graph

33381 Commits

Author SHA1 Message Date
Heitor Schueroff
8fa328f88e [doc] Deprecate torch.cholesky in favor of torch.linalg.cholesky (#51460)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51460

This PR is part of a larger effort to ensure torch.linalg documentation is consistent (see #50287).

* #51459 [doc] Fix linalg.cholesky doc consistency issues

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D26176130

Pulled By: heitorschueroff

fbshipit-source-id: cc89575db69cbfd5f87d970a2e71deb6522a35b1
2021-02-01 15:47:08 -08:00
Heitor Schueroff
8583f7cbe2 [doc] Fix linalg.cholesky doc consistency issues (#51459)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51459

This PR is part of a larger effort to ensure torch.linalg documentation is consistent (see #50287).

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D26176131

Pulled By: heitorschueroff

fbshipit-source-id: 2ad88a339e6dff044965e8bf29dd8c852afecb34
2021-02-01 15:43:47 -08:00
Yi Wang
c08078031f [Gradient Compression] Allow BatchedPowerSGD to run vanilla allreduce for the first K iterations (#51270)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51270

Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations.

This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120725858

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

baseline: f248001754
batched PowerSGD: f246960752

The training time was reduced from 54m48s to 30m33s, and the accuracy is approximately the same: 44.21 vs 44.35

Reviewed By: rohan-varma

Differential Revision: D26077709

fbshipit-source-id: 6afeefad7a3fbdd7da2cbffb56dfbad855a96cb5
2021-02-01 15:26:29 -08:00
Rong Rong (AI Infra)
718e4b110b add git submodule troubleshoot to CONTRIBUTING.md (#51458)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/51355.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51458

Reviewed By: janeyx99

Differential Revision: D26176233

Pulled By: walterddr

fbshipit-source-id: 758e4203e11c81489234bbca812d1a3738504148
2021-02-01 14:30:00 -08:00
Cheng Chang
109bc1047e [NNC] Generate C++ code for Allocate and Free (#51070)
Summary:
This is the initial skeleton for C++ codegen, it includes generations for Allocate and Free.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51070

Test Plan: New unit tests are added to `test_cpp_codegen.cpp`.

Reviewed By: ZolotukhinM

Differential Revision: D26061818

Pulled By: cheng-chang

fbshipit-source-id: b5256b2dcee6b2583ba73b6c9684994dbe7cdc1f
2021-02-01 13:06:51 -08:00
anjali411
642afcb168 Add sgn to torch.rst so that it appears in the built docs (#51479)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51479

Fixes https://github.com/pytorch/pytorch/issues/50146

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D26179734

Pulled By: anjali411

fbshipit-source-id: 1cda9a3dc9ce600e585900eea70fbecac0635d5c
2021-02-01 12:43:06 -08:00
Scott Wolchok
d1ddc5d65d [PyTorch] Outline OperatorEntry::assertSignatureIsCorrect fail path (#51269)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51269

This saves about 10% of the compile time of Functions.cpp. Found using clang-9's `-ftime-trace` feature + ClangBuildAnalyzer.

Test Plan:
Compared -ftime-trace + ClangBuildAnalyzer output.

Before: P167884397

After: P167888502

Note that time spent generating assertSignatureIsCorrect is way down, though it's still kind of slow.

Reviewed By: ezyang

Differential Revision: D26121814

fbshipit-source-id: 949a85d8939c02e4fb5ac1adc35905ed34414724
2021-02-01 12:40:19 -08:00
Scott Wolchok
9877777fee [PyTorch] check isValidUnboxed() in the dispatcher (#51247)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51247

See code comment for explanation.

This measures neutral compared to the previous diff with `perf stat` when running on a
benchmark that calls empty in a loop. I think that we should commit it
anyway because:
1) I have previously seen it make a difference when applied earlier in
the stack.
2) This makes sense both on principle and via inspecting output
assembly: we avoid having to touch the boxed kernel at all (usually)
and instead use the unboxed kernel for both the validity check in
`OperatorEntry::lookup` and the actual `KernelFunction::call`.
ghstack-source-id: 120697497

Test Plan: Aforementioned perf measurement

Reviewed By: ezyang

Differential Revision: D26113650

fbshipit-source-id: 8448c4ed764d477f63eb7c0f6dd87b1fc0228b73
2021-02-01 12:40:14 -08:00
Scott Wolchok
4495b49ffa [PyTorch] Pass TensorOptions by value (#51165)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51165

`TensorOptions` does not have a non-trivial copy, move, or
destroy operation and is small enough to fit in a register, so it
seems like we should pass it by value.
ghstack-source-id: 120697498

Test Plan:
Measured timing for empty framework overhead benchmark before & after this change:

Before:
```
I0126 16:02:50.662864 2137574 bench.cpp:139] Mean 0.268645
I0126 16:02:50.662891 2137574 bench.cpp:140] Median 0.267485
I0126 16:02:50.662896 2137574 bench.cpp:141] Min 0.266485
I0126 16:02:50.662901 2137574 bench.cpp:142] stddev 0.00219359
I0126 16:02:50.662915 2137574 bench.cpp:143] stddev / mean 0.00816537

          2,968.37 msec task-clock                #    0.997 CPUs utilized            ( +-  0.03% )
               250      context-switches          #    0.084 K/sec                    ( +-  2.21% )
                 1      cpu-migrations            #    0.000 K/sec
            11,403      page-faults               #    0.004 M/sec                    ( +-  0.28% )
     5,898,481,882      cycles                    #    1.987 GHz                      ( +-  0.03% )  (50.05%)
    16,169,242,938      instructions              #    2.74  insn per cycle           ( +-  0.03% )  (50.06%)
     3,076,546,626      branches                  # 1036.443 M/sec                    ( +-  0.05% )  (50.05%)
         2,531,859      branch-misses             #    0.08% of all branches          ( +-  0.89% )  (50.03%)
```

After:
```
I0126 16:23:20.010062 2244624 bench.cpp:139] Mean 0.266814
I0126 16:23:20.010092 2244624 bench.cpp:140] Median 0.265759
I0126 16:23:20.010099 2244624 bench.cpp:141] Min 0.260291
I0126 16:23:20.010107 2244624 bench.cpp:142] stddev 0.00548279
I0126 16:23:20.010118 2244624 bench.cpp:143] stddev / mean 0.0205491

          2,983.75 msec task-clock                #    0.995 CPUs utilized            ( +-  0.36% )
               243      context-switches          #    0.082 K/sec                    ( +-  1.26% )
                 1      cpu-migrations            #    0.000 K/sec
            11,422      page-faults               #    0.004 M/sec                    ( +-  0.18% )
     5,928,639,486      cycles                    #    1.987 GHz                      ( +-  0.36% )  (50.02%)
    16,105,928,210      instructions              #    2.72  insn per cycle           ( +-  0.05% )  (50.02%)
     3,150,273,453      branches                  # 1055.809 M/sec                    ( +-  0.03% )  (50.05%)
         3,713,617      branch-misses             #    0.12% of all branches          ( +-  0.83% )  (50.07%)

```

It looked close to neutral, so I used `perf stat` to confirm it's about a 1% instruction count win.

For deciding whether this stack is worth it, I went back and ran `perf stat` on the baseline diff before I started touching the dispatcher:

```
          2,968.37 msec task-clock                #    0.997 CPUs utilized            ( +-  0.03% )
               250      context-switches          #    0.084 K/sec                    ( +-  2.21% )
                 1      cpu-migrations            #    0.000 K/sec
            11,403      page-faults               #    0.004 M/sec                    ( +-  0.28% )
     5,898,481,882      cycles                    #    1.987 GHz                      ( +-  0.03% )  (50.05%)
    16,169,242,938      instructions              #    2.74  insn per cycle           ( +-  0.03% )  (50.06%)
     3,076,546,626      branches                  # 1036.443 M/sec                    ( +-  0.05% )  (50.05%)
         2,531,859      branch-misses             #    0.08% of all branches          ( +-  0.89% )  (50.03%)
```

If I've done the arithmetic correctly, we have an 0.39% instruction count win.

Reviewed By: ezyang

Differential Revision: D25983863

fbshipit-source-id: 87d1451a01ead25738ea6b80db270d344bc583b2
2021-02-01 12:40:08 -08:00
Scott Wolchok
341c76dcc1 [PyTorch] Add C10_ALWAYS_INLINE to critical dispatcher paths (#51245)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51245

Splitting this out from #51164 (D26069629) to allow it to
land separately; I'm sure this is a good idea but I'm less sure about
#51164.
ghstack-source-id: 120697499

Test Plan:
double-check effect on empty benchmark with perf stat;
didn't move

Reviweers: ezyang, messmer

Reviewed By: ezyang

Differential Revision: D26112627

fbshipit-source-id: 50d4418d351527bcedd5ccdc49106bc642699870
2021-02-01 12:39:58 -08:00
Scott Wolchok
673687e764 [PyTorch] Refactor Dispatcher to inline less code in fast path (#51163)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51163

The Dispatcher seems to have been in a precarious local
maximum: I tried to make several different changes to parameter
passing and ended up with regressions due to reduced inlining that
swamped any gains I might have gotten from the parameter passing
changes.

This diff reduces the amount of inline code on the fast path. It
should both reduce code size and provide a platform for making further
improvements to the dispatcher code.

It is a slight performance regression, but it unblocked the following
two diffs (which seem to get us back where we were) from landing.
ghstack-source-id: 120693163

Test Plan:
CI, framework overhead benchmarks to check the size of the
regression

Compared timing for empty framework overhead benchmark before/after.

Build command: `buck build mode/no-gpu //caffe2/caffe2/fb/high_perf_models/pytorch/benchmark_framework_overheads:cpp_benchmark mode/opt-clang --show-output`
Run with `numactl -m  0 -C 3 path/to/cpp_benchmark -op empty -niter 100`

Before:
```
I0126 16:02:04.373075 2135872 bench.cpp:139] Mean 0.266272
I0126 16:02:04.373106 2135872 bench.cpp:140] Median 0.266347
I0126 16:02:04.373111 2135872 bench.cpp:141] Min 0.263585
I0126 16:02:04.373117 2135872 bench.cpp:142] stddev 0.0021264
I0126 16:02:04.373131 2135872 bench.cpp:143] stddev / mean 0.00798581
```

After:
```
I0126 16:02:30.377992 2137048 bench.cpp:139] Mean 0.27579
I0126 16:02:30.378023 2137048 bench.cpp:140] Median 0.275281
I0126 16:02:30.378029 2137048 bench.cpp:141] Min 0.270617
I0126 16:02:30.378034 2137048 bench.cpp:142] stddev 0.00308287
I0126 16:02:30.378044 2137048 bench.cpp:143] stddev / mean 0.0111783
```

Yes, it's a regression, but I compared D26069629 stacked on this diff vs not:

With this diff:

```
I0126 16:02:50.662864 2137574 bench.cpp:139] Mean 0.268645
I0126 16:02:50.662891 2137574 bench.cpp:140] Median 0.267485
I0126 16:02:50.662896 2137574 bench.cpp:141] Min 0.266485
I0126 16:02:50.662901 2137574 bench.cpp:142] stddev 0.00219359
I0126 16:02:50.662915 2137574 bench.cpp:143] stddev / mean 0.00816537
```

Without:
```
I0126 20:40:27.815824 3240699 bench.cpp:139] Mean 0.270755
I0126 20:40:27.815860 3240699 bench.cpp:140] Median 0.268998
I0126 20:40:27.815866 3240699 bench.cpp:141] Min 0.268306
I0126 20:40:27.815873 3240699 bench.cpp:142] stddev 0.00260365
I0126 20:40:27.815886 3240699 bench.cpp:143] stddev / mean 0.00961624
```

So we do seem to have accomplished something w.r.t. not overwhelming the inliner.

Reviewed By: ezyang

Differential Revision: D26091377

fbshipit-source-id: c9b7f4e187059fa15452b7c75fc29816022b92b1
2021-02-01 12:36:48 -08:00
Jacob Szwejbka
ec611aca88 [Pytorch Mobile] Expose _export_operator_list to python (#51312)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51312

Follow up to D24690094 (4a870f6518) exposing the api in python. Created matching unit test.
ghstack-source-id: 120611452

Test Plan: Ran unit test

Reviewed By: dhruvbird

Differential Revision: D26112765

fbshipit-source-id: ffe3bb97de0a4f08b31719b4b47dcebd7d2fd42a
2021-02-01 12:09:02 -08:00
James Reed
609f76f27a [WIP][FX] Add Interpreter and Transformer (#50420)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50420

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D25880330

Pulled By: jamesr66a

fbshipit-source-id: 27d34888e36e39924821fed891d79f969237a104
2021-02-01 11:40:12 -08:00
Yi Wang
0831984ed5 [Resubmission][Gradient Compression] Refactor default_hooks.py and powerSGD_hook.py by creating a util function that make a vanilla allreduce future (#51400)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51400

Resubmission of #51094

Address https://github.com/pytorch/pytorch/pull/50973#discussion_r564229818

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120725690

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl

Reviewed By: rohan-varma

Differential Revision: D26162333

fbshipit-source-id: ccc2eae5383a23673e00d61cb5570fb8bf749cd0
2021-02-01 11:34:41 -08:00
Scott Wolchok
6c24296795 [PyTorch] Devirtualize TensorImpl::has_storage (#51049)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51049

This diff makes it OK to query has_storage() on all TensorImpls. I added debug assertions that storage_ is indeed never set on them, which is required for this to be correct.
ghstack-source-id: 120714380

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D26008498

fbshipit-source-id: b3f55f0b57b04636d13b09aa55bb720c6529542c
2021-02-01 11:30:23 -08:00
Scott Wolchok
765062c085 [PyTorch] Devirtualize TensorImpl::storage_offset (#51048)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51048

There doesn't seem to be any reason to prohibit accessing the always-zero storage_offset of those TensorImpls that prohibit set_storage_offset.
ghstack-source-id: 120714379

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D26008499

fbshipit-source-id: cd92ac0afdebbd5cf8f04df141843635113b6444
2021-02-01 11:27:13 -08:00
kshitij12345
50fa415a4d [testing] Add OpInfo for ceil and floor (#51198)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/50006

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51198

Reviewed By: malfet

Differential Revision: D26105099

Pulled By: mruberry

fbshipit-source-id: 6cfa89f42b87cca66dbc5bf474d17a6cad7eb45a
2021-02-01 10:10:36 -08:00
Max Balandat
449098c2d2 [SobolEngine] Update direction numbers to 21201 dims (#49710)
Summary:
Performs the update that was suggested in https://github.com/pytorch/pytorch/issues/41489

Adjust the functionality to largely match that pf the scipy companion PR https://github.com/scipy/scipy/pull/10844/, including
- a new `draw_base2` method
- include zero as the first point in the (unscrambled) Sobol sequence

The scipy PR is also quite opinionated if the `draw` method doesn't get called with a base 2 number (for which the resulting sequence has nice properties, see the scipy PR for a comprehensive discussion of this).

Note that this update is a **breaking change** in the sense that sequences generated with the same parameters after as before will not be identical! They will have the same (better, arguably) distributional properties, but calling the engine with the same seed will result in different numbers in the sequence.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49710

Test Plan:
```
from torch.quasirandom import SobolEngine

sobol = SobolEngine(3)
sobol.draw(4)

sobol = SobolEngine(4, scramble=True)
sobol.draw(5)

sobol = SobolEngine(4, scramble=True)
sobol.draw_base2(2)
```

Reviewed By: malfet

Differential Revision: D25657233

Pulled By: Balandat

fbshipit-source-id: 9df50a14631092b176cc692b6024aa62a639ef61
2021-02-01 08:44:31 -08:00
Hameer Abbasi
b1907f5ebc Fix pickling for Tensor subclasses (redo) (#47732)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47051
Redo of https://github.com/pytorch/pytorch/issues/47115

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47732

Reviewed By: izdeby

Differential Revision: D25465382

Pulled By: ezyang

fbshipit-source-id: 3a8d57281a2d6f57415d5735d34ad307f3526638
2021-02-01 07:32:52 -08:00
anjali411
508bab43e7 Support complex number list in JIT (#51145)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51145

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D26154025

Pulled By: anjali411

fbshipit-source-id: 74645f9b6467757ddb9d75846e778222109848f0
2021-01-31 23:54:14 -08:00
Mike Ruberry
40c0fffb4b Fixes docs (#51439)
Summary:
pytorch_python_doc_build is failing with:

```
Jan 31 04:30:45 /var/lib/jenkins/workspace/docs/source/notes/broadcasting.rst:6: WARNING: 'any' reference target not found: numpy.doc.broadcasting
```

this removes the incorrect reference and adds an updated link.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51439

Reviewed By: ngimel

Differential Revision: D26170232

Pulled By: mruberry

fbshipit-source-id: 829999db52e1e860d36d626d0d9f26e31283d14b
2021-01-31 22:00:26 -08:00
Jianyu Huang
d1dcd5f287 [fbgemm_gpu] Use the latest philox_cuda_state API for stochastic rounding (#51004)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51004

Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/493

Follow up on the failure case on FP16 stochastic rounding:
- https://github.com/pytorch/pytorch/pull/50148
- D26006041

From Natalia:
- https://github.com/pytorch/pytorch/pull/50916 is the fix, philox_engine_inputs is deprecated btw so if you could refactor it to use philox_cuda_state that would be great.
- instructions to change the call https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/CUDAGeneratorImpl.h#L48-L83, it will be important to use philox_cuda_state with graph capture.

Benchmark:
- Before this Diff:
```
(base) [jianyuhuang@devgpu017.atn5.facebook.com: ~/fbsource/fbcode/hpc/ops/benchmarks] $  buck run mode/opt //hpc/ops/benchmarks:split_table_batched_embeddings_benchmark device -- --fp16 --stoc 2>&1 | tee before_diff.log
PARSING BUCK FILES: FINISHED IN 0.4s
CREATING ACTION GRAPH: FINISHED IN 0.0s
DOWNLOADED 0 ARTIFACTS, 0.00 BYTES, 0.0% CACHE MISS
BUILDING: FINISHED IN 5.3s (100%) 6474/6474 JOBS, 0 UPDATED
BUILD SUCCEEDED
DEBUG:root:Using fused exact_row_wise_adagrad with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=False, max_gradient=1.0, learning_rate=0.1, eps=0.1, beta1=0.9, beta2=0.999, weight_decay=0.0, eta=0.001, momentum=0.9)
INFO:root:Embedding parameters:  0.41 GParam,  0.82GB
INFO:root:Accessed weights per batch:  83.89MB
INFO:root:Forward, B: 512, E: 100000, T: 32, D: 128, L: 20, W: False, BW:  607.48GB/s, T: 138us
INFO:root:ForwardBackward, B: 512, E: 100000, T: 32, D: 128, L: 20, BW:  220.85GB/s, T: 1139us
```

- After this Diff:
```
(base) [jianyuhuang@devgpu017.atn5.facebook.com: ~/fbsource/fbcode/hpc/ops/benchmarks] $  buck run mode/opt //hpc/ops/[5/1935]
ks:split_table_batched_embeddings_benchmark device -- --fp16 --stoc 2>&1 | tee after_diff.log
PARSING BUCK FILES: FINISHED IN 1.1s
CREATING ACTION GRAPH: FINISHED IN 0.0s
DEBUG:root:Using fused exact_row_wise_adagrad with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=Fal
se, max_gradient=1.0, learning_rate=0.1, eps=0.1, beta1=0.9, beta2=0.999, weight_decay=0.0, eta=0.001, momentum=0.9)           INFO:root:Embedding parameters:  0.41 GParam,  0.82GB
INFO:root:Accessed weights per batch:  83.89MB
INFO:root:Forward, B: 512, E: 100000, T: 32, D: 128, L: 20, W: False, BW:  608.80GB/s, T: 138us
INFO:root:ForwardBackward, B: 512, E: 100000, T: 32, D: 128, L: 20, BW:  229.17GB/s, T: 1098us
```

Test Plan: CI

Reviewed By: ngimel

Differential Revision: D26038596

fbshipit-source-id: 5360395c1c3b1a062b38e5695239258e892c63c4
2021-01-31 20:42:43 -08:00
jiej
0e1c5cb354 fixing index clamping for upsample nearest kernel backward (#51240)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/51036

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51240

Reviewed By: ailzhang

Differential Revision: D26139221

Pulled By: ngimel

fbshipit-source-id: 0591ac6d1f988b54c1b1ee50d34fb7c2a3f97c4e
2021-01-31 15:22:58 -08:00
Rohan Varma
9cf62a4b5d [1.8] Add additional tests for object-based APIs (#51341)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51341

Adds tests for objects that contain CPU/GPU tensors to ensure that
they can also be serialized/deserialized appropriately.
ghstack-source-id: 120718120

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D26144100

fbshipit-source-id: f1a8ccb9741bb5372cb7809cb43cbe43bf47d517
2021-01-30 19:50:08 -08:00
Rohan Varma
c255628134 [Collective APIs] Make python object collective API args consistent (#50625)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50625

Make API signatures consistent and provide default argument similar to
the tensor collectives.
ghstack-source-id: 120718121

Test Plan: CI

Reviewed By: wanchaol

Differential Revision: D25932012

fbshipit-source-id: d16267e236a65ac9d55e19e2178f9d9267b08a20
2021-01-30 19:47:16 -08:00
Marat Subkhankulov
721ba97eb6 Create op benchmark for stack (#51263)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51263

- Add benchmark for stack op

Test Plan:
```
buck build mode/opt //caffe2/benchmarks/operator_benchmark/pt:stack_test --show-output
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 buck-out/gen/caffe2/benchmarks/operator_benchmark/pt/stack_test.par --tag_filter=static_runtime | grep Execution

Forward Execution Time (us) : 6.380
Forward Execution Time (us) : 6.553
Forward Execution Time (us) : 14.904
Forward Execution Time (us) : 5.657
Forward Execution Time (us) : 5.612
Forward Execution Time (us) : 6.051
Forward Execution Time (us) : 4.225
Forward Execution Time (us) : 4.240
Forward Execution Time (us) : 6.280
Forward Execution Time (us) : 6.267
Forward Execution Time (us) : 418.932
Forward Execution Time (us) : 417.694
Forward Execution Time (us) : 1592.455
Forward Execution Time (us) : 2919.261
Forward Execution Time (us) : 211.458
Forward Execution Time (us) : 211.518
Forward Execution Time (us) : 783.953
Forward Execution Time (us) : 1457.823
Forward Execution Time (us) : 2032.816
Forward Execution Time (us) : 2090.662
Forward Execution Time (us) : 6487.098
Forward Execution Time (us) : 11874.702
Forward Execution Time (us) : 2123.830
Forward Execution Time (us) : 2195.453
Forward Execution Time (us) : 6435.978
Forward Execution Time (us) : 11852.205
Forward Execution Time (us) : 2036.526
Forward Execution Time (us) : 2055.618
Forward Execution Time (us) : 6417.192
Forward Execution Time (us) : 12468.744
Forward Execution Time (us) : 4959.704
Forward Execution Time (us) : 5121.823
Forward Execution Time (us) : 5082.105
Forward Execution Time (us) : 5395.936
Forward Execution Time (us) : 5162.756
Forward Execution Time (us) : 23798.080
Forward Execution Time (us) : 4957.921
Forward Execution Time (us) : 4971.234
Forward Execution Time (us) : 5005.909
Forward Execution Time (us) : 5159.614
Forward Execution Time (us) : 5013.221
Forward Execution Time (us) : 20238.741
Forward Execution Time (us) : 7632.439
Forward Execution Time (us) : 7589.376
Forward Execution Time (us) : 7859.937
Forward Execution Time (us) : 8214.213
Forward Execution Time (us) : 11606.562
Forward Execution Time (us) : 34612.919
```

Reviewed By: hlu1

Differential Revision: D25859143

fbshipit-source-id: a1b735ce87f57b5eb67e223e549248a2cd7663c1
2021-01-30 10:32:14 -08:00
Natalia Gimelshein
e26fccc22b update profiler doc strings (#51395)
Summary:
Fixes formatting for autograd.profiler doc string (was broken), slightly expands profiler.profile documentation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51395

Reviewed By: ilia-cher

Differential Revision: D26162349

Pulled By: ngimel

fbshipit-source-id: ac7af8e0f3dbae2aa899ad815d2311c2758ee57c
2021-01-29 23:37:06 -08:00
Ilia Cherniavskii
17b5683156 Multi-GPU Kineto profiler test (#51391)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51391

Adding a test to check the kineto profiler on multiple gpus

Test Plan: python test/test_profiler.py

Reviewed By: ngimel

Differential Revision: D26160788

Pulled By: ilia-cher

fbshipit-source-id: f3554f52176cc26e7f331d205f1a514eb03aa758
2021-01-29 23:26:12 -08:00
Hao Lu
11cda929fb [StaticRuntime] Fix bug in MemoryPlanner (#51342)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51342

There is a subtle bug with the MemoryPlanner with regard to view ops with out variant.

```
  def forward(self, a: Tensor, shape: List[int]):
      b = a.reshape(shape)
      return b + b
```
In this case, if we replace reshape with the out variant, b would be managed by the MemoryPlanner and the storage of its output would have been set to nullptr right after inference by the MemoryPlanner if opts.cleanup_activations is true. Because b is a view of a, the storage of a is also set to nullptr, and this violates the API which promises that a is const.

To fix this bug, I changed the MemoryPlanner so that it puts b in the unmanaged part.

Test Plan:
Add unit test to enforce the constness of inputs

```
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
```

Reviewed By: ajyu

Differential Revision: D26144203

fbshipit-source-id: 2dbacccf7685d0fe0f0b1195166e0510b2069fe3
2021-01-29 21:16:02 -08:00
Ansley Ussery
09e48dbd33 Handle error during dict expansion (#51374)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51374

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D26155995

Pulled By: ansley

fbshipit-source-id: 04e924cb641565341c570c6cf5e5eec42e4f9c8b
2021-01-29 18:46:10 -08:00
Natalia Gimelshein
7ab89f58be expose memory_fraction and gpu_process docs (#51372)
Summary:
Per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51372

Reviewed By: mruberry

Differential Revision: D26157787

Pulled By: ngimel

fbshipit-source-id: 97eac5f12881a2bf62c251f6f7eaf65fdbe34056
2021-01-29 18:22:34 -08:00
Natalia Gimelshein
7d30f67659 remove LegacyDefinitions as it is empty now (#51251)
Summary:
Per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51251

Reviewed By: mruberry

Differential Revision: D26120574

Pulled By: ngimel

fbshipit-source-id: 223b4f358932f47e0af7413752c7db7c35402260
2021-01-29 18:15:11 -08:00
Yanli Zhao
d5541c50a3 add a c++ interface in processGroup to get its backend name (#51066)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51066

backend name of a processgroup created using distributed_c10d python API is tracked, but there is no good way to track name of a processgroup created using processGroup c++ API. In some cases, knowing backend name of a processGroup is useful, e,g., log the backend name, or write some codes that have dependency on the known backend.
ghstack-source-id: 120628432

Test Plan: unit tests

Reviewed By: pritamdamania87

Differential Revision: D26059769

fbshipit-source-id: 6584c6695c5c3570137dc98c16e06cbe4b7f5503
2021-01-29 17:28:42 -08:00
Wanchao Liang
662b6d2115 [dist_optim] update the doc of DistributedOptimizer (#51314)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51314

updating the doc of DistributedOptimizer to include TorchScript enablement information

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26156032

Pulled By: wanchaol

fbshipit-source-id: 1f3841f55918a5c2ed531cf6aeeb3f6e3a09a6a8
2021-01-29 17:12:52 -08:00
kshitij12345
a88e1d3ddf [complex] Complex support for masked_scatter and autograd support for masked_scatter and masked_select (#51281)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/33152

Changes
* Enable complex support for masked_scatter
* Enable half support for masked_scatter CPU
* Enable complex autograd support for masked_scatter CPU and masked_select (both CPU and CUDA).

**Note**:
Complex Support for masked_scatter CUDA is disabled as it depends on `masked_fill` which is yet to be ported to ATen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51281

Reviewed By: ailzhang

Differential Revision: D26127561

Pulled By: anjali411

fbshipit-source-id: 6284926b934942213c5dfc24b5bcc8538d0231af
2021-01-29 13:49:31 -08:00
Brian Skinn
fe645fdfc7 Update _torch_docs.py (#51212)
Summary:
Fix `torch.linalg.qr` reference where it's desired to render fully-qualified name into docs.

Suggested fix for https://github.com/pytorch/pytorch/pull/47764/files#r565368195

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51212

Reviewed By: ezyang

Differential Revision: D26142496

Pulled By: ailzhang

fbshipit-source-id: 052b2085099baa372e3b515b403f25d23cf50785
2021-01-29 13:03:09 -08:00
Arindam Roy
da920fa141 Enable rocm tests in common nn (#51227)
Summary:
Fixes #{issue number}
Resubmitting a new PR as the older one got reverted due to problems in test_optim.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51227

Reviewed By: ezyang

Differential Revision: D26142505

Pulled By: ailzhang

fbshipit-source-id: a2ab5d85630aac2d2ce17652ba19c11ea668a6a9
2021-01-29 12:54:04 -08:00
Eli Uriegas
52609c8c65 .github: Up frequency of stale checks (#51365)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51365

We have a pretty big backlog of PRs when it comes to checking for stale and the action only supports processing 30 PRs at a given time.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D26153785

Pulled By: seemethere

fbshipit-source-id: 585b36068683e04cf4e2cc59013482f143ec30a3
2021-01-29 12:50:40 -08:00
Ivan Kobzarev
dbfaf966b0 [android] turn on USE_VULKAN for android builds by default (#51291)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51291

Turning on USE_VULKAN for android builds
Remove standalone android vulkan build

Testing all ci jobs (for master): https://github.com/pytorch/pytorch/pull/51292

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D26141891

Pulled By: IvanKobzarev

fbshipit-source-id: e8e1a4ab612c0786ce09217ab9370fd75a71eb00
2021-01-29 11:58:21 -08:00
Hong Xu
ebd2a82559 Replace all AT_ASSERTM in RNN_miopen.cpp (#51072)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51072

AT_ASSERTM is deprecated and should be replaced by either TORCH_CHECK or
TORCH_INTERNAL_ASSERT, depending on the situation.

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D26074364

Pulled By: ezyang

fbshipit-source-id: 742e28afe49e0a546c252a0fad487f93410d0cb5
2021-01-29 11:40:38 -08:00
Hong Xu
dfca1e48d3 Replace all AT_ASSERTM under c10/ (except Exception.h) (#50843)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50843

AT_ASSERTM is deprecated and should be replaced by either TORCH_CHECK or
TORCH_INTERNAL_ASSERT, depending on the situation.

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D26074365

Pulled By: ezyang

fbshipit-source-id: 46e13588fad4e24828f3cc99635e9cb2223a6c2c
2021-01-29 11:37:07 -08:00
Shoichiro Kawauchi
c41ca4ae5b [doc]Fix autograd.detect_anomaly docs incorrectly formatted (#51335)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/51141

Two bullet points don't render as bullet points.

Before
<img width="657" alt="screenshot before" src="https://user-images.githubusercontent.com/19372617/106240701-125a3080-6248-11eb-9572-f915aa9b72e1.png">

After
<img width="888" alt="screenshot after" src="https://user-images.githubusercontent.com/19372617/106240714-17b77b00-6248-11eb-8e54-51be103639e9.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51335

Reviewed By: izdeby

Differential Revision: D26148582

Pulled By: ezyang

fbshipit-source-id: 5aff6f9bd7affdf13bec965e9bf1a417e5caa88d
2021-01-29 11:18:51 -08:00
Rohan Varma
5021582fe6 Fix benchmarks/distributed/ddp/benchmark.py (#51095)
Summary:
Fixes the issue reported in https://github.com/pytorch/pytorch/issues/50679 by using built-in object-based collectives. User has verified this patch works

Test with:
RANK=0 python3 pytorch-dist-benchmark.py --world-size 2 --master-addr 127.0.0.1 --master-port 23456
RANK=1 python3 pytorch-dist-benchmark.py --world-size 2 --master-addr 127.0.0.1 --master-port 23456

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51095

Reviewed By: SciPioneer

Differential Revision: D26070275

Pulled By: rohan-varma

fbshipit-source-id: 59abcaac9e395bcdd8a018bf6ba07521d94b2fdf
2021-01-29 11:10:13 -08:00
Richard Barnes
1b089c1257 Modernize for-loops (#50899)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50899

Test Plan: Sandcastle tests + OSS CI

Reviewed By: ezyang

Differential Revision: D26001931

fbshipit-source-id: d829d520f647aacd178e1c7a9faa6196cc5af54e
2021-01-29 10:52:31 -08:00
Yi Zhang
edaa23c8ab extend init_group_test timeout to 5s (#51330)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50662

![image](https://user-images.githubusercontent.com/16190118/106225549-58030300-6220-11eb-948d-1998bdafc245.png)

From: https://circleci.com/api/v1.1/project/github/pytorch/pytorch/10203733/output/105/0?file=true&allocation-id=60022ee190b8596d279f4531-0-build%2F195A7D58 (e86f941395)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51330

Reviewed By: izdeby

Differential Revision: D26148618

Pulled By: ezyang

fbshipit-source-id: 708d7522843da2f5c919cf41919e6819f89903e2
2021-01-29 10:44:28 -08:00
Ivan Yashchuk
30675d0921 Added OpInfo-based testing of triangular_solve (#50948)
Summary:
Added OpInfo-based testing of `torch.triangular_solve`.

These tests helped to discover that CPU `triangular_solve` wasn't working for empty matrices and for CUDA inputs a warning was printed to the terminal. It is fixed now.

CUDA gradgrad checks are skipped.
```
11.44s call     test/test_ops.py::TestGradientsCUDA::test_fn_gradgrad_triangular_solve_cuda_complex128
2.97s call     test/test_ops.py::TestGradientsCUDA::test_fn_gradgrad_triangular_solve_cuda_float64
1.60s call     test/test_ops.py::TestGradientsCPU::test_fn_gradgrad_triangular_solve_cpu_complex128
1.36s call     test/test_ops.py::TestOpInfoCUDA::test_supported_dtypes_triangular_solve_cuda_complex128
1.20s call     test/test_ops.py::TestGradientsCUDA::test_fn_grad_triangular_solve_cuda_complex128
0.86s call     test/test_ops.py::TestCommonCUDA::test_variant_consistency_jit_triangular_solve_cuda_complex64
0.85s call     test/test_ops.py::TestCommonCUDA::test_variant_consistency_jit_triangular_solve_cuda_complex128
0.81s call     test/test_ops.py::TestCommonCUDA::test_variant_consistency_jit_triangular_solve_cuda_float64
0.77s call     test/test_ops.py::TestCommonCUDA::test_variant_consistency_jit_triangular_solve_cuda_float32
0.46s call     test/test_ops.py::TestCommonCPU::test_variant_consistency_jit_triangular_solve_cpu_complex128
0.44s call     test/test_ops.py::TestCommonCPU::test_variant_consistency_jit_triangular_solve_cpu_complex64
0.44s call     test/test_ops.py::TestGradientsCUDA::test_fn_grad_triangular_solve_cuda_float64
0.42s call     test/test_ops.py::TestGradientsCPU::test_fn_gradgrad_triangular_solve_cpu_float64
0.40s call     test/test_ops.py::TestCommonCPU::test_variant_consistency_jit_triangular_solve_cpu_float32
0.40s call     test/test_ops.py::TestCommonCPU::test_variant_consistency_jit_triangular_solve_cpu_float64
0.17s call     test/test_ops.py::TestGradientsCPU::test_fn_grad_triangular_solve_cpu_complex128
```

Ref. https://github.com/pytorch/pytorch/issues/50006

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50948

Reviewed By: ailzhang

Differential Revision: D26123998

Pulled By: mruberry

fbshipit-source-id: 54136e8fc8a71f107dddb692c5be298c6d5ed168
2021-01-29 10:31:07 -08:00
Ansley Ussery
1b479416b7 Clarify logic in ir_emitter (#51299)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51299

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D26131245

Pulled By: ansley

fbshipit-source-id: ecd69275214775804f5aa92f9b4c0b19be19b596
2021-01-29 10:05:01 -08:00
Jeffrey Wan
c0966914bc Internal gradcheck wrapper in testing._internal that sets certain flags to True (#51133)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49409

There are many call sites where, gradcheck/gradgradcheck is now being implicitly invoked with `check_batched_grad` as True, but they were previously False. Cases fall into two basic categories:
1) the call site was previously using `torch.autograd.gradcheck` but is now changed to use the globally imported function instead
3) the call site was already using globally imported function, but does not explicitly pass `check_batched_grad` flag

Only in the _assertGradAndGradgradChecks cases, which are infrequent, I assumed that the the author is aware that omitting the flag means not applying check_batched_grad=True. (but maybe that is not the case?)

Overall this PR in its current state assumes that unless the author explicitly specified `check_batched_grad=False`, they were just probably not aware of this flag and did not mean to have this flag as False.

So far exceptions to the above (as discovered by CI) include:
 - Mkldnn (opaque tensors do not have strides) https://app.circleci.com/pipelines/github/pytorch/pytorch/264416/workflows/e4d87886-6247-4305-8526-2696130aa9a4/jobs/10401882/tests
 - all cases in test_sparse (https://app.circleci.com/pipelines/github/pytorch/pytorch/264553/workflows/3c1cbe30-830d-4acd-b240-38d833dccd9b/jobs/10407103)
 - all cases in test_overrides (https://app.circleci.com/pipelines/github/pytorch/pytorch/264553/workflows/3c1cbe30-830d-4acd-b240-38d833dccd9b/jobs/10407236)
 - test_autograd (test_LSTM_grad_and_gradgrad) - (https://app.circleci.com/pipelines/github/pytorch/pytorch/264553/workflows/3c1cbe30-830d-4acd-b240-38d833dccd9b/jobs/10407235)
 - test_data_parallel (test_data_parallel_buffers_requiring_grad) - *SIGSEGV* (https://app.circleci.com/pipelines/github/pytorch/pytorch/264820/workflows/14d89503-040d-4e3d-9f7b-0bc04833589b/jobs/10422697)
 - test_nn (https://app.circleci.com/pipelines/github/pytorch/pytorch/264919/workflows/df79e3ed-8a31-4a8e-b584-858ee99686ff/jobs/10427315)

Possible TODO is to prevent new tests from invoking external gradcheck.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51133

Reviewed By: ezyang

Differential Revision: D26147919

Pulled By: soulitzer

fbshipit-source-id: dff883b50f337510a89f391ea2fd87de2d531432
2021-01-29 09:13:37 -08:00
Iurii Zdebskyi
5a406c023e Revert D26070147: [Gradient Compression] Refactor default_hooks.py and powerSGD_hook.py by creating a util function that make a vanilla allreduce future
Test Plan: revert-hammer

Differential Revision:
D26070147 (e7b3496232)

Original commit changeset: 8c9339f1511e

fbshipit-source-id: fa1e9582baec9759a73b3004be9bb19bdeb6cd34
2021-01-29 09:06:24 -08:00
Yan Li
270111b7b6 split quantization jit op (#51329)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51329

Currently the test_qbatch_norm_relu is containing too many examples and causing timeout. Splitting them for now to fix the timeout issue

Test Plan: buck test caffe2/test:quantization

Reviewed By: supriyar

Differential Revision: D26141037

fbshipit-source-id: da877efa78924a252a35c2b83407869ebb8c48b7
2021-01-29 07:49:53 -08:00