Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51460
This PR is part of a larger effort to ensure torch.linalg documentation is consistent (see #50287).
* #51459 [doc] Fix linalg.cholesky doc consistency issues
Test Plan: Imported from OSS
Reviewed By: mruberry
Differential Revision: D26176130
Pulled By: heitorschueroff
fbshipit-source-id: cc89575db69cbfd5f87d970a2e71deb6522a35b1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51459
This PR is part of a larger effort to ensure torch.linalg documentation is consistent (see #50287).
Test Plan: Imported from OSS
Reviewed By: mruberry
Differential Revision: D26176131
Pulled By: heitorschueroff
fbshipit-source-id: 2ad88a339e6dff044965e8bf29dd8c852afecb34
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51270
Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations.
This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict.
Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120725858
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
baseline: f248001754
batched PowerSGD: f246960752
The training time was reduced from 54m48s to 30m33s, and the accuracy is approximately the same: 44.21 vs 44.35
Reviewed By: rohan-varma
Differential Revision: D26077709
fbshipit-source-id: 6afeefad7a3fbdd7da2cbffb56dfbad855a96cb5
Summary:
This is the initial skeleton for C++ codegen, it includes generations for Allocate and Free.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51070
Test Plan: New unit tests are added to `test_cpp_codegen.cpp`.
Reviewed By: ZolotukhinM
Differential Revision: D26061818
Pulled By: cheng-chang
fbshipit-source-id: b5256b2dcee6b2583ba73b6c9684994dbe7cdc1f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51269
This saves about 10% of the compile time of Functions.cpp. Found using clang-9's `-ftime-trace` feature + ClangBuildAnalyzer.
Test Plan:
Compared -ftime-trace + ClangBuildAnalyzer output.
Before: P167884397
After: P167888502
Note that time spent generating assertSignatureIsCorrect is way down, though it's still kind of slow.
Reviewed By: ezyang
Differential Revision: D26121814
fbshipit-source-id: 949a85d8939c02e4fb5ac1adc35905ed34414724
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51247
See code comment for explanation.
This measures neutral compared to the previous diff with `perf stat` when running on a
benchmark that calls empty in a loop. I think that we should commit it
anyway because:
1) I have previously seen it make a difference when applied earlier in
the stack.
2) This makes sense both on principle and via inspecting output
assembly: we avoid having to touch the boxed kernel at all (usually)
and instead use the unboxed kernel for both the validity check in
`OperatorEntry::lookup` and the actual `KernelFunction::call`.
ghstack-source-id: 120697497
Test Plan: Aforementioned perf measurement
Reviewed By: ezyang
Differential Revision: D26113650
fbshipit-source-id: 8448c4ed764d477f63eb7c0f6dd87b1fc0228b73
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51245
Splitting this out from #51164 (D26069629) to allow it to
land separately; I'm sure this is a good idea but I'm less sure about
#51164.
ghstack-source-id: 120697499
Test Plan:
double-check effect on empty benchmark with perf stat;
didn't move
Reviweers: ezyang, messmer
Reviewed By: ezyang
Differential Revision: D26112627
fbshipit-source-id: 50d4418d351527bcedd5ccdc49106bc642699870
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51163
The Dispatcher seems to have been in a precarious local
maximum: I tried to make several different changes to parameter
passing and ended up with regressions due to reduced inlining that
swamped any gains I might have gotten from the parameter passing
changes.
This diff reduces the amount of inline code on the fast path. It
should both reduce code size and provide a platform for making further
improvements to the dispatcher code.
It is a slight performance regression, but it unblocked the following
two diffs (which seem to get us back where we were) from landing.
ghstack-source-id: 120693163
Test Plan:
CI, framework overhead benchmarks to check the size of the
regression
Compared timing for empty framework overhead benchmark before/after.
Build command: `buck build mode/no-gpu //caffe2/caffe2/fb/high_perf_models/pytorch/benchmark_framework_overheads:cpp_benchmark mode/opt-clang --show-output`
Run with `numactl -m 0 -C 3 path/to/cpp_benchmark -op empty -niter 100`
Before:
```
I0126 16:02:04.373075 2135872 bench.cpp:139] Mean 0.266272
I0126 16:02:04.373106 2135872 bench.cpp:140] Median 0.266347
I0126 16:02:04.373111 2135872 bench.cpp:141] Min 0.263585
I0126 16:02:04.373117 2135872 bench.cpp:142] stddev 0.0021264
I0126 16:02:04.373131 2135872 bench.cpp:143] stddev / mean 0.00798581
```
After:
```
I0126 16:02:30.377992 2137048 bench.cpp:139] Mean 0.27579
I0126 16:02:30.378023 2137048 bench.cpp:140] Median 0.275281
I0126 16:02:30.378029 2137048 bench.cpp:141] Min 0.270617
I0126 16:02:30.378034 2137048 bench.cpp:142] stddev 0.00308287
I0126 16:02:30.378044 2137048 bench.cpp:143] stddev / mean 0.0111783
```
Yes, it's a regression, but I compared D26069629 stacked on this diff vs not:
With this diff:
```
I0126 16:02:50.662864 2137574 bench.cpp:139] Mean 0.268645
I0126 16:02:50.662891 2137574 bench.cpp:140] Median 0.267485
I0126 16:02:50.662896 2137574 bench.cpp:141] Min 0.266485
I0126 16:02:50.662901 2137574 bench.cpp:142] stddev 0.00219359
I0126 16:02:50.662915 2137574 bench.cpp:143] stddev / mean 0.00816537
```
Without:
```
I0126 20:40:27.815824 3240699 bench.cpp:139] Mean 0.270755
I0126 20:40:27.815860 3240699 bench.cpp:140] Median 0.268998
I0126 20:40:27.815866 3240699 bench.cpp:141] Min 0.268306
I0126 20:40:27.815873 3240699 bench.cpp:142] stddev 0.00260365
I0126 20:40:27.815886 3240699 bench.cpp:143] stddev / mean 0.00961624
```
So we do seem to have accomplished something w.r.t. not overwhelming the inliner.
Reviewed By: ezyang
Differential Revision: D26091377
fbshipit-source-id: c9b7f4e187059fa15452b7c75fc29816022b92b1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51312
Follow up to D24690094 (4a870f6518) exposing the api in python. Created matching unit test.
ghstack-source-id: 120611452
Test Plan: Ran unit test
Reviewed By: dhruvbird
Differential Revision: D26112765
fbshipit-source-id: ffe3bb97de0a4f08b31719b4b47dcebd7d2fd42a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51049
This diff makes it OK to query has_storage() on all TensorImpls. I added debug assertions that storage_ is indeed never set on them, which is required for this to be correct.
ghstack-source-id: 120714380
Test Plan: CI
Reviewed By: ezyang
Differential Revision: D26008498
fbshipit-source-id: b3f55f0b57b04636d13b09aa55bb720c6529542c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51048
There doesn't seem to be any reason to prohibit accessing the always-zero storage_offset of those TensorImpls that prohibit set_storage_offset.
ghstack-source-id: 120714379
Test Plan: CI
Reviewed By: ezyang
Differential Revision: D26008499
fbshipit-source-id: cd92ac0afdebbd5cf8f04df141843635113b6444
Summary:
Performs the update that was suggested in https://github.com/pytorch/pytorch/issues/41489
Adjust the functionality to largely match that pf the scipy companion PR https://github.com/scipy/scipy/pull/10844/, including
- a new `draw_base2` method
- include zero as the first point in the (unscrambled) Sobol sequence
The scipy PR is also quite opinionated if the `draw` method doesn't get called with a base 2 number (for which the resulting sequence has nice properties, see the scipy PR for a comprehensive discussion of this).
Note that this update is a **breaking change** in the sense that sequences generated with the same parameters after as before will not be identical! They will have the same (better, arguably) distributional properties, but calling the engine with the same seed will result in different numbers in the sequence.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49710
Test Plan:
```
from torch.quasirandom import SobolEngine
sobol = SobolEngine(3)
sobol.draw(4)
sobol = SobolEngine(4, scramble=True)
sobol.draw(5)
sobol = SobolEngine(4, scramble=True)
sobol.draw_base2(2)
```
Reviewed By: malfet
Differential Revision: D25657233
Pulled By: Balandat
fbshipit-source-id: 9df50a14631092b176cc692b6024aa62a639ef61
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51341
Adds tests for objects that contain CPU/GPU tensors to ensure that
they can also be serialized/deserialized appropriately.
ghstack-source-id: 120718120
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D26144100
fbshipit-source-id: f1a8ccb9741bb5372cb7809cb43cbe43bf47d517
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50625
Make API signatures consistent and provide default argument similar to
the tensor collectives.
ghstack-source-id: 120718121
Test Plan: CI
Reviewed By: wanchaol
Differential Revision: D25932012
fbshipit-source-id: d16267e236a65ac9d55e19e2178f9d9267b08a20
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51342
There is a subtle bug with the MemoryPlanner with regard to view ops with out variant.
```
def forward(self, a: Tensor, shape: List[int]):
b = a.reshape(shape)
return b + b
```
In this case, if we replace reshape with the out variant, b would be managed by the MemoryPlanner and the storage of its output would have been set to nullptr right after inference by the MemoryPlanner if opts.cleanup_activations is true. Because b is a view of a, the storage of a is also set to nullptr, and this violates the API which promises that a is const.
To fix this bug, I changed the MemoryPlanner so that it puts b in the unmanaged part.
Test Plan:
Add unit test to enforce the constness of inputs
```
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
```
Reviewed By: ajyu
Differential Revision: D26144203
fbshipit-source-id: 2dbacccf7685d0fe0f0b1195166e0510b2069fe3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51066
backend name of a processgroup created using distributed_c10d python API is tracked, but there is no good way to track name of a processgroup created using processGroup c++ API. In some cases, knowing backend name of a processGroup is useful, e,g., log the backend name, or write some codes that have dependency on the known backend.
ghstack-source-id: 120628432
Test Plan: unit tests
Reviewed By: pritamdamania87
Differential Revision: D26059769
fbshipit-source-id: 6584c6695c5c3570137dc98c16e06cbe4b7f5503
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51314
updating the doc of DistributedOptimizer to include TorchScript enablement information
Test Plan: Imported from OSS
Reviewed By: pbelevich
Differential Revision: D26156032
Pulled By: wanchaol
fbshipit-source-id: 1f3841f55918a5c2ed531cf6aeeb3f6e3a09a6a8
Summary:
Reference: https://github.com/pytorch/pytorch/issues/33152
Changes
* Enable complex support for masked_scatter
* Enable half support for masked_scatter CPU
* Enable complex autograd support for masked_scatter CPU and masked_select (both CPU and CUDA).
**Note**:
Complex Support for masked_scatter CUDA is disabled as it depends on `masked_fill` which is yet to be ported to ATen.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51281
Reviewed By: ailzhang
Differential Revision: D26127561
Pulled By: anjali411
fbshipit-source-id: 6284926b934942213c5dfc24b5bcc8538d0231af
Summary:
Fixes #{issue number}
Resubmitting a new PR as the older one got reverted due to problems in test_optim.py.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51227
Reviewed By: ezyang
Differential Revision: D26142505
Pulled By: ailzhang
fbshipit-source-id: a2ab5d85630aac2d2ce17652ba19c11ea668a6a9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51365
We have a pretty big backlog of PRs when it comes to checking for stale and the action only supports processing 30 PRs at a given time.
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Test Plan: Imported from OSS
Reviewed By: samestep
Differential Revision: D26153785
Pulled By: seemethere
fbshipit-source-id: 585b36068683e04cf4e2cc59013482f143ec30a3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51072
AT_ASSERTM is deprecated and should be replaced by either TORCH_CHECK or
TORCH_INTERNAL_ASSERT, depending on the situation.
Test Plan: Imported from OSS
Reviewed By: ailzhang
Differential Revision: D26074364
Pulled By: ezyang
fbshipit-source-id: 742e28afe49e0a546c252a0fad487f93410d0cb5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50843
AT_ASSERTM is deprecated and should be replaced by either TORCH_CHECK or
TORCH_INTERNAL_ASSERT, depending on the situation.
Test Plan: Imported from OSS
Reviewed By: ailzhang
Differential Revision: D26074365
Pulled By: ezyang
fbshipit-source-id: 46e13588fad4e24828f3cc99635e9cb2223a6c2c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51329
Currently the test_qbatch_norm_relu is containing too many examples and causing timeout. Splitting them for now to fix the timeout issue
Test Plan: buck test caffe2/test:quantization
Reviewed By: supriyar
Differential Revision: D26141037
fbshipit-source-id: da877efa78924a252a35c2b83407869ebb8c48b7