Summary:
Working towards https://docs.google.com/document/d/10yx2-4gs0gTMOimVS403MnoAWkqitS8TUHX73PN8EjE/edit?pli=1#
This PR:
- Ensure that all the submodules are listed in a rst file (that ensure they are considered by the coverage tool)
- Remove some long deprecated code that just error out on import
- Remove the allow list altogether to ensure nothing gets added back there
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73983
Reviewed By: anjali411
Differential Revision: D34787908
Pulled By: albanD
fbshipit-source-id: 163ce61e133b12b2f2e1cbe374f979e3d6858db7
(cherry picked from commit c9edfead7a01dc45bfc24eaf7220d2a84ab1f62e)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73601
Some users had questions about how the RPC framework deals with
failures and whether we retry. Adding a note about this to our docs to
elaborate on our current behavior and why we chose that approach.
ghstack-source-id: 150359866
Test Plan: view docs.
Reviewed By: mrshenli
Differential Revision: D34560199
fbshipit-source-id: ee33ceed7fa706270d4ca5c8fcff7535583490ff
(cherry picked from commit 954a906240cc40aacf08ca13f6554a35303a678a)
Summary:
This PR adds a minimal version of a NestedTensor. It introduces the general harness future development can be built around.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72881
Reviewed By: albanD
Differential Revision: D34259177
Pulled By: cpuhrsch
fbshipit-source-id: 0245c36f603424e20f3b09651043c207f526d760
(cherry picked from commit 10764e8d427f29b364567e4cbc86ed73c3933158)
Summary:
This PR introduces the `cuSolverSP` backend for `linalg.solve` with sparse CSR input matrices. The motivation comes from the issue: https://github.com/pytorch/pytorch/issues/69538.
`cuSolver` provides [`cusolverSp<t>csrlsvluHost`](https://docs.nvidia.com/cuda/cusolver/index.html#cusolver-lt-t-gt-csrlsvlu) API, a few things to note:
1. As mentioned in the documentation: `only CPU (Host) path is provided.` From the profiling, there doesn't seem to be any GPU kernel launch for optimization, please see the profiling below.
2. Since only `host` path is provided, the CPU path uses `csrlsvluHost` (but requires PyTorch to be installed/built with CUDA support).
3. The documentation mentions reordering helps optimize stuff, but it isn't clear how it affects the performance. There are options for reordering, so we stick to `reorder = 0` as the default choice.
`cuSolver` has [`csrlsvqr`](https://docs.nvidia.com/cuda/cusolver/index.html#cusolver-lt-t-gt-csrlsvqr) function which provides a `device` path to solve the linear system. This function is used for the CUDA path in this PR.
**Gist:**
For CPU Path: we call [`csrlsvluHost` function of cuSolver](https://docs.nvidia.com/cuda/cusolver/index.html#cusolver-lt-t-gt-csrlsvlu).
For CUDA Path: we call [`csrlsvqr` function of cuSolver](https://docs.nvidia.com/cuda/cusolver/index.html#cusolver-lt-t-gt-csrlsvqr).
**Profiling:** (On sparse input tensor of size 1000 x 1000, with a vector of shape length 1000), for `csrlsvlu` function (to show no GPU optimization)
```cpp
==3999651== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 2.1440us 1 2.1440us 2.1440us 2.1440us [CUDA memcpy HtoD]
API calls: 99.72% 1.07199s 9 119.11ms 500ns 1.07164s cudaFree
0.11% 1.2182ms 398 3.0600us 140ns 137.94us cuDeviceGetAttribute
0.06% 674.45us 4 168.61us 165.50us 173.64us cuDeviceTotalMem
0.03% 357.07us 4 89.268us 2.7800us 201.89us cudaMalloc
0.03% 309.29us 1 309.29us 309.29us 309.29us cudaGetDeviceProperties
0.01% 160.47us 332 483ns 350ns 3.3300us cudaFuncSetAttribute
0.01% 115.12us 4 28.780us 26.290us 33.410us cuDeviceGetName
0.00% 28.591us 5 5.7180us 440ns 16.921us cudaGetDevice
0.00% 22.061us 4 5.5150us 871ns 18.690us cudaDeviceSynchronize
0.00% 20.370us 18 1.1310us 410ns 6.9900us cudaEventDestroy
0.00% 16.390us 1 16.390us 16.390us 16.390us cudaMemcpy
0.00% 11.540us 2 5.7700us 1.4900us 10.050us cuDeviceGetPCIBusId
0.00% 10.510us 18 583ns 430ns 1.6200us cudaEventCreateWithFlags
0.00% 7.9100us 21 376ns 290ns 700ns cudaDeviceGetAttribute
0.00% 1.4300us 6 238ns 150ns 590ns cuDeviceGet
0.00% 1.2200us 4 305ns 190ns 500ns cuDeviceGetCount
0.00% 900ns 1 900ns 900ns 900ns cuInit
0.00% 860ns 4 215ns 180ns 260ns cuDeviceGetUuid
0.00% 240ns 1 240ns 240ns 240ns cuDriverGetVersion
0.00% 230ns 1 230ns 230ns 230ns cudaGetDeviceCount
```
Script:
```python
import torch
def solve(x, other, out):
torch.linalg.solve(x, other, out=out)
if __name__ == "__main__":
dense_inp = torch.randn((1000, 1000), dtype=torch.float64)
# Set 50% of the values to 0 randomly
dense_inp = torch.nn.functional.dropout(dense_inp, p=0.5)
sparse_inp = dense_inp.to_sparse_csr()
other = torch.randint(100, (1000,), dtype=torch.float64)
out = torch.randint(1, (1000,), dtype=torch.float64)
solve(sparse_inp, other, out)
```
The following error is raised when the function is used on a CPU device with PyTorch built/installed without CUDA support:
* When built without CUDA support:
```python
/home/krshrimali/pytorch/torch/autograd/profiler.py:151: UserWarning: CUDA is not available, disabling CUDA profiling
warn("CUDA is not available, disabling CUDA profiling")
Traceback (most recent call last):
File "/home/krshrimali/pytorch/test_sp.py", line 17, in <module>
solve(x, other, out)
File "/home/krshrimali/pytorch/test_sp.py", line 5, in solve
torch.linalg.solve(x, other, out=out)
RuntimeError: PyTorch was not built with CUDA support. Please use PyTorch built CUDA support
```
**Performance Comparison** (vs SciPy's [`scipy.sparse.linalg.spsolve`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.spsolve.html):
Time taken by `scipy.sparse.linalg.spsolve` : 0.595 seconds
On CPU: Time taken by `torch.linalg.solve` : 4.565 seconds
On CUDA: Time taken by `torch.linalg.solve`: 1.838 seconds
The inputs are of dimensions: (17281, 17281) and (17281, 1), and were taken from https://math.nist.gov/MatrixMarket/extreme.html.
Thanks to IvanYashchuk for helping me with the PR, and guiding me through it.
cc: IvanYashchuk pearu nikitaved cpuhrsch
cc nikitaved pearu cpuhrsch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71399
Reviewed By: VitalyFedyunin
Differential Revision: D33767740
Pulled By: cpuhrsch
fbshipit-source-id: a945f065210cd719096eb8d7cdbf8e8937c2fce9
(cherry picked from commit f4f35c17da414e1ca6c6d91402933521857aa1ea)
PR #72405 added four new types to the public python API:
`torch.ComplexFloatTensor`, `torch.ComplexDoubleTensor`,
`torch.cuda.ComplexFloatTensor` and `torch.cuda.ComplexDoubleTensor`.
I believe this was unintentional and a clarifying comment as to the
purpose of `all_declared_types` is needed to avoid this in future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73370
Summary:
This pull request introduces `SoftplusTransform` to `torch.distributions.transforms`. `SoftplusTransform` transforms via the mapping `Softplus(x) = log(1 + exp(x))`. Note that the transform is different to [`torch.nn.Softplus`](https://pytorch.org/docs/stable/generated/torch.nn.Softplus.html#torch.nn.Softplus), as that has additional `beta` and `threshold` parameters. Inverse and `log_abs_det_jacobian` for a more complex `SoftplusTransform` can be added in the future.
vitkl fritzo
Addresses the issue discussed here: [pyro issue 855](https://github.com/pyro-ppl/numpyro/issues/855)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52300
Reviewed By: albanD, ejguan
Differential Revision: D34082655
Pulled By: neerajprad
fbshipit-source-id: 6114e74ee5d73c1527191bed612a142d691e2094
(cherry picked from commit a181a3a9e53a34214a503d38760ad7778d08a680)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73361
This PR adds the documentation for the newly introduced `TORCH_CPP_LOG_LEVEL` and how it can be used along with `TORCH_DISTRIBUTED_DEBUG` to adjust the log level of c10d.
ghstack-source-id: 149874995
Test Plan: Locally rendered and checked the documentation.
Reviewed By: rohan-varma
Differential Revision: D34452352
fbshipit-source-id: ecb54590f3030ddef9921a7152ca9f7fc9438345
(cherry picked from commit f4c7c6f3b27dbd3006686cf26a6e9e53cd2c8f09)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73166
This PR refactors, cleans up, and optimizes the implementation of `TORCH_DISTRIBUTED_DEBUG`. It also introduces three new user APIs: `get_debug_level()`, `set_debug_level()`, and `set_debug_level_from_env()` to retrieve and modify the debug level after a process has started.
ghstack-source-id: 149778566
Test Plan: Run the existing unit tests.
Reviewed By: rohan-varma
Differential Revision: D34371226
fbshipit-source-id: e18443b411adcbaf39b2ec999178c198052fcd5b
(cherry picked from commit 26d6bb1584b83a0490d8b766482656a5887fa21d)
Summary:
Fixespytorch/pytorch.github.io#929
The pytorch doc team would like to move to only major.minor documentation at https://pytorch.org/docs/versions.html, not major.minor.patch. This has been done in the CI scripts, but the generated documentation still has the patch version. Remove it when building RELEASE documentation. This allows simplifying the logic, using `'.'.join(torch_version.split('.')[:2])` since we no longer care about trimming off the HASH: it automatically gets removed.
holly1238, brianjo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72706
Reviewed By: samdow
Differential Revision: D34215815
Pulled By: albanD
fbshipit-source-id: 8437036cc6636674d9ab8b1666f37b561d0527e1
(cherry picked from commit d8caf988f9)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69550
Fix the wiki URL.
Also minor reorganization in onnx.rst.
Test Plan: Imported from OSS
Reviewed By: msaroufim
Differential Revision: D32994269
Pulled By: malfet
fbshipit-source-id: 112acfe8b7c778d7e3c2cef684023fdaf2c6ec9c
(cherry picked from commit f0787fabde)
Summary:
It is probably the most user friendly to link to that (lesser known?) feature.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72584
Reviewed By: soulitzer
Differential Revision: D34173999
Pulled By: albanD
fbshipit-source-id: 99fff7a55412faf54888f8317ab2388f4d7d30e4
(cherry picked from commit 2191ee7657)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68491
* Allows implementing symbolic functions for domains other than `aten`, for example `prim`, in symbolic_opset#.py.
* Allows symbolic function to access extra context if needed, through `SymbolicFunctionState`.
* Particularly, the `prim::PythonOp` special case can access node without the need of passing node through inputs. Updates will be made downstreams, and in a follow-up PR we will remove the previous workaround in exporter.
* `prim::Loop`, `prim::If`, etc are now moved outside of `_run_symbolic_function` from utils.py, and to symbolic_opset9.py.
Motivation for this change:
- Better maintainability and reducing complexity. Easier to add symbolic for operators, both simple and complex ones (that need additional context), without the former needing to know the existence of the latter.
- The design idea was long outdated. prim ops are no longer rare special cases, and they shouldn't all be handled inside `_run_symbolic_function`. As a result this function becomes too clumsy. There were also prim ops symbolic added in symbolic_opset#.py with signature `prim_[opname]`, creating separation and confusion.
Test Plan: Imported from OSS
Reviewed By: jansel
Differential Revision: D32483782
Pulled By: malfet
fbshipit-source-id: f9affc31b1570af30ffa6668da9375da111fd54a
Co-authored-by: BowenBao <bowbao@microsoft.com>
(cherry picked from commit 1e04ffd2fd)
Summary:
This PR adds a transform that uses the cumulative distribution function of a given probability distribution.
For example, the following code constructs a simple Gaussian copula.
```python
# Construct a Gaussian copula from a multivariate normal.
base_dist = MultivariateNormal(
loc=torch.zeros(2),
scale_tril=LKJCholesky(2).sample(),
)
transform = CumulativeDistributionTransform(Normal(0, 1))
copula = TransformedDistribution(base_dist, [transform])
```
The following snippet creates a "wrapped" Gaussian copula for correlated positive variables with Weibull marginals.
```python
transforms = [
CumulativeDistributionTransform(Normal(0, 1)),
CumulativeDistributionTransform(Weibull(4, 2)).inv,
]
wrapped_copula = TransformedDistribution(base_dist, transforms)
```
cc fritzo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72495
Reviewed By: ejguan
Differential Revision: D34085919
Pulled By: albanD
fbshipit-source-id: 7917391519a96b0d9b54c52db65d1932f961d070
(cherry picked from commit 572196146e)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72499
Pull Request resolved: https://github.com/pytorch/benchmark/pull/740
To fx2trt out of tree to remove bloatness of PyTorch core.
It's the first and major step. Next, we will move acc_tracer out of the tree and rearrange some fx passes.
Reviewed By: suo
Differential Revision: D34065866
fbshipit-source-id: c72b7ad752d0706abd9a63caeef48430e85ec56d
(cherry picked from commit 91647adbca)
Summary:
Correcting a minor typo: "Users should pay" instead of "Users should be pay"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72500
Reviewed By: albanD
Differential Revision: D34077972
Pulled By: ejguan
fbshipit-source-id: 5d7a138d1f17eca838d2c1da76d7759d96e4375f
(cherry picked from commit d046baa89c)
Summary:
This PR ports `index_copy` implementation to structured kernels, also adds an `out` variant.
~Note to the reviewers: This is in draft mode, waiting for the tests from the CI, and I'll give a final look before requesting the review.~
Issue tracker: https://github.com/pytorch/pytorch/issues/55070
cc: bdhirsh ysiraichi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67329
Reviewed By: ejguan
Differential Revision: D34077219
Pulled By: bdhirsh
fbshipit-source-id: 6accda33957f654b753261c5c3d765a27a64d2c0
(cherry picked from commit f3ac83217a)
Summary:
Let's make the documentation for `torch.sparse.sampled_addmm` searchable in the PyTorch documentation.
This PR shall be cherry-picked for the next 1.11 release.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72312
Reviewed By: davidberard98
Differential Revision: D34045230
Pulled By: cpuhrsch
fbshipit-source-id: c1b1dc907443284857f48c8ce1efab22c6701bbe
(cherry picked from commit 225929ecf2)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72084
make fsdp folder to be public
ghstack-source-id: 148173447
Test Plan: unit tests
Reviewed By: mrshenli
Differential Revision: D33903417
fbshipit-source-id: 7852a2adc4af09af48a5ffa52ebf210489f834d5
(cherry picked from commit bd06513cfe)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72111
For vectorize flag:
- Advertises the use of functorch
For autograd.functional.jvp:
- Advertises the use of functorch and the low-level jvp API, both of
which will be more performant than the double backprop trick.
Test Plan: - view docs
Reviewed By: albanD
Differential Revision: D33918065
Pulled By: zou3519
fbshipit-source-id: 6e19699aa94f0e023ccda0dc40551ad6d932b7c7
(cherry picked from commit b4662ceb99)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72009
This simplifies the Stats interface by merging IntervalStat and FixedCountStat into a single Stat w/ a specific window size duration and an optional max samples per window. This allows for the original intention of having comparably sized windows (for statistical purposes) while also having a consistent output bandwidth.
Test Plan:
```
buck test //caffe2/test:monitor //caffe2/test/cpp/monitor:monitor
```
Reviewed By: kiukchung
Differential Revision: D33822956
fbshipit-source-id: a74782492421be613a1a8b14341b6fb2e8eeb8b4
(cherry picked from commit 293b94e0b4)
Summary:
Follow up: we would need to update the links to the tutorial later
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71643
Reviewed By: albanD
Differential Revision: D33713982
Pulled By: soulitzer
fbshipit-source-id: a314ffa4e7d5c5ebdef9c50033f338b06578d71c
(cherry picked from commit ba30daaaa5)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66745
This PR implement NCCL gather and add gather to ProcessGroupNCCL using nccl send/recv api.
NCCL doesn’t directly provide primitives for gather, so we need to be implemented on top of NCCL’s send/recv API.
1. In ProcessGroupNCCL.cpp, the outputTensors are first flattened, then inputTensors and outputFlattened are passed by the collective class to gather() function in nccl.cpp.
1. In nccl.cpp, gather is implemented using ncclSend/ncclRecv: all the ranks send inputTensor to the root rank, and the root rank uses a for loop to receive these inputTensors.
ghstack-source-id: 147754838
Test Plan:
test_gather_ops
test_gather_checks
test_gather_stress
Reviewed By: pritamdamania87
Differential Revision: D29616361
fbshipit-source-id: b500d9b8e67113194c5cc6575fb0e5d806dc7782
(cherry picked from commit d560ee732e)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71658
This adds the beginnings of a TensorboardEventHandler which will log stats to Tensorboard.
Test Plan: buck test //caffe2/test:monitor
Reviewed By: edward-io
Differential Revision: D33719954
fbshipit-source-id: e9847c1319255ce0d9cf2d85d8b54b7a3c681bd2
(cherry picked from commit 5c8520a6ba)
Fix the wiki URL.
Also minor reorganization in onnx.rst.
[ONNX] restore documentation of public functions (#69623)
The build-docs check requires all public functions to be documented.
These should really not be public, but we'll fix that later.'
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71609