Commit Graph

730 Commits

Author SHA1 Message Date
Bin Bao
4bf1cd6961 [aotinductor] Rename aot_runtime to aoti_runtime (#110007)
Summary: Make the naming more explicit

Differential Revision: D49593528

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110007
Approved by: https://github.com/houseroad
2023-09-26 00:46:54 +00:00
Bin Bao
9c2715bbb2 [inductor] Clean up AOTInductor runtime ABI (#109678)
Summary: Change the AOTInductor runtime interface to avoid referring to aten data structures directly, mostly at::Tensor and ProxyExecutor. This a combination of https://github.com/pytorch/pytorch/pull/109436,  https://github.com/pytorch/pytorch/pull/109498, https://github.com/pytorch/pytorch/pull/109450, https://github.com/pytorch/pytorch/pull/109606, plus a few internal build changes.

Reviewed By: frank-wei

Differential Revision: D49374820

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109678
Approved by: https://github.com/frank-wei, https://github.com/chenyang78
2023-09-21 00:25:24 +00:00
Xuehai Pan
0bf30c140a [pytree] Use OpTree for PyTree manipulation (#93139)
Split from #92679. Use C++-based PyTree implementation.

## Highlights

1. High performance (20x speedup than the pure-Python implementation, 10%-20% overall speedup for `torch.fx`)
2. Multi-input tree-map support
3. Custom tree node registry with namespace isolation

Refs:

- #65761
- #91323
- #92679

From https://github.com/pytorch/pytorch/issues/65761#issuecomment-1334746366:

> ### 0. Out-of-box compatible with JAX's pytree, provides the same interfaces and functions (and more).
>
> ### 1. High-performance: `optree` has comparable fast tree operations (~0.9x for `dict`s and ~2.5x for `OrderedDict`s) than JAX's pytree and it is 20x faster than `torch.utils._pytree`.
>
> `optree` implements some common Python container types in C++ (e.g., `OrderedDict`) and achieves 2.5x performance than JAX's pytree. Check out section [Built-in PyTree Node Types](https://github.com/metaopt/optree#built-in-pytree-node-types) and [Benchmark](https://github.com/metaopt/optree#benchmark) for more details.
>
> | Module    | Nodes | OpTree (μs) | JAX XLA (μs) | PyTorch (μs) | DM-Tree (μs) | Speedup (J / O) | Speedup (P / O) | Speedup (D / O) |
> | :-------- | ----: | ----------: | -----------: | -----------: | -----------: | --------------: | --------------: | --------------: |
> | TinyMLP   |    53 |       26.40 |        68.19 |       586.87 |        34.14 |            2.58 |           22.23 |            1.29 |
> | AlexNet   |   188 |       84.28 |       259.51 |      2182.07 |       125.12 |            3.08 |           25.89 |            1.48 |
> | ResNet18  |   698 |      288.57 |       807.27 |      7881.69 |       429.39 |            2.80 |           27.31 |            1.49 |
> | ResNet34  |  1242 |      580.75 |      1564.97 |     15082.84 |       819.02 |            2.69 |           25.97 |            1.41 |
> | ResNet50  |  1702 |      791.18 |      2081.17 |     20982.82 |      1104.62 |            2.63 |           26.52 |            1.40 |
> | ResNet101 |  3317 |     1603.93 |      3939.37 |     40382.14 |      2208.63 |            2.46 |           25.18 |            1.38 |
> | ResNet152 |  4932 |     2446.56 |      6267.98 |     56892.36 |      3139.17 |            2.56 |           23.25 |            1.28 |
> | ViT-H/14  |  3420 |     1681.48 |      4488.33 |     41703.16 |      2504.86 |            2.67 |           24.80 |            1.49 |
> | Swin-B    |  2881 |     1565.41 |      4091.10 |     34241.99 |      1936.75 |            2.61 |           21.87 |            1.24 |
> |           |       |             |              |              |  **Average** |        **2.68** |       **24.78** |        **1.38** |
>
> <div align="center">
>   <img src="https://user-images.githubusercontent.com/16078332/200494435-fd5bb385-59f7-4811-b520-98bf5763ccf3.png" width="90%" />
> </div>
>
> ### 2. Namespace Isolation for the PyTree Type Registry
>
> In addition to the JAX's pytree registry for custom node type registration, `optree` adds `namespace` isolation to the registry. Users can register the same type multiple times for different flatten/unflatten behavior. It also provides module-level isolation for safety reasons. For example, you can add a unique prefix to your namespace to isolate your registry with other modules (e.g., `torch.xxx`, `torch.functorch.xxx`):
>
> ```python
> # Register a Python type into a namespace
> import torch
>
> optree.register_pytree_node(
>     torch.Tensor,
>     # (tensor) -> (children, metadata)
>     flatten_func=lambda tensor: (
>         (tensor.cpu().numpy(),),
>         dict(dtype=tensor.dtype, device=tensor.device, requires_grad=tensor.requires_grad),
>     ),
>     # (metadata, children) -> tensor
>     unflatten_func=lambda metadata, children: torch.tensor(children[0], **metadata),
>     namespace='torch.torch2numpy',
> )
> ```
>
> ```python
> >>> tree = {'weight': torch.ones(size=(1, 2)).cuda(), 'bias': torch.zeros(size=(2,))}
> >>> tree
> {'weight': tensor([[1., 1.]], device='cuda:0'), 'bias': tensor([0., 0.])}
>
> # Flatten without specifying the namespace
> >>> tree_flatten(tree)  # `torch.Tensor`s are leaf nodes
> ([tensor([0., 0.]), tensor([[1., 1.]], device='cuda:0')], PyTreeSpec({'bias': *, 'weight': *}))
>
> # Flatten with the namespace
> >>> leaves, treespec = optree.tree_flatten(tree, namespace='torch.torch2numpy')
> >>> leaves, treespec
> (
>     [array([0., 0.], dtype=float32), array([[1., 1.]], dtype=float32)],
>     PyTreeSpec(
>         {
>             'bias': CustomTreeNode(Tensor[{'dtype': torch.float32, 'device': device(type='cpu'), 'requires_grad': False}], [*]),
>             'weight': CustomTreeNode(Tensor[{'dtype': torch.float32, 'device': device(type='cuda', index=0), 'requires_grad': False}], [*])
>         },
>         namespace='torch.torch2numpy'
>     )
> )
>
> # `entries` are not defined and use `range(len(children))`
> >>> optree.tree_paths(tree, namespace='torch.torch2numpy')
> [('bias', 0), ('weight', 0)]
>
> # Unflatten back to a copy of the original object
> >>> optree.tree_unflatten(treespec, leaves)
> {'bias': tensor([0., 0.]), 'weight': tensor([[1., 1.]], device='cuda:0')}
> ```
>
> Check out section [Registering a Container-like Custom Type as Non-leaf Nodes](https://github.com/metaopt/optree#notes-about-the-pytree-type-registry) for more details.
>
> ### 3. Support both `None` as Non-leaf Node and `None` as Leaf
>
> In JAX's implementation, `None` is always an internal non-leaf node with an arity 0, which is like an empty tuple. This limits the usage of the JAX's pytree utilities for PyTorch. For example, the `nn.Module` uses `_parameters` and `_buffers` (`OrderedDict[str, Optional[Tensor]]`) to hold the tensors, while the value can be a tensor or `None`.
>
> `optree` supports both `None` as Non-leaf Node (JAX's default) and `None` as Leaf (PyTorch's default). Check out section [None is Non-leaf Node vs. None is Leaf](https://github.com/metaopt/optree#none-is-non-leaf-node-vs-none-is-leaf) for more details.
>
> ### 4. Some other improvements and bug fixes
>
> 1. Adds in-place version of treemap (`tree_map_`), which reduces redundant unflatten operation for better performance.
> 2. Adds support for tree flatten and tree map with paths. (useful for `functorch` module extraction).
> 3. Improves the JAX's pytree sorting support for `dict`s.
> 4. Better string representation `repr(PyTreeSpec)`.
> 5. Fixes some bugs for JAX's pytree of hashing, pickle serialization, segmentation fault for infinite recursion, and tree-compose/tree-transpose.

From https://github.com/pytorch/pytorch/pull/92679#issuecomment-1398778481:

> ```python
> # pytree_make_fx_bench.py
> import torch
> from torch.fx.experimental.proxy_tensor import make_fx
> import time
>
> def f(x):
>     for _ in range(10000):
>         x = x+x
>     return x
>
> import time
> begin = time.time()
> out = make_fx(f, tracing_mode="real")(torch.randn(20))
> begin = time.time()
> print(f'tracing_mode="real" {time.time() - begin:.2f}')
> out = make_fx(f, tracing_mode="fake")(torch.randn(20))
> print(f'tracing_mode="fake" {time.time() - begin:.2f}')
>
> out = make_fx(f, tracing_mode="symbolic")(torch.randn(20))
> print(f'tracing_mode="symbolic" {time.time() - begin:.2f}')
> ```
>
> This seems to run around 10-20% faster with the optree implementation:
>
> ```
> # Optree
> python pytree_make_fx_bench.py
> tracing_mode="real" 0.00
> tracing_mode="fake" 6.32
> tracing_mode="symbolic" 27.13
> ```
>
> ```
> # torch.utils._pytree
> python pytree_make_fx_bench.py
> tracing_mode="real" 0.00
> tracing_mode="fake" 7.66
> tracing_mode="symbolic" 31.07
> ```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93139
Approved by: https://github.com/malfet
2023-09-18 21:24:56 +00:00
Bin Bao
0f646b1d15 [inductor] Add a C shim layer for libtorch (#109391)
Summary:
This PR adds a limited C shim layer for libtorch. The ultimate goal is to ban any direct reference to aten/c10 data structures or functions, to avoid ABI breakage by providing stable C interfaces.

To make the review and landing easier, we broke the changes into several steps. In this PR (a combination of https://github.com/pytorch/pytorch/pull/109022 and https://github.com/pytorch/pytorch/pull/109351), we add C interfaces for certain libtorch functions and modify the wrapper codegen to generate calls to those interfaces. There are a few other items to be addressed in future PRs:

* The AOTInductor runtime interface still takes lists of aten tensors as input and output
* The interaction with ProxyExecutor (general fallback support) needs to move away from aten tensor
* Remove all references to aten/c10 headers in the AOTInductor-generated code

Differential Revision: D49302669

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109391
Approved by: https://github.com/chenyang78
2023-09-16 16:46:26 +00:00
Yu, Guangye
b1f21399c8 Prerequisite of ATen/native/utils header for C++ extension (#109013)
# Motivate
Without this PR, if we would like to include the header file like ```#include <ATen/native/ForeachUtils.h>``` in our C++ extension, it will raise a Error ```/home/xxx/torch/include/ATen/native/ForeachUtils.h:7:10: fatal error: 'ATen/native/utils/ParamsHash.h' file not found```. We should fix it.

# Solution
Add the ATen/native/utils header file in the build.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109013
Approved by: https://github.com/ezyang
2023-09-12 02:30:45 +00:00
Bin Bao
60bd30ee0b [inductor] Move AOTInductor runtime headers (#108564)
Summary: Move AOTInductor runtime header files into its own subdirectory, to separate them from to-be-added libtorch C interface.

Reviewed By: frank-wei

Differential Revision: D48905038

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108564
Approved by: https://github.com/frank-wei
2023-09-06 11:50:41 +00:00
Huy Do
4084d039b7 Only add triton dependency to CUDA and ROCm binaries if it hasn't been set as an installation requirement yet (#108424)
The dependency was added twice before in CUDA and ROCm binaries, one as an installation dependency from builder and the later as an extra dependency for dynamo, for example:

```
Requires-Python: >=3.8.0
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: NOTICE
Requires-Dist: filelock
Requires-Dist: typing-extensions
Requires-Dist: sympy
Requires-Dist: networkx
Requires-Dist: jinja2
Requires-Dist: fsspec
Requires-Dist: pytorch-triton (==2.1.0+e6216047b8)
Provides-Extra: dynamo
Requires-Dist: pytorch-triton (==2.1.0+e6216047b8) ; extra == 'dynamo'
Requires-Dist: jinja2 ; extra == 'dynamo'
Provides-Extra: opt-einsum
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'
```

In the previous release, we needed to remove this part from `setup.py` to build release binaries https://github.com/pytorch/pytorch/pull/96010.  With this, that step isn't needed anymore because the dependency will come from builder.

### Testing

Using the draft https://github.com/pytorch/pytorch/pull/108374 for testing and manually inspect the wheels artifact at https://github.com/pytorch/pytorch/actions/runs/6045878399 (don't want to go through all `ciflow/binaries` again)

* torch-2.1.0.dev20230901+cu121-cp39-cp39-linux_x86_64
```
Requires-Python: >=3.8.0
Description-Content-Type: text/markdown
Requires-Dist: filelock
Requires-Dist: typing-extensions
Requires-Dist: sympy
Requires-Dist: networkx
Requires-Dist: jinja2
Requires-Dist: fsspec
Requires-Dist: pytorch-triton (==2.1.0+e6216047b8) <-- This will be 2.1.0 on the release branch after https://github.com/pytorch/builder/pull/1515
Provides-Extra: dynamo
Requires-Dist: jinja2 ; extra == 'dynamo'
Provides-Extra: opt-einsum
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'
```

* torch-2.1.0.dev20230901+cu121.with.pypi.cudnn-cp39-cp39-linux_x86_64
```
Requires-Python: >=3.8.0
Description-Content-Type: text/markdown
Requires-Dist: filelock
Requires-Dist: typing-extensions
Requires-Dist: sympy
Requires-Dist: networkx
Requires-Dist: jinja2
Requires-Dist: fsspec
Requires-Dist: pytorch-triton (==2.1.0+e6216047b8)
Requires-Dist: nvidia-cuda-nvrtc-cu12 (==12.1.105) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cuda-runtime-cu12 (==12.1.105) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cuda-cupti-cu12 (==12.1.105) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cudnn-cu12 (==8.9.2.26) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cublas-cu12 (==12.1.3.1) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cufft-cu12 (==11.0.2.54) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-curand-cu12 (==10.3.2.106) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cusolver-cu12 (==11.4.5.107) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cusparse-cu12 (==12.1.0.106) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-nccl-cu12 (==2.18.1) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-nvtx-cu12 (==12.1.105) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: triton (==2.1.0) ; platform_system == "Linux" and platform_machine == "x86_64" <--This is 2.1.0 because it already has https://github.com/pytorch/pytorch/pull/108423, but the package doesn't exist yet atm
Provides-Extra: dynamo
Requires-Dist: jinja2 ; extra == 'dynamo'
Provides-Extra: opt-einsum
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'
```

* torch-2.1.0.dev20230901+rocm5.6-cp38-cp38-linux_x86_64
```
Requires-Python: >=3.8.0
Description-Content-Type: text/markdown
Requires-Dist: filelock
Requires-Dist: typing-extensions
Requires-Dist: sympy
Requires-Dist: networkx
Requires-Dist: jinja2
Requires-Dist: fsspec
Requires-Dist: pytorch-triton-rocm (==2.1.0+34f8189eae) <-- This will be 2.1.0 on the release branch after https://github.com/pytorch/builder/pull/1515
Provides-Extra: dynamo
Requires-Dist: jinja2 ; extra == 'dynamo'
Provides-Extra: opt-einsum
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108424
Approved by: https://github.com/atalman
2023-09-02 01:16:18 +00:00
drisspg
182a9cf366 Add Independent Memory Efficient and Flash Attention Build Flags (#107985)
# Summary
In an effort to simplify https://github.com/pytorch/pytorch/pull/105602, this PR pulls out independent chunks of code that can be landed prior to FlashV2 landing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107985
Approved by: https://github.com/cpuhrsch
2023-08-28 18:39:18 +00:00
PyTorch MergeBot
22cade56ba Revert "[Reland] Upgrade NVTX to NVTX3 (#97582)"
This reverts commit 5bbfb96203.

Reverted https://github.com/pytorch/pytorch/pull/97582 on behalf of https://github.com/izaitsevfb due to Breaks meta RL builds ([comment](https://github.com/pytorch/pytorch/pull/97582#issuecomment-1679568525))
2023-08-15 20:55:12 +00:00
cyy
5bbfb96203 [Reland] Upgrade NVTX to NVTX3 (#97582)
PR #90689 replaces NVTX with NVTX3. However, the torch::nvtoolsext is created only when the third party NVTX is used.
 This is clear a logical error. We now move the creation code out of the branch to cover all cases. This should fix the issues reported in the comments of  #90689.

It would be better to move configurations of the failed FRL jobs to CI tests so that we can find such issues early before merging.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97582
Approved by: https://github.com/peterbell10
2023-08-14 16:55:25 +00:00
shibo19
6691413145 export torch/csrc/dynamo/*.h (#106757)
Fixes #ISSUE_NUMBER
as title, we need the header files in torch/csrc/dynamo, so to export it. could you have a look? @albanD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106757
Approved by: https://github.com/albanD
2023-08-09 03:57:49 +00:00
shibo19
26846546e8 export tools/autograd to torchgen package (#106663)
Fixes #ISSUE_NUMBER
as discussed here https://github.com/pytorch/pytorch/pull/105003,  I have exported tools/autograd to torchgen package, and could you have a look? @zou3519
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106663
Approved by: https://github.com/zou3519
2023-08-07 16:14:51 +00:00
Jesse Cai
f81f9093ec [core][pruning][feature] cuSPARSELt build integration (#103700)
Summary:

This stack of PR's integrates cuSPARSELt into PyTorch.

This PR adds support for cuSPARSELt into the build process.
It adds in a new flag, USE_CUSPARSELT that defaults to false.

When USE_CUSPASRELT=1 is specified, the user can also specify
CUSPASRELT_ROOT, which defines the path to the library.

Compiling pytorch with cusparselt support can be done as follows:

``
USE_CUSPARSELT=1
CUSPARSELT_ROOT=/path/to/cusparselt

python setup.py develop
```

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103700
Approved by: https://github.com/albanD
2023-08-02 12:48:39 +00:00
Edward Z. Yang
f70844bec7 Enable UFMT on a bunch of low traffic Python files outside of main files (#106052)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106052
Approved by: https://github.com/albanD, https://github.com/Skylion007
2023-07-27 01:01:17 +00:00
Justin Chu
4cc1745b13 [BE] f-stringify torch/ and scripts (#105538)
This PR is a follow up on the pyupgrade series to convert more strings to use f-strings using `flynt`.

- https://docs.python.org/3/reference/lexical_analysis.html#f-strings
- https://pypi.org/project/flynt/

Command used:

```
flynt torch/ -ll 120
flynt scripts/ -ll 120
flynt tools/ -ll 120
```

and excluded `collect_env.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105538
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-07-21 19:35:24 +00:00
George White
803d58a408 Add TensorPipe header files to Python package (#105521)
This change adds the TensorPipe header files to `torch_package_data` if `USE_DISTRIBUTED` is set to `ON` in the CMake cache. The TensorPipe library and CMake config is already available in the Torch wheel, but the headers are not. This resolves issue where out-of-tree backends could not implement TensorPipe converters, because the definition of the `tensorpipe::Message` struct is defined in the TensorPipe headers.

Fixes #105224.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105521
Approved by: https://github.com/albanD
2023-07-20 16:06:00 +00:00
Justin Chu
14d87bb5ff [BE] Enable ruff's UP rules and autoformat tools and scripts (#105428)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105428
Approved by: https://github.com/albanD, https://github.com/soulitzer, https://github.com/malfet
2023-07-19 01:24:44 +00:00
Bin Bao
b10de43c0a Add aot_inductor as a test backend for benchmarking (#105221)
Summary:
Original PR at https://github.com/pytorch/pytorch/pull/104977. Landing from fbcode instead.

Add an aot_inductor backend (Export+AOTInductor) in the benchmarking harness. Note it is not a dynamo backend.

Moved files from torch/_inductor/aot_inductor_include to torch/csrc/inductor as a more standard way for exposing headers
Created a caching function in benchmarks/dynamo/common.py for compiling, loading and caching the .so file, as a proxy for a pure C++ deployment, but easier for benchmarking.

Differential Revision: D47452591

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105221
Approved by: https://github.com/jansel
2023-07-18 13:16:36 +00:00
Bin Bao
528ab477ce [reland][inductor] Register an op for mm_plus_mm (#105153)
Summary: Reland https://github.com/pytorch/pytorch/pull/104835 after fixing internal build issues

Test Plan: CI

Differential Revision: D47442849

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105153
Approved by: https://github.com/clee2000
2023-07-14 14:35:29 +00:00
Catherine Lee
c36dca7bc5 Revert "[inductor] Register an op for mm_plus_mm (#104835)" (#105150)
This reverts commit 9c46a1620c.

Actual revert referenced in https://github.com/pytorch/pytorch/pull/105149

#104835 is causing internal builds to fail

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105150
Approved by: https://github.com/atalman
2023-07-13 17:13:45 +00:00
Bin Bao
9c46a1620c [inductor] Register an op for mm_plus_mm (#104835)
Summary: Currently the aten version of mm_plus_mm has no cpp
implementation, and thus cpp_wrapper can not generate the correct cpp
function call for it.

Differential Revision: [D47372057](https://our.internmc.facebook.com/intern/diff/D47372057)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104835
Approved by: https://github.com/jansel, https://github.com/SherlockNoMad
2023-07-12 02:34:02 +00:00
Edward Z. Yang
3dc4adc7a6 Don't build CUDA with debug info by default. (#102617)
Fixes https://github.com/pytorch/pytorch/issues/102594

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102617
Approved by: https://github.com/malfet
2023-07-05 20:16:19 +00:00
Xu Han
6c1ccccf21 Enable mimalloc on pytorch Windows (#102595)
This PR is implemention of [#102534](https://github.com/pytorch/pytorch/issues/102534), option 2.
Major changes:
1. Add mimalloc to the submodule.
2. Add build option "USE_MIMALLOC".
3. It is only enabled on Windows build, And it would improve pytorch memory allocation performance.

Additional Test:
<img width="953" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/4b2ec2dc-16f1-4ad9-b457-cfeb37e489d3">
This PR also build & static link mimalloc on Linux well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102595
Approved by: https://github.com/jgong5, https://github.com/malfet
2023-06-27 08:53:26 +00:00
Yang Chen
d2281e38ae Adds the initial support for AOTInductor model and interface (#104202)
This PR combines the C++ code for the AOTInductor's model and interface with Bin Bao's changes to AOTInductor codegen.

It adds a number of AOTInductor C interfaces that can be used by an inference runtime. Under the hood of the interfaces, the model code generated by the AOTInductor's codegen is wrapped into a class, AOTInductorModel, which manages tensors and run the model inference.

On top of AOTInductorModel, we provide one more abstract layer, AOTInductorModelContainer, which allows the user to have multiple inference runs concurrently for the same model.

This PR also adjusts the compilation options for AOT codegen, particularly some fbcode-related changes such as libs to be linked and header-file search paths.

Note that this is the very first version of the AOTInductor model and interface, so many features (e.g. dynamic shape) are incomplete. We will support those missing features in in future PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104202
Approved by: https://github.com/desertfire
2023-06-27 00:37:26 +00:00
PyTorch MergeBot
2c313e7b99 Revert "Record view stacks if running anomaly mode (#103185)"
This reverts commit a02c573a89.

Reverted https://github.com/pytorch/pytorch/pull/103185 on behalf of https://github.com/izaitsevfb due to Breaks internal builds, see D46629734 ([comment](https://github.com/pytorch/pytorch/pull/103185#issuecomment-1588258206))
2023-06-12 23:52:10 +00:00
Edward Z. Yang
a02c573a89 Record view stacks if running anomaly mode (#103185)
Now, when you do an inplace mutation and the view is naughty, you get this message:

```
RuntimeError: A view was created in no_grad mode and is being modified inplace with grad mode enabled. Given that this use case is ambiguous and error-prone, it is forbidden. You can clarify your code by moving both the view and the inplace either both inside the no_grad block (if you don't want the inplace to be tracked) or both outside (if you want the inplace to be tracked). To find out where this view was allocated, run your entire forward region under anomaly mode (torch.autograd.detect_anomaly(check_nan=False)).
```

When you run under anomaly mode, you get:

```
RuntimeError: A view was created in no_grad mode and is being modified inplace with grad mode enabled. Given that this use case is ambiguous and error-prone, it is forbidden. You can clarify your code by moving both the view and the inplace either both inside the no_grad block (if you don't want the inplace to be tracked) or both outside (if you want the inplace to be tracked). This view was allocated at:
  File "/data/users/ezyang/c/pytorch/test/test_autograd.py", line 4299, in arglebargle
  File "/data/users/ezyang/c/pytorch/test/test_autograd.py", line 4306, in test_anomaly_gives_view_stack
  File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
  File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/case.py", line 591, in run
  File "/data/users/ezyang/c/pytorch/torch/testing/_internal/common_utils.py", line 2266, in _run_with_retry
  File "/data/users/ezyang/c/pytorch/torch/testing/_internal/common_utils.py", line 2337, in run
  File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/case.py", line 650, in __call__
  File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/suite.py", line 122, in run
  File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/suite.py", line 84, in __call__
  File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/suite.py", line 122, in run
  File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/suite.py", line 84, in __call__
  File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/runner.py", line 184, in run
  File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/main.py", line 271, in runTests
  File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/main.py", line 101, in __init__
  File "/data/users/ezyang/c/pytorch/torch/testing/_internal/common_utils.py", line 894, in run_tests
  File "/data/users/ezyang/c/pytorch/test/test_autograd.py", line 11209, in <module>
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103185
Approved by: https://github.com/zdevito
2023-06-09 16:56:28 +00:00
Li-Huai (Allan) Lin
3c0072e7c0 [MPS] Prerequisite for MPS C++ extension (#102483)
in order to add mps kernels to torchvision codebase, we need to expose mps headers and allow objc++ files used in extensions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102483
Approved by: https://github.com/malfet
2023-06-07 17:28:31 +00:00
lkct
9567aaebe5 Package torch/*.pyi type hints (#103016)
Including `torch._VF` and `torch.return_types`

These are generated by:
4003e96ca1/tools/pyi/gen_pyi.py (L1139-L1155)

Ref #99541
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103016
Approved by: https://github.com/Skylion007
2023-06-05 23:08:10 +00:00
Nikita Shulga
49d0d1d79f Update XLA pin (#102446)
Updating the pin to the same hash as  https://github.com/pytorch/pytorch/pull/100922

On the XLA side, build have switch from CMake to bazel, which requires number of changes on PyTorch side:
 - Copy installed headers back to the `torch/` folder before starting the build
 - Install `torch/csrc/lazy/python/python_utils.h`
 - Define `LD_LIBRARY_PATH`

TODO:
 - Enable bazel caching
 - Pass CXX11_ABI flag to  `//test/cpp:all`  to reuse build artifacts from  `//:_XLAC.so`

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at cd4768b</samp>

> _To fix the XLA tests that were failing_
> _We updated the submodule and scaling_
> _We added `python_util.h`_
> _And copied `torch` as well_
> _And set `LD_LIBRARY_PATH` for linking_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102446
Approved by: https://github.com/huydhn
2023-06-01 02:04:07 +00:00
lantiankaikai
17166c2511 python_arg_parser to allow fake tensor element in symint_list when in dynamo mode #95424 (#97508)
Failing mechanism on #95424 :
In dynamo mode, when passing numpy.int_ to 'shape' like param (Sequence[Union[int, symint]]) is wrapped as list with FakeTensor.  However, in python_arg_parser, parser expect int in symint_list but got FakeTensor.

Following #85759, this PR allow tensor element in symint_list when in dynamo mode

This PR also fix below test with similar failing mechanism
pytest ./generated/test_huggingface_diffusers.py -k test_016
pytest ./generated/test_ustcml_RecStudio.py -k test_036

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97508
Approved by: https://github.com/yanboliang
2023-05-31 19:19:17 +00:00
mikey dagitses
979f55d3bc implementation of DataPtr context for copy-on-write tensors (#100818)
implementation of DataPtr context for copy-on-write tensors

Summary:
Copy-on-write storage
=====================
This library adds support for copy-on-write storage, i.e. lazy copies,
to tensors. The design maintains the PyTorch invariant that tensors
alias if and only if they share a storage. Thus, tensors that are lazy
copies of one another will have distinct storages that share a data
allocation.

Thread-safety
-------------
The correctness of this design hinges on the pre-existing PyTorch user
requirement (and general default programming assumption) that users
are responsible for guaranteeing that writes do not take places
concurrently with reads and other writes.

Lazily copied tensors add a complication to this programming model
because users are not required to know if lazy copies exist and are
not required to serialize writes across lazy copies. For example: two
tensors with distinct storages that share a copy-on-write data context
may be given to different threads that may do whatever they wish to
them, and the runtime is required to guarantee its safety.

It turns out that this is not that difficult to protect because, due
to the copy-on-write requirement, we just need to materialize a tensor
upon writing. This could be done entirely without synchronization if
we materialized each copy, however, we have a common-sense
optimization to elide the copy for the last remaining reference. This
requires waiting for any pending copies.

### Thread-safety detailed design
There are two operations that affect the copy-on-write details of a
tensor:

1) lazy-clone (e.g. an explicit call or a hidden implementation detail
   added through an operator like reshape)
2) materialization (i.e. any write to the tensor)

The key insight that we exploit is that lazy-clone is logically a read
operation and materialization is logically a write operation. This
means that, for a given set of tensors that share a storage, if
materialization is taking place, no other read operation, including
lazy-clone, can be concurrent with it.

However, this insight only applies within a set of tensors that share
a storage. We also have to be concerned with tensors with different
storages that share a copy-on-write context. In this world,
materialization can race with lazy-clone or even other
materializations. _However_, in order for this to be the case, there
must be _at least_ two references to the context. This means that the
context _can not_ vanish out from under you if you are performing a
lazy-clone, and hence, it only requires an atomic refcount bump.

The most complicated case is that all lazy-copies are concurrently
materializing. In this case, because a write is occurring, there are
no in-flight lazy-copies taking place. We must simply ensure that all
lazy-copies are able to materialize (read the data) concurrently. If
we didn't have the aforementioned optimization where the last copy
steals the data, we could get away with no locking whatsoever: each
makes a copy and decrements the refcount. However, because of the
optimization, we require the loser of the materializing race wait for
the pending copies to finish, and then steal the data without copying
it.

We implement this by taking a shared lock when copying the data and
taking an exclusive lock when stealing the data. The exclusive lock
acquisition ensures that all pending shared locks are finished before
we steal the data.

Test Plan: 100% code coverage.

---
Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/100818).
* #100821
* #100820
* #100819
* __->__ #100818

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100818
Approved by: https://github.com/ezyang
2023-05-11 11:13:51 +00:00
Nikita Shulga
08ef92e711 Delete Python-2 checks from setup.py (#101112)
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 557960b</samp>

> _`Python 2` is gone_
> _PyTorch cleans up its code_
> _Winter of legacy_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101112
Approved by: https://github.com/kit1980, https://github.com/albanD
2023-05-10 20:17:31 +00:00
Iris
466adab7c4 Add fsspec to PT setup.py (#99768)
Follow up for https://github.com/pytorch/pytorch/pull/96532. Including this in setup.py so the package will be available for CI.

Fsspec package size:
```
du  -h /fsx/users/irisz/conda/envs/pytorch/lib/python3.9/site-packages/fsspec-2023.3.0-py3.9.egg
264K    /fsx/users/irisz/conda/envs/pytorch/lib/python3.9/site-packages/fsspec-2023.3.0-py3.9.egg/fsspec/__pycache__
58K     /fsx/users/irisz/conda/envs/pytorch/lib/python3.9/site-packages/fsspec-2023.3.0-py3.9.egg/fsspec/implementations/__pycache__
377K    /fsx/users/irisz/conda/envs/pytorch/lib/python3.9/site-packages/fsspec-2023.3.0-py3.9.egg/fsspec/implementations
1017K   /fsx/users/irisz/conda/envs/pytorch/lib/python3.9/site-packages/fsspec-2023.3.0-py3.9.egg/fsspec
96K     /fsx/users/irisz/conda/envs/pytorch/lib/python3.9/site-packages/fsspec-2023.3.0-py3.9.egg/EGG-INFO
1.2M    /fsx/users/irisz/conda/envs/pytorch/lib/python3.9/site-packages/fsspec-2023.3.0-py3.9.egg
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99768
Approved by: https://github.com/kit1980
2023-04-25 01:34:08 +00:00
Nikita Shulga
32cd05ae60 Package torch.fx type hints (#99541)
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at ca3aab4</samp>

> _`fx` module traced_
> _Symbolic graphs transformed_
> _Type stubs for winter_

Fixes https://github.com/pytorch/pytorch/issues/99530

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99541
Approved by: https://github.com/kit1980, https://github.com/Chillee
2023-04-19 22:00:07 +00:00
Jithun Nair
ce4df4cc59 Enable triton build in CI docker image for ROCm (#98096)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98096
Approved by: https://github.com/malfet
2023-04-11 09:02:19 +00:00
PyTorch MergeBot
cb3c478069 Revert "refactor(add privateuseone floder in aten/src/ATen): add a PrivateUse… (#98127)"
This reverts commit 5a537e291d.

Reverted https://github.com/pytorch/pytorch/pull/98127 on behalf of https://github.com/weiwangmeta due to Sorry, our internal code is not ready to take such changes
2023-04-08 05:32:21 +00:00
ykddd
5a537e291d refactor(add privateuseone floder in aten/src/ATen): add a PrivateUse… (#98127)
Add a PrivateUse1 folder to contain all the feature adaptations for PrivateUse1 under Aten,For example GetGeneratorPrivate which is used for the three-party backend to register his own Generator implementation.This makes it easier for us to centrally manage these features, and it will increase the convenience of adaptation for different back-end manufacturers. For more info: https://github.com/pytorch/pytorch/issues/98073

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98127
Approved by: https://github.com/bdhirsh
2023-04-07 03:43:16 +00:00
jjsjann123
7282be3d91 Patch for nvfuser build (#97404)
1. Packaging nvfuser header for support c++ build against nvfuser;
2. Moving `#include <torch/csrc/jit/codegen/fuser/interface.h>` from `torch/csrc/jit/runtime/register_ops_utils.h` to `torch/csrc/jit/runtime/register_prim_ops_fulljit.cpp` to avoid missing header, since pytorch doesn't package `interface.h`;
3. Patching DynamicLibrary load of nvfuser to leak the handle, this avoids double de-allocation of `libnvfuser_codegen.so`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97404
Approved by: https://github.com/davidberard98
2023-03-28 23:36:08 +00:00
Han Qi (qihqi)
b895a0a675 [BE] Move flatbuffer related python C bindings to script_init (#97476)
Summary:
Extra C binding module for flatbuffer was introduced because
not all dependencies of Pytorch want (or can) bundle in flatbuffer.

However, flatbuffer is in by default now so this separate binding is not longer needed.

Test Plan: existing unit tests

Differential Revision: D44352583

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97476
Approved by: https://github.com/dbort
2023-03-28 17:56:32 +00:00
PyTorch MergeBot
5170995b2a Revert "Upgrade NVTX to NVTX3 (#90689)"
This reverts commit e64ddd1ab9.

Reverted https://github.com/pytorch/pytorch/pull/90689 on behalf of https://github.com/osalpekar due to Build Failures due to not being able to find one nvtx3 header in FRL jobs: [D42332540](https://www.internalfb.com/diff/D42332540)
2023-03-24 18:16:06 +00:00
cyy
e64ddd1ab9 Upgrade NVTX to NVTX3 (#90689)
Due to recent upgrade to CUDA 11, we can upgrade NVTX to NVTX3 as well, which is a header only library that can simplify the building system a lot.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90689
Approved by: https://github.com/soumith, https://github.com/malfet
2023-03-23 01:56:42 +00:00
Nikita Shulga
1ab883797a [BE] Dedup hardcoded triton versions (#96580)
Define it once in `.ci/docker/trition_version.txt` and use everywhere.

Also, patch version defined in `triton/__init__.py` as currently it always returns `2.0.0` even if package name is `2.1.0`

Followup after https://github.com/pytorch/pytorch/pull/95896 where version needed to be updated in 4+ places
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96580
Approved by: https://github.com/huydhn
2023-03-12 20:00:48 +00:00
PyTorch MergeBot
30b968f60d Revert "[BE] Dedup hardcoded triton versions (#96580)"
This reverts commit c131e51e62.

Reverted https://github.com/pytorch/pytorch/pull/96580 on behalf of https://github.com/malfet due to Forgot to fix lint
2023-03-12 19:37:52 +00:00
Nikita Shulga
c131e51e62 [BE] Dedup hardcoded triton versions (#96580)
Define it once in `.ci/docker/trition_version.txt` and use everywhere.

Also, patch version defined in `triton/__init__.py` as currently it always returns `2.0.0` even if package name is `2.1.0`

Followup after https://github.com/pytorch/pytorch/pull/95896 where version needed to be updated in 4+ places
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96580
Approved by: https://github.com/huydhn
2023-03-12 16:56:04 +00:00
Natalia Gimelshein
76cac70939 new triton main pin (#95896)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95896
Approved by: https://github.com/jansel, https://github.com/malfet
2023-03-10 06:30:41 +00:00
cyy
6786a24fd2 fix some tiny code issues (#95757)
This PR tries to fix:
1. a misspelled NDEBUG preprocessing condition.
2. get ride of all writable-strings warnings.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95757
Approved by: https://github.com/soulitzer
2023-03-01 23:27:32 +00:00
Wei Wang
46f092dc66 Add jinja2 as mandatory dependency (#95691)
Should fix #95671  for nightly wheels issue. v2.0.0 RC does not need this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95691
Approved by: https://github.com/malfet
2023-03-01 17:28:55 +00:00
cyy
f27e09de04 Cleanup Windows warning suppression in CMake and fix some warnings in the source code (#94927)
This PR do two things:
1. It moves some Windows warning suppression from various CMake files into the main CMakeList.txt, following the conventions of gcc and clang.
2. It fixes some Windows warnings in the source code. Most importantly, it fixes lots of dll warnings by adjusting C10_API to TORCH_API or TORCH_PYTHON_API. There are still some dll warnings because some TORCH_API functions are actually built as part of libtorch_python

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94927
Approved by: https://github.com/malfet
2023-02-27 19:22:20 +00:00
donnyyou
5d70ee93fa Expose more headers for extensions. (#95447)
Fixes #ISSUE_NUMBER

Expose more headers for extensions of distributed methods.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95447
Approved by: https://github.com/ezyang
2023-02-27 18:59:40 +00:00
jjsjann123
21eb7f70f1 Nvfuser python API import fix (#94036)
1. Having nvfuser python API import working with both devel and upstream;
2. Add environment variable to allow custom nvfuser code base to be built with upstream pytorch core.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94036
Approved by: https://github.com/malfet, https://github.com/davidberard98
2023-02-16 20:10:40 +00:00