Commit Graph

2824 Commits

Author SHA1 Message Date
PyTorch MergeBot
dabc9566c4 Revert "(MTIA) Move "empty_cache" API (#143402)"
This reverts commit c7d9f29807.

Reverted https://github.com/pytorch/pytorch/pull/143402 on behalf of https://github.com/huydhn due to The internal diff D67148738 has been reverted ([comment](https://github.com/pytorch/pytorch/pull/143402#issuecomment-2557982597))
2024-12-21 04:01:23 +00:00
Mikayla Gawarecki
8e483654cb Add config.save.use_pinned_memory_for_d2h to serialization config (#143342)
This was benchmarked with two separate scripts on my A100
(A) Save state_dict of llama3-style model on CUDA to disk with ``torch.save``
(B) Save `ModuleList` of 10 `nn.Linear(10,000, 10,000)` on CUDA to disk with `torch.save`
Timings are an average of 5 runs and benchmark scripts + results are attached

Under both scenarios, we see **~2x speedup in ``torch.save`` time with (``compute_crc32=False`` and ``use_pinned_memory_for_d2h=True``)** compared to the baseline of the current defaults (``compute_crc32=True`` and ``use_pinned_memory_for_d2h=False``

(A)  Save state_dict of llama3-style model on CUDA to disk with ``torch.save`` [[script](https://gist.github.com/mikaylagawarecki/d3a86ea1bb08045d1a839976808d7432)][[results](https://gist.github.com/mikaylagawarecki/f61a4714e5cff703146a1fcb7e0c755c)]

|                                                                                 |  use_pinned_memory_for_d2h=False (Default) |  use_pinned_memory_for_d2h=True |
|-|-|-|
| `compute_crc_32= True`  (Default)| 28.54s | 20.76s |
| `compute_crc_32 = False` | 22.57s |  **14.51s** |

(B) Save `ModuleList` of 10 `nn.Linear(10,000, 10,000)` on CUDA to disk with `torch.save` [[script](https://gist.github.com/mikaylagawarecki/ecbc505436bdd4b5190ef1b3430c12b6)][[results](https://gist.github.com/mikaylagawarecki/4e686bcf030b57de8c3ca74d8f5a88f7)]

|                                                                                 |  use_pinned_memory_for_d2h=False (Default) |  use_pinned_memory_for_d2h=True |
|-|-|-|
| `compute_crc_32= True`  (Default)| 8.38s | 5.53s |
| `compute_crc_32 = False` | 6.94s |  **3.99s** |

Trace of (A) with `use_pinned_memory_for_d2h=True`, `compute_crc32=False`
<img width="1745" alt="Screenshot 2024-12-16 at 7 32 33 PM" src="https://github.com/user-attachments/assets/80b87a8c-5a70-4eb9-ad66-7abc4aa7cc25" />

Baseline trace of (A) with `use_pinned_memory_for_d2h=False`, `compute_crc32=True`
<img width="1799" alt="Screenshot 2024-12-16 at 7 38 20 PM" src="https://github.com/user-attachments/assets/13fa12d1-8f5f-424c-adc4-275b67012927" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143342
Approved by: https://github.com/albanD
ghstack dependencies: #143324
2024-12-20 21:01:18 +00:00
Mikayla Gawarecki
3f63b742e6 Refactor serialization getter/setters into torch.utils.serialization.config (#143324)
Consolidate
- get/set_default_load_endianness
- get/set_default_mmap_options
- get/set_crc32_options

into one global dynamo-style config + allow global setting of mmap. The existing APIs are not removed and will get/set from the config (as they can't be removed for BC)

In #143459 I add the local (argument style) config

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143324
Approved by: https://github.com/albanD
2024-12-20 21:01:17 +00:00
Nikhil Gupta
94737e8a2a [ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124)
Description:
1. Quantize Linear Layer Weights to 4-bits:
Quantize the weights of the Linear layer to 4 bits, using symmetric quantization.
Pack two 4-bit weights into one uint8 container.
Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32.

2. Prepare Quantized Weights, Scales, and Optional Bias:
After quantizing, obtain the quantized_weights, scales, and groupsize.
If the original Linear layer has a bias, prepare it as well.

3. Pack the Weights Efficiently:
Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias.
```python
packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features)
```
Input parameters should include:
in_features and out_features (the same as the Linear layer’s corresponding parameters).

4. Perform Dynamic Quantized Matrix Multiplication:
Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights.
```python
output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights,  groupsize, in_features, out_features)
```
Inputs required include:
The input tensor, packed_weights , groupsize, and the in_features and out_features.

API Usage: https://github.com/pytorch/pytorch/issues/143289

Model Perf :
7B Transformer model:
Prefill : 340 t/s
Decode  : 40  t/s
2B Transformer model
Prefill : 747 t/s
Decode  : 80  t/s

Tests:
python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight
Ran 1 test in 0.016s

OK

python test/test_linalg.py -k test__dyn_quant_matmul_4bit
Ran 8 tests in 0.077s

OK

python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit
Ran 8 tests in 11.454s

Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124
Approved by: https://github.com/digantdesai, https://github.com/malfet
2024-12-20 19:32:03 +00:00
Hyunho Yeo
c7d9f29807 (MTIA) Move "empty_cache" API (#143402)
Summary: This diff moves one of memory-related APIs to the consolidated location, which is `mtia/memory.py`.

Test Plan:
```
buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api
```

https://www.internalfb.com/intern/testinfra/testrun/13510798943184259

Reviewed By: nautsimon

Differential Revision: D67148738

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143402
Approved by: https://github.com/nautsimon
2024-12-20 17:39:06 +00:00
Avik Chaudhuri
29b586bbad fix formatting in programming model doc (#143587)
Test Plan: Some of the formatting in https://docs-preview.pytorch.org/pytorch/pytorch/143546/export.programming_model.html is broken.

Differential Revision: D67458972

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143587
Approved by: https://github.com/yushangdi
2024-12-20 07:09:19 +00:00
PyTorch MergeBot
8136daff5a Revert "[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124)"
This reverts commit 4b82251011.

Reverted https://github.com/pytorch/pytorch/pull/134124 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it breaks lots of internal build ([comment](https://github.com/pytorch/pytorch/pull/134124#issuecomment-2555953189))
2024-12-19 23:33:17 +00:00
Nikhil Gupta
4b82251011 [ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124)
Description:
1. Quantize Linear Layer Weights to 4-bits:
Quantize the weights of the Linear layer to 4 bits, using symmetric quantization.
Pack two 4-bit weights into one uint8 container.
Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32.

2. Prepare Quantized Weights, Scales, and Optional Bias:
After quantizing, obtain the quantized_weights, scales, and groupsize.
If the original Linear layer has a bias, prepare it as well.

3. Pack the Weights Efficiently:
Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias.
```python
packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features)
```
Input parameters should include:
in_features and out_features (the same as the Linear layer’s corresponding parameters).

4. Perform Dynamic Quantized Matrix Multiplication:
Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights.
```python
output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights,  groupsize, in_features, out_features)
```
Inputs required include:
The input tensor, packed_weights , groupsize, and the in_features and out_features.

API Usage: https://github.com/pytorch/pytorch/issues/143289

Model Perf :
7B Transformer model:
Prefill : 340 t/s
Decode  : 40  t/s
2B Transformer model
Prefill : 747 t/s
Decode  : 80  t/s

Tests:
python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight
Ran 1 test in 0.016s

OK

python test/test_linalg.py -k test__dyn_quant_matmul_4bit
Ran 8 tests in 0.077s

OK

python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit
Ran 8 tests in 11.454s

Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124
Approved by: https://github.com/digantdesai, https://github.com/malfet
2024-12-19 18:51:26 +00:00
Avik Chaudhuri
1433bad0e4 torch export programming model (#143546)
Differential Revision: [D67429743](https://our.internmc.facebook.com/intern/diff/D67429743/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143546
Approved by: https://github.com/ydwu4
2024-12-19 16:56:13 +00:00
PyTorch MergeBot
14fe1f7190 Revert "[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124)"
This reverts commit d3ff2d42c2.

Reverted https://github.com/pytorch/pytorch/pull/134124 on behalf of https://github.com/malfet due to This broke S390 builds, includes cpuinfo unconditionally ([comment](https://github.com/pytorch/pytorch/pull/134124#issuecomment-2552560208))
2024-12-19 01:05:11 +00:00
Nikhil Gupta
d3ff2d42c2 [ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124)
Description:
1. Quantize Linear Layer Weights to 4-bits:
Quantize the weights of the Linear layer to 4 bits, using symmetric quantization.
Pack two 4-bit weights into one uint8 container.
Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32.

2. Prepare Quantized Weights, Scales, and Optional Bias:
After quantizing, obtain the quantized_weights, scales, and groupsize.
If the original Linear layer has a bias, prepare it as well.

3. Pack the Weights Efficiently:
Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias.
```python
packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features)
```
Input parameters should include:
in_features and out_features (the same as the Linear layer’s corresponding parameters).

4. Perform Dynamic Quantized Matrix Multiplication:
Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights.
```python
output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights,  groupsize, in_features, out_features)
```
Inputs required include:
The input tensor, packed_weights , groupsize, and the in_features and out_features.

API Usage: https://github.com/pytorch/pytorch/issues/143289

Model Perf :
7B Transformer model:
Prefill : 340 t/s
Decode  : 40  t/s
2B Transformer model
Prefill : 747 t/s
Decode  : 80  t/s

Tests:
python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight
Ran 1 test in 0.016s

OK

python test/test_linalg.py -k test__dyn_quant_matmul_4bit
Ran 8 tests in 0.077s

OK

python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit
Ran 8 tests in 11.454s

Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124
Approved by: https://github.com/digantdesai, https://github.com/malfet
2024-12-18 22:30:07 +00:00
Yidi Wu
1e201422ed [export] add is_exporting flag (#142425)
We added an is_export flag under torch.compiler.is_exporting. This comes handy when we try to do some special logic in user-level and system-level (e.g. in upper of the stack).

In increasing-scope:
- `_is_fx_tracing` is set to True when we use under symbolic_trace or make_fx.
- `is_exporting` is set to True when we're doing strict or non-strict export, which internally has a step that calls make_fx and set _is_fx_tracing to be True.
- `is_compiling` is set to True when we're either doing strict, non-strict export or torch.compile.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142425
Approved by: https://github.com/avikchaudhuri
2024-12-18 21:36:28 +00:00
Zizeng Meng
eb67dd3e2d [3/N][Memory Profiling] Add memory profiling function for MTIA hooks (#142149)
Design Doc: https://fburl.com/gdoc/47zpuweb
Prototyping:  D66469341

In this diff, we implement two new mtia hooks to start/stop profiler and export the memory snapshot.

In next diff, we will integrate the mtia backend with profiler python api

Differential Revision: [D66823583](https://our.internmc.facebook.com/intern/diff/D66823583/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142149
Approved by: https://github.com/nautsimon
2024-12-18 11:58:23 +00:00
Hyunho Yeo
efe21ee59d [MTIA] (3/n) Implement PyTorch APIs to query/reset device peak memory usage (#143347)
Summary: This diff implements the "max_memory_allocated" PyTorch API for MTIA devices, which returns the peak device DRAM usage

Test Plan:
Passed the local unit test
```
buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- -r test_max_memory_allocated
```

https://www.internalfb.com/intern/testinfra/testrun/8444249544807192

Reviewed By: yuhc, egienvalue

Differential Revision: D67118173

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143347
Approved by: https://github.com/nautsimon
2024-12-17 23:37:03 +00:00
Bin Bao
a3688ead4b [AOTI][doc] Update tutorial (#143390)
Summary: Update the cpp inference part to call AOTIModelPackageLoader.run directly
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143390
Approved by: https://github.com/yushangdi
2024-12-17 18:35:40 +00:00
PyTorch MergeBot
969b07b96f Revert "[ROCm] CK Flash Attention Backend (#138947)"
This reverts commit 500d02921b.

Reverted https://github.com/pytorch/pytorch/pull/138947 on behalf of https://github.com/atalman due to Breaks default windows checkout ([comment](https://github.com/pytorch/pytorch/pull/138947#issuecomment-2548998359))
2024-12-17 16:46:57 +00:00
Andy Lugo
500d02921b [ROCm] CK Flash Attention Backend (#138947)
Replaces https://github.com/ROCm/pytorch/pull/1592

This PR contains the initial implementation of SDPA with composable_kernel backend. The CK path can be forced by simply calling `torch.backends.cuda.preferred_rocm_fa_library("ck")`. Similarly, you can force the incumbent aotriton implementation by passing in "aotriton" or "default". As you'd expect, not setting this option will result in aotriton to be used as the backend. In the case of CK, if pytorch deems flash attention usable, then it will use the CK path in all the same places aotriton would have been used. This PR makes no changes to the heuristics which select which attention scheme to use (i.e. flash attention vs memory efficient attention vs math etc etc). It only gets called when flash attention is both enabled (via `USE_FLASH_ATTENTION`) and is selected at runtime by the existing heuristics.

Files located in pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck/mha* have been pulled from https://github.com/Dao-AILab/flash-attention courtesy of @tridao's hard work who is the co-author

NOTE: In order to use this backend, the user MUST set USE_CK_FLASH_ATTENTION=1 in their environment when they build PyTorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138947
Approved by: https://github.com/pruthvistony, https://github.com/xw285cornell, https://github.com/leitian

Co-authored-by: Xiaodong Wang <xw285@cornell.edu>
2024-12-17 02:18:07 +00:00
Will Constable
9d57a39541 [C10D] Update docs for wait() (#143305)
Clarify that currently active stream, not default stream, is the one
that will be blocked by a call to wait(), and also point out that the
CPU is not blocked by the call for CUDA/nccl collectives.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143305
Approved by: https://github.com/LucasLLC, https://github.com/ngimel
2024-12-17 00:41:11 +00:00
Nichols A. Romero
c0a39ad35a [ROCm] Fix TunableOp UTs: Rotating Buffer (#143172)
TunableOp's rotating buffer feature cannot be properly tested because the environment variable that controls this feature is sticky. A Python API is introduced to modify this value.

Additional items in this PR:
* UT for rotating buffer API
* Clean up UTs that were setting the rotating buffer via the environment variable
* Align behavior of environment variable and Python API when a negative value (< 0) is set.
* Update documentation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143172
Approved by: https://github.com/jeffdaily
2024-12-14 06:18:11 +00:00
Shangdi Yu
bb574abe73 [BC-Breaking]Remove capture_pre_autograd_graph references in quantization (#139505)
Summary:
As title

This is a BC-breaking change because graph produced by "capture_pre_autograd_graph" cannot be input to quantization anymore. But this is ok, since this API is deprecated for a while and is going to be deleted. We have removed all call sites of it.

We remove the deprecated API references in code, docs, and tests.

We also removed two tests that specific to capture_pre_autograd_graph API.

Test Plan: CI

Differential Revision: D65351887

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139505
Approved by: https://github.com/tugsbayasgalan, https://github.com/andrewor14, https://github.com/jerryzh168
2024-12-13 22:26:22 +00:00
Howard Huang
b0c3d39e0d [pipelining] Update tutorials and documentation (#143045)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143045
Approved by: https://github.com/wconstab, https://github.com/kwen2501
2024-12-12 18:42:17 +00:00
Svetlana Karslioglu
0f78be5573 Fix search icon (#142808)
Removing:

.pytorch-left-menu-search input[type=text] {
    background-image: none;
}
so that the search icon correctly appears in the sphinx searchbox

Also, fixing scrolling

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142808
Approved by: https://github.com/albanD
2024-12-12 16:09:30 +00:00
gasoonjia
91261107e0 debug handler maintain through decomposition (#141612)
Add checks in the ao numberic debugger to guard the debug handle consistency between aten op decomposition

Differential Revision: [D66517480](https://our.internmc.facebook.com/intern/diff/D66517480/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141612
Approved by: https://github.com/jerryzh168
2024-12-12 12:26:45 +00:00
Xuehai Pan
18785c1af9 [BE][accelerator] formalize API name {current,set}_device_{idx => index} (#140542)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140542
Approved by: https://github.com/guangyey, https://github.com/albanD
2024-12-12 10:53:48 +00:00
PyTorch MergeBot
cd50bd8477 Revert "[BE][accelerator] formalize API name {current,set}_device_{idx => index} (#140542)"
This reverts commit fb02b40d27.

Reverted https://github.com/pytorch/pytorch/pull/140542 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I need to revert this in order to revert https://github.com/pytorch/pytorch/pull/133572#issuecomment-2537204202 due to a conflict ([comment](https://github.com/pytorch/pytorch/pull/140542#issuecomment-2537253665))
2024-12-11 21:44:23 +00:00
Xuehai Pan
fb02b40d27 [BE][accelerator] formalize API name {current,set}_device_{idx => index} (#140542)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140542
Approved by: https://github.com/guangyey, https://github.com/albanD
2024-12-11 17:57:56 +00:00
Howard Huang
88154024b3 [pipelining] Add ZBV schedule (#142084)
Adds ZBV schedule which is explained in https://arxiv.org/pdf/2401.10241, Section 6. Tested it works under the new PipelineScheduleRuntime by fixing a small bug in handling V-shaped schedules. This PR is a replacement for https://github.com/pytorch/pytorch/pull/138444

cc the original authors: @QPHutu @ufotalent https://github.com/pytorch/pytorch/pull/138444#issuecomment-2472684977

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142084
Approved by: https://github.com/kwen2501
2024-12-11 02:00:57 +00:00
lzhang2
5d6acd5a31 Register Intel distributed Backend (XCCL) in PyTorch distributed package (#141856)
### Motivation:

As design illustrated in Intel distributed support RFC https://github.com/pytorch/pytorch/issues/141741, two sections are needed to enable intel distributed backend (`XCCL`) support in PyTorch.
1. Intel GPU distributed Backend integration in PyTorch `torch-xpu-ops`.
2. **Intel distributed Backend register in PyTorch distributed package**. This PR is to contribute section 2 change.

### Example:
Here is a simple example of using spawn to launch XCCL backend and perform allreduce on XPU tensors.
```
import os
import torch
import torch.distributed as dist
import torch.multiprocessing as mp

def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group(rank=rank, world_size=world_size)

def cleanup():
    dist.destroy_process_group()

def run_allreduce(rank, world_size):
    setup(rank, world_size)
    device = torch.device('xpu:{}'.format(rank))
    x = torch.randn([2, 2], device=device)
    dist.all_reduce(x)
    cleanup()

if __name__ == '__main__':
    world_size = 2
    mp.spawn(run_allreduce, args=(world_size,), nprocs=world_size, join=True)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141856
Approved by: https://github.com/kwen2501, https://github.com/gujinghui, https://github.com/albanD
2024-12-10 01:58:06 +00:00
Hyunho Yeo
005c5694eb Refactor "torch.mtia.memory_stats" API (#141723)
Summary:
This diff refactors the code for the "torch.mtia.memory_stats" API to maintain the same file hierarchy as its CUDA counterpart:
- All device memory APIs are now located under ".../mtia/memory.py".
- Device memory APIs can be accessed using either "torch.mtia.XYZ" or "torch.mtia.memory.XYZ".

Test Plan:
Passed a local unit test: `buck run //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api`

```
Ran 14 tests in 16.657s

OK
I1127 11:06:06.505201 2133030 afg_bindings.cpp:943] afg-aten::mul.out-dtype_Float-bBtLGD6Y executable has been unloaded
I1127 11:06:06.506654 2133030 afg_bindings.cpp:943] afg-add-dtype_Float-fa37JncC executable has been unloaded
W1127 11:06:08.731138 2133030 HazptrDomain.h:148] Tagged objects remain. This may indicate a higher-level leak of object(s) that use hazptr_obj_cohort.
```

Differential Revision: D66549179

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141723
Approved by: https://github.com/nautsimon
2024-12-09 19:19:19 +00:00
Andrew Gu
78425bff30 [FSDP2] Move to public torch.distributed.fsdp (#141868)
**Overview**
This PR moves `torch/distributed/_composable/fsdp` to `torch/distributed/fsdp/_fully_shard` and makes public APIs available from `torch.distributed.fsdp`, e.g.:
```
from torch.distributed.fsdp import fully_shard
```
This is targeting 2.6 release. I rewrote some of the documentation with (hopefully) improved phrasing.

**Changes for Reland**
- Preserved the public objects from `torch/distributed/_composable/fsdp/fully_shard.py` so that the import path still works internally
- Added a unit test that we can do `from torch.distributed._composable.fsdp.fully_shard import FSDPModule`

Differential Revision: [D66890387](https://our.internmc.facebook.com/intern/diff/D66890387)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141868
Approved by: https://github.com/kwen2501, https://github.com/wconstab, https://github.com/weifengpy, https://github.com/fegin, https://github.com/XilunWu

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2024-12-07 01:24:28 +00:00
PyTorch MergeBot
bab15df40a Revert "[FSDP2] Move to public torch.distributed.fsdp (#141868)"
This reverts commit 45583a5df9.

Reverted https://github.com/pytorch/pytorch/pull/141868 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/141868#issuecomment-2523925180))
2024-12-06 18:38:12 +00:00
Shangdi Yu
02c509669a Aoti minifier flatten (#141156)
Flatten the inputs to minifier so AOTI Minifier can handle unflattened inputs and kwargs.

- flatten the inputs in minifier
- changed the "load_and_run" part of the minifier verification to run on the flattened inputs.
- refactored code to keep `torch._inductor.__init__.py` clean
- update doc

`python test/inductor/test_minifier.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141156
Approved by: https://github.com/desertfire
2024-12-06 07:12:45 +00:00
Svetlana Karslioglu
ce22a01e11 Add an option for classic search (#142018)
Fixes https://github.com/pytorch/tutorials/issues/3143

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142018
Approved by: https://github.com/albanD
2024-12-06 01:24:52 +00:00
bhack
ae9cda0221 Add truediv support in export serializer (#136364)
Fixes #136113

- [x] Inital `truediv` coverage
- [ ] Expand/reduce coverage?
- [x] Add tests
- [x] Re-check docstrings
- [ ] Linting

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136364
Approved by: https://github.com/pianpwk

Co-authored-by: Angela Yi <angelayi@meta.com>
Co-authored-by: Pian Pawakapan <pianpwk@meta.com>
2024-12-05 17:33:33 +00:00
Yukio Siraichi
f8c212a925 Transform unbacked int expressions into a fresh unbacked int. (#141917)
Fix: #141419

This PR introduces the `torch.sym_fresh_size` API, which transforms an unbacked int
expression into a fresh unbacked int.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141917
Approved by: https://github.com/ezyang
2024-12-05 16:53:44 +00:00
Yu, Guangye
8dd4673cea Support torch.xpu.mem_get_info API (#141230)
# Motivate
Fix https://github.com/pytorch/pytorch/issues/130599
This PR intends to add a new API, `torch.xpu.mem_get_info,` which is widely used in popular model workloads.
For example, [here](403c0714d1/src/accelerate/utils/modeling.py (L721)) we need to get current GPU memory usage to split or load the model.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141230
Approved by: https://github.com/EikanWang, https://github.com/albanD
2024-12-05 08:17:25 +00:00
Yiming Zhou
31f2d4eb4e [export] Update docs (#142011)
Summary:
Update export docs. Including:
1. Update the output graph.
2. Misc fixes for examples.

Test Plan: CI

Differential Revision: D66726729

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142011
Approved by: https://github.com/angelayi
2024-12-05 03:44:46 +00:00
Andrew Gu
45583a5df9 [FSDP2] Move to public torch.distributed.fsdp (#141868)
**Overview**
This PR moves `torch/distributed/_composable/fsdp` to `torch/distributed/fsdp/_fully_shard` and makes public APIs available from `torch.distributed.fsdp`, e.g.:
```
from torch.distributed.fsdp import fully_shard
```
This is targeting 2.6 release. I rewrote some of the documentation with (hopefully) improved phrasing.

**Follow-Ups**
- [x] Add some explanation in the docs about FSDP1 vs. FSDP2
- [ ] Move unit tests from `test/distributed/_composable/fsdp` to `test/distributed/fsdp/fully_shard/`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141868
Approved by: https://github.com/kwen2501, https://github.com/wconstab, https://github.com/weifengpy

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2024-12-05 03:04:01 +00:00
Svetlana Karslioglu
f7bd0c6b60 [doc] Fix the toctree level (#142008)
Changing this back 1 in order to not expand on the index.html page.
Before:
![Screenshot 2024-12-04 at 11 47 54 AM (2)](https://github.com/user-attachments/assets/40d730ee-61b9-4d60-ab13-9b9075cb3cba)
After:
![Screenshot 2024-12-04 at 11 48 30 AM (2)](https://github.com/user-attachments/assets/5eb711a0-e76c-4573-9fdf-88b6b94b31a9)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142008
Approved by: https://github.com/sekyondaMeta, https://github.com/malfet
2024-12-04 19:52:14 +00:00
rzou
827c322290 Make torch.library.triton_op public (#141880)
We've been using it privately for half a year and everything's been
good. This PR:
1. Makes torch.library.triton_op public
2. Renames capture_triton -> wrap_triton. We got feedback that no one
   knew what "capture triton" does.
3. Makes torch.library.wrap_triton public.

triton_op is used to construct a Python custom operator that may call 1+
triton kernels. Each of those triton kernels must be annotated with
wrap_triton.

Test Plan:
- existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141880
Approved by: https://github.com/albanD
ghstack dependencies: #141894
2024-12-03 16:28:56 +00:00
Benjamin Glass
4959784dac Add API query for available per-process CUDA memory (#140620)
Certain `cpp_wrapper`-enabled tests were OOM-ing in the CI pipeline, with error messages suggesting that sufficient memory was accessible.  This ultimately resulted from an internal memory limitation that was not queryable in the API.  This PR adds querying for that limit.

Additionally, the failing tests had incorrect memory availability checks, and are updated with measured memory requirements.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140620
Approved by: https://github.com/malfet, https://github.com/eqy
ghstack dependencies: #141367
2024-12-03 00:24:03 +00:00
Hyunho Yeo
d70b7029c8 [MTIA] Support torch.mtia.empty_cache() (#141533)
Summary: As title

Test Plan:
Passed a local unit test: `buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api`

https://www.internalfb.com/intern/testinfra/testrun/4785074861101240

Reviewed By: nautsimon

Differential Revision: D66481778

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141533
Approved by: https://github.com/nautsimon
2024-11-28 02:24:19 +00:00
Mark Saroufim
e24190709f [BE] Remove Model Dump utility (#141540)
So I found this utility by accident, trying to find how many html files we have in the repo so I could convert them to markdown

Turns out we package some html and js files in pytorch to visualize torchscript models. This seems kinda strange, probably shouldn't be in core, I removed the tests I could find. Maybe some internal tests will break but considering torchscript is being superseded might make sense to do this

Last time there was a meaningful update to the test for this file was about 2 years ago by @digantdesai since then it's a bunch of routine upgrades

It seems like this package is unused https://github.com/search?type=code&auto_enroll=true&q=torch.utils.model_dump&p=1 I skimmed through 5 pages of these and the only time this shows up in code search is when someone is either cloning pytorch or checking in their venv into github
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141540
Approved by: https://github.com/malfet
2024-11-27 22:52:55 +00:00
Isuru Fernando
b37cfddeb3 Refactor ShapeGuardPrinter for future C++ addiiton (#140968)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140968
Approved by: https://github.com/anijain2305
ghstack dependencies: #140597
2024-11-27 20:09:58 +00:00
PyTorch MergeBot
6e61ff4fd3 Revert "Add truediv support in export serializer (#136364)"
This reverts commit 1df440dc4e.

Reverted https://github.com/pytorch/pytorch/pull/136364 on behalf of https://github.com/huydhn due to Sorry for reverting your change but its doc build failure is legit ([comment](https://github.com/pytorch/pytorch/pull/136364#issuecomment-2502620732))
2024-11-27 03:24:31 +00:00
Svetlana Karslioglu
807a7dbf9f Don't generate modindex (#141601)
Fixes https://github.com/pytorch/pytorch/issues/141591
The generated index looks ugly. Attempting to not generate it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141601
Approved by: https://github.com/malfet, https://github.com/albanD
2024-11-27 02:07:21 +00:00
bhack
1df440dc4e Add truediv support in export serializer (#136364)
Fixes #136113

- [x] Inital `truediv` coverage
- [ ] Expand/reduce coverage?
- [x] Add tests
- [x] Re-check docstrings
- [ ] Linting

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136364
Approved by: https://github.com/pianpwk

Co-authored-by: Angela Yi <angelayi@meta.com>
Co-authored-by: Pian Pawakapan <pianpwk@meta.com>
2024-11-27 00:31:47 +00:00
Nichols A. Romero
a99332eb25 [ROCM] Support Multi-GPU offline tuning in TunableOp (#139673)
This PR enhances offline tuning to support multi-GPUs.

High-level description of algorithm:
- Duplicate GEMMs are first eliminated
- GEMMs are distributed to multi-GPUs for tuning
- Results are gathered into a file with `_full` in the filename

Also adding support for GemmAndBias and ScaledGemm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139673
Approved by: https://github.com/jeffdaily, https://github.com/hongxiayang
2024-11-26 19:07:41 +00:00
Stephen Matthews
2bbd984aa2 Fix typo in Reproducibility docs (#141341)
Fixes trivial issue in the docs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141341
Approved by: https://github.com/svekars
2024-11-26 16:53:26 +00:00
ZhiweiYan-96
c418a9ac75 [Intel GPU] XPUInductorQuantizer for XPU int8 recipe customization (#139578)
# Motivation
This PR add `XPUInductorQuantizer`, which would defined the recipe of int8 quantization at XPU backend.

# Detailed
The `XPUInductorQuantizer` is class derived from `X86InductorQuantizer` as both quantizer would take the advantage of highly optimized operators in oneDNN library(qconv, qlinear, qconv/qlinear fusion).

We share the same recipe as `X86InductorQuantizer`, so we would have same `annotate_xxxx` methods.  So, in ideal situation, the `XPUInductorQuantizer` would have no class body as all implementation can inherit from base class.

In this PR, we override the `annotate_xxx` method for operators that has NOT be implemented. All operators XPU backend does  not implement would be fallbacked to fp32 implementation as the node in graph is a `dq-op-q` pairs. This would help provide good OOB usability for XPU backend.   On the other hand, the implemented operators would uses `annotate_op` implemented in base class and could be lowered successfully.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139578
Approved by: https://github.com/EikanWang, https://github.com/leslie-fang-intel, https://github.com/CuiYifeng, https://github.com/jerryzh168
ghstack dependencies: #133080
2024-11-26 09:44:14 +00:00