pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
PyTorch MergeBot	dabc9566c4	Revert "(MTIA) Move "empty_cache" API (#143402 )" This reverts commit `c7d9f29807`. Reverted https://github.com/pytorch/pytorch/pull/143402 on behalf of https://github.com/huydhn due to The internal diff D67148738 has been reverted ([comment](https://github.com/pytorch/pytorch/pull/143402#issuecomment-2557982597))	2024-12-21 04:01:23 +00:00
Mikayla Gawarecki	8e483654cb	Add config.save.use_pinned_memory_for_d2h to serialization config (#143342 ) This was benchmarked with two separate scripts on my A100 (A) Save state_dict of llama3-style model on CUDA to disk with ``torch.save`` (B) Save `ModuleList` of 10 `nn.Linear(10,000, 10,000)` on CUDA to disk with `torch.save` Timings are an average of 5 runs and benchmark scripts + results are attached Under both scenarios, we see ~2x speedup in ``torch.save`` time with (``compute_crc32=False`` and ``use_pinned_memory_for_d2h=True``) compared to the baseline of the current defaults (``compute_crc32=True`` and ``use_pinned_memory_for_d2h=False`` (A) Save state_dict of llama3-style model on CUDA to disk with ``torch.save`` [[script](https://gist.github.com/mikaylagawarecki/d3a86ea1bb08045d1a839976808d7432)][[results](https://gist.github.com/mikaylagawarecki/f61a4714e5cff703146a1fcb7e0c755c)] \| \| use_pinned_memory_for_d2h=False (Default) \| use_pinned_memory_for_d2h=True \| \|-\|-\|-\| \| `compute_crc_32= True` (Default)\| 28.54s \| 20.76s \| \| `compute_crc_32 = False` \| 22.57s \| 14.51s \| (B) Save `ModuleList` of 10 `nn.Linear(10,000, 10,000)` on CUDA to disk with `torch.save` [[script](https://gist.github.com/mikaylagawarecki/ecbc505436bdd4b5190ef1b3430c12b6)][[results](https://gist.github.com/mikaylagawarecki/4e686bcf030b57de8c3ca74d8f5a88f7)] \| \| use_pinned_memory_for_d2h=False (Default) \| use_pinned_memory_for_d2h=True \| \|-\|-\|-\| \| `compute_crc_32= True` (Default)\| 8.38s \| 5.53s \| \| `compute_crc_32 = False` \| 6.94s \| 3.99s \| Trace of (A) with `use_pinned_memory_for_d2h=True`, `compute_crc32=False` <img width="1745" alt="Screenshot 2024-12-16 at 7 32 33 PM" src="https://github.com/user-attachments/assets/80b87a8c-5a70-4eb9-ad66-7abc4aa7cc25" /> Baseline trace of (A) with `use_pinned_memory_for_d2h=False`, `compute_crc32=True` <img width="1799" alt="Screenshot 2024-12-16 at 7 38 20 PM" src="https://github.com/user-attachments/assets/13fa12d1-8f5f-424c-adc4-275b67012927" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/143342 Approved by: https://github.com/albanD ghstack dependencies: #143324	2024-12-20 21:01:18 +00:00
Mikayla Gawarecki	3f63b742e6	Refactor serialization getter/setters into torch.utils.serialization.config (#143324 ) Consolidate - get/set_default_load_endianness - get/set_default_mmap_options - get/set_crc32_options into one global dynamo-style config + allow global setting of mmap. The existing APIs are not removed and will get/set from the config (as they can't be removed for BC) In #143459 I add the local (argument style) config Pull Request resolved: https://github.com/pytorch/pytorch/pull/143324 Approved by: https://github.com/albanD	2024-12-20 21:01:17 +00:00
Nikhil Gupta	94737e8a2a	[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 ) Description: 1. Quantize Linear Layer Weights to 4-bits: Quantize the weights of the Linear layer to 4 bits, using symmetric quantization. Pack two 4-bit weights into one uint8 container. Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32. 2. Prepare Quantized Weights, Scales, and Optional Bias: After quantizing, obtain the quantized_weights, scales, and groupsize. If the original Linear layer has a bias, prepare it as well. 3. Pack the Weights Efficiently: Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias. ```python packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features) ``` Input parameters should include: in_features and out_features (the same as the Linear layer’s corresponding parameters). 4. Perform Dynamic Quantized Matrix Multiplication: Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights. ```python output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features) ``` Inputs required include: The input tensor, packed_weights , groupsize, and the in_features and out_features. API Usage: https://github.com/pytorch/pytorch/issues/143289 Model Perf : 7B Transformer model: Prefill : 340 t/s Decode : 40 t/s 2B Transformer model Prefill : 747 t/s Decode : 80 t/s Tests: python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight Ran 1 test in 0.016s OK python test/test_linalg.py -k test__dyn_quant_matmul_4bit Ran 8 tests in 0.077s OK python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit Ran 8 tests in 11.454s Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124 Approved by: https://github.com/digantdesai, https://github.com/malfet	2024-12-20 19:32:03 +00:00
Hyunho Yeo	c7d9f29807	(MTIA) Move "empty_cache" API (#143402 ) Summary: This diff moves one of memory-related APIs to the consolidated location, which is `mtia/memory.py`. Test Plan: ``` buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api ``` https://www.internalfb.com/intern/testinfra/testrun/13510798943184259 Reviewed By: nautsimon Differential Revision: D67148738 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143402 Approved by: https://github.com/nautsimon	2024-12-20 17:39:06 +00:00
Avik Chaudhuri	29b586bbad	fix formatting in programming model doc (#143587 ) Test Plan: Some of the formatting in https://docs-preview.pytorch.org/pytorch/pytorch/143546/export.programming_model.html is broken. Differential Revision: D67458972 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143587 Approved by: https://github.com/yushangdi	2024-12-20 07:09:19 +00:00
PyTorch MergeBot	8136daff5a	Revert "[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 )" This reverts commit `4b82251011`. Reverted https://github.com/pytorch/pytorch/pull/134124 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it breaks lots of internal build ([comment](https://github.com/pytorch/pytorch/pull/134124#issuecomment-2555953189))	2024-12-19 23:33:17 +00:00
Nikhil Gupta	4b82251011	[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 ) Description: 1. Quantize Linear Layer Weights to 4-bits: Quantize the weights of the Linear layer to 4 bits, using symmetric quantization. Pack two 4-bit weights into one uint8 container. Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32. 2. Prepare Quantized Weights, Scales, and Optional Bias: After quantizing, obtain the quantized_weights, scales, and groupsize. If the original Linear layer has a bias, prepare it as well. 3. Pack the Weights Efficiently: Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias. ```python packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features) ``` Input parameters should include: in_features and out_features (the same as the Linear layer’s corresponding parameters). 4. Perform Dynamic Quantized Matrix Multiplication: Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights. ```python output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features) ``` Inputs required include: The input tensor, packed_weights , groupsize, and the in_features and out_features. API Usage: https://github.com/pytorch/pytorch/issues/143289 Model Perf : 7B Transformer model: Prefill : 340 t/s Decode : 40 t/s 2B Transformer model Prefill : 747 t/s Decode : 80 t/s Tests: python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight Ran 1 test in 0.016s OK python test/test_linalg.py -k test__dyn_quant_matmul_4bit Ran 8 tests in 0.077s OK python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit Ran 8 tests in 11.454s Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124 Approved by: https://github.com/digantdesai, https://github.com/malfet	2024-12-19 18:51:26 +00:00
Avik Chaudhuri	1433bad0e4	torch export programming model (#143546 ) Differential Revision: [D67429743](https://our.internmc.facebook.com/intern/diff/D67429743/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143546 Approved by: https://github.com/ydwu4	2024-12-19 16:56:13 +00:00
PyTorch MergeBot	14fe1f7190	Revert "[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 )" This reverts commit `d3ff2d42c2`. Reverted https://github.com/pytorch/pytorch/pull/134124 on behalf of https://github.com/malfet due to This broke S390 builds, includes cpuinfo unconditionally ([comment](https://github.com/pytorch/pytorch/pull/134124#issuecomment-2552560208))	2024-12-19 01:05:11 +00:00
Nikhil Gupta	d3ff2d42c2	[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 ) Description: 1. Quantize Linear Layer Weights to 4-bits: Quantize the weights of the Linear layer to 4 bits, using symmetric quantization. Pack two 4-bit weights into one uint8 container. Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32. 2. Prepare Quantized Weights, Scales, and Optional Bias: After quantizing, obtain the quantized_weights, scales, and groupsize. If the original Linear layer has a bias, prepare it as well. 3. Pack the Weights Efficiently: Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias. ```python packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features) ``` Input parameters should include: in_features and out_features (the same as the Linear layer’s corresponding parameters). 4. Perform Dynamic Quantized Matrix Multiplication: Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights. ```python output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features) ``` Inputs required include: The input tensor, packed_weights , groupsize, and the in_features and out_features. API Usage: https://github.com/pytorch/pytorch/issues/143289 Model Perf : 7B Transformer model: Prefill : 340 t/s Decode : 40 t/s 2B Transformer model Prefill : 747 t/s Decode : 80 t/s Tests: python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight Ran 1 test in 0.016s OK python test/test_linalg.py -k test__dyn_quant_matmul_4bit Ran 8 tests in 0.077s OK python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit Ran 8 tests in 11.454s Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124 Approved by: https://github.com/digantdesai, https://github.com/malfet	2024-12-18 22:30:07 +00:00
Yidi Wu	1e201422ed	[export] add is_exporting flag (#142425 ) We added an is_export flag under torch.compiler.is_exporting. This comes handy when we try to do some special logic in user-level and system-level (e.g. in upper of the stack). In increasing-scope: - `_is_fx_tracing` is set to True when we use under symbolic_trace or make_fx. - `is_exporting` is set to True when we're doing strict or non-strict export, which internally has a step that calls make_fx and set _is_fx_tracing to be True. - `is_compiling` is set to True when we're either doing strict, non-strict export or torch.compile. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142425 Approved by: https://github.com/avikchaudhuri	2024-12-18 21:36:28 +00:00
Zizeng Meng	eb67dd3e2d	[3/N][Memory Profiling] Add memory profiling function for MTIA hooks (#142149 ) Design Doc: https://fburl.com/gdoc/47zpuweb Prototyping: D66469341 In this diff, we implement two new mtia hooks to start/stop profiler and export the memory snapshot. In next diff, we will integrate the mtia backend with profiler python api Differential Revision: [D66823583](https://our.internmc.facebook.com/intern/diff/D66823583/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142149 Approved by: https://github.com/nautsimon	2024-12-18 11:58:23 +00:00
Hyunho Yeo	efe21ee59d	[MTIA] (3/n) Implement PyTorch APIs to query/reset device peak memory usage (#143347 ) Summary: This diff implements the "max_memory_allocated" PyTorch API for MTIA devices, which returns the peak device DRAM usage Test Plan: Passed the local unit test ``` buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- -r test_max_memory_allocated ``` https://www.internalfb.com/intern/testinfra/testrun/8444249544807192 Reviewed By: yuhc, egienvalue Differential Revision: D67118173 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143347 Approved by: https://github.com/nautsimon	2024-12-17 23:37:03 +00:00
Bin Bao	a3688ead4b	[AOTI][doc] Update tutorial (#143390 ) Summary: Update the cpp inference part to call AOTIModelPackageLoader.run directly Pull Request resolved: https://github.com/pytorch/pytorch/pull/143390 Approved by: https://github.com/yushangdi	2024-12-17 18:35:40 +00:00
PyTorch MergeBot	969b07b96f	Revert "[ROCm] CK Flash Attention Backend (#138947 )" This reverts commit `500d02921b`. Reverted https://github.com/pytorch/pytorch/pull/138947 on behalf of https://github.com/atalman due to Breaks default windows checkout ([comment](https://github.com/pytorch/pytorch/pull/138947#issuecomment-2548998359))	2024-12-17 16:46:57 +00:00
Andy Lugo	500d02921b	[ROCm] CK Flash Attention Backend (#138947 ) Replaces https://github.com/ROCm/pytorch/pull/1592 This PR contains the initial implementation of SDPA with composable_kernel backend. The CK path can be forced by simply calling `torch.backends.cuda.preferred_rocm_fa_library("ck")`. Similarly, you can force the incumbent aotriton implementation by passing in "aotriton" or "default". As you'd expect, not setting this option will result in aotriton to be used as the backend. In the case of CK, if pytorch deems flash attention usable, then it will use the CK path in all the same places aotriton would have been used. This PR makes no changes to the heuristics which select which attention scheme to use (i.e. flash attention vs memory efficient attention vs math etc etc). It only gets called when flash attention is both enabled (via `USE_FLASH_ATTENTION`) and is selected at runtime by the existing heuristics. Files located in pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck/mha* have been pulled from https://github.com/Dao-AILab/flash-attention courtesy of @tridao's hard work who is the co-author NOTE: In order to use this backend, the user MUST set USE_CK_FLASH_ATTENTION=1 in their environment when they build PyTorch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138947 Approved by: https://github.com/pruthvistony, https://github.com/xw285cornell, https://github.com/leitian Co-authored-by: Xiaodong Wang <xw285@cornell.edu>	2024-12-17 02:18:07 +00:00
Will Constable	9d57a39541	[C10D] Update docs for wait() (#143305 ) Clarify that currently active stream, not default stream, is the one that will be blocked by a call to wait(), and also point out that the CPU is not blocked by the call for CUDA/nccl collectives. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143305 Approved by: https://github.com/LucasLLC, https://github.com/ngimel	2024-12-17 00:41:11 +00:00
Nichols A. Romero	c0a39ad35a	[ROCm] Fix TunableOp UTs: Rotating Buffer (#143172 ) TunableOp's rotating buffer feature cannot be properly tested because the environment variable that controls this feature is sticky. A Python API is introduced to modify this value. Additional items in this PR: * UT for rotating buffer API * Clean up UTs that were setting the rotating buffer via the environment variable * Align behavior of environment variable and Python API when a negative value (< 0) is set. * Update documentation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143172 Approved by: https://github.com/jeffdaily	2024-12-14 06:18:11 +00:00
Shangdi Yu	bb574abe73	[BC-Breaking]Remove capture_pre_autograd_graph references in quantization (#139505 ) Summary: As title This is a BC-breaking change because graph produced by "capture_pre_autograd_graph" cannot be input to quantization anymore. But this is ok, since this API is deprecated for a while and is going to be deleted. We have removed all call sites of it. We remove the deprecated API references in code, docs, and tests. We also removed two tests that specific to capture_pre_autograd_graph API. Test Plan: CI Differential Revision: D65351887 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139505 Approved by: https://github.com/tugsbayasgalan, https://github.com/andrewor14, https://github.com/jerryzh168	2024-12-13 22:26:22 +00:00
Howard Huang	b0c3d39e0d	[pipelining] Update tutorials and documentation (#143045 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143045 Approved by: https://github.com/wconstab, https://github.com/kwen2501	2024-12-12 18:42:17 +00:00
Svetlana Karslioglu	0f78be5573	Fix search icon (#142808 ) Removing: .pytorch-left-menu-search input[type=text] { background-image: none; } so that the search icon correctly appears in the sphinx searchbox Also, fixing scrolling Pull Request resolved: https://github.com/pytorch/pytorch/pull/142808 Approved by: https://github.com/albanD	2024-12-12 16:09:30 +00:00
gasoonjia	91261107e0	debug handler maintain through decomposition (#141612 ) Add checks in the ao numberic debugger to guard the debug handle consistency between aten op decomposition Differential Revision: [D66517480](https://our.internmc.facebook.com/intern/diff/D66517480/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141612 Approved by: https://github.com/jerryzh168	2024-12-12 12:26:45 +00:00
Xuehai Pan	18785c1af9	[BE][accelerator] formalize API name `{current,set}_device_{idx => index}` (#140542 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140542 Approved by: https://github.com/guangyey, https://github.com/albanD	2024-12-12 10:53:48 +00:00
PyTorch MergeBot	cd50bd8477	Revert "[BE][accelerator] formalize API name `{current,set}_device_{idx => index}` (#140542 )" This reverts commit `fb02b40d27`. Reverted https://github.com/pytorch/pytorch/pull/140542 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I need to revert this in order to revert https://github.com/pytorch/pytorch/pull/133572#issuecomment-2537204202 due to a conflict ([comment](https://github.com/pytorch/pytorch/pull/140542#issuecomment-2537253665))	2024-12-11 21:44:23 +00:00
Xuehai Pan	fb02b40d27	[BE][accelerator] formalize API name `{current,set}_device_{idx => index}` (#140542 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140542 Approved by: https://github.com/guangyey, https://github.com/albanD	2024-12-11 17:57:56 +00:00
Howard Huang	88154024b3	[pipelining] Add ZBV schedule (#142084 ) Adds ZBV schedule which is explained in https://arxiv.org/pdf/2401.10241, Section 6. Tested it works under the new PipelineScheduleRuntime by fixing a small bug in handling V-shaped schedules. This PR is a replacement for https://github.com/pytorch/pytorch/pull/138444 cc the original authors: @QPHutu @ufotalent https://github.com/pytorch/pytorch/pull/138444#issuecomment-2472684977 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142084 Approved by: https://github.com/kwen2501	2024-12-11 02:00:57 +00:00
lzhang2	5d6acd5a31	Register Intel distributed Backend (`XCCL`) in PyTorch distributed package (#141856 ) ### Motivation: As design illustrated in Intel distributed support RFC https://github.com/pytorch/pytorch/issues/141741, two sections are needed to enable intel distributed backend (`XCCL`) support in PyTorch. 1. Intel GPU distributed Backend integration in PyTorch `torch-xpu-ops`. 2. Intel distributed Backend register in PyTorch distributed package. This PR is to contribute section 2 change. ### Example: Here is a simple example of using spawn to launch XCCL backend and perform allreduce on XPU tensors. ``` import os import torch import torch.distributed as dist import torch.multiprocessing as mp def setup(rank, world_size): os.environ['MASTER_ADDR'] = 'localhost' os.environ['MASTER_PORT'] = '29500' dist.init_process_group(rank=rank, world_size=world_size) def cleanup(): dist.destroy_process_group() def run_allreduce(rank, world_size): setup(rank, world_size) device = torch.device('xpu:{}'.format(rank)) x = torch.randn([2, 2], device=device) dist.all_reduce(x) cleanup() if __name__ == '__main__': world_size = 2 mp.spawn(run_allreduce, args=(world_size,), nprocs=world_size, join=True) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141856 Approved by: https://github.com/kwen2501, https://github.com/gujinghui, https://github.com/albanD	2024-12-10 01:58:06 +00:00
Hyunho Yeo	005c5694eb	Refactor "torch.mtia.memory_stats" API (#141723 ) Summary: This diff refactors the code for the "torch.mtia.memory_stats" API to maintain the same file hierarchy as its CUDA counterpart: - All device memory APIs are now located under ".../mtia/memory.py". - Device memory APIs can be accessed using either "torch.mtia.XYZ" or "torch.mtia.memory.XYZ". Test Plan: Passed a local unit test: `buck run //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api` ``` Ran 14 tests in 16.657s OK I1127 11:06:06.505201 2133030 afg_bindings.cpp:943] afg-aten::mul.out-dtype_Float-bBtLGD6Y executable has been unloaded I1127 11:06:06.506654 2133030 afg_bindings.cpp:943] afg-add-dtype_Float-fa37JncC executable has been unloaded W1127 11:06:08.731138 2133030 HazptrDomain.h:148] Tagged objects remain. This may indicate a higher-level leak of object(s) that use hazptr_obj_cohort. ``` Differential Revision: D66549179 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141723 Approved by: https://github.com/nautsimon	2024-12-09 19:19:19 +00:00
Andrew Gu	78425bff30	[FSDP2] Move to public `torch.distributed.fsdp` (#141868 ) Overview This PR moves `torch/distributed/_composable/fsdp` to `torch/distributed/fsdp/_fully_shard` and makes public APIs available from `torch.distributed.fsdp`, e.g.: ``` from torch.distributed.fsdp import fully_shard ``` This is targeting 2.6 release. I rewrote some of the documentation with (hopefully) improved phrasing. Changes for Reland - Preserved the public objects from `torch/distributed/_composable/fsdp/fully_shard.py` so that the import path still works internally - Added a unit test that we can do `from torch.distributed._composable.fsdp.fully_shard import FSDPModule` Differential Revision: [D66890387](https://our.internmc.facebook.com/intern/diff/D66890387) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141868 Approved by: https://github.com/kwen2501, https://github.com/wconstab, https://github.com/weifengpy, https://github.com/fegin, https://github.com/XilunWu Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2024-12-07 01:24:28 +00:00
PyTorch MergeBot	bab15df40a	Revert "[FSDP2] Move to public `torch.distributed.fsdp` (#141868 )" This reverts commit `45583a5df9`. Reverted https://github.com/pytorch/pytorch/pull/141868 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/141868#issuecomment-2523925180))	2024-12-06 18:38:12 +00:00
Shangdi Yu	02c509669a	Aoti minifier flatten (#141156 ) Flatten the inputs to minifier so AOTI Minifier can handle unflattened inputs and kwargs. - flatten the inputs in minifier - changed the "load_and_run" part of the minifier verification to run on the flattened inputs. - refactored code to keep `torch._inductor.__init__.py` clean - update doc `python test/inductor/test_minifier.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141156 Approved by: https://github.com/desertfire	2024-12-06 07:12:45 +00:00
Svetlana Karslioglu	ce22a01e11	Add an option for classic search (#142018 ) Fixes https://github.com/pytorch/tutorials/issues/3143 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142018 Approved by: https://github.com/albanD	2024-12-06 01:24:52 +00:00
bhack	ae9cda0221	Add `truediv` support in export serializer (#136364 ) Fixes #136113 - [x] Inital `truediv` coverage - [ ] Expand/reduce coverage? - [x] Add tests - [x] Re-check docstrings - [ ] Linting Pull Request resolved: https://github.com/pytorch/pytorch/pull/136364 Approved by: https://github.com/pianpwk Co-authored-by: Angela Yi <angelayi@meta.com> Co-authored-by: Pian Pawakapan <pianpwk@meta.com>	2024-12-05 17:33:33 +00:00
Yukio Siraichi	f8c212a925	Transform unbacked int expressions into a fresh unbacked int. (#141917 ) Fix: #141419 This PR introduces the `torch.sym_fresh_size` API, which transforms an unbacked int expression into a fresh unbacked int. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141917 Approved by: https://github.com/ezyang	2024-12-05 16:53:44 +00:00
Yu, Guangye	8dd4673cea	Support torch.xpu.mem_get_info API (#141230 ) # Motivate Fix https://github.com/pytorch/pytorch/issues/130599 This PR intends to add a new API, `torch.xpu.mem_get_info,` which is widely used in popular model workloads. For example, [here](`403c0714d1/src/accelerate/utils/modeling.py (L721)`) we need to get current GPU memory usage to split or load the model. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141230 Approved by: https://github.com/EikanWang, https://github.com/albanD	2024-12-05 08:17:25 +00:00
Yiming Zhou	31f2d4eb4e	[export] Update docs (#142011 ) Summary: Update export docs. Including: 1. Update the output graph. 2. Misc fixes for examples. Test Plan: CI Differential Revision: D66726729 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142011 Approved by: https://github.com/angelayi	2024-12-05 03:44:46 +00:00
Andrew Gu	45583a5df9	[FSDP2] Move to public `torch.distributed.fsdp` (#141868 ) Overview This PR moves `torch/distributed/_composable/fsdp` to `torch/distributed/fsdp/_fully_shard` and makes public APIs available from `torch.distributed.fsdp`, e.g.: ``` from torch.distributed.fsdp import fully_shard ``` This is targeting 2.6 release. I rewrote some of the documentation with (hopefully) improved phrasing. Follow-Ups - [x] Add some explanation in the docs about FSDP1 vs. FSDP2 - [ ] Move unit tests from `test/distributed/_composable/fsdp` to `test/distributed/fsdp/fully_shard/` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141868 Approved by: https://github.com/kwen2501, https://github.com/wconstab, https://github.com/weifengpy Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2024-12-05 03:04:01 +00:00
Svetlana Karslioglu	f7bd0c6b60	[doc] Fix the toctree level (#142008 ) Changing this back 1 in order to not expand on the index.html page. Before: ![Screenshot 2024-12-04 at 11 47 54 AM (2)](https://github.com/user-attachments/assets/40d730ee-61b9-4d60-ab13-9b9075cb3cba) After: ![Screenshot 2024-12-04 at 11 48 30 AM (2)](https://github.com/user-attachments/assets/5eb711a0-e76c-4573-9fdf-88b6b94b31a9) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142008 Approved by: https://github.com/sekyondaMeta, https://github.com/malfet	2024-12-04 19:52:14 +00:00
rzou	827c322290	Make torch.library.triton_op public (#141880 ) We've been using it privately for half a year and everything's been good. This PR: 1. Makes torch.library.triton_op public 2. Renames capture_triton -> wrap_triton. We got feedback that no one knew what "capture triton" does. 3. Makes torch.library.wrap_triton public. triton_op is used to construct a Python custom operator that may call 1+ triton kernels. Each of those triton kernels must be annotated with wrap_triton. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/141880 Approved by: https://github.com/albanD ghstack dependencies: #141894	2024-12-03 16:28:56 +00:00
Benjamin Glass	4959784dac	Add API query for available per-process CUDA memory (#140620 ) Certain `cpp_wrapper`-enabled tests were OOM-ing in the CI pipeline, with error messages suggesting that sufficient memory was accessible. This ultimately resulted from an internal memory limitation that was not queryable in the API. This PR adds querying for that limit. Additionally, the failing tests had incorrect memory availability checks, and are updated with measured memory requirements. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140620 Approved by: https://github.com/malfet, https://github.com/eqy ghstack dependencies: #141367	2024-12-03 00:24:03 +00:00
Hyunho Yeo	d70b7029c8	[MTIA] Support torch.mtia.empty_cache() (#141533 ) Summary: As title Test Plan: Passed a local unit test: `buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api` https://www.internalfb.com/intern/testinfra/testrun/4785074861101240 Reviewed By: nautsimon Differential Revision: D66481778 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141533 Approved by: https://github.com/nautsimon	2024-11-28 02:24:19 +00:00
Mark Saroufim	e24190709f	[BE] Remove Model Dump utility (#141540 ) So I found this utility by accident, trying to find how many html files we have in the repo so I could convert them to markdown Turns out we package some html and js files in pytorch to visualize torchscript models. This seems kinda strange, probably shouldn't be in core, I removed the tests I could find. Maybe some internal tests will break but considering torchscript is being superseded might make sense to do this Last time there was a meaningful update to the test for this file was about 2 years ago by @digantdesai since then it's a bunch of routine upgrades It seems like this package is unused https://github.com/search?type=code&auto_enroll=true&q=torch.utils.model_dump&p=1 I skimmed through 5 pages of these and the only time this shows up in code search is when someone is either cloning pytorch or checking in their venv into github Pull Request resolved: https://github.com/pytorch/pytorch/pull/141540 Approved by: https://github.com/malfet	2024-11-27 22:52:55 +00:00
Isuru Fernando	b37cfddeb3	Refactor ShapeGuardPrinter for future C++ addiiton (#140968 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140968 Approved by: https://github.com/anijain2305 ghstack dependencies: #140597	2024-11-27 20:09:58 +00:00
PyTorch MergeBot	6e61ff4fd3	Revert "Add `truediv` support in export serializer (#136364 )" This reverts commit `1df440dc4e`. Reverted https://github.com/pytorch/pytorch/pull/136364 on behalf of https://github.com/huydhn due to Sorry for reverting your change but its doc build failure is legit ([comment](https://github.com/pytorch/pytorch/pull/136364#issuecomment-2502620732))	2024-11-27 03:24:31 +00:00
Svetlana Karslioglu	807a7dbf9f	Don't generate modindex (#141601 ) Fixes https://github.com/pytorch/pytorch/issues/141591 The generated index looks ugly. Attempting to not generate it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141601 Approved by: https://github.com/malfet, https://github.com/albanD	2024-11-27 02:07:21 +00:00
bhack	1df440dc4e	Add `truediv` support in export serializer (#136364 ) Fixes #136113 - [x] Inital `truediv` coverage - [ ] Expand/reduce coverage? - [x] Add tests - [x] Re-check docstrings - [ ] Linting Pull Request resolved: https://github.com/pytorch/pytorch/pull/136364 Approved by: https://github.com/pianpwk Co-authored-by: Angela Yi <angelayi@meta.com> Co-authored-by: Pian Pawakapan <pianpwk@meta.com>	2024-11-27 00:31:47 +00:00
Nichols A. Romero	a99332eb25	[ROCM] Support Multi-GPU offline tuning in TunableOp (#139673 ) This PR enhances offline tuning to support multi-GPUs. High-level description of algorithm: - Duplicate GEMMs are first eliminated - GEMMs are distributed to multi-GPUs for tuning - Results are gathered into a file with `_full` in the filename Also adding support for GemmAndBias and ScaledGemm Pull Request resolved: https://github.com/pytorch/pytorch/pull/139673 Approved by: https://github.com/jeffdaily, https://github.com/hongxiayang	2024-11-26 19:07:41 +00:00
Stephen Matthews	2bbd984aa2	Fix typo in Reproducibility docs (#141341 ) Fixes trivial issue in the docs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141341 Approved by: https://github.com/svekars	2024-11-26 16:53:26 +00:00
ZhiweiYan-96	c418a9ac75	[Intel GPU] XPUInductorQuantizer for XPU int8 recipe customization (#139578 ) # Motivation This PR add `XPUInductorQuantizer`, which would defined the recipe of int8 quantization at XPU backend. # Detailed The `XPUInductorQuantizer` is class derived from `X86InductorQuantizer` as both quantizer would take the advantage of highly optimized operators in oneDNN library(qconv, qlinear, qconv/qlinear fusion). We share the same recipe as `X86InductorQuantizer`, so we would have same `annotate_xxxx` methods. So, in ideal situation, the `XPUInductorQuantizer` would have no class body as all implementation can inherit from base class. In this PR, we override the `annotate_xxx` method for operators that has NOT be implemented. All operators XPU backend does not implement would be fallbacked to fp32 implementation as the node in graph is a `dq-op-q` pairs. This would help provide good OOB usability for XPU backend. On the other hand, the implemented operators would uses `annotate_op` implemented in base class and could be lowered successfully. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139578 Approved by: https://github.com/EikanWang, https://github.com/leslie-fang-intel, https://github.com/CuiYifeng, https://github.com/jerryzh168 ghstack dependencies: #133080	2024-11-26 09:44:14 +00:00

1 2 3 4 5 ...

2824 Commits