pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Eddie Yan	e2917245fb	[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 ) Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441 Approved by: https://github.com/Chillee, https://github.com/malfet	2025-01-30 22:33:50 +00:00
Benjamin Glass	5aa5a5763e	[inductor triton] Disable incorrect TF32 usage on CUDA capability < 8 (#145684 ) Triton 2.2 and greater have a bug where allowing TF32 generation for a GPU that does not support TF32 will cause code generation errors. Patch around this problem by: 1. Adding a function to `torch.cuda` that determines whether CUDA hardware is capable of using the TF32 format. 2. Using that function to explicitly disable TF32 generation when calling Triton, where needed. To demonstrate that this fix works, try running `test/inductor/test_max_autotune.py` on a GPU with CUDA compute capability < 8 (e.g. any NVIDIA consumer GPU) without this fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145684 Approved by: https://github.com/eqy	2025-01-28 22:01:08 +00:00
Zheng, Zhaoqiong	9003d81144	change the test wheel to release wheel when release wheel available (#145252 ) change the test wheel to release wheel when release wheel available Pull Request resolved: https://github.com/pytorch/pytorch/pull/145252 Approved by: https://github.com/seemethere, https://github.com/atalman Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-01-28 21:23:53 +00:00
PyTorch MergeBot	9010649292	Revert "Add option to serialization config to reduce random reads from get_record_offset when loading with mmap=True (#143880 )" This reverts commit `db3685a35c`. Reverted https://github.com/pytorch/pytorch/pull/143880 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but either this PR or the base PR breaks distributed tests ([comment](https://github.com/pytorch/pytorch/pull/143880#issuecomment-2617743403))	2025-01-28 03:07:17 +00:00
Mikayla Gawarecki	db3685a35c	Add option to serialization config to reduce random reads from get_record_offset when loading with mmap=True (#143880 ) ## Background This PR adds `torch.utils.serialization.config.load.calculate_storage_offsets`. This option relies on the previous PR in this stack, where storage order was changed to non lexicographical. A `.format_version` entry was added to the zipfile and `calculate_storage_offsets` will only work on checkpoints with `.format_version`. When this is turned on, for `torch.load(mmap=True)`, offsets of each storage record (other than the 0th storage will be calculated instead of relying on `miniz` APIs to determine this). The existing APIs will issue multiple random reads (reading the end of central directory record, then reading the zipfile header for the record) to determine the storage offset where the record starts. This can greatly degrade `torch.load(mmap=True)` performance for non-filesystem cases. `6aaae9d78f/caffe2/serialize/inline_container.cc (L589-L605)` ## Testing strategy The agreed upon testing strategy was as follows: - Add debug code gated by an environment flag `TORCH_SERIALIZATION_DEBUG` that will run this offset calculation logic and verify it against getRecordOffset for each storage (when mmap=False) - This flag is set throughout CI, which means that every time `torch.load` is called, the offset calculation logic is implicitly being tested. Differential Revision: [D67673026](https://our.internmc.facebook.com/intern/diff/D67673026) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143880 Approved by: https://github.com/albanD ghstack dependencies: #143879	2025-01-27 23:57:30 +00:00
PyTorch MergeBot	c986eba560	Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 )" This reverts commit `abf28982a8`. Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/ZainRizvi due to Sorry but this is failing internally. @Chillee can you please help change get remerged? See D68720562 ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2616726406))	2025-01-27 19:38:26 +00:00
Yanbo Liang	ec91b7720f	[Custom Ops] Add a new API to allow users to register an autocast for the custom op (#145588 ) Fixes #137033 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145588 Approved by: https://github.com/zou3519	2025-01-27 19:22:43 +00:00
Eddie Yan	abf28982a8	[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 ) Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441 Approved by: https://github.com/Chillee	2025-01-27 18:05:23 +00:00
Joel Schlosser	b2a0feac85	Update OSS nested tensor docs to focus on NJT (#145402 ) Updated nested tensor docs to be NJT-centric (instead of NST-centric). They now include: * High-level description of NST vs. NJT + a recommendation to use NJT * General NJT construction / usage * torch.compile() integration w/ dynamic shapes * Common errors and how to fix them * Contribution guide * Data layout / shape information (with diagram) * Links to more extensive tutorials involving Transformers / SDPA / FlexAttention Pull Request resolved: https://github.com/pytorch/pytorch/pull/145402 Approved by: https://github.com/soulitzer	2025-01-25 04:08:19 +00:00
jainapurva	547c18ee9f	Add Torchao docs link to Pytorch libraries (#145412 ) Add Torchao docs link to the libraries section in torch docs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145412 Approved by: https://github.com/svekars	2025-01-24 17:11:20 +00:00
PyTorch MergeBot	dad9bc3461	Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 )" This reverts commit `de945d78da`. Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/izaitsevfb due to unused variables again :( ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2611182461))	2025-01-23 22:59:25 +00:00
PyTorch MergeBot	d7b6746470	Revert "Fix deprecated pytorch_sphinx_theme editable installation (#145347 )" This reverts commit `c27dd9cf72`. Reverted https://github.com/pytorch/pytorch/pull/145347 on behalf of https://github.com/huydhn due to Remove -e breaks the theme somehow ([comment](https://github.com/pytorch/pytorch/pull/145347#issuecomment-2610911258))	2025-01-23 20:06:07 +00:00
Nikhil Gupta	41b38f755c	Revert "Reverting the PR adding Kleidiai-based int4 kernels (#145392 )" (#145505 ) https://github.com/pytorch/pytorch/pull/134124 was reverted by https://github.com/pytorch/pytorch/pull/145392 due to KleidiAI clone issue. 1. This reverts commit `0940eb6d44` (https://github.com/pytorch/pytorch/pull/145392 )and Fixes KleidiAI mirror issue. 2. KleidiAI is now cloned from github mirror instead of arm gitlab Change-Id: I7d6eee7214cd117d3057d615936fcc3ee6052fa2 Fixes https://github.com/pytorch/pytorch/issues/145273 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145505 Approved by: https://github.com/malfet	2025-01-23 18:50:59 +00:00
Zheng, Zhaoqiong	fef92c9447	Fix IdentationError of code example (#145251 ) I found there is IndentationError when try to copy paste the example of inference with torch.compile fix the format in this pr Pull Request resolved: https://github.com/pytorch/pytorch/pull/145251 Approved by: https://github.com/mikaylagawarecki Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-01-23 18:17:11 +00:00
Eddie Yan	de945d78da	[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 ) Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441 Approved by: https://github.com/Chillee	2025-01-22 22:42:48 +00:00
albanD	0940eb6d44	Reverting the PR adding Kleidiai-based int4 kernels (#145392 ) Mitigation for https://github.com/pytorch/pytorch/issues/145273 Reverting https://github.com/pytorch/pytorch/pull/134124 and https://github.com/pytorch/pytorch/pull/144074 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145392 Approved by: https://github.com/ZainRizvi, https://github.com/malfet, https://github.com/atalman, https://github.com/digantdesai	2025-01-22 20:11:49 +00:00
Huy Do	c27dd9cf72	Fix deprecated pytorch_sphinx_theme editable installation (#145347 ) Fixes https://github.com/pytorch/pytorch/issues/145221 Pip editable install is going to be deprecated soon https://github.com/pypa/pip/issues/11457. The fix here is just to remove it and install `pytorch_sphinx_theme` normally. ### Testing Doc build is working with the change: * PR https://github.com/pytorch/pytorch/actions/runs/12901499736/job/35975042345?pr=145347 * Nightly https://github.com/pytorch/pytorch/actions/runs/12901500521/job/35975046289 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145347 Approved by: https://github.com/ZainRizvi	2025-01-22 17:28:16 +00:00
ZhaoqiongZ	465a1cfe2e	update get start xpu (#143183 ) - Support new Intel client GPU on Windows [Intel® Arc™ B-Series graphics](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/arc/desktop/b-series/overview.html) and [Intel® Core™ Ultra Series 2 with Intel® Arc™ Graphics](https://www.intel.com/content/www/us/en/products/details/processors/core-ultra.html) - Support vision/audio prebuilt wheels on Windows Pull Request resolved: https://github.com/pytorch/pytorch/pull/143183 Approved by: https://github.com/EikanWang, https://github.com/leslie-fang-intel, https://github.com/atalman, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-01-17 06:31:40 +00:00
PyTorch MergeBot	4ea189422d	Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 )" This reverts commit `a6763b7b81`. Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/kit1980 due to breaking internal builds: unused variable 'halpha' ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2596895865))	2025-01-16 21:12:41 +00:00
PyTorch MergeBot	6559374494	Revert "Add flop formula for _scaled_mm (#144872 )" This reverts commit `f31452268b`. Reverted https://github.com/pytorch/pytorch/pull/144872 on behalf of https://github.com/lw due to Breaks ROCm jobs on main ([comment](https://github.com/pytorch/pytorch/pull/144872#issuecomment-2595994134))	2025-01-16 15:16:18 +00:00
Luca Wehrstedt	f31452268b	Add flop formula for _scaled_mm (#144872 ) This will make it work correctly with the partitioner's AutoAC Pull Request resolved: https://github.com/pytorch/pytorch/pull/144872 Approved by: https://github.com/vkuzo	2025-01-16 13:57:54 +00:00
eqy	a6763b7b81	[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 ) Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441 Approved by: https://github.com/Chillee	2025-01-15 18:37:55 +00:00
Boyuan Feng	7e80758efc	[CUDAGraph][Docs] add `cuda` to `torch.randn` (#144793 ) Previous doc example created `torch.randn` tensor on cpu so CUDAGraph was skipped. Fixes #144386 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144793 Approved by: https://github.com/eellison	2025-01-15 18:02:10 +00:00
PyTorch MergeBot	64bcf39180	Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 )" This reverts commit `388b75edec`. Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/kit1980 due to breaking internal builds: unused variable 'halpha' ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2588517060))	2025-01-14 00:48:28 +00:00
eqy	388b75edec	[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 ) Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441 Approved by: https://github.com/Chillee	2025-01-11 15:30:38 +00:00
Nikita Shulga	92ddb3d3d3	[MPS] Expose `MPSProfiler::start/stopCapture` to Python (#144561 ) I.e. when `MTL_CAPTURE_ENABLED` environment variable is set to 1, one should be able to invoke wrap the code with `torch.mps.profiler.capture_metal` to generate gputrace for shaders invoked inside the context manager. For example, code below: ```python import torch import os def foo(x): return x[:,::2].sin() + x[:, 1::2].cos() if __name__ == "__main__": os.environ["MTL_CAPTURE_ENABLED"] = "1" x = torch.rand(32, 1024, device="mps") with torch.mps.profiler.metal_capture("compiled_shader"): torch.compile(foo)(x) ``` should capture the execution of a `torch.compile` generated shader <img width="734" alt="image" src="https://github.com/user-attachments/assets/718ff64e-103b-4b11-b66c-c89cfc770b5d" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/144561 Approved by: https://github.com/manuelcandales ghstack dependencies: #144559, #144560	2025-01-11 02:05:36 +00:00
Alexander Kurakin	18c1dcb8f3	docs: get rid of copyright year (#144562 ) Fixes https://github.com/pytorch/pytorch/pull/144153#pullrequestreview-2540418083 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144562 Approved by: https://github.com/albanD	2025-01-10 19:57:25 +00:00
titaiwangms	a742859fc2	[ONNX] Update images and APIs to onnx_dynamo.rst (#144358 ) Update the result image of exporting, and delete the functions/class that belongs to `torch.onnx.dynamo_export` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144358 Approved by: https://github.com/justinchuby, https://github.com/malfet	2025-01-08 21:44:43 +00:00
PyTorch MergeBot	99f2491af9	Revert "Use absolute path `path.resolve()` -> `path.absolute()` (#129409 )" This reverts commit `45411d1fc9`. Reverted https://github.com/pytorch/pytorch/pull/129409 on behalf of https://github.com/jeanschmidt due to Breaking internal CI, @albanD please help get this PR merged ([comment](https://github.com/pytorch/pytorch/pull/129409#issuecomment-2571316444))	2025-01-04 14:17:20 +00:00
Xiaodong Wang	0a94bb432e	[ROCm] CK Flash Attention Backend (#143695 ) Replace https://github.com/pytorch/pytorch/pull/138947 for re-import. Replaces https://github.com/ROCm/pytorch/pull/1592 This PR contains the initial implementation of SDPA with composable_kernel backend. The CK path can be forced by simply calling torch.backends.cuda.preferred_rocm_fa_library("ck"). Similarly, you can force the incumbent aotriton implementation by passing in "aotriton" or "default". As you'd expect, not setting this option will result in aotriton to be used as the backend. In the case of CK, if pytorch deems flash attention usable, then it will use the CK path in all the same places aotriton would have been used. This PR makes no changes to the heuristics which select which attention scheme to use (i.e. flash attention vs memory efficient attention vs math etc etc). It only gets called when flash attention is both enabled (via USE_FLASH_ATTENTION) and is selected at runtime by the existing heuristics. Files located in pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck/mha* have been pulled from https://github.com/Dao-AILab/flash-attention courtesy of @tridao's hard work who is the co-author NOTE: In order to use this backend, the user MUST set USE_CK_FLASH_ATTENTION=1 in their environment when they build PyTorch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143695 Approved by: https://github.com/malfet Co-authored-by: Andy Lugo <Andy.LugoReyes@amd.com> Co-authored-by: Jithun Nair <jithun.nair@amd.com>	2025-01-03 22:01:36 +00:00
Jay Zhang	b75f32b848	Update TorchDynamo-based ONNX Exporter memory usage example code. (#144139 ) Address related comments earlier. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144139 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2025-01-03 20:41:36 +00:00
Wanchao Liang	eb7a303d21	[dtensor] expose the __create_chunk_list__ in the doc (#144100 ) as titled, this PR expose this dunder method as a public API in the doc, so that different checkpoint implementations can leverage this protocol, instead of exposing a separate API Pull Request resolved: https://github.com/pytorch/pytorch/pull/144100 Approved by: https://github.com/awgu ghstack dependencies: #144099	2025-01-03 20:06:23 +00:00
Xuehai Pan	45411d1fc9	Use absolute path `path.resolve()` -> `path.absolute()` (#129409 ) Changes: 1. Always explicit `.absolute()`: `Path(__file__)` -> `Path(__file__).absolute()` 2. Replace `path.resolve()` with `path.absolute()` if the code is resolving the PyTorch repo root directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129409 Approved by: https://github.com/albanD	2025-01-03 20:03:40 +00:00
Wanchao Liang	48a05ee773	[dtensor] improve doc of the DTensor class (#144099 ) as titled: explicitly list all public members to make sure the public API stays consistent, also use groupwise as the member order to make doc look better Pull Request resolved: https://github.com/pytorch/pytorch/pull/144099 Approved by: https://github.com/awgu	2025-01-03 05:35:44 +00:00
Yu, Guangye	3848de55ed	Add get_stream_from_external API for CUDA backend (#143799 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143799 Approved by: https://github.com/albanD, https://github.com/EikanWang ghstack dependencies: #142347, #141119, #141123	2024-12-31 11:15:59 +00:00
Yu, Guangye	8f6c4d1732	Add get_stream_from_external API for XPU backend (#141123 ) # Motivation This PR aims to introduce `torch.xpu.ExternalStream` to be used to wrap SYCL queue created in other libraries to PyTorch. # Additional Context Pull Request resolved: https://github.com/pytorch/pytorch/pull/141123 Approved by: https://github.com/albanD, https://github.com/EikanWang ghstack dependencies: #142347, #141119	2024-12-31 11:15:52 +00:00
Xuehai Pan	b6bdb67f82	[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 ) Changes by apply order: 1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`. 2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`. 3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first. `.parent{...}.absolute()` -> `.absolute().parent{...}` 4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.) `.parent.parent.parent.parent` -> `.parents[3]` 5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~ ~`.parents[3]` -> `.parents[4 - 1]`~ 6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374 Approved by: https://github.com/justinchuby, https://github.com/malfet	2024-12-29 17:23:13 +00:00
Yanan Cao (PyTorch)	ba5cacbc17	[Codemod][AddExplicitStrictExportArg] caffe2/test (#143688 ) Reviewed By: avikchaudhuri Differential Revision: D67530154 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143688 Approved by: https://github.com/tugsbayasgalan	2024-12-27 07:58:44 +00:00
PyTorch MergeBot	475656fd9c	Revert "[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 )" This reverts commit `2293fe1024`. Reverted https://github.com/pytorch/pytorch/pull/129374 on behalf of https://github.com/malfet due to failing internal ROCM builds with error: ModuleNotFoundError: No module named hipify ([comment](https://github.com/pytorch/pytorch/pull/129374#issuecomment-2562973920))	2024-12-26 17:32:23 +00:00
PyTorch MergeBot	cc4e70b7c3	Revert "Use absolute path `path.resolve()` -> `path.absolute()` (#129409 )" This reverts commit `135c7db99d`. Reverted https://github.com/pytorch/pytorch/pull/129409 on behalf of https://github.com/malfet due to need to revert to as dependency of https://github.com/pytorch/pytorch/pull/129374 ([comment](https://github.com/pytorch/pytorch/pull/129409#issuecomment-2562969825))	2024-12-26 17:26:06 +00:00
Xuehai Pan	135c7db99d	Use absolute path `path.resolve()` -> `path.absolute()` (#129409 ) Changes: 1. Always explicit `.absolute()`: `Path(__file__)` -> `Path(__file__).absolute()` 2. Replace `path.resolve()` with `path.absolute()` if the code is resolving the PyTorch repo root directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129409 Approved by: https://github.com/albanD	2024-12-24 08:33:08 +00:00
Jerry Zhang	ace645a017	Add support for prototype affine quantization in pt2e flow (#141421 ) Summary: duplicated affine quantization functionality including observer (https://github.com/pytorch/ao/blob/main/torchao/quantization/observer.py) and some quant_primitive ops (`7c3c51fd0d/torchao/quantization/quant_primitives.py (L26-L30)`) to allow for per group quantization min max observer in pt2e flow Next: We can follow up to add moving average min max observer Test Plan: python test/test_quantization.py -k test_channel_group_quantization Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/141421 Approved by: https://github.com/cccclai	2024-12-24 04:22:18 +00:00
Oguz Ulgen	dc55704b48	Rename cache limit to recompile limit in configs (#143709 ) This PR renames every cache_limit to recompile_limit via sed. Old config options are maintained via Config(alias='xyz') Pull Request resolved: https://github.com/pytorch/pytorch/pull/143709 Approved by: https://github.com/jansel	2024-12-22 10:03:57 +00:00
Xuehai Pan	2293fe1024	[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 ) Changes by apply order: 1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`. 2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`. 3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first. `.parent{...}.absolute()` -> `.absolute().parent{...}` 4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.) `.parent.parent.parent.parent` -> `.parents[3]` 5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~ ~`.parents[3]` -> `.parents[4 - 1]`~ 6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374 Approved by: https://github.com/justinchuby, https://github.com/malfet	2024-12-21 22:08:01 +00:00
PyTorch MergeBot	c7d7eff798	Revert "[MTIA] (3/n) Implement PyTorch APIs to query/reset device peak memory usage (#143347 )" This reverts commit `efe21ee59d`. Reverted https://github.com/pytorch/pytorch/pull/143347 on behalf of https://github.com/huydhn due to D67118173 has been backed out internally ([comment](https://github.com/pytorch/pytorch/pull/143347#issuecomment-2557983266))	2024-12-21 04:04:16 +00:00
PyTorch MergeBot	dabc9566c4	Revert "(MTIA) Move "empty_cache" API (#143402 )" This reverts commit `c7d9f29807`. Reverted https://github.com/pytorch/pytorch/pull/143402 on behalf of https://github.com/huydhn due to The internal diff D67148738 has been reverted ([comment](https://github.com/pytorch/pytorch/pull/143402#issuecomment-2557982597))	2024-12-21 04:01:23 +00:00
Mikayla Gawarecki	8e483654cb	Add config.save.use_pinned_memory_for_d2h to serialization config (#143342 ) This was benchmarked with two separate scripts on my A100 (A) Save state_dict of llama3-style model on CUDA to disk with ``torch.save`` (B) Save `ModuleList` of 10 `nn.Linear(10,000, 10,000)` on CUDA to disk with `torch.save` Timings are an average of 5 runs and benchmark scripts + results are attached Under both scenarios, we see ~2x speedup in ``torch.save`` time with (``compute_crc32=False`` and ``use_pinned_memory_for_d2h=True``) compared to the baseline of the current defaults (``compute_crc32=True`` and ``use_pinned_memory_for_d2h=False`` (A) Save state_dict of llama3-style model on CUDA to disk with ``torch.save`` [[script](https://gist.github.com/mikaylagawarecki/d3a86ea1bb08045d1a839976808d7432)][[results](https://gist.github.com/mikaylagawarecki/f61a4714e5cff703146a1fcb7e0c755c)] \| \| use_pinned_memory_for_d2h=False (Default) \| use_pinned_memory_for_d2h=True \| \|-\|-\|-\| \| `compute_crc_32= True` (Default)\| 28.54s \| 20.76s \| \| `compute_crc_32 = False` \| 22.57s \| 14.51s \| (B) Save `ModuleList` of 10 `nn.Linear(10,000, 10,000)` on CUDA to disk with `torch.save` [[script](https://gist.github.com/mikaylagawarecki/ecbc505436bdd4b5190ef1b3430c12b6)][[results](https://gist.github.com/mikaylagawarecki/4e686bcf030b57de8c3ca74d8f5a88f7)] \| \| use_pinned_memory_for_d2h=False (Default) \| use_pinned_memory_for_d2h=True \| \|-\|-\|-\| \| `compute_crc_32= True` (Default)\| 8.38s \| 5.53s \| \| `compute_crc_32 = False` \| 6.94s \| 3.99s \| Trace of (A) with `use_pinned_memory_for_d2h=True`, `compute_crc32=False` <img width="1745" alt="Screenshot 2024-12-16 at 7 32 33 PM" src="https://github.com/user-attachments/assets/80b87a8c-5a70-4eb9-ad66-7abc4aa7cc25" /> Baseline trace of (A) with `use_pinned_memory_for_d2h=False`, `compute_crc32=True` <img width="1799" alt="Screenshot 2024-12-16 at 7 38 20 PM" src="https://github.com/user-attachments/assets/13fa12d1-8f5f-424c-adc4-275b67012927" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/143342 Approved by: https://github.com/albanD ghstack dependencies: #143324	2024-12-20 21:01:18 +00:00
Mikayla Gawarecki	3f63b742e6	Refactor serialization getter/setters into torch.utils.serialization.config (#143324 ) Consolidate - get/set_default_load_endianness - get/set_default_mmap_options - get/set_crc32_options into one global dynamo-style config + allow global setting of mmap. The existing APIs are not removed and will get/set from the config (as they can't be removed for BC) In #143459 I add the local (argument style) config Pull Request resolved: https://github.com/pytorch/pytorch/pull/143324 Approved by: https://github.com/albanD	2024-12-20 21:01:17 +00:00
Nikhil Gupta	94737e8a2a	[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 ) Description: 1. Quantize Linear Layer Weights to 4-bits: Quantize the weights of the Linear layer to 4 bits, using symmetric quantization. Pack two 4-bit weights into one uint8 container. Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32. 2. Prepare Quantized Weights, Scales, and Optional Bias: After quantizing, obtain the quantized_weights, scales, and groupsize. If the original Linear layer has a bias, prepare it as well. 3. Pack the Weights Efficiently: Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias. ```python packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features) ``` Input parameters should include: in_features and out_features (the same as the Linear layer’s corresponding parameters). 4. Perform Dynamic Quantized Matrix Multiplication: Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights. ```python output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features) ``` Inputs required include: The input tensor, packed_weights , groupsize, and the in_features and out_features. API Usage: https://github.com/pytorch/pytorch/issues/143289 Model Perf : 7B Transformer model: Prefill : 340 t/s Decode : 40 t/s 2B Transformer model Prefill : 747 t/s Decode : 80 t/s Tests: python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight Ran 1 test in 0.016s OK python test/test_linalg.py -k test__dyn_quant_matmul_4bit Ran 8 tests in 0.077s OK python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit Ran 8 tests in 11.454s Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124 Approved by: https://github.com/digantdesai, https://github.com/malfet	2024-12-20 19:32:03 +00:00
Hyunho Yeo	c7d9f29807	(MTIA) Move "empty_cache" API (#143402 ) Summary: This diff moves one of memory-related APIs to the consolidated location, which is `mtia/memory.py`. Test Plan: ``` buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api ``` https://www.internalfb.com/intern/testinfra/testrun/13510798943184259 Reviewed By: nautsimon Differential Revision: D67148738 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143402 Approved by: https://github.com/nautsimon	2024-12-20 17:39:06 +00:00

1 2 3 4 5 ...

3063 Commits