pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Ke Wen	18a7a04c4a	[c10d] Add NCCL memory allocator (#145675 ) This PR implements a small UI improvement over #133603. It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it. UI: ``` pool = torch.cuda.MemPool(backend.mem_allocator) with torch.cuda.use_mem_pool(pool): tensor = torch.arange(1024 * 1024 * 2, device=device) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145675 Approved by: https://github.com/syed-ahmed, https://github.com/wconstab	2025-01-29 23:20:22 +00:00
PyTorch MergeBot	6371c25b91	Revert "[c10d] Add NCCL memory allocator (#145675 )" This reverts commit `9fd6722fc9`. Reverted https://github.com/pytorch/pytorch/pull/145675 on behalf of https://github.com/ZainRizvi due to This fails to build internally, can you please take a look at D68831004 for more details? ([comment](https://github.com/pytorch/pytorch/pull/145675#issuecomment-2622515425))	2025-01-29 18:30:30 +00:00
Ke Wen	9fd6722fc9	[c10d] Add NCCL memory allocator (#145675 ) This PR implements a small UI improvement over #133603. It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it. UI: ``` pool = torch.cuda.MemPool(backend.mem_allocator) with torch.cuda.use_mem_pool(pool): tensor = torch.arange(1024 * 1024 * 2, device=device) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145675 Approved by: https://github.com/syed-ahmed, https://github.com/wconstab	2025-01-29 02:48:56 +00:00
cyyever	23eb0a3201	Improve typing in torch/types.py (#145237 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/145237 Approved by: https://github.com/XuehaiPan, https://github.com/albanD Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn>	2025-01-28 05:29:12 +00:00
PyTorch MergeBot	c986eba560	Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 )" This reverts commit `abf28982a8`. Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/ZainRizvi due to Sorry but this is failing internally. @Chillee can you please help change get remerged? See D68720562 ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2616726406))	2025-01-27 19:38:26 +00:00
Randolf Scholz	835e770bad	Use `typing.IO[bytes]` instead of `io.BytesIO` in annotations (#144994 ) Fixes #144976 Using appoach ① `IO[bytes]`, but could also try with a protocol. ## Notes: - moved `torch.serialization.FILE_LIKE` to `torch.types.FileLike` - Use `FileLike` annotation where it makes sense - made sure those functions also support `os.PathLike` - Replaced `isinstance(x, io.BytesIO)` with `isinstance(x, (io.IOBase, IO))` where appropriate. - Replaced `BinaryIO` with `IO[bytes]` (the two ABCs are almost identical, the only difference is that `BinaryIO` allows `bytearray` input to `write`, whereas `IO[bytes]` only `bytes`) - needed to make `torch.serialization._opener` generic to avoid LSP violations. - skipped `torch/onnx/verification` for now (functions use `BytesIO.getvalue` which is not part of the `IO[bytes]` ABC, but it kind of seems that this is redundant, as e.g. `onnx.load` supports `str \| PathLike[str] \| IO[bytes]` directly... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144994 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2025-01-27 18:08:07 +00:00
Eddie Yan	abf28982a8	[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 ) Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441 Approved by: https://github.com/Chillee	2025-01-27 18:05:23 +00:00
Shuqiang Zhang	c0861d092c	[PGNCCL] Add an API to get the status/error code at the PG level (#144498 ) Summary: This PR is basically a replacement of https://github.com/pytorch/pytorch/pull/140087, which caused some perf drop due to frequent TCPStore check in watchdog thread. The fix is to move the tcpstore check in monitoring thread If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level. Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG. Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/144498 Approved by: https://github.com/kwen2501	2025-01-24 16:47:32 +00:00
PyTorch MergeBot	dad9bc3461	Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 )" This reverts commit `de945d78da`. Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/izaitsevfb due to unused variables again :( ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2611182461))	2025-01-23 22:59:25 +00:00
Nikhil Gupta	41b38f755c	Revert "Reverting the PR adding Kleidiai-based int4 kernels (#145392 )" (#145505 ) https://github.com/pytorch/pytorch/pull/134124 was reverted by https://github.com/pytorch/pytorch/pull/145392 due to KleidiAI clone issue. 1. This reverts commit `0940eb6d44` (https://github.com/pytorch/pytorch/pull/145392 )and Fixes KleidiAI mirror issue. 2. KleidiAI is now cloned from github mirror instead of arm gitlab Change-Id: I7d6eee7214cd117d3057d615936fcc3ee6052fa2 Fixes https://github.com/pytorch/pytorch/issues/145273 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145505 Approved by: https://github.com/malfet	2025-01-23 18:50:59 +00:00
Animesh Jain	015c6d6fdb	[dynamo][guards] Turn on profiling of guard manager (#145420 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145420 Approved by: https://github.com/ezyang ghstack dependencies: #145351	2025-01-23 18:17:43 +00:00
Eddie Yan	de945d78da	[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 ) Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441 Approved by: https://github.com/Chillee	2025-01-22 22:42:48 +00:00
albanD	0940eb6d44	Reverting the PR adding Kleidiai-based int4 kernels (#145392 ) Mitigation for https://github.com/pytorch/pytorch/issues/145273 Reverting https://github.com/pytorch/pytorch/pull/134124 and https://github.com/pytorch/pytorch/pull/144074 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145392 Approved by: https://github.com/ZainRizvi, https://github.com/malfet, https://github.com/atalman, https://github.com/digantdesai	2025-01-22 20:11:49 +00:00
Hyunho Yeo	057d9aff39	[S481486] [MTIA] Correct mtia.device_count() API (#145338 ) Summary: Prev: Count the number of "general" accelerators Curr: Count the number of MTIA devices by using the MTIA runtime API Test Plan: ``` buck test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- -r test_get_device_count ``` https://www.internalfb.com/intern/testinfra/testrun/8162774572631995 Reviewed By: BoyueZheng Differential Revision: D68472668 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145338 Approved by: https://github.com/BoyueZheng, https://github.com/egienvalue	2025-01-22 17:45:15 +00:00
Isuru Fernando	0efa843392	Dynamic shape guards in C++ (#139899 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139899 Approved by: https://github.com/anijain2305, https://github.com/albanD, https://github.com/jansel ghstack dependencies: #143385, #143164	2025-01-22 14:58:35 +00:00
Aaron Orenstein	f2cfe8b59f	PEP585 update - mostly toplevels (#145178 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145178 Approved by: https://github.com/bobrenjc93	2025-01-22 02:21:14 +00:00
Aaron Orenstein	5b5766665d	PEP585 update - torch/_C torch/_decomp torch/_lazy torch/_library torch/_numpy torch/_prims torch/_refs torch/_strobelight (#145102 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145102 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #145105	2025-01-18 20:47:12 +00:00
Yu, Guangye	176cde6240	Use torch with statement in torch distributed module (#144951 ) # Motivation In https://github.com/pytorch/pytorch/pull/137678, we help use the device-agnostic APIs to generalize distributed module. As this [comment](https://github.com/pytorch/pytorch/pull/137678#discussion_r1828645683) said, we will use the with statement of `torch.Stream` once https://github.com/pytorch/pytorch/pull/140138 is landed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144951 Approved by: https://github.com/kwen2501, https://github.com/albanD	2025-01-17 01:49:28 +00:00
PyTorch MergeBot	6c713ccb5e	Revert "Make functionalization `ViewMeta` serializable with pickle. (#143712 )" This reverts commit `b8abdaa286`. Reverted https://github.com/pytorch/pytorch/pull/143712 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/143712#issuecomment-2597205261))	2025-01-17 00:52:50 +00:00
PyTorch MergeBot	4ea189422d	Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 )" This reverts commit `a6763b7b81`. Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/kit1980 due to breaking internal builds: unused variable 'halpha' ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2596895865))	2025-01-16 21:12:41 +00:00
Yukio Siraichi	b8abdaa286	Make functionalization `ViewMeta` serializable with pickle. (#143712 ) Fix: #141974 This PR makes `ViewMeta` sequence, present in functional tensors, serializable with pickle. In order to accomplish that, it makes `ViewMeta` an abstract class with overridable `forward` and `reverse` functions. In this context, each operation that once instanciated `ViewMeta`, should now create a new specialized class that inherits from `ViewMeta. Therefore, this PR also uses codegen for creating these specializations. In summary, these are the changes this PR introduces: - `ViewMeta` is turned into an abstract class (see _FunctionalStorageImpl.cpp_). `forward` and `reverse` are pure virtual functions that need to be implemented. `to_out_index` should be implemented by operations that might return more than 1 output. - New `ViewMeta` specializations for `resize_` and `_unsafe_view` are created (see _FunctionalizeFallbackKernel.h_). - New templates _ViewMetaClasses.{cpp,h}_ are created. They hold the declaration and definition of the `ViewMeta` specializations, which are automatically generated in the ATen codegen (see _gen.py_). - New `_functionalization` Python sub-module is created (see _Module.cpp_). It serves as namespace for the `ViewMeta` specializations and `InverseReturnMode` enum. - New template _ViewMetaClassesPythonBinding.cpp_ is created. It holds the automatically generated Python bindings for the `ViewMeta` specialization, which are generated in the torch codegen (see _generate_code.py_). Note that this PR makes use of codegen at 2 different moments: - ATen codegen (_gen.py_): generates the `ViewMeta` specialized classes. - Torch codegen (_generate_code.py_): generated the Python bindings for them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143712 Approved by: https://github.com/bdhirsh	2025-01-16 19:41:41 +00:00
eqy	a6763b7b81	[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 ) Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441 Approved by: https://github.com/Chillee	2025-01-15 18:37:55 +00:00
PyTorch MergeBot	64bcf39180	Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 )" This reverts commit `388b75edec`. Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/kit1980 due to breaking internal builds: unused variable 'halpha' ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2588517060))	2025-01-14 00:48:28 +00:00
bobrenjc93	80c286cbec	remove allow-untyped-defs from torch/_C/_dynamo/eval_frame.pyi (#144655 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144655 Approved by: https://github.com/StrongerXi	2025-01-13 20:03:25 +00:00
eqy	388b75edec	[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 ) Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441 Approved by: https://github.com/Chillee	2025-01-11 15:30:38 +00:00
Nikita Shulga	92ddb3d3d3	[MPS] Expose `MPSProfiler::start/stopCapture` to Python (#144561 ) I.e. when `MTL_CAPTURE_ENABLED` environment variable is set to 1, one should be able to invoke wrap the code with `torch.mps.profiler.capture_metal` to generate gputrace for shaders invoked inside the context manager. For example, code below: ```python import torch import os def foo(x): return x[:,::2].sin() + x[:, 1::2].cos() if __name__ == "__main__": os.environ["MTL_CAPTURE_ENABLED"] = "1" x = torch.rand(32, 1024, device="mps") with torch.mps.profiler.metal_capture("compiled_shader"): torch.compile(foo)(x) ``` should capture the execution of a `torch.compile` generated shader <img width="734" alt="image" src="https://github.com/user-attachments/assets/718ff64e-103b-4b11-b66c-c89cfc770b5d" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/144561 Approved by: https://github.com/manuelcandales ghstack dependencies: #144559, #144560	2025-01-11 02:05:36 +00:00
Randolf Scholz	f002825e1e	added `__add__` and `__mul__` hints to torch.Size (#144322 ) Fixes #144218 `Size` returns `Size`, whereas `tuple` returns `tuple`: `9f28171658/stdlib/builtins.pyi (L985-L988)` - Use `SupportIndex` instead of `int` in `__getitem__` (supported at runtime) - `Size.__add__` overrides `tuple.__add__`, the latter supports adding tuples on non-integral type. - Added typing unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144322 Approved by: https://github.com/Skylion007	2025-01-08 01:02:11 +00:00
Xiaodong Wang	0a94bb432e	[ROCm] CK Flash Attention Backend (#143695 ) Replace https://github.com/pytorch/pytorch/pull/138947 for re-import. Replaces https://github.com/ROCm/pytorch/pull/1592 This PR contains the initial implementation of SDPA with composable_kernel backend. The CK path can be forced by simply calling torch.backends.cuda.preferred_rocm_fa_library("ck"). Similarly, you can force the incumbent aotriton implementation by passing in "aotriton" or "default". As you'd expect, not setting this option will result in aotriton to be used as the backend. In the case of CK, if pytorch deems flash attention usable, then it will use the CK path in all the same places aotriton would have been used. This PR makes no changes to the heuristics which select which attention scheme to use (i.e. flash attention vs memory efficient attention vs math etc etc). It only gets called when flash attention is both enabled (via USE_FLASH_ATTENTION) and is selected at runtime by the existing heuristics. Files located in pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck/mha* have been pulled from https://github.com/Dao-AILab/flash-attention courtesy of @tridao's hard work who is the co-author NOTE: In order to use this backend, the user MUST set USE_CK_FLASH_ATTENTION=1 in their environment when they build PyTorch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143695 Approved by: https://github.com/malfet Co-authored-by: Andy Lugo <Andy.LugoReyes@amd.com> Co-authored-by: Jithun Nair <jithun.nair@amd.com>	2025-01-03 22:01:36 +00:00
Yu, Guangye	3848de55ed	Add get_stream_from_external API for CUDA backend (#143799 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143799 Approved by: https://github.com/albanD, https://github.com/EikanWang ghstack dependencies: #142347, #141119, #141123	2024-12-31 11:15:59 +00:00
Yu, Guangye	8f6c4d1732	Add get_stream_from_external API for XPU backend (#141123 ) # Motivation This PR aims to introduce `torch.xpu.ExternalStream` to be used to wrap SYCL queue created in other libraries to PyTorch. # Additional Context Pull Request resolved: https://github.com/pytorch/pytorch/pull/141123 Approved by: https://github.com/albanD, https://github.com/EikanWang ghstack dependencies: #142347, #141119	2024-12-31 11:15:52 +00:00
Aaron Orenstein	06b4b96b34	dynamo tracing perf: no re in arg_ref: 33.9 -> 33.7 (#143069 ) See #143056 for overall docs. This PR: Avoid use of python re and move valid varname check in `GuardBuilder.arg_ref()` into C++ Pull Request resolved: https://github.com/pytorch/pytorch/pull/143069 Approved by: https://github.com/jansel	2024-12-23 05:32:09 +00:00
Aaron Orenstein	9bf4b1c2e9	dynamo tracing perf: c++ strip_function_call: 49.12 -> 47.77 (#143063 ) See #143056 for overall docs. This PR: Convert `strip_function_call()` into C++ Pull Request resolved: https://github.com/pytorch/pytorch/pull/143063 Approved by: https://github.com/jansel ghstack dependencies: #143057, #143062	2024-12-22 06:38:46 +00:00
Nikhil Gupta	94737e8a2a	[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 ) Description: 1. Quantize Linear Layer Weights to 4-bits: Quantize the weights of the Linear layer to 4 bits, using symmetric quantization. Pack two 4-bit weights into one uint8 container. Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32. 2. Prepare Quantized Weights, Scales, and Optional Bias: After quantizing, obtain the quantized_weights, scales, and groupsize. If the original Linear layer has a bias, prepare it as well. 3. Pack the Weights Efficiently: Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias. ```python packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features) ``` Input parameters should include: in_features and out_features (the same as the Linear layer’s corresponding parameters). 4. Perform Dynamic Quantized Matrix Multiplication: Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights. ```python output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features) ``` Inputs required include: The input tensor, packed_weights , groupsize, and the in_features and out_features. API Usage: https://github.com/pytorch/pytorch/issues/143289 Model Perf : 7B Transformer model: Prefill : 340 t/s Decode : 40 t/s 2B Transformer model Prefill : 747 t/s Decode : 80 t/s Tests: python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight Ran 1 test in 0.016s OK python test/test_linalg.py -k test__dyn_quant_matmul_4bit Ran 8 tests in 0.077s OK python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit Ran 8 tests in 11.454s Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124 Approved by: https://github.com/digantdesai, https://github.com/malfet	2024-12-20 19:32:03 +00:00
Guilherme Leobas	673cc88fd6	Add support for `contextmanager` in Dynamo (#136033 ) Fixes #130559 * Intro This PR adds support for `@contextmanager` in Dynamo. We chose to limit the scope of this work to only `@contextmanager` and plan to handle generators fully in #141055 (still in draft). * Motivation Dynamo lacks support for generator functions. When it encounters one, it traces it as if it were a regular function. This is problematic because it can lead to incorrect behavior. To illustrate, consider the test case below: ```python import torch import contextlib @contextlib.contextmanager def set_default_dtype(dtype): old_dtype = torch.get_default_dtype() try: torch.set_default_dtype(dtype) yield finally: torch.set_default_dtype(old_dtype) @torch.compile(backend="eager", fullgraph=True) def fn(): with set_default_dtype(torch.float64): x = torch.tensor([3.0, 3.0 + 5.0j]) return x ``` Before this work, Dynamo would not stop at the `yield`, and the graph produced would contain both calls to `set_default_dtype` executed one after the other. This is incorrect because the context manager should execute code before and after the `yield`. * List of changes `YIELD_VALUE` now raises an exception (`YieldValueOp`) to signal that control flow must be suspended and returned to the caller. Additionally, `RETURN_VALUE` behaves differently in a generator function. Unlike regular functions, where `RETURN_VALUE` indicates the final result, in generators it signifies that the generator is exhausted and implicitly raises `StopIteration`. A new `VariableTracker` named `FunctionDecoratedByContextlibContextManagerVariable` was introduced to handle `@contextmanager`. This variable tracker acts not just as a wrapper for the original function but also maintains an internal `tx` (InstructionTranslator) object to suspend and return control flow to the parent tracer when a `yield` is encountered. * Corner cases Returning a context manager from a compiled function is not supported. This would require PyTorch to synchronize the generator state between Dynamo and the interpreter. Any attempt to return it will result in an `IncorrectUsage` exception. Graph breaks require special handling as well. In the event of a graph break, the frame associated with the context manager is skipped, and the context manager runs in eager mode. * This PR is breaking my code There is a configuration flag (`enable_trace_contextlib`) that can be set to `False` to disable tracing context managers. If this still causes crashes, please revert this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136033 Approved by: https://github.com/zou3519	2024-12-20 12:02:20 +00:00
PyTorch MergeBot	8136daff5a	Revert "[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 )" This reverts commit `4b82251011`. Reverted https://github.com/pytorch/pytorch/pull/134124 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it breaks lots of internal build ([comment](https://github.com/pytorch/pytorch/pull/134124#issuecomment-2555953189))	2024-12-19 23:33:17 +00:00
Nikhil Gupta	4b82251011	[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 ) Description: 1. Quantize Linear Layer Weights to 4-bits: Quantize the weights of the Linear layer to 4 bits, using symmetric quantization. Pack two 4-bit weights into one uint8 container. Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32. 2. Prepare Quantized Weights, Scales, and Optional Bias: After quantizing, obtain the quantized_weights, scales, and groupsize. If the original Linear layer has a bias, prepare it as well. 3. Pack the Weights Efficiently: Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias. ```python packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features) ``` Input parameters should include: in_features and out_features (the same as the Linear layer’s corresponding parameters). 4. Perform Dynamic Quantized Matrix Multiplication: Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights. ```python output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features) ``` Inputs required include: The input tensor, packed_weights , groupsize, and the in_features and out_features. API Usage: https://github.com/pytorch/pytorch/issues/143289 Model Perf : 7B Transformer model: Prefill : 340 t/s Decode : 40 t/s 2B Transformer model Prefill : 747 t/s Decode : 80 t/s Tests: python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight Ran 1 test in 0.016s OK python test/test_linalg.py -k test__dyn_quant_matmul_4bit Ran 8 tests in 0.077s OK python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit Ran 8 tests in 11.454s Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124 Approved by: https://github.com/digantdesai, https://github.com/malfet	2024-12-19 18:51:26 +00:00
PyTorch MergeBot	14fe1f7190	Revert "[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 )" This reverts commit `d3ff2d42c2`. Reverted https://github.com/pytorch/pytorch/pull/134124 on behalf of https://github.com/malfet due to This broke S390 builds, includes cpuinfo unconditionally ([comment](https://github.com/pytorch/pytorch/pull/134124#issuecomment-2552560208))	2024-12-19 01:05:11 +00:00
Nikhil Gupta	d3ff2d42c2	[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 ) Description: 1. Quantize Linear Layer Weights to 4-bits: Quantize the weights of the Linear layer to 4 bits, using symmetric quantization. Pack two 4-bit weights into one uint8 container. Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32. 2. Prepare Quantized Weights, Scales, and Optional Bias: After quantizing, obtain the quantized_weights, scales, and groupsize. If the original Linear layer has a bias, prepare it as well. 3. Pack the Weights Efficiently: Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias. ```python packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features) ``` Input parameters should include: in_features and out_features (the same as the Linear layer’s corresponding parameters). 4. Perform Dynamic Quantized Matrix Multiplication: Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights. ```python output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features) ``` Inputs required include: The input tensor, packed_weights , groupsize, and the in_features and out_features. API Usage: https://github.com/pytorch/pytorch/issues/143289 Model Perf : 7B Transformer model: Prefill : 340 t/s Decode : 40 t/s 2B Transformer model Prefill : 747 t/s Decode : 80 t/s Tests: python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight Ran 1 test in 0.016s OK python test/test_linalg.py -k test__dyn_quant_matmul_4bit Ran 8 tests in 0.077s OK python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit Ran 8 tests in 11.454s Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124 Approved by: https://github.com/digantdesai, https://github.com/malfet	2024-12-18 22:30:07 +00:00
Zizeng Meng	eb67dd3e2d	[3/N][Memory Profiling] Add memory profiling function for MTIA hooks (#142149 ) Design Doc: https://fburl.com/gdoc/47zpuweb Prototyping: D66469341 In this diff, we implement two new mtia hooks to start/stop profiler and export the memory snapshot. In next diff, we will integrate the mtia backend with profiler python api Differential Revision: [D66823583](https://our.internmc.facebook.com/intern/diff/D66823583/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142149 Approved by: https://github.com/nautsimon	2024-12-18 11:58:23 +00:00
William Wen	18261e9f39	[dynamo] implement framelocals mapping as c++ object (#140063 ) Implements https://github.com/pytorch/pytorch/issues/93753 - move frame local guard accessors to C++. Before, we used dict accessors on a Python dict representing the frame's fastlocals that we manually build. We move this accessor to C++ and additionally use the fastlocal index whenever possible. Some implementation notes: - `FrameLocalsMapping` is now initialized as a C++ vector of `PyObject`s. We do not just use the frame's localsplus/fastlocals buffer because we also unbox cells. - `FrameLocalsMapping` can still be converted into a Python dict representing the frame's fastlocals, but it is done lazily. - We update `LeafGuard`, `GuardAccessor`, and `GuardManager`'s `check_nopybind` methods to accept `FrameLocalsMapping`. By default, we convert the `FrameLocalsMapping` to a Python dict and run the original `check_nopybind` on it, but in some cases, conversion is not needed. - We add a new guard accessor `FrameLocalsGuardAccessor`, which is similar to `DictGetItemGuardAccessor` but has special handling for `FrameLocalsMapping`. We create a separate class to emphasize different use cases, but we could probably combine these two (can do in a follow up) dynamo_guard_eval.py microbenchmark update: - 713.2us -> 630.0us (3.10) - 598.8us -> 530.7us (3.12) Other followups: - Add `FrameLocalsMapping` version for `check_verbose_nopybind` in order to match behavior between `check_nopybind` and `check_verbose_nopybind`. This can prevent difficult debugging situations where guards fail (`check_nopybind` returns false) but no guard error message is generated (`check_verbose_nopybind` succeeds). - Rewrite the `SHAPE_ENV` guard into C++ - it is a fairly common guard that results in `FrameLocalsMapping` needing to convert to a dict Pull Request resolved: https://github.com/pytorch/pytorch/pull/140063 Approved by: https://github.com/jansel ghstack dependencies: #142117, #142430	2024-12-17 18:54:27 +00:00
bobrenjc93	74e66a21b4	remove allow-untyped-defs for torch/_C/_distributed_autograd.pyi (#143369 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143369 Approved by: https://github.com/aorenste	2024-12-17 18:09:28 +00:00
bobrenjc93	313b9964ae	remove allow-untyped-defs for torch/_C/_lazy.pyi (#143370 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143370 Approved by: https://github.com/aorenste, https://github.com/desertfire ghstack dependencies: #143366	2024-12-17 17:18:10 +00:00
PyTorch MergeBot	969b07b96f	Revert "[ROCm] CK Flash Attention Backend (#138947 )" This reverts commit `500d02921b`. Reverted https://github.com/pytorch/pytorch/pull/138947 on behalf of https://github.com/atalman due to Breaks default windows checkout ([comment](https://github.com/pytorch/pytorch/pull/138947#issuecomment-2548998359))	2024-12-17 16:46:57 +00:00
Andy Lugo	500d02921b	[ROCm] CK Flash Attention Backend (#138947 ) Replaces https://github.com/ROCm/pytorch/pull/1592 This PR contains the initial implementation of SDPA with composable_kernel backend. The CK path can be forced by simply calling `torch.backends.cuda.preferred_rocm_fa_library("ck")`. Similarly, you can force the incumbent aotriton implementation by passing in "aotriton" or "default". As you'd expect, not setting this option will result in aotriton to be used as the backend. In the case of CK, if pytorch deems flash attention usable, then it will use the CK path in all the same places aotriton would have been used. This PR makes no changes to the heuristics which select which attention scheme to use (i.e. flash attention vs memory efficient attention vs math etc etc). It only gets called when flash attention is both enabled (via `USE_FLASH_ATTENTION`) and is selected at runtime by the existing heuristics. Files located in pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck/mha* have been pulled from https://github.com/Dao-AILab/flash-attention courtesy of @tridao's hard work who is the co-author NOTE: In order to use this backend, the user MUST set USE_CK_FLASH_ATTENTION=1 in their environment when they build PyTorch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138947 Approved by: https://github.com/pruthvistony, https://github.com/xw285cornell, https://github.com/leitian Co-authored-by: Xiaodong Wang <xw285@cornell.edu>	2024-12-17 02:18:07 +00:00
Yifu Wang	af190479c8	[fused_all_gather_matmul] use _multimem_all_gather_matmul for small global Ms (#143160 ) ## Benchmark M=2048, N=3584, K=8192 baseline (nccl + cublas): 301us decomp-based async-tp: 354us comm-aware async-tp: 295us multimem_all_gather matmul: 277us As M further decreases, the multimem_all_gather approach consistently outperforms the baseline and other approaches (omitted other approaches in the chart as they start to be slower than the baseline): ![image](https://github.com/user-attachments/assets/5811455a-68c9-43fe-9d82-ca488dd77bc1) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143160 Approved by: https://github.com/weifengpy ghstack dependencies: #142283, #142810, #143159	2024-12-17 01:07:27 +00:00
Nichols A. Romero	c0a39ad35a	[ROCm] Fix TunableOp UTs: Rotating Buffer (#143172 ) TunableOp's rotating buffer feature cannot be properly tested because the environment variable that controls this feature is sticky. A Python API is introduced to modify this value. Additional items in this PR: * UT for rotating buffer API * Clean up UTs that were setting the rotating buffer via the environment variable * Align behavior of environment variable and Python API when a negative value (< 0) is set. * Update documentation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143172 Approved by: https://github.com/jeffdaily	2024-12-14 06:18:11 +00:00
Alex Denisov	539286a67b	Inductor annotations (#130429 ) Add NVTX annotations around training phases and buffer computations RFC/discussion: https://dev-discuss.pytorch.org/t/rfc-performance-profiling-at-scale-with-details-nvtx-annotations/2224 <img width="2160" alt="Screenshot 2024-07-10 at 11 48 04" src="https://github.com/pytorch/pytorch/assets/1175576/9ade139c-d393-473f-9b68-6c25da367dc4"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130429 Approved by: https://github.com/aorenste, https://github.com/eellison, https://github.com/albanD Co-authored-by: Cedric GESTES <cedric.gestes@flex.ai>	2024-12-10 08:53:39 +00:00
Bin Bao	6680a83e89	[AOTI XPU] Support AOT Inductor for Intel GPU. (#140269 ) This PR add XPU support for AOT Inductor, and reuse the corresponding UT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140269 Approved by: https://github.com/desertfire, https://github.com/EikanWang ghstack dependencies: #140268 Co-authored-by: Bin Bao <binbao@meta.com>	2024-12-10 05:05:08 +00:00
lzhang2	5d6acd5a31	Register Intel distributed Backend (`XCCL`) in PyTorch distributed package (#141856 ) ### Motivation: As design illustrated in Intel distributed support RFC https://github.com/pytorch/pytorch/issues/141741, two sections are needed to enable intel distributed backend (`XCCL`) support in PyTorch. 1. Intel GPU distributed Backend integration in PyTorch `torch-xpu-ops`. 2. Intel distributed Backend register in PyTorch distributed package. This PR is to contribute section 2 change. ### Example: Here is a simple example of using spawn to launch XCCL backend and perform allreduce on XPU tensors. ``` import os import torch import torch.distributed as dist import torch.multiprocessing as mp def setup(rank, world_size): os.environ['MASTER_ADDR'] = 'localhost' os.environ['MASTER_PORT'] = '29500' dist.init_process_group(rank=rank, world_size=world_size) def cleanup(): dist.destroy_process_group() def run_allreduce(rank, world_size): setup(rank, world_size) device = torch.device('xpu:{}'.format(rank)) x = torch.randn([2, 2], device=device) dist.all_reduce(x) cleanup() if __name__ == '__main__': world_size = 2 mp.spawn(run_allreduce, args=(world_size,), nprocs=world_size, join=True) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141856 Approved by: https://github.com/kwen2501, https://github.com/gujinghui, https://github.com/albanD	2024-12-10 01:58:06 +00:00
PyTorch MergeBot	219e9c83a5	Revert "[AOTI XPU] Support AOT Inductor for Intel GPU. (#140269 )" This reverts commit `854d83133b`. Reverted https://github.com/pytorch/pytorch/pull/140269 on behalf of https://github.com/clee2000 due to breaks forward compatibility? D66937097 ([comment](https://github.com/pytorch/pytorch/pull/140269#issuecomment-2528828555))	2024-12-09 17:33:28 +00:00
xinan.lin	854d83133b	[AOTI XPU] Support AOT Inductor for Intel GPU. (#140269 ) This PR add XPU support for AOT Inductor, and reuse the corresponding UT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140269 Approved by: https://github.com/desertfire, https://github.com/EikanWang ghstack dependencies: #140268	2024-12-07 19:22:04 +00:00
eqy	8fc6d3a5d8	[SDPA] Allow user-specified priority order with context manager (#140467 ) TODO: docs changes? For better debuggability of issues like https://github.com/pytorch/pytorch/issues/139298 Better testing, current sketch: ``` Python import torch from torch.nn.functional import scaled_dot_product_attention from torch.nn.attention import SDPBackend, sdpa_kernel q = torch.randn(64, 1024, 8, 64, dtype=torch.half, device='cuda') print(torch._C._get_sdp_priority_order()) orders = [[SDPBackend.CUDNN_ATTENTION, SDPBackend.MATH, SDPBackend.EFFICIENT_ATTENTION], [SDPBackend.MATH, SDPBackend.CUDNN_ATTENTION, SDPBackend.EFFICIENT_ATTENTION], [SDPBackend.EFFICIENT_ATTENTION, SDPBackend.CUDNN_ATTENTION, SDPBackend.MATH]] import time times = list() for order in orders: print(order) with sdpa_kernel(order, set_priority=True): scaled_dot_product_attention(q, q, q) torch.cuda.synchronize() t0 = time.perf_counter() with sdpa_kernel(order, set_priority=True): scaled_dot_product_attention(q, q, q) torch.cuda.synchronize() t1 = time.perf_counter() times.append(t1 - t0) print(times) assert times[0] < times[1] assert times[0] > times[2] assert times[1] > times[2] print(torch._C._get_sdp_priority_order()) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140467 Approved by: https://github.com/drisspg	2024-12-06 07:56:35 +00:00
Yukio Siraichi	12d28a5929	Move overlapping guards to C++. (#140013 ) This PR moves the logic for computing the overlapping relations between input tensors that share a storage instance to C++. In summary, this PR: - Moves both `tensors_definitely_do_not_overlap` and part of `compute_overlapping_tensors` to C++ - Introduces a `check_overlapping` function that re-runs `compute_overlapping_tensors`, checking that the result is consistent with what is expected - Introduces the `StorageOverlapChecker` class - Keeps track of overlapping and non-overlapping tensors - Actually checks the overlapping relation (call `check_overlapping`) when all tensors are collected - Introduces the `STORAGE_OVERLAPPING` relational guard - Has a reference to a `StorageOverlapChecker` - Stores the to-be-checked tensors in the checker, and triggers its check - Introduces `install_storage_overlapping_guard` python function - Creates an instance of `StorageOverlapChecker` - Creates 2 instances of the `STORAGE_OVERLAPPING` guard (for overlapping and non-overlapping tensors), referencing the same `StorageOverlapChecker` instance Why is `StorageOverlapChecker` needed? The way `GuardManager` is implemented, we have no control over the order in which the check methods are called, i.e. no control over the order the tensors are collected. So, we can't easily split them in "overlapping" and non-overlapping kinds. Instead, we create 2 instances of `STORAGE_OVERLAPPING` guard, each of which helps collecting the tensors for one of the kinds mentioned above. They are then used in a single `StorageOverlapChecker` instance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140013 Approved by: https://github.com/bdhirsh ghstack dependencies: #139554, #139555	2024-12-05 14:43:58 +00:00
Yu, Guangye	8dd4673cea	Support torch.xpu.mem_get_info API (#141230 ) # Motivate Fix https://github.com/pytorch/pytorch/issues/130599 This PR intends to add a new API, `torch.xpu.mem_get_info,` which is widely used in popular model workloads. For example, [here](`403c0714d1/src/accelerate/utils/modeling.py (L721)`) we need to get current GPU memory usage to split or load the model. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141230 Approved by: https://github.com/EikanWang, https://github.com/albanD	2024-12-05 08:17:25 +00:00
Benjamin Glass	4959784dac	Add API query for available per-process CUDA memory (#140620 ) Certain `cpp_wrapper`-enabled tests were OOM-ing in the CI pipeline, with error messages suggesting that sufficient memory was accessible. This ultimately resulted from an internal memory limitation that was not queryable in the API. This PR adds querying for that limit. Additionally, the failing tests had incorrect memory availability checks, and are updated with measured memory requirements. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140620 Approved by: https://github.com/malfet, https://github.com/eqy ghstack dependencies: #141367	2024-12-03 00:24:03 +00:00
Hyunho Yeo	d70b7029c8	[MTIA] Support torch.mtia.empty_cache() (#141533 ) Summary: As title Test Plan: Passed a local unit test: `buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api` https://www.internalfb.com/intern/testinfra/testrun/4785074861101240 Reviewed By: nautsimon Differential Revision: D66481778 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141533 Approved by: https://github.com/nautsimon	2024-11-28 02:24:19 +00:00
Simon Fan	db4e8a1d8a	[ca] expose option to collect sizes as dynamic (#141153 ) This is to address recompiles from eager nodes that saved dynamic activations Pull Request resolved: https://github.com/pytorch/pytorch/pull/141153 Approved by: https://github.com/jansel ghstack dependencies: #141152	2024-11-22 19:26:27 +00:00
Animesh Jain	fb529c2c84	[dynamo] skip_guard_eval_unsafe stance for power users (#140251 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140251 Approved by: https://github.com/jansel ghstack dependencies: #140223, #140250	2024-11-21 06:28:58 +00:00
Syed Tousif Ahmed	e0482fdf95	Implements user buffer registration using MemPool (#133603 ) This PR implements user buffer registration and demonstrates NVLink Sharp (NVLS) reductions using a combination of allocation special memory using MemPool and registering it with the nccl buffer registration APIs. Part of https://github.com/pytorch/pytorch/issues/124807. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133603 Approved by: https://github.com/kwen2501, https://github.com/eqy	2024-11-21 01:40:11 +00:00
Animesh Jain	9d229f08f4	[dynamo][guards] Introduce a diff_guard_manager (#140250 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140250 Approved by: https://github.com/jansel ghstack dependencies: #140223	2024-11-20 17:59:30 +00:00
FFFrog	a9e54f64ee	Remove unused Python API named _set_torch_function_mode (#141023 ) Detailed description: As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141023 Approved by: https://github.com/Skylion007, https://github.com/ezyang	2024-11-20 09:48:03 +00:00
PyTorch MergeBot	9fac5a16fd	Revert "[PGNCCL] Add an API to get the status/error code of each PG (#140087 )" This reverts commit `80aa19a622`. Reverted https://github.com/pytorch/pytorch/pull/140087 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/140087#issuecomment-2486912231))	2024-11-19 22:53:46 +00:00
PyTorch MergeBot	496c1e78c5	Revert "Implements user buffer registration using MemPool (#133603 )" This reverts commit `25d9be37be`. Reverted https://github.com/pytorch/pytorch/pull/133603 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/133603#issuecomment-2486897708))	2024-11-19 22:42:26 +00:00
Animesh Jain	f4ce9ac29d	[dynamo] Dont erase the cache line on invalidation (#140821 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140821 Approved by: https://github.com/jansel	2024-11-19 19:11:10 +00:00
Yifu Wang	ab5c8857ef	[SymmetricMemory] support specifying group_name at rendezvous time (#139529 ) Before this PR, users need to call `empty_strided_p2p()` with a `group_name`: ```python tensor = _SymmetricMemory.empty_strided_p2p((1024,), (1,), device=device, group_name="0") symm_mem = _SymmetricMemory.rendezvous(tensor) ``` Users can now omit `group_name` at allocation time and specify it later at rendezvous time: ```python tensor = _SymmetricMemory.empty_strided_p2p((1024,), (1,), device=device) symm_mem = _SymmetricMemory.rendezvous(tensor, group_name="0") ``` Rationales for this change: - This allows the same allocation to establish symmetric memory under different groups - Specifying `group_name` at rendezvous time instead of allocation time is a more natural UX Pull Request resolved: https://github.com/pytorch/pytorch/pull/139529 Approved by: https://github.com/lw	2024-11-17 09:31:17 +00:00
Ryan Guo	ac6684ebbc	[dynamo] Identify pre-existing captured cells by cell id rather than content id (#140436 ) In `match_nested_cell`, Dynamo tried to identify pre-existing captured cells by `(cell_name, id(cell_contents))`. This works in most cases, but as the test added in this patch shows, it's not a complete solution. This patch 1. changes `match_nested_cell` to `lookup_variable_for_captured_cell`, and does the lookup based on id of cell objects, not their contents. This requires plumbing a tuple of captured cell objects from different CPython versions all the way to `InstructionTranslator.__init__`, where we store a mapping from the ids of these cell objects, and use it later in `UserFunctionVariable.bind_args` to look for these unboxed cells. 2. builds off (1) -- rather than using a `VariableTracker` that represents the content of the unboxed cells, use `ClosureVariable`, which enables codegen in case these cells escape as closure of a `NestedUserFunctionVariable`. The patch adds a regression test for each of the scenarios above: 1. `test_write_to_cells_with_name_shadowing` where Dynamo mistakenly thought the program is writing to a cell captured by root frame (which it doesn't support atm), which resulted in ``` File "/Users/ryanguo99/Documents/work/pytorch/torch/_dynamo/symbolic_convert.py", line 3340, in STORE_DEREF unimplemented("write to __closure__ while inlining") File "/Users/ryanguo99/Documents/work/pytorch/torch/_dynamo/exc.py", line 313, in unimplemented raise Unsupported(msg, case_name=case_name) torch._dynamo.exc.Unsupported: write to __closure__ while inlining ``` 2. `test_existing_func_that_creates_capturing_nested_func` where Dynamo ended up trying to codegen a `NestedUserFunctionVariable` that captures a cell which was also captured by the root frame, so it was unboxed and ends up emitting `LOAD_DEREF` rather than `LOAD_FAST/LOAD_CLOSURE` during codegen, resulting in ``` File "/Users/ryanguo99/Documents/work/pytorch/torch/_dynamo/variables/functions.py", line 105, in _create_nested_fn func = FunctionType(code, f_globals, name, defaults, closure) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TypeError: arg 5 (closure) expected cell, found int ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140436 Approved by: https://github.com/jansel, https://github.com/williamwen42 ghstack dependencies: #140330, #140152	2024-11-15 17:17:30 +00:00
Ryan Guo	85dd7b84cf	[dynamo] Add a `DynamoFrameType` type above Python frame object (#140330 ) This patch introduces a `DynamoFrameType` to serve as a layer between Dynamo and different versions of Python frame object. In `DynamoFrameType`, we only register attributes Dynamo cares about (e.g., `f_code`, `f_locals`, etc. This will be helpful when it comes to adding new attributes to this `DynamoFrameType`, or dealing with Python version changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140330 Approved by: https://github.com/jansel, https://github.com/williamwen42	2024-11-15 17:17:30 +00:00
Syed Tousif Ahmed	25d9be37be	Implements user buffer registration using MemPool (#133603 ) This PR implements user buffer registration and demonstrates NVLink Sharp (NVLS) reductions using a combination of allocation special memory using MemPool and registering it with the nccl buffer registration APIs. Part of https://github.com/pytorch/pytorch/issues/124807. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133603 Approved by: https://github.com/kwen2501, https://github.com/eqy	2024-11-15 12:47:49 +00:00
Shuqiang Zhang	80aa19a622	[PGNCCL] Add an API to get the status/error code of each PG (#140087 ) Summary: If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level. Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG. Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/140087 Approved by: https://github.com/kwen2501	2024-11-15 04:11:00 +00:00
Zhengxu Chen	3ef2dfc1ba	[export] Implement cpp deserializer. (#136398 ) Differential Revision: D63206258 This diff introduces a mechanism to generate a json-compatible deserializer in cpp using nlohmann json (already being used by AOTI). Why we need this? Because there will be a lot of cases where people don't want to use Python to load the graph (e.g. cpp runtime), and instead they can use this header to deserialize the JSON graph. Every time we call update_schema.py to update the schema, the header will be auto generated and included into the source files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136398 Approved by: https://github.com/angelayi	2024-11-14 16:34:59 +00:00
Aaron Orenstein	82597d07aa	type annotations for meta_utils (#140203 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140203 Approved by: https://github.com/ezyang	2024-11-13 20:07:47 +00:00
Yu, Guangye	659d2132be	Add architecture to XPU device property (#138186 ) # Motivation Add `architecture` to XPU device property. In some cases, low-level application code can use special features or do specific optimizations depending on the device architecture, and this PR enables such applications. Modified from https://github.com/pytorch/pytorch/pull/129675/files Pull Request resolved: https://github.com/pytorch/pytorch/pull/138186 Approved by: https://github.com/ezyang	2024-11-13 03:35:13 +00:00
Yifu Wang	0a0915fb5e	[SymmetricMemory] improve the API for stream_write_value32 (#139934 ) This PR updates the binding for `stream_write_value32` to be consistent with `memset32` which IMO makes more sense for this type of utilities: - Changed the API to take a uint32 tensor as argument, instead of a device pointer - Changed the Python binding to be a static method of `_SymmetricMemory`, instead of a object method - Use the dispatcher for device dispatching, as opposed to `SymmetricMemory` backends Pull Request resolved: https://github.com/pytorch/pytorch/pull/139934 Approved by: https://github.com/weifengpy ghstack dependencies: #139227	2024-11-11 18:49:22 +00:00
PyTorch MergeBot	5f4a21dc58	Revert "[SymmetricMemory] improve the API for stream_write_value32 (#139934 )" This reverts commit `2f3a5a15ef`. Reverted https://github.com/pytorch/pytorch/pull/139934 on behalf of https://github.com/malfet due to Broke distributed tests, see https://github.com/pytorch/pytorch/actions/runs/11770673088/job/32784210441 ([comment](https://github.com/pytorch/pytorch/pull/139934#issuecomment-2468641512))	2024-11-11 17:02:07 +00:00
Yifu Wang	2f3a5a15ef	[SymmetricMemory] improve the API for stream_write_value32 (#139934 ) This PR updates the binding for `stream_write_value32` to be consistent with `memset32` which IMO makes more sense for this type of utilities: - Changed the API to take a uint32 tensor as argument, instead of a device pointer - Changed the Python binding to be a static method of `_SymmetricMemory`, instead of a object method - Use the dispatcher for device dispatching, as opposed to `SymmetricMemory` backends Pull Request resolved: https://github.com/pytorch/pytorch/pull/139934 Approved by: https://github.com/weifengpy ghstack dependencies: #139227	2024-11-11 01:54:35 +00:00
Yifu Wang	1659e241c8	[experimental] async-tp impl with cutlass-based, progress aware kernel (#139227 ) This PR introduces the following: ### torch.ops.symm_mem._async_input_mm `_async_input_mm(Tensor a, Tensor b, Tensor a_chunk_signals, int a_chunk_pivot) -> Tensor` An mm impl that supports consuming asynchronous input. It guarantees the following rasterization order, and that the corresponding signal arrives before an input chunk is consumed. ``` num_chunks = a_chunks_signals.numel() for chunk_idx in range(a_chunk_pivot, num_chunks + a_chunk_pivot): chunk_idx = chunk_idx % num_chunks wait_signal(a_chunk_signals, chunk_idx) # Compute output tiles that consumes the input chunk ``` ### PersistentAsyncInputScheduler This is a forked version of PersistentScheduler that supports consuming asynchronous input. This tile scheduler introduces the following arguments: - `tiles_per_chunk_m` – Specifies the size of an M chunk. Chunks are the granularity at which the asynchronous input becomes ready. It must be an interger multiple of the size of an M tile. - `chunk_signals` – `chunk_signals[i] == 1` indicates that chunk i is ready. Before returning a work tile, get_current_work() waits for the signal to ensure that the corresponding chunk is ready. - `tile_idx_pivot_m` – After applying swizzling, apply `pivot(m) => (m + tile_idx_pivot_m) % tiles_m` to `m`. In a distributed setting, this allows different ranks to process different m indices at the same time, thus avoiding communication hotspots. Note that this scheduler currently only supports the `KernelTmaWarpSpecializedCooperative` kernel schedule. This is enforced via the template argument `KernelSchedule`. Usage: ``` using GemmKernel = cutlass::gemm::kernel::GemmUniversal< Shape<int, int, int, int>, CollectiveMainloop, CollectiveEpilogue, cutlass::gemm::PersistentAsyncInputScheduler<KernelSchedule>>; ``` ### _fused_all_gather_matmul_native An ag-mm impl that combines `torch.ops.symm_mem._async_input_mm` and progress-aware all-gather. This is not yet enabled via the async-tp passes. We will use it as a backend to optimize the current decomposition-based async-tp impl. ## Benchmarks ### 4096x3584x8192 - cublas + nccl: 539us - decomp-based async-tp w/o cuda graph: 694us - decomp-based async-tp w/ cuda graph: 478us - new cutlass kernel: 408us <img width="478" alt="image" src="https://github.com/user-attachments/assets/39f316ab-36c5-4b41-af77-07854a385dfc"> ### 2048x3584x8192 - cublas + nccl: 301us - decomp-based async-tp w/o cuda graph: 687us - decomp-based async-tp w/ cuda graph: 356us - new cutlass kernel: 276us <img width="441" alt="image" src="https://github.com/user-attachments/assets/9e23ce21-863b-43dd-a562-fb05d3a5a144"> ## Next Steps - Add tuning logic - Use `_fused_all_gather_matmul_native` as a backend for the decomp-based async-tp impl Differential temp Revision: [D65623152](https://our.internmc.facebook.com/intern/diff/D65623152) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139227 Approved by: https://github.com/weifengpy, https://github.com/Chillee	2024-11-08 23:28:25 +00:00
Animesh Jain	e6c5a77485	[dynamo][guards] Profile guard manager in C++ (#140110 ) This should remove the pybind noise from the profiling. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140110 Approved by: https://github.com/jansel ghstack dependencies: #139953	2024-11-08 18:44:08 +00:00
Luca Wehrstedt	5f287df422	Add type information for FakeProcessGroup (#133211 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133211 Approved by: https://github.com/Skylion007	2024-11-08 11:18:52 +00:00
PyTorch MergeBot	36e0f119d0	Revert "[experimental] async-tp impl with cutlass-based, progress aware kernel (#139227 )" This reverts commit `5203138483`. Reverted https://github.com/pytorch/pytorch/pull/139227 on behalf of https://github.com/yifuwang due to Need to address internal build failure D65605027 ([comment](https://github.com/pytorch/pytorch/pull/139227#issuecomment-2463204467))	2024-11-07 21:01:36 +00:00
Yifu Wang	5203138483	[experimental] async-tp impl with cutlass-based, progress aware kernel (#139227 ) This PR introduces the following: ### torch.ops.symm_mem._async_input_mm `_async_input_mm(Tensor a, Tensor b, Tensor a_chunk_signals, int a_chunk_pivot) -> Tensor` An mm impl that supports consuming asynchronous input. It guarantees the following rasterization order, and that the corresponding signal arrives before an input chunk is consumed. ``` num_chunks = a_chunks_signals.numel() for chunk_idx in range(a_chunk_pivot, num_chunks + a_chunk_pivot): chunk_idx = chunk_idx % num_chunks wait_signal(a_chunk_signals, chunk_idx) # Compute output tiles that consumes the input chunk ``` ### PersistentAsyncInputScheduler This is a forked version of PersistentScheduler that supports consuming asynchronous input. This tile scheduler introduces the following arguments: - `tiles_per_chunk_m` – Specifies the size of an M chunk. Chunks are the granularity at which the asynchronous input becomes ready. It must be an interger multiple of the size of an M tile. - `chunk_signals` – `chunk_signals[i] == 1` indicates that chunk i is ready. Before returning a work tile, get_current_work() waits for the signal to ensure that the corresponding chunk is ready. - `tile_idx_pivot_m` – After applying swizzling, apply `pivot(m) => (m + tile_idx_pivot_m) % tiles_m` to `m`. In a distributed setting, this allows different ranks to process different m indices at the same time, thus avoiding communication hotspots. Note that this scheduler currently only supports the `KernelTmaWarpSpecializedCooperative` kernel schedule. This is enforced via the template argument `KernelSchedule`. Usage: ``` using GemmKernel = cutlass::gemm::kernel::GemmUniversal< Shape<int, int, int, int>, CollectiveMainloop, CollectiveEpilogue, cutlass::gemm::PersistentAsyncInputScheduler<KernelSchedule>>; ``` ### _fused_all_gather_matmul_native An ag-mm impl that combines `torch.ops.symm_mem._async_input_mm` and progress-aware all-gather. This is not yet enabled via the async-tp passes. We will use it as a backend to optimize the current decomposition-based async-tp impl. ## Benchmarks ### 4096x3584x8192 - cublas + nccl: 539us - decomp-based async-tp w/o cuda graph: 694us - decomp-based async-tp w/ cuda graph: 478us - new cutlass kernel: 408us <img width="478" alt="image" src="https://github.com/user-attachments/assets/39f316ab-36c5-4b41-af77-07854a385dfc"> ### 2048x3584x8192 - cublas + nccl: 301us - decomp-based async-tp w/o cuda graph: 687us - decomp-based async-tp w/ cuda graph: 356us - new cutlass kernel: 276us <img width="441" alt="image" src="https://github.com/user-attachments/assets/9e23ce21-863b-43dd-a562-fb05d3a5a144"> ## Next Steps - Add tuning logic - Use `_fused_all_gather_matmul_native` as a backend for the decomp-based async-tp impl Pull Request resolved: https://github.com/pytorch/pytorch/pull/139227 Approved by: https://github.com/weifengpy, https://github.com/Chillee	2024-11-07 03:43:12 +00:00
CaoE	9e14d86573	[Inductor][CPP] Add oneDNN BRGEMM config for Half cpp gemm template (#136255 ) `kernel_micro_gemm` generated using BRGEMM: ``` template <bool accum> inline void kernel_micro_gemm( const half* __restrict__ A, const half* __restrict__ B, float* __restrict__ C, int64_t M, int64_t N, int64_t K, int64_t lda, int64_t ldb, int64_t ldc ) { at::native::cpublas::brgemm( M, N, K, lda, ldb, ldc, 1.f, accum ? 1.f : 0.f, A, B, C); } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136255 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-11-05 05:33:29 +00:00
Shivam Raikundalia	ac7acfb894	[Profiler] Create Auto-Trace Frontend for Trace ID (#139310 ) Summary: This PR adds Auto-Trace implementation for Trace ID. By default, the python side will generate a uuid in the same format as the one set in the backend by kineto. Upon running an auto-trace, the python generated trace id will overwrite the one set in kineto using the Config variable. Since we don't expect users to generate on-demand traces after an auto-trace we can simply keep overwriting the backend trace id whenever autotrace is ran. If we one day want to eventually do something like this, we simply have to add a call in kineto on the backend to generate a new ID upon start of profiling. We also implement a custom callback in the frontend such that users can generate their own trace ids if they wish to. This works similarly as the default, only difference being that they have to manually set this callback after a profiler is generated. We use a specific call to set this rather then putting it in the frontend initializer in case users want to change the trace_id for different repeats. Test Plan: Tested both default and custom callbacks using the verbose prints added. Trace ids on the frontend and the prints on the backend for the manifold upload matched. Differential Revision: D65178308 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139310 Approved by: https://github.com/shengfukevin	2024-10-31 19:02:57 +00:00
Richard Zou	ef380f7b8e	[real tensor prop] Add some asserts for custom ops (#139212 ) When we see a custom op: - check that its mutation annotations are correct - check that its aliasing constraints matches our constraints for custom ops. Otherwise, there may be undefined behavior. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/139212 Approved by: https://github.com/angelayi	2024-10-30 19:29:11 +00:00
Will Feng	4ee514144b	[c10d][Partial-Graph Overlap] Support calling .wait_tensor() on output tensor of eager `async_op=True` collective if under `allow_inflight_collective_as_graph_input_ctx()` context manager (#137763 ) This PR aims to support the following use case: ```python def all_reduce_eager(x): y = x * x req = dist.all_reduce(y, op=dist.ReduceOp.SUM, async_op=True) assert isinstance(req, torch.distributed.Work) return y @torch.compile(fullgraph=True) def all_reduce_wait_compiled(y): torch.ops.c10d_functional.wait_tensor(y) return y * y x = torch.ones(1280, 1280, device="cuda") + self.rank with allow_inflight_collective_as_graph_input_ctx(): y = all_reduce_eager(x) z = all_reduce_wait_compiled(y) ``` where the collective is issued in eager (with `async_op=True`) but waited in compiled region. This is important for internal use cases such as TorchRec, where we issue collectives in eager for SparseArch all_to_all but want to wait for them in compiled region at beginning of OverArch, so that the all_to_all can be overlapped with the DenseArch compute that runs in parallel. ---- Update: Did two items to prevent regression to existing use cases: 1. Added memory-stressed test case to test_c10d_nccl.py `test_unwaited` to cover existing user's "not calling work.wait() for non-functional collective" use case 2. Gated all new `register_work()` / `unregister_work()` calls with `c10d::allow_inflight_collective_as_graph_input()` check, which is a new context manager that requires explicit user enablement (i.e. not on by default, so should not affect existing users). The risk of this new version of PR causing regression should be very low. ------ Test commands: - `pytest -rA test/distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_eager_async_allreduce_inductor_wait` - `pytest -rA test/test_fx.py::TestDCE::test_keep_collectives` - `pytest -rA test/test_fx.py::TestDCE::test_keep_collectives_no_overload` - `pytest -rA test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_wait_tensor` - `pytest -rA test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_unwaited` - `pytest -rA test/distributed/test_c10d_nccl.py::CommTest::test_wait_tensor` - `pytest -rA test/distributed/test_c10d_nccl.py::CommTest::test_unwaited` - `pytest -rA test/distributed/_tensor/test_tensor_ops.py::DistTensorOpsTest::test_equal` - `pytest -rA test/distributed/_tensor/test_random_ops.py::DistTensorRandomOpTest::test_manual_seed` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_setattr` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_no_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_asymmetric_compilation` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_scalar` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_speculation_divergence` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_tensor` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_dim_mismatch` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_graph_break_empty_graph_still_collective` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_missing_source` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_scalar_missing_source` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_type_mismatch` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_activation_checkpointing` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_activation_checkpointing` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_inductor` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_setattr` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_no_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager_static_graph` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor_static_graph` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_fsdp_activation_checkpointing` - `pytest -rA test/distributed/_tensor/test_experimental_ops.py::DistOtherOpsTest::test_bernoulli` - `pytest -rA test/distributed/_tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_tp_compile_fullgraph_is_seq_parallel_True` - `pytest -rA test/distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_allreduce_inductor_cudagraph_trees` - `python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --inference --bfloat16 --total-partitions 2 --partition-id 1 --output inference_torchbench.csv --only moco` ------ Differential Revision: [D65023311](https://our.internmc.facebook.com/intern/diff/D65023311) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137763 Approved by: https://github.com/yifuwang	2024-10-29 03:31:19 +00:00
Simon Fan	fd9f4e6770	Back out "[compiled autograd] tls access helpers (#138061 )" and Back out "[compiled autograd] Compiled autograd configs in TLS (#137821 )" (#139086 ) Summary: Original commit changeset: 9bf80c1492d7 Original Phabricator Diff: D64796226 Original commit changeset: aa1d9ef8f6e6 Original Phabricator Diff: D64796212 Differential Revision: D65072644 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139086 Approved by: https://github.com/malfet	2024-10-28 23:37:05 +00:00
William Wen	35be6aef69	[dynamo] add some cpython debugging methods (#138030 ) This PR enables you to inspect PyObjects in C using `INSPECT(...)` without requiring https://docs.python.org/3/howto/gdb_helpers.html. `torch._dynamo.eval_frame.raise_sigtrap` can also be used to set gdb breakpoints while running Python code, e.g. ```python x = x + 1 torch._dynamo.eval_frame.raise_sigtrap(); # can breakpoint on ceval.c:CALL to breakpoint the `sin` call in C. x = torch.sin(x) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138030 Approved by: https://github.com/jansel	2024-10-28 22:25:21 +00:00
PyTorch MergeBot	e5595f10c8	Revert "[c10d][Partial-Graph Overlap] Support calling .wait_tensor() on output tensor of eager `async_op=True` collective if under `allow_inflight_collective_as_graph_input_ctx()` context manager (#137763 )" This reverts commit `a688c57033`. Reverted https://github.com/pytorch/pytorch/pull/137763 on behalf of https://github.com/yf225 due to Seems to have bad interaction with latest commits on trunk, reverting to be safe ([comment](https://github.com/pytorch/pytorch/pull/137763#issuecomment-2442527696))	2024-10-28 20:13:46 +00:00
Will Feng	a688c57033	[c10d][Partial-Graph Overlap] Support calling .wait_tensor() on output tensor of eager `async_op=True` collective if under `allow_inflight_collective_as_graph_input_ctx()` context manager (#137763 ) This PR aims to support the following use case: ```python def all_reduce_eager(x): y = x * x req = dist.all_reduce(y, op=dist.ReduceOp.SUM, async_op=True) assert isinstance(req, torch.distributed.Work) return y @torch.compile(fullgraph=True) def all_reduce_wait_compiled(y): torch.ops.c10d_functional.wait_tensor(y) return y * y x = torch.ones(1280, 1280, device="cuda") + self.rank with allow_inflight_collective_as_graph_input_ctx(): y = all_reduce_eager(x) z = all_reduce_wait_compiled(y) ``` where the collective is issued in eager (with `async_op=True`) but waited in compiled region. This is important for internal use cases such as TorchRec, where we issue collectives in eager for SparseArch all_to_all but want to wait for them in compiled region at beginning of OverArch, so that the all_to_all can be overlapped with the DenseArch compute that runs in parallel. ------ Test commands: - `pytest -rA test/distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_eager_async_allreduce_inductor_wait` - `pytest -rA test/test_fx.py::TestDCE::test_keep_collectives` - `pytest -rA test/test_fx.py::TestDCE::test_keep_collectives_no_overload` - `pytest -rA test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_wait_tensor` - `pytest -rA test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_unwaited` - `pytest -rA test/distributed/test_c10d_nccl.py::CommTest::test_wait_tensor` - `pytest -rA test/distributed/test_c10d_nccl.py::CommTest::test_unwaited` - `pytest -rA test/distributed/_tensor/test_tensor_ops.py::DistTensorOpsTest::test_equal` - `pytest -rA test/distributed/_tensor/test_random_ops.py::DistTensorRandomOpTest::test_manual_seed` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_setattr` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_no_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_asymmetric_compilation` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_scalar` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_speculation_divergence` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_tensor` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_dim_mismatch` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_graph_break_empty_graph_still_collective` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_missing_source` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_scalar_missing_source` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_type_mismatch` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_activation_checkpointing` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_activation_checkpointing` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_inductor` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_setattr` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_no_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager_static_graph` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor_static_graph` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_fsdp_activation_checkpointing` - `pytest -rA test/distributed/_tensor/test_experimental_ops.py::DistOtherOpsTest::test_bernoulli` - `pytest -rA test/distributed/_tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_tp_compile_fullgraph_is_seq_parallel_True` - `pytest -rA test/distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_allreduce_inductor_cudagraph_trees` - `python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --inference --bfloat16 --total-partitions 2 --partition-id 1 --output inference_torchbench.csv --only moco` ------ Differential Revision: [D65023311](https://our.internmc.facebook.com/intern/diff/D65023311) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137763 Approved by: https://github.com/yifuwang	2024-10-28 18:11:23 +00:00
Wouter Devriendt	bae3426af7	reimport pr137735 due to merging check issues (#138959 ) This is a cherry-pick from #137735 by @mikaylagawarecki , that cannot be merged due to a (wrongly) failing check for codev @diff-train-skip-merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/138959 Approved by: https://github.com/mikaylagawarecki	2024-10-27 16:31:34 +00:00
Yu, Guangye	40c098f731	Introduce a device-agnostic runtime API design (#132204 ) # Motivation According to [[RFC]A device-agnostic Python runtime API design for stream-based accelerators](https://github.com/pytorch/pytorch/issues/128403), this PR intends to introduce a device-agnostic runtime API design. I personally prefer the Simple Version APIs that no longer accept the device type as an input argument. It means we will leverage `getAccelerator` to fetch the current accelerator. And it is flexible to expand these APIs to handle multiple types of accelerator scenarios. The design does NOT break the previous design philosophies. I also believe that namespace torch.accelerator is better. It lets users know that the APIs they are calling are running on an accelerator rather than CPU. This is important. Meanwhile, we can follow a simple API design principle: 1. Device-agnostic APIs should be placed under the torch.accelerator namespace and not accept a device_type optional parameter. 2. Device-specific APIs should be placed under device-specific submodules. 3. APIS required by both CPU and accelerators should be placed under the torch namespace and accept a device_type optional parameter. Also, I list the pros and cons of Simple Version here: Pros: - `torch.accelerator.foo` will have the same input argument as `torch.xxx.foo`, bringing a better user experience; - more concise, facilitate the developer to write a device-agnostic code. Cons: - no obvious drawbacks. # Additional Context I list the new APIs here: ```python torch.accelerator.is_available() -> bool: torch.accelerator.current_accelerator() -> torch.device: torch.accelerator.device_count() -> int: torch.accelerator.current_device_idx() -> int: torch.accelerator.set_device_idx(device: Union[torch.device, str, int, None]) -> None: torch.accelerator.current_stream(device: Union[torch.device, str, int, None]) -> torch.Stream: torch.accelerator.set_stream(stream: torch.Stream) -> None: torch.accelerator.synchronize(device: Union[torch.device, str, int, None]) -> None: ``` According to the discussion with Alban, we decide to change the API name `set_device` to `set_device_idx` and `current_device` to `current_device_idx` for more explicit. And will submit other PR to support device and stream context manager. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132204 Approved by: https://github.com/EikanWang, https://github.com/abhilash1910, https://github.com/gujinghui, https://github.com/albanD	2024-10-27 10:37:09 +00:00
Yifu Wang	ed9169df98	Removed the typing information for already deleted ProcessGroupCudaP2P (#138753 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138753 Approved by: https://github.com/weifengpy	2024-10-25 00:32:07 +00:00
Shuqiang Zhang	4c91481656	[c10d] allow sub group to be eagerly inited even if default one is not (#138665 ) Summary: Currently, eager mode is applied either to all PGs or NONE of them. There are cases where we don't want to initialize the comms for default PG, but we still want to initialize the comms for sub PG. Now with a device_id passed to new group, we can achieve this case Test Plan: newly added UT Tags: Resolves https://github.com/pytorch/pytorch/issues/137018 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138665 Approved by: https://github.com/kwen2501 ghstack dependencies: #138781	2024-10-24 23:51:28 +00:00
Simon Fan	49fa437097	[compiled autograd] Compiled autograd configs in TLS (#137821 ) Multithreaded doesn't work yet, this adds python side TLS only for the python side state Pull Request resolved: https://github.com/pytorch/pytorch/pull/137821 Approved by: https://github.com/jansel, https://github.com/yf225 ghstack dependencies: #137953	2024-10-22 08:03:52 +00:00
Syed Tousif Ahmed	03c72976a5	Properly uses ref-counting for torch.cuda.use_mem_pool (#133600 ) This PR refactors some ref-counting functionality out of `beginAllocateToPool` and `releasePool`. The ref-counting logic is then used in construction and destruction of `torch.cuda.MemPool`. The `use_count` variable in the CUDACachingAllocator is essentially a refcount of how many context managers are using the pool. Since we are now lifting up the MemPool abstraction to the user, the MemPool object itself now needs to hold a an extra reference as well. Part of https://github.com/pytorch/pytorch/issues/124807. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133600 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-10-22 03:21:53 +00:00
Jeff Daily	3f3b692a00	[ROCm] CK-based GEMM (#131004 ) - composable_kernel as a third_party submodule - "ck" as a `torch.backends.cuda.preferred_linalg_library()` - reference CK gemm implementations for float, bfloat16, and half types Pull Request resolved: https://github.com/pytorch/pytorch/pull/131004 Approved by: https://github.com/xw285cornell, https://github.com/pruthvistony Co-authored-by: Andres Lugo <Andy.LugoReyes@amd.com> Co-authored-by: Pruthvi Madugundu <pruthvigithub@gmail.com>	2024-10-20 02:57:43 +00:00
Yu, Guangye	8cda774a03	Add torch.xpu.get_arch_list and torch.xpu.get_gencode_flags for XPU (#137773 ) # Motivation Add `torch.xpu.get_arch_list()` and `torch.xpu.get_gencode_flags()` methods that return architecture list and AOT flags to preserve what flags PyTorch XPU was built with. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137773 Approved by: https://github.com/EikanWang, https://github.com/albanD	2024-10-18 02:28:08 +00:00
PyTorch MergeBot	dd32a32cb6	Revert "Expose option to disable CRC-32 computation during `torch.save` (#137735 )" This reverts commit `534fa96f2d`. Reverted https://github.com/pytorch/pytorch/pull/137735 on behalf of https://github.com/clee2000 due to failing internally D64438525, probably needs gating ([comment](https://github.com/pytorch/pytorch/pull/137735#issuecomment-2417412264))	2024-10-16 17:03:06 +00:00
PyTorch MergeBot	361f42bc42	Revert "[compiled autograd] Compiled autograd configs in TLS (#137821 )" This reverts commit `9aba0b91c8`. Reverted https://github.com/pytorch/pytorch/pull/137821 on behalf of https://github.com/wdvr due to Reverting this for now, it is failing test_public_bindings in trunk ([comment](https://github.com/pytorch/pytorch/pull/137821#issuecomment-2417351788))	2024-10-16 16:38:29 +00:00
William Wen	4c8718d8e7	[dynamo] add torch.compiler.set_stance (#137504 ) Attempt # 2 at https://github.com/pytorch/pytorch/pull/132926 to implement https://github.com/pytorch/pytorch/issues/123771. Implement a new `torch.compiler.set_stance` function that can force `torch.compile` regions to run eagerly. See added tests for usage examples. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137504 Approved by: https://github.com/yf225, https://github.com/jansel	2024-10-16 16:18:25 +00:00
Simon Fan	9aba0b91c8	[compiled autograd] Compiled autograd configs in TLS (#137821 ) Multithreaded doesn't work yet, this adds python side TLS only for the python side state Pull Request resolved: https://github.com/pytorch/pytorch/pull/137821 Approved by: https://github.com/jansel, https://github.com/yf225 ghstack dependencies: #137953	2024-10-16 09:28:32 +00:00

1 2 3 4 5 ...

1178 Commits