pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Simon Fan	c16cd25cf5	[ca] remove compiled_autograd_tracing (#148381 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148381 Approved by: https://github.com/jansel	2025-03-08 06:08:26 +00:00
cyy	f7c0c230b0	Fix compile errors (#148758 ) Fix ``` /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/14.2.1/../../../../include/c++/14.2.1/bits/unique_ptr.h:91:16: error: invalid application of 'sizeof' to an incomplete type 'torch::jit::AliasDb::WriteRegistry' 91 \| static_assert(sizeof(_Tp)>0, \| ^~~~~~~~~~~ /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/14.2.1/../../../../include/c++/14.2.1/bits/unique_ptr.h:399:4: note: in instantiation of member function 'std::default_delete<torch::jit::AliasDb::WriteRegistry>::operator()' requested here 399 \| get_deleter()(std::move(__ptr)); \| ^ ../torch/csrc/jit/ir/alias_analysis.cpp:200:10: note: in instantiation of member function 'std::unique_ptr<torch::jit::AliasDb::WriteRegistry>::~unique_ptr' requested here 200 \| AliasDb::~AliasDb() = default; \| ^ ../torch/csrc/jit/ir/alias_analysis.cpp:200:23: note: in defaulted destructor for 'torch::jit::AliasDb' first required here 200 \| AliasDb::~AliasDb() = default; \| ^ ../torch/csrc/jit/ir/alias_analysis.h:298:10: note: forward declaration of 'torch::jit::AliasDb::WriteRegistry' 298 \| struct WriteRegistry; \| ^ 1 error generated. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148758 Approved by: https://github.com/Skylion007	2025-03-08 04:56:42 +00:00
Joel Schlosser	85467ed063	Fix for AOTI + CUDAGraphs when calling from Python (#148601 ) Background: I've been comparing performance of torch.compile vs. torch.export + AOTI (specifically, loaded from Python) on the Flux model and found a ~1.4% performance decrease with the latter. The trace shows that CUDAGraphs are not utilized for torch.export + AOTI, leading to higher overhead. When trying to manually CUDAGraph the loaded, previously exported + AOTIed model (thanks to @eellison for the logic here), I get: ``` Error: operation not permitted when stream is capturing ``` @desertfire confirms that this is due to multi-threading logic on the AOTI runtime side (in `AOTIModelContainer` / `AOTIModel`) conflicting with the use of CUDAGraphs. Fix: This PR takes the approach of providing an alternate, single-threaded method for running loaded models with the AOTI runtime. Details: * Python side introduces a new flag to enable this behavior (needs a better name): `torch._inductor.package.load_package(..., run_single_threaded=False)` * This flag is passed down to the C++ side's `AOTIModelPackageLoader`, which passes it to the `CreateAOTIModelRunnerFunc` during `AOTIModelContainerRunner` construction. * C++ side introduces single-threaded alternatives to model running and model container running: * `AOTIModelContainer.run_single_threaded()` / `AOTIModel.run_single_threaded()`. The interfaces match those of `run()`, but the synchronization logic has been removed. * Introduces `AOTInductorModelContainerRunSingleThreaded` to AOTI's `interface.h`; this is invoked by the `AOTIModelContainerRunner` utility class when `run_single_threaded=true`. I've verified on both a small repro and my real-world use case that I can manually CUDAGraph a loaded model that was previously exported + AOTIed. Future work: * Flip default value to `run_single_threaded=True` as Python-side inference doesn't take advantage of the AOTI runtime thread pool * There are some BC concerns here - models need to be re-serialized so the .so contains the new `AOTInductorModelContainerRunSingleThreaded` interface func. We can flip the default value and warn (instead of crashing) if the `AOTInductorModelContainerRunSingleThreaded` symbol does not exist. * Compose with cudagraph trees as opposed to manual cuda graph wrapping Pull Request resolved: https://github.com/pytorch/pytorch/pull/148601 Approved by: https://github.com/desertfire	2025-03-08 02:44:14 +00:00
Nikita Shulga	6602e632cd	Suppress build warnings when gcc-11 is used (#148763 ) By decorating the header with `C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wmismatched-new-delete")` that will suppress following (when building against ancient llvm-9) ``` In file included from /var/lib/jenkins/workspace/torch/csrc/jit/tensorexpr/llvm_codegen.cpp:24: /opt/llvm/include/llvm/IR/IRBuilder.h: In member function 'llvm::LoadInst* llvm::IRBuilder<T, Inserter>::CreateLoad(llvm::Type, llvm::Value, const llvm::Twine&) [with T = llvm::ConstantFolder; Inserter = llvm::IRBuilderDefaultInserter]': /opt/llvm/include/llvm/IR/IRBuilder.h:1581:19: error: 'static void llvm::User::operator delete(void)' called on pointer returned from a mismatched allocation function [-Werror=mismatched-new-delete] 1581 \| return Insert(new LoadInst(Ty, Ptr), Name); \| ^~~~~~~~~~~~~~~~~~~~~ /opt/llvm/include/llvm/IR/IRBuilder.h:1581:19: note: returned from 'static void llvm::UnaryInstruction::operator new(size_t)' ``` Probably a reasonable followup will be to disable NNC testing all-together, as project has been in a maintenance mode for a while now Pull Request resolved: https://github.com/pytorch/pytorch/pull/148763 Approved by: https://github.com/Skylion007, https://github.com/ZainRizvi, https://github.com/atalman ghstack dependencies: #148739	2025-03-07 20:43:35 +00:00
PyTorch MergeBot	b246cd7b82	Revert "Move get accelerator to use build time flags when possible (#146098 )" This reverts commit `17302b4bc8`. Reverted https://github.com/pytorch/pytorch/pull/146098 on behalf of https://github.com/albanD due to Still fails with cuda build on a non-gpu machine ([comment](https://github.com/pytorch/pytorch/pull/146098#issuecomment-2707191770))	2025-03-07 18:59:58 +00:00
eqy	18c6e00c7b	[CUDA Graphs][NCCL] Set event queries to happen under thread-local mode in `ProcessGroupNCCL.cpp` (#148594 ) Should mean we don't need to coordinate the watchdog with CUDAGraph captures anymore Pull Request resolved: https://github.com/pytorch/pytorch/pull/148594 Approved by: https://github.com/kwen2501	2025-03-07 18:39:02 +00:00
PyTorch MergeBot	abcca2fcbb	Revert "Fix `torch.nn.functional.hardswish` gradients corner case (#148049 )" This reverts commit `29b28e9d9f`. Reverted https://github.com/pytorch/pytorch/pull/148049 on behalf of https://github.com/soulitzer due to This may be causing an accuracy failure on inductor ([comment](https://github.com/pytorch/pytorch/pull/148049#issuecomment-2706839169))	2025-03-07 16:05:56 +00:00
albanD	17302b4bc8	Move get accelerator to use build time flags when possible (#146098 ) This PR does two main things (they are in a single PR to show how the newly added APIs are used). - Add isBuilt and isAvailable APIs to the AcceleratorHook interface. See inline doc for their exact semantic - Use the newly added isBuilt for accelerator check to ensure it does not poison fork Pull Request resolved: https://github.com/pytorch/pytorch/pull/146098 Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/EikanWang, https://github.com/jeromean Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-03-07 15:19:34 +00:00
Wei-Sheng Chin	9c9b05bc4f	Expose functions used in custom backend in torch_python dll (#148213 ) Fixes #148208. There are solutions for exposing symbols implicitly from inline functions (i.e., inline function A calls non-inline function B in foo.h. Code includes foo.h has to see the symbol B in DLL). Solution 1: tag the entire struct where the inline functions are defined as member functions with TORCH_PYTHON_API --- this PR does this for python_arg_parser.h. An alternative solution exists but will slow down dispatching a lot --- drop inline keyword and move implementation to .cc file. Solution 2: tag individual functions with TORCH_PYTHON_API. This PR does this for python_tensor.h. Related discussion about hiding torch_python symbols: https://github.com/pytorch/pytorch/pull/142214 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148213 Approved by: https://github.com/malfet	2025-03-07 02:34:37 +00:00
Zhuoran Zhao	dfb4094b9c	Skip buffer in dense update (#148533 ) Summary: as title. PyTorch Module buffer will not be published in delta publishing. In Quinn's previous diff, constant type annotations have been introduced. In addition to skip constant, we also need to skip buffer if it is not found in the user-provided delta weights list Test Plan: https://docs.google.com/document/d/1wiqUo0PyZ4g6YJIJlL_LE084ZEuE74iu74gZjqGGjWY/edit?tab=t.0#heading=h.dby6cwiw1xrn Differential Revision: D69553929 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148533 Approved by: https://github.com/22quinn, https://github.com/jingsh	2025-03-07 01:59:58 +00:00
Richard Barnes	33a285379a	[codemod] Remove unused-variable in caffe2/torch/csrc/distributed/c10d/cuda/AsyncMM.cu (#148501 ) Summary: LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance. This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Reviewed By: dtolnay Pull Request resolved: https://github.com/pytorch/pytorch/pull/148501 Approved by: https://github.com/Skylion007	2025-03-07 00:33:39 +00:00
zeshengzong	29b28e9d9f	Fix `torch.nn.functional.hardswish` gradients corner case (#148049 ) Fixes #147801 ## Changes - Change hardswish gradient compute condition as [torch.nn.functional.hardswish](https://pytorch.org/docs/stable/generated/torch.nn.functional.hardswish.html) - Enable cuda for test `test_hardswish_grad_corner` - Add test case for value=-3 ## Test Result ```bash pytest test/test_nn.py -k test_hardswish pytest test/test_unary_ufuncs.py -k test_hardswish pytest test/inductor/test_torchinductor.py -k test_hardswish ``` ![image](https://github.com/user-attachments/assets/000cb5c4-15f5-4bfd-ab45-f52bf810ff3d) ![image](https://github.com/user-attachments/assets/38b08cf8-ea84-47a2-8e37-0a213da3e0c8) ![image](https://github.com/user-attachments/assets/54bc57be-2c57-46cc-ab90-94ea6cbe1c34) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148049 Approved by: https://github.com/soulitzer	2025-03-06 19:04:52 +00:00
Ke Wen	d91a634edf	[c10d] Make getDefaultBackend more fault tolerant (#148596 ) This is a forward fix for #135338. It hits error like this: ``` "distributed_c10d.py", line 2156, in destroy_process_group if type(pg) == ProcessGroup and pg._has_hooks(): RuntimeError: Could not find the default backend type 0 for Process Group with name undefined. ``` When users call `init_process_group(nothing)`, default backend is not set, or set to `undefined`. Thus the above signature. Triggered by the `_has_hooks()` call. The fix wraps `getDefaultBackend` with a try-catch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148596 Approved by: https://github.com/LucasLLC, https://github.com/fduwjj	2025-03-06 18:07:43 +00:00
Benjamin Glass	b160dda743	cpp_wrapper: reduce memory usage by removing unneeded temporaries (#147403 ) This PR contains a set of interrelated changes, listed below, with the upshot that compiled model memory usage in `cpp_wrapper` mode is now roughly equivalent to the default inductor mode. Changes: 1. Refactor `reinterpret_view` calls in `cpp_wrapper` to always return a temporary RAII tensor object, rather than saving off a "temporary" tensor handle that persisted through the end of the function. This matches the behavior of the base Python wrapper class, and is responsible for majority of the memory usage reductions. 2. Eliminate nearly all other cases where a "temporary" tensor handle was saved off (with the exception of one or two places where the tensor would immediately be destroyed by going out-of-scope). This necessitated some ugly-looking code to handle `Optional[Tensor]` and `Optional[Sequence[Any]]`, since `Optional` is passed by pointer into the C-shim functions (making passing temporary objects difficult). This code is justified by the fact that it only appears in controlled circumstances that we auto-generate, so there are minimal user-facing footguns. 3. Delete the list containing the input tensors to the `cpp_wrapper` main function after casting them to `AtenTensorHandle` objects, which have an internal reference count keeping them alive. The [TorchInductor benchmark](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Sat%2C%2015%20Feb%202025%2018%3A38%3A08%20GMT&stopTime=Sat%2C%2022%20Feb%202025%2018%3A38%3A08%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/benjaminglass1/73/head&lCommit=4d5edaf67e80ca9ca36d301af1ded13967a04790&rBranch=main&rCommit=e1bf892d9004a4dba0748d0eda5c3b4eced0ea70) I ran shows the increased memory compression. Differential Revision: [D70648897](https://our.internmc.facebook.com/intern/diff/D70648897) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147403 Approved by: https://github.com/desertfire	2025-03-06 16:08:16 +00:00
Mikayla Gawarecki	be0ceee1c3	Make record/storage alignment in torch.save configurable (#147788 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147788 Approved by: https://github.com/albanD ghstack dependencies: #147786, #147787	2025-03-06 12:04:46 +00:00
Shivam Raikundalia	63fbc738dc	[Easy/Profiler] Add last entry to truncated values (#148576 ) Summary: Since the ranks of a PG are usually in a consecutive range it is useful to print the last values when truncating metadata Test Plan: Manually changed truncate length to 2 and ran 4 gpu graph to get the following trace: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu003.rva5.facebook.com/rank-1.Mar_05_09_48_21.1280355.pt.trace.json.gz&bucket=gpu_traces Differential Revision: D70637461 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148576 Approved by: https://github.com/davidberard98	2025-03-06 01:14:15 +00:00
Benjamin Glass	d6d670ab4d	[AOTI] build CPU CPP kernels at O3, and all other code at O1 (#148587 ) In the future, we may also want to add LTO linking to further optimize the results (while still hopefully netting compile time benefits). Differential Revision: [D70641543](https://our.internmc.facebook.com/intern/diff/D70641543) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148587 Approved by: https://github.com/desertfire	2025-03-05 22:47:46 +00:00
fduwjj	87bd3471ff	[c10d] Move record param for init to the right place (#148571 ) The place we do the log of init does not look correct. We move it to the beginning of comm init. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148571 Approved by: https://github.com/kwen2501	2025-03-05 21:43:30 +00:00
Marko Radmilac	c65ee728f0	Initial implementation of host memory stats (#147660 ) This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics. This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache. As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later. Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660 Approved by: https://github.com/ngimel	2025-03-05 16:13:19 +00:00
Aidyn-A	8274da9312	[c10d][PGNCCL] Fix capturability of isend and irecv (#148462 ) This PR fixes an issue of inability to capture `isend`/`irecv` ops in `async` mode. <details> <summary>The repro code</summary> ```Python import os import torch import torch.distributed as dist USE_ASYNC = True def test_func(x, rank): if rank == 0: x += 1 # Send the tensor to process 1 if USE_ASYNC: a = dist.isend(tensor=x, dst=1) else: dist.send(tensor=x, dst=1) else: # Receive tensor from process 0 if USE_ASYNC: a = dist.irecv(tensor=x, src=0) else: dist.recv(tensor=x, src=0) if USE_ASYNC: a.wait() return x + 2 def run(rank): torch.cuda.set_device(rank) x = torch.ones(1, device='cuda') with torch.cuda.stream(torch.cuda.Stream()): for i in range(11): x.copy_(torch.ones(1, device='cuda')) y = test_func(x, rank) print(f"Rank{rank} has data {y} in warmup") torch.cuda.synchronize() graph = torch.cuda.CUDAGraph() x.copy_(torch.ones(1, device='cuda')) with torch.cuda.graph(graph): y = test_func(x, rank) for i in range(1): x.copy_(torch.ones(1, device='cuda')) graph.replay() print(f"Rank{rank} has data {y} after graph replay") def main(): rank = int(os.environ['RANK']) local_rank = int(os.environ['LOCAL_RANK']) world_size = int(os.environ['WORLD_SIZE']) dist.init_process_group('nccl', rank=rank, world_size=world_size) run(local_rank) if __name__ == "__main__": main() ``` </details> Fails with an error stating that work handle is of a NoneType: ``` [rank1]: Traceback (most recent call last): [rank1]: File "/workspace/repro.py", line 54, in <module> [rank1]: main() [rank1]: File "/workspace/repro.py", line 51, in main [rank1]: run(local_rank) [rank1]: File "/workspace/repro.py", line 38, in run [rank1]: y = test_func(x, rank) [rank1]: ^^^^^^^^^^^^^^^^^^ [rank1]: File "/workspace/repro.py", line 22, in test_func [rank1]: a.wait() [rank1]: ^^^^^^ [rank1]: AttributeError: 'NoneType' object has no attribute 'wait' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148462 Approved by: https://github.com/kwen2501	2025-03-05 15:49:53 +00:00
Bin Bao	df7e43e5d4	[AOTI] Fix aot_inductor_package test errors (#148279 ) Summary: Fix fbcode test failures introduced by https://github.com/pytorch/pytorch/pull/147975. Make sure script.ld is copied to the build-time directory. Differential Revision: D70454149 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148279 Approved by: https://github.com/zoranzhao	2025-03-05 05:22:48 +00:00
wdziurdz	edc3ca577e	[Profiler] Add profiler activity for HPU devices (#148182 ) Fixes #148181 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148182 Approved by: https://github.com/sraikund16	2025-03-05 01:37:48 +00:00
Animesh Jain	713a504a82	[dynamo][guards] Fix mem leak caused be refcount increment (#148480 ) Should help [internalfb.com/sevmanager/view/491701](https://www.internalfb.com/sevmanager/view/491701) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148480 Approved by: https://github.com/xmfan, https://github.com/StrongerXi, https://github.com/williamwen42, https://github.com/zou3519	2025-03-05 01:04:08 +00:00
Mwiza Kunda	b5873292c6	Add overload names to profiler trace (#143114 ) Currently, recorded profiler events for aten ops do not store overload names. It would be useful to know which overloads are actually called to analyse performance. For example, consider the following dispatch trace which occurs if there is a fallthrough kernel registered for aten::add: ``` [call] op=[aten::add.Tensor], key=[AutogradCPU] [redispatch] op=[aten::add.Tensor], key=[Undefined] [call] op=[aten::empty.memory_format], key=[BackendSelect] [redispatch] op=[aten::empty.memory_format], key=[CPU] [call] op=[aten::add.out], key=[CPU] ``` In this case, aten::add.out is a child of aten::add.Tensor, however the current profiler trace provides no way to differentiate aten op calls. See the added unit test for a more detailed example. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143114 Approved by: https://github.com/sraikund16	2025-03-05 01:00:29 +00:00
rzou	a907b6abae	[compiled_autograd] workaround windows compilation issue (#148454 ) torch.compile doesn't work on windows so we can ifdef-away the problem. I do not know what the root cause actually is. Most notably, the pytorch windows build is fine, but some third-party projects that use pytorch headers on windows (e.g. torchaudio) have issues. Test Plan: - wait for CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/148454 Approved by: https://github.com/atalman, https://github.com/xmfan	2025-03-05 00:18:20 +00:00
Howard Huang	e02a2ca07a	Fix dist.init_process_group on windows (#148266 ) Fix https://github.com/pytorch/pytorch/issues/139990 We don't build libuv on windows so anything that creates `TCPStore` which includes `init_process_group()` will fail, which is a bad experience. We should just default to `USE_LIBUV=0` for windows. There were a decent amount of hits for this [error on google ](https://www.google.com/search?q=use_libuv+was+requested+but+PyTorch+was+build+without+libuv+support&sca_esv=921f59ac5f8bd98a&sxsrf=AHTn8zpG3PxdKoomFHkclOc451rBhoc3jw%3A1740854890873&source=hp&ei=albDZ5GHM-uIptQP4NTikQw&iflsig=ACkRmUkAAAAAZ8Nkei9H-aB2IBCk3pUOK3yFl5xBLZUt&ved=0ahUKEwiR5P7qxemLAxVrhIkEHWCqOMIQ4dUDCBg&uact=5&oq=use_libuv+was+requested+but+PyTorch+was+build+without+libuv+support&gs_lp=Egdnd3Mtd2l6IkN1c2VfbGlidXYgd2FzIHJlcXVlc3RlZCBidXQgUHlUb3JjaCB3YXMgYnVpbGQgd2l0aG91dCBsaWJ1diBzdXBwb3J0SABQAFgAcAB4AJABAJgBAKABAKoBALgBA8gBAPgBAvgBAZgCAKACAJgDAJIHAKAHAA&sclient=gws-wiz) and https://github.com/pytorch/pytorch/issues/139579, so I figured we should add a more helpful message as well. We don't have CI for windows and our support is just best effort, so I just tested these changes on my windows machine. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148266 Approved by: https://github.com/d4l3k	2025-03-05 00:07:56 +00:00
PyTorch MergeBot	92beda54c8	Revert "[fx] Move map_aggregate to C++ (#148243 )" This reverts commit `edaff88f69`. Reverted https://github.com/pytorch/pytorch/pull/148243 on behalf of https://github.com/jovianjaison due to breaking internal builds [T216910920] ([comment](https://github.com/pytorch/pytorch/pull/148243#issuecomment-2698724058))	2025-03-04 19:40:21 +00:00
PyTorch MergeBot	17d003fe75	Revert "[fx] Move Node._update_args_kwargs to C++ (#148260 )" This reverts commit `0135f57f4a`. Reverted https://github.com/pytorch/pytorch/pull/148260 on behalf of https://github.com/jovianjaison due to breaking internal builds [T216910920] ([comment](https://github.com/pytorch/pytorch/pull/148243#issuecomment-2698724058))	2025-03-04 19:40:21 +00:00
PyTorch MergeBot	97b9e68bc6	Revert "[fx] Move Node._prepend/Node._remove_from_list to C++ (#148261 )" This reverts commit `29c2de9ae1`. Reverted https://github.com/pytorch/pytorch/pull/148261 on behalf of https://github.com/jovianjaison due to breaking internal builds [T216910920] ([comment](https://github.com/pytorch/pytorch/pull/148243#issuecomment-2698724058))	2025-03-04 19:40:21 +00:00
taozhiwei	16d07988fc	add supports_coalescing property in c10d::Backend to determine whether backend supports coalescing (#135338 ) 1. My company is using privateuseone to connect new hardware device and requires the use of `batch_isend_irecv` function. However, `batch_isend_irecv` is currently only open to CUDA, so I add `supports_coalescing` property in `c10d::Backend` to determine whether backend supports coalescing. 2. If `pg._has_hooks` return True, We don't need to determine if the current device is CUDA. So privateuseone can also support `pg._wait_for_pending_works` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135338 Approved by: https://github.com/kwen2501, https://github.com/albanD	2025-03-04 12:37:06 +00:00
Ding, Yi1	c21dc11a17	[Intel GPU] Enable SDPA on XPU (#147614 ) Motivation === This PR is part of the plan of OneDNN Upstreaming, as #114848 [(comment)](https://github.com/pytorch/pytorch/issues/114848#issuecomment-2451553203) stated. The support of SDPA is via the overridable variance on XPU backend. Beside the added `Attention.cpp` file, `Graph.h` is added to hold utils for OneDNN graph including those for kernel/compile graph caching. In addition, a selection of testcases in `test/test_transformers.py` are copied into the new `test/xpu/test_transformers.py` and modified accordingly to provide additional tests beyond `./third_party/torch-xpu-ops/test/xpu/test_ops_xpu.py`. Depends on OneDNN version v3.7 upgrade in #147498 Depends on BUILD_GRAPH switch in #147608 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147614 Approved by: https://github.com/jansel, https://github.com/EikanWang	2025-03-04 01:40:45 +00:00
Jason Ansel	29c2de9ae1	[fx] Move Node._prepend/Node._remove_from_list to C++ (#148261 ) Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before: ``` 24303536 function calls (23503339 primitive calls) in 10.726 seconds ``` after: ``` 20003454 function calls (19203257 primitive calls) in 8.936 seconds ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148261 Approved by: https://github.com/oulgen ghstack dependencies: #148243, #148260	2025-03-02 22:42:31 +00:00
Jason Ansel	0135f57f4a	[fx] Move Node._update_args_kwargs to C++ (#148260 ) Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before: ``` 25203549 function calls (24403352 primitive calls) in 12.090 seconds ``` after: ``` 24303536 function calls (23503339 primitive calls) in 10.726 seconds ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148260 Approved by: https://github.com/oulgen ghstack dependencies: #148243	2025-03-02 22:42:31 +00:00
Jason Ansel	edaff88f69	[fx] Move map_aggregate to C++ (#148243 ) Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before: ``` 30603618 function calls (29403419 primitive calls) in 13.744 seconds ``` after: ``` 25203549 function calls (24403352 primitive calls) in 12.090 seconds ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148243 Approved by: https://github.com/oulgen	2025-03-02 22:42:31 +00:00
cyy	9aa897b992	Remove unnecessary tensor clone (#148159 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148159 Approved by: https://github.com/Skylion007	2025-03-02 16:21:39 +00:00
Richard Barnes	5301710b15	[codemod] Fix unused-value issue in caffe2/aten/src/ATen/cuda/detail/CUDAHooks.cpp +4 (#147555 ) Summary: LLVM has a warning `-Wunused-value` which we treat as an error because it's so often diagnostic of a code issue. Unused values often indicate a programming mistake, but can also just be unnecessary cruft that harms readability and performance. For questions/comments, contact r-barnes. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Differential Revision: D69945678 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147555 Approved by: https://github.com/Skylion007, https://github.com/eqy	2025-03-01 19:46:13 +00:00
PyTorch MergeBot	a983b2b11a	Revert "Initial implementation of host memory stats (#147660 )" This reverts commit `945e359fc1`. Reverted https://github.com/pytorch/pytorch/pull/147660 on behalf of https://github.com/mradmila due to There is an issue with ambiguous definition of Stat structure when different C++ tools are used. Backing out for now. ([comment](https://github.com/pytorch/pytorch/pull/147660#issuecomment-2692346379))	2025-03-01 18:05:45 +00:00
William Wen	40b3e4a358	[dynamo] expose code execution strategy to python (#148020 ) @anijain2305 this can be used to mark a code object to be skipped/run-only (recursively) while tracing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148020 Approved by: https://github.com/jansel	2025-02-28 21:59:12 +00:00
Marko Radmilac	945e359fc1	Initial implementation of host memory stats (#147660 ) This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics. This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache. As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later. Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660 Approved by: https://github.com/ngimel	2025-02-28 18:36:44 +00:00
Wouter Devriendt	ea12fc8a9f	Revert D70262395 (#148164 ) Summary: This reverts #147804 due to internal revert. --- This diff reverts D70262395 Reviewed By: RossMcKenzie Differential Revision: D70318024 @diff-train-skip-merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/148164 Approved by: https://github.com/xmfan	2025-02-28 06:39:48 +00:00
x41lakazam	30375cb326	Fix minor typo in python_nccl (#148088 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148088 Approved by: https://github.com/Skylion007	2025-02-28 00:47:09 +00:00
Zhengxu Chen	915b9c80ab	[export] Sync aoti schema to schema.py (#148017 ) Summary: Synchronizing internal AOTI schema to OSS schema.py Test Plan: CI Differential Revision: D70271151 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148017 Approved by: https://github.com/yiming0416	2025-02-27 21:46:11 +00:00
Mikayla Gawarecki	536bce5a04	Make Tensor.set_ validate storage_offset when sizes/strides are unchanged (#147354 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147354 Approved by: https://github.com/albanD ghstack dependencies: #147352	2025-02-27 15:48:58 +00:00
Simon Fan	fd43c36aa9	[ca] side-effect free initial trace: RAII PyCompilerInterface (#147891 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147891 Approved by: https://github.com/jansel ghstack dependencies: #147242, #147796, #147804	2025-02-27 07:17:30 +00:00
Simon Fan	fd1220e386	[ca] side-effect free inital trace: compiled_args (#147804 ) const methods to prevent accidental mutation. changes mainly in Error nodes and PyNode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147804 Approved by: https://github.com/jansel ghstack dependencies: #147242, #147796	2025-02-26 16:37:27 +00:00
Simon Fan	5e3069dde8	[ca] side-effect free initial trace: GraphTask (#147796 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147796 Approved by: https://github.com/jansel ghstack dependencies: #147242	2025-02-26 16:37:27 +00:00
Simon Fan	0a2da008f8	[ca] trace saved variable unpacking (#147242 ) ## Before Previously, CA will always unpack all saved variables stored in the autograd graph before executing it. This meant that we can't capture unpack hooks as part of the CA graph, and they would fire out of order wrt to other backward hooks. For memory saving APIs built on top of saved tensor hooks like non-reentrant checkpointing and offloading, we couldn't achieve any savings because all activations would be recomputed/loaded and active at the same time, resulting in no-op. ## After We add unpack hooks into the CA graph so that they can be executed progressively. The python hook and hook input themselves are wrapped by non-traceable code, so CA polyfills the wrapping as: ```python # pseudocode class SavedVariable: def unpack(self): if self.hook: return self.hook(self.packed_data) else: return self.packed_data # This approach won't directly work when we add support for Forward AD or double-backward. ``` Directly executing the CA graph (without torch.compiling it) under checkpointing/offloading, memory profile is expected to stay the same as when using the eager autograd engine. If AOT backward is in the autograd graph, memory profile is expected to be better than the eager autograd engine, since we can now delay saved activations unpacking into the AOT backward's execution. All tests pass when running the CA graph directly, the remaining issues are in Dynamo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147242 Approved by: https://github.com/jansel	2025-02-26 16:37:17 +00:00
Luca Wehrstedt	60d94ea22b	Add option to limit number of SMs used by matmul kernels (#147966 ) Resubmission of #144974 which was reverted for unrelated reasons. Newer matmul kernels, e.g. those targeting Hopper GPUs, sometime use a "persistent" schedule which consists in launching as many CUDA blocks as there are SMs on the GPU, with each such block then working on multiple output tiles in a row. This allows to eliminate the overhead of starting and finishing each tile, effectively doing cross-tile pipelining. In previous generations these latencies could be hidden by having multiple CUDA blocks per SM but, with blocks becoming larger, only one can run at a time per SM and thus this needs to be taken care of in software. Persistent kernels become an issue when other kernels are running concurrently. The classical example is a NCCL communication kernel running in the background. In such cases the matmul expects to be able to use all the SMs but is prevented from doing so because some of the are busy. This can lead to its blocks being scheduled as two separate waves on the available SMs. This "wave quantization" can double the latency of the matmul kernels. While we wait for smarter solutions, such as automatic load balancing among the blocks, an easy way to unblock ourselves is to tell the matmuls to only use a subset of the GPU's SMs. For this, I am introducing a global `sm_carveout` flag which can be used to specify how many SMs should be left available for other kernels. For now I only change the cuBLAS kernels and the scaled-mm CUTLASS kernel. More kernels can be opted-in later. I tested this change manually, by using the Kineto profiler to look up the grid size of a scaled-mm kernel with different values of `sm_carveout`, and making sure it changed. Suggestions are welcome for a more automated test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147966 Approved by: https://github.com/danthe3rd	2025-02-26 12:01:12 +00:00
Ke Wen	f211818bc0	[c10d] Restrict use condition of NCCL mem pool (#147764 ) Add check to see if CUDA driver support multicast, as does in Symmetric Memory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147764 Approved by: https://github.com/syed-ahmed, https://github.com/yifuwang	2025-02-26 03:40:00 +00:00
PyTorch MergeBot	90e3a3d86d	Revert "[ca] trace saved variable unpacking (#147242 )" This reverts commit `68ddca9449`. Reverted https://github.com/pytorch/pytorch/pull/147242 on behalf of https://github.com/wdvr due to failing tests in the slow workflow, see below ([comment](https://github.com/pytorch/pytorch/pull/147242#issuecomment-2683604547))	2025-02-26 00:40:16 +00:00
PyTorch MergeBot	4d614baa30	Revert "[ca] side-effect free initial trace: GraphTask (#147796 )" This reverts commit `5758743f3c`. Reverted https://github.com/pytorch/pytorch/pull/147796 on behalf of https://github.com/wdvr due to failing tests in the slow workflow, see below ([comment](https://github.com/pytorch/pytorch/pull/147796#issuecomment-2683599896))	2025-02-26 00:36:08 +00:00
PyTorch MergeBot	143f0f0006	Revert "[ca] side-effect free inital trace: compiled_args (#147804 )" This reverts commit `ec768d8dc0`. Reverted https://github.com/pytorch/pytorch/pull/147804 on behalf of https://github.com/wdvr due to failing tests in the slow workflow, see below ([comment](https://github.com/pytorch/pytorch/pull/147804#issuecomment-2683594740))	2025-02-26 00:31:40 +00:00
drisspg	3ecfe6be25	[Submodule] Turning flash-attention integration into 3rd party submod (#144120 ) (#146372 ) Summary: # Summary ### Sticky points Cuda-graph rng handling has changed / deviated from original implementation. We will be left with a dangling 'offset' val and confusing naming due to BC ## Dependencies - Flash PR: https://github.com/Dao-AILab/flash-attention/pull/1419 ### Other Points - The BC linter is complaining about losing generate.py and its functions which is not real BC surface cc albanD imported-using-ghimport Test Plan: Imported from OSS Building in dev `buck build @//mode/dev-nosan -c fbcode.nvcc_arch=h100a //caffe2:ATen-cu --show-full-output ` I and Nming the .so I do see that the flash symbols are correctly named: ``` 0000000001c3dfb0 t pytorch_flash::run_mha_bwd(pytorch_flash::Flash_bwd_params&, CUstream_st)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const 0000000001c36080 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()#6}::operator()() const 0000000001c360e0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const 0000000001c35fc0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#6}::operator()() const 0000000001c36020 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const ``` Reviewed By: vkuzo Differential Revision: D68502879 Pulled By: drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/146372 Approved by: https://github.com/jbschlosser	2025-02-26 00:10:59 +00:00
Animesh Jain	276dfe8150	[dynamo][cpp-guards] Disable dict-tag optim if the guard_manager has child accessors (#147694 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147694 Approved by: https://github.com/isuruf	2025-02-26 00:02:08 +00:00
PyTorch MergeBot	1e894d2635	Revert "Add option to limit number of SMs used by matmul kernels (#144974 )" This reverts commit `af2d63637e`. Reverted https://github.com/pytorch/pytorch/pull/144974 on behalf of https://github.com/wdvr due to reverting in order to revert #147548 that causes a merge conflict ([comment](https://github.com/pytorch/pytorch/pull/144974#issuecomment-2683461733))	2025-02-25 22:46:38 +00:00
Hoa Dinh	687fe64667	Fix crash in -[PTMCoreMLCompiler _compileModel:atPath:] (#147809 ) Summary: We could hit one of those exceptions: https://github.com/apple/coremltools/blob/main/modelpackage/src/ModelPackage.cpp#L205-L225 And it would make this code path crash. Test Plan: build. Differential Revision: D70122378 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147809 Approved by: https://github.com/mcr229	2025-02-25 20:56:16 +00:00
Simon Fan	ec768d8dc0	[ca] side-effect free inital trace: compiled_args (#147804 ) const methods to prevent accidental mutation. changes mainly in Error nodes and PyNode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147804 Approved by: https://github.com/jansel ghstack dependencies: #147242, #147796	2025-02-25 20:38:51 +00:00
Simon Fan	5758743f3c	[ca] side-effect free initial trace: GraphTask (#147796 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147796 Approved by: https://github.com/jansel ghstack dependencies: #147242	2025-02-25 20:38:51 +00:00
Simon Fan	68ddca9449	[ca] trace saved variable unpacking (#147242 ) ## Before Previously, CA will always unpack all saved variables stored in the autograd graph before executing it. This meant that we can't capture unpack hooks as part of the CA graph, and they would fire out of order wrt to other backward hooks. For memory saving APIs built on top of saved tensor hooks like non-reentrant checkpointing and offloading, we couldn't achieve any savings because all activations would be recomputed/loaded and active at the same time, resulting in no-op. ## After We add unpack hooks into the CA graph so that they can be executed progressively. The python hook and hook input themselves are wrapped by non-traceable code, so CA polyfills the wrapping as: ```python # pseudocode class SavedVariable: def unpack(self): if self.hook: return self.hook(self.packed_data) else: return self.packed_data # This approach won't directly work when we add support for Forward AD or double-backward. ``` Directly executing the CA graph (without torch.compiling it) under checkpointing/offloading, memory profile is expected to stay the same as when using the eager autograd engine. If AOT backward is in the autograd graph, memory profile is expected to be better than the eager autograd engine, since we can now delay saved activations unpacking into the AOT backward's execution. All tests pass when running the CA graph directly, the remaining issues are in Dynamo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147242 Approved by: https://github.com/jansel	2025-02-25 20:38:51 +00:00
Luca Wehrstedt	af2d63637e	Add option to limit number of SMs used by matmul kernels (#144974 ) Newer matmul kernels, e.g. those targeting Hopper GPUs, sometime use a "persistent" schedule which consists in launching as many CUDA blocks as there are SMs on the GPU, with each such block then working on multiple output tiles in a row. This allows to eliminate the overhead of starting and finishing each tile, effectively doing cross-tile pipelining. In previous generations these latencies could be hidden by having multiple CUDA blocks per SM but, with blocks becoming larger, only one can run at a time per SM and thus this needs to be taken care of in software. Persistent kernels become an issue when other kernels are running concurrently. The classical example is a NCCL communication kernel running in the background. In such cases the matmul expects to be able to use all the SMs but is prevented from doing so because some of the are busy. This can lead to its blocks being scheduled as two separate waves on the available SMs. This "wave quantization" can double the latency of the matmul kernels. While we wait for smarter solutions, such as automatic load balancing among the blocks, an easy way to unblock ourselves is to tell the matmuls to only use a subset of the GPU's SMs. For this, I am introducing a global `sm_carveout` flag which can be used to specify how many SMs should be left available for other kernels. For now I only change the cuBLAS kernels and the scaled-mm CUTLASS kernel. More kernels can be opted-in later. I tested this change manually, by using the Kineto profiler to look up the grid size of a scaled-mm kernel with different values of `sm_carveout`, and making sure it changed. Suggestions are welcome for a more automated test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144974 Approved by: https://github.com/eqy, https://github.com/albanD	2025-02-25 10:19:19 +00:00
Tristan Rice	8eb400ef66	[BE] TCPStore: use typed errors for assertions (#147647 ) This is a follow up to #147465 that changes most TORCH_CHECK calls in TCPStore and TCPStoreLibUvBackend to use typed exceptions instead of generic `TORCH_CHECK` calls which end up as RuntimeErrors in Python. Test plan: ``` pytest test/distributed/test_store.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147647 Approved by: https://github.com/fduwjj	2025-02-24 20:58:10 +00:00
Ke Wen	e1bf892d90	[DDP] Temporarily disable comm mem (#147663 ) For fear that it incur slightly more memory usage and cause some applications at tight memory margin to OOM. (bc the comm mem pool is a separate pool than the regular pool ?) Differential Revision: [D70026681](https://our.internmc.facebook.com/intern/diff/D70026681) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147663 Approved by: https://github.com/d4l3k	2025-02-22 05:55:43 +00:00
Luca Wehrstedt	36c461af95	Support SymmetricMemory's signaling kernels on sm60 and sm70 (#146308 ) By leveraging libcudacxx's utilities: https://nvidia.github.io/cccl/libcudacxx/extended_api/synchronization_primitives/atomic_ref.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/146308 Approved by: https://github.com/yifuwang	2025-02-21 15:29:02 +00:00
Sheng Fu	71d2827eeb	Code Refactoring for getting start and stride from global ranks (#147230 ) Summary: Code Refactoring for getting start and stride from global ranks, this function can be used in different collective backend. Differential Revision: D69555405 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147230 Approved by: https://github.com/kwen2501	2025-02-21 10:02:50 +00:00
cyy	b61a556427	Turn onnx functions into static (#147598 ) To avoid exposing ONNX symbols. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147598 Approved by: https://github.com/justinchuby	2025-02-21 07:40:28 +00:00
Kevin Fu	4986f0f52e	[PT2]: allow empty dict to pass type check (#147167 ) (#147480 ) Summary: Seeing errors like when testing sigmoid for inline_cvr and perevent_cvr models. ``` terminate called after throwing an instance of 'c10::Error' what(): forward() Expected a value of type 'Dict[int, Tuple[Tensor, Tensor, Tensor]]' for argument 'event_based_features' but instead found type 'Dict[Any, Any]'. ``` Let empty dict pass type check. please, do NOT use any of the following flags, those are result of manual interventions in other parts of the system, misuse of them can be very painful for both detect and recover: Test Plan: ``` MODEL_ENTITY_ID=691508446 SNAPSHOT_ID=0 OTHER_MODEL_ENTITY_ID=649645886 OTHER_SNAPSHOT_ID=0 MODULE=local buck2 run mode/opt caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- \ --loadMode=BenchmarkAB \ --inputNetFile=/data/users/${USER}/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}${suffix} \ --otherNetFile=/data/users/${USER}/models/${OTHER_MODEL_ENTITY_ID}/${OTHER_SNAPSHOT_ID}/${OTHER_MODEL_ENTITY_ID}_${OTHER_SNAPSHOT_ID}${suffix} \ --moduleName=${module} \ --submodToDevice "" \ --benchmarkDontRebatchSamples=true \ --sampleInputFilePath=/data/users/${USER}/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/archive_.predictor.disagg.gpu.local/data/sample_inputs/local.pt ``` Reviewed By: yjhao Differential Revision: D69871393 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147480 Approved by: https://github.com/henryoier, https://github.com/jeanschmidt	2025-02-21 07:00:46 +00:00
Tristan Rice	ba214ab56c	TCPStore: soft fail bind when agent store active (#147465 ) This makes it easier to roll out `TORCHELASTIC_USE_AGENT_STORE` by opportunistically swallowing bind errors when the agent store is enabled and the port matches `MASTER_PORT`. This should be very safe as if the store is somehow not up and the envs are set, the TCPStore client connections will fail to connect so we end up with a slightly different error message but success/failure behavior is identical. This also pybinds `c10d::SocketError` into Python so we can assert on the error type in tests. https://docs.google.com/document/d/1CzOn_N53AiFxWGgbyMWSnd2elCJd4lZ-ajPg2lzcxoM/edit?tab=t.0#heading=h.2j2f5dimrdau Test plan: ``` pytest test/distributed/test_store.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147465 Approved by: https://github.com/fduwjj	2025-02-21 03:02:26 +00:00
Aaron Orenstein	be0df96b50	Fix c++ implementation of strip_function_call (#147436 ) #143063 was missing handling a couple UCS cases as well as had some bugs in the way it dealt with errors. - Fix all the UCS handling (and make some of the common code more common) - Make sure all the error paths return `nullptr` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147436 Approved by: https://github.com/jansel	2025-02-20 20:41:21 +00:00
vasiliy	382fbcc1e4	add the `torch.float8_e8m0fnu` dtype to PyTorch (#147466 ) Summary: Continuing the work from https://github.com/pytorch/pytorch/pull/146427 Adds the `torch.float8_e8m0fnu` dtype to PyTorch, as detailed in https://github.com/pytorch/pytorch/issues/146414 . Please see the issue for a detailed definition of the format. Example of basic functionality: ```python import torch # round trip x0 = torch.randn(4, 4, dtype=torch.float32) x1 = x0.to(torch.float8_e8m0fnu) # RNE rounding x2 = x1.to(torch.float32) # 2 ** exponent # creation with empty x0 = torch.empty(4, 4, dtype=torch.float8_e8m0fnu) # printing print(x0) ``` Done in this PR: * numerical correctness * op coverage (except for `torch._scaled_mm`): create tensor, cast to/from float32 * printing a tensor works For future PRs: * performance optimizations for casting * torch._scaled_mm * PT2 * various cleanups (detailed in comments with issue numbers) Test Plan: ``` pytest test/quantization/core/experimental/test_float8.py -s ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/147466 Approved by: https://github.com/drisspg	2025-02-20 13:55:42 +00:00
PyTorch MergeBot	babb2dc2af	Revert "Add torch._scaled_mm for CPU (#139975 )" This reverts commit `6f7e67c43c`. Reverted https://github.com/pytorch/pytorch/pull/139975 on behalf of https://github.com/wdvr due to failing inductor mkldnn_pattern_matcher_cpu tests ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2667186865))	2025-02-18 23:58:31 +00:00
William Wen	63e8ad49b8	[dynamo] replace hardcoded eval frame control flags skip_code_recursive_flag/cache_limit_hit_flag (#146355 ) This PR and the previous: - Moves parts of `eval_frame.c` to C++. - Reduces code duplication in `dynamo__custom_eval_frame` and makes the control flow more clear. - Enables `convert_frame` to signal to `eval_frame.cpp` in a general manner how to evaluate this frame, recursive frames, and future frames with the same code object (default/compile, skip, run-only). e.g. this will allow us to change skipping/cache limit hit eval_frame behavior directly from convert_frame without requiring changes to C/C++. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146355 Approved by: https://github.com/jansel ghstack dependencies: #145603	2025-02-18 21:37:12 +00:00
William Wen	75db0fd8a0	[dynamo] refactor dynamo__custom_eval_frame to C++, refactor SKIP_CODE[_RECURSIVE] (#145603 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145603 Approved by: https://github.com/jansel, https://github.com/anijain2305	2025-02-18 21:37:12 +00:00
Jiang, Yanbing	6f7e67c43c	Add torch._scaled_mm for CPU (#139975 ) This PR is to add `torch._scaled_mm` for CPU backend. `_scaled_mm_out_cpu` and `_scaled_mm_cpu` are new added and included in `torch._scaled_mm` CPU dispatch. We also add `_scaled_mm_out_cpu_emulated` as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139975 Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet	2025-02-18 18:44:26 +00:00
Michal Gallus	d9cf1debf9	[ROCm][Windows] Fix clang-cl error related to -Wmissing prototypes enabled (#146981 ) Some of the windows files (fused_kernels.cpp or temp_file.h) contain code that fail to compile when this flag is enabled when built with clang-cl. This PR resolves the issue by ensuring that even if we build with clang-cl, it doesn't include those flags on windows. Alternatively if needed, I can fix the files mentioned to pass under this flag. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146981 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-02-18 07:41:12 +00:00
PyTorch MergeBot	49e8f9c965	Revert "Add torch._scaled_mm for CPU (#139975 )" This reverts commit `22fae4c5f9`. Reverted https://github.com/pytorch/pytorch/pull/139975 on behalf of https://github.com/huydhn due to third time is the charm ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2664622598))	2025-02-18 05:11:32 +00:00
Jiang, Yanbing	22fae4c5f9	Add torch._scaled_mm for CPU (#139975 ) This PR is to add `torch._scaled_mm` for CPU backend. `_scaled_mm_out_cpu` and `_scaled_mm_cpu` are new added and included in `torch._scaled_mm` CPU dispatch. We also add `_scaled_mm_out_cpu_emulated` as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139975 Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet	2025-02-17 18:39:10 +00:00
Yan Zhiwei	ae351d4d0e	[Intel GPU] allow_tf32 for oneDNN backend - XPU part (#137570 ) # Motivation Add context variable `torch.bachend.mkldnn.allow_tf32` to control tf32 computation in convolution kernels at XPU side. The tf32 data type is beneficial to improve the performance of deep learning workloads during training/inference. Current PR uses the [oneDNN API fpmath_mode](https://oneapi-src.github.io/oneDNN/dev_guide_attributes_fpmath_mode.html#the-floating-point-math-mode-attribute) to trigger the tf32 acceleration in convolution kernels. # Valiadation * ut to test context variable `python test/xpu/test_conv.py -k test_mkldnn_allow_tf32_get_set` * Runtime exemplification ``` onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_f32::blocked:abcd::f0 wei_f32::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_f32::blocked:abcd::f0,attr-scratchpad:user attr-fpmath:tf32,alg:convolution_direct,mb20_ic16oc33_ih50oh24kh3sh2dh0ph0_iw100ow49kw3sw2dw0pw0,0.649902 onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_f32::blocked:abcd::f0 wei_f32::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_f32::blocked:abcd::f0,attr-scratchpad:user attr-fpmath:tf32,alg:convolution_direct,mb20_ic33oc33_ih24oh24kh3sh1dh0ph1_iw49ow49kw3sw1dw0pw1,0.151855 onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,backward_data,src_f32::blocked:abcd::f0 wei_f32::blocked:abcd::f0 bia_undef::undef::: dst_f32::blocked:abcd::f0,attr-scratchpad:user attr-fpmath:tf32,alg:convolution_direct,mb20_ic33oc33_ih24oh24kh3sh1dh0ph1_iw49ow49kw3sw1dw0pw1,0.167969 onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,backward_weights,src_f32::blocked:abcd::f0 wei_f32::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_f32::blocked:abcd::f0,attr-scratchpad:user attr-fpmath:tf32,alg:convolution_direct,mb20_ic33oc33_ih24oh24kh3sh1dh0ph1_iw49ow49kw3sw1dw0pw1,0.26709 onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,backward_weights,src_f32::blocked:abcd::f0 wei_f32::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_f32::blocked:abcd::f0,attr-scratchpad:user attr-fpmath:tf32,alg:convolution_direct,mb20_ic16oc33_ih50oh24kh3sh2dh0ph0_iw100ow49kw3sw2dw0pw0,0.219971 ``` According to the field `fpmath:tf32` in verbose, we could see that, current context setting utils could successfully trigger tf32 computation in conv forward/backward_data/backward_weights kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137570 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/atalman, https://github.com/malfet Co-authored-by: Yu, Guangye <guangye.yu@intel.com>	2025-02-17 01:46:43 +00:00
Zhou Fang	a8fa4bcfd2	[StaticRuntime] Support a new pattern (aten::to with 5 inputs) for ClipRangesToGatherToOffsets (#147189 ) Summary: Support the following new pattern for ClipRangesToGatherToOffsets: Before optimization: ``` %11175 : Tensor, %11176 : Tensor = fb::clip_ranges_gather(%int_66.1, %getitem_1784.1, %347) %getattr_256.1 : int = prim::dtype(%11175) %to_298.1 : Tensor = aten::to(%11176, %getattr_256.1, %13, %13, %12) %lengths_to_offsets_333.1 : Tensor = fb::lengths_to_offsets(%to_298.1, %8) ``` After optimization: ``` %11199 : int = prim::dtype(%int_66.1) %11200 : Tensor, %11201 : Tensor = fb::clip_ranges_gather_to_offsets(%int_66.1, %getitem_1784.1, %347, %8, %11199) ``` It is similar with https://github.com/pytorch/pytorch/pull/146931, but aten::to has 5 inputs instead of 4. Differential Revision: D69627793 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147189 Approved by: https://github.com/hanyilou123	2025-02-16 22:16:02 +00:00
Animesh Jain	9dc702875d	[dynamo][mappingproxy][inspect] Support existing types.MappingProxyType (#147217 ) Fixes https://github.com/pytorch/pytorch/issues/147162 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147217 Approved by: https://github.com/williamwen42	2025-02-15 07:59:33 +00:00
cyy	8daa742e8b	Remove code for Python < 3.9 (#147181 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/147181 Approved by: https://github.com/albanD	2025-02-15 06:43:26 +00:00
cyy	8f291e8c00	Fix clang-tidy warnings in torch/jit (#146963 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146963 Approved by: https://github.com/davidberard98	2025-02-15 03:36:59 +00:00
Mu-Chu Lee	a5c0dab900	[AOTInductor] Guard RAII_cpuMalloc with macro (#147150 ) Summary: Silence RAII_cpuMalloc(size_t) defined but not used [-Wunused-function] Test Plan: Existing tests Differential Revision: D69623481 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147150 Approved by: https://github.com/henrylhtsang	2025-02-14 23:21:35 +00:00
PyTorch MergeBot	aac5d1a289	Revert "Add torch._scaled_mm for CPU (#139975 )" This reverts commit `f0bdc27f74`. Reverted https://github.com/pytorch/pytorch/pull/139975 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it looks like internal ideep version is too old to support this ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2660008996))	2025-02-14 18:31:54 +00:00
Zhengxu Chen	0b84311842	[export] Generate printers/parsers for serialization enum values. (#147126 ) Summary: Generate two helper functions for enum classes in generated_serialization_types.h printEnum: will convert enum values into strings. parseEnum: will convert strings into enum values. Test Plan: CI Differential Revision: D69604850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147126 Approved by: https://github.com/yiming0416	2025-02-14 02:14:35 +00:00
Jiang, Yanbing	f0bdc27f74	Add torch._scaled_mm for CPU (#139975 ) This PR is to add `torch._scaled_mm` for CPU backend. `_scaled_mm_out_cpu` and `_scaled_mm_cpu` are new added and included in `torch._scaled_mm` CPU dispatch. We also add `_scaled_mm_out_cpu_emulated` as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139975 Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet	2025-02-14 02:03:53 +00:00
PyTorch MergeBot	9a883007a2	Revert "Implement cuda graphs implementation of torch.cond and torch.while_loop (#140979 )" This reverts commit `c7515da7b0`. Reverted https://github.com/pytorch/pytorch/pull/140979 on behalf of https://github.com/huydhn due to This change has been reported to break internal code ([comment](https://github.com/pytorch/pytorch/pull/140979#issuecomment-2657361940))	2025-02-13 18:04:26 +00:00
Mu-Chu Lee	e21181642f	[AOTInductor] Align behavior between CPU and GPU (#145459 ) Summary: (1) Make sure CPU and GPU doesn't have different implementation and behavior when calling from the same path and API. Only difference between CPU and GPU after this PR should ONLY be the running hardware. (2) This PR fixes the issue of memory access when it==constants_map.end() (3) This PR resolves T179437596 Test Plan: buck2 run mode/dev sigmoid/inference/test:e2e_test_cpu Differential Revision: D68540744 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145459 Approved by: https://github.com/desertfire, https://github.com/hl475	2025-02-13 09:50:18 +00:00
Xia, Weiwen	ca3aabc8e6	[Inductor][CPU] Add a lowering pass for _weight_int4pack_mm_for_cpu (#145250 ) Summary It's part of the task to enable max-autotune with GEMM template for WoQ INT4 GEMM on CPU. This PR adds a lowering pass for `torch.ops.aten_weight_int4pack_mm_for_cpu`. This op is used for WoQ int4 in Torchao. The lowering pass is a prerequisite for max-autotune, which is planed to be enabled for this op in subsequent PRs. Test plan ``` python test/inductor/test_mkldnn_pattern_matcher.py -k test_woq_int4 python test/inductor/test_cpu_cpp_wrapper.py -k test_woq_int4 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145250 Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168 ghstack dependencies: #145245	2025-02-13 08:40:12 +00:00
Yu, Guangye	aa20b4b6cf	Friendly handle mem_get_info's runtime error message (#146899 ) # Motivation Friendly handle the runtime error message if the device doesn't support querying the available free memory. See https://github.com/intel/torch-xpu-ops/issues/1352 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146899 Approved by: https://github.com/EikanWang	2025-02-13 06:26:19 +00:00
Rachel Guo	88d0bb0fee	[aoti_debug_printer][BE] explicitly dumping float32, bfloat16, float16 data type (#147020 ) Summary: per request, explicitly dumping the float dtypes for aten tensors in debug printing summary info. can be useful in identifying issues such as "wrong AOTI Lowering precisions" Test Plan: ``` AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2 TORCH_LOGS="+inductor, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_addmm ``` Differential Revision: D69547344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147020 Approved by: https://github.com/jingsh, https://github.com/ColinPeppler	2025-02-13 04:41:00 +00:00
Brian Hirsh	ec0b318ddb	[poc] force UntypedStorage.from_buffer(buf) to return meta storage under FakeTensorMode (#146642 ) context here: https://fb.workplace.com/groups/326136610199609/permalink/495389539940981/ This PR is an attempt to make it such that if you create a tensor from an external buffer (using `UntypedStorage.from_buffer(buf)`, we can generate a proper fake tensor for you out of the box. The annoying bit is that there are not any dispatcher ops to interpose on and change behavior. So instead, I took the manual C binding and tweaked the storage device to be "meta' if we see an active fake mode. Put "poc" in the title since I... think this is hopefully reasonable, but I can be convinced that it's not :) ``` from torch._subclasses.fake_tensor import FakeTensorMode import pickle import io import torch from contextlib import nullcontext use_fake_tensor = True with FakeTensorMode() if use_fake_tensor else nullcontext(): obj = [1, 2] f = io.BytesIO() pickle.Pickler(f).dump(obj) byte_storage = torch.ByteStorage._from_buffer(f.getvalue()) # type: ignore[attr-defined] t = torch.ByteTensor(byte_storage) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146642 Approved by: https://github.com/zou3519	2025-02-12 20:57:10 +00:00
Zhou Fang	d774a6333d	[StaticRuntime] Support a new pattern for ClipRangesToGatherToOffsets (#146931 ) Summary: Support the following new pattern for ClipRangesToGatherToOffsets: Before optimization: ``` %18267 : Tensor, %18268 : Tensor = fb::clip_ranges_gather(%int_77.1, %getitem_2484.1, %493) %getattr_368.1 : int = prim::dtype(%18267) %to_443.1 : Tensor = aten::to(%18268, %getattr_368.1, %self._maybe_compute_kjt_to_jt_dict.is_weighted, %self._maybe_compute_kjt_to_jt_dict.is_weighted) %lengths_to_offsets_490.1 : Tensor = fb::lengths_to_offsets(%to_443.1, %8) ``` After optimization: ``` %18297 : int = prim::dtype(%int_77.1) %18298 : Tensor, %18299 : Tensor = fb::clip_ranges_gather_to_offsets(%int_77.1, %getitem_2484.1, %493, %8, %18297) ``` Reviewed By: garroud Differential Revision: D69373835 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146931 Approved by: https://github.com/hanyilou123	2025-02-12 08:19:41 +00:00
Zhengxu Chen	683bb1242c	[export][ez] Update tag_ for union setters. (#146912 ) Summary: ez fix to set tag for union type fields. Test Plan: CI Differential Revision: D69467715 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146912 Approved by: https://github.com/yiming0416	2025-02-12 03:52:36 +00:00
Zhengxu Chen	664550ecbf	[export] Serialize special values of float into strings for json. (#146490 ) Summary: Currently inf is serialized as Infinity in JSON which is not standard compliant. Instead we will tweak all special floating points into strings and handle them at json layer. Test Plan: see D69060784 CI Differential Revision: D69186425 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146490 Approved by: https://github.com/yiming0416	2025-02-11 20:01:27 +00:00
Daniel Galvez	c7515da7b0	Implement cuda graphs implementation of torch.cond and torch.while_loop (#140979 ) This is a new PR for #130386 , which got stale and was closed. Since I force-pushed to that branch in order to rebase it on top of main, the PR can no longer be reopened, according to https://github.com/isaacs/github/issues/361 I fixed the possibly-not-warmed-up problem described here: https://github.com/pytorch/pytorch/pull/130386/files#r1690856534 Since starting this, torch.cond and torch.while_loop now apparently have support for backward passes. I will look into what it might take to support that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140979 Approved by: https://github.com/eqy, https://github.com/eellison	2025-02-11 18:16:15 +00:00
Zhou Fang	fc5913b6bf	[StaticRuntime] Fix a bug that memory planner ignores subblocks (#146728 ) (#146855 ) Summary: When Static Runtime graph node has sub-blocks, the memory planner does not consider sub-blocks' inputs as a node's input in memory planner. As the result, such nodes' inputs' lifetime is incorrect and corresponding tensor memory is released earlier than required and causes errors. Differential Revision: D69195886 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146855 Approved by: https://github.com/swolchok	2025-02-11 13:59:54 +00:00
cyy	15635b14ce	[4/N] Remove unnecessary once flag usage (#146783 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146783 Approved by: https://github.com/albanD	2025-02-11 13:55:06 +00:00
Ke Wen	30cbf13544	[PGNCCL] Associate tensor allocation support with NCCL version (#146842 ) This is a forward fix to #146589. For NCCL version lower than 2.19, previous PR would see `RuntimeError: NCCL mem allocator is not supported in this NCCL version`. This PR gates the support by checking link-time NCCL version via `ncclGetVersion`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146842 Approved by: https://github.com/XilunWu, https://github.com/wconstab, https://github.com/fduwjj ghstack dependencies: #146589	2025-02-11 02:52:52 +00:00
Yifu Wang	97f6480cf5	Fix an issue where functional collectives don't force fx stride on inputs when compiled (#146467 ) Fixes https://github.com/pytorch/pytorch/issues/146416 Also added contiguity checks in the C++ functional collective ops to prevent striding issues introduced during compilation manifest as silent correctness issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146467 Approved by: https://github.com/Chillee, https://github.com/lw, https://github.com/shunting314	2025-02-10 19:15:49 +00:00
Hyunho Yeo	5f621c5879	[MTIA] (3/n) Implement PyTorch APIs to query/reset device peak memory usage (#146710 ) Summary: Public summary (shared with Github): This diff implements a C++-Python binding to enable `reset_peak_memory_stats`. Test Plan: The test is implemented in the following diff. Reviewed By: yuhc Differential Revision: D68988673 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146710 Approved by: https://github.com/nautsimon	2025-02-10 16:57:09 +00:00

1 2 3 4 5 ...

15267 Commits