pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Zhengxu Chen	fe11d300ac	[nativert] Improve MPMCQueue tests. (#153154 ) Summary: - Use std::this_thread::yield and stop busy wating. - Sort test file orders. Following up @swolchok's comment from https://github.com/pytorch/pytorch/pull/152837 Test Plan: CI Differential Revision: D74402536 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153154 Approved by: https://github.com/Skylion007, https://github.com/cyyever	2025-05-09 19:25:42 +00:00
Alvaro-Kothe	e86b6b2a19	Add tests to check pretty print when padding is a string in C++ API (#153126 ) Currently there are no tests to verify the behaviour of pretty print when padding is `torch::kSame` or `torch::kValid`. This PR just adds this tests to check for future regressions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153126 Approved by: https://github.com/Skylion007	2025-05-08 17:55:25 +00:00
Zhengxu Chen	5bb154e6fd	[nativert] Move MPMCQueue to torch/nativert. (#152837 ) Summary: Torch Native Runtime RFC: https://github.com/zhxchen17/rfcs/blob/master/RFC-0043-torch-native-runtime.md To land the runtime into PyTorch core, we will gradually land logical parts of the code into the Github issue and get each piece properly reviewed. This diff adds a small library implementing a multi producer multi consumer queue which will be used to synchronize taks for Torch Native Runtime. Differential Revision: D74184245 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152837 Approved by: https://github.com/albanD, https://github.com/dolpm, https://github.com/swolchok	2025-05-07 21:17:42 +00:00
Yiming Zhou	1d7728056b	[nativert] Move TensorMeta to pytorch core (#152475 ) Summary: Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72 This diff moves `TensorMeta.cpp` and `TensorMeta.h` to PyTorch core under `torch/nativert/graph/` Existing `torch::_export::TensorMeta` in `torch/csrc/utils/generated_serialization_types.h` is auto-generated from the export serde schema and therefore only containing the most basic serializable types. We need the newly added `TensorMeta.cpp` to deserialize the metadata into a in-memory class with c10 types so that it can be consumed by the runtime later. Test Plan: Added test under `test/cpp/nativert/test_tensor_meta.cpp` Differential Revision: D73820548 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152475 Approved by: https://github.com/albanD	2025-05-06 01:50:46 +00:00
Aaron Gokaslan	49b9efdf1f	[BE]: Cleanup traceutils with fmtlib (#152265 ) Simplify code and make it faster. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152265 Approved by: https://github.com/albanD, https://github.com/cyyever	2025-05-04 22:27:19 +00:00
Julius Herb	8f54e56e62	Add optional device index to AOTIModelPackageLoader (#152093 ) This is my suggestion for resolving #152087 This PR extends the constructor of `AOTIModelPackageLoader` with an (optional) device index. The device type is still determined by `metadata_["AOTI_DEVICE_KEY"]`, but the `device_index` argument can be used to move an AOTI model package to different devices like `cuda:0`, `cuda:1`, ... in a convenient way. AFAIK, this is not possible so far using `AOTIModelPackageLoader` alone. The default case (no device index specified) with `metadata_["AOTI_DEVICE_KEY"] == "cuda"` would lead to the current behavior, i.e., the model is loaded to device `cuda`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152093 Approved by: https://github.com/desertfire	2025-05-04 11:40:12 +00:00
Scott Wolchok	c7484805ca	Add two missing JIT tests to CMake (#152440 ) Looks like I forgot to add these. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152440 Approved by: https://github.com/Skylion007	2025-04-30 16:18:55 +00:00
Scott Wolchok	520366e102	Fix StringCoordView::substr after D73379178 / #151810 (#152304 ) Received complaint that we broke something. After a bunch of debugging, landed on this test + fix. Differential Revision: [D73754877](https://our.internmc.facebook.com/intern/diff/D73754877/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D73754877/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/152304 Approved by: https://github.com/Skylion007	2025-04-29 06:00:38 +00:00
PyTorch MergeBot	46419c7899	Revert "[Relandx2] Rewrite the guts of torch::jit::Lexer to speed it up (#152372 )" This reverts commit `7ce6f63214`. Reverted https://github.com/pytorch/pytorch/pull/152372 on behalf of https://github.com/malfet due to Looks like it broke distributed this time around, see `f05d3e5019/1` ([comment](https://github.com/pytorch/pytorch/pull/152372#issuecomment-2837426497))	2025-04-29 04:37:40 +00:00
Scott Wolchok	7ce6f63214	[Relandx2] Rewrite the guts of torch::jit::Lexer to speed it up (#152372 ) Reapplying with fix for linux-manylinux-2_28-py3-cpu-s390x / build failure (https://github.com/pytorch/pytorch/actions/runs/14716285820/job/41300304223#logs), which is to just update a pair of static_assert constants I got wrong. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152372 Approved by: https://github.com/wdvr, https://github.com/malfet	2025-04-28 23:55:48 +00:00
PyTorch MergeBot	e7c19f4f69	Revert "Reapply "Rewrite the guts of torch::jit::Lexer to speed it up (#151850 )" (#152250 )" This reverts commit `e407ea1e5e`. Reverted https://github.com/pytorch/pytorch/pull/152250 on behalf of https://github.com/malfet due to Breaks s390, may be time to move build back to opt-in `2667cb69d9/1` ([comment](https://github.com/pytorch/pytorch/pull/152250#issuecomment-2836833030))	2025-04-28 22:05:12 +00:00
Scott Wolchok	e407ea1e5e	Reapply "Rewrite the guts of torch::jit::Lexer to speed it up (#151850 )" (#152250 ) Almost-exact reapply of #151850 (adding minor reviewer nits) . AFAICT it was reverted unnecessarily. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152250 Approved by: https://github.com/Skylion007, https://github.com/cyyever	2025-04-28 19:33:40 +00:00
Anthony Shoumikhin	e2f9759bd0	Fix broken URLs (#152237 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152237 Approved by: https://github.com/huydhn, https://github.com/malfet	2025-04-27 09:56:42 +00:00
cyy	65b845f82b	Remove useless options for third-party ONNX build (#147616 ) Treat ONNX CMake targets properly and remove unneeded options. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147616 Approved by: https://github.com/malfet	2025-04-26 02:34:08 +00:00
PyTorch MergeBot	fa1b4ef649	Revert "Rewrite the guts of torch::jit::Lexer to speed it up (#151850 )" This reverts commit `47d34261e0`. Reverted https://github.com/pytorch/pytorch/pull/151850 on behalf of https://github.com/ZainRizvi due to This codev PR is breaking on it's internal counterpart diff D73129443. For codev PRs like this one, please always make sure the internal diff is green and then land the diff internally. The Github PR will be automatically merged ([comment](https://github.com/pytorch/pytorch/pull/151850#issuecomment-2831686141))	2025-04-26 00:44:11 +00:00
Scott Wolchok	47d34261e0	Rewrite the guts of torch::jit::Lexer to speed it up (#151850 ) The trie-based approach was, apparently, not efficient. This incidentally fixes a bug where "not inp" and "is note" were lexed incorrectly; see test_lexer.cpp update. Differential Revision: [D73129443](https://our.internmc.facebook.com/intern/diff/D73129443/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151850 Approved by: https://github.com/Skylion007 ghstack dependencies: #151801, #151802, #151803, #151804, #151805, #151806, #151807, #151810, #151849	2025-04-25 23:49:35 +00:00
Scott Wolchok	cf101d66ee	Add simple direct C++ tests for torch::jit::Lexer (#151849 ) We have test_jit.py, but given that I'm working on significant changes to the lexer, it seems nice to have direct C++ tests. (Also, writing the tests caught a pair of related bugs; see the two tests with "Bug" in their name. The rewrite will fix them.) Differential Revision: [D73402367](https://our.internmc.facebook.com/intern/diff/D73402367/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151849 Approved by: https://github.com/malfet ghstack dependencies: #151801, #151802, #151803, #151804, #151805, #151806, #151807, #151810	2025-04-25 22:39:49 +00:00
Scott Wolchok	2a58d2a155	StringCordView: make iterator fast when there is only one piece (#151810 ) This makes the StringCordView iterator a variant holding either the existing implementation (when there is more than one piece) or a simple `std::string_view::iterator` (when there is only one piece). The latter seems to be significantly cheaper. Differential Revision: [D73379178](https://our.internmc.facebook.com/intern/diff/D73379178/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151810 Approved by: https://github.com/Skylion007 ghstack dependencies: #151801, #151802, #151803, #151804, #151805, #151806, #151807	2025-04-24 04:43:34 +00:00
Scott Wolchok	aa61707a56	Fix extra heap allocation in Source constructor (#151800 ) This was a sneaky one: the StringCordView default constructor allocates. Differential Revision: [D73129448](https://our.internmc.facebook.com/intern/diff/D73129448/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151800 Approved by: https://github.com/malfet, https://github.com/cyyever, https://github.com/Skylion007 ghstack dependencies: #151682	2025-04-22 23:36:06 +00:00
inventshah	bf28d1cafc	Expose bicubic mode for torch::nn::functional::grid_sample in LibTorch (#150817 ) When bicubic interpolation was added to grid_sampler in #44780, `GridSampleFuncOptions` was not updated to allow a user to use bicubic mode in LibTorch, even though the function could handle it. This PR fixes the parity such that LibTorch's `torch::nn::functional::grid_sample` behaves the same as PyTorch's `torch.nn.functional.grid_sample`. Existing users can directly use `torch::grid_sampler` but must know what int to pass for the interpolation (2 for bicubic) and padding mode parameters, which is not ideal. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150817 Approved by: https://github.com/Skylion007	2025-04-21 08:55:27 +00:00
PaulZhang12	3ed5f1fb77	[CUDA][cuBLAS] Aten GEMM overload for FP32 output from FP16/BF16 inputs (#150812 ) Enable FP32 output from FP16/BF16 GEMMs in aten with cuBLAS. Accumulation for these GEMMs are generally already done in FP32. Adds the functionality to the following aten operators: * mm * bmm * addmm * baddmm Follow up of customer issue: https://github.com/pytorch/pytorch/issues/146241#issuecomment-2781889390 Differential Revision: [D73126191](https://our.internmc.facebook.com/intern/diff/D73126191) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150812 Approved by: https://github.com/ngimel, https://github.com/eqy	2025-04-18 01:53:26 +00:00
Mu-Chu Lee	c3a18f6126	[AOTInductor] Add states for constant folding process (#151273 ) Summary: We add states in the constant folding process for AOTInductor. Basically, there's 3 states, which is (1) None: The state when no constants are loaded and uninitialized. (2) Initialized: The state when constants are loaded, but not yet folded. (3) Folded: The state where the model is fully ready with folded constants. Note that even if constant folding is not enabled, we still only run when state is FOLDED, this is okay because without constant folding, the transition from INITIALIZED to FOLDED is just a pass-throught. Test Plan: python test/inductor/test_aot_inductor.py -k test_constant_folding_with_update Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D73002538](https://our.internmc.facebook.com/intern/diff/D73002538) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151273 Approved by: https://github.com/jingsh, https://github.com/desertfire	2025-04-17 16:41:38 +00:00
Nariaki Tateiwa	23a3cef5d9	[c10d] Add `_allgather_base` , `reduce_scatter` , and `_reduce_scatter_base` into ProcessGroupMPI to enable FSDP with MPI backend (#150162 ) This PR implements _allgather_base, reduce_scatter, and _reduce_scatter_base in the MPI backend (ProcessGroupMPI), enabling support for Fully Sharded Data Parallel (FSDP) in environments that use MPI for distributed communication. ### Context As noted in https://github.com/pytorch/pytorch/issues/85628, FSDP currently supports only the NCCL backend. Due to this limitation, FSDP cannot run on legacy HPC environments or clusters that rely on MPI. By implementing just these three collective operations, we can enable FSDP to work with the MPI backend. These collectives are implemented in a similar manner to existing operations such as allgather. ### Testing We validated this PR using pytorch/build/bin/ProcessGroupMPITest with OpenMPI, and all tests passed successfully. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150162 Approved by: https://github.com/H-Huang	2025-04-14 19:31:38 +00:00
Shivam Raikundalia	ad5e9065ac	[Profiler/Easy] Remove temp flag for on-demand Memory Snapshot (#151068 ) Summary: Now that we have profiler impl in we don't need the temporary flag. submodule update too. Test Plan: CI Reviewed By: sanrise Differential Revision: D72672186 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151068 Approved by: https://github.com/davidberard98	2025-04-11 18:50:25 +00:00
fduwjj	f663aa4e81	[c10d][tcp_store] Fix connection reset caused by wrong socket close (#150987 ) While fixing the memory leak in https://github.com/pytorch/pytorch/pull/145757, we accidentally close the socket for the case when nread == 0 and thought it is the case when connection is closed. This is not true. According to libuv doc: https://docs.libuv.org/en/v1.x/stream.html#c.uv_read_cb. > nread might be 0, which does not indicate an error or EOF. This is equivalent to EAGAIN or EWOULDBLOCK under read(2). We found this bug when debugging a broken pipe issue when users first call a set and then wait for all keys right afterwards on 128 ranks. This might also cause other broken pipe issues we have seen in the prod jobs recently. Added a unit test to test this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150987 Approved by: https://github.com/d4l3k, https://github.com/XilunWu	2025-04-10 18:48:57 +00:00
Mu-Chu Lee	f3cf3ec591	[AOTInductor] Add User Managed buffer for AOTI constant buffer. (#150276 ) Summary: We add the functionality to allow users to directly pass in a at::Tensor into AOTInductor, that would be used as the constant. This user managed buffer skips the copying step in AOTInductor, and let users to directly manage the memory usage themselve. Test Plan: LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib /data/users/$USER/pytorch/build/bin/test_aoti_inference Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D72589514](https://our.internmc.facebook.com/intern/diff/D72589514) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150276 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2025-04-10 00:15:44 +00:00
Shivam Raikundalia	99c9a31386	[submodule] [Snapshot/Profiler] Memory Snapshot On Demand (#150559 ) Summary: Profiler side of memory snapshot. 1. Add API to actually do snapshot when client interface is called 2. Add ifdefs to builds so that kineto hooks snapshot correctly. Design Philosophy: There is one interesting part of this implementation and it is during export. For export we are callign the python impl of the export rather than CPP even though we are already in CPP. This is because it is better to simply have one path of export rather than 2. Personally, I want there to be parity between auto-trace and on-demand so it if we can limit the side paths then we will have an easier time maintaining this relationship Test Plan: {F1976563426} Reviewed By: sanrise Differential Revision: D70733247 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150559 Approved by: https://github.com/sanrise	2025-04-07 13:04:38 +00:00
Mu-Chu Lee	063ea5d669	[AOTInductor] Modify test for Memory tracking for memory-related (#150269 ) operations Summary: Fix the test for memory tracking. This PR does: (1) Add tracking before and after for all memory-related operations. Make sure the operation do indeed captures memory both in CUDA and torch's CUDACachAllocator Make sure the operation do indeed captures consumed memory both in CUDA and torch's CUDACachAllocator. (2) Keep track of memory being reserved by CUDACacheAllocator in torch and it's relationship with global CUDA memory consumption. Test Plan: This PR is adding tests. Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/150269 Approved by: https://github.com/jingsh, https://github.com/chenyang78, https://github.com/desertfire	2025-04-02 04:18:18 +00:00
Ke Wen	35c45a4a31	[Reland] Launch kernel on current stream & remove `record_stream` entirely (#150398 ) Relanding #148590 due to merge conflict. This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Squashed contents: * [ptd][nccl] use current-stream as nccl-stream under async=False mode (#147820) PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead (70us -> 35us), which reduce total CPU/GPU from 230us to 195us by 15% * [PGNCCL] Make avoid-record-stream default * [c10d] Add asyncOp argument to Ops * Change python side wait * Pass asyncOp at ProcessGroup level * Watchdog unstashing tensors as a safety net * Stash tensors for reduce_scatter_v and all_gather_v Pull Request approved: https://github.com/pytorch/pytorch/pull/149753 * [c10d] Move unstashing from watchdog to main thread Pull Request approved: https://github.com/pytorch/pytorch/pull/150079 * [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation Pull Request approved: https://github.com/pytorch/pytorch/pull/150130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150398 Approved by: https://github.com/atalman	2025-04-01 16:46:07 +00:00
Mu-Chu Lee	a2070e2fd5	[AOTInductor] Free tensors in test (#150274 ) Summary: This PR frees tensor that were new-ed within the test itself to prevent memory leak. Test Plan: Fixing tests itself. Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/150274 Approved by: https://github.com/chenyang78	2025-03-31 23:28:13 +00:00
Irshad CC	f3c77b2458	Set requires grad in TensorMaker::make_tensor() (#148255 ) Fixes #146419 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148255 Approved by: https://github.com/soulitzer	2025-03-29 08:06:42 +00:00
Mu-Chu Lee	03313c6619	[AOTInductor] Add function for users to extract constants in container (#150163 ) Summary: Add extract_constant_map that allows users to inspect the constants being used by AOTInductor Test Plan: `python test/inductor/test_aot_inductor.py -k extract_constants_map` `LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib /data/users/$USER/pytorch/build/bin/test_aoti_inference` Differential Revision: D72020400 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150163 Approved by: https://github.com/chenyang78	2025-03-29 03:36:12 +00:00
Mu-Chu Lee	e6afb51805	[AOTInductor] Free folded constants that's managed by AOTInductor (#149825 ) internally. Summary: This diff allows freeing the usage of folded constants that's created by AOTInductor through CUDACachingAllocator instead of the constant blob from cudaMalloc directly. Test Plan: LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib /home/$USER/local/pytorch/build/bin/test_aoti_inference Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/149825 Approved by: https://github.com/chenyang78, https://github.com/desertfire, https://github.com/jingsh	2025-03-27 06:05:50 +00:00
Mu-Chu Lee	12628ba24d	[AOTInductor] Bug fix for freeing buffers when freeing multiple times (#149810 ) Summary: We might free the active buffer if we free the buffer twice. Test Plan: ``` LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib /home/$USER/local/pytorch/build/bin/test_aoti_inference ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/149810 Approved by: https://github.com/chenyang78	2025-03-25 20:26:36 +00:00
Scott Wolchok	c73a526599	Extract reusable portions of elu_kernel into header (#149673 ) Similar to #140425, we are making the implementation usable via header-only code sharing. Review note: #62546 by @yanbing-j removed expm1 usage from this path. I don't know why and expm1 should be more efficient, so I've put it back. Please let me know if there is a good reason I shouldn't. Testing: existing correctness tests should cover. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149673 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-03-21 23:54:26 +00:00
Shangdi Yu	46dd226702	Fakify torchbind objects in compile_fx and add tests for SigridTransformsInstanceTorchBind (#149529 ) Summary: We need to properly fakify torchbind objects, including the ones in graph module attributes, so the resgitered fake implementation works properly. - _fakify_script_objects in `compile_fx` - Allow fake torchbind objects in `torchbind_constants` Remove `node.meta["unbacked_bindings"]` for `aot_compile` in `compile_fx`. Otherwise `ShapeProp` will fail when trying to resolve the `unbacked_bindings` of `with_effect` tokens. Update `sigrid_transforms_test` to use the latest `torch._inductor.aot_compile` API. Add a test for `Fakify torchbind objects in compile_fx and add tests for SigridTransformsInstanceTorchBind` in `e2e_test`. Test Plan: ``` buck run //caffe2/torch/fb/sparsenn:sigrid_test -- -r test_transform_torch_bind buck run //sigmoid/inference/test:e2e_test_cpu -- -r SigridTransforms buck2 run mode/dev-nosan sigmoid/inference/ts_migration:pt2i_readiness_main -- --model_id 545017754 --test_suite ads_all --mode test_preproc ``` Differential Revision: D70013257 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149529 Approved by: https://github.com/angelayi	2025-03-21 18:58:28 +00:00
Bin Bao	04e251a7dd	[AOTI] Add num_runners to AOTIModelPackageLoader (#149364 ) Summary: AOTIModelContainerRunner takes a num_runners argument for multi-threaded inference, but AOTIModelPackageLoader forgot to take the same parameter, although its run() API already expects to take an optional cudaStream_t parameter for multi-threaded inference. Differential Revision: [D71357418](https://our.internmc.facebook.com/intern/diff/D71357418) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149364 Approved by: https://github.com/angelayi	2025-03-19 02:28:06 +00:00
Mu-Chu Lee	bb42e4d137	[AOTInductor] Add function to free buffer (#149161 ) Summary: We add a function that allows users to free the unused buffer. Test Plan: Testing correctness: python test/inductor/test_aot_inductor.py -k free_inactive Testing memory consumption: LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib /home/$USER/local/pytorch/build/bin/test_aoti_inference Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/149161 Approved by: https://github.com/chenyang78, https://github.com/desertfire ghstack dependencies: #149249	2025-03-18 02:43:14 +00:00
PyTorch MergeBot	afa1eda901	Revert "[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 )" This reverts commit `ef6296e7f2`. Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/izaitsevfb due to reverted internally, see D71292427 ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2731114626))	2025-03-17 22:43:15 +00:00
Joel Schlosser	5e1b715dda	BC fix for AOTIModelPackageLoader() constructor defaults (#149082 ) The default value for `run_single_threaded` was wrongly specified in the .cpp file instead of the header, breaking C++-side instantiation of `AOTIModelPackageLoader` with no arguments. This PR fixes this and adds a test for the use case of running with `AOTIModelPackageLoader` instead of `AOTIModelContainerRunner` on the C++ side. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149082 Approved by: https://github.com/desertfire	2025-03-13 18:40:53 +00:00
Bin Bao	b9803a5c81	[AOTI] Re-enable AOTI cpp unit test (#149085 ) Summary: test_inductor_aoti was removed by accident previously. Add it back. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149085 Approved by: https://github.com/jbschlosser	2025-03-13 16:00:38 +00:00
Ke Wen	ef6296e7f2	[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 ) This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj Differential Revision: [D70937982](https://our.internmc.facebook.com/intern/diff/D70937982) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590 Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj	2025-03-11 18:36:12 +00:00
PyTorch MergeBot	a95eb0c0a7	Revert "[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 )" This reverts commit `2149f6c684`. Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/ZainRizvi due to Breaking internally, see D70873275. Discussed reverting this with Ke. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2712001270))	2025-03-10 22:38:40 +00:00
Ke Wen	2149f6c684	[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 ) This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590 Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj	2025-03-09 07:32:23 +00:00
PyTorch MergeBot	9cb25f0ea2	Revert "[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 )" This reverts commit `17dbeb11db`. Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/janeyx99 due to PR break backward compat test ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2708641172))	2025-03-09 03:01:55 +00:00
Ke Wen	17dbeb11db	[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 ) This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590 Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj	2025-03-08 20:00:12 +00:00
rpsilva	4abff4b271	Introduce cache clearing APIs for the lazy graph executor (#144489 ) This PR introduces two new methods to the LazyGraphExecutor class: - ClearComputationCache(): Allows clearing the entire computation cache. - RemoveFromComputationCache(hash): Enables removal of specific cache entries based on their hash. The main objective is to expose cache management functionality for debugging cache hits and misses across different computations. For instance: - Reset the cache state in tests, allowing reuse of the same computation client to evaluate cache logic consistently. - Selectively remove cache entries to analyze the impact on subsequent computations. - Improve observability into the cache behavior, aiding in the investigation of cache-related issues or optimizations. On the XLA lazy graph executor, we want to run a series of tests that modify some parts of the HLO module proto of the computation, and we need a means to ensure that the hash is agnostic to some elements (OpMetadata in the XLA proto data). Hence, it would be easy to parameterize the test, clear the cache and validate that the resulting hash is the same between runs. Otherwise, we'd need to hardcode the resulting serialized hash. Simultaneously, another motivation, is that users could also clear some computation hashes for an added flexibility in their applications, by introducing their own custom strategies for maintaining the cache (without relying on the default LRU). Pull Request resolved: https://github.com/pytorch/pytorch/pull/144489 Approved by: https://github.com/wconstab	2025-01-29 17:38:01 +00:00
Shuqiang Zhang	c0861d092c	[PGNCCL] Add an API to get the status/error code at the PG level (#144498 ) Summary: This PR is basically a replacement of https://github.com/pytorch/pytorch/pull/140087, which caused some perf drop due to frequent TCPStore check in watchdog thread. The fix is to move the tcpstore check in monitoring thread If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level. Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG. Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/144498 Approved by: https://github.com/kwen2501	2025-01-24 16:47:32 +00:00
fduwjj	ae7df51232	[c10d] Fix CudaEventCache for dangling references (#144496 ) Reported in https://github.com/pytorch/pytorch/issues/143470, we have a dangling references in `CudaEventCache`. So we want to fix it. 1. We add a unit test to repro the issue mentioned in the issue. 2. Instead of converting variables to shared pointers as suggested in the issue, we then make the cache itself a shared pointer. So if the thread creates the cache dies before all events get recycled, the cache is still there until the last CudaEvent get deleted. (thanks for the suggestion from @kwen2501 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144496 Approved by: https://github.com/kwen2501	2025-01-15 05:11:48 +00:00
PyTorch MergeBot	b80ecc4457	Revert "Fix poision child process issue when call getAccelerator() (#144368 )" This reverts commit `2583d831d4`. Reverted https://github.com/pytorch/pytorch/pull/144368 on behalf of https://github.com/clee2000 due to broke internal tests D68023262, probably the same problem as noted in the issue this PR is mentioned above ([comment](https://github.com/pytorch/pytorch/pull/144368#issuecomment-2584848568))	2025-01-10 23:36:43 +00:00

1 2 3 4 5 ...

2367 Commits