pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
PyTorch MergeBot	2c1912452d	Revert "Rewrite autograd producer consumer stream sync logic (#151079 )" This reverts commit `f78e4529a9`. Reverted https://github.com/pytorch/pytorch/pull/151079 on behalf of https://github.com/jeanschmidt due to Seems to have introduced regressions in internal signals, see [D74648937](https://www.internalfb.com/diff/D74648937) ([comment](https://github.com/pytorch/pytorch/pull/151079#issuecomment-2880176879))	2025-05-14 13:07:12 +00:00
PyTorch MergeBot	a628efd1e8	Revert "Enable accelerator to perform streaming backward (#153412 )" This reverts commit `d5d26ce436`. Reverted https://github.com/pytorch/pytorch/pull/153412 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert https://github.com/pytorch/pytorch/pull/151079 ([comment](https://github.com/pytorch/pytorch/pull/153412#issuecomment-2880169739))	2025-05-14 13:04:27 +00:00
Ke Wen	e2ce17c6ef	[SymmMem][a2av] Use more CTAs for intra-node case (#153509 ) Previously, we launch the a2av kernel with at most 8 blocks for intra-node cases, which turns out to saturate only 57 GB/s bandwidth. This PR adds more blocks for intra-node, up to 8 per peer, pumping up data parallelism. The kernel now achieves 350 GB/s SOL for Hopper. See figure. It also uses a simple tuning based on input size to avoid jumping to 8 CTAs directly (i.e. 1, 2, 4, then 8) For inter-node, we cap at 8 blocks, since 57 GB/s seems bigger than regular NIC bandwidths (400 Gb/s). ![all_to_all_vdev Performance on 8xH100](https://github.com/user-attachments/assets/d4b841e6-4c42-4a2e-aa9f-2bc116ba9d25) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153509 Approved by: https://github.com/ngimel ghstack dependencies: #153483	2025-05-14 04:24:32 +00:00
Bin Bao	33a5179269	[AOTI][reland2] Remove typedef for half and bfloat16 (#153467 ) Summary: Reland https://github.com/pytorch/pytorch/pull/151109 after fixing cutlass AOTI build issues. typedef is prone to name collision. Explicitly spell out the actual aten types, needed for the standalone AOTI codegen. Differential Revision: D74398762 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153467 Approved by: https://github.com/jingsh, https://github.com/henrylhtsang, https://github.com/cyyever	2025-05-14 02:37:18 +00:00
Ke Wen	90001554bf	[SymmMem][a2av] Fix TODO: change stride unit (#153483 ) Previous kernel impl assumes float type. This PR makes it general by passing stride in unit of bytes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153483 Approved by: https://github.com/fegin, https://github.com/ngimel	2025-05-14 01:47:54 +00:00
Shivam Raikundalia	a13c8f2ecb	[EZ/Profiler] Replace manual GIL calls with pybind GIL calls (#153415 ) Summary: Use pybind11::gil_scoped_acquire instead of old impl as it will automatically take care of error handling. In the original implementation we missed releasing the GIL on each possible error which could put the program in a deadlock Test Plan: Induced error manually and saw that GIL was released Differential Revision: D74593564 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153415 Approved by: https://github.com/Skylion007, https://github.com/cyyever	2025-05-13 20:47:52 +00:00
Tristan Rice	9c3cef437c	gloo: support ibverbs in cmake (#153425 ) This updates the gloo submodule in PyTorch to a version that supports the new ibverbs backend that can be used with PyTorch. Test plan: ``` sudo dnf install rdma-core-devel USE_GLOO_IBVERBS=ON python setup.py develop torchrun --nproc_per_node 2 ~/scripts/gloo_ibverbs_test.py ``` ```py """ run with: torchrun --nproc_per_node 2 ~/scripts/gloo_ibverbs_test.py """ import os os.environ["GLOO_DEVICE_TRANSPORT"] = "IBVERBS" import torch import torch.distributed as dist dist.init_process_group("gloo") rank = dist.get_rank() if rank == 0: device = "cpu" else: device = "cuda" print(device) t = torch.full((10, 100), fill_value=(rank+1), device=device) target = torch.full((10, 100), fill_value=3, device=device) dist.all_reduce(t) torch.testing.assert_close(t, target) t = torch.full((10, 100), fill_value=(rank+1), device=device) if rank == 0: dist.send(t, dst=1) else: dist.recv(t, src=0) torch.testing.assert_close(t, torch.full_like(t, 1)) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153425 Approved by: https://github.com/fduwjj	2025-05-13 17:09:00 +00:00
Simon Fan	a80eb84a5f	[ca] support higher order gradients (create_graph=True) (#153222 ) Adds create_graph support if you don't compile or compile only with torch.compile(backend="eager"). Using a backend that uses AOTDispatch produces a post-dispatch AOT backward, where its double backward will be silently incorrect if the forward trace involved any ops that are not composite implicit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153222 Approved by: https://github.com/jansel ghstack dependencies: #153193	2025-05-13 16:42:09 +00:00
fduwjj	27e9d9b103	[c10d][fr] Add try catch to update entry due to cuda error (#153414 ) During the dump of FR, due to some unknown reasons, we see cuda errors when querying events and this leads to the failures of whole FR dumps (when trying to get entries). So we do a try-catch instead of let it fails the whole process. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153414 Approved by: https://github.com/d4l3k	2025-05-13 01:10:00 +00:00
soulitzer	d5d26ce436	Enable accelerator to perform streaming backward (#153412 ) Also see https://github.com/pytorch/pytorch/pull/142097 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153412 Approved by: https://github.com/albanD ghstack dependencies: #151079	2025-05-13 00:02:24 +00:00
Shuai Yang	a87e810980	add needs_contiguous_strides tag (#153399 ) Summary: The padding operations could lead to non-contiguous tensors, which will fail the test in `reduce_scatter_tensor`: https://fburl.com/code/5wt5xkig The `needs_contiguous_strides` tag is to tell inductor that `reduce_scatter_tensor` needs contiguous inputs, so it will not to execute padding operations. Test Plan: W/o the tag, job failed on the check: https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-rebase_sanity_check_256bs_8t-fc398c39d3?job_attempt=0&version=0&tab=summary&env=PRODUCTION With this tag, previously failed job succeeded: https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-rebase_sanity_128bs_8t_i10_tag-2ed5b05276?job_attempt=11&version=0&tab=summary&env=PRODUCTION Differential Revision: D74598810 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153399 Approved by: https://github.com/fmassa	2025-05-12 23:03:56 +00:00
Shivam Raikundalia	dbb4444ce3	[Memento] Add PT2 to Memory Snapshot (#152707 ) Summary: To add PT2 information to memory snapshot we piggyback off of the Kineto implementation using record_function similar to adding the user annotations. To do this we add the following: 1. Stack implementation that we instantiate to keep track of which compile context stack we are currently in (top element of the stack). The stack will be per device and thread-local since different threads of a process can be in different compile contexts at a given time. For this reason, we do not need to add mutexes to our stack impl since no two threads will touch a given stack 2. RecordFunction hooks to properly pipe the correct events to the compile context stack. These hooks are similar to the annotation ones in the fact that we just register them lazily and DO NOT unregister them. This is done out of convenience. In the future, we should save the handles and unregister them to minimize overhead after profiling is finished. As of now, we are registering this at the FUNCTION scope which is wide; however, we treat any function that does not start with "Torch-Compiled Region" as a no-op so we anticipate the difference in performance to be negligible during and after profiling. We also hide this feature behind a flag set to off on default so existing jobs will be unaffected 3. Piping for compile context to pickle output Test Plan: In D74039793, we add CompileContext to the visualizer and we see the following {F1977654658} Differential Revision: D74028214 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152707 Approved by: https://github.com/eqy	2025-05-12 21:12:51 +00:00
soulitzer	f78e4529a9	Rewrite autograd producer consumer stream sync logic (#151079 ) Also see previous work https://github.com/pytorch/pytorch/pull/142097 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151079 Approved by: https://github.com/albanD	2025-05-12 21:07:16 +00:00
soulitzer	cb35a2b15d	Add missing in-place on view check to custom autograd.Function (#153094 ) Fixes https://github.com/pytorch/pytorch/issues/152773 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153094 Approved by: https://github.com/albanD ghstack dependencies: #153005	2025-05-12 14:42:46 +00:00
Aaron Gokaslan	0104ac0f6f	[Ez][BE]: Fix click ImportError in torch/csrc/jit (#153323 ) Fixes unnecessary import for torch script. Unblocks #153020 as it appears to fix circular importer linter into importing every Python file under torch Pull Request resolved: https://github.com/pytorch/pytorch/pull/153323 Approved by: https://github.com/ngimel, https://github.com/cyyever	2025-05-11 19:16:01 +00:00
PyTorch MergeBot	fdc387ec7c	Revert "refine fp32 precision api (#125888 )" This reverts commit `4c11b26158`. Reverted https://github.com/pytorch/pytorch/pull/125888 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to cause some failures on ROCm ([comment](https://github.com/pytorch/pytorch/pull/125888#issuecomment-2869274791))	2025-05-11 00:35:46 +00:00
haozhe.zhu	4c11b26158	refine fp32 precision api (#125888 ) Based on the [conversation](https://github.com/pytorch/pytorch/issues/121791), we plan to drop the "highest, high, medium" to represent fp32 internal computation data types . Instead, we will directly use the algorithm to represent it. ### Design Choice: Directly use algorithms name like "TF32", "BF16". #### Pros - The names are more informative. 'tf32' is more informative than a simple "high". - Easier to extend new algorithm like `tf32x3` #### Cons - "HIGHEST, HIGH, MEDIUM" indicated the relative precision between different algorithms. However, we can have more documents to discuss them. ### We provide a layered structure for backends/operators. ('f32' is short for 'fp32_precision') ![image](https://github.com/user-attachments/assets/f89143e5-d6a1-4865-9351-9a50439f5067) ### We provide 3 fp32 compute precision can be set: - "ieee": Not allowed to use any other internal computation data types . - "tf32": Allowed to use tf32 as internal computation data types. - "bf16": Allowed to use bf16 as internal computation data types. - "none": Precision's are not set. Can be override by its father node. ### Overriding Precision Settings Child node can be override by its father node if it is set to default. For current default settings: ``` backend = generic, op = all, precision setting = none backend = cuda, op = all, precision setting = none backend = cuda, op = conv, precision setting = tf32 backend = cuda, op = rnn, precision setting = tf32 backend = cuda, op = matmul, precision setting = none backend = matmul, op = all, precision setting = none backend = matmul, op = conv, precision setting = none backend = matmul, op = rnn, precision setting = none backend = matmul, op = matmul, precision setting = none ``` - If the user set `torch.backends.mkldnn.fp32_precision="bf16"`, his child nodes `torch.backends.mkldnn.matmul.fp32_precision` / `torch.backends.mkldnn.conv.fp32_precision` / `torch.backends.mkldnn.rnn.fp32_precision` will also be override to "bf16". - If the user set `torch.backends.fp32_precision="bf16"`, `torch.backends.mkldnn.fp32_precision` and his child nodes will also we override to "bf16". ### Backward Compatible Since new API allow user to have more fine-grained control. There will be some conflict. For example, previous `torch.backends.cudnn.allow_tf32` are not enough to represent the status for `torch.backends.cudnn.rnn.fp32_precision="ieee"` and `torch.backends.cudnn.conv.fp32_precision="tf32"`. Therefore, our goal for backward compatible is - If the user only uses previous APIs, it will work as previous expectations. - If the user use new API to change the status to an un-representable status for old API, and try to access the status by old API. We will raise Runtime Error and point the document for user. ### Test Plan ``` python test/test_cuda.py -k test_fp32_precision_with_tf32 python test/test_cuda.py -k test_fp32_precision_with_float32_matmul_precision python test/test_cuda.py -k test_invalid_status_for_legacy_api python test/test_mkldnn.py -k test_mlkdnn_get_set python test/test_mkldnn.py -k test_generic_precision python test/test_mkldnn.py -k test_invalid python test/test_mkldnn.py -k test_default_use_parent ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125888 Approved by: https://github.com/jgong5, https://github.com/albanD Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>	2025-05-10 11:13:04 +00:00
Mu-Chu Lee	c227865720	[AOTInductor] Fix state of ConstantFolding (#153152 ) Summary: Bug fix for constant folding states. We are not setting the correct state for each updates. One race condition would be: (1) All threads obtain the model_exec_lock from main run. (2) In second round of updated constant buffer, we should have set secondary as INITIALIZED but primary is mistakenly set instead. (3) run_const_fold get called and an model_exec_lock is obtained, waiting for available at this time. (4) main run enters INITIALIZED, waiting for unique_lock (which a shared_lock is being held by (3) at this moment) Test Plan: TBD Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/153152 Approved by: https://github.com/jingsh, https://github.com/chenyang78	2025-05-09 16:03:05 +00:00
Tristan Rice	d900c68ea6	c10d/gloo: add ibverbs backend (#153015 ) Summary: X-link: https://github.com/pytorch/gloo/pull/437 This provides a new "UnboundBuffer" implementation for Gloo ibverbs backend so it can be used with PyTorch. This currently is passing basic tests such as `reduce_test` and `send_recv_test` but there are a number of failures. Putting this up for review so the follow up fixes are less of a mega PR and also so we can start doing some initial testing with this E2E with PyTorch. Known issues: * support recv from any is not supported * AllreduceBcubeBase2 is failing Test Plan: ``` buck2 run mode/dbgo //gloo/test:send_recv_test_ibverbs buck2 test //gloo/test: GLOO_DEVICE_TRANSPORT=IBVERBS buck2 run @//mode/opt //caffe2/test/distributed:c10d -- -r '.gloo.' -f ``` We can't run any of the gloo tests in CI since none of our CI machines have ibverbs so they're disabled by default and need to be manually run. Differential Revision: D73291471 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153015 Approved by: https://github.com/fduwjj	2025-05-08 18:26:29 +00:00
PyTorch MergeBot	7b806a8cb1	Revert "[inductor][dynamo] Include operator name in size/stride/alignment assertion (#152353 )" This reverts commit `9357635127`. Reverted https://github.com/pytorch/pytorch/pull/152353 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to fail an inductor test in trunk ([comment](https://github.com/pytorch/pytorch/pull/152353#issuecomment-2863657185))	2025-05-08 16:39:28 +00:00
cyy	d291fa8ecc	Avoid std::chrono::system_clock (#153135 ) This PR replaces most `std::chrono::system_clock` with `std::chrono::steady_clock` if the duration is used in condition variables. Ideally system clocks should be used only to log wall-clock times. Some `high_resolution_clock` are also changed to `steady_clock` because its resolution is not required in the context. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153135 Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/malfet	2025-05-08 16:30:29 +00:00
Simon Fan	8f380b239f	[ca] mark scalar int sizes as dynamic via tensor wrapping (#151731 ) This is the only way to support dynamic shapes on scalars right now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151731 Approved by: https://github.com/jansel	2025-05-08 15:12:08 +00:00
Xinfeng Xie	bfc0920d95	[C10D] Move getNcclDataType into NCCLUtils (#153113 ) Differential Revision: D74365214 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153113 Approved by: https://github.com/ngimel	2025-05-08 08:54:05 +00:00
karthickai	9357635127	[inductor][dynamo] Include operator name in size/stride/alignment assertion (#152353 ) Fixes #151930 This PR updates the `assert_size_stride` and `assert_alignment` functions in [guards.cpp](https://github.com/pytorch/pytorch/blob/main/torch/csrc/dynamo/guards.cpp) to accept an optional `op_name` argument and includes it in the error messages. The corresponding type stubs in [guards.pyi](https://github.com/pytorch/pytorch/blob/main/torch/_C/_dynamo/guards.pyi) are updated to match the new function arg. In [inductor/ir.py](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/ir.py) extracts the operator name from the FX graph and passes it into the `codegen_size_asserts` and `codegen_alignment_asserts` functions, so that generated assertions in Triton code include the op name for better debugging. Added unit tests inside [test_torchinductor.py](https://github.com/pytorch/pytorch/blob/main/test/inductor/test_torchinductor.py). - Verified both successful and failing assertion cases include the operator name. - Verified that generated Triton code contains the op name inside the asserts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152353 Approved by: https://github.com/jansel, https://github.com/shunting314	2025-05-08 08:28:05 +00:00
fduwjj	1ff3c223d2	[c10d][fr] Make FR vendor neutral so that other backends can use it (#152563 ) Current FR code is built with `USE_C10D_NCCL` we should remove it to make it generic. And we keep existing API used by NCCL so that we can have some bc compatibility because lots of use cases are around FR with NCCL. The generic version with C10::Event can then be used for other backend like Gloo, etc. The current Unit test should cover the change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152563 Approved by: https://github.com/kwen2501, https://github.com/d4l3k ghstack dependencies: #152585	2025-05-07 20:37:40 +00:00
fduwjj	5c878d4b04	[c10d][fr] Decouple the core logic of FR with the entry and event type (#152585 ) We want to make FR generic enough so the first step is to make the FR a template struct so that most of common code logic can be reused. The reason for this is that CudaEvent does not inherit c10::Event and we just want to swap the event part so that for NCCL we use CudaEvent and for the rest of backends, we use c10::event. Differential Revision: [D74262695](https://our.internmc.facebook.com/intern/diff/D74262695) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152585 Approved by: https://github.com/kwen2501, https://github.com/d4l3k	2025-05-07 06:21:33 +00:00
PyTorch MergeBot	f6db749e60	Revert "[ca] mark scalar int sizes as dynamic via tensor wrapping (#151731 )" This reverts commit `18229a5300`. Reverted https://github.com/pytorch/pytorch/pull/151731 on behalf of https://github.com/huydhn due to Chatting with @xmfan and decide to revert and reland this instead ([comment](https://github.com/pytorch/pytorch/pull/151860#issuecomment-2856709646))	2025-05-07 00:56:54 +00:00
Nikita Shulga	1c30862d8f	Partilally revert https://github.com/pytorch/pytorch/pull/152288 (#152909 ) Summary: As it results in build failures for some internal targets that stuck on older compiler. Platform update is tracked in [T223408150](https://www.internalfb.com/tasks?t=223408150) Test Plan: CI Differential Revision: D74220384 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152909 Approved by: https://github.com/cyyever, https://github.com/wdvr	2025-05-06 22:02:42 +00:00
Prachi Gupta	ed63cb20ec	[ROCm] Fix SymmetricMemory build error on NAVI arch (#152838 ) NAVI arch doesn't support `__builtin_amdgcn_s_memtime()`, using `clock64()` instead which works for both NAVI and MI archs. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/152838 Approved by: https://github.com/jeffdaily	2025-05-06 19:37:58 +00:00
Huy Do	8904ba6387	Forward fix D74196435 (#152926 ) Summary: Forward fix a misplace declaration from D74196435 Test Plan: Random check with a failed build `buck2 build --config fbcode.enable_gpu_sections=true --flagfile fbcode//mode/opt fbcode//accelerators/workloads/models/emu_flash/tests:test_compile_eager` Reviewed By: wdvr Differential Revision: D74225582 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152926 Approved by: https://github.com/cyyever, https://github.com/wdvr	2025-05-06 07:33:38 +00:00
angelayi	470cd3a995	[aotinductor] Don't alloc weights if they don't exist (#152692 ) Fixes https://github.com/pytorch/pytorch/issues/152356 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152692 Approved by: https://github.com/henrylhtsang	2025-05-06 02:50:21 +00:00
Aaron Gokaslan	0e9874849f	[BE]: Update torch core lazy helpers with micropts (#152778 ) Some minor nits I noticed. Use reserve when possible Pull Request resolved: https://github.com/pytorch/pytorch/pull/152778 Approved by: https://github.com/cyyever, https://github.com/albanD	2025-05-06 00:03:51 +00:00
zhxchen17	ffd58293f7	[dynamo] Guard serialization for FUNCTORCH_STACK_MATCH (#152616 ) Make Functorch interpreters serializable most of the time, so that we can save the guards on functorch states. ## Test Cases: 0. torch.compile() without functorch layers present. Guard should fail with any layer being pushed. 1. torch.compile() nested in vmap. 2. torch.compile() nested in grad. 3. torch.compile() nested in jvp + vmap 4. torch.compile() nested functionalize 5. torch.compile() nested in vmap + grad Differential Revision: [D74008787](https://our.internmc.facebook.com/intern/diff/D74008787/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152616 Approved by: https://github.com/zou3519 ghstack dependencies: #152615	2025-05-05 18:05:56 +00:00
cyy	45efa1aaa8	[3/N] Use internal linkage in C++ files (#151297 ) Follows #151070. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151297 Approved by: https://github.com/Skylion007	2025-05-05 17:48:39 +00:00
Ke Wen	7a2df6a00b	[PGNCCL] Add FP8 support (#152706 ) NCCL added support for `Float8e4m3` and `Float8e5m2` in 2.24. NVIDIA GPUs does not seem to support the following "no negative zero" versions: `Float8_e4m3fnuz` and `Float8_e5m2fnuz`, see https://onnx.ai/onnx/technical/float8.html. So we continue to error out for these two upon a reduction op. Test plan: - test_allreduce_float8 - test_reduce_scatter_float8 Resolves #148344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152706 Approved by: https://github.com/d4l3k, https://github.com/eqy, https://github.com/fduwjj, https://github.com/cyyever	2025-05-05 16:02:27 +00:00
Phillip Liu	7e637de9cb	[Flight Recorder] Added logging after FR dump completed (#152648 ) Summary: TSIA Test Plan: eyes Differential Revision: D74041147 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152648 Approved by: https://github.com/fduwjj, https://github.com/wdvr	2025-05-05 06:17:47 +00:00
Aaron Gokaslan	49b9efdf1f	[BE]: Cleanup traceutils with fmtlib (#152265 ) Simplify code and make it faster. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152265 Approved by: https://github.com/albanD, https://github.com/cyyever	2025-05-04 22:27:19 +00:00
Matthijs Hogervorst	b117a6c47b	Fix two error messages involving Tensor.dense() (#152631 ) Two error messages in the codebase instruct the user to use `Tendor.dense()`. This method doesn't exist, but `Tensor.to_dense()` does, and this is what the user should be using instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152631 Approved by: https://github.com/jansel	2025-05-04 20:44:08 +00:00
Julius Herb	8f54e56e62	Add optional device index to AOTIModelPackageLoader (#152093 ) This is my suggestion for resolving #152087 This PR extends the constructor of `AOTIModelPackageLoader` with an (optional) device index. The device type is still determined by `metadata_["AOTI_DEVICE_KEY"]`, but the `device_index` argument can be used to move an AOTI model package to different devices like `cuda:0`, `cuda:1`, ... in a convenient way. AFAIK, this is not possible so far using `AOTIModelPackageLoader` alone. The default case (no device index specified) with `metadata_["AOTI_DEVICE_KEY"] == "cuda"` would lead to the current behavior, i.e., the model is loaded to device `cuda`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152093 Approved by: https://github.com/desertfire	2025-05-04 11:40:12 +00:00
FFFrog	fd8fd01d25	[OpenReg] Add _lazy_init and rng_state support for OpenReg (#151914 ) As the title stated. Changes: - Add get_rng_state & set_rng_state support for OpenReg - Add _lazy_init support for OpenReg - Remove redundant code for cuda/Module.cpp Pull Request resolved: https://github.com/pytorch/pytorch/pull/151914 Approved by: https://github.com/albanD	2025-05-04 09:42:08 +00:00
FFFrog	c8bac51ec1	Remove the unnecessary cuda/Tensor.cpp (#152522 ) As the title stated. Question: I have carefully looked through all the .h files in Tensor.cpp and from my perspective this file does not make sense. Does anyone know what the background is for doing this? Pull Request resolved: https://github.com/pytorch/pytorch/pull/152522 Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/eqy ghstack dependencies: #152512, #152513, #152521	2025-05-04 07:15:11 +00:00
FFFrog	8562457cba	Make torch/csrc/utils.h to be device-agnostic (#152521 ) `torch/csrc/utils.h` should be device-independent. Currently, it contains CUDA-related implementations, which indirectly causes the [failure of ROCm testing](https://github.com/pytorch/pytorch/pull/151914#issuecomment-2839691038) (The reason is that the ROCm test environment shouldn`t expose HIP-related header files, which causes the JIT compilation to fail during testing) Therefore, move CUDA-related implementations to `torch/csrc/cuda/utils.h`. Question: This change may introduce BC-breack. I searched for this function globally on github and I think the impact is very small. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152521 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #152512, #152513	2025-05-04 07:15:11 +00:00
rzou	762844355e	Make DispatchKeySet serializable; add `__eq__` (#152732 ) These seem like reasonable things to add. Also fixes a bug in vLLM for me. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/152732 Approved by: https://github.com/bdhirsh	2025-05-03 14:40:06 +00:00
Ke Wen	0e59b594ee	[SymmMem] Use cub's BlockScan instead of in-house impl for offset calculation (#151993 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151993 Approved by: https://github.com/ngimel ghstack dependencies: #151261, #151498, #151819	2025-05-02 23:40:47 +00:00
Prachi Gupta	1ea2731e26	[ROCm] Add support for SymmetricMemory (#150580 ) This is an attempt to re-land the initial PR https://github.com/pytorch/pytorch/pull/134817 with recent design changes from upstream. NOTE: ROCm currently does NOT have multicast/multimem hardware support at the moment, so those features are disabled in symmetric memory for ROCm. This also means that we currently do not have a way of lowering add + all_reduce + wait_tensor into one_shot_all_reduce op in inductor as it depends on a multicast buffer support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150580 Approved by: https://github.com/jeffdaily, https://github.com/kwen2501, https://github.com/yoyoyocmu Co-authored-by: Xiaodong Wang <xdwang@fb.com>	2025-05-02 18:35:14 +00:00
Laith Sakka	376529c78b	consolidate guard_or_x and definitely_x (#152463 ) definitely_true is almost same as guard_or_false, the potential differences are not meaningful to a degree that justify the existence of both. same for definitely_false, it can be expressed with guard_or_true and guard_or_false. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152463 Approved by: https://github.com/bobrenjc93	2025-05-02 18:08:11 +00:00
James Wu	3baa85cfad	[StaticCudaLauncher] Ensure cuda context exists before launching kernels (#152667 ) Triton does this already due to https://github.com/triton-lang/triton/pull/3731/files, in order to fix https://github.com/pytorch/pytorch/issues/124565. We need to do the same thing as triton here, so that in cases with no compilation we still have a cuda context in the backward autograd thread. Fixes https://github.com/pytorch/pytorch/issues/152639 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152667 Approved by: https://github.com/oulgen	2025-05-02 17:29:57 +00:00
Ke Wen	829752ba37	[SymmMem] Add all_to_all_vdev (#151819 ) Merge in/out splits into one tensor Multi-block Use sync instead of barrier Use nvshmemx_collective_launch Rotate blocks among peer write back input splits Parallel scan works Use scan for output offsets Use at most 16 blocks Pull Request resolved: https://github.com/pytorch/pytorch/pull/151819 Approved by: https://github.com/ngimel, https://github.com/fduwjj ghstack dependencies: #151261, #151498	2025-05-02 06:59:21 +00:00
Ke Wen	d7961a1086	[SymmMem] Add all-to-all (#151498 ) Add an all-to-all impl based on NVSHMEM's on-stream API `nvshmemx_alltoallmem_on_stream`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151498 Approved by: https://github.com/fegin, https://github.com/fduwjj ghstack dependencies: #151261	2025-05-02 06:40:43 +00:00
FFFrog	ac5de6d55a	Remove unnecessary __STDC_FORMAT_MACROS macro (#152513 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152513 Approved by: https://github.com/cyyever, https://github.com/albanD ghstack dependencies: #152512	2025-05-02 05:06:44 +00:00

1 2 3 4 5 ...

15542 Commits