pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Nikita Shulga	4e7232c5da	[MPS] Fix `smooth_l1_loss` backward for fp16 (#166687 ) And enable fp16 implementation for CPU, which simplifies OpInfo definitions for the op Pull Request resolved: https://github.com/pytorch/pytorch/pull/166687 Approved by: https://github.com/Skylion007 ghstack dependencies: #166214	2025-10-31 21:13:46 +00:00
Jeff Daily	c3b71d5499	[ROCm][CI] remove relaxed tolerance for tf32 tests (#166478 ) Instead of relaxing tolerances for certain unit tests that exercise TF32 on MI300, skip the tests until hipblaslt accuracy is improved. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166478 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com> Co-authored-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>	2025-10-31 16:15:42 +00:00
Kurt Mohler	1e3600b528	[MPS] Move `logaddexp/logaddexp2` to Metal and support complex (#166670 ) NOTE: Complex inputs are only supported in `logaddexp`. Since `logaddexp2` does not support complex inputs for CPU, it is not enabled for MPS in this PR either. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166670 Approved by: https://github.com/malfet	2025-10-31 16:15:02 +00:00
Yuanyuan Chen	030de07aff	[2/N] Use 'is' in callable comparisons (#166685 ) It is generally advised to use `is/is not` for comparisons against torch functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166685 Approved by: https://github.com/xmfan, https://github.com/mlazos	2025-10-31 08:08:07 +00:00
Artem Kuzmitckii	45c3f02d69	[ROCm] moved gfx1100 back to experimental status for AOTriton (#166397 ) According to next commit to AOTriton: `8625c4faee` These changes missed in 0.11b release: https://github.com/pytorch/pytorch/pull/161754 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166397 Approved by: https://github.com/jeffdaily	2025-10-30 21:43:01 +00:00
PyTorch MergeBot	694d205143	Revert "shrink_group implementation to expose ncclCommShrink API (#164518 )" This reverts commit `311ea0dec0`. Reverted https://github.com/pytorch/pytorch/pull/164518 on behalf of https://github.com/atalman due to breaks internal builds Error: from logging_utils import ( ModuleNotFoundError: No module named 'logging_utils' ([comment](https://github.com/pytorch/pytorch/pull/164518#issuecomment-3469308568))	2025-10-30 17:52:29 +00:00
Simon Layton	fb545fb068	Add MXFP4 grouped gemm support via. FBGEMM kernels (#166530 ) Summary: * Extend `_scaled_grouped_mm_v2` to include MXFP4 support * Add testing to existing grouped routines Test Plan: ``` pytest -svv -k "mxfp4 and group" test/test_scaled_matmul_cuda.py ``` Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Simon Layton <simonlayton@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/166530 Approved by: https://github.com/drisspg	2025-10-30 16:46:11 +00:00
Yuanyuan Chen	2de4cf2102	[1/N] Remove unused loop variables (#166258 ) This PR removes unused loop variables. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166258 Approved by: https://github.com/Lucaskabela, https://github.com/mlazos	2025-10-30 12:22:25 +00:00
linhaifeng	369f2d6951	[3/N] fix typo in other folders (#166606 ) fix typo in other folders #166374 #166126 _typos.toml ```bash [files] extend-exclude = ["tools/linter/dictionary.txt"] [default.extend-words] nd = "nd" arange = "arange" Nd = "Nd" GLOBALs = "GLOBALs" hte = "hte" iy = "iy" PN = "PN" Dout = "Dout" optin = "optin" gam = "gam" PTD = "PTD" Sur = "Sur" nin = "nin" tme = "tme" inpt = "inpt" mis = "mis" Raison = "Raison" ouput = "ouput" nto = "nto" Onwer = "Onwer" callibrate = "callibrate" ser = "ser" Metdata = "Metdata" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/166606 Approved by: https://github.com/ezyang	2025-10-30 10:30:40 +00:00
Dzmitry Huba	791ca80d3a	Enable local tensor mode for DTensor attention and convolution tests (#166406 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166406 Approved by: https://github.com/ezyang	2025-10-30 02:48:02 +00:00
Bruce Chang	311ea0dec0	shrink_group implementation to expose ncclCommShrink API (#164518 ) Closes #164529 To expose the new [ncclCommShrink](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/comms.html#ncclcommshrink) API to PyTorch. This is useful when you need to exclude certain GPUs or nodes from a collective operation, for example in fault tolerance scenarios or when dynamically adjusting resource utilization. For more info: [Shrinking a communicator](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html#shrinking-a-communicator) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164518 Approved by: https://github.com/kwen2501	2025-10-30 01:50:54 +00:00
Maggie Moss	d1a6e006e0	Fix syntax for pyrefly errors (#166496 ) Last one! This ensures all existing suppressions match the syntax expected and will silence only one error code pyrefly check lintrunner Pull Request resolved: https://github.com/pytorch/pytorch/pull/166496 Approved by: https://github.com/Skylion007, https://github.com/mlazos	2025-10-29 20:00:25 +00:00
PyTorch MergeBot	1dd6b76914	Revert "[1/N] Remove unused loop variables (#166258 )" This reverts commit `76b2c37045`. Reverted https://github.com/pytorch/pytorch/pull/166258 on behalf of https://github.com/atalman due to breaks test/distributed/test_serialization.py::TestSerialization::test_weights_only [GH job link](https://github.com/pytorch/pytorch/actions/runs/18894311802/job/53929321703) [HUD commit link](`76b2c37045`) ([comment](https://github.com/pytorch/pytorch/pull/166258#issuecomment-3460964612))	2025-10-29 11:10:37 +00:00
etaf	1b655a87ef	[xpu][test] Enable more UTs for Intel GPU. (#166047 ) This PR enables additional Inductor unit tests for Intel GPU. Due to the increased number of test cases, the number of runners has been extended from 8 to 12 to prevent CI timeouts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166047 Approved by: https://github.com/jansel Co-authored-by: Deng, Daisy <daisy.deng@intel.com> Co-authored-by: Jason Ansel <jansel@jansel.net>	2025-10-29 06:25:36 +00:00
Junjie Wang (PyTorch)	774abb018e	[ptd] Fix test config in destroy_pg (#166463 ) Summary: When device_type is CPU we will not use device id from CUDA which is enabled in https://github.com/pytorch/pytorch/pull/161015. However, we should not exclude the case when the accelerator itself is CPU. This PR fixes it. Test Plan: UT Differential Revision: D85714901 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166463 Approved by: https://github.com/mori360, https://github.com/fegin	2025-10-29 04:35:04 +00:00
Nikita Shulga	877f126e35	[MPS] Improve index_select error checking (#166468 ) Just copy-n-paste overlap checks from `0d4992c170/aten/src/ATen/native/TensorAdvancedIndexing.cpp (L1620-L1622)` Very similar to https://github.com/pytorch/pytorch/pull/166425 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166468 Approved by: https://github.com/dcci, https://github.com/Skylion007	2025-10-29 02:23:12 +00:00
Yuanyuan Chen	76b2c37045	[1/N] Remove unused loop variables (#166258 ) This PR removes unused loop variables. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166258 Approved by: https://github.com/Lucaskabela, https://github.com/mlazos	2025-10-29 01:34:15 +00:00
Xingyuan Li	68b3984b77	[xpu][test] Enable skipped `SparseAdam` UTs (#166375 ) With `SparseAdam` now correctly supported on Intel GPU, the previously disabled UTs can be enabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166375 Approved by: https://github.com/Skylion007, https://github.com/janeyx99	2025-10-28 22:49:25 +00:00
Simon Layton	b5189e269e	NVFP4 grouped gemm support via. FBGEMM kernels (#166308 ) Summary: * Add NVFP4 (1x16 block e4m3, tensor-wise fp32) scaled grouped gemm * Extend testing to add nvfp4 support Test Plan: ``` pytest -svv -k grouped test/test_scaled_matmul_cuda.py ``` Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Simon Layton <simonlayton@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/166308 Approved by: https://github.com/ngimel	2025-10-28 20:32:53 +00:00
Nikita Shulga	1abfa5f70b	[EZ][MPS] Improve distribution error checking (#166425 ) Essentially not allow ops on self-overlapping outputs, by adding `at::assert_no_internal_overlap(self);` check that already used in CPU and CUDA builds, see `895795f07c/aten/src/ATen/native/DistributionTemplates.h (L366)` This fixes `test_error_inputs_bernoulli_mps` Should be landed ahead of https://github.com/pytorch/pytorch/pull/165267 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166425 Approved by: https://github.com/Skylion007, https://github.com/seemethere	2025-10-28 18:42:12 +00:00
Roman Krasavtsev	e137cd0a10	docs: fix typos (#164879 ) Correct typos in the comments Pull Request resolved: https://github.com/pytorch/pytorch/pull/164879 Approved by: https://github.com/Lucaskabela, https://github.com/mlazos, https://github.com/cyyever	2025-10-28 12:00:36 +00:00
Janani Sriram	ff46d5a79b	[Inductor][Triton][FP8] Support deepseek-style scaling in Inductor (#164404 ) Summary: Support deepseek-style scaling in Inductor Triton for FP8 GEMMs. DeepSeek-style scaling is a colloquial term for a fine-grained mixed precision framework using FP8 to train [Deepseek-V3](https://arxiv.org/pdf/2412.19437), DeepSeek AI's recent MoE (Mixture of Experts) model. DeepSeek-style scaling effectively extends the dynamic range of FP8 by mitigating dequantization overhead under increased-precision accumulation, which is key to achieving more accurate FP8 GEMM results. DeepSeek-style scaling on matmul `A @ B` leverages two different types of scaling strategies to preserve a balance between numerical stability and training efficiency: - Activations (input tensor `A`): tile-wise (1x128 across shape `(M, K)`) - Weights (input tensor `B`): block-wise (128x128 across shape `(N, K)`) This diff enables Inductor users to replicate past successes with deepseek-style scaling and achieve higher numerical stability while increasing training efficiency. NOTE: Block-wise 128x128 scaling is only supported in CUDA 12.9+; therefore, deepseek-style scaling is currently unsupported in `fbcode` (CUDA 12.4). Use OSS PyTorch to run deepseek-style scaling. NOTE: Accuracy for FP8 is unstable, even with high tolerances, which is why TritonBench benchmarks are unlikely to be accurate against a `torch` implementation. Test Plan: In OSS PyTorch, run ``` TORCHINDUCTOR_CACHE_DIR=~/personal/cache_dir_inductor CUDA_LAUNCH_BLOCKING=1 TORCH_USE_CUDA_DSA=1 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 python run.py --op fp8_gemm --only torch_fp8_gemm,pt2_fp8_gemm --metrics tflops,accuracy --m 4096 --n 768 --k 512 --output="{output_dir}/deepseek_bench.csv" --scaling_deepseek --atol=1e-2 --rtol=0.5 2>&1 \| tee ~/personal/deepseek_style/deepseek_bench.log ``` Differential Revision: D83609850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164404 Approved by: https://github.com/slayton58	2025-10-28 03:38:54 +00:00
Dzmitry Huba	a51f877287	Enable local tensor mode for another set of DTensor tests (#166105 ) Enable local tensor mode DTensor tests for the optimizers, op strategy, matrix ops, math ops, init ops, experimental ops, embedding ops, dynamic, convolution ops, main api. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166105 Approved by: https://github.com/ezyang	2025-10-27 23:58:24 +00:00
Nikita Shulga	4e6afa8c07	[BE][Opinfo] Mark `[c]double` as unsupported for MPS (#166213 ) Test plan: Run `python ../test/test_ops.py -v -k test_dtypes___radd___mps` when TestCommon parametrization is enabled for MPS Pull Request resolved: https://github.com/pytorch/pytorch/pull/166213 Approved by: https://github.com/kulinseth, https://github.com/Skylion007	2025-10-27 05:38:36 +00:00
Dzmitry Huba	86f9f1d0ab	Enable local tensor model for DTensor redistribute tests (#166081 ) Redistribute test exercise extensively various sharding schemes and redistribution between them. These tests uncovered more edge cases that were not supported by the local tensor primarily different flavors of uneven sharding. In order to handle these cases this change implements missing functional collectives and adds support for uneven sharding case where sharding group (ranks) is larger than the size of the dimension being sharded. In the latter case the "missing" shards are represented by zero sized tensors so that the rest of the local tensor machinery can stay oblivious to this special case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166081 Approved by: https://github.com/ezyang	2025-10-26 22:21:43 +00:00
Yuanyuan Chen	a60d9e1f6d	Fix flake8 B028 warnings (#166224 ) This PR fixes flake8 B028 warning by specifying stacklevel=2 in `warnings.warn`. The advantage is that users can know more contextual information about PyTorch warnings. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166224 Approved by: https://github.com/ezyang	2025-10-26 06:18:55 +00:00
Kurt Mohler	c9b49e506e	[MPS] Add `linalg.householder_product` for MPS (#166090 ) Fixes #166089 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166090 Approved by: https://github.com/malfet	2025-10-24 21:13:56 +00:00
Eddie Yan	e64a814ae7	[CUDA] Add experimental green context support for SM carveout (#159104 ) Low-level PyTorch APIs should be usable/stable enough at this point but we might move the underlying driver API usage a bit from here... Built on top of @drisspg 's branch Pull Request resolved: https://github.com/pytorch/pytorch/pull/159104 Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/kwen2501 Co-authored-by: drisspg <drisspguessous@gmail.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-10-22 21:38:52 +00:00
Artem Kuzmitckii	f3b8e15f20	[AMD][gfx1100] test_decompose_mem_bound_mm.py tolerance increase (#165625 ) test_decompose_mem_bound_mm.py tolerance increase for navi3x(gfx11x) (cherry picked from commit 03c7da05f61890bbf5ae41e23c8df6d5f6805bac) from Fixes for CI HUD for gfx1100 Signed-off-by: Artem Kuzmitckii <artem.kuzmitckii@amd.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/165625 Approved by: https://github.com/jeffdaily Co-authored-by: iupaikov-amd <Iurii.Paikov@amd.com> Co-authored-by: Dmitry Nikolaev <139769634+dnikolaev-amd@users.noreply.github.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-10-22 01:38:48 +00:00
PyTorch MergeBot	ad4dc52bf6	Revert "shrink_group implementation to expose ncclCommShrink API (#164518 )" This reverts commit `4e643422f6`. Reverted https://github.com/pytorch/pytorch/pull/164518 on behalf of https://github.com/albanD due to Breaks lint ([comment](https://github.com/pytorch/pytorch/pull/164518#issuecomment-3429426503))	2025-10-21 20:24:14 +00:00
Bruce Chang	4e643422f6	shrink_group implementation to expose ncclCommShrink API (#164518 ) Closes #164529 To expose the new [ncclCommShrink](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/comms.html#ncclcommshrink) API to PyTorch. This is useful when you need to exclude certain GPUs or nodes from a collective operation, for example in fault tolerance scenarios or when dynamically adjusting resource utilization. For more info: [Shrinking a communicator](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html#shrinking-a-communicator) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164518 Approved by: https://github.com/kwen2501	2025-10-21 19:47:33 +00:00
Yuanyuan Chen	0e083942cc	Enable PLW0127 in ruff (#165851 ) This PR enables `PLW0127` in ruff, which checks self-assignment of variables with the form `var=var`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165851 Approved by: https://github.com/Lucaskabela	2025-10-21 03:30:57 +00:00
PyTorch MergeBot	633a3b7f67	Revert "shrink_group implementation to expose ncclCommShrink API (#164518 )" This reverts commit `fa0db212e7`. Reverted https://github.com/pytorch/pytorch/pull/164518 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/164518#issuecomment-3419893217))	2025-10-19 19:20:45 +00:00
Bruce Chang	fa0db212e7	shrink_group implementation to expose ncclCommShrink API (#164518 ) Closes #164529 To expose the new [ncclCommShrink](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/comms.html#ncclcommshrink) API to PyTorch. This is useful when you need to exclude certain GPUs or nodes from a collective operation, for example in fault tolerance scenarios or when dynamically adjusting resource utilization. For more info: [Shrinking a communicator](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html#shrinking-a-communicator) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164518 Approved by: https://github.com/kwen2501	2025-10-19 18:00:08 +00:00
Yuanyuan Chen	3255e7872b	Enable all flake8-logging-format rules (#164655 ) These rules are enabled by removing existing suppressions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164655 Approved by: https://github.com/janeyx99, https://github.com/mlazos	2025-10-19 00:59:28 +00:00
Dzmitry Huba	c4f6619330	Enable more DTensor tests in local tensor mode and fix more integration issues (#165716 ) - During op dispatch local tensor is supposed to collect rng state from CPU and CUDA devices so that it can be reset before execution of the op for each such that ops with randomness produces the same result for all ranks (note that we are planning a separate change to add support of per rank rng state). Previously we relied on op input arguments to deduce which devices to get rng state from. Which doesn't work for factory functions such torch.randn. Hence this changes switches to uncondionally collecting rng state from all devices. - Fixing per rank specific computations in _MaskedPartial and Shard placements discovered during test enablement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165716 Approved by: https://github.com/ezyang	2025-10-18 23:33:24 +00:00
Yuanyuan Chen	1f43d17ce6	Fix self assignment (#165816 ) This PR removes assignments of the form `var=var`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165816 Approved by: https://github.com/jansel	2025-10-18 18:51:52 +00:00
PyTorch MergeBot	beb6b62e8c	Revert "Enable more DTensor tests in local tensor mode and fix more integration issues (#165716 )" This reverts commit `1b397420f2`. Reverted https://github.com/pytorch/pytorch/pull/165716 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/165716#issuecomment-3418083391))	2025-10-18 09:15:49 +00:00
Yuanyuan Chen	fdab48a7c1	Enable all PIE rules on ruff (#165814 ) This PR enables all PIE rules on ruff, there are already some enabled rules from this family, the new added rules are ``` PIE796 Enum contains duplicate value: {value} PIE808 Unnecessary start argument in range ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165814 Approved by: https://github.com/ezyang	2025-10-18 07:36:18 +00:00
PyTorch MergeBot	24520b8386	Revert "Enable all PIE rules on ruff (#165814 )" This reverts commit `c79dfdc655`. Reverted https://github.com/pytorch/pytorch/pull/165814 on behalf of https://github.com/cyyever due to Need to cover more files ([comment](https://github.com/pytorch/pytorch/pull/165814#issuecomment-3417931863))	2025-10-18 07:21:08 +00:00
Yuanyuan Chen	c79dfdc655	Enable all PIE rules on ruff (#165814 ) This PR enables all PIE rules on ruff, there are already some enabled rules from this family, the new added rules are ``` PIE796 Enum contains duplicate value: {value} PIE808 Unnecessary start argument in range ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165814 Approved by: https://github.com/ezyang	2025-10-18 06:40:12 +00:00
Yuanyuan Chen	e595136187	Enable PLC1802 on ruff (#165813 ) This PR enables ruff check `PLC1802`, which detects len calls on sequences in a boolean test context. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165813 Approved by: https://github.com/ezyang	2025-10-18 05:44:14 +00:00
Yuanyuan Chen	aaac8cb0f5	[1/N] Add strict parameter to Python zip calls (#165531 ) Add `strict=True/False` to zip calls in test utils. `strict=True` is passed when possible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165531 Approved by: https://github.com/Skylion007	2025-10-18 05:26:33 +00:00
Dzmitry Huba	1b397420f2	Enable more DTensor tests in local tensor mode and fix more integration issues (#165716 ) - During op dispatch local tensor is supposed to collect rng state from CPU and CUDA devices so that it can be reset before execution of the op for each such that ops with randomness produces the same result for all ranks (note that we are planning a separate change to add support of per rank rng state). Previously we relied on op input arguments to deduce which devices to get rng state from. Which doesn't work for factory functions such torch.randn. Hence this changes switches to uncondionally collecting rng state from all devices. - Fixing per rank specific computations in _MaskedPartial and Shard placements discovered during test enablement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165716 Approved by: https://github.com/ezyang	2025-10-17 23:28:22 +00:00
PyTorch MergeBot	fae74cd52f	Revert "shrink_group implementation to expose ncclCommShrink API (#164518 )" This reverts commit `a032510db3`. Reverted https://github.com/pytorch/pytorch/pull/164518 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/164518#issuecomment-3416718767))	2025-10-17 18:55:53 +00:00
Bruce Chang	a032510db3	shrink_group implementation to expose ncclCommShrink API (#164518 ) Closes #164529 To expose the new [ncclCommShrink](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/comms.html#ncclcommshrink) API to PyTorch. This is useful when you need to exclude certain GPUs or nodes from a collective operation, for example in fault tolerance scenarios or when dynamically adjusting resource utilization. For more info: [Shrinking a communicator](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html#shrinking-a-communicator) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164518 Approved by: https://github.com/Skylion007, https://github.com/syed-ahmed, https://github.com/kwen2501	2025-10-17 17:55:03 +00:00
Wei Wang	d7e275d4b4	[CI][CUDA] Add periodic b200 distributed job (#159323 ) 1. Run distributed job with B200 runner, periodically. 2. discovered generic distributed test issue that certain unit test hard-coded ranks, calling for require_exact_world_size(world_size) API instead of require_world_size(world_size). Pull Request resolved: https://github.com/pytorch/pytorch/pull/159323 Approved by: https://github.com/eqy Co-authored-by: Aidyn-A <aidyn.b.aitzhan@gmail.com>	2025-10-16 21:54:04 +00:00
Dzmitry Huba	2cd5fd1588	Enable local tensor mode on DTensor view ops test (#165596 ) While enabling this test discovered lack of support for sub meshes. Added limited support for sub meshes by properly computing rank coordinates for a given sub mesh. The implementation follows similar approach to collectives. We infer all sub meshes for the given dimensions and compute each rank's coordinates with respect to is sub mesh. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165596 Approved by: https://github.com/ezyang	2025-10-16 20:52:06 +00:00
Brian Hirsh	ed74dc054d	add the option to disable functionalization in AOTDispatcher (#164577 ) I'm cleaning this PR up as a proper way of disabling functionalization via config in AOTDispatcher. I removed the non-functionalization related changes from the original version: (1) preventing proxy mode (and functionalization) from incorrectly decomposing CIA ops (Ed has a PR for it here: https://github.com/pytorch/pytorch/pull/164939) (2) preventing python-dispatcher-based decomps above autograd from running. I'm not doing this for now, will likely do it in a followup Pull Request resolved: https://github.com/pytorch/pytorch/pull/164577 Approved by: https://github.com/ezyang ghstack dependencies: #165372	2025-10-16 15:44:11 +00:00
Brian Hirsh	f33c7e1a43	add and fix OpInfo tests for the default partitioner (#165372 ) I noticed the default partitioner was breaking in some dynamic shape tests, so prior to turning off functionalization I want to tweak it to pass all of our OpInfo tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/165372 Approved by: https://github.com/ezyang	2025-10-16 15:44:11 +00:00

1 2 3 4 5 ...

5938 Commits