pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Ke Wen	4879f8f919	[TP] Add warning when module is distributed twice (#147006 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147006 Approved by: https://github.com/XilunWu	2025-02-13 06:49:17 +00:00
Tianyu Liu	ac0f206f3c	[dtensor] fix side-effect on dtype for _like ops (#146869 ) fixes https://github.com/pytorch/pytorch/issues/146749 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146869 Approved by: https://github.com/yifuwang, https://github.com/janeyx99, https://github.com/ngimel	2025-02-12 08:42:14 +00:00
Xilun Wu	c4d835fbab	[DTensor][conv] add DTensor convolution_backward op support for case where the input Tensor has requires_grad=False (#142278 ) Fixes #142058 ## Summary DTensor `convolution_backward` op throws exception when the input Tensor has `requires_grad=False` which happens if the conv layer is the first layer in the model. ATEN convolution_backward op Usually returns 3 Tensors (grad_input, grad_weight, grad_bias) and the `grad_input` is actually an Optional[Tensor] which can be `None` in the case mentioned above. However, the DTensor sharding propagation rule and corresponding TP conv backward implementation both assume that the `grad_input` would be existent. ## Fix allow the `grad_input` to be `None` for `convolution_backward` op. ## Test `pytest test/distributed/tensor/test_convolution_ops.py` ## Follow-up The current implementation of DTensor conv op also ignores `output_mask` and this may need further care. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142278 Approved by: https://github.com/bdhirsh	2025-02-10 07:06:40 +00:00
Xilun Wu	5cc1b54a91	[2/N][cp][example] flex attention in context parallel (backward pass) (#146397 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146397 Approved by: https://github.com/fegin ghstack dependencies: #145896	2025-02-06 19:50:02 +00:00
Xilun Wu	6220c64aea	[1/N][cp][example] flex attention in context parallel (forward pass) (#145896 ) Description This is an example of how FlexAttention can be used in a context parallel fashion. Right now it's only a flex_attention call with collectives added and has no load balancer, but we're about to add the missing parts step by step: 1. backward pass 2. static load balancing for causal masking 3. dynamic load balancing for other general maskings 4. automatic collective insertion solution 5. non-intrusive context parallel APIs Test `torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/tensor/examples/flex_attention_cp.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145896 Approved by: https://github.com/fegin, https://github.com/Skylion007	2025-02-06 19:50:02 +00:00
Aaron Gokaslan	292af3cc89	[BE][Ez]: ISC001 Auto concatenate implicit one line strings (#146408 ) Apply ruff rule about implicit string concatenation, this autofixes strings that are all the same type and on the same line. These lines are broken up likely as the result of autoformatters in the past. All fixes are automated using the autofixes in ISC001. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146408 Approved by: https://github.com/justinchuby, https://github.com/janeyx99	2025-02-04 19:07:04 +00:00
Stas Bekman	3aeccf2a28	DeepSpeed github repo move sync (#146320 ) DeepSpeed has moved to a new repo on github https://github.com/deepspeedai/DeepSpeed This PR updates this repo to use the new URL. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146320 Approved by: https://github.com/awgu	2025-02-03 23:20:49 +00:00
wz337	6f5c8fb128	[DTensor] Add pointwise ops strategy for `aten.minimum` (#145816 ) Need it for Shampoo optimizer. `9c5700ad5e/matrix_functions.py (L240-L242)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145816 Approved by: https://github.com/XilunWu	2025-01-29 01:19:01 +00:00
Xilun Wu	2ce70da96c	[cp] override compute_log_sumexp to True for aten._scaled_dot_product_efficient_attention.default if False (#145421 ) ## Description Our current CP doesn't support efficient attention when `compute_log_sumexp=False`. `compute_log_sumexp=False` only if that `requires_grad=False` and since PP's [shape inference](`d95a6babcc/torch/distributed/pipelining/stage.py (L1387)`) happens under `torch.no_grad()` context , we need to override `compute_log_sumexp` to `True` in our CP attention implementation. ## Test - Test PP+FSDP+CP w/ `mixed_precision = "float32"` in torchtitan - `pytest test/distributed/tensor/test_attention.py -s -k test_ring_attention_sdpa` Before: <img width="1880" alt="image" src="https://github.com/user-attachments/assets/872ff583-295e-4751-a280-cf7f2d41c61a" /> After: <img width="2988" alt="image" src="https://github.com/user-attachments/assets/4bdcc2e5-22a5-427a-91a5-82206d5bd78f" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/145421 Approved by: https://github.com/H-Huang, https://github.com/tianyu-l	2025-01-24 06:17:54 +00:00
Aaron Orenstein	c95efc37ba	PEP585 update - torch/distributed/tensor (#145141 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145141 Approved by: https://github.com/bobrenjc93	2025-01-18 20:01:59 +00:00
bobrenjc93	08be9ec312	Migrate from Tuple -> tuple in torch/distributed (#144258 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144258 Approved by: https://github.com/aorenste	2025-01-10 08:34:54 +00:00
Wanchao Liang	b1c2c3967a	[dtensor] deprecate _shard_tensor to use src_data_rank=None (#144171 ) as titled, we can achieve no comm sharding for the inference case with src_data_rank=None, so deprecate the private APi Pull Request resolved: https://github.com/pytorch/pytorch/pull/144171 Approved by: https://github.com/awgu	2025-01-09 22:26:45 +00:00
Andrew Gu	8ac005ddb8	[DTensor] Add `aten.view.dtype` op support (#144404 ) Fixes https://github.com/pytorch/pytorch/issues/144286 Viewing a tensor to a different dtype does not require any redistribution and can use the default strategy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144404 Approved by: https://github.com/wanchaol	2025-01-08 23:11:22 +00:00
Xuehai Pan	dcc3cf7066	[BE] fix ruff rule E226: add missing whitespace around operator in f-strings (#144415 ) The fixes are generated by: ```bash ruff check --fix --preview --unsafe-fixes --select=E226 . lintrunner -a --take "RUFF,PYFMT" --all-files ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144415 Approved by: https://github.com/huydhn, https://github.com/Skylion007	2025-01-08 21:55:00 +00:00
Luca Wehrstedt	defbf0d339	[DTensor] Add strategy for _scaled_mm (#143760 ) This is done by copying the one for a regular mm, and enforcing that the scales have the same sharding scheme as their respective operands. This works because scales are 2-d tensors that must "broadcast" to the operands. This broadcasting is trivial when scales have dimensions of 1 or N, which is the only options we currently support. Note, however, that after this PR scales will be allowed to have the mesh's world size as a dimension (in certain cases). This works because, when mapped to the local shard, it becomes a dimension of 1, which can be handled by the operator. Note that when using row-wise _scaled_mm for tensor (sequence) parallelism, this situation arises naturally! Because of these specificities, the test is rather complex, as it specifically tests all these behaviors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143760 Approved by: https://github.com/tianyu-l	2025-01-06 16:35:47 +00:00
Aaron Orenstein	45ef3309e3	[BE] typing for decorators (#144161 ) Summary: Untyped decorators strip annotations from the decorated items. - _compile - _inductor/fx_passes/post_grad - _inductor/lowering - _library/custom_ops - _meta_registrations - _ops - _refs/nn/functional - ao/quantization/quantizer/xnnpack_quantizer_utils - distributed/_composable/contract - fx/experimental/graph_gradual_typechecker - fx/experimental/migrate_gradual_types/constraint_generator - optim/optimizer - signal/windows/windows - testing/_internal/common_device_type - torch/_inductor/decomposition - utils/flop_counter Test Plan: unit tests Differential Revision: D62302684 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144161 Approved by: https://github.com/Skylion007, https://github.com/albanD	2025-01-04 16:40:09 +00:00
Wanchao Liang	eb7a303d21	[dtensor] expose the __create_chunk_list__ in the doc (#144100 ) as titled, this PR expose this dunder method as a public API in the doc, so that different checkpoint implementations can leverage this protocol, instead of exposing a separate API Pull Request resolved: https://github.com/pytorch/pytorch/pull/144100 Approved by: https://github.com/awgu ghstack dependencies: #144099	2025-01-03 20:06:23 +00:00
Wanchao Liang	48a05ee773	[dtensor] improve doc of the DTensor class (#144099 ) as titled: explicitly list all public members to make sure the public API stays consistent, also use groupwise as the member order to make doc look better Pull Request resolved: https://github.com/pytorch/pytorch/pull/144099 Approved by: https://github.com/awgu	2025-01-03 05:35:44 +00:00
Wanchao Liang	0431d47eaa	[tp] propagate src_data_rank kwarg in TP API (#144005 ) as titled, this PR propagates the src_data_rank in the TP API, so that module level APIs could leverage the flexibility to choose src_data_rank, and avoid the communication if it does not need to Pull Request resolved: https://github.com/pytorch/pytorch/pull/144005 Approved by: https://github.com/tianyu-l ghstack dependencies: #143883	2025-01-02 05:35:52 +00:00
Wanchao Liang	f242dbb76f	[dtensor] add src_data_rank to distribute_tensor API (#143883 ) As titled, this PR add a kwarg src_data_rank to the distribute_tensor API, to allow user specify a specific rank as the full tensor source data. Previously we by default specify group_rank=0 as the source of truth for single device semantic, this new option: * gives advanced user flexiblity to choose the source data rank * allow user to specify None explicity, which means we will skip the communications needed (scatter/broadcast) for the cases that does not care about single device semantic (i.e. loading from a checkpoint) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143883 Approved by: https://github.com/XilunWu, https://github.com/tianyu-l	2025-01-02 05:35:52 +00:00
Luca Wehrstedt	aec3b46274	[DTensor] Add aten.amin/amax to linear_reduction_strategy (#143747 ) In the same vein as https://github.com/pytorch/pytorch/pull/134206, these two ops still seemed missing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143747 Approved by: https://github.com/kwen2501	2024-12-24 13:36:40 +00:00
Xuehai Pan	b77406a9ec	[BE][CI] bump `ruff` to 0.8.4 (#143753 ) Changes: 1. Bump `ruff` from 0.7.4 to 0.8.4 2. Change `%`-formatted strings to f-string 3. Change arguments with the `__`-prefix to positional-only arguments with the `/` separator in function signature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143753 Approved by: https://github.com/Skylion007	2024-12-24 12:24:10 +00:00
Tom Ritchford	f1cbf4b1b5	Enable ruff's unused variable checking everywhere in pytorch (#136965 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136965 Approved by: https://github.com/cyyever, https://github.com/albanD	2024-12-22 02:33:11 +00:00
bobrenjc93	8e78345d69	remove allow-untyped-defs from distributed/tensor/experimental/__init__.py (#143583 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143583 Approved by: https://github.com/awgu	2024-12-19 20:25:28 +00:00
Aaron Orenstein	401b1498d2	[BE] typing for decorators - distributed/_tensor/ops/utils (#142139 ) Test Plan: unit tests Differential Revision: D62302679 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142139 Approved by: https://github.com/Skylion007, https://github.com/kwen2501	2024-12-16 21:19:33 +00:00
lzhang2	b7ad52abb0	Use new group instead of split group on non-CUDA device (#141469 ) Motivation: Currently, `split_group` only works for NCCL backend. https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py#L4745. Then we need to use `use_group` on other non-CUDA device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141469 Approved by: https://github.com/kwen2501, https://github.com/gujinghui, https://github.com/albanD	2024-12-13 05:11:33 +00:00
Jane Xu	fd65bd755d	[BE] replace incorrect .. note:: invocations (#142868 ) Something I've noticed is that a lot of the distributed sites don't render on our docs at all, but if they ever do, the notes will render properly now 😛 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142868 Approved by: https://github.com/albanD	2024-12-11 19:58:18 +00:00
Xilun Wu	bce07deb96	[dtensor][cp][experiment] add CP experimental API to choose rotate method (#142093 ) Summary This PR adds a new experimental API `set_rotate_method` for Context Parallel. This API allows user to choose the desired communication method (between all-to-all and all-gather) for shards rotation. Test `pytest test/distributed/_tensor/test_attention.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142093 Approved by: https://github.com/fegin	2024-12-10 18:25:23 +00:00
Ke Wen	a58d2f14e8	[DTensor] Add a private util for sharding tensor (#142288 ) Locally shards a full tensor based on indicated sharding arrangement, and returns a DTensor containing the local shard. warning: This is a private API purposed to skip the communication otherwise required by `distribute_tensor`. It is only applicable to a case where all ranks have the same `full_tensor`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142288 Approved by: https://github.com/wz337	2024-12-07 05:30:18 +00:00
Ke Wen	8bdcdae733	[DTensor] Support matmul in inference_mode (#142197 ) Fixes #142190 . The solution is to add a `decompose_handler` for `aten.matmul`, similar to how we handle `aten.linear`. With the decomposition, `aten.matmul` becomes `aten.mm` which has sharding strategy registered with DTensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142197 Approved by: https://github.com/XilunWu, https://github.com/wz337	2024-12-06 07:15:05 +00:00
main-horse	52b7f0ba12	[DTensor] fix stride of fake tensor produced by `shard_dim_alltoall` (#141835 ) currently, DTensor redistributions involving all2all `Shard(n)->Shard(m)` will generate faulty inductor code when compiled: ```python # torchrun --nproc_per_node=2 crash.py import torch from torch.distributed.device_mesh import init_device_mesh from torch.distributed.tensor import Shard, DTensor mesh = init_device_mesh('cuda', (2,), mesh_dim_names=('ep',)) dt = DTensor.from_local(torch.randn(2, 4, device='cuda'), mesh, [Shard(0)]).requires_grad_() def f(dt): return dt.redistribute(placements=[Shard(1)]).to_local() f(dt).sum().backward() # no crash f = torch.compile(f) f(dt).sum().backward() # crash ``` resulting: ```python [rank1]: Traceback (most recent call last): [rank1]: File "/crash.py", line 11, in <module> [rank1]: f(dt).sum().backward() # crash [rank1]: ^^^^^ ... [rank1]: File "/tmp/torchinductor_main/gu/cgurkeb7tzx7kfsnooolsjefrgoizzylrldrugc52n4avmgiccas.py", line 41, in call [rank1]: assert_size_stride(buf0, (4, 2), (4, 1)) [rank1]: AssertionError: expected size 4==4, stride 2==4 at dim=0 ``` This happens because the current [`register_fake` implementation for `shard_dim_alltoall` ops](`5deca07c0d/torch/distributed/tensor/_collective_utils.py (L32)`) returns an erroneous stride: ```python import torch import torch.distributed as dist from torch._C._distributed_c10d import _register_process_group from torch.distributed.device_mesh import init_device_mesh from torch.distributed.tensor._collective_utils import _shard_dim_alltoall_meta, _get_group_size_by_name mesh = init_device_mesh('cuda', (2,), mesh_dim_names=('ep',)) _register_process_group('ep', mesh['ep'].get_group()) x = torch.randn(2, 4, device='meta') y = _shard_dim_alltoall_meta(x, 0, 1, 'ep') if dist.get_rank() == 0: print(x.shape, x.stride()) # torch.Size([2, 4]) (4, 1) print(y.shape, y.stride()) # torch.Size([4, 2]) (4, 1) ``` --- The proposed fix in the pull request causes the provided example code to compile correctly && stop erroring. However, I know very little about torch internals, and expect there to be something wrong with this patch. Any corrections are appreciated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141835 Approved by: https://github.com/awgu, https://github.com/tianyu-l	2024-12-06 06:56:03 +00:00
Aaron Gokaslan	08db735629	[BE]: Update mypy to 1.13.0 (#140808 ) Update mypy to 1.13.0 . Should hopefully reduce linting time. Has support for orjson cache serialization which should improve mypy cache perf if orjson is installed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140808 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-12-03 02:50:10 +00:00
PyTorch MergeBot	daa77f3d9f	Revert "[BE]: Update mypy to 1.13.0 (#140808 )" This reverts commit `00134d68af`. Reverted https://github.com/pytorch/pytorch/pull/140808 on behalf of https://github.com/huydhn due to This is failing a distributed test in trunk, target determination missed this test and did not run it on PR ([comment](https://github.com/pytorch/pytorch/pull/140808#issuecomment-2512788426))	2024-12-02 20:47:43 +00:00
Aaron Gokaslan	00134d68af	[BE]: Update mypy to 1.13.0 (#140808 ) Update mypy to 1.13.0 . Should hopefully reduce linting time. Has support for orjson cache serialization which should improve mypy cache perf if orjson is installed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140808 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-12-02 18:47:54 +00:00
Xilun Wu	ce572fedfc	[dtensor][random] use torch.uint64 as the seed/offset tensor dtype to avoid overflow (#141532 ) Summary DTensor RNG code raises error if the seed passed in is beyong `torch.int64` range (e.g. `torch.tensor([2**64-1])` raises error). The solution is to specify the `dtype=torch.uint64` in the `torch.tensor()` call. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141532 Approved by: https://github.com/wconstab ghstack dependencies: #141731, #141220, #141223	2024-11-29 07:59:34 +00:00
Xilun Wu	93cbb287c2	[dtensor][random] allow user to manual_seed different seed on device mesh; only sync RNG state in WORLD when manual_seed has not been called (#141223 ) Summary This PR proposes 4 changes to DTensor RNG management: 1. DTensor allows users to eagerly initialize the RNG tracker by calling `torch.distributed.tensor._random.manual_seed`. 2. DTensor `manual_seed` no longer checks the integrity of the `seed` argument. Users are responsible for setting the same seed on all ranks within an SPMD group, but if there are multiple separate SPMD groups (e.g. across pipeline stages), users should set a _different_ seed for each SPMD group. For cases like Pipeline Parallel, users can set different initial seed for pipelining stages by calling ``` world_mesh = init_device_mesh( device_type="cuda", mesh_shape=(2, 2, 2), mesh_dim_names=("pp", "dp", "tp"), ) pp_mesh = world_mesh["pp"] pp_rank = pp_mesh.get_local_rank() spmd_mesh = world_mesh["dp", "tp"]._flatten("spmd") # this flattening is only needed if you need to call collective over this mesh torch.distributed.tensor._random.manual_seed(123+pp_rank, spmd_mesh) ``` In other word, if users want to call `torch.distributed.tensor._random.manual_seed`, they will be responsible for passing in the right value and DTensor won't perform any checks on it. If the current rank is not a part of the mesh, it will use the current device RNG state to initialize. 3. `OffsetBasedRNGTracker` still performs RNG state synchronization by broadcasting the RNG state on rank 0 to `WORLD`. However, calling `torch.distributed.tensor._random.manual_seed` is an exception. In this case, no broadcast will happen. 4. Enforce that the `manual_seed` call only accept "full mesh" i.e. the DTensor RNG state on every rank must be set through the call. This makes sure that no rank has its RNG state left uninitialized and the SPMD ranks have their RNG state synchronous. Motivation tl;dr 1. Lazily initializing DTensor RNG tracker causes hang in non-SPMD code such as Pipeline Parallel. 2. Users may want to set different seed on ranks in one device mesh. 3. We want to keep the old behavior if users prefer not curating the RNG state and want to have DTensor take care of it. see detail in https://github.com/pytorch/pytorch/issues/140301 Test `pytest test/distributed/_tensor/test_random_ops.py` `pytest test/distributed/tensor/parallel/test_tp_random_state.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141223 Approved by: https://github.com/wconstab ghstack dependencies: #141731, #141220	2024-11-29 07:59:34 +00:00
Xilun Wu	7f5bc9dd87	[dtensor][random][tp] remove the adhoc DTensor RNG tracker TensorParallelRNGTracker since it does not match FSDP2+TP (#141220 ) Summary The ad-hoc DTensor RNG tracker was used to mimic Megatron DDP+TP RNG behavior but it turns out not compatible with PyTorch Distributed FSDP2+TP so we decide to deprecate it and use `OffsetBasedRNGTracker` to replace, which follows the SPMD semantics (replicas get the same random sampling result, shards get different results). Motivation `TensorParallelRNGTracker` was designed for DDP+TP where the random operators produce the same result along the data parallel mesh dimension and different results along the tensor parallel dimension. However this does not apply to the new FSDP+TP composable combination where the model weights are sharded along data parallel mesh dimension as well. Therefore we decide to remove this outdated RNG tracker type for now. If users have demands for exact match between PyTorch Distributed and Megatron on Random Number generation result, feel free to file an issue. Impact `TensorParallelRNGTracker` was only used when Tensor Parallel is used (i.e. calling `parallelize_module`). For non-FSDP users, the "replicas get the same random numbers and shards get different ones" remains unchanged. Unlike `TensorParallelRNGTracker` which sets different seeds (`base_seed + 2718 + TP_rank`) within the TP group, DTensor now sets the same seed (default value is 1234 but users can call `torch.distributed.tensor._random.manual_seed` to modify) on all ranks but choose the right RNG offset based on DTensor placements to enforce the "replicas get the same random numbers and shards get different ones" invariant. For FSDP2 users, improvement should be observed in a way that DTensor sharded within DP group now gets different random number sampling which `TensorParallelRNGTracker` failed to do, though we're not sure how much this change will improve the eventual training loss convergence. Test 1-d model weight meta init: `pytest test/distributed/_tensor/test_random_ops.py -s -k test_tp_model_meta_init` 2-d model weight meta init: `pytest test/distributed/_tensor/test_random_ops.py -s -k test_fsdp_tp_model_meta_init` TP model weight init test: `pytest test/distributed/tensor/parallel/test_tp_random_state.py` FSDP+TP model weight init test: `pytest test/distributed/_composable/fsdp/test_fully_shard_init.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141220 Approved by: https://github.com/wconstab ghstack dependencies: #141731	2024-11-29 07:59:26 +00:00
Xilun Wu	c55191f3a2	[dtensor][random] add 1d and 2d model meta init tests (#141731 ) Summary Added tests for model meta init on 1-d mesh (TP) and 2-d mesh (FSDP+TP). This exploits the issue where DTensor RNG failed to initialize weights differently across FSDP ranks. Test `pytest test/distributed/_tensor/test_random_ops.py -s -k meta_init` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141731 Approved by: https://github.com/wconstab	2024-11-29 07:59:20 +00:00
Will Constable	54d26d670e	[CP] Add assertion for unsupported load-balance + non-causal (#141622 ) We actually do not support load-balance mode when non_causal = True, due to changes in data shuffling for load_balance mode. This PR just adds an assertion to make this limitation clear. Fixes #141429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141622 Approved by: https://github.com/XilunWu	2024-11-28 02:52:35 +00:00
Aaron Gokaslan	12e95aa4ee	[BE]: Apply PERF401 autofixes from ruff (#140980 ) * Automatically applies ruff rule 401. Turns loops into equivalent list comprehensions which are faster and do not leak the scope of the loop variables. * list comprehensions not only often have better typing, but are 50+% faster than for loops on overhead. They also preserve length information etc and are better for the interpreter to optimize. * Manually went back and made mypy happy after the change. * Also fixed style lints in files covered by flake8 but not by pyfmt Pull Request resolved: https://github.com/pytorch/pytorch/pull/140980 Approved by: https://github.com/justinchuby, https://github.com/malfet	2024-11-20 17:52:07 +00:00
Mikayla Gawarecki	f3f305ef3e	Fix condition for weights_only unpickler for DTensor (#140740 ) Same as #140739 but for DTensor (move safe globals for DTensor to `torch.distributed.tensor.__init__` and update error message to let user know `torch.distributed.tensor` must be imported to load DTensor) Differential Revision: [D65961690](https://our.internmc.facebook.com/intern/diff/D65961690) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140740 Approved by: https://github.com/malfet ghstack dependencies: #140739	2024-11-19 02:44:53 +00:00
zeshengzong	cb71bcc542	Replace clone.detach with detach.clone (#140264 ) Fixes #64532 As state in issue, replace `clone.detach` by `detach.clone` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140264 Approved by: https://github.com/soulitzer	2024-11-13 07:01:02 +00:00
IvanKobzarev	781c68c865	[aotd] coerce_same_metadata_as_tangent with expected_type for e.g.AsyncCollectiveTensor (#139095 ) Based on discussion here: https://github.com/pytorch/pytorch/pull/138731 Introducing ability for subclass implement type convertion to expected_type. ``` def __coerce_same_metadata_as_tangent__( self, expected_metadata: Any, expected_type: Optional[Type] = None ): ``` Here if `expected_type=None` means `SubclassClass` is expected. E.g. for `DTensor` we may find tangent `AsyncCollectiveTensor` where we expected `Tensor` - in this case `expected_type=Tensor` will be called during runtime Adding implementation to AsyncCollectiveTensor, that just triggers `wait()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139095 Approved by: https://github.com/bdhirsh	2024-11-07 16:24:48 +00:00
PyTorch MergeBot	1d28b8b6d5	Revert "Deprecate `torch._utils.is_compiling()` and `torch._dynamo.external_utils.is_compiling()` (#127690 )" This reverts commit `e84d1121ad`. Reverted https://github.com/pytorch/pytorch/pull/127690 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. More details in D65483292 ([comment](https://github.com/pytorch/pytorch/pull/127690#issuecomment-2458381056))	2024-11-05 23:10:38 +00:00
Xuehai Pan	e84d1121ad	Deprecate `torch._utils.is_compiling()` and `torch._dynamo.external_utils.is_compiling()` (#127690 ) This PR is split from PR #126898. - #126898 ------ Pull Request resolved: https://github.com/pytorch/pytorch/pull/127690 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-11-05 10:44:56 +00:00
wz337	b71ab3fc85	[DTensor][Bug Fix]Fix 2D DTensor mm with mesh_shape (1, n) or (n, 1) (#139134 ) Fixes #138742. In the issue, the matrix multiplication with DTensor failed when the size of one of mesh dimension is 1 when the mesh is > 1D. We are missing tests for covering this corner case where mesh_shape is (n, 1) or (1, n). The DTensor mm op is correct when the 1D mesh is of shape (self.world_size, ) or 2D mesh with none of the mesh_dimension has a size of 1. In this PR, we fixed the corner case by updating `gen_einsum_strategies` in `_einsum_strategy.py`. Specifically, we cannot skip generating `mesh_dim_strategies` when `mesh_dim <= 1`, as this is not valid for nD mesh with one of the mesh dimension sizes being 1. Without the fix, the OpStrategy generated for 2D mesh with mesh_shape of (1,n) or (n,1) is wrong, as the OpStrategy generated is 1D. ``` all_mesh_dim_strategies=[[[Replicate(), Replicate(), Replicate()], [Partial(sum), Shard(dim=1), Shard(dim=0)], [Shard(dim=0), Shard(dim=0), Replicate()], [Shard(dim=1), Replicate(), Shard(dim=1)]]] OpStrategy(all_strategies)::: [(R, R) -> R, (S(1), S(0)) -> P, (S(0), R) -> S(0), (R, S(1)) -> S(1)] @ mesh: (4, 1)[(R, R) -> R, (S(1), S(0)) -> P, (S(0), R) -> S(0), (R, S(1)) -> S(1)] @ mesh: (4, 1) ``` After the fix, we can see the OpStrategy generated is correct with 2D strategy. ``` all_mesh_dim_strategies=[[[Replicate(), Replicate(), Replicate()], [Partial(sum), Shard(dim=1), Shard(dim=0)], [Shard(dim=0), Shard(dim=0), Replicate()], [Shard(dim=1), Replicate(), Shard(dim=1)]]][[[Replicate(), Replicate(), Replicate()], [Partial(sum), Shard(dim=1), Shard(dim=0)], [Shard(dim=0), Shard(dim=0), Replicate()], [Shard(dim=1), Replicate(), Shard(dim=1)]]] OpStrategy(all_strategies) = [(RR, RR) -> RR, (RS(1), RS(0)) -> RP, (RS(0), RR) -> RS(0), (RR, RS(1)) -> RS(1), (S(1)R, S(0)R) -> PR, (S(1)S(1), S(0)S(0)) -> PP, (S(1)S(0), S(0)R) -> PS(0), (S(1)R, S(0)S(1)) -> PS(1), (S(0)R, RR) -> S(0)R, (S(0)S(1), RS(0)) -> S(0)P, (S(0)S(0), RR) -> S(0)S(0), (S(0)R, RS(1)) -> S(0)S(1), (RR, S(1)R) -> S(1)R, (RS(1), S(1)S(0)) -> S(1)P, (RS(0), S(1)R) -> S(1)S(0), (RR, S(1)S(1)) -> S(1)S(1)] @ mesh: (4, 1) ``` ***** As a follow up, we should add more test coverage for DTensor op with 2D mesh and 2D mesh with one of the size of mesh dimension being 1. ***** Pull Request resolved: https://github.com/pytorch/pytorch/pull/139134 Approved by: https://github.com/fegin	2024-10-30 08:09:39 +00:00
wz337	7e951c1675	[EZ][DTensor] Update DTensor readme to use the new import path (#138625 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138625 Approved by: https://github.com/XilunWu	2024-10-23 00:08:36 +00:00
Chien-Chin Huang	0b4a071a1d	[CP] Implement AllGather based context parallelism (#132820 ) Summary: This implementation does not utilize the benefit that after allgather we can directly perform the SDPA without doing the ring-based SDPA, but we can overlap the communication with the first sharded kv computation. This implementation shows some performance benefit and memory saving compared to the original alltoall implementation in certain cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132820 Approved by: https://github.com/XilunWu	2024-10-22 05:25:50 +00:00
Tom Ritchford	c0582fd0f8	Remove unused Python variables in torch/[b-z]* (#136963 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136963 Approved by: https://github.com/ezyang	2024-10-19 16:45:22 +00:00
Aaron Gokaslan	195d0a666b	[BE][Ez]: Use interned hardcoded string FURB156 (#138330 ) Uses string constants from string module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138330 Approved by: https://github.com/albanD	2024-10-18 18:26:16 +00:00

1 2 3 4

200 Commits