Commit Graph

36 Commits

Author SHA1 Message Date
Yuanyuan Chen
a60d9e1f6d Fix flake8 B028 warnings (#166224)
This PR fixes flake8 B028 warning by specifying stacklevel=2 in `warnings.warn`. The advantage is that users can know more contextual information about PyTorch warnings.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166224
Approved by: https://github.com/ezyang
2025-10-26 06:18:55 +00:00
Maggie Moss
c7eee49525 Fix pyrefly ignores 1/n (#166239)
First diff adjusting the syntax for pyrefly: ignore suppressions so they only hide one class of type error.

Test:
lintrunner
pyrefly check

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166239
Approved by: https://github.com/oulgen
2025-10-26 00:44:10 +00:00
Dzmitry Huba
5e58420dff LocalTensor (#164537)
A LocalTensor is a tensor subclass which simulates a tensor that is
distributed across SPMD ranks.  A LocalTensor might be size N, but in fact
there are world_size shards/replicas of it stored internally.  When you do a
plain PyTorch operation on it, we apply the operation to each shard; when you
do a collective, we do the mathematically equivalent operation on the local
shards.  A LocalTensor is associated with a list of ranks which specify
which ranks it holds local tensors for.

NB, this is NOT a DataParallel like abstraction where you can run operations
on multiple different GPUs. It is intended purely for *debugging* purposes,
the overhead is almost certainly too high to keep eight GPUs (even the C++
autograd needs multithreading to keep up!)  (It might potentially be possible
to trace through this with torch.compile and then compile it with CUDA graphs
but this is currently a non-goal.)

In order to handle MPMD, we provide a helper decorator that allows you to
run a function with no side effects for each LocalTensor shard and combine
results back into LocalTensor or LocalIntNode.

Note: This PR convert all DTensor ops and some DTensor tests to illustrate
intended usage and ensure conrrectness. In subsequent PR more tests will be
converted. DUring test conversion we aim to share as much as possible of
test logic between multi-process / multi-threaded and local tensor tests.
We would like to developers to be able to run both flavors of the tests.

Note: This work is based on the original proposal
by @ezyang (WIP PR https://github.com/pytorch/pytorch/pull/162753).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164537
Approved by: https://github.com/ezyang
2025-10-12 20:06:41 +00:00
Dzmitry Huba
ae25dd51fc Simplifying computation of the final result for equals op on DTensor (#164999)
Instead of collecting local results using all_gather_object followed by local reduction, with this change we switch to using a single all_reduce with MIN reduction operation to compute the final equals result.

This change is needed to enable LocalTensor work (all_gather_object introduces challenges in for DTensor and LocalTensor integration).

topic: not user facing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164999
Approved by: https://github.com/ezyang
2025-10-10 03:01:28 +00:00
Maggie Moss
9944cac6e6 Add suppressions to torch/_inductor (#165062)
Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283

Split this directory into two PRs to keep them from being too large.

Test plan:
dmypy restart && python3 scripts/lintrunner.py -a
pyrefly check

step 1: delete lines in the pyrefly.toml file from the project-excludes field
step 2: run pyrefly check
step 3: add suppressions, clean up unused suppressions
before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199

after:
INFO 0 errors (6,884 ignored)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165062
Approved by: https://github.com/oulgen, https://github.com/mlazos
2025-10-09 20:34:20 +00:00
Maggie Moss
7457d139c5 Add pyrefly suppressions to torch/distributed (7/n) (#165002)
Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283

One more PR after this one.

Test plan:
dmypy restart && python3 scripts/lintrunner.py -a
pyrefly check

step 1: delete lines in the pyrefly.toml file from the project-excludes field
step 2: run pyrefly check
step 3: add suppressions, clean up unused suppressions
before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199

after:
INFO 0 errors (6,884 ignored)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165002
Approved by: https://github.com/oulgen
2025-10-09 04:08:25 +00:00
Sherlock Huang
4c2c401ccf Record redistribute_local_tensor in DebugMode (#163704)
Explicit redistribute_local_tensor API call could also results in communication, record it!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163704
Approved by: https://github.com/ezyang
2025-09-24 16:11:26 +00:00
Sherlock Huang
95ac7d724e Rename to _debug_mode.py to make it private (#163534)
rename debug_mode.py to _debug_mode.py to make it private, per @alban's request.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163534
Approved by: https://github.com/albanD
2025-09-23 04:27:10 +00:00
Edward Yang
2887f3fde4 [BE] Slight improvements to documentation in python_dispatch (#162963)
I was briefly confused which way I should iterate stack, here's the
comments I wanted.

Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162963
Approved by: https://github.com/albanD, https://github.com/SherlockNoMad
2025-09-21 01:45:46 +00:00
rzou
98ce93db0b [DTensor] Add guide for what to do about mixed torch.Tensor and DTensor operations (#162651)
Also updates the error message to point to the guide.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162651
Approved by: https://github.com/ezyang
ghstack dependencies: #162117, #162307
2025-09-18 06:41:02 +00:00
Sherlock Huang
f8d379d29e [DTensor] Introduce DebugMode (#162665)
Introduce a lightweight TorchDispatchMode for understanding the magic behind DTensor.

- Tracks redistribution, see `redistribute_input(input_idx, from_placement, to_placement)`
- Optionally tracks torch-level functions, via `__torch_function__`
- Optionally tracks FakeTensor operations, which was needed for propagating tensor meta as a step of sharding propagation
- Optionally tracks real tensor operations, including functional c10d op, and regular ops
- Calls are shown in the hierarchical structure!
- shorthand representation
  - dt: DTesnor, ft: FakeTensor, t: Tensor
  - DM(2, 2) == DeviceMesh(shape = [2, 2])
  - [R, P, S(0)] == Placement[Replicate, Partial, Shard(0)]
  - f32[8,8] == float32 with shape[8, 8]

```
  debug_mode = DTensorDebugMode(record_faketensor=False, record_realtensor=True)
  with debug_mode:
      torch.mm(x_dtensor, y_dtensor)
  print(debug_mode.debug_string())
```
produces:
```
  torch.mm(dt: f32[8, 8][S(0)], dt: f32[8, 32][S(0)])
    aten::mm(dt: f32[8, 8][S(0)], dt: f32[8, 32][S(0)])
      redistribute_input(1, [S(0)], [R])
        _c10d_functional::all_gather_into_tensor(t: f32[1, 32], 8, 0)
        _c10d_functional::wait_tensor(t: f32[8, 32])
      aten::mm(t: f32[1, 8], t: f32[8, 32])
```

Another example, for torch.einsum
```
  torch.functional.einsum(bld,dnh->blnh, dt: f32[16, 6, 8][P, R], dt: f32[8, 4, 4][R, P])
    aten::unsqueeze(dt: f32[16, 6, 8][P, R], 3)
      aten::unsqueeze(t: f32[16, 6, 8], 3)
    aten::unsqueeze(dt: f32[16, 6, 8, 1][P, R], 4)
      aten::unsqueeze(t: f32[16, 6, 8, 1], 4)
    aten::permute(dt: f32[16, 6, 8, 1, 1][P, R], [0, 1, 3, 4, 2])
      aten::permute(t: f32[16, 6, 8, 1, 1], [0, 1, 3, 4, 2])
    aten::unsqueeze(dt: f32[8, 4, 4][R, P], 3)
      aten::unsqueeze(t: f32[8, 4, 4], 3)
    aten::unsqueeze(dt: f32[8, 4, 4, 1][R, P], 4)
      aten::unsqueeze(t: f32[8, 4, 4, 1], 4)
    aten::permute(dt: f32[8, 4, 4, 1, 1][R, P], [3, 4, 1, 2, 0])
      aten::permute(t: f32[8, 4, 4, 1, 1], [3, 4, 1, 2, 0])
    aten::permute(dt: f32[16, 6, 1, 1, 8][P, R], [0, 1, 4, 2, 3])
      aten::permute(t: f32[16, 6, 1, 1, 8], [0, 1, 4, 2, 3])
    aten::view(dt: f32[16, 6, 8, 1, 1][P, R], [1, 96, 8])
      aten::view(t: f32[16, 6, 8, 1, 1], [1, 96, 8])
    aten::permute(dt: f32[1, 1, 4, 4, 8][R, P], [4, 2, 3, 0, 1])
      aten::permute(t: f32[1, 1, 4, 4, 8], [4, 2, 3, 0, 1])
    aten::view(dt: f32[8, 4, 4, 1, 1][R, P], [1, 8, 16])
      aten::view(t: f32[8, 4, 4, 1, 1], [1, 8, 16])
    aten::bmm(dt: f32[1, 96, 8][P, R], dt: f32[1, 8, 16][R, P])
      redistribute_input(0, [P, R], [S(2), S(2)])
        aten::chunk(t: f32[1, 96, 8], 4, 2)
        aten::cat(['t: f32[1, 96, 2]', 't: f32[1, 96, 2]', 't: f32[1, 96, 2]', 't: f32[1, 96, 2]'])
        _c10d_functional::reduce_scatter_tensor(t: f32[4, 96, 2], sum, 4, 2)
        aten::clone(t: f32[1, 96, 1])
      redistribute_input(1, [R, P], [S(1), S(1)])
        aten::chunk(t: f32[1, 8, 16], 4, 1)
        aten::clone(t: f32[1, 2, 16])
        aten::chunk(t: f32[1, 2, 16], 2, 1)
        aten::cat(['t: f32[1, 1, 16]', 't: f32[1, 1, 16]'])
        _c10d_functional::reduce_scatter_tensor(t: f32[2, 1, 16], sum, 2, 3)
        _c10d_functional::wait_tensor(t: f32[1, 1, 16])
      aten::bmm(t: f32[1, 96, 1], t: f32[1, 1, 16])
    aten::view(dt: f32[1, 96, 16][P, P], [16, 6, 1, 4, 4])
      aten::view(t: f32[1, 96, 16], [16, 6, 1, 4, 4])
    aten::permute(dt: f32[16, 6, 1, 4, 4][P, P], [0, 1, 3, 4, 2])
      aten::permute(t: f32[16, 6, 1, 4, 4], [0, 1, 3, 4, 2])
    aten::view(dt: f32[16, 6, 4, 4, 1][P, P], [16, 6, 4, 4])
      aten::view(t: f32[16, 6, 4, 4, 1], [16, 6, 4, 4])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162665
Approved by: https://github.com/ezyang
2025-09-16 07:30:05 +00:00
Scott Wolchok
f68f76d8c7 Remove logger.debug statements in DTensor dispatch (#161596)
These seem to have been costing us 5-10 usec per detach (out of ~~95 usec total).  If they need to ship let's talk about requirements and how we can make this more efficient given that we would prefer if an entire DTensor op could finish in 10 usec.

Differential Revision: [D81530106](https://our.internmc.facebook.com/intern/diff/D81530106)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161596
Approved by: https://github.com/ezyang, https://github.com/Skylion007
ghstack dependencies: #161591, #161595, #161633, #161634, #161692, #162219, #162220, #162218
2025-09-11 06:58:35 +00:00
rzou
70d36e047d Making batching rule for F.embedding DTensor-aware (#162117)
`vmap(F.embedding)(DTensor, DTensor)` was failing because F.embedding's
batching rule generates a new tensor via at::arange, at::arange
generates a regular tensor, and DTensor rightfully errors on mixed
DTensor-regular Tensor operations.

This PR fixes the problem by activating DTensor implicit replication on
just the at::arange and the subsequent add operation.

In order to accomplish this I move the DTensor implicit replication flag
to C++ (most batching rules are in C++).

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162117
Approved by: https://github.com/bdhirsh
2025-09-05 21:40:14 +00:00
Will Constable
4fd761fecc [DTensor] Wrap sharding prop error with contextual exception (#161574)
Mainly, this helps tell the user more info about the operator that
failed to run if it fails during sharding propagation.

Previously, only this exception would be raised:
```
RuntimeError: ('Attempted to flatten sharded dimension 1, ', 'but only the leftmost dim of a Flatten can be sharded.')
```

Now you get both the above exception as well as

```
The above exception was the direct cause of the following exception:
RuntimeError: Sharding propagation failed for Op(op=aten.view.default, args_schema=Spec((Replicate(), Shard(dim=0), Shard(dim=1), Shard(dim=2)) on (8, 8, 4)), [64, 4] @ mesh: (1, 2, 2, 2))
```

<stacktrace omitted>
<details><summary>detailed error</summary>

```
======================================================================
ERROR: test_linear (__main__.TestDTensor)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/users/whc/pytorch/torch/testing/_internal/common_distributed.py", line 668, in wrapper
    self._join_processes(fn)
  File "/data/users/whc/pytorch/torch/testing/_internal/common_distributed.py", line 932, in _join_processes
    self._check_return_codes(fn, elapsed_time)
  File "/data/users/whc/pytorch/torch/testing/_internal/common_distributed.py", line 972, in _check_return_codes
    raise RuntimeError(error)
RuntimeError: Process 4 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/data/users/whc/pytorch/torch/distributed/tensor/_dispatch.py", line 150, in dispatch
    self.sharding_propagator.propagate(op_info)
  File "/data/users/whc/pytorch/torch/distributed/tensor/_sharding_prop.py", line 309, in propagate
    OutputSharding, self.propagate_op_sharding(op_info.schema)
  File "/data/users/whc/pytorch/torch/distributed/tensor/_sharding_prop.py", line 45, in __call__
    return self.cache(*args, **kwargs)
  File "/data/users/whc/pytorch/torch/distributed/tensor/_sharding_prop.py", line 329, in propagate_op_sharding_non_cached
    op_strategy = self.op_strategy_funcs[op_schema.op](strategy_schema)
  File "/data/users/whc/pytorch/torch/distributed/tensor/_ops/_view_ops.py", line 673, in reshape_strategy
    input_tgt_placements, output_placements = propagate_shape_and_sharding(
  File "/data/users/whc/pytorch/torch/distributed/tensor/_ops/_view_ops.py", line 601, in propagate_shape_and_sharding
    in_dim = get_in_dim_to_shard(cmd)
  File "/data/users/whc/pytorch/torch/distributed/tensor/_ops/_view_ops.py", line 537, in get_in_dim_to_shard
    raise RuntimeError(
RuntimeError: ('Attempted to flatten sharded dimension 1, ', 'but only the leftmost dim of a Flatten can be sharded.')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/data/users/whc/pytorch/torch/testing/_internal/common_distributed.py", line 816, in run_test
    getattr(self, test_name)()
  File "/data/users/whc/pytorch/torch/testing/_internal/common_distributed.py", line 670, in wrapper
    fn()
  File "/data/users/whc/pytorch/torch/testing/_internal/common_utils.py", line 3224, in wrapper
    method(*args, **kwargs)
  File "/data/users/whc/pytorch/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 490, in wrapper
    raise e
  File "/data/users/whc/pytorch/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 487, in wrapper
    func(self, *args, **kwargs)  # type: ignore[misc]
  File "/data/users/whc/pytorch/test.py", line 60, in test_linear
    print("results: ", distributed_linear(distributed_input))
  File "/data/users/whc/pytorch/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/users/whc/pytorch/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/users/whc/pytorch/torch/nn/modules/linear.py", line 134, in forward
    return F.linear(input, self.weight, self.bias)
  File "/data/users/whc/pytorch/torch/_compile.py", line 53, in inner
    return disable_fn(*args, **kwargs)
  File "/data/users/whc/pytorch/torch/_dynamo/eval_frame.py", line 1005, in _fn
    return fn(*args, **kwargs)
  File "/data/users/whc/pytorch/torch/distributed/tensor/_api.py", line 358, in __torch_dispatch__
    return DTensor._op_dispatcher.dispatch(
  File "/data/users/whc/pytorch/torch/distributed/tensor/_dispatch.py", line 163, in dispatch
    raise RuntimeError(
RuntimeError: Sharding propagation failed for Op(op=aten.view.default, args_schema=Spec((Replicate(), Shard(dim=0), Shard(dim=1), Shard(dim=2)) on (8, 8, 4)), [64, 4] @ mesh: (1, 2, 2, 2))
```
</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161574
Approved by: https://github.com/zpcore, https://github.com/XilunWu
2025-08-28 15:56:15 +00:00
Xilun Wu
6c05ea6475 [DTensor] add op support: aten.squeeze_.dim (#159532)
**Summary**
This PR enables in-place op `aten.squeeze_.dim` on DTensor with a change to
DTensor dispatch logic: when processing in-place operator, we should assign
`output_sharding.output_spec` back to the first argument. This is because
the in-place op_call on `arg._local_tensor` could also shift the tensor meta.

**Test**
`pytest test/distributed/tensor/test_view_ops.py -s -k  test_squeeze_`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159532
Approved by: https://github.com/zpcore
2025-08-14 18:01:19 +00:00
Edward Z. Yang
3cec82a7e9 Ensure outer aliasing on DTensor matches inner aliasing (#158954)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158954
Approved by: https://github.com/albanD, https://github.com/wconstab
2025-08-12 17:47:48 +00:00
Will Constable
3cf7b4024e [DTensor] Support user-supplied Generator for random ops (#159933)
If the user provides a generator kwarg to a random op (e.g.
nn.init.uniform_(..., generator=my_generator)), we can still advance
that generator's state in a SPMD-global way so that each local-tensor
gets appropriate values and the generator advances to the same state as
if it had operated on the full tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159933
Approved by: https://github.com/fduwjj, https://github.com/XilunWu, https://github.com/wanchaol
2025-08-07 18:47:22 +00:00
zpcore
bb6766053b fix strategy hashing arg mismatch (#159506)
Reland https://github.com/pytorch/pytorch/pull/159289.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159506
Approved by: https://github.com/XilunWu
2025-08-01 05:42:40 +00:00
PyTorch MergeBot
c07bb277a0 Revert "fix strategy hashing arg mismatch (#159506)"
This reverts commit 3a55676200.

Reverted https://github.com/pytorch/pytorch/pull/159506 on behalf of https://github.com/yangw-dev due to failed the internal tests test_get_bwd_hook (torch.equal(output * 2, input_tensor.grad)) ([comment](https://github.com/pytorch/pytorch/pull/159506#issuecomment-3140858905))
2025-07-31 17:54:29 +00:00
zpcore
3a55676200 fix strategy hashing arg mismatch (#159506)
Reland https://github.com/pytorch/pytorch/pull/159289.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159506
Approved by: https://github.com/XilunWu
2025-07-30 21:37:13 +00:00
Will Constable
2176d481c1 [DTensor] dispatch to sharding prop over decomps (#159324)
Fixes #159110

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159324
Approved by: https://github.com/ezyang
2025-07-29 21:28:36 +00:00
Xuehai Pan
3f8e2e91ad [BE][15/16] fix typos in torch/ (torch/distributed/tensor/) (#156605)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156605
Approved by: https://github.com/wanchaol, https://github.com/albanD
2025-07-17 12:08:33 +00:00
Wanchao Liang
ee5c2908cb [dtensor] refactor PlacementStrategy -> OpSpec, move utils to OpSchema (#155592)
as titled. It's sometimes confusing to use PlacementStrategy as a name,
as we also have OpStrategy and TupleStrategy, the latter two contain
the former, so it is better to make the naming clearer.

Renaming PlacementStrategy -> OpSpec as it is an operator spec that
contains output_spec + input_specs.

Also found some utils can be merged to OpSchema so included together in
this PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155592
Approved by: https://github.com/awgu
2025-06-12 00:51:36 +00:00
Brian Hirsh
1c6b517e19 DTensor: more generically support CompositeImplicitAutograd ops under inference mode (#149514)
Today, if you run DTensor (or any tensor subclass) under __torch_dispatch__, you will start seeing `CompositeImplicitAutograd` ops show up in the torch_dispatch.

"handling" these ops is trivial: you can just tell them to decompose into their constituent ops. Normally this decomposing happens in autograd, above DTensor, but inference_mode turns autograd off, forcing the subclass to handle the op directly.

It looks like previously we manually added a few CompositeImplicitAutograd entries to DTensor (e.g. linear), but this PR tries to support these ops a bit more generically.

The main difference is that DTensor now needs to check if a given op is `CompositeImplicitAutograd` before attempting to run sharding prop. I ran a quick microbenchmark for the below code with `timeit`, which gave me overhead on the order of ~1us, which is hopefully not too bad for eager mode:

```
        def fast_function():
            return torch._C._dispatch_has_kernel_for_dispatch_key(op_call.name(), torch._C.DispatchKey.CompositeImplicitAutograd)
        import timeit
        time_taken = timeit.timeit(fast_function, number=1000)
        # printed 0.12..., aka 1.2us
        print(f'func={str(op_call)}, time={str(time_taken)}')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149514
Approved by: https://github.com/kwen2501, https://github.com/albanD, https://github.com/wanchaol
2025-03-21 22:09:19 +00:00
Wanchao Liang
f859722f70 [dtensor] refactor sharding prop to handle cross mesh computation (#147869)
as titled, this PR moves the same mesh check from the sharding propagation level to each individual operator level.

This is to allow more flexibility for each individual operator to check the operator can be run on the same mesh or not. For example, before this PR if user have two DTensor params that lives on different DeviceMesh, and want to run `for_each` operator on them individually, it would error out with cross mesh error. But for foreach computation there could be DTensors that live on different meshes, as long as the the mesh are the same in a "zipped way".

This should also fix https://github.com/pytorch/pytorch/issues/134212

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147869
Approved by: https://github.com/tianyu-l
2025-03-04 18:30:44 +00:00
Xuehai Pan
995df34b19 [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144547
Approved by: https://github.com/kwen2501
2025-02-28 07:35:56 +00:00
Xilun Wu
ef61c290e1 [DTensor][random] defer DTensor RNG state sync until first random op call or manual_seed call; support more flexible OffsetBasedRNGTracker init (#147025)
Resolves https://github.com/pytorch/pytorch/issues/146767.

May also resolve https://github.com/pytorch/pytorch/issues/147584.

### Summary
This PR removes the RNG tracker init from the `distribute_tensor` call for the following reasons:

1. if the user does not use random ops on DTensor, there's no need to init DTensor RNG which currently requires CUDA device to be present.
2. this complies with the 0-communication semantic of `src_data_rank=None` shard distribution.

Besides, `OffsetBasedRNGTracker` only accepts `DeviceMesh` argument to its constructor method.

### Consequence

DTensor RNG initialization is delayed till the first DTensor random ops call or `torch.distributed.tensor.random.manual_seed`.

### Test
`pytest test/distributed/tensor/test_random_ops.py`
`pytest test/distributed/tensor/parallel/test_tp_random_state.py`
`pytest test/distributed/tensor/parallel/test_tp_style.py`

Differential Revision: [D70201856](https://our.internmc.facebook.com/intern/diff/D70201856)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147025
Approved by: https://github.com/kwen2501
2025-02-26 17:33:22 +00:00
Aaron Orenstein
c95efc37ba PEP585 update - torch/distributed/tensor (#145141)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145141
Approved by: https://github.com/bobrenjc93
2025-01-18 20:01:59 +00:00
bobrenjc93
08be9ec312 Migrate from Tuple -> tuple in torch/distributed (#144258)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144258
Approved by: https://github.com/aorenste
2025-01-10 08:34:54 +00:00
Aaron Orenstein
45ef3309e3 [BE] typing for decorators (#144161)
Summary:
Untyped decorators strip annotations from the decorated items.

- _compile
- _inductor/fx_passes/post_grad
- _inductor/lowering
- _library/custom_ops
- _meta_registrations
- _ops
- _refs/nn/functional
- ao/quantization/quantizer/xnnpack_quantizer_utils
- distributed/_composable/contract
- fx/experimental/graph_gradual_typechecker
- fx/experimental/migrate_gradual_types/constraint_generator
- optim/optimizer
- signal/windows/windows
- testing/_internal/common_device_type
- torch/_inductor/decomposition
- utils/flop_counter

Test Plan: unit tests

Differential Revision: D62302684

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144161
Approved by: https://github.com/Skylion007, https://github.com/albanD
2025-01-04 16:40:09 +00:00
Ke Wen
8bdcdae733 [DTensor] Support matmul in inference_mode (#142197)
Fixes #142190 .

The solution is to add a `decompose_handler` for `aten.matmul`, similar to how we handle `aten.linear`.
With the decomposition, `aten.matmul` becomes `aten.mm` which has sharding strategy registered with DTensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142197
Approved by: https://github.com/XilunWu, https://github.com/wz337
2024-12-06 07:15:05 +00:00
Xilun Wu
c55191f3a2 [dtensor][random] add 1d and 2d model meta init tests (#141731)
**Summary**
Added tests for model meta init on 1-d mesh (TP) and 2-d mesh (FSDP+TP). This exploits the issue where DTensor RNG failed to initialize weights differently across FSDP ranks.

**Test**
`pytest test/distributed/_tensor/test_random_ops.py -s -k meta_init`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141731
Approved by: https://github.com/wconstab
2024-11-29 07:59:20 +00:00
Tom Ritchford
c0582fd0f8 Remove unused Python variables in torch/[b-z]* (#136963)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136963
Approved by: https://github.com/ezyang
2024-10-19 16:45:22 +00:00
Wanchao Liang
cfc227ad43 [reland][dtensor] move DTensor to public namespace (#134203)
reland of https://github.com/pytorch/pytorch/pull/133113

I have to create a new PR because the previous reverted PR could not either be rebased, or imported successfully :(

----

Moving DTensor to be in the public namespace, to formally add the documentation page that includes all the public APIs. This includes:

* many path renames and path import fixes
* a dedicated doc page without too much content yet (adding in the next PRs)
* To preserve the BC for users still using the torch.distributed._tensor, I added a shim script to redirect old path calls to the new module

The BC preserving is evidented by the fact that all DTensor tests are still working without changing the public imports. So it's safe to land the changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134203
Approved by: https://github.com/tianyu-l
2024-09-08 17:08:40 +00:00
PyTorch MergeBot
35f36363ec Revert "[dtensor] move DTensor to public namespace (#133113)"
This reverts commit 2ee6b97464.

Reverted https://github.com/pytorch/pytorch/pull/133113 on behalf of https://github.com/wanchaol due to looks like it break some internal type imports ([comment](https://github.com/pytorch/pytorch/pull/133113#issuecomment-2295670911))
2024-08-19 05:00:19 +00:00
Wanchao Liang
2ee6b97464 [dtensor] move DTensor to public namespace (#133113)
Moving DTensor to be in the public namespace, to formally add the
documentation page that includes all the public APIs. This includes:

* many path renames and path import fixes
* a dedicated doc page without too much content yet (adding in the next
  PRs)
* To preserve the BC for users still using the `torch.distributed._tensor`,
  I added a shim script to redirect old path calls to the new module

The BC preserving is evidented by the fact that all DTensor tests are still
working without changing the public imports. So it's safe to land the
changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133113
Approved by: https://github.com/XilunWu
ghstack dependencies: #133305, #133306
2024-08-17 05:09:52 +00:00