[dtensor] move DTensor to public namespace (#133113)

Moving DTensor to be in the public namespace, to formally add the documentation page that includes all the public APIs. This includes: * many path renames and path import fixes * a dedicated doc page without too much content yet (adding in the next PRs) * To preserve the BC for users still using the `torch.distributed._tensor`, I added a shim script to redirect old path calls to the new module The BC preserving is evidented by the fact that all DTensor tests are still working without changing the public imports. So it's safe to land the changes Pull Request resolved: https://github.com/pytorch/pytorch/pull/133113 Approved by: https://github.com/XilunWu ghstack dependencies: #133305, #133306
2025-12-06 12:20:52 +01:00 · 2024-08-16 18:39:24 -07:00 · 2024-08-16 18:39:24 -07:00 · 2ee6b97464
commit 2ee6b97464
parent 1a4709cef5
88 changed files with 433 additions and 265 deletions
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@ -626,6 +626,17 @@ coverage_ignore_functions = [
    # torch.distributed.rpc.internal
    "deserialize",
    "serialize",
    # torch.distributed.tensor.api
    "distribute_module",
    "distribute_tensor",
    # torch.distributed.tensor.random
    "is_rng_supported_mesh",
    # torch.distributed.tensor.experimental
    "context_parallel",
    "local_map",
    "register_sharding",
    # torch.distributed.tensor.debug
    "visualize_sharding",
    # torch.distributed.tensor.parallel.api
    "parallelize_module",
    # torch.distributed.tensor.parallel.input_reshard
@ -2621,6 +2632,15 @@ coverage_ignore_classes = [
    "RemoteException",
    # torch.distributed.rpc.rref_proxy
    "RRefProxy",
    # torch.distributed.tensor.api
    "DTensor",
    # torch.distributed.tensor.placement_types
    "DTensorSpec",
    "Placement",
    # torch.distributed.tensor.random
    "OffsetBasedRNGTracker",
    # torch.distributed.tensor.debug
    "CommDebugMode",
    # torch.distributed.tensor.parallel.fsdp
    "DTensorExtensions",
    # torch.distributed.tensor.parallel.style
--- a/docs/source/distributed.rst
+++ b/docs/source/distributed.rst
@ -876,7 +876,6 @@ If you are running single node training, it may be convenient to interactively b
 .. py:module:: torch.distributed.nn.api
 .. py:module:: torch.distributed.nn.jit
 .. py:module:: torch.distributed.nn.jit.templates
 .. py:module:: torch.distributed.tensor
 .. py:module:: torch.distributed.algorithms.ddp_comm_hooks.ddp_zero_hook
 .. py:module:: torch.distributed.algorithms.ddp_comm_hooks.debugging_hooks
 .. py:module:: torch.distributed.algorithms.ddp_comm_hooks.default_hooks
--- a/docs/source/distributed.tensor.rst
+++ b/docs/source/distributed.tensor.rst
@ -0,0 +1,104 @@
 .. role:: hidden
    :class: hidden-section
 PyTorch DTensor (Distributed Tensor)
 ======================================================
 .. note::
  ``torch.distributed.tensor`` is currently in alpha state and under
  development, we are committing backward compatibility for the most APIs listed
  in the doc, but there might be API changes if necessary.
 PyTorch DTensor offers simple and flexible tensor sharding primitives that transparently handles distributed
 logic, including sharded storage, operator computation and collective communications across devices/hosts.
 ``DTensor`` could be used to build different paralleism solutions and support sharded state_dict representation
 when working with multi-dimensional sharding.
 Please see examples from the PyTorch native parallelism solutions that are built on top of ``DTensor``:
 * `Tensor Parallel <https://pytorch.org/docs/main/distributed.tensor.parallel.html>`__
 * `FSDP2 <https://github.com/pytorch/torchtitan/blob/main/docs/fsdp.md>`__
 .. automodule:: torch.distributed.tensor
 .. currentmodule:: torch.distributed.tensor
 :class:`DTensor` follows the SPMD (single program, multiple data) programming model to empower users to
 write distributed program as if it's a single-device program with the same convergence property. It
 provides a uniform tensor sharding layout (DTensor Layout) through specifying the :class:`DeviceMesh`
 and :class:`Placement`:
 - :class:`DeviceMesh` represents the device topology and the communicators of the cluster using
  an n-dimensional array.
 - :class:`Placement` describes the sharding layout of the logical tensor on the :class:`DeviceMesh`.
  DTensor supports three types of placements: :class:`Shard`, :class:`Replicate` and :class:`Partial`.
 There're three ways to construct a :class:`DTensor`:
  * :meth:`distribute_tensor` creates a :class:`DTensor` from a logical or "global" ``torch.Tensor`` on
    each rank. This could be used to shard the leaf ``torch.Tensor`` s (i.e. model parameters/buffers
    and inputs).
  * :meth:`DTensor.from_local` creates a :class:`DTensor` from a local ``torch.Tensor`` on each rank, which can
    be used to create :class:`DTensor` from a non-leaf ``torch.Tensor`` s (i.e. intermediate activation
    tensors during forward/backward).
  * DTensor provides dedicated tensor factory methods (e.g. :meth:`empty`, :meth:`ones`, :meth:`randn`, etc.)
    to allow different :class:`DTensor` creations by directly specifying the :class:`DeviceMesh` and
    :class:`Placement`
 .. autoclass:: DTensor
    :members:
    :member-order: bysource
 .. autofunction::  distribute_tensor
 Along with :meth:`distribute_tensor`, DTensor also offers a :meth:`distribute_module` API to allow easier
 sharding on the :class:`nn.Module` level
 .. autofunction::  distribute_module
 DTensor supports the following types of :class:`Placement` on each :class:`DeviceMesh` dimension:
 .. autoclass:: Shard
  :members:
  :undoc-members:
 .. autoclass:: Replicate
  :members:
  :undoc-members:
 .. autoclass:: Partial
  :members:
  :undoc-members:
 DTensor provides dedicated tensor factory functions to allow creating :class:`DTensor` directly
 using torch.Tensor like factory function APIs (i.e. torch.ones, torch.empty, etc), by additionally
 specifying the :class:`DeviceMesh` and :class:`Placement` for the :class:`DTensor` created:
 .. autofunction:: zeros
 .. autofunction:: ones
 .. autofunction:: empty
 .. autofunction:: full
 .. autofunction:: rand
 .. autofunction:: randn
 .. modules that are missing docs, add the doc later when necessary
 .. py:module:: torch.distributed.tensor.api
 .. py:module:: torch.distributed.tensor.device_mesh
 .. py:module:: torch.distributed.tensor.random
 .. py:module:: torch.distributed.tensor.placement_types
 .. py:module:: torch.distributed.tensor.experimental
 .. py:module:: torch.distributed.tensor.experimental.attention
 .. py:module:: torch.distributed.tensor.experimental.func_map
 .. py:module:: torch.distributed.tensor.experimental.register_sharding
 .. py:module:: torch.distributed.tensor.experimental.tp_transform
 .. py:module:: torch.distributed.tensor.debug
 .. py:module:: torch.distributed.tensor.debug.comm_mode
 .. py:module:: torch.distributed.tensor.debug.visualize_sharding
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -74,12 +74,13 @@ Features described in this documentation are classified by release status:
   torch.backends <backends>
   torch.export <export>
   torch.distributed <distributed>
   torch.distributed.tensor <distributed.tensor>
   torch.distributed.algorithms.join <distributed.algorithms.join>
   torch.distributed.elastic <distributed.elastic>
   torch.distributed.fsdp <fsdp>
   torch.distributed.tensor.parallel <distributed.tensor.parallel>
   torch.distributed.optim <distributed.optim>
   torch.distributed.pipelining <distributed.pipelining>
   torch.distributed.tensor.parallel <distributed.tensor.parallel>
   torch.distributed.checkpoint <distributed.checkpoint>
   torch.distributions <distributions>
   torch.compiler <torch.compiler>
--- a/test/allowlist_for_publicAPI.json
+++ b/test/allowlist_for_publicAPI.json
@ -33,7 +33,8 @@
    "torch.nn.quantizable": "torch.ao.nn.quantizable",
    "torch.nn.quantizable.modules": "torch.ao.nn.quantizable.modules",
    "torch.nn.quantizable.modules.activation": "torch.ao.nn.quantizable.modules.activation",
-    "torch.nn.quantizable.modules.rnn": "torch.ao.nn.quantizable.modules.rnn"
+    "torch.nn.quantizable.modules.rnn": "torch.ao.nn.quantizable.modules.rnn",
    "torch.distributed.tensor.device_mesh": "torch.distributed.device_mesh"
  },
  "torch.backends": [
    "contextmanager"
@ -231,6 +232,9 @@
    "urlunparse"
  ],
  "torch.distributed.rpc": [],
  "torch.distributed.tensor": [
    "DeviceMesh"
  ],
  "torch.fft": [
    "Tensor",
    "fft",
--- a/test/distributed/_tensor/test_common_rules.py
+++ b/test/distributed/_tensor/test_common_rules.py
@ -3,9 +3,9 @@
 import torch
 from torch.distributed._tensor import DeviceMesh
 from torch.distributed._tensor._op_schema import OpSchema
 from torch.distributed._tensor.ops._common_rules import einop_rule, pointwise_rule
 from torch.distributed._tensor.placement_types import DTensorSpec, TensorMeta
 from torch.distributed.tensor._op_schema import OpSchema
 from torch.distributed.tensor._ops._common_rules import einop_rule, pointwise_rule
 from torch.testing._internal.common_utils import run_tests
 from torch.testing._internal.distributed._tensor.common_dtensor import (
    DTensorTestBase,
--- a/test/distributed/_tensor/test_dtensor.py
+++ b/test/distributed/_tensor/test_dtensor.py
@ -909,7 +909,7 @@ class TestDTensorPlacementTypes(DTensorTestBase):
                ]
                assert_array_equal(expected_pad_sizes, pad_sizes)
-                from torch.distributed._tensor._collective_utils import unpad_tensor
+                from torch.distributed.tensor._collective_utils import unpad_tensor
                unpadded_list = [
                    unpad_tensor(tensor, shard_placement.dim, pad_sizes[i])
--- a/test/distributed/_tensor/test_embedding_ops.py
+++ b/test/distributed/_tensor/test_embedding_ops.py
@ -167,7 +167,7 @@ class TestEmbeddingOp(DTensorTestBase):
        self._run_embedding_op_test(mesh, 0, [6, 7, 6], 13, 22)
        self._run_embedding_op_test(mesh, 0, [34], 15, 14, padding_idx=10)
-        from torch.distributed._tensor.ops._embedding_ops import _MaskPartial
+        from torch.distributed.tensor._ops._embedding_ops import _MaskPartial
        # test collectives
        embedding_mod = torch.nn.Embedding(10, 20, device=self.device_type)
@ -191,7 +191,7 @@ class TestEmbeddingOp(DTensorTestBase):
        inp = torch.randint(0, 10, (4, 4), device=self.device_type)
        replicated_inp = DTensor.from_local(inp, mesh, [Replicate()], run_check=False)
-        from torch.distributed._tensor.ops._embedding_ops import _MaskPartial
+        from torch.distributed.tensor._ops._embedding_ops import _MaskPartial
        # case 1: two embeddings with the same shape, thus sharing the underying _MaskPartial
        # and MaskBuffer, because of cache hit from sharding propagation
--- a/test/distributed/_tensor/test_math_ops.py
+++ b/test/distributed/_tensor/test_math_ops.py
@ -7,8 +7,8 @@ import itertools
 import torch
 from torch.distributed._tensor import DeviceMesh, distribute_module, distribute_tensor
 from torch.distributed._tensor.debug import CommDebugMode
 from torch.distributed._tensor.ops.utils import is_tensor_partial, normalize_dim
 from torch.distributed._tensor.placement_types import Replicate, Shard
 from torch.distributed.tensor._ops.utils import is_tensor_partial, normalize_dim
 from torch.testing._internal.common_utils import run_tests
 from torch.testing._internal.distributed._tensor.common_dtensor import (
    DTensorTestBase,
--- a/test/distributed/_tensor/test_op_strategy.py
+++ b/test/distributed/_tensor/test_op_strategy.py
@ -4,12 +4,6 @@ from itertools import chain
 import torch
 from torch.distributed._tensor import DeviceMesh, DTensor
 from torch.distributed._tensor._collective_utils import redistribute_cost
 from torch.distributed._tensor._op_schema import OpSchema, OpStrategy, PlacementStrategy
 from torch.distributed._tensor.ops._einsum_strategy import (
    EinsumDims,
    gen_einsum_strategies,
 )
 from torch.distributed._tensor.placement_types import (
    DTensorSpec,
    Partial,
@ -17,6 +11,12 @@ from torch.distributed._tensor.placement_types import (
    Shard,
    TensorMeta,
 )
 from torch.distributed.tensor._collective_utils import redistribute_cost
 from torch.distributed.tensor._op_schema import OpSchema, OpStrategy, PlacementStrategy
 from torch.distributed.tensor._ops._einsum_strategy import (
    EinsumDims,
    gen_einsum_strategies,
 )
 from torch.testing._internal.common_utils import run_tests, TestCase
 from torch.testing._internal.distributed._tensor.common_dtensor import DTensorOpTestBase
@ -169,7 +169,7 @@ class TestCostModel(DTensorOpTestBase):
    def test_redistribute_cost_latency(self):
        # test cost model on addmm op
-        from torch.distributed._tensor.ops._matrix_ops import addmm_strategy
+        from torch.distributed.tensor._ops._matrix_ops import addmm_strategy
        mesh = self.build_device_mesh()
        shard0_placement = (Shard(0),)
@ -246,7 +246,7 @@ class TestCostModel(DTensorOpTestBase):
        self.assertTrue(allreduce_cost > reduce_scatter_cost)
    def test_mm_strategies(self):
-        from torch.distributed._tensor.ops._matrix_ops import mm_strategy
+        from torch.distributed.tensor._ops._matrix_ops import mm_strategy
        mesh = self.build_device_mesh()
        lhs_tensor = torch.randn(6, 8)
@ -292,7 +292,7 @@ class TestCostModel(DTensorOpTestBase):
            self.assertFalse(output_sharding.needs_redistribute)
    def test_bmm_strategies(self):
-        from torch.distributed._tensor.ops._matrix_ops import bmm_strategy
+        from torch.distributed.tensor._ops._matrix_ops import bmm_strategy
        mesh = self.build_device_mesh()
        lhs_tensor = torch.randn(8, 6, 8)
--- a/test/distributed/_tensor/test_redistribute.py
+++ b/test/distributed/_tensor/test_redistribute.py
@ -5,10 +5,10 @@ import itertools
 import torch
 from torch.distributed._tensor import DeviceMesh, distribute_tensor, DTensor
 from torch.distributed._tensor._collective_utils import shard_dim_alltoall
 from torch.distributed._tensor.debug import CommDebugMode
 from torch.distributed._tensor.placement_types import Partial, Replicate, Shard
 from torch.distributed.device_mesh import init_device_mesh
 from torch.distributed.tensor._collective_utils import shard_dim_alltoall
 from torch.testing._internal.common_utils import run_tests
 from torch.testing._internal.distributed._tensor.common_dtensor import (
    DTensorTestBase,
@ -207,7 +207,7 @@ class RedistributeTest(DTensorTestBase):
        with self.assertRaisesRegex(RuntimeError, "Can not redistribute to Partial"):
            partial_tensor = replica_tensor.redistribute(device_mesh, [partial_spec])
-        from torch.distributed._tensor._redistribute import Redistribute
+        from torch.distributed.tensor._redistribute import Redistribute
        comm_mode = CommDebugMode()
--- a/test/distributed/_tensor/test_tensor_ops.py
+++ b/test/distributed/_tensor/test_tensor_ops.py
@ -445,7 +445,7 @@ class DistTensorOpsTest(DTensorTestBase):
        # case 2 input sharding: input sharded, index replicated, output mask partial
        # only works when index has size 1 on the gather dimension and
        # input is sharded on the gather dimension
-        from torch.distributed._tensor.ops._embedding_ops import _MaskPartial
+        from torch.distributed.tensor._ops._embedding_ops import _MaskPartial
        gather_dim = 1
        global_input = torch.randn(12, 8, 16)
--- a/test/distributed/_tensor/test_view_ops.py
+++ b/test/distributed/_tensor/test_view_ops.py
@ -9,7 +9,8 @@ import torch.distributed as dist
 from torch import rand, randn, Tensor
 from torch.distributed._tensor import DeviceMesh, distribute_tensor, Replicate, Shard
 from torch.distributed._tensor.debug import CommDebugMode
-from torch.distributed._tensor.ops._view_ops import (
+from torch.distributed._tensor.placement_types import Placement
 from torch.distributed.tensor._ops._view_ops import (
    Broadcast,
    dim_maps,
    Flatten,
@ -19,7 +20,6 @@ from torch.distributed._tensor.ops._view_ops import (
    Split,
    view_groups,
 )
 from torch.distributed._tensor.placement_types import Placement
 from torch.testing._internal.common_utils import run_tests
 from torch.testing._internal.distributed._tensor.common_dtensor import (
    DTensorTestBase,
--- a/test/distributed/test_device_mesh.py
+++ b/test/distributed/test_device_mesh.py
@ -5,11 +5,6 @@ import os
 import torch
 import torch.distributed._functional_collectives as funcol
 from torch.distributed._tensor import DTensor
 from torch.distributed._tensor._collective_utils import (
    mesh_broadcast,
    mesh_scatter,
    unpad_tensor,
 )
 from torch.distributed._tensor.placement_types import _Partial, Shard
 from torch.distributed.device_mesh import _mesh_resources, DeviceMesh, init_device_mesh
 from torch.distributed.distributed_c10d import (
@ -22,6 +17,11 @@ from torch.distributed.distributed_c10d import (
    is_nccl_available,
    ProcessGroup,
 )
 from torch.distributed.tensor._collective_utils import (
    mesh_broadcast,
    mesh_scatter,
    unpad_tensor,
 )
 from torch.testing._internal.common_distributed import skip_if_lt_x_gpu
 from torch.testing._internal.common_utils import run_tests
 from torch.testing._internal.distributed._tensor.common_dtensor import (
--- a/torch/_dynamo/guards.py
+++ b/torch/_dynamo/guards.py
@ -1432,12 +1432,12 @@ class GuardBuilder(GuardBuilderBase):
            }
        )
        if torch.distributed.is_available():
-            from torch.distributed._tensor.placement_types import (
+            from torch.distributed.device_mesh import DeviceMesh
            from torch.distributed.tensor.placement_types import (
                Partial,
                Replicate,
                Shard,
            )
            from torch.distributed.device_mesh import DeviceMesh
            ok_types = ok_types + (
                Shard,
--- a/torch/_dynamo/trace_rules.py
+++ b/torch/_dynamo/trace_rules.py
@ -142,7 +142,7 @@ manual_torch_name_rule_map = {
    "torch.distributed.is_initialized": TorchInGraphFunctionVariable,
    "torch.distributed.get_rank": TorchInGraphFunctionVariable,
    "torch.distributed.get_world_size": TorchInGraphFunctionVariable,
-    "torch.distributed._tensor.api.DTensor#from_local": TorchInGraphFunctionVariable,
+    "torch.distributed.tensor.api.DTensor#from_local": TorchInGraphFunctionVariable,
    "torch.distributed.distributed_c10d._get_group_size_by_name": TorchInGraphFunctionVariable,
    "torch.distributed.distributed_c10d._resolve_group_name_by_ranks_and_tag": TorchInGraphFunctionVariable,
    "torch.distributed.distributed_c10d._get_group_tag": TorchInGraphFunctionVariable,
@ -3190,8 +3190,8 @@ LEGACY_MOD_INLINELIST = {
 if torch.distributed.is_available():
    LEGACY_MOD_INLINELIST |= {
-        "torch.distributed._tensor.api",
+        "torch.distributed.tensor.api",
-        "torch.distributed._tensor.device_mesh",
+        "torch.distributed.tensor.device_mesh",
        "torch.distributed.device_mesh",
        "torch.distributed.algorithms._checkpoint.checkpoint_wrapper",
        "torch.distributed.tensor.parallel._data_parallel_utils",
--- a/torch/_dynamo/variables/distributed.py
+++ b/torch/_dynamo/variables/distributed.py
@ -50,7 +50,7 @@ class DistributedVariable(VariableTracker):
 def is_from_local(value):
    if not DistributedVariable.is_available():
        return False
-    from torch.distributed._tensor import DTensor
+    from torch.distributed.tensor import DTensor
    return inspect.isfunction(value) and value is DTensor.from_local
@ -108,7 +108,7 @@ class PlacementClassVariable(DistributedVariable):
        if not DistributedVariable.is_available():
            return False
-        from torch.distributed._tensor.placement_types import Placement
+        from torch.distributed.tensor.placement_types import Placement
        return type(value) is type and issubclass(value, Placement)
@ -143,7 +143,7 @@ class PlacementVariable(DistributedVariable):
        if not DistributedVariable.is_available():
            return False
-        from torch.distributed._tensor.placement_types import Placement
+        from torch.distributed.tensor.placement_types import Placement
        return isinstance(value, Placement)
--- a/torch/_dynamo/variables/torch.py
+++ b/torch/_dynamo/variables/torch.py
@ -598,7 +598,6 @@ class TorchInGraphFunctionVariable(BaseTorchVariable):
            )
        if DistributedVariable.is_available():
            from torch.distributed._tensor import DTensor
            from torch.distributed.distributed_c10d import (
                _get_group_size_by_name,
                _get_group_tag,
@ -606,6 +605,7 @@ class TorchInGraphFunctionVariable(BaseTorchVariable):
                _resolve_group_name_by_ranks_and_tag,
                get_process_group_ranks,
            )
            from torch.distributed.tensor import DTensor
            @register(
                _get_group_size_by_name,
--- a/torch/distributed/_composable/fsdp/_fsdp_collectives.py
+++ b/torch/distributed/_composable/fsdp/_fsdp_collectives.py
@ -4,8 +4,8 @@ from typing import cast, List, NamedTuple, Optional, Tuple, Union
 import torch
 import torch._dynamo.compiled_autograd as ca
 import torch.distributed as dist
 from torch.distributed._tensor import DTensor
 from torch.distributed.distributed_c10d import ReduceOp
 from torch.distributed.tensor import DTensor
 from ._fsdp_common import (
    _get_dim0_padded_size,
--- a/torch/distributed/_composable/fsdp/_fsdp_common.py
+++ b/torch/distributed/_composable/fsdp/_fsdp_common.py
@ -10,8 +10,8 @@ import torch._dynamo.compiled_autograd as ca
 import torch.distributed as dist
 import torch.nn as nn
 from torch.distributed._composable.contract import _get_registry
-from torch.distributed._tensor import DeviceMesh, DTensor
+from torch.distributed.tensor import DeviceMesh, DTensor
-from torch.distributed._tensor.placement_types import DTensorSpec
+from torch.distributed.tensor.placement_types import DTensorSpec
@dataclass
--- a/torch/distributed/_composable/fsdp/_fsdp_init.py
+++ b/torch/distributed/_composable/fsdp/_fsdp_init.py
@ -4,8 +4,8 @@ from typing import List, Optional, Set, Tuple, Union
 import torch
 import torch.distributed as dist
 import torch.nn as nn
 from torch.distributed._tensor import DeviceMesh, DTensor, init_device_mesh
 from torch.distributed.device_mesh import _get_device_handle
 from torch.distributed.tensor import DeviceMesh, DTensor, init_device_mesh
 from torch.utils._python_dispatch import is_traceable_wrapper_subclass
 from ._fsdp_common import _is_composable_with_fsdp, FSDPMeshInfo, HSDPMeshInfo
--- a/torch/distributed/_composable/fsdp/_fsdp_param.py
+++ b/torch/distributed/_composable/fsdp/_fsdp_param.py
@ -9,9 +9,9 @@ import torch._dynamo.compiled_autograd as ca
 import torch.nn as nn
 from torch._prims_common import make_contiguous_strides_for
 from torch.distributed._functional_collectives import AsyncCollectiveTensor
-from torch.distributed._tensor import DTensor, Replicate, Shard
+from torch.distributed.tensor import DTensor, Replicate, Shard
-from torch.distributed._tensor.device_mesh import _mesh_resources
+from torch.distributed.tensor.device_mesh import _mesh_resources
-from torch.distributed._tensor.placement_types import (
+from torch.distributed.tensor.placement_types import (
    _StridedShard,
    DTensorSpec,
    Placement,
--- a/torch/distributed/_composable/fsdp/fully_shard.py
+++ b/torch/distributed/_composable/fsdp/fully_shard.py
@ -6,7 +6,7 @@ from typing import Any, cast, Iterable, List, NoReturn, Optional, Union
 import torch
 import torch.nn as nn
 from torch.distributed._composable import contract
-from torch.distributed._tensor import DeviceMesh
+from torch.distributed.tensor import DeviceMesh
 from torch.distributed.utils import _get_root_modules
 from ._fsdp_api import MixedPrecisionPolicy, OffloadPolicy
--- a/torch/distributed/_functional_collectives.py
+++ b/torch/distributed/_functional_collectives.py
@ -97,7 +97,7 @@ RANK_TYPES = Union[
    List[List[int]],
    dist.ProcessGroup,
    DeviceMesh,
-    Tuple["dist._tensor.DeviceMesh", int],
+    Tuple["dist.tensor.DeviceMesh", int],
    str,
 ]
--- a/torch/distributed/_state_dict_utils.py
+++ b/torch/distributed/_state_dict_utils.py
@ -27,7 +27,7 @@ from torch.distributed._functional_collectives import AsyncCollectiveTensor
 if dist.is_available() or TYPE_CHECKING:
    from torch.distributed import distributed_c10d
    from torch.distributed._shard.sharded_tensor import ShardedTensor
-    from torch.distributed._tensor import distribute_tensor, DTensor, Replicate
+    from torch.distributed.tensor import distribute_tensor, DTensor, Replicate
 def _identity_func(
--- a/torch/distributed/_tensor/init.py
+++ b/torch/distributed/_tensor/init.py
@ -1,58 +1,43 @@
-# mypy: allow-untyped-defs
+"""
-# Copyright (c) Meta Platforms, Inc. and affiliates
+NOTICE: DTensor has moved to torch.distributed.tensor
-import torch
+This file is a shim to redirect to the new location, and
-import torch.distributed._tensor.ops as _ops  # force import all built-in dtensor ops
+we keep the old import path starts with `_tensor` for
-from torch.distributed._tensor.api import (
+backward compatibility.
-    distribute_module,
+"""
-    distribute_tensor,
+import importlib
-    DTensor,
+import sys
-    empty,
+
-    full,
+import torch.distributed.tensor
    ones,
    rand,
    randn,
    zeros,
 )
 from torch.distributed._tensor.placement_types import (
    Partial,
    Placement,
    Replicate,
    Shard,
 )
 from torch.distributed.device_mesh import DeviceMesh, init_device_mesh
 from torch.optim.optimizer import (
    _foreach_supported_types as _optim_foreach_supported_types,
 )
 from torch.utils._foreach_utils import (
    _foreach_supported_types as _util_foreach_supported_types,
 )
-# All public APIs from dtensor package
+def _populate():  # type: ignore[no-untyped-def]
-__all__ = [
+    for name in (
-    "DTensor",
+        # TODO: _utils here mainly for checkpoint imports BC, remove it
-    "DeviceMesh",
+        "_utils",
-    "distribute_tensor",
+        "api",
-    "distribute_module",
+        "debug",
-    "init_device_mesh,",
+        "device_mesh",
-    "Shard",
+        "experimental",
-    "Replicate",
+        "placement_types",
-    "Partial",
+        "random",
-    "Placement",
+    ):
-    "ones",
+        try:
-    "empty",
+            globals()[name] = sys.modules[
-    "full",
+                f"torch.distributed._tensor.{name}"
-    "rand",
+            ] = importlib.import_module(f"torch.distributed.tensor.{name}")
-    "randn",
+        except ImportError as e:
-    "zeros",
+            import traceback
-]
+
            traceback.print_exc()
            raise ImportError(
                f"Failed to import torch.distributed.tensor.{name} due to {e}"
            ) from e
    for name, val in torch.distributed.tensor.__dict__.items():
        # Skip private names and tensor parallel package
        if not name.startswith("_") and name != "parallel":
            globals()[name] = val
-# Append DTensor to the list of supported types for foreach implementation for optimizer
+_populate()
 # and clip_grad_norm_ so that we will try to use foreach over the for-loop implementation on CUDA.
 if DTensor not in _optim_foreach_supported_types:
    _optim_foreach_supported_types.append(DTensor)
 if DTensor not in _util_foreach_supported_types:
    _util_foreach_supported_types.append(DTensor)
--- a/torch/distributed/checkpoint/_traverse.py
+++ b/torch/distributed/checkpoint/_traverse.py
@ -14,8 +14,8 @@ from typing import (
 import torch
 from torch.distributed._shard.sharded_tensor.api import ShardedTensor
 from torch.distributed._tensor import DTensor
 from torch.distributed.checkpoint.metadata import STATE_DICT_TYPE
 from torch.distributed.tensor import DTensor
 PATH_ITEM = Union[str, int]
--- a/torch/distributed/checkpoint/default_planner.py
+++ b/torch/distributed/checkpoint/default_planner.py
@ -11,7 +11,6 @@ from typing import Any, cast, Dict, List, Optional, Tuple, Union
 import torch
 from torch.distributed._shard._utils import narrow_tensor_by_index
 from torch.distributed._tensor import DTensor
 from torch.distributed.checkpoint._dedup_save_plans import dedup_save_plans
 from torch.distributed.checkpoint._nested_dict import (
    FLATTEN_MAPPING,
@ -45,6 +44,7 @@ from torch.distributed.checkpoint.planner_helpers import (
    _init_state_dict,
 )
 from torch.distributed.checkpoint.utils import find_state_dict_object
 from torch.distributed.tensor import DTensor
 logger: logging.Logger = logging.getLogger(__name__)
--- a/torch/distributed/checkpoint/examples/async_checkpointing_example.py
+++ b/torch/distributed/checkpoint/examples/async_checkpointing_example.py
@ -11,12 +11,12 @@ import torch.distributed.checkpoint as dcp
 import torch.multiprocessing as mp
 import torch.nn as nn
 import torch.nn.functional as F
 from torch.distributed._tensor.device_mesh import init_device_mesh
 from torch.distributed.checkpoint.state_dict import (
    _patch_model_state_dict,
    _patch_optimizer_state_dict,
 )
 from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
 from torch.distributed.tensor.device_mesh import init_device_mesh
 DEVICE = "cuda"
--- a/torch/distributed/checkpoint/examples/stateful_example.py
+++ b/torch/distributed/checkpoint/examples/stateful_example.py
@ -12,11 +12,11 @@ import torch.distributed as dist
 import torch.distributed.checkpoint as dcp
 import torch.multiprocessing as mp
 import torch.nn as nn
 from torch.distributed._tensor.device_mesh import init_device_mesh
 from torch.distributed.checkpoint.state_dict import (
    _patch_model_state_dict,
    _patch_optimizer_state_dict,
 )
 from torch.distributed.device_mesh import init_device_mesh
 from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
--- a/torch/distributed/checkpoint/optimizer.py
+++ b/torch/distributed/checkpoint/optimizer.py
@ -12,7 +12,6 @@ from torch.distributed._shard.sharded_tensor.metadata import (
 )
 from torch.distributed._shard.sharded_tensor.shard import Shard
 from torch.distributed._shard.sharding_spec.chunk_sharding_spec import ChunkShardingSpec
 from torch.distributed._tensor import DTensor
 from torch.distributed.checkpoint._nested_dict import unflatten_state_dict
 from torch.distributed.checkpoint.default_planner import DefaultLoadPlanner
 from torch.distributed.checkpoint.metadata import (
@ -39,6 +38,7 @@ from torch.distributed.checkpoint.utils import (
 from torch.distributed.distributed_c10d import _get_default_group
 from torch.distributed.fsdp._shard_utils import _create_chunk_sharded_tensor
 from torch.distributed.remote_device import _remote_device
 from torch.distributed.tensor import DTensor
 STATE_DICT_2D_LAYOUT = Dict[str, Tuple[Optional[Sequence[int]], Sequence[int]]]
--- a/torch/distributed/checkpoint/planner_helpers.py
+++ b/torch/distributed/checkpoint/planner_helpers.py
@ -7,8 +7,8 @@ import torch.distributed as dist
 from torch._utils import _get_device_module
 from torch.distributed._shard.metadata import ShardMetadata
 from torch.distributed._shard.sharded_tensor import ShardedTensor
-from torch.distributed._tensor import DTensor
+from torch.distributed.tensor import DTensor
-from torch.distributed._tensor._utils import compute_local_shape_and_global_offset
+from torch.distributed.tensor._utils import compute_local_shape_and_global_offset
 from .metadata import (
    BytesStorageMetadata,
--- a/torch/distributed/checkpoint/state_dict.py
+++ b/torch/distributed/checkpoint/state_dict.py
@ -32,7 +32,6 @@ from torch.distributed._state_dict_utils import (
    _offload_state_dict_to_cpu,
    _unflatten_state_dict,
 )
 from torch.distributed._tensor import DTensor
 from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import (
    _CHECKPOINT_PREFIX,
 )
@ -50,6 +49,7 @@ from torch.distributed.fsdp._common_utils import (
    _get_module_fsdp_state_if_fully_sharded_module,
    FSDP_WRAPPED_MODULE,
 )
 from torch.distributed.tensor import DTensor
 from torch.nn.modules.module import _IncompatibleKeys
 from torch.nn.parallel import DistributedDataParallel as DDP
 from torch.utils._pytree import tree_map_only
--- a/torch/distributed/fsdp/_flat_param.py
+++ b/torch/distributed/fsdp/_flat_param.py
@ -1887,7 +1887,7 @@ class FlatParamHandle:
        flat_param = self.flat_param
        self._check_unsharded(flat_param)
        views = self._get_unflat_views()
-        from torch.distributed._tensor import DTensor
+        from torch.distributed.tensor import DTensor
        for i, (view, (param_name, module, _)) in enumerate(
            zip(views, flat_param._param_infos)
@ -2717,7 +2717,7 @@ def _warn_use_fake_reduce(log: logging.Logger, warning: str):
 def _same_storage(a, b):
    # Params are DTensors in backward
    # with SHARD_GRAD_OP + TP
-    from torch.distributed._tensor import DTensor
+    from torch.distributed.tensor import DTensor
    if isinstance(a, DTensor):
        a = a._local_tensor
--- a/torch/distributed/fsdp/_fsdp_extensions.py
+++ b/torch/distributed/fsdp/_fsdp_extensions.py
@ -5,12 +5,12 @@ import torch
 import torch.distributed as dist
 from torch.distributed._shard.sharded_tensor.api import ShardedTensor
 from torch.distributed._shard.sharded_tensor.shard import Shard
 from torch.distributed._tensor import DeviceMesh, DTensor
 from torch.distributed.fsdp._shard_utils import (
    _all_gather_dtensor,
    _create_chunk_dtensor,
    _create_chunk_sharded_tensor,
 )
 from torch.distributed.tensor import DeviceMesh, DTensor
 class FSDPExtensions(ABC):
--- a/torch/distributed/fsdp/_optim_utils.py
+++ b/torch/distributed/fsdp/_optim_utils.py
@ -27,7 +27,6 @@ import torch.distributed as dist
 import torch.distributed.fsdp._traversal_utils as traversal_utils
 import torch.nn as nn
 from torch.distributed._state_dict_utils import _gather_state_dict
 from torch.distributed._tensor import DTensor, Replicate
 from torch.distributed.distributed_c10d import _get_pg_default_device
 from torch.distributed.fsdp._common_utils import (
    _apply_to_modules,
@ -53,6 +52,7 @@ from torch.distributed.fsdp.api import (
    StateDictSettings,
    StateDictType,
 )
 from torch.distributed.tensor import DTensor, Replicate
 from torch.utils._pytree import tree_map_only
--- a/torch/distributed/fsdp/_shard_utils.py
+++ b/torch/distributed/fsdp/_shard_utils.py
@ -15,7 +15,7 @@ from torch.distributed._shard.sharded_tensor import (
    TensorProperties,
 )
 from torch.distributed._shard.sharding_spec import ShardMetadata
-from torch.distributed._tensor import DeviceMesh, DTensor, Replicate, Shard as DShard
+from torch.distributed.tensor import DeviceMesh, DTensor, Replicate, Shard as DShard
 def _get_remote_device_str(rank, device_type, num_devices_per_node):
--- a/torch/distributed/fsdp/_state_dict_utils.py
+++ b/torch/distributed/fsdp/_state_dict_utils.py
@ -25,7 +25,6 @@ from torch.distributed._shard.sharded_tensor import (
    Shard,
    ShardedTensor,
 )
 from torch.distributed._tensor import DTensor
 from torch.distributed.device_mesh import _mesh_resources
 from torch.distributed.fsdp._common_utils import (
    _FSDPState,
@ -49,6 +48,7 @@ from torch.distributed.fsdp.api import (
    ShardingStrategy,
    StateDictType,
 )
 from torch.distributed.tensor import DTensor
 from torch.distributed.utils import _replace_by_prefix
 from ._fsdp_extensions import (
--- a/torch/distributed/fsdp/fully_sharded_data_parallel.py
+++ b/torch/distributed/fsdp/fully_sharded_data_parallel.py
@ -25,7 +25,6 @@ import torch
 import torch.distributed as dist
 import torch.distributed.fsdp._traversal_utils as traversal_utils
 import torch.nn as nn
 from torch.distributed._tensor import DeviceMesh
 from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import (
    _CHECKPOINT_WRAPPED_MODULE,
    ActivationWrapper,
@ -84,6 +83,7 @@ from torch.distributed.fsdp.api import (
    StateDictSettings,
    StateDictType,
 )
 from torch.distributed.tensor import DeviceMesh
 from torch.distributed.utils import _p_assert
 from ._flat_param import FlatParameter, FlatParamHandle
--- a/torch/distributed/_tensor/README.md
+++ b/torch/distributed/_tensor/README.md
--- a/torch/distributed/tensor/init.py
+++ b/torch/distributed/tensor/init.py
@ -0,0 +1,57 @@
 # Copyright (c) Meta Platforms, Inc. and affiliates
 import torch
 import torch.distributed.tensor._ops  # force import all built-in dtensor ops
 from torch.distributed.device_mesh import DeviceMesh, init_device_mesh
 from torch.distributed.tensor.api import (
    distribute_module,
    distribute_tensor,
    DTensor,
    empty,
    full,
    ones,
    rand,
    randn,
    zeros,
 )
 from torch.distributed.tensor.placement_types import (
    Partial,
    Placement,
    Replicate,
    Shard,
 )
 from torch.optim.optimizer import (
    _foreach_supported_types as _optim_foreach_supported_types,
 )
 from torch.utils._foreach_utils import (
    _foreach_supported_types as _util_foreach_supported_types,
 )
 # All public APIs from dtensor package
 __all__ = [
    "DTensor",
    "DeviceMesh",
    "distribute_tensor",
    "distribute_module",
    "init_device_mesh,",
    "Shard",
    "Replicate",
    "Partial",
    "Placement",
    "ones",
    "empty",
    "full",
    "rand",
    "randn",
    "zeros",
 ]
 # Append DTensor to the list of supported types for foreach implementation for optimizer
 # and clip_grad_norm_ so that we will try to use foreach over the for-loop implementation on CUDA.
 if DTensor not in _optim_foreach_supported_types:
    _optim_foreach_supported_types.append(DTensor)
 if DTensor not in _util_foreach_supported_types:
    _util_foreach_supported_types.append(DTensor)
--- a/torch/distributed/_tensor/_collective_utils.py
+++ b/torch/distributed/_tensor/_collective_utils.py
@ -7,7 +7,7 @@ from typing import List, Optional
 import torch
 import torch.distributed._functional_collectives as funcol
-import torch.distributed._tensor.placement_types as placement_types
+import torch.distributed.tensor.placement_types as placement_types
 from torch._C._distributed_c10d import _resolve_process_group
 from torch.distributed.device_mesh import _mesh_resources, DeviceMesh
 from torch.distributed.distributed_c10d import (
--- a/torch/distributed/_tensor/_dispatch.py
+++ b/torch/distributed/_tensor/_dispatch.py
@ -8,24 +8,24 @@ from typing import cast, Dict, List, Optional, Sequence, Tuple, TYPE_CHECKING
 import torch
 import torch.distributed as dist
-import torch.distributed._tensor.api as dtensor
+import torch.distributed.tensor.api as dtensor
-import torch.distributed._tensor.random as random
+import torch.distributed.tensor.random as random
-from torch.distributed._tensor._op_schema import (
+from torch.distributed.tensor._op_schema import (
    _is_inplace_op,
    _is_out_variant_op,
    OpInfo,
    OpSchema,
    OutputSpecType,
 )
-from torch.distributed._tensor._redistribute import redistribute_local_tensor
+from torch.distributed.tensor._redistribute import redistribute_local_tensor
-from torch.distributed._tensor._sharding_prop import ShardingPropagator
+from torch.distributed.tensor._sharding_prop import ShardingPropagator
-from torch.distributed._tensor._tp_conv import (
+from torch.distributed.tensor._tp_conv import (
    convolution_backward_handler,
    convolution_handler,
 )
-from torch.distributed._tensor._utils import try_find_mesh_from_args
+from torch.distributed.tensor._utils import try_find_mesh_from_args
-from torch.distributed._tensor.placement_types import DTensorSpec, Replicate, TensorMeta
+from torch.distributed.tensor.placement_types import DTensorSpec, Replicate, TensorMeta
-from torch.distributed._tensor.random import is_rng_supported_mesh
+from torch.distributed.tensor.random import is_rng_supported_mesh
 if TYPE_CHECKING:
--- a/torch/distributed/_tensor/_op_schema.py
+++ b/torch/distributed/_tensor/_op_schema.py
@ -5,8 +5,8 @@ from typing import Any, Dict, List, Optional, Sequence, Tuple, Union
 import torch
 from torch._ops import OpOverload
 from torch.distributed._tensor.placement_types import DTensorSpec, Placement
 from torch.distributed.device_mesh import DeviceMesh
 from torch.distributed.tensor.placement_types import DTensorSpec, Placement
 try:
--- a/torch/distributed/tensor/_ops/init.py
+++ b/torch/distributed/tensor/_ops/init.py
--- a/torch/distributed/tensor/_ops/_common_rules.py
+++ b/torch/distributed/tensor/_ops/_common_rules.py
@ -2,15 +2,15 @@
 from typing import cast, Dict, List, Optional, Tuple
 import torch
-from torch.distributed._tensor._op_schema import (
+from torch.distributed.tensor._op_schema import (
    _is_inplace_op,
    _is_out_variant_op,
    OpSchema,
    OutputSharding,
 )
-from torch.distributed._tensor._utils import compute_local_shape
+from torch.distributed.tensor._ops.utils import prod
-from torch.distributed._tensor.ops.utils import prod
+from torch.distributed.tensor._utils import compute_local_shape
-from torch.distributed._tensor.placement_types import DTensorSpec, TensorMeta
+from torch.distributed.tensor.placement_types import DTensorSpec, TensorMeta
 def _replace_char_in_str(string: str, new_char: str, idx: int) -> str:
--- a/torch/distributed/tensor/_ops/_conv_ops.py
+++ b/torch/distributed/tensor/_ops/_conv_ops.py
@ -4,9 +4,9 @@
 from typing import List
 import torch
-from torch.distributed._tensor._op_schema import OpSchema, OutputSharding
+from torch.distributed.tensor._op_schema import OpSchema, OutputSharding
-from torch.distributed._tensor.ops.utils import register_prop_rule
+from torch.distributed.tensor._ops.utils import register_prop_rule
-from torch.distributed._tensor.placement_types import DTensorSpec, TensorMeta
+from torch.distributed.tensor.placement_types import DTensorSpec, TensorMeta
 aten = torch.ops.aten
--- a/torch/distributed/tensor/_ops/_einsum_strategy.py
+++ b/torch/distributed/tensor/_ops/_einsum_strategy.py
@ -2,15 +2,15 @@ import itertools
 from dataclasses import dataclass
 from typing import List, Set, Tuple
-from torch.distributed._tensor._op_schema import OpStrategy, PlacementStrategy
+from torch.distributed.device_mesh import DeviceMesh
-from torch.distributed._tensor.placement_types import (
+from torch.distributed.tensor._op_schema import OpStrategy, PlacementStrategy
 from torch.distributed.tensor.placement_types import (
    DTensorSpec,
    Partial,
    Placement,
    Replicate,
    Shard,
 )
 from torch.distributed.device_mesh import DeviceMesh
@dataclass
--- a/torch/distributed/tensor/_ops/_embedding_ops.py
+++ b/torch/distributed/tensor/_ops/_embedding_ops.py
@ -7,23 +7,23 @@ from typing import cast, Optional
 import torch
 import torch.distributed._functional_collectives as funcol
-from torch.distributed._tensor._op_schema import (
+from torch.distributed.device_mesh import DeviceMesh
 from torch.distributed.tensor._op_schema import (
    OpSchema,
    OpStrategy,
    PlacementList,
    StrategyType,
 )
-from torch.distributed._tensor.ops.utils import (
+from torch.distributed.tensor._ops.utils import (
    expand_to_full_mesh_op_strategy,
    register_op_strategy,
 )
-from torch.distributed._tensor.placement_types import (
+from torch.distributed.tensor.placement_types import (
    Partial,
    Placement,
    Replicate,
    Shard,
 )
 from torch.distributed.device_mesh import DeviceMesh
 aten = torch.ops.aten
--- a/torch/distributed/tensor/_ops/_experimental_ops.py
+++ b/torch/distributed/tensor/_ops/_experimental_ops.py
@ -3,15 +3,15 @@
 # implement matrix related ops for distributed tensor
 import torch
-from torch.distributed._tensor._op_schema import (
+from torch.distributed.tensor._op_schema import (
    OpSchema,
    OpStrategy,
    PlacementStrategy,
    StrategyType,
 )
-from torch.distributed._tensor.device_mesh import DeviceMesh
+from torch.distributed.tensor._ops.utils import register_op_strategy
-from torch.distributed._tensor.ops.utils import register_op_strategy
+from torch.distributed.tensor.device_mesh import DeviceMesh
-from torch.distributed._tensor.placement_types import DTensorSpec, Replicate
+from torch.distributed.tensor.placement_types import DTensorSpec, Replicate
 aten = torch.ops.aten
--- a/torch/distributed/tensor/_ops/_math_ops.py
+++ b/torch/distributed/tensor/_ops/_math_ops.py
@ -7,7 +7,8 @@ from enum import Enum
 from typing import cast, List, Optional, Sequence, Tuple, Union
 import torch
-from torch.distributed._tensor._op_schema import (
+from torch.distributed.device_mesh import DeviceMesh
 from torch.distributed.tensor._op_schema import (
    OpSchema,
    OpStrategy,
    PlacementList,
@ -15,8 +16,7 @@ from torch.distributed._tensor._op_schema import (
    RuntimeSchemaInfo,
    TupleStrategy,
 )
-from torch.distributed._tensor._utils import normalize_to_torch_size
+from torch.distributed.tensor._ops.utils import (
 from torch.distributed._tensor.ops.utils import (
    as_list,
    expand_to_full_mesh_op_strategy,
    generate_redistribute_costs,
@ -25,14 +25,14 @@ from torch.distributed._tensor.ops.utils import (
    normalize_dims,
    register_op_strategy,
 )
-from torch.distributed._tensor.placement_types import (
+from torch.distributed.tensor._utils import normalize_to_torch_size
 from torch.distributed.tensor.placement_types import (
    DTensorSpec,
    Partial,
    Placement,
    Replicate,
    Shard,
 )
 from torch.distributed.device_mesh import DeviceMesh
 aten = torch.ops.aten
--- a/torch/distributed/tensor/_ops/_matrix_ops.py
+++ b/torch/distributed/tensor/_ops/_matrix_ops.py
@ -5,14 +5,15 @@
 from typing import List
 import torch
-from torch.distributed._tensor._op_schema import (
+from torch.distributed.device_mesh import DeviceMesh
 from torch.distributed.tensor._op_schema import (
    OpSchema,
    OpStrategy,
    PlacementList,
    PlacementStrategy,
 )
-from torch.distributed._tensor.ops._einsum_strategy import gen_einsum_strategies
+from torch.distributed.tensor._ops._einsum_strategy import gen_einsum_strategies
-from torch.distributed._tensor.ops.utils import (
+from torch.distributed.tensor._ops.utils import (
    expand_to_full_mesh_op_strategy,
    generate_redistribute_costs,
    infer_broadcast_dims_map,
@ -20,13 +21,12 @@ from torch.distributed._tensor.ops.utils import (
    map_placements_after_broadcast,
    register_op_strategy,
 )
-from torch.distributed._tensor.placement_types import (
+from torch.distributed.tensor.placement_types import (
    DTensorSpec,
    Placement,
    Replicate,
    Shard,
 )
 from torch.distributed.device_mesh import DeviceMesh
 aten = torch.ops.aten
--- a/torch/distributed/tensor/_ops/_pointwise_ops.py
+++ b/torch/distributed/tensor/_ops/_pointwise_ops.py
@ -2,7 +2,8 @@
 from typing import List, Sequence, Tuple
 import torch
-from torch.distributed._tensor._op_schema import (
+from torch.distributed.device_mesh import DeviceMesh
 from torch.distributed.tensor._op_schema import (
    _is_inplace_op,
    _is_out_variant_op,
    OpSchema,
@ -12,21 +13,20 @@ from torch.distributed._tensor._op_schema import (
    StrategyType,
    TupleStrategy,
 )
-from torch.distributed._tensor.ops.utils import (
+from torch.distributed.tensor._ops.utils import (
    generate_redistribute_costs,
    infer_broadcast_dims_map,
    map_placements_after_broadcast,
    normalize_dim,
    register_op_strategy,
 )
-from torch.distributed._tensor.placement_types import (
+from torch.distributed.tensor.placement_types import (
    DTensorSpec,
    Partial,
    Placement,
    Replicate,
    Shard,
 )
 from torch.distributed.device_mesh import DeviceMesh
 aten = torch.ops.aten
--- a/torch/distributed/tensor/_ops/_random_ops.py
+++ b/torch/distributed/tensor/_ops/_random_ops.py
@ -1,14 +1,14 @@
 # mypy: allow-untyped-decorators
 # Copyright (c) Meta Platforms, Inc. and affiliates
 import torch
-from torch.distributed._tensor._op_schema import (
+from torch.distributed.device_mesh import DeviceMesh
 from torch.distributed.tensor._op_schema import (
    OpSchema,
    OpStrategy,
    PlacementStrategy,
    StrategyType,
 )
-from torch.distributed._tensor.ops.utils import is_tensor_partial, register_op_strategy
+from torch.distributed.tensor._ops.utils import is_tensor_partial, register_op_strategy
 from torch.distributed.device_mesh import DeviceMesh
 aten = torch.ops.aten
--- a/torch/distributed/tensor/_ops/_tensor_ops.py
+++ b/torch/distributed/tensor/_ops/_tensor_ops.py
@ -4,7 +4,8 @@
 from typing import cast, List, Optional, Sequence, Tuple
 import torch
-from torch.distributed._tensor._op_schema import (
+from torch.distributed.device_mesh import DeviceMesh
 from torch.distributed.tensor._op_schema import (
    _is_inplace_op,
    OpSchema,
    OpStrategy,
@ -15,9 +16,9 @@ from torch.distributed._tensor._op_schema import (
    StrategyType,
    TupleStrategy,
 )
-from torch.distributed._tensor.ops._common_rules import pointwise_rule
+from torch.distributed.tensor._ops._common_rules import pointwise_rule
-from torch.distributed._tensor.ops._embedding_ops import _MaskPartial
+from torch.distributed.tensor._ops._embedding_ops import _MaskPartial
-from torch.distributed._tensor.ops.utils import (
+from torch.distributed.tensor._ops.utils import (
    expand_to_full_mesh_op_strategy,
    is_tensor_dim_sharded,
    is_tensor_evenly_shardable,
@ -26,14 +27,13 @@ from torch.distributed._tensor.ops.utils import (
    register_op_strategy,
    register_prop_rule,
 )
-from torch.distributed._tensor.placement_types import (
+from torch.distributed.tensor.placement_types import (
    DTensorSpec,
    Partial,
    Placement,
    Replicate,
    Shard,
 )
 from torch.distributed.device_mesh import DeviceMesh
 aten = torch.ops.aten
--- a/torch/distributed/tensor/_ops/_view_ops.py
+++ b/torch/distributed/tensor/_ops/_view_ops.py
@ -17,23 +17,27 @@ from typing import (
 import torch
 from torch import Tensor
-from torch.distributed._tensor._op_schema import (
+from torch.distributed.device_mesh import DeviceMesh
 from torch.distributed.tensor._op_schema import (
    OpSchema,
    OpStrategy,
    PlacementStrategy,
    RuntimeSchemaInfo,
    StrategyType,
 )
-from torch.distributed._tensor.api import Shard
+from torch.distributed.tensor._ops.utils import (
 from torch.distributed._tensor.ops.utils import (
    generate_redistribute_costs,
    normalize_dim,
    normalize_dims,
    prod,
    register_op_strategy,
 )
-from torch.distributed._tensor.placement_types import DTensorSpec, Placement, Replicate
+from torch.distributed.tensor.placement_types import (
-from torch.distributed.device_mesh import DeviceMesh
+    DTensorSpec,
    Placement,
    Replicate,
    Shard,
 )
 aten = torch.ops.aten
--- a/torch/distributed/tensor/_ops/utils.py
+++ b/torch/distributed/tensor/_ops/utils.py
@ -6,17 +6,17 @@ import operator
 from typing import cast, Iterable, List, Optional, Sequence, Tuple, Union
 import torch
-from torch.distributed._tensor._collective_utils import redistribute_cost
+from torch.distributed.tensor._collective_utils import redistribute_cost
-from torch.distributed._tensor._op_schema import (
+from torch.distributed.tensor._op_schema import (
    OpSchema,
    OpStrategy,
    PlacementList,
    PlacementStrategy,
    RuntimeSchemaInfo,
 )
-from torch.distributed._tensor.api import DTensor
+from torch.distributed.tensor.api import DTensor
-from torch.distributed._tensor.device_mesh import DeviceMesh
+from torch.distributed.tensor.device_mesh import DeviceMesh
-from torch.distributed._tensor.placement_types import (
+from torch.distributed.tensor.placement_types import (
    DTensorSpec,
    Partial,
    Placement,
--- a/torch/distributed/_tensor/_redistribute.py
+++ b/torch/distributed/_tensor/_redistribute.py
@ -6,9 +6,9 @@ from typing import cast, List, NamedTuple, Tuple
 import torch
 import torch.distributed._functional_collectives as funcol
-import torch.distributed._tensor.api as dtensor
+import torch.distributed.tensor.api as dtensor
-from torch.distributed._tensor.device_mesh import DeviceMesh
+from torch.distributed.tensor.device_mesh import DeviceMesh
-from torch.distributed._tensor.placement_types import (
+from torch.distributed.tensor.placement_types import (
    DTensorSpec,
    Partial,
    Placement,
--- a/torch/distributed/_tensor/_sharding_prop.py
+++ b/torch/distributed/_tensor/_sharding_prop.py
@ -6,7 +6,8 @@ from typing import Callable, cast, Dict, List, Optional, Sequence, Tuple, Union
 import torch
 from torch._ops import OpOverload
 from torch._subclasses import FakeTensorMode
-from torch.distributed._tensor._op_schema import (
+from torch.distributed.device_mesh import DeviceMesh
 from torch.distributed.tensor._op_schema import (
    OpInfo,
    OpSchema,
    OpStrategy,
@ -17,13 +18,12 @@ from torch.distributed._tensor._op_schema import (
    StrategyType,
    TupleStrategy,
 )
-from torch.distributed._tensor._utils import (
+from torch.distributed.tensor._utils import (
    compute_local_shape,
    compute_local_stride,
    try_find_mesh_from_args,
 )
-from torch.distributed._tensor.placement_types import DTensorSpec, TensorMeta
+from torch.distributed.tensor.placement_types import DTensorSpec, TensorMeta
 from torch.distributed.device_mesh import DeviceMesh
 aten = torch.ops.aten
--- a/torch/distributed/_tensor/_shards_wrapper.py
+++ b/torch/distributed/_tensor/_shards_wrapper.py
--- a/torch/distributed/_tensor/_tp_conv.py
+++ b/torch/distributed/_tensor/_tp_conv.py
@ -5,7 +5,7 @@ from typing import cast, Dict, List, Tuple
 import torch
 import torch.distributed as dist
-import torch.distributed._tensor.api as dtensor
+import torch.distributed.tensor.api as dtensor
 aten = torch.ops.aten
--- a/torch/distributed/_tensor/_utils.py
+++ b/torch/distributed/_tensor/_utils.py
@ -1,9 +1,10 @@
 from typing import cast, List, Sequence, Tuple
 import torch
-import torch.distributed._tensor.api as dtensor
+import torch.distributed.tensor.api as dtensor
 from torch._prims_common import ShapeType
-from torch.distributed._tensor.placement_types import (
+from torch.distributed.device_mesh import DeviceMesh
 from torch.distributed.tensor.placement_types import (
    _StridedShard,
    DTensorSpec,
    Partial,
@ -11,7 +12,6 @@ from torch.distributed._tensor.placement_types import (
    Replicate,
    Shard,
 )
 from torch.distributed.device_mesh import DeviceMesh
 # TODO: audit existing code base to see if we can safely remove this API.
--- a/torch/distributed/_tensor/api.py
+++ b/torch/distributed/_tensor/api.py
@ -6,23 +6,21 @@ import warnings
 from typing import Any, Callable, cast, Optional, Sequence, Tuple
 import torch
-import torch.distributed._tensor._dispatch as op_dispatch
+import torch.distributed.tensor._dispatch as op_dispatch
-import torch.distributed._tensor.random as random
+import torch.distributed.tensor.random as random
 import torch.nn as nn
-from torch.distributed._tensor._collective_utils import (
+from torch.distributed.device_mesh import _mesh_resources, DeviceMesh
-    check_tensor_meta,
+from torch.distributed.tensor._collective_utils import check_tensor_meta, mesh_broadcast
-    mesh_broadcast,
+from torch.distributed.tensor._redistribute import (
 )
 from torch.distributed._tensor._redistribute import (
    Redistribute,
    redistribute_local_tensor,
 )
-from torch.distributed._tensor._utils import (
+from torch.distributed.tensor._utils import (
    compute_global_tensor_info,
    compute_local_shape,
    normalize_to_torch_size,
 )
-from torch.distributed._tensor.placement_types import (
+from torch.distributed.tensor.placement_types import (
    DTensorSpec,
    Partial,
    Placement,
@ -30,11 +28,7 @@ from torch.distributed._tensor.placement_types import (
    Shard,
    TensorMeta,
 )
-from torch.distributed._tensor.random import (
+from torch.distributed.tensor.random import is_rng_supported_mesh, OffsetBasedRNGTracker
    is_rng_supported_mesh,
    OffsetBasedRNGTracker,
 )
 from torch.distributed.device_mesh import _mesh_resources, DeviceMesh
 __all__ = [
@ -254,7 +248,8 @@ class DTensor(torch.Tensor):
        """
        Construct a DTensor from a local tensor, device mesh, and placement and
        other tensor properties (i.e. shape, requires_grad, strides, etc).
-        Note: This is not a public API and it's only supposed to be used by the
+
        .. note:: This is not a public API and it's only supposed to be used by the
            operator implementations and internals. If you want to construct a
            DTensor from a local tensor, consider using ``DTensor.from_local``, if
            you want to construct a DTensor from a "global" tensor (where you
--- a/torch/distributed/_tensor/debug/init.py
+++ b/torch/distributed/_tensor/debug/init.py
@ -1,6 +1,6 @@
 # mypy: allow-untyped-defs
-from torch.distributed._tensor.debug.comm_mode import CommDebugMode
+from torch.distributed.tensor.debug.comm_mode import CommDebugMode
-from torch.distributed._tensor.debug.visualize_sharding import visualize_sharding
+from torch.distributed.tensor.debug.visualize_sharding import visualize_sharding
 __all__ = ["CommDebugMode", "visualize_sharding"]
@ -12,7 +12,7 @@ def _get_sharding_prop_cache_info():
    This would return a named tuple showing hits, misses, maxsize and cursize of the sharding
    propagator cache.
    """
-    from torch.distributed._tensor.api import DTensor
+    from torch.distributed.tensor.api import DTensor
    return (
        DTensor._op_dispatcher.sharding_propagator.propagate_op_sharding.cache_info()  # type:ignore[attr-defined]
--- a/torch/distributed/_tensor/debug/_op_coverage.py
+++ b/torch/distributed/_tensor/debug/_op_coverage.py
@ -8,7 +8,7 @@ import torch.nn as nn
 from functorch.compile import make_boxed_func
 from torch._functorch.compilers import aot_module
 from torch._inductor.decomposition import select_decomp_table
-from torch.distributed._tensor import DTensor
+from torch.distributed.tensor import DTensor
 inductor_decomps = select_decomp_table()
--- a/torch/distributed/_tensor/debug/comm_mode.py
+++ b/torch/distributed/_tensor/debug/comm_mode.py
@ -7,11 +7,11 @@ from collections import defaultdict
 from typing import Any, Dict
 import torch
 import torch.distributed._tools.mod_tracker as mod_tracker
 import torch.nn
 from torch._guards import detect_fake_mode
 from torch.autograd.graph import register_multi_grad_hook
-from torch.distributed._tensor.api import DTensor
+from torch.distributed.tensor.api import DTensor
 from torch.distributed._tools.mod_tracker import ModTracker
 from torch.nn.modules.module import (
    register_module_forward_hook,
    register_module_forward_pre_hook,
@ -69,7 +69,7 @@ trivial_ops = {
 }
-class CommModeModuleTracker(ModTracker):
+class _CommModeModuleTracker(mod_tracker.ModTracker):
    """
    Inherits ModuleTracker and expands on its functionality to track the
    parameters and sharding information of a model at a module-level
@ -250,7 +250,7 @@ class CommDebugMode(TorchDispatchMode):
            self.comm_registry.add(py_op)
        self.comm_registry.add(torch.ops._dtensor.shard_dim_alltoall)
-        self.advanced_module_tracker = CommModeModuleTracker()
+        self.advanced_module_tracker = _CommModeModuleTracker()
    def generate_json_dump(self, file_name="comm_mode_log.json", noise_level=3):
        """
--- a/torch/distributed/_tensor/debug/comm_mode_broswer_visual.js
+++ b/torch/distributed/_tensor/debug/comm_mode_broswer_visual.js
--- a/torch/distributed/_tensor/debug/visualize_sharding.py
+++ b/torch/distributed/_tensor/debug/visualize_sharding.py
@ -4,8 +4,8 @@ from typing import List, Sequence, Tuple
 import numpy as np
 from torch._prims_common import ShapeType
-from torch.distributed._tensor import DeviceMesh
+from torch.distributed.tensor import DeviceMesh
-from torch.distributed._tensor.placement_types import Placement, Shard
+from torch.distributed.tensor.placement_types import Placement, Shard
 __all__ = ["visualize_sharding"]
--- a/torch/distributed/_tensor/device_mesh.py
+++ b/torch/distributed/_tensor/device_mesh.py
--- a/torch/distributed/_tensor/examples/comm_mode_features_example.py
+++ b/torch/distributed/_tensor/examples/comm_mode_features_example.py
@ -8,8 +8,8 @@ from typing import Callable, Dict, Union
 import torch
 import torch.nn as nn
-from torch.distributed._tensor import DeviceMesh
+from torch.distributed.tensor import DeviceMesh
-from torch.distributed._tensor.debug import CommDebugMode
+from torch.distributed.tensor.debug import CommDebugMode
 from torch.distributed.tensor.parallel import (
    ColwiseParallel,
    parallelize_module,
--- a/torch/distributed/_tensor/examples/convnext_example.py
+++ b/torch/distributed/_tensor/examples/convnext_example.py
@ -12,7 +12,7 @@ import time
 import torch
 import torch.distributed as dist
 import torch.nn as nn
-from torch.distributed._tensor import (
+from torch.distributed.tensor import (
    DeviceMesh,
    distribute_module,
    distribute_tensor,
--- a/torch/distributed/_tensor/examples/torchrec_sharding_example.py
+++ b/torch/distributed/_tensor/examples/torchrec_sharding_example.py
@ -9,23 +9,23 @@ from functools import cached_property
 from typing import List, TYPE_CHECKING
 import torch
-from torch.distributed._tensor import (
+from torch.distributed.checkpoint.metadata import (
    ChunkStorageMetadata,
    TensorProperties,
    TensorStorageMetadata,
 )
 from torch.distributed.tensor import (
    DeviceMesh,
    DTensor,
    init_device_mesh,
    Replicate,
    Shard,
 )
-from torch.distributed._tensor.debug.visualize_sharding import visualize_sharding
+from torch.distributed.tensor.debug.visualize_sharding import visualize_sharding
 from torch.distributed.checkpoint.metadata import (
    ChunkStorageMetadata,
    TensorProperties,
    TensorStorageMetadata,
 )
 if TYPE_CHECKING:
-    from torch.distributed._tensor.placement_types import Placement
+    from torch.distributed.tensor.placement_types import Placement
 def get_device_type():
--- a/torch/distributed/_tensor/examples/visualize_sharding_example.py
+++ b/torch/distributed/_tensor/examples/visualize_sharding_example.py
@ -6,8 +6,8 @@ torchrun --standalone --nnodes=1 --nproc-per-node=4 visualize_sharding_example.p
 import os
 import torch
-from torch.distributed._tensor import DeviceMesh, distribute_tensor, Replicate, Shard
+from torch.distributed.tensor import DeviceMesh, distribute_tensor, Replicate, Shard
-from torch.distributed._tensor.debug.visualize_sharding import visualize_sharding
+from torch.distributed.tensor.debug.visualize_sharding import visualize_sharding
 world_size = int(os.environ["WORLD_SIZE"])
--- a/torch/distributed/_tensor/experimental/init.py
+++ b/torch/distributed/_tensor/experimental/init.py
@ -2,9 +2,9 @@
 # Copyright (c) Meta Platforms, Inc. and affiliates
 from contextlib import contextmanager
-from torch.distributed._tensor.api import DTensor
+from torch.distributed.tensor.api import DTensor
-from torch.distributed._tensor.experimental.func_map import local_map
+from torch.distributed.tensor.experimental.func_map import local_map
-from torch.distributed._tensor.experimental.register_sharding import register_sharding
+from torch.distributed.tensor.experimental.register_sharding import register_sharding
 __all__ = ["implicit_replication", "local_map", "register_sharding"]
--- a/torch/distributed/_tensor/experimental/attention.py
+++ b/torch/distributed/_tensor/experimental/attention.py
@ -24,8 +24,8 @@ import torch.distributed as dist
 import torch.distributed._functional_collectives as ft_c
 import torch.nn.functional as F
 from torch import nn
 from torch.distributed._tensor import distribute_module, DTensor, Replicate, Shard
 from torch.distributed.device_mesh import DeviceMesh
 from torch.distributed.tensor import distribute_module, DTensor, Replicate, Shard
 from torch.distributed.tensor.parallel.style import ParallelStyle
--- a/torch/distributed/_tensor/experimental/func_map.py
+++ b/torch/distributed/_tensor/experimental/func_map.py
@ -4,8 +4,8 @@ from typing import Callable, Optional, Sequence, Tuple, Union
 import torch
 from torch.distributed._functional_collectives import AsyncCollectiveTensor
-from torch.distributed._tensor import DeviceMesh, DTensor
+from torch.distributed.tensor import DeviceMesh, DTensor
-from torch.distributed._tensor.placement_types import Placement
+from torch.distributed.tensor.placement_types import Placement
 try:
@ -16,7 +16,6 @@ except ImportError:
 __all__ = ["local_map"]
 PlacementType = Optional[Sequence[Placement]]
 InputPlacements = Optional[Tuple[PlacementType, ...]]
 OutputPlacements = Union[PlacementType, Tuple[PlacementType, ...]]
--- a/torch/distributed/_tensor/experimental/register_sharding.py
+++ b/torch/distributed/_tensor/experimental/register_sharding.py
@ -5,8 +5,8 @@ from typing import Callable, List, Sequence, Tuple, Union
 import torch
 from torch._ops import OpOverload
-from torch.distributed._tensor import DeviceMesh, DTensor
+from torch.distributed.tensor import DeviceMesh, DTensor
-from torch.distributed._tensor._op_schema import (
+from torch.distributed.tensor._op_schema import (
    _is_inplace_op,
    OpSchema,
    OpStrategy,
@ -15,7 +15,7 @@ from torch.distributed._tensor._op_schema import (
    StrategyType,
    TupleStrategy,
 )
-from torch.distributed._tensor.ops.utils import expand_to_full_mesh_op_strategy
+from torch.distributed.tensor._ops.utils import expand_to_full_mesh_op_strategy
 __all__ = ["register_sharding"]
--- a/torch/distributed/_tensor/experimental/tp_transform.py
+++ b/torch/distributed/_tensor/experimental/tp_transform.py
@ -5,22 +5,22 @@ from typing import Any, cast, Dict, List, Optional, Sequence, Tuple
 import torch
 from torch._subclasses.fake_tensor import FakeTensor
-from torch.distributed._tensor import DeviceMesh, distribute_tensor, DTensor
+from torch.distributed.tensor import DeviceMesh, distribute_tensor, DTensor
-from torch.distributed._tensor._op_schema import (
+from torch.distributed.tensor._op_schema import (
    DTensorSpec,
    OpSchema,
    OutputSharding,
    OutputSpecType,
    PlacementStrategy,
 )
-from torch.distributed._tensor._redistribute import redistribute_local_tensor
+from torch.distributed.tensor._redistribute import redistribute_local_tensor
-from torch.distributed._tensor.placement_types import (
+from torch.distributed.tensor.parallel.style import ColwiseParallel, ParallelStyle
 from torch.distributed.tensor.placement_types import (
    Placement,
    Replicate,
    Shard,
    TensorMeta,
 )
 from torch.distributed.tensor.parallel.style import ColwiseParallel, ParallelStyle
 from torch.export import ExportedProgram
 from torch.export.exported_program import ExportGraphSignature
 from torch.fx import GraphModule
--- a/torch/distributed/tensor/parallel/_data_parallel_utils.py
+++ b/torch/distributed/tensor/parallel/_data_parallel_utils.py
@ -3,8 +3,8 @@ from typing import no_type_check, Optional, Tuple
 import torch
 from torch.distributed._functional_collectives import AsyncCollectiveTensor
-from torch.distributed._tensor import DTensor
+from torch.distributed.tensor import DTensor
-from torch.distributed._tensor.placement_types import DTensorSpec
+from torch.distributed.tensor.placement_types import DTensorSpec
@no_type_check
--- a/torch/distributed/tensor/parallel/_utils.py
+++ b/torch/distributed/tensor/parallel/_utils.py
@ -2,9 +2,9 @@
 import warnings
 from typing import Tuple, Union
 from torch.distributed._tensor import DeviceMesh
 from torch.distributed._tensor.placement_types import Placement
 from torch.distributed.device_mesh import _mesh_resources
 from torch.distributed.tensor import DeviceMesh
 from torch.distributed.tensor.placement_types import Placement
 try:
--- a/torch/distributed/tensor/parallel/api.py
+++ b/torch/distributed/tensor/parallel/api.py
@ -3,15 +3,15 @@ from fnmatch import fnmatch
 from typing import Dict, Union
 import torch
-import torch.distributed._tensor.random as random
+import torch.distributed.tensor.random as random
 import torch.nn as nn
-from torch.distributed._tensor import DeviceMesh
+from torch.distributed.tensor import DeviceMesh
-from torch.distributed._tensor.random import (
+from torch.distributed.tensor.parallel._utils import _validate_tp_mesh_dim
 from torch.distributed.tensor.parallel.style import ParallelStyle
 from torch.distributed.tensor.random import (
    is_rng_supported_mesh,
    TensorParallelRNGTracker,
 )
 from torch.distributed.tensor.parallel._utils import _validate_tp_mesh_dim
 from torch.distributed.tensor.parallel.style import ParallelStyle
 __all__ = [
--- a/torch/distributed/tensor/parallel/fsdp.py
+++ b/torch/distributed/tensor/parallel/fsdp.py
@ -14,12 +14,12 @@ from torch.distributed._shard.sharded_tensor import (
 )
 from torch.distributed._shard.sharding_spec import ShardMetadata
 from torch.distributed._shard.sharding_spec.chunk_sharding_spec import ChunkShardingSpec
 from torch.distributed._tensor import DeviceMesh, DTensor, Replicate, Shard as DShard
 from torch.distributed.device_mesh import _mesh_resources
 from torch.distributed.fsdp._common_utils import _set_fsdp_flattened
 from torch.distributed.fsdp._fsdp_extensions import FSDPExtensions
 from torch.distributed.fsdp._shard_utils import _create_chunk_sharded_tensor
 from torch.distributed.remote_device import _remote_device
 from torch.distributed.tensor import DeviceMesh, DTensor, Replicate, Shard as DShard
 from torch.distributed.tensor.parallel._data_parallel_utils import (
    _flatten_tensor,
    _unflatten_tensor,
--- a/torch/distributed/tensor/parallel/input_reshard.py
+++ b/torch/distributed/tensor/parallel/input_reshard.py
@ -3,7 +3,7 @@ from functools import partial
 from typing import Any, Optional, Tuple
 import torch
-from torch.distributed._tensor import DeviceMesh, DTensor, Replicate, Shard
+from torch.distributed.tensor import DeviceMesh, DTensor, Replicate, Shard
 __all__ = [
--- a/torch/distributed/tensor/parallel/loss.py
+++ b/torch/distributed/tensor/parallel/loss.py
@ -8,15 +8,15 @@ import torch._prims_common as utils
 import torch.distributed._functional_collectives as funcol
 import torch.distributed.distributed_c10d as c10d
 from torch import Tensor
-from torch.distributed._tensor import DTensor, Replicate, Shard
+from torch.distributed.device_mesh import DeviceMesh
-from torch.distributed._tensor.ops._embedding_ops import _MaskPartial
+from torch.distributed.tensor import DTensor, Replicate, Shard
-from torch.distributed._tensor.ops._math_ops import (
+from torch.distributed.tensor._ops._embedding_ops import _MaskPartial
 from torch.distributed.tensor._ops._math_ops import (
    _skip_dim,
    Reduction,
    replicate_reduction_dims,
 )
-from torch.distributed._tensor.placement_types import DTensorSpec, Placement, TensorMeta
+from torch.distributed.tensor.placement_types import DTensorSpec, Placement, TensorMeta
 from torch.distributed.device_mesh import DeviceMesh
 aten = torch.ops.aten
--- a/torch/distributed/tensor/parallel/style.py
+++ b/torch/distributed/tensor/parallel/style.py
@ -6,7 +6,7 @@ from typing import Any, Dict, Optional, Tuple, Union
 import torch
 import torch.nn as nn
-from torch.distributed._tensor import (
+from torch.distributed.tensor import (
    DeviceMesh,
    distribute_module,
    distribute_tensor,
@ -14,7 +14,7 @@ from torch.distributed._tensor import (
    Replicate,
    Shard,
 )
-from torch.distributed._tensor.placement_types import Placement
+from torch.distributed.tensor.placement_types import Placement
 __all__ = [
--- a/torch/distributed/_tensor/placement_types.py
+++ b/torch/distributed/_tensor/placement_types.py
@ -6,7 +6,8 @@ from typing import Any, cast, List, NamedTuple, Optional, Tuple
 import torch
 import torch.distributed._functional_collectives as funcol
-from torch.distributed._tensor._collective_utils import (
+from torch.distributed.device_mesh import DeviceMesh
 from torch.distributed.tensor._collective_utils import (
    fill_empty_tensor_to_shards,
    mesh_broadcast,
    mesh_scatter,
@ -14,7 +15,6 @@ from torch.distributed._tensor._collective_utils import (
    shard_dim_alltoall,
    unpad_tensor,
 )
 from torch.distributed.device_mesh import DeviceMesh
 __all__ = ["Placement", "Shard", "Replicate", "Partial", "DTensorSpec", "TensorMeta"]
--- a/torch/distributed/_tensor/random.py
+++ b/torch/distributed/_tensor/random.py
@ -7,8 +7,8 @@ from typing import Dict, List, Optional
 import torch
 import torch.distributed as dist
 from torch import Tensor
 from torch.distributed._tensor.placement_types import DTensorSpec, Shard
 from torch.distributed.device_mesh import _get_device_handle, DeviceMesh
 from torch.distributed.tensor.placement_types import DTensorSpec, Shard
 __all__ = [
@ -290,7 +290,7 @@ class OffsetBasedRNGTracker(_RNGStateTracker):
                    return_offset=False,
                )[0]
-        from torch.distributed._tensor.ops.utils import prod
+        from torch.distributed.tensor._ops.utils import prod
        local_size = prod(local_size_on_rank_0)
@ -317,7 +317,7 @@ class OffsetBasedRNGTracker(_RNGStateTracker):
        """
        dtensor_shape = spec.shape
-        from torch.distributed._tensor.ops.utils import prod
+        from torch.distributed.tensor._ops.utils import prod
        numel = prod(dtensor_shape)
        # pytorch: offset must be multiple of 4
--- a/torch/testing/_internal/common_fsdp.py
+++ b/torch/testing/_internal/common_fsdp.py
@ -34,7 +34,6 @@ from torch.distributed._composable.fsdp._fsdp_param_group import (
    FSDPParamGroup,
    RegisterPostBackwardFunction,
 )
 from torch.distributed._tensor import distribute_tensor, DTensor, Shard
 from torch.distributed.device_mesh import DeviceMesh
 from torch.distributed.fsdp import CPUOffload, FullyShardedDataParallel as FSDP
 from torch.distributed.fsdp._common_utils import TrainingState
@ -46,6 +45,7 @@ from torch.distributed.fsdp.fully_sharded_data_parallel import (
 )
 from torch.distributed.fsdp.sharded_grad_scaler import ShardedGradScaler
 from torch.distributed.fsdp.wrap import always_wrap_policy, ModuleWrapPolicy, wrap
 from torch.distributed.tensor import distribute_tensor, DTensor, Shard
 from torch.distributed.tensor.parallel import (
    ColwiseParallel,
    parallelize_module,