Commit Graph

79 Commits

Author SHA1 Message Date
Carlos Mocholí
9df4ee8d38 Fix ColwiseParallel typo (#116151)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116151
Approved by: https://github.com/wanchaol
2023-12-20 06:40:32 +00:00
Tianyu Liu
2a5659a797 add length assertion to PrepareModuleInput and PrepareModuleOutput (#115957)
## summary

`zip(inputs, self.input_layouts, self.desired_input_layouts)` is used in `_prepare_input_fn`; similar for `_prepare_output_fn`. Without assertion, unmatched dimension in inputs/outputs will be lost, potentially causing unexpected behabiors.

## test plan
`python test/distributed/tensor/parallel/test_tp_style.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115957
Approved by: https://github.com/wanchaol
2023-12-18 21:50:18 +00:00
Wanchao Liang
a1a0b290d2 [tp] further fix the docs (#115974)
some typo result in the note section not rendered properly, can't see
this from the last PR directly as the last PR only show the first commit
documentation :(

Also make the parallelize_module doc example more concrete

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115974
Approved by: https://github.com/wz337
2023-12-18 20:41:53 +00:00
Wanchao Liang
61abacf829 [tp] improve documentation (#115880)
Improve the TP documentation in terms of format and descriptions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115880
Approved by: https://github.com/XilunWu
2023-12-15 18:44:22 +00:00
Iris Zhang (PyTorch)
23fa9621e4 [DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099) (#115193)
Summary:

Rename _device_mesh.py to device_mesh.py, update all callsites, add documentation.
We created stubs for public class and methods in torch.distributed.device_mesh so that torch.distributed.device_mesh can be imported with or without distributed is available().

Original diff reverted: D51629761
Original PR reverted: https://github.com/pytorch/pytorch/pull/115099
Prior to landing, CI signals are all passed. Shipit added the "ci/trunk" label to the PR and DID NOT wait for it and went ahead committing. More context can be found in the reverted PR above.

Test Plan: CI.

Differential Revision: D51861018

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115193
Approved by: https://github.com/fegin
2023-12-08 08:44:32 +00:00
Nikita Shulga
a827ac71f2 Revert "[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099)"
This reverts commit eaa64339d6.
2023-12-05 08:59:36 -08:00
Iris Zhang (PyTorch)
eaa64339d6 [DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099)
Summary:
Rename _device_mesh.py to device_mesh.py, update all callsites, adds documentation.

Original diff reverted: D51629761
Original PR reverted: https://github.com/pytorch/pytorch/pull/114991
It was failing because failing a public module binding tests in MacOS, and this is due to the change in import order for torch/distributed/fsdp/_common_utils.py. Since this original import would still work, we remove the changes in this file.

Test Plan: CI.

Differential Revision: D51825114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115099
Approved by: https://github.com/wanchaol, https://github.com/fegin
2023-12-05 05:44:52 +00:00
PyTorch MergeBot
3a2e2044cd Revert "[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#114710) (#114991)"
This reverts commit 729ac7317a.

Reverted https://github.com/pytorch/pytorch/pull/114991 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/114991#issuecomment-1837214567))
2023-12-02 17:55:51 +00:00
Wanchao Liang
28925902fa [TP] fully rewrite Tensor Parallel APIs (#114732)
This PR rewrites Tensor Parallel implementation. Tensor Parallel APIs
supposed to be a very thin-wrapper to DTensor APIs, but the current
implementation got too messy and buggy. It's really hard to debug what
went wrong when using it. It's crucially important for advanced users or
developers to understand the API and its implementation easily without
going through all different types of functions and utils, so that
they could trust what happen under the hood.

In particular this PR:

* Make ParallelStyle to be a real contract API for parallelize_module to
  take, each concrete ParallelStyle only needs to implement `apply` to
apply the sharding to nn.Module, remove all non-necessary fields. This
also enable easier ParallelStyle authoring going forward.
* Keep the ColwiseParallel and RowwiseParallel public interface, but
  refactor them in a way that makes the parameter sharding, inputs and
outputs handling lives within the style itself, so that it's easy to
understand how Linear/Embedding layers are sharded and how the inputs/outputs
transformations are performed.
* remove all those private _prepare_input/_prepare_output_fn fields for
  both ColwiseParallel/RowwiseParallel. Since we throw deprecation
messages in nightly for a while and TP is on prototype release, the
fields are also private, it should be safe to remove them
* Refactor the recently landed PrepareModuleInput/Output style, change
  output_layouts to desired_input/output_layouts, group
  the function inside the style itself, no default arguments for these
two styles and user need to specify them to think about the sharding
layouts. Fixed bugs about not handling
`use_local_output` flag.
* Make default arguments be None instead of Placement object, this is
  standard python practice to not have custom object instance as default
argument
* Remove all dead APIs (i.e. PairwiseParallel and SequenceParallel
  style, all prepare input/output functions) as we throw deprecation
 msgs for a while, and in the progress of removing all of them from the tests.
* throw deprecation warning for `tp_mesh_dim` as we recomemnd use device
  mesh slice/indexing instead of manually specify mesh dim
* Rewrite all documentations for every ParallelStyle and make the
  documentation more clear about what each style is doing

TODOs:
* Rewrite TP tests to adjust for the changes we have in this PR
* add more tests to guard the bug fixes

Differential Revision: [D51761183](https://our.internmc.facebook.com/intern/diff/D51761183)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114732
Approved by: https://github.com/wz337, https://github.com/fduwjj
2023-12-02 08:18:12 +00:00
Iris Zhang (PyTorch)
729ac7317a [DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#114710) (#114991)
Summary:

Same content of changes as https://github.com/pytorch/pytorch/pull/114710

Rename _device_mesh.py to device_mesh.py, update all callsites, adds documentation.
ghstack-source-id: 208980207
exported-using-ghexport

Test Plan: CI.

Reviewed By: wanchaol

Differential Revision: D51629761

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114991
Approved by: https://github.com/wanchaol, https://github.com/fduwjj, https://github.com/fegin
2023-12-02 04:39:41 +00:00
Wanchao Liang
4875e4d63f [tp] delete dead code (#114731)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114731
Approved by: https://github.com/fegin, https://github.com/wz337
2023-12-01 06:35:42 +00:00
wz337
7b3e45be59 [DeviceMesh] Rename get_dim_groups to get_group (#114708)
Rename get_dim_groups to get_group and update all callsites.

Differential Revision: [D51629801](https://our.internmc.facebook.com/intern/diff/D51629801/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114708
Approved by: https://github.com/XilunWu, https://github.com/wanchaol, https://github.com/fegin
2023-11-30 23:40:14 +00:00
wz337
febbc48f43 [DeviceMesh] Make our mesh_dim kwarg naming consistent (#114707)
Changing size(self, dim: Optional[int] = None) to def size(self, mesh_dim: Optional[int] = None) so it is consistent with the rest of our APIs.

We also update this API usage change in both PT and internal (pyper, APS).

Differential Revision: [D51602986](https://our.internmc.facebook.com/intern/diff/D51602986/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114707
Approved by: https://github.com/XilunWu, https://github.com/wanchaol, https://github.com/fegin
2023-11-29 19:43:23 +00:00
Iris Zhang
72ce5dd13e [2D] Remove enable_2d_with_fsdp() API and make remove_enable_2d_with_fsdp private (#112473)
As we have our new 2D flow out, we want to remove `enable_2d_with_fsdp()`.
In addition, we change pre_dp_module_transform to private, as we may need to change the UX later on.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112473
Approved by: https://github.com/fegin, https://github.com/wanchaol
2023-11-16 01:14:00 +00:00
PyTorch MergeBot
277474f1a0 Revert "[2d] pass shape/stride during tensor unflatten (#113547)"
This reverts commit 93372455a7.

Reverted https://github.com/pytorch/pytorch/pull/113547 on behalf of https://github.com/wanchaol due to broken compile test ([comment](https://github.com/pytorch/pytorch/pull/113547#issuecomment-1813048318))
2023-11-15 18:32:54 +00:00
Wanchao Liang
93372455a7 [2d] pass shape/stride during tensor unflatten (#113547)
as titled, built on top of the work @wz337 enabled, this could save some
runtime CPU time to recreate DTensor parameters with correct
shape/stride, and avoid issues when un-even sharding parameters

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113547
Approved by: https://github.com/XilunWu
ghstack dependencies: #113323, #113324
2023-11-14 09:28:09 +00:00
NVS Abhilash
44c0521e8c fix: docstring error in torch/distributed module (#113241)
Fixes: #113193

`pydocstyle <all_files_in_issue> --count`

- Before: 345
- After: 130

For deprecated methods, I have added a `noqa` to ignore them. I was not able to find the file `torch/distributed/tensor/parallel/multihead_attention_tp.py`, so I've ignored it for this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113241
Approved by: https://github.com/kit1980
2023-11-09 19:10:20 +00:00
Iris Zhang
12c1465d76 [DeviceMesh] Make mesh_resources private (#112294)
This is to prepare moving DeviceMesh as a standalone distributed package.

`_mesh_resources` should only be used in torch.distributed package.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112294
Approved by: https://github.com/fegin
2023-10-28 17:28:46 +00:00
Wanchao Liang
033680c9af [tp] fix PrepareModuleInput for multiple inputs (#112204)
Not all inputs needs to annotate shardings and convert to DTensors, if
user annotate only one inputs are mark the rest as Nones, we should skip
creating DTensors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112204
Approved by: https://github.com/fduwjj
2023-10-27 05:08:05 +00:00
Iris Z
185e76238d [2D][Documentation] Add some comments to _chunk_dtensor (#111775)
As title.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111775
Approved by: https://github.com/awgu
2023-10-23 20:43:03 +00:00
fduwjj
fdc29f58c6 [TP] Refactor style to make it work with torch.compile (#111625)
We are refactoring parallel style to solve the following things:
1. To further simplifying code logic to make more readable for users.
2. To remove tuple check so that we can work with dynamo for now. Ideally dynamo needs to support this case and we will fix it in parallel.
3. Add tests for newly added parallel style in UT and torch compile test so that we can capture regression due to code change.
4. Move placements early return check into DTensor since it is by passed by dynamo.
5. Remove PairwiseParallelStyle from unit tests to use the new Col/Rowwise parallel style.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111625
Approved by: https://github.com/wanchaol
2023-10-20 19:20:43 +00:00
Wanchao Liang
03e28bde2e [tp] fix torch compile regression (#111521)
The most recent refactor of TP
https://github.com/pytorch/pytorch/pull/111160 breaks torch compile
path, so reverting the behavior back by:
1. use the old default prepare_input/output
2. add the colwise/rowwise parallel test instead
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111521
Approved by: https://github.com/fduwjj
2023-10-19 10:27:10 +00:00
Wanchao Liang
59281d5631 [tp] fix SP style regression (#111353)
[tp] fix SP style regression

Although we want to remove prepare_input/output, we should still keep
the old behavior for SequenceParallel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111353
Approved by: https://github.com/fduwjj
2023-10-16 17:18:17 +00:00
fduwjj
bfcd86955e [TP] Fix TP doc format to show examples correctly (#111346)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111346
Approved by: https://github.com/wanchaol
ghstack dependencies: #111160, #111166, #111176, #111177
2023-10-16 06:15:10 +00:00
fduwjj
25a2845d78 [TP] Enable embedding sharding in TP API (#111177)
We see use cases where embedding sharding is also needed in TP API so we enabled it in the API since DTensor already support colwise embedding sharding.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111177
Approved by: https://github.com/wanchaol
ghstack dependencies: #111160, #111166, #111176
2023-10-15 11:49:56 +00:00
fduwjj
8085e08a84 [TP] Add prepareInput and output for input/output DTensor layout annotation in the parent module in TP API (#111166)
In some use cases, we found that users might want to annote the input/output DTensor layout for the parent module rather than the submodule whose parameters are to be distributed so that we want to have these two class for users to annote input/output DTensor layouts so that we register pre-FWD/FWD hook for the TP-lized module.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111166
Approved by: https://github.com/wanchaol
ghstack dependencies: #111160
2023-10-14 15:37:52 +00:00
fduwjj
3a8b10e2da [TP] Refactor Parallel Style to make it more usable (#111160)
One thing we find it challenging for users is that we don't want to expose the concept of prepare_input and prepare_out to users since there are so many func names for users to select from which is quite confusing. On the other hand, the colwise and rowwise parallel always need input(out) and output(in) to be certain layout so we can somehow simplify the logic here and make it more usable.

So we added three public attributes to the parallelStyle here and the code logic is like:

```python
class ParallelStyle(ABC):
    """
    The parallel style user wants the module or submodule to be parallelized.
    We can add more in future, but this seems sufficient for immediate needs. Users can extend this class to build their own parallel style with customized input/output preparations.
  """
    input_layouts: Union[placement, Tuple[placement]]
    output_layouts: Union[placement, Tuple[placement]]
    use_local: bool

class RowwiseParallel(ParallelStyle):
    """
    Partitioning the row of a module. We assume the input to be a sharded DTensor and output to be a replicate Tensor.
    """
    def __init__(self):
        super().__init__(input_layouts=Shard(-1), output_layouts=Replicate(), use_local=True)

Class ColwiseParallel(ParallelStyle):
    """
    Partitioning the column of a module. We assume the input to be a Replicated DTensor and output to be a sharded DTensor.
    """
    def __init__(self):
        super().__init__(input_layouts=Replicate(), output_layouts=Shard(-1), use_local=True)

# For the case of Sequence parallel, users just set different input_shard, Shard(0) or Shard(1) instead of Replicate()

Class PrepareModuleInput(ParallelStyle):
    """
    Only used to specify the input distribute spec for a module.
    """
    def __init__(self):
        super().__init__(input_layouts=Shard(0), output_layouts=Replicate(), use_local=False)

Class PrepareModuleOutput(ParallelStyle):
    """
    Only used to specify the output distribute spec for a module.
    """
    def __init__(self):
        super().__init__(input_layouts=Replicate(), output_layouts=Shard(0), use_local=True)

parallelize_plan = {
    "embedding": ColwiseParallel(output_shard=Replicate()),
    "attn": PrepareModuleInput(),
    "attn.w1": ColwiseParallel(),
    "attn.w2": ColwiseParallel(),
    "attn.w3": ColwiseParallel(),
    "attn.wo": RowwiseParallel(),
}

parallelize_module(
    module=block, # this can be a submodule or module
    device_mesh=mesh['tp'],
    parallelize_plan=parallelize_plan,
)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111160
Approved by: https://github.com/wanchaol
2023-10-14 15:26:36 +00:00
wz337
8e32e62f67 [TP] Validate TP mesh dim for 2D composition (#111001)
Currently, we only support intranode TP when compositin TP with other parallelism. This PR adds additional check to validate the TP mesh dim in TP initialization when parent mesh exists.

cc. @fegin, @fduwjj
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111001
Approved by: https://github.com/fduwjj
2023-10-12 02:11:44 +00:00
wz337
80dfc974dd [2D] Enable 2D FSDP+TP model.load_state_dict() (#110925)
This PR adds a all_gather_dtensor() method to fsdp/_fsdp_extensions.py and the actual implementation in tensor/parallel/fsdp.py. This enables FSDP to load 2D DTensor state_dict into model when calling `model.load_state_dict()`.

cc. @fegin

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110925
Approved by: https://github.com/fegin
ghstack dependencies: #110831, #110846
2023-10-11 18:22:20 +00:00
wz337
6c136c3302 [2D] Enable 2D DTensor state_dict for FSDP + TP (#110846)
This PR adds a `chunk_dtensor()` method to fsdp/_fsdp_extensions.py and the actual implementation of `chunk_dtensor()` in tensor/parallel/fsdp.py. This enables FSDP to return 2D DTensor state_dict when composing FSDP with TP.

cc. @fegin
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110846
Approved by: https://github.com/fegin, https://github.com/wanchaol
ghstack dependencies: #110831
2023-10-11 17:40:39 +00:00
wz337
8140494afd [3/N][2D] Enable training with new 2D flow (#110034)
Replacing https://github.com/pytorch/pytorch/pull/109553 as it gets reverted.

This PR enables training with new 2D flow and adds associated test. In addition, this PR moves the tensor/parallel/_data_parallel_utils.py that are fsdp specific back to tensor/parallel/fsdp.py to avoid circular dependency for ddp.py and test/distributed/tensor/parallel/test_ddp_2d_parallel.py.

state_dict related changes would be in later PRs.

cc. @fegin, @fduwjj, @wanchaol, @awgu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110034
Approved by: https://github.com/fduwjj
2023-09-26 09:14:15 +00:00
PyTorch MergeBot
f5886bf352 Revert "[3/N][2D] Enable training with new 2D flow (#109553)"
This reverts commit 217b37c023.

Reverted https://github.com/pytorch/pytorch/pull/109553 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but those distributed failures look legit and they are failing in trunk https://hud.pytorch.org/pr/109553 ([comment](https://github.com/pytorch/pytorch/pull/109553#issuecomment-1734100546))
2023-09-25 16:37:19 +00:00
wz337
217b37c023 [3/N][2D] Enable training with new 2D flow (#109553)
This PR enables training with new 2D flow and adds associated test.

state_dict related changes would be in later PRs.

cc. @fegin, @fduwjj, @wanchaol, @awgu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109553
Approved by: https://github.com/fegin, https://github.com/awgu
2023-09-25 05:32:07 +00:00
Chien-Chin Huang
1b3e5b53f3 [FSDP][optim_state_dict] Add device to _shard_utils.py to explicitly use the device from fsdp_state (#109631)
_get_pg_default_device does not always get the device we want. This PR let the user explicitly tell use the correct device.

Differential Revision: [D49425743](https://our.internmc.facebook.com/intern/diff/D49425743/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109631
Approved by: https://github.com/awgu, https://github.com/fduwjj, https://github.com/wz337
2023-09-20 01:59:38 +00:00
Wanchao Liang
a29b9101fa [dynamo] fix dynamo + DTensor to work with 2d (#108329)
pair debugged with @wconstab and we found some issue in both dynamo and
the TP's fsdp extension side. This PR fixes the dynamo + DTensor integration
so that the current graph break FSDP can work with tensor parallel by moving
the torch.compile after FSDP wrapping.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108329
Approved by: https://github.com/Skylion007, https://github.com/wconstab
2023-08-31 22:46:26 +00:00
fduwjj
3828cd4b79 [TP][EZ] Update doc for TP parallel style (#107819)
We need to update the doc for PairwiseParallel and SequenceParallel so that users don't get wrong impressions that these working for ``nn.Transformer``.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107819
Approved by: https://github.com/awgu, https://github.com/wanchaol
2023-08-24 00:13:52 +00:00
Wanchao Liang
da765995fb [2d] remove ShardedTensor from fsdp extension (#107472)
2D Parallel won't use ShardedTensor, and it causes headable for dynamo
to recoginize it, removing it from the runtime flatten/unflatten path
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107472
Approved by: https://github.com/fduwjj
2023-08-21 17:16:07 +00:00
Wanchao Liang
d8f2ef10a6 [dtensor][1/n] refactor op dispatch logic to reduce overhead (#107305)
This PR is the first change of a series of refactors to the op dispatch logic to:
1. remove the redundant logic in the op dispatch, simplify the error
checking
2. reduce the number of tree_map/tree_flatten/unflatten needed to reduce
the overhead coming from those operations
3. remove the CachedShardingPropagator by using lru_cache from functools
directly, this makes it not only helps TP, but general DTensor
operations could be faster!
4. change the view ops behavior by inplace changing the op_schema, which
is dangerous for sharding prop caching, model the view op as one type
of resharding too
5. enrich output sharding to include whether the op needs redistribute
so that we don't need explicit op schema comparison to know it.

This should help with further reducing the CPU overhead, benchmark
results:
before (without this change), aten.addmm latency: 0.476ms
![Screenshot 2023-08-16 at 10 46 26 AM](https://github.com/pytorch/pytorch/assets/9443650/7692e6c1-1936-4c7f-bf9c-6c8c9b8f6c76)

after (with this change), aten.addmm latency: 0.341ms
![Screenshot 2023-08-16 at 11 05 49 AM](https://github.com/pytorch/pytorch/assets/9443650/15a53f0b-7a95-444e-ab2f-3ee0ad2fa47f)

overall one layer of mlp time reduced from 13.535 -> 9.665ms

Apart from overhead reduction, this PR simplifies the op dispatching logic and the resharding logic (more refactor needed to make things more clean, which will be done in later PRs)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107305
Approved by: https://github.com/fduwjj
2023-08-18 18:30:46 +00:00
fduwjj
983fd5ba79 [2D][TP] Enable DDP TP integration with unit test (#106583)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106583
Approved by: https://github.com/kumpera, https://github.com/fegin, https://github.com/wanchaol
ghstack dependencies: #107313
2023-08-17 02:54:17 +00:00
fduwjj
f3b0d83fe3 [EZ][TP] Refactor FSDP 2D integration extension code so that it can re-used (#107313)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107313
Approved by: https://github.com/wz337
2023-08-16 22:01:17 +00:00
fduwjj
4a6ca4cc05 [TP][DTensor Perf] Some perf improvement to reduce DTensor CPU overhead (#106524)
By inspecting a small TP benchmark, we found couple things we can optimize:
1. We call deep_copy so many times when we initialize DTensor.
2. Some shading_prop is not cached successfully.
3. We are still calling redistribute when not necessary.

![image](https://github.com/pytorch/pytorch/assets/6937752/b847d110-eea1-45df-9298-066d0ba07dd7)

![image](https://github.com/pytorch/pytorch/assets/6937752/fc08f564-caed-496b-80d7-275c1dba3806)

![image](https://github.com/pytorch/pytorch/assets/6937752/fdc06cc4-a4ba-48e8-a118-c041bbd04f5e)

So we want to:
1. Remove the deep_copy, and we now make placements a tuple so we are sure it's immutable.
2. Somehow the op_schema gets changed during sharding_op propogation, so we store a hash version of it before passing it to sharding_prop. Ideally we want to figure out why `op_schema` gets changed, but looks like in both index and detach/view op, all get changed, it might take more time to debug.
3. Also when we do hashing of op_schema, we want to hash the entire args_schema not just the args_spec which only contains the DTensorSpec from args which are Dtensors.
4. It turns out that sometimes, DTensor has mem_format to be None (not contiguous) and this will lead to redistribute get triggered, so that we only need to compare type/shape and stride in the metadata.

Also we need to ensure _Partial and Shard have different hash value in the DTensorSpec.

![image](https://github.com/pytorch/pytorch/assets/6937752/321e6890-1ab6-4975-adc9-524c6ef9a76b)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106524
Approved by: https://github.com/wanchaol
2023-08-14 20:03:19 +00:00
alanhe151220037
1afbc985fe Make RNGStateTracker support cuda-like device (#106771)
replace  `CudaRNGStateTracker` with `RNGStateTracker` by rewriting some Cuda-binding code with `device_handle`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106771
Approved by: https://github.com/wanchaol
2023-08-10 19:14:33 +00:00
fduwjj
487ebcac3b Clean up unsed MHA code to avoid confusion (#105956)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105956
Approved by: https://github.com/wz337, https://github.com/ezyang, https://github.com/wanchaol
2023-07-27 17:10:17 +00:00
FFFrog
9a1cdcb8a0 Format: fixing multiple string concatenation in single line (#106013)
Fixing multiple string concatenation in single line
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106013
Approved by: https://github.com/albanD
2023-07-26 18:39:18 +00:00
Nikita Shulga
5837e95d30 [Reland] Update mypy to 1.4.1 (#105227)
This PR re-lands
- [Typing] Fix PEP 484 Violation (#105022)
- Update mypy to 1.4.1 (#91983)

That were reverted due to the conflict with internal source repo.

Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional)
Plus few real fixes:
  - Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi`
  - Add missing return statement to `torch._export. deserialize_graph`
  - Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights`
  - Add assert it `torch/optim/optimizer.py` that Optional list is not None
TODO (in followup PR):
  - Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py`

Unrelated, to bypass CI failures due to the gcc9 dependency update in Ubuntu-18.04:
- Add hack to squash older libstdc++ from conda environment in favor one from OS to `.ci/docker/install_conda.sh`
- Update bazel cuda builds to focal, as with libstdc++-6.0.32 bazel builds loose the ability to catch exceptions (probably because they link with cupti statically, but I could not found where it is done)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105227
Approved by: https://github.com/atalman, https://github.com/albanD, https://github.com/Skylion007
2023-07-15 20:30:20 +00:00
PyTorch MergeBot
15fd1ea118 Revert "[Reland] Update mypy to 1.4.1 (#105227)"
This reverts commit c9c4f8efc3.

Reverted https://github.com/pytorch/pytorch/pull/105227 on behalf of https://github.com/atalman due to trying to mitigate ci sev #105248 ([comment](https://github.com/pytorch/pytorch/pull/105227#issuecomment-1636510935))
2023-07-14 22:28:35 +00:00
Nikita Shulga
c9c4f8efc3 [Reland] Update mypy to 1.4.1 (#105227)
This PR re-lands
- [Typing] Fix PEP 484 Violation (#105022)
- Update mypy to 1.4.1 (#91983)

That were reverted due to the conflict with internal source repo.

Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional)
Plus few real fixes:
  - Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi`
  - Add missing return statement to `torch._export. deserialize_graph`
  - Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights`
  - Add assert it `torch/optim/optimizer.py` that Optional list is not None
TODO (in followup PR):
  - Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105227
Approved by: https://github.com/atalman, https://github.com/albanD, https://github.com/Skylion007
2023-07-14 20:45:12 +00:00
PyTorch MergeBot
3c5a494d7a Revert "Update mypy to 1.4.1 (#91983)"
This reverts commit 634659e262.

Reverted https://github.com/pytorch/pytorch/pull/91983 on behalf of https://github.com/malfet due to It's dependent change was reverted, so reverting this one as well, to keep CI clean ([comment](https://github.com/pytorch/pytorch/pull/91983#issuecomment-1636059709))
2023-07-14 15:59:16 +00:00
Nikita Shulga
634659e262 Update mypy to 1.4.1 (#91983)
Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional)
Plus few real fixes:
  - Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi`
  - Add missing return statement to `torch._export. deserialize_graph`
  - Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights`
  -
TODO (in followup PR):
  - Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91983
Approved by: https://github.com/kit1980, https://github.com/ZainRizvi, https://github.com/huydhn, https://github.com/thiagocrepaldi, https://github.com/aaronenyeshi
2023-07-13 16:30:36 +00:00
Xilun Wu
e799f565eb [DTensor][TP][Random] Introduce TensorParallelRNGTracker to integrate parallel RNG state with Tensor Parallel (#103910)
This PR enables the automatic use of `TensorParallelRNGTracker` in Tensor Parallel api. Some unit tests are going to be added to cover.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103910
Approved by: https://github.com/wanchaol, https://github.com/fduwjj
2023-06-30 08:06:41 +00:00