Summary:
Rename _device_mesh.py to device_mesh.py, update all callsites, add documentation.
We created stubs for public class and methods in torch.distributed.device_mesh so that torch.distributed.device_mesh can be imported with or without distributed is available().
Original diff reverted: D51629761
Original PR reverted: https://github.com/pytorch/pytorch/pull/115099
Prior to landing, CI signals are all passed. Shipit added the "ci/trunk" label to the PR and DID NOT wait for it and went ahead committing. More context can be found in the reverted PR above.
Test Plan: CI.
Differential Revision: D51861018
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115193
Approved by: https://github.com/fegin
Summary:
Rename _device_mesh.py to device_mesh.py, update all callsites, adds documentation.
Original diff reverted: D51629761
Original PR reverted: https://github.com/pytorch/pytorch/pull/114991
It was failing because failing a public module binding tests in MacOS, and this is due to the change in import order for torch/distributed/fsdp/_common_utils.py. Since this original import would still work, we remove the changes in this file.
Test Plan: CI.
Differential Revision: D51825114
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115099
Approved by: https://github.com/wanchaol, https://github.com/fegin
This PR removes four usages of compute_local_offset() in PyTorch repo and replaces it with the new API compute_local_shape_and_global_offset().
We will be removing compute_local_offset() API in the next diff, as there are usages internally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108547
Approved by: https://github.com/wanchaol
The compute_local_shape_and_global_offset API does the following:
1) Calculate both local_shape and global_offset in one API to replace two API calls (compute_local_size and compute_local_shape).
2) Generate the correct global_offset for checkpointing purposes. We are currently using compute_local_offset for downstream checkpoint components, which could lead to incorrect results. For checkpointing, we need global_offset instead of local_offset. In some cases, global_offset does not equal to local_offset, when a dimension is sharded multipe times on different mesh dimension (e.g. placements = [Shard(0), Shard(0)]).
Follow-up PRs:
1) Replace related downstream components to use compute_local_shape_and_global_offset instead of compute_local_size and compute_local_offset.
2) Audit existing code base to see if we can remove compute_local_size and compute_local_offset, since they are currently being used.
cc. @wanchaol
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107996
Approved by: https://github.com/wanchaol
As functional collective being updated, using tensor_split() as the underlying sharding algorithm would require padding and unpadding on multiple ranks. Therefore, we are changing the sharding algorithm to be in line with ``torch.chunk()`` to allow padding on the last two ranks in most of the scenarios.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98722
Approved by: https://github.com/wanchaol
Summary:
implement zeros function inside DTensor API
- user specify the zeros tensor shape, and the function will create local zero tensor given the placement information
Test Plan:
{F889157756} - unit test for util function for compute_local_tensor_size
- unit test for _tensor.zeros
Reviewed By: wanchaol
Differential Revision: D43630718
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95863
Approved by: https://github.com/wanchaol