Commit Graph

2083 Commits

Author SHA1 Message Date
Wanchao Liang
a26480a4d1 [dtensor] move early return check into redistribute autograd function (#121653)
This PR fixed the bug of redistribute to move early return check into the
redistribute autograd function, so that even though we redistribute the
same placement, the grad_placements from the `to_local` call might be
different, the redistribute backward still need to happen

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121653
Approved by: https://github.com/awgu
2024-03-12 17:37:30 +00:00
PyTorch MergeBot
0398dc9e8e Revert "[DCP] Makes fsspec public (#121508)"
This reverts commit d482614fec.

Reverted https://github.com/pytorch/pytorch/pull/121508 on behalf of https://github.com/osalpekar due to this causes torchrec tests to fail internally with this error: ModuleNotFoundError: No module named 'fsspec'. see [D54779117](https://www.internalfb.com/diff/D54779117) ([comment](https://github.com/pytorch/pytorch/pull/121508#issuecomment-1992137831))
2024-03-12 17:02:43 +00:00
Andrew Gu
85dc254364 [DTensor] Moved Transformer sharding to staticmethod (#121660)
To support FSDP + TP/SP unit tests, let us factor out the canonical TP/SP sharding of `Transformer` to a staticmethod that can be called by other unit tests.

Test Plan:
```
pytest test/distributed/tensor/parallel/test_tp_examples.py -k test_transformer_training
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121660
Approved by: https://github.com/wanchaol, https://github.com/yifuwang
ghstack dependencies: #121360, #121357
2024-03-12 15:08:57 +00:00
Howard Huang
2a99e6f299 Update error message (#121644)
Summary:
We don't want people to move to NCCL exp without explicit opt in. It seems that sparse allreduce was accidentally called and people were confused whether they should use NCCL exp instead.

Update the error message to explicitly say that sparse_allreduce is not supported.

Test Plan: sandcastle

Differential Revision: D54759307

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121644
Approved by: https://github.com/awgu
2024-03-12 13:04:21 +00:00
Andrew Gu
272cf29e4d [FSDP2][BE] Refactored check_1d_sharded_parity to use mesh (#121357)
Eventually, we should just have one unified way to check for parity between a `DTensor`-sharded model and a replicated model. This PR is a small refactor to work toward that. One current gap to use this `check_sharded_parity` function for 2D is that FSDP's `(Shard(0), Shard(0))` layout differs from that of the `DTensor` APIs since FSDP shards on dim-0 after TP shards on dim-0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121357
Approved by: https://github.com/weifengpy
ghstack dependencies: #121360
2024-03-11 22:34:42 +00:00
PyTorch MergeBot
fd0dbcd891 Revert "Batch Norm Consolidation (#116092)"
This reverts commit 7b4f70eda5.

Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/osalpekar due to Causes build failure in //caffe2:aten-hip (AMD build) target. See [D54707318](https://www.internalfb.com/diff/D54707318) for more details, may require internal build system changes to resolve. ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1989542965))
2024-03-11 22:22:41 +00:00
wz337
60cd2a43ca [DeviceMesh] Add support for nD slicing (#119752)
Fixes one of the issue mentioned in #118639
@mvpatel2000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119752
Approved by: https://github.com/wanchaol
2024-03-10 00:16:37 +00:00
Yifu Wang
71d0202627 [dynamo] support rewriting dist.all_reduce with explicitly specified reduce op (#120181)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120181
Approved by: https://github.com/wconstab, https://github.com/awgu
2024-03-09 08:28:22 +00:00
Wanchao Liang
242e03ba86 [dtensor] add async_op option to redistribute and some refactor (#121477)
async output option was only available in `full_tensor()` call, but I think it's
generally good to make this option available in the `redistribute` call directly
so that user can control it

This PR adds async_op option to redistribute call, to allow user control
whether to perform tensor redistribution asynchronously or not.

By default we set this to False, this is to follow the semantics of the c10d
collectives.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121477
Approved by: https://github.com/wz337
2024-03-09 06:17:23 +00:00
Aidyn-A
eb3919944d [C10d][NCCL] Refactor complex all_reduce and broadcast (#121045)
The necessity of this PR lies in the fact that autograd engine + DDP calls `all_reduce` from C++, so the changes must be made in C++.

```
[rank0]: Traceback (most recent call last):
[rank0]:   File "~/complex_ddp.py", line 72, in <module>
[rank0]:     main()
[rank0]:   File "~/complex_ddp.py", line 64, in main
[rank0]:     loss.backward()
[rank0]:   File "/home/usr/pytorch/torch/_tensor.py", line 525, in backward
[rank0]:     torch.autograd.backward(
[rank0]:   File "/home/usr/pytorch/torch/autograd/__init__.py", line 267, in backward
[rank0]:     _engine_run_backward(
[rank0]:   File "/home/usr/pytorch/torch/autograd/graph.py", line 744, in _engine_run_backward
[rank0]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank0]: TypeError: Input tensor data type is not supported for NCCL process group: ComplexFloat
```

I believe, for minimizing the Python overhead, the same could be done for the rest of the ops, what do you think @kwen2501?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121045
Approved by: https://github.com/eqy, https://github.com/kwen2501
2024-03-09 02:00:54 +00:00
Lucas Pasqualin
d482614fec [DCP] Makes fsspec public (#121508)
Fixes #118033

Also removes `_checkpointer.py` class
original PR's:
- https://github.com/pytorch/pytorch/pull/121330
- https://github.com/pytorch/pytorch/pull/121329

We're also disabling `test_fsdp` since it is failing on random PR's

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121508
Approved by: https://github.com/fegin
2024-03-09 01:14:18 +00:00
Wanchao Liang
bc02fca358 [dtensor] to_local backward grad placement passthrough (#121474)
to_local accepts a `grad_placements` if user choose to pass, previously
we enforce the grad_out to be the "same" placement as the current
DTensor for safety.

But I realized that we DO NOT need to enforce this constraint. Why?
backward placement does not need to be the same as fwd tensor placement, this
is already the case for param vs param.grad (i.e. param can be replicate
and grad can be partial), so we should not restrict this to activation
vs activation grad too

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121474
Approved by: https://github.com/awgu, https://github.com/yoyoyocmu, https://github.com/yifuwang
2024-03-08 23:11:49 +00:00
Ke Wen
038b2e8780 [c10d] Add complex support for P2P (#121240)
Fixes the following error when `tensor` is a complex tensor:
```
[rank0]:     return pg.send([tensor], dst, tag)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: Unconvertible NCCL type ComplexFloat
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121240
Approved by: https://github.com/shuqiangzhang
2024-03-08 22:47:49 +00:00
Boyuan Feng
35d3adb4b0 Add ATen Op _chunk_cat and _chunk_cat.out (#121081)
# Motivation

In backward of per-parameter sharding FSDP, each rank performs reduce scatter to sync gradients across ranks. A rank chunks each gradient tensor into `world_size` slices along the 0-th dimension and concatenate all slices along the 1-th dimension. Gradient tensors will be padded before concatenation when tensor.size(0) % world_size != 0.

### Example 1
Consider `world_size=3` and tensors A (2x4), B (3x3), C (1x2):

Input tensors:
```
AAAA   BBB   CC
AAAA   BBB
       BBB
```

Reduce-scatter-copy-in Output:
```
AAAABBBCC
AAAABBB00
0000BBB00
```

### Example 2
Consider `world_size=2` and tensors A (2x4), B (3x3), C(1x2), D(4x2):

Input tensors:
```
AAAA   BBB   CC   DD
AAAA   BBB   00   DD
       BBB        DD
       000        DD
```

Reduce-scatter-copy-in first pad:
```
AAAA   BBB   CC   DD
AAAA   BBB   00   DD
       BBB        DD
       000        DD
```

Then chunk and cat along dim as the output:
```
AAAABBBBBBCCDDDD
AAAABBB00000DDDD
```

The performance of reduce-scatter-copy-in is critical to per-parameter sharding FSDP. However, reduce-scatter-copy-in via composing existing ATen ops involves `cat` and irregular `pad`, leading redundant data copies and unsatisfactory performance.

# PR
We provide aten native support for reduce-scatter-copy-in, namely `_chunk_cat()`:

```
_chunk_cat(Tensor[] tensors, int dim, int num_chunks) -> Tensor
```

This PR includes the registration of `_chunk_cat` and `_chunk_cat.out`, OpInfo tests, and basic implementation composing existing ATen ops.
In the next PR, we will add the CUDA implementation. Comparing with baselines of composing existing ATen ops, `_chunk_cat()` CUDA implementation improves copy bandwidth from 498 GB/s to 966 GB/s on a production benchmark.

## Requirements on input

1. If input tensors have different ndims, dim should be non-negative and be less than the ndims of every input tensors. If all input tensors have the same ndims, we support both negative and non-negative dim.
2. For wrapped_dim, all tensors should have the same size for 0,...,wrapped_dim-1 dimensions. No requirements for (wrapped_dim, ...)-th dimension.
3. Expect positive num_chunks
4. Expect non-empty input tensor list and each input tensor should have at least 1 element

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121081
Approved by: https://github.com/albanD
2024-03-08 21:48:12 +00:00
Yifu Wang
22cd2658b4 Disable GroupRegistry's thread isolation by default (#121457)
Today `GroupRegistry` employs thread isolation by default, i.e. every thread sees its own process group registry. This is intended to work for one-device-per-process (for python use cases) and one-device-per-thread case (for custom native runtimes).

However, there's a problem - there are python use cases that initializes/registers process groups in one thread, and runs collectives in another thread. This use case should be supported. However, since `GroupRegistry` employs thread isolation by default, collectives in different threads can't find the registered process groups.

This PR fixes the issue by:
- Make `GroupRegistry` work in non-thread isolation mode by default. This would match the behavior w/o the native process group registry.
- Introduces `set_thread_isolation_mode` so one-device-per-thread runtimes can enable thread isolation mode explicitly.

Differential Revision: [D54658515](https://our.internmc.facebook.com/intern/diff/D54658515)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121457
Approved by: https://github.com/wanchaol
2024-03-08 19:31:24 +00:00
PyTorch MergeBot
0f3f4f5534 Revert "[nit][DCP][DSD] Remove Unused Variables in test_state_dict.py (#121204)"
This reverts commit 4186c36531.

Reverted https://github.com/pytorch/pytorch/pull/121204 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but the failure looks legit ([comment](https://github.com/pytorch/pytorch/pull/121204#issuecomment-1986252526))
2024-03-08 19:08:50 +00:00
Wanchao Liang
08460f4bae [tp] remove deprecated tp_mesh_dim arg (#121432)
This PR removes the deprecated tp_mesh_dim arg to prepare for release.
As we deprecated this arg for a while (by throwing deprecating
messages), we should remove it before the release

#suppress-api-compatibility-check

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121432
Approved by: https://github.com/wz337
ghstack dependencies: #121431
2024-03-08 17:46:44 +00:00
Yeounoh Chung
f7ec984b1b [DTensor][XLA] support XLA backend in distirbute_module API (#121355)
Addresses #92909  cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121355
Approved by: https://github.com/wanchaol
2024-03-08 15:47:33 +00:00
andrewor14
7b4f70eda5 Batch Norm Consolidation (#116092)
**Summary:**

This commit simplifies the existing decomposition hierarchy
of batch norm ops by adding a single, backend agnostic op:
`batch_norm_with_update`. The existing hierarchy looks like:

```
aten.batch_norm ->
aten._batch_norm_impl_index ->
[
  aten.native_batch_norm ->
  aten._native_batch_norm_legit (export only) ->
  _batch_norm_legit_cpu/cuda (kernels, export only) ->
  _batch_norm_cpu/cuda (kernels)
] OR
[ aten.cudnn_batch_norm ] OR
[ aten.miopen_batch_norm ]
```

Aside from complexity, an important problem with the
above decomposition hierarchy is cuda numerics in
export flows. We observed significantly worse convergence
when training a mobilenetv2-like model when using the
`_batch_norm_cuda` kernel instead of the `cudnn_batch_norm`
kernel. This means users who export their models on CPU
first then move the models to cuda later may silently
see worse accuracies even when cudnn is installed,
because they are using the worse kernel. This issue is
summarized in https://github.com/pytorch/pytorch/issues/111384.

Instead, the new hierarchy proposed by consolidating
existing batch norm ops will look like:

```
aten.batch_norm ->
aten.batch_norm_with_update ->
[ _batch_norm_cpu (kernel) ] OR
[ _batch_norm_cuda (kernel) ] OR
[ cudnn_batch_norm (kernel) ] OR
[ miopen_batch_norm (kernel) ]
```

The new op `batch_norm_with_update` hides backend
implementation details and automatically picks the right
kernel based on what is installed. This commit also adds
the following variants to this op:

```
batch_norm_with_update_functional
batch_norm_with_update.out
batch_norm_no_update
batch_norm_no_update.out
batch_norm_backward
```

Note that this commit only adds this op and its variants,
but does not actually change the decomps to produce these
ops in the graph. This will be done after the 2 week FC
window, and the ops used in the old stack is planned to
be removed after the 6 month BC window.

Test Plan: `OpInfo` tests for `batch_norm_with_update`.

Reviewers: albanD, bdhirsh

Subscribers: albanD, bdhirsh, supriyar

Tasks: https://github.com/pytorch/pytorch/issues/111384

Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2024-03-08 15:07:15 +00:00
Lucas Pasqualin
96ed37ac13 [DCP] Makes async_save public (#121325)
Makes async_save public

Differential Revision: [D54593610](https://our.internmc.facebook.com/intern/diff/D54593610/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121325
Approved by: https://github.com/wz337
ghstack dependencies: #121317
2024-03-08 05:13:13 +00:00
Lucas Pasqualin
909d73d8cb [DCP] Removes no_dist and coordinator_rank from public DCP API's (#121317)
[DCP] Removes `no_dist` and `coordinator_rank` from public DCP API's

Differential Revision: [D54591181](https://our.internmc.facebook.com/intern/diff/D54591181/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121317
Approved by: https://github.com/fegin
2024-03-08 02:14:12 +00:00
wz337
4186c36531 [nit][DCP][DSD] Remove Unused Variables in test_state_dict.py (#121204)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121204
Approved by: https://github.com/Skylion007
2024-03-08 01:54:25 +00:00
Tianyu Liu
dc514b967e [dtensor][TP] check funcol calls and improve doc for loss parallel (#121366)
Since CommDebugMode is fixed, we can check that loss parallel is working as expected.

Under loss parallel, the forward computation should invoke 3 all-reduces, and the backward computation should invoke no functional collectives.

Co-authored-by: Wanchao <wanchaol@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121366
Approved by: https://github.com/wanchaol
2024-03-08 01:41:31 +00:00
Chien-Chin Huang
0811f15270 [DCP][state_dict] Let _offload_state_dict_to_cpu to return the companion_obj if it exist. (#121273)
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121273
Approved by: https://github.com/wz337, https://github.com/LucasLLC
2024-03-08 00:24:29 +00:00
Shengbao Zheng
60aaba4128 create function to get ProcessGroupNCCL uid (#121132)
Summary: expose ProcessGroupNCCL uid

Differential Revision: D54446056

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121132
Approved by: https://github.com/aaronenyeshi
2024-03-07 18:34:38 +00:00
Andrew Gu
e8e3049f57 [FSDP2] Relaxed check for parent mesh (#121360)
Mixing 1D and 2D `DTensor`s in the same sharded state dict should be okay, so we can remove the check that a parameter for FSDP to shard must be a `DTensor` if passing a child mesh to FSDP.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121360
Approved by: https://github.com/yifuwang, https://github.com/Skylion007
ghstack dependencies: #120351, #121328
2024-03-07 08:09:25 +00:00
Wanchao Liang
1a28ebffb3 [TP] Introduce Sequence Parallel Style for Laynorm/RMSNorm/Dropout (#121295)
As titled, this PR introduces a dedicated `ParallelStyle` to shard the
nn.LayerNorm/nn.Dropout/RMSNorm layers. We were mainly using a manual
distribute_module calls before when sharding the RMSNorm layer, but I
think we should have a dedicate TP API to easily shard those layers,
instead of user manually using DTensors.

I call this SequenceParallel, which might bring some confusion that we
technically "deprecated" a SequenceParallel style months ago. But this
time the SeuqenceParallel style is significantly different with the
previous ones (which used to shard two consecutive Linear layers). I
believe making it the right name is the first priority, instead of
worrying about the issue of reusing the old name

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121295
Approved by: https://github.com/awgu, https://github.com/tianyu-l
ghstack dependencies: #121294
2024-03-07 02:04:59 +00:00
Wanchao Liang
a88356f45c [dtensor] make add_.Tensor/div_.Scalar to be linear pointwise instead (#121294)
add_.Tensor and div_.Scalar should support linearity so that we delay the partial
results.

This fixes the additional collective in the layernorm layer that we seen

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121294
Approved by: https://github.com/tianyu-l
2024-03-06 22:52:18 +00:00
Andrew Gu
372f192050 [DTensor] Initialized RNG tracker if needed (#121328)
Since we are already checking if the RNG tracker is initialized, there is no real performance difference between erroring vs. just initializing a default RNG tracker (which we choose to be the `OffsetBasedRNGTracker`).

```
pytest test/distributed/_composable/fsdp/test_fully_shard_init.py -k test_meta
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121328
Approved by: https://github.com/wanchaol
ghstack dependencies: #120351
2024-03-06 22:21:44 +00:00
Yifu Wang
d7a5e59647 [dynamo] support group=None when rewriting collectives (#121043)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121043
Approved by: https://github.com/awgu
2024-03-06 21:37:19 +00:00
Andrew Gu
e865700f6a [FSDP2] Added initial meta-device init support (#120351)
This PR adds initial support for meta-device initialization for pre-training without loading from a state dict. The idea is to allow `fully_shard(module)` to return and still have sharded parameters on meta device. Then, the user is free to initialize them as they please, e.g. using `to_empty()`.

We override `_apply` to achieve the following:
- Reshard the parameters to ensure that sharded parameters are registered (for correctness) -- we will always need this
- Pad new local tensors and use the padded local tensors (to handle uneven sharding) -- we will remove this once `DTensor` pads its local tensor

We use the `swap_tensors` path in `_apply`. For now, this requires setting `torch.__future__.set_swap_module_params_on_conversion(True)`; however, in the future, this may be enabled by default for wrapper subclasses and will not need any explicit API call. If requiring this call is too intrusive in the short term, we can also call it in `_apply` or when importing `fully_shard`.

```
# Pre-training flow (no checkpoint)
global_mesh = init_device_mesh(..., mesh_dim_names=("dp", "tp"))
dp_mesh, tp_mesh = global_mesh["dp"], global_mesh["tp"]
with torch.device("meta"):
  model = ...
  parallelize_module(model, tp_mesh, ...)
  fully_shard(model, mesh=dp_mesh, ...)
for param in model.parameters():
  assert param.device.type == "meta"

model.to_empty(device="cuda")
random.manual_seed(42, global_mesh)
for module in model.modules():
  if hasattr(module, "reset_parameters"):
    module.reset_parameters()
```

This PR includes some minor changes to allow the user to similarly cast the module to a different dtype after construction time but before forward.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120351
Approved by: https://github.com/wanchaol
2024-03-06 21:18:25 +00:00
Kurman Karabukaev
360761f7d0 [Torchelasic] Create root log directory by default (#121257)
Summary:
After refactoring in https://github.com/pytorch/pytorch/pull/120691, default behavior unintentionally was changes from creating tempdir for logging to not capturing any logs by torch Elastic Agent.

Reverting the behavior to:
- making tempdir when log dir is not specified
- allowing non-empty root log dir
    - Note: in case attempt folder exists, it will be pruned here: https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/multiprocessing/api.py#L294

Differential Revision: D54531851

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121257
Approved by: https://github.com/d4l3k
2024-03-06 18:50:38 +00:00
PyTorch MergeBot
b529c19bdf Revert "Batch Norm Consolidation (#116092)"
This reverts commit 5680f565d5.

Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/jeffdaily due to broke ROCm, PR signal was clean but trunk was not, the merge should have been blocked but wasn't ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1981373237))
2024-03-06 17:10:01 +00:00
Tugsbayasgalan Manlaibaatar
5680f565d5 Batch Norm Consolidation (#116092)
**Summary:**

This commit simplifies the existing decomposition hierarchy
of batch norm ops by adding a single, backend agnostic op:
`batch_norm_with_update`. The existing hierarchy looks like:

```
aten.batch_norm ->
aten._batch_norm_impl_index ->
[
  aten.native_batch_norm ->
  aten._native_batch_norm_legit (export only) ->
  _batch_norm_legit_cpu/cuda (kernels, export only) ->
  _batch_norm_cpu/cuda (kernels)
] OR
[ aten.cudnn_batch_norm ] OR
[ aten.miopen_batch_norm ]
```

Aside from complexity, an important problem with the
above decomposition hierarchy is cuda numerics in
export flows. We observed significantly worse convergence
when training a mobilenetv2-like model when using the
`_batch_norm_cuda` kernel instead of the `cudnn_batch_norm`
kernel. This means users who export their models on CPU
first then move the models to cuda later may silently
see worse accuracies even when cudnn is installed,
because they are using the worse kernel. This issue is
summarized in https://github.com/pytorch/pytorch/issues/111384.

Instead, the new hierarchy proposed by consolidating
existing batch norm ops will look like:

```
aten.batch_norm ->
aten.batch_norm_with_update ->
[ _batch_norm_cpu (kernel) ] OR
[ _batch_norm_cuda (kernel) ] OR
[ cudnn_batch_norm (kernel) ] OR
[ miopen_batch_norm (kernel) ]
```

The new op `batch_norm_with_update` hides backend
implementation details and automatically picks the right
kernel based on what is installed. This commit also adds
the following variants to this op:

```
batch_norm_with_update_functional
batch_norm_with_update.out
batch_norm_no_update
batch_norm_no_update.out
batch_norm_backward
```

Note that this commit only adds this op and its variants,
but does not actually change the decomps to produce these
ops in the graph. This will be done after the 2 week FC
window, and the ops used in the old stack is planned to
be removed after the 6 month BC window.

Test Plan: `OpInfo` tests for `batch_norm_with_update`.

Reviewers: albanD, bdhirsh

Subscribers: albanD, bdhirsh, supriyar

Tasks: https://github.com/pytorch/pytorch/issues/111384

Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2024-03-06 04:50:46 +00:00
Chien-Chin Huang
5abf7972d1 [DCP][state_dict] Implement pin_memory and shared_memory copy for _offload_state_dict_to_cpu (#120378)
**Summary**
This PR extend `_offload_state_dict_to_cpu` to accept a `cpu_offload_state_dict` argument. If `cpu_offload_state_dict` is not None, `_offload_state_dict_to_cpu` will use `copy_` to copy the GPU data to the CPU tensors. This allows users to pass a pin_memory or share_memory version of `cpu_offload_state_dict`.

This PR also adds `_create_cpu_state_dict` to allow users to easily create a pin_memory or share_memory cpu state_dict.

**Performance improvement**
```
# The micro-benchmark has a source state_dict with 150 tensors, and each tensor is 50MB.
# The micro-benchmark is run on a H100 machine with PCIe 5

cpu_state_dict_2 = _create_cpu_state_dict(state_dict, pin_memory=True)
cpu_state_dict_3 = _create_cpu_state_dict(state_dict, share_memory=True)

# GPU->CPU memory: 4.6556 seconds
cpu_state_dict = _offload_state_dict_to_cpu(state_dict)

# GPU->pin memory: 0.1566 seconds
_offload_state_dict_to_cpu(state_dict, cpu_offload_state_dict=cpu_state_dict_2)

# GPU->shared memory: 0.5509 seconds (variation is quite large)
_offload_state_dict_to_cpu(state_dict, cpu_offload_state_dict=cpu_state_dict_3)

# GPU->pin memory->shared memory: 0.2550 seconds
_offload_state_dict_to_cpu(state_dict, cpu_offload_state_dict=cpu_state_dict_2)
_offload_state_dict_to_cpu(cpu_state_dict_2, cpu_offload_state_dict=cpu_state_dict_3)
```

Differential Revision: [D54045845](https://our.internmc.facebook.com/intern/diff/D54045845/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120378
Approved by: https://github.com/LucasLLC
2024-03-05 17:48:15 +00:00
wz337
de8af28083 [FSDP][StateDict] Allow FULL_STATE_DICT option for 2D (#120837)
Fixes #120722

TL;DR for the issue:
As users are expected to use get_model_state_dict to do state_dict retrieval, I think it's fine to remove the warning and RuntimeError.
More context in #120722.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120837
Approved by: https://github.com/Skylion007
2024-03-05 10:03:44 +00:00
Wanchao Liang
2e50566722 [dtensor] change distribute_module input/output_fn to accept module (#120895)
This is a BC breaking change to distribute_module. The underlying rationle
for this change is that sometimes in the input_fn/output_fn, user would want
to access to the current module for some attributes. This might not be
common enough, but in some cases it's worth to access to the module.

An outstanding use case we want to support is float8, if we want to make
float8 works with the TP API, the input_fn/output_fn of TP parallel
styles would need to get access to the module, where the module might
encapsulates `dynamic_linear.emulate` attribute, that is useful for
input/output casting

Since this is needed for fp8 and DTensor still under prototype release,
I feel it's worth the change and it's better we make the change as
early.

Right now making it a soft BC breaking, which means we maintain BC still
but throw deprecation messages.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120895
Approved by: https://github.com/tianyu-l
2024-03-04 07:22:32 +00:00
Lucas Pasqualin
7aced61c46 [DCP] deletes legacy formatting test (#120127)
Should no longer be necessary

Differential Revision: [D53791345](https://our.internmc.facebook.com/intern/diff/D53791345/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120127
Approved by: https://github.com/fegin
ghstack dependencies: #119816
2024-03-02 22:04:39 +00:00
IvanKobzarev
bab4b5a341 [dist][sharded_tensor] Fix ChunkShardingSpec metadata offsets for empty shards (#121002)
ChunkShardingSpec generated metadata where offsets exceed the tensor size.

Example:

Torchrec prepared ShardedTensorMetadata:
```
ShardedTensorMetadata(shards_metadata=[
ShardMetadata(shard_offsets=[0, 0], shard_sizes=[2, 512], placement=rank:0/cuda:0),
ShardMetadata(shard_offsets=[2, 0], shard_sizes=[2, 512], placement=rank:1/cuda:1),
ShardMetadata(shard_offsets=[4, 0], shard_sizes=[2, 512], placement=rank:2/cuda:2),
ShardMetadata(shard_offsets=[6, 0], shard_sizes=[2, 512], placement=rank:3/cuda:3),
ShardMetadata(shard_offsets=[8, 0], shard_sizes=[2, 512], placement=rank:4/cuda:4),
ShardMetadata(shard_offsets=[10, 0], shard_sizes=[0, 512], placement=rank:5/cuda:5),
ShardMetadata(shard_offsets=[10, 0], shard_sizes=[0, 512], placement=rank:6/cuda:6)
],
size=torch.Size([10, 512]
),
```
Calling ShardedTensor._init_from_local_shards_and_global_metadata()
ShardedTensor ShardingSpec builds metadata

```
ShardedTensorMetadata(shards_metadata=[
ShardMetadata(shard_offsets=[0, 0], shard_sizes=[2, 512], placement=rank:0/cuda:0),
ShardMetadata(shard_offsets=[2, 0], shard_sizes=[2, 512], placement=rank:1/cuda:1),
ShardMetadata(shard_offsets=[4, 0], shard_sizes=[2, 512], placement=rank:2/cuda:2),
ShardMetadata(shard_offsets=[6, 0], shard_sizes=[2, 512], placement=rank:3/cuda:3),
ShardMetadata(shard_offsets=[8, 0], shard_sizes=[2, 512], placement=rank:4/cuda:4),
ShardMetadata(shard_offsets=[10, 0], shard_sizes=[0, 512], placement=rank:5/cuda:5),
ShardMetadata(shard_offsets=[12, 0], shard_sizes=[0, 512], placement=rank:6/cuda:6)
],
size=torch.Size([10, 512]), tensor_properties=TensorProperties(dtype=torch.float16, layout=torch.strided, requires_grad=False, memory_format=torch.contiguous_format, pin_memory=False))
```
The deduced ChunkShardingSpec:
```
ChunkShardingSpec(dim=0, placements=[rank:0/cuda:0, rank:1/cuda:1, rank:2/cuda:2, rank:3/cuda:3, rank:4/cuda:4, rank:5/cuda:5, rank:6/cuda:6])
```

The fix is to limit offsets by dim size.

Differential Revision: [D54419513](https://our.internmc.facebook.com/intern/diff/D54419513)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121002
Approved by: https://github.com/wz337
2024-03-02 08:58:48 +00:00
Kurman Karabukaev
b0cfa96e82 [Torchelastic][Logging] Pluggable logsspecs using python entrypoints and option to specify one by name. (#120942)
Summary:
Expose an option to users to specify name of the LogsSpec implementation to use.
- Has to be defined in entrypoints under `torchrun.logs_specs` group.
- Must implement LogsSpec defined in prior PR/diff.

Test Plan: unit test+local tests

Reviewed By: ezyang

Differential Revision: D54180838

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120942
Approved by: https://github.com/ezyang
2024-03-02 08:07:52 +00:00
Tianyu Liu
af5376c444 [dtensor] add support for loss parallel (#119877)
Loss parallel is the last piece of sequence parallelism to enable. It enables efficient distributed cross entropy computation when the input is sharded on the class dimension (in a classification problem with many classes). The implementation is via a context manager `loss_parallel`, after enabling which users can directly use `torch.nn.functional.cross_entropy` or `torch.nn.CrossEntropyLoss` without modifying other parts of their code.

Here are the underlying rationales why we are going through these op replacements:

1. `nn.functional.cross_entropy` is the common method that OSS user is using for things like transformer training, to avoid changing user code, we want user to still use this function for loss calculation if they are already using it.
2. `nn.functional.cross_entropy` boils down into `aten.log_softmax` and `aten.nll_loss_foward/backward`, and DTensor now supports those ops already (#117723 #119255 #118917 #119256). They are doing computation with input *replicated* on the class dimension.
3. However when the input of this loss calculation is **sharded on the class dimension**, to run sharded computation efficiently, we need to run both `aten.log_softmax` and `aten.nll_loss_foward` with multiple all-reduce collectives **in the middle of** those aten ops. This is not possible if we are just overriding these two ops, so we need to have some way to **decompose** these two ops into smaller ops to have collectives run in the middle of these two ops.
4. We explored the existing decompositions (#118950). It seems working, except that `log_softmax_backward` and `nll_loss_backward` combined together in aten are implemented in a inefficient way, which would trigger an additional expensive collective. Recently some user also reported similar issues https://github.com/pytorch/pytorch/issues/119261.
5. Therefore, currently we are doing our own decomposition inside a context manager for sequence parallelism specifically. Once we have a better decomposition in core, we can possibly take that instead of reinventing the wheels here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119877
Approved by: https://github.com/wanchaol
2024-03-02 05:06:26 +00:00
Yifu Wang
f7a2bae0ac Change TestOpWaitiness to use MultiProcessTestCase (#121046)
The test has been failing sporadically rencetly in CI and the failures
are not reproducible locally, likely due to some nasty race conditional
related a combination of MultiThreadedTestCase, the use of global state
and finalizers, and the recently introduced test decorator for native
funcol migration.

Switching to the test to use MultiProcessTestCase to provide better
isolation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121046
Approved by: https://github.com/weifengpy
2024-03-02 01:12:14 +00:00
Andrew Gu
4cf6d1172b [FSDP2] Used ReduceOp.AVG if fp32 reduce-scatter (#120919)
This PR uses `ncclAvg` op (via `ReduceOp.AVG`) if doing fp32 reduce-scatter. This allows the division by world size to happen in the reduce-scatter kernel itself, which seems to save extra memory read/write for dividing. This yields ~1.5% speedup on the Llama-7B workload (and makes per-parameter FSDP faster than flat-parameter FSDP 😅 ).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120919
Approved by: https://github.com/yifuwang, https://github.com/wanchaol
ghstack dependencies: #120238, #120910
2024-03-02 00:39:16 +00:00
Andrew Gu
7c71d7f32b [DTensor] Supported foreach=True for clip_grad_norm_ (#120910)
This PR adds support for `clip_grad_norm_(foreach=True)` by implementing `aten._foreach_norm.Scalar` and `aten._foreach_mul_.Tensor`. `foreach=True` is required to get competitive performance with `DTensor`.

`foreach=True` reduces CPU overhead for Llama-7B from 388 ms to 63 ms. Existing flat-parameter FSDP's `clip_grad_norm_` takes 3 ms on CPU 😢 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120910
Approved by: https://github.com/wanchaol, https://github.com/janeyx99
ghstack dependencies: #120238
2024-03-02 00:28:09 +00:00
Andrew Gu
f0e8e7cf43 [DTensor] Supported foreach=False for clip_grad_norm_ (#120238)
This PR adds `DTensor` support for `aten.linalg_vector_norm.default` and `aten.stack.default` so that we can run `clip_grad_norm_` (with `foreach=False`).

To implement `linalg_vector_norm`, we introduce a `_NormPartial` placement since the reduction op for norm is the norm itself.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120238
Approved by: https://github.com/wanchaol
2024-03-02 00:25:16 +00:00
Shuqiang Zhang
c8e56b4965 [c10d] dump from one and only one thread (PG0's monitor thread) (#120893)
Summary:
When there are multiple PGs in a process and a hardware failure happens,
we found that multiple PGs/ threads in the same
process are competing to dump the same records at the same time. The
affects the reliability of dumps.

In this PR, we will try to make the change such that only one thread/PG
could dump: PG0's monitor thread. We use a static variable to indicate
that something (e.g., collective timeout) has triggered the dump
locally.

monitor thread would dump debug info under any one of the 3 conditions:
1: this static variable is set to true by the watchdog thread when it detects
a timeout or pipe dump signal
2: timeout signal is received from other ranks through tcpstore
3: no heartbeat of watchdog
Test Plan:
python test/distributed/test_c10d_nccl.py -k
test_timeout_dumps_on_stuck_ranks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120893
Approved by: https://github.com/wconstab
2024-03-02 00:13:13 +00:00
Will Constable
581fe26792 [C10D] Add ProcessGroup op_id to track ops inside coalescing region (#120745)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120745
Approved by: https://github.com/zdevito
2024-03-01 23:45:43 +00:00
Lucas Pasqualin
9d5dea7812 [DCP] Adds storage reader and planner classes for online loading/sharding of models in torch.save format (#119816)
as title

Differential Revision: [D53718041](https://our.internmc.facebook.com/intern/diff/D53718041/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119816
Approved by: https://github.com/fegin
2024-03-01 00:21:05 +00:00
Elias Ellison
d03b11ad5b Pass inductor strides forward in ddp optimizer (#120523)
# Note: Returning Fake Tensors on First AOT Autograd Call
            #
            # Inductor will optimize strides of outputs when it deems it profitable.
            # For instance, converting to channels last. When we split the graph here
            # into multiple inductor compilations, we need to make sure that the
            # output strides of one compilation is appropriately passed to the subsequent
            # compilations. However, the mapping from inductor output to dynamo output
            # is non-trivial due to aot_autograd's deduping, de-aliasing, mutation, re-writing,
            # subclass handling, etc. In order to replay all this logic we set a flag such that
            # the first invocation of inductor in aot_autograd will return Fake Tensors with
            # appropriate strides. Then, all of aot autograd's runtime logic is replayed.
            # This gives us the appropriately strided outputs here which will reflect runtime strides.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120523
Approved by: https://github.com/yf225, https://github.com/bdhirsh
2024-02-29 22:25:00 +00:00
PyTorch MergeBot
76d3a6bb4a Revert "[C10D] Add ProcessGroup op_id to track ops inside coalescing region (#120745)"
This reverts commit 381a7ad3f1.

Reverted https://github.com/pytorch/pytorch/pull/120745 on behalf of https://github.com/kit1980 due to The new test fails internally, see D54343421 ([comment](https://github.com/pytorch/pytorch/pull/120745#issuecomment-1972047106))
2024-02-29 22:06:13 +00:00