Commit Graph

27 Commits

Author SHA1 Message Date
Will Constable
9fb9219478 Make DDPOptimizer work with torch._dynamo.explain() (#94749)
GraphModules that were created during DDPOptimizer graph breaking
lacked `compile_subgraph_reason`, which caused an exception when
running .explain().

Now the reason is provided and users can use .explain() to find out
that DDPOptimizer is causing graph breaks.

Fixes #94579

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94749
Approved by: https://github.com/voznesenskym
2023-02-14 01:33:47 +00:00
Xuehai Pan
046e88a291 [BE] [3/3] Rewrite super() calls in test (#94592)
Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied.

- #94587
- #94588
- #94592

Also, methods with only a `super()` call are removed:

```diff
class MyModule(nn.Module):
-   def __init__(self):
-       super().__init__()
-
    def forward(self, ...):
        ...
```

Some cases that change the semantics should be kept unchanged. E.g.:

f152a79be9/caffe2/python/net_printer.py (L184-L190)

f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94592
Approved by: https://github.com/ezyang, https://github.com/seemethere
2023-02-12 22:20:53 +00:00
Jason Ansel
2b0d7e63f0 Move dynamo.optimizations.distributed to backends (#93408)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93408
Approved by: https://github.com/wconstab
2023-02-02 20:42:17 +00:00
Will Constable
ac791bddce Refactor dynamo distributed test helpers to be reusable (#93187)
The point is to let Test helpers previously defined and used in `test_dynamo_distributed.py` be used from a new file `test_traceable_collectives.py` later in this stack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93187
Approved by: https://github.com/kumpera
2023-02-01 06:09:42 +00:00
Will Constable
648202ceb9 Improve DDPOptimizer by avoiding small preamble graph (#93162)
This optimizes an edge case where some compute-only ops (e.g. add)
could end up in an orphan graph at the input side due to the bucket
for the next graph being full already.  The fix is to fuse this
graph (which is "empty" in parameter count) together with the adjoining
"full" bucket.

Note: i encountered this when trying to repro some suspected duplicate
argument errors, but this is unrelated and I have not yet repro'd
a duplicate arg issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93162
Approved by: https://github.com/davidberard98
2023-01-28 15:33:53 +00:00
Will Constable
5441f2c067 Fix DDPOptimizer fake_mode execution (#92986)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* __->__ #92986

When running compiled submods for the purpose of producing outputs to pass
to the compilation step for the next submod, we use fake parameters and
assume fake inputs, but we forgot to activate our fake_mode during execution.

This caused certain edge cases where tensors other than activations or parameters
got created during execution, such as scalar->tensor expansion in the case
of executing torch.where(tensor, scalar, scalar).

Also add a test and clarify behavior of DDPOptimizer via comments.

Fixes #92941

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92986
Approved by: https://github.com/bdhirsh
2023-01-26 00:37:54 +00:00
Edward Z. Yang
6dcc214ac2 Fix AssertionError fake_mode is not None in distributed (#90392)
Fixes https://github.com/pytorch/pytorch/issues/90375

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90392
Approved by: https://github.com/voznesenskym
2022-12-07 20:12:39 +00:00
Ram Rachum
351d73b97f Fix exception causes all over the codebase (#90271)
This is the continuation to #90134 and hopefully the final PR in this series.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90271
Approved by: https://github.com/kit1980
2022-12-07 04:29:00 +00:00
Will Constable
705ad36cc5 Dynamo asserts FSDP wrapped modules use_orig_param (#89523)
- This is a strict requirement given the way dynamo+FSDP is implemented,
  but isn't convenient to assert.
- By plumbing use_orig_param field on all wrapped modules, we can
  do this assertion inside dynamo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89523
Approved by: https://github.com/awgu
2022-11-29 05:27:23 +00:00
Edward Z. Yang
95563b3eda Reland "Add single process version of dynamo distributed hf_Bert tests (#89721)" (#89756)
This reverts commit 0d9a615af4.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89756
Approved by: https://github.com/anjali411, https://github.com/malfet
2022-11-28 19:15:03 +00:00
PyTorch MergeBot
0d9a615af4 Revert "Add single process version of dynamo distributed hf_Bert tests (#89721)"
This reverts commit 1a2dd6b15e.

Reverted https://github.com/pytorch/pytorch/pull/89721 on behalf of https://github.com/ezyang due to this broke inductor_distributed job
2022-11-28 14:56:54 +00:00
Edward Z. Yang
49eb43fc45 Don't modify log level in dynamo distributed test (#89655)
Let the developer decide!

Taken from voz's https://github.com/pytorch/pytorch/pull/89392

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89655
Approved by: https://github.com/albanD
2022-11-28 14:47:52 +00:00
Edward Z. Yang
1a2dd6b15e Add single process version of dynamo distributed hf_Bert tests (#89721)
It's a lot easier to debug problems in the Dynamo optimization pass if
you aren't actually triggering a multiprocessing run.  Keep these tests
around.

I think the other tests can probably get this treatment too, leaving
this to future work.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89721
Approved by: https://github.com/voznesenskym
2022-11-28 03:16:47 +00:00
David Berard
304b5de1b0 Re-enable test_hf_bert_fsdp (#89223)
It looks like this failure was actually caused by https://github.com/pytorch/pytorch/pull/88629, see the revert message on that PR. It probably just looked like a flaky test on CI because of how quickly the PR was reverted.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89223
Approved by: https://github.com/voznesenskym
2022-11-18 21:40:27 +00:00
Michael Voznesensky
06ce1338bc [dynamo] Port all pytorch/dynamo and test/dynamo pieces over from symbolic-shapes branch (#88768)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88768
Approved by: https://github.com/jansel, https://github.com/ezyang
2022-11-13 04:50:21 +00:00
Will Constable
a3f3ec8fac [FSDP+dynamo]: forward treats parameter-views as params (#88781)
Dynamo+AotAutograd needs a way to wrap all tensors (whether
inputs or params/buffers) in FakeTensor wrappers, and
FSDP's mangling of parameters hides them from this wrapping.

This PR unblocks running hf_bert and hf_T5 with FSDP under dynamo, whether using recursive wrapping around transformer layers or only applying FSDP around the whole model.  Perf/memory validation and possibly optimization is the next step.
`python benchmarks/dynamo/distributed.py --torchbench_model hf_Bert --fsdp --dynamo aot_eager`
`python benchmarks/dynamo/distributed.py --torchbench_model hf_Bert --fsdp --dynamo aot_eager --fsdp_wrap`
`python benchmarks/dynamo/distributed.py --torchbench_model hf_T5 --fsdp --dynamo aot_eager`
`python benchmarks/dynamo/distributed.py --torchbench_model hf_T5 --fsdp --dynamo aot_eager --fsdp_wrap`

The problem:
Dynamo (Actually aot_autograd) trips up with FSDP becuase it must
wrap all input tensors in FakeTensor wrappers, and it only knows
to wrap graph inputs or named_(parameters, buffers).  FSDP's
pre_forward hook sets views (which are not nn.param) into the flatparam
as attrs on the module with the same name as the original param, but
they will not show up in named_parameters.

- in use_orig_params mode, FSDP still de-registers
  params during pre-forward hook, then re-registers them
  post-forward
- during forward (between the hooks), the params are setattr'd
  on the module as regular view tensors, not nn.Parameters
- note: use_orig_params is the recommended way to use FSDP,
  and use_orig_params=False is being deprecated.  So i only consider
  use_orig_params=True for this enablement

The solution:
- adding them to named_buffers is not possible because it interferes
  with how FSDP's `_apply` works
- since they are not actual nn.parameters, register_parameter will
  complain about registering them
- simply seting `module._parameters[name] = view` seems to be a viable
  workaround, despite being hacky, and FSDP code does modify _parameters
  directly already.

Note: Manual checkpointing still isn't working with FSDP+dynamo,
so that will have to be addressed in a follow up.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88781
Approved by: https://github.com/ezyang, https://github.com/awgu
2022-11-12 01:17:23 +00:00
Will Constable
3fd0729bb6 DDPOptimizer replace debug=True/False with using torchdynamo logger (#88480)
Example output:

```
2022-11-04 05:09:29,525] torch._dynamo.optimizations.distributed: [INFO]
DDPOptimizer bucket assignments
┌─────────┬────────────┬───────────────────┐
│   Index │   Size (b) │ Param Names       │
├─────────┼────────────┼───────────────────┤
│       0 │  100120020 │ self_net_6_weight │
├─────────┼────────────┼───────────────────┤
│         │            │ self_net_6_bias   │
├─────────┼────────────┼───────────────────┤
│         │            │ self_net_4_weight │
├─────────┼────────────┼───────────────────┤
│         │            │ self_net_4_bias   │
├─────────┼────────────┼───────────────────┤
│       1 │  100020000 │ self_net_2_weight │
├─────────┼────────────┼───────────────────┤
│         │            │ self_net_2_bias   │
├─────────┼────────────┼───────────────────┤
│       2 │     220000 │ self_net_0_weight │
├─────────┼────────────┼───────────────────┤
│         │            │ self_net_0_bias   │
└─────────┴────────────┴───────────────────┘
[2022-11-04 05:09:29,527] torch._dynamo.optimizations.distributed: [DEBUG]
---orig graph---
graph():
    %inputs : torch.Tensor [#users=1] = placeholder[target=inputs]
    %self_net_0 : [#users=1] = call_module[target=self_net_0](args = (%inputs,), kwargs = {})
    %self_net_1 : [#users=1] = call_module[target=self_net_1](args = (%self_net_0,), kwargs = {})
    %self_net_2 : [#users=1] = call_module[target=self_net_2](args = (%self_net_1,), kwargs = {})
    %self_net_3 : [#users=1] = call_module[target=self_net_3](args = (%self_net_2,), kwargs = {})
    %self_net_4 : [#users=1] = call_module[target=self_net_4](args = (%self_net_3,), kwargs = {})
    %self_net_5 : [#users=1] = call_module[target=self_net_5](args = (%self_net_4,), kwargs = {})
    %self_net_6 : [#users=1] = call_module[target=self_net_6](args = (%self_net_5,), kwargs = {})
    %self_net_7 : [#users=1] = call_module[target=self_net_7](args = (%self_net_6,), kwargs = {})
    return (self_net_7,)

---split graph---
graph():
    %inputs : torch.Tensor [#users=1] = placeholder[target=inputs]
    %submod_0 : [#users=1] = call_module[target=submod_0](args = (%inputs,), kwargs = {})
    %submod_1 : [#users=1] = call_module[target=submod_1](args = (%submod_0,), kwargs = {})
    %submod_2 : [#users=1] = call_module[target=submod_2](args = (%submod_1,), kwargs = {})
    return (submod_2,)

---submod_0 graph---
graph():
    %inputs : [#users=1] = placeholder[target=inputs]
    %self_net_0 : [#users=1] = call_module[target=self_net_0](args = (%inputs,), kwargs = {})
    %self_net_1 : [#users=1] = call_module[target=self_net_1](args = (%self_net_0,), kwargs = {})
    return self_net_1

---submod_1 graph---
graph():
    %self_net_1 : [#users=1] = placeholder[target=self_net_1]
    %self_net_2 : [#users=1] = call_module[target=self_net_2](args = (%self_net_1,), kwargs = {})
    %self_net_3 : [#users=1] = call_module[target=self_net_3](args = (%self_net_2,), kwargs = {})
    return self_net_3

---submod_2 graph---
graph():
    %self_net_3 : [#users=1] = placeholder[target=self_net_3]
    %self_net_4 : [#users=1] = call_module[target=self_net_4](args = (%self_net_3,), kwargs = {})
    %self_net_5 : [#users=1] = call_module[target=self_net_5](args = (%self_net_4,), kwargs = {})
    %self_net_6 : [#users=1] = call_module[target=self_net_6](args = (%self_net_5,), kwargs = {})
    %self_net_7 : [#users=1] = call_module[target=self_net_7](args = (%self_net_6,), kwargs = {})
    return self_net_7

---------------
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88480
Approved by: https://github.com/anj-s, https://github.com/davidberard98
2022-11-05 02:40:51 +00:00
Will Constable
678d038001 Support DDP ignored parameters in DDPOptimizer (#88460)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88460
Approved by: https://github.com/aazzolini
2022-11-04 21:42:15 +00:00
Will Constable
70b00b1383 Add hf_bert + DDP multigpu test (#88435)
Spot-checks an e2e model working with ddp.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88435
Approved by: https://github.com/davidberard98
2022-11-04 03:17:48 +00:00
Will Constable
a51da28551 Support multi-gpu CI for inductor-distributed (#87996)
This test by itself isn't the end goal, but it is a minimal test that exercises multi-gpu and the focus of the PR is the infra behind enabling that.  I'll follow up with more tests using actual models etc.

and @malfet @desertfire for awareness/feedback on the infra side
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87996
Approved by: https://github.com/aazzolini
2022-11-02 03:52:20 +00:00
Will Constable
82a9de16d4 Change dynamo/distributed tests to use cuda/nccl (#88133)
- FSDP tests require nccl
- also run in inductor shard and skip inductor in distributed shard
- inductor shard has newer GPU and supports triton/inductor, but only runs on trunk
- distributed shard runs on PR, but inductor shard only runs on trunk/opt-in

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88133
Approved by: https://github.com/davidberard98
2022-11-01 15:35:44 +00:00
Will Constable
91c95ff7c5 Enable graph_split_inductor test as it runs now (#87762)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87762
Approved by: https://github.com/davidberard98
2022-10-26 22:06:03 +00:00
Will Constable
aa66c6e01e Fix missing weight init and clean up helper (#87760)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87760
Approved by: https://github.com/davidberard98
2022-10-26 19:29:35 +00:00
Will Constable
233305a852 Improvements for DDP Optimizer (#87549)
- adds support for 'first_bucket_cap' arg, to align bucketing more precisely
  with DDP, which may start a smaller first bucket
- refactors the bucket splitting logic to be cleaner
- adds pretty-print for bucket info, and a way to access bucket info
  from the DDPOptimizer class from a test case or benchmark
- dumps debug logs to stdout

cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87549
Approved by: https://github.com/soumith
2022-10-24 03:40:43 +00:00
PyTorch MergeBot
0ef0a78196 Revert "Improvements for DDP Optimizer (#87525)"
This reverts commit cf693a02e0.

Reverted https://github.com/pytorch/pytorch/pull/87525 on behalf of https://github.com/ZainRizvi due to The macos error messages look like they were indeed caused by this PR
2022-10-22 04:51:33 +00:00
Will Constable
cf693a02e0 Improvements for DDP Optimizer (#87525)
- adds support for 'first_bucket_cap' arg, to align bucketing more precisely
  with DDP, which may start a smaller first bucket
- refactors the bucket splitting logic to be cleaner
- adds pretty-print for bucket info, and a way to access bucket info
  from the DDPOptimizer class from a test case or benchmark
- dumps debug logs to stdout

cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87525
Approved by: https://github.com/davidberard98
2022-10-22 03:44:12 +00:00
Will Constable
b18fadae88 Re-enable dynamo ddp tests (#87524)
- Move dynamo dist tests to another shard
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87524
Approved by: https://github.com/davidberard98
2022-10-22 03:29:02 +00:00