Add similar semantics for creating a buffer object similar to creating a parameter. This is done by introducing a new Buffer class that can be used for type disambiguation. The underlying functionality of registering a buffer remains the same as the register_buffer method has not been changed. The persistent parameter in the Buffer type is to indicate whether a buffer object should be persistent or not. Other non-test changes have to do with getting the new Buffer type recognized by inductor and dynamo. Remaining changes are test changes to make sure that the Buffer type can be used as a drop in replacement for register_buffer as it just leads to register_buffer being called. The addition of this new functionality still allows for normal tensors to be used as buffers so these changes are intended to be backwards compatible.
Fixes#35735
Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125971
Approved by: https://github.com/albanD, https://github.com/anijain2305, https://github.com/mlazos
This PR implements an opt-in configuration option for synchronizing compilation across all ranks at the end of Dynamo tracing (and potentially, other places in the future). There are two pieces to this PR:
1. Implementing infrastructure for compiler collectives (DistributedState/LocalState, the actual collective)
2. Using this infrastructure to synchronize automatic dynamic choices across all ranks
The infrastructure in part one can be used for other purposes, just add more (serializable) fields to LocalState.
Here is how automatic dynamic synchronization works:
1. Preflight in "torch/_dynamo/variables/builder.py": On the first Dynamo trace run, we trace without automatic dynamic at all; we assume all Tensor inputs that are not otherwise marked are static. This run is purely to collect all Tensor input sizes in the program.
2. torch/_dynamo/output_graph.py: At the end of the first Dynamo trace run, we perform a compiler collective to distribute all Tensor input sizes to all ranks. Then, we restart Dynamo
3. Apply the updates in "torch/_dynamo/variables/builder.py": Now that we have all sizes for every rank, we now update frame state with the observed sizes for all ranks, in rank order. Under the assumption that frame state is consistent on all ranks, this series of updates will preserve consistency.
For future work, it would be safer if we force a consistent hint on all ranks; this is more involved as we have to interpose in fakification.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130935
Approved by: https://github.com/jansel
Tracing through `__init__` is important because it initializes (calls STORE_ATTR) on members. By doing that, we kick in the mutation tracking for these objects. So, things like mutating `_modules` etc is tracked automatically.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126578
Approved by: https://github.com/jansel
ghstack dependencies: #128001
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126
Approved by: https://github.com/kit1980
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126
Approved by: https://github.com/kit1980
ghstack dependencies: #127122, #127123, #127124, #127125
This PR fixes Issue #111279.
While #111279 reported the issue with `MultiheadAttention`, a minimal reproduction would be:
```python
class ToyModel(nn.Module):
def __init__(self,):
super().__init__()
self.linear = nn.Linear(128, 10)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.linear.forward(x) # Error
# return self.linear(x) # OK
```
Dynamo treats `self.linear(x)` as `call_module` while treating `self.linear.forward(x)` as a [`get_attr` and a `call_method`](https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/variables/nn_module.py#L358-L378). However, existing DDPOptimizer assumes, for a `get_attr` node, `getattr(gm, node.target)` gives a tensor with the `requires_grad` attribute. Existing DDPOptimizer also does not support `call_method` nodes.
This PR adds support for `call_method` and check on `get_attr`. It also checks if a module's parameters have been added to a bucket to support multiple method calls from the same module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121771
Approved by: https://github.com/yf225
# Note: Returning Fake Tensors on First AOT Autograd Call
#
# Inductor will optimize strides of outputs when it deems it profitable.
# For instance, converting to channels last. When we split the graph here
# into multiple inductor compilations, we need to make sure that the
# output strides of one compilation is appropriately passed to the subsequent
# compilations. However, the mapping from inductor output to dynamo output
# is non-trivial due to aot_autograd's deduping, de-aliasing, mutation, re-writing,
# subclass handling, etc. In order to replay all this logic we set a flag such that
# the first invocation of inductor in aot_autograd will return Fake Tensors with
# appropriate strides. Then, all of aot autograd's runtime logic is replayed.
# This gives us the appropriately strided outputs here which will reflect runtime strides.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120523
Approved by: https://github.com/yf225, https://github.com/bdhirsh
When CUDA is not available `c10d.init_process_group("nccl"...)` will fail with
> RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
Hence add a corresponding skip marker to the classes deriving from DynamoDistributedSingleProcTestCase next to the `requires_nccl` marker.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117741
Approved by: https://github.com/ezyang, https://github.com/malfet
Fixes https://github.com/pytorch/pytorch/issues/111636
Fixes https://github.com/pytorch/pytorch/issues/108877
Fixes https://github.com/pytorch/pytorch/issues/116956
Inductor has an invariant that every dynamic shape symbol s0, s1, etc. which is referenced by an input tensor must also be passed in explicitly as an argument. It has some capability of reverse engineering symbols if it's obvious how to get them (e.g., if you pass in `arg: f32[s0, 4]` it will know that it can retrieve `s0 = arg.size(0)`) but in full generality it is not always possible to derive this (e.g., if the only mention of s0 is in `arg2: f32[s0 + s1, 4]`). However, the graph splitter used by optimize_ddp did not respect this invariant. This PR makes it respect it.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117406
Approved by: https://github.com/wconstab
this was a SEV when FSDP modules are registered as graph attributes this unit test prevents it from happening again
without SEV fix: D48810186
```
python test/distributed/test_dynamo_distributed.py -k
test_fsdp_skip_register_attr_or_module
File "/data/users/weif/pytorch/torch/_dynamo/repro/after_dynamo.py",
line 117, in debug_wrapper
compiled_gm = compiler_fn(gm, example_inputs)
File
"/data/users/weif/pytorch/test/distributed/test_dynamo_distributed.py", line 897, in debug_compiler
self.assertFalse(name in node.name, f"FSDP module {name} should not
be registered as attributes")
torch._dynamo.exc.BackendCompilerFailed: backend='debug_compiler' raised:
AssertionError: True is not false : FSDP module l__self___net_0_weight should not be registered as attributes
```
with SEV fix: D48810186
```
python test/distributed/test_dynamo_distributed.py -k test_fsdp_skip_register_attr_or_module
Ran 1 test in 6.438s
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115112
Approved by: https://github.com/mlazos
I'm not sure why we needed two overloads previously, let's find out! Removing the int overload is load bearing because it now forces specialization on SymInt arguments instead of falling through to the SymInt overload, see new test.
I decided NOT to allow storage offset simultaneously with None strides.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114236
Approved by: https://github.com/albanD
Summary:
Original commit changeset: 33650f7cb0fb
Original Phabricator Diff: D48833682
Test Plan: See T162942232 for how we figured out that this diff caused significant numeric difference.
Reviewed By: voznesenskym
Differential Revision: D49082219
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108823
Approved by: https://github.com/xw285cornell
This PR:
1) Add device_mesh kwarg to FSDP. Remove init_device_mesh() from _runtime_utils.py, as device_mesh would be passed in by user as an kwarg.
2) change use_dtensor flag for state_dict_config and optim_state_dict_config to be private. If device_mesh is used with sharded model/optim state dict, _use_dtensor flag would be set to True and model/optim state dict would return dtensor state_dict. Otherwise, _use_dtensor flag would be set to False and model/optim state dict would return sharded_tensor state_dict.
3) Update _optim_utils.py, _shard_utils.py, and _state_dict_utils.py to add support for HSDP to return 2D DTensor state_dict.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107533
Approved by: https://github.com/fegin, https://github.com/awgu, https://github.com/wanchaol