pytorch/torch/distributed
ankurneog e248c1d7eb Update real device in FSDP state_dict_utils (#134994)
## Motivation
The default device for tensor.device both for sharded as well as non sharded is set to cuda by default. Hence while checking the FSDP UTs we see the following errors. This change updates the actual device type based on the created tensor.

```
[rank3]   File "/root/repos/pytorch-training-tests/tests/pytorch/v2.4.0/distributed_hpu/fsdp/test_fsdp_dtensor_state_dict.py", line 143, in test_dtensor_sharded_tensor_state_dict_identical
[rank3]     sharded_tensor_sd = ref_model.state_dict()
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1944, in state_dict
[rank3]     hook_result = hook(self, destination, prefix, local_metadata)
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank3]     return func(*args, **kwargs)
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/_state_dict_utils.py", line 752, in _post_state_dict_hook
[rank3]     tensor.device,
[rank3]   File "/usr/local/lib/python3.10/dist-packages/typing_extensions.py", line 2853, in wrapper
[rank3]     return arg(*args, **kwargs)
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/sharded_tensor/api.py", line 1152, in __torch_function__
[rank3]     return dispatch(st_instance, func)
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/sharded_tensor/api.py", line 1134, in dispatch
[rank3]     return _SHARDED_OPS[func](types, args, kwargs, st._process_group)
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/op_registry_utils.py", line 33, in wrapper
[rank3]     return wrapped_func(types, args, kwargs, process_group)
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/sharded_tensor/_ops/tensor_ops.py", line 52, in tensor_device
[rank3]     dev = torch.device(torch.cuda.current_device())
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 878, in current_device
[rank3]     _lazy_init()
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 305, in _lazy_init
[rank3]     raise AssertionError("Torch not compiled with CUDA enabled")
[rank3] AssertionError: Torch not compiled with CUDA enabled
````

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134994
Approved by: https://github.com/fegin
2024-09-17 04:39:08 +00:00
..
_composable [Traceable FSDP2] Don't register RegisterPostBackwardFunction if user intends to use Traceable FSDP2, and assert that compiled autograd is not used when entering RegisterPostBackwardFunction (#135824) 2024-09-14 06:30:12 +00:00
_shard [BE]: Update mypy to 1.11.2 (#133816) 2024-09-16 19:44:11 +00:00
_sharded_tensor
_sharding_spec
_symmetric_memory [micro_pipeline_tp] support all _scaled_mm args (#131984) 2024-08-05 21:44:37 +00:00
_tensor [reland][dtensor] move DTensor to public namespace (#134203) 2024-09-08 17:08:40 +00:00
_tools Runtime Estimator for estimating GPU compute time (#134243) 2024-08-28 20:06:54 +00:00
algorithms [BE]: Update mypy to 1.11.2 (#133816) 2024-09-16 19:44:11 +00:00
autograd
benchmarks [BE][Easy] enable ruff rule PIE790: unnecessary pass statement (#133200) 2024-08-15 15:50:19 +00:00
checkpoint [Distributed] fix FileSystemWriter __init__ (#136135) 2024-09-16 19:11:08 +00:00
elastic Adding entry-point based support for out-of-tree rendezvous plugins (#132633) 2024-09-11 03:35:02 +00:00
examples
fsdp Update real device in FSDP state_dict_utils (#134994) 2024-09-17 04:39:08 +00:00
launcher
nn Revert "added persistent option to buffers and namedbuffers (#132994)" 2024-08-09 18:14:53 +00:00
optim [BE]: Update mypy to 1.11.2 (#133816) 2024-09-16 19:44:11 +00:00
pipelining [BE]: Update mypy to 1.11.2 (#133816) 2024-09-16 19:44:11 +00:00
rpc [BE][Easy] enable ruff rule PIE790: unnecessary pass statement (#133200) 2024-08-15 15:50:19 +00:00
tensor [BE]: Update mypy to 1.11.2 (#133816) 2024-09-16 19:44:11 +00:00
__init__.py Remove ProcessGroupRoundRobin (#132888) 2024-08-08 01:07:40 +00:00
_checkpointable.py
_composable_state.py
_functional_collectives_impl.py
_functional_collectives.py [BE]: Update mypy to 1.11.2 (#133816) 2024-09-16 19:44:11 +00:00
_state_dict_utils.py [DSD][EZ] Minor update in _state_dict_utils.py (#136165) 2024-09-17 04:32:43 +00:00
argparse_util.py
c10d_logger.py
collective_utils.py
constants.py
CONTRIBUTING.md
device_mesh.py [c10d][Reland] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931) (#135653) 2024-09-16 19:56:42 +00:00
distributed_c10d.py [c10d][Reland] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931) (#135653) 2024-09-16 19:56:42 +00:00
launch.py
logging_handlers.py
remote_device.py [BE][Easy] fix ruff rule needless-bool (SIM103) (#130206) 2024-07-14 08:17:52 +00:00
rendezvous.py
run.py fix torchrun log message (#131652) 2024-07-25 14:50:10 +00:00
utils.py [FSDP] casting input args with dataclass(frozen=True) (#135067) 2024-09-05 01:19:53 +00:00