Commit Graph

44 Commits

Author SHA1 Message Date
gaoyufeng
cde54fe4e9 fix-unpin-memory-tensor-param (#160992)
Fixes #160983

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160992
Approved by: https://github.com/ngimel
2025-08-26 21:55:25 +00:00
Teja Rao
19ffdf4ea0 [dcp] add new checkpoint staging to preserve storage sharing and support mutable state_dicts (#155192)
Summary:
This implements staging in way that doesnt mess up checkpointing semantics. We want to be close to torch.save/load semantics and when async checkpointing is used it messes up shared storages, doesnt handle custom objects or tensors well. EG: users passes a state_dict with a cuda tensor in datatype.  this is deepcloned causing the staging tensor to be created on GPU. This can cause ooms is hard to debug.

This diffs hooks into deepcopy of storages to move them to cpu using the cached storages created for async checkpoint staging.  This allows reusing storages created for staging to avoid recreating them on each checkpoint while also being flexible enough to handle any changes - clean up old storages or create new ones as needed.

Lifetime of staging storages is tied to the original storage object. when the original storage object is gc-ed, we delete the corresponding staging storage from cache possibly causing it to gc-ed is there are no other references.  I am using data_ptr of the storage to keep track of this. Please share thoughts on this.
The alternative is to use fqn's instead of storage_id and verify the underlying storage object has same shape/size,etc to make the caching logic work. Current implementation is much simpler and cleaner.

The API:
```
# construct a stager once per job in checkpointing.
stager = StateDictStager(pin_memory=pin_memory, share_memory=share_memory)

# do this on every checkpoint:
 with staging_context(stager):
     cpu_state_dict = copy.deepcopy(state_dict)
```

Also, adds support for pinned-memory.

One problem this implementation does not address is that we lose the original device.

The only alternatives here are - pickle synchronously like torch.save but with special handling for storages. It is valuable to keep state_dict throughout the checkpointing process. so users can manipulate and debug as needed. so we need to unpickle in the background process. I think this is flexible, not performant and not very different to current solution but needs more code. One idea if we really want to address is this to stick the original device in a some variable on storage and then use it recover on load side. I think we do not need this for now and can be explicit about losing device type for async checkpointing.

Update:
Note: Due to reservations on hooking into deepcopy to customize it, the PR is now updated to use deepcopy like logic to clone the state_dict. There are some caveats to this solution:
1. Duplicated deepcopy code to hook into for tensors. There is a risk of this code getting outdated with python version changes. This is needed to handle several different types like NamedTuples, frozen dataclasses, nested dataclasses. deepcopy logic is relying on reduce_ex to get a function with which these can be constructed.
2. Since we are bypassing deepcopy and adding custom logic to clone a tensor, we are missing some of the functionality that exists in deepcopy for torch.Tensor like _clear_non_serializable_cached_data(), or other logic. Would like thoughts on which logic or if everything should be copied?
3. If any object implemented deepcopy , we will not be able to handle any tensors in the attrs with this logic because they likely just call copy.deepcopy on the attrs instead of this deepcopy logic. We are taking care of subclasses of torch.Tensor to workaround this.

The new API:
```
# construct a stager once per job in checkpointing.
stager = StateDictStager(pin_memory=pin_memory, share_memory=share_memory)

# do this on every checkpoint:
cpu_state_dict = copy.stage(state_dict)
```

Test Plan:
unit tests

Differential Revision: D75993324

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155192
Approved by: https://github.com/mikaylagawarecki, https://github.com/pradeepfn
2025-06-19 02:04:21 +00:00
Meet Vadakkanchery
9bfefda296 [DCP][PyTorch Staging APIs][2/x] Handle 0-elem case + ShardedTensor copy for staging (#156092)
Summary:
### Diff Context

1. Sometimes, a tensor might have non-zero size and 0 numel. In this case, pinning memory will fail
so we take a best guess at how to replicate the tensor below to maintain symmetry in the returned
state dict.

2. ShardedTensor copying was not handled originally in PyTorch state_dict copy APIs, handled in this diff.

Test Plan: CI

Differential Revision: D75553096

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156092
Approved by: https://github.com/pradeepfn
2025-06-18 22:41:25 +00:00
Natalia Gimelshein
34e3930401 fix numpy compatibility for 2d small list indices (#154806)
Will fix #119548 and linked issues once we switch from warning to the new behavior,
but for now, given how much this syntax was used in our test suite, we suspect a silent change will be disruptive.
We will change the behavior after 2.8 branch is cut.
Numpy behavior was changed at least in numpy 1.24 (more than 2 years ago)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154806
Approved by: https://github.com/cyyever, https://github.com/Skylion007, https://github.com/albanD
2025-06-04 01:58:52 +00:00
Matthew Hoffman
e3ebf61589 Create and send full_tensor on ProcessGroup-supported device in _broadcast_tensors (#148865)
Fixes #138842

`device` is always the device of the `local_state_dict`, which may or may not be CPU, which is not supported by NCCL backend.

Instead, create broadcasted tensors on one of `pg._device_types` and then move the tensors back if `local_state_dict`'s `device` was not supported by the `ProcessGroup`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148865
Approved by: https://github.com/mori360
2025-03-12 20:56:31 +00:00
Xuehai Pan
995df34b19 [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144547
Approved by: https://github.com/kwen2501
2025-02-28 07:35:56 +00:00
Chien-Chin Huang
0de27ee7e0 Let _create_cpu_state_dict and _copy_state_dict support DTensor (#146852)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146852
Approved by: https://github.com/d4l3k
2025-02-12 18:43:52 +00:00
Aaron Orenstein
00ffeca1b1 PEP585 update - torch/distributed (#145164)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145164
Approved by: https://github.com/bobrenjc93
2025-01-21 04:23:29 +00:00
PyTorch MergeBot
6374332d33 Revert "PEP585 update - torch/distributed (#145164)"
This reverts commit 6cb186e279.

Reverted https://github.com/pytorch/pytorch/pull/145164 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing an inductor test ([comment](https://github.com/pytorch/pytorch/pull/145164#issuecomment-2602875679))
2025-01-20 16:46:46 +00:00
Aaron Orenstein
6cb186e279 PEP585 update - torch/distributed (#145164)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145164
Approved by: https://github.com/bobrenjc93
2025-01-20 00:19:01 +00:00
bobrenjc93
08be9ec312 Migrate from Tuple -> tuple in torch/distributed (#144258)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144258
Approved by: https://github.com/aorenste
2025-01-10 08:34:54 +00:00
mori360
a7ba562ec8 [state dict] Change _load_model_state_dict to enable cpu_offload, accept 2 device type and optimize memory (#142845)
For destributed state dict api [migration](https://github.com/pytorch/torchtune/pull/2138), make the changes here:
1. `load_from_full_model_state_dict` at TorchTune calls `set_model_state_dict` with the options on whether to have cpu_offload. Add cpu_offload at _load_model_state_dict to process to cpu if config is True
2. Change the device check as lora_finetune might hace 2 device types, accept that to be valid.
3. Some changes to optimize the memory performance:
3.1 use `.detach().clone()` instead of view directly
3.2 if local_state is not meta, copy `full_tensor[slices]` to `ret.to_local()`
4. add relative unit tests

Memory performance calling from TorchTune with llama2/7B_full:
1. cpu_offload = True
<img width="555" alt="Screenshot 2024-12-18 at 1 36 47 PM" src="https://github.com/user-attachments/assets/429261f5-1107-4592-b295-de3944a2614b" />

2. cpu_offload = False
<img width="555" alt="Screenshot 2024-12-18 at 1 36 52 PM" src="https://github.com/user-attachments/assets/40bf281a-236a-4218-826b-b1192a10c806" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142845
Approved by: https://github.com/fegin
2024-12-19 05:06:41 +00:00
wz337
637d5c4b7e [DSD] Fix loading uneven full tensor into sharded state dict (#136365)
Fix #136228.

This is a follow up on https://github.com/pytorch/pytorch/pull/135725. We need to pass shape and stride from the original dtensor, since for uneven case, `from_local` would calculate shape and stride assuming the tensor is evenly-sharded based on the local tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136365
Approved by: https://github.com/fegin
2024-09-23 16:35:58 +00:00
wz337
408fe41a45 [DSD][EZ] Minor update in _state_dict_utils.py (#136165)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136165
Approved by: https://github.com/kwen2501
ghstack dependencies: #135725, #135763
2024-09-17 04:32:43 +00:00
wz337
0cdc6a8dcd [DSD] Fix distributed state dict full_state_dict option hang during set_state_dict (#135725)
Fix https://github.com/pytorch/pytorch/issues/134095
This fix distributed state dict full_state_dict option hang during set_state_dict. We switch `_distribute_tensors` in _state_dict_utils.py to use `DTensor.from_local` instead of `distribute_tensor` to support FSDP2+TP 2D strided sharding use case, as `distribute_tensor` cannot handle strided sharding yet. `distribute_tensor` incurs a scatter behind the scenes, while `DTensor.from_local` takes the local slice from the full tensor on each rank to create the DTensor (no collective).  This means it's the user's responsibility to make sure the full_tensor from the full_state_dict is the same across all ranks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135725
Approved by: https://github.com/fegin
2024-09-13 03:26:36 +00:00
PyTorch MergeBot
3e1a4ea132 Revert "[DSD] Fix distributed state dict full_state_dict option hang during set_state_dict (#135725)"
This reverts commit 83c594ebd6.

Reverted https://github.com/pytorch/pytorch/pull/135725 on behalf of https://github.com/ZainRizvi due to This is breaking lint. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/10835983999/job/30068709508) [HUD commit link](83c594ebd6) ([comment](https://github.com/pytorch/pytorch/pull/135725#issuecomment-2347303272))
2024-09-12 21:47:38 +00:00
wz337
83c594ebd6 [DSD] Fix distributed state dict full_state_dict option hang during set_state_dict (#135725)
Fix https://github.com/pytorch/pytorch/issues/134095
This fix distributed state dict full_state_dict option hang during set_state_dict. We switch `_distribute_tensors` in _state_dict_utils.py to use `DTensor.from_local` instead of `distribute_tensor` to support FSDP2+TP 2D strided sharding use case, as `distribute_tensor` cannot handle strided sharding yet. `distribute_tensor` incurs a scatter behind the scenes, while `DTensor.from_local` takes the local slice from the full tensor on each rank to create the DTensor (no collective).  This means it's the user's responsibility to make sure the full_tensor from the full_state_dict is the same across all ranks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135725
Approved by: https://github.com/fegin
2024-09-12 17:43:57 +00:00
Wanchao Liang
cfc227ad43 [reland][dtensor] move DTensor to public namespace (#134203)
reland of https://github.com/pytorch/pytorch/pull/133113

I have to create a new PR because the previous reverted PR could not either be rebased, or imported successfully :(

----

Moving DTensor to be in the public namespace, to formally add the documentation page that includes all the public APIs. This includes:

* many path renames and path import fixes
* a dedicated doc page without too much content yet (adding in the next PRs)
* To preserve the BC for users still using the torch.distributed._tensor, I added a shim script to redirect old path calls to the new module

The BC preserving is evidented by the fact that all DTensor tests are still working without changing the public imports. So it's safe to land the changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134203
Approved by: https://github.com/tianyu-l
2024-09-08 17:08:40 +00:00
mori360
d0ac5d55ba Memory optimization for DSD for TorchTune LoRA (#134025)
Optimize memory cost at [PR#129635](https://github.com/pytorch/pytorch/pull/129635)

There are 2 main part of the optimization here:
1. optimize the tensor distributing part, postpone the full_tensor generation, which avoids the memory overlap, saves around 50% peak memory at 2 param test case.
2. apply `assign=True` for the `load_state_dict`, saves memory cost at the state dict loading by assigning the input param, around 50% peak memory at loading part.

Future work:
Memory optimization to the opt will be conducted in the next PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134025
Approved by: https://github.com/fegin

Co-authored-by: Rachel Guo <guorachel@meta.com>
2024-08-26 17:24:25 +00:00
PyTorch MergeBot
35f36363ec Revert "[dtensor] move DTensor to public namespace (#133113)"
This reverts commit 2ee6b97464.

Reverted https://github.com/pytorch/pytorch/pull/133113 on behalf of https://github.com/wanchaol due to looks like it break some internal type imports ([comment](https://github.com/pytorch/pytorch/pull/133113#issuecomment-2295670911))
2024-08-19 05:00:19 +00:00
Wanchao Liang
2ee6b97464 [dtensor] move DTensor to public namespace (#133113)
Moving DTensor to be in the public namespace, to formally add the
documentation page that includes all the public APIs. This includes:

* many path renames and path import fixes
* a dedicated doc page without too much content yet (adding in the next
  PRs)
* To preserve the BC for users still using the `torch.distributed._tensor`,
  I added a shim script to redirect old path calls to the new module

The BC preserving is evidented by the fact that all DTensor tests are still
working without changing the public imports. So it's safe to land the
changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133113
Approved by: https://github.com/XilunWu
ghstack dependencies: #133305, #133306
2024-08-17 05:09:52 +00:00
Xuehai Pan
b25ef91bf1 [BE][Easy][18/19] enforce style for empty lines in import segments in torch/d*/ (#129770)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129770
Approved by: https://github.com/wconstab
2024-08-01 04:22:50 +00:00
Lucas Pasqualin
69c34f6e4c Corrects Error Codes from cudaHostRegister (#132089)
Causing some terrible error messages e.g. :

```
# printing directly: cudaError.???
# casting to int first: 712

Traceback (most recent call last):
  File "/data/users/lpasqualin/fbsource/fbcode/scripts/lpasqualin/playground.py", line 15, in <module>
    main()
  File "/data/users/lpasqualin/fbsource/fbcode/scripts/lpasqualin/playground.py", line 11, in main
    _create_cpu_state_dict(sd, share_memory=True, pin_memory=True)
  File "/home/lpasqualin/pytorch/torch/distributed/_state_dict_utils.py", line 436, in _create_cpu_state_dict
    ret = _iterate_state_dict(
          ^^^^^^^^^^^^^^^^^^^^
  File "/home/lpasqualin/pytorch/torch/distributed/_state_dict_utils.py", line 143, in _iterate_state_dict
    ret = {
          ^
  File "/home/lpasqualin/pytorch/torch/distributed/_state_dict_utils.py", line 144, in <dictcomp>
    key: _iterate_state_dict(
         ^^^^^^^^^^^^^^^^^^^^
  File "/home/lpasqualin/pytorch/torch/distributed/_state_dict_utils.py", line 125, in _iterate_state_dict
    ret = tensor_func(iter_object, pg, device, companion_obj)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lpasqualin/pytorch/torch/distributed/_state_dict_utils.py", line 428, in tensor_func
    succ == 0
AssertionError: Pinning shared memory failed with error-code: cudaError.???
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132089
Approved by: https://github.com/Skylion007
2024-07-30 21:42:00 +00:00
Teja
b61600f6cc [pytorch] fix the leak for pinned memory when using _create_cpu_state… (#131270)
When pin_memory and share_memory both are set to True in _create_cpu_state_dict, the memory is pinned using cudaHostRegister but is never unpinned. So, once tensor is created and freed, when a new tensor is created the caching allocator is allocating the same memory. This fails with below error.

```
obj = <[RuntimeError('CUDA error: part or all of the requested memory range is already mapped\nCUDA kernel errors might be a...pile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n') raised in repr()] Tensor object at 0x7f0028a4d6c0> pg = None, device = None, _ = None
```

This PR fixes this by unregistering this memory on tensor free by attaching a hook.

This is easily reproducible with xlformers checkpointing unit tests and the fix is verified with the same.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131270
Approved by: https://github.com/LucasLLC
2024-07-23 15:47:21 +00:00
Xuehai Pan
973037be6a [BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): list() / tuple() / dict() (#130199)
This PR changes the empty collection factory call to Python literals:

- `list()` -> `[]`
- `tuple()` -> `()`
- `dict()` -> `{}`

The Python literals are more performant and safer. For example, the bytecode for building an empty dictionary:

```bash
$ python3 -m dis - <<EOS
import collections

d1 = {}
d2 = dict()

dict = collections.OrderedDict
d3 = dict()
EOS
```

```text
  0           0 RESUME                   0

  1           2 LOAD_CONST               0 (0)
              4 LOAD_CONST               1 (None)
              6 IMPORT_NAME              0 (collections)
              8 STORE_NAME               0 (collections)

  3          10 BUILD_MAP                0
             12 STORE_NAME               1 (d1)

  4          14 PUSH_NULL
             16 LOAD_NAME                2 (dict)
             18 CALL                     0
             26 STORE_NAME               3 (d2)

  6          28 LOAD_NAME                0 (collections)
             30 LOAD_ATTR                8 (OrderedDict)
             50 STORE_NAME               2 (dict)

  7          52 PUSH_NULL
             54 LOAD_NAME                2 (dict)
             56 CALL                     0
             64 STORE_NAME               5 (d3)
             66 RETURN_CONST             1 (None)
```

The dict literal `{}` only has one bytecode `BUILD_MAP`, while the factory call `dict()` has three `PUSH_NULL + LOAD_NAME + CALL`. Also, the factory call is not safe if users override the `dict` name in `locals` or `globals` (see the example of replacing with `OrderedDict` above).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130199
Approved by: https://github.com/malfet
2024-07-11 17:30:28 +00:00
Xuehai Pan
94dc3253a0 [BE][Easy] enable UFMT for torch/distributed/ (#128870)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128870
Approved by: https://github.com/fegin, https://github.com/wconstab
2024-06-22 18:53:28 +00:00
PyTorch MergeBot
9c929f6ce9 Revert "[BE][Easy] enable UFMT for torch/distributed/ (#128870)"
This reverts commit a0e1e20c41.

Reverted https://github.com/pytorch/pytorch/pull/128870 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128870#issuecomment-2181780356))
2024-06-21 00:38:28 +00:00
Xuehai Pan
a0e1e20c41 [BE][Easy] enable UFMT for torch/distributed/ (#128870)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128870
Approved by: https://github.com/fegin
ghstack dependencies: #128868, #128869
2024-06-18 21:49:08 +00:00
mori360
d71f92213c [DSD] keep 'exp_avg' as DTensor after torch.distributed.checkpoint.state_dict.set_optimizer_state_dict (#128004)
Fixes #126950
`ptd_state_dict` with `broadcast_from_rank0=False` might miss 2 condition checks in the `set_optimizer_state_dict`
Here we add another condition `full_state_dict=True` with corresponding tensor distribution without broadcasting if broadcast_from_rank0=False

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128004
Approved by: https://github.com/fegin
2024-06-12 18:14:56 +00:00
Aaron Orenstein
3a0d088517 Flip default value for mypy disallow_untyped_defs [5/11] (#127842)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127842
Approved by: https://github.com/oulgen
2024-06-08 18:49:18 +00:00
Chien-Chin Huang
6d21685b45 [DSD] Fixes various bugs for broadcast_from_rank0 (#127635)
Fixes https://github.com/pytorch/pytorch/issues/126285

Summary:
1. Fixes https://github.com/pytorch/pytorch/issues/126285
2. Broadcasting one tensor per time to avoid OOM.
3. Add some docstring

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127635
Approved by: https://github.com/weifengpy
2024-06-03 06:35:21 +00:00
Lucas Pasqualin
42312a52b3 [DSD] Adds type_check param to copy state dict utils (#127417)
[DSD] Adds type_check param to copy state dict utils.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127417
Approved by: https://github.com/fegin
2024-06-01 17:50:52 +00:00
Chien-Chin Huang
15a9770225 [DSD] Implement broadcast_from_rank0 option for optim state_dict (#125339)
Summary:
This is useful if users would like to avoid CPU memory OOM when loading from a full state_dict.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125339
Approved by: https://github.com/weifengpy
ghstack dependencies: #125708, #125338
2024-05-08 07:22:20 +00:00
Chien-Chin Huang
0542fd485f [DSD] Implement broadcast_from_rank0 option for model state_dict (#125338)
Summary:
This is useful if users would like to avoid CPU memory OOM when loading from a full state_dict.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125338
Approved by: https://github.com/weifengpy
ghstack dependencies: #125708
2024-05-08 07:11:18 +00:00
Xuehai Pan
93e249969b [BE] enable ruff rule RSE and remove useless parentheses in raise statements (#124261)
Remove useless parentheses in `raise` statements if the exception type is raised with no argument.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124261
Approved by: https://github.com/albanD
2024-04-17 19:29:34 +00:00
Lucas Pasqualin
46a25cc0db [DCP] Adds support for non-primatives in async_save by deep copying during cpu offloading (#123941)
Adds support for non-primatives in async_save by deep copying during cpu offloading.

If users are not type checking, the expectation in async is likely that the object is copied

Differential Revision: [D56065237](https://our.internmc.facebook.com/intern/diff/D56065237/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123941
Approved by: https://github.com/fegin
2024-04-16 20:49:25 +00:00
Lucas Pasqualin
d838cc8f66 [DCP] Returns a copy of sd in copy sd (#123567)
I found that returning the copy is actually useful in situations where you might do something like:

```
ret = _copy_state_dict(obj, cache)
ret.update(some_other_values)
```

and would like `cache` not to change structure from `ret.update(some_other_values)`.  Open to some notes here, not returning a copy might force the user to do some additional copies for this case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123567
Approved by: https://github.com/wz337
2024-04-16 15:29:32 +00:00
Lucas Pasqualin
620aaaf0cb [DCP] Adds ability to create a CPU state dict that is both shared and pinned (#122338)
[DCP] Adds ability to create a CPU state dict that is both shared and pinned, as well as a new utility specific to copying the state dict

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1ge8d5c17670f16ac4fc8fcb4181cb490c

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122338
Approved by: https://github.com/fegin
2024-04-03 20:05:01 +00:00
Chien-Chin Huang
0811f15270 [DCP][state_dict] Let _offload_state_dict_to_cpu to return the companion_obj if it exist. (#121273)
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121273
Approved by: https://github.com/wz337, https://github.com/LucasLLC
2024-03-08 00:24:29 +00:00
Chien-Chin Huang
5abf7972d1 [DCP][state_dict] Implement pin_memory and shared_memory copy for _offload_state_dict_to_cpu (#120378)
**Summary**
This PR extend `_offload_state_dict_to_cpu` to accept a `cpu_offload_state_dict` argument. If `cpu_offload_state_dict` is not None, `_offload_state_dict_to_cpu` will use `copy_` to copy the GPU data to the CPU tensors. This allows users to pass a pin_memory or share_memory version of `cpu_offload_state_dict`.

This PR also adds `_create_cpu_state_dict` to allow users to easily create a pin_memory or share_memory cpu state_dict.

**Performance improvement**
```
# The micro-benchmark has a source state_dict with 150 tensors, and each tensor is 50MB.
# The micro-benchmark is run on a H100 machine with PCIe 5

cpu_state_dict_2 = _create_cpu_state_dict(state_dict, pin_memory=True)
cpu_state_dict_3 = _create_cpu_state_dict(state_dict, share_memory=True)

# GPU->CPU memory: 4.6556 seconds
cpu_state_dict = _offload_state_dict_to_cpu(state_dict)

# GPU->pin memory: 0.1566 seconds
_offload_state_dict_to_cpu(state_dict, cpu_offload_state_dict=cpu_state_dict_2)

# GPU->shared memory: 0.5509 seconds (variation is quite large)
_offload_state_dict_to_cpu(state_dict, cpu_offload_state_dict=cpu_state_dict_3)

# GPU->pin memory->shared memory: 0.2550 seconds
_offload_state_dict_to_cpu(state_dict, cpu_offload_state_dict=cpu_state_dict_2)
_offload_state_dict_to_cpu(cpu_state_dict_2, cpu_offload_state_dict=cpu_state_dict_3)
```

Differential Revision: [D54045845](https://our.internmc.facebook.com/intern/diff/D54045845/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120378
Approved by: https://github.com/LucasLLC
2024-03-05 17:48:15 +00:00
Yue Dong
2bda6b4cb8 [DTensor] Only wait on AsyncCollectiveTensor after DTensor-based state dict loading (#119716)
Summary:
This PR serves as a follow-up fix to address numerical correctness concerns identified in PR #118197, and we should only wait on `AsyncCollectiveTensor`.

Without the change, we occasionally ran into exception: `AttributeError("'Tensor' object has no attribute 'wait'")`

Test Plan:
**CI**:
Wait for the CI test

**Test with prod model**:
- Tested with models and no-longer ran into the exception after checkpoint loading.

Differential Revision: D53680406

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119716
Approved by: https://github.com/fegin, https://github.com/Skylion007, https://github.com/wz337
2024-02-13 04:30:45 +00:00
Catherine Lee
f9971daaee Fix divergence between internal + external (#118509)
D53049807 and https://github.com/pytorch/pytorch/pull/118197 got out of sync somehow

Fixing externally since I'm pretty sure the internal version is correct

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118509
Approved by: https://github.com/malfet
2024-01-29 14:53:50 +00:00
Chien-Chin Huang
4f78869c18 [state_dict] Calls wait() for the DTensor to_local() result (#118197)
See the discussion in https://github.com/pytorch/pytorch/pull/117799.

There are some issues when returning a AsyncCollectiveTensor (haven't found the
root causes), including OOM and unexpected values.

This PR forces `_gather_state_dict()` to be synchronous with respect to the mian stream.

Differential Revision: [D53049807](https://our.internmc.facebook.com/intern/diff/D53049807/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118197
Approved by: https://github.com/wz337, https://github.com/LucasLLC
2024-01-25 17:14:08 +00:00
Chien-Chin Huang
cc28f61fa3 [DCP][BE] Move DCP._state_dict_utils out from DCP (#115523)
DCP._state_dict_utils is also used by FSDP. This can cause circular import sometimes. Move it out from DCP to avoid circular import.

Differential Revision: [D52022440](https://our.internmc.facebook.com/intern/diff/D52022440/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115523
Approved by: https://github.com/wz337
2023-12-13 08:59:48 +00:00