pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Wanchao Liang	a26480a4d1	[dtensor] move early return check into redistribute autograd function (#121653 ) This PR fixed the bug of redistribute to move early return check into the redistribute autograd function, so that even though we redistribute the same placement, the grad_placements from the `to_local` call might be different, the redistribute backward still need to happen Pull Request resolved: https://github.com/pytorch/pytorch/pull/121653 Approved by: https://github.com/awgu	2024-03-12 17:37:30 +00:00
PyTorch MergeBot	0398dc9e8e	Revert "[DCP] Makes fsspec public (#121508 )" This reverts commit `d482614fec`. Reverted https://github.com/pytorch/pytorch/pull/121508 on behalf of https://github.com/osalpekar due to this causes torchrec tests to fail internally with this error: ModuleNotFoundError: No module named 'fsspec'. see [D54779117](https://www.internalfb.com/diff/D54779117) ([comment](https://github.com/pytorch/pytorch/pull/121508#issuecomment-1992137831))	2024-03-12 17:02:43 +00:00
Andrew Gu	85dc254364	[DTensor] Moved `Transformer` sharding to staticmethod (#121660 ) To support FSDP + TP/SP unit tests, let us factor out the canonical TP/SP sharding of `Transformer` to a staticmethod that can be called by other unit tests. Test Plan: ``` pytest test/distributed/tensor/parallel/test_tp_examples.py -k test_transformer_training ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121660 Approved by: https://github.com/wanchaol, https://github.com/yifuwang ghstack dependencies: #121360, #121357	2024-03-12 15:08:57 +00:00
Howard Huang	2a99e6f299	Update error message (#121644 ) Summary: We don't want people to move to NCCL exp without explicit opt in. It seems that sparse allreduce was accidentally called and people were confused whether they should use NCCL exp instead. Update the error message to explicitly say that sparse_allreduce is not supported. Test Plan: sandcastle Differential Revision: D54759307 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121644 Approved by: https://github.com/awgu	2024-03-12 13:04:21 +00:00
Andrew Gu	272cf29e4d	[FSDP2][BE] Refactored `check_1d_sharded_parity` to use mesh (#121357 ) Eventually, we should just have one unified way to check for parity between a `DTensor`-sharded model and a replicated model. This PR is a small refactor to work toward that. One current gap to use this `check_sharded_parity` function for 2D is that FSDP's `(Shard(0), Shard(0))` layout differs from that of the `DTensor` APIs since FSDP shards on dim-0 after TP shards on dim-0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121357 Approved by: https://github.com/weifengpy ghstack dependencies: #121360	2024-03-11 22:34:42 +00:00
PyTorch MergeBot	fd0dbcd891	Revert "Batch Norm Consolidation (#116092 )" This reverts commit `7b4f70eda5`. Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/osalpekar due to Causes build failure in //caffe2:aten-hip (AMD build) target. See [D54707318](https://www.internalfb.com/diff/D54707318) for more details, may require internal build system changes to resolve. ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1989542965))	2024-03-11 22:22:41 +00:00
wz337	60cd2a43ca	[DeviceMesh] Add support for nD slicing (#119752 ) Fixes one of the issue mentioned in #118639 @mvpatel2000 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119752 Approved by: https://github.com/wanchaol	2024-03-10 00:16:37 +00:00
Yifu Wang	71d0202627	[dynamo] support rewriting dist.all_reduce with explicitly specified reduce op (#120181 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120181 Approved by: https://github.com/wconstab, https://github.com/awgu	2024-03-09 08:28:22 +00:00
Wanchao Liang	242e03ba86	[dtensor] add async_op option to redistribute and some refactor (#121477 ) async output option was only available in `full_tensor()` call, but I think it's generally good to make this option available in the `redistribute` call directly so that user can control it This PR adds async_op option to redistribute call, to allow user control whether to perform tensor redistribution asynchronously or not. By default we set this to False, this is to follow the semantics of the c10d collectives. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121477 Approved by: https://github.com/wz337	2024-03-09 06:17:23 +00:00
Aidyn-A	eb3919944d	[C10d][NCCL] Refactor complex all_reduce and broadcast (#121045 ) The necessity of this PR lies in the fact that autograd engine + DDP calls `all_reduce` from C++, so the changes must be made in C++. ``` [rank0]: Traceback (most recent call last): [rank0]: File "~/complex_ddp.py", line 72, in <module> [rank0]: main() [rank0]: File "~/complex_ddp.py", line 64, in main [rank0]: loss.backward() [rank0]: File "/home/usr/pytorch/torch/_tensor.py", line 525, in backward [rank0]: torch.autograd.backward( [rank0]: File "/home/usr/pytorch/torch/autograd/__init__.py", line 267, in backward [rank0]: _engine_run_backward( [rank0]: File "/home/usr/pytorch/torch/autograd/graph.py", line 744, in _engine_run_backward [rank0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [rank0]: TypeError: Input tensor data type is not supported for NCCL process group: ComplexFloat ``` I believe, for minimizing the Python overhead, the same could be done for the rest of the ops, what do you think @kwen2501? Pull Request resolved: https://github.com/pytorch/pytorch/pull/121045 Approved by: https://github.com/eqy, https://github.com/kwen2501	2024-03-09 02:00:54 +00:00
Lucas Pasqualin	d482614fec	[DCP] Makes fsspec public (#121508 ) Fixes #118033 Also removes `_checkpointer.py` class original PR's: - https://github.com/pytorch/pytorch/pull/121330 - https://github.com/pytorch/pytorch/pull/121329 We're also disabling `test_fsdp` since it is failing on random PR's Pull Request resolved: https://github.com/pytorch/pytorch/pull/121508 Approved by: https://github.com/fegin	2024-03-09 01:14:18 +00:00
Wanchao Liang	bc02fca358	[dtensor] to_local backward grad placement passthrough (#121474 ) to_local accepts a `grad_placements` if user choose to pass, previously we enforce the grad_out to be the "same" placement as the current DTensor for safety. But I realized that we DO NOT need to enforce this constraint. Why? backward placement does not need to be the same as fwd tensor placement, this is already the case for param vs param.grad (i.e. param can be replicate and grad can be partial), so we should not restrict this to activation vs activation grad too Pull Request resolved: https://github.com/pytorch/pytorch/pull/121474 Approved by: https://github.com/awgu, https://github.com/yoyoyocmu, https://github.com/yifuwang	2024-03-08 23:11:49 +00:00
Ke Wen	038b2e8780	[c10d] Add complex support for P2P (#121240 ) Fixes the following error when `tensor` is a complex tensor: ``` [rank0]: return pg.send([tensor], dst, tag) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: RuntimeError: Unconvertible NCCL type ComplexFloat ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121240 Approved by: https://github.com/shuqiangzhang	2024-03-08 22:47:49 +00:00
Boyuan Feng	35d3adb4b0	Add ATen Op _chunk_cat and _chunk_cat.out (#121081 ) # Motivation In backward of per-parameter sharding FSDP, each rank performs reduce scatter to sync gradients across ranks. A rank chunks each gradient tensor into `world_size` slices along the 0-th dimension and concatenate all slices along the 1-th dimension. Gradient tensors will be padded before concatenation when tensor.size(0) % world_size != 0. ### Example 1 Consider `world_size=3` and tensors A (2x4), B (3x3), C (1x2): Input tensors: ``` AAAA BBB CC AAAA BBB BBB ``` Reduce-scatter-copy-in Output: ``` AAAABBBCC AAAABBB00 0000BBB00 ``` ### Example 2 Consider `world_size=2` and tensors A (2x4), B (3x3), C(1x2), D(4x2): Input tensors: ``` AAAA BBB CC DD AAAA BBB 00 DD BBB DD 000 DD ``` Reduce-scatter-copy-in first pad: ``` AAAA BBB CC DD AAAA BBB 00 DD BBB DD 000 DD ``` Then chunk and cat along dim as the output: ``` AAAABBBBBBCCDDDD AAAABBB00000DDDD ``` The performance of reduce-scatter-copy-in is critical to per-parameter sharding FSDP. However, reduce-scatter-copy-in via composing existing ATen ops involves `cat` and irregular `pad`, leading redundant data copies and unsatisfactory performance. # PR We provide aten native support for reduce-scatter-copy-in, namely `_chunk_cat()`: ``` _chunk_cat(Tensor[] tensors, int dim, int num_chunks) -> Tensor ``` This PR includes the registration of `_chunk_cat` and `_chunk_cat.out`, OpInfo tests, and basic implementation composing existing ATen ops. In the next PR, we will add the CUDA implementation. Comparing with baselines of composing existing ATen ops, `_chunk_cat()` CUDA implementation improves copy bandwidth from 498 GB/s to 966 GB/s on a production benchmark. ## Requirements on input 1. If input tensors have different ndims, dim should be non-negative and be less than the ndims of every input tensors. If all input tensors have the same ndims, we support both negative and non-negative dim. 2. For wrapped_dim, all tensors should have the same size for 0,...,wrapped_dim-1 dimensions. No requirements for (wrapped_dim, ...)-th dimension. 3. Expect positive num_chunks 4. Expect non-empty input tensor list and each input tensor should have at least 1 element Pull Request resolved: https://github.com/pytorch/pytorch/pull/121081 Approved by: https://github.com/albanD	2024-03-08 21:48:12 +00:00
Yifu Wang	22cd2658b4	Disable GroupRegistry's thread isolation by default (#121457 ) Today `GroupRegistry` employs thread isolation by default, i.e. every thread sees its own process group registry. This is intended to work for one-device-per-process (for python use cases) and one-device-per-thread case (for custom native runtimes). However, there's a problem - there are python use cases that initializes/registers process groups in one thread, and runs collectives in another thread. This use case should be supported. However, since `GroupRegistry` employs thread isolation by default, collectives in different threads can't find the registered process groups. This PR fixes the issue by: - Make `GroupRegistry` work in non-thread isolation mode by default. This would match the behavior w/o the native process group registry. - Introduces `set_thread_isolation_mode` so one-device-per-thread runtimes can enable thread isolation mode explicitly. Differential Revision: [D54658515](https://our.internmc.facebook.com/intern/diff/D54658515) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121457 Approved by: https://github.com/wanchaol	2024-03-08 19:31:24 +00:00
PyTorch MergeBot	0f3f4f5534	Revert "[nit][DCP][DSD] Remove Unused Variables in test_state_dict.py (#121204 )" This reverts commit `4186c36531`. Reverted https://github.com/pytorch/pytorch/pull/121204 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but the failure looks legit ([comment](https://github.com/pytorch/pytorch/pull/121204#issuecomment-1986252526))	2024-03-08 19:08:50 +00:00
Wanchao Liang	08460f4bae	[tp] remove deprecated tp_mesh_dim arg (#121432 ) This PR removes the deprecated tp_mesh_dim arg to prepare for release. As we deprecated this arg for a while (by throwing deprecating messages), we should remove it before the release #suppress-api-compatibility-check Pull Request resolved: https://github.com/pytorch/pytorch/pull/121432 Approved by: https://github.com/wz337 ghstack dependencies: #121431	2024-03-08 17:46:44 +00:00
Yeounoh Chung	f7ec984b1b	[DTensor][XLA] support XLA backend in distirbute_module API (#121355 ) Addresses #92909 cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang Pull Request resolved: https://github.com/pytorch/pytorch/pull/121355 Approved by: https://github.com/wanchaol	2024-03-08 15:47:33 +00:00
andrewor14	7b4f70eda5	Batch Norm Consolidation (#116092 ) Summary: This commit simplifies the existing decomposition hierarchy of batch norm ops by adding a single, backend agnostic op: `batch_norm_with_update`. The existing hierarchy looks like: ``` aten.batch_norm -> aten._batch_norm_impl_index -> [ aten.native_batch_norm -> aten._native_batch_norm_legit (export only) -> _batch_norm_legit_cpu/cuda (kernels, export only) -> _batch_norm_cpu/cuda (kernels) ] OR [ aten.cudnn_batch_norm ] OR [ aten.miopen_batch_norm ] ``` Aside from complexity, an important problem with the above decomposition hierarchy is cuda numerics in export flows. We observed significantly worse convergence when training a mobilenetv2-like model when using the `_batch_norm_cuda` kernel instead of the `cudnn_batch_norm` kernel. This means users who export their models on CPU first then move the models to cuda later may silently see worse accuracies even when cudnn is installed, because they are using the worse kernel. This issue is summarized in https://github.com/pytorch/pytorch/issues/111384. Instead, the new hierarchy proposed by consolidating existing batch norm ops will look like: ``` aten.batch_norm -> aten.batch_norm_with_update -> [ _batch_norm_cpu (kernel) ] OR [ _batch_norm_cuda (kernel) ] OR [ cudnn_batch_norm (kernel) ] OR [ miopen_batch_norm (kernel) ] ``` The new op `batch_norm_with_update` hides backend implementation details and automatically picks the right kernel based on what is installed. This commit also adds the following variants to this op: ``` batch_norm_with_update_functional batch_norm_with_update.out batch_norm_no_update batch_norm_no_update.out batch_norm_backward ``` Note that this commit only adds this op and its variants, but does not actually change the decomps to produce these ops in the graph. This will be done after the 2 week FC window, and the ops used in the old stack is planned to be removed after the 6 month BC window. Test Plan: `OpInfo` tests for `batch_norm_with_update`. Reviewers: albanD, bdhirsh Subscribers: albanD, bdhirsh, supriyar Tasks: https://github.com/pytorch/pytorch/issues/111384 Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092 Approved by: https://github.com/bdhirsh, https://github.com/albanD	2024-03-08 15:07:15 +00:00
Lucas Pasqualin	96ed37ac13	[DCP] Makes async_save public (#121325 ) Makes async_save public Differential Revision: [D54593610](https://our.internmc.facebook.com/intern/diff/D54593610/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121325 Approved by: https://github.com/wz337 ghstack dependencies: #121317	2024-03-08 05:13:13 +00:00
Lucas Pasqualin	909d73d8cb	[DCP] Removes `no_dist` and `coordinator_rank` from public DCP API's (#121317 ) [DCP] Removes `no_dist` and `coordinator_rank` from public DCP API's Differential Revision: [D54591181](https://our.internmc.facebook.com/intern/diff/D54591181/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121317 Approved by: https://github.com/fegin	2024-03-08 02:14:12 +00:00
wz337	4186c36531	[nit][DCP][DSD] Remove Unused Variables in test_state_dict.py (#121204 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121204 Approved by: https://github.com/Skylion007	2024-03-08 01:54:25 +00:00
Tianyu Liu	dc514b967e	[dtensor][TP] check funcol calls and improve doc for loss parallel (#121366 ) Since CommDebugMode is fixed, we can check that loss parallel is working as expected. Under loss parallel, the forward computation should invoke 3 all-reduces, and the backward computation should invoke no functional collectives. Co-authored-by: Wanchao <wanchaol@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121366 Approved by: https://github.com/wanchaol	2024-03-08 01:41:31 +00:00
Chien-Chin Huang	0811f15270	[DCP][state_dict] Let _offload_state_dict_to_cpu to return the companion_obj if it exist. (#121273 ) As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/121273 Approved by: https://github.com/wz337, https://github.com/LucasLLC	2024-03-08 00:24:29 +00:00
Shengbao Zheng	60aaba4128	create function to get ProcessGroupNCCL uid (#121132 ) Summary: expose ProcessGroupNCCL uid Differential Revision: D54446056 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121132 Approved by: https://github.com/aaronenyeshi	2024-03-07 18:34:38 +00:00
Andrew Gu	e8e3049f57	[FSDP2] Relaxed check for parent mesh (#121360 ) Mixing 1D and 2D `DTensor`s in the same sharded state dict should be okay, so we can remove the check that a parameter for FSDP to shard must be a `DTensor` if passing a child mesh to FSDP. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121360 Approved by: https://github.com/yifuwang, https://github.com/Skylion007 ghstack dependencies: #120351, #121328	2024-03-07 08:09:25 +00:00
Wanchao Liang	1a28ebffb3	[TP] Introduce Sequence Parallel Style for Laynorm/RMSNorm/Dropout (#121295 ) As titled, this PR introduces a dedicated `ParallelStyle` to shard the nn.LayerNorm/nn.Dropout/RMSNorm layers. We were mainly using a manual distribute_module calls before when sharding the RMSNorm layer, but I think we should have a dedicate TP API to easily shard those layers, instead of user manually using DTensors. I call this SequenceParallel, which might bring some confusion that we technically "deprecated" a SequenceParallel style months ago. But this time the SeuqenceParallel style is significantly different with the previous ones (which used to shard two consecutive Linear layers). I believe making it the right name is the first priority, instead of worrying about the issue of reusing the old name Pull Request resolved: https://github.com/pytorch/pytorch/pull/121295 Approved by: https://github.com/awgu, https://github.com/tianyu-l ghstack dependencies: #121294	2024-03-07 02:04:59 +00:00
Wanchao Liang	a88356f45c	[dtensor] make add_.Tensor/div_.Scalar to be linear pointwise instead (#121294 ) add_.Tensor and div_.Scalar should support linearity so that we delay the partial results. This fixes the additional collective in the layernorm layer that we seen Pull Request resolved: https://github.com/pytorch/pytorch/pull/121294 Approved by: https://github.com/tianyu-l	2024-03-06 22:52:18 +00:00
Andrew Gu	372f192050	[DTensor] Initialized RNG tracker if needed (#121328 ) Since we are already checking if the RNG tracker is initialized, there is no real performance difference between erroring vs. just initializing a default RNG tracker (which we choose to be the `OffsetBasedRNGTracker`). ``` pytest test/distributed/_composable/fsdp/test_fully_shard_init.py -k test_meta ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121328 Approved by: https://github.com/wanchaol ghstack dependencies: #120351	2024-03-06 22:21:44 +00:00
Yifu Wang	d7a5e59647	[dynamo] support group=None when rewriting collectives (#121043 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121043 Approved by: https://github.com/awgu	2024-03-06 21:37:19 +00:00
Andrew Gu	e865700f6a	[FSDP2] Added initial meta-device init support (#120351 ) This PR adds initial support for meta-device initialization for pre-training without loading from a state dict. The idea is to allow `fully_shard(module)` to return and still have sharded parameters on meta device. Then, the user is free to initialize them as they please, e.g. using `to_empty()`. We override `_apply` to achieve the following: - Reshard the parameters to ensure that sharded parameters are registered (for correctness) -- we will always need this - Pad new local tensors and use the padded local tensors (to handle uneven sharding) -- we will remove this once `DTensor` pads its local tensor We use the `swap_tensors` path in `_apply`. For now, this requires setting `torch.__future__.set_swap_module_params_on_conversion(True)`; however, in the future, this may be enabled by default for wrapper subclasses and will not need any explicit API call. If requiring this call is too intrusive in the short term, we can also call it in `_apply` or when importing `fully_shard`. ``` # Pre-training flow (no checkpoint) global_mesh = init_device_mesh(..., mesh_dim_names=("dp", "tp")) dp_mesh, tp_mesh = global_mesh["dp"], global_mesh["tp"] with torch.device("meta"): model = ... parallelize_module(model, tp_mesh, ...) fully_shard(model, mesh=dp_mesh, ...) for param in model.parameters(): assert param.device.type == "meta" model.to_empty(device="cuda") random.manual_seed(42, global_mesh) for module in model.modules(): if hasattr(module, "reset_parameters"): module.reset_parameters() ``` This PR includes some minor changes to allow the user to similarly cast the module to a different dtype after construction time but before forward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120351 Approved by: https://github.com/wanchaol	2024-03-06 21:18:25 +00:00
Kurman Karabukaev	360761f7d0	[Torchelasic] Create root log directory by default (#121257 ) Summary: After refactoring in https://github.com/pytorch/pytorch/pull/120691, default behavior unintentionally was changes from creating tempdir for logging to not capturing any logs by torch Elastic Agent. Reverting the behavior to: - making tempdir when log dir is not specified - allowing non-empty root log dir - Note: in case attempt folder exists, it will be pruned here: https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/multiprocessing/api.py#L294 Differential Revision: D54531851 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121257 Approved by: https://github.com/d4l3k	2024-03-06 18:50:38 +00:00
PyTorch MergeBot	b529c19bdf	Revert "Batch Norm Consolidation (#116092 )" This reverts commit `5680f565d5`. Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/jeffdaily due to broke ROCm, PR signal was clean but trunk was not, the merge should have been blocked but wasn't ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1981373237))	2024-03-06 17:10:01 +00:00
Tugsbayasgalan Manlaibaatar	5680f565d5	Batch Norm Consolidation (#116092 ) Summary: This commit simplifies the existing decomposition hierarchy of batch norm ops by adding a single, backend agnostic op: `batch_norm_with_update`. The existing hierarchy looks like: ``` aten.batch_norm -> aten._batch_norm_impl_index -> [ aten.native_batch_norm -> aten._native_batch_norm_legit (export only) -> _batch_norm_legit_cpu/cuda (kernels, export only) -> _batch_norm_cpu/cuda (kernels) ] OR [ aten.cudnn_batch_norm ] OR [ aten.miopen_batch_norm ] ``` Aside from complexity, an important problem with the above decomposition hierarchy is cuda numerics in export flows. We observed significantly worse convergence when training a mobilenetv2-like model when using the `_batch_norm_cuda` kernel instead of the `cudnn_batch_norm` kernel. This means users who export their models on CPU first then move the models to cuda later may silently see worse accuracies even when cudnn is installed, because they are using the worse kernel. This issue is summarized in https://github.com/pytorch/pytorch/issues/111384. Instead, the new hierarchy proposed by consolidating existing batch norm ops will look like: ``` aten.batch_norm -> aten.batch_norm_with_update -> [ _batch_norm_cpu (kernel) ] OR [ _batch_norm_cuda (kernel) ] OR [ cudnn_batch_norm (kernel) ] OR [ miopen_batch_norm (kernel) ] ``` The new op `batch_norm_with_update` hides backend implementation details and automatically picks the right kernel based on what is installed. This commit also adds the following variants to this op: ``` batch_norm_with_update_functional batch_norm_with_update.out batch_norm_no_update batch_norm_no_update.out batch_norm_backward ``` Note that this commit only adds this op and its variants, but does not actually change the decomps to produce these ops in the graph. This will be done after the 2 week FC window, and the ops used in the old stack is planned to be removed after the 6 month BC window. Test Plan: `OpInfo` tests for `batch_norm_with_update`. Reviewers: albanD, bdhirsh Subscribers: albanD, bdhirsh, supriyar Tasks: https://github.com/pytorch/pytorch/issues/111384 Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092 Approved by: https://github.com/bdhirsh, https://github.com/albanD	2024-03-06 04:50:46 +00:00
Chien-Chin Huang	5abf7972d1	[DCP][state_dict] Implement pin_memory and shared_memory copy for _offload_state_dict_to_cpu (#120378 ) Summary This PR extend `_offload_state_dict_to_cpu` to accept a `cpu_offload_state_dict` argument. If `cpu_offload_state_dict` is not None, `_offload_state_dict_to_cpu` will use `copy_` to copy the GPU data to the CPU tensors. This allows users to pass a pin_memory or share_memory version of `cpu_offload_state_dict`. This PR also adds `_create_cpu_state_dict` to allow users to easily create a pin_memory or share_memory cpu state_dict. Performance improvement ``` # The micro-benchmark has a source state_dict with 150 tensors, and each tensor is 50MB. # The micro-benchmark is run on a H100 machine with PCIe 5 cpu_state_dict_2 = _create_cpu_state_dict(state_dict, pin_memory=True) cpu_state_dict_3 = _create_cpu_state_dict(state_dict, share_memory=True) # GPU->CPU memory: 4.6556 seconds cpu_state_dict = _offload_state_dict_to_cpu(state_dict) # GPU->pin memory: 0.1566 seconds _offload_state_dict_to_cpu(state_dict, cpu_offload_state_dict=cpu_state_dict_2) # GPU->shared memory: 0.5509 seconds (variation is quite large) _offload_state_dict_to_cpu(state_dict, cpu_offload_state_dict=cpu_state_dict_3) # GPU->pin memory->shared memory: 0.2550 seconds _offload_state_dict_to_cpu(state_dict, cpu_offload_state_dict=cpu_state_dict_2) _offload_state_dict_to_cpu(cpu_state_dict_2, cpu_offload_state_dict=cpu_state_dict_3) ``` Differential Revision: [D54045845](https://our.internmc.facebook.com/intern/diff/D54045845/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120378 Approved by: https://github.com/LucasLLC	2024-03-05 17:48:15 +00:00
wz337	de8af28083	[FSDP][StateDict] Allow FULL_STATE_DICT option for 2D (#120837 ) Fixes #120722 TL;DR for the issue: As users are expected to use get_model_state_dict to do state_dict retrieval, I think it's fine to remove the warning and RuntimeError. More context in #120722. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120837 Approved by: https://github.com/Skylion007	2024-03-05 10:03:44 +00:00
Wanchao Liang	2e50566722	[dtensor] change distribute_module input/output_fn to accept module (#120895 ) This is a BC breaking change to distribute_module. The underlying rationle for this change is that sometimes in the input_fn/output_fn, user would want to access to the current module for some attributes. This might not be common enough, but in some cases it's worth to access to the module. An outstanding use case we want to support is float8, if we want to make float8 works with the TP API, the input_fn/output_fn of TP parallel styles would need to get access to the module, where the module might encapsulates `dynamic_linear.emulate` attribute, that is useful for input/output casting Since this is needed for fp8 and DTensor still under prototype release, I feel it's worth the change and it's better we make the change as early. Right now making it a soft BC breaking, which means we maintain BC still but throw deprecation messages. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120895 Approved by: https://github.com/tianyu-l	2024-03-04 07:22:32 +00:00
Lucas Pasqualin	7aced61c46	[DCP] deletes legacy formatting test (#120127 ) Should no longer be necessary Differential Revision: [D53791345](https://our.internmc.facebook.com/intern/diff/D53791345/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120127 Approved by: https://github.com/fegin ghstack dependencies: #119816	2024-03-02 22:04:39 +00:00
IvanKobzarev	bab4b5a341	[dist][sharded_tensor] Fix ChunkShardingSpec metadata offsets for empty shards (#121002 ) ChunkShardingSpec generated metadata where offsets exceed the tensor size. Example: Torchrec prepared ShardedTensorMetadata: ``` ShardedTensorMetadata(shards_metadata=[ ShardMetadata(shard_offsets=[0, 0], shard_sizes=[2, 512], placement=rank:0/cuda:0), ShardMetadata(shard_offsets=[2, 0], shard_sizes=[2, 512], placement=rank:1/cuda:1), ShardMetadata(shard_offsets=[4, 0], shard_sizes=[2, 512], placement=rank:2/cuda:2), ShardMetadata(shard_offsets=[6, 0], shard_sizes=[2, 512], placement=rank:3/cuda:3), ShardMetadata(shard_offsets=[8, 0], shard_sizes=[2, 512], placement=rank:4/cuda:4), ShardMetadata(shard_offsets=[10, 0], shard_sizes=[0, 512], placement=rank:5/cuda:5), ShardMetadata(shard_offsets=[10, 0], shard_sizes=[0, 512], placement=rank:6/cuda:6) ], size=torch.Size([10, 512] ), ``` Calling ShardedTensor._init_from_local_shards_and_global_metadata() ShardedTensor ShardingSpec builds metadata ``` ShardedTensorMetadata(shards_metadata=[ ShardMetadata(shard_offsets=[0, 0], shard_sizes=[2, 512], placement=rank:0/cuda:0), ShardMetadata(shard_offsets=[2, 0], shard_sizes=[2, 512], placement=rank:1/cuda:1), ShardMetadata(shard_offsets=[4, 0], shard_sizes=[2, 512], placement=rank:2/cuda:2), ShardMetadata(shard_offsets=[6, 0], shard_sizes=[2, 512], placement=rank:3/cuda:3), ShardMetadata(shard_offsets=[8, 0], shard_sizes=[2, 512], placement=rank:4/cuda:4), ShardMetadata(shard_offsets=[10, 0], shard_sizes=[0, 512], placement=rank:5/cuda:5), ShardMetadata(shard_offsets=[12, 0], shard_sizes=[0, 512], placement=rank:6/cuda:6) ], size=torch.Size([10, 512]), tensor_properties=TensorProperties(dtype=torch.float16, layout=torch.strided, requires_grad=False, memory_format=torch.contiguous_format, pin_memory=False)) ``` The deduced ChunkShardingSpec: ``` ChunkShardingSpec(dim=0, placements=[rank:0/cuda:0, rank:1/cuda:1, rank:2/cuda:2, rank:3/cuda:3, rank:4/cuda:4, rank:5/cuda:5, rank:6/cuda:6]) ``` The fix is to limit offsets by dim size. Differential Revision: [D54419513](https://our.internmc.facebook.com/intern/diff/D54419513) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121002 Approved by: https://github.com/wz337	2024-03-02 08:58:48 +00:00
Kurman Karabukaev	b0cfa96e82	[Torchelastic][Logging] Pluggable logsspecs using python entrypoints and option to specify one by name. (#120942 ) Summary: Expose an option to users to specify name of the LogsSpec implementation to use. - Has to be defined in entrypoints under `torchrun.logs_specs` group. - Must implement LogsSpec defined in prior PR/diff. Test Plan: unit test+local tests Reviewed By: ezyang Differential Revision: D54180838 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120942 Approved by: https://github.com/ezyang	2024-03-02 08:07:52 +00:00
Tianyu Liu	af5376c444	[dtensor] add support for loss parallel (#119877 ) Loss parallel is the last piece of sequence parallelism to enable. It enables efficient distributed cross entropy computation when the input is sharded on the class dimension (in a classification problem with many classes). The implementation is via a context manager `loss_parallel`, after enabling which users can directly use `torch.nn.functional.cross_entropy` or `torch.nn.CrossEntropyLoss` without modifying other parts of their code. Here are the underlying rationales why we are going through these op replacements: 1. `nn.functional.cross_entropy` is the common method that OSS user is using for things like transformer training, to avoid changing user code, we want user to still use this function for loss calculation if they are already using it. 2. `nn.functional.cross_entropy` boils down into `aten.log_softmax` and `aten.nll_loss_foward/backward`, and DTensor now supports those ops already (#117723 #119255 #118917 #119256). They are doing computation with input replicated on the class dimension. 3. However when the input of this loss calculation is sharded on the class dimension, to run sharded computation efficiently, we need to run both `aten.log_softmax` and `aten.nll_loss_foward` with multiple all-reduce collectives in the middle of those aten ops. This is not possible if we are just overriding these two ops, so we need to have some way to decompose these two ops into smaller ops to have collectives run in the middle of these two ops. 4. We explored the existing decompositions (#118950). It seems working, except that `log_softmax_backward` and `nll_loss_backward` combined together in aten are implemented in a inefficient way, which would trigger an additional expensive collective. Recently some user also reported similar issues https://github.com/pytorch/pytorch/issues/119261. 5. Therefore, currently we are doing our own decomposition inside a context manager for sequence parallelism specifically. Once we have a better decomposition in core, we can possibly take that instead of reinventing the wheels here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119877 Approved by: https://github.com/wanchaol	2024-03-02 05:06:26 +00:00
Yifu Wang	f7a2bae0ac	Change TestOpWaitiness to use MultiProcessTestCase (#121046 ) The test has been failing sporadically rencetly in CI and the failures are not reproducible locally, likely due to some nasty race conditional related a combination of MultiThreadedTestCase, the use of global state and finalizers, and the recently introduced test decorator for native funcol migration. Switching to the test to use MultiProcessTestCase to provide better isolation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121046 Approved by: https://github.com/weifengpy	2024-03-02 01:12:14 +00:00
Andrew Gu	4cf6d1172b	[FSDP2] Used `ReduceOp.AVG` if fp32 reduce-scatter (#120919 ) This PR uses `ncclAvg` op (via `ReduceOp.AVG`) if doing fp32 reduce-scatter. This allows the division by world size to happen in the reduce-scatter kernel itself, which seems to save extra memory read/write for dividing. This yields ~1.5% speedup on the Llama-7B workload (and makes per-parameter FSDP faster than flat-parameter FSDP 😅 ). Pull Request resolved: https://github.com/pytorch/pytorch/pull/120919 Approved by: https://github.com/yifuwang, https://github.com/wanchaol ghstack dependencies: #120238, #120910	2024-03-02 00:39:16 +00:00
Andrew Gu	7c71d7f32b	[DTensor] Supported `foreach=True` for `clip_grad_norm_` (#120910 ) This PR adds support for `clip_grad_norm_(foreach=True)` by implementing `aten._foreach_norm.Scalar` and `aten._foreach_mul_.Tensor`. `foreach=True` is required to get competitive performance with `DTensor`. `foreach=True` reduces CPU overhead for Llama-7B from 388 ms to 63 ms. Existing flat-parameter FSDP's `clip_grad_norm_` takes 3 ms on CPU 😢 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/120910 Approved by: https://github.com/wanchaol, https://github.com/janeyx99 ghstack dependencies: #120238	2024-03-02 00:28:09 +00:00
Andrew Gu	f0e8e7cf43	[DTensor] Supported `foreach=False` for `clip_grad_norm_` (#120238 ) This PR adds `DTensor` support for `aten.linalg_vector_norm.default` and `aten.stack.default` so that we can run `clip_grad_norm_` (with `foreach=False`). To implement `linalg_vector_norm`, we introduce a `_NormPartial` placement since the reduction op for norm is the norm itself. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120238 Approved by: https://github.com/wanchaol	2024-03-02 00:25:16 +00:00
Shuqiang Zhang	c8e56b4965	[c10d] dump from one and only one thread (PG0's monitor thread) (#120893 ) Summary: When there are multiple PGs in a process and a hardware failure happens, we found that multiple PGs/ threads in the same process are competing to dump the same records at the same time. The affects the reliability of dumps. In this PR, we will try to make the change such that only one thread/PG could dump: PG0's monitor thread. We use a static variable to indicate that something (e.g., collective timeout) has triggered the dump locally. monitor thread would dump debug info under any one of the 3 conditions: 1: this static variable is set to true by the watchdog thread when it detects a timeout or pipe dump signal 2: timeout signal is received from other ranks through tcpstore 3: no heartbeat of watchdog Test Plan: python test/distributed/test_c10d_nccl.py -k test_timeout_dumps_on_stuck_ranks Pull Request resolved: https://github.com/pytorch/pytorch/pull/120893 Approved by: https://github.com/wconstab	2024-03-02 00:13:13 +00:00
Will Constable	581fe26792	[C10D] Add ProcessGroup op_id to track ops inside coalescing region (#120745 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120745 Approved by: https://github.com/zdevito	2024-03-01 23:45:43 +00:00
Lucas Pasqualin	9d5dea7812	[DCP] Adds storage reader and planner classes for online loading/sharding of models in torch.save format (#119816 ) as title Differential Revision: [D53718041](https://our.internmc.facebook.com/intern/diff/D53718041/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119816 Approved by: https://github.com/fegin	2024-03-01 00:21:05 +00:00
Elias Ellison	d03b11ad5b	Pass inductor strides forward in ddp optimizer (#120523 ) # Note: Returning Fake Tensors on First AOT Autograd Call # # Inductor will optimize strides of outputs when it deems it profitable. # For instance, converting to channels last. When we split the graph here # into multiple inductor compilations, we need to make sure that the # output strides of one compilation is appropriately passed to the subsequent # compilations. However, the mapping from inductor output to dynamo output # is non-trivial due to aot_autograd's deduping, de-aliasing, mutation, re-writing, # subclass handling, etc. In order to replay all this logic we set a flag such that # the first invocation of inductor in aot_autograd will return Fake Tensors with # appropriate strides. Then, all of aot autograd's runtime logic is replayed. # This gives us the appropriately strided outputs here which will reflect runtime strides. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120523 Approved by: https://github.com/yf225, https://github.com/bdhirsh	2024-02-29 22:25:00 +00:00
PyTorch MergeBot	76d3a6bb4a	Revert "[C10D] Add ProcessGroup op_id to track ops inside coalescing region (#120745 )" This reverts commit `381a7ad3f1`. Reverted https://github.com/pytorch/pytorch/pull/120745 on behalf of https://github.com/kit1980 due to The new test fails internally, see D54343421 ([comment](https://github.com/pytorch/pytorch/pull/120745#issuecomment-1972047106))	2024-02-29 22:06:13 +00:00

1 2 3 4 5 ...

2083 Commits