pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Justin Chu	232b96b6e2	[BE] Enable ruff's UP rules and autoformat distributed/ (#105433 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105433 Approved by: https://github.com/albanD	2023-07-19 14:27:11 +00:00
Nikita Shulga	5837e95d30	[Reland] Update mypy to 1.4.1 (#105227 ) This PR re-lands - [Typing] Fix PEP 484 Violation (#105022) - Update mypy to 1.4.1 (#91983) That were reverted due to the conflict with internal source repo. Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional) Plus few real fixes: - Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi` - Add missing return statement to `torch._export. deserialize_graph` - Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights` - Add assert it `torch/optim/optimizer.py` that Optional list is not None TODO (in followup PR): - Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py` Unrelated, to bypass CI failures due to the gcc9 dependency update in Ubuntu-18.04: - Add hack to squash older libstdc++ from conda environment in favor one from OS to `.ci/docker/install_conda.sh` - Update bazel cuda builds to focal, as with libstdc++-6.0.32 bazel builds loose the ability to catch exceptions (probably because they link with cupti statically, but I could not found where it is done) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105227 Approved by: https://github.com/atalman, https://github.com/albanD, https://github.com/Skylion007	2023-07-15 20:30:20 +00:00
PyTorch MergeBot	15fd1ea118	Revert "[Reland] Update mypy to 1.4.1 (#105227 )" This reverts commit `c9c4f8efc3`. Reverted https://github.com/pytorch/pytorch/pull/105227 on behalf of https://github.com/atalman due to trying to mitigate ci sev #105248 ([comment](https://github.com/pytorch/pytorch/pull/105227#issuecomment-1636510935))	2023-07-14 22:28:35 +00:00
Nikita Shulga	c9c4f8efc3	[Reland] Update mypy to 1.4.1 (#105227 ) This PR re-lands - [Typing] Fix PEP 484 Violation (#105022) - Update mypy to 1.4.1 (#91983) That were reverted due to the conflict with internal source repo. Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional) Plus few real fixes: - Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi` - Add missing return statement to `torch._export. deserialize_graph` - Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights` - Add assert it `torch/optim/optimizer.py` that Optional list is not None TODO (in followup PR): - Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105227 Approved by: https://github.com/atalman, https://github.com/albanD, https://github.com/Skylion007	2023-07-14 20:45:12 +00:00
PyTorch MergeBot	b4d91b1c5b	Revert "[Typing] Fix PEP 484 Violation (#105022 )" This reverts commit `4148b7bada`. Reverted https://github.com/pytorch/pytorch/pull/105022 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/105022#issuecomment-1635967734))	2023-07-14 14:45:09 +00:00
Nikita Shulga	4148b7bada	[Typing] Fix PEP 484 Violation (#105022 ) Not sure, how it worked before, but if arguments must be annotated is optional if they are defaulted to None Towards enabling mypy-1.4.1 in lintrunner <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 5e1b9f4</samp> > _We annotate the arguments of doom_ > _To show the `None` values of gloom_ > _We improve the type checking and readability_ > _With `Optional` annotations of metal-ity_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/105022 Approved by: https://github.com/izaitsevfb, https://github.com/huydhn, https://github.com/Skylion007	2023-07-12 10:20:48 +00:00
Rohan Varma	a748be93df	[CheckpointWrapper] Warn on reentrant use (#102890 ) We'd like to encourage users to try non-reentrant as much as possible, and identify any gaps this way. Differential Revision: [D46397786](https://our.internmc.facebook.com/intern/diff/D46397786/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102890 Approved by: https://github.com/awgu	2023-06-04 18:31:22 +00:00
Rohan Varma	957ea485c4	[FSDP/AC] checkpoint_wrapper acccept auto_wrap_policy (#102672 ) Some feedback for this API is that folks would like to use auto_wrap_policy similar to FSDP instead of having to adapt to the signature of ``check_fn``. Differential Revision: [D46340320](https://our.internmc.facebook.com/intern/diff/D46340320/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102672 Approved by: https://github.com/awgu	2023-06-04 18:31:19 +00:00
Dirk Groeneveld	75945d54f7	Properly propagates checkpoint wrapper args and kwargs (#99791 ) It looks like passing `args` and `kwargs` to `checkpoint_wrapper()` does not work because someone forgot some ``s. This adds them back in. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99791 Approved by: https://github.com/awgu	2023-05-03 23:19:21 +00:00
Rohan Varma	bba2090831	Enable fused optimizer for DP (#98270 ) Differential Revision: [D42714482](https://our.internmc.facebook.com/intern/diff/D42714482/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D42714482/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/98270 Approved by: https://github.com/awgu	2023-04-13 20:16:32 +00:00
Edward Z. Yang	5a7aad9681	Convert logging f-strings to use % format, part four (#98705 ) This does multi-line concatenated string literals. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98705 Approved by: https://github.com/voznesenskym	2023-04-11 13:17:59 +00:00
Kazuaki Ishizaki	6514d71add	Fix typos under torch/distributed directory (#98225 ) This PR fixes typos in comments and messages of `.py` files under `torch/distributed` directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/98225 Approved by: https://github.com/soulitzer, https://github.com/kit1980	2023-04-05 00:21:33 +00:00
Edward Z. Yang	5df59f957f	Fix G001,G002,G003 in logs to % syntax (#97812 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/97812 Approved by: https://github.com/Skylion007, https://github.com/kiukchung, https://github.com/malfet, https://github.com/mlazos	2023-04-01 01:43:33 +00:00
Kazuaki Ishizaki	35fd5c548e	Fix typos under torch/distributed directory (#95638 ) This PR fixes typos in comments and messages of `.py` files under torch/distributed directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/95638 Approved by: https://github.com/usamah1, https://github.com/H-Huang, https://github.com/kit1980	2023-03-27 21:13:44 +00:00
Rohan Varma	32f11f58c9	DDP native mixed precision (#92882 ) Implements native mixed precision support for DDP in a similar fashion to how it is enabled for FSDP. The implementation works as follows: 1. In DDP init, we save `_mp_param` and `_fp_param` variables to manage mixed precision parameter usage. In particular, _mp_param will represent the parameter in the reduced precision, while _fp_param will represent the param in regular precision. During forward/backward, we swap back and forth as needed. 2. The root module gets a root pre-forward hook that kicks off copies to the reduced precision for all submodules. An event is recorded for each submodule to allow for waiting, as we run these asynchronously. 3. Each module gets a pre-forward hook that waits on its corresponding event. note that modules might be reused during training, in this case the wait is only done for the first module. After this wait, the module's parameters are in reduced precision. 4. In the pre-forward hook, we register a backward hook on the lower precision parameters in order to run reduced precision allreduce + parameter upcast. We can't rely on the Reducer's constructor setting up these hooks because the gradient is accumulated on the low precision param, so we need to register them ourselves. 5. In the backward pass, when the hook runs, we first run allreduce + divide in the reduced precision. Next, we upcast parameters and gradients back to fp32 asynchronously. We also queue a callback at the end of backward to wait on these upcasts so that the upcast is complete before optim.step() runs. 6. Parameters that don't require grad are also cast since they may be used in computation, they are upcast back in the final autograd callback. 7. DDP Ignored parameters are not touched. Follow-ups: 1. Unify comm hooks and make it work with apply optimizer in backward 2. implement keep_low_precision_grads, 3. allow BN, LN, or custom units to run in reduced precision, 4. support for cast_forward_inputs 5. Unify certain APIs / helpers with FSDP where possible, such as for _cast_forward_inputs 6. Integrate this with replicate() API. 7. The order in which we kick off copies and wait for them is set by the iteration order of module.modules(), but this might not be how the modules are used in the actual training. In the worst case, the last module in module.modules() could be used first which would result in waiting for all copies unnecessarily. For static graphs, we should record the module execution order and copy / wait in this order. 8. Entirely unused modules probably don't need to be cast. Differential Revision: [D42515803](https://our.internmc.facebook.com/intern/diff/D42515803/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92882 Approved by: https://github.com/zhaojuanmao	2023-03-13 14:10:31 +00:00
Xuehai Pan	5b1cedacde	[BE] [2/3] Rewrite `super()` calls in functorch and torch (#94588 ) Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied. - #94587 - #94588 - #94592 Also, methods with only a `super()` call are removed: ```diff class MyModule(nn.Module): - def __init__(self): - super().__init__() - def forward(self, ...): ... ``` Some cases that change the semantics should be kept unchanged. E.g.: `f152a79be9/caffe2/python/net_printer.py (L184-L190)` `f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94588 Approved by: https://github.com/ezyang, https://github.com/albanD	2023-02-10 21:16:33 +00:00
Aaron Gokaslan	8fce9a09cd	[BE]: pyupgrade Python to 3.8 - imports and object inheritance only (#94308 ) Apply parts of pyupgrade to torch (starting with the safest changes). This PR only does two things: removes the need to inherit from object and removes unused future imports. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94308 Approved by: https://github.com/ezyang, https://github.com/albanD	2023-02-07 21:10:56 +00:00
Vasiliy Kuznetsov	f15ab8a7f2	AO migration: replace torch internal callsites (#94170 ) Summary: Do the following renames: `torch.quantization` -> `torch.ao.quantization` `torch.nn.quantized` -> `torch.ao.nn.quantized` `torch.nn.quantizable` -> `torch.ao.nn.quantizable` `torch.nn.qat` -> `torch.ao.nn.qat` `torch.nn.intrinsic` -> `torch.ao.nn.intrinsic` And then, do `torch.ao.nn.quantized._reference` -> `torch.ao.nn.quantized.reference` to clean up the aftermath of https://github.com/pytorch/pytorch/pull/84974 Then, manually update `test/test_module_init.py` to fix hanging whitespace due to the replace. Run this script to do the replacements: https://gist.github.com/vkuzo/7f7afebf8c31b9ba48306223e68a1c82 This is for https://github.com/pytorch/pytorch/issues/81667 Test plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/94170 Approved by: https://github.com/jerryzh168	2023-02-07 02:32:23 +00:00
Rohan Varma	e8bf7c21e4	Integrate apply_optim_in_backward with DDP (#89194 ) Allow _apply_optim_in_backward to work with DDP. Example: ``` dist.init_process_group("nccl", rank=rank, world_size=2) torch.cuda.set_device(rank) e = enc().cuda(rank) _apply_optimizer_in_backward( optimizer_class=torch.optim.SGD, params=e.parameters(), optimizer_kwargs={"lr": 0.03}, ) e = DDP(e, device_ids=[rank]) inp = torch.randn(1, 10, device=rank) e(inp).sum().backward() ``` Constraints: 1. Custom communication hook not yet supported 2. _apply_optim_in_backward needs to be called _before_ wrapping model in DDP. 3. DDP will remove the gradient hooks _apply_optim_in_backward registers, so these gradient hooks will not be fired and cannot be used. 4. All DDP managed parameters have grads set to None by default once optimizer is applied. There is no support for setting only some parameter grads to None, this must be done manually by user (and DDP_OVERLAPPED_OPTIM_SET_GRADS_TO_NONE=0 needs to be set.) Differential Revision: [D41329694](https://our.internmc.facebook.com/intern/diff/D41329694/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D41329694/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/89194 Approved by: https://github.com/zhaojuanmao	2022-12-21 07:35:19 +00:00
Andrew Gu	fc429512d5	[FSDP] Clean up `FlatParamHandle` dtypes, post-backward hook (#90660 ) This PR reworks the internal handling of parameter and gradient reduction mixed precision, cleans up the post-backward hook logic, and adds some minor changes to the communication hooks. Overview This PR addresses everything in https://github.com/pytorch/pytorch/issues/90657 except renaming `keep_low_precision_grads` to `keep_grads_in_reduce_dtype` since that is BC breaking. I recommend reading the issue before preceding. For `MixedPrecision(param_dtype, reduce_dtype, ...)`, the exact rule for parameter and gradient reduction mixed precision that we are following is: > If `param_dtype is not None` and `reduce_dtype is None`, then we infer `reduce_dtype = param_dtype`. Otherwise, we take `param_dtype` and `reduce_dtype` as is. This PR enforces that, at the `FlatParamHandle` level, `handle._config.fwd_bwd_param_dtype` and `handle._config.reduce_dtype` are never `None`. The way to check if mixed precision is enabled is to compare against the original parameter dtype, which is now stored in `handle._orig_param_dtype`. It is no longer to check against `None`. This avoids ambiguous cases such as when the user passes `MixedPrecision(param_dtype=torch.float32)`. In that case, our existing implementation mistakenly thinks that parameter mixed precision is enabled and either relies on no-ops silently or errors (such as one case reported by MosaicML). Additional Details - We remove `FullyShardedDataParallel._mixed_precision_enabled_for_params`, `FullyShardedDataParallel._mixed_precision_enabled_for_reduce`, and `FullyShardedDataParallel._mixed_precision_keep_low_precision_grads` since they are not used. - The unit test `test_meta_device_with_mixed_precision()` exercises a tricky edge case with meta device initialization, `apply()` (calling into `summon_full_params()`), and `param_dtype=torch.float32` for a nested wrapping case, where each nested instance has parameters. - We include some minor fixes/improvements to the communication hook implementation. Follow-Ups - We should get rid of `HandleConfig` and store its fields as attributes on `FlatParamHandle` directly. - Rename `keep_low_precision_grads` to `keep_grads_in_reduce_dtype`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90660 Approved by: https://github.com/zhaojuanmao	2022-12-13 07:34:59 +00:00
Rohan Varma	a5532929da	Remove DDP import (#89982 ) This import is only used for typing, removing it to avoid circular ref in next diffs Differential Revision: [D41636897](https://our.internmc.facebook.com/intern/diff/D41636897/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89982 Approved by: https://github.com/zhaojuanmao	2022-12-01 14:56:48 +00:00
Kazuaki Ishizaki	2ddefbdc3c	Fix typos used in documents under torch directory (#88300 ) This PR fixes typos, in comments of Python files, that are found from a search box at https://pytorch.org/docs/master/search.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/88300 Approved by: https://github.com/lezcano	2022-11-02 09:38:13 +00:00
Andrew Gu	a2ffc3be97	[AC] Add trailing "." to `_CHECKPOINT_PREFIX` like FSDP (#87951 ) This is for consistency with FSDP. - `_FSDP_WRAPPED_MODULE` and `_CHECKPOINT_WRAPPED_MODULE` are exactly the wrapped module variable name, meaning you can call `getattr(module, _FSDP_WRAPPED_MODULE)` or `getattr(module, _CHECKPOINT_WRAPPED_MODULE)`. - `_FSDP_PREFIX` and `_CHECKPOINT_PREFIX` include the trailing `"."` and are only used for FQNs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87951 Approved by: https://github.com/zhaojuanmao	2022-10-28 22:05:29 +00:00
Andrew Gu	1c37119a1f	[FSDP] New fix for composing with other module wrappers (#87950 ) We change `.module` to pass through `ActivationWrapper` directly to the inner wrapped module. This should fix the state dict issues. Given the invariant that `.module` always returns the inner wrapped module, FSDP always registers the `FlatParameter` on the inner wrapped module, regardless of if there is an intermediate `ActivationWrapper` or not. This avoids casing on whether `ActivationWrapper` is added before or after FSDP construction. This PR removes the added unit test in `test_fsdp_misc.py` for changing the wrapped module because I would rather not complicated `_lazy_init()` logic just to support that kind of adversarial behavior. The user should not be swapping out the wrapped module arbitrarily or deleting the `FlatParameter`. I mainly had those tests to make sure that all branches of the code I added was correct. Differential Revision: [D40799961](https://our.internmc.facebook.com/intern/diff/D40799961) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87950 Approved by: https://github.com/zhaojuanmao	2022-10-28 21:11:40 +00:00
Andrew Gu	f967918411	[AC] Return `None` from `apply_activation_checkpointing()` (#87871 ) `_recursive_wrap()` returns `Tuple[nn.Module, int]`, where the `nn.Module` is the in-place modified module and the `int` is the numel wrapped. In that sense, the return value is not meant to be publicly used. The `apply_activation_checkpointing()` docs already suggest that the function returns `None`, so this PR simply follows that. Test Plan CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/87871 Approved by: https://github.com/zhaojuanmao	2022-10-28 02:00:39 +00:00
Andrew Gu	04ad0134ae	[FSDP] Use `reduce_scatter_tensor()` (#87240 ) Let us silence some more warnings 👍🏼 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87240 Approved by: https://github.com/rohan-varma	2022-10-24 11:29:23 +00:00
Rohan Varma	bdefa260b2	[RFC] Separate CPU offload activation to its own wrapper (#85459 ) Passing in `offload_to_cpu=True` to checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to make `offload_wrapper` its own concept. Now, offload to CPU + checkpoint can be composed together, such as ``` # apply AC to transformer layers apply_ac_wrapper(model, checkpoint_wrapper, check_fn=lambda mod: isinstance(mod, TransformerLayer)) # offload the rest of activations to CPU model = offload_wrapper(model) ``` Will polish / add tests if this proposal sounds good. Differential Revision: [D39719854](https://our.internmc.facebook.com/intern/diff/D39719854/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/85459 Approved by: https://github.com/awgu	2022-10-15 05:19:28 +00:00
Rohan Varma	2b5625a726	Update hierarchical_model_averager.py (#85648 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/85648 Approved by: https://github.com/wayi1, https://github.com/H-Huang	2022-10-03 06:15:20 +00:00
Rohan Varma	a8074a1a0b	[Checkpoint] rename apply_ac_wrapper (#85449 ) Per title Differential Revision: [D39714855](https://our.internmc.facebook.com/intern/diff/D39714855/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/85449 Approved by: https://github.com/awgu	2022-09-23 21:15:08 +00:00
Rohan Varma	cc64f64670	[Docs] Minor fix to apply_ac doc (#85448 ) Per title Created from CodeHub with https://fburl.com/edit-in-codehub Differential Revision: [D39714530](https://our.internmc.facebook.com/intern/diff/D39714530/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/85448 Approved by: https://github.com/awgu	2022-09-23 21:15:08 +00:00
anjali411	85073b8ddc	Add __all__ to fx, fistributed and cuda submodules (#85080 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/85080 Approved by: https://github.com/albanD	2022-09-21 18:04:58 +00:00
Rohan Varma	8cb7826889	[CheckpointWrapper] Reentrant kwarg support (#84908 ) A temporary patch to support keyword args when reentrant checkpoint wrapper is used. This is need to unblock some crucial workloads, the ideal fix would be checking this directly into torch.utils.checkpoint. Differential Revision: [D39453453](https://our.internmc.facebook.com/intern/diff/D39453453/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/84908 Approved by: https://github.com/awgu	2022-09-15 00:30:23 +00:00
Rohan Varma	55ca6901a7	[CheckpointWrapper] Decouple CPU offload (#84907 ) This fixes the activation offload for checkpoint wrapper, which was previously broken. It was broken because it was tightly coupled with activation checkpoint, i.e. we did: ``` with save_on_cpu: checkpoint(module_forward()) ``` which would not offload any activation tensors to CPU, as those activations would already be not saved by autograd due to the checkpoint implementation taking priority. Now, if `offload_to_cpu` is specified, we only do `save_on_cpu` and no checkpoint, so all intermediate tensors are offloaded to CPU instead of checkpointed. These wrappers can be composed, i.e. if we have `(Linear, Linear) -> (Linear, Linear) -> (Linear, Linear)` we can do `Offload( checkpoint(Linear, Linear) -> checkpoint(Linear, Linear) -> checkpoint(Linear, Linear))` and inner tensors would be checkpointed while outers will be offloaded. Differential Revision: [D39448882](https://our.internmc.facebook.com/intern/diff/D39448882/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/84907 Approved by: https://github.com/awgu	2022-09-15 00:30:23 +00:00
Rodrigo Kumpera	38192f63cd	Add __all__ for a few distributed modules plus a little typing (reland) (#84872 ) This handles distributed_c10d, which is massive and ddp_comm_hooks. This relands #84119 with the required fixes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84872 Approved by: https://github.com/rohan-varma	2022-09-13 21:57:49 +00:00
PyTorch MergeBot	219ff26172	Revert "Add __all__ for a few distributed modules plus a little typing (#84119 )" This reverts commit `6f21680563`. Reverted https://github.com/pytorch/pytorch/pull/84119 on behalf of https://github.com/izaitsevfb due to breaking internal builds, see D39386448	2022-09-09 20:01:07 +00:00
Rodrigo Kumpera	6f21680563	Add __all__ for a few distributed modules plus a little typing (#84119 ) This handles distributed_c10d, which is massive and ddp_comm_hooks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84119 Approved by: https://github.com/rohan-varma	2022-09-08 23:28:31 +00:00
Rodrigo Kumpera	65dc5dd3f3	[c10d] Introduce dist.get_local_rank, dist.get_global_rank and dist.get_global_ranks (#82134 ) Those functions enable membership introspection into a ProcessGroup. A common scenario that needs this is library code that consumes a PG but doesn't create it, which means it likely doesn't know the global ranks used to create it. Translating from local to global is necessary when using c10d collectives like broadcast so if your library code adopts the convention of using local rank 0, it needs to the following: ```python import torch.distributed as dist my_pg: dist.ProcessGroup = ... def my_library_bcast(tensor) dist.broadcast(tensor, src=dist.get_global_rank(my_pg, local_rank=0), my_pg) ``` This implements some of the helpers needed to implement the `clone` API from: https://github.com/pytorch/pytorch/issues/81291 Pull Request resolved: https://github.com/pytorch/pytorch/pull/82134 Approved by: https://github.com/rohan-varma	2022-08-30 17:45:00 +00:00
Rohan Varma	1a53e35b9d	Enforce explicit ProcessGroup passed into DefaultState (#84105 ) Would prefer to enforce that users pass in explicit PG into these state objects when using comm hooks with FSDP, so that it is clear and easy debugable over which processes communication is taking place. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84105 Approved by: https://github.com/mrshenli, https://github.com/zhaojuanmao	2022-08-29 14:52:58 +00:00
PyTorch MergeBot	5cf4542f86	Revert "Enforce explicit ProcessGroup passed into DefaultState (#84105 )" This reverts commit `adc9a1e2fb`. Reverted https://github.com/pytorch/pytorch/pull/84105 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally	2022-08-28 14:30:18 +00:00
Rohan Varma	adc9a1e2fb	Enforce explicit ProcessGroup passed into DefaultState (#84105 ) Would prefer to enforce that users pass in explicit PG into these state objects when using comm hooks with FSDP, so that it is clear and easy debugable over which processes communication is taking place. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84105 Approved by: https://github.com/mrshenli, https://github.com/zhaojuanmao	2022-08-27 03:12:20 +00:00
Olga Andreeva	f204afc2bb	Added communication hook for sharded cases (#83254 ) Fixes https://github.com/pytorch/pytorch/issues/79114 An implementation of a FSDP communication hook interface for a sharded strategies: - Added `reduce_scatter_hook` to default hooks. Note the difference of `reduce_scatter` from `all_reduce`, it requires 2 tensors:`input_gradient` and `output` variables and stores result in `output`, which is further used as a summed gradient shard. - Adjusted FSDP logic to return `reduce_scatter_hook` as a default communication hook for sharded strategies, `DefaultState` is the same for sharded and non-sharded strategies. - Adjusted low-precision hooks to work with both `all_reduce` and `reduce_scatter` depending on whether `output` tensor is provided or not. Test plan: Added all existing sharded strategies as an input parameters to existing tests. For`test_default_communication_hook_behaviour` double checked how a linear layer is sharded across workers. This test creates a simple net ``1 X N``, where ``N`` - is the number of workers. For sharded cases, ``N`` parameters are sharded across ``N`` workers. This test checks that after backward, each worker has a proper value in it's chunk of the gradient, or the whole gradient on every worker is equal to an expected value. Checked that low-precision tests work for sharded cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83254 Approved by: https://github.com/rohan-varma, https://github.com/awgu	2022-08-18 18:41:14 +00:00
joncrall	4618371da5	Integrate xdoctest - Rebased (#82797 ) This is a new version of #15648 based on the latest master branch. Unlike the previous PR where I fixed a lot of the doctests in addition to integrating xdoctest, I'm going to reduce the scope here. I'm simply going to integrate xdoctest, and then I'm going to mark all of the failing tests as "SKIP". This will let xdoctest run on the dashboards, provide some value, and still let the dashboards pass. I'll leave fixing the doctests themselves to another PR. In my initial commit, I do the bare minimum to get something running with failing dashboards. The few tests that I marked as skip are causing segfaults. Running xdoctest results in 293 failed, 201 passed tests. The next commits will be to disable those tests. (unfortunately I don't have a tool that will insert the `#xdoctest: +SKIP` directive over every failing test, so I'm going to do this mostly manually.) Fixes https://github.com/pytorch/pytorch/issues/71105 @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/82797 Approved by: https://github.com/ezyang	2022-08-12 02:08:01 +00:00
Rohan Varma	5b2c03823d	Generalize CheckpointWrapper (#83035 ) Allow checkpoint_wrapper to take in the checkpoint functional impl. This decouples it from torch.utils.checkpoint and allows other checkpoint implementations to be used. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83035 Approved by: https://github.com/awgu	2022-08-09 23:35:39 +00:00
ProGamerGov	71d50f4f89	Change docstring type callable to Callable for consistency (#82487 ) ### Description Across PyTorch's docstrings, both `callable` and `Callable` for variable types. The Callable should be capitalized as we are referring to the `Callable` type, and not the Python `callable()` function. ### Testing There shouldn't be any testing required. Pull Request resolved: https://github.com/pytorch/pytorch/pull/82487 Approved by: https://github.com/albanD	2022-08-01 17:26:09 +00:00
Olga Andreeva	a60907ec11	Adding fsdp fp16 and bf16 hooks (#81711 ) Recently, `register_comm_hook` was introduced to `FSDP`, which at the moment supports only `NO_SHARD` strategy and has a default `all_reduce` hook implemented. This PR adds two lower precision hooks to an existing default hook. I've also made slight adjustments to existing implementation of an `all_reduce` hook including: `AllReduceState` ->` DefaultState `, motivation: `AllReduceState` is not specific to all_reduce. Gradients' pre- and post-division factors are also useful for other hooks, that require pre- and post-division, e.g. `fp16_hook` and `bf16_hook`. I've put all 3 hooks into `default_hooks.py` Additionally, `FSDP` supports `MixedPrecision` and, theoretically, it is possible to specify MixedPrecision for gradients and attach a lower precision hook to the model. To avoid double-casting, I've added a couple of checks to `fully_sharded_data_parallel`, i.e. casting to precision and back is performed by a lower precision hook only. I think, as a next step, it would be nice to ensure that user can't have both lower precision hook and MixedPrecision(reduce_dtype=<precision>) specified, but I am happy to discuss this and adjust current implementation. As a test, I create two models: one with a lower precision hook and one with a `MixedPrecision(reduce_dtype=<precision>)` specified, perform one forward/backward and optimizer step and compare gradients. PS. first version of this PR was reverted, because added unittests didn't include NCCL version checks for `bf16_hook` (thus failed on trunk). In this version, I've added appropriate checks for tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/81711 Approved by: https://github.com/rohan-varma	2022-07-19 23:54:51 +00:00
PyTorch MergeBot	a8f4011e90	Revert "Adding fsdp fp16 and bf16 hooks (#80557 )" This reverts commit `f7d6828467`. Reverted https://github.com/pytorch/pytorch/pull/80557 on behalf of https://github.com/aovladi due to broke distributed tests on trunk	2022-07-19 03:11:19 +00:00
Olga Andreeva	f7d6828467	Adding fsdp fp16 and bf16 hooks (#80557 ) Recently, `register_comm_hook` was introduced to `FSDP`, which at the moment supports only `NO_SHARD` strategy and has a default `all_reduce` hook implemented. This PR adds two lower precision hooks to an existing default hook. I've also made slight adjustments to existing implementation of an `all_reduce` hook including: - `AllReduceState` -> `DefaultState` , motivation: `AllReduceState` is not specific to `all_reduce`. Gradients' pre- and post-division factors are also useful for other hooks, that require pre- and post-division, e.g. fp16_hook and bf16_hook. - I've put all 3 hooks into `default_hooks.py` Additionally, `FSDP` supports `MixedPrecision` and, theoretically, it is possible to specify `MixedPrecision` for gradients and attach a lower precision hook to the model. To avoid double-casting, I've added a couple of checks to `fully_sharded_data_parallel`, i.e. casting to precision and back is performed by a lower precision hook only. I think, as a next step, it would be nice to ensure that user can't have both lower precision hook and `MixedPrecision(reduce_dtype=<precision>)` specified, but I am happy to discuss this and adjust current implementation. As a test, I create two models: one with a lower precision hook and one with a `MixedPrecision(reduce_dtype=<precision>)` specified, perform one forward/backward and optimizer step and compare gradients. Pull Request resolved: https://github.com/pytorch/pytorch/pull/80557 Approved by: https://github.com/rohan-varma	2022-07-18 22:40:56 +00:00
Jerome	547e499731	Enable Zero1's ddp_with_overlap for hpu backend (#80438 ) Enable zero with ddp overlap feature along with a simple interface to insert functional optimizer to the map Signed-off-by: Jerome <janand@habana.ai> Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/80438 Approved by: https://github.com/rohan-varma, https://github.com/awgu	2022-07-18 15:05:27 +00:00
Rohan Varma	0c5fdfd95f	Revert "Revert "[FSDP Optim State] Remove checkpoint prefix (#80480 )"" (#80936 ) This reverts commit `fe361dede4`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/80936 Approved by: https://github.com/awgu	2022-07-06 22:21:07 +00:00
PyTorch MergeBot	fe361dede4	Revert "[FSDP Optim State] Remove checkpoint prefix (#80480 )" This reverts commit `04c50fec1c`. Reverted https://github.com/pytorch/pytorch/pull/80480 on behalf of https://github.com/suo due to Broke master `04c50fec1c`, the test failures were not unrelated	2022-07-06 02:43:27 +00:00

1 2 3 4 5

220 Commits