Commit Graph

26 Commits

Author SHA1 Message Date
Rohit Singh Rathaur
c4565c3b94 [distributed] Replace 164 assert statements in fsdp directory (#165235)
Replace assert statements with explicit if/raise patterns across 20 files:
- _optim_utils.py (38 asserts)
- _flat_param.py (25 asserts)
- _fully_shard/_fsdp_param.py (23 asserts)
- sharded_grad_scaler.py (12 asserts)
- fully_sharded_data_parallel.py (11 asserts)
- wrap.py (10 asserts)
- _state_dict_utils.py (9 asserts)
- _fully_shard/_fsdp_param_group.py (8 asserts)
- _runtime_utils.py (6 asserts)
- _init_utils.py (6 asserts)
- 10 additional files (16 asserts)

This prevents assertions from being disabled with Python -O flag.

Fixes partially #164878

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165235
Approved by: https://github.com/albanD
2025-10-14 18:04:57 +00:00
Xuehai Pan
3fd84a8592 [BE][PYFMT] migrate PYFMT for torch/[a-c]*/ to ruff format (#144554)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144554
Approved by: https://github.com/soulitzer
2025-07-03 18:56:07 +00:00
Xuehai Pan
995df34b19 [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144547
Approved by: https://github.com/kwen2501
2025-02-28 07:35:56 +00:00
Aaron Orenstein
c64e657632 PEP585 update - torch/distributed/fsdp (#145162)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145162
Approved by: https://github.com/bobrenjc93
2025-01-19 20:04:05 +00:00
bobrenjc93
08be9ec312 Migrate from Tuple -> tuple in torch/distributed (#144258)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144258
Approved by: https://github.com/aorenste
2025-01-10 08:34:54 +00:00
lzhang2
1886e33f60 Use device-agnostic runtime API in distributed DDP/FSDP instead of cuda device specific. (#137678)
# Motivation
This PR targets to use device-agnostic runtime API in distributed DDP/FSDP instead of `cuda` device specific.

cc cc [@jgong5](https://github.com/jgong5) [@gujinghui](https://github.com/gujinghui) [@EikanWang](https://github.com/EikanWang) [@fengyuan14](https://github.com/fengyuan14) [@guangyey](https://github.com/guangyey)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137678
Approved by: https://github.com/kwen2501, https://github.com/guangyey, https://github.com/jgong5
2024-11-13 05:32:19 +00:00
CaoE
f96a073c9d Use _amp_foreach_non_finite_check_and_unscale_ for CPU grads of ShardedGradScaler (#135232)
Use `_amp_foreach_non_finite_check_and_unscale_` instead of fallback version for CPU grads of `ShardedGradScaler ` as `_amp_foreach_non_finite_check_and_unscale_ ` is supported on CPU https://github.com/pytorch/pytorch/pull/109281.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135232
Approved by: https://github.com/ezyang
2024-09-14 09:53:17 +00:00
Jessica Vandebon
68b1a09422 Integrate device agnostic APIs in FSDP library [1/n] (#134337)
Summary: For MTIA FSDP support, we need to ensure the FSDP library code handles accelerator devices not limited to CUDA.

Test Plan: CI

Reviewed By: hanzlfs

Differential Revision: D60587415

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134337
Approved by: https://github.com/LucasLLC, https://github.com/awgu
2024-08-27 17:31:11 +00:00
Xuehai Pan
3b798df853 [BE][Easy] enable UFMT for torch/distributed/{fsdp,optim,rpc}/ (#128869)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128869
Approved by: https://github.com/fegin
ghstack dependencies: #128868
2024-06-18 21:49:08 +00:00
Aaron Orenstein
7c12cc7ce4 Flip default value for mypy disallow_untyped_defs [6/11] (#127843)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127843
Approved by: https://github.com/oulgen
ghstack dependencies: #127842
2024-06-08 18:49:29 +00:00
daitian1995
f6366454db Add privateuse1 in FSDP's sharded grad scaler (#126971)
1. add privateuse1 in FSDP's sharded grad scaler
2. support found_inf copy for more devices

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126971
Approved by: https://github.com/awgu, https://github.com/weifengpy
2024-05-24 05:54:25 +00:00
Chirag Pandya
b6201a60c5 [BE] minor logging cleanup in distributed (#122921)
Summary:
    Minor logging cleanup in distributed library
    1. Don't use "f" formatted strings - address linter issues.
    2. Nits: Make use of unused `e` (error) in a few logs.
    3. Change info->debug as asked in issue #113545
    4. Nit: rename log -> logger in a few files for consistency
    5. Fix a linter error.

    Test Plan:
    1. Local build passes.
    2. Linter is happy.

    Reviewers: wanchaol

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122921
Approved by: https://github.com/wanchaol
2024-03-29 03:34:01 +00:00
Jeeja KP
79569d117d Add hpu device support in storage/resize (#119761)
Add hpu device to
 - In storage method resize_
 - is_supported_device for fsdp
 - for storage add hpu device support

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119761
Approved by: https://github.com/mikaylagawarecki
2024-02-17 01:04:27 +00:00
CaoE
bacbad5bc9 add GradScaler on CPU (#109993)
Step 2 of https://github.com/pytorch/pytorch/issues/111559.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109993
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-01-29 23:42:35 +00:00
Wei (Will) Feng
e4e80dc9b3 [FSDP] sharded grad scaler: copy found_inf after waiting on async reduce_all (#115710)
**Expected behavior**: when rank 0 have inf grad, rank 1...k should get `found_inf=1` after `dist.reduce_all`
**Bug addressed in this PR**: for cpu offloaded param.grad, when rank 0 have inf, rank 1...k would not have found_inf=1. This is because `found_inf` was copied before `future.wait` on async `dist.reduce_all`

repro the bug using the newly added unit test: `pytest test/distributed/fsdp/test_fsdp_sharded_grad_scaler.py -k test_sharded_grad_scaler_found_inf`

```
  File "/data/users/weif/pytorch/test/distributed/fsdp/test_fsdp_sharded_grad_scaler.py", line 320, in _test_sharded_grad_scaler_found_inf
    self.assertEqual(
  File "/data/users/weif/pytorch/torch/testing/_internal/common_utils.py", line 3576, in assertEqual
    raise error_metas.pop()[0].to_error(
AssertionError: Scalars are not close!

Expected 1.0 but got 2.0.
Absolute difference: 1.0 (up to 1e-05 allowed)
Relative difference: 1.0 (up to 1.3e-06 allowed)
rank: 0 iter: 0 expect origin scale 2.0 to be backed off by 0.5 but got 2.0
```

verify the bug is fixed: `pytest test/distributed/fsdp/test_fsdp_sharded_grad_scaler.py -k test_sharded_grad_scaler_found_inf`

```
test/distributed/fsdp/test_fsdp_sharded_grad_scaler.py dist init r=1, world=8
dist init r=3, world=8
dist init r=7, world=8
dist init r=4, world=8
dist init r=6, world=8
dist init r=2, world=8
dist init r=0, world=8
dist init r=5, world=8
NCCL version 2.19.3+cuda12.0
.                                                                                                                 [100%]

====================================================================== 1 passed, 19 deselected in 27.43s =========================

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115710
Approved by: https://github.com/awgu
2024-01-10 19:17:25 +00:00
Matthew Hoffman
e40d6ae0a7 Improve torch.cuda.amp type hints (#108630)
Fixes #108629

1. Add the following to their modules' `__all__` so that pyright considers them to be publicly exported:
* [`torch.autocast`](https://pytorch.org/docs/stable/amp.html#torch.autocast)
* [`torch.cuda.amp.GradScaler`](https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.GradScaler)
* [`torch.cuda.amp.autocast`](https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.autocast)
* [`torch.cuda.amp.custom_fwd`](https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.custom_fwd)
* [`torch.cuda.amp.custom_bwd`](https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.custom_bwd)
2. Add `overload`s for `torch.cuda.amp.GradScaler.scale` to differentiate when a `torch.Tensor` is returned vs. an `Iterable[torch.Tensor]` is returned based on the type of the `outputs` parameter.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108630
Approved by: https://github.com/ezyang
2023-09-08 06:06:25 +00:00
Rohan Varma
4d516f44a1 [FSDP][ez] Type optimizer correctly (#102637)
In shardedgradscaler, the optimizer doesn't have to be SGD.

Differential Revision: [D46327103](https://our.internmc.facebook.com/intern/diff/D46327103/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102637
Approved by: https://github.com/Skylion007, https://github.com/awgu, https://github.com/fegin
2023-06-04 18:31:13 +00:00
Daniel Dale
477ca1789c Avoid elementwise dispatch of gradient unscaling/validation ops in _foreach_non_finite_check_and_unscale_cpu_ (#100108)
Fixes [#82206](https://github.com/pytorch/pytorch/issues/82206)

When executing a `ShardedGradScaler` step in the context of `cpu_offload`, [the function](ecd2c71871/torch/distributed/fsdp/sharded_grad_scaler.py (L151-L152)) `_foreach_non_finite_check_and_unscale_cpu_` is grindingly slow. This issue is due to the elementwise op dispatching/redispatching/execution that is engendered by the current approach to gradient tensor validation:
ecd2c71871/torch/distributed/fsdp/sharded_grad_scaler.py (L159-L163)

The subsequent `isinf` and `isnan` checks with associated `any` checks result in unscalable elementwise op dispatches:
ecd2c71871/torch/distributed/fsdp/sharded_grad_scaler.py (L173-L181)

This inefficency is of course hidden in the current FSDP tests given their (appropriately) trivial parameter dimensionality. In the perf analysis below, the example test configures only the final `Linear(4, 8)` module parameters to require grad, so there are 40 elements to iterate through. However, if one increases the dimensionality to a still-modest 320008 elements (changing the final module to `Linear(40000,8)`), the execution time/cpu cost of the test is dominated by the elementwise op dispatching/redispatching/execution of the `any` validation ops in this function.

To characterize the current behavior, I use a slightly modified version of an existing `ShardedGradScaler` test [^1]. The following modifications to the test are made to allow the analysis:

1. Run just `CUDAInitMode.CUDA_BEFORE` for clarity instead of additional scenarios
2. Increase the final module to `Linear(40000, 8)` (along with modifying the preceding module to make the dimensions work) ,
3. For the cProfile run (but not valgrind or perf) the test runs just a single [`_train_for_several_steps`](ecd2c71871/torch/testing/_internal/common_fsdp.py (L926-L934)) step per rank (instead of 2 steps)
4. I temporarily reduce `init_scale` further to ensure we don't hit any `infs`, short-circuiting our analysis

### Current behavior

The most relevant call subgraph:
![callgrind_subgraph_elementwise_dispatch](https://user-images.githubusercontent.com/7462936/234656744-b7ca81b2-ce5b-4035-9918-0ad57d3689d3.png)

Note that:

1. Instead of dispatching to the relevant autograd op and then redispatching to the relevant CPU op implementation 8 times per test, (2 train steps x 2 any calls per parameter per step x 2 orig parameters) we (I believe unnecessarily) call the relevant dispatch flow elementwise, so 640016 times! (only 1 node in this trace so 320008 elements/2 X 2 train steps x 2 calls per element per step).
2. Nearly 50% of the relative (inclusive) instruction reads for the entire test in `callgrind` are executed by the `isnan` (320008 execs), `isinf` (320008 execs) and `any` (640016 execs) calls.
3. The `any` pre-dispatch entry point IRs (`torch::autograd::THPVariable_any`) vs actual op implementation IRs (`at::native::structured_any_all_out::impl`) are below to give one a sense of the relative dispatch and op execution cost in an elementwise context[^3].
![THPVariable_any_op_elementwise_dispatch_absolute_IR](https://user-images.githubusercontent.com/7462936/234656886-3c017ee3-8a04-4a7d-bdf8-6c690de42c92.png)
![structured_any_all_out_impl_absolute_IR](https://user-images.githubusercontent.com/7462936/234656915-0b203bb7-bd05-4ceb-a38b-67b0d4862aa7.png)

Using cprofile stats:

```bash
python -c "import pstats; stats=pstats.Stats('/tmp/fsdp_cprofile_8wa9uw39.stats'); stats.print_stats()"
...
ncalls  tottime  	percall  	cumtime  	percall  filename:lineno(function)
1   	20.159   	20.159   	66.805   	66.805 	 torch/distributed/fsdp/sharded_grad_scaler.py:151(_foreach_non_finite_check_and_unscale_cpu_)
160004  18.427    	0.000   	18.427    	0.000 	 {built-in method torch.isinf}
160004  6.026    	0.000    	6.026    	0.000 	 {built-in method torch.isnan}
```
We see that a single step of the scaler runs for more than a minute. Though there is non-trivial cprofile overhead, we can infer from this that per-element op dispatches/executions are on the order of a 100ns.

On the order of 100 nanoseconds per dispatch is acceptable if we're using typical tensor access patterns, but if we're dispatching each element for each op, obviously everything is going to come to a grinding halt for many practical use cases.

(Given the cost of this function is currently O(n) in the number of gradient elements, feel free to set `TORCH_SHOW_DISPATCH_TRACE=1` if you want to make this function cry 🤣)

I've attached a flamegraph at the bottom of the PR[^2] that more intuitively demonstrates the manner and extent of resource consumption attributable to this function with just a modest number of gradient elements.

### After the loop refactor in this PR:

The most relevant call subgraph:
![callgrind_subgraph_elementwise_dispatch_fix](https://user-images.githubusercontent.com/7462936/234657001-0a448756-b4ce-468e-9f91-1d21597df057.png)

Note that:

1. Less than 0.4% of the relative (inclusive) instruction reads for the entire test in `callgrind` are executed by the `isnan` (4 execs), `isinf` (4 execs) and `any` (8 execs) calls (versus ~50% and 320008, 320008, 640016 respectively above)
2. The `any` pre-dispatch entry point IRs (`torch::autograd::THPVariable_any`) vs actual op implementation IRs (`at::native::structured_any_all_out::impl`) reflect far less overhead (of secondary importance to item number 1)
![THPVariable_any_op_elementwise_dispatch_absolute_IR_fix](https://user-images.githubusercontent.com/7462936/234659454-b1e262cf-d291-4d44-aff2-e27efe284e9c.png)
![structured_any_all_out_impl_absolute_IR_fix](https://user-images.githubusercontent.com/7462936/234657154-91fa7cb8-e39e-48c7-abf0-cc58f06c0ae1.png)

Using cprofile stats:

```bash
python -c "import pstats; stats=pstats.Stats('/tmp/fsdp_cprofile_pfap7nwk.stats'); stats.print_stats()"
...
ncalls  tottime  	percall  	cumtime  	percall  	filename:lineno(function)
1    	0.013    	0.013    	0.109    	0.109 		torch/distributed/fsdp/sharded_grad_scaler.py:151(_foreach_non_finite_check_and_unscale_cpu_)
2    	0.022    	0.011    	0.022    	0.011 		{built-in method torch.isinf}
2    	0.018    	0.009    	0.018    	0.009 		{built-in method torch.isnan}
```
We can see our function runtime has dropped from more than a minute to ~100ms.

### Assumptions associated with this loop refactor:

The key assumptions here are:

1. The grads are always on CPU in this function so any MTA-safe constraints ([`can_use_fast_route`](efc3887ea5/aten/src/ATen/native/cuda/AmpKernels.cu (L110-L111)) relating to the relevant CUDA kernel path selection, i.e. slower `TensorIterator` gpu kernel vs `multi_tensor_apply_kernel`) do not apply in this context

2. We've already filtered by dtype and device and can assume the presence of a single CPU device. Unless manually creating separate CPU devices with manually set non-default indexes (which I don't think FSDP supports and should be validated prior to this function), device equality should always be `True` for `cpu` type devices so we should just need to check that the current device is of `cpu` type. [^4].

![elementwise_dispatch](https://user-images.githubusercontent.com/7462936/234660413-8c96ef90-7a23-4307-b8ed-c1fbf932f1e9.svg)

[^1]: `TestShardedGradScalerParityWithDDP.test_fsdp_ddp_parity_with_grad_scaler_offload_true_none_mixed_precision_use_orig_params` test in `test/distributed/fsdp/test_fsdp_sharded_grad_scaler.py`
[^2]: Note the native frame stacks for `torch::autograd::THPVariable_isinf`, `torch::autograd::THPVariable_isnan`, `torch::autograd::THPVariable_any` in particular.
[^3]: There's more `TensorIterator` etc. setup overhead further up the stack beyond `structured_any_all_out`, but roughly speaking
[^4]: Device equality is based on [type and index combination](efc3887ea5/c10/core/Device.h (L47-L51)), CPU device type is -1 by default (`None` on the python side) and is intended to [always be 0](cf21240f67/c10/core/Device.h (L29)) if set explicitly. Though technically, unless in debug mode, this constraint isn't [actually validated](bb4e9e9124/c10/core/Device.h (L171-L184)), so one can actually manually create separate `cpu` devices with invalid indices. I suspect it's safe to ignore that potential incorrect/unusual configuration in this context but let me know if you'd like to add another `cpu` device equality check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100108
Approved by: https://github.com/awgu
2023-04-27 21:33:27 +00:00
Daniel Dale
157c869026 Enable FSDP `use_orig_params=True` mixed precision training when some ranks have no (non-zero sized) parameter shards (#99175)
Fixes #99174

## Enable FSDP ``use_orig_params=True`` mixed precision training when some ranks have no (non-zero sized) parameter shards

### The issue

Now that ``use_orig_params=True`` allows non-uniform ``requires_grad`` (🎉 🚀 thanks @awgu!!!) with [#98221](https://github.com/pytorch/pytorch/pull/98221), there will be circumstances wherein some ranks have no (non-zero sized) local shards of the original parameters (and hence no associated gradients).

### Use Cases
For a simple Transformer case, imagine a user wraps all encoder layers in separate FSDP instances but allows the classifier head to be wrapped in the same FSDP instance as the relatively large embeddings layers. While this is a sub-optimal wrapping strategy for most use-cases, I believe it is expected to be supported (full precision training works in that context).

I originally encountered this issue while extending a package I maintain, leveraging the relaxed ``requires_grad`` contstraint to simplify multi-phase scheduled fine-tuning FSDP configuration, so a [concrete example is there](https://finetuning-scheduler.readthedocs.io/en/latest/advanced/fsdp_scheduled_fine_tuning.html#basic-scheduled-fine-tuning-with-fsdp).

### Reproduction and Remediation
Currently, ``ShardedGradScaler`` does not accommodate these situations, failing to initialize ``optimizer_state["found_inf_per_device"]`` when ``unscale_`` is called.

In this PR, I extend the existing ``ShardedGradScaler`` tests with an ``use_orig_params=True`` dimension added to the parameterization and test scenarios wherein one rank possesses no (non-zero sized) parameter shards.

The relevant issue can be reproduced with the tests I'm adding in this PR. The current (pre-PR) execution of these tests fail in ``use_orig_params=True`` mode with this error:

```python
./test_fsdp_sharded_grad_scaler.py::TestShardedGradScalerParityWithDDP::test_fsdp_ddp_parity_with_grad_scaler_offload_false_none_mixed_precision_use_orig_params Failed with Error: Process 0 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/home/speediedan/repos/pytorch/torch/testing/_internal/common_distributed.py", line 657, in run_test
    getattr(self, test_name)()
  File "/home/speediedan/repos/pytorch/torch/testing/_internal/common_distributed.py", line 543, in wrapper
    fn()
  File "/home/speediedan/repos/pytorch/torch/testing/_internal/common_utils.py", line 259, in instantiated_test
    test(self, **param_kwargs)
  File "/home/speediedan/repos/pytorch/torch/testing/_internal/common_distributed.py", line 174, in wrapper
    return func(*args, **kwargs)
  File "/home/speediedan/repos/pytorch/test/distributed/fsdp/test_fsdp_sharded_grad_scaler.py", line 187, in test_fsdp_ddp_parity_with_grad_scaler
    self._test_fsdp_parity(
  File "/home/speediedan/repos/pytorch/torch/testing/_internal/common_fsdp.py", line 1152, in _test_fsdp_parity
    fsdp_loss = self._train_for_several_steps(
  File "/home/speediedan/repos/pytorch/torch/testing/_internal/common_fsdp.py", line 1016, in _train_for_several_steps
    sharded_grad_scaler.step(optim)
  File "/home/speediedan/repos/pytorch/torch/distributed/fsdp/sharded_grad_scaler.py", line 291, in step
    return super().step(optimizer, *args, **kwargs)
  File "/home/speediedan/repos/pytorch/torch/cuda/amp/grad_scaler.py", line 368, in step
    assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer."
AssertionError: No inf checks were recorded for this optimizer.
```

A few implementation notes/considerations and questions:

1. Rather than just initialize  ``per_device_found_inf``, one could disable the grad scalar altogether for relevant ranks, altering ``unscale_`` to reduce with a subgroup or some rank mask construct to avoid the ``all_reduce`` s in ``distributed/fsdp/sharded_grad_scaler.py:unscale_()`` from hanging. Given that users may subsequently add parameter groups to an optimizer that would require re-enabling the scaler and the complexity associated with maintaining a separate mask construct or process subgroup, I thought this implementation was cleaner.
2. I extended ``_train_for_several_steps`` and ``_test_fsdp_parity`` in ``/torch/testing/_internal/common_fsdp.py`` with the ability to configure ``sharded_grad_scaler_kwargs`` for future testing flexibility.
3. Should the user be warned that no parameter shards were associated with a given rank? My initial thought is that this should be considered an implementation detail, part of supporting ``use_orig_params`` with heterogeneous ``requires_grad``, and therefore should be transparently handled by PyTorch. Should a DEBUG level message be added? If so, likely further upstream rather than at the scaler step level.
4. Rather than extend the existing ``ShardedGradScaler`` tests with an ``use_orig_params=True`` dimension added to the parameterization, let me know if you prefer that I instead narrow the scope of the new testing to a single additional test, e.g.:
	```python
	# from typing import Optional
	from typing import Optional, List
	# ...
	# use_orig_params = ["enable_use_orig_params", None]
	use_orig_params: List[Optional[str]] = [None]
	# ...
	configs = list(itertools.product(cpu_offload_config, sharding_strategy_config, mixed_precision, use_orig_params))
	configs.append((CPUOffload(offload_params=False), None, "enable_mixed_precision", "enable_use_orig_params"))
	```
Thanks as always to the PyTorch distributed team for your astonishingly impressive and valuable contributions to the open-source ML engineering community!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99175
Approved by: https://github.com/awgu
2023-04-15 05:13:23 +00:00
Edward Z. Yang
5df59f957f Fix G001,G002,G003 in logs to % syntax (#97812)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97812
Approved by: https://github.com/Skylion007, https://github.com/kiukchung, https://github.com/malfet, https://github.com/mlazos
2023-04-01 01:43:33 +00:00
Kazuaki Ishizaki
35fd5c548e Fix typos under torch/distributed directory (#95638)
This PR fixes typos in comments and messages of `.py` files under torch/distributed directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95638
Approved by: https://github.com/usamah1, https://github.com/H-Huang, https://github.com/kit1980
2023-03-27 21:13:44 +00:00
Andrew Gu
0d4bbd1996 [Lint] Add FSDP/composable API files to ufmt include (#90873)
This PR adds FSDP and composable API files to `.lintrunner.toml` so that (1) lintrunner enforces that those files are formatted and (2) `lintrunner f` formats those files for you.

There are two requirements here (see https://github.com/pytorch/pytorch/wiki/lintrunner for details):
1. Install lintrunner:
```
pip install lintrunner
lintrunner init
```
2. `lintrunner f` before you finalize your PR, which would now be enforced by CI after this PR.

The code changes in this PR outside of `.lintrunner.toml` are the result of `lintrunner f`.

---

I only plan to land this PR if all of the composable API developers agree that this is something that makes sense and is not too intrusive to the workflow.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90873
Approved by: https://github.com/yhcharles, https://github.com/mrshenli, https://github.com/rohan-varma
2023-01-18 05:33:34 +00:00
Edward Z. Yang
eef019c14a Lint rule to forbid direct use of logging.info/etc APIs (#90907)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90907
Approved by: https://github.com/jansel
2022-12-16 05:13:51 +00:00
Andrew Gu
e3cf81e0a7 [FSDP] ufmt /fsdp (#87811)
This applies `ufmt` to all of the FSDP files in the `torch/distributed/fsdp/` directory.

**Test Plan**
CI

**Notes**
For VSCode users,
- Install `ufmt`: https://pypi.org/project/ufmt/
- Install VSCode `ufmt` extension: https://marketplace.visualstudio.com/items?itemName=omnilib.ufmt
- Include in `settings.json`:
```
{
    "[python]": {
        "editor.defaultFormatter": "omnilib.ufmt",
        "editor.formatOnSave": true,
    },
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87811
Approved by: https://github.com/rohan-varma, https://github.com/fegin
2022-10-27 04:25:55 +00:00
ProGamerGov
8def154e00 Fix multiple docstring type mistakes (#82474)
### Description

* Docstrings using `(tuple of ints)` shows up as `(tuple of python:ints)`, so I fixed them by making the `int` no longer plural. Example: https://pytorch.org/docs/stable/generated/torch.permute.html#torch.permute
* A docstring type in JIT had one of its types incorrectly highlighted as code. Example: https://pytorch.org/docs/stable/generated/torch.jit.script.html#torch.jit.script
* I found some docstring type usages of `string` that had not yet been converted to `str` after #82410
* Some docstrings incorrectly listed their defaults inside the docstring types.
* I also found a docstring that was missing its type

### Testing
No testing should be required.

---

In the developer guidelines, there should probably be standards listed for the docstring types.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82474
Approved by: https://github.com/albanD
2022-07-29 17:45:37 +00:00
Sisil Mehta
9d3ffed327 [FSDP] Sharded Grad Scaler (#76918)
Summary: Adding in a shard aware grad scaler for FSDP+MixedPrecision support

Test Plan: Tests added

Differential Revision: D35988676

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76918
Approved by: https://github.com/rohan-varma
2022-05-16 15:53:21 +00:00