Commit Graph

4671 Commits

Author SHA1 Message Date
Jeff Daily
6ede882c0b preferred blas library; cublaslt gemm implementation (#122106)
Following the example of PyTorch supporting a preferred Linalg library (cusolver or magma), this PR introduces a preferred blas library selector of either cublas or cublaslt for CUDA and hipblas or hipblaslt for ROCm via normal hipification of sources.

The default blas implementation remains cublas or hipblas.  cublaslt or hipblaslt can be enabled using environment variable TORCH_BLAS_PREFER_CUBLASLT=1 (or TORCH_BLAS_PREFER_HIPBLASLT=1 as an alias) or by calling `torch.backends.cuda.preferred_blas_library(backend="cublaslt")` or as an alias `backend="hipblaslt"`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122106
Approved by: https://github.com/lezcano
2024-04-22 15:38:22 +00:00
Chen, Zejun
b1984237a0 [Profiler] Unify the device(CUDA, XPU, PrivateUse1) in torch profiler post processing (#123247)
This PR unifies the CUDA, XPU and PrivateUse1 in the torch profiler. Now CUDA, XPU and PrivateUse1 can together use string object `use_device` to distinguish each other and share one device path for calculating kineto time durations and memory statistics for post processing.

#suppress-api-compatibility-check

Co-authored-by: Aaron Enye Shi <enye.shi@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123247
Approved by: https://github.com/aaronenyeshi
2024-04-22 01:26:55 +00:00
Aaron Gokaslan
c5fafe9f48 [BE]: TRY002 - Ban raising vanilla exceptions (#124570)
Adds a ruff lint rule to ban raising raw exceptions. Most of these should at the very least be runtime exception, value errors, type errors or some other errors. There are hundreds of instance of these bad exception types already in the codebase, so I have noqa'd most of them. Hopefully this error code will get commiters to rethink what exception type they should raise when they submit a PR.

I also encourage people to gradually go and fix all the existing noqas that have been added so they can be removed overtime and our exception typing can be improved.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124570
Approved by: https://github.com/ezyang
2024-04-21 22:26:40 +00:00
Aaron Gokaslan
5a1216bb2e [BE]: Update ruff to 0.4.1 (#124549)
Update ruff to 0.4.1 .
This version fixes a lot false negatives/false positives, is 20-40% faster, and has various other bug fixes.

Below is a before and after table showing the execution time of ruff lint and ruff format in milliseconds courtesy of https://astral.sh/blog/ruff-v0.4.0

| Repository                                         | Linter (v0.3) | Linter (v0.4) | Formatter (v0.3) | Formatter (v0.4) |
|----------------------------------------------------|---------------|---------------|------------------|------------------|
| [pytorch/pytorch](https://github.com/pytorch/pytorch) | 328.7         | 251.8         | 351.1            | 274.9            |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124549
Approved by: https://github.com/ezyang
2024-04-21 14:06:23 +00:00
Michael Lazos
0d0b5b2655 Enable dynamo rosenbrock sparse tests (#124542)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124542
Approved by: https://github.com/yf225
ghstack dependencies: #124540, #124541
2024-04-20 05:54:41 +00:00
Michael Lazos
184f16016e Enable dynamo-traced deepcopy test for RMSprop (#124541)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124541
Approved by: https://github.com/yf225
ghstack dependencies: #124540
2024-04-20 05:54:41 +00:00
Michael Lazos
6a730698e2 Enable dynamo-traced Adamax tests (#124540)
Enabling tests related to https://github.com/pytorch/pytorch/issues/121178

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124540
Approved by: https://github.com/yf225
2024-04-20 05:54:41 +00:00
rzou
f0560f7b3b [opcheck] Stop doing test_aot_dispatch_static by default (#124495)
Motivations:
- this is pretty redundant with test_aot_dispatch_dynamic.
- The user story for opcheck is that a user should use opcheck to see
  if their operator was "registered correctly". If a user's custom op
  only supports dynamic shapes, then it's a bit awkward for
  one of the tests (e.g. `test_aot_dispatch_static`) to fail.
- We've already stopped running test_aot_dispatch_static in all of
  our opcheck tests.

Test Plan:
- wait for CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124495
Approved by: https://github.com/williamwen42
ghstack dependencies: #124180, #124200, #124299, #124134, #124199, #124403, #124414
2024-04-19 21:57:22 +00:00
rzou
25c65d6642 Change register_autograd to reflect ordering of setup_context and backward (#124403)
old: `register_autograd(setup_context, backward, /)`
new: `register_autograd(backward, /, *, setup_context=None)`

Motivations:
- We introduce these APIs as "give us a backward and use setup_context
  to save things for backward".
- setup_context isn't always necessary.

Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124403
Approved by: https://github.com/albanD
ghstack dependencies: #124180, #124200, #124299, #124134, #124199
2024-04-19 17:56:30 +00:00
Michael Lazos
68a027f144 Fixes for 123400 (#123406)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123406
Approved by: https://github.com/janeyx99
ghstack dependencies: #123324, #123404, #123405, #124309
2024-04-19 17:20:57 +00:00
Michael Lazos
1531a29fb9 Enable tests related to 116061 (#123405)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123405
Approved by: https://github.com/janeyx99
ghstack dependencies: #123324, #123404
2024-04-19 17:20:54 +00:00
Michael Lazos
406d99e46c Fix for 117147 (#123404)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123404
Approved by: https://github.com/Skylion007, https://github.com/janeyx99
ghstack dependencies: #123324
2024-04-19 17:20:50 +00:00
Michael Lazos
203d111c54 Enable dynamo test_forloop_goes_right_direction_multi_gpu (#123324)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123324
Approved by: https://github.com/janeyx99
2024-04-19 17:20:41 +00:00
ydwu4
e62169a8fa Support torchbind op dispatch in python (#123367)
We override the `__call__` method and register fake, functional, proxy default dispatch mode implementation in its python_key_mode_table.

The idea is:
1. when inputs contains FakeScriptObject,  we dispatch it through _get_dispatch mechanism. We implement dispatch mode keys automatically in the operator's constructor.
2. when inputs are not fakified, we dispatch through the original c++ dispatcher.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123367
Approved by: https://github.com/zou3519
2024-04-19 17:17:27 +00:00
Jane Xu
b412b75b42 [optim] add fused_adam/adamw_kernel support for CPU device (#123074)
On par with `CUDA` implementation.

For `autocast` logic, same with `CUDA` + `Fused Adam`:
 - check inf in `gradscalar.step`
 - In fused kernel, if there is `inf`, do nothing. If not, unscale the grad ( also write back) and update the param.

**TestPlan**:
```
# extend CUDA only test for CPU fused adagrad
python test_optim.py -k test_fused_matches_forloop
python test_optim.py -k test_fused_large_tensor
python test_torch.py -k test_grad_scaling_autocast_fused

# extend fused test
python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step
python test_optim.py -k test_can_load_older_state_dict

# newly added test (follow 6b1f13ea2f/test/test_cuda.py (L1108))
python test_optim.py -k test_grad_scaling_autocast_fused_optimizers
```

**Benchmark**:
**5.1x** on 56 core SPR
**Parameter-size=1M**
**Nparams=10**
[test script](https://gist.github.com/zhuhaozhe/ef9a290ad3f8f4067b3373a3bdaa33e7)

```
numactl -C 0-55 -m 0 python bench_adam.py
non-fused 6.0174267292022705 s
fused 1.1787631511688232 s
```

**Note: Fused kernel accuracy**
The accuracy failure in CI shows a little higher than default tolerance
```
2024-04-02T06:09:16.2213887Z Mismatched elements: 21 / 64 (32.8%)
2024-04-02T06:09:16.2214339Z Greatest absolute difference: 1.5735626220703125e-05 at index (6, 6) (up to 1e-05 allowed)
2024-04-02T06:09:16.2214813Z Greatest relative difference: 1.0073336852656212e-05 at index (4, 1) (up to 1.3e-06 allowed)
```
I have debug it step by step and unfortunately we may not able to make the `fused kernel` exactly same with `non fused` one due to compiler optimizations.
For example, in non-fused impl
```
exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2)
```
and in fused impl
```
  exp_avg_sq_ptr[d] = scalar_t(beta2) * exp_avg_sq_ptr[d];
  //  std::cout << "exp_avg_sq " <<   exp_avg_sq_ptr[d] << std::endl;
  exp_avg_sq_ptr[d] = exp_avg_sq_ptr[d] +
      scalar_t(exp_avg_sq_grad_coefficient) * grad_val * grad_val;
```
If I keep `std::cout`, I can get exactly same results in UT
```
===============param
0.6796758770942688
0.6796758770942688
```
But when I comment out it, there will be a difference
```
===============param
0.6796758770942688
0.6796759366989136
```
So I will make the tolerance a little higher than default one.

Co-authored-by: Jane Xu <janeyx@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123074
Approved by: https://github.com/jgong5, https://github.com/janeyx99
2024-04-19 11:14:04 +00:00
PyTorch MergeBot
520bc1080e Revert "[Profiler] Unify the device(CUDA, XPU, PrivateUse1) in torch profiler post processing (#123247)"
This reverts commit 768ce2cdda.

Reverted https://github.com/pytorch/pytorch/pull/123247 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/123247#issuecomment-2066152611))
2024-04-19 09:09:03 +00:00
Chen, Zejun
768ce2cdda [Profiler] Unify the device(CUDA, XPU, PrivateUse1) in torch profiler post processing (#123247)
This PR unifies the CUDA, XPU and PrivateUse1 in the torch profiler. Now CUDA, XPU and PrivateUse1 can together use string object `use_device` to distinguish each other and share one device path for calculating kineto time durations and memory statistics for post processing.

#suppress-api-compatibility-check

Co-authored-by: Aaron Enye Shi <enye.shi@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123247
Approved by: https://github.com/aaronenyeshi, https://github.com/gujinghui
2024-04-19 03:31:13 +00:00
xinan.lin
6fcbeb3489 [ATen] Add CPU fp16 support for nll_loss and cross_entropy_loss (#123256)
Add CPU FP16 support for nll_loss and cross_entropy_loss.
Resolve issue #123328.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123256
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/malfet
2024-04-18 11:44:38 +00:00
xinan.lin
c9ab9248ce [Inductor Intel GPU backend Upstream] Generalize device-bias code in (#124249)
Generalize device-bias code in tirton_utils.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124249
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/jansel
2024-04-18 03:54:31 +00:00
Michael Lazos
102a223216 Enable dynamo test_state_dict_deterministic (#123323)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123323
Approved by: https://github.com/janeyx99
ghstack dependencies: #123498, #123322
2024-04-18 01:06:28 +00:00
Michael Lazos
d88fcb86d8 Enable dynamo traced test_forloop_goes_right_direction (#123322)
Removed a bunch of skips, I also updated test_forloop_goes_right_direction to *not* use the closure when dynamo is tracing. The reason for this is that testing the disabled optimizer doesn't actually test anything.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123322
Approved by: https://github.com/janeyx99
ghstack dependencies: #123498
2024-04-18 00:50:10 +00:00
Xuehai Pan
93e249969b [BE] enable ruff rule RSE and remove useless parentheses in raise statements (#124261)
Remove useless parentheses in `raise` statements if the exception type is raised with no argument.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124261
Approved by: https://github.com/albanD
2024-04-17 19:29:34 +00:00
Pearu Peterson
d2b0c0a34e Fix index_reduce sampler filter when op_info.variant_test_name is specified (#123375)
As in the title: `index_reduce` sample must correspond to reduction type specified by `variant_test_name`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123375
Approved by: https://github.com/zou3519, https://github.com/peterbell10
2024-04-17 15:31:28 +00:00
FFFrog
acc466751b Add bfloat16 support to binary_cross_entropy for CPU (#123823)
Fixes #123715

As the title stated.

But, maybe we should pay attention to this https://github.com/pytorch/pytorch/pull/33206, which removed the half support for cpu about 4 years ago.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123823
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-04-17 09:44:07 +00:00
Catherine Lee
0abd3f60fd [CI] Reduce CI_SERIAL_LIST list (#124085)
Add serial marker for individual tests so the test file can be removed from the ci serial list
Run serial marked tests first in serial
Run all other tests afterwards in parallel

Slowly reduce list and mark individual tests as serial instead

Hope # of serial tests is small so sharding evenness doesn't get too messed up

Hopefully can do 3 procs for sm86 and cpu?

serial no longer looks like a real word to me

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124085
Approved by: https://github.com/seemethere, https://github.com/malfet
2024-04-17 00:23:47 +00:00
Fuzzkatt
1cf62e86a4 skip various unit tests for Jetson (#122531)
skip multiprocessing, cuda expandable segments, mem eff and flash attention tests on Jetson due to hanging / sigkill issues from nvidia internal testing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122531
Approved by: https://github.com/eqy, https://github.com/malfet
2024-04-16 01:26:26 +00:00
rzou
3c25b18d76 Excise old custom ops prototype from custom_op_db (#124062)
Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124062
Approved by: https://github.com/albanD
ghstack dependencies: #123615
2024-04-15 23:32:47 +00:00
rzou
a03711d24d [custom_ops] Support TensorList inputs/outputs (#123615)
We add a `supports_tensorlist` decorator that gives an autograd.Function
the ability to handle TensorLists.

Test Plan:
- custom_op_db tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123615
Approved by: https://github.com/albanD
2024-04-15 23:32:43 +00:00
Andrew Gu
d1a0821e7e [FSDP2] Added pre/post-all-gather extensions (subclass) (#122908)
**Overview**
This PR adds pre/post-all-gather extensions to FSDP2.
- The pre/post-all-gather extensions are specified at the tensor-level on the `sharded_param._local_tensor` (i.e. the tensor wrapped by the sharded `DTensor`). If the user has a tensor-subclass parameter on the module passed to FSDP that preserves the subclass through the sharding ops (e.g. `new_zeros`, `chunk`, etc.), then the `sharded_param._local_tensor` will naturally be of that subclass.
- The pre-all-gather function has signature:
  ```
  def fsdp_pre_all_gather(self) -> Tuple[Tuple[torch.Tensor, ...], Any]
  ```
    - The first return value is a `Tuple[torch.Tensor, ...]` of the all-gather inputs. It is a tuple since a subclass could contribute >1 inner tensors.
    - The second return value is any optional metadata needed to pass through to the post-all-gather.
- The post all-gather function has signature:
  ```
  def fsdp_post_all_gather(
      self,
      all_gather_outputs: Tuple[torch.Tensor, ...],
      metadata: Any,
      param_dtype: torch.dtype,
      *,
      out: Optional[torch.Tensor] = None,
  ) -> Union[Tuple[torch.Tensor, Tuple[torch.Tensor, ...]], None]:
  ```
    - The `all_gather_outputs` are exactly the all-gathered versions of the `fsdp_pre_all_gather` 1st return value (representing the all-gather inputs). We make sure to unflatten these back to ND for the user.
    - The `metadata` is the `fsdp_pre_all_gather` 2nd return value, untouched.
    - The `param_dtype` is the parameter dtype based on the passed-in `MixedPrecisionPolicy`. Namely, if no policy is passed in, then `param_dtype` is the original dtype, and otherwise, it is the `MixedPrecisionPolicy.param_dtype`.
    - If `out` is not specified, then the return value has type `Tuple[torch.Tensor, Tuple[torch.Tensor, ...]]`. The first tuple item is the unsharded parameter (e.g. re-wrapping into some subclass). The second tuple item is a tuple of unsharded inner tensors that FSDP should free during reshard. These should be derived from the all-gather outputs.
    - The `out` argument is required due to FSDP's `resize_` usage. We require an in-place variant for the backward all-gather. Here, `out` will be exactly the object returned as the first tuple item in the out-of-place variant mentioned before. The unsharded inner tensors will be allocated before calling `fsdp_post_all_gather`. When `out` is specified, the `fsdp_post_all_gather` should return `None`. If the post-all-gather does not do any out-of-place ops, then the `out` variant can just be a no-op since the unsharded inner tensors will be the same as the all-gather outputs, which FSDP directly writes to after all-gather. (E.g., this is the case for both float8 and `NF4Tensor`.)
- We check for `fsdp_pre_all_gather` and `fsdp_post_all_gather` directly via `hasattr` to accommodate monkey patching so that we do not strictly require the user to use a tensor subclass. The monkey patch must happen after the local tensors have been finalized (after applying FSDP and after any meta-device init).
- For now, we require that all gradients in one FSDP parameter group share the same dtype. This is fine for float8 and `NF4Tensor` use cases. If this requirement is too strict, then in the future we can issue 1 reduce-scatter per dtype per group.

**Design Notes**
- We assume that the `sharded_param._local_tensor` is padded on dim-0.
    - This assumption should not block immediate use cases, and when we pad the `DTensor._local_tensor` by default, this assumption will always be true.
    - This assumption allows us to call `sharded_param._local_tensor.fsdp_pre_all_gather()`; i.e. it tells us from which tensor object to invoke `fsdp_pre_all_gather()`.
    - Suppose we want to compose with CPU offloading. Then, CPU offloading's H2D copy should run first, i.e. `sharded_param._local_tensor.to("cuda").fsdp_pre_all_gather()`, where `_local_tensor.to("cuda")` should return an instance of the subclass so that it still defines `fsdp_pre_all_gather()`. Note that in this case, the subclass instance on GPU is a temporary, which means caching values on it would not be possible. One possibility would be to have `.to("cuda")` move any cached values too.
- `fsdp_post_all_gather` can either return an unsharded parameter that aliases with the all-gather output or does not alias, but there is no way to know a priori.
    - If the unsharded parameter aliases with the all-gather output, then we should _not_ free the all-gather output in `unshard`.
    - If the unsharded parameter does not alias with the all-gather output, then we prefer to free the all-gather output in `unshard` to avoid holding the unneeded temporary.
    - One approach is for eager-mode to check for this alias (by comparing data pointers). However, this might be adversarial to full-graph compilation. The compromise for simplicity can be to always free the all-gather output in `reshard`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122908
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
ghstack dependencies: #119302
2024-04-15 21:35:51 +00:00
Adnan Akhundov
03a05e791a Don't add non-integer Triton kernel arg 1 to equal_to_1 (#123886)
Summary: Triton compiler adds constnat argument 1 to `equal_to_1` [only when it's an int](8c5e33c77e/python/triton/runtime/jit.py (L275)). Here we restrict Inductor's `equal_to_1` in the same way.

Test Plan:

```
$ python test/inductor/test_triton_kernels.py -k test_triton_kernel_equal_to_1_float_arg
...
----------------------------------------------------------------------
Ran 1 test in 6.528s

OK

$ python test/inductor/test_triton_kernels.py -k test_triton_kernel_equal_to_1_arg
...
----------------------------------------------------------------------
Ran 2 tests in 10.142s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123886
Approved by: https://github.com/oulgen
ghstack dependencies: #123703
2024-04-14 20:34:05 +00:00
Yifu Wang
2a2e1d8e4f [functional collective] change the Python APIs to only use the native funcol ops (#123777)
## Summary

After this PR, the functional collective Python APIs will stop honoring `TORCH_DISABLE_NATIVE_FUNCOL` and only use native funcol ops. Specifically, this PR:
- Removed `use_native_funcol()`.
- Removed the code path in the Python APIs when `use_native_funcol()` is `False`.
- Changed the CI tests that runs on both native funcol and legacy funcol through the Python API to only run with native funcol.

## Test Changes

`test_functional_api.py`
- Removed the tests where only one of output_split_sizes or input_split_sizes is specified. This behavior is unreliable has been removed from the native funcol.
- Removed `TestWaitiness` which tests an implementation detail of the legacy funcol. We have equivalent tests for native funcol in `test/distributed/test_c10d_functional_native.py` b7fac76fc2/test/distributed/test_c10d_functional_native.py (L114-L116)

`test/distributed/_tensor/test_dtensor.py`
`test/distributed/_tensor/test_dtensor_compile.py`
`test/distributed/test_device_mesh.py`
`test/distributed/_tensor/experimental/test_tp_transform.py`
`test/distributed/_tensor/test_matrix_ops.py`
`test/distributed/test_inductor_collectives.py`
- All these tests were double running with both native funcol and legacy funcol. Changed to only run with native funcol.

`test/distributed/test_c10d_functional_native.py`
- Removed the `run_with_native_funcol` decorators.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123777
Approved by: https://github.com/wanchaol
ghstack dependencies: #123776
2024-04-13 03:08:36 +00:00
Aaron Gokaslan
1d6c5972c1 [BE]: Optimize min/max/sum comprehensions C419 (#123960)
Automatic fixes that replaces certain list comprehensions with generator ones where appropriate so that they are immediately consumed. This is preview functionality in ruff for rule C419 and it was automatically applied.

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123960
Approved by: https://github.com/malfet
2024-04-12 23:54:15 +00:00
Shawn Xu
04acdad829 [PT] [FSDP] [test] add barrier device ids (#123866)
Summary:
without this the `ProcessGroupNCCL` lib would try to infer the device id and emit a warning.
This doesn't change the behavior just makes it explicit.

> ProcessGroupNCCL.cpp:3720] [PG 0 Rank 1]  using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.

Test Plan: CI

Differential Revision: D55998175

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123866
Approved by: https://github.com/awgu
2024-04-12 18:29:32 +00:00
Thiago Crepaldi
23dbe2b517 Add test for skipping hf logging during export (#123410)
https://github.com/pytorch/pytorch/pull/123402 already supports hf
logging because HF logger is based on logging module

This PR adds a test to guard this against regression, only

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123410
Approved by: https://github.com/BowenBao, https://github.com/malfet
2024-04-12 17:42:46 +00:00
Pritam Damania
9dfeec9cdc Add a mode to avoid clone() in DDPSink (#122927)
DDPSink clones the outputs of DDP to avoid in-place modification of loss (see https://github.com/pytorch/pytorch/issues/61982). However, when outputs are really large (2-3GB) this adds a lot of overhead for peak memory.

As a result, adding a mode to avoid this clone in cases where users are not modifying loss in-place.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122927
Approved by: https://github.com/fegin, https://github.com/rohan-varma
2024-04-12 08:56:10 +00:00
Tristan Rice
358ace1a1b functional_collectives: add first differentiable collective -- all_to_all_single_grad (#123599)
This adds the differentiable collective -- all_to_all_single_grad. This is the initial proof of concept PR and I will be adding the remaining collectives in follow up PRs.

This adds a new function called `all_to_all_single_autograd` which is the autograd variant of `all_to_all_single`. For backwards compatibility + initial testing we wanted to make the autograd variant separate to avoid regressions.

This uses `autograd::Function` to register an Autograd op that calls the original `_c10d_functional::all_to_all_single` via the dispatcher. This works with compile and inductor as opposed to the previous Python implementation that had issues. As this uses the existing `_c10d_functional` ops we don't need to register any meta functions or lowering.

To avoid cudaStream issues this explicitly calls `wait_tensor` in the backward method to ensure it runs under the same stream as the async operation. This hurts performance but can be alleviated potentially using `compile`.

Related work: https://github.com/pytorch/torchrec/blob/main/torchrec/distributed/comm_ops.py

Test plan:

```
pytest test/distributed/test_functional_api.py -k test_all_to_all_single_compile
pytest test/distributed/test_functional_api.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123599
Approved by: https://github.com/yifuwang
2024-04-12 01:48:49 +00:00
Lucas Pasqualin
b7fac76fc2 [DCP fixes for _load_state_dict_keys and supports nested keys (#123679)
Fixes some issues with `_load_state_dict_keys`, including:
  * updates broken test, which was failing due to incorrect parameters
  * adds support for specifying nested keys e.g. (load state dict keys can now specify `something like "optimizer.state"`, which loads all keys under `optimzier.state`.
  * updates call site to use the private implementation of `_load_state_dict`, which properly handles empty state dicts (otherwise the keys are ignored)

Big shout out to @diego-urgell who not only identified current issues, but recommended the right solutions!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123679
Approved by: https://github.com/diego-urgell, https://github.com/wz337
2024-04-11 20:52:06 +00:00
ydwu4
e979f45610 [while_loop] add a simiple op_info test (#123814)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123814
Approved by: https://github.com/tugsbayasgalan, https://github.com/zou3519
2024-04-11 19:59:04 +00:00
Kurt Mohler
ee869c9bb7 Avoid COW materialization in backward ops (4) (#123798)
Affected ops:
* embedding_bag
* mse_loss
* huber_loss
* grid_sample
* ctc_loss
* nll_loss
* pdist
* _segment_reduce

Part of #97856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123798
Approved by: https://github.com/ezyang
ghstack dependencies: #123797
2024-04-11 18:41:41 +00:00
Kurt Mohler
69249a218b Avoid COW materialization in backward ops (3) (#123797)
Affected ops:
* conv ops
* glu
* prelu
* scaled_dot_product_attention
* threshold
* logsigmoid
* binary_cross_entropy
* gelu
* unfold
* smooth_l1_loss
* embedding

Part of #97856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123797
Approved by: https://github.com/ezyang
2024-04-11 18:35:08 +00:00
Episkey0109
02b29e7d07 Add meta function for channel_shuffle operation (#123033)
This commit introduces a meta function for the `channel_shuffle` operation, enabling PyTorch to perform shape inference and optimizations related to this operation without actual computation. The meta function assumes input shape (*, C, H, W) and validates that the number of channels (C) is divisible by the specified number of groups.

Fixes #122771

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123033
Approved by: https://github.com/ezyang, https://github.com/mikaylagawarecki
2024-04-11 10:07:18 +00:00
Edward Z. Yang
8aad72b0d3 Support all unsigned int sizes on unique (#123643)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123643
Approved by: https://github.com/albanD, https://github.com/kit1980
2024-04-11 06:50:12 +00:00
Oguz Ulgen
57a2032c7a Delete Lark (#123689)
Now that we are using MLIR bindings inside triton, lets delete Lark parser.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123689
Approved by: https://github.com/jansel
2024-04-11 05:51:06 +00:00
Kurt Mohler
281810e307 Avoid COW materialization in backward ops (2) (#123740)
Affected ops:
* pooling ops
* relu
* pad
* interpolate
* upsample
* multi_margin_loss
* multilabel_margin_loss
* multilabel_soft_margin_loss

Part of #97856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123740
Approved by: https://github.com/ezyang
ghstack dependencies: #123657
2024-04-11 01:35:38 +00:00
Andrew Gu
c64184b097 [FSDP] Made patch functions thread safe with barrier (#123754)
I think if we do not have barriers as added in the PR, we could have a race condition with multi-threading (e.g. MTPG). I think this mainly matters if the test function itself does not run collectives.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123754
Approved by: https://github.com/weifengpy
ghstack dependencies: #122962, #123290, #123362
2024-04-11 00:59:16 +00:00
PyTorch MergeBot
6b18daf205 Revert "Delete Lark (#123689)"
This reverts commit a631461eef.

Reverted https://github.com/pytorch/pytorch/pull/123689 on behalf of https://github.com/PaliC due to This PR seems to be breaking  test_binary_ufuncs.py ([comment](https://github.com/pytorch/pytorch/pull/123689#issuecomment-2048489549))
2024-04-10 21:48:04 +00:00
Kurt Mohler
49d5553f5a Avoid COW materialization in backward ops (1) (#123657)
Affected ops:
* cdist
* sparse.sampled_addmm
* sparse.mm
* cross_entropy
* norm ops

Part of #97856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123657
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2024-04-10 21:07:07 +00:00
Oguz Ulgen
a631461eef Delete Lark (#123689)
Now that we are using MLIR bindings inside triton, lets delete Lark parser.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123689
Approved by: https://github.com/jansel
2024-04-10 19:41:54 +00:00
PyTorch MergeBot
d017645dc7 Revert "Support all unsigned int sizes on unique (#123643)"
This reverts commit 8aa08b8b9d.

Reverted https://github.com/pytorch/pytorch/pull/123643 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing lots of jobs with the new dtype 8aa08b8b9d ([comment](https://github.com/pytorch/pytorch/pull/123643#issuecomment-2047905094))
2024-04-10 15:49:40 +00:00
Edward Z. Yang
8aa08b8b9d Support all unsigned int sizes on unique (#123643)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123643
Approved by: https://github.com/albanD, https://github.com/kit1980
2024-04-10 11:46:10 +00:00