Commit Graph

2082 Commits

Author SHA1 Message Date
PyTorch MergeBot
033e733021 Revert "[BE] wrap deprecated function/class with typing_extensions.deprecated (#126898)"
This reverts commit 749a132fb0.

Reverted https://github.com/pytorch/pytorch/pull/126898 on behalf of https://github.com/fbgheith due to switching typing-extensions=4.3.0 to 4.9.0 causes internal failure ([comment](https://github.com/pytorch/pytorch/pull/126898#issuecomment-2142884456))
2024-05-31 19:47:24 +00:00
Mikayla Gawarecki
cd06ae0cb8 Relax use_count constraints for swap_tensors when AccumulateGrad holds a reference (#127313)
### Before this PR:
`torch.utils.swap_tensors(a, b)` required the `use_count` of `a` and `b` to be 1

```python
a = torch.randn(2, 3, requires_grad=True)
b = torch.randn(2, 4)
out = a * 2
out.sum().backward()
# Calling swap_tensors here would fail due to the reference held by AccumulateGrad node, which is not cleaned up after backward
# torch.utils.swap_tensors(a, b)
del out
# Calling swap_tensors here would pass
torch.utils.swap_tensors(a, b)
```
### After this PR:
`torch.utils.swap_tensors(a, b)` requires the `use_count` of `a` and `b` to be 1 or 2 IF the second reference is held by `AccumulateGrad`

A pre-hook will be registered on the `AccumulateGrad` node so that it will fail if it is called (i.e. if user attempts to backward through the graph).

```python
a = torch.randn(2, 3, requires_grad=True)
b = torch.randn(2, 4)
out = a * 2
out.sum().backward()
# Calling swap_tensors here is ok
torch.utils.swap_tensors(a, b)
# If we ever backward to the AccumulateGrad node it will error that it was poisoned by swap_tensors
```

### Application to `nn.Module`

This issue is especially pertinent in context of `nn.Module` where parameters will have `AccumulateGrad` nodes initialized after forward. Specifically, this is intended to address https://github.com/pytorch/pytorch/pull/126814#issuecomment-2127777866. Previously, this would fail at the `m.cpu()` but we want users to be able to do something like the following, and instead raise an error if the user ever attempts to backward through the poisoned `AccumulateGrad` node

```python
import torch
import torch.nn as nn
m = nn.Linear(3, 5)
inp = torch.randn(2, 3)
out = m(inp)
out.sum().backward()
m.cpu()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127313
Approved by: https://github.com/soulitzer
2024-05-30 07:06:55 +00:00
Xuehai Pan
749a132fb0 [BE] wrap deprecated function/class with typing_extensions.deprecated (#126898)
Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing.

Note that only warnings that their messages contain `[Dd]eprecat(ed|ion)` are updated in this PR.

UPDATE: Use `FutureWarning` instead of `DeprecationWarning`.

Resolves #126888

- #126888

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126898
Approved by: https://github.com/albanD
2024-05-29 12:09:27 +00:00
William Wen
5359af0c7e [dynamo] wrap GraphModule exceptions in dynamo-wrapped tests (#126341)
Better approach to https://github.com/pytorch/pytorch/pull/126197 to catch issues like https://github.com/pytorch/pytorch/issues/125568.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126341
Approved by: https://github.com/anijain2305, https://github.com/jansel
2024-05-29 05:18:04 +00:00
Yu, Guangye
c09205a057 Deprecate device-specific GradScaler autocast API (#126527)
# Motivation

## for `torch.amp.GradScaler`,
- `torch.cpu.amp.GradScaler(args...)` is completely equivalent to `torch. amp.GradScaler("cpu", args...)`.
- `torch.cuda.amp.GradScaler(args...)` is completely equivalent to `torch.amp.GradScaler("cuda", args...)`.

So, we intend to depreate them and **strongly recommend** developer to use `torch.amp.GradScaler`.

## for `custom_fwd` and `custom_bwd`,
this is a good solution to make the custom function run with or without effect even in an autocast-enabled region and can be shared by other backends, like CPU and XPU.
So we generalize it to be device-agnostic and put them int `torch/amp/autocast_mode.py` and re-expose to `torch.amp.custom_fwd` and `torch.amp.custom_bwd`. Meanwhile, we deprecate `torch.cuda.amp.custom_fwd` and `torch.cuda.amp.custom_bwd`.

# Additional Context
Add UT to cover the deprecated warning.
No need for more UTs to cover the functionality of `torch.amp.custom_f/bwd`, the existing UTs that previously covered the functionality of `torch.cuda.amp.custom_f/bwd` can cover them.
To facilitate the review, we separate these code changes to two PRs. The first PR cover `torch.amp.GradScaler`. The follow-up covers `custom_fwd` and `custom_bwd`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126527
Approved by: https://github.com/jgong5, https://github.com/gujinghui, https://github.com/janeyx99, https://github.com/EikanWang
2024-05-25 06:41:34 +00:00
Matthew Hoffman
86ad101370 Enable pickling torch._C.Generator (#126271)
Fixes #71398

Add `__reduce__` and `__setstate__` methods for `torch._C.Generator`.

`__reduce__` returns a tuple of 3 values:

1. `torch.Generator` itself.
2. A one-element tuple containing the `torch.device` to create the `Generator` with, since this cannot be changed after the object is created.
3. The state, a three-element tuple: the initial seed, the offset (or `None` if a CPU `Generator`), and the RNG state tensor.

`__setstate__` calls `manual_seed`, `set_offset` (if not `None`), and `set_state` on each respective element of the state.

Added test demonstrating successful reserialization with cpu and cuda `Generator`s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126271
Approved by: https://github.com/ezyang
2024-05-22 14:38:47 +00:00
Nikita Shulga
a379ed6e98 Fix SobolEngine default dtype handling (#126781)
- Change default dtype argument to `None` and fetch it value via `torch.get_default_dtype()` call if not defined
- Fix bug in first draw handling logic, that would ignore dtype in favor of default one due to type promotion
- Add regression tests

Fixes https://github.com/pytorch/pytorch/issues/126478
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126781
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-05-22 01:55:48 +00:00
Tarunbir Gambhir
ad67553c5c Updated test_torch.py to use new OptimizerInfo infrastructure (#125538)
Fixes #123451 (only addresses test_torch.py cases)

This PR solves the specific task to update `test_grad_scaling_autocast` and `test_params_invalidated_with_grads_invalidated_between_unscale_and_step` in `test/test_torch.py` to use the new OptimizerInfo infrastructure.

I have combined tests that call `_grad_scaling_autocast_test` into one called `test_grad_scaling_autocast` and used `_get_optim_inputs_including_global_cliquey_kwargs` to avoid hard-coded configurations.

```
$ lintrunner test/test_cuda.py
ok No lint issues.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125538
Approved by: https://github.com/janeyx99
2024-05-18 15:42:45 +00:00
albanD
19a9de114a Forbid subclassing _TensorBase directly (#125558)
As per title.
This ensures that all the places where we assume the method defined in _tensor.py do exist.

BC-Breaking: This is bc-breaking as the user cannot subclass this private class anymore.
You should replace any use of _TensorBase to Tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125558
Approved by: https://github.com/ezyang
2024-05-08 20:29:29 +00:00
PyTorch MergeBot
a0e2f62edd Revert "Include support for the scatter gather cuda kernels to allow for comp… (#124809)"
This reverts commit 9e24c263f9.

Reverted https://github.com/pytorch/pytorch/pull/124809 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/124809#issuecomment-2091751002))
2024-05-02 21:36:18 +00:00
Danial Javady
9e24c263f9 Include support for the scatter gather cuda kernels to allow for comp… (#124809)
Fixes #121965

This PR hopes to add support complex numbers in the scatter/gather related kernels. For brevity, I will only include `complex<float>` for now as `complex<double>`, for example, will be more complicated.

C++ unit tests are currently passing alongside tests in `test_scatter_gather_ops.py`. Python test suites also seem to be passing.

Please keep the following in mind:
1) I think this is my first time using Pytorch.
2) This is my first contribution to Pytorch.

Environment:
3080 & WSL 2. `nvcc` is at 12.4.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124809
Approved by: https://github.com/mikaylagawarecki
2024-05-01 23:58:35 +00:00
PyTorch MergeBot
4d410155b2 Revert "Include support for the scatter gather cuda kernels to allow for comp… (#124809)"
This reverts commit e09f98c705.

Reverted https://github.com/pytorch/pytorch/pull/124809 on behalf of https://github.com/clee2000 due to windows build failure is real, https://github.com/pytorch/pytorch/actions/runs/8910674030/job/24470387612#step:11:11236 is the correct failure line, ignore the statement saying build passed, batch is errorcodes arent propagating again ([comment](https://github.com/pytorch/pytorch/pull/124809#issuecomment-2088680371))
2024-05-01 16:02:02 +00:00
Danial Javady
e09f98c705 Include support for the scatter gather cuda kernels to allow for comp… (#124809)
Fixes #121965

This PR hopes to add support complex numbers in the scatter/gather related kernels. For brevity, I will only include `complex<float>` for now as `complex<double>`, for example, will be more complicated.

C++ unit tests are currently passing alongside tests in `test_scatter_gather_ops.py`. Python test suites also seem to be passing.

Please keep the following in mind:
1) I think this is my first time using Pytorch.
2) This is my first contribution to Pytorch.

Environment:
3080 & WSL 2. `nvcc` is at 12.4.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124809
Approved by: https://github.com/eqy, https://github.com/mikaylagawarecki
2024-05-01 14:31:31 +00:00
Yifu Wang
91a4740e72 Disable the CUDA fast path for split_with_sizes_copy when capturing (#125052)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125052
Approved by: https://github.com/awgu, https://github.com/eellison, https://github.com/eqy
2024-04-27 07:59:39 +00:00
Aaron Orenstein
a8574a9719 Fix global flake8 issues (#124771)
Prior to this `lintrunner --all-files --take FLAKE8` failed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124771
Approved by: https://github.com/Skylion007
ghstack dependencies: #124428
2024-04-26 15:35:53 +00:00
PyTorch MergeBot
1ac60484c1 Revert "Fix global flake8 issues (#124771)"
This reverts commit f01275934b.

Reverted https://github.com/pytorch/pytorch/pull/124771 on behalf of https://github.com/jeanschmidt due to Unfortunately, I needed to revert #123735 and this one depends on it. So please check if there are no merge conflicts or breakages and feel free to merge this PR again ([comment](https://github.com/pytorch/pytorch/pull/124428#issuecomment-2078699836))
2024-04-26 06:15:17 +00:00
Aaron Orenstein
f01275934b Fix global flake8 issues (#124771)
Prior to this `lintrunner --all-files --take FLAKE8` failed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124771
Approved by: https://github.com/Skylion007
ghstack dependencies: #124428
2024-04-25 14:25:00 +00:00
Isuru Fernando
edcd968b51 Add out wrappers to some decompositions (#115437)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115437
Approved by: https://github.com/lezcano
2024-04-23 06:26:11 +00:00
Aaron Gokaslan
29cc293725 [BE]: FURB142 - Remove set mutations. Use set update (#124551)
Uses set mutation methods instead of manually reimplementing (update, set_difference etc).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124551
Approved by: https://github.com/ezyang
2024-04-21 14:12:33 +00:00
Aaron Gokaslan
5a1216bb2e [BE]: Update ruff to 0.4.1 (#124549)
Update ruff to 0.4.1 .
This version fixes a lot false negatives/false positives, is 20-40% faster, and has various other bug fixes.

Below is a before and after table showing the execution time of ruff lint and ruff format in milliseconds courtesy of https://astral.sh/blog/ruff-v0.4.0

| Repository                                         | Linter (v0.3) | Linter (v0.4) | Formatter (v0.3) | Formatter (v0.4) |
|----------------------------------------------------|---------------|---------------|------------------|------------------|
| [pytorch/pytorch](https://github.com/pytorch/pytorch) | 328.7         | 251.8         | 351.1            | 274.9            |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124549
Approved by: https://github.com/ezyang
2024-04-21 14:06:23 +00:00
Jane Xu
b412b75b42 [optim] add fused_adam/adamw_kernel support for CPU device (#123074)
On par with `CUDA` implementation.

For `autocast` logic, same with `CUDA` + `Fused Adam`:
 - check inf in `gradscalar.step`
 - In fused kernel, if there is `inf`, do nothing. If not, unscale the grad ( also write back) and update the param.

**TestPlan**:
```
# extend CUDA only test for CPU fused adagrad
python test_optim.py -k test_fused_matches_forloop
python test_optim.py -k test_fused_large_tensor
python test_torch.py -k test_grad_scaling_autocast_fused

# extend fused test
python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step
python test_optim.py -k test_can_load_older_state_dict

# newly added test (follow 6b1f13ea2f/test/test_cuda.py (L1108))
python test_optim.py -k test_grad_scaling_autocast_fused_optimizers
```

**Benchmark**:
**5.1x** on 56 core SPR
**Parameter-size=1M**
**Nparams=10**
[test script](https://gist.github.com/zhuhaozhe/ef9a290ad3f8f4067b3373a3bdaa33e7)

```
numactl -C 0-55 -m 0 python bench_adam.py
non-fused 6.0174267292022705 s
fused 1.1787631511688232 s
```

**Note: Fused kernel accuracy**
The accuracy failure in CI shows a little higher than default tolerance
```
2024-04-02T06:09:16.2213887Z Mismatched elements: 21 / 64 (32.8%)
2024-04-02T06:09:16.2214339Z Greatest absolute difference: 1.5735626220703125e-05 at index (6, 6) (up to 1e-05 allowed)
2024-04-02T06:09:16.2214813Z Greatest relative difference: 1.0073336852656212e-05 at index (4, 1) (up to 1.3e-06 allowed)
```
I have debug it step by step and unfortunately we may not able to make the `fused kernel` exactly same with `non fused` one due to compiler optimizations.
For example, in non-fused impl
```
exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2)
```
and in fused impl
```
  exp_avg_sq_ptr[d] = scalar_t(beta2) * exp_avg_sq_ptr[d];
  //  std::cout << "exp_avg_sq " <<   exp_avg_sq_ptr[d] << std::endl;
  exp_avg_sq_ptr[d] = exp_avg_sq_ptr[d] +
      scalar_t(exp_avg_sq_grad_coefficient) * grad_val * grad_val;
```
If I keep `std::cout`, I can get exactly same results in UT
```
===============param
0.6796758770942688
0.6796758770942688
```
But when I comment out it, there will be a difference
```
===============param
0.6796758770942688
0.6796759366989136
```
So I will make the tolerance a little higher than default one.

Co-authored-by: Jane Xu <janeyx@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123074
Approved by: https://github.com/jgong5, https://github.com/janeyx99
2024-04-19 11:14:04 +00:00
Sam Larsen
6502c888cf Enable fx graph cache in torch_test.py when using PYTORCH_TEST_WITH_INDUCTOR=1 (#122010)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122010
Approved by: https://github.com/eellison
2024-03-19 02:17:10 +00:00
Kurt Mohler
13a54ce279 Avoid COW materialization in at::parallel_for/parallel_reduce (#120455)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120455
Approved by: https://github.com/albanD
2024-03-01 05:05:28 +00:00
PyTorch MergeBot
86ff31c4a0 Revert "Avoid COW materialization in at::parallel_for/parallel_reduce (#120455)"
This reverts commit cabc09a5f2.

Reverted https://github.com/pytorch/pytorch/pull/120455 on behalf of https://github.com/izaitsevfb due to breaks xla jobs ([comment](https://github.com/pytorch/pytorch/pull/120455#issuecomment-1970026100))
2024-02-28 22:30:18 +00:00
Kurt Mohler
cabc09a5f2 Avoid COW materialization in at::parallel_for/parallel_reduce (#120455)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120455
Approved by: https://github.com/albanD
2024-02-28 00:37:33 +00:00
Sergii Dymchenko
bd9db6a9c7 Update to TorchFix 0.4.0 (#119424)
`torch.library.Library` updated to `torch.library._scoped_library` in files with many tests where it seems obvious to do, otherwise `noqa: TOR901` added - see https://github.com/pytorch/pytorch/pull/118318 for more context.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119424
Approved by: https://github.com/zou3519
2024-02-12 23:30:12 +00:00
Hirochika Matsumoto
02c24b0b5e Add Python binding resizable to class {Untyped,Typed}Storage (#119286)
This PR exposes `resizable` method of `StorageImpl` to Python frontend to make it accessible for users.

Fixes #119233

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119286
Approved by: https://github.com/ezyang, https://github.com/mikaylagawarecki
2024-02-07 19:15:55 +00:00
CaoE
113138aa55 add test cases for GradScaler on CPU (#109994)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109994
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-02-02 21:49:07 +00:00
Yifu Wang
0f7e63620f CUDA fast path for split_with_sizes_copy.out (#117203)
### Motivation
In per-parameter sharding FSDP, each rank holds one shard of every parameter. Before a bucket of parameters is used, FSDP performs all-gather to reconstruct the full parameters. The following example demonstrates the process for `world_size=2`, `num_params=3` (`A`, `B`, `C` standands for values in param `A`, `B`, `C`):

All-gather output:
```
AAAABBBCCAAAABBBCC
```

After all-gather-copy-out:
```
AAAAAAAA  BBBBBB  CCCC
```

The performance of all-gather-copy-out is crucial for the viability of per-parameter sharding FSDP. After thorough experiments, we believe that acceptable performance for this op is not achievable via composing existing ATen ops today.

We have proven that ideal performance is achievable with a [custom kernel](https://github.com/pytorch/pytorch/pull/115515). This PR aims to incorporate the optimizations to appropriate ATen ops (as suggested by @albanD).

### all-gather-copy-out via Composing ATen Ops

Carrying out the op out via composing ATen ops involves a combination of view ops and copy ops. After thorough experiments, we found that the most natural/performant way to express the op is via `split_with_sizes` + `_foreach_copy_`, which works as follows:

Reshape all-gather output as (world_size, -1):
```
AAAABBBCC
AAAABBBCC
```

`split_with_sizes` + `_foreach_copy_`:
```
AAAA BBB CC
AAAA BBB CC
```

However, the performance of this approach is still far below that of the custom kernel. We've identified the following reasons:
- The approach requires materializing `O(num_params)` intermediate views, which induces large amount of CPU overhead when `num_params` is high.
- `_foreach_copy_` uses the same block size all tensors, leading to waste for small tensors and insufficient thread count for large tensors. This means low effective occupancy.
- `_foreach_copy_` dispatches multiple kernels for typical problem sizes for all-gather-copy-out. This further lowers the effective occupancy.
- Due to the nature of the workload, the underlying copies are unaligned. `_foreach_copy_` isn't aggressive enough in exploiting vectorization oppurtunities in such workloads.

### PR
Introduces a CUDA backend for `split_with_sizes_copy.out` that addresses the above inefficiencies. See code for details.

### Benchmarks
The benchmarks are conducted on a set of representative problems sizes on an A100. CPU overhead and GPU execution time is measured separately, as reasonable CPU overhead doesn't directly affect e2e throughput. The reported copy bandwidth is calculated with GPU execution time.

Compared to the baseline, we observe 3x-10x higher throughput compared to the baseline depending on the problem size. We also observe lower CPU overhead across the board compared to the baseline.

Baseline:
```
num_params=150   world_size=8     mixed=True    Param size: 0.059 GB    Copy bandwidth: 67.564 GB/s (gpu ms/iter: 0.869, cpu ms/iter 10.460)
num_params=54    world_size=8     mixed=True    Param size: 1.453 GB    Copy bandwidth: 260.373 GB/s (gpu ms/iter: 5.582, cpu ms/iter 0.572)
num_params=54    world_size=8     mixed=True    Param size: 0.512 GB    Copy bandwidth: 239.585 GB/s (gpu ms/iter: 2.135, cpu ms/iter 0.587)
num_params=50    world_size=8     mixed=True    Param size: 0.200 GB    Copy bandwidth: 205.361 GB/s (gpu ms/iter: 0.976, cpu ms/iter 0.534)
num_params=3     world_size=8     mixed=True    Param size: 0.983 GB    Copy bandwidth: 268.397 GB/s (gpu ms/iter: 3.663, cpu ms/iter 0.084)
num_params=9     world_size=8     mixed=True    Param size: 0.802 GB    Copy bandwidth: 265.240 GB/s (gpu ms/iter: 3.024, cpu ms/iter 0.154)
num_params=3     world_size=8     mixed=True    Param size: 1.573 GB    Copy bandwidth: 268.918 GB/s (gpu ms/iter: 5.849, cpu ms/iter 0.087)
num_params=9     world_size=8     mixed=True    Param size: 2.248 GB    Copy bandwidth: 268.141 GB/s (gpu ms/iter: 8.384, cpu ms/iter 0.151)
num_params=150   world_size=128   mixed=True    Param size: 0.064 GB    Copy bandwidth: 73.237 GB/s (gpu ms/iter: 0.874, cpu ms/iter 10.664)
num_params=54    world_size=128   mixed=True    Param size: 1.458 GB    Copy bandwidth: 259.902 GB/s (gpu ms/iter: 5.609, cpu ms/iter 0.584)
num_params=54    world_size=128   mixed=True    Param size: 0.515 GB    Copy bandwidth: 238.703 GB/s (gpu ms/iter: 2.158, cpu ms/iter 0.612)
num_params=50    world_size=128   mixed=True    Param size: 0.203 GB    Copy bandwidth: 205.144 GB/s (gpu ms/iter: 0.987, cpu ms/iter 0.559)
num_params=3     world_size=128   mixed=True    Param size: 0.983 GB    Copy bandwidth: 270.467 GB/s (gpu ms/iter: 3.635, cpu ms/iter 0.073)
num_params=9     world_size=128   mixed=True    Param size: 0.802 GB    Copy bandwidth: 267.700 GB/s (gpu ms/iter: 2.997, cpu ms/iter 0.133)
num_params=3     world_size=128   mixed=True    Param size: 1.573 GB    Copy bandwidth: 268.913 GB/s (gpu ms/iter: 5.849, cpu ms/iter 0.093)
num_params=9     world_size=128   mixed=True    Param size: 2.248 GB    Copy bandwidth: 266.589 GB/s (gpu ms/iter: 8.433, cpu ms/iter 0.207)
num_params=150   world_size=1024  mixed=True    Param size: 0.202 GB    Copy bandwidth: 135.107 GB/s (gpu ms/iter: 1.495, cpu ms/iter 10.904)
num_params=54    world_size=1024  mixed=True    Param size: 1.524 GB    Copy bandwidth: 258.675 GB/s (gpu ms/iter: 5.890, cpu ms/iter 0.996)
num_params=54    world_size=1024  mixed=True    Param size: 0.575 GB    Copy bandwidth: 238.919 GB/s (gpu ms/iter: 2.408, cpu ms/iter 0.765)
num_params=50    world_size=1024  mixed=True    Param size: 0.246 GB    Copy bandwidth: 209.836 GB/s (gpu ms/iter: 1.172, cpu ms/iter 0.611)
num_params=3     world_size=1024  mixed=True    Param size: 1.007 GB    Copy bandwidth: 270.607 GB/s (gpu ms/iter: 3.720, cpu ms/iter 0.100)
num_params=9     world_size=1024  mixed=True    Param size: 0.818 GB    Copy bandwidth: 266.375 GB/s (gpu ms/iter: 3.071, cpu ms/iter 0.176)
num_params=3     world_size=1024  mixed=True    Param size: 1.611 GB    Copy bandwidth: 270.601 GB/s (gpu ms/iter: 5.952, cpu ms/iter 0.099)
num_params=9     world_size=1024  mixed=True    Param size: 2.248 GB    Copy bandwidth: 268.558 GB/s (gpu ms/iter: 8.371, cpu ms/iter 0.207)
num_params=150   world_size=8     mixed=False   Param size: 0.035 GB    Copy bandwidth: 43.749 GB/s (gpu ms/iter: 0.797, cpu ms/iter 10.531)
num_params=54    world_size=8     mixed=False   Param size: 0.961 GB    Copy bandwidth: 254.084 GB/s (gpu ms/iter: 3.781, cpu ms/iter 0.752)
num_params=54    world_size=8     mixed=False   Param size: 0.282 GB    Copy bandwidth: 216.792 GB/s (gpu ms/iter: 1.299, cpu ms/iter 0.717)
num_params=50    world_size=8     mixed=False   Param size: 0.149 GB    Copy bandwidth: 188.025 GB/s (gpu ms/iter: 0.793, cpu ms/iter 0.633)
num_params=3     world_size=8     mixed=False   Param size: 0.655 GB    Copy bandwidth: 267.793 GB/s (gpu ms/iter: 2.447, cpu ms/iter 0.107)
num_params=9     world_size=8     mixed=False   Param size: 0.634 GB    Copy bandwidth: 264.232 GB/s (gpu ms/iter: 2.401, cpu ms/iter 0.182)
num_params=3     world_size=8     mixed=False   Param size: 1.049 GB    Copy bandwidth: 268.455 GB/s (gpu ms/iter: 3.906, cpu ms/iter 0.089)
num_params=9     world_size=8     mixed=False   Param size: 1.711 GB    Copy bandwidth: 267.633 GB/s (gpu ms/iter: 6.394, cpu ms/iter 0.177)
num_params=150   world_size=128   mixed=False   Param size: 0.038 GB    Copy bandwidth: 46.698 GB/s (gpu ms/iter: 0.807, cpu ms/iter 10.488)
num_params=54    world_size=128   mixed=False   Param size: 0.963 GB    Copy bandwidth: 253.450 GB/s (gpu ms/iter: 3.799, cpu ms/iter 0.655)
num_params=54    world_size=128   mixed=False   Param size: 0.283 GB    Copy bandwidth: 216.857 GB/s (gpu ms/iter: 1.307, cpu ms/iter 0.671)
num_params=50    world_size=128   mixed=False   Param size: 0.151 GB    Copy bandwidth: 189.059 GB/s (gpu ms/iter: 0.799, cpu ms/iter 0.572)
num_params=3     world_size=128   mixed=False   Param size: 0.655 GB    Copy bandwidth: 269.849 GB/s (gpu ms/iter: 2.429, cpu ms/iter 0.078)
num_params=9     world_size=128   mixed=False   Param size: 0.634 GB    Copy bandwidth: 264.501 GB/s (gpu ms/iter: 2.399, cpu ms/iter 0.149)
num_params=3     world_size=128   mixed=False   Param size: 1.049 GB    Copy bandwidth: 268.426 GB/s (gpu ms/iter: 3.906, cpu ms/iter 0.086)
num_params=9     world_size=128   mixed=False   Param size: 1.711 GB    Copy bandwidth: 267.495 GB/s (gpu ms/iter: 6.398, cpu ms/iter 0.170)
num_params=150   world_size=1024  mixed=False   Param size: 0.122 GB    Copy bandwidth: 101.151 GB/s (gpu ms/iter: 1.211, cpu ms/iter 10.476)
num_params=54    world_size=1024  mixed=False   Param size: 1.000 GB    Copy bandwidth: 252.323 GB/s (gpu ms/iter: 3.963, cpu ms/iter 0.633)
num_params=54    world_size=1024  mixed=False   Param size: 0.318 GB    Copy bandwidth: 218.322 GB/s (gpu ms/iter: 1.455, cpu ms/iter 0.622)
num_params=50    world_size=1024  mixed=False   Param size: 0.185 GB    Copy bandwidth: 196.369 GB/s (gpu ms/iter: 0.944, cpu ms/iter 0.576)
num_params=3     world_size=1024  mixed=False   Param size: 0.671 GB    Copy bandwidth: 269.369 GB/s (gpu ms/iter: 2.491, cpu ms/iter 0.076)
num_params=9     world_size=1024  mixed=False   Param size: 0.645 GB    Copy bandwidth: 264.441 GB/s (gpu ms/iter: 2.439, cpu ms/iter 0.140)
num_params=3     world_size=1024  mixed=False   Param size: 1.074 GB    Copy bandwidth: 269.955 GB/s (gpu ms/iter: 3.978, cpu ms/iter 0.073)
num_params=9     world_size=1024  mixed=False   Param size: 1.711 GB    Copy bandwidth: 267.168 GB/s (gpu ms/iter: 6.405, cpu ms/iter 0.147)
```
New kernel:
```
num_params=150   world_size=8     mixed=True    Param size: 0.059 GB    Copy bandwidth: 560.946 GB/s (gpu ms/iter: 0.105, cpu ms/iter 1.066)
num_params=54    world_size=8     mixed=True    Param size: 1.453 GB    Copy bandwidth: 732.657 GB/s (gpu ms/iter: 1.984, cpu ms/iter 0.417)
num_params=54    world_size=8     mixed=True    Param size: 0.512 GB    Copy bandwidth: 753.514 GB/s (gpu ms/iter: 0.679, cpu ms/iter 0.419)
num_params=50    world_size=8     mixed=True    Param size: 0.200 GB    Copy bandwidth: 719.400 GB/s (gpu ms/iter: 0.279, cpu ms/iter 0.410)
num_params=3     world_size=8     mixed=True    Param size: 0.983 GB    Copy bandwidth: 782.121 GB/s (gpu ms/iter: 1.257, cpu ms/iter 0.098)
num_params=9     world_size=8     mixed=True    Param size: 0.802 GB    Copy bandwidth: 766.458 GB/s (gpu ms/iter: 1.047, cpu ms/iter 0.134)
num_params=3     world_size=8     mixed=True    Param size: 1.573 GB    Copy bandwidth: 790.611 GB/s (gpu ms/iter: 1.989, cpu ms/iter 0.099)
num_params=9     world_size=8     mixed=True    Param size: 2.248 GB    Copy bandwidth: 789.754 GB/s (gpu ms/iter: 2.847, cpu ms/iter 0.138)
num_params=150   world_size=128   mixed=True    Param size: 0.064 GB    Copy bandwidth: 565.667 GB/s (gpu ms/iter: 0.113, cpu ms/iter 0.996)
num_params=54    world_size=128   mixed=True    Param size: 1.458 GB    Copy bandwidth: 670.681 GB/s (gpu ms/iter: 2.174, cpu ms/iter 0.289)
num_params=54    world_size=128   mixed=True    Param size: 0.515 GB    Copy bandwidth: 676.135 GB/s (gpu ms/iter: 0.762, cpu ms/iter 0.264)
num_params=50    world_size=128   mixed=True    Param size: 0.203 GB    Copy bandwidth: 662.603 GB/s (gpu ms/iter: 0.306, cpu ms/iter 0.249)
num_params=3     world_size=128   mixed=True    Param size: 0.983 GB    Copy bandwidth: 769.283 GB/s (gpu ms/iter: 1.278, cpu ms/iter 0.078)
num_params=9     world_size=128   mixed=True    Param size: 0.802 GB    Copy bandwidth: 761.057 GB/s (gpu ms/iter: 1.054, cpu ms/iter 0.104)
num_params=3     world_size=128   mixed=True    Param size: 1.573 GB    Copy bandwidth: 774.325 GB/s (gpu ms/iter: 2.031, cpu ms/iter 0.075)
num_params=9     world_size=128   mixed=True    Param size: 2.248 GB    Copy bandwidth: 773.048 GB/s (gpu ms/iter: 2.908, cpu ms/iter 0.099)
num_params=150   world_size=1024  mixed=True    Param size: 0.202 GB    Copy bandwidth: 641.405 GB/s (gpu ms/iter: 0.315, cpu ms/iter 0.616)
num_params=54    world_size=1024  mixed=True    Param size: 1.524 GB    Copy bandwidth: 646.772 GB/s (gpu ms/iter: 2.356, cpu ms/iter 0.276)
num_params=54    world_size=1024  mixed=True    Param size: 0.575 GB    Copy bandwidth: 658.157 GB/s (gpu ms/iter: 0.874, cpu ms/iter 0.278)
num_params=50    world_size=1024  mixed=True    Param size: 0.246 GB    Copy bandwidth: 642.032 GB/s (gpu ms/iter: 0.383, cpu ms/iter 0.245)
num_params=3     world_size=1024  mixed=True    Param size: 1.007 GB    Copy bandwidth: 728.990 GB/s (gpu ms/iter: 1.381, cpu ms/iter 0.080)
num_params=9     world_size=1024  mixed=True    Param size: 0.818 GB    Copy bandwidth: 689.763 GB/s (gpu ms/iter: 1.186, cpu ms/iter 0.102)
num_params=3     world_size=1024  mixed=True    Param size: 1.611 GB    Copy bandwidth: 765.507 GB/s (gpu ms/iter: 2.104, cpu ms/iter 0.078)
num_params=9     world_size=1024  mixed=True    Param size: 2.248 GB    Copy bandwidth: 757.626 GB/s (gpu ms/iter: 2.967, cpu ms/iter 0.106)
num_params=150   world_size=8     mixed=False   Param size: 0.035 GB    Copy bandwidth: 584.272 GB/s (gpu ms/iter: 0.060, cpu ms/iter 0.656)
num_params=54    world_size=8     mixed=False   Param size: 0.961 GB    Copy bandwidth: 728.234 GB/s (gpu ms/iter: 1.319, cpu ms/iter 0.264)
num_params=54    world_size=8     mixed=False   Param size: 0.282 GB    Copy bandwidth: 730.059 GB/s (gpu ms/iter: 0.386, cpu ms/iter 0.279)
num_params=50    world_size=8     mixed=False   Param size: 0.149 GB    Copy bandwidth: 670.899 GB/s (gpu ms/iter: 0.222, cpu ms/iter 0.274)
num_params=3     world_size=8     mixed=False   Param size: 0.655 GB    Copy bandwidth: 775.699 GB/s (gpu ms/iter: 0.845, cpu ms/iter 0.077)
num_params=9     world_size=8     mixed=False   Param size: 0.634 GB    Copy bandwidth: 773.612 GB/s (gpu ms/iter: 0.820, cpu ms/iter 0.112)
num_params=3     world_size=8     mixed=False   Param size: 1.049 GB    Copy bandwidth: 781.395 GB/s (gpu ms/iter: 1.342, cpu ms/iter 0.081)
num_params=9     world_size=8     mixed=False   Param size: 1.711 GB    Copy bandwidth: 789.156 GB/s (gpu ms/iter: 2.169, cpu ms/iter 0.116)
num_params=150   world_size=128   mixed=False   Param size: 0.038 GB    Copy bandwidth: 517.056 GB/s (gpu ms/iter: 0.073, cpu ms/iter 0.632)
num_params=54    world_size=128   mixed=False   Param size: 0.963 GB    Copy bandwidth: 684.246 GB/s (gpu ms/iter: 1.407, cpu ms/iter 0.294)
num_params=54    world_size=128   mixed=False   Param size: 0.283 GB    Copy bandwidth: 680.593 GB/s (gpu ms/iter: 0.416, cpu ms/iter 0.286)
num_params=50    world_size=128   mixed=False   Param size: 0.151 GB    Copy bandwidth: 682.197 GB/s (gpu ms/iter: 0.221, cpu ms/iter 0.255)
num_params=3     world_size=128   mixed=False   Param size: 0.655 GB    Copy bandwidth: 759.470 GB/s (gpu ms/iter: 0.863, cpu ms/iter 0.074)
num_params=9     world_size=128   mixed=False   Param size: 0.634 GB    Copy bandwidth: 765.694 GB/s (gpu ms/iter: 0.829, cpu ms/iter 0.094)
num_params=3     world_size=128   mixed=False   Param size: 1.049 GB    Copy bandwidth: 766.535 GB/s (gpu ms/iter: 1.368, cpu ms/iter 0.075)
num_params=9     world_size=128   mixed=False   Param size: 1.711 GB    Copy bandwidth: 787.608 GB/s (gpu ms/iter: 2.173, cpu ms/iter 0.105)
num_params=150   world_size=1024  mixed=False   Param size: 0.122 GB    Copy bandwidth: 640.203 GB/s (gpu ms/iter: 0.191, cpu ms/iter 0.668)
num_params=54    world_size=1024  mixed=False   Param size: 1.000 GB    Copy bandwidth: 713.947 GB/s (gpu ms/iter: 1.401, cpu ms/iter 0.274)
num_params=54    world_size=1024  mixed=False   Param size: 0.318 GB    Copy bandwidth: 642.855 GB/s (gpu ms/iter: 0.494, cpu ms/iter 0.276)
num_params=50    world_size=1024  mixed=False   Param size: 0.185 GB    Copy bandwidth: 643.297 GB/s (gpu ms/iter: 0.288, cpu ms/iter 0.262)
num_params=3     world_size=1024  mixed=False   Param size: 0.671 GB    Copy bandwidth: 690.626 GB/s (gpu ms/iter: 0.972, cpu ms/iter 0.078)
num_params=9     world_size=1024  mixed=False   Param size: 0.645 GB    Copy bandwidth: 754.431 GB/s (gpu ms/iter: 0.855, cpu ms/iter 0.109)
num_params=3     world_size=1024  mixed=False   Param size: 1.074 GB    Copy bandwidth: 769.985 GB/s (gpu ms/iter: 1.395, cpu ms/iter 0.080)
num_params=9     world_size=1024  mixed=False   Param size: 1.711 GB    Copy bandwidth: 766.337 GB/s (gpu ms/iter: 2.233, cpu ms/iter 0.103)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117203
Approved by: https://github.com/albanD, https://github.com/awgu
ghstack dependencies: #118512
2024-02-01 18:23:01 +00:00
CaoE
bacbad5bc9 add GradScaler on CPU (#109993)
Step 2 of https://github.com/pytorch/pytorch/issues/111559.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109993
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-01-29 23:42:35 +00:00
Edward Z. Yang
46712b019d Enable local_partial_types (#118467)
When using dmypy, this setting is enabled and cannot be turned off. Force it for regular mypy too.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118467
Approved by: https://github.com/Skylion007
ghstack dependencies: #118414, #118418, #118432
2024-01-28 13:38:22 +00:00
Mikayla Gawarecki
41a56f7828 Fix swap_tensors to swap PyObjects associated with TensorImpl (#116955)
This PR intends to fix the following issue when swapping two tensors

```python
>>> import torch
>>> torch.manual_seed(5)
>>> t1 = torch.randn(2)
>>> t2 = torch.randn(3)
>>> t1
tensor([-0.4868, -0.6038])
>>> t2
tensor([-0.5581,  0.6675, -0.1974])
>>> torch.utils.swap_tensors(t1, t2)
>>> t1
tensor([-0.5581,  0.6675, -0.1974])
>>> t2
tensor([-0.4868, -0.6038])
>>> t1.fill_(0.5) # t1 back to its unswapped state :o
tensor([-0.4868, -0.6038])
```

What happens here is that in `THPVariable_Wrap` (which is used when going back from C++ --> Python), we check if the TensorImpl of the tensor to be returned already has a pointer to a PyObject in its PyObject slot. If this is the case then this object is returned.

57491d2046/torch/csrc/autograd/python_variable.cpp (L271-L292)

When we run any operation that returns the same TensorImpl (e.g. inplace op, `t.to(dtype=t.dtype)`, etc.), although `t1` now has `t2`'s TensorImpl, `t2`'s TensorImpl still has a reference to `t2`, so when we do the op on `t1` and `THPVariable_Wrap` attempts to return the pointer to the TensorImpl's PyObject, we return a pointer to `t2` instead.

The TensorImpl should have the PyObjects in their PyObjectSlots swapped as well in `swap_tensors`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116955
Approved by: https://github.com/albanD
2024-01-24 01:40:18 +00:00
Kurt Mohler
cd084c4909 Add TensorIteratorConfig::add_const_input to avoid COW materialize (#118053)
Part of #97856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118053
Approved by: https://github.com/ezyang
2024-01-23 22:32:39 +00:00
Oguz Ulgen
3b38f7b266 Remove skips for passing tests (#118000)
These tests were already passing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118000
Approved by: https://github.com/yanboliang
2024-01-23 16:11:38 +00:00
PyTorch MergeBot
bb28965924 Revert "Remove skips for passing tests (#118000)"
This reverts commit 3c339b5b21.

Reverted https://github.com/pytorch/pytorch/pull/118000 on behalf of https://github.com/oulgen due to test passing on diff but failing on hud... ([comment](https://github.com/pytorch/pytorch/pull/118000#issuecomment-1905351752))
2024-01-23 06:10:25 +00:00
Oguz Ulgen
3c339b5b21 Remove skips for passing tests (#118000)
These tests were already passing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118000
Approved by: https://github.com/yanboliang
2024-01-23 03:41:23 +00:00
haozhe.zhu@intel.com
0ae952db76 enable mkldnn bf32 matmul (#116015)
### Testing
FP32 matmul vs. mkldnn BF32 matmul  on SPR

single core:

Input | BF32   / ms | FP32  /   ms | Speed up
-- | -- | -- | --
M: 128, N: 128, K: 128, trans_a: False, trans_b: False | 32.842 | 38.279 | 1.165
M: 128, N: 256, K: 128, trans_a: False, trans_b: False | 38.590 | 73.967 | 1.917
M: 8192, N: 768, K: 768, trans_a: False, trans_b: False | 18456.267 | 74588.002 | 4.041

56 cores:
Input | BF32   / ms | FP32 /   ms | Speed up
-- | -- | -- | --
M: 8192, N: 768, K: 768, trans_a: False, trans_b: False | 1199.400 | 1715.548 | 1.430
M: 8192, N: 768, K: 768, trans_a: False, trans_b: True |1129.204 | 1708.912 |  1.513
M: 8192, N: 768, K: 3072, trans_a: False, trans_b: False | 3655.915  | 7992.877 | 2.186
M: 8192, N: 768, K: 3072, trans_a: False, trans_b: True | 3707.993 |  8026.191 | 2.165
Batch: 768, M: 128, N: 64, K: 128  | 1296.419 | 1308.411 | 1.009

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116015
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-01-20 09:30:23 +00:00
CaoE
29516bd2a0 add _amp_foreach_non_finite_check_and_unscale_cpu_ and _amp_update_scale_cpu_ kernels on CPU (#109281)
Step1 of https://github.com/pytorch/pytorch/issues/111559.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109281
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-01-16 15:25:08 +00:00
Edward Z. Yang
2200118f59 Enable some uint{16,32,64} tests that are working (#116809)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116809
Approved by: https://github.com/albanD
2024-01-15 02:25:21 +00:00
Edward Z. Yang
edec54b9de Add torch._lazy_clone to create COW tensors (#113397)
Part of #109833

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):
* __->__ #113397
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113397
Approved by: https://github.com/ezyang
2024-01-11 01:32:44 +00:00
Edward Z. Yang
8bcdde5058 Support uint{16,32,64} deterministic empty fill and scalar Python binding handling (#116807)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116807
Approved by: https://github.com/albanD
ghstack dependencies: #116805, #116806
2024-01-10 02:17:23 +00:00
Edward Z. Yang
43a23a704a Support uint{16,32,64} copy (#116806)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116806
Approved by: https://github.com/albanD
ghstack dependencies: #116805
2024-01-10 02:17:23 +00:00
Edward Z. Yang
2e983fcfd3 Support unsigned int for randint, item, equality, fill, iinfo, tensor (#116805)
These are some basic utilities that are often used for testing.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116805
Approved by: https://github.com/albanD
2024-01-10 02:17:23 +00:00
Aaron Gokaslan
3fe437b24b [BE]: Update flake8 to v6.1.0 and fix lints (#116591)
Updates flake8 to v6.1.0 and fixes a few lints using sed and some ruff tooling.
- Replace `assert(0)` with `raise AssertionError()`
- Remove extraneous parenthesis i.e.
  - `assert(a == b)` -> `assert a == b`
  - `if(x > y or y < z):`->`if x > y or y < z:`
  - And `return('...')` -> `return '...'`

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116591
Approved by: https://github.com/albanD, https://github.com/malfet
2024-01-03 06:04:44 +00:00
Aaron Gokaslan
bd10fea79a [BE]: Enable F821 and fix bugs (#116579)
Fixes #112371

I tried to fix as many of the bugs as I could, a few I could not figure out what the proper fix for them was though and so I left them with noqas.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116579
Approved by: https://github.com/ezyang
2024-01-01 08:40:46 +00:00
Yanbo Liang
f657b2b1f8 [Dynamo][10/N] Remove TorchVariable and is_allowed (#116312)
After this refactor:
* ```TorchVariable``` definition and all references are removed.
* All ```is_allowed``` references except one are removed.
  - The only left one is in ```torch/_dynamo/decorators:_disallow_in_graph_helper```. It was called when users put ```disallow_in_graph``` decorator on a function. Since we use the lists in ```trace_rules``` to decide the function's trace rule, so the decorator would only be used as customer function rather than torch functions. I'll defer this to a separate decorator refactor PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116312
Approved by: https://github.com/jansel
2023-12-27 18:47:05 +00:00
PyTorch MergeBot
3b709d7c1e Revert "[Dynamo][10/N] Remove TorchVariable and is_allowed (#116312)"
This reverts commit 015bd0e0a1.

Reverted https://github.com/pytorch/pytorch/pull/116312 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/116312#issuecomment-1869825506))
2023-12-26 23:47:15 +00:00
Yanbo Liang
015bd0e0a1 [Dynamo][10/N] Remove TorchVariable and is_allowed (#116312)
After this refactor:
* ```TorchVariable``` definition and all references are removed.
* All ```is_allowed``` references except one are removed.
  - The only left one is in ```torch/_dynamo/decorators:_disallow_in_graph_helper```. It was called when users put ```disallow_in_graph``` decorator on a function. Since we use the lists in ```trace_rules``` to decide the function's trace rule, so the decorator would only be used as customer function rather than torch functions. I'll defer this to a separate decorator refactor PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116312
Approved by: https://github.com/jansel
2023-12-23 09:44:09 +00:00
Mikayla Gawarecki
f206e31e2f Swap slots if slots match in swap_tensor (#116128)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116128
Approved by: https://github.com/albanD
2023-12-21 00:43:30 +00:00
Kurt Mohler
8a8d0adc0b Fix troch.gradient check for spacing arg list length (#115686)
Fixes #114207

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115686
Approved by: https://github.com/albanD
2023-12-13 20:17:20 +00:00