Commit Graph

86268 Commits

Author SHA1 Message Date
morotti
c0991b0316 README: anaconda license violation / no longer recommend anaconda since it's no longer free to use (#150619)
hello,

I was going over the documentation to build pytorch from source.
Unfortunately, the first thing that come up is that you strongly recommend to use anaconda, which shouldn't be used because it's no longer free to use.
Could you please remove that from the doc?

I don't know if you are aware but anaconda is no longer free.
They changed their terms of service in 2020 to restrict commercial usage.
They changed their terms of service in 2024 to forbid downloading anaconda and forbid education and non-profit usage too.
The download is open and doesn't require any registration, but if you download anaconda they will sue you ^^

They started raining lawsuits against users since last year. You may have heard about anaconda vs intel in the news. They started another 5 or so in the last few months.
https://www.reuters.com/legal/litigation/intel-sued-copyright-infringement-over-ai-software-2024-08-09/

You may need to adjust more doc and adjust your build system. The free to use alternatives are miniforge with the conda-forge channel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150619
Approved by: https://github.com/seemethere
2025-04-08 02:10:31 +00:00
CaoE
d7f3cd0ac3 Add Half support for weight_norm on CPU (#148878)
Fixes #148867.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148878
Approved by: https://github.com/leslie-fang-intel, https://github.com/cyyever, https://github.com/albanD
2025-04-08 01:12:29 +00:00
Nikita Shulga
5228986c39 [CUDA] Only use vec128 if CUDA version is newer than 12.8 (#150705)
By addressing a feedback requested at https://github.com/pytorch/pytorch/pull/145746
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150705
Approved by: https://github.com/atalman
2025-04-08 00:46:13 +00:00
Akash Verma
e9e5682a4a [ROCm] Build Pytorch extensions with amdclang++ (#150451)
Here are the following modifications made to cpp_extension.py- 1) Changed compiler flag to use --version.
2) Added a feature to convert alpha-numeric string to numeric string for the version string returned by compiler. This was the source of error as the parser was failing on parsing alpha-numeric version string.

Build with following pytorch extensions- Apex, TorchVision, TorchAudio & DeepSpeed.
Unit tested with following pytorch extensions- Apex, TorchVision.

(cherry picked from commit c873aeac35851a7d5000eb7f24561d3f56c2ffbd)

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150451
Approved by: https://github.com/jeffdaily
2025-04-07 23:31:29 +00:00
Hexin Wang
91173ff89a Fixing NCCL abort hang issue when a ProcessGroupNCCL manages multiple ncclComms (#150690)
Detail of the issue:

If PyTorch issues send/recv to each 2 rank comm, and these comms are managed by a single ProcessGroupNCCL instance, then comms need to abort either in sequence or in group.

I.e. the following sequential abort will cause hang in NCCL. recv(..., comm0, stream);
send(..., comm1, stream);
abort(comm1);
abort(comm0);

Fixes #119797

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150690
Approved by: https://github.com/kwen2501
2025-04-07 23:20:49 +00:00
Animesh Jain
6ea5514e04 [invoke_subgraph] Lazy backward (#150666)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150666
Approved by: https://github.com/zou3519, https://github.com/bdhirsh
2025-04-07 22:44:43 +00:00
Ankita George
78fe079c97 Support having no metadata file for HuggingFaceStorageReader (#150701)
Summary: If there is only one safetensors file, we don't need users to have a metadata file and we can just construct it from the keys of that file. This is a use-case for some HuggingFace models, so adding support for it

Test Plan:
ensure existing tests pass
tested e2e in a notebook

Differential Revision: D72472490

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150701
Approved by: https://github.com/joecummings
2025-04-07 22:10:39 +00:00
Nikita Shulga
fbccbfedaf [BE] Fix Amp.metal compilation warning (#150783)
Deleting unused `uint tid` fixes
```
[114/1416] Compiling /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/Amp.metal to Amp_30.air
/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/Amp.metal:70:10: warning: unused parameter 'tid' [-Wunused-parameter]
    uint tid [[thread_position_in_grid]]) {
         ^
1 warning generated.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150783
Approved by: https://github.com/wdvr, https://github.com/atalman
2025-04-07 22:05:00 +00:00
Max Ren
eba05e2d3e [AO] Refactor convert and add QuantAffinePlaceholderObserver (#150644)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150644
Approved by: https://github.com/jerryzh168
ghstack dependencies: #150642, #150643
2025-04-07 20:52:45 +00:00
Max Ren
5653fb3525 [AO] Add Moving Average Affine Observer (#150643)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150643
Approved by: https://github.com/jerryzh168
ghstack dependencies: #150642
2025-04-07 20:52:45 +00:00
Max Ren
ed0dea3e24 [AO] update port_metadata_pass to support quant_affine ops (#150642)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150642
Approved by: https://github.com/jerryzh168
2025-04-07 20:52:44 +00:00
PyTorch MergeBot
bf1132c196 Revert "Generalize poison fork logic for each device backend (#144664)"
This reverts commit d86c14156d.

Reverted https://github.com/pytorch/pytorch/pull/144664 on behalf of https://github.com/atalman due to failing periodic test: python test/test_cpp_extensions_mtia_backend.py TestCppExtensionMTIABackend.test_device_context ([comment](https://github.com/pytorch/pytorch/pull/144664#issuecomment-2784506104))
2025-04-07 20:09:53 +00:00
Pian Pawakapan
f8b53f4a75 [export] raise when Dim.DYNAMIC 0/1 specializes (#150716)
Previously we didn't catch this, mark_dynamic() just doesn't allocate a symbol for it

Differential Revision: D72486930

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150716
Approved by: https://github.com/angelayi
2025-04-07 18:58:42 +00:00
Sam Larsen
2a1e2b88ed [logging] Add pgo remote get/put timings to dynamo_compile (#150322)
Test Plan: https://fburl.com/scuba/dynamo_compile/sandbox/xf950tw8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150322
Approved by: https://github.com/ppanchalia
2025-04-07 18:08:26 +00:00
Annop Wongwathanarat
6fcffd8cd1 Optimize SVE embedding performance (#150176)
Change loop unrolling strategy. Previously, the script only unrolls the inner loop over block_size when block size is multiple of vector length. This version instead unrolls the outer loop which reduces the number of load/store for accumulation into the output array and improves performance for cases when block size is not multiple of vector length.

Benchmarking script:
```python
# SPDX-FileCopyrightText: Copyright 2025 Arm Limited and/or its affiliate <open-source-office@arm.com>
# SPDX-License-Identifier: BSD-3-Clause
import torch
import torch.nn as nn
import numpy as np
import time
import sys

np.random.seed(0)
torch.manual_seed(0)

num_embeddings = 400000
embedding_dim = int(sys.argv[1])
multi_hot = 100
batch_size = 400
nrun = 1000

class SimpleEmbeddingBagModel(nn.Module):
    def __init__(self, num_embeddings, embedding_dim):
        super(SimpleEmbeddingBagModel, self).__init__()

        weights = torch.from_numpy((np.random.random_sample((num_embeddings, embedding_dim)) + 1).astype(np.float32)).to(torch.float16)

        # Defining the EmbeddingBag layer
        self.embedding_bag = torch.nn.EmbeddingBag(num_embeddings, embedding_dim, _weight=weights,
                                                   mode='sum', include_last_offset=True, dtype=torch.float32)

    def forward(self, input, offsets):
        # Forward pass through the EmbeddingBag layer
        result32 = self.embedding_bag(input, offsets, per_sample_weights=None)
        return result32

# Instantiate the model
model = SimpleEmbeddingBagModel(num_embeddings=num_embeddings, embedding_dim=embedding_dim)
model.eval()

# Example input
input_tensor = torch.randint(0, num_embeddings, (batch_size * multi_hot,), dtype=torch.long)

offsets = torch.tensor(range(0, batch_size * multi_hot + 1, multi_hot))

with torch.no_grad():
    # warm up
    output32 = model(input_tensor, offsets)

    ti = time.time_ns()
    for i in range(nrun):
        _ = model(input_tensor, offsets)
    tf = time.time_ns()
    print("{:3d} {:.3E}".format(embedding_dim, (tf-ti)/nrun/1.e6))
```
Speedup on NEOVERSEV1 with 1 thread
![embedding](https://github.com/user-attachments/assets/16e567ed-b9a5-4db3-90b8-dec66d5414a7)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150176
Approved by: https://github.com/digantdesai, https://github.com/malfet
2025-04-07 18:01:54 +00:00
Saurabh Mishra
7d2411d30e [DCP][OSS] Introduce barrier util in the DistWrapper for rank local checkpointing (#150748)
Summary: Introduce barrier util in the DistWrapper for rank local checkpointing. This barrier will be used at the end of the rank local checkpointing to ensure all ranks synchronize.

Test Plan: UTs

Differential Revision: D72541431

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150748
Approved by: https://github.com/MeetVadakkanchery
2025-04-07 17:33:07 +00:00
Isuru Fernando
957faaadca Avoid overflow in vector_norm for scalar input (#144073)
Fixes https://github.com/pytorch/pytorch/issues/143960 where torch.dist gave different results from eager due to vector_norm overflowing and eager mode avoids the overflow for single element reductions by not computing the power and then the root.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144073
Approved by: https://github.com/eellison, https://github.com/laithsakka
2025-04-07 17:10:10 +00:00
fduwjj
06e9deabb6 [c10d][fr] Improve FR dump robustness with all watchdog broadcast wait and more frequent store check (#150652)
When debugging FR missing dump and missing dump logs, I have couple initial findings:
1. On the same rank, if a second watchdog timeout triggers on a different PG(or subPG), that watchdog thread will immediately throw exception instead of sleeping. We want to fix that by still making the watchdog thread to wait for 1 min.
2. The FR dump takes about 900ms to 1200ms so, we are not checking the store frequently enough. But instead of changing the frequency from 1sec to 300ms, we finally decided to just let all ranks just sleep for 1 min universally rather than using a promise.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150652
Approved by: https://github.com/kwen2501
2025-04-07 16:33:27 +00:00
jpvillam
56ab71de98 [ROCm] Expand workspace size for gfx95 (#150632)
Use same workspace size for gfx95* as gfx94*

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150632
Approved by: https://github.com/jeffdaily

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
2025-04-07 16:05:56 +00:00
shiyang-weng
0ad2c5d7e2 Add RECORD_FUNCTION for AOTI (#150150)
Only add RECORD_FUNCTION for shim_fn now.
Next step need to add RECORD_FUNCTION for all the aoti_torch_* functions.

Fixes https://github.com/pytorch/pytorch/issues/148650

Some code gen by aoti
```c++
    AtenTensorHandle buf1_handle;
    AtenTensorHandle buf2_handle;
    AtenTensorHandle buf3_handle;
    AtenTensorHandle buf4_handle;
    {RECORD_FUNCTION("aoti_torch_cpu__embedding_bag", c10::ArrayRef<c10::IValue>());AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_cpu__embedding_bag(L__self___sparse_arch_embedding_bag_collection_embedding_bags_t_cat_0_weight, arg80_1, arg81_1, 0, 0L, 0, nullptr, 1, -1L, &buf1_handle, &buf2_handle, &buf3_handle, &buf4_handle));}
    RAIIAtenTensorHandle buf1(buf1_handle);
    RAIIAtenTensorHandle buf2(buf2_handle);
    RAIIAtenTensorHandle buf3(buf3_handle);
    RAIIAtenTensorHandle buf4(buf4_handle);
    arg80_1.reset();
    arg81_1.reset();
```

On trace
```
{
  "name": "aoti_torch_cpu__embedding_bag",
  "ph": "X",
  "ts": 68874.450000,
  "dur": 361.291000,
  "tid": 2,
  "pid": "CPU Functions",
  "args": {}
},
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150150
Approved by: https://github.com/desertfire, https://github.com/EikanWang
2025-04-07 15:12:29 +00:00
Benjamin Glass
f813d64f54 cpp_wrapper: Fix even more tests (#147225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147225
Approved by: https://github.com/desertfire
ghstack dependencies: #150671, #150672
2025-04-07 14:20:06 +00:00
Benjamin Glass
f0abbabac1 AOTI fallback ops: sort alphabetically (#150672)
This is just a housekeeping task that makes the listed fallback op order match what's in the generated C shim files.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150672
Approved by: https://github.com/desertfire
ghstack dependencies: #150671
2025-04-07 14:20:06 +00:00
Benjamin Glass
5e3c8214b5 cpp_wrapper: Re-enable code disabled for forward compatibility (#150671)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150671
Approved by: https://github.com/desertfire
2025-04-07 14:20:06 +00:00
Shivam Raikundalia
99c9a31386 [submodule] [Snapshot/Profiler] Memory Snapshot On Demand (#150559)
Summary:
Profiler side of memory snapshot.

1. Add API to actually do snapshot when client interface is called
2. Add ifdefs to builds so that kineto hooks snapshot correctly.

Design Philosophy: There is one interesting part of this implementation and it is during export. For export we are callign the python impl of the export rather than CPP even though we are already in CPP. This is because it is better to simply have one path of export rather than 2. Personally, I want there to be parity between auto-trace and on-demand so it if we can limit the side paths then we will have an easier time maintaining this relationship

Test Plan: {F1976563426}

Reviewed By: sanrise

Differential Revision: D70733247

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150559
Approved by: https://github.com/sanrise
2025-04-07 13:04:38 +00:00
Zain Huda
e209625334 [torchrec] update local_shards_wrapper to latest version (#150469)
Summary: Adding new ops, support for empty shards, and fixed initializations for downstream checkpointing.

Test Plan: buck2 run 'fbcode//mode/dev-nosan' fbcode//torchrec/distributed/tests:test_shards_wrapper

Differential Revision: D72271275

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150469
Approved by: https://github.com/XilunWu
2025-04-07 13:00:52 +00:00
PyTorch UpdateBot
cdf3b63e32 Update slow tests (#150283)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150283
Approved by: https://github.com/pytorchbot
2025-04-07 11:49:59 +00:00
PyTorch UpdateBot
25662d38d5 [xla hash update] update the pinned xla hash (#132021)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132021
Approved by: https://github.com/pytorchbot
2025-04-07 11:35:56 +00:00
Kurt Mohler
164d2c887b Add check in test_cow_input to ensure COW data is never changed (#150723)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150723
Approved by: https://github.com/Skylion007
2025-04-07 04:35:00 +00:00
Zhengxu Chen
24aadb40fb [precompile] Serialization for GlobalStateGuard (#150636)
Summary: To preserve global state guards we need to make the C++ type serialzable. Using json because it's easier to do and we don't have a lot of data in global state.

Test Plan: test_dynamo -k test_global_state_guard_serialization

Differential Revision: D72410611

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150636
Approved by: https://github.com/williamwen42
2025-04-07 03:10:03 +00:00
eellison
b6929aef08 Fix conv2d strided prologue (#150697)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150697
Approved by: https://github.com/drisspg
2025-04-07 02:26:58 +00:00
Yu, Guangye
d86c14156d Generalize poison fork logic for each device backend (#144664)
# Motivation
Generalize the posion_fork code to make it reusable across different devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144664
Approved by: https://github.com/EikanWang, https://github.com/albanD
2025-04-07 02:06:21 +00:00
Han, Chao1
d98575806b Generalize compile collective to avoid cuda-bias (#150405)
Fixes https://github.com/intel/torch-xpu-ops/issues/1527
Let the combination of `compile` and `collective` to support more devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150405
Approved by: https://github.com/guangyey, https://github.com/jansel

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-04-07 01:54:20 +00:00
Richard Barnes
d8d306cbc6 Suppress -Wunused-function for DSA (#150735)
Test Plan: Sandcastle

Reviewed By: dtolnay

Differential Revision: D72458590

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150735
Approved by: https://github.com/eqy, https://github.com/cyyever
2025-04-07 01:47:35 +00:00
Richard Barnes
370ba6b96f [codemod] Fix -Wambiguous-reversed-operator in aten/src/ATen/cuda/tunable/Tunable.h (#150744)
Summary:
`-Wambiguous-reversed-operator` warns about ambiguous reversed operators, e.g. `a < b` and `b > a` are both valid. Such operators are disallowed in C++20. This codemod fixes the warnings.

#buildsonlynotests - If this diff compiles, it works.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Differential Revision: D72535527

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150744
Approved by: https://github.com/drisspg
2025-04-07 01:45:03 +00:00
Paul Ganssle
47b494ef69 Add type hints to _tensor_docs.add_docstr_all (#150715)
There is some sort of bug in `pytype` where if this function doesn't have type hints, `pytype` will spend 10 minutes inferring the types. Not that this matters much for a project not using `pytype`, but it led me to realize that this function could easily be type hinted and is not, so here is a PR adding some type hints.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150715
Approved by: https://github.com/Skylion007
2025-04-06 22:25:34 +00:00
Scott Wolchok
0aaf35310a Overload unary - operator on at::vec::Vectorized to call neg() (#150568)
Makes Vectorized look even more like a scalar type, getting me closer to being able to use the same generic code with scalars and Vectorized (e.g., for sigmoid, which needs `exp(-x)`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150568
Approved by: https://github.com/Skylion007
ghstack dependencies: #150380
2025-04-06 21:12:27 +00:00
Scott Wolchok
912102b4ec Make at::vec::Vectorized ops work with scalars (#150380)
I noticed that I couldn't use `vec::Vectorized` operations with scalars, even though there is an implicit conversion from `T` to `vec::Vectorized<T>`, so I made it work.

Test Plan: Added tests. Reverted vec_base.h, left the new tests in place, and confirmed that new tests don't compile in that state.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150380
Approved by: https://github.com/Skylion007
2025-04-06 21:12:27 +00:00
Eddie Yan
8adfcd35c3 [cuDNN][SDPA] Loosen constraints for GQA for cuDNN Attention (#150337)
cuDNN attention doesn't require key and value tensors to have the same number of heads

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150337
Approved by: https://github.com/drisspg
2025-04-06 20:31:11 +00:00
Bin Bao
6a8ab902a2 [AOTI][dashboard] Fix mis-calculated memory compression ratio (#150695)
Summary: https://github.com/pytorch/pytorch/pull/149817 introduced an extra warmup run to compute AOTI memory compression ratio, but since weights are only loaded once in the AOTI run, the peak memory seen in the extra warmup won't include the weight, which causes an aritifically high memory compression ratio. This PR removes that extra warmup run, and calls reset_peak_memory_stats in the proper place instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150695
Approved by: https://github.com/yushangdi
2025-04-06 19:51:22 +00:00
Randolf Scholz
6c38b9be73 [typing] Add type hints to __init__ methods in torch.distributions. (#144197)
Fixes #144196
Extends #144106 and #144110

## Open Problems:

- [ ] Annotating with `numbers.Number` is a bad idea, should consider using `float`, `SupportsFloat` or some `Procotol`. https://github.com/pytorch/pytorch/pull/144197#discussion_r1903324769

# Notes

- `beta.py`: needed to add `type: ignore` since `broadcast_all` is untyped.
- `categorical.py`: converted `else` branches of mutually exclusive arguments to `if` branch[^2].
- ~~`dirichlet.py`: replaced `axis` with `dim` arguments.~~ #144402
- `gemoetric.py`: converted `else` branches of mutually exclusive arguments to `if` branch[^2].
- ~~`independent.py`: fixed bug in `Independent.__init__` where `tuple[int, ...]` could be passed to `Distribution.__init__` instead of `torch.Size`.~~ **EDIT:** turns out the bug is related to typing of `torch.Size`. #144218
- `independent.py`: made `Independent` a generic class of its base distribution.
- `multivariate_normal.py`: converted `else` branches of mutually exclusive arguments to `if` branch[^2].
- `relaxed_bernoulli.py`: added class-level type hint for `base_dist`.
- `relaxed_categorical.py`: added class-level type hint for `base_dist`.
- ~~`transforms.py`: Added missing argument to docstring of `ReshapeTransform`~~ #144401
- ~~`transforms.py`: Fixed bug in `AffineTransform.sign` (could return `Tensor` instead of `int`).~~ #144400
- `transforms.py`: Added `type: ignore` comments to `AffineTransform.log_abs_det_jacobian`[^1]; replaced `torch.abs(scale)` with `scale.abs()`.
- `transforms.py`: Added `type: ignore` comments to `AffineTransform.__eq__`[^1].
- `transforms.py`: Fixed type hint on `CumulativeDistributionTransform.domain`. Note that this is still an LSP violation, because `Transform.domain` is defined as `Constraint`, but `Distribution.domain` is defined as `Optional[Constraint]`.
- skipped: `constraints.py`, `constraints_registry.py`, `kl.py`, `utils.py`, `exp_family.py`, `__init__.py`.

## Remark

`TransformedDistribution`: `__init__` uses the check `if reinterpreted_batch_ndims > 0:`, which can lead to the creation of `Independent` distributions with only 1 component. This results in awkward code like `base_dist.base_dist` in `LogisticNormal`.

```python
import torch
from torch.distributions import *
b1 = Normal(torch.tensor([0.0]), torch.tensor([1.0]))
b2 = MultivariateNormal(torch.tensor([0.0]), torch.eye(1))
t = StickBreakingTransform()
d1 = TransformedDistribution(b1, t)
d2 = TransformedDistribution(b2, t)
print(d1.base_dist)  # Independent with 1 dimension
print(d2.base_dist)  # MultivariateNormal
```

One could consider changing this to `if reinterpreted_batch_ndims > 1:`.

[^1]: Usage of `isinstance(value, numbers.Real)` leads to problems with static typing, as the `numbers` module is not supported by `mypy` (see <https://github.com/python/mypy/issues/3186>). This results in us having to add type-ignore comments in several places
[^2]: Otherwise, we would have to add a bunch of `type: ignore` comments to make `mypy` happy, as it isn't able to perform the type narrowing. Ideally, such code should be replaced with structural pattern matching once support for Python 3.9 is dropped.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144197
Approved by: https://github.com/malfet

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-04-06 17:50:35 +00:00
Isalia20
49f6cce736 [MPS] grad scaler (#150255)
Fixes #142397

Basic implementation is done. What's left:
- [x] Different dtype/device tensors in the TensorList
- [x] fast path for grouping the foreach kernel
- [x] Tests

Regarding tests, I found some tests in `test/test_torch.py` for GradScaler but I couldn't figure out what is the best way to enable the test for MPS device.

By removing `@onlyNativeDeviceTypes`, one enables the tests for MPS but also enables tests for all other devices which are not included in the native device types. If I put:
`instantiate_device_type_tests(TestTorchDeviceType, globals(), allow_mps=True)`

This enables lots of tests in that class for MPS which were not(?) being tested before? This part needs some clarification

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150255
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-04-06 17:06:55 +00:00
Natalia Gimelshein
55e62ff74a bf16 grouped gemm (#150374)
Enabled bf16 grouped gemm with an API similar to _scaled_group_gemm, except without scale and fast accum arguments. All transpose variants are enabled, unlike scaled gemm. Ideally we'd factor out a lot more code from scaled gemm, currently there's a lot of repetition between scaled and non-scaled versions. I factored out only a helper kernel that prepares arguments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150374
Approved by: https://github.com/drisspg
2025-04-06 04:53:24 +00:00
PyTorch MergeBot
caf8d9bc17 Revert "Fix conv2d strided prologue (#150697)"
This reverts commit 2e4ae2ab41.

Reverted https://github.com/pytorch/pytorch/pull/150697 on behalf of https://github.com/ngimel due to breaks rocm build ([comment](https://github.com/pytorch/pytorch/pull/150697#issuecomment-2781218658))
2025-04-06 04:50:15 +00:00
Klint Qinami
2d98a1caf5 [MTIA] Map names to operand indices when folding submodules (#150692)
When replacing placeholders with getattrs during constant folding, we can have an argument and parameter name mismatch. In fact, there is no guarantee that the parameter name is equivalent to the argument name used in the module call.

Differential Revision: D72415970

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150692
Approved by: https://github.com/jfix71
2025-04-06 03:11:14 +00:00
Jeff Daily
15768cc34b add unit test for preferred_blas_library settings (#150581)
Follow up to #150212 that was committed without a unit test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150581
Approved by: https://github.com/atalman, https://github.com/malfet

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-04-06 01:44:07 +00:00
Richard Barnes
83b870a28a Fix missing braces for clang CUDA (#150736)
Test Plan: Sandcastle

Differential Revision: D72469764

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150736
Approved by: https://github.com/Skylion007
2025-04-06 01:29:59 +00:00
Nikita Shulga
c830c12a87 [MPSInductor] Fix tiled reduction logic (#150737)
In case of tiles, index must include both reduction dimentions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150737
Approved by: https://github.com/dcci
2025-04-06 00:20:41 +00:00
Isalia20
cfea55dbec [MPS] fix inverse bug for N>1024 (#146754)
Fixes #138200

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146754
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-04-05 21:49:21 +00:00
Mu-Chu Lee
60a45eb862 [AOTInductor] Introduce MaybeOwningAtenTensorHandle for ConstantMap (#150275)
Summary:
We used RAIIAtenTensorHandle for ConstantMap, where RAIIAtenTensorHandle
is a unique_ptr, indicating that all memory handling is by the
AOTInductor internally.

In this PR, we introduce ConstantAtenTensorHandle which replaces
RAIIATenTensorHandle. This class holds a raw AtenTensorHandle, and also
owns a RAIIAtenTensorHandle if user decides to delegate memory
management to AOTInductor.

This is a prerequisite for user managed buffer, this PR, however only
introduces this class and make sure it works with existing AOTInductor
and has the default behavior identical as using RAIIAtenTensorHandle.

Test Plan:
Existing tests. No change should be introduced within this PR.

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150275
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2025-04-05 06:00:35 +00:00
Nikita Shulga
7ac8186851 [MPSInductor] Speedup sum/prod reductions (#150566)
By using cooperative `simd_sum`/`simd_product` instead of a C-style for loop for threadgroup reductions. This also allows significantly reduce amount of shared memory needed to perform those reductions

Using such reduction increases the `torch.compile` performance for gpt-fast using `stories110M` from 29 tokens/sec to 630 tokens/sec on M4 and changes perf of torch.rand as follows:
|size| before | after |
|------------------------|------------|-------------|
| 512x512         | 202.1       | 131.8       |
| 1024x1024   |   780.6    | 176.9       |
| 2048x2048    |   1423.4       | 339.9      |
| 4096x4097    |    2982.2 | 1047.2      |

Unfortunately, none of the SIMDgroup operations are available for 64-bit integers, but one can simulate the behavior using using `simd_shuffle_down` of 64-bit values represented as `int2` types, that yields reduction in $log_2(threadgroup\\_size)$ steps. [`mlx/kernels/reduction/ops.h](86389bf970/mlx/backend/metal/kernels/reduction/ops.h (L15-L18)) contains an implementation of such algorithm, but alas it yields wrong results on M1/M2(and may be M3 machines) if not all threads in the simdgroup are active which could be observed by running
```python
import torch
lib=torch.mps.compile_shader("""
kernel void do_sum(device int* out, constant int* in, uint idx [[thread_position_in_grid]]) {
  out[idx] = metal::simd_shuffle_down(in[idx], 8);
}
""")
x=torch.arange(22, device='mps', dtype=torch.int32)
y=torch.empty_like(x)
lib.do_sum(y, x)
print(y)
```
that returns following on M4
```
tensor([ 8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,  0,  0,  0,  0, 0,  0,  0,  0], device='mps:0', dtype=torch.int32)
```
but same kernel running on M1 returns
```
tensor([ 8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 14, 15, 16, 17, 18, 19, 20, 21], device='mps:0', dtype=torch.int32)
```
This discrepancy in behavior can be addressed by using `simd_shuffle_and_fill_down`, but any kernels using simd_shuffle_and_fill_down cause an internal compiler error on MacOS-13.2. Considering that OS is to be EOL soon, skip the offending tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150566
Approved by: https://github.com/manuelcandales
ghstack dependencies: #150452, #150457
2025-04-05 02:47:27 +00:00