Commit Graph

5527 Commits

Author SHA1 Message Date
PyTorch MergeBot
496c1e78c5 Revert "Implements user buffer registration using MemPool (#133603)"
This reverts commit 25d9be37be.

Reverted https://github.com/pytorch/pytorch/pull/133603 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/133603#issuecomment-2486897708))
2024-11-19 22:42:26 +00:00
Mikayla Gawarecki
b63a84804c Allow NJT by default for weights_only torch.load (take 2) (#140739)
Per discussion with @malfet, only allow weights_only unpickler to load NJT if `torch.nested` and `torch._dynamo`  are imported

(this is slightly weird as technically `torch.nested` is actually imported by default and `torch._dynamo.decorators._DimRange` is actually what needs to be imported)

we can't import this from `torch.nested` as this would
- undo dynamo lazy import
- cause circular import

===========================
Redo of https://github.com/pytorch/pytorch/pull/140304 caused issues as `torch.nested._internal.foo` needs to be imported, which causes issues like

```python
torch/_weights_only_unpickler.py", line 339, in load
    if full_path in _get_allowed_globals():
torch/_weights_only_unpickler.py", line 188, in _get_allowed_globals
    torch.nested._internal.nested_tensor.NestedTensor
AttributeError: module 'torch.nested' has no attribute '_internal'
```

**This likely wasn't caught in our CI because imports are global during unit tests(?), so we use subprocess to properly test this time**

Differential Revision: [D65961691](https://our.internmc.facebook.com/intern/diff/D65961691)

@jbschlosser
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140739
Approved by: https://github.com/malfet
2024-11-19 02:44:53 +00:00
Anant Gulati
b379a28a95 Generalization of distributed test cases for non-CUDA devices (#138216)
# Motivation
This pr is an extension of #131758. As described in #131758, these changes are looking to make distributed UTs more accessible to users of all device types.

It is a demonstration of a few changes discussed by @kwen2501 and @jgong5 in the discussion for #131758(https://github.com/pytorch/pytorch/pull/131758#discussion_r1762422784)

This PR contains two types of changes, the first is to the common distributed folder where we have added a new class derived from MultiProcessTestCase which helps abstracts out the process group creation /deletion and other functionality for a given device.

The new generalized content can be added by deriving from this base class.
Also includes other misc changes for gaudi support

The second changed file is test_functional_api. a test file in common distributed. This file is a POC for how we can use this new class to write more device agnostic distributed test cases.

The following changes have been made to test_functional_api.py:
-Functionality has been added to test for non cuda devices using intel HPU as an example
-Multiple set up steps previously required by MultiProcessTestCase have been abstracted out
-Misc adaptations to allow for general call to accelerators while adding test skips instead explicitly skipping for multiple GPUs
-Skipifhpu flags have been added to enable skipping a few Multithreaded test cases which are as yet not supported on HPUs

NOTE: Within test functional api, there are tests which require the use of some multithreading functions which are as yet not supported on HPUs. These have been skipped for hpu using skipHPU decorator.

I will be raising a separate PR to improve usability pf said decorators in a device agnostic setting in the manner suggested by @kwen2501 in a comment on this PR.

This pr is a cleaned up version of a previous PR(#136988) which I closed due to human error. I have addressed some of the comments made by @kwen2501 in this as well

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138216
Approved by: https://github.com/kwen2501, https://github.com/guangyey
2024-11-18 09:38:00 +00:00
Will Constable
1d5a8ee8fb [C10D] call destroy_process_group after MultiProcess tests (#140820)
Faced with an annoying string of warnings like this when running tests,
<img width="1644" alt="Screenshot 2024-11-15 at 11 23 21 AM" src="https://github.com/user-attachments/assets/91ff4e1d-3c29-4510-9a61-46e7df68a212">

My choices seem to be (1) call destroy_process_group() at the end of
each test fn, (2) do this in some wrapper, (3) do it in the base test
class.

Since tests in MultiProcessTestCase are responsible for calling
init_process_group themselves, they should also be responsible for
calling destroy (or at least method (3) would be asymmetric and may
result in double-destroy).

But it doesn't feel worth it to go add a destroy call manually to each
test, and try/except for a possible second destroy call seems like a
happy middle ground.

Note: tests that want to ensure that destroy runs cleanly can and should
still call destroy _inside_ the test, and this change does not affect
that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140820
Approved by: https://github.com/fegin
2024-11-18 04:26:21 +00:00
Edward Z. Yang
6094f17ada Revert "revert test repro logging" (#140749)
This reverts commit 6323fa673279eac9f2292b9b7790f621a4649af8.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140749
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #138634
2024-11-17 06:25:54 +00:00
PyTorch MergeBot
bf8709b08a Revert "[C10D] call destroy_process_group after MultiProcess tests (#140820)"
This reverts commit 77d1f076da.

Reverted https://github.com/pytorch/pytorch/pull/140820 on behalf of https://github.com/wconstab due to failures on trunk not on PR CI ([comment](https://github.com/pytorch/pytorch/pull/140820#issuecomment-2480644227))
2024-11-16 16:32:14 +00:00
Will Constable
77d1f076da [C10D] call destroy_process_group after MultiProcess tests (#140820)
Faced with an annoying string of warnings like this when running tests,
<img width="1644" alt="Screenshot 2024-11-15 at 11 23 21 AM" src="https://github.com/user-attachments/assets/91ff4e1d-3c29-4510-9a61-46e7df68a212">

My choices seem to be (1) call destroy_process_group() at the end of
each test fn, (2) do this in some wrapper, (3) do it in the base test
class.

Since tests in MultiProcessTestCase are responsible for calling
init_process_group themselves, they should also be responsible for
calling destroy (or at least method (3) would be asymmetric and may
result in double-destroy).

But it doesn't feel worth it to go add a destroy call manually to each
test, and try/except for a possible second destroy call seems like a
happy middle ground.

Note: tests that want to ensure that destroy runs cleanly can and should
still call destroy _inside_ the test, and this change does not affect
that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140820
Approved by: https://github.com/fegin
ghstack dependencies: #140460, #140815
2024-11-16 14:24:52 +00:00
Oguz Ulgen
a173186566 [RFC] Implement caching for user defined triton kernels (#140326)
This PR adds caching for user defined triton kernels by putting the transitive closure of source code in node.meta along with constant arguments.

One HUGE hack we do here is a node looks like
```
triton_kernel_wrapper_functional_proxy = torch.ops.higher_order.triton_kernel_wrapper_functional(kernel_idx = 0, constant_args_idx = 1, grid = [(1, 1, 1)], tma_descriptor_
metadata = {}, kwargs = {'in_ptr0': arg0_1, 'in_ptr1': arg1_1, 'out_ptr': arg0_1}, tensors_to_clone = ['out_ptr']);
```
so we use regex to remove `kernel_idx = 0, constant_args_idx = 1` parts as they are not relevant to cache hash. This is horrible and I'd like to eventually not use pickle as a hashing alternative but this is a longer project.

Differential Revision: [D65895744](https://our.internmc.facebook.com/intern/diff/D65895744)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140326
Approved by: https://github.com/zou3519
2024-11-16 02:37:16 +00:00
Michael Lazos
1fd4757fdc Support tensor betas in Adam and AdamW (#134171)
Adds support for beta1 and beta2 to be wrapped in tensor for Adam and AdamW.

Fixes https://github.com/pytorch/pytorch/issues/133898

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134171
Approved by: https://github.com/janeyx99
2024-11-15 21:55:55 +00:00
xin.li
80d63e7dd9 Fix softmax_backward_data cpu implementation error when argument output is noncontinguous (#139740)
Implementation of the `softmax_backward_data` operator for the CPU backend produces incorrect results when the `output` argument is non-contiguous.

Here is a test case that demonstrates this issue:

```python
torch.manual_seed(0)
op = torch.ops.aten._softmax_backward_data
grad_output = torch.ones(3, 3, 3)
temp = torch.randn(3, 10, 3)
out = temp[:, :3, :]
out = out.contiguous()
print(out.is_contiguous())
grad_input = op(grad_output, out, 1, torch.float32)
print(grad_input)
```

In this test case, the variable `grad_input` yields incorrect results if the line `out = out.contiguous()` is commented out. With this fix, `grad_input` consistently produces the same results whenever `output` is contiguous.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139740
Approved by: https://github.com/zou3519
2024-11-15 19:53:20 +00:00
Syed Tousif Ahmed
25d9be37be Implements user buffer registration using MemPool (#133603)
This PR implements user buffer registration and demonstrates NVLink Sharp (NVLS) reductions using a combination of allocation special memory using MemPool and registering it with the nccl buffer registration APIs.

Part of https://github.com/pytorch/pytorch/issues/124807.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133603
Approved by: https://github.com/kwen2501, https://github.com/eqy
2024-11-15 12:47:49 +00:00
Nikita Shulga
9c88b08ac9 [BE] Replace skipIfMPS with expectedFailureMPS (#139940)
Functionally two decorators are very similar, but one should rely on expectedFailure as much as possible to get signal when something is fixed.
- Move `product_version` variable from `test_mps` to common_utils, but call it `MACOS_VERSION`
- Introduce `skipIfMPSOnMacOS13`  to decorate the hard crashes that happens only on MacOS13 (which at this point will not get any fixes and will be deprecated soon)
- Add `device_type='mps'` to all `skipIfMPS` per https://github.com/pytorch/pytorch/issues/140560
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139940
Approved by: https://github.com/janeyx99, https://github.com/huydhn
2024-11-15 03:48:37 +00:00
Bob Ren
c536903c3f revert test repro logging (#140717)
@ezyang noticed this exercises a multithreading bug that is causing tests to become disabled:

```
2024-11-13T21:05:55.8363582Z inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCPU::test_comprehensive_fft_ihfftn_cpu_int32 /opt/conda/envs/py_3.9/lib/python3.9/site-packages/_pytest/threadexception.py:73: PytestUnhandledThreadExceptionWarning: Exception in thread Thread-3
2024-11-13T21:05:55.8364857Z
2024-11-13T21:05:55.8364974Z Traceback (most recent call last):
2024-11-13T21:05:55.8365491Z   File "/opt/conda/envs/py_3.9/lib/python3.9/threading.py", line 980, in _bootstrap_inner
2024-11-13T21:05:55.8366003Z     self.run()
2024-11-13T21:05:55.8366371Z   File "/opt/conda/envs/py_3.9/lib/python3.9/threading.py", line 917, in run
2024-11-13T21:05:55.8366858Z     self._target(*self._args, **self._kwargs)
2024-11-13T21:05:55.8367518Z   File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/fbscribelogger/__init__.py", line 176, in _run_event_loop
2024-11-13T21:05:55.8368189Z     self.loop.run_until_complete(self.task)
2024-11-13T21:05:55.8368774Z   File "/opt/conda/envs/py_3.9/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
2024-11-13T21:05:55.8369348Z     return future.result()
2024-11-13T21:05:55.8369980Z   File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/fbscribelogger/__init__.py", line 214, in _worker
2024-11-13T21:05:55.8370603Z     message = await asyncio.wait_for(
2024-11-13T21:05:55.8371090Z   File "/opt/conda/envs/py_3.9/lib/python3.9/asyncio/tasks.py", line 442, in wait_for
2024-11-13T21:05:55.8371573Z     return await fut
2024-11-13T21:05:55.8372156Z   File "/opt/conda/envs/py_3.9/lib/python3.9/asyncio/queues.py", line 166, in get
2024-11-13T21:05:55.8372613Z     await getter
2024-11-13T21:05:55.8374010Z RuntimeError: Task <Task pending name='Task-1' coro=<FbScribeLogger._worker() running at /opt/conda/envs/py_3.9/lib/python3.9/site-packages/fbscribelogger/__init__.py:214> cb=[_run_until_complete_cb() at /opt/conda/envs/py_3.9/lib/python3.9/asyncio/base_events.py:184]> got Future <Future pending> attached to a different loop
2024-11-13T21:05:55.8375366Z
2024-11-13T21:05:55.8375603Z   warnings.warn(pytest.PytestUnhandledThreadExceptionWarning(msg))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140717
Approved by: https://github.com/ezyang, https://github.com/zxiiro
2024-11-14 19:51:52 +00:00
Nikita Shulga
0f739b8f66 [Codemod] skipIfMps->skipIfMPS (#140562)
As `MPS` is an acronym that stands for Metal Performance Shaders
Also to closer align with `skipCUDAIf` not `skipCudaIf`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140562
Approved by: https://github.com/ZainRizvi, https://github.com/r-barnes
2024-11-13 19:45:08 +00:00
zeshengzong
cb71bcc542 Replace clone.detach with detach.clone (#140264)
Fixes #64532

As state in issue, replace `clone.detach` by `detach.clone`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140264
Approved by: https://github.com/soulitzer
2024-11-13 07:01:02 +00:00
Tsung-Hsien Lee
953286b850 [DTensorTestbase] Fix @with_comms inactive problem (#139637)
Summary:
`with_comms()` is mostly used as a decorator with an optional input argument `eager_init`. The problem of a decorator with input argument is that it has to be used with invocation always, i.e., you have to use as `with_comms()` rather than `with_comms` which majority of the existing usages.

This diff tries to provide a solution such that we could use `with_comms`, `with_comms()`, `with_comms(eager_init=False)`, and `with_comms(eager_init=True)`.

Test Plan: Contbuild & OSS CI

Differential Revision: D65385700

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139637
Approved by: https://github.com/wz337
2024-11-13 02:45:02 +00:00
Masaki Kozuki
6a368b3fc5 Add ScalarList overload to _foreach_lerp (#134482)
Related:
- https://github.com/pytorch/pytorch/issues/133367

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134482
Approved by: https://github.com/janeyx99
2024-11-12 19:03:41 +00:00
Jane Xu
213b8ef163 [BE] add empty tensor testing for _foreach_addcmul/div (#140276)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140276
Approved by: https://github.com/jbschlosser
ghstack dependencies: #140191
2024-11-12 15:35:06 +00:00
Jane Xu
92fb1f79b8 [BE] Test interspersed empty tensors for _foreach_norm test parity (#140191)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140191
Approved by: https://github.com/jbschlosser
2024-11-12 15:35:06 +00:00
Masaki Kozuki
71d8bb7ede implement torch._foreach_rsqrt (#134574)
Related:
- #133367 c

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134574
Approved by: https://github.com/eqy, https://github.com/janeyx99
2024-11-12 15:34:35 +00:00
Jiang, Yanbing
f77eb07662 Split int4wo weight packing (#139611)
Fixes https://github.com/pytorch/ao/issues/1117.

This PR is to seperate int4wo weight packing between CPU and other devices, to help implement `INT4CPULayout` in torchao based on https://github.com/pytorch/ao/issues/1117#issuecomment-2451252756.

Now, for CPU, the input `weight` of `_convert_weight_to_int4pack_for_cpu` is [n, k] int32, output is [n, k / 2] uint8. The input packed weight of `_weight_int4pack_mm_for_cpu` is [n, k / 2] uint8.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139611
Approved by: https://github.com/jerryzh168
2024-11-12 10:12:50 +00:00
Joel Schlosser
e7ec294c10 NJT OpInfo tests v2 (#138370)
This PR updates OpInfo-based tests for NJTs:
* Adds extensive coverage across non-contiguous NJTs (both non-contiguous transposed and non-contiguous with holes)
    * The `_sample_njts()` helper that `sample_input_func`s utilize now produces non-contig NJTs as well
* Utilizes a `SampleInput`-based xfail system for granular classification of bugs. For example, it's possible to indicate that a class of ops is expected to fail only on non-contig with holes NJT inputs.
    * I decided on adding `SampleInput`s and utilizing this system over using test parametrization for two reasons:
        * Test perf - adding `SampleInput`s is faster than generating entire new tests
        * Avoiding the possibility of `sample_input_func`s not respecting the non-contig test parameter - this would result in silently incorrect passing of these tests. Keeping the responsibility for `SampleInput` generation firmly within each `OpInfo`'s `sample_input_func` means weirdness like this isn't possible
* Improves `SampleInput` naming for a bunch of `sample_input_func`s. This makes it easier to xfail them as needed. For example, binary / unary / other ops now use the new `_describe_njt()` helper to get a string repr that uniquely defines the type of NJT being passed to the op
* Adds appropriate `XFailRule`s to get tests passing for forward / backward / forward compile / backward compile. In general, each xfail corresponds to some bug that needs to be fixed

```python
# Represents a rule indicating how to xfail a particular test. It allows granularity
# at the device, dtype, op, and individual sample levels. This flexibility allows entire
# bugs to be represented by a single rule, even if this corresponds with multiple conceptual
# test cases across multiple ops.
@dataclass
class XFailRule:
    # expected error type
    error_type: TypeVar = Exception
    # expected error message
    error_msg: str = ".*"
    # function to indicate whether the rule applies; return True if so
    match_fn: Callable[[torch.device, torch.dtype, OpInfo, SampleInput], bool] = None
    # optional name for identifying the rule
    name: str = ""

    def match(self, device, dtype, op, sample) -> bool:
        return self.match_fn(device, dtype, op, sample)
```

Example:
```python
    # Bug when broadcasting a binary op with non-contiguous with holes NJT + dense
    # tensor with 1 in ragged dim.
    XFailRule(
        error_type=RuntimeError,
        error_msg="cannot call binary pointwise function .* with inputs of shapes",
        match_fn=lambda device, dtype, op, sample: (
            isinstance(op, BinaryUfuncInfo)
            and "noncontig_holes" in sample.name
            and "broadcasting 1 over ragged" in sample.name
        ),
        name="binary_noncontig_holes_broadcasting_1_over_ragged",
    ),
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138370
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
ghstack dependencies: #140160
2024-11-11 19:35:24 +00:00
Xiaodong Wang
565a7942ee Recover non-standard bool test for msort (#139870)
Summary:
I was looking into why the non-standard bool value will fail for msort - it makes sense for argsort and sort to fail, because we're randomly generating uint8 so the order will be different (and thus the indices will be different). But msort should work.

After some digging, it's interesting that even though scalar_t is bool, when the actual value is a uint8_t, the comparison will treat them as signed. I tried lhs=255 and rhs=0: lhs < rhs is equivalent to -1 < 0 which is true (but it's supposed to be False)

Therefore we add an explicit type cast.

Test Plan: Remove the test skip

Differential Revision: D65472170

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139870
Approved by: https://github.com/Skylion007, https://github.com/davidberard98
2024-11-11 02:00:34 +00:00
Natalia Gimelshein
1cdaf1d85f correctly keep track of processed tensors for foreach reductions (#140103)
Fixes #140066

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140103
Approved by: https://github.com/janeyx99

Co-authored-by: Jane Xu <janeyx@meta.com>
2024-11-08 23:04:53 +00:00
Animesh Jain
86792a5a8d [invoke_subgraph] User facing API to support arbitrary args and kwargs (#139162)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139162
Approved by: https://github.com/zou3519
2024-11-08 03:31:19 +00:00
Nikita Shulga
ae01f2b61b Extend CPU implementation of MSELoss to BF16 (#139959)
It's strange that it has not been implemented for the type yet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139959
Approved by: https://github.com/jgong5, https://github.com/janeyx99
ghstack dependencies: #139961
2024-11-07 23:50:15 +00:00
Animesh Jain
75f3056c81 [hop-db] Import invoke_subgraph to avoid Dynamo error on mac (#140038)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140038
Approved by: https://github.com/ydwu4
2024-11-07 22:36:57 +00:00
IvanKobzarev
781c68c865 [aotd] coerce_same_metadata_as_tangent with expected_type for e.g.AsyncCollectiveTensor (#139095)
Based on discussion here: https://github.com/pytorch/pytorch/pull/138731

Introducing ability for subclass implement type convertion to expected_type.
```
    def __coerce_same_metadata_as_tangent__(
        self, expected_metadata: Any, expected_type: Optional[Type] = None
    ):
```
Here if `expected_type=None` means `SubclassClass` is expected.

E.g. for `DTensor` we may find tangent `AsyncCollectiveTensor` where we expected `Tensor` - in this case
`expected_type=Tensor` will be called during runtime

Adding implementation to AsyncCollectiveTensor, that just triggers `wait()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139095
Approved by: https://github.com/bdhirsh
2024-11-07 16:24:48 +00:00
Gabriel Ferns
2037ea3e15 Add type annotations to Configs (#139833)
Summary:
Adds types to Configs, and fixes a bug in options that was caused by the lack of types.

fixes: https://github.com/pytorch/pytorch/issues/139822

Configs are used by many modules so not sure which label to put.

Types also allow https://github.com/pytorch/pytorch/pull/139736 to fuzz configs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139833
Approved by: https://github.com/c00w
2024-11-07 03:49:09 +00:00
Sun, Jiayi
a59132b9c8 fix torch.linalg.norm and torch.norm for torch.complex32 datatype (#133661)
Fix https://github.com/pytorch/pytorch/issues/132634.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133661
Approved by: https://github.com/mingfeima, https://github.com/Skylion007
2024-11-07 03:21:36 +00:00
Edward Z. Yang
4e647871d6 Ensure TORCH_TRACE is run for Dynamo/Distributed tests (#139786)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139786
Approved by: https://github.com/bobrenjc93, https://github.com/c00w, https://github.com/anijain2305
ghstack dependencies: #139716
2024-11-07 01:58:05 +00:00
Colin L. Rice
2a857e940d config: Add env_name_default and env_name_force to Config (#138956)
This allows Configs to handle setting their defaults (or overriding
themselves) via environment variables.

The environment variables are resolved at install time (which is usually
import time). This is done 1) to avoid any race conditions between
threads etc..., but 2) to help encourage people to just go modify the
configs directly, vs overriding environment variables to change
pytorch behaviour.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138956
Approved by: https://github.com/ezyang
ghstack dependencies: #138766
2024-11-06 21:20:42 +00:00
Nikita Shulga
68ef445c33 [MPS][Perf] Dispatch to SDP-math-mps for non-contig Tensors (#139791)
As MacOS-15 or newer supports those out of the box. This significantly reduces memory requirements and improves performance for some stable diffision networks.

Test plan: Run
```python
from diffusers import StableDiffusionXLPipeline, AutoencoderKL, EulerAncestralDiscreteScheduler
import torch
import time

vae = AutoencoderKL.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0",
                                    subfolder='vae',
                                    torch_dtype=torch.bfloat16,
                                    force_upcast=False).to('mps')

pipe = StableDiffusionXLPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", vae=vae,
                                                 torch_dtype=torch.bfloat16, variant="fp16").to('mps')
pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipe.scheduler.config)

start_time = time.time()
start_mps_mem = torch.mps.driver_allocated_memory()
image = pipe(prompt="Spherical cow in vacuum",
             num_inference_steps=10,
             guidance_scale=8,
             generator=torch.Generator("mps").manual_seed(42),
             ).images[0]
end_mps_mem = torch.mps.driver_allocated_memory()
run_time = time.time() - start_time
print(f"run time in {run_time:.2f} sec, end_mps_mem {end_mps_mem/1024.0**2:.2f} Mb mem increase {(end_mps_mem-start_time)/1024.0**2:.2f} Mb")
image.save(f'bfloat16.png')
```

Before the change total memory use were 16Gb and needed 65 sec to complete, after it drops down to 14Gb and takes 50 sec to finish on M2Pro, though generated image remains the same:
![image](https://github.com/user-attachments/assets/1a35efef-9f80-4cd0-ac9c-30203eab6bb1)

Fixes https://github.com/pytorch/pytorch/issues/139389
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139791
Approved by: https://github.com/drisspg, https://github.com/Skylion007
ghstack dependencies: #139788, #139784, #139763
2024-11-06 16:25:39 +00:00
Sun, Jiayi
44df6522ee add Half/BFloat16 support for grid_sample on CPU (#134812)
Fix https://github.com/pytorch/pytorch/issues/127224.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134812
Approved by: https://github.com/Skylion007, https://github.com/mingfeima
2024-11-06 14:02:08 +00:00
Jack Taylor
5f266b5a02 [ROCm] re-enable flex attention UTs (#139632)
https://github.com/pytorch/pytorch/pull/136792 accidentally disabled flex attention UTs on ROCm. Re-enabling.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139632
Approved by: https://github.com/drisspg
2024-11-06 12:49:44 +00:00
Xiaodong Wang
e7cf7d00be Support torch.bool in torch.sort + CUDA (#139409)
Summary: This might be out-dated, so I'm adding it back and see if we pass all the tests. I'm pretty sure cuda12 is ok.

Test Plan: CI

Differential Revision: D65282650

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139409
Approved by: https://github.com/zou3519, https://github.com/ngimel, https://github.com/eqy
2024-11-06 00:02:54 +00:00
PyTorch MergeBot
1d28b8b6d5 Revert "Deprecate torch._utils.is_compiling() and torch._dynamo.external_utils.is_compiling() (#127690)"
This reverts commit e84d1121ad.

Reverted https://github.com/pytorch/pytorch/pull/127690 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. More details in D65483292 ([comment](https://github.com/pytorch/pytorch/pull/127690#issuecomment-2458381056))
2024-11-05 23:10:38 +00:00
Xuehai Pan
e84d1121ad Deprecate torch._utils.is_compiling() and torch._dynamo.external_utils.is_compiling() (#127690)
This PR is split from PR #126898.

- #126898

------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127690
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-11-05 10:44:56 +00:00
zeshengzong
ffb7a08921 Fix torch.histc not checking min > max on cuda for int8 tensors (#139372)
Fixes #139360

86e6513c86/aten/src/ATen/native/cuda/SummaryOps.cu (L323-L324)

Assign `min` and `max` to with low-precision input_t variable `minvalue` and `maxvalue` cause wrong comparing result in following check in here:

86e6513c86/aten/src/ATen/native/cuda/SummaryOps.cu (L353)

![image](https://github.com/user-attachments/assets/0d5c87f4-3dc6-48bb-bcc8-b1803e7cd487)

Change type of `minvalue` and `maxvalue` to fix it, similar like in line:

86e6513c86/aten/src/ATen/native/cuda/SummaryOps.cu (L280-L282)

**Test Result**
```bash
$ pytest test/test_reductions.py -vv
```
![image](https://github.com/user-attachments/assets/6b5d0d48-ebc2-4a8c-85f4-dbad147c086c)

```bash
$ lintrunner
```
![image](https://github.com/user-attachments/assets/f97c2d6d-78ea-4439-a1ba-907bc9defad7)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139372
Approved by: https://github.com/eqy
2024-11-05 08:42:38 +00:00
Chen, Zejun
9aaf3a04fa [profiler][UT] instantiate profiler UTs for devices and enable UTs for xpu profiler (#134316)
This PR enables the profiler related UT to be device-agnostic. It instantiates the profiler UTs for different device types and enable them on XPU backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134316
Approved by: https://github.com/etaf, https://github.com/aaronenyeshi, https://github.com/gujinghui
2024-11-05 05:46:13 +00:00
Jane Xu
23169a6bcc Disable foreach tests for complex128 internally (#139649)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139649
Approved by: https://github.com/ngimel
2024-11-04 23:24:47 +00:00
Colin L. Rice
abc5d59dcb config: create Config objects with JK support (#138766)
This teaches install_config_module (and the underlying code) to
understands Config objects. Additionally we've added a JK option to this
which resolves the JK.

This config gets stored within the _ConfigEntry class and is evaluated
when __getattr__ is called. If justknobs is set, it'll call
justknobs_check to see the result.

Due to preceeding work, basically everything works correctly here and we
had to update a couple of tests, and modify the getattr behaviour.

Note that we are updating the justknob_check function to support a
default option, to make default work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138766
Approved by: https://github.com/ezyang
2024-11-01 19:20:37 +00:00
Joel Schlosser
ddb291a881 Fix and test several NJT reductions (#139317)
I'm sick of reductions not working properly - spotty dim coverage, missing backwards, etc. This PR fixes quite a bit.

It applies to the following ops:
* `sum` / `mean` / `prod`
* `all` / `any`
* `amin` / `amax`
* `min` / `max`
* `argmin` / `argmax`

The general reduction logic has been factored out into a helper `_apply_reduction(func, func_name, identity_element, *args, **kwargs)`. The idea is that by providing a valid identity element, we can utilize conversions to padded dense when needed for reducing over the ragged dim.

Extensive test coverage includes:
* reductions across ragged dim
* reductions across non-batch, non-ragged dims
* reductions across both batch and ragged dims
* multiple dim reductions (for ops that support this)
* full reduction -> scalar

Bonus: the PR includes backwards fixes for `sum` and `mean`, which have never worked.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139317
Approved by: https://github.com/cpuhrsch
2024-10-31 20:55:38 +00:00
Antoni Viros
ad637a4c5c Add support for index_put_ in NT (#135722)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135722
Approved by: https://github.com/jbschlosser
2024-10-30 17:17:59 +00:00
PyTorch MergeBot
5861279f47 Revert "Add support for index_put_ in NT (#135722)"
This reverts commit b4836e5b5c.

Reverted https://github.com/pytorch/pytorch/pull/135722 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/135722#issuecomment-2445651914))
2024-10-30 01:53:55 +00:00
Antoni Viros
b4836e5b5c Add support for index_put_ in NT (#135722)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135722
Approved by: https://github.com/jbschlosser
2024-10-30 00:03:21 +00:00
Joel Schlosser
23d590e518 More flexible test parametrization with @reparametrize (#138369)
**Background:** The `@parametrize` decorator enjoys widespread usage as a convenient tool for ensuring extensive test coverage. One particular feature that makes this easy is the ability to stack such decorators, testing over the cross-product of inputs. Example:
```python
class MyTestClass(TestCase):
    @parametrize("x", range(3))
    @parametrize("y", [False, True])
    def test_foo(self, x, y):
        # Invoked with:
        # x=0, y=False
        # x=1, y=False
        # x=2, y=False
        # x=0, y=True
        # x=1, y=True
        # x=2, y=True
        ...
```

Note that the `@ops` and `@modules` decorators employ the same underlying machinery for parametrizing over `OpInfo` / `ModuleInfo` entries. These decorators also parametrize over op-specific `device` / `dtype` info *according to what is supported for each op*.
```python
class MyTestClass(TestCase):
    @ops(op_db)
    def test_foo(self, op, device, dtype):
        # Invoked each OpInfo in the db along with each device / dtype that corresponds
        # with this op according to the OpInfo entry.
        ...
```

Note that this in contrast to the naive cross product between ops and devices / dtypes, which would generate too many tests. Certain use cases benefit from a similar type of flexible parametrization that is more intelligent than simple cross-product composition. It is expensive to generate / run too many tests, even if the unneeded ones are skipped appropriately.

This PR attempts to generalize such flexible parametrization and satisfy these use cases through the introduction of a `@reparametrize` decorator, which operates on an existing parametrizer and allows for customized on-the-fly parametrization through the use of an `adapter_fn`. Examples:
```python
# adapter_fn that adds a new arg
 def include_is_even_arg(test_name, param_kwargs):
    x = param_kwargs["x"]
    is_even = x % 2 == 0
    new_param_kwargs = dict(param_kwargs)
    new_param_kwargs["is_even"] = is_even
    is_even_suffix = "_even" if is_even else "_odd"
    new_test_name = f"{test_name}{is_even_suffix}"
    yield (new_test_name, new_param_kwargs)

# adapter_fn that excludes certain values
def exclude_odds(test_name, param_kwargs):
    x = param_kwargs["x"]
    is_even = x % 2 == 0
    yield None if not is_even else (test_name, param_kwargs)

class MyTestClass(TestCase):
    @reparametrize(parametrize("x", range(5)), include_is_even_arg)
    def test_foo(self, x, is_even):
        # Invoked with both the x value and the new is_even arg
        ...

    @reparametrize(parametrize("x", range(5)), exclude_odds)
    def test_bar(self, x):
        # Only invoked with even x values
        ...
```

For a more real-world use case, imagine you want to write a set of OpInfo tests that parametrize over additional op-specific things beyond `device` / `dtype` (in NJT's case, this includes contiguity type, whether to operate over the batch / ragged / other dims, etc.). The `@reparametrize` decorator allows you to customize the `@ops` parametrization to add in these additional args as they make sense on a per-op basis.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138369
Approved by: https://github.com/janeyx99
2024-10-29 22:14:38 +00:00
drisspg
80c7c7178e Make sure all SDPA tests are ran with tensor cores enabled (#135592)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135592
Approved by: https://github.com/eqy
2024-10-29 20:53:10 +00:00
Jake Schmidt
2b577ae58f Implement NJT embedding backward (#138627)
Fixes #138352

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138627
Approved by: https://github.com/jbschlosser
2024-10-29 18:44:58 +00:00
PyTorch MergeBot
38645e8a3e Revert "Fix unbind_copy and add its decomposition (#134319)"
This reverts commit 8aedc649bd.

Reverted https://github.com/pytorch/pytorch/pull/134319 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but this is still failing the same test on ExecuTorch ([comment](https://github.com/pytorch/pytorch/pull/134319#issuecomment-2443209139))
2024-10-29 04:54:37 +00:00
wz337
5b39734a0a [DTensor][Test] Fix gloo backend failure when eager_init is turned on (#139097)
We should only pass the `device_id` when the backend is `nccl`. Otherwise, we would run into the following error:
```
RuntimeError: No backend for the parent process group or its backend does not support splitting
```

This also fixes test failure is not asserted when using `with_comms()` or `with_comms(eager_init=False)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139097
Approved by: https://github.com/XilunWu
2024-10-29 00:04:06 +00:00
cyy
aa2b17c330 [3/N] Don't skip ASAN on some tests (#139058)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139058
Approved by: https://github.com/ezyang
2024-10-28 23:57:23 +00:00
Guilherme Leobas
8785353f2f Fix tensor subclass + dynamic shapes in torch.compile + aot autograd (#125941)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125941
Approved by: https://github.com/bdhirsh
ghstack dependencies: #133337
2024-10-28 21:58:59 +00:00
Guilherme Leobas
6baccb430b Update TwoTensor impl. to accept outer_size/outer_stride (#133337)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133337
Approved by: https://github.com/bdhirsh
2024-10-28 21:58:59 +00:00
William Wen
52c80f663d change name of dynamo CI chard to dynamo_wrapped (#138233)
Implements https://github.com/pytorch/pytorch/issues/118127
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138233
Approved by: https://github.com/clee2000
2024-10-28 21:42:33 +00:00
Joel Schlosser
8ba9063002 FlexAttention support for NJT (#136792)
This PR adds FlexAttention + NJT support. In particular:
* To handle raggedness, treats the packed sequence dim of input NJTs as a giant "stacked sequence". To ensure user `score_mod` / `mask_mod` functions can still be written in the original NJT sequence space, this PR handles conversions for indices within the giant "stacked sequence" -> sequence relative indices automatically.
* Provides `py_impls` for `NestedTensor` to the HOPs for flex attention forward / backward that simply wrap / unwrap NJTs appropriately
* Adds barebones `new_empty()` support to NJT since FlexAttention utilizes this repeatedly; right now, only `new_empty()` with a shape of `()` is supported
* Tests that FlexAttention with a causal mask matches causal SDPA
* Adds a new public API for FlexAttention usage:
    * `create_nested_block_mask(mask_mod, B, H, njt, BLOCK_SIZE, _compile)` - NJT analogue for `create_block_mask()` that utilizes the `njt`'s ragged structure to create an appropriately-sized block mask (e.g. `(1, 1, total_seqlen, total_seqlen)`). This function handles the index conversion from "stacked sequence" space -> relative sequence space.
      * Minor note: as this is a public API, this function is purposefully named with "nested" instead of "njt" to keep the latter as an informal, mostly internal-only term.

Example usage:
```python
def causal_mask(b, h, q_idx, kv_idx):
    return q_idx >= kv_idx

query = ... # NJT of shape (B, H, S*, D)
key = ... # NJT of shape (B, H, S*, D)
value = ... # NJT of shape (B, H, S*, D)
# create_nested_block_mask() automatically converts indices from "stacked sequence" space -> relative sequence space
block_mask = create_nested_block_mask(causal_mask, 1, 1, query)  # block mask conceptual shape is (B, H, sum(S*), sum(S*))
output = flex_attention(query, key, value, block_mask=block_mask)

def causal_score_mod(score, b, h, q_idx, kv_idx):
    return torch.where(q_idx >= kv_idx, score, float("-inf"))

# flex_attention() automatically converts indices from "stacked sequence" space -> relative sequence space for NJT inputs
output2 = flex_attention(query, key, value, score_mod=causal_score_mod)
```

TODO:
* ~~Determine the right level of abstraction for public API helpers + move them alongside other helpers~~ Verify this with others though
* ~~Some cleanup~~
* ~~`njt_score_mod_adapter`~~
* ~~Q: should `create_njt_block_mask()` call `njt_mask_mod_adapter()` so we don't need two calls?~~
* Can we avoid materializing the `sum(s)` length `seq_idx` used for conversion between stacked sequence -> sequence relative indices?
    * Not for now, although future work may deepen the integration between Flex + NJT (possibly requiring custom templates). We should try to cache this though.
* ~~Demonstrate non-causal mask~~
* Support non-contiguous NJTs with holes (**booted to future PR**)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136792
Approved by: https://github.com/drisspg
ghstack dependencies: #138841
2024-10-28 20:01:27 +00:00
Mwiza Kunda
c2ded9ec0d Fix dot reference checks (#138596)
dot reference implementation should be consistent with the cpu / cuda implementations since it may be used for meta dispatch

i.e.
```python
import torch
x = torch.tensor([1,2,3], dtype=torch.float32)
y = torch.tensor([4,5,6], dtype=torch.float16)
x.dot(y)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: dot : expected both vectors to have same dtype, but found Float and Half
```

However the below does not raise an exception
```python
x.to("meta").dot(y.to("meta"))
```
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138596
Approved by: https://github.com/bdhirsh
2024-10-28 19:11:40 +00:00
Aaron Gokaslan
5d074746e9 [BE]: Add better optional typing (#138426)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138426
Approved by: https://github.com/XuehaiPan, https://github.com/malfet
2024-10-27 14:19:00 +00:00
Simon Fan
99608ceed6 Scoped extension building for C++ backed custom ops tests (#136695)
FIXES #125579 #131103 #133197 #133283 #134738 #135369 #135685

Tests that create C++ extensions can cause flakiness in CI due to library namespace conflict and test ordering. We can build them in temp dirs to ensure isolation.

An alternative is to build these as part of the build process and have build time errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136695
Approved by: https://github.com/zou3519
2024-10-26 07:41:00 +00:00
Yidi Wu
c6bb9b53f4 [scan] better error handling and remove redundant tests (#137967)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137967
Approved by: https://github.com/zou3519
2024-10-25 19:01:25 +00:00
Tom Ritchford
8aedc649bd Fix unbind_copy and add its decomposition (#134319)
* Fixes https://github.com/pytorch/pytorch/issues/130829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134319
Approved by: https://github.com/amjames, https://github.com/eellison
2024-10-23 19:13:44 +00:00
Tom Ritchford
1bc73f3157 Add decomposition for permute_copy (#130944)
* Extracted from #129476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130944
Approved by: https://github.com/amjames, https://github.com/eellison
2024-10-23 17:42:11 +00:00
William Wen
3441ea7d74 [dynamo] reset compiler stance after test (#138277)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138277
Approved by: https://github.com/anijain2305, https://github.com/jansel
2024-10-23 00:07:33 +00:00
Animesh Jain
4dd4d38ca9 [hierarchical-compilation][hop] Introduce invoke_subgraph (#137538)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137538
Approved by: https://github.com/zou3519
2024-10-22 15:33:34 +00:00
Colin L. Rice
bb8bc7d6b3 config: simplify most of the config handling and fix some bugs (#138377)
This PR combines a number of cleanups in one PR. If any of the specific cleanups don't seem to make sense, let me know and I can remove them.

Cleanups

- This PR adds a set of test suites for the config module code, which handles basically all the APIs and ways it is used. Please let me know if you see anything critical that is not tested that I missed. This test suite is primarily used as the regression test suite for later changes in this diff. Note that there is some dynamo specific testing of the config module, but it isn't as verbose.
- I removed all internal usage of shallow_copy_dict. Those usages could all use the deep copy, and did not depend on the reference behavior of certain config values that shallow_copy_dict allows.
- I removed shallow copy semantics for configuration with a deprecation warning. I think this requires a release note, so hopefully I did that correctly. Let me know if we want to continue to expose shallow copy value semantics, but I just can't find a case where I expect anyone would want it. It also complicated later internal changes to the API (i.e. breaking apart various layers of the config changes).
- I fixed what I believe is a bug in how hashes are calculated on configs. In particular, if you got the hash, then made a config change, and then got the hash again, it would not update the hash. @oulgen, please let me know if I'm misunderstanding this behavior and it is desired.
- I switched our multiple implementations of iterating through the dictionary to a single one. This is primarily to make later changes easier, but it also makes it clear how inconsistent our various config ignoring options are. Let me know if people would be interested in me unifying the various options for ignoring config values.
- I updated the test patcher (not the performance critical one, just the normal one), to use __setattr__ and __getattr__ to remove direct API access to the underlying config fetcher.

For release notes, Not sure exactly how to communicate this, but something like
"ConfigModule.to_dict, and ConfigModule.shallow_copy_dict no longer retain their shallow copy semantics, which allowed reference values objects to be modified. If you wish to modify the config object, call load_config explicitly".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138377
Approved by: https://github.com/ezyang, https://github.com/jansel, https://github.com/jovianjaison
2024-10-22 13:40:26 +00:00
Bob Ren
20a2d39557 Log all failing test repros to scuba (#138394)
This has the benefit that

1) It's much easier to aggregate test failure repros into say a CSV or shell script from scuba
2) We can do analysis (eg. set different two sets of tests across two PRs)
3) We can get results faster at the test-level granularity instead of job-level granularity we see in the HUD/GH.

I tested this by introducing a breaking change, adding ci-scribe label and then verifying that the failed tests were logged to scuba: https://fburl.com/scuba/torch_open_source_signpost/w6qt7qr9

I then reverted the breaking change and published this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138394
Approved by: https://github.com/ezyang
2024-10-21 21:35:47 +00:00
Will Feng
e4ad02892f Upgrade distributed test to g4dn instances (T4 GPUs) (#137161)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137161
Approved by: https://github.com/seemethere, https://github.com/eqy, https://github.com/yf225

Co-authored-by: Will Feng <yf225@cornell.edu>
2024-10-20 23:48:54 +00:00
Will Feng
a9f4f89cd5 [CI] Add Compiled DDP / Compiled FSDP2 / compute-comm reordering tests to test_inductor_distributed (#138178)
`test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support).

This allows us to migrate distributed jobs to T4 machines in https://github.com/pytorch/pytorch/pull/137161, as the compiled distributed jobs are the only blocking ones now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138178
Approved by: https://github.com/xmfan, https://github.com/fduwjj, https://github.com/fegin, https://github.com/kwen2501
2024-10-20 19:38:18 +00:00
Tom Ritchford
c0582fd0f8 Remove unused Python variables in torch/[b-z]* (#136963)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136963
Approved by: https://github.com/ezyang
2024-10-19 16:45:22 +00:00
wz337
ff598f2f4d [DTensorTestbase] Add an optional eager_init flag to with_comms() to support eager init nccl communicator for DeviceMesh test case (#138108)
Add an optional `eager_init` flag to `with_comms`.
When `eager_init` is True and backend is `nccl`, we pass the `device_id` to `init_process_group()` for eager initialization.
Otherwise, `device_id` is still `None` and this goes through the normal lazy call.
Default for `eager_init` is False.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138108
Approved by: https://github.com/kwen2501
2024-10-19 01:04:55 +00:00
Ke Wen
c88b77af9c [Distributed][CI] Add SM guard for compiled tests involving BF16 (#138245)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138245
Approved by: https://github.com/yf225
2024-10-18 21:39:39 +00:00
PyTorch MergeBot
7b39fb5712 Revert "Fix unbind_copy and add its decomposition (#134319)"
This reverts commit 9f81270d75.

Reverted https://github.com/pytorch/pytorch/pull/134319 on behalf of https://github.com/clee2000 due to breaking some executorch tests D64568664 ([comment](https://github.com/pytorch/pytorch/pull/134319#issuecomment-2423157700))
2024-10-18 20:09:40 +00:00
PyTorch MergeBot
0ff6f7a040 Revert "[Distributed][CI] Add SM guard for compiled tests involving BF16 (#138245)"
This reverts commit 1581a93e87.

Reverted https://github.com/pytorch/pytorch/pull/138245 on behalf of https://github.com/albanD due to Breaks distributed inductor tests ([comment](https://github.com/pytorch/pytorch/pull/138245#issuecomment-2422462579))
2024-10-18 13:21:17 +00:00
Ke Wen
1581a93e87 [Distributed][CI] Add SM guard for compiled tests involving BF16 (#138245)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138245
Approved by: https://github.com/yf225
2024-10-18 09:10:01 +00:00
Xinya Zhang
770fcaf2ab Fix the Rank of logsumexp Tensor and mGPU support. (#137717)
The logsumexp tensor was considered for internal use only but apparently exposed to unit tests and inductors.

The stream should be selected after picking the current device. Otherwise the code is checking the default device's architecture.

Fixes #131316 #137414

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137717
Approved by: https://github.com/drisspg

Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>
2024-10-17 21:58:14 +00:00
Tom Ritchford
9f81270d75 Fix unbind_copy and add its decomposition (#134319)
* Fixes https://github.com/pytorch/pytorch/issues/130829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134319
Approved by: https://github.com/amjames, https://github.com/eellison
2024-10-17 21:27:35 +00:00
Joel Schlosser
906fe05895 Naive impls for NJT matmul (#138121)
Our matmul support is abysmal - let's at least get this working and do it performantly later.

Bonus: implements `bmm` as well.

jagged <-> padded dense conversions are utilized when possible, and an unbind-based fallback otherwise (the former works with torch.compile and the latter doesn't). Some testing is missing because we don't have factory function support yet :(
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138121
Approved by: https://github.com/cpuhrsch
2024-10-17 01:31:46 +00:00
Jane Xu
94537e70b5 Skip test_parity__foreach_mul_fastpath_inplace_cuda_complex128 internally (#138100)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138100
Approved by: https://github.com/Skylion007
2024-10-17 00:34:56 +00:00
PyTorch MergeBot
4b3035f2fe Revert "Add decomposition for permute_copy (#130944)"
This reverts commit e7a4ad3b40.

Reverted https://github.com/pytorch/pytorch/pull/130944 on behalf of https://github.com/clee2000 due to breaking internal builds D64418214 cc @digantdesai @GregoryComer to help get this fixed and remerged ([comment](https://github.com/pytorch/pytorch/pull/130944#issuecomment-2418125356))
2024-10-16 23:18:53 +00:00
PyTorch MergeBot
24ee4af86b Revert "Upgrade distributed test to g4dn instances (T4 GPUs) (#137161)"
This reverts commit 2b7c7a20b9.

Reverted https://github.com/pytorch/pytorch/pull/137161 on behalf of https://github.com/kwen2501 due to breaking trunk ([comment](https://github.com/pytorch/pytorch/pull/137161#issuecomment-2417833666))
2024-10-16 20:05:38 +00:00
Ke Wen
2b7c7a20b9 Upgrade distributed test to g4dn instances (T4 GPUs) (#137161)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137161
Approved by: https://github.com/seemethere, https://github.com/eqy
2024-10-16 16:42:57 +00:00
PyTorch MergeBot
78632b97b1 Revert "Upgrade distributed test to g4dn instances (T4 GPUs) (#137161)"
This reverts commit f43c4d28b8.

Reverted https://github.com/pytorch/pytorch/pull/137161 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems another failure showing up after the upgrade ([comment](https://github.com/pytorch/pytorch/pull/137161#issuecomment-2415941159))
2024-10-16 07:26:34 +00:00
Ke Wen
f43c4d28b8 Upgrade distributed test to g4dn instances (T4 GPUs) (#137161)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137161
Approved by: https://github.com/seemethere, https://github.com/eqy
2024-10-16 05:03:08 +00:00
Adnan Akhundov
809ff3b274 Add host-side Triton TMA support to Dynamo (#137677)
This adds Dynamo tracing support for the host-side Triton TMA API (see `create_2d_tma_descriptor` calls on the host in the [Triton tutorial](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#sphx-glr-getting-started-tutorials-09-persistent-matmul-py)). A few notes:

- Here we assume the availability of the host-side TMA API added to upstream Triton in https://github.com/triton-lang/triton/pull/4498. As of time of writing, this is not a part of the PT2 OSS Triton pin (although back-ported internally). OSS Triton pin update should be done in December 2024.
- To capture the chain of calls `t.data_ptr() --> create_{1d,2d}_tma_descriptor(ptr, ...) --> kernel[grid](tma_desc, ...)`, we add three new variable trackers: `DataPtrVariable`, `CreateTMADescriptorVariable` (for the function), `TMADescriptorVariable` (for TMA descriptor object). This is to maintain the path back from the Triton kernel to the Tensor from which the TMA descriptor has been created.
- The newly introduced variables have `reconstruct` methods used in case of graph breaks.
- The `tma_descriptor_metadata` extracted from the captured `create_{1d,2d}_tma_descriptor` calls is propagated through the HOPs in Dynamo and AOTAutograd to be used by the downstream compiler (e.g., Inductor). See the unit tests for how the captured HOP arguments look like.
- In the Dynamo-captured fx graph, we replace the TMA descriptor arguments of the Triton kernel by the underlying Tensors, to be able to track the input/output relationships in terms of Tensors.
- In the Triton kernel mutation analysis pass (in AOTAutograd), we use the `tt.experimental_descriptor_store` TTIR op to detect mutations of the underlying tensors via TMA descriptors. So that downstream AOTAutograd can perform functionalizations as required.
- JIT Inductor and AOT Inductor support will be implemented in follow-up PRs.

Differential Revision: [D64404928](https://our.internmc.facebook.com/intern/diff/D64404928)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137677
Approved by: https://github.com/zou3519
2024-10-16 02:18:48 +00:00
Ke Wen
35fc24fbed [PGNCCL] Fix bugs in non-blocking mode (#137741)
### Fix 1: Throw async error during init wait

Previously we just busy wait for `ncclSuccess`, if the nonblocking init encountered error, we never report that. Added detection of async error via `ncclGetAsyncError`.

### Fix 2: Add wait after comm split

```
  // After calling ncclCommSplit in non-blocking mode, we should wait for the
  // source communicator to be out of ncclInProgress state.
  // Reason 1:
  //   it's unsafe to call new operations on the parent comm while it's in
  //   ncclInProgress state.
  // Reason 2:
  //   as of NCCL 2.23, the ptr value of child comm will not be filled until the
  //   state of parent comm is ncclSuccess. This may change in the future. See:
  //   https://github.com/NVIDIA/nccl/issues/1472
```
This wait does not mean the child comm is ready for use, neither does it block till that point.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137741
Approved by: https://github.com/shuqiangzhang
2024-10-15 20:35:39 +00:00
Aaron Orenstein
524fe784ec BundledAutotuneCache (take 2) (#137902)
Summary:
Add a cache to combine individual autotune caches into a single cached bundle.  We still rely on the individual autotune caches - on a cache hit we copy the individual results into the local caches so they can retrieved later.

Attempt 2 of #134959 (D60677499).

Various configs:
env: TORCHINDUCTOR_BUNDLED_AUTOTUNE_REMOTE_CACHE
config: bundled_autotune_remote_cache
jk: pytorch/remote_cache:bundled_autotune_remote_cache_version

Test Plan:
unit tests

Manually tested w/ EMU:
```
cd fbcode/accelerators/workloads/models/emu_flash/v1p4
make build_benchmark_model && make save_model_to_path
make test_pt2_latency
```

- on a cold run we got 0 hits and 40 misses. On a warm run it got 40 hits and 0 miss.
- perf seems a little better - for 8 runs:
  - no bundled cache averaged 14m11s
  - bundled cache averaged 14m6s
  - 125ms saved per cache entry seems reasonable

Cache Metrics for an sample run:
no bundled cache:
```
INFO: Cache Metrics:
  FbMemcacheRemoteKernelCache: {hit: 2256, miss: 0, put: 0, exception: 0}
  FbRemoteAutotuneCache: {hit: 0, miss: 0, put: 7, exception: 0}
  FbRemoteFxGraphCache: {hit: 40, miss: 0, put: 0, exception: 0}
  LocalAutotuneCache: {hit: 878, miss: 0, put: 7, exception: 0}
  backend:MemcacheCache: {hit: 2256, miss: 0, put: 7, exception: 0}
  backend:_LocalAutotuneCacheBackend: {hit: 878, miss: 0, put: 7, exception: 0}
  backend:_ManifoldCache: {hit: 40, miss: 0, put: 0, exception: 0}
```
bundled cache:
```
INFO: Cache Metrics:
  FbMemcacheRemoteKernelCache: {hit: 2258, miss: 0, put: 0, exception: 0}
  FbRemoteAutotuneCache: {hit: 0, miss: 0, put: 8, exception: 0}
  FbRemoteBundledAutotuneCache: {hit: 40, miss: 0, put: 0, exception: 0} <<<<<<
  FbRemoteFxGraphCache: {hit: 40, miss: 0, put: 0, exception: 0}
  LocalAutotuneCache: {hit: 878, miss: 0, put: 886, exception: 0}
  backend:MemcacheCache: {hit: 2258, miss: 0, put: 8, exception: 0}
  backend:_LocalAutotuneCacheBackend: {hit: 878, miss: 0, put: 886, exception: 0}
  backend:_ManifoldCache: {hit: 80, miss: 0, put: 0, exception: 0}
```

Differential Revision: D64336043

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137902
Approved by: https://github.com/oulgen
2024-10-15 18:39:47 +00:00
Nikita Shulga
e4d7676c1b [CPU] Expand torch.special.i1 to Half and BF16 (#137899)
To match behavior of `torch.special.i0`

Noticed while looking at the failures in https://github.com/pytorch/pytorch/pull/137849

Also, add explicit high-precision template specialization for  `calc_i0` and `calc_i1` for `BFloat16` and `Half`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137899
Approved by: https://github.com/Skylion007
2024-10-15 17:00:58 +00:00
Tom Ritchford
e7a4ad3b40 Add decomposition for permute_copy (#130944)
* Extracted from #129476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130944
Approved by: https://github.com/amjames, https://github.com/eellison
2024-10-15 13:51:20 +00:00
ErezYosef
197601eeea Add Support for Tracking Parameter Names (named_parameters) in Optimizer State Dict (#134107)
A proposal addressing Issue #1489: **Optimizer should track parameter names and not id.**

(also mentioned in here: [[RFC] Introducing FQNs/clarity eyeglasses to optim state_dict](https://dev-discuss.pytorch.org/t/rfc-introducing-fqns-clarity-to-optim-state-dict/1552)

## Summary
This PR introduces a backward-compatible enhancement where optimizers track parameter names instead of just their id.
Optimizers can be initialized with `named_parameters()` as:
```python
optimizer = optim.SGD(model.named_parameters(), lr=0.01, momentum=0.9)
```
This allows for greater clarity and ease when handling optimizers, as the parameters' names are preserved within the optimizer’s `state_dict` as:
```
state_dict =
{
    'state': {
    0: {'momentum_buffer': tensor(...), ...},
    1: {'momentum_buffer': tensor(...), ...},
    },
    'param_groups': [
        {
        'lr': 0.01,
        'weight_decay': 0,
        ...
        'params': [0,1]
        'param_names' ['layer.weight', 'layer.bias']  (optional)
        }
    ]
}
```
Loading `state_dict` is not changed (backward-compatible) and the `param_names` key will be ignored.

## Key Features
#### Named Parameters in Optimizer Initialization:
Optimizers can accept the output of `model.named_parameters()` during initialization, allowing them to store parameter names directly.
#### Parameter Names in `state_dict`:
The parameter names are saved as a list in the optimizer’s `state_dict` with key `param_names`, alongside the `params` indices, ensuring seamless tracking of both names and parameters.

## Backward Compatibility
#### No Breaking Changes:
This change is fully backward-compatible. The added `param_names` key in the optimizer's `state_dict` is ignored when loading a state to the optimizer.

#### Customization with Hooks:
For more control, the loaded state_dict can be modified using a custom `register_load_state_dict_pre_hook`, providing flexibility for different design needs.

## Documentation Updates
Please refer to the documentation changes for more details on how this feature is implemented and how it can be used effectively.

## Solution Example:

A suggested solution to the problem mentioned in #1489, for the same parameters but in a different order.
The following `register_load_state_dict_pre_hook` should be added to the optimizer before loading to enable loading the state dict :
```python
def adapt_state_dict_ids(optimizer, state_dict):
    # assuming a single param group.
    current_state_group = optimizer.state_dict()['param_groups'][0]
    loaded_state_group = state_dict['param_groups'][0]

    # same number of params, same names, only different ordering
    current_state_name_to_id_mapping = {}  # mapping --  param_name: id
    for i, name in enumerate(current_state_group['param_names']):
        current_state_name_to_id_mapping[name] = current_state_group['params'][i]

    # changing the ids of the loaded state dict to match the order of the given state dict.
    for i, name in enumerate(current_state_group['param_names']):
        loaded_state_group['params'][i] = current_state_name_to_id_mapping[name]

    return state_dict
```
In this code, the loaded `state_dict` ids are adapted to match the order of the current optimizer `state_dict`.
Both the previous and the current optimizers are required to be initiated with `named_parameters()` to have the 'param_names' key in the dict.

### Note
This is my first contribution to PyTorch, and I wish to receive feedback or suggestions for improvement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134107
Approved by: https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2024-10-14 19:24:44 +00:00
iupaikov-amd
c09b567a91 Fixed error string assertion in test_invalid_devices (#137772)
ROCm distribution returns different error string for this operation so the test fails this assertion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137772
Approved by: https://github.com/Skylion007
2024-10-13 18:10:07 +00:00
Li, Xingyuan
0dbbcfa7ae [Inductor UT] Generalize newly introduced inductor UTs for intel GPU (Part 3) (#136947)
[Inductor UT] Generalize Newly introduced inductor UTs for intel GPU
reuse `test/inductor/test_pattern_matcher.py`
reuse `test/inductor/test_snode_runtime.py`
reuse `test/inductor/test_unbacked_symints.py`
fix `test/inductor/test_triton_kernels.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136947
Approved by: https://github.com/etaf, https://github.com/EikanWang, https://github.com/jansel
2024-10-12 13:21:20 +00:00
PyTorch MergeBot
b55ff476bd Revert "[Distributed] Fix extra context on device 0 (#135273)"
This reverts commit cdd8fa98c7.

Reverted https://github.com/pytorch/pytorch/pull/135273 on behalf of https://github.com/PaliC due to broken tests on trunk ([comment](https://github.com/pytorch/pytorch/pull/137161#issuecomment-2406236337))
2024-10-10 23:47:25 +00:00
Ke Wen
cdd8fa98c7 [Distributed] Fix extra context on device 0 (#135273)
This PR contains multiple fixes for issue https://github.com/pytorch/pytorch/issues/135279:

## First part:
Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call.
As its name suggests, it May Init Ctx.

## Second part:
Even with the above fix, additional contexts are still observed during Work object destruction, e.g.
```
work = dist.all_reduce(tensor, async_op=True)
time.sleep(5)  <-- no additional context yet
del work  <-- additional context shows up
```
### Debug process
Chasing it down to destruction of a `Future` object -- a member variable of `Work`.
Then further down to the following member of `Future`:
```
std::vector<c10::Event> events_;
```
When the `events_` are destroyed, we hit the road down to:
1f3a793790/c10/cuda/impl/CUDAGuardImpl.h (L106-L121)

When there is no "preset" CUDA context (**which is the case for python garbage collector**), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 --
**that's where rank 1, 2, ... can create extra context on device 0!**
### Solution
This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard.

## Test
Added test_extra_cuda_context, implemented via
- `pynvml` (if available), or
- memory consumption check.

`python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135273
Approved by: https://github.com/fduwjj, https://github.com/wconstab, https://github.com/eqy
ghstack dependencies: #137161
2024-10-10 17:16:34 +00:00
Joel Schlosser
3e2f276a14 Fix to() on non-contiguous NJTs (#137124)
Called out via torchrec integration: `lengths` is not handled properly.

Future work (not related to non-contiguous NJTs): #137275
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137124
Approved by: https://github.com/soulitzer
ghstack dependencies: #137030, #137031
2024-10-08 15:11:05 +00:00
Wei Feng
14b4099521 [FSDP2] support torch._foreach_copy_(float8) for fully_shard(Float8Linear) (#135955)
this PR unblocks unit test with single Float8Linear module. It fixes following error
```
torch._foreach_copy_(foreach_copy_dsts, all_gather_inputs)
[rank0]:E0913 13:44:29.829000 2179476 torch/testing/_internal/common_distributed.py:671] RuntimeError: "foreach_tensor_copy" not implemented for 'Float8_e4m3fn'
```

Differential Revision: [D63961071](https://our.internmc.facebook.com/intern/diff/D63961071)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135955
Approved by: https://github.com/vkuzo, https://github.com/eqy
2024-10-07 16:36:31 +00:00
vasiliy
a063a82c8b [redo] Fp8 support for item() with cuda, index_select, and fill_ cpu (#137341)
Summary:

Redo of https://github.com/pytorch/pytorch/pull/128780, easier to copy-paste.

Test Plan: CI

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137341
Approved by: https://github.com/eqy
2024-10-07 00:58:51 +00:00
Siddharth Kotapati
e27c0048db Enable additional tests for MPS CI runs (#134356)
As part of the follow up for https://github.com/pytorch/pytorch/issues/133520, adapting existing unused tests for use in MPS CI runs. Focusing on nhwc & other memory formatting tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134356
Approved by: https://github.com/malfet, https://github.com/eqy, https://github.com/huydhn
2024-10-04 21:52:38 +00:00
PyTorch MergeBot
cd17b2645c Revert "[Distributed] Fix extra context on device 0 (#135273)"
This reverts commit a93d3873e9.

Reverted https://github.com/pytorch/pytorch/pull/135273 on behalf of https://github.com/albanD due to Broken trunk distributed ci ([comment](https://github.com/pytorch/pytorch/pull/135273#issuecomment-2393772987))
2024-10-04 13:58:57 +00:00
Ke Wen
a93d3873e9 [Distributed] Fix extra context on device 0 (#135273)
This PR contains multiple fixes for issue https://github.com/pytorch/pytorch/issues/135279:

## First part:
Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call.
As its name suggests, it May Init Ctx.

## Second part:
Even with the above fix, additional contexts are still observed during Work object destruction, e.g.
```
work = dist.all_reduce(tensor, async_op=True)
time.sleep(5)  <-- no additional context yet
del work  <-- additional context shows up
```
### Debug process
Chasing it down to destruction of a `Future` object -- a member variable of `Work`.
Then further down to the following member of `Future`:
```
std::vector<c10::Event> events_;
```
When the `events_` are destroyed, we hit the road down to:
1f3a793790/c10/cuda/impl/CUDAGuardImpl.h (L106-L121)

When there is no "preset" CUDA context (**which is the case for python garbage collector**), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 --
**that's where rank 1, 2, ... can create extra context on device 0!**
### Solution
This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard.

## Test
Added test_extra_cuda_context, implemented via
- `pynvml` (if available), or
- memory consumption check.

`python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135273
Approved by: https://github.com/fduwjj, https://github.com/wconstab, https://github.com/eqy
2024-10-04 00:44:02 +00:00
Shangdi Yu
c83178d894 Change to export_for_training in XNNPACK tests (#137238)
Summary: as title

Test Plan: CI

Differential Revision: D63344674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137238
Approved by: https://github.com/tugsbayasgalan
2024-10-03 21:28:05 +00:00
Simon Fan
b86269fab5 Unify cpp_extension build directory removal (#136059)
Keeps existing default directory clearing logic, even though it fails when TORCH_EXTENSIONS_DIR is set. To properly clear, we'd need to track all the folders we compiled the extensions to.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136059
Approved by: https://github.com/ezyang, https://github.com/albanD
2024-10-03 06:22:11 +00:00
Ke Wen
7631a04081 [c10d] Fix the device query story of ProcessGroup (#136790)
Function `_get_pg_default_device` is being used outside of `distributed_c10d.py`.

A concern is that people may not be aware of what it actually does, due to bad naming of this function:
`Return the device to use with ``group`` for control flow usage (object collectives, barrier).`

The remediation is as follows:
- Added a deprecation warning to `_get_pg_default_device`;
- Added a private function `_get_object_coll_device` to undertake what it does;
- Added a `_device_capability` function for users who want to query the device support of a PG -- it returns a plain list, no more "default" choice.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136790
Approved by: https://github.com/H-Huang
2024-10-03 01:36:22 +00:00
Joel Schlosser
6374a19a6e Fix wrapper subclass serialization with custom sizes / strides (#137030)
Fixes #130154

This PR takes the strategy outlined in the above issue and clears out any cached sizes / strides PyCapsules before serialization. This affects the default subclass serialization logic.

The PyCapsule issue also affects `deepcopy`, so that's fixed here as well.

Note: I originally tried utilizing a context manager to remove / restore cached PyCapsules after serialization, but in practice the state returned from `_reduce_ex_internal()` references the actual `tensor.__dict__()`, so the problem persists once the cached values are restored. Instead, we have to be careful to remove the cached values in the right place so they're not re-cached when pulling out size / stride information for serialization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137030
Approved by: https://github.com/albanD
2024-10-02 18:55:03 +00:00
Benjamin Glass
f984b88718 Ensure noncontiguous tensor creation tests offsetting (#136396)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136396
Approved by: https://github.com/amjames, https://github.com/eellison
ghstack dependencies: #136055
2024-10-02 00:40:43 +00:00
Jez Ng
99eb47fb6d Add CI for Triton CPU backend (#135342)
Where possible, I have marked failing tests (which we intend to fix or triage) as `@xfail_if_triton_cpu`. This will help us track progress of the Triton CPU backend over time. Tests that I don't think we need to address, or that are flaky, have been marked as skips.

Successful CI run: https://github.com/pytorch/pytorch/actions/runs/10822238062/job/30028284549

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135342
Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/malfet
2024-10-01 20:43:10 +00:00
Tom Ritchford
b85f21fc1d Add decomposition for squeeze_copy (#130941)
* Extracted from #128416

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130941
Approved by: https://github.com/amjames, https://github.com/eellison
ghstack dependencies: #136653
2024-10-01 10:23:22 +00:00
Nikita Shulga
c610aa80dc Testing: Unblock new_* testing on MPS (#137003)
By changing `other_dtype` to `torch.half` rather than `double` in
`sample_inputs_new_fns` if MPS is available
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137003
Approved by: https://github.com/Skylion007
ghstack dependencies: #136981, #136982, #136983, #136984, #136985, #136986
2024-09-30 19:06:12 +00:00
ankurneog
22a4129a76 Generalization of FSDP common for non-cuda execution (#133209)
## Motivation
The FSDP common code for FSDP UT execution is mostly written with cuda device in mind. However other devices such the intel Gaudi supports most of the functionality. We are generalizing the base content so that the UT content can be used for non-cuda device execution.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133209
Approved by: https://github.com/kwen2501
2024-09-27 00:38:10 +00:00
Joel Schlosser
991f8f8ec3 Bias gradient calculation for NJT linear backward (#136660)
Previously NYI - @mikaylagawarecki needs it for Transformers.

Fixes #136652
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136660
Approved by: https://github.com/soulitzer
2024-09-26 21:38:10 +00:00
drisspg
840c6b7a68 [FlexAttention] Add Better error message for cpu tensors (#136673)
Partially address: #136525

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136673
Approved by: https://github.com/Chillee
2024-09-26 16:40:21 +00:00
Howard Huang
141cae2eb8 [pipelining] Fix more leaks and check leaks in tests (#136584)
Fix two more leaks of the same variety as #136507 (see that PR desc and attached gdoc for debug details).

This time, also add a test-time check that helped to discover new leaks and ensure we won't accidently regress.

Adds `check_tensor_leak` util which internally asserts no tensors are being kept alive by other objects involved in py ref cycles.

Uses objgraph for a nice debug utility when a leak is found.

Credit to @H-Huang for pointing out objdump and helping debug the 'param_group["intermediates"]` leak.

I manually confirmed that all 3 of the leaks identified/fixed so far are caught by the unit test and checker.

Sample output, if I re-introduce a leak by commenting out `del param_group["intermediates"]` in _backward.py,
and run `python test/distributed/pipelining/test_schedule_multiproc.py -k test_schedule_with_native_zero_bubble`:

```
warnings.warn(
/data/users/whc/pytorch/torch/testing/_internal/common_utils.py:5341: UserWarning: 34 tensors were found in the garbage. Did you introduce a reference cycle?
warnings.warn(
/data/users/whc/pytorch/torch/testing/_internal/common_utils.py:5347: UserWarning: Dumping first 1 objgraphs of leaked tensors rendered to png
Graph written to /tmp/objgraph-ztz642h3.dot (19 nodes)
Graph viewer (xdot) not found, generating a png instead
Image generated as /tmp/objgraph-ztz642h3.png
```

rendering of ` /tmp/objgraph-ztz642h3.png`:
<img width="1671" alt="image" src="https://github.com/user-attachments/assets/9098ff29-224c-4533-935b-83c210ac2e22">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136584
Approved by: https://github.com/kwen2501, https://github.com/H-Huang
ghstack dependencies: #136507

Co-authored-by: Howard Huang <howardhuang@fb.com>
2024-09-26 01:10:40 +00:00
Huy Do
b7a5c7d331 Do not XFAIL test_segfault in fbcode (#136661)
https://github.com/pytorch/pytorch/pull/136252 silence the failure on OSS, but the test actually passed on fbcode [T202241133](https://www.internalfb.com/intern/tasks/?t=202241133)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136661
Approved by: https://github.com/malfet
2024-09-25 22:26:24 +00:00
IvanKobzarev
370c1c4297 [aotd] Fix rrelu compilation (#136008)
Issues:
https://github.com/pytorch/pytorch/issues/135083
https://github.com/pytorch/pytorch/issues/120292

rrelu decomposition contains mutation, copy_. Decompositions are executed below Functionalization, as a result AOT produces non-functional graph.

Also that decomposition is registered as python_dispatch kernel for AutogradCUDA.
Autograd dispatch happens above Functionalization, so registering it for Autograd to handle all backends makes functionalization running after this.

Testing:
```
python test/functorch/test_aotdispatch.py -k test_rrelu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136008
Approved by: https://github.com/bdhirsh
2024-09-25 11:26:19 +00:00
Joel Schlosser
888744bd36 NJT binary pointwise broadcasting support via jagged <-> padded dense conversion (#133021)
Related: #132695

This PR uses padded dense <-> jagged conversions to handle binary pointwise broadcasting of (NT, T) and (T, NT). This includes:
* `(B, j0, D) + (1, 1, 1)`
* `(B, j0, D) + (B, 1, 1)`
* `(B, j0, D) + (B, 1, D)`
* etc.

This PR also adds (hacky) support for bool inputs to the jagged <-> padded dense conversions. The underlying CUDA kernels do not support integer / bool inputs; so the following workaround is employed: `convert input -> half, run conversion kernel, convert output -> bool`. Note that this bool support is needed specifically for the backward formula of `fmax`, and likely others.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133021
Approved by: https://github.com/cpuhrsch
2024-09-24 19:11:49 +00:00
ankurneog
efed357ef5 Add dtypes support in opinfo for Intel Gaudi (#132840)
## Motivation
This is following up on changes introduced in https://github.com/pytorch/pytorch/pull/128584
we are adding the dtype information to be picked up while executing the UTs for Intel Gaudi/HPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132840
Approved by: https://github.com/albanD
2024-09-24 17:17:15 +00:00
gaopengff
33e10803c8 Fix ut in internal distributed_test.py (#136251)
I have failed with test case of **test_new_subgroups_by_enumeration_input_rank_exceeds_world_size**, and passed with this small change. The expected exception is supposed to be "ValueError" rather than "RuntimeError" according to [code](https://github.com/pytorch/pytorch/blob/v2.4.1/torch/distributed/distributed_c10d.py#L4190).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136251
Approved by: https://github.com/kwen2501
2024-09-24 15:06:20 +00:00
Joel Schlosser
83a3ee0699 Support embedding_bag() with NJT input (#135888)
Fixes #93843

`EmbeddingBag()` / `embedding_bag()` support 1D inputs with offsets to handle raggedness. NJT is a natural fit here as it already maintains offsets of the same form. This PR updates the python-side to support NJT and adds corresponding OpInfo-based NJT tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135888
Approved by: https://github.com/cpuhrsch
2024-09-23 17:35:19 +00:00
Isuru Fernando
f276da7f98 Remove prims.slice_in_dim and prims.slice (#136150)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136150
Approved by: https://github.com/ezyang
2024-09-23 01:27:22 +00:00
Aaron Gokaslan
b6ffa381e1 [BE]: Add half CUDA support nextafter (#136373)
Making CUDA support match CPU support for nextafter
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136373
Approved by: https://github.com/ezyang
2024-09-21 17:13:45 +00:00
Isuru Fernando
0c936c3ecb Add decomps for max_unpool (#133146)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133146
Approved by: https://github.com/amjames, https://github.com/eellison
2024-09-20 21:35:25 +00:00
Jeff Daily
15dba021bb [ROCm][CI] upgrade CI to ROCm 6.2 (#132555)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132555
Approved by: https://github.com/pruthvistony, https://github.com/malfet
2024-09-20 17:39:31 +00:00
Igor Sugak
bce52d0b60 [CODEMOD][caffe2] use npt.NDArray instead of np.ndarray in type annotations (#136288)
Summary:
To facilitate PSS-2 upgrade, this uses `ndt.NDArray` instead of `nd.ndarray` in type annotations. In Numpy-1.19 (PSS-1) it's an alias to `nd.ndarray` -- a noop.
In Numpy-1.24, `ndt.NDArray` a proper generic type, and without this change uses of `nd.ndarray` generate this Pyre type error:
```counterexample
 Invalid type parameters [24]: Generic type `np.ndarray` expects 2 type parameters.
```

Test Plan: Sandcastle plus visual inspection

Differential Revision: D62977370

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136288
Approved by: https://github.com/kit1980
2024-09-19 12:40:36 +00:00
Huy Do
db80b98ec4 XFAIL test_segfault (#136252)
Fixes https://github.com/pytorch/pytorch/issues/128551

As this has been failing in trunk for a while and there is no owner yet to fix it properly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136252
Approved by: https://github.com/andrewkho
2024-09-19 04:17:06 +00:00
fduwjj
3efaa016b1 [c10d] Make test compatible for new pytest (#136158)
Temporary fix to the issue in https://github.com/pytorch/pytorch/issues/127517.

Short-term fix following CPython: 51aefc5bf9/Lib/unittest/case.py (L419-L426)

Differential Revision: [D62878083](https://our.internmc.facebook.com/intern/diff/D62878083)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136158
Approved by: https://github.com/fegin
2024-09-18 17:10:55 +00:00
CaoE
6a6f5b20c5 Add _addmm_activation to lower precision cast policy on AutocastCPU (#135936)
Fixes #132613.
Add `_addmm_activation` to lower precision cast policy on AutocastCPU.
`_addmm_activation`  https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/transformers/transformer.cpp#L39 of `transformer_encoder_layer_forward` may throw `RuntimeError: mat1 and mat2 must have the same dtype, but got BFloat16 and Float` when autocast is enabled, as `_native_multi_head_attention` is put in lower data type cast policy https://github.com/pytorch/pytorch/pull/107674 and `_addmm_activation` may encounter mixed data types.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135936
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-09-18 16:31:27 +00:00
Prachi Gupta
b5be4d8c05 Fix ROCm skip decorator for test_ddp_tp and multiprocess UTs (#136161)
skip_if_rocm is used only in multiprocess case (when UT test class is a child of MultiProcessTestCase). Each individual process can exit with a skip code. If used for single process UT, it will cause the UT to fail as the process returns a non-zero exit code. Use skipIfRocm in single process UTs.

To avoid the above confusion, this PR renamed skip_if_rocm to skip_if_rocm_multiprocess.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136161
Approved by: https://github.com/jithunnair-amd, https://github.com/kwen2501, https://github.com/fegin
2024-09-18 11:01:23 +00:00
Joel Schlosser
a8382847f4 Support rms_norm() for NJT (#135872)
`rms_norm()` is a nice-to-have for ViT :)

This PR:
* SymInt-ifies `rms_norm()`, allowing NJT to use the same decomp.
* Adds torch_function-based input validation logic for nested-specific stuff (no normalization supported over the ragged dim for now) on the python NJT side.
* Adds multi-dim support (on non-ragged, non-batch dims) to `mean()` for NJT.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135872
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #125947
2024-09-17 18:09:20 +00:00
PyTorch MergeBot
462b727d1e Revert "Add decomposition for permute_copy (#130944)"
This reverts commit ab9a7eadd3.

Reverted https://github.com/pytorch/pytorch/pull/130944 on behalf of https://github.com/jeanschmidt due to Broke internal signal executorch.backends.xnnpack.test.ops.permute.TestPermute, more details on D62737086. @eellison could you please help get this PR merged to main? ([comment](https://github.com/pytorch/pytorch/pull/130944#issuecomment-2355846394))
2024-09-17 13:42:55 +00:00
PyTorch MergeBot
2c4ae81494 Revert "Add decomposition for squeeze_copy (#130941)"
This reverts commit c33b0580e6.

Reverted https://github.com/pytorch/pytorch/pull/130941 on behalf of https://github.com/jeanschmidt due to Need to revert in order to be able to revert https://github.com/pytorch/pytorch/pull/130944, after fixing any merge conflicts, feel free to merge it back ([comment](https://github.com/pytorch/pytorch/pull/130941#issuecomment-2355831480))
2024-09-17 13:39:07 +00:00
PyTorch MergeBot
7fe004f7cf Revert "Add CI for Triton CPU backend (#135342)"
This reverts commit 426580a67d.

Reverted https://github.com/pytorch/pytorch/pull/135342 on behalf of https://github.com/jeanschmidt due to Broke internal signals, see D62737208 for more details ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2353623816))
2024-09-16 18:33:33 +00:00
Aaron Gokaslan
23c0d2689e [BE][Ez]: Fix missing float16 coverage for adaptive_pool3d_cpu (#136091)
Testing if op info coverage has issues

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136091
Approved by: https://github.com/ezyang
2024-09-16 18:22:16 +00:00
Aaron Gokaslan
b491e2974c [BE][Ez]: Add full half/bfloat16 dtype for unique and isin (#136114)
Fixes #136090

* Add support for isin to tensor half dtypes for CPU (just add a few extra dispatches).
* Seems like the CUDA implementation for bfloat16 was mostly compiled and available all along (it just calls sort internally AND unique). To enable it, we just need to remove an assert to access it (since sort's functionality was updated since the assert was added) and add missing dtype support to unique.
* This unlocks more GPU functionality with minimal code bloat. I also added CPU kernels for the dtypes for parity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136114
Approved by: https://github.com/malfet
2024-09-16 17:49:12 +00:00
Tom Ritchford
c33b0580e6 Add decomposition for squeeze_copy (#130941)
* Extracted from #128416

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130941
Approved by: https://github.com/amjames, https://github.com/eellison
2024-09-16 15:46:57 +00:00
Tom Ritchford
ab9a7eadd3 Add decomposition for permute_copy (#130944)
* Extracted from #129476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130944
Approved by: https://github.com/amjames, https://github.com/eellison
2024-09-15 19:35:14 +00:00
Jez Ng
426580a67d Add CI for Triton CPU backend (#135342)
Where possible, I have marked failing tests (which we intend to fix or triage) as `@xfail_if_triton_cpu`. This will help us track progress of the Triton CPU backend over time. Tests that I don't think we need to address, or that are flaky, have been marked as skips.

Successful CI run: https://github.com/pytorch/pytorch/actions/runs/10822238062/job/30028284549

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135342
Approved by: https://github.com/jansel
ghstack dependencies: #133408
2024-09-14 21:45:19 +00:00
CaoE
db393fb95e Add Half support for reflection and replication padding on CPU (#135931)
Fixes #135680

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135931
Approved by: https://github.com/Skylion007
2024-09-14 14:18:55 +00:00
PyTorch MergeBot
1786a17fed Revert "Use _amp_foreach_non_finite_check_and_unscale_ for CPU grads of ShardedGradScaler (#135232)"
This reverts commit 51c5206133.

Reverted https://github.com/pytorch/pytorch/pull/135232 on behalf of https://github.com/CaoE due to wrong commit ([comment](https://github.com/pytorch/pytorch/pull/135232#issuecomment-2350792806))
2024-09-14 02:31:06 +00:00
CaoE
51c5206133 Use _amp_foreach_non_finite_check_and_unscale_ for CPU grads of ShardedGradScaler (#135232)
Use `_amp_foreach_non_finite_check_and_unscale_` instead of fallback version for CPU grads of `ShardedGradScaler ` as `_amp_foreach_non_finite_check_and_unscale_ ` is supported on CPU https://github.com/pytorch/pytorch/pull/109281.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135232
Approved by: https://github.com/ezyang
2024-09-14 02:20:58 +00:00
PyTorch MergeBot
b6d6aa49b8 Revert "Validate input types for torch.nn.Linear and torch.nn.Bilinear (#135596)"
This reverts commit e157ce3ebb.

Reverted https://github.com/pytorch/pytorch/pull/135596 on behalf of https://github.com/malfet due to It's too restrictive, should allow other int-like types, such as `numpy.int64` ([comment](https://github.com/pytorch/pytorch/pull/135596#issuecomment-2349714104))
2024-09-13 18:06:56 +00:00
Sanskar Modi
e157ce3ebb Validate input types for torch.nn.Linear and torch.nn.Bilinear (#135596)
Adding validation checks to check the input types and display better error messages for the same.
Fixes https://github.com/pytorch/pytorch/issues/135463

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135596
Approved by: https://github.com/malfet
2024-09-12 21:28:37 +00:00
Mayank Mishra
9a04cfbeff fix for fp16 (#134106)
This PR is a replacement for https://github.com/pytorch/pytorch/pull/133085 for pushing a quick fix for RMSNorm.
The original author is @kkontny

Previous PR summary:
Since FP16 has quite small dynamic range it is very easy to overflow while computing `at::pow(input, 2)` , and it happens in real world computation.

I've tried to use `nn.RMSNorm` fused implementation instead of `LlamaRMSNorm` inside `transformers` implementation of Llama (`src/transformers/models/llama/modeling_llama.py`). It started to give wrong answers in Fp16 while still giving good in FP32. I figured out happens due to overflow while computing square of the input tensor.

Original `LLamaRMSNorm` implementation upcasts input to fp32 to prevent this and give better numerical stability.

```
class LlamaRMSNorm(nn.Module):
    def __init__(self, hidden_size, eps=1e-6):
        """
        LlamaRMSNorm is equivalent to T5LayerNorm
        """
        super().__init__()
        self.weight = nn.Parameter(torch.ones(hidden_size))
        self.variance_epsilon = eps

    def forward(self, hidden_states):
        input_dtype = hidden_states.dtype
        hidden_states = hidden_states.to(torch.float32)
        variance = hidden_states.pow(2).mean(-1, keepdim=True)
        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
        return self.weight * hidden_states.to(input_dtype)
```

Proposed commit fixed the issue. FP16 in RMSNorm has to be treated in special way, to be usable in real world implementations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134106
Approved by: https://github.com/mikaylagawarecki, https://github.com/eqy
2024-09-11 22:02:07 +00:00
Xinya Zhang
74fd1bf965 [ROCm] Update to AOTriton 0.7b (#134498)
Notable changes:
1. Enable CudaGraph related tests
2. Fix UT problems
3. EXPERIMENTAL Navi31 support. User should enable Navi31 support with Env Var `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1`

Know Problem:
1. `test/test_transformers.py` will massive failures and/or NaN outputs with `--use-pytest`
    + Update: Confirmed skip `class TestSDPAPrivateUse1Only` can fix the problem with `--use-pytest`

Note:
AOTriton 0.7b adds support to nestedtenosrs+SDPA but need more work (and consequently a separate PR) to enable it.

Fixes #133540

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134498
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily, https://github.com/malfet
2024-09-11 20:34:01 +00:00
Tom Ritchford
e05ea2b179 Add decomposition for transpose_copy (#130943)
* Extracted from #128416
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130943
Approved by: https://github.com/amjames, https://github.com/eellison
2024-09-11 19:45:22 +00:00
Chien-Chin Huang
1d9fefff19 [DCP] Fixes the stateless optimizer issue of distributed state_dict (#135535)
Some optimizers don't have states that can cause get_state_dict/set_state_dict behave incorrectly. This PR fixes the issues.

fixes: https://github.com/pytorch/pytorch/issues/133415

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135535
Approved by: https://github.com/wz337
2024-09-10 03:10:00 +00:00
Wanchao Liang
cfc227ad43 [reland][dtensor] move DTensor to public namespace (#134203)
reland of https://github.com/pytorch/pytorch/pull/133113

I have to create a new PR because the previous reverted PR could not either be rebased, or imported successfully :(

----

Moving DTensor to be in the public namespace, to formally add the documentation page that includes all the public APIs. This includes:

* many path renames and path import fixes
* a dedicated doc page without too much content yet (adding in the next PRs)
* To preserve the BC for users still using the torch.distributed._tensor, I added a shim script to redirect old path calls to the new module

The BC preserving is evidented by the fact that all DTensor tests are still working without changing the public imports. So it's safe to land the changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134203
Approved by: https://github.com/tianyu-l
2024-09-08 17:08:40 +00:00
Nowtryz
a15aabc975 Add MaskedTensor passthrough: unfold, F.Unfold, F.Fold, stack (#125262)
Hi,
I noticed the `unfold` operator was missing on MaskedTensor.

I tested that my change works when calling unfold and backward on a `MaskedTensor` but I didn't find the tests for the dispatch of such operation. Where is it?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125262
Approved by: https://github.com/cpuhrsch
2024-09-06 19:06:23 +00:00
Will Feng
84ae6b7d6b AOTDispatcher: limit cases when we detach() graph inputs to non-leaves (#134193)
This PR is slightly a revival / update to the discussion from https://github.com/pytorch/pytorch/pull/98960:

Part of FSDP2's tracing strategy right now is that:

(1) it is painful/difficult to handle the case where we have multiple graph input tensors that are aliased to each other and at least one of them is duplicated

(2) we already have longstanding in logic to remove duplicate input tensors from the graph in dynamo. Morally, FSDP2 gives us duplicate input tensors in the backward graph for every `unsharded_param`, because we have (a) the `unsharded_param` being closed over by the backward hook to resize/allgather, and (b) the same `unsharded_param` being saved for backward by autograd (we now guarantee in the partitioner that we will always save the base tensor for backward and recompute views)

(3) However, we were still seeing cases where the `unsharded_param` showed up twice in the backward graph inputs, as distinct tensor objects (with different python ids) instead of being true duplicates that dynamo can de-dup.

It turns on that this was because we were `.detach()`ing the `unsharded_param` in AOTDispatcher before plumbing it through the compiled forward (and so autograd would save a detach'd version of the `unsharded_param`). This is precisely because of the logic from https://github.com/pytorch/pytorch/pull/98960.

However, re-reading the detailed comments, it seems unnecessary to do a detach() on a graph input that is a (leaf) `nn.Parameter`, even if it happens to get no gradients in the backward. Since it is a leaf, we don't have to worry about the autograd engine "continuing to backprop through the graph beyond the current tensor" (the leaf has no other grad_fn for autograd to backprop through).

So this PR makes us a bit less aggressive about calling detach() on inputs: we only do it when:

(1) our graph input statically will get a `None` gradient (and also has no metadata mutations, the existing state)

(2) **and** our graph input is a non-leaf tensor (so detach()ing is actually required to prevent autograd from incorrectly backpropping past the non-leaf.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134193
Approved by: https://github.com/yf225

Co-authored-by: Will Feng <yf225@cornell.edu>
2024-09-06 14:06:48 +00:00
Shangdi Yu
590a3e9f8a [export][training ir migration] quantized_decomposed.quantize_per_tensor decomposition (#134525)
Summary:
In graph of  TestXNNPACKQuantizer.test_dynamic_linear_with_con test, some quantized_decomposed.quantize_per_tensor.default ops are becoming quantized_decomposed.dequantize_per_tensor.tensor ops when using the new training ir.

This is because we lift params/buffers before calling make_fx. So previously, for the graph that’s passed to make_fx,`graph.L__self___linear1.weight` is a tensor
now in training ir, graph.L__self___linear1.weight is a FakeTensor. This caused the node overload to be different.

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_dynamic_linear_with_conv
```

Differential Revision: D61364547

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134525
Approved by: https://github.com/tugsbayasgalan, https://github.com/jerryzh168
2024-09-06 07:06:06 +00:00
Avik Chaudhuri
43f4947d44 fix fake tensor tolist implementation (#135131)
Summary:
When exporting for training with `tolist`, we do not hit `FunctionalTensor.tolist` since we do not functionalize. Unfortunately, this means we hit `FakeTensor.tolist`, which creates unbacked symints that are not backed by proxies.

Rather than trying to patch up this low-level implementation, we replace it with essentially what `FunctionalTensor.tolist` does, which is higher-level: we essentially desugar to `item()` calls and let it take care of unbacked symints.

Test Plan:
Some expected failures are gone now.
Also found a test for `tolist` that was written when `FunctionalTensor.tolist` was implemented but not really doing much; repurposed it now to exercise more modes.

Differential Revision: D62197742

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135131
Approved by: https://github.com/ezyang
2024-09-05 23:20:31 +00:00
Henry Tsang
bb3c2408f4 [inductor][test] in test_unbacked_symints, replace inductor's skipCUDAIf with common device type's skipcudaif (#133936)
Differential Revision: D61506212

Use `skipCUDAIf` from `torch.testing._internal.common_device_type` if we create the test class with `instantiate_device_type_tests`.

`instantiate_device_type_tests` would make sure the class has attr device_type, which works with`skipCUDAIf` from `torch.testing._internal.common_device_type`.

Also skipping test_vertical_pointwise_reduction_fusion for cpu test class, since the test expects cuda.

FAILED [0.0026s] test/inductor/test_unbacked_symints.py::TestUnbackedSymintsCPU::test_vertical_pointwise_reduction_fusion_cpu - AttributeError: 'TestUnbackedSymintsCPU' object has no attribute 'device'

repro:
```
CUDA_VISIBLE_DEVICES="" pytest test/inductor/test_unbacked_symints.py -k cpu -v
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133936
Approved by: https://github.com/ColinPeppler, https://github.com/desertfire
2024-09-05 16:40:14 +00:00
eqy
4f70b3cfae [CUDA][complex][TF32] Update test_noncontiguous_samples tolerances for complex64 (#134526)
Recent cuDNN heuristics change surfaces same TF32 issue as `float32`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134526
Approved by: https://github.com/ezyang
2024-09-04 23:37:16 +00:00
FFFrog
80a6d60829 Moving _run_autocast_outofplace to basic class named TestAutocast to reduce redundance (#134460)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134460
Approved by: https://github.com/EikanWang, https://github.com/ezyang
2024-09-04 10:48:58 +00:00
Nikita Shulga
71383dd3da [MPS] Fix bachnorm_2d for channels last (#134618)
By skipping gather of input tensor if memory_layout is channels_last, which is a first step towards fixing  https://github.com/pytorch/pytorch/issues/134580

Though underlying problem is much more interesting, i.e. MPS does not have a generic support for channels last, but `c10::is_contiguoius()` is true for channels last layout.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134618
Approved by: https://github.com/albanD
2024-09-03 19:20:11 +00:00
Tobias Ringwald
758d787901 Added complex support for torch.logsumexp (#133187)
Added complex support for `torch.logsumexp`. Implemented complex backward pass for `torch.logsumexp`.

Fixes #133047

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133187
Approved by: https://github.com/amjames, https://github.com/lezcano
2024-09-03 17:28:36 +00:00
IvanKobzarev
33ba952e31 [subclasses] Do not fakeTensor const prop subclass args (#134855)
The issue:

Const propagation checks only if arguments do not have FakeTensor. If argument is Subclass, it will pass this condition.

As a result Const Propogation execution happens without FakeTensorMode and having tensor factories inside Subclass.__torch_dispatch__ results that this Tensor is not Fakified.

Solution:

If we have subclasses arguments, do not count that const propagation is doable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134855
Approved by: https://github.com/zou3519
2024-09-03 13:31:49 +00:00
chilli
6fce1faa10 change multinomial to use async asserts instead of a synchronization (#134818)
Fixes https://github.com/pytorch/pytorch/issues/134442

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134818
Approved by: https://github.com/ezyang
ghstack dependencies: #134813
2024-09-03 06:33:24 +00:00
Nikita Shulga
f95085fd91 [BE][MPS] Prefer xfail to skip (#134858)
This essentially undoes large skips on everything but MacOS Sequoia to nn.modules made by https://github.com/pytorch/pytorch/pull/128393

Instead it uses existing `xfail`, but guards it on `_macos15_or_newer` boolean

Before the change if run on MacOS 14:
```
 % python3 ../test/test_modules.py -v -k Hardswish 2>&1|tail -n3
Ran 57 tests in 0.053s

OK (skipped=32)
```
After
```
% python3 ../test/test_modules.py -v -k Hardswish 2>&1|tail -n3
Ran 57 tests in 0.229s

OK (skipped=10, expected failures=2)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134858
Approved by: https://github.com/janeyx99
2024-08-31 00:29:48 +00:00
Masaki Kozuki
e21d7b77ce Update ForeachfuncInfo.sample_inputs_func to yield scalars & scalarlists that are more friendly to test_meta (#134552)
for `test_meta.py` to see more "PASSED" instead of "XFAIL".

`pytest test_meta.py -k "_foreach_"` ran 6400 test cases and:
- This PR: 4702 passed, 260 skipped, 73732 deselected, 1698 xfailed
- main (92c4771853): 3906 passed, 260 skipped, 73732 deselected, 2494 xfailed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134552
Approved by: https://github.com/janeyx99
2024-08-30 17:30:50 +00:00
PyTorch MergeBot
f997b2b8e6 Revert "Add MaskedTensor passthrough: unfold, F.Unfold, F.Fold, stack (#125262)"
This reverts commit f685018ea9.

Reverted https://github.com/pytorch/pytorch/pull/125262 on behalf of https://github.com/ZainRizvi due to Hi, this PR appears to be calling maskedtensor tests to fail on main. Please rebase your changes onto the latest trunk build to repro the failure. test_maskedtensor.py::TestOperatorsCUDA::test_like_empty_like_layout1_cuda_bool [GH job link](https://github.com/pytorch/pytorch/actions/runs/10604716811/job/29393256312) [HUD commit link](f685018ea9) ([comment](https://github.com/pytorch/pytorch/pull/125262#issuecomment-2316387447))
2024-08-28 23:10:07 +00:00
Nowtryz
f685018ea9 Add MaskedTensor passthrough: unfold, F.Unfold, F.Fold, stack (#125262)
Hi,
I noticed the `unfold` operator was missing on MaskedTensor.

I tested that my change works when calling unfold and backward on a `MaskedTensor` but I didn't find the tests for the dispatch of such operation. Where is it?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125262
Approved by: https://github.com/cpuhrsch
2024-08-28 21:30:39 +00:00
Chien-Lin Chen
40de63be09 parameterized test_graph_optims and test_graph_scaling_fused_optimizers (#133749)
Fixes #123451

This is a rework of a reverted pull request, https://github.com/pytorch/pytorch/pull/125127.
The test failure is fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133749
Approved by: https://github.com/janeyx99
2024-08-28 16:34:06 +00:00
David Berard
289486d007 Move attention kernels back from fake_impls to meta_registrations (#134288)
See #121528 for additional context.

In #120682, we moved the attention kernels from meta_registrations to fake_impls with the intent of fixing the device handling for seed/offset: these are typically on CPU. We needed to put the registrations in fake_impls to do this because meta_registrations doesn't have a way to specify device, whereas fake_impls does. But when we tried to actually fix the device types (#120839), we had to revert the PR because it broke cudagraph handling (during which seed/offset _are_ on CUDA).

Now, we want to put the registrations back in meta_registrations so that we can call these kernels with meta tensors. The use case is later in this stack - we want to be able to use the flop counter with these kernels.

Also - I specifically skip the `compare_tensor_meta()` check in test_fake / test_fake_autocast tests for the `_efficient_attention_forward` and `_flash_attention_forward` kernels, which fails because of the device mismatch from the seed/offset tensors. Then we can un-skip these opinfos. I verified that the efficient_attention_forward bug (#120842) is now caught by these opinfos if I revert the fix from this PR.

Differential Revision: [D61687369](https://our.internmc.facebook.com/intern/diff/D61687369)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134288
Approved by: https://github.com/drisspg
2024-08-27 21:10:36 +00:00
Bo Li
16b8146c9e Exclude test_transformers and unit tests which require recent GPU arch (#132895)
This PR is to exclude test_transformers on ROCm temporarily and skip some unit tests which require recent GPU arch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132895
Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/malfet
2024-08-27 20:40:53 +00:00
Nikita Shulga
8de0d7690c Use newer toAccumulateType signature in Normalization.cpp (#134540)
Which fixes BatchNorm behavior for if called with empty tensors on MPS backed. Removed `expectedFailureMPS` in test_nn.py, deleted expected failure in `test_mps.py` and adjusted `skipIfMPS` to `expectedFailureMPS`  in BatchNorm2d OpInfo decorator, but restrict it only to the memory format tests

Test Plan: CI + `python3 -c "import torch; print(torch.nn.BatchNorm2d(3, device='mps')(torch.rand(0, 3, 2, 2, device='mps')))"`

Fixes https://github.com/pytorch/pytorch/issues/134423

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134540
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-08-27 18:09:20 +00:00
Colin Peppler
13049cd6e5 [aotinductor][UserDefinedTritonKernel] fix case with non-constexpr params declared after autotuned params (#134520)
## Context
In some user Triton kernels, we have this set-up for whatever reason.
```
@triton.jit
def mykernel(
  param0,
  param1,
  param2,
  param3: tl.constexpr,   # autotuned
  param4,                 # non-constexpr
):
  ...
```

This is an edge case because it's a general practice to declare all constexprs params at the end.

And this will be an issue for AOTI because it fails to codegen all 4 params. That will surface as a device-side error: CUDA IMA, invalid argument...

```
>     void* kernel_args_var_0[] = {&var_0, &var_1, &var_2};
---
<     CUdeviceptr var_3;
<     AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_get_data_ptr(buf0, reinterpret_cast<void**>(&var_3)));
<     void* kernel_args_var_0[] = {&var_0, &var_1, &var_2, &var_3};
```

## Root-cause
* `kernel.constexpr` from the Kernel side-table contains the indices for all `constexpr` params that includes autotuned params.
* `raw_args`, that gets passed to wrapper codegen, excludes autotuned args.
* In the wrapper codegen, we try to find non-constexpr args using `kernel.constexpr` & `raw_args`. This is okay unless there's a `raw_arg` after an autotuned param in the function signature.

79b7fff188/torch/_inductor/codegen/cpp_wrapper_cuda.py (L118-L126)

## Fix
We try to fix this, by calculating the right constexprs wrt `raw_args`.

An illustration
```
         raw_args: [arg0, arg1, arg2, arg4]
 kernel.arg_names: [param0, param1, param2, param3, param4]
kernel.constexprs: [3]                      # param3 is autotuned; this is correct wrt kernel.arg_names
constexpr_indices: []                       # this is correct wrt raw_args
```

Differential Revision: [D61831625](https://our.internmc.facebook.com/intern/diff/D61831625)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134520
Approved by: https://github.com/oulgen
2024-08-27 17:20:27 +00:00
Mikayla Gawarecki
d028b810fe Fix flaky GroupNorm ModuleInfo test (#133899)
Fixes https://github.com/pytorch/pytorch/issues/98677

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133899
Approved by: https://github.com/albanD
2024-08-27 14:45:51 +00:00
wizzniu
3515090006 Fix TypeError when itering NoneType in instantiate_device_type_tests() (#134457)
Fixes #134454

Fix TypeError introduced by https://github.com/pytorch/pytorch/pull/133082, which uses iter for NoneType of default args ``except_for`` and ``only_for``.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134457
Approved by: https://github.com/shink, https://github.com/albanD
2024-08-27 07:13:36 +00:00
Roy Hvaara
1565940114 [MPS] Add test/test_nn.py to test suite (#134184)
This PR increases test coverage by including the tests in `test/test_nn.py` in the test suite of MPS.

Some of the tests are decorated with `@expectedFailureMPS` for various reasons. Either that the op is not implemented, or that the outputs do not align. Those tests that contain differing results should be investigated further to rule out any live bugs.

```bash
$ python test/run_test.py --mps --verbose -k TestNN
Running test batch 'tests to run' cost 84.76 seconds
```

Ref #133520

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134184
Approved by: https://github.com/albanD, https://github.com/malfet
2024-08-26 23:48:23 +00:00
Benjamin Glass
55236d0cb7 TestForeach::test_parity: Remove check for error message text (#134251)
Previously, error messages were expected to be string equivalent to
error messages thrown by the ref function.  This check fails for dozens
of torch functions, and doesn't appear to add much value for the end
user.  This commit removes this check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134251
Approved by: https://github.com/amjames, https://github.com/janeyx99
ghstack dependencies: #134253, #134344
2024-08-26 22:40:54 +00:00
Benjamin Glass
ef8c474fcf Add the fast path for bfloat16 lgamma (#134344)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134344
Approved by: https://github.com/amjames, https://github.com/janeyx99
ghstack dependencies: #134253
2024-08-26 22:40:54 +00:00
Benjamin Glass
3c5883e550 Fix test_parity xfail for sigmoid (#134253)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134253
Approved by: https://github.com/amjames, https://github.com/janeyx99
2024-08-26 22:40:54 +00:00
Mengwei Liu
a0e062c6f1 Add mean.dtype_out (#133506)
Give it a try and see if CI is happy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133506
Approved by: https://github.com/bdhirsh
2024-08-26 19:26:11 +00:00
Wuxun Zhang
1d231ff8ba [HOO] add hints_wrapper to support passing context hints (#132860)
Fixes #126393

The implementation code is based on feedback here (https://github.com/pytorch/pytorch/pull/121639#issuecomment-2223948842).

Hints are passed as kwargs of hints_wrapper op. It also supports nested hints.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132860
Approved by: https://github.com/ydwu4, https://github.com/zou3519
2024-08-26 18:21:22 +00:00
Benjamin Glass
27d97b9649 Remove unnecessary test skip (#134250)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134250
Approved by: https://github.com/amjames, https://github.com/janeyx99
2024-08-26 14:34:53 +00:00
Laith Sakka
7b6b10417d Remove ansi escape chars in assertExpectedInline and add options to skip comments and to skip empty lines (#134248)
I had a night mare rewriting tests in test_misc.py specifically :
1. graphs can have comments that refers to my files "/lsakka/.." we really dont care about comments add option to ignore comments.
2. empty lines added when EXPECTTEST_ACCEPT=1  are changed with linter causing tests to fail or linter fail!
add flag to ignore empty lines.
3. EXPECTTEST_ACCEPT fails when the text have some not readable characters. those should not effect comparing strings, also those causes weird diffs comments when tests fails. I removed ansi_escape chars https://github.com/pytorch/pytorch/pull/133045

this is used in

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134248
Approved by: https://github.com/aorenste
ghstack dependencies: #133639, #134364
2024-08-26 02:03:44 +00:00
Xu Han
1034f456ef [inductor] fix munge_exc not support windows path (#134348)
Windows file path use `\` as delimiter, it is also a escape character. We need translate all path `\` to `/`. which like Linux.

Reproduce UT:
```cmd
pytest test\dynamo\test_higher_order_ops.py -v -k test_vmap_grad_vmap_guard_fail
```
Error msg:
```cmd
________________________________________________________________________________________________________ HigherOrderOpVmapGuardTests.test_vmap_grad_vmap_guard_fail _________________________________________________________________________________________________________
Traceback (most recent call last):
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\logging_utils.py", line 89, in test_fn
    fn(self, records)
  File "D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_higher_order_ops.py", line 2714, in test_vmap_grad_vmap_guard_fail
    munge_exc(record.getMessage()),
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_utils.py", line 5252, in munge_exc
    s = re.sub(file, os.path.basename(file), s)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\re.py", line 209, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\re.py", line 303, in _compile
    p = sre_compile.compile(pattern, flags)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_compile.py", line 788, in compile
    p = sre_parse.parse(p, flags)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_parse.py", line 955, in parse
    p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_parse.py", line 444, in _parse_sub
    itemsappend(_parse(source, state, verbose, nested + 1,
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_parse.py", line 526, in _parse
    code = _escape(source, this, state)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_parse.py", line 370, in _escape
    raise source.error("incomplete escape %s" % escape, len(escape))
re.error: incomplete escape \x at position 2

To execute this test, run the following from the base repo dir:
    python test\dynamo\test_higher_order_ops.py HigherOrderOpVmapGuardTests.test_vmap_grad_vmap_guard_fail

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
--------------------------------------------------------------------------------------------------------------------------- Captured stdout call ----------------------------------------------------------------------------------------------------------------------------
frames [('total', 2), ('ok', 2)]
inductor []
inline_call []
stats [('calls_captured', 38), ('unique_graphs', 2)]
--------------------------------------------------------------------------------------------------------------------------- Captured stderr call ----------------------------------------------------------------------------------------------------------------------------
V0824 01:29:00.148000 27840 torch\_dynamo\guards.py:2787] [0/1] [__recompiles] Recompiling function fn in D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_higher_order_ops.py:2699
V0824 01:29:00.148000 27840 torch\_dynamo\guards.py:2787] [0/1] [__recompiles]     triggered by the following guard failure(s):
V0824 01:29:00.148000 27840 torch\_dynamo\guards.py:2787] [0/1] [__recompiles]     - 0/0: torch._functorch.pyfunctorch.compare_functorch_state([('Vmap', 1, 'error')])  # _dynamo\output_graph.py:479 in init_ambient_guards
========================================================================================================================== short test summary info ==========================================================================================================================
FAILED [0.7452s] test/dynamo/test_higher_order_ops.py::HigherOrderOpVmapGuardTests::test_vmap_grad_vmap_guard_fail - re.error: incomplete escape \x at position 2
```
Local test passed:
<img width="860" alt="image" src="https://github.com/user-attachments/assets/90f0d780-0639-4c03-8d7c-6f227c93a3fc">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134348
Approved by: https://github.com/jansel
2024-08-24 05:51:35 +00:00
Aaron Gokaslan
83b5d449a3 Add full float16/bfloat16 support to MaxUnPool (#133774)
It already supported half so might as well add bfloat16 support for parity

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133774
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-08-22 13:34:43 +00:00
Zitong Zhan
90c821814e SparseCsrCUDA: cuDSS backend for linalg.solve (#129856)
This PR switches to cuDSS library and has the same purpose of #127692, which is to add Sparse CSR tensor support to linalg.solve.
Fixes #69538

Minimum example of usage:
```
import torch

if __name__ == '__main__':
    spd = torch.rand(4, 3)
    A = spd.T @ spd
    b = torch.rand(3).to(torch.float64).cuda()
    A = A.to_sparse_csr().to(torch.float64).cuda()

    x = torch.linalg.solve(A, b)
    print((A @ x - b).norm())

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129856
Approved by: https://github.com/amjames, https://github.com/lezcano, https://github.com/huydhn

Co-authored-by: Zihang Fang <zhfang1108@gmail.com>
Co-authored-by: Huy Do <huydhn@gmail.com>
2024-08-22 07:57:30 +00:00
Yuanhao Ji
cdb9c7d228 Add support for using privateuse1 backend name in instantiate_device_type_tests() (#133082)
As you can see, 'privateuse1' appears many times in out-of-tree extension codebase. I think that everything about the device type should be as same as other in-tree backends after registering the privateuse1 backend.

For example, after registering a privateuse1 backend named "foo", you should allow "foo" to be passed in as a valid device type.

```diff
- instantiate_device_type_tests(TestIndexing, globals(), only_for='privateuse1')
- instantiate_device_type_tests(NumpyTests, globals(), only_for='privateuse1')
+ instantiate_device_type_tests(TestIndexing, globals(), only_for='foo')
+ instantiate_device_type_tests(NumpyTests, globals(), only_for='foo')
```

> https://github.com/Ascend/pytorch/blob/master/test/test_indexing.py#L1654-L1655

The change is to map privateuse1 backend name to 'privateuse1' when calling `filter_desired_device_types()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133082
Approved by: https://github.com/albanD
2024-08-22 06:17:21 +00:00
chilli
938f37b745 Added batching rule for sdpa_math, sdpa_efficient_attention forward, cudnn, and flash attention (#133964)
Fixes https://github.com/pytorch/pytorch/issues/117016, https://github.com/pytorch/pytorch/issues/102457, https://github.com/pytorch/pytorch/issues/110525, https://github.com/pytorch/pytorch/issues/108065,

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133964
Approved by: https://github.com/Skylion007
2024-08-22 05:29:49 +00:00
xinan.lin
c2e2602ecd [Inductor] Move GPU_TYPE(The runtime avaliable gpu type, cuda or xpu) from (#132740)
Move GPU_TYPE(The runtime avaliable gpu type, cuda or xpu) from `testing/_internal/inductor_utils.py` to `_inductor/utils.py`. So that we can use it in Inductor, not limited in test case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132740
Approved by: https://github.com/EikanWang, https://github.com/jansel
2024-08-21 11:18:00 +00:00
krzysztofjordan
2e1830c7c8 Implement 2D version of masked_select for nestedtensors (#133889)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133889
Approved by: https://github.com/soulitzer
2024-08-20 21:46:32 +00:00
Chirag Pandya
994fcb9acd Killswitch based rollout for flight recorder (#133237)
Summary: Defaulting TORCH_NCCL_DUMP_ON_TIMEOUT to "true" and adding a kilswitch in case we need to kill this feature in production.

Test Plan: Tests pass manually but need futher testing before this is rolled out fully everywhere.

Differential Revision: D61136320

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133237
Approved by: https://github.com/c00w
2024-08-20 04:27:55 +00:00
drisspg
fb26b84390 Update fused kernels and call _safe_softmax from SDPA (#133882)
# UPDATE:
This is  take 3 of https://github.com/pytorch/pytorch/pull/131863 which was landed via co dev but not applying correclty

# Summary
Changes the stance of SDPA on what to do for fully masked out rows

## Current Behavior
Several PyTorch users have expressed frustration over this issue:
- https://github.com/pytorch/pytorch/issues/41508
- https://github.com/pytorch/pytorch/issues/103749
- https://github.com/pytorch/pytorch/issues/103963

These are significant issues with extensive discussion but no satisfactory resolution. The PyTorch team's consensus, as stated here:
https://github.com/pytorch/pytorch/issues/24816#issuecomment-524415617

Can be paraphrased as follows:

When passing in fully masked out rows, attention becomes ambiguous. We have two main options:

1. Uniformly attend to all values:
   ```python
   scores[masked_out_rows] = 1 / len(row)
   out[masked_out_rows] = 1 / len(row) * value
   ```

2. Decide that attention between no queries (masked) and no keys (masked) is meaningless:
   ```python
   output[fully_masked_rows] = NaN
   ```

We went with option 2. Partially because it was easier to implement, but also people argued that users can slice the output to remove the NaNs:
``` Python
>fill_value = -float("inf")
>row0 = torch.randn(4)
>row1 = torch.tensor([(fill_value for _ in range(4)])
>matrix = torch.stack([row0, row1]).requires_grad_(True)
>out = torch.softmax(matrix, 1)
>out = out[0]
>print(out)
tensor([0.5377, 0.2729, 0.0692, 0.1201])
```
Cool, problem solved. But what happends when you call backwards..
```Python
>out.backward(torch.ones_like(out))
>print(matrix.grad)
tensor([[3.0957e-08, 1.4157e-08, 7.7802e-10, 1.3713e-08],
        [       nan,        nan,        nan,        nan]])
```
Those pesky NaNs are back!

## Why do we see NaNs today?

The core of the problem revolves around using softmax function in sdpa:

```python
> row = torch.tensor([(-float("inf")) for _ in range(4)])
> torch.softmax(row, 0)
tensor([nan, nan, nan, nan])
```

## Quick Aside: Masking in Attention

Attention itself doesn't have a concept of masking. The `sdpa` function has an argument called `attn_mask`, which would be more accurately named `attn_bias`. This is because we don't actually "mask" entries when computing attention. Instead, due to implementation details([performance](https://github.com/pytorch/pytorch/issues/25110#issuecomment-524519087)), we add a value to the masked-out query/key pairs.

We use a large negative number (typically -inf) to decrease the attention weight, as softmax assigns more weight to larger values.

## Alternative Approaches

If we use a very large negative number instead of -inf:

```python
> row = torch.tensor([(-1e6) for _ in range(4)])
> torch.softmax(row, 0)
tensor([0.2500, 0.2500, 0.2500, 0.2500])
```
However if users always remembered to "slice" out their outputs i.e.:
```Python
>fill_value = -1e6
>...
>out.backward(torch.ones_like(out))
>print(matrix.grad)
tensor([[-0.0563, -0.0564,  0.1613, -0.0486],
        [ 0.0000,  0.0000,  0.0000,  0.0000]])
```
This would bring us back into a better state.

## A Third Option

We don't necessarily need to alter the behavior of softmax for -inf or very large negative numbers. The fundamental goal is to exclude certain query/key pairs from attention, regardless of the underlying implementation.

This PR implements the new semantic for masking w/ attention in fully masked-out rows:
```python
out[masked_out_rows] = 0
```

**Important Note**: This idea isn't entirely new. The [MaskedTensor](https://pytorch.org/tutorials/prototype/maskedtensor_overview#safe-softmax) prototype, a tensor subclass, was designed to handle such cases. However, it remains a prototype feature and hasn't gained widespread adoption.

## Details
This PR stack does 3 things:
1. Adds a PRIVATE _safe_softmax op
2. Updates semantic for flash_cpu fused kernel
3. Updates semantic for efficient_cuda fused kernel

_safe_softmax is not supposed to be used generically and is only meant to be used within the context of SDPA. Due to this fact instead of decomposing softmax and checking for -inf rows we instead "cheat" and use nan_to_num.

Why I think this is okay? (please find a counter point if avail)
There are multiple ways NaNs can emerge. For the fully masked out rows case nan_to_num works. But what if there were other NaNs, wouldn't this silently remove them?

The only case that this can happen is if the input itself had a NaN or an Inf
For example:
```Python
a = torch.ones([4], requires_grad=False, dtype=torch.float16)
a[1] = torch.finfo(torch.float16).max
print(a.softmax(-1))
```
Will return
`tensor([0., 1., 0., 0.], dtype=torch.float16)`

Where
```Python
a = torch.ones([4], requires_grad=False, dtype=torch.float16)
a[1] = float("inf")
a.softmax(-1)
```
returns:
`tensor([nan, nan, nan, nan], dtype=torch.float16)`

If we dont want to even allow for the possibility of "inf" or "NaN" attention scores to be converted to 0 then we can implemented it something like this

```Python
max = torch.max(a, dim=-1, keepdim=True)
exp = torch.exp(a - max.values)
denom = torch.sum(exp, dim=-1, keepdim=True)
softmax = exp / denom
softmax = torch.where(max.values == float('-inf'), 0.0, softmax)
```
however we would be paying for this in math performance.

## Why Now
I think one point that has substantially changed where PyTorch should lie on this argument is the fact that we have fused implementations for SDPA now. And these fused implementations allow us to easily and performantly support this new semantic.

Differential Revision: [D61418679](https://our.internmc.facebook.com/intern/diff/D61418679)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133882
Approved by: https://github.com/soulitzer
2024-08-19 18:53:11 +00:00
PyTorch MergeBot
35f36363ec Revert "[dtensor] move DTensor to public namespace (#133113)"
This reverts commit 2ee6b97464.

Reverted https://github.com/pytorch/pytorch/pull/133113 on behalf of https://github.com/wanchaol due to looks like it break some internal type imports ([comment](https://github.com/pytorch/pytorch/pull/133113#issuecomment-2295670911))
2024-08-19 05:00:19 +00:00
Jiang, Yanbing
215b14530a Add Half for sparse.mm reduce (#133672)
This PR is to add Half support for sparse.mm reduce in CPU backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133672
Approved by: https://github.com/Skylion007
2024-08-17 15:20:39 +00:00
Wanchao Liang
2ee6b97464 [dtensor] move DTensor to public namespace (#133113)
Moving DTensor to be in the public namespace, to formally add the
documentation page that includes all the public APIs. This includes:

* many path renames and path import fixes
* a dedicated doc page without too much content yet (adding in the next
  PRs)
* To preserve the BC for users still using the `torch.distributed._tensor`,
  I added a shim script to redirect old path calls to the new module

The BC preserving is evidented by the fact that all DTensor tests are still
working without changing the public imports. So it's safe to land the
changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133113
Approved by: https://github.com/XilunWu
ghstack dependencies: #133305, #133306
2024-08-17 05:09:52 +00:00
Li, Xingyuan
dcfa415e6e [Inductor UT] Reuse inductor UT for intel GPU test/inductor/test_compiled_optimizers.py (#133083)
[Inductor UT] Reuse Inductor test case for Intel GPU.
Reuse `test/inductor/test_compiled_optimizers.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133083
Approved by: https://github.com/etaf, https://github.com/jansel, https://github.com/mlazos
2024-08-17 01:15:26 +00:00
Denis Vieriu
861bdf96f4 [MPS] Add native strided API for MPSNDArray starting with macOS 15 (#128393)
Add support for native strides in MPS starting with macOS Sequoia. This will get rid of the additional gather and scatter operations needed to solve the strides or storage offsets of the tensors.

Summary of changes (starting with macOS 15):
- Add support for **MPS strided API** (strides/storage offsets etc):
   - [initWithBuffer:offset:descriptor:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarray/4391636-initwithbuffer?language=objc)
   - [arrayViewWithCommandBuffer:descriptor:aliasing:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarray/3114040-arrayviewwithcommandbuffer?language=objc)
   - [arrayViewWithShape:strides:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarray/4408694-arrayviewwithshape?language=objc)
   - [reshapeWithCommandBuffer:sourceArray:shape:destinationArray:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarrayidentity/4438557-reshapewithcommandbuffer?language=objc)
- Add native support for NHWC convolutions (without incurring any extra copy from NCHW -> NHWC -> NCHW).
- Add support for strided output buffers (previously we would create a contiguous buffer

OSes older than macOS 15 will run the old gather/scatter code path to solve strides/storage offsets.

---

Couple performance stats collected from torchbench comparing macOS 15 vs macOS 14:
```
- test_train[functorch_maml_omniglot-mps]: 27% faster
- test_train[timm_vision_transformer-mps]: 12% faster
- test_train[hf_T5-mps]: 9.46% faster
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128393
Approved by: https://github.com/albanD

Co-authored-by: Siddharth Kotapati <skotapati@apple.com>
2024-08-16 21:07:50 +00:00
Mikayla Gawarecki
d9576c9440 Fix failures when default is flipped for weights_only (#127627)
Tests on XLA shard not fixed yet but there is an issue here https://github.com/pytorch/xla/issues/7799

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127627
Approved by: https://github.com/albanD
ghstack dependencies: #132349
2024-08-16 00:22:43 +00:00
PyTorch MergeBot
cfec69e2a1 Revert "Update fused kernels and call _safe_softmax from SDPA (#131863)"
This reverts commit caba37e99b.

Reverted https://github.com/pytorch/pytorch/pull/131863 on behalf of https://github.com/izaitsevfb due to breaks executorch test executorch/backends/apple/coreml:test - test_vit_skip_conv (executorch.backends.apple.coreml.test.test_coreml_partitioner.TestCoreMLPartitioner) ([comment](https://github.com/pytorch/pytorch/pull/131863#issuecomment-2291855634))
2024-08-15 17:55:07 +00:00
Jane Xu
c23dceb8f1 Add Adafactor foreach impl (#132336)
This PR adds the foreach impl for Adafactor knowing that there are many ways to improve its runtime perf today (by adding more foreach support). After this PR:
- we have a foreach flag for Adafactor
- It is NOT the default. Why not? It is only slightly faster + uses O(n) more memory where n is the number of params in your max param group. People tend to use Adafactor for memory efficiency.

Next steps:
- make torch.compile possible on it
- make it faster (by adding more foreach apis)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132336
Approved by: https://github.com/albanD
ghstack dependencies: #133360
2024-08-15 17:00:33 +00:00
Xuehai Pan
758a0a88a2 [BE][Easy] enable ruff rule PIE790: unnecessary pass statement (#133200)
This PR removes unnecessary `pass` statement. This is semanticly safe because the bytecode for the Python code does not change.

Note that if there is a docstring in the function, a empty function does not need a `pass` statement as placeholder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133200
Approved by: https://github.com/malfet, https://github.com/eqy, https://github.com/kit1980
2024-08-15 15:50:19 +00:00
Aaron Gokaslan
ec49ce5f8e [CUDA]: Add frexp CUDA bfloat16 support (#133313)
Fixes #133263 Add CUDA bfloat16 support to cuda_frexp

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133313
Approved by: https://github.com/ezyang, https://github.com/eqy
2024-08-15 15:20:00 +00:00
soulitzer
4af4910b1a Reland "Construct NJT without graph breaks" (#133196)
This reverts commit 154d40ca488e6979ce9c2de89d8a35b53129ebea.

and adds changes from https://github.com/pytorch/pytorch/pull/133061

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133196
Approved by: https://github.com/ezyang
ghstack dependencies: #133145
2024-08-14 01:11:13 +00:00
drisspg
caba37e99b Update fused kernels and call _safe_softmax from SDPA (#131863)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131863
Approved by: https://github.com/jbschlosser, https://github.com/Chillee
2024-08-13 23:37:50 +00:00
Sun, Jiayi
7be77658e9 [Inductor] support masked vectorization for the tail_loop for INT8 datatype (#131155)
This PR supports masked vectorization for the tail_loop for torch.uint8 and torch.int8 datatype to improve performance.
BTW, I fixed the UT of `byte` by setting the range of the sample inputs  to [0, 255] since the range of `torch.uint8` is [0, 255].

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131155
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
ghstack dependencies: #130724
2024-08-13 01:12:05 +00:00
soulitzer
05de2b2d0f Revert "Construct NJT without graph breaks" (#133145)
This reverts commit 911154271309667b55dfb963ec6384bd0048019b.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133145
Approved by: https://github.com/YuqingJ
2024-08-10 03:11:16 +00:00
drisspg
1434e0b121 Add a private _safe_softmax (#131060)
# Summary
Changes the stance of SDPA on what to do for fully masked out rows

## Current Behavior
Several PyTorch users have expressed frustration over this issue:
- https://github.com/pytorch/pytorch/issues/41508
- https://github.com/pytorch/pytorch/issues/103749
- https://github.com/pytorch/pytorch/issues/103963

These are significant issues with extensive discussion but no satisfactory resolution. The PyTorch team's consensus, as stated here:
https://github.com/pytorch/pytorch/issues/24816#issuecomment-524415617

Can be paraphrased as follows:

When passing in fully masked out rows, attention becomes ambiguous. We have two main options:

1. Uniformly attend to all values:
   ```python
   scores[masked_out_rows] = 1 / len(row)
   out[masked_out_rows] = 1 / len(row) * value
   ```

2. Decide that attention between no queries (masked) and no keys (masked) is meaningless:
   ```python
   output[fully_masked_rows] = NaN
   ```

We went with option 2. Partially because it was easier to implement, but also people argued that users can slice the output to remove the NaNs:
``` Python
>fill_value = -float("inf")
>row0 = torch.randn(4)
>row1 = torch.tensor([(fill_value for _ in range(4)])
>matrix = torch.stack([row0, row1]).requires_grad_(True)
>out = torch.softmax(matrix, 1)
>out = out[0]
>print(out)
tensor([0.5377, 0.2729, 0.0692, 0.1201])
```
Cool, problem solved. But what happends when you call backwards..
```Python
>out.backward(torch.ones_like(out))
>print(matrix.grad)
tensor([[3.0957e-08, 1.4157e-08, 7.7802e-10, 1.3713e-08],
        [       nan,        nan,        nan,        nan]])
```
Those pesky NaNs are back!

## Why do we see NaNs today?

The core of the problem revolves around using softmax function in sdpa:

```python
> row = torch.tensor([(-float("inf")) for _ in range(4)])
> torch.softmax(row, 0)
tensor([nan, nan, nan, nan])
```

## Quick Aside: Masking in Attention

Attention itself doesn't have a concept of masking. The `sdpa` function has an argument called `attn_mask`, which would be more accurately named `attn_bias`. This is because we don't actually "mask" entries when computing attention. Instead, due to implementation details([performance](https://github.com/pytorch/pytorch/issues/25110#issuecomment-524519087)), we add a value to the masked-out query/key pairs.

We use a large negative number (typically -inf) to decrease the attention weight, as softmax assigns more weight to larger values.

## Alternative Approaches

If we use a very large negative number instead of -inf:

```python
> row = torch.tensor([(-1e6) for _ in range(4)])
> torch.softmax(row, 0)
tensor([0.2500, 0.2500, 0.2500, 0.2500])
```
However if users always remembered to "slice" out their outputs i.e.:
```Python
>fill_value = -1e6
>...
>out.backward(torch.ones_like(out))
>print(matrix.grad)
tensor([[-0.0563, -0.0564,  0.1613, -0.0486],
        [ 0.0000,  0.0000,  0.0000,  0.0000]])
```
This would bring us back into a better state.

## A Third Option

We don't necessarily need to alter the behavior of softmax for -inf or very large negative numbers. The fundamental goal is to exclude certain query/key pairs from attention, regardless of the underlying implementation.

This PR implements the new semantic for masking w/ attention in fully masked-out rows:
```python
out[masked_out_rows] = 0
```

**Important Note**: This idea isn't entirely new. The [MaskedTensor](https://pytorch.org/tutorials/prototype/maskedtensor_overview#safe-softmax) prototype, a tensor subclass, was designed to handle such cases. However, it remains a prototype feature and hasn't gained widespread adoption.

## Details
This PR stack does 3 things:
1. Adds a PRIVATE _safe_softmax op
2. Updates semantic for flash_cpu fused kernel
3. Updates semantic for efficient_cuda fused kernel

_safe_softmax is not supposed to be used generically and is only meant to be used within the context of SDPA. Due to this fact instead of decomposing softmax and checking for -inf rows we instead "cheat" and use nan_to_num.

Why I think this is okay? (please find a counter point if avail)
There are multiple ways NaNs can emerge. For the fully masked out rows case nan_to_num works. But what if there were other NaNs, wouldn't this silently remove them?

The only case that this can happen is if the input itself had a NaN or an Inf
For example:
```Python
a = torch.ones([4], requires_grad=False, dtype=torch.float16)
a[1] = torch.finfo(torch.float16).max
print(a.softmax(-1))
```
Will return
`tensor([0., 1., 0., 0.], dtype=torch.float16)`

Where
```Python
a = torch.ones([4], requires_grad=False, dtype=torch.float16)
a[1] = float("inf")
a.softmax(-1)
```
returns:
`tensor([nan, nan, nan, nan], dtype=torch.float16)`

If we dont want to even allow for the possibility of "inf" or "NaN" attention scores to be converted to 0 then we can implemented it something like this

```Python
max = torch.max(a, dim=-1, keepdim=True)
exp = torch.exp(a - max.values)
denom = torch.sum(exp, dim=-1, keepdim=True)
softmax = exp / denom
softmax = torch.where(max.values == float('-inf'), 0.0, softmax)
```
however we would be paying for this in math performance.

## Why Now
I think one point that has substantially changed where PyTorch should lie on this argument is the fact that we have fused implementations for SDPA now. And these fused implementations allow us to easily and performantly support this new semantic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131060
Approved by: https://github.com/jbschlosser
2024-08-08 23:09:38 +00:00
Alnis Murtovi
7bd0732cbd Fix flaky internal mixed_mm tests (#133015)
This PR fixes flaky internal tests:
- The AutoHeuristic test was sometimes failing because it required autotuning to happen for mixed_mm which didn't end up happening when there was a fx graph cache hit.
- The tests inside pattern_matcher failed because in some cases pad_mm decided to pad which made the mixed_mm pattern not match anymore (instead of cast -> mm, it was cast -> pad -> mm), and the tests also fail when is_big_gpu is false (which I haven't found an explanation for).

Differential Revision: [D60972176](https://our.internmc.facebook.com/intern/diff/D60972176)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133015
Approved by: https://github.com/Chillee, https://github.com/eellison
2024-08-08 20:32:12 +00:00
Joel Schlosser
75eb66afc0 Support 'non-contiguous with holes' NJTs for contiguous clone() (#132776)
It's possible to construct an NJT with "holes" by specifying both `offsets` and `lengths` metadata. When `nt.clone(memory_format=torch.contiguous_format)` is called on such an NJT, the result should be an NJT without holes.

This PR fixes this in simplistic way using `unbind()`, which isn't really supported in `torch.compile`. The longer term solution involves writing a proper kernel to support this.

NB: Another limitation is that the returned NJT does not have the same ragged structure as the input. While we could manually hack the nested int registry (or update the union find when that lands), this is the first instance where a NJT with holes and an NJT without holes could have the same ragged structure, and getting those to play nicely together requires some fairly involved updates. For now, this PR punts on these updates until we can clean this up.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132776
Approved by: https://github.com/ani300, https://github.com/soulitzer
ghstack dependencies: #131898, #131704, #131937
2024-08-08 17:08:11 +00:00
Catherine Lee
4fe6a5dc34 Move slow tests to be in repo (#132379)
Move the slow test json to be in the pytorch/pytorch repo and make a job that will update it weekly.  The job uses the same environment as the commit hash.  It uses similar code to the hash updates, but the hash update contains a lot of code that is specific to the hash update, so I chose to pick out the parts that are relevant

Remove references to the old file and set up testing to read from the new file instead

The old update cadence was every day, the new one is every week

The auto slow test infra + the lack of pinning between pytorch and test-infra makes it really hard to tell if a test started failing because of a change or because of the slow test json changing.  While this can have benefits, like disable test issues being effective everywhere immediately, it can also be very confusing, especially since we don't have the same insight into slow tests like we do for disable issues.

Example PR made: https://github.com/pytorch/pytorch/pull/132383 (with all the changes from this PR because it was working on top of this)

We should just get rid of this at some point in favor of the slowTest decorator, but there are some tests that take 5+ minutes to run and I don't want to track them down right now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132379
Approved by: https://github.com/huydhn
2024-08-07 18:42:56 +00:00
wz337
8b50d5398f [DeviceMesh] Create new group for 1D mesh when default backend is 'gloo' and 'cuda' is available (#132709)
More context in [#132471](https://github.com/pytorch/pytorch/issues/132471) and https://github.com/pytorch/pytorch/issues/132366.

TLDR:
When cuda is available and users move tensors to cuda, we cannot really reuse the default pg if default pg is gloo, as lots of collectives are not supported on gloo for cuda tensors. For example, `dtensor.full_tensor()` would result in a mysterious SIGTERM when all_gather a cuda tensor using gloo. Without the change in this PR, users would have to know the context and explicitly move the cuda tensor to cpu before invoking most collectives, which I think is not so ideal UX.

Therefore, given most collectives are not supported on gloo for cuda tensors, we should init a new pg if the default pg is gloo when torch.cuda.is_available() and device_type is cuda.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132709
Approved by: https://github.com/awgu, https://github.com/wanchaol
2024-08-07 16:13:11 +00:00
Apurva Jain
8bc5ef563e Grouped Query Attention (#132689)
### Approach: Using the current function declaration

**Constraint:** Q_Heads % KV_Heads == 0

**Major change:**
- Added a new argument enable_gqa: bool to sdpa function call
- It adds a meaning to the last third dimension.

Sample use cases this would enable:
LLama3

```
# LLama3 8b call to SDPA
query = torch.rand(batch, 32, seq_len_q, D)
key = torch.rand(batch, 8, seq_len_kv, D)
value = torch.rand(batch, 8, seq_len_kv, D)

output = scaled_dot_product_attention(query, key, value, is_causal=True, enable_gqa=True)

# Output Shape
(batch, 32, seq_len_q, D)
```

### Design Choice:

- Check if Query.size(-3) == Key.size(-3) == Value.size(-3) or, Query.size(-3) % Key.size(-3) == 0
- The function adjusts the key and value tensors to match the query tensor's head dimension by using repeat_interleave if their number of heads are not equal, facilitating correct and efficient computation in attention mechanisms.
- By default the enable_gqa flag is set to False, which ensures that regular sdpa functionality remains unchanged.

### Benchmarks:

- **sdpa.py: #130634**
For different batch sizes enable_gqa=True shows a substansial improvement in the run_time of sdpa

 | batch_size | q_num_heads | kv_num_heads | q_seq_len | kv_seq_len | embed_dim | forward_time when enable_gqa=True   |   forward_time when enable_gqa=False    |
| ------------ | ------------- | -------------- | ----------- | ------------ | ----------- | ----------- | ---------------- |
|     1      |     32      |      8       |   2048    |    2048    |   2048    |   100.71  |  119.70  |
|     8      |     32      |      8       |   2048    |    2048    |   2048    |   539.78  |  628.83  |
|     16     |     32      |      8       |   2048    |    2048    |   2048    |   1056.81  |  1225.48  |
|     32      |     32      |      8       |   2048    |    2048    |   2048    |   2099.54  |  2440.45  |

![Screenshot 2024-07-25 at 9 07 40 PM](https://github.com/user-attachments/assets/a3e5f716-c39f-4096-9e6c-82a735e57b7b)

- **TorchTitan: https://github.com/pytorch/torchtitan/pull/458**

Differential Revision: D60772086

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132689
Approved by: https://github.com/drisspg
2024-08-07 05:35:36 +00:00
Edward Z. Yang
ed224554eb [BE] Don't unnecessarily suggest -k for rerunning tests locally (#132807)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132807
Approved by: https://github.com/malfet
2024-08-07 02:15:18 +00:00
PyTorch MergeBot
cbee9c1fd2 Revert "Deprecate torch._utils.is_compiling() and torch._dynamo.external_utils.is_compiling() (#127690)"
This reverts commit 0e7e61f7ce.

Reverted https://github.com/pytorch/pytorch/pull/127690 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/127690#issuecomment-2272370386))
2024-08-07 00:05:20 +00:00
soulitzer
f50621989b Construct NJT without graph breaks (#130292)
Combines contributions from https://github.com/pytorch/pytorch/pull/130505

Some context can be found in this large comment block:

a5b64d39fd/test/dynamo/test_subclasses.py (L1667-L1681)

Changes in this PR
- For each tensor fakified, check the nested int registry in eager, and eagerly symbolicize if that tensor has already been associated with nested int in eager.
- Adds a separate counter stored on FakeTensorMode as a fake analog to _tensor_id_counter (which keeps track of unique tensors). This counter is initialized to the global eager tensor id counter upon creation of the FakeTensorMode, and needs to be reset when the same FakeTensorMode is reused to trace again (in this PR, we piggyback on the epoch incrementing logic).
- (refactor) Today, we store FakeTensor -> symbolic nested int in the global registry. With this PR, symbolic nested int is stored directly on the FakeTensor. (Eager still caches nested int in the registry, though we should avoid this at some point.)

Basically unchanged, but worth noting:
- `__tensor_unflatten__` is still responsible for determining whether we should cache for now. The logic is somewhat simplified.
- to_copy is still using the trick of updating two different tensors in the registry to point to the same nested int. This is kind of broken, but we try to leave it as is, and plan a better fix with the UnionFind stack.

Differential Revision: [D60406772](https://our.internmc.facebook.com/intern/diff/D60406772)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130292
Approved by: https://github.com/bdhirsh
ghstack dependencies: #131916, #131803
2024-08-06 17:03:39 +00:00
Wei Feng
2a8e94347f [TP] verify numeric parity on Transfromers for multiple iterations (#132543)
Before setting up float8 numeric parity test, I have to set up regular TP numeric parity test, preferrably testing 10 iterations

this PR sets a baseline of TP numerics. I can verify fp8 on top of it

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132543
Approved by: https://github.com/tianyu-l
ghstack dependencies: #132350
2024-08-04 06:43:27 +00:00
Xuehai Pan
0e7e61f7ce Deprecate torch._utils.is_compiling() and torch._dynamo.external_utils.is_compiling() (#127690)
This PR is split from PR #126898.

- #126898

------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127690
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-08-03 09:43:38 +00:00
Adnan Akhundov
8ad9f89ccc [inductor] Reland: Add flag to ignore unsupported @triton.autotune args in user-written kernel compilation (#132562)
Summary:
This is a reland attempt of [#131431](https://github.com/pytorch/pytorch/pull/131431), as, in its original form, the PR has caused issues internally.

We currently don't support some of the `triton.autotune` arguments when compiling user-written Triton kernels with PT2. In this PR, we're adding a flag to circumvent it. This is to unblock internal compilation in some cases. The flag is supplied with the docs mentioning why it is not a good idea to set it.

Test Plan:
```
python test/inductor/test_triton_kernels.py -k test_triton_kernel_
autotune_with_unsupported_args
...
----------------------------------------------------------------------
Ran 3 tests in 3.636s

OK
```

Differential Revision: D60701839

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132562
Approved by: https://github.com/chenyang78
2024-08-03 06:31:28 +00:00
PyTorch MergeBot
bcb4f7c172 Revert "Grouped Query Attention (#128898)"
This reverts commit 6b28af1b79.

Reverted https://github.com/pytorch/pytorch/pull/128898 on behalf of https://github.com/ZainRizvi due to Sorry, this broke a bunch of tests internally. See D60638265 ([comment](https://github.com/pytorch/pytorch/pull/128898#issuecomment-2265961038))
2024-08-02 18:58:46 +00:00
Pearu Peterson
a4ea776881 Add pinned memory support to sparse COO/CSR/CSC/BSR/BSC tensors (#129645)
As in the title:

To register indices/values of a sparse XYZ tensor with CUDA, the following methods are supported
- `sparse_xyz_tensor(indices, values, pin_memory=True)`
- `sparse_xyz_tensor(indices, values).pin_memory()`
- `sparse_xyz_tensor(indices.pin_memory(), values.pin_memory())`

Fixes https://github.com/pytorch/pytorch/issues/115330

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129645
Approved by: https://github.com/amjames, https://github.com/cpuhrsch, https://github.com/eqy
2024-08-02 08:55:55 +00:00
majing
3a355c1891 Correct sample creation of torch.histogram in UT op_db to align PyTorch defined operator semantics (#131630)
Fixes #130916
As the semantics defined in [torch.histogram](https://pytorch.org/docs/stable/generated/torch.histogram.html#torch-histogram), we need an increasing sequence as bins tensor. Random input doesn't make sense for torch.histogram.
The case is a comparison between CPU backend and another backend. When the input is random, kernel implementation in other backends have to totally align with the CPU kernel, or the case fails.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131630
Approved by: https://github.com/EikanWang, https://github.com/albanD
2024-08-02 01:51:09 +00:00
henrylhtsang
0c3ac428a2 [BE][typing] fix types in common pruning (#132309)
BE task. Add typings and remove mypy errors in torch/testing/_internal/common_pruning.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132309
Approved by: https://github.com/ColinPeppler
2024-08-01 23:34:33 +00:00
Xuehai Pan
30293319a8 [BE][Easy][19/19] enforce style for empty lines in import segments in torch/[o-z]*/ (#129771)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129771
Approved by: https://github.com/justinchuby, https://github.com/janeyx99
2024-08-01 17:07:14 +00:00
Oguz Ulgen
72d2dba992 Add None return type to init (#132335)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132335
Approved by: https://github.com/albanD
2024-08-01 15:26:45 +00:00
jainapurva
6b28af1b79 Grouped Query Attention (#128898)
### Approach: Using the current function declaration

**Constraint:** Q_Heads % KV_Heads == 0

**Major change:**
- Added a new argument enable_gqa: bool to sdpa function call
- It adds a meaning to the last third dimension.

Sample use cases this would enable:
LLama3

```
# LLama3 8b call to SDPA
query = torch.rand(batch, 32, seq_len_q, D)
key = torch.rand(batch, 8, seq_len_kv, D)
value = torch.rand(batch, 8, seq_len_kv, D)

output = scaled_dot_product_attention(query, key, value, is_causal=True, enable_gqa=True)

# Output Shape
(batch, 32, seq_len_q, D)
```

### Design Choice:

- Check if Query.size(-3) == Key.size(-3) == Value.size(-3) or, Query.size(-3) % Key.size(-3) == 0
- The function adjusts the key and value tensors to match the query tensor's head dimension by using repeat_interleave if their number of heads are not equal, facilitating correct and efficient computation in attention mechanisms.
- By default the enable_gqa flag is set to False, which ensures that regular sdpa functionality remains unchanged.

### Benchmarks:

- **sdpa.py: #130634**
For different batch sizes enable_gqa=True shows a substansial improvement in the run_time of sdpa

 | batch_size | q_num_heads | kv_num_heads | q_seq_len | kv_seq_len | embed_dim | forward_time when enable_gqa=True   |   forward_time when enable_gqa=False    |
| ------------ | ------------- | -------------- | ----------- | ------------ | ----------- | ----------- | ---------------- |
|     1      |     32      |      8       |   2048    |    2048    |   2048    |   100.71  |  119.70  |
|     8      |     32      |      8       |   2048    |    2048    |   2048    |   539.78  |  628.83  |
|     16     |     32      |      8       |   2048    |    2048    |   2048    |   1056.81  |  1225.48  |
|     32      |     32      |      8       |   2048    |    2048    |   2048    |   2099.54  |  2440.45  |

![Screenshot 2024-07-25 at 9 07 40 PM](https://github.com/user-attachments/assets/a3e5f716-c39f-4096-9e6c-82a735e57b7b)

- **TorchTitan: https://github.com/pytorch/torchtitan/pull/458**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128898
Approved by: https://github.com/drisspg
2024-07-31 22:58:51 +00:00
Yu, Guangye
45e6a364ee Avoid autocast deprecation warning (#132207)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132207
Approved by: https://github.com/awgu
2024-07-31 13:13:39 +00:00
ekamiti
9e473fd868 Make adding Buffers more like adding Parameters (#125971)
Add similar semantics for creating a buffer object similar to creating a parameter. This is done by introducing a new Buffer class that can be used for type disambiguation. The underlying functionality of registering a buffer remains the same as the register_buffer method has not been changed. The persistent parameter in the Buffer type is to indicate whether a buffer object should be persistent or not. Other non-test changes have to do with getting the new Buffer type recognized by inductor and dynamo. Remaining changes are test changes to make sure that the Buffer type can be used as a drop in replacement for register_buffer as it just leads to register_buffer being called. The addition of this new functionality still allows for normal tensors to be used as buffers so these changes are intended to be backwards compatible.

Fixes #35735

Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125971
Approved by: https://github.com/albanD, https://github.com/anijain2305, https://github.com/mlazos
2024-07-31 10:32:40 +00:00
Joel Schlosser
524aac413c Initial OpInfo-based testing for NJTs (#131704)
This PR utilizes the info from the existing OpInfo database `op_db` to contribute to general NJT testing.
* New tests in `TestNestedTensorOpInfo`
    * `test_forward()` - compares forward output to an unbind-based reference
    * `test_backward()` - compares forward output and grads to an unbind-based reference
    * `test_forward_compile()` - compares forward compile output (`backend="aot_eager_decomp_partition"`) to eager
    * `test_backward_compile()` - compares forward compile output (`backend="aot_eager_decomp_partition"`) and grads to eager
* To avoid adding a bunch of NJT-specific stuff to the `OpInfo` structure, this PR translates `op_db` -> a NJT-specific `njt_op_db`.
    * `UnaryUfuncInfo`s utilize a new `sample_inputs_unary_njt_pointwise()` which iterates through a comprehensive list of NJTs: contiguous / non-contiguous, dims 2, 3, and 4, transposed / not, etc.
    * `BinaryUfuncInfo`s utilize a new `sample_inputs_binary_njt_pointwise()` which iterates through a comprehensive list of NJTs: contiguous / non-contiguous, dims 2, 3, and 4, transposed / not, etc.
    * `ReductionOpInfo`s utilize a new `sample_inputs_njt_reduction()` which covers full reductions, reductions over the jagged dim, and reductions over the non-jagged dim
* Several xfails were added to get things passing

TODO (future PRs):
* Pass non-contiguous / non-contiguous with holes NJTs (maybe we should have separate tests for these? most ops don't support NJTs with holes today)
* Mixed (NT, T), (T, NT) inputs for binary ops
* Handle other types of OpInfos (beyond unary pointwise, binary pointwise, and reduction) by manually by writing sample_inputs_funcs
* Address all xfails via fixes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131704
Approved by: https://github.com/soulitzer
ghstack dependencies: #131898
2024-07-30 23:02:24 +00:00
Joel Schlosser
d53b11bb6e Strict shape checking for NJTs with TestCase.assertEqual() (#131898)
**Background**: `TestCase.assertEqual()` is commonly used during test case validation. Historically, to support NSTs, the logic was written to compare two nested tensors by unbinding them and comparing their components. This logic applied to NJTs as well, which in practice meant that two NJTs with different nested ints in their shapes could compare equal if their components were equal.

This PR changes the above logic so that NJTs are no longer unbound during comparison, allowing them to receive full shape validation. This makes `TestCase.assertEqual()` stricter for NJTs, requiring them to have the same nested ints in their shapes to compare equal.

Note that some tests rely on the old, looser behavior. To address this, the PR introduces a base `NestedTensorTestCase` that defines a helper function `assertEqualIgnoringNestedInts()` so that these tests can explicitly opt in to the looser comparison behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131898
Approved by: https://github.com/soulitzer
2024-07-30 20:05:48 +00:00
PyTorch MergeBot
499ead96ff Revert "Grouped Query Attention (#128898)"
This reverts commit d039b14207.

Reverted https://github.com/pytorch/pytorch/pull/128898 on behalf of https://github.com/albanD due to Broken test on main ([comment](https://github.com/pytorch/pytorch/pull/128898#issuecomment-2258314481))
2024-07-30 13:11:24 +00:00
jainapurva
d039b14207 Grouped Query Attention (#128898)
### Approach: Using the current function declaration

**Constraint:** Q_Heads % KV_Heads == 0

**Major change:**
- Added a new argument enable_gqa: bool to sdpa function call
- It adds a meaning to the last third dimension.

Sample use cases this would enable:
LLama3

```
# LLama3 8b call to SDPA
query = torch.rand(batch, 32, seq_len_q, D)
key = torch.rand(batch, 8, seq_len_kv, D)
value = torch.rand(batch, 8, seq_len_kv, D)

output = scaled_dot_product_attention(query, key, value, is_causal=True, enable_gqa=True)

# Output Shape
(batch, 32, seq_len_q, D)
```

### Design Choice:

- Check if Query.size(-3) == Key.size(-3) == Value.size(-3) or, Query.size(-3) % Key.size(-3) == 0
- The function adjusts the key and value tensors to match the query tensor's head dimension by using repeat_interleave if their number of heads are not equal, facilitating correct and efficient computation in attention mechanisms.
- By default the enable_gqa flag is set to False, which ensures that regular sdpa functionality remains unchanged.

### Benchmarks:

- **sdpa.py: #130634**
For different batch sizes enable_gqa=True shows a substansial improvement in the run_time of sdpa

 | batch_size | q_num_heads | kv_num_heads | q_seq_len | kv_seq_len | embed_dim | forward_time when enable_gqa=True   |   forward_time when enable_gqa=False    |
| ------------ | ------------- | -------------- | ----------- | ------------ | ----------- | ----------- | ---------------- |
|     1      |     32      |      8       |   2048    |    2048    |   2048    |   100.71  |  119.70  |
|     8      |     32      |      8       |   2048    |    2048    |   2048    |   539.78  |  628.83  |
|     16     |     32      |      8       |   2048    |    2048    |   2048    |   1056.81  |  1225.48  |
|     32      |     32      |      8       |   2048    |    2048    |   2048    |   2099.54  |  2440.45  |

![Screenshot 2024-07-25 at 9 07 40 PM](https://github.com/user-attachments/assets/a3e5f716-c39f-4096-9e6c-82a735e57b7b)

- **TorchTitan: https://github.com/pytorch/torchtitan/pull/458**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128898
Approved by: https://github.com/drisspg
2024-07-29 21:49:06 +00:00
PyTorch MergeBot
d5e9fbb012 Revert "BE: reset dynamo before each test in test_module.py (#131372)"
This reverts commit 527901f054.

Reverted https://github.com/pytorch/pytorch/pull/131372 on behalf of https://github.com/kit1980 due to Broke test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_CTCLoss_cuda_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10149118852/job/28065175173) [HUD commit link](ca8153ae67) ([comment](https://github.com/pytorch/pytorch/pull/131372#issuecomment-2257019116))
2024-07-29 21:15:25 +00:00
PyTorch MergeBot
a4723b566f Revert "BE: reset dynamo before each test in test_ops_gradients.py (#131397)"
This reverts commit ca8153ae67.

Reverted https://github.com/pytorch/pytorch/pull/131397 on behalf of https://github.com/kit1980 due to Broke test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_CTCLoss_cuda_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10149118852/job/28065175173) [HUD commit link](ca8153ae67) ([comment](https://github.com/pytorch/pytorch/pull/131372#issuecomment-2257019116))
2024-07-29 21:15:25 +00:00
Tom Ritchford
bdf5a6dca9 Add decomposition for unsqueeze_copy (#130942)
* Extracted from #128416
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130942
Approved by: https://github.com/peterbell10
2024-07-29 21:13:37 +00:00
Xuehai Pan
5cc34f61d1 [CI] add new test config label ci-test-showlocals to control test log verbosity (#131981)
Add a new label `ci-test-showlocals` and add it to test config filter.
If the PR is labeled with `ci-test-showlocals` or "ci-test-showlocals"
present in the PR comment, the test config filter will set a environment
variable `TEST_SHOWLOCALS`. Then `pytest` will show local variables on
failures for better debugging.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131981
Approved by: https://github.com/malfet
ghstack dependencies: #131151
2024-07-29 18:53:14 +00:00
Xuehai Pan
4694ee1ad2 [BE][tests] show local variables on failure in tests (#131151)
------

As per the title, add argument `--locals` for `unittest` and `--showlocals --tb=long` for `pytest` in CI.

Some failures cannot be reproduced on the local machine but exist on cloud CI. This change allows us to investigate the test failure more easily.

Example output: https://github.com/pytorch/pytorch/actions/runs/9961546996/job/27523888353?pr=130710#step:20:3361

```text
/opt/conda/envs/py_3.8/lib/python3.8/site-packages/sympy/core/function.py:307:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

cls = FloorDiv, base = -1.00000000000000, divisor = -1.00000000000000

    @classmethod
    def eval(cls, base, divisor):
        # python test/test_dynamic_shapes.py -k TestDimConstraints.test_dim_constraints_solve_full
        # Assert triggered by inequality solver
        # assert base.is_integer, base
        # assert divisor.is_integer, divisor

        # We don't provide the same error message as in Python because SymPy
        # makes it difficult to check the types.
        if divisor.is_zero:
            raise ZeroDivisionError("division by zero")
        if base in (int_oo, -int_oo, sympy.oo, -sympy.oo) and divisor in (
            int_oo,
            -int_oo,
            sympy.oo,
            -sympy.oo,
        ):
            return sympy.nan
        if base is sympy.nan or divisor is sympy.nan:
            return sympy.nan

        if base.is_zero:
            return sympy.S.Zero
        if base.is_integer and divisor == 1:
            return base
        if base.is_integer and divisor == -1:
            return sympy.Mul(base, -1)
        if (
            isinstance(base, sympy.Number)
            and isinstance(divisor, sympy.Number)
            and (
                base in (int_oo, -int_oo, sympy.oo, -sympy.oo)
                or divisor in (int_oo, -int_oo, sympy.oo, -sympy.oo)
            )
        ):
            r = float(base) / float(divisor)
            if r == math.inf:
                return int_oo
            elif r == -math.inf:
                return -int_oo
            elif math.isnan(r):
                return sympy.nan
            else:
                return sympy.Integer(math.floor(r))
        if isinstance(base, sympy.Integer) and isinstance(divisor, sympy.Integer):
            return sympy.Integer(int(base) // int(divisor))
        if isinstance(base, FloorDiv):
            return FloorDiv(base.args[0], base.args[1] * divisor)

        # Expands (x + y) // b into x // b + y // b.
        # This only works if floor is an identity, i.e. x / b is an integer.
        for term in sympy.Add.make_args(base):
            quotient = term / divisor
            if quotient.is_integer and isinstance(divisor, sympy.Integer):
                # NB: this is correct even if the divisor is not an integer, but it
                # creates rational expressions that cause problems with dynamic
                # shapes.
                return FloorDiv(base - term, divisor) + quotient

        try:
            gcd = sympy.gcd(base, divisor)
            if gcd != 1:
>               return FloorDiv(
                    sympy.simplify(base / gcd), sympy.simplify(divisor / gcd)
                )

base       = -1.00000000000000
cls        = FloorDiv
divisor    = -1.00000000000000
gcd        = 1.00000000000000
quotient   = 1.00000000000000
term       = -1.00000000000000

/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/_sympy/functions.py:159:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

args = (FloorDiv, -1.00000000000000, -1.00000000000000), kwargs = {}

    @wraps(func)
    def wrapper(*args, **kwargs):
        try:
>           retval = cfunc(*args, **kwargs)
E           RecursionError: maximum recursion depth exceeded in comparison
E
E           To execute this test, run the following from the base repo dir:
E               python test/test_sympy_utils.py -k TestValueRanges.test_binary_ref_fn_floordiv_dtype_float
E
E           This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

args       = (FloorDiv, -1.00000000000000, -1.00000000000000)
cfunc      = <functools._lru_cache_wrapper object at 0x7fc5303173a0>
func       = <function Function.__new__ at 0x7fc530317280>
kwargs     = {}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131151
Approved by: https://github.com/ezyang
2024-07-29 18:53:14 +00:00
Shunting Zhang
ca8153ae67 BE: reset dynamo before each test in test_ops_gradients.py (#131397)
https://github.com/pytorch/pytorch/pull/126586 tried to reset dynamo before each unit test. That PR get reverted a couple of times because we see post-land test failures that we don't see before merge. This PR only reset dynamo before each tests in `test_ops_gradients.py` to make it easier to land.

Eventually after we reset dynamo in each individual test files, we can move the change to the base class (TestCase) and remove the change in individual test files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131397
Approved by: https://github.com/zou3519
ghstack dependencies: #131551, #131388, #131372
2024-07-29 17:39:23 +00:00
Shunting Zhang
527901f054 BE: reset dynamo before each test in test_module.py (#131372)
https://github.com/pytorch/pytorch/pull/126586 tried to reset dynamo before each unit test. That PR get reverted a couple of times because we see post-land test failures that we don't see before merge. This PR only reset dynamo before each tests in `test_module.py` to make it easier to land.

Eventually after we reset dynamo in each individual test files, we can move the change to the base class (TestCase) and remove the change in individual test files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131372
Approved by: https://github.com/zou3519
ghstack dependencies: #131551, #131388
2024-07-29 17:39:23 +00:00
PyTorch MergeBot
6cf493158e Revert "Enable FlashAttention on Windows (#131906)"
This reverts commit b90bc66766.

Reverted https://github.com/pytorch/pytorch/pull/131906 on behalf of https://github.com/atalman due to Windows nightly failures ([comment](https://github.com/pytorch/pytorch/pull/131906#issuecomment-2256421183))
2024-07-29 16:49:23 +00:00
Tom Ritchford
962f248437 Add decomposition for expand_copy (#130940)
* Extracted from #129476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130940
Approved by: https://github.com/peterbell10
2024-07-29 16:23:56 +00:00
PyTorch MergeBot
c35f21e5fc Revert "[BE][tests] show local variables on failure in tests (#131151)"
This reverts commit 14158d892a.

Reverted https://github.com/pytorch/pytorch/pull/131151 on behalf of https://github.com/atalman due to Broke CI: test_testing.py::TestTestingCUDA::test_cuda_assert_should_stop_common_device_type_test_suite_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/10131415299/job/28014665693) [HUD commit link](14158d892a) ([comment](https://github.com/pytorch/pytorch/pull/131151#issuecomment-2255921015))
2024-07-29 13:19:38 +00:00
PyTorch MergeBot
06fe99a097 Revert "[CI] add new test config label ci-test-showlocals to control test log verbosity (#131981)"
This reverts commit dfa18bf3f3.

Reverted https://github.com/pytorch/pytorch/pull/131981 on behalf of https://github.com/atalman due to Sorry, need to revert bottom PR, which broke CI: https://github.com/pytorch/pytorch/pull/131151 ([comment](https://github.com/pytorch/pytorch/pull/131981#issuecomment-2255892628))
2024-07-29 13:09:41 +00:00
Xuehai Pan
dfa18bf3f3 [CI] add new test config label ci-test-showlocals to control test log verbosity (#131981)
Add a new label `ci-test-showlocals` and add it to test config filter.
If the PR is labeled with `ci-test-showlocals` or "ci-test-showlocals"
present in the PR comment, the test config filter will set a environment
variable `TEST_SHOWLOCALS`. Then `pytest` will show local variables on
failures for better debugging.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131981
Approved by: https://github.com/malfet
2024-07-29 07:40:42 +00:00
Shunting Zhang
f151f25c0b BE: reset dynamo before each test in test_torch.py (#131388)
https://github.com/pytorch/pytorch/pull/126586 tried to reset dynamo before each unit test. That PR get reverted a couple of times because we see post-land test failures that we don't see before merge. This PR only reset dynamo before each tests in `test_torch.py` to make it easier to land.

Eventually after we reset dynamo in each individual test files, we can move the change to the base class (TestCase) and remove the change in individual test files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131388
Approved by: https://github.com/zou3519
ghstack dependencies: #131551
2024-07-29 04:57:34 +00:00
Oguz Ulgen
7c29665f77 Remove mypy ignore from torch/testing/_internal/distributed/ (#131870)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131870
Approved by: https://github.com/aakhundov
ghstack dependencies: #131786
2024-07-28 17:13:53 +00:00
PyTorch MergeBot
d3c17fea90 Revert "[BE] typing for decorators - _library/custom_ops (#131578)"
This reverts commit c65b197b85.

Reverted https://github.com/pytorch/pytorch/pull/131578 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))
2024-07-28 03:29:32 +00:00
Xuehai Pan
14158d892a [BE][tests] show local variables on failure in tests (#131151)
------

As per the title, add argument `--locals` for `unittest` and `--showlocals --tb=long` for `pytest` in CI.

Some failures cannot be reproduced on the local machine but exist on cloud CI. This change allows us to investigate the test failure more easily.

Example output: https://github.com/pytorch/pytorch/actions/runs/9961546996/job/27523888353?pr=130710#step:20:3361

```text
/opt/conda/envs/py_3.8/lib/python3.8/site-packages/sympy/core/function.py:307:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

cls = FloorDiv, base = -1.00000000000000, divisor = -1.00000000000000

    @classmethod
    def eval(cls, base, divisor):
        # python test/test_dynamic_shapes.py -k TestDimConstraints.test_dim_constraints_solve_full
        # Assert triggered by inequality solver
        # assert base.is_integer, base
        # assert divisor.is_integer, divisor

        # We don't provide the same error message as in Python because SymPy
        # makes it difficult to check the types.
        if divisor.is_zero:
            raise ZeroDivisionError("division by zero")
        if base in (int_oo, -int_oo, sympy.oo, -sympy.oo) and divisor in (
            int_oo,
            -int_oo,
            sympy.oo,
            -sympy.oo,
        ):
            return sympy.nan
        if base is sympy.nan or divisor is sympy.nan:
            return sympy.nan

        if base.is_zero:
            return sympy.S.Zero
        if base.is_integer and divisor == 1:
            return base
        if base.is_integer and divisor == -1:
            return sympy.Mul(base, -1)
        if (
            isinstance(base, sympy.Number)
            and isinstance(divisor, sympy.Number)
            and (
                base in (int_oo, -int_oo, sympy.oo, -sympy.oo)
                or divisor in (int_oo, -int_oo, sympy.oo, -sympy.oo)
            )
        ):
            r = float(base) / float(divisor)
            if r == math.inf:
                return int_oo
            elif r == -math.inf:
                return -int_oo
            elif math.isnan(r):
                return sympy.nan
            else:
                return sympy.Integer(math.floor(r))
        if isinstance(base, sympy.Integer) and isinstance(divisor, sympy.Integer):
            return sympy.Integer(int(base) // int(divisor))
        if isinstance(base, FloorDiv):
            return FloorDiv(base.args[0], base.args[1] * divisor)

        # Expands (x + y) // b into x // b + y // b.
        # This only works if floor is an identity, i.e. x / b is an integer.
        for term in sympy.Add.make_args(base):
            quotient = term / divisor
            if quotient.is_integer and isinstance(divisor, sympy.Integer):
                # NB: this is correct even if the divisor is not an integer, but it
                # creates rational expressions that cause problems with dynamic
                # shapes.
                return FloorDiv(base - term, divisor) + quotient

        try:
            gcd = sympy.gcd(base, divisor)
            if gcd != 1:
>               return FloorDiv(
                    sympy.simplify(base / gcd), sympy.simplify(divisor / gcd)
                )

base       = -1.00000000000000
cls        = FloorDiv
divisor    = -1.00000000000000
gcd        = 1.00000000000000
quotient   = 1.00000000000000
term       = -1.00000000000000

/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/_sympy/functions.py:159:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

args = (FloorDiv, -1.00000000000000, -1.00000000000000), kwargs = {}

    @wraps(func)
    def wrapper(*args, **kwargs):
        try:
>           retval = cfunc(*args, **kwargs)
E           RecursionError: maximum recursion depth exceeded in comparison
E
E           To execute this test, run the following from the base repo dir:
E               python test/test_sympy_utils.py -k TestValueRanges.test_binary_ref_fn_floordiv_dtype_float
E
E           This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

args       = (FloorDiv, -1.00000000000000, -1.00000000000000)
cfunc      = <functools._lru_cache_wrapper object at 0x7fc5303173a0>
func       = <function Function.__new__ at 0x7fc530317280>
kwargs     = {}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131151
Approved by: https://github.com/ezyang
2024-07-27 19:39:40 +00:00
Will Feng
9e06572704 [Traceable FSDP2][Inductor] Create grouped nodes for FSDP2 all-gather code block and reduce-scatter code block (after Buffer/Operation split) (#131510)
This PR creates these `GroupedSchedulerNode`s:
- One for each all-gather code block (cast + copy-in + all-gather)
- One for each all-gather-wait code block (all-gather-wait + copy-out)
- One for each reduce-scatter code block (copy-in + reduce-scatter)
- One for each reduce-scatter-wait code block (reduce-scatter-wait)

This serves two goals:
- Prevent outside ops from being fused into these op groups, in order to have more predicable memory usage.
- Make it easier to specify the dependency e.g. from `i+1` all-gather group node to the `i` all-gather-wait group node, to enforce FSDP2 comm ordering (i.e. "serialization of comms").

The actual "reorder-for-FSDP-compute-comm-overlap" PR will come next.

Test commands:
- `pytest -rA  test/distributed/test_compute_comm_reordering.py::TestComputeCommReorderingMultiProc`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131510
Approved by: https://github.com/yifuwang
2024-07-27 08:39:58 +00:00
Xu Han
a90b8b967a [inductor] enable windows inductor UTs (#131767)
Changes:
1. Add `skipIfWindows` function.
2. Fix `fresh_inductor_cache` raise error on Windows, due to can't delete loaded modules.
3. Disable some UTs, which are not passed on Windows.
4. Enable test_torchinductor in Windows CI.

I have tested passed on my dev machine:
<img width="864" alt="image" src="https://github.com/user-attachments/assets/91d5a62f-7383-44b3-b614-99940f196fdb">

TODO: review and fix the skipped cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131767
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-27 02:46:03 +00:00
Luca Wehrstedt
b90bc66766 Enable FlashAttention on Windows (#131906)
Let's just give this a try.

Reland of https://github.com/pytorch/pytorch/pull/131875.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131906
Approved by: https://github.com/drisspg
2024-07-26 21:41:56 +00:00
Shunting Zhang
c8626a4e1f [BE] add a list of inductor test files to skip resetting dynamo (#131551)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131551
Approved by: https://github.com/zou3519
2024-07-26 21:08:15 +00:00
PyTorch MergeBot
0f9bf208ec Revert "[BE][tests] show local variables on failure in tests (#131151)"
This reverts commit 054d214c50.

Reverted https://github.com/pytorch/pytorch/pull/131151 on behalf of https://github.com/jbschlosser due to pollutes test failure output for OpInfo tests ([comment](https://github.com/pytorch/pytorch/pull/131151#issuecomment-2253310448))
2024-07-26 19:03:10 +00:00
Xuehai Pan
63374dda69 [BE][Easy] explicitly define global constants in torch.testing._internal.common_utils (#129826)
This appeases IDE warnings like "torch.testing._internal.common_utils has no member TEST_WITH_ROCM".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129826
Approved by: https://github.com/Skylion007
2024-07-26 06:32:08 +00:00
Aaron Orenstein
c65b197b85 [BE] typing for decorators - _library/custom_ops (#131578)
See #131429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131578
Approved by: https://github.com/oulgen, https://github.com/zou3519
ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575, #131576, #131577
2024-07-25 22:24:19 +00:00
PyTorch MergeBot
f3df7deab8 Revert "Add flag to ignore unsupported @triton.autotune args in user-written kernel compilation (#131431)"
This reverts commit e9db1b0597.

Reverted https://github.com/pytorch/pytorch/pull/131431 on behalf of https://github.com/clee2000 due to broke internal tests D60211713 ([comment](https://github.com/pytorch/pytorch/pull/131431#issuecomment-2251091957))
2024-07-25 18:00:46 +00:00
Jane Xu
9c4cf866c2 Adafactor forloop basic impl (#129905)
#109581

At this point, the vanilla implementation (the default) is good.
Docs: https://docs-preview.pytorch.org/pytorch/pytorch/129905/generated/torch.optim.Adafactor.html#torch.optim.Adafactor

Specifically, the impl in this PR, which attempts to replicate the paper,
```
optim = torch.optim.Adafactor([weight])
```
is close enough to https://pytorch-optimizers.readthedocs.io/en/latest/optimizer/#pytorch_optimizer.AdaFactor
```
optim_c = AdaFactor([weight], betas=(0, 0.999), scale_parameter=False)
```
is close enough to https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adafactor
```
optim = keras.optimizers.Adafactor(learning_rate=0.01)
```

The three results respectively for the same randomly generated weights:
```
# ours
tensor([[ 0.3807594, -0.3912092],
        [ 0.0762539,  0.5377805],
        [ 0.2459473,  0.4662207]])

# pytorch-optimizer
tensor([[ 0.3807592, -0.3912172],
        [ 0.0762507,  0.5377818],
        [ 0.2459457,  0.4662213]])

# keras
array([[ 0.38076326, -0.39121315],
        [ 0.0762547 ,  0.5377859 ],
        [ 0.24594972,  0.46622536]], dtype=float32)
```

This gives me confidence to move forward in speeding up the implementation now that a baseline has been established. If you're curious about differences:
* keras assigns step_size (rho_t in their code) to `min(lr, 1 / sqrt(step)` whereas the OG impl uses a hardcoded 0.01 instead of lr. We do the same thing as keras, but our lr default is 0.01.
* We differ from the pytorch-optimizers default in that our default will not track momentum (thus `beta1=0`) and we do not apply parameter scaling.

<details>

Keras collab: https://colab.research.google.com/drive/1i3xF8ChL7TWKJGV_5v_5nMhXKnYmQQ06?usp=sharing

My script repro:

```
import torch
from pytorch_optimizer import AdaFactor
torch.set_printoptions(precision=7)

weight = torch.tensor([[ 0.37697506, -0.39500135],
        [ 0.07246649,  0.53399765],
        [ 0.24216151,  0.46243715]], dtype=torch.float32)
# bias = torch.tensor([0, 0], dtype=torch.float32)

weight.grad = torch.tensor([[-0.5940447, -0.7743838],
        [-0.5940447, -0.7743838],
        [-0.5940447, -0.7743838]], dtype=torch.float32)
# bias.grad = torch.tensor([-2.5027974,  1.5422692], dtype=torch.float32)

weight_c = weight.clone()
weight_c.grad = weight.grad.clone()

optim = torch.optim.Adafactor([weight])
optim.step()
print(weight)

optim_c = AdaFactor([weight_c], betas=(0, 0.999), scale_parameter=False)
optim_c.step()
print(weight_c)
```

<details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129905
Approved by: https://github.com/albanD
2024-07-25 13:17:19 +00:00
Xuehai Pan
054d214c50 [BE][tests] show local variables on failure in tests (#131151)
------

As per the title, add argument `--locals` for `unittest` and `--showlocals --tb=long` for `pytest` in CI.

Some failures cannot be reproduced on the local machine but exist on cloud CI. This change allows us to investigate the test failure more easily.

Example output: https://github.com/pytorch/pytorch/actions/runs/9961546996/job/27523888353?pr=130710#step:20:3361

```text
/opt/conda/envs/py_3.8/lib/python3.8/site-packages/sympy/core/function.py:307:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

cls = FloorDiv, base = -1.00000000000000, divisor = -1.00000000000000

    @classmethod
    def eval(cls, base, divisor):
        # python test/test_dynamic_shapes.py -k TestDimConstraints.test_dim_constraints_solve_full
        # Assert triggered by inequality solver
        # assert base.is_integer, base
        # assert divisor.is_integer, divisor

        # We don't provide the same error message as in Python because SymPy
        # makes it difficult to check the types.
        if divisor.is_zero:
            raise ZeroDivisionError("division by zero")
        if base in (int_oo, -int_oo, sympy.oo, -sympy.oo) and divisor in (
            int_oo,
            -int_oo,
            sympy.oo,
            -sympy.oo,
        ):
            return sympy.nan
        if base is sympy.nan or divisor is sympy.nan:
            return sympy.nan

        if base.is_zero:
            return sympy.S.Zero
        if base.is_integer and divisor == 1:
            return base
        if base.is_integer and divisor == -1:
            return sympy.Mul(base, -1)
        if (
            isinstance(base, sympy.Number)
            and isinstance(divisor, sympy.Number)
            and (
                base in (int_oo, -int_oo, sympy.oo, -sympy.oo)
                or divisor in (int_oo, -int_oo, sympy.oo, -sympy.oo)
            )
        ):
            r = float(base) / float(divisor)
            if r == math.inf:
                return int_oo
            elif r == -math.inf:
                return -int_oo
            elif math.isnan(r):
                return sympy.nan
            else:
                return sympy.Integer(math.floor(r))
        if isinstance(base, sympy.Integer) and isinstance(divisor, sympy.Integer):
            return sympy.Integer(int(base) // int(divisor))
        if isinstance(base, FloorDiv):
            return FloorDiv(base.args[0], base.args[1] * divisor)

        # Expands (x + y) // b into x // b + y // b.
        # This only works if floor is an identity, i.e. x / b is an integer.
        for term in sympy.Add.make_args(base):
            quotient = term / divisor
            if quotient.is_integer and isinstance(divisor, sympy.Integer):
                # NB: this is correct even if the divisor is not an integer, but it
                # creates rational expressions that cause problems with dynamic
                # shapes.
                return FloorDiv(base - term, divisor) + quotient

        try:
            gcd = sympy.gcd(base, divisor)
            if gcd != 1:
>               return FloorDiv(
                    sympy.simplify(base / gcd), sympy.simplify(divisor / gcd)
                )

base       = -1.00000000000000
cls        = FloorDiv
divisor    = -1.00000000000000
gcd        = 1.00000000000000
quotient   = 1.00000000000000
term       = -1.00000000000000

/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/_sympy/functions.py:159:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

args = (FloorDiv, -1.00000000000000, -1.00000000000000), kwargs = {}

    @wraps(func)
    def wrapper(*args, **kwargs):
        try:
>           retval = cfunc(*args, **kwargs)
E           RecursionError: maximum recursion depth exceeded in comparison
E
E           To execute this test, run the following from the base repo dir:
E               python test/test_sympy_utils.py -k TestValueRanges.test_binary_ref_fn_floordiv_dtype_float
E
E           This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

args       = (FloorDiv, -1.00000000000000, -1.00000000000000)
cfunc      = <functools._lru_cache_wrapper object at 0x7fc5303173a0>
func       = <function Function.__new__ at 0x7fc530317280>
kwargs     = {}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131151
Approved by: https://github.com/ezyang
2024-07-25 10:10:58 +00:00
William Wen
7718024d2b [3.13] support 3.13 multiline traces in munge_exc (#131207)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131207
Approved by: https://github.com/jansel, https://github.com/anijain2305
ghstack dependencies: #131206
2024-07-24 18:22:30 +00:00
William Wen
106c6a49f5 [dynamo] limit number of compiles per frame (#130891)
Fixes https://github.com/pytorch/pytorch/issues/130776

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130891
Approved by: https://github.com/anijain2305
2024-07-24 16:43:40 +00:00
Adnan Akhundov
e9db1b0597 Add flag to ignore unsupported @triton.autotune args in user-written kernel compilation (#131431)
Summary: We currently don't support some of the `@triton.autotune` arguments when compiling user-written Triton kernels with PT2. In this PR, we're adding a flag to circumvent it. This is to unblock internal compilation in some cases. The flag is supplied with the docs mentioning why it is not a good idea to set it.

Test Plan:

```
python test/inductor/test_triton_kernels.py -k test_triton_kernel_
autotune_with_unsupported_args
...
----------------------------------------------------------------------
Ran 3 tests in 3.636s

OK
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131431
Approved by: https://github.com/oulgen, https://github.com/zou3519
2024-07-24 05:37:09 +00:00
Will Feng
fc3d2b26cd Use fake PG for test_compute_comm_reordering.py unit tests (#131415)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131415
Approved by: https://github.com/yifuwang
2024-07-23 22:53:23 +00:00
Aaron Orenstein
5a0068cc69 [BE] mypy: disallow untyped decorators (#131428)
Untyped decorators strip the types from their decorated function so even if the underlying function is fully typed then callers to it don't get any benefit from type annotations.

Step 1 - Enable the error and override in all the offending files.

#131429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131428
Approved by: https://github.com/justinchuby, https://github.com/oulgen
2024-07-23 21:50:55 +00:00
Shangdi Yu
68c725a094 [custom ops] Add register_vmap for custom ops (#130589)
Fixes #130284
Fixes #130653

- Add `torch.library.register_vmap` to custom ops
- Add `register_vmap` for operators in ops in custom_op_db.
- Make `torch.autograd.Function` support kwarg-only kwargs for vmap
- test operators in op_db with `tests/test_vmap`.
- change `test_vmap` to allow custom `out_dim` and allow "None" in `out_dim` when testing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130589
Approved by: https://github.com/zou3519
2024-07-23 17:48:38 +00:00
Tom Ritchford
16247987a1 Add decomposition for t_copy (#130939)
* Extracted from #128416

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130939
Approved by: https://github.com/peterbell10
2024-07-23 08:29:19 +00:00
Li-Huai (Allan) Lin
99d9b369f4 [Optim] Support tensor lr for all optimizers and check it is 1-element (#131065)
Fixes: #130980
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131065
Approved by: https://github.com/janeyx99
2024-07-23 04:27:05 +00:00
PyTorch MergeBot
b435d84261 Revert "[custom ops] Add register_vmap for custom ops (#130589)"
This reverts commit 074b420641.

Reverted https://github.com/pytorch/pytorch/pull/130589 on behalf of https://github.com/atalman due to Please fix lint and reland ([comment](https://github.com/pytorch/pytorch/pull/130589#issuecomment-2244092174))
2024-07-23 01:44:44 +00:00
Shangdi Yu
074b420641 [custom ops] Add register_vmap for custom ops (#130589)
Fixes #130284
Fixes #130653

- Add `torch.library.register_vmap` to custom ops
- Add `register_vmap` for operators in ops in custom_op_db.
- Make `torch.autograd.Function` support kwarg-only kwargs for vmap
- test operators in op_db with `tests/test_vmap`.
- change `test_vmap` to allow custom `out_dim` and allow "None" in `out_dim` when testing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130589
Approved by: https://github.com/zou3519
2024-07-23 00:54:52 +00:00
Thomas Ortner
8ae1963a61 [Autograd] Cond Higher-Order Operation (#126911)
This is an updated PR to equip cond with the autograd feature and replaces the old [PR](https://github.com/pytorch/pytorch/pull/126007)

@ydwu4 I tried to incorporate your requests already.

Currently there are two problems that I struggle with solving:

1. There seems to be an import issue when trying to import cond in `torch/__init__.py`, see [here](8a704035c9/torch/__init__.py (L1914-L1916)). Therefore, I had to comment those lines, which resolved the import issues, but I believe cond is not proberly exposed as torch.cond.
2. I am not entirely sure how to deal with the opinfo test in `hop_db.py`

Co-authored-by: Yidi Wu <yidi@meta.com>
Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126911
Approved by: https://github.com/ydwu4
2024-07-22 23:18:19 +00:00
Tom Ritchford
500cbb5b90 Add decomposition for view_copy (#130938)
* Extracted from #128416
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130938
Approved by: https://github.com/peterbell10
ghstack dependencies: #130937
2024-07-21 20:39:24 +00:00
Tom Ritchford
f628813066 Fix out_wrapper, _make_copy_from_view to handle all signatures (#130937)
* See #128416 and #129476
* Simplify xskip lists in test/functorch/test_ops.py
* Add supports_out=True to OpInfos for copy ops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130937
Approved by: https://github.com/peterbell10
2024-07-21 20:39:24 +00:00
ankurneog
ebc012ace6 Add hooks for execution on intel gaudi devices - 1 (#128584)
## Motivation
This is follow up to PR:https://github.com/pytorch/pytorch/pull/126970  to support Gaudi devices for Pytorch UT execution.

## Changes
We are adding additional hooks to:
1. Add dtype exceptions for Gaudi/HPU
2. Extend onlyNativeDevices decorator  functionality to add additional devices

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128584
Approved by: https://github.com/albanD
2024-07-20 05:03:36 +00:00
Isuru Fernando
bb4251213b Add decomposition for channel_shuffle (#118775)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118775
Approved by: https://github.com/peterbell10
2024-07-20 01:24:41 +00:00
Catherine Lee
31e79aae6a Another follow up to #130260 (#130993)
Another followup to https://github.com/pytorch/pytorch/pull/130260
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130993
Approved by: https://github.com/huydhn
2024-07-19 16:43:54 +00:00
kausik
4f60a2e39c Set correct output dtype for dequantize op during convert_pt2e in decomposed mode (#128953)
Earlier the signature of dequantize ops for decomposed quantized Tensor was changed for wider use-cases where the output dtype can be different from torch.float and needs to be passed during dequantization.
Please refer: https://github.com/pytorch/pytorch/pull/121450

However, setting of correct output dtype for dequantize ops was still missing in convert_pt2e flow.

This change enables the users to use PT2E quantization flow with non torch.float unquantized dtype, such as torch.bfloat16.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128953
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2024-07-19 04:58:02 +00:00
PyTorch MergeBot
fb3674b1f4 Revert "[Autograd] Cond Higher-Order Operation (#126911)"
This reverts commit f7058b735e.

Reverted https://github.com/pytorch/pytorch/pull/126911 on behalf of https://github.com/clee2000 due to broke lint and functorch/test_aotdispatch f7058b735e Probably a landrace since both the test and lint passed on PR ([comment](https://github.com/pytorch/pytorch/pull/126911#issuecomment-2237703182))
2024-07-18 22:06:40 +00:00
Thomas Bohnstingl
f7058b735e [Autograd] Cond Higher-Order Operation (#126911)
This is an updated PR to equip cond with the autograd feature and replaces the old [PR](https://github.com/pytorch/pytorch/pull/126007)

@ydwu4 I tried to incorporate your requests already.

Currently there are two problems that I struggle with solving:

1. There seems to be an import issue when trying to import cond in `torch/__init__.py`, see [here](8a704035c9/torch/__init__.py (L1914-L1916)). Therefore, I had to comment those lines, which resolved the import issues, but I believe cond is not proberly exposed as torch.cond.
2. I am not entirely sure how to deal with the opinfo test in `hop_db.py`

Co-authored-by: Yidi Wu <yidi@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126911
Approved by: https://github.com/ydwu4
2024-07-18 21:09:09 +00:00
PyTorch MergeBot
fff92d4f18 Revert "Use inductor TestCase for test_replicate_with_compiler.py (#129494)"
This reverts commit 9f392f8294.

Reverted https://github.com/pytorch/pytorch/pull/129494 on behalf of https://github.com/atalman due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/129494#issuecomment-2237147504))
2024-07-18 17:42:05 +00:00
drisspg
dd39dca034 Removing some cruff and updating signatures for consistency (#130871)
# Summary

- This removes a bunch of example score mods that were primarily used for testing and places them directly in the test file. We should follow up with merging test_flex_decode and test_flash when the velocity slows down a little
- Fixes a bug with indexing on block mask
- Adds some doc strings to helper funcs and fixes some misc typing things
- Forces functions passed to `create_block_mask` to mask_mods and updates tests files

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130871
Approved by: https://github.com/joydddd, https://github.com/Chillee
2024-07-18 13:32:11 +00:00
Michael Lazos
cf3f4285a8 Add recursive metadata guard test (#131002)
Ensures that nested tensors subclasses are guarded properly. It turns out this case is already handled [here](d77af49380/torch/_dynamo/variables/builder.py (L1496)) which will recursively wrap inner tensors adding metadata guards for them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131002
Approved by: https://github.com/bdhirsh
2024-07-18 08:24:43 +00:00
Sam Larsen
9f392f8294 Use inductor TestCase for test_replicate_with_compiler.py (#129494)
Summary: `test/distributed/_composable/test_replicate_with_compiler.py` exercises inductor. This change introduces a version of MultiProcessTestCase that derives from the inductor TestCase class to make sure we always get a clean cache dir.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129494
Approved by: https://github.com/eellison
2024-07-18 03:08:32 +00:00
Aidyn-A
bd56bcf0ab [TEST] Fix _scaled_mm tests (#130897)
This PR resolves several sets of `_scaled_mm` test failures:
- `scale_a` and `scale_b` are now required arguments, so the function `sample_inputs_scaled_mm` must supply them
- `_scaled_mm` does not support `"meta"` device, so it should be skipped in `test_meta.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130897
Approved by: https://github.com/drisspg
2024-07-18 02:15:00 +00:00
lezcano
af0b5ee924 Reduce number of samples in {svd,pca}_lowrank OpInfos (#127199)
We don't need to generate so many samples for these very expensive ops.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127199
Approved by: https://github.com/peterbell10, https://github.com/zou3519
2024-07-17 16:29:36 +00:00
Xuehai Pan
b29b23137c [Easy] Fix argument name collision in dispatched functions (#129562)
Use positional-only argument to avoid naming collision with aten ops arguments that are named "self".

```python
In [1]: def foo(self, *args, **kwargs):
   ...:     print(self, args, kwargs)
   ...:

In [2]: def bar(self, /, *args, **kwargs):
   ...:     print(self, args, kwargs)
   ...:

In [3]: foo(1, 2, self=3)
TypeError: foo() got multiple values for argument 'self'

In [4]: bar(1, 2, self=3)
1
(2,)
{'self': 3}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129562
Approved by: https://github.com/zou3519, https://github.com/fegin
2024-07-17 14:39:56 +00:00
PyTorch MergeBot
8458dc8966 Revert "Use inductor TestCase for distributed tests (#129494)"
This reverts commit 3cd2ae331a.

Reverted https://github.com/pytorch/pytorch/pull/129494 on behalf of https://github.com/masnesral due to fbcode failures ([comment](https://github.com/pytorch/pytorch/pull/129494#issuecomment-2232063690))
2024-07-17 00:32:48 +00:00
drisspg
2b43d339fe Make FlexAttention API public (#130755)
# Summary

Makes the prototype API flex_attention public

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130755
Approved by: https://github.com/Chillee
2024-07-16 16:21:25 +00:00
Jovian Anthony Jaison
e57101d927 Add testing regarding SparseAdam state_dicts (#130645)
Summary:
- Updated SparseAdam to run test_state_dict_deterministic unit test.
- Made gradients sparse while keeping weights dense in the above test.

Test Plan:
- Ran test_optim.py locally.

Fixes #116507

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130645
Approved by: https://github.com/janeyx99
2024-07-16 11:29:22 +00:00
Sam Larsen
3cd2ae331a Use inductor TestCase for distributed tests (#129494)
Summary: At least some of the tests deriving from MultiProcessTestCase exercise inductor. Using the inductor TestCase class makes sure we always get a clean cache dir.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129494
Approved by: https://github.com/eellison
2024-07-16 01:24:35 +00:00
eqy
5e617d7ef5 [CUDA] Actually bump tolerances for test_grad_pca_lowrank (#130770)
Fixes change in #129902 to actually bump pca rather than svd, thanks @ptrblck for the catch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130770
Approved by: https://github.com/Skylion007
2024-07-16 00:41:10 +00:00
Bilal Khan
54a932b0ac Support for expandable segments with cuda graph trees (#128068)
This PR adds support to use expandable segments with private memory pools which should unblock using it with cuda graphs and cuda graph trees. Currently, the allocator silently avoids using expandable segments when allocating in a private pool due to checkpoint saving/restoring not meshing well with how we keep track of unmapped blocks.

The PR itself is pretty short, most of the logic for checkpointing and reapplying state for non-expandable segments transfers over without much work.

Expandable segments reserve a virtual address space of size equal to the amount of physical memory on the GPU. Every time we want to `malloc()` or `free()` memory in a memory pool with expandable segments turned on, we map/unmap pages of physical GPU memory under the hood to create a new block that we return to the caller. This is beneficial due to the fact that each memory pool functions as a single segment of memory with a contiguous block of memory addresses that can grow and shrink as needed, avoiding fragmentation from allocating multiple non-contiguous segments that may not be merged together.

The caching allocator handles this by creating an unmapped block for the entire reserved virtual address space at init, which is treated similarly to an unallocated block in a free pool. When callers call `malloc()`, it's split and mapped to create allocated blocks, and calling `free()` similarly caches and merges free blocks in a free pool to be used later. Expandable blocks are unmapped and returned back to Cuda when they are cleaned up, or when we hit an OOM and the allocator attempts to remap cached free blocks. The code paths to map, free, and unmap blocks in expandable segments is similar to that for normal blocks and does all the same work of updating stats on memory usage, moving blocks between active and free pools, and returning memory to Cuda.

With Cuda Graph Trees and private memory pools, we need the ability to take checkpoints of the current state of the memory allocator after each graph capture as well as reapplying the state before capturing a new graph after replaying a captured graph so that the new cuda graph capture has access to the state of the allocator at the point after replaying a previously captured graph so it can reuse empty blocks and allocate new ones.

As mentioned in a below comment, memory in a private pool is cached until the private pool is destroyed and allocations can only grow from extra graph captures, any freeing of memory would result in invalid memory addresses and would break cuda graphs.

One implementation detail to note for unmapped blocks with expandable segments is that unmapped blocks are kept track in a member variable `unmapped` of a `BlockPool`. `unmapped` is *not* part of the checkpointed state of the caching allocator and isn't restored when reapplying checkpoints since we never free/unmap memory back to cuda and is persisted across graph captures / replays.

Checkpointing the current state of the memory allocator works as expected with expandable segments. Checkpointing grabs the first block of every segment in the active and free pools of the private pool and traverses the linked list of blocks in the segment to capture the state of every segment, which is then saved and kept for when it is needed to be reapplied. For expandable blocks, the last block in every segment will be an unallocated unmapped block containing the remaining amount of unmapped memory at graph capture time, and this too is saved in the checkpoint.

Reapplying the checkpoints works by freeing all allocated blocks and merging them into a single block per segment, then for each segment, we manually split and allocate all blocks from the checkpoint and then free the blocks marked as unallocated in the checkpoint state. For expandable segments, we need to make some modifications to not split unmapped blocks and avoid manually mapping then freeing unmapped blocks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128068
Approved by: https://github.com/eqy, https://github.com/eellison
2024-07-15 23:23:23 +00:00
Alnis Murtovi
50ef099ad0 Learn a heuristic to decide whether to pad before mm (#128643)
This PR introduces AutoHeuristic, a framework to collect results from autotuning, learn a heuristic as a machine learning model (a regression tree), and then ship the learned heuristic by generating the regression tree to code.

The heuristics have been learned on artificial/random data that has been collected with the `gen_data_pad_mm.py` script. The `gen_pad_mm_a100.sh` scripts can then be used to learn a heuristic and generate it to code.

The best model is decided by doing a grid search over various values for `max_depth` and `min_samples_leaf` and choosing the model with the highest number of correct predicitons on the validation set.

The heuristic can return "unsure" which means that it is not sure which choice is the best choice and as a result autotuning will happen.

On A100 only tensors where each dimension is >= 512 are considered. For smaller tensors the heuristics that I learned returned "unsure" too often.

The results for randomly generated data and huggingface look as follows:
`max_wrong_speedup` is max(`wrong_speedups`) where `wrong_speedups` contains all the speedups one could have achieved for those examples where the heuristic made a wrong choice, i.e. a `max_wrong_speedup` of 1.37 means that the heuristic selected a choice, but the other choice would have been 1.37x faster. `gman_wrong_speedup` is the geomean of `wrong_speedups`.

The heuristic is learned as a regression tree, that returns higher values for better choices. The threshold decides how much better the better choice has to be for it to be returned, i.e. on A100 if the better choice is less than 1.702530x better than the other choice, "unsure" will be returned. This threshold is determined using the validation set.

A100
```
       max_depth  min_samples_leaf dataset  correct  wrong  unsure  total  max_wrong_speedup  gman_wrong_speedup  threshold
15         5.0                10     train     2730      4    3023   5757           1.372220            1.193873   1.702530
16         5.0                10       val      878      0    1042   1920                NaN                 NaN   1.702530
17         5.0                10      test      925      2     993   1920           1.741708            1.354954   1.702530
18         5.0                10  hf-train       14      0      22     36                NaN                 NaN   1.702530
19         5.0                10    hf-inf        7      0       1      8                NaN                 NaN   1.702530
```

The numbers for huggingface only include tensors where each dim is >=512. If all tensors would have been included there would have been the following number of matmuls, where at least one dimension is unaligned:
A100 hf-train: 60
A100 hf-inf: 10

## Results on running huggingface locally
This only includes models where the learned heuristic made at least one decision. For the examples here, it takes around 0.25-0.3 seconds to perform autotuning for the padded and unpadded version, so each decision that the heuristic makes saves around 0.25-0.3 seconds.
#pad_mm_autotuning is the number of times autotuning happened in pad_mm and #heuristic_made_decision is the number of times the heuristic made a decision (i.e. it didn't return "unsure").

I ran huggingface locally, each model 5 times and took the median speedup and compilation_latency.
Results on huggingface training
```
                          name speedup_heuristic speedup_baseline  speedup_diff compilation_latency_heuristic compilation_latency_baseline  compilation_latency_diff  comp_latency_reduction%  #pad_mm_autotuning  #heuristic_made_decision
               BartForCausalLM   1.19 (+/- 0.00)  1.19 (+/- 0.00)         -0.00              40.33 (+/- 1.13)             40.95 (+/- 0.78)                     -0.62                     1.52                   3                         2
  BartForConditionalGeneration   1.53 (+/- 0.06)  1.47 (+/- 0.05)          0.06              81.93 (+/- 5.20)             82.23 (+/- 1.92)                     -0.30                     0.36                   3                         1
    BlenderbotSmallForCausalLM   1.86 (+/- 0.04)  1.86 (+/- 0.00)          0.00              36.76 (+/- 0.49)             37.62 (+/- 1.33)                     -0.87                     2.31                   3                         2
                     CamemBert   2.36 (+/- 0.01)  2.35 (+/- 0.01)          0.01              97.60 (+/- 1.91)             98.69 (+/- 1.35)                     -1.09                     1.11                   2                         1
                   DistillGPT2   2.57 (+/- 0.01)  2.57 (+/- 0.01)          0.00              57.33 (+/- 0.77)             58.26 (+/- 1.41)                     -0.93                     1.59                   3                         2
             PLBartForCausalLM   2.07 (+/- 0.01)  2.06 (+/- 0.01)          0.01              32.54 (+/- 0.83)             34.65 (+/- 0.71)                     -2.11                     6.10                   3                         2
PLBartForConditionalGeneration   1.87 (+/- 0.00)  1.88 (+/- 0.00)         -0.01              58.45 (+/- 1.24)             58.95 (+/- 1.92)                     -0.50                     0.85                   3                         1
            RobertaForCausalLM   2.39 (+/- 0.01)  2.40 (+/- 0.01)         -0.01              97.38 (+/- 1.52)             97.69 (+/- 1.18)                     -0.31                     0.32                   2                         1
              TrOCRForCausalLM   1.70 (+/- 0.00)  1.70 (+/- 0.00)         -0.00              44.79 (+/- 1.33)             45.25 (+/- 1.08)                     -0.46                     1.01                   3                         2

Mean difference in speedup: 0.01
Mean compilation latency saved: -0.80s
Mean compilation latency reduction: 1.68%
```

Results on huggingface inference
```
                          name speedup_heuristic speedup_baseline  speedup_diff compilation_latency_heuristic compilation_latency_baseline  compilation_latency_diff  comp_latency_reduction%  #pad_mm_autotuning  #heuristic_made_decision
               BartForCausalLM   1.11 (+/- 0.00)  1.11 (+/- 0.00)          0.00              19.02 (+/- 0.28)             19.40 (+/- 0.35)                     -0.38                     1.95                   3                         2
  BartForConditionalGeneration   1.26 (+/- 0.01)  1.23 (+/- 0.03)          0.03              36.84 (+/- 0.40)             36.55 (+/- 0.75)                      0.30                    -0.81                   3                         1
    BlenderbotSmallForCausalLM   1.87 (+/- 0.02)  1.87 (+/- 0.01)          0.00              17.53 (+/- 0.31)             18.03 (+/- 0.43)                     -0.49                     2.74                   3                         2
                   DistillGPT2   2.50 (+/- 0.02)  2.50 (+/- 0.01)          0.00              16.16 (+/- 0.29)             16.40 (+/- 0.18)                     -0.24                     1.46                   3                         2
             PLBartForCausalLM   1.93 (+/- 0.01)  1.94 (+/- 0.01)         -0.00              15.30 (+/- 0.22)             16.01 (+/- 0.71)                     -0.71                     4.43                   3                         2
PLBartForConditionalGeneration   1.98 (+/- 0.01)  1.98 (+/- 0.01)          0.00              25.90 (+/- 0.32)             26.58 (+/- 0.62)                     -0.67                     2.53                   3                         1
              TrOCRForCausalLM   1.61 (+/- 0.00)  1.62 (+/- 0.00)         -0.01              21.38 (+/- 0.37)             21.85 (+/- 0.16)                     -0.47                     2.16                   3                         2

Mean difference in speedup: 0.00
Mean compilation latency saved: -0.38s
Mean compilation latency reduction: 2.07%
```

For now, the heuristic can only be applied to decide whether to pad for mm. One could also learn heuristics for bmm and addmm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128643
Approved by: https://github.com/Chillee, https://github.com/eellison
2024-07-15 23:04:06 +00:00
eqy
6f32dc0c7b Don't pass error message as places in assertGreaterAlmostEqual (#130648)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130648
Approved by: https://github.com/awgu
2024-07-15 22:14:49 +00:00
PyTorch MergeBot
9e161af179 Revert "Increase tolerance for tensorsolve tests (#130620)"
This reverts commit 103b6ccab2.

Reverted https://github.com/pytorch/pytorch/pull/130620 on behalf of https://github.com/clee2000 due to didn't work, test is still failing on this PR and on main, reverting in favor of https://github.com/pytorch/pytorch/pull/130449 instead ([comment](https://github.com/pytorch/pytorch/pull/130620#issuecomment-2229036418))
2024-07-15 17:35:04 +00:00
cyy
28f6ae2718 [9/N] Replace c10::optional with std::optional (#130674)
Follows  #130509

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130674
Approved by: https://github.com/Skylion007
2024-07-15 00:48:43 +00:00
chilli
f9f85bfc0b [Inductor] FlexAttention supports partial masking (#130415) (#130626)
This is the new version of https://github.com/pytorch/pytorch/pull/130415

Updated test script: https://gist.github.com/yanboliang/7c34a82df611d4ea8869cb9e041bfbfc
Updated perf numbers:
```
(pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py
fwd speedup: 0.7166695598192317
bwd speedup: 0.7142133867805904
(pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py --partial-mask
fwd speedup: 0.8428246087169973
bwd speedup: 0.8486261278030254
```
Approved by: https://github.com/Chillee

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130626
Approved by: https://github.com/drisspg, https://github.com/yanboliang
2024-07-14 00:37:26 +00:00
Tobias Ringwald
e5de25896f Fixed CUDA randint generation for large ranges. (#126066)
Fixes #125224

For large ranges, calls to CUDA `randint` use a different `unroll_factor` to generate random ints. This `unroll_factor` was not considered correctly in the calculation of the Philox offsets. Thus, some of the random states were reused, resulting in lower entropy (see #125224).

This also affects multiple other random functions, such as `torch.rand` and `torch.randn`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126066
Approved by: https://github.com/eqy, https://github.com/lezcano
2024-07-13 21:42:27 +00:00
rzou
95046c86e3 [HOP] add HOP x torch_dispatch interaction (#130606)
This involved beefing up the Python dispatcher to handle torch_dispatch.
Given a HOP and a torch_dispatch Tensor subclass:
- the HOP will show up in the subclass's `__torch_dispatch__`
- you can also use HOP.py_impl to register a rule for the HOP x
  subclass interaction
- (coming soon) we'll offer a way to open register HOP x subclass
  interaction without needing to touch the subclass's
  `__torch_dispatch__` or the HOP's .py_impl.

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130606
Approved by: https://github.com/ydwu4
2024-07-12 21:51:36 +00:00
albanD
103b6ccab2 Increase tolerance for tensorsolve tests (#130620)
Fix current failure in periodic trunk https://hud.pytorch.org/failure?name=periodic%20%2F%20linux-focal-cuda11.8-py3.10-gcc9-debug%20%2F%20test%20(default%2C%204%2C%205%2C%20linux.4xlarge.nvidia.gpu)&jobName=undefined&failureCaptures=%5B%22functorch%2Ftest_ops.py%3A%3ATestOperatorsCUDA%3A%3Atest_vjp_linalg_tensorsolve_cuda_float32%22%5D

Since it appeared with https://github.com/pytorch/pytorch/pull/128238 that only updates random seed for the test, I expect this is just bad luck of the draw. Thus increasing tolerance like we do for other tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130620
Approved by: https://github.com/lezcano, https://github.com/atalman, https://github.com/malfet
2024-07-12 20:08:18 +00:00
leslie-fang-intel
2a1f22e57f Change BN to eval before QAT Convert phase (#130598)
**Summary**
In the QAT convert phase, we fold bn into conv and do DCE to this BN node. We should change `torch.ops.aten._native_batch_norm_legit.default` to `torch.ops.aten._native_batch_norm_legit_no_training.default`  for a safe DCE.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130598
Approved by: https://github.com/jgong5, https://github.com/yushangdi
2024-07-12 16:03:56 +00:00
yan-yhy
c16e90fe06 The device_suffix in a test_name is "privateuse1" sometimes. (#130091)
When run some test cases on the privateuse1 device, the device_suffix in a test_name is 'privateuse1' sometimes.
For examples, a test_name is 'test_Dropout1d_npu', while it would be 'test_Dropout1d_privateuse1' sometimes.
When setUpClass() didn't set it, the device_suffix would be "privateuse1".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130091
Approved by: https://github.com/zou3519
2024-07-12 02:51:40 +00:00
PyTorch MergeBot
d97d962082 Revert "Add decompositions for copy variants of view ops (#128416)"
This reverts commit 68751799b8.

Reverted https://github.com/pytorch/pytorch/pull/128416 on behalf of https://github.com/izaitsevfb due to breaks test_qs8_permute_copy test in executorch ([comment](https://github.com/pytorch/pytorch/pull/128416#issuecomment-2224023423))
2024-07-11 22:09:23 +00:00
PyTorch MergeBot
a2f630a9a4 Revert "Decompose expand_copy and permute_copy (#129476)"
This reverts commit 7d4cb21098.

Reverted https://github.com/pytorch/pytorch/pull/129476 on behalf of https://github.com/izaitsevfb due to depends on #128416 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/129476#issuecomment-2224019720))
2024-07-11 22:06:15 +00:00
Yidi Wu
cd9bae30de Allow kwargs in _remove_effect_tokens_pass (#130491)
Summary: Previously, remove_effect_tokens pass didn't pass kwargs to the internal nodes. This PR fix it and add a test for it.

Test Plan: buck2 run caffe2/test:test_export -- -r test_remove_effect_token_kwargs

Reviewed By: angelayi

Differential Revision: D59603147

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130491
Approved by: https://github.com/angelayi
2024-07-11 19:03:19 +00:00
PyTorch MergeBot
578388bed8 Revert "Support for expandable segments with cuda graph trees (#128068)"
This reverts commit fdc83610f2.

Reverted https://github.com/pytorch/pytorch/pull/128068 on behalf of https://github.com/janeyx99 due to Reverting for breaking ROCm tests on trunk, I think the tests need to be qualified with @onlyCUDA ([comment](https://github.com/pytorch/pytorch/pull/128068#issuecomment-2223672381))
2024-07-11 18:58:13 +00:00
Xuehai Pan
973037be6a [BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): list() / tuple() / dict() (#130199)
This PR changes the empty collection factory call to Python literals:

- `list()` -> `[]`
- `tuple()` -> `()`
- `dict()` -> `{}`

The Python literals are more performant and safer. For example, the bytecode for building an empty dictionary:

```bash
$ python3 -m dis - <<EOS
import collections

d1 = {}
d2 = dict()

dict = collections.OrderedDict
d3 = dict()
EOS
```

```text
  0           0 RESUME                   0

  1           2 LOAD_CONST               0 (0)
              4 LOAD_CONST               1 (None)
              6 IMPORT_NAME              0 (collections)
              8 STORE_NAME               0 (collections)

  3          10 BUILD_MAP                0
             12 STORE_NAME               1 (d1)

  4          14 PUSH_NULL
             16 LOAD_NAME                2 (dict)
             18 CALL                     0
             26 STORE_NAME               3 (d2)

  6          28 LOAD_NAME                0 (collections)
             30 LOAD_ATTR                8 (OrderedDict)
             50 STORE_NAME               2 (dict)

  7          52 PUSH_NULL
             54 LOAD_NAME                2 (dict)
             56 CALL                     0
             64 STORE_NAME               5 (d3)
             66 RETURN_CONST             1 (None)
```

The dict literal `{}` only has one bytecode `BUILD_MAP`, while the factory call `dict()` has three `PUSH_NULL + LOAD_NAME + CALL`. Also, the factory call is not safe if users override the `dict` name in `locals` or `globals` (see the example of replacing with `OrderedDict` above).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130199
Approved by: https://github.com/malfet
2024-07-11 17:30:28 +00:00
Jiang, Yanbing
6f662e9575 update the input weight of _convert_weight_to_int4pack to [n][k / 2] uint8 (#129940)
This PR is to update the input `weight` of `_convert_weight_to_int4pack` from `[n][k] int32` to `[n][k / 2] uint8`, both for CPU, CUDA and MPS, which can help decouple int4 model checkpoint with different ISAs and different platforms in `gpt-fast`. The advantage is int4 model checkpoint can be shared in different test machines, without re-generating in one certain platform. Meanwhile, the size of input `weight` can be reduced to `1 / 8`.

Before this PR, packed weight stored in CUDA specific layout: `[n/8][k/(InnerKTiles*16)][32][InnerKTiles/2]`, dtype int32, where InnerKTiles = 2, 4, 8. CPU packed weight viewed as the SAME shape but stored in different layout: `[n/64][k][32]`, dtype uint8. Weight is strongly coupled with platforms (CPU/CUDA) and ISAs (AVX512/AVX2/scalar). And users cannot use a generated weight in another different ISA or platform, because when loading weight into devices, the compute format is different.
![image](https://github.com/pytorch/pytorch/assets/61222868/64971c4b-29b9-42cf-9aeb-ffa01cea93dd)

Now, we use common serialized layout (`[n][k/2] uint8`) for different devices or ISAs as input `weight` of `_convert_weight_to_int4pack`, and each back chooses how to interpret as compute layout.
![image](https://github.com/pytorch/pytorch/assets/61222868/c7990761-c723-417b-aca2-7c60db7785c7)

### Performance
Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores)
There is no obvious regression of this PR.
![image](https://github.com/pytorch/pytorch/assets/61222868/6046dcf4-920b-4c63-9ca3-1c8c3cafebde)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129940
Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/mingfeima
2024-07-11 15:26:48 +00:00
Isuru Fernando
5db9bd467e Skip test_nnc_correctness for new op _unsafe_masked_index (#130375)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130375
Approved by: https://github.com/lezcano
2024-07-11 08:17:16 +00:00
Bilal Khan
fdc83610f2 Support for expandable segments with cuda graph trees (#128068)
This PR adds support to use expandable segments with private memory pools which should unblock using it with cuda graphs and cuda graph trees. Currently, the allocator silently avoids using expandable segments when allocating in a private pool due to checkpoint saving/restoring not meshing well with how we keep track of unmapped blocks.

The PR itself is pretty short, most of the logic for checkpointing and reapplying state for non-expandable segments transfers over without much work.

Expandable segments reserve a virtual address space of size equal to the amount of physical memory on the GPU. Every time we want to `malloc()` or `free()` memory in a memory pool with expandable segments turned on, we map/unmap pages of physical GPU memory under the hood to create a new block that we return to the caller. This is beneficial due to the fact that each memory pool functions as a single segment of memory with a contiguous block of memory addresses that can grow and shrink as needed, avoiding fragmentation from allocating multiple non-contiguous segments that may not be merged together.

The caching allocator handles this by creating an unmapped block for the entire reserved virtual address space at init, which is treated similarly to an unallocated block in a free pool. When callers call `malloc()`, it's split and mapped to create allocated blocks, and calling `free()` similarly caches and merges free blocks in a free pool to be used later. Expandable blocks are unmapped and returned back to Cuda when they are cleaned up, or when we hit an OOM and the allocator attempts to remap cached free blocks. The code paths to map, free, and unmap blocks in expandable segments is similar to that for normal blocks and does all the same work of updating stats on memory usage, moving blocks between active and free pools, and returning memory to Cuda.

With Cuda Graph Trees and private memory pools, we need the ability to take checkpoints of the current state of the memory allocator after each graph capture as well as reapplying the state before capturing a new graph after replaying a captured graph so that the new cuda graph capture has access to the state of the allocator at the point after replaying a previously captured graph so it can reuse empty blocks and allocate new ones.

As mentioned in a below comment, memory in a private pool is cached until the private pool is destroyed and allocations can only grow from extra graph captures, any freeing of memory would result in invalid memory addresses and would break cuda graphs.

One implementation detail to note for unmapped blocks with expandable segments is that unmapped blocks are kept track in a member variable `unmapped` of a `BlockPool`. `unmapped` is *not* part of the checkpointed state of the caching allocator and isn't restored when reapplying checkpoints since we never free/unmap memory back to cuda and is persisted across graph captures / replays.

Checkpointing the current state of the memory allocator works as expected with expandable segments. Checkpointing grabs the first block of every segment in the active and free pools of the private pool and traverses the linked list of blocks in the segment to capture the state of every segment, which is then saved and kept for when it is needed to be reapplied. For expandable blocks, the last block in every segment will be an unallocated unmapped block containing the remaining amount of unmapped memory at graph capture time, and this too is saved in the checkpoint.

Reapplying the checkpoints works by freeing all allocated blocks and merging them into a single block per segment, then for each segment, we manually split and allocate all blocks from the checkpoint and then free the blocks marked as unallocated in the checkpoint state. For expandable segments, we need to make some modifications to not split unmapped blocks and avoid manually mapping then freeing unmapped blocks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128068
Approved by: https://github.com/zdevito, https://github.com/eqy
2024-07-11 05:33:09 +00:00
Tom Ritchford
7d4cb21098 Decompose expand_copy and permute_copy (#129476)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129476
Approved by: https://github.com/amjames, https://github.com/lezcano
2024-07-10 17:12:01 +00:00