Commit Graph

80400 Commits

Author SHA1 Message Date
William Wen
b71e813bce [dynamo, 3.13] fix bytecode nop tests (#139323)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139323
Approved by: https://github.com/jansel
2024-11-02 00:39:36 +00:00
Bin Bao
8c17830dea [AOTI] Unify how weights are stored as data section (#139471)
Summary: https://github.com/pytorch/pytorch/pull/118076 introduced a cleaner way to link weights as a data section for macos. Unify the code by adopting that approach for Linux as well.

Test Plan: CI

Differential Revision: D65302273

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139471
Approved by: https://github.com/chenyang78
2024-11-02 00:23:24 +00:00
PyTorch UpdateBot
aa54b2467f [executorch hash update] update the pinned executorch hash (#139133)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139133
Approved by: https://github.com/pytorchbot
2024-11-02 00:14:47 +00:00
eellison
ee2f8a50d3 Class rename (#139490)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139490
Approved by: https://github.com/exclamaforte, https://github.com/zou3519
ghstack dependencies: #139295
2024-11-02 00:10:17 +00:00
PyTorch MergeBot
c95adb9c5b Revert "use more elements per thread for narrow dtypes (#139449)"
This reverts commit f5b9e725d1.

Reverted https://github.com/pytorch/pytorch/pull/139449 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but a bunch of tests are failing after it lands, it looks like a landrace ([comment](https://github.com/pytorch/pytorch/pull/139449#issuecomment-2452723863))
2024-11-01 23:42:16 +00:00
PyTorch MergeBot
b617d4813c Revert "fix dynamo tracking numpy 2 ops (#138686)"
This reverts commit 124eac255e.

Reverted https://github.com/pytorch/pytorch/pull/138686 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I am seeing inductor failure with hf_BigBird number of graph breaks after it lands ([comment](https://github.com/pytorch/pytorch/pull/138686#issuecomment-2452718164))
2024-11-01 23:34:06 +00:00
Nikita Shulga
77b72d686e [BE][MPS] Make metal shaders compile cleanly (#139522)
I.e. without warnings, by deleting dead code and fixing one
signed-unsigned comparison warning

Also, pass `-Werror` to metal compiler if WERROR options is set
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139522
Approved by: https://github.com/Skylion007
2024-11-01 23:22:47 +00:00
eellison
2382b3b6d8 [Easy] Add joint graph passes, fallback_random to bisector (#139295)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139295
Approved by: https://github.com/zou3519, https://github.com/exclamaforte
2024-11-01 23:21:53 +00:00
Gabriel Ferns
1e73842029 Refactor FxGraphDrawer to use HTML-like labels (#137726)
Fixes https://github.com/pytorch/pytorch/issues/137499
Testing: Added a new unit test to make sure that the regression case succeeds.
I'm debating about whether to make the borders visible. I'm partial to no borders, but it might make it harder for some people to read?
![68a2b0e3-orig_fx_graph_diagram](https://github.com/user-attachments/assets/fbc2fd98-9e76-488e-8ebe-c64fbf206932)
Vs.
![2bfe1c4f-orig_fx_graph_diagram](https://github.com/user-attachments/assets/b6bc88ba-dda2-4cf7-84ac-a615e1e03a74)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137726
Approved by: https://github.com/eellison, https://github.com/malfet
2024-11-01 23:19:50 +00:00
David Berard
60542eeb33 [inductor] set sanitize_overflow=False for triton kernels (#139502)
In upstream triton, https://github.com/triton-lang/triton/pull/4589 introduces overflow checks. However, overflow checks likely add some overhead, and have some correctness bugs at the moment (e.g. https://github.com/triton-lang/triton/pull/5033). Let's set `sanitize_overflow=False` but keep `debug=True` so that we can keep using device_assert but without the additional asserts added by `sanitize_overflow`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139502
Approved by: https://github.com/bertmaher
2024-11-01 23:10:21 +00:00
Huy Do
da395384a2 Delete Windows GPU jobs in periodic (#139336)
As an outcome of https://fburl.com/gdoc/voce5o06, we could stop running Windows GPU tests on periodic pending the green light from MS. No one is monitoring these jobs atm.

We already have Windows CUDA and CPU build jobs in trunk.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139336
Approved by: https://github.com/ZainRizvi, https://github.com/wdvr, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-11-01 22:26:22 +00:00
Shuqiang Zhang
4c64a7f33f [pgnccl] add a restart test for PGs in blocking mode (#139496)
Summary:
Restarting (aborting and re-initialize a PG) is a basic need if we want
to achieve in-process restart of PGs without tearing down the whole
process.

Add this tests to verify that this is supported by current NCCL.
Note that this restart test passes steadily only for blocking mode for now.
In nonblockin mode. There is problem in either nccl init or abort that
needs further investigation
Test Plan:
new UT

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139496
Approved by: https://github.com/c-p-i-o, https://github.com/kwen2501
2024-11-01 22:13:37 +00:00
Huy Do
0b13bdd877 Delete parallelnative jobs in periodic (#139328)
As an outcome of https://fburl.com/gdoc/voce5o06, we can now clean up parallelnative build and test jobs in periodic.  There is not much value in running them anymore
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139328
Approved by: https://github.com/wdvr, https://github.com/malfet
2024-11-01 22:05:13 +00:00
Huy Do
8eb75cbad6 Delete iOS jobs from periodic (#139345)
As an outcome of https://fburl.com/gdoc/voce5o06 and confirm with @iseeyuan, we can now clean up iOS lite interpreter jobs on PyTorch CI. There is not much value in running them anymore.

It's stated in https://github.com/pytorch/ios-demo-app/blob/master/README.md that ExecuTorch is the replacement now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139345
Approved by: https://github.com/wdvr, https://github.com/malfet
2024-11-01 22:04:27 +00:00
Huy Do
8ad76efb8d Delete Vulkan jobs from periodic (#139354)
As an outcome of https://fburl.com/gdoc/voce5o06, we can clean up this job now as the backend has been marked as deprecated https://pytorch.org/tutorials/prototype/vulkan_workflow.html to be replace by ExecuTorch Vulkan delegate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139354
Approved by: https://github.com/wdvr, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-11-01 22:03:12 +00:00
Mikayla Gawarecki
a979318ef7 Add section to serialization note re weights_only (#139433)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139433
Approved by: https://github.com/malfet
ghstack dependencies: #138936, #139221
2024-11-01 21:51:50 +00:00
Nikita Shulga
a1f854f270 [MPS] Compile kernels into Metallib (#138636)
PyTorch MPS backend for the most part relies on MPSGraph to provide specific operations, but recently more and more often one had to implement custom kernel here that were simply embedded in the operator codebase and were compiled directly using [`- id<MTLLibrary>newLibraryWithSource:options:error:`](https://developer.apple.com/documentation/metal/mtldevice/1433431-newlibrarywithsource) (first metal kernel to MPS backend was added in https://github.com/pytorch/pytorch/pull/82307 )
Later on, as number of operator grew, those were refactored into `MetalShaderLibrary` convenience class (see  https://github.com/pytorch/pytorch/pull/125550 )

But as number of kernels keeps growing, it's time to make a next step and properly compile them into `.metalib`

This PR does exactly that by:
 - Moving shader sources into separate .metal files
 - Adds check on whether full Xcode installed or just DeveloperTools
 - If full Xcode is installed, compiles and links shaders into .metallib for Metal-3.0(Available on MacOS 13) and Metal-3.1 standard (available on MacOS 14, can use bfloat) and bundles both using `-sectcreate` linker option and `getsectiondata` API call. `metallib_dummy.cpp` file is used to properly express dependencies between metallib build and torch_cpu link stages. Logic for generating metallibraries is loosely based on https://github.com/ml-explore/mlx/blob/main/mlx/backend/metal/kernels/CMakeLists.txt.
 - If only DeveloperTools CLI is installed, automatically wraps .metal into `_metallib.h` that contains shader source wrapped in `MetalShaderLibrary`

Bulk of changes introduced in this PR are just moving code around. I.e. for every file that contains non-templated shader definition in `aten/src/ATen/native/mps/operators` folder, corresponding `.metal` file is created in `aten/src/ATen/native/mps/kernels` folder and embedded shader definition is replaced with the following
```cpp
#ifndef PYTORCH_JIT_COMPILE_SHADERS
static auto& lib = MetalShaderLibrary::getBundledLibrary();
#else
#include <ATen/native/mps/OpName_metallib.h>
#endif
```

Some historical stats:
| PyTorch Version  | Number of shaders in MPS | Ops added |
| ------------- | ------------- | ---- |
| 1.12  | 0  | |
| 1.13  | 2  | bitwise_ops and  index.out |
| 2.0  | 4  | cross repeat and view)  |
| 2.1  | 9   | unary_ops, histogram, renorm, binary_ops |
| 2.2  | 11   | gamma and bucketization |
| 2.3  | 12  | naive_matmul (to workaround crash) |
| 2.4 | 13 | quantized_mm |
| 2.5 | 14 | fused_adam |

Pros:
  - Better code structure/readability
  - Eventually allows one to use shared headers (and implement something like `TensorIterator`)
  - Faster runtime (as compilation is done ahead of time) and perhaps better optimized compiled kernels

Cons:
  - Build process is a bit more complicated that it used to be
  - Need to maintain two codepath (as our CI builders only has DeveloperTools installed)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138636
Approved by: https://github.com/manuelcandales
2024-11-01 21:47:20 +00:00
Edward Z. Yang
a6630bcf87 Profile guided optimization for automatic_dynamic (#139001)
Previously: https://github.com/pytorch/pytorch/pull/138052 but the implementation is done from scratch, so I open a new PR.

This implements the ability to save and load profiles of automatic dynamic decisions, so on subsequent runs we can directly make something automatically dynamic. Unlike the previous implementation, this cache is never enabled by default; instead, you have to specify a "job id" that says it's OK to share results. We will be able to automatically populate this id for internal MAST jobs but for generic OSS users you will have to explicitly opt into it.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Differential Revision: [D65065497](https://our.internmc.facebook.com/intern/diff/D65065497)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139001
Approved by: https://github.com/oulgen
2024-11-01 21:43:25 +00:00
Xuan Zhang
9c2ffce71a add condition for freeable input buffer (#139480)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139480
Approved by: https://github.com/yf225
ghstack dependencies: #139396
2024-11-01 21:15:40 +00:00
Huy Do
18f3b3c991 Clean up Android jobs in CI (#139350)
As an outcome of https://fburl.com/gdoc/voce5o06 and confirm with @iseeyuan, we can now clean up Android lite interpreter jobs on PyTorch CI. There is not much value in running them anymore.

It's stated in https://github.com/pytorch/android-demo-app/blob/master/README.md that ExecuTorch is the replacement now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139350
Approved by: https://github.com/ZainRizvi
2024-11-01 21:10:19 +00:00
Sam Larsen
c412a42ae2 [pt2 logging] move remote cache get/put logging up one level (#139423)
Summary: I need to refactor the way we record CompilationMetrics. It will be much easier to do in OSS and having the relevant timing code in the OSS area of the codebase will make this much easier. I doubt this meaningfully changes the values we see.

Test Plan: Made sure samples show up: https://fburl.com/scuba/dynamo_compile/sandbox/c38zjq0x

Differential Revision: D65290089

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139423
Approved by: https://github.com/oulgen
2024-11-01 21:06:59 +00:00
Animesh Jain
0e57f2b589 [invoke_subgraph] Change the joint_graph output signature to simplify min-cut partitioner (#139326)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139326
Approved by: https://github.com/zou3519
ghstack dependencies: #139216, #139130
2024-11-01 21:02:32 +00:00
Animesh Jain
6a268c3fbb [invoke_subgraph] Generate fake_inputs correctly (#139130)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139130
Approved by: https://github.com/zou3519
ghstack dependencies: #139216
2024-11-01 21:02:32 +00:00
Animesh Jain
4c756cacfd [invoke_subgraph] Re-enable fake tensor model in the fake tensor impl (#139216)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139216
Approved by: https://github.com/zou3519
2024-11-01 21:02:32 +00:00
Justin Chu
5d67efb809 [ONNX] New registration API (#135403)
The ONNX custom ops registration API.

## Design

1. Create a "custom_translation_table: dict[Callable, Sequence[Callable] | Callable" parameter for specifying extra functions
2. Use a callable as the key to support all possible call_function targets in the fx graph
3. Allow a callable or a Sequence of callables as values.
		- When there is a single callable, it is the translation function for the op
		- When there is a Sequence of callable, the exporter's dispatcher will dispatch to these callables in order based on input dtypes.
		- The translation functions can be a plain python function that calls onnxscript ops (traced), or an onnxscript function.
		- Complex input support: We create special type annotations for annotating real representations of complex inputs, which are needed to handle complex computation in the ONNX graph, as we don't have any ops in ONNX that handle complex inputs. The dispatcher will have knowledge of these newly created type annotations and dispatch correctly. The complex functions will be in the same overload pool as the real functions.

```py
torch.onnx.export(dynamo=True,
	custom_translation_table = {
	torch.ops.aten.add: [overload1, overload2],
	torch.sym_not: sym_not_onnx,
})
```
Support for functions that handles complex inputs will be in separate PRs.

fixes https://github.com/pytorch/pytorch/issues/138391

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135403
Approved by: https://github.com/titaiwangms
2024-11-01 20:58:54 +00:00
Natalia Gimelshein
f5b9e725d1 use more elements per thread for narrow dtypes (#139449)
Fix perf issue for narrow type by accessing more elements per thread

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139449
Approved by: https://github.com/Chillee, https://github.com/eqy
2024-11-01 20:41:13 +00:00
Jason Ansel
73c0762a34 [inductor] Simplify remove_kernel_local_buffers (#139452)
I plan to reuse `can_buffer_be_removed_through_fusion` in some heuristics.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139452
Approved by: https://github.com/shunting314
ghstack dependencies: #139364, #139365, #139370
2024-11-01 20:36:39 +00:00
Bert Maher
dcdcb8b364 Avoid overflow in float32-to-int32 test (#139489)
Summary:

Triton has added some integer overflow detection when kernels are compiled with
`debug=True`, and this test results in integer overflow (2.0 is 0x40000000,
times 2 is 0x80000000 which overflows a signed int32).

Assertion `int32 overflow detected for operation mul` failed

Fixes #139479

Test Plan:
```
python inductor/test_torchinductor.py -k test_float32_to_int32_cuda
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139489
Approved by: https://github.com/eellison, https://github.com/jansel, https://github.com/chenyang78
2024-11-01 20:22:19 +00:00
Yifu Wang
0dbc284a72 [SymmetricMemory] expose signal_pads as tensors in Python (#138754)
## This Stack

This stack does the following things to support `xformers`-style, comm-aware Triton kernels:
- Exposes `signal_pad`s as tensors in Python
- Adds a binding for `cuMemsetAsync`

These in combination aims to provide users with more flexibility to express custom signaling/synchronization patterns.

## This PR

```python
# Obtain the signal pad of the specified peer rank as a tensor.
# If both shape and dtype are unspecified, the returned tensor will be a
# 1d uint32 tensor, which is most natural for signaling purposes.
symm_mem.get_signal_pad(peer_rank)

# If only shape is specified, it is equivalent to:
# symm_mem.get_signal_pad(peer_rank)[:shape.numel()].view(shape)
symm_mem.get_signal_pad(peer_rank, shape)

# If only dtype is specified, it is equivalent to:
# symm_mem.get_signal_pad(peer_rank).view(dtype)
symm_mem.get_signal_pad(peer_rank, dtype=dtype)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138754
Approved by: https://github.com/weifengpy, https://github.com/lw
2024-11-01 20:17:15 +00:00
Haifeng Jin
124eac255e fix dynamo tracking numpy 2 ops (#138686)
Fixes #136559
As we upgrade to NumPy 2, torch falsely filtered out `numpy.random` as unsupported in dynamo tracking.
This PR changes the filtering rules to include them while keeping behavior with numpy 1 unchanged.

Before this PR, the following tests failed:

```
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/dynamo/test_functions.py -k FunctionTests.test_numpy_random
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/dynamo/test_unspec.py -k UnspecTests.test_to_tensor
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/test_fake_tensor.py -k FakeTensorTest.test_export_numpy
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/test_fake_tensor.py -k PropagateRealTensorsFakeTensorTest.test_export_numpy_propagate_real_tensors
```

With this PR, the supported/unsupported ops in NumPy 1 are not changed.
For NumPy 2, only the `numpy.random` ops that are already supported with NumPy 1 are added to the supported list.

I used the following scripts to check the differences before and after the change for both NumPy 1 & 2.
The output is empty for NumPy 1 since there is no change.
The output is a list of `numpy.random` that considered supported for NumPy 2.

```py
from torch._dynamo import trace_rules
import numpy as np

def new_numpy_function_ids():
    unsupported_funcs = {"seed", "ranf", "get_bit_generator", "RandomState", "set_bit_generator", "sample"}

    def is_supported(k, v, mod):
        if not callable(v):
            return False
        if not getattr(v, "__module__", None):
            return True
        if v.__module__ == mod.__name__:
            return True
        if v.__module__ == "numpy.random.mtrand" and mod.__name__== "numpy.random" and k not in unsupported_funcs:
            return True
        return False
    rv = {}
    for mod in trace_rules.NP_SUPPORTED_MODULES:
        for k, v in mod.__dict__.items():
            if is_supported(k, v, mod):
                rv[id(v)] = f"{mod.__name__}.{k}"
    return rv

def old_numpy_function_ids():
    rv = {}
    for mod in trace_rules.NP_SUPPORTED_MODULES:
        rv.update(
            {
                id(v): f"{mod.__name__}.{k}"
                for k, v in mod.__dict__.items()
                if callable(v)
                and (getattr(v, "__module__", None) or mod.__name__) == mod.__name__
            }
        )
    return rv

rv1 = set(old_numpy_function_ids().values())
rv2 = set(new_numpy_function_ids().values())

for v in (rv1 - rv2):
    print(v)
print("****")
for v in (rv2 - rv1):
    print(v)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138686
Approved by: https://github.com/lezcano, https://github.com/williamwen42
2024-11-01 19:51:40 +00:00
Mikayla Gawarecki
ea0e09b3f3 Add utility to get all unsafe globals in checkpoint (no pickletools dependency) (#139221)
Fixes https://github.com/pytorch/pytorch/issues/129698

https://github.com/pytorch/pytorch/pull/139106 without pickletools

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139221
Approved by: https://github.com/malfet
ghstack dependencies: #138936
2024-11-01 19:31:39 +00:00
rzou
f3b485eb2a [reland] Flip triton kernel default layout constraint to "needs_fixed_stride_order" (#137064)
This is to match the default layout constraint for custom operators. By
default, Inductor should match the stride order of inputs to a triton
kernel.

IF THIS IS BREAKING YOU, PLEASE REACH OUT, especially if it's been
more than two weeks since this landed. You can flip the config locally
as a workaround.

Test Plan:
- existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137064
Approved by: https://github.com/albanD, https://github.com/eellison
2024-11-01 19:21:16 +00:00
Colin L. Rice
abc5d59dcb config: create Config objects with JK support (#138766)
This teaches install_config_module (and the underlying code) to
understands Config objects. Additionally we've added a JK option to this
which resolves the JK.

This config gets stored within the _ConfigEntry class and is evaluated
when __getattr__ is called. If justknobs is set, it'll call
justknobs_check to see the result.

Due to preceeding work, basically everything works correctly here and we
had to update a couple of tests, and modify the getattr behaviour.

Note that we are updating the justknob_check function to support a
default option, to make default work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138766
Approved by: https://github.com/ezyang
2024-11-01 19:20:37 +00:00
eqy
6fc63b4ef1 [ROCM][CUDA][NCCL] Disable test_lowering_one_shot_all_reduce on ROCM (#139414)
I'm not sure this is expected to run if it requires buffer-registration support CC @yifuwang @huydhn @syed-ahmed #138029

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139414
Approved by: https://github.com/huydhn, https://github.com/yifuwang
2024-11-01 18:39:47 +00:00
Jason Davies
391ee62180 Ensure scalar tensor device matches attn_mask for convert_boolean_attn_mask_cudnn. (#139450)
This is causing a small performance hit when using SDPA with the cuDNN backend due to unnecessary host-to-device memcpy.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139450
Approved by: https://github.com/drisspg, https://github.com/eqy
2024-11-01 18:38:02 +00:00
Sam Larsen
d8b606ecb5 [fx graph cache] Support freezing with FX graph caching (#136505)
Summary: The main changes to support freezing are:
1) When pickling constant tensors as part of the cache key calculation: If freezing has not been applied, then keep the existing behavior (pickle the metadata and values). If freezing has been applied, then pickle the values if the constant will be inlined; otherwise, consider only the metadata.
2) If freezing has been applied, modify what we store in the cache: Instead of storing the constant attributes in the cache entry, store the _names_ of the constants, and then grab those constants from the GraphModule when we need attache the attributes to a newly-loaded Python module. Since the cache lookup path loads the Python module, this bullet means we need to thread through a GraphModule argument in several places.
3) Since this feature means that we may need to reload the same Python module path more than once (but attach different constant attributes), I changed PyCodeCache.load_by_key_path to not store an in-memory map of path to module (since there may be more than one). I don't _think_ this will have any affect on performance, however.. It's unclear why we were using an in-memory cache here anyway, since this function should only be called once for each module needed to be loaded.
4) Several tests were removing on-disk PyCodeCache artifacts by iterating over the modules. I made this more straightforward by implementing a cache_clear method that removes the on-disk artifacts. Arguably, this should have been the implementation all along.

Differential Revision: [D63542170](https://our.internmc.facebook.com/intern/diff/D63542170)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136505
Approved by: https://github.com/eellison
2024-11-01 18:29:29 +00:00
vladimirrotariu
7d644f025f make equation behind torch.isclose element-wise (#138459)
The current formula behind torch.isclose, according to the docs, is
![imagen](https://github.com/user-attachments/assets/6b79f6d8-e675-4585-b26b-0c6933f7ecdd)

However, torch.isclose acts element-wise, so this formula may be misleading at first, given that the docs said that `input` and `other` are the first, respectively second tensor to compare. I propose the following change, to stress the element-wise nature of the norms in the equation:
![imagen](https://github.com/user-attachments/assets/2926a1c6-c4fa-4c48-8874-106521d3f54c)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138459
Approved by: https://github.com/soulitzer
2024-11-01 18:18:33 +00:00
Nikita Shulga
1857be1b48 Fix S390 builds (#139491)
Caused by https://github.com/pytorch/pytorch/pull/137918 By guarding all cpuinfo use with `!defined(__s390x__ ) && !defined(__powerpc__)`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139491
Approved by: https://github.com/huydhn, https://github.com/Skylion007
2024-11-01 18:16:29 +00:00
Nikita Shulga
51adab0829 [MPS] Fix reduction ops outputs for empty tensors (#139446)
By adding a switch for all reduction types, that either sets it to given value or raises runtime error.
Before this change, reduction ops returned uninitialized values in many case

Fixes https://github.com/pytorch/pytorch/issues/139400

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139446
Approved by: https://github.com/Skylion007
2024-11-01 17:32:12 +00:00
Bin Bao
7d081cabfb [AOTI] Forward fix #139458 (#139485)
Summary: A new test added in https://github.com/pytorch/pytorch/pull/139458 only fails in certain CI instance. Skip for now as the failing test has a low priority.

@diff-train-skip-merge (to silent fb bot so that I can land this myself)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139485
Approved by: https://github.com/huydhn, https://github.com/hl475
2024-11-01 17:14:40 +00:00
Scott Wolchok
3e0f4d18eb [PyTorch] Support non-zero beta in fp16_gemv_trans (#138275)
No real reason to have the zero-beta restriction, so let's lift it.

Testing: intentionally broke new paths locally to verify test coverage existed

Differential Revision: [D64407752](https://our.internmc.facebook.com/intern/diff/D64407752/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138275
Approved by: https://github.com/malfet
ghstack dependencies: #139082, #139083, #137918, #138005
2024-11-01 16:49:05 +00:00
Scott Wolchok
195b1b9a9b [PyTorch] Hook up fp16_gemv_trans to gemv fast path for non-aarch64 architectures (#138005)
Following up on previous rev to use fp16_gemv_trans in gemv, not just gemm-used-for-gemv.

Differential Revision: [D64351092](https://our.internmc.facebook.com/intern/diff/D64351092/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138005
Approved by: https://github.com/malfet
ghstack dependencies: #139082, #139083, #137918
2024-11-01 16:49:05 +00:00
Scott Wolchok
fad5d89321 [PyTorch] Hook up fp16_gemv_trans to x86 fp16 GEMM (#137918)
This is the first big milestone we've been building towards!
(Following rev also hooks this up to actual gemv.)
Testing: To check perf, I ran python torchchat.py generate stories110M
--dtype fp16 --device cpu on an x86 machine without AVX512FP16. Observed roughly 5x tokens/sec increase.
Differential Revision: [D64280688](https://our.internmc.facebook.com/intern/diff/D64280688/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D64280688/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137918
Approved by: https://github.com/malfet
ghstack dependencies: #139082, #139083
2024-11-01 16:48:56 +00:00
Scott Wolchok
d79c5143d8 [PyTorch] Add efficient isnan for NEON half (#139083)
Same as the efficient one for float when f16 hardware support is available.

Testing: Added exhaustive isnan test coverage

Differential Revision: [D65003321](https://our.internmc.facebook.com/intern/diff/D65003321/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139083
Approved by: https://github.com/malfet
ghstack dependencies: #139082
2024-11-01 16:40:51 +00:00
Scott Wolchok
9ecd7d1587 [PyTorch] Add efficient isnan for NEON float (#139082)
Just test x != x rather than applying element-by-element scalar isnan.

Testing: vec_test_all_types checks IsNan

Differential Revision: [D65001633](https://our.internmc.facebook.com/intern/diff/D65001633/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139082
Approved by: https://github.com/malfet
2024-11-01 16:40:51 +00:00
sanchitintel
3cbf0c0bbf [Inductor][CPP] Cache weight tiles in L1D for AMX int8 WoQ GEMM (#136688)
# Summary

The AMX ISA based GEMM micro-kernel template for int8 weight-only quantization (BF16 activation, int8 weights) should cache dequantized weights (int8 -> int32 -> fp32 -> bf16) so that they would not have to be dequantized again in subsequent calls to the _inner-kernel_ that uses the same weights.

This change leverages the fact that even for BF16 x BF16 GEMM template, cache-blocking ensures that `Nr * Kc` weight elements are cached in L1D cache (more info [here](https://static.sched.com/hosted_files/pytorch2024/59/TorchInductor%20CPU%20Backend%20Advancements%20-%20New%20Features%20and%20Performance%20Improvements_20240915.pdf)). Here, `Nr` is the register blocking size for `N` dimension (at the granularity of the GEMM micro-kernel, it's currently also the cache blocking size for `N` dimension, although that may change in the future), and `Kc` is the cache blocking size for `K` dimension.

The figure below is from the document linked above -

<img width="476" alt="image" src="https://github.com/user-attachments/assets/e23e5476-d910-46d1-a9b3-cbf77de76d94">

## Performance data

Collected on 48 physical cores of one socket of Intel Xeon  Platinum 8468H (Xeon SP 4th gen). Intel OpenMP & tcmalloc were preloaded.

|M | N | K | Latency with ATen _weight_int8pack_mm | Latency with codegened templated GEMM (current main branch) | Latency with codegened templated GEMM (this PR) |
|-----|-----|-----|------|----------|----|
|4096|4096|4096| 45.844 ms | 9.322 ms| 5.2181 ms |
|4096|11008|4096| 127.618 ms |24.6258 ms | 13.6046 ms|
|4096|4096|11008| 121.953 ms | 25.4692 ms | 10.2669 ms |
|4096|32000|4096| 478.450 ms| 75.3942 ms | 48.21 ms |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136688
Approved by: https://github.com/jgong5
2024-11-01 16:32:22 +00:00
Jason Ansel
b57b4b7f9b [inductor] Move remove_kernel_local_buffers to Kernel (#139370)
This method mutates the kernel, so it fits better in that class.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139370
Approved by: https://github.com/shunting314
ghstack dependencies: #139364, #139365
2024-11-01 16:28:15 +00:00
Jason Ansel
1e934b473c [inductor] Remove Node.last_usage mutation (#139365)
I can't figure out why this is needed.  Let's see if tests fail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139365
Approved by: https://github.com/shunting314
ghstack dependencies: #139364
2024-11-01 16:28:15 +00:00
Jason Ansel
286d3ce266 [inductor] Remove SIMDKernel.last_usage (#139364)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139364
Approved by: https://github.com/eellison, https://github.com/shunting314
2024-11-01 16:28:15 +00:00
Shuqiang Zhang
df0c1eceb9 [pgnccl][simple] clean up unused members of PGNCCL (#139436)
Summary:
Found those unused members when prototying something else.
Better remove unused members
Test Plan:
CI

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139436
Approved by: https://github.com/Skylion007
2024-11-01 16:25:04 +00:00