Commit Graph

80776 Commits

Author SHA1 Message Date
Yu, Guangye
8051ee802c Add XPU compiler version control in cmake to keep BC (#139258)
# Motivation
This PR aims to maintain backward compatibility when building PyTorch XPU with the old and new compilers.

# Additional Context
The details are described here. The new compiler (2025.0.0) has some breaking changes compared with the old compiler(2024.1), for examples:
1. On Windows, sycl library is named `sycl7.lib` in the old compiler but is named `sycl.lib` in the new compiler.
2. On Linux, in order to support ABI=0, we have to link `libsycl-preview.so` in the old compiler but we could link `libsycl.so` in the new compiler to have the same ABI compatibility.
3. We added a macro `SYCL_COMPILER_VERSION` to support our new code has good backward compatibility with the old compiler. Now the new feature(Event elapsed_time, memory summary, and device architecture property) introduced by the new compiler will be controlled within the macro `SYCL_COMPILER_VERSION`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139258
Approved by: https://github.com/EikanWang, https://github.com/atalman, https://github.com/gujinghui
2024-11-09 13:31:21 +00:00
xinan.lin
191971e01d [AOTI] Introduce an extensibility mechanism for the c shim codegen to make it easy to produce c shims for out-of-tree OP kernels as well. Add c_shim for XPU. (#136742)
[AOTI] Introduce an extensibility mechanism for the c shim codegen to make it easy to produce c shims for out-of-tree OP kernels as well. Add c shim for XPU.

### Motivation
Since the current c shim codegen will only produce C wrappers for Op's registered in `aten/src/ATen/native/native_functions.yaml`, for the same backend, when a portion of out-of-tree OP's are not registered in that file, but are registered externally. For example, `third_party/torch-xpu-ops/yaml/native_functions.yaml` , in this case, the existing codegen can't fulfill the need to do extensions for the c shims from the out-of-tree OPs for the in-tree that has already been produced.

### Design
To extend the c shim with more OP for a backend from out-of-tree.
The PR provided a bool option `--aoti-extend` to indicate the codegen is to extend c shim from out-of-tree.
The generated c shim is stored in the `extend` subdirectory , for example:
```
torch/include/torch/csrc/inductor/aoti_torch/generated/c_shim_xpu.h
torch/include/torch/csrc/inductor/aoti_torch/generated/c_shim_xpu.cpp
torch/include/torch/csrc/inductor/aoti_torch/generated/extend/c_shim_xpu.h
torch/include/torch/csrc/inductor/aoti_torch/generated/extend/c_shim_xpu.cpp
```
example usage:
`python -m torchgen.gen --source-path third_party/torch-xpu-ops/yaml/ --xpu --aoti-extend --update-aoti-c-shim  `
`--xpu`:  generate c shim for XPU
`--aoti-extend `: this is an out-of-tree OPs(defined in `third_party/torch-xpu-ops/yaml/native_functions.yaml`)  extend for in-tree ops(defined in `aten/src/ATen/native/native_functions.yaml`)
`--update-aoti-c-shim`: always generate c_shim_xpu.h for the extend c_shim.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136742
Approved by: https://github.com/EikanWang, https://github.com/desertfire
ghstack dependencies: #139025
2024-11-09 13:19:52 +00:00
xinan.lin
929a647363 [Intel GPU] Support RegisterXPU.cpp codegen and compile for the in-tree XPU structured GEMM OPs. (#139025)
[Intel GPU] Support RegisterXPU.cpp codegen and compile for the in-tree XPU structured GEMM ops.

Motivation: There are two parts of aten ops for XPU, one is in-tree ops like GEMM related OPs and the other is out-off-tree ops in torch-xpu-ops. For the in-tree part,since Pytorch uses native_functions.yaml registration and is equipped with convenient codegen capabilities, we want to take advantage of these benefits as well.
At the same time, since AOT Inductor also uses native_functions.yaml to generate c shim wrappers, we also need to enable this mechanism for XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139025
Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire
2024-11-09 13:09:27 +00:00
Andrea Frittoli
0b650c360a Build magma for windows (#139924)
Copy the magma for windows job and script from pytorch/builder c9aac65e12/.github/workflows/build-magma-windows.yml

The linux version is moved here in https://github.com/pytorch/pytorch/pull/139888

Fixes #140001

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139924
Approved by: https://github.com/atalman
2024-11-09 09:27:59 +00:00
Boyuan Feng
e2e425b4f3 [CUDAGraph] Add dynamo timer to checkpoint, warmup, and record (#139818)
Summary: Add time log to cudagraph, including `create deferred_cudagraphify wrapper`, `warmup`,	`record`, and `checkpoint`.

Test Plan:
1. buck2 run fbcode//mode/opt //pytorch/benchmark:run -- resnet50 -d cuda -t train --inductor --pt2-triton-cudagraph

2. Found the result in [scuba table](https://fburl.com/scuba/pt2_compile_events/0oik8nu9).

 {F1954034920}

Differential Revision: D65505659

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139818
Approved by: https://github.com/eellison
2024-11-09 05:27:11 +00:00
cyy
ab55a99283 Use TORCH_DECLARE_XXX (#139952)
Because those files use TORCH_API

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139952
Approved by: https://github.com/ezyang
2024-11-09 04:56:28 +00:00
Kefei Lu
d2d1258b1b Speed up AMD AOT Inductor lowering by memoizing hipify trie to regex logic (#140156)
Summary:
AMD lowering duration is 1.55x longer than H100. Profiling shows hipification related functions took 22% of overall lowering time.

This diff cuts that time by safely memoize the trie to regex logic. The trick is to incrementally build a state of the trie during the trie construction. The state is the hash of all the words added to the trie.

Differential Revision: D65659445

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140156
Approved by: https://github.com/ColinPeppler

Co-authored-by: Kefei Lu <kefeilu@meta.com>
2024-11-09 04:28:58 +00:00
Michael Lazos
8b2e3855a9 Make size a property with an assertion (#139794)
Fixes https://github.com/pytorch/pytorch/issues/120568

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139794
Approved by: https://github.com/williamwen42
2024-11-09 03:39:41 +00:00
cyy
032135f8a2 [2/N] Turn inline static functions into static (#140068)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140068
Approved by: https://github.com/ezyang
2024-11-09 03:31:24 +00:00
Bob Ren
3b8470c461 add special case for __round__ constant variables (#139583)
Fixes `PYTORCH_TEST_WITH_INDUCTOR=1 tlp python test/test_torch.py TestTorchDeviceTypeCUDA.test_cauchy_cuda_float64` when specialize_float=False

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139583
Approved by: https://github.com/ezyang
ghstack dependencies: #139569, #139457, #139568, #139572, #139846, #139454, #139896, #139935, #139587
2024-11-09 03:25:53 +00:00
Florian (Feuermagier)
f915409c26 FlopCounterMode: Decompose ops for inference mode (#138508)
Fixes #126268

I've basically followed @ezyang suggestion (I think) to use `func.decompose(...)`. Since `__torch_dispatch__` won't be called a second time for the same op, I've added a second `TorchDispatchMode` (`_DecomposedCounterMode`) that simpy dispatches to the parent flop counter. Using `self` as the inner context manager is not possible, since the second call to `__enter__` would re-initialize the counter's tracking state.

Let me know if there's something wrong with this implementation, since I'm quite unsure how the decomposition thing actually works :D

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138508
Approved by: https://github.com/ezyang
2024-11-09 03:13:53 +00:00
Bob Ren
4488e23763 Fix another item memo loss location + bool specialization bug (#139587)
This fix was a bit more involved:
1) It fixes a item_memo loss place.
2) It updates a test to be eager instead of aot_eager since it reveals a very obscure bug related to replacements that's not worth solving since in practice inductor will regenerate the runtime asserts anyways
3) It updates tensorify to specialize more places now that the aforementioned bug is fixed.

Fixes `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=6 python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoCPU.test_comprehensive_linalg_norm_cpu_float16` when `specialize_float=False`

while ensuring `python test/dynamo/test_dynamic_shapes.py DynamicShapesMiscTests.test_runtime_assert_replacement_dynamic_shapes` doesn't regress

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139587
Approved by: https://github.com/ezyang
ghstack dependencies: #139569, #139457, #139568, #139572, #139846, #139454, #139896, #139935
2024-11-09 03:11:19 +00:00
wz337
4893e248a8 [DTensor][Test] Remove safe global context for weights_only torch.load() DTensor (#140173)
We have added DTensor related classes to allowed globals so we can torch.load(DTensor) with weights_only=True. So we don't need the safe_globals context for this test anymore.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140173
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #139949
2024-11-09 02:21:44 +00:00
Andrea Frittoli
72976b2486 Use manylinux-builder images with main tag (#140158)
The magma build uses deprecated manylinux-builder images. Update it to use the images with "main" in the tag:

  pytorch/manylinux-builder:cuda<version>-main

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140158
Approved by: https://github.com/atalman
2024-11-09 02:16:00 +00:00
Zhou, Lingzhi
2ede4c9a38 [Partitioner] Enumerate partitions by iterating partition ids (#136598)
Currently, we get all partition id by iterating assignment whose size is same as the number of nodes in graph. But we can reach same results by iterating partitions_by_id whose size is much smaller than the nodes number. Assume the number of nodes is N, the number of partitions is P, the time complexity decrease from O(N * N) to O(N * P) after this patch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136598
Approved by: https://github.com/ezyang

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-11-09 01:31:46 +00:00
Joel Schlosser
9c678af9f9 Misc. non-contig NJT fixes (#140160)
This PR contains several fixes related to non-contiguous NJTs:
1. Propagates `lengths` through op calls appropriately (see desc of #138098)
    * SDPA now calls `nested_view_from_values_offsets_lengths()` instead of `nested_view_from_values_offsets()`
2. Allows non-contig NJTs in unsqueeze / transpose / select
3. Expands padded dense -> NJT conversion to support non-contig NJTs
4. (unrelated sorry) Updates `split` / `split_with_sizes` to allow for optional `dim`, matching the ATen signature
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140160
Approved by: https://github.com/cpuhrsch
2024-11-09 01:18:26 +00:00
William Wen
be172d2a60 [pt2, docs] Add new PT2 troubleshooting doc (#138620)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138620
Approved by: https://github.com/ezyang

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2024-11-09 01:17:39 +00:00
Ryan Guo
de40a23f6c [dynamo] Remove dead code path for capturing __class__ in UserFunctionVariable (#140034)
This was introduced in https://github.com/pytorch/torchdynamo/commit/d0c10341
as limited support for pre-existing cells, since we know `__class__` wouldn't be modified
in most cases. It's no longer needed now that we have much more support for these cells.

Example:
```python
class Foo():
    def __init__(self):
        super().__init__()

print(Foo.__init__.__code__.co_freevars) # ('__class__',)
print(Foo.__init__.__closure__)          # (<cell at 0x1011fb310: type object at 0x10fe185b0>,)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140034
Approved by: https://github.com/williamwen42, https://github.com/anijain2305, https://github.com/jansel
ghstack dependencies: #140033
2024-11-09 01:03:24 +00:00
Ryan Guo
0b8652a999 [dynamo] Remove NestedUserFunctionVariable.closure_scope (#140033)
This was no longer needed after https://github.com/pytorch/torchdynamo/commit/663e4d92,
which removed the uses of `closure_scope` but not the field itself.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140033
Approved by: https://github.com/williamwen42, https://github.com/anijain2305, https://github.com/jansel
2024-11-09 01:03:24 +00:00
cyy
263d8f7a94 [8/N] Don't skip ASAN on some tests (#140081)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140081
Approved by: https://github.com/ezyang
2024-11-09 01:00:13 +00:00
PyTorch MergeBot
58b661cda2 Revert "[c10d][Logging] Remove args and kwargs from c10d logging (#140169)"
This reverts commit e3b2f04f05.

Reverted https://github.com/pytorch/pytorch/pull/140169 on behalf of https://github.com/ZainRizvi due to Man, this test really wants to fail on trunk. Sorry. Details:  distributed/test_c10d_logger.py::C10dErrorLoggerTest::test_exception_logger [GH job link](https://github.com/pytorch/pytorch/actions/runs/11751023962/job/32740983427) [HUD commit link](e3b2f04f05) ([comment](https://github.com/pytorch/pytorch/pull/140169#issuecomment-2465933413))
2024-11-09 00:23:43 +00:00
Peter Steinbach
090b778b8a Clarify meaning of rate parameter in Gamma distribution (#134847)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134847
Approved by: https://github.com/fritzo
2024-11-09 00:22:13 +00:00
PyTorch MergeBot
7eb66173e2 Revert "Fix split decomp returning self (#140065)"
This reverts commit 9d99dceb53.

Reverted https://github.com/pytorch/pytorch/pull/140065 on behalf of https://github.com/ZainRizvi due to Diff been imported internally, but merged externally. And the internal diff has been updated so the diff and PR are now mismatched.  Reverting this PR to get things back into a consistent state. See D65635070 ([comment](https://github.com/pytorch/pytorch/pull/140065#issuecomment-2465928027))
2024-11-09 00:16:26 +00:00
Mengwei Liu
a02e88d19c [miniz] Bump miniz version to 3.0.2 and add patch for zip64 (#140041)
Summary:
Bump miniz version from 2.1.0 to 3.0.2 and apply these patches:

* #79636 patches internal BUCK and bazel build
* #138959 adds `bool compute_crc32` argument
* miniz PR: https://github.com/richgel999/miniz/pull/324 to support
  zip64

Anyone bumping miniz version again, please apply these patches as well.

Test Plan:
Rely on unit test

Imported from OSS

Differential Revision: D65586230

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140041
Approved by: https://github.com/mikaylagawarecki
2024-11-09 00:13:16 +00:00
PyTorch MergeBot
1400fedf76 Revert "add supports_coalescing property in c10d::Backend to determine whether backend supports coalescing (#135338)"
This reverts commit e5574445b0.

Reverted https://github.com/pytorch/pytorch/pull/135338 on behalf of https://github.com/ZainRizvi due to Sorry but this is failing internally. Please see D65663382 for more details ([comment](https://github.com/pytorch/pytorch/pull/135338#issuecomment-2465911854))
2024-11-08 23:52:49 +00:00
Michael Lazos
ea0f60ecfa [Dynamo] allow dynamic callables on tensor variables (#137940)
Fixes https://github.com/pytorch/pytorch/issues/134844

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137940
Approved by: https://github.com/williamwen42
2024-11-08 23:49:34 +00:00
PyTorch MergeBot
beae7725be Revert "Tighten type hints for tensor arithmetic (#135392)"
This reverts commit d378819068.

Reverted https://github.com/pytorch/pytorch/pull/135392 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See D65641103 for more details ([comment](https://github.com/pytorch/pytorch/pull/135392#issuecomment-2465906839))
2024-11-08 23:44:41 +00:00
Haifeng Jin
2af5172774 fix dynamo tracking numpy 2 ops (#138686)
Fixes #136559
As we upgrade to NumPy 2, torch falsely filtered out `numpy.random` as unsupported in dynamo tracking.
This PR changes the filtering rules to include them while keeping behavior with numpy 1 unchanged.

Before this PR, the following tests failed:

```
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/dynamo/test_functions.py -k FunctionTests.test_numpy_random
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/dynamo/test_unspec.py -k UnspecTests.test_to_tensor
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/test_fake_tensor.py -k FakeTensorTest.test_export_numpy
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/test_fake_tensor.py -k PropagateRealTensorsFakeTensorTest.test_export_numpy_propagate_real_tensors
```

With this PR, the supported/unsupported ops in NumPy 1 are not changed.
For NumPy 2, only the `numpy.random` ops that are already supported with NumPy 1 are added to the supported list.

I used the following scripts to check the differences before and after the change for both NumPy 1 & 2.
The output is empty for NumPy 1 since there is no change.
The output is a list of `numpy.random` that considered supported for NumPy 2.

```py
from torch._dynamo import trace_rules
import numpy as np

def new_numpy_function_ids():
    unsupported_funcs = {"seed", "ranf", "get_bit_generator", "RandomState", "set_bit_generator", "sample"}

    def is_supported(k, v, mod):
        if not callable(v):
            return False
        if not getattr(v, "__module__", None):
            return True
        if v.__module__ == mod.__name__:
            return True
        if v.__module__ == "numpy.random.mtrand" and mod.__name__== "numpy.random" and k not in unsupported_funcs:
            return True
        return False
    rv = {}
    for mod in trace_rules.NP_SUPPORTED_MODULES:
        for k, v in mod.__dict__.items():
            if is_supported(k, v, mod):
                rv[id(v)] = f"{mod.__name__}.{k}"
    return rv

def old_numpy_function_ids():
    rv = {}
    for mod in trace_rules.NP_SUPPORTED_MODULES:
        rv.update(
            {
                id(v): f"{mod.__name__}.{k}"
                for k, v in mod.__dict__.items()
                if callable(v)
                and (getattr(v, "__module__", None) or mod.__name__) == mod.__name__
            }
        )
    return rv

rv1 = set(old_numpy_function_ids().values())
rv2 = set(new_numpy_function_ids().values())

for v in (rv1 - rv2):
    print(v)
print("****")
for v in (rv2 - rv1):
    print(v)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138686
Approved by: https://github.com/williamwen42
2024-11-08 23:38:53 +00:00
Yifu Wang
1659e241c8 [experimental] async-tp impl with cutlass-based, progress aware kernel (#139227)
This PR introduces the following:

### torch.ops.symm_mem._async_input_mm

`_async_input_mm(Tensor a, Tensor b, Tensor a_chunk_signals, int a_chunk_pivot) -> Tensor`

An mm impl that supports consuming asynchronous input. It guarantees the following rasterization order, and that the corresponding signal arrives before an input chunk is consumed.
```
num_chunks = a_chunks_signals.numel()
for chunk_idx in range(a_chunk_pivot, num_chunks + a_chunk_pivot):
    chunk_idx = chunk_idx % num_chunks
    wait_signal(a_chunk_signals, chunk_idx)
    # Compute output tiles that consumes the input chunk
```

### PersistentAsyncInputScheduler

This is a forked version of PersistentScheduler that supports consuming asynchronous input. This tile scheduler introduces the following arguments:

- `tiles_per_chunk_m` – Specifies the size of an M chunk. Chunks are the granularity at which the asynchronous input becomes ready. It must be an interger multiple of the size of an M tile.
- `chunk_signals` – `chunk_signals[i] == 1` indicates that chunk i is ready. Before returning a work tile, get_current_work() waits for the signal to ensure that the corresponding chunk is ready.
- `tile_idx_pivot_m` – After applying swizzling, apply `pivot(m) => (m + tile_idx_pivot_m) % tiles_m` to `m`. In a distributed setting, this allows different ranks to process different m indices at the same time, thus avoiding communication hotspots.

Note that this scheduler currently only supports the `KernelTmaWarpSpecializedCooperative` kernel schedule. This is enforced via the template argument `KernelSchedule`.

Usage:
```
using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
   Shape<int, int, int, int>,
   CollectiveMainloop,
   CollectiveEpilogue,
   cutlass::gemm::PersistentAsyncInputScheduler<KernelSchedule>>;
```

### _fused_all_gather_matmul_native
An ag-mm impl that combines `torch.ops.symm_mem._async_input_mm` and progress-aware all-gather. This is not yet enabled via the async-tp passes. We will use it as a backend to optimize the current decomposition-based async-tp impl.

## Benchmarks

### 4096x3584x8192
- cublas + nccl: 539us
- decomp-based async-tp w/o cuda graph: 694us
- decomp-based async-tp w/ cuda graph: 478us
- new cutlass kernel: 408us

<img width="478" alt="image" src="https://github.com/user-attachments/assets/39f316ab-36c5-4b41-af77-07854a385dfc">

### 2048x3584x8192
- cublas + nccl: 301us
- decomp-based async-tp w/o cuda graph: 687us
- decomp-based async-tp w/ cuda graph: 356us
- new cutlass kernel: 276us

<img width="441" alt="image" src="https://github.com/user-attachments/assets/9e23ce21-863b-43dd-a562-fb05d3a5a144">

## Next Steps
- Add tuning logic
- Use `_fused_all_gather_matmul_native` as a backend for the decomp-based async-tp impl

Differential temp Revision: [D65623152](https://our.internmc.facebook.com/intern/diff/D65623152)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139227
Approved by: https://github.com/weifengpy, https://github.com/Chillee
2024-11-08 23:28:25 +00:00
fduwjj
e3b2f04f05 [c10d][Logging] Remove args and kwargs from c10d logging (#140169)
This PR is trying to reland https://github.com/pytorch/pytorch/pull/139804

We now don't want to log args and kwargs directly because if they contain tensor or tensor subclass it would take lots of time in conversion to string or even not supported.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140169
Approved by: https://github.com/wz337
2024-11-08 23:24:52 +00:00
Scott Wolchok
cc44b55b00 Hook up bf16_gemv_trans to x86 bf16 GEMM (#139220)
This is the big milestone for bf16 and should enable us to close https://github.com/pytorch/torchchat/issues/1253 .

Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2.

Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139220
Approved by: https://github.com/malfet
ghstack dependencies: #139084, #139090, #139558, #139081, #139208
2024-11-08 23:24:36 +00:00
Scott Wolchok
25c469bac3 Build bf16 gemv fast path & entry points for non-ARM architectures too (#139208)
Very similar to #137917, but for bf16.

Differential Revision: [D65155971](https://our.internmc.facebook.com/intern/diff/D65155971/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139208
Approved by: https://github.com/malfet
ghstack dependencies: #139084, #139090, #139558, #139081
2024-11-08 23:24:36 +00:00
Scott Wolchok
7f0bf9f961 Move bf16_gemv_trans to ReducedPrecisionFloatGemvFastPathKernel (#139081)
Following the previous move of fp16_gemv_trans.

Testing: Checked for performance regression with llm_benchmarks' `python benchmarks/benchmark_torch_mm.py llm`, didn't find one
Differential Revision: [D64930872](https://our.internmc.facebook.com/intern/diff/D64930872/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139081
Approved by: https://github.com/malfet
ghstack dependencies: #139084, #139090, #139558
2024-11-08 23:24:29 +00:00
Scott Wolchok
44f6d1439e Unbreak vec128_half_neon comparison without FP16 hardware support (#139558)
Discovered this bug when working on Vectorized<BFloat16>; apparently we have no automated testing for aarch64 without FP16.

Testing: Manually disable FP16 feature for local vec_test_all_types run on Mac; see pass.

Differential Revision: [D65385267](https://our.internmc.facebook.com/intern/diff/D65385267/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139558
Approved by: https://github.com/malfet
ghstack dependencies: #139084, #139090
2024-11-08 23:24:22 +00:00
Nikita Shulga
ac6b6c6f98 [BE][CI] Use pip3 instead of pip (#140185)
As on modern distros(see this oldie but goodie: https://launchpad.net/ubuntu/focal/+package/python-is-python3 ), `pip` alias might be missing or indeed point to Python2 installation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140185
Approved by: https://github.com/wdvr, https://github.com/huydhn, https://github.com/seemethere
2024-11-08 23:15:02 +00:00
Natalia Gimelshein
1cdaf1d85f correctly keep track of processed tensors for foreach reductions (#140103)
Fixes #140066

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140103
Approved by: https://github.com/janeyx99

Co-authored-by: Jane Xu <janeyx@meta.com>
2024-11-08 23:04:53 +00:00
Nikita Shulga
f3cbf67686 [CD] Build aarch64 wheels without conda (#140093)
As manylinuxaarch64-builder already comes pre-built with all versions of python runtime

Refactor logic for setting path to DESIRED_PYTHON from `manywheel/build_common` into `set_desired_python.sh` and call it from aarch64_ci_setup.sh

In followup PRs move scons and ninja installation into base docker image
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140093
Approved by: https://github.com/atalman
2024-11-08 22:24:28 +00:00
Gabriel Ferns
95198f8299 Remove uses of deleted operations (#139447)
resolves: https://github.com/pytorch/pytorch/issues/138721

Summary:

Delete the uses of deleted nodes. The double for-loop is icky here, but N should
be pretty small and removing it requires refactoring the datastructures
involved, which is a bigger endeavor.

Test Plan:

Normal test coverage should be sufficient. There were a couple of spots in the
scheduler code that didn't check users being deleted, so I'll run a perf test to see
what impact that has, and to make sure N^2 doesn't affect compile times.

Perf:
https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Tue%2C%2029%20Oct%202024%2017%3A41%3A36%20GMT&stopTime=Tue%2C%2005%20Nov%202024%2018%3A41%3A36%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=exclamaforte/prune-deleted-users&lCommit=5cb1aa6f7d8a52acdae0c7cf36b8c2d536d7f0d1&rBranch=main&rCommit=f4ee5a243dbb31e6310e5632b1c87898b299df2c
off of nov4 nightly

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139447
Approved by: https://github.com/eellison
2024-11-08 22:21:53 +00:00
PyTorch MergeBot
347f96061f Revert "[cpu] Modify inductor opt flag --- ftree-loop-vectorize (#136827)"
This reverts commit cf0bb6c435.

Reverted https://github.com/pytorch/pytorch/pull/136827 on behalf of https://github.com/ZainRizvi due to Sorry but this breaks internally. See D65605094 for more details ([comment](https://github.com/pytorch/pytorch/pull/136827#issuecomment-2465805271))
2024-11-08 21:52:33 +00:00
PyTorch MergeBot
a7724518c0 Revert "[Inductor][CPU] Fuse SmoothQuant int8 linear pattern (#139595)"
This reverts commit d72a308e77.

Reverted https://github.com/pytorch/pytorch/pull/139595 on behalf of https://github.com/ZainRizvi due to Sorry but the newly added tests in test_mkldnn_pattern_matcher.py fail internally. See D65661038 for more details ([comment](https://github.com/pytorch/pytorch/pull/139595#issuecomment-2465797016))
2024-11-08 21:45:52 +00:00
PyTorch MergeBot
80d0356b11 Revert "Make Context to be Device-agnostic Step by Step (2/N) (#136526)"
This reverts commit c03324de2d.

Reverted https://github.com/pytorch/pytorch/pull/136526 on behalf of https://github.com/ZainRizvi due to This fails to build internally. See D65604944 for more details ([comment](https://github.com/pytorch/pytorch/pull/136526#issuecomment-2465790157))
2024-11-08 21:40:10 +00:00
PyTorch MergeBot
3483f7809e Revert "Fix typo in associative_scan tests (#139929)"
This reverts commit 7fa94f0363.

Reverted https://github.com/pytorch/pytorch/pull/139929 on behalf of https://github.com/ZainRizvi due to This test is breaking in trunk somehow, which is really weird. functorch/test_control_flow.py::AssociativeScanTests::test_associative_scan_binary_operator_compile_mode_compile_dynamic_shape_combine_mode_pointwise_reverse_False_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/11747748990/job/32732254909) [HUD commit link](7fa94f0363) ([comment](https://github.com/pytorch/pytorch/pull/139929#issuecomment-2465773366))
2024-11-08 21:26:41 +00:00
Zain Rizvi
411203e7c1 Revert D65490202 (#140142)
Summary:
This diff reverts D65490202
This is causing tests to fail on open source. See distributed/test_c10d_logger.py::C10dErrorLoggerTest::test_exception_logger [GH job link](https://github.com/pytorch/pytorch/actions/runs/11736922614/job/32697709457) [HUD commit link](ba9645f6e5)

Test Plan: NA

Differential Revision: D65663063

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140142
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-11-08 21:22:32 +00:00
Catherine Lee
119e0699cc [ez] Add .lintrunner.private.toml to .gitignore (#140166)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140166
Approved by: https://github.com/Skylion007
2024-11-08 20:55:21 +00:00
Bin Bao
63a0d6587e [AOTI] Update the OSS tutorial (#139956)
Summary: Update the OSS tutorial to use the new aoti_compile_and_package and aoti_load_package APIs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139956
Approved by: https://github.com/angelayi
ghstack dependencies: #139955
2024-11-08 20:46:57 +00:00
PyTorch MergeBot
07ad74635b Revert "[Reland] Use static_assert to detect get_type_index used in device code (#139966)"
This reverts commit ca7fdfe4d2.

Reverted https://github.com/pytorch/pytorch/pull/139966 on behalf of https://github.com/malfet due to This approach will prevent one from using get_type_index from device code ([comment](https://github.com/pytorch/pytorch/pull/139966#issuecomment-2465701260))
2024-11-08 20:32:43 +00:00
Animesh Jain
e6c5a77485 [dynamo][guards] Profile guard manager in C++ (#140110)
This should remove the pybind noise from the profiling.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140110
Approved by: https://github.com/jansel
ghstack dependencies: #139953
2024-11-08 18:44:08 +00:00
Animesh Jain
a140e65e0f [dynamo] Support method with different __self__ on user defined objects (#139953)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139953
Approved by: https://github.com/jansel
2024-11-08 18:44:08 +00:00
William Wen
d18bca4961 [dynamo] switch to get_framelocals_mapping for 3.10 and below (#140037)
Part of implementing https://github.com/pytorch/pytorch/issues/93753. Next step will be to use a lower overhead data structure over `py::dict`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140037
Approved by: https://github.com/jansel
ghstack dependencies: #139921, #139950
2024-11-08 18:43:54 +00:00
William Wen
bbd427faf5 [dynamo] switch to get_framelocals_mapping for 3.11 (#139950)
Part of implementing https://github.com/pytorch/pytorch/issues/93753

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139950
Approved by: https://github.com/jansel
ghstack dependencies: #139921
2024-11-08 18:43:54 +00:00