Commit Graph

80818 Commits

Author SHA1 Message Date
PyTorch MergeBot
dbb55b448b Revert "[7/N] Fix Wextra-semi warning (#140225)"
This reverts commit ffb979032d.

Reverted https://github.com/pytorch/pytorch/pull/140225 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/140225#issuecomment-2469312229))
2024-11-12 00:02:06 +00:00
Tugsbayasgalan Manlaibaatar
0af38b1034 Remove temp table to post autograd IR (#140085)
This table is not needed

Differential Revision: [D64553397](https://our.internmc.facebook.com/intern/diff/D64553397/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140085
Approved by: https://github.com/justinchuby, https://github.com/bdhirsh
2024-11-11 23:59:09 +00:00
Felix Zimmermann
c223e0642c Tighten type hints for tensor arithmetic (#135392)
Fixes #124015

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135392
Approved by: https://github.com/ezyang
2024-11-11 23:55:27 +00:00
Bob Ren
a96aadf0a0 fix specialization logic in Scalar.h (#140280)
Fixes `test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoCUDA.test_comprehensive_linalg_norm_subgradients_at_zero_cuda_float64` when `specialize_float=False`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140280
Approved by: https://github.com/ezyang
2024-11-11 23:51:15 +00:00
PyTorch MergeBot
222175b3d5 Revert "[Partitioner] Enumerate partitions by iterating partition ids (#136598)"
This reverts commit 2ede4c9a38.

Reverted https://github.com/pytorch/pytorch/pull/136598 on behalf of https://github.com/kit1980 due to breaking internal ExecuTorch tests ([comment](https://github.com/pytorch/pytorch/pull/136598#issuecomment-2469294995))
2024-11-11 23:42:51 +00:00
PyTorch MergeBot
412df50454 Revert "[dynamo] Remove dead code path for capturing __class__ in UserFunctionVariable (#140034)"
This reverts commit de40a23f6c.

Reverted https://github.com/pytorch/pytorch/pull/140034 on behalf of https://github.com/kit1980 due to breaking internal tests, see D65755044 ([comment](https://github.com/pytorch/pytorch/pull/140034#issuecomment-2469290205))
2024-11-11 23:38:00 +00:00
Kevin Sheridan
2817fe8bef Add unaligned attributes to q8gemm/4x4c2-sse2.c (#140188)
Summary:
UBSan hits undefined behavior in this file. This fixes it by marking these pointers as unaligned.

```
caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/__ukernels_sse2__/buck-private-headers/q8gemm/4x4c2-sse2.c:325:5: runtime error: store to misaligned address 0x62900313891f for type 'uint32_t' (aka 'unsigned int'), which requires 4 byte alignment
0x62900313891f: note: pointer points here
 be be be be be  be be be be be be be be  be be be be be be be be  be be be be be be be be  be be be
             ^
UndefinedBehaviorSanitizer: undefined-behavior buck-caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/__ukernels_sse2__/buck-private-headers/q8gemm/4x4c2-sse2.c:325:5 in
```

The fix is to mark these variables as unaligned following D42179009's example

q8gemm.cc + internal integration test

Differential Revision: D65637959

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140188
Approved by: https://github.com/digantdesai
2024-11-11 23:28:07 +00:00
Animesh Jain
5eb1ccadc2 [dynamo][user-defined] Walk __mro__ to get the member descriptor source (#140300)
Fixes https://github.com/pytorch/pytorch/issues/140266

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140300
Approved by: https://github.com/williamwen42
2024-11-11 23:16:48 +00:00
Nathan Brown
a290c1d748 Fix building with system GLOO (#140275)
Leverage existing FindGloo CMake module to locate system's library and headers. Add system's gloo headers to include path rather than the gloo from third party when USE_SYSTEM_GLOO is specified.

Fixes #140274

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140275
Approved by: https://github.com/malfet
2024-11-11 22:58:39 +00:00
Catherine Lee
b742d11b1c [TD] Filepath heuristic also looks at file name (#140170)
Filepath heuristic also now takes into account the file name, not just directories

A bit of refactoring
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140170
Approved by: https://github.com/huydhn
2024-11-11 22:55:54 +00:00
Animesh Jain
5f7ea7ca6a [invoke_subgraph] Support symint/int as inputs (#140058)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140058
Approved by: https://github.com/ydwu4, https://github.com/eellison
ghstack dependencies: #139162
2024-11-11 22:26:43 +00:00
Xuan Zhang
d4cdc09881 ILP for auto FSDP wrapping (#140298)
This PR presents a mixed integer linear programming (MILP) formulation that can be utilized to determine, under a memory budget, which modules to wrap as FSDP units. Similar to the auto SAC MILP introduced in https://github.com/pytorch/pytorch/pull/137908, the MILP uses information collected from MemTracker, Runtime Estimator, and SAC Estimator, introduced in these PRs:
* https://github.com/pytorch/pytorch/pull/124688
* https://github.com/pytorch/pytorch/pull/134243
* https://github.com/pytorch/pytorch/pull/135208

End-to-end example and its sample output:

```
import copy
from typing import Tuple

import torch
from torch._subclasses.fake_tensor import FakeTensorMode

from torch.distributed._tools.ilp_utils import (
    aggregate_stats,
    get_peak_memory_runtime_baseline,
    parse_module_info,
)
from torch.distributed._tools.mem_tracker import _ModState, MemTracker
from torch.distributed._tools.runtime_estimator import RuntimeEstimator
from torch.distributed._tools.sac_estimator import SACEstimator
from torch.distributed._tools.fsdp_ilp import fsdp_milp, CommType, CommParams
from torch.testing._internal.distributed._tensor.common_dtensor import (
    ModelArgs,
    Transformer,
)

def _init_model_input_optimizer() -> (
    Tuple[torch.nn.Module, torch.optim.Optimizer, torch.Tensor]
):
    bsz = 2
    model_args = ModelArgs(
        n_layers=6,
        n_heads=12,
        vocab_size=8192,
        max_seq_len=1024,
        dim=6144,
        dropout_p=0.1,
    )
    with torch.device(torch.cuda.current_device()):
        model = Transformer(model_args)
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-2, foreach=True)
    inp = torch.randint(
        0,
        model_args.vocab_size,
        (bsz, model_args.max_seq_len),
        device=torch.cuda.current_device(),
    )
    return (model, optimizer, inp)

def _run_and_get_mem_tracker(
    model: torch.nn.Module,
    optimizer: torch.optim.Optimizer,
    inp: torch.Tensor,
) -> MemTracker:
    mem_tracker = MemTracker()
    mem_tracker.track_external(model, optimizer)
    with mem_tracker as mt:
        for iter_idx in range(2):  # running twice to initialize optimizer
            output = model(inp)
            output.sum().backward()
            if iter_idx == 1:
                last_snapshot = mt.get_tracker_snapshot("current")
            optimizer.step()
            optimizer.zero_grad()
            if iter_idx == 0:
                mt.reset_mod_stats()
    assert last_snapshot is not None
    for mod_stats in mem_tracker.memory_tracking.values():
        if _ModState.POST_BW not in mod_stats.snapshots.keys():
            mod_stats.snapshots.setdefault(_ModState.POST_BW, []).append(
                copy.deepcopy(last_snapshot)
            )
    return mem_tracker

def _run_and_get_runtime_estimator(
    model: torch.nn.Module,
    optimizer: torch.optim.Optimizer,
    inp: torch.Tensor,
) -> RuntimeEstimator:
    def _run_one_step() -> None:
        output = model(inp)
        output.sum().backward()
        optimizer.step()
        optimizer.zero_grad()

    # Initializing optimizer states and warm-up
    _run_one_step()

    runtime_estimator = RuntimeEstimator()
    with runtime_estimator(estimate_mode_type="operator-level-cost-model"):
        _run_one_step()  # We use only one iteration for estimation
    return runtime_estimator

def _run_and_get_sac_estimator(
    model: torch.nn.Module,
    inp: torch.Tensor,
) -> SACEstimator:
    sac_estimator = SACEstimator()
    with sac_estimator(estimate_mode_type="operator-level-cost-model"):
        loss = model(inp).sum()
    loss.backward()
    return sac_estimator

def main():
    with FakeTensorMode():
        model, optimizer, inp = _init_model_input_optimizer()
        mem_tracker = _run_and_get_mem_tracker(model, optimizer, inp)
        runtime_estimator = _run_and_get_runtime_estimator(model, optimizer, inp)
        sac_estimator = _run_and_get_sac_estimator(model, inp)
        mod_info = aggregate_stats(
            model,
            mem_tracker,
            runtime_estimator,
            sac_estimator,
            torch.device(torch.cuda.current_device()),
        )
        g = parse_module_info(mod_info)

        peak_mem, compute_time = get_peak_memory_runtime_baseline(g)
        print("=== WITHOUT FSDP ===")
        print(f"peak_mem: {round(peak_mem / 2**30, 2)} GiB")
        print(f"compute_time: {round(compute_time, 2)} ms")

        fsdp_decisions, exposed_comm_time, peak_mem = fsdp_milp(
            g,
            world_size=8,
            memory_budget=15,
            comm_params={
                CommType.ALL_GATHER: CommParams(latency=0.01, bandwidth=2 * 1e8),
                CommType.REDUCE_SCATTER: CommParams(latency=0.01, bandwidth=2 * 1e8),
            },
        )
        print("=== WITH FSDP on 8 ranks ===")
        print(f"fsdp units: {sorted(fsdp_decisions)}")
        print(f"peak_mem: {round(peak_mem / 2**30, 2)} GiB")
        print(f"exposed communication time: {round(exposed_comm_time, 2)} ms")

if __name__ == "__main__":
    main()
```

```
=== WITHOUT FSDP ===
peak_mem: 20.92 GiB
compute_time: 1375.49 ms
=== WITH FSDP on 8 ranks ===
fsdp units: ['Transformer', 'Transformer.layers.0.attention.wk', 'Transformer.layers.0.attention.wo', 'Transformer.layers.0.attention.wq', 'Transformer.layers.0.attention.wv', 'Transformer.layers.0.feed_forward.w1', 'Transformer.layers.0.feed_forward.w2', 'Transformer.layers.1', 'Transformer.layers.2', 'Transformer.layers.3', 'Transformer.layers.4', 'Transformer.layers.5', 'Transformer.output', 'Transformer.pos_embeddings']
peak_mem: 13.63 GiB
exposed communication time: 1.02 ms
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140298
Approved by: https://github.com/weifengpy
2024-11-11 22:02:39 +00:00
Bin Bao
2c77352fe2 [AOTI][refactor] Clean up call chain in wrapper codegen (#136531)
Summary: For cpp wrapper, generate_kernel_call and define_kernel need to handle both cpu and gpu kernels. Refactor the code to remove nested super() calls.

Differential Revision: [D65639095](https://our.internmc.facebook.com/intern/diff/D65639095)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136531
Approved by: https://github.com/frank-wei
2024-11-11 22:00:42 +00:00
Huy Do
115c58c52a Update ET pin for #6744 (#140199)
This will be updated to ET trunk commit after https://github.com/pytorch/executorch/pull/6744 lands.  I also move ET back from unstable and install llama3 dependencies
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140199
Approved by: https://github.com/kit1980
2024-11-11 21:40:12 +00:00
Justin Chu
780b28f67e [ONNX] Update docstring typo in building (#140281)
The oprecorder docstring mistakenly referred to torchscript when it should say ONNX IR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140281
Approved by: https://github.com/titaiwangms
2024-11-11 21:01:27 +00:00
Jack Taylor
001f7366a7 [ROCm] Correct numerical issues in layer norm backwards kernel (#140259)
It was raised that the backwards layer norm on AMD was slightly off the accuracy of the equivalent NVIDIA implementation.

On AMD we call into a helper kernel `cuLoadWriteStridedInputs` which processes strided input and accumulates the partial gradients into shared memory.

In this kernel (https://github.com/pytorch/pytorch/pull/87635) we truncated `mean` and `rstd` from T_ACC type to T which causes numerical issues in the warp buffers created in this kernel. This PR will use the correct accumulator type for mean and rstd.

Note: Only AMD call into this call stack for backwards layer norm, so this was not an issue for NV.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140259
Approved by: https://github.com/jianyuh
2024-11-11 20:44:18 +00:00
Rachel Guo
10e40dd5ca [aoti][tooling] Add support to debug printing for all AOTI model run input args (#140064)
Summary:
Add debug printing around: `void AOTInductorModel::run_impl()`

Example:
```
void AOTInductorModel::run_impl(
    AtenTensorHandle*
        input_handles, // array of input AtenTensorHandle; handles
                        // are stolen; the array itself is borrowed
    AtenTensorHandle*
        output_handles, // array for writing output AtenTensorHandle; handles
                        // will be stolen by the caller; the array itself is
                        // borrowed
    DeviceStreamType stream,
    AOTIProxyExecutorHandle proxy_executor
) {

    auto inputs = steal_from_raw_handles_to_raii_handles(input_handles, 3);
    auto arg0_1 = std::move(inputs[0]);
    auto arg1_1 = std::move(inputs[1]);
    auto arg2_1 = std::move(inputs[2]);
    aoti_torch_print_tensor_handle(arg0_1, "aoti_model_inputs - arg0_1");
    aoti_torch_print_tensor_handle(arg1_1, "aoti_model_inputs - arg1_1");
    aoti_torch_print_tensor_handle(arg2_1, "aoti_model_inputs - arg2_1");
```

Differential Revision: D65616590

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140064
Approved by: https://github.com/chenyang78
2024-11-11 20:10:35 +00:00
Yuanhao Ji
7f1e248b50 [Dynamo] Replace torch._dynamo.optimize() with torch.compile() [1/N] (#139706)
``torch._dynamo.optimize()`` is wrapped for convenience by ``torch.compile()``.

related commits:

- #139706
- #140238
- #140247
- #140253

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139706
Approved by: https://github.com/jansel, https://github.com/ezyang
2024-11-11 20:04:08 +00:00
Joel Schlosser
e7ec294c10 NJT OpInfo tests v2 (#138370)
This PR updates OpInfo-based tests for NJTs:
* Adds extensive coverage across non-contiguous NJTs (both non-contiguous transposed and non-contiguous with holes)
    * The `_sample_njts()` helper that `sample_input_func`s utilize now produces non-contig NJTs as well
* Utilizes a `SampleInput`-based xfail system for granular classification of bugs. For example, it's possible to indicate that a class of ops is expected to fail only on non-contig with holes NJT inputs.
    * I decided on adding `SampleInput`s and utilizing this system over using test parametrization for two reasons:
        * Test perf - adding `SampleInput`s is faster than generating entire new tests
        * Avoiding the possibility of `sample_input_func`s not respecting the non-contig test parameter - this would result in silently incorrect passing of these tests. Keeping the responsibility for `SampleInput` generation firmly within each `OpInfo`'s `sample_input_func` means weirdness like this isn't possible
* Improves `SampleInput` naming for a bunch of `sample_input_func`s. This makes it easier to xfail them as needed. For example, binary / unary / other ops now use the new `_describe_njt()` helper to get a string repr that uniquely defines the type of NJT being passed to the op
* Adds appropriate `XFailRule`s to get tests passing for forward / backward / forward compile / backward compile. In general, each xfail corresponds to some bug that needs to be fixed

```python
# Represents a rule indicating how to xfail a particular test. It allows granularity
# at the device, dtype, op, and individual sample levels. This flexibility allows entire
# bugs to be represented by a single rule, even if this corresponds with multiple conceptual
# test cases across multiple ops.
@dataclass
class XFailRule:
    # expected error type
    error_type: TypeVar = Exception
    # expected error message
    error_msg: str = ".*"
    # function to indicate whether the rule applies; return True if so
    match_fn: Callable[[torch.device, torch.dtype, OpInfo, SampleInput], bool] = None
    # optional name for identifying the rule
    name: str = ""

    def match(self, device, dtype, op, sample) -> bool:
        return self.match_fn(device, dtype, op, sample)
```

Example:
```python
    # Bug when broadcasting a binary op with non-contiguous with holes NJT + dense
    # tensor with 1 in ragged dim.
    XFailRule(
        error_type=RuntimeError,
        error_msg="cannot call binary pointwise function .* with inputs of shapes",
        match_fn=lambda device, dtype, op, sample: (
            isinstance(op, BinaryUfuncInfo)
            and "noncontig_holes" in sample.name
            and "broadcasting 1 over ragged" in sample.name
        ),
        name="binary_noncontig_holes_broadcasting_1_over_ragged",
    ),
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138370
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
ghstack dependencies: #140160
2024-11-11 19:35:24 +00:00
Yifu Wang
0a0915fb5e [SymmetricMemory] improve the API for stream_write_value32 (#139934)
This PR updates the binding for `stream_write_value32` to be consistent with `memset32` which IMO makes more sense for this type of utilities:
- Changed the API to take a uint32 tensor as argument, instead of a device pointer
- Changed the Python binding to be a static method of `_SymmetricMemory`, instead of a object method
- Use the dispatcher for device dispatching, as opposed to `SymmetricMemory` backends

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139934
Approved by: https://github.com/weifengpy
ghstack dependencies: #139227
2024-11-11 18:49:22 +00:00
Max Ren
96b64182de Delete Buck1 as it is no longer supported (#140067)
Buck1 is no longer supported in favor of buck2. This CI tests the old buck1 flow, however it is difficult to maintain especially since buck1 doesn't support aarch64 mac.

I am suggesting that this CI be deprecated until a decision on buck2 is made, and buck2 support is added. As of now, there seems to be no push towards adding buck2 support.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140067
Approved by: https://github.com/huydhn
2024-11-11 18:49:18 +00:00
PyTorch MergeBot
5f4a21dc58 Revert "[SymmetricMemory] improve the API for stream_write_value32 (#139934)"
This reverts commit 2f3a5a15ef.

Reverted https://github.com/pytorch/pytorch/pull/139934 on behalf of https://github.com/malfet due to Broke distributed tests, see https://github.com/pytorch/pytorch/actions/runs/11770673088/job/32784210441 ([comment](https://github.com/pytorch/pytorch/pull/139934#issuecomment-2468641512))
2024-11-11 17:02:07 +00:00
Nikita Shulga
2fe110ff3a [BE][MPS] Standardize indexing shader compilation (#140271)
It was wrong to add it to MPSDevice in the first place, as in the end it's just a regular shader, like all others.
I.e. this PR:
 - Moves contents of `at::mps::indexing_metal_shaders` into `kernels/Indexing.metal`
 - Deletes `MPSDevice::getMetalIndexingLibrary()` and `MPSDevice::metalIndexingPSO` methods
 - Moves `at::native::mps::generateKernelDataOffsets` implementation from `OperationUtils.mm` to `Indexing.mm`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140271
Approved by: https://github.com/Skylion007
2024-11-11 17:00:49 +00:00
Nikita Shulga
f5ffd55a32 [MPS] Add torch.special.i1 op (#140196)
By more-or-less copy-n-pasting 58b661cda2/aten/src/ATen/native/cuda/Math.cuh (L576)

Enable respective tests in test_mps.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140196
Approved by: https://github.com/Skylion007
2024-11-11 16:57:53 +00:00
Aleksei Nikiforov
63715f6567 S390x update builder image (#132983)
Publish current state of s390x builder image to allow reproducing worker setup.
Also, if this image gets published to docker repository later, it'd be possible to download published image instead of building it into worker image in https://github.com/pytorch/pytorch/blob/main/.github/scripts/s390x-ci/self-hosted-builder/actions-runner.Dockerfile#L66, which should allow improving restart time at the cost of additional runtime overhead.

Compared to first attempt to merge:
- default docker repository settings are added to all runners. Changes are mirrored in this PR.
- job is moved into separate workflow file.
- it's no longer attempted to update limits on s390x. Limits should be properly set up there on the host. And it's not possible to update them from worker since it runs in container. Also, worker container currently doesn't have sudo installed or configured or any systemd running.
- github token is now passed once via named pipe instead of environment variable. This should increase security of tokens.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132983
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-11-11 16:14:06 +00:00
Richard Zou
04b5b4a94e Add base class for single-subgraph inductor HOPs (#139898)
This PR adds "PrimHOPBase", which is intended to be a base class that
one can extend to create new HOPs that match some criteria:
- they take one subgraph as input, and their semantics are running the
  subgraph on some operands
- the HOP stays alive until Inductor

The motivation is that we are seeing a lot more HOPs (invoke_subgraph,
invoke_quant) that have this property and there can be a lot of shared
code between them.

Future:
- Migrate invoke_subgraph to use this
- There are some TODOs in the code

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139898
Approved by: https://github.com/anijain2305, https://github.com/ydwu4
2024-11-11 16:12:35 +00:00
David Berard
d4b8857e51 [codecache][triton 3.2] hash -> base64 conversion for triton 3.2 (#140190)
In old triton versions, you take the hash of the triton kernel and use it in the filepath for the cached kernel. In Triton 3.2 (after https://github.com/triton-lang/triton/pull/4553), the filepath will use the base-64-encoded representation of the hash in the path.

This PR checks whether the `_base64` function exists in triton, and if so, uses the base-64-encoded represenatation in the path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140190
Approved by: https://github.com/ezyang
2024-11-11 15:32:28 +00:00
fduwjj
ceb44b22dc [FR] Enable best effort parital analysis and verbose mode for trace printing (#139853)
Based on user feedback, we want to enable two things for FR analysis script:
1. Print out more information when verbose is specified.
2. Perform best effort based analysis when not all ranks have FR trace dumped.

Differential Revision: [D65516081](https://our.internmc.facebook.com/intern/diff/D65516081/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139853
Approved by: https://github.com/c-p-i-o
2024-11-11 14:38:32 +00:00
Sam Larsen
cb15c15157 [logging] Overhaul dynamo_timed and CompilationMetrics logging. (#139849)
Here's the overview:

There's a new contextmanager singleton called MetricsContext. Entering the MetricsContext is how we demarcate the boundary on which we'll create a single CompilationMetrics object, and therefore, a single dynamo_compile log entry. While we're inside the MetricsContext, we can update/set many different metrics. Most importantly: `dynamo_timed` can also update the in-progress MetricsContext. In the proposal here, we tell `dynamo_timed` that we want it to do so by providing the name of the MetricsContext field to increment. There can be many `dynamo_timed` calls in different parts of the code updating different fields. Then when the MetricsContext exits, that's when the logging of everything gathered finally happens. One potential footgun is trying to use `dynamo_timed` when we haven't entered the MetricsContext, but we assert on that problem. Another problem is that we re-enter the context recursively, but we watch for that and do the logging only when the outermost exits.

Some specifics:
* Introduce MetricsContext - a context manager that on exit, records the CompilationMetrics (which also logs to dynamo_compile).
* Completely remove the concept of frame_phase_timing. Instead, update the MetricsContext during compilation, either directly or via dynamo_timed.
* Remove some globals we previously used to accumulate counters to later populate a CompilationMetrics. We use CompilationMetrics set/update/increment APIs instead.
* `record_compilation_metrics` is now called on exit from MetricsContext.
* Populate legacy CompilationMetrics fields right before logging, inside `record_compilation_metrics`.
* Remove the one-off `add_remote_cache_time_saved` helper; capture that timing directly into the MetricsContext.

And specifically, several changes to dynamo_timed:
* "Modernize" the parameters and update all callsites accordingly.
* Move the backwards logging of the CompilationMetrics to the backwards compile location.
* Add a parameter for which CompilationMetrics field to update

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139849
Approved by: https://github.com/ezyang
ghstack dependencies: #140094
2024-11-11 14:24:23 +00:00
Xiaodong Wang
565a7942ee Recover non-standard bool test for msort (#139870)
Summary:
I was looking into why the non-standard bool value will fail for msort - it makes sense for argsort and sort to fail, because we're randomly generating uint8 so the order will be different (and thus the indices will be different). But msort should work.

After some digging, it's interesting that even though scalar_t is bool, when the actual value is a uint8_t, the comparison will treat them as signed. I tried lhs=255 and rhs=0: lhs < rhs is equivalent to -1 < 0 which is true (but it's supposed to be False)

Therefore we add an explicit type cast.

Test Plan: Remove the test skip

Differential Revision: D65472170

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139870
Approved by: https://github.com/Skylion007, https://github.com/davidberard98
2024-11-11 02:00:34 +00:00
Yifu Wang
2f3a5a15ef [SymmetricMemory] improve the API for stream_write_value32 (#139934)
This PR updates the binding for `stream_write_value32` to be consistent with `memset32` which IMO makes more sense for this type of utilities:
- Changed the API to take a uint32 tensor as argument, instead of a device pointer
- Changed the Python binding to be a static method of `_SymmetricMemory`, instead of a object method
- Use the dispatcher for device dispatching, as opposed to `SymmetricMemory` backends

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139934
Approved by: https://github.com/weifengpy
ghstack dependencies: #139227
2024-11-11 01:54:35 +00:00
cyy
ffb979032d [7/N] Fix Wextra-semi warning (#140225)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140225
Approved by: https://github.com/ezyang
2024-11-10 14:28:10 +00:00
Zhenbin Lin
d90c25e3e2 OpenReg: Support event (#140111)
Support events. Since cpu backend doesn't support asynchronous execution, all event operations will be executed immediately on the executor side.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140111
Approved by: https://github.com/ezyang
2024-11-10 08:38:45 +00:00
Yutao Xu
c3087ace58 Update torch-xpu-ops commit pin (#139986)
Update the torch-xpu-ops commit to [5e29831 ](https://github.com/intel/torch-xpu-ops/commit/5e29831). Includes:
- OneAPI-2025 build issue fix
- Enhancement of the XPU operator coverage

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139986
Approved by: https://github.com/guangyey, https://github.com/jansel
2024-11-10 06:49:38 +00:00
CaoE
94c9bb73c0 [Inductor] [CPP] Update BRGEMM parameters for Half cpp gemm template (#140116)
Update BRGEMM parameters for Half cpp gemm template as BRGEMM api is changed https://github.com/pytorch/pytorch/pull/138184.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140116
Approved by: https://github.com/jansel
2024-11-10 06:37:10 +00:00
Sam Larsen
4f6b30bcbc Add testing for the utils surrounding dynamo_timed (#140094)
Summary: This will make it easier to verify that we don't break these utilities for the refactor in https://github.com/pytorch/pytorch/pull/139849.
It's one giant test. I can split it into multiple for better readability if ppl prefer that. My rationale for the giant test is that I found I was just resetting compilation and recompiling the same thing many times, which was slow and wasteful.

Test Plan: The new tests

Differential Revision: [D65682138](https://our.internmc.facebook.com/intern/diff/D65682138)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140094
Approved by: https://github.com/ezyang
2024-11-10 04:17:45 +00:00
zeshengzong
5ef33e40b3 Add size param check of unfold (#139965)
Fixes #76617

Changes:

- Add check of input `size` value, give user friendly hint message
- fix `FIXME: move to shape ops test suite` in test file

Before
```python
import torch
x = torch.arange(1., 8)
x.unfold(0, -1, 1)

Traceback (most recent call last):
  File "/home/zong/code/unfold.py", line 12, in <module>
    x.unfold(0, -1, 1)
RuntimeError: Storage size calculation overflowed with sizes=[9, -1] and strides=[1, 1]

```

After
```python
import torch
x = torch.arange(1., 8)
x.unfold(0, -1, 1)

Traceback (most recent call last):
  File "/home/zong/code/pytorch/../unfold.py", line 12, in <module>
    x.unfold(0, -1, 1)
RuntimeError: size is -1 but must be >= 0
```

Test Result:
```bash
pytest test/test_shape_ops.py
```

![image](https://github.com/user-attachments/assets/d7bcef62-04e6-4187-9c8f-bc5220ff6c33)

```bash
$ lintrunner
```

![image](https://github.com/user-attachments/assets/6b48d095-5c8a-4e75-9957-dc22d39a73bb)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139965
Approved by: https://github.com/ezyang
2024-11-09 17:12:53 +00:00
atalman
f89b2b9630 Refactor conda-builder -> almalinux-builder (#140157)
This changes the conda-builder workflow to almalinux-builder and switches Docker file to almalinux.
Please note: Published conda-builder images will still be available, hence workflows that use these images will still work.
We will be switching workflows that use conda-builder images to almalinux-builder

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140157
Approved by: https://github.com/malfet
2024-11-09 16:06:40 +00:00
cyy
7d4f5f7508 [Environment Variable][6/N] Use thread-safe getenv functions (#140200)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140200
Approved by: https://github.com/ezyang
2024-11-09 15:05:51 +00:00
Nikita Shulga
a2ac96cae0 [BE] Rectify some references to caffe2 (#140204)
- Rename `tools.build_pytorch_libs.build_caffe2` to `tools.build_pytorch_libs.build_pytorch`
- Delete number of `if BUILD_CAFFE2` conditions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140204
Approved by: https://github.com/huydhn, https://github.com/r-barnes, https://github.com/atalman
2024-11-09 14:14:20 +00:00
fduwjj
5107d244ee [c10d][Logging] Remove args and kwargs from c10d logging (#140169)
This PR is trying to reland https://github.com/pytorch/pytorch/pull/139804

We now don't want to log args and kwargs directly because if they contain tensor or tensor subclass it would take lots of time in conversion to string or even not supported.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140169
Approved by: https://github.com/wz337, https://github.com/kwen2501
2024-11-09 13:57:32 +00:00
Yu, Guangye
052b67e2b4 Add torch.version.xpu (#139466)
# Motivation
We add a new attribute `torch.version.xpu` to facilitate the problem diagnosing and version control.

# Additional Context
It is aligned with `torch.version.cuda` and `torch.version.hip`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139466
Approved by: https://github.com/EikanWang, https://github.com/ezyang, https://github.com/atalman, https://github.com/malfet
ghstack dependencies: #139258
2024-11-09 13:31:21 +00:00
Yu, Guangye
8051ee802c Add XPU compiler version control in cmake to keep BC (#139258)
# Motivation
This PR aims to maintain backward compatibility when building PyTorch XPU with the old and new compilers.

# Additional Context
The details are described here. The new compiler (2025.0.0) has some breaking changes compared with the old compiler(2024.1), for examples:
1. On Windows, sycl library is named `sycl7.lib` in the old compiler but is named `sycl.lib` in the new compiler.
2. On Linux, in order to support ABI=0, we have to link `libsycl-preview.so` in the old compiler but we could link `libsycl.so` in the new compiler to have the same ABI compatibility.
3. We added a macro `SYCL_COMPILER_VERSION` to support our new code has good backward compatibility with the old compiler. Now the new feature(Event elapsed_time, memory summary, and device architecture property) introduced by the new compiler will be controlled within the macro `SYCL_COMPILER_VERSION`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139258
Approved by: https://github.com/EikanWang, https://github.com/atalman, https://github.com/gujinghui
2024-11-09 13:31:21 +00:00
xinan.lin
191971e01d [AOTI] Introduce an extensibility mechanism for the c shim codegen to make it easy to produce c shims for out-of-tree OP kernels as well. Add c_shim for XPU. (#136742)
[AOTI] Introduce an extensibility mechanism for the c shim codegen to make it easy to produce c shims for out-of-tree OP kernels as well. Add c shim for XPU.

### Motivation
Since the current c shim codegen will only produce C wrappers for Op's registered in `aten/src/ATen/native/native_functions.yaml`, for the same backend, when a portion of out-of-tree OP's are not registered in that file, but are registered externally. For example, `third_party/torch-xpu-ops/yaml/native_functions.yaml` , in this case, the existing codegen can't fulfill the need to do extensions for the c shims from the out-of-tree OPs for the in-tree that has already been produced.

### Design
To extend the c shim with more OP for a backend from out-of-tree.
The PR provided a bool option `--aoti-extend` to indicate the codegen is to extend c shim from out-of-tree.
The generated c shim is stored in the `extend` subdirectory , for example:
```
torch/include/torch/csrc/inductor/aoti_torch/generated/c_shim_xpu.h
torch/include/torch/csrc/inductor/aoti_torch/generated/c_shim_xpu.cpp
torch/include/torch/csrc/inductor/aoti_torch/generated/extend/c_shim_xpu.h
torch/include/torch/csrc/inductor/aoti_torch/generated/extend/c_shim_xpu.cpp
```
example usage:
`python -m torchgen.gen --source-path third_party/torch-xpu-ops/yaml/ --xpu --aoti-extend --update-aoti-c-shim  `
`--xpu`:  generate c shim for XPU
`--aoti-extend `: this is an out-of-tree OPs(defined in `third_party/torch-xpu-ops/yaml/native_functions.yaml`)  extend for in-tree ops(defined in `aten/src/ATen/native/native_functions.yaml`)
`--update-aoti-c-shim`: always generate c_shim_xpu.h for the extend c_shim.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136742
Approved by: https://github.com/EikanWang, https://github.com/desertfire
ghstack dependencies: #139025
2024-11-09 13:19:52 +00:00
xinan.lin
929a647363 [Intel GPU] Support RegisterXPU.cpp codegen and compile for the in-tree XPU structured GEMM OPs. (#139025)
[Intel GPU] Support RegisterXPU.cpp codegen and compile for the in-tree XPU structured GEMM ops.

Motivation: There are two parts of aten ops for XPU, one is in-tree ops like GEMM related OPs and the other is out-off-tree ops in torch-xpu-ops. For the in-tree part,since Pytorch uses native_functions.yaml registration and is equipped with convenient codegen capabilities, we want to take advantage of these benefits as well.
At the same time, since AOT Inductor also uses native_functions.yaml to generate c shim wrappers, we also need to enable this mechanism for XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139025
Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire
2024-11-09 13:09:27 +00:00
Andrea Frittoli
0b650c360a Build magma for windows (#139924)
Copy the magma for windows job and script from pytorch/builder c9aac65e12/.github/workflows/build-magma-windows.yml

The linux version is moved here in https://github.com/pytorch/pytorch/pull/139888

Fixes #140001

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139924
Approved by: https://github.com/atalman
2024-11-09 09:27:59 +00:00
Boyuan Feng
e2e425b4f3 [CUDAGraph] Add dynamo timer to checkpoint, warmup, and record (#139818)
Summary: Add time log to cudagraph, including `create deferred_cudagraphify wrapper`, `warmup`,	`record`, and `checkpoint`.

Test Plan:
1. buck2 run fbcode//mode/opt //pytorch/benchmark:run -- resnet50 -d cuda -t train --inductor --pt2-triton-cudagraph

2. Found the result in [scuba table](https://fburl.com/scuba/pt2_compile_events/0oik8nu9).

 {F1954034920}

Differential Revision: D65505659

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139818
Approved by: https://github.com/eellison
2024-11-09 05:27:11 +00:00
cyy
ab55a99283 Use TORCH_DECLARE_XXX (#139952)
Because those files use TORCH_API

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139952
Approved by: https://github.com/ezyang
2024-11-09 04:56:28 +00:00
Kefei Lu
d2d1258b1b Speed up AMD AOT Inductor lowering by memoizing hipify trie to regex logic (#140156)
Summary:
AMD lowering duration is 1.55x longer than H100. Profiling shows hipification related functions took 22% of overall lowering time.

This diff cuts that time by safely memoize the trie to regex logic. The trick is to incrementally build a state of the trie during the trie construction. The state is the hash of all the words added to the trie.

Differential Revision: D65659445

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140156
Approved by: https://github.com/ColinPeppler

Co-authored-by: Kefei Lu <kefeilu@meta.com>
2024-11-09 04:28:58 +00:00
Michael Lazos
8b2e3855a9 Make size a property with an assertion (#139794)
Fixes https://github.com/pytorch/pytorch/issues/120568

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139794
Approved by: https://github.com/williamwen42
2024-11-09 03:39:41 +00:00