Basically adds native _IntWrapper support to dynamo. Here's my process of trying to make symint input support work on dynamo, and how I ended up with this approach [(doc)](https://docs.google.com/document/d/1GvNRQd8BnxlMay_hrEVgEta6VUeUW_hcFeRuB7q1nDY/edit?tab=t.0).
What I did was, before passing inputs to dynamo.export, I first wrap them with a class, `_IntWrapper`. When processing dynamic shapes, I will then add the corresponding dynamic shape specification to the `dynamism` field stored on the `_IntWrapper`. If there is no dynamism specified, then this will get unwrapped back to an integer. When dynamo tracing, when we encounter an `_IntWrapper`, we will convert this to a symint if the dynamism was specified as `Dim.DYNAMIC/AUTO`. Dynamo will then trace a graph that contains symint inputs, which will get passed to AOTAutograd and so on.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152677
Approved by: https://github.com/pianpwk
DSD currently will pop tensors if these tensors are on Meta device. This forbid the use cases that users would like to let DCP to directly initialize the tensors when loading.
This PR also removes test/distributed/checkpoint/e2e/test_pipeline.py which is based on the above feature that is not realistic and is not used anywhere.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153185
Approved by: https://github.com/mori360
**Summary**
This PR adds a new op, `onednn.qbatch_norm2d`, which accepts uint8 inputs on CPU device (instead of QuantizedCPU).
The new ops are implemented with AVX512 instructions and it provides similar performance as its counterpart for QuantizedCPU device `quantized.batch_norm2d`.
The new op supports output dtypes other than uint8 (fp32, fp16 and bf16 are supported).
**Test plan**
```
pytest test/quantization/core/test_quantized_op.py -k test_int8_batch_norm_onednn
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152811
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168, https://github.com/jgong5
ghstack dependencies: #152411
Fixes#147336
## Context
NCU analysis of the fp8 flex attention perf issue in #147336 showed an unexpected increase in shared memory access bank conflicts when loading the V tensor from HBM to SRAM.
Bringing this to the attention of triton developer @davidberard98 he identified the memory layout of the tensor in HBM to be causing non-pipelined loads into SRAM, causing the slowdown.
To summarize:
In flex attention when performing the FP8 GEMM `softmax_scores @ V` the right operand V must be in column-major memory layout. However, the `tl.load` of V blocks from HBM to SRAM cannot be pipelined if the V tensor isn't column-major in HBM already, leading to substantial performance degradation.
This is because triton does not perform async copies with the `cp.async` PTX instruction if the number of contiguous bytes is less than 4 (see [here](81f93f2c8e/lib/Dialect/TritonGPU/Transforms/Pipeliner/PipeliningUtility.cpp (L403))).
i.e., when loading 4 bytes of contiguous data from a tensor stored in row-major in HBM, we have to perform 4 separate non-contiguous writes to SRAM to place those bytes in their new location in the col-major layout in SRAM. Thus the load is not a candidate for pipelining w/ cp.async and just moves data to registers then performs a series of single byte stores.
## Fix summary
- To fix this, we should enforce memory layouts for Q, K, V in FlexAttention when fp8 is being used, to ensure they each exist in HBM in the necessary memory layout to facilitate pipelined loads into SRAM ahead of the FP8 GEMMs
## Benchmarks
Rerunning the repro we see fp8 runtime is reduced from 120% of bf16 to 76% of bf16 runtime.
Before fix:
```
(flex) [danvm@devgpu007.eag6 ~/ml-perf-tools/flex_attention (main)]$ rm -rf /tmp/torchinductor_${USER}; python profile_flex.py --bf16 --fp8
2025-05-11 19:07:33,402 - flex_bench - INFO - Running benchmark: bf16
2025-05-11 19:07:35,885 - flex_bench - INFO - bf16: 424.87228804347734 us
2025-05-11 19:07:35,893 - flex_bench - INFO - Running benchmark: fp8e4m3
2025-05-11 19:07:37,319 - flex_bench - INFO - fp8e4m3: 515.714000000001 us
```
After fix:
```
(flex) [danvm@devgpu007.eag6 ~/ml-perf-tools/flex_attention (main)]$ rm -rf /tmp/torchinductor_${USER}; python profile_flex.py --bf16 --fp8
2025-05-11 17:34:38,223 - flex_bench - INFO - Running benchmark: bf16
2025-05-11 17:34:41,157 - flex_bench - INFO - bf16: 423.4662032967036 us
2025-05-11 17:34:41,167 - flex_bench - INFO - Running benchmark: fp8e4m3
2025-05-11 17:34:42,917 - flex_bench - INFO - fp8e4m3: 326.3694803493453 us
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153357
Approved by: https://github.com/ngimel, https://github.com/davidberard98
Summary:
Some new errors have been showing up on the PT2 dashboard with
```
Invalid type for lengths: Expected BlobReference or torch.Tensor, got: Tensor(shape: torch.Size([10]), stride: (1,), storage_offset: 0)
```
This is caused by [this piece of code](https://fburl.com/code/5nbi9on7) which maps over a set of nodes (in this case type `IDListFeatureListField`) and turns the results into strings to be displayed later. However during pytree.tree_map we call pytree.tree_unflatten which will call the class's init function, which calls `assert_blob` (https://fburl.com/code/h3ainrn9). Because we've mapped over the values and converted them to strings, the assert_blob fails.
I initially thought to disable the assert_blob while tracing (D74684309) but then I think we should actually flatten the list first. Because tlparse will expect just a string out outputs instead of the actual structure.
Test Plan: `buck2 run mode/opt sigmoid/inference/ts_migration:pt2i_readiness_main -- --test_suite ads_all --mode test_full_model --model_id 542947220` fails with something else 😅
Differential Revision: D74744326
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153627
Approved by: https://github.com/yiming0416
C++ Reducer is silently incorrect under CA, its implementation is no-oping the collective. I'm guessing that it was no-op'd because in DDP + python reducer, the C++ reducer is still being initialized.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152735
Approved by: https://github.com/fegin
ghstack dependencies: #153300, #152689
I slap disable on the recomputation hook, otherwise the partitioner may save less/more activations and mismatch with the expected eager count in checkpoint. See code comment `Note: [compiled autograd and checkpoint unpack hook]`.
This fixes all non-nested checkpointing tests. I also wrap nested checkpointing tests, and a few of them still fail.
This also seems to fix all PYTORCH_TEST_WITH_DYNAMO checkpointing tests except for `TestAutograd.test_checkpointing_without_reentrant_custom_function_works`. For those tests, it looks like we fail to HOPify the checkpointed region and when the backward executes the unpack hooks, dynamo tried to trace them. This messed up the internal state tracking of checkpointing, some raising the _StopRecomputationError and others raising the same count mismatch error as CA.
FIXES https://github.com/pytorch/pytorch/issues/127115
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153300
Approved by: https://github.com/jansel
This enables Gloo CUDA when used with a backend that supports GPUDirect which currently is only the IBVERBS backend.
This requires some changes to Gloo which are in https://github.com/pytorch/gloo/pull/441
Since we're now depending on gloo_cuda we need to split ProcessGroupGloo into two pieces, one with the CPU bits (libtorch_cpu) and one with CUDA kernels in libtorch_cuda. This unfortunately requires some major refactoring as some CPU code is shared across both.
The gloo submodule is updated to depend on the new Gloo changes
Test plan:
```py
import os
import time
transport = "TCP"
#transport = "IBVERBS"
os.environ["GLOO_DEVICE_TRANSPORT"] = transport
rank = int(os.environ["RANK"])
os.environ["CUDA_VISIBLE_DEVICES"] = str(rank)
ibv = "mlx5_0:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_9:1,mlx5_10:1,mlx5_11:1".split(",")[rank]
ibv_name, ibv_port = ibv.split(":")
os.environ["TORCH_GLOO_IBV_NAME"] = ibv_name
os.environ["TORCH_GLOO_IBV_PORT"] = ibv_port
os.environ["TORCH_GLOO_IBV_INDEX"] = "3"
import torch
import torch.distributed as dist
dist.init_process_group("gloo")
rank = dist.get_rank()
# initial sanity check
#device = "cpu"
#t = torch.zeros(10, device=device)
#dist.all_reduce(t)
#print("sanity complete")
device = "cpu"
iters = 10
warmup_iters = 2
for nelem in [10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000]:
t = torch.zeros(nelem, device=device)
torch.cuda.current_stream().synchronize()
for i in range(warmup_iters):
dist.all_reduce(t)
torch.cuda.current_stream().synchronize()
start = time.perf_counter()
for i in range(iters):
dist.all_reduce(t)
torch.cuda.current_stream().synchronize()
dur = (time.perf_counter() - start)
qps = iters/dur
bandwidth_gb = t.nbytes * iters / dur / 1e9
gb = t.nbytes / 1e9
if rank == 0:
print(f"{transport=} {device=} {iters=} {nelem=} {qps=} {gb=} {bandwidth_gb=}\n", end="")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153406
Approved by: https://github.com/fduwjj
Summary:
Currently things are hardcoded to only work with nccl backend. Extend it
to allow NCCL + custom plugin backend.
The split-specific methods/attributes have not been added to the base
Backend and Options as some of them are specific to backend implementations.
Instead, explicit checks have been added to the split_group method for the
expected methods and attributes.
I am open to making them part of base Backend based if folks prefer.
Test Plan:
CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152175
Approved by: https://github.com/shuqiangzhang, https://github.com/kwen2501
We handle fake tensor caching in two ways:
1. If the inputs have no symbols (SymInt, etc) then we cache on the FakeTensorMode.
2. If the inputs have symbols then we cache on the ShapeEnv.
This way the symbols in the inputs and outputs are associated with the guards in place at the time of the call.
However - it's possible to have an op where there are no symbols in the inputs but there is an unbacked symbol in the output. In this case we shouldn't cache at all because what would that really mean?
So this PR changes the caching behavior so that if there's a symbol in the output which doesn't come in some way from the input then we refuse to cache that op.
Added a test which checks for this case.
While in there I also did a couple other related changes:
1. Added negative caching - if we see that an (op, args) failed to cache previously we don't even bother trying to cache it again.
2. Reworked the inner behavior of _cached_dispatch_impl a little to make it more clear which bits we expect to be able to throw _BypassDispatchCache and add some comments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153034
Approved by: https://github.com/masnesral, https://github.com/tugsbayasgalan
## Summary
- The unit test `pytest test/distributed/test_symmetric_memory.py -k test_fused_scaled_matmul_reduce_scatter_scatter` was not running for some reason when #149247 was merged, giving false green CI signals. When it was ran manually recently, the test failed, highlighting a bug causing incorrect numerics when `scatter_dim=1`.
- This PR fixes the bug, which was related to how we swap dims 0<=>scatter_dim at the beginning of the custom op (for more efficient cross-device data movement I believe), then swap it back prior to reduction.
## Test plan
- I confirmed the unit test `pytest test/distributed/test_symmetric_memory.py -k test_fused_scaled_matmul_reduce_scatter_scatter` is now passing.
- I confirmed e2e training w/ torchtitan looks good ([logs](https://www.internalfb.com/phabricator/paste/view/P1812054188))
- I analyzed the tlparse to verify the fused_all_gather_matmul and fused_scaled_matmul_reduce_scatter both appear at least once in the post grad graphs ([tlparse](https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpVbUsdG/dedicated_log_torch_trace_65oh3qj_.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000))
## Next steps
1. I think for async TP `fused_scaled_matmul_reduce_scatter` we may only need `scatter_dim_after_maybe_reshape` and not `orig_scatter_dim` after all. I can confirm this and refactor if it is the case.
2. This op is specifically designed for async TP, and many of the arguments don't make sense for a user trying to use this as a standalone op. IMO we should have separate standalone custom op without all the extra function args and internal logic that doesn't apply to non-async TP cases.
3. In a follow up PR I want to add shape annotations to each line (e.g. `# (B, T, H)` etc) to make this easier to debug in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153595
Approved by: https://github.com/fegin
The immediate motivation is to make map support match
ExecuTorch so we can delete ExecuTorch-specific mapping functions, but
this should also straightforwardly improve performance.
Testing: there is existing coverage for this in
vec_test_all_types.cpp. Verified that it really does cover the newly
enabled "don't convert through float" paths by temporarily adding a
TORCH_INTERNAL_ASSERT(false).
Differential Revision: [D73802126](https://our.internmc.facebook.com/intern/diff/D73802126/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152366
Approved by: https://github.com/malfet
ghstack dependencies: #152365
Change trigger for auto format to be pull_request b/c the reusable action used gets the pr number from the pull_request event context, but only run it if ciflow/autoformat is attached to the PR. Tested this on a different PR, and it seems to be working
Changed tag name because ciflow prefixed labels have special handling
Also change to run on all files so it will mimic the normal CI lintrunner call, and because lintrunner, either by itself or using -m mergebase can miss some things. Idk if it would miss for format, but it does for checking lint. Format seems to take shorter than normal lint. I don't know if the comment about making suggestions on non edited file changes is a concern. I didn't really test this part
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153289
Approved by: https://github.com/atalman, https://github.com/malfet
Summary: To support running a subset of these tests with the remote autotuning utilities, I've split out some of the tests into separate classes so that I can derive from the "main" TestMaxAutotune class when creating new tests for remote. I'm not 100% sure what some of these tests do, so please suggest if another grouping / naming might make more sense. The remaining tests in TestMaxAutotune all smelled relevant to me.
Test Plan: existing unit tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153486
Approved by: https://github.com/eellison
This makes it more obvious what's going on when TCPStore shuts down while waiting on a remote key and also shows the remote address.
Test plan:
```
[W514 18:33:36.536327028 TCPStore.cpp:138] [c10d] recvValueWithTimeout failed on SocketImpl(fd=3, addr=[localhost]:34658, remote=[localhost]:1234): Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash?
```
```py
import os
rank = int(os.environ["RANK"])
import time
from torch import distributed as dist
store = dist.TCPStore(
host_name="localhost",
port=1234,
is_master=(rank == 0),
wait_for_workers=False,
)
time.sleep(1)
print("starting")
if rank != 0:
store.get("foo")
else:
time.sleep(1)
print("done")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153586
Approved by: https://github.com/XilunWu
Change logging.error to logging.exception to log additional information when relevant. A few places have slipped in logging.errors in try except since I last did a clean up here and the rule is stabilized so I am enabling it codebase wide. I have NOQA'd much of our custom exception stack trace handling for RPC calls and distributed and tried to a fix a few errors based on whether we immediately reraised it or if we didn't print any exception handling where it could be useful.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153473
Approved by: https://github.com/albanD, https://github.com/cyyever
### Summary
Fixes#148494
Explicitly prefetch the cache lines of the next `B` block to accelerate int8 WoQ (BF16 activation, int8 statically quantized weights) GEMM for small `M` dimension.
Some of this code (outer loops of the GEMM) is being ported over from Intel Extension for PyTorch. The macro-kernel* and the micro-kernel* are essentially the same, but optionally prefetch a block of B. Templatization is being used to prevent branching causing a slowdown due to unnecessary prefetching.
\* - in [BLIS](https://dl.acm.org/doi/10.1145/2764454) parlance
### Performance data with BS 1
Machine: 32 cores of one socket of a Intel Xeon SP Gen 5 machine
| Model | input tokens | output tokens | next-token latency before this PR | Next-token latency after this change | Speedup |
|-----------|-------------|-----------------|--------------------------------------|------------------------------------------|-----------|
|GPT-J | 128 | 128 | 42 ms | 38 ms | 9.52 % |
| GPT-J | 1024 | 1024 | 48 ms | 45 ms | 6.25 % |
|LLaMA 3.1 8B Instruct | 128 | 128 | 52 ms | 47 ms| 9.61% |
|LLaMA 3.1 8B Instruct | 1024 | 1024 | 57 ms | 53 ms| 7.01% |
While the input shapes of GEMMs corresponding to linear for next-token computation remain the same in case of different number of input & output tokens, the difference in next-token latency is due to attention for those cases
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149373
Approved by: https://github.com/leslie-fang-intel, https://github.com/Xia-Weiwen
Co-authored-by: Xia Weiwen <xia.weiwen@hotmail.com>
https://github.com/pytorch/pytorch/pull/129001#discussion_r1645126801 is the motivation for the whole stack of PRs. In `torch/__init__.py`, `torch._C.Type` shadows `from typing import Type`, and there is no type stub for `torch._C.Type` in `torch/_C/__init__.pyi`. So we need to use `from typing import Type as _Type`. After enabling [Generic TypeAlias (PEP 585)](https://peps.python.org/pep-0585) in the `.pyi` type stub files, we can use `type` instead of `typing.Type` or `from typing import Type as _Type`.
------
- [Generic TypeAlias (PEP 585)](https://peps.python.org/pep-0585): e.g. `typing.List[T] -> list[T]`, `typing.Dict[KT, VT] -> dict[KT, VT]`, `typing.Type[T] -> type[T]`.
- [Union Type (PEP 604)](https://peps.python.org/pep-0604): e.g. `Union[X, Y] -> X | Y`, `Optional[X] -> X | None`, `Optional[Union[X, Y]] -> X | Y | None`.
Note that in `.pyi` stub files, we do not need `from __future__ import annotations`. So this PR does not violate issue #117449:
- #117449
------
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150727
Approved by: https://github.com/aorenste
ghstack dependencies: #150726
I slap disable on the recomputation hook, otherwise the partitioner may save less/more activations and mismatch with the expected eager count in checkpoint. See code comment `Note: [compiled autograd and checkpoint unpack hook]`.
This fixes all non-nested checkpointing tests. I also wrap nested checkpointing tests, and a few of them still fail.
This also seems to fix all PYTORCH_TEST_WITH_DYNAMO checkpointing tests except for `TestAutograd.test_checkpointing_without_reentrant_custom_function_works`. For those tests, it looks like we fail to HOPify the checkpointed region and when the backward executes the unpack hooks, dynamo tried to trace them. This messed up the internal state tracking of checkpointing, some raising the _StopRecomputationError and others raising the same count mismatch error as CA.
FIXES https://github.com/pytorch/pytorch/issues/127115
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153300
Approved by: https://github.com/jansel