Instead of collecting local results using all_gather_object followed by local reduction, with this change we switch to using a single all_reduce with MIN reduction operation to compute the final equals result.
This change is needed to enable LocalTensor work (all_gather_object introduces challenges in for DTensor and LocalTensor integration).
topic: not user facing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164999
Approved by: https://github.com/ezyang
Context is in https://www.internalfb.com/excalidraw/EX519691 and https://docs.google.com/document/d/1qnuXLZk_GYt_PksHTwkn7L2ELRDnYlIRPkHAlXTyuhw/edit?tab=t.0.
So for Autoparallel initial trace, we want to trace the graph with global shapes initially. But, for the local_map region, we are forced to trace with the expected local tensors. To the tracers, this looks weird, because it's a plain tensor input (representing DTensor's full tensor .to_local()) that we need to "redistribute".
After hacking a miserable version that had cross-key dependencies, @ydwu4 proposed this simpler approach to override the fake key. This means the shape conversion will be invisible to all dispatch keys above fake, this covers all current tracing mechanisms. This manifests as the joint graph for the HOP body being traced with local shapes:
```python
# HOP forward, note local shapes (10, 80)
class GraphModule(torch.nn.Module):
def forward(self, primals_0: "f32[10, 80]"):
# No stacktrace found for following nodes
view: "f32[800]" = torch.ops.aten.view.default(primals_0, [-1]); primals_0 = None
add: "f32[800]" = torch.ops.aten.add.Tensor(view, 10); view = None
view_1: "f32[10, 80]" = torch.ops.aten.view.default(add, [10, 80]); add = None
return (view_1,)
# HOP backward, note local shapes (10, 80)
class GraphModule(torch.nn.Module):
def forward(self, tangents_0: "f32[10, 80]"):
# No stacktrace found for following nodes
clone: "f32[10, 80]" = torch.ops.aten.clone.default(tangents_0); tangents_0 = None
return (clone,)
```
while the rest of the graph is still traced with global shapes:
```python
# Parent graph joint, note global shapes (80, 80)
class inner_f(torch.nn.Module):
def forward(self, primals, tangents):
primals_1: "f32[80, 80]"; tangents_1: "f32[80, 80]";
primals_1, tangents_1, = fx_pytree.tree_flatten_spec([primals, tangents], self._in_spec)
# File: /home/xmfan/core/a/pytorch/test/higher_order_ops/test_local_map.py:597 in forward, code: return fn(x)
call_local_map = torch._higher_order_ops.local_map.call_local_map(primals_1); primals_1 = None
getitem: "f32[80, 80]" = call_local_map[0]; call_local_map = None
call_local_map_1 = torch._higher_order_ops.local_map.call_local_map(tangents_1); tangents_1 = None
getitem_1: "f32[80, 80]" = call_local_map_1[0]; call_local_map_1 = None
return pytree.tree_unflatten([getitem, getitem_1], self._out_spec)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164340
Approved by: https://github.com/ydwu4
ghstack dependencies: #164296, #164321, #164419, #164420
Reviewed GPT-5 Summary:
**Summary / Goal**
Add validation that partitioned forward/backward graphs respect placements.
**Details**
- Validates placement alignment in local_map.
- The HOP's autograd key gets called when we are tracing the joint, we need to validate:
- the inputs to the HOP's fwd gm (typically this is the dynamo rewritten inputs)
- the inputs to the HOP partitioned fwd/bwd gm
- the outputs of the HOP partitioned fwd/bwd gm
**Motivation**
Catch mismatch errors earlier, improve debugging.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164420
Approved by: https://github.com/ezyang
ghstack dependencies: #164296, #164321, #164419
Reviewed GPT5 summary:
**Summary / Goal**
Fix inconsistent variable naming for forward/backward graphs.
**Details**
- Those methods are actually for both fw and bw graphs now that we reuse the same op for fw/bw
**Motivation**
Improves clarity, avoids confusion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164419
Approved by: https://github.com/bdhirsh
ghstack dependencies: #164296, #164321
Reviewed GPT5 summary:
**Summary / Goal**
Improve error reporting when local_map subgraph input/output counts mismatch placement info.
**Details**
- Adds descriptive runtime error messages.
**Motivation**
Helps debug local_map misalignments.
```python
AssertionError: Expecting 2 inputs to local_map function based on placements, but found 1. If the count matches for eager, Dynamo may have flattened inputs to the function or found additional tensors used via closures. Please adjust the input placements to match what the traced graph sees:
class GraphModule(torch.nn.Module):
def forward(self, l_args_0_: "f32[8, 8, 16]"):
# File: /home/xmfan/core/a/pytorch/test/higher_order_ops/test_local_map.py:523 in mismatch_input, code: return x + scalar, scalar
child: "f32[8, 8, 16]" = l_args_0_ + 10; l_args_0_ = None
return (child,)
.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164321
Approved by: https://github.com/ezyang, https://github.com/mlazos
ghstack dependencies: #164296
Reviewed GPT5 summary:
**Summary / Goal**
Add a utility to compute expected local tensor sizes and strides under *even sharding* in dtensor.
**Details**
- New function in `torch/distributed/tensor/_utils.py`.
- Computes local sizes/strides given global shape, mesh, and placements.
- Enforces divisibility of global dimension by mesh size (strict even sharding).
- Complements `compute_global_tensor_info`.
**Motivation**
Ensures correctness for stride/layout computations in distributed tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164296
Approved by: https://github.com/ezyang
This PR fixes the condition
```
if arg_signatures is None and self.kernel_type == "cpp" or "extern"
```
which is interpreted as
```
if (arg_signatures is None and self.kernel_type == "cpp") or ("extern"):
```
and it is always evaluated to `True`. According to the context the intention was
```
if arg_signatures is None and (self.kernel_type == "cpp" or self.kernel_type == "extern"):
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165033
Approved by: https://github.com/Skylion007
## TODO
Check on multi indices
```Python
@cute.jit
def score_mod(tSrS_ssa, b_idx, h_idx, q_idx, kv_idx, buffers):
in_ptr4 = buffers[0]
tmp0 = tSrS_ssa
tmp1 = b_idx
tmp2 = h_idx
tmp3 = cute.make_fragment(1, cutlass.Int32)
tmp4 = tmp3.store(32*tmp1 + tmp2)
tmp5 = cute.make_fragment(1, cutlass.BFloat16)
tmp6 = tmp3[0]
tmp7 = tmp5[0] = (in_ptr4[tmp6])
tmp8 = (tmp5.load()).to(cutlass.Float32)
tmp9 = (tmp0 + tmp8)
tSrS_ssa = tmp9
return tSrS_ssa
```
I dont think that
```
tmp4 = tmp3.store(32*tmp1 + tmp2)
tmp5 = cute.make_fragment(1, cutlass.BFloat16)
tmp6 = tmp3[0]
tmp7 = tmp5[0] = (in_ptr4[tmp6]
```
is right since this tmp6 value will be larger than the actual index dim int his case its B -> see if its possible to 1d index
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162031
Approved by: https://github.com/v0i0
ghstack dependencies: #161118
This fixes AOTAutograd rms_norm not being bitwise equivalent to
eager, because it avoids a decomposition. You can force the
decomposition by having the decomposition in the dispatch table,
but if eager mode wouldn't have decomposed (because it went to the fused
one), we now default to preserving the fused call by default.
This largely reverts https://github.com/pytorch/pytorch/pull/103275/ for view ops. This means that in inference mode we could hit the wrong C++ kernel; if this occurs we should just SymInt'ify the C++ kernel.
Another neat side effect of this change is that Inductor's generated kernels for rms_norm now have rms_norm in their name.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164939
Approved by: https://github.com/bdhirsh
Previously when torch.are_deterministic_algorithms_enabled() is True Inductor will
- skip autotuning pointwise kernels
- pick a fixed (and quite arbitrary) config for reduction
This PR change the behavior to
- for pointwise kernels, we still do autotuning
- for reduction kernels, we use the recent added heuristic to pick a config
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164905
Approved by: https://github.com/jansel, https://github.com/v0i0
ghstack dependencies: #164801, #164532, #164904
Verify the deterministic mode with torch.compile benchmark scripts.
Here is what my testing script does (pasted in the end):
- run a model in default mode, save it's result
- run the model again in default mode, but distort the benchmarking results. Compare it with the saved result.
- Do the above again in deterministic mode.
I tried to test a few modes
- BertForMaskedLM and GoogleFnet: I can repro the numeric change by distorting the benchnmark result in the default mode. The non-determinism is gone in the deterministic mode
- DistillGPT2: I can not repro the numeric change by distorting the benchmarking result in the default mode. It does not surprise me much. Reduction order change does not always cause numeric change.
```
model=GoogleFnet
export TORCHINDUCTOR_WRITE_ARE_DETERMINISTIC_ALGORITHMS_ENABLED=0
export TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 # disable autotune cache
export TORCHINDUCTOR_FX_GRAPH_REMOTE_CACHE=0
export TORCHINDUCTOR_FX_GRAPH_CACHE=0
export TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_shunting/
export TORCHINDUCTOR_BENCHMARK_KERNEL=1
export TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1
export INDUCTOR_TEST_DISABLE_FRESH_CACHE=1
# Non deterministic mode
# --float32 rather than --amp to make it easier to repro non-deterministic
echo "Save results for non-deterministic mode"
python benchmarks/dynamo/huggingface.py --backend inductor --float32 --accuracy --only $model --training --disable-cudagraphs --save-model-outputs-to=/tmp/saved-non-deterministic.pkl
echo "Compare results with distorted benchmarking in non-deterministic mode"
TORCHINDUCTOR_DISTORT_BENCHMARKING_RESULT=inverse python benchmarks/dynamo/huggingface.py --backend inductor --float32 --accuracy --only $model --training --disable-cudagraphs --compare-model-outputs-with=/tmp/saved-non-deterministic.pkl
echo "Save results for deterministic mode"
TORCHINDUCTOR_DETERMINISTIC=1 python benchmarks/dynamo/huggingface.py --backend inductor --float32 --accuracy --only $model --training --disable-cudagraphs --save-model-outputs-to=/tmp/saved-deterministic.pkl
echo "Compare results with distorted benchmarking in deterministic mode"
TORCHINDUCTOR_DETERMINISTIC=1 TORCHINDUCTOR_DISTORT_BENCHMARKING_RESULT=inverse python benchmarks/dynamo/huggingface.py --backend inductor --float32 --accuracy --only $model --training --disable-cudagraphs --compare-model-outputs-with=/tmp/saved-deterministic.pkl
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164904
Approved by: https://github.com/jansel, https://github.com/v0i0
ghstack dependencies: #164801, #164532
After lean export, we might want to be able to restore the original fqn. This PR refactors one util function in export that sort of does this. Note that strict_export has some complicated logic of updating the graph signature as well which we don't want. I think we can gradually make this util more refined by handling constants, non persistent buffers etc and change how strict_export does it today.
Differential Revision: [D83687844](https://www.internalfb.com/diff/D83687844)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164401
Approved by: https://github.com/avikchaudhuri
Previously we hardcoded the assumption in cuDNN that the inputs would be dense which breaks when e.g., the user is chunking tensors yielding noncontig inputs
New test added to check this when `TORCH_CUDNN_SDPA_NESTED_TENSOR_ENABLED=1` is set in `test/test_transformers.py`
One issue I noticed was that the old gating of nested tensor in `sdp_utils.cpp` seems to be a no-op? All of the inputs are reported as "dense" by the time that function is called in the nested tensor tests in `test/test_nestedtensor.py -k sdpa`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164958
Approved by: https://github.com/Skylion007, https://github.com/drisspg
Summary: We have an internal user where caching broke because the paths that are unzipped are probably different per host. We can't think of a use case where a path change matters when the file content has not changed, so removing this part
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165020
Approved by: https://github.com/oulgen
In aot_stage2_autograd:
Before calling fw_compiler, we run pre_compile for the following wrappers:
* FakifiedOutWrapper
* FunctionalizedRngRuntimeWrapper
After, we run post_compile for the following wrappers:
* EffectTokensWrapper
* AOTDispatchSubclassWrapper
* FunctionalizedRngRuntimeWrapper
* FakifiedOutWrapper
In aot_stage2_inference:
Before calling inference compiler, we run pre_compile for the following wrappers (same as above):
* FakifiedOutWrapper
* FunctionalizedRngRuntimeWrapper
After, we run post_compile for the following wrappers (different than above):
* FunctionalizedRngRuntimeWrapper
* FakifiedOutWrapper
* EffectTokensWrapper
* AOTDispatchSubclassWrapper
This PR makes both do the post_compiles in the same order.
Differential Revision: D84213657
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165016
Approved by: https://github.com/zhxchen17, https://github.com/bdhirsh
This commit makes several cleanup changes to MHA.cpp, the main
one of which is removal of shared_ptr from MHAGraphCache as the
cache does not actually intend to share ownership. The changes are:
1. Remove shared_ptr from MHAGraphCache
2. Remove template arguments from MHAGraphCache
3. Remove unnecessary optional<shared_ptr<...>> vars
4. Change some functions with auto return type to the actual type
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164895
Approved by: https://github.com/eqy
The windows cpp tests take ~1 hour according to logs. Each has run_test called on them individually, so I tried batching them together so it's just one run_test call for all of them. I believe it now takes 30min. I turned off TD since I don't think cpp tests are included in TD stuff.
As always with batch, I'm not sure if the errorlevel/error surfacing stuff is correct
This code is written with a lot of help from chatgpu and copilot
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164861
Approved by: https://github.com/huydhn