Fixes#148938
Context:
In triton 3.3, triton kernels expect a global scratch space arg to be passed in. This is fixed in #148051, which fixed most of the AOTI/cpp_wrapper failures; the fix is to inject a (null) global scratch space arg passed as an argument to all kernels.
But in the case of TMA, we need to call a non-triton-generated function - init1DTMADescriptor. The same `generate_args_decl` function used for calling triton kernels (and modified in #148051 to insert a global scratch space) is used to prepare the arguments to init1DTMADescriptor, and so it had an extra global scratch space arg. Then we'd get a null pointer passed into init1DTMADescriptor, resulting in an IMA later on when the TMA use kernel
This PR: adds an option to `generate_args_decl` to specify whether this is a triton kernel (in which case we should add the global scratch space arg) or not (when we shouldn't add the extra arg).
Note: this doesn't appear in CI because we don't run these tests with Hopper machines in CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149973
Approved by: https://github.com/drisspg
Summary:
Refine the error message if dlopen failed in AOTInductor.
The original error message was ominous, modified to recommend user to
rebuild AOTInductor if needed, otherwise it's fine.
Test Plan:
None. Error message change.
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149812
Approved by: https://github.com/chenyang78, https://github.com/jingsh
Summary:
We might free the active buffer if we free the buffer twice.
Test Plan:
```
LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib
/home/$USER/local/pytorch/build/bin/test_aoti_inference
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149810
Approved by: https://github.com/chenyang78
Fixes#139167
This PR:
* uses `named_buffers` to mark static
* Checks that `named_buffers` is of expected type (callable, iterator) before trying to iterate over; if not, we skip this pass
These changes fix the previous errors in dynamo causing to crash (as shown in issue above)
### Unit Test
```
python test/dynamo/test_buffers_override.py
```
Results in:
```
.
----------------------------------------------------------------------
Ran 2 tests in 5.344s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149882
Approved by: https://github.com/anijain2305
Summary: The primary change is to update the autotune-in-a-subproc implementation to avoid using multiprocessing spawn. Spawn (re)executes the toplevel script in the subproc, which can be problematic. The approach here is similar to Triton parallel compile: we Popen a subproc on a controlled entry point and communicate over pipes. That change drove a lot of refactoring in the TuningProcess class, so I took the opportunity to simplify some things, rename some methods, etc.
One other notable change is around the timeout / kill approach. After a timeout, we were previously attempting to stop the subproc in three steps (graceful shutdown, sigkill if graceful fails, sigterm if sigkill fails). I'm gonna argue think that's not useful: 1) The graceful shutdown is never going to work unless the subproc happens to have just completed its task and is ready to receive the next command. 2) If we're going to kill the subproc, let's just take the most aggressive approach and move on as quickly as possible to restarting it rather than waiting to see if previous shutdown attempts succeeded. The only downside that I can find find is maybe a little log spew?, e.g., ` ResourceWarning: subprocess 2987680 is still running`
List of changes:
* Use Popen instead of spawn for the autotuning subprocess.
* Introduced a new entry point `__autotune_main__.py`
* Renamed some TuningProcess methods. For example `shutdown` makes more sense than `terminate` because the latter implies a forced kill.
* Simplified the implementation around benchmarking timeout and how we kill the subproc after a timeout.
* Deprecated the unused timeout configs in `_inductor/config.py`
* Moved `get_ld_library_path` helper to a common utils file.
* Added more unit tests for subproc crashes / timeouts / exceptions, etc.
Test plan:
* New unit tests
* Also ran internally with all combinations of: build mode `opt` and `dev-nosan`, and `buck run` vs. executing the `.par` file directly.
* Made sure the functionality to parallelize autotuning across different GPUs is working (it wasn't clear to me this was behaving the way we wanted it to).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149700
Approved by: https://github.com/aorenste, https://github.com/jansel, https://github.com/eellison
We weren't handling `setattr(tensor_obj, "real", 42)` correctly, because
the attribute is a `GetSetDescriptorType` that has special setter logic.
See added test and comments for more explanations.
This patch makes it so that we graph break in those cases, rather than
resulting in silent incorrectness.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149791
Approved by: https://github.com/mlazos
ghstack dependencies: #149481
Patches over an issue where randomly generated example tensors can cause kernel autotuning to fail, when those tensors would not be possible outputs from previous kernels in the sequence. This fixes a failure in `test_torchinductor_opinfo.py` when run with compile-time autotuning, `test_comprehensive_nanquantile_cuda_float64`.
For clarity, the situation triggering this PR looks like kernels `A -> BCDE -> F` (`BCDE` is fused), where one of the outputs from `A` is a boolean tensor describing some of the input data. Previously, we randomly regenerated that boolean tensor and the input data before passing them to `BCDE`, so that they no longer matched. This caused a `tl.device_assert` call in `BCDE` to fail. With this PR, we reuse the random data input to `A` and the output Boolean tensor, such that they match and pass the device assertion in `BCDE`.
Fixes#147799.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146706
Approved by: https://github.com/desertfire
The "PendingUnbackedSymbolNotFound" error is when an unbacked symbol is created within a piece of code, but this symbol never appears in any of the outputs. I believe the original intention is to help catch incorrectly written meta kernels, where users might've unintentionally created an unbacked symbol but never used it anywhere, but in our case this is intentional. An example is the following test case:
```python
def test_pending_unbacked(self):
class M(torch.nn.Module):
@mark_compile_region
def gn(self, x):
u = x[0].item()
return x * u
def forward(self, x):
for _ in range(4):
x = self.gn(x)
return x
torch._dynamo.config.capture_scalar_outputs = True
torch.compile(M())(torch.randn(8))
```
This fails with the error:
```
torch._dynamo.exc.InternalTorchDynamoError: PendingUnbackedSymbolNotFound: Pending unbacked symbols {zuf1} not in returned outputs (FakeTensor(..., size=(8,)),) .
```
In this case, creating the unbacked symbol is intentional, so we can bypass this using `fake_mode.shape_env.ignore_fresh_unbakced_symbols()`.
Differential Revision: [D71298926](https://our.internmc.facebook.com/intern/diff/D71298926)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149297
Approved by: https://github.com/zou3519
ghstack dependencies: #149296
Summary:
Cache save plan metadata to reduce the collective overhead.
Global plan dedupe and metadata creation are the main overheads on Rank 0. This change saves all this cost for the subsequent saves if the plans do not change. A quick experiment with the 256 rank job, Global step overhead drops by ~99%, from 90s+ to mere 1.5s. 1.5s was mostly spent on creating the checkpoint module directories and near empty collective.
Differential Revision: D71631441
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149785
Approved by: https://github.com/MeetVadakkanchery
This adds a `reduce_scatter` implementation for ProcessGroupGloo. This is a pretty naive implementation as it does 1 allreduce per rank but may be useful for testing in FSDP etc. There was an existing implementation of reduce_scatter_tensor/reduce_scatter_tensor_coalesed that has a very similar implementation but requires a fixed tensor size per rank.
If users find these functions to be too slow we can address them as issues arise.
Gloo now supports all major distributed operations. Quite a few of these were added by @rohan-varma and @yifuwang but they didn't update the support chart. We also have `CUDAWork` variants of most operations so those were also added to the chart.
Test plan:
```
pytest -v test/distributed/test_c10d_gloo.py -k reduce_scatter
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149869
Approved by: https://github.com/fduwjj
### Summary
When the block-size for `N` dimension is `48` for the AMX GEMM micro-kernel for int8 WoQ (BF16 activation, int8 statically quantized weights), the logic for handling the tail is incorrect - we can't always dequantize 32 elements of weights at a time because we may need to dequantize `32` followed by `16` when `block_n` is `48` (for each `K`).
This PR fixes that logic, which was initially exposed with `M=17, N=1024, K=1024`.
This PR also fixes the case of `block_n` being 16.
I had introduced [this bug ](ca9813ea14) after misreading GEMM blockings as `["block_m", "block_k", "block_n"]` instead of `["block_m", "block_n", "block_k"]` (so I had wrongly assumed that `block_n` was always 32).
### Future work
While this PR simply fixes a bug, it's possible to optimize the code pertaining to dequantizing & caching the B buffer - for `block_n` being `16` or `48`, `K` would always be a multiple of 2, so `K * block_n` will always be a multiple of 32. Since `dequantized_B_buf` stores rows contiguously, when `block_n` would be `16` or `48`, we could store 32 BF16 elements at a time instead of storing `16` at a time (when `block_n` is 16), or `32` followed by `16` at a time (when `block_n` is 48). Such an optimization would lower `register -> memory` data movements.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149359
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
This adds a new env var and flag,
autograd_cache_allow_custom_autograd_functions, (env var: `TORCHINDUCTOR_AUTOGRAD_CACHE_ALLOW_CUSTOM_AUTOGRAD`) which allows custom autograd functions into AOTAutogradCache.
@hirsheybar and I worked together to verify that the higher order op AutogradFunctionApply is pure with respect to the dynamo input being passed in, so this *should* be safe. I'm still putting it behind a flag and turning it on slowly, first on an internal model, though. Once we verify that it is correct on the internal model we can work to enable the flag by default.
Differential Revision: [D71633184](https://our.internmc.facebook.com/intern/diff/D71633184/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149751
Approved by: https://github.com/bdhirsh, https://github.com/zou3519
This patch updates existing `test_return_..._subclass` tests in
`test/dynamo/test_subclasses.py`, so that they end up invoking the
`__torch_function__` method of the newly constructed tensor subclass
instnaces.
This exposes a bug in `TensorVariable.method_as_subclass`, where it
forgot to grab the `__func__` out of `__torch_function__`, which led to
the an error down the line.
This patch fixes `TensorVariable.method_as_subclass` by centralizing how
we extract and wrap torch function, in `build_torch_function_fn`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149481
Approved by: https://github.com/jansel
This adds AVG support to ProcessGroupGloo to better support FSDP on CPU. I expect there will be more issues but this is easy enough to support in a naive fashion.
This applies to both reduce and allreduce.
This is a simple SUM + division and may not be the most numerically stable but that's expected. FSDP for low precision data types implements pre/post divide and uses SUM instead.
Test plan:
```
pytest -v test/distributed/test_c10d_gloo.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149781
Approved by: https://github.com/fduwjj
Lazos correctly pointed out this doesn't make sense for compile since
we graph break in compile. This results in tons of unwanted user log
spew. We do want this in export though since it's drastiaclly reduced
the support load for DDEs. This PR does the refactor to keep it in
export but remove it from compile
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149831
Approved by: https://github.com/mlazos
This preserves graph breaks in the case that one graph break directly causes another, e.g. graph breaks in generic context managers.
```python
import torch
class CtxMgr:
def __enter__(self):
return self
def __exit__(self, exc_type, exc_value, traceback):
pass
@torch.compile(backend="eager", fullgraph=True)
def fn():
with CtxMgr():
with CtxMgr():
pass
with CtxMgr():
with CtxMgr():
pass
torch._dynamo.graph_break()
fn()
```
Output:
```
torch._dynamo.exc.Unsupported: Call to `torch._dynamo.graph_break()`
Explanation: User-inserted graph break. Message: None
Hint: Remove the `torch._dynamo.graph_break()` call.
Developer debug context: Called `torch._dynamo.graph_break()` with args `[]`, kwargs `{}`
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/data/users/williamwen/pytorch/playground.py", line 23, in <module>
fn()
File "/data/users/williamwen/pytorch/torch/_dynamo/eval_frame.py", line 664, in _fn
raise e.with_traceback(None) from e.__cause__
torch._dynamo.exc.Unsupported: Graph break under GenericContextWrappingVariable
Explanation: Attempted to graph break in an active context manager(s) that doesn't support graph breaking.
Hint: Move the offending context manager(s) to outside the compiled region.
Hint: This graph break may have been caused by an earlier graph break. Resolving the earlier graph break may resolve this one.
Developer debug context: Active generic context managers: [GenericContextWrappingVariable(CtxMgr), GenericContextWrappingVariable(CtxMgr)]
from user code:
File "/data/users/williamwen/pytorch/playground.py", line 20, in fn
torch._dynamo.graph_break()
Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
```
Note in particular that both graph breaks (torch._dynamo.graph_break and graph break in context manager) are present in the logs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149676
Approved by: https://github.com/jansel, https://github.com/zou3519, https://github.com/anijain2305
While we probably don't want to expand the set of default matmul tunings too much, this is the largest tile size usable by H100 and A100, and is usually the top performing tile size for large matmuls. E.g. on H100 adding this tile size improves perf of multiplying 8192-square matrices from 600->700 tflops. (cuBLAS 12.6 gets 780, so Triton still isn't SOTA, but closer)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149790
Approved by: https://github.com/jansel
Before, we would take the first argument with the largest number of shards, regardless if it had fewer dims than another arg with the same number of shards but more dimensions. This would lead to potentially fewer sharding options
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149721
Approved by: https://github.com/tianyu-l
Fixes#149450
This PR adds fallback support on StaticCudaLauncher for any number of kernel arguments. Above MAX_ARGS, we can do a heap allocation/malloc instead.
For 0 arguments, triton technically does some undefined behavior by allocating a 0 byte array and passing it to cuLaunchKernel. In reality, cuLaunchKernel never accesses the pointer if the singature of the cubin has no parameters, so we can just pass nullptr directly.
We could technically use `alloca` to stack allocate instead of heap allocate, though in my tests it didn't seem to affect runtime performance on benchmarks particularly impressively, and alloca has portability issues, so I'd rather just stick with something simpler for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149442
Approved by: https://github.com/jansel
This PR threads through the correct boxed_forward_device_index from graph_kwargs to CompiledFXGraph.post_compile. This allows us to correctly update BoxedDeviceIndex from cache hits.
We don't actually need to save `boxed_forward_device_index` in CompiledFXGraph because its value is in the cache key, so it always matches to the ambient one anyway. On forward with cudagraphs enabled, derive `boxed_forward_device_index`'s value from `device_idxs`.
Testing:
```
python benchmarks/dynamo/cachebench.py --mode training --benchmark torchbench --model BERT_pytorch --device cuda --repeat 1 --dynamic --output="dynamic.json"
```
Now cache hits properly on FXGraphCache. AOTAutogradCache has a guard failure. Will look into that as a followup.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148130
Approved by: https://github.com/eellison
In the kernelBot leaderboard we support people competing with custom cuda extensions via `load_inline()`, however even on toy kernels this can result in cold starts of up to 90s - this feature is primarily responsible for us having to double our timeout values
I performed an investigation here https://github.com/msaroufim/load_inline_slow and the primary cause was that torch/extension.h and torch/types.h add in about 5,000 header files https://github.com/msaroufim/load_inline_slow/blob/main/header-analysis
So we introduce a mode `no_implicit_headers` which forces users to be explicit about exactly what they want to add. There's a proper test meant to be used in a CLI and a pytest test that's not terribly helpful
Then there's still an open question around what's the most minimal example implementation we can provide. For the baseline kernel we're showing here, it takes about 1 min to compile
1. There's using TensorBase.h (finicky to get right but can get compilation times down to 7s)
2. Just using Tensor.h (down to 15s)
3. Using Shim.h (did not try yet since the syntax is verbose relative to cuda)
This is my take so far https://gist.github.com/msaroufim/079a8d08ffebd0f91a1c2247eb0ce9e0 for a minimal implementation at 15s but @malfet has a simpler one at only 5s
There's more things I'd like to try moving forward like nvrtc and fancier compilation flags. Typical advice around using precompiled headers does not apply to us because we are mostly interested in cold starts where we tear down the machine after running a kernel
Also in a future PR I'd like to fix issue I've noticed with load_inline
1. It needs a force recompilation mode, I was using this quite a bit myself
2. The cache does not take into account changes in environment so the best way to force a recompilation is to change some string in the file
3. Instead of relying on pybind, can we use TORCH_LIBRARY instead
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149480
Approved by: https://github.com/malfet
The main purpose of this PR is to fix offline tuning for ScaledGEMM. The previous UT passed because it was not strict enough. Additionally:
- All the offline tuning tests now do a comparison with the online results to ensure that ParamSignature match.
- We raise an error if submatrices are encountered as this is only supported in online tuning mode.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149677
Approved by: https://github.com/jeffdaily
Summary:
`use_triton_lce_replace_simple_LCE` and `use_triton_lce_replace_normal_LCE`
code is mostly the same, some minor changes to support aten IR
Test Plan:
```
scripts/aetk/aetk -L
%run ~/fbsource/fbcode/caffe2/test/inductor/fb/test_customized_triton_kernel_passes.py
```
will verify the qps after everything done in the stack
Reviewed By: frank-wei
Differential Revision: D68909857
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149702
Approved by: https://github.com/frank-wei
Today, if you run DTensor (or any tensor subclass) under __torch_dispatch__, you will start seeing `CompositeImplicitAutograd` ops show up in the torch_dispatch.
"handling" these ops is trivial: you can just tell them to decompose into their constituent ops. Normally this decomposing happens in autograd, above DTensor, but inference_mode turns autograd off, forcing the subclass to handle the op directly.
It looks like previously we manually added a few CompositeImplicitAutograd entries to DTensor (e.g. linear), but this PR tries to support these ops a bit more generically.
The main difference is that DTensor now needs to check if a given op is `CompositeImplicitAutograd` before attempting to run sharding prop. I ran a quick microbenchmark for the below code with `timeit`, which gave me overhead on the order of ~1us, which is hopefully not too bad for eager mode:
```
def fast_function():
return torch._C._dispatch_has_kernel_for_dispatch_key(op_call.name(), torch._C.DispatchKey.CompositeImplicitAutograd)
import timeit
time_taken = timeit.timeit(fast_function, number=1000)
# printed 0.12..., aka 1.2us
print(f'func={str(op_call)}, time={str(time_taken)}')
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149514
Approved by: https://github.com/kwen2501, https://github.com/albanD, https://github.com/wanchaol
Cudagraphs is careful to not allow any memory recorded to escape globally without having a reference to the tensor. This is because we may later reclaim that memory for a cudagraph recording and we need to mark the tensor as erroring on access. Very occasionally, a stray tensor will have been allocated locally but not yet cleaned up. In this case, we enter the slow path and try to gc.collect() to deallocate it. From a hard to repro internal use case, this was fixed by an additional `cuda.synchronize()`.
i also snuck in an outdated comment and a duplicate line removal.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149741
Approved by: https://github.com/BoyuanFeng, https://github.com/Skylion007
Summary: When we call torch.inference_mode, we seem to skip Autograd key causing the custom op export uses to be not decomposed properly before subclass dispatching starts. We fix this by force desugaring this op at Python key
Test Plan: test
Differential Revision: D71599541
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149698
Approved by: https://github.com/bdhirsh
This is an attempt to fix#119698
I was unable to reproduce the original described problem on the latest trunk but the proposed fix makes sense. Instead of adding locks like the original (unlanded) fix I changed a few of the cache writes to be atomic file swaps (write to temp file, rename file) which should have the same effect without blocking reads.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149654
Approved by: https://github.com/eellison
Summary:
We need to properly fakify torchbind objects, including the ones in graph module attributes, so the resgitered fake implementation works properly.
- _fakify_script_objects in `compile_fx`
- Allow fake torchbind objects in `torchbind_constants`
Remove `node.meta["unbacked_bindings"]` for `aot_compile` in `compile_fx`. Otherwise `ShapeProp` will fail when trying to resolve the `unbacked_bindings` of `with_effect` tokens.
Update `sigrid_transforms_test` to use the latest `torch._inductor.aot_compile` API.
Add a test for `Fakify torchbind objects in compile_fx and add tests for SigridTransformsInstanceTorchBind` in `e2e_test`.
Test Plan:
```
buck run //caffe2/torch/fb/sparsenn:sigrid_test -- -r test_transform_torch_bind
buck run //sigmoid/inference/test:e2e_test_cpu -- -r SigridTransforms
buck2 run mode/dev-nosan sigmoid/inference/ts_migration:pt2i_readiness_main -- --model_id 545017754 --test_suite ads_all --mode test_preproc
```
Differential Revision: D70013257
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149529
Approved by: https://github.com/angelayi
Summary:
1\ The current write item structure does not contain the amount of data that needs to be written.
2\ the planner.item already has a size primitive 'tensor_storage_size'. https://fburl.com/code/7a0gsmw7 But only for tensors.
3\ Right now, the only way the writer layer get hold of this property (fro non tensor data)
first do a lookup in to the actual tensor/bytes
then calculate the nbytes.
This change introduce a way to capture non-tensor data size within a write-plan item.
Test Plan: Existing UT.
Differential Revision: D71599725
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149699
Approved by: https://github.com/MeetVadakkanchery
Create draft_export strategy.
The strategy is added before jit and after strict=True, as the third fallback. Since it is specializing tensors it should not be less robust than the jit trace strategy.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147529
Approved by: https://github.com/titaiwangms
Redundant exception types in `except (PermissionError, OSError):`. Write `except OSError:`, which catches exactly the same exceptions.
https://github.com/pytorch/pytorch/actions/runs/13935844871/job/39141062991
When hipify files, or writing cprofile files, PermissionError is not enough when the file is located in a place that is not writable at all, or other OS errors happened when writing files.
This fix makes the code more robust.
Example error log:
```log
File "deepspeed/ops/adam/fused_adam.py", line 94, in __init__
fused_adam_cuda = FusedAdamBuilder().load()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "deepspeed/ops/op_builder/builder.py", line 540, in load
return self.jit_load(verbose)
^^^^^^^^^^^^^^^^^^^^^^
File "deepspeed/ops/op_builder/builder.py", line 587, in jit_load
op_module = load(name=self.name,
^^^^^^^^^^^^^^^^^^^^
File "torch/utils/cpp_extension.py", line 1597, in load
return _jit_compile(
^^^^^^^^^^^^^
File "torch/utils/cpp_extension.py", line 2031, in _jit_compile
hipify_result = hipify_python.hipify(
^^^^^^^^^^^^^^^^^^^^^
File "torch/utils/hipify/hipify_python.py", line 1167, in hipify
preprocess_file_and_save_result(output_directory, filepath, all_files, header_include_dirs,
File "torch/utils/hipify/hipify_python.py", line 213, in preprocess_file_and_save_result
result = preprocessor(output_directory, filepath, all_files, header_include_dirs, stats,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "torch/utils/hipify/hipify_python.py", line 940, in preprocessor
output_source = RE_QUOTE_HEADER.sub(mk_repl('#include "{0}"', True), output_source)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "torch/utils/hipify/hipify_python.py", line 919, in repl
preprocess_file_and_save_result(output_directory,
File "torch/utils/hipify/hipify_python.py", line 213, in preprocess_file_and_save_result
result = preprocessor(output_directory, filepath, all_files, header_include_dirs, stats,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "torch/utils/hipify/hipify_python.py", line 986, in preprocessor
with clean_ctx.open(fout_path, 'w', encoding='utf-8') as fout:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "torch/utils/hipify/hipify_python.py", line 123, in open
return open(fn, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: [Errno 30] Read-only file system: 'deepspeed/ops/csrc/adam/multi_tensor_apply_hip.cuh'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149464
Approved by: https://github.com/janeyx99
Summary: This diff ports some technique from torch.fx symbolic trace to trace through Python asserts when we run into data dependent symbolic shape assertions, so that we can achieve the same effect as torch dynamo to automatically turn assert into torch.check()s.
Test Plan: buck test mode/opt caffe2/test:test_export -- -r test_python_asserts_with_sym_int
Differential Revision: D71425360
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149444
Approved by: https://github.com/tugsbayasgalan
This hooks up the previous PR to torch.compile. Will add a config flag to hide this behind in a bit, but for now it's useful for testing purposes to have it on by default.
Inductor will automatically choose to use StaticCudaLauncher to launch triton kernels if:
- The kernel is a cuda kernel and inductor can find a cubin file associated with it
- The kernel takes less than 50 arguments
- The kernel doesn't use any special features (launch hooks, large amounts of shared memory)
- The kernel is not user defined (to be supported in a later PR)
We split CompileResult into TritonCompileResult and StaticTritonCompileResult, but have them share implementations of how they exec a python launcher. StaticTritonCompileResult's python launcher has the benefit of a simpler def_args/call_args setup, since it always filters out all constexprs before running, no matter the triton version.
Some key features of StaticTritonCompileResult:
- It is fully serializable
- It stores the minimum amount of stuff, so that later it can be cached easily
- It does not depend on any triton specific types (though it does have various triton metadata).
For now, both TritonCompileResult and StaticTritonCompileResult still `exec` custom python launchers, and use GridExpr. We can change that in the future to simplify if we'd like. For now though, this custom python codegen is good for flexibility when it comes to supporting removal of constexprs, so using it for static launching is nice to not have to pay the cost of removing constexprs at kernel runtime.
Hooking everything up to torch.compile lets me run every unit test with StaticCudaLauncher to make sure that we still pass (even if we bypass StaticCudaLauncher itself). It also lets me check for compilation/runtime performance with these changes.
Fixes#149448
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148890
Approved by: https://github.com/jansel
AOTDispatch doing AOT backward graph preparation does not know real tangents that user will specify when runs backward.
AOTD guesses the tangents. Before - we guessed that memory format of tangents will be as memory format of corresponding outputs. And if specified tangents at runtime are not the same memory format as we guessed during compilation, AOTD does coercion (copy) to guessed memory_format
But as Horace found, there are popular use cases, where the outputs of compiled region will be in specific memory_format. E.g. in 4D tensor transposing dims 1 and 2.
https://github.com/karpathy/nanoGPT/blob/master/model.py#L57
This PR changes the logic, that AOTD expects the same "strideness" of tangents as outputs. As a result it will avoid coercion for the case of transposed dims.
Limitations:
We keep guessing memory_format for:
1/ Dynamic shapes (needs more changes)
2/ Tensor subclasses (needs more changes)
Other changes:
test_torchinductor was always creating contiguous tangents via `torch.randn()`, changing them to be `torch.randn_like()` to compare computation with the same strideness.
(E.g. for cuda float16 strideness affects numerics for fft ops).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144579
Approved by: https://github.com/bdhirsh
# Feature
Fixes https://github.com/pytorch/pytorch/issues/148718 by reordering the tensor dims to `(z, y, x)`.
As a bonus refactor, block pointers no longer needed the `reorder=True` argument to `self.active_range_trees()`. Since this argument is no longer used anywhere, this PR simply deletes it as opposed to updating the logic for the new iteration order.
# Perf impact
It looks like there's a decent perf bump on A100, with cudagraphs enabled. Granted, perf runs seem to have some noise between commits. ([Workflow run](https://github.com/pytorch/pytorch/actions/runs/13914815576).)
Training (all neutral or positive):

Inference (one positive, one very small negative):

As reported in https://github.com/pytorch/pytorch/issues/148718, this PR makes consecutive threads access consecutive memory addresses. This should theoretically give the GPU more opportunities to coalesce loads and stores. From Nvidia's [kernel profiling guide](https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html):
> Local memory is private storage for an executing thread and is not visible outside of that thread. It is intended for thread-local data like thread stacks and register spills. Local memory addresses are translated to global virtual addresses by the AGU unit. Local memory has the same latency as global memory. One difference between global and local memory is that local memory is arranged such that consecutive 32-bit words are accessed by consecutive thread IDs. Accesses are therefore fully coalesced as long as all threads in a warp access the same relative address (e.g., same index in an array variable, same member in a structure variable, etc.).
I couldn't find any information on how coalescing works for other kinds of memory, but the guide mentions it is also supported for accesses to the L2 cache.
> The L2 Request Coalescer (LRC) processes incoming requests for L2 and tries to coalesce read requests before forwarding them to the L2 cache. It also serves programmatic multicast requests from the SM and supports compression for writes.
The [answer to this Stack Overflow post](https://stackoverflow.com/a/5044424) also explains coalescing in a straightforward way. Inductor's current iteration order corresponds to the first (uncoalesced) example in that answer, while the order after this PR corresponds to the second (coalesced) example.
Besides GPUs, this order of accessing data is highly advantageous for systems relying on DMAs, as those are designed to access contiguous spans of memory. This change improves the performance of an elementwise add kernel on an internal model, using internal hardware, by 1.76x. I will share the details with reviewers who are Meta employees via a private channel.
# Test plan
- Updated expected code on CI tests.
- Added a new test checking the {x,y,z}indices and block pointers on a 3D pointwise kernel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149339
Approved by: https://github.com/jansel
Summary:
Fix logging error like:
```
in combinable_nodes
log.debug(
Message: 'ComboKernels: %d template nodes are filtered'
Arguments: (OrderedSet([8]),)
--- Logging error ---
Traceback (most recent call last):
File "/usr/local/fbcode/platform010/lib/python3.10/logging/__init__.py", line 1100, in emit
msg = self.format(record)
File "/usr/local/fbcode/platform010/lib/python3.10/logging/__init__.py", line 943, in format
return fmt.format(record)
File "/data/users/guorachel/fbsource/buck-out/v2/gen/fbcode/854b9ed00d28c5c5/caffe2/torch/fb/model_transform/experimental/benchmark/__mts_gpu_benchmark__/mts_gpu_benchmark#link-tree/torch/_logging/_internal.py", line 818, in format
record.message = record.getMessage()
File "/usr/local/fbcode/platform010/lib/python3.10/logging/__init__.py", line 368, in getMessage
msg = msg % self.args
TypeError: %d format: a real number is required, not OrderedSet
```
encountered in running a prod model + enable combo kernel feature
Test Plan: CI
Differential Revision: D71512220
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149575
Approved by: https://github.com/ColinPeppler
python benchmarks/transformer/score_mod.py --dynamic --max-autotune
previously would crash with
```
"/home/bobren/local/a/pytorch/torch/_inductor/select_algorithm.py", line 2306, in key_of
node.get_device().type,
```
but with this change no longer does
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148991
Approved by: https://github.com/drisspg
WHen we create constraints, we look at the ordering of kwargs according to model signature. But when we trace, we use the ordering that is created based on how user passes in their kwargs. As a result, constraints and dynamic shapes end up having a different order causing issues when they have different dynamic tensor specs.
Differential Revision: [D71478578](https://our.internmc.facebook.com/intern/diff/D71478578)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149528
Approved by: https://github.com/ydwu4
Summary: This adds a version field like the following: `3.10.9+fb (3.10:1dd9be6, May 4 2022, 01:23:45) [Clang 15.0.7 (mononoke://mononoke.internal.tfbnw.net/fbsource 5d1601b0eed7426ac`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149419
Approved by: https://github.com/c00w
Summary:
1\ The current write item structure does not contain the amount of data that needs to be written.
2\ the planner.item already has a size primitive 'tensor_storage_size'. https://fburl.com/code/7a0gsmw7 But only for tensors.
3\ Right now, the only way the writer layer get hold of this property (fro non tensor data)
- first do a lookup in to the actual tensor/bytes
- then calculate the nbytes.
This change introduce a way to capture non-tensor data size within a write-plan item.
Reviewed By: daulet-askarov
Differential Revision: D70497442
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149434
Approved by: https://github.com/MeetVadakkanchery
Summary:
Oftentimes, users complain that a bunch of extra events are prepended to their desired GPU snapshot. This is because they usually attach an OOM logger without knowing and when they go to collect the actual snapshot, it adds all the OOM logger contents. Since OOM and regular snapshot use the same backend, we currently don't have the infra in place to split these snapshots.
As a solution we add a flag to the snapshot frontend to clear out the history when starting the auto-trace record memory history.
A more thorough solution would be to have a user pass in a handle and to have snapshots per handle to seperate the events. However, this would likely be complicated and more work than it is worth as we would have to change the callbacks in the caching allocator and pass these objects between python and cpp.
Test Plan:
See diff below
Differential Revision: D71159720
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149352
Approved by: https://github.com/eqy, https://github.com/aaronenyeshi
Summary: Remove torch.export.export_for_inference, it is redundant and can always be replaced with torch.export.export_for_training() + run_decompositions()
Test Plan: unit tests
Differential Revision: D71069057
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149078
Approved by: https://github.com/tugsbayasgalan
Summary: - For `torch.ops.higher_order.with_effects`'s lowering, we should not extract the items out of an list (i.e. `*result` vs `result`). The `get_attr` nodes consider the result to be in the list format.
Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r test_torchbind_aot_compile
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r list_return
buck run //caffe2/torch/fb/sparsenn:sigrid_test -- -r test_transform_torch_bind # tested together with D70013257
buck run fbcode//mode/dev-nosan //caffe2/test:test_export -- -r test_custom_obj
```
Reviewed By: angelayi
Differential Revision: D71346024
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149510
Approved by: https://github.com/zou3519
Differential Revision: D70022208
- When resolving unbacked symints in ExternKernel for with_effect, we need to ignore the first item in the binding path, because the `example_output` doesn't contain the effect token, but the binding paths do.
- Similarly, `node.meta["val"]` contains the effect token, so when we compute_unbacked_bindings, we need to remove that effect token
- For `torch.ops.higher_order.with_effects`'s lowering, we should not extract the items out of an list (i.e. `*result` vs `result`). The `get_attr` nodes consider the result to be in the list format.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147656
Approved by: https://github.com/angelayi, https://github.com/zou3519
Summary:
We add the aten pattern to optimize big cat node with arbitrary order of inputs to support APS jobs
context: https://docs.google.com/document/d/1G2qFcQu1K7VXbz2uPe0CS2aBirnwtwI_B8lxmlBlAPQ/edit?tab=t.0
Test Plan:
### how to enable
Add the following patterns to the post grad
```
post_grad_fusion_options={
"normalization_aten_pass": {},
"split_cat_aten_pass": {"threshold_to_cat": 10},
},
```
You can tune threshold_to_cat to achieve best performance. If nothing gives, the default value 10 will be used
### unit test
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_aten_passes -- test_split_cat_post_grad
```
Buck UI: https://www.internalfb.com/buck2/9e52168d-c107-4be8-a46b-b9d239f5c50d
Test UI: https://www.internalfb.com/intern/testinfra/testrun/17732923605061752
Network: Up: 112KiB Down: 132KiB (reSessionID-915796e0-4a8f-486a-9f63-afb1e191d24a)
Executing actions. Remaining 0/3 1.0s exec time total
Command: test. Finished 2 local
Time elapsed: 4:57.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0
### E2E
baseline
f691990503
proposal
Differential Revision: D71017436
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149027
Approved by: https://github.com/Yuzhen11
Summary: The FlexAttention path generates code that uses this function. Although streams are not used yet in Triton-MTIA, adding this now allows us to not branch out just for MTIA and generate different code.
Test Plan: CI
Reviewed By: chaos5958
Differential Revision: D70072057
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149436
Approved by: https://github.com/chaos5958
Summary:
Recall that we use "ivals" to track intermediate values of mutations during unflattening. Previously, for each such intermediate value, we would create a hidden shared attribute that would be updated / read by respective submodules.
Unfortunately this scheme doesn't work when some but not all of those submodules are swapped out. This is because the swapped in submodules have no knowledge of these hidden attributes. Thus the submodules that are not swapped out end up reading / updating dangling state.
This PR does away with these hidden attributes. Instead, we directly read the underlying buffer or placeholder that was updated, and update those underlying buffers and placeholders in place. This makes the graphs look much closer to their eager origins.
Test Plan: added some tests, ensured existing tests pass
Differential Revision: D71203469
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149206
Approved by: https://github.com/tugsbayasgalan
PR does following
* Turns `inference_mode` to False and `no_grad` for `convert_frame`, if the inference_mode is on globally.
* Turns off inference_mode for fake tensor prop. This ensures that converting from real inference tensor to a fake tensor removes the inference-ness.
* Graph breaks on is_inference and is_inference_mode_enabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149321
Approved by: https://github.com/jansel, https://github.com/zou3519
Summary:
`-Wunused-exception-parameter` has identified an unused exception parameter. This diff removes it.
This:
```
try {
...
} catch (exception& e) {
// no use of e
}
```
should instead be written as
```
} catch (exception&) {
```
If the code compiles, this is safe to land.
Test Plan: Sandcastle
Reviewed By: dtolnay
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149328
Approved by: https://github.com/Skylion007, https://github.com/eqy
This PR includes additional enhancements to TF32 support in TunableOp.
- OpSignature now differentiates between float32 and tf32 data types.
- Offline tuning now supports TF32.
- Unit tests for online and offline tuning of TF32.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149088
Approved by: https://github.com/jeffdaily
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
- Updated HIP flags for Windows (removed non Windows flags on Windows case, added runtime library)
- Set hipcc call for Windows case
- Removed CUDA flags (not used in ROCm) on Windows
- Updated Windows compiler (added case when using ROCm on Windows)
- Fixed path issue in hipify_python
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147382
Approved by: https://github.com/jeffdaily
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Summary: Right now we get Overload names and forward them to the Event List frontend for profiler but we do not forward anything to kineto. This diff checks if there is an overload name for each cpu op and appends it to the name if necessary
Test Plan: Added test in CI
Differential Revision: D71326670
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149333
Approved by: https://github.com/aaronenyeshi
Minor refactor to trace.py
* Removed `_strict_export_lower_to_aten_ir` in favor of just `_strict_export` and `_non_strict_export`
* Matched the APIs of `_strict_export` and `_non_strict_export`
* Instead of a `lower_to_aten_callback` which is a callable, or `dispatch_tracing_mode`, both functions take in a `_to_aten_func` which can be either `_export_to_aten_ir_make_fx` or `_export_to_aten_ir`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149240
Approved by: https://github.com/pianpwk
In the old exporter we allow users to define a symbolic() method to bypass JIT tracing for a block of logic. We can allow users to do similar things by creating symbolic ops at export.
This PR implements `torch.onnx.ops.symbolic` and `torch.onnx.ops.symbolic_multi_out` to allow users to create onnx nodes symbolically with pt2 & fx. The custom pytorch ops were designed such that the attributes are encoded to be part of a valid fx op. Users provide shape and dtype for the meta function to produce the currect fake tensor during export.
An example is

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148905
Approved by: https://github.com/titaiwangms
Fixes https://github.com/ROCm/hip/issues/3764.
Fixes and improvements to CUDA->HIP flag conversion for CPP extensions
- Log flag conversion for debugging purposes.
- Fix cases where it should not touch the -I flags or cases where CUDA appears more than once by replacing only the first instance.
- Fix case where nvcc key may not exist
- Fix case where hipify should ignore flag values and only touch the flag itself
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149245
Approved by: https://github.com/jeffdaily
Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai>
Fix for https://github.com/pytorch/pytorch/issues/144431.
Improves perf from 0.29963893827160504 -> 0.0396331632970453.
In split reductions, we view an input tensor as a single dimension, then reduce over it. When we are reducing over a tensor which has a dimension other than the last dimension as the dense dimension, we should iterate over the dense dimension first in our re-indexing.
This pr also gives evidence for general need of reduction tiling, e.g. for cooperative reduction handling of this..
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147229
Approved by: https://github.com/jansel
Summary:
Avoid in-place update and deepcopy during dudpe. Deepcopy becomes prohibitively expensive with models having a huge number of FQNs. This was manifestd in the Ads 2K experiment as well. Here are the results from the TextRay model in Mitra:
#### Control job with deepcopy regression:
First save ~24.8s
Global step latency is ~7-8s
Test job with the new fix to avoid deepcopy:
First save is ~21s
global step latency ~2s
Test Plan:
```
buck test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/distributed/checkpoint:test_planner
```
https://www.internalfb.com/intern/testinfra/testrun/3940649945104822
Differential Revision: D71245218
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149320
Approved by: https://github.com/MeetVadakkanchery
Summary: The FlexAttention path uses `_maybe_exchange_device`, so it will be needed eventually for MTIA as well.
Test Plan: `buck2 test fbcode//mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- test_maybe_exchange_device`
Reviewed By: chaos5958
Differential Revision: D70072063
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149340
Approved by: https://github.com/chaos5958
This is a trivial rule that for most cases isn't needed, but if we want to consider that the input data is actually `Shard(0)` (instead of `Replicated()` as it is currently assumed), then we need this rule.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149253
Approved by: https://github.com/XilunWu
This change does 2 important things:
(a) Instead of relying on IValue type as source of truth, we use the schema as the source of truth, which is important as IValue types are overloaded and can ambiguously convert incorrectly. For example, a MemoryFormat will look like an int + get converted to an int64_t vs a MemoryFormat!
(b) This PR expands support for many more types to encompass way more schemas, e.g., Optional, Device, dtype, etc. The main win from this PR is the ability for aoti_torch_call_dispatcher to call TensorFactory ops like ones_like/empty_like!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149052
Approved by: https://github.com/albanD
Doing this removes the need of collecting `id` and therefore facilitates serialization. It also improves readability with recompilations. Earlier, recompile message will just show the `id`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149228
Approved by: https://github.com/jansel
found an issue while running `python torchgen/fuse/gen_patterns.py`
exact error:
```shell
Traceback (most recent call last):
File "/Users/mayankmishra/Desktop/non-IBM/pytorch/torchgen/fuse/gen_patterns.py", line 19, in <module>
joint_graph.lazy_init()
File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/pattern_matcher.py", line 2096, in lazy_init
result = fn()
File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/fx_passes/joint_graph.py", line 53, in lazy_init
_pad_mm_init()
File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/fx_passes/pad_mm.py", line 905, in _pad_mm_init
gen_register_replacement(
File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/pattern_matcher.py", line 1584, in gen_register_replacement
pat = _serialize_pattern(
File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/pattern_matcher.py", line 1539, in _serialize_pattern
file_template = get_file_template()
File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/pattern_matcher.py", line 1513, in get_file_template
if isinstance(attr, type) and issubclass(attr, (PatternExpr, _TargetExpr)):
File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/abc.py", line 123, in __subclasscheck__
return _abc_subclasscheck(cls, subclass)
TypeError: issubclass() arg 1 must be a class
```
This PR fixes this issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147723
Approved by: https://github.com/aorenste
Co-authored-by: Aaron Orenstein <aorenste@meta.com>