Previously: https://github.com/pytorch/pytorch/pull/138052 but the implementation is done from scratch, so I open a new PR.
This implements the ability to save and load profiles of automatic dynamic decisions, so on subsequent runs we can directly make something automatically dynamic. Unlike the previous implementation, this cache is never enabled by default; instead, you have to specify a "job id" that says it's OK to share results. We will be able to automatically populate this id for internal MAST jobs but for generic OSS users you will have to explicitly opt into it.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Differential Revision: [D65065497](https://our.internmc.facebook.com/intern/diff/D65065497)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139001
Approved by: https://github.com/oulgen
The ONNX custom ops registration API.
## Design
1. Create a "custom_translation_table: dict[Callable, Sequence[Callable] | Callable" parameter for specifying extra functions
2. Use a callable as the key to support all possible call_function targets in the fx graph
3. Allow a callable or a Sequence of callables as values.
- When there is a single callable, it is the translation function for the op
- When there is a Sequence of callable, the exporter's dispatcher will dispatch to these callables in order based on input dtypes.
- The translation functions can be a plain python function that calls onnxscript ops (traced), or an onnxscript function.
- Complex input support: We create special type annotations for annotating real representations of complex inputs, which are needed to handle complex computation in the ONNX graph, as we don't have any ops in ONNX that handle complex inputs. The dispatcher will have knowledge of these newly created type annotations and dispatch correctly. The complex functions will be in the same overload pool as the real functions.
```py
torch.onnx.export(dynamo=True,
custom_translation_table = {
torch.ops.aten.add: [overload1, overload2],
torch.sym_not: sym_not_onnx,
})
```
Support for functions that handles complex inputs will be in separate PRs.
fixes https://github.com/pytorch/pytorch/issues/138391
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135403
Approved by: https://github.com/titaiwangms
## This Stack
This stack does the following things to support `xformers`-style, comm-aware Triton kernels:
- Exposes `signal_pad`s as tensors in Python
- Adds a binding for `cuMemsetAsync`
These in combination aims to provide users with more flexibility to express custom signaling/synchronization patterns.
## This PR
```python
# Obtain the signal pad of the specified peer rank as a tensor.
# If both shape and dtype are unspecified, the returned tensor will be a
# 1d uint32 tensor, which is most natural for signaling purposes.
symm_mem.get_signal_pad(peer_rank)
# If only shape is specified, it is equivalent to:
# symm_mem.get_signal_pad(peer_rank)[:shape.numel()].view(shape)
symm_mem.get_signal_pad(peer_rank, shape)
# If only dtype is specified, it is equivalent to:
# symm_mem.get_signal_pad(peer_rank).view(dtype)
symm_mem.get_signal_pad(peer_rank, dtype=dtype)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138754
Approved by: https://github.com/weifengpy, https://github.com/lw
Fixes#136559
As we upgrade to NumPy 2, torch falsely filtered out `numpy.random` as unsupported in dynamo tracking.
This PR changes the filtering rules to include them while keeping behavior with numpy 1 unchanged.
Before this PR, the following tests failed:
```
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/dynamo/test_functions.py -k FunctionTests.test_numpy_random
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/dynamo/test_unspec.py -k UnspecTests.test_to_tensor
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/test_fake_tensor.py -k FakeTensorTest.test_export_numpy
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/test_fake_tensor.py -k PropagateRealTensorsFakeTensorTest.test_export_numpy_propagate_real_tensors
```
With this PR, the supported/unsupported ops in NumPy 1 are not changed.
For NumPy 2, only the `numpy.random` ops that are already supported with NumPy 1 are added to the supported list.
I used the following scripts to check the differences before and after the change for both NumPy 1 & 2.
The output is empty for NumPy 1 since there is no change.
The output is a list of `numpy.random` that considered supported for NumPy 2.
```py
from torch._dynamo import trace_rules
import numpy as np
def new_numpy_function_ids():
unsupported_funcs = {"seed", "ranf", "get_bit_generator", "RandomState", "set_bit_generator", "sample"}
def is_supported(k, v, mod):
if not callable(v):
return False
if not getattr(v, "__module__", None):
return True
if v.__module__ == mod.__name__:
return True
if v.__module__ == "numpy.random.mtrand" and mod.__name__== "numpy.random" and k not in unsupported_funcs:
return True
return False
rv = {}
for mod in trace_rules.NP_SUPPORTED_MODULES:
for k, v in mod.__dict__.items():
if is_supported(k, v, mod):
rv[id(v)] = f"{mod.__name__}.{k}"
return rv
def old_numpy_function_ids():
rv = {}
for mod in trace_rules.NP_SUPPORTED_MODULES:
rv.update(
{
id(v): f"{mod.__name__}.{k}"
for k, v in mod.__dict__.items()
if callable(v)
and (getattr(v, "__module__", None) or mod.__name__) == mod.__name__
}
)
return rv
rv1 = set(old_numpy_function_ids().values())
rv2 = set(new_numpy_function_ids().values())
for v in (rv1 - rv2):
print(v)
print("****")
for v in (rv2 - rv1):
print(v)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138686
Approved by: https://github.com/lezcano, https://github.com/williamwen42
This is to match the default layout constraint for custom operators. By
default, Inductor should match the stride order of inputs to a triton
kernel.
IF THIS IS BREAKING YOU, PLEASE REACH OUT, especially if it's been
more than two weeks since this landed. You can flip the config locally
as a workaround.
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137064
Approved by: https://github.com/albanD, https://github.com/eellison
This teaches install_config_module (and the underlying code) to
understands Config objects. Additionally we've added a JK option to this
which resolves the JK.
This config gets stored within the _ConfigEntry class and is evaluated
when __getattr__ is called. If justknobs is set, it'll call
justknobs_check to see the result.
Due to preceeding work, basically everything works correctly here and we
had to update a couple of tests, and modify the getattr behaviour.
Note that we are updating the justknob_check function to support a
default option, to make default work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138766
Approved by: https://github.com/ezyang
Summary: The main changes to support freezing are:
1) When pickling constant tensors as part of the cache key calculation: If freezing has not been applied, then keep the existing behavior (pickle the metadata and values). If freezing has been applied, then pickle the values if the constant will be inlined; otherwise, consider only the metadata.
2) If freezing has been applied, modify what we store in the cache: Instead of storing the constant attributes in the cache entry, store the _names_ of the constants, and then grab those constants from the GraphModule when we need attache the attributes to a newly-loaded Python module. Since the cache lookup path loads the Python module, this bullet means we need to thread through a GraphModule argument in several places.
3) Since this feature means that we may need to reload the same Python module path more than once (but attach different constant attributes), I changed PyCodeCache.load_by_key_path to not store an in-memory map of path to module (since there may be more than one). I don't _think_ this will have any affect on performance, however.. It's unclear why we were using an in-memory cache here anyway, since this function should only be called once for each module needed to be loaded.
4) Several tests were removing on-disk PyCodeCache artifacts by iterating over the modules. I made this more straightforward by implementing a cache_clear method that removes the on-disk artifacts. Arguably, this should have been the implementation all along.
Differential Revision: [D63542170](https://our.internmc.facebook.com/intern/diff/D63542170)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136505
Approved by: https://github.com/eellison
# Summary
The AMX ISA based GEMM micro-kernel template for int8 weight-only quantization (BF16 activation, int8 weights) should cache dequantized weights (int8 -> int32 -> fp32 -> bf16) so that they would not have to be dequantized again in subsequent calls to the _inner-kernel_ that uses the same weights.
This change leverages the fact that even for BF16 x BF16 GEMM template, cache-blocking ensures that `Nr * Kc` weight elements are cached in L1D cache (more info [here](https://static.sched.com/hosted_files/pytorch2024/59/TorchInductor%20CPU%20Backend%20Advancements%20-%20New%20Features%20and%20Performance%20Improvements_20240915.pdf)). Here, `Nr` is the register blocking size for `N` dimension (at the granularity of the GEMM micro-kernel, it's currently also the cache blocking size for `N` dimension, although that may change in the future), and `Kc` is the cache blocking size for `K` dimension.
The figure below is from the document linked above -
<img width="476" alt="image" src="https://github.com/user-attachments/assets/e23e5476-d910-46d1-a9b3-cbf77de76d94">
## Performance data
Collected on 48 physical cores of one socket of Intel Xeon Platinum 8468H (Xeon SP 4th gen). Intel OpenMP & tcmalloc were preloaded.
|M | N | K | Latency with ATen _weight_int8pack_mm | Latency with codegened templated GEMM (current main branch) | Latency with codegened templated GEMM (this PR) |
|-----|-----|-----|------|----------|----|
|4096|4096|4096| 45.844 ms | 9.322 ms| 5.2181 ms |
|4096|11008|4096| 127.618 ms |24.6258 ms | 13.6046 ms|
|4096|4096|11008| 121.953 ms | 25.4692 ms | 10.2669 ms |
|4096|32000|4096| 478.450 ms| 75.3942 ms | 48.21 ms |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136688
Approved by: https://github.com/jgong5
This PR fixes a compilation time regression manifested in timm_models/hrnet_w18 caused by https://github.com/pytorch/pytorch/pull/136732.
The regression is reproducible locally. The compilation time is a bit noisy, but it's still possible to tell the difference.
```
Before the offending PR
compilation_latency mean=176.022 seconds
compilation_latency mean=176.564 seconds
On the offending PR
compilation_latency mean=180.096 seconds
compilation_latency mean=179.101 seconds
On the fix
compilation_latency mean=173.153 seconds
compilation_latency mean=174.182 seconds
```
(I think the fix being faster than the baseline is due to noise)
The cause of the regression is an inefficiency in `is_user_visible_output()`. Specifically, it used `output_node.args[0].index(node)` to obtain the output idx for each node (and we called this for each node twice). The offending PR had the assumption that `len(output_node.args[0])` is rather small. However, it has been proven false by the benchmark (it was 1900+ for timm_models/hrnet_w18).
The fix is to precompute `user_visible_output_strides` once by iterating only over the nodes in `output_node.args[0]`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139420
Approved by: https://github.com/ezyang
Fix https://github.com/pytorch/pytorch/issues/128063 .
Now for this snippet
```
def f(x):
y = torch.sum(torch.sum(x, dim=-1))
z = x / 10.0
z_t = z.t().contiguous().t()
return y, z, z_t
```
Inductor could generate a single kernel for the first reduction and the two ponitwise kernels (if loop-ordering after fusion is enabled). And the generated kernel read `x` only ONCE. (with no proper handling, the two pointwise's may each access x once even if they are fused).
The PR needs fix 2 subtile bugs regarding LOAF .
1. when we reorder loops for a FusedSchedulerNode, we check if each sub-node's sizes matches. But some node has sizes in `list` type (if its loop is not reordered) while others have its sizes in `tuple` type (if its loop is reordered). I could change the upstream code to uniformly use either `list` or `tuple`. But without strong enforcement, future code could break this. So I just convert sizes to uniform type before comparison.
2. We have a cache for tiling decisions of a BaseSchedulerNode. If we reorder loops for the node, we should invalidate the cache. Otherwise, a stale tiling decision can result in (very) bad kernel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139376
Approved by: https://github.com/jansel, https://github.com/eellison
Currently, we get all partition id by iterating assignment whose size is same as the number of nodes in graph. But we can reach same results by iterating partitions_by_id whose size is much smaller than the nodes number. Assume the number of nodes is N, the number of partitions is P, the time complexity decrease from O(N * N) to O(N * P) after this patch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136598
Approved by: https://github.com/tarun292
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Also, update tests to use I (BACKWARD_INPUT) vs B (FULL_BACKWARD)
consistently.
Previously, schedules would issue a 'B' operation and leave it ambiguous
whether that operation should be BACKWARD_INPUT or FULL_BACKWARD,
depending on a separate flag (use_full_backward) passed to the schedule
class, which would determine which behavior was taken at runtime.
Now, use_full_backward is removed and the schedule class is required to
produce unambiguous IR. The logic for 'use_full_backward' is removed
from the runtime.
_validate_pipeline_order is replaced with _simulate_comms_compute. Both
offer similar functionality, to validate the corrrectness of a schedule
IR. 'validate' operates on compute-only IR, while simulate operates on
compute + comm IR. To convert from using validate to simulate, you have
to first insert comm actions via '_add_send_recv'.
'simulate' was inefficiently written before this PR and needed to be
optimized to run quickly for extra large schedules with >32 ranks and
microbatches per rank used in some unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138886
Approved by: https://github.com/H-Huang
As discussed w/ @ezyang offline, one way to de-risk the `specialize_float=False` rollout is to specialize all backed symfloats that we fail to tensorify away. This diff does a few things:
1) It fixes a bug where item_memo gets dropped (due to incorrect epoch invalidation)
2) It updates the tensorify pass to do the backup specialization
This pass was originally part of the [PR](https://github.com/pytorch/pytorch/pull/137782) that flips `specialize_float=False` but we learned that the blast radius is simply too large. We've pivoted to a more milestone driven approach where we learn from the failures of the aforementioned PR and cherry pick fixes into main first. After this current PR lands our strategy is as follows:
1) Integrate turning off specialize float only in the automatic dynamic pass.
2) Put up a canary diff that only turns off specialize float in `backend=eager` mode to sniff out symfloat related bugs in dynamo due to code paths we previously never exercised.
3) Put up a canary diff that only turns off specialize float in `backend=aot_eager` mode to sniff out symfloat related bugs in aotautograd due to code paths we previously never exercised.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138868
Approved by: https://github.com/ezyang
This diff considerably changes the column format of PT2 Compile Events:
- Now, instead of logging one new column per every piece of metadata, we just log a single column, "metadata". This vastly decreases the number of columns we need to log, which should help with retention.
- Now, we only log to scuba for a set of dynamo_timed() events that we actually care about aggregating. To do so, we add a boolean to dynamo_timed() that decides whether or not to log a pt2_compile_event. We'll always log a chromium_event for every dynamo_timed(), but only log a subset of those to scuba.
Differential Revision: [D65225598](https://our.internmc.facebook.com/intern/diff/D65225598/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139309
Approved by: https://github.com/oulgen
Summary:
I think we can inplace a buffer if all of the users of said buffer are "inconsequential", defined as having been removed, being completed, or being part of the ancestors set. In particular, this allows LayerNorm to inplace its input buffer.
Implements:
https://github.com/pytorch/pytorch/issues/132826
Test Plan:
New unit test of matmul followed by LayerNorm, make sure there's an inplaced buffer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138383
Approved by: https://github.com/eellison
Reason:
Currently we have multiple traversals for tangents in runtime:
- To check that types and structure are identical to what we guessed during tracing time
- Coerce metadata
- Coerce memory_format
- Unwrap_tensor_subclass
All of them are traversing tangents via __tensor_flatten__ calls the tree of Subclasses.
Change:
To do everything in one traversal at runtime (including flattening)
Implementation details:
Add memory_format information inside SubclassCreationMeta, for PlainTensors keep not only (int) of unwrapped_index, but memory_format too.
Preparing memory_format is optional (controlled by with_memory_format=True).
2. Removing unused subclass_utils.create_metadata_for_subclass which does not have any usages inside torch and would require update of the logic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139068
Approved by: https://github.com/bdhirsh
Used in both simulator and add_send_recv pass, the ready_to_schedule
logic works by looking at all the previously scheduled ops on a rank to
see if any of them 'unblocks' the current op to be scheduled. For example,
to schedule a FORWARD op, a previous RECV_F op is needed, unless this is
stage 0 or there is a previous stage on the same rank that ran FORWARD
already.
The old implementation iteratively compared the candidate op to the
previous ops. The new implementation uses set lookups to reduce
complexity. It also maintains the set of previous ops as ops are
scheduled rather than constructing a set on demand.
I did not save benchmark results, but this results in a 10-100x speedup
which is most noticeable for unit tests with artificially huge schedule
IR, the largest of which took longer than 20m before (I never let it
finish) but now takes less than 14s. Most schedules take less than
10ms.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138924
Approved by: https://github.com/H-Huang
ghstack dependencies: #138928, #131762
### Separate dI / dW:
PipelineScheduleRuntime now supports execution of merged FULL_BACKWARD
or separate dI / dW operations.
Separating the B and W may add execution overhead or may be suboptimal
in cases where BW are 'fused', but it is worthwhile when separating B, W
lets the schedule be more efficient by filling in bubbles. In some
cases, the schedule will still issue B followed by W at certain points,
so in these cases just merge them back into BW ops and execute them as
full backwards rather than executing a B followed by a W.
### V-schedules:
V-schedules have a special case where the last rank has 2 adjacent
stages.
E.g. if rank3 had stage 3 and stage 4, then we should implement direct
transfer of stage3 outputs to stage4 inputs without a
send/recv.
In the schedling logic, we also must allow scheduling the
stage 4 forward after running stage 3 forward, without expecting a stage
4 RECV_F
In the runtime, we pass activations between adjacent stages without
using SEND/RECV ops since the stages are on the same rank/process. We
add new APIs to PipelineStage abstraction for passing the activations
both during forward and backward. Currently the implementation directly
modifies the 'recv buffers' the stage is managing, so the
forward/backwrad execution logic does not need to know the difference.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131762
Approved by: https://github.com/H-Huang
ghstack dependencies: #138928