This PR makes it so that we don't crash due to logging if we invoke AOTAutogradCache/FXGraphCache without using dynamo. This is preparation for supporting certain VLLM use cases where they store graph modules and have special handling in conjunection with the caches.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150423
Approved by: https://github.com/oulgen
This patch effectively ignores traceable_tensor_subclasses, allowing
Dynamo to always try tracing into the `__torch_function__` of tensor
subclass. This helps us with 2 things:
1. allowing users to directly benefit from better compilation of tensor
subclass, by just upgrading pytorch, without having to change legacy
library code (see earlier patches in the stack for examples).
2. potentially exposing more issues in compiling tensor subclass, so we
can get signals and improve them.
As a consequence, it exposed and fixes 2 subtle bugs:
1. In `build_torch_function_fn`, we could get
`torch._C._disabled_torch_function_impl` because we have a
`Parameter` subclass without `__torch_function__` override or if we
have a tensor subclass with `__torch_dispatch__` override. We graph
break on this for now, and plan to add support -- the logic for
simulating `torch._C._disabled_torch_function_impl` is already in
`SuperVariable`, we just need to reuse it.
2. Sometimes we create `SyntheticLocalSource` and need to remove all the
guards installed on it, but we only removed the ones whose source
_is_ the created synthetic source `s`, but forgot about chained
source like `s.foo`, this showed up as
`SYNTHETIC_LOCAL['tmp_0'].__torch_function__.__func__`.
Differential Revision: [D71906141](https://our.internmc.facebook.com/intern/diff/D71906141)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149792
Approved by: https://github.com/jansel, https://github.com/mlazos
ghstack dependencies: #149482, #149483, #149484
This patch effectively ignores traceable_tensor_subclasses, allowing
Dynamo to always try tracing into the `__torch_function__` of tensor
subclass. This helps us with 2 things:
1. allowing users to directly benefit from better compilation of tensor
subclass, by just upgrading pytorch, without having to change legacy
library code (see earlier patches in the stack for examples).
2. potentially exposing more issues in compiling tensor subclass, so we
can get signals and improve them.
As a consequence, it exposed and fixes 2 subtle bugs:
1. In `build_torch_function_fn`, we could get
`torch._C._disabled_torch_function_impl` because we have a
`Parameter` subclass without `__torch_function__` override or if we
have a tensor subclass with `__torch_dispatch__` override. We graph
break on this for now, and plan to add support -- the logic for
simulating `torch._C._disabled_torch_function_impl` is already in
`SuperVariable`, we just need to reuse it.
2. Sometimes we create `SyntheticLocalSource` and need to remove all the
guards installed on it, but we only removed the ones whose source
_is_ the created synthetic source `s`, but forgot about chained
source like `s.foo`, this showed up as
`SYNTHETIC_LOCAL['tmp_0'].__torch_function__.__func__`.
Differential Revision: [D71906141](https://our.internmc.facebook.com/intern/diff/D71906141)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149792
Approved by: https://github.com/jansel, https://github.com/mlazos
ghstack dependencies: #149482, #149483, #149484
Summary:
This will log a wait counter with for backward compile and fixes weirdness with nested context managers.
Since the old wait counters added through dynamo_timed were never created with the nesting issue. I am also changing the key nomenclature from `pytorch.dynamo_timed` to `pytorch.wait_counter`. We want to use the same nomenclature, to make it easy to find keys.
Reviewed By: jamesjwu
Differential Revision: D72032055
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150235
Approved by: https://github.com/jamesjwu, https://github.com/masnesral
We weren't handling `setattr(tensor_obj, "real", 42)` correctly, because
the attribute is a `GetSetDescriptorType` that has special setter logic.
See added test and comments for more explanations.
This patch makes it so that we graph break in those cases, rather than
resulting in silent incorrectness.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149791
Approved by: https://github.com/mlazos
ghstack dependencies: #149481
Summary: This adds a version field like the following: `3.10.9+fb (3.10:1dd9be6, May 4 2022, 01:23:45) [Clang 15.0.7 (mononoke://mononoke.internal.tfbnw.net/fbsource 5d1601b0eed7426ac`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149419
Approved by: https://github.com/c00w
PR does following
* Turns `inference_mode` to False and `no_grad` for `convert_frame`, if the inference_mode is on globally.
* Turns off inference_mode for fake tensor prop. This ensures that converting from real inference tensor to a fake tensor removes the inference-ness.
* Graph breaks on is_inference and is_inference_mode_enabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149321
Approved by: https://github.com/jansel, https://github.com/zou3519
This allows for each device type to check current devices for Triton compatibility and ensure their Triton backend is present.
This PR replaces the `has_triton()` global method which was previously used for this task, and moves the initial check for each Inductor backend on to their associated `BaseScheduler` subclass. This means that other backends, such as Halide, can also implement their own availability checks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139171
Approved by: https://github.com/jansel
This gives us a decent proxy for how big of a graph we functionally had to parse.
Note that this is a cummulative counter. If people feel strongly, I can either write into the dynamo_timed datasets with metrics contexts, or clear the counters / write a counter per frame id as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147149
Approved by: https://github.com/jansel
Summary: this adds some new dynamo_timed calls in cudagraph_trees, primarily with the aim to add cudagraph-related timing to scuba. Things to note:
* Uses the changes in https://github.com/pytorch/pytorch/pull/141919 to log "runtime" entries
* The logging for chromium/tlparse/scuba relies on us providing a compile_id since it's not available in the environment. A lot of the changes here are just passing around the compile_id
* I believe the spirit of the scuba logging is to capture the overheads of `torch.compile`. Therefore, I'm not adding _every_ dynamo_timed to scuba. For example, "run_eager" is the first real execution of the inductor graph -- it's not cudagraph overhead, per se. Watch out for the two instances of `dynamo_compile_runtime_column_us="runtime_cudagraphify_time_us"`. Those are the spots I believe are _extra_ overhead we'd contribute to torch.compile.
Test Plan:
`python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only dcgan`:
* tlparse: https://fburl.com/21yrdn8h
* scuba: https://fburl.com/scuba/dynamo_compile/sandbox/wt90wnjz
`python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only nanogpt`
* tlparse: https://fburl.com/r9mp7uiv
* scuba: https://fburl.com/scuba/dynamo_compile/sandbox/1nvx94re
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143220
Approved by: https://github.com/eellison
Softmax need do some preparation work that access the input tensor in two passes
- compute amax of each row
- compute (x - amax).exp.sum for each row
When the row size is large, cache can not hold all the active data and accessing the input multiple passes increases execution time since the kernel is membw bounded.
Online softmax uses a customized reduction to compute max and sum at the same time by accessing the data in one pass. Check this paper for more details ( https://arxiv.org/abs/1805.02867 ).
Also here is an online softmax kernel generated by inductor as a reference: https://gist.github.com/shunting314/67ae4fffd45d4f2753c781780332fa54
## Microbenchmark
- `TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_ONLINE_SOFTMAX=0 DO_PERF_TEST=1 python test/inductor/test_online_softmax.py -k test_softmax` : without online softmax
- eager_ms=6.671296119689941
- opt_ms=8.06931209564209
- `TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_ONLINE_SOFTMAX=1 DO_PERF_TEST=1 python test/inductor/test_online_softmax.py -k test_softmax`: with online softmax
- eager_ms=6.634047985076904
- opt_ms=6.230591773986816
Ideally, online softmax should save about 2ms here. We saves about 1.84ms in practice.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127011
Approved by: https://github.com/jansel
Summary: Gather the compilation time of individual triton kernels and log them to dynamo_compile:
* Time compilation in `_worker_compile_triton` and pass back to the main process and logged from `get_result()`.
* Added a way to track the "top N" (or N most-expensive compiles) in the metrics_context. I did this because I doubt we really care to capture potentially thousands of kernel compile times. That would be problematic for scuba logging anyway, so let's limit the number we track from the beginning. Arbitrarily chose 25 for now.
* Format the list of compile times as a json string before logging.
Test Plan:
`python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only nanogpt`
Scuba: https://fburl.com/scuba/dynamo_compile/sandbox/nc4dzm3r
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147022
Approved by: https://github.com/jamesjwu
## Context
> **Note:** `mark_traceable` got renamed to `nonstrict_trace` after
> offline discussion. The reasons are (1) it aligns with `torch.export`'s
> `nonstrict` notion, and (2) it's more definitive in behavior suggestion.
1. [Overall Design](https://docs.google.com/document/d/1O-dR2ZQaJQVt_v67AVcDCw2yJLtqgkZFwoXK0buEWRg/edit?tab=t.0)
2. [Dynamo graph representation with `torch._higher_order_ops.flat_apply`](https://docs.google.com/document/d/1YHl5nPTJvYeCPE5TO9uA18DPWNgUYGE4gCn6bFvXcBM/edit?tab=t.0#heading=h.xtw3hhbro4gn)
## Summary
This patch adds a `torch._dynamo.nonstrict_trace` decorator, which
currently is an enhanced version of `torch._dynamo.allow_in_graph` (see
docstring for their differences). Specifically, this patch focuses on
the UI and functionality prototyping/plumbing.
The main enhancement is supporting more input types, and the
implementation challenge lies in reconstructing the input objects from
Dynamo `VariableTracker` (while accounting for buffered side-effects and
guards). This patch takes a middle-ground (simple implementation with a
bit of user labor), by
1. asking the user to provide pytree registration for non-proxy-able
input types,
2. letting Dynamo trace through `pytree_flatten` (which accounts for
buffered side-effects and guards automatically),
3. and passing in the TreeSpec as a graph attribute constant into
`torch._higher_order_ops.flat_apply` (which unflattens the inputs and
invokes the underlying function).
## Next Steps
In subsequent patches, we will try to support the following:
- annotating on class method
- reads to global tensors
- inputs that contains `pytree.register_constant`-ed instances.
- function as input
- more output types (e.g., any pytree-registered type)
- `torch.nn.Module` as inputs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146367
Approved by: https://github.com/zou3519
ghstack dependencies: #146714
This is for "for some large number Z, make sure the error messages are readable English." - beginning to audit all `unimplemented` sites and making sure that all messages are at least English-readable. Hints may not necessarily be provided.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147385
Approved by: https://github.com/jansel
Part of the required fix for https://github.com/intel/torch-xpu-ops/issues/1264.
To support `roi_align`, torchvision uses `is_compile_supported` in `torch/_dynamo/utils.py` to compile a non-deterministic version of the op for backwards passes. This PR adds XPU device to the supported compile devices.
The `is_compile_supported()` util function has extremely limited usage, only being used in `torchvision.ops.roi_align` and `torch.utils._content_store.has_storage()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147541
Approved by: https://github.com/guangyey, https://github.com/jansel
Co-authored-by: lei,zhenyuan <zhenyuan.lei@intel.com>
Current implementation reads as: we will only actually use the "python_reducer" config if the DDP forward is compiled. Otherwise, we will silently fallback to C++ reducer + no DDPOptimizer.
I'm changing this behavior to always use the python reducer if the config is specified.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147123
Approved by: https://github.com/fegin
This PR adds support for list subclasses. Among other things are
1) Tracking the mutations on internal vts like `_dict_vt` and `_list_vt` using sources. This helps identify if there was a mutation in the underlying data structures, and we need to reconstruct it.
2) `UserDefinedObjectVariable` now has a new method - `is_modified` which `side_effect` infra relies upon to check mutations in the underlying vts (like `_dict_vt`).
3) `reconstruction` logic ensures that we use `dict.__getitem__` and `list.__getitem__` methods. This is super important because we don't want to call the overridden `__getitem__` methods.
If this PR is hard to review, please let me know. I can break it into several small PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146819
Approved by: https://github.com/StrongerXi, https://github.com/jansel
`get_top()` is really confusing when talking about a stack, because it can mean the most recently started event on the stack or the toplevel event in perfetto(which displays the stack upside down). Rename to `get_outermost` and fix the bug associated with it, so that it returns the correct value out of the stack.
Running nanogpt now puts `guard_latency_us` correctly in the `dynamo` event:
```
tlp python benchmarks/dynamo/torchbench.py --backend inductor --device cuda --only nanogpt --amp --cold-start-latency --print-compilation-time --training --performance 2>&1 --dynamic-shapes | tee out.log
```
<img width="1281" alt="image" src="https://github.com/user-attachments/assets/4eeb371a-4d81-415a-acc4-7d303a4b2a93" />
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146649
Approved by: https://github.com/masnesral, https://github.com/anijain2305
Logging branches based on RecompileLimitExceeded or not. If we exceed the limit, we fallback to eager before even trying to analyze the frame. We handle RecompileLimitExceeded outside of the try/catch/finally that edits the metrics context:
72405b0c0f/torch/_dynamo/convert_frame.py (L908-L935).
dynamo_config and recompile_reason are both known before we raise the RecompileLimitExceeded, so we can add them with the rest of the "common" metrics. which are logged on metric_context decorator exit and is always called
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146544
Approved by: https://github.com/masnesral
This reintroduces the change backed out by #145393 and fixes the underlying problem.
Although using a BuiltinVariable was better than nothing when we saw a GenericAlias it had problems if there was a graph break and we had to reconstruct the original python code which BuiltinVariable did as a simple `list` instead of a `list[int]`.
This changes it to use a TypingVariable instead and then teaches TypingVariable how to reconstruct.
Original commit changeset: 77b9193acb23
python test/dynamo/test_repros.py ReproTests.test_graph_break_on_jit_isinstance
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145554
Approved by: https://github.com/anijain2305
ghstack dependencies: #145551, #145552, #145553
Issue:
https://github.com/pytorch/pytorch/issues/144888
Torchbench of timm lcnet_050 model fails on accuracy in case of `--frezing` `--inference` `--bfloat16`
`res_error==0.12`
If to turn off convolution inductor constant folding - `res_error==0.016`
`float16 error ~ 0.00669`
`float16 without conv folding ~ 0.0018`
convolution folding results in increase of error almost at one order of magnitude.
I think we should revisit and try to do something to improve the accuracy for conv folding.
E.g. For example doing conv folding at compilation time with float64?
At the moment I am adding counters to identify if convolution folding happened, and in case of bfloat16 and conv_folding - increase multiplier to the max level (10) to pass accuracy test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145623
Approved by: https://github.com/eellison
This bit me while I was trying to debug some trace issues.
In general this config is already quite large when dumping, so adding
more fields doesn't make it significantly worse.
Also a number of the items we are type checking for (except the test
configs), don't even show up. Primarily this will help us when debugging
rocm, halide, and trace configs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144700
Approved by: https://github.com/ezyang
**Problem statement**: I want to be able to centralize and simplify the process by which people add columns/data to existing spans. We have MetricsContext and ChromiumEventLogger, and there's various choices you can make to decide where and when to log different levels of observability for your events. To resolve this, I want a central API for "adding to events under dynamo_timed".
**CompileEventLogger** is intended as a frontend for MetricsContext and ChromiumEventLogger so we can use the same class for handling everything.
CompileEventLogger is intended be used within a `dynamo_timed()` context. Its purpose is to 1. log to existing events that are in progress (i.e. within dynamo_timed), and 2. log instant events to chromium that are independent of any specific span.
CompileEventLogger has three log levels:
- CHROMIUM: Log only to chromium events, visible via tlparse.
- PT2_COMPILE: Log to chromium_events + pt2_compile_events
- COMPILATION_METRIC: Log to compilation metrics in addition to the toplevel chromium and pt2_compile_event.
In addition, we have a function CompileEventLogger.add() that automagically chooses the correct log level. For now, it is conservative, and will never automagically choose to log CompilationMetrics (though I could imagine it figuring out the metadata are all keys in CompilationMetric and therefore loggable there).
The goal here is to make one single interface to log stuff for observability reasons, and make it as easy as possible.
Not included in this diff:
- V1 of this diff will not have implementations of `increment` and `add_to_set` which MetricsContext has, so those usages are not replaced yet. But I'll add those in a followup.
- We don't handle `RuntimeMetricsContext`. It's unclear if I want that to be part of this, because under RuntimeMetricsContext there might not be a toplevel event to log to, so chromium events doesn't make sense in that context. So I might leave that separate for now.
Differential Revision: [D67346203](https://our.internmc.facebook.com/intern/diff/D67346203/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143420
Approved by: https://github.com/aorenste
In hinsight, we never needed a DICT_SUBCLASS_GUARD_MANAGER, because Dynamo would inline through the overridden keys method. In this PR, we ensure that while creating guards and constructing variable trackers, we get the `d.keys()` value by using `dict.keys(d)`. This ensures that we do not call overridden keys method. Therefore, the C++ guard can use `PyDict_Next` directly to check the guards.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143722
Approved by: https://github.com/jansel
In hinsight, we never needed a DICT_SUBCLASS_GUARD_MANAGER, because Dynamo would inline through the overridden keys method. In this PR, we ensure that while creating guards and constructing variable trackers, we get the `d.keys()` value by using `dict.keys(d)`. This ensures that we do not call overridden keys method. Therefore, the C++ guard can use `PyDict_Next` directly to check the guards.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143722
Approved by: https://github.com/jansel
Summary: Mostly cosmetic, but one bug fix:
* Bug fix: Make sure compile_id is converted to a string in the compilation metrics so it's printed as, e.g., "0/1" instead of "[0, 1]"
* Sort collections in `collection_to_str`
* Print non-string elements as `"<unknown>"` instead of None (since we don't expect non-strings)
* Move the population of the legacy metrics and any pre-processing to a new factory method in CompilationMetrics
Test Plan:
```
python test/dynamo/test_structured_trace.py
python test/dynamo/test_utils.py
```
Internal testing: https://fburl.com/scuba/dynamo_compile/sandbox/l0me8auf
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143332
Approved by: https://github.com/ppanchalia
Summary: Mostly cosmetic, but one bug fix:
* Bug fix: Make sure compile_id is converted to a string in the compilation metrics so it's printed as, e.g., "0/1" instead of "[0, 1]"
* Sort collections in `collection_to_str`
* Print non-string elements as `"<unknown>"` instead of None (since we don't expect non-strings)
* Move the population of the legacy metrics and any pre-processing to a new factory method in CompilationMetrics
Test Plan:
```
python test/dynamo/test_structured_trace.py
python test/dynamo/test_utils.py
```
Internal testing: https://fburl.com/scuba/dynamo_compile/sandbox/l0me8auf
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143332
Approved by: https://github.com/ppanchalia
Summary:
Support garbage collection after pt2 compilation.
Add jk to control the global rollout / rollback of this functionality
Add env var to control individual job's rollout
Test Plan:
Test the model training job with / without this changes
Reviewers:
@yuxihu @ezyang , @Yuzhen11 ,
Subscribers:
Tasks:
Tags:
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143364
Approved by: https://github.com/ezyang
(Got reverted due to a silly bug, fixed now.)
This is causing FbFxGraphRemoteCache.init to no longer be idempotent, i.e. only safe to call once per compile. AOTAutogradCache initializes a new remote cache for the forward and the backward.
Technically, we could make AOTAutogradCache smart and globally thread through a single FbFxGraphRemoteCache everywhere. But there's no reason to do so, as this class is just the handle to access the cache. Plus, it's very brittle for FbFxGraphRemoteCache to not be safe to call multiple times
Differential Revision: [D66701970](https://our.internmc.facebook.com/intern/diff/D66701970/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141967
Approved by: https://github.com/laithsakka
This logs a waitcounter of the name pytorch.dynamo_timed.{key}.
Primarily sending this now to make sure everyone likes the API, then
I'll add tests, and migrate one dynamo_timed to use it. (likely starting
with
https://github.com/pytorch/pytorch/pull/141379).
Testing is a bit harder, since we don't normally have any way to read
_WaitCounter state AFAICT. I want to poke around and see if I can figure
out a way to read the state, otherwise I'll just mock it to at least
make sure it's mostly working.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141402
Approved by: https://github.com/jamesjwu, https://github.com/masnesral
A subsequeunt patch attempts to fix a side-effect issue for range
iterators, which in turn exposed an exising issue on guards for range
iterators -- the following test started failing:
```
PYTORCH_TEST_WITH_DYNAMO=1 python test/test_tensor_creation_ops.py TestTensorCreationCPU.test_hstack_column_stack_cpu_int16
```
This patch adds a `RANGE_ITERATOR_MATCH` guard to make sure that we
properly guard on range iterators, and adds a regression test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141902
Approved by: https://github.com/jansel
ghstack dependencies: #141713, #141714, #141715
This is causing FbFxGraphRemoteCache.init to no longer be idempotent, i.e. only safe to call once per compile. AOTAutogradCache initializes a new remote cache for the forward and the backward.
Technically, we could make AOTAutogradCache smart and globally thread through a single FbFxGraphRemoteCache everywhere. But there's no reason to do so, as this class is just the handle to access the cache. Plus, it's very brittle for FbFxGraphRemoteCache to not be safe to call multiple times.
(Same problem, different fix of D66502138)
Differential Revision: [D66508492](https://our.internmc.facebook.com/intern/diff/D66508492/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141707
Approved by: https://github.com/ezyang
Prior to this patch, we are using `ConstantVariable.create` to create VT
for frozenset objects, and intended yet failed to predicate that on all
itmes being literals (see https://github.com/pytorch/pytorch/pull/140984#discussion_r1847393736).
The code was from https://github.com/pytorch/torchdynamo/commit/7c03434 and
the original goal was to help DBR quantization, but as the new test in
this patch shows, it could lead to silent incorrectness.
Upon a closer look, this exposes some subtleties in how Dynamo handles
`ConstantVariable` and `LOAD_CONST`, so this patch both fixes the
aforementioned issue and documents, enforces, and makes explicit the
invariants around `ConstantVariable` and `LOAD_CONST` -- only immutable
objects are supported.
Specifically, this patch:
1. refine the checks for wrapping a `frozenset` object, document why we
can't just wrap its items directly due to lack of `Sourcec` for set
items, and use a safe workaround (`SourcelessBuilder`) to ensure
soundness while keeping the DBR quantization support.
2. Adds more types to `common_constant_types`, thereby making
`ConstantVariable.is_base_literal` more lenient, and strictly checks
this property in the constructor of `ConstantVariable`.
3. Change relevant uses of `create_instruction("LOAD_CONST", ...)` to
`create_load_const` which checks `is_safe_constant`, and makes
developer overrides explicit by using `create_load_const_unchecked`
when needed.
4. In a few places, use more specific `VariableTracker`, e.g.,
`TypingVariable` rather than `ConstantVariable`, and
`FrozensetVariable` rather than `SetVariable`.
(2) and (3) are mainly to future-proof Dynamo against bugs like (1).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141504
Approved by: https://github.com/jansel
This turns on AOTAutogradCache for all inductor tests. It clears AOTAutogradCache on each test as well, by virtue of the local cache using the same directory to store cache entries.
I've also tested with INDUCTOR_TEST_DISABLE_FRESH_CACHE=1, running all the tests. AOTAutogradCache successfully caches 99% of these. There are a few tests that use view_replay and therefore save functional tensors, which cause AOTAutogradCache to fail to pickle its result. Will look into next steps there, but for now, it seems okay if the cache just misses on those cases where it can't serialize the result. It would be better to check before pickling, though.
I've made the following small bugfixes to get this working:
- Inductor is sometimes used in a standalone mode without dynamo, which leads to attribute errors in check_can_cache. In general, we should *never* crash in cache checking, only bypass. So I change a try catch to check Exception instead of just a specific exception.
- Add extra structured logging for metadata on cache hits
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140890
Approved by: https://github.com/bdhirsh
Summary: Fix outstanding TODOs related to logging of CompilationMetrics by moving the population of common fields to record_compilation_metrics() instead of populating those independently wherever we use a the metrics_context contextmanager:
* Keep track of start and end time in MetricsContext and pass those to record_compilation_metrics() and populate those fields in that function.
* Pass exception info to record_compilation_metrics() and populate those field in that function.
* Add a new contextmanager, chromium_event_timed, to create the start/end "dynamo" event. This is important because I want this contextmanager to complete _after_ building the CompilationMetrics.
* Populate the compile_id field centrally in record_compilation_metrics().
* Populate the structured_logging_overhead centrally in record_compilation_metrics().
* Add the CompilationMetrics to the current chromium event in record_compilation_metrics(), after all common fields have been added. In a future diff, I can also add _all_ compilation metrics to the chromium event.
Test plan: Unit tests. Also see internal testing:
* dynamo_compile: https://fburl.com/scuba/dynamo_compile/sandbox/jrascnf9
* pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/l3jnla06
* tlparse: https://fburl.com/bq5a9nqs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141291
Approved by: https://github.com/jamesjwu
Summary:
Add the following inductor fx graph cache stats to dynamo compile
- inductor_fx_cache_hit_count
- inductor_fx_cache_miss_count
- inductor_fx_cache_backend_type
- inductor_fx_cache_hit_keys
- inductor_fx_cache_miss_keys
- remote_cache_version
Test Plan: Run local tests and staging logger: P1683061460
Differential Revision: D66232206
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141190
Approved by: https://github.com/masnesral
* Automatically applies ruff rule 401. Turns loops into equivalent list comprehensions which are faster and do not leak the scope of the loop variables.
* list comprehensions not only often have better typing, but are 50+% faster than for loops on overhead. They also preserve length information etc and are better for the interpreter to optimize.
* Manually went back and made mypy happy after the change.
* Also fixed style lints in files covered by flake8 but not by pyfmt
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140980
Approved by: https://github.com/justinchuby, https://github.com/malfet