Summary: This starts writing the compiler_config metadata into logger
Test Plan:
Modified existing test case to make sure this is not null.
(Also eyeballed what we're logging tomake sure it's reasonable
Reviewed By: masnesral
Differential Revision: D84014636
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165581
Approved by: https://github.com/masnesral
Summary:
This stores information on where fx graphs come from, which makes it
significantly easier to debug.
One outstanding question
1) I only stored the kernel stack traces, do we also want the node mappings?
Test Plan:
I wrote a explicit logging test which makes a module, fx traces it, compiles it, and makes sure the logging infomration shows up.
```
clr@devvm17763 ~/fbsource/fbcode/caffe2/test/dynamo
% buck2 test @//mode/opt fbcode//caffe2/test/dynamo:test_dynamo -- test_utils
File changed: fbsource//xplat/caffe2/test/dynamo/test_utils.py
File changed: fbcode//caffe2/test/dynamo/test_utils.py
Buck UI: https://www.internalfb.com/buck2/528dea32-2416-4a62-a1ec-39f3c0efdd2e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13229324015574003
Network: Up: 0B Down: 0B
Executing actions. Remaining 0/2
Command: test.
Time elapsed: 17.3s
Tests finished: Pass 16. Fail 0. Fatal 0. Skip 0. Build failure 0
```
Rollback Plan:
Differential Revision: D82037582
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162669
Approved by: https://github.com/yushangdi
Note: Adding unit test for this is tricky as having errors in the specific unit test would cause test_utils.py to crash all together.
Tested as follows:
1. Added x = 1/0 after guarded_code = compile_inner(code, one_graph, hooks, transform) in convert_frame.py
2. Printed exception_stack_trace and got: ['Traceback (most recent call last):\n File "/data/users/jovian/pytorch/torch/_dynamo/convert_frame.py", line 1207, in _compile\n x = 1/0\n ~^~\nZeroDivisionError: division by zero\n']
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161096
Approved by: https://github.com/c00w
We add a logging around when an ID_MATCH guard is added at a place where inbuilt_inline_nn_modules would inline it. This is done with the aim of tagging recompiles that could be avoided by setting inbuilt_inline_nn_modules flag.
It will help us log and track the flag's adoption and potentially quantify saving in the the number of recompiles.
Differential Revision: D80075975
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160592
Approved by: https://github.com/anijain2305
Users may want compile-related but customized logging info to dynamo_compile. One example is to logging the current training iteration index when recompilation happens. In general, current training iteration index is not available to compiler, since the same compiled function may be called multiple times in the same training iteration. The user could provide the training iteration index in a user hook where torch.compile logs it when recompilation happens.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157961
Approved by: https://github.com/masnesral
#153622 introduced a hook for getting the relevant code objects after frame tracing. The idea is to have vLLM use this instead of monkey-patching `inline_call_()` to determine the source code files to hash. Unfortunately, the hook runs too late; the vLLM backend needs access to the set of source code filenames while it's running.
This PR replaces the newly-added hook with a utility function that a backend can call to get this information. I've made the change in vLLM and can verify that this allows the information to be queried at the right time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155249
Approved by: https://github.com/zou3519
This PR:
* Expands `Hooks` with a new, optional `frame_traced_fn` field. It should be a callable receiving the list of traced code objects
* Maintains a list of `traced_code` objects in the `TracingContext` of an `OutputGraph`
* Whenever an `inline_call()` is encountered, the corresponding code object is added to this set
* `OutputGraph`'s associated `f_code` is added to the list just before the hook is called
I believe use of this hook should enable the source code hashing that vLLM does in a better way than monkey-patching `inline_call()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153622
Approved by: https://github.com/jansel
https://github.com/pytorch/pytorch/issues/148222
Goal:
At the moment autograd saved tensors hooks are run in eager after compiled forward.
They are executed at the same time for all saved tensors.
Hooks can be used to reduce amout of memory used for saved tensors, doing quantization or offloading to cpu.
This is suboptimal for optimization of peak memory.
Better solution will be to put the hooks in the graph, as close as possible to the last usage of the tensor.
To get user specified autograd saved tensors hooks in the graph.
Logic:
UX:
If user specifies with torch.autograd.graph.saved_tensors_hooks(pack_gm, unpack_gm).
Where pack_gm and unpack_gm are torch.fx.GraphModule.
Then AotAutograd will retrace those graph modules, doing decompositions and functionalization in aot_autograd, inlining the result graphs in forward epilogue and backward prologue.
User may want to use control logic in the hooks, for example applying quantization only for specific dtypes and sizes.
This is also possible, user can put it into torch.fx.wrap function and use symbolic trace to make a GraphModule.
In that case AotAutograd cahing will work only in case when user explicitly set to the torch.fx.wrap call_function node "user_cache_hash" metadata.
If this metadata set - then aot_autograd cache can use saved cache artifact.
If metadata is not set - then cache is bypassed.
Dynamo:
Dynamo traces pack and unpack hooks and installs them as subgraph and explicitly adds to the output_graph. (As those subgraphs are not used and will not be copied in the result by default).
The complexity here is that at this moment we do not have example of inputs for the hooks.
We trace pack_hook with some Tensor from the inputs.
The result subgraphs are added to the hashing of AotAutograd Cache.
In AotAutograd we retrace the graph with the true saved tensors coming from partitioner.
Backwards Compatibility:
As current hooks are executed in eager mode and not all of them will be traceable - we only try to put in the graph hooks, explicitly marked by user with annotation (@_inlineable_saved_tensors_hooks).
For other hooks or if compiled autograd is enabled - keep the same logic.
Recompilations:
Hooks are guarded with lambda guard matching function id to cause recompilation if user reruns compiled function.
Aot_autograd:
After partitioner prepared forward and backward module - we trace prepared at Dynamo graphs for pack and unpack hooks and inline them in epilogue of forward and prologue of backward. Forward outputs and backward inputs are changed, transparently for user.
We do not try to put it close the last usage etc., relying on inductor to do this optimization.
```
INFO: TRACED GRAPH
===== Forward graph pre saved_tensors_hooks inlining 3 =====
/data/users/ivankobzarev/a/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
def forward(self, primals_1: "Sym(s0)", primals_2: "Sym(s1)", primals_3: "f32[s0, s1][s1, 1]cuda:0"):
# File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6660 in simple_fn, code: x = x + 1
add: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.add.Tensor(primals_3, 1); primals_3 = None
# File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6661 in simple_fn, code: x = SAF.apply(x)
view: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.view.default(add, [primals_1, primals_2])
return (view, add, primals_1, primals_2)
INFO: TRACED GRAPH
===== Backward graph pre saved_tensors_hooks inlining 3 =====
/data/users/ivankobzarev/a/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
def forward(self, primals_1: "Sym(s0)", primals_2: "Sym(s1)", primals_3: "f32[s0, s1][s1, 1]cuda:0"):
# File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6660 in simple_fn, code: x = x + 1
add: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.add.Tensor(primals_3, 1); primals_3 = None
# File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6661 in simple_fn, code: x = SAF.apply(x)
view: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.view.default(add, [primals_1, primals_2])
return (view, add, primals_1, primals_2)
INFO: TRACED GRAPH
===== saved_tensors_pack_hook add 3 =====
/data/users/ivankobzarev/a/pytorch/torch/fx/_lazy_graph_module.py class pack_float8(torch.nn.Module):
def forward(self, x_1: "f32[s0, s1][s1, 1]cuda:0"):
# No stacktrace found for following nodes
_to_copy: "f8e4m3fn[s0, s1][s1, 1]cuda:0" = torch.ops.aten._to_copy.default(x_1, dtype = torch.float8_e4m3fn); x_1 = None
return (torch.float32, _to_copy)
INFO: TRACED GRAPH
===== saved_tensors_unpack_hook add 3 =====
<eval_with_key>.22 from /data/users/ivankobzarev/a/pytorch/torch/fx/experimental/proxy_tensor.py:1225 in wrapped class pack_float8(torch.nn.Module):
def forward(self, x_1: "f32[s0, s1][s1, 1]cuda:0"):
# No stacktrace found for following nodes
_to_copy: "f8e4m3fn[s0, s1][s1, 1]cuda:0" = torch.ops.aten._to_copy.default(x_1, dtype = torch.float8_e4m3fn); x_1 = None
return (torch.float32, _to_copy)
INFO: TRACED GRAPH
===== Forward graph 3 =====
/data/users/ivankobzarev/a/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
def forward(self, primals_1: "Sym(s0)", primals_2: "Sym(s1)", primals_3: "f32[s0, s1][s1, 1]cuda:0"):
# File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6660 in simple_fn, code: x = x + 1
add: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.add.Tensor(primals_3, 1); primals_3 = None
# No stacktrace found for following nodes
_to_copy: "f8e4m3fn[s0, s1][s1, 1]cuda:0" = torch.ops.aten._to_copy.default(add, dtype = torch.float8_e4m3fn)
# File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6661 in simple_fn, code: x = SAF.apply(x)
view: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.view.default(add, [primals_1, primals_2]); add = None
return (view, _to_copy, primals_1, primals_2)
INFO: TRACED GRAPH
===== Backward graph 3 =====
<eval_with_key>.21 class GraphModule(torch.nn.Module):
def forward(self, primals_1: "Sym(s0)", primals_2: "Sym(s1)", add_packed_2: "f8e4m3fn[s0, s1][s1, 1]cuda:0", tangents_1: "f32[s0, s1][s1, 1]cuda:0"):
# No stacktrace found for following nodes
_to_copy: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten._to_copy.default(add_packed_2, dtype = torch.float32); add_packed_2 = None
# File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6661 in simple_fn, code: x = SAF.apply(x)
add_7: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.add.Tensor(tangents_1, _to_copy); tangents_1 = _to_copy = None
return (None, None, add_7)
```
Differential Revision: [D72187044](https://our.internmc.facebook.com/intern/diff/D72187044)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150032
Approved by: https://github.com/bdhirsh
Summary: I forgot to remove this unused field in D73809989.
Test Plan: `buck test 'fbcode//mode/opt' fbcode//caffe2/test:fbonly -- --exact 'caffe2/test:fbonly - test_compilation_metrics_logger_in_sync (caffe2.test.fb.test_fb.TestFBOnly)'`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153413
Approved by: https://github.com/c00w
Summary: I'm just trying to fix the test again. It's out of date because it's disabled and some dynamo_timed-related fields are gone now.
Test Plan: `python test/dynamo/test_utils.py -k dynamo_timed`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152387
Approved by: https://github.com/anijain2305
We correctly handed different python version in the explicit ir_nodes test, but
didn't handle it in the dynamo_timed test. Just explicitly deleting the fields
there so the dynamo_timed test passes on all python versions.
(I noticed it breaking on 3.13).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148987
Approved by: https://github.com/jansel
This gives us a decent proxy for how big of a graph we functionally had to parse.
Note that this is a cummulative counter. If people feel strongly, I can either write into the dynamo_timed datasets with metrics contexts, or clear the counters / write a counter per frame id as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147149
Approved by: https://github.com/jansel
Summary: Gather the compilation time of individual triton kernels and log them to dynamo_compile:
* Time compilation in `_worker_compile_triton` and pass back to the main process and logged from `get_result()`.
* Added a way to track the "top N" (or N most-expensive compiles) in the metrics_context. I did this because I doubt we really care to capture potentially thousands of kernel compile times. That would be problematic for scuba logging anyway, so let's limit the number we track from the beginning. Arbitrarily chose 25 for now.
* Format the list of compile times as a json string before logging.
Test Plan:
`python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only nanogpt`
Scuba: https://fburl.com/scuba/dynamo_compile/sandbox/nc4dzm3r
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147022
Approved by: https://github.com/jamesjwu
This bit me while I was trying to debug some trace issues.
In general this config is already quite large when dumping, so adding
more fields doesn't make it significantly worse.
Also a number of the items we are type checking for (except the test
configs), don't even show up. Primarily this will help us when debugging
rocm, halide, and trace configs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144700
Approved by: https://github.com/ezyang
We don't support assigning to objects or numeric constants at the top level in
config modules, no need to test for them.
(This specifically breaks later sorting refactoring, since it requires <
to be implemented).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143535
Approved by: https://github.com/ppanchalia
Summary:
Support garbage collection after pt2 compilation.
Add jk to control the global rollout / rollback of this functionality
Add env var to control individual job's rollout
Test Plan:
Test the model training job with / without this changes
Reviewers:
@yuxihu @ezyang , @Yuzhen11 ,
Subscribers:
Tasks:
Tags:
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143364
Approved by: https://github.com/ezyang