Summary: Add more details about the export_memory_timeline API, as we've landed new representations of the memory timeline data.
Test Plan: CI, should be no functional change, as we only changed comments.
Differential Revision: D50123450
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110949
Approved by: https://github.com/davidberard98
_enable_dynamo_cache_lookup_profiler used to get turned on when running `__enter__` or `__exit__` with the profiler. But it's possible to turn the profiler on and off without the context manager (e.g. with a schedule and calling `.step()`). Instead, we should put these calls (which are supposed to be executed when the profiler turns on/off) where `_enable_profiler()` and `_disable_profiler()` are called.
This puts `_enable_dynamo_cache_lookup_profiler` and `_set_is_profiler_enabled` into `_run_on_profiler_(start|stop)` and calls that on the 3 places where `_(enable|disable)_profiler` get called.
Differential Revision: [D48619818](https://our.internmc.facebook.com/intern/diff/D48619818)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107720
Approved by: https://github.com/wconstab
We hope PyTorch profiling parsing ability can also be applicable to custom devices. Based on previous work https://github.com/pytorch/pytorch/pull/101554, we have made supplementary updates to PyTorch profiling to extend its parsing capabilities for custom devices. These modifications do not affect the original logic of the code and mainly include the following aspects:
1. Added the relevant logic for use_device in torch.profiler.profiler._KinetoProfile.
2. In torch.autograd.profiler and torch.autograd.profiler_util, custom devices profiling data parsing ability has been added using privateuse1 and use_device attributes.
3. In torch._C._autograd.pyi and torch._C._autograd.pyi, custom devices related attributes have been added. The underlying C++
logic will be added in subsequent pull requests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106142
Approved by: https://github.com/aaronenyeshi
Summary: Return early if we can easily determine the operator qualified name is invalid before attempting to retrieve the schema. In particular "::" should always be present. Quick estimate shows that this is >50x faster (100 us -> 2 us).
Test Plan: CI
Differential Revision: D47562587
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105495
Approved by: https://github.com/aaronenyeshi
Summary:
Rather than processing the events into a time and sizes plot, dump the actual events as (timestamp, action, num of bytes, category) when output file ends in `raw.json.gz`.
This can allow downstream analysis tools to process these events. It also avoids having to control the granularity of the previous json.gz in memory profiler.
Test Plan: CI Tests
Differential Revision: D47416544
Pulled By: aaronenyeshi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105094
Approved by: https://github.com/davidberard98
This PR re-lands
- [Typing] Fix PEP 484 Violation (#105022)
- Update mypy to 1.4.1 (#91983)
That were reverted due to the conflict with internal source repo.
Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional)
Plus few real fixes:
- Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi`
- Add missing return statement to `torch._export. deserialize_graph`
- Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights`
- Add assert it `torch/optim/optimizer.py` that Optional list is not None
TODO (in followup PR):
- Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py`
Unrelated, to bypass CI failures due to the gcc9 dependency update in Ubuntu-18.04:
- Add hack to squash older libstdc++ from conda environment in favor one from OS to `.ci/docker/install_conda.sh`
- Update bazel cuda builds to focal, as with libstdc++-6.0.32 bazel builds loose the ability to catch exceptions (probably because they link with cupti statically, but I could not found where it is done)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105227
Approved by: https://github.com/atalman, https://github.com/albanD, https://github.com/Skylion007
This PR re-lands
- [Typing] Fix PEP 484 Violation (#105022)
- Update mypy to 1.4.1 (#91983)
That were reverted due to the conflict with internal source repo.
Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional)
Plus few real fixes:
- Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi`
- Add missing return statement to `torch._export. deserialize_graph`
- Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights`
- Add assert it `torch/optim/optimizer.py` that Optional list is not None
TODO (in followup PR):
- Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105227
Approved by: https://github.com/atalman, https://github.com/albanD, https://github.com/Skylion007
Not sure, how it worked before, but if arguments must be annotated is optional if they are defaulted to None
Towards enabling mypy-1.4.1 in lintrunner
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 5e1b9f4</samp>
> _We annotate the arguments of doom_
> _To show the `None` values of gloom_
> _We improve the type checking and readability_
> _With `Optional` annotations of metal-ity_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105022
Approved by: https://github.com/izaitsevfb, https://github.com/huydhn, https://github.com/Skylion007
This fixes the bug in profiler code exposed by https://github.com/pytorch/pytorch/pull/104368 that introduced on the fact that `import torch._dynamo` also imports `torch._inductor.config`:
```
$ python -c "import torch._inductor;print(torch._inductor.config)"
Traceback (most recent call last):
File "<string>", line 1, in <module>
AttributeError: module 'torch._inductor' has no attribute 'config'
(base) $ python -c "import torch._dynamo;print(torch._inductor.config)"
<module 'torch._inductor.config' from '/home/nshulga/git/pytorch/pytorch/torch/_inductor/config.py'>
```
### Testing
D47159397
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104477
Approved by: https://github.com/aaronenyeshi, https://github.com/malfet
Summary: trigger tracing for MTIA events on python side when ProfilerActivity.MTIA is specified
Test Plan:
Test diff: D45437426
```
hg graft D45437426
```
- in one terminal
```
cd ~/fbsource/fbcode
buck2 run -j 8 \
//infra_asic_fpga/firmware/tools/mad/service:mad_service
```
- in another terminal
Pytorch profiler
```
buck run mode/dev-nosan -j 8 //caffe2/torch/fb/acc_runtime/afg/tests:test_afg -- -m kernel_add
```
Differential Revision: D46122853
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102288
Approved by: https://github.com/aaronenyeshi
Summary: Since CUPTI lazy re-init crashes with CUDA Graphs in CUDA 11, we should disable this. Remove this item once majority of workloads move to CUDA 12.
Test Plan: CI Tests
Reviewed By: xw285cornell
Differential Revision: D45921028
Pulled By: aaronenyeshi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101879
Approved by: https://github.com/xw285cornell
There are known issues with profiling cuda graphs - particularly, if you create a cuda graph before the first use of the profiler, and then run that cuda graph during profiling.
One workaround is to add `with profile(): pass` before creating the cuda graph that you want to profile later.
For convenience, we provide this function to use the workaround. This also adads a test for this workaround, to ensure that it continues working.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100441
Approved by: https://github.com/Chillee, https://github.com/aaronenyeshi
Summary:
Support the file extension .html, which will include a PNG image of the plot embedded into an HTML file.
This allows users to avoid processing the timeline manually in their own frontend UI.
Test Plan:
CI Tests
Ran on resnet50 model and generated this html file w/ plot:
See attached html file: {F954232276}
Screenshot: {F954232469}
Differential Revision: D45152735
Pulled By: aaronenyeshi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99751
Approved by: https://github.com/davidberard98
Enable some sensible flake8-simplify rules. Mainly wanted to enable the SIM101, and `yield from` SIM103 checks. @kit1980 since you wanted to be tagged on this CI check.
Enabling this check also helped flag one logical bug so it's definitely beneficial (also fixed in this PR).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97984
Approved by: https://github.com/ezyang
Fixes https://github.com/pytorch/pytorch/issues/82915
This rare flaky issue caught my attention today when it failed flakily on MacOS in https://github.com/pytorch/pytorch/actions/runs/4494182574/jobs/7906827531. The test expected 3 traces to be written but got only 2 of them.
Looking a bit closer into the `tensorboard_trace_handler` function, it looks like there is a potential filename clash here. The millisecond since epoch `"{}.{}.pt.trace.json".format(worker_name, int(time.time() * 1000))` is used as part of the name. As `tensorboard_trace_handler` is used as a callback handle in the test, the names look too close to each other (1-millisecond apart), i.e.
```
huydo-mbp_13494.1679526197252.pt.trace.json
huydo-mbp_13494.1679526197253.pt.trace.json
huydo-mbp_13494.1679526197250.pt.trace.json
```
Switching to nanosecond reduces the chance of two or more of them having the same timestamp while keeping the naming convention intact, i.e. `huydo-mbp_13804.1679526325182878000.pt.trace.json`
I suspect that this is also the cause of Windows flakiness.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97392
Approved by: https://github.com/malfet, https://github.com/aaronenyeshi
Summary: Rather than starting the timeline at t=0, keep the actual timestamps of the memory events.
Test Plan: CI Tests
Reviewed By: leitian, chaekit
Differential Revision: D43807624
Pulled By: aaronenyeshi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96535
Approved by: https://github.com/davidberard98
Summary: Added the functionality to export the memory timeline plot as a list of times and sizes, which the post processing visualization can parse and plot.
Test Plan: CI Tests
Reviewed By: leitian, fengxizhou
Differential Revision: D43680760
Pulled By: aaronenyeshi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96137
Approved by: https://github.com/chaekit
Summary:
There are a few races/permission errors in file creation, fixing
OSS:
1. caffe2/torch/_dynamo/utils.py, get_debug_dir: multiple process may conflict on it even it's using us. Adding pid to it
2. caffe2/torch/_dynamo/config.py: may not be a right assumption that we have permission to cwd
Test Plan: sandcastle
Differential Revision: D42905908
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93407
Approved by: https://github.com/soumith, https://github.com/mlazos
This PR adds the `_profile_using_dynolog` function to `torch/__init__.py`. The `_profile_using_dynolog` method allows registering the optimizer step post hook. This is required to collect iteration based traces using dynolog.
Other related changes for tests to pass:
1. Updated `optimizer.pyi`
1. Updated `overrides.py`
1. The test `test_kineto_profiler_multiple_steppers` in `test_profiler.py` has been broken down into two cases:
- `test_kineto_profiler_multiple_steppers_with_override_True` : this test uses the override argument
- `test_kineto_profiler_multiple_steppers_with_override_False` : this test uses the environment variable
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90101
Approved by: https://github.com/albanD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90880
# Summary
Enables multiple step trackers. Currently we only had one place to mark that a step() has occurred in the program. This was via pytorch profiler step().
We are now working on adding an Optimizer step hook - https://github.com/pytorch/pytorch/issues/88446
- This could mean programs that already call profiler.step() every iteration can end up double incrementing steps
- If a model uses multiple optimizers we can also have double or more counting of the step.
## Solution
We fix this by adding a layer of abstraction before calling step() to the kineto library. The idea is to maintain steps per requester in a dictionary
```
{
"ProfilerStep": 100, # triggered by profiler step() call
"Optimizer1Step": 100, # Optimizer 1 or 2 are just examples, could be SGD, Adam etc
"Optimizer2Step": 100,
}
```
To figure out the global step count just take max on the dict values (100).
```
{
"ProfilerStep": 100,
"Optimizer1Step": 101, # Optimizer1 got incremented first say
"Optimizer2Step": 100,
}
```
Then global step count is 101
## Calling kineto
We only call the kineto step() function when global count increments.
# Test Plan:
Added a unit test
buck2 run mode/dev-nosan caffe2/test:profiler
Differential Revision: D41751157
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90880
Approved by: https://github.com/chaekit
TF32 is not supported on ROCm and hence the torch/profiler/_pattern_matcher.py FP32MatMulPattern should return False for ROCm instead of checking the results of torch.cuda.get_arch_list(). Depending on the gfx arch running the test, test_profiler.py's test_profiler_fp32_matmul_pattern (__main__.TestExperimentalUtils) will fail otherwise.
Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84077
Approved by: https://github.com/jeffdaily, https://github.com/kit1980
There are various Tensors created in the backward pass which do not correspond to parameters. We don't want to mark these as gradients, but we do still want to convey as much information as possible. Thus, this PR introduces an AUTOGRAD_DETAIL category. (Which can be grouped with GRADIENT in visualization if one wishes to take a coarse grained view of the world.)
Differential Revision: [D40868661](https://our.internmc.facebook.com/intern/diff/D40868661/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88926
Approved by: https://github.com/chaekit