Summary:
Attempt to fix the following exception which occurred when profiling a Pytorch model ( Meta-internal LLM ) that also involved a ThreadPoolExecutor in the background:
```
Exception Found: !stack.empty() INTERNAL ASSERT FAILED at "fbcode/caffe2/torch/csrc/autograd/profiler_python.cpp":987, please report a bug to PyTorch. Python replay stack is empty.
```
The root cause of this issue seems to be that a thread call stack can be empty, which is asserted to not be empty.
I fixed this with some minimal changes to profiler_python.cpp
Approach:
* Ensuring that the stack in question is not empty before trying to pop from it.
Test Plan:
* Tested manually on a reproducible scenario where the assertion failure was otherwise triggered ( repro too large to include here ). The assertion failure disappears.
* CI
Differential Revision: D67691558
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143940
Approved by: https://github.com/Skylion007, https://github.com/sraikund16
Summary:
Move the profiler's Approximate Clock from libtorch to libc10. The main reason is to allow c10 features to get time.
The clock is using TSC when available for performance. CUDA Caching Allocator's implementation of memory snapshot will add the timestamps to memory events with this same clock in subsequent diff.
Test Plan: CI
Differential Revision: D50601935
Pulled By: aaronenyeshi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111972
Approved by: https://github.com/davidberard98
This PR replace c10::guts::to_string with std::to_string. The major part of changes is using void* as optimizer state key since string is used only for serialization and using pointers as hashing keys is more efficient than a string.
Some other guts functions in the affected source files are also replaced.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108480
Approved by: https://github.com/Skylion007
This is my commandeer of https://github.com/pytorch/pytorch/pull/82154 with a couple extra fixes.
The high level idea is that when we start profiling we see python frames which are currently executing, but we don't know what system TID created them. So instead we defer the TID assignment, and then during post processing we peer into the future and use the system TID *of the next* call on that Python TID.
As an aside, it turns out that CPython does some bookkeeping (ee821dcd39/Include/cpython/pystate.h (L159-L165), thanks @dzhulgakov for the pointer), but you'd have to do some extra work at runtime to know how to map their TID to ours so for now I'm going to stick to what I can glean from post processing alone.
As we start observing more threads it becomes more important to be principled about how we start up and shut down. (Since threads may die while the profiler is running.) #82154 had various troubles with segfaults that wound up being related to accessing Python thread pointers which were no longer alive. I've tweaked the startup and shutdown interaction with the CPython interpreter and it should be safer now.
Differential Revision: [D42336292](https://our.internmc.facebook.com/intern/diff/D42336292/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91684
Approved by: https://github.com/chaekit
Refcounting is hard. (Citation needed.) https://github.com/pytorch/pytorch/pull/81242 introduced a corner case where we would over incref when breaking out due to max (128) depth. https://github.com/pytorch/pytorch/pull/85847 ostensibly fixed a segfault, but in actuality was over incref-ing because PyEval_GetFrame returns a borrowed reference while `PyFrame_GetBack` returns a strong reference.
Instead of squinting really hard at the loops, it's much better to use the RAII wrapper and do the right thing by default.
I noticed the over incref issue because of a memory leak where Tensors captured by the closure of a function would be kept alive by zombie frames.
Differential Revision: [D42184394](https://our.internmc.facebook.com/intern/diff/D42184394/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91646
Approved by: https://github.com/albanD
Apply clang-tidy fixups to prefer member initializer and modernize-pass-by-value. This is a mostly a noop, but it should make a few ctors slighlty more readable and more efficient. Also drops in some missing moves that prevents a lot of unnecessary copying.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91538
Approved by: https://github.com/ezyang
This applies some more clang-tidy fixups. Particularly, this applies the modernize loops and modernize-use-transparent-functors checks. Transparent functors are less error prone since you don't have to worry about accidentally specifying the wrong type and are newly available as of C++17.
Modern foreach loops tend be more readable and can be more efficient to iterate over since the loop condition is removed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91449
Approved by: https://github.com/ezyang
Apply clang-tidy check modernize-use-emplace. This is slightly more efficient by using an inplace constructor and is the recommended style in parts of the codebase covered by clang-tidy. This just manually applies the check to rest of the codebase. Pinging @ezyang as this is related to my other PRs he reviewed like #89000
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91077
Approved by: https://github.com/ezyang
This PR unifies and rationalizes some of the input representation in Result. The current approach of storing separate types in separate vectors is tedious for two types (Tensors and scalars), but would be even more annoying with the addition of TensorLists. A similar disconnection exists with sizes and strides which the user is also expected to zip with tensor_metadata.
I simplified things by moving inputs to a variant and moving sizes and strides into TensorMetadata. This also forced collection of sizes and strides in python tracer which helps to bring it in line with op profiling. Collection of TensorLists is fairly straightforward; `InputOutputEncoder` already has a spot for them (I actually collected them in the original TorchTidy prototype) so it was just a matter of plumbing things through.
Differential Revision: [D40734451](https://our.internmc.facebook.com/intern/diff/D40734451/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87825
Approved by: https://github.com/slgong-fb, https://github.com/chaekit
A recurring problem with assigning Tensor IDs is that we want to preserve identity when storage changes but we don't observe TensorImpl destruction so identity assignment is not robust to the ABA problem with respect to TensorImpl*. ~TensorImpl is far too hot to instrument; even adding a call to a no-op function in a different compilation unit increases overhead by tens of percent. (OSS builds do not have any sort of LTO.)
Fortunately there is a solution. A PyTorch Tensor is a `c10::intrusive_ptr<c10::TensorImpl>`, which in turn holds a storage. (Which is a `c10::intrusive_ptr<c10::StorageImpl>`) `c10::intrusive_ptr` has a `c10::weak_intrusive_ptr` class for taking non-owning references to the underlying object. The implementation involves both a strong refcount and weak refcount in `c10::intrusive_ptr`. If the strong refcount of an intrusive_ptr goes to zero and there are no weak references then everything is deleted. However if there is a weak reference then the intrusive_ptr calls `release_resources()` but not delete.
This has the effect of freeing the underlying resources (ensuring that program semantics are unchanged) but leaves behind an empty shell of an `intrusive_ptr` that the `weak_intrusive_ptr`s use to check status. And herein lies the solution: as long as we hold a weak reference to a TensorImpl we will block deletion and prevent the `TensorImpl*` from being reused.
This PR uses a `c10::weak_intrusive_ptr<c10::TensorImpl>` to store the address of profiled TensorImpls and then converts it to a raw pointer (or rather, a `TensorImplAddress`) during post processing when we no longer care about blocking address reuse.
Differential Revision: [D40492848](https://our.internmc.facebook.com/intern/diff/D40492848/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87244
Approved by: https://github.com/slgong-fb, https://github.com/albanD
While optimizer can store state however it likes, in practice most optimizer state corresponds to a particular parameter. (This is the case for all `torch.optim` optimizers.) Thus, it turns out to be ergonomic to collect using that structure. Note that this doesn't lock us into anything; we can always collect state with non Tensor keys if the use case arises.
One simplification that arises is that Module and Optimizer collection has very similar structure. So similar, in fact, that it is possible to use a common template for config. I also found that a lot of the `check_and_store` logic could be simplified and inlined by this joining of collected optimizer state.
Differential Revision: [D40210703](https://our.internmc.facebook.com/intern/diff/D40210703/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86753
Approved by: https://github.com/slgong-fb, https://github.com/aaronenyeshi
Summary:
- catch .grad tensor info
- update data type and `check_and_store`, etc
- update unit test case
Test Plan: buck run mode/opt //caffe2/test:profiler
Reviewed By: chaekit
Differential Revision: D39711295
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86355
Approved by: https://github.com/chaekit
Summary:
Wrong reference counting of Python Objects has made intermittent and corner-case-only segfault.
- before : increment once decrement in a loop.
- after: increment and decrement in different but consistent loops.
Test Plan: buck run mode/opt //caffe2/test:profiler
Differential Revision: D39902973
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85847
Approved by: https://github.com/robieta, https://github.com/aaronenyeshi
`PythonTracer` holds a pointer to an owning `RecordQueue`, however that relationship is not enforced and it is possible to dangle that pointer if the ProfilerState owning the `RecordQueue` is destroyed without proper cleanup.
We currently use a singleton to enforce the requirement that only one python tracer is active at a time, however a better formulation is to simply enforce that with an atomic bool and manage object lifetime through composition. In this new architecture, `RecordQueue` explicitly holds a unique_ptr to the python tracer instance. That way if `~RecordQueue` is called it will call `~PythonTracer` which can then clean up any state. Overall it is just a simpler ownership model, and less prone to unexpected failures.
Differential Revision: [D38955616](https://our.internmc.facebook.com/intern/diff/D38955616/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83964
Approved by: https://github.com/slgong-fb
The ownership model between `RecordQueue` and `PythonTracer` is brittle; if a profiler is popped without proper shutdown it can dangle a reference in `PythonTracer` which will segfault when dereferenced. The next PR will address this; to start we simply move the code into `torch/csrc/profiler/orchestration` to limit the sloc delta when making actual changes.
Differential Revision: [D38933962](https://our.internmc.facebook.com/intern/diff/D38933962/)
**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D38933962/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83895
Approved by: https://github.com/slgong-fb
Summary:
Record nn.Module's parameters for detaild memory profiling:
- extend 'module_' in value cache & NNModuleInfo to save parameters
- python binding and unit test case
Test Plan: buck run mode/opt //caffe2/test:profiler -- -r test_nnmodule
Differential Revision: D38379717
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83209
Approved by: https://github.com/robieta
There are a couple of bugs in the python tracer related to how we cache values. The first is that `ValueCache::store<CallType::PyModuleCall>` wrongly assumes that it will only be called from the profiling callback and calls `PyEval_GetFrame`, effectively violating the encapsulation of the cache by accessing global state. Secondly, we use `arg` to cache bound C functions. This turns out not to be correct, and collisions are resulting in incorrect traces.
In both cases, we can solve the problem by introducing a concept of ephemeral data which is used to materialize a cached value, but is not part of the cache key. (And the author is responsible for making sure that is done correctly.)
Differential Revision: [D38062921](https://our.internmc.facebook.com/intern/diff/D38062921/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81958
Approved by: https://github.com/ngimel