Pull Request resolved: https://github.com/pytorch/pytorch/pull/77699
The current machinery to connect libtorch to libtorch_python for profiling is... meh. Adequite for separate components that mostly just need to send a trigger, but not really clean. This PR makes an abstract interface class that the python tracer subclasses so the profiler can actually get at the tracer singleton, albeit through a restricted interface. This will help fold Python tracing into the new unified event structure.
Differential Revision: [D36325739](https://our.internmc.facebook.com/intern/diff/D36325739/)
Approved by: https://github.com/aaronenyeshi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77698
This PR adds tree building to the post processing of profiler. The basic algorithm is to sort the events, maintain a stack and a priority queue of event ends, and push/pop accordingly. The logic for merging Python events is still separate in `profiler_kineto.cpp`. That can be removed when Python events have an `EventType`.
Differential Revision: [D36321105](https://our.internmc.facebook.com/intern/diff/D36321105/)
Approved by: https://github.com/aaronenyeshi
Summary:
After https://github.com/pytorch/pytorch/pull/77608 `example_inputs` is required input for `prepare_fx` and `prepare_qat_fx`.
This makes quantizing submodules harder, so we added this utility function to get a dictionary from fqn to submodule example_inputs
Example Call:
```
example_inputs = (tensor0,)
get_fqn_to_example_inputs(m, example_inputs)
```
Example output:
```
{
"linear1": (tensor1,),
"linear2": (tensor2,),
"sub": (tensor3,),
"sub.linear1": (tensor4,),
...
}
```
Test Plan:
python test/test_quantization.py TestUtils
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78146
Approved by: https://github.com/vkuzo
Summary: Add an argument to specify the number of warmup iterations to the API ``torch.cuda.make_graphed_callables``. By default, it needs 3 warm-up iterations. To work with NCCL, it needs 11 warm-up iterations.
Differential Revision: D36606758
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78124
Approved by: https://github.com/jianyuh
Use pyupgrade(https://github.com/asottile/pyupgrade) and flynt to modernize python syntax
```sh
pyupgrade --py36-plus --keep-runtime-typing torch/onnx/**/*.py
pyupgrade --py36-plus --keep-runtime-typing test/onnx/**/*.py
flynt torch/onnx/ --line-length 120
```
- Use f-strings for string formatting
- Use the new `super()` syntax for class initialization
- Use dictionary / set comprehension
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77935
Approved by: https://github.com/BowenBao
Euler beta function:
```Python
torch.special.beta(input, other, *, out=None) → Tensor
```
`reentrant_gamma` and `reentrant_ln_gamma` implementations (using Stirling’s approximation) are provided. I started working on this before I realized we were missing a gamma implementation (despite providing incomplete gamma implementations). Uses the coefficients computed by Steve Moshier to replicate SciPy’s implementation. Likewise, it mimics SciPy’s behavior (instead of the behavior in Cephes).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78031
Approved by: https://github.com/mruberry
Add codegen infrastructure to generate IR nodes for non-native ops.
The proposed change is to add a `non_native` key to the `{backend}_native_functions.yaml` file that contains schema definitions similar to what is found in `native_functions.yaml`. e.g.
```
non_native:
...
- func: expand(Tensor input, int[] size, bool is_scalar_expand) -> Tensor
...
```
these definitions are parsed into a `LazyIrSchema` that can be used for generating IR nodes using `GenLazyIR`.
Fixes#74628
CC: @wconstab @desertfire @henrytwo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76535
Approved by: https://github.com/wconstab
(reopening due to botched merge)
The cuDNN V8 API (main support merged in https://github.com/pytorch/pytorch/pull/60755) potentially exposes many more kernels with benchmark=True. While these additional kernels can improve performance, it is often unnecessary to run every kernel returned by the heuristic and doing so may degrade the user experience by causing the first model iteration to be very slow. To alleviate this issue, this PR introduces torch.backends.cudnn.benchmark_limit. benchmark_limit specifies the maximum number of working cuDNN kernels to try for a given workload, with the default being 10 (similar to what TensorFlow does). benchmark_limit = 0 yields the current behavior of trying every kernel returned by the heuristic.
CC @ptrblck @ngimel @xwang233
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77002
Approved by: https://github.com/ngimel
Feels good to delete it from `torch._decomps`. This is mainly to clarify the process for me -
Seems like there's still some components missing of the `torch <-> refs` mapping? For example, seems like methods don't work yet for mapping from torch <-> refs, and neither do the meta tests? (cc: @ezyang).
If I replace the `torch` with `refs`, then the tests seem to pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78041
Approved by: https://github.com/mruberry
Fixes#68172. Generally, this corrects multiple flaky convolution unit test behavior seen on ROCm.
The MIOpen integration has been forcing benchmark=True when calling `torch._C._set_cudnn_benchmark(False)`, typically called by `torch.backends.cudnn.set_flags(enabled=True, benchmark=False)`. We now add support for MIOpen immediate mode to avoid benchmarking during MIOpen solution selection.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77438
Approved by: https://github.com/ngimel, https://github.com/malfet
This PR adds the item, equal, any, and all references.
While doing this I found the following issues:
- https://github.com/pytorch/pytorch/issues/78070
- https://github.com/pytorch/pytorch/issues/78071
And I fixed a bug where the `convert_element_type` prim could not convert tensors requiring grad to datatypes that don't require grad.
Creating the item reference required adding item as a prim, but per @ngimel's suggestion I removed the prims for any and all and implemented them as references, so this is net negative one prim.
Reference OpInfos are added for any and all, but item and equal don't even have regular OpInfos.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78072
Approved by: https://github.com/ngimel
This PR...
**Issues Found**
- https://github.com/pytorch/pytorch/issues/78058
- https://github.com/pytorch/pytorch/issues/78054
- https://github.com/pytorch/pytorch/issues/78053
- https://github.com/pytorch/pytorch/issues/78050
- https://github.com/pytorch/pytorch/issues/77932
**Testing**
- disables stride consistency checks in test_ops and test_meta pending resolution of https://github.com/pytorch/pytorch/issues/78050
- skips chalf in reference tests (addressing https://github.com/pytorch/pytorch/issues/78054)
- splits test test_python_reference_consistency in one test for the ctx where torch.foo is torch.foo, and another for when torch.foo is refs.foo
- updates test names to be more natural and consistent:
- test_python_reference_errors -> test_python_ref_errors
- test_python_reference_consistency -> test_python_ref and test_python_ref_torch_fallback
- test_python_reference_meta_functions -> test_python_ref_meta
- test_reference_testing -> test_numpy_ref
- updates test_python_ref and test_python_ref_torch_fallback to check that the reference is more accurate than the torch op if the reference and torch op results are not close, a warning is raised when this occurs (addressing https://github.com/pytorch/pytorch/issues/77687)
- adds reference inputs for broadcast_tensors
- Updates the "fill_" OpInfo to "fill", adding a NumPy reference and making it an elementwise unary operator
- Adds 1D no element sample inputs to the cat OpInfo and updates the NumPy reference to handle them and type promotion correctly
- Adds reference inputs for elementwise ternary operations, like clamp
- Adds a NumPy reference for clamp
- Adds reference inputs to where's OpInfo
- Makes softplus an elementwise unary OpInfo
- Removes the great majority of Python reference OpInfo skips and xfails due to the above test changes
- Adds Python reference OpInfos for fill, dropout, clamp, broadcast_tensors, and where
**Prims**
- adds the fill, empty_strided, and uniform prims
- removes the empty, empty_like, full, and full_like prims -- these are now references that use empty_strided and fill
- renames the "concatenate" and "select" prims to "cat" and "where", respectively, to be consistent with PyTorch
- extends the `_elementwise_meta` operation to accepts tensors that don't participate in type promotion, like the `cond` tensor in `where`
- fixes a bug in the stride propagation of broadcast_in_dim
- moves some error checks from prims.cat to prims.where to refs.cat and refs.where, respectively, consistent with our new policy of doing as much error checking in the ref as possible
**Utils**
- adds the canoicalize_device, extract_shape, and extract_shape_from_varargs helpers
- adds the elementwise_unary_scalar_wrapper -- this allows elementwise unary operators to take and return scalar values (ex. refs.sin(1) will return .84...)
**Refs**
- adds the fill, broadcast_tensors, clamp, empty_strided, ones, zeros, and uniform references
- adds the nn.functional.dropout reference
- fixes refs.cat to handle 1D tensors with no inputs consistent with eager mode
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78026
Approved by: https://github.com/ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77696https://github.com/pytorch/pytorch/pull/63619 added a RECORD_FUNCTION guard to make calls to `Engine::evaluate_function` visible regardless of the underlying op. While useful, this creates a call that looks like a forward call that somewhat complicates stitching forward and backward ops. I don't want to add complexity (and therefore work) on the hot path; instead it's fairly straightforward to stitch things back together in post. This PR simply propagates sequence number and forward tid info up to the `evaluate_function` event.
Differential Revision: [D36302562](https://our.internmc.facebook.com/intern/diff/D36302562/)
Approved by: https://github.com/aaronenyeshi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77693
Right now the profiler internals are rather ad-hoc and disjoint. As we move towards a unified experience this needs to be addressed. This PR adds an enum specifying the various types of events that can be profiled and specializes the `ExtraFields` struct on the values of the `EventType` enum. This lets us punt more of the heterogeneity onto the type system and allows a caller to simply think in terms of `ExtraFields<EventType::...>`. (No more "X field is always present but only makes sense for Y". e.g. inputs)
For now only ops and backend events are transitioned since they are already in a weird union state. Changes planned for subsequent diffs in the stack:
1) Allocations
2) Python tracer events
3) Kineto (e.g. Cupti) events
4) Use unified event type for more post processing
One rather pleasant observation was that this change exposed several minor bugs in the current implementation:
1) We just didn't plumb `end_thread_id_` from `OpEvent` to `Result`. Switching to using ctors rather than setting fields in `getRecords` fixes this.
2) We were calling `fn.threadId()` to get start TID, but that is wasteful because it is already stored in the `ThreadLocalSubqueue`.
So that gives me some confidence that this is a step in the right direction.
Differential Revision: [D36189044](https://our.internmc.facebook.com/intern/diff/D36189044/)
**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36189044/)!
Approved by: https://github.com/aaronenyeshi
This PR...
**Issues Found**
- https://github.com/pytorch/pytorch/issues/78058
- https://github.com/pytorch/pytorch/issues/78054
- https://github.com/pytorch/pytorch/issues/78053
- https://github.com/pytorch/pytorch/issues/78050
- https://github.com/pytorch/pytorch/issues/77932
**Testing**
- disables stride consistency checks in test_ops and test_meta pending resolution of https://github.com/pytorch/pytorch/issues/78050
- skips chalf in reference tests (addressing https://github.com/pytorch/pytorch/issues/78054)
- splits test test_python_reference_consistency in one test for the ctx where torch.foo is torch.foo, and another for when torch.foo is refs.foo
- updates test names to be more natural and consistent:
- test_python_reference_errors -> test_python_ref_errors
- test_python_reference_consistency -> test_python_ref and test_python_ref_torch_fallback
- test_python_reference_meta_functions -> test_python_ref_meta
- test_reference_testing -> test_numpy_ref
- updates test_python_ref and test_python_ref_torch_fallback to check that the reference is more accurate than the torch op if the reference and torch op results are not close, a warning is raised when this occurs (addressing https://github.com/pytorch/pytorch/issues/77687)
- adds reference inputs for broadcast_tensors
- Updates the "fill_" OpInfo to "fill", adding a NumPy reference and making it an elementwise unary operator
- Adds 1D no element sample inputs to the cat OpInfo and updates the NumPy reference to handle them and type promotion correctly
- Adds reference inputs for elementwise ternary operations, like clamp
- Adds a NumPy reference for clamp
- Adds reference inputs to where's OpInfo
- Makes softplus an elementwise unary OpInfo
- Removes the great majority of Python reference OpInfo skips and xfails due to the above test changes
- Adds Python reference OpInfos for fill, dropout, clamp, broadcast_tensors, and where
**Prims**
- adds the fill, empty_strided, and uniform prims
- removes the empty, empty_like, full, and full_like prims -- these are now references that use empty_strided and fill
- renames the "concatenate" and "select" prims to "cat" and "where", respectively, to be consistent with PyTorch
- extends the `_elementwise_meta` operation to accepts tensors that don't participate in type promotion, like the `cond` tensor in `where`
- fixes a bug in the stride propagation of broadcast_in_dim
- moves some error checks from prims.cat to prims.where to refs.cat and refs.where, respectively, consistent with our new policy of doing as much error checking in the ref as possible
**Utils**
- adds the canoicalize_device, extract_shape, and extract_shape_from_varargs helpers
- adds the elementwise_unary_scalar_wrapper -- this allows elementwise unary operators to take and return scalar values (ex. refs.sin(1) will return .84...)
**Refs**
- adds the fill, broadcast_tensors, clamp, empty_strided, ones, zeros, and uniform references
- adds the nn.functional.dropout reference
- fixes refs.cat to handle 1D tensors with no inputs consistent with eager mode
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78026
Approved by: https://github.com/ngimel
Adds `exp2`, `log10` to the prims (both also exist in C++ lib and Intel SIMD intrinsic has `exp2`)
Adds `exp2`, `log10`, `frac` to refs with corresponding entries to OpInfo.
Tried to decompose `exp2` (before adding it as prim) as
* `exp(log(2) * x)` but it wasn't stable at large numbers.
* `pow(2, x)` in which case there was stride mismatch. At cursory look, `pow` tries to preserve stride of first arg if possible.
Tried to decompose `log10` (before adding it as prim) as
* `log(x) / log(10)` passed for real dtypes. Failed for complex at extremals. Probably related to https://github.com/pytorch/pytorch/issues/52332 (not a 100% sure)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78046
Approved by: https://github.com/mruberry
cc: @jansel @bertmaher
More or less identical in spirit to the layer norm and batch norm ones.
One annoying thing about all 3 of these is that layer_norm has slightly different `mean/var` semantics than batch norm and group norm. After normalization, `layer_norm` keeps them unsqueezed (so they're something like [1, 5, 1, 1]) while batch norm and group norm squeeze out the 1-dims.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78029
Approved by: https://github.com/bertmaher
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77996
An issue recently surfaced internally which highlighted the fact that removing KinetoThreadLocalState from the TLS at the end of post processing means that we are profiling memory during post processing. (Which violates a whole bunch of invariants in the system.) This change switches the global profiling ctx to a shared_ptr, introduces a class to manage it (`init`, `get`, and `pop` methods) and moves the `pop` call to the beginning of `disableProfiler`.
Differential Revision: [D36555738](https://our.internmc.facebook.com/intern/diff/D36555738/)
Approved by: https://github.com/aaronenyeshi
Summary: Move storage saving to last step, because otherwise tensors saved after storage are already saved will not have storage.
Test Plan: Tested by loading the file in `clowder get GLDGLQnKrIsQFg8DAPxq9vg59ZwZbmQwAAAA orig.pt` and converting to flatbuffer and load again
Differential Revision: D36552645
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78024
Approved by: https://github.com/Jack-Khuu
In general, if we are expecting the users to use the base class,
such as `_ConvNd`, we should rename it to something like
`BaseConv`. However, because this base class is only used inside of the
AO packages, there is no need to expose it to the users.
Test Plan:
```
python test/test_quantization.py
python test/test_module_init.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77344
Approved by: https://github.com/jerryzh168
Because the AO stuff depends on the torch packages, but very few (if any)
torch packages depend on AO, we are moving the imports lower.
That will reduce the probability of cyclic imports, as by the time the
AO would start importing, the rest of the torch would be already imported.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77065
Approved by: https://github.com/albanD
Cleaning up onnx module imports to prepare for updating `__init__`.
- Simplify importing the `_C` and `_C._onnx` name spaces
- Remove alias of the symbolic_helper module in imports
- Remove any module level function imports. Import modules instead
- Alias `symbilic_opsetx` as `opsetx`
- Fix some docstrings
Requires:
- https://github.com/pytorch/pytorch/pull/77448
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77423
Approved by: https://github.com/BowenBao