Summary: Add the conv padding ops in pytorch, the corresponding pr in torch ao is https://github.com/pytorch/ao/pull/2257
Test Plan:
```
buck test 'fbcode//mode/opt' fbcode//caffe2/test:quantization_pt2e -- --exact 'caffe2/test:quantization_pt2e - test_conv_padding_bn_relu (quantization.pt2e.test_quantize_pt2e.TestQuantizePT2E)'
```
Differential Revision: D75494468
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154473
Approved by: https://github.com/Skylion007
`SIGABRT` is a common return by *negative* distributed tests, which checks for effectiveness of NaN assert, watchdog throw, etc.
These errors are not detectable by traditional statements like `with self.assertRaises(RuntimeError)`.
Instead, we'd need to check for the process's return code, e.g. `SIGABRT(6)` would have a return code of -6.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153167
Approved by: https://github.com/fduwjj
1. Reworked `MultiProcContinousTest` to spawn processes during `setUpClass` instead of `main` (so that we can support multiple TestClass'es in one file).
2. The child processes are now an infinite loop, monitoring test IDs passed from main process via a task queue. Reciprocally, the child processes inform the main process completion of a test via a completion queue.
3. Added a test template.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153653
Approved by: https://github.com/d4l3k, https://github.com/fegin, https://github.com/fduwjj
`SIGABRT` is a common return by *negative* distributed tests, which checks for effectiveness of NaN assert, watchdog throw, etc.
These errors are not detectable by traditional statements like `with self.assertRaises(RuntimeError)`.
Instead, we'd need to check for the process's return code, e.g. `SIGABRT(6)` would have a return code of -6.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153167
Approved by: https://github.com/fduwjj
This accomplishes following:
- Fixes correctness problem with large integer types (though probably makes it slower, but this could not be avoided if one wants to compute accurate answer)
- Makes op faster for floating point types (as Metal kernel invocation is faster than creating MPSGraph)
- Eliminates need for several correctness workarounds
Fixes https://github.com/pytorch/pytorch/issues/154171
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154280
Approved by: https://github.com/dcci
ghstack dependencies: #154275, #154290
`SIGABRT` is a common return by *negative* distributed tests, which checks for effectiveness of NaN assert, watchdog throw, etc.
These errors are not detectable by traditional statements like `with self.assertRaises(RuntimeError)`.
Instead, we'd need to check for the process's return code, e.g. `SIGABRT(6)` would have a return code of -6.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153167
Approved by: https://github.com/fduwjj
Summary: The error caused by the world size not being divisible by `group_size` is a common issue encountered by end-users when utilizing applications built on top of `new_subgroups()`. However, these applications may employ different variable names, such as `num_trainers_per_group`, which can make the current error messages less effective despite being correct. To address this, we have improved the error messages to display the actual numbers involved, thereby enhancing their clarity and usefulness.
Test Plan: contbuild & OSS CI
Differential Revision: D75226925
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154124
Approved by: https://github.com/wz337
Usage:
```python
from torch._higher_order_ops.wrap import dynamo_bypassing_wrapper
# Your ordinary function wrapper
def my_hop_fn_impl(fn, *args, k=1, **kwargs):
def wrapper(*args, **kwargs):
out = fn(*args, **kwargs)
if isinstance(out, tuple):
return (out[0] + k,)
return out + k
return wrapper
# Calling `my_hop_fn` instead of the impl directly captures a HOP into the dynamo graph
def my_hop_fn(fn, *args, k=1, **kwargs):
return dynamo_bypassing_wrapper(
functools.partial(my_hop_fn_impl, k=k), fn, *args, **kwargs
)
```
Notes:
- The dynamo captured graph now stashes arbitrary callable objects (the wrapper_fn) - this is equivalent to what SAC does today with policy_fn.
- The `wrapper_fn` passed to `dynamo_bypassing_wrapper ` should have signature `Callable -> Callable`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153487
Approved by: https://github.com/ydwu4
Related: #148920
This PR:
* Provides a helper `install_cpp_extension(extension_root)` for building C++ extensions. This is intended to be used in `TestMyCppExtension.setUpClass()`
* Updates libtorch_agnostic tests to use this
* Deletes preexisting libtorch_agnostic tests from `test/test_cpp_extensions_aot.py`
* Fixes `run_test.py` to actually run tests in `test/cpp_extensions/libtorch_agnostic_extension/test/test_libtorch_agnostic.py` to avoid losing coverage. This wasn't being run due to logic excluding tests that start with "cpp"; this is fixed now
After this PR, it is now possible to run:
```
python test/cpp_extensions/libtorch_agnostic_extension/test/test_libtorch_agnostic.py
```
and the test will build the `libtorch_agnostic` extension before running the tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153264
Approved by: https://github.com/janeyx99
Differential Revision: D72437251
Enable to rebuild bucket order when find_unused_parameters=true.
It should be always better than not rebuilding bucket order when find_unused_parameters=True:
1. for cases where bucket order in the first iteration is the same as the parameter order, rebuilding bucket order will not change anything
2. for cases where bucket order in the first iteration is not the same as the parameter order, there could be two cases:
a. bucket order will not change after 1st iteration even the graph is dynamic and there is unused parameter, in this case, rebuilding bucket order will have performance gain
b. bucket order change after 1st iteration due to dynamic graph, in this case, both parameter order and bucket order in 1st iteration are not ideal, so rebuilding bucket order or not does not matter
it can help case 2.a if enabling to rebuild bucket order when find_unused_parameters=true. meanwhile it will not hurt other cases in 1 and 2.b.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153404
Approved by: https://github.com/rohan-varma, https://github.com/fegin
1. Reworked `MultiProcContinousTest` to spawn processes during `setUpClass` instead of `main` (so that we can support multiple TestClass'es in one file).
2. The child processes are now an infinite loop, monitoring test IDs passed from main process via a task queue. Reciprocally, the child processes inform the main process completion of a test via a completion queue.
3. Added a test template.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153653
Approved by: https://github.com/d4l3k, https://github.com/fegin, https://github.com/fduwjj
Summary: The bug, introduced in https://github.com/pytorch/pytorch/pull/152765, was caused by passing the `group` parameter to the `get_rank()` function, which caused the function to return the rank of the entire group instead of the rank of the current process. The fix involves removing the `group` parameter from the `get_rank()` function call.
Test Plan: contbuild & OSS CI
Differential Revision: D74964213
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153798
Approved by: https://github.com/Skylion007
Change logging.error to logging.exception to log additional information when relevant. A few places have slipped in logging.errors in try except since I last did a clean up here and the rule is stabilized so I am enabling it codebase wide. I have NOQA'd much of our custom exception stack trace handling for RPC calls and distributed and tried to a fix a few errors based on whether we immediately reraised it or if we didn't print any exception handling where it could be useful.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153473
Approved by: https://github.com/albanD, https://github.com/cyyever
MOTIVATION
This PR includes a minor change to check for TEST_HPU flag as well before falling back to CPU. Without this flag, some tests were falling back to CPU causing them to fail.
Please refer to this RFC as well: https://github.com/pytorch/rfcs/pull/66
CHANGES
add TEST_HPU flag to some of the conditions checking the environment
use DEVICE_COUNT variable instead of torch.accelerator.device_count() API since the later is not supported on out-of-tree devices like Intel Gaudi.
@ankurneog , @EikanWang , @cyyever , @guangyey
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153461
Approved by: https://github.com/EikanWang, https://github.com/cyyever, https://github.com/albanD
Based on the [conversation](https://github.com/pytorch/pytorch/issues/121791), we plan to drop the "highest, high, medium" to represent fp32 internal computation data types . Instead, we will directly use the algorithm to represent it.
### Design Choice: Directly use algorithms name like "TF32", "BF16".
#### Pros
- The names are more informative. 'tf32' is more informative than a simple "high".
- Easier to extend new algorithm like `tf32x3`
#### Cons
- "HIGHEST, HIGH, MEDIUM" indicated the relative precision between different algorithms. However, we can have more documents to discuss them.
### We provide a layered structure for backends/operators.
('f32' is short for 'fp32_precision')

### We provide 3 fp32 compute precision can be set:
- **"ieee"**: Not allowed to use any other internal computation data types .
- **"tf32"**: Allowed to use tf32 as internal computation data types.
- **"bf16"**: Allowed to use bf16 as internal computation data types.
- **"none"**: Precision's are not set. Can be override by its father node.
### Overriding Precision Settings
Child node can be override by its father node if it is set to default.
For current default settings:
```
backend = generic, op = all, precision setting = none
backend = cuda, op = all, precision setting = none
backend = cuda, op = conv, precision setting = tf32
backend = cuda, op = rnn, precision setting = tf32
backend = cuda, op = matmul, precision setting = none
backend = matmul, op = all, precision setting = none
backend = matmul, op = conv, precision setting = none
backend = matmul, op = rnn, precision setting = none
backend = matmul, op = matmul, precision setting = none
```
- If the user set `torch.backends.mkldnn.fp32_precision="bf16"`, his child nodes `torch.backends.mkldnn.matmul.fp32_precision` / `torch.backends.mkldnn.conv.fp32_precision` / `torch.backends.mkldnn.rnn.fp32_precision` will also be override to "bf16".
- If the user set `torch.backends.fp32_precision="bf16"`, `torch.backends.mkldnn.fp32_precision` and his child nodes will also we override to "bf16".
### Backward Compatible
Since new API allow user to have more fine-grained control. There will be some conflict. For example, previous `torch.backends.cudnn.allow_tf32` are not enough to represent the status for `torch.backends.cudnn.rnn.fp32_precision="ieee"` and `torch.backends.cudnn.conv.fp32_precision="tf32"`. Therefore, our goal for backward compatible is
- If the user only uses previous APIs, it will work as previous expectations.
- If the user use **new** API to change the status to an **un-representable** status for old API, and try to access the status by **old** API. We will raise Runtime Error and point the document for user.
### Test Plan
```
python test/test_cuda.py -k test_fp32_precision_with_tf32
python test/test_cuda.py -k test_fp32_precision_with_float32_matmul_precision
python test/test_cuda.py -k test_invalid_status_for_legacy_api
python test/test_mkldnn.py -k test_mlkdnn_get_set
python test/test_mkldnn.py -k test_generic_precision
python test/test_mkldnn.py -k test_invalid
python test/test_mkldnn.py -k test_default_use_parent
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125888
Approved by: https://github.com/jgong5, https://github.com/albanD
Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>
Noticed some of these ops were contributing to a big chunk of the runtime for OpenLLama as well as a few other benchmarks
At the op level, moving to a TensorIterator-based Metal kernel gives a 20x speedup. Will migrate the inverse trigonometric functions & log ops in a follow-up PR, as this one is already a bit large
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152876
Approved by: https://github.com/malfet
By implementing `div_floor` and `div_trunc` . Do not mark `div_trunc` as OPMATH, to align following output with CPU(if division is performed in fp32, than result will be truncated to 25
```
import torch
print(torch.tensor([[-7.4688, -3.1289]], dtype=torch.float16,device="cpu").div(torch.tensor([-0.2988, -0.8789], dtype=torch.bfloat16,device="cpu"), rounding_mode="trunc"))
tensor([[24., 3.]])
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152758
Approved by: https://github.com/dcci
ghstack dependencies: #152663, #152515, #152737, #152743