This PR continues to clean clang-tidy warnings in torch/csrc/distributed/c10d, following #124701. In addition, libfmt dependency is added in CMake code to enable using it in the headers. The libfmt has to be added as private dependency to torch_cuda and torch_hip because they include torch/csrc/distributed/c10d/Utils.hpp which uses libfmt.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124987
Approved by: https://github.com/malfet
#113713
currently passing trivial smoke tests but I just totally pattern-matched bits and pieces of the autograd defs
Will also collect benchmark data,
CC @drisspg
Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122510
Approved by: https://github.com/drisspg
Summary:
This makes barrier and rank operations linear instead of quadratic with the number of workers. This drastically improves performance for rendezvous when running with over 1000 hosts.
This uses 2 approaches for different areas:
* local rank assignment: each worker does 1 set and 1 get, local ranks are assigned on the rank 0 host in a O(n) operation which reduces total store operations to be linear with number of workers.
* exit_barrier: use a counter and a final flag so each worker has to do max 1 set, 1 get and 1 add.
At 4000 hosts we see torchelastic be able to run in as little as 10 seconds down from 373 seconds.
Test Plan:
This is testing using many small tests running on a remote cluster.
{D56549942}
```
torchx run --scheduler mast -- --image=torchelastic_benchmark --j=4000x1
```
Differential Revision: D56605193
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124982
Approved by: https://github.com/kiukchung, https://github.com/kurman
- sets it as a fake stack trace as we don't have a generic comment feature
- when verbose is disabled, still adds a contextmanager and flag checks. the alternative is to use MACROS, but that wouldn't be usable with TORCH_LOGS
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124954
Approved by: https://github.com/jansel
## Description
Framework overhead is found to be big for the onednn qconv op (used for quantization with PT2E X86Inductor backend). This PR reduces the integration overhead by modifying the implementation of qconv.
## performance results
Running quantized Resnet50 on an Intel(R) Xeon(R) Platinum 8490H machine
Before
```
Average latency: 8.378 ms.
------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
------------------------- ------------ ------------ ------------ ------------ ------------ ------------
onednn::qconv2d_pointwise 86.54% 6.954ms 87.42% 7.025ms 132.547us 53
```
After
```
Average latency: 6.255 ms.
------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
------------------------- ------------ ------------ ------------ ------------ ------------ ------------
onednn::qconv2d_pointwise 85.05% 6.381ms 85.98% 6.451ms 121.717us 53
```
Test script:
```python
import torch
import torchvision
import time
import copy
import numpy as np
from torch._export import capture_pre_autograd_graph
from torch.ao.quantization.quantize_pt2e import (
prepare_pt2e,
convert_pt2e,
)
import torch.ao.quantization.quantizer.x86_inductor_quantizer as xiq
from torch.ao.quantization.quantizer.x86_inductor_quantizer import X86InductorQuantizer
torch._inductor.config.cpp.enable_kernel_profile=True
torch._inductor.config.profiler_mark_wrapper_call = True
torch._inductor.config.freezing = True
torch._inductor.config.cpp_wrapper = True
def bench_model(model, inputs):
times =[]
with torch.no_grad():
for _ in range(5): # warm-up
output = model(inputs)
for _ in range(20):
start_time = time.time()
output = model(inputs)
end_time = time.time()
times.append(end_time - start_time)
print ('Average latency: %0.3f ms.' % (np.median(times) * 1000.0))
with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU]) as p:
out_ipex = model(inputs)
print(p.key_averages().table(sort_by="self_cpu_time_total", row_limit=-1))
def pt2e_ptq(m, example_inputs):
m = m.eval()
exported_model = capture_pre_autograd_graph(m, example_inputs)
quantizer = X86InductorQuantizer()
quantizer.set_global(xiq.get_default_x86_inductor_quantization_config())
prepared_model = prepare_pt2e(exported_model, quantizer)
_ = prepared_model(*example_inputs)
converted_model = convert_pt2e(prepared_model)
torch.ao.quantization.move_exported_model_to_eval(converted_model)
with torch.no_grad():
optimized_model = torch.compile(converted_model)
_ = optimized_model(*example_inputs)
_ = optimized_model(*example_inputs)
bench_model(optimized_model, *example_inputs)
return optimized_model
if __name__ == "__main__":
data = torch.randn(16, 3, 224, 224)
model_fp = torchvision.models.resnet50(weights=torchvision.models.ResNet50_Weights.DEFAULT)
pt2e_ptq(copy.deepcopy(model_fp), (data,))
```
Differential Revision: [D56288440](https://our.internmc.facebook.com/intern/diff/D56288440)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123240
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168
Summary:
This diff fixes a bug in PyTorch where when creating a tensor from a List of booleans, PyTorch was throwing an error.
This fix resolves that issue. All credit goes to swolchok for identifying the root cause of the issue and suggesting this fix.
Test Plan: Running our model end to end works as expected and no error occurs.
Differential Revision: D55990810
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124899
Approved by: https://github.com/zhxchen17
**Summary**
This PR is attempt to land an experimental feature designed in #103686 . `local_map` is designed to allow users to apply to `DTensor` objects a function that was written to apply to `torch.Tensor`.
As a function, `local_map` takes in 2 required arguments (`func` and `out_placements`) and 3 optional arguments (`device_mesh`, `in_placements`, `redistribute_inputs`). `func` is the function to be applied to each local shard of input `DTensor`. `out_placements` is the sharding specification of output `DTensor`.
`local_map` returns a new function that does the following:
1. Infer `device_mesh` and `in_placements` from `DTensor` input if they're not provided. If `device_mesh` is provided, it must be identical to the device mesh of every `DTensor` input. If `in_placements` is provided, it serves as the required sharding specification of corresponding `DTensor` input before feeding its local shard into `func`. In case it is different from `DTensor`'s sharding specification, if `redistribute_inputs=False` an exception will be raised, otherwise perform a resharding to the required sharding.
2. Call `func` with the arguments passed in along with `device_mesh` except `DTensor`s. For `DTensor`, pass in its local shard. This `func` may include collectives.
3. For each output of `func` that has validate (i.e. not `None) sharding specification in `out_placements`, construct a new `DTensor` using the output and the specification. Use this `DTensor` as the output.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123676
Approved by: https://github.com/wanchaol
torch.library.register_fake reports the python module the fake impl is
located in. This is used to check against
`m.set_python_module("foo.bar")` calls in C++.
The module reporting logic was wrong in most cases. This PR fixes it.
Test Plan:
- exhaustive tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125037
Approved by: https://github.com/williamwen42
Improves the Cutlass backend GEMM template:
* Adds code which allows to create stand-alone test runners for Cutlass GEMM Kernels, which allows (manual) debugging of, for example, CUDA IMA errors or similar problems which occur in practice. Includes some utility code and tests to actually compile and run these standalone tests.
* Cleans up the GEMM template code through various refactorings
* Eliminates code sections and options that are unneccessary now that epilogue fusions are being removed.
* Limits the scope of a workaround for (flaky) Cutlass issues with bias broadcasting to neccessary cases.
* Puts some CPU runtime checks into #if / #endif blocks, such that it's possible to compile CUTLASS Kernels with lower CPU overhead.
* Add documentation comments
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124577
Approved by: https://github.com/jansel
ghstack dependencies: #124576
Summary:
previous attempts don't work eventually. D49720297 causes online train SEV due to extra importing. D56299408 mitigates a tricky bug from Distributed Shampoo constructor but unfortutenaly didn't correct the scuba logging either.
see f552546983
Test Plan: {F1491621504}
Differential Revision: D56378270
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124593
Approved by: https://github.com/anijain2305
This PR is to add support for tensor's is_complex method in dynamo. Take the following code as an example:
```python
def test_tensor_is_complex(x):
if x.is_complex():
return x + 1
else:
return x - 1
```
Before this fix, the is_complex() call will cause a graph break "torch.* op returned non-Tensor bool call_method is_complex". After this fix, the graph break can be avoided.
Fixes#122692
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124927
Approved by: https://github.com/ezyang
A better query for the base commit of a PR.
Some ghstack PRs are not connected to main so git merge-base doesn't work. Instead, use the Github API to query for the base of the PR, which should be more accurate
Sanity checked on one of Ed's ghstack PRs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122214
Approved by: https://github.com/seemethere
I'm restoring the `training` and `inference` options after github.com/pytorch/pytorch/pull/124795 and remove the not less-known `cppwrapper` option instead per @desertfire suggestion. The total number of parameters remains at 10.
Also, the default choice for training and inference are explicitly spelled out when dispatching the workflow manually to catch dev attention.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124971
Approved by: https://github.com/ezyang
Test the generic torch.Stream/Event with fake device gurad and hooks. Since we added a fake device backend, it is mutual exclusive to other backends. Tests will be skipped if TEST_CUDA or TEST_ROCM is true.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123614
Approved by: https://github.com/albanD
ghstack dependencies: #123611, #123612
MTIA device has its own Module in PyTorch now.
torch.mtia has following APIs similar to other backends. The lazy_init is also supported.
```
__all__ = [
"init",
"is_available",
"synchronize",
"device_count",
"current_device",
"current_stream",
"default_stream",
"set_stream",
"stream",
"device",
]
```
------------
For device management. We expand AccleratorHooksInterface to support generic device management and it can be used in both C++ and PyThon.
```
def _accelerator_hooks_device_count() -> _int: ...
def _accelerator_hooks_set_current_device(device_index: _int) -> None: ...
def _accelerator_hooks_get_current_device() -> _int : ...
def _accelerator_hooks_exchange_device(device_index: _int) -> _int : ...
def _accelerator_hooks_maybe_exchange_device(device_index: _int) -> _int : ...
```
---------
Adding get_device_module API to retrieve device modules for different device types.
```
def get_device_module(device: Optional[Union[torch.device, str]] = None)
```
---------
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123612
Approved by: https://github.com/albanD
ghstack dependencies: #123611
fake_tensor.py had mypy error ignored. That seems less than desirable.
Also added SafePyObjectT<T> which is a tagged wrapper around a SafePyObject but provides static type checking (with no other guarantees).
Used `SafePyObjectT<TorchDispatchModeKey>` on some of the TorchDispatchModeTLS API to ensure that we don't accidentally inject a different type than expected into the stack.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124428
Approved by: https://github.com/malfet
A better query for the base commit of a PR.
Some ghstack PRs are not connected to main so git merge-base doesn't work. Instead, use the Github API to query for the base of the PR, which should be more accurate
Sanity checked on one of Ed's ghstack PRs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122214
Approved by: https://github.com/seemethere
Summary:
Previously the `requires_grad` is not propagated from original Tensor to decomposed tensors
Test Plan:
python test/test_parametrization.py -k test_register_parametrization_no_grad
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124888
Approved by: https://github.com/lezcano