```
Between the time we switch to the native funcol by default and the time when
we are confident that we can remove the legacy implementation, we want to
ensure that the legacy funcol remains covered by unit tests. This is to
prepare for any potential (but unlikely) reverts. The following utilities
help achieve this goal.
run_with_{native,legacy}_funcol - mark a test to run with only
{native,legacy} funcol. These decorators are for impl specific tests (e.g.
verifying generated code with FileCheck).
run_with_both_funcol_impls - parametrize a test to run with both legacy and
native funcol.
run_with_both_funcol_impls_with_arg - same as run_with_both_funcol_impls, but
passes `enable_native_funcol` to the test so impl specific checks can be
carried out.
```
This PR also marks some tests we want to cover in this fashion. More tests will be marked in subsequent PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119950
Approved by: https://github.com/wanchaol
ghstack dependencies: #119881
This is a lot of files changed! Don't panic! Here's how it works:
* Previously, we set `follow_imports = silent` for our mypy.ini configuration. Per https://mypy.readthedocs.io/en/stable/running_mypy.html#follow-imports, what this does is whenever we have an import to a module which is not listed as a file to be typechecked in mypy, we typecheck it as normal but suppress all errors that occurred in that file.
* When mypy is run inside lintrunner, the list of files is precisely the files covered by the glob in lintrunner.toml, but with files in excludes excluded.
* The top-level directive `# mypy: ignore-errors` instructs mypy to typecheck the file as normal, but ignore all errors.
* Therefore, it should be equivalent to set `follow_imports = normal`, if we put `# mypy: ignore-errors` on all files that were previously excluded from the file list.
* Having done this, we can remove the exclude list from .lintrunner.toml, since excluding a file from typechecking is baked into the files themselves.
* torch/_dynamo and torch/_inductor were previously in the exclude list, because they were covered by MYPYINDUCTOR. It is not OK to mark these as `# mypy: ignore-errors` as this will impede typechecking on the alternate configuration. So they are temporarily being checked twice, but I am suppressing the errors in these files as the configurations are not quite the same. I plan to unify the configurations so this is only a temporary state.
* There were some straggler type errors after these changes somehow, so I fixed them as needed. There weren't that many.
In the future, to start type checking a file, just remove the ignore-errors directive from the top of the file.
The codemod was done with this script authored by GPT-4:
```
import glob
exclude_patterns = [
...
]
for pattern in exclude_patterns:
for filepath in glob.glob(pattern, recursive=True):
if filepath.endswith('.py'):
with open(filepath, 'r+') as f:
content = f.read()
f.seek(0, 0)
f.write('# mypy: ignore-errors\n\n' + content)
```
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118414
Approved by: https://github.com/thiagocrepaldi, https://github.com/albanD
This replaces a bunch of unnecessary lambdas with the operator package. This is semantically equivalent, but the operator package is faster, and arguably more readable. When the FURB rules are taken out of preview, I will enable it as a ruff check.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116027
Approved by: https://github.com/malfet
Previously:
```
[W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
[W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
```
With this PR, those warnings disappear. They were introduced in #114077
This change was generated with this sed script, applied with `sed -i -f /tmp/x **/*.{py,hpp,cpp,cc}` and hand inspected.
```
s/\bNCCL_BLOCKING_WAIT\b/TORCH_NCCL_BLOCKING_WAIT/g
s/\bNCCL_ENABLE_TIMING\b/TORCH_NCCL_ENABLE_TIMING/g
s/\bNCCL_DESYNC_DEBUG\b/TORCH_NCCL_DESYNC_DEBUG/g
s/\bNCCL_ASYNC_ERROR_HANDLING\b/TORCH_NCCL_ASYNC_ERROR_HANDLING/g
s/\bENABLE_NCCL_HEALTH_CHECK\b/TORCH_ENABLE_NCCL_HEALTH_CHECK/g
s/\bNCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK\b/TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK/g
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114880
Approved by: https://github.com/kwen2501
This test(8340762211/test/distributed/test_multi_threaded_pg.py (L133) ) is failing on internal sandbox with the following error msg:
```
File "/data/sandcastle/boxes/eden-trunk-hg-fbcode-fbsource/buck-out/v2/gen/fbcode/8c7462494077df89/caffe2/test/distributed/__multi_threaded__/multi_threaded#link-tree/torch/testing/_internal/distributed/multi_threaded_pg.py", line 255, in _start_coll
raise Exception(
Exception: world not ready, only 3 PG's registered but world has 4 ranks
exiting thread 1
ERROR
```
Internal error report: https://www.internalfb.com/intern/test/562950031915334?ref_report_id=0
We believe this is because we no longer perform barrier after init (see https://github.com/pytorch/pytorch/pull/99937).
This PR temporarily turn back on ```TORCH_DIST_INIT_BARRIER``` to avoid flaky test for the time being, but we should look into it to find a way to properly do this.
cc. @kumpera @kwen2501
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103568
Approved by: https://github.com/H-Huang
currently the test
```
pytest test/distributed/test_multi_threaded_pg.py -vs
```
has errors
```
Traceback (most recent call last):
File "/private/home/howardhuang/.conda/envs/pytorch/lib/python3.9/threading.py", line 980, in _bootstrap_inner
self.run()
File "/private/home/howardhuang/.conda/envs/pytorch/lib/python3.9/threading.py", line 917, in run
self._target(*self._args, **self._kwargs)
File "/private/home/howardhuang/pytorch-projects/pytorch/torch/testing/_internal/common_distributed.py", line 1029, in _run
self._tls.precision = TestCase._precision
AttributeError: 'TestCollectivesWithBaseClass' object has no attribute '_tls'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93883
Approved by: https://github.com/awgu, https://github.com/wanchaol
Time comparison between using MultithreadedTestCase and MultiProcessTestCase on op db tests is amazing!
using MultiThreadTestCase on a AWS dev node:
```
time pytest test/distributed/_tensor/test_dtensor_ops.py
============= 175 passed, 42 skipped, 397 xfailed in 80.30s (0:01:20) =======
real 1m22.330s
user 1m38.782s
sys 0m18.762s
```
MultiProcessTestCase spends from 40mins to more than 1h, even if using pytest parallel testing tools.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92198
Approved by: https://github.com/XilunWu
This PR did a full rewrite of MultiThreadedTestCase, to make it more
aligned with the MultiProcessTestCase, also changed how it do spawning
and testing, so that we could embed thread local states when running
tests.
This PR enables device_type tests to work with MultiThreadedTestCase
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91650
Approved by: https://github.com/XilunWu
This PR includes:
Changes from @kumpera (https://github.com/pytorch/pytorch/pull/86327): adding MultiThreaded FileSystemWriter for distributed checkpointing, which adds two knobs to FileSystemWriter: thread_count and per_thread_copy_ahead. This increases up to 50% performance improvement on 32 GPUS workloads on AWS.
Add parametrize tests to /test/distributed/_shard/checkpoint/test_file_system_checkpoint.py and /test/distributed/_shard/checkpoint/test_file_system_checkpoint_cpu.py
Modify @with_comms in ShardedTensorTestBase to take in *args and **kwargs.
Tests:
```
python3 test/distributed/checkpoint/test_file_system_checkpoint_cpu.py
```
test/distributed/checkpoint/test_file_system_checkpoint.py(GPU tests) runs fine locally but would timeout on CI. We will use thread-based PG and update this test in following PR.
[T134844615]
## Add docstring and update comments in the following PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87987
Approved by: https://github.com/fduwjj
Add sequence number support for UCC, mostly following format of ProcressGroupNCCL.
Pass new test: `test_all_gather_object_subgroup`
Add skips for gather tests: `test_gather_object` and `test_gather_object_subgroup`
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85047
Approved by: https://github.com/kwen2501
Move a bunch of globals to instance methods and replace all use to them.
We move all PG related globals under World and use a singleton instance under _world.
This creates an undocumented extension point to inject full control of how how c10d
state behaves.
One simple hack is to change _world to an implementation that uses a threadlocal
and enable per-thread PGs.
It almost get DDP working and the PG is missing an implementation of all_reduce.
This enables notebook usage of PTD, which is a big deal for learning it:
https://gist.github.com/kumpera/32cb051fa26b8cad8bdf671f968dcd68
This change ensures BC by keeping the global variables around and have the default _World wrap it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86348
Approved by: https://github.com/rohan-varma
Move a bunch of globals to instance methods and replace all use to them.
We move all PG related globals under World and use a singleton instance under _world.
This creates an undocumented extension point to inject full control of how how c10d
state behaves.
One simple hack is to change _world to an implementation that uses a threadlocal
and enable per-thread PGs.
It almost get DDP working and the PG is missing an implementation of all_reduce.
This enables notebook usage of PTD, which is a big deal for learning it:
https://gist.github.com/kumpera/32cb051fa26b8cad8bdf671f968dcd68
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84153
Approved by: https://github.com/rohan-varma
- [x] Direct dependency on UCX is completely removed, UCC active set API always enabled
- [x] Remove `TORCH_UCC_PROFILING_ENABLE`, always enable profiling
- [x] Fixes profiling of `recv` and `all_gather`
- [x] Use the NCCL TL of UCC on CUDA, as the UCP TL is not well supported on CUDA
Most tests are passing, but there are a few skipped tests:
- `scatter` and `gather` are not supported by the UCP TL of UCC on CPU tensors
- A few flaky tests in PyTorch's CI environment
- Profiler-related failures, some of them will be fixed by @Fuzzkatt in https://github.com/pytorch/pytorch/pull/84368
After this PR is merged, I will continue to work on these skipped failures.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83285
Approved by: https://github.com/vtlam, https://github.com/malfet, https://github.com/kwen2501
When running on clusters the startup time for the subprocesses might be much higher which leads to spurious failures.
So increase this to 300s similar to torch/testing/_internal/distributed/distributed_test.py
Also introduces `DISTRIBUTED_TESTS_DEFAULT_TIMEOUT` as suggested by @malfet in #55896
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80330
Approved by: https://github.com/malfet