Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56386
The diff resolves bug around incorrect handler resolution:
_create_static_handler pointed towards etcd, and _create_etcd_handler pointed towards static.
Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed:test_launcher
Added test_launcher to the ci/cd tests
Reviewed By: cbalioglu
Differential Revision: D27858897
fbshipit-source-id: 440155789958c091ce5755e7c9524e4bb704203a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56039
Python will try to eagerly resolve the name references even if
the import failed. Quote them so that it doesn't.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: janeyx99
Differential Revision: D27770536
Pulled By: ezyang
fbshipit-source-id: b111739289498f9bab856fb9424f3080efee4ee0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55695
In order to be able to run CUDA tests on their own (e.g., to avoid running CPU tests on GPU machines).
Done by moving test methods to a separate class (and sometimes introducing a "common" base class for utils), and then providing new entry points inside a `cuda/` subdirectory.
Test Plan: Checked they are run on Sandcastle.
Reviewed By: mrshenli
Differential Revision: D27618198
fbshipit-source-id: 8f671657f79c8ae115748ab7752fe0066705893b
Summary:
1. move module related stuff to test_module_container
2. created test_types for types and annotation
3. created test_misc for the rest
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55560
Reviewed By: VitalyFedyunin
Differential Revision: D27650911
Pulled By: walterddr
fbshipit-source-id: d895a7da9e9c3d25a662a37faf4daabc276b9c1a
Summary:
Prettifies JSON files .pytorch-test-times and .pytorch-slow-tests so that not everything is on one single line.
This is of slightly more importance as generated .pytorch-slow-tests ends up getting stored in our test-infra repo ([example](ad9cd87565)), and it is nice to not have that lil red symbol at the end.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55335
Reviewed By: samestep
Differential Revision: D27576930
Pulled By: janeyx99
fbshipit-source-id: be58565b8c8593a9bfcfab383ee19facc79f0572
Summary:
Moves more s3 parsing code to s3_stat_parser.py. This is another step in modularizing the parsing code more correctly. I will also be using this exact function in future slowTest code.
Also replaces some Any's in the code to be Report.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54808
Test Plan:
.pytorch-test-times generated before the code and after this code is the same.
CI should pass, specifically the test tools GHA.
Reviewed By: walterddr
Differential Revision: D27375783
Pulled By: janeyx99
fbshipit-source-id: bec28551668b2eb3fdd60d802200993e493eac83
Summary:
First step to move all S3 related operations into S3 parser utils.
in the end we provide APIs from s3_stats_parser:
1. downloading data as reports and uploading data as reports
2. filter by job name
and handle all compression, formatting inside.
TODO
- [ ] Refactor out upload into s3_stats_parser
- [ ] Remove all S3/BOTO related checkers and try/catch blocks outside of s3_stats_parser
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54681
Test Plan:
1. Running tools/test/* covers the refactoring logic (test_test_history.py and test_stats.py as entrypoint and both using the 2 new APIs in s3_stats_parser after the refactoring.
2. print_test_stats.py's main argparse entrypoint is covered by CI step Report Test Result step.
3. run `python test/run_test.py --export-past-test-times` before and after this PR should result in the same file content in .pytorch-test-times
Reviewed By: ailzhang
Differential Revision: D27346742
Pulled By: walterddr
fbshipit-source-id: fb40162e631e007fed9d5821fe4f190bda2cb52e
Summary:
Since `_test1`, `_test2` and `_build` and `test` are all stripped, `slow_test` should be stripped as well. This way, the _slow_test stats will be considered as a part of all stats relating to a particular build job, though currently, it doesn't do much because the jobs don't share a common stemmed name--the build has `_gcc7` while the slow_test CI job does not.
This makes me think...do we omit the `gcc7` intentionally? Are there other things I should strip, e.g., `multigpu_test`?
See:
ci/circleci: pytorch_linux_xenial_cuda10_2_cudnn7_py3_slow_test
ci/circleci: pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test1
ci/circleci: pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test2
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54528
Reviewed By: samestep
Differential Revision: D27270393
Pulled By: janeyx99
fbshipit-source-id: ffb7289cfe4dba52ded67f50a89f3e75e7bad68d
Summary:
Step 2 to fixing https://github.com/pytorch/pytorch/issues/53882 :)
This changes TARGET_DET_LIST and sharding automation by checking if there's already cached data from the commit in `.pytorch-test-times`. If not, it pulls data from S3 and updates the file to have the stats. This way, S3 pulling does not need to happen more than once for the same commit.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54210
Test Plan:
the following methods should run the same set of tests.
First `export CIRCLE_JOB=pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test2` or your favorite CIRCLE JOB.
1. Pull data first and use it:
Download the data from S3 and write it to the cache file with `python test/run_test.py --export-historic-test-times .pytorch-test-times`
Now run `python test/run_test.py --shard 1 10`
2. Make the sharding job pull data:
Delete the file you just created: `rm .pytorch-test-times`
Now run `python test/run_test.py --shard 1 10`
Reviewed By: walterddr
Differential Revision: D27136849
Pulled By: janeyx99
fbshipit-source-id: 51a42c4e2fa3f8cf15e682679dd3eb6130aad927
Summary:
This is an initial attempt in refactoring and consolidating our S3 read logic for print_test_stats.py, test_history.py, and run_test.py. This way, boto3 and botocore do not need to be imported in various places throughout the code base, and duplicated logic (such as the many type definitions) can exist in one place: `tools/stat_utils/s3_stat_parser.py`. walterddr contributed to this PR by moving print_test_stats.py to the tools folder and the corresponding tests a subfolder within tools.
**NOTE: this removes those tests from CI as the new `tools/test/test_stats.py` is not in the test/ directory as the other tests in TESTS in run_test.py.**
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53755
Test Plan:
This refactoring change should not break anything, so running the files as before should work as they did previously.
To make sure that print_test_stats.py still functions: run `python tools/test/test_stats.py` and make sure all tests pass.
To make sure that test_history.py works, run the example commands from `tools/test_history.py --help` and check that their output matches that shown. Note that the script will continue printing for a while, so don't be alarmed.
Some next steps:
- Actually coming up with similarities among the three current use cases and further refactoring/consolidating of functions (e.g., combining simplify and get_cases)
- Moving more parsing logic to s3_stat_parser.py to have better abstraction between our files
- Adding tests for s3_stat_parser.py when there is more functionality in it
Reviewed By: agolynski, samestep
Differential Revision: D27030285
Pulled By: janeyx99
fbshipit-source-id: e664781324ef7c0c30943bfd7f17c895075ef7a7
Summary:
This will allow for future work to use the test times file (which will save computation time and also allow for more consistency). (Step one to fixing https://github.com/pytorch/pytorch/issues/53882)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54083
Test Plan:
export CIRCLE_JOB=your-favorite-circleci-job e.g., pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test2
`python test/run_test.py --export-historic-test-times` OR
`python test/run_test.py --export-historic-test-times .your-favorite-file`
When opening either .pytorch-test-times or .your-favorite-file, you should see something like:
```
{"commit": "2d559a09392aabb84dfb4a498010b2f01d99818c", "job_times": {"distributed/test_distributed_spawn": 583.5889999999973, "distributed/test_data_parallel": 4.866999999999997, "test_binary_ufuncs": 171.1569999999998, "test_numpy_interop": 2.5649999999999995, "test_public_bindings": 0.011,...}}
```
Note that no tests will be run when this option is specified.
Reviewed By: walterddr
Differential Revision: D27091351
Pulled By: janeyx99
fbshipit-source-id: e191d739268d86de0a0ba0eea0006969859d1940
Summary:
This PR:
1. moves sharding algorithm from run_test.py to framework_utils.py (let me know if you have a better place for it)
2. adds tests for the algorithm in test_testing.py
3. fixes the algorithm so that it doesn't tack on the unknown jobs all to the shard with the minimum time, but instead distributes them around the shards.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53942
Test Plan: python test/test_testing.py -k TestFrameworkUtils
Reviewed By: samestep
Differential Revision: D27047223
Pulled By: janeyx99
fbshipit-source-id: 824b20009c0bb707aa5361de445cdec795d5e3f1
Summary:
First argument is either file name or test module name, but key to `CUSTOM_HANDLERS` is test module name.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53884
Test Plan: Run `python3 run_test.py -i distributed/test_distributed_spawn.py`
Reviewed By: janeyx99
Differential Revision: D27006164
Pulled By: malfet
fbshipit-source-id: f30b42856cd2754e5981c1c69618f84e392c986a
Summary:
This PR:
1. refactors the logic for S3 stats gathering.
2. Renames SLOW_TESTS to TARGET_DET_LIST to disambiguate and remove confusion with slowTest
2. detects slow tests (tests with time > 5min) to add to the TARGET_DET_LIST based on results in S3 from the previous nightly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53549
Test Plan:
Set CIRCLE_JOB to your favorite CI job (like `pytorch_linux_bionic_py3_8_gcc9_coverage_test1`).
Run `python test/run_test.py --determine-from=<your fave pytorch files>`
e.g., `python test/run_test.py --determine-from=test/run_test.py`
Reviewed By: mrshenli
Differential Revision: D26904478
Pulled By: janeyx99
fbshipit-source-id: 9576b34f4fee09291d60e36ff2631753a3925094
Summary:
Context: https://github.com/pytorch/pytorch/pull/53299#discussion_r587882857
These are the only hand-written parts of this diff:
- the addition to `.github/workflows/lint.yml`
- the file endings changed in these four files (to appease FB-internal land-blocking lints):
- `GLOSSARY.md`
- `aten/src/ATen/core/op_registration/README.md`
- `scripts/README.md`
- `torch/csrc/jit/codegen/fuser/README.md`
The rest was generated by running this command (on macOS):
```
git grep -I -l ' $' -- . ':(exclude)**/contrib/**' ':(exclude)third_party' | xargs gsed -i 's/ *$//'
```
I looked over the auto-generated changes and didn't see anything that looked problematic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53406
Test Plan:
This run (after adding the lint but before removing existing trailing spaces) failed:
- https://github.com/pytorch/pytorch/runs/2043032377
This run (on the tip of this PR) succeeded:
- https://github.com/pytorch/pytorch/runs/2043296348
Reviewed By: walterddr, seemethere
Differential Revision: D26856620
Pulled By: samestep
fbshipit-source-id: 3f0de7f7c2e4b0f1c089eac9b5085a58dd7e0d97
Summary:
Uses nightly commit stats to automatically shard tests based on execution time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53269
Test Plan:
set CIRCLE_JOB to an existing job, like `pytorch_linux_bionic_py3_6_clang9_test`
Then you can run something like: `python test/run_test.py --shard 1 10`
Reviewed By: malfet
Differential Revision: D26819440
Pulled By: janeyx99
fbshipit-source-id: 6bc73d6aa3d52d9850817536be15d7b54a72780e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52910
**Summary**
PR #52158 tried to move all JIT bindings from `torch._C` to a new
submodule `torch._C._jit`, but that...did not go well. This pull request
adds the new `torch._C._jit` submodule, but does not migrate the
existing bindings. Instead, it adds a unit test that fails if any new
bindings are added to `torch._C`. A comment in the test instructs
developers to add their new binding to the allowlist if it really should
be in `torch._C`, or to add it to the appropriate submodule (e.g
`torch._C._jit`, for example). The idea is to prevent the issue
described in #51691 from getting *worse* if it cannot be fixed.
**Test Plan**
Continuous integration.
**Fixes**
This commit fixes#51691.
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D26698373
Pulled By: SplitInfinity
fbshipit-source-id: ec9f5426051227a513d4fd09512b624420e0100b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52323
Using default cpu allocator for ops executed on qnnpack backend will result in
asan failures with heap overflow since qnnpack (and xnnpack) can access input
beyond their and/beginning.
Here we are enabling this feature specifically to enable dynamic sparse linear op test
using qnnpack engine. In dynamic linear op, the fp32 bias is not packed and
hence can result in out-of-bound access.
Test Plan: test_set_default_mobile_cpu_allocator.py
Reviewed By: z-a-f
Differential Revision: D26263481
fbshipit-source-id: a49227cac7e6781b0db4a156ca734d7671972d9f
Summary:
Implement the first stage of ZeRO, sharding of the optimizer state, as described in [this blog post](https://www.microsoft.com/en-us/research/blog/zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/) and [this paper](https://arxiv.org/abs/1910.02054). This implementation is completely independent from the [DeepSpeed](https://github.com/microsoft/DeepSpeed) framework, and aims at providing ZeRO-compliant building blocks within the PyTorch scheme of things.
This works by:
- acting as a wrapper to a pytorch optimizer. ZeROptimizer does not optimize anything by itself, it only shards optimizers for distributed jobs
- each rank distributes parameters according to a given partitioning scheme (could be updated), and owns the update of a given shard only
- the .step() is called on each rank as expected, the fact that the optimizer actually works on a shard of the model is not visible from the outside
- when the update is completed, each rank broadcasts the updated model shard to all the other ranks
This can be used with DDP, although some communications are wasted in that case (gradients are all-reduced to all ranks). This implementation was initially developed in [Fairscale](https://github.com/facebookresearch/fairscale), and can also be used with an optimized DDP which only reduces to the relevant ranks. More context on ZeRO and PyTorch can be found in [this RFC](https://github.com/pytorch/pytorch/issues/42849)
The API with respect to loading and saving the state is a known pain point and should probably be discussed an updated. Other possible follow ups include integrating more closely to a [modularized DDP](https://github.com/pytorch/pytorch/issues/37002), [making the checkpoints partition-agnostic](https://github.com/facebookresearch/fairscale/issues/164), [exposing a gradient clipping option](https://github.com/facebookresearch/fairscale/issues/98) and making sure that mixed precision states are properly handled.
original authors include msbaines, min-xu-ai and myself
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46750
Reviewed By: mruberry
Differential Revision: D25958918
Pulled By: blefaudeux
fbshipit-source-id: 14280f2fd90cf251eee8ef9ac0f1fa6025ae9c50
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49698
Reincarnation of #47620 by jamesr66a.
It's just an initial bunch of things that we're exposing to python, more
is expected to come in future. Some things can probably be done better,
but I'm putting this out anyway, since some other people were interested
in using and/or developing this.
Differential Revision: D25668694
Test Plan: Imported from OSS
Reviewed By: bertmaher
Pulled By: ZolotukhinM
fbshipit-source-id: fb0fd1b31e851ef9ab724686b9ac2d172fa4905a
Summary:
Used to temporarily change working directory, but restore it even if exception is raised
Use it in test_type_hints and during code coverage collection
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49657
Reviewed By: walterddr
Differential Revision: D25660543
Pulled By: malfet
fbshipit-source-id: 77f08d57e4b60b95daa4068d0dacf7c25f978526
Summary:
Instead of calling coverage frontend import coverage module and call combine() and html_report()
Fixes https://github.com/pytorch/pytorch/issues/49596 by not using a strict mode when combining those reports
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49615
Reviewed By: seemethere
Differential Revision: D25645196
Pulled By: malfet
fbshipit-source-id: be55b5c23a3569a331cbdf3f86d8c89bc27d5fe1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47829
As per proposal in https://github.com/pytorch/pytorch/issues/44827,
the API needs to return an RRef to support inter-host pipelining.
For now, we just return a local RRef and only support pipeline on a single
host. But having this change in the API upfront ensures we don't make any BC
breaking changes later.
ghstack-source-id: 118366784
Test Plan: waitforbuildbot
Reviewed By: rohan-varma
Differential Revision: D24914022
fbshipit-source-id: e711e7d12efa45645f752f0e5e776a3d845f3ef5
Summary:
This adds a transform to convert a real vector of (D * (D-1))/2 dimension into the cholesky factor of a D x D correlation matrix. This follows the implementation in [NumPyro](https://github.com/pyro-ppl/numpyro/blob/master/numpyro/distributions/transforms.py) by fehiepsi. This is needed for the LKJDistribution which will be added in a subsequent PR.
Also in line with the ongoing effort to refactor distributions test, this moves the transforms test into its own file that uses pytest with parametrized fixtures.
For review:
fehiepsi - could you help review the math?
fritzo - do you have any suggestions for what to do about the event dimension (more details are in the comment below)?
ezyang - could you review the changes in `run_test.py`? Instead of a separate `PYTEST_TESTS`, I have clubbed these tests in `USE_PYTEST_LIST` to avoid duplicate logic. The only difference is that we do not anymore check if pytest is not installed and exclude the tests in the list. I figured that if existing tests are already using pytest, this should not matter.
TODOs (probably not all can be satisfied at the same time):
- [x] Use operations that are JIT friendly, i.e. the transform works with different sized input under JIT.
- [x] Resolve test failures - currently `arange(scalar_tensor)` fails on certain backends but this is needed for JIT. Maybe we should only support same sized tensor under JIT?
- [x] Add tests to check that the transform gives correct gradients and is in agreement with the `log_det_jacobian`.
- [x] Add `input_event_dim` and `output_event_dim` to `CorrCholeskyTransform`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48041
Reviewed By: zhangguanheng66
Differential Revision: D25262505
Pulled By: neerajprad
fbshipit-source-id: 5a57e1c19d8230b53592437590b9169bdf2f71e9