Summary:
fixes https://github.com/pytorch/pytorch/issues/58632.
Added several skips that relates to test assert and MKL. Will address them in separate PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58666
Reviewed By: seemethere, janeyx99
Differential Revision: D28607966
Pulled By: walterddr
fbshipit-source-id: 066d4afce2672e4026334528233e69f68da04965
Summary:
Some machines don't have a versionless `python` on their PATH, which breaks these existing shebangs.
I'm assuming that all the existing versionless `python` shebangs are meant to be `python3` and not `python2`; please let me know if my assumption was incorrect for any of these.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58275
Test Plan: CI.
Reviewed By: zhouzhuojie
Differential Revision: D28428143
Pulled By: samestep
fbshipit-source-id: 6562be3d12924db72a92a0207b060ef740f61ebf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56868
See __init__.py for a summary of the tool.
The following sections are present in this initial version
- Model Size. Show the total model size, as well as a breakdown by
stored files, compressed files, and zip overhead. (I expect this
breakdown to be a bit more useful once data.pkl is compressed.)
- Model Structure. This is basically the output of
`show_pickle(data.pkl)`, but as a hierarchical structure.
Some structures cause this view to crash right now, but it can be
improved incrementally.
- Zip Contents. This is basically the output of `zipinfo -l`.
- Code. This is the TorchScript code. It's integrated with a blame
window at the bottom, so you can click "Blame Code", then click a bit
of code to see where it came from (based on the debug_pkl). This
currently doesn't render properly if debug_pkl is missing or
incomplete.
- Extra files (JSON). JSON dumps of each json file under /extra/, up to
a size limit.
- Extra Pickles. For each .pkl file in the model, we safely unpickle it
with `show_pickle`, then render it with `pprint` and include it here
if the size is not too large. We aren't able to install the pprint
hack that thw show_pickle CLI uses, so we get one-line rendering for
custom objects, which is not very useful. Built-in types look fine,
though. In particular, bytecode.pkl seems to look fine (and we
hard-code that file to ignore the size limit).
I'm checking in the JS dependencies to avoid a network dependency at
runtime. They were retrieved from the following URLS, then passed
through a JS minifier:
https://unpkg.com/htm@3.0.4/dist/htm.module.js?modulehttps://unpkg.com/preact@10.5.13/dist/preact.module.js?module
Test Plan:
Manually ran on a few models I had lying around.
Mostly tested in Chrome, but I also poked around in Firefox.
Reviewed By: dhruvbird
Differential Revision: D28020849
Pulled By: dreiss
fbshipit-source-id: 421c30ed7ca55244e9fda1a03b8aab830466536d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56386
The diff resolves bug around incorrect handler resolution:
_create_static_handler pointed towards etcd, and _create_etcd_handler pointed towards static.
Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed:test_launcher
Added test_launcher to the ci/cd tests
Reviewed By: cbalioglu
Differential Revision: D27858897
fbshipit-source-id: 440155789958c091ce5755e7c9524e4bb704203a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56039
Python will try to eagerly resolve the name references even if
the import failed. Quote them so that it doesn't.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: janeyx99
Differential Revision: D27770536
Pulled By: ezyang
fbshipit-source-id: b111739289498f9bab856fb9424f3080efee4ee0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55695
In order to be able to run CUDA tests on their own (e.g., to avoid running CPU tests on GPU machines).
Done by moving test methods to a separate class (and sometimes introducing a "common" base class for utils), and then providing new entry points inside a `cuda/` subdirectory.
Test Plan: Checked they are run on Sandcastle.
Reviewed By: mrshenli
Differential Revision: D27618198
fbshipit-source-id: 8f671657f79c8ae115748ab7752fe0066705893b
Summary:
1. move module related stuff to test_module_container
2. created test_types for types and annotation
3. created test_misc for the rest
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55560
Reviewed By: VitalyFedyunin
Differential Revision: D27650911
Pulled By: walterddr
fbshipit-source-id: d895a7da9e9c3d25a662a37faf4daabc276b9c1a
Summary:
Prettifies JSON files .pytorch-test-times and .pytorch-slow-tests so that not everything is on one single line.
This is of slightly more importance as generated .pytorch-slow-tests ends up getting stored in our test-infra repo ([example](ad9cd87565)), and it is nice to not have that lil red symbol at the end.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55335
Reviewed By: samestep
Differential Revision: D27576930
Pulled By: janeyx99
fbshipit-source-id: be58565b8c8593a9bfcfab383ee19facc79f0572
Summary:
Moves more s3 parsing code to s3_stat_parser.py. This is another step in modularizing the parsing code more correctly. I will also be using this exact function in future slowTest code.
Also replaces some Any's in the code to be Report.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54808
Test Plan:
.pytorch-test-times generated before the code and after this code is the same.
CI should pass, specifically the test tools GHA.
Reviewed By: walterddr
Differential Revision: D27375783
Pulled By: janeyx99
fbshipit-source-id: bec28551668b2eb3fdd60d802200993e493eac83
Summary:
First step to move all S3 related operations into S3 parser utils.
in the end we provide APIs from s3_stats_parser:
1. downloading data as reports and uploading data as reports
2. filter by job name
and handle all compression, formatting inside.
TODO
- [ ] Refactor out upload into s3_stats_parser
- [ ] Remove all S3/BOTO related checkers and try/catch blocks outside of s3_stats_parser
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54681
Test Plan:
1. Running tools/test/* covers the refactoring logic (test_test_history.py and test_stats.py as entrypoint and both using the 2 new APIs in s3_stats_parser after the refactoring.
2. print_test_stats.py's main argparse entrypoint is covered by CI step Report Test Result step.
3. run `python test/run_test.py --export-past-test-times` before and after this PR should result in the same file content in .pytorch-test-times
Reviewed By: ailzhang
Differential Revision: D27346742
Pulled By: walterddr
fbshipit-source-id: fb40162e631e007fed9d5821fe4f190bda2cb52e
Summary:
Since `_test1`, `_test2` and `_build` and `test` are all stripped, `slow_test` should be stripped as well. This way, the _slow_test stats will be considered as a part of all stats relating to a particular build job, though currently, it doesn't do much because the jobs don't share a common stemmed name--the build has `_gcc7` while the slow_test CI job does not.
This makes me think...do we omit the `gcc7` intentionally? Are there other things I should strip, e.g., `multigpu_test`?
See:
ci/circleci: pytorch_linux_xenial_cuda10_2_cudnn7_py3_slow_test
ci/circleci: pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test1
ci/circleci: pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test2
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54528
Reviewed By: samestep
Differential Revision: D27270393
Pulled By: janeyx99
fbshipit-source-id: ffb7289cfe4dba52ded67f50a89f3e75e7bad68d
Summary:
Step 2 to fixing https://github.com/pytorch/pytorch/issues/53882 :)
This changes TARGET_DET_LIST and sharding automation by checking if there's already cached data from the commit in `.pytorch-test-times`. If not, it pulls data from S3 and updates the file to have the stats. This way, S3 pulling does not need to happen more than once for the same commit.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54210
Test Plan:
the following methods should run the same set of tests.
First `export CIRCLE_JOB=pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test2` or your favorite CIRCLE JOB.
1. Pull data first and use it:
Download the data from S3 and write it to the cache file with `python test/run_test.py --export-historic-test-times .pytorch-test-times`
Now run `python test/run_test.py --shard 1 10`
2. Make the sharding job pull data:
Delete the file you just created: `rm .pytorch-test-times`
Now run `python test/run_test.py --shard 1 10`
Reviewed By: walterddr
Differential Revision: D27136849
Pulled By: janeyx99
fbshipit-source-id: 51a42c4e2fa3f8cf15e682679dd3eb6130aad927
Summary:
This is an initial attempt in refactoring and consolidating our S3 read logic for print_test_stats.py, test_history.py, and run_test.py. This way, boto3 and botocore do not need to be imported in various places throughout the code base, and duplicated logic (such as the many type definitions) can exist in one place: `tools/stat_utils/s3_stat_parser.py`. walterddr contributed to this PR by moving print_test_stats.py to the tools folder and the corresponding tests a subfolder within tools.
**NOTE: this removes those tests from CI as the new `tools/test/test_stats.py` is not in the test/ directory as the other tests in TESTS in run_test.py.**
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53755
Test Plan:
This refactoring change should not break anything, so running the files as before should work as they did previously.
To make sure that print_test_stats.py still functions: run `python tools/test/test_stats.py` and make sure all tests pass.
To make sure that test_history.py works, run the example commands from `tools/test_history.py --help` and check that their output matches that shown. Note that the script will continue printing for a while, so don't be alarmed.
Some next steps:
- Actually coming up with similarities among the three current use cases and further refactoring/consolidating of functions (e.g., combining simplify and get_cases)
- Moving more parsing logic to s3_stat_parser.py to have better abstraction between our files
- Adding tests for s3_stat_parser.py when there is more functionality in it
Reviewed By: agolynski, samestep
Differential Revision: D27030285
Pulled By: janeyx99
fbshipit-source-id: e664781324ef7c0c30943bfd7f17c895075ef7a7
Summary:
This will allow for future work to use the test times file (which will save computation time and also allow for more consistency). (Step one to fixing https://github.com/pytorch/pytorch/issues/53882)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54083
Test Plan:
export CIRCLE_JOB=your-favorite-circleci-job e.g., pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test2
`python test/run_test.py --export-historic-test-times` OR
`python test/run_test.py --export-historic-test-times .your-favorite-file`
When opening either .pytorch-test-times or .your-favorite-file, you should see something like:
```
{"commit": "2d559a09392aabb84dfb4a498010b2f01d99818c", "job_times": {"distributed/test_distributed_spawn": 583.5889999999973, "distributed/test_data_parallel": 4.866999999999997, "test_binary_ufuncs": 171.1569999999998, "test_numpy_interop": 2.5649999999999995, "test_public_bindings": 0.011,...}}
```
Note that no tests will be run when this option is specified.
Reviewed By: walterddr
Differential Revision: D27091351
Pulled By: janeyx99
fbshipit-source-id: e191d739268d86de0a0ba0eea0006969859d1940
Summary:
This PR:
1. moves sharding algorithm from run_test.py to framework_utils.py (let me know if you have a better place for it)
2. adds tests for the algorithm in test_testing.py
3. fixes the algorithm so that it doesn't tack on the unknown jobs all to the shard with the minimum time, but instead distributes them around the shards.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53942
Test Plan: python test/test_testing.py -k TestFrameworkUtils
Reviewed By: samestep
Differential Revision: D27047223
Pulled By: janeyx99
fbshipit-source-id: 824b20009c0bb707aa5361de445cdec795d5e3f1
Summary:
First argument is either file name or test module name, but key to `CUSTOM_HANDLERS` is test module name.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53884
Test Plan: Run `python3 run_test.py -i distributed/test_distributed_spawn.py`
Reviewed By: janeyx99
Differential Revision: D27006164
Pulled By: malfet
fbshipit-source-id: f30b42856cd2754e5981c1c69618f84e392c986a
Summary:
This PR:
1. refactors the logic for S3 stats gathering.
2. Renames SLOW_TESTS to TARGET_DET_LIST to disambiguate and remove confusion with slowTest
2. detects slow tests (tests with time > 5min) to add to the TARGET_DET_LIST based on results in S3 from the previous nightly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53549
Test Plan:
Set CIRCLE_JOB to your favorite CI job (like `pytorch_linux_bionic_py3_8_gcc9_coverage_test1`).
Run `python test/run_test.py --determine-from=<your fave pytorch files>`
e.g., `python test/run_test.py --determine-from=test/run_test.py`
Reviewed By: mrshenli
Differential Revision: D26904478
Pulled By: janeyx99
fbshipit-source-id: 9576b34f4fee09291d60e36ff2631753a3925094
Summary:
Context: https://github.com/pytorch/pytorch/pull/53299#discussion_r587882857
These are the only hand-written parts of this diff:
- the addition to `.github/workflows/lint.yml`
- the file endings changed in these four files (to appease FB-internal land-blocking lints):
- `GLOSSARY.md`
- `aten/src/ATen/core/op_registration/README.md`
- `scripts/README.md`
- `torch/csrc/jit/codegen/fuser/README.md`
The rest was generated by running this command (on macOS):
```
git grep -I -l ' $' -- . ':(exclude)**/contrib/**' ':(exclude)third_party' | xargs gsed -i 's/ *$//'
```
I looked over the auto-generated changes and didn't see anything that looked problematic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53406
Test Plan:
This run (after adding the lint but before removing existing trailing spaces) failed:
- https://github.com/pytorch/pytorch/runs/2043032377
This run (on the tip of this PR) succeeded:
- https://github.com/pytorch/pytorch/runs/2043296348
Reviewed By: walterddr, seemethere
Differential Revision: D26856620
Pulled By: samestep
fbshipit-source-id: 3f0de7f7c2e4b0f1c089eac9b5085a58dd7e0d97
Summary:
Uses nightly commit stats to automatically shard tests based on execution time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53269
Test Plan:
set CIRCLE_JOB to an existing job, like `pytorch_linux_bionic_py3_6_clang9_test`
Then you can run something like: `python test/run_test.py --shard 1 10`
Reviewed By: malfet
Differential Revision: D26819440
Pulled By: janeyx99
fbshipit-source-id: 6bc73d6aa3d52d9850817536be15d7b54a72780e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52910
**Summary**
PR #52158 tried to move all JIT bindings from `torch._C` to a new
submodule `torch._C._jit`, but that...did not go well. This pull request
adds the new `torch._C._jit` submodule, but does not migrate the
existing bindings. Instead, it adds a unit test that fails if any new
bindings are added to `torch._C`. A comment in the test instructs
developers to add their new binding to the allowlist if it really should
be in `torch._C`, or to add it to the appropriate submodule (e.g
`torch._C._jit`, for example). The idea is to prevent the issue
described in #51691 from getting *worse* if it cannot be fixed.
**Test Plan**
Continuous integration.
**Fixes**
This commit fixes#51691.
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D26698373
Pulled By: SplitInfinity
fbshipit-source-id: ec9f5426051227a513d4fd09512b624420e0100b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52323
Using default cpu allocator for ops executed on qnnpack backend will result in
asan failures with heap overflow since qnnpack (and xnnpack) can access input
beyond their and/beginning.
Here we are enabling this feature specifically to enable dynamic sparse linear op test
using qnnpack engine. In dynamic linear op, the fp32 bias is not packed and
hence can result in out-of-bound access.
Test Plan: test_set_default_mobile_cpu_allocator.py
Reviewed By: z-a-f
Differential Revision: D26263481
fbshipit-source-id: a49227cac7e6781b0db4a156ca734d7671972d9f
Summary:
Implement the first stage of ZeRO, sharding of the optimizer state, as described in [this blog post](https://www.microsoft.com/en-us/research/blog/zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/) and [this paper](https://arxiv.org/abs/1910.02054). This implementation is completely independent from the [DeepSpeed](https://github.com/microsoft/DeepSpeed) framework, and aims at providing ZeRO-compliant building blocks within the PyTorch scheme of things.
This works by:
- acting as a wrapper to a pytorch optimizer. ZeROptimizer does not optimize anything by itself, it only shards optimizers for distributed jobs
- each rank distributes parameters according to a given partitioning scheme (could be updated), and owns the update of a given shard only
- the .step() is called on each rank as expected, the fact that the optimizer actually works on a shard of the model is not visible from the outside
- when the update is completed, each rank broadcasts the updated model shard to all the other ranks
This can be used with DDP, although some communications are wasted in that case (gradients are all-reduced to all ranks). This implementation was initially developed in [Fairscale](https://github.com/facebookresearch/fairscale), and can also be used with an optimized DDP which only reduces to the relevant ranks. More context on ZeRO and PyTorch can be found in [this RFC](https://github.com/pytorch/pytorch/issues/42849)
The API with respect to loading and saving the state is a known pain point and should probably be discussed an updated. Other possible follow ups include integrating more closely to a [modularized DDP](https://github.com/pytorch/pytorch/issues/37002), [making the checkpoints partition-agnostic](https://github.com/facebookresearch/fairscale/issues/164), [exposing a gradient clipping option](https://github.com/facebookresearch/fairscale/issues/98) and making sure that mixed precision states are properly handled.
original authors include msbaines, min-xu-ai and myself
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46750
Reviewed By: mruberry
Differential Revision: D25958918
Pulled By: blefaudeux
fbshipit-source-id: 14280f2fd90cf251eee8ef9ac0f1fa6025ae9c50
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49698
Reincarnation of #47620 by jamesr66a.
It's just an initial bunch of things that we're exposing to python, more
is expected to come in future. Some things can probably be done better,
but I'm putting this out anyway, since some other people were interested
in using and/or developing this.
Differential Revision: D25668694
Test Plan: Imported from OSS
Reviewed By: bertmaher
Pulled By: ZolotukhinM
fbshipit-source-id: fb0fd1b31e851ef9ab724686b9ac2d172fa4905a