Commit Graph

81 Commits

Author SHA1 Message Date
Catherine Lee
9fd3eba6ce [experiment] More procs in CI (#98098)
experiment with more procs but only in master so prs dont get affected

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98098
Approved by: https://github.com/huydhn
2023-04-07 17:21:32 +00:00
Catherine Lee
0d73cfb3e9 Retry at test file level (#97506)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97506
Approved by: https://github.com/huydhn
2023-03-31 18:36:53 +00:00
PyTorch MergeBot
675dfd2c1f Revert "Retry at test file level (#97506)"
This reverts commit 7d5d5beba2.

Reverted https://github.com/pytorch/pytorch/pull/97506 on behalf of https://github.com/clee2000 due to test_jit_cuda_fuser having a rough time
2023-03-31 06:22:14 +00:00
Catherine Lee
7d5d5beba2 Retry at test file level (#97506)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97506
Approved by: https://github.com/huydhn
2023-03-30 17:12:19 +00:00
Aaron Gokaslan
9171f7d4cd [BE] Modernize PyTorch even more for 3.8 with pyupgrade (#94520)
Applies some more pyupgrade fixits to PyTorch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94520
Approved by: https://github.com/ezyang
2023-02-10 18:02:50 +00:00
Jeff Daily
f44946289b [CI][ROCm] fix device visibility, again (#91813)
The previous PR #91137 was incomplete.  Though it successfully queried for the number of available GPUs, it still resulted in test files sharing the same GPU.  This PR lifts the maxtasksperchild=1 restriction so that Pool workers will always use the same GPU.  This also adds a Note in run_test.py for future reference.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91813
Approved by: https://github.com/kit1980, https://github.com/huydhn, https://github.com/malfet
2023-01-06 22:19:07 +00:00
Jeff Daily
c18e8c68d8 [ROCm] fix parallel test runners and device visibility (#91137)
Fixes #90940.  This PR revamps how tests are run in parallel as well as device visibility at the docker container and within the run_test.py test runner.

First, running multiple test modules concurrently on the same GPU was causing instability for ROCm runners manifesting as timeouts.  ROCm runners have at least 1 GPU each, but often 2 or more.  This PR allows NUM_PROCS to be set equal to the number of devices available, but also takes care to set HIP_VISIBLE_DEVICES to avoid oversubscribing any GPU.

Second, we had introduced env vars `-e ROCR_VISIBLE_DEVICES` (#91031) to prepare for two GHA runners per CI node, to split up the GPU visibility at the docker level between the two runners.  This effort wasn't fully realized; to date, we haven't had more than one runner per CI host.  We abandon this effort in favor of all GPUs being visible to a single runner and managing GPU resources as stated above.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91137
Approved by: https://github.com/kit1980, https://github.com/huydhn, https://github.com/pruthvistony
2023-01-04 19:40:05 +00:00
Catherine Lee
d632d94cc7 Disable mem leak check (#88373)
tbh at this point it might be easier to make a new workflow and copy the relevant jobs...

Changes:
* Disable cuda mem leak check except for on scheduled workflows
* Make pull and trunk run on a schedule which will run the memory leak check
* Periodic will always run the memory leak check -> periodic does not have parallelization anymore
* Concurrency check changed to be slightly more generous
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88373
Approved by: https://github.com/ZainRizvi, https://github.com/huydhn
2022-11-04 20:47:42 +00:00
Tom Stein
fd60b818b9 [Python] refactor slices on sorted (#86995)
Sometimes you want to query the small element of a set of elements and use `sorted(elements)[0]` without a second thought. However, this is not optimal, since the entire list must be sorted first `O(n log n)`. It would be better to use the `min(elements)` method provided for this purpose `O(n)`.
Furthermore `sorted(elements)[::-1]` is not very efficient, because it would be better to use `sorted(elements, reverse=True)` to save the slice operation.

**TLDR: using `sorted(elements)[0]` is slow and can be replaced with `min(elements)`.**

I stumbled across these code snippets while playing around with CodeQL (see https://lgtm.com/query/4148064474379348546/).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86995
Approved by: https://github.com/jansel
2022-10-25 04:07:19 +00:00
Catherine Lee
7941b042a7 parallelize at file granularity (#85770)
part two of https://github.com/pytorch/pytorch/pull/84961

tests files in parallel at the test file granularity

* 2 procs at a time
* number of tests ran changed by <200, possibly due to adding more tests on master between the base commit and head commit of the PR
* may cause flakiness, but I haven't seen it in my small sample size of this PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85770
Approved by: https://github.com/huydhn
2022-10-03 16:59:39 +00:00
Catherine Lee
49e10c1598 [ci] test_ops in parallel, ci tests log to file (#85528)
part one of splitting up https://github.com/pytorch/pytorch/pull/84961 into (probably 2) parts

contains
* logging to file
* testing test_ops in parallel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85528
Approved by: https://github.com/huydhn
2022-09-23 20:45:20 +00:00
PyTorch MergeBot
3dce26635f Revert "test in parallel at file granularity (#84961)"
This reverts commit 8107666c6a.

Reverted https://github.com/pytorch/pytorch/pull/84961 on behalf of https://github.com/clee2000 due to makes test_forward_ad_nn_functional_max_unpool2d_cuda_float32 flakily unexpectedly pass
2022-09-21 20:21:25 +00:00
Catherine Lee
8107666c6a test in parallel at file granularity (#84961)
run tests in parallel at the test file granularity

runs 3 files in parallel using multiprocessing pool, output goes to a file, which is then printed when the test finishes.  Some tests cannot be run in parallel (usually due to lacking memory), so we run those after.  Sharding is changed to attempt to mask large files with other large files/run them on the same shard.

test_ops* gets a custom handler to run it because it is simply too big (2hrs on windows) and linalg_cholesky fails (I would really like a solution to this if possible, but until then we use the custom handler).

reduces cuda tests by a lot, reduces total windows test time by ~1hr

Ref. https://github.com/pytorch/pytorch/issues/82894
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84961
Approved by: https://github.com/huydhn
2022-09-21 16:58:11 +00:00
Sergii Dymchenko
591222f5d9 Fix use-dict-literal lint (#83718)
Fix use-dict-literal pylint suggestions by changing `dict()` to `{}`. This PR should do the change for every Python file except test/jit/test_list_dict.py, where I think the intent is to test the constructor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83718
Approved by: https://github.com/albanD
2022-08-24 00:26:46 +00:00
Huy Do
347b036350 Apply ufmt linter to all py files under tools (#81285)
With ufmt in place https://github.com/pytorch/pytorch/pull/81157, we can now use it to gradually format all files. I'm breaking this down into multiple smaller batches to avoid too many merge conflicts later on.

This batch (as copied from the current BLACK linter config):
* `tools/**/*.py`

Upcoming batchs:
* `torchgen/**/*.py`
* `torch/package/**/*.py`
* `torch/onnx/**/*.py`
* `torch/_refs/**/*.py`
* `torch/_prims/**/*.py`
* `torch/_meta_registrations.py`
* `torch/_decomp/**/*.py`
* `test/onnx/**/*.py`

Once they are all formatted, BLACK linter will be removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81285
Approved by: https://github.com/suo
2022-07-13 07:59:22 +00:00
Michael Suo
d321be61c0 [ci] remove dead code related to test selection (#81163)
Since we are using Rockset for all this now, remove the code that used
the S3 path.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81163
Approved by: https://github.com/janeyx99
2022-07-12 04:50:19 +00:00
Jane Xu
addeb1ed5e [GHA] Add warning when S3 stats for sharding aren't found (#80176)
Adds visibility to when this happens so I don't need to keep looking at the logs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80176
Approved by: https://github.com/suo
2022-06-24 04:14:10 +00:00
Michael Suo
5029a91f7b [ci] delete JOB_BASE_NAME (#80046)
`JOB_BASE_NAME` was a holdover from jenkins compatibility. Eventually,
it morphed to be always set to the build enviroment + `-test` or
`-build`, and we used it to detect whether we were in a build or test.

That's sort of pointless, so removing and fixing up the few remaining
use cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80046
Approved by: https://github.com/malfet, https://github.com/janeyx99
2022-06-23 21:06:48 +00:00
PyTorch MergeBot
9f29d09b3b Revert "[ci] delete JOB_BASE_NAME"
This reverts commit a4080240e9.

Reverted https://github.com/pytorch/pytorch/pull/80046 on behalf of https://github.com/malfet due to Broke bazel job
2022-06-23 01:49:42 +00:00
Michael Suo
a4080240e9 [ci] delete JOB_BASE_NAME
`JOB_BASE_NAME` was a holdover from jenkins compatibility. Eventually,
it morphed to be always set to the build enviroment + `-test` or
`-build`, and we used it to detect whether we were in a build or test.

That's sort of pointless, so removing and fixing up the few remaining
use cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/80046

Approved by: https://github.com/malfet
2022-06-22 23:33:18 +00:00
Michael Suo
842da8a5de [ci] remove TD + test specification code from run_test.py
In the case of target determination, this is just removing comments that
refer to non-existent code.

In the case of the test specification code; this removes (what I believe
to be) an unused feature. If we're using this somehow let me know and I
can revise the PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79372

Approved by: https://github.com/janeyx99
2022-06-13 16:09:53 +00:00
Michael Suo
943c09a53e [ci] clean up dead code related to PR test selection
This is never used and not tested, so removing it for clarity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79363

Approved by: https://github.com/janeyx99
2022-06-13 16:09:51 +00:00
Jane Xu
0708630d9f Allow sharding for distributed tests
Addresses my mistake introduced in https://github.com/pytorch/pytorch/pull/76536#issuecomment-1112657429

Also allows for sharding 1 in run_test.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76570
Approved by: https://github.com/jeffdaily, https://github.com/seemethere
2022-04-29 03:55:07 +00:00
Edward Z. Yang
a11c1bbdd0 Run Black on all of tools/
Signed-off-by: Edward Z. Yang <ezyangfb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76089

Approved by: https://github.com/albanD
2022-04-20 17:29:41 +00:00
PyTorch MergeBot
f80d0f4e7c Revert "exclude slow tests from sharding calc for linux-bionic-py3.7-clang9-test"
This reverts commit 5364752b7d.

Reverted https://github.com/pytorch/pytorch/pull/75918 on behalf of https://github.com/clee2000
2022-04-19 20:42:53 +00:00
Catherine Lee
5364752b7d exclude slow tests from sharding calc for linux-bionic-py3.7-clang9-test
Fixes #ISSUE_NUMBER

Sharding for linux-bionic-py3.7-clang9 previously included slow test times in the calculation for how long a test takes, causing the sharding to be uneven:

| Duration      | Count | Name|
| ----------- | ----------- | ---|
| 11.2m      | 221       |linux-bionic-py3.7-clang9 / test (default, 1, 2, linux.2xlarge)|
| 1.1h   | 218        | linux-bionic-py3.7-clang9 / test (default, 2, 2, linux.2xlarge)|

Numbers taken from https://hud.pytorch.org/metrics from 04/10/2022 12:20 PM to 04/17/2022 12:20 PM.

The duration of these jobs on this PR are 39m and 38m.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75918
Approved by: https://github.com/seemethere, https://github.com/janeyx99
2022-04-19 20:25:30 +00:00
Brian Muse
0effac2b6a Support running pipelines on main in .jenkins and tools
## Release Notes
This commit updates the tools and .jenkins folders to support running on both `master` and `main` branches in preparation for the transition.

## Topics
Code cleanup

Fixes #71806

Pull Request resolved: https://github.com/pytorch/pytorch/pull/73297
Approved by: https://github.com/malfet
2022-03-16 14:44:19 +00:00
Andrey Talman
d1c529bd0b replace platform specific CI environment variables with generic ones (#68133)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/59478

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68133

Reviewed By: saketh-are

Differential Revision: D32401080

Pulled By: atalman

fbshipit-source-id: 057a34a56f8a2d324f4d1ea07da3a09772177897
2021-11-15 07:02:44 -08:00
Rong Rong (AI Infra)
a5a10fe353 Move all downloading logic out of common_utils.py (#61479)
Summary:
and into tools/ folder

Currently run_tests.py invokes tools/test_selections.py
1. download and analyze what test_file to run
2. download and parse S3 stats and pass the info to local files.
3. common_utils.py uses download S3 stats to determine what test cases to run.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61479

Reviewed By: janeyx99

Differential Revision: D29661986

Pulled By: walterddr

fbshipit-source-id: bebd8c474bcc2444e135bfd2fa4bdd1eefafe595
2021-07-12 11:23:22 -07:00
Rong Rong (AI Infra)
718db968b8 move CI related functions out of run_test.py (#61124)
Summary:
run_test.py currently does lots of downloading and test file/suite/case parsing. It doesn't work well outside of the CI environment

Restructured the run_test.py and created tools/test/test_selections.py and move all test selection logic (reordering, categorizing slow test, creating shards)

Follow up PRs should:
- refactor those file read/write logic entangled inside test_selections.py into stats/ folder
- restructure and add network independent test logics to test_test_selections.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61124

Test Plan:
- tools/test
- CI

Related PR:
This follows the refactoring example in: https://github.com/pytorch/pytorch/issues/60373

Reviewed By: malfet

Differential Revision: D29558981

Pulled By: walterddr

fbshipit-source-id: 7f0fd9b4720a918d82918766c002295e8df04169
2021-07-06 09:06:42 -07:00
Rong Rong (AI Infra)
7e619b9588 First step to rearrange files in tools folder (#60473)
Summary:
Changes including:
- introduced `linter/`, `testing/`, `stats/` folders in `tools/`
- move appropriate scripts into these folders
- change grepped references in the pytorch/pytorch repo

Next step
- introduce `build/` folder for build scripts

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60473

Test Plan:
- CI (this is important b/c pytorch/test-infra also rely on some script reference.
- tools/tests/

Reviewed By: albanD

Differential Revision: D29352716

Pulled By: walterddr

fbshipit-source-id: bad40b5ce130b35dfd9e59b8af34f9025f3285fd
2021-06-24 10:13:58 -07:00