Commit Graph

178 Commits

Author SHA1 Message Date
imaginary-person
9e53c823b8 Add AVX512 support in ATen & remove AVX support (#61903)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61903

### Remaining Tasks

- [ ] Collate results of benchmarks on two Intel Xeon machines (with & without CUDA, to check if CPU throttling causes issues with GPUs) - make graphs, including Roofline model plots (Intel Advisor can't make them with libgomp, though, but with Intel OpenMP).

### Summary

1. This draft PR produces binaries with with 3 types of ATen kernels - default, AVX2, AVX512 . Using the environment variable `ATEN_AVX512_256=TRUE`  also results in 3 types of kernels, but the compiler can use 32 ymm registers for AVX2, instead of the default 16. ATen kernels for `CPU_CAPABILITY_AVX` have been removed.

2. `nansum` is not using AVX512 kernel right now, as it has poorer accuracy for Float16, than does AVX2 or DEFAULT, whose respective accuracies aren't very good either (#59415).
It was more convenient to disable AVX512 dispatch for all dtypes of `nansum` for now.

3. On Windows , ATen Quantized AVX512 kernels are not being used, as quantization tests are flaky. If `--continue-through-failure` is used, then `test_compare_model_outputs_functional_static` fails. But if this test is skipped, `test_compare_model_outputs_conv_static` fails. If both these tests are skipped, then a third one fails. These are hard to debug right now due to not having access to a Windows machine with AVX512 support, so it was more convenient to disable AVX512 dispatch of all ATen Quantized kernels on Windows for now.

4. One test is currently being skipped -
[test_lstm` in `quantization.bc](https://github.com/pytorch/pytorch/issues/59098) - It fails only on Cascade Lake machines, irrespective of the `ATEN_CPU_CAPABILITY` used, because FBGEMM uses `AVX512_VNNI` on machines that support it. The value of `reduce_range` should be used as `False` on such machines.

The list of the changes is at https://gist.github.com/imaginary-person/4b4fda660534f0493bf9573d511a878d.

Credits to ezyang for proposing `AVX512_256` - these use AVX2 intrinsics but benefit from 32 registers, instead of the 16 ymm registers that AVX2 uses.
Credits to limo1996 for the initial proposal, and for optimizing `hsub_pd` & `hadd_pd`, which didn't have direct AVX512 equivalents, and are being used in some kernels. He also refactored `vec/functional.h` to remove duplicated code.
Credits to quickwritereader for helping fix 4 failing complex multiplication & division tests.

### Testing
1. `vec_test_all_types` was modified to test basic AVX512 support, as tests already existed for AVX2.
Only one test had to be modified, as it was hardcoded for AVX2.
2.  `pytorch_linux_bionic_py3_8_gcc9_coverage_test1` & `pytorch_linux_bionic_py3_8_gcc9_coverage_test2` are now using `linux.2xlarge` instances, as they support AVX512. They were used for testing AVX512 kernels, as AVX512 kernels are being used by default in both of the CI checks. Windows CI checks had already been using machines with AVX512 support.

### Would the downclocking caused by AVX512 pose an issue?

I think it's important to note that AVX2 causes downclocking as well, and the additional downclocking caused by AVX512 may not hamper performance on some Skylake machines & beyond, because of the double vector-size. I think that [this post with verifiable references is a must-read](https://community.intel.com/t5/Software-Tuning-Performance/Unexpected-power-vs-cores-profile-for-MKL-kernels-on-modern-Xeon/m-p/1133869/highlight/true#M6450). Also, AVX512 would _probably not_ hurt performance on a high-end machine, [but measurements are recommended](https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/). In case it does, `ATEN_AVX512_256=TRUE` can be used for building PyTorch, as AVX2 can then use 32 ymm registers instead of the default 16. [FBGEMM uses `AVX512_256` only on Xeon D processors](https://github.com/pytorch/FBGEMM/pull/209), which are said to have poor AVX512 performance.

This [official data](https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf) is for the Intel Skylake family, and the first link helps understand its significance. Cascade Lake & Ice Lake SP Xeon processors are said to be even better when it comes to AVX512 performance.

Here is the corresponding data for [Cascade Lake](https://cdrdv2.intel.com/v1/dl/getContent/338848) -

![CASCADE LAKE AVX2](https://user-images.githubusercontent.com/76181208/120666172-ffec3f80-c451-11eb-8ea1-8933ccc12a1b.PNG)
![CASCADE LAKE AVX512](https://user-images.githubusercontent.com/76181208/120666190-04b0f380-c452-11eb-9faa-38d233c874c8.PNG)

The corresponding data isn't publicly available for Intel Xeon SP 3rd gen (Ice Lake SP), but [Intel mentioned that the 3rd gen has frequency improvements pertaining to AVX512](https://newsroom.intel.com/wp-content/uploads/sites/11/2021/04/3rd-Gen-Intel-Xeon-Scalable-Platform-Press-Presentation-281884.pdf). Ice Lake SP machines also have 48 KB L1D caches, so that's another reason for AVX512 performance to be better on them.

### Is PyTorch always faster with AVX512?

No, but then PyTorch is not always faster with AVX2 either. Please refer to #60202. The benefit from vectorization is apparent with with small tensors that fit in caches or in kernels that are more compute heavy. For instance, AVX512 or AVX2 would yield no benefit for adding two 64 MB tensors, but adding two 1 MB tensors would do well with AVX2, and even more so with AVX512.

It seems that memory-bound computations, such as adding two 64 MB tensors can be slow with vectorization (depending upon the number of threads used), as the effects of downclocking can then be observed.

Original pull request: https://github.com/pytorch/pytorch/pull/56992

Reviewed By: soulitzer

Differential Revision: D29266289

Pulled By: ezyang

fbshipit-source-id: 2d5e8d1c2307252f22423bbc14f136c67c3e6184
2021-07-22 08:51:49 -07:00
Rong Rong (AI Infra)
9ade039593 fix test file not found issue (#61610)
Summary:
it should not error out if the file is not found.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61610

Reviewed By: samestep

Differential Revision: D29687958

Pulled By: walterddr

fbshipit-source-id: 17cacba8daa131df9bfb37fd58d6e4870ff75198
2021-07-13 17:50:50 -07:00
Rong Rong (AI Infra)
a5a10fe353 Move all downloading logic out of common_utils.py (#61479)
Summary:
and into tools/ folder

Currently run_tests.py invokes tools/test_selections.py
1. download and analyze what test_file to run
2. download and parse S3 stats and pass the info to local files.
3. common_utils.py uses download S3 stats to determine what test cases to run.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61479

Reviewed By: janeyx99

Differential Revision: D29661986

Pulled By: walterddr

fbshipit-source-id: bebd8c474bcc2444e135bfd2fa4bdd1eefafe595
2021-07-12 11:23:22 -07:00
Jane Xu
2bbcc80de3 Enable disabling test cases on specific platforms (#61427)
Summary:
This adds functionality to our common_utils.py to allow disabling test cases for platforms Mac, Windows, and Linux.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61427

Test Plan:
CI should not change as no issues currently have the line "Platforms:..."

I tested locally by making sure `test_async_script` is skipped while running `python test/test_jit.py -k TestAsync.test_async_script` with a cached modified `.pytorch-disabled-tests.json`:
```
{
  "total_count": 32,
  "incomplete_results": false,
  "items": [
    {
      "url": "https://api.github.com/repos/pytorch/pytorch/issues/60652",
      "repository_url": "https://api.github.com/repos/pytorch/pytorch",
      "labels_url": "https://api.github.com/repos/pytorch/pytorch/issues/60652/labels{/name}",
      "comments_url": "https://api.github.com/repos/pytorch/pytorch/issues/60652/comments",
      "events_url": "https://api.github.com/repos/pytorch/pytorch/issues/60652/events",
      "html_url": "https://github.com/pytorch/pytorch/issues/60652",
      "id": 929288995,
      "node_id": "MDU6SXNzdWU5MjkyODg5OTU=",
      "number": 60652,
      "title": "DISABLED test_async_script (jit.test_async.TestAsync)",
      "user": {
        "login": "ezyang",
        "id": 13564,
        "node_id": "MDQ6VXNlcjEzNTY0",
        "avatar_url": "https://avatars.githubusercontent.com/u/13564?v=4",
        "gravatar_id": "",
        "url": "https://api.github.com/users/ezyang",
        "html_url": "https://github.com/ezyang",
        "followers_url": "https://api.github.com/users/ezyang/followers",
        "following_url": "https://api.github.com/users/ezyang/following{/other_user}",
        "gists_url": "https://api.github.com/users/ezyang/gists{/gist_id}",
        "starred_url": "https://api.github.com/users/ezyang/starred{/owner}{/repo}",
        "subscriptions_url": "https://api.github.com/users/ezyang/subscriptions",
        "organizations_url": "https://api.github.com/users/ezyang/orgs",
        "repos_url": "https://api.github.com/users/ezyang/repos",
        "events_url": "https://api.github.com/users/ezyang/events{/privacy}",
        "received_events_url": "https://api.github.com/users/ezyang/received_events",
        "type": "User",
        "site_admin": false
      },
      "labels": [
        {
          "id": 1301397902,
          "node_id": "MDU6TGFiZWwxMzAxMzk3OTAy",
          "url": "https://api.github.com/repos/pytorch/pytorch/labels/module:%20flaky-tests",
          "name": "module: flaky-tests",
          "color": "f7e101",
          "default": false,
          "description": "Problem is a flaky test in CI"
        },
        {
          "id": 679953883,
          "node_id": "MDU6TGFiZWw2Nzk5NTM4ODM=",
          "url": "https://api.github.com/repos/pytorch/pytorch/labels/oncall:%20distributed",
          "name": "oncall: distributed",
          "color": "f7e101",
          "default": false,
          "description": "Add this issue/PR to distributed oncall triage queue"
        }
      ],
      "state": "open",
      "locked": false,
      "assignee": {
        "login": "rohan-varma",
        "id": 8039770,
        "node_id": "MDQ6VXNlcjgwMzk3NzA=",
        "avatar_url": "https://avatars.githubusercontent.com/u/8039770?v=4",
        "gravatar_id": "",
        "url": "https://api.github.com/users/rohan-varma",
        "html_url": "https://github.com/rohan-varma",
        "followers_url": "https://api.github.com/users/rohan-varma/followers",
        "following_url": "https://api.github.com/users/rohan-varma/following{/other_user}",
        "gists_url": "https://api.github.com/users/rohan-varma/gists{/gist_id}",
        "starred_url": "https://api.github.com/users/rohan-varma/starred{/owner}{/repo}",
        "subscriptions_url": "https://api.github.com/users/rohan-varma/subscriptions",
        "organizations_url": "https://api.github.com/users/rohan-varma/orgs",
        "repos_url": "https://api.github.com/users/rohan-varma/repos",
        "events_url": "https://api.github.com/users/rohan-varma/events{/privacy}",
        "received_events_url": "https://api.github.com/users/rohan-varma/received_events",
        "type": "User",
        "site_admin": false
      },
      "assignees": [
        {
          "login": "rohan-varma",
          "id": 8039770,
          "node_id": "MDQ6VXNlcjgwMzk3NzA=",
          "avatar_url": "https://avatars.githubusercontent.com/u/8039770?v=4",
          "gravatar_id": "",
          "url": "https://api.github.com/users/rohan-varma",
          "html_url": "https://github.com/rohan-varma",
          "followers_url": "https://api.github.com/users/rohan-varma/followers",
          "following_url": "https://api.github.com/users/rohan-varma/following{/other_user}",
          "gists_url": "https://api.github.com/users/rohan-varma/gists{/gist_id}",
          "starred_url": "https://api.github.com/users/rohan-varma/starred{/owner}{/repo}",
          "subscriptions_url": "https://api.github.com/users/rohan-varma/subscriptions",
          "organizations_url": "https://api.github.com/users/rohan-varma/orgs",
          "repos_url": "https://api.github.com/users/rohan-varma/repos",
          "events_url": "https://api.github.com/users/rohan-varma/events{/privacy}",
          "received_events_url": "https://api.github.com/users/rohan-varma/received_events",
          "type": "User",
          "site_admin": false
        }
      ],
      "milestone": null,
      "comments": 0,
      "created_at": "2021-06-24T14:28:33Z",
      "updated_at": "2021-06-24T16:40:42Z",
      "closed_at": null,
      "author_association": "CONTRIBUTOR",
      "active_lock_reason": null,
      "body": "Platforms:Mac, windows, Linux\r\n```\r\nJun 24 00:59:14 ======================================================================\r\nJun 24 00:59:14 ERROR [0.477s]: test_async_script (__main__.ProcessGroupGlooWrapperTest)\r\nJun 24 00:59:14 ----------------------------------------------------------------------\r\nJun 24 00:59:14 Traceback (most recent call last):\r\nJun 24 00:59:14   File \"/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py\", line 398, in wrapper\r\nJun 24 00:59:14     self._join_processes(fn)\r\nJun 24 00:59:14   File \"/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py\", line 590, in _join_processes\r\nJun 24 00:59:14     self._check_return_codes(elapsed_time)\r\nJun 24 00:59:14   File \"/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py\", line 633, in _check_return_codes\r\nJun 24 00:59:14     raise RuntimeError(error)\r\nJun 24 00:59:14 RuntimeError: Process 0 exited with error code 10 and exception:\r\nJun 24 00:59:14 RuntimeError: [/var/lib/jenkins/workspace/third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [172.17.0.2]:21400\r\nJun 24 00:59:14 \r\nJun 24 00:59:14 During handling of the above exception, another exception occurred:\r\nJun 24 00:59:14 \r\nJun 24 00:59:14 Traceback (most recent call last):\r\nJun 24 00:59:14   File \"/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py\", line 516, in run_test\r\nJun 24 00:59:14     getattr(self, test_name)()\r\nJun 24 00:59:14   File \"/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py\", line 400, in wrapper\r\nJun 24 00:59:14     fn()\r\nJun 24 00:59:14   File \"distributed/test_pg_wrapper.py\", line 270, in test_collective_hang\r\nJun 24 00:59:14     self._test_collective_hang(pg)\r\nJun 24 00:59:14   File \"distributed/test_pg_wrapper.py\", line 52, in _test_collective_hang\r\nJun 24 00:59:14     wrapper_pg.allreduce([tensor])\r\nJun 24 00:59:14   File \"/opt/conda/lib/python3.6/unittest/case.py\", line 217, in __exit__\r\nJun 24 00:59:14     expected_regex.pattern, str(exc_value)))\r\nJun 24 00:59:14   File \"/opt/conda/lib/python3.6/unittest/case.py\", line 135, in _raiseFailure\r\nJun 24 00:59:14     raise self.test_case.failureException(msg)\r\nJun 24 00:59:14 AssertionError: \"Ranks 1 failed to pass monitoredBarrier\" does not match \"[/var/lib/jenkins/workspace/third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [172.17.0.2]:21400\"\r\n```\r\n\r\nhttps://www.internalfb.com/intern/opensource/ci/job/log/225221175921058/\n\ncc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse agolynski SciPioneer H-Huang mrzzd cbalioglu gcramer23",
      "performed_via_github_app": null,
      "score": 0.0
    }
  ]
}
```

Reviewed By: iramazanli

Differential Revision: D29627799

Pulled By: janeyx99

fbshipit-source-id: 5ef79127cbe0055c4f41766048e66f98cf80d2c4
2021-07-09 09:29:16 -07:00
Jane Xu
fb00194030 Fix typo in common_utils.py (#61365)
Summary:
Missed this in review of https://github.com/pytorch/pytorch/pull/57953. I don't think this has affected much, though.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61365

Reviewed By: walterddr

Differential Revision: D29593764

Pulled By: janeyx99

fbshipit-source-id: 2c6f6aa961eabca0d8b8a7607aaae979667cca3b
2021-07-07 16:28:20 -07:00
driazati
45cc207a88 Fix breakpad build + add test canary (#60990)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60990

This makes the breakpad build more explicit in its messaging and hints to cmake where to look for the library (it wasn't able to find it without `PATHS` on CI even though that works locally). This also adds a smoke test that will fail if breakpad isn't present on a CI job where it is expected (e.g. binary builds).

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D29514316

Pulled By: driazati

fbshipit-source-id: 79514363334788f311ba5d4f25deed3452f0c3eb
2021-07-06 14:15:07 -07:00
Pearu Peterson
374278f431 Improved sparse CSR tensor sampling method (#60283)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/59379

The improved sparse CSR tensor sampling method is described in https://pearu.github.io/csr_sampling.html that features:
- for specified `nnz`, one gets a CSR sample with the same `nnz`
- variability of the number of specified columns per row is maximized
- `crow_indices` content is randomized
- a given row specific `col_indices` content is sorted and filled with unique values (see also https://github.com/pytorch/pytorch/issues/60277)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60283

Reviewed By: bhosmer

Differential Revision: D29492605

Pulled By: cpuhrsch

fbshipit-source-id: 8d875b7c2b0573a9ab37047c6d8fe8b540295ce1
2021-07-01 13:26:19 -07:00
Sam Estep
d5a44f9f12 Use expecttest from PyPI (#60658)
Summary:
This PR removes `torch/testing/_internal/expecttest.py` in favor of https://github.com/ezyang/expecttest. See also https://github.com/ezyang/ghstack/pull/71.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60658

Test Plan: CI.

Reviewed By: ezyang

Differential Revision: D29430763

Pulled By: samestep

fbshipit-source-id: b7cdc7ba37330176149fd465312118e2254ae92e
2021-06-28 15:43:34 -07:00
Kushashwa Ravi Shrimali
08020220f3 [Testing] Adding reference tests to OpInfo class (#59369)
Summary:
This PR will ideally add `ref` argument to `OpInfo` base class. The idea is to add reference checks for all the ops _eligible_. For more discussion, please check https://github.com/pytorch/pytorch/issues/58294

* [x] Migrate (but not removing yet) and modify helper functions from `UnaryUfuncOpInfo` class to `OpInfo` base class.
* [x] Test the reference checks for multiple ops. (also decide a list of different and eligible ops for this)
* [x] Handle possible edge cases (for example: `uint64` isn't implemented in PyTorch but is there in NumPy, and this needs to be handled -- more on this later) -- _Update_: We decided that these reference tests should only test for values and not types.
* [x] Create a sample PR for a single (of all different categories?) on adding reference functions to the eligible ops. -- _Update_: This is being done in this PR only.
* [x] ~Remove reference tests from `test_unary_ufuncs.py` and test to make sure that nothing breaks.~ (*Update*: We won't be touching Unary Ufunc reference tests in this PR)
* [x] Add comments, remove unnecessary prints/comments (added for debugging).

Note: To keep the PR description short, examples of edge cases encountered have been mentioned in the comments below.

cc: mruberry pmeier kshitij12345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59369

Reviewed By: ngimel

Differential Revision: D29347252

Pulled By: mruberry

fbshipit-source-id: 69719deddb1d23c53db45287a7e66c1bfe7e65bb
2021-06-23 19:26:08 -07:00
Philip Meier
0c916c8a4e up the priority of numpy array comparisons in self.assertEqual (#59067)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58988.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59067

Reviewed By: jbschlosser

Differential Revision: D28986642

Pulled By: heitorschueroff

fbshipit-source-id: 3ef2d26b4010fc3519d0a1a020ea446ffeb46ba0
2021-06-22 13:07:07 -07:00
Weiqiang Wu
6a87e8d087 Implement erfcx() (#58194)
Summary:
Implement erfcx() https://github.com/pytorch/pytorch/issues/31945

Reference: https://github.com/pytorch/pytorch/issues/50345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58194

Reviewed By: ngimel

Differential Revision: D29285979

Pulled By: mruberry

fbshipit-source-id: 5bcfe77fddfabbeb8c8068658ba6d9fec6430399
2021-06-22 12:38:38 -07:00
Peter Bell
45ae2e7863 Set TORCH_WARN_ONCE to always warn inside of assertNotWarn (#60020)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60020

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D29249909

Pulled By: mruberry

fbshipit-source-id: 10a8d5c05bd8d4aec345f70b132efd3623601f6a
2021-06-21 21:35:54 -07:00
Rong Rong (AI Infra)
5921b5480a ensure xml report path are relative to */pytorch/test (#60380)
Summary:
Changes the approach.

Root cause of this is for some reason: `inspect.getfile` returns absolute path instead of relative path to `os.getcwd` in newer python version. we sanitize this by removing the CI_PREFIX if applies

See:
https://app.circleci.com/pipelines/github/pytorch/pytorch/339568/workflows/43cac71c-759e-471f-83c2-d59c152dcd8a/jobs/14278585 vs. https://app.circleci.com/pipelines/github/pytorch/pytorch/339568/workflows/43cac71c-759e-471f-83c2-d59c152dcd8a/jobs/14278285

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60380

Test Plan:
CI

Plot twist:

windows tests are actually launched via
```
pushd test
python run_test.py
```
while linux/macos tests are
```
python test/run_test.py
```
This might cause problem when using `os.getcwd()` we will see from PR CI results.

Reviewed By: malfet

Differential Revision: D29276969

Pulled By: walterddr

fbshipit-source-id: 336c2805d0c92733e0ff4c309ff2044dc2ed4e21
2021-06-21 20:47:23 -07:00
Rong Rong (AI Infra)
510334f34b [BE] clean up IS_PYTORCH_CI and IN_CI (#60279)
Summary:
`IS_PYTORCH_CI` and `IN_CI` are used randomly, however in some cases IN_CI is not currently set because it only exist in .circleci/scripts/setup_ci_environment.sh. This cleans up the 2 flags and only use IN_CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60279

Test Plan: CI

Reviewed By: seemethere

Differential Revision: D29239545

Pulled By: walterddr

fbshipit-source-id: a069424a2bb8790a3adfdaf0dc460301026bf8c7
2021-06-20 19:45:07 -07:00
Philip Meier
d5988c5eca remove unused type: ignore directives (#60006)
Summary:
During development it is common practice to put `type: ignore` comments on lines that are correct, but `mypy` doesn't recognize this. This often stems from the fact, that the used `mypy` version wasn't able to handle the used pattern.

With every new release `mypy` gets better at handling complex code. In addition to fix all the previously accepted but now failing patterns, we should also revisit all `type: ignore` comments to see if they are still needed or not. Fortunately, we don't need to do it manually: by adding `warn_unused_ignores = True` to the configuration, `mypy` will error out in case it encounters an `type: ignore` that is no longer needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60006

Reviewed By: jbschlosser, malfet

Differential Revision: D29133237

Pulled By: albanD

fbshipit-source-id: 41e82edc5cd5affa7ccedad044b59b94dad4425a
2021-06-18 07:23:31 -07:00
Rong Rong (AI Infra)
b8ab98626b only runs mem leak check on master (#60023)
Summary:
setting environment variable to only do cuda mem leak check on master CI jobs.

See discussion in https://github.com/pytorch/pytorch/pull/59402#issuecomment-860773034

See stats before/after disabling mem leak check: https://github.com/pytorch/pytorch/pull/59942#issuecomment-860947095

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60023

Test Plan:
https://github.com/pytorch/pytorch/issues/60108
https://github.com/pytorch/pytorch/issues/60116

Reviewed By: janeyx99

Differential Revision: D29164182

Pulled By: walterddr

fbshipit-source-id: dfe88c2c1275b6eb35f18b58aacdc220f34ccb59
2021-06-17 07:56:26 -07:00
Eli Uriegas
a62f6b6d04 ci: Add skipIfOnGHA util (#59748)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59748

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: janeyx99

Differential Revision: D29008217

Pulled By: seemethere

fbshipit-source-id: ffc2f7935df722f26c1252e3833085430ada7433
2021-06-09 21:19:26 -07:00
Jane Xu
97dfc7e300 [Reland] Adding run specified tests option to run_test.py (#59649)
Summary:
Reland of https://github.com/pytorch/pytorch/issues/59487

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59649

Reviewed By: samestep

Differential Revision: D28970751

Pulled By: janeyx99

fbshipit-source-id: 6e28d4dcfdab8a49da4b6a02c57516b08bacd7b5
2021-06-08 16:04:46 -07:00
Rong Rong (AI Infra)
0208e604e3 seems os.environ.get() not working well on windows (#59634)
Summary:
replace with os.getenv() instead

For some reason this was intermittently failing azure pipelines. I can't login to the pipeline itself for debugging but here are 2 examples: [successful](https://app.circleci.com/pipelines/github/pytorch/pytorch/332405/workflows/944609ad-5dcf-49da-984f-26c381d1f16c/jobs/13969059) vs [failed](https://app.circleci.com/pipelines/github/pytorch/pytorch/332518/workflows/21f8a5a6-3b95-432e-be42-ac98008c671b/jobs/13975637)

However given the fact that the other common_utils.py exposed constants using `os.getenv()` was working. I am making them consistent.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59634

Test Plan: CI/master

Reviewed By: jbschlosser

Differential Revision: D28966412

Pulled By: walterddr

fbshipit-source-id: 7bcb9adf06df0acabd9574459eb6637c3e6a2947
2021-06-08 11:59:39 -07:00
Alban Desmaison
5d6a10a765 Revert D28913223: [pytorch][PR] Adding run-specified-test-cases option in run_test.py
Test Plan: revert-hammer

Differential Revision:
D28913223 (24432eaa29)

Original commit changeset: 0d1f99109734

fbshipit-source-id: 47c073720cff23a5d4cb64556381c46025e90937
2021-06-08 02:18:16 -07:00
Rong Rong (AI Infra)
57d8bccd00 only reorder tests based on git diff if IN_CI (#59565)
Summary:
Do not reorder tests unless they are in IN_CI, this causes local development test ordering indeterministic. most of use branch out from viable strict not head of master.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59565

Reviewed By: ejguan

Differential Revision: D28943906

Pulled By: walterddr

fbshipit-source-id: e742e7ce4b3fc017d7563b01e93c4cd774d0a537
2021-06-07 17:54:19 -07:00
Jane Xu
24432eaa29 Adding run-specified-test-cases option in run_test.py (#59487)
Summary:
The run-specified-test-cases option would allow us to specify a list of test cases to run by having a CSV with minimally two columns: test_filename and test_case_name.

This PR also adds .json to some files we use for better clarity.

Usage:
`python test/run_test.py --run-specified-test-cases <csv_file>` where the csv file can look like:
```
test_filename,test_case_name,test_total_time,windows_only_failure_sha_count,total_sha_count,windows_failure_count,linux_failure_count,windows_total_count,linux_total_count
test_cuda,test_cudnn_multiple_threads_same_device,8068.8409659525,46,3768,53,0,2181,6750
test_utils,test_load_standalone,8308.8062920459,14,4630,65,0,2718,8729
test_ops,test_forward_mode_AD_acosh_cuda_complex128,91.652619369806,11,1971,26,1,1197,3825
test_ops,test_forward_mode_AD_acos_cuda_complex128,91.825633094915,11,1971,26,1,1197,3825
test_profiler,test_source,60.93786725749,9,4656,21,3,2742,8805
test_profiler,test_profiler_tracing,203.09352795241,9,4662,21,3,2737,8807
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59487

Test Plan:
Without specifying the option, everything should be as they were before.

Running `python test/run_test.py --run-specified-test-cases windows_smoke_tests.csv` resulted in this paste P420276949 (you can see internally). A snippet looks like:
```
(pytorch) janeyx@janeyx-mbp pytorch % python test/run_test.py --run-specified-test-cases windows_smoke_tests.csv
Loading specified test cases to run from windows_smoke_tests.csv.
Processed 28 test cases.
Running test_cpp_extensions_jit ... [2021-06-04 17:24:41.213644]
Executing ['/Users/janeyx/miniconda3/envs/pytorch/bin/python', 'test_cpp_extensions_jit.py', '-k', 'test_jit_cuda_archflags'] ... [2021-06-04 17:24:41.213781]
s
----------------------------------------------------------------------
Ran 1 test in 0.000s

OK (skipped=1)
...
```
With pytest, an example executable would be:
`Running test_dataloader ... [2021-06-04 17:37:57.643039]
Executing ['/Users/janeyx/miniconda3/envs/pytorch/bin/python', '-m', 'pytest', 'test_dataloader.py', '-v', '-k', 'test_segfault or test_timeout'] ... [2021-06-04 17:37:57.643327]`

Reviewed By: samestep

Differential Revision: D28913223

Pulled By: janeyx99

fbshipit-source-id: 0d1f9910973426b8756815c697b483160517b127
2021-06-07 16:27:43 -07:00
Natalia Gimelshein
344ecb2e71 flip via TI (#59509)
Summary:
Resubmit of https://github.com/pytorch/pytorch/issues/58747

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59509

Reviewed By: mruberry

Differential Revision: D28918665

Pulled By: ngimel

fbshipit-source-id: b045c7b35eaf22e53b1bc359ffbe5a4fda05dcda
2021-06-05 15:43:29 -07:00
Natalia Gimelshein
5117ac3bb4 Revert D28877076: [pytorch][PR] torch.flip via TI
Test Plan: revert-hammer

Differential Revision:
D28877076 (d82bc3feb8)

Original commit changeset: 4fa6eb519085

fbshipit-source-id: c81e7d3283ff6822db913bf9f49a1533268755d0
2021-06-04 23:03:53 -07:00
lezcano
d82bc3feb8 torch.flip via TI (#58747)
Summary:
Implements an idea by ngimel to improve the performance of `torch.flip` via a clever hack into TI to bypass the fact that TI is not designed to work with negative indices.

Something that might be added is vectorisation support on CPU, given how simple the implementation is now.

Some low-hanging fruits that I did not implement:
- Write it as a structured kernel
- Migrate the tests to opinfos
- Have a look at `cumsum_backward` and `cumprod_backward`,  as I think that they could be implemented faster with `flip`, now that `flip` is fast.

**Edit**
This operation already has OpInfos and it cannot be migrated to a structured kernel because it implements quantisation

Summary of the PR:
- x1.5-3 performance boost on CPU
- x1.5-2 performance boost on CUDA
- Comparable performance across dimensions, regardless of the strides (thanks TI)
- Simpler code

<details>
<summary>
Test Script
</summary>

```python
from itertools import product

import torch
from torch.utils.benchmark import Compare, Timer

def get_timer(size, dims, num_threads, device):
    x = torch.rand(*size, device=device)

    timer = Timer(
        "torch.flip(x, dims=dims)",
        globals={"x": x, "dims": dims},
        label=f"Flip {device}",
        description=f"dims: {dims}",
        sub_label=f"size: {size}",
        num_threads=num_threads,
    )

    return timer.blocked_autorange(min_run_time=5)

def get_params():
    sizes = ((1000,)*2, (1000,)*3, (10000,)*2)
    for size, device in product(sizes, ("cpu", "cuda")):
        threads = (1, 2, 4) if device == "cpu" else (1,)
        list_dims = [(0,), (1,), (0, 1)]
        if len(size) == 3:
            list_dims.append((0, 2))
        for num_threads, dims in product(threads, list_dims):
            yield size, dims, num_threads, device

def compare():
    compare = Compare([get_timer(*params) for params in get_params()])
    compare.trim_significant_figures()
    compare.colorize()
    compare.print()

compare()
```
</details>

<details>
<summary>
Benchmark PR
</summary>

![image](https://user-images.githubusercontent.com/3291265/119139954-81e46d80-ba3b-11eb-9aad-e825e515d41b.png)

</details>

<details>
<summary>
Benchmark master
</summary>

![image](https://user-images.githubusercontent.com/3291265/119139915-76914200-ba3b-11eb-9aa8-84b3ca220c93.png)

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58747

Reviewed By: agolynski

Differential Revision: D28877076

Pulled By: ngimel

fbshipit-source-id: 4fa6eb519085950176cb3a9161eeb3b6289ec575
2021-06-04 20:13:38 -07:00
albanD
e9e5588588 Improve Tensor traverse to traverse its grad_fn when possible (#58271)
Summary:
There are two main changes here:
- THPVariable will actually visit their grad_fn if there are no other reference to the c++ Tensor and no other reference to the grad_fn. The critical observation compared to the existing comment (thanks Ed!) is that if we also check that the c++ Tensor object is not referenced somewhere else, we're sure that no one can change the grad_fn refcount between the traverse and the clear.
- THPVariable don't need a special clear for this new cases as we're the only owner of the c++ Tensor and so the cdata.reset() will necessarily free the Tensor and all its resources.

The two tests are to ensure:
- That the cycles are indeed collectible by the gc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58271

Reviewed By: ngimel

Differential Revision: D28796461

Pulled By: albanD

fbshipit-source-id: 62c05930ddd0c48422c79b03118db41a73c1355d
2021-06-01 10:27:52 -07:00
kshitij12345
ea465f7378 OpInfo: true_divide and minor fix (#59154)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59154

Reviewed By: ngimel

Differential Revision: D28780115

Pulled By: mruberry

fbshipit-source-id: 91e254698597fa0c7d4df6053ec017a85e180304
2021-05-30 18:35:30 -07:00
Kushashwa Ravi Shrimali
0c1420aa3c OpInfo: fmod and remainder (#57941)
Summary:
See https://github.com/pytorch/pytorch/issues/54261

cc: mruberry Lezcano kshitij12345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57941

Reviewed By: mrshenli

Differential Revision: D28744464

Pulled By: mruberry

fbshipit-source-id: 19847277d4f8d3a39a706c2b3c9eddf0dedcb20c
2021-05-27 20:32:56 -07:00
Meghan Lele
b14c3205fd [JIT] Add torch._C.ScriptDict (#52659)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52659

**Summary**
This commit adds `torch._C.ScriptDict`, a dictionary type that has reference
semantics across the Python/TorchScript boundary. That is, modifications
made to instances of `torch._C.ScriptDict` in TorchScript are visible in
Python even when it is not returned from the function. Instances can be
constructed by passing an instance of a Python dictionary to
`torch.jit.script`. In the case of an empty dictionary, its type is
assumed to be `Dict[str, Tensor]` to be consistent with the handling of
empty dictionaries in TorchScript source code.

`torch._C.ScriptDict` is implemented using a modified version of pybind's `stl_bind.h`-style bindings attached to `ScriptDict`, `ScriptDictIterator` and `ScriptDictKeyIterator`, wrapper classes around `c10::impl::GenericDict` and `c10::impl::GenericDict::iterator`. These bindings allow instances of `torch._C.ScriptDict` to be used as if it were a regular `dict` Python. Reference semantics are achieved by simply retrieving the `IValue` contained in `ScriptDict` in `toIValue` (invoked when converting Python arguments to `IValues` before calling TorchScript code).

**Test Plan**
This commit adds `TestScriptDict` to `test_list_dict.py`, a set of tests
that check that all of the common dictionary operations are supported
and that instances have reference semantics across the
Python/TorchScript boundary.

Differential Revision:
D27211605
D27211605

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Pulled By: SplitInfinity

fbshipit-source-id: 446d4e5328375791aa73eb9e8b04dfe3465af960
2021-05-27 10:25:30 -07:00
Ivan Yashchuk
aaca12bcc2 Deprecate in docs torch.svd and change svd -> linalg_svd (#57981)
Summary:
This PR adds a note to the documentation that torch.svd is deprecated together with an upgrade guide on how to use `torch.linalg.svd` and `torch.linalg.svdvals` (Lezcano's instructions from https://github.com/pytorch/pytorch/issues/57549).
In addition, all usage of the old svd function is replaced with a new one from torch.linalg module, except for the `at::linalg_pinv` function, that fails the XLA CI build (https://github.com/pytorch/xla/issues/2755, see failure in draft PR https://github.com/pytorch/pytorch/pull/57772).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57981

Reviewed By: ngimel

Differential Revision: D28345558

Pulled By: mruberry

fbshipit-source-id: 02dd9ae6efe975026e80ca128e9b91dfc65d7213
2021-05-11 18:04:10 -07:00
Rong Rong (AI Infra)
29753339b7 Do not download slow test when on sandcastle (#57953)
Summary:
Downloading slow_test list on SC causes timeout, this is even a bigger issue since `common_utils.py` is reused in many internal projects/modules.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57953

Test Plan: CI

Reviewed By: janeyx99

Differential Revision: D28325527

fbshipit-source-id: ae47c9e43ad6f416008005bb26ceb2f3d6966f2e
2021-05-10 10:39:10 -07:00
Alexander
18c89a904b Modernize test-suite in sparse tensor CSR (#56392)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56392

Fixes for gh-56371 and gh-56369

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D27913212

Pulled By: mruberry

fbshipit-source-id: 2c78fe9fa4b6c6b566d9eb01f71e6016d672a545
2021-04-27 15:22:17 -07:00
Jeffrey Wan
d01302431c Enable fast gradcheck for real inputs and outputs (#55237)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55237

In this PR, we reenable fast-gradcheck and resolve misc issues that arise:
Before landing this PR, land #55182 so that slow tests are still being run periodically.

Bolded indicates the issue is handled in this PR, otherwise it is handled in a previous PR.

**Non-determinism issues**:
- ops that do not have deterministic implementation (as documented https://pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html#torch.use_deterministic_algorithms)
  - test_pad_cuda (replication_pad2d) (test_nn)
  - interpolate (test_nn)
  - cummin, cummax (scatter_add_cuda_kernel) (test_ops)
  - test_fn_gradgrad_prod_cpu_float64 (test_ops)

Randomness:
  - RRelu (new module tests) - we fix by using our own generator as to avoid messing with user RNG state (handled in #54480)

Numerical precision issues:
- jacobian mismatch: test_gelu (test_nn, float32, not able to replicate locally) - we fixed this by disabling for float32 (handled in previous  PR)
- cholesky_solve (test_linalg): #56235 handled in previous PR
- **cumprod** (test_ops) - #56275 disabled fast gradcheck

Not yet replicated:
 - test_relaxed_one_hot_categorical_2d (test_distributions)

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D27920906

fbshipit-source-id: 894dd7bf20b74f1a91a5bc24fe56794b4ee24656
2021-04-22 19:46:37 -07:00
Jeffrey Wan
5dcc7ac35c Add new scheduled job to circle-ci workflow (#55182)
Summary:
Under this setting the job should run 3 times a day.

When the environment variable, `PYTORCH_TEST_WITH_SLOW_GRADCHECK` is set to `ON`, set the default value for `fast_mode` in gradchack wrapper as False. This would be overriden by whatever value the user explicitly passes in.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55182

Reviewed By: albanD

Differential Revision: D27919236

Pulled By: soulitzer

fbshipit-source-id: 3a55ec6edcfc6e65fbc3a8a09c63aaea1bd1c5bf
2021-04-21 17:05:10 -07:00
Ailing Zhang
59b61f912a Switch assertWarnsOnceRegex logic to check any instead of all. (#56434)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56434

If we hit multiple TORCH_WARN from different sources when running the
statement, it makes more sense to me that we want to check the regex is
met in any one of the warning messages instead of all messages.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D27871946

Pulled By: ailzhang

fbshipit-source-id: 5940a8e43e4cc91aef213ef01e48d506fd9a1132
2021-04-20 10:37:36 -07:00
Sam Estep
e3900d2ba5 Add lint for unqualified noqa (#56272)
Summary:
As this diff shows, currently there are a couple hundred instances of raw `noqa` in the codebase, which just ignore all errors on a given line. That isn't great, so this PR changes all existing instances of that antipattern to qualify the `noqa` with respect to a specific error code, and adds a lint to prevent more of this from happening in the future.

Interestingly, some of the examples the `noqa` lint catches are genuine attempts to qualify the `noqa` with a specific error code, such as these two:
```
test/jit/test_misc.py:27:            print(f"{hello + ' ' + test}, I'm a {test}") # noqa E999
test/jit/test_misc.py:28:            print(f"format blank") # noqa F541
```
However, those are still wrong because they are [missing a colon](https://flake8.pycqa.org/en/3.9.1/user/violations.html#in-line-ignoring-errors), which actually causes the error code to be completely ignored:

- If you change them to anything else, the warnings will still be suppressed.
- If you add the necessary colons then it is revealed that `E261` was also being suppressed, unintentionally:
  ```
  test/jit/test_misc.py:27:57: E261 at least two spaces before inline comment
  test/jit/test_misc.py:28:35: E261 at least two spaces before inline comment
  ```

I did try using [flake8-noqa](https://pypi.org/project/flake8-noqa/) instead of a custom `git grep` lint, but it didn't seem to work. This PR is definitely missing some of the functionality that flake8-noqa is supposed to provide, though, so if someone can figure out how to use it, we should do that instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56272

Test Plan:
CI should pass on the tip of this PR, and we know that the lint works because the following CI run (before this PR was finished) failed:

- https://github.com/pytorch/pytorch/runs/2365189927

Reviewed By: janeyx99

Differential Revision: D27830127

Pulled By: samestep

fbshipit-source-id: d6dcf4f945ebd18cd76c46a07f3b408296864fcb
2021-04-19 13:16:18 -07:00
Can Balioglu
42f5d66080 [DDP] Fixes flaky tests caused by incorrect floating-point comparison (#56192)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50699.

The root cause was that some floating-point assertions had a "greater than or **equal to**" condition. The "equal to" part was causing flakiness due to strict equality check (`==`) in `TestCase.assertGreaterEqual()`. This PR introduces a new assertion method called `assertGreaterAlmostEqual()` in `common_utils.py` that mitigates the problem by behaving similar to `TestCase.assertAlmostEqual()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56192

Reviewed By: zhaojuanmao

Differential Revision: D27804724

Pulled By: cbalioglu

fbshipit-source-id: bc44a41ca4ce45dfee62fb3769fb47bfd9028831
2021-04-15 17:15:42 -07:00
David Riazati
3c6b52ae62 Cache slow/disabled test files (#55682)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55682

Fixes #55648

For now it downloads and writes the relevant files to the system's temp dir and marks it as valid for 3 hours.

Test Plan: Imported from OSS

Reviewed By: malfet, nikithamalgifb

Differential Revision: D27685616

Pulled By: driazati

fbshipit-source-id: 27469b85fe4b6b4addde6b22bf795bca3d4990ef
2021-04-12 09:17:07 -07:00
Mike Ruberry
399b66c813 Ports logdet from method_tests() to op_db (#55743)
Summary:
Per title. Also updates some tensor construction helpers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55743

Reviewed By: ngimel

Differential Revision: D27702060

Pulled By: mruberry

fbshipit-source-id: f64b7bee855733ad1f4fd182819ceec5831d9878
2021-04-11 20:39:16 -07:00
Alexander
6ee333cdb5 modernize test_sparse (#54572)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54572

Adding device generic tests to `test_sparse`.
Follow-up PR: #54153

I think is ready to review.
Looking forward your comments cc mruberry.

Thanks

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D27562663

Pulled By: mruberry

fbshipit-source-id: c48973e707f779b529bc7f61b75103194b428987
2021-04-09 12:19:29 -07:00
Jane Xu
2a24a2418a common_utils.py use new file names for disabled/slow tests (#55620)
Summary:
Following these changes in renaming the files:
https://github.com/pytorch/pytorch/pull/55618
https://github.com/pytorch/test-infra/pull/3

We should update the use sites in common_utils.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55620

Reviewed By: samestep

Differential Revision: D27651884

Pulled By: janeyx99

fbshipit-source-id: 298a981e55e0b7c95202294d9bc4b3fcce359590
2021-04-09 09:25:20 -07:00
Philip Meier
f4967d68f5 make torch.testing asserts importable (#54769)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54769

Follow-up to #53820. This

- makes the `asserts.py` module private as per suggestion from rgommers in https://github.com/pytorch/pytorch/pull/53820#issuecomment-802661387. With this the functions should only be accessible through `torch.testing`, giving us the option the change the underlying structure later.
- moves the code from `torch/testing/__init__.py` to `torch/testing/_core.py` (happy to accept other name suggestions). Otherwise we can't import the new `_asserts.py` in `torch/testing/__init__.py` due to circular imports.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D27438451

Pulled By: mruberry

fbshipit-source-id: c7292b4d5709185b42b4aac8016648562688040e
2021-04-07 23:53:02 -07:00
Jane Xu
2e9eb5afa2 Use slow tests stats in common_utils (#55190)
Summary:
This is a step in adding automatic slowTest detection to our testing infrastructure. This uses stats (updated daily) in https://github.com/pytorch/test-infra/blob/master/stats/.pytorch-slow-tests to determine whether more tests need to be marked as slow as they are run.

More details in previous PR draft/proposal [here](https://github.com/pytorch/pytorch/pull/54456#issue-598388491), though I no longer think we need the third step as using a raw git file does not require much processing.

Upon looking at [logs](https://circleci.com/api/v1.1/project/github/pytorch/pytorch/12060292/output/107/0?file=true&allocation-id=606660dbd8e5857bcc2b2e0f-0-build%2F60DCA8CD) for the coverage tests as of the first commit [when I had not skipped the tests so we could see their actual times], here are some slow tests that weren't marked as slow before:
```
test_fn_gradgrad_unfold_cpu_complex128 (__main__.TestGradientsCPU) (172.554s)
test_matmul_4d_4d_complex_cpu (__main__.TestAutogradDeviceTypeCPU) (180.057s)
test_conv1d_basic (__main__.TestXNNPACKConv1dTransformPass) (94.737s)
```

And here is a test that wasn't actually slow but was still marked as slow based on stats:
```
test_trunc_normal (__main__.TestNNInit) ... ok (1.208s)
```

The new logs show the above tests as skipped (as they should be):
[Coverage Test 1](https://app.circleci.com/pipelines/github/pytorch/pytorch/296224/workflows/ba6c2917-51f8-4fb8-be57-90151c2e5c25/jobs/12126156) and [Coverage Test 2](https://app.circleci.com/pipelines/github/pytorch/pytorch/296224/workflows/ba6c2917-51f8-4fb8-be57-90151c2e5c25/jobs/12126155)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55190

Reviewed By: samestep

Differential Revision: D27566663

Pulled By: janeyx99

fbshipit-source-id: c13f8c676bb8eb15d9d697d224dbaef7df98aef3
2021-04-07 08:04:39 -07:00
lezcano
fd02fc5d71 Port put_ and take from TH to ATen (#53356)
Summary:
The two ports were don together, as they can be implemented with the same kernel. In TH, they were already implemented with the same kernel.

Resolves https://github.com/pytorch/pytorch/issues/24751
Resolves https://github.com/pytorch/pytorch/issues/24614
Resolves https://github.com/pytorch/pytorch/issues/24640
Resolves https://github.com/pytorch/pytorch/issues/24772

This port makes sure that it interacts correctly with the "deterministic algorithms" flag, as done in https://github.com/pytorch/pytorch/pull/51388

This PR also makes these two functions correct in the following aspects (all of them added to the tests as well):
- Support for complex numbers
- Correct handling of scalar inputs and zero-dimensional inputs
- Implementation that does not do any copies nor sorting of any of the input tensors
- Faster and more correct implementation of the backwards (now it works as it should when `source.shape() != index.shape()`)
- Now `put_(..., accumulate=True)` is implemented correctly with atomic operations on GPU / CPU (when possible) and is deterministic (modulo the loss of precision that might happen due to the reordering of a sum of floats)
- Adds the `torch.put` function that was missing, (`index_put` exists, for example)
- Corrected docs

It also adds a much more thorough testing to the operations and their gradients.

There is a BC-breaking change, and that is that now we check that the inputs do not overlap in the `put_` operation. This was handled (some of the cases, other cases were wrong) in the TH implementation by making contiguous copies of the inputs. How should we handle this one?

**Edit.** Benchmarks:
<details>
<summary>Script</summary>

```python
from IPython import get_ipython
import torch
from itertools import product

torch.manual_seed(13)
torch.set_num_threads(1)

ipython = get_ipython()

cpu = torch.device('cpu')
cuda = torch.device('cuda')

def run_test(ndims, size, index_len, device, cmd):
    print(f"cmd: {cmd}, ndims: {ndims}, tensor_size: {size}, index_len: {index_len}, device: {device}")

    large_tensor = torch.rand(*([size] * ndims), device=device)
    small_tensor = torch.rand((index_len,), device=device)
    index = torch.randint(size * ndims, (index_len,), dtype=torch.long, device=device)
    if cmd == "put":
        command = "large_tensor.put_(index, small_tensor, accumulate=False)"
        if device == cuda:
            command += "; torch.cuda.synchronize()"
    elif cmd == "accumulate":
        command = "large_tensor.put_(index, small_tensor, accumulate=True)"
        if device == cuda:
            command += "; torch.cuda.synchronize()"
    elif cmd == "take":
        command = "torch.take(large_tensor, index)"
        if device == cuda:
            command += "; torch.cuda.synchronize()"
    ipython.magic(f"timeit {command}")
    print()

for method, device in product(["accumulate", "put", "take"], [cpu, cuda]):
    run_test(3, 1000, 10, device, method)
    run_test(3, 1000, 1000, device, method)
    run_test(3, 1000, 10000, device, method)
    run_test(2, 10000, 100000, device, method)
```
</details>

```python
put_(accumulate=False)
```

<details>
<summary>ATen CPU (1.5x - 2x speedup)</summary>

```python
cmd: put, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu
1.05 µs ± 2.35 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu
3.15 µs ± 5.13 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu
21.6 µs ± 13.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

cmd: put, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu
238 µs ± 781 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
</details>

<details>
<summary>TH CPU</summary>

```python
cmd: put, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu
722 ns ± 2.67 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu
4.89 µs ± 18.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu
42.5 µs ± 96.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

cmd: put, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu
428 µs ± 774 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
</details>
<details>
<summary>ATen GPU (same speed)</summary>

```python
cmd: put, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda
8.99 µs ± 16 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda
10.4 µs ± 24.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda
10.4 µs ± 11.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda
15.6 µs ± 1.12 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
```
</details>

<details>
<summary>TH GPU</summary>

```python
cmd: put, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda
8.44 µs ± 31.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda
9.09 µs ± 4.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda
9.77 µs ± 0.998 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda
15.8 µs ± 5.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
```
</details>

```python
put_(accumulate=True)
```

<details>
<summary>ATen CPU (x2 speedup)</summary>

```python
cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu
1.12 µs ± 2.91 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu
3.14 µs ± 2.05 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu
20.8 µs ± 25.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

cmd: accumulate, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu
264 µs ± 263 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
</details>

<details>
<summary>TH CPU</summary>

```python
cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu
814 ns ± 1.87 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu
5.11 µs ± 6.02 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu
43.9 µs ± 49.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

cmd: accumulate, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu
442 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
</details>
<details>
<summary>ATen GPU (3x - 11x speedup)</summary>

```python
cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda
9.01 µs ± 14.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda
10.4 µs ± 15.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda
10.3 µs ± 44.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: accumulate, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda
12.6 µs ± 19 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
```
</details>

<details>
<summary>TH GPU</summary>

```python
cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda
34.7 µs ± 131 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda
38.2 µs ± 116 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda
61.2 µs ± 50.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

cmd: accumulate, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda
140 µs ± 24.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```
</details>

```python
take()
```

<details>
<summary>ATen CPU (1.1x speedup)</summary>

```python
cmd: take, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu
1.18 µs ± 2.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu
2.79 µs ± 2.96 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu
16.6 µs ± 10.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu
161 µs ± 984 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```
</details>

<details>
<summary>TH CPU</summary>

```python
cmd: take, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu
1.1 µs ± 3.14 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu
2.93 µs ± 7.31 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu
18.6 µs ± 14.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu
178 µs ± 139 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```
</details>
<details>
<summary>ATen GPU (same speed)</summary>

```python
cmd: take, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda
9.38 µs ± 23.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda
10.7 µs ± 9.77 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda
10.6 µs ± 107 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda
11.5 µs ± 21.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
```
</details>

<details>
<summary>TH GPU</summary>

```python
cmd: take, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda
9.31 µs ± 7.57 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda
9.52 µs ± 5.78 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda
9.73 µs ± 17.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda
11.7 µs ± 5.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
```
</details>

cc mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53356

Reviewed By: mruberry

Differential Revision: D27520243

Pulled By: ngimel

fbshipit-source-id: e3979349c2c62d2949e09fb05e5fd4883fbc9093
2021-04-05 18:05:38 -07:00
Heitor Schueroff
6d87b3667f Added support for TensorList inputs in OpInfo (#54922)
Summary:
Stack:
* https://github.com/pytorch/pytorch/issues/54954 Fixed OpInfo jit tests failing for TensorList inputs
* __#54922 Added support for TensorList inputs in OpInfo__

Updated OpInfo to accept either a `Tensor` or `TensorList` as `sample.input` and added workarounds to make this work with gradcheck.

Note: JIT testing support for TensorList inputs will be added in a follow up PR.

Fixes https://github.com/pytorch/pytorch/issues/51996

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54922

Reviewed By: H-Huang

Differential Revision: D27448952

Pulled By: heitorschueroff

fbshipit-source-id: 3f24a56f6180eb2d044dcfc89ba59fce8acfe278
2021-03-31 04:42:10 -07:00
Edward Yang
b5ab348253 Fix missing format string qualifier (#54705)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54705

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D27338808

Pulled By: ezyang

fbshipit-source-id: b21c931c2306e525bc444766bc203bb303868dbf
2021-03-27 11:55:36 -07:00
Nikita Vedeneev
61b074581c torch.prod backward for complex types. (#48125)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53511
torch.det does depend on torch.prod, which in turn depends on several other functions, and they also depend on torch.prod, so there is a circular relationship, hence this PR will enable complex backward support for several functions at once.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48125

Reviewed By: pbelevich

Differential Revision: D27188589

Pulled By: anjali411

fbshipit-source-id: bbb80f8ecb83a0c3bea2b917627d3cd3b84eb09a
2021-03-19 09:44:08 -07:00
Edward Yang
a2a7179695 Fix bug in assertRaises NotImplemented handling when no exception is thrown (#54126)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54126

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: agolynski, mruberry

Differential Revision: D27109510

Pulled By: ezyang

fbshipit-source-id: ba5a4de85ca00f81724f3d4e645797e8f32aa3b1
2021-03-17 12:30:51 -07:00
Edward Yang
c2f41b6b84 Add meta device to generic device testing framework, skip NotImplementedError (#53682)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53682

With this, under the meta device, 101 tests passed and 16953 skipped.
It ain't much, but it's a start.

Some various bits and bobs:
- NotImplementedError suppression at test level is implemented
  in the same way as CUDA memory leak check, i.e., by wrapping
  test methods and monkeypatching them back in.
- I had to reimplement assertRaises/assertRaisesRegex from scratch to
  ignore NotImplementedError when _ignore_not_implemented_error is True.
  The implementation relies on a small amount of private API that hasn't
  changed since 2010
- expectedAlertNondeterministic doesn't really work so I skipped them
  all; there's probably a way to do it better

I tested this using `pytest --disable-warnings --tb=native -k meta --sw
test/*.py` and a pile of extra patches to make collection actually work
(lol).

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D26955539

Pulled By: ezyang

fbshipit-source-id: ac21c8734562497fdcca3b614a28010bc4c03d74
2021-03-14 20:41:19 -07:00
Edward Yang
d47d246206 Add 'noarch' tests which only run in one CI config (#53747)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53747

Fixes #53743

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D26971343

Pulled By: ezyang

fbshipit-source-id: cee7aa10063ae674f741406a3af830e4b4f128df
2021-03-14 20:39:07 -07:00