Setting a timeout value when testing multiprocess DataLoader to prevent ASAN jobs timing out after 4 hours.
We are seeing multiple timeout issue running ASAN tests on HUD https://hud.pytorch.org/hud/pytorch/pytorch/master/1?per_page=50&name_filter=asan for examples
* Without mem leak check enabled https://github.com/pytorch/pytorch/actions/runs/3794216079/jobs/6455118197
* With mem leak check https://github.com/pytorch/pytorch/actions/runs/3792743994/jobs/6449356306
Looking a bit closer into the test, the hanging happens when multiprocess DataLoader is used in `test_utils`. Here is the snapshot of those processes when I log into the hang runner:
```
UID PID PPID C STIME TTY TIME CMD
jenkins 1 0 0 Dec28 pts/0 00:00:00 bash
jenkins 8 0 0 Dec28 pts/1 00:00:00 sh -c pip install dist/torch-2.0.0a0+git97db9fd-cp37-cp37m-linux_x86_64.whl[opt-einsum] && .jenkins/pytorch/test.sh
jenkins 20 8 0 Dec28 pts/1 00:00:00 /bin/bash .jenkins/pytorch/test.sh
jenkins 764 20 0 Dec28 pts/1 00:00:07 python test/run_test.py --exclude-jit-executor --exclude-distributed-tests --shard 5 5 --verbose
jenkins 788 764 0 Dec28 pts/1 00:00:00 /opt/conda/bin/python -c from multiprocessing.semaphore_tracker import main;main(6)
jenkins 3743 764 0 Dec28 pts/1 00:00:05 /opt/conda/bin/python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=7, pipe_handle=11) --multiprocessing-fork
jenkins 3766 3743 0 Dec28 pts/1 00:00:06 /opt/conda/bin/python -bb test_utils.py -v --import-slow-tests --import-disabled-tests
jenkins 3878 3766 0 Dec28 pts/1 00:00:06 /opt/conda/bin/python -bb test_utils.py -v --import-slow-tests --import-disabled-tests
jenkins 3879 3766 0 Dec28 pts/1 00:00:00 /opt/conda/bin/python -bb test_utils.py -v --import-slow-tests --import-disabled-tests
jenkins 3880 3766 0 Dec28 pts/1 00:00:00 /opt/conda/bin/python -bb test_utils.py -v --import-slow-tests --import-disabled-tests
jenkins 3881 3766 0 Dec28 pts/1 00:00:00 /opt/conda/bin/python -bb test_utils.py -v --import-slow-tests --import-disabled-tests
jenkins 3893 0 0 01:45 pts/2 00:00:00 /bin/bash
jenkins 3904 3893 0 01:46 pts/2 00:00:00 ps -ef
```
The specific hanging test was `test_random_seed` which spawned 4 subprocesses to load data. After I killed one of them, the test could continue and printed the following stacktrace:
```
test_random_seed (__main__.TestDataLoaderUtils) ... [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
[W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
[W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
[W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
[W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
[W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
[W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
[W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
ERROR (9345.840s)
test_random_seed (__main__.TestDataLoaderUtils) ... test_random_seed errored - num_retries_left: 3
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1134, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/opt/conda/lib/python3.7/multiprocessing/queues.py", line 104, in get
if not self._poll(timeout):
File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 257, in poll
return self._poll(timeout)
File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 414, in _poll
r = wait([self], timeout)
File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 921, in wait
ready = selector.select(timeout)
File "/opt/conda/lib/python3.7/selectors.py", line 415, in select
fd_event_list = self._selector.poll(timeout)
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 3878) is killed by signal: Terminated.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "test_utils.py", line 469, in test_random_seed
x2 = run()
File "test_utils.py", line 464, in run
return next(iter(dataloader))
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 635, in __next__
data = self._next_data()
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1330, in _next_data
idx, data = self._get_data()
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1296, in _get_data
success, data = self._try_get_data()
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1147, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 3878) exited unexpectedly
[W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
[W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
[W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
[W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
[W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
[W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
[W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
[W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
ok (0.137s)
```
This doesn't fix the issue which I'll need to follow up to see why they hang. However, this should allow the test to terminate gracefully and report errors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91476
Approved by: https://github.com/kit1980
Towards fixing https://github.com/pytorch/pytorch/issues/82482
This PR fixes two things:
## 1) memory leak
The .detach() call prevents a true memory leak in some cases where the user function is using multiple ops in a row that save their inputs. The following chain of objects keep each other alive
- the `storage` object
- a recomputed Tensor y
- y's grad_fn FooBackward (in c++)
- FooBackward's SavedVariables (in c++)
- SavedVariable Hook
- the `inner_pack` function
- captures `storage`
Since part of this cycle is in c++, the python gc is not able to break it.
Should THPCppFunction_traverse actually visit it's SavedVariables which in turn should visit their hooks? I think the answer is yes but I haven't dived into which python object is traversing what as if there is non-unique ownership of the c++ object, it makes the traversal a lot trickier. @ezyang do you think we should dive into this more?
In this case, this can be easily solved anyways by storing `y.detach()` in the `storage` object as we don't care about the temporary backward graph that gets created during the second forward call.
## 2) Lifetime of the recomputed buffers
The new storage system is now such that the lifetime of the recomputed buffer is directly linked to the SavedVariable c++ object. Meaning that this buffer will get deleted IIF the SavedVariable is cleared.
This means that we now get the exact same behavior as the version without the saved variable hook where Tensors are saved directly on the SavedVariable object.
This is great as this solves all the cases where the non-checkpoint version used to work but the checkpoint version does not (even double access or retain_graph=True).
The one drawback of this approach though is that the buffer do NOT get cleared when the user passes in `retain_graph=True`! The next backward won't even re-run the forward as it already has all the buffers available. Is this a problem that you think we would need to find a solution for @rohan-varma or it is niche enough that we don't care for now?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82776
Approved by: https://github.com/ezyang, https://github.com/rohan-varma
This functionality does not seem to be used
and there are some requests to update dependency.
Add `third_party` to torch_cpu include directories if compiling with
Caffe2 support, as `caffe2/quantization/server/conv_dnnlowp_op.cc` depends on `third_party/fbgemm/src/RefImplementations.h`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75394
Approved by: https://github.com/janeyx99, https://github.com/seemethere
Summary:
Following triage review discussion, it would be best for these tests to not be triaged high priority by automation, but by the triagers in the oncall.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74555
Reviewed By: albanD
Differential Revision: D35099202
Pulled By: janeyx99
fbshipit-source-id: 657a0317141de3a598476a6f601ec26cc26231b1
(cherry picked from commit 057519cb2494d0f9a0b169f359ac87ba9e89f088)
Summary:
Working towards https://docs.google.com/document/d/10yx2-4gs0gTMOimVS403MnoAWkqitS8TUHX73PN8EjE/edit?pli=1#
This PR:
- Ensure that all the submodules are listed in a rst file (that ensure they are considered by the coverage tool)
- Remove some long deprecated code that just error out on import
- Remove the allow list altogether to ensure nothing gets added back there
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73983
Reviewed By: anjali411
Differential Revision: D34787908
Pulled By: albanD
fbshipit-source-id: 163ce61e133b12b2f2e1cbe374f979e3d6858db7
(cherry picked from commit c9edfead7a01dc45bfc24eaf7220d2a84ab1f62e)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67779
Not all flaky failures from this test are URLErrors; I think we should
err on the side of being expansive with retries here.
Test Plan: Imported from OSS
Reviewed By: jamesr66a
Differential Revision: D32145434
Pulled By: suo
fbshipit-source-id: 3c3274b2080681fcafb3ea6132e420605f65c429
Summary:
Action following https://github.com/pytorch/pytorch/issues/66232
This change does require some context: there were several suggestions regarding what to do about this group of tests: tests that are core and crucial to all of PyTorch and are too broad to be owned by one team.
1. Let's add a "module: core" and put people behind it! This idea sounds appealing unless you are one of the people backing the label. From talking to albanD among others, this idea of putting all these core tests on the shoulder of a few people or one team isn't super fair and I have not yet found anyone willing to take on this job.
2. Taking advantage of the fact that we already have a triaging oncall that takes turns triaging issues, we can leave these tests essentially unlabeled and allow the oncall to triage these tests. Since these tests are crucial to PyTorch, we'll add the "high priority" label to mark them different from other unowned tests (see https://github.com/pytorch/pytorch/issues/67552).
3. I _could_ still create an unbacked label "module: core" and attribute these tests there, but I don't like the idea of creating a facade that the tests are "triaged" to a label when no one is actually taking a look.
Now we could potentially break these tests down into smaller files so that each piece _could_ be owned by a team, but 1. I don't know if this is currently feasible and 2. This approach does not prevent that from happening in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67553
Reviewed By: albanD
Differential Revision: D32025004
Pulled By: janeyx99
fbshipit-source-id: 1fb1aa4c27e305695ab6e80ae3d02f90519939c0
Summary:
Closes https://github.com/pytorch/pytorch/issues/63753
This PR changes the assumption regarding the default branch of a repo to the following:
> If main exist then use main,otherwise use master
This will make torchhub more robust w.r.t. to the ongoing changes where repo use `main` instead of `master` as the development / default branch.
cc nairbv NicolasHug
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64364
Reviewed By: saketh-are
Differential Revision: D30731551
Pulled By: NicolasHug
fbshipit-source-id: 7232a30e956dcccca21933a29de5eddd711aa99b
Summary:
This PR adds more detailed error messages to torchhub if the commit hash validation goes wrong, providing suggestions to the users on how to resolve the issue.
It also documents why such validation is important.
EDIT: it also avoids validatating some stuff when we know "stuff" isn't a commit since there's no risk in this case
CC malfet mthrok
cc nairbv NicolasHug
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64362
Reviewed By: gchanan, malfet
Differential Revision: D30731191
Pulled By: NicolasHug
fbshipit-source-id: d1ee7c2ef2591dd7a5291977af1635ada2552d1b
Summary:
Fixes https://github.com/pytorch/pytorch/issues/64760
This should hopefully put the torchhub tests back.
This also avoids skipping the torchhub tests: currently the tests are skipped if they fail, which pretty much defeats the purpose of having a test in the first place since we're never notified when they do fail.
cc ezyang seemethere malfet lg20987 pytorch/pytorch-dev-infra nairbv NicolasHug
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64807
Reviewed By: seemethere
Differential Revision: D30994585
Pulled By: NicolasHug
fbshipit-source-id: 561782c22462b5cfec99cca153eb59623db5660a
Summary:
We currently build breakpad from [this fork](https://github.com/driazati/breakpad) to include extra logic to restore signal handlers that were previously present. With some [new additions](https://github.com/google/breakpad/compare/main...driazati:main) this fork now includes a CMake based build, so we can add breakpad as a proper dependency rather than rely on including it in Docker images as a system library which is error prone (we have a bunch of images) and hard to extend to MacOS / Windows. This also includes some changes to the crash handling code to support MacOS / Windows in a similar way to Linux.
```python
import torch
# On Windows this writes crashes to C:\Users\<user>\AppData\pytorch_crashes
# On MacOS/Linux this writes crashes to /tmp/pytorch_crashes
torch.utils._crash_handler.enable_minidumps()
# Easy way to cause a segfault and trigger the handler
torch.bincount(input=torch.tensor([9223372036854775807]))
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63186
Reviewed By: malfet, seemethere
Differential Revision: D30318404
Pulled By: driazati
fbshipit-source-id: 0d7daf3701cfaba5451cc529a0730272ab1eb1dc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60990
This makes the breakpad build more explicit in its messaging and hints to cmake where to look for the library (it wasn't able to find it without `PATHS` on CI even though that works locally). This also adds a smoke test that will fail if breakpad isn't present on a CI job where it is expected (e.g. binary builds).
Test Plan: Imported from OSS
Reviewed By: malfet
Differential Revision: D29514316
Pulled By: driazati
fbshipit-source-id: 79514363334788f311ba5d4f25deed3452f0c3eb
Summary:
## Motivation
Allow the out-of-tree Pytorch plug-in, for the device type other than CUDA, to add the runtime interface to the `torch` module. The runtime interface of the device can be referred with the device type name in the `torch` module. I.E., `torch.cuda` or `torch.xpu`.
## Solution
- Add a register interface for the plug-in to add the platform python module into `torch` module with the device type name. I.E., The `torch.xpu` can be used to refer the XPU runtime interface after the XPU runtime module is registered with `torch._register_device_module('xpu', xpu_module)` in Intel's XPU plug-in.
## Additional Context
More details about runtime has been discussed in https://github.com/pytorch/pytorch/issues/53707.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59857
Reviewed By: mrshenli
Differential Revision: D29309320
Pulled By: ezyang
fbshipit-source-id: b9802a5f937ddef9e0bdaf2f7692dfe463912fbe
Summary:
During development it is common practice to put `type: ignore` comments on lines that are correct, but `mypy` doesn't recognize this. This often stems from the fact, that the used `mypy` version wasn't able to handle the used pattern.
With every new release `mypy` gets better at handling complex code. In addition to fix all the previously accepted but now failing patterns, we should also revisit all `type: ignore` comments to see if they are still needed or not. Fortunately, we don't need to do it manually: by adding `warn_unused_ignores = True` to the configuration, `mypy` will error out in case it encounters an `type: ignore` that is no longer needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60006
Reviewed By: jbschlosser, malfet
Differential Revision: D29133237
Pulled By: albanD
fbshipit-source-id: 41e82edc5cd5affa7ccedad044b59b94dad4425a
Summary:
This PR
* adds the breakpad build to most of the remaining docker images (except the mobile + slim ones)
* pins to a [fork of breakpad](https://github.com/google/breakpad/compare/master...driazati:master?expand=1) to enable dasiy chaining on signal handlers
* renames the API to be nicer
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59236
Reviewed By: malfet
Differential Revision: D28792511
Pulled By: driazati
fbshipit-source-id: 83723e74b7f0a00e1695210ac2620a0c91ab4bf2
Summary:
We should iterate all pages of the branches API. Otherwise, even using "pytorch/vision" would fail to find master.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56138
Reviewed By: heitorschueroff
Differential Revision: D27872346
Pulled By: ailzhang
fbshipit-source-id: 55881558f7980b1fb08b0d08ed6687a38df06edd
Summary:
As this diff shows, currently there are a couple hundred instances of raw `noqa` in the codebase, which just ignore all errors on a given line. That isn't great, so this PR changes all existing instances of that antipattern to qualify the `noqa` with respect to a specific error code, and adds a lint to prevent more of this from happening in the future.
Interestingly, some of the examples the `noqa` lint catches are genuine attempts to qualify the `noqa` with a specific error code, such as these two:
```
test/jit/test_misc.py:27: print(f"{hello + ' ' + test}, I'm a {test}") # noqa E999
test/jit/test_misc.py:28: print(f"format blank") # noqa F541
```
However, those are still wrong because they are [missing a colon](https://flake8.pycqa.org/en/3.9.1/user/violations.html#in-line-ignoring-errors), which actually causes the error code to be completely ignored:
- If you change them to anything else, the warnings will still be suppressed.
- If you add the necessary colons then it is revealed that `E261` was also being suppressed, unintentionally:
```
test/jit/test_misc.py:27:57: E261 at least two spaces before inline comment
test/jit/test_misc.py:28:35: E261 at least two spaces before inline comment
```
I did try using [flake8-noqa](https://pypi.org/project/flake8-noqa/) instead of a custom `git grep` lint, but it didn't seem to work. This PR is definitely missing some of the functionality that flake8-noqa is supposed to provide, though, so if someone can figure out how to use it, we should do that instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56272
Test Plan:
CI should pass on the tip of this PR, and we know that the lint works because the following CI run (before this PR was finished) failed:
- https://github.com/pytorch/pytorch/runs/2365189927
Reviewed By: janeyx99
Differential Revision: D27830127
Pulled By: samestep
fbshipit-source-id: d6dcf4f945ebd18cd76c46a07f3b408296864fcb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56048
This reverts commit c411017a41.
This implementation broke CI in pytorch/vision and it's not handling
tags properly. So I want to revert it first to unblock vision CI and
send out a proper fix later.
Test Plan: Imported from OSS
Reviewed By: gchanan
Differential Revision: D27771701
Pulled By: ailzhang
fbshipit-source-id: 932f9be72a1ae1816f4032643b3c2dde0cb7ae4c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52422
As mentioned in https://github.com/pytorch/pytorch/issues/52415,
`torch.utils.checkpoint` doesn't support checkpointing for functions which have
non-tensor inputs and outputs.
This PR resolves this issue by ensuring the autograd machinery ignores the
non-tensor inputs and outputs and processes the tensors accordingly.
ghstack-source-id: 124406867
Test Plan:
1) unit test
2) waitforbuildbot
Reviewed By: albanD
Differential Revision: D26507228
fbshipit-source-id: 0a5a1591570814176185362e83ad18dabd9c84b0
Summary:
We want to store the file names that triggers each test suite so that we can use this data for categorizing those test files.
~~After considering several solutions, this one is the most backwards compatible, and the current test cases in test_testing.py for print test stats don't break.~~
The previous plan did not work, as there are multiple Python test jobs that spawn the same suites. Instead, the new S3 format will store test files (e.g., `test_nn` and `distributed/test_distributed_fork`) which will contain the suites they spawn, which will contain the test cases run within the suite. (Currently, there is no top layer of test files.)
Because of this major structural change, a lot of changes have now been made (thank you samestep!) to test_history.py and print_test_stats.py to make this new format backwards compatible.
Old test plan:
Make sure that the data is as expected in S3 after https://github.com/pytorch/pytorch/pull/52873 finishes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52869
Test Plan: Added tests to test_testing.py which pass, and CI.
Reviewed By: samestep
Differential Revision: D26672561
Pulled By: janeyx99
fbshipit-source-id: f46b91e16c1d9de5e0cb9bfa648b6448d979257e
Summary:
This fixes the previous erroring out by adding stricter conditions in cpp_extension.py.
To test, run a split torch_cuda build on Windows with export BUILD_SPLIT_CUDA=ON && python setup.py develop and then run the following test: python test/test_utils.py TestStandaloneCPPJIT.test_load_standalone. It should pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51596
Reviewed By: malfet
Differential Revision: D26213816
Pulled By: janeyx99
fbshipit-source-id: a752ce7f9ab9d73dcf56f952bed2f2e040614443
Summary:
_resubmission of gh-49654, which was reverted due to a cross-merge conflict_
This caught one incorrect annotation in `cpp_extension.load`.
xref gh-16574.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50278
Reviewed By: walterddr
Differential Revision: D25865278
Pulled By: ezyang
fbshipit-source-id: 25489191628af5cf9468136db36f5a0f72d9d54d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47973
Currently torch.Assert is not scriptable, which makes it not very useful for production code. According to jamesr66a , moving this to c++ op land will help with scriptability. This PR implements the change.
Note: with the current code the Assert is scriptable but the Assert is a no-op after being scripted. Would love suggestions on how to address that (can be in future PR).
Test Plan:
```
python test/test_utils.py TestAssert.test_assert_scriptable
python test/test_utils.py TestAssert.test_assert_true
python test/test_fx.py TestFX.test_symbolic_trace_assert
```
Reviewed By: supriyar
Differential Revision: D24974299
Pulled By: vkuzo
fbshipit-source-id: 20d4f4d8ac20d76eee122f2cdcdcdcaf1cda3afe
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47399
Currently torch.Assert is not scriptable, which makes it not very useful for production code. According to jamesr66a , moving this to c++ op land will help with scriptability. This PR implements the change.
Note: with the current code the Assert is scriptable but the Assert is a no-op after being scripted. Would love suggestions on how to address that (can be in future PR).
Test Plan:
```
python test/test_utils.py TestAssert.test_assert_scriptable
python test/test_utils.py TestAssert.test_assert_true
python test/test_fx.py TestFX.test_symbolic_trace_assert
```
Imported from OSS
Reviewed By: eellison
Differential Revision: D24740727
fbshipit-source-id: c7888e769c921408a3020ca8332f4dae33f2bc0e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47763
Changing the name due to the discussion in
https://github.com/pytorch/pytorch/pull/47399.
Test Plan:
```
python test/test_utils.py TestAssert.test_assert_true
python test/test_fx.py TestFX.test_symbolic_trace_assert
python test/test_fx_experimental.py
```
Imported from OSS
Reviewed By: ezyang
Differential Revision: D24891767
fbshipit-source-id: 01c7a5acd83bf9c962751552780930c242134dd2
Summary:
This PR just adds more polish to the benchmark utils:
1) `common.py`, `timer.py`, and `valgrind_wrapper/timer_interface.py` are now MyPy strict compliant. (except for three violations due to external deps.) Compare and Fuzzer will be covered in a future PR.
2) `CallgrindStats` now uses `TaskSpec` rather than accepting the individual fields which brings it closer to `Measurement`.
3) Some `__repr__` logic has been moved into `TaskSpec` (which `Measurement` and `CallgrindStats` use in their own `__repr__`s) for a more unified feel and less horrible f-string hacking, and the repr's have been given a cleanup pass.
4) `Tuple[FunctionCount, ...]` has been formalized as the `FunctionCounts` class, which has a much nicer `__repr__` than just the raw tuple, as well as some convenience methods (`__add__`, `__sub__`, `filter`, `transform`) for easier DIY stat exploration. (I find myself using the latter two a lot now.) My personal experience is that manipulating `FunctionCounts` is massively more pleasant than the raw tuples of `FunctionCount`. (Though it's still possible to get at the raw data if you want.)
5) Better support for multi-line `stmt` and `setup`.
6) Compare now also supports rowwise coloring, which is often the more natural layout for A/B testing.
7) Limited support for `globals` in `collect_callgrind`. This should make it easier to benchmark JIT models. (CC ZolotukhinM)
8) More unit tests, including extensive tests for the Callgrind stats manipulation APIs.
9) Mitigate issue with `MKL_THREADING_LAYER` when run in Jupyter. (https://github.com/pytorch/pytorch/issues/37377)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46023
Test Plan: changes should be covered by existing and new unit tests.
Reviewed By: navahgar, malfet
Differential Revision: D24313911
Pulled By: robieta
fbshipit-source-id: 835d4b5cde336fb7ff0adef3c0fd614d64df0f77
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45934https://pytorch.org/docs/stable/checkpoint.html pytorch checkpoint requires all input to the function being checkpointed to requires_grad, but this assumption is not necessarily try. consider the following two examples
```
output = MultiheadedMaskedAtten(input, mask)
output = LSTM(input, seq_length)
```
both length and mask are tensors that won't requires grad, currently if you try to checkpoint torch.autograd.backward will complain
```
File "/mnt/xarfuse/uid-124297/7d159c34-seed-nspid4026531836-ns-4026531840/torch/autograd/function.py
", line 87, in apply
return self._forward_cls.backward(self, *args)
File "/mnt/xarfuse/uid-124297/7d159c34-seed-nspid4026531836-ns-4026531840/torch/utils/checkpoint.py"
, line 99, in backward
torch.autograd.backward(outputs, args)
File "/mnt/xarfuse/uid-124297/7d159c34-seed-nspid4026531836-ns-4026531840/torch/autograd/__init__.py
", line 132, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: element 1 of tensors does not require grad and does not have a grad_fn
```
this diff allows skipping the non-grad-requiring tensor when running autograd.backward.
added documentation for this feature as well.
Test Plan: added unit test to make sure partial tensor grads can be used in checkpoint().
Differential Revision: D24094764
fbshipit-source-id: 6557e8e74132d5a392526adc7b57b6998609ed12
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45586
Test Plan: The unit test has been softened to be less platform sensitive.
Reviewed By: mruberry
Differential Revision: D24025415
Pulled By: robieta
fbshipit-source-id: ee986933b984e736cf1525e1297de6b21ac1f0cf
Summary:
This PR allows Timer to collect deterministic instruction counts for (some) snippets. Because of the intrusive nature of Valgrind (effectively replacing the CPU with an emulated one) we have to perform our measurements in a separate process. This PR writes a `.py` file containing the Timer's `setup` and `stmt`, and executes it within a `valgrind` subprocess along with a plethora of checks and error handling. There is still a bit of jitter around the edges due to the Python glue that I'm using, but the PyTorch signal is quite good and thus this provides a low friction way of getting signal. I considered using JIT as an alternative, but:
A) Python specific overheads (e.g. parsing) are important
B) JIT might do rewrites which would complicate measurement.
Consider the following bit of code, related to https://github.com/pytorch/pytorch/issues/44484:
```
from torch.utils._benchmark import Timer
counts = Timer(
"x.backward()",
setup="x = torch.ones((1,)) + torch.ones((1,), requires_grad=True)"
).collect_callgrind()
for c, fn in counts[:20]:
print(f"{c:>12} {fn}")
```
```
812800 ???:_dl_update_slotinfo
355600 ???:update_get_addr
308300 work/Python/ceval.c:_PyEval_EvalFrameDefault'2
304800 ???:__tls_get_addr
196059 ???:_int_free
152400 ???:__tls_get_addr_slow
138400 build/../c10/core/ScalarType.h:c10::typeMetaToScalarType(caffe2::TypeMeta)
126526 work/Objects/dictobject.c:_PyDict_LoadGlobal
114268 ???:malloc
101400 work/Objects/unicodeobject.c:PyUnicode_FromFormatV
85900 work/Python/ceval.c:_PyEval_EvalFrameDefault
79946 work/Objects/typeobject.c:_PyType_Lookup
72000 build/../c10/core/Device.h:c10::Device::validate()
70000 /usr/include/c++/8/bits/stl_vector.h:std::vector<at::Tensor, std::allocator<at::Tensor> >::~vector()
66400 work/Objects/object.c:_PyObject_GenericGetAttrWithDict
63000 ???:pthread_mutex_lock
61200 work/Objects/dictobject.c:PyDict_GetItem
59800 ???:free
58400 work/Objects/tupleobject.c:tupledealloc
56707 work/Objects/dictobject.c:lookdict_unicode_nodummy
```
Moreover, if we backport this PR to 1.6 (just copy the `_benchmarks` folder) and load those counts as `counts_1_6`, then we can easily diff them:
```
print(f"Head instructions: {sum(c for c, _ in counts)}")
print(f"1.6 instructions: {sum(c for c, _ in counts_1_6)}")
count_dict = {fn: c for c, fn in counts}
for c, fn in counts_1_6:
_ = count_dict.setdefault(fn, 0)
count_dict[fn] -= c
count_diffs = sorted([(c, fn) for fn, c in count_dict.items()], reverse=True)
for c, fn in count_diffs[:15] + [["", "..."]] + count_diffs[-15:]:
print(f"{c:>8} {fn}")
```
```
Head instructions: 7609547
1.6 instructions: 6059648
169600 ???:_dl_update_slotinfo
101400 work/Objects/unicodeobject.c:PyUnicode_FromFormatV
74200 ???:update_get_addr
63600 ???:__tls_get_addr
46800 work/Python/ceval.c:_PyEval_EvalFrameDefault
33512 work/Objects/dictobject.c:_PyDict_LoadGlobal
31800 ???:__tls_get_addr_slow
31700 build/../aten/src/ATen/record_function.cpp:at::RecordFunction::RecordFunction(at::RecordScope)
28300 build/../torch/csrc/utils/python_arg_parser.cpp:torch::FunctionSignature::parse(_object*, _object*, _object*, _object**, bool)
27800 work/Objects/object.c:_PyObject_GenericGetAttrWithDict
27401 work/Objects/dictobject.c:lookdict_unicode_nodummy
24115 work/Objects/typeobject.c:_PyType_Lookup
24080 ???:_int_free
21700 work/Objects/dictobject.c:PyDict_GetItemWithError
20700 work/Objects/dictobject.c:PyDict_GetItem
...
-3200 build/../c10/util/SmallVector.h:at::TensorIterator::binary_op(at::Tensor&, at::Tensor const&, at::Tensor const&, bool)
-3400 build/../aten/src/ATen/native/TensorIterator.cpp:at::TensorIterator::resize_outputs(at::TensorIteratorConfig const&)
-3500 /usr/include/c++/8/x86_64-redhat-linux/bits/gthr-default.h:std::unique_lock<std::mutex>::unlock()
-3700 build/../torch/csrc/utils/python_arg_parser.cpp:torch::PythonArgParser::raw_parse(_object*, _object*, _object**)
-4207 work/Objects/obmalloc.c:PyMem_Calloc
-4500 /usr/include/c++/8/bits/stl_vector.h:std::vector<at::Tensor, std::allocator<at::Tensor> >::~vector()
-4800 build/../torch/csrc/autograd/generated/VariableType_2.cpp:torch::autograd::VariableType::add__Tensor(at::Tensor&, at::Tensor const&, c10::Scalar)
-5000 build/../c10/core/impl/LocalDispatchKeySet.cpp:c10::impl::ExcludeDispatchKeyGuard::ExcludeDispatchKeyGuard(c10::DispatchKey)
-5300 work/Objects/listobject.c:PyList_New
-5400 build/../torch/csrc/utils/python_arg_parser.cpp:torch::FunctionParameter::check(_object*, std::vector<pybind11::handle, std::allocator<pybind11::handle> >&)
-5600 /usr/include/c++/8/bits/std_mutex.h:std::unique_lock<std::mutex>::unlock()
-6231 work/Objects/obmalloc.c:PyMem_Free
-6300 work/Objects/listobject.c:list_repeat
-11200 work/Objects/listobject.c:list_dealloc
-28900 build/../torch/csrc/utils/python_arg_parser.cpp:torch::FunctionSignature::parse(_object*, _object*, _object**, bool)
```
Remaining TODOs:
* Include a timer in the generated script for cuda sync.
* Add valgrind to CircleCI machines and add a unit test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44717
Reviewed By: soumith
Differential Revision: D24010742
Pulled By: robieta
fbshipit-source-id: df6bc765f8efce7193893edba186cd62b4b23623
Summary:
This PR cleans up some of the rough edges around `Timer` and `Compare`
* Moves `Measurement` to be dataclass based
* Adds a bunch of type annotations. MyPy is now happy.
* Allows missing entries in `Compare`. This is one of the biggest usability issues with `Compare` right now, both from an API perspective and because the current failure mode is really unpleasant.
* Greatly expands the testing of `Compare`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45361
Test Plan: Changes to Timer are covered under existing tests, changes to `Compare` are covered by the expanded `test_compare` method.
Reviewed By: bwasti
Differential Revision: D23966816
Pulled By: robieta
fbshipit-source-id: 826969f73b42f72fa35f4de3c64d0988b61474cd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45188
This is a symbolically traceable alternative to Python's `assert`.
It should be useful to allow people who want to use FX to also
be able to assert things.
A bunch of TODO(before) land are inline - would love thoughts
on where is the best place for this code to live, and what this
function should be called (since `assert` is reserved).
Test Plan:
```
python test/test_fx.py TestFX.test_symbolic_trace_assert
```
Imported from OSS
Reviewed By: jamesr66a
Differential Revision: D23861567
fbshipit-source-id: d9d6b9556140faccc0290eba1fabea401d7850de
Summary:
I noticed that the recently introduced adaptive_autorange tests occasionally timeout CI, and I've been meaning to improve the Timer tests for a while. This PR allows unit tests to swap the measurement portion of `Timer` with a deterministic mock so we can thoroughly test behavior without having to worry about flaky CI measurements. It also means that the tests can be much more detailed and still finish very quickly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45173
Test Plan: You're lookin' at it.
Reviewed By: ezyang
Differential Revision: D23873548
Pulled By: robieta
fbshipit-source-id: 26113e5cea0cbf46909b9bf5e90c878c29e87e88
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43622
- Moves the model loading part of `torch.hub.load()` into a new `torch.hub.load_local()` function that takes in a path to a local directory that contains a `hubconf.py` instead of a repo name.
- Refactors `torch.hub.load()` so that it now calls `torch.hub.load_local()` after downloading and extracting the repo.
- Updates `torch.hub` docs to include the new function + minor fixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44204
Reviewed By: malfet
Differential Revision: D23817429
Pulled By: ailzhang
fbshipit-source-id: 788fd83c87a94f487b558715b2809d346ead02b2
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44219
Rebasing https://github.com/pytorch/pytorch/pull/44288 and fixing the git history.
This allows users to bencmark code without having to specify how long to run the benchmark. It runs the benchmark until the variance (IQR / Median) is low enough that we can be confident in the measurement.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44607
Test Plan: There are unit tests, and we manually tested using Examples posted in git.
Reviewed By: robieta
Differential Revision: D23671208
Pulled By: bitfort
fbshipit-source-id: d63184290b88b26fb81c2452e1ae701c7d513d12
Summary:
Move the timing utils to `torch.utils._benchmark`. I couldn't figure out how to get setuptools to pick it up and put it under `torch` unless it is in the `torch` directory. (And I think it has to be for `setup.py develop` anyway.)
I also modified the record function benchmark since `Timer` and `Compare` should always be available now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41506
Reviewed By: ngimel
Differential Revision: D22601460
Pulled By: robieta
fbshipit-source-id: 9cea7ff1dcb0bb6922c15b99dd64833d9631c37b
Summary:
`HTTPError` are raised when server is overloaded, while `URLError` is
raised when network is not available
And since `HTTPError` is an extension of `URLError`, `URLError` should catch both exceptions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39477
Differential Revision: D21873560
Pulled By: malfet
fbshipit-source-id: 11806671b768705465f562087521ad4887fd20f7
Summary:
Invoke `Popen.communicate` with `timeout` argument and kill the process in `TimeoutExpired` handler
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39191
Differential Revision: D21773510
Pulled By: malfet
fbshipit-source-id: 52b94315f8aa4d6c330dd5c9a8936100e49aef2d
Summary:
This updates assertEqual and assertEqual-like functions to either require both or neither of atol and rtol be specified. This should improve clarity around handling precision in the test suite, and it allows us to remove the legacy positional atol argument from assertEqual. In addition, the "message" kwarg is replace with a kwarg-only "msg" argument whose name is consistent with unittest's assertEqual argument.
In the future we could make "msg" an optional third positional argument to be more consistent with unittest's assertEqual, but requiring it be specified should be clear, and we can easily update the signature to make "msg" an optional positional argument in the future, too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38872
Differential Revision: D21740237
Pulled By: mruberry
fbshipit-source-id: acbc027aa1d7877a49664d94db9a5fff91a07042
Summary:
Fixes https://github.com/pytorch/pytorch/issues/38401
* `torch.hub.load_state_dict_from_url()` now also downloads to `$TORCH_HOME/hub/checkpoints` instead of `$TORCH_HOME/checkpoints` like `torch.hub.load()` and others.
* Make `hub_dir` private, add and use `get_dir()` instead.
Also updated docs. Did not see a need for additional unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38969
Differential Revision: D21725880
Pulled By: ailzhang
fbshipit-source-id: 58cc6b32ddbda91e58c1c1433cc3916223556ea1
Summary:
This updates assertEqual and assertEqual-like functions to either require both or neither of atol and rtol be specified. This should improve clarity around handling precision in the test suite, and it allows us to remove the legacy positional atol argument from assertEqual. In addition, the "message" kwarg is replace with a kwarg-only "msg" argument whose name is consistent with unittest's assertEqual argument.
In the future we could make "msg" an optional third positional argument to be more consistent with unittest's assertEqual, but requiring it be specified should be clear, and we can easily update the signature to make "msg" an optional positional argument in the future, too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38872
Differential Revision: D21717199
Pulled By: mruberry
fbshipit-source-id: 9feb856f94eee911b44f6c7140a1d07c1b026d3a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35615
Python 2 has reached end-of-life and is no longer supported by PyTorch.
Now we can clean up a lot of cruft that we put in place to support it.
These changes were all done manually, and I skipped anything that seemed
like it would take more than a few seconds, so I think it makes sense to
review it manually as well (though using side-by-side view and ignoring
whitespace change might be helpful).
Test Plan: CI
Differential Revision: D20842886
Pulled By: dreiss
fbshipit-source-id: 8cad4e87c45895e7ce3938a88e61157a79504aed
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34258
This PR allows both atol and rtol to be specified, uses defaults based on the prior analysis (spreadsheet attached to https://github.com/pytorch/pytorch/pull/32538), but retains the absolute tolerance behavior in cases where precision was previously specified explicitly.
Test Plan: Imported from OSS
Differential Revision: D21110255
Pulled By: nairbv
fbshipit-source-id: 57b3a004c7d5ac1be80ee765f03668b1b13f4a7e
Summary:
TensorBoard tests using SummaryWriter() may fail with a pandas import
complaint if TensorFlow packages are installed in the same python
environment as PyTorch:
Traceback (most recent call last):
File "test_tensorboard.py", line 212, in test_writer
with self.createSummaryWriter() as writer:
File "test_tensorboard.py", line 64, in createSummaryWriter
return SummaryWriter(temp_dir)
...
File "[...]/site-packages/pandas/core/arrays/categorical.py", line 52, in <module>
import pandas.core.algorithms as algorithms
AttributeError: module 'pandas' has no attribute 'core'
The exact failure may depend on the pandas version. We've also seen:
File "[...]/site-packages/pandas/core/arrays/categorical.py", line 9, in <module>
import pandas.compat as compat
AttributeError: module 'pandas' has no attribute 'compat'
The module import chain leading to the failure is tensorboard imports
tensorflow imports tensorflow_estimator imports pandas. pandas includes
a submodule named 'bottleneck', whose name collides with the PyTorch
'test/bottleneck/' subdirectory.
So IF tensorboard, tensorflow, tensorflow_estimator, and pandas are
installed in the python environment AND IF testing is run from within
PyTorch's 'test/' directory (or maybe just with 'test/' in PYTHONPATH,
etc.), then TensorBoard tests using SummaryWriter() will fail.
Rename the 'bottleneck/' directory slightly to avoid the name collision.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29650
Differential Revision: D19698638
Pulled By: ezyang
fbshipit-source-id: cb59342ed407cb37aefc833d67f768a8809129ac
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30445
Create distributed and rpc directories under caffe/test for better management
of unit tests.
Differential Revision: D18702786
fbshipit-source-id: e9daeed0cfb846ef68806f6decfcb57c0e0e3606
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31230
A major issue with distributed autograd currently is that we block an
RPC thread when we call Engine::execute_with_graph_task.
To resolve this issue, I've made modifications to the local autograd engine
such that `execute_with_graph_task` returns a Future instead. The `execute()`
methods for Engine::execute() and DistEngine::execute() still wait() on this
Future which ensures there is no change in behavior yet.
In follow up PRs we can modify the distributed autograd engine to take
advantage of this Future.
Closes#26359
ghstack-source-id: 96298057
Test Plan: waitforbuildbot
Differential Revision: D18999709
fbshipit-source-id: 388f54467fd2415a0acb7df17bd063aedc105229
Summary:
To support variadic inputs of `checkpoint_sequential` was deprecated at https://github.com/pytorch/pytorch/issues/21006. This case should be warned with `DeprecationWarning` for PyTorch 1.2, but it should be simply failed with `TypeError` since PyTorch 1.3. This patch removes the `DeprecationWarning` for PyTorch 1.2.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25985
Differential Revision: D18809875
Pulled By: albanD
fbshipit-source-id: e84dd8629c04979c4b2dc63e8ada94292e8cedd0
Summary:
Resubmit of https://github.com/pytorch/pytorch/pull/25980.
Our old serialization was in tar (like `resnet18-5c106cde.pth` was in this format) so let's only support automatically unzip if checkpoints are zipfiles.
We can still manage to get it work with tarfile, but let's delay it when there's an ask.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26723
Differential Revision: D17551795
Pulled By: ailzhang
fbshipit-source-id: 00b4e7621f1e753ca9aa07b1fe356278c6693a1e
Summary:
This PR does a few small improvements to hub:
- add support `verbose` option in `torch.load`. Note that this mutes hitting cache message but keeps the message of first download as suggested. fixes https://github.com/pytorch/pytorch/issues/24791
- add support loading state dict from tar file or zip file in `torch.hub.load_state_dict_from_url`.
- add `torch.hub.download_url_to_file` as public API, and add BC bit for `_download_url_to_file`.
- makes hash check in filename optional through `check_hash`, many users don't have control over the naming, relaxing this constraint could potentially avoid duplicating download code on user end.
- move pytorch CI off `pytorch/vision` and use `ailzhang/torchhub_example` as a dedicated test repo. fixes https://github.com/pytorch/pytorch/issues/25865
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25980
Differential Revision: D17495679
Pulled By: ailzhang
fbshipit-source-id: 695df3e803ad5f9ca33cfbcf62f1a4f8cde0dbbe