This PR fixes the instruction count benchmark
1. Fix the updated import path
2. Allows building the benchmark with less compiler options (remove all "-W" options)
Test plan:
```
BENCHMARK_USE_DEV_SHM=1 python main.py --mode ci
```
Manually tested and worked on the CI machine.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85359
Approved by: https://github.com/robieta
### Description
The `sys.executable` string does not take into account if the file is a symlink or not. This lead to a false negative during checking if the two python interpreters were the same, while using an interpreter that was symlinked to another one.
Finding the realpath fixes the problem.
### Testing
Tested manually.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82734
Approved by: https://github.com/ngimel
We define specializations for pybind11 defined templates
(in particular, PYBIND11_DECLARE_HOLDER_TYPE) and consequently
it is important that these specializations *always* be #include'd
when making use of pybind11 templates whose behavior depends on
these specializations, otherwise we can cause an ODR violation.
The easiest way to ensure that all the specializations are always
loaded is to designate a header (in this case, torch/csrc/util/pybind.h)
that ensures the specializations are defined, and then add a lint
to ensure this header is included whenever pybind11 headers are
included.
The existing grep linter didn't have enough knobs to do this
conveniently, so I added some features. I'm open to suggestions
for how to structure the features better. The main changes:
- Added an --allowlist-pattern flag, which turns off the grep lint
if some other line exists. This is used to stop the grep
lint from complaining about pybind11 includes if the util
include already exists.
- Added --match-first-only flag, which lets grep only match against
the first matching line. This is because, even if there are multiple
includes that are problematic, I only need to fix one of them.
We don't /really/ need this, but when I was running lintrunner -a
to fixup the preexisting codebase it was annoying without this,
as the lintrunner overall driver fails if there are multiple edits
on the same file.
I excluded any files that didn't otherwise have a dependency on
torch/ATen, this was mostly caffe2 and the valgrind wrapper compat
bindings.
Note the grep replacement is kind of crappy, but clang-tidy lint
cleaned it up in most cases.
See also https://github.com/pybind/pybind11/issues/4099
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82552
Approved by: https://github.com/albanD
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72667
Created by running
```
python -m libcst.tool codemod --no-format --jobs=1 convert_type_comments.ConvertTypeComments ~/fbsource/fbcode/caffe2/torch/utils/ --no-quote-annotations
```
and then manually cleaning up unreadable function headers (which is needed due to lack of autoformatting).
Test Plan:
Wait for CI - usually type annotations are safe to land, but the jit
compiler sometimes can choke if there's a problem.
Reviewed By: grievejia
Differential Revision: D34148011
fbshipit-source-id: 8f7c7a3b5ef78e0dea6d10ce70072f39e6d1ecc3
(cherry picked from commit 25a929ef8d)
Summary:
Delete `-Wno-unused-variable` from top level `CMakeLists.txt`
Still suppress those warnings for tests and `torch_python`
Delete number of unused variables from caffe2 code
Use `(void)var;` to suppress unused variable in range loops
Use `C10_UNUSED` for global constructors and use `constexpr` instead of `static` for global constants
Do not delete `caffe2::OperatorBase::Output` calls as they have side effects
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66041
Reviewed By: ngimel
Differential Revision: D31360142
Pulled By: malfet
fbshipit-source-id: 6fdfb9f91efdc49ca984a2f2a17ee377d28210c8
Summary:
PyTorch core do not depend on numpy, so benchmarks should not depend on it as well
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60564
Reviewed By: robieta
Differential Revision: D29497375
Pulled By: malfet
fbshipit-source-id: d9566e5b2e48868cef5568cd62f691af19ccf1f1
Summary:
Switches most of the simple for loops outside of `jit` directories to use `c10::irange`.
Generated with D28874212.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59481
Test Plan: Sandcastle
Reviewed By: ngimel
Differential Revision: D28909681
fbshipit-source-id: ec9ab1bd602933238d9d0f73d4d8d027b75d9d85
Summary:
The JIT will typically need two warmup runs to do profiling and optimization.
This is not the perfect solution but it will substantially reduce the number of surprised people when the docs say torch.utils.benchmark.Timer takes care of warmup.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58801
Reviewed By: desertfire
Differential Revision: D28644244
Pulled By: robieta
fbshipit-source-id: cc54ed019e882a379d6e4a0c6a01fd5873dd41c3
Summary:
This is actually something I discovered a while ago with the wall of serotonin. It was really easy for large scale runs to get bottlenecked on disk access. I have a hack in the working files of that machine to use `/dev/shm`, but I figured I should formalize and actually make a respectable utility.
I also added a param to tweak the run cadence and print when a CorePool is created; these are just to make the CI logs a bit nicer. (A printout each second on a 40 minute CI job is a bit much...)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56711
Reviewed By: agolynski
Differential Revision: D28392248
Pulled By: robieta
fbshipit-source-id: b6aa7445c488d8e4ab9d4b31ab18df4e12783d8f
Summary:
As this diff shows, currently there are a couple hundred instances of raw `noqa` in the codebase, which just ignore all errors on a given line. That isn't great, so this PR changes all existing instances of that antipattern to qualify the `noqa` with respect to a specific error code, and adds a lint to prevent more of this from happening in the future.
Interestingly, some of the examples the `noqa` lint catches are genuine attempts to qualify the `noqa` with a specific error code, such as these two:
```
test/jit/test_misc.py:27: print(f"{hello + ' ' + test}, I'm a {test}") # noqa E999
test/jit/test_misc.py:28: print(f"format blank") # noqa F541
```
However, those are still wrong because they are [missing a colon](https://flake8.pycqa.org/en/3.9.1/user/violations.html#in-line-ignoring-errors), which actually causes the error code to be completely ignored:
- If you change them to anything else, the warnings will still be suppressed.
- If you add the necessary colons then it is revealed that `E261` was also being suppressed, unintentionally:
```
test/jit/test_misc.py:27:57: E261 at least two spaces before inline comment
test/jit/test_misc.py:28:35: E261 at least two spaces before inline comment
```
I did try using [flake8-noqa](https://pypi.org/project/flake8-noqa/) instead of a custom `git grep` lint, but it didn't seem to work. This PR is definitely missing some of the functionality that flake8-noqa is supposed to provide, though, so if someone can figure out how to use it, we should do that instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56272
Test Plan:
CI should pass on the tip of this PR, and we know that the lint works because the following CI run (before this PR was finished) failed:
- https://github.com/pytorch/pytorch/runs/2365189927
Reviewed By: janeyx99
Differential Revision: D27830127
Pulled By: samestep
fbshipit-source-id: d6dcf4f945ebd18cd76c46a07f3b408296864fcb
Summary:
Generally wildcard imports are bad for the reasons described here: https://www.flake8rules.com/rules/F403.html
This PR replaces wildcard imports with an explicit list of imported items where possible, and adds a `# noqa: F403` comment in the other cases (mostly re-exports in `__init__.py` files).
This is a prerequisite for https://github.com/pytorch/pytorch/issues/55816, because currently [`tools/codegen/dest/register_dispatch_key.py` simply fails if you sort its imports](https://github.com/pytorch/pytorch/actions/runs/742505908).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55838
Test Plan: CI. You can also run `flake8` locally.
Reviewed By: jbschlosser
Differential Revision: D27724232
Pulled By: samestep
fbshipit-source-id: 269fb09cb4168f8a51fd65bfaacc6cda7fb87c34
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54211
This was a little more annoying than expected, because the `exclude = ` key in `mypy.ini` is weird. I'll file an upstream issue about that.
I ignored one file, `torch/distributed/elastic/agent/server/api.py` that had ~8 errors that were hard to figure out. This can be done in a follow-up.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55712
Reviewed By: walterddr
Differential Revision: D27694976
Pulled By: malfet
fbshipit-source-id: 228d8be6af040343ce46595dabaca212e69ccc68
Summary:
- Fixes https://github.com/pytorch/pytorch/issues/54114
- Capped estimated block size to the largest multiple of ten less than C++ INT_MAX
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55200
Test Plan: unit test doesn't throw exception as expected
Reviewed By: robieta
Differential Revision: D27542652
Pulled By: naveedgol
fbshipit-source-id: 3ba68ce84d5fa1d8338cdd5c9f9e5d8c9adda51c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53295
A lot of the time spent in `collect_callgrind` is spinning up Valgrind and executing the initial `import torch`. In most cases the actual run loop is a much smaller fraction. As a result, we can reuse the same process to do multiple replicates and do a much better job amortizing that startup cost. This also tends to result in more stable measurements: the kth run is more repeatable than the first because everything has been given a chance to settle into a steady state. The instruction microbenchmarks lean heavily on this behavior. I found that in practice doing several `n=100` replicates to be more reliable than one monolithic 10,000+ iteration run. (Since rare cases like memory consolidation will just contaminate that one replicate, as opposed to getting mixed into the entire long run.)
Test Plan: Imported from OSS
Reviewed By: ngimel
Differential Revision: D26907093
Pulled By: robieta
fbshipit-source-id: 72e5b48896911f5dbde96c8387845d7f9882fdb2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53294
Just a bunch of little things, none of which are big enough to need a full PR.
1) C++ wall time should release the GIL
2) Add option to retain `callgrind.out` contents. This will allow processing with kCachegrind for more detailed analysis.
3) Stop subtracting the baseline instruction counts. (People just found it confusing when they saw negative instruction counts.) There is a finesse in #53295 that drops the baseline to ~800 instructions for `number=100`, and at that level it's not worth correcting.
4) Add a `__mul__` overload to function counts. e.g. suppose `c0` was run with `number=100`, and `c1` was run with `number=200`, then `c0 * 2 - c1` is needed to properly diff them. (Obviously there are correctness concerns, but I think it's fine as a caveat emptor convenience method.)
5) Tweak the `callgrind_annotate` call, since by default it filters very small counts.
6) Move some args to kwargs only since types could be ambiguous otherwise.
7) Don't omit rows from slices. It was annoying to print something like `stats[:25]` and have `__repr__` hide the lines in the middle.
Test Plan: Imported from OSS
Reviewed By: Chillee
Differential Revision: D26906715
Pulled By: robieta
fbshipit-source-id: 53d5cd92cd17212ec013f89d48ac8678ba6e6228
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53293
Instruction count benchmarks need some includes for IValues, but this is also just generally useful. (Unlike Python where you can just drop imports anywhere, C++ will get very upset if you `#include` in a function body...)
Test Plan: Imported from OSS
Reviewed By: Chillee
Differential Revision: D26906684
Pulled By: robieta
fbshipit-source-id: cbdfd79d3b8383100ff2e6857b6f309c387cbe2a
Summary:
Context: https://github.com/pytorch/pytorch/pull/53299#discussion_r587882857
These are the only hand-written parts of this diff:
- the addition to `.github/workflows/lint.yml`
- the file endings changed in these four files (to appease FB-internal land-blocking lints):
- `GLOSSARY.md`
- `aten/src/ATen/core/op_registration/README.md`
- `scripts/README.md`
- `torch/csrc/jit/codegen/fuser/README.md`
The rest was generated by running this command (on macOS):
```
git grep -I -l ' $' -- . ':(exclude)**/contrib/**' ':(exclude)third_party' | xargs gsed -i 's/ *$//'
```
I looked over the auto-generated changes and didn't see anything that looked problematic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53406
Test Plan:
This run (after adding the lint but before removing existing trailing spaces) failed:
- https://github.com/pytorch/pytorch/runs/2043032377
This run (on the tip of this PR) succeeded:
- https://github.com/pytorch/pytorch/runs/2043296348
Reviewed By: walterddr, seemethere
Differential Revision: D26856620
Pulled By: samestep
fbshipit-source-id: 3f0de7f7c2e4b0f1c089eac9b5085a58dd7e0d97
Summary:
`torch.__config__._cxx_flags` gets called on import, but this means that Timer can't be used if it fails. (Even just the wall time parts.) This is needlessly restrictive.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52124
Reviewed By: albanD
Differential Revision: D26395917
Pulled By: robieta
fbshipit-source-id: 4336a77dba131f80d386368ef715eed63c1cbcb4
Summary:
Add some much needed documentation on the Timer callgrind output format, and expand what is shown on the website.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51664
Reviewed By: tugsbayasgalan
Differential Revision: D26246675
Pulled By: robieta
fbshipit-source-id: 7a07ff35cae07bd2da111029242a5dc8de21403c
Summary:
This is a benchmarking tooling to work with sparse tensors. To implement this, we extended PR `benchmarking util` [https://github.com/pytorch/pytorch/issues/38338](https://github.com/pytorch/pytorch/pull/38338) for sparse tensors. In order to extend the proposed utility library the **FuzzedTensor** class was extended by creating the new **FuzzedSparseTensor** class. In addition two new operator classes were added, the `UnaryOpSparseFuzzer` and `BinaryOpSparseFuzzer`.
The class `FuzzedSparseTensor` adds new input parameters to the constructor:
1. `sparse_dim`: The number of sparse dimensions in a sparse tensor.
2. `nnz`: Number of non-zero elements in the sparse tensor.
3. `density`: The density of the sparse tensor.
4. `coalesced`: As we know the sparse tensor format permits coalesced/uncoalesced sparse tensors.
and removes `probability_contiguous`, `max_allocation_bytes`, `roll_parameter`, `tensor_constructor` as they are dense-tensors related parameters.
In addition, I've extended the `torch.utils.benchmark.examples` to work with the new classes `FuzzedSparseTensor`, `UnaryOpSparseFuzzer` and `BinaryOpSparseFuzzer`.
Hopefully, this tooling and these examples will help to make other benchmarks in other PRs. Looking forward to your thoughts and feedback. cc robieta, mruberry, ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48397
Reviewed By: ejguan
Differential Revision: D26008137
Pulled By: mruberry
fbshipit-source-id: 2f37811c7c3eaa3494a0f2500e519267f2186dfb
Summary:
Because I have a hard time finding this tutorial every time I need it. So I'm sure other people have the same issue :D
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50374
Reviewed By: zhangguanheng66
Differential Revision: D25872173
Pulled By: albanD
fbshipit-source-id: f34f719606e58487baf03c73dcbd255017601a09
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47871
- FuzzedTensor now supports complex data types
- Compare no longer calls min on empty ranges when a table has empty cells
* **#47871 Fix complex tensors and missing data in benchmark utility**
Test Plan: Imported from OSS
Reviewed By: ngimel
Differential Revision: D25237500
Pulled By: robieta
fbshipit-source-id: 76248647313d4d81590a68297a5f6768fa7d3d82
Summary:
Adds a standalone script which can be used to test different BLAS libraries. Right now I've deliberately kept it limited (only a couple BLAS libs and only GEMM and GEMV). It's easy enough to expand later.
CC ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47058
Reviewed By: zhangguanheng66
Differential Revision: D25078946
Pulled By: robieta
fbshipit-source-id: b5f7f7ec289d59c16c5370b7a6636c10a496b3ac
Summary:
CC: gchanan jspisak seemethere
I previewed the docs and they look reasonable. Let me know if I missed anything.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46880
Reviewed By: seemethere, izdeby
Differential Revision: D24551503
Pulled By: robieta
fbshipit-source-id: 627f73d3dd4d8f089777bca8653702735632b9fc