The problem:
- The new CustomOp API depends on torchgen.model
- torchgen.model imports `yaml`
- `yaml` is not a PyTorch runtime dependency
To unblock myself, because I'm not sure how long it'll take to
convince people yaml should be a PyTorch runtime dependency
(unless one of you wants to approve #100166), this PR removes the
yaml dependency from torchgen.model.
It does so by splitting torchgen.utils (the offender) into
torchgen.utils (no yaml) and torchgen.yaml (which uses yaml).
Test Plan:
- CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100203
Approved by: https://github.com/ezyang, https://github.com/Skylion007
I want to use torchgen to generate code, and my yaml file format is the same as `native_functions.yaml`.
I will use the PrivateUse1, but in my yaml file, I don't want to show PrivateUse1 to the user.
So I want to achieve the following result(e.g. my device is `YPU`):
```
>>>from torchgen.model import DispatchKey
>>>str(DispatchKey.PrivateUse1)
"YPU"
>>>DispatchKey.parse("YPU")
DispatchKey.PrivateUse1
```
I also thought that not everyone would need this feature, so I add a new func to handle this scenario.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99406
Approved by: https://github.com/ezyang
simplify method_def generation
Summary:
This removes some duplication. This was originally done to streamline
a subsequent change, but that change turned out to be
misguided. Nevertheless, this is a nice simplification.
Test Plan:
This should change the code gen by removing some redundant
parentheses. Rely on CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100059
Approved by: https://github.com/ezyang
Implements a simple content-addressable store for storages (with tensors implemented as cheap references on top), enabling incremental serialization of tensors to disk, which I intend to use in the accuracy repro extractor. Check the comment at the top of torch/utils/_content_store.py for more details on the intended use case.
One major piece of this PR is implementing the content hash for tensors. For our prospective use case, we may need to repeatedly hash up to 80 GB of tensor data every time we snapshot (and we may snapshot multiple times). Using a conventional cryptographic hash and hashing each snapshot would likely take on order of minutes, which seemed too slow to me. So instead, I implemented a crappy hash function that can be run on GPU. It is at least somewhat theoretically grounded: using random parameters generated by Philox, we use the standard shift-multiply and xor sum universal hash family. The hash function is a bit dorky though; instead of properly doing 160-bit math, it just runs 32-bit hash five times and cats them together. By the way, this sets the first precedent for kernel in PyTorch library which MUST be torch.compile'd to be run (in fact, this kernel does not run in eager mode because of the use of xor_sum, which doesn't actually exist in ATen.)
I had to add a few more primitives to inductor, namely randint (over the entire int range) and xor_sum. Fortunately, these primitives are natively supported by Triton/C++, and so they were very easy to plumb through. xor_sum is exposed as a prim, while randint special cases on when low/high span the entire 32-bit signed integer range.
Thanks to Jeff Johnson for letting me bounce ideas of him on a Saturday morning lol.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99809
Approved by: https://github.com/voznesenskym
### This change
- Implements the ruff linter in pytorch lintrunner. It is adapted from https://github.com/justinchuby/lintrunner-adapters/blob/main/lintrunner_adapters/adapters/ruff_linter.py. It does **both linting and fixing**. 🔧
- Migrated all flake8 configs to the ruff config and enabled it for the repo. ✅
- **`ruff` lints the whole repo in under 2s** 🤯
Fixes https://github.com/pytorch/pytorch/issues/94737 Replaces #99280
@huydhn @Skylion007
<!--
copilot:all
-->
### <samp>🤖 Generated by Copilot at 6b982dd</samp>
### Summary
🧹🛠️🎨
<!--
1. 🧹 This emoji represents cleaning or tidying up, which is what `ruff` does by formatting and linting the code. It also suggests improving the code quality and removing unnecessary or redundant code.
2. 🛠️ This emoji represents tools or fixing, which is what `ruff` is as a code formatter and linter. It also suggests enhancing the code functionality and performance, and resolving potential issues or bugs.
3. 🎨 This emoji represents art or creativity, which is what `ruff` allows by providing a consistent and configurable style for the code. It also suggests adding some flair or personality to the code, and making it more readable and enjoyable.
-->
Add `[tool.ruff]` section to `pyproject.toml` to configure `ruff` code formatter and linter. This change aims to improve code quality and consistency with a single tool.
> _`ruff` cleans the code_
> _like a spring breeze in the fields_
> _`pyproject.toml`_
### Walkthrough
* Configure `ruff` code formatter and linter for the whole project ([link](https://github.com/pytorch/pytorch/pull/99785/files?diff=unified&w=0#diff-50c86b7ed8ac2cf95bd48334961bf0530cdc77b5a56f852c5c61b89d735fd711R22-R79))
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99785
Approved by: https://github.com/malfet, https://github.com/Skylion007
Sharing code between the code that handles test results in parallel vs serial mode.
Note that the original version of this code had an inconsistency between the two versions where it would execute `print_to_stderr(err_message)` on every test that ran in parallel, but for serial tests it would only invoke `print_to_stderr(err_message)` if `continue_on_error` was also specified. By sharing code, this PR changes that behavior to be consistent between the two modes.
Also adding some comments.
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 029342c</samp>
> _Sing, O Muse, of the skillful coder who refined_
> _The PyTorch testing script, `run_test.py`, and shined_
> _A light on its obscure logic, with docstrings and comments_
> _And made it run more smoothly, with better error contents_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99467
Approved by: https://github.com/huydhn, https://github.com/malfet
Summary: Removes the dependency on the unified YAML file
Test Plan:
Smoke test via some caffe2 tests.
```
buck2 run xplat/caffe2:supported_mobile_models_test
```
Build a major FoA app that uses model tracing and confirm it still works.
```
buck2 build fb4a
```
CI/CD for the rest. If operator tracing / bundling was broken, I'd hope in the 1000+ tests spawned by this change should catch it.
Differential Revision: D44946368
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99122
Approved by: https://github.com/dhruvbird
The strategy is that we will heap allocate a LargeNegativeIntSymNodeImpl whenever we have a large negative int, so that we can keep the old `is_symbolic` test (now called `is_heap_allocated`) on SymInt. Whenever we need to do something with these ints, though, we convert them back into a plain `int64_t` (and then, e.g., wrap it in whatever user specificed SymNodeImpl they need.) We cannot wrap directly in the user specified SymNodeImpl as we generally do not know what the "tracing context" is from C++. We expect large negative ints to be rare, so we don't apply optimizations like singleton-ifying INT_MIN. Here's the order to review:
* c10/core/SymInt.h and cpp
* `is_symbolic` renamed to `is_heap_allocated` as I needed to audit all use sites: the old `is_symbolic` test would return true for large negative int, but it would be wrong to then try to dispatch on the LargeNegativeIntSymNodeImpl which supports very few operations. In this file, I had to update expect_int,
* If you pass in a large negative integer, we instead heap allocate it in `promote_to_negative`. The function is written in a funny way to keep compact constructor code for SymInt (the heap allocation happens out of line)
* clone is now moved out-of-line
* New method maybe_as_int which will give you a constant int if it is possible, either because it's stored inline or in LargeNegativeIntSymNodeImpl. This is the preferred replacement for previous use of is_symbolic() and then as_int_unchecked().
* Rename toSymNodeImpl to toSymNode, which is more correct (since it returns a SymNode)
* Complete rewrite of `normalize_symints.cpp` to use new `maybe_as_int`. Cannot easily use the old code structure, so it's now done doing a macro and typing out each case manually (it's actually not that bad.)
* Reimplementations of all the unary operators by hand to use `maybe_as_int`, relatively simple.
* c10/core/LargeNegativeIntSymNodeImpl.h - Just stores a int64_t value, but it has to be big and negative. Most methods are not implemented, since we will rewrap the large negative int in the real SymNodeImpl subclass before doing operations with it
* The rest of the files are just rewriting code to use `maybe_as_int`. There is a nontrivial comment in c10/core/SymIntArrayRef.h
Very minor test adjustment in c10/test/core/SymInt_test.cpp . Plan to exercise this properly in next PR.
Companion XLA PR: https://github.com/pytorch/xla/pull/4882
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99157
Approved by: https://github.com/albanD
Summary:
Original commit changeset: ba36f8751adc
Original Phabricator Diff: D44788697
Test Plan: model loading is fine after reverting the diff
Reviewed By: zyan0, sayitmemory
Differential Revision: D44921259
---
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99168
Approved by: https://github.com/izaitsevfb
If `CMAKE_GENERATOR=Visual Studio 16 2019` then the build will fail if `USE_NINJA=False` not set.
This PR changes that if CMAKE_GENERATOR is set an not equal to ninja then it won't use Ninja.
This is just for easier setting another generator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98605
Approved by: https://github.com/kit1980
The higher order derivatives calculations of `max_pool2d` require indices provided, but `mps_max_pool2d` kernel doesn't calculate it. If we calculate indices during back propagations afterwards, that would be expensive and unnecessary since users can directly call `max_pool2d` with `return_indices=True`, which calculates `indices` along.
This PR adds a warning for it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98582
Approved by: https://github.com/soulitzer
This PR is to address the issue seeing in PR #97417 where the newly added op requires `kwargs`, however, currently tools/autograd/gen_annotated_fn_args.py does not support `kwargs`, only `func_args` are generated for test_overrides.py.
The PR adds a new field "is_kwargs" to each argument indicating whether it's a `kwargs` or not. See example:
```
annotated_args = {
torch._C._VariableFunctions._cast_Byte: [{'is_kwarg_only': 'False', 'name': 'self', 'simple_type': 'Tensor'}],
...
```
The full comparison of the generated file `annotated_fn_args.py` can be found here:
- **Before**: [P681991116](https://www.internalfb.com/phabricator/paste/view/P681991116)
- **After**: [P681994218](https://www.internalfb.com/intern/paste/P681994218/)
Differential Revision: D44698310
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98396
Approved by: https://github.com/ezyang
We used to keep track of the average of stats, however, when we munge the data to find interesting insights this makes things difficult (ie. finding total test time for an oncall). The pin is updated such that we keep track of the sum instead as well as an "occurrences" field such that the average can be rederived from sum/occurrences.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98359
Approved by: https://github.com/huydhn
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 79f1b37</samp>
This pull request improves the workflow and data processing for uploading contribution and testing statistics to Rockset and S3. It renames and updates a workflow file, removes unused code from a script, and adds a new script to aggregate and upload test results.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97954
Approved by: https://github.com/huydhn
Inductor codegen is suboptimal when calling all_reduce_coalesced with input args. We need to fix inductor's calling convention for that, or something else.
Might not work if any outputs is unused.
Test code:
```python
import torch
import torch.distributed as dist
import torch.nn.functional as F
from functorch import make_fx
import os
import torch.distributed._functional_collectives as ft_c
from torch.testing._internal.common_distributed import (
spawn_threads_and_init_comms,
)
from torch._inductor.compile_fx import compile_fx_inner
def my_fun(a, b):
c = a * 3
tensors = ft_c.all_reduce_coalesced([a, c, b], "sum", [0])
return ((tensors[1] + tensors[0] + tensors[2]).sum(), )
@spawn_threads_and_init_comms(world_size=1)
def inductor_main(self):
x = torch.arange(4).cuda() * (dist.get_rank() + 1)
y = torch.arange(4).cuda() * (dist.get_rank() + 1)
x = x.to(torch.float)
y = y.to(torch.float) * 0.5
res = make_fx(my_fun)(x, y)
print(f"fx graph:\n{res.graph}")
ind = compile_fx_inner(res, [x, y])
print(f"inductor done:\n{ind}")
os.environ["PROXY_TENSOR_TRACING"] = "1"
os.environ["TORCH_COMPILE_DEBUG"] = "1"
torch._dynamo.config.output_code = True
if __name__ == "__main__":
inductor_main(None)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97157
Approved by: https://github.com/fegin
Among the changes is the introduction of gather_dim and scatter_dim in DeviceMesh collectives to simplify user code.
The current plan is to keep padding and gather/scatter dim support in DeviceMesh while we explore optimization opportunities in Inductor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96226
Approved by: https://github.com/wanchaol