Commit Graph

80790 Commits

Author SHA1 Message Date
Sam Larsen
cb15c15157 [logging] Overhaul dynamo_timed and CompilationMetrics logging. (#139849)
Here's the overview:

There's a new contextmanager singleton called MetricsContext. Entering the MetricsContext is how we demarcate the boundary on which we'll create a single CompilationMetrics object, and therefore, a single dynamo_compile log entry. While we're inside the MetricsContext, we can update/set many different metrics. Most importantly: `dynamo_timed` can also update the in-progress MetricsContext. In the proposal here, we tell `dynamo_timed` that we want it to do so by providing the name of the MetricsContext field to increment. There can be many `dynamo_timed` calls in different parts of the code updating different fields. Then when the MetricsContext exits, that's when the logging of everything gathered finally happens. One potential footgun is trying to use `dynamo_timed` when we haven't entered the MetricsContext, but we assert on that problem. Another problem is that we re-enter the context recursively, but we watch for that and do the logging only when the outermost exits.

Some specifics:
* Introduce MetricsContext - a context manager that on exit, records the CompilationMetrics (which also logs to dynamo_compile).
* Completely remove the concept of frame_phase_timing. Instead, update the MetricsContext during compilation, either directly or via dynamo_timed.
* Remove some globals we previously used to accumulate counters to later populate a CompilationMetrics. We use CompilationMetrics set/update/increment APIs instead.
* `record_compilation_metrics` is now called on exit from MetricsContext.
* Populate legacy CompilationMetrics fields right before logging, inside `record_compilation_metrics`.
* Remove the one-off `add_remote_cache_time_saved` helper; capture that timing directly into the MetricsContext.

And specifically, several changes to dynamo_timed:
* "Modernize" the parameters and update all callsites accordingly.
* Move the backwards logging of the CompilationMetrics to the backwards compile location.
* Add a parameter for which CompilationMetrics field to update

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139849
Approved by: https://github.com/ezyang
ghstack dependencies: #140094
2024-11-11 14:24:23 +00:00
Xiaodong Wang
565a7942ee Recover non-standard bool test for msort (#139870)
Summary:
I was looking into why the non-standard bool value will fail for msort - it makes sense for argsort and sort to fail, because we're randomly generating uint8 so the order will be different (and thus the indices will be different). But msort should work.

After some digging, it's interesting that even though scalar_t is bool, when the actual value is a uint8_t, the comparison will treat them as signed. I tried lhs=255 and rhs=0: lhs < rhs is equivalent to -1 < 0 which is true (but it's supposed to be False)

Therefore we add an explicit type cast.

Test Plan: Remove the test skip

Differential Revision: D65472170

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139870
Approved by: https://github.com/Skylion007, https://github.com/davidberard98
2024-11-11 02:00:34 +00:00
Yifu Wang
2f3a5a15ef [SymmetricMemory] improve the API for stream_write_value32 (#139934)
This PR updates the binding for `stream_write_value32` to be consistent with `memset32` which IMO makes more sense for this type of utilities:
- Changed the API to take a uint32 tensor as argument, instead of a device pointer
- Changed the Python binding to be a static method of `_SymmetricMemory`, instead of a object method
- Use the dispatcher for device dispatching, as opposed to `SymmetricMemory` backends

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139934
Approved by: https://github.com/weifengpy
ghstack dependencies: #139227
2024-11-11 01:54:35 +00:00
cyy
ffb979032d [7/N] Fix Wextra-semi warning (#140225)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140225
Approved by: https://github.com/ezyang
2024-11-10 14:28:10 +00:00
Zhenbin Lin
d90c25e3e2 OpenReg: Support event (#140111)
Support events. Since cpu backend doesn't support asynchronous execution, all event operations will be executed immediately on the executor side.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140111
Approved by: https://github.com/ezyang
2024-11-10 08:38:45 +00:00
Yutao Xu
c3087ace58 Update torch-xpu-ops commit pin (#139986)
Update the torch-xpu-ops commit to [5e29831 ](https://github.com/intel/torch-xpu-ops/commit/5e29831). Includes:
- OneAPI-2025 build issue fix
- Enhancement of the XPU operator coverage

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139986
Approved by: https://github.com/guangyey, https://github.com/jansel
2024-11-10 06:49:38 +00:00
CaoE
94c9bb73c0 [Inductor] [CPP] Update BRGEMM parameters for Half cpp gemm template (#140116)
Update BRGEMM parameters for Half cpp gemm template as BRGEMM api is changed https://github.com/pytorch/pytorch/pull/138184.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140116
Approved by: https://github.com/jansel
2024-11-10 06:37:10 +00:00
Sam Larsen
4f6b30bcbc Add testing for the utils surrounding dynamo_timed (#140094)
Summary: This will make it easier to verify that we don't break these utilities for the refactor in https://github.com/pytorch/pytorch/pull/139849.
It's one giant test. I can split it into multiple for better readability if ppl prefer that. My rationale for the giant test is that I found I was just resetting compilation and recompiling the same thing many times, which was slow and wasteful.

Test Plan: The new tests

Differential Revision: [D65682138](https://our.internmc.facebook.com/intern/diff/D65682138)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140094
Approved by: https://github.com/ezyang
2024-11-10 04:17:45 +00:00
zeshengzong
5ef33e40b3 Add size param check of unfold (#139965)
Fixes #76617

Changes:

- Add check of input `size` value, give user friendly hint message
- fix `FIXME: move to shape ops test suite` in test file

Before
```python
import torch
x = torch.arange(1., 8)
x.unfold(0, -1, 1)

Traceback (most recent call last):
  File "/home/zong/code/unfold.py", line 12, in <module>
    x.unfold(0, -1, 1)
RuntimeError: Storage size calculation overflowed with sizes=[9, -1] and strides=[1, 1]

```

After
```python
import torch
x = torch.arange(1., 8)
x.unfold(0, -1, 1)

Traceback (most recent call last):
  File "/home/zong/code/pytorch/../unfold.py", line 12, in <module>
    x.unfold(0, -1, 1)
RuntimeError: size is -1 but must be >= 0
```

Test Result:
```bash
pytest test/test_shape_ops.py
```

![image](https://github.com/user-attachments/assets/d7bcef62-04e6-4187-9c8f-bc5220ff6c33)

```bash
$ lintrunner
```

![image](https://github.com/user-attachments/assets/6b48d095-5c8a-4e75-9957-dc22d39a73bb)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139965
Approved by: https://github.com/ezyang
2024-11-09 17:12:53 +00:00
atalman
f89b2b9630 Refactor conda-builder -> almalinux-builder (#140157)
This changes the conda-builder workflow to almalinux-builder and switches Docker file to almalinux.
Please note: Published conda-builder images will still be available, hence workflows that use these images will still work.
We will be switching workflows that use conda-builder images to almalinux-builder

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140157
Approved by: https://github.com/malfet
2024-11-09 16:06:40 +00:00
cyy
7d4f5f7508 [Environment Variable][6/N] Use thread-safe getenv functions (#140200)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140200
Approved by: https://github.com/ezyang
2024-11-09 15:05:51 +00:00
Nikita Shulga
a2ac96cae0 [BE] Rectify some references to caffe2 (#140204)
- Rename `tools.build_pytorch_libs.build_caffe2` to `tools.build_pytorch_libs.build_pytorch`
- Delete number of `if BUILD_CAFFE2` conditions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140204
Approved by: https://github.com/huydhn, https://github.com/r-barnes, https://github.com/atalman
2024-11-09 14:14:20 +00:00
fduwjj
5107d244ee [c10d][Logging] Remove args and kwargs from c10d logging (#140169)
This PR is trying to reland https://github.com/pytorch/pytorch/pull/139804

We now don't want to log args and kwargs directly because if they contain tensor or tensor subclass it would take lots of time in conversion to string or even not supported.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140169
Approved by: https://github.com/wz337, https://github.com/kwen2501
2024-11-09 13:57:32 +00:00
Yu, Guangye
052b67e2b4 Add torch.version.xpu (#139466)
# Motivation
We add a new attribute `torch.version.xpu` to facilitate the problem diagnosing and version control.

# Additional Context
It is aligned with `torch.version.cuda` and `torch.version.hip`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139466
Approved by: https://github.com/EikanWang, https://github.com/ezyang, https://github.com/atalman, https://github.com/malfet
ghstack dependencies: #139258
2024-11-09 13:31:21 +00:00
Yu, Guangye
8051ee802c Add XPU compiler version control in cmake to keep BC (#139258)
# Motivation
This PR aims to maintain backward compatibility when building PyTorch XPU with the old and new compilers.

# Additional Context
The details are described here. The new compiler (2025.0.0) has some breaking changes compared with the old compiler(2024.1), for examples:
1. On Windows, sycl library is named `sycl7.lib` in the old compiler but is named `sycl.lib` in the new compiler.
2. On Linux, in order to support ABI=0, we have to link `libsycl-preview.so` in the old compiler but we could link `libsycl.so` in the new compiler to have the same ABI compatibility.
3. We added a macro `SYCL_COMPILER_VERSION` to support our new code has good backward compatibility with the old compiler. Now the new feature(Event elapsed_time, memory summary, and device architecture property) introduced by the new compiler will be controlled within the macro `SYCL_COMPILER_VERSION`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139258
Approved by: https://github.com/EikanWang, https://github.com/atalman, https://github.com/gujinghui
2024-11-09 13:31:21 +00:00
xinan.lin
191971e01d [AOTI] Introduce an extensibility mechanism for the c shim codegen to make it easy to produce c shims for out-of-tree OP kernels as well. Add c_shim for XPU. (#136742)
[AOTI] Introduce an extensibility mechanism for the c shim codegen to make it easy to produce c shims for out-of-tree OP kernels as well. Add c shim for XPU.

### Motivation
Since the current c shim codegen will only produce C wrappers for Op's registered in `aten/src/ATen/native/native_functions.yaml`, for the same backend, when a portion of out-of-tree OP's are not registered in that file, but are registered externally. For example, `third_party/torch-xpu-ops/yaml/native_functions.yaml` , in this case, the existing codegen can't fulfill the need to do extensions for the c shims from the out-of-tree OPs for the in-tree that has already been produced.

### Design
To extend the c shim with more OP for a backend from out-of-tree.
The PR provided a bool option `--aoti-extend` to indicate the codegen is to extend c shim from out-of-tree.
The generated c shim is stored in the `extend` subdirectory , for example:
```
torch/include/torch/csrc/inductor/aoti_torch/generated/c_shim_xpu.h
torch/include/torch/csrc/inductor/aoti_torch/generated/c_shim_xpu.cpp
torch/include/torch/csrc/inductor/aoti_torch/generated/extend/c_shim_xpu.h
torch/include/torch/csrc/inductor/aoti_torch/generated/extend/c_shim_xpu.cpp
```
example usage:
`python -m torchgen.gen --source-path third_party/torch-xpu-ops/yaml/ --xpu --aoti-extend --update-aoti-c-shim  `
`--xpu`:  generate c shim for XPU
`--aoti-extend `: this is an out-of-tree OPs(defined in `third_party/torch-xpu-ops/yaml/native_functions.yaml`)  extend for in-tree ops(defined in `aten/src/ATen/native/native_functions.yaml`)
`--update-aoti-c-shim`: always generate c_shim_xpu.h for the extend c_shim.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136742
Approved by: https://github.com/EikanWang, https://github.com/desertfire
ghstack dependencies: #139025
2024-11-09 13:19:52 +00:00
xinan.lin
929a647363 [Intel GPU] Support RegisterXPU.cpp codegen and compile for the in-tree XPU structured GEMM OPs. (#139025)
[Intel GPU] Support RegisterXPU.cpp codegen and compile for the in-tree XPU structured GEMM ops.

Motivation: There are two parts of aten ops for XPU, one is in-tree ops like GEMM related OPs and the other is out-off-tree ops in torch-xpu-ops. For the in-tree part,since Pytorch uses native_functions.yaml registration and is equipped with convenient codegen capabilities, we want to take advantage of these benefits as well.
At the same time, since AOT Inductor also uses native_functions.yaml to generate c shim wrappers, we also need to enable this mechanism for XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139025
Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire
2024-11-09 13:09:27 +00:00
Andrea Frittoli
0b650c360a Build magma for windows (#139924)
Copy the magma for windows job and script from pytorch/builder c9aac65e12/.github/workflows/build-magma-windows.yml

The linux version is moved here in https://github.com/pytorch/pytorch/pull/139888

Fixes #140001

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139924
Approved by: https://github.com/atalman
2024-11-09 09:27:59 +00:00
Boyuan Feng
e2e425b4f3 [CUDAGraph] Add dynamo timer to checkpoint, warmup, and record (#139818)
Summary: Add time log to cudagraph, including `create deferred_cudagraphify wrapper`, `warmup`,	`record`, and `checkpoint`.

Test Plan:
1. buck2 run fbcode//mode/opt //pytorch/benchmark:run -- resnet50 -d cuda -t train --inductor --pt2-triton-cudagraph

2. Found the result in [scuba table](https://fburl.com/scuba/pt2_compile_events/0oik8nu9).

 {F1954034920}

Differential Revision: D65505659

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139818
Approved by: https://github.com/eellison
2024-11-09 05:27:11 +00:00
cyy
ab55a99283 Use TORCH_DECLARE_XXX (#139952)
Because those files use TORCH_API

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139952
Approved by: https://github.com/ezyang
2024-11-09 04:56:28 +00:00
Kefei Lu
d2d1258b1b Speed up AMD AOT Inductor lowering by memoizing hipify trie to regex logic (#140156)
Summary:
AMD lowering duration is 1.55x longer than H100. Profiling shows hipification related functions took 22% of overall lowering time.

This diff cuts that time by safely memoize the trie to regex logic. The trick is to incrementally build a state of the trie during the trie construction. The state is the hash of all the words added to the trie.

Differential Revision: D65659445

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140156
Approved by: https://github.com/ColinPeppler

Co-authored-by: Kefei Lu <kefeilu@meta.com>
2024-11-09 04:28:58 +00:00
Michael Lazos
8b2e3855a9 Make size a property with an assertion (#139794)
Fixes https://github.com/pytorch/pytorch/issues/120568

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139794
Approved by: https://github.com/williamwen42
2024-11-09 03:39:41 +00:00
cyy
032135f8a2 [2/N] Turn inline static functions into static (#140068)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140068
Approved by: https://github.com/ezyang
2024-11-09 03:31:24 +00:00
Bob Ren
3b8470c461 add special case for __round__ constant variables (#139583)
Fixes `PYTORCH_TEST_WITH_INDUCTOR=1 tlp python test/test_torch.py TestTorchDeviceTypeCUDA.test_cauchy_cuda_float64` when specialize_float=False

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139583
Approved by: https://github.com/ezyang
ghstack dependencies: #139569, #139457, #139568, #139572, #139846, #139454, #139896, #139935, #139587
2024-11-09 03:25:53 +00:00
Florian (Feuermagier)
f915409c26 FlopCounterMode: Decompose ops for inference mode (#138508)
Fixes #126268

I've basically followed @ezyang suggestion (I think) to use `func.decompose(...)`. Since `__torch_dispatch__` won't be called a second time for the same op, I've added a second `TorchDispatchMode` (`_DecomposedCounterMode`) that simpy dispatches to the parent flop counter. Using `self` as the inner context manager is not possible, since the second call to `__enter__` would re-initialize the counter's tracking state.

Let me know if there's something wrong with this implementation, since I'm quite unsure how the decomposition thing actually works :D

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138508
Approved by: https://github.com/ezyang
2024-11-09 03:13:53 +00:00
Bob Ren
4488e23763 Fix another item memo loss location + bool specialization bug (#139587)
This fix was a bit more involved:
1) It fixes a item_memo loss place.
2) It updates a test to be eager instead of aot_eager since it reveals a very obscure bug related to replacements that's not worth solving since in practice inductor will regenerate the runtime asserts anyways
3) It updates tensorify to specialize more places now that the aforementioned bug is fixed.

Fixes `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=6 python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoCPU.test_comprehensive_linalg_norm_cpu_float16` when `specialize_float=False`

while ensuring `python test/dynamo/test_dynamic_shapes.py DynamicShapesMiscTests.test_runtime_assert_replacement_dynamic_shapes` doesn't regress

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139587
Approved by: https://github.com/ezyang
ghstack dependencies: #139569, #139457, #139568, #139572, #139846, #139454, #139896, #139935
2024-11-09 03:11:19 +00:00
wz337
4893e248a8 [DTensor][Test] Remove safe global context for weights_only torch.load() DTensor (#140173)
We have added DTensor related classes to allowed globals so we can torch.load(DTensor) with weights_only=True. So we don't need the safe_globals context for this test anymore.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140173
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #139949
2024-11-09 02:21:44 +00:00
Andrea Frittoli
72976b2486 Use manylinux-builder images with main tag (#140158)
The magma build uses deprecated manylinux-builder images. Update it to use the images with "main" in the tag:

  pytorch/manylinux-builder:cuda<version>-main

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140158
Approved by: https://github.com/atalman
2024-11-09 02:16:00 +00:00
Zhou, Lingzhi
2ede4c9a38 [Partitioner] Enumerate partitions by iterating partition ids (#136598)
Currently, we get all partition id by iterating assignment whose size is same as the number of nodes in graph. But we can reach same results by iterating partitions_by_id whose size is much smaller than the nodes number. Assume the number of nodes is N, the number of partitions is P, the time complexity decrease from O(N * N) to O(N * P) after this patch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136598
Approved by: https://github.com/ezyang

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-11-09 01:31:46 +00:00
Joel Schlosser
9c678af9f9 Misc. non-contig NJT fixes (#140160)
This PR contains several fixes related to non-contiguous NJTs:
1. Propagates `lengths` through op calls appropriately (see desc of #138098)
    * SDPA now calls `nested_view_from_values_offsets_lengths()` instead of `nested_view_from_values_offsets()`
2. Allows non-contig NJTs in unsqueeze / transpose / select
3. Expands padded dense -> NJT conversion to support non-contig NJTs
4. (unrelated sorry) Updates `split` / `split_with_sizes` to allow for optional `dim`, matching the ATen signature
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140160
Approved by: https://github.com/cpuhrsch
2024-11-09 01:18:26 +00:00
William Wen
be172d2a60 [pt2, docs] Add new PT2 troubleshooting doc (#138620)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138620
Approved by: https://github.com/ezyang

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2024-11-09 01:17:39 +00:00
Ryan Guo
de40a23f6c [dynamo] Remove dead code path for capturing __class__ in UserFunctionVariable (#140034)
This was introduced in https://github.com/pytorch/torchdynamo/commit/d0c10341
as limited support for pre-existing cells, since we know `__class__` wouldn't be modified
in most cases. It's no longer needed now that we have much more support for these cells.

Example:
```python
class Foo():
    def __init__(self):
        super().__init__()

print(Foo.__init__.__code__.co_freevars) # ('__class__',)
print(Foo.__init__.__closure__)          # (<cell at 0x1011fb310: type object at 0x10fe185b0>,)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140034
Approved by: https://github.com/williamwen42, https://github.com/anijain2305, https://github.com/jansel
ghstack dependencies: #140033
2024-11-09 01:03:24 +00:00
Ryan Guo
0b8652a999 [dynamo] Remove NestedUserFunctionVariable.closure_scope (#140033)
This was no longer needed after https://github.com/pytorch/torchdynamo/commit/663e4d92,
which removed the uses of `closure_scope` but not the field itself.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140033
Approved by: https://github.com/williamwen42, https://github.com/anijain2305, https://github.com/jansel
2024-11-09 01:03:24 +00:00
cyy
263d8f7a94 [8/N] Don't skip ASAN on some tests (#140081)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140081
Approved by: https://github.com/ezyang
2024-11-09 01:00:13 +00:00
PyTorch MergeBot
58b661cda2 Revert "[c10d][Logging] Remove args and kwargs from c10d logging (#140169)"
This reverts commit e3b2f04f05.

Reverted https://github.com/pytorch/pytorch/pull/140169 on behalf of https://github.com/ZainRizvi due to Man, this test really wants to fail on trunk. Sorry. Details:  distributed/test_c10d_logger.py::C10dErrorLoggerTest::test_exception_logger [GH job link](https://github.com/pytorch/pytorch/actions/runs/11751023962/job/32740983427) [HUD commit link](e3b2f04f05) ([comment](https://github.com/pytorch/pytorch/pull/140169#issuecomment-2465933413))
2024-11-09 00:23:43 +00:00
Peter Steinbach
090b778b8a Clarify meaning of rate parameter in Gamma distribution (#134847)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134847
Approved by: https://github.com/fritzo
2024-11-09 00:22:13 +00:00
PyTorch MergeBot
7eb66173e2 Revert "Fix split decomp returning self (#140065)"
This reverts commit 9d99dceb53.

Reverted https://github.com/pytorch/pytorch/pull/140065 on behalf of https://github.com/ZainRizvi due to Diff been imported internally, but merged externally. And the internal diff has been updated so the diff and PR are now mismatched.  Reverting this PR to get things back into a consistent state. See D65635070 ([comment](https://github.com/pytorch/pytorch/pull/140065#issuecomment-2465928027))
2024-11-09 00:16:26 +00:00
Mengwei Liu
a02e88d19c [miniz] Bump miniz version to 3.0.2 and add patch for zip64 (#140041)
Summary:
Bump miniz version from 2.1.0 to 3.0.2 and apply these patches:

* #79636 patches internal BUCK and bazel build
* #138959 adds `bool compute_crc32` argument
* miniz PR: https://github.com/richgel999/miniz/pull/324 to support
  zip64

Anyone bumping miniz version again, please apply these patches as well.

Test Plan:
Rely on unit test

Imported from OSS

Differential Revision: D65586230

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140041
Approved by: https://github.com/mikaylagawarecki
2024-11-09 00:13:16 +00:00
PyTorch MergeBot
1400fedf76 Revert "add supports_coalescing property in c10d::Backend to determine whether backend supports coalescing (#135338)"
This reverts commit e5574445b0.

Reverted https://github.com/pytorch/pytorch/pull/135338 on behalf of https://github.com/ZainRizvi due to Sorry but this is failing internally. Please see D65663382 for more details ([comment](https://github.com/pytorch/pytorch/pull/135338#issuecomment-2465911854))
2024-11-08 23:52:49 +00:00
Michael Lazos
ea0f60ecfa [Dynamo] allow dynamic callables on tensor variables (#137940)
Fixes https://github.com/pytorch/pytorch/issues/134844

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137940
Approved by: https://github.com/williamwen42
2024-11-08 23:49:34 +00:00
PyTorch MergeBot
beae7725be Revert "Tighten type hints for tensor arithmetic (#135392)"
This reverts commit d378819068.

Reverted https://github.com/pytorch/pytorch/pull/135392 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See D65641103 for more details ([comment](https://github.com/pytorch/pytorch/pull/135392#issuecomment-2465906839))
2024-11-08 23:44:41 +00:00
Haifeng Jin
2af5172774 fix dynamo tracking numpy 2 ops (#138686)
Fixes #136559
As we upgrade to NumPy 2, torch falsely filtered out `numpy.random` as unsupported in dynamo tracking.
This PR changes the filtering rules to include them while keeping behavior with numpy 1 unchanged.

Before this PR, the following tests failed:

```
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/dynamo/test_functions.py -k FunctionTests.test_numpy_random
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/dynamo/test_unspec.py -k UnspecTests.test_to_tensor
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/test_fake_tensor.py -k FakeTensorTest.test_export_numpy
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/test_fake_tensor.py -k PropagateRealTensorsFakeTensorTest.test_export_numpy_propagate_real_tensors
```

With this PR, the supported/unsupported ops in NumPy 1 are not changed.
For NumPy 2, only the `numpy.random` ops that are already supported with NumPy 1 are added to the supported list.

I used the following scripts to check the differences before and after the change for both NumPy 1 & 2.
The output is empty for NumPy 1 since there is no change.
The output is a list of `numpy.random` that considered supported for NumPy 2.

```py
from torch._dynamo import trace_rules
import numpy as np

def new_numpy_function_ids():
    unsupported_funcs = {"seed", "ranf", "get_bit_generator", "RandomState", "set_bit_generator", "sample"}

    def is_supported(k, v, mod):
        if not callable(v):
            return False
        if not getattr(v, "__module__", None):
            return True
        if v.__module__ == mod.__name__:
            return True
        if v.__module__ == "numpy.random.mtrand" and mod.__name__== "numpy.random" and k not in unsupported_funcs:
            return True
        return False
    rv = {}
    for mod in trace_rules.NP_SUPPORTED_MODULES:
        for k, v in mod.__dict__.items():
            if is_supported(k, v, mod):
                rv[id(v)] = f"{mod.__name__}.{k}"
    return rv

def old_numpy_function_ids():
    rv = {}
    for mod in trace_rules.NP_SUPPORTED_MODULES:
        rv.update(
            {
                id(v): f"{mod.__name__}.{k}"
                for k, v in mod.__dict__.items()
                if callable(v)
                and (getattr(v, "__module__", None) or mod.__name__) == mod.__name__
            }
        )
    return rv

rv1 = set(old_numpy_function_ids().values())
rv2 = set(new_numpy_function_ids().values())

for v in (rv1 - rv2):
    print(v)
print("****")
for v in (rv2 - rv1):
    print(v)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138686
Approved by: https://github.com/williamwen42
2024-11-08 23:38:53 +00:00
Yifu Wang
1659e241c8 [experimental] async-tp impl with cutlass-based, progress aware kernel (#139227)
This PR introduces the following:

### torch.ops.symm_mem._async_input_mm

`_async_input_mm(Tensor a, Tensor b, Tensor a_chunk_signals, int a_chunk_pivot) -> Tensor`

An mm impl that supports consuming asynchronous input. It guarantees the following rasterization order, and that the corresponding signal arrives before an input chunk is consumed.
```
num_chunks = a_chunks_signals.numel()
for chunk_idx in range(a_chunk_pivot, num_chunks + a_chunk_pivot):
    chunk_idx = chunk_idx % num_chunks
    wait_signal(a_chunk_signals, chunk_idx)
    # Compute output tiles that consumes the input chunk
```

### PersistentAsyncInputScheduler

This is a forked version of PersistentScheduler that supports consuming asynchronous input. This tile scheduler introduces the following arguments:

- `tiles_per_chunk_m` – Specifies the size of an M chunk. Chunks are the granularity at which the asynchronous input becomes ready. It must be an interger multiple of the size of an M tile.
- `chunk_signals` – `chunk_signals[i] == 1` indicates that chunk i is ready. Before returning a work tile, get_current_work() waits for the signal to ensure that the corresponding chunk is ready.
- `tile_idx_pivot_m` – After applying swizzling, apply `pivot(m) => (m + tile_idx_pivot_m) % tiles_m` to `m`. In a distributed setting, this allows different ranks to process different m indices at the same time, thus avoiding communication hotspots.

Note that this scheduler currently only supports the `KernelTmaWarpSpecializedCooperative` kernel schedule. This is enforced via the template argument `KernelSchedule`.

Usage:
```
using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
   Shape<int, int, int, int>,
   CollectiveMainloop,
   CollectiveEpilogue,
   cutlass::gemm::PersistentAsyncInputScheduler<KernelSchedule>>;
```

### _fused_all_gather_matmul_native
An ag-mm impl that combines `torch.ops.symm_mem._async_input_mm` and progress-aware all-gather. This is not yet enabled via the async-tp passes. We will use it as a backend to optimize the current decomposition-based async-tp impl.

## Benchmarks

### 4096x3584x8192
- cublas + nccl: 539us
- decomp-based async-tp w/o cuda graph: 694us
- decomp-based async-tp w/ cuda graph: 478us
- new cutlass kernel: 408us

<img width="478" alt="image" src="https://github.com/user-attachments/assets/39f316ab-36c5-4b41-af77-07854a385dfc">

### 2048x3584x8192
- cublas + nccl: 301us
- decomp-based async-tp w/o cuda graph: 687us
- decomp-based async-tp w/ cuda graph: 356us
- new cutlass kernel: 276us

<img width="441" alt="image" src="https://github.com/user-attachments/assets/9e23ce21-863b-43dd-a562-fb05d3a5a144">

## Next Steps
- Add tuning logic
- Use `_fused_all_gather_matmul_native` as a backend for the decomp-based async-tp impl

Differential temp Revision: [D65623152](https://our.internmc.facebook.com/intern/diff/D65623152)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139227
Approved by: https://github.com/weifengpy, https://github.com/Chillee
2024-11-08 23:28:25 +00:00
fduwjj
e3b2f04f05 [c10d][Logging] Remove args and kwargs from c10d logging (#140169)
This PR is trying to reland https://github.com/pytorch/pytorch/pull/139804

We now don't want to log args and kwargs directly because if they contain tensor or tensor subclass it would take lots of time in conversion to string or even not supported.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140169
Approved by: https://github.com/wz337
2024-11-08 23:24:52 +00:00
Scott Wolchok
cc44b55b00 Hook up bf16_gemv_trans to x86 bf16 GEMM (#139220)
This is the big milestone for bf16 and should enable us to close https://github.com/pytorch/torchchat/issues/1253 .

Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2.

Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139220
Approved by: https://github.com/malfet
ghstack dependencies: #139084, #139090, #139558, #139081, #139208
2024-11-08 23:24:36 +00:00
Scott Wolchok
25c469bac3 Build bf16 gemv fast path & entry points for non-ARM architectures too (#139208)
Very similar to #137917, but for bf16.

Differential Revision: [D65155971](https://our.internmc.facebook.com/intern/diff/D65155971/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139208
Approved by: https://github.com/malfet
ghstack dependencies: #139084, #139090, #139558, #139081
2024-11-08 23:24:36 +00:00
Scott Wolchok
7f0bf9f961 Move bf16_gemv_trans to ReducedPrecisionFloatGemvFastPathKernel (#139081)
Following the previous move of fp16_gemv_trans.

Testing: Checked for performance regression with llm_benchmarks' `python benchmarks/benchmark_torch_mm.py llm`, didn't find one
Differential Revision: [D64930872](https://our.internmc.facebook.com/intern/diff/D64930872/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139081
Approved by: https://github.com/malfet
ghstack dependencies: #139084, #139090, #139558
2024-11-08 23:24:29 +00:00
Scott Wolchok
44f6d1439e Unbreak vec128_half_neon comparison without FP16 hardware support (#139558)
Discovered this bug when working on Vectorized<BFloat16>; apparently we have no automated testing for aarch64 without FP16.

Testing: Manually disable FP16 feature for local vec_test_all_types run on Mac; see pass.

Differential Revision: [D65385267](https://our.internmc.facebook.com/intern/diff/D65385267/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139558
Approved by: https://github.com/malfet
ghstack dependencies: #139084, #139090
2024-11-08 23:24:22 +00:00
Nikita Shulga
ac6b6c6f98 [BE][CI] Use pip3 instead of pip (#140185)
As on modern distros(see this oldie but goodie: https://launchpad.net/ubuntu/focal/+package/python-is-python3 ), `pip` alias might be missing or indeed point to Python2 installation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140185
Approved by: https://github.com/wdvr, https://github.com/huydhn, https://github.com/seemethere
2024-11-08 23:15:02 +00:00
Natalia Gimelshein
1cdaf1d85f correctly keep track of processed tensors for foreach reductions (#140103)
Fixes #140066

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140103
Approved by: https://github.com/janeyx99

Co-authored-by: Jane Xu <janeyx@meta.com>
2024-11-08 23:04:53 +00:00