Commit Graph

34349 Commits

Author SHA1 Message Date
Behrang Javaherian
386776c49a [torch] Reduce the memory usage by adding flags to clearing intermediate graphs used for optimization during the ineference. (#115657)
Summary: During the inference time the intermediate graphs for optimization are not used so the Executor's graph is the only graph we need to keep around these two flags

Test Plan:
the FLAGS are all off by default

baseline
```
buck run mode/opt-clang  sigrid/predictor/client/localnet:run_model -- --model_id_to_load=951679039 --model_snapshot_to_load=244 --torch_jit_do_not_store_optimized_graph=true
I1212 10:24:20.407408 401092 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 951679039_244 is 182863 Kb
```
```
buck run mode/opt-clang  sigrid/predictor/client/localnet:run_model -- --model_id_to_load=951679039 --model_snapshot_to_load=244 --torch_jit_do_not_store_optimized_graph=true --torch_jit_release_profiling_graph_after_optimization=true
I1212 10:31:37.663487 464000 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 951679039_244 is 186127 Kb
```
```
buck run mode/opt-clang  sigrid/predictor/client/localnet:run_model -- --model_id_to_load=951679039 --model_snapshot_to_load=244 --torch_jit_do_not_store_optimized_graph=true --torch_jit_release_profiling_graph_after_optimization=true --torch_jit_execution_plan_avoid_extra_graph_copy=true
I1212 10:29:42.848093 447218 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 951679039_244 is 129451 Kb```

Differential Revision: D52081631

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115657
Approved by: https://github.com/houseroad
2023-12-18 17:56:39 +00:00
Wanchao Liang
dd367b7c8f check tensor subclass when using torch.compile + SAC (#115960)
as titled, when using SAC + torch.compile, it currently only check for
functional tensor, but not checking any tensor subclasses, therefore SAC
under torch.compile would ignore the tensor types like tensor
subclasses. Fixed in this PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115960
Approved by: https://github.com/bdhirsh
2023-12-18 17:49:06 +00:00
angelayi
e43d33f4f7 [export] Support torch.sym* ops (#115854)
Fixes https://github.com/pytorch/pytorch/issues/108830 and https://github.com/pytorch/executorch/issues/1379#issuecomment-1853322866

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115854
Approved by: https://github.com/zhxchen17
2023-12-18 17:48:47 +00:00
Aaron Gokaslan
647f14e70b [BE]: Enable clang-tidy check for readability-string-compare (#115994)
Adds a clang-tidy check to ensure string compare is not used unnecessarily in a way that is less efficient and less readable if an equality overload exists.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115994
Approved by: https://github.com/albanD
2023-12-18 16:13:00 +00:00
Nikita Shulga
d7caef7996 [CI] Update clang-format (#116002)
To 17.0.6 build using https://github.com/pytorch/test-infra/blob/main/.github/workflows/clang-tidy-linux.yml

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116002
Approved by: https://github.com/suo
2023-12-18 14:58:46 +00:00
Mu-Chu Lee
c285ca7916 [AOTInductor] Add updaing constant buffer to active buffer. (#116001)
Summary:
Refactor update inactive constant buffer to allow updating with active
buffer.

Test Plan:
Existing test to test inactive buffer updates.
UpdateConstantsCuda in cpp test for active buffer updates.

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116001
Approved by: https://github.com/chenyang78
2023-12-18 11:49:03 +00:00
Pearu Peterson
34fe850d00 SymInt'ify sparse_compressed_tensor (#107903)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107903
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #115586
2023-12-17 17:36:20 +00:00
Tej Singh
eafeba71c1 Adamw refactor (#115983)
Fixes #104899, refactors adamw by abstracting out common code in adam.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115983
Approved by: https://github.com/janeyx99
2023-12-17 06:58:39 +00:00
Yue Dong
87ea6fb844 Make input contiguous for DTensor reduce scatter to fix the incorrect numerical values (#115847)
Summary:
This change is to make the input tensor contiguous for DTensor reduce scatter in the case no padding is needed.

There's no exception thrown during training, but we ran into numerical value correctness issue without the change.

Test Plan:
**CI**
CI test

**WHEN model test**:
- Verified loss for each iteration within the expected range.
- Verified NE on-par with this change with 4B training data.

Differential Revision: D52170822

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115847
Approved by: https://github.com/wanchaol
2023-12-17 01:35:09 +00:00
Menglu Yu
bc4115ffcf [Inductor][Observability] Change to log.debug to avoid excessive long of logs (#115474)
Summary: Titled

Test Plan: CI

Differential Revision: D52003825

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115474
Approved by: https://github.com/jackiexu1992, https://github.com/yanboliang
2023-12-17 00:25:54 +00:00
voznesenskym
b06b02559e Support non grapharg and intermediary grad access (#115898)
Support for something we need for both FSDP and optimizers. For sourced args that are not inputs (params, etc) - we use the dynamic_getattr flow on tensors. This soundly handles the storage and registration and guarding downstream of tensor_wrap for the grad values. For non sourced (true intermediates), we only support None (the idea being that if we have a true intermediate in the graph with grad, we are already doing something weird).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115898
Approved by: https://github.com/bdhirsh
ghstack dependencies: #115315, #112184
2023-12-16 18:43:37 +00:00
kflu
c5dcb50c00 [easy] aten ops: support passing all args as kwargs, including self (#114920)
Summary:
This is important for writing aten IR based graph transformation.

```
In [4]: [x.name for x in torch.ops.aten.reshape.default._schema.arguments]
Out[4]: ['self', 'shape']

In [8]: torch.ops.aten.reshape.default(torch.rand(1,2), shape=[2])
Out[8]: tensor([0.7584, 0.4834])

# === CANNOT CALL `self` BY KWARGS ===

In [7]: torch.ops.aten.reshape.default(self=torch.rand(1,2), shape=[2])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[7], line 1
----> 1 torch.ops.aten.reshape.default(self=torch.rand(1,2), shape=[2])

TypeError: OpOverload.__call__() got multiple values for argument 'self'

```

# Where's the problem?

1. the aten ops first arg is usually named `self` (aten/src/ATen/native/native_functions.yaml)
2. Unfortunately, in `torch._ops.{OpOverload, OpOverloadPacket}.__call__()`, the first arg is (by python convention) named `self` too.

So when call `self` by kwargs, `OpOverloadPacket.__call__` received:

```
OpOverloadPacket.__call__(self, {"self": ...})
```

It is Python that does not allow some argument named "arg" to appear twice. and hence

> TypeError: OpOverload.__call__() got multiple values for argument 'self'

# How to fix?

**Note that**, in above, `self` is an instance of `OpOverloadPacket`, and the "self" kwarg is the input tensor to the aten op. To fix, we only need to differentiate the two `self`s.

In Python, first arg of a method does not need to be named `self`. So we change the `__call__` definition to:

```
def __call__(_self, ...):
```

Now the call becomes:

```
OpOverloadPacket.__call__(_self, {"self": ...})
```

where:
* `_self` is the instance to the `OpOverloadPacket`
* `"self"` is the input tensor to the aten op.

Test Plan:
```
In [4]: [x.name for x in torch.ops.aten.reshape.default._schema.arguments]
Out[4]: ['self', 'shape']

In [3]: torch.ops.aten.reshape.default(self=torch.rand(1,2), shape=[2])
Out[3]: tensor([0.5127, 0.3051])
```

Differential Revision: D51731996

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114920
Approved by: https://github.com/houseroad
2023-12-16 18:32:58 +00:00
Jeff Daily
e3aefe2970 Revert "Initial Flash Attention support on ROCM (#114309)" (#115975)
This reverts commit 5bddbed399.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115975
Approved by: https://github.com/atalman, https://github.com/malfet
2023-12-16 03:40:14 +00:00
vfdev-5
2a2f2e454a [inductor] Fixed issue with true div on integer input with dyn shapes (#115920)
Related to https://github.com/pytorch/pytorch/issues/115742, `Cpu/CudaTests.test_div8`

Description:
- Fixed issue with true div on integer input with dyn shapes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115920
Approved by: https://github.com/peterbell10
2023-12-16 02:06:39 +00:00
Yanbo Liang
14a6b24c8b [Dynamo][8/N] Wrap itertools.* as ItertoolsVariable (#115802)
This is part of a series changes before removing ```is_allowed```.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115802
Approved by: https://github.com/voznesenskym
2023-12-16 01:42:02 +00:00
Jane Xu
056a882cb9 add markDynamoStrictTest to TestOptimRenewed, removing flakiness (#115947)
fixes #115406 fixes #115394 fixes #115393 fixes #115392 fixes #115391

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115947
Approved by: https://github.com/albanD, https://github.com/zou3519
2023-12-16 01:33:32 +00:00
youkaichao
034e871710 [Dynamo] Look up variables from old frame, rather than copy variables to new frame; skip some copy to save time. (#115062)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115062
Approved by: https://github.com/williamwen42
2023-12-16 00:02:59 +00:00
Will Constable
9fcf6fb6fe [C10D] Add waitForDumpOrTimeout to log on dump abandonment (#115876)
Helps call attention to any cases where the dump actually times out.

The timeout is likely to hit if we run into slow stacktrace processing.

Log any exceptions encountered in the background thread, but don't raise
them- we're already willing to abandon the debug dump, and want to
proceed with our normal execution (in the case of dumppipe) or shutdown
process (when dumping happens on timeout and shutdown is already
initiated).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115876
Approved by: https://github.com/zdevito
ghstack dependencies: #115807
2023-12-15 22:13:06 +00:00
Will Constable
82e0d00da9 [c10d] Polish NCCL PG monitor thread log message (#115888)
We turned on monitor thread by default in https://github.com/pytorch/pytorch/pull/112518, and we want the error message that is displayed when the monitor kills the process to be more informative.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115888
Approved by: https://github.com/wconstab
2023-12-15 22:00:29 +00:00
angelayi
1f3bdf40ad [export] Update schema version (#115712)
Since pytorch 2.1 release we've made some BC breaking changes to the serialized schema. We should update it in time for the 2.2 release. Some of the changes include:

* https://github.com/pytorch/pytorch/pull/114371 - custom class objects / pybinded objects are no longer saved directly to the `ExportedProgram` structure. Instead, the name is serialized inside of the program, and the actual bytes are stored. in a separate location from the exported program, allowing it to be saved to a different location.
* https://github.com/pytorch/pytorch/pull/111204 - `GraphSignature` structure changed and `call_spec` is removed from the `GraphModule` schema
* https://github.com/pytorch/pytorch/pull/111407 - `loss_outout` -> `loss_output`
* https://github.com/pytorch/pytorch/pull/113075 - `example_inputs` removed from the `ExportedProgram` structure (this originally did not store anything), `dialect` added to the `ExportedProgram` structure.
* https://github.com/pytorch/pytorch/pull/113689 - tensor constants are now lifted as inputs to the graph, and their locations are stored in the `GraphSignature`
* https://github.com/pytorch/pytorch/pull/114172 - removed `equality_constraints` and added a `SymExprHint` for all symbolic expressions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115712
Approved by: https://github.com/gmagogsfm
2023-12-15 21:43:03 +00:00
PyTorch MergeBot
50c9665f92 Revert "[export] Support torch.sym* ops (#115854)"
This reverts commit 347cb91946.

Reverted https://github.com/pytorch/pytorch/pull/115854 on behalf of https://github.com/atalman due to OSSCI oncall, broke multple jobs ([comment](https://github.com/pytorch/pytorch/pull/115854#issuecomment-1858486796))
2023-12-15 21:07:52 +00:00
PyTorch MergeBot
80a9625d9f Revert "non-strict export with dynamic shapes (#115862)"
This reverts commit 1bb0d0fc1f.

Reverted https://github.com/pytorch/pytorch/pull/115862 on behalf of https://github.com/atalman due to OSSCI oncall, failing trunk / macos-12-py3-arm64 / test ([comment](https://github.com/pytorch/pytorch/pull/115862#issuecomment-1858482486))
2023-12-15 21:04:12 +00:00
Avik Chaudhuri
1bb0d0fc1f non-strict export with dynamic shapes (#115862)
Differential Revision: [D52175048](https://our.internmc.facebook.com/intern/diff/D52175048/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115862
Approved by: https://github.com/zhxchen17
2023-12-15 20:11:30 +00:00
angelayi
347cb91946 [export] Support torch.sym* ops (#115854)
Fixes https://github.com/pytorch/pytorch/issues/108830 and https://github.com/pytorch/executorch/issues/1379#issuecomment-1853322866

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115854
Approved by: https://github.com/zhxchen17
2023-12-15 20:08:04 +00:00
PyTorch MergeBot
91b848bf81 Revert "markDynamoStrictTest on more tests (#115879)"
This reverts commit 8b650cdd3c.

Reverted https://github.com/pytorch/pytorch/pull/115879 on behalf of https://github.com/atalman due to OSSCI oncall, broke inductor ([comment](https://github.com/pytorch/pytorch/pull/115879#issuecomment-1858418921))
2023-12-15 20:00:09 +00:00
PyTorch MergeBot
c006c8b50e Revert "markDynamoStrictTest some more (#115885)"
This reverts commit 55ce4693ff.

Reverted https://github.com/pytorch/pytorch/pull/115885 on behalf of https://github.com/atalman due to OSSCI oncall, broke inductor ([comment](https://github.com/pytorch/pytorch/pull/115885#issuecomment-1858409669))
2023-12-15 19:51:24 +00:00
Wanchao Liang
61abacf829 [tp] improve documentation (#115880)
Improve the TP documentation in terms of format and descriptions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115880
Approved by: https://github.com/XilunWu
2023-12-15 18:44:22 +00:00
PyTorch MergeBot
d5115bfb06 Revert "[AOTI][refactor][3/n] Declare python_kernel_name and cpp_kernel_name in ExternKernel (#115831)"
This reverts commit 287a865677.

Reverted https://github.com/pytorch/pytorch/pull/115831 on behalf of https://github.com/desertfire due to rocm CI failure ([comment](https://github.com/pytorch/pytorch/pull/115831#issuecomment-1858322270))
2023-12-15 18:34:55 +00:00
PyTorch MergeBot
1b506e7469 Revert "non-strict export with dynamic shapes (#115862)"
This reverts commit f54bb1ed56.

Reverted https://github.com/pytorch/pytorch/pull/115862 on behalf of https://github.com/atalman due to OSSCI oncall, failing trunk / macos-12-py3-arm64 / test ([comment](https://github.com/pytorch/pytorch/pull/115862#issuecomment-1858197497))
2023-12-15 17:03:42 +00:00
Avik Chaudhuri
f54bb1ed56 non-strict export with dynamic shapes (#115862)
Differential Revision: [D52175048](https://our.internmc.facebook.com/intern/diff/D52175048/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115862
Approved by: https://github.com/zhxchen17
2023-12-15 16:38:45 +00:00
Jeff Daily
b062ea3803 [ROCm] add hipblaslt support (#114329)
Disabled by default. Enable with env var DISABLE_ADDMM_HIP_LT=0. Tested on both ROCm 5.7 and 6.0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114329
Approved by: https://github.com/malfet
2023-12-15 15:36:46 +00:00
Bin Bao
287a865677 [AOTI][refactor][3/n] Declare python_kernel_name and cpp_kernel_name in ExternKernel (#115831)
Summary: Both ExternKernelAlloc and ExternKernelOut need the two fields, so declaring them in the base class. Also add cpp codegen for IndexPutFallback and InplaceBernoulliFallback in this PR.

Differential Revision: [D52189999](https://our.internmc.facebook.com/intern/diff/D52189999)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115831
Approved by: https://github.com/chenyang78
2023-12-15 14:40:44 +00:00
rzou
55ce4693ff markDynamoStrictTest some more (#115885)
Featuring
test_native_mha.py
test_nn.py
test_prims.py
test_schema_check.py
test_serialization.py
test_show_pickle.py
test_sort_and_select.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115885
Approved by: https://github.com/voznesenskym
ghstack dependencies: #115845, #115855, #115856, #115857, #115858, #115870, #115871, #115879
2023-12-15 13:19:52 +00:00
rzou
8b650cdd3c markDynamoStrictTest on more tests (#115879)
Featuring:
test_mobile_optimizer.py
test_module_init.py
test_modules.py
test_multiprocessing.py
test_multiprocessing_spawn.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115879
Approved by: https://github.com/voznesenskym
ghstack dependencies: #115845, #115855, #115856, #115857, #115858, #115870, #115871
2023-12-15 13:19:52 +00:00
Jun Luo
2d43e31aa9 Fix wrong behavior of is_alias_of and c10d::reducer on MTIA (#115553)
Reviewed By: kirteshpatil

Differential Revision: D51860023

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115553
Approved by: https://github.com/fduwjj
2023-12-15 11:14:41 +00:00
Yanbo Liang
b4d6443bcf [Dynamo] Log innermost user frame filename & lineno for better error aggregation (#115899)
CompilationMetrics example:
```
frame_key='1',
co_name='fn',
co_filename='/data/users/ybliang/debug/debug1.py',
co_firstlineno=58,
cache_size=0,
accumulated_cache_size=0,
guard_count=None,
graph_op_count=None,
graph_node_count=None,
graph_input_count=None,
entire_frame_compile_time_s=None,
backend_compile_time_s=None,
fail_type="<class 'torch._dynamo.exc.Unsupported'>",
fail_reason='custome dict init with args/kwargs unimplemented',
fail_user_frame_filename='/data/users/ybliang/debug/debug1.py',
fail_user_frame_lineno=61
```
where:
* ```fail_type``` and ```fail_reason``` are exceptions inside of Dynamo.
* ```fail_user_frame_filename``` and ```fail_user_frame_lineno``` are where the original user code triggered the exception.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115899
Approved by: https://github.com/davidberard98, https://github.com/ydwu4
2023-12-15 08:24:55 +00:00
Yifu Wang
4edc921857 Introduce 3 low-latency, intra-node allreduce algorithms for small messages to PyTorch (#114001)
## Summary
This PR added 3 intra-node GPU allreduce algorithms to PyTorch:
- One-shot allreduce (inspired by FasterTransformer): all ranks simultaneously read and accumulate data from other ranks.
- Two-shot allreduce (inspired by FasterTransformer): all ranks simultanesouly read and accumulate `1 / world_size` data from other ranks. Then all ranks read accumulated data from other ranks. (effectively one-shot reduce-scatter + one-shot all-gather).
- Hybrid cube mesh allreduce (original): a one-shot allreduce variant that avoids transmission over PCIe on HCM topology.

## Micro Benchmarks
![image](https://github.com/pytorch/pytorch/assets/4156752/7bd25ffc-cd5b-4acb-bd65-b01bc136726e)

![image](https://github.com/pytorch/pytorch/assets/4156752/3ced31b4-6c31-4f34-a2d8-c072df29ae0e)

![image](https://github.com/pytorch/pytorch/assets/4156752/5b942c05-4fcc-4ec9-ae29-12c64080bb1c)

## Details
The intra-node algos are organized behind `c10d::IntraNodeComm`, which is responsible for:
- Managing handshaking and cuda IPC handle exchange among ranks.
- Querying NVLink connection and detecting topology.
- Performing algo selection based on available info.
- Launching the selected allreduce kernel.

`c10d::IntraNodeComm` is integrated into `c10d::ProcessGroupNCCL` as follows:
- When the `ENABLE_INTRA_NODE_COMM` environment variable is set, `c10d::ProcessGroupNCCL` initialize a `c10d::IntraNodeComm` for its ranks.
  - If the setup is not suitable for intra-node comm (e.g. not all ranks are from the same node), the rendezvous logic guarantees all participants fall back consistently.
- `c10d::ProcessGroupNCCL::allreduce` consults `c10d::IntraNodeComm` whether to use intra-node allreduce and carries out the communication accordingly.

We currently detect two types of topoloies from the nNVLink connection mesh:
- Fully connected: all GPU pairs has direct NVLink connection (e.g. NVSwitch or fully connected sub-set of hybrid cube mesh)
  - `msg <= 256KB`: one-shot allreduce.
  - `256KB < msg <= 10MB`: two-shot allreduce.
  -  `msg > 10MB`: instructs the caller to fallback to NCCL.
- Hybrid cube mesh
  - `msg <= 256KB`: one-shot allreduce.
  - `msg > 256KB`: instructs the caller to fallback to NCCL.

## Next Steps
- Fine tune algo selection based on GPU model, topology, link speed.
- Potentially optimize the two-shot allreduce impl. Accroding to FasterTransformer, two-shot allreduce is preferred until 50MB. There might be room for improvement, but PyTorch does impose more constraints:
  - FasterTransformer uses a single process to drive multiple devices. It can use `cudaDeviceEnablePeerAccess` enable device-level peer access.
  - PyTorch uses multiple process to drive multiple devices. With cuda IPC, a device can only share a specific region to other devices. This means extra copies may be unavoidable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114001
Approved by: https://github.com/yf225
2023-12-15 08:17:35 +00:00
Aidyn-A
cd47e335d1 [TEST] Skip test_schema_correctness for float8 dtype (#115757)
According to the https://github.com/pytorch/pytorch/issues/107256#issuecomment-1705341870 the ops tested in `test_schema_correctness` are not supported with `torch.float8_e4m3fn` yet. Until they are not supported, it is best to skip the test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115757
Approved by: https://github.com/drisspg
2023-12-15 06:26:46 +00:00
Zhijing Li (Accelerator Enablement)
c1c9b739e2 Back out "[aotinductor] replace lld with the default ld linker (#115478)" (#115875)
Summary:
Back out the diff

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115875
Approved by: https://github.com/chenyang78
2023-12-15 05:56:06 +00:00
rzou
478f0e96dc markDynamoStrictTest more tests (#115871)
For:
test_dispatch.py
test_fake_tensor.py
test_indexing.py
test_linalg.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115871
Approved by: https://github.com/voznesenskym
ghstack dependencies: #115845, #115855, #115856, #115857, #115858, #115870
2023-12-15 05:26:54 +00:00
rzou
7f686c8fe1 More markDynamoStrictTest (#115870)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115870
Approved by: https://github.com/voznesenskym
ghstack dependencies: #115845, #115855, #115856, #115857, #115858
2023-12-15 05:26:54 +00:00
leslie-fang-intel
9ae0e62929 [PT2] [Quant] Change the QConv2d Binary post op name from add to sum (#115329)
**Summary**
Change the QConv2d Binary fusion post op name from `add` to `sum`, since we are actually using OneDNN `post op sum` instead of `Binary_Add` for now.

**TestPlan**
```
python -m pytest test_quantized_op.py -k test_qconv2d_sum_pt2e
python -m pytest test_quantized_op.py -k test_qconv2d_sum_relu_pt2e
python -m pytest test_quantized_op.py -k test_qconv2d_sum_relu_float_output_pt2e
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115329
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2023-12-15 05:10:47 +00:00
Angela Yi
8e2d63cbc3 [export][reland] Remove runtime assertion pass (#115597)
Summary:
Reland of https://github.com/pytorch/pytorch/pull/115196
D52054112 to fix internal failures.

Test Plan: CI

Differential Revision: D52054110

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115597
Approved by: https://github.com/ydwu4, https://github.com/zhxchen17
2023-12-15 03:22:03 +00:00
Bin Bao
7d4ccd7b9e [AOTI][refactor][2/n] Rename kernel to python_kernel_name (#115766)
Differential Revision: [D52164940](https://our.internmc.facebook.com/intern/diff/D52164940)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115766
Approved by: https://github.com/chenyang78
ghstack dependencies: #115783
2023-12-15 03:08:13 +00:00
Will Constable
8e1cff96e3 [C10D] Log PG size in init log (#115807)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115807
Approved by: https://github.com/XilunWu
2023-12-15 02:38:54 +00:00
Nikita Shulga
5989e1222d [BE] Set torch.cuda.has_half to True (#115884)
This check was introduced by https://github.com/pytorch/pytorch/pull/5417 and then turned into a tautology by https://github.com/pytorch/pytorch/pull/10147

So I guess it's time to let go of all that dynamic initialization (and may be just delete it in 2.3?)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115884
Approved by: https://github.com/kit1980
2023-12-15 02:30:55 +00:00
Jesse Cai
a8e354a9a0 [sparse][semi-structured] enable fp32 support, separate sparse and dense constraints (#115550)
Summary:

Both cuSPASRELt and CUTLASS support 1:2 semi-structured sparsity for
fp32, which this PR enables.(thanks @alexsamardzic).

Furthermore, this PR also updates the sparse_config to take into account
the different shape constraints for sparse and dense matrices.

Technically, cuSPARSELt supports smaller sparse matrix constraints as it
seens to pad to the CUTLASS constraints under the hood. However, in
practice small sparse matrices are not commonly used and we care more
about the dense constraints for LLM inference.

For now, we keep the CUTLASS constraints in place for both cuSPARSELt
and CUTLASS tensors

This PR also reconnects the _FUSE_TRANSPOSE flag for cuSPARSELt tensors.

Test Plan:
```
python test/test_sparse_semi_structured.py
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115550
Approved by: https://github.com/cpuhrsch
2023-12-15 02:28:17 +00:00
Mikayla Gawarecki
6d5fe07659 Fix numpy warning when importing torch without numpy installed (#115867)
Fixes #115638

I verified locally that with no numpy install the warning no longer occurs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115867
Approved by: https://github.com/soulitzer
2023-12-15 02:22:12 +00:00
rzou
85262b0a9e markDynamoStrictTest some test_cpp_extensions.* (#115858)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115858
Approved by: https://github.com/voznesenskym
ghstack dependencies: #115845, #115855, #115856, #115857
2023-12-15 01:22:38 +00:00
rzou
4ccd8eb613 Add Dynamo test expected failure mechanism (#115845)
Tests that are added to a list in dynamo_test_failures.py will
automatically be marked as expectedFailure when run with
PYTORCH_TEST_WITH_DYNAMO=1. I'm splitting this PR off on its own so that
I can test various things on top of it.

Also added an unMarkDynamoStrictTest that is not useful until we turn
on strict mode by default.

Test Plan:
- code reading
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115845
Approved by: https://github.com/voznesenskym
2023-12-15 01:22:17 +00:00
Bin Bao
f90a5f891b [AOTI][refactor][1/n] Rename cpp_kernel to cpp_kernel_name (#115783)
Differential Revision: [D52142184](https://our.internmc.facebook.com/intern/diff/D52142184)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115783
Approved by: https://github.com/chenyang78, https://github.com/jansel
2023-12-15 00:50:17 +00:00
Joel Schlosser
6fee208064 Handle -1 in jagged layout NT view ops (#115843)
Allows for inheriting the ragged and batch dims via -1:
```python
nt.view(-1, -1, D)
nt.expand(B, -1, D)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115843
Approved by: https://github.com/soulitzer
ghstack dependencies: #115636
2023-12-15 00:42:47 +00:00
PyTorch MergeBot
59f7355f86 Revert "[ROCm] add hipblaslt support (#114329)"
This reverts commit bb2bb8cca1.

Reverted https://github.com/pytorch/pytorch/pull/114329 on behalf of https://github.com/atalman due to OSSCI oncall, trunk  tests are failing ([comment](https://github.com/pytorch/pytorch/pull/114329#issuecomment-1857003155))
2023-12-14 23:53:30 +00:00
zdevito
66b04e3cb7 [nccl flight recorder] nullptr profiling name (#115851)
Sometimes profiling name can be a nullptr, which
throws on conversion to std::string. This adds a check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115851
Approved by: https://github.com/wconstab
2023-12-14 23:40:54 +00:00
Oguz Ulgen
21b8127f1c [Inductor] Deduplicate grid wrapper statements for user defined triton kernels (#115849)
Noticed that on many MRS kernels the grid wrapper for autotuning is huge with a bunch of duplicates due to num_warps and num_stages not being needed for grid calculation. Lets deduplicate these entries.

Previously, we would see wrapper like
```
    def grid_wrapper_for_add_kernel_2d_autotuned_0(meta):
        if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
        if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
        if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
        if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
```
now it looks like
```
    def grid_wrapper_for_add_kernel_2d_autotuned_0(meta):
        if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
        if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115849
Approved by: https://github.com/jansel
2023-12-14 23:26:04 +00:00
Pearu Peterson
194d57dae7 Add values backward support for sparse CSR, CSC, BSR, and BSC tensors (#115586)
Fixes https://github.com/pytorch/pytorch/issues/107286

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115586
Approved by: https://github.com/cpuhrsch, https://github.com/albanD
2023-12-14 23:09:13 +00:00
Zhengxu Chen
ef6a0faf89 [export] Fix canonicalization. (#115830)
Summary: Add the missed layout argument branch.

Test Plan: buck2 test 'fbcode//mode/dev-nosan' fbcode//sigmoid/inference/test_gpu:export_package_sparse_toy_test

Differential Revision: D52166501

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115830
Approved by: https://github.com/angelayi
2023-12-14 22:48:26 +00:00
Jeff Daily
bb2bb8cca1 [ROCm] add hipblaslt support (#114329)
Disabled by default. Enable with env var DISABLE_ADDMM_HIP_LT=0. Tested on both ROCm 5.7 and 6.0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114329
Approved by: https://github.com/malfet
2023-12-14 21:41:22 +00:00
Will Constable
04ef21f5dd [C10D] Make dumpDebuggingInfo share a mutex across PGs (#115803)
The mutex was originally added to avoid racing to dump debuginfo,
where a race in this case would result in a corrupted dump file.

The reason a mutex helps is that it forces all dump requests to be
serialized, so that an observer would either see an in-progress file, a
complete file, or no file.  Without a mutex, a fourth state is possible
(a file that has been written to by multiple threads and is invalid).

Becuase the mutex was a ProcessGroupNCCL class member, and each PG
instance has its own watchdog thread that can launch a dump, it was not
doing its job.  Making the mutex static shares it between instances of
the class and ensures serialization of dumps triggered by any PG.

(Note: dumps triggered by different PGs have the same, global contents
anyway- there is only one global flight recorder, so it doesn't matter
who triggers it.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115803
Approved by: https://github.com/kwen2501
ghstack dependencies: #115771, #115798, #115800, #115801
2023-12-14 21:17:44 +00:00
PyTorch MergeBot
7ecddaef23 Revert "Introduce 3 low-latency, intra-node allreduce algorithms for small messages to PyTorch (#114001)"
This reverts commit adfbd2b219.

Reverted https://github.com/pytorch/pytorch/pull/114001 on behalf of https://github.com/atalman due to OSSCI oncall, breaks periodic jobs ([comment](https://github.com/pytorch/pytorch/pull/114001#issuecomment-1856539040))
2023-12-14 20:33:10 +00:00
David Berard
67232199b1 [dynamo] Log shape_env_guard_count separately from guard_count (#115776)
guard_count counts all the shape_env guards as a single guard; log the shape_env_guard_count separately so those metrics can be used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115776
Approved by: https://github.com/yanboliang
2023-12-14 20:12:49 +00:00
Wei Wei
87547a26b8 [aotinductor] add no weight change version of fuse_parallel_linear (#115791)
Summary: We need a new version of fuse_parallel_linear w/o creating new weights for real-time update.

Reviewed By: khabinov

Differential Revision: D52128296

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115791
Approved by: https://github.com/khabinov
2023-12-14 18:36:17 +00:00
PyTorch MergeBot
ca4caf4eac Revert "[inductor] Do variance calculation in opmath type (#115181)"
This reverts commit 42390a097b.

Reverted https://github.com/pytorch/pytorch/pull/115181 on behalf of https://github.com/atalman due to OSSCI oncall, broke periodic tests ([comment](https://github.com/pytorch/pytorch/pull/115181#issuecomment-1856360644))
2023-12-14 18:21:49 +00:00
Will Constable
0fe014bd8a [C10D] Change PGNCCL logs to prefix [PG {} Rank {}] (#115801)
Adds a PG {process group uid} prefix component to logs.

This is helpful in situations where there are multiple processgroups,
and rank information by itself is confusing.  (For example rank0 on PG1
may correspond to rank3 on PG0.  People may assume 'rank0' references
the global (PG0) world, but it may reference a sub-pg.  Prefacing the PG
helps clarify this.

Does NOT change logs from inside WorkNCCL functions, since WorkNCCL
doens't know what PG ID it corresponds to. Will address these logs
separately.

Example:

```
[I ProcessGroupNCCL.cpp:787] [PG 0 Rank 0] ProcessGroupNCCL initialization ...
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115801
Approved by: https://github.com/fduwjj
ghstack dependencies: #115771, #115798, #115800
2023-12-14 18:17:16 +00:00
Will Constable
e94267587b [C10D] Refactor NCCL logs to use common prefix helper (#115800)
Put the repeated code that string formats [Rank {rank}] in one place.

Sets up for the next PR that also adds more info to this prefix.

(Does not change exception messages, which could be done as well.
Exception messages are not formatted quite the same way. Tries
instead to keep from changing log behavior (in this PR) and only
refactor code.

Did limited testing (some logs were observed OK).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115800
Approved by: https://github.com/fduwjj
ghstack dependencies: #115771, #115798
2023-12-14 18:13:24 +00:00
Will Constable
eb6e70cf66 [C10D] Only open NCCL dump pipe file once per process (#115798)
The NCCL flight recorder is per-process (it is shared by all
processgroups), but individual process groups used to construct their
own pipe for being signaled to dump the flight recorder.

This ensures that only one pipe per process is created, by only creating
the pipe on the first ProcessGroup (uid_ == 0) which should be the world
group.

Filenames are still keyed off of rank, but this should now be global
rank instead of sub-pg rank, making the filenames unique across the
whole trainer process.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115798
Approved by: https://github.com/zdevito
ghstack dependencies: #115771
2023-12-14 17:48:26 +00:00
Will Constable
74d2b9dd15 [C10D] Make DumpPipe disabled when FlightRecorder disabled (#115771)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115771
Approved by: https://github.com/fduwjj
2023-12-14 17:42:46 +00:00
Ruichao Xiao
c80e2d5bb2 [fbcode] consolidate usage of fp8 linears for inference models (#115808)
Summary:
ATT, this will use implementation of D51812709 for fp8 linears.

Meanwhile, it also adds use-case of delay quantization

Test Plan:
```
CUDA_VISIBLE_DEVICES=7 buck run mode/opt  -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=h100 -c fbcode.use_link_groups=false caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --local-model /home/xiaoruichao/test_models/463113248.input.predictor.disagg.gpu.merge --lower-backend AOT_INDUCTOR --fp8-linear-quantization-type delay_quantization --disable-acc-tracer-aot-inductor
```

```
CUDA_VISIBLE_DEVICES=7 buck run mode/opt  -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=h100 -c fbcode.use_link_groups=false caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --local-model /home/xiaoruichao/test_models/463113248.input.predictor.disagg.gpu.merge --lower-backend AOT_INDUCTOR --fp8-linear-quantization-type delay_quantization --disable-acc-tracer-aot-inductor
```

Reviewed By: tter1

Differential Revision: D51840344

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115808
Approved by: https://github.com/ipiszy
2023-12-14 16:59:48 +00:00
Xinya Zhang
5bddbed399
Initial Flash Attention support on ROCM (#114309)
This pull requests add initial Flash Attention support for AMD/ROCM platform. It added a specialized Triton repository/branch as a compile-time dependency for Flash Attention math library on AMD/ROCM. This triton submodule is not used at runtime and will not be shipped to the final pytorch package. We have the plan to release this specialized Triton as a separate project.

Know limitations:

- [ ] Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`.
- [ ] Only supports power of two sequence lengths.
- [ ] No support for varlen APIs.
- [ ] Only support head dimension 16,32,64,128.
- [ ] Performance is still being optimized.

Fixes https://github.com/pytorch/pytorch/issues/112997

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114309

Approved by: https://github.com/jeffdaily, https://github.com/malfet

---------

Co-authored-by: Joseph Groenenboom <joseph.groenenboom@amd.com>
2023-12-14 08:52:57 -08:00
Mikayla Gawarecki
ac60a70e06 Migrated loss functions to ModuleInfos (#115584)
Migrates most tests in `common_nn.py:criterion_tests` to ModuleInfos.

**I can split this up if it is too large to review**

What this PR does not include:
- [`no_batch_dim` tests](https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_nn.py#L3995-L4112)
- [tests that use the functional variant of the loss function and `wrap_functional`](https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_nn.py#L1079-L1128)

#### On test times
This PR increases test time by ~58s locally
Before this PR:
```
>>> python test/test_nn.py -k Loss
Ran 1003 tests in 28.977s
```
After this PR
```
>>> python test/test_nn.py -k Loss
Ran 368 tests in 23.073s
```

```
>>> python test/test_modules.py -k Loss
Ran 836 tests in 63.900s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115584
Approved by: https://github.com/janeyx99
ghstack dependencies: #115617
2023-12-14 16:21:05 +00:00
vfdev-5
f727bed2e6 [inductor] Updated upsample_bilinear2d decomposition (#104182)
Description:
- Updated upsample_bilinear2d decomposition
  - added support for uint8 dtype support
  - code improvements
- Added uint8 dtype tests

Perf considerations:
- There is minor perf regression (speed-up ~0.7) on cases uint8, align_corners=True when output is smaller/equal (256, 256)
- For cases, when output is larger (256, 256) and input dtype uint8, nightly output is wrong, so IMO large perf regression (speed-up around ~0.2) should not be taken into account.

## Perfs benchmarks

```
[--------------------------------------------------------------------------------------------------------------------------------------------------------- Interpolate, cpu --------------------------------------------------------------------------------------------------------------------------------------------------------]
                                                                                                                                                    |  Eager (2.3.0a0+gitafcfdb1) PR  |  Compiled (2.3.0a0+gitafcfdb1) PR  |  Compiled (2.3.0a0+gitde89a53) Nightly  |  speed-up PR vs Nightly  |  Eager (2.3.0a0+gitde89a53) Nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input (1, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (256, 256)       |        565.212 (+-3.548)        |        1384.210 (+-10.798)         |           1230.996 (+-32.930)           |     0.889 (+-0.000)      |          566.253 (+-1.526)
      Input (1, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (256, 256)      |        565.404 (+-1.614)        |         1491.649 (+-7.763)         |            2974.959 (+-6.006)           |     1.994 (+-0.000)      |          566.476 (+-1.742)
      Input (1, 3, 500, 400), torch.uint8, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (256, 256)           |        270.761 (+-0.861)        |         1557.777 (+-4.699)         |            1080.919 (+-4.243)           |     0.694 (+-0.000)      |          269.829 (+-0.986)
      Input (1, 3, 500, 400), torch.uint8, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (256, 256)          |        270.960 (+-0.995)        |        1723.913 (+-12.433)         |            3191.938 (+-6.194)           |     1.852 (+-0.000)      |          269.962 (+-1.657)
      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (256, 256)     |        1555.884 (+-5.169)       |         1178.753 (+-4.957)         |            1910.445 (+-5.988)           |     1.621 (+-0.000)      |          1560.804 (+-6.793)
      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (256, 256)    |        1651.193 (+-6.952)       |         1323.466 (+-6.059)         |            3374.842 (+-8.168)           |     2.550 (+-0.000)      |          1653.497 (+-8.018)
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (256, 256)         |        978.482 (+-10.183)       |         1383.768 (+-4.341)         |            2147.841 (+-6.581)           |     1.552 (+-0.000)      |          979.983 (+-1.499)
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (256, 256)        |        1074.472 (+-5.031)       |         1414.912 (+-5.754)         |           3590.968 (+-10.042)           |     2.538 (+-0.000)      |          1074.589 (+-3.948)
      Input (4, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (256, 256)       |        2168.703 (+-8.964)       |        5400.528 (+-26.628)         |           4777.299 (+-11.891)           |     0.885 (+-0.000)      |          2168.133 (+-7.667)
      Input (4, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (256, 256)      |       2169.132 (+-12.618)       |        6583.866 (+-28.959)         |           11986.894 (+-45.838)          |     1.821 (+-0.000)      |         2174.488 (+-10.317)
      Input (4, 3, 500, 400), torch.uint8, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (256, 256)           |        992.808 (+-6.086)        |         5985.028 (+-9.532)         |            4334.158 (+-9.423)           |     0.724 (+-0.000)      |          989.604 (+-5.499)
      Input (4, 3, 500, 400), torch.uint8, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (256, 256)          |        987.618 (+-6.350)        |        6963.044 (+-28.885)         |           15441.096 (+-55.324)          |     2.218 (+-0.000)      |          985.573 (+-5.159)
      Input (4, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (256, 256)     |       6695.557 (+-35.067)       |        4657.603 (+-14.220)         |           8058.708 (+-41.684)           |     1.730 (+-0.000)      |         6714.996 (+-38.626)
      Input (4, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (256, 256)    |       7040.481 (+-39.486)       |        5445.704 (+-16.659)         |           13906.618 (+-53.298)          |     2.554 (+-0.000)      |         7034.453 (+-44.626)
      Input (4, 3, 500, 400), torch.float32, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (256, 256)         |       3926.186 (+-10.660)       |        5741.433 (+-12.748)         |           9356.036 (+-40.848)           |     1.630 (+-0.000)      |         3930.598 (+-17.086)
      Input (4, 3, 500, 400), torch.float32, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (256, 256)        |        4308.536 (+-9.607)       |        6122.755 (+-47.278)         |           15637.567 (+-54.392)          |     2.554 (+-0.000)      |         4307.463 (+-11.268)
      Input (1, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (200, 300)     |       2512.740 (+-10.860)       |         1573.590 (+-5.061)         |            451.355 (+-1.210)            |     0.287 (+-0.000)      |         2511.727 (+-10.930)
      Input (1, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (200, 300)    |       2489.926 (+-11.915)       |         1537.233 (+-4.212)         |            2501.470 (+-7.446)           |     1.627 (+-0.000)      |         2500.000 (+-12.155)
      Input (1, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (200, 300)         |        632.032 (+-2.108)        |         1496.994 (+-4.194)         |            404.759 (+-1.064)            |     0.270 (+-0.000)      |          630.122 (+-4.086)
      Input (1, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (200, 300)        |        629.174 (+-4.386)        |         1708.935 (+-8.817)         |            2643.296 (+-9.723)           |     1.547 (+-0.000)      |          628.388 (+-1.326)
      Input (1, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (200, 300)   |        4409.941 (+-8.016)       |         1160.133 (+-4.698)         |            1897.089 (+-9.392)           |     1.635 (+-0.000)      |         4450.959 (+-10.438)
      Input (1, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (200, 300)  |       4493.427 (+-11.703)       |         1329.226 (+-4.740)         |           2835.872 (+-12.241)           |     2.133 (+-0.000)      |          4506.973 (+-9.914)
      Input (1, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (200, 300)       |        901.712 (+-4.071)        |         1320.739 (+-5.197)         |            2207.605 (+-8.219)           |     1.671 (+-0.000)      |          904.757 (+-4.558)
      Input (1, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (200, 300)      |        990.080 (+-3.922)        |         1702.563 (+-7.909)         |           3074.196 (+-10.478)           |     1.806 (+-0.000)      |          990.482 (+-4.444)
      Input (4, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (200, 300)     |       9785.550 (+-58.445)       |        6135.680 (+-33.569)         |           1628.572 (+-19.770)           |     0.265 (+-0.000)      |         9893.606 (+-62.377)
      Input (4, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (200, 300)    |       9710.191 (+-57.597)       |        6066.824 (+-36.364)         |           10469.110 (+-42.775)          |     1.726 (+-0.000)      |         9919.022 (+-72.190)
      Input (4, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (200, 300)         |       2790.356 (+-12.188)       |        6134.101 (+-28.694)         |            1576.832 (+-6.030)           |     0.257 (+-0.000)      |         2761.122 (+-11.503)
      Input (4, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (200, 300)        |       2778.711 (+-13.603)       |        6608.528 (+-37.776)         |           10841.549 (+-49.429)          |     1.641 (+-0.000)      |         2753.037 (+-10.995)
      Input (4, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (200, 300)   |      45533.868 (+-102.618)      |         4962.994 (+-8.215)         |           9003.968 (+-38.179)           |     1.814 (+-0.000)      |        43531.261 (+-102.951)
      Input (4, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (200, 300)  |       45932.699 (+-81.207)      |        5595.682 (+-11.482)         |           12302.907 (+-50.254)          |     2.199 (+-0.000)      |         43916.455 (+-80.468)
      Input (4, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (200, 300)       |        3827.804 (+-8.057)       |        6311.580 (+-25.021)         |           11760.614 (+-51.531)          |     1.863 (+-0.000)      |         3849.959 (+-10.848)
      Input (4, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (200, 300)      |        4169.007 (+-8.452)       |        6820.716 (+-35.310)         |           15264.633 (+-49.982)          |     2.238 (+-0.000)      |         4183.875 (+-19.104)
      Input (1, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (600, 700)       |        1306.914 (+-7.470)       |        10598.101 (+-38.410)        |           2678.031 (+-11.051)           |     0.253 (+-0.000)      |          1307.470 (+-8.519)
      Input (1, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (600, 700)      |        1307.268 (+-8.197)       |        10161.123 (+-45.643)        |           17148.842 (+-55.402)          |     1.688 (+-0.000)      |          1308.077 (+-8.553)
      Input (1, 3, 300, 400), torch.uint8, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (600, 700)           |        548.574 (+-2.157)        |        10072.806 (+-41.368)        |            2408.971 (+-6.997)           |     0.239 (+-0.000)      |          547.726 (+-1.721)
      Input (1, 3, 300, 400), torch.uint8, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (600, 700)          |        546.664 (+-1.484)        |        11123.694 (+-43.636)        |           18058.070 (+-48.552)          |     1.623 (+-0.000)      |          547.151 (+-1.627)
      Input (1, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (600, 700)     |       7935.051 (+-71.022)       |        7654.533 (+-29.512)         |           12414.194 (+-87.450)          |     1.622 (+-0.000)      |         7900.056 (+-53.997)
      Input (1, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (600, 700)    |       8546.732 (+-53.118)       |        8583.572 (+-35.656)         |          19111.824 (+-166.978)          |     2.227 (+-0.000)      |         8515.433 (+-63.300)
      Input (1, 3, 300, 400), torch.float32, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (600, 700)         |       6202.642 (+-34.355)       |        8915.622 (+-62.293)         |           14327.295 (+-52.188)          |     1.607 (+-0.000)      |         6213.329 (+-39.740)
      Input (1, 3, 300, 400), torch.float32, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (600, 700)        |       6811.128 (+-33.747)       |        9647.316 (+-50.837)         |           20830.594 (+-62.979)          |     2.159 (+-0.000)      |         6822.512 (+-37.092)
      Input (4, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (600, 700)       |       5079.586 (+-19.067)       |        42238.442 (+-87.643)        |           11282.141 (+-42.477)          |     0.267 (+-0.000)      |         5104.234 (+-17.706)
      Input (4, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (600, 700)      |       5079.575 (+-16.306)       |        41512.995 (+-83.710)        |          68789.816 (+-440.001)          |     1.657 (+-0.000)      |         5097.446 (+-21.724)
      Input (4, 3, 300, 400), torch.uint8, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (600, 700)           |        2039.974 (+-8.614)       |       42322.773 (+-111.866)        |           10399.237 (+-43.140)          |     0.246 (+-0.000)      |         2043.808 (+-10.707)
      Input (4, 3, 300, 400), torch.uint8, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (600, 700)          |       2036.214 (+-10.083)       |        44353.281 (+-71.548)        |          73340.412 (+-324.780)          |     1.654 (+-0.000)      |          2039.000 (+-9.554)
      Input (4, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (600, 700)     |       33821.523 (+-96.639)      |        30552.094 (+-65.023)        |          49494.486 (+-872.916)          |     1.620 (+-0.000)      |         33844.404 (+-92.466)
      Input (4, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (600, 700)    |      36196.104 (+-128.169)      |        34038.432 (+-79.697)        |          75761.226 (+-905.194)          |     2.226 (+-0.000)      |         36260.473 (+-94.642)
      Input (4, 3, 300, 400), torch.float32, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (600, 700)         |       24827.821 (+-77.335)      |        37006.218 (+-86.318)        |          61297.625 (+-898.192)          |     1.656 (+-0.000)      |         24823.275 (+-80.945)
      Input (4, 3, 300, 400), torch.float32, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (600, 700)        |       27266.138 (+-70.262)      |        40109.475 (+-94.248)        |          92086.075 (+-404.922)          |     2.296 (+-0.000)      |         27287.992 (+-89.507)

Times are in microseconds (us).

[--------------------------------------------------------------------------------------------------------------------------------------------------------- Interpolate, cuda ---------------------------------------------------------------------------------------------------------------------------------------------------------]
                                                                                                                                                      |  Eager (2.3.0a0+gitafcfdb1) PR  |  Compiled (2.3.0a0+gitafcfdb1) PR  |  Compiled (2.3.0a0+gitde89a53) Nightly  |  speed-up PR vs Nightly  |  Eager (2.3.0a0+gitde89a53) Nightly
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input (1, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (1234, 1345)   |         98.259 (+-0.014)        |          97.156 (+-0.008)          |             97.443 (+-0.031)            |     1.003 (+-0.000)      |           98.248 (+-0.021)
      Input (1, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (1234, 1345)  |         97.048 (+-0.016)        |          97.480 (+-0.018)          |             96.819 (+-0.126)            |     0.993 (+-0.000)      |           97.045 (+-0.015)
      Input (1, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (1234, 1345)       |         97.944 (+-0.028)        |          91.686 (+-0.411)          |             93.894 (+-1.011)            |     1.024 (+-0.000)      |           97.933 (+-0.008)
      Input (1, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (1234, 1345)      |         98.008 (+-0.011)        |          91.205 (+-0.346)          |             96.854 (+-0.058)            |     1.062 (+-0.000)      |           97.203 (+-0.010)
      Input (4, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (1234, 1345)   |        384.318 (+-0.011)        |         382.793 (+-0.007)          |            382.472 (+-0.011)            |     0.999 (+-0.000)      |          384.701 (+-0.012)
      Input (4, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (1234, 1345)  |        384.266 (+-0.009)        |         385.333 (+-0.024)          |            382.554 (+-0.022)            |     0.993 (+-0.000)      |          384.386 (+-0.016)
      Input (4, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (1234, 1345)       |        383.924 (+-0.011)        |         570.071 (+-0.030)          |            545.615 (+-0.051)            |     0.957 (+-0.000)      |          384.044 (+-0.012)
      Input (4, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (1234, 1345)      |        384.184 (+-0.016)        |         560.857 (+-0.026)          |            552.447 (+-0.040)            |     0.985 (+-0.000)      |          384.063 (+-0.016)
      Input (1, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (2345, 2456)   |        122.188 (+-0.053)        |         116.744 (+-1.006)          |            163.762 (+-0.015)            |     1.403 (+-0.000)      |          121.874 (+-0.015)
      Input (1, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (2345, 2456)  |        122.156 (+-0.012)        |         182.692 (+-0.013)          |            161.653 (+-0.018)            |     0.885 (+-0.000)      |          121.926 (+-0.014)
      Input (1, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (2345, 2456)       |        105.852 (+-0.324)        |         119.545 (+-0.294)          |            190.527 (+-0.023)            |     1.594 (+-0.000)      |          105.999 (+-0.446)
      Input (1, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (2345, 2456)      |        106.507 (+-0.282)        |         120.060 (+-0.257)          |            162.330 (+-0.012)            |     1.352 (+-0.000)      |          106.567 (+-0.385)
      Input (4, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (2345, 2456)   |        447.907 (+-0.015)        |         463.863 (+-1.779)          |            650.492 (+-0.331)            |     1.402 (+-0.000)      |          446.596 (+-0.017)
      Input (4, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (2345, 2456)  |        447.750 (+-0.017)        |         723.832 (+-0.170)          |            641.539 (+-0.075)            |     0.886 (+-0.000)      |          446.467 (+-0.019)
      Input (4, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (2345, 2456)       |        439.549 (+-0.031)        |         507.772 (+-2.879)          |            758.795 (+-0.482)            |     1.494 (+-0.000)      |          440.372 (+-0.025)
      Input (4, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (2345, 2456)      |        439.538 (+-0.029)        |         509.260 (+-2.704)          |            654.195 (+-2.621)            |     1.285 (+-0.000)      |          440.362 (+-0.026)

Times are in microseconds (us).
```

[Source](f4751a3196/perf_interp_mode.py), [Output](899f34c024/output/20231213-214209-upsample-bilinear-pr_vs_nightly-speedup.md)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104182
Approved by: https://github.com/lezcano
2023-12-14 14:50:06 +00:00
Yifu Wang
adfbd2b219 Introduce 3 low-latency, intra-node allreduce algorithms for small messages to PyTorch (#114001)
## Summary
This PR added 3 intra-node GPU allreduce algorithms to PyTorch:
- One-shot allreduce (inspired by FasterTransformer): all ranks simultaneously read and accumulate data from other ranks.
- Two-shot allreduce (inspired by FasterTransformer): all ranks simultanesouly read and accumulate `1 / world_size` data from other ranks. Then all ranks read accumulated data from other ranks. (effectively one-shot reduce-scatter + one-shot all-gather).
- Hybrid cube mesh allreduce (original): a one-shot allreduce variant that avoids transmission over PCIe on HCM topology.

## Micro Benchmarks
![image](https://github.com/pytorch/pytorch/assets/4156752/7bd25ffc-cd5b-4acb-bd65-b01bc136726e)

![image](https://github.com/pytorch/pytorch/assets/4156752/3ced31b4-6c31-4f34-a2d8-c072df29ae0e)

![image](https://github.com/pytorch/pytorch/assets/4156752/5b942c05-4fcc-4ec9-ae29-12c64080bb1c)

## Details
The intra-node algos are organized behind `c10d::IntraNodeComm`, which is responsible for:
- Managing handshaking and cuda IPC handle exchange among ranks.
- Querying NVLink connection and detecting topology.
- Performing algo selection based on available info.
- Launching the selected allreduce kernel.

`c10d::IntraNodeComm` is integrated into `c10d::ProcessGroupNCCL` as follows:
- When the `ENABLE_INTRA_NODE_COMM` environment variable is set, `c10d::ProcessGroupNCCL` initialize a `c10d::IntraNodeComm` for its ranks.
  - If the setup is not suitable for intra-node comm (e.g. not all ranks are from the same node), the rendezvous logic guarantees all participants fall back consistently.
- `c10d::ProcessGroupNCCL::allreduce` consults `c10d::IntraNodeComm` whether to use intra-node allreduce and carries out the communication accordingly.

We currently detect two types of topoloies from the nNVLink connection mesh:
- Fully connected: all GPU pairs has direct NVLink connection (e.g. NVSwitch or fully connected sub-set of hybrid cube mesh)
  - `msg <= 256KB`: one-shot allreduce.
  - `256KB < msg <= 10MB`: two-shot allreduce.
  -  `msg > 10MB`: instructs the caller to fallback to NCCL.
- Hybrid cube mesh
  - `msg <= 256KB`: one-shot allreduce.
  - `msg > 256KB`: instructs the caller to fallback to NCCL.

## Next Steps
- Fine tune algo selection based on GPU model, topology, link speed.
- Potentially optimize the two-shot allreduce impl. Accroding to FasterTransformer, two-shot allreduce is preferred until 50MB. There might be room for improvement, but PyTorch does impose more constraints:
  - FasterTransformer uses a single process to drive multiple devices. It can use `cudaDeviceEnablePeerAccess` enable device-level peer access.
  - PyTorch uses multiple process to drive multiple devices. With cuda IPC, a device can only share a specific region to other devices. This means extra copies may be unavoidable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114001
Approved by: https://github.com/yf225
2023-12-14 08:13:08 +00:00
Xuehai Pan
36c6c0c7dc [pytree] expand tree_map to accept multi-inputs (#115642)
Fixes #115419
Fixes #91323
Closes #115549

- #115419
- #91323

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115642
Approved by: https://github.com/vmoens, https://github.com/zou3519
2023-12-14 06:16:42 +00:00
eqy
7e1542b938 [CUDA][FP8] Skip test_dtypes on FP8 _scaled_mm (#115661)
This test isn't actually parametrized by `dtype` so it seems to surface bogus failures where "unsupported" types "work" but in reality fp8 is used every time.

CC @drisspg I'm guessing this doesn't surface in upstream CI because there are no SM9.0 runners yet?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115661
Approved by: https://github.com/drisspg
2023-12-14 05:12:33 +00:00
Will Constable
f5458f8f00 [C10D] Make DumpPipe pipe file configurable (#115770)
Add TORCH_NCCL_DEBUG_INFO_PIPE_FILE env, allowing separate pipe file
location from dump file location.

Defaults PIPE_FILE to empty, meaning disabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115770
Approved by: https://github.com/zdevito
2023-12-14 03:54:43 +00:00
Fuzzkatt
ef01e78fd9 disable test_ddp_profiling_autograd_profiler in distributed_test.py (#115704)
test was previously disabled in upstream: https://github.com/pytorch/pytorch/issues/77342, currently failing in NVIDIA internal CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115704
Approved by: https://github.com/soulitzer
2023-12-14 01:41:37 +00:00
Pavan Balaji
ffc826bf10 [nccl-pg] Store PG global rank information in tracing logs (#115730)
Storing the list of global ranks associated with each PG allows us to correlate traces across different ranks.

Test Plan:

OSS CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115730
Approved by: https://github.com/fduwjj
2023-12-14 00:59:17 +00:00
Yidi Wu
b38e14c12a [Reland][HigherOrderOp] remove unused get_item in MapHigherOrder (#115758)
Summary: This is a reland of https://github.com/pytorch/pytorch/pull/115207

Test Plan: Modified existing tests.

Reviewed By: yanboliang

Differential Revision: D52045157

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115758
Approved by: https://github.com/angelayi
2023-12-14 00:41:46 +00:00
PyTorch MergeBot
626b7dc847 Revert "Migrated loss functions to ModuleInfos (#115584)"
This reverts commit f138b08d2e.

Reverted https://github.com/pytorch/pytorch/pull/115584 on behalf of https://github.com/atalman due to OSS CI oncall, breaks slow test ([comment](https://github.com/pytorch/pytorch/pull/115584#issuecomment-1854855080))
2023-12-13 23:34:30 +00:00
vfdev-5
c7ae2c170f [inductor] Added non-integer expr support for floordiv in triton codegen (#115751)
Description:
- Added non-integer expr support for floordiv in triton codegen
- Added a test
  - cpp test is skipped as failing and https://github.com/pytorch/pytorch/pull/115647 may fix it

This PR is fixing compilation error with the following code:
```python
import torch

def func(x, a):
    n = (a * 1.234) // 8.234
    y = x + n
    return y

cfunc = torch.compile(func, dynamic=True, fullgraph=True)

device = "cuda"
x = torch.tensor(0, dtype=torch.float32, device=device)
a = 33

out = cfunc(x, a)
expected = func(x, a)
torch.testing.assert_close(out, expected)
```
Error message on Nightly:
```
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
torch._dynamo.exc.BackendCompilerFailed: backend='compile_fx_wrapper' raised:
CompilationError: at 7:38:def triton_(in_ptr0, out_ptr0, ks0, xnumel, XBLOCK : tl.constexpr):
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (x0), xmask)
    tmp1 = ((1.23400000000000*ks0) // 8.23400000000000)
                                      ^
AssertionError()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115751
Approved by: https://github.com/peterbell10
2023-12-13 23:17:42 +00:00
Peter Bell
ad76a4e1e7 [inductor] Allow sympy expressions to participate in type promotion (#115676)
In the test example we have `add(i64[10], sympy.Expr)` where
`sympy.Expr` is not considered a promoting arg so isn't factored into
the type promotion. However, in eager it would promote to float32.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115676
Approved by: https://github.com/lezcano
ghstack dependencies: #115677, #115699, #115700
2023-12-13 22:22:37 +00:00
Michael Lazos
869e52e3dd Support torch function user objects (#111765)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111765
Approved by: https://github.com/jansel
2023-12-13 22:11:52 +00:00
Scott Wolchok
81321baf5c [PyTorch] Remove ArrayRefTensor::dtype (#113578)
Knocks off a few nanoseconds from CPU inference due to not having to set this field; paths that would've needed it are expensive anyway.

Differential Revision: [D51182794](https://our.internmc.facebook.com/intern/diff/D51182794/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113578
Approved by: https://github.com/khabinov, https://github.com/Neilblaze
ghstack dependencies: #112800, #113577
2023-12-13 21:32:14 +00:00
Chien-Chin Huang
8c57fde21f Let all_reduce_coalesced accept one tensor as well (#115650)
This diff introduces a change to the `all_reduce_coalesced` function in `distributed_c10d.py`. The function now accepts a single tensor as well as a list of tensors. This allows for more flexibility in the use of the function.

This is just a syntax sugar for the compiler to use `all_reduce_coalesced` without worrying  about converting the input to a list.

Differential Revision: [D51433236](https://our.internmc.facebook.com/intern/diff/D51433236/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115650
Approved by: https://github.com/wconstab
ghstack dependencies: #115523, #115302, #115648, #115649
2023-12-13 21:32:01 +00:00
Scott Wolchok
b9af126908 [PyTorch] Add input numel assert for minimal arrayref interface (#113577)
We currently have no shape checking on CPU IIUC. Now we at least do numel checking for the minimal arrayref interface.

Differential Revision: [D51165703](https://our.internmc.facebook.com/intern/diff/D51165703/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113577
Approved by: https://github.com/chenyang78, https://github.com/jansel
ghstack dependencies: #112800
2023-12-13 21:31:55 +00:00
Yanbo Liang
db851b1bc9 [Dynamo][7/N] Wrap python modules under torch as regular PythonModuleVariable (#115724)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115724
Approved by: https://github.com/jansel
2023-12-13 21:23:14 +00:00
Chien-Chin Huang
54d552e991 [funcol] Directly import DeviceMesh to avoid circular dependency (#115649)
This diff aims to directly import DeviceMesh from torch.distributed.device_mesh instead of importing it from dist._tensor. This is done to avoid a circular dependency issue. The code changes in each file of the diff are as follows:

- torch/distributed/_functional_collectives.py: import DeviceMesh from torch.distributed instead of dist._tensor.

Overall, this diff aims to improve the code by avoiding circular dependencies and improving the import statements.

==
The above summary is generated by LLM with minor manual fixes. The following summary is by me.

The original import causes some issues when compiling DDP with compiled_autograd. The root cause of compilation failure is not identified but it is good to fix the lazy initialization, which indirectly fixes the compilation issues for DDP.

Differential Revision: [D51857246](https://our.internmc.facebook.com/intern/diff/D51857246/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115649
Approved by: https://github.com/wconstab, https://github.com/wz337
ghstack dependencies: #115523, #115302, #115648
2023-12-13 20:44:58 +00:00
Will Constable
c90fdb9ac0 Fix torch.distributed.breakpoint (#115705)
Switches from calling breakpoint() internally to using a subclass of
Pdb.

Fixes #115685

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115705
Approved by: https://github.com/wanchaol, https://github.com/fegin
2023-12-13 20:33:56 +00:00
Alexander Yermolovich
23bff71de4 [llvm][oncall] Fix build for llvm-18+ (#115652)
Summary:
https://reviews.llvm.org/D137838 moved Host.h and some other files under TargetParser.
https://github.com/llvm/llvm-project/pull/74261 Removed it from Support folder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115652
Approved by: https://github.com/davidberard98
2023-12-13 20:11:31 +00:00
soulitzer
4d8ad4fb82 Move SingletonSymNodeImpl from c10 to aten (#114895)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114895
Approved by: https://github.com/jbschlosser
2023-12-13 20:01:18 +00:00
suo
926236305f [sigmoid] fix for FX tracing unflattened modules (#115708)
Differential Revision: [D52095387](https://our.internmc.facebook.com/intern/diff/D52095387/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115708
Approved by: https://github.com/zhxchen17
2023-12-13 19:43:46 +00:00
Carlos Mocholí
75d3bbaaa2 Fix cudagraph check message (#115664)
This error message is printed when CUDAGraph trees are used with multiple device indices.

However, the message seems to say the opposite.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115664
Approved by: https://github.com/soulitzer
2023-12-13 18:44:43 +00:00
Peter Bell
42390a097b [inductor] Do variance calculation in opmath type (#115181)
Fixes #114903

Previously large split variance reductions stored the intermediates as float16
precision, which may lead to overflow as the intermediate result is
unnormalized.

In #114903 we see two different `num_split` decisions made based on the
hardware capabilities, one of which has large enough intermediates to cause
overflows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115181
Approved by: https://github.com/shunting314
2023-12-13 18:40:44 +00:00
Pavan Balaji
afa62d6237 [nccl-pg] Pass group global rank information to NCCL PG (#114736)
We were only passing a subset of the group creation information to the
NCCL PG.  We are specifically missing the information on which global
ranks belong to a particular PG.

This allows the NCCL PG to use this additional information for things
like better trace logging.

Test Plan:

OSS CI

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114736
Approved by: https://github.com/kwen2501
2023-12-13 18:02:51 +00:00
Pearu Peterson
193f87857e [BC breaking] Remove check_sparse_nnz argument of gradcheck (#115658)
As in title per deprecation plan.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115658
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
2023-12-13 17:34:30 +00:00
voznesenskym
310f6ab11a [fsdp] Replace acc_grad hooking with register_post_accumulate_grad_hook on flat_param (#112184)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112184
Approved by: https://github.com/albanD
ghstack dependencies: #115315
2023-12-13 16:24:44 +00:00
Peter Bell
fb80f05ee2 [inductor] Fix angle decomposition return type (#115700)
The current decomposition always returns float32 when the input isn't complex.
Instead, we should do proper type promotion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115700
Approved by: https://github.com/lezcano
ghstack dependencies: #115677, #115699
2023-12-13 14:16:31 +00:00
Peter Bell
9cdc80d581 [inductor] Fix torch.bernoulli decomposition return type (#115699)
Strangely enough, `torch.bernoulli` doesn't return a boolean and instead
it matches the output type of the inplace bernoulli.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115699
Approved by: https://github.com/lezcano
ghstack dependencies: #115677
2023-12-13 14:16:31 +00:00
atalman
3807fc690f [OSSCI oncall] fix lint (#115737)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115737
Approved by: https://github.com/DanilBaibak
2023-12-13 14:15:26 +00:00
Scott Wolchok
f9cf6ae889 [PyTorch] AOTI: add minimal arrayref interface (#112800)
This implements an optional alternate interface to the AOTI
generated DSO, intended to increase efficiency for models running on
CPU and requiring minimal overhead. See comment in config.py for more
explanation.

This took a while to get right (e.g., I initially required 1-D
MiniArrayRef<T> for the inputs, but found that multi-dimensional
ArrayRefTensor<T> ended up simplifying the implementation and allowed
test_aot_inductor.py to run) and is somewhat intricate, so I am
anticipating that review will require some back-and-forth.

Differential Revision: [D50699890](https://our.internmc.facebook.com/intern/diff/D50699890/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D50699890/)!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112800
Approved by: https://github.com/chenyang78
2023-12-13 12:06:35 +00:00