Commit Graph

13670 Commits

Author SHA1 Message Date
Animesh Jain
79a04f2df9 [dynamo][guards-cpp-refactor] Permit dict version guard in DictGuardManager (#121327)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121327
Approved by: https://github.com/jansel
2024-03-08 01:24:00 +00:00
Adnan Akhundov
3d089de851 Add torch.cond support to AOT Inductor (#121120)
Summary: In this PR, `torch.cond` support and the necessary codegening infrastructure is added to C++ wrapper (AOTInductor and friends).

Notable additions:

- A new mechanism in the Python wrapper codegen to precompile and save the Triton kernels (generated and user-defined) which haven't been covered by the active path through the control flow given the sample inputs. As we can't do the runtime autotuning of the kernels outside the active path, we precompile and save them with the `launchers[0]` (corresponding to the first config).

- Codegen infra for `torch.cond` in the C++ wrapper (ABI- and non-ABI-compatible). The `torch.cond` codegen has been slightly refactored to avoid duplication across the Python and C++ wrappers.

- More extensions of the caching sites in the wrapper code to cache per codegened graph (e.g., `codegen_int_array_var`) + some infra for tracking the current codegened graph in the wrapper (both during codegen-ing in the `Scheduler.codegen` and in the `WrapperCodeGen.generate` functions).

- New unit tests to cover the added AOT Inductor + `torch.cond` functionality.

Codegen examples from the new unit tests:

- [`test_cond_simple_abi_compatible_cpu`](https://gist.github.com/aakhundov/862d5de9aa460f5df399e1387f7b342e)
- [`test_cond_simple_abi_compatible_cuda`](https://gist.github.com/aakhundov/d70b81f95fa8cc768cedef9acacb25bb)
- [`test_cond_simple_non_abi_compatible_cpu`](https://gist.github.com/aakhundov/c0ae7a8cbb6fa311c838e1b580f9a3f6)
- [`test_cond_simple_non_abi_compatible_cuda`](https://gist.github.com/aakhundov/08b945d4e8a32c97b7f9ff6272f4a223)
- [`test_cond_nested_abi_compatible_cuda`](https://gist.github.com/aakhundov/ce664f433c53e010ce4c0d96a6c13711)
- [`test_cond_with_parameters_abi_compatible_cuda`](https://gist.github.com/aakhundov/77afbeb8eaab5c5b930a3f922a7baf12)
- [`test_cond_with_multiple_outputs_abi_compatible_cuda`](https://gist.github.com/aakhundov/8cc06105ec8a3fe88be09b3f6e32c690)

Test Plan:

```
$ python test/inductor/test_aot_inductor.py -k test_cond
...
----------------------------------------------------------------------
Ran 42 tests in 170.619s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121120
Approved by: https://github.com/jansel, https://github.com/chenyang78
2024-03-07 22:39:57 +00:00
Scott Wolchok
4c58f2b675 [PyTorch] Use uint32_t for ProcessedNode::num_outputs (#121335)
We already use uint32_t for indexing, and the notion of a single graph node with more than four billion outputs stretches credulity.

Differential Revision: [D54598821](https://our.internmc.facebook.com/intern/diff/D54598821/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121335
Approved by: https://github.com/Skylion007
2024-03-07 21:15:05 +00:00
PyTorch MergeBot
2b1661c7a0 Revert "[compiled autograd] support custom ops backed by c++ autograd::Function (#120681)"
This reverts commit 05c256849b.

Reverted https://github.com/pytorch/pytorch/pull/120681 on behalf of https://github.com/izaitsevfb due to breaking internal builds, see D54617701 ([comment](https://github.com/pytorch/pytorch/pull/120681#issuecomment-1984214079))
2024-03-07 18:53:51 +00:00
Shengbao Zheng
60aaba4128 create function to get ProcessGroupNCCL uid (#121132)
Summary: expose ProcessGroupNCCL uid

Differential Revision: D54446056

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121132
Approved by: https://github.com/aaronenyeshi
2024-03-07 18:34:38 +00:00
Bin Bao
7e598c0053 [Inductor] Enable ABI-compatible mode for cpp-wrapper JIT (#121309)
Differential Revision: [D54617284](https://our.internmc.facebook.com/intern/diff/D54617284)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121309
Approved by: https://github.com/chenyang78
2024-03-07 14:22:06 +00:00
cyy
4305c64fea Change ATEN generator argument type to const std::optional<Generator>& (#120076)
This PR proposes to use std::optional<Generator>& for underlying functions to avoid unnecessary copy and move operations. The torchgen code was changed to generate the new type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120076
Approved by: https://github.com/malfet
2024-03-07 09:52:21 +00:00
Chen_Liqing
291ce86a6c Modify StorageImplCreateHelper (#118459)
I want to use tensor.untyped_storage()[a:b] for ``PrivateUse1`` backend but fail. The code will go into ``THPStorage_get``:
bb6eba189f/torch/csrc/Storage.cpp (L525-L540)

Here ``torch`` will create a new ``c10::StorageImpl`` but not consider about ``PrivateUse1`` backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118459
Approved by: https://github.com/albanD
2024-03-07 06:26:55 +00:00
briancoutinho
b9087f8571 [profiler] Add execution_trace_observer as an optional argument to profiler (#119912)
# Update Profiler API to collect Execution Traces

## TLDR
We would like to simplify collecting Execution Trace and Kineto together. Execution Trace and Kineto both provide meaningful information that can be combined to enable benchmarking, performance analysis and simulating new hardware.
```
import torch

def main():
    with torch.profiler.profile(
        activities=[
            torch.profiler.ProfilerActivity.CPU,
            torch.profiler.ProfilerActivity.CUDA,
        ],
        …
        excution_trace_observer=ExecutionTraceObserver() # <<<<<<< NEW
    ) as prof:
        ...
        prof.step()
```

See test/profiler/test_profiler.py 'test_execution_trace_with_kineto' for an example of using this API.

## What are Execution Traces?
[Chakra Execution Traces](https://github.com/mlcommons/chakra/wiki) offer a graph based representation of AI/ML workloads.  It stands apart from conventional AI/ML frameworks by focusing on replay benchmarks, simulators, and emulators, prioritizing agile performance modeling and adaptable methodologies.
- Chakra is part of ML Commons industry standard and is being adopted by other companies besides NVIDIA too.
- At Meta we have instrumented PyPer framework to collect Execution Traces. More details on our [PyTorch implementation of Chakra can be found here](https://github.com/mlcommons/chakra/wiki)

Chakra essentially enables benchmarking and co-design for ML Models without having to reproduce entier software stacks and helps companies collaborate together [[chakra paper](https://arxiv.org/pdf/2305.14516.pdf)]

## Why correlate Execution Trace with PyTorch/Kineto Trace

Both Execution Traces and Kineto/ provide different types of information and combining. While PyTorch ETs focus on CPU operators with explicit dependencies between them, Kineto traces encode GPU operators with their start and end times. In addition, collecting them at different timestamps will be inaccurate as several operations (NCCL, Embedding lookup) are data dependent and may not match correctly.
Thus, it makes sense to collect both ET and Kineto together. The problem is that there are two code paths.

## Proposal
The proposal is to modify the PyTorch profiler (Kineto) API to enable execution trace to be collected simultaneously, see TLDR section

# Testing
Updated the unit test for collecting kineto and Execution Trace together.
- Check the collected ET has right range of events.
- Compare two sets of IDs - record func Ids in ET and external IDs in Kineto. We check if these have a constant difference.

```
pytest test/profiler/test_profiler.py  -k test_execution_trace_with_kineto -rP

Running 1 items in this shard

test/profiler/test_profiler.py [W execution_trace_observer.cpp:682] Enabling Execution Trace Observer
STAGE:2024-03-05 09:05:05 1119546:1119546 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
[W execution_trace_observer.cpp:694] Disabling Execution Trace Observer
STAGE:2024-03-05 09:05:05 1119546:1119546 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-03-05 09:05:05 1119546:1119546 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119912
Approved by: https://github.com/sanrise, https://github.com/aaronenyeshi
2024-03-07 01:30:26 +00:00
Denis Yaroshevskiy
b0e2ed4d67 removing some macros (#120314)
Summary: Will be making some changes in the surrounding code, they are going to be easier without macros

Differential Revision: D54001770

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120314
Approved by: https://github.com/zhxchen17
2024-03-06 22:06:05 +00:00
Tobias Ringwald
76f3663efe Fixed a memory leak when calling from_numpy on a numpy array with an … (#121156)
…unsupported dtype.

Fixes #121138.

The lambda function that DECREFs the object is not called when the dtype conversion functions throws. This PR moves the conversion before the INCREF, which prevents the memory leak.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121156
Approved by: https://github.com/soulitzer, https://github.com/albanD
2024-03-06 19:37:38 +00:00
Simon Fan
05c256849b [compiled autograd] support custom ops backed by c++ autograd::Function (#120681)
- Adds support for custom ops backed by c++ custom autograd functions, e.g. fbgemm
- Include files more granularly to avoid namespace pollution and circular imports

limitations:
- requires user to audit their code and opt-in their custom autograd::Function via autograd::Function::is_traceable and maybe additional compiled_args + apply_with_saved implementation. this was the only way I can think of for soundness
- will throw if we can't hash the saved_data i.e. for any non implemented type other than list and dict in at::IValue::hash b0cfa96e82/aten/src/ATen/core/ivalue.cpp (L364)
- can technically silently fail if both the typeid hash and the typeid string name of the custom autograd::Function collide at the same time, and an identical autograd graph containing a different custom autograd::Function, yet that has an identical implementation, is called. this case seems extremely unlikely, and the only alternative to hash collision i can think of is compiling with reflection
- tensors not saved via save_variables are not lifted, and are specialized on TensorImpl*'s hash (treated as a memory address). if needed, we can lift them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120681
Approved by: https://github.com/jansel
2024-03-06 18:01:56 +00:00
PyTorch MergeBot
b529c19bdf Revert "Batch Norm Consolidation (#116092)"
This reverts commit 5680f565d5.

Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/jeffdaily due to broke ROCm, PR signal was clean but trunk was not, the merge should have been blocked but wasn't ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1981373237))
2024-03-06 17:10:01 +00:00
Animesh Jain
e3bd6efe72 [dynamo][guards-cpp-refactor] Prevent duplication of leaf guards (#121164)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121164
Approved by: https://github.com/jansel
ghstack dependencies: #121121, #121147, #121154
2024-03-06 08:36:45 +00:00
Animesh Jain
b6b2d5b00a [dynamo][guards-cpp-refactor] Pass source name for debug ease (#121154)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121154
Approved by: https://github.com/jansel
ghstack dependencies: #121121, #121147
2024-03-06 08:36:45 +00:00
Animesh Jain
52d89d8491 [dynamo][guards-cpp-refactor] Simplify DictGuardManager by removing KeyValueDictGuardManager (#121147)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121147
Approved by: https://github.com/jansel
ghstack dependencies: #121121
2024-03-06 08:36:45 +00:00
Animesh Jain
af7f55ffc8 [dynamo][guards-cpp-refactor] Add argnames in pybind'ings (#121121)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121121
Approved by: https://github.com/jansel
2024-03-06 08:36:45 +00:00
PyTorch MergeBot
8087912622 Revert "[XPU][Profiler] Add Logic To The Profiler For Processing XPU-backend Data (#120185)"
This reverts commit 0ab2ec3738.

Reverted https://github.com/pytorch/pytorch/pull/120185 on behalf of https://github.com/briancoutinho due to This PR contains a list search in '_parse_kineto_events()' that can lead to very high cost of running this post trace, training jobs getting stuck for mins ([comment](https://github.com/pytorch/pytorch/pull/120185#issuecomment-1980180774))
2024-03-06 06:39:51 +00:00
Sheng Fu
31bfa59970 Capture primitive data type arguments for profiling python_function (#120949)
RECORD_FUNCTION in python_function only captures argument that is a Tensor. However, it is very common for user to use non tensor arguments in custom ops, for example, sequence length in GPT attention custom op. My previous PR tries to capture all non-tensor arguments, it turned out in some cases, it is very expensive.

This PR is to support primitive (or its container) arguments in RECORD_FUNCTION.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120949
Approved by: https://github.com/soulitzer
2024-03-06 05:09:22 +00:00
Tugsbayasgalan Manlaibaatar
5680f565d5 Batch Norm Consolidation (#116092)
**Summary:**

This commit simplifies the existing decomposition hierarchy
of batch norm ops by adding a single, backend agnostic op:
`batch_norm_with_update`. The existing hierarchy looks like:

```
aten.batch_norm ->
aten._batch_norm_impl_index ->
[
  aten.native_batch_norm ->
  aten._native_batch_norm_legit (export only) ->
  _batch_norm_legit_cpu/cuda (kernels, export only) ->
  _batch_norm_cpu/cuda (kernels)
] OR
[ aten.cudnn_batch_norm ] OR
[ aten.miopen_batch_norm ]
```

Aside from complexity, an important problem with the
above decomposition hierarchy is cuda numerics in
export flows. We observed significantly worse convergence
when training a mobilenetv2-like model when using the
`_batch_norm_cuda` kernel instead of the `cudnn_batch_norm`
kernel. This means users who export their models on CPU
first then move the models to cuda later may silently
see worse accuracies even when cudnn is installed,
because they are using the worse kernel. This issue is
summarized in https://github.com/pytorch/pytorch/issues/111384.

Instead, the new hierarchy proposed by consolidating
existing batch norm ops will look like:

```
aten.batch_norm ->
aten.batch_norm_with_update ->
[ _batch_norm_cpu (kernel) ] OR
[ _batch_norm_cuda (kernel) ] OR
[ cudnn_batch_norm (kernel) ] OR
[ miopen_batch_norm (kernel) ]
```

The new op `batch_norm_with_update` hides backend
implementation details and automatically picks the right
kernel based on what is installed. This commit also adds
the following variants to this op:

```
batch_norm_with_update_functional
batch_norm_with_update.out
batch_norm_no_update
batch_norm_no_update.out
batch_norm_backward
```

Note that this commit only adds this op and its variants,
but does not actually change the decomps to produce these
ops in the graph. This will be done after the 2 week FC
window, and the ops used in the old stack is planned to
be removed after the 6 month BC window.

Test Plan: `OpInfo` tests for `batch_norm_with_update`.

Reviewers: albanD, bdhirsh

Subscribers: albanD, bdhirsh, supriyar

Tasks: https://github.com/pytorch/pytorch/issues/111384

Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2024-03-06 04:50:46 +00:00
Valentin Andrei
8bb3e0b643 [pytorch] Name the main and autograd threads for better debugging (#121170)
The main thread and the autograd one are latency critical threads. They launch CPU/GPU/Accelerator kernels and if for some reason they get preempted, the rank can become a straggler in a distributed training application. By naming these threads we can debug performance issues that impact the latency sensitive threads.

I used Kineto traces to verify if the thread names were propagated:

<img width="851" alt="Screenshot 2024-03-04 at 3 07 43 PM" src="https://github.com/pytorch/pytorch/assets/23515689/68b4a09c-b8e5-4f14-a5c0-6593f866c03f">

Also:

```
nvidia-smi
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   3065920      C   ...me#python#py_version_3_10     1968MiB |
|    1   N/A  N/A   3065926      C   ...me#python#py_version_3_10     1978MiB |
|    2   N/A  N/A   3065930      C   ...me#python#py_version_3_10     2084MiB |
|    3   N/A  N/A   3065936      C   ...me#python#py_version_3_10     2016MiB |
|    4   N/A  N/A   3065939      C   ...me#python#py_version_3_10     1998MiB |
|    5   N/A  N/A   3065943      C   ...me#python#py_version_3_10     2070MiB |
|    6   N/A  N/A   3065948      C   ...me#python#py_version_3_10     2026MiB |
|    7   N/A  N/A   3065952      C   ...me#python#py_version_3_10     2070MiB |
+-----------------------------------------------------------------------------+
[me@myhost ~]$ ps -T -p 3065920
    PID    SPID TTY          TIME CMD
3065920 3065920 pts/14   00:01:04 pt_main_thread
...
3065920 3092181 pts/14   00:00:40 pt_autograd_d0
3065920 3092182 pts/14   00:00:00 pt_autograd_d1
3065920 3092183 pts/14   00:00:00 pt_autograd_d2
3065920 3092184 pts/14   00:00:00 pt_autograd_d3
3065920 3092185 pts/14   00:00:00 pt_autograd_d4
3065920 3092186 pts/14   00:00:00 pt_autograd_d5
3065920 3092187 pts/14   00:00:00 pt_autograd_d6
3065920 3092188 pts/14   00:00:00 pt_autograd_d7
...

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121170
Approved by: https://github.com/albanD
2024-03-05 22:15:39 +00:00
Peter Bell
34a28f01dd [Autograd] Improve error for leaf tensors as out argument to fallback (#121089)
Closes  #120988

Currently operators that hit the autograd fallback call `check_inplace`
on all mutated inputs, including out arguments. This leads to a slightly
confusing error message:
```
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.
```

Compared to functions that don't fallback, which raise
```
RuntimeError: add(): functions with out=... arguments don't support automatic differentiation, but one of the arguments requires grad.
```

This changes the error message to make clear the issue is with the out argument,
but does not tighten the check to outright ban out arguments that require grad.
Instead, I use the same checks from `check_inplace` which allows non-leaf tensors
that require grad to pass without error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121089
Approved by: https://github.com/lezcano, https://github.com/soulitzer
ghstack dependencies: #121142
2024-03-05 21:13:27 +00:00
cyy
6ecd65886a Remove unnecessary const_casts (#121225)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121225
Approved by: https://github.com/soulitzer
2024-03-05 17:34:24 +00:00
cyy
507611f9ae [CUDACachingAllocator] Turn Allocator::allocate into non-const (#120969)
Ideally, the method should be non-const since it changes the allocator state. Some const_casts are also removed in the way.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120969
Approved by: https://github.com/albanD
2024-03-05 09:53:05 +00:00
Bin Bao
bd19d6d822 [AOTI] Use torchgen to generate C shim functions (#120513)
Summary: The current C shim layer manually implements a C interface for a handful of ops. Obviously that's not scalable if we want to extend it to cover all aten ops. This new torchgen script automatically generates C shim interfaces for CPU and CUDA backends. The interface follows the same parameter passing rules as the current C shim layer, such as

* Use plain C data types to pass parameters
* Use AtenTensorHandle to pass at::Tensor
* Use pointer type to pass optional parameter
* Use pointer+length to pass list
* Use device_type+device_index to pass device
* When a parameter is a pointer of pointer, e.g. AtenTensorHandle**, the script generates either a list of optional values or an optional list of values

https://gist.github.com/desertfire/83701532b126c6d34dae6ba68a1b074a is an example of the generated torch/csrc/inductor/aoti_torch/generated/c_shim_cuda.cpp file. The current version doesn't generate C shim wrappers for all aten ops, and probably generates more wrappers than needed on the other hand, but it should serve as a good basis.

This PR by itself won't change AOTI codegen and thus won't introduce any FC breakage. The actual wrapper codegen changes will come in another PR with some version control flag to avoid FC breakage.

Differential Revision: [D54258087](https://our.internmc.facebook.com/intern/diff/D54258087)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120513
Approved by: https://github.com/jansel
2024-03-05 04:28:44 +00:00
Francesco Fusco
26431db939 [ONNX] Perform implicit casting of constants for the onnx::where operator (#118733) (#120619)
This PR fixes the problem of having the `Where` operator bound to different types in cases where the dtype is not explicitly set. The PR extends the implicit casting to the onnx::Where operator to fix the issue, and includes the corresponding unit test.

Fixes #118733

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120619
Approved by: https://github.com/BowenBao, https://github.com/thiagocrepaldi
2024-03-04 19:27:30 +00:00
rzou
3ef0befdc9 Better error messages for impl_abstract_pystub (#120959)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120959
Approved by: https://github.com/drisspg
2024-03-04 15:24:36 +00:00
Animesh Jain
7f81563e5e [dynamo][guards-cpp-refactor] Skip type and length check guard for DictGuardManager (#120739)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120739
Approved by: https://github.com/jansel
ghstack dependencies: #120673
2024-03-02 13:15:53 +00:00
Animesh Jain
82d1465d8d [dynamo][guards-cpp-refactor] DICT_CONTAINS guard (#120673)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120673
Approved by: https://github.com/jansel
2024-03-02 13:15:53 +00:00
Shuqiang Zhang
c8e56b4965 [c10d] dump from one and only one thread (PG0's monitor thread) (#120893)
Summary:
When there are multiple PGs in a process and a hardware failure happens,
we found that multiple PGs/ threads in the same
process are competing to dump the same records at the same time. The
affects the reliability of dumps.

In this PR, we will try to make the change such that only one thread/PG
could dump: PG0's monitor thread. We use a static variable to indicate
that something (e.g., collective timeout) has triggered the dump
locally.

monitor thread would dump debug info under any one of the 3 conditions:
1: this static variable is set to true by the watchdog thread when it detects
a timeout or pipe dump signal
2: timeout signal is received from other ranks through tcpstore
3: no heartbeat of watchdog
Test Plan:
python test/distributed/test_c10d_nccl.py -k
test_timeout_dumps_on_stuck_ranks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120893
Approved by: https://github.com/wconstab
2024-03-02 00:13:13 +00:00
Will Constable
581fe26792 [C10D] Add ProcessGroup op_id to track ops inside coalescing region (#120745)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120745
Approved by: https://github.com/zdevito
2024-03-01 23:45:43 +00:00
albanD
8cb4855d1e Release the GIL in serialization when it is safe to do so (#120818)
In particular this ensures we release the GIL when serializing:
- PyBytes objects (this is how we get the pickle object)
- Storage objects

Other string-like objects keep the gil which is fine because we only use this for very small strings today (for endianess) and so releasing the GIL is not important there
Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120818
Approved by: https://github.com/colesbury
2024-03-01 22:37:26 +00:00
Simon Fan
82b356193d Move VariableInfo into its own file to avoid circular dependency (#120732)
VariableInfo is used by both `custom_function.h` (in a templated class) and `compiled_autograd.h` (in a class with some templated methods). Another way could have been to make a `compiled_autograd.cpp` and forward declare VariableInfo, but this VariableInfo was also being used in other nodes like PyNode so it felt cleaner to do it this way.

Differential Revision: [D54287007](https://our.internmc.facebook.com/intern/diff/D54287007)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120732
Approved by: https://github.com/jansel
2024-03-01 08:48:13 +00:00
Ma Jian
518a23bb03 support bool as Scalar Type in TorchScript (#113835)
Fixes #112402
Fixes #75465

From the description in #75465 , the bool type should subtype from the int. and `register_prim_ops.cpp` already supports converting from bool to int or float.
So this patch can fix bool as Scalar in TorchScirpt.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113835
Approved by: https://github.com/davidberard98
2024-03-01 04:20:15 +00:00
PyTorch MergeBot
76d3a6bb4a Revert "[C10D] Add ProcessGroup op_id to track ops inside coalescing region (#120745)"
This reverts commit 381a7ad3f1.

Reverted https://github.com/pytorch/pytorch/pull/120745 on behalf of https://github.com/kit1980 due to The new test fails internally, see D54343421 ([comment](https://github.com/pytorch/pytorch/pull/120745#issuecomment-1972047106))
2024-02-29 22:06:13 +00:00
Edward Z. Yang
0a7666801d SymIntify prod_backward (#120776)
Fixes https://github.com/pytorch/pytorch/issues/120608

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120776
Approved by: https://github.com/albanD
2024-02-29 20:05:22 +00:00
Shuqiang Zhang
313abcdba2 [c10d] fix the unwanted reason (#120863)
Summary:
Addressing #120849. Current c10d treat a reason as a failure, hence give some unwanted false
postiive errors. This is a quick fix, but we need to revisit the error
handling logic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120863
Approved by: https://github.com/kwen2501
2024-02-29 19:58:11 +00:00
Bin Bao
52e3c78a43 [AOTI][refactor] Move a few util functions in atoi_torch (#119987)
Summary: Move these util functions from an anonymous namespace to a common header so that later torchgen-ed files can use them.

Differential Revision: [D54258088](https://our.internmc.facebook.com/intern/diff/D54258088)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119987
Approved by: https://github.com/chenyang78
2024-02-29 15:46:47 +00:00
Shengbao Zheng
5b9e5f854b [profiler] Log process group id instead of backend id (#120475)
Summary:
https://github.com/pytorch/pytorch/pull/104373 introduced backend_id
> an unique ID for the actual backend object, this is also exposed in record_param_comms, so we can correlate these collectives with the right backend object.

However, it is inconvenient to correlate collectives with backend id. Instead, using pg id(uid) to correlate directly is a better solution.
This PR change the ID information exposted in record_param_comms from backend_id to pg_id.

Differential Revision: D53558257

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120475
Approved by: https://github.com/aaronenyeshi
2024-02-29 15:04:33 +00:00
Yifu Wang
f988f649be [IntraNodeComm] accept P2P buffer size as constructor argument (#120856)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120856
Approved by: https://github.com/wanchaol
ghstack dependencies: #120855
2024-02-29 11:43:52 +00:00
Yifu Wang
22b5548f5d [IntraNodeComm] refactor all_reduce variants as private methods (#120855)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120855
Approved by: https://github.com/Chillee, https://github.com/wanchaol
2024-02-29 11:43:52 +00:00
Sergii Dymchenko
09aefe1502 Fix ouput typos (#120870)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120870
Approved by: https://github.com/clee2000
2024-02-29 08:29:14 +00:00
Animesh Jain
82cbd9b131 [dynamo][guards-cpp-refactor] PythonLambdaGuardAccessor (#120730)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120730
Approved by: https://github.com/jansel
ghstack dependencies: #120864
2024-02-29 07:25:13 +00:00
Will Constable
381a7ad3f1 [C10D] Add ProcessGroup op_id to track ops inside coalescing region (#120745)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120745
Approved by: https://github.com/zdevito
ghstack dependencies: #120724, #120270
2024-02-29 01:03:31 +00:00
Will Constable
f85d3a022c [C10D] Fix pointToPoint op Flight Recording (#120270)
Fix and test issues with both coalesced and individual send/recv ops

Considered an alternate approach and then ditched it
 - alternate approach: #119757
 - reason ditched: prefer recording individual collective events inside
   coalescing region instead of just the event at the end of the region,
   which also would not have tensor sizes or opnames without additional
   state variables added

Another approach also ditched
- record events on workEnqueue instead of initWork
- reason ditched: too messy to get input/output shapes tagged on
  recording when recording in workEnqueue.  Adding the info onto the
  Work obj would be possible, but adds to overhead of copying Works
  which we do on every collective. We can get info off the input/output
  tensors directly in initWork, but we don't want to keep refs to those
  tensors alive while the work is Enqueued, so we'd have to specifically
  copy size lists or something.

This PR instead avoids creating a work inside pointToPoint when
coalescing is active. Instead, only at endCoalescing() is a work finally
intialized and enqueued.  But it adds a record() call inside
pointToPoint() instead of creating a work, during coalescing. This
record() call picks up tensor shapes and op names.

It ALSO changes initWork to accept a 'record' argument. This defaults to
false, and should only be set to true if the caller ensures the work
will be enqueued by workEnqueue, ensuring its cuda events are live when
used by flight recorder's update_state().

The testing uncovers some odd pre-existing behavior and leaves them
alone for now. We could change some of these
- seq starts off at 1, not 0 for first op (but this is inconistent)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120270
Approved by: https://github.com/shuqiangzhang
ghstack dependencies: #120724
2024-02-29 01:03:31 +00:00
Will Constable
7f4d673885 [C10D] Add record_id to flight recorder (#120724)
In cases where sequence number is shared between events (e.g. coalesced
collectives) we want to ensure a unique (and ordered) ID per record.

Note: the records are already in a list, so their ID could be implicitly
observed.  But (1) it's a ring buffer, so absolute ID is lost once the
buffer rolls over once, (2) users may sort or process or filter their
flight records, so having the ID be an explicit member of an entry is
still useful

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120724
Approved by: https://github.com/zdevito
2024-02-29 01:03:31 +00:00
PyTorch MergeBot
4903e33e19 Revert "Capture non tensor arguments in record_function (#120017)"
This reverts commit 5c5b71b6ee.

Reverted https://github.com/pytorch/pytorch/pull/120017 on behalf of https://github.com/soulitzer due to regresses perf on autograd Function when using profiler ([comment](https://github.com/pytorch/pytorch/pull/120017#issuecomment-1969883792))
2024-02-28 20:43:33 +00:00
Jason Ansel
01ec8df6d8 [Compiled Autograd] Introduce BackwardState capture (#120382)
This adds support for backwards hooks that are *both*:
1) Interior to the graph; and
2) Dynamically generated (e.g. lambdas)

We do this by creating a BackwardState object that is used to register the hooks in the forward, then populated by dynamo *after* the forwards runs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120382
Approved by: https://github.com/xmfan
2024-02-28 20:36:47 +00:00
Shengbao Zheng
11de40f82f [flight recorder] record process group configuration (#120262)
Summary: Record process group configuration (i.e. ranks involved in a process group) to facilitate NCCL related debugging.

Differential Revision: D53792087

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120262
Approved by: https://github.com/shuqiangzhang
2024-02-28 20:31:08 +00:00
PyTorch MergeBot
a9d9077f12 Revert "Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex. (#119639)"
This reverts commit 7c556428c7.

Reverted https://github.com/pytorch/pytorch/pull/119639 on behalf of https://github.com/kit1980 due to breaking internal builds, see D54286923 ([comment](https://github.com/pytorch/pytorch/pull/119639#issuecomment-1969634480))
2024-02-28 18:57:09 +00:00
Rohan Potdar
f67c77c497 Update engine.cpp (#120773)
Minor comment fix; `backward` and `grad` are flipped here. See https://pytorch.org/docs/stable/_modules/torch/autograd.html#backward

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120773
Approved by: https://github.com/albanD, https://github.com/janeyx99, https://github.com/soulitzer
2024-02-28 18:23:35 +00:00
Xunsong, Huang
0ab2ec3738 [XPU][Profiler] Add Logic To The Profiler For Processing XPU-backend Data (#120185)
This pull request is writing to provide an update on the recent advancements made in the PyTorch profiler with regards to XPU backend support. Following the successful merge of a previous pull request #94502 that established a pathway for the XPU backend within PyTorch, we have now taken steps to enhance the profiler's capabilities for handling and displaying profile data directly related to the XPU backend.

# Motivation

The current pull request builds upon this foundation by refining the profiler's data processing scripts, particularly `profiler_util.py`, to accommodate XPU backend-specific profile data. The aim is to align the handling and presentation of this data with that of the CUDA backend, offering users a consistent experience across different device profiles. This includes generating outputs such as JSON files compatible with Chrome trace tooling, among other formats.

# Principles

1. Minimal Impact: The modifications introduced should support XPU backend data with minimal disruption to the existing profiling scripts.
2. Consistency: Changes should maintain stylistic and functional consistency with existing `CUDA` and `privateuse1` pathways, ensuring no adverse effects on other logic paths.
3. Exclusivity: Ensure that the new XPU pathway does not interfere with or impede other pathways.

# Solutions

### a. Pathway Identification:

Introduction of a `use_xpu` flag within `torch.autograd.profiler.profile` interfaces to distinguish XPU-specific profiling.

### b. `use_device` Logic Revision:

With the introduction of the XPU pathway, `use_device` no longer implies a binary relationship with `use_cuda`. Consequently, we have revised related logic to remove implicit assertions and establish independent device distinction.

### c. Kernel List Segregation:

To accommodate the non-binary nature of device pathways, we have enabled kernel lists to identify specific device affiliations through separate list objects.

### d. Formatted Output:

To ensure output consistency, we have employed code duplication and keyword substitution techniques to facilitate the formatting of XPU-related profile data.

# Additional Enhancements

### a. Enumerations in `.pyi` Files:

Added recognition items for `DeviceType` and `ProfilerActivity` specific to XPU.

### b. Correct DeviceType Returns:

Revised `deviceTypeFromActivity` logic to accurately differentiate between device backends, even when they share common flags such as `libkineto::ActivityType::GPU_MEMCPY`.

### c. Bug Fixes in `cuda_corr_map`:

Addressed a corner case where erroneous parent-child event relationships were formed due to shared function event identifiers. The solution involves refining `cuda_corr_map` processing to prevent a function event from being misidentified as both the linker and linkee.

# Further Abstraction

Looking forward, we acknowledge the potential for further abstraction in the codebase. The current changes necessitated by XPU support have highlighted opportunities for reducing redundancy by consolidating naming conventions and utilizing a singular `device` naming system that relies on `DeviceType` attributes or string flags for differentiation. This would involve significant refactoring to replace device-specific flags and variables. This topic needs further discussions about whether we could and when we should deprecate all those flags and variables named with `cuda`.

# Next Pull Request

The next pull request will be contingent on Kineto's adoption of Intel's forthcoming PTI-sdk library, which will enable direct usage of XPU-related tracers. Subsequent modifications to `libkineto_init()` will aim to endow PyTorch running on XPU backends with comprehensive profiling capabilities on XPU devices.

We appreciate your attention to these enhancements and welcome any feedback or questions you may have regarding these developments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120185
Approved by: https://github.com/aaronenyeshi, https://github.com/gujinghui
2024-02-28 17:50:32 +00:00
Chao Zhou
a11a49af58 Add NCCL work sequence number to work info (#120596)
Summary: Expose sequence number to work info. The number can help applications identify a NCCL work more precisely.

Test Plan:
1. pytest test/distributed/test_c10d_nccl.py::WorkHookTest::test_on_completion_hook_seq
2. pytest test/distributed/test_c10d_nccl.py::WorkHookTest

Differential Revision: D54180050

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120596
Approved by: https://github.com/kwen2501
2024-02-28 07:54:37 +00:00
Yu, Guangye
12995a5d9d [2/2] Intel GPU Runtime Upstreaming for Generator (#118613)
# Motivation
According to [[1/2] Intel GPU Runtime Upstreaming for Generator](https://github.com/pytorch/pytorch/pull/118528), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the second PR covers the changes under `python frontend`.

# Design
Currently, it primarily offers geneartor-related APIs, including

- `torch.xpu.default_generators`
- `torch.xpu.get_rng_state`
- `torch.xpu.get_rng_state_all`
- `torch.xpu.initial_seed`
- `torch.xpu.manual_seed`
- `torch.xpu.manual_seed_all`
- `torch.xpu.seed`
- `torch.xpu.seed_all`
- `torch.xpu.set_rng_state`
- `torch.xpu.set_rng_state_all`

# Additional Context
The differences with CUDA:
The generator-related frontend python APIs are 1:1 mapping with CUDA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118613
Approved by: https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/jgong5, https://github.com/albanD
2024-02-28 05:28:11 +00:00
Yang Chen
1627d9e06d [aot_inductor] added a utility function aoti_torch_print_tensor_handle (#120660)
Added a function to print tenosr values for a tensor handle.
It can be injected to the cpp wrapper code and help debug
numerical issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120660
Approved by: https://github.com/desertfire
2024-02-28 02:08:34 +00:00
Yu, Guangye
1aa9099839 [CLANGTIDY] Enable clang-tidy in torch/csrc/xpu (#120616)
# Motivation
refer to [#118504](https://github.com/pytorch/pytorch/pull/118504), enabling clang-tidy in `torch/csrc/xpu`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120616
Approved by: https://github.com/albanD
2024-02-28 01:35:25 +00:00
Tobias Ringwald
7c556428c7 Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex. (#119639)
Fixes #115331.

This PR increases the number of valid GPU devices to 512 (from 64) in order to future-proof PyTorch for providers that offer [single nodes with a large device count](https://www.tensorwave.com/). Until now, `DeviceIndex` was an `int8_t`, thus multiple changes were necessary:

- `DeviceIndex` changed to `int16_t`. Updated consumers that assume it to be an `int8_t`.
- Updated bounds checking for `torch.device()` in the Python frontend. Right now, we allow funny things like `torch.device('cpu', 200).index == -56`, which is undefined behavior. I inserted some checks to only allow values between 0 and `c10::Device::MAX_NUM_DEVICES - 1`.
- Updated the `ArgumentInfo` struct as it hardcodes the device index as 8 bit field [^1]. Might be a breaking change, not sure if users rely on this.
- Introduced `c10::Device::MAX_NUM_DEVICES` as a replacement for the old `C10_COMPILE_TIME_MAX_GPUS`

[^1]: This field was unsigned, so I guess this has also been undef behavior the whole time? Our default device index is -1, so this always wrapped around to 255 when written to the `ArgumentInfo` struct. When I switched the `DeviceIndex` to `int16_t`, it actually stayed 255 after unpacking from `ArgumentInfo` again, as the `DeviceIndex` was now wide enough that it didn't wrap back to -1.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119639
Approved by: https://github.com/cyyever, https://github.com/albanD, https://github.com/huydhn
2024-02-27 07:05:48 +00:00
Levy Zhao
b6139b1e57 [PyTorch][CUDA Caching Allocator] Export sync-stream-and-free-HBM counter in memory_stats for performance debugging (#120050)
Differential Revision: D53734057

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120050
Approved by: https://github.com/xw285cornell
2024-02-27 04:34:53 +00:00
Animesh Jain
63f874b476 [dynamo][guards-cpp-refactor] DictGetItemGuardAccessor for f_locals (#120593)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120593
Approved by: https://github.com/jansel
2024-02-27 03:13:55 +00:00
William Wen
ecb3f33a1a [dynamo] fix segfault in _debug_get_cache_entry_list (#120635)
Fix https://github.com/pytorch/pytorch/issues/120607.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120635
Approved by: https://github.com/jansel
2024-02-26 23:31:09 +00:00
Animesh Jain
a299db2983 [dynamo][guards-cpp-refactor] NO_HASATTR guard (#120469)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120469
Approved by: https://github.com/jansel
2024-02-26 04:37:40 +00:00
Shan19900305
685d862c45 Add SparsePrivateUse1 in backend_to_string, layout_from_backend and check_base_legacy_new. (#119263)
1) Using items stored in torch._tensor_classes to check item passed from python side;
2) Add SparsePrivateUse1 in backend_to_string, layout_from_backend and check_base_legacy_new;
3) Using more general API to get python module name in get_storage_obj and get_name functions.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119263
Approved by: https://github.com/ezyang
2024-02-26 01:54:30 +00:00
Animesh Jain
4328e772bf [dynamo][guards-cpp-refactor] DICT_VERSION guard (#120416)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120416
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091, #120119, #120123, #120093, #120096, #120342, #120344, #120359
2024-02-25 23:24:24 +00:00
Animesh Jain
c269e48af0 [dynamo][guards-cpp-refactor] DictGuardManager (#120359)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120359
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091, #120119, #120123, #120093, #120096, #120342, #120344
2024-02-25 23:24:24 +00:00
Animesh Jain
775a4388d9 [dynamo][guards-cpp-refactor] WEAKREF_ALIVE guard (#120344)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120344
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091, #120119, #120123, #120093, #120096, #120342
2024-02-25 23:24:04 +00:00
cyy
81f0b2c14e [Clang-tidy header][19/N] Enable clang-tidy on torch/csrc/autograd/profiler_legacy.* (#120552)
This PR enables clang-tidy on torch/csrc/autograd/profiler_legacy.* and cleans some path rules of clang-tidy.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120552
Approved by: https://github.com/Skylion007
2024-02-25 03:29:40 +00:00
Shuqiang Zhang
8e20385447 [c10d] fix the macro definition of NCCL_COMM_DUMP (#120502)
Summary:
Only if both macros are defined, should we dump the comm dump,
otherwise, use the original definition.

The previous implementation missed the function definition when IS_NCCL_EXP is defined but NCCL_COMM_DUMP is not defined

Test Plan:
Build and unit test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120502
Approved by: https://github.com/dsjohns2, https://github.com/Skylion007
2024-02-23 20:59:39 +00:00
Animesh Jain
007606e520 [dynamo][guards-cpp-refactor] TENSOR_MATCH guard (#120342)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120342
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091, #120119, #120123, #120093, #120096
2024-02-23 20:10:09 +00:00
Animesh Jain
4b65d192f0 [dynamo][guards-cpp-refactor] DYNAMIC_INDICES guard (#120096)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120096
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091, #120119, #120123, #120093
2024-02-23 20:10:09 +00:00
Animesh Jain
a92ce46dc3 [dynamo][guards-cpp-refactor] GlobalWeakRefGuardAccessor (#120093)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120093
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091, #120119, #120123
2024-02-23 20:10:01 +00:00
Animesh Jain
bb331b1eb5 [dynamo][guards-cpp-refactor] LENGTH_CHECK guard (#120123)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120123
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091, #120119
2024-02-23 20:09:52 +00:00
Animesh Jain
2eac593ffd [dynamo][guards-cpp-refactor] TUPLE_ITERATOR_LEN guard (#120119)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120119
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091
2024-02-23 20:09:43 +00:00
Animesh Jain
da95421f05 [dynamo][guards-cpp-refactor] TupleIteratorGetItemAccessor (#120091)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120091
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089
2024-02-23 20:09:34 +00:00
Shuqiang Zhang
39f0a5ecc9 [c10d] simplify the dump timeout logic and unify the async call (#120331)
Summary:
The current dump timeout logic is a bit cumbersome as it needs 2 times: 1.
timeout, 2. wake up time. And in theory the caller just needs to wait
for a max of timeout value for the dump and declare the dump to be
either successful or not. Also we unify the async call using std::async
instead of a customized async lauch function for each operation.
Test Plan:
Unit tests
Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120331
Approved by: https://github.com/wconstab
2024-02-23 19:46:40 +00:00
cyy
97918e8c37 [Clang-tidy header][18/N] Enable clang-tidy on headers in torch/csrc/cuda (#118504)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118504
Approved by: https://github.com/albanD
2024-02-23 16:47:33 +00:00
Shuqiang Zhang
2b0168aeb0 [c10d] update the work progress of PG periodically (#120438)
Summary:
Previously, I added lastEnqueuedSeq_ and lastCompletedSeq_ to store the states of PG progress
but log only when there is timeout detected.

We found it is not enough since the 'straggler' itself might not detect
the timeout and hence there is no log from the 'straggler'.

In this PR, we can log these states periorically so that it would be
much easier for us to identify the straggler by checking which rank
has the smallest number of lastEnqueuedSeq_
Test Plan:
Log adding, build success

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120438
Approved by: https://github.com/wconstab, https://github.com/XilunWu, https://github.com/kwen2501
2024-02-23 01:40:43 +00:00
Yifu Wang
1c9fc720ae Change the .clone() in native funcol's all_reduce to use at::MemoryFormat::Contiguous (#120042)
Summary:
While I think it probably makes more sense to only require `all_reduce` input to be non-overlapping and dense, today `ProcessGroupNCCL` requires it to be contiguous. This is also what the `all_reduce` in non-native funcol does.

Also marking a test affected by this with `@run_with_both_funcol_impls`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120042
Approved by: https://github.com/wanchaol
2024-02-22 20:24:15 +00:00
briancoutinho
b88621040a [profiler] Add kineto init delay when used in daemon mode (#120276)
Fixes #112389

## About

PyTorch (Kineto) profiler registers with the profiling daemon Dynolog to enable on-demand profiling. The user should only need to set the env variable `KINETO_USE_DAEMON`. To enable this we need to initialize kineto library early rather than lazily on a PyTorch profiler call. This initialization happens in a static initializer.
- Kineto init function basically registers a callback using the CUDA CUPTI library https://github.com/pytorch/kineto/blob/main/libkineto/src/init.cpp#L130-L148
- However, the above needs the dynamic linking to libcupti.so to have taken place.
- I understand now that static initializations of compilation units will be called before the dynamic linking leading to a segfault in #112389

![image](https://github.com/pytorch/pytorch/assets/6922212/29c9e79b-8080-4198-aaae-8a5696dccaec)

## Workaround
We add a delay in the initialization that can be configured using the env variable 'KINETO_DAEMON_INIT_DELAY_S'. May not be the best but it could help resolve the issue.

## Testing
Tested this out with [linear_model_example.py](https://github.com/facebookincubator/dynolog/blob/main/scripts/pytorch/linear_model_example.py)
First export the daemon env variable

### Without any delay
```
>$ python3 linear_model_example.py

INFO:2024-02-21 19:34:50 2366287:2366287 init.cpp:131] Registering daemon config loader, cpuOnly =  1
INFO:2024-02-21 19:34:50 2366287:2366287 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-02-21 19:34:50 2366287:2366287 IpcFabricConfigClient.cpp:93] Setting up IPC Fabric at endpoint: dynoconfigclientb8f91363-d8d6-47a7-9103-197661e28397 status = initialized
INFO:2024-02-21 19:34:50 2366287:2366287 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-02-21 19:34:50 2366287:2366287 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
cpu
99 1385.468505859375
```

### With 5 seconds delay
```
>$ KINETO_DAEMON_INIT_DELAY_S=5 python3 linear_model_example.py

cpu
99 284.82305908203125
10099 8.817167282104492
INFO:2024-02-21 19:34:26 2359155:2359214 init.cpp:131] Registering daemon config loader, cpuOnly =  1
ERROR: External init callback must run in same thread as registerClient (1782580992 != -1922169024)
INFO:2024-02-21 19:34:26 2359155:2359214 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-02-21 19:34:26 2359155:2359214 IpcFabricConfigClient.cpp:93] Setting up IPC Fabric at endpoint: dynoconfigclient49270a3f-e913-4ea6-b9e0-cc90a853a869 status = initialized
INFO:2024-02-21 19:34:26 2359155:2359214 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-02-21 19:34:26 2359155:2359214 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
20099 8.817167282104492
```

### With an invalid delay
```
>$ KINETO_DAEMON_INIT_DELAY_S=abc python3 linear_model_example.py

INFO:2024-02-21 19:35:02 2369647:2369647 init.cpp:131] Registering daemon config loader, cpuOnly =  1
INFO:2024-02-21 19:35:02 2369647:2369647 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-02-21 19:35:02 2369647:2369647 IpcFabricConfigClient.cpp:93] Setting up IPC Fabric at endpoint: dynoconfigclient0e12a349-af7b-4322-901d-1ff22f91fd4c status = initialized
INFO:2024-02-21 19:35:02 2369647:2369647 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-02-21 19:35:02 2369647:2369647 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
cpu
```

### Unit test updated as well.

## Impact
This should not impact any general user. The initialization only occurs if `KINETO_USE_DAEMON` is set in the environment.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120276
Approved by: https://github.com/anupambhatnagar, https://github.com/aaronenyeshi
2024-02-22 18:17:33 +00:00
Sheng Fu
5c5b71b6ee Capture non tensor arguments in record_function (#120017)
Summary: RECORD_FUNCTION only capture the argument when it is a Tensor. However, it is very common for user to use the argument with primitive data type (int, float, index, bool). This DIFF is to support non tensor arguments in RECORD_FUNCTION.

Test Plan:
unit test
    buck test  mode/dev-nosan caffe2/test:profiler -- test_execution_trace_with_pt2 test_execution_trace_alone test_execution_trace_with_kineto test_execution_trace_start_stop test_execution_trace_repeat_in_loop test_execution_trace_no_capture

Differential Revision: D53674768

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120017
Approved by: https://github.com/soulitzer
2024-02-22 09:40:08 +00:00
PyTorch MergeBot
fff9d98e58 Revert "Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex. (#119639)"
This reverts commit e0268821dd.

Reverted https://github.com/pytorch/pytorch/pull/119639 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think the Window failures are legit as they are failing now in trunk, i.e. 450339ab2d ([comment](https://github.com/pytorch/pytorch/pull/119639#issuecomment-1958428416))
2024-02-22 00:12:54 +00:00
Tobias Ringwald
e0268821dd Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex. (#119639)
Fixes #115331.

This PR increases the number of valid GPU devices to 512 (from 64) in order to future-proof PyTorch for providers that offer [single nodes with a large device count](https://www.tensorwave.com/). Until now, `DeviceIndex` was an `int8_t`, thus multiple changes were necessary:

- `DeviceIndex` changed to `int16_t`. Updated consumers that assume it to be an `int8_t`.
- Updated bounds checking for `torch.device()` in the Python frontend. Right now, we allow funny things like `torch.device('cpu', 200).index == -56`, which is undefined behavior. I inserted some checks to only allow values between 0 and `c10::Device::MAX_NUM_DEVICES - 1`.
- Updated the `ArgumentInfo` struct as it hardcodes the device index as 8 bit field [^1]. Might be a breaking change, not sure if users rely on this.
- Introduced `c10::Device::MAX_NUM_DEVICES` as a replacement for the old `C10_COMPILE_TIME_MAX_GPUS`

[^1]: This field was unsigned, so I guess this has also been undef behavior the whole time? Our default device index is -1, so this always wrapped around to 255 when written to the `ArgumentInfo` struct. When I switched the `DeviceIndex` to `int16_t`, it actually stayed 255 after unpacking from `ArgumentInfo` again, as the `DeviceIndex` was now wide enough that it didn't wrap back to -1.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119639
Approved by: https://github.com/cyyever, https://github.com/albanD
2024-02-21 21:10:49 +00:00
soulitzer
27c5bbe5cb Add is_nested_int() (#119975)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119975
Approved by: https://github.com/jbschlosser
ghstack dependencies: #119661, #119974
2024-02-21 21:10:02 +00:00
Animesh Jain
9c64068ef8 [dynamo][guards-cpp-refactor] TypeGuardAccessor (#120089)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120089
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068
2024-02-21 17:56:48 +00:00
Animesh Jain
ec6783990a [dynamo][guards-cpp-refactor] GlobalsGuardAccessor (#120068)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120068
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067
2024-02-21 17:56:48 +00:00
Animesh Jain
66c52d678f [dynamo][guards-cpp-refactor] GetItemGuardAccessor (#120067)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120067
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065
2024-02-21 17:56:36 +00:00
Animesh Jain
7a0c2a9d0a [dynamo][guards-cpp-refactor] NO_TENSOR_ALIASING guard (#120065)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120065
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064
2024-02-21 17:56:18 +00:00
Animesh Jain
8d5ae8c0b3 [dynamo][guards-cpp-refactor] TENSOR_ALIASING guard (#120064)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120064
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062
2024-02-21 17:56:05 +00:00
Animesh Jain
034955b2fc [dynamo][guards-cpp-refactor] DATA_PTR_MATCH guard (#120062)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120062
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061
2024-02-21 17:55:46 +00:00
Animesh Jain
cc6cf89c30 [dynamo][guards-cpp-refactor] GLOBAL_STATE guard (#120061)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120061
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060
2024-02-21 17:55:32 +00:00
Animesh Jain
5066bec743 [dynamo][guards-cpp-refactor] DEFAULT_DEVICE guard (#120060)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120060
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833
2024-02-21 17:55:17 +00:00
Shuqiang Zhang
a24cba35b0 [c10d][flight recorder] dump additinal NCCL debug info (#120063)
Summary:
This PR is mainly about flight recorder side of changes that takes a
map of maps as input, and dump it as picklable. Also add functions that
should be compiled only when NCCL_COMM_DUMP is defined
Test Plan:
Integration tests with NCCL would be done later, here we only do the
c10d side of dump test, aka,NCCLTraceTest

Testing the dump function is a bit tricky as we don't have
existing C++ unit tests for them. So we still use the Python NCCLTraceTest with
the python binding of _dump_nccl_trace(), we manually fed the
dump_nccl_trace with a map of test info, and assert the pickle result and
print the converted python dict:
```
(sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (main)]$  python
test/distributed/test_c10d_nccl.py NCCLTraceTest
NCCL version 2.19.3+cuda12.0
[rank0]:[E ProcessGroupNCCL.cpp:1200] [PG 0 Rank 0] ProcessGroupNCCL
preparing to dump debug info.
.NCCL version 2.19.3+cuda12.0
.NCCL version 2.19.3+cuda12.0
{'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2':
'Value2', 'Key1': 'Value1'}}
{'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2':
'Value2', 'Key1': 'Value1'}}
.NCCL version 2.19.3+cuda12.0
{'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2':
'Value2', 'Key1': 'Value1'}}
{'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2':
'Value2', 'Key1': 'Value1'}}
.NCCL version 2.19.3+cuda12.0
.NCCL version 2.19.3+cuda12.0
.NCCL version 2.19.3+cuda12.0
.NCCL version 2.19.3+cuda12.0
.
----------------------------------------------------------------------
Ran 8 tests in 95.761s
OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120063
Approved by: https://github.com/wconstab
2024-02-21 16:35:23 +00:00
cyy
3cd6a21e8f [DeviceIndex][6/N] Use DeviceIndex in more places (#120133)
This PR follows the series of patches beginning with #119142 and fixes various XPU and python related methods to use DeviceIndex.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120133
Approved by: https://github.com/Skylion007
2024-02-21 06:24:23 +00:00
Yifu Wang
2d6c0cc81b Run test_functional_api.py with both legacy and native funcol impls (#119982)
Additional changes: tests in test_functional_api.py uses multi-threaded pg which is implemented in Python. For the native ops to call into the Python pg implementation, glue code in PyProcessGroup is required for each collective. This PR also adds a few pieces of previously missing glue code, which are necessary for running test_functional_api.py with native funcol.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119982
Approved by: https://github.com/wanchaol
2024-02-20 21:15:37 +00:00
Animesh Jain
389b56b4c4 [dynamo][guards-cpp-refactor] GetAttrGuardAccessor (#119833)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119833
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827
2024-02-20 05:33:08 +00:00
Animesh Jain
96f45d15d8 [dynamo][guards-c++-refactor] EQUALS_MATCH guard (#119827)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119827
Approved by: https://github.com/jansel
ghstack dependencies: #119822
2024-02-20 05:33:08 +00:00
Animesh Jain
0802951081 [dynamo][guards-c++-refactor] Introduce LeafGuard, GuardManager and GuardAccessor classes (#119822)
The full blown implementation is in this stack - https://github.com/pytorch/pytorch/pull/110590 which is passing all the test cases on CI. That stack is hard to review. So, breaking apart.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119822
Approved by: https://github.com/jansel
2024-02-20 05:33:08 +00:00
Yifu Wang
40786ca509 Handle unwaited work objects on process termination (#119881)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119881
Approved by: https://github.com/wconstab
2024-02-19 02:46:02 +00:00
cyy
a9953a5ef3 Remove unused c10/util/C++17.h inclusion and outdated checks (#120149)
This is a continued work to clean up pre-C++17 code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120149
Approved by: https://github.com/ezyang
2024-02-17 14:28:17 +00:00
Shuqiang Zhang
30000aa3fd [c10d] remove one line of verbose log (#120138)
Summary:
I don't find exiting DBG mode support in c10d. This is flooding the log, removing it to unblock user
Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120138
Approved by: https://github.com/wconstab
2024-02-17 06:39:57 +00:00
Bin Bao
fa0e39560c [AOTI] Fix a typo (#120094)
Differential Revision: [D53861810](https://our.internmc.facebook.com/intern/diff/D53861810)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120094
Approved by: https://github.com/khabinov, https://github.com/sijiac
2024-02-17 05:28:58 +00:00
Aaron Enye Shi
7973ac586d [Memory Snapshot] Add CUDAAllocatorConfig details into snapshot metadata (#119404)
Summary:
Include the CUDAAllocatorConfig at the time of snapshot into the snapshot file. These include adding variables:

```
  double garbage_collection_threshold;
  size_t max_split_size;
  size_t pinned_num_register_threads;
  bool expandable_segments;
  bool release_lock_on_cudamalloc;
  bool pinned_use_cuda_host_register;
  std::string last_allocator_settings;
  std::vector<size_t> roundup_power2_divisions;
```

Test Plan:
`PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True ` produces
```
{'PYTORCH_CUDA_ALLOC_CONF': 'expandable_segments:True',
 'max_split_size': -1,
 'garbage_collection_threshold': 0.0,
 'expandable_segments': True,
 'pinned_num_register_threads': 1,
 'release_lock_on_cudamalloc': False,
 'pinned_use_cuda_host_register': False,
 'roundup_power2_divisions': {'1': 0,
  '2': 0,
  '4': 0,
  '8': 0,
  '16': 0,
  '32': 0,
  '64': 0,
  '128': 0,
  '256': 0,
  '512': 0,
  '1024': 0,
  '2048': 0,
  '4096': 0,
  '8192': 0,
  '16384': 0,
  '32768': 0}}
```
`PYTORCH_CUDA_ALLOC_CONF="max_split_size_mb:2000,roundup_power2_divisions:[256:1,512:2,1024:4,>:8]"` produces
```
{'PYTORCH_CUDA_ALLOC_CONF': 'max_split_size_mb:2000,roundup_power2_divisions:[256:1,512:2,1024:4,>:8]',
 'max_split_size': 2097152000,
 'garbage_collection_threshold': 0.0,
 'expandable_segments': False,
 'pinned_num_register_threads': 1,
 'release_lock_on_cudamalloc': False,
 'pinned_use_cuda_host_register': False,
 'roundup_power2_divisions': {'1': 1, '2': 1, '4': 1, '8': 1, '16': 1, '32': 1, '64': 1, '128': 1, '256': 1, '512': 2, '1024': 8, '2048': 8, '4096': 8, '8192': 8, '16384': 8, '32768': 8}
}
```

Differential Revision: D53536199

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119404
Approved by: https://github.com/zdevito
2024-02-17 01:16:37 +00:00
Yifu Wang
4ac857f94e Support broadcast in native funcol (#119229)
### Summary

@LucasLLC recently implemented `broadcast` in funcol. This is not yet available in the native funcol ops. This PR adds support for broadcast for native funcol.

- Added `_c10d_functional::broadcast` and `_c10d_functional::broadcast_`
- Integrated with python functol broadcast and `AsyncCollectiveTensor`
- Implemented Inductor lowering. Verified correctness and buffer reuse behavior
- Validated dynamo traceability
- Validated AOTInductor compile-ability

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119229
Approved by: https://github.com/wanchaol
ghstack dependencies: #119104
2024-02-16 21:01:34 +00:00
soulitzer
312ce35c1f Rename singleton int to nested int (#119661)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119661
Approved by: https://github.com/ezyang
2024-02-16 19:21:17 +00:00
Dan Johnson
124c251510 Guarantee init cuda before attaching hooks (#120052)
Summary: If cuda is not initialized before calling attachAllocatorTraceTracker, then the CudaCachingAllocator device_allocator is empty which means that the registration hooks are not setup. This means that a new segment_alloc will not be registered causing an expensive dynamic registration each time the segment is used. The fix is to guarantee that cuda is initialized before attaching the hooks. If cuda is already initialized, then this lazyInitCUDA is a no-op.

Test Plan:
Testing this on fsdp+tp example model where cuda is not initialized before init_process_group.

Job without the fix keeps dynamically registering:
https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-fsdp_2d_main-j544j0vn7zqh4c?job_attempt=0&version=0&env=PRODUCTION
The following keeps looping:
[0]:2024-02-14T10:48:18.873079 twshared0039:4836:6232 [0] NCCL INFO CTRAN-MAPPER: registered buffer 0x7f6ebe000000 len 608124000, state 1
[0]:2024-02-14T10:48:18.873087 twshared0039:4836:6232 [0] NCCL INFO *dynamicRegist = true
[0]:2024-02-14T10:48:18.903234 twshared0039:4836:6232 [0] NCCL INFO CTRAN-MAPPER: deregister buffer 0x7f6ebe000000 len 608124000, state 1
[0]:2024-02-14T10:48:18.903240 twshared0039:4836:6232 [0] NCCL INFO CTRAN-MAPPER: deregiter buffer 0x7f6ebe000000 len 608124000

Job with the fix does not have this issue:
https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-fsdp_2d_main-hzm5dwqncr7l7?version=0&env=PRODUCTION

Reviewed By: minsii, kwen2501, xw285cornell

Differential Revision: D53770989

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120052
Approved by: https://github.com/kwen2501
2024-02-16 17:36:53 +00:00
Yu, Guangye
8f9f12c068 Intel GPU Runtime Upstreaming for Device Allocator (#118091)
# Motivation
According to [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842) and [[RFC] Intel GPU Runtime Upstreaming for Allocator](https://github.com/pytorch/pytorch/issues/116322), we will upstream the key functionality of device `Allocator` dedicated for XPU to PyTorch. And following our design prepare to generalize `Allocator` in parallel.

# Design
In the current design, XPU uses an `XPUAllocator` class, inherited from `c10::Allocator`. `XPUAllocator` is a manager to handle `DeviceCachingAllocator`, which is a per-device implementation of the caching mechanism to manage the already cached or newly allocated memory. The caching mechanism is similar to other backends, like CUDA. We can visualize the design as below.
<p align="center">
<img width="162" alt="image" src="https://github.com/pytorch/pytorch/assets/106960996/6b17b8cf-e7d1-48b4-b684-f830c409d218">
</p>

# Additional Context
We're going to implement our design gradually. This PR covers the device `Allocator` dedicated to XPU. The second PR covers the host `Allocator`.
Besides these PRs, we plan to generalize the device `Allocator` device-agnostic through another PR.
In this PR, our device `Allocator` has the same memory management mechanism as CUDA, but lacks features such as expendable segments and statistics. We will add these features back in the subsequent PR which intend to generalize `Allocator`.

The differences with CUDA:
only key functionality, and lack of AsyncAllocator, gpu_trace, history_record, graph functionality, memory snapshot, memory statistics, expandable segment...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118091
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/albanD
ghstack dependencies: #117611, #117619, #117734
2024-02-16 06:46:00 +00:00
Mu-Chu Lee
b8be8b639f Add Runtime Constant-Folding function of AOTInductor for AOTInductorModels used internally. (#119823)
Summary:
1. Make sure folded constants generated internally doesn't get exposed.
2. Add runConstantFolding and related API calls

Test Plan:
```buck2 run mode/opt-split-dwarf -c fbcode.nvcc_arch=v100,a100 caffe2/caffe2/fb/predictor/tests_gpu:pytorch_predictor_container_gpu_test -- --gtest_filter=*PyTorchPredictorContainerTest.LoadAOTInductorModel*
```
The test triggers the added predictor tests `test_aot_inductor_merge_net_file_*.predictor_20240206`,
which would trigger runConstantFolding from predictor's module loading.

Reviewed By: SherlockNoMad

Differential Revision: D53718139

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119823
Approved by: https://github.com/chenyang78
2024-02-16 06:45:48 +00:00
Yu, Guangye
4dc75f9084 Intel GPU Runtime Upstreaming for Event (#117734)
# Motivation
As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the next runtime component we would like to upstream is `Event` which handles the status of an operation that is being executed. Typically, in some circumstances, we can fine-grain control of the operation execution via `Event`.

# Design
`XPUEvent` is a movable but not a copyable wrapper around sycl event. It should be created lazily on an XPU device when recording an `XPUStream`. Meanwhile, `XPUEvent` can wait for another `XPUEvent` or all the submitted kernels on an `XPUStream` to complete. Align to the other backend, the C++ files related to `Event` will be placed in `aten/src/ATen/xpu` folder. For frontend code, `XPUEvent` runtime API will be bound to Python `torch.xpu.Event`. The corresponding C++ code will be placed in `torch/csrc/xpu/Event.cpp` and Python code will be placed in `torch/xpu/streams.py` respectively.

# Additional Context
It is worth mentioning that the `elapsed_time` method is temporarily not supported by `XPUEvent`. We will be adding support for it soon. Meanwhile `XPUEvent` doesn't support IPC from different processes. For the other parts, we have almost a 1:1 mapping with CUDA.

lack of the below APIs:
- `torch.cuda.Event.ipc_handle`
- `CUDAEvent`'s constructor with `IpcEventHandle`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117734
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: #117611, #117619
2024-02-16 06:28:26 +00:00
cyy
d4882e438a [DeviceIndex][5/N] Use DeviceIndex in more places (#119866)
This PR follows the series of patches beginning with #119142 and fixes various CUDA related methods to use DeviceIndex.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119866
Approved by: https://github.com/Skylion007
2024-02-15 07:01:43 +00:00
cyy
5f9b432494 [2/N] Replace std::tie with structural binding (#119879)
This PR follows #119774, Python generated code was changed to use structural binding.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119879
Approved by: https://github.com/albanD
2024-02-15 02:56:34 +00:00
Eddie Yan
cd380c794f [CUDNN][SDPA] Experimental cuDNN Flash Attention v2 Inference (#115663)
#113713

Going to clean up some of the checks and will remove draft status after.
Can be tested on SM80+ with `TORCH_CUDNN_MHA_ENABLED=1`.

CC @drisspg @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115663
Approved by: https://github.com/drisspg
2024-02-14 22:02:06 +00:00
Joel Schlosser
9ec8dd2467 Reify view_func() closures as ViewFuncs (#118404)
Replaces `view_func()` closures with a reified `ViewFunc` data structure. Codegen generates a `ViewFunc` subclass for each view op (e.g. `NarrowViewFunc`) containing state needed to reconstruct the view. The `ViewFunc` API allows for querying and hot-swapping any `SymInt`s or `Tensors` in the state through `get_symints()` / `get_tensors()` / `clone_and_set()`, which will be essential for fake-ification later on.

```cpp
/// Base class for view functions, providing reapplication of a view on a new base.
/// Each view op should get a codegenerated subclass of this class containing
/// any state needed to reconstruct the view. The class also provides convenience
/// accessors for saved SymInts / tensor state. This is useful for e.g. fake-ification,
/// where we want to use symbolic values or fake tensors instead.
struct TORCH_API ViewFunc {
  virtual ~ViewFunc() {}
  /// Returns any SymInts in the saved state.
  virtual std::vector<c10::SymInt> get_symints() const { return {}; }
  /// Returns the number of SymInts in the saved state.
  virtual size_t num_symints() const { return 0; }
  /// Returns any tensors in the saved state.
  virtual std::vector<at::Tensor> get_tensors() const { return {}; }
  /// Returns the number of tensors in the saved state.
  virtual size_t num_tensors() const { return 0; }
  /// Reapplies the view on the given base using the saved state.
  virtual at::Tensor operator()(const at::Tensor&) const = 0;
  /// Returns a clone of this ViewFunc, optionally with the specified saved state.
  virtual std::unique_ptr<ViewFunc> clone_and_set(
      std::optional<std::vector<c10::SymInt>> = c10::nullopt,
      std::optional<std::vector<at::Tensor>> = c10::nullopt) const = 0;

protected:
  /// Sets the values of any SymInts in the saved state. The input vector size must
  /// match the number of SymInts in the saved state (i.e. the size of the list
  /// returned by get_symints()).
  virtual void set_symints(std::vector<c10::SymInt>) {}
  /// Sets the values of any Tensors in the saved state. The input vector size must
  /// match the number of Tensors in the saved state (i.e. the size of the list
  /// returned by get_tensors()).
  virtual void set_tensors(std::vector<at::Tensor>) {}
};
```

New codegen files:
* `torch/csrc/autograd/generated/ViewFunc.h`
* `torch/csrc/autograd/generated/ViewFuncs.cpp`

The templates for these also contains impls for `ChainedViewFunc` and `ErroringViewFunc` which are used in a few places within autograd.

Example codegen for `slice.Tensor`:
```cpp
// torch/csrc/autograd/generated/ViewFuncs.h
#define SLICE_TENSOR_VIEW_FUNC_AVAILABLE
struct SliceTensorViewFunc : public torch::autograd::ViewFunc {
  SliceTensorViewFunc(int64_t dim, c10::optional<c10::SymInt> start, c10::optional<c10::SymInt> end, c10::SymInt step) : dim(dim), start(start), end(end), step(step)
  {};
  virtual ~SliceTensorViewFunc() override {};
  virtual std::vector<c10::SymInt> get_symints() const override;
  virtual size_t num_symints() const override;
  virtual std::vector<at::Tensor> get_tensors() const override;
  virtual size_t num_tensors() const override;
  virtual at::Tensor operator()(const at::Tensor&) const override;
  virtual std::unique_ptr<ViewFunc> clone_and_set(
      std::optional<std::vector<c10::SymInt>> = c10::nullopt,
      std::optional<std::vector<at::Tensor>> = c10::nullopt) const override;

protected:
  virtual void set_symints(std::vector<c10::SymInt>) override;
  virtual void set_tensors(std::vector<at::Tensor>) override;

private:
  int64_t dim;
  c10::optional<c10::SymInt> start;
  c10::optional<c10::SymInt> end;
  c10::SymInt step;
};
...

// torch/csrc/autograd/generated/ViewFuncs.cpp
std::vector<c10::SymInt> SliceTensorViewFunc::get_symints() const {
  ::std::vector<c10::SymInt> symints;
  symints.reserve((start.has_value() ? 1 : 0) + (end.has_value() ? 1 : 0) + 1);
  if(start.has_value()) symints.insert(symints.end(), *(start));
  if(end.has_value()) symints.insert(symints.end(), *(end));
  symints.push_back(step);
  return symints;
}

size_t SliceTensorViewFunc::num_symints() const {
  return static_cast<size_t>((start.has_value() ? 1 : 0) + (end.has_value() ? 1 : 0) + 1);
}

void SliceTensorViewFunc::set_symints(std::vector<c10::SymInt> symints) {
  TORCH_INTERNAL_ASSERT(symints.size() == num_symints());
  auto i = 0;
  if(start.has_value()) start = symints[i];
  i += (start.has_value() ? 1 : 0);
  if(end.has_value()) end = symints[i];
  i += (end.has_value() ? 1 : 0);
  step = symints[i];
}

std::vector<at::Tensor> SliceTensorViewFunc::get_tensors() const {
  ::std::vector<at::Tensor> tensors;
  return tensors;
}

size_t SliceTensorViewFunc::num_tensors() const {
  return static_cast<size_t>(0);
}

void SliceTensorViewFunc::set_tensors(std::vector<at::Tensor> tensors) {
  TORCH_INTERNAL_ASSERT(tensors.size() == num_tensors());

}

at::Tensor SliceTensorViewFunc::operator()(const at::Tensor& input_base) const {
  return at::_ops::slice_Tensor::call(input_base, dim, start, end, step);
}

std::unique_ptr<ViewFunc> SliceTensorViewFunc::clone_and_set(
    std::optional<std::vector<c10::SymInt>> symints,
    std::optional<std::vector<at::Tensor>> tensors) const {
  auto output = std::make_unique<SliceTensorViewFunc>(dim, start, end, step);
  if (symints.has_value()) {
    output->set_symints(std::move(*(symints)));
  }
  if (tensors.has_value()) {
    output->set_tensors(std::move(*(tensors)));
  }
  return output;
}
```

The `_view_func()` / `_view_func_unsafe()` methods now accept two additional (optional) args for `symint_visitor_fn` / `tensor_visitor_fn`. If these are defined, they are expected to be python callables that operate on a single SymInt / tensor and return a new one. This allows for the hot-swapping needed during fake-ification.

For testing, there are extensive pre-existing tests, and I added a test to ensure that hot-swapping functions correctly.
```sh
python test/test_autograd.py -k test_view_func_replay
python test/test_ops.py -k test_view_replay
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118404
Approved by: https://github.com/ezyang
2024-02-14 22:00:43 +00:00
atalman
244b124bb8 Add linux cpu test for 3.12 (#117853)
This is continuation of work: https://github.com/pytorch/pytorch/pull/113987

Co-authored-by: albanD <desmaison.alban@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117853
Approved by: https://github.com/albanD
2024-02-14 20:52:23 +00:00
cyy
87c6cd2f00 [1/N] Replace std::tie with structural binding (#119774)
This PR replaces some std::tie calls with structural binding from C++17.  This not only makes the code more compact, but also has some performance gain.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119774
Approved by: https://github.com/albanD, https://github.com/malfet
2024-02-14 09:25:04 +00:00
Shuqiang Zhang
a45c627f27 [c10d][flight recorder] store a copy of string in entry (#119837)
Summary:
Previously, we just store the char pointer in entry, the string is a
temp object and will be destructed when we want to dump/access it.

A quick fix is to store a copy of the string, but without changing the
upstream char*.

An alternative is to change every profilingTitle into std:string, this
however would needs comprehensive overhall of the code up to the
c10d::work layer above workNCCL and RecordFunction etc.

We chose the first option for this change

Resolve #119808

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119837
Approved by: https://github.com/zdevito, https://github.com/wconstab
2024-02-14 09:13:56 +00:00
Jason Ansel
75a6d6aef7 [inductor] Support storage resizing (#119749)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119749
Approved by: https://github.com/yf225
ghstack dependencies: #119647, #119671
2024-02-14 03:03:38 +00:00
cyy
cb0886ecf2 [DeviceIndex][4/N] Use DeviceIndex in more places (#119741)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119741
Approved by: https://github.com/aaronenyeshi, https://github.com/ezyang
2024-02-14 00:29:10 +00:00
Jason Ansel
cf117e37d5 Refactor THPStorage_resize_ (#119671)
Moving code around to allow it to be reused in the next PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119671
Approved by: https://github.com/yf225
ghstack dependencies: #119647
2024-02-13 23:28:47 +00:00
albanD
ca777fbbb7 Add Accelerator device and shell hooks (#119329)
This adds a concept of Accelerator that points to one of our devices. See DeviceAccelerator.h in this PR for details https://github.com/pytorch/pytorch/pull/119329/files#diff-83cc748bed5df1a453c272cc5ecc7e572d4eb694c5125384d8fbd17a0b5f50c8
It also adds scaffolding for shared C++ API to allow generic feature implementation. This PR in particular updates the autograd engine to use this generic API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119329
Approved by: https://github.com/ezyang, https://github.com/huydhn
2024-02-13 23:15:24 +00:00
Edward Z. Yang
6665b96ebb Rewrite maybe_reduce more carefully for unbacked SymInt (#119562)
Fixes https://github.com/pytorch/pytorch/issues/119476

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119562
Approved by: https://github.com/albanD
ghstack dependencies: #119559
2024-02-13 21:40:06 +00:00
Ke Wen
28f299a870 [c10d] Fix compilation of NCCL_EXP path (#119805)
Fixes issue pointed out in https://github.com/pytorch/pytorch/pull/119421#issuecomment-1941694621

When refactoring ProcessGroupNCCL, some code in the NCCL_EXP path wasn't done cleanly.

Cc: @kunalb @H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119805
Approved by: https://github.com/H-Huang
2024-02-13 21:26:59 +00:00
Guilherme Leobas
3319dbcd23 Update vmap guard to avoid recompilations (#119061)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119061
Approved by: https://github.com/zou3519
2024-02-13 20:50:23 +00:00
Shuqiang Zhang
abadbbc4b0 [c10d][flight recorder] remove unintended assignment of entry (#119748)
Summary:
auto& entry = entries_.at(*id % max_entries_);
entry = entries_.at(*id % max_entries_);
The above line of code has unintended consequence of invoking copy/assignment
of entry objects as ref itself cannot be re-assigned.

Also what could cause the crash is that the entry ref could become invalid if entries_ are
resized by other threads. and this could result in 'copy to a garbage
location'. The fix is to use a pointer which can be re-assigned after
re-acquiring the lock

Tests: python test/distributed/test_c10d_nccl.py NCCLTraceTest

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119748
Approved by: https://github.com/wconstab, https://github.com/fegin
2024-02-13 20:18:58 +00:00
cyy
47a2e6b6b8 Fix C++20 build (#112333)
Currently C++20 fails because of incorrect template initialization order. This PR adjusted the order of theses classes and a constructor to address the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112333
Approved by: https://github.com/albanD
2024-02-13 05:10:19 +00:00
min-jean-cho
2502a01110 Linear-BN Fusion: add precondition check (#119264)
Fixes #118990

The root cause is due to `out_features` of Linear not matching `num_features` of BatchNorm, resulting in shape mismatch while computing `fused_w`, and `fused_b`. This can happen for linear-bn folding because linear layer operates over the last dim, `(*, H_in)`, while bn layer operates over the channel dim, `(N, C_in, H, W)`.

To preserve the shapes of the original linear weight and bias in linear-bn folding, check linear `out_features` match bn `num_features`. If they don't match, bn `num_features` need to be 1 to broadcast.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119264
Approved by: https://github.com/eellison
2024-02-13 04:16:34 +00:00
PyTorch MergeBot
214f06ae3a Revert "Add Accelerator device and shell hooks (#119329)"
This reverts commit 4b9568a360.

Reverted https://github.com/pytorch/pytorch/pull/119329 on behalf of https://github.com/huydhn due to Breaks internal build and requires OSS file update to fix it ([comment](https://github.com/pytorch/pytorch/pull/119329#issuecomment-1940278598))
2024-02-13 02:23:45 +00:00
cyy
10f3abc6b8 [DeviceIndex][3/N] Use DeviceIndex in more places (#119635)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119635
Approved by: https://github.com/ezyang
2024-02-12 21:31:27 +00:00
suo
82248f0b1c [export] improve FakeTensor serialization (#119531)
Recently we made it possible to serialize ExportedPrograms with fake parameters/buffers/etc.

The serialization regime was kind of whacky; basically we serialized a stub and reassembled the FakeTensor using metadata that we had stashed elsewhere in the Graph state.

This was bad for a few reasons:
- Storing the metadata separately from the actual serialized object caused situations where you could have one but not the other. An example case is if you had a FakeTensor contained inside a TorchBind object—there was no obviously place to store the metadata for this. This actually happens—TensorQueue in fbgemm does this.
- It created an annoying cycle: we had to deserialize the Graph's tensor metadata in order to deserialize (potentially faked) constants, but we need constants in order to deserialize the Graph.

This fixes all that. The basic idea is to patch the reducer function for FakeTensor at serialization time, and serialize a copy of the FakeTensor metadata. We already are policing BC for the TensorMeta schema struct so it's not a net increase in the BC surface.

As a bonus, I fixed a weird bug with torchbind tracing where we were accidentally reinterpreting a torch.ScriptObject as a torch.ScriptModule (which was the root cause of some weird behavior @bahuang was seeing last week).

Differential Revision: [D53601251](https://our.internmc.facebook.com/intern/diff/D53601251/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119531
Approved by: https://github.com/zhxchen17
2024-02-12 19:28:08 +00:00
Edward Z. Yang
482345d747 Refactor out shape test into InputMetadata::maybe_reduce (#119559)
I'm going to gut this function shortly, and having it all on
InputMetadata is convenient for this purpose.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119559
Approved by: https://github.com/soulitzer
2024-02-12 19:27:48 +00:00
Yifu Wang
27ffede878 [reland] Fix estimate_nccl_collective_runtime (#118986)
`estimate_nccl_collective_runtime` has been broken and the errors have been silently swallowed by inductor. This PR:
- Fixes the issues described in https://github.com/pytorch/pytorch/issues/118497.
- Adds white-box testing so future issues can be surfaced in tests.
- Add support for native funcol IRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118986
Approved by: https://github.com/yf225
ghstack dependencies: #119102
2024-02-12 18:48:06 +00:00
Ke Wen
b2043c0543 [c10d] PGNCCL refactor part 2: Simplify ProcessGroupNCCL into single-device style (#119421)
Part 2 and last part of #118674:
Introduce actual "single-device" code change to ProcessGroupNCCL.

assert size == 1 and test refactor have been done in #119099.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119421
Approved by: https://github.com/shuqiangzhang
2024-02-12 18:45:49 +00:00
Shuqiang Zhang
893dcac068 [c10d] explicitly abort communicators in destroy_process_group call (#119250)
Summary:
This PR tries to resolve issue #119215.

Basically,  processgroup shutdown (and hence ncclCommAbort) is called in
destroy_process_group APIs for the corresponding PGs. and in the
destructor of ProcessGroup, we avoid calling abort/ncclCommAbort.
Instead, it just checks if the users have explicitly already called destroy_process_group. If
not, Destructor will log a warning and encourage/expect users to do so
for cleanup of resources of PGs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119250
Approved by: https://github.com/minsii, https://github.com/kwen2501
2024-02-12 18:40:28 +00:00
Bin Bao
52a3de6cbf [AOTI][refactor] Move ThreadLocalCachedOutputTensor into a separate header (#119392)
Summary: Move common functionality into a separate header so that later JIT and AOT Inductor can share it.

Test Plan: CI

Differential Revision: D53523452

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119392
Approved by: https://github.com/khabinov
2024-02-12 15:56:16 +00:00
PyTorch MergeBot
24bdd03d23 Revert "Reify view_func() closures as ViewFuncs (#118404)"
This reverts commit d5a6762263.

Reverted https://github.com/pytorch/pytorch/pull/118404 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/118404#issuecomment-1938600260))
2024-02-12 12:38:51 +00:00
PyTorch MergeBot
0342b227e5 Revert "[c10d] PGNCCL refactor part 2: Simplify ProcessGroupNCCL into single-device style (#119421)"
This reverts commit f3e7d80993.

Reverted https://github.com/pytorch/pytorch/pull/119421 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/119421#issuecomment-1938169747))
2024-02-12 07:34:20 +00:00
cyy
8a3c241094 Remove unused header inclusion (#119667)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119667
Approved by: https://github.com/Skylion007
2024-02-12 05:36:25 +00:00
Mu-Chu Lee
dcb08a7044 Add CUDAEvent recording for constant folding to show up. (#119216)
Summary: Add a layer of call to let CUDAEvent show up for constant folding.

Test Plan: Existing tests

Differential Revision: D53437934

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119216
Approved by: https://github.com/khabinov
2024-02-12 03:46:36 +00:00
cyy
568740f080 [DeviceIndex][2/N] Use DeviceIndex instead of int in allocators (#119545)
Follows #119142
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119545
Approved by: https://github.com/ezyang
2024-02-10 20:27:59 +00:00
Mu-Chu Lee
e71c202520 Use CUDA if cuda's macro is set for AOTI runner's pybind (#119616)
Summary:
Use CUDA if cuda's macro is set for AOTI runner's pybind
This is a duplicate of #119438 for landing issues

Test Plan:
Existing tests (D52303882)

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119616
Approved by: https://github.com/khabinov
2024-02-10 11:00:47 +00:00
Yu, Guangye
8fd11cb307 [2/2] Intel GPU Runtime Upstreaming for Stream (#117619)
# Motivation
According to [[1/2] Intel GPU Runtime Upstreaming for Stream](https://github.com/pytorch/pytorch/pull/117611), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the second PR covers the changes under `python frontend`.

# Design
Currently, it primarily offers stream-related APIs, including
 - `torch.xpu.StreamContext`
 - `torch.xpu.current_stream`
 - `torch.xpu.set_stream`
 - `torch.xpu.synchronize`
 - `torch._C._xpu_getCurrentRawStream`

# Additional Context
We will implement functions like `torch.xpu.Stream.wait_event`, `torch.xpu.Stream.wait_stream`, and `torch.xpu.Stream.record_event` in the next PR related with `Event`.

The differences with CUDA:
no default and external stream in XPU and lack of below APIs:
- `torch.cuda.ExternalStream`
- `torch.cuda.default_stream`
- `toch.cuda.is_current_stream_capturing`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117619
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/albanD
ghstack dependencies: #117611
2024-02-10 03:39:42 +00:00
Ke Wen
f3e7d80993 [c10d] PGNCCL refactor part 2: Simplify ProcessGroupNCCL into single-device style (#119421)
Part 2 and last part of #118674:
Introduce actual "single-device" code change to ProcessGroupNCCL.

assert size == 1 and test refactor have been done in #119099.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119421
Approved by: https://github.com/shuqiangzhang
2024-02-09 20:23:20 +00:00
albanD
4b9568a360 Add Accelerator device and shell hooks (#119329)
This adds a concept of Accelerator that points to one of our devices. See DeviceAccelerator.h in this PR for details https://github.com/pytorch/pytorch/pull/119329/files#diff-83cc748bed5df1a453c272cc5ecc7e572d4eb694c5125384d8fbd17a0b5f50c8
It also adds scaffolding for shared C++ API to allow generic feature implementation. This PR in particular updates the autograd engine to use this generic API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119329
Approved by: https://github.com/ezyang
2024-02-09 18:54:28 +00:00
Joel Schlosser
d5a6762263 Reify view_func() closures as ViewFuncs (#118404)
Replaces `view_func()` closures with a reified `ViewFunc` data structure. Codegen generates a `ViewFunc` subclass for each view op (e.g. `NarrowViewFunc`) containing state needed to reconstruct the view. The `ViewFunc` API allows for querying and hot-swapping any `SymInt`s or `Tensors` in the state through `get_symints()` / `get_tensors()` / `clone_and_set()`, which will be essential for fake-ification later on.

```cpp
/// Base class for view functions, providing reapplication of a view on a new base.
/// Each view op should get a codegenerated subclass of this class containing
/// any state needed to reconstruct the view. The class also provides convenience
/// accessors for saved SymInts / tensor state. This is useful for e.g. fake-ification,
/// where we want to use symbolic values or fake tensors instead.
struct TORCH_API ViewFunc {
  virtual ~ViewFunc() {}
  /// Returns any SymInts in the saved state.
  virtual std::vector<c10::SymInt> get_symints() const { return {}; }
  /// Returns the number of SymInts in the saved state.
  virtual size_t num_symints() const { return 0; }
  /// Returns any tensors in the saved state.
  virtual std::vector<at::Tensor> get_tensors() const { return {}; }
  /// Returns the number of tensors in the saved state.
  virtual size_t num_tensors() const { return 0; }
  /// Reapplies the view on the given base using the saved state.
  virtual at::Tensor operator()(const at::Tensor&) const = 0;
  /// Returns a clone of this ViewFunc, optionally with the specified saved state.
  virtual std::unique_ptr<ViewFunc> clone_and_set(
      std::optional<std::vector<c10::SymInt>> = c10::nullopt,
      std::optional<std::vector<at::Tensor>> = c10::nullopt) const = 0;

protected:
  /// Sets the values of any SymInts in the saved state. The input vector size must
  /// match the number of SymInts in the saved state (i.e. the size of the list
  /// returned by get_symints()).
  virtual void set_symints(std::vector<c10::SymInt>) {}
  /// Sets the values of any Tensors in the saved state. The input vector size must
  /// match the number of Tensors in the saved state (i.e. the size of the list
  /// returned by get_tensors()).
  virtual void set_tensors(std::vector<at::Tensor>) {}
};
```

New codegen files:
* `torch/csrc/autograd/generated/ViewFunc.h`
* `torch/csrc/autograd/generated/ViewFuncs.cpp`

The templates for these also contains impls for `ChainedViewFunc` and `ErroringViewFunc` which are used in a few places within autograd.

Example codegen for `slice.Tensor`:
```cpp
// torch/csrc/autograd/generated/ViewFuncs.h
#define SLICE_TENSOR_VIEW_FUNC_AVAILABLE
struct SliceTensorViewFunc : public torch::autograd::ViewFunc {
  SliceTensorViewFunc(int64_t dim, c10::optional<c10::SymInt> start, c10::optional<c10::SymInt> end, c10::SymInt step) : dim(dim), start(start), end(end), step(step)
  {};
  virtual ~SliceTensorViewFunc() override {};
  virtual std::vector<c10::SymInt> get_symints() const override;
  virtual size_t num_symints() const override;
  virtual std::vector<at::Tensor> get_tensors() const override;
  virtual size_t num_tensors() const override;
  virtual at::Tensor operator()(const at::Tensor&) const override;
  virtual std::unique_ptr<ViewFunc> clone_and_set(
      std::optional<std::vector<c10::SymInt>> = c10::nullopt,
      std::optional<std::vector<at::Tensor>> = c10::nullopt) const override;

protected:
  virtual void set_symints(std::vector<c10::SymInt>) override;
  virtual void set_tensors(std::vector<at::Tensor>) override;

private:
  int64_t dim;
  c10::optional<c10::SymInt> start;
  c10::optional<c10::SymInt> end;
  c10::SymInt step;
};
...

// torch/csrc/autograd/generated/ViewFuncs.cpp
std::vector<c10::SymInt> SliceTensorViewFunc::get_symints() const {
  ::std::vector<c10::SymInt> symints;
  symints.reserve((start.has_value() ? 1 : 0) + (end.has_value() ? 1 : 0) + 1);
  if(start.has_value()) symints.insert(symints.end(), *(start));
  if(end.has_value()) symints.insert(symints.end(), *(end));
  symints.push_back(step);
  return symints;
}

size_t SliceTensorViewFunc::num_symints() const {
  return static_cast<size_t>((start.has_value() ? 1 : 0) + (end.has_value() ? 1 : 0) + 1);
}

void SliceTensorViewFunc::set_symints(std::vector<c10::SymInt> symints) {
  TORCH_INTERNAL_ASSERT(symints.size() == num_symints());
  auto i = 0;
  if(start.has_value()) start = symints[i];
  i += (start.has_value() ? 1 : 0);
  if(end.has_value()) end = symints[i];
  i += (end.has_value() ? 1 : 0);
  step = symints[i];
}

std::vector<at::Tensor> SliceTensorViewFunc::get_tensors() const {
  ::std::vector<at::Tensor> tensors;
  return tensors;
}

size_t SliceTensorViewFunc::num_tensors() const {
  return static_cast<size_t>(0);
}

void SliceTensorViewFunc::set_tensors(std::vector<at::Tensor> tensors) {
  TORCH_INTERNAL_ASSERT(tensors.size() == num_tensors());

}

at::Tensor SliceTensorViewFunc::operator()(const at::Tensor& input_base) const {
  return at::_ops::slice_Tensor::call(input_base, dim, start, end, step);
}

std::unique_ptr<ViewFunc> SliceTensorViewFunc::clone_and_set(
    std::optional<std::vector<c10::SymInt>> symints,
    std::optional<std::vector<at::Tensor>> tensors) const {
  auto output = std::make_unique<SliceTensorViewFunc>(dim, start, end, step);
  if (symints.has_value()) {
    output->set_symints(std::move(*(symints)));
  }
  if (tensors.has_value()) {
    output->set_tensors(std::move(*(tensors)));
  }
  return output;
}
```

The `_view_func()` / `_view_func_unsafe()` methods now accept two additional (optional) args for `symint_visitor_fn` / `tensor_visitor_fn`. If these are defined, they are expected to be python callables that operate on a single SymInt / tensor and return a new one. This allows for the hot-swapping needed during fake-ification.

For testing, there are extensive pre-existing tests, and I added a test to ensure that hot-swapping functions correctly.
```sh
python test/test_autograd.py -k test_view_func_replay
python test/test_ops.py -k test_view_replay
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118404
Approved by: https://github.com/ezyang
2024-02-09 18:51:36 +00:00
Kurt Mohler
90dabff260 Avoid COW materialize in various operations (#119506)
Operations affected include dot, cross, scatter/gather, shape, sort,
triangular, unary, scalar, pad, complex, to_list, fft

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119506
Approved by: https://github.com/ezyang
ghstack dependencies: #119501, #119502, #119503, #119504
2024-02-09 14:47:19 +00:00
cyy
560c92c324 [DeviceIndex] Use DeviceIndex instead of int in CUDA wrappers (#119142)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119142
Approved by: https://github.com/ezyang
2024-02-08 23:00:56 +00:00
Yang Chen
9f8ade04cc [aot_inductor] replace TORCH_CHECK with AOTI_CHECK in the generate cpp code (#119220)
In some cases where we have TORCH_CHECK in loops, it may cause the host
compiler to spend hours optimizing the run_impl function. This PR
mitigated the issue by replacing TORCH_CHECK with a custom AOTI_CHECK,
where we force the underneath assert function to be noinline.

If forcing noinline caused any serious perf regression, we could
either add an option to turn on/off enable noinline. Or, we could
another an option to just turn AOTI_CHECK into a no-op, similar
to the ```assert``` macro from cassert.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119220
Approved by: https://github.com/hl475, https://github.com/desertfire
2024-02-08 21:57:27 +00:00
Qianli Scott Zhu
71e772f827 Update logging.cpp for explicit chrono import (#119469)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119469
Approved by: https://github.com/davidberard98
2024-02-08 21:57:23 +00:00
PyTorch MergeBot
7315ec7505 Revert "Fix estimate_nccl_collective_runtime (#118986)"
This reverts commit 0dab6fb352.

Reverted https://github.com/pytorch/pytorch/pull/118986 on behalf of https://github.com/atalman due to Breaks internal tests ([comment](https://github.com/pytorch/pytorch/pull/118986#issuecomment-1934680463))
2024-02-08 18:11:53 +00:00
Sheng Fu
2b9cba86cf Fix deadlock in ExecutionTraceObserver (#119242) (#119398)
Summary:

With the compiled PyTorch module, in execution_trace_observer.cpp, function convertIValue calls TensorImpl->storage_offset(). That function call will trigger a recursive call into recordOperatorStart. It will cause a deadlock on ob.g_mutex.

This DIFF is to fix this deadlock by replacing std::mutex with std::recursive_mutex.

Since PyTorch only has one thread for FWD, and one thread for BWD. The contention is very low, the performance should NOT be a concern.

Test Plan:
Unit Test
    buck test  mode/dev-nosan caffe2/test:profiler -- test_execution_trace_with_pt2

Differential Revision: D53533253

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119398
Approved by: https://github.com/aaronenyeshi
2024-02-08 18:00:51 +00:00
zdevito
7f05c72864 [nccl flight recorder] record time we discover start and complete (#119249)
Some APIs like ncclCommAbort can cause nccl kernels to finish even if
they were previously stuck. Because we can gather the trace buffer after
those calls, we can end up seeing some collectives marked completed eventhough
that complete happened several minutes after they started and clearly after
the timeout. This changes how we record state so that we keep track of the time
we discover a state change, so even if eventually the collective gets marked complete,
we can observe it happened minutes after it was schedule.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119249
Approved by: https://github.com/wconstab
2024-02-08 16:48:33 +00:00
Peter Bell
08657b82f5 Reduce scope of dispatching in logcumsumexp_backward (#119397)
Everything inside the `AT_DISPATCH` block is being compiled 5 times,
so it makes sense to limit it to the only line that uses `scalar_t` which is
the `numeric_limits` query.

Also a small optimization, instead of computing `grad.log()` and `(-grad).log()`
we can compute `grad.abs().log()` which is 2 pointwise ops instead of 3.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119397
Approved by: https://github.com/lezcano, https://github.com/albanD
2024-02-08 15:09:22 +00:00
Pritam Damania
f579c65ef6 Release GIL for torch::autograd::clear_autocast_cache (#119416)
Fixes #119262

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119416
Approved by: https://github.com/albanD
2024-02-08 03:22:48 +00:00
Chien-Chin Huang
1d2382f141 [DDP] Use compiled_autograd to trace DDP backward allreduce (#110662)
**Summary**
The reducer of `DistributedDataParallel`  is implemented with C++ and it is not easy to trace the allreduce launched in the reducer. This PR modifies `DistributedDataParallel` to launch one allreduce per gradient when `compiled_autograd` is enabled. The changes allow us to use `compiled_autograd` to trace the allreduce and later be optimized (fused) in the Inductor.

**Key Logic**
1. If `ddp_python_hook` is True, we assume `compiled_autograd` is used. `DistributedDataParallel` registers `compiled_accum_grad_hook` for all parameters.
2. In the first forward() call, if `DistributedDataParallel` is not compiled, all  `compiled_accum_grad_hook` are deregistered. If `DistributedDataParallel` is compiled, all `compiled_accum_grad_hook` will be compiled by `compiled_autograd`.
3.  `compiled_accum_grad_hook` launches an allreduce to reduce the gradient of the parameter.

**Bucketing**
The compiled backward is slow because there is no bucketing for the allreduces. We rely on Inductor to bucket the allreduces.

The bucketing is done in a separate PR.

Differential Revision: [D49428482](https://our.internmc.facebook.com/intern/diff/D49428482/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110662
Approved by: https://github.com/wconstab
2024-02-08 03:03:15 +00:00
Yu, Guangye
9a992b0918 [4/4] Intel GPU Runtime Upstreaming for Device (#116869)
# Motivation
According to [[1/4] Intel GPU Runtime Upstreaming for Device](https://github.com/pytorch/pytorch/pull/116019), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), this last PR  covers the changes under lazy initialization.

# Design
This PR primarily offers the support of multi-processing via lazy initialization. We lazily initialize our runtime avoiding initializing XPU until the first time it is accessed. In our design, we extend `cuda_lazy_init` to `device_lazy_init` which is a device-agnostic API that can support any backend. And change `maybe_initialize_cuda` to `maybe_initialize_device` to support lazy initialization for both CUDA and XPU while maintaining scalability.

# Additional Context
We adopt a similar design to CUDA. So we share some code with CUDA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116869
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet
ghstack dependencies: #119248
2024-02-08 03:01:21 +00:00
Ke Wen
029a16c41f [c10d] PGNCCL refactor part 1: adds assert size==1 (#119099)
Breaking #118674 into multiple smaller PRs.
This is the first one.
It adds `assert size==1` to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119099
Approved by: https://github.com/wconstab, https://github.com/XilunWu
2024-02-07 22:29:29 +00:00
Natalia Gimelshein
6fe5a3adaf release GIL for cudaEventDestroy (#119393)
cudaEventDestroy can become blocking under some circumstances, and then holding GIL will lead to deadlocks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119393
Approved by: https://github.com/Skylion007
2024-02-07 22:16:18 +00:00
albanD
a6e16fe202 Fix global in header warning (#119380)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119380
Approved by: https://github.com/janeyx99
2024-02-07 20:35:21 +00:00
Kaiming Ouyang
35aa353c48 Change watchdog log from "NCCL" to "Process group" (#118121)
This PR changes the watchdog log.
In order to avoid confusion that NCCL creates a watchdog thread and reports the error log, it is better to change "NCCL" to "Process group" to better indicate the source of the log.

@wconstab

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118121
Approved by: https://github.com/kwen2501, https://github.com/wconstab
2024-02-07 20:14:49 +00:00
Hirochika Matsumoto
02c24b0b5e Add Python binding resizable to class {Untyped,Typed}Storage (#119286)
This PR exposes `resizable` method of `StorageImpl` to Python frontend to make it accessible for users.

Fixes #119233

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119286
Approved by: https://github.com/ezyang, https://github.com/mikaylagawarecki
2024-02-07 19:15:55 +00:00
Yifu Wang
0dab6fb352 Fix estimate_nccl_collective_runtime (#118986)
`estimate_nccl_collective_runtime` has been broken and the errors have been silently swallowed by inductor. This PR:
- Fixes the issues described in https://github.com/pytorch/pytorch/issues/118497.
- Adds white-box testing so future issues can be surfaced in tests.
- Add support for native funcol IRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118986
Approved by: https://github.com/yf225
ghstack dependencies: #118910, #118911, #118437
2024-02-07 18:02:51 +00:00
Bin Bao
40ec155e58 [AOTI][refactor] Split common aoti_runtime utils into a separate header (#119066)
Summary: Split common utils from aoti_runtime/model.h into a separate header file, because when turning on ABI-compatible mode for JIT Inductor we won't need AOTInductorModel, but we do need some common utils, e.g. RAIIAtenTensorHandle.

Differential Revision: [D53478809](https://our.internmc.facebook.com/intern/diff/D53478809)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119066
Approved by: https://github.com/khabinov
2024-02-07 16:54:00 +00:00
Yu, Guangye
5c46600f84 [RELAND] refactor lazy init to device-agnostic (#119248)
# Motivation
This PR intends to extend `cuda_lazy_init` to `device_lazy_init` which is a device-agnostic API that can support any backend. And change `maybe_initialize_cuda` to `maybe_initialize_device` to support lazy initialization for CUDA while maintaining scalability.

# Design
We maintain a flag for each backend to manage the lazy initialization state separately.

# Additional Context
No need more UTs.
This is a reland PR, the original PR is [refactor lazy init to device-agnostic](https://github.com/pytorch/pytorch/pull/118846).
This is a common PR, and does not trigger xpu ciflow.

Differential Revision: [D53478332](https://our.internmc.facebook.com/intern/diff/D53478332)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119248
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/atalman
2024-02-07 15:58:51 +00:00
Simon Fan
1435cfecfa Increase accumulate_grad_ gradient's expected refcount to account for pybind (#119068)
Account for pybind of the op holding 1 ref when torch.ops.inductor.accumulate_grad_.default is called during run time

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119068
Approved by: https://github.com/jansel
ghstack dependencies: #118817, #119334
2024-02-07 10:25:43 +00:00
Simon Fan
8e14e1d514 Fix gradient refcounts in pybind and compiled autograd (#118817)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118817
Approved by: https://github.com/jansel
2024-02-07 10:25:42 +00:00
PyTorch MergeBot
d85631b721 Revert "Fix deadlock in ExecutionTraceObserver (#119242)"
This reverts commit 6fc775ae13.

Reverted https://github.com/pytorch/pytorch/pull/119242 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/119242#issuecomment-1931445631))
2024-02-07 07:37:22 +00:00
William Wen
ee1c2449f7 [dynamo] delete dynamo cache entry when guard function is invalidated [attempt 2] (#119107)
Attempt #2 for https://github.com/pytorch/pytorch/pull/117875 to fix https://github.com/pytorch/pytorch/issues/112090.

Summary of changes:
- ~Changed CacheEntry linked list into a doubly-linked list structure to support deletion.~ (done by C++ refactor)
- Added CacheEntry and ExtraState borrowed references to GuardFn so that GuardFn can tell ExtraState to delete CacheEntry when the GuardFn is invalidated.
- ~Added ExtraState raw reference to CacheEntry so that we can get ExtraState to correctly point to the first CacheEntry if it gets deleted.~ (done by C++ refactor)
- CacheEntry destructor needs to reset GuardFn refs to ExtraState/CacheEntry in order to prevent use-after-free.
- code_context values that are nn.GraphModules need to be weakrefs in order to prevent circular references.
- Added tests that check for memory leaks and cache deletion operations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119107
Approved by: https://github.com/jansel
2024-02-07 03:32:42 +00:00
Sheng Fu
6fc775ae13 Fix deadlock in ExecutionTraceObserver (#119242)
Summary:
With the compiled PyTorch module, in execution_trace_observer.cpp, function convertIValue calls TensorImpl->storage_offset(). That function call will trigger a recursive call into recordOperatorStart. It will cause a deadlock on ob.g_mutex.

This DIFF is to fix this deadlock by replacing std::mutex with std::recursive_mutex.

Since PyTorch only has one thread for FWD, and one thread for BWD. The contention is very low, the performance should NOT be a concern.

Test Plan:
Unit Test
    buck test  mode/dev-nosan caffe2/test:profiler -- test_execution_trace_with_pt2

Differential Revision: D53299183

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119242
Approved by: https://github.com/aaronenyeshi
2024-02-06 23:36:22 +00:00
PyTorch MergeBot
9d46fe603d Revert "[c10d] PGNCCL refactor part 1: adds assert size==1 (#119099)"
This reverts commit 4ab852b6c5.

Reverted https://github.com/pytorch/pytorch/pull/119099 on behalf of https://github.com/atalman due to Breaks internal tests ([comment](https://github.com/pytorch/pytorch/pull/119099#issuecomment-1930839754))
2024-02-06 22:14:36 +00:00
Chen_Liqing
422b4271ae Change PrivateUse1's resize_bytes to PrivateUse1HooksInterface (#117839)
Reopen from https://github.com/pytorch/pytorch/pull/117211
Modify the logic for entering the registration branch so that existing uts are not affected.
Co-authored-by: albanD <desmaison.alban@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117839
Approved by: https://github.com/albanD
2024-02-06 20:51:56 +00:00
William Wen
ae4e866bba [dynamo] refactor CacheEntry and ExtraState to eval_frame.c to C++ (#118438)
Part of implementing CacheEntry invalidation to fix https://github.com/pytorch/pytorch/issues/112090.

Changes:
- Move CacheEntry and ExtraState to C++
- Use pybind to control reference counting
- Use std::list instead of manually implementing a linked list

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118438
Approved by: https://github.com/jansel
2024-02-06 20:48:11 +00:00
Edward Z. Yang
3f0fd36835 Introduce size oblivious guards (#118579)
Fixes https://github.com/pytorch/pytorch/issues/117361

The implementation here slightly diverges from what was proposed in the issue, so I will recap what this PR is doing here. Today, when doing computations involving size-like unbacked SymInts, we assume for all operations that the compile time range of the integer is `[2, inf]`, even though at runtime we also accept zero and one.

This PR removes the carte blanche assumption, and instead does the analysis in a much more limited and controlled fashion: only for guards which we have designated as "size oblivious" are we willing to do the analysis under the assumption that the range of all size-like unbacked SymInts is `[2, inf]`; otherwise, we will faithfully only do analysis with `[0, inf]` (or whatever the user provided) bounds.

The infra pieces of this PR are:

* Remove runtime_var_to_range from torch/fx/experimental/symbolic_shapes.py; modify `_constrain_range_for_size` to refine the range without clamping min to 2, and instead add the symbol to a `size_like` set in the ShapeEnv
* When evaluating an expression, if the expression is requested to be evaluated in a `size_oblivious` way, we attempt to statically compute the value of the expression with the assumption that all symbols in `size_like` are updated to assume that they are `>= 2`.
* Add Python and C++ APIs for guarding on a SymBool in a size-oblivious way. In C++, I also need to add some helpers for performing symbolic comparisons, since the stock comparisons immediately specialize in the "normal" way.

The rest of the changes of the PR are marking various spots in PyTorch framework code as size oblivious, based on what our current test suite exercises.

As you review the places where we have marked things as size oblivious, it may become clear why I ended up not opting for the "designate a branch as the default branch when it's not statically obvious which way to go": for some of the conditions, this answer is rather non-obvious. I think potentially there is another refinement on top of this PR, which is something like "I don't care if you can't figure it out with ValueRange analysis, go down this path anyway if there are unbacked sizes involved." But even if we add this API, I think we are obligated to attempt the ValueRange analysis first, since it can lead to better outcomes sometimes (e.g., we are able to figure out that something is contiguous no matter what the unbacked size is.)

When is it permissible to mark something as size oblivious? Heuristically, it is OK anywhere in framework code if it gets you past a guard on unbacked SymInt problem. It is somewhat difficult to provide a true semantic answer, however. In particular, these annotations don't have any observational equivalence guarantee; for example, if I have `torch.empty(u0, 1).squeeze()`, we will always produce a `[u0]` size tensor, even though if `u0 == 1` PyTorch will actually produce a `[]` size tensor. The argument that I gave to Lezcano is that we are in fact defining an alternate semantics for a "special" size = 0, 1, for which we have these alternate eager mode semantics. In particular, suppose that we have a constant `special1` which semantically denotes 1, but triggers alternate handling rules. We would define `torch.empty(special1, 1).squeeze()` to always produce a `[special1]` size tensor, making its semantics coincide with unbacked SymInt semantics. In this model, the decision to designate guards as size oblivious is simply a user API question: you put them where ever you need some handling for special1! As we conservatively error out whenever it is not obvious what `special1` semantics should be, it is always valid to expand these semantics to cover more cases (although you can always choose the wrong semantics!)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118579
Approved by: https://github.com/eellison, https://github.com/lezcano
2024-02-06 19:45:32 +00:00
Ke Wen
4ab852b6c5 [c10d] PGNCCL refactor part 1: adds assert size==1 (#119099)
Breaking #118674 into multiple smaller PRs.
This is the first one.
It adds `assert size==1` to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119099
Approved by: https://github.com/wconstab
2024-02-06 06:59:47 +00:00
Yifu Wang
5086e1cf3f Remove distributed/c10d/Functional.hpp (#119138)
This file is useless and was accidentally checked in.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119138
Approved by: https://github.com/Skylion007
2024-02-05 21:58:08 +00:00
PyTorch MergeBot
ab613a4019 Revert "refactor lazy init to device-agnostic (#118846)"
This reverts commit 520771d7b3.

Reverted https://github.com/pytorch/pytorch/pull/118846 on behalf of https://github.com/atalman due to Failing, tests https://github.com/pytorch/torchdistx/blob/main/src/python/torchdistx/_C/fake.cc#L11  ([comment](https://github.com/pytorch/pytorch/pull/118846#issuecomment-1927651305))
2024-02-05 18:06:30 +00:00
Bin Bao
79b20aec76 [AOTI] Support copy_, _fft_c2c and view_as_real in C shim (#119125)
Summary: These ops exist in GoogleFnet. Also add a Complex fallback for convert_element_type. After this PR, we can enable ABI-compatible for AOTInductor OSS CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119125
Approved by: https://github.com/chenyang78
2024-02-04 15:48:58 +00:00
Yifu Wang
372e9550bd ProcessGroupGloo::reduce_scatter_tensor_coalesced (#118911)
### Motivation
Despite our plan to reduce gloo usage, it is still being widely used as testing tool (in both the PyTorch CI and user tests) for code that only uses nccl in real world scenario. There's some coverage issues around all-gather and reduce-scatter variants, which are currently worked around in ugly ways (e.g. [this](b9e86bc93d/torch/distributed/_functional_collectives_impl.py (L216-L219)) and [this](b9e86bc93d/torch/distributed/_functional_collectives_impl.py (L262-L272))). For native funcol I ran into the same issues but I'd rather just fix the coverage.

### This PR
We already have a fallback impl for `_reduce_scatter_base`, which is composed from all-reduce + scatter. The scatter was not necessary. It introduces extra communication, sync point, and forced the impl to fail on `asyncOp=True`. This PR does the following:
- Simulate reduce-scatter with `allreduce(inp).chunk(world_size)[rank]`. This is still 2x communication than a real reduce-scatter (since all-reduce = reduce-scatter + all-gather), but it's strictly better than what we have now.
- By doing the above, the comm becomes async and we don't have to fail on `asyncOp=True`.
- The general logic is implemented in `reduce_scatter_tensor_coalesced`. `_reduce_scatter_base` just calls it with single input/output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118911
Approved by: https://github.com/shuqiangzhang
ghstack dependencies: #118910
2024-02-03 02:42:47 +00:00
lancerts
857508fa36 Change the internal assert to torch_check in torch::nn::functional::InterpolateFuncOptions (#117831)
Fixes #117333

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117831
Approved by: https://github.com/malfet
2024-02-03 02:15:11 +00:00
briancoutinho
d91d21fd6f [submodule kineto] Enable profiler connection to daemon during init for cpu only jobs (#118320)
Fixes #112389 and https://github.com/facebookincubator/dynolog/issues/208

This PR enables profiler initialization for CPU only use cases. The main goal is to enable on-demand profiling with a daemon when using CPU only mode of PyTorch.
* When CUDA is available the profiler is initialized on first CUDA stream creation (or lazily when profiler is run).
* Since the CUDA stream creation callback does not exist on CPU only PyTorch the profiler is never initied on its own.
* Thus the job does not register with Dynolog when we set "KINETO_USE_DAEMON" env variable to set.

Part of the fix is in Kineto https://github.com/pytorch/kineto/pull/861, we point to it in PyTorch.
The change in PyTorch is to correctly set the `cpuOnly` argument.

## TestPlan:

Build PyTorch from source with USE_CUDA=0 so we have CPU only based build.  Git hash = `a40951defd87b9a5e582cf9112bf7a8bd0930c79`
(See instructions in PyTorch repo)

For the setup we run dynolog daemon in another terminal
```
buck2 run dynolog/src:dynolog  -- --enable_ipc_monitor &
```

Now run an example model in PyTorch - see [linear_model.py](https://github.com/facebookincubator/dynolog/blob/main/scripts/pytorch/linear_model_example.py) , and set the device to 'cpu' inside the code instead of 'cuda'.
```
export KINETO_USE_DAEMON=1
python linear_model_example.py
```
Output shows the profiler registration with dynolog
```
(pytorch) [bcoutinho@devgpu038.ftw6 ~/local/pytorch (main)]$ python linear_model_example.py
INFO:2024-01-25 11:08:53 1807792:1807792 init.cpp:122] Registering daemon config loader, cpuOnly =  1
INFO:2024-01-25 11:08:53 1807792:1807792 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-01-25 11:08:53 1807792:1807792 IpcFabricConfigClient.cpp:93] Setting up IPC Fabric at endpoint: dynoconfigclient0dc36b8a-e14c-4260-958b-4b2e7d15e986 status = initialized
INFO:2024-01-25 11:08:53 1807792:1807792 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-01-25 11:08:53 1807792:1807792 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
```

We can also collect a trace using
```
[bcoutinho@devgpu038.ftw6 ~/fbsource/fbcode (3bc85f968)]$ buck2 run dynolog/cli:dyno -- gputrace --log-file /tmp/test.json
Kineto config =
ACTIVITIES_LOG_FILE=/tmp/test.json
PROFILE_START_TIME=0
ACTIVITIES_DURATION_MSECS=500
PROFILE_REPORT_INPUT_SHAPES=false
PROFILE_PROFILE_MEMORY=false
PROFILE_WITH_STACK=false
PROFILE_WITH_FLOPS=false
PROFILE_WITH_MODULES=false
response length = 147
response = {"activityProfilersBusy":0,"activityProfilersTriggered":[1807792],"eventProfilersBusy":0,"eventProfilersTriggered":[],"processesMatched":[1807792]}
Matched 1 processes
Trace output files will be written to:
    /tmp/test_1807792.json
```
And trace file contains the trace correctly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118320
Approved by: https://github.com/aaronenyeshi
2024-02-03 01:40:56 +00:00
willfengg
63fd6883fd [c10d] logging utility for cpp-python stacktrace (#118924)
user may not know which line of code called collectives in a big code base. When debugging, we can print python-cpp stacktrace in case user call ``ProcessGroup.reduce`` instead of ``torch.distributed.reduce``

```
LOG(INFO) << "ProcessGroupNCCL::_allgather_base stacktrace: "
                       << get_python_cpp_trace();
```

output (using _allgather_base as an example): one example python-part trace is ``all_gather_into_tensor from /data/users/weif/pytorch/torch/distributed/distributed_c10d.py:2838``
```
ProcessGroupNCCL::_allgather_base stacktrace: #0 torch::unwind::unwind() from ??:0
#1 torch::CapturedTraceback::gather(bool, bool, bool) from ??:0
#2 c10d::get_python_cpp_trace[abi:cxx11]() from :0
#3 c10d::ProcessGroupNCCL::_allgather_base(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&) from ??:0
#4 c10d::ops::(anonymous namespace)::_allgather_base_CUDA(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long) from Ops.cpp:0
#5 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<at::Tensor, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long), std::tuple<at::Tensor, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from :0
#6 torch::autograd::basicAutogradNotImplementedFallbackImpl(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from autograd_not_implemented_fallback.cpp:0
#7 c10d::ProcessGroup::_allgather_base(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&) from :0
#8 pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&)#1}, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup*, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&)#1}&&, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (*)(c10d::ProcessGroup*, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from :0
#9 pybind11::cpp_function::dispatcher(_object*, _object*, _object*) from :0
#10 cfunction_call from /usr/local/src/conda/python-3.10.12/Objects/methodobject.c:543
#11 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.12/Objects/call.c:215
#12 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:112
#13 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#14 all_gather_into_tensor from /data/users/weif/pytorch/torch/distributed/distributed_c10d.py:2838
#15 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#16 do_call_core from /usr/local/src/conda/python-3.10.12/Python/ceval.c:5945
#17 wrapper from /data/users/weif/pytorch/torch/distributed/c10d_logger.py:75
#18 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#19 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#20 _all_gather_flat_param from /data/users/weif/pytorch/torch/distributed/fsdp/_flat_param.py:1399
#21 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#22 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#23 unshard from /data/users/weif/pytorch/torch/distributed/fsdp/_flat_param.py:1308
#24 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#25 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#26 _unshard from /data/users/weif/pytorch/torch/distributed/fsdp/_runtime_utils.py:332
#27 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#28 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#29 _pre_forward_unshard from /data/users/weif/pytorch/torch/distributed/fsdp/_runtime_utils.py:448
#30 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#31 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#32 _pre_forward from /data/users/weif/pytorch/torch/distributed/fsdp/_runtime_utils.py:413
#33 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#34 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#35 forward from /data/users/weif/pytorch/torch/distributed/fsdp/fully_sharded_data_parallel.py:839
#36 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#37 do_call_core from /usr/local/src/conda/python-3.10.12/Python/ceval.c:5945
#38 _call_impl from /data/users/weif/pytorch/torch/nn/modules/module.py:1520
#39 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#40 do_call_core from /usr/local/src/conda/python-3.10.12/Python/ceval.c:5945
#41 _wrapped_call_impl from /data/users/weif/pytorch/torch/nn/modules/module.py:1511
#42 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#43 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.12/Objects/call.c:431
#44 slot_tp_call from /usr/local/src/conda/python-3.10.12/Objects/typeobject.c:7494
#45 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.12/Objects/call.c:215
#46 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:112
#47 inner from /data/users/weif/pytorch/run_fsdp.py:72
#48 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#49 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#50 run from /data/users/weif/pytorch/run_fsdp.py:76
#51 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#52 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#53 main from /data/users/weif/pytorch/run_fsdp.py:133
#54 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#55 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#56 <module> from /data/users/weif/pytorch/run_fsdp.py:137
#57 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#58 PyEval_EvalCode from /usr/local/src/conda/python-3.10.12/Python/ceval.c:1134
#59 run_eval_code_obj from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:1291
#60 run_mod from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:1312
#61 pyrun_file from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:1208
#62 _PyRun_SimpleFileObject from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:456
#63 _PyRun_AnyFileObject from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:90
#64 pymain_run_file_obj from /usr/local/src/conda/python-3.10.12/Modules/main.c:357
#65 Py_BytesMain from /usr/local/src/conda/python-3.10.12/Modules/main.c:1090
#66 __libc_start_call_main from ??:0
#67 <unwind unsupported> from ??:0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118924
Approved by: https://github.com/kwen2501
2024-02-02 23:49:18 +00:00
titaiwangms
a3cec6a7fa [ONNX] Eliminate redundant TODOs (#119060)
Remove titaiwangms/AllenTiTaiWang/titaiwang created TODOs:

1. Resolved TODOs
2. Turned TODOs to NOTEs if they are not actionable
3. Merge duplicated TODOs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119060
Approved by: https://github.com/kit1980, https://github.com/thiagocrepaldi
2024-02-02 23:37:52 +00:00
Yifu Wang
fd000340fd ProcessGroupGloo::allgather_into_tensor_coalesced (#118910)
### Motivation
Despite our plan to reduce gloo usage, it is still being widely used as testing tool (in both the PyTorch CI and user tests) for code that only uses nccl in real world scenario. There's some coverage issues around all-gather and reduce-scatter variants, which are currently worked around in ugly ways (e.g. [this](b9e86bc93d/torch/distributed/_functional_collectives_impl.py (L216-L219)) and [this](b9e86bc93d/torch/distributed/_functional_collectives_impl.py (L262-L272))). For native funcol I ran into the same issues but I'd rather just fix the coverage.

**I think it's reasonable to think of this as a fix rather than adding new features. This is orthogonal to the potential reduction of gloo usage**.

### This PR

This PR adds `ProcessGroupGloo::allgather_into_tensor_coalesced`.  This is very straightforward - `ProcessGroupGloo` already supports `allgather_coalesced`, to which we can funnel `allgather_into_tensor_coalesced`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118910
Approved by: https://github.com/shuqiangzhang
2024-02-02 17:53:28 +00:00
Yu, Guangye
520771d7b3 refactor lazy init to device-agnostic (#118846)
# Motivation
This PR intends to extend `cuda_lazy_init` to `device_lazy_init` which is a device-agnostic API that can support any backend. And change `maybe_initialize_cuda` to `maybe_initialize_device` to support lazy initialization for CUDA while maintaining scalability.

# Design
We maintain a flag for each backend to manage the lazy initialization state separately.

# Additional Context
No need more UTs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118846
Approved by: https://github.com/malfet
2024-02-02 12:10:39 +00:00
albanD
54668ad6dc Cleanup max cuda device (#118779)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118779
Approved by: https://github.com/ezyang
2024-02-01 21:11:28 +00:00
mantaionut
b0e65dd1b4 Fix TCP Store Windows (#118860)
In https://github.com/pytorch/pytorch/pull/107607 there was added a new Validate flow, however on Windows it was not calling addMiscellaneousSocket.
Added missing call to addMiscellaneousSocket on Windows.

Fixes #118737

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118860
Approved by: https://github.com/awgu, https://github.com/malfet
2024-02-01 18:46:18 +00:00
garfield1997
ff9ce94489 Create empty host tensor for privateuseone (#118854)
For the H2D copy of local_used_map_ on the privateuseone device, reuse the CUDA logic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118854
Approved by: https://github.com/ezyang
2024-02-01 15:32:55 +00:00
Yu, Guangye
a205e7bf56 [3/4] Intel GPU Runtime Upstreaming for Device (#116850)
# Motivation
According to [[1/4] Intel GPU Runtime Upstreaming for Device](https://github.com/pytorch/pytorch/pull/116019), As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), this third PR  covers the changes under `libtorch_python`.

# Design
This PR primarily offers device-related APIs in python frontend, including
- `torch.xpu.is_available`
- `torch.xpu.device_count`
- `torch.xpu.current_device`
- `torch.xpu.set_device`
- `torch.xpu.device`
- `torch.xpu.device_of`
- `torch.xpu.get_device_name`
- `torch.xpu.get_device_capability`
- `torch.xpu.get_device_properties`
- ====================
- `torch.xpu._DeviceGuard`
- `torch.xpu._is_compiled`
- `torch.xpu._get_device`

# Additional Context
We will implement the support of lazy initialization in the next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116850
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet
2024-02-01 12:31:26 +00:00
Michael Suo
eaa45f47f8 [sigmoid] fix for torchbind serialization (#118791)
Summary:
There is an annoying inconsistency in how we pickle custom objs.
`torch.save` will invoke regular pickle, for which we have bound `__setstate__`/`__getstate__` methods on `torch.ScriptObject`: https://fburl.com/code/4howyl4u.

This serializes in a different format than TorchScript does, which uses the TS C++ pickler.

The issue we were facing was using the Python pickler to save, and the C++ pickler to load. If we use the C++ pickler to both save and load (plus some plumbing to get type/object resolution to work correctly), then things should work.

Test Plan:
ran SherlockNoMad's repro
```
buck2 run 'fbcode//mode/dev-nosan' scripts/bahuang:export_torchbind -- --logging DBG
```

Got to a new error, which has to do with how we're initializing the graph, but will leave that for future diffs.

Reviewed By: SherlockNoMad

Differential Revision: D53248454

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118791
Approved by: https://github.com/qxy11, https://github.com/SherlockNoMad, https://github.com/khabinov
2024-02-01 10:09:07 +00:00
Mu-Chu Lee
2b48891e62 [AOTInductor] Add Runtime Constant-folding for AOTInductor (#118765)
Summary:
Add Runtime Constant-folding for AOTInductor.
This also include the invocation of constant folding at load time.

The constant folding lowering is a 2-step process.
First, we split the graph into 2 modules, one of it is the constant module, which doesn't depend on any input and the whole module could be inferred (constant-folded) one-time and be reused. The constant module, is lowered, and being codegen-ed as usual and cached (let's call this constant code). The constant code reuses the whole lowering/profiling/etc. process, only difference is that we do not generate any headers or initialization for the constant code.
Second, after handling the constant module, we take care of the main module (which is the part that would depend on the user input.) For the main module, we take in one additional component, the constant code, compare with a normal lowering. Addition step we do here is that, we inject the constant code into the codegen-ed main module, and create the caller for the main module to consume the result of the constant module.

Test Plan: Unit tests included in commit.

Differential Revision: D53274382

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118765
Approved by: https://github.com/chenyang78
2024-02-01 04:54:25 +00:00
Andrew Calvano
649f2e3000 Fix for out of bounds registers_ access in mobile TorchScript interpreter (#110300)
Summary:
The TorchScript interpreter had multiple opcodes whose logic had the potential to access the registers_ array out of bounds.

This change ensures that all registers_ accesses are in bounds or an exception will be thrown.

Test Plan: contbuild + OSS signals

Differential Revision: D49748737

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110300
Approved by: https://github.com/malfet, https://github.com/kimishpatel
2024-01-31 19:40:02 +00:00
Shan19900305
99b69e1ffb add PrivateUse1 device support in function options_from_string. (#118627)
add PrivateUse1 device support in function options_from_string.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118627
Approved by: https://github.com/soulitzer
2024-01-31 18:52:58 +00:00
Bin Bao
1128cf96f0 [AOTI] Support _embedding_bag in C shim (#118706)
Summary: At some point I will stop manually adding ops to C shim, but use torchgen to generate those code. For the near term, I need to add a few more in order to switch the AOTInductor dashboard run.

Differential Revision: [D53249074](https://our.internmc.facebook.com/intern/diff/D53249074)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118706
Approved by: https://github.com/frank-wei, https://github.com/aakhundov
ghstack dependencies: #118704, #118705
2024-01-31 15:02:40 +00:00
Bin Bao
8db8ff652c [AOTI] Add aoti_torch_view_dtype in C shim (#118705)
Summary: Support ir.ComplexView in the ABI-compatible codegen

Differential Revision: [D53249039](https://our.internmc.facebook.com/intern/diff/D53249039)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118705
Approved by: https://github.com/frank-wei
ghstack dependencies: #118704
2024-01-31 14:42:29 +00:00
cyy
4a019047ad Enable nested namespace check in clang-tidy (#118506)
It is time to enable nested namespaces in the code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118506
Approved by: https://github.com/albanD
2024-01-31 00:32:35 +00:00
Shuqiang Zhang
e180218949 [c10d] Log the last enqueued and completed collective (#118582)
Summary:
During debugging of some timeouted jobs, I found it difficult to
identify which rank is at fault eventhough we have logs of many ranks
reporting timeout on a specific collective seq.

If we can also report lastEqueuedSeq and lastCompletedSeq, it would be
much easier to identify,
1. whether a rank has not even join a collective call (not enqueued)
2. Or it has joined the collective call, but not completed.

For the 1st case, it is mostly likely users code problem
for the 2ed case, it could be lower-layer issues

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118582
Approved by: https://github.com/wconstab
2024-01-30 20:13:55 +00:00
garfield1997
fbf92500fb enable privateuseone to perform streaming backward (#117111)
Fixes #116957

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117111
Approved by: https://github.com/soulitzer
2024-01-30 15:13:31 +00:00
Yifu Wang
64efec9953 Port FakeProcessGroup to cpp (#118426)
### Summary
Native functional collective ops requires the backend to be implemented in C++. Porting `FakeProcessGroup` to cpp so that it can also work for native functional collective ops.

The existing tests involving `FakeProcessGroup` all pass. In addition, `DeviceMeshTest::test_fake_pg_device_mesh` now pass with `_USE_NATIVE_C10D_FUNCTIONAL=1`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118426
Approved by: https://github.com/wanchaol
ghstack dependencies: #113057
2024-01-30 11:40:13 +00:00
Shuqiang Zhang
c7af626a26 [c10d] allow nonblocking wrap of ncclCommInitRankConfig (#118256)
resolve #117749

Summary:
Updated the PR with the following intentions:

1. identify eagerMode init (as opposed to lazy init), in which case we will create NCCL comms without guarantees that they are fully initialized if NONBLOCKING mode is also enabled.
2. Python users can do their other works (e.g., model init) between invoking init_process_group and their first collective call.
3. c10D would guarantee/wait for communicators to be initialized before issuing the first collective call.
4. For NCCL collective calls, the contract between python users and c10d is not changed much from blocking calls (C10d would wait the NCCL call to be ncclSuccess, or timeout, whichever happens first).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118256
Approved by: https://github.com/kwen2501
2024-01-30 06:23:20 +00:00
Yifu Wang
b778f44e97 Allow using native c10d_functional via _functional_collectives (#113057)
This diff introduces an env var `_USE_NATIVE_C10D_FUNCTIONAL` that tells `_functional_collective` to use native `c10d_functional` ops. The Python version and the native version will co-exist until we completely switch to the native version after more testing and verification.

NOTE: `DeviceMesh` support for native `c10d_functional` will be added in a subsequent PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113057
Approved by: https://github.com/LucasLLC, https://github.com/wconstab, https://github.com/wanchaol
2024-01-30 02:34:25 +00:00
PyTorch MergeBot
fb11354594 Revert "[c10d] relax the nccl error check for nonblocking mode (#118254)"
This reverts commit 993e4f3911.

Reverted https://github.com/pytorch/pytorch/pull/118254 on behalf of https://github.com/clee2000 due to has internal failures D53170606 ([comment](https://github.com/pytorch/pytorch/pull/118254#issuecomment-1915267786))
2024-01-29 17:56:40 +00:00
Wenyin Fu
65f8276bc6 add an option to specify custom addr2line binary (#118328)
There is a need for users to pick their own addr2line binary in their deployment due to reasons like default addr2line being too slow etc... This option would allow user quickly experiment other alternatives.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118328
Approved by: https://github.com/zdevito, https://github.com/aaronenyeshi
2024-01-29 16:36:38 +00:00
Will Constable
5f59d0c748 [C10D] Disarm PGNCCL Heartbeat Monitor to gather data (#118344)
Summary:
Leave monitoring thread 'running' in log-only mode. Use the kill logs to
correlate with actual job outcomes (e.g. does stuck job detector agree?)

Later, re-enable (using a justknobs knob this time)

Test Plan: CI

Differential Revision: D53108142

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118344
Approved by: https://github.com/shuqiangzhang, https://github.com/yifuwang, https://github.com/malfet, https://github.com/kwen2501
2024-01-29 06:09:36 +00:00
eqy
8d790abab9 [NCCL][c10d] Log failing pointer if deregistration fails (#118455)
For debugging convenience

CC @minsii @Aidyn-A @syed-ahmed @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118455
Approved by: https://github.com/wconstab
2024-01-27 11:03:02 +00:00
PyTorch MergeBot
dabb90f2a4 Revert "[Exception] [6/N] Remove use of torch::TypeError (#117964)"
This reverts commit 87335fabae.

Reverted https://github.com/pytorch/pytorch/pull/117964 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/117964#issuecomment-1913079096))
2024-01-27 08:44:34 +00:00
Shuqiang Zhang
993e4f3911 [c10d] relax the nccl error check for nonblocking mode (#118254)
resolve https://github.com/pytorch/pytorch/issues/117749

Summary:
This is the first step to enable NCCL nonblocking mode.

In NCCL nonblocking mode,  ncclInProgress is an expected return value
when checking communicators. Without this relaxation, watchdog thread
would throw NCCL errors during work checking while it is expected.

Test Plan:
Set nonblocking mode in unit tests, and make sure all existing NCCL
tests pass
Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118254
Approved by: https://github.com/kwen2501
2024-01-27 03:49:00 +00:00
David Berard
40c08795b0 [JIT] python IR bindings: consolidate tests, add short docs in OVERVIEW.md (#118319)
Document the existence of python IR bindings; quick comments about it; and consolidate tests in one file to serve as examples to users.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118319
Approved by: https://github.com/eellison
2024-01-27 03:11:51 +00:00
Edward Z. Yang
9bce208dfb Replace follow_imports = silent with normal (#118414)
This is a lot of files changed! Don't panic! Here's how it works:

* Previously, we set `follow_imports = silent` for our mypy.ini configuration. Per https://mypy.readthedocs.io/en/stable/running_mypy.html#follow-imports, what this does is whenever we have an import to a module which is not listed as a file to be typechecked in mypy, we typecheck it as normal but suppress all errors that occurred in that file.
* When mypy is run inside lintrunner, the list of files is precisely the files covered by the glob in lintrunner.toml, but with files in excludes excluded.
* The top-level directive `# mypy: ignore-errors` instructs mypy to typecheck the file as normal, but ignore all errors.
* Therefore, it should be equivalent to set `follow_imports = normal`, if we put `# mypy: ignore-errors` on all files that were previously excluded from the file list.
* Having done this, we can remove the exclude list from .lintrunner.toml, since excluding a file from typechecking is baked into the files themselves.
* torch/_dynamo and torch/_inductor were previously in the exclude list, because they were covered by MYPYINDUCTOR. It is not OK to mark these as `# mypy: ignore-errors` as this will impede typechecking on the alternate configuration. So they are temporarily being checked twice, but I am suppressing the errors in these files as the configurations are not quite the same. I plan to unify the configurations so this is only a temporary state.
* There were some straggler type errors after these changes somehow, so I fixed them as needed. There weren't that many.

In the future, to start type checking a file, just remove the ignore-errors directive from the top of the file.

The codemod was done with this script authored by GPT-4:

```
import glob

exclude_patterns = [
    ...
]

for pattern in exclude_patterns:
    for filepath in glob.glob(pattern, recursive=True):
        if filepath.endswith('.py'):
            with open(filepath, 'r+') as f:
                content = f.read()
                f.seek(0, 0)
                f.write('# mypy: ignore-errors\n\n' + content)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118414
Approved by: https://github.com/thiagocrepaldi, https://github.com/albanD
2024-01-27 02:44:11 +00:00
lancerts
af1338bfbf fix escape nested comments in C++ (#117882)
Fixes #115243, as it is tricky to deal with the nested comment in doxygen + sphinx. Change 6 below is adopted as the fix. All other changes do not work.

After adopting change 6, realize the original
`torch::optim::SGD sgd(0.9);` is not the correct call to the sgd constructor,
modified to the correct one
`torch::optim::SGD sgd(model->parameters(), 0.9);`

- Original in [link](https://pytorch.org/cppdocs/api/function_namespacetorch_1ad98de93d4a74dd9a91161f64758f1a76.html#exhale-function-namespacetorch-1ad98de93d4a74dd9a91161f64758f1a76): `///   torch::optim::SGD sgd(/*lr=*/0.9);`
![image](https://github.com/pytorch/pytorch/assets/7495155/0054b355-4925-4112-93b4-9385fdc34bb9)

- Change 1, this solution is referenced from [here](https://stackoverflow.com/questions/24978463/doxygen-escape-nested-comments-in-c): `///   torch::optim::SGD sgd(/&zwj;* lr= *&zwj;/0.9);`
![image](https://github.com/pytorch/pytorch/assets/7495155/77ff2d18-3097-4265-8dcd-31d78acb9c6e)

- Change 2: `///   torch::optim::SGD sgd(/* lr= *//* 0.9);`
![image](https://github.com/pytorch/pytorch/assets/7495155/b520f8de-ead7-4009-b0fb-f4517daba077)

- Change 3: `///   torch::optim::SGD sgd(/\*lr=\*/0.9);`
![image](https://github.com/pytorch/pytorch/assets/7495155/07e9e608-4640-43c0-994a-37983b803003)

- Change 4: `///   torch::optim::SGD sgd(/&lowast; lr= &lowast;/0.9);`
![image](https://github.com/pytorch/pytorch/assets/7495155/121e55c5-0802-4ff3-bbd7-3521e1299d94)

- Change 5:
```
/// \rst
/// .. code-block:: cpp
///
///   torch::nn::Linear model(3, 4);
///   torch::load(model, "model.pt");
///   \verbatim
///   torch::optim::SGD sgd(/*lr=*/0.9);
///   \endverbatim
///   std::istringstream stream("...");
///   torch::load(sgd, stream);
///
///   auto tensor = torch::ones({3, 4});
///   torch::load(tensor, "my_tensor.pt");
/// \endrst
```
![image](https://github.com/pytorch/pytorch/assets/7495155/e675f551-e939-4be8-b24a-e2e53377dd08)

- Change 6: `///   torch::optim::SGD sgd(0.9);  // 0.9 is the learning rate`
![image](https://github.com/pytorch/pytorch/assets/7495155/ecf0adc4-9b0b-4aef-b0bc-72d4b17c45fa)
![image](https://github.com/pytorch/pytorch/assets/7495155/01bf5d5b-8450-4599-8c9a-00204ab56119)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117882
Approved by: https://github.com/cpuhrsch, https://github.com/malfet
2024-01-27 02:37:23 +00:00
Min Si
838d3620cd [NCCL PG] log NCCL comm at creation and abort (#118335)
Summary: It helps correlate NCCL PG with corresponding NCCL comm in separate logs.

Differential Revision: D53107647

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118335
Approved by: https://github.com/wconstab
2024-01-27 01:43:53 +00:00
rzou
b256b7b348 Add way to actually delete a torch.library.Library object (#118318)
Relying on object lifetimes in Python is a bad idea due to reference
cycles. Previously, when a torch.library.Library object gets destroyed,
it clears all the registrations associated with it, but it's unclear
when it actually gets destroyed due to the existence of refcycles.

This PR:
- adds torch::Library::clear(), which deterministically releases all of
  the RAII registration handles of the torch::Library object
- adds a new `torch.library._scoped_library` context manager, which creates
  a library and cleans it up at the end of the scope using the previous item.
  All tests (unless they already handle library lifetimes) should use
  this new API
- Rewrites some flaky tests to use `_scoped_library`.

In the future we'll probably migrate all of our torch.library tests to
use `_scoped_library`, but that's kind of annoying because we have
multiple thousands of LOC

I'm hoping this will deflake those tests; we'll see.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118318
Approved by: https://github.com/albanD
2024-01-26 22:30:51 +00:00
Thiago Crepaldi
939008a268 Fix RuntimeError: NYI: Named tensors are not supported with the tracer (#118393)
This PR relands #108238 that was closed as stale due to CLA issues and also because the CI check has marked the PR as not mergeable.

Repro 1:

```python
import torch

def f(x):
    return x[x > 0]

jf = torch.jit.trace(f, torch.tensor(2., device="cuda"))
```
Error:

```bash
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/pytorch/torch/jit/_trace.py", line 874, in trace
    traced = torch._C._create_function_from_trace(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<stdin>", line 2, in f
RuntimeError: NYI: Named tensors are not supported with the tracer
```

Repro2:

```python
import torch
import torch.nn.functional as F
from torch import nn
import copy

class Net(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, inputs):
        x = copy.deepcopy(inputs) # RuntimeError: NYI: Named tensors are not supported with the tracer
        x = F.relu(x)
        return x

model = Net()
images = torch.randn(8, 28, 28)
torch.jit.trace(model, images)
```

Error 2:

```bash
Traceback (most recent call last):
  File "/opt/pytorch/test_deepcopy.py", line 18, in <module>
  File "/opt/pytorch/torch/jit/_trace.py", line 806, in trace
    return trace_module(
           ^^^^^^^^^^^^^
  File "/opt/pytorch/torch/jit/_trace.py", line 1074, in trace_module
    module._c._create_method_from_trace(
  File "/opt/pytorch/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/torch/nn/modules/module.py", line 1501, in _slow_forward
    result = self.forward(*input, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/test_deepcopy.py", line 12, in forward
    x = F.relu(x)
        ^^^^^^^^^^
  File "/opt/conda/envs/ptca/lib/python3.11/copy.py", line 153, in deepcopy
    y = copier(memo)
        ^^^^^^^^^^^^
  File "/opt/pytorch/torch/_tensor.py", line 122, in __deepcopy__
    new_storage = self._typed_storage()._deepcopy(memo)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/torch/storage.py", line 847, in _deepcopy
    return self._new_wrapped_storage(copy.deepcopy(self._untyped_storage, memo))
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/ptca/lib/python3.11/copy.py", line 153, in deepcopy
    y = copier(memo)
        ^^^^^^^^^^^^
  File "/opt/pytorch/torch/storage.py", line 112, in __deepcopy__
    new_storage = self.clone()
                  ^^^^^^^^^^^^
  File "/opt/pytorch/torch/storage.py", line 126, in clone
    return type(self)(self.nbytes(), device=self.device).copy_(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: NYI: Named tensors are not supported with the tracer
```

----
 #48054 RuntimeError: NYI: Named tensors are not supported with the tracer
 #49538 jit tracer doesn't work with unflatten layer
 #31591 when i try to export a pytorch model to ONNX, got RuntimeError: output of traced region did not have observable data dependence with trace inputs; this probably indicates your program cannot be understood by the tracer.
   - This bug was closed but exists. Multiple comments on it still showing error. This is addressed

Likely fixes the following issues (but untested)

 #63297 Named tensor in tracer
 #2323 [Bug] torch.onnx.errors.UnsupportedOperatorError when convert mask2former to onnx

Fix zero dimensioned tensors when used with jit.trace They are currently assigned an empty set for names {} this is not the same as "no name" so jit.trace bails with
  "NYI: Named tensors are not supported with the tracer"
This happens when I am trying to save a non-trivial model as onnx but the simplest repro I have seen is 48054 above which has been added as test/jit/test_zero_dim_tensor_trace.py

Test plan:
  New unit test added
  Broken scenarios tested locally
  CI

Fixes #48054

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118393
Approved by: https://github.com/zou3519
2024-01-26 19:31:23 +00:00
cyy
6da0e7f84b [Clang-tidy header][17/N] Apply clang-tidy on headers in torch/csrc/cuda (#117829)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117829
Approved by: https://github.com/albanD
2024-01-26 13:33:24 +00:00
Bin Bao
4e456fd95b [AOTI] Support scalar to tensor in the ABI-compatible mode (#118024)
Differential Revision: [D53019485](https://our.internmc.facebook.com/intern/diff/D53019485)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118024
Approved by: https://github.com/ezyang
2024-01-26 03:15:05 +00:00
Jason Ansel
2de24c11f6 [inductor] Slightly faster memory allocation on CUDA (#118255)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118255
Approved by: https://github.com/peterbell10
ghstack dependencies: #118065, #118070, #118171
2024-01-25 20:49:14 +00:00
Jason Ansel
817debeb89 [inductor] Slightly faster memory allocation on CPU (#118171)
Based on `python benchmarks/dynamo/microbenchmarks/overheads.py`:
- Before `12.2us`
- After `10.5us`

This is inspired by a2c17a2b00 -- but in Python rather than C++

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118171
Approved by: https://github.com/jgong5, https://github.com/peterbell10
ghstack dependencies: #118065, #118070
2024-01-25 16:54:57 +00:00
Bin Bao
ee1dbb2acf [AOTI] Fix a None as index codegen issue (#118187)
Summary: Fix a ABI-compatible codegen issue when index_put has None in its indices.

Differential Revision: [D53047489](https://our.internmc.facebook.com/intern/diff/D53047489)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118187
Approved by: https://github.com/chenyang78
ghstack dependencies: #118168, #118169
2024-01-25 11:53:44 +00:00
Bin Bao
d1e661a1ce [AOTI] Add _scaled_dot_product_efficient_attention to C shim (#118169)
Summary: _scaled_dot_product_efficient_attention is used in some TIMM models

Differential Revision: [D53032358](https://our.internmc.facebook.com/intern/diff/D53032358)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118169
Approved by: https://github.com/chenyang78
ghstack dependencies: #118168
2024-01-25 11:53:44 +00:00
Bin Bao
5c7a18c5cb [AOTI] Refactor shim_common.cpp (#118168)
Summary: Use new_tensor_handle to reduce code repetition

Differential Revision: [D53032353](https://our.internmc.facebook.com/intern/diff/D53032353)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118168
Approved by: https://github.com/chenyang78
2024-01-25 11:53:29 +00:00
Will Constable
a40951defd [C10D] Fix nccl flightrecorder ignored dump timeout (#118142)
Don't call future.get() unless it's ready, because it waits.
Also, refactor the code a bit for simplicity.

We should do a follow-on PR to clean up the timeouts further, but this
should fix the glaring timeout bug.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118142
Approved by: https://github.com/shuqiangzhang
ghstack dependencies: #118044, #118046, #118047
2024-01-25 04:25:36 +00:00
cyy
87335fabae [Exception] [6/N] Remove use of torch::TypeError (#117964)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117964
Approved by: https://github.com/albanD
2024-01-25 03:35:58 +00:00
soulitzer
67300a11cb Support custom autograd Function forward AD return non-Tensor in forward (#118234)
Fixes https://github.com/pytorch/pytorch/issues/117491

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118234
Approved by: https://github.com/albanD
ghstack dependencies: #117552
2024-01-25 03:24:29 +00:00
soulitzer
5b819d9ef0 Properly move retains_grad hook on in-place over view for base (#117552)
Fixes https://github.com/pytorch/pytorch/issues/117366
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117552
Approved by: https://github.com/albanD
2024-01-25 00:27:13 +00:00
drisspg
4e29f01bf2 Remove sdp_kernel and replace with sdpa_kernel in attention namespace (#114689)
# Summary
Simplification of Backend Selection

This PR deprecates the `torch.backends/cuda/sdp_kernel` context manager and replaces it with a new context manager `torch.nn.attention.sdpa_kernel`. This context manager also changes the api for this context manager.

For `sdp_kernel` one would specify the backend choice by taking the negation of what kernel they would like to run. The purpose of this backend manager was to only to be a debugging tool, "turn off the math backend" and see if you can run one of the fused implementations.

Problems:
- This pattern makes sense if majority of users don't care to know anything about the backends that can be run. However, if users are seeking to use this context manager then they are explicitly trying to run a specific backend.
- This is not scalable. We are working on adding the cudnn backend and this API makes it so so that more implementations will need to be turned off if user wants to explicitly run a given backend.
- Discoverability of the current context manager. It is somewhat un-intutive that this backend manager is in backends/cuda/init when this now also controls the CPU fused kernel behavior. I think centralizing to attention namespace will be helpful.

Other concerns:
- Typically backends (kernels) for operators are entirely hidden from users and implementation details of the framework. We have exposed this to users already, albeit not by default and with beta warnings. Does making backends choices even more explicit lead to problems when we potentially want to remove existing backends, (perhaps inputs shapes will get covered by newer backends).

A nice side effect is now that we aren't using the `BACKEND_MAP` in test_transformers many, many dynamo failures are passing for CPU tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114689
Approved by: https://github.com/cpuhrsch
2024-01-24 22:28:04 +00:00
Bin Bao
821b2c543c [AOTI] Support .item() in the ABI-compatible mode (#117989)
Summary:

Differential Revision: [D52965076](https://our.internmc.facebook.com/intern/diff/D52965076)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117989
Approved by: https://github.com/ezyang, https://github.com/chenyang78
2024-01-24 20:17:59 +00:00
dilililiwhy
b025e5984c Get Device instance with correct type when privateuse1 backend is registered (#117966)
Fixes #ISSUE_NUMBER
If privateuse1 backend is registered. Let torch.device return corresponding instance of Device when only index is given.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117966
Approved by: https://github.com/albanD, https://github.com/malfet
2024-01-24 19:03:18 +00:00
Ke Wen
1e185c7803 [c10d] Barrier uses stream sync instead of device sync (#117804)
Resubmitting #96785

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117804
Approved by: https://github.com/wconstab
2024-01-24 18:42:14 +00:00
Mikayla Gawarecki
41a56f7828 Fix swap_tensors to swap PyObjects associated with TensorImpl (#116955)
This PR intends to fix the following issue when swapping two tensors

```python
>>> import torch
>>> torch.manual_seed(5)
>>> t1 = torch.randn(2)
>>> t2 = torch.randn(3)
>>> t1
tensor([-0.4868, -0.6038])
>>> t2
tensor([-0.5581,  0.6675, -0.1974])
>>> torch.utils.swap_tensors(t1, t2)
>>> t1
tensor([-0.5581,  0.6675, -0.1974])
>>> t2
tensor([-0.4868, -0.6038])
>>> t1.fill_(0.5) # t1 back to its unswapped state :o
tensor([-0.4868, -0.6038])
```

What happens here is that in `THPVariable_Wrap` (which is used when going back from C++ --> Python), we check if the TensorImpl of the tensor to be returned already has a pointer to a PyObject in its PyObject slot. If this is the case then this object is returned.

57491d2046/torch/csrc/autograd/python_variable.cpp (L271-L292)

When we run any operation that returns the same TensorImpl (e.g. inplace op, `t.to(dtype=t.dtype)`, etc.), although `t1` now has `t2`'s TensorImpl, `t2`'s TensorImpl still has a reference to `t2`, so when we do the op on `t1` and `THPVariable_Wrap` attempts to return the pointer to the TensorImpl's PyObject, we return a pointer to `t2` instead.

The TensorImpl should have the PyObjects in their PyObjectSlots swapped as well in `swap_tensors`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116955
Approved by: https://github.com/albanD
2024-01-24 01:40:18 +00:00
Nikita Shulga
bff348b28f [AOTI] Add missing include to model.h (#118075)
At lest if one tries to compile the AOTI code on Darwin, compilation
fails with implicit instantiation of undefined template error:
```
In file included from /Users/nshulga/git/pytorch/pytorch/torch/include/torch/csrc/inductor/aoti_runtime/arrayref_tensor.h:3:
/Users/nshulga/git/pytorch/pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model.h:69:21: error: implicit instantiation of undefined template 'std::basic_stringstream<char>'
  std::stringstream ss;
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118075
Approved by: https://github.com/desertfire
ghstack dependencies: #118074
2024-01-23 14:34:00 +00:00
Will Constable
455bba38f4 [C10D] Make Flight Recorder report time_created in ns (#118047)
Addresses (6) from #117883

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118047
Approved by: https://github.com/zdevito
ghstack dependencies: #118044, #118046
2024-01-23 08:18:08 +00:00
Will Constable
5df92a9244 [C10D] Add version tag to NCCL Flight Recorder Dump (#118046)
Addresses (3) from #117883

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118046
Approved by: https://github.com/zdevito
ghstack dependencies: #118044
2024-01-23 08:18:08 +00:00
Will Constable
dace1fda2e [C10D] Make NCCL Flight Recorder dump produce a dict (#118044)
Putting the list of entries into a particular key of a top-level dict
paves the way for adding other metadata as other top level keys.

Addresses 1 and 2 from #117883

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118044
Approved by: https://github.com/zdevito
2024-01-23 08:18:08 +00:00
Will Constable
6049998971 [C10D] Finer-grain nccl heartbeat, avoid false positive hangs (#118016)
Summary:
Previously, heatbeat was incremented once per finishing a for loop over a list
of in-progress work items, under the assumption that either the processing
would be predictably quick, or it would hang completely.

In fact, there can be cuda API contention that causes the processing of works
to slow down arbitrarily but not truly deadlock.  To guard against this, we
bump the heartbeat at the smallest unit of progress, one work item being
successfully processed.

Test Plan: CI

Differential Revision: D52973948

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118016
Approved by: https://github.com/shuqiangzhang, https://github.com/kwen2501
2024-01-23 07:25:18 +00:00
PyTorch MergeBot
b5799d9977 Revert "[c10d] Barrier uses stream sync instead of device sync (#117804)"
This reverts commit 0f6bbb1c07.

Reverted https://github.com/pytorch/pytorch/pull/117804 on behalf of https://github.com/clee2000 due to sorry the docs test failure is real, I think it wants the lines after the .. note to be indented https://github.com/pytorch/pytorch/actions/runs/7616827874/job/20745016788.  Marking as nosignal due to bad Dr. CI categorization ([comment](https://github.com/pytorch/pytorch/pull/117804#issuecomment-1904882487))
2024-01-22 21:54:03 +00:00
Ke Wen
0f6bbb1c07 [c10d] Barrier uses stream sync instead of device sync (#117804)
Resubmitting #96785

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117804
Approved by: https://github.com/wconstab
2024-01-22 20:14:51 +00:00
Jeff Daily
01abb5af21 additional support for float8_e4m3fnuz and _e5m2fnuz (#115214)
Follow up to #107586.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115214
Approved by: https://github.com/peterbell10, https://github.com/malfet
2024-01-22 18:33:41 +00:00
Guilherme Leobas
80cf0ce153 Enhance torch.vmap support from inside torch.compile (#116050)
This work rewrites vmap support in torch.compile by inlining most of
the frames into the existing FX graph. It also unlocks to PyTorch to
support features that were previously missing, such as keyword args.

Fixes: https://github.com/pytorch/pytorch/issues/114306

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116050
Approved by: https://github.com/zou3519
2024-01-22 17:53:45 +00:00
cyy
39df084001 [Clang-tidy header][16/N] Enable clang-tidy on headers in torch/csrc/autograd (#117821)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117821
Approved by: https://github.com/Skylion007
2024-01-22 00:52:56 +00:00
eqy
8f7caaee67 [cuDNN] Fix cuDNN version parsing against future versions of cuDNN (#117908)
Remove the unnecesssary dependence on assuming a fixed number of digits per field

CC @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117908
Approved by: https://github.com/cpuhrsch
2024-01-21 05:00:01 +00:00
fduwjj
05ef2030ea [c10d] Add logs for NCCL Comm Abort call (#117868)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117868
Approved by: https://github.com/kwen2501
2024-01-20 21:34:13 +00:00
Wei Lu
a1b3b5748f [Pytoch][Vulkan] Create context for conv1d (#117780)
Summary:
`conv1d` has two arguments `weight` and `bias` which are stored as constant tensors on the CPU and they are transferred to GPU at every inference call. We create a context for this operator to avoid the repeated passing. Specifically, we
- created `Conv1dPackedContext`,`create_conv1d_context` and `run_layernorm_context` in `Convolution.h` and `Convolution.cpp`
- registered them in `Register.cpp`
- rewrote the graph representation of the op in `vulkan_rewrite.cpp`

Test Plan:
## Numerical test
```
[luwei@82308.od /data/sandcastle/boxes/fbsource (8a8d911dc)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*conv1d*"
Buck UI: https://www.internalfb.com/buck2/7760800b-fd75-479a-9368-be5fcd5a7fef
Network: Up: 0B  Down: 0B
Jobs completed: 4. Time elapsed: 0.6s.
BUILD SUCCEEDED
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *conv1d*
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.conv1d_simple
[       OK ] VulkanAPITest.conv1d_simple (159 ms)
[ RUN      ] VulkanAPITest.conv1d
[       OK ] VulkanAPITest.conv1d (57 ms)
[----------] 2 tests from VulkanAPITest (217 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (217 ms total)
[  PASSED  ] 2 tests.
```

Full test result in P1053644934, summary as below
```
[----------] 419 tests from VulkanAPITest (28080 ms total)
[----------] Global test environment tear-down
[==========] 419 tests from 1 test suite ran. (28080 ms total)
[  PASSED  ] 418 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
```
## Graph representation comparison
We created a model using `conv1d` and traced it as below
```
# Define a simple model that uses conv1d
class MyModel(torch.nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.conv1d = nn.Conv1d(16, 33, 3)

    def forward(self, x):
        return self.conv1d(x)

# Create an instance of the model
model = MyModel()

# Create a dummy input tensor for tracing
input_tensor = torch.randn(20, 16, 50)

# Use torch.jit.trace to trace the model and generate a graph
traced_model = torch.jit.trace(model, input_tensor)
```
Then we converted the traced model to Vulkan backend using `optimize_for_mobile`
```
from torch.utils import mobile_optimizer

vulkan_model = mobile_optimizer.optimize_for_mobile(
    traced_model, backend="vulkan", preserved_methods=to_preserve
)
```
Next we can print the graph of the `vulkan_model` as `print(vk_model.graph)`
- before this diff: `conv1d` was used
```
graph(%self.1 : __torch__.___torch_mangle_16.MyModel,
      %x : Tensor):
  %60 : Device = prim::Constant[value="cpu"]()
  %self.conv1d.bias : Float(33, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=<Tensor>]()
  %37 : bool = prim::Constant[value=0]()
  %36 : NoneType = prim::Constant()
  %59 : Device = prim::Constant[value="vulkan"]()
  %self.conv1d.weight : Float(33, 16, 3, strides=[48, 3, 1], requires_grad=0, device=cpu) = prim::Constant[value=<Tensor>]()
  %7 : int = prim::Constant[value=1](), scope: __module.conv1d # /mnt/xarfuse/uid-23453/243f3953-seed-nspid4026532834_cgpid7972545-ns-4026532831/torch/nn/modules/conv.py:306:0
  %18 : int[] = prim::Constant[value=[1]]()
  %19 : int[] = prim::Constant[value=[0]]()
  %39 : Tensor = aten::to(%x, %59, %36, %37, %37)
  %20 : Tensor = aten::conv1d(%39, %self.conv1d.weight, %self.conv1d.bias, %18, %19, %18, %7)
  %58 : Tensor = aten::to(%20, %60, %36, %37, %37)
  return (%58)
```
- after this diff: `conv1d` was replaced with `run_conv1d_context`
```
graph(%self.1 : __torch__.___torch_mangle_6.MyModel,
      %x : Tensor):
  %85 : Device = prim::Constant[value="cpu"]()
  %51 : bool = prim::Constant[value=0]()
  %50 : NoneType = prim::Constant()
  %84 : Device = prim::Constant[value="vulkan"]()
  %53 : Tensor = aten::to(%x, %84, %50, %51, %51)
  %prepack_folding_forward._jit_pass_packed_weight_0 : __torch__.torch.classes.vulkan.Conv1dPackedContext = prim::GetAttr[name="prepack_folding_forward._jit_pass_packed_weight_0"](%self.1)
  %22 : Tensor = vulkan_prepack::run_conv1d_context(%53, %prepack_folding_forward._jit_pass_packed_weight_0)
  %83 : Tensor = aten::to(%22, %85, %50, %51, %51)
  return (%83)
```

Differential Revision: D52865379

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117780
Approved by: https://github.com/yipjustin
2024-01-20 02:35:32 +00:00
Scott Wolchok
ad3d41692e [PyTorch] return decltype(auto) from getItem (#117569)
This allows getItem to take advantage of the nicer (sometimes-const-reference) return type from `List::get() const` added in the previous diff.

Differential Revision: [D52809097](https://our.internmc.facebook.com/intern/diff/D52809097/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117569
Approved by: https://github.com/iseeyuan, https://github.com/malfet
ghstack dependencies: #117568
2024-01-19 21:04:53 +00:00
PyTorch MergeBot
b637fdc8b3 Revert "additional support for float8_e4m3fnuz and _e5m2fnuz (#115214)"
This reverts commit 74e1362499.

Reverted https://github.com/pytorch/pytorch/pull/115214 on behalf of https://github.com/PaliC due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/115214#issuecomment-1900815152))
2024-01-19 17:35:04 +00:00
dilililiwhy
924ed91612 Move getDurationFromFirstEvent to USE_C10D_NCCL ifdef (#117738)
Fixes #117517

Try to move nccl related function *getDurationFromFirstEvent* to USE_C10D_NCCL ifdef (Related to https://github.com/pytorch/pytorch/issues/114575)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117738
Approved by: https://github.com/wconstab, https://github.com/XilunWu
2024-01-19 04:28:47 +00:00
cyy
38d9b3d937 Remove use of math_compat.h (#116167)
Because  ANDROID>=21 is assumed in CI tests, it is time to remove old workarounds. math_compat.h contains solely wrapper math functions for ANDROID, so we can remove its usage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116167
Approved by: https://github.com/ezyang
2024-01-19 03:37:55 +00:00
cyy
5c17f66a3d [Exception] [5/N] Remove torch::IndexError (#117713)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117713
Approved by: https://github.com/ezyang
2024-01-19 03:36:15 +00:00
Ke Wen
c16e6e4cf7 [ProcessGroup] Make watchdog check work queue more frequently (#117297)
Today watchdog's sleep interval is 1s. That's a bit long compared to modern GPU link's (or network link's) speed.

Take DDP and Ampere for example:

DDP's bucket size = 25 MB
Ampere's NVLink speed = 250 GB/s

25 MB / 250 GB/s = 100 ms.
So we are updating the interval to 100 ms.

Update:
25 MB / 250 GB/s = 0.1 ms
But let's see how it goes so far between making the checking more aggressive.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117297
Approved by: https://github.com/fduwjj
2024-01-19 02:33:31 +00:00
Jeff Daily
74e1362499 additional support for float8_e4m3fnuz and _e5m2fnuz (#115214)
Follow up to #107586.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115214
Approved by: https://github.com/peterbell10
2024-01-19 00:50:18 +00:00
PyTorch MergeBot
2f84a9d37c Revert "[CUDNN][SDPA] Experimental cuDNN Flash Attention v2 Inference (#115663)"
This reverts commit 5aa92b5090.

Reverted https://github.com/pytorch/pytorch/pull/115663 on behalf of https://github.com/PaliC due to Unfortunately, this pr breaks cuda builds internally ([comment](https://github.com/pytorch/pytorch/pull/115663#issuecomment-1899388813))
2024-01-18 23:40:30 +00:00
Jason Ansel
a669319450 [inductor] Faster C++ kernel python bindings (#117500)
Calling C++ from Python via ctypes is notoriously slow.  This switches to generating our own C++ bindings directly, which is a >5x speedup on this kernel-launch-bound microbenchmark:
```python
from ctypes import c_void_p
import torch
from torch import empty
from torch._inductor.codecache import AsyncCompile
from torch._dynamo.testing import rand_strided
from torch._inductor.utils import print_performance
from torch._inductor.wrapper_benchmark import compiled_module_main

async_compile = AsyncCompile()

src = '''
#include "/tmp/torchinductor_jansel/gb/cgbau5vlj6cetmcjbjbtw6x4rrivaln6f45s5d72gy2bfx5foz3k.h"
extern "C" void kernel(const float* in_ptr0,
                       float* out_ptr0)
{
    {
        auto tmp0 = in_ptr0[static_cast<long>(0L)];
        auto tmp1 = static_cast<float>(1.0);
        auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
        out_ptr0[static_cast<long>(0L)] = tmp2;
    }
}
'''

cpp_fused_add_ctypes = async_compile.cpp(src)
cpp_fused_add_cpython = async_compile.cpp_pybinding(["const float*", "float*"], src)

async_compile.wait(globals())
del async_compile

def call(arg0_1):
    buf0 = empty((1,), device='cpu', dtype=torch.float32)
    if use_ctypes:
        for _ in range(100):
            cpp_fused_add_ctypes(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr()))
    else:
        for _ in range(100):
            cpp_fused_add_cpython(arg0_1, buf0)
    del arg0_1
    return (buf0,)

def benchmark_compiled_module(times=1000, repeat=100):
    arg0_1 = rand_strided((1,), (1,), device='cpu', dtype=torch.float32)
    return print_performance(lambda: call(arg0_1), times=times, repeat=repeat)

print("old ctypes bindings: ", end='')
use_ctypes = True
compiled_module_main('None', benchmark_compiled_module)
print("new bindings:        ", end='')
use_ctypes = False
compiled_module_main('None', benchmark_compiled_module)
```
Output:
```
old ctypes bindings: 0.000073
new bindings:        0.000013
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117500
Approved by: https://github.com/desertfire
2024-01-18 16:20:12 +00:00
Bin Bao
26956980c6 [AOTI] Add torch._export.aot_load (#117610)
Summary: Add a torch._export.aot_load API that can load an AOTInductor-compiled model.so into a python executable.

Test Plan: CI

Differential Revision: D52825456

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117610
Approved by: https://github.com/angelayi, https://github.com/khabinov, https://github.com/chenyang78
2024-01-18 15:02:16 +00:00
Tobias Ringwald
bc9cb04822 Replaced CHECK with TORCH_CHECK in order to not abort, but throw a Ru… (#117653)
…ntimeError instead.

Fixes #117499.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117653
Approved by: https://github.com/antoniojkim, https://github.com/JackCaoG, https://github.com/alanwaketan
2024-01-18 07:47:22 +00:00
Ke Wen
6d96beb6be [c10d] Remove health check (#117699)
https://github.com/pytorch/pytorch/pull/114916 and https://github.com/pytorch/pytorch/pull/116222 added support for eager NCCL comm init (performed as soon as `init_process_group` is called).

If any user cares about the time difference and want to see NCCL init errors early, they can use eager init now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117699
Approved by: https://github.com/wconstab
2024-01-18 02:14:49 +00:00