Commit Graph

1178 Commits

Author SHA1 Message Date
PyTorch MergeBot
cb4919344a Revert "[BE] update type annotations for basic utilities in torch/__init__.py (#129001)"
This reverts commit e53d959028.

Reverted https://github.com/pytorch/pytorch/pull/129001 on behalf of https://github.com/XuehaiPan due to lint failure ([comment](https://github.com/pytorch/pytorch/pull/129001#issuecomment-2186944549))
2024-06-24 16:18:43 +00:00
Xuehai Pan
e53d959028 [BE] update type annotations for basic utilities in torch/__init__.py (#129001)
Changes:

1. Make some arguments positional-only as we only support Python 3.8+
2. Clean up `torch.typename(obj)` implementation.
3. Update type annotations., especially `is_tensor()` and `is_masked_tensor()` using `TypeGuard`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129001
Approved by: https://github.com/malfet
2024-06-24 14:35:41 +00:00
Yifu Wang
217aac96d7 Introduce a prototype for SymmetricMemory (#128582)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

This PR introduces a prototype for `SymmetricMemory` (including a CUDA implementation) - a remote-memory access-based communication primitive. It allows for user-defined communication patterns/kernels and is designed to be torch.compile-friendly. It addresses the major limitations of `IntraNodeComm` and `ProcessGroupCudaP2p` and serves as a replacement for them.

### SymmetricMemory

`SymmetricMemory` represents symmetric allocations across a group of devices. The allocations represented by a `SymmetricMemory` object are accessible by all devices in the group. The class can be used for **op-level custom communication patterns** (via the get_buffer APIs and the synchronization primitives), as well as **custom communication kernels** (via the buffer and signal_pad device pointers).

### Python API Example

```python
from torch._C.distributed_c10d import _SymmetricMemory

# Set a store for rendezvousing symmetric allocations on a group of devices
# identified by group_name. The concept of groups is logical; users can
# utilize predefined groups (e.g., a group of device identified by a
# ProcessGroup) or create custom ones. Note that a SymmetricMemoryAllocator
# backends might employ a more efficient communication channel for the actual
# rendezvous process and only use the store for bootstrapping purposes.
_SymmetricMemory.set_group_info(group_name, rank, world_size, store)

# Identical to empty_strided, but allows symmetric memory access to be
# established for the allocated tensor via _SymmetricMemory.rendezvous().
# This function itself is not a collective operation.
t = _SymmetricMemory.empty_strided_p2p((64, 64), (64, 1), torch.float32, group_name)

# Users can write Python custom ops that leverages the symmetric memory access.
# Below are examples of things users can do (assuming the group's world_size is 2).

# Establishes symmetric memory access on tensors allocated via
# _SymmetricMemory.empty_strided_p2p(). rendezvous() is a one-time process,
# and the mapping between a local memory region and the associated SymmetricMemory
# object is unique. Subsequent calls to rendezvous() with the same tensor will receive
# the cached SymmetricMemory object.
#
# The function has a collective semantic and must be invoked simultaneously
# from all rendezvous participants.
symm_mem = _SymmetricMemory.rendezvous(t)

# This represents the allocation on rank 0 and is accessible from all devices.
buf = symm_mem.get_buffer(0, (64, 64), torch.float32)

if symm_mem.rank == 0:
    symm_mem.wait_signal(src_rank=1)
    assert buf.eq(42).all()
else:
    # The remote buffer can be used as a regular tensor
    buf.fill_(42)
    symm_mem.put_signal(dst_rank=0)

symm_mem.barrier()

if symm_mem.rank == 0:
    symm_mem.barrier()
    assert buf.eq(43).all()
else:
    new_val = torch.empty_like(buf)
    new_val.fill_(43)
    # Contiguous copies to/from a remote buffer utilize copy engines
    # which bypasses SMs (i.e. no need to load the data into registers)
    buf.copy_(new_val)
    symm_mem.barrier()
```

### Custom CUDA Comm Kernels

Given a tensor, users can access the associated `SymmetricMemory` which provides pointer to remote buffers/signal_pads needed for custom communication kernels.

```cpp
TORCH_API c10::intrusive_ptr<SymmetricMemory> get_symmetric_memory(
    const at::Tensor& tensor);

class TORCH_API SymmetricMemory : public c10::intrusive_ptr_target {
 public:
  ...
  virtual std::vector<void*> get_buffer_ptrs() = 0;
  virtual std::vector<void*> get_signal_pad_ptrs() = 0;
  virtual void** get_buffer_ptrs_dev() = 0;
  virtual void** get_signal_pad_ptrs_dev() = 0;
  virtual size_t get_buffer_size() = 0;
  virtual size_t get_signal_pad_size() = 0;
  virtual int get_rank() = 0;
  virtual int get_world_size() = 0;
  ...
};
```

### Limitations of IntraNodeComm and ProcessGroupCudaP2p
Both `IntraNodeComm` (used by `ProcessGroupCudaP2p`) manages a single fixed-size workspace. This approach:
- Leads to awkward UX in which the required workspace needs to be specified upfront.
- Can not avoid extra copies for some algorithms in eager mode (e.g., custom/multimem all-reduce, reduce-scatter, all-gather).
- Prevents torch.compile from eliminating all copies.

In addition, they only offer out-of-the-box communication kernels and don't expose required pointers for user-defined, custom CUDA comm kernels.

* __->__ #128582

Differential Revision: [D58849033](https://our.internmc.facebook.com/intern/diff/D58849033)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128582
Approved by: https://github.com/wanchaol
2024-06-21 08:49:11 +00:00
Jiong Gong
914d3ca2ba [inductor][cpp] BF16 AMX micro-gemm support (#127195)
This PR adds the intrinsics based micro-gemm for BF16 using Advanced Matrix eXtension (AMX) instructions available in Intel 4th and 5th Xeon processors. A compilation check is added to `codecache.py` to check the validity of the compiler support. Also, since AMX requires an initialization in the Linux kernel to extra register states, an initialization function is added to do that and triggered via `codecache.py`.

Performance speedups with >=10% on BF16 AMP, max_autotune vs. no autotune, measured on Intel(R) Xeon(R) Platinum 8488C:
Static shapes
Single-threaded
| Model Family | Model Name | Speedup |
|--------------|------------|---------|
| timm_models | mixer_b16_224 | 1.54 |
| timm_models | convit_base | 1.53 |
| huggingface | MobileBertForQuestionAnswering | 1.52 |
| torchbench | fastNLP_Bert | 1.44 |
| torchbench | llama | 1.33 |
| timm_models | swin_base_patch4_window7_224 | 1.31 |
| torchbench | dlrm | 1.28 |
| torchbench | timm_vision_transformer_large | 1.28 |
| huggingface | MobileBertForMaskedLM | 1.27 |
| timm_models | vit_base_patch16_224 | 1.26 |
| timm_models | beit_base_patch16_224 | 1.23 |
| timm_models | jx_nest_base | 1.21 |
| torchbench | pyhpc_equation_of_state | 1.18 |
| huggingface | Speech2Text2ForCausalLM | 1.15 |
| timm_models | pit_b_224 | 1.14 |
| timm_models | twins_pcpvt_base | 1.14 |
| torchbench | maml_omniglot | 1.1 |
| timm_models | botnet26t_256 | 1.1 |

Multi-threaded
| Model Family | Model Name | Speedup |
|--------------|------------|---------|
| torchbench | BERT_pytorch | 1.35 |
| torchbench | lennard_jones | 2.43 |
| torchbench | hf_Albert | 1.35 |
| torchbench | hf_T5 | 1.34 |
| torchbench | soft_actor_critic | 1.34 |
| torchbench | fastNLP_Bert | 1.28 |
| huggingface | LayoutLMForSequenceClassification | 1.26 |
| torchbench | llama | 1.24 |
| huggingface | GPT2ForSequenceClassification | 1.19 |
| torchbench | hf_Bart | 1.17 |
| torchbench | hf_Bert_large | 1.16 |
| torchbench | hf_GPT2 | 1.16 |
| timm_models | gmixer_24_224 | 1.16 |
| torchbench | hf_GPT2_large | 1.15 |
| torchbench | maml_omniglot | 1.14 |
| torchbench | hf_Bert | 1.13 |
| torchbench | hf_DistilBert | 1.13 |
| torchbench | hf_T5_large | 1.12 |
| huggingface | MT5ForConditionalGeneration | 1.11 |

Dynamic shapes
Single-threaded
| Model Family | Model Name | Speedup |
|--------------|------------|-------|
| timm_models | mixer_b16_224 | 1.52 |
| timm_models | convit_base | 1.5 |
| huggingface | MobileBertForQuestionAnswering | 1.49 |
| torchbench | fastNLP_Bert | 1.42 |
| torchbench | timm_vision_transformer_large | 1.28 |
| timm_models | swin_base_patch4_window7_224 | 1.27 |
| torchbench | llama | 1.26 |
| huggingface | MobileBertForMaskedLM | 1.25 |
| timm_models | vit_base_patch16_224 | 1.25 |
| timm_models | beit_base_patch16_224 | 1.24 |
| timm_models | jx_nest_base | 1.2 |
| torchbench | dlrm | 1.19 |
| timm_models | pit_b_224 | 1.13 |
| timm_models | twins_pcpvt_base | 1.13 |
| torchbench | hf_Bert_large | 1.12 |
| torchbench | hf_BigBird | 1.11 |
| huggingface | Speech2Text2ForCausalLM | 1.11 |
| timm_models | eca_botnext26ts_256 | 1.11 |
| timm_models | botnet26t_256 | 1.1 |

Multi-threaded
| Model Family | Model Name | Speedup |
|--------------|------------|-------|
| torchbench | BERT_pytorch | 1.18 |
| torchbench | lennard_jones | 2.18 |
| torchbench | hf_Albert | 1.37 |
| torchbench | soft_actor_critic | 1.31 |
| huggingface | GPT2ForSequenceClassification | 1.29 |
| torchbench | hf_T5 | 1.28 |
| torchbench | fastNLP_Bert | 1.27 |
| torchbench | hf_Bart | 1.21 |
| torchbench | hf_Bert_large | 1.19 |
| torchbench | hf_T5_large | 1.19 |
| torchbench | hf_Bert | 1.16 |
| torchbench | hf_GPT2 | 1.16 |
| huggingface | CamemBert | 1.16 |
| torchbench | hf_GPT2_large | 1.13 |
| torchbench | functorch_maml_omniglot | 1.12 |
| huggingface | BertForMaskedLM | 1.12 |
| huggingface | MT5ForConditionalGeneration | 1.12 |
| torchbench | hf_DistilBert | 1.11 |
| timm_models | mixnet_l | 1.11 |
| timm_models | tf_mixnet_l | 1.11 |

No perf regressions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127195
Approved by: https://github.com/jansel
2024-06-21 07:21:47 +00:00
cyy
5c676bb8b3 Remove Caffe2 handling from onnx_unpack_quantized_weights (#129021)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129021
Approved by: https://github.com/justinchuby, https://github.com/albanD
2024-06-21 06:16:44 +00:00
Deng Weishi
b542825066 Enable deterministic support for oneDNN (#127277)
This PR is a part of RFC https://github.com/pytorch/pytorch/issues/114848.
For the request for Torchbenchmark models, this PR enables the deterministic attribute for the oneDNN operators for XPU backends, like convolution, deconvolution and matmult.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127277
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/desertfire, https://github.com/gujinghui
2024-06-21 05:21:24 +00:00
PyTorch MergeBot
63a724d8e1 Revert "Introduce a prototype for SymmetricMemory (#128582)"
This reverts commit 8771e3429c.

Reverted https://github.com/pytorch/pytorch/pull/128582 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128582#issuecomment-2181656181))
2024-06-20 22:31:29 +00:00
PyTorch MergeBot
e84cf805d2 Revert "Modularize aten parameter parser and checker (#125308)"
This reverts commit 60bbdc0b40.

Reverted https://github.com/pytorch/pytorch/pull/125308 on behalf of https://github.com/fbgheith due to test failures when run by meta ([comment](https://github.com/pytorch/pytorch/pull/125308#issuecomment-2181327211))
2024-06-20 18:52:05 +00:00
Yifu Wang
8771e3429c Introduce a prototype for SymmetricMemory (#128582)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

This PR introduces a prototype for `SymmetricMemory` (including a CUDA implementation) - a remote-memory access-based communication primitive. It allows for user-defined communication patterns/kernels and is designed to be torch.compile-friendly. It addresses the major limitations of `IntraNodeComm` and `ProcessGroupCudaP2p` and serves as a replacement for them.

### SymmetricMemory

`SymmetricMemory` represents symmetric allocations across a group of devices. The allocations represented by a `SymmetricMemory` object are accessible by all devices in the group. The class can be used for **op-level custom communication patterns** (via the get_buffer APIs and the synchronization primitives), as well as **custom communication kernels** (via the buffer and signal_pad device pointers).

### Python API Example

```python
from torch._C.distributed_c10d import _SymmetricMemory

# Set a store for rendezvousing symmetric allocations on a group of devices
# identified by group_name. The concept of groups is logical; users can
# utilize predefined groups (e.g., a group of device identified by a
# ProcessGroup) or create custom ones. Note that a SymmetricMemoryAllocator
# backends might employ a more efficient communication channel for the actual
# rendezvous process and only use the store for bootstrapping purposes.
_SymmetricMemory.set_group_info(group_name, rank, world_size, store)

# Identical to empty_strided, but allows symmetric memory access to be
# established for the allocated tensor via _SymmetricMemory.rendezvous().
# This function itself is not a collective operation.
t = _SymmetricMemory.empty_strided_p2p((64, 64), (64, 1), torch.float32, group_name)

# Users can write Python custom ops that leverages the symmetric memory access.
# Below are examples of things users can do (assuming the group's world_size is 2).

# Establishes symmetric memory access on tensors allocated via
# _SymmetricMemory.empty_strided_p2p(). rendezvous() is a one-time process,
# and the mapping between a local memory region and the associated SymmetricMemory
# object is unique. Subsequent calls to rendezvous() with the same tensor will receive
# the cached SymmetricMemory object.
#
# The function has a collective semantic and must be invoked simultaneously
# from all rendezvous participants.
symm_mem = _SymmetricMemory.rendezvous(t)

# This represents the allocation on rank 0 and is accessible from all devices.
buf = symm_mem.get_buffer(0, (64, 64), torch.float32)

if symm_mem.rank == 0:
    symm_mem.wait_signal(src_rank=1)
    assert buf.eq(42).all()
else:
    # The remote buffer can be used as a regular tensor
    buf.fill_(42)
    symm_mem.put_signal(dst_rank=0)

symm_mem.barrier()

if symm_mem.rank == 0:
    symm_mem.barrier()
    assert buf.eq(43).all()
else:
    new_val = torch.empty_like(buf)
    new_val.fill_(43)
    # Contiguous copies to/from a remote buffer utilize copy engines
    # which bypasses SMs (i.e. no need to load the data into registers)
    buf.copy_(new_val)
    symm_mem.barrier()
```

### Custom CUDA Comm Kernels

Given a tensor, users can access the associated `SymmetricMemory` which provides pointer to remote buffers/signal_pads needed for custom communication kernels.

```cpp
TORCH_API c10::intrusive_ptr<SymmetricMemory> get_symmetric_memory(
    const at::Tensor& tensor);

class TORCH_API SymmetricMemory : public c10::intrusive_ptr_target {
 public:
  ...
  virtual std::vector<void*> get_buffer_ptrs() = 0;
  virtual std::vector<void*> get_signal_pad_ptrs() = 0;
  virtual void** get_buffer_ptrs_dev() = 0;
  virtual void** get_signal_pad_ptrs_dev() = 0;
  virtual size_t get_buffer_size() = 0;
  virtual size_t get_signal_pad_size() = 0;
  virtual int get_rank() = 0;
  virtual int get_world_size() = 0;
  ...
};
```

### Limitations of IntraNodeComm and ProcessGroupCudaP2p
Both `IntraNodeComm` (used by `ProcessGroupCudaP2p`) manages a single fixed-size workspace. This approach:
- Leads to awkward UX in which the required workspace needs to be specified upfront.
- Can not avoid extra copies for some algorithms in eager mode (e.g., custom/multimem all-reduce, reduce-scatter, all-gather).
- Prevents torch.compile from eliminating all copies.

In addition, they only offer out-of-the-box communication kernels and don't expose required pointers for user-defined, custom CUDA comm kernels.

* __->__ #128582

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128582
Approved by: https://github.com/wanchaol
2024-06-19 03:38:58 +00:00
PyTorch MergeBot
77830d509f Revert "Introduce a prototype for SymmetricMemory (#128582)"
This reverts commit 7a39755da2.

Reverted https://github.com/pytorch/pytorch/pull/128582 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128582#issuecomment-2176685232))
2024-06-18 18:11:43 +00:00
cyy
163847b1bb [1/N] [Caffe2] Remove caffe2_aten_fallback code (#128675)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128675
Approved by: https://github.com/r-barnes
2024-06-17 21:25:59 +00:00
Yifu Wang
7a39755da2 Introduce a prototype for SymmetricMemory (#128582)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

This PR introduces a prototype for `SymmetricMemory` (including a CUDA implementation) - a remote-memory access-based communication primitive. It allows for user-defined communication patterns/kernels and is designed to be torch.compile-friendly. It addresses the major limitations of `IntraNodeComm` and `ProcessGroupCudaP2p` and serves as a replacement for them.

### SymmetricMemory

`SymmetricMemory` represents symmetric allocations across a group of devices. The allocations represented by a `SymmetricMemory` object are accessible by all devices in the group. The class can be used for **op-level custom communication patterns** (via the get_buffer APIs and the synchronization primitives), as well as **custom communication kernels** (via the buffer and signal_pad device pointers).

### Python API Example

```python
from torch._C.distributed_c10d import _SymmetricMemory

# Set a store for rendezvousing symmetric allocations on a group of devices
# identified by group_name. The concept of groups is logical; users can
# utilize predefined groups (e.g., a group of device identified by a
# ProcessGroup) or create custom ones. Note that a SymmetricMemoryAllocator
# backends might employ a more efficient communication channel for the actual
# rendezvous process and only use the store for bootstrapping purposes.
_SymmetricMemory.set_group_info(group_name, rank, world_size, store)

# Identical to empty_strided, but allows symmetric memory access to be
# established for the allocated tensor via _SymmetricMemory.rendezvous().
# This function itself is not a collective operation.
t = _SymmetricMemory.empty_strided_p2p((64, 64), (64, 1), torch.float32, group_name)

# Users can write Python custom ops that leverages the symmetric memory access.
# Below are examples of things users can do (assuming the group's world_size is 2).

# Establishes symmetric memory access on tensors allocated via
# _SymmetricMemory.empty_strided_p2p(). rendezvous() is a one-time process,
# and the mapping between a local memory region and the associated SymmetricMemory
# object is unique. Subsequent calls to rendezvous() with the same tensor will receive
# the cached SymmetricMemory object.
#
# The function has a collective semantic and must be invoked simultaneously
# from all rendezvous participants.
symm_mem = _SymmetricMemory.rendezvous(t)

# This represents the allocation on rank 0 and is accessible from all devices.
buf = symm_mem.get_buffer(0, (64, 64), torch.float32)

if symm_mem.rank == 0:
    symm_mem.wait_signal(src_rank=1)
    assert buf.eq(42).all()
else:
    # The remote buffer can be used as a regular tensor
    buf.fill_(42)
    symm_mem.put_signal(dst_rank=0)

symm_mem.barrier()

if symm_mem.rank == 0:
    symm_mem.barrier()
    assert buf.eq(43).all()
else:
    new_val = torch.empty_like(buf)
    new_val.fill_(43)
    # Contiguous copies to/from a remote buffer utilize copy engines
    # which bypasses SMs (i.e. no need to load the data into registers)
    buf.copy_(new_val)
    symm_mem.barrier()
```

### Custom CUDA Comm Kernels

Given a tensor, users can access the associated `SymmetricMemory` which provides pointer to remote buffers/signal_pads needed for custom communication kernels.

```cpp
TORCH_API c10::intrusive_ptr<SymmetricMemory> get_symmetric_memory(
    const at::Tensor& tensor);

class TORCH_API SymmetricMemory : public c10::intrusive_ptr_target {
 public:
  ...
  virtual std::vector<void*> get_buffer_ptrs() = 0;
  virtual std::vector<void*> get_signal_pad_ptrs() = 0;
  virtual void** get_buffer_ptrs_dev() = 0;
  virtual void** get_signal_pad_ptrs_dev() = 0;
  virtual size_t get_buffer_size() = 0;
  virtual size_t get_signal_pad_size() = 0;
  virtual int get_rank() = 0;
  virtual int get_world_size() = 0;
  ...
};
```

### Limitations of IntraNodeComm and ProcessGroupCudaP2p
Both `IntraNodeComm` (used by `ProcessGroupCudaP2p`) manages a single fixed-size workspace. This approach:
- Leads to awkward UX in which the required workspace needs to be specified upfront.
- Can not avoid extra copies for some algorithms in eager mode (e.g., custom/multimem all-reduce, reduce-scatter, all-gather).
- Prevents torch.compile from eliminating all copies.

In addition, they only offer out-of-the-box communication kernels and don't expose required pointers for user-defined, custom CUDA comm kernels.

* __->__ #128582

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128582
Approved by: https://github.com/wanchaol
2024-06-15 10:20:21 +00:00
Wang, Eikan
60bbdc0b40 Modularize aten parameter parser and checker (#125308)
In this PR, we abstracted the different types of aten operation parameters as `ParameterMetadata`. This structure intends to be used to represent and store the metadata of each aten operation parameter. Currently, it only supports `Tensor`, `TensorList`, and `Scalar`.

```C++
using ParameterMetadataValue = std::variant<TensorMetadata, std::vector<TensorMetadata>, c10::Scalar>;
```

With this PR, we can extend other parameter-type support in a more modularize way, like `string`, `int`, `double`, and other different types to be summarized as the following list. The list is collected from all aten operations and ordered by the number of being used.

- `Tensor`
- `bool`
- `int64_t`
- `TensorList`
- `Scalar`
- `c10::SymIntArrayRef`
- `::std::optional<Tensor>`
- `IntArrayRef`
- `double`
- `c10::SymInt`
- `::std::optional<ScalarType>`
- `::std::optional<double>`
- `::std::optional<bool>`
- `::std::optional<Layout>`
- `::std::optional<Device>`
- `::std::optional<int64_t>`
- `Dimname`
- `::std::optional<Generator>`
- `c10::string_view`
- `::std::optional<c10::string_view>`
- `OptionalIntArrayRef`
- `::std::optional<Scalar>`
- `OptionalSymIntArrayRef`
- `::std::optional<MemoryFormat>`
- `::std::optional<c10::SymInt>`
- `ScalarType`
- `ArrayRef<Scalar>`
- `DimnameList`
- `::std::optional<ArrayRef<double>>`
- `::std::array<bool,3>`
- `::std::optional<DimnameList>`
- `c10::List<::std::optional<Tensor>>`
- `::std::array<bool,2>`
- `Storage`
- `::std::array<bool,4>`
- `Device`
- `DeviceIndex`
- `ITensorListRef`
- `Stream`
- `Layout`
- `MemoryFormat`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125308
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-06-15 09:18:44 +00:00
Simon Fan
4b96575a09 [dynamo][aot autograd] Silently disable default saved tensor hooks during tracing (#123196)
FIXES #113263. Same idea as in https://github.com/pytorch/pytorch/pull/113417, but we need a more intrusive C API to silently nop default saved tensor hooks, in order to support user-code that use torch.autograd.disable_saved_tensors_hooks (see test_unpack_hooks_can_be_disabled). We mock the output of get_hooks while leaving push/pop untouched.

For compiled autograd, we're firing pack hooks once and unpack hooks twice right now, I'll look into this separately from this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123196
Approved by: https://github.com/soulitzer
2024-06-14 20:28:08 +00:00
Tristan Rice
7c370d2fb0 expose set_thread_name to Python and set thread names (#128448)
This adds a new multiprocessing method `_set_thread_name` and calls it from torchelastic and dataloader main functions. This will allow better monitoring of processes as we can separate elastic and dataloading processes from the main training process.

Threads named:

* torchrun/elastic
* PyTorch dataloader worker processes + pin memory thread
* TCPStore
* ProcessGroupNCCL background threads
* WorkerServer httpserver thread

Test plan:

```
$ torchrun --nnodes 1 --nproc_per_node 1 --no-python /bin/bash -c 'ps -eL | grep pt_'
3264281 3264281 pts/45   00:00:02 pt_elastic
3264281 3267950 pts/45   00:00:00 pt_elastic
```

dataloading

```py
import torch
import time

from torch.utils.data import (
    DataLoader,
    Dataset,
)

class NoopDataset(Dataset):
    def __getitem__(self, index):
        return index

    def __len__(self):
        return 10

dataloader = DataLoader(NoopDataset(), num_workers=2)

for i, x in enumerate(dataloader):
    print(i, x)
    time.sleep(10000)
```

```
$ python3 ~/scripts/dataloader_test.py
$ ps -eL | grep pt_
1228312 1228312 pts/45   00:00:02 pt_main_thread
1228312 1230058 pts/45   00:00:00 pt_main_thread
1228312 1230059 pts/45   00:00:00 pt_main_thread
1230052 1230052 pts/45   00:00:00 pt_data_worker
1230052 1230198 pts/45   00:00:00 pt_data_worker
1230052 1230740 pts/45   00:00:00 pt_data_worker
1230055 1230055 pts/45   00:00:00 pt_data_worker
1230055 1230296 pts/45   00:00:00 pt_data_worker
1230055 1230759 pts/45   00:00:00 pt_data_worker
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128448
Approved by: https://github.com/c-p-i-o, https://github.com/andrewkho, https://github.com/rsdcastro
2024-06-13 16:38:23 +00:00
PyTorch MergeBot
817ce6835b Revert "[cuDNN][SDPA] Remove TORCH_CUDNN_SDPA_ENABLED=1, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343)"
This reverts commit 4c971932e8.

Reverted https://github.com/pytorch/pytorch/pull/125343 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/125343#issuecomment-2163690162))
2024-06-12 18:47:52 +00:00
Xu Han
9e39c62908 correct avx512_vnni isa name. (#128318)
`x86` has two vnni isa currently: `avx2_vnni` and `avx512_vnni`.
This PR correct the function name to `avx512_vnni`.

Co-authored-by: Jiong Gong <jiong.gong@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128318
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/desertfire
2024-06-12 16:12:49 +00:00
Kulin Seth
8df56afc20 Add support in Python API for the recommended max working set size. (#128289)
Adds ways for users to request recommended max size for Metal on Mac. It plumbs through
https://developer.apple.com/documentation/metal/mtldevice/2369280-recommendedmaxworkingsetsize?language=objc

Can be used like
```
        max_memory = torch.mps.recommended_max_memory()
        print ("Recommended Max Memory : ", (max_memory/(1024*1024*1024)), "GB")
```

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128289
Approved by: https://github.com/malfet
2024-06-12 16:03:57 +00:00
Aaron Orenstein
3c971d2ef3 Flip default value for mypy disallow_untyped_defs [final] (#127836)
Not requiring all functions to have types allows a lot of 'Any' types to slip in - which poison types and make mypy unable to properly typecheck the code.  I want to flip the default so that new files are required to have fully typed defs and we can have a burndown list of files that fail to require full types.

The preceding stack of PRs (cut up simply to limit the number of file changes per PR "reasonable") adds `# mypy: allow-untyped-defs` to any file which didn't immediately pass mypy with the flag flipped.  Due to changing files and merge conflicts it will probably be necessary to have several passes through before landing this final PR which turns the option on.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127836
Approved by: https://github.com/oulgen, https://github.com/Skylion007
2024-06-12 15:28:42 +00:00
PyTorch MergeBot
c9c1fed065 Revert "Flip default value for mypy disallow_untyped_defs [10+2/11] (#128374)"
This reverts commit c13e03c874.

Reverted https://github.com/pytorch/pytorch/pull/128374 on behalf of https://github.com/clee2000 due to sorry I need to revert this in order to revert something else, to remerge, just rebase and fix the merge conflict ([comment](https://github.com/pytorch/pytorch/pull/128374#issuecomment-2161772864))
2024-06-11 23:34:03 +00:00
Aaron Orenstein
c13e03c874 Flip default value for mypy disallow_untyped_defs [10+2/11] (#128374)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128374
Approved by: https://github.com/Skylion007
2024-06-11 15:58:28 +00:00
Oguz Ulgen
5b5d269d34 Speed up fx graph iteration by implementing it in C++ (#128288)
Before this change
```
python benchmarks/dynamo/microbenchmarks/fx_microbenchmarks.py
iterating over 100000000 FX nodes took 19.5s (5132266 nodes/s)
```

After this change
```
python benchmarks/dynamo/microbenchmarks/fx_microbenchmarks.py
iterating over 100000000 FX nodes took 3.4s (29114001 nodes/s)
```

5.7x improvement

Differential Revision: [D58343997](https://our.internmc.facebook.com/intern/diff/D58343997)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128288
Approved by: https://github.com/jansel, https://github.com/albanD
2024-06-11 05:48:31 +00:00
eqy
4c971932e8 [cuDNN][SDPA] Remove TORCH_CUDNN_SDPA_ENABLED=1, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343)
Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes.

What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here...

Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343
Approved by: https://github.com/Skylion007
2024-06-09 06:53:34 +00:00
Aaron Orenstein
dcfa7702c3 Flip default value for mypy disallow_untyped_defs [1/11] (#127838)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127838
Approved by: https://github.com/oulgen
2024-06-08 18:16:33 +00:00
Xu Han
ba81c3c290 [inductor] add cpp builder code. (take 2) (#125849)
Fully manual rebase the code of PR: https://github.com/pytorch/pytorch/pull/124045
The old PR seems crashed due to too many commits, and too many times rebase. Please reference: https://github.com/pytorch/pytorch/pull/124045#issuecomment-2103744588

-------
It is the first step of RFC https://github.com/pytorch/pytorch/issues/124245.
Changes:
1. Add cpp builder code, the new cpp_builder support Windows OS.
2. Add CPU ISA checker which is cross OS and exported from backend cpuinfo.
3. Switch compiler ISA checker to new cpp builder.
4. CppCodeCache use the new ISA checker.
5. Add temprary `test_new_cpp_build_logical` UT to help on transfer to new code.
<img width="1853" alt="Image" src="https://github.com/pytorch/pytorch/assets/8433590/ce6519ab-ba92-4204-b1d6-7d15d2ba2cbe">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125849
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-06-07 20:49:58 +00:00
Shuqiang Zhang
30788739f4 [c10d] add a simple test to demonstrate the user usage of collectives (#127665)
Summary:
Just play around the UT and think it would be good to give an simple
example of user function which can be used for different subclasses of
_ControlCollectives, and test the user function can be executed
correctly

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127665
Approved by: https://github.com/d4l3k
2024-06-05 04:32:11 +00:00
Mikayla Gawarecki
a135776307 Remove tensor subclass detection logic from weights_only unpickler (#127808)
Remove logic to auto-detect and allow subclasses that did not override certain methods from the weights_only unpickler from https://github.com/pytorch/pytorch/pull/124331 for 2.4 release

Subclasses should be loadable using `torch.serialization.add_safe_globals`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127808
Approved by: https://github.com/malfet
2024-06-05 02:14:30 +00:00
Tristan Rice
597922ba21 Reapply "distributed debug handlers (#126601)" (#127805)
This reverts commit 7646825c3e.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127805
Approved by: https://github.com/PaliC
2024-06-04 19:44:30 +00:00
Shan19900305
3bcc3cddb5 Using scalarType instead string in function _group_tensors_by_device_and_dtype. (#127869)
Now torch.dtype can pass through pybind11, so modify function _group_tensors_by_device_and_dtype to using scalar type. And without convert torch.dtype and string in python and c++ side.
@ezyang @bdhirsh
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127869
Approved by: https://github.com/ezyang
2024-06-04 18:19:33 +00:00
Jeff Daily
0e7bd7fedd [ROCm] TunableOp improvements (#124362)
- use less memory; smaller default hipblaslt workspace size
- options to avoid cache effects
  - icache flush option
  - rotating buffers during tuning
- python APIs
- unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124362
Approved by: https://github.com/xw285cornell
2024-06-03 22:30:11 +00:00
PyTorch MergeBot
7646825c3e Revert "distributed debug handlers (#126601)"
This reverts commit 3d541835d5.

Reverted https://github.com/pytorch/pytorch/pull/126601 on behalf of https://github.com/PaliC due to breaking internal typechecking tests ([comment](https://github.com/pytorch/pytorch/pull/126601#issuecomment-2141076987))
2024-05-31 01:21:24 +00:00
Tristan Rice
3d541835d5 distributed debug handlers (#126601)
This adds debug handlers as described in:
* https://gist.github.com/d4l3k/828b7be585c7615e85b2c448b308d925 (public copy)
* https://docs.google.com/document/d/1la68szcS6wUYElUUX-P6zXgkPA8lnfzpagMTPys3aQ8/edit (internal copy)

This is only adding the C++ pieces that will be used from the main process. The Python and torchrun pieces will be added in a follow up PR.

This adds 2 handlers out of the box:

* `/handler/ping` for testing purposes
* `/handler/dump_nccl_trace_pickle` as a POC integration with Flight Recorder

Test plan:

```
python test/distributed/elastic/test_control_plane.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126601
Approved by: https://github.com/kurman, https://github.com/c-p-i-o
2024-05-30 02:21:08 +00:00
Xuehai Pan
ba3b05fdf3 [1/N][Easy] fix typo for usort config in pyproject.toml (kown -> known): sort stdlib (#127122)
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127122
Approved by: https://github.com/kit1980
2024-05-25 08:25:50 +00:00
Yifu Wang
4a09117d16 Introduce ProcessGroupCudaP2P (#122163)
## Context
This stack prototypes automatic micro-pipelining of `all-gather -> matmul` and `matmul -> reduce-scatter` via Inductor. The idea originates from the paper [Overlap Communication with Dependent Computation via
Decomposition in Large Deep Learning Models](https://dl.acm.org/doi/pdf/10.1145/3567955.3567959). The implementation and some key optimizations are heavily influenced by @lw's implementation in xformers.

The stack contains several components:
- `ProcessGroupCudaP2P` - a thin wrapper around `ProcessGroupNCCL`. It in addition maintains a P2P workspace that enables SM-free, one-sided P2P communication which is needed for optimal micro-pipelining.
- `fused_all_gather_matmul` and `fused_matmul_reduce_scatter` dispatcher ops.
- Post-grad fx pass that detects `all-gather -> matmul` and `matmul -> reduce-scatter` and replaces them with the fused dispatcher ops.

To enable the prototype feature:
- Set the distributed backend to `cuda_p2p`.
- Set `torch._inductor.config._micro_pipeline_tp` to `True`.

*NOTE: the prototype sets nothing in stone w.r.t to each component's design. The purpose is to have a performant baseline with reasonable design on which each component can be further improved.*

## Benchmark
Setup:
- 8 x H100 (500W) + 3rd gen NVSwitch.
- Llama3 8B training w/ torchtitan.
- 8-way TP. Reduced the number of layers from 32 to 8 for benchmarking purpose.

Trace (baseline): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmpjaz8zgx0
<img width="832" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/4addba77-5abc-4d2e-93ea-f68078587fe1">

Trace (w/ micro pipelining): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmpn073b4wn
<img width="963" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/4f44e78d-8196-43ab-a1ea-27390f07e9d2">

## This PR
`ProcessGroupCudaP2P` is a thin wrapper around `ProcessGroupNCCL`. By default, it routes all collectives to the underlying `ProcessGroupNCCL`. In addition, `ProcessGroupCudaP2P` initializes a P2P workspace that allows direct GPU memory access among the members. The workspace can be used in Python to optimize intra-node communication patterns or to create custom intra-node collectives in CUDA.

`ProcessGroupCudaP2P` aims to bridge the gap where certain important patterns can be better optimized via fine-grained P2P memory access than with collectives in the latest version of NCCL. It is meant to complement NCCL rather than replacing it.
Usage:
```
    # Using ProcessGroupCudaP2P
    dist.init_process_group(backend="cuda_p2p", ...)

    # Using ProcessGroupCudaP2P while specifying ProcessGroupCudaP2P.Options
    pg_options = ProcessGroupCudaP2P.Options()
    dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...)

    # Using ProcessGroupCudaP2P while specifying ProcessGroupNCCL.Options
    pg_options = ProcessGroupNCCL.Options()
    dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...)

    # Using ProcessGroupCudaP2P while specifying both
    # ProcessGroupCudaP2P.Options and ProcessGroupNCCL.Options
    pg_options = ProcessGroupCudaP2P.Options()
    pg_options.nccl_options = ProcessGroupNCCL.Options()
    dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...)

    # Down-casting the backend to access p2p buffers for cuda_p2p specific
    # optimizations
    if is_cuda_p2p_group(group):
        backend = get_cuda_p2p_backend(group)
        if required_p2p_buffer_size > backend.get_buffer_size():
            # fallback
        p2p_buffer = backend.get_p2p_buffer(...)
    else:
        # fallback
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122163
Approved by: https://github.com/wanchaol
2024-05-24 18:33:18 +00:00
Aaron Orenstein
70dc59c55f Fix perf regression caused by #122074 (#126996)
The original change was about 9.5% slower than then before #122074 .
This improves it to be only about 1.4% slower.

Also touched up some unrelated nits that the linter complained about.

Fixes #126293

Ran torchbench 3 times on each change. Perf values before (stable), after (fix),
and with #122074 backed out (backout):
```
../inductor-tools/scripts/modelbench/inductor_single_run.sh single inference performance torchbench pyhpc_isoneutral_mixing amp first dynamic cpp
stable:
43.948x
45.754x
44.906x

fix:
47.505x
49.987x
47.493x

backout:
48.243x
48.199x
48.192x

../inductor-tools/scripts/modelbench/inductor_single_run.sh single inference performance torchbench pyhpc_equation_of_state amp first static default
stable:
15.224x
13.286x
15.354x

fix:
16.402x
16.370x
16.183x

backout:
16.554x
16.675x
16.787x

../inductor-tools/scripts/modelbench/inductor_single_run.sh single inference performance torchbench lennard_jones float32 first static default
stable:
1.712x
1.651x
1.640x

fix:
1.804x
1.798x
1.792x

backout:
1.864x
1.824x
1.836x
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126996
Approved by: https://github.com/jansel
2024-05-24 04:27:22 +00:00
PyTorch MergeBot
1b29c16e5e Revert "Introduce ProcessGroupCudaP2P (#122163)"
This reverts commit 2dd2699860.

Reverted https://github.com/pytorch/pytorch/pull/122163 on behalf of https://github.com/jithunnair-amd due to This is breaking ROCm distributed CI on trunk ([comment](https://github.com/pytorch/pytorch/pull/122163#issuecomment-2127518473))
2024-05-23 16:06:14 +00:00
Matthew Hoffman
86ad101370 Enable pickling torch._C.Generator (#126271)
Fixes #71398

Add `__reduce__` and `__setstate__` methods for `torch._C.Generator`.

`__reduce__` returns a tuple of 3 values:

1. `torch.Generator` itself.
2. A one-element tuple containing the `torch.device` to create the `Generator` with, since this cannot be changed after the object is created.
3. The state, a three-element tuple: the initial seed, the offset (or `None` if a CPU `Generator`), and the RNG state tensor.

`__setstate__` calls `manual_seed`, `set_offset` (if not `None`), and `set_state` on each respective element of the state.

Added test demonstrating successful reserialization with cpu and cuda `Generator`s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126271
Approved by: https://github.com/ezyang
2024-05-22 14:38:47 +00:00
Yifu Wang
2dd2699860 Introduce ProcessGroupCudaP2P (#122163)
## Context
This stack prototypes automatic micro-pipelining of `all-gather -> matmul` and `matmul -> reduce-scatter` via Inductor. The idea originates from the paper [Overlap Communication with Dependent Computation via
Decomposition in Large Deep Learning Models](https://dl.acm.org/doi/pdf/10.1145/3567955.3567959). The implementation and some key optimizations are heavily influenced by @lw's implementation in xformers.

The stack contains several components:
- `ProcessGroupCudaP2P` - a thin wrapper around `ProcessGroupNCCL`. It in addition maintains a P2P workspace that enables SM-free, one-sided P2P communication which is needed for optimal micro-pipelining.
- `fused_all_gather_matmul` and `fused_matmul_reduce_scatter` dispatcher ops.
- Post-grad fx pass that detects `all-gather -> matmul` and `matmul -> reduce-scatter` and replaces them with the fused dispatcher ops.

To enable the prototype feature:
- Set the distributed backend to `cuda_p2p`.
- Set `torch._inductor.config._micro_pipeline_tp` to `True`.

*NOTE: the prototype sets nothing in stone w.r.t to each component's design. The purpose is to have a performant baseline with reasonable design on which each component can be further improved.*

## Benchmark
Setup:
- 8 x H100 (500W) + 3rd gen NVSwitch.
- Llama3 8B training w/ torchtitan.
- 8-way TP. Reduced the number of layers from 32 to 8 for benchmarking purpose.

Trace (baseline): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmpjaz8zgx0
<img width="832" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/4addba77-5abc-4d2e-93ea-f68078587fe1">

Trace (w/ micro pipelining): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmpn073b4wn
<img width="963" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/4f44e78d-8196-43ab-a1ea-27390f07e9d2">

## This PR
`ProcessGroupCudaP2P` is a thin wrapper around `ProcessGroupNCCL`. By default, it routes all collectives to the underlying `ProcessGroupNCCL`. In addition, `ProcessGroupCudaP2P` initializes a P2P workspace that allows direct GPU memory access among the members. The workspace can be used in Python to optimize intra-node communication patterns or to create custom intra-node collectives in CUDA.

`ProcessGroupCudaP2P` aims to bridge the gap where certain important patterns can be better optimized via fine-grained P2P memory access than with collectives in the latest version of NCCL. It is meant to complement NCCL rather than replacing it.
Usage:
```
    # Using ProcessGroupCudaP2P
    dist.init_process_group(backend="cuda_p2p", ...)

    # Using ProcessGroupCudaP2P while specifying ProcessGroupCudaP2P.Options
    pg_options = ProcessGroupCudaP2P.Options()
    dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...)

    # Using ProcessGroupCudaP2P while specifying ProcessGroupNCCL.Options
    pg_options = ProcessGroupNCCL.Options()
    dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...)

    # Using ProcessGroupCudaP2P while specifying both
    # ProcessGroupCudaP2P.Options and ProcessGroupNCCL.Options
    pg_options = ProcessGroupCudaP2P.Options()
    pg_options.nccl_options = ProcessGroupNCCL.Options()
    dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...)

    # Down-casting the backend to access p2p buffers for cuda_p2p specific
    # optimizations
    if is_cuda_p2p_group(group):
        backend = get_cuda_p2p_backend(group)
        if required_p2p_buffer_size > backend.get_buffer_size():
            # fallback
        p2p_buffer = backend.get_p2p_buffer(...)
    else:
        # fallback
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122163
Approved by: https://github.com/wanchaol
2024-05-22 09:33:05 +00:00
Tristan Rice
ac51920656 Reapply "c10d: add Collectives abstraction (#125978)" (#126695)
This reverts commit d9c3485146.

Reapplies #125978.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126695
Approved by: https://github.com/c-p-i-o
2024-05-21 18:00:09 +00:00
PyTorch MergeBot
d9c3485146 Revert "c10d: add Collectives abstraction (#125978)"
This reverts commit 4b2ae2ac33.

Reverted https://github.com/pytorch/pytorch/pull/125978 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/125978#issuecomment-2119858015))
2024-05-20 07:40:41 +00:00
Simon Fan
be67985bd7 [compiled autograd] log in cpp using python logger (#126483)
Internal infra may not preserve python and c++ log ordering e.g. MAST logs: https://fburl.com/mlhub/38576cxn, all the `[python_compiled_autograd.cpp] Creating cache entry [...]` logs of the entire run are at the beginning of the file

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126483
Approved by: https://github.com/jansel
ghstack dependencies: #126144, #126146, #126148
2024-05-19 23:49:52 +00:00
Mikayla Gawarecki
66dc8fb7ff Allow tensor subclasses and add torch.serialization.add_safe_globals that allows users to allowlist classes for weights_only load (#124331)
#### Conditions for allowlisting tensor subclasses
We allow tensor subclasses types that
(1) Do not override `__setstate__`, `__getattr__`, `__setattr__`, `__get__`, `__set__` or `__getattribute__` of `torch.Tensor` (`torch.Tensor` does not have a definition of `__getattr__`, `__get__` or `__set__` so we check that these are `None`)
(2) Use the generic `tp_alloc`
(3) Are in a module that *has been imported by the user*
to be pushed onto the stack as strings by `GLOBAL` instructions, while storing the type in a dict

The strings will be converted to the classes as appropriate when executing `REBUILD` with `_rebuild_from_type_v2`

*Note that we use `inspect.getattr_static(sys.modules[module], name)` to get the class/function as this method claims to have no code execution.

The rationale for the 3 conditions above is as follows:

The rebuild func provided by `Tensor.__reduce_ex__` is `torch._tensor._rebuild_from_type_v2`, which is defined as such (note the call to `getattr`, `Tensor.__setstate__` and the call to `as_subclass` as well as the call to `_set_obj_state` which calls `setattr`)

4e66aaa010/torch/_tensor.py (L57-L71)

`as_subclass` is implemented with a call to `THPVariable_NewWithVar`

that will eventually call `tp_alloc` here
4e66aaa010/torch/csrc/autograd/python_variable.cpp (L2053)

The `func` arg to `_rebuild_from_type_v2` for wrapper subclasses is `Tensor.rebuild_wrapper_subclass`, which will similarly call into `THPVariable_NewWithVar` and hit the above `tp_alloc`

**Note that we do not call `tp_init` or `tp_new` (i.e. `cls.__init__` or `cls.__new__`) when unpickling**

### How do we check something is a tensor subclass/constraints around imports

In order to check whether `bla` is a tensor subclass in the bytecode `GLOBAL module.name`, we need to do an `issubclass` check, which entails converting the global string to the appropriate type. We *do not* arbitrarily import modules but will perform this check as long as the given subclass (given by `module.name`) has already been imported by the user (i.e. `module in sys.modules` and `issubclass(getattr(sys[modules], name), torch.Tensor)`

This PR also allowlisted  `torch._utils._rebuild_wrapper_subclass` and `torch.device` (used by `_rebuild_wrapper_subclass`)

### API for allow listing
This PR also added `torch.serialization.{add/get/clear}_safe_globals` that enables user to allowlist globals they have deemed safe and manipulate this list (for example they could allowlist a tensor subclass with a custom `__setstate__` if they have checked that this is safe).

Next steps:
- Add testing and allowlist required classes for all in-core tensor subclasses (e.g. `DTensor`, `FakeTensor` etc.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124331
Approved by: https://github.com/albanD
2024-05-17 17:56:57 +00:00
Tristan Rice
4b2ae2ac33 c10d: add Collectives abstraction (#125978)
This adds a new `Collectives` API for doing distributed collectives operations. This is intended to replace the [current Elastic store abstraction](https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/utils/store.py) with more performant and debugable primitives.

Design doc: https://docs.google.com/document/d/147KcKJXEHvk1Q6tISLbJVvLejHg_1kIhBQeu-8RQxhY/edit

The standard implementation is using `StoreCollectives` but other more performant backends will be added in a follow up PR.

Test plan:

```
python test/distributed/test_collectives.py -v
```

This tests both functionality using multiple threads as well as timeout behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125978
Approved by: https://github.com/shuqiangzhang
2024-05-17 05:09:11 +00:00
Gustav Larsson
52fad83335 [onnx.export] Avoid linear look up in env for exist_in_env (#124909)
This PR is part of a series of PRs to significantly speed up torch.onnx.export for models with many nodes (e.g. LLM). See #121422 for more analysis.

- As part of torch.onnx.export, a reverse look-up is made in env. This is done for each node, and this look-up costs in proportional to the graph size, which incurs and overall O(N^2) time complexity.
- A pragmatic solution is simply to keep a separate data structure to make this de facto constant time. So, this introduces a set containing all the values of env. Open to other ideas. Ideally `exist_in_env` wouldn't be needed at all, but to preserve current behavior exactly I'm not sure how that can be done.
- Resolves (4) in #121422.
- This code change and the choice of py::set looks a bit more natural on top of #123063, where the env is changed from a std::unordered_map to a py::dict.

Partially fixes #121422
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124909
Approved by: https://github.com/srikris-sridhar, https://github.com/justinchuby
2024-05-09 22:38:00 +00:00
PyTorch MergeBot
6fd745255e Revert "add uuid in cudaDeviceProperties (#125083)"
This reverts commit 3f36145db2.

Reverted https://github.com/pytorch/pytorch/pull/125083 on behalf of https://github.com/izaitsevfb due to Fails internal builds with: no member named 'uuid' in 'hipDeviceProp_t' ([comment](https://github.com/pytorch/pytorch/pull/125083#issuecomment-2103315320))
2024-05-09 19:52:45 +00:00
Jeff Daily
3f36145db2 add uuid in cudaDeviceProperties (#125083)
Replaces #99967.

Fixes #99903.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125083
Approved by: https://github.com/pruthvistony, https://github.com/albanD, https://github.com/eqy
2024-05-08 19:15:55 +00:00
PyTorch MergeBot
2e237fcd70 Revert "[inductor] add cpp builder code. (#124045)"
This reverts commit 469383755f.

Reverted https://github.com/pytorch/pytorch/pull/124045 on behalf of https://github.com/clee2000 due to broke inductor/test_codecache and inductor/test_max_autotune 469383755f https://github.com/pytorch/pytorch/actions/runs/8996772350/job/24724775182 ([comment](https://github.com/pytorch/pytorch/pull/124045#issuecomment-2100851419))
2024-05-08 15:33:20 +00:00
Xu Han
469383755f [inductor] add cpp builder code. (#124045)
Previous full PR https://github.com/pytorch/pytorch/pull/115248 is failed to merge due to fb_code is hard to debug.
I also tried to submit them as two pieces, https://github.com/pytorch/pytorch/pull/118514 https://github.com/pytorch/pytorch/pull/118515. And they have passed PreCI at that time.

Now I tried to split https://github.com/pytorch/pytorch/pull/115248 into smaller piece, and it is the first step of RFC https://github.com/pytorch/pytorch/issues/124245.
Changes:
1. Add cpp builder code, the new cpp_builder support Windows OS.
2. Add CPU ISA checker which is cross OS and exported from backend cpuinfo.
3. Switch compiler ISA checker to new cpp builder.
4. CppCodeCache use the new ISA checker.
5. Add temprary `test_new_cpp_build_logical` UT to help on transfer to new code.
<img width="1853" alt="Image" src="https://github.com/pytorch/pytorch/assets/8433590/ce6519ab-ba92-4204-b1d6-7d15d2ba2cbe">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124045
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-05-08 05:27:15 +00:00
PyTorch MergeBot
2f79a18324 Revert "[inductor] add cpp builder code. (#124045)"
This reverts commit 7864d287a1.

Reverted https://github.com/pytorch/pytorch/pull/124045 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing trunk jobs 7864d287a1 including lint ([comment](https://github.com/pytorch/pytorch/pull/124045#issuecomment-2099306071))
2024-05-07 21:04:49 +00:00
Xu Han
7864d287a1 [inductor] add cpp builder code. (#124045)
Previous full PR https://github.com/pytorch/pytorch/pull/115248 is failed to merge due to fb_code is hard to debug.
I also tried to submit them as two pieces, https://github.com/pytorch/pytorch/pull/118514 https://github.com/pytorch/pytorch/pull/118515. And they have passed PreCI at that time.

Now I tried to split https://github.com/pytorch/pytorch/pull/115248 into smaller piece, and it is the first step of RFC https://github.com/pytorch/pytorch/issues/124245.
Changes:
1. Add cpp builder code, the new cpp_builder support Windows OS.
2. Add CPU ISA checker which is cross OS and exported from backend cpuinfo.
3. Switch compiler ISA checker to new cpp builder.
4. CppCodeCache use the new ISA checker.
5. Add temprary `test_new_cpp_build_logical` UT to help on transfer to new code.
<img width="1853" alt="Image" src="https://github.com/pytorch/pytorch/assets/8433590/ce6519ab-ba92-4204-b1d6-7d15d2ba2cbe">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124045
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-05-07 20:07:41 +00:00
Aaron Orenstein
b23b6e7108 Ensure that vmap is restored properly if an exception is thrown during frame eval (#122074)
We save and restore the DynamicLayerStack during frame eval but since fx graph has no way to express a try/finally we just assume it will happen. If we throw an exception between the push and pop to the stack then we're left in a state that affects following operations poorly.  Make sure that if it's in a bad state we restore it after frame eval.

Repro:
before:
```
$ rm test/dynamo_skips/TestSparseCPU.test_log1p_cpu_uint8
$ rm test/dynamo_expected_failures/FuncTorchHigherOrderOpTests.test_vmap_free_tensor
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/jit/test_sparse.py test/dynamo/test_dynamic_shapes.py test/inductor/test_torchinductor_dynamic_shapes.py test/test_sparse.py -k 'test_log1p_cpu_uint8'
============= 1 passed, 8588 deselected in 9.75s =============
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/jit/test_sparse.py test/dynamo/test_dynamic_shapes.py test/inductor/test_torchinductor_dynamic_shapes.py test/test_sparse.py -k
'test_vmap_free_tensor_dynamic_shapes or test_log1p_cpu_uint8'
================== short test summary info ===================
FAILED [0.0632s] test/test_sparse.py::TestSparseCPU::test_log1p_cpu_uint8 - AssertionError: "only Tensors of floating point dtype can require gradients"
does not match "You are attempting to call Tensor.requires_grad_() (or perhaps using torch.autograd.functional.* APIs) inside of a function ...
======= 1 failed, 1 skipped, 8587 deselected in 10.99s =======
```
(Note that adding test_vmap_free_tensor_dynamic_shapes causes test_vmap_free_tensor_dynamic_shapes to fail)
after:
```
$ rm test/dynamo_skips/TestSparseCPU.test_log1p_cpu_uint8
$ rm test/dynamo_expected_failures/FuncTorchHigherOrderOpTests.test_vmap_free_tensor
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/jit/test_sparse.py test/dynamo/test_dynamic_shapes.py test/inductor/test_torchinductor_dynamic_shapes.py test/test_sparse.py -k 'test_log1p_cpu_uint8'
============= 1 passed, 8588 deselected in 9.89s =============
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/jit/test_sparse.py test/dynamo/test_dynamic_shapes.py test/inductor/test_torchinductor_dynamic_shapes.py test/test_sparse.py -k
'test_vmap_free_tensor_dynamic_shapes or test_log1p_cpu_uint8'
======= 1 passed, 1 skipped, 8587 deselected in 11.34s =======
```
(test_vmap_free_tensor_dynamic_shapes passes either way)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122074
Approved by: https://github.com/oulgen
2024-05-07 19:36:52 +00:00
PyTorch MergeBot
5fd0b6e5f7 Revert "add uuid in cudaDeviceProperties (#125083)"
This reverts commit f35fe4eaf1.

Reverted https://github.com/pytorch/pytorch/pull/125083 on behalf of https://github.com/clee2000 due to test_uuid is flaky.  ex https://github.com/pytorch/pytorch/actions/runs/8988855916/job/24692369523 https://hud.pytorch.org/flakytest?name=test_uuid&suite=TestCuda&file=%25&limit=300 ([comment](https://github.com/pytorch/pytorch/pull/125083#issuecomment-2099029993))
2024-05-07 18:16:27 +00:00
Jeff Daily
f35fe4eaf1 add uuid in cudaDeviceProperties (#125083)
Replaces #99967.

Fixes #99903.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125083
Approved by: https://github.com/pruthvistony, https://github.com/albanD, https://github.com/eqy
2024-05-07 01:26:01 +00:00
Randolf Scholz
ccaf03fd89 Fix: nn.Parameter return type identified as Tensor instead of nn.Parameter (#125106)
Fixes #125105

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125106
Approved by: https://github.com/ezyang, https://github.com/albanD
2024-04-29 23:25:23 +00:00
Simon Fan
43a7ab2a21 [compiled autograd] introduce verbose logs, add autograd node info to graph (#124954)
- sets it as a fake stack trace as we don't have a generic comment feature
- when verbose is disabled, still adds a contextmanager and flag checks. the alternative is to use MACROS, but that wouldn't be usable with TORCH_LOGS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124954
Approved by: https://github.com/jansel
2024-04-27 01:10:37 +00:00
egienvalue
73744a2c00 torch.mtia module for MTIA device backend (#123612)
MTIA device has its own Module in PyTorch now.
torch.mtia has following APIs similar to other backends. The lazy_init is also supported.
```
__all__ = [
    "init",
    "is_available",
    "synchronize",
    "device_count",
    "current_device",
    "current_stream",
    "default_stream",
    "set_stream",
    "stream",
    "device",
]

```
------------
For device management. We expand AccleratorHooksInterface to support generic device management and it can be used in both C++ and PyThon.
```
def _accelerator_hooks_device_count() -> _int: ...
def _accelerator_hooks_set_current_device(device_index: _int) -> None: ...
def _accelerator_hooks_get_current_device() -> _int : ...
def _accelerator_hooks_exchange_device(device_index: _int) -> _int : ...
def _accelerator_hooks_maybe_exchange_device(device_index: _int) -> _int : ...
```

---------
Adding get_device_module API to retrieve device modules for different device types.
```
def get_device_module(device: Optional[Union[torch.device, str]] = None)
```
---------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123612
Approved by: https://github.com/albanD
ghstack dependencies: #123611
2024-04-26 16:17:54 +00:00
Aaron Orenstein
609c958281 Fix mypy issues in fake_tensor.py (#124428)
fake_tensor.py had mypy error ignored. That seems less than desirable.

Also added SafePyObjectT<T> which is a tagged wrapper around a SafePyObject but provides static type checking (with no other guarantees).

Used `SafePyObjectT<TorchDispatchModeKey>` on some of the TorchDispatchModeTLS API to ensure that we don't accidentally inject a different type than expected into the stack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124428
Approved by: https://github.com/malfet
2024-04-26 15:35:53 +00:00
PyTorch MergeBot
f131c2c199 Revert "Fix mypy issues in fake_tensor.py (#124428)"
This reverts commit 25c0d3f3f0.

Reverted https://github.com/pytorch/pytorch/pull/124428 on behalf of https://github.com/jeanschmidt due to Unfortunately, I needed to revert #123735 and this one depends on it. So please check if there are no merge conflicts or breakages and feel free to merge this PR again ([comment](https://github.com/pytorch/pytorch/pull/124428#issuecomment-2078699836))
2024-04-26 06:15:17 +00:00
PyTorch MergeBot
e04c7b19f4 Revert "torch.mtia module for MTIA device backend (#123612)"
This reverts commit 381653de63.

Reverted https://github.com/pytorch/pytorch/pull/123612 on behalf of https://github.com/jeffdaily due to this PR broke ROCm with message RuntimeError: Cannot have MTIA with other devices ([comment](https://github.com/pytorch/pytorch/pull/123612#issuecomment-2077649762))
2024-04-25 16:06:46 +00:00
Yu, Guangye
cdc66e9dc3 refactor autocast python APIs (#124479)
# Motivation
Refactor autocast usage scenario in `torch/amp/autocast_mode.py` and `torch/utils/checkpoint.py` to fix the bug - convention conflict between `torch.xxx.get_autocast_xxx_dtype` defined in `autocast_mode.py` and `torch.xxx.get_autocast_dtype` defined in `checkpoint.py`.

# Solution
Use device-agnostic APIs like `torch.get_autocast_dtype`, ..., instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124479
Approved by: https://github.com/jgong5, https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/albanD
ghstack dependencies: #124359
2024-04-25 14:33:33 +00:00
Aaron Orenstein
25c0d3f3f0 Fix mypy issues in fake_tensor.py (#124428)
fake_tensor.py had mypy error ignored. That seems less than desirable.

Also added SafePyObjectT<T> which is a tagged wrapper around a SafePyObject but provides static type checking (with no other guarantees).

Used `SafePyObjectT<TorchDispatchModeKey>` on some of the TorchDispatchModeTLS API to ensure that we don't accidentally inject a different type than expected into the stack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124428
Approved by: https://github.com/malfet
2024-04-25 14:07:53 +00:00
Animesh Jain
e68d65dae2 [dynamo][cpp-guards] Differentiate dict guards wrt to guarding on key order (#124779)
We guard on key order
1) When a key is a non-constant object
2) When we actually need key order - like .values, .items etc

For dicts/OrderedDicts that do not require key order guarding, we just rely on usual `GuardManger + DictGetItemGuardAccessor`. This is faster than going through the `list(d.keys())` based design for OrderedDicts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124779
Approved by: https://github.com/jansel
2024-04-25 08:20:35 +00:00
egienvalue
381653de63 torch.mtia module for MTIA device backend (#123612)
MTIA device has its own Module in PyTorch now.
torch.mtia has following APIs similar to other backends. The lazy_init is also supported.
```
__all__ = [
    "init",
    "is_available",
    "synchronize",
    "device_count",
    "current_device",
    "current_stream",
    "default_stream",
    "set_stream",
    "stream",
    "device",
]

```
------------
For device management. We expand AccleratorHooksInterface to support generic device management and it can be used in both C++ and PyThon.
```
def _accelerator_hooks_device_count() -> _int: ...
def _accelerator_hooks_set_current_device(device_index: _int) -> None: ...
def _accelerator_hooks_get_current_device() -> _int : ...
def _accelerator_hooks_exchange_device(device_index: _int) -> _int : ...
def _accelerator_hooks_maybe_exchange_device(device_index: _int) -> _int : ...
```

---------
Adding get_device_module API to retrieve device modules for different device types.
```
def get_device_module(device: Optional[Union[torch.device, str]] = None)
```
---------

Differential Revision: [D56443356](https://our.internmc.facebook.com/intern/diff/D56443356)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123612
Approved by: https://github.com/albanD
ghstack dependencies: #123611
2024-04-24 20:51:20 +00:00
egienvalue
408aa0182c Build device generic torch.Stream and torch.Event based on c10::Stream/Event (#123611)
This diff intends to build device generic torch.Stream and torch.Event for newly added accelerators in PyTorch.
------------
**torch.Stream APIs**
```
# Defined in torch/csrc/Stream.cpp
class Stream(_StreamBase):
    stream_id: _int  # Stream id
    device_index: _int
    device_type: _int

    device: _device  # The device of the stream

    @overload
    def __new__(self, device: Optional[DeviceLikeType] = None, priority: _int = 0) -> Stream: ...
    @overload
    def __new__(self, stream_id: _int, device_index: _int, device_type: _int, priority: _int = 0) -> Stream: ...
    def wait_event(self, event: Event) -> None: ...
    def wait_stream(self, other: Stream) -> None: ...
    def record_event(self, event: Optional[Event] = None) -> Event: ...
    def query(self) -> None: ...
    def synchronize(self) -> None: ...
    def __hash__(self) -> _int: ...
    def __repr__(self) -> str: ...
    def __eq__(self, other: object) -> _bool: ...
```
------------------
**torch.Event APIs**:
- IPC related APIs are not implemented, since many device backends don't support it, but we leave interfaces there for future adaption of torch.cuda.Stream.
- currently only the enable_timing is supported, since it is the most common one used in other device backends. We have to refactor the event flag system in PyTorch to support more fancy flag.
- elapsedTime API is added to c10::Event

```
# Defined in torch/csrc/Event.cpp
class Event(_EventBase):

    device: _device  # The device of the Event
    event_id: _int # The raw event created by device backend

    def __new__(self,
        device: Optional[DeviceLikeType] = None,
        enable_timing: _bool = False,
        blocking: _bool = False,
        interprocess: _bool = False) -> Event: ...
    @classmethod
    def from_ipc_handle(self, device: DeviceLikeType, ipc_handle: bytes) -> Event: ...
    def record(self, stream: Optional[Stream] = None) -> None: ...
    def wait(self, stream: Optional[Stream] = None) -> None: ...
    def query(self) -> _bool: ...
    def elapsed_time(self, other: Event) -> _float: ...
    def synchronize(self) -> None: ...
    def ipc_handle(self) -> bytes: ...
    def __repr__(self) -> str: ...
```

-----------

c10::Event provides new APIs
- calculate **elapsedTime**.
- Get raw event id
- Synchronize event.

```
  double elapsedTime(const Event& event) const {
    return impl_.elapsedTime(event.impl_);
  }

  void* eventId() const {
    return impl_.eventId();
  }

  void synchronize() const {
    return impl_.synchronize();
  }
```
----------
TODO: need to find a good way to test them in PyTorch with API mocks.

Differential Revision: [D56443357](https://our.internmc.facebook.com/intern/diff/D56443357)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123611
Approved by: https://github.com/albanD, https://github.com/jeffdaily
2024-04-24 20:51:17 +00:00
Yu, Guangye
25f321b84f Refactor autocast C++ APIs to be device-agnostic (#124359)
# Motivation
This PR aims to refactor autocast **C++** APIs to be device-agnostic and deprecate the device-specific autocast  **C++** APIs.
In C++ side,
- `is_enabled()` -> `is_enabled(device_type)`.
- `set_enabled(new_enabled)` -> `set_enabled(device_type, new_enabled)`.
- `get_autocast_dtype()` -> `get_autocast_dtype(device_type)`
- `set_autocast_dtype(dtype)` -> `set_autocast_dtype(device_type, dtype)`

These following C++ APIs are deprecated and should be removed in PyTorch 2.5
- `is_cpu_enabled`
- `set_cpu_enabled`
- `get_autocast_cpu_dtype`
- `set_autocast_cpu_dtype`
- `is_xpu_enabled`
- `set_xpu_enabled`
- `get_autocast_xpu_dtype`
- `set_autocast_xpu_dtype`
- `is_ipu_enabled`
- `set_ipu_enabled`
- `get_autocast_ipu_dtype`
- `set_autocast_ipu_dtype`
- `is_hpu_enabled`
- `set_hpu_enabled`
- `get_autocast_hpu_dtype`
- `set_autocast_hpu_dtype`
- `is_xla_enabled`
- `set_xla_enabled`
- `get_autocast_xla_dtype`
- `set_autocast_xla_dtype`
- `is_privateuseone_enabled`
- `set_privateuseone_enabled`
- `get_autocast_privateuseone_dtype`
- `set_autocast_privateuseone_dtype`

In Python side,
provide 4 generic autocast APIs:
- `torch.is_autocast_enabled(device_type)`
- `torch.set_autocast_enabled(device_type, new_enabled)`
- `torch.get_autocast_dtype(device_type)`
- `torch.set_autocast_dtype(device_type, dtype)`

# Additional Context
We will submit another PR to refactor autocast **Python** APIs based on this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124359
Approved by: https://github.com/jgong5, https://github.com/albanD
2024-04-23 10:38:50 +00:00
Ashwin Hari
5f5778476a rename ort to maia (#123265)
Fixes #123264

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123265
Approved by: https://github.com/albanD
2024-04-23 00:33:25 +00:00
Jeff Daily
6ede882c0b preferred blas library; cublaslt gemm implementation (#122106)
Following the example of PyTorch supporting a preferred Linalg library (cusolver or magma), this PR introduces a preferred blas library selector of either cublas or cublaslt for CUDA and hipblas or hipblaslt for ROCm via normal hipification of sources.

The default blas implementation remains cublas or hipblas.  cublaslt or hipblaslt can be enabled using environment variable TORCH_BLAS_PREFER_CUBLASLT=1 (or TORCH_BLAS_PREFER_HIPBLASLT=1 as an alias) or by calling `torch.backends.cuda.preferred_blas_library(backend="cublaslt")` or as an alias `backend="hipblaslt"`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122106
Approved by: https://github.com/lezcano
2024-04-22 15:38:22 +00:00
Chen, Zejun
b1984237a0 [Profiler] Unify the device(CUDA, XPU, PrivateUse1) in torch profiler post processing (#123247)
This PR unifies the CUDA, XPU and PrivateUse1 in the torch profiler. Now CUDA, XPU and PrivateUse1 can together use string object `use_device` to distinguish each other and share one device path for calculating kineto time durations and memory statistics for post processing.

#suppress-api-compatibility-check

Co-authored-by: Aaron Enye Shi <enye.shi@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123247
Approved by: https://github.com/aaronenyeshi
2024-04-22 01:26:55 +00:00
PyTorch MergeBot
0feab7d6c3 Revert "Build device generic torch.Stream and torch.Event based on c10::Stream/Event (#123611)"
This reverts commit cb17721899.

Reverted https://github.com/pytorch/pytorch/pull/123611 on behalf of https://github.com/jeffdaily due to This broke ROCm. see test_overrides.py ([comment](https://github.com/pytorch/pytorch/pull/123611#issuecomment-2067363780))
2024-04-19 22:44:26 +00:00
PyTorch MergeBot
929242a15c Revert "torch.mtia module for MTIA device backend (#123612)"
This reverts commit d7e1bf9ff9.

Reverted https://github.com/pytorch/pytorch/pull/123612 on behalf of https://github.com/jeffdaily due to This broke ROCm. see test_overrides.py ([comment](https://github.com/pytorch/pytorch/pull/123611#issuecomment-2067363780))
2024-04-19 22:44:26 +00:00
ydwu4
e62169a8fa Support torchbind op dispatch in python (#123367)
We override the `__call__` method and register fake, functional, proxy default dispatch mode implementation in its python_key_mode_table.

The idea is:
1. when inputs contains FakeScriptObject,  we dispatch it through _get_dispatch mechanism. We implement dispatch mode keys automatically in the operator's constructor.
2. when inputs are not fakified, we dispatch through the original c++ dispatcher.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123367
Approved by: https://github.com/zou3519
2024-04-19 17:17:27 +00:00
PyTorch MergeBot
520bc1080e Revert "[Profiler] Unify the device(CUDA, XPU, PrivateUse1) in torch profiler post processing (#123247)"
This reverts commit 768ce2cdda.

Reverted https://github.com/pytorch/pytorch/pull/123247 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/123247#issuecomment-2066152611))
2024-04-19 09:09:03 +00:00
Chen, Zejun
768ce2cdda [Profiler] Unify the device(CUDA, XPU, PrivateUse1) in torch profiler post processing (#123247)
This PR unifies the CUDA, XPU and PrivateUse1 in the torch profiler. Now CUDA, XPU and PrivateUse1 can together use string object `use_device` to distinguish each other and share one device path for calculating kineto time durations and memory statistics for post processing.

#suppress-api-compatibility-check

Co-authored-by: Aaron Enye Shi <enye.shi@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123247
Approved by: https://github.com/aaronenyeshi, https://github.com/gujinghui
2024-04-19 03:31:13 +00:00
egienvalue
d7e1bf9ff9 torch.mtia module for MTIA device backend (#123612)
MTIA device has its own Module in PyTorch now.
torch.mtia has following APIs similar to other backends. The lazy_init is also supported.
```
__all__ = [
    "init",
    "is_available",
    "synchronize",
    "device_count",
    "current_device",
    "current_stream",
    "default_stream",
    "set_stream",
    "stream",
    "device",
]

```
------------
For device management. We expand AccleratorHooksInterface to support generic device management and it can be used in both C++ and PyThon.
```
def _accelerator_hooks_device_count() -> _int: ...
def _accelerator_hooks_set_current_device(device_index: _int) -> None: ...
def _accelerator_hooks_get_current_device() -> _int : ...
def _accelerator_hooks_exchange_device(device_index: _int) -> _int : ...
def _accelerator_hooks_maybe_exchange_device(device_index: _int) -> _int : ...
```

---------
Adding get_device_module API to retrieve device modules for different device types.
```
def get_device_module(device: Optional[Union[torch.device, str]] = None)
```
---------
@exported-using-ghexport

Differential Revision: [D52923602](https://our.internmc.facebook.com/intern/diff/D52923602/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123612
Approved by: https://github.com/albanD
ghstack dependencies: #123611
2024-04-18 17:38:06 +00:00
egienvalue
cb17721899 Build device generic torch.Stream and torch.Event based on c10::Stream/Event (#123611)
This diff intends to build device generic torch.Stream and torch.Event for newly added accelerators in PyTorch.
------------
**torch.Stream APIs**
```
# Defined in torch/csrc/Stream.cpp
class Stream(_StreamBase):
    stream_id: _int  # Stream id
    device_index: _int
    device_type: _int

    device: _device  # The device of the stream

    @overload
    def __new__(self, device: Optional[DeviceLikeType] = None, priority: _int = 0) -> Stream: ...
    @overload
    def __new__(self, stream_id: _int, device_index: _int, device_type: _int, priority: _int = 0) -> Stream: ...
    def query(self) -> _bool: ...
    def synchronize(self) -> None: ...
    def wait_event(self, event: Event) -> None: ...
    def wait_stream(self, other: Stream) -> None: ...
    def record_event(self, event: Optional[Event] = None) -> Event: ...
    def query(self) -> None: ...
    def synchronize(self) -> None: ...
    def __hash__(self) -> _int: ...
    def __repr__(self) -> str: ...
    def __eq__(self, other: object) -> _bool: ...
```
------------------
**torch.Event APIs**:
- IPC related APIs are not implemented, since many device backends don't support it, but we leave interfaces there for future adaption of torch.cuda.Stream.
- currently only the enable_timing is supported, since it is the most common one used in other device backends. We have to refactor the event flag system in PyTorch to support more fancy flag.
- elapsedTime API is added to c10::Event

```
# Defined in torch/csrc/Event.cpp
class Event(_EventBase):

    device: _device  # The device of the Event
    event_id: _int # The raw event created by device backend

    def __new__(self,
        device: Optional[DeviceLikeType] = None,
        enable_timing: _bool = False,
        blocking: _bool = False,
        interprocess: _bool = False) -> Event: ...
    @classmethod
    def from_ipc_handle(self, device: DeviceLikeType, ipc_handle: bytes) -> Event: ...
    def record(self, stream: Optional[Stream] = None) -> None: ...
    def wait(self, stream: Optional[Stream] = None) -> None: ...
    def query(self) -> _bool: ...
    def elapsed_time(self, other: Event) -> _float: ...
    def synchronize(self) -> None: ...
    def ipc_handle(self) -> bytes: ...
    def __repr__(self) -> str: ...
```

-----------

c10::Event provides new APIs
- calculate **elapsedTime**.
- Get raw event id
- Synchronize event.

```
  double elapsedTime(const Event& event) const {
    return impl_.elapsedTime(event.impl_);
  }

  void* eventId() const {
    return impl_.eventId();
  }

  void synchronize() const {
    return impl_.synchronize();
  }
```
----------
TODO: need to find a good way to test them in PyTorch with API mocks.

Differential Revision: [D55351839](https://our.internmc.facebook.com/intern/diff/D55351839/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123611
Approved by: https://github.com/albanD
2024-04-18 17:35:09 +00:00
rzou
648c39c47d Add OpOverload.redispatch; use it in new custom ops API (#124089)
A kernel has "dispatcher convention" if there is an additional keyset
arg at the beginning of the argument list. This PR:
- adds a way to register kernels with dispatcher_convention using
  Library.impl (pass dispatcher_convention = True)
- adds OpOverload.redispatch

We use both of the above in the new custom ops API: we register the
autograd kernel in dispatcher convention so that we can actually call
redispatch like how pytorch built-in ops do it.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124089
Approved by: https://github.com/albanD
ghstack dependencies: #123937, #124064, #124065, #124066, #124071
2024-04-18 12:48:04 +00:00
PHLens
9aba918bd8 Support Accelerator OOM Error (#121200) (#121702)
Fixes #121200
This PR introduces AcceleratorOutOfMemoryError for all privateuse1 backend. For python, there is a PyError object which will be set only when privateuse1 is registered. All privateuse1 backend then can use this error for memory errors. Maybe more error types in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121702
Approved by: https://github.com/guangyey, https://github.com/albanD
2024-04-15 21:41:46 +00:00
NiTIAN
3dde6a461f fix cpp path in torch/_C/_autograd.pyi (#123924)
The file `tools/autograd/init.cpp` does not exist, I think the right path is `torch/csrc/autograd/init.cpp`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123924
Approved by: https://github.com/Skylion007
2024-04-12 22:32:00 +00:00
Shengbao Zheng
4e9094533e [c10d/nccl-pg] allow user to pass process group description (#123472)
Summary:
We need a way to allow user set a customized description for a process group, e.g. FSDP, PP.

Here are several use cases of user specified group_desc:
- Logging: we can easily match a log line and understand what's this collective/pg is used to.
- Pytorch traces (e.g. Kineto, Execution Trace) can benefit from the PG desc since trace analysis, benchmarks will be able to easily differentiate PG purpose like FSDP, PP.
- Lower layer collectives(e.g. NCCL) debug: we will be able to expose PG desc to NCCL communicator so NCCL layer operations can be easily correlated to a PG.

Solution: Add a group_desc field to c10d

Differential Revision: D55781850

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123472
Approved by: https://github.com/kwen2501
2024-04-12 08:44:21 +00:00
Shivam Raikundalia
3ebbeb75fd [Profiler] Make Kineto traces export ns granularity for finer timestamps (#122425) (#123650)
Summary:

Kineto traces use microsecond level granularity because of chrome tracing defaults to that precision. Fix by adding preprocessor flag to TARGETS and BUCK files. Also remove any unnecessary ns to us conversions made in the profiler itself.

This diff contains profiler changes only. Libkineto changes found in D54964435.

Test Plan:
Check JSON and chrome tracing to make sure values are as expected. Tracing with flags enabled should have ns precision. Tracings without flags should be same as master.
Zoomer: https://www.internalfb.com/intern/zoomer/?profiling_run_fbid=796886748550189
Ran key_averages() to make sure FunctionEvent code working as expected:
--  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls

                                          ProfilerStep*         0.74%       3.976ms        64.40%     346.613ms      69.323ms       0.000us         0.00%      61.710ms      12.342ms             5
                      Optimizer.zero_grad#SGD.zero_grad         0.76%       4.109ms         0.76%       4.109ms     821.743us       0.000us         0.00%       0.000us       0.000us             5
                                          ## forward ##         6.89%      37.057ms        27.19%     146.320ms      29.264ms       0.000us         0.00%      58.708ms      11.742ms             5
                                           aten::conv2d         0.22%       1.176ms         7.74%      41.658ms     157.199us       0.000us         0.00%      27.550ms     103.962us           265
                                      aten::convolution         0.79%       4.273ms         7.52%      40.482ms     152.762us       0.000us         0.00%      27.550ms     103.962us           265
                                     aten::_convolution         0.69%       3.688ms         6.73%      36.209ms     136.637us       0.000us         0.00%      27.550ms     103.962us           265
                                aten::cudnn_convolution         6.04%      32.520ms         6.04%      32.520ms     122.719us      27.550ms         8.44%      27.550ms     103.962us           265
                                             aten::add_         2.42%      13.045ms         2.42%      13.045ms      30.694us      12.700ms         3.89%      12.700ms      29.882us           425
                                       aten::batch_norm         0.19%       1.027ms         8.12%      43.717ms     164.971us       0.000us         0.00%      16.744ms      63.185us           265
                           aten::_batch_norm_impl_index         0.31%       1.646ms         7.93%      42.691ms     161.096us       0.000us         0.00%      16.744ms      63.185us           265
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------

Differential Revision: D55925068

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123650
Approved by: https://github.com/aaronenyeshi
2024-04-11 04:29:20 +00:00
Aaron Orenstein
2bcc83dfbd Preserve dispatch state across function tracing (#122073)
If we throw an exception in the "wrong" place we can end up with the dispatch state being in a weird state which can cause all future dispatching to fail. Preserve and restore it as part of `preserve_global_state` so we know it's sane after that.

Also fake_tensor's in_kernel_invocation_manager() was leaving a bit set in the dispatcher (DispatchKey.Dense) which affected follow-on code.  Fixed that to reset after as well.

Repro:

before:
```
$ rm test/dynamo_skips/TestSparseCPU.test_to_dense_with_gradcheck_sparse_cpu_complex64
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest -s test/dynamo/test_export.py test/test_sparse.py -k 'test_to_dense_with_gradcheck_sparse_cpu_complex64'
======== 1 passed, 6173 deselected in 5.21s =============
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest -s test/dynamo/test_export.py test/test_sparse.py -k 'test_torch_inference_mode_ctx or test_to_dense_with_gradcheck_sparse_cpu_complex64'
========= 1 skipped, 6172 deselected, 1 error in 5.29s =========
```
(note that test_to_dense_with_gradcheck_sparse_cpu_complex64 passes on its own but failed when including the skipped test_export.py tests)
after:
```
$ rm test/dynamo_skips/TestSparseCPU.test_to_dense_with_gradcheck_sparse_cpu_complex64
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest -s test/dynamo/test_export.py test/test_sparse.py -k 'test_to_dense_with_gradcheck_sparse_cpu_complex64'
===================== 1 passed, 6173 deselected in 5.42s =====================
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest -s test/dynamo/test_export.py test/test_sparse.py -k 'test_torch_inference_mode_ctx or test_to_dense_with_gradcheck_sparse_cpu_complex64'
===================== 1 passed, 1 skipped, 6172 deselected in 7.30s ======================
```
(note that test_to_dense_with_gradcheck_sparse_cpu_complex64 passes in both runs)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122073
Approved by: https://github.com/zou3519
2024-04-10 18:57:01 +00:00
Shivam Raikundalia
c9c099b271 Add kwargs to RecordFunctionFast (#123600)
Differential Revision: [D55897888](https://our.internmc.facebook.com/intern/diff/D55897888/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123600
Approved by: https://github.com/davidberard98
2024-04-10 18:17:50 +00:00
FFFrog
5c1bde99c0 Fix the uncorrect return value of Tensor.numpy() (#123538)
Fixes #123494

As the ISSUE stated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123538
Approved by: https://github.com/XuehaiPan, https://github.com/Skylion007
2024-04-10 14:47:24 +00:00
Jason Ansel
e3ea316623 [dynamo] Save/restore cublas_allow_tf32 in convert_frame (#123509)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123509
Approved by: https://github.com/anijain2305
2024-04-07 03:37:47 +00:00
PyTorch MergeBot
c66d503194 Revert "[Profiler][submodule] Make Kineto traces export ns granularity for finer timestamps (#122425)"
This reverts commit 6f7dd2f84a.

Reverted https://github.com/pytorch/pytorch/pull/122425 on behalf of https://github.com/malfet due to Breaks ROCM builds ([comment](https://github.com/pytorch/pytorch/pull/122425#issuecomment-2041129241))
2024-04-06 16:19:00 +00:00
Shivam Raikundalia
6f7dd2f84a [Profiler][submodule] Make Kineto traces export ns granularity for finer timestamps (#122425)
Summary:
Kineto traces use microsecond level granularity because of chrome tracing defaults to that precision. Fix by adding preprocessor flag to TARGETS and BUCK files. Also remove any unnecessary ns to us conversions made in the profiler itself.

This diff contains profiler changes only. Libkineto changes found in D54964435.

Test Plan:
Check JSON and chrome tracing to make sure values are as expected. Tracing with flags enabled should have ns precision. Tracings without flags should be same as master.
Tracing with flags enabled: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Mar_18_14_37_22.4155151.pt.trace.json.gz&bucket=gpu_traces
Tracing without flags enabled: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Mar_18_14_39_15.4166047.pt.trace.json.gz&bucket=gpu_traces
Tracing on main: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Mar_18_14_42_43.4177559.pt.trace.json.gz&bucket=gpu_traces

Ran key_averages() to make sure FunctionEvent code working as expected:
--  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls

                                          ProfilerStep*         0.74%       3.976ms        64.40%     346.613ms      69.323ms       0.000us         0.00%      61.710ms      12.342ms             5
                      Optimizer.zero_grad#SGD.zero_grad         0.76%       4.109ms         0.76%       4.109ms     821.743us       0.000us         0.00%       0.000us       0.000us             5
                                          ## forward ##         6.89%      37.057ms        27.19%     146.320ms      29.264ms       0.000us         0.00%      58.708ms      11.742ms             5
                                           aten::conv2d         0.22%       1.176ms         7.74%      41.658ms     157.199us       0.000us         0.00%      27.550ms     103.962us           265
                                      aten::convolution         0.79%       4.273ms         7.52%      40.482ms     152.762us       0.000us         0.00%      27.550ms     103.962us           265
                                     aten::_convolution         0.69%       3.688ms         6.73%      36.209ms     136.637us       0.000us         0.00%      27.550ms     103.962us           265
                                aten::cudnn_convolution         6.04%      32.520ms         6.04%      32.520ms     122.719us      27.550ms         8.44%      27.550ms     103.962us           265
                                             aten::add_         2.42%      13.045ms         2.42%      13.045ms      30.694us      12.700ms         3.89%      12.700ms      29.882us           425
                                       aten::batch_norm         0.19%       1.027ms         8.12%      43.717ms     164.971us       0.000us         0.00%      16.744ms      63.185us           265
                           aten::_batch_norm_impl_index         0.31%       1.646ms         7.93%      42.691ms     161.096us       0.000us         0.00%      16.744ms      63.185us           265
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------

Differential Revision: D55087993

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122425
Approved by: https://github.com/aaronenyeshi
2024-04-06 06:04:28 +00:00
sraikund16
6fa72480d3 Enhance RecordFunctionFast input args and use input args in triton_heuristics.py (#123459)
Summary: Now that we can input shapes as input args for RecordFunctionFast, let's add that to the triton heuristics. Also, lets add the ability to pass in a tuple into the RecordFunctionFast constructor.

Test Plan:
Ran both the _inductor/test_profile.py and profiler/test_profiler.py unit tests. Also added tuple based unit test to profiler/test_profiler.py

Ran record_function_fast.py from the following branch
https://github.com/pytorch/pytorch/compare/sraikund/record_funct_test?expand=1

No shape or args: tests function fast with no args and profile without record_shapes
With shape tests: tests function fast with args and profile with record_shapes true
Args no shape: tests function fast with args inputted but record_shapes set to false
Args shape tuple: tests function fast with args inputted in form of tuple and record_shapes true

Stdout:

No shape or args:: 1.8491458892822266 us
With shape:: 2.211381196975708 us
Args no shape:: 1.9212646484375 us
With shape tuple:: 2.245788335800171 us

Differential Revision: D55809967

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123459
Approved by: https://github.com/davidberard98
2024-04-06 02:44:06 +00:00
rzou
81e7a7c955 Add mutated_args field to custom_op (#123129)
If provided, we:
- autogenerate an ADInplaceOrView implementation
- assume that no mutated inputs are returned as outputs. There are
  already aliasing runtime checks that check this.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123129
Approved by: https://github.com/albanD
ghstack dependencies: #123108, #123109, #123110
2024-04-05 22:03:51 +00:00
rzou
067851dd0d Expand is_functional_schema to work with torch._C._FunctionSchema (#123108)
Previously it worked with torchgen.model.FunctionSchema. This PR extends
it to work with torch._C._FunctionSchema by making
torchgen.model.FunctionSchema look more like torch._C._FunctionSchema.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123108
Approved by: https://github.com/albanD
2024-04-05 22:03:39 +00:00
rzou
d486cb7c1b Deprecate calling FakeTensor.data_ptr in eager-mode (#123292)
Today, we error out on FakeTensor.data_ptr under torch.compile. This PR
moves to error out on FakeTensor.data_ptr under eager mode to avoid
diverging behavior.

We do this by adding another bit onto FakeTensor that we'll remove after
the deprecation cycle.

Test Plan:
- tested locally
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123292
Approved by: https://github.com/eellison
ghstack dependencies: #123261, #123282, #123291
2024-04-04 20:35:24 +00:00
Shivam Raikundalia
4732375042 make RecordFunctionFast take inputs (#123208)
Summary: RECORD_FUNCTION in C++ and torch.profiler.record_function already support recording inputs. Let's do the same for RecordFunctionFast.

Test Plan: Add tests in test_profiler.py that take args and also do not take args so we can support it being an optional parameter

Differential Revision: D55648870

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123208
Approved by: https://github.com/davidberard98
2024-04-03 21:58:09 +00:00
Yu, Guangye
eb7adc3ae0 Refactor gpu trace to be device-agnostic (#121794)
# Motivation
Refactor gpu trace to be device-agnostic. gpu trace is usually used in runtime components, including Device, Stream, Event, Guard, and Allocator. It should be device-agnostic and can be shared among each device backend.

# Solution
move `_cuda_trace.py` to `_gpu_trace.py`, which makes each device backend owns their callback, respectively.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121794
Approved by: https://github.com/jgong5, https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui
2024-03-30 13:04:38 +00:00
Yu, Guangye
f4ff063c33 Add attributes to xpu device prop (#121898)
# Motivation
Add some attributes to `XPUDeviceProp` and expose them via `torch.xpu.get_device_properties` and `torch.xpu.get_device_capability`. They can be used in `torch.compile`  or directly passed to triton to generate more optimized code based on device properties.

# Additional Context
expose the following attributes to `torch.xpu.get_device_properties`:
- `has_fp16` (newly added)
- `has_fp64` (newly added)
- `has_atomic64` (newly added)
- `driver_version`
- `vendor`
- `version`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121898
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/malfet, https://github.com/albanD, https://github.com/atalman
2024-03-30 00:25:39 +00:00
William Wen
35382f0573 [dynamo, 3.12] Use CPython internal _PyOpcode_Caches instead of hardcoding (#122335)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122335
Approved by: https://github.com/jansel
ghstack dependencies: #122146
2024-03-27 20:39:39 +00:00
Frank Lin
249e65b92d Graph-Safe RNG State Exchange for Tensor Parallelism (#114068)
See #113541

The PR allows for registering and controlling multiple RNG states using indices, ensuring cudagraph-safe operations, and includes both C++ and Python API changes to support this functionality.

cc  @eellison @anijain2305 @jansel @ezyang @ptrblck @csarofeen @mcarilli
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114068
Approved by: https://github.com/ezyang, https://github.com/eqy, https://github.com/xuzhao9
2024-03-27 01:14:38 +00:00
rzou
c81c9ba472 Disallow {FakeTensor,FunctionalTensor}.data_ptr (#122514)
This PR:
- disallows FakeTensor.data_ptr when it is called inside PT2 or fx tracing.
- disallows FunctionalTensor.data_ptr (python FunctionalTensor is only used in
  PT2)

The motivation behind this is that the leading cause of segfaults when
using custom ops with PT2 is calling .data_ptr on FunctionalTensor or
FakeTensor.

This change is BC-breaking. If your code broke as a result of this, it's
because there was a bug in it (these .data_ptr should never be
accessed!). You can either fix the bug (recommended) or get the previous
behavior back with:
```
from torch._subclasses.fake_tensor import FakeTensor
from torch._subclasses.functional_tensor import FunctionalTensor

data_ptr = 0 if isinstance(tensor, (FakeTensor, FunctionalTensor)) else tensor.data_ptr()
```

Test Plan:
- existing tests

Differential Revision: [D55366199](https://our.internmc.facebook.com/intern/diff/D55366199)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122514
Approved by: https://github.com/ezyang, https://github.com/albanD, https://github.com/yifuwang, https://github.com/kurtamohler
2024-03-26 23:55:42 +00:00
Bin Bao
537cd66e73 [Inductor] Support custom op in JIT with cpp wrapper (#122554)
Summary:  To call custom ops in an ABI-compatible way requires doing boxed call with varargs across C shim. In the JIT mode, we can get around it by calling into Python.  https://gist.github.com/desertfire/be2a65b0a9b47780bb716b53ac2cd2b3 is an example of generated code.

Differential Revision: [D55326556](https://our.internmc.facebook.com/intern/diff/D55326556)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122554
Approved by: https://github.com/jansel, https://github.com/chenyang78
2024-03-26 18:48:45 +00:00
PyTorch MergeBot
4dc09d6aa4 Revert "Graph-Safe RNG State Exchange for Tensor Parallelism (#114068)"
This reverts commit e9dcda5cba.

Reverted https://github.com/pytorch/pytorch/pull/114068 on behalf of https://github.com/ezyang due to memory leak in another ci ([comment](https://github.com/pytorch/pytorch/pull/114068#issuecomment-2018044527))
2024-03-25 13:49:04 +00:00
Edward Z. Yang
5891c5b3a6 Factor meta conversion through serializable MetaTensorDesc (#122044)
Fixes https://github.com/pytorch/pytorch/issues/121085

This PR pretty involved so pay attention to this description.  At a high
level, the refactor is intended to be mechanical: anywhere in
MetaConverter where previously we took a Tensor as argument, we now take
a MetaTensorDesc, which contains all of the information that we would
have queried off of the Tensor, but placed into a separate data
structure which we can serialize or use to recreate a fake tensor in
a separate fake tensor mode in exact fidelity to the original.

However, this transformation is not always entirely mechanical.  Here
is what you need to pay attention to:

- The memo table from real Tensor -> meta/fake Tensor is now broken
  into two memo tables: real Tensor -> stable int id -> meta/fake
  Tensor.  The stable int id is needed so that when we do serialization,
  we know when tensors/storages alias each other and can ensure we preserve
  this aliasing upon deserialization.

  The way I have implemented changes the weak reference behavior.
  Previously, when either the real Tensor OR the meta/fake Tensor went
  dead, we would remove the entry from the memo table.  Now, this only
  removes entries from one of the two memo tables.  This semantically
  makes sense, because the user may have held on to the stable int id
  out of band, and may expect a real Tensor to continue to be numbered
  consistently / expect to be able to lookup a meta/fake tensor from
  this id.  If this is unacceptable, it may be possible to rejigger
  the memo tables so that we have real Tensor -> stable int id
  and real Tensor -> meta/fake Tensor, but TBH I find the new
  implementation a lot simpler, and arranging the memo tables in this
  way means that I have to muck around with the real tensor to save
  to the memo table; in the current implementation, I never pass the
  Tensor to meta_tensor function AT ALL, which means it is impossible
  to accidentally depend on it.

- When I fill in the fields of MetaTensorDesc in describe_tensor, I need
  to be careful not to poke fields when they are not valid.  Previously,
  preconditions were implicitly checked via the conditional structure
  ("is this sparse? is this nested?") that is tested before we start
  reading attributes.  This structure has to be replicated in
  describe_tensor, and I have almost assuredly gotten it wrong on my
  first try (I'll be grinding through it on CI; a careful audit will
  help too, by auditing that I've tested all the same conditionals that
  the original access was guarded by.)

- I originally submitted https://github.com/pytorch/pytorch/pull/121821
  for the symbolic shapes change, but it turned out the way I did it
  there didn't actually work so well for this PR.  I ended up just
  inlining the symbolic shapes allocation logic into MetaConverter
  (look for calls to maybe_specialize_sym_int_with_hint), maybe there
  is a better way to structure it, but what I really want is to
  just read sizes/strides/offset directly off of MetaTensorDesc; I
  don't want another intermediate data structure.

- Some fields aren't serializable. These are documented as "NOT
  serializable".  ctx/type should morally be serializable and I just
  need to setup a contract with subclasses to let them be serialized.
  The fake_mode is used solely to test if we are refakefying with
  a pre-existing ShapeEnv and we want to reuse the SymInt
  directly--serializing this case is hopeless but I am kind of hoping
  after this refactor we do not need this at all.  view_func is not
  serializable because it's a bound C implemented method.  Joel has
  promised me that this is not too difficult to actually expose as a
  true data structure, but this is the edgiest of edge cases and there
  is no reason to deal with it right now.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122044
Approved by: https://github.com/eellison
2024-03-25 06:21:17 +00:00
Guilherme Leobas
4eaa000acc Teach dynamo about torch.func.jvp (#119926)
List of changes:
- Replace JVP_NESTING by torch._C._functorch.maybe_current_level()
- Remove all increment nesting functions from wrap_fx_proxy_cls
- fwAD.make_dual receives the dual_level as keyword argument
- Add jvp_increment_nesting, set_fwd_grad_enabled and dual_level context managers to dynamo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119926
Approved by: https://github.com/zou3519
2024-03-22 20:25:47 +00:00
PyTorch MergeBot
f65373e278 Revert "Factor meta conversion through serializable MetaTensorDesc (#122044)"
This reverts commit e2d89e9704.

Reverted https://github.com/pytorch/pytorch/pull/122044 on behalf of https://github.com/jeanschmidt due to Seems that some landrace caused this PR to break lint ([comment](https://github.com/pytorch/pytorch/pull/122044#issuecomment-2015025490))
2024-03-22 12:46:21 +00:00
Edward Z. Yang
e2d89e9704 Factor meta conversion through serializable MetaTensorDesc (#122044)
Fixes https://github.com/pytorch/pytorch/issues/121085

This PR pretty involved so pay attention to this description.  At a high
level, the refactor is intended to be mechanical: anywhere in
MetaConverter where previously we took a Tensor as argument, we now take
a MetaTensorDesc, which contains all of the information that we would
have queried off of the Tensor, but placed into a separate data
structure which we can serialize or use to recreate a fake tensor in
a separate fake tensor mode in exact fidelity to the original.

However, this transformation is not always entirely mechanical.  Here
is what you need to pay attention to:

- The memo table from real Tensor -> meta/fake Tensor is now broken
  into two memo tables: real Tensor -> stable int id -> meta/fake
  Tensor.  The stable int id is needed so that when we do serialization,
  we know when tensors/storages alias each other and can ensure we preserve
  this aliasing upon deserialization.

  The way I have implemented changes the weak reference behavior.
  Previously, when either the real Tensor OR the meta/fake Tensor went
  dead, we would remove the entry from the memo table.  Now, this only
  removes entries from one of the two memo tables.  This semantically
  makes sense, because the user may have held on to the stable int id
  out of band, and may expect a real Tensor to continue to be numbered
  consistently / expect to be able to lookup a meta/fake tensor from
  this id.  If this is unacceptable, it may be possible to rejigger
  the memo tables so that we have real Tensor -> stable int id
  and real Tensor -> meta/fake Tensor, but TBH I find the new
  implementation a lot simpler, and arranging the memo tables in this
  way means that I have to muck around with the real tensor to save
  to the memo table; in the current implementation, I never pass the
  Tensor to meta_tensor function AT ALL, which means it is impossible
  to accidentally depend on it.

- When I fill in the fields of MetaTensorDesc in describe_tensor, I need
  to be careful not to poke fields when they are not valid.  Previously,
  preconditions were implicitly checked via the conditional structure
  ("is this sparse? is this nested?") that is tested before we start
  reading attributes.  This structure has to be replicated in
  describe_tensor, and I have almost assuredly gotten it wrong on my
  first try (I'll be grinding through it on CI; a careful audit will
  help too, by auditing that I've tested all the same conditionals that
  the original access was guarded by.)

- I originally submitted https://github.com/pytorch/pytorch/pull/121821
  for the symbolic shapes change, but it turned out the way I did it
  there didn't actually work so well for this PR.  I ended up just
  inlining the symbolic shapes allocation logic into MetaConverter
  (look for calls to maybe_specialize_sym_int_with_hint), maybe there
  is a better way to structure it, but what I really want is to
  just read sizes/strides/offset directly off of MetaTensorDesc; I
  don't want another intermediate data structure.

- Some fields aren't serializable. These are documented as "NOT
  serializable".  ctx/type should morally be serializable and I just
  need to setup a contract with subclasses to let them be serialized.
  The fake_mode is used solely to test if we are refakefying with
  a pre-existing ShapeEnv and we want to reuse the SymInt
  directly--serializing this case is hopeless but I am kind of hoping
  after this refactor we do not need this at all.  view_func is not
  serializable because it's a bound C implemented method.  Joel has
  promised me that this is not too difficult to actually expose as a
  true data structure, but this is the edgiest of edge cases and there
  is no reason to deal with it right now.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122044
Approved by: https://github.com/eellison
ghstack dependencies: #122018
2024-03-22 03:56:34 +00:00
PyTorch MergeBot
968c4c4154 Revert "Refactor gpu trace to be device-agnostic (#121794)"
This reverts commit 74deacbf31.

Reverted https://github.com/pytorch/pytorch/pull/121794 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it breaks ROCm jobs in trunk 74deacbf31, please help take a look and reland the change ([comment](https://github.com/pytorch/pytorch/pull/121794#issuecomment-2013674083))
2024-03-21 20:33:17 +00:00
Frank Lin
e9dcda5cba Graph-Safe RNG State Exchange for Tensor Parallelism (#114068)
See #113541

The PR allows for registering and controlling multiple RNG states using indices, ensuring cudagraph-safe operations, and includes both C++ and Python API changes to support this functionality.

cc  @eellison @anijain2305 @jansel @ezyang @ptrblck @csarofeen @mcarilli
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114068
Approved by: https://github.com/ezyang
2024-03-21 01:57:08 +00:00
Yu, Guangye
74deacbf31 Refactor gpu trace to be device-agnostic (#121794)
# Motivation
Refactor gpu trace to be device-agnostic. gpu trace is usually used in runtime components, including Device, Stream, Event, Guard, and Allocator. It should be device-agnostic and can be shared among each device backend.

# Solution
move `_cuda_trace.py` to `_gpu_trace.py`, which makes each device backend owns their callback, respectively.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121794
Approved by: https://github.com/jgong5, https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui
2024-03-21 01:52:58 +00:00
Michael Lazos
c20cf97366 Move some cudagraphs checks into C++ (#122251)
Based off of https://github.com/pytorch/pytorch/pull/111094
This + cpp guards improves TIMM geomean optimizer performance by about 20%

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122251
Approved by: https://github.com/eellison
2024-03-21 01:02:23 +00:00
PyTorch MergeBot
0696db8202 Revert "Teach dynamo about torch.func.jvp (#119926)"
This reverts commit 17489784b6.

Reverted https://github.com/pytorch/pytorch/pull/119926 on behalf of https://github.com/peterbell10 due to broken mac jobs on main ([comment](https://github.com/pytorch/pytorch/pull/119926#issuecomment-2010327997))
2024-03-20 18:34:43 +00:00
Guilherme Leobas
17489784b6 Teach dynamo about torch.func.jvp (#119926)
List of changes:
- Replace JVP_NESTING by torch._C._functorch.maybe_current_level()
- Remove all increment nesting functions from wrap_fx_proxy_cls
- fwAD.make_dual receives the dual_level as keyword argument
- Add jvp_increment_nesting, set_fwd_grad_enabled and dual_level context managers to dynamo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119926
Approved by: https://github.com/zou3519
2024-03-20 13:09:19 +00:00
PyTorch MergeBot
36e5c1dcab Revert "Teach dynamo about torch.func.jvp (#119926)"
This reverts commit edd04b7c16.

Reverted https://github.com/pytorch/pytorch/pull/119926 on behalf of https://github.com/jeanschmidt due to lots of breakages in pull jobs, checking if reverting this one will help ([comment](https://github.com/pytorch/pytorch/pull/119926#issuecomment-2007915919))
2024-03-19 18:59:46 +00:00
PyTorch MergeBot
f9ed1c432d Revert "Refactor gpu trace to be device-agnostic (#121794)"
This reverts commit 0ff1109e26.

Reverted https://github.com/pytorch/pytorch/pull/121794 on behalf of https://github.com/jeanschmidt due to Reverting to see if rocm trunk errors are related ([comment](https://github.com/pytorch/pytorch/pull/121794#issuecomment-2007519408))
2024-03-19 15:40:26 +00:00
Guilherme Leobas
edd04b7c16 Teach dynamo about torch.func.jvp (#119926)
List of changes:
- Replace JVP_NESTING by torch._C._functorch.maybe_current_level()
- Remove all increment nesting functions from wrap_fx_proxy_cls
- fwAD.make_dual receives the dual_level as keyword argument
- Add jvp_increment_nesting, set_fwd_grad_enabled and dual_level context managers to dynamo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119926
Approved by: https://github.com/zou3519
2024-03-19 13:06:42 +00:00
Yu, Guangye
0ff1109e26 Refactor gpu trace to be device-agnostic (#121794)
# Motivation
Refactor gpu trace to be device-agnostic. gpu trace is usually used in runtime components, including Device, Stream, Event, Guard, and Allocator. It should be device-agnostic and can be shared among each device backend.

# Solution
move `_cuda_trace.py` to `_gpu_trace.py`, which makes each device backend owns their callback, respectively.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121794
Approved by: https://github.com/jgong5, https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui
2024-03-19 06:02:28 +00:00
Animesh Jain
8860c625ea [dynamo][guards-cpp-refactor] Integrate cpp guard manager with CheckFnManager (#120726)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120726
Approved by: https://github.com/jansel
2024-03-19 03:11:31 +00:00
andrewor14
773ae817f7 Batch Norm Consolidation (#116092)
**Summary:**

This commit simplifies the existing decomposition hierarchy
of batch norm ops by adding a single, backend agnostic op:
`batch_norm_with_update`. The existing hierarchy looks like:

```
aten.batch_norm ->
aten._batch_norm_impl_index ->
[
  aten.native_batch_norm ->
  aten._native_batch_norm_legit (export only) ->
  _batch_norm_legit_cpu/cuda (kernels, export only) ->
  _batch_norm_cpu/cuda (kernels)
] OR
[ aten.cudnn_batch_norm ] OR
[ aten.miopen_batch_norm ]
```

Aside from complexity, an important problem with the
above decomposition hierarchy is cuda numerics in
export flows. We observed significantly worse convergence
when training a mobilenetv2-like model when using the
`_batch_norm_cuda` kernel instead of the `cudnn_batch_norm`
kernel. This means users who export their models on CPU
first then move the models to cuda later may silently
see worse accuracies even when cudnn is installed,
because they are using the worse kernel. This issue is
summarized in https://github.com/pytorch/pytorch/issues/111384.

Instead, the new hierarchy proposed by consolidating
existing batch norm ops will look like:

```
aten.batch_norm ->
aten.batch_norm_with_update ->
[ _batch_norm_cpu (kernel) ] OR
[ _batch_norm_cuda (kernel) ] OR
[ cudnn_batch_norm (kernel) ] OR
[ miopen_batch_norm (kernel) ]
```

The new op `batch_norm_with_update` hides backend
implementation details and automatically picks the right
kernel based on what is installed. This commit also adds
the following variants to this op:

```
batch_norm_with_update_functional
batch_norm_with_update.out
batch_norm_no_update
batch_norm_no_update.out
batch_norm_backward
```

Note that this commit only adds this op and its variants,
but does not actually change the decomps to produce these
ops in the graph. This will be done after the 2 week FC
window, and the ops used in the old stack is planned to
be removed after the 6 month BC window.

Test Plan: `OpInfo` tests for `batch_norm_with_update`.

Reviewers: albanD, bdhirsh

Subscribers: albanD, bdhirsh, supriyar

Tasks: https://github.com/pytorch/pytorch/issues/111384

Differential Revision: [D54805279](https://our.internmc.facebook.com/intern/diff/D54805279)
Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2024-03-18 21:01:30 +00:00
Bin Bao
8be80706b4 [AOTI] Add pybind for tensor_converter util functions (#121744)
Differential Revision: [D54818716](https://our.internmc.facebook.com/intern/diff/D54818716)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121744
Approved by: https://github.com/chenyang78
ghstack dependencies: #121523, #121743
2024-03-14 22:20:51 +00:00
Zihua Wu
d62bdb087d [Profiler] add missing field device_resource_id (#121480)
Fixes #121479

Co-authored-by: Aaron Shi <enye.shi@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121480
Approved by: https://github.com/aaronenyeshi
2024-03-12 21:42:53 +00:00
PyTorch MergeBot
fd0dbcd891 Revert "Batch Norm Consolidation (#116092)"
This reverts commit 7b4f70eda5.

Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/osalpekar due to Causes build failure in //caffe2:aten-hip (AMD build) target. See [D54707318](https://www.internalfb.com/diff/D54707318) for more details, may require internal build system changes to resolve. ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1989542965))
2024-03-11 22:22:41 +00:00
andrewor14
7b4f70eda5 Batch Norm Consolidation (#116092)
**Summary:**

This commit simplifies the existing decomposition hierarchy
of batch norm ops by adding a single, backend agnostic op:
`batch_norm_with_update`. The existing hierarchy looks like:

```
aten.batch_norm ->
aten._batch_norm_impl_index ->
[
  aten.native_batch_norm ->
  aten._native_batch_norm_legit (export only) ->
  _batch_norm_legit_cpu/cuda (kernels, export only) ->
  _batch_norm_cpu/cuda (kernels)
] OR
[ aten.cudnn_batch_norm ] OR
[ aten.miopen_batch_norm ]
```

Aside from complexity, an important problem with the
above decomposition hierarchy is cuda numerics in
export flows. We observed significantly worse convergence
when training a mobilenetv2-like model when using the
`_batch_norm_cuda` kernel instead of the `cudnn_batch_norm`
kernel. This means users who export their models on CPU
first then move the models to cuda later may silently
see worse accuracies even when cudnn is installed,
because they are using the worse kernel. This issue is
summarized in https://github.com/pytorch/pytorch/issues/111384.

Instead, the new hierarchy proposed by consolidating
existing batch norm ops will look like:

```
aten.batch_norm ->
aten.batch_norm_with_update ->
[ _batch_norm_cpu (kernel) ] OR
[ _batch_norm_cuda (kernel) ] OR
[ cudnn_batch_norm (kernel) ] OR
[ miopen_batch_norm (kernel) ]
```

The new op `batch_norm_with_update` hides backend
implementation details and automatically picks the right
kernel based on what is installed. This commit also adds
the following variants to this op:

```
batch_norm_with_update_functional
batch_norm_with_update.out
batch_norm_no_update
batch_norm_no_update.out
batch_norm_backward
```

Note that this commit only adds this op and its variants,
but does not actually change the decomps to produce these
ops in the graph. This will be done after the 2 week FC
window, and the ops used in the old stack is planned to
be removed after the 6 month BC window.

Test Plan: `OpInfo` tests for `batch_norm_with_update`.

Reviewers: albanD, bdhirsh

Subscribers: albanD, bdhirsh, supriyar

Tasks: https://github.com/pytorch/pytorch/issues/111384

Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2024-03-08 15:07:15 +00:00
Shengbao Zheng
60aaba4128 create function to get ProcessGroupNCCL uid (#121132)
Summary: expose ProcessGroupNCCL uid

Differential Revision: D54446056

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121132
Approved by: https://github.com/aaronenyeshi
2024-03-07 18:34:38 +00:00
PyTorch MergeBot
b529c19bdf Revert "Batch Norm Consolidation (#116092)"
This reverts commit 5680f565d5.

Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/jeffdaily due to broke ROCm, PR signal was clean but trunk was not, the merge should have been blocked but wasn't ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1981373237))
2024-03-06 17:10:01 +00:00
PyTorch MergeBot
8087912622 Revert "[XPU][Profiler] Add Logic To The Profiler For Processing XPU-backend Data (#120185)"
This reverts commit 0ab2ec3738.

Reverted https://github.com/pytorch/pytorch/pull/120185 on behalf of https://github.com/briancoutinho due to This PR contains a list search in '_parse_kineto_events()' that can lead to very high cost of running this post trace, training jobs getting stuck for mins ([comment](https://github.com/pytorch/pytorch/pull/120185#issuecomment-1980180774))
2024-03-06 06:39:51 +00:00
Tugsbayasgalan Manlaibaatar
5680f565d5 Batch Norm Consolidation (#116092)
**Summary:**

This commit simplifies the existing decomposition hierarchy
of batch norm ops by adding a single, backend agnostic op:
`batch_norm_with_update`. The existing hierarchy looks like:

```
aten.batch_norm ->
aten._batch_norm_impl_index ->
[
  aten.native_batch_norm ->
  aten._native_batch_norm_legit (export only) ->
  _batch_norm_legit_cpu/cuda (kernels, export only) ->
  _batch_norm_cpu/cuda (kernels)
] OR
[ aten.cudnn_batch_norm ] OR
[ aten.miopen_batch_norm ]
```

Aside from complexity, an important problem with the
above decomposition hierarchy is cuda numerics in
export flows. We observed significantly worse convergence
when training a mobilenetv2-like model when using the
`_batch_norm_cuda` kernel instead of the `cudnn_batch_norm`
kernel. This means users who export their models on CPU
first then move the models to cuda later may silently
see worse accuracies even when cudnn is installed,
because they are using the worse kernel. This issue is
summarized in https://github.com/pytorch/pytorch/issues/111384.

Instead, the new hierarchy proposed by consolidating
existing batch norm ops will look like:

```
aten.batch_norm ->
aten.batch_norm_with_update ->
[ _batch_norm_cpu (kernel) ] OR
[ _batch_norm_cuda (kernel) ] OR
[ cudnn_batch_norm (kernel) ] OR
[ miopen_batch_norm (kernel) ]
```

The new op `batch_norm_with_update` hides backend
implementation details and automatically picks the right
kernel based on what is installed. This commit also adds
the following variants to this op:

```
batch_norm_with_update_functional
batch_norm_with_update.out
batch_norm_no_update
batch_norm_no_update.out
batch_norm_backward
```

Note that this commit only adds this op and its variants,
but does not actually change the decomps to produce these
ops in the graph. This will be done after the 2 week FC
window, and the ops used in the old stack is planned to
be removed after the 6 month BC window.

Test Plan: `OpInfo` tests for `batch_norm_with_update`.

Reviewers: albanD, bdhirsh

Subscribers: albanD, bdhirsh, supriyar

Tasks: https://github.com/pytorch/pytorch/issues/111384

Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2024-03-06 04:50:46 +00:00
Scott Wolchok
cac36e232e [PyTorch] Split StaticModule out of test_static_runtime (#121028)
I want to use StaticModule in another (internal) test, so splitting it out.

Differential Revision: [D54384817](https://our.internmc.facebook.com/intern/diff/D54384817/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121028
Approved by: https://github.com/suo
2024-03-05 23:14:07 +00:00
rzou
3ef0befdc9 Better error messages for impl_abstract_pystub (#120959)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120959
Approved by: https://github.com/drisspg
2024-03-04 15:24:36 +00:00
albanD
8cb4855d1e Release the GIL in serialization when it is safe to do so (#120818)
In particular this ensures we release the GIL when serializing:
- PyBytes objects (this is how we get the pickle object)
- Storage objects

Other string-like objects keep the gil which is fine because we only use this for very small strings today (for endianess) and so releasing the GIL is not important there
Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120818
Approved by: https://github.com/colesbury
2024-03-01 22:37:26 +00:00
Guilherme Leobas
491c2b4665 Let torch dynamo inline torch.func.grad (#118407)
When dynamo sees torch.func.grad, it tries to inline all frames related
to.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118407
Approved by: https://github.com/zou3519
2024-02-28 20:05:00 +00:00
Xunsong, Huang
0ab2ec3738 [XPU][Profiler] Add Logic To The Profiler For Processing XPU-backend Data (#120185)
This pull request is writing to provide an update on the recent advancements made in the PyTorch profiler with regards to XPU backend support. Following the successful merge of a previous pull request #94502 that established a pathway for the XPU backend within PyTorch, we have now taken steps to enhance the profiler's capabilities for handling and displaying profile data directly related to the XPU backend.

# Motivation

The current pull request builds upon this foundation by refining the profiler's data processing scripts, particularly `profiler_util.py`, to accommodate XPU backend-specific profile data. The aim is to align the handling and presentation of this data with that of the CUDA backend, offering users a consistent experience across different device profiles. This includes generating outputs such as JSON files compatible with Chrome trace tooling, among other formats.

# Principles

1. Minimal Impact: The modifications introduced should support XPU backend data with minimal disruption to the existing profiling scripts.
2. Consistency: Changes should maintain stylistic and functional consistency with existing `CUDA` and `privateuse1` pathways, ensuring no adverse effects on other logic paths.
3. Exclusivity: Ensure that the new XPU pathway does not interfere with or impede other pathways.

# Solutions

### a. Pathway Identification:

Introduction of a `use_xpu` flag within `torch.autograd.profiler.profile` interfaces to distinguish XPU-specific profiling.

### b. `use_device` Logic Revision:

With the introduction of the XPU pathway, `use_device` no longer implies a binary relationship with `use_cuda`. Consequently, we have revised related logic to remove implicit assertions and establish independent device distinction.

### c. Kernel List Segregation:

To accommodate the non-binary nature of device pathways, we have enabled kernel lists to identify specific device affiliations through separate list objects.

### d. Formatted Output:

To ensure output consistency, we have employed code duplication and keyword substitution techniques to facilitate the formatting of XPU-related profile data.

# Additional Enhancements

### a. Enumerations in `.pyi` Files:

Added recognition items for `DeviceType` and `ProfilerActivity` specific to XPU.

### b. Correct DeviceType Returns:

Revised `deviceTypeFromActivity` logic to accurately differentiate between device backends, even when they share common flags such as `libkineto::ActivityType::GPU_MEMCPY`.

### c. Bug Fixes in `cuda_corr_map`:

Addressed a corner case where erroneous parent-child event relationships were formed due to shared function event identifiers. The solution involves refining `cuda_corr_map` processing to prevent a function event from being misidentified as both the linker and linkee.

# Further Abstraction

Looking forward, we acknowledge the potential for further abstraction in the codebase. The current changes necessitated by XPU support have highlighted opportunities for reducing redundancy by consolidating naming conventions and utilizing a singular `device` naming system that relies on `DeviceType` attributes or string flags for differentiation. This would involve significant refactoring to replace device-specific flags and variables. This topic needs further discussions about whether we could and when we should deprecate all those flags and variables named with `cuda`.

# Next Pull Request

The next pull request will be contingent on Kineto's adoption of Intel's forthcoming PTI-sdk library, which will enable direct usage of XPU-related tracers. Subsequent modifications to `libkineto_init()` will aim to endow PyTorch running on XPU backends with comprehensive profiling capabilities on XPU devices.

We appreciate your attention to these enhancements and welcome any feedback or questions you may have regarding these developments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120185
Approved by: https://github.com/aaronenyeshi, https://github.com/gujinghui
2024-02-28 17:50:32 +00:00
Isuru Fernando
f7e79299c7 register torch.return_types in torch.fx._pytree (#120027)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120027
Approved by: https://github.com/lezcano, https://github.com/zou3519, https://github.com/XuehaiPan
ghstack dependencies: #119284
2024-02-23 21:52:42 +00:00
Isuru Fernando
c3496d50f0 Fix torch.return_types init signature (#119284)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119284
Approved by: https://github.com/peterbell10, https://github.com/XuehaiPan
2024-02-23 21:52:34 +00:00
soulitzer
312ce35c1f Rename singleton int to nested int (#119661)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119661
Approved by: https://github.com/ezyang
2024-02-16 19:21:17 +00:00
Yu, Guangye
8f9f12c068 Intel GPU Runtime Upstreaming for Device Allocator (#118091)
# Motivation
According to [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842) and [[RFC] Intel GPU Runtime Upstreaming for Allocator](https://github.com/pytorch/pytorch/issues/116322), we will upstream the key functionality of device `Allocator` dedicated for XPU to PyTorch. And following our design prepare to generalize `Allocator` in parallel.

# Design
In the current design, XPU uses an `XPUAllocator` class, inherited from `c10::Allocator`. `XPUAllocator` is a manager to handle `DeviceCachingAllocator`, which is a per-device implementation of the caching mechanism to manage the already cached or newly allocated memory. The caching mechanism is similar to other backends, like CUDA. We can visualize the design as below.
<p align="center">
<img width="162" alt="image" src="https://github.com/pytorch/pytorch/assets/106960996/6b17b8cf-e7d1-48b4-b684-f830c409d218">
</p>

# Additional Context
We're going to implement our design gradually. This PR covers the device `Allocator` dedicated to XPU. The second PR covers the host `Allocator`.
Besides these PRs, we plan to generalize the device `Allocator` device-agnostic through another PR.
In this PR, our device `Allocator` has the same memory management mechanism as CUDA, but lacks features such as expendable segments and statistics. We will add these features back in the subsequent PR which intend to generalize `Allocator`.

The differences with CUDA:
only key functionality, and lack of AsyncAllocator, gpu_trace, history_record, graph functionality, memory snapshot, memory statistics, expandable segment...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118091
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/albanD
ghstack dependencies: #117611, #117619, #117734
2024-02-16 06:46:00 +00:00
Yu, Guangye
4dc75f9084 Intel GPU Runtime Upstreaming for Event (#117734)
# Motivation
As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the next runtime component we would like to upstream is `Event` which handles the status of an operation that is being executed. Typically, in some circumstances, we can fine-grain control of the operation execution via `Event`.

# Design
`XPUEvent` is a movable but not a copyable wrapper around sycl event. It should be created lazily on an XPU device when recording an `XPUStream`. Meanwhile, `XPUEvent` can wait for another `XPUEvent` or all the submitted kernels on an `XPUStream` to complete. Align to the other backend, the C++ files related to `Event` will be placed in `aten/src/ATen/xpu` folder. For frontend code, `XPUEvent` runtime API will be bound to Python `torch.xpu.Event`. The corresponding C++ code will be placed in `torch/csrc/xpu/Event.cpp` and Python code will be placed in `torch/xpu/streams.py` respectively.

# Additional Context
It is worth mentioning that the `elapsed_time` method is temporarily not supported by `XPUEvent`. We will be adding support for it soon. Meanwhile `XPUEvent` doesn't support IPC from different processes. For the other parts, we have almost a 1:1 mapping with CUDA.

lack of the below APIs:
- `torch.cuda.Event.ipc_handle`
- `CUDAEvent`'s constructor with `IpcEventHandle`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117734
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: #117611, #117619
2024-02-16 06:28:26 +00:00
Eddie Yan
cd380c794f [CUDNN][SDPA] Experimental cuDNN Flash Attention v2 Inference (#115663)
#113713

Going to clean up some of the checks and will remove draft status after.
Can be tested on SM80+ with `TORCH_CUDNN_MHA_ENABLED=1`.

CC @drisspg @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115663
Approved by: https://github.com/drisspg
2024-02-14 22:02:06 +00:00
Guilherme Leobas
3319dbcd23 Update vmap guard to avoid recompilations (#119061)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119061
Approved by: https://github.com/zou3519
2024-02-13 20:50:23 +00:00
Shuqiang Zhang
893dcac068 [c10d] explicitly abort communicators in destroy_process_group call (#119250)
Summary:
This PR tries to resolve issue #119215.

Basically,  processgroup shutdown (and hence ncclCommAbort) is called in
destroy_process_group APIs for the corresponding PGs. and in the
destructor of ProcessGroup, we avoid calling abort/ncclCommAbort.
Instead, it just checks if the users have explicitly already called destroy_process_group. If
not, Destructor will log a warning and encourage/expect users to do so
for cleanup of resources of PGs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119250
Approved by: https://github.com/minsii, https://github.com/kwen2501
2024-02-12 18:40:28 +00:00
Yu, Guangye
8fd11cb307 [2/2] Intel GPU Runtime Upstreaming for Stream (#117619)
# Motivation
According to [[1/2] Intel GPU Runtime Upstreaming for Stream](https://github.com/pytorch/pytorch/pull/117611), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the second PR covers the changes under `python frontend`.

# Design
Currently, it primarily offers stream-related APIs, including
 - `torch.xpu.StreamContext`
 - `torch.xpu.current_stream`
 - `torch.xpu.set_stream`
 - `torch.xpu.synchronize`
 - `torch._C._xpu_getCurrentRawStream`

# Additional Context
We will implement functions like `torch.xpu.Stream.wait_event`, `torch.xpu.Stream.wait_stream`, and `torch.xpu.Stream.record_event` in the next PR related with `Event`.

The differences with CUDA:
no default and external stream in XPU and lack of below APIs:
- `torch.cuda.ExternalStream`
- `torch.cuda.default_stream`
- `toch.cuda.is_current_stream_capturing`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117619
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/albanD
ghstack dependencies: #117611
2024-02-10 03:39:42 +00:00
William Wen
ee1c2449f7 [dynamo] delete dynamo cache entry when guard function is invalidated [attempt 2] (#119107)
Attempt #2 for https://github.com/pytorch/pytorch/pull/117875 to fix https://github.com/pytorch/pytorch/issues/112090.

Summary of changes:
- ~Changed CacheEntry linked list into a doubly-linked list structure to support deletion.~ (done by C++ refactor)
- Added CacheEntry and ExtraState borrowed references to GuardFn so that GuardFn can tell ExtraState to delete CacheEntry when the GuardFn is invalidated.
- ~Added ExtraState raw reference to CacheEntry so that we can get ExtraState to correctly point to the first CacheEntry if it gets deleted.~ (done by C++ refactor)
- CacheEntry destructor needs to reset GuardFn refs to ExtraState/CacheEntry in order to prevent use-after-free.
- code_context values that are nn.GraphModules need to be weakrefs in order to prevent circular references.
- Added tests that check for memory leaks and cache deletion operations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119107
Approved by: https://github.com/jansel
2024-02-07 03:32:42 +00:00
William Wen
ae4e866bba [dynamo] refactor CacheEntry and ExtraState to eval_frame.c to C++ (#118438)
Part of implementing CacheEntry invalidation to fix https://github.com/pytorch/pytorch/issues/112090.

Changes:
- Move CacheEntry and ExtraState to C++
- Use pybind to control reference counting
- Use std::list instead of manually implementing a linked list

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118438
Approved by: https://github.com/jansel
2024-02-06 20:48:11 +00:00
Yu, Guangye
a205e7bf56 [3/4] Intel GPU Runtime Upstreaming for Device (#116850)
# Motivation
According to [[1/4] Intel GPU Runtime Upstreaming for Device](https://github.com/pytorch/pytorch/pull/116019), As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), this third PR  covers the changes under `libtorch_python`.

# Design
This PR primarily offers device-related APIs in python frontend, including
- `torch.xpu.is_available`
- `torch.xpu.device_count`
- `torch.xpu.current_device`
- `torch.xpu.set_device`
- `torch.xpu.device`
- `torch.xpu.device_of`
- `torch.xpu.get_device_name`
- `torch.xpu.get_device_capability`
- `torch.xpu.get_device_properties`
- ====================
- `torch.xpu._DeviceGuard`
- `torch.xpu._is_compiled`
- `torch.xpu._get_device`

# Additional Context
We will implement the support of lazy initialization in the next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116850
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet
2024-02-01 12:31:26 +00:00
Michael Suo
eaa45f47f8 [sigmoid] fix for torchbind serialization (#118791)
Summary:
There is an annoying inconsistency in how we pickle custom objs.
`torch.save` will invoke regular pickle, for which we have bound `__setstate__`/`__getstate__` methods on `torch.ScriptObject`: https://fburl.com/code/4howyl4u.

This serializes in a different format than TorchScript does, which uses the TS C++ pickler.

The issue we were facing was using the Python pickler to save, and the C++ pickler to load. If we use the C++ pickler to both save and load (plus some plumbing to get type/object resolution to work correctly), then things should work.

Test Plan:
ran SherlockNoMad's repro
```
buck2 run 'fbcode//mode/dev-nosan' scripts/bahuang:export_torchbind -- --logging DBG
```

Got to a new error, which has to do with how we're initializing the graph, but will leave that for future diffs.

Reviewed By: SherlockNoMad

Differential Revision: D53248454

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118791
Approved by: https://github.com/qxy11, https://github.com/SherlockNoMad, https://github.com/khabinov
2024-02-01 10:09:07 +00:00
Yifu Wang
b778f44e97 Allow using native c10d_functional via _functional_collectives (#113057)
This diff introduces an env var `_USE_NATIVE_C10D_FUNCTIONAL` that tells `_functional_collective` to use native `c10d_functional` ops. The Python version and the native version will co-exist until we completely switch to the native version after more testing and verification.

NOTE: `DeviceMesh` support for native `c10d_functional` will be added in a subsequent PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113057
Approved by: https://github.com/LucasLLC, https://github.com/wconstab, https://github.com/wanchaol
2024-01-30 02:34:25 +00:00
Colin Peppler
1f6aa4b336 [mypy] Enable follow_imports = normal for mypy-torch.backends.* (#116311)
Summary:

Test Plan:

```
lintrunner --take MYPYINDUCTOR --all-files
ok No lint issues.

lintrunner -a
ok No lint issues.
Successfully applied all patches.
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116311
Approved by: https://github.com/int3
2024-01-25 20:17:22 +00:00
Jeff Daily
01abb5af21 additional support for float8_e4m3fnuz and _e5m2fnuz (#115214)
Follow up to #107586.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115214
Approved by: https://github.com/peterbell10, https://github.com/malfet
2024-01-22 18:33:41 +00:00
Guilherme Leobas
80cf0ce153 Enhance torch.vmap support from inside torch.compile (#116050)
This work rewrites vmap support in torch.compile by inlining most of
the frames into the existing FX graph. It also unlocks to PyTorch to
support features that were previously missing, such as keyword args.

Fixes: https://github.com/pytorch/pytorch/issues/114306

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116050
Approved by: https://github.com/zou3519
2024-01-22 17:53:45 +00:00
PyTorch MergeBot
b637fdc8b3 Revert "additional support for float8_e4m3fnuz and _e5m2fnuz (#115214)"
This reverts commit 74e1362499.

Reverted https://github.com/pytorch/pytorch/pull/115214 on behalf of https://github.com/PaliC due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/115214#issuecomment-1900815152))
2024-01-19 17:35:04 +00:00
Jeff Daily
74e1362499 additional support for float8_e4m3fnuz and _e5m2fnuz (#115214)
Follow up to #107586.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115214
Approved by: https://github.com/peterbell10
2024-01-19 00:50:18 +00:00
PyTorch MergeBot
2f84a9d37c Revert "[CUDNN][SDPA] Experimental cuDNN Flash Attention v2 Inference (#115663)"
This reverts commit 5aa92b5090.

Reverted https://github.com/pytorch/pytorch/pull/115663 on behalf of https://github.com/PaliC due to Unfortunately, this pr breaks cuda builds internally ([comment](https://github.com/pytorch/pytorch/pull/115663#issuecomment-1899388813))
2024-01-18 23:40:30 +00:00
Eddie Yan
5aa92b5090 [CUDNN][SDPA] Experimental cuDNN Flash Attention v2 Inference (#115663)
#113713

Going to clean up some of the checks and will remove draft status after.
Can be tested on SM80+ with `TORCH_CUDNN_MHA_ENABLED=1`.

CC @drisspg @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115663
Approved by: https://github.com/drisspg
2024-01-18 01:20:36 +00:00
FFFrog
7b0926cc3e Fix wrong class inheritance in pyi (#116404)
As the title stated.

f6dfbffb3b/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp (L153)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116404
Approved by: https://github.com/ezyang, https://github.com/wconstab
2024-01-12 21:25:29 +00:00
Bin Bao
70f3a530d7 [AOTI] Add pybind for AOTIModelContainerRunnerCpu and AOTIModelContainerRunnerCuda (#116269)
Summary: Now we can allocate an AOTIModelContainerRunner object instead of relying on torch.utils.cpp_extension.load_inline. Also renamed AOTInductorModelRunner to AOTIRunnerUtil in this PR.

Test Plan: CI

Reviewed By: khabinov

Differential Revision: D52339116

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116269
Approved by: https://github.com/khabinov
2024-01-04 18:58:24 +00:00
Aaron Gokaslan
bd10fea79a [BE]: Enable F821 and fix bugs (#116579)
Fixes #112371

I tried to fix as many of the bugs as I could, a few I could not figure out what the proper fix for them was though and so I left them with noqas.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116579
Approved by: https://github.com/ezyang
2024-01-01 08:40:46 +00:00
angelayi
6b91e6907e Add setUserEnabledNNPACK config (#116152)
When exporting a model with a convolution kernel on cpu, if mkldnn is disabled and nnpack is enabled, export will go down the nnpack optimized convolution kernel for certain shapes ((code pointer)[cd449e260c/aten/src/ATen/native/Convolution.cpp (L542-L552)]). This means that we will automatically create a guard on that certain shape. If users want to export without any restrictions, one option is to disable nnpack. However, no config function exists for this, so this PR is adding a config function, similar to the `set_mkldnn_enabled` function.

Original context is in https://fb.workplace.com/groups/1075192433118967/posts/1349589822345892/?comment_id=1349597102345164&reply_comment_id=1349677642337110.

To test the flag, the following script runs successfully:
```
import os

import torch
from torchvision.models import ResNet18_Weights, resnet18

torch.set_float32_matmul_precision("high")

model = resnet18(weights=ResNet18_Weights.DEFAULT)
model.eval()

with torch.no_grad():
    # device = "cuda" if torch.cuda.is_available() else "cpu"
    torch.backends.mkldnn.set_flags(False)
    torch.backends.nnpack.set_flags(False)   # <--- Added config
    device = "cpu"
    model = model.to(device=device)
    example_inputs = (torch.randn(2, 3, 224, 224, device=device),)
    batch_dim = torch.export.Dim("batch", min=2, max=32)
    so_path = torch._export.aot_compile(
        model,
        example_inputs,
        # Specify the first dimension of the input x as dynamic
        dynamic_shapes={"x": {0: batch_dim}},
        # Specify the generated shared library path
        options={
            "aot_inductor.output_path": os.path.join(os.getcwd(), "resnet18_pt2.so"),
            "max_autotune": True,
        },
    )

```

I'm not sure who to add as reviewer, so please feel free to add whoever is relevant!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116152
Approved by: https://github.com/malfet
2023-12-27 06:00:16 +00:00
Nikita Shulga
0aa185f394 [BE] Make torch.cuda.has_magma a build time check (#116299)
Perhaps originally one needed to query about GPU capability, but right now it's a simple check for a build time flag: 52f0457d7d/aten/src/ATen/cuda/detail/CUDAHooks.cpp (L165-L171)

Alternative, to avoid `at::hasMAGMA()` call  one can implement it as follows:
```cpp
  const auto use_magma = caffe2::GetBuildOptions().at("USE_MAGMA");
  return PyBool_FromLong(use_magma == "1");
```

Make this check very similar to `_has_mkldnn`
0978482afa/torch/csrc/Module.cpp (L1793-L1794)

Test plan:
 Run `lldb -- python3 -c "import torch;print(torch.cuda.has_magma)"` and make sure it returns True and that `cuInit` is not called

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116299
Approved by: https://github.com/seemethere, https://github.com/albanD
2023-12-26 23:37:23 +00:00
fduwjj
f6dfbffb3b [c10d] Add hashing as a debug feature for before and after NCCL collective call (#113238)
For now, we use `TORCH_DISTRIBUTED_DEBUG = DETAIL` to turn a debug feature which calculate the hashing for input tensors and output results of c10d collective in NCCL.  This is a debugging feature so that we can rule out the bug from c10d level.

<img width="840" alt="image" src="https://github.com/pytorch/pytorch/assets/6937752/cdc70b0b-ae3c-4efd-86ff-adc5c5ba505f">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113238
Approved by: https://github.com/wconstab, https://github.com/fegin
2023-12-25 22:25:38 +00:00
zdevito
4afe2687d5 Reland "Serve multistream graph captures from correct pool (#114647)" (#116199)
Fixes a variable shadowing problem that broke internal builds.

This reverts commit fe15645619.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116199
Approved by: https://github.com/eellison
2023-12-20 21:22:34 +00:00
PyTorch MergeBot
fe15645619 Revert "Serve multistream graph captures from correct pool (#114647)"
This reverts commit 8a445f7bd5.

Reverted https://github.com/pytorch/pytorch/pull/114647 on behalf of https://github.com/jeanschmidt due to breaking multiple internal build jobs, please check internal diff in order to obtain more details ([comment](https://github.com/pytorch/pytorch/pull/114647#issuecomment-1864840724))
2023-12-20 17:11:42 +00:00
zdevito
8a445f7bd5 Serve multistream graph captures from correct pool (#114647)
This fixes #114320 by placing the logic for determining whether to allocate
to a pool inside a callback that is controlled by CUDAGraph.cpp or by the
python bound api to allocate a stream directly to a pool.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114647
Approved by: https://github.com/ngimel, https://github.com/eellison
2023-12-18 18:24:15 +00:00
fduwjj
40ce9a4cfb [c10d] Create a python c10d API _set_pg_timeout to set timeout (#115453)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115453
Approved by: https://github.com/wconstab, https://github.com/H-Huang
2023-12-12 20:52:43 +00:00
Nikita Shulga
b706c4116d [MPS] Add MacOS 14 runtime check (#115512)
Prerequisite for adding more complex type support and FFT operation

Check using `conjugateWithTensor:name:` selector defined as follows
```objc
/// Returns the complex conjugate of the input tensor elements.
///
/// - Parameters:
///   - tensor: The input tensor.
///   - name: An optional string which serves as an identifier for the operation..
/// - Returns: A valid `MPSGraphTensor` object containing the elementwise result of the applied operation.
-(MPSGraphTensor *) conjugateWithTensor:(MPSGraphTensor *) tensor
                                   name:(NSString * _Nullable) name
MPS_AVAILABLE_STARTING(macos(14.0), ios(17.0), tvos(17.0))
MPS_SWIFT_NAME( conjugate(tensor:name:) );
```

- Rename `isOnMacOS13orNewer(unsigned minor)` hook to `isOnMacOSorNewer(major, minor)`
- Replace `torch._C.__mps_is_on_macos_13_or_newer` with `torch._C._mps_is_on_macos_or_newer`
- Add `torch.backends.mps.is_macos_or_newer` public API
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115512
Approved by: https://github.com/albanD
2023-12-11 21:11:42 +00:00
Chip Turner
937d616e82 Re-enable type checking for distributed_c10d.py (#115223)
Re-enable type checking for distributed_c10d.py

Type checking for distributed_c10d.py was inadvertently turned off in issues that have accumulated since.

Note: the backwards compatibility linter does not like some of these changes.  But they were incorrect before.  This needs human verification, however.

#suppress-api-compatibility-check

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115223
Approved by: https://github.com/wconstab
2023-12-09 11:07:54 +00:00
albanD
a2b89154bf New swap function (#111747)
This PR is proposing a new approach to solve the nn/optim only linked by python object identity problem.
The idea is to have a function that can swap the content of two Tensors t1 and t2 while preserving all the old references.
This would allow us to swap the `model.weight` with a new Tensor (can be any subclass of Tensor and any TensorImpl (xla, sparse, nested tensorimpl would work)). The use within nn will be done in a follow up.

This is done by swapping the whole content of the PyObject and then putting back the fields associated with external references (refcount, gc tracking and weakrefs).
Note that we have to properly handle all the cases where there is memory used before the public pointer PyObject* and where the PyObject is bigger due to dict/weakref being inlined (older CPython version) or due to slots.

The main limitation of this approach is that the number of slots need to match for the objects being swapped and thus limit usage of slots in subclasses.

Draft right now to see what @colesbury thinks about doing this?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111747
Approved by: https://github.com/colesbury
2023-12-08 18:49:35 +00:00
Pritam Damania
f505d76462 Bug fixes to DDP _update_process_group API. (#114194)
https://github.com/pytorch/pytorch/pull/113580 introduced the `DDP._update_process_group` API. However, the implementation did not correctly reset all of the necessary state in the reducer. In particular if an error occurred during backward, DDP would end up in an incorrect state.

As a result, in this PR I've enhanced the unit test to test for this case and also appropriately fixed resetting Reducer state.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114194
Approved by: https://github.com/rohan-varma
2023-11-27 23:52:40 +00:00
Antonio Kim
7fc292930c Add support for torch.Generator type in TorchScript (#110413)
- Add support for `torch.Generator` type in TorchScript
- Add `generator` args to all `torch.nn.init` functions that call `uniform_` or `normal_`
- Add support for `torch.Generator` in LTC's TorchScript backend (CC: @wconstab)

CC: @eellison @davidberard98 @GlebKazantaev @behzad-a
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110413
Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/glebk-cerebras, https://github.com/davidberard98
2023-11-21 23:07:21 +00:00
Jez Ng
631fb33fd6 Enable import following in MYPYNOFOLLOW (now MYPYINDUCTOR) (#113830)
Skipping importing some packages for now to make this change more
tractable.

For some reason, lintrunner on CI raises errors in all imported `.pyi` files,
even though it doesn't on my local machine. The errors are all from missing
generic types, as the MYPYINDUCTOR config has `disallow_any_generics`
set. I have thus added `disable-error-code` comments to the relevant files,
though I fixed a few that were easy enough.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113830
Approved by: https://github.com/Skylion007
ghstack dependencies: #113722, #113721
2023-11-17 18:24:21 +00:00
Jez Ng
0c8362de1a [dynamo] Make {guards,eval_frame}.py pass follow_imports typechecking (#113721)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113721
Approved by: https://github.com/Skylion007
ghstack dependencies: #113722
2023-11-17 18:24:21 +00:00
Brian Vaughan
dbb96ef30d improve annotation device parameters where a device ordinal is allowed (#113647)
Using mypy in code that depends on pytorch, I noticed that the type annotation doesn't allow a device ordinal.

`error: Argument "device" to "to_empty" of "Module" has incompatible type "int"; expected "str | device"  [arg-type]`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113647
Approved by: https://github.com/albanD
2023-11-17 14:41:22 +00:00
Pritam Damania
17e2313dd3 Add an API to DDP for dynamically updating the underlying process group. (#113580)
# Motivation

If we would like to reinitialize DDP with a different PG with `torch.compile`, we need to do the following:

```
del old_ddp
del old_pg
pg = init_pg(...)
ddp = DDP(pg)
model = torch.compile(DDP)
```

This results in recompilation of the entire model and is very expensive. Since the only thing we need to update is the PG, we should be able to do this without having to compile the model again.

# Proposal

As a result, in this PR I've introduced an `_update_process_group` API which can dynamically update the underlying ProcessGroup used by DDP without needing to reinitialize DDP again.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113580
Approved by: https://github.com/fduwjj
2023-11-15 09:05:02 +00:00
drisspg
9b0f2f8d94 expose sdpa helpers to python (#110496)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110496
Approved by: https://github.com/jbschlosser
2023-11-15 07:34:34 +00:00
Jez Ng
fda94124d7 [inductor] Make {cudagraph_trees,decomposition,post_grad}.py pass follow_imports typechecking (#113609)
I added explicit imports to `kernel/__init__.py` as mypy doesn't seem to
understand an empty `__init__.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113609
Approved by: https://github.com/eellison
2023-11-15 05:04:11 +00:00
PyTorch MergeBot
252e68a83b Revert "Add support for torch.Generator type in TorchScript (#110413)"
This reverts commit 54493fe8c4.

Reverted https://github.com/pytorch/pytorch/pull/110413 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is, unfortunately, still breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/110413#issuecomment-1811625557))
2023-11-15 00:51:23 +00:00
Jez Ng
ffc3731dc4 Update TensorBase.to()'s' signature; create {guards,compiled_autograd}.pyi (#113536)
I had to explicitly import submodules in torch/_C/_dynamo/__init__.pyi
because mypy doesn't seem to understand empty `__init__.py[i]` files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113536
Approved by: https://github.com/ezyang
ghstack dependencies: #113412, #113535
2023-11-14 04:31:12 +00:00
Antonio Kim
54493fe8c4 Add support for torch.Generator type in TorchScript (#110413)
- Add support for `torch.Generator` type in TorchScript
- Add `generator` args to all `torch.nn.init` functions that call `uniform_` or `normal_`
- Add support for `torch.Generator` in LTC's TorchScript backend (CC: @wconstab)

CC: @eellison @davidberard98 @GlebKazantaev @behzad-a
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110413
Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/glebk-cerebras, https://github.com/davidberard98
2023-11-13 23:18:14 +00:00
Jez Ng
6805d1e1d6 [inductor] Make graph.py pass follow_imports typechecking (#113518)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113518
Approved by: https://github.com/Skylion007
ghstack dependencies: #113413
2023-11-11 22:15:46 +00:00
Jez Ng
a8cf04fd2a [inductor] Make {output_graph,pad_mm}.py pass follow_imports typechecking (#113413)
I changed OutputGraph.nn_modules' type to `Dict[str, Any]` because it
seems that `register_attr_or_module` can populate it with essentially
any type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113413
Approved by: https://github.com/Skylion007
2023-11-11 22:15:46 +00:00
Jez Ng
a2c32b8bd0 [inductor] Make codegen/{common,wrapper,cuda/cutlass_utils}.py pass follow_imports typechecking (#113411)
SymIntType is referenced by wrapper.py, so I added its .pyi definition.
I also added SymBoolType along the way for completeness.

The `insinstance` checks in wrapper.py reference torch.Type, which seems
to cause mypy to choke. Not entirely sure why; I've just added
type-ignore comments for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113411
Approved by: https://github.com/Skylion007
ghstack dependencies: #113409, #113410
2023-11-10 19:58:08 +00:00
Jez Ng
a3a55df4af [dynamo] Add .pyi declaration of _CacheEntry (#113305)
This is required for enabling follow-imports=silent; referenced by
_dynamo/types.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113305
Approved by: https://github.com/Skylion007, https://github.com/ezyang
ghstack dependencies: #113304
2023-11-09 21:55:49 +00:00
Wes Bland
9d765d28ca [pytorch] Add binding to get nccl version suffix (#112884)
Summary: Adds a Python to C binding to get the NCCL_SUFFIX value for more accurate NCCL version information and add that to the NCCL version tuple.

Differential Revision: D50978181

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112884
Approved by: https://github.com/kwen2501
2023-11-08 02:51:22 +00:00
Richard Zou
d1c092ae1b Update impl_abstract_pystub to be less boilerplatey (#113182)
Summary:

We've made the following changes:
- The new way to use the API is `m.impl_abstract_pystub(module, context)`.
  Every subsequent m.def of an op inside the TORCH_LIBRARY block gives
  the op the `impl_abstract_pystub`.
- Added a mechanism to determine if an operator was defined in Python or C++.
  Library.define in Python appends the op to a global set, which is analogous
  to what we do for tracking Library.impl.
- If someone does `torch.library.impl_abstract` in Python for an operator, then
  we require that it has an `impl_abstract_pystub` specified and we also check
  that the module in the `impl_abstract_pystub` is the same as the module where
  the call to `torch.library.impl_abstract` exists.
- Unfortunately we can't check the "context" (which is the buck target on
  buck-based systems) because buck sits above us.

bypass-github-export-checks

Test Plan: - existing tests

Differential Revision: D51080493

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113182
Approved by: https://github.com/ezyang
2023-11-08 00:39:00 +00:00
PyTorch MergeBot
bc3e2e03cd Revert "Update impl_abstract_pystub to be less boilerplatey (#112851)"
This reverts commit 6ae4e3a8d2.

Reverted https://github.com/pytorch/pytorch/pull/112851 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/112851#issuecomment-1799539354))
2023-11-07 18:53:13 +00:00
Richard Zou
6ae4e3a8d2 Update impl_abstract_pystub to be less boilerplatey (#112851)
Summary:
We've made the following changes:
- The new way to use the API is `m.impl_abstract_pystub(module, context)`.
  Every subsequent m.def of an op inside the TORCH_LIBRARY block gives
  the op the `impl_abstract_pystub`.
- Added a mechanism to determine if an operator was defined in Python or C++.
  Library.define in Python appends the op to a global set, which is analogous
  to what we do for tracking Library.impl.
- If someone does `torch.library.impl_abstract` in Python for an operator, then
  we require that it has an `impl_abstract_pystub` specified and we also check
  that the module in the `impl_abstract_pystub` is the same as the module where
  the call to `torch.library.impl_abstract` exists.
- Unfortunately we can't check the "context" (which is the buck target on
  buck-based systems) because buck sits above us.

Test Plan: - existing tests

Differential Revision: D50972148

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112851
Approved by: https://github.com/ezyang
2023-11-07 16:07:42 +00:00
PyTorch MergeBot
9a28a7b498 Revert "Add support for torch.Generator type in TorchScript (#110413)"
This reverts commit 27e31ab6e8.

Reverted https://github.com/pytorch/pytorch/pull/110413 on behalf of https://github.com/PaliC due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/110413#issuecomment-1799003164))
2023-11-07 15:53:32 +00:00
Ke Wen
bb7ac12cbf [ProcessGroupNCCL] Avoid recording stream for broadcast and scatter (#112896)
Summary: Follows PR #111431, save memory for DTensor init

Test Plan: Sandcastle

Reviewed By: wanchaol

Differential Revision: D50985365

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112896
Approved by: https://github.com/wanchaol
2023-11-07 15:44:04 +00:00
Will Constable
ff51f94e32 [Reland] Fix default timeouts for python entrypoints (e.g. init_process_group) (#113094)
Previous PRs changed the c++ default timeout for PGNccl, but this path
was only hit in some cases, and the python defaults took over in other
cases.

This PR ensures that NCCL pg always default to the changed NCCL-specific
timeout value.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113094
Approved by: https://github.com/fduwjj
2023-11-07 05:34:26 +00:00
PyTorch MergeBot
75adb9f371 Revert "Fix default timeouts for python entrypoints (e.g. init_process_group) (#112893)"
This reverts commit f9d47e1381.

Reverted https://github.com/pytorch/pytorch/pull/112893 on behalf of https://github.com/clee2000 due to sorry this seems to have broken inductor f9d47e1381 https://github.com/pytorch/pytorch/actions/runs/6776367936/job/18418174752 ([comment](https://github.com/pytorch/pytorch/pull/112893#issuecomment-1796979811))
2023-11-06 22:49:53 +00:00
Antonio Kim
27e31ab6e8 Add support for torch.Generator type in TorchScript (#110413)
- Add support for `torch.Generator` type in TorchScript
- Add `generator` args to all `torch.nn.init` functions that call `uniform_` or `normal_`
- Add support for `torch.Generator` in LTC's TorchScript backend (CC: @wconstab)

CC: @eellison @davidberard98 @GlebKazantaev @behzad-a
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110413
Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/glebk-cerebras, https://github.com/davidberard98
2023-11-06 21:27:02 +00:00
Will Constable
f9d47e1381 Fix default timeouts for python entrypoints (e.g. init_process_group) (#112893)
Previous PRs changed the c++ default timeout for PGNccl, but this path
was only hit in some cases, and the python defaults took over in other
cases.

This PR ensures that NCCL pg always default to the changed NCCL-specific
timeout value.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112893
Approved by: https://github.com/xw285cornell, https://github.com/kwen2501, https://github.com/XilunWu
ghstack dependencies: #112611, #112803
2023-11-06 20:48:39 +00:00
Jez Ng
b8ac5bbcbd [dynamo] Enable typechecking for bytecode_transformation.py (#112561)
As part of this diff, I have upgraded the `python_version` config setting to 3.11. `bytecode_transformation.py` (and a few other files) have functions using APIs only available in Python 3.11+. Those APIs are gated by a sys.version_info check in their typeshed .pyi files. So setting the min version to 3.11 allows those functions to typecheck properly.

An alternative is to make the relevant types Any:

```
if sys.version_info >= (3, 11):
    _Positions = dis.Positions
else:
    _Positions = Any
```

However, with python_version = 3.8, that means we're not getting any useful typechecking signal when encountering values of type _Position.

Changing the python_version to 3.11 does mean that we will stop typechecking codepaths that run only on lower versions, but that seems a small price to pay. It does also mean that we won't catch code that uses newer APIs without the appropriate version check, but again, not sure this has much of an impact.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112561
Approved by: https://github.com/ezyang
2023-11-04 19:36:27 +00:00
isdanni
43fb5147e2 [BE] Enable Ruff's Flake8 PYI001 (#112823)
Enable [unprefixed-type-param (PYI001)](https://docs.astral.sh/ruff/rules/unprefixed-type-param/#unprefixed-type-param-pyi001)

Link: #110950
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112823
Approved by: https://github.com/Skylion007
2023-11-03 17:25:39 +00:00
Yifu Wang
ac9476ba99 Add .boxed() to c10d::ProcessGroup and c10d::Work's pybind (#111997)
Summary:
When passed from C++ to Python, `c10d::ProcessGroup` and `c10d::Work` are automatically converted to their pybind class which can't be used for dispatcher ops. `.boxed()` exposes `c10d::ProcessGroup` and `c10d::Work` as boxed custom class object to Python.

```python
import tempfile

import torch
import torch.distributed as dist

if __name__ == "__main__":
    with tempfile.NamedTemporaryFile(delete=False) as tmpf:
        dist.init_process_group(
            backend="nccl", init_method=f"file://{tmpf.name}", rank=0, world_size=1
        )
        group = dist.group.WORLD
        print(group)
        print(group.boxed())
```

```
<torch.distributed.distributed_c10d.ProcessGroup object at 0x7fe42fb78d30>
ScriptObject <__torch__.torch.classes.c10d.ProcessGroup>
```

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111997
Approved by: https://github.com/lw
2023-11-02 20:35:20 +00:00
Kurt Mohler
fd209543d5 Add torch.utils.deterministic.fill_uninitialized_memory flag (#111377)
Part of #109802

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111377
Approved by: https://github.com/albanD, https://github.com/aaronenyeshi
2023-11-01 16:10:09 +00:00
Jithun Nair
333d5821ee [ROCm] Add gcnArchName to collect_env and torch.cuda.get_device_properties (#107477)
Printing just the device name is not helpful when investigating PyTorch issues filed for specific AMD GPUs, as the support/issue might depend on the gfx arch, which is part of the gcnArchName property.

`torch.cuda.get_device_properties(0).gcnArchName` will print the value of the `gcnArchName` property: eg.
```
>>> torch.cuda.get_device_properties(0).gcnArchName
'gfx906:sramecc+:xnack-'
```

```
root@6f064e3c19fb:/data/pytorch/test# python ../torch/utils/collect_env.py
...
GPU models and configuration: AMD Radeon Graphics(gfx906:sramecc+:xnack-)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107477
Approved by: https://github.com/albanD
2023-10-31 23:05:36 +00:00
PyTorch MergeBot
ace2713d1e Revert "Add torch.utils.deterministic.fill_uninitialized_memory flag (#111377)"
This reverts commit f1785373c0.

Reverted https://github.com/pytorch/pytorch/pull/111377 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/111377#issuecomment-1784179040))
2023-10-29 17:41:55 +00:00
dshi7
fbff99ffea Add regex matching to Inductor all2all collective unit tests (#112077)
Fixes #111776

Support check_regex in FileCheck() by adding `find_regex` in `struct TORCH_API StringCordView`.
Callsite accepts RE syntax for std::regex.

However, I haven't figured out submatch ID yet.
For example, "buf5[0], buf6_inputs[0]" is still considered a match.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112077
Approved by: https://github.com/yf225
2023-10-26 08:29:30 +00:00
Kurt Mohler
f1785373c0 Add torch.utils.deterministic.fill_uninitialized_memory flag (#111377)
Part of #109802

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111377
Approved by: https://github.com/albanD
2023-10-26 02:39:06 +00:00
Jez Ng
ad3572a5dc Unify torch.SymInt and torch.types.SymInt (#110573)
Per @ezyang, this should be fine

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110573
Approved by: https://github.com/ezyang
2023-10-24 16:17:23 +00:00
Richard Zou
66b74d231a Change torch.library.impl to accept a device string (#111659)
torch.library.impl now accepts a device string (e.g. "cpu", "cuda"). It
still accepts DispatchKey strings, but we no longer document this, because
using arbitrary DispatchKeys is more for the power users.

We map the device string to a DispatchKey and then register the impl for
said DispatchKey. A user may also specify multiple device strings at once
or specify "types=default" to get a CompositeExplicitAutograd registration.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111659
Approved by: https://github.com/soulitzer
ghstack dependencies: #111380
2023-10-23 23:02:41 +00:00
Ke Wen
18cc8a92ac [ProcessGroupNCCL] Avoid recording stream for synchronous ops (#111431)
For synchronous ops (i.e. `asyncOp = False`), we don't want to record streams because we know that the NCCL stream will join back to the "current" stream right after this op. So we might just as well keep the stream ownership of the input/output tensors unchanged. The benefit would be that the allocation/free of the tensors would look deterministic to the "current" stream so that the caching allocator can reuse memory pool for this stream in a clever way.

To prevent the input/output tensors from being recycled by python, we rely on the stashing mechanism in ProcessGroupNCCL (which can be also turned on by setting `TORCH_NCCL_AVOID_RECORD_STREAMS=1`).

This mechanism change is for libraries like FSDP which uses `all_gather_into_tensor` and `reduce_scatter_tensor` in a synchronous way and which cannot set `TORCH_NCCL_AVOID_RECORD_STREAMS=1` for their users. And therefore, this change is limited to these two collectives for now.

Cc: @awgu @janeyx99 @albanD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111431
Approved by: https://github.com/H-Huang
2023-10-19 00:41:09 +00:00
Nikita Shulga
f84755bcac Fix _CudaStreamBase type annotations (#111387)
Make it inherit from `Stream` as indeed it is, see 97a513ed07/torch/csrc/cuda/Stream.cpp (L208) and
```
python3 -c "import torch;print(torch._C._CudaStreamBase.__base__)"
<class 'torch.Stream'>
```

Fixes https://github.com/pytorch/pytorch/issues/111268

TODO (in separate PR): Revive `test_typing` and add regression test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111387
Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007
2023-10-16 23:26:58 +00:00
PyTorch MergeBot
1e70f4d02c Revert "Reland #2 "[C10] PG observability hooks. (#108815, #110907)" (#111072)"
This reverts commit bb1424d46e.

Reverted https://github.com/pytorch/pytorch/pull/111072 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/111072#issuecomment-1765399829))
2023-10-16 23:03:26 +00:00
Jesse Cai
4c01686027 Public API for constructing NT with jagged layout from tensor list (#111078)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111078
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
ghstack dependencies: #109123
2023-10-13 03:27:41 +00:00
Will Constable
bb1424d46e Reland #2 "[C10] PG observability hooks. (#108815, #110907)" (#111072)
This reverts commit 314a502eb0.

Changes since original PR:
Reland 1
 *  rename torch.distributed.hooks to torch.distributed._hooks

Reland 2
 * make _hooks importable even if !distributed.is_available()
 * handle cuda driver exit intermittent failure caused by new cuda api usage in callback caller (see prev PR in stack)

(original PR https://github.com/pytorch/pytorch/pull/108815 desc copied below)

Expose a set of observability hooks into C10D such that our users can
detect collectives failure both faster and more easily.

The design is similar to NCCL desync debug that it minimized the
overhead by doing most of the work out of the main thread.

This PR introduces a new module torch.distributed.hooks that exposes the following set of methods:

    register_collective_start_hook
    register_collective_end_hook
    register_process_group_hook

The process group hook exposes PG creation on the member ranks and call them inline from the
the PG creation code. This is fine since this happens during initialization and a limited number of times.

The collective start/end hooks are fired from a single background thread. It reads
events from a C++ queue and dispatches over.

Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown
and have it as background thread. This is not possible with more reasonable choices like a condvar.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111072
Approved by: https://github.com/malfet
ghstack dependencies: #111061
2023-10-12 16:59:23 +00:00
PyTorch MergeBot
314a502eb0 Revert "Reland "[C10] PG observability hooks. (#108815)" (#110907)"
This reverts commit 7678cd22af.

Reverted https://github.com/pytorch/pytorch/pull/110907 on behalf of https://github.com/huydhn due to Sorry for reverting this, but macos job in trunk starts failing after this 7678cd22af ([comment](https://github.com/pytorch/pytorch/pull/110907#issuecomment-1756497387))
2023-10-11 00:23:42 +00:00
Will Constable
7678cd22af Reland "[C10] PG observability hooks. (#108815)" (#110907)
This reverts commit ff0358b038.

(original PR https://github.com/pytorch/pytorch/pull/108815 desc copied below)

Expose a set of observability hooks into C10D such that our users can
detect collectives failure both faster and more easily.

The design is similar to NCCL desync debug that it minimized the
overhead by doing most of the work out of the main thread.

This PR introduces a new module torch.distributed.hooks that exposes the following set of methods:

    register_collective_start_hook
    register_collective_end_hook
    register_process_group_hook

The process group hook exposes PG creation on the member ranks and call them inline from the
the PG creation code. This is fine since this happens during initialization and a limited number of times.

The collective start/end hooks are fired from a single background thread. It reads
events from a C++ queue and dispatches over.

Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown
and have it as background thread. This is not possible with more reasonable choices like a condvar.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110907
Approved by: https://github.com/fduwjj
2023-10-10 20:09:40 +00:00
soulitzer
fda0a965c7 [reland] Support SingletonSymNode mul with coefficient (#110673)
reland of https://github.com/pytorch/pytorch/pull/110369
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110673
Approved by: https://github.com/ezyang
2023-10-10 19:37:17 +00:00
PyTorch MergeBot
ff0358b038 Revert "[C10] PG observability hooks. (#108815)"
This reverts commit 0c7a877745.

Reverted https://github.com/pytorch/pytorch/pull/108815 on behalf of https://github.com/albanD due to Add a new torch.distributed.hooks namespace but does not document it, test was added this morning ([comment](https://github.com/pytorch/pytorch/pull/108815#issuecomment-1751327751))
2023-10-06 19:49:49 +00:00
Rodrigo Kumpera
0c7a877745 [C10] PG observability hooks. (#108815)
Expose a set of observability hooks into C10D such that our users can
detect collectives failure both faster and more easily.

The design is similar to NCCL desync debug that it minimized the
overhead by doing most of the work out of the main thread.

This PR introduces a new module torch.distributed.hooks that exposes the following set of methods:

    register_collective_start_hook
    register_collective_end_hook
    register_process_group_hook

The process group hook exposes PG creation on the member ranks and call them inline from the
the PG creation code. This is fine since this happens during initialization and a limited number of times.

The collective start/end hooks are fired from a single background thread. It reads
events from a C++ queue and dispatches over.

Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown
and have it as background thread. This is not possible with more reasonable choices like a condvar.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108815
Approved by: https://github.com/wconstab, https://github.com/fduwjj
2023-10-06 18:52:46 +00:00
PyTorch MergeBot
1c3fae46ee Revert "Support SingletonSymNode mul with coefficient (#110369)"
This reverts commit eb8feb8ff8.

Reverted https://github.com/pytorch/pytorch/pull/110369 on behalf of https://github.com/PaliC due to bottom diff is causing a plethora of internal failures ([comment](https://github.com/pytorch/pytorch/pull/110369#issuecomment-1749802899))
2023-10-05 23:51:28 +00:00
Edward Z. Yang
6a974bec5d Change flash attention outputs to be SymInt instead of int (#110533)
Fixes https://github.com/pytorch/pytorch/issues/110322

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110533
Approved by: https://github.com/albanD
2023-10-05 01:00:07 +00:00
soulitzer
eb8feb8ff8 Support SingletonSymNode mul with coefficient (#110369)
We want to be able to use SingletonSymNode to represent strides for Jagged layout tensor. The following is for 3D, but easily generalizable to higher dimensions.

Constraints:
- [B, x, D] (where x represents the "variably lengthed dim") can be strided in two ways [x, 1, sum(x)] and [dx, d, 1]. We need two different placeholder values depending on how the jagged tensor is strided.
- When doing operations we need the strides of output tensors to be expressable in terms of the strides and sizes of the inner tensors. Given [B, x, D] @ [D, D'], the output strides is [x * D', D', 1] rather than some opaque [x2, D', 1]. This constraint exists because if I'm tracing, I need a symint to represent the output stride. This symint needs to come from somewhere; I get it in several ways: (1) create a constant, (2) unbacked symint, (3) create a new input using a source, (4) output of an operation on an existing symint. It is clear that (4) is what we want here, which brings us to the design below.

Design:

Given the two constraints, the most straightforward way to implement this is actually to update SingletonSymNode to include some scalar factor, i.e. Morally, SingletonSymNode represents `factor * [s_0, s_1, …, s_n]` This enables us to symbolically compute strides from sizes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110369
Approved by: https://github.com/ezyang
ghstack dependencies: #110044
2023-10-04 22:56:15 +00:00
Moritz Hennen
09c598745c Rename torch._C._TensorBase to TensorBase (#109940)
I have gone ahead and implemented the renaming of the type `torch._C._TensorBase` to a non-private class name `TensorBase`.
The changes also include leaving `torch._C._TensorBase` as an alias to the new type: 70458768fb/torch/csrc/autograd/python_variable.cpp (L2196-L2197) both in the c++ code and in the corresponding `__init__.pyi.in` file:
70458768fb/torch/_C/__init__.pyi.in (L1522)

Fixes #109438

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109940
Approved by: https://github.com/ezyang
2023-09-25 19:10:22 +00:00
Brian Hirsh
63526a63f5 Make FunctionalTensor subclass to be more like functorch (interaction with ZeroTensor + Conjugate key) (#109023)
I added some tests for Conj, Neg and ZeroTensor for both python and C++ functionalization. This also fixes a nasty segfult when running a functorch `jacfwd` test with `torch.compile`, once AOTAutograd is using `FunctionalTensor`.

Changes:

(1) I use Jeffrey's `make_wrapper_subclass(extra_dispatch_keys)` kwarg to plumb extra dispatch keys ontoto the wrapper, mirroring what C++ functionalization does (C++ functionalization will mirror all dispatch keys from the inner tensor to the wrapper, except for python and functorch keys).

(2) FunctionalTensorMode will decompose CompositeImplicitAutograd ops, since (for example) ZeroTensor kernels can send ops like `.to()` directly to the Python key. We'll need a way to toggle this later for pre-dispatch functionalization

(3) Bound `_ForceDispatchKeyGuard` and BatchedTensorImpl's dispatch keyset to python

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109023
Approved by: https://github.com/zou3519
ghstack dependencies: #108654, #109662, #109632
2023-09-22 07:09:04 +00:00
Randolf Scholz
c6b9481c15 Update type hint for Tensor.__getitem__. (#109531)
Better type-hint that's similar in spirit to `numpy.ndarray.__getitem__`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109531
Approved by: https://github.com/ezyang
2023-09-21 18:19:38 +00:00
hauntsaninja
2cd0b94533 Hide __getattr__ from type checkers (#109683)
Visibility of this causes type checkers to conservatively assume that all attributes are defined on torch module.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109683
Approved by: https://github.com/ngimel, https://github.com/ezyang, https://github.com/malfet
2023-09-21 17:01:23 +00:00
soulitzer
5252fcb133 Handle constant SymBool in unary and binary operations (#109169)
In this PR:
- When Constant SymNode are detected in unary/binary ops demote them to plain int/bool before proceeding. Sometimes this means doing a unary op with a Constant SymNode would result in a plain bool.
- Introduce an is_symbolic method, only available from Python. We need this because isinstance(x, SymInt) is no longer sufficient to check whether a given int/SymInt is symbolic or not. See later PR in the stack to see how this is used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109169
Approved by: https://github.com/ezyang
2023-09-20 20:37:15 +00:00
Brian Hirsh
f22b303f65 Add TorchDispatch version of functionalization (#106404)
This PR adds a new `FunctionalTensor` subclass, and `FunctionalTensorMode` torch dispatch mode. Together, this class/mode are a lightweight wrapper around our existing C++ functionalization logic.

This idea came from Ed - later in the stack, I want to be able to run functionalization **underneath** torch_dispatch, when performing tracing in AOTAutograd. I can't do this easily with vanilla C++ functionalization, because it has a dedicated dispatch key that always runs before TorchDispatch. However, by adding a torch_dispatch mode shim around functionalization, we can use functionalization as a torch_dispatch mode, which will make it easier to run underneath other modes later.

This PR provides the basic new classes, and some light testing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106404
Approved by: https://github.com/ezyang
2023-09-15 20:19:25 +00:00
Randolf Scholz
32f50b7021 Improve type annotations for jit.script (#108782)
Fixes #108781

- [x] added `@overload` for `jit.script`
- [x] added typing unittest in `test/typing/pass/jit.py`
    - NOTE: unittest is not automatically checked by mypy when executing lintrunner currently. (how to fix?)
- [x] used `stubgen` to create [torch/jit/_script.pyi](https://github.com/pytorch/pytorch/pull/108782/files#diff-738e66abee2523a952b3ddbaecf95e187cce559473cf8c1b3da7c247ee5d1132) and added overloads there. (adding them inside `_script.py` itself interfered with JIT engine)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108782
Approved by: https://github.com/ezyang
2023-09-13 19:20:25 +00:00
soulitzer
2bcff92540 Add NestedTensor python subclass (#108314)
Description coming soon

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108314
Approved by: https://github.com/jbschlosser
ghstack dependencies: #108808
2023-09-11 18:29:20 +00:00
Aaron Bockover
bd1229477d [ONNX] Add initial support for FP8 ONNX export (#107962)
This PR resurrects @tcherckez-nvidia's #106379 with changes to resolve conflicts against newer `main` and defines our own constants for the new ONNX types to [avoid breaking Meta's internal usage of an old ONNX](https://github.com/pytorch/pytorch/pull/106379#issuecomment-1675189340).

- `::torch::onnx::TensorProto_DataType_FLOAT8E4M3FN=17`
- `::torch::onnx::TensorProto_DataType_FLOAT8E5M2=19`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107962
Approved by: https://github.com/justinchuby, https://github.com/titaiwangms
2023-09-08 20:40:39 +00:00
Rodrigo Kumpera
b26af5d5ac [c10d] Add TCPSTore libuv backend support to c10d rendezvous. (#108284)
This enables libuv under env and tcp urls.

Under env either use the environment variable USE_LIBUV=1
or the url parameter use_lib=1.

Under tcp use the url parameter use_lib=1.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108284
Approved by: https://github.com/H-Huang, https://github.com/XilunWu
2023-09-07 21:39:58 +00:00
David Berard
06b173780d [dynamo] "TorchDynamo Cache Lookup" event: use C++ api (#108436)
**Background**: "TorchDynamo Cache Lookup" events appear in traces to indicate a dynamo cache lookup; it's useful to check when cache lookups are taking a long time. To add a profiler event, one can use the `torch.profiler.record_function` context manager, or the C++ equivalent. Previously, the python version was used; first, when the profiler was enabled, callbacks for record_function_enter and record_function_exit were registered; then those would be called before and after every cache lookup.

**This PR**: Instead of calling the python bindings for `torch.profiler.record_function`, directly call the C++ implementation. This simplifies a lot of the code for binding C/C++. It also improves performance; previously there was a lot of overhead in the "TorchDynamo Cache Lookup" event, making the event artificially take a long time. After this change the events now appear shorter, because there's less overhead in starting/stopping the event: in other words, the profiler no longer distorts the results as much.

**Performance results**:
I ran using the script below on a cpu-only 1.6GHz machine. I report the median time (from 100 measurements) of a "TorchDynamo Cache Lookup" event before and after this PR. I think it is reasonable to consider the difference to be due to a reduction in overhead.

<details>

<summary>Benchmarking script</summary>

```python
def fn(x, y):
    return (x * y).relu()

a, b = [torch.rand((4, 4), requires_grad=True) for _ in range(2)]

opt_fn = torch.compile(fn)

opt_fn(a, b)
opt_fn(a, b)

with torch.profiler.profile() as prof:
    opt_fn(a, b)
```

</details>

Median before PR: 198-228 us (median of 100, measured 5 times)
Median after PR: 27us

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108436
Approved by: https://github.com/anijain2305, https://github.com/jansel
2023-09-04 04:37:26 +00:00
Pritam Damania
704b0b3c67 [RESUBMIT] Standardize on error types for distributed errors. (#108191)
We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc.

This results in messy code during error handling somewhat like this:
```
if "NCCL" in exception_str:
  ...
if "Timed out initializing process group in store based barrier on rank" in exception_str:
  ...
if "The client socket has timed out after" in exception_str:
  ...
if "Broken pipe" in exception_str:
  ...
if "Connection reset by peer" in exception_str:
  ...
```

To address this issue, in this PR I've ensured added these error types:

1. **DistError** - the base type of all distributed errors
2. **DistBackendError** - this already existed and referred to PG backend errors
3. **DistStoreError** - for errors originating from the store
4. **DistNetworkError** - for general network errors coming from the socket library

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108191
Approved by: https://github.com/H-Huang
2023-08-30 21:47:39 +00:00
Brian Hirsh
da54f3c519 reorder proxy / fake modes so they always run last (#104482)
**Update:** Made refactor of the original PR. See the original description below, but here I'll describe the updates:

(1) TLS changes in `TorchDispatchModeTLS.h/cpp`.

I added a `TorchDispatchModeKey` enum, that (for now) just contains PROXY and FAKE. The ModeTLS used to just contain a `std::vector<std::shared_ptr<c10::SafePyObject>>` corresponding to the mode stack. It now **also** contains a separate array of "infra modes", indexed by mode key (PROXY and FAKE, with a new addition, FUNCTIONAL, coming later in the stack).

`TorchDispatchModeTLS::push_onto_stack` and `TorchDispatchModeTLS::pop_stack` are now a bit more complicated. Pushing accepts an optional mode_key, which if set, tells us to add the given mode directly to our "infra_modes" array. Popping will first check the "user mode" stack, before trying to pop anything from the infra mode stack. It also optionally returns the mode key of the mode we popped if there was one - that way if we push that same mode back onto the TLS later, we know where it goes.

`TorchDispatchModeTLS::dispatch_mode_enabled()` now accepts an optional `skip_infra_modes` param, so you can separately query if there are "any modes at all", or if there are "any user modes".

`TorchDispatchModeTLS::get/set/unset_mode()` all take in a mode key, and get/set/unset the mode at that particular mode key (meaning they are only meant to be used for infra modes).

There were also some mild codegen changes to support the new enum

(2) `fake_tensor.py/proxy_tensor.py/_python_dispatch.py`

The way I tell the infra that certain subclasses/modes are "infra" is through the enum: I gave `FakeTensor` and `FakeTensorMode` a `self._mode_key = torch._C.TorchDispatchModeKey.FAKE`. `TorchDispatchMode.__enter/exit__()` (in `_python_dispatch.py` now check if the current mode has a mode key, and if so they plumb it into any `push_onto_stack()` calls (which eventually instructs `TorchDispatchModeTLS` where to put the mode). Same thing for `ProxyTorchDispatchMode`.

I also had to change both of these mode's enter/exit, to handle the fact that there can no longer be multiple proxy/fake modes on the mode stack at once. I updated them both to have a `self.enter_stack: List[Optional[TorchDispatchMode]]` - whenever we push a given mode in `__enter__`, we remove the current ambient fake/proxy mode from the mode stack, and save it in `enter_stack`, so that on exit we can reset the state properly.

(2) dispatching logic in `python_arg_parser.cpp`

This is where the core dispatching logic changes are. I added two helpers, `dispatch_on_subclass()` and `dispatch_on_mode()`. The overall dispatching order is now:
```
(a) dispatch_on_mode()  # try user modes first (where the mode stack automatically considers infra modes last)
(b) dispatch_on_subclass() # try user subclasses next (skipping infra subclasses)
(c) dispatch_on_subclass() # try infra subclasses next (skipping user subclasses)
```

Note that we still want "user subclasses" to run before "infra modes". As Ed helped me realize, this will work today: If proxy/fake modes in step 1, they'll return NotImplemented if they see a user subclass, allowing us to redispatch to the user subclass.

How do (b) and (c) distinguish between user and infra subclasses? Infra subclasses (FakeTensor, and later FunctionalTensor) are required to have a `_mode_key` hidden on the subclass - so we filter via arguments that do/don't have the _mode_key.

(3) I also changed `DoubleTensor` to `TwoTensor` to minimize confusion (@albanD  pointed out that DoubleTensor would be easily confused with `torch.FloatTensor` and friends).

----- original description below -----

The main purpose of this PR is to fix the "ordering problem" between torch_dispatch modes, where we want to ensure that our Fake and Proxy dispatch modes always run **after** any dispatch modes created by the user, regardless of where they are in the stack. See this doc for more details: https://docs.google.com/document/d/1COQ291nOZvtFnzGTQMJqoYZ3sttEYFw_7HbfSyL8gcA/edit

Full set of changes below. I ended up including a few semi-related changes in this PR that I documented - but if folks would rather I separate them out, happy to try to do that.

**(1) Add dedicated TLS slots for FakeTensorMode and ProxyTensorMode**

This is the main component of this PR. There are two new slots, `TorchDispatchModeTLS.fake_mode_` and `TorchDispatchModeTLS.proxy_mode_`, which correspond to a single "global" fake and proxy mode. There is now an invariant that `torchDispatchModeState.stack_` can never contain either of these modes.

I also added a `TorchDispatchModeTLS::maybe_highest_mode()` helper that consults the `stack_` as well as both the proxy and fake slots, and returns the highest priority mode - this is because there are a few places in the codebase where we legitimately want to get the highest priority mode, *including* fake or proxy, if one is set.

This also made the implementations of the existing `disable_proxy_modes_tracing()` and `get_innermost_proxy_mode()` marginally simpler.

**(2) Updated the dispatching logic in handle_torch_function_no_python_arg_parser()**

This is the function that actually figures out which torch_dispatch implementation to call, given the current mode stack and tensor subclass inputs. This function got marginally more complicated as part of the refactor: First we inspect the mode stack and any non-fake subclass inputs. Then we check for the proxy mode slot. Then we check for the Fake mode slot, before finally checking for any fake subclass inputs.

**(3) new python `_get_fake_tensor_mode()` and `_get_proxy_tensor_mode()` API's**

Before, if you wanted to see if proxy or fake modes were active in python, you would have to consult the mode stack. Since these two modes are no longer part of the actual mode stack, I added two new API's to directly check if either proxy or fake modes are active.

**(4) Allow traceable tensor subclasses to access storages from python**
This is convenient later in the stack, where AOTAutograd needs to detect aliasing of inputs and outputs, where those inputs and outputs might be tensor subclasses. Previously, `x.untyped_storage()` would raise an error if `x` was a subclass. In this PR, I tried to relax this constraint as little as possible: `THPVariable_storage()` will only try to return a storage to python if the tensor subclass that you are passing in is "traceable"

**(5) Fixed subclass fakeification**

@wanchaol recently added support to be able to fakeify tensor subclasses. That fakeification logic works in most cases, but there is one case it doesn't handle: autograd metadata. In particular, since autograd sees our tensor subclasses and not their desugared tensors, we need to make sure that our fakeified subclass has the same autograd metadata as the original subclass. I updated `meta_utils.py` to make sure that the autograd metadata is correct.

**(6) make tensor subclasses resizeable**

Previously we didn't allow tensor subclasses to be resizeable. I ran into an issue where fakeifying a tensor subclass occasionally requires swapping out its storage, which can involve resizing the tensor. Mechanically, this required updating `at::for_blob()` to expose a way to request that the tensor that you create has resizeable storage, and then using this new API in `_make_wrapper_tensor()`.

**(7) Added a basic DoubleTensor subclass for testing**

I use this subclass more later in this stack in my AOTAutograd tests - but it serves as a simple subclass example to test the dispatch ordering in this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104482
Approved by: https://github.com/ezyang
ghstack dependencies: #107415
2023-08-29 02:36:48 +00:00
PyTorch MergeBot
d4ff06ec84 Revert "Standardize on error types for distributed errors. (#107651)"
This reverts commit 0e2317479b.

Reverted https://github.com/pytorch/pytorch/pull/107651 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing inductor test in trunk for one of its model moco ([comment](https://github.com/pytorch/pytorch/pull/107651#issuecomment-1696578138))
2023-08-28 23:58:33 +00:00
Pritam Damania
0e2317479b Standardize on error types for distributed errors. (#107651)
We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc.

This results in messy code during error handling somewhat like this:
```
if "NCCL" in exception_str:
  ...
if "Timed out initializing process group in store based barrier on rank" in exception_str:
  ...
if "The client socket has timed out after" in exception_str:
  ...
if "Broken pipe" in exception_str:
  ...
if "Connection reset by peer" in exception_str:
  ...
```

To address this issue, in this PR I've ensured added these error types:

1. **DistError** - the base type of all distributed errors
2. **DistBackendError** - this already existed and referred to PG backend errors
3. **DistStoreError** - for errors originating from the store
4. **DistNetworkError** - for general network errors coming from the socket library
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107651
Approved by: https://github.com/H-Huang
2023-08-28 21:58:15 +00:00
Elias Ellison
0a9778a372 Expose cudaStreamCaptureMode in CUDA Graphs, use local setting in inductor (#107407)
>  capture_error_mode (str, optional): specifies the cudaStreamCaptureMode for the graph capture stream.
Can be "global", "thread_local" or "relaxed". During cuda graph capture, some actions, such as cudaMalloc,
 may be unsafe. "global" will error on actions in other threads, "thread_local" will only error for
 actions in the current thread, and "relaxed" will not error on these actions.

Inductor codegen is single-threaded, so it should be safe to enable "thread_local" for inductor's cuda graph capturing. We have seen errors when inductor cudagraphs has been used concurrently with data preprocessing in other threads.

Differential Revision: [D48656014](https://our.internmc.facebook.com/intern/diff/D48656014)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107407
Approved by: https://github.com/albanD, https://github.com/eqy
2023-08-25 01:44:26 +00:00
David Berard
614b865721 [profiler] _RecordFunctionFast - faster python bindings for record_function (#107195)
torch.profiler.record_function is relatively slow; for example, in some benchmarks I was running, x.view_as(x) was ~2us, and ~16-17us when wrapped in a record_function context. The reasons for this are: dispatcher overhead from going through an op (the main source of overhead), python binding / python conversion overhead, and some overhead from the context manager.

This new implementation is faster, but it won't work with torchscript. Based on the benchmarks I was running, it adds 0.5-0.7us overhead per call when the profiler is turned off. To use it, you can just:

```python
with torch._C._profiler_manual._RecordFunctionFast("title"):
    torch.add(x, y)
```

It implements a context manager in python which directly calls the record_function utilities, instead of calling through an op.
* The context manager is implemented directly in python because the overhead from calling a python function seems non-negligible
* All the record_function calls, python object conversions are guarded on checks for whether the profiler is enabled or not. It seems like this saves a few hundred nanoseconds.

For more details about the experiments I ran to choose this implementation, see [my record_functions experiments branch](https://github.com/pytorch/pytorch/compare/main...davidberard98:pytorch:record-function-fast-experiments?expand=1).

This also adds a `torch.autograd.profiler._is_profiler_enabled` global variable that can be used to check whether a profiler is currently enabled. It's useful for further reducing the overhead, like this:

```python
if torch.autograd.profiler._is_profiler_enabled:
    with torch._C._profiler_manual._RecordFunctionFast("title"):
        torch.add(x, y)
else:
    torch.add(x, y)
```

On BERT_pytorch (CPU-bound model), if we add a record_function inside CachedAutotuning.run:
* Naive torch.profiler.record_function() is a ~30% slowdown
* Always wrapping with RecordFunctionFast causes a regression of ~2-4%.
* Guarding with an if statement - any regression is within noise

**Selected benchmark results**: these come from a 2.20GHz machine, GPU build but only running CPU ops; running `x.view_as(x)`, with various record_functions applied (with profiling turned off). For more detailed results see "record_functions experiments branch" linked above (those results are on a different machine, but show the same patterns). Note that the results are somewhat noisy, assume 0.05-0.1us variations

```
Baseline:: 1.7825262546539307 us  # Just running x.view_as(x)
profiled_basic:: 13.600390434265137 us  # torch.profiler.record_function(x) + view_as
precompute_manual_cm_rf:: 2.317216396331787 us  # torch._C._profiler_manual._RecordFunctionFast(), if the context is pre-constructed + view_as
guard_manual_cm_rf:: 1.7994389533996582 us  # guard with _is_profiler_enabled + view_as
```

Differential Revision: [D48421198](https://our.internmc.facebook.com/intern/diff/D48421198)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107195
Approved by: https://github.com/albanD, https://github.com/aaronenyeshi
2023-08-22 18:48:30 +00:00
ydwu4
a408920817 Reland fakify FunctionalTensor (#107569)
Try to rebase and reland https://github.com/pytorch/pytorch/pull/107062 . One difference compared with previous is to make the DTensor logic same as previously in _clone_input.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107569
Approved by: https://github.com/zou3519
2023-08-22 15:46:25 +00:00
Aaron Gokaslan
79d35bfc01 [BE]: Add PYI files to ruff lintrunner (#107524)
Due to a bug with the lintrunner yaml, PYI files were not convered. This PR builds off #107519 covers all PYI files and adds one noqa to fix a B006 bugprone error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107524
Approved by: https://github.com/ezyang
2023-08-21 18:55:41 +00:00
PyTorch MergeBot
96c5be8bc4 Revert "Fakify leaf of FunctionalTensor (#107062)"
This reverts commit 3349725766.

Reverted https://github.com/pytorch/pytorch/pull/107062 on behalf of https://github.com/ydwu4 due to This appears to have broken the test TestDTensorCompile.test_dtensor_fullgraph. Probably a land race ([comment](https://github.com/pytorch/pytorch/pull/107062#issuecomment-1685447747))
2023-08-21 00:30:16 +00:00
Aaron Gokaslan
b1e8e01e50 [BE]: Apply PYI autofixes to various types (#107521)
Applies some autofixes from the ruff PYI rules to improve the typing of PyTorch. I haven't enabled most of these ruff rules yet as they do not have autofixes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107521
Approved by: https://github.com/ezyang
2023-08-20 02:42:21 +00:00
ydwu4
3349725766 Fakify leaf of FunctionalTensor (#107062)
This PR allows dynamo to fakify FunctionalTensorWrapper by unwrapping, replacing and wrapping again for FunctionalTensorWrapper so that FunctionalTensorWrapper can be passed in as input for dynamo.optimize and we can support something like this
```python
ff = torch.func.functionalize(f)
torch.compile(ff)(x)
```

This PR didn't follow the \_\_tensor_flatten\_\_ and \_\_tensor_unflatten\_\_ protocol right now because we're not sure the plan of doing that for FunctionalTensorWrapper (it's implemented in C++).

**Test Plan:**
Add a new test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107062
Approved by: https://github.com/zou3519
ghstack dependencies: #107042
2023-08-19 17:33:42 +00:00
Edward Z. Yang
36bb7a1f42 Add fast traceback utilities (#107358)
This adds some utilities for conveniently working with fast combined CapturedTraceback from Python. The main goal of these utilities is to make it easier for people to use CapturedTraceback as a drop-in replacement for `traceback.extract_stack`, which is 20x slower than CapturedTraceback.

I port symbolic shapes to use the new CapturedTraceback code, to validate that the APIs work and are useful.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107358
Approved by: https://github.com/zdevito, https://github.com/albanD
ghstack dependencies: #107438
2023-08-18 19:05:54 +00:00
PyTorch MergeBot
3c11184ca8 Revert "Fakify leaf of FunctionalTensor (#107062)"
This reverts commit 6cb0128c8a.

Reverted https://github.com/pytorch/pytorch/pull/107062 on behalf of https://github.com/ZainRizvi due to This appears to have broken the test TestDTensorCompile.test_dtensor_fullgraph.  Probably a land race ([comment](https://github.com/pytorch/pytorch/pull/107062#issuecomment-1684124230))
2023-08-18 16:02:54 +00:00
ydwu4
6cb0128c8a Fakify leaf of FunctionalTensor (#107062)
This PR allows dynamo to fakify FunctionalTensorWrapper by unwrapping, replacing and wrapping again for FunctionalTensorWrapper so that FunctionalTensorWrapper can be passed in as input for dynamo.optimize and we can support something like this
```python
ff = torch.func.functionalize(f)
torch.compile(ff)(x)
```

This PR didn't follow the \_\_tensor_flatten\_\_ and \_\_tensor_unflatten\_\_ protocol right now because we're not sure the plan of doing that for FunctionalTensorWrapper (it's implemented in C++).

**Test Plan:**
Add a new test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107062
Approved by: https://github.com/zou3519
ghstack dependencies: #107042
2023-08-18 03:05:45 +00:00
Zachary DeVito
80988b6277 Introduce memory stacks for free (#106758)
Previously when we recorded a free action in a memory trace, we would provide
the stack for when the block was allocated. This is faster because we do not
have to record stacks for free, which would otherwise double the number of stacks
collected. However, sometimes knowing the location of a free is useful for
figuring out why a tensor was live. So this PR adds this behavior. If
performance ends up being a concern the old behavior is possible by passing
"alloc" to the context argument rather than "all".

Also refactors some of glue logic to be consistent across C++ and Python and
routes the Python API through the C++ version.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106758
Approved by: https://github.com/albanD
2023-08-14 20:38:15 +00:00
PyTorch MergeBot
71be8f2223 Revert "Add initial support for FP8 ONNX export (#106379)"
This reverts commit 08704f96f0.

Reverted https://github.com/pytorch/pytorch/pull/106379 on behalf of https://github.com/kit1980 due to breaking multiple internal builds ([comment](https://github.com/pytorch/pytorch/pull/106379#issuecomment-1675192700))
2023-08-11 18:22:35 +00:00
eellison
3495f0c999 Generate mypy hints for torch.Tag, add a couple of pointwise ops (#106910)
Replace https://github.com/pytorch/pytorch/pull/106739, since i had a bad CLA commit.

- adds clone, and convert_element_dtype to pointwise
- adds codegen for mypy hints of torch.Tag and removes existing ignores for them

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106910
Approved by: https://github.com/mlazos
2023-08-10 05:12:27 +00:00
Tal Cherckez
08704f96f0 Add initial support for FP8 ONNX export (#106379)
Add support for ONNX_NAMESPACE::TensorProto_DataType_FLOAT8E5M2 and ONNX_NAMESPACE::TensorProto_DataType_FLOAT8E4M3FN to enable export of torch models that use FP8 (E4M3 and E5M2) to ONNX (opset 19)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106379
Approved by: https://github.com/justinchuby, https://github.com/thiagocrepaldi, https://github.com/malfet
2023-08-10 01:02:45 +00:00
Ramin Azarmehr
cdfd0ea162 [MPS] Introduce torch.mps.Event() APIs (#102121)
- Implement `MPSEventPool` to recycle events.
- Implement python bindings with `torch.mps.Event` class using the MPSEventPool backend. The current member functions of the Event class are `record()`, `wait()`, `synchronize()`, `query()`, and `elapsed_time()`.
- Add API to measure elapsed time between two event recordings.
- Added documentation for Event class to `mps.rst`.
- Added test case to `test_mps.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102121
Approved by: https://github.com/albanD, https://github.com/kulinseth
2023-08-08 03:45:45 +00:00
poseljacob
a25eee1d77 _force_original_view_tracking to work as both context manager and function (#106706)
Fix _force_original_view_tracking to work as a function as well as a context manager, as stated by documentation.

Applied similar fixes to PR: https://github.com/pytorch/pytorch/pull/105291
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106706
Approved by: https://github.com/albanD
2023-08-07 23:29:22 +00:00
MooYeh
fb6652b56e [profiler] add profiler parsing support for custom device. (#106142)
We hope PyTorch profiling parsing ability can also be applicable to custom devices. Based on previous  work  https://github.com/pytorch/pytorch/pull/101554, we have made supplementary updates to PyTorch profiling to extend its parsing capabilities for custom devices. These modifications do not affect the original logic of the code and mainly include the following aspects:
1. Added the relevant logic for use_device in torch.profiler.profiler._KinetoProfile.
2. In torch.autograd.profiler and torch.autograd.profiler_util, custom devices profiling data  parsing ability has been added using privateuse1 and use_device attributes.
3. In torch._C._autograd.pyi and torch._C._autograd.pyi, custom devices related attributes have been added. The underlying C++
logic will be added in subsequent pull requests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106142
Approved by: https://github.com/aaronenyeshi
2023-08-02 20:23:22 +00:00
Alex Settle
9ba0558d48 Add sequence_nr to aot_autograd to map forward ops to their corresponding backward ops (#103129)
Fixes #102375

Sequence_nr increments in the forward pass and decrements in the backward pass.  Backward ops with the same sequence_nr as a forward op represent the backward implementation for the op.  The long term goal is to make this information available to the profiler so users can observe which ops are fused by the inductor openai triton kernels.

Added a test for this feature **test/dynamo/test_aot_autograd.py::AotAutogradFallbackTests::test_aot_sequence_nr**.  The test case uses **aot_export_module()** to create a joint fwd/bwd fx graph.  Then it walks all the nodes in fx graph using fx_graph.graph.nodes.   The seq_nr of each node is recorded in node.meta.  During the fwd pass the seq_nr increments and it decrements during the bwd pass.  This allows the user to map forward ops to their corresponding bwd ops which is useful for performance analysis.

Expected output from the test case.

 SeqNr|OrigAten|SrcFn
0|aten.convolution.default|l__self___conv1
0|aten.add.Tensor|l__self___bn1
1|aten._native_batch_norm_legit_functional.default|l__self___bn1
2|aten.relu.default|l__self___relu1
3|aten.add.Tensor|add
4|aten.view.default|flatten
5|aten.t.default|l__self___fc1
6|aten.unsqueeze.default|l__self___fc1
7|aten.mm.default|l__self___fc1
8|aten.squeeze.dim|l__self___fc1
9|aten.add.Tensor|l__self___fc1
10|aten.sub.Tensor|l__self___loss_fn
11|aten.abs.default|l__self___loss_fn
12|aten.mean.default|l__self___loss_fn
12|aten.ones_like.default|
12|aten.expand.default|
12|aten.div.Scalar|
11|aten.sgn.default|
11|aten.mul.Tensor|
8|aten.unsqueeze.default|
7|aten.t.default|
7|aten.mm.default|
7|aten.t.default|
7|aten.t.default|
7|aten.mm.default|
6|aten.squeeze.dim|
5|aten.t.default|
4|aten.view.default|
2|aten.threshold_backward.default|
1|aten.native_batch_norm_backward.default|
0|aten.convolution_backward.default|
0|aten.add.Tensor|

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103129
Approved by: https://github.com/soulitzer
2023-08-02 00:52:52 +00:00
albanD
9d2e15882e Add torch.utils to the docs page, remove dead code and fix docstrings (#105142)
As per title.
Note that the c++ side code for the minidumps part was removed. So trying to call any of these 3 functions today results in an error saying that `torch._C` doesn't have these attributes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105142
Approved by: https://github.com/janeyx99
2023-07-26 14:24:58 +00:00
Brian Coutinho
8d9c8897ed [profiler] add option for kineto synchronization events in the trace (#105187)
Summary:
## About Sync Events
For CUDA profiling mode, we can enable tracing CUDA synchronization events.
* This feature captures synchronization events in CUDA including 1) context/device sync, 2) stream sync, 3) CUDA event sync, 4) CUDA stream wait event (inter stream synchronization). Read more
* We add this flag using the profiler's experimental config option.
* This PR relies on 7b003638c6 change in pytorch/kineto

## Usage
Just set the `enable_cuda_sync_events` option in `_ExperimentalConfig`
```
from torch.autograd.profiler import profile, _ExperimentalConfig
with profile(use_kineto=True, use_cuda=True,
   experimental_config=_ExperimentalConfig(enable_cuda_sync_events=True),
) as prof:
   workload()
```

**Please wait for PyTorch github repo to point to 7b003638c6 or later commit in Kineto**

Test Plan:
## Unit Test
Added a unit test

  buck2 test mode/dev-nosan caffe2/test:profiler --local-only -- test_profiler_cuda_sync_events
  Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0
ttps://www.internalfb.com/intern/testinfra/testrun/281475298097379

Reviewed By: davidberard98

Differential Revision: D46244591

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105187
Approved by: https://github.com/aaronenyeshi
2023-07-26 03:45:04 +00:00
Louis Feng
3a01c056f5 [PyTorch][ET] Collect Process Groups Mapping Info (#104373)
Summary: Add the logics and interface to log ProcessGroup comms configuration (unique ID, type, and ranks info).

Test Plan:
Testing in HPC:
```
TORCH_LOGS=all ../buck-out/v2/gen/fbcode/c8344b52091f4f7f/hpc/models/ads/__ads_10x_launcher__/ads_10x_launcher.par  +launcher=local launcher.num_trainers=4 +data_loader=random data_loader.num_batches=2000
```
Example output in ET:
```
    {
      "name": "## process_group:init ##", "id": 3, "rf_id": 1, "parent": 2, "fw_parent": 0, "seq_id": -1, "scope": 7, "tid": 1, "fw_tid": 0, "op_schema": "",
      "inputs": ["[{'pg_id': 140538064364672, 'backend_id': 140538060772480, 'backend_config': 'cuda:nccl', 'ranks': {0: 0, 1: 1, 2: 2, 3: 3}}, {'pg_id': 140538064363904, 'backend_id': 140538042628864, 'backend_config': 'cuda:nccl', 'ranks': {0: 0, 1: 1, 2: 2, 3: 3}}]"], "input_shapes": [[]], "input_types": ["String"],
      "outputs": [], "output_shapes": [], "output_types": []
    },
```

Differential Revision: D46321690

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104373
Approved by: https://github.com/kwen2501
2023-07-25 03:34:53 +00:00
Richard Barnes
32b67b5a6b Fix RRef Future type annotation (#105296)
Test Plan: Sandcastle

Differential Revision: D47376236

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105296
Approved by: https://github.com/Skylion007
2023-07-23 22:34:09 +00:00
PyTorch MergeBot
117325862c Revert "Add torch.utils to the docs page, remove dead code and fix docstrings (#105142)"
This reverts commit e985719e98.

Reverted https://github.com/pytorch/pytorch/pull/105142 on behalf of https://github.com/huydhn due to Sorry for reverting this but it is failing python doc build job in trunk e985719e98 ([comment](https://github.com/pytorch/pytorch/pull/105142#issuecomment-1644874540))
2023-07-21 01:47:49 +00:00
albanD
e985719e98 Add torch.utils to the docs page, remove dead code and fix docstrings (#105142)
As per title.
Note that the c++ side code for the minidumps part was removed. So trying to call any of these 3 functions today results in an error saying that `torch._C` doesn't have these attributes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105142
Approved by: https://github.com/janeyx99
2023-07-21 00:14:59 +00:00
poseljacob
1aba399138 allow set_multithreading_enabled to act as function and context manager (#105291)
Fixes #104985

Implemented `set_multithreading_enabled` C++ function to directly alter state rather than using `MultithreadingEnabled` class, which was automatically resetting the state when the object was destroyed. This behavior more closely aligns with set_grad_enabled which does work as expected. This allows us to change python class `set_multithreading_enabled` to act as both a function and context manager.

I also added a getter: `torch._C.is_multithreading_enabled`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105291
Approved by: https://github.com/albanD
2023-07-18 16:55:40 +00:00
Nikita Shulga
5837e95d30 [Reland] Update mypy to 1.4.1 (#105227)
This PR re-lands
- [Typing] Fix PEP 484 Violation (#105022)
- Update mypy to 1.4.1 (#91983)

That were reverted due to the conflict with internal source repo.

Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional)
Plus few real fixes:
  - Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi`
  - Add missing return statement to `torch._export. deserialize_graph`
  - Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights`
  - Add assert it `torch/optim/optimizer.py` that Optional list is not None
TODO (in followup PR):
  - Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py`

Unrelated, to bypass CI failures due to the gcc9 dependency update in Ubuntu-18.04:
- Add hack to squash older libstdc++ from conda environment in favor one from OS to `.ci/docker/install_conda.sh`
- Update bazel cuda builds to focal, as with libstdc++-6.0.32 bazel builds loose the ability to catch exceptions (probably because they link with cupti statically, but I could not found where it is done)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105227
Approved by: https://github.com/atalman, https://github.com/albanD, https://github.com/Skylion007
2023-07-15 20:30:20 +00:00
PyTorch MergeBot
15fd1ea118 Revert "[Reland] Update mypy to 1.4.1 (#105227)"
This reverts commit c9c4f8efc3.

Reverted https://github.com/pytorch/pytorch/pull/105227 on behalf of https://github.com/atalman due to trying to mitigate ci sev #105248 ([comment](https://github.com/pytorch/pytorch/pull/105227#issuecomment-1636510935))
2023-07-14 22:28:35 +00:00
Nikita Shulga
c9c4f8efc3 [Reland] Update mypy to 1.4.1 (#105227)
This PR re-lands
- [Typing] Fix PEP 484 Violation (#105022)
- Update mypy to 1.4.1 (#91983)

That were reverted due to the conflict with internal source repo.

Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional)
Plus few real fixes:
  - Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi`
  - Add missing return statement to `torch._export. deserialize_graph`
  - Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights`
  - Add assert it `torch/optim/optimizer.py` that Optional list is not None
TODO (in followup PR):
  - Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105227
Approved by: https://github.com/atalman, https://github.com/albanD, https://github.com/Skylion007
2023-07-14 20:45:12 +00:00
Richard Barnes
15ea0a00cb Fix RRef type annotations (#104876)
Test Plan: Sandcastle

Reviewed By: H-Huang

Differential Revision: D47334579

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104876
Approved by: https://github.com/H-Huang
2023-07-14 17:31:51 +00:00
Edward Z. Yang
9fc3a22731 Turn on typechecking in cudagraph_trees (#105151)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105151
Approved by: https://github.com/Skylion007, https://github.com/eellison
2023-07-14 16:13:50 +00:00
PyTorch MergeBot
3c5a494d7a Revert "Update mypy to 1.4.1 (#91983)"
This reverts commit 634659e262.

Reverted https://github.com/pytorch/pytorch/pull/91983 on behalf of https://github.com/malfet due to It's dependent change was reverted, so reverting this one as well, to keep CI clean ([comment](https://github.com/pytorch/pytorch/pull/91983#issuecomment-1636059709))
2023-07-14 15:59:16 +00:00
Richard Zou
f03a8f0589 [reland] Deprecate registering autograd kernels at not an autograd key (#105078)
Summary:
Context
-------
This PR adds a new fallback to the Autograd dispatch keys.

If you would prefer the old behavior:
- A quick (unsupported) way to get the previous behavior is to call
`torch._C._set_autograd_fallback("nothing")`
- Register "torch::CppFunction::makeFallthrough()" to your Autograd key,
like in https://gist.github.com/zou3519/d09a5f4b1afe2430af09fea67c6ff2c8

It is possible that this PR regresses performance of overhead-bound
models. If this is the case, please reach out (and apply one of the
temporary fixes in the previous section).

Description for reviewers
-------------------------
In order to deprecate registering autograd kernels at not an autograd
key, we add a fallback to the Autograd dispatch keys. This fallback
raises a warning if the user attempts to backprop through the operator
and is also configurable to either warn or not warn.

The goal of this PR is to
- preserve as much BC as possible
- raise a warning that whatever the user is doing is potentially wrong.
- be as performant as possible

There are roughly two cases:
- if the post-autograd kernels return a Tensor that requires grad, then
we install an autograd hook that raises a warning. We are preserving BC
in that it is possible that the user has a torch::autograd::Function
registered to their CPU key.
- if the post-autograd kernels return Tensors that do not require grad,
then we make them require_grad and install a WarnNotImplemented grad fn
that warns in the backward pass. This is mildy BC-breaking (see next
section).

Test Plan:
- bunch of new tests

BC-Breaking Note
----------------
This PR adds a new fallback to the Autograd dispatch keys. It affects
custom operators that do not have a kernel registered to the Autograd
keys (e.g. AutogradCPU and AutogradCUDA).

If the previous behavior was that the custom operator would return
Tensors that do not require grad if the inputs do require grad, then
this PR changes it so that all floating-point and complex returns do
require grad. See the "Context" section above for how to get the old
behavior.

Differential Revision: D47408353

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105078
Approved by: https://github.com/soulitzer
2023-07-14 15:03:07 +00:00
Nikita Shulga
634659e262 Update mypy to 1.4.1 (#91983)
Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional)
Plus few real fixes:
  - Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi`
  - Add missing return statement to `torch._export. deserialize_graph`
  - Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights`
  -
TODO (in followup PR):
  - Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91983
Approved by: https://github.com/kit1980, https://github.com/ZainRizvi, https://github.com/huydhn, https://github.com/thiagocrepaldi, https://github.com/aaronenyeshi
2023-07-13 16:30:36 +00:00
PyTorch MergeBot
24aa8b9b9a Revert "Deprecate registering autograd kernels at not an autograd key (#104481)"
This reverts commit ed13ab6664.

Reverted https://github.com/pytorch/pytorch/pull/104481 on behalf of https://github.com/atalman due to failed in periodic tests ([comment](https://github.com/pytorch/pytorch/pull/104481#issuecomment-1631552846))
2023-07-11 21:48:22 +00:00
Aaron Gokaslan
2f95a3d0fc [BE]: Apply ruff PERF fixes to torch (#104917)
Applies automated ruff fixes in the PERF modules and enables all automatic ones. I also updated ruff which applied some additional fixes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104917
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-07-11 20:45:21 +00:00
Richard Zou
ed13ab6664 Deprecate registering autograd kernels at not an autograd key (#104481)
Context
-------
This PR adds a new fallback to the Autograd dispatch keys.

If you would prefer the old behavior:
- A quick (unsupported) way to get the previous behavior is to call
`torch._C._set_autograd_fallback("nothing")`
- Register "torch::CppFunction::makeFallthrough()" to your Autograd key,
like in https://gist.github.com/zou3519/d09a5f4b1afe2430af09fea67c6ff2c8

It is possible that this PR regresses performance of overhead-bound
models. If this is the case, please reach out (and apply one of the
temporary fixes in the previous section).

Description for reviewers
-------------------------
In order to deprecate registering autograd kernels at not an autograd
key, we add a fallback to the Autograd dispatch keys. This fallback
raises a warning if the user attempts to backprop through the operator
and is also configurable to either warn or not warn.

The goal of this PR is to
- preserve as much BC as possible
- raise a warning that whatever the user is doing is potentially wrong.
- be as performant as possible

There are roughly two cases:
- if the post-autograd kernels return a Tensor that requires grad, then
we install an autograd hook that raises a warning. We are preserving BC
in that it is possible that the user has a torch::autograd::Function
registered to their CPU key.
- if the post-autograd kernels return Tensors that do not require grad,
then we make them require_grad and install a WarnNotImplemented grad fn
that warns in the backward pass. This is mildy BC-breaking (see next
section).

Test Plan:
- bunch of new tests

BC-Breaking Note
----------------
This PR adds a new fallback to the Autograd dispatch keys. It affects
custom operators that do not have a kernel registered to the Autograd
keys (e.g. AutogradCPU and AutogradCUDA).

If the previous behavior was that the custom operator would return
Tensors that do not require grad if the inputs do require grad, then
this PR changes it so that all floating-point and complex returns do
require grad. See the "Context" section above for how to get the old
behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104481
Approved by: https://github.com/soulitzer
2023-07-11 16:48:39 +00:00
soulitzer
c85468a94c [autograd Function] Add private API to not materialize grads for non-differentiable outputs (#104291)
Fixes https://github.com/pytorch/pytorch/issues/104272

This PR adds a new private API `materialize_non_diff_grads` (default True) such that when set to False, grad outputs corresponding to outputs marked non-differentiable would receive None instead of a zero-filled tensor. This is overrides the setting of `materialize_grads`, i.e. grad outputs corresponding non-differentiable outputs would still be None even if `materialize_grads=True` (the default).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104291
Approved by: https://github.com/albanD
2023-07-08 14:53:54 +00:00
Howard Huang
9165d46b89 DDP + C10D sparse all_reduce changes (#103916) (#104256)
Summary:

reland of https://github.com/pytorch/pytorch/pull/103916

## Changes

prototyping sparse allreduce using the sparse dispatch key. When passing in sparse tensors into `dist.allreduce()` we can execute our dispatched function.

prior to this change, passing a sparse tensor into `allreduce()` will error out with `Tensor must be dense...`

## Example script

```python
# python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 this_script.py

import torch
import torch.distributed as dist

def main():
    dist.init_process_group(backend="nccl")
    rank = dist.get_rank()
    a = torch.tensor([[0, 2.], [3, 0]]).to(rank)
    a = a.to_sparse()
    print(f"rank {rank} - a: {a}")
    dist.all_reduce(a)

if __name__ == "__main__":
    main()
```

output:
```
rank 1 - a: tensor(indices=tensor([[0, 1],
                       [1, 0]]),
       values=tensor([2., 3.]),
       device='cuda:1', size=(2, 2), nnz=2, layout=torch.sparse_coo)
allreduce_sparse_cuda_
tensor.is_sparse() = 1
in ProcessGroupNCCL::allreduceSparse
rank 0 - a: tensor(indices=tensor([[0, 1],
                       [1, 0]]),
       values=tensor([2., 3.]),
       device='cuda:0', size=(2, 2), nnz=2, layout=torch.sparse_coo)
allreduce_sparse_cuda_
tensor.is_sparse() = 1
in ProcessGroupNCCL::allreduceSparse
```

Test Plan:
Testing commands (OSS):

```
# python
pytest test/distributed/test_c10d_nccl.py -vsk test_sparse_allreduce_ops

# c++
build/bin/ProcessGroupNCCLTest --gtest_filter=ProcessGroupNCCLTest.testSparseAllreduce
```

Testing commands (internal, ondemand GPU):
ddp tests:
```
buck build mode/opt -c hpc_comms.use_ncclexp=default //caffe2/test/distributed:c10d --show-full-output

# Get the .par file from the previous command and use it below
TORCH_SHOW_CPP_STACKTRACE=1 /data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/c8344b52091f4f7f/caffe2/test/distributed/__c10d__/c10d.par -r test_ddp_set_sparse_metadata
```

c10d tests:
```
# build tests and run with log output (python)
buck build mode/opt -c hpc_comms.use_ncclexp=default //caffe2/test/distributed:c10d --show-full-output
NCCL_DEBUG=WARN /data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/c8344b52091f4f7f/caffe2/test/distributed/__c10d__/c10d.par -r test_sparse_allreduce_ops

# python
NCCL_DEBUG=WARN buck test mode/opt -c hpc_comms.use_ncclexp=default //caffe2/test/distributed:c10d -- --exact 'caffe2/test/distributed:c10d - test_sparse_allreduce_ops (test_c10d_nccl.ProcessGroupNCCLTest)'

# c++
NCCL_DEBUG=WARN buck run mode/opt -c hpc_comms.use_ncclexp=default //caffe2/test/cpp/c10d:ProcessGroupNCCLTest -- --gtest_filter=ProcessGroupNCCLTest.testSparseAllreduce
```

Differential Revision: D47056695

Pulled By: H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104256
Approved by: https://github.com/rohan-varma
2023-06-28 00:37:52 +00:00
PyTorch MergeBot
436d035dc7 Revert "DDP + C10D sparse all_reduce changes (#103916)"
This reverts commit fed5fba6e4.

Reverted https://github.com/pytorch/pytorch/pull/103916 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/103916#issuecomment-1608412325))
2023-06-26 22:37:58 +00:00
Howard Huang
fed5fba6e4 DDP + C10D sparse all_reduce changes (#103916)
Summary:
## Changes

prototyping sparse allreduce using the sparse dispatch key. When passing in sparse tensors into `dist.allreduce()` we can execute our dispatched function.

prior to this change, passing a sparse tensor into `allreduce()` will error out with `Tensor must be dense...`

## Example script

```python
# python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 this_script.py

import torch
import torch.distributed as dist

def main():
    dist.init_process_group(backend="nccl")
    rank = dist.get_rank()
    a = torch.tensor([[0, 2.], [3, 0]]).to(rank)
    a = a.to_sparse()
    print(f"rank {rank} - a: {a}")
    dist.all_reduce(a)

if __name__ == "__main__":
    main()
```

output:
```
rank 1 - a: tensor(indices=tensor([[0, 1],
                       [1, 0]]),
       values=tensor([2., 3.]),
       device='cuda:1', size=(2, 2), nnz=2, layout=torch.sparse_coo)
allreduce_sparse_cuda_
tensor.is_sparse() = 1
in ProcessGroupNCCL::allreduceSparse
rank 0 - a: tensor(indices=tensor([[0, 1],
                       [1, 0]]),
       values=tensor([2., 3.]),
       device='cuda:0', size=(2, 2), nnz=2, layout=torch.sparse_coo)
allreduce_sparse_cuda_
tensor.is_sparse() = 1
in ProcessGroupNCCL::allreduceSparse
```

Test Plan:
Testing commands (OSS):

```
# python
pytest test/distributed/test_c10d_nccl.py -vsk test_sparse_allreduce_ops

# c++
build/bin/ProcessGroupNCCLTest --gtest_filter=ProcessGroupNCCLTest.testSparseAllreduce
```

Testing commands (internal, ondemand GPU):
ddp tests:
```
buck build mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/distributed:c10d --show-full-output

# Get the .par file from the previous command and use it below
TORCH_SHOW_CPP_STACKTRACE=1 /data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/c8344b52091f4f7f/caffe2/test/distributed/__c10d__/c10d.par -r test_ddp_set_sparse_metadata
```

c10d tests:
```
# build tests and run with log output (python)
buck build mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/distributed:c10d --show-full-output
NCCL_DEBUG=WARN /data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/c8344b52091f4f7f/caffe2/test/distributed/__c10d__/c10d.par -r test_sparse_allreduce_ops

# python
NCCL_DEBUG=WARN buck test mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/distributed:c10d -- --exact 'caffe2/test/distributed:c10d - test_sparse_allreduce_ops (test_c10d_nccl.ProcessGroupNCCLTest)'

# c++
NCCL_DEBUG=WARN buck run mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/cpp/c10d:ProcessGroupNCCLTest -- --gtest_filter=ProcessGroupNCCLTest.testSparseAllreduce
```

Differential Revision: D46724856

Pulled By: H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103916
Approved by: https://github.com/rohan-varma
2023-06-26 20:42:17 +00:00
Louis Feng
5847cb55e4 [PyPer][ET] Refactor EG to ET (#99694)
Summary:
Change execution graph to execution trace.
See post: https://fb.workplace.com/groups/873291503156329/permalink/1529496217535851/

Test Plan: Run a job.

Reviewed By: chaekit

Differential Revision: D44121392

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99694
Approved by: https://github.com/chaekit
2023-06-22 19:41:54 +00:00
Brian Hirsh
c3c03e7cb8 Reland of https://github.com/pytorch/pytorch/pull/101818 (#103888)
Original PR broke internal

This reverts commit 5ed618132f.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103888
Approved by: https://github.com/albanD
2023-06-21 21:00:56 +00:00
Nikita Shulga
6d2887cc06 Reland "Move tensor grouping to ATen" (#103912)
This is a reland of https://github.com/pytorch/pytorch/pull/100007 with a build fix for Windows debug builds.
`at::native::ParamsHash` only works on structs with standard layout, but `std::string` isn't one in Visual C++ debug builds, which one can easily verified by running something like:
```cpp
#define _DEBUG
#include <type_traits>
#include <string>
static_assert(std::is_standard_layout_v<std::string>, "Oh noes");
```
If above conditon is not met, instead of printing a static_assert output, VC++ raises a very cryptic compilation errors,  see https://github.com/pytorch/pytorch/pull/100007#discussion_r1227116292 for more detail.

Also, using `std::hash` for string should result in a faster hash function.

(cherry picked from commit 74b7a6c75e)

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 5914771</samp>

This pull request introduces a new function `_group_tensors_by_device_and_dtype` that can group tensors by their device and dtype, and updates the `foreach` utilities and several optimizers to use this function. The goal is to improve the performance, readability, and compatibility of the code that handles tensors with different properties. The pull request also adds a test case and type annotations for the new function, and some error checks for the `fused` argument in Adam and AdamW.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103912
Approved by: https://github.com/janeyx99
2023-06-21 09:26:33 +00:00
leslie-fang-intel
9832cfbbfe Quantization oneDNN backend only support VNNI CPU (#103653)
**Summary**

- Update the quantization document that default qconfig with oneDNN backend is recommended to be used on CPUs with Vector Neural Network Instruction support.
- Add the warning message when user uses default qconfig with oneDNN backend on CPU without Vector Neural Network Instruction support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103653
Approved by: https://github.com/jgong5, https://github.com/malfet
2023-06-19 09:50:07 +00:00
lkct
fd4beb7a05 Better function annotations for nn.functional (#102918)
Fixes #102768

- Provides proper function declarations in generated `torch/nn/functional.pyi`.
- Moves some functions from manually defined in `functional.pyi.in` to generated code, in order to single-source the signature.
- Includes some of the functions in `torch._C._nn` into its `.pyi.in`, but not exhaustive (only what's already there).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102918
Approved by: https://github.com/drisspg, https://github.com/malfet
2023-06-16 19:48:04 +00:00
Jon Maltiel Swenson
0da38409a0 [gloo] Make it possible for gloo TCPStore to take over an existing socket fd (#103478)
Summary:
This diff allows the `TCPStore` server associated with a gloo process group to listen on an existing socket already bound to a port.

Without the functionality in this diff, canonical initialization of a gloo `ProcessGroup` is fundamentally racy: 1) ask the OS for a free port by creating a socket bound to port 0, 2) close the socket, 3) attempt to initialize a `TCPStore` server that listens on the previously free port. Of course, the problem is that in between steps 2 and 3, another process on the host may have claimed the port, causing `TCPStore` and overall process group initialization to fail. With this diff, it is now possible for users to completely avoid this race (see unit test for how this can be achieved).

Test Plan:
Added new unit test:
  buck2 test caffe2/test/distributed:store

Differential Revision: D46622317

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103478
Approved by: https://github.com/H-Huang
2023-06-16 17:15:56 +00:00
Matthew Hoffman
bc2caa7fdf Add type hint for retains_grad (#103528)
Fixes #103485

Type checkers don't know about the existence of `retains_grad` otherwise:

```python
torch.randn(10, 10).retains_grad  # Cannot access member "retains_grad" for type "Tensor"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103528
Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/janeyx99
2023-06-13 21:37:32 +00:00
PyTorch MergeBot
0cb5bc3b04 Revert "Move tensor grouping to ATen (#100007)"
This reverts commit 74b7a6c75e.

Reverted https://github.com/pytorch/pytorch/pull/100007 on behalf of https://github.com/izaitsevfb due to Breaks internal builds, see D46629727 ([comment](https://github.com/pytorch/pytorch/pull/100007#issuecomment-1587861598))
2023-06-12 18:30:33 +00:00
PyTorch MergeBot
5ed618132f Revert "change pre_autograd to pre_dispatch tracing (#101818)"
This reverts commit b0392de2c3.

Reverted https://github.com/pytorch/pytorch/pull/101818 on behalf of https://github.com/izaitsevfb due to Breaks internal builds see D46629736 TypeError: wrap_key() got an unexpected keyword argument pre_autograd ([comment](https://github.com/pytorch/pytorch/pull/101818#issuecomment-1587837667))
2023-06-12 18:16:37 +00:00
Nikita Shulga
4cfa06f706 [BE] Deprecate has_XYZ attributes (#103279)
Use [`__getattr__`](https://peps.python.org/pep-0562/) to raise warningwhen one tries to access `has_XYZ` methods and recommend appropriate `torch.backends.XYZ` methods

Make respective properties in `torch._C` private (by prefixing them with underscore), to exclude from `from torch._C import *`.

Added `warnings.simplefilter` to workaround Python-3.11 torch.compile lineinfo issue.

Fixes https://github.com/pytorch/pytorch/issues/102484

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103279
Approved by: https://github.com/janeyx99, https://github.com/Skylion007
2023-06-10 05:17:17 +00:00
Brian Hirsh
b0392de2c3 change pre_autograd to pre_dispatch tracing (#101818)
We discussed in a composability meeting a few weeks ago that `pre_autograd` should probably be renamed to `pre_dispatch`.

One question in this PR was: should I re-use a dispatch key? Or should I create a new dispatch key (that yet again corresponds to "top of the dispatcher")?

~~For now, I ended up sticking our proxy mode on the mode stack corresponding to `PythonTLSSnapshot`, because it was simple and it works. It looks like one of the functorch dispatch keys has higher priority though, so it's possible that functorch will end up running first. Open to options, but we can consider adding a new dispatch key later if that becomes a problem~~

Update: I added a dedicated dispatch key, `PreDispatch`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101818
Approved by: https://github.com/ezyang, https://github.com/Neilblaze, https://github.com/albanD, https://github.com/zou3519
2023-06-09 17:30:15 +00:00
Masaki Kozuki
74b7a6c75e Move tensor grouping to ATen (#100007)
rel: #94344
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100007
Approved by: https://github.com/janeyx99
2023-06-09 15:44:46 +00:00
Edward Z. Yang
96fd283640 Preserve CreationMeta when metafying views. (#103152)
This helps us avoid erroring / generate more accurate error messages
in Dynamo when doing mutations on views.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103152
Approved by: https://github.com/albanD
2023-06-09 12:34:54 +00:00
Richard Li
f1f57e1e54 trigger tracing for MTIA events (#102288)
Summary: trigger tracing for MTIA events on python side when ProfilerActivity.MTIA is specified

Test Plan:
Test diff: D45437426

```
hg graft D45437426
```
- in one terminal

```
cd ~/fbsource/fbcode
buck2 run -j 8 \
    //infra_asic_fpga/firmware/tools/mad/service:mad_service
```
- in another terminal

Pytorch profiler
```
buck run mode/dev-nosan -j 8 //caffe2/torch/fb/acc_runtime/afg/tests:test_afg  -- -m kernel_add
```

Differential Revision: D46122853

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102288
Approved by: https://github.com/aaronenyeshi
2023-06-05 15:10:31 +00:00
Richard Zou
3897c479af Add API to construct the functional variant of an op (#102293)
`register_functional_op`:
- constructs the functional variant of an op
- registers a functionalization kernel to the op

To get this to work:
- `register_functional_op` makes assumptions that it checks about the
op's schema. In particular, the op is not allowed to return anything it
mutates. We can relax these constraints in the future.
- We add a "boxed" python functionalization kernel that handles this
case.

I'm not actually sure (or convinced) this should be public API or how
it should work. If we want this to be public, then it should probably be
a torch.library API, but does that also mean we should give the same
lifetime guarantees? If so, then it would be up to the user to construct
a Library object to actually register the functional variant onto.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102293
Approved by: https://github.com/bdhirsh
2023-06-02 13:36:50 +00:00
dujinhang
2e8ce910bb [Profiler][1/N] add profiler support for custom device. (#101554)
1. `torch.autograd.profiler` interface parameters changed. (use `self.use_device` instead of `self.use_cuda` facilitates access by other devices and integrate it in subsequent pr)
2. Modify `ProfilerEventStub`(aka `std::shared_ptr<CUevent_st>`) to `ProfilerVoidEventStub`(aka `std::shared_ptr<void>`) so that `ProfilerStubs` can be inherited by any `{device}Methods`.
In addition, `cuda_event_start_` is renamed to `device_event_start_` , cuda and other devices can use this event pointer if needed.
4. custom device support using legacy profiling(add `ProfilerState::KINETO_PRIVATEUSE1_FALLBACK` option)
5. add `privateuse1Stubs` register
(parse results and test cases are added in subsequent pr)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101554
Approved by: https://github.com/aaronenyeshi
2023-06-02 09:19:19 +00:00
Pritam Damania
9a2df0a5af [RFC] Add method to DDP to check for backward finalization. (#100773)
Summary: In cases where DDP backward is not finalized, the error is raised only in the next forward iteration of DDP. However, if there are other collective calls between those two points, training scripts could potentially get stuck.

As a result, there should be a way to check if DDP finalized after calling `.backward()`. To address this, I've added a `_check_reducer_finalized` method to validate that DDP indeed did successfully finish reduction.

Test Plan: Added unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100773
Approved by: https://github.com/rohan-varma
2023-05-31 20:43:06 +00:00
Richard Zou
fc31b3a106 Allow existing "Python RAII guards" to be used as context managers (#102579)
This PR adds a `py_context_manager_DEPRECATED` that converts a C++ RAII
guard to an object that may be either used as Python context manager or
as a "Python RAII guard".

We don't convert all of them to Python context manager only due to BC
reasons; people in OSS and internally actually rely on these APIs and I
don't want to break them. We are justified in breaking BC if we wanted
to, but it seemed like too much work for not a lot of gain.

The API is postfixed with "DEPRECATED" to indicate that people should
really use `py_context_manager` (converts C++ RAII guard to Python
context manager) instead.

Test Plan:
- this PR converts all PyTorch usages of _AutoDispatchBelowAutograd to
context manager. I can do the rest in follow-ups.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102579
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2023-05-31 19:55:38 +00:00
David Berard
5324124eac [profiler] Reintroduce forward-backward links (#102424)
**TL;DR:** This re-introduces links between backward kernels and their corresponding forward kernels.

<img width="1020" alt="Screenshot 2023-05-26 at 7 25 22 PM" src="https://github.com/pytorch/pytorch/assets/5067123/02571b59-859c-4c9e-b3ef-121ef3159812">

In the example above, you can see there are two such flows - one for aten::add, and one for aten::binary_cross_entropy

### Details

Forward/backward links were added in https://github.com/pytorch/pytorch/pull/62553, but then disabled in https://github.com/pytorch/pytorch/pull/72904 due to segfaults (e.g. https://github.com/pytorch/pytorch/issues/69443).

Between now and when the fwd-bwd links were disabled, there's been a lot of refactoring; so this PR updates the implementation:
* Use a raw profiler::impl::Result instead of a KinetoEvent
* Move the implementation to collection.cpp, where the TraceWrapper is currently handled.
* Sort the events before processing, because they aren't always in chronological order
* There can now be more than one event in the backward pass that matches the sequenceNr-threadID pair. The implementation needed to be updated to avoid showing multiple endpoints for a given sequenceNr-threadID pair ([ptr to where the bwd sequenceNr-threadID pair is duplicated](6e3e3dd477/torch/csrc/profiler/collection.cpp (L398-L399))).

Next, we need to verify that https://github.com/pytorch/pytorch/issues/69443 is fixed. Running the repro no longer errors. Looking further into the details of the issue it seems like the handling of the [raw linkedActivity pointer (old code from 2021)](6089dcac48/libkineto/src/output_json.cpp (L283)) resulted in the segfault. Now, it doesn't look like the linked activity is used anywhere in output_json.cpp so the issue should be fixed.

### Testing

#### 1. unit test
`test_profiler_fwd_bwd_link` was un-skipped. It was modified to match the new implementation.

#### 2. https://github.com/pytorch/pytorch/issues/69443

I ran the repro in https://github.com/pytorch/pytorch/issues/69443 and verified there were no segfaults.

#### 3. Duplicate flow IDs

When forward-backward connections were first introduced, gpu-cpu async links had not been introduced. There's a possibility that gpu-cpu links and fwd-bwd links could interfere if their IDs overlap.

I manually tested this in chrome://tracing; I edited a file so that a gpu-cpu link had the same ID as one of the fwd-bwd connections. The chrome tracing UI continued showing both types of links.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102424
Approved by: https://github.com/aaronenyeshi
2023-05-31 02:50:38 +00:00
atannous
b469ed72d0 Integrating new API usage metadata logger (#101762)
Summary: The new logger allows passing metadata into the api usage logger. The immediate use case is to pass the serialization_id to the save and load events to be enable tracking serialized models in API events. It could be extended to add more metadata in the future.

Test Plan:
```
buck2 test @//mode/dev //caffe2/caffe2/serialize:inline_container_test
```

Reviewed By: davidberard98

Differential Revision: D45683697

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101762
Approved by: https://github.com/davidberard98
2023-05-26 00:24:26 +00:00
Richard Zou
08fb648fe1 Add mechanism to turn any RAII guard into a Python Context Manager (#102037)
This PR:
- adds a mechanism to turn any RAII guard into a Python Context Manager
- turns ExcludeDispatchKeyGuard into a context manager, and purges usages
of the older torch._C.ExcludeDispatchKeyGuard from the codebase.

The mechanism is that given a RAII guard, we construct a context
manager object that holds an optional guard. When we enter the context
manager we populate the guard, when we exit we reset it.

We don't delete torch._C.ExcludeDispatchKeyGuard for BC reasons (people
are using it in fbcode). If this code actually sticks
(it is using C++17 and that worries me a bit), then I'll apply the
change to other RAII guards we have, otherwise, we can write our own
std::apply.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102037
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
2023-05-24 14:20:52 +00:00
Xuehai Pan
05f6250815 Add missing torch.distributed.ReduceOp.AVG in type stubs (#101534)
Add missing `AVG` to `torch.distributed.ReduceOp` enum for type annotation.

Ref:

88b6a4577b/torch/csrc/distributed/c10d/Types.hpp (L35-L47)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101534
Approved by: https://github.com/Skylion007
2023-05-16 15:51:21 +00:00
Ronan Gautier
ff3f19615f Type conversion between float/complex dtypes (#97935)
This PR is an implementation of the feature request https://github.com/pytorch/pytorch/issues/97888, for the implementation of `torch.dtype.to_complex()` and `torch.dtype.to_float()` methods that convert between float and complex dtypes of the same precision.

Disclaimer: it's the first time I code in C++ so hopefully the code is correct, but I'm not super confident about the PR. Any advice/comment is welcome. It's also my first contribution to a large library, so hopefully I'm not doing anything wrong !

@ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97935
Approved by: https://github.com/ezyang
2023-05-16 14:39:44 +00:00
Richard Zou
6bc0f4a4ee [reland][CustomOp] Add Dispatcher error callback (#101452)
Reland of #101015, original stack reverted due to internal test
flakiness.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101452
Approved by: https://github.com/soulitzer
2023-05-16 13:33:31 +00:00
David Berard
935100cbde [profiler] When record_inputs=True, record scalar lists of length <= 30 (#100593)
Many ops take as inputs scalars or scalar lists which are important to understand the properties of the op. For example, convolution ops' behavior and output shapes often depend on padding and strides, which are provided as scalars of lists of scalars. This will record scalar lists when record_inputs=True.

Details:
During collection (and this was true before this PR as well), we serialize values and tensor metadata into an InputOutputEncoder. After collection occurs, we deserialize these values to attach the information to each of the events.

This PR does this:
- Adds support for serializing scalar lists during collection / serialization
- Adds an extra field called "Concrete Args"
- Splits up the deserialization process into two steps - one for generating "input shapes" and one for generating "concrete args". We split up input shapes and concrete args to avoid interrupting any previous workflows that relied on the specific data in the input shapes category; additionally, it's just a better description. Note that single scalars will remain in the "input shapes" category as they were already in that category in the past.

Differential Revision: [D45798431](https://our.internmc.facebook.com/intern/diff/D45798431)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100593
Approved by: https://github.com/aaronenyeshi
2023-05-16 07:58:46 +00:00
PyTorch MergeBot
7912b34789 Revert "[CustomOp] Add Dispatcher error callback (#101015)"
This reverts commit c0e5d7e7fe.

Reverted https://github.com/pytorch/pytorch/pull/101015 on behalf of https://github.com/huydhn due to Revert this as the earlier commits in the stack have been reverted ([comment](https://github.com/pytorch/pytorch/pull/101015#issuecomment-1548476583))
2023-05-15 19:49:53 +00:00
Ramin Azarmehr
721b144f0f [MPS] Add support for Custom Kernels (#100661)
- This change introduces these APIs to enable developing custom kernels on the MPS Stream:
`torch::mps::get_command_buffer()`
`torch::mps::get_dispatch_queue()`
`torch::mps::commit()`
- Add ObjC test case
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100661
Approved by: https://github.com/kulinseth, https://github.com/malfet
2023-05-15 17:02:33 +00:00
Ramin Azarmehr
0be53d83fc [MPS] Add support for MPSProfiler Python bindings (#101002)
- Added torch.mps.profiler.[start() and stop()] APIs with RST documentation
- Added test case in test_mps
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101002
Approved by: https://github.com/malfet
2023-05-12 21:55:34 +00:00
Richard Zou
c0e5d7e7fe [CustomOp] Add Dispatcher error callback (#101015)
The PyTorch Dispatcher's "no kernel found for DispatchKey" error message
is a bit long and winded. This PR adds a way to add a custom error
callback and changes the CustomOp API to use the custom error callback
to deliver better error messages.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101015
Approved by: https://github.com/ezyang
2023-05-12 13:49:20 +00:00
PyTorch MergeBot
19be2bb875 Revert "[MPS] Add support for Custom Kernels (#100661)"
This reverts commit f39cda83d1.

Reverted https://github.com/pytorch/pytorch/pull/100661 on behalf of https://github.com/malfet due to Break internal builds, but also guarding dispatch_t define behind __OBJC__ guard is not a good practices ([comment](https://github.com/pytorch/pytorch/pull/100661#issuecomment-1540540002))
2023-05-09 17:02:04 +00:00
chunyuan
7012600abe fix cpu autocast check in rnn (#100621)
https://github.com/pytorch/pytorch/pull/100100 added Typechecking while `torch.is_autocast_enabled()` always return `False` on cpu. This PR fixes the autocast check for cpu.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100621
Approved by: https://github.com/albanD
2023-05-09 04:34:18 +00:00
Ramin Azarmehr
f39cda83d1 [MPS] Add support for Custom Kernels (#100661)
- This change introduces these APIs to enable developing custom kernels on the MPS Stream:
`torch::mps::get_command_buffer()`
`torch::mps::get_dispatch_queue()`
`torch::mps::commit()`
- Add ObjC test case
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100661
Approved by: https://github.com/kulinseth, https://github.com/malfet
2023-05-08 20:05:46 +00:00
vfdev-5
6a12f10b08 Publicly exposing torch.backends.cpu.get_cpu_capability() (#100164)
Description:

- As suggested by Nikita, created `torch.backends.cpu` submodule and exposed `get_cpu_capability`.

- In torchvision Resize method we want to know current cpu capability in order to pick appropriate codepath depending on cpu capablities

Newly coded vectorized resize of uint8 images on AVX2 supported CPUs is now faster than older way (uint8->float->resize->uint8). However, on non-avx hardware (e.g. Mac M1) certain configs are slower using native uint8.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100164
Approved by: https://github.com/albanD, https://github.com/malfet
2023-05-03 19:02:07 +00:00
Edward Z. Yang
67e0913de9 Add support for serializing real tensor data in after aot minifier (#99834)
The new minifier script looks like this:

```
import torch._dynamo.repro.after_aot
reader = torch._dynamo.repro.after_aot.InputReader(save_dir='/tmp/tmpcsngx39e')
buf0 = reader.storage('e2b39c716c0d4efb9fa57375a3902b9dab666893', 16)
t0 = reader.tensor(buf0, (4,))
args = [t0]
mod = make_fx(Repro(), tracing_mode='real')(*args)
```

The real tensor data is stored in the storages folder of the checkpoint dump directory. If you delete this folder / it is otherwise missing, we will transparently fall back to generating random data like before. The tensors are serialized using content store from #99809, which means each storage is content-addressed and we will automatically deduplicate equivalent data (which is useful if you keep dumping out, e.g., your parameters.) We don't use the tensor serialization capability from content store, instead all of the tensor metadata is stored inline inside the repro script (so that everything is in one file if you lose the checkpointed tensors).

We also add a stable_hash option to content store, where we use a slow SHA-1 sum on the data in CPU side to compute a hash that is stable across systems with the same endianness.

Out of rage, I also added support for Dtype.itemsize property access.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99834
Approved by: https://github.com/voznesenskym
2023-04-27 11:52:13 +00:00
Edward Z. Yang
3a5427baf4 Add torch.utils._content_store (#99809)
Implements a simple content-addressable store for storages (with tensors implemented as cheap references on top), enabling incremental serialization of tensors to disk, which I intend to use in the accuracy repro extractor.  Check the comment at the top of torch/utils/_content_store.py for more details on the intended use case.

One major piece of this PR is implementing the content hash for tensors.  For our prospective use case, we may need to repeatedly hash up to 80 GB of tensor data every time we snapshot (and we may snapshot multiple times).  Using a conventional cryptographic hash and hashing each snapshot would likely take on order of minutes, which seemed too slow to me.  So instead, I implemented a crappy hash function that can be run on GPU.  It is at least somewhat theoretically grounded: using random parameters generated by Philox, we use the standard shift-multiply and xor sum universal hash family.  The hash function is a bit dorky though; instead of properly doing 160-bit math, it just runs 32-bit hash five times and cats them together.  By the way, this sets the first precedent for kernel in PyTorch library which MUST be torch.compile'd to be run (in fact, this kernel does not run in eager mode because of the use of xor_sum, which doesn't actually exist in ATen.)

I had to add a few more primitives to inductor, namely randint (over the entire int range) and xor_sum.  Fortunately, these primitives are natively supported by Triton/C++, and so they were very easy to plumb through.  xor_sum is exposed as a prim, while randint special cases on when low/high span the entire 32-bit signed integer range.

Thanks to Jeff Johnson for letting me bounce ideas of him on a Saturday morning lol.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99809
Approved by: https://github.com/voznesenskym
2023-04-26 18:02:59 +00:00
Brian Hirsh
62fad315c1 fix per-dispatchkey-mode caching bug (#98030)
The bug was that: if you want to move a mode to the autograd key, we need to use the "functionality" key for it (AutogradFunctionality). But when we do that, we need to clear any PythonDispatcher caches for every op for **every** autograd key (since you could run autograd ops with both cpu and cuda tensors underneath the mode, which both may have been cached).

I didn't add a test, since this ends up getting indirectly tests by export in the PR. If someone would prefer a direct test I can add one.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98030
Approved by: https://github.com/ezyang
2023-04-25 21:58:14 +00:00