Changes:
1. Make some arguments positional-only as we only support Python 3.8+
2. Clean up `torch.typename(obj)` implementation.
3. Update type annotations., especially `is_tensor()` and `is_masked_tensor()` using `TypeGuard`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129001
Approved by: https://github.com/malfet
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):
This PR introduces a prototype for `SymmetricMemory` (including a CUDA implementation) - a remote-memory access-based communication primitive. It allows for user-defined communication patterns/kernels and is designed to be torch.compile-friendly. It addresses the major limitations of `IntraNodeComm` and `ProcessGroupCudaP2p` and serves as a replacement for them.
### SymmetricMemory
`SymmetricMemory` represents symmetric allocations across a group of devices. The allocations represented by a `SymmetricMemory` object are accessible by all devices in the group. The class can be used for **op-level custom communication patterns** (via the get_buffer APIs and the synchronization primitives), as well as **custom communication kernels** (via the buffer and signal_pad device pointers).
### Python API Example
```python
from torch._C.distributed_c10d import _SymmetricMemory
# Set a store for rendezvousing symmetric allocations on a group of devices
# identified by group_name. The concept of groups is logical; users can
# utilize predefined groups (e.g., a group of device identified by a
# ProcessGroup) or create custom ones. Note that a SymmetricMemoryAllocator
# backends might employ a more efficient communication channel for the actual
# rendezvous process and only use the store for bootstrapping purposes.
_SymmetricMemory.set_group_info(group_name, rank, world_size, store)
# Identical to empty_strided, but allows symmetric memory access to be
# established for the allocated tensor via _SymmetricMemory.rendezvous().
# This function itself is not a collective operation.
t = _SymmetricMemory.empty_strided_p2p((64, 64), (64, 1), torch.float32, group_name)
# Users can write Python custom ops that leverages the symmetric memory access.
# Below are examples of things users can do (assuming the group's world_size is 2).
# Establishes symmetric memory access on tensors allocated via
# _SymmetricMemory.empty_strided_p2p(). rendezvous() is a one-time process,
# and the mapping between a local memory region and the associated SymmetricMemory
# object is unique. Subsequent calls to rendezvous() with the same tensor will receive
# the cached SymmetricMemory object.
#
# The function has a collective semantic and must be invoked simultaneously
# from all rendezvous participants.
symm_mem = _SymmetricMemory.rendezvous(t)
# This represents the allocation on rank 0 and is accessible from all devices.
buf = symm_mem.get_buffer(0, (64, 64), torch.float32)
if symm_mem.rank == 0:
symm_mem.wait_signal(src_rank=1)
assert buf.eq(42).all()
else:
# The remote buffer can be used as a regular tensor
buf.fill_(42)
symm_mem.put_signal(dst_rank=0)
symm_mem.barrier()
if symm_mem.rank == 0:
symm_mem.barrier()
assert buf.eq(43).all()
else:
new_val = torch.empty_like(buf)
new_val.fill_(43)
# Contiguous copies to/from a remote buffer utilize copy engines
# which bypasses SMs (i.e. no need to load the data into registers)
buf.copy_(new_val)
symm_mem.barrier()
```
### Custom CUDA Comm Kernels
Given a tensor, users can access the associated `SymmetricMemory` which provides pointer to remote buffers/signal_pads needed for custom communication kernels.
```cpp
TORCH_API c10::intrusive_ptr<SymmetricMemory> get_symmetric_memory(
const at::Tensor& tensor);
class TORCH_API SymmetricMemory : public c10::intrusive_ptr_target {
public:
...
virtual std::vector<void*> get_buffer_ptrs() = 0;
virtual std::vector<void*> get_signal_pad_ptrs() = 0;
virtual void** get_buffer_ptrs_dev() = 0;
virtual void** get_signal_pad_ptrs_dev() = 0;
virtual size_t get_buffer_size() = 0;
virtual size_t get_signal_pad_size() = 0;
virtual int get_rank() = 0;
virtual int get_world_size() = 0;
...
};
```
### Limitations of IntraNodeComm and ProcessGroupCudaP2p
Both `IntraNodeComm` (used by `ProcessGroupCudaP2p`) manages a single fixed-size workspace. This approach:
- Leads to awkward UX in which the required workspace needs to be specified upfront.
- Can not avoid extra copies for some algorithms in eager mode (e.g., custom/multimem all-reduce, reduce-scatter, all-gather).
- Prevents torch.compile from eliminating all copies.
In addition, they only offer out-of-the-box communication kernels and don't expose required pointers for user-defined, custom CUDA comm kernels.
* __->__ #128582
Differential Revision: [D58849033](https://our.internmc.facebook.com/intern/diff/D58849033)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128582
Approved by: https://github.com/wanchaol
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):
This PR introduces a prototype for `SymmetricMemory` (including a CUDA implementation) - a remote-memory access-based communication primitive. It allows for user-defined communication patterns/kernels and is designed to be torch.compile-friendly. It addresses the major limitations of `IntraNodeComm` and `ProcessGroupCudaP2p` and serves as a replacement for them.
### SymmetricMemory
`SymmetricMemory` represents symmetric allocations across a group of devices. The allocations represented by a `SymmetricMemory` object are accessible by all devices in the group. The class can be used for **op-level custom communication patterns** (via the get_buffer APIs and the synchronization primitives), as well as **custom communication kernels** (via the buffer and signal_pad device pointers).
### Python API Example
```python
from torch._C.distributed_c10d import _SymmetricMemory
# Set a store for rendezvousing symmetric allocations on a group of devices
# identified by group_name. The concept of groups is logical; users can
# utilize predefined groups (e.g., a group of device identified by a
# ProcessGroup) or create custom ones. Note that a SymmetricMemoryAllocator
# backends might employ a more efficient communication channel for the actual
# rendezvous process and only use the store for bootstrapping purposes.
_SymmetricMemory.set_group_info(group_name, rank, world_size, store)
# Identical to empty_strided, but allows symmetric memory access to be
# established for the allocated tensor via _SymmetricMemory.rendezvous().
# This function itself is not a collective operation.
t = _SymmetricMemory.empty_strided_p2p((64, 64), (64, 1), torch.float32, group_name)
# Users can write Python custom ops that leverages the symmetric memory access.
# Below are examples of things users can do (assuming the group's world_size is 2).
# Establishes symmetric memory access on tensors allocated via
# _SymmetricMemory.empty_strided_p2p(). rendezvous() is a one-time process,
# and the mapping between a local memory region and the associated SymmetricMemory
# object is unique. Subsequent calls to rendezvous() with the same tensor will receive
# the cached SymmetricMemory object.
#
# The function has a collective semantic and must be invoked simultaneously
# from all rendezvous participants.
symm_mem = _SymmetricMemory.rendezvous(t)
# This represents the allocation on rank 0 and is accessible from all devices.
buf = symm_mem.get_buffer(0, (64, 64), torch.float32)
if symm_mem.rank == 0:
symm_mem.wait_signal(src_rank=1)
assert buf.eq(42).all()
else:
# The remote buffer can be used as a regular tensor
buf.fill_(42)
symm_mem.put_signal(dst_rank=0)
symm_mem.barrier()
if symm_mem.rank == 0:
symm_mem.barrier()
assert buf.eq(43).all()
else:
new_val = torch.empty_like(buf)
new_val.fill_(43)
# Contiguous copies to/from a remote buffer utilize copy engines
# which bypasses SMs (i.e. no need to load the data into registers)
buf.copy_(new_val)
symm_mem.barrier()
```
### Custom CUDA Comm Kernels
Given a tensor, users can access the associated `SymmetricMemory` which provides pointer to remote buffers/signal_pads needed for custom communication kernels.
```cpp
TORCH_API c10::intrusive_ptr<SymmetricMemory> get_symmetric_memory(
const at::Tensor& tensor);
class TORCH_API SymmetricMemory : public c10::intrusive_ptr_target {
public:
...
virtual std::vector<void*> get_buffer_ptrs() = 0;
virtual std::vector<void*> get_signal_pad_ptrs() = 0;
virtual void** get_buffer_ptrs_dev() = 0;
virtual void** get_signal_pad_ptrs_dev() = 0;
virtual size_t get_buffer_size() = 0;
virtual size_t get_signal_pad_size() = 0;
virtual int get_rank() = 0;
virtual int get_world_size() = 0;
...
};
```
### Limitations of IntraNodeComm and ProcessGroupCudaP2p
Both `IntraNodeComm` (used by `ProcessGroupCudaP2p`) manages a single fixed-size workspace. This approach:
- Leads to awkward UX in which the required workspace needs to be specified upfront.
- Can not avoid extra copies for some algorithms in eager mode (e.g., custom/multimem all-reduce, reduce-scatter, all-gather).
- Prevents torch.compile from eliminating all copies.
In addition, they only offer out-of-the-box communication kernels and don't expose required pointers for user-defined, custom CUDA comm kernels.
* __->__ #128582
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128582
Approved by: https://github.com/wanchaol
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):
This PR introduces a prototype for `SymmetricMemory` (including a CUDA implementation) - a remote-memory access-based communication primitive. It allows for user-defined communication patterns/kernels and is designed to be torch.compile-friendly. It addresses the major limitations of `IntraNodeComm` and `ProcessGroupCudaP2p` and serves as a replacement for them.
### SymmetricMemory
`SymmetricMemory` represents symmetric allocations across a group of devices. The allocations represented by a `SymmetricMemory` object are accessible by all devices in the group. The class can be used for **op-level custom communication patterns** (via the get_buffer APIs and the synchronization primitives), as well as **custom communication kernels** (via the buffer and signal_pad device pointers).
### Python API Example
```python
from torch._C.distributed_c10d import _SymmetricMemory
# Set a store for rendezvousing symmetric allocations on a group of devices
# identified by group_name. The concept of groups is logical; users can
# utilize predefined groups (e.g., a group of device identified by a
# ProcessGroup) or create custom ones. Note that a SymmetricMemoryAllocator
# backends might employ a more efficient communication channel for the actual
# rendezvous process and only use the store for bootstrapping purposes.
_SymmetricMemory.set_group_info(group_name, rank, world_size, store)
# Identical to empty_strided, but allows symmetric memory access to be
# established for the allocated tensor via _SymmetricMemory.rendezvous().
# This function itself is not a collective operation.
t = _SymmetricMemory.empty_strided_p2p((64, 64), (64, 1), torch.float32, group_name)
# Users can write Python custom ops that leverages the symmetric memory access.
# Below are examples of things users can do (assuming the group's world_size is 2).
# Establishes symmetric memory access on tensors allocated via
# _SymmetricMemory.empty_strided_p2p(). rendezvous() is a one-time process,
# and the mapping between a local memory region and the associated SymmetricMemory
# object is unique. Subsequent calls to rendezvous() with the same tensor will receive
# the cached SymmetricMemory object.
#
# The function has a collective semantic and must be invoked simultaneously
# from all rendezvous participants.
symm_mem = _SymmetricMemory.rendezvous(t)
# This represents the allocation on rank 0 and is accessible from all devices.
buf = symm_mem.get_buffer(0, (64, 64), torch.float32)
if symm_mem.rank == 0:
symm_mem.wait_signal(src_rank=1)
assert buf.eq(42).all()
else:
# The remote buffer can be used as a regular tensor
buf.fill_(42)
symm_mem.put_signal(dst_rank=0)
symm_mem.barrier()
if symm_mem.rank == 0:
symm_mem.barrier()
assert buf.eq(43).all()
else:
new_val = torch.empty_like(buf)
new_val.fill_(43)
# Contiguous copies to/from a remote buffer utilize copy engines
# which bypasses SMs (i.e. no need to load the data into registers)
buf.copy_(new_val)
symm_mem.barrier()
```
### Custom CUDA Comm Kernels
Given a tensor, users can access the associated `SymmetricMemory` which provides pointer to remote buffers/signal_pads needed for custom communication kernels.
```cpp
TORCH_API c10::intrusive_ptr<SymmetricMemory> get_symmetric_memory(
const at::Tensor& tensor);
class TORCH_API SymmetricMemory : public c10::intrusive_ptr_target {
public:
...
virtual std::vector<void*> get_buffer_ptrs() = 0;
virtual std::vector<void*> get_signal_pad_ptrs() = 0;
virtual void** get_buffer_ptrs_dev() = 0;
virtual void** get_signal_pad_ptrs_dev() = 0;
virtual size_t get_buffer_size() = 0;
virtual size_t get_signal_pad_size() = 0;
virtual int get_rank() = 0;
virtual int get_world_size() = 0;
...
};
```
### Limitations of IntraNodeComm and ProcessGroupCudaP2p
Both `IntraNodeComm` (used by `ProcessGroupCudaP2p`) manages a single fixed-size workspace. This approach:
- Leads to awkward UX in which the required workspace needs to be specified upfront.
- Can not avoid extra copies for some algorithms in eager mode (e.g., custom/multimem all-reduce, reduce-scatter, all-gather).
- Prevents torch.compile from eliminating all copies.
In addition, they only offer out-of-the-box communication kernels and don't expose required pointers for user-defined, custom CUDA comm kernels.
* __->__ #128582
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128582
Approved by: https://github.com/wanchaol
In this PR, we abstracted the different types of aten operation parameters as `ParameterMetadata`. This structure intends to be used to represent and store the metadata of each aten operation parameter. Currently, it only supports `Tensor`, `TensorList`, and `Scalar`.
```C++
using ParameterMetadataValue = std::variant<TensorMetadata, std::vector<TensorMetadata>, c10::Scalar>;
```
With this PR, we can extend other parameter-type support in a more modularize way, like `string`, `int`, `double`, and other different types to be summarized as the following list. The list is collected from all aten operations and ordered by the number of being used.
- `Tensor`
- `bool`
- `int64_t`
- `TensorList`
- `Scalar`
- `c10::SymIntArrayRef`
- `::std::optional<Tensor>`
- `IntArrayRef`
- `double`
- `c10::SymInt`
- `::std::optional<ScalarType>`
- `::std::optional<double>`
- `::std::optional<bool>`
- `::std::optional<Layout>`
- `::std::optional<Device>`
- `::std::optional<int64_t>`
- `Dimname`
- `::std::optional<Generator>`
- `c10::string_view`
- `::std::optional<c10::string_view>`
- `OptionalIntArrayRef`
- `::std::optional<Scalar>`
- `OptionalSymIntArrayRef`
- `::std::optional<MemoryFormat>`
- `::std::optional<c10::SymInt>`
- `ScalarType`
- `ArrayRef<Scalar>`
- `DimnameList`
- `::std::optional<ArrayRef<double>>`
- `::std::array<bool,3>`
- `::std::optional<DimnameList>`
- `c10::List<::std::optional<Tensor>>`
- `::std::array<bool,2>`
- `Storage`
- `::std::array<bool,4>`
- `Device`
- `DeviceIndex`
- `ITensorListRef`
- `Stream`
- `Layout`
- `MemoryFormat`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125308
Approved by: https://github.com/jgong5, https://github.com/jansel
FIXES#113263. Same idea as in https://github.com/pytorch/pytorch/pull/113417, but we need a more intrusive C API to silently nop default saved tensor hooks, in order to support user-code that use torch.autograd.disable_saved_tensors_hooks (see test_unpack_hooks_can_be_disabled). We mock the output of get_hooks while leaving push/pop untouched.
For compiled autograd, we're firing pack hooks once and unpack hooks twice right now, I'll look into this separately from this issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123196
Approved by: https://github.com/soulitzer
Not requiring all functions to have types allows a lot of 'Any' types to slip in - which poison types and make mypy unable to properly typecheck the code. I want to flip the default so that new files are required to have fully typed defs and we can have a burndown list of files that fail to require full types.
The preceding stack of PRs (cut up simply to limit the number of file changes per PR "reasonable") adds `# mypy: allow-untyped-defs` to any file which didn't immediately pass mypy with the flag flipped. Due to changing files and merge conflicts it will probably be necessary to have several passes through before landing this final PR which turns the option on.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127836
Approved by: https://github.com/oulgen, https://github.com/Skylion007
Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes.
What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here...
Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343
Approved by: https://github.com/Skylion007
Summary:
Just play around the UT and think it would be good to give an simple
example of user function which can be used for different subclasses of
_ControlCollectives, and test the user function can be executed
correctly
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127665
Approved by: https://github.com/d4l3k
Now torch.dtype can pass through pybind11, so modify function _group_tensors_by_device_and_dtype to using scalar type. And without convert torch.dtype and string in python and c++ side.
@ezyang @bdhirsh
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127869
Approved by: https://github.com/ezyang
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127122
Approved by: https://github.com/kit1980
## Context
This stack prototypes automatic micro-pipelining of `all-gather -> matmul` and `matmul -> reduce-scatter` via Inductor. The idea originates from the paper [Overlap Communication with Dependent Computation via
Decomposition in Large Deep Learning Models](https://dl.acm.org/doi/pdf/10.1145/3567955.3567959). The implementation and some key optimizations are heavily influenced by @lw's implementation in xformers.
The stack contains several components:
- `ProcessGroupCudaP2P` - a thin wrapper around `ProcessGroupNCCL`. It in addition maintains a P2P workspace that enables SM-free, one-sided P2P communication which is needed for optimal micro-pipelining.
- `fused_all_gather_matmul` and `fused_matmul_reduce_scatter` dispatcher ops.
- Post-grad fx pass that detects `all-gather -> matmul` and `matmul -> reduce-scatter` and replaces them with the fused dispatcher ops.
To enable the prototype feature:
- Set the distributed backend to `cuda_p2p`.
- Set `torch._inductor.config._micro_pipeline_tp` to `True`.
*NOTE: the prototype sets nothing in stone w.r.t to each component's design. The purpose is to have a performant baseline with reasonable design on which each component can be further improved.*
## Benchmark
Setup:
- 8 x H100 (500W) + 3rd gen NVSwitch.
- Llama3 8B training w/ torchtitan.
- 8-way TP. Reduced the number of layers from 32 to 8 for benchmarking purpose.
Trace (baseline): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmpjaz8zgx0
<img width="832" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/4addba77-5abc-4d2e-93ea-f68078587fe1">
Trace (w/ micro pipelining): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmpn073b4wn
<img width="963" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/4f44e78d-8196-43ab-a1ea-27390f07e9d2">
## This PR
`ProcessGroupCudaP2P` is a thin wrapper around `ProcessGroupNCCL`. By default, it routes all collectives to the underlying `ProcessGroupNCCL`. In addition, `ProcessGroupCudaP2P` initializes a P2P workspace that allows direct GPU memory access among the members. The workspace can be used in Python to optimize intra-node communication patterns or to create custom intra-node collectives in CUDA.
`ProcessGroupCudaP2P` aims to bridge the gap where certain important patterns can be better optimized via fine-grained P2P memory access than with collectives in the latest version of NCCL. It is meant to complement NCCL rather than replacing it.
Usage:
```
# Using ProcessGroupCudaP2P
dist.init_process_group(backend="cuda_p2p", ...)
# Using ProcessGroupCudaP2P while specifying ProcessGroupCudaP2P.Options
pg_options = ProcessGroupCudaP2P.Options()
dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...)
# Using ProcessGroupCudaP2P while specifying ProcessGroupNCCL.Options
pg_options = ProcessGroupNCCL.Options()
dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...)
# Using ProcessGroupCudaP2P while specifying both
# ProcessGroupCudaP2P.Options and ProcessGroupNCCL.Options
pg_options = ProcessGroupCudaP2P.Options()
pg_options.nccl_options = ProcessGroupNCCL.Options()
dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...)
# Down-casting the backend to access p2p buffers for cuda_p2p specific
# optimizations
if is_cuda_p2p_group(group):
backend = get_cuda_p2p_backend(group)
if required_p2p_buffer_size > backend.get_buffer_size():
# fallback
p2p_buffer = backend.get_p2p_buffer(...)
else:
# fallback
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122163
Approved by: https://github.com/wanchaol
The original change was about 9.5% slower than then before #122074 .
This improves it to be only about 1.4% slower.
Also touched up some unrelated nits that the linter complained about.
Fixes#126293
Ran torchbench 3 times on each change. Perf values before (stable), after (fix),
and with #122074 backed out (backout):
```
../inductor-tools/scripts/modelbench/inductor_single_run.sh single inference performance torchbench pyhpc_isoneutral_mixing amp first dynamic cpp
stable:
43.948x
45.754x
44.906x
fix:
47.505x
49.987x
47.493x
backout:
48.243x
48.199x
48.192x
../inductor-tools/scripts/modelbench/inductor_single_run.sh single inference performance torchbench pyhpc_equation_of_state amp first static default
stable:
15.224x
13.286x
15.354x
fix:
16.402x
16.370x
16.183x
backout:
16.554x
16.675x
16.787x
../inductor-tools/scripts/modelbench/inductor_single_run.sh single inference performance torchbench lennard_jones float32 first static default
stable:
1.712x
1.651x
1.640x
fix:
1.804x
1.798x
1.792x
backout:
1.864x
1.824x
1.836x
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126996
Approved by: https://github.com/jansel
Fixes#71398
Add `__reduce__` and `__setstate__` methods for `torch._C.Generator`.
`__reduce__` returns a tuple of 3 values:
1. `torch.Generator` itself.
2. A one-element tuple containing the `torch.device` to create the `Generator` with, since this cannot be changed after the object is created.
3. The state, a three-element tuple: the initial seed, the offset (or `None` if a CPU `Generator`), and the RNG state tensor.
`__setstate__` calls `manual_seed`, `set_offset` (if not `None`), and `set_state` on each respective element of the state.
Added test demonstrating successful reserialization with cpu and cuda `Generator`s.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126271
Approved by: https://github.com/ezyang
## Context
This stack prototypes automatic micro-pipelining of `all-gather -> matmul` and `matmul -> reduce-scatter` via Inductor. The idea originates from the paper [Overlap Communication with Dependent Computation via
Decomposition in Large Deep Learning Models](https://dl.acm.org/doi/pdf/10.1145/3567955.3567959). The implementation and some key optimizations are heavily influenced by @lw's implementation in xformers.
The stack contains several components:
- `ProcessGroupCudaP2P` - a thin wrapper around `ProcessGroupNCCL`. It in addition maintains a P2P workspace that enables SM-free, one-sided P2P communication which is needed for optimal micro-pipelining.
- `fused_all_gather_matmul` and `fused_matmul_reduce_scatter` dispatcher ops.
- Post-grad fx pass that detects `all-gather -> matmul` and `matmul -> reduce-scatter` and replaces them with the fused dispatcher ops.
To enable the prototype feature:
- Set the distributed backend to `cuda_p2p`.
- Set `torch._inductor.config._micro_pipeline_tp` to `True`.
*NOTE: the prototype sets nothing in stone w.r.t to each component's design. The purpose is to have a performant baseline with reasonable design on which each component can be further improved.*
## Benchmark
Setup:
- 8 x H100 (500W) + 3rd gen NVSwitch.
- Llama3 8B training w/ torchtitan.
- 8-way TP. Reduced the number of layers from 32 to 8 for benchmarking purpose.
Trace (baseline): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmpjaz8zgx0
<img width="832" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/4addba77-5abc-4d2e-93ea-f68078587fe1">
Trace (w/ micro pipelining): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmpn073b4wn
<img width="963" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/4f44e78d-8196-43ab-a1ea-27390f07e9d2">
## This PR
`ProcessGroupCudaP2P` is a thin wrapper around `ProcessGroupNCCL`. By default, it routes all collectives to the underlying `ProcessGroupNCCL`. In addition, `ProcessGroupCudaP2P` initializes a P2P workspace that allows direct GPU memory access among the members. The workspace can be used in Python to optimize intra-node communication patterns or to create custom intra-node collectives in CUDA.
`ProcessGroupCudaP2P` aims to bridge the gap where certain important patterns can be better optimized via fine-grained P2P memory access than with collectives in the latest version of NCCL. It is meant to complement NCCL rather than replacing it.
Usage:
```
# Using ProcessGroupCudaP2P
dist.init_process_group(backend="cuda_p2p", ...)
# Using ProcessGroupCudaP2P while specifying ProcessGroupCudaP2P.Options
pg_options = ProcessGroupCudaP2P.Options()
dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...)
# Using ProcessGroupCudaP2P while specifying ProcessGroupNCCL.Options
pg_options = ProcessGroupNCCL.Options()
dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...)
# Using ProcessGroupCudaP2P while specifying both
# ProcessGroupCudaP2P.Options and ProcessGroupNCCL.Options
pg_options = ProcessGroupCudaP2P.Options()
pg_options.nccl_options = ProcessGroupNCCL.Options()
dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...)
# Down-casting the backend to access p2p buffers for cuda_p2p specific
# optimizations
if is_cuda_p2p_group(group):
backend = get_cuda_p2p_backend(group)
if required_p2p_buffer_size > backend.get_buffer_size():
# fallback
p2p_buffer = backend.get_p2p_buffer(...)
else:
# fallback
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122163
Approved by: https://github.com/wanchaol
#### Conditions for allowlisting tensor subclasses
We allow tensor subclasses types that
(1) Do not override `__setstate__`, `__getattr__`, `__setattr__`, `__get__`, `__set__` or `__getattribute__` of `torch.Tensor` (`torch.Tensor` does not have a definition of `__getattr__`, `__get__` or `__set__` so we check that these are `None`)
(2) Use the generic `tp_alloc`
(3) Are in a module that *has been imported by the user*
to be pushed onto the stack as strings by `GLOBAL` instructions, while storing the type in a dict
The strings will be converted to the classes as appropriate when executing `REBUILD` with `_rebuild_from_type_v2`
*Note that we use `inspect.getattr_static(sys.modules[module], name)` to get the class/function as this method claims to have no code execution.
The rationale for the 3 conditions above is as follows:
The rebuild func provided by `Tensor.__reduce_ex__` is `torch._tensor._rebuild_from_type_v2`, which is defined as such (note the call to `getattr`, `Tensor.__setstate__` and the call to `as_subclass` as well as the call to `_set_obj_state` which calls `setattr`)
4e66aaa010/torch/_tensor.py (L57-L71)
`as_subclass` is implemented with a call to `THPVariable_NewWithVar`
that will eventually call `tp_alloc` here
4e66aaa010/torch/csrc/autograd/python_variable.cpp (L2053)
The `func` arg to `_rebuild_from_type_v2` for wrapper subclasses is `Tensor.rebuild_wrapper_subclass`, which will similarly call into `THPVariable_NewWithVar` and hit the above `tp_alloc`
**Note that we do not call `tp_init` or `tp_new` (i.e. `cls.__init__` or `cls.__new__`) when unpickling**
### How do we check something is a tensor subclass/constraints around imports
In order to check whether `bla` is a tensor subclass in the bytecode `GLOBAL module.name`, we need to do an `issubclass` check, which entails converting the global string to the appropriate type. We *do not* arbitrarily import modules but will perform this check as long as the given subclass (given by `module.name`) has already been imported by the user (i.e. `module in sys.modules` and `issubclass(getattr(sys[modules], name), torch.Tensor)`
This PR also allowlisted `torch._utils._rebuild_wrapper_subclass` and `torch.device` (used by `_rebuild_wrapper_subclass`)
### API for allow listing
This PR also added `torch.serialization.{add/get/clear}_safe_globals` that enables user to allowlist globals they have deemed safe and manipulate this list (for example they could allowlist a tensor subclass with a custom `__setstate__` if they have checked that this is safe).
Next steps:
- Add testing and allowlist required classes for all in-core tensor subclasses (e.g. `DTensor`, `FakeTensor` etc.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124331
Approved by: https://github.com/albanD
This PR is part of a series of PRs to significantly speed up torch.onnx.export for models with many nodes (e.g. LLM). See #121422 for more analysis.
- As part of torch.onnx.export, a reverse look-up is made in env. This is done for each node, and this look-up costs in proportional to the graph size, which incurs and overall O(N^2) time complexity.
- A pragmatic solution is simply to keep a separate data structure to make this de facto constant time. So, this introduces a set containing all the values of env. Open to other ideas. Ideally `exist_in_env` wouldn't be needed at all, but to preserve current behavior exactly I'm not sure how that can be done.
- Resolves (4) in #121422.
- This code change and the choice of py::set looks a bit more natural on top of #123063, where the env is changed from a std::unordered_map to a py::dict.
Partially fixes#121422
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124909
Approved by: https://github.com/srikris-sridhar, https://github.com/justinchuby
We save and restore the DynamicLayerStack during frame eval but since fx graph has no way to express a try/finally we just assume it will happen. If we throw an exception between the push and pop to the stack then we're left in a state that affects following operations poorly. Make sure that if it's in a bad state we restore it after frame eval.
Repro:
before:
```
$ rm test/dynamo_skips/TestSparseCPU.test_log1p_cpu_uint8
$ rm test/dynamo_expected_failures/FuncTorchHigherOrderOpTests.test_vmap_free_tensor
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/jit/test_sparse.py test/dynamo/test_dynamic_shapes.py test/inductor/test_torchinductor_dynamic_shapes.py test/test_sparse.py -k 'test_log1p_cpu_uint8'
============= 1 passed, 8588 deselected in 9.75s =============
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/jit/test_sparse.py test/dynamo/test_dynamic_shapes.py test/inductor/test_torchinductor_dynamic_shapes.py test/test_sparse.py -k
'test_vmap_free_tensor_dynamic_shapes or test_log1p_cpu_uint8'
================== short test summary info ===================
FAILED [0.0632s] test/test_sparse.py::TestSparseCPU::test_log1p_cpu_uint8 - AssertionError: "only Tensors of floating point dtype can require gradients"
does not match "You are attempting to call Tensor.requires_grad_() (or perhaps using torch.autograd.functional.* APIs) inside of a function ...
======= 1 failed, 1 skipped, 8587 deselected in 10.99s =======
```
(Note that adding test_vmap_free_tensor_dynamic_shapes causes test_vmap_free_tensor_dynamic_shapes to fail)
after:
```
$ rm test/dynamo_skips/TestSparseCPU.test_log1p_cpu_uint8
$ rm test/dynamo_expected_failures/FuncTorchHigherOrderOpTests.test_vmap_free_tensor
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/jit/test_sparse.py test/dynamo/test_dynamic_shapes.py test/inductor/test_torchinductor_dynamic_shapes.py test/test_sparse.py -k 'test_log1p_cpu_uint8'
============= 1 passed, 8588 deselected in 9.89s =============
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/jit/test_sparse.py test/dynamo/test_dynamic_shapes.py test/inductor/test_torchinductor_dynamic_shapes.py test/test_sparse.py -k
'test_vmap_free_tensor_dynamic_shapes or test_log1p_cpu_uint8'
======= 1 passed, 1 skipped, 8587 deselected in 11.34s =======
```
(test_vmap_free_tensor_dynamic_shapes passes either way)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122074
Approved by: https://github.com/oulgen
- sets it as a fake stack trace as we don't have a generic comment feature
- when verbose is disabled, still adds a contextmanager and flag checks. the alternative is to use MACROS, but that wouldn't be usable with TORCH_LOGS
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124954
Approved by: https://github.com/jansel
MTIA device has its own Module in PyTorch now.
torch.mtia has following APIs similar to other backends. The lazy_init is also supported.
```
__all__ = [
"init",
"is_available",
"synchronize",
"device_count",
"current_device",
"current_stream",
"default_stream",
"set_stream",
"stream",
"device",
]
```
------------
For device management. We expand AccleratorHooksInterface to support generic device management and it can be used in both C++ and PyThon.
```
def _accelerator_hooks_device_count() -> _int: ...
def _accelerator_hooks_set_current_device(device_index: _int) -> None: ...
def _accelerator_hooks_get_current_device() -> _int : ...
def _accelerator_hooks_exchange_device(device_index: _int) -> _int : ...
def _accelerator_hooks_maybe_exchange_device(device_index: _int) -> _int : ...
```
---------
Adding get_device_module API to retrieve device modules for different device types.
```
def get_device_module(device: Optional[Union[torch.device, str]] = None)
```
---------
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123612
Approved by: https://github.com/albanD
ghstack dependencies: #123611
fake_tensor.py had mypy error ignored. That seems less than desirable.
Also added SafePyObjectT<T> which is a tagged wrapper around a SafePyObject but provides static type checking (with no other guarantees).
Used `SafePyObjectT<TorchDispatchModeKey>` on some of the TorchDispatchModeTLS API to ensure that we don't accidentally inject a different type than expected into the stack.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124428
Approved by: https://github.com/malfet
fake_tensor.py had mypy error ignored. That seems less than desirable.
Also added SafePyObjectT<T> which is a tagged wrapper around a SafePyObject but provides static type checking (with no other guarantees).
Used `SafePyObjectT<TorchDispatchModeKey>` on some of the TorchDispatchModeTLS API to ensure that we don't accidentally inject a different type than expected into the stack.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124428
Approved by: https://github.com/malfet
We guard on key order
1) When a key is a non-constant object
2) When we actually need key order - like .values, .items etc
For dicts/OrderedDicts that do not require key order guarding, we just rely on usual `GuardManger + DictGetItemGuardAccessor`. This is faster than going through the `list(d.keys())` based design for OrderedDicts.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124779
Approved by: https://github.com/jansel
MTIA device has its own Module in PyTorch now.
torch.mtia has following APIs similar to other backends. The lazy_init is also supported.
```
__all__ = [
"init",
"is_available",
"synchronize",
"device_count",
"current_device",
"current_stream",
"default_stream",
"set_stream",
"stream",
"device",
]
```
------------
For device management. We expand AccleratorHooksInterface to support generic device management and it can be used in both C++ and PyThon.
```
def _accelerator_hooks_device_count() -> _int: ...
def _accelerator_hooks_set_current_device(device_index: _int) -> None: ...
def _accelerator_hooks_get_current_device() -> _int : ...
def _accelerator_hooks_exchange_device(device_index: _int) -> _int : ...
def _accelerator_hooks_maybe_exchange_device(device_index: _int) -> _int : ...
```
---------
Adding get_device_module API to retrieve device modules for different device types.
```
def get_device_module(device: Optional[Union[torch.device, str]] = None)
```
---------
Differential Revision: [D56443356](https://our.internmc.facebook.com/intern/diff/D56443356)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123612
Approved by: https://github.com/albanD
ghstack dependencies: #123611
This diff intends to build device generic torch.Stream and torch.Event for newly added accelerators in PyTorch.
------------
**torch.Stream APIs**
```
# Defined in torch/csrc/Stream.cpp
class Stream(_StreamBase):
stream_id: _int # Stream id
device_index: _int
device_type: _int
device: _device # The device of the stream
@overload
def __new__(self, device: Optional[DeviceLikeType] = None, priority: _int = 0) -> Stream: ...
@overload
def __new__(self, stream_id: _int, device_index: _int, device_type: _int, priority: _int = 0) -> Stream: ...
def wait_event(self, event: Event) -> None: ...
def wait_stream(self, other: Stream) -> None: ...
def record_event(self, event: Optional[Event] = None) -> Event: ...
def query(self) -> None: ...
def synchronize(self) -> None: ...
def __hash__(self) -> _int: ...
def __repr__(self) -> str: ...
def __eq__(self, other: object) -> _bool: ...
```
------------------
**torch.Event APIs**:
- IPC related APIs are not implemented, since many device backends don't support it, but we leave interfaces there for future adaption of torch.cuda.Stream.
- currently only the enable_timing is supported, since it is the most common one used in other device backends. We have to refactor the event flag system in PyTorch to support more fancy flag.
- elapsedTime API is added to c10::Event
```
# Defined in torch/csrc/Event.cpp
class Event(_EventBase):
device: _device # The device of the Event
event_id: _int # The raw event created by device backend
def __new__(self,
device: Optional[DeviceLikeType] = None,
enable_timing: _bool = False,
blocking: _bool = False,
interprocess: _bool = False) -> Event: ...
@classmethod
def from_ipc_handle(self, device: DeviceLikeType, ipc_handle: bytes) -> Event: ...
def record(self, stream: Optional[Stream] = None) -> None: ...
def wait(self, stream: Optional[Stream] = None) -> None: ...
def query(self) -> _bool: ...
def elapsed_time(self, other: Event) -> _float: ...
def synchronize(self) -> None: ...
def ipc_handle(self) -> bytes: ...
def __repr__(self) -> str: ...
```
-----------
c10::Event provides new APIs
- calculate **elapsedTime**.
- Get raw event id
- Synchronize event.
```
double elapsedTime(const Event& event) const {
return impl_.elapsedTime(event.impl_);
}
void* eventId() const {
return impl_.eventId();
}
void synchronize() const {
return impl_.synchronize();
}
```
----------
TODO: need to find a good way to test them in PyTorch with API mocks.
Differential Revision: [D56443357](https://our.internmc.facebook.com/intern/diff/D56443357)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123611
Approved by: https://github.com/albanD, https://github.com/jeffdaily
Following the example of PyTorch supporting a preferred Linalg library (cusolver or magma), this PR introduces a preferred blas library selector of either cublas or cublaslt for CUDA and hipblas or hipblaslt for ROCm via normal hipification of sources.
The default blas implementation remains cublas or hipblas. cublaslt or hipblaslt can be enabled using environment variable TORCH_BLAS_PREFER_CUBLASLT=1 (or TORCH_BLAS_PREFER_HIPBLASLT=1 as an alias) or by calling `torch.backends.cuda.preferred_blas_library(backend="cublaslt")` or as an alias `backend="hipblaslt"`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122106
Approved by: https://github.com/lezcano
This PR unifies the CUDA, XPU and PrivateUse1 in the torch profiler. Now CUDA, XPU and PrivateUse1 can together use string object `use_device` to distinguish each other and share one device path for calculating kineto time durations and memory statistics for post processing.
#suppress-api-compatibility-check
Co-authored-by: Aaron Enye Shi <enye.shi@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123247
Approved by: https://github.com/aaronenyeshi
We override the `__call__` method and register fake, functional, proxy default dispatch mode implementation in its python_key_mode_table.
The idea is:
1. when inputs contains FakeScriptObject, we dispatch it through _get_dispatch mechanism. We implement dispatch mode keys automatically in the operator's constructor.
2. when inputs are not fakified, we dispatch through the original c++ dispatcher.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123367
Approved by: https://github.com/zou3519
This PR unifies the CUDA, XPU and PrivateUse1 in the torch profiler. Now CUDA, XPU and PrivateUse1 can together use string object `use_device` to distinguish each other and share one device path for calculating kineto time durations and memory statistics for post processing.
#suppress-api-compatibility-check
Co-authored-by: Aaron Enye Shi <enye.shi@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123247
Approved by: https://github.com/aaronenyeshi, https://github.com/gujinghui
MTIA device has its own Module in PyTorch now.
torch.mtia has following APIs similar to other backends. The lazy_init is also supported.
```
__all__ = [
"init",
"is_available",
"synchronize",
"device_count",
"current_device",
"current_stream",
"default_stream",
"set_stream",
"stream",
"device",
]
```
------------
For device management. We expand AccleratorHooksInterface to support generic device management and it can be used in both C++ and PyThon.
```
def _accelerator_hooks_device_count() -> _int: ...
def _accelerator_hooks_set_current_device(device_index: _int) -> None: ...
def _accelerator_hooks_get_current_device() -> _int : ...
def _accelerator_hooks_exchange_device(device_index: _int) -> _int : ...
def _accelerator_hooks_maybe_exchange_device(device_index: _int) -> _int : ...
```
---------
Adding get_device_module API to retrieve device modules for different device types.
```
def get_device_module(device: Optional[Union[torch.device, str]] = None)
```
---------
@exported-using-ghexport
Differential Revision: [D52923602](https://our.internmc.facebook.com/intern/diff/D52923602/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123612
Approved by: https://github.com/albanD
ghstack dependencies: #123611
This diff intends to build device generic torch.Stream and torch.Event for newly added accelerators in PyTorch.
------------
**torch.Stream APIs**
```
# Defined in torch/csrc/Stream.cpp
class Stream(_StreamBase):
stream_id: _int # Stream id
device_index: _int
device_type: _int
device: _device # The device of the stream
@overload
def __new__(self, device: Optional[DeviceLikeType] = None, priority: _int = 0) -> Stream: ...
@overload
def __new__(self, stream_id: _int, device_index: _int, device_type: _int, priority: _int = 0) -> Stream: ...
def query(self) -> _bool: ...
def synchronize(self) -> None: ...
def wait_event(self, event: Event) -> None: ...
def wait_stream(self, other: Stream) -> None: ...
def record_event(self, event: Optional[Event] = None) -> Event: ...
def query(self) -> None: ...
def synchronize(self) -> None: ...
def __hash__(self) -> _int: ...
def __repr__(self) -> str: ...
def __eq__(self, other: object) -> _bool: ...
```
------------------
**torch.Event APIs**:
- IPC related APIs are not implemented, since many device backends don't support it, but we leave interfaces there for future adaption of torch.cuda.Stream.
- currently only the enable_timing is supported, since it is the most common one used in other device backends. We have to refactor the event flag system in PyTorch to support more fancy flag.
- elapsedTime API is added to c10::Event
```
# Defined in torch/csrc/Event.cpp
class Event(_EventBase):
device: _device # The device of the Event
event_id: _int # The raw event created by device backend
def __new__(self,
device: Optional[DeviceLikeType] = None,
enable_timing: _bool = False,
blocking: _bool = False,
interprocess: _bool = False) -> Event: ...
@classmethod
def from_ipc_handle(self, device: DeviceLikeType, ipc_handle: bytes) -> Event: ...
def record(self, stream: Optional[Stream] = None) -> None: ...
def wait(self, stream: Optional[Stream] = None) -> None: ...
def query(self) -> _bool: ...
def elapsed_time(self, other: Event) -> _float: ...
def synchronize(self) -> None: ...
def ipc_handle(self) -> bytes: ...
def __repr__(self) -> str: ...
```
-----------
c10::Event provides new APIs
- calculate **elapsedTime**.
- Get raw event id
- Synchronize event.
```
double elapsedTime(const Event& event) const {
return impl_.elapsedTime(event.impl_);
}
void* eventId() const {
return impl_.eventId();
}
void synchronize() const {
return impl_.synchronize();
}
```
----------
TODO: need to find a good way to test them in PyTorch with API mocks.
Differential Revision: [D55351839](https://our.internmc.facebook.com/intern/diff/D55351839/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123611
Approved by: https://github.com/albanD
A kernel has "dispatcher convention" if there is an additional keyset
arg at the beginning of the argument list. This PR:
- adds a way to register kernels with dispatcher_convention using
Library.impl (pass dispatcher_convention = True)
- adds OpOverload.redispatch
We use both of the above in the new custom ops API: we register the
autograd kernel in dispatcher convention so that we can actually call
redispatch like how pytorch built-in ops do it.
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124089
Approved by: https://github.com/albanD
ghstack dependencies: #123937, #124064, #124065, #124066, #124071
Fixes#121200
This PR introduces AcceleratorOutOfMemoryError for all privateuse1 backend. For python, there is a PyError object which will be set only when privateuse1 is registered. All privateuse1 backend then can use this error for memory errors. Maybe more error types in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121702
Approved by: https://github.com/guangyey, https://github.com/albanD
Summary:
We need a way to allow user set a customized description for a process group, e.g. FSDP, PP.
Here are several use cases of user specified group_desc:
- Logging: we can easily match a log line and understand what's this collective/pg is used to.
- Pytorch traces (e.g. Kineto, Execution Trace) can benefit from the PG desc since trace analysis, benchmarks will be able to easily differentiate PG purpose like FSDP, PP.
- Lower layer collectives(e.g. NCCL) debug: we will be able to expose PG desc to NCCL communicator so NCCL layer operations can be easily correlated to a PG.
Solution: Add a group_desc field to c10d
Differential Revision: D55781850
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123472
Approved by: https://github.com/kwen2501
Summary:
Kineto traces use microsecond level granularity because of chrome tracing defaults to that precision. Fix by adding preprocessor flag to TARGETS and BUCK files. Also remove any unnecessary ns to us conversions made in the profiler itself.
This diff contains profiler changes only. Libkineto changes found in D54964435.
Test Plan:
Check JSON and chrome tracing to make sure values are as expected. Tracing with flags enabled should have ns precision. Tracings without flags should be same as master.
Zoomer: https://www.internalfb.com/intern/zoomer/?profiling_run_fbid=796886748550189
Ran key_averages() to make sure FunctionEvent code working as expected:
-- ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
ProfilerStep* 0.74% 3.976ms 64.40% 346.613ms 69.323ms 0.000us 0.00% 61.710ms 12.342ms 5
Optimizer.zero_grad#SGD.zero_grad 0.76% 4.109ms 0.76% 4.109ms 821.743us 0.000us 0.00% 0.000us 0.000us 5
## forward ## 6.89% 37.057ms 27.19% 146.320ms 29.264ms 0.000us 0.00% 58.708ms 11.742ms 5
aten::conv2d 0.22% 1.176ms 7.74% 41.658ms 157.199us 0.000us 0.00% 27.550ms 103.962us 265
aten::convolution 0.79% 4.273ms 7.52% 40.482ms 152.762us 0.000us 0.00% 27.550ms 103.962us 265
aten::_convolution 0.69% 3.688ms 6.73% 36.209ms 136.637us 0.000us 0.00% 27.550ms 103.962us 265
aten::cudnn_convolution 6.04% 32.520ms 6.04% 32.520ms 122.719us 27.550ms 8.44% 27.550ms 103.962us 265
aten::add_ 2.42% 13.045ms 2.42% 13.045ms 30.694us 12.700ms 3.89% 12.700ms 29.882us 425
aten::batch_norm 0.19% 1.027ms 8.12% 43.717ms 164.971us 0.000us 0.00% 16.744ms 63.185us 265
aten::_batch_norm_impl_index 0.31% 1.646ms 7.93% 42.691ms 161.096us 0.000us 0.00% 16.744ms 63.185us 265
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Differential Revision: D55925068
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123650
Approved by: https://github.com/aaronenyeshi
If we throw an exception in the "wrong" place we can end up with the dispatch state being in a weird state which can cause all future dispatching to fail. Preserve and restore it as part of `preserve_global_state` so we know it's sane after that.
Also fake_tensor's in_kernel_invocation_manager() was leaving a bit set in the dispatcher (DispatchKey.Dense) which affected follow-on code. Fixed that to reset after as well.
Repro:
before:
```
$ rm test/dynamo_skips/TestSparseCPU.test_to_dense_with_gradcheck_sparse_cpu_complex64
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest -s test/dynamo/test_export.py test/test_sparse.py -k 'test_to_dense_with_gradcheck_sparse_cpu_complex64'
======== 1 passed, 6173 deselected in 5.21s =============
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest -s test/dynamo/test_export.py test/test_sparse.py -k 'test_torch_inference_mode_ctx or test_to_dense_with_gradcheck_sparse_cpu_complex64'
========= 1 skipped, 6172 deselected, 1 error in 5.29s =========
```
(note that test_to_dense_with_gradcheck_sparse_cpu_complex64 passes on its own but failed when including the skipped test_export.py tests)
after:
```
$ rm test/dynamo_skips/TestSparseCPU.test_to_dense_with_gradcheck_sparse_cpu_complex64
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest -s test/dynamo/test_export.py test/test_sparse.py -k 'test_to_dense_with_gradcheck_sparse_cpu_complex64'
===================== 1 passed, 6173 deselected in 5.42s =====================
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest -s test/dynamo/test_export.py test/test_sparse.py -k 'test_torch_inference_mode_ctx or test_to_dense_with_gradcheck_sparse_cpu_complex64'
===================== 1 passed, 1 skipped, 6172 deselected in 7.30s ======================
```
(note that test_to_dense_with_gradcheck_sparse_cpu_complex64 passes in both runs)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122073
Approved by: https://github.com/zou3519
Summary:
Kineto traces use microsecond level granularity because of chrome tracing defaults to that precision. Fix by adding preprocessor flag to TARGETS and BUCK files. Also remove any unnecessary ns to us conversions made in the profiler itself.
This diff contains profiler changes only. Libkineto changes found in D54964435.
Test Plan:
Check JSON and chrome tracing to make sure values are as expected. Tracing with flags enabled should have ns precision. Tracings without flags should be same as master.
Tracing with flags enabled: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Mar_18_14_37_22.4155151.pt.trace.json.gz&bucket=gpu_traces
Tracing without flags enabled: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Mar_18_14_39_15.4166047.pt.trace.json.gz&bucket=gpu_traces
Tracing on main: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Mar_18_14_42_43.4177559.pt.trace.json.gz&bucket=gpu_traces
Ran key_averages() to make sure FunctionEvent code working as expected:
-- ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
ProfilerStep* 0.74% 3.976ms 64.40% 346.613ms 69.323ms 0.000us 0.00% 61.710ms 12.342ms 5
Optimizer.zero_grad#SGD.zero_grad 0.76% 4.109ms 0.76% 4.109ms 821.743us 0.000us 0.00% 0.000us 0.000us 5
## forward ## 6.89% 37.057ms 27.19% 146.320ms 29.264ms 0.000us 0.00% 58.708ms 11.742ms 5
aten::conv2d 0.22% 1.176ms 7.74% 41.658ms 157.199us 0.000us 0.00% 27.550ms 103.962us 265
aten::convolution 0.79% 4.273ms 7.52% 40.482ms 152.762us 0.000us 0.00% 27.550ms 103.962us 265
aten::_convolution 0.69% 3.688ms 6.73% 36.209ms 136.637us 0.000us 0.00% 27.550ms 103.962us 265
aten::cudnn_convolution 6.04% 32.520ms 6.04% 32.520ms 122.719us 27.550ms 8.44% 27.550ms 103.962us 265
aten::add_ 2.42% 13.045ms 2.42% 13.045ms 30.694us 12.700ms 3.89% 12.700ms 29.882us 425
aten::batch_norm 0.19% 1.027ms 8.12% 43.717ms 164.971us 0.000us 0.00% 16.744ms 63.185us 265
aten::_batch_norm_impl_index 0.31% 1.646ms 7.93% 42.691ms 161.096us 0.000us 0.00% 16.744ms 63.185us 265
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Differential Revision: D55087993
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122425
Approved by: https://github.com/aaronenyeshi
Summary: Now that we can input shapes as input args for RecordFunctionFast, let's add that to the triton heuristics. Also, lets add the ability to pass in a tuple into the RecordFunctionFast constructor.
Test Plan:
Ran both the _inductor/test_profile.py and profiler/test_profiler.py unit tests. Also added tuple based unit test to profiler/test_profiler.py
Ran record_function_fast.py from the following branch
https://github.com/pytorch/pytorch/compare/sraikund/record_funct_test?expand=1
No shape or args: tests function fast with no args and profile without record_shapes
With shape tests: tests function fast with args and profile with record_shapes true
Args no shape: tests function fast with args inputted but record_shapes set to false
Args shape tuple: tests function fast with args inputted in form of tuple and record_shapes true
Stdout:
No shape or args:: 1.8491458892822266 us
With shape:: 2.211381196975708 us
Args no shape:: 1.9212646484375 us
With shape tuple:: 2.245788335800171 us
Differential Revision: D55809967
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123459
Approved by: https://github.com/davidberard98
Previously it worked with torchgen.model.FunctionSchema. This PR extends
it to work with torch._C._FunctionSchema by making
torchgen.model.FunctionSchema look more like torch._C._FunctionSchema.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123108
Approved by: https://github.com/albanD
Today, we error out on FakeTensor.data_ptr under torch.compile. This PR
moves to error out on FakeTensor.data_ptr under eager mode to avoid
diverging behavior.
We do this by adding another bit onto FakeTensor that we'll remove after
the deprecation cycle.
Test Plan:
- tested locally
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123292
Approved by: https://github.com/eellison
ghstack dependencies: #123261, #123282, #123291
Summary: RECORD_FUNCTION in C++ and torch.profiler.record_function already support recording inputs. Let's do the same for RecordFunctionFast.
Test Plan: Add tests in test_profiler.py that take args and also do not take args so we can support it being an optional parameter
Differential Revision: D55648870
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123208
Approved by: https://github.com/davidberard98
# Motivation
Add some attributes to `XPUDeviceProp` and expose them via `torch.xpu.get_device_properties` and `torch.xpu.get_device_capability`. They can be used in `torch.compile` or directly passed to triton to generate more optimized code based on device properties.
# Additional Context
expose the following attributes to `torch.xpu.get_device_properties`:
- `has_fp16` (newly added)
- `has_fp64` (newly added)
- `has_atomic64` (newly added)
- `driver_version`
- `vendor`
- `version`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121898
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/malfet, https://github.com/albanD, https://github.com/atalman
This PR:
- disallows FakeTensor.data_ptr when it is called inside PT2 or fx tracing.
- disallows FunctionalTensor.data_ptr (python FunctionalTensor is only used in
PT2)
The motivation behind this is that the leading cause of segfaults when
using custom ops with PT2 is calling .data_ptr on FunctionalTensor or
FakeTensor.
This change is BC-breaking. If your code broke as a result of this, it's
because there was a bug in it (these .data_ptr should never be
accessed!). You can either fix the bug (recommended) or get the previous
behavior back with:
```
from torch._subclasses.fake_tensor import FakeTensor
from torch._subclasses.functional_tensor import FunctionalTensor
data_ptr = 0 if isinstance(tensor, (FakeTensor, FunctionalTensor)) else tensor.data_ptr()
```
Test Plan:
- existing tests
Differential Revision: [D55366199](https://our.internmc.facebook.com/intern/diff/D55366199)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122514
Approved by: https://github.com/ezyang, https://github.com/albanD, https://github.com/yifuwang, https://github.com/kurtamohler
Fixes https://github.com/pytorch/pytorch/issues/121085
This PR pretty involved so pay attention to this description. At a high
level, the refactor is intended to be mechanical: anywhere in
MetaConverter where previously we took a Tensor as argument, we now take
a MetaTensorDesc, which contains all of the information that we would
have queried off of the Tensor, but placed into a separate data
structure which we can serialize or use to recreate a fake tensor in
a separate fake tensor mode in exact fidelity to the original.
However, this transformation is not always entirely mechanical. Here
is what you need to pay attention to:
- The memo table from real Tensor -> meta/fake Tensor is now broken
into two memo tables: real Tensor -> stable int id -> meta/fake
Tensor. The stable int id is needed so that when we do serialization,
we know when tensors/storages alias each other and can ensure we preserve
this aliasing upon deserialization.
The way I have implemented changes the weak reference behavior.
Previously, when either the real Tensor OR the meta/fake Tensor went
dead, we would remove the entry from the memo table. Now, this only
removes entries from one of the two memo tables. This semantically
makes sense, because the user may have held on to the stable int id
out of band, and may expect a real Tensor to continue to be numbered
consistently / expect to be able to lookup a meta/fake tensor from
this id. If this is unacceptable, it may be possible to rejigger
the memo tables so that we have real Tensor -> stable int id
and real Tensor -> meta/fake Tensor, but TBH I find the new
implementation a lot simpler, and arranging the memo tables in this
way means that I have to muck around with the real tensor to save
to the memo table; in the current implementation, I never pass the
Tensor to meta_tensor function AT ALL, which means it is impossible
to accidentally depend on it.
- When I fill in the fields of MetaTensorDesc in describe_tensor, I need
to be careful not to poke fields when they are not valid. Previously,
preconditions were implicitly checked via the conditional structure
("is this sparse? is this nested?") that is tested before we start
reading attributes. This structure has to be replicated in
describe_tensor, and I have almost assuredly gotten it wrong on my
first try (I'll be grinding through it on CI; a careful audit will
help too, by auditing that I've tested all the same conditionals that
the original access was guarded by.)
- I originally submitted https://github.com/pytorch/pytorch/pull/121821
for the symbolic shapes change, but it turned out the way I did it
there didn't actually work so well for this PR. I ended up just
inlining the symbolic shapes allocation logic into MetaConverter
(look for calls to maybe_specialize_sym_int_with_hint), maybe there
is a better way to structure it, but what I really want is to
just read sizes/strides/offset directly off of MetaTensorDesc; I
don't want another intermediate data structure.
- Some fields aren't serializable. These are documented as "NOT
serializable". ctx/type should morally be serializable and I just
need to setup a contract with subclasses to let them be serialized.
The fake_mode is used solely to test if we are refakefying with
a pre-existing ShapeEnv and we want to reuse the SymInt
directly--serializing this case is hopeless but I am kind of hoping
after this refactor we do not need this at all. view_func is not
serializable because it's a bound C implemented method. Joel has
promised me that this is not too difficult to actually expose as a
true data structure, but this is the edgiest of edge cases and there
is no reason to deal with it right now.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122044
Approved by: https://github.com/eellison
List of changes:
- Replace JVP_NESTING by torch._C._functorch.maybe_current_level()
- Remove all increment nesting functions from wrap_fx_proxy_cls
- fwAD.make_dual receives the dual_level as keyword argument
- Add jvp_increment_nesting, set_fwd_grad_enabled and dual_level context managers to dynamo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119926
Approved by: https://github.com/zou3519
Fixes https://github.com/pytorch/pytorch/issues/121085
This PR pretty involved so pay attention to this description. At a high
level, the refactor is intended to be mechanical: anywhere in
MetaConverter where previously we took a Tensor as argument, we now take
a MetaTensorDesc, which contains all of the information that we would
have queried off of the Tensor, but placed into a separate data
structure which we can serialize or use to recreate a fake tensor in
a separate fake tensor mode in exact fidelity to the original.
However, this transformation is not always entirely mechanical. Here
is what you need to pay attention to:
- The memo table from real Tensor -> meta/fake Tensor is now broken
into two memo tables: real Tensor -> stable int id -> meta/fake
Tensor. The stable int id is needed so that when we do serialization,
we know when tensors/storages alias each other and can ensure we preserve
this aliasing upon deserialization.
The way I have implemented changes the weak reference behavior.
Previously, when either the real Tensor OR the meta/fake Tensor went
dead, we would remove the entry from the memo table. Now, this only
removes entries from one of the two memo tables. This semantically
makes sense, because the user may have held on to the stable int id
out of band, and may expect a real Tensor to continue to be numbered
consistently / expect to be able to lookup a meta/fake tensor from
this id. If this is unacceptable, it may be possible to rejigger
the memo tables so that we have real Tensor -> stable int id
and real Tensor -> meta/fake Tensor, but TBH I find the new
implementation a lot simpler, and arranging the memo tables in this
way means that I have to muck around with the real tensor to save
to the memo table; in the current implementation, I never pass the
Tensor to meta_tensor function AT ALL, which means it is impossible
to accidentally depend on it.
- When I fill in the fields of MetaTensorDesc in describe_tensor, I need
to be careful not to poke fields when they are not valid. Previously,
preconditions were implicitly checked via the conditional structure
("is this sparse? is this nested?") that is tested before we start
reading attributes. This structure has to be replicated in
describe_tensor, and I have almost assuredly gotten it wrong on my
first try (I'll be grinding through it on CI; a careful audit will
help too, by auditing that I've tested all the same conditionals that
the original access was guarded by.)
- I originally submitted https://github.com/pytorch/pytorch/pull/121821
for the symbolic shapes change, but it turned out the way I did it
there didn't actually work so well for this PR. I ended up just
inlining the symbolic shapes allocation logic into MetaConverter
(look for calls to maybe_specialize_sym_int_with_hint), maybe there
is a better way to structure it, but what I really want is to
just read sizes/strides/offset directly off of MetaTensorDesc; I
don't want another intermediate data structure.
- Some fields aren't serializable. These are documented as "NOT
serializable". ctx/type should morally be serializable and I just
need to setup a contract with subclasses to let them be serialized.
The fake_mode is used solely to test if we are refakefying with
a pre-existing ShapeEnv and we want to reuse the SymInt
directly--serializing this case is hopeless but I am kind of hoping
after this refactor we do not need this at all. view_func is not
serializable because it's a bound C implemented method. Joel has
promised me that this is not too difficult to actually expose as a
true data structure, but this is the edgiest of edge cases and there
is no reason to deal with it right now.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122044
Approved by: https://github.com/eellison
ghstack dependencies: #122018
See #113541
The PR allows for registering and controlling multiple RNG states using indices, ensuring cudagraph-safe operations, and includes both C++ and Python API changes to support this functionality.
cc @eellison @anijain2305 @jansel @ezyang @ptrblck @csarofeen @mcarilli
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114068
Approved by: https://github.com/ezyang
List of changes:
- Replace JVP_NESTING by torch._C._functorch.maybe_current_level()
- Remove all increment nesting functions from wrap_fx_proxy_cls
- fwAD.make_dual receives the dual_level as keyword argument
- Add jvp_increment_nesting, set_fwd_grad_enabled and dual_level context managers to dynamo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119926
Approved by: https://github.com/zou3519
List of changes:
- Replace JVP_NESTING by torch._C._functorch.maybe_current_level()
- Remove all increment nesting functions from wrap_fx_proxy_cls
- fwAD.make_dual receives the dual_level as keyword argument
- Add jvp_increment_nesting, set_fwd_grad_enabled and dual_level context managers to dynamo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119926
Approved by: https://github.com/zou3519
**Summary:**
This commit simplifies the existing decomposition hierarchy
of batch norm ops by adding a single, backend agnostic op:
`batch_norm_with_update`. The existing hierarchy looks like:
```
aten.batch_norm ->
aten._batch_norm_impl_index ->
[
aten.native_batch_norm ->
aten._native_batch_norm_legit (export only) ->
_batch_norm_legit_cpu/cuda (kernels, export only) ->
_batch_norm_cpu/cuda (kernels)
] OR
[ aten.cudnn_batch_norm ] OR
[ aten.miopen_batch_norm ]
```
Aside from complexity, an important problem with the
above decomposition hierarchy is cuda numerics in
export flows. We observed significantly worse convergence
when training a mobilenetv2-like model when using the
`_batch_norm_cuda` kernel instead of the `cudnn_batch_norm`
kernel. This means users who export their models on CPU
first then move the models to cuda later may silently
see worse accuracies even when cudnn is installed,
because they are using the worse kernel. This issue is
summarized in https://github.com/pytorch/pytorch/issues/111384.
Instead, the new hierarchy proposed by consolidating
existing batch norm ops will look like:
```
aten.batch_norm ->
aten.batch_norm_with_update ->
[ _batch_norm_cpu (kernel) ] OR
[ _batch_norm_cuda (kernel) ] OR
[ cudnn_batch_norm (kernel) ] OR
[ miopen_batch_norm (kernel) ]
```
The new op `batch_norm_with_update` hides backend
implementation details and automatically picks the right
kernel based on what is installed. This commit also adds
the following variants to this op:
```
batch_norm_with_update_functional
batch_norm_with_update.out
batch_norm_no_update
batch_norm_no_update.out
batch_norm_backward
```
Note that this commit only adds this op and its variants,
but does not actually change the decomps to produce these
ops in the graph. This will be done after the 2 week FC
window, and the ops used in the old stack is planned to
be removed after the 6 month BC window.
Test Plan: `OpInfo` tests for `batch_norm_with_update`.
Reviewers: albanD, bdhirsh
Subscribers: albanD, bdhirsh, supriyar
Tasks: https://github.com/pytorch/pytorch/issues/111384
Differential Revision: [D54805279](https://our.internmc.facebook.com/intern/diff/D54805279)
Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092
Approved by: https://github.com/bdhirsh, https://github.com/albanD
**Summary:**
This commit simplifies the existing decomposition hierarchy
of batch norm ops by adding a single, backend agnostic op:
`batch_norm_with_update`. The existing hierarchy looks like:
```
aten.batch_norm ->
aten._batch_norm_impl_index ->
[
aten.native_batch_norm ->
aten._native_batch_norm_legit (export only) ->
_batch_norm_legit_cpu/cuda (kernels, export only) ->
_batch_norm_cpu/cuda (kernels)
] OR
[ aten.cudnn_batch_norm ] OR
[ aten.miopen_batch_norm ]
```
Aside from complexity, an important problem with the
above decomposition hierarchy is cuda numerics in
export flows. We observed significantly worse convergence
when training a mobilenetv2-like model when using the
`_batch_norm_cuda` kernel instead of the `cudnn_batch_norm`
kernel. This means users who export their models on CPU
first then move the models to cuda later may silently
see worse accuracies even when cudnn is installed,
because they are using the worse kernel. This issue is
summarized in https://github.com/pytorch/pytorch/issues/111384.
Instead, the new hierarchy proposed by consolidating
existing batch norm ops will look like:
```
aten.batch_norm ->
aten.batch_norm_with_update ->
[ _batch_norm_cpu (kernel) ] OR
[ _batch_norm_cuda (kernel) ] OR
[ cudnn_batch_norm (kernel) ] OR
[ miopen_batch_norm (kernel) ]
```
The new op `batch_norm_with_update` hides backend
implementation details and automatically picks the right
kernel based on what is installed. This commit also adds
the following variants to this op:
```
batch_norm_with_update_functional
batch_norm_with_update.out
batch_norm_no_update
batch_norm_no_update.out
batch_norm_backward
```
Note that this commit only adds this op and its variants,
but does not actually change the decomps to produce these
ops in the graph. This will be done after the 2 week FC
window, and the ops used in the old stack is planned to
be removed after the 6 month BC window.
Test Plan: `OpInfo` tests for `batch_norm_with_update`.
Reviewers: albanD, bdhirsh
Subscribers: albanD, bdhirsh, supriyar
Tasks: https://github.com/pytorch/pytorch/issues/111384
Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092
Approved by: https://github.com/bdhirsh, https://github.com/albanD
**Summary:**
This commit simplifies the existing decomposition hierarchy
of batch norm ops by adding a single, backend agnostic op:
`batch_norm_with_update`. The existing hierarchy looks like:
```
aten.batch_norm ->
aten._batch_norm_impl_index ->
[
aten.native_batch_norm ->
aten._native_batch_norm_legit (export only) ->
_batch_norm_legit_cpu/cuda (kernels, export only) ->
_batch_norm_cpu/cuda (kernels)
] OR
[ aten.cudnn_batch_norm ] OR
[ aten.miopen_batch_norm ]
```
Aside from complexity, an important problem with the
above decomposition hierarchy is cuda numerics in
export flows. We observed significantly worse convergence
when training a mobilenetv2-like model when using the
`_batch_norm_cuda` kernel instead of the `cudnn_batch_norm`
kernel. This means users who export their models on CPU
first then move the models to cuda later may silently
see worse accuracies even when cudnn is installed,
because they are using the worse kernel. This issue is
summarized in https://github.com/pytorch/pytorch/issues/111384.
Instead, the new hierarchy proposed by consolidating
existing batch norm ops will look like:
```
aten.batch_norm ->
aten.batch_norm_with_update ->
[ _batch_norm_cpu (kernel) ] OR
[ _batch_norm_cuda (kernel) ] OR
[ cudnn_batch_norm (kernel) ] OR
[ miopen_batch_norm (kernel) ]
```
The new op `batch_norm_with_update` hides backend
implementation details and automatically picks the right
kernel based on what is installed. This commit also adds
the following variants to this op:
```
batch_norm_with_update_functional
batch_norm_with_update.out
batch_norm_no_update
batch_norm_no_update.out
batch_norm_backward
```
Note that this commit only adds this op and its variants,
but does not actually change the decomps to produce these
ops in the graph. This will be done after the 2 week FC
window, and the ops used in the old stack is planned to
be removed after the 6 month BC window.
Test Plan: `OpInfo` tests for `batch_norm_with_update`.
Reviewers: albanD, bdhirsh
Subscribers: albanD, bdhirsh, supriyar
Tasks: https://github.com/pytorch/pytorch/issues/111384
Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092
Approved by: https://github.com/bdhirsh, https://github.com/albanD
In particular this ensures we release the GIL when serializing:
- PyBytes objects (this is how we get the pickle object)
- Storage objects
Other string-like objects keep the gil which is fine because we only use this for very small strings today (for endianess) and so releasing the GIL is not important there
Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120818
Approved by: https://github.com/colesbury
This pull request is writing to provide an update on the recent advancements made in the PyTorch profiler with regards to XPU backend support. Following the successful merge of a previous pull request #94502 that established a pathway for the XPU backend within PyTorch, we have now taken steps to enhance the profiler's capabilities for handling and displaying profile data directly related to the XPU backend.
# Motivation
The current pull request builds upon this foundation by refining the profiler's data processing scripts, particularly `profiler_util.py`, to accommodate XPU backend-specific profile data. The aim is to align the handling and presentation of this data with that of the CUDA backend, offering users a consistent experience across different device profiles. This includes generating outputs such as JSON files compatible with Chrome trace tooling, among other formats.
# Principles
1. Minimal Impact: The modifications introduced should support XPU backend data with minimal disruption to the existing profiling scripts.
2. Consistency: Changes should maintain stylistic and functional consistency with existing `CUDA` and `privateuse1` pathways, ensuring no adverse effects on other logic paths.
3. Exclusivity: Ensure that the new XPU pathway does not interfere with or impede other pathways.
# Solutions
### a. Pathway Identification:
Introduction of a `use_xpu` flag within `torch.autograd.profiler.profile` interfaces to distinguish XPU-specific profiling.
### b. `use_device` Logic Revision:
With the introduction of the XPU pathway, `use_device` no longer implies a binary relationship with `use_cuda`. Consequently, we have revised related logic to remove implicit assertions and establish independent device distinction.
### c. Kernel List Segregation:
To accommodate the non-binary nature of device pathways, we have enabled kernel lists to identify specific device affiliations through separate list objects.
### d. Formatted Output:
To ensure output consistency, we have employed code duplication and keyword substitution techniques to facilitate the formatting of XPU-related profile data.
# Additional Enhancements
### a. Enumerations in `.pyi` Files:
Added recognition items for `DeviceType` and `ProfilerActivity` specific to XPU.
### b. Correct DeviceType Returns:
Revised `deviceTypeFromActivity` logic to accurately differentiate between device backends, even when they share common flags such as `libkineto::ActivityType::GPU_MEMCPY`.
### c. Bug Fixes in `cuda_corr_map`:
Addressed a corner case where erroneous parent-child event relationships were formed due to shared function event identifiers. The solution involves refining `cuda_corr_map` processing to prevent a function event from being misidentified as both the linker and linkee.
# Further Abstraction
Looking forward, we acknowledge the potential for further abstraction in the codebase. The current changes necessitated by XPU support have highlighted opportunities for reducing redundancy by consolidating naming conventions and utilizing a singular `device` naming system that relies on `DeviceType` attributes or string flags for differentiation. This would involve significant refactoring to replace device-specific flags and variables. This topic needs further discussions about whether we could and when we should deprecate all those flags and variables named with `cuda`.
# Next Pull Request
The next pull request will be contingent on Kineto's adoption of Intel's forthcoming PTI-sdk library, which will enable direct usage of XPU-related tracers. Subsequent modifications to `libkineto_init()` will aim to endow PyTorch running on XPU backends with comprehensive profiling capabilities on XPU devices.
We appreciate your attention to these enhancements and welcome any feedback or questions you may have regarding these developments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120185
Approved by: https://github.com/aaronenyeshi, https://github.com/gujinghui
# Motivation
According to [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842) and [[RFC] Intel GPU Runtime Upstreaming for Allocator](https://github.com/pytorch/pytorch/issues/116322), we will upstream the key functionality of device `Allocator` dedicated for XPU to PyTorch. And following our design prepare to generalize `Allocator` in parallel.
# Design
In the current design, XPU uses an `XPUAllocator` class, inherited from `c10::Allocator`. `XPUAllocator` is a manager to handle `DeviceCachingAllocator`, which is a per-device implementation of the caching mechanism to manage the already cached or newly allocated memory. The caching mechanism is similar to other backends, like CUDA. We can visualize the design as below.
<p align="center">
<img width="162" alt="image" src="https://github.com/pytorch/pytorch/assets/106960996/6b17b8cf-e7d1-48b4-b684-f830c409d218">
</p>
# Additional Context
We're going to implement our design gradually. This PR covers the device `Allocator` dedicated to XPU. The second PR covers the host `Allocator`.
Besides these PRs, we plan to generalize the device `Allocator` device-agnostic through another PR.
In this PR, our device `Allocator` has the same memory management mechanism as CUDA, but lacks features such as expendable segments and statistics. We will add these features back in the subsequent PR which intend to generalize `Allocator`.
The differences with CUDA:
only key functionality, and lack of AsyncAllocator, gpu_trace, history_record, graph functionality, memory snapshot, memory statistics, expandable segment...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118091
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/albanD
ghstack dependencies: #117611, #117619, #117734
# Motivation
As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the next runtime component we would like to upstream is `Event` which handles the status of an operation that is being executed. Typically, in some circumstances, we can fine-grain control of the operation execution via `Event`.
# Design
`XPUEvent` is a movable but not a copyable wrapper around sycl event. It should be created lazily on an XPU device when recording an `XPUStream`. Meanwhile, `XPUEvent` can wait for another `XPUEvent` or all the submitted kernels on an `XPUStream` to complete. Align to the other backend, the C++ files related to `Event` will be placed in `aten/src/ATen/xpu` folder. For frontend code, `XPUEvent` runtime API will be bound to Python `torch.xpu.Event`. The corresponding C++ code will be placed in `torch/csrc/xpu/Event.cpp` and Python code will be placed in `torch/xpu/streams.py` respectively.
# Additional Context
It is worth mentioning that the `elapsed_time` method is temporarily not supported by `XPUEvent`. We will be adding support for it soon. Meanwhile `XPUEvent` doesn't support IPC from different processes. For the other parts, we have almost a 1:1 mapping with CUDA.
lack of the below APIs:
- `torch.cuda.Event.ipc_handle`
- `CUDAEvent`'s constructor with `IpcEventHandle`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117734
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: #117611, #117619
Summary:
This PR tries to resolve issue #119215.
Basically, processgroup shutdown (and hence ncclCommAbort) is called in
destroy_process_group APIs for the corresponding PGs. and in the
destructor of ProcessGroup, we avoid calling abort/ncclCommAbort.
Instead, it just checks if the users have explicitly already called destroy_process_group. If
not, Destructor will log a warning and encourage/expect users to do so
for cleanup of resources of PGs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119250
Approved by: https://github.com/minsii, https://github.com/kwen2501
# Motivation
According to [[1/2] Intel GPU Runtime Upstreaming for Stream](https://github.com/pytorch/pytorch/pull/117611), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the second PR covers the changes under `python frontend`.
# Design
Currently, it primarily offers stream-related APIs, including
- `torch.xpu.StreamContext`
- `torch.xpu.current_stream`
- `torch.xpu.set_stream`
- `torch.xpu.synchronize`
- `torch._C._xpu_getCurrentRawStream`
# Additional Context
We will implement functions like `torch.xpu.Stream.wait_event`, `torch.xpu.Stream.wait_stream`, and `torch.xpu.Stream.record_event` in the next PR related with `Event`.
The differences with CUDA:
no default and external stream in XPU and lack of below APIs:
- `torch.cuda.ExternalStream`
- `torch.cuda.default_stream`
- `toch.cuda.is_current_stream_capturing`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117619
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/albanD
ghstack dependencies: #117611
Attempt #2 for https://github.com/pytorch/pytorch/pull/117875 to fix https://github.com/pytorch/pytorch/issues/112090.
Summary of changes:
- ~Changed CacheEntry linked list into a doubly-linked list structure to support deletion.~ (done by C++ refactor)
- Added CacheEntry and ExtraState borrowed references to GuardFn so that GuardFn can tell ExtraState to delete CacheEntry when the GuardFn is invalidated.
- ~Added ExtraState raw reference to CacheEntry so that we can get ExtraState to correctly point to the first CacheEntry if it gets deleted.~ (done by C++ refactor)
- CacheEntry destructor needs to reset GuardFn refs to ExtraState/CacheEntry in order to prevent use-after-free.
- code_context values that are nn.GraphModules need to be weakrefs in order to prevent circular references.
- Added tests that check for memory leaks and cache deletion operations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119107
Approved by: https://github.com/jansel
# Motivation
According to [[1/4] Intel GPU Runtime Upstreaming for Device](https://github.com/pytorch/pytorch/pull/116019), As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), this third PR covers the changes under `libtorch_python`.
# Design
This PR primarily offers device-related APIs in python frontend, including
- `torch.xpu.is_available`
- `torch.xpu.device_count`
- `torch.xpu.current_device`
- `torch.xpu.set_device`
- `torch.xpu.device`
- `torch.xpu.device_of`
- `torch.xpu.get_device_name`
- `torch.xpu.get_device_capability`
- `torch.xpu.get_device_properties`
- ====================
- `torch.xpu._DeviceGuard`
- `torch.xpu._is_compiled`
- `torch.xpu._get_device`
# Additional Context
We will implement the support of lazy initialization in the next PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116850
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet
Summary:
There is an annoying inconsistency in how we pickle custom objs.
`torch.save` will invoke regular pickle, for which we have bound `__setstate__`/`__getstate__` methods on `torch.ScriptObject`: https://fburl.com/code/4howyl4u.
This serializes in a different format than TorchScript does, which uses the TS C++ pickler.
The issue we were facing was using the Python pickler to save, and the C++ pickler to load. If we use the C++ pickler to both save and load (plus some plumbing to get type/object resolution to work correctly), then things should work.
Test Plan:
ran SherlockNoMad's repro
```
buck2 run 'fbcode//mode/dev-nosan' scripts/bahuang:export_torchbind -- --logging DBG
```
Got to a new error, which has to do with how we're initializing the graph, but will leave that for future diffs.
Reviewed By: SherlockNoMad
Differential Revision: D53248454
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118791
Approved by: https://github.com/qxy11, https://github.com/SherlockNoMad, https://github.com/khabinov
This diff introduces an env var `_USE_NATIVE_C10D_FUNCTIONAL` that tells `_functional_collective` to use native `c10d_functional` ops. The Python version and the native version will co-exist until we completely switch to the native version after more testing and verification.
NOTE: `DeviceMesh` support for native `c10d_functional` will be added in a subsequent PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113057
Approved by: https://github.com/LucasLLC, https://github.com/wconstab, https://github.com/wanchaol
Summary:
Test Plan:
```
lintrunner --take MYPYINDUCTOR --all-files
ok No lint issues.
lintrunner -a
ok No lint issues.
Successfully applied all patches.
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116311
Approved by: https://github.com/int3
Summary: Now we can allocate an AOTIModelContainerRunner object instead of relying on torch.utils.cpp_extension.load_inline. Also renamed AOTInductorModelRunner to AOTIRunnerUtil in this PR.
Test Plan: CI
Reviewed By: khabinov
Differential Revision: D52339116
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116269
Approved by: https://github.com/khabinov
When exporting a model with a convolution kernel on cpu, if mkldnn is disabled and nnpack is enabled, export will go down the nnpack optimized convolution kernel for certain shapes ((code pointer)[cd449e260c/aten/src/ATen/native/Convolution.cpp (L542-L552)]). This means that we will automatically create a guard on that certain shape. If users want to export without any restrictions, one option is to disable nnpack. However, no config function exists for this, so this PR is adding a config function, similar to the `set_mkldnn_enabled` function.
Original context is in https://fb.workplace.com/groups/1075192433118967/posts/1349589822345892/?comment_id=1349597102345164&reply_comment_id=1349677642337110.
To test the flag, the following script runs successfully:
```
import os
import torch
from torchvision.models import ResNet18_Weights, resnet18
torch.set_float32_matmul_precision("high")
model = resnet18(weights=ResNet18_Weights.DEFAULT)
model.eval()
with torch.no_grad():
# device = "cuda" if torch.cuda.is_available() else "cpu"
torch.backends.mkldnn.set_flags(False)
torch.backends.nnpack.set_flags(False) # <--- Added config
device = "cpu"
model = model.to(device=device)
example_inputs = (torch.randn(2, 3, 224, 224, device=device),)
batch_dim = torch.export.Dim("batch", min=2, max=32)
so_path = torch._export.aot_compile(
model,
example_inputs,
# Specify the first dimension of the input x as dynamic
dynamic_shapes={"x": {0: batch_dim}},
# Specify the generated shared library path
options={
"aot_inductor.output_path": os.path.join(os.getcwd(), "resnet18_pt2.so"),
"max_autotune": True,
},
)
```
I'm not sure who to add as reviewer, so please feel free to add whoever is relevant!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116152
Approved by: https://github.com/malfet
Prerequisite for adding more complex type support and FFT operation
Check using `conjugateWithTensor:name:` selector defined as follows
```objc
/// Returns the complex conjugate of the input tensor elements.
///
/// - Parameters:
/// - tensor: The input tensor.
/// - name: An optional string which serves as an identifier for the operation..
/// - Returns: A valid `MPSGraphTensor` object containing the elementwise result of the applied operation.
-(MPSGraphTensor *) conjugateWithTensor:(MPSGraphTensor *) tensor
name:(NSString * _Nullable) name
MPS_AVAILABLE_STARTING(macos(14.0), ios(17.0), tvos(17.0))
MPS_SWIFT_NAME( conjugate(tensor:name:) );
```
- Rename `isOnMacOS13orNewer(unsigned minor)` hook to `isOnMacOSorNewer(major, minor)`
- Replace `torch._C.__mps_is_on_macos_13_or_newer` with `torch._C._mps_is_on_macos_or_newer`
- Add `torch.backends.mps.is_macos_or_newer` public API
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115512
Approved by: https://github.com/albanD
Re-enable type checking for distributed_c10d.py
Type checking for distributed_c10d.py was inadvertently turned off in issues that have accumulated since.
Note: the backwards compatibility linter does not like some of these changes. But they were incorrect before. This needs human verification, however.
#suppress-api-compatibility-check
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115223
Approved by: https://github.com/wconstab
This PR is proposing a new approach to solve the nn/optim only linked by python object identity problem.
The idea is to have a function that can swap the content of two Tensors t1 and t2 while preserving all the old references.
This would allow us to swap the `model.weight` with a new Tensor (can be any subclass of Tensor and any TensorImpl (xla, sparse, nested tensorimpl would work)). The use within nn will be done in a follow up.
This is done by swapping the whole content of the PyObject and then putting back the fields associated with external references (refcount, gc tracking and weakrefs).
Note that we have to properly handle all the cases where there is memory used before the public pointer PyObject* and where the PyObject is bigger due to dict/weakref being inlined (older CPython version) or due to slots.
The main limitation of this approach is that the number of slots need to match for the objects being swapped and thus limit usage of slots in subclasses.
Draft right now to see what @colesbury thinks about doing this?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111747
Approved by: https://github.com/colesbury
https://github.com/pytorch/pytorch/pull/113580 introduced the `DDP._update_process_group` API. However, the implementation did not correctly reset all of the necessary state in the reducer. In particular if an error occurred during backward, DDP would end up in an incorrect state.
As a result, in this PR I've enhanced the unit test to test for this case and also appropriately fixed resetting Reducer state.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114194
Approved by: https://github.com/rohan-varma
Skipping importing some packages for now to make this change more
tractable.
For some reason, lintrunner on CI raises errors in all imported `.pyi` files,
even though it doesn't on my local machine. The errors are all from missing
generic types, as the MYPYINDUCTOR config has `disallow_any_generics`
set. I have thus added `disable-error-code` comments to the relevant files,
though I fixed a few that were easy enough.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113830
Approved by: https://github.com/Skylion007
ghstack dependencies: #113722, #113721
Using mypy in code that depends on pytorch, I noticed that the type annotation doesn't allow a device ordinal.
`error: Argument "device" to "to_empty" of "Module" has incompatible type "int"; expected "str | device" [arg-type]`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113647
Approved by: https://github.com/albanD
# Motivation
If we would like to reinitialize DDP with a different PG with `torch.compile`, we need to do the following:
```
del old_ddp
del old_pg
pg = init_pg(...)
ddp = DDP(pg)
model = torch.compile(DDP)
```
This results in recompilation of the entire model and is very expensive. Since the only thing we need to update is the PG, we should be able to do this without having to compile the model again.
# Proposal
As a result, in this PR I've introduced an `_update_process_group` API which can dynamically update the underlying ProcessGroup used by DDP without needing to reinitialize DDP again.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113580
Approved by: https://github.com/fduwjj
SymIntType is referenced by wrapper.py, so I added its .pyi definition.
I also added SymBoolType along the way for completeness.
The `insinstance` checks in wrapper.py reference torch.Type, which seems
to cause mypy to choke. Not entirely sure why; I've just added
type-ignore comments for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113411
Approved by: https://github.com/Skylion007
ghstack dependencies: #113409, #113410
Summary:
We've made the following changes:
- The new way to use the API is `m.impl_abstract_pystub(module, context)`.
Every subsequent m.def of an op inside the TORCH_LIBRARY block gives
the op the `impl_abstract_pystub`.
- Added a mechanism to determine if an operator was defined in Python or C++.
Library.define in Python appends the op to a global set, which is analogous
to what we do for tracking Library.impl.
- If someone does `torch.library.impl_abstract` in Python for an operator, then
we require that it has an `impl_abstract_pystub` specified and we also check
that the module in the `impl_abstract_pystub` is the same as the module where
the call to `torch.library.impl_abstract` exists.
- Unfortunately we can't check the "context" (which is the buck target on
buck-based systems) because buck sits above us.
bypass-github-export-checks
Test Plan: - existing tests
Differential Revision: D51080493
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113182
Approved by: https://github.com/ezyang
Summary:
We've made the following changes:
- The new way to use the API is `m.impl_abstract_pystub(module, context)`.
Every subsequent m.def of an op inside the TORCH_LIBRARY block gives
the op the `impl_abstract_pystub`.
- Added a mechanism to determine if an operator was defined in Python or C++.
Library.define in Python appends the op to a global set, which is analogous
to what we do for tracking Library.impl.
- If someone does `torch.library.impl_abstract` in Python for an operator, then
we require that it has an `impl_abstract_pystub` specified and we also check
that the module in the `impl_abstract_pystub` is the same as the module where
the call to `torch.library.impl_abstract` exists.
- Unfortunately we can't check the "context" (which is the buck target on
buck-based systems) because buck sits above us.
Test Plan: - existing tests
Differential Revision: D50972148
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112851
Approved by: https://github.com/ezyang
Previous PRs changed the c++ default timeout for PGNccl, but this path
was only hit in some cases, and the python defaults took over in other
cases.
This PR ensures that NCCL pg always default to the changed NCCL-specific
timeout value.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113094
Approved by: https://github.com/fduwjj
As part of this diff, I have upgraded the `python_version` config setting to 3.11. `bytecode_transformation.py` (and a few other files) have functions using APIs only available in Python 3.11+. Those APIs are gated by a sys.version_info check in their typeshed .pyi files. So setting the min version to 3.11 allows those functions to typecheck properly.
An alternative is to make the relevant types Any:
```
if sys.version_info >= (3, 11):
_Positions = dis.Positions
else:
_Positions = Any
```
However, with python_version = 3.8, that means we're not getting any useful typechecking signal when encountering values of type _Position.
Changing the python_version to 3.11 does mean that we will stop typechecking codepaths that run only on lower versions, but that seems a small price to pay. It does also mean that we won't catch code that uses newer APIs without the appropriate version check, but again, not sure this has much of an impact.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112561
Approved by: https://github.com/ezyang
Summary:
When passed from C++ to Python, `c10d::ProcessGroup` and `c10d::Work` are automatically converted to their pybind class which can't be used for dispatcher ops. `.boxed()` exposes `c10d::ProcessGroup` and `c10d::Work` as boxed custom class object to Python.
```python
import tempfile
import torch
import torch.distributed as dist
if __name__ == "__main__":
with tempfile.NamedTemporaryFile(delete=False) as tmpf:
dist.init_process_group(
backend="nccl", init_method=f"file://{tmpf.name}", rank=0, world_size=1
)
group = dist.group.WORLD
print(group)
print(group.boxed())
```
```
<torch.distributed.distributed_c10d.ProcessGroup object at 0x7fe42fb78d30>
ScriptObject <__torch__.torch.classes.c10d.ProcessGroup>
```
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111997
Approved by: https://github.com/lw
Printing just the device name is not helpful when investigating PyTorch issues filed for specific AMD GPUs, as the support/issue might depend on the gfx arch, which is part of the gcnArchName property.
`torch.cuda.get_device_properties(0).gcnArchName` will print the value of the `gcnArchName` property: eg.
```
>>> torch.cuda.get_device_properties(0).gcnArchName
'gfx906:sramecc+:xnack-'
```
```
root@6f064e3c19fb:/data/pytorch/test# python ../torch/utils/collect_env.py
...
GPU models and configuration: AMD Radeon Graphics(gfx906:sramecc+:xnack-)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107477
Approved by: https://github.com/albanD
Fixes#111776
Support check_regex in FileCheck() by adding `find_regex` in `struct TORCH_API StringCordView`.
Callsite accepts RE syntax for std::regex.
However, I haven't figured out submatch ID yet.
For example, "buf5[0], buf6_inputs[0]" is still considered a match.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112077
Approved by: https://github.com/yf225
torch.library.impl now accepts a device string (e.g. "cpu", "cuda"). It
still accepts DispatchKey strings, but we no longer document this, because
using arbitrary DispatchKeys is more for the power users.
We map the device string to a DispatchKey and then register the impl for
said DispatchKey. A user may also specify multiple device strings at once
or specify "types=default" to get a CompositeExplicitAutograd registration.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111659
Approved by: https://github.com/soulitzer
ghstack dependencies: #111380
For synchronous ops (i.e. `asyncOp = False`), we don't want to record streams because we know that the NCCL stream will join back to the "current" stream right after this op. So we might just as well keep the stream ownership of the input/output tensors unchanged. The benefit would be that the allocation/free of the tensors would look deterministic to the "current" stream so that the caching allocator can reuse memory pool for this stream in a clever way.
To prevent the input/output tensors from being recycled by python, we rely on the stashing mechanism in ProcessGroupNCCL (which can be also turned on by setting `TORCH_NCCL_AVOID_RECORD_STREAMS=1`).
This mechanism change is for libraries like FSDP which uses `all_gather_into_tensor` and `reduce_scatter_tensor` in a synchronous way and which cannot set `TORCH_NCCL_AVOID_RECORD_STREAMS=1` for their users. And therefore, this change is limited to these two collectives for now.
Cc: @awgu @janeyx99 @albanD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111431
Approved by: https://github.com/H-Huang
This reverts commit 314a502eb0.
Changes since original PR:
Reland 1
* rename torch.distributed.hooks to torch.distributed._hooks
Reland 2
* make _hooks importable even if !distributed.is_available()
* handle cuda driver exit intermittent failure caused by new cuda api usage in callback caller (see prev PR in stack)
(original PR https://github.com/pytorch/pytorch/pull/108815 desc copied below)
Expose a set of observability hooks into C10D such that our users can
detect collectives failure both faster and more easily.
The design is similar to NCCL desync debug that it minimized the
overhead by doing most of the work out of the main thread.
This PR introduces a new module torch.distributed.hooks that exposes the following set of methods:
register_collective_start_hook
register_collective_end_hook
register_process_group_hook
The process group hook exposes PG creation on the member ranks and call them inline from the
the PG creation code. This is fine since this happens during initialization and a limited number of times.
The collective start/end hooks are fired from a single background thread. It reads
events from a C++ queue and dispatches over.
Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown
and have it as background thread. This is not possible with more reasonable choices like a condvar.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111072
Approved by: https://github.com/malfet
ghstack dependencies: #111061
This reverts commit ff0358b038.
(original PR https://github.com/pytorch/pytorch/pull/108815 desc copied below)
Expose a set of observability hooks into C10D such that our users can
detect collectives failure both faster and more easily.
The design is similar to NCCL desync debug that it minimized the
overhead by doing most of the work out of the main thread.
This PR introduces a new module torch.distributed.hooks that exposes the following set of methods:
register_collective_start_hook
register_collective_end_hook
register_process_group_hook
The process group hook exposes PG creation on the member ranks and call them inline from the
the PG creation code. This is fine since this happens during initialization and a limited number of times.
The collective start/end hooks are fired from a single background thread. It reads
events from a C++ queue and dispatches over.
Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown
and have it as background thread. This is not possible with more reasonable choices like a condvar.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110907
Approved by: https://github.com/fduwjj
Expose a set of observability hooks into C10D such that our users can
detect collectives failure both faster and more easily.
The design is similar to NCCL desync debug that it minimized the
overhead by doing most of the work out of the main thread.
This PR introduces a new module torch.distributed.hooks that exposes the following set of methods:
register_collective_start_hook
register_collective_end_hook
register_process_group_hook
The process group hook exposes PG creation on the member ranks and call them inline from the
the PG creation code. This is fine since this happens during initialization and a limited number of times.
The collective start/end hooks are fired from a single background thread. It reads
events from a C++ queue and dispatches over.
Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown
and have it as background thread. This is not possible with more reasonable choices like a condvar.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108815
Approved by: https://github.com/wconstab, https://github.com/fduwjj
We want to be able to use SingletonSymNode to represent strides for Jagged layout tensor. The following is for 3D, but easily generalizable to higher dimensions.
Constraints:
- [B, x, D] (where x represents the "variably lengthed dim") can be strided in two ways [x, 1, sum(x)] and [dx, d, 1]. We need two different placeholder values depending on how the jagged tensor is strided.
- When doing operations we need the strides of output tensors to be expressable in terms of the strides and sizes of the inner tensors. Given [B, x, D] @ [D, D'], the output strides is [x * D', D', 1] rather than some opaque [x2, D', 1]. This constraint exists because if I'm tracing, I need a symint to represent the output stride. This symint needs to come from somewhere; I get it in several ways: (1) create a constant, (2) unbacked symint, (3) create a new input using a source, (4) output of an operation on an existing symint. It is clear that (4) is what we want here, which brings us to the design below.
Design:
Given the two constraints, the most straightforward way to implement this is actually to update SingletonSymNode to include some scalar factor, i.e. Morally, SingletonSymNode represents `factor * [s_0, s_1, …, s_n]` This enables us to symbolically compute strides from sizes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110369
Approved by: https://github.com/ezyang
ghstack dependencies: #110044
I added some tests for Conj, Neg and ZeroTensor for both python and C++ functionalization. This also fixes a nasty segfult when running a functorch `jacfwd` test with `torch.compile`, once AOTAutograd is using `FunctionalTensor`.
Changes:
(1) I use Jeffrey's `make_wrapper_subclass(extra_dispatch_keys)` kwarg to plumb extra dispatch keys ontoto the wrapper, mirroring what C++ functionalization does (C++ functionalization will mirror all dispatch keys from the inner tensor to the wrapper, except for python and functorch keys).
(2) FunctionalTensorMode will decompose CompositeImplicitAutograd ops, since (for example) ZeroTensor kernels can send ops like `.to()` directly to the Python key. We'll need a way to toggle this later for pre-dispatch functionalization
(3) Bound `_ForceDispatchKeyGuard` and BatchedTensorImpl's dispatch keyset to python
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109023
Approved by: https://github.com/zou3519
ghstack dependencies: #108654, #109662, #109632
In this PR:
- When Constant SymNode are detected in unary/binary ops demote them to plain int/bool before proceeding. Sometimes this means doing a unary op with a Constant SymNode would result in a plain bool.
- Introduce an is_symbolic method, only available from Python. We need this because isinstance(x, SymInt) is no longer sufficient to check whether a given int/SymInt is symbolic or not. See later PR in the stack to see how this is used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109169
Approved by: https://github.com/ezyang
This PR adds a new `FunctionalTensor` subclass, and `FunctionalTensorMode` torch dispatch mode. Together, this class/mode are a lightweight wrapper around our existing C++ functionalization logic.
This idea came from Ed - later in the stack, I want to be able to run functionalization **underneath** torch_dispatch, when performing tracing in AOTAutograd. I can't do this easily with vanilla C++ functionalization, because it has a dedicated dispatch key that always runs before TorchDispatch. However, by adding a torch_dispatch mode shim around functionalization, we can use functionalization as a torch_dispatch mode, which will make it easier to run underneath other modes later.
This PR provides the basic new classes, and some light testing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106404
Approved by: https://github.com/ezyang
**Background**: "TorchDynamo Cache Lookup" events appear in traces to indicate a dynamo cache lookup; it's useful to check when cache lookups are taking a long time. To add a profiler event, one can use the `torch.profiler.record_function` context manager, or the C++ equivalent. Previously, the python version was used; first, when the profiler was enabled, callbacks for record_function_enter and record_function_exit were registered; then those would be called before and after every cache lookup.
**This PR**: Instead of calling the python bindings for `torch.profiler.record_function`, directly call the C++ implementation. This simplifies a lot of the code for binding C/C++. It also improves performance; previously there was a lot of overhead in the "TorchDynamo Cache Lookup" event, making the event artificially take a long time. After this change the events now appear shorter, because there's less overhead in starting/stopping the event: in other words, the profiler no longer distorts the results as much.
**Performance results**:
I ran using the script below on a cpu-only 1.6GHz machine. I report the median time (from 100 measurements) of a "TorchDynamo Cache Lookup" event before and after this PR. I think it is reasonable to consider the difference to be due to a reduction in overhead.
<details>
<summary>Benchmarking script</summary>
```python
def fn(x, y):
return (x * y).relu()
a, b = [torch.rand((4, 4), requires_grad=True) for _ in range(2)]
opt_fn = torch.compile(fn)
opt_fn(a, b)
opt_fn(a, b)
with torch.profiler.profile() as prof:
opt_fn(a, b)
```
</details>
Median before PR: 198-228 us (median of 100, measured 5 times)
Median after PR: 27us
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108436
Approved by: https://github.com/anijain2305, https://github.com/jansel
We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc.
This results in messy code during error handling somewhat like this:
```
if "NCCL" in exception_str:
...
if "Timed out initializing process group in store based barrier on rank" in exception_str:
...
if "The client socket has timed out after" in exception_str:
...
if "Broken pipe" in exception_str:
...
if "Connection reset by peer" in exception_str:
...
```
To address this issue, in this PR I've ensured added these error types:
1. **DistError** - the base type of all distributed errors
2. **DistBackendError** - this already existed and referred to PG backend errors
3. **DistStoreError** - for errors originating from the store
4. **DistNetworkError** - for general network errors coming from the socket library
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108191
Approved by: https://github.com/H-Huang
**Update:** Made refactor of the original PR. See the original description below, but here I'll describe the updates:
(1) TLS changes in `TorchDispatchModeTLS.h/cpp`.
I added a `TorchDispatchModeKey` enum, that (for now) just contains PROXY and FAKE. The ModeTLS used to just contain a `std::vector<std::shared_ptr<c10::SafePyObject>>` corresponding to the mode stack. It now **also** contains a separate array of "infra modes", indexed by mode key (PROXY and FAKE, with a new addition, FUNCTIONAL, coming later in the stack).
`TorchDispatchModeTLS::push_onto_stack` and `TorchDispatchModeTLS::pop_stack` are now a bit more complicated. Pushing accepts an optional mode_key, which if set, tells us to add the given mode directly to our "infra_modes" array. Popping will first check the "user mode" stack, before trying to pop anything from the infra mode stack. It also optionally returns the mode key of the mode we popped if there was one - that way if we push that same mode back onto the TLS later, we know where it goes.
`TorchDispatchModeTLS::dispatch_mode_enabled()` now accepts an optional `skip_infra_modes` param, so you can separately query if there are "any modes at all", or if there are "any user modes".
`TorchDispatchModeTLS::get/set/unset_mode()` all take in a mode key, and get/set/unset the mode at that particular mode key (meaning they are only meant to be used for infra modes).
There were also some mild codegen changes to support the new enum
(2) `fake_tensor.py/proxy_tensor.py/_python_dispatch.py`
The way I tell the infra that certain subclasses/modes are "infra" is through the enum: I gave `FakeTensor` and `FakeTensorMode` a `self._mode_key = torch._C.TorchDispatchModeKey.FAKE`. `TorchDispatchMode.__enter/exit__()` (in `_python_dispatch.py` now check if the current mode has a mode key, and if so they plumb it into any `push_onto_stack()` calls (which eventually instructs `TorchDispatchModeTLS` where to put the mode). Same thing for `ProxyTorchDispatchMode`.
I also had to change both of these mode's enter/exit, to handle the fact that there can no longer be multiple proxy/fake modes on the mode stack at once. I updated them both to have a `self.enter_stack: List[Optional[TorchDispatchMode]]` - whenever we push a given mode in `__enter__`, we remove the current ambient fake/proxy mode from the mode stack, and save it in `enter_stack`, so that on exit we can reset the state properly.
(2) dispatching logic in `python_arg_parser.cpp`
This is where the core dispatching logic changes are. I added two helpers, `dispatch_on_subclass()` and `dispatch_on_mode()`. The overall dispatching order is now:
```
(a) dispatch_on_mode() # try user modes first (where the mode stack automatically considers infra modes last)
(b) dispatch_on_subclass() # try user subclasses next (skipping infra subclasses)
(c) dispatch_on_subclass() # try infra subclasses next (skipping user subclasses)
```
Note that we still want "user subclasses" to run before "infra modes". As Ed helped me realize, this will work today: If proxy/fake modes in step 1, they'll return NotImplemented if they see a user subclass, allowing us to redispatch to the user subclass.
How do (b) and (c) distinguish between user and infra subclasses? Infra subclasses (FakeTensor, and later FunctionalTensor) are required to have a `_mode_key` hidden on the subclass - so we filter via arguments that do/don't have the _mode_key.
(3) I also changed `DoubleTensor` to `TwoTensor` to minimize confusion (@albanD pointed out that DoubleTensor would be easily confused with `torch.FloatTensor` and friends).
----- original description below -----
The main purpose of this PR is to fix the "ordering problem" between torch_dispatch modes, where we want to ensure that our Fake and Proxy dispatch modes always run **after** any dispatch modes created by the user, regardless of where they are in the stack. See this doc for more details: https://docs.google.com/document/d/1COQ291nOZvtFnzGTQMJqoYZ3sttEYFw_7HbfSyL8gcA/edit
Full set of changes below. I ended up including a few semi-related changes in this PR that I documented - but if folks would rather I separate them out, happy to try to do that.
**(1) Add dedicated TLS slots for FakeTensorMode and ProxyTensorMode**
This is the main component of this PR. There are two new slots, `TorchDispatchModeTLS.fake_mode_` and `TorchDispatchModeTLS.proxy_mode_`, which correspond to a single "global" fake and proxy mode. There is now an invariant that `torchDispatchModeState.stack_` can never contain either of these modes.
I also added a `TorchDispatchModeTLS::maybe_highest_mode()` helper that consults the `stack_` as well as both the proxy and fake slots, and returns the highest priority mode - this is because there are a few places in the codebase where we legitimately want to get the highest priority mode, *including* fake or proxy, if one is set.
This also made the implementations of the existing `disable_proxy_modes_tracing()` and `get_innermost_proxy_mode()` marginally simpler.
**(2) Updated the dispatching logic in handle_torch_function_no_python_arg_parser()**
This is the function that actually figures out which torch_dispatch implementation to call, given the current mode stack and tensor subclass inputs. This function got marginally more complicated as part of the refactor: First we inspect the mode stack and any non-fake subclass inputs. Then we check for the proxy mode slot. Then we check for the Fake mode slot, before finally checking for any fake subclass inputs.
**(3) new python `_get_fake_tensor_mode()` and `_get_proxy_tensor_mode()` API's**
Before, if you wanted to see if proxy or fake modes were active in python, you would have to consult the mode stack. Since these two modes are no longer part of the actual mode stack, I added two new API's to directly check if either proxy or fake modes are active.
**(4) Allow traceable tensor subclasses to access storages from python**
This is convenient later in the stack, where AOTAutograd needs to detect aliasing of inputs and outputs, where those inputs and outputs might be tensor subclasses. Previously, `x.untyped_storage()` would raise an error if `x` was a subclass. In this PR, I tried to relax this constraint as little as possible: `THPVariable_storage()` will only try to return a storage to python if the tensor subclass that you are passing in is "traceable"
**(5) Fixed subclass fakeification**
@wanchaol recently added support to be able to fakeify tensor subclasses. That fakeification logic works in most cases, but there is one case it doesn't handle: autograd metadata. In particular, since autograd sees our tensor subclasses and not their desugared tensors, we need to make sure that our fakeified subclass has the same autograd metadata as the original subclass. I updated `meta_utils.py` to make sure that the autograd metadata is correct.
**(6) make tensor subclasses resizeable**
Previously we didn't allow tensor subclasses to be resizeable. I ran into an issue where fakeifying a tensor subclass occasionally requires swapping out its storage, which can involve resizing the tensor. Mechanically, this required updating `at::for_blob()` to expose a way to request that the tensor that you create has resizeable storage, and then using this new API in `_make_wrapper_tensor()`.
**(7) Added a basic DoubleTensor subclass for testing**
I use this subclass more later in this stack in my AOTAutograd tests - but it serves as a simple subclass example to test the dispatch ordering in this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104482
Approved by: https://github.com/ezyang
ghstack dependencies: #107415
We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc.
This results in messy code during error handling somewhat like this:
```
if "NCCL" in exception_str:
...
if "Timed out initializing process group in store based barrier on rank" in exception_str:
...
if "The client socket has timed out after" in exception_str:
...
if "Broken pipe" in exception_str:
...
if "Connection reset by peer" in exception_str:
...
```
To address this issue, in this PR I've ensured added these error types:
1. **DistError** - the base type of all distributed errors
2. **DistBackendError** - this already existed and referred to PG backend errors
3. **DistStoreError** - for errors originating from the store
4. **DistNetworkError** - for general network errors coming from the socket library
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107651
Approved by: https://github.com/H-Huang
> capture_error_mode (str, optional): specifies the cudaStreamCaptureMode for the graph capture stream.
Can be "global", "thread_local" or "relaxed". During cuda graph capture, some actions, such as cudaMalloc,
may be unsafe. "global" will error on actions in other threads, "thread_local" will only error for
actions in the current thread, and "relaxed" will not error on these actions.
Inductor codegen is single-threaded, so it should be safe to enable "thread_local" for inductor's cuda graph capturing. We have seen errors when inductor cudagraphs has been used concurrently with data preprocessing in other threads.
Differential Revision: [D48656014](https://our.internmc.facebook.com/intern/diff/D48656014)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107407
Approved by: https://github.com/albanD, https://github.com/eqy
torch.profiler.record_function is relatively slow; for example, in some benchmarks I was running, x.view_as(x) was ~2us, and ~16-17us when wrapped in a record_function context. The reasons for this are: dispatcher overhead from going through an op (the main source of overhead), python binding / python conversion overhead, and some overhead from the context manager.
This new implementation is faster, but it won't work with torchscript. Based on the benchmarks I was running, it adds 0.5-0.7us overhead per call when the profiler is turned off. To use it, you can just:
```python
with torch._C._profiler_manual._RecordFunctionFast("title"):
torch.add(x, y)
```
It implements a context manager in python which directly calls the record_function utilities, instead of calling through an op.
* The context manager is implemented directly in python because the overhead from calling a python function seems non-negligible
* All the record_function calls, python object conversions are guarded on checks for whether the profiler is enabled or not. It seems like this saves a few hundred nanoseconds.
For more details about the experiments I ran to choose this implementation, see [my record_functions experiments branch](https://github.com/pytorch/pytorch/compare/main...davidberard98:pytorch:record-function-fast-experiments?expand=1).
This also adds a `torch.autograd.profiler._is_profiler_enabled` global variable that can be used to check whether a profiler is currently enabled. It's useful for further reducing the overhead, like this:
```python
if torch.autograd.profiler._is_profiler_enabled:
with torch._C._profiler_manual._RecordFunctionFast("title"):
torch.add(x, y)
else:
torch.add(x, y)
```
On BERT_pytorch (CPU-bound model), if we add a record_function inside CachedAutotuning.run:
* Naive torch.profiler.record_function() is a ~30% slowdown
* Always wrapping with RecordFunctionFast causes a regression of ~2-4%.
* Guarding with an if statement - any regression is within noise
**Selected benchmark results**: these come from a 2.20GHz machine, GPU build but only running CPU ops; running `x.view_as(x)`, with various record_functions applied (with profiling turned off). For more detailed results see "record_functions experiments branch" linked above (those results are on a different machine, but show the same patterns). Note that the results are somewhat noisy, assume 0.05-0.1us variations
```
Baseline:: 1.7825262546539307 us # Just running x.view_as(x)
profiled_basic:: 13.600390434265137 us # torch.profiler.record_function(x) + view_as
precompute_manual_cm_rf:: 2.317216396331787 us # torch._C._profiler_manual._RecordFunctionFast(), if the context is pre-constructed + view_as
guard_manual_cm_rf:: 1.7994389533996582 us # guard with _is_profiler_enabled + view_as
```
Differential Revision: [D48421198](https://our.internmc.facebook.com/intern/diff/D48421198)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107195
Approved by: https://github.com/albanD, https://github.com/aaronenyeshi
This PR allows dynamo to fakify FunctionalTensorWrapper by unwrapping, replacing and wrapping again for FunctionalTensorWrapper so that FunctionalTensorWrapper can be passed in as input for dynamo.optimize and we can support something like this
```python
ff = torch.func.functionalize(f)
torch.compile(ff)(x)
```
This PR didn't follow the \_\_tensor_flatten\_\_ and \_\_tensor_unflatten\_\_ protocol right now because we're not sure the plan of doing that for FunctionalTensorWrapper (it's implemented in C++).
**Test Plan:**
Add a new test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107062
Approved by: https://github.com/zou3519
ghstack dependencies: #107042
This adds some utilities for conveniently working with fast combined CapturedTraceback from Python. The main goal of these utilities is to make it easier for people to use CapturedTraceback as a drop-in replacement for `traceback.extract_stack`, which is 20x slower than CapturedTraceback.
I port symbolic shapes to use the new CapturedTraceback code, to validate that the APIs work and are useful.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107358
Approved by: https://github.com/zdevito, https://github.com/albanD
ghstack dependencies: #107438
This PR allows dynamo to fakify FunctionalTensorWrapper by unwrapping, replacing and wrapping again for FunctionalTensorWrapper so that FunctionalTensorWrapper can be passed in as input for dynamo.optimize and we can support something like this
```python
ff = torch.func.functionalize(f)
torch.compile(ff)(x)
```
This PR didn't follow the \_\_tensor_flatten\_\_ and \_\_tensor_unflatten\_\_ protocol right now because we're not sure the plan of doing that for FunctionalTensorWrapper (it's implemented in C++).
**Test Plan:**
Add a new test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107062
Approved by: https://github.com/zou3519
ghstack dependencies: #107042
Previously when we recorded a free action in a memory trace, we would provide
the stack for when the block was allocated. This is faster because we do not
have to record stacks for free, which would otherwise double the number of stacks
collected. However, sometimes knowing the location of a free is useful for
figuring out why a tensor was live. So this PR adds this behavior. If
performance ends up being a concern the old behavior is possible by passing
"alloc" to the context argument rather than "all".
Also refactors some of glue logic to be consistent across C++ and Python and
routes the Python API through the C++ version.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106758
Approved by: https://github.com/albanD
- Implement `MPSEventPool` to recycle events.
- Implement python bindings with `torch.mps.Event` class using the MPSEventPool backend. The current member functions of the Event class are `record()`, `wait()`, `synchronize()`, `query()`, and `elapsed_time()`.
- Add API to measure elapsed time between two event recordings.
- Added documentation for Event class to `mps.rst`.
- Added test case to `test_mps.py`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102121
Approved by: https://github.com/albanD, https://github.com/kulinseth
We hope PyTorch profiling parsing ability can also be applicable to custom devices. Based on previous work https://github.com/pytorch/pytorch/pull/101554, we have made supplementary updates to PyTorch profiling to extend its parsing capabilities for custom devices. These modifications do not affect the original logic of the code and mainly include the following aspects:
1. Added the relevant logic for use_device in torch.profiler.profiler._KinetoProfile.
2. In torch.autograd.profiler and torch.autograd.profiler_util, custom devices profiling data parsing ability has been added using privateuse1 and use_device attributes.
3. In torch._C._autograd.pyi and torch._C._autograd.pyi, custom devices related attributes have been added. The underlying C++
logic will be added in subsequent pull requests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106142
Approved by: https://github.com/aaronenyeshi
Fixes#102375
Sequence_nr increments in the forward pass and decrements in the backward pass. Backward ops with the same sequence_nr as a forward op represent the backward implementation for the op. The long term goal is to make this information available to the profiler so users can observe which ops are fused by the inductor openai triton kernels.
Added a test for this feature **test/dynamo/test_aot_autograd.py::AotAutogradFallbackTests::test_aot_sequence_nr**. The test case uses **aot_export_module()** to create a joint fwd/bwd fx graph. Then it walks all the nodes in fx graph using fx_graph.graph.nodes. The seq_nr of each node is recorded in node.meta. During the fwd pass the seq_nr increments and it decrements during the bwd pass. This allows the user to map forward ops to their corresponding bwd ops which is useful for performance analysis.
Expected output from the test case.
SeqNr|OrigAten|SrcFn
0|aten.convolution.default|l__self___conv1
0|aten.add.Tensor|l__self___bn1
1|aten._native_batch_norm_legit_functional.default|l__self___bn1
2|aten.relu.default|l__self___relu1
3|aten.add.Tensor|add
4|aten.view.default|flatten
5|aten.t.default|l__self___fc1
6|aten.unsqueeze.default|l__self___fc1
7|aten.mm.default|l__self___fc1
8|aten.squeeze.dim|l__self___fc1
9|aten.add.Tensor|l__self___fc1
10|aten.sub.Tensor|l__self___loss_fn
11|aten.abs.default|l__self___loss_fn
12|aten.mean.default|l__self___loss_fn
12|aten.ones_like.default|
12|aten.expand.default|
12|aten.div.Scalar|
11|aten.sgn.default|
11|aten.mul.Tensor|
8|aten.unsqueeze.default|
7|aten.t.default|
7|aten.mm.default|
7|aten.t.default|
7|aten.t.default|
7|aten.mm.default|
6|aten.squeeze.dim|
5|aten.t.default|
4|aten.view.default|
2|aten.threshold_backward.default|
1|aten.native_batch_norm_backward.default|
0|aten.convolution_backward.default|
0|aten.add.Tensor|
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103129
Approved by: https://github.com/soulitzer
As per title.
Note that the c++ side code for the minidumps part was removed. So trying to call any of these 3 functions today results in an error saying that `torch._C` doesn't have these attributes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105142
Approved by: https://github.com/janeyx99
Summary:
## About Sync Events
For CUDA profiling mode, we can enable tracing CUDA synchronization events.
* This feature captures synchronization events in CUDA including 1) context/device sync, 2) stream sync, 3) CUDA event sync, 4) CUDA stream wait event (inter stream synchronization). Read more
* We add this flag using the profiler's experimental config option.
* This PR relies on 7b003638c6 change in pytorch/kineto
## Usage
Just set the `enable_cuda_sync_events` option in `_ExperimentalConfig`
```
from torch.autograd.profiler import profile, _ExperimentalConfig
with profile(use_kineto=True, use_cuda=True,
experimental_config=_ExperimentalConfig(enable_cuda_sync_events=True),
) as prof:
workload()
```
**Please wait for PyTorch github repo to point to 7b003638c6 or later commit in Kineto**
Test Plan:
## Unit Test
Added a unit test
buck2 test mode/dev-nosan caffe2/test:profiler --local-only -- test_profiler_cuda_sync_events
Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0
ttps://www.internalfb.com/intern/testinfra/testrun/281475298097379
Reviewed By: davidberard98
Differential Revision: D46244591
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105187
Approved by: https://github.com/aaronenyeshi
As per title.
Note that the c++ side code for the minidumps part was removed. So trying to call any of these 3 functions today results in an error saying that `torch._C` doesn't have these attributes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105142
Approved by: https://github.com/janeyx99
Fixes#104985
Implemented `set_multithreading_enabled` C++ function to directly alter state rather than using `MultithreadingEnabled` class, which was automatically resetting the state when the object was destroyed. This behavior more closely aligns with set_grad_enabled which does work as expected. This allows us to change python class `set_multithreading_enabled` to act as both a function and context manager.
I also added a getter: `torch._C.is_multithreading_enabled`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105291
Approved by: https://github.com/albanD
This PR re-lands
- [Typing] Fix PEP 484 Violation (#105022)
- Update mypy to 1.4.1 (#91983)
That were reverted due to the conflict with internal source repo.
Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional)
Plus few real fixes:
- Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi`
- Add missing return statement to `torch._export. deserialize_graph`
- Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights`
- Add assert it `torch/optim/optimizer.py` that Optional list is not None
TODO (in followup PR):
- Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py`
Unrelated, to bypass CI failures due to the gcc9 dependency update in Ubuntu-18.04:
- Add hack to squash older libstdc++ from conda environment in favor one from OS to `.ci/docker/install_conda.sh`
- Update bazel cuda builds to focal, as with libstdc++-6.0.32 bazel builds loose the ability to catch exceptions (probably because they link with cupti statically, but I could not found where it is done)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105227
Approved by: https://github.com/atalman, https://github.com/albanD, https://github.com/Skylion007
This PR re-lands
- [Typing] Fix PEP 484 Violation (#105022)
- Update mypy to 1.4.1 (#91983)
That were reverted due to the conflict with internal source repo.
Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional)
Plus few real fixes:
- Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi`
- Add missing return statement to `torch._export. deserialize_graph`
- Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights`
- Add assert it `torch/optim/optimizer.py` that Optional list is not None
TODO (in followup PR):
- Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105227
Approved by: https://github.com/atalman, https://github.com/albanD, https://github.com/Skylion007
Summary:
Context
-------
This PR adds a new fallback to the Autograd dispatch keys.
If you would prefer the old behavior:
- A quick (unsupported) way to get the previous behavior is to call
`torch._C._set_autograd_fallback("nothing")`
- Register "torch::CppFunction::makeFallthrough()" to your Autograd key,
like in https://gist.github.com/zou3519/d09a5f4b1afe2430af09fea67c6ff2c8
It is possible that this PR regresses performance of overhead-bound
models. If this is the case, please reach out (and apply one of the
temporary fixes in the previous section).
Description for reviewers
-------------------------
In order to deprecate registering autograd kernels at not an autograd
key, we add a fallback to the Autograd dispatch keys. This fallback
raises a warning if the user attempts to backprop through the operator
and is also configurable to either warn or not warn.
The goal of this PR is to
- preserve as much BC as possible
- raise a warning that whatever the user is doing is potentially wrong.
- be as performant as possible
There are roughly two cases:
- if the post-autograd kernels return a Tensor that requires grad, then
we install an autograd hook that raises a warning. We are preserving BC
in that it is possible that the user has a torch::autograd::Function
registered to their CPU key.
- if the post-autograd kernels return Tensors that do not require grad,
then we make them require_grad and install a WarnNotImplemented grad fn
that warns in the backward pass. This is mildy BC-breaking (see next
section).
Test Plan:
- bunch of new tests
BC-Breaking Note
----------------
This PR adds a new fallback to the Autograd dispatch keys. It affects
custom operators that do not have a kernel registered to the Autograd
keys (e.g. AutogradCPU and AutogradCUDA).
If the previous behavior was that the custom operator would return
Tensors that do not require grad if the inputs do require grad, then
this PR changes it so that all floating-point and complex returns do
require grad. See the "Context" section above for how to get the old
behavior.
Differential Revision: D47408353
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105078
Approved by: https://github.com/soulitzer
Context
-------
This PR adds a new fallback to the Autograd dispatch keys.
If you would prefer the old behavior:
- A quick (unsupported) way to get the previous behavior is to call
`torch._C._set_autograd_fallback("nothing")`
- Register "torch::CppFunction::makeFallthrough()" to your Autograd key,
like in https://gist.github.com/zou3519/d09a5f4b1afe2430af09fea67c6ff2c8
It is possible that this PR regresses performance of overhead-bound
models. If this is the case, please reach out (and apply one of the
temporary fixes in the previous section).
Description for reviewers
-------------------------
In order to deprecate registering autograd kernels at not an autograd
key, we add a fallback to the Autograd dispatch keys. This fallback
raises a warning if the user attempts to backprop through the operator
and is also configurable to either warn or not warn.
The goal of this PR is to
- preserve as much BC as possible
- raise a warning that whatever the user is doing is potentially wrong.
- be as performant as possible
There are roughly two cases:
- if the post-autograd kernels return a Tensor that requires grad, then
we install an autograd hook that raises a warning. We are preserving BC
in that it is possible that the user has a torch::autograd::Function
registered to their CPU key.
- if the post-autograd kernels return Tensors that do not require grad,
then we make them require_grad and install a WarnNotImplemented grad fn
that warns in the backward pass. This is mildy BC-breaking (see next
section).
Test Plan:
- bunch of new tests
BC-Breaking Note
----------------
This PR adds a new fallback to the Autograd dispatch keys. It affects
custom operators that do not have a kernel registered to the Autograd
keys (e.g. AutogradCPU and AutogradCUDA).
If the previous behavior was that the custom operator would return
Tensors that do not require grad if the inputs do require grad, then
this PR changes it so that all floating-point and complex returns do
require grad. See the "Context" section above for how to get the old
behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104481
Approved by: https://github.com/soulitzer
Fixes https://github.com/pytorch/pytorch/issues/104272
This PR adds a new private API `materialize_non_diff_grads` (default True) such that when set to False, grad outputs corresponding to outputs marked non-differentiable would receive None instead of a zero-filled tensor. This is overrides the setting of `materialize_grads`, i.e. grad outputs corresponding non-differentiable outputs would still be None even if `materialize_grads=True` (the default).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104291
Approved by: https://github.com/albanD
This is a reland of https://github.com/pytorch/pytorch/pull/100007 with a build fix for Windows debug builds.
`at::native::ParamsHash` only works on structs with standard layout, but `std::string` isn't one in Visual C++ debug builds, which one can easily verified by running something like:
```cpp
#define _DEBUG
#include <type_traits>
#include <string>
static_assert(std::is_standard_layout_v<std::string>, "Oh noes");
```
If above conditon is not met, instead of printing a static_assert output, VC++ raises a very cryptic compilation errors, see https://github.com/pytorch/pytorch/pull/100007#discussion_r1227116292 for more detail.
Also, using `std::hash` for string should result in a faster hash function.
(cherry picked from commit 74b7a6c75e)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 5914771</samp>
This pull request introduces a new function `_group_tensors_by_device_and_dtype` that can group tensors by their device and dtype, and updates the `foreach` utilities and several optimizers to use this function. The goal is to improve the performance, readability, and compatibility of the code that handles tensors with different properties. The pull request also adds a test case and type annotations for the new function, and some error checks for the `fused` argument in Adam and AdamW.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103912
Approved by: https://github.com/janeyx99
**Summary**
- Update the quantization document that default qconfig with oneDNN backend is recommended to be used on CPUs with Vector Neural Network Instruction support.
- Add the warning message when user uses default qconfig with oneDNN backend on CPU without Vector Neural Network Instruction support.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103653
Approved by: https://github.com/jgong5, https://github.com/malfet
Fixes#102768
- Provides proper function declarations in generated `torch/nn/functional.pyi`.
- Moves some functions from manually defined in `functional.pyi.in` to generated code, in order to single-source the signature.
- Includes some of the functions in `torch._C._nn` into its `.pyi.in`, but not exhaustive (only what's already there).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102918
Approved by: https://github.com/drisspg, https://github.com/malfet
Summary:
This diff allows the `TCPStore` server associated with a gloo process group to listen on an existing socket already bound to a port.
Without the functionality in this diff, canonical initialization of a gloo `ProcessGroup` is fundamentally racy: 1) ask the OS for a free port by creating a socket bound to port 0, 2) close the socket, 3) attempt to initialize a `TCPStore` server that listens on the previously free port. Of course, the problem is that in between steps 2 and 3, another process on the host may have claimed the port, causing `TCPStore` and overall process group initialization to fail. With this diff, it is now possible for users to completely avoid this race (see unit test for how this can be achieved).
Test Plan:
Added new unit test:
buck2 test caffe2/test/distributed:store
Differential Revision: D46622317
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103478
Approved by: https://github.com/H-Huang
We discussed in a composability meeting a few weeks ago that `pre_autograd` should probably be renamed to `pre_dispatch`.
One question in this PR was: should I re-use a dispatch key? Or should I create a new dispatch key (that yet again corresponds to "top of the dispatcher")?
~~For now, I ended up sticking our proxy mode on the mode stack corresponding to `PythonTLSSnapshot`, because it was simple and it works. It looks like one of the functorch dispatch keys has higher priority though, so it's possible that functorch will end up running first. Open to options, but we can consider adding a new dispatch key later if that becomes a problem~~
Update: I added a dedicated dispatch key, `PreDispatch`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101818
Approved by: https://github.com/ezyang, https://github.com/Neilblaze, https://github.com/albanD, https://github.com/zou3519
Summary: trigger tracing for MTIA events on python side when ProfilerActivity.MTIA is specified
Test Plan:
Test diff: D45437426
```
hg graft D45437426
```
- in one terminal
```
cd ~/fbsource/fbcode
buck2 run -j 8 \
//infra_asic_fpga/firmware/tools/mad/service:mad_service
```
- in another terminal
Pytorch profiler
```
buck run mode/dev-nosan -j 8 //caffe2/torch/fb/acc_runtime/afg/tests:test_afg -- -m kernel_add
```
Differential Revision: D46122853
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102288
Approved by: https://github.com/aaronenyeshi
`register_functional_op`:
- constructs the functional variant of an op
- registers a functionalization kernel to the op
To get this to work:
- `register_functional_op` makes assumptions that it checks about the
op's schema. In particular, the op is not allowed to return anything it
mutates. We can relax these constraints in the future.
- We add a "boxed" python functionalization kernel that handles this
case.
I'm not actually sure (or convinced) this should be public API or how
it should work. If we want this to be public, then it should probably be
a torch.library API, but does that also mean we should give the same
lifetime guarantees? If so, then it would be up to the user to construct
a Library object to actually register the functional variant onto.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102293
Approved by: https://github.com/bdhirsh
1. `torch.autograd.profiler` interface parameters changed. (use `self.use_device` instead of `self.use_cuda` facilitates access by other devices and integrate it in subsequent pr)
2. Modify `ProfilerEventStub`(aka `std::shared_ptr<CUevent_st>`) to `ProfilerVoidEventStub`(aka `std::shared_ptr<void>`) so that `ProfilerStubs` can be inherited by any `{device}Methods`.
In addition, `cuda_event_start_` is renamed to `device_event_start_` , cuda and other devices can use this event pointer if needed.
4. custom device support using legacy profiling(add `ProfilerState::KINETO_PRIVATEUSE1_FALLBACK` option)
5. add `privateuse1Stubs` register
(parse results and test cases are added in subsequent pr)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101554
Approved by: https://github.com/aaronenyeshi
Summary: In cases where DDP backward is not finalized, the error is raised only in the next forward iteration of DDP. However, if there are other collective calls between those two points, training scripts could potentially get stuck.
As a result, there should be a way to check if DDP finalized after calling `.backward()`. To address this, I've added a `_check_reducer_finalized` method to validate that DDP indeed did successfully finish reduction.
Test Plan: Added unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100773
Approved by: https://github.com/rohan-varma
This PR adds a `py_context_manager_DEPRECATED` that converts a C++ RAII
guard to an object that may be either used as Python context manager or
as a "Python RAII guard".
We don't convert all of them to Python context manager only due to BC
reasons; people in OSS and internally actually rely on these APIs and I
don't want to break them. We are justified in breaking BC if we wanted
to, but it seemed like too much work for not a lot of gain.
The API is postfixed with "DEPRECATED" to indicate that people should
really use `py_context_manager` (converts C++ RAII guard to Python
context manager) instead.
Test Plan:
- this PR converts all PyTorch usages of _AutoDispatchBelowAutograd to
context manager. I can do the rest in follow-ups.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102579
Approved by: https://github.com/bdhirsh, https://github.com/albanD
**TL;DR:** This re-introduces links between backward kernels and their corresponding forward kernels.
<img width="1020" alt="Screenshot 2023-05-26 at 7 25 22 PM" src="https://github.com/pytorch/pytorch/assets/5067123/02571b59-859c-4c9e-b3ef-121ef3159812">
In the example above, you can see there are two such flows - one for aten::add, and one for aten::binary_cross_entropy
### Details
Forward/backward links were added in https://github.com/pytorch/pytorch/pull/62553, but then disabled in https://github.com/pytorch/pytorch/pull/72904 due to segfaults (e.g. https://github.com/pytorch/pytorch/issues/69443).
Between now and when the fwd-bwd links were disabled, there's been a lot of refactoring; so this PR updates the implementation:
* Use a raw profiler::impl::Result instead of a KinetoEvent
* Move the implementation to collection.cpp, where the TraceWrapper is currently handled.
* Sort the events before processing, because they aren't always in chronological order
* There can now be more than one event in the backward pass that matches the sequenceNr-threadID pair. The implementation needed to be updated to avoid showing multiple endpoints for a given sequenceNr-threadID pair ([ptr to where the bwd sequenceNr-threadID pair is duplicated](6e3e3dd477/torch/csrc/profiler/collection.cpp (L398-L399))).
Next, we need to verify that https://github.com/pytorch/pytorch/issues/69443 is fixed. Running the repro no longer errors. Looking further into the details of the issue it seems like the handling of the [raw linkedActivity pointer (old code from 2021)](6089dcac48/libkineto/src/output_json.cpp (L283)) resulted in the segfault. Now, it doesn't look like the linked activity is used anywhere in output_json.cpp so the issue should be fixed.
### Testing
#### 1. unit test
`test_profiler_fwd_bwd_link` was un-skipped. It was modified to match the new implementation.
#### 2. https://github.com/pytorch/pytorch/issues/69443
I ran the repro in https://github.com/pytorch/pytorch/issues/69443 and verified there were no segfaults.
#### 3. Duplicate flow IDs
When forward-backward connections were first introduced, gpu-cpu async links had not been introduced. There's a possibility that gpu-cpu links and fwd-bwd links could interfere if their IDs overlap.
I manually tested this in chrome://tracing; I edited a file so that a gpu-cpu link had the same ID as one of the fwd-bwd connections. The chrome tracing UI continued showing both types of links.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102424
Approved by: https://github.com/aaronenyeshi
Summary: The new logger allows passing metadata into the api usage logger. The immediate use case is to pass the serialization_id to the save and load events to be enable tracking serialized models in API events. It could be extended to add more metadata in the future.
Test Plan:
```
buck2 test @//mode/dev //caffe2/caffe2/serialize:inline_container_test
```
Reviewed By: davidberard98
Differential Revision: D45683697
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101762
Approved by: https://github.com/davidberard98
This PR:
- adds a mechanism to turn any RAII guard into a Python Context Manager
- turns ExcludeDispatchKeyGuard into a context manager, and purges usages
of the older torch._C.ExcludeDispatchKeyGuard from the codebase.
The mechanism is that given a RAII guard, we construct a context
manager object that holds an optional guard. When we enter the context
manager we populate the guard, when we exit we reset it.
We don't delete torch._C.ExcludeDispatchKeyGuard for BC reasons (people
are using it in fbcode). If this code actually sticks
(it is using C++17 and that worries me a bit), then I'll apply the
change to other RAII guards we have, otherwise, we can write our own
std::apply.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102037
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
This PR is an implementation of the feature request https://github.com/pytorch/pytorch/issues/97888, for the implementation of `torch.dtype.to_complex()` and `torch.dtype.to_float()` methods that convert between float and complex dtypes of the same precision.
Disclaimer: it's the first time I code in C++ so hopefully the code is correct, but I'm not super confident about the PR. Any advice/comment is welcome. It's also my first contribution to a large library, so hopefully I'm not doing anything wrong !
@ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97935
Approved by: https://github.com/ezyang
Many ops take as inputs scalars or scalar lists which are important to understand the properties of the op. For example, convolution ops' behavior and output shapes often depend on padding and strides, which are provided as scalars of lists of scalars. This will record scalar lists when record_inputs=True.
Details:
During collection (and this was true before this PR as well), we serialize values and tensor metadata into an InputOutputEncoder. After collection occurs, we deserialize these values to attach the information to each of the events.
This PR does this:
- Adds support for serializing scalar lists during collection / serialization
- Adds an extra field called "Concrete Args"
- Splits up the deserialization process into two steps - one for generating "input shapes" and one for generating "concrete args". We split up input shapes and concrete args to avoid interrupting any previous workflows that relied on the specific data in the input shapes category; additionally, it's just a better description. Note that single scalars will remain in the "input shapes" category as they were already in that category in the past.
Differential Revision: [D45798431](https://our.internmc.facebook.com/intern/diff/D45798431)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100593
Approved by: https://github.com/aaronenyeshi
The PyTorch Dispatcher's "no kernel found for DispatchKey" error message
is a bit long and winded. This PR adds a way to add a custom error
callback and changes the CustomOp API to use the custom error callback
to deliver better error messages.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101015
Approved by: https://github.com/ezyang
Description:
- As suggested by Nikita, created `torch.backends.cpu` submodule and exposed `get_cpu_capability`.
- In torchvision Resize method we want to know current cpu capability in order to pick appropriate codepath depending on cpu capablities
Newly coded vectorized resize of uint8 images on AVX2 supported CPUs is now faster than older way (uint8->float->resize->uint8). However, on non-avx hardware (e.g. Mac M1) certain configs are slower using native uint8.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100164
Approved by: https://github.com/albanD, https://github.com/malfet
The new minifier script looks like this:
```
import torch._dynamo.repro.after_aot
reader = torch._dynamo.repro.after_aot.InputReader(save_dir='/tmp/tmpcsngx39e')
buf0 = reader.storage('e2b39c716c0d4efb9fa57375a3902b9dab666893', 16)
t0 = reader.tensor(buf0, (4,))
args = [t0]
mod = make_fx(Repro(), tracing_mode='real')(*args)
```
The real tensor data is stored in the storages folder of the checkpoint dump directory. If you delete this folder / it is otherwise missing, we will transparently fall back to generating random data like before. The tensors are serialized using content store from #99809, which means each storage is content-addressed and we will automatically deduplicate equivalent data (which is useful if you keep dumping out, e.g., your parameters.) We don't use the tensor serialization capability from content store, instead all of the tensor metadata is stored inline inside the repro script (so that everything is in one file if you lose the checkpointed tensors).
We also add a stable_hash option to content store, where we use a slow SHA-1 sum on the data in CPU side to compute a hash that is stable across systems with the same endianness.
Out of rage, I also added support for Dtype.itemsize property access.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99834
Approved by: https://github.com/voznesenskym
Implements a simple content-addressable store for storages (with tensors implemented as cheap references on top), enabling incremental serialization of tensors to disk, which I intend to use in the accuracy repro extractor. Check the comment at the top of torch/utils/_content_store.py for more details on the intended use case.
One major piece of this PR is implementing the content hash for tensors. For our prospective use case, we may need to repeatedly hash up to 80 GB of tensor data every time we snapshot (and we may snapshot multiple times). Using a conventional cryptographic hash and hashing each snapshot would likely take on order of minutes, which seemed too slow to me. So instead, I implemented a crappy hash function that can be run on GPU. It is at least somewhat theoretically grounded: using random parameters generated by Philox, we use the standard shift-multiply and xor sum universal hash family. The hash function is a bit dorky though; instead of properly doing 160-bit math, it just runs 32-bit hash five times and cats them together. By the way, this sets the first precedent for kernel in PyTorch library which MUST be torch.compile'd to be run (in fact, this kernel does not run in eager mode because of the use of xor_sum, which doesn't actually exist in ATen.)
I had to add a few more primitives to inductor, namely randint (over the entire int range) and xor_sum. Fortunately, these primitives are natively supported by Triton/C++, and so they were very easy to plumb through. xor_sum is exposed as a prim, while randint special cases on when low/high span the entire 32-bit signed integer range.
Thanks to Jeff Johnson for letting me bounce ideas of him on a Saturday morning lol.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99809
Approved by: https://github.com/voznesenskym
The bug was that: if you want to move a mode to the autograd key, we need to use the "functionality" key for it (AutogradFunctionality). But when we do that, we need to clear any PythonDispatcher caches for every op for **every** autograd key (since you could run autograd ops with both cpu and cuda tensors underneath the mode, which both may have been cached).
I didn't add a test, since this ends up getting indirectly tests by export in the PR. If someone would prefer a direct test I can add one.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98030
Approved by: https://github.com/ezyang