Commit Graph

37529 Commits

Author SHA1 Message Date
rzou
3918dfedc5 [custom_op] Rename register_impl to register_kernel (#124200)
Motivation:
- The API is used for registering an implementation for a specific
  device type.
- "impl" is ambiguous and can be confused with Library.impl.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124200
Approved by: https://github.com/albanD
ghstack dependencies: #124180
2024-04-19 13:54:21 +00:00
rzou
22a2f676c3 [custom_op] add ability to provide manual schema (#124180)
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124180
Approved by: https://github.com/albanD
2024-04-19 13:54:13 +00:00
GdoongMathew
8b1ad51881 Better Error Message in ChainedScheduler and SequentialLR (#121633)
Fixes #121577

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121633
Approved by: https://github.com/janeyx99
2024-04-19 13:37:41 +00:00
Jesse Cai
c9db59e9e4 [sparse] Add fast semi-structured spasification kernels (#122350)
This PR adds in fast semi-structured sparsification kernels to PyTorch.

These kernels allow for accelerated semi-structured sparsification
kernels in PyTorch.

The kernels have been added as aten native functions

In particular, three new functions have been added:

* `torch._sparse_semi_structured_tile`

This function will return the packed representation and metadata for
both X and X', as well as the thread masks. Note that this applies 2:4
sparsity in a 4x4 tile instead of a 1x4 strip as usual.

* `torch._sparse_semi_structured_apply`

This function takes in an input tensor and thread masks from the above
function and returns a packed representation and metadata from applying
thread masks to the input tensor.

* `torch._sparse_semi_structured_apply_dense`

This function does the same thing as above but instead of returning the
tensor in the sparse representation it returns it in the dense
representation

The subclasses have also been updated to add a new
`prune_dense_static_sort`
classmethod to create sparse tensors with this format. I've added some
additional documentatino on how to calculate the compressed tensors
needed to create a SparseSemiStructuredTensor oneself.

To this end, there are two new helper functions added:
`sparse_semi_structured_tile`
`compute_compressed_swizzled_bitmask`

Differential Revision: [D56190801](https://our.internmc.facebook.com/intern/diff/D56190801)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122350
Approved by: https://github.com/cpuhrsch
2024-04-19 13:31:58 +00:00
Cen Zhao
96724a769b [ptd] drop ncclGroupStart/end for ncclCommInit (#124363) (#124416)
Summary:

```
ncclGroupStart()
ncclCommInit(..)
ncclGroupEnd()
```

above pattern is only needed when we have *single-thread* to manage multiple GPUs

in our case, we always have 1 process managing 1 GPU, we don't need group operation.

Test Plan: CI

Differential Revision: D56274975

Co-authored-by: Cen Zhao <cenzhao@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124416
Approved by: https://github.com/shuqiangzhang
2024-04-19 13:12:42 +00:00
chilli
8e280862ff Add custom joint graph passes (#124443)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124443
Approved by: https://github.com/aorenste, https://github.com/malfet
2024-04-19 11:54:46 +00:00
Jane Xu
b412b75b42 [optim] add fused_adam/adamw_kernel support for CPU device (#123074)
On par with `CUDA` implementation.

For `autocast` logic, same with `CUDA` + `Fused Adam`:
 - check inf in `gradscalar.step`
 - In fused kernel, if there is `inf`, do nothing. If not, unscale the grad ( also write back) and update the param.

**TestPlan**:
```
# extend CUDA only test for CPU fused adagrad
python test_optim.py -k test_fused_matches_forloop
python test_optim.py -k test_fused_large_tensor
python test_torch.py -k test_grad_scaling_autocast_fused

# extend fused test
python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step
python test_optim.py -k test_can_load_older_state_dict

# newly added test (follow 6b1f13ea2f/test/test_cuda.py (L1108))
python test_optim.py -k test_grad_scaling_autocast_fused_optimizers
```

**Benchmark**:
**5.1x** on 56 core SPR
**Parameter-size=1M**
**Nparams=10**
[test script](https://gist.github.com/zhuhaozhe/ef9a290ad3f8f4067b3373a3bdaa33e7)

```
numactl -C 0-55 -m 0 python bench_adam.py
non-fused 6.0174267292022705 s
fused 1.1787631511688232 s
```

**Note: Fused kernel accuracy**
The accuracy failure in CI shows a little higher than default tolerance
```
2024-04-02T06:09:16.2213887Z Mismatched elements: 21 / 64 (32.8%)
2024-04-02T06:09:16.2214339Z Greatest absolute difference: 1.5735626220703125e-05 at index (6, 6) (up to 1e-05 allowed)
2024-04-02T06:09:16.2214813Z Greatest relative difference: 1.0073336852656212e-05 at index (4, 1) (up to 1.3e-06 allowed)
```
I have debug it step by step and unfortunately we may not able to make the `fused kernel` exactly same with `non fused` one due to compiler optimizations.
For example, in non-fused impl
```
exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2)
```
and in fused impl
```
  exp_avg_sq_ptr[d] = scalar_t(beta2) * exp_avg_sq_ptr[d];
  //  std::cout << "exp_avg_sq " <<   exp_avg_sq_ptr[d] << std::endl;
  exp_avg_sq_ptr[d] = exp_avg_sq_ptr[d] +
      scalar_t(exp_avg_sq_grad_coefficient) * grad_val * grad_val;
```
If I keep `std::cout`, I can get exactly same results in UT
```
===============param
0.6796758770942688
0.6796758770942688
```
But when I comment out it, there will be a difference
```
===============param
0.6796758770942688
0.6796759366989136
```
So I will make the tolerance a little higher than default one.

Co-authored-by: Jane Xu <janeyx@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123074
Approved by: https://github.com/jgong5, https://github.com/janeyx99
2024-04-19 11:14:04 +00:00
Boyuan Feng
9a71d12d92 [CUDAGraphTree] Support mutated inputs from prior cudagraph pool (#123231)
# PR
This PR supports mutating inputs in cudagraph trees, if these inputs are outputs from previous cudagraph. Please check #121861 for more details.

# Note on Optimistic Mutation Check
To determine whether applying cudagraph, we need to check input mutations, falling into four categories: a) no mutation, b) mutation on parameters/buffers, c) mutation on cudagraph recorded tensors, d) mutation on non-cudagraph recorded tensors. We can apply cudagraph for type a,b,c but cannot for type d. This input mutation types depends on function, current_node, and inputs.

Since `check_for_mutation` is slow, there is a trade-off on making type c or d faster.
- To make type d) faster, we want to `check_for_mutation` and call eager function early. However, this adds unnecessary overhead to type a, b, c due to the extra check.
- To make type c) faster, we want to skip `check_for_mutation` at the beginning and only `check_for_mutation` before `record_function` for a new function. This removes the overhead of `check_for_mutation` for type a, b, c. However, this adds extra overhead to type d due to `check_invariants` for all children nodes.

Instead, we design optimistic mutation check. The assumption is that, given a function and a node, the input mutation types usually remain the same across inputs. So, if we have ever detect a function on a node with type d, we will never detect it as type c. The detailed design is:
- [Slow Path] On the first invocation of a function on a node, we run `check_for_mutation` once and cache the input mutation type as `non_cudagraph_managed_mutation[node_id][func_id]`.
- [Fast Path] On the subsequent invocations of a function on a node, we skip `check_for_mutation`. For `non_cudagraph_managed_mutation[node_id][func_id]` as true, we directly call eager function. Otherwise, we `check_variants` and call cudagraph function.
- [Slow Path] Before `record_function`, we run `check_for_mutation` again.

**Q1: Would there be overhead for type a,b,c,d?**
A: No. We only check input mutation types for the first invocation of a function on a node.

**Q2: If a function happens to be type c during the first invocation on a node, could we detect it as type d in the future?**
A: Yes. This is done by `check_invariants` and guarantees the correctness.

**Q3: If a function happens to be type d during the first invocation on a node, could it still be recognized as type c in the future?**
A: No. But this should happen rarely according to our assumption. In the rare case that it happens, there would not be any correctness issues and the performance is the same as the eager (or inductor optimized) function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123231
Approved by: https://github.com/eellison
2024-04-19 10:32:12 +00:00
Tobias Ringwald
58e403c739 Added a docstring for torch.Size.numel. (#124186)
Fixes #61231. Fixes #124167.

This PR documents a rather long-standing issue w.r.t. unexpected behavior of `torch.Size.numel`, first reported almost 5 years ago.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124186
Approved by: https://github.com/janeyx99
2024-04-19 09:23:02 +00:00
PyTorch MergeBot
520bc1080e Revert "[Profiler] Unify the device(CUDA, XPU, PrivateUse1) in torch profiler post processing (#123247)"
This reverts commit 768ce2cdda.

Reverted https://github.com/pytorch/pytorch/pull/123247 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/123247#issuecomment-2066152611))
2024-04-19 09:09:03 +00:00
Xuehai Pan
a6f044a490 [dynamo, 3.8-3.9] support dataclass with frozen=True in Python 3.8/3.9 (#124393)
Closes #114966

Frozen field assignment in `__init__` in Python 3.8-3.9:

f5bd65ed37/Lib/dataclasses.py (L402-L411)

```python
import builtins

BUILTINS = builtins

def _field_assign(frozen, name, value, self_name):
    # If we're a frozen class, then assign to our fields in __init__
    # via object.__setattr__.  Otherwise, just use a simple
    # assignment.
    #
    # self_name is what "self" is called in this function: don't
    # hard-code "self", since that might be a field name.
    if frozen:
        return f'BUILTINS.object.__setattr__({self_name},{name!r},{value})'
    return f'{self_name}.{name}={value}'
```

Frozen field assignment in `__init__` in Python 3.10+:

812245ecce/Lib/dataclasses.py (L436-L445)

```python
__dataclass_builtins_object__ = object

def _field_assign(frozen, name, value, self_name):
    # If we're a frozen class, then assign to our fields in __init__
    # via object.__setattr__.  Otherwise, just use a simple
    # assignment.
    #
    # self_name is what "self" is called in this function: don't
    # hard-code "self", since that might be a field name.
    if frozen:
        return f'__dataclass_builtins_object__.__setattr__({self_name},{name!r},{value})'
    return f'{self_name}.{name}={value}'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124393
Approved by: https://github.com/jansel
2024-04-19 05:10:33 +00:00
Nikita Shulga
1ba85b34dd [AOTI] Enbale mmaped weights when CUDA is used (#124346)
By refactoring the logic that returns the start to constant pointer into `_get_constants_start()` method and call it from both CUDA and CPU readers

It has no runtime impact, but export time is down from 10m to 3m if mmaped weights are used on AWS p4d.24xlarge

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124346
Approved by: https://github.com/mikekgfb, https://github.com/desertfire
2024-04-19 04:47:27 +00:00
Kiuk Chung
87f44d70b1 [torch/distributed] Check gloo availability when doing isinstance(pg,… (#124233)
Fixes a bug where a reference to `_ProcessGroupWrapper` is used without first checking whether gloo is available. This fails on pytorch builds that do not include gloo becuase `_ProcessGroupWrapper` is only pybinded when building with `USE_GLOO=1`. Therefore, creation of a new process group fails with a `NameError` when only NCCL is available as the backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124233
Approved by: https://github.com/rohan-varma, https://github.com/d4l3k
2024-04-19 04:07:00 +00:00
Chen, Zejun
768ce2cdda [Profiler] Unify the device(CUDA, XPU, PrivateUse1) in torch profiler post processing (#123247)
This PR unifies the CUDA, XPU and PrivateUse1 in the torch profiler. Now CUDA, XPU and PrivateUse1 can together use string object `use_device` to distinguish each other and share one device path for calculating kineto time durations and memory statistics for post processing.

#suppress-api-compatibility-check

Co-authored-by: Aaron Enye Shi <enye.shi@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123247
Approved by: https://github.com/aaronenyeshi, https://github.com/gujinghui
2024-04-19 03:31:13 +00:00
rraminen
803a08f8ae [ROCm] Add cublasGemmAlgo_t -> hipblasGemmAlgo_t (#121030)
This PR is to add cublasGemmAlgo_t -> hipblasGemmAlgo_t to cuda_to_hip_mappings.py.
It is required for DeepSpeed transformer extension build on ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121030
Approved by: https://github.com/jeffdaily, https://github.com/ezyang
2024-04-19 02:57:16 +00:00
rzou
889e3eeed3 Avoid cuda init to FakeTensorMode (#124413)
Also partially fixes #122109

This PR:
- We add a C++ flag (only_lift_cpu_tensors) to toggle the
  torch.tensor(1, device='cuda') ctor strategy.
  When false (default), it does the current PyTorch behavior
  of unconditionally constructing a concrete CUDA tensor then calling
  lift_fresh on it. When true, we instead construct a concrete CPU
  tensor, call lift_fresh, and then call Tensor.to(device) (under any ambient
  modes).
- FakeTensorMode flips this flag depending on if CUDA is available or
  not. We don't unconditionally set the flag to True because that is
  likely BC-breaking.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124413
Approved by: https://github.com/eellison
2024-04-19 02:39:35 +00:00
chilli
e620c3e814 Optimized templated attention to use exp2 (#124356)
0.705 (vs. FA2) to 0.860 after this change.

<img width="1270" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/d58f57ba-e50e-44ea-8a8a-4f13b8650adf">

to

<img width="1277" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/f1945b67-0cfc-463c-a2f6-5812b90677fe">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124356
Approved by: https://github.com/drisspg
2024-04-19 01:58:19 +00:00
Tristan Rice
ddd0ed1b43 distributed: templated ring attention (#124215)
This adds a templated version of the ring attention forwards function as well as tests it with memory efficient attention. This doesn't add support for memory efficient attention in DTensor. That will be added in a follow up PR.

This templating is also a POC of how to support other attention ops such as Jagged/nested tensor and as well how to implement striped attention in a scalable way.

Misc changes:

* Fixes all_to_all_single autograd implementation with CUDA + adds NCCL test
* Adds compile support to the ring attention implementations (required some tweaks to process groups)

Test plan:

```
pytest test/distributed/_tensor/test_attention.py
pytest test/distributed/test_functional_api.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124215
Approved by: https://github.com/wanchaol
2024-04-19 00:57:08 +00:00
Bin Bao
4946638f06 [AOTI] Add ABI-compatiblity tests (#123848)
Summary: In AOTInductor generated CPU model code, there can be direct references to some aten/c10 utility functions and data structures, e.g. at::vec and c10::Half. These are performance critical and thus it doesn't make sense to create C shim for them. Instead, we make sure they are implemented in a header-only way, and use this set of tests to guard future changes.

There are more header files to be updated, but we will do it in other followup PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123848
Approved by: https://github.com/jansel
ghstack dependencies: #123847
2024-04-19 00:51:24 +00:00
JackCaoG
9ed9b22ec0 Implement efficient_conv_bn_eval_decomp_graph_transform to handle conv and bn fusion after decomp (#123680)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123680
Approved by: https://github.com/ezyang, https://github.com/youkaichao
2024-04-19 00:22:25 +00:00
Shuqiang Zhang
ca6a0e1348 [c10d] remove the env of TORCH_NCCL_ABORT_IN_DESTROY_PG (#124334)
Summary:
This ENV was introduced to safely rollout the behavior change in destroy
process group (e.g., call ncclCommsAbort). Now that this behavior change
were already rolled out, we no longer need this env and we should clean
up it to keep our code cleaner
Test Plan:
Modified/existing ut pass

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124334
Approved by: https://github.com/wconstab
2024-04-18 23:42:55 +00:00
eellison
e4f6340f21 realize inputs to mem bound mm decomposition (#123165)
Differential Revision: [D55639709](https://our.internmc.facebook.com/intern/diff/D55639709)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123165
Approved by: https://github.com/jackiexu1992
2024-04-18 23:10:04 +00:00
Mikayla Gawarecki
5ba6bb7b2f Add swap_tensors path to nn parametrizations (#124130)
Fixes #123859

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124130
Approved by: https://github.com/albanD
2024-04-18 22:22:08 +00:00
Wei Wei
87f651c7e7 fix cpu test errors (#124116)
Similar fix is from @int3 but not landed. Credit to @int3 too.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124116
Approved by: https://github.com/chenyang78
2024-04-18 20:30:58 +00:00
ydwu4
2e48b39603 Fix example_value of map (#124203)
Previously, we didn't expand the shape of example_value of map to the same as inputs (edit: the first mapped dimension). This pr fixes this bug. To make this easier, we change _call_function_and_unflatten_output to accept example_values directly instead of retrieving them from the variable trackers.

Also remove a redundant call function node in strict_mode higher order op in dynamo.

Test Plan:
existing tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124203
Approved by: https://github.com/ezyang, https://github.com/zou3519
2024-04-18 19:18:36 +00:00
PyTorch MergeBot
4a0900d04b Revert "[NJT] Inline through torch.nested.nested_tensor_from_jagged instead of graph break (#124343)"
This reverts commit ef93402f61.

Reverted https://github.com/pytorch/pytorch/pull/124343 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124343#issuecomment-2064937192))
2024-04-18 18:55:48 +00:00
Sheng Fu
89407eca3b Capture triton kernel in execution trace (#124140)
Summary: This DIFF is to capture triton kernels in execution trace.

Test Plan: buck test  mode/dev-nosan caffe2/test:profiler -- test_execution_trace_with_pt2

Differential Revision: D56162599

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124140
Approved by: https://github.com/briancoutinho
2024-04-18 18:38:26 +00:00
angelayi
74bedbb9e1 [export] Serialize rational symint ranges (#123884)
Some symints result in rational ranges like 10/3 which runs into an error ([example](https://www.internalfb.com/intern/everpaste/?handle=GMG2AxkeoFUrh-UDAFcE8pKPgjoUbsIXAAAB)).

Ed will eventually get rid(?) of these rational ranges but as a workaround export can just clamp the results during serialization time
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123884
Approved by: https://github.com/zhxchen17
2024-04-18 18:20:11 +00:00
Aaron Orenstein
37215a4fa2 Fix memory leak in pattern_matcher (#124345)
#121313 changed precompiled patterns so they are more integrated with the pattern matching code.  This resulted with a list of "known" patterns (with their example data) being stored globally. Unfortunately since small FakeTensors store a constant of the original tensor it meant that we leaked cuda tensors in the example data.

Fix this by clearing out the constant storage for the example data that we keep around.

Fixes #124081

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124345
Approved by: https://github.com/xuzhao9
2024-04-18 17:38:12 +00:00
egienvalue
d7e1bf9ff9 torch.mtia module for MTIA device backend (#123612)
MTIA device has its own Module in PyTorch now.
torch.mtia has following APIs similar to other backends. The lazy_init is also supported.
```
__all__ = [
    "init",
    "is_available",
    "synchronize",
    "device_count",
    "current_device",
    "current_stream",
    "default_stream",
    "set_stream",
    "stream",
    "device",
]

```
------------
For device management. We expand AccleratorHooksInterface to support generic device management and it can be used in both C++ and PyThon.
```
def _accelerator_hooks_device_count() -> _int: ...
def _accelerator_hooks_set_current_device(device_index: _int) -> None: ...
def _accelerator_hooks_get_current_device() -> _int : ...
def _accelerator_hooks_exchange_device(device_index: _int) -> _int : ...
def _accelerator_hooks_maybe_exchange_device(device_index: _int) -> _int : ...
```

---------
Adding get_device_module API to retrieve device modules for different device types.
```
def get_device_module(device: Optional[Union[torch.device, str]] = None)
```
---------
@exported-using-ghexport

Differential Revision: [D52923602](https://our.internmc.facebook.com/intern/diff/D52923602/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123612
Approved by: https://github.com/albanD
ghstack dependencies: #123611
2024-04-18 17:38:06 +00:00
egienvalue
cb17721899 Build device generic torch.Stream and torch.Event based on c10::Stream/Event (#123611)
This diff intends to build device generic torch.Stream and torch.Event for newly added accelerators in PyTorch.
------------
**torch.Stream APIs**
```
# Defined in torch/csrc/Stream.cpp
class Stream(_StreamBase):
    stream_id: _int  # Stream id
    device_index: _int
    device_type: _int

    device: _device  # The device of the stream

    @overload
    def __new__(self, device: Optional[DeviceLikeType] = None, priority: _int = 0) -> Stream: ...
    @overload
    def __new__(self, stream_id: _int, device_index: _int, device_type: _int, priority: _int = 0) -> Stream: ...
    def query(self) -> _bool: ...
    def synchronize(self) -> None: ...
    def wait_event(self, event: Event) -> None: ...
    def wait_stream(self, other: Stream) -> None: ...
    def record_event(self, event: Optional[Event] = None) -> Event: ...
    def query(self) -> None: ...
    def synchronize(self) -> None: ...
    def __hash__(self) -> _int: ...
    def __repr__(self) -> str: ...
    def __eq__(self, other: object) -> _bool: ...
```
------------------
**torch.Event APIs**:
- IPC related APIs are not implemented, since many device backends don't support it, but we leave interfaces there for future adaption of torch.cuda.Stream.
- currently only the enable_timing is supported, since it is the most common one used in other device backends. We have to refactor the event flag system in PyTorch to support more fancy flag.
- elapsedTime API is added to c10::Event

```
# Defined in torch/csrc/Event.cpp
class Event(_EventBase):

    device: _device  # The device of the Event
    event_id: _int # The raw event created by device backend

    def __new__(self,
        device: Optional[DeviceLikeType] = None,
        enable_timing: _bool = False,
        blocking: _bool = False,
        interprocess: _bool = False) -> Event: ...
    @classmethod
    def from_ipc_handle(self, device: DeviceLikeType, ipc_handle: bytes) -> Event: ...
    def record(self, stream: Optional[Stream] = None) -> None: ...
    def wait(self, stream: Optional[Stream] = None) -> None: ...
    def query(self) -> _bool: ...
    def elapsed_time(self, other: Event) -> _float: ...
    def synchronize(self) -> None: ...
    def ipc_handle(self) -> bytes: ...
    def __repr__(self) -> str: ...
```

-----------

c10::Event provides new APIs
- calculate **elapsedTime**.
- Get raw event id
- Synchronize event.

```
  double elapsedTime(const Event& event) const {
    return impl_.elapsedTime(event.impl_);
  }

  void* eventId() const {
    return impl_.eventId();
  }

  void synchronize() const {
    return impl_.synchronize();
  }
```
----------
TODO: need to find a good way to test them in PyTorch with API mocks.

Differential Revision: [D55351839](https://our.internmc.facebook.com/intern/diff/D55351839/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123611
Approved by: https://github.com/albanD
2024-04-18 17:35:09 +00:00
Jason Ansel
7a6edb0b66 Possible fix for einops warning (#124084)
See https://github.com/arogozhnikov/einops/issues/315

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124084
Approved by: https://github.com/peterbell10
2024-04-18 17:09:50 +00:00
Zhengxu Chen
e1062f5738 [export] Add a printer to unflattened module. (#124315)
Summary: add a helper method to print graph in every level of unflattened module.

Test Plan: {F1489609684}

Differential Revision: D56263195

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124315
Approved by: https://github.com/tugsbayasgalan
2024-04-18 16:35:51 +00:00
Boyuan Feng
aa2da0cdd2 [Export] Add runtime assert to non-strict export (#123681)
This PR moves insert_deferred_runtime_asserts from dynamo to torch.fx.passes and uses it to add runtime assertion for non-strict export.

Differential Revision: D55944267

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123681
Approved by: https://github.com/tugsbayasgalan, https://github.com/angelayi
2024-04-18 16:13:27 +00:00
soulitzer
ef93402f61 [NJT] Inline through torch.nested.nested_tensor_from_jagged instead of graph break (#124343)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124343
Approved by: https://github.com/jbschlosser
2024-04-18 14:42:54 +00:00
Andrew Gu
bbb6e36495 [FSDP2] Fixed set_requires_gradient_sync's recurse arg (#124318)
The `recurse` argument was not being respected for `set_requires_gradient_sync`. This PR fixes that.

The previous unit test did not have nested FSDP modules with managed parameters, so the `recurse=False` was not being exercised. We augment the unit test to try only disabling gradient sync for the root module and not children.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124318
Approved by: https://github.com/weifengpy
ghstack dependencies: #120952, #124293
2024-04-18 14:21:57 +00:00
rzou
1542874311 Delete qualname from custom_op decorator (#124092)
I forgot to delete this in an earlier PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124092
Approved by: https://github.com/albanD
ghstack dependencies: #123937, #124064, #124065, #124066, #124071, #124089
2024-04-18 12:48:04 +00:00
rzou
648c39c47d Add OpOverload.redispatch; use it in new custom ops API (#124089)
A kernel has "dispatcher convention" if there is an additional keyset
arg at the beginning of the argument list. This PR:
- adds a way to register kernels with dispatcher_convention using
  Library.impl (pass dispatcher_convention = True)
- adds OpOverload.redispatch

We use both of the above in the new custom ops API: we register the
autograd kernel in dispatcher convention so that we can actually call
redispatch like how pytorch built-in ops do it.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124089
Approved by: https://github.com/albanD
ghstack dependencies: #123937, #124064, #124065, #124066, #124071
2024-04-18 12:48:04 +00:00
rzou
645173a0b5 Add torch.library.register_autograd (#124071)
Allows registering autograd for all custom op entry points:
- the new-style custom op API (custom_op)
- the old-style torch.library APIs
- C++ operator registration

Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124071
Approved by: https://github.com/albanD
ghstack dependencies: #123937, #124064, #124065, #124066
2024-04-18 12:47:59 +00:00
rzou
8135c4b921 torch.library.register_fake now accepts more types (#124066)
We allow it to accept:
- a string with the op name
- an opoverload
- a new-style custom op

If any of these are referring to a new-style custom op (created with the
custom_op decorator), then we dispatch to CustomOpDef.register_fake.
Otherwise, we do what we previously did.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124066
Approved by: https://github.com/albanD
ghstack dependencies: #123937, #124064, #124065
2024-04-18 12:47:55 +00:00
xinan.lin
6fcbeb3489 [ATen] Add CPU fp16 support for nll_loss and cross_entropy_loss (#123256)
Add CPU FP16 support for nll_loss and cross_entropy_loss.
Resolve issue #123328.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123256
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/malfet
2024-04-18 11:44:38 +00:00
IvanKobzarev
d59f1da62f [sym_shapes][perf] _find not update unchanged replacements (#124274)
Differential Revision: [D56236380](https://our.internmc.facebook.com/intern/diff/D56236380)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124274
Approved by: https://github.com/ezyang
2024-04-18 08:32:02 +00:00
IvanKobzarev
9eba1995d0 [sym_shapes][perf] Use sympy xreplace instead of subs (#124208)
https://github.com/sympy/sympy/issues/22240

Differential Revision: [D56207553](https://our.internmc.facebook.com/intern/diff/D56207553)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124208
Approved by: https://github.com/ezyang, https://github.com/lezcano
2024-04-18 08:19:03 +00:00
PyTorch MergeBot
2b82345e48 Revert "Re-land precompile triton templates (#124030)"
This reverts commit 030bb13fe8.

Reverted https://github.com/pytorch/pytorch/pull/124030 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124030#issuecomment-2063191117))
2024-04-18 07:21:41 +00:00
Animesh Jain
704fac5618 [dynamo][cpp-guard] Reland Attempt 1 - Enable cpp guard manager (#124231)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124231
Approved by: https://github.com/jansel
ghstack dependencies: #124230, #124237
2024-04-18 06:36:20 +00:00
PyTorch MergeBot
6e86a40694 Revert "[Dynamo] Check for __bool__ attribute before accessing it (#120943)"
This reverts commit dd7aeedb72.

Reverted https://github.com/pytorch/pytorch/pull/120943 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/120943#issuecomment-2063098295))
2024-04-18 06:34:32 +00:00
PyTorch MergeBot
8ff85b42f9 Revert "Add swap_tensors path to nn parametrizations (#124130)"
This reverts commit 64f6ddf12c.

Reverted https://github.com/pytorch/pytorch/pull/124130 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124130#issuecomment-2063074856))
2024-04-18 06:12:54 +00:00
Zhuoran Zhao
8ad66e05d2 [4/x][AMD][Lowering Enablement] Enabling meta internal AOTInductor compilation on ROCM (#124123)
Summary: as title

Test Plan: CI & unit test

Differential Revision: D56163334

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124123
Approved by: https://github.com/chenyang78, https://github.com/jansel
2024-04-18 04:19:37 +00:00
xinan.lin
c9ab9248ce [Inductor Intel GPU backend Upstream] Generalize device-bias code in (#124249)
Generalize device-bias code in tirton_utils.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124249
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/jansel
2024-04-18 03:54:31 +00:00
Yanan Cao (PyTorch)
27daa110c8 Back out "Refresh OpOverloadPacket if a new OpOverload gets added (#123578)" (#124324)
Summary:
Original commit changeset: 528276bc8a92

Original Phabricator Diff: D56057952

Differential Revision: D56271240

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124324
Approved by: https://github.com/davidberard98
2024-04-18 03:33:54 +00:00
Animesh Jain
f213f262af [dynamo][cpp-guards] Improve when to use Dict vs DictSubclassGuardManager (#124237)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124237
Approved by: https://github.com/jansel, https://github.com/mlazos
ghstack dependencies: #124230
2024-04-18 03:33:37 +00:00
William Wen
812bae09be [dynamo] fix 3.11+ refleak (#124238)
Fixes https://github.com/pytorch/pytorch/issues/119607 for 3.11+.

In 3.11+, `_PyFrame_FastToLocalsWithError` could implicity run `COPY_FREE_VARS` on the original frame, leading to double incref's since the dynamo shadow frame can rerun `COPY_FREE_VARS`. So the solution is to skip the first `COPY_FREE_VARS` instruction in the shadow frame if it was already executed in the original frame.

Also move the location for clearing the original frame in 3.12 to handle error cases more thoroughly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124238
Approved by: https://github.com/jansel
2024-04-18 03:02:29 +00:00
Animesh Jain
6b4b857a60 [dynamo][nn_module] Enable torch.compile/disable as decorators on the class (#124187)
Support something like. This is UI change, so please review carefully.

~~~
        @torch._dynamo.disable
        class SimpleLinear(torch.nn.Module):
            def __init__(self):
                super().__init__()
                self.layer0 = torch.nn.Linear(4, 4)

            def forward(self, inp):
                return self.layer0(torch.sigmoid(inp))

        @torch.compile(backend=cnts)
        class SimpleModel(torch.nn.Module):
            def __init__(self):
                super().__init__()
                self.layer0 = SimpleLinear()
                self.layer1 = torch.nn.Linear(4, 4)

            def forward(self, inp):
                z = self.layer0(torch.sin(inp))
                return self.layer1(z)
~~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124187
Approved by: https://github.com/yanboliang, https://github.com/jansel
2024-04-18 02:51:30 +00:00
Simon Fan
b6b757701e [aot] trim refcount for subclass runtime wrapper (#124155)
On torchtrain,

before
<img width="1218" alt="image" src="https://github.com/pytorch/pytorch/assets/9547562/b340c114-071a-440c-904c-c042de4d92c5">

after
![image](https://github.com/pytorch/pytorch/assets/9547562/ee3b6e6f-6e46-46bc-a93d-d4603673ee63)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124155
Approved by: https://github.com/jansel, https://github.com/bdhirsh
ghstack dependencies: #124127
2024-04-18 02:34:52 +00:00
Sun, Jiayi
1f04c29be5 [inductor] Freeze the layout of the conv input to channels_last (#122765)
Fix https://github.com/pytorch/pytorch/issues/118082.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122765
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-04-18 02:23:38 +00:00
Sun, Jiayi
51a56efbb9 [inductor] modify the output_stride of ConcatKernel (#122761)
Fix https://github.com/pytorch/pytorch/issues/121613.
Modify the `output_stride` of `ConcatKernel`: If any input to `Concat` is `Pointwise`, check the layout of all inputs to `Pointwise`, if any of the inputs is in channels_last format, set channels_last strides for the `output_stride`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122761
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-04-18 02:19:46 +00:00
Sun, Jiayi
78f3b99a94 [inductor] Modify the rules for freezing the layout of x.unwrap_view() in convert_to_reinterpret_view (#122760)
Fix https://github.com/pytorch/pytorch/issues/121607

Modify the rules for freezing the layout of `x.unwrap_view()` in `convert_to_reinterpret_view`: If any read of `x.unwrap_view()` is in channels_last format, freeze the layout of `x.unwrap_view()` to channels_last format.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122760
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel
2024-04-18 02:17:07 +00:00
Shunting Zhang
b71423c2e4 [inductor] let coordesc tuner respect max RBLOCK (#124325)
Fix https://github.com/pytorch/pytorch/issues/124251 .

Coordesc tuner need respect max RBLOCK. When rnumel is a multiple of max-RBLOCK, inductor codegen will skip rmask. If coordesc tuner does not consider max-RBLOCK and pick a RBLOCK larger than that, we would get CUDA IMA (illegal memory access) error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124325
Approved by: https://github.com/Chillee, https://github.com/jansel
2024-04-18 02:12:35 +00:00
Pearu Peterson
43b4ac956e Add index_reduce decomposition (#122579)
As in the title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122579
Approved by: https://github.com/peterbell10
ghstack dependencies: #123375
2024-04-18 01:30:47 +00:00
eellison
030bb13fe8 Re-land precompile triton templates (#124030)
Re-land precompile triton templates. This got reverted because we were precompiling templates without checking the cache. I have since added logic and a test to ensure we do not precompile if there is a cache hit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124030
Approved by: https://github.com/shunting314, https://github.com/nmacchioni, https://github.com/yoyoyocmu
2024-04-18 01:22:13 +00:00
Michael Lazos
102a223216 Enable dynamo test_state_dict_deterministic (#123323)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123323
Approved by: https://github.com/janeyx99
ghstack dependencies: #123498, #123322
2024-04-18 01:06:28 +00:00
Michael Lazos
d88fcb86d8 Enable dynamo traced test_forloop_goes_right_direction (#123322)
Removed a bunch of skips, I also updated test_forloop_goes_right_direction to *not* use the closure when dynamo is tracing. The reason for this is that testing the disabled optimizer doesn't actually test anything.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123322
Approved by: https://github.com/janeyx99
ghstack dependencies: #123498
2024-04-18 00:50:10 +00:00
Michael Lazos
57a3dc56d4 Small Adamax fix (#123498)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123498
Approved by: https://github.com/janeyx99
2024-04-18 00:50:03 +00:00
rzou
5a60a1abde Move the implementation of register_fake onto torch.library.Library (#124065)
Motivations:
- This makes things more consistent: using a Library object, you should
  be able to do all of the registration APIs that tie registrations to
  the lifetime of the Library.
- I need this for the next PR up in the stack, where we will have
  torch.library.register_fake support both CustomOpDef (from the new
  custom ops API) and other custom ops.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124065
Approved by: https://github.com/albanD
ghstack dependencies: #123937, #124064
2024-04-17 23:51:20 +00:00
rzou
d1e1d671ef Stop requiring a pystub for register_fake by default (#124064)
Previously, if someone used `register_fake` to add a fake impl for an
operator defined in C++, we would require them to add a
`m.set_python_module(<module>)` call to C++. This was to avoid
situations where a user imported the C++ operator without importing the
fake impl.

This "breaks" open registration: there's no way to add a fake impl
outside of a repository that defines an operator, so we want to turn
this behavior off by default in open source.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124064
Approved by: https://github.com/albanD
ghstack dependencies: #123937
2024-04-17 23:51:20 +00:00
Mikayla Gawarecki
64f6ddf12c Add swap_tensors path to nn parametrizations (#124130)
Fixes #123859

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124130
Approved by: https://github.com/albanD
2024-04-17 23:37:28 +00:00
Andrew Gu
b5235694f4 [FSDP2] Made unshard return type consistent (#124293)
We can always return an `UnshardHandle` if `async_op=True` even if the FSDP module does not manage any parameters and hence does not have an `FSDPParamGroup`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124293
Approved by: https://github.com/weifengpy
ghstack dependencies: #120952
2024-04-17 23:33:46 +00:00
Andrew M. James
64f42bfd52 [dynamo] Support list.reverse (#124210)
fixes #123974

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124210
Approved by: https://github.com/peterbell10
2024-04-17 23:33:32 +00:00
Matthias Reso
dd7aeedb72 [Dynamo] Check for __bool__ attribute before accessing it (#120943)
This PR checks if __bool__ attribute is available before accessing it when handling a UserDefinedObjectVariable

Fixes #119782

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120943
Approved by: https://github.com/zou3519
2024-04-17 23:26:55 +00:00
Nikita Shulga
00372b1211 Extend int[48]mm ops to float32 input (#124287)
Just for completeness

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124287
Approved by: https://github.com/mikekgfb
2024-04-17 23:10:49 +00:00
vfdev-5
6330acae76 Refactored implementation for upsample_nearest decompostions (#122783)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122783
Approved by: https://github.com/peterbell10
2024-04-17 23:05:40 +00:00
Edward Z. Yang
bebdbb63ce Introduce set_example_value and use it throughout Dynamo (#124176)
I'm going to setup some extra behavior when we set example value, so
I need a convenient place to interpose.  I cannot easily do it on
meta itself because its a generic dict with no interposition point.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124176
Approved by: https://github.com/oulgen
ghstack dependencies: #124105, #124059
2024-04-17 22:57:11 +00:00
Tugsbayasgalan Manlaibaatar
d23bf9cef0 Add fake impl for aten.unique2 (#124306)
Reapply of: https://github.com/pytorch/pytorch/pull/121571
Differential Revision: [D56258431](https://our.internmc.facebook.com/intern/diff/D56258431)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124306
Approved by: https://github.com/gmagogsfm
2024-04-17 22:55:27 +00:00
Tristan Rice
1ec05c769b all_gather and reduce_scatter autograd (#123989)
This adds `all_gather_tensor_autograd` and `reduce_scatter_tensor_autograd` to the functional_collectives library.

This only supports `sum` mode for `reduce_scatter` but should be easy to extend in the future.

The backwards implementations match the behavior in https://github.com/pytorch/torchrec/blob/main/torchrec/distributed/comm_ops.py

This follows the pattern of #123599 .

Test plan:

```sh
pytest test/distributed/test_functional_api.py -k Autograd
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123989
Approved by: https://github.com/wanchaol
2024-04-17 21:32:22 +00:00
Xuehai Pan
93e249969b [BE] enable ruff rule RSE and remove useless parentheses in raise statements (#124261)
Remove useless parentheses in `raise` statements if the exception type is raised with no argument.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124261
Approved by: https://github.com/albanD
2024-04-17 19:29:34 +00:00
Oguz Ulgen
24cecf06d7 Update autotune jk knobs (#124214)
Differential Revision: [D56201145](https://our.internmc.facebook.com/intern/diff/D56201145/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124214
Approved by: https://github.com/aakhundov
2024-04-17 17:49:25 +00:00
Animesh Jain
f433517181 [dynamo][decorator] Support disable on nn modules (#124185)
Fixes https://github.com/pytorch/pytorch/issues/123979

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124185
Approved by: https://github.com/weifengpy, https://github.com/yoyoyocmu
2024-04-17 16:20:34 +00:00
Xuehai Pan
7e1c98c171 [dynamo] support object.__setattr__(obj, name, value) (#124068)
Resolves #114964
Resolves #114966

- #114964
- #114966

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124068
Approved by: https://github.com/jansel
2024-04-17 15:57:14 +00:00
PyTorch MergeBot
36f6928a37 Revert "[Profiler][PrivateUse1] Profiler support PrivateUse1 key (#120556)"
This reverts commit 41613a0803.

Reverted https://github.com/pytorch/pytorch/pull/120556 on behalf of https://github.com/aaronenyeshi due to Breaks GPU Chrome trace UI ([comment](https://github.com/pytorch/pytorch/pull/120556#issuecomment-2061578951))
2024-04-17 15:38:14 +00:00
Pearu Peterson
d2b0c0a34e Fix index_reduce sampler filter when op_info.variant_test_name is specified (#123375)
As in the title: `index_reduce` sample must correspond to reduction type specified by `variant_test_name`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123375
Approved by: https://github.com/zou3519, https://github.com/peterbell10
2024-04-17 15:31:28 +00:00
rzou
47dbfecd37 Rename impl_abstract to register_fake, part 1/2 (#123937)
This PR:
- adds a new torch.library.register_fake and deprecates
  torch.library.impl_abstract. The motivation is that we have a lot of
  confusion around the naming so we are going to align the naming with
  the actual subsystem (FakeTensor).
- renames `m.impl_abstract_pystub("fbgemm_gpu.sparse_ops")` to
  `m.has_python_registration("fbgemm_gpu.sparse_ops")`. No deprecation
  here yet; I need to test how this works with static initialization.
- Renames a bunch of internals to match (e.g. abstractimplpystub ->
  pystub)

I'm scared to rename the Python-side internal APIs (e.g.
torch._library.abstract_impl) because of torch.package concerns. I'll do
that in its own isolated PR next just in case it causes problems.

DEPRECATION NOTE: torch.library.impl_abstract was renamed to to
torch.library.register_fake. Please use register_fake. We'll delete
impl_abstract in a future version of PyTorch.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123937
Approved by: https://github.com/albanD
2024-04-17 12:46:01 +00:00
PyTorch MergeBot
2dc15b6849 Revert "[sparse] Add fast semi-structured spasification kernels (#122350)"
This reverts commit 14b2273b0c.

Reverted https://github.com/pytorch/pytorch/pull/122350 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/122350#issuecomment-2061070350))
2024-04-17 11:47:02 +00:00
PyTorch MergeBot
3f89f565bb Revert "Re-land precompile triton templates (#124030)"
This reverts commit d68196e7ef.

Reverted https://github.com/pytorch/pytorch/pull/124030 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124030#issuecomment-2061044960))
2024-04-17 11:31:33 +00:00
PyTorch MergeBot
77ad630f5d Revert "Dont precompile already seen keys, limit epilogue choices (#122642)"
This reverts commit 050051f412.

Reverted https://github.com/pytorch/pytorch/pull/122642 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124030#issuecomment-2061044960))
2024-04-17 11:31:32 +00:00
FFFrog
acc466751b Add bfloat16 support to binary_cross_entropy for CPU (#123823)
Fixes #123715

As the title stated.

But, maybe we should pay attention to this https://github.com/pytorch/pytorch/pull/33206, which removed the half support for cpu about 4 years ago.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123823
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-04-17 09:44:07 +00:00
chunyuan
d0211e207c inductor cpp wrapper: add GIL release back (#123897)
Fixes https://github.com/pytorch/pytorch/issues/123517.
This PR adds the GIL release (originally added in https://github.com/pytorch/pytorch/pull/111888) back following the suggestion here: https://github.com/pytorch/pytorch/pull/123897#discussion_r1562509705.
We added a default constructor and an assignment operator for the `RAIIPyObject` class
 (https://github.com/pytorch/pytorch/pull/123897#discussion_r1566262575) in order to declare the `custom_op_wrapper` outside of the GIL acquisition scope.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123897
Approved by: https://github.com/peterbell10, https://github.com/jgong5
2024-04-17 07:18:14 +00:00
Tugsbayasgalan Manlaibaatar
dd3cea3291 Fix derived dim bugs in ep.run_decomp (#123326)
Differential Revision: [D55730289](https://our.internmc.facebook.com/intern/diff/D55730289)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123326
Approved by: https://github.com/avikchaudhuri
2024-04-17 04:00:55 +00:00
Edward Z. Yang
236b0d12fa Don't clamp slices generated from cat kernel (#124139)
Fixes https://github.com/pytorch/pytorch/issues/123793

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124139
Approved by: https://github.com/Microve, https://github.com/peterbell10, https://github.com/Skylion007
2024-04-17 03:13:10 +00:00
eellison
050051f412 Dont precompile already seen keys, limit epilogue choices (#122642)
Two changes:
- in epilogue benchmark fusion, only take top 6 choices. There were basically no choices taken after this in HF.
- Share a single precompilation function among matmuls with same key.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122642
Approved by: https://github.com/shunting314
ghstack dependencies: #124030
2024-04-17 03:08:59 +00:00
Animesh Jain
51cc808ac7 [dynamo][cpp-guards] Missing decref on early returns in DictSubclassGuardManager (#124230)
I am sad that I missed this earlier. Good thing is that CI caught it. Will be more careful next time.

This was the reason https://github.com/pytorch/pytorch/pull/123547 is reverted - https://github.com/pytorch/pytorch/pull/123547#issuecomment-2058350245

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124230
Approved by: https://github.com/mlazos
2024-04-17 02:49:07 +00:00
eellison
d68196e7ef Re-land precompile triton templates (#124030)
Re-land precompile triton templates. This got reverted because we were precompiling templates without checking the cache. I have since added logic and a test to ensure we do not precompile if there is a cache hit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124030
Approved by: https://github.com/shunting314, https://github.com/nmacchioni, https://github.com/yoyoyocmu
2024-04-17 02:30:46 +00:00
CK Luk
32ca18ea3b Handle the case when one of the output of forward pass is None (#123988)
Summary: When applying FSDP-2 to FM-FB benchmark with FullModel model, we ran into an error that one of the output tensors of a forward pass is None. I double checked that the same output tensor is also None in FSDP-1. So, we just need to handle the None properly here.

Test Plan:
See that in the internal diff.

Differential Revision: D56087956

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123988
Approved by: https://github.com/awgu
2024-04-17 02:18:32 +00:00
Valentine233
6e4c4e93b6 [Inductor] add contiguous layout optm for bmm input (#122599)
Fixes #117743.

Add contiguous layout optimization for `bmm` input, to avoid additional copies.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122599
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/eellison
2024-04-17 02:12:20 +00:00
Oguz Ulgen
1fd9e320ea Remove unnecessary FileLock in Fx Graph Cache (#124212)
Writing to file happens via `write_atomic`, there's no need to take a global lock on the file system. This is likely creating unnecessary waits.

Differential Revision: [D56208628](https://our.internmc.facebook.com/intern/diff/D56208628/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124212
Approved by: https://github.com/masnesral, https://github.com/eellison
2024-04-17 01:02:41 +00:00
Theodore Ehrenborg
f56c4572a6 Fix typos in docs (#124218)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124218
Approved by: https://github.com/albanD
2024-04-17 00:46:08 +00:00
Andrew Gu
bf45ac8c98 [FSDP2] Added explicit unshard(async_op) API (#120952)
This PR adds an `unshard(async_op: bool = False)` API to manually unshard the parameters via all-gather. This can be used for reordering the all-gather with other collectives (e.g. all-to-all).

This currently requires the user to set `TORCH_NCCL_AVOID_RECORD_STREAMS=1` to avoid `recordStream` from `ProcessGroupNCCL` and get expected memory behaviors.

Differential Revision: [D56148725](https://our.internmc.facebook.com/intern/diff/D56148725)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120952
Approved by: https://github.com/wanchaol
2024-04-17 00:39:34 +00:00
Catherine Lee
0abd3f60fd [CI] Reduce CI_SERIAL_LIST list (#124085)
Add serial marker for individual tests so the test file can be removed from the ci serial list
Run serial marked tests first in serial
Run all other tests afterwards in parallel

Slowly reduce list and mark individual tests as serial instead

Hope # of serial tests is small so sharding evenness doesn't get too messed up

Hopefully can do 3 procs for sm86 and cpu?

serial no longer looks like a real word to me

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124085
Approved by: https://github.com/seemethere, https://github.com/malfet
2024-04-17 00:23:47 +00:00
IvanKobzarev
e7cf6f81ea [sym_shapes][perf] Skip assert in check_is_size (#124209)
Differential Revision: [D56207943](https://our.internmc.facebook.com/intern/diff/D56207943)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124209
Approved by: https://github.com/ezyang
2024-04-17 00:10:06 +00:00
Edward Z. Yang
cebf65126c FakeTensorProp assert consistency of sizes when metadata previously existed (#124059)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124059
Approved by: https://github.com/bdhirsh, https://github.com/thiagocrepaldi
ghstack dependencies: #124105
2024-04-16 23:28:42 +00:00
andrewor14
3eea300680 [quant] Do not decompose choose_qparams_per_token_asymmetric (#124178)
Summary: https://github.com/pytorch/pytorch/pull/123452 added
backward support to this op by turning it into
CompositeImplicitAutograd, which meant it gets decomposed during
export/compile. However, this is not desirable behavior for the
PTQ case when we try to lower the model. This commit enables
QAT without breaking PTQ by refactoring the impl into a separate
op that does have backward support.

Test Plan:
python test/test_quantization.py -k test_decomposed_choose_qparams_per_token_asymmetric_backward

Reviewers: jerryzh168, digantdesai, zou3519

Subscribers: jerryzh168, digantdesai, zou3519, supriyar

Differential Revision: [D56192116](https://our.internmc.facebook.com/intern/diff/D56192116)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124178
Approved by: https://github.com/digantdesai
2024-04-16 22:58:48 +00:00
Shunting Zhang
3e90e93a78 [inductor] disable comprehensive padding in fbcode (#124191)
Comprehension padding cause small NE change and fail an internal test. Disable it for internal use case to mitigate.

Differential Revision: [D56197430](https://our.internmc.facebook.com/intern/diff/D56197430)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124191
Approved by: https://github.com/jansel
2024-04-16 22:44:08 +00:00
Xilun Wu
b3f88317ec [dtensor][5/N] have table-wise sharding use LocalShardsWrapper on participating ranks only (#122853)
**Summary**
We wrap DTensor's local tensor in `LocalShardsWrapper` for torchrec's table-wise sharding. The exception is on non-participating ranks: for non-participating ranks, the local tensor is an empty torch.Tensor object. The reason of this design is to avoid complexity on supporting empty tensor case on `LocalShardsWrapper`.

**Test**
`torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/torchrec_sharding_example.py -e table-wise`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122853
Approved by: https://github.com/wz337
ghstack dependencies: #120265, #121392, #122843
2024-04-16 22:27:30 +00:00
Xilun Wu
d419fcd19f [dtensor][4/N] have row-wise sharding always use LocalShardsWrapper (#122843)
**Summary**
Always wrap local tensor into a `LocalShardsWrapper`. This is for uniformity and it leads to easiness on adoption of DTensor as a wrapper for local shard(s) representation. To support more tensor ops over `LocalShardsWrapper`, users need to extend its `__torch_dispatch__`.

**Test**
`torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/torchrec_sharding_example.py -e row-wise-even`

**Result**
```
Row-wise even sharding example in DTensor
         Col 0-15
-------  ----------
Row 0-1  cuda:0
Row 2-3  cuda:1
Row 4-5  cuda:2
Row 6-7  cuda:3
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122843
Approved by: https://github.com/wz337
ghstack dependencies: #120265, #121392
2024-04-16 22:27:30 +00:00
Xilun Wu
1d7ac7baa0 [dtensor][3/N] add torchrec row-wise uneven sharding example (#121392)
**Test**
`torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/torchrec_sharding_example.py -e row-wise-uneven`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121392
Approved by: https://github.com/wanchaol
ghstack dependencies: #120265
2024-04-16 22:27:28 +00:00
Xilun Wu
9d3543df9a [dtensor][2/N] add torchrec table-wise sharding example (#120265)
**Summary**
This PR serves as a start of this effort by adding an example test that represents TorchRec's `ShardingType.TABLE_WISE` using DTensor.

**Test**
`torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/torchrec_sharding_example.py -e table-wise`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120265
Approved by: https://github.com/wanchaol
2024-04-16 22:27:24 +00:00
PyTorch MergeBot
9d88339b53 Revert "make sure dynamo doesn't inline DTensor __new__ or __torch_dispatch__ (#123347)"
This reverts commit 63dcb5b0f2.

Reverted https://github.com/pytorch/pytorch/pull/123347 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/123347#issuecomment-2059994989))
2024-04-16 22:08:24 +00:00
Lucas Pasqualin
440e4353c7 [DCP] Remove overlapping loader in async case (#123942)
In the async case, the state dict is already on CPU, so maintaining this buffer makes no sense. Additionally, using the overlapping cpu loader introduces new cuda synchronize calls, leading to additional unnecessary overhead.

Differential Revision: [D56065250](https://our.internmc.facebook.com/intern/diff/D56065250/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123942
Approved by: https://github.com/fegin
ghstack dependencies: #123941
2024-04-16 21:19:31 +00:00
Shawn Xu
606c4f1367 [PT] [ST] fix test_sharded_tensor (#124103)
Summary:
https://github.com/pytorch/pytorch/pull/123230 formalizes the rank validation to support sub groups.

It broke a few UTs, some of which got fixed in https://github.com/pytorch/pytorch/pull/123778

This is to fix the remaining one reported by DanilBaibak

Test Plan: CI

Differential Revision: D56155076

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124103
Approved by: https://github.com/fegin
2024-04-16 21:18:22 +00:00
Lucas Pasqualin
46a25cc0db [DCP] Adds support for non-primatives in async_save by deep copying during cpu offloading (#123941)
Adds support for non-primatives in async_save by deep copying during cpu offloading.

If users are not type checking, the expectation in async is likely that the object is copied

Differential Revision: [D56065237](https://our.internmc.facebook.com/intern/diff/D56065237/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123941
Approved by: https://github.com/fegin
2024-04-16 20:49:25 +00:00
Jesse Cai
14b2273b0c [sparse] Add fast semi-structured spasification kernels (#122350)
This PR adds in fast semi-structured sparsification kernels to PyTorch.

These kernels allow for accelerated semi-structured sparsification
kernels in PyTorch.

The kernels have been added as aten native functions

In particular, three new functions have been added:

* `torch._sparse_semi_structured_tile`

This function will return the packed representation and metadata for
both X and X', as well as the thread masks. Note that this applies 2:4
sparsity in a 4x4 tile instead of a 1x4 strip as usual.

* `torch._sparse_semi_structured_apply`

This function takes in an input tensor and thread masks from the above
function and returns a packed representation and metadata from applying
thread masks to the input tensor.

* `torch._sparse_semi_structured_apply_dense`

This function does the same thing as above but instead of returning the
tensor in the sparse representation it returns it in the dense
representation

The subclasses have also been updated to add a new
`prune_dense_static_sort`
classmethod to create sparse tensors with this format. I've added some
additional documentatino on how to calculate the compressed tensors
needed to create a SparseSemiStructuredTensor oneself.

To this end, there are two new helper functions added:
`sparse_semi_structured_tile`
`compute_compressed_swizzled_bitmask`

Differential Revision: [D56190801](https://our.internmc.facebook.com/intern/diff/D56190801)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122350
Approved by: https://github.com/cpuhrsch
2024-04-16 20:31:52 +00:00
Mikayla Gawarecki
383d2d1f6c Add testing and fix issues for weights_only load for LRScheduler (#123775)
Fixes https://github.com/pytorch/pytorch/issues/98921

There were two issues detected:
- `MultiStepLR`: issue is described in https://github.com/pytorch/pytorch/issues/98921, this is resolved by allowlisting `collections.Counter`
- `OneCycleLR`: `state_dict['anneal_func']` is either `<function OneCycleLR._annealing_cos at 0x7f364186f5b0>` or
`<function OneCycleLR._annealing_linear at 0x7f39aa483640>` depending on the `anneal_func` kwarg.
   This leads to `WeightsUnpickler error: Unsupported class __builtin__.getattr` from the `weights_only` Unpickler.

  Fixed the above in a BC-compatible manner by adding `OneCyclicLR._anneal_func_type` as a string attribute and removing `OneCyclicLR.anneal_func`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123775
Approved by: https://github.com/albanD, https://github.com/malfet
2024-04-16 20:29:27 +00:00
Shengbao Zheng
42e22bb444 [nccl-pg] Pass pg name and desc to NCCL communicator (#124149)
Summary:
Pass Process Group Name and Desc to NCCL communicator in order to access pg information in NCCL layer.
The information is passed as commDesc string(i.e. "<pg_desc>:<pg_name>")
Function only valid when NCCL_COMM_DESCRIPTION is defined.

Differential Revision: D55703310

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124149
Approved by: https://github.com/shuqiangzhang
2024-04-16 20:08:07 +00:00
Rohan
72271fb07e Add NEON ISA support on aarch64 (#123584)
Fixes #104729

This improves the compiled mode performance of Softmax (by 20%) and other operations (like batchnorm) that invoke the reduce_all function. Thereby also improves BERT inference by around 8%.

Tested on a graviton 3 instance (c7g.4xl). Tests were run in a single-threaded manner.

Script attached below.
Command: `OMP_NUM_THREADS=1 LRU_CACHE_CAPACITY=1024 DNNL_DEFAULT_FPMATH_MODE=BF16 python TestSoftmax.py`
[TestSoftmax.txt](https://github.com/pytorch/pytorch/files/14910754/TestSoftmax.txt)
```python
import torch
import torch.nn as nn
from torch.profiler import profile, record_function, ProfilerActivity

model = nn.Softmax().eval()
compiled_model = torch.compile(model)
inputs = torch.randn(1024, 1024)

with torch.set_grad_enabled(False):
    for _ in range(50):
        compiled_model(inputs) #Warmup
    print("Warmup over")
    with profile(activities=[ProfilerActivity.CPU]) as prof:
        with record_function("model_inference"):
            for _ in range(100):
                compiled_model(inputs)

print(prof.key_averages().table(sort_by="self_cpu_time_total"))
# Check if the compiled model inference and the eager model inference are similar using torch.allclose
print(torch.allclose(compiled_model(inputs), model(inputs)))
```

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123584
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-04-16 18:49:52 +00:00
Simon Fan
67bd43b510 [compiled autograd][dynamo] use aliases for stack restore when partial graphs steal inputs (#124127)
same idea as https://github.com/pytorch/pytorch/pull/123359, but for when we restore stack variables after calling a partial graph:

Illustrated by the test case:

before:
```python
def function(inputs):
    graph_out_0 = __compiled_fn_2(inputs)
    getitem_1 = graph_out_0[0]
    add = inputs[1]  <---- error inputs is already cleared
    del graph_out_0
    add_1 = add + getitem_1
    add = None
    getitem_1 = None
    cpu = add_1.cpu()
    add_1 = None
    return (cpu,)
```
after:
```python
def function(inputs):
    inputs_ref_0 = inputs[1]
    graph_out_1 = __compiled_fn_2(inputs)
    getitem_1 = graph_out_1[0]
    add = inputs_ref_0
    del graph_out_1
    add_1 = add + getitem_1
    add = None
    getitem_1 = None
    cpu = add_1.cpu()
    add_1 = None
    return (cpu,)
```

Co-authored-by: Jason Ansel <jansel@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124127
Approved by: https://github.com/jansel
2024-04-16 17:01:34 +00:00
Lucas Pasqualin
d838cc8f66 [DCP] Returns a copy of sd in copy sd (#123567)
I found that returning the copy is actually useful in situations where you might do something like:

```
ret = _copy_state_dict(obj, cache)
ret.update(some_other_values)
```

and would like `cache` not to change structure from `ret.update(some_other_values)`.  Open to some notes here, not returning a copy might force the user to do some additional copies for this case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123567
Approved by: https://github.com/wz337
2024-04-16 15:29:32 +00:00
Sijia Chen
0f6ce45bcb [Inductor] handle AMD special launch options (#124146)
Summary: `matrix_instr_nonkdim` and `waves_per_eu` are AMD specific launch configs that can't be treated as fn input args

Test Plan:
HIP_VISIBLE_DEVICES=7 numactl --cpunodebind=1 --membind=1 buck2 run mode/{opt,amd-gpu} -c fbcode.triton_backend=amd -c fbcode.enable_gpu_sections=true -c fbcode.rocm_arch=mi300 //hammer/modules/sequential/encoders/tests:hstu_bench -- --torch-compile=True

the E2E works well on the magic model

Differential Revision: D56165438

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124146
Approved by: https://github.com/aakhundov
2024-04-16 11:07:17 +00:00
William Wen
9309580d69 [dynamo, 3.12] handle possibility of NULL local variables during graph breaks (#124095)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124095
Approved by: https://github.com/jansel
2024-04-16 08:44:43 +00:00
William Wen
2b3594f90e [dynamo] fix call_finally issue in Python 3.8 (#124122)
Fix https://github.com/pytorch/pytorch/issues/97811 again...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124122
Approved by: https://github.com/jansel
2024-04-16 08:36:20 +00:00
Nikita Shulga
298eb69c91 [EZ] Make weight_int4pack_mm compilable for half input dtype (#124136)
To enable efficient int4 quantization on ARM

Followup after https://github.com/pytorch/pytorch/pull/124022
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124136
Approved by: https://github.com/mikekgfb
2024-04-16 08:10:59 +00:00
Animesh Jain
bb0c768c5b [dynamo][refactor] Move LazyGraphModule handling (#124113)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124113
Approved by: https://github.com/jansel
ghstack dependencies: #124078
2024-04-16 06:39:45 +00:00
PyTorch MergeBot
530bf391cc Revert "[dynamo] Turn on CPP guard manager (#123547)"
This reverts commit 3e98bdd66d.

Reverted https://github.com/pytorch/pytorch/pull/123547 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/123547#issuecomment-2058337419))
2024-04-16 06:38:15 +00:00
Xuehai Pan
2e48f7b044 [pytree] add tree_iter function (#123913)
- Add a new `tree_iter` function.
- Bump `optree` version to `0.11.0` for C++ version of `tree_iter`.

This PR is split from #120300.

- #120300

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123913
Approved by: https://github.com/zou3519
2024-04-16 06:02:08 +00:00
Xuehai Pan
0eab740db3 [Docs][Distributed] Add migration notes for --local-rank option style change for torchrun in PyTorch 2.0 (#109480)
Fixes https://github.com/pytorch/pytorch/pull/94505#issuecomment-1722777767

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109480
Approved by: https://github.com/ezyang
2024-04-16 05:51:57 +00:00
Arun Pa
7530c5a85d [DOC] Fix example and typo (#123959)
Fixes #123554 and fixes #123053

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123959
Approved by: https://github.com/mikaylagawarecki
2024-04-16 05:38:24 +00:00
Shunting Zhang
df5829d0ba [inductor] let rand_strided support fp8 (#124120)
I'm working on https://fb.workplace.com/groups/1075192433118967/posts/1411161629522044/ (this is a meta internal link about a inefficient inner/persistent reduction kernel generated by inductor). I found the generated benchmark code for a kernel ( https://gist.github.com/shunting314/13a0105f72a1c54d9c220370c7fd3845 ) can not be run since rand_strided failed to generate tensors for fp8. Errors are like

```
RuntimeError: "normal_kernel_cpu" not implemented for 'Float8_e4m3fn'
```
for CPU
or
```
RuntimeError: "normal_kernel_cuda" not implemented for 'Float8_e4m3fn'
```
for GPU

This PR work around that problem.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124120
Approved by: https://github.com/Chillee, https://github.com/jansel
2024-04-16 04:15:56 +00:00
FFFrog
791e5db705 Part 3: UFMT fix the rest files in torch/optim due to the pr-sanity-checks (#124055)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124055
Approved by: https://github.com/ezyang
ghstack dependencies: #124048, #124053, #124054
2024-04-16 03:22:39 +00:00
FFFrog
ac74a6783b Part 2: UFMT fix 2 files in torch/optim due to the pr-sanity-checks (#124054)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124054
Approved by: https://github.com/ezyang
ghstack dependencies: #124048, #124053
2024-04-16 03:20:21 +00:00
FFFrog
560efaa471 Part 1: UFMT partial files in torch/optim due to the pr-sanity-checks (#124053)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124053
Approved by: https://github.com/ezyang
ghstack dependencies: #124048
2024-04-16 03:17:18 +00:00
FFFrog
f30704f5f3 add preparatory work for torch/optim/lr_scheduler.py (#124048)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124048
Approved by: https://github.com/albanD
2024-04-16 03:17:18 +00:00
Sam Larsen
6babf00014 [inductor] Bypass FX graph cache when we have HigherOrderOperators (#123325)
Summary: The initial motivation was to avoid caching when we have triton higher order ops, but it's probably safer to avoid the cache for all higher order ops and allow/implement if/when we find it necessary.

Test Plan: Unit test cribbed from: https://docs-preview.pytorch.org/pytorch/tutorials/2783/recipes/torch_compile_user_defined_triton_kernel_tutorial.html?highlight=triton

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123325
Approved by: https://github.com/eellison
2024-04-16 02:51:49 +00:00
Fuzzkatt
1cf62e86a4 skip various unit tests for Jetson (#122531)
skip multiprocessing, cuda expandable segments, mem eff and flash attention tests on Jetson due to hanging / sigkill issues from nvidia internal testing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122531
Approved by: https://github.com/eqy, https://github.com/malfet
2024-04-16 01:26:26 +00:00
Kai Londenberg
aaad0554b4 [Inductor] Fix endless recursion in codecache.DLLWrapper.__getattr__ (#123931)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123931
Approved by: https://github.com/peterbell10
2024-04-16 00:52:21 +00:00
cyy
c2596fd3e0 [Distributed] [4/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124032)
This PR continues to fix some clang-tidy warnings in distributed/c10d code, following https://github.com/pytorch/pytorch/pull/123312.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124032
Approved by: https://github.com/Skylion007
2024-04-16 00:42:18 +00:00
Shivam Raikundalia
9079c76689 Fix Asynchronous PyTorch Profiler Trace (#124080)
Summary: With the merge of D55925068, we have introduced an overflow issue when recording a trace using dyno gputrace. This is because it is possible for TorchOPs to be enumerated but not have an end time since they were running as the recording ended. By default these events have an end time set to INT_MIN. When finding the duration() for such events using end-start, we get an overflow resulting in a very long duration. This was avoided before because we were dividing the INT_MIN by 1000 because we were trying to convert uS to nS. This change introduces a patch for TorchOps and a future PR will be added to create a more universal guard in kineto.

Test Plan:
Trace recorded using resnet test.

Trace:
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/0/1713199267/localhost/libkineto_activities_2247224.json.gz&bucket=gpu_traces

Differential Revision: D56144914

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124080
Approved by: https://github.com/aaronenyeshi
2024-04-16 00:24:32 +00:00
Will Constable
1885c3972d [C10D] Add dist.get_node_local_rank helper (#123992)
Fixes #122816

Summarizing the pros/cons of the request and motivation from #122816

- (+) it's really common for users to do 'os.getenv["LOCAL_RANK"]' so we
  should provide a helper
- (-) we can't really control if/how local rank information is made
  available, but it is handled automatically if torchrun is used.

We can assume local rank is correctly passed if it is passed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123992
Approved by: https://github.com/shuqiangzhang, https://github.com/zdevito, https://github.com/XilunWu
2024-04-16 00:09:46 +00:00
rzou
2b54b00e30 Update some more APIs to have positional-only args (#124063)
Not BC-breaking since we haven't released these yet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124063
Approved by: https://github.com/albanD
ghstack dependencies: #123615, #124062
2024-04-15 23:32:47 +00:00
rzou
3c25b18d76 Excise old custom ops prototype from custom_op_db (#124062)
Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124062
Approved by: https://github.com/albanD
ghstack dependencies: #123615
2024-04-15 23:32:47 +00:00
rzou
a03711d24d [custom_ops] Support TensorList inputs/outputs (#123615)
We add a `supports_tensorlist` decorator that gives an autograd.Function
the ability to handle TensorLists.

Test Plan:
- custom_op_db tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123615
Approved by: https://github.com/albanD
2024-04-15 23:32:43 +00:00
Markus Hennerbichler
5a15cbfa44 Fix typo in TorchScript annotate docstring (#123719)
It's already in the docstring for torch.jit.Attribute to use Attribute in a __init__ method of a Module. However, this was wrong in the `annotate` docstring
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123719
Approved by: https://github.com/mikaylagawarecki
2024-04-15 22:52:20 +00:00
-
70ad64e8a6 update docs for separate context and forward functions (#121955)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121955
Approved by: https://github.com/soulitzer
2024-04-15 22:31:12 +00:00
Shengbao Zheng
9fa922c2ed [profiler] Log process group name instead of pg uid (#124035)
Summary:
As part of the work of unifying process group identifier, log <group_name, group_desc>,  instead of pg uid in profiler.
- group_name remains as the unique identifier, e.g. “0”, "1"
- group_desc will be the user specified name, e.g. "fsdp".

Reviewed By: aaronenyeshi, kwen2501

Differential Revision: D55610682

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124035
Approved by: https://github.com/aaronenyeshi
2024-04-15 21:49:06 +00:00
PHLens
9aba918bd8 Support Accelerator OOM Error (#121200) (#121702)
Fixes #121200
This PR introduces AcceleratorOutOfMemoryError for all privateuse1 backend. For python, there is a PyError object which will be set only when privateuse1 is registered. All privateuse1 backend then can use this error for memory errors. Maybe more error types in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121702
Approved by: https://github.com/guangyey, https://github.com/albanD
2024-04-15 21:41:46 +00:00
Andrew Gu
495a4d4a42 [FSDP2] Added mesh arg to fsdp_pre_all_gather (#123953)
This PR adds a `mesh: DeviceMesh` argument to `fsdp_pre_all_gather()` so that the extension can know over which mesh the all-gather is happening. This can be useful in recovering the post-all-gather tensor size in the `fsdp_post_all_gather()` (e.g. for `NF4Tensor`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123953
Approved by: https://github.com/Skylion007, https://github.com/wanchaol
ghstack dependencies: #119302, #122908
2024-04-15 21:35:51 +00:00
Andrew Gu
d1a0821e7e [FSDP2] Added pre/post-all-gather extensions (subclass) (#122908)
**Overview**
This PR adds pre/post-all-gather extensions to FSDP2.
- The pre/post-all-gather extensions are specified at the tensor-level on the `sharded_param._local_tensor` (i.e. the tensor wrapped by the sharded `DTensor`). If the user has a tensor-subclass parameter on the module passed to FSDP that preserves the subclass through the sharding ops (e.g. `new_zeros`, `chunk`, etc.), then the `sharded_param._local_tensor` will naturally be of that subclass.
- The pre-all-gather function has signature:
  ```
  def fsdp_pre_all_gather(self) -> Tuple[Tuple[torch.Tensor, ...], Any]
  ```
    - The first return value is a `Tuple[torch.Tensor, ...]` of the all-gather inputs. It is a tuple since a subclass could contribute >1 inner tensors.
    - The second return value is any optional metadata needed to pass through to the post-all-gather.
- The post all-gather function has signature:
  ```
  def fsdp_post_all_gather(
      self,
      all_gather_outputs: Tuple[torch.Tensor, ...],
      metadata: Any,
      param_dtype: torch.dtype,
      *,
      out: Optional[torch.Tensor] = None,
  ) -> Union[Tuple[torch.Tensor, Tuple[torch.Tensor, ...]], None]:
  ```
    - The `all_gather_outputs` are exactly the all-gathered versions of the `fsdp_pre_all_gather` 1st return value (representing the all-gather inputs). We make sure to unflatten these back to ND for the user.
    - The `metadata` is the `fsdp_pre_all_gather` 2nd return value, untouched.
    - The `param_dtype` is the parameter dtype based on the passed-in `MixedPrecisionPolicy`. Namely, if no policy is passed in, then `param_dtype` is the original dtype, and otherwise, it is the `MixedPrecisionPolicy.param_dtype`.
    - If `out` is not specified, then the return value has type `Tuple[torch.Tensor, Tuple[torch.Tensor, ...]]`. The first tuple item is the unsharded parameter (e.g. re-wrapping into some subclass). The second tuple item is a tuple of unsharded inner tensors that FSDP should free during reshard. These should be derived from the all-gather outputs.
    - The `out` argument is required due to FSDP's `resize_` usage. We require an in-place variant for the backward all-gather. Here, `out` will be exactly the object returned as the first tuple item in the out-of-place variant mentioned before. The unsharded inner tensors will be allocated before calling `fsdp_post_all_gather`. When `out` is specified, the `fsdp_post_all_gather` should return `None`. If the post-all-gather does not do any out-of-place ops, then the `out` variant can just be a no-op since the unsharded inner tensors will be the same as the all-gather outputs, which FSDP directly writes to after all-gather. (E.g., this is the case for both float8 and `NF4Tensor`.)
- We check for `fsdp_pre_all_gather` and `fsdp_post_all_gather` directly via `hasattr` to accommodate monkey patching so that we do not strictly require the user to use a tensor subclass. The monkey patch must happen after the local tensors have been finalized (after applying FSDP and after any meta-device init).
- For now, we require that all gradients in one FSDP parameter group share the same dtype. This is fine for float8 and `NF4Tensor` use cases. If this requirement is too strict, then in the future we can issue 1 reduce-scatter per dtype per group.

**Design Notes**
- We assume that the `sharded_param._local_tensor` is padded on dim-0.
    - This assumption should not block immediate use cases, and when we pad the `DTensor._local_tensor` by default, this assumption will always be true.
    - This assumption allows us to call `sharded_param._local_tensor.fsdp_pre_all_gather()`; i.e. it tells us from which tensor object to invoke `fsdp_pre_all_gather()`.
    - Suppose we want to compose with CPU offloading. Then, CPU offloading's H2D copy should run first, i.e. `sharded_param._local_tensor.to("cuda").fsdp_pre_all_gather()`, where `_local_tensor.to("cuda")` should return an instance of the subclass so that it still defines `fsdp_pre_all_gather()`. Note that in this case, the subclass instance on GPU is a temporary, which means caching values on it would not be possible. One possibility would be to have `.to("cuda")` move any cached values too.
- `fsdp_post_all_gather` can either return an unsharded parameter that aliases with the all-gather output or does not alias, but there is no way to know a priori.
    - If the unsharded parameter aliases with the all-gather output, then we should _not_ free the all-gather output in `unshard`.
    - If the unsharded parameter does not alias with the all-gather output, then we prefer to free the all-gather output in `unshard` to avoid holding the unneeded temporary.
    - One approach is for eager-mode to check for this alias (by comparing data pointers). However, this might be adversarial to full-graph compilation. The compromise for simplicity can be to always free the all-gather output in `reshard`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122908
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
ghstack dependencies: #119302
2024-04-15 21:35:51 +00:00
Andrew Gu
ea52918e81 [FSDP2] Generalized all-gather outputs to >1 per parameter (#119302)
This PR is part of the FSDP extensions work. For subclasses such as for QLoRA's `NF4Tensor` (using block-wise quantization) that have multiple inner tensors per parameter, we must generalize to allow each parameter to contribute >1 all-gather inputs and hence have >1 all-gather outputs.

This PR does this generalization by converting `FSDPParam.all_gather_input: torch.Tensor` to `FSDPParam.all_gather_inputs: List[torch.Tensor]`. Unfortunately, since we need to preserve the mapping from all-gather inputs/outputs to their source parameter, we have to introduce `List[List]` instead of simply `List` in several places. Furthermore, we still require the flattened 1D `List` for `torch.split` calls, introducing some redundancy between data structures. Nonetheless, I do not see a way to avoid this if we want the generalization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119302
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
2024-04-15 21:35:46 +00:00
Animesh Jain
601112fdb4 [dynamo][log] Print missing skipped frame info on debug (#124078)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124078
Approved by: https://github.com/yanboliang
2024-04-15 20:33:17 +00:00
Sam Larsen
e5b404b809 [inductor] Fix fresh_inductor_cache() (#122661)
Summary: Modify fresh_inductor_cache() to clear cached state before mocking the toplevel cache_dir directory. Any lru_caches (or otherwise) can use the @clear_on_fresh_inductor_cache decorator to register the cache for clearing. Also change the base inductor TestCase class to use fresh_inductor_cache(). Previously that TestCase was only mocking the subdirectory within the toplevel cache dir designated for the FX graph cache artifacts.

Test Plan:
- New unit test
- All existing inductor tests will exercise fresh_inductor_cache()

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122661
Approved by: https://github.com/oulgen
2024-04-15 20:28:54 +00:00
Bert Maher
99059affb9 Use packed metadata from triton to reduce launch latency (#123842)
https://github.com/openai/triton/pull/3633 converts some kernel launch metadata from a namedtuple to a regular tuple, which is faster to parse.  Using it here shaves off a microsecond or so from the apparently extremely-sensitive launch path.

Fixes #123597

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123842
Approved by: https://github.com/jansel, https://github.com/shunting314
ghstack dependencies: #123841
2024-04-15 19:43:06 +00:00
Bert Maher
6c9f5064ea Avoid retrieving launch metadata if launch_enter_hook is not installed (#123841)
Fixes #123597

There's a sizable comment in the PR about why this is needed, but essentially the launch path is really really perf sensitive (running `launch` is ~30 microseconds, and according to the linked issue, regressing it to 33us is worth 6% overall on torchbench).  The `bin.launch_metadata` call doesn't look super expensive, but microseconds matter, and this is only useful when we have a launch hook installed (which seems pretty rare?).  This change is worth about 2us, and when combined with the other diff in the stack seems to completely eliminate the torchbench regression.

Differential Revision: [D56046347](https://our.internmc.facebook.com/intern/diff/D56046347)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123841
Approved by: https://github.com/jansel, https://github.com/shunting314
2024-04-15 19:43:06 +00:00
Pian Pawakapan
90d1720861 [export] Restore original placeholder names (part 3: constant input de/serialization) (#123590)
Summary:
note: breaking the original diff D55225818 into 3 parts (top-level renaming, higher-order-op subgraphs, constant input de/serialization) because of its size.

Stacked PR to restore original names to placeholder nodes, replacing the default names arg0_1, arg1_1, ...

This PR supports constant argument placeholder (e.g. forward(self, x, y=1)) names and de/serialization, by adding a name field for ConstantArguments in the graph signature, and ConstantInputSpec in the input specs for serialization.

Test Plan: verification checks on placeholder names for all export() calls, unit test in test/export/test_export.py

Differential Revision: D55506949

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123590
Approved by: https://github.com/angelayi, https://github.com/zhxchen17
2024-04-15 19:09:41 +00:00