Commit Graph

2496 Commits

Author SHA1 Message Date
Ke Wen
5cd7c75bd9 [pipelining] Add tracing frontend (#125448)
This PR allows user to transform a model into a pipeline representation with split stages, according to a split spec.
```
def pipeline(
    module: torch.nn.Module,
    num_chunks: int,
    example_args: Tuple[Any, ...],
    example_kwargs: Optional[Dict[str, Any]] = None,
    split_spec: Optional[Dict[str, SplitPoint]] = None,
    split_policy: Optional[Callable[[fx.GraphModule], fx.GraphModule]] = None,
) -> Pipe:
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125448
Approved by: https://github.com/H-Huang
ghstack dependencies: #125273
2024-05-04 09:00:25 +00:00
Muralidhar Andoorveedu
b96b1e8cff [Distributed] Add P2P versions of *object_list operations (#124379)
This PR adds `send_object_list` and `recv_object_list` to `distributed_c10d.py`. This is extending functionality already present in PyTorch with `broadcast_object_list` that I noticed was missing and decided to upstream.

With this change, sending and receiving arbitrary picklable python objects is possible.

Relevant issue: https://github.com/pytorch/pytorch/issues/3473

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124379
Approved by: https://github.com/kwen2501, https://github.com/wconstab
2024-05-03 23:22:58 +00:00
Alexandre Ghelfi, PhD
d18a6f46d0 Adding Compare in torch.utils.benchmark documentation (#125009)
`torch.utils.benchmark.Compare` is not directly exposed in torch.utils.benchmark documentation.

I think this is a valuable resource to add since it can help people embracing the torch benchmark way of doing things, and help people building documentation towards it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125009
Approved by: https://github.com/mikaylagawarecki
2024-05-03 00:50:54 +00:00
Ke Wen
0199ce8d6c [pipelining] Add microbatch split and merge utils (#125273)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125273
Approved by: https://github.com/H-Huang
ghstack dependencies: #124776, #124875, #124958
2024-05-02 21:09:47 +00:00
Lucas Pasqualin
799f1460af [DCP] Provides default AsyncStager (#124939)
Differential Revision: [D56575987](https://our.internmc.facebook.com/intern/diff/D56575987/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124939
Approved by: https://github.com/fegin
ghstack dependencies: #122965
2024-05-02 19:48:54 +00:00
Lucas Pasqualin
3741fb3680 [DCP] Introduce async staging extension points (#122965)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* #124944
* #124939
* __->__ #122965

Differential Revision: [D55493240](https://our.internmc.facebook.com/intern/diff/D55493240/)

*This PR is now ready for merge and is not an RFC*

Major choices are:
-- the introduction of the AsyncStager protocol
-- removed `executor` from param.
-- leave async as a separate method (for now)

This proposal seeks to add extension points to dcp.async_save, allowing users to:
- Specify a specific staging method when calling async_save
- Allow a vehicle for also making the staging method async, to allow for cases where we may want to overlap with the training loop (e.g., overlap d2h with and only synchronize at the optim.step)
- Potentially specify the execution method for doing async_save in parallel. For example some users may prefer a subprocess over a  thread to avoid GIL issues.

A totally reasonable alternative to this entire proposal is to expect users who want this level of customization
to write their own custom async save methods. Here's an example which addresses the issues mentioned
in PR comments.
```
def custom_async_save(...):
    #     this step accomplishes staging and includes the usual 'planning' calls (issue 1)
    buffered_writer = CpuBufferedWriter() # this is stateful, contains a copy of state_dict
    dcp.save(state_dict, storage_writer=buffered_writer)

    final_storage_writer = FileSystemWriter()
    mp.spawn(      # issue2 is gone, do whatever you want here
	dcp.save,    # or some custom sub-process method which calls dcp.save under the hood
        buffered_writer.state_dict,   # lot's of way's to do this, not really the most important part
	checkpoint_id=checkpoint_id,
	storage_writer=storage_writer,
	planner=planner,
	process_group=process_group, # this actually wouldn't work, but again not the pt.
      )
      # leaving out the rest of the details for managing your extra special subprocess.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122965
Approved by: https://github.com/daulet-askarov
2024-05-02 19:01:55 +00:00
Ke Wen
52142192d4 [pipelining] Add stage backward function (#124958)
This is a helper function which:
1. computes the gradients for the stage inputs, and
2. accumulates gradients for the stage module's parameters.

A unit test for this function is also added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124958
Approved by: https://github.com/wconstab
ghstack dependencies: #124776, #124875
2024-05-01 07:56:58 +00:00
Mikayla Gawarecki
2480e8b8a1 Add MAP_SHARED option for torch.load(mmap=True) (#124889)
Fixes #124528

Going over the options for our MapAllocator and what they do, I don't think any other of them need to be piped up to `torch.load`

4f29103749/aten/src/ATen/MapAllocator.h (L8-L16)

~However, I wonder if this `MmapVisibility(Enum)` is a good way to represent "or-ing" together of `mmap` flags if we want to extend it in the future. I looked over the flags for [`mmap(2)`](https://man7.org/linux/man-pages/man2/mmap.2.html), and could not immediately see how most of them would be useful for `torch.load` (would maybe `MAP_LOCKED` (like `mlock`) or `MAP_HUGE` ever be worthwhile?)~

Using the flags provided by the python `mmap` library so that we can extend the allowed flags and pipe them down to the cpp `mmap` call if there is a need for other flags in the future

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124889
Approved by: https://github.com/albanD
2024-04-30 15:02:19 +00:00
Avik Chaudhuri
e7846447e0 dynamic shapes builder API (#124898)
This PR introduces a new way of building `dynamic_shapes` for export. The idea is to build up a mapping from input tensors to the dynamic shapes that should be assigned to their corresponding fake tensors.

This mapping is automatically converted to the current form of `dynamic_shapes`, which must exactly match the structure of inputs. We do this by using pytree utils.

With the current `dynamic_shapes`, we had to be careful about user-defined classes that are registered with pytree, since  such classes are not necessarily polymorphic containers; they may be fine containing tensors, but not dynamic shapes. Thus we had decided to allow input instances of such classes to be associated with dynamic shapes in flattened form. This decision needs to be mirrored in this PR as well. To make it easier to keep these code paths in sync, we refactor the current recursive procedure for associating inputs with dynamic shapes to use the same pytree utils. This needs minor fixes to a few tests where `dynamic_shapes` were not exactly matching the structure of inputs.

Differential Revision: D56551992

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124898
Approved by: https://github.com/zhxchen17
2024-04-30 03:59:49 +00:00
Tristan Rice
dc4c75ba72 elastic/rendezvous: make barrier and rank assignment operations O(n) instead of O(n^2) (#124982)
Summary:
This makes barrier and rank operations linear instead of quadratic with the number of workers. This drastically improves performance for rendezvous when running with over 1000 hosts.

This uses 2 approaches for different areas:

* local rank assignment: each worker does 1 set and 1 get, local ranks are assigned on the rank 0 host in a O(n) operation which reduces total store operations to be linear with number of workers.
* exit_barrier: use a counter and a final flag so each worker has to do max 1 set, 1 get and 1 add.

At 4000 hosts we see torchelastic be able to run in as little as 10 seconds down from 373 seconds.

Test Plan:
This is testing using many small tests running on a remote cluster.

{D56549942}

```
torchx run --scheduler mast -- --image=torchelastic_benchmark --j=4000x1
```

Differential Revision: D56605193

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124982
Approved by: https://github.com/kiukchung, https://github.com/kurman
2024-04-27 02:21:44 +00:00
egienvalue
73744a2c00 torch.mtia module for MTIA device backend (#123612)
MTIA device has its own Module in PyTorch now.
torch.mtia has following APIs similar to other backends. The lazy_init is also supported.
```
__all__ = [
    "init",
    "is_available",
    "synchronize",
    "device_count",
    "current_device",
    "current_stream",
    "default_stream",
    "set_stream",
    "stream",
    "device",
]

```
------------
For device management. We expand AccleratorHooksInterface to support generic device management and it can be used in both C++ and PyThon.
```
def _accelerator_hooks_device_count() -> _int: ...
def _accelerator_hooks_set_current_device(device_index: _int) -> None: ...
def _accelerator_hooks_get_current_device() -> _int : ...
def _accelerator_hooks_exchange_device(device_index: _int) -> _int : ...
def _accelerator_hooks_maybe_exchange_device(device_index: _int) -> _int : ...
```

---------
Adding get_device_module API to retrieve device modules for different device types.
```
def get_device_module(device: Optional[Union[torch.device, str]] = None)
```
---------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123612
Approved by: https://github.com/albanD
ghstack dependencies: #123611
2024-04-26 16:17:54 +00:00
Yu, Guangye
19a83eacb5 add new API torch.amp.is_autocast_available (#124938)
# Motivation
expose `torch._is_autocast_available` to `torch.amp.is_autocast_available` as a public api.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124938
Approved by: https://github.com/albanD
2024-04-26 08:45:20 +00:00
PyTorch MergeBot
e04c7b19f4 Revert "torch.mtia module for MTIA device backend (#123612)"
This reverts commit 381653de63.

Reverted https://github.com/pytorch/pytorch/pull/123612 on behalf of https://github.com/jeffdaily due to this PR broke ROCm with message RuntimeError: Cannot have MTIA with other devices ([comment](https://github.com/pytorch/pytorch/pull/123612#issuecomment-2077649762))
2024-04-25 16:06:46 +00:00
Edward Z. Yang
b4597fffce Try to reuse old symbol name rather than new symbol name when renaming (#124782)
Previously, unbacked SymInts would gradually get larger and larger as we kept rebinding them.  Now, we do the replacement to preserve the old symbol.

Actually doing this is a bit tricky.  Here’s the order things happen when retracing data dependent:

1. Run fake tensor prop: allocate new unbacked SymInt
2. Run proxy tensor mode, calculate bindings and associate them with FX node
3. Run PropagateUnbackedSymInts, rename unbacked bindings to their old ones so they are consistent

So the problem is when we calculate bindings in step (2), we don't know
what the original names are yet, we only find out later at (3).  But by
the time (3) runs, we've already stuffed some new bindings in
meta["unbacked_bindings"] and we don't know how to update them!  To fix
this, I introduce resolve_unbacked_bindings which post facto applies any
of the renamings we discovered in (3).

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124782
Approved by: https://github.com/lezcano
ghstack dependencies: #124310, #124314, #124316, #124394, #124739
2024-04-25 14:02:42 +00:00
Edward Z. Yang
13ab24f192 Reimplement unbacked symbol bindings in Inductor (#124394)
This PR has a lot of "draw the rest of the fucking owl" energy. Here's how to break it down.

1. **torch/_inductor/graph.py** - We start by tightening unbacked symbol invariants. Specifically, as we lower FX nodes, we check whether or not every unbacked_binding recorded on the FX node meta, actually ends up getting bound (according to get_unbacked_symbol_defs) in all the buffers generated by the lowering. Hopefully this invariant is self evident. This leads to a lot of failures.
2. **torch/_inductor/ir.py** - Problem 1: There is softness in how Inductor computes defs of unbacked symbols in IR node. Previously, we tried to infer it by looking at the output sizes/strides/etc and see if new unbacked symbols popped up that we hadn't seen in the inputs. I don't know exactly what was buggy about the old code, but sometimes we would fail to notice an unbacked symbol had been bound, or rebind an unbacked symbol multiple times. Fortunately, thanks to the earlier PRs in our stack, we now have a nice list of unbacked symbol bindings from FX, so we now just store it directly on ExternKernel and use it directly to report defs. This has to be done twice: once for FallbackKernel (e.g., nonzero) and once for DynamicScalar (e.g., item) (see also **torch/_inductor/lowering.py**, **torch/_inductor/codegen/wrapper.py** and  **torch/_inductor/codegen/cpp_wrapper_cpu.py** for the lowering and codegen changes for item)
   * **process_kernel** - Sidequest! It turns out that Inductor lowering can reallocate unbacked symbols. This happens specifically when we repropagate fake tensors through the operator in `process_kernel`. This repropagation process is necessary because Inductor may have changed the strides of input tensors, and it must now recompute the strides so that it can continue to appropriately plan the rest of the lowering process. This is fine: we just make sure we do the rebind unbacked + compute_unbacked_bindings dance we've been doing previously in the PR stack. But instead of putting unbacked_bindings on a new FX node, they go straight into our unbacked_bindings on the Inductor IR node.
    * **codegen_unbacked_symbol_defs** - Sidequest! FallbackKernel lowering is done in two steps. First, you emit the FallbackKernel buffer. Then, you emit MultiOutput buffers which actually give access to the individual outputs of FallbackKernel, which may have been multi-output. There is a design decision here: does the FallbackKernel bind the unbacked symbols, or the MultiOutput buffer? Historically, we put the binding on MultiOutput buffer, because it's more convenient: the FallbackKernel buffer is fake, in fact, it doesn't even get a name in C++ codegen. But it's kind of inconsistent with the keypath model that we've been tracking unbacked bindings with: if you have a multi-output node, you'd expect a keypath like `[0].size()[0]` representing the first output's first dimension size. That suggests that it's the FallbackKernel that should define the things. So that was my first implementation. Unfortunately, the C++ codegen is too cursed and I could not understand how to make it work in that case. So now we just unsoundly assume you cannot have multi-output data dependent output, and do the codegen in MultiOutput. There are some comments explaining exactly what we are improperly assuming.
3. **_rename_unbacked_to** in **torch/fx/experimental/symbolic_shapes.py** - Previously, when we renamed unbacked symbols, we clobbered any facts we previously knew about them. So for example, if we had a replacement `u0 -> s0` but then we renamed u0 to u1, we would now setup the replacement `u0 -> u1`, clobbering the old replacement. This apparently didn't matter in earlier PRs in the stack, but with Inductor now on the ball, there were some tests that indicated this was a problem. The solution is easy: if u0 had a preexisting replacement, reapply it to u1. However...
    * **torch/_functorch/_aot_autograd/collect_metadata_analysis.py** - When we run forward analysis, this triggers fake tensor repropagation and fresh allocations. Previously, we just cleared out the pending symbols when finished the analysis. But with the change above, this would also migrate replacements to the new symbols... which are now dead. So now we explicitly suppress generation of these symbols with `ignore_fresh_unbacked_symbols` so that no rebinding happens at all.
    * **torch/_dynamo/eval_frame.py** - same deal; I just searched for all sites we called clear() on pending
4. The last step is fixing the long tail of extra problems that show up, now that unbacked_bindings are load bearing into Inductor
    * **torch/_dynamo/eval_frame.py** - Some of the exports are making copies of nodes without repropagating fake tensors, so in this case, it is important to also copy the `unbacked_bindings` (apparently this didn't matter before without the Inductor changes)
    * **torch/_export/pass_base.py** - I discover that this is doing fake tensor repropagation via a test suite failure. Do the same playbook as AOTAutograd: PropagateUnbackedSymInts too!  Actually, they also have implemented their own tracer as well, so do the same playbook as proxy_tensor: record unbacked_bindings on the newly traced nodes. UGH code duplication.
    * **torch/_subclasses/fake_tensor.py**, **torch/_subclasses/fake_impls.py** (with call site updates at  **torch/_functorch/_aot_autograd/traced_function_transforms.py** and **torch/fx/passes/fake_tensor_prop.py**) - What's this new epoch thing? I noticed that sometimes I would be retracing, call nonzero() on a fake tensor, and not allocate a new unbacked symbol. This is actually bad, because if I don't get a new unbacked symbol, I don't know there's a binding site, and `unbacked_bindings` is now missing a binding. The reason for this is memoization: if I reuse the exact same fake tensor on my retrace, it will already have an unbacked symint memoized on it and we will short circuit allocation. Well, that's no good. So I associate the memos with a fake tensor epoch, and every time you start a new fake tensor propagation from scratch, you bump the epoch so that I clear all the memos.
    * **torch/_inductor/scheduler.py** - I notice in unit tests that V.current_node is not always set when we call process_kernel. So I save it into the IR node and restore it when we are running `get_estimated_runtime`.
    * **torch/fx/experimental/symbolic_shapes.py** - A few things
      * **rebind_unbacked** (re **_tensor_version**). Ordinarily, when you have an unbacked SymInt, you persistently hvae it all the way to the end of the program. `_tensor_version` violates this: this generates an unbacked SymInt (for reasons I don't quite understand?) and then gets rid of it later. This triggered an assert violation. I think this op is kind of misusing unbacked SymInt, but I didn't know how to refactor it, so it gets a special case.
      * **rebind_unbacked** (re **Simplify SymBool binding**). Ugh, SymBool, what a pain in the butt. I have an assert that you can only rebind unbacked symbol to another unbacked symbol. This assert fails when a boolean is involved, because the result of running keypath on the result is not `u1`, it's `sympy.Piecewise(... sympy.Eq(u1, 1) ...)`. This is actually just `u1`, but Sympy doesn't know it because it doesn't know that `u1` value range is `[0, 1]`. So we manually implement the simplification needed to get the assert to pass.
      * **compute_unbacked_bindings** (re **This is pretty fragile**). There is a really funny disaster involving memoization and Inductor process kernel. Ordinarily when I retrace, if there was a memo hit in the old trace, there will be a memo hit in the new trace. However, Inductor process kernel breaks this, because it recreates fake tensor inputs to the operator call from scratch (since they might have different strides), and obviously these tensor inputs don't have the memo from the old one. I tried a little bit to try to manually transplant the memo to the new fake tensor but it seemed hopeless, so I just let the fresh symbol ride, allocating a new unbacked symbol. However, in one of our tests, we rely on knowing that the first nonzero call is equal to the second (memoized) nonzero call. The equality test looked pretty easy to discharge, so I just went ahead and added a deferred runtime assert to this effect and it worked.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124394
Approved by: https://github.com/jansel
ghstack dependencies: #124310, #124314, #124316
2024-04-25 02:08:59 +00:00
Edward Z. Yang
9692b954c6 FakeTensorProp works with unbacked bindings (#124310)
This is a partial revert of https://github.com/pytorch/pytorch/pull/124059

Like in #124297, profiling has revealed that testing equality on *every* output is kind of expensive. So we only test equality when we know there is an unbacked binding.  This is the same playbook as the previous PR, just on FakeTensorProp instead of PropagateUnbackedSymInts. Note that we also need to populate `unbacked_bindings` in proxy_tensor.py, since we're generating an entirely new graph in that case.

We now have enough propagation that we're able to trigger a bug related to divisibility replacement. In https://github.com/pytorch/pytorch/pull/113165 we allowed to replace `u0` with `u1 * c` for some constant c, when we have determined that u0 is divisible by c. However, where does the binding for u1 come from? What we will have in practice is that there is some node that is supposed to have bound u1, but which actually is getting a `u1 * c` in its output. So, to get u1, we must divide out c. Fortunately, under the divisibility condition, this is always possible (but remember, we must test divisibility at runtime!)

Because we have tightened up asserts, it is now an error to allocate unbacked SymInts and then fail to track them under unbacked_bindings. In torch/_dynamo/eval_frame.py and torch/_functorch/_aot_autograd/collect_metadata_analysis.py there are examples of benign cases where we repropagated fake tensors but then immediately threw away the results. In these cases, it's not appropriate to rebind, since we're still using the old FX graph that has all of the old symbols. So we just manually clear it. It is possible that other cases will need to be updated, so this PR is "risky" from the perspective of hitting fbcode.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124310
Approved by: https://github.com/lezcano
2024-04-25 02:08:51 +00:00
Gagan Jain
c5e567c573 [Torch][Timer] Adding debug info logging interface for expired timers (#123883)
Summary:
Adding function to log additional debug information before killing the expired watchdog timers.

Additional information like stack trace can be added in the debug function using worker process IDs from expired timers.

Test Plan: buck test mode/opt caffe2/test/distributed/elastic/timer:file_based_timer_test

Differential Revision: D56044153

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123883
Approved by: https://github.com/kurman
2024-04-25 01:15:52 +00:00
egienvalue
381653de63 torch.mtia module for MTIA device backend (#123612)
MTIA device has its own Module in PyTorch now.
torch.mtia has following APIs similar to other backends. The lazy_init is also supported.
```
__all__ = [
    "init",
    "is_available",
    "synchronize",
    "device_count",
    "current_device",
    "current_stream",
    "default_stream",
    "set_stream",
    "stream",
    "device",
]

```
------------
For device management. We expand AccleratorHooksInterface to support generic device management and it can be used in both C++ and PyThon.
```
def _accelerator_hooks_device_count() -> _int: ...
def _accelerator_hooks_set_current_device(device_index: _int) -> None: ...
def _accelerator_hooks_get_current_device() -> _int : ...
def _accelerator_hooks_exchange_device(device_index: _int) -> _int : ...
def _accelerator_hooks_maybe_exchange_device(device_index: _int) -> _int : ...
```

---------
Adding get_device_module API to retrieve device modules for different device types.
```
def get_device_module(device: Optional[Union[torch.device, str]] = None)
```
---------

Differential Revision: [D56443356](https://our.internmc.facebook.com/intern/diff/D56443356)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123612
Approved by: https://github.com/albanD
ghstack dependencies: #123611
2024-04-24 20:51:20 +00:00
Edward Z. Yang
e0e2d897ed Handle Tensor returns in PropagateUnbackedSymInts (#124297)
This subsumes https://github.com/pytorch/pytorch/pull/124069

In the original PR, my idea was that when we run PropagateUnbackedSymInts, we check that the sizes before and after are exactly the same. This ended up turning up lots of bugs that I didn't feel like fixing. Separately, Ivan let me know that this pass was quite expensive in terms of compile time, since we spent a lot of time thinking about the equalities.

To kill two birds with one stone, we now only check for equality precisely when an unbacked SymInt was bound (thanks to the previous PR in this stack, we now have this information). Specifically, we look to see if `meta["unbacked_bindings"]` is set on the old node, and if it is, we assert the old value is equal to the new value from the repropagation. Note that the pytree key is used to actually extract the new value from the example value, as it may be nested inside an, e.g., tensor size.

We do something a bit naughty at the end: we use `defer_runtime_assert` to actually teach ShapeEnv about the equality. This is implementationally equivalent to what we used to do, but we're going to change this later soon.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124297
Approved by: https://github.com/lezcano
ghstack dependencies: #124290
2024-04-24 12:18:33 +00:00
Edward Z. Yang
b04dca1502 Add pending_fresh_unbacked_symbols, populate unbacked_bindings for Dynamo (#124290)
The important comment:

```
        # Whenever we allocate a fresh unbacked Symbol, we add it to this
        # pending list.  Unbacked symbol allocation can occur at unpredictable
        # points during meta tensor propagation, but at some point, the we
        # have to know what the binding site for an unbacked symbol is, and
        # this is computed when we actually place the node in the graph.  The
        # important thing is that we always actually handle every unaccounted
        # for unbacked symbol, so this list helps us keep track of them and
        # then make sure they are all accounted for.
        #
        # We could potentially give rise to errors earlier by lexically
        # scoping when we do propagation, and only allowing unbacked symbols
        # to be allocated at this point in time.  However this is inconvenient
        # to do in Dynamo, because fake tensor propagation is far from when we
        # analyze binding sites (set_example_value), so we do it in a more
        # mutatey way.
        #
        # NB: fresh unbacked symbols NEVER get substitutions applied to them,
        # they are binding sites!
```

The compute_unbacked_bindings is the other half of the equation: the thing that actually consumes the pending_fresh_unbacked_symbols and does something with them. Important comment:

```
    After having run fake tensor propagation and producing example_value
    result, traverse example_value looking for freshly bound unbacked
    symbols and record their paths for later.  It is an error if
    we have allocated an unbacked SymInt but it cannot be found in
    example_value.  (NB: this means if you have a multi-output
    function, you must call this on the tuple of tensor output, you
    cannot wait!)
```

For example, if I return a tensor with size `[u0, u1]`, and u1 is a fresh unbacked SymInt, then I'll have `{u1: KeyPath(".size(1)")}`, telling me I can get u1 by running `size(1)` on the result of this node. u0 is not fresh (it probably flowed in as an argument), so I don't generate a binding for it.

I eventually intend to propagate this information all the way to Inductor lowering, where extra metadata about unbacked symbol binding will be canonically used for codegen, instead of trying to infer it from defs/uses.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124290
Approved by: https://github.com/lezcano
2024-04-24 09:11:34 +00:00
rzou
4ceb44c40d Add torch.library.opcheck (#124496)
This PR:
- exposes torch.testing._internal.optests.opcheck as
  torch.library.opcheck
- Adds support for CustomOpDef (aka functions decorated with
  torch.library.custom_op) to opcheck.

Test Plan:
- Updated tests
- We validated opcheck's design internally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124496
Approved by: https://github.com/williamwen42
2024-04-23 21:48:00 +00:00
Matthew Hoffman
1d3a13d3d1 Conform torch.mps to device module interface (#124676)
Right now `torch.fork_rng()` doesn't support MPS. MPS' device module functions don't line up with the others'. There is a step of `fork_rng` to call `device_count()`:

302d7e9a6e/torch/random.py (L146)

It is pretty simple to know the MPS device count, based on whether it is built and available.

Also:

302d7e9a6e/torch/random.py (L168)

302d7e9a6e/torch/random.py (L175)

`get_rng_state` and `set_rng_state` are expected to be able to accept a `device` parameter.

@ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124676
Approved by: https://github.com/ezyang
2024-04-23 18:38:48 +00:00
Jeff Daily
6ede882c0b preferred blas library; cublaslt gemm implementation (#122106)
Following the example of PyTorch supporting a preferred Linalg library (cusolver or magma), this PR introduces a preferred blas library selector of either cublas or cublaslt for CUDA and hipblas or hipblaslt for ROCm via normal hipification of sources.

The default blas implementation remains cublas or hipblas.  cublaslt or hipblaslt can be enabled using environment variable TORCH_BLAS_PREFER_CUBLASLT=1 (or TORCH_BLAS_PREFER_HIPBLASLT=1 as an alias) or by calling `torch.backends.cuda.preferred_blas_library(backend="cublaslt")` or as an alias `backend="hipblaslt"`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122106
Approved by: https://github.com/lezcano
2024-04-22 15:38:22 +00:00
PyTorch MergeBot
929242a15c Revert "torch.mtia module for MTIA device backend (#123612)"
This reverts commit d7e1bf9ff9.

Reverted https://github.com/pytorch/pytorch/pull/123612 on behalf of https://github.com/jeffdaily due to This broke ROCm. see test_overrides.py ([comment](https://github.com/pytorch/pytorch/pull/123611#issuecomment-2067363780))
2024-04-19 22:44:26 +00:00
rzou
bad8d25881 Add torch.library.register_kernel (#124299)
This mirrors the .register_kernel method on the object produced by the
custom_op decorator.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124299
Approved by: https://github.com/albanD
ghstack dependencies: #124180, #124200
2024-04-19 13:54:21 +00:00
Tobias Ringwald
58e403c739 Added a docstring for torch.Size.numel. (#124186)
Fixes #61231. Fixes #124167.

This PR documents a rather long-standing issue w.r.t. unexpected behavior of `torch.Size.numel`, first reported almost 5 years ago.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124186
Approved by: https://github.com/janeyx99
2024-04-19 09:23:02 +00:00
egienvalue
d7e1bf9ff9 torch.mtia module for MTIA device backend (#123612)
MTIA device has its own Module in PyTorch now.
torch.mtia has following APIs similar to other backends. The lazy_init is also supported.
```
__all__ = [
    "init",
    "is_available",
    "synchronize",
    "device_count",
    "current_device",
    "current_stream",
    "default_stream",
    "set_stream",
    "stream",
    "device",
]

```
------------
For device management. We expand AccleratorHooksInterface to support generic device management and it can be used in both C++ and PyThon.
```
def _accelerator_hooks_device_count() -> _int: ...
def _accelerator_hooks_set_current_device(device_index: _int) -> None: ...
def _accelerator_hooks_get_current_device() -> _int : ...
def _accelerator_hooks_exchange_device(device_index: _int) -> _int : ...
def _accelerator_hooks_maybe_exchange_device(device_index: _int) -> _int : ...
```

---------
Adding get_device_module API to retrieve device modules for different device types.
```
def get_device_module(device: Optional[Union[torch.device, str]] = None)
```
---------
@exported-using-ghexport

Differential Revision: [D52923602](https://our.internmc.facebook.com/intern/diff/D52923602/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123612
Approved by: https://github.com/albanD
ghstack dependencies: #123611
2024-04-18 17:38:06 +00:00
Boyuan Feng
aa2da0cdd2 [Export] Add runtime assert to non-strict export (#123681)
This PR moves insert_deferred_runtime_asserts from dynamo to torch.fx.passes and uses it to add runtime assertion for non-strict export.

Differential Revision: D55944267

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123681
Approved by: https://github.com/tugsbayasgalan, https://github.com/angelayi
2024-04-18 16:13:27 +00:00
rzou
645173a0b5 Add torch.library.register_autograd (#124071)
Allows registering autograd for all custom op entry points:
- the new-style custom op API (custom_op)
- the old-style torch.library APIs
- C++ operator registration

Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124071
Approved by: https://github.com/albanD
ghstack dependencies: #123937, #124064, #124065, #124066
2024-04-18 12:47:59 +00:00
doloresgarcia
4efdf9a6a6 fix pytorch version for onnx in doc (#124182)
Fixes [ 123845](https://github.com/pytorch/pytorch/issues/123845)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124182
Approved by: https://github.com/albanD
2024-04-17 18:05:15 +00:00
rzou
47dbfecd37 Rename impl_abstract to register_fake, part 1/2 (#123937)
This PR:
- adds a new torch.library.register_fake and deprecates
  torch.library.impl_abstract. The motivation is that we have a lot of
  confusion around the naming so we are going to align the naming with
  the actual subsystem (FakeTensor).
- renames `m.impl_abstract_pystub("fbgemm_gpu.sparse_ops")` to
  `m.has_python_registration("fbgemm_gpu.sparse_ops")`. No deprecation
  here yet; I need to test how this works with static initialization.
- Renames a bunch of internals to match (e.g. abstractimplpystub ->
  pystub)

I'm scared to rename the Python-side internal APIs (e.g.
torch._library.abstract_impl) because of torch.package concerns. I'll do
that in its own isolated PR next just in case it causes problems.

DEPRECATION NOTE: torch.library.impl_abstract was renamed to to
torch.library.register_fake. Please use register_fake. We'll delete
impl_abstract in a future version of PyTorch.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123937
Approved by: https://github.com/albanD
2024-04-17 12:46:01 +00:00
Edward Z. Yang
cebf65126c FakeTensorProp assert consistency of sizes when metadata previously existed (#124059)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124059
Approved by: https://github.com/bdhirsh, https://github.com/thiagocrepaldi
ghstack dependencies: #124105
2024-04-16 23:28:42 +00:00
lezcano
891736f115 Fix links rendering when surrounding code in Dynamo deepdive (#123427)
I thought the RST was rendering correctly, but here we are.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123427
Approved by: https://github.com/peterbell10
2024-04-13 04:55:15 +00:00
Gagan Jain
016ca546aa Adding health check server hook in torch elastic (#122750) (#123504)
Summary:

Building hook for external mechanism to monitor the health of torch elastic launcher. Health check server takes dependency on FileTimerServer to check if launcher is healthy or not. It will be always healthy if FileTimerServer is disabled.

Implementation of start_healthcheck_server is unsupported, however tcp/http server can be started on specific port which can monitor the aliveness of worker_watchdog and accordingly take the action.

Test Plan: buck test mode/opt caffe2/test/distributed/elastic/agent/server/test:local_agent_test

Differential Revision: D55837899

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123504
Approved by: https://github.com/kurman
2024-04-11 19:10:56 +00:00
Edward Z. Yang
bbcdd28409 Report LRU cache stats at end of program for symbolic shapes (#123724)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123724
Approved by: https://github.com/Chillee
2024-04-11 05:12:43 +00:00
PyTorch MergeBot
ecb2418dd6 Revert "Adding health check server hook in torch elastic (#122750)"
This reverts commit 61d431fab0.

Reverted https://github.com/pytorch/pytorch/pull/122750 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/122750#issuecomment-2041104931))
2024-04-06 14:31:07 +00:00
Gagan Jain
61d431fab0 Adding health check server hook in torch elastic (#122750)
Summary:
Building hook for external mechanism to monitor the health of torch elastic launcher. Health check server takes dependency on FileTimerServer to check if launcher is healthy or not. It will be always healthy if FileTimerServer is disabled.

Implementation of start_healthcheck_server is unsupported, however tcp/http server can be started on specific port which can monitor the aliveness of worker_watchdog and accordingly take the action.

Test Plan: buck test mode/opt caffe2/test/distributed/elastic/agent/server/test:local_agent_test

Differential Revision: D55108182

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122750
Approved by: https://github.com/kurman
2024-04-05 23:17:30 +00:00
Huy Do
f5b8c9b730 Ignore some known duplicated modules in doc build config script (#123425)
This is a follow-up fix of https://github.com/pytorch/pytorch/pull/123244#discussion_r1552935150 as @clee2000 points out a better way to ignore those duplicated entries.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123425
Approved by: https://github.com/clee2000
2024-04-05 21:12:14 +00:00
Lucas Pasqualin
de7edeea25 [DCP] DCP logger (#121352)
Adds additional logging for improved observability in DCP.

Differential Revision: [D54512626](https://our.internmc.facebook.com/intern/diff/D54512626/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121352
Approved by: https://github.com/wz337, https://github.com/fegin
2024-04-05 17:50:50 +00:00
Guilherme Leobas
c575e378ba Update torch.compile_faq w.r.t to functorch (#122213)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122213
Approved by: https://github.com/zou3519
ghstack dependencies: #122211, #122212
2024-04-05 03:29:11 +00:00
Guilherme Leobas
84658d9c4f Enable capture_func_transforms by default (#122211)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122211
Approved by: https://github.com/zou3519
2024-04-05 03:29:11 +00:00
Huy Do
3d20cc1332 Cleanup some duplicated placeholder py:module docs (#123244)
Fixes https://github.com/pytorch/pytorch/issues/123068
Fixes https://github.com/pytorch/pytorch/issues/111256

While investigating the flaky doc build failure .w.r.t duplicated `torch.ao.quantization.quantize` docstring warning, i.e. https://github.com/pytorch/pytorch/actions/runs/8532187126/job/23376591356#step:10:1260, I discover an old but still open bug in Sphinx https://github.com/sphinx-doc/sphinx/issues/4459.  These warnings have always been there, but they are hidden because we are using `-j auto` to build docs with multiple threads.  It's just by chance that they start to surface now.

The issue can be reproduced by removing `-j auto` from https://github.com/pytorch/pytorch/blob/main/docs/Makefile#L5 and run `make html` locally.  Then, these warnings shows up consistently.  As `make html` treats warnings as errors, they will fail the build.

```
...
/data/users/huydo/conda/py3.8/lib/python3.8/site-packages/torch/ao/quantization/quantize.py:docstring of torch.ao.quantization.quantize.quantize:1: WARNING: duplicate object description of torch.ao.quantization.quantize, other instance in quantization, use :noindex: for one of them
/data/users/huydo/conda/py3.8/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py:docstring of torch.nn.parallel.data_parallel.data_parallel:1: WARNING: duplicate object description of torch.nn.parallel.data_parallel, other instance in nn, use :noindex: for one of them
/data/users/huydo/conda/py3.8/lib/python3.8/site-packages/torch/nn/utils/spectral_norm.py:docstring of torch.nn.utils.spectral_norm.spectral_norm:1: WARNING: duplicate object description of torch.nn.utils.spectral_norm, other instance in nn, use :noindex: for one of them
/data/users/huydo/conda/py3.8/lib/python3.8/site-packages/torch/nn/utils/weight_norm.py:docstring of torch.nn.utils.weight_norm.weight_norm:1: WARNING: duplicate object description of torch.nn.utils.weight_norm, other instance in nn, use :noindex: for one of them
/data/users/huydo/github/pytorch/docs/source/nn.rst:579: WARNING: duplicate object description of torch.nn.parallel.data_parallel, other instance in generated/torch.nn.functional.torch.nn.parallel.data_parallel, use :noindex: for one of them
/data/users/huydo/github/pytorch/docs/source/nn.rst:594: WARNING: duplicate object description of torch.nn.utils.spectral_norm, other instance in generated/torch.nn.utils.spectral_norm, use :noindex: for one of them
/data/users/huydo/github/pytorch/docs/source/nn.rst:595: WARNING: duplicate object description of torch.nn.utils.weight_norm, other instance in generated/torch.nn.utils.weight_norm, use :noindex: for one of them
/data/users/huydo/github/pytorch/docs/source/quantization.rst:1348: WARNING: duplicate object description of torch.ao.quantization.quantize, other instance in generated/torch.ao.quantization.quantize, use :noindex: for one of them
...
```

The fix is just to clean up those duplicated placeholder py:module docs, which were there because these modules didn't have any docs originally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123244
Approved by: https://github.com/andrewor14, https://github.com/malfet
2024-04-05 03:18:53 +00:00
rzou
44c0c0fc0f Add torch.library.custom_op (#122344)
This is the entrypoint for defining an opaque/blackbox (e.g. PyTorch will
never peek into it) custom op. In this PR, you can specify backend impls
and the abstract impl for this op.

NB: most of this PR is docstrings, please don't be intimidated by the
line count.

There are a number of interesting features:
- we infer the schema from type hints. In a followup I add the ability
  to manually specify a schema.
- name inference. The user needs to manually specify an op name for now.
  In a followup we add the ability to automatically infer a name (this
  is a little tricky).
- custom_op registrations can override each other. This makes them
  more pleasant to work with in environments like colab.
- we require that the outputs of the custom_op do not alias any inputs
  or each other. We enforce this via a runtime check, but can relax this
  into an opcheck test if it really matters in the future.

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122344
Approved by: https://github.com/ezyang, https://github.com/albanD
2024-04-03 18:36:17 +00:00
lezcano
b27ee6548d Add a Dynamo deepdive to documentation (#122305)
This supersedes the previous `Guards Overview" as a more comprehensive
approach to most of the main topics within Dynamo.

In the future, we could add specific sections for each of the topics
discussed here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122305
Approved by: https://github.com/msaroufim
2024-04-02 15:08:08 +00:00
Will Feng
489f4a063b Revert "Preserve unbacked SymInt on SymNode (#120816)" (#122988)
This reverts commit 476585b190.

I did a bisect and this seems to be the cause of compile time regression in cudagraphs_dynamic test suite between 03/23 and 03/24:
![image](https://github.com/pytorch/pytorch/assets/4063635/21394e06-4906-4690-b5a2-7d16cc475843)
image Particularly BERT_pytorch and hf_T5 seem to have ~50% compile time regression.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122988
Approved by: https://github.com/eellison
2024-04-01 22:11:09 +00:00
Mikayla Gawarecki
487b6d40ec Add RMSNorm module (#121364)
Similar to dbeed9724b/torchmultimodal/modules/layers/normalizations.py (L51)

**The implementation here is not optimized and we welcome pull requests to improve this**

- Use `normalized_shape` instead of singular integer `dim` to be aligned with the `nn.LayerNorm` implementation
- Remove the [upcast to float and downcast
](dbeed9724b/torchmultimodal/modules/layers/normalizations.py (L73))

Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)

Differential Revision: [D55485840](https://our.internmc.facebook.com/intern/diff/D55485840)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121364
Approved by: https://github.com/albanD
2024-03-29 18:05:28 +00:00
David Berard
59f6393209 [docs] Update PT2+Profiler docs (#122272)
Document:
* Torch-Compiled Region
* What to expect in kernels inside a torch-compiled region

For review, see https://docs-preview.pytorch.org/pytorch/pytorch/122272/torch.compiler_profiling_torch_compile.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122272
Approved by: https://github.com/aaronenyeshi
2024-03-28 17:52:28 +00:00
PyTorch MergeBot
8698121636 Revert "Add RMSNorm module (#121364)"
This reverts commit a7306de0dc.

Reverted https://github.com/pytorch/pytorch/pull/121364 on behalf of https://github.com/atalman due to Broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/121364#issuecomment-2025502007))
2024-03-28 15:31:10 +00:00
Aaron Orenstein
a8b7480f0d fix dynamo.explain examples (#122745)
`dynamo.explain()` was updated to return a structure but the docs weren't updated to match.

- Update the docs to use the new API
- Remove some dead code left when `explain` was updated.
- Drive-by: Fix some `nopython` uses that I noticed
- Drive-by: I noticed an ignored error coming from CleanupHook on shutdown - make it check the global before setting it.

Fixes #122573

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122745
Approved by: https://github.com/jansel
2024-03-27 22:53:27 +00:00
Mikayla Gawarecki
a7306de0dc Add RMSNorm module (#121364)
Similar to dbeed9724b/torchmultimodal/modules/layers/normalizations.py (L51)

**The implementation here is not optimized and we welcome pull requests to improve this**

- Use `normalized_shape` instead of singular integer `dim` to be aligned with the `nn.LayerNorm` implementation
- Remove the [upcast to float and downcast
](dbeed9724b/torchmultimodal/modules/layers/normalizations.py (L73))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121364
Approved by: https://github.com/albanD
2024-03-27 21:39:30 +00:00