Commit Graph

2671 Commits

Author SHA1 Message Date
Sheng Fu
c1dd3a615f Implement Graph Transform Observer (#127427)
Summary: Implement Graph Transform Observer

Differential Revision: D57887518

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127427
Approved by: https://github.com/angelayi
2024-06-02 06:49:47 +00:00
PyTorch MergeBot
7646825c3e Revert "distributed debug handlers (#126601)"
This reverts commit 3d541835d5.

Reverted https://github.com/pytorch/pytorch/pull/126601 on behalf of https://github.com/PaliC due to breaking internal typechecking tests ([comment](https://github.com/pytorch/pytorch/pull/126601#issuecomment-2141076987))
2024-05-31 01:21:24 +00:00
Alex Baden
5d316c81be [Inductor] Add 0 initialization to Triton masked loads (#127311)
For a masked `tl.load` operation, the Triton language specifies that values masked out (i.e. where the mask evaluates to false) are undefined in the output of the load. Triton provides an optional `other` parameter which, when included, provides an explicit value to use for masked out values from the load. If the output from a masked load without the `other` parameter is used in a conditional, unexpected behavior can occur.

Despite the language specification, all Triton backends currently in use by PyTorch Inductor (NVIDIA, AMD, and Intel) 0-initialize masked loads if `other` is not present (we recently changed the Intel backend behavior to match NVIDIA and AMD because that's what our users expect, even if we are not following the Triton spec to the tee). This PR attempts to "future-proof" Inductor for new backends (or perhaps changes in the current backends? - we did not see any performance change from 0-initializing in the Intel XPU backend but one could imagine compiler optimizations to remove paths that depend on undefined) to add an explicit `other` in instances where later conditionals depend on the `tl.load` output. I also removed an exception to `other` behavior for boolean loads, which was put in place for a Triton bug that should be fixed. I added `other` to the getting started documentation as a clue that masked load behavior requires explicit initialization if, even though I don't expect `undef` values to cause the example code to fail if the underlying output is not 0-initialized.  Finally, I added other to the `make_load` function in `select_algorithm.py`, though I wasn't able to determine if that function was actually being called.

Fixes #126535

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127311
Approved by: https://github.com/jansel
2024-05-30 04:50:54 +00:00
Tristan Rice
3d541835d5 distributed debug handlers (#126601)
This adds debug handlers as described in:
* https://gist.github.com/d4l3k/828b7be585c7615e85b2c448b308d925 (public copy)
* https://docs.google.com/document/d/1la68szcS6wUYElUUX-P6zXgkPA8lnfzpagMTPys3aQ8/edit (internal copy)

This is only adding the C++ pieces that will be used from the main process. The Python and torchrun pieces will be added in a follow up PR.

This adds 2 handlers out of the box:

* `/handler/ping` for testing purposes
* `/handler/dump_nccl_trace_pickle` as a POC integration with Flight Recorder

Test plan:

```
python test/distributed/elastic/test_control_plane.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126601
Approved by: https://github.com/kurman, https://github.com/c-p-i-o
2024-05-30 02:21:08 +00:00
rzou
1abcac9dab New Custom Ops Documentation landing page (#127400)
We create a new landing page for PyTorch custom ops (suggested by
jansel). All of our error messages will link here, and I'll work with
the docs team to see if we can boost SEO for this page.

NB: the landing page links some non-searchable webpages. Two of those
(the Python custom ops tutorial and C++ custom ops tutorial) will turn
into actual webpages when PyTorch 2.4 comes around. I'll make the third one
(the Custom Operators Manual) once it stabilizes (we continously add new
things to it and the length means that we might want to create a custom
website for it to make the presentation more ingestable).

Test Plan:
- view docs preview.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127400
Approved by: https://github.com/jansel
ghstack dependencies: #127291, #127292
2024-05-30 01:06:04 +00:00
Edward Z. Yang
76fc58c160 Document the legacy constructor for Tensor (#122625)
Fixes https://github.com/pytorch/pytorch/issues/122408

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122625
Approved by: https://github.com/albanD
2024-05-29 23:23:19 +00:00
Xuehai Pan
26f4f10ac8 [5/N][Easy] fix typo for usort config in pyproject.toml (kown -> known): sort torch (#127126)
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126
Approved by: https://github.com/kit1980
2024-05-27 14:49:57 +00:00
PyTorch MergeBot
55c0ab2887 Revert "[5/N][Easy] fix typo for usort config in pyproject.toml (kown -> known): sort torch (#127126)"
This reverts commit 7763c83af6.

Reverted https://github.com/pytorch/pytorch/pull/127126 on behalf of https://github.com/XuehaiPan due to Broken CI ([comment](https://github.com/pytorch/pytorch/pull/127126#issuecomment-2133044286))
2024-05-27 09:22:08 +00:00
Xuehai Pan
7763c83af6 [5/N][Easy] fix typo for usort config in pyproject.toml (kown -> known): sort torch (#127126)
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126
Approved by: https://github.com/kit1980
ghstack dependencies: #127122, #127123, #127124, #127125
2024-05-27 04:22:18 +00:00
Xuehai Pan
35ea5c6b22 [3/N][Easy] fix typo for usort config in pyproject.toml (kown -> known): sort torchgen (#127124)
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127124
Approved by: https://github.com/Skylion007
ghstack dependencies: #127122, #127123
2024-05-25 19:20:03 +00:00
Yu, Guangye
e7a42702f9 generalize custom_fwd&custom_bwd to be device-agnostic (#126531)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126531
Approved by: https://github.com/jgong5, https://github.com/gujinghui, https://github.com/albanD, https://github.com/EikanWang
ghstack dependencies: #126527
2024-05-25 06:48:16 +00:00
Yu, Guangye
c09205a057 Deprecate device-specific GradScaler autocast API (#126527)
# Motivation

## for `torch.amp.GradScaler`,
- `torch.cpu.amp.GradScaler(args...)` is completely equivalent to `torch. amp.GradScaler("cpu", args...)`.
- `torch.cuda.amp.GradScaler(args...)` is completely equivalent to `torch.amp.GradScaler("cuda", args...)`.

So, we intend to depreate them and **strongly recommend** developer to use `torch.amp.GradScaler`.

## for `custom_fwd` and `custom_bwd`,
this is a good solution to make the custom function run with or without effect even in an autocast-enabled region and can be shared by other backends, like CPU and XPU.
So we generalize it to be device-agnostic and put them int `torch/amp/autocast_mode.py` and re-expose to `torch.amp.custom_fwd` and `torch.amp.custom_bwd`. Meanwhile, we deprecate `torch.cuda.amp.custom_fwd` and `torch.cuda.amp.custom_bwd`.

# Additional Context
Add UT to cover the deprecated warning.
No need for more UTs to cover the functionality of `torch.amp.custom_f/bwd`, the existing UTs that previously covered the functionality of `torch.cuda.amp.custom_f/bwd` can cover them.
To facilitate the review, we separate these code changes to two PRs. The first PR cover `torch.amp.GradScaler`. The follow-up covers `custom_fwd` and `custom_bwd`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126527
Approved by: https://github.com/jgong5, https://github.com/gujinghui, https://github.com/janeyx99, https://github.com/EikanWang
2024-05-25 06:41:34 +00:00
lezcano
a30baec0c3 [Docs] Fix NumPy + backward example (#126872)
We were calling backward on a tensor not a scalar...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126872
Approved by: https://github.com/albanD
2024-05-22 21:29:31 +00:00
Kurman Karabukaev
d62b025efc [TorchElastic] Option for sharing TCPStore created by rdzv handlers (#125743)
Summary:

1. Define explicit `use_agent_store` on rdzv handlers. Handlers that set is true can share the store.
2. Instead of agent coordinating master_add/master_port values, the logic is now encapsulated by a *rdzv_handler* where `RendezvousInfo` will have `RendezvousStoreInfo` object that handlers must return.
    - Depending on the implementation they can either:
         - point to existing store (and expected to `use_agent_store` as true - point 1). Client code will rely on `TORCHELASTIC_USE_AGENT_STORE` env variable to know if the store is shared.
         - build args that `torch.distributed.init_process_group` can bootstrap by creating new store.

Additional points:

- When TCPStore is shared, it should be wrapped in PrefixStore to qualify/scope namespace for other usecases.
- `next_rendezvous` signature changed to return instance of `RendezvousInfo` instead of a (store, rank, world_size) tuple for extensibility purposes.

Why:
- Reduce moving parts
   - easier to swap implementation
   - improve tractability
   - addressing perf/debug-ability will benefit all usecases
   -
Test Plan: CI

Differential Revision: D57055235

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125743
Approved by: https://github.com/d4l3k
2024-05-22 18:24:11 +00:00
Ke Wen
403012b50a [pipelining] expose APIs per pytorch rule (#126812)
Rule is enforced by #126103.

The rule:
- If `torch.a.b` defines a public class `C` (i.e. to be exposed in torch API namespace), then `torch.a.b` must be a public path, i.e. no `_`.
- `torch.a.b` should ideally have an `__all__` that defines what should be imported from this file when it is imported.
- All other definitions in `torch.a.b` that you don't want to expose should have a `_` prefix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126812
Approved by: https://github.com/wconstab
2024-05-22 16:21:13 +00:00
Sahdev Zala
fe0a36fd7c Fix a link in the compiler backend doc (#126079)
The core aten is the core subset of aten and seems the corrent link to replace the broken link.

Fixes #125961

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126079
Approved by: https://github.com/svekars
2024-05-21 20:16:04 +00:00
Joel Schlosser
31ba6ee49b Traceable wrapper subclass support for deferred runtime asserts (#126198)
The padded dense -> jagged conversion op has the signature:
```
_fbgemm_dense_to_jagged_forward(Tensor dense, Tensor[] offsets, SymInt? total_L=None) -> Tensor
```

when `total_L` is not specified, the meta registration has a data-dependent output shape (based on `offsets[0][-1]`). Returning an unbacked SymInt here should work in theory, but traceable wrapper subclass support is missing in later code to handle deferred runtime asserts. This PR fixes this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126198
Approved by: https://github.com/ezyang
2024-05-21 01:21:46 +00:00
Mikayla Gawarecki
66dc8fb7ff Allow tensor subclasses and add torch.serialization.add_safe_globals that allows users to allowlist classes for weights_only load (#124331)
#### Conditions for allowlisting tensor subclasses
We allow tensor subclasses types that
(1) Do not override `__setstate__`, `__getattr__`, `__setattr__`, `__get__`, `__set__` or `__getattribute__` of `torch.Tensor` (`torch.Tensor` does not have a definition of `__getattr__`, `__get__` or `__set__` so we check that these are `None`)
(2) Use the generic `tp_alloc`
(3) Are in a module that *has been imported by the user*
to be pushed onto the stack as strings by `GLOBAL` instructions, while storing the type in a dict

The strings will be converted to the classes as appropriate when executing `REBUILD` with `_rebuild_from_type_v2`

*Note that we use `inspect.getattr_static(sys.modules[module], name)` to get the class/function as this method claims to have no code execution.

The rationale for the 3 conditions above is as follows:

The rebuild func provided by `Tensor.__reduce_ex__` is `torch._tensor._rebuild_from_type_v2`, which is defined as such (note the call to `getattr`, `Tensor.__setstate__` and the call to `as_subclass` as well as the call to `_set_obj_state` which calls `setattr`)

4e66aaa010/torch/_tensor.py (L57-L71)

`as_subclass` is implemented with a call to `THPVariable_NewWithVar`

that will eventually call `tp_alloc` here
4e66aaa010/torch/csrc/autograd/python_variable.cpp (L2053)

The `func` arg to `_rebuild_from_type_v2` for wrapper subclasses is `Tensor.rebuild_wrapper_subclass`, which will similarly call into `THPVariable_NewWithVar` and hit the above `tp_alloc`

**Note that we do not call `tp_init` or `tp_new` (i.e. `cls.__init__` or `cls.__new__`) when unpickling**

### How do we check something is a tensor subclass/constraints around imports

In order to check whether `bla` is a tensor subclass in the bytecode `GLOBAL module.name`, we need to do an `issubclass` check, which entails converting the global string to the appropriate type. We *do not* arbitrarily import modules but will perform this check as long as the given subclass (given by `module.name`) has already been imported by the user (i.e. `module in sys.modules` and `issubclass(getattr(sys[modules], name), torch.Tensor)`

This PR also allowlisted  `torch._utils._rebuild_wrapper_subclass` and `torch.device` (used by `_rebuild_wrapper_subclass`)

### API for allow listing
This PR also added `torch.serialization.{add/get/clear}_safe_globals` that enables user to allowlist globals they have deemed safe and manipulate this list (for example they could allowlist a tensor subclass with a custom `__setstate__` if they have checked that this is safe).

Next steps:
- Add testing and allowlist required classes for all in-core tensor subclasses (e.g. `DTensor`, `FakeTensor` etc.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124331
Approved by: https://github.com/albanD
2024-05-17 17:56:57 +00:00
yuanx749
691af57fbc Fix broken link of scikit-learn (#120972)
The link is broken in https://pytorch.org/docs/main/community/design.html
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120972
Approved by: https://github.com/Skylion007
2024-05-16 11:46:34 +00:00
Edward Z. Yang
44efeac24e Beef up error message for pending assert failure (#126212)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126212
Approved by: https://github.com/Skylion007
2024-05-15 18:22:53 +00:00
Oguz Ulgen
79655a1321 Add force_disable_caches to the docs (#126184)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126184
Approved by: https://github.com/msaroufim
2024-05-15 07:16:08 +00:00
Ke Wen
07d6ab5aa2 [pipelining] Add pipeline schedules (#125975)
1. Add pipeline schedules:
- GPipe
- 1F1B
- Interleaved 1F1B
- LoopedBFS

2. Add basic forward and backward tests:
test_schedule.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125975
Approved by: https://github.com/wconstab
ghstack dependencies: #125729
2024-05-11 21:17:53 +00:00
Will Constable
26b942c4fc [C10D] Document destroy_process_group usage (#122358)
This API was not documented. It has already been a source of confusion,
but recently has become more urgent as improper destruction can lead to
hangs due to ncclCommAbort's requirement of being called collectively.
<img width="888" alt="image" src="https://github.com/pytorch/pytorch/assets/4984825/9e16342d-1108-4d7d-95c8-b8753661b8e9">

Fixes #48203
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122358
Approved by: https://github.com/shuqiangzhang
2024-05-09 16:51:31 +00:00
lezcano
acafabaa29 Rename TorchDynamo -> Dyanamo in the dynamo tutorial doc (#123431)
Less verbose and it aligns it with the dynamo deepdive
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123431
Approved by: https://github.com/peterbell10
2024-05-07 05:07:00 +00:00
albanD
76a26a885d Add module tracker (#125352)
This does a few things that were originally a few PRs but I am on a new machine and don't have ghstack.
If it is too problematic to review, I can re-split, just let me know.
This does:
- Cleanup context manager use in test_flop_counter
- Remove need for mod argument in FlopCounterMode, warning about it
- Re-implement a Module tracker from scratch using global forward Module use and multi_grad_hook (we cannot use global backward Module hook because they don't look for nested Tensor and they're custom Function based instead of multi_grad_hook).
- Update FlopCouterMode to use the new ModuleTracker. All the existing test suite passes as-is (only changes there are new tests and refactoring mentioned above)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125352
Approved by: https://github.com/mikaylagawarecki
2024-05-04 18:33:35 +00:00
Ke Wen
5cd7c75bd9 [pipelining] Add tracing frontend (#125448)
This PR allows user to transform a model into a pipeline representation with split stages, according to a split spec.
```
def pipeline(
    module: torch.nn.Module,
    num_chunks: int,
    example_args: Tuple[Any, ...],
    example_kwargs: Optional[Dict[str, Any]] = None,
    split_spec: Optional[Dict[str, SplitPoint]] = None,
    split_policy: Optional[Callable[[fx.GraphModule], fx.GraphModule]] = None,
) -> Pipe:
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125448
Approved by: https://github.com/H-Huang
ghstack dependencies: #125273
2024-05-04 09:00:25 +00:00
Muralidhar Andoorveedu
b96b1e8cff [Distributed] Add P2P versions of *object_list operations (#124379)
This PR adds `send_object_list` and `recv_object_list` to `distributed_c10d.py`. This is extending functionality already present in PyTorch with `broadcast_object_list` that I noticed was missing and decided to upstream.

With this change, sending and receiving arbitrary picklable python objects is possible.

Relevant issue: https://github.com/pytorch/pytorch/issues/3473

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124379
Approved by: https://github.com/kwen2501, https://github.com/wconstab
2024-05-03 23:22:58 +00:00
Alexandre Ghelfi, PhD
d18a6f46d0 Adding Compare in torch.utils.benchmark documentation (#125009)
`torch.utils.benchmark.Compare` is not directly exposed in torch.utils.benchmark documentation.

I think this is a valuable resource to add since it can help people embracing the torch benchmark way of doing things, and help people building documentation towards it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125009
Approved by: https://github.com/mikaylagawarecki
2024-05-03 00:50:54 +00:00
Ke Wen
0199ce8d6c [pipelining] Add microbatch split and merge utils (#125273)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125273
Approved by: https://github.com/H-Huang
ghstack dependencies: #124776, #124875, #124958
2024-05-02 21:09:47 +00:00
Lucas Pasqualin
799f1460af [DCP] Provides default AsyncStager (#124939)
Differential Revision: [D56575987](https://our.internmc.facebook.com/intern/diff/D56575987/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124939
Approved by: https://github.com/fegin
ghstack dependencies: #122965
2024-05-02 19:48:54 +00:00
Lucas Pasqualin
3741fb3680 [DCP] Introduce async staging extension points (#122965)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* #124944
* #124939
* __->__ #122965

Differential Revision: [D55493240](https://our.internmc.facebook.com/intern/diff/D55493240/)

*This PR is now ready for merge and is not an RFC*

Major choices are:
-- the introduction of the AsyncStager protocol
-- removed `executor` from param.
-- leave async as a separate method (for now)

This proposal seeks to add extension points to dcp.async_save, allowing users to:
- Specify a specific staging method when calling async_save
- Allow a vehicle for also making the staging method async, to allow for cases where we may want to overlap with the training loop (e.g., overlap d2h with and only synchronize at the optim.step)
- Potentially specify the execution method for doing async_save in parallel. For example some users may prefer a subprocess over a  thread to avoid GIL issues.

A totally reasonable alternative to this entire proposal is to expect users who want this level of customization
to write their own custom async save methods. Here's an example which addresses the issues mentioned
in PR comments.
```
def custom_async_save(...):
    #     this step accomplishes staging and includes the usual 'planning' calls (issue 1)
    buffered_writer = CpuBufferedWriter() # this is stateful, contains a copy of state_dict
    dcp.save(state_dict, storage_writer=buffered_writer)

    final_storage_writer = FileSystemWriter()
    mp.spawn(      # issue2 is gone, do whatever you want here
	dcp.save,    # or some custom sub-process method which calls dcp.save under the hood
        buffered_writer.state_dict,   # lot's of way's to do this, not really the most important part
	checkpoint_id=checkpoint_id,
	storage_writer=storage_writer,
	planner=planner,
	process_group=process_group, # this actually wouldn't work, but again not the pt.
      )
      # leaving out the rest of the details for managing your extra special subprocess.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122965
Approved by: https://github.com/daulet-askarov
2024-05-02 19:01:55 +00:00
Ke Wen
52142192d4 [pipelining] Add stage backward function (#124958)
This is a helper function which:
1. computes the gradients for the stage inputs, and
2. accumulates gradients for the stage module's parameters.

A unit test for this function is also added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124958
Approved by: https://github.com/wconstab
ghstack dependencies: #124776, #124875
2024-05-01 07:56:58 +00:00
Mikayla Gawarecki
2480e8b8a1 Add MAP_SHARED option for torch.load(mmap=True) (#124889)
Fixes #124528

Going over the options for our MapAllocator and what they do, I don't think any other of them need to be piped up to `torch.load`

4f29103749/aten/src/ATen/MapAllocator.h (L8-L16)

~However, I wonder if this `MmapVisibility(Enum)` is a good way to represent "or-ing" together of `mmap` flags if we want to extend it in the future. I looked over the flags for [`mmap(2)`](https://man7.org/linux/man-pages/man2/mmap.2.html), and could not immediately see how most of them would be useful for `torch.load` (would maybe `MAP_LOCKED` (like `mlock`) or `MAP_HUGE` ever be worthwhile?)~

Using the flags provided by the python `mmap` library so that we can extend the allowed flags and pipe them down to the cpp `mmap` call if there is a need for other flags in the future

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124889
Approved by: https://github.com/albanD
2024-04-30 15:02:19 +00:00
Avik Chaudhuri
e7846447e0 dynamic shapes builder API (#124898)
This PR introduces a new way of building `dynamic_shapes` for export. The idea is to build up a mapping from input tensors to the dynamic shapes that should be assigned to their corresponding fake tensors.

This mapping is automatically converted to the current form of `dynamic_shapes`, which must exactly match the structure of inputs. We do this by using pytree utils.

With the current `dynamic_shapes`, we had to be careful about user-defined classes that are registered with pytree, since  such classes are not necessarily polymorphic containers; they may be fine containing tensors, but not dynamic shapes. Thus we had decided to allow input instances of such classes to be associated with dynamic shapes in flattened form. This decision needs to be mirrored in this PR as well. To make it easier to keep these code paths in sync, we refactor the current recursive procedure for associating inputs with dynamic shapes to use the same pytree utils. This needs minor fixes to a few tests where `dynamic_shapes` were not exactly matching the structure of inputs.

Differential Revision: D56551992

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124898
Approved by: https://github.com/zhxchen17
2024-04-30 03:59:49 +00:00
Tristan Rice
dc4c75ba72 elastic/rendezvous: make barrier and rank assignment operations O(n) instead of O(n^2) (#124982)
Summary:
This makes barrier and rank operations linear instead of quadratic with the number of workers. This drastically improves performance for rendezvous when running with over 1000 hosts.

This uses 2 approaches for different areas:

* local rank assignment: each worker does 1 set and 1 get, local ranks are assigned on the rank 0 host in a O(n) operation which reduces total store operations to be linear with number of workers.
* exit_barrier: use a counter and a final flag so each worker has to do max 1 set, 1 get and 1 add.

At 4000 hosts we see torchelastic be able to run in as little as 10 seconds down from 373 seconds.

Test Plan:
This is testing using many small tests running on a remote cluster.

{D56549942}

```
torchx run --scheduler mast -- --image=torchelastic_benchmark --j=4000x1
```

Differential Revision: D56605193

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124982
Approved by: https://github.com/kiukchung, https://github.com/kurman
2024-04-27 02:21:44 +00:00
egienvalue
73744a2c00 torch.mtia module for MTIA device backend (#123612)
MTIA device has its own Module in PyTorch now.
torch.mtia has following APIs similar to other backends. The lazy_init is also supported.
```
__all__ = [
    "init",
    "is_available",
    "synchronize",
    "device_count",
    "current_device",
    "current_stream",
    "default_stream",
    "set_stream",
    "stream",
    "device",
]

```
------------
For device management. We expand AccleratorHooksInterface to support generic device management and it can be used in both C++ and PyThon.
```
def _accelerator_hooks_device_count() -> _int: ...
def _accelerator_hooks_set_current_device(device_index: _int) -> None: ...
def _accelerator_hooks_get_current_device() -> _int : ...
def _accelerator_hooks_exchange_device(device_index: _int) -> _int : ...
def _accelerator_hooks_maybe_exchange_device(device_index: _int) -> _int : ...
```

---------
Adding get_device_module API to retrieve device modules for different device types.
```
def get_device_module(device: Optional[Union[torch.device, str]] = None)
```
---------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123612
Approved by: https://github.com/albanD
ghstack dependencies: #123611
2024-04-26 16:17:54 +00:00
Yu, Guangye
19a83eacb5 add new API torch.amp.is_autocast_available (#124938)
# Motivation
expose `torch._is_autocast_available` to `torch.amp.is_autocast_available` as a public api.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124938
Approved by: https://github.com/albanD
2024-04-26 08:45:20 +00:00
PyTorch MergeBot
e04c7b19f4 Revert "torch.mtia module for MTIA device backend (#123612)"
This reverts commit 381653de63.

Reverted https://github.com/pytorch/pytorch/pull/123612 on behalf of https://github.com/jeffdaily due to this PR broke ROCm with message RuntimeError: Cannot have MTIA with other devices ([comment](https://github.com/pytorch/pytorch/pull/123612#issuecomment-2077649762))
2024-04-25 16:06:46 +00:00
Edward Z. Yang
b4597fffce Try to reuse old symbol name rather than new symbol name when renaming (#124782)
Previously, unbacked SymInts would gradually get larger and larger as we kept rebinding them.  Now, we do the replacement to preserve the old symbol.

Actually doing this is a bit tricky.  Here’s the order things happen when retracing data dependent:

1. Run fake tensor prop: allocate new unbacked SymInt
2. Run proxy tensor mode, calculate bindings and associate them with FX node
3. Run PropagateUnbackedSymInts, rename unbacked bindings to their old ones so they are consistent

So the problem is when we calculate bindings in step (2), we don't know
what the original names are yet, we only find out later at (3).  But by
the time (3) runs, we've already stuffed some new bindings in
meta["unbacked_bindings"] and we don't know how to update them!  To fix
this, I introduce resolve_unbacked_bindings which post facto applies any
of the renamings we discovered in (3).

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124782
Approved by: https://github.com/lezcano
ghstack dependencies: #124310, #124314, #124316, #124394, #124739
2024-04-25 14:02:42 +00:00
Edward Z. Yang
13ab24f192 Reimplement unbacked symbol bindings in Inductor (#124394)
This PR has a lot of "draw the rest of the fucking owl" energy. Here's how to break it down.

1. **torch/_inductor/graph.py** - We start by tightening unbacked symbol invariants. Specifically, as we lower FX nodes, we check whether or not every unbacked_binding recorded on the FX node meta, actually ends up getting bound (according to get_unbacked_symbol_defs) in all the buffers generated by the lowering. Hopefully this invariant is self evident. This leads to a lot of failures.
2. **torch/_inductor/ir.py** - Problem 1: There is softness in how Inductor computes defs of unbacked symbols in IR node. Previously, we tried to infer it by looking at the output sizes/strides/etc and see if new unbacked symbols popped up that we hadn't seen in the inputs. I don't know exactly what was buggy about the old code, but sometimes we would fail to notice an unbacked symbol had been bound, or rebind an unbacked symbol multiple times. Fortunately, thanks to the earlier PRs in our stack, we now have a nice list of unbacked symbol bindings from FX, so we now just store it directly on ExternKernel and use it directly to report defs. This has to be done twice: once for FallbackKernel (e.g., nonzero) and once for DynamicScalar (e.g., item) (see also **torch/_inductor/lowering.py**, **torch/_inductor/codegen/wrapper.py** and  **torch/_inductor/codegen/cpp_wrapper_cpu.py** for the lowering and codegen changes for item)
   * **process_kernel** - Sidequest! It turns out that Inductor lowering can reallocate unbacked symbols. This happens specifically when we repropagate fake tensors through the operator in `process_kernel`. This repropagation process is necessary because Inductor may have changed the strides of input tensors, and it must now recompute the strides so that it can continue to appropriately plan the rest of the lowering process. This is fine: we just make sure we do the rebind unbacked + compute_unbacked_bindings dance we've been doing previously in the PR stack. But instead of putting unbacked_bindings on a new FX node, they go straight into our unbacked_bindings on the Inductor IR node.
    * **codegen_unbacked_symbol_defs** - Sidequest! FallbackKernel lowering is done in two steps. First, you emit the FallbackKernel buffer. Then, you emit MultiOutput buffers which actually give access to the individual outputs of FallbackKernel, which may have been multi-output. There is a design decision here: does the FallbackKernel bind the unbacked symbols, or the MultiOutput buffer? Historically, we put the binding on MultiOutput buffer, because it's more convenient: the FallbackKernel buffer is fake, in fact, it doesn't even get a name in C++ codegen. But it's kind of inconsistent with the keypath model that we've been tracking unbacked bindings with: if you have a multi-output node, you'd expect a keypath like `[0].size()[0]` representing the first output's first dimension size. That suggests that it's the FallbackKernel that should define the things. So that was my first implementation. Unfortunately, the C++ codegen is too cursed and I could not understand how to make it work in that case. So now we just unsoundly assume you cannot have multi-output data dependent output, and do the codegen in MultiOutput. There are some comments explaining exactly what we are improperly assuming.
3. **_rename_unbacked_to** in **torch/fx/experimental/symbolic_shapes.py** - Previously, when we renamed unbacked symbols, we clobbered any facts we previously knew about them. So for example, if we had a replacement `u0 -> s0` but then we renamed u0 to u1, we would now setup the replacement `u0 -> u1`, clobbering the old replacement. This apparently didn't matter in earlier PRs in the stack, but with Inductor now on the ball, there were some tests that indicated this was a problem. The solution is easy: if u0 had a preexisting replacement, reapply it to u1. However...
    * **torch/_functorch/_aot_autograd/collect_metadata_analysis.py** - When we run forward analysis, this triggers fake tensor repropagation and fresh allocations. Previously, we just cleared out the pending symbols when finished the analysis. But with the change above, this would also migrate replacements to the new symbols... which are now dead. So now we explicitly suppress generation of these symbols with `ignore_fresh_unbacked_symbols` so that no rebinding happens at all.
    * **torch/_dynamo/eval_frame.py** - same deal; I just searched for all sites we called clear() on pending
4. The last step is fixing the long tail of extra problems that show up, now that unbacked_bindings are load bearing into Inductor
    * **torch/_dynamo/eval_frame.py** - Some of the exports are making copies of nodes without repropagating fake tensors, so in this case, it is important to also copy the `unbacked_bindings` (apparently this didn't matter before without the Inductor changes)
    * **torch/_export/pass_base.py** - I discover that this is doing fake tensor repropagation via a test suite failure. Do the same playbook as AOTAutograd: PropagateUnbackedSymInts too!  Actually, they also have implemented their own tracer as well, so do the same playbook as proxy_tensor: record unbacked_bindings on the newly traced nodes. UGH code duplication.
    * **torch/_subclasses/fake_tensor.py**, **torch/_subclasses/fake_impls.py** (with call site updates at  **torch/_functorch/_aot_autograd/traced_function_transforms.py** and **torch/fx/passes/fake_tensor_prop.py**) - What's this new epoch thing? I noticed that sometimes I would be retracing, call nonzero() on a fake tensor, and not allocate a new unbacked symbol. This is actually bad, because if I don't get a new unbacked symbol, I don't know there's a binding site, and `unbacked_bindings` is now missing a binding. The reason for this is memoization: if I reuse the exact same fake tensor on my retrace, it will already have an unbacked symint memoized on it and we will short circuit allocation. Well, that's no good. So I associate the memos with a fake tensor epoch, and every time you start a new fake tensor propagation from scratch, you bump the epoch so that I clear all the memos.
    * **torch/_inductor/scheduler.py** - I notice in unit tests that V.current_node is not always set when we call process_kernel. So I save it into the IR node and restore it when we are running `get_estimated_runtime`.
    * **torch/fx/experimental/symbolic_shapes.py** - A few things
      * **rebind_unbacked** (re **_tensor_version**). Ordinarily, when you have an unbacked SymInt, you persistently hvae it all the way to the end of the program. `_tensor_version` violates this: this generates an unbacked SymInt (for reasons I don't quite understand?) and then gets rid of it later. This triggered an assert violation. I think this op is kind of misusing unbacked SymInt, but I didn't know how to refactor it, so it gets a special case.
      * **rebind_unbacked** (re **Simplify SymBool binding**). Ugh, SymBool, what a pain in the butt. I have an assert that you can only rebind unbacked symbol to another unbacked symbol. This assert fails when a boolean is involved, because the result of running keypath on the result is not `u1`, it's `sympy.Piecewise(... sympy.Eq(u1, 1) ...)`. This is actually just `u1`, but Sympy doesn't know it because it doesn't know that `u1` value range is `[0, 1]`. So we manually implement the simplification needed to get the assert to pass.
      * **compute_unbacked_bindings** (re **This is pretty fragile**). There is a really funny disaster involving memoization and Inductor process kernel. Ordinarily when I retrace, if there was a memo hit in the old trace, there will be a memo hit in the new trace. However, Inductor process kernel breaks this, because it recreates fake tensor inputs to the operator call from scratch (since they might have different strides), and obviously these tensor inputs don't have the memo from the old one. I tried a little bit to try to manually transplant the memo to the new fake tensor but it seemed hopeless, so I just let the fresh symbol ride, allocating a new unbacked symbol. However, in one of our tests, we rely on knowing that the first nonzero call is equal to the second (memoized) nonzero call. The equality test looked pretty easy to discharge, so I just went ahead and added a deferred runtime assert to this effect and it worked.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124394
Approved by: https://github.com/jansel
ghstack dependencies: #124310, #124314, #124316
2024-04-25 02:08:59 +00:00
Edward Z. Yang
9692b954c6 FakeTensorProp works with unbacked bindings (#124310)
This is a partial revert of https://github.com/pytorch/pytorch/pull/124059

Like in #124297, profiling has revealed that testing equality on *every* output is kind of expensive. So we only test equality when we know there is an unbacked binding.  This is the same playbook as the previous PR, just on FakeTensorProp instead of PropagateUnbackedSymInts. Note that we also need to populate `unbacked_bindings` in proxy_tensor.py, since we're generating an entirely new graph in that case.

We now have enough propagation that we're able to trigger a bug related to divisibility replacement. In https://github.com/pytorch/pytorch/pull/113165 we allowed to replace `u0` with `u1 * c` for some constant c, when we have determined that u0 is divisible by c. However, where does the binding for u1 come from? What we will have in practice is that there is some node that is supposed to have bound u1, but which actually is getting a `u1 * c` in its output. So, to get u1, we must divide out c. Fortunately, under the divisibility condition, this is always possible (but remember, we must test divisibility at runtime!)

Because we have tightened up asserts, it is now an error to allocate unbacked SymInts and then fail to track them under unbacked_bindings. In torch/_dynamo/eval_frame.py and torch/_functorch/_aot_autograd/collect_metadata_analysis.py there are examples of benign cases where we repropagated fake tensors but then immediately threw away the results. In these cases, it's not appropriate to rebind, since we're still using the old FX graph that has all of the old symbols. So we just manually clear it. It is possible that other cases will need to be updated, so this PR is "risky" from the perspective of hitting fbcode.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124310
Approved by: https://github.com/lezcano
2024-04-25 02:08:51 +00:00
Gagan Jain
c5e567c573 [Torch][Timer] Adding debug info logging interface for expired timers (#123883)
Summary:
Adding function to log additional debug information before killing the expired watchdog timers.

Additional information like stack trace can be added in the debug function using worker process IDs from expired timers.

Test Plan: buck test mode/opt caffe2/test/distributed/elastic/timer:file_based_timer_test

Differential Revision: D56044153

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123883
Approved by: https://github.com/kurman
2024-04-25 01:15:52 +00:00
egienvalue
381653de63 torch.mtia module for MTIA device backend (#123612)
MTIA device has its own Module in PyTorch now.
torch.mtia has following APIs similar to other backends. The lazy_init is also supported.
```
__all__ = [
    "init",
    "is_available",
    "synchronize",
    "device_count",
    "current_device",
    "current_stream",
    "default_stream",
    "set_stream",
    "stream",
    "device",
]

```
------------
For device management. We expand AccleratorHooksInterface to support generic device management and it can be used in both C++ and PyThon.
```
def _accelerator_hooks_device_count() -> _int: ...
def _accelerator_hooks_set_current_device(device_index: _int) -> None: ...
def _accelerator_hooks_get_current_device() -> _int : ...
def _accelerator_hooks_exchange_device(device_index: _int) -> _int : ...
def _accelerator_hooks_maybe_exchange_device(device_index: _int) -> _int : ...
```

---------
Adding get_device_module API to retrieve device modules for different device types.
```
def get_device_module(device: Optional[Union[torch.device, str]] = None)
```
---------

Differential Revision: [D56443356](https://our.internmc.facebook.com/intern/diff/D56443356)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123612
Approved by: https://github.com/albanD
ghstack dependencies: #123611
2024-04-24 20:51:20 +00:00
Edward Z. Yang
e0e2d897ed Handle Tensor returns in PropagateUnbackedSymInts (#124297)
This subsumes https://github.com/pytorch/pytorch/pull/124069

In the original PR, my idea was that when we run PropagateUnbackedSymInts, we check that the sizes before and after are exactly the same. This ended up turning up lots of bugs that I didn't feel like fixing. Separately, Ivan let me know that this pass was quite expensive in terms of compile time, since we spent a lot of time thinking about the equalities.

To kill two birds with one stone, we now only check for equality precisely when an unbacked SymInt was bound (thanks to the previous PR in this stack, we now have this information). Specifically, we look to see if `meta["unbacked_bindings"]` is set on the old node, and if it is, we assert the old value is equal to the new value from the repropagation. Note that the pytree key is used to actually extract the new value from the example value, as it may be nested inside an, e.g., tensor size.

We do something a bit naughty at the end: we use `defer_runtime_assert` to actually teach ShapeEnv about the equality. This is implementationally equivalent to what we used to do, but we're going to change this later soon.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124297
Approved by: https://github.com/lezcano
ghstack dependencies: #124290
2024-04-24 12:18:33 +00:00
Edward Z. Yang
b04dca1502 Add pending_fresh_unbacked_symbols, populate unbacked_bindings for Dynamo (#124290)
The important comment:

```
        # Whenever we allocate a fresh unbacked Symbol, we add it to this
        # pending list.  Unbacked symbol allocation can occur at unpredictable
        # points during meta tensor propagation, but at some point, the we
        # have to know what the binding site for an unbacked symbol is, and
        # this is computed when we actually place the node in the graph.  The
        # important thing is that we always actually handle every unaccounted
        # for unbacked symbol, so this list helps us keep track of them and
        # then make sure they are all accounted for.
        #
        # We could potentially give rise to errors earlier by lexically
        # scoping when we do propagation, and only allowing unbacked symbols
        # to be allocated at this point in time.  However this is inconvenient
        # to do in Dynamo, because fake tensor propagation is far from when we
        # analyze binding sites (set_example_value), so we do it in a more
        # mutatey way.
        #
        # NB: fresh unbacked symbols NEVER get substitutions applied to them,
        # they are binding sites!
```

The compute_unbacked_bindings is the other half of the equation: the thing that actually consumes the pending_fresh_unbacked_symbols and does something with them. Important comment:

```
    After having run fake tensor propagation and producing example_value
    result, traverse example_value looking for freshly bound unbacked
    symbols and record their paths for later.  It is an error if
    we have allocated an unbacked SymInt but it cannot be found in
    example_value.  (NB: this means if you have a multi-output
    function, you must call this on the tuple of tensor output, you
    cannot wait!)
```

For example, if I return a tensor with size `[u0, u1]`, and u1 is a fresh unbacked SymInt, then I'll have `{u1: KeyPath(".size(1)")}`, telling me I can get u1 by running `size(1)` on the result of this node. u0 is not fresh (it probably flowed in as an argument), so I don't generate a binding for it.

I eventually intend to propagate this information all the way to Inductor lowering, where extra metadata about unbacked symbol binding will be canonically used for codegen, instead of trying to infer it from defs/uses.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124290
Approved by: https://github.com/lezcano
2024-04-24 09:11:34 +00:00
rzou
4ceb44c40d Add torch.library.opcheck (#124496)
This PR:
- exposes torch.testing._internal.optests.opcheck as
  torch.library.opcheck
- Adds support for CustomOpDef (aka functions decorated with
  torch.library.custom_op) to opcheck.

Test Plan:
- Updated tests
- We validated opcheck's design internally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124496
Approved by: https://github.com/williamwen42
2024-04-23 21:48:00 +00:00
Matthew Hoffman
1d3a13d3d1 Conform torch.mps to device module interface (#124676)
Right now `torch.fork_rng()` doesn't support MPS. MPS' device module functions don't line up with the others'. There is a step of `fork_rng` to call `device_count()`:

302d7e9a6e/torch/random.py (L146)

It is pretty simple to know the MPS device count, based on whether it is built and available.

Also:

302d7e9a6e/torch/random.py (L168)

302d7e9a6e/torch/random.py (L175)

`get_rng_state` and `set_rng_state` are expected to be able to accept a `device` parameter.

@ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124676
Approved by: https://github.com/ezyang
2024-04-23 18:38:48 +00:00
Jeff Daily
6ede882c0b preferred blas library; cublaslt gemm implementation (#122106)
Following the example of PyTorch supporting a preferred Linalg library (cusolver or magma), this PR introduces a preferred blas library selector of either cublas or cublaslt for CUDA and hipblas or hipblaslt for ROCm via normal hipification of sources.

The default blas implementation remains cublas or hipblas.  cublaslt or hipblaslt can be enabled using environment variable TORCH_BLAS_PREFER_CUBLASLT=1 (or TORCH_BLAS_PREFER_HIPBLASLT=1 as an alias) or by calling `torch.backends.cuda.preferred_blas_library(backend="cublaslt")` or as an alias `backend="hipblaslt"`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122106
Approved by: https://github.com/lezcano
2024-04-22 15:38:22 +00:00
PyTorch MergeBot
929242a15c Revert "torch.mtia module for MTIA device backend (#123612)"
This reverts commit d7e1bf9ff9.

Reverted https://github.com/pytorch/pytorch/pull/123612 on behalf of https://github.com/jeffdaily due to This broke ROCm. see test_overrides.py ([comment](https://github.com/pytorch/pytorch/pull/123611#issuecomment-2067363780))
2024-04-19 22:44:26 +00:00
rzou
bad8d25881 Add torch.library.register_kernel (#124299)
This mirrors the .register_kernel method on the object produced by the
custom_op decorator.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124299
Approved by: https://github.com/albanD
ghstack dependencies: #124180, #124200
2024-04-19 13:54:21 +00:00
Tobias Ringwald
58e403c739 Added a docstring for torch.Size.numel. (#124186)
Fixes #61231. Fixes #124167.

This PR documents a rather long-standing issue w.r.t. unexpected behavior of `torch.Size.numel`, first reported almost 5 years ago.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124186
Approved by: https://github.com/janeyx99
2024-04-19 09:23:02 +00:00
egienvalue
d7e1bf9ff9 torch.mtia module for MTIA device backend (#123612)
MTIA device has its own Module in PyTorch now.
torch.mtia has following APIs similar to other backends. The lazy_init is also supported.
```
__all__ = [
    "init",
    "is_available",
    "synchronize",
    "device_count",
    "current_device",
    "current_stream",
    "default_stream",
    "set_stream",
    "stream",
    "device",
]

```
------------
For device management. We expand AccleratorHooksInterface to support generic device management and it can be used in both C++ and PyThon.
```
def _accelerator_hooks_device_count() -> _int: ...
def _accelerator_hooks_set_current_device(device_index: _int) -> None: ...
def _accelerator_hooks_get_current_device() -> _int : ...
def _accelerator_hooks_exchange_device(device_index: _int) -> _int : ...
def _accelerator_hooks_maybe_exchange_device(device_index: _int) -> _int : ...
```

---------
Adding get_device_module API to retrieve device modules for different device types.
```
def get_device_module(device: Optional[Union[torch.device, str]] = None)
```
---------
@exported-using-ghexport

Differential Revision: [D52923602](https://our.internmc.facebook.com/intern/diff/D52923602/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123612
Approved by: https://github.com/albanD
ghstack dependencies: #123611
2024-04-18 17:38:06 +00:00
Boyuan Feng
aa2da0cdd2 [Export] Add runtime assert to non-strict export (#123681)
This PR moves insert_deferred_runtime_asserts from dynamo to torch.fx.passes and uses it to add runtime assertion for non-strict export.

Differential Revision: D55944267

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123681
Approved by: https://github.com/tugsbayasgalan, https://github.com/angelayi
2024-04-18 16:13:27 +00:00
rzou
645173a0b5 Add torch.library.register_autograd (#124071)
Allows registering autograd for all custom op entry points:
- the new-style custom op API (custom_op)
- the old-style torch.library APIs
- C++ operator registration

Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124071
Approved by: https://github.com/albanD
ghstack dependencies: #123937, #124064, #124065, #124066
2024-04-18 12:47:59 +00:00
doloresgarcia
4efdf9a6a6 fix pytorch version for onnx in doc (#124182)
Fixes [ 123845](https://github.com/pytorch/pytorch/issues/123845)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124182
Approved by: https://github.com/albanD
2024-04-17 18:05:15 +00:00
rzou
47dbfecd37 Rename impl_abstract to register_fake, part 1/2 (#123937)
This PR:
- adds a new torch.library.register_fake and deprecates
  torch.library.impl_abstract. The motivation is that we have a lot of
  confusion around the naming so we are going to align the naming with
  the actual subsystem (FakeTensor).
- renames `m.impl_abstract_pystub("fbgemm_gpu.sparse_ops")` to
  `m.has_python_registration("fbgemm_gpu.sparse_ops")`. No deprecation
  here yet; I need to test how this works with static initialization.
- Renames a bunch of internals to match (e.g. abstractimplpystub ->
  pystub)

I'm scared to rename the Python-side internal APIs (e.g.
torch._library.abstract_impl) because of torch.package concerns. I'll do
that in its own isolated PR next just in case it causes problems.

DEPRECATION NOTE: torch.library.impl_abstract was renamed to to
torch.library.register_fake. Please use register_fake. We'll delete
impl_abstract in a future version of PyTorch.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123937
Approved by: https://github.com/albanD
2024-04-17 12:46:01 +00:00
Edward Z. Yang
cebf65126c FakeTensorProp assert consistency of sizes when metadata previously existed (#124059)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124059
Approved by: https://github.com/bdhirsh, https://github.com/thiagocrepaldi
ghstack dependencies: #124105
2024-04-16 23:28:42 +00:00
lezcano
891736f115 Fix links rendering when surrounding code in Dynamo deepdive (#123427)
I thought the RST was rendering correctly, but here we are.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123427
Approved by: https://github.com/peterbell10
2024-04-13 04:55:15 +00:00
Gagan Jain
016ca546aa Adding health check server hook in torch elastic (#122750) (#123504)
Summary:

Building hook for external mechanism to monitor the health of torch elastic launcher. Health check server takes dependency on FileTimerServer to check if launcher is healthy or not. It will be always healthy if FileTimerServer is disabled.

Implementation of start_healthcheck_server is unsupported, however tcp/http server can be started on specific port which can monitor the aliveness of worker_watchdog and accordingly take the action.

Test Plan: buck test mode/opt caffe2/test/distributed/elastic/agent/server/test:local_agent_test

Differential Revision: D55837899

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123504
Approved by: https://github.com/kurman
2024-04-11 19:10:56 +00:00
Edward Z. Yang
bbcdd28409 Report LRU cache stats at end of program for symbolic shapes (#123724)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123724
Approved by: https://github.com/Chillee
2024-04-11 05:12:43 +00:00
PyTorch MergeBot
ecb2418dd6 Revert "Adding health check server hook in torch elastic (#122750)"
This reverts commit 61d431fab0.

Reverted https://github.com/pytorch/pytorch/pull/122750 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/122750#issuecomment-2041104931))
2024-04-06 14:31:07 +00:00
Gagan Jain
61d431fab0 Adding health check server hook in torch elastic (#122750)
Summary:
Building hook for external mechanism to monitor the health of torch elastic launcher. Health check server takes dependency on FileTimerServer to check if launcher is healthy or not. It will be always healthy if FileTimerServer is disabled.

Implementation of start_healthcheck_server is unsupported, however tcp/http server can be started on specific port which can monitor the aliveness of worker_watchdog and accordingly take the action.

Test Plan: buck test mode/opt caffe2/test/distributed/elastic/agent/server/test:local_agent_test

Differential Revision: D55108182

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122750
Approved by: https://github.com/kurman
2024-04-05 23:17:30 +00:00
Huy Do
f5b8c9b730 Ignore some known duplicated modules in doc build config script (#123425)
This is a follow-up fix of https://github.com/pytorch/pytorch/pull/123244#discussion_r1552935150 as @clee2000 points out a better way to ignore those duplicated entries.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123425
Approved by: https://github.com/clee2000
2024-04-05 21:12:14 +00:00
Lucas Pasqualin
de7edeea25 [DCP] DCP logger (#121352)
Adds additional logging for improved observability in DCP.

Differential Revision: [D54512626](https://our.internmc.facebook.com/intern/diff/D54512626/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121352
Approved by: https://github.com/wz337, https://github.com/fegin
2024-04-05 17:50:50 +00:00
Guilherme Leobas
c575e378ba Update torch.compile_faq w.r.t to functorch (#122213)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122213
Approved by: https://github.com/zou3519
ghstack dependencies: #122211, #122212
2024-04-05 03:29:11 +00:00
Guilherme Leobas
84658d9c4f Enable capture_func_transforms by default (#122211)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122211
Approved by: https://github.com/zou3519
2024-04-05 03:29:11 +00:00
Huy Do
3d20cc1332 Cleanup some duplicated placeholder py:module docs (#123244)
Fixes https://github.com/pytorch/pytorch/issues/123068
Fixes https://github.com/pytorch/pytorch/issues/111256

While investigating the flaky doc build failure .w.r.t duplicated `torch.ao.quantization.quantize` docstring warning, i.e. https://github.com/pytorch/pytorch/actions/runs/8532187126/job/23376591356#step:10:1260, I discover an old but still open bug in Sphinx https://github.com/sphinx-doc/sphinx/issues/4459.  These warnings have always been there, but they are hidden because we are using `-j auto` to build docs with multiple threads.  It's just by chance that they start to surface now.

The issue can be reproduced by removing `-j auto` from https://github.com/pytorch/pytorch/blob/main/docs/Makefile#L5 and run `make html` locally.  Then, these warnings shows up consistently.  As `make html` treats warnings as errors, they will fail the build.

```
...
/data/users/huydo/conda/py3.8/lib/python3.8/site-packages/torch/ao/quantization/quantize.py:docstring of torch.ao.quantization.quantize.quantize:1: WARNING: duplicate object description of torch.ao.quantization.quantize, other instance in quantization, use :noindex: for one of them
/data/users/huydo/conda/py3.8/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py:docstring of torch.nn.parallel.data_parallel.data_parallel:1: WARNING: duplicate object description of torch.nn.parallel.data_parallel, other instance in nn, use :noindex: for one of them
/data/users/huydo/conda/py3.8/lib/python3.8/site-packages/torch/nn/utils/spectral_norm.py:docstring of torch.nn.utils.spectral_norm.spectral_norm:1: WARNING: duplicate object description of torch.nn.utils.spectral_norm, other instance in nn, use :noindex: for one of them
/data/users/huydo/conda/py3.8/lib/python3.8/site-packages/torch/nn/utils/weight_norm.py:docstring of torch.nn.utils.weight_norm.weight_norm:1: WARNING: duplicate object description of torch.nn.utils.weight_norm, other instance in nn, use :noindex: for one of them
/data/users/huydo/github/pytorch/docs/source/nn.rst:579: WARNING: duplicate object description of torch.nn.parallel.data_parallel, other instance in generated/torch.nn.functional.torch.nn.parallel.data_parallel, use :noindex: for one of them
/data/users/huydo/github/pytorch/docs/source/nn.rst:594: WARNING: duplicate object description of torch.nn.utils.spectral_norm, other instance in generated/torch.nn.utils.spectral_norm, use :noindex: for one of them
/data/users/huydo/github/pytorch/docs/source/nn.rst:595: WARNING: duplicate object description of torch.nn.utils.weight_norm, other instance in generated/torch.nn.utils.weight_norm, use :noindex: for one of them
/data/users/huydo/github/pytorch/docs/source/quantization.rst:1348: WARNING: duplicate object description of torch.ao.quantization.quantize, other instance in generated/torch.ao.quantization.quantize, use :noindex: for one of them
...
```

The fix is just to clean up those duplicated placeholder py:module docs, which were there because these modules didn't have any docs originally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123244
Approved by: https://github.com/andrewor14, https://github.com/malfet
2024-04-05 03:18:53 +00:00
rzou
44c0c0fc0f Add torch.library.custom_op (#122344)
This is the entrypoint for defining an opaque/blackbox (e.g. PyTorch will
never peek into it) custom op. In this PR, you can specify backend impls
and the abstract impl for this op.

NB: most of this PR is docstrings, please don't be intimidated by the
line count.

There are a number of interesting features:
- we infer the schema from type hints. In a followup I add the ability
  to manually specify a schema.
- name inference. The user needs to manually specify an op name for now.
  In a followup we add the ability to automatically infer a name (this
  is a little tricky).
- custom_op registrations can override each other. This makes them
  more pleasant to work with in environments like colab.
- we require that the outputs of the custom_op do not alias any inputs
  or each other. We enforce this via a runtime check, but can relax this
  into an opcheck test if it really matters in the future.

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122344
Approved by: https://github.com/ezyang, https://github.com/albanD
2024-04-03 18:36:17 +00:00
lezcano
b27ee6548d Add a Dynamo deepdive to documentation (#122305)
This supersedes the previous `Guards Overview" as a more comprehensive
approach to most of the main topics within Dynamo.

In the future, we could add specific sections for each of the topics
discussed here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122305
Approved by: https://github.com/msaroufim
2024-04-02 15:08:08 +00:00
Will Feng
489f4a063b Revert "Preserve unbacked SymInt on SymNode (#120816)" (#122988)
This reverts commit 476585b190.

I did a bisect and this seems to be the cause of compile time regression in cudagraphs_dynamic test suite between 03/23 and 03/24:
![image](https://github.com/pytorch/pytorch/assets/4063635/21394e06-4906-4690-b5a2-7d16cc475843)
image Particularly BERT_pytorch and hf_T5 seem to have ~50% compile time regression.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122988
Approved by: https://github.com/eellison
2024-04-01 22:11:09 +00:00
Mikayla Gawarecki
487b6d40ec Add RMSNorm module (#121364)
Similar to dbeed9724b/torchmultimodal/modules/layers/normalizations.py (L51)

**The implementation here is not optimized and we welcome pull requests to improve this**

- Use `normalized_shape` instead of singular integer `dim` to be aligned with the `nn.LayerNorm` implementation
- Remove the [upcast to float and downcast
](dbeed9724b/torchmultimodal/modules/layers/normalizations.py (L73))

Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)

Differential Revision: [D55485840](https://our.internmc.facebook.com/intern/diff/D55485840)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121364
Approved by: https://github.com/albanD
2024-03-29 18:05:28 +00:00
David Berard
59f6393209 [docs] Update PT2+Profiler docs (#122272)
Document:
* Torch-Compiled Region
* What to expect in kernels inside a torch-compiled region

For review, see https://docs-preview.pytorch.org/pytorch/pytorch/122272/torch.compiler_profiling_torch_compile.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122272
Approved by: https://github.com/aaronenyeshi
2024-03-28 17:52:28 +00:00
PyTorch MergeBot
8698121636 Revert "Add RMSNorm module (#121364)"
This reverts commit a7306de0dc.

Reverted https://github.com/pytorch/pytorch/pull/121364 on behalf of https://github.com/atalman due to Broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/121364#issuecomment-2025502007))
2024-03-28 15:31:10 +00:00
Aaron Orenstein
a8b7480f0d fix dynamo.explain examples (#122745)
`dynamo.explain()` was updated to return a structure but the docs weren't updated to match.

- Update the docs to use the new API
- Remove some dead code left when `explain` was updated.
- Drive-by: Fix some `nopython` uses that I noticed
- Drive-by: I noticed an ignored error coming from CleanupHook on shutdown - make it check the global before setting it.

Fixes #122573

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122745
Approved by: https://github.com/jansel
2024-03-27 22:53:27 +00:00
Mikayla Gawarecki
a7306de0dc Add RMSNorm module (#121364)
Similar to dbeed9724b/torchmultimodal/modules/layers/normalizations.py (L51)

**The implementation here is not optimized and we welcome pull requests to improve this**

- Use `normalized_shape` instead of singular integer `dim` to be aligned with the `nn.LayerNorm` implementation
- Remove the [upcast to float and downcast
](dbeed9724b/torchmultimodal/modules/layers/normalizations.py (L73))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121364
Approved by: https://github.com/albanD
2024-03-27 21:39:30 +00:00
Frank Lin
249e65b92d Graph-Safe RNG State Exchange for Tensor Parallelism (#114068)
See #113541

The PR allows for registering and controlling multiple RNG states using indices, ensuring cudagraph-safe operations, and includes both C++ and Python API changes to support this functionality.

cc  @eellison @anijain2305 @jansel @ezyang @ptrblck @csarofeen @mcarilli
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114068
Approved by: https://github.com/ezyang, https://github.com/eqy, https://github.com/xuzhao9
2024-03-27 01:14:38 +00:00
Edward Z. Yang
85845a29db Refactor ShapeEnvSettings so it's directly on ShapeEnv (#122310)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122310
Approved by: https://github.com/masnesral, https://github.com/lezcano
2024-03-26 14:16:33 +00:00
PyTorch MergeBot
4dc09d6aa4 Revert "Graph-Safe RNG State Exchange for Tensor Parallelism (#114068)"
This reverts commit e9dcda5cba.

Reverted https://github.com/pytorch/pytorch/pull/114068 on behalf of https://github.com/ezyang due to memory leak in another ci ([comment](https://github.com/pytorch/pytorch/pull/114068#issuecomment-2018044527))
2024-03-25 13:49:04 +00:00
Edward Z. Yang
476585b190 Preserve unbacked SymInt on SymNode (#120816)
Previously, when we applied a replacement, a SymInt that was
previously an unbacked SymInt would then transmute into whatever
we replaced it into (e.g., a constant).

This has a major downside: we often look at SymInts associated with
FX nodes (e.g., the meta of x.item() return) to find out where the
unbacked SymInt was allocated.  If we replace it, we no longer can
find out where, e.g., u1 was allocated!  But we need to know this
so we can generate deferred runtime asserts like u1 == s0.

To solve this problem, I have a special mode for replace, resolve_unbacked=False, which lets you disable substitutions on unbacked SymInts. When reporting node.expr, we preferentially avoid applying unbacked SymInt substitutions. To understand if we might accidentally reapply the substitution later, before we have reached the deferred runtime assert, we must study the calls to simplify() in ShapeEnv. My audit turns up these sites:

* `produce_guards`: this is fine, deferred runtime asserts never show up here, we must NOT have unbacked SymInts show up here. Similarly `get_nontrivial_guards`.
* `_maybe_evaluate_static`: this is fine, we are using this to determine if it is necessary to produce a guard/runtime assert. We don't want to reissue a runtime assert if we've already asserted on it, and replacements can help us understand if this has occurred.
* `_simplify_floor_div`: this is a legitimate bug, it needs to be `resolve_unbacked=False`
* `_refine_ranges`: this is fine, a refined range doesn't affect what runtime asserts we issue
* `_update_divisible`: this updates the `self.divisible` set, which specifies when we can simplify away divisibility constraints. Since this affects replacements only, it won't cause us to oversimplify a user provided expression.

There are some situations where we DO want to always apply the substitution, specifically when we have the duplicate symbol problem (we retrace an item call and get u0 and u1 which refer to the same thing.) I don't want two symbols in this case, so a special `rename_unbacked_to` is provided which sets up the unconditional renaming.

Along the way, I make a refinement to `_update_var_to_range`: if you update a var range for a size-like unbacked SymInt, you are now no longer allowed to set its lower bound below 2. This is because if you could, then our size oblivious tests for it would be inconsistent. Actually, I think there is still some inconsistency, because if you assert `u0 == 0` we will still end up with this in deferred runtime asserts, and we will then use this to simplify these statements to be True everywhere else. Maybe we should forbid this kind of refinement; not done in this PR.

Fixes https://github.com/pytorch/pytorch/issues/119689

Fixes https://github.com/pytorch/pytorch/issues/118385

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120816
Approved by: https://github.com/lezcano
2024-03-24 02:56:16 +00:00
liqunfu
bbe846f430 Add symbolic_opset19.py and symbolic_opset20.py to support opset 19/20, extend opset 18 support (#118828)
Start to fix https://github.com/pytorch/pytorch/issues/114801

Co-authored-by: Thiago Crepaldi <thiagofc@microsoft.com>
Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118828
Approved by: https://github.com/thiagocrepaldi
2024-03-22 18:01:33 +00:00
Sahdev Zala
17175cdbc7 [Docs] Add extended debugging options for troubleshooting (#122028)
Fixes #120889

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122028
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-03-21 17:00:45 +00:00
Frank Lin
e9dcda5cba Graph-Safe RNG State Exchange for Tensor Parallelism (#114068)
See #113541

The PR allows for registering and controlling multiple RNG states using indices, ensuring cudagraph-safe operations, and includes both C++ and Python API changes to support this functionality.

cc  @eellison @anijain2305 @jansel @ezyang @ptrblck @csarofeen @mcarilli
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114068
Approved by: https://github.com/ezyang
2024-03-21 01:57:08 +00:00
Nathan
ae983d2d6e Fix typo in sparse.rst (#121826)
Change word "on" to "one" when talking in the third person.

Fixes #121770
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121826
Approved by: https://github.com/janeyx99
2024-03-19 00:17:19 +00:00
Jane Xu
37e563276b Document complex optimizer semantic behavior (#121667)
<img width="817" alt="image" src="https://github.com/pytorch/pytorch/assets/31798555/565b389d-3e86-4767-9fcb-fe075b50aefe">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121667
Approved by: https://github.com/albanD
2024-03-16 00:43:47 +00:00
Tugsbayasgalan Manlaibaatar
53d2188df9 Update get_aten_graph_module (#121937)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121937
Approved by: https://github.com/andrewor14
2024-03-15 20:35:55 +00:00
Aidyn-A
af86d67d61 [Doc][NVTX] Add documentation for nvtx.range (#121699)
The context manager `torch.cuda.nvtx.range` has been around for about 4 years (see #42925). Unfortunately, it was never documented and as a consequence users are just unaware of it (see #121663).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121699
Approved by: https://github.com/janeyx99
2024-03-15 20:26:44 +00:00
lezcano
d0d09f5977 Fix torch.compile links (#121824)
Fixes https://github.com/pytorch/pytorch.github.io/issues/1567

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121824
Approved by: https://github.com/svekars, https://github.com/peterbell10, https://github.com/malfet
ghstack dependencies: #121823
2024-03-15 19:49:37 +00:00
Matthias Reso
a9274c9a2c Fix aoti doc to avoid cannot bind non-const lvalue reference error (#121672)
This PR corrects the example in the AOTInductor example which currently fails with:
```
/home/ubuntu/test/inference.cpp:21:62: error: cannot bind non-const lvalue reference of type ‘std::vector<at::Tensor>&’ to an rvalue of type ‘std::vector<at::Tensor>’
   21 |     std::cout << runner.run({torch::randn({2, 10}, at::kCPU)})[0] << std::endl;
      |
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121672
Approved by: https://github.com/desertfire
2024-03-12 23:43:40 +00:00
PyTorch MergeBot
0398dc9e8e Revert "[DCP] Makes fsspec public (#121508)"
This reverts commit d482614fec.

Reverted https://github.com/pytorch/pytorch/pull/121508 on behalf of https://github.com/osalpekar due to this causes torchrec tests to fail internally with this error: ModuleNotFoundError: No module named 'fsspec'. see [D54779117](https://www.internalfb.com/diff/D54779117) ([comment](https://github.com/pytorch/pytorch/pull/121508#issuecomment-1992137831))
2024-03-12 17:02:43 +00:00
Lucas Pasqualin
d482614fec [DCP] Makes fsspec public (#121508)
Fixes #118033

Also removes `_checkpointer.py` class
original PR's:
- https://github.com/pytorch/pytorch/pull/121330
- https://github.com/pytorch/pytorch/pull/121329

We're also disabling `test_fsdp` since it is failing on random PR's

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121508
Approved by: https://github.com/fegin
2024-03-09 01:14:18 +00:00
chilli
ed8eebd1c2 Changed cublas repdocubility URL (#121534)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121534
Approved by: https://github.com/Skylion007
2024-03-08 23:46:21 +00:00
angelayi
f2d5e96db4 [export] Add docs for 2.3 release (#121466)
- Added docs about non-strict export
- Added example using derived dims
- Added api docs for ep.run_decompositions() (https://github.com/pytorch/pytorch/issues/119480)
- Tried to include/cover everything in https://docs.google.com/document/d/1kZ_BbB3JnoLbUZleDT6635dHs88ZVYId8jT-yTFgf3A/edit
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121466
Approved by: https://github.com/zhxchen17
2024-03-08 22:29:48 +00:00
Ke Wen
c78f72d7e7 [c10d] Deprecate torch.distributed.pipeline (#121464)
In favor of PiPPy (Pipeline Parallelism for PyTorch) https://github.com/pytorch/PiPPy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121464
Approved by: https://github.com/wz337, https://github.com/awgu
2024-03-08 19:55:02 +00:00
Wanchao Liang
30982ce072 [tp] doc fixes (#121431)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121431
Approved by: https://github.com/wz337
2024-03-08 17:46:44 +00:00
Karol Blaszczak
8ed0932172 Update link to OpenVINO backend in torch.compiler.rst (#121303)
This is a permalink, so it will remain active regardless of documentation version changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121303
Approved by: https://github.com/soulitzer
2024-03-08 08:17:13 +00:00
Chien-Chin Huang
2e789ad522 [DCP][state_dict][doc] Update the distributed state_dict document (#121290)
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121290
Approved by: https://github.com/LucasLLC
ghstack dependencies: #121273, #121276
2024-03-08 07:58:18 +00:00
Lucas Pasqualin
96ed37ac13 [DCP] Makes async_save public (#121325)
Makes async_save public

Differential Revision: [D54593610](https://our.internmc.facebook.com/intern/diff/D54593610/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121325
Approved by: https://github.com/wz337
ghstack dependencies: #121317
2024-03-08 05:13:13 +00:00
Cheng Ni
9bff1599b6 [Torch Elastic][Draft] Refactor SubprocessHandler to separate module for easier subclass (#120373)
Summary:
## No Functional Change
- Refactor Subprocess Handler into a separate folder for easier subclassing
- SubprocessHandler
    - added `local_rank_id` in `SubprocessHandler` to make it available as a field in the class
    - pass in `local_rank_id` from subprocess start

Test Plan: No functional changes.

Differential Revision: D54038627

#suppress-api-compatibility-check

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120373
Approved by: https://github.com/kurman
2024-03-08 01:37:34 +00:00
Dheeraj Peri
b1657beac1 feat: Add min, max ranges to mark_dynamic API (#119737)
Fixes https://github.com/pytorch/pytorch/issues/115137

This PR adds:

- mark_dynamic API will accept `min`, `max` values to create a bounded constraint on the dim.
- test case in test_misc.py which checks if `ConstraintViolationError` is triggered if `torch.compile` gets a input dimension out of bounds.

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119737
Approved by: https://github.com/ezyang, https://github.com/jansel
2024-03-07 23:26:03 +00:00
suo
c3c15eb9a6 [export] update docs to not export raw functions (#121272)
as title

Differential Revision: [D54555101](https://our.internmc.facebook.com/intern/diff/D54555101/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121272
Approved by: https://github.com/zhxchen17
2024-03-07 17:18:07 +00:00
Wanchao Liang
1a28ebffb3 [TP] Introduce Sequence Parallel Style for Laynorm/RMSNorm/Dropout (#121295)
As titled, this PR introduces a dedicated `ParallelStyle` to shard the
nn.LayerNorm/nn.Dropout/RMSNorm layers. We were mainly using a manual
distribute_module calls before when sharding the RMSNorm layer, but I
think we should have a dedicate TP API to easily shard those layers,
instead of user manually using DTensors.

I call this SequenceParallel, which might bring some confusion that we
technically "deprecated" a SequenceParallel style months ago. But this
time the SeuqenceParallel style is significantly different with the
previous ones (which used to shard two consecutive Linear layers). I
believe making it the right name is the first priority, instead of
worrying about the issue of reusing the old name

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121295
Approved by: https://github.com/awgu, https://github.com/tianyu-l
ghstack dependencies: #121294
2024-03-07 02:04:59 +00:00
Zhengxu Chen
8aeb247a3d [export] Remove WrapperModule. (#121042)
Summary: WrapperModule seems a good idea but may introduce some surprising behavior to users, for example, it never registers enclosed modules as submodules and therefore it's unclear that's the state dict for the exported program should look like, because some people may argue to include every state in state dict but others want to keep them as constants.

Test Plan: CI

Reviewed By: tugsbayasgalan

Differential Revision: D54326331

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121042
Approved by: https://github.com/angelayi
2024-03-05 18:10:22 +00:00
Tianyu Liu
af5376c444 [dtensor] add support for loss parallel (#119877)
Loss parallel is the last piece of sequence parallelism to enable. It enables efficient distributed cross entropy computation when the input is sharded on the class dimension (in a classification problem with many classes). The implementation is via a context manager `loss_parallel`, after enabling which users can directly use `torch.nn.functional.cross_entropy` or `torch.nn.CrossEntropyLoss` without modifying other parts of their code.

Here are the underlying rationales why we are going through these op replacements:

1. `nn.functional.cross_entropy` is the common method that OSS user is using for things like transformer training, to avoid changing user code, we want user to still use this function for loss calculation if they are already using it.
2. `nn.functional.cross_entropy` boils down into `aten.log_softmax` and `aten.nll_loss_foward/backward`, and DTensor now supports those ops already (#117723 #119255 #118917 #119256). They are doing computation with input *replicated* on the class dimension.
3. However when the input of this loss calculation is **sharded on the class dimension**, to run sharded computation efficiently, we need to run both `aten.log_softmax` and `aten.nll_loss_foward` with multiple all-reduce collectives **in the middle of** those aten ops. This is not possible if we are just overriding these two ops, so we need to have some way to **decompose** these two ops into smaller ops to have collectives run in the middle of these two ops.
4. We explored the existing decompositions (#118950). It seems working, except that `log_softmax_backward` and `nll_loss_backward` combined together in aten are implemented in a inefficient way, which would trigger an additional expensive collective. Recently some user also reported similar issues https://github.com/pytorch/pytorch/issues/119261.
5. Therefore, currently we are doing our own decomposition inside a context manager for sequence parallelism specifically. Once we have a better decomposition in core, we can possibly take that instead of reinventing the wheels here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119877
Approved by: https://github.com/wanchaol
2024-03-02 05:06:26 +00:00
Lucas Pasqualin
9d5dea7812 [DCP] Adds storage reader and planner classes for online loading/sharding of models in torch.save format (#119816)
as title

Differential Revision: [D53718041](https://our.internmc.facebook.com/intern/diff/D53718041/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119816
Approved by: https://github.com/fegin
2024-03-01 00:21:05 +00:00
Kurman Karabukaev
67d3e4f2a2 [TorchElastic] Refactoring to support non-default logging strategy (#120691)
Summary:
Pulling out logging parameters into a logging specs that can be overridden (follow-up changes on possible mechanism)

Why?
Right now the logging approach is quite rigid:
- Requires for log directory to exist and not be empty
- Will create tempdir otherwise,
- Creates subdir for a run
- creates subdir for each attempt
- creates files named as stdout.log, stderr.log, error.json

In some instances some of the users would like to customize the behavior including file names based on context. And we do have right now a mechanism to template multiplexed teed output prefix.

With current changes, users can create custom log spec that can use env variables to change the behavior.

Notes:
Made `LaunchConf.logs_specs` as an optional field that will be bound to `DefaultLogsSpecs` instance. There are large number of clients (code) that use the API directly without using torchrun API. For those cases, we have to explicitly pass LogSpecs implementation if we would like to override the implementation. For the regular torchrun users, we can use pluggable approach proposed in the follow up change.

Test Plan: CI + unit tests

Differential Revision: D54176265

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120691
Approved by: https://github.com/ezyang
2024-02-29 20:59:17 +00:00
Oleg Khabinov
4b18ab869f [torch.export] Support is_compiling() flag for non-strict mode (#119602)
Summary: In non-strict mode of torch.export() we didn't set those `is_compiling()` to `True` which is needed by some models.

Test Plan: Unit tests and manual testing.

Differential Revision: D53624452

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119602
Approved by: https://github.com/suo
2024-02-29 05:52:51 +00:00
Yu, Guangye
12995a5d9d [2/2] Intel GPU Runtime Upstreaming for Generator (#118613)
# Motivation
According to [[1/2] Intel GPU Runtime Upstreaming for Generator](https://github.com/pytorch/pytorch/pull/118528), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the second PR covers the changes under `python frontend`.

# Design
Currently, it primarily offers geneartor-related APIs, including

- `torch.xpu.default_generators`
- `torch.xpu.get_rng_state`
- `torch.xpu.get_rng_state_all`
- `torch.xpu.initial_seed`
- `torch.xpu.manual_seed`
- `torch.xpu.manual_seed_all`
- `torch.xpu.seed`
- `torch.xpu.seed_all`
- `torch.xpu.set_rng_state`
- `torch.xpu.set_rng_state_all`

# Additional Context
The differences with CUDA:
The generator-related frontend python APIs are 1:1 mapping with CUDA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118613
Approved by: https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/jgong5, https://github.com/albanD
2024-02-28 05:28:11 +00:00
Lucas Pasqualin
1c1028ac49 [DCP] Adds utility for converting torch save to dcp (#119815)
as title

Differential Revision: [D53718040](https://our.internmc.facebook.com/intern/diff/D53718040/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119815
Approved by: https://github.com/fegin
ghstack dependencies: #119813, #119814
2024-02-22 17:22:11 +00:00
Lucas Pasqualin
1ab441a7dd [DCP] Adds utility for converting dcp to torch save format (#119814)
as title

Differential Revision: [D53718042](https://our.internmc.facebook.com/intern/diff/D53718042/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119814
Approved by: https://github.com/fegin
ghstack dependencies: #119813
2024-02-22 16:55:58 +00:00
Hugues de Saxcé
8464654ae4 Add missing words to torch.utils.checkpoint doc (#120196)
This PR adds a couple of missing words in the Checkpointing documentation, it doesn't have a specific issue number related to it.

Changes are:
- "backward." -> "backward propagation."
- "to be advanced than" -> "to be more advanced than"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120196
Approved by: https://github.com/soulitzer
2024-02-20 20:18:42 +00:00
andrewor14
6ea4480818 [quant][pt2e] Add model_is_exported util function (#119726)
Summary: This commit adds the `model_is_exported` util function
for users to be able to easily tell what APIs to call to move
their models between train and eval modes. This has the
additional advantage of hiding the implementation of how we
detect a model is exported, in case the metadata format changes
in the future.

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_model_is_exported

Differential Revision: [D53812972](https://our.internmc.facebook.com/intern/diff/D53812972)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119726
Approved by: https://github.com/tugsbayasgalan, https://github.com/albanD
2024-02-16 19:29:36 +00:00
soulitzer
312ce35c1f Rename singleton int to nested int (#119661)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119661
Approved by: https://github.com/ezyang
2024-02-16 19:21:17 +00:00
Yu, Guangye
8f9f12c068 Intel GPU Runtime Upstreaming for Device Allocator (#118091)
# Motivation
According to [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842) and [[RFC] Intel GPU Runtime Upstreaming for Allocator](https://github.com/pytorch/pytorch/issues/116322), we will upstream the key functionality of device `Allocator` dedicated for XPU to PyTorch. And following our design prepare to generalize `Allocator` in parallel.

# Design
In the current design, XPU uses an `XPUAllocator` class, inherited from `c10::Allocator`. `XPUAllocator` is a manager to handle `DeviceCachingAllocator`, which is a per-device implementation of the caching mechanism to manage the already cached or newly allocated memory. The caching mechanism is similar to other backends, like CUDA. We can visualize the design as below.
<p align="center">
<img width="162" alt="image" src="https://github.com/pytorch/pytorch/assets/106960996/6b17b8cf-e7d1-48b4-b684-f830c409d218">
</p>

# Additional Context
We're going to implement our design gradually. This PR covers the device `Allocator` dedicated to XPU. The second PR covers the host `Allocator`.
Besides these PRs, we plan to generalize the device `Allocator` device-agnostic through another PR.
In this PR, our device `Allocator` has the same memory management mechanism as CUDA, but lacks features such as expendable segments and statistics. We will add these features back in the subsequent PR which intend to generalize `Allocator`.

The differences with CUDA:
only key functionality, and lack of AsyncAllocator, gpu_trace, history_record, graph functionality, memory snapshot, memory statistics, expandable segment...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118091
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/albanD
ghstack dependencies: #117611, #117619, #117734
2024-02-16 06:46:00 +00:00
Yu, Guangye
4dc75f9084 Intel GPU Runtime Upstreaming for Event (#117734)
# Motivation
As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the next runtime component we would like to upstream is `Event` which handles the status of an operation that is being executed. Typically, in some circumstances, we can fine-grain control of the operation execution via `Event`.

# Design
`XPUEvent` is a movable but not a copyable wrapper around sycl event. It should be created lazily on an XPU device when recording an `XPUStream`. Meanwhile, `XPUEvent` can wait for another `XPUEvent` or all the submitted kernels on an `XPUStream` to complete. Align to the other backend, the C++ files related to `Event` will be placed in `aten/src/ATen/xpu` folder. For frontend code, `XPUEvent` runtime API will be bound to Python `torch.xpu.Event`. The corresponding C++ code will be placed in `torch/csrc/xpu/Event.cpp` and Python code will be placed in `torch/xpu/streams.py` respectively.

# Additional Context
It is worth mentioning that the `elapsed_time` method is temporarily not supported by `XPUEvent`. We will be adding support for it soon. Meanwhile `XPUEvent` doesn't support IPC from different processes. For the other parts, we have almost a 1:1 mapping with CUDA.

lack of the below APIs:
- `torch.cuda.Event.ipc_handle`
- `CUDAEvent`'s constructor with `IpcEventHandle`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117734
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: #117611, #117619
2024-02-16 06:28:26 +00:00
lancerts
444c628e06 Include the scalar tensor auto-transfer in the doc (#119967)
Fixes #119609

@albanD

Co-authored-by: albanD <desmaison.alban@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119967
Approved by: https://github.com/albanD
2024-02-15 22:37:39 +00:00
drisspg
744898b311 Add doc page for environment variables that effect PyTorch Runtime (#119087)
# Summary

The goal of this PR is to add a doc page to list a number of environment that effect the PyTorch runtime. It will likely not be exhaustive but hopefully will be added and updated to stay relevant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119087
Approved by: https://github.com/janeyx99, https://github.com/eqy
2024-02-15 21:41:38 +00:00
Kazuaki Ishizaki
a2f07bb317 Fix typo under docs directory (#119657)
This PR fixes typo under `docs` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119657
Approved by: https://github.com/colesbury
2024-02-15 21:14:34 +00:00
Eddie Yan
cd380c794f [CUDNN][SDPA] Experimental cuDNN Flash Attention v2 Inference (#115663)
#113713

Going to clean up some of the checks and will remove draft status after.
Can be tested on SM80+ with `TORCH_CUDNN_MHA_ENABLED=1`.

CC @drisspg @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115663
Approved by: https://github.com/drisspg
2024-02-14 22:02:06 +00:00
andrewor14
8ec8d78ef2 [quant][pt2e][be] Rename eval_utils -> export_utils (#119725)
It's not really eval_utils anymore, since we added some training
related utils. Instead it should be util functions that are
related to general export use cases.

Differential Revision: [D53711494](https://our.internmc.facebook.com/intern/diff/D53711494)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119725
Approved by: https://github.com/tugsbayasgalan
2024-02-13 19:10:06 +00:00
Yu, Guangye
8fd11cb307 [2/2] Intel GPU Runtime Upstreaming for Stream (#117619)
# Motivation
According to [[1/2] Intel GPU Runtime Upstreaming for Stream](https://github.com/pytorch/pytorch/pull/117611), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the second PR covers the changes under `python frontend`.

# Design
Currently, it primarily offers stream-related APIs, including
 - `torch.xpu.StreamContext`
 - `torch.xpu.current_stream`
 - `torch.xpu.set_stream`
 - `torch.xpu.synchronize`
 - `torch._C._xpu_getCurrentRawStream`

# Additional Context
We will implement functions like `torch.xpu.Stream.wait_event`, `torch.xpu.Stream.wait_stream`, and `torch.xpu.Stream.record_event` in the next PR related with `Event`.

The differences with CUDA:
no default and external stream in XPU and lack of below APIs:
- `torch.cuda.ExternalStream`
- `torch.cuda.default_stream`
- `toch.cuda.is_current_stream_capturing`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117619
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/albanD
ghstack dependencies: #117611
2024-02-10 03:39:42 +00:00
Mikayla Gawarecki
3372aa51b4 Integrate swap_tensors into nn.Module.load_state_dict (#117913)
Added a `torch.Tensor` method that defines how to transform `other`, a value in the state dictionary, to be loaded into `self`, a param/buffer in an `nn.Module` before swapping via `torch.utils.swap_tensors`
* `param.module_load(sd[key])`

This method can be overridden using `__torch_function__`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117913
Approved by: https://github.com/albanD
2024-02-09 22:32:29 +00:00
Angela Yi
0827510fd3 [export] Remove torch._export.export (#119095)
XLA changes: https://github.com/pytorch/xla/pull/6486

Test Plan: CI

Differential Revision: D53316196

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119095
Approved by: https://github.com/ydwu4, https://github.com/zhxchen17, https://github.com/tugsbayasgalan, https://github.com/avikchaudhuri, https://github.com/jerryzh168
2024-02-08 21:22:04 +00:00
Yu, Guangye
9a992b0918 [4/4] Intel GPU Runtime Upstreaming for Device (#116869)
# Motivation
According to [[1/4] Intel GPU Runtime Upstreaming for Device](https://github.com/pytorch/pytorch/pull/116019), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), this last PR  covers the changes under lazy initialization.

# Design
This PR primarily offers the support of multi-processing via lazy initialization. We lazily initialize our runtime avoiding initializing XPU until the first time it is accessed. In our design, we extend `cuda_lazy_init` to `device_lazy_init` which is a device-agnostic API that can support any backend. And change `maybe_initialize_cuda` to `maybe_initialize_device` to support lazy initialization for both CUDA and XPU while maintaining scalability.

# Additional Context
We adopt a similar design to CUDA. So we share some code with CUDA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116869
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet
ghstack dependencies: #119248
2024-02-08 03:01:21 +00:00
Mateus Devino
64aaa8f508 Fix typo on Contribution Guide (#119428)
Fixes #119427

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119428
Approved by: https://github.com/awgu, https://github.com/kit1980
2024-02-08 01:07:27 +00:00
Mikayla Gawarecki
d5a718d27b Add swap_tensors path to nn.Module._apply (#117167)
Added `torch.__future__.{get/set}_swap_module_params_on_conversion` that defaults to `False` for now, but we probably want to modify  to override this and default to `True` in `nn.Module._apply` if input is a tensor subclass.

From offline discussion, for now we are **not** allowing `swap_tensor` after the first module forward has been run*** if the autograd graph is still alive. The reason being that `torch.utils.swap_tensors(t1, t2)` requires the `use_count` of both `TensorImpl`s associated with `t1` and `t2` to be 1.  The first forward pass will install `AccumulateGrad` nodes on each param, which [bump the refcount of the associated TensorImpl](6cf1fc66e3/torch/csrc/autograd/variable.cpp (L307)). **Future work might be to swap the refs that the `AccumulateGrad` nodes hold if it is necessary.**

***From this, it might seem like we don't need to handle gradients. However, I still handle the grads for the edge case that the grads are set via `p.grad = grad` OR the autograd graph is no longer alive because the output has been garbage collected.

If any `swap_tensors` fails on any of the parameters in the `nn.Module` we raise an error.

**`RNNBase` overrides `nn.Module._apply()` and installs weakrefs on some parameters. As a result, all modules that inherit from `RNNBase` (`RNN`, `GRU` and `LSTM`) cannot use the`swap_tensors` path as of now**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117167
Approved by: https://github.com/albanD
ghstack dependencies: #118028
2024-02-07 18:55:44 +00:00
Peter Bell
7c95cc5e03 Add basic reference documentation for symbolic_shapes.py (#118997)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118997
Approved by: https://github.com/albanD
2024-02-07 14:33:42 +00:00
Svetlana Karslioglu
5ae6f6cffe Test seo torch cuda (#119324)
Testing if this will help improve SEO of this page.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119324
Approved by: https://github.com/albanD
2024-02-07 00:39:51 +00:00
Edward Z. Yang
3f0fd36835 Introduce size oblivious guards (#118579)
Fixes https://github.com/pytorch/pytorch/issues/117361

The implementation here slightly diverges from what was proposed in the issue, so I will recap what this PR is doing here. Today, when doing computations involving size-like unbacked SymInts, we assume for all operations that the compile time range of the integer is `[2, inf]`, even though at runtime we also accept zero and one.

This PR removes the carte blanche assumption, and instead does the analysis in a much more limited and controlled fashion: only for guards which we have designated as "size oblivious" are we willing to do the analysis under the assumption that the range of all size-like unbacked SymInts is `[2, inf]`; otherwise, we will faithfully only do analysis with `[0, inf]` (or whatever the user provided) bounds.

The infra pieces of this PR are:

* Remove runtime_var_to_range from torch/fx/experimental/symbolic_shapes.py; modify `_constrain_range_for_size` to refine the range without clamping min to 2, and instead add the symbol to a `size_like` set in the ShapeEnv
* When evaluating an expression, if the expression is requested to be evaluated in a `size_oblivious` way, we attempt to statically compute the value of the expression with the assumption that all symbols in `size_like` are updated to assume that they are `>= 2`.
* Add Python and C++ APIs for guarding on a SymBool in a size-oblivious way. In C++, I also need to add some helpers for performing symbolic comparisons, since the stock comparisons immediately specialize in the "normal" way.

The rest of the changes of the PR are marking various spots in PyTorch framework code as size oblivious, based on what our current test suite exercises.

As you review the places where we have marked things as size oblivious, it may become clear why I ended up not opting for the "designate a branch as the default branch when it's not statically obvious which way to go": for some of the conditions, this answer is rather non-obvious. I think potentially there is another refinement on top of this PR, which is something like "I don't care if you can't figure it out with ValueRange analysis, go down this path anyway if there are unbacked sizes involved." But even if we add this API, I think we are obligated to attempt the ValueRange analysis first, since it can lead to better outcomes sometimes (e.g., we are able to figure out that something is contiguous no matter what the unbacked size is.)

When is it permissible to mark something as size oblivious? Heuristically, it is OK anywhere in framework code if it gets you past a guard on unbacked SymInt problem. It is somewhat difficult to provide a true semantic answer, however. In particular, these annotations don't have any observational equivalence guarantee; for example, if I have `torch.empty(u0, 1).squeeze()`, we will always produce a `[u0]` size tensor, even though if `u0 == 1` PyTorch will actually produce a `[]` size tensor. The argument that I gave to Lezcano is that we are in fact defining an alternate semantics for a "special" size = 0, 1, for which we have these alternate eager mode semantics. In particular, suppose that we have a constant `special1` which semantically denotes 1, but triggers alternate handling rules. We would define `torch.empty(special1, 1).squeeze()` to always produce a `[special1]` size tensor, making its semantics coincide with unbacked SymInt semantics. In this model, the decision to designate guards as size oblivious is simply a user API question: you put them where ever you need some handling for special1! As we conservatively error out whenever it is not obvious what `special1` semantics should be, it is always valid to expand these semantics to cover more cases (although you can always choose the wrong semantics!)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118579
Approved by: https://github.com/eellison, https://github.com/lezcano
2024-02-06 19:45:32 +00:00
Edward Z. Yang
6620176da7 Add documentation for meta device (#119119)
Fixes https://github.com/pytorch/pytorch/issues/119098

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119119
Approved by: https://github.com/bdhirsh
2024-02-04 01:05:22 +00:00
Mikayla Gawarecki
9ffed22391 Document file format returned by torch.save (#118719)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118719
Approved by: https://github.com/albanD
2024-02-03 02:11:44 +00:00
Yu, Guangye
a205e7bf56 [3/4] Intel GPU Runtime Upstreaming for Device (#116850)
# Motivation
According to [[1/4] Intel GPU Runtime Upstreaming for Device](https://github.com/pytorch/pytorch/pull/116019), As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), this third PR  covers the changes under `libtorch_python`.

# Design
This PR primarily offers device-related APIs in python frontend, including
- `torch.xpu.is_available`
- `torch.xpu.device_count`
- `torch.xpu.current_device`
- `torch.xpu.set_device`
- `torch.xpu.device`
- `torch.xpu.device_of`
- `torch.xpu.get_device_name`
- `torch.xpu.get_device_capability`
- `torch.xpu.get_device_properties`
- ====================
- `torch.xpu._DeviceGuard`
- `torch.xpu._is_compiled`
- `torch.xpu._get_device`

# Additional Context
We will implement the support of lazy initialization in the next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116850
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet
2024-02-01 12:31:26 +00:00
Michael Lazos
e426924c19 Change classification to beta for TORCH_LOGS (#118682)
Changes classification of TORCH_LOGS to beta
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118682
Approved by: https://github.com/svekars
2024-01-31 21:50:55 +00:00
CaoE
bacbad5bc9 add GradScaler on CPU (#109993)
Step 2 of https://github.com/pytorch/pytorch/issues/111559.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109993
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-01-29 23:42:35 +00:00
albanD
a40be5f4dc Autograd doc cleanup (#118500)
I don't think we'll realistically go though deprecation for these now since there are a couple use of each online. So document appropriately.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118500
Approved by: https://github.com/soulitzer
2024-01-29 21:51:33 +00:00
Will Constable
abe3c55a6a Update DDP dynamo debug docs (#118295)
Refreshes https://github.com/pytorch/pytorch/pull/114201 and updates it to include other log names that also include ddp_optimizer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118295
Approved by: https://github.com/LucasLLC, https://github.com/wanchaol
2024-01-29 14:58:26 +00:00
Tobias Ringwald
62c1e4a578 Added missing CircularPad*d references so the docs are actually built. (#118465)
Fixes #118429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118465
Approved by: https://github.com/Skylion007
2024-01-27 22:39:01 +00:00
Lucas Pasqualin
ff8e33556e Enables load balancing duplicates in DCP (#116469)
Enables the deduplication of saved entries by load balancing duplicates across ranks.

Tested with existing and modified tests. Additionally tested with the following code snippet, which saves a 20GB DDP model in **~3 seconds on 8 ranks**.  Before this PR, the same operation has been measured at ~19 seconds.

```
def run(local_rank, world_size, param_size, num_params, work_dir):

    os.environ["RANK"] = str(local_rank)
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12355"
    device = torch.device(f"cuda:{local_rank}")
    torch.cuda.set_device(device)
    dist.init_process_group(backend="nccl", rank=local_rank, world_size=world_size)

    model = Model(param_size=param_size, num_params=num_params)
    model = DistributedDataParallel(model, gradient_as_bucket_view=True)
    _patch_model_state_dict(model)

    sz = sum(t.nelement() * t.element_size() for t in model.parameters())
    rank_0_print(f"Model size: {sz / 1_000_000_000.0} GB")
    rank_0_print("Saving the model with DCP...")

    checkpointer = _FileSystemCheckpointer(
        f"{args.work_dir}/dcp",
        sync_files=False,
        single_file_per_rank=False,
        thread_count=1
    )

    begin_ts = time.monotonic()
    checkpointer.save(state_dict={"model": model})
    end_ts = time.monotonic()
    rank_0_print(f"Took {end_ts - begin_ts} seconds with DCP")
```

Differential Revision: [D52435926](https://our.internmc.facebook.com/intern/diff/D52435926/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116469
Approved by: https://github.com/fegin, https://github.com/wz337
2024-01-26 22:34:14 +00:00
Sherlock Huang
6596a3f23d [Export] Remove ScriptObjectMeta (#118241)
Summary: As title. Use CustomObjArgument as ScriptObjectMeta

Test Plan: CIs

Reviewed By: zhxchen17

Differential Revision: D53062230

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118241
Approved by: https://github.com/zhxchen17
2024-01-26 00:37:19 +00:00
drisspg
4e29f01bf2 Remove sdp_kernel and replace with sdpa_kernel in attention namespace (#114689)
# Summary
Simplification of Backend Selection

This PR deprecates the `torch.backends/cuda/sdp_kernel` context manager and replaces it with a new context manager `torch.nn.attention.sdpa_kernel`. This context manager also changes the api for this context manager.

For `sdp_kernel` one would specify the backend choice by taking the negation of what kernel they would like to run. The purpose of this backend manager was to only to be a debugging tool, "turn off the math backend" and see if you can run one of the fused implementations.

Problems:
- This pattern makes sense if majority of users don't care to know anything about the backends that can be run. However, if users are seeking to use this context manager then they are explicitly trying to run a specific backend.
- This is not scalable. We are working on adding the cudnn backend and this API makes it so so that more implementations will need to be turned off if user wants to explicitly run a given backend.
- Discoverability of the current context manager. It is somewhat un-intutive that this backend manager is in backends/cuda/init when this now also controls the CPU fused kernel behavior. I think centralizing to attention namespace will be helpful.

Other concerns:
- Typically backends (kernels) for operators are entirely hidden from users and implementation details of the framework. We have exposed this to users already, albeit not by default and with beta warnings. Does making backends choices even more explicit lead to problems when we potentially want to remove existing backends, (perhaps inputs shapes will get covered by newer backends).

A nice side effect is now that we aren't using the `BACKEND_MAP` in test_transformers many, many dynamo failures are passing for CPU tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114689
Approved by: https://github.com/cpuhrsch
2024-01-24 22:28:04 +00:00
Zhengxu Chen
abd759d50d [fx] Add hooks to intercept node replacements. (#117825)
Summary: Adding an experimental API to FX graph module to place "hooks" every time when we are changing or replacing nodes in a graph, so that we can properly update the new name in graph signature and potentially other places.

Test Plan:
buck test mode/opt  -c fbcode.enable_gpu_sections=true caffe2/test/distributed/_tensor/experimental:tp_transform

buck test mode/opt caffe2/test:test_export -- -r test_replace_hook

Differential Revision: D52896531

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117825
Approved by: https://github.com/avikchaudhuri
2024-01-23 22:28:40 +00:00
Matteo Migliarini
fdac55c35d Added example regarding weight_decay distinction with per-parameter API (#117436)
Added new example and description regarding per-parameter `weight_decay` distinction for bias and non-bias terms.

Fixes #115935

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117436
Approved by: https://github.com/janeyx99
2024-01-22 21:26:02 +00:00
Wanchao Liang
2bb2cc0b71 [tp] add clarification to doc and improve TP examples (#117618)
This PR adds a clarification about evenly sharded assumption in the main
tp doc and improved the tp examples by adding device mesh constructions

fixes https://github.com/pytorch/pytorch/issues/100044

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117618
Approved by: https://github.com/wconstab, https://github.com/awgu
2024-01-22 18:56:50 +00:00
Stas Bekman
86b4b27e26 [docs] start a new FSDP notes doc (#117323)
As discussed on [slack](https://pytorch.slack.com/archives/C3PDTEV8E/p1703699711772289) adding Andrew Gu's advanced FSDP design notes with a few additions from myself based on our discussion.

I hope I did the RST right, I haven't done RST in a while.

- The first section is Andrew's words verbatim + formatting
- The second section is Andrew's words verbatim + formatting + a few of my additions that were confirmed by Andrew, and which hopefully should help understand the process better.

tagging @albanD as requested.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117323
Approved by: https://github.com/awgu
2024-01-22 15:46:35 +00:00
PyTorch MergeBot
02209b5880 Revert "[docs] start a new FSDP notes doc (#117323)"
This reverts commit 7f474da6bc.

Reverted https://github.com/pytorch/pytorch/pull/117323 on behalf of https://github.com/awgu due to broke docs ([comment](https://github.com/pytorch/pytorch/pull/117323#issuecomment-1902740900))
2024-01-21 19:47:27 +00:00
Stas Bekman
7f474da6bc [docs] start a new FSDP notes doc (#117323)
As discussed on [slack](https://pytorch.slack.com/archives/C3PDTEV8E/p1703699711772289) adding Andrew Gu's advanced FSDP design notes with a few additions from myself based on our discussion.

I hope I did the RST right, I haven't done RST in a while.

- The first section is Andrew's words verbatim + formatting
- The second section is Andrew's words verbatim + formatting + a few of my additions that were confirmed by Andrew, and which hopefully should help understand the process better.

tagging @albanD as requested.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117323
Approved by: https://github.com/albanD, https://github.com/awgu
2024-01-21 15:11:24 +00:00
suo
4057d005ff Initial torchbind support in PT2 (#117697)
This PR adds the bare minimum functionality to get torchbind working in an e2e testable way on PT2.

It implements:
* ProxyTensor support
* Simple torch.export support (proxytensor-only path, e.g. non-strict).
* add some tests exercising the path.

Because all this is not fully baked, I hide the functionality behind a feature flag (`enable_torchbind_tracing()`) so it does not affect regular users for now.

Still on the agenda:
* Dynamo support
* Actual FakeMode support
* Mutability support

Hoping to get this first bit in as a standalone, as it will unblock some more extensive experimentation/testing going on internally.

Differential Revision: [D51825372](https://our.internmc.facebook.com/intern/diff/D51825372/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117697
Approved by: https://github.com/SherlockNoMad
2024-01-19 06:28:20 +00:00
PyTorch MergeBot
2f84a9d37c Revert "[CUDNN][SDPA] Experimental cuDNN Flash Attention v2 Inference (#115663)"
This reverts commit 5aa92b5090.

Reverted https://github.com/pytorch/pytorch/pull/115663 on behalf of https://github.com/PaliC due to Unfortunately, this pr breaks cuda builds internally ([comment](https://github.com/pytorch/pytorch/pull/115663#issuecomment-1899388813))
2024-01-18 23:40:30 +00:00
Angela Yi
92d718aed1 [export] Add lifted constant obj to input (#116985)
Test Plan: wip

Differential Revision: D52556070

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116985
Approved by: https://github.com/suo
2024-01-18 22:10:53 +00:00
suo
ccc8440609 [export] introduce WrapperModule (#117571)
Simple module to wrap a callable. This is a useful utility for when we start requiring that torch.export take an nn.Module.

Differential Revision: [D52791310](https://our.internmc.facebook.com/intern/diff/D52791310/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117571
Approved by: https://github.com/tugsbayasgalan, https://github.com/avikchaudhuri
ghstack dependencies: #117570
2024-01-18 03:40:34 +00:00
Eddie Yan
5aa92b5090 [CUDNN][SDPA] Experimental cuDNN Flash Attention v2 Inference (#115663)
#113713

Going to clean up some of the checks and will remove draft status after.
Can be tested on SM80+ with `TORCH_CUDNN_MHA_ENABLED=1`.

CC @drisspg @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115663
Approved by: https://github.com/drisspg
2024-01-18 01:20:36 +00:00
Kurman Karabukaev
a60b566d37 [TorchElastic] Support for overprovisioning in C10 based rendezvous (#117066)
Summary:
Allow TorchElastic to manage more nodes than a maximum nnodes specifed in a job. It will be used as a spare capacity/warm nodes for schedulers that support elasticity.

RFC: https://github.com/pytorch/pytorch/issues/114097

Test Plan: Integration tests

Differential Revision: D52343874

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117066
Approved by: https://github.com/zdevito
2024-01-18 01:16:55 +00:00
Peter Bell
001585f446 [fx][inductor] Add statically_known_true utility for SymBool (#117359)
This adds a function `statically_known_true` for `SymBool` that works
like inductor's `is_expr_static_and_true`. That is, it tries to simplify the
expression to a constant or returns `False` if it cannot be simplified.

This is useful in cases that can be optimized if the condition is met,
otherwise it doesn't effect correctness so we can avoid adding guards.

I also use this new function in inductor for `FakeTensorUpdater` and
`remove_noop_pass` which both generated unexpected guards previously.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117359
Approved by: https://github.com/lezcano
2024-01-15 18:01:10 +00:00
Sai-Pra
19502ff6aa Fixed typo in build_activation_images.py (#117458)
In line 24 of build_activation_images.py, I changed "programmaticly" to "programmatically" to be dramatically correct.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117458
Approved by: https://github.com/malfet
2024-01-15 03:27:40 +00:00
vasiliy
a6d33614d6 add float8 types to dtypes table (#117375)
Summary:

As titled

Test Plan:

CI

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117375
Approved by: https://github.com/ezyang
2024-01-15 00:23:07 +00:00
Edward Z. Yang
d006cae2a8 Update documentation for unsigned int types (#116804)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116804
Approved by: https://github.com/albanD
ghstack dependencies: #116595, #116803
2024-01-08 22:02:10 +00:00
Guo Yejun
5323b2daa5 [docs] add mode="reduce-overhead" into torch.compile to enable cuda g… (#116529)
…raph

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116529
Approved by: https://github.com/eellison
2024-01-05 22:54:20 +00:00
Angela Yi
6413511713 [export][refactor][4/n] Make equality_constraints optional (#116233)
Summary: needed to remove equality_contraints eventually :P

Test Plan: CI

Differential Revision: D52351709

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116233
Approved by: https://github.com/tugsbayasgalan
2024-01-05 00:50:52 +00:00
Mikayla Gawarecki
0f6f582c0d Add config to disable TransformerEncoder/MHA fastpath (#112212)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112212
Approved by: https://github.com/jbschlosser
2024-01-02 23:59:30 +00:00
lezcano
b18d8d4595 Add a wrapper to transform a NumPy function into a PyTorch function (#114610)
A less general version of this wrapper was used in the keynote on
`torch.compile(numpy)`. We expose a generic version of the wrapper
that works seamlessly with `torch.compile`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114610
Approved by: https://github.com/albanD
2024-01-02 18:35:29 +00:00
Anupam Bhatnagar
4371939751 Removing HTA documentation (#116513)
Removing HTA documentation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116513
Approved by: https://github.com/aaronenyeshi, https://github.com/malfet, https://github.com/atalman
2023-12-28 23:04:23 +00:00
angelayi
6b91e6907e Add setUserEnabledNNPACK config (#116152)
When exporting a model with a convolution kernel on cpu, if mkldnn is disabled and nnpack is enabled, export will go down the nnpack optimized convolution kernel for certain shapes ((code pointer)[cd449e260c/aten/src/ATen/native/Convolution.cpp (L542-L552)]). This means that we will automatically create a guard on that certain shape. If users want to export without any restrictions, one option is to disable nnpack. However, no config function exists for this, so this PR is adding a config function, similar to the `set_mkldnn_enabled` function.

Original context is in https://fb.workplace.com/groups/1075192433118967/posts/1349589822345892/?comment_id=1349597102345164&reply_comment_id=1349677642337110.

To test the flag, the following script runs successfully:
```
import os

import torch
from torchvision.models import ResNet18_Weights, resnet18

torch.set_float32_matmul_precision("high")

model = resnet18(weights=ResNet18_Weights.DEFAULT)
model.eval()

with torch.no_grad():
    # device = "cuda" if torch.cuda.is_available() else "cpu"
    torch.backends.mkldnn.set_flags(False)
    torch.backends.nnpack.set_flags(False)   # <--- Added config
    device = "cpu"
    model = model.to(device=device)
    example_inputs = (torch.randn(2, 3, 224, 224, device=device),)
    batch_dim = torch.export.Dim("batch", min=2, max=32)
    so_path = torch._export.aot_compile(
        model,
        example_inputs,
        # Specify the first dimension of the input x as dynamic
        dynamic_shapes={"x": {0: batch_dim}},
        # Specify the generated shared library path
        options={
            "aot_inductor.output_path": os.path.join(os.getcwd(), "resnet18_pt2.so"),
            "max_autotune": True,
        },
    )

```

I'm not sure who to add as reviewer, so please feel free to add whoever is relevant!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116152
Approved by: https://github.com/malfet
2023-12-27 06:00:16 +00:00
Lucas Pasqualin
b342286646 adds async save, makes checkpointer private (#116293)
Adds Async Save and also makes `Checkpointer` classes private.

The original PR was here: https://github.com/pytorch/pytorch/pull/115864

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116293
Approved by: https://github.com/fegin
2023-12-22 05:22:39 +00:00
suo
bc3ef1684e [export] refactor unflatten.py to be a top-level API (#115466)
This is in preparation for the merging of the internal and external versions of
the unflattener. Unflatten needs to be its own API because we are adding more
options to it in forthcoming diffs.

Differential Revision: [D52001133](https://our.internmc.facebook.com/intern/diff/D52001133/)

@diff-train-skip-merge
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115466
Approved by: https://github.com/zhxchen17
2023-12-21 20:52:29 +00:00
Damien
2d2016fdf8 WIP Add compatibility with channels_last_3d for conv3d (#114790)
Part of a multi-PR work to fix #59168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114790
Approved by: https://github.com/albanD
2023-12-20 19:28:25 +00:00
Bin Bao
fabf9433e7 [AOTI][refactor] Organize model runner files (#116022)
Summary: Move runner util files into a subdirectory and put AOTIModelContainerRunnerCpu into a separate file

Differential Revision: [D52300693](https://our.internmc.facebook.com/intern/diff/D52300693)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116022
Approved by: https://github.com/khabinov
2023-12-20 15:35:34 +00:00
FFFrog
327bdcdb14 Some tiny modification about torch.set/get_default_device (#116014)
1. fix bug of torch.set_default_device in multi-threading
2. add new interface named torch.get_default_device

Fixes #115333
Fixes #115917

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116014
Approved by: https://github.com/malfet, https://github.com/jansel
2023-12-19 05:08:06 +00:00
Wanchao Liang
61abacf829 [tp] improve documentation (#115880)
Improve the TP documentation in terms of format and descriptions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115880
Approved by: https://github.com/XilunWu
2023-12-15 18:44:22 +00:00
Will Constable
28e4004286 Add doc for torch.distributed.breakpoint (#115656)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115656
Approved by: https://github.com/wanchaol, https://github.com/fegin
ghstack dependencies: #115705
2023-12-14 14:45:36 +00:00
angelayi
dd9a989b83 [export][reland][refactor][1/n] Split dynamic shapes (#115556)
Reland of https://github.com/pytorch/pytorch/pull/114764
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115556
Approved by: https://github.com/zhxchen17
2023-12-12 05:36:41 +00:00
atalman
b88be1686d Revert "[export][refactor][1/n] Move dynamic shapes logic (#114764)" (#115508)
GitHub first oncall.
This reverts commit 53bf8cfcf9.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115508
Approved by: https://github.com/malfet, https://github.com/angelayi
2023-12-11 14:54:51 +00:00
William Wen
f614ed78b8 [docs, dynamo] fix typos in dynamo custom backend docs (#115444)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115444
Approved by: https://github.com/eellison
2023-12-08 23:58:26 +00:00
albanD
a2b89154bf New swap function (#111747)
This PR is proposing a new approach to solve the nn/optim only linked by python object identity problem.
The idea is to have a function that can swap the content of two Tensors t1 and t2 while preserving all the old references.
This would allow us to swap the `model.weight` with a new Tensor (can be any subclass of Tensor and any TensorImpl (xla, sparse, nested tensorimpl would work)). The use within nn will be done in a follow up.

This is done by swapping the whole content of the PyObject and then putting back the fields associated with external references (refcount, gc tracking and weakrefs).
Note that we have to properly handle all the cases where there is memory used before the public pointer PyObject* and where the PyObject is bigger due to dict/weakref being inlined (older CPython version) or due to slots.

The main limitation of this approach is that the number of slots need to match for the objects being swapped and thus limit usage of slots in subclasses.

Draft right now to see what @colesbury thinks about doing this?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111747
Approved by: https://github.com/colesbury
2023-12-08 18:49:35 +00:00
Linus
5f2ff29569 Fix typo in https://pytorch.org/docs/stable/sparse.html (#115282)
Fixes #111473

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115282
Approved by: https://github.com/svekars
2023-12-08 18:31:33 +00:00
Wongboo
68f74dd162 Add python and C++ support for LPPool3d (#114199)
Add python and C++ support for LPPool3d to Fixes #114114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114199
Approved by: https://github.com/mikaylagawarecki
2023-12-08 18:18:44 +00:00
Iris Zhang (PyTorch)
23fa9621e4 [DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099) (#115193)
Summary:

Rename _device_mesh.py to device_mesh.py, update all callsites, add documentation.
We created stubs for public class and methods in torch.distributed.device_mesh so that torch.distributed.device_mesh can be imported with or without distributed is available().

Original diff reverted: D51629761
Original PR reverted: https://github.com/pytorch/pytorch/pull/115099
Prior to landing, CI signals are all passed. Shipit added the "ci/trunk" label to the PR and DID NOT wait for it and went ahead committing. More context can be found in the reverted PR above.

Test Plan: CI.

Differential Revision: D51861018

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115193
Approved by: https://github.com/fegin
2023-12-08 08:44:32 +00:00
Lucas Pasqualin
5432088098 Adds Checkpointer Wrapper for DCP [3/N] (#114603)
Adds a useful high level wrapper for calling `dist.save/load` with the correct storage readers and writers.

Instead of doing:

```
DCP.save(
    state_dict={...},
    storage_writer=StorageWriter(...)
)

DCP.load(
    state_dict={...},
    storage_reader=StorageReader(...)
)
```

We can now do:

```
checkpointer = Checkpointer(...)

checkpointer.save(state_dict={...})
checkpointer.load(state_dict={...})
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114603
Approved by: https://github.com/fegin, https://github.com/wz337
2023-12-08 01:03:21 +00:00
Howard Huang
3e66385ddd Add Work to distributed docs (#115172)
Summary:
Documenting the `Work` object

For a collective (broadcast, all_reduce, etc.) when async_op=True we return a `Work` object to which users can call `.wait()`, `.is_success()`, among other things but this class is not documented

Test Plan: Preview the docs build in OSS

Differential Revision: D51854974

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115172
Approved by: https://github.com/wconstab
2023-12-07 18:12:10 +00:00
angelayi
53bf8cfcf9 [export][refactor][1/n] Move dynamic shapes logic (#114764)
1/n of refactoring export code:
* Moved dynamic shapes/constraints/dynamic_dims logic in torch/_export/__init__.py and torch/export/__init__.py to torch/export/dynamic_shapes.py

Differential Revision: [D51823962](https://our.internmc.facebook.com/intern/diff/D51823962)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114764
Approved by: https://github.com/ydwu4
2023-12-06 16:46:38 +00:00
drisspg
d4c79a3078 Add an attention bias subclass for a lower right causal masking (#114823)
# Summary
This PR introduces a new Tensor subclass that is designed to be used with torch.nn.functional.scaled_dot_product_attention. Currently we have a boolean `is_causal` flag that allows users to do do causal masking without the need to actually create the "realized" attention bias and pass into sdpa. We originally added this flag since there is native support in both fused kernels we support. This provides a big performance gain ( the kernels only need to iterate over ~0.5x the sequence, and for very large sequence lengths this can provide vary large memory improvements.

The flag was introduced when the early on in the kernel development and at the time it was implicitly meant to "upper_left" causal attention. This distinction only matters when the attention_bias is not square. For a more detailed break down see: https://github.com/pytorch/pytorch/issues/108108. The kernels default behavior has since changed, largely due to the rise of autogressive text generation. And unfortunately this would lead to a BC break. In the long term it may actually be beneficial to change the default meaning of `is_causal` to represent lower_right causal masking.

The larger theme though is laid here: https://github.com/pytorch/pytorch/issues/110681. The thesis being that there is alot of innovation in SDPA revolving around the attention_bias being used. This is the first in hopefully a few more attention_biases that we would like to add. The next interesting one would be `sliding_window` which is used by the popular mistral model family.

Results from benchmarking, I improved the meff_attention perf hence the slightly decreased max perf.
```Shell
+---------+--------------------+------------+-----------+-----------+-----------+-----------+----------------+----------+
|  Type   |      Speedup       | batch_size | num_heads | q_seq_len | k_seq_len | embed_dim |     dtype      | head_dim |
+---------+--------------------+------------+-----------+-----------+-----------+-----------+----------------+----------+
| Average | 1.2388050062214226 |            |           |           |           |           |                |          |
|   Max   | 1.831672915579016  |    128     |    32     |   1024    |   2048    |   2048    | torch.bfloat16 |    64    |
|   Min   | 0.9430534166730135 |     1      |    16     |    256    |    416    |   2048    | torch.bfloat16 |   128    |
+---------+--------------------+------------+-----------+-----------+-----------+-----------+----------------+----------+
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114823
Approved by: https://github.com/cpuhrsch
2023-12-06 08:29:26 +00:00
Joel Schlosser
22704426c3 Expand dynamic dims support for traceable subclasses (#114311)
Continuation of #112185, following the design in this [doc](https://docs.google.com/document/d/1ipSxcTzEMMOAPvxP-YJlD5JBZZmIGgh8Q34ixtOUCRo).

Summary:
* Introduce `SubclassSymbolicPolicy` containing separate dynamic dim / constraint policies for the outer and inner tensors
    * Expand the automatic dynamic algorithm to recurse into inner tensors and produce one of these for a subclass instance
    * Maintain legacy behavior for subclasses by recursively calling `mark_dynamic()` on inner tensors *of the same dim as outer* when `mark_dynamic(outer, ...)` is called
    * Addresses this: 6a86cf00ad/torch/_dynamo/variables/builder.py (L1750)
* Add `outer_size` and `outer_stride` arguments to `__tensor_unflatten__()` so that you can find out what symbols were allocated for the outer size / stride (you are expected to return a tensor that compares equal to the outer symbols)
    * Signatures now:
    ```python
    # attrs is a list of inner tensor attributes on x; inner_tensor = getattr(x, attr)
    # ctx is anything useful for rebuilding the class we want to guard on
    attrs, ctx = x.__tensor_flatten__()
    ...
    # inner_tensors is a dict of {attr -> tensor}
    # ctx is taken unmodified from flattening and (eventually) guarded on
    # outer_size is the expected size of the output; possibly symbolic
    # outer_stride is the expected strides of the output; possibly symbolic
    y = MySubclass.__tensor_unflatten__(inner_tensors, ctx, outer_size, outer_stride)

    # at the __tensor_unflatten__() call-site in PT2, we assert y.shape == outer_size and y.stride() == outer_stride
    # the assert simplifies symbols when there are relationships between outer and inner symbols
    ```
    * Size info needed for `NestedTensor` at least, stride info needed for `DTensor` at least
    * Punting on `outer_storage_offset` because storage_offset handling is horribly broken in PT2 right now
* ~~Add new `__tensor_mark_dynamic__()` to allow overriding the behavior of mark_dynamic on a per-subclass basis~~ (booted to future work)
* ~~Add guards for tensor subclasses by calling `__tensor_flatten__()` in the guard to test equality on `ctx`~~
    * Now handled in #114469
* Next PR: add TENSOR_MATCH guards on inner tensors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114311
Approved by: https://github.com/ezyang, https://github.com/drisspg, https://github.com/voznesenskym, https://github.com/bdhirsh
2023-12-05 21:09:25 +00:00
angelayi
5fdae89c03 [docs][aoti] Link to export docs in AOTI docs (#115088)
Context: https://fb.workplace.com/groups/1075192433118967/posts/1341833143121560/?comment_id=1341841786454029

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115088
Approved by: https://github.com/desertfire
2023-12-05 20:22:42 +00:00
Anupam Bhatnagar
85d4708512 HTA docs (#115060)
Added documentation for Holistic Trace Analysis

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115060
Approved by: https://github.com/aaronenyeshi
2023-12-05 19:38:09 +00:00
Nikita Shulga
a827ac71f2 Revert "[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099)"
This reverts commit eaa64339d6.
2023-12-05 08:59:36 -08:00
Iris Zhang (PyTorch)
eaa64339d6 [DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099)
Summary:
Rename _device_mesh.py to device_mesh.py, update all callsites, adds documentation.

Original diff reverted: D51629761
Original PR reverted: https://github.com/pytorch/pytorch/pull/114991
It was failing because failing a public module binding tests in MacOS, and this is due to the change in import order for torch/distributed/fsdp/_common_utils.py. Since this original import would still work, we remove the changes in this file.

Test Plan: CI.

Differential Revision: D51825114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115099
Approved by: https://github.com/wanchaol, https://github.com/fegin
2023-12-05 05:44:52 +00:00
soulitzer
a7bcc78bff Make it clearer that current selective AC is PT2-only and private (#115081)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115081
Approved by: https://github.com/albanD
2023-12-04 23:01:22 +00:00
Xuehai Pan
55064a4ef9 [BE] add parentheses to kwargs unpacking func(*args, **(kwargs or {})) (#115026)
This PR adds parentheses to kwargs unpacking `func(*args, **(kwargs or {}))` for better code readability.

With/without the parentheses are semantic equivalent because they produce the same bytecode.

```console
$ echo "func(*args, **kwargs or {})" | python3 -m dis -
  0           0 RESUME                   0

  1           2 PUSH_NULL
              4 LOAD_NAME                0 (func)
              6 LOAD_NAME                1 (args)
              8 BUILD_MAP                0
             10 LOAD_NAME                2 (kwargs)
             12 JUMP_IF_TRUE_OR_POP      1 (to 16)
             14 BUILD_MAP                0
        >>   16 DICT_MERGE               1
             18 CALL_FUNCTION_EX         1
             20 POP_TOP
             22 LOAD_CONST               0 (None)
             24 RETURN_VALUE

$ echo "func(*args, **(kwargs or {}))" | python3 -m dis -
  0           0 RESUME                   0

  1           2 PUSH_NULL
              4 LOAD_NAME                0 (func)
              6 LOAD_NAME                1 (args)
              8 BUILD_MAP                0
             10 LOAD_NAME                2 (kwargs)
             12 JUMP_IF_TRUE_OR_POP      1 (to 16)
             14 BUILD_MAP                0
        >>   16 DICT_MERGE               1
             18 CALL_FUNCTION_EX         1
             20 POP_TOP
             22 LOAD_CONST               0 (None)
             24 RETURN_VALUE
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115026
Approved by: https://github.com/Skylion007
2023-12-03 20:03:26 +00:00
PyTorch MergeBot
3a2e2044cd Revert "[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#114710) (#114991)"
This reverts commit 729ac7317a.

Reverted https://github.com/pytorch/pytorch/pull/114991 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/114991#issuecomment-1837214567))
2023-12-02 17:55:51 +00:00
Wanchao Liang
28925902fa [TP] fully rewrite Tensor Parallel APIs (#114732)
This PR rewrites Tensor Parallel implementation. Tensor Parallel APIs
supposed to be a very thin-wrapper to DTensor APIs, but the current
implementation got too messy and buggy. It's really hard to debug what
went wrong when using it. It's crucially important for advanced users or
developers to understand the API and its implementation easily without
going through all different types of functions and utils, so that
they could trust what happen under the hood.

In particular this PR:

* Make ParallelStyle to be a real contract API for parallelize_module to
  take, each concrete ParallelStyle only needs to implement `apply` to
apply the sharding to nn.Module, remove all non-necessary fields. This
also enable easier ParallelStyle authoring going forward.
* Keep the ColwiseParallel and RowwiseParallel public interface, but
  refactor them in a way that makes the parameter sharding, inputs and
outputs handling lives within the style itself, so that it's easy to
understand how Linear/Embedding layers are sharded and how the inputs/outputs
transformations are performed.
* remove all those private _prepare_input/_prepare_output_fn fields for
  both ColwiseParallel/RowwiseParallel. Since we throw deprecation
messages in nightly for a while and TP is on prototype release, the
fields are also private, it should be safe to remove them
* Refactor the recently landed PrepareModuleInput/Output style, change
  output_layouts to desired_input/output_layouts, group
  the function inside the style itself, no default arguments for these
two styles and user need to specify them to think about the sharding
layouts. Fixed bugs about not handling
`use_local_output` flag.
* Make default arguments be None instead of Placement object, this is
  standard python practice to not have custom object instance as default
argument
* Remove all dead APIs (i.e. PairwiseParallel and SequenceParallel
  style, all prepare input/output functions) as we throw deprecation
 msgs for a while, and in the progress of removing all of them from the tests.
* throw deprecation warning for `tp_mesh_dim` as we recomemnd use device
  mesh slice/indexing instead of manually specify mesh dim
* Rewrite all documentations for every ParallelStyle and make the
  documentation more clear about what each style is doing

TODOs:
* Rewrite TP tests to adjust for the changes we have in this PR
* add more tests to guard the bug fixes

Differential Revision: [D51761183](https://our.internmc.facebook.com/intern/diff/D51761183)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114732
Approved by: https://github.com/wz337, https://github.com/fduwjj
2023-12-02 08:18:12 +00:00
Iris Zhang (PyTorch)
729ac7317a [DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#114710) (#114991)
Summary:

Same content of changes as https://github.com/pytorch/pytorch/pull/114710

Rename _device_mesh.py to device_mesh.py, update all callsites, adds documentation.
ghstack-source-id: 208980207
exported-using-ghexport

Test Plan: CI.

Reviewed By: wanchaol

Differential Revision: D51629761

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114991
Approved by: https://github.com/wanchaol, https://github.com/fduwjj, https://github.com/fegin
2023-12-02 04:39:41 +00:00
Rohan Varma
3c78ea4c9d [DDP][Compile] Test to Ensure torch.compile works w/static_graph=True (#114621)
Resolves https://github.com/pytorch/pytorch/issues/93672. This was
actually fixed by https://github.com/pytorch/pytorch/pull/103487 but I didn't
realize that PR also fixes torch compile at the time.

Differential Revision: [D51596148](https://our.internmc.facebook.com/intern/diff/D51596148/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114621
Approved by: https://github.com/wconstab
2023-12-01 22:18:45 +00:00
Lucas Pasqualin
f073dcd4f7 Stateful Checkpointing for Distributed [1/N] (#113867)
First pass at adding a save/load API, as well as definition of Stateful objects.

Amongst a couple todo's, we still need to explore adding an `all_gather` & potentially a `barrier` while iterating through state keys.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113867
Approved by: https://github.com/fegin, https://github.com/wz337
2023-12-01 19:21:03 +00:00
Philip Meier
373f2060ba fix extending torch native API docs (#114863)
Couldn't think of a better `release notes:` label. Feel free to set a more fitting one
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114863
Approved by: https://github.com/mikaylagawarecki
2023-12-01 06:09:35 +00:00
Jerry Zhang
64fd706b21 [quant][pt2e] Add generate_numeric_debug_handle pass (#114315)
Summary:
This is a util for numeric suite in pt2 export so that we can build
a more streamlined UX for numerical debugging in quant + executorch stack

Test Plan:
python test/test_quantization.py TestGenerateNumericDebugHandle

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114315
Approved by: https://github.com/zhxchen17
2023-12-01 03:38:17 +00:00
William Wen
38ae17d166 [dynamo, docs] update dynamo backend registration docs (#114820)
Update docs to reflect current backend registration API. Add `lookup_backend` to root `dynamo` module.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114820
Approved by: https://github.com/eellison
2023-11-30 21:41:05 +00:00
Nikita Shulga
a9d5133207 [ez][doc] Fix sample code in onnx_dynamo.rst (#114770)
By adding `import torch.nn as nn`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114770
Approved by: https://github.com/atalman, https://github.com/thiagocrepaldi
2023-11-29 19:27:52 +00:00
Guo Yejun
4aa2c51a09 [doc] fix typo on graph 3 that is recorded (#114666)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114666
Approved by: https://github.com/eellison
2023-11-28 20:40:13 +00:00
Guo Yejun
4a35ec3c0e [docs] correct the code for cudagraph trees integration (#114583)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114583
Approved by: https://github.com/eellison
2023-11-28 20:28:52 +00:00
lezcano
4ba3e6758d Canonicalize runtime asserts (#114509)
This allows us to remove quite a few redundant runtime asserts, and potentially a number of guards as well.

On
```
python test/dynamo/test_subclasses.py -k test_unbind
```
we go from
```
inserting runtime assert i0 <= s0
inserting runtime assert 0 <= -i0 + s0
inserting runtime assert i0 + i1 <= s0
inserting runtime assert i0 <= -i1 + s0
inserting runtime assert i0 + i1 + i2 <= s0
inserting runtime assert i0 + i1 <= -i2 + s0
inserting runtime assert Eq(i0 + i1 + i2 + i3, s0)
inserting runtime assert i0 + i1 + i2 + i3 <= s0
inserting runtime assert i0 + i1 + i2 <= -i3 + s0
```
to
```
inserting runtime assert i0 - s0 <= 0
inserting runtime assert i0 + i1 - s0 <= 0
inserting runtime assert i0 + i1 + i2 - s0 <= 0
inserting runtime assert Eq(i0 + i1 + i2 + i3, s0)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114509
Approved by: https://github.com/voznesenskym
2023-11-28 01:38:47 +00:00
voznesenskym
081c5b3adc Add Stateful/Stateless symbolic contexts, use fresh fake mode for dynamo backends (#113926) (#114526)
Summary:

The primary problem we are setting out to solve here is fake tensor freshness. Before this PR, fake tensors after dynamo represented fake tensors *at the end* of trace, so subsequent retraces like aot_autograd would start off with fake tensors in the wrong (end result) state, rather than their expected fresh state. The solution here is to start a fresh fake mode, and re-fakify the tensors. The nuance comes from ensuring that symbols are uniformly created for the symbolic sizes and strides of the tensor.

This PR is the result of *a lot* of back and forth with ezyang and eellison. Initially, the first pass at this was not super different from what we have in the PR - the broad strokes were the same:

1) We cache source->symbol in shape_env
2) We pass policy objects around, stored at dynamo fakificaiton time, and reused for later fakification
3) We create a new fake mode for backends
(from https://github.com/pytorch/pytorch/pull/113605/files)

This is ugly, and has some layering violations. We detoured our decision making through a few other alternatives. Immutable/mutable fake tensor mode was the most interesting alternative, https://github.com/pytorch/pytorch/pull/113653, and was struck down on concerns of complexity in fake mode combined with it not covering all edge cases. We also detoured on what to do about tensor memoization returning back potentially different tensors than requested, and if that was an anti pattern (it is) we want to hack in with the symbol cache (we don't).

We went back to the drawing board here, but with a few concessions:
1) the cache for source->symbol must live outside of shape_env, for both lifecycle, and layering reasons
2) A good amount of work needs to be done to pipe policy around fake_mode and meta_utils correctly, to cover all the cases (ezyang did this)

cc penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 aakhundov kadeng

imported-using-ghimport

Test Plan: Imported from OSS

Reviewed By: huydhn, Chillee

Differential Revision: D51566250

Pulled By: voznesenskym

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114526
Approved by: https://github.com/Chillee, https://github.com/huydhn
2023-11-26 23:40:32 +00:00
Akihiro Nitta
d37c4c6995 Update torch.compiler_troubleshooting.rst (#114530)
If you copy and paste the env var in the docs:
```console
TORCHDYNAMO_REPRO_AFTER=“aot”
```
it leads to this error:
```python
    @functools.wraps(unconfigured_compiler_fn)
    def debug_wrapper(gm, example_inputs, **kwargs):
        compiler_fn = functools.partial(unconfigured_compiler_fn, **kwargs)
>       assert config.repro_after in ("dynamo", "aot", None)
E       torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
E       AssertionError:
```
because `config.repro_after` is being `'“aot”'` but not `'aot'`.

---

It would've saved a few minutes of my time 😄
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114530
Approved by: https://github.com/Chillee
2023-11-25 23:15:47 +00:00