Commit Graph

2413 Commits

Author SHA1 Message Date
Lucas Pasqualin
1ab441a7dd [DCP] Adds utility for converting dcp to torch save format (#119814)
as title

Differential Revision: [D53718042](https://our.internmc.facebook.com/intern/diff/D53718042/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119814
Approved by: https://github.com/fegin
ghstack dependencies: #119813
2024-02-22 16:55:58 +00:00
Hugues de Saxcé
8464654ae4 Add missing words to torch.utils.checkpoint doc (#120196)
This PR adds a couple of missing words in the Checkpointing documentation, it doesn't have a specific issue number related to it.

Changes are:
- "backward." -> "backward propagation."
- "to be advanced than" -> "to be more advanced than"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120196
Approved by: https://github.com/soulitzer
2024-02-20 20:18:42 +00:00
andrewor14
6ea4480818 [quant][pt2e] Add model_is_exported util function (#119726)
Summary: This commit adds the `model_is_exported` util function
for users to be able to easily tell what APIs to call to move
their models between train and eval modes. This has the
additional advantage of hiding the implementation of how we
detect a model is exported, in case the metadata format changes
in the future.

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_model_is_exported

Differential Revision: [D53812972](https://our.internmc.facebook.com/intern/diff/D53812972)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119726
Approved by: https://github.com/tugsbayasgalan, https://github.com/albanD
2024-02-16 19:29:36 +00:00
soulitzer
312ce35c1f Rename singleton int to nested int (#119661)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119661
Approved by: https://github.com/ezyang
2024-02-16 19:21:17 +00:00
Yu, Guangye
8f9f12c068 Intel GPU Runtime Upstreaming for Device Allocator (#118091)
# Motivation
According to [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842) and [[RFC] Intel GPU Runtime Upstreaming for Allocator](https://github.com/pytorch/pytorch/issues/116322), we will upstream the key functionality of device `Allocator` dedicated for XPU to PyTorch. And following our design prepare to generalize `Allocator` in parallel.

# Design
In the current design, XPU uses an `XPUAllocator` class, inherited from `c10::Allocator`. `XPUAllocator` is a manager to handle `DeviceCachingAllocator`, which is a per-device implementation of the caching mechanism to manage the already cached or newly allocated memory. The caching mechanism is similar to other backends, like CUDA. We can visualize the design as below.
<p align="center">
<img width="162" alt="image" src="https://github.com/pytorch/pytorch/assets/106960996/6b17b8cf-e7d1-48b4-b684-f830c409d218">
</p>

# Additional Context
We're going to implement our design gradually. This PR covers the device `Allocator` dedicated to XPU. The second PR covers the host `Allocator`.
Besides these PRs, we plan to generalize the device `Allocator` device-agnostic through another PR.
In this PR, our device `Allocator` has the same memory management mechanism as CUDA, but lacks features such as expendable segments and statistics. We will add these features back in the subsequent PR which intend to generalize `Allocator`.

The differences with CUDA:
only key functionality, and lack of AsyncAllocator, gpu_trace, history_record, graph functionality, memory snapshot, memory statistics, expandable segment...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118091
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/albanD
ghstack dependencies: #117611, #117619, #117734
2024-02-16 06:46:00 +00:00
Yu, Guangye
4dc75f9084 Intel GPU Runtime Upstreaming for Event (#117734)
# Motivation
As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the next runtime component we would like to upstream is `Event` which handles the status of an operation that is being executed. Typically, in some circumstances, we can fine-grain control of the operation execution via `Event`.

# Design
`XPUEvent` is a movable but not a copyable wrapper around sycl event. It should be created lazily on an XPU device when recording an `XPUStream`. Meanwhile, `XPUEvent` can wait for another `XPUEvent` or all the submitted kernels on an `XPUStream` to complete. Align to the other backend, the C++ files related to `Event` will be placed in `aten/src/ATen/xpu` folder. For frontend code, `XPUEvent` runtime API will be bound to Python `torch.xpu.Event`. The corresponding C++ code will be placed in `torch/csrc/xpu/Event.cpp` and Python code will be placed in `torch/xpu/streams.py` respectively.

# Additional Context
It is worth mentioning that the `elapsed_time` method is temporarily not supported by `XPUEvent`. We will be adding support for it soon. Meanwhile `XPUEvent` doesn't support IPC from different processes. For the other parts, we have almost a 1:1 mapping with CUDA.

lack of the below APIs:
- `torch.cuda.Event.ipc_handle`
- `CUDAEvent`'s constructor with `IpcEventHandle`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117734
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: #117611, #117619
2024-02-16 06:28:26 +00:00
lancerts
444c628e06 Include the scalar tensor auto-transfer in the doc (#119967)
Fixes #119609

@albanD

Co-authored-by: albanD <desmaison.alban@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119967
Approved by: https://github.com/albanD
2024-02-15 22:37:39 +00:00
drisspg
744898b311 Add doc page for environment variables that effect PyTorch Runtime (#119087)
# Summary

The goal of this PR is to add a doc page to list a number of environment that effect the PyTorch runtime. It will likely not be exhaustive but hopefully will be added and updated to stay relevant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119087
Approved by: https://github.com/janeyx99, https://github.com/eqy
2024-02-15 21:41:38 +00:00
Kazuaki Ishizaki
a2f07bb317 Fix typo under docs directory (#119657)
This PR fixes typo under `docs` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119657
Approved by: https://github.com/colesbury
2024-02-15 21:14:34 +00:00
Eddie Yan
cd380c794f [CUDNN][SDPA] Experimental cuDNN Flash Attention v2 Inference (#115663)
#113713

Going to clean up some of the checks and will remove draft status after.
Can be tested on SM80+ with `TORCH_CUDNN_MHA_ENABLED=1`.

CC @drisspg @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115663
Approved by: https://github.com/drisspg
2024-02-14 22:02:06 +00:00
andrewor14
8ec8d78ef2 [quant][pt2e][be] Rename eval_utils -> export_utils (#119725)
It's not really eval_utils anymore, since we added some training
related utils. Instead it should be util functions that are
related to general export use cases.

Differential Revision: [D53711494](https://our.internmc.facebook.com/intern/diff/D53711494)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119725
Approved by: https://github.com/tugsbayasgalan
2024-02-13 19:10:06 +00:00
Yu, Guangye
8fd11cb307 [2/2] Intel GPU Runtime Upstreaming for Stream (#117619)
# Motivation
According to [[1/2] Intel GPU Runtime Upstreaming for Stream](https://github.com/pytorch/pytorch/pull/117611), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the second PR covers the changes under `python frontend`.

# Design
Currently, it primarily offers stream-related APIs, including
 - `torch.xpu.StreamContext`
 - `torch.xpu.current_stream`
 - `torch.xpu.set_stream`
 - `torch.xpu.synchronize`
 - `torch._C._xpu_getCurrentRawStream`

# Additional Context
We will implement functions like `torch.xpu.Stream.wait_event`, `torch.xpu.Stream.wait_stream`, and `torch.xpu.Stream.record_event` in the next PR related with `Event`.

The differences with CUDA:
no default and external stream in XPU and lack of below APIs:
- `torch.cuda.ExternalStream`
- `torch.cuda.default_stream`
- `toch.cuda.is_current_stream_capturing`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117619
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/albanD
ghstack dependencies: #117611
2024-02-10 03:39:42 +00:00
Mikayla Gawarecki
3372aa51b4 Integrate swap_tensors into nn.Module.load_state_dict (#117913)
Added a `torch.Tensor` method that defines how to transform `other`, a value in the state dictionary, to be loaded into `self`, a param/buffer in an `nn.Module` before swapping via `torch.utils.swap_tensors`
* `param.module_load(sd[key])`

This method can be overridden using `__torch_function__`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117913
Approved by: https://github.com/albanD
2024-02-09 22:32:29 +00:00
Angela Yi
0827510fd3 [export] Remove torch._export.export (#119095)
XLA changes: https://github.com/pytorch/xla/pull/6486

Test Plan: CI

Differential Revision: D53316196

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119095
Approved by: https://github.com/ydwu4, https://github.com/zhxchen17, https://github.com/tugsbayasgalan, https://github.com/avikchaudhuri, https://github.com/jerryzh168
2024-02-08 21:22:04 +00:00
Yu, Guangye
9a992b0918 [4/4] Intel GPU Runtime Upstreaming for Device (#116869)
# Motivation
According to [[1/4] Intel GPU Runtime Upstreaming for Device](https://github.com/pytorch/pytorch/pull/116019), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), this last PR  covers the changes under lazy initialization.

# Design
This PR primarily offers the support of multi-processing via lazy initialization. We lazily initialize our runtime avoiding initializing XPU until the first time it is accessed. In our design, we extend `cuda_lazy_init` to `device_lazy_init` which is a device-agnostic API that can support any backend. And change `maybe_initialize_cuda` to `maybe_initialize_device` to support lazy initialization for both CUDA and XPU while maintaining scalability.

# Additional Context
We adopt a similar design to CUDA. So we share some code with CUDA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116869
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet
ghstack dependencies: #119248
2024-02-08 03:01:21 +00:00
Mateus Devino
64aaa8f508 Fix typo on Contribution Guide (#119428)
Fixes #119427

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119428
Approved by: https://github.com/awgu, https://github.com/kit1980
2024-02-08 01:07:27 +00:00
Mikayla Gawarecki
d5a718d27b Add swap_tensors path to nn.Module._apply (#117167)
Added `torch.__future__.{get/set}_swap_module_params_on_conversion` that defaults to `False` for now, but we probably want to modify  to override this and default to `True` in `nn.Module._apply` if input is a tensor subclass.

From offline discussion, for now we are **not** allowing `swap_tensor` after the first module forward has been run*** if the autograd graph is still alive. The reason being that `torch.utils.swap_tensors(t1, t2)` requires the `use_count` of both `TensorImpl`s associated with `t1` and `t2` to be 1.  The first forward pass will install `AccumulateGrad` nodes on each param, which [bump the refcount of the associated TensorImpl](6cf1fc66e3/torch/csrc/autograd/variable.cpp (L307)). **Future work might be to swap the refs that the `AccumulateGrad` nodes hold if it is necessary.**

***From this, it might seem like we don't need to handle gradients. However, I still handle the grads for the edge case that the grads are set via `p.grad = grad` OR the autograd graph is no longer alive because the output has been garbage collected.

If any `swap_tensors` fails on any of the parameters in the `nn.Module` we raise an error.

**`RNNBase` overrides `nn.Module._apply()` and installs weakrefs on some parameters. As a result, all modules that inherit from `RNNBase` (`RNN`, `GRU` and `LSTM`) cannot use the`swap_tensors` path as of now**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117167
Approved by: https://github.com/albanD
ghstack dependencies: #118028
2024-02-07 18:55:44 +00:00
Peter Bell
7c95cc5e03 Add basic reference documentation for symbolic_shapes.py (#118997)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118997
Approved by: https://github.com/albanD
2024-02-07 14:33:42 +00:00
Svetlana Karslioglu
5ae6f6cffe Test seo torch cuda (#119324)
Testing if this will help improve SEO of this page.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119324
Approved by: https://github.com/albanD
2024-02-07 00:39:51 +00:00
Edward Z. Yang
3f0fd36835 Introduce size oblivious guards (#118579)
Fixes https://github.com/pytorch/pytorch/issues/117361

The implementation here slightly diverges from what was proposed in the issue, so I will recap what this PR is doing here. Today, when doing computations involving size-like unbacked SymInts, we assume for all operations that the compile time range of the integer is `[2, inf]`, even though at runtime we also accept zero and one.

This PR removes the carte blanche assumption, and instead does the analysis in a much more limited and controlled fashion: only for guards which we have designated as "size oblivious" are we willing to do the analysis under the assumption that the range of all size-like unbacked SymInts is `[2, inf]`; otherwise, we will faithfully only do analysis with `[0, inf]` (or whatever the user provided) bounds.

The infra pieces of this PR are:

* Remove runtime_var_to_range from torch/fx/experimental/symbolic_shapes.py; modify `_constrain_range_for_size` to refine the range without clamping min to 2, and instead add the symbol to a `size_like` set in the ShapeEnv
* When evaluating an expression, if the expression is requested to be evaluated in a `size_oblivious` way, we attempt to statically compute the value of the expression with the assumption that all symbols in `size_like` are updated to assume that they are `>= 2`.
* Add Python and C++ APIs for guarding on a SymBool in a size-oblivious way. In C++, I also need to add some helpers for performing symbolic comparisons, since the stock comparisons immediately specialize in the "normal" way.

The rest of the changes of the PR are marking various spots in PyTorch framework code as size oblivious, based on what our current test suite exercises.

As you review the places where we have marked things as size oblivious, it may become clear why I ended up not opting for the "designate a branch as the default branch when it's not statically obvious which way to go": for some of the conditions, this answer is rather non-obvious. I think potentially there is another refinement on top of this PR, which is something like "I don't care if you can't figure it out with ValueRange analysis, go down this path anyway if there are unbacked sizes involved." But even if we add this API, I think we are obligated to attempt the ValueRange analysis first, since it can lead to better outcomes sometimes (e.g., we are able to figure out that something is contiguous no matter what the unbacked size is.)

When is it permissible to mark something as size oblivious? Heuristically, it is OK anywhere in framework code if it gets you past a guard on unbacked SymInt problem. It is somewhat difficult to provide a true semantic answer, however. In particular, these annotations don't have any observational equivalence guarantee; for example, if I have `torch.empty(u0, 1).squeeze()`, we will always produce a `[u0]` size tensor, even though if `u0 == 1` PyTorch will actually produce a `[]` size tensor. The argument that I gave to Lezcano is that we are in fact defining an alternate semantics for a "special" size = 0, 1, for which we have these alternate eager mode semantics. In particular, suppose that we have a constant `special1` which semantically denotes 1, but triggers alternate handling rules. We would define `torch.empty(special1, 1).squeeze()` to always produce a `[special1]` size tensor, making its semantics coincide with unbacked SymInt semantics. In this model, the decision to designate guards as size oblivious is simply a user API question: you put them where ever you need some handling for special1! As we conservatively error out whenever it is not obvious what `special1` semantics should be, it is always valid to expand these semantics to cover more cases (although you can always choose the wrong semantics!)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118579
Approved by: https://github.com/eellison, https://github.com/lezcano
2024-02-06 19:45:32 +00:00
Edward Z. Yang
6620176da7 Add documentation for meta device (#119119)
Fixes https://github.com/pytorch/pytorch/issues/119098

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119119
Approved by: https://github.com/bdhirsh
2024-02-04 01:05:22 +00:00
Mikayla Gawarecki
9ffed22391 Document file format returned by torch.save (#118719)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118719
Approved by: https://github.com/albanD
2024-02-03 02:11:44 +00:00
Yu, Guangye
a205e7bf56 [3/4] Intel GPU Runtime Upstreaming for Device (#116850)
# Motivation
According to [[1/4] Intel GPU Runtime Upstreaming for Device](https://github.com/pytorch/pytorch/pull/116019), As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), this third PR  covers the changes under `libtorch_python`.

# Design
This PR primarily offers device-related APIs in python frontend, including
- `torch.xpu.is_available`
- `torch.xpu.device_count`
- `torch.xpu.current_device`
- `torch.xpu.set_device`
- `torch.xpu.device`
- `torch.xpu.device_of`
- `torch.xpu.get_device_name`
- `torch.xpu.get_device_capability`
- `torch.xpu.get_device_properties`
- ====================
- `torch.xpu._DeviceGuard`
- `torch.xpu._is_compiled`
- `torch.xpu._get_device`

# Additional Context
We will implement the support of lazy initialization in the next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116850
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet
2024-02-01 12:31:26 +00:00
Michael Lazos
e426924c19 Change classification to beta for TORCH_LOGS (#118682)
Changes classification of TORCH_LOGS to beta
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118682
Approved by: https://github.com/svekars
2024-01-31 21:50:55 +00:00
CaoE
bacbad5bc9 add GradScaler on CPU (#109993)
Step 2 of https://github.com/pytorch/pytorch/issues/111559.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109993
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-01-29 23:42:35 +00:00
albanD
a40be5f4dc Autograd doc cleanup (#118500)
I don't think we'll realistically go though deprecation for these now since there are a couple use of each online. So document appropriately.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118500
Approved by: https://github.com/soulitzer
2024-01-29 21:51:33 +00:00
Will Constable
abe3c55a6a Update DDP dynamo debug docs (#118295)
Refreshes https://github.com/pytorch/pytorch/pull/114201 and updates it to include other log names that also include ddp_optimizer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118295
Approved by: https://github.com/LucasLLC, https://github.com/wanchaol
2024-01-29 14:58:26 +00:00
Tobias Ringwald
62c1e4a578 Added missing CircularPad*d references so the docs are actually built. (#118465)
Fixes #118429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118465
Approved by: https://github.com/Skylion007
2024-01-27 22:39:01 +00:00
Lucas Pasqualin
ff8e33556e Enables load balancing duplicates in DCP (#116469)
Enables the deduplication of saved entries by load balancing duplicates across ranks.

Tested with existing and modified tests. Additionally tested with the following code snippet, which saves a 20GB DDP model in **~3 seconds on 8 ranks**.  Before this PR, the same operation has been measured at ~19 seconds.

```
def run(local_rank, world_size, param_size, num_params, work_dir):

    os.environ["RANK"] = str(local_rank)
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12355"
    device = torch.device(f"cuda:{local_rank}")
    torch.cuda.set_device(device)
    dist.init_process_group(backend="nccl", rank=local_rank, world_size=world_size)

    model = Model(param_size=param_size, num_params=num_params)
    model = DistributedDataParallel(model, gradient_as_bucket_view=True)
    _patch_model_state_dict(model)

    sz = sum(t.nelement() * t.element_size() for t in model.parameters())
    rank_0_print(f"Model size: {sz / 1_000_000_000.0} GB")
    rank_0_print("Saving the model with DCP...")

    checkpointer = _FileSystemCheckpointer(
        f"{args.work_dir}/dcp",
        sync_files=False,
        single_file_per_rank=False,
        thread_count=1
    )

    begin_ts = time.monotonic()
    checkpointer.save(state_dict={"model": model})
    end_ts = time.monotonic()
    rank_0_print(f"Took {end_ts - begin_ts} seconds with DCP")
```

Differential Revision: [D52435926](https://our.internmc.facebook.com/intern/diff/D52435926/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116469
Approved by: https://github.com/fegin, https://github.com/wz337
2024-01-26 22:34:14 +00:00
Sherlock Huang
6596a3f23d [Export] Remove ScriptObjectMeta (#118241)
Summary: As title. Use CustomObjArgument as ScriptObjectMeta

Test Plan: CIs

Reviewed By: zhxchen17

Differential Revision: D53062230

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118241
Approved by: https://github.com/zhxchen17
2024-01-26 00:37:19 +00:00
drisspg
4e29f01bf2 Remove sdp_kernel and replace with sdpa_kernel in attention namespace (#114689)
# Summary
Simplification of Backend Selection

This PR deprecates the `torch.backends/cuda/sdp_kernel` context manager and replaces it with a new context manager `torch.nn.attention.sdpa_kernel`. This context manager also changes the api for this context manager.

For `sdp_kernel` one would specify the backend choice by taking the negation of what kernel they would like to run. The purpose of this backend manager was to only to be a debugging tool, "turn off the math backend" and see if you can run one of the fused implementations.

Problems:
- This pattern makes sense if majority of users don't care to know anything about the backends that can be run. However, if users are seeking to use this context manager then they are explicitly trying to run a specific backend.
- This is not scalable. We are working on adding the cudnn backend and this API makes it so so that more implementations will need to be turned off if user wants to explicitly run a given backend.
- Discoverability of the current context manager. It is somewhat un-intutive that this backend manager is in backends/cuda/init when this now also controls the CPU fused kernel behavior. I think centralizing to attention namespace will be helpful.

Other concerns:
- Typically backends (kernels) for operators are entirely hidden from users and implementation details of the framework. We have exposed this to users already, albeit not by default and with beta warnings. Does making backends choices even more explicit lead to problems when we potentially want to remove existing backends, (perhaps inputs shapes will get covered by newer backends).

A nice side effect is now that we aren't using the `BACKEND_MAP` in test_transformers many, many dynamo failures are passing for CPU tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114689
Approved by: https://github.com/cpuhrsch
2024-01-24 22:28:04 +00:00
Zhengxu Chen
abd759d50d [fx] Add hooks to intercept node replacements. (#117825)
Summary: Adding an experimental API to FX graph module to place "hooks" every time when we are changing or replacing nodes in a graph, so that we can properly update the new name in graph signature and potentially other places.

Test Plan:
buck test mode/opt  -c fbcode.enable_gpu_sections=true caffe2/test/distributed/_tensor/experimental:tp_transform

buck test mode/opt caffe2/test:test_export -- -r test_replace_hook

Differential Revision: D52896531

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117825
Approved by: https://github.com/avikchaudhuri
2024-01-23 22:28:40 +00:00
Matteo Migliarini
fdac55c35d Added example regarding weight_decay distinction with per-parameter API (#117436)
Added new example and description regarding per-parameter `weight_decay` distinction for bias and non-bias terms.

Fixes #115935

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117436
Approved by: https://github.com/janeyx99
2024-01-22 21:26:02 +00:00
Wanchao Liang
2bb2cc0b71 [tp] add clarification to doc and improve TP examples (#117618)
This PR adds a clarification about evenly sharded assumption in the main
tp doc and improved the tp examples by adding device mesh constructions

fixes https://github.com/pytorch/pytorch/issues/100044

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117618
Approved by: https://github.com/wconstab, https://github.com/awgu
2024-01-22 18:56:50 +00:00
Stas Bekman
86b4b27e26 [docs] start a new FSDP notes doc (#117323)
As discussed on [slack](https://pytorch.slack.com/archives/C3PDTEV8E/p1703699711772289) adding Andrew Gu's advanced FSDP design notes with a few additions from myself based on our discussion.

I hope I did the RST right, I haven't done RST in a while.

- The first section is Andrew's words verbatim + formatting
- The second section is Andrew's words verbatim + formatting + a few of my additions that were confirmed by Andrew, and which hopefully should help understand the process better.

tagging @albanD as requested.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117323
Approved by: https://github.com/awgu
2024-01-22 15:46:35 +00:00
PyTorch MergeBot
02209b5880 Revert "[docs] start a new FSDP notes doc (#117323)"
This reverts commit 7f474da6bc.

Reverted https://github.com/pytorch/pytorch/pull/117323 on behalf of https://github.com/awgu due to broke docs ([comment](https://github.com/pytorch/pytorch/pull/117323#issuecomment-1902740900))
2024-01-21 19:47:27 +00:00
Stas Bekman
7f474da6bc [docs] start a new FSDP notes doc (#117323)
As discussed on [slack](https://pytorch.slack.com/archives/C3PDTEV8E/p1703699711772289) adding Andrew Gu's advanced FSDP design notes with a few additions from myself based on our discussion.

I hope I did the RST right, I haven't done RST in a while.

- The first section is Andrew's words verbatim + formatting
- The second section is Andrew's words verbatim + formatting + a few of my additions that were confirmed by Andrew, and which hopefully should help understand the process better.

tagging @albanD as requested.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117323
Approved by: https://github.com/albanD, https://github.com/awgu
2024-01-21 15:11:24 +00:00
suo
4057d005ff Initial torchbind support in PT2 (#117697)
This PR adds the bare minimum functionality to get torchbind working in an e2e testable way on PT2.

It implements:
* ProxyTensor support
* Simple torch.export support (proxytensor-only path, e.g. non-strict).
* add some tests exercising the path.

Because all this is not fully baked, I hide the functionality behind a feature flag (`enable_torchbind_tracing()`) so it does not affect regular users for now.

Still on the agenda:
* Dynamo support
* Actual FakeMode support
* Mutability support

Hoping to get this first bit in as a standalone, as it will unblock some more extensive experimentation/testing going on internally.

Differential Revision: [D51825372](https://our.internmc.facebook.com/intern/diff/D51825372/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117697
Approved by: https://github.com/SherlockNoMad
2024-01-19 06:28:20 +00:00
PyTorch MergeBot
2f84a9d37c Revert "[CUDNN][SDPA] Experimental cuDNN Flash Attention v2 Inference (#115663)"
This reverts commit 5aa92b5090.

Reverted https://github.com/pytorch/pytorch/pull/115663 on behalf of https://github.com/PaliC due to Unfortunately, this pr breaks cuda builds internally ([comment](https://github.com/pytorch/pytorch/pull/115663#issuecomment-1899388813))
2024-01-18 23:40:30 +00:00
Angela Yi
92d718aed1 [export] Add lifted constant obj to input (#116985)
Test Plan: wip

Differential Revision: D52556070

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116985
Approved by: https://github.com/suo
2024-01-18 22:10:53 +00:00
suo
ccc8440609 [export] introduce WrapperModule (#117571)
Simple module to wrap a callable. This is a useful utility for when we start requiring that torch.export take an nn.Module.

Differential Revision: [D52791310](https://our.internmc.facebook.com/intern/diff/D52791310/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117571
Approved by: https://github.com/tugsbayasgalan, https://github.com/avikchaudhuri
ghstack dependencies: #117570
2024-01-18 03:40:34 +00:00
Eddie Yan
5aa92b5090 [CUDNN][SDPA] Experimental cuDNN Flash Attention v2 Inference (#115663)
#113713

Going to clean up some of the checks and will remove draft status after.
Can be tested on SM80+ with `TORCH_CUDNN_MHA_ENABLED=1`.

CC @drisspg @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115663
Approved by: https://github.com/drisspg
2024-01-18 01:20:36 +00:00
Kurman Karabukaev
a60b566d37 [TorchElastic] Support for overprovisioning in C10 based rendezvous (#117066)
Summary:
Allow TorchElastic to manage more nodes than a maximum nnodes specifed in a job. It will be used as a spare capacity/warm nodes for schedulers that support elasticity.

RFC: https://github.com/pytorch/pytorch/issues/114097

Test Plan: Integration tests

Differential Revision: D52343874

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117066
Approved by: https://github.com/zdevito
2024-01-18 01:16:55 +00:00
Peter Bell
001585f446 [fx][inductor] Add statically_known_true utility for SymBool (#117359)
This adds a function `statically_known_true` for `SymBool` that works
like inductor's `is_expr_static_and_true`. That is, it tries to simplify the
expression to a constant or returns `False` if it cannot be simplified.

This is useful in cases that can be optimized if the condition is met,
otherwise it doesn't effect correctness so we can avoid adding guards.

I also use this new function in inductor for `FakeTensorUpdater` and
`remove_noop_pass` which both generated unexpected guards previously.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117359
Approved by: https://github.com/lezcano
2024-01-15 18:01:10 +00:00
Sai-Pra
19502ff6aa Fixed typo in build_activation_images.py (#117458)
In line 24 of build_activation_images.py, I changed "programmaticly" to "programmatically" to be dramatically correct.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117458
Approved by: https://github.com/malfet
2024-01-15 03:27:40 +00:00
vasiliy
a6d33614d6 add float8 types to dtypes table (#117375)
Summary:

As titled

Test Plan:

CI

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117375
Approved by: https://github.com/ezyang
2024-01-15 00:23:07 +00:00
Edward Z. Yang
d006cae2a8 Update documentation for unsigned int types (#116804)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116804
Approved by: https://github.com/albanD
ghstack dependencies: #116595, #116803
2024-01-08 22:02:10 +00:00
Guo Yejun
5323b2daa5 [docs] add mode="reduce-overhead" into torch.compile to enable cuda g… (#116529)
…raph

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116529
Approved by: https://github.com/eellison
2024-01-05 22:54:20 +00:00
Angela Yi
6413511713 [export][refactor][4/n] Make equality_constraints optional (#116233)
Summary: needed to remove equality_contraints eventually :P

Test Plan: CI

Differential Revision: D52351709

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116233
Approved by: https://github.com/tugsbayasgalan
2024-01-05 00:50:52 +00:00
Mikayla Gawarecki
0f6f582c0d Add config to disable TransformerEncoder/MHA fastpath (#112212)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112212
Approved by: https://github.com/jbschlosser
2024-01-02 23:59:30 +00:00