Commit Graph

2971 Commits

Author SHA1 Message Date
Mark Saroufim
f3d16ec76f Add doc preview command (#141590)
Convenience, when we build pytorch docs
1. Docs for build weren't clear that `make html` is the main command intended to be ran
2. Once you run `make html` you need to visualize the work, opening up a simple http server seems like the simplest solution so adding a `make serve command`

Usage

```shell
numpy ❯ make serve PORT=8080 # Add port optionally
Serving HTTP on :: port 8080 (http://[::]:8080/) ...
::1 - - [26/Nov/2024 10:05:41] "GET / HTTP/1.1" 200 -
::1 - - [26/Nov/2024 10:05:41] "GET /_static/copybutton.css HTTP/1.1" 200 -
::1 - - [26/Nov/2024 10:05:41] "GET /_static/katex-math.css HTTP/1.1" 200 -
```

![Screenshot 2024-11-26 at 10 05 46 AM](https://github.com/user-attachments/assets/3b275c33-1515-4e21-b540-f5a68c8a8e55)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141590
Approved by: https://github.com/svekars, https://github.com/malfet
2024-11-26 21:56:54 +00:00
Nichols A. Romero
a99332eb25 [ROCM] Support Multi-GPU offline tuning in TunableOp (#139673)
This PR enhances offline tuning to support multi-GPUs.

High-level description of algorithm:
- Duplicate GEMMs are first eliminated
- GEMMs are distributed to multi-GPUs for tuning
- Results are gathered into a file with `_full` in the filename

Also adding support for GemmAndBias and ScaledGemm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139673
Approved by: https://github.com/jeffdaily, https://github.com/hongxiayang
2024-11-26 19:07:41 +00:00
Stephen Matthews
2bbd984aa2 Fix typo in Reproducibility docs (#141341)
Fixes trivial issue in the docs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141341
Approved by: https://github.com/svekars
2024-11-26 16:53:26 +00:00
ZhiweiYan-96
c418a9ac75 [Intel GPU] XPUInductorQuantizer for XPU int8 recipe customization (#139578)
# Motivation
This PR add `XPUInductorQuantizer`, which would defined the recipe of int8 quantization at XPU backend.

# Detailed
The `XPUInductorQuantizer` is class derived from `X86InductorQuantizer` as both quantizer would take the advantage of highly optimized operators in oneDNN library(qconv, qlinear, qconv/qlinear fusion).

We share the same recipe as `X86InductorQuantizer`, so we would have same `annotate_xxxx` methods.  So, in ideal situation, the `XPUInductorQuantizer` would have no class body as all implementation can inherit from base class.

In this PR, we override the `annotate_xxx` method for operators that has NOT be implemented. All operators XPU backend does  not implement would be fallbacked to fp32 implementation as the node in graph is a `dq-op-q` pairs. This would help provide good OOB usability for XPU backend.   On the other hand, the implemented operators would uses `annotate_op` implemented in base class and could be lowered successfully.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139578
Approved by: https://github.com/EikanWang, https://github.com/leslie-fang-intel, https://github.com/CuiYifeng, https://github.com/jerryzh168
ghstack dependencies: #133080
2024-11-26 09:44:14 +00:00
Svetlana Karslioglu
25c0b91dbb [Docs] Make links to source link to source (#141186)
Rewrite [SOURCE] links in the API docs to point to the source file in github repo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141186
Approved by: https://github.com/malfet, https://github.com/msaroufim

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-11-22 00:50:19 +00:00
angelayi
878a849c92 [aoti] Remove example inputs from aoti_compile_and_package (#140991)
Differential Revision: [D66136724](https://our.internmc.facebook.com/intern/diff/D66136724)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140991
Approved by: https://github.com/yushangdi, https://github.com/desertfire
ghstack dependencies: #140990
2024-11-20 02:49:47 +00:00
YangQuan
93aef684d9 fix typo in torch.compiler_dynamo_deepdive.rst (#140871)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140871
Approved by: https://github.com/zou3519
2024-11-19 14:42:36 +00:00
Yu Guo
808da50c2d create a new torch.cuda.device_memory_used api (#140870)
Summary:
the current torch.cuda.memory_usage returns the memory utilization, more specifically, percent of time over the past sample period global memory being read/written for Nvidia.
see more details in https://github.com/pytorch/pytorch/issues/140638

Test Plan: added a new unittest

Differential Revision: D65960134

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140870
Approved by: https://github.com/ngimel, https://github.com/eqy
2024-11-19 06:36:30 +00:00
Tristan Rice
2673a440d0 [distributed] add PG APIs and general doc cleanups (#140853)
Doc updates:

* This adds documentation for the object oriented ProcessGroup APIs that are being used in torchft as well as https://github.com/pytorch/rfcs/pull/71 .
* It also does some general cleanups to simplify the distributed.rst by using `:methods`.
* It adds `__init__` definitions for the Stores
* I've reordered things so the collective APIs are before the Store/PG apis

Test plan:

```
lintrunner -a
cd docs && sphinx-autobuild source build/ -j auto -WT --keep-going
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140853
Approved by: https://github.com/kwen2501
2024-11-19 02:06:32 +00:00
PyTorch MergeBot
43de32d948 Revert "create a new torch.cuda.device_memory_used api (#140870)"
This reverts commit 478204cad6.

Reverted https://github.com/pytorch/pytorch/pull/140870 on behalf of https://github.com/yuguo68 due to the test is still flaky on ROCm, test_cuda.py::TestCudaMallocAsync is not skipped with the unittest.skipIf(TEST_CUDAMALLOCASYNC ([comment](https://github.com/pytorch/pytorch/pull/140870#issuecomment-2484161914))
2024-11-18 21:26:25 +00:00
Yuanhao Ji
4bb1bf0573 [Docs] Remove duplicate declaration of double_tensor (#140927)
Fixes #140920

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140927
Approved by: https://github.com/malfet
2024-11-18 21:22:30 +00:00
Yu Guo
478204cad6 create a new torch.cuda.device_memory_used api (#140870)
Summary:
the current torch.cuda.memory_usage returns the memory utilization, more specifically, percent of time over the past sample period global memory being read/written for Nvidia.
see more details in https://github.com/pytorch/pytorch/issues/140638

Test Plan: added a new unittest

Differential Revision: D65960134

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140870
Approved by: https://github.com/ngimel
2024-11-18 19:13:43 +00:00
PyTorch MergeBot
03b7ec9237 Revert "create a new torch.cuda.memory_usage_in_bytes api (#140719)"
This reverts commit 9febc47637.

Reverted https://github.com/pytorch/pytorch/pull/140719 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the test is flaky on ROCm ([comment](https://github.com/pytorch/pytorch/pull/140719#issuecomment-2479832082))
2024-11-15 20:05:32 +00:00
Laith Sakka
500ce29e4c Use has_free_unbacked_symbols instead of bool(free_unbacked_symbols) (#140027)
with 20K features saves 20 seconds.
257.021589517593-> 237.8304626941681
buck2 run @fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=2000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140027
Approved by: https://github.com/ezyang
2024-11-15 19:01:06 +00:00
Yu Guo
9febc47637 create a new torch.cuda.memory_usage_in_bytes api (#140719)
Summary:
the current torch.cuda.memory_usage returns the memory utilization, more specifically, percent of time over the past sample period global memory being read/written for Nvidia.

see more details in https://github.com/pytorch/pytorch/issues/140638

Test Plan: added a new unittest

Differential Revision: D65928031

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140719
Approved by: https://github.com/xw285cornell, https://github.com/hongxiayang
2024-11-15 05:59:40 +00:00
Vincent Moens
03cccaa76a Doc: Rewrite the storage.rst file to emphasize untyped storages (#140145)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140145
Approved by: https://github.com/janeyx99
2024-11-13 17:40:16 +00:00
Tongzhou Wang
7b0d199471 [doc] fix grammar in "Extending Torch" (#140209)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140209
Approved by: https://github.com/soulitzer
2024-11-13 05:34:43 +00:00
Tongzhou Wang
4c6eebf4e2 [doc] improve code in fake tensor doc (#140329)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140329
Approved by: https://github.com/soulitzer
2024-11-13 05:14:56 +00:00
William Wen
be172d2a60 [pt2, docs] Add new PT2 troubleshooting doc (#138620)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138620
Approved by: https://github.com/ezyang

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2024-11-09 01:17:39 +00:00
Bin Bao
63a0d6587e [AOTI] Update the OSS tutorial (#139956)
Summary: Update the OSS tutorial to use the new aoti_compile_and_package and aoti_load_package APIs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139956
Approved by: https://github.com/angelayi
ghstack dependencies: #139955
2024-11-08 20:46:57 +00:00
Jerry Zhang
1fcc99c6bf Update quantization.rst (#139824)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139824
Approved by: https://github.com/svekars
2024-11-08 02:34:50 +00:00
John MacCormick
81d077cca2 Fix to modules.rst: indent line with activation functions (#139667)
At line 205, I believe the code `x = self.activations[act](x)` should be indented so that it is in the body of the for loop. Otherwise, applying the four linear modules has the same effect as applying a single linear module, in the sense that it is still just a linear map so there is no point in having four of them.  In other words, each layer of this network should have a nonlinearity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139667
Approved by: https://github.com/malfet
2024-11-08 01:12:52 +00:00
Tongzhou Wang
22dd17c7bb [doc] fixing missing colon in custom op doc (#140060)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140060
Approved by: https://github.com/malfet
2024-11-07 23:48:44 +00:00
Mikayla Gawarecki
2ee91db03d Add APIs to separate norm calculation and gradient scaling in nn.utils.clip_grad_norm_ (#139662)
Fixes https://github.com/pytorch/pytorch/issues/139467

Refactor `nn.utils.clip_grad_norm_` into `nn.utils.get_total_norm` and then `nn.utils.clip_grads_with_norm_` . `clip_grad_norm_` now calls into these two new ops,

`get_total_norm` is generalized (rather than `get_grad_norm` due to the discussion on the issue from @awgu)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139662
Approved by: https://github.com/H-Huang
2024-11-07 23:13:23 +00:00
Shangdi Yu
83e36a6bfa AOTI Minifier (#139351)
See documentation at https://docs-preview.pytorch.org/pytorch/pytorch/139351/torch.compiler_aot_inductor_minifier.html.

Add a minifier for AOTI.

Test Plan:
python test/inductor/test_minifier.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139351
Approved by: https://github.com/desertfire
2024-11-07 21:43:44 +00:00
Tom Fogal
b5286ba207 Small fix to Python rendering in documentation. (#138281)
The text was being rendered as normal text but I believe was meant to be code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138281
Approved by: https://github.com/janeyx99
2024-11-07 20:48:47 +00:00
Will Constable
2b400236c2 [DCP] Cross-link DCP doc to tutorials (#139776)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139776
Approved by: https://github.com/mhorowitz, https://github.com/LucasLLC, https://github.com/fduwjj
ghstack dependencies: #139938
2024-11-07 02:19:49 +00:00
Jay Zhang
99deedff57 [ONNX] Describe memory usage of TorchDynamo-based exporter. (#139388)
Add a new documentation to show one memory usage benefit brought by TorchDynamo-based ONNX exporter.

Also add a unit test to make sure TorchDynamo-based ONNX exporter works well under FakeTensorMode.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139388
Approved by: https://github.com/xadupre
2024-11-06 17:29:11 +00:00
Tongzhou Wang
faab564bda [doc] Fix grammar in export.ir_spec.rst (#139584)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139584
Approved by: https://github.com/zou3519
2024-11-05 23:26:36 +00:00
Ryan Guo
693a0a1bd4 [dynamo][NFC] Rename mutable_local and add documentation (#139339)
This patch addresses the renaming part of #133027, specifically, it
renames the following and adds documentation for relevant classes.
1. `VariableTracker.mutable_local` to `mutation_type`
2. `MatableLocal `to `ValueMutationNew`
3. `MutableSideEffects `to `ValueMutationExisting`
4. `MutableLocalSource` to `SourceType`
5. `MutableLocalSource.Local` to `New`

Note that (2), (3) and (5) are mainly to bring consistency between them
and `AttributeMutationNew`, `AttributeMutationExisting`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139339
Approved by: https://github.com/jansel, https://github.com/mlazos, https://github.com/anijain2305
2024-11-05 19:11:41 +00:00
Henry Tsang
350bc2a166 [export] Add support for symbool to make it usable for torch.cond (#138765)
# Why?

I want the following code to work.

minimal repro:
```
class M(torch.nn.Module):
    def forward(self, dilate_flag):
        return dilate_flag.item()

input1 = (torch.tensor([1], dtype=torch.bool, device="cuda"),)
model = M().cuda()

ep = torch.export.export(model, input1, strict=True)
path = torch._inductor.aot_compile(ep.module(), input1)
aot_model = torch._export.aot_load(path, device="cuda")
actual_output = aot_model(*input1)
```

error: AssertionError: Encountered an unsupported object of type <class 'torch.SymBool'> while writing the metadata for exported program

second error will be handled by https://github.com/pytorch/pytorch/pull/138760

# Motivation

I could technically bypass it with a torch.int tensor. However, it doesn't work with torch.cond. I want the following to work. It would also require https://github.com/pytorch/pytorch/pull/138760 for aot compile to work.

```
class M(torch.nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.dilate_flag = 0

    def forward(self, dilate_flag):
        self.dilate_flag = dilate_flag.item()

        def true_fn(dilate_flag):
            return dilate_flag.clone()

        def false_fn(dilate_flag):
            return dilate_flag.clone()

        torch.cond(
            self.dilate_flag,
            true_fn,
            false_fn,
            (dilate_flag,),
        )
        return self.dilate_flag

input1 = (torch.tensor([1], dtype=torch.bool, device="cuda"),)
input2 = (torch.tensor([0], dtype=torch.bool, device="cuda"),)
inputs = (input1, input2)
model = M().cuda()

for input in inputs:
    expected_output = model(*input)

    ep = torch.export.export(model, input, strict=False)
    path = torch._inductor.aot_compile(ep.module(), input)
    aot_model = torch._export.aot_load(path, device="cuda")
    actual_output = aot_model(*input)

    assert (
        expected_output == actual_output
    ), f"henry they are not equal {expected_output} != {actual_output}"
```

Differential Revision: D64867504

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138765
Approved by: https://github.com/ydwu4
2024-11-04 23:31:49 +00:00
Jane Xu
514c466cd9 Redirect the custom ops landing page :D (#139634)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139634
Approved by: https://github.com/zou3519
2024-11-04 22:25:15 +00:00
Will Constable
3d93caf664 [c10d] Add thread-safety initialization warning (#139638)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139638
Approved by: https://github.com/kwen2501, https://github.com/c-p-i-o, https://github.com/XilunWu
2024-11-04 21:38:47 +00:00
Edward Z. Yang
585dbfa583 Profile guided optimization for automatic_dynamic (#139001)
Previously: https://github.com/pytorch/pytorch/pull/138052 but the implementation is done from scratch, so I open a new PR.

This implements the ability to save and load profiles of automatic dynamic decisions, so on subsequent runs we can directly make something automatically dynamic. Unlike the previous implementation, this cache is never enabled by default; instead, you have to specify a "job id" that says it's OK to share results. We will be able to automatically populate this id for internal MAST jobs but for generic OSS users you will have to explicitly opt into it.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139001
Approved by: https://github.com/oulgen
2024-11-03 06:29:57 +00:00
PyTorch MergeBot
92d7f29e59 Revert "Profile guided optimization for automatic_dynamic (#139001)"
This reverts commit f6be44c74e.

Reverted https://github.com/pytorch/pytorch/pull/139001 on behalf of https://github.com/ezyang due to more fbcode errors ([comment](https://github.com/pytorch/pytorch/pull/139001#issuecomment-2452985581))
2024-11-02 13:11:04 +00:00
Edward Z. Yang
f6be44c74e Profile guided optimization for automatic_dynamic (#139001)
Previously: https://github.com/pytorch/pytorch/pull/138052 but the implementation is done from scratch, so I open a new PR.

This implements the ability to save and load profiles of automatic dynamic decisions, so on subsequent runs we can directly make something automatically dynamic. Unlike the previous implementation, this cache is never enabled by default; instead, you have to specify a "job id" that says it's OK to share results. We will be able to automatically populate this id for internal MAST jobs but for generic OSS users you will have to explicitly opt into it.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Differential Revision: [D65065497](https://our.internmc.facebook.com/intern/diff/D65065497)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139001
Approved by: https://github.com/oulgen
2024-11-02 11:50:11 +00:00
PyTorch MergeBot
8d1eaa3da6 Revert "Profile guided optimization for automatic_dynamic (#139001)"
This reverts commit a6630bcf87.

Reverted https://github.com/pytorch/pytorch/pull/139001 on behalf of https://github.com/ezyang due to internal code triggers import cycle ([comment](https://github.com/pytorch/pytorch/pull/139001#issuecomment-2452833882))
2024-11-02 03:38:15 +00:00
Mikayla Gawarecki
a979318ef7 Add section to serialization note re weights_only (#139433)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139433
Approved by: https://github.com/malfet
ghstack dependencies: #138936, #139221
2024-11-01 21:51:50 +00:00
Edward Z. Yang
a6630bcf87 Profile guided optimization for automatic_dynamic (#139001)
Previously: https://github.com/pytorch/pytorch/pull/138052 but the implementation is done from scratch, so I open a new PR.

This implements the ability to save and load profiles of automatic dynamic decisions, so on subsequent runs we can directly make something automatically dynamic. Unlike the previous implementation, this cache is never enabled by default; instead, you have to specify a "job id" that says it's OK to share results. We will be able to automatically populate this id for internal MAST jobs but for generic OSS users you will have to explicitly opt into it.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Differential Revision: [D65065497](https://our.internmc.facebook.com/intern/diff/D65065497)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139001
Approved by: https://github.com/oulgen
2024-11-01 21:43:25 +00:00
Mikayla Gawarecki
ea0e09b3f3 Add utility to get all unsafe globals in checkpoint (no pickletools dependency) (#139221)
Fixes https://github.com/pytorch/pytorch/issues/129698

https://github.com/pytorch/pytorch/pull/139106 without pickletools

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139221
Approved by: https://github.com/malfet
ghstack dependencies: #138936
2024-11-01 19:31:39 +00:00
bskrlj
8e27833e30 Ensure SWA boundary conditions w.r.t. definition (#133773)
According to the documentation, decay is a number in [0,1] range,[ i.e.](https://pytorch.org/docs/stable/optim.html)
```
Decay is a parameter between 0 and 1 that controls how fast the averaged parameters are decayed. If not provided to get_ema_multi_avg_fn, the default is 0.999.
```
An inspection of `swa_utils.py`  indicates there are no checks for invalid values of `decay`. Adding asserts as suggested in this PR ensures valid compute range (one way to enforce correct behavior, there are perhaps more suitable ones). Papers `torch` cites for reference idea/implementation also consider exclusively this range (e.g., https://arxiv.org/pdf/2310.04415).

Fixes https://github.com/pytorch/pytorch/issues/133772

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133773
Approved by: https://github.com/janeyx99
2024-10-31 18:24:08 +00:00
Nhat Minh Luu
261d90c18f Add docs page for torch.inf and torch.nan (#138430)
Fixes #131040

## Description
Add docs for `torch.inf` and `torch.nan`,

## Checklist
- [x] The issue that is being fixed is referred in the description (see above "Fixes #ISSUE_NUMBER")
- [x] Only one issue is addressed in this pull request
- [x] Labels from the issue that this PR is fixing are added to this pull request
- [x] No unnecessary issues are included into this pull request.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138430
Approved by: https://github.com/ezyang
2024-10-31 05:46:46 +00:00
Boyuan Feng
68134a320e [Flex Attention] Paged Attention (#137164)
This PR adds paged attention for flex attention.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137164
Approved by: https://github.com/drisspg
2024-10-29 17:05:22 +00:00
Jeff Daily
7c7b2d89ba [ROCm] set hipblas workspace (#138791)
Fixes #138532.

This brings hipblas behavior in line with cublas behavior with respect to setting the workspace to an allocation from the caching allocator as well as the env var HIPBLAS_WORKSPACE_CONFIG.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138791
Approved by: https://github.com/naromero77amd, https://github.com/eqy, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-10-29 01:37:55 +00:00
Svetlana Karslioglu
e00ead400c Add a temporary Survey about the search (#139096)
- Add a link to the new search survey
- Add .css classes needed for the search banner

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139096
Approved by: https://github.com/seemethere, https://github.com/cjyabraham
2024-10-28 23:43:25 +00:00
Joel Schlosser
8ba9063002 FlexAttention support for NJT (#136792)
This PR adds FlexAttention + NJT support. In particular:
* To handle raggedness, treats the packed sequence dim of input NJTs as a giant "stacked sequence". To ensure user `score_mod` / `mask_mod` functions can still be written in the original NJT sequence space, this PR handles conversions for indices within the giant "stacked sequence" -> sequence relative indices automatically.
* Provides `py_impls` for `NestedTensor` to the HOPs for flex attention forward / backward that simply wrap / unwrap NJTs appropriately
* Adds barebones `new_empty()` support to NJT since FlexAttention utilizes this repeatedly; right now, only `new_empty()` with a shape of `()` is supported
* Tests that FlexAttention with a causal mask matches causal SDPA
* Adds a new public API for FlexAttention usage:
    * `create_nested_block_mask(mask_mod, B, H, njt, BLOCK_SIZE, _compile)` - NJT analogue for `create_block_mask()` that utilizes the `njt`'s ragged structure to create an appropriately-sized block mask (e.g. `(1, 1, total_seqlen, total_seqlen)`). This function handles the index conversion from "stacked sequence" space -> relative sequence space.
      * Minor note: as this is a public API, this function is purposefully named with "nested" instead of "njt" to keep the latter as an informal, mostly internal-only term.

Example usage:
```python
def causal_mask(b, h, q_idx, kv_idx):
    return q_idx >= kv_idx

query = ... # NJT of shape (B, H, S*, D)
key = ... # NJT of shape (B, H, S*, D)
value = ... # NJT of shape (B, H, S*, D)
# create_nested_block_mask() automatically converts indices from "stacked sequence" space -> relative sequence space
block_mask = create_nested_block_mask(causal_mask, 1, 1, query)  # block mask conceptual shape is (B, H, sum(S*), sum(S*))
output = flex_attention(query, key, value, block_mask=block_mask)

def causal_score_mod(score, b, h, q_idx, kv_idx):
    return torch.where(q_idx >= kv_idx, score, float("-inf"))

# flex_attention() automatically converts indices from "stacked sequence" space -> relative sequence space for NJT inputs
output2 = flex_attention(query, key, value, score_mod=causal_score_mod)
```

TODO:
* ~~Determine the right level of abstraction for public API helpers + move them alongside other helpers~~ Verify this with others though
* ~~Some cleanup~~
* ~~`njt_score_mod_adapter`~~
* ~~Q: should `create_njt_block_mask()` call `njt_mask_mod_adapter()` so we don't need two calls?~~
* Can we avoid materializing the `sum(s)` length `seq_idx` used for conversion between stacked sequence -> sequence relative indices?
    * Not for now, although future work may deepen the integration between Flex + NJT (possibly requiring custom templates). We should try to cache this though.
* ~~Demonstrate non-causal mask~~
* Support non-contiguous NJTs with holes (**booted to future PR**)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136792
Approved by: https://github.com/drisspg
ghstack dependencies: #138841
2024-10-28 20:01:27 +00:00
Wouter Devriendt
bae3426af7 reimport pr137735 due to merging check issues (#138959)
This is  a cherry-pick from #137735 by @mikaylagawarecki , that cannot be merged due to a (wrongly) failing check for codev

@diff-train-skip-merge

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138959
Approved by: https://github.com/mikaylagawarecki
2024-10-27 16:31:34 +00:00
Yu, Guangye
40c098f731 Introduce a device-agnostic runtime API design (#132204)
# Motivation
According to [[RFC]A device-agnostic Python runtime API design for stream-based accelerators](https://github.com/pytorch/pytorch/issues/128403), this PR intends to introduce a device-agnostic runtime API design.
I personally prefer the **Simple Version** APIs that no longer accept the device type as an input argument. It means we will leverage `getAccelerator` to fetch the current accelerator. And it is flexible to expand these APIs to handle multiple types of accelerator scenarios. The design does **NOT** break the previous design philosophies.
I also believe that namespace torch.accelerator is better. It lets users know that the APIs they are calling are running on an accelerator rather than CPU. This is important. Meanwhile, we can follow a simple API design principle:
1. Device-agnostic APIs should be placed under the torch.accelerator namespace and not accept a device_type optional parameter.
2. Device-specific APIs should be placed under device-specific submodules.
3. APIS required by both CPU and accelerators should be placed under the torch namespace and accept a device_type optional parameter.

Also, I list the pros and cons of **Simple Version** here:
Pros:
- `torch.accelerator.foo` will have the same input argument as `torch.xxx.foo`, bringing a better user experience;
- more concise, facilitate the developer to write a device-agnostic code.

Cons:
- no obvious drawbacks.

# Additional Context
I list the new APIs here:
```python
torch.accelerator.is_available() -> bool:
torch.accelerator.current_accelerator() -> torch.device:
torch.accelerator.device_count() -> int:
torch.accelerator.current_device_idx() -> int:
torch.accelerator.set_device_idx(device: Union[torch.device, str, int, None]) -> None:
torch.accelerator.current_stream(device: Union[torch.device, str, int, None]) -> torch.Stream:
torch.accelerator.set_stream(stream: torch.Stream) -> None:
torch.accelerator.synchronize(device: Union[torch.device, str, int, None]) -> None:
```
According to the discussion with Alban, we decide to change the API name `set_device` to `set_device_idx` and `current_device` to `current_device_idx` for more explicit. And will submit other PR to support device and stream context manager.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132204
Approved by: https://github.com/EikanWang, https://github.com/abhilash1910, https://github.com/gujinghui, https://github.com/albanD
2024-10-27 10:37:09 +00:00
Laith Sakka
ed313a5ca2 Introduce torch.sym_add, variadic add (#138660)
Tested internally here: https://www.internalfb.com/diff/D64057744
This is a reland after previous internal failures.
main change is
```
 if min is None and max is None:
        torch._check_is_size(size)
        return
```

Partially addresses https://github.com/pytorch/pytorch/issues/128150

When you have big sums of values, we end up computing long chains of
binary addition in our FX graph representation.  Not only is this ugly,
it also is quadratic, as the sympy.Add constructor is O(N) in number
of arguments.  Instead, ensure that we maintain the summation as a
single FX node so we can do the entire addition all in one go.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138660
Approved by: https://github.com/ezyang, https://github.com/bobrenjc93
2024-10-23 17:42:41 +00:00
Laith Sakka
662d07e93e Remove parallel_and and parallel_or (#138135)
Not used, suggested by @ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138135
Approved by: https://github.com/ezyang
2024-10-23 00:22:22 +00:00