Commit Graph

72112 Commits

Author SHA1 Message Date
lezcano
9a5b4d2403 Do not forward parent's value range to CSE variable for variables created within codegen. (#123099)
Consider we are generating code for `ops.gt`, and within it we call
`ops.to_dtype`. Before, we would forward the bounds from `gt` to the
to the result of `to_dtype`, which is wrong.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123099
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-04-23 06:26:39 +00:00
Isuru Fernando
edcd968b51 Add out wrappers to some decompositions (#115437)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115437
Approved by: https://github.com/lezcano
2024-04-23 06:26:11 +00:00
chilli
e0c5113dec Add support for capturing tensors with score_mod (#124444)
```
import torch
from torch import nn
import torch.nn.functional as F
import torch._inductor.config as config
# torch.set_default_device('cuda')

import torch
from torch.nn.attention._templated_attention import _templated_attention as templated_attention
from triton.testing import do_bench
from torch.nn.attention import SDPBackend, sdpa_kernel

index = torch.ops.aten
torch.manual_seed(0)

B = 16
H = 16
S = 2048
D = 64

head_scale = torch.randn(H, device='cuda')
def alibi(score, batch, head, token_q, token_kv):
    return score + torch.ops.aten.index(head_scale, [head]) * (token_q - token_kv)
bias = torch.randn(H, S, S, dtype=torch.float16, device='cuda')

query = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16)
key = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16)
value = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16)

compiled = torch.compile(templated_attention)
out = compiled(query, key, value, score_mod=alibi)
out2 = templated_attention(query, key, value,score_mod=alibi)
print((out - out2).abs().mean())
assert (out - out2).abs().mean() < 1e-3
print("Flash (no mask): ", do_bench(lambda: F.scaled_dot_product_attention(query, key, value)))
print("Flash (mask): ", do_bench(lambda: F.scaled_dot_product_attention(query, key, value, attn_mask=bias)))
print("flexattention: ", do_bench(lambda: compiled(query, key, value, score_mod=alibi)))
```
<img width="324" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/18c175d0-2720-4dfd-8747-85b8a8f609f5">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124444
Approved by: https://github.com/jansel, https://github.com/drisspg
2024-04-23 06:20:13 +00:00
Mikayla Gawarecki
c82fcb7b30 Add testing and fix weights_only load for quantized types and nn.Parameters with python attrs (#124330)
Adds the following to allowed globals for the `weights_only` unpickler
- [x] `torch._utils._rebuild_qtensor` and qtensor related types
- [x] `torch._utils._rebuild_parameter_with_state` (used deserializing a parameter that has user-defined attributes like `Param.foo`)

The remaining rebuild functions that have not been allowlisted are

- [x] `torch._utils._rebuild_wrapper_subclass` (allowlisted in above PR)
- [ ] `torch._utils._rebuild_device_tensor_from_numpy`
- [ ] `torch._utils._rebuild_xla_tensor` (legacy)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124330
Approved by: https://github.com/albanD
2024-04-23 04:13:26 +00:00
Nikita Shulga
de5d689cf9 [EZ] Update pillow to 10.3.0 (#124614)
As older versions as subject to [CVE-2024-28219](https://nvd.nist.gov/vuln/detail/CVE-2024-28219), although it's not super important from CI PoV

Modernize `torch/utils/tensorboard/summary.py` to use Pillow-9+ APIs (is this file even used for anything anymore?)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124614
Approved by: https://github.com/Skylion007, https://github.com/ZainRizvi
2024-04-23 03:22:23 +00:00
atalman
7706cd7d12 Extend CPU inductor merge rule (#124671)
To help unblock: https://github.com/pytorch/pytorch/pull/123710
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124671
Approved by: https://github.com/leslie-fang-intel, https://github.com/huydhn
2024-04-23 02:18:00 +00:00
Edward Z. Yang
660db767ef Don't clean up fresh inductor cache on error (#124620)
Useful for local debugging.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124620
Approved by: https://github.com/oulgen, https://github.com/desertfire, https://github.com/jansel
2024-04-23 02:13:05 +00:00
Oguz Ulgen
7e095be4b6 Fix test_max_autotune_remote_caching (#124655)
D55206000 broke this test. It is not clear why it did not run in the CI but here's the fix.

Differential Revision: [D56439213](https://our.internmc.facebook.com/intern/diff/D56439213/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124655
Approved by: https://github.com/aorenste
2024-04-23 01:41:04 +00:00
Fadi Botros
375ec25f55 Add missing aten::sort.any op for assistant lm models (#123982)
Differential Revision: D56084098

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123982
Approved by: https://github.com/JacobSzwejbka
2024-04-23 01:35:07 +00:00
cyy
ea61c9cb29 [Distributed] [5/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124043)
This PR continues to fix some clang-tidy warnings in distributed/c10d code, following https://github.com/pytorch/pytorch/pull/124032.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124043
Approved by: https://github.com/ezyang
2024-04-23 00:43:50 +00:00
Ashwin Hari
5f5778476a rename ort to maia (#123265)
Fixes #123264

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123265
Approved by: https://github.com/albanD
2024-04-23 00:33:25 +00:00
leslie-fang-intel
bffecb5aff [Inductor] Enable VecMask store (#123710)
**Summary**
Enable the vectorization of store with `bool` dtype.

**Test Plan**
```
python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_decomposed_fake_quant_per_channel
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123710
Approved by: https://github.com/jgong5, https://github.com/lezcano
ghstack dependencies: #123512
2024-04-23 00:29:47 +00:00
leslie-fang-intel
dd440ac734 Add Matmul recipe into x86_inductor_quantizer (#122776)
**Summary**
Add `matmul` in the quantization recipes, noting that it's not a general recipe but tailored to meet accuracy criteria for specific models. `matmul` recipe is disabled by default.

**Test Plan**
```
python -m pytest quantization/pt2e/test_x86inductor_quantizer.py -k test_attention_block
```

Differential Revision: [D56288468](https://our.internmc.facebook.com/intern/diff/D56288468)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122776
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2024-04-23 00:25:41 +00:00
soulitzer
1fcdea8cd6 Do not import transformers when import torch._dynamo (#124634)
Fixes https://github.com/pytorch/pytorch/issues/123954

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124634
Approved by: https://github.com/thiagocrepaldi, https://github.com/Chillee
ghstack dependencies: #124343
2024-04-23 00:25:20 +00:00
nopperl
0c21161488 Add meta function for torch.histc (#124548)
Registers a meta function for the `aten.histc.default` and `aten.histc.out` ops to support `torch.compile(dynamic=True)`. Fixes #124512.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124548
Approved by: https://github.com/lezcano, https://github.com/peterbell10
2024-04-23 00:24:59 +00:00
Edward Z. Yang
6054789874 Make numel equal test size oblivious in reshape_symint (#124611)
Fixes https://github.com/pytorch/pytorch/issues/124581

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124611
Approved by: https://github.com/bdhirsh
ghstack dependencies: #124139
2024-04-22 23:59:40 +00:00
Nikita Shulga
abf3f90781 [MPS] Fix large copy (#124635)
By slicing `copyFromBuffer:sourceOffset:toBuffer:destinationOffset:size:` into 2Gb chunks

Add regression test, but limit it to machines with 12Gb of RAM or more, and MacOS 14+, as on MacOS 13 attempt to alloc 4Gb tensor fails with:
```
/AppleInternal/Library/BuildRoots/c651a45f-806e-11ed-a221-7ef33c48bc85/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:724: failed assertion `[MPSNDArray initWithDevice:descriptor:] Error: total bytes of NDArray > 2**32'
```

Fixes https://github.com/pytorch/pytorch/issues/124335

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124635
Approved by: https://github.com/kulinseth
2024-04-22 23:43:11 +00:00
Yanbo Liang
72a34eeb99 Dynamo x autograd.Function supports non-{Tensor, symnode, constant} inputs (#124360)
Fixes #118395

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124360
Approved by: https://github.com/zou3519
2024-04-22 23:32:54 +00:00
atalman
302d7e9a6e [Binary Build] Increase timeout for Linux nightly binary builds (#124668)
Related issue: https://github.com/pytorch/pytorch/issues/124667. Please note, this is mitigation PR. Will follow up with investigation and proper fix for this.

Similar to: 94d6463255

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124668
Approved by: https://github.com/huydhn
2024-04-22 22:38:39 +00:00
Shuqiang Zhang
87a35d5a29 Use new function to log one cluster per line (#124628)
Summary:
For motivation behind the overall stack of diffs see D56218385 summary.

This particular diff makes cpp_dumper take a custom printer function to log callstacks one-group-at-a-time and as such no longer running into 30K characters limit of `LOG(INFO)`.

Test Plan:
```
[romanmal@46150.od /data/sandcastle/boxes/fbsource/fbcode (520a7b7b5)]$ buck2 test //caffe2/torch/csrc/distributed/c10d/...
File changed: fbcode//common/base/ThreadStackTrace.cpp
File changed: fbsource//xplat/caffe2/torch/csrc/distributed/c10d/fb/TraceUtils.cpp
File changed: fbcode//caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp
4 additional file change events
Buck UI: https://www.internalfb.com/buck2/d8ceae86-7d6f-4779-ad0c-8e37eddcff98
Network: Up: 0B  Down: 0B
Jobs completed: 2. Time elapsed: 1.5s.
Tests finished: Pass 0. Fail 0. Fatal 0. Skip 0. Build failure 0
NO TESTS RAN
[romanmal@46150.od /data/sandcastle/boxes/fbsource/fbcode (520a7b7b5)]$
```

Tested to print the stack trace:
P1220109730

Differential Revision: D56218360

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124628
Approved by: https://github.com/wconstab
2024-04-22 21:57:39 +00:00
William Wen
501edc7e59 [inductor, test] remove cast for test_tmp_not_defined_issue2_cpu (#114910)
Does this verify that https://github.com/pytorch/pytorch/issues/94017 is fixed?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114910
Approved by: https://github.com/angelayi
2024-04-22 21:51:53 +00:00
Sheng Fu
ba3c00c266 [test_profiler.py] Disable tqdm monitor thread and torch.compile with compile_threads=1 (#124409)
Summary: if tqdm is not shutdown properly, it will leave the monitor thread alive. This causes an issue in the multithreading test because we check all events in that test with their tids. The events that correspond to these lingering threads all have TID of (uint64_t)(-1) which is invalid. The work around is turning off monitoring thread when tqdm is loaded. Since these are unit tests, it is safe to turn off monitor thread.

Test Plan: buck test  mode/dev-sand caffe2/test:profiler

Differential Revision: D56310301

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124409
Approved by: https://github.com/aaronenyeshi
2024-04-22 21:51:14 +00:00
IvanKobzarev
c01499ecc6 [sym_shapes][perf] Cache ShapeEnv constrain_symbol_range calls (#124610)
Differential Revision: [D56422688](https://our.internmc.facebook.com/intern/diff/D56422688)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124610
Approved by: https://github.com/ezyang, https://github.com/lezcano
2024-04-22 21:49:08 +00:00
Wanchao Liang
05addd5658 [tp] add kwargs support to prepare_module_input (#124114)
as titled, this PR adds kwargs support to PrepareModuleInput style,
where there might be modules who have only kwargs inputs but no
positional args, so we should support this

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124114
Approved by: https://github.com/XilunWu
2024-04-22 21:46:31 +00:00
Jithun Nair
5785b02ba6 Skip workspace permission change for ROCm CI (#123816)
PR https://github.com/pytorch/pytorch/pull/122922 added chown steps to test.sh and used the trap mechanism to ensure that, even if the test scripts fails and exits with a non-zero code, it will call the cleanup_workspace function on EXIT.

However, this doesn't work as intended when the CI job gets cancelled for eg. if a PR pushes new commits and the older commit CI job gets cancelled. The trap function doesn't get called as the test script immediately aborts.

Any subsequent jobs scheduled on the same runner then fail in the 'Checkout PyTorch' step when they try to delete the workspace.

This has been resulting in a slew of CI failures on the HUD.

Example of this situation playing out on one of the ROCm runners:
Cancelled job: https://github.com/pytorch/pytorch/actions/runs/8563212279/job/23469711035

![image](https://github.com/pytorch/pytorch/assets/37884920/7192e4fe-8cff-4256-abc8-9f874a3918ff)

Subsequent failed job: https://github.com/pytorch/pytorch/actions/runs/8564517036/job/23472675041

![image](https://github.com/pytorch/pytorch/assets/37884920/24b0af66-cfe9-431f-851a-24a1ccc18e84)

This PR skips the logic introduced by PR 122922 for ROCm CI.

Alternative to https://github.com/pytorch/pytorch/pull/123468 and https://github.com/pytorch/pytorch/pull/123588

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123816
Approved by: https://github.com/pruthvistony, https://github.com/zxiiro, https://github.com/kit1980, https://github.com/malfet
2024-04-22 21:27:32 +00:00
Bin Bao
bb37910e30 [AOTI] Fixes ScatterFallback codegen (#124580)
Summary: For https://github.com/pytorch/pytorch/issues/123184. ScatterFallback currently relies on op name matching for codegen, which makes its cpp codegen fragile. Refactor to use op_overload and fix the relevant unit test failures.

Differential Revision: [D56417815](https://our.internmc.facebook.com/intern/diff/D56417815)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124580
Approved by: https://github.com/chenyang78
2024-04-22 20:47:26 +00:00
Catherine Lee
fd59554be6 Scripts to compile reruns + td exclusions and upload to s3 (#124312)
Edits upload_test_stats to also upload a condensed version that contains reruns, and one that contains the list of td_exclusions.

Grouped by build name + test config
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124312
Approved by: https://github.com/malfet
2024-04-22 20:19:35 +00:00
Edward Z. Yang
0bbbc754dd Add AOTInductor generated cpp code to TORCH_TRACE (#124617)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124617
Approved by: https://github.com/albanD
2024-04-22 19:25:20 +00:00
Jason Ansel
0093735ccd [inductor] Use compile time config values in runtime (#124561)
This removes usage of torch._inductor.config from `torch._inductor.runtime`.  Fixing two issues:
1) If configs change we should really use the compile time ones
2) In compile workers, we want to use the parent process config

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124561
Approved by: https://github.com/yanboliang
ghstack dependencies: #124552, #124553, #124557, #124559, #124560, #124569
2024-04-22 18:46:40 +00:00
Jason Ansel
cb9fe91f5c [inductor] Remove config check for 3D tiling (#124569)
This makes the check per-kernel (if 3D tiling is used), rather than
global config.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124569
Approved by: https://github.com/yanboliang
ghstack dependencies: #124552, #124553, #124557, #124559, #124560
2024-04-22 18:46:40 +00:00
Jason Ansel
4620a45542 [inductor] Refactor runtime files into torch._inductor.runtime (part 5) (#124560)
I am planning to make the compile_worker process not import torch so it can start up much faster.  This stack is prep for that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124560
Approved by: https://github.com/yanboliang
ghstack dependencies: #124552, #124553, #124557, #124559
2024-04-22 18:46:35 +00:00
Jason Ansel
0cc0e60e30 [inductor] Refactor runtime files into torch._inductor.runtime (part 4) (#124559)
I am planning to make the compile_worker process not import torch so it can start up much faster.  This stack is prep for that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124559
Approved by: https://github.com/yanboliang
ghstack dependencies: #124552, #124553, #124557
2024-04-22 18:46:29 +00:00
Jason Ansel
7fd8870e6b [inductor] Refactor runtime files into torch._inductor.runtime (part 3) (#124557)
I am planning to make the compile_worker process not import torch so it can start up much faster.  This stack is prep for that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124557
Approved by: https://github.com/yanboliang
ghstack dependencies: #124552, #124553
2024-04-22 18:46:24 +00:00
Jason Ansel
bb8815bc31 [inductor] Refactor runtime files into torch._inductor.runtime (part 2) (#124553)
I am planning to make the compile_worker process not import torch so it can start up much faster.  This stack is prep for that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124553
Approved by: https://github.com/yanboliang
ghstack dependencies: #124552
2024-04-22 18:46:20 +00:00
Jason Ansel
480585fd2b [inductor] Refactor runtime files into torch._inductor.runtime (part 1) (#124552)
I am planning to make the compile_worker process not import torch so it can start up much faster.  This stack is prep for that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124552
Approved by: https://github.com/yanboliang
2024-04-22 18:41:12 +00:00
PyTorch MergeBot
16eea7c6a5 Revert "[inductor] Refactor runtime files into torch._inductor.runtime (part 1) (#124552)"
This reverts commit a7035cc11a.

Reverted https://github.com/pytorch/pytorch/pull/124552 on behalf of https://github.com/jeanschmidt due to There are internal breakages, already discussed with author and he'll FF ([comment](https://github.com/pytorch/pytorch/pull/124552#issuecomment-2070548223))
2024-04-22 18:28:05 +00:00
PyTorch MergeBot
56714cb497 Revert "[inductor] Refactor runtime files into torch._inductor.runtime (part 2) (#124553)"
This reverts commit f4d47f5bbb.

Reverted https://github.com/pytorch/pytorch/pull/124553 on behalf of https://github.com/jeanschmidt due to There are internal breakages, already discussed with author and he'll FF ([comment](https://github.com/pytorch/pytorch/pull/124552#issuecomment-2070548223))
2024-04-22 18:28:05 +00:00
PyTorch MergeBot
0b90af0bf5 Revert "[inductor] Refactor runtime files into torch._inductor.runtime (part 3) (#124557)"
This reverts commit fcf28b0ad5.

Reverted https://github.com/pytorch/pytorch/pull/124557 on behalf of https://github.com/jeanschmidt due to There are internal breakages, already discussed with author and he'll FF ([comment](https://github.com/pytorch/pytorch/pull/124552#issuecomment-2070548223))
2024-04-22 18:28:05 +00:00
PyTorch MergeBot
b3d6c2fe9b Revert "[inductor] Refactor runtime files into torch._inductor.runtime (part 4) (#124559)"
This reverts commit 9ea2a09510.

Reverted https://github.com/pytorch/pytorch/pull/124559 on behalf of https://github.com/jeanschmidt due to There are internal breakages, already discussed with author and he'll FF ([comment](https://github.com/pytorch/pytorch/pull/124552#issuecomment-2070548223))
2024-04-22 18:28:05 +00:00
PyTorch MergeBot
0f44ef93ab Revert "[inductor] Refactor runtime files into torch._inductor.runtime (part 5) (#124560)"
This reverts commit 3ac30bc32a.

Reverted https://github.com/pytorch/pytorch/pull/124560 on behalf of https://github.com/jeanschmidt due to There are internal breakages, already discussed with author and he'll FF ([comment](https://github.com/pytorch/pytorch/pull/124552#issuecomment-2070548223))
2024-04-22 18:28:05 +00:00
PyTorch MergeBot
8973c5b846 Revert "[inductor] Remove config check for 3D tiling (#124569)"
This reverts commit 317c0af149.

Reverted https://github.com/pytorch/pytorch/pull/124569 on behalf of https://github.com/jeanschmidt due to There are internal breakages, already discussed with author and he'll FF ([comment](https://github.com/pytorch/pytorch/pull/124552#issuecomment-2070548223))
2024-04-22 18:28:05 +00:00
PyTorch MergeBot
30dec1da84 Revert "[inductor] Use compile time config values in runtime (#124561)"
This reverts commit 3af12447f8.

Reverted https://github.com/pytorch/pytorch/pull/124561 on behalf of https://github.com/jeanschmidt due to There are internal breakages, already discussed with author and he'll FF ([comment](https://github.com/pytorch/pytorch/pull/124561#issuecomment-2070537634))
2024-04-22 18:24:38 +00:00
rzou
d77e7b7c54 Make some kernel static asserts clearer (#124519)
Users get int/int64_t and double/float confused a lot.

Test Plan:
- tested locally
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124519
Approved by: https://github.com/Skylion007
2024-04-22 18:17:40 +00:00
Isuru Fernando
c2f8bfae9c Make torch._inductor.dependencies.Dep a proper class (#124407)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124407
Approved by: https://github.com/peterbell10
2024-04-22 17:09:34 +00:00
Aleksei Nikiforov
77c35334c1 Fix build on s390x (#123250)
Rename s390x-specific zvector functions with same name.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123250
Approved by: https://github.com/malfet
2024-04-22 16:57:08 +00:00
Aleksei Nikiforov
be2e56b5ab s390x: update using vectorization builtins (#124396)
With gcc >= 12 on s390x store builtins
are accidentally optimized out due to
bad type aliasing.

Ensure that proper corresponding types are used,
and if types do mismatch,
first store data into array of correct type
and then memcpy it to destination pointer.

See also:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114676

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124396
Approved by: https://github.com/malfet
2024-04-22 16:55:18 +00:00
mengfeil
0ee514e628 [CI] Upgrade xpu driver to LTS_803.29 (#123920)
Upgrade xpu driver from 647.21  to LTS 803.29

Works for #114850
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123920
Approved by: https://github.com/chuanqi129, https://github.com/EikanWang, https://github.com/huydhn
2024-04-22 16:45:01 +00:00
Thiago Crepaldi
9c2ac4476c Allow ONNX models without parameters (#121904)
Currently, if initializers are available, they are included in the ONNX model. If they are not available, the model is serialized without them.

However, there are times in which the initializers are avaialable, but the user prefers not to include them in the model, say for visualizing it on Netron or because the initialziers will be specified along with the inputs in the onnx runtime of choice.

This PR allow users to pass `include_initializers` to `ONNXProgram.save()` API.

Fixes #100996
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121904
Approved by: https://github.com/titaiwangms
2024-04-22 15:53:38 +00:00
Jeff Daily
6ede882c0b preferred blas library; cublaslt gemm implementation (#122106)
Following the example of PyTorch supporting a preferred Linalg library (cusolver or magma), this PR introduces a preferred blas library selector of either cublas or cublaslt for CUDA and hipblas or hipblaslt for ROCm via normal hipification of sources.

The default blas implementation remains cublas or hipblas.  cublaslt or hipblaslt can be enabled using environment variable TORCH_BLAS_PREFER_CUBLASLT=1 (or TORCH_BLAS_PREFER_HIPBLASLT=1 as an alias) or by calling `torch.backends.cuda.preferred_blas_library(backend="cublaslt")` or as an alias `backend="hipblaslt"`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122106
Approved by: https://github.com/lezcano
2024-04-22 15:38:22 +00:00
Adnan Akhundov
9a322ba1b0 [user triton] Return unbacked SymInts used in the grid (#124594)
Summary: When unbacked SymInts are used only in a grid of a user-written Triton kernel call, there is no dependency between the Triton kernel's buffer and those unbacked SymInts. As a result, definition of the unbacked SymInts are not codegen-end and the code using them in the grid definition breaks.

Here we add the unbacked SymInts used in the grid to the `get_unbacked_symbol_uses` returned by the `UserDefinedTritonKernel` alongside those used in the `kwargs` (returned by `ExternKernel`).

Test Plan:

```
$ python test/inductor/test_aot_inductor.py -k test_triton_kernel_unbacked_symint
...
----------------------------------------------------------------------
Ran 24 tests in 155.764s

OK (skipped=16)
```

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D56406991](https://our.internmc.facebook.com/intern/diff/D56406991)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124594
Approved by: https://github.com/oulgen
2024-04-22 15:33:30 +00:00