Commit Graph

25384 Commits

Author SHA1 Message Date
Kai Londenberg
a5ec45f2ec [Inductor Cutlass backend] Move tests to separate file (#121489)
Move Cutlass backend related tests to test/inductor/test_cutlass_backend.py - no changes to the tests themselves.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121489
Approved by: https://github.com/jansel
2024-03-12 21:59:48 +00:00
Tugsbayasgalan Manlaibaatar
5478a4e348 Don't run non-strict for test case that doesn't need non-strict (#121710)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121710
Approved by: https://github.com/BoyuanFeng
ghstack dependencies: #121652, #121678, #121687
2024-03-12 21:32:33 +00:00
Tugsbayasgalan Manlaibaatar
90e886aa6c Sanity check for non-strict (#121687)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121687
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #121652, #121678
2024-03-12 18:21:32 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar
443e241cc5 Don't cache predispatch kernels (#121712)
Summary: Title

Test Plan: CI

Differential Revision: D54791087

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121712
Approved by: https://github.com/ydwu4
2024-03-12 18:05:59 +00:00
Wanchao Liang
a26480a4d1 [dtensor] move early return check into redistribute autograd function (#121653)
This PR fixed the bug of redistribute to move early return check into the
redistribute autograd function, so that even though we redistribute the
same placement, the grad_placements from the `to_local` call might be
different, the redistribute backward still need to happen

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121653
Approved by: https://github.com/awgu
2024-03-12 17:37:30 +00:00
Animesh Jain
4e63d9065a [dynamo] Delete record replay tests as they are not maintained (#121705)
Fixes https://github.com/pytorch/pytorch/issues/115518

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121705
Approved by: https://github.com/mlazos
2024-03-12 17:16:34 +00:00
Animesh Jain
2348e8e4e7 [dynamo][guards-cpp-refactor] Simplify DYNAMIC_INDICES guard (#121614)
Use NO_HASATTR guard for the common part.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121614
Approved by: https://github.com/jansel
2024-03-12 17:08:56 +00:00
PyTorch MergeBot
0398dc9e8e Revert "[DCP] Makes fsspec public (#121508)"
This reverts commit d482614fec.

Reverted https://github.com/pytorch/pytorch/pull/121508 on behalf of https://github.com/osalpekar due to this causes torchrec tests to fail internally with this error: ModuleNotFoundError: No module named 'fsspec'. see [D54779117](https://www.internalfb.com/diff/D54779117) ([comment](https://github.com/pytorch/pytorch/pull/121508#issuecomment-1992137831))
2024-03-12 17:02:43 +00:00
Andrew Gu
85dc254364 [DTensor] Moved Transformer sharding to staticmethod (#121660)
To support FSDP + TP/SP unit tests, let us factor out the canonical TP/SP sharding of `Transformer` to a staticmethod that can be called by other unit tests.

Test Plan:
```
pytest test/distributed/tensor/parallel/test_tp_examples.py -k test_transformer_training
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121660
Approved by: https://github.com/wanchaol, https://github.com/yifuwang
ghstack dependencies: #121360, #121357
2024-03-12 15:08:57 +00:00
Howard Huang
2a99e6f299 Update error message (#121644)
Summary:
We don't want people to move to NCCL exp without explicit opt in. It seems that sparse allreduce was accidentally called and people were confused whether they should use NCCL exp instead.

Update the error message to explicitly say that sparse_allreduce is not supported.

Test Plan: sandcastle

Differential Revision: D54759307

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121644
Approved by: https://github.com/awgu
2024-03-12 13:04:21 +00:00
Adnan Akhundov
06d2392003 Support tt.reduce in Triton kernel analysis pass (#121706)
Summary: Previously, we bailed out of the Triton kernel analysis pass when seeing a `tt.reduce` op. In this PR, we support the op and don't bail out anymore.

Test Plan: This is a bit tricky, as the extension is added to the MLIR walk-based analysis code path which is active only on when the MLIR bindings added in https://github.com/openai/triton/pull/3191 are available. So for now I've run the `test_argmax` and `test_reduce_sum` manually with a newer Triton version than the current pin. When pin updates, we'll make those tests official (left a TODO comment).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121706
Approved by: https://github.com/jansel
2024-03-12 11:38:28 +00:00
Tugsbayasgalan Manlaibaatar
52ad2b682c Generate predispatch tests (#121678)
In this PR, we create another dynamic test class for TestExport tests that basically serializes/deserializas pre-dispatch IR. I encountered 4 additional failures. But 3 of them are due to different operator showing up in the graph and only one legit failure which is tracked by another task internally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121678
Approved by: https://github.com/angelayi
ghstack dependencies: #121652
2024-03-12 08:34:50 +00:00
Dmitry Nikolaev
656134c38f [ROCm] enable complex128 in test_addmm_sizes_all_sparse_csr for rocm for trivial (k,n,m) cases (#120504)
This PR enables `test_addmm_sizes_all_sparse_csr_k_*_n_*_m_*_cuda_complex128` for ROCm for trivial cases  (m or n or k = 0)

CUSPARSE_SPMM_COMPLEX128_SUPPORTED also used for `test_addmm_all_sparse_csr` and ` test_sparse_matmul` and both of them are skipped for ROCm by `@skipIfRocm` or `@skipCUDAIf(not _check_cusparse_spgemm_available())`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120504
Approved by: https://github.com/jithunnair-amd, https://github.com/ezyang
2024-03-12 07:29:57 +00:00
lezcano
86a2d67bb9 Simplify guards using info from previous guards (#121463)
Let me see what CI thinks about this one. Will add tests tomorrow.

Fixes https://github.com/pytorch/pytorch/issues/119917
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121463
Approved by: https://github.com/ezyang
2024-03-12 04:22:20 +00:00
Shen Xu
159f30331f [quant][pt2e] Call sub-quantizers' transform_for_annotation in ComposableQuantizer (#121548)
Test Plan:
```
buck run caffe2/test:quantization_pt2e
```

Differential Revision: D54454707

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121548
Approved by: https://github.com/jerryzh168
2024-03-12 02:59:12 +00:00
Tugsbayasgalan Manlaibaatar
7fc497711d Also test predispatch serialization (#121652)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121652
Approved by: https://github.com/zhxchen17, https://github.com/angelayi
2024-03-12 02:37:59 +00:00
eellison
6ca9ae4f86 Express y grid > 2^16 in terms of z grid (#121554)
CUDA has a max y_grid of 65535. If we're computing larger than that we can compose it in terms of z grid, which is currently unused in inductor codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121554
Approved by: https://github.com/aakhundov
2024-03-12 02:36:19 +00:00
Jane Xu
fb1d7935bb [optim][BE] move complex_2d (last of complex tests) to OptimInfo (#120618)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120618
Approved by: https://github.com/albanD
2024-03-12 02:33:21 +00:00
Xinya Zhang
a37e22de70 Add Flash Attention support on ROCM (#121561)
This patch addresses the major limitations in our previous [PR #115981](https://github.com/pytorch/pytorch/pull/115981) through the new dedicated repository [AOTriton](https://github.com/ROCm/aotriton)

- [x] Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`).
    * MI300X is supported. More architectures will be added once Triton support them.
- [x] Only supports power of two sequence lengths.
    * Now it support arbitrary sequence length
- [ ] No support for varlen APIs.
    * varlen API will be supported in the next release of AOTriton
- [x] Only support head dimension 16,32,64,128.
    * Now it support arbitrary head dimension <= 256
- [x] Performance is still being optimized.
    * Kernel is selected according to autotune information from Triton.

Other improvements from AOTriton include
* Allow more flexible Tensor storage layout
* More flexible API

This is a more extensive fix to #112997

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121561
Approved by: https://github.com/malfet, https://github.com/atalman
2024-03-12 01:16:53 +00:00
Elias Ellison
5b5d423c2e Benchmark templates (#118880)
Adding support for benchmarking templates in `benchmark_fusion`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118880
Approved by: https://github.com/shunting314
2024-03-11 23:55:13 +00:00
Mu-Chu Lee
7676433012 [AOTInductor] Reuse generated kernels between constant graph and main graph (#121564)
Summary: We copy the src_to_kernel from constant graph to main graph so that we could avoid generating duplicating kernels. And pass throught the name counter such that no duplicated names will be generated.

Test Plan: Included in commit

Differential Revision: D54706767

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121564
Approved by: https://github.com/desertfire, https://github.com/chenyang78
2024-03-11 22:44:38 +00:00
Andrew Gu
272cf29e4d [FSDP2][BE] Refactored check_1d_sharded_parity to use mesh (#121357)
Eventually, we should just have one unified way to check for parity between a `DTensor`-sharded model and a replicated model. This PR is a small refactor to work toward that. One current gap to use this `check_sharded_parity` function for 2D is that FSDP's `(Shard(0), Shard(0))` layout differs from that of the `DTensor` APIs since FSDP shards on dim-0 after TP shards on dim-0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121357
Approved by: https://github.com/weifengpy
ghstack dependencies: #121360
2024-03-11 22:34:42 +00:00
PyTorch MergeBot
fd0dbcd891 Revert "Batch Norm Consolidation (#116092)"
This reverts commit 7b4f70eda5.

Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/osalpekar due to Causes build failure in //caffe2:aten-hip (AMD build) target. See [D54707318](https://www.internalfb.com/diff/D54707318) for more details, may require internal build system changes to resolve. ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1989542965))
2024-03-11 22:22:41 +00:00
PyTorch MergeBot
b2f09c1859 Revert "[compiled autograd] support custom ops backed by c++ autograd::Function (#120681)"
This reverts commit d27509c384.

Reverted https://github.com/pytorch/pytorch/pull/120681 on behalf of https://github.com/xmfan due to breaking internal builds, see D54707287 ([comment](https://github.com/pytorch/pytorch/pull/120681#issuecomment-1989542344))
2024-03-11 22:18:36 +00:00
Sam Larsen
fd13a56f61 Refactor some testing helpers for FX graph cache testing (#121520)
Summary: I plan to enable the FX graph cache for more inductor unit tests. This PR does some refactoring to prepare by moving the `TestCase` base class to `torch._inductor.test_case` (which mirrors the existing `torch._dynamo.test_case`). In a subsequent diff, I'll modify tests importing `torch._dynamo.test_case.TestCase` to use `torch._inductor.test_case.TestCase` instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121520
Approved by: https://github.com/eellison
2024-03-11 21:46:27 +00:00
Andres Lugo-Reyes
e01b07e1e8 [ROCm] Autocast RNN Support (#121539)
Fixes #116361

Implements Autocast wrapper for miopen rnn's

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121539
Approved by: https://github.com/albanD, https://github.com/jeffdaily
2024-03-11 21:14:43 +00:00
Zhenghao Zhao
3461404869 [pt2 export]fix name collision on constant name (#121145)
Summary: Taking the right most part of the fqn will cause name conflict when having multiple instances of the same class. Changed to replace "." in fqn by "_" to avoid invalid syntax in input args.

Test Plan: added test case

Differential Revision: D54435230

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121145
Approved by: https://github.com/zhxchen17
2024-03-11 20:40:59 +00:00
Daniel Herrera
dccc1ca839 [torch] Use __prepare_scriptable__ for closures (#121553)
Summary:
This fixes a case left incomplete by https://github.com/pytorch/pytorch/pull/106229
The object is using __prepare_scriptable__ correctly inside of torch.jit.script()
but the clousre that is obtained below is using the non-prepared version.
This causes issues when the prepared and non-prepared versions are in different python modules.

Test Plan:
```
buck2 run mode/opt caffe2/test:jit -- -r test_decorator
```

Differential Revision: D54308741

Re-exporting, as #120806 #121307 were not properly merged.

Co-authored-by: Daniel Herrera <dherrera@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121553
Approved by: https://github.com/huydhn, https://github.com/seemethere
2024-03-11 19:14:19 +00:00
Aidyn-A
39ed038f41 [TEST] Prepare test_cumulative_trapezoid for SciPy 1.12 (#121541)
Follow up on #119326 with addressed comment: https://github.com/pytorch/pytorch/pull/119326#issuecomment-1939428705:
> I'd like to propose a slightly different approach. We could check if scipy is version `1.12.0`. If so, overload `scipy_cumulative_trapezoid` with a function that specifically checks `t.shape[axis] == 0`, and in that case return an array of the same shape as `t`, which is the expected behavior as far as I understand. That way, we're not just skipping the test cases

I would like to add that the version check is not necessary as in any case the outcome is the same.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121541
Approved by: https://github.com/nWEIdia, https://github.com/albanD
2024-03-11 17:48:29 +00:00
Natalia Gimelshein
89add71168 fix synchronization behavior for copies with type change (#121341)
Fixes #121320

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121341
Approved by: https://github.com/albanD
2024-03-11 17:09:45 +00:00
Catherine Lee
fac06a12c8 CI sanity check test for env vars (#120519)
Make a test that fails on purpose to trigger retries.  Check the opposite of success (that env vars exist)

It's bit hacky because I want it to fail on the normal flow in order to trigger reruns but I don't want to expose the failures to users since it's confusing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120519
Approved by: https://github.com/huydhn
2024-03-11 15:35:45 +00:00
Thiago Crepaldi
6c11d3ce0c Add support to save safetensors checkpoint directly into onnx (#121001)
Currently, when `torch.onnx.dynamo_export` is called within `torch.onnx.enable_fake_mode`, all the external pytorch checkpoint files used to initialize the model are automatically and used by `torch.onnx.ONNXProgram.save` to recreate the initializers for
the newly exported ONNX model.

This API extends the mechanism for HuggingFace models that use safetensors weights. This PR detects safetensors state files and converts them to PyTorch format using mmap on a temporary file, which is deleted after conversion is finished.

Without this PR, the user would have to convert the safetensors files to pytorch format manually and feed it to `torch.onnx.ONNXProgram.save` manually.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121001
Approved by: https://github.com/BowenBao, https://github.com/malfet
2024-03-11 15:21:59 +00:00
Xia Weiwen
d1510e01fa Upgrade submodule onednn to v3.3.5 (#120767)
This upgrade contains the fixes to the known issues brought by oneDNN v3.3.2, including issues https://github.com/pytorch/pytorch/issues/115346, https://github.com/pytorch/pytorch/issues/120211 and https://github.com/pytorch/pytorch/issues/120406 and those listed in PR #112700.

Issue https://github.com/pytorch/pytorch/issues/115346 (perf regression) was fixed by oneDNN v3.3.4. No new regression was found with v3.3.5. The detailed results of v3.3.4 are given below and compared with v3.1.1 (the oneDNN version in PyTorch before it was updated to v3.3.2).
1. A performance regression with 5.8% perf drop from `pytorch_stargan-train` (see https://github.com/pytorch/benchmark/issues/2076#issuecomment-1847545843)
Validation results with this patch: Latency increased by 0.60%
```
Tested on an Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz instance (IceLake)
oneDNN v3.1.1
metrics-1484287.json
{
    "name": "cpu",
    "environ": {
        "pytorch_git_version": "6c8c5ad5eaf47a62fafbb4a2747198cbffbf1ff0"
    },
    "metrics": {
        "latency": 418.851717
    }
}
oneDNN v3.3.4
{
    "name": "cpu",
    "environ": {
        "pytorch_git_version": "6c8c5ad5eaf47a62fafbb4a2747198cbffbf1ff0"
    },
    "metrics": {
        "latency": 421.381313
    }
}
```

2. Performance regression of FP32 rexnet_100 with Inductor, dynamic shape, multi-threads (see https://github.com/pytorch/pytorch/issues/115346#issue-2030859592)
Validation results with this patch: Latency reduced by 3.23%
```
Tested on an Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz instance (IceLake)
oneDNN v3.1.1
(inductor speedup over eager mode) 2.876x
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cpu,rexnet_100,128,2.875904,113.314765,18.455283,0.990437,1302.636134,1315.212902,351,1,0,0

oneDNN v3.3.4
(inductor speedup over eager mode) 3.003x
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cpu,rexnet_100,128,3.003012,109.653012,91.547260,0.990048,1302.532506,1315.625370,351,1,0,0
```

3. Performance regression of AMP hf_T5_generate and tinynet_a with Inductor, static shape, multi-threads (see https://github.com/pytorch/pytorch/issues/115346#issuecomment-1856029962)
Validation results with this patch: Latency reduced by 0.85%
```
Tested on an AWS spr metal instance
oneDNN v3.1.1
(inductor speedup over eager mode) 1.120x
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cpu,hf_T5_generate,1,1.120018,1197.807729,205.905466,0.442803,125.179904,282.698957,10550,48,8,4

oneDNN v3.3.4
(inductor speedup over eager mode) 1.134x
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cpu,hf_T5_generate,1,1.133594,1187.701514,205.855527,0.422012,128.405094,304.268493,10550,48,8,4
```

The following issues about functionality are fixed by this upgrade. Test cases are also added for these issues.
- https://github.com/pytorch/pytorch/issues/120211
- https://github.com/pytorch/pytorch/issues/120406
- https://github.com/pytorch/pytorch/issues/120547

-----

Below are detailed data of torchbench CPU userbenchmark test and Inductor FP32/AMP inference tests. No regression of perf or functionality was found.
I.  *torchbench CPU userbenchmark test*
Suite | Speedup
-- | --
eager_throughtput_bf16_infer | 1.001848
eager_throughtput_fp32_infer | 1.000257
eager_throughtput_fx_int8 | 1.003069
jit_llga_throughtput_amp_bf16 | 1.000682
jit_llga_throughtput_fp32 | 1.000313
eager_throughtput_bf16_train | 0.998222
eager_throughtput_fp32_train | 1.003384

II. *Inductor FP32/AMP inference tests*
i.  FP32 static default
suite | name | thread | batch size | Ratio Speedup(New/old)
-- | -- | -- | -- | --
torchbench | timm_efficientnet | multiple | 64 | 1.09
timm_models | tinynet_a | multiple | 128 | 1.14

ii.  FP32 dynamic default

suite | name | thread | batch size | Ratio Speedup(New/old)
-- | -- | -- | -- | --
torchbench | alexnet | multiple | 128 | 1.08
torchbench | basic_gnn_edgecnn | multiple | 1 | 0.98
torchbench | timm_efficientnet | multiple | 64 | 1.08

iii. AMP static default

suite | name | thread | batch size | Ratio Speedup(New/old)
-- | -- | -- | -- | --
torchbench | hf_distil_whisper | multiple | 1 | 1.18
torchbench | timm_efficientnet | multiple | 64 | 1.32
huggingface | BartForConditionalGeneration | multiple | 2 | 1.19
timm_models | eca_halonext26ts | multiple | 128 | 1.13
timm_models | nfnet_l0 | multiple | 128 | 1.13
timm_models | rexnet_100 | multiple | 128 | 1.45
timm_models | spnasnet_100 | multiple | 128 | 1.15
timm_models | tf_efficientnet_b0 | multiple | 128 | 1.22
timm_models | tinynet_a | multiple | 128 | 1.49
torchbench | hf_Bert_large | single | 1 | 1.16
huggingface | XLNetLMHeadModel | single | 1 | 1.07

iv.  AMP dynamic default

suite | name | thread | batch size | Ratio Speedup(New/old)
-- | -- | -- | -- | --
torchbench | timm_efficientnet | multiple | 64 | 1.32
huggingface | PLBartForConditionalGeneration | multiple | 4 | 1.14
timm_models | nfnet_l0 | multiple | 128 | 1.15
timm_models | rexnet_100 | multiple | 128 | 1.45
timm_models | tinynet_a | multiple | 128 | 1.34
huggingface | XLNetLMHeadModel | single | 1 | 1.09

-----

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120767
Approved by: https://github.com/chuanqi129, https://github.com/jgong5, https://github.com/atalman
2024-03-11 12:56:59 +00:00
Jason Ansel
7cc476ea16 [dynamo] Fix support for nn.Parameter constructor (part 1) (#120163)
This captures calls to `torch.nn.Parameter` by lifting them to graph inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120163
Approved by: https://github.com/albanD, https://github.com/yanboliang
ghstack dependencies: #121086
2024-03-11 05:14:42 +00:00
Jason Ansel
32488b0664 [dynamo] Support _unsafe_set_version_counter (#121086)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121086
Approved by: https://github.com/yanboliang
2024-03-11 05:14:42 +00:00
Ze Sheng
7a4e451184 [Dynamo] Fix function overrides (#120885)
To check existence of `__torch_function__`, the code intended to iterate each element but got `TupleVariable` when the ordinary `has_torch_function()` was being used. Needs further unpack in this case

Fixes #120653

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120885
Approved by: https://github.com/yanboliang
2024-03-11 02:18:43 +00:00
wz337
60cd2a43ca [DeviceMesh] Add support for nD slicing (#119752)
Fixes one of the issue mentioned in #118639
@mvpatel2000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119752
Approved by: https://github.com/wanchaol
2024-03-10 00:16:37 +00:00
Peter Bell
168a04e752 [inductor] Changes to support newer triton pin (#121267)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121267
Approved by: https://github.com/lezcano
ghstack dependencies: #121438
2024-03-09 18:17:36 +00:00
Peter Bell
459c5bca58 [inductor] Refactor common triton imports into one function (#121438)
This means when codegen depends on a particular import we only need to
add it in one place and it's applied to all triton kernels.

This also changes codegen slightly so instead of generating
`@pointwise` we now generate `@triton_heuristics.pointwise` just so
the imports are the same for all kernel types.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121438
Approved by: https://github.com/lezcano
2024-03-09 18:17:36 +00:00
Yifu Wang
71d0202627 [dynamo] support rewriting dist.all_reduce with explicitly specified reduce op (#120181)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120181
Approved by: https://github.com/wconstab, https://github.com/awgu
2024-03-09 08:28:22 +00:00
PyTorch MergeBot
cf9742371c Revert "Add CUTLASS kernel as choice for _int_mm() Inductor autotuning (#119685)"
This reverts commit 752d164b2f.

Reverted https://github.com/pytorch/pytorch/pull/119685 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is crashing on ROCm 752d164b2f ([comment](https://github.com/pytorch/pytorch/pull/119685#issuecomment-1986773384))
2024-03-09 07:20:53 +00:00
Wanchao Liang
242e03ba86 [dtensor] add async_op option to redistribute and some refactor (#121477)
async output option was only available in `full_tensor()` call, but I think it's
generally good to make this option available in the `redistribute` call directly
so that user can control it

This PR adds async_op option to redistribute call, to allow user control
whether to perform tensor redistribution asynchronously or not.

By default we set this to False, this is to follow the semantics of the c10d
collectives.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121477
Approved by: https://github.com/wz337
2024-03-09 06:17:23 +00:00
Jerry Zhang
a6a67da333 [quant] Add error check for input_edge annotation (#121536)
Summary:
Raises error when an input edge contains non-Node elements like constant values etc in annotation.

Test Plan:
python test/test_quantization.py -k test_input_edge_sanity_check

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121536
Approved by: https://github.com/andrewor14
2024-03-09 06:13:04 +00:00
angelayi
e8836759d0 [export] Add effect token to export (#121424)
Following the creation of effect tokens (https://github.com/pytorch/pytorch/pull/120296), we want to now add support for these tokens in export because the calling/returning convention has changed. The inputs are now `(tokens, params, buffers, constants, user_inputs)` and the outputs are `(tokens, buffer_mutations, user_mutations, user_outputs)`. The graph looks something like:
```
graph():
    %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
    %attr : [num_users=2] = placeholder[target=attr]
    %arg1_1 : [num_users=2] = placeholder[target=arg1_1]
    %with_effects : [num_users=2] = call_function[target=torch._higher_order_ops.effects.with_effects](args = (%arg0_1, _TorchScriptTesting.takes_foo.default, %attr, %arg1_1), kwargs = {})
    %getitem : [num_users=1] = call_function[target=operator.getitem](args = (%with_effects, 0), kwargs = {})
    %getitem_1 : [num_users=1] = call_function[target=operator.getitem](args = (%with_effects, 1), kwargs = {})
    %with_effects_1 : [num_users=2] = call_function[target=torch._higher_order_ops.effects.with_effects](args = (%getitem, _TorchScriptTesting.takes_foo.default, %attr, %getitem_1), kwargs = {})
    %getitem_2 : [num_users=1] = call_function[target=operator.getitem](args = (%with_effects_1, 0), kwargs = {})
    %getitem_3 : [num_users=1] = call_function[target=operator.getitem](args = (%with_effects_1, 1), kwargs = {})
    %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%arg1_1, %getitem_3), kwargs = {})
    return (getitem_2, add)
```

During unlifting, we will first remove the tokens and with_effect calls using the `remove_effect_tokens` pass. (cc @SherlockNoMad on the pass to remove tokens). This is so that this won't change the calling conventions when retracing. The graph after unlifting looks something like:
```
graph():
    %attr_1 : [num_users=2] = get_attr[target=attr]
    %arg1_1 : [num_users=2] = placeholder[target=arg1_1]
    %takes_foo_default_1 : [num_users=1] = call_function[target=torch.ops._TorchScriptTesting.takes_foo.default](args = (%attr_1, %arg1_1), kwargs = {})
    %takes_foo_default : [num_users=1] = call_function[target=torch.ops._TorchScriptTesting.takes_foo.default](args = (%attr_1, %takes_foo_default_1), kwargs = {})
    %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%arg1_1, %takes_foo_default), kwargs = {})
    return (add,)
```

Serialization support will be added in a followup.
Note: tokens only affect custom ops that take in ScriptObjects, not ScriptObject methods yet.

Differential Revision: [D54639390](https://our.internmc.facebook.com/intern/diff/D54639390)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121424
Approved by: https://github.com/tugsbayasgalan
2024-03-09 02:43:26 +00:00
Aidyn-A
eb3919944d [C10d][NCCL] Refactor complex all_reduce and broadcast (#121045)
The necessity of this PR lies in the fact that autograd engine + DDP calls `all_reduce` from C++, so the changes must be made in C++.

```
[rank0]: Traceback (most recent call last):
[rank0]:   File "~/complex_ddp.py", line 72, in <module>
[rank0]:     main()
[rank0]:   File "~/complex_ddp.py", line 64, in main
[rank0]:     loss.backward()
[rank0]:   File "/home/usr/pytorch/torch/_tensor.py", line 525, in backward
[rank0]:     torch.autograd.backward(
[rank0]:   File "/home/usr/pytorch/torch/autograd/__init__.py", line 267, in backward
[rank0]:     _engine_run_backward(
[rank0]:   File "/home/usr/pytorch/torch/autograd/graph.py", line 744, in _engine_run_backward
[rank0]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank0]: TypeError: Input tensor data type is not supported for NCCL process group: ComplexFloat
```

I believe, for minimizing the Python overhead, the same could be done for the rest of the ops, what do you think @kwen2501?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121045
Approved by: https://github.com/eqy, https://github.com/kwen2501
2024-03-09 02:00:54 +00:00
Aleksandar Samardžić
752d164b2f Add CUTLASS kernel as choice for _int_mm() Inductor autotuning (#119685)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119685
Approved by: https://github.com/cpuhrsch
2024-03-09 02:00:50 +00:00
Colin Peppler
13a25c647f [export] improve binary op fast path broadcast check (#121546)
# Context
I believe we have an incorrect guard being created during FakeTensor's binary op fast path.

Consider this case
```
# op.shape: (10, 192); final_shape: (s0, 10, 192)
# Guard Ne(s0, 10) is created when we create SymBool(10 == s0)
if isinstance(op, torch.Tensor) and op.shape == final_shape:
    break
```

As of right now, `op.shape == final_shape` checks whether one of the binary op's operands is the same as the binay op's output shape.
* If one of them is a dynamic shape, then we'll create a guard via`SymBool` creation (i.e. `s0 == 10`).
* If the `SymBool` expr resolves to `false`, then we'll create the guard `Ne(s0, 10)`.

This is a problem when the # of dimensions aren't the same between `op.shape` & `final_shape`. Take the case above for example, `op.shape: (10, 192); final_shape: (s0, 10, 192)`. Although, the shapes aren't the same, it doesn't necessarily mean that `s0 != 10`.

Some thoughts (feel free to ignore). What if the # of dimensions are equal but one of the shapes has symbols. Here's three cases:
  1. `op.shape: (9000, 10, 192); final_shape: (s0, 10, 192)` -- not broadcastable.
  2. `op.shape: (1, 10, 192); final_shape: (s0, 10, 192)` -- 0/1 specialization wins?
  3. `op.shape: (100, 10, 192); final_shape: (s0, 10, 192) where s0 = 100` -- Ask user to mark `s0` as a constant.

# Test
```
$ TORCHDYNAMO_VERBOSE=1 PYTORCH_TEST_WITH_DYNAMO=1 pytest -s test/dynamo/test_dynamic_shapes.py -k test_export_fast_binary_broadcast_check_dynamic_shapes

torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (dim0)! For more information, run with TORCH_LOGS="+dynamic".
  - Not all values of dim0 = L['a'].size()[0] in the specified range 3 <= dim0 <= 1024 satisfy the generated guard Ne(L['a'].size()[0], 3).
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121546
Approved by: https://github.com/aakhundov
2024-03-09 01:49:42 +00:00
Lucas Pasqualin
d482614fec [DCP] Makes fsspec public (#121508)
Fixes #118033

Also removes `_checkpointer.py` class
original PR's:
- https://github.com/pytorch/pytorch/pull/121330
- https://github.com/pytorch/pytorch/pull/121329

We're also disabling `test_fsdp` since it is failing on random PR's

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121508
Approved by: https://github.com/fegin
2024-03-09 01:14:18 +00:00
albanD
6791b0c09e Change default torch_function behavior to be disabled when torch_dispatch is defined (take 2) (#120632)
This does not introduce a new test but is tested by checking that all the classes we already have still behave as before now that they don't explicitly disable torch_function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120632
Approved by: https://github.com/ezyang
2024-03-09 01:08:37 +00:00
Aidyn-A
ca9678405a [CUDA graphs] Pool argument for make_graphed_callables (#121475)
It is just a nice feature to have for the situations when users want multiple graphs captures and/or graphed callables to share the same memory pool.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121475
Approved by: https://github.com/eellison, https://github.com/eqy
2024-03-09 00:15:38 +00:00