pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Kai Londenberg	a5ec45f2ec	[Inductor Cutlass backend] Move tests to separate file (#121489 ) Move Cutlass backend related tests to test/inductor/test_cutlass_backend.py - no changes to the tests themselves. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121489 Approved by: https://github.com/jansel	2024-03-12 21:59:48 +00:00
Tugsbayasgalan Manlaibaatar	5478a4e348	Don't run non-strict for test case that doesn't need non-strict (#121710 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121710 Approved by: https://github.com/BoyuanFeng ghstack dependencies: #121652, #121678, #121687	2024-03-12 21:32:33 +00:00
Tugsbayasgalan Manlaibaatar	90e886aa6c	Sanity check for non-strict (#121687 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121687 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #121652, #121678	2024-03-12 18:21:32 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	443e241cc5	Don't cache predispatch kernels (#121712 ) Summary: Title Test Plan: CI Differential Revision: D54791087 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121712 Approved by: https://github.com/ydwu4	2024-03-12 18:05:59 +00:00
Wanchao Liang	a26480a4d1	[dtensor] move early return check into redistribute autograd function (#121653 ) This PR fixed the bug of redistribute to move early return check into the redistribute autograd function, so that even though we redistribute the same placement, the grad_placements from the `to_local` call might be different, the redistribute backward still need to happen Pull Request resolved: https://github.com/pytorch/pytorch/pull/121653 Approved by: https://github.com/awgu	2024-03-12 17:37:30 +00:00
Animesh Jain	4e63d9065a	[dynamo] Delete record replay tests as they are not maintained (#121705 ) Fixes https://github.com/pytorch/pytorch/issues/115518 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121705 Approved by: https://github.com/mlazos	2024-03-12 17:16:34 +00:00
Animesh Jain	2348e8e4e7	[dynamo][guards-cpp-refactor] Simplify DYNAMIC_INDICES guard (#121614 ) Use NO_HASATTR guard for the common part. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121614 Approved by: https://github.com/jansel	2024-03-12 17:08:56 +00:00
PyTorch MergeBot	0398dc9e8e	Revert "[DCP] Makes fsspec public (#121508 )" This reverts commit `d482614fec`. Reverted https://github.com/pytorch/pytorch/pull/121508 on behalf of https://github.com/osalpekar due to this causes torchrec tests to fail internally with this error: ModuleNotFoundError: No module named 'fsspec'. see [D54779117](https://www.internalfb.com/diff/D54779117) ([comment](https://github.com/pytorch/pytorch/pull/121508#issuecomment-1992137831))	2024-03-12 17:02:43 +00:00
Andrew Gu	85dc254364	[DTensor] Moved `Transformer` sharding to staticmethod (#121660 ) To support FSDP + TP/SP unit tests, let us factor out the canonical TP/SP sharding of `Transformer` to a staticmethod that can be called by other unit tests. Test Plan: ``` pytest test/distributed/tensor/parallel/test_tp_examples.py -k test_transformer_training ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121660 Approved by: https://github.com/wanchaol, https://github.com/yifuwang ghstack dependencies: #121360, #121357	2024-03-12 15:08:57 +00:00
Howard Huang	2a99e6f299	Update error message (#121644 ) Summary: We don't want people to move to NCCL exp without explicit opt in. It seems that sparse allreduce was accidentally called and people were confused whether they should use NCCL exp instead. Update the error message to explicitly say that sparse_allreduce is not supported. Test Plan: sandcastle Differential Revision: D54759307 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121644 Approved by: https://github.com/awgu	2024-03-12 13:04:21 +00:00
Adnan Akhundov	06d2392003	Support tt.reduce in Triton kernel analysis pass (#121706 ) Summary: Previously, we bailed out of the Triton kernel analysis pass when seeing a `tt.reduce` op. In this PR, we support the op and don't bail out anymore. Test Plan: This is a bit tricky, as the extension is added to the MLIR walk-based analysis code path which is active only on when the MLIR bindings added in https://github.com/openai/triton/pull/3191 are available. So for now I've run the `test_argmax` and `test_reduce_sum` manually with a newer Triton version than the current pin. When pin updates, we'll make those tests official (left a TODO comment). Pull Request resolved: https://github.com/pytorch/pytorch/pull/121706 Approved by: https://github.com/jansel	2024-03-12 11:38:28 +00:00
Tugsbayasgalan Manlaibaatar	52ad2b682c	Generate predispatch tests (#121678 ) In this PR, we create another dynamic test class for TestExport tests that basically serializes/deserializas pre-dispatch IR. I encountered 4 additional failures. But 3 of them are due to different operator showing up in the graph and only one legit failure which is tracked by another task internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121678 Approved by: https://github.com/angelayi ghstack dependencies: #121652	2024-03-12 08:34:50 +00:00
Dmitry Nikolaev	656134c38f	[ROCm] enable complex128 in test_addmm_sizes_all_sparse_csr for rocm for trivial (k,n,m) cases (#120504 ) This PR enables `test_addmm_sizes_all_sparse_csr_k__n__m_*_cuda_complex128` for ROCm for trivial cases (m or n or k = 0) CUSPARSE_SPMM_COMPLEX128_SUPPORTED also used for `test_addmm_all_sparse_csr` and ` test_sparse_matmul` and both of them are skipped for ROCm by `@skipIfRocm` or `@skipCUDAIf(not _check_cusparse_spgemm_available())` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120504 Approved by: https://github.com/jithunnair-amd, https://github.com/ezyang	2024-03-12 07:29:57 +00:00
lezcano	86a2d67bb9	Simplify guards using info from previous guards (#121463 ) Let me see what CI thinks about this one. Will add tests tomorrow. Fixes https://github.com/pytorch/pytorch/issues/119917 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121463 Approved by: https://github.com/ezyang	2024-03-12 04:22:20 +00:00
Shen Xu	159f30331f	[quant][pt2e] Call sub-quantizers' transform_for_annotation in ComposableQuantizer (#121548 ) Test Plan: ``` buck run caffe2/test:quantization_pt2e ``` Differential Revision: D54454707 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121548 Approved by: https://github.com/jerryzh168	2024-03-12 02:59:12 +00:00
Tugsbayasgalan Manlaibaatar	7fc497711d	Also test predispatch serialization (#121652 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121652 Approved by: https://github.com/zhxchen17, https://github.com/angelayi	2024-03-12 02:37:59 +00:00
eellison	6ca9ae4f86	Express y grid > 2^16 in terms of z grid (#121554 ) CUDA has a max y_grid of 65535. If we're computing larger than that we can compose it in terms of z grid, which is currently unused in inductor codegen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121554 Approved by: https://github.com/aakhundov	2024-03-12 02:36:19 +00:00
Jane Xu	fb1d7935bb	[optim][BE] move complex_2d (last of complex tests) to OptimInfo (#120618 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120618 Approved by: https://github.com/albanD	2024-03-12 02:33:21 +00:00
Xinya Zhang	a37e22de70	Add Flash Attention support on ROCM (#121561 ) This patch addresses the major limitations in our previous [PR #115981](https://github.com/pytorch/pytorch/pull/115981) through the new dedicated repository [AOTriton](https://github.com/ROCm/aotriton) - [x] Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`). * MI300X is supported. More architectures will be added once Triton support them. - [x] Only supports power of two sequence lengths. * Now it support arbitrary sequence length - [ ] No support for varlen APIs. * varlen API will be supported in the next release of AOTriton - [x] Only support head dimension 16,32,64,128. * Now it support arbitrary head dimension <= 256 - [x] Performance is still being optimized. * Kernel is selected according to autotune information from Triton. Other improvements from AOTriton include * Allow more flexible Tensor storage layout * More flexible API This is a more extensive fix to #112997 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121561 Approved by: https://github.com/malfet, https://github.com/atalman	2024-03-12 01:16:53 +00:00
Elias Ellison	5b5d423c2e	Benchmark templates (#118880 ) Adding support for benchmarking templates in `benchmark_fusion` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118880 Approved by: https://github.com/shunting314	2024-03-11 23:55:13 +00:00
Mu-Chu Lee	7676433012	[AOTInductor] Reuse generated kernels between constant graph and main graph (#121564 ) Summary: We copy the src_to_kernel from constant graph to main graph so that we could avoid generating duplicating kernels. And pass throught the name counter such that no duplicated names will be generated. Test Plan: Included in commit Differential Revision: D54706767 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121564 Approved by: https://github.com/desertfire, https://github.com/chenyang78	2024-03-11 22:44:38 +00:00
Andrew Gu	272cf29e4d	[FSDP2][BE] Refactored `check_1d_sharded_parity` to use mesh (#121357 ) Eventually, we should just have one unified way to check for parity between a `DTensor`-sharded model and a replicated model. This PR is a small refactor to work toward that. One current gap to use this `check_sharded_parity` function for 2D is that FSDP's `(Shard(0), Shard(0))` layout differs from that of the `DTensor` APIs since FSDP shards on dim-0 after TP shards on dim-0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121357 Approved by: https://github.com/weifengpy ghstack dependencies: #121360	2024-03-11 22:34:42 +00:00
PyTorch MergeBot	fd0dbcd891	Revert "Batch Norm Consolidation (#116092 )" This reverts commit `7b4f70eda5`. Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/osalpekar due to Causes build failure in //caffe2:aten-hip (AMD build) target. See [D54707318](https://www.internalfb.com/diff/D54707318) for more details, may require internal build system changes to resolve. ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1989542965))	2024-03-11 22:22:41 +00:00
PyTorch MergeBot	b2f09c1859	Revert "[compiled autograd] support custom ops backed by c++ autograd::Function (#120681 )" This reverts commit `d27509c384`. Reverted https://github.com/pytorch/pytorch/pull/120681 on behalf of https://github.com/xmfan due to breaking internal builds, see D54707287 ([comment](https://github.com/pytorch/pytorch/pull/120681#issuecomment-1989542344))	2024-03-11 22:18:36 +00:00
Sam Larsen	fd13a56f61	Refactor some testing helpers for FX graph cache testing (#121520 ) Summary: I plan to enable the FX graph cache for more inductor unit tests. This PR does some refactoring to prepare by moving the `TestCase` base class to `torch._inductor.test_case` (which mirrors the existing `torch._dynamo.test_case`). In a subsequent diff, I'll modify tests importing `torch._dynamo.test_case.TestCase` to use `torch._inductor.test_case.TestCase` instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121520 Approved by: https://github.com/eellison	2024-03-11 21:46:27 +00:00
Andres Lugo-Reyes	e01b07e1e8	[ROCm] Autocast RNN Support (#121539 ) Fixes #116361 Implements Autocast wrapper for miopen rnn's Pull Request resolved: https://github.com/pytorch/pytorch/pull/121539 Approved by: https://github.com/albanD, https://github.com/jeffdaily	2024-03-11 21:14:43 +00:00
Zhenghao Zhao	3461404869	[pt2 export]fix name collision on constant name (#121145 ) Summary: Taking the right most part of the fqn will cause name conflict when having multiple instances of the same class. Changed to replace "." in fqn by "_" to avoid invalid syntax in input args. Test Plan: added test case Differential Revision: D54435230 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121145 Approved by: https://github.com/zhxchen17	2024-03-11 20:40:59 +00:00
Daniel Herrera	dccc1ca839	[torch] Use __prepare_scriptable__ for closures (#121553 ) Summary: This fixes a case left incomplete by https://github.com/pytorch/pytorch/pull/106229 The object is using __prepare_scriptable__ correctly inside of torch.jit.script() but the clousre that is obtained below is using the non-prepared version. This causes issues when the prepared and non-prepared versions are in different python modules. Test Plan: ``` buck2 run mode/opt caffe2/test:jit -- -r test_decorator ``` Differential Revision: D54308741 Re-exporting, as #120806 #121307 were not properly merged. Co-authored-by: Daniel Herrera <dherrera@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121553 Approved by: https://github.com/huydhn, https://github.com/seemethere	2024-03-11 19:14:19 +00:00
Aidyn-A	39ed038f41	[TEST] Prepare test_cumulative_trapezoid for SciPy 1.12 (#121541 ) Follow up on #119326 with addressed comment: https://github.com/pytorch/pytorch/pull/119326#issuecomment-1939428705: > I'd like to propose a slightly different approach. We could check if scipy is version `1.12.0`. If so, overload `scipy_cumulative_trapezoid` with a function that specifically checks `t.shape[axis] == 0`, and in that case return an array of the same shape as `t`, which is the expected behavior as far as I understand. That way, we're not just skipping the test cases I would like to add that the version check is not necessary as in any case the outcome is the same. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121541 Approved by: https://github.com/nWEIdia, https://github.com/albanD	2024-03-11 17:48:29 +00:00
Natalia Gimelshein	89add71168	fix synchronization behavior for copies with type change (#121341 ) Fixes #121320 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121341 Approved by: https://github.com/albanD	2024-03-11 17:09:45 +00:00
Catherine Lee	fac06a12c8	CI sanity check test for env vars (#120519 ) Make a test that fails on purpose to trigger retries. Check the opposite of success (that env vars exist) It's bit hacky because I want it to fail on the normal flow in order to trigger reruns but I don't want to expose the failures to users since it's confusing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120519 Approved by: https://github.com/huydhn	2024-03-11 15:35:45 +00:00
Thiago Crepaldi	6c11d3ce0c	Add support to save safetensors checkpoint directly into onnx (#121001 ) Currently, when `torch.onnx.dynamo_export` is called within `torch.onnx.enable_fake_mode`, all the external pytorch checkpoint files used to initialize the model are automatically and used by `torch.onnx.ONNXProgram.save` to recreate the initializers for the newly exported ONNX model. This API extends the mechanism for HuggingFace models that use safetensors weights. This PR detects safetensors state files and converts them to PyTorch format using mmap on a temporary file, which is deleted after conversion is finished. Without this PR, the user would have to convert the safetensors files to pytorch format manually and feed it to `torch.onnx.ONNXProgram.save` manually. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121001 Approved by: https://github.com/BowenBao, https://github.com/malfet	2024-03-11 15:21:59 +00:00
Xia Weiwen	d1510e01fa	Upgrade submodule onednn to v3.3.5 (#120767 ) This upgrade contains the fixes to the known issues brought by oneDNN v3.3.2, including issues https://github.com/pytorch/pytorch/issues/115346, https://github.com/pytorch/pytorch/issues/120211 and https://github.com/pytorch/pytorch/issues/120406 and those listed in PR #112700. Issue https://github.com/pytorch/pytorch/issues/115346 (perf regression) was fixed by oneDNN v3.3.4. No new regression was found with v3.3.5. The detailed results of v3.3.4 are given below and compared with v3.1.1 (the oneDNN version in PyTorch before it was updated to v3.3.2). 1. A performance regression with 5.8% perf drop from `pytorch_stargan-train` (see https://github.com/pytorch/benchmark/issues/2076#issuecomment-1847545843) Validation results with this patch: Latency increased by 0.60% ``` Tested on an Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz instance (IceLake) oneDNN v3.1.1 metrics-1484287.json { "name": "cpu", "environ": { "pytorch_git_version": "6c8c5ad5eaf47a62fafbb4a2747198cbffbf1ff0" }, "metrics": { "latency": 418.851717 } } oneDNN v3.3.4 { "name": "cpu", "environ": { "pytorch_git_version": "6c8c5ad5eaf47a62fafbb4a2747198cbffbf1ff0" }, "metrics": { "latency": 421.381313 } } ``` 2. Performance regression of FP32 rexnet_100 with Inductor, dynamic shape, multi-threads (see https://github.com/pytorch/pytorch/issues/115346#issue-2030859592) Validation results with this patch: Latency reduced by 3.23% ``` Tested on an Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz instance (IceLake) oneDNN v3.1.1 (inductor speedup over eager mode) 2.876x dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cpu,rexnet_100,128,2.875904,113.314765,18.455283,0.990437,1302.636134,1315.212902,351,1,0,0 oneDNN v3.3.4 (inductor speedup over eager mode) 3.003x dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cpu,rexnet_100,128,3.003012,109.653012,91.547260,0.990048,1302.532506,1315.625370,351,1,0,0 ``` 3. Performance regression of AMP hf_T5_generate and tinynet_a with Inductor, static shape, multi-threads (see https://github.com/pytorch/pytorch/issues/115346#issuecomment-1856029962) Validation results with this patch: Latency reduced by 0.85% ``` Tested on an AWS spr metal instance oneDNN v3.1.1 (inductor speedup over eager mode) 1.120x dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cpu,hf_T5_generate,1,1.120018,1197.807729,205.905466,0.442803,125.179904,282.698957,10550,48,8,4 oneDNN v3.3.4 (inductor speedup over eager mode) 1.134x dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cpu,hf_T5_generate,1,1.133594,1187.701514,205.855527,0.422012,128.405094,304.268493,10550,48,8,4 ``` The following issues about functionality are fixed by this upgrade. Test cases are also added for these issues. - https://github.com/pytorch/pytorch/issues/120211 - https://github.com/pytorch/pytorch/issues/120406 - https://github.com/pytorch/pytorch/issues/120547 ----- Below are detailed data of torchbench CPU userbenchmark test and Inductor FP32/AMP inference tests. No regression of perf or functionality was found. I. torchbench CPU userbenchmark test Suite \| Speedup -- \| -- eager_throughtput_bf16_infer \| 1.001848 eager_throughtput_fp32_infer \| 1.000257 eager_throughtput_fx_int8 \| 1.003069 jit_llga_throughtput_amp_bf16 \| 1.000682 jit_llga_throughtput_fp32 \| 1.000313 eager_throughtput_bf16_train \| 0.998222 eager_throughtput_fp32_train \| 1.003384 II. Inductor FP32/AMP inference tests i. FP32 static default suite \| name \| thread \| batch size \| Ratio Speedup(New/old) -- \| -- \| -- \| -- \| -- torchbench \| timm_efficientnet \| multiple \| 64 \| 1.09 timm_models \| tinynet_a \| multiple \| 128 \| 1.14 ii. FP32 dynamic default suite \| name \| thread \| batch size \| Ratio Speedup(New/old) -- \| -- \| -- \| -- \| -- torchbench \| alexnet \| multiple \| 128 \| 1.08 torchbench \| basic_gnn_edgecnn \| multiple \| 1 \| 0.98 torchbench \| timm_efficientnet \| multiple \| 64 \| 1.08 iii. AMP static default suite \| name \| thread \| batch size \| Ratio Speedup(New/old) -- \| -- \| -- \| -- \| -- torchbench \| hf_distil_whisper \| multiple \| 1 \| 1.18 torchbench \| timm_efficientnet \| multiple \| 64 \| 1.32 huggingface \| BartForConditionalGeneration \| multiple \| 2 \| 1.19 timm_models \| eca_halonext26ts \| multiple \| 128 \| 1.13 timm_models \| nfnet_l0 \| multiple \| 128 \| 1.13 timm_models \| rexnet_100 \| multiple \| 128 \| 1.45 timm_models \| spnasnet_100 \| multiple \| 128 \| 1.15 timm_models \| tf_efficientnet_b0 \| multiple \| 128 \| 1.22 timm_models \| tinynet_a \| multiple \| 128 \| 1.49 torchbench \| hf_Bert_large \| single \| 1 \| 1.16 huggingface \| XLNetLMHeadModel \| single \| 1 \| 1.07 iv. AMP dynamic default suite \| name \| thread \| batch size \| Ratio Speedup(New/old) -- \| -- \| -- \| -- \| -- torchbench \| timm_efficientnet \| multiple \| 64 \| 1.32 huggingface \| PLBartForConditionalGeneration \| multiple \| 4 \| 1.14 timm_models \| nfnet_l0 \| multiple \| 128 \| 1.15 timm_models \| rexnet_100 \| multiple \| 128 \| 1.45 timm_models \| tinynet_a \| multiple \| 128 \| 1.34 huggingface \| XLNetLMHeadModel \| single \| 1 \| 1.09 ----- Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120767 Approved by: https://github.com/chuanqi129, https://github.com/jgong5, https://github.com/atalman	2024-03-11 12:56:59 +00:00
Jason Ansel	7cc476ea16	[dynamo] Fix support for nn.Parameter constructor (part 1) (#120163 ) This captures calls to `torch.nn.Parameter` by lifting them to graph inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120163 Approved by: https://github.com/albanD, https://github.com/yanboliang ghstack dependencies: #121086	2024-03-11 05:14:42 +00:00
Jason Ansel	32488b0664	[dynamo] Support _unsafe_set_version_counter (#121086 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121086 Approved by: https://github.com/yanboliang	2024-03-11 05:14:42 +00:00
Ze Sheng	7a4e451184	[Dynamo] Fix function overrides (#120885 ) To check existence of `__torch_function__`, the code intended to iterate each element but got `TupleVariable` when the ordinary `has_torch_function()` was being used. Needs further unpack in this case Fixes #120653 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120885 Approved by: https://github.com/yanboliang	2024-03-11 02:18:43 +00:00
wz337	60cd2a43ca	[DeviceMesh] Add support for nD slicing (#119752 ) Fixes one of the issue mentioned in #118639 @mvpatel2000 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119752 Approved by: https://github.com/wanchaol	2024-03-10 00:16:37 +00:00
Peter Bell	168a04e752	[inductor] Changes to support newer triton pin (#121267 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121267 Approved by: https://github.com/lezcano ghstack dependencies: #121438	2024-03-09 18:17:36 +00:00
Peter Bell	459c5bca58	[inductor] Refactor common triton imports into one function (#121438 ) This means when codegen depends on a particular import we only need to add it in one place and it's applied to all triton kernels. This also changes codegen slightly so instead of generating `@pointwise` we now generate `@triton_heuristics.pointwise` just so the imports are the same for all kernel types. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121438 Approved by: https://github.com/lezcano	2024-03-09 18:17:36 +00:00
Yifu Wang	71d0202627	[dynamo] support rewriting dist.all_reduce with explicitly specified reduce op (#120181 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120181 Approved by: https://github.com/wconstab, https://github.com/awgu	2024-03-09 08:28:22 +00:00
PyTorch MergeBot	cf9742371c	Revert "Add CUTLASS kernel as choice for _int_mm() Inductor autotuning (#119685 )" This reverts commit `752d164b2f`. Reverted https://github.com/pytorch/pytorch/pull/119685 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is crashing on ROCm `752d164b2f` ([comment](https://github.com/pytorch/pytorch/pull/119685#issuecomment-1986773384))	2024-03-09 07:20:53 +00:00
Wanchao Liang	242e03ba86	[dtensor] add async_op option to redistribute and some refactor (#121477 ) async output option was only available in `full_tensor()` call, but I think it's generally good to make this option available in the `redistribute` call directly so that user can control it This PR adds async_op option to redistribute call, to allow user control whether to perform tensor redistribution asynchronously or not. By default we set this to False, this is to follow the semantics of the c10d collectives. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121477 Approved by: https://github.com/wz337	2024-03-09 06:17:23 +00:00
Jerry Zhang	a6a67da333	[quant] Add error check for input_edge annotation (#121536 ) Summary: Raises error when an input edge contains non-Node elements like constant values etc in annotation. Test Plan: python test/test_quantization.py -k test_input_edge_sanity_check Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/121536 Approved by: https://github.com/andrewor14	2024-03-09 06:13:04 +00:00
angelayi	e8836759d0	[export] Add effect token to export (#121424 ) Following the creation of effect tokens (https://github.com/pytorch/pytorch/pull/120296), we want to now add support for these tokens in export because the calling/returning convention has changed. The inputs are now `(tokens, params, buffers, constants, user_inputs)` and the outputs are `(tokens, buffer_mutations, user_mutations, user_outputs)`. The graph looks something like: ``` graph(): %arg0_1 : [num_users=1] = placeholder[target=arg0_1] %attr : [num_users=2] = placeholder[target=attr] %arg1_1 : [num_users=2] = placeholder[target=arg1_1] %with_effects : [num_users=2] = call_function[target=torch._higher_order_ops.effects.with_effects](args = (%arg0_1, _TorchScriptTesting.takes_foo.default, %attr, %arg1_1), kwargs = {}) %getitem : [num_users=1] = call_function[target=operator.getitem](args = (%with_effects, 0), kwargs = {}) %getitem_1 : [num_users=1] = call_function[target=operator.getitem](args = (%with_effects, 1), kwargs = {}) %with_effects_1 : [num_users=2] = call_function[target=torch._higher_order_ops.effects.with_effects](args = (%getitem, _TorchScriptTesting.takes_foo.default, %attr, %getitem_1), kwargs = {}) %getitem_2 : [num_users=1] = call_function[target=operator.getitem](args = (%with_effects_1, 0), kwargs = {}) %getitem_3 : [num_users=1] = call_function[target=operator.getitem](args = (%with_effects_1, 1), kwargs = {}) %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%arg1_1, %getitem_3), kwargs = {}) return (getitem_2, add) ``` During unlifting, we will first remove the tokens and with_effect calls using the `remove_effect_tokens` pass. (cc @SherlockNoMad on the pass to remove tokens). This is so that this won't change the calling conventions when retracing. The graph after unlifting looks something like: ``` graph(): %attr_1 : [num_users=2] = get_attr[target=attr] %arg1_1 : [num_users=2] = placeholder[target=arg1_1] %takes_foo_default_1 : [num_users=1] = call_function[target=torch.ops._TorchScriptTesting.takes_foo.default](args = (%attr_1, %arg1_1), kwargs = {}) %takes_foo_default : [num_users=1] = call_function[target=torch.ops._TorchScriptTesting.takes_foo.default](args = (%attr_1, %takes_foo_default_1), kwargs = {}) %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%arg1_1, %takes_foo_default), kwargs = {}) return (add,) ``` Serialization support will be added in a followup. Note: tokens only affect custom ops that take in ScriptObjects, not ScriptObject methods yet. Differential Revision: [D54639390](https://our.internmc.facebook.com/intern/diff/D54639390) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121424 Approved by: https://github.com/tugsbayasgalan	2024-03-09 02:43:26 +00:00
Aidyn-A	eb3919944d	[C10d][NCCL] Refactor complex all_reduce and broadcast (#121045 ) The necessity of this PR lies in the fact that autograd engine + DDP calls `all_reduce` from C++, so the changes must be made in C++. ``` [rank0]: Traceback (most recent call last): [rank0]: File "~/complex_ddp.py", line 72, in <module> [rank0]: main() [rank0]: File "~/complex_ddp.py", line 64, in main [rank0]: loss.backward() [rank0]: File "/home/usr/pytorch/torch/_tensor.py", line 525, in backward [rank0]: torch.autograd.backward( [rank0]: File "/home/usr/pytorch/torch/autograd/__init__.py", line 267, in backward [rank0]: _engine_run_backward( [rank0]: File "/home/usr/pytorch/torch/autograd/graph.py", line 744, in _engine_run_backward [rank0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [rank0]: TypeError: Input tensor data type is not supported for NCCL process group: ComplexFloat ``` I believe, for minimizing the Python overhead, the same could be done for the rest of the ops, what do you think @kwen2501? Pull Request resolved: https://github.com/pytorch/pytorch/pull/121045 Approved by: https://github.com/eqy, https://github.com/kwen2501	2024-03-09 02:00:54 +00:00
Aleksandar Samardžić	752d164b2f	Add CUTLASS kernel as choice for _int_mm() Inductor autotuning (#119685 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119685 Approved by: https://github.com/cpuhrsch	2024-03-09 02:00:50 +00:00
Colin Peppler	13a25c647f	[export] improve binary op fast path broadcast check (#121546 ) # Context I believe we have an incorrect guard being created during FakeTensor's binary op fast path. Consider this case ``` # op.shape: (10, 192); final_shape: (s0, 10, 192) # Guard Ne(s0, 10) is created when we create SymBool(10 == s0) if isinstance(op, torch.Tensor) and op.shape == final_shape: break ``` As of right now, `op.shape == final_shape` checks whether one of the binary op's operands is the same as the binay op's output shape. * If one of them is a dynamic shape, then we'll create a guard via`SymBool` creation (i.e. `s0 == 10`). * If the `SymBool` expr resolves to `false`, then we'll create the guard `Ne(s0, 10)`. This is a problem when the # of dimensions aren't the same between `op.shape` & `final_shape`. Take the case above for example, `op.shape: (10, 192); final_shape: (s0, 10, 192)`. Although, the shapes aren't the same, it doesn't necessarily mean that `s0 != 10`. Some thoughts (feel free to ignore). What if the # of dimensions are equal but one of the shapes has symbols. Here's three cases: 1. `op.shape: (9000, 10, 192); final_shape: (s0, 10, 192)` -- not broadcastable. 2. `op.shape: (1, 10, 192); final_shape: (s0, 10, 192)` -- 0/1 specialization wins? 3. `op.shape: (100, 10, 192); final_shape: (s0, 10, 192) where s0 = 100` -- Ask user to mark `s0` as a constant. # Test ``` $ TORCHDYNAMO_VERBOSE=1 PYTORCH_TEST_WITH_DYNAMO=1 pytest -s test/dynamo/test_dynamic_shapes.py -k test_export_fast_binary_broadcast_check_dynamic_shapes torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (dim0)! For more information, run with TORCH_LOGS="+dynamic". - Not all values of dim0 = L['a'].size()[0] in the specified range 3 <= dim0 <= 1024 satisfy the generated guard Ne(L['a'].size()[0], 3). ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121546 Approved by: https://github.com/aakhundov	2024-03-09 01:49:42 +00:00
Lucas Pasqualin	d482614fec	[DCP] Makes fsspec public (#121508 ) Fixes #118033 Also removes `_checkpointer.py` class original PR's: - https://github.com/pytorch/pytorch/pull/121330 - https://github.com/pytorch/pytorch/pull/121329 We're also disabling `test_fsdp` since it is failing on random PR's Pull Request resolved: https://github.com/pytorch/pytorch/pull/121508 Approved by: https://github.com/fegin	2024-03-09 01:14:18 +00:00
albanD	6791b0c09e	Change default torch_function behavior to be disabled when torch_dispatch is defined (take 2) (#120632 ) This does not introduce a new test but is tested by checking that all the classes we already have still behave as before now that they don't explicitly disable torch_function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120632 Approved by: https://github.com/ezyang	2024-03-09 01:08:37 +00:00
Aidyn-A	ca9678405a	[CUDA graphs] Pool argument for make_graphed_callables (#121475 ) It is just a nice feature to have for the situations when users want multiple graphs captures and/or graphed callables to share the same memory pool. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121475 Approved by: https://github.com/eellison, https://github.com/eqy	2024-03-09 00:15:38 +00:00

1 2 3 4 5 ...

25384 Commits