pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
hongxyan	637ab85e7f	fix for launching kernel invalid config error when calling embedding … (#130994 ) …with large index Fixes #130806 When an output size of 2147483648 (=131072*16384) is expected in the above issue, it throwed out the following error: RuntimeError: HIP error: invalid configuration argument What happened was that the second parameter passed to hipLaunchKernel was crazy {2147483648,1,1}. Found two issues in the Indexing.cu: 1: ptrdiff_t was used but it is signed int, outTotalSize >= 2147483648 can cause overflow when doing [this](`39493aa934/aten/src/ATen/native/cuda/Indexing.cu (L1367)`): 2: On ROCm, std::min -> ::min did not work as expected when outTotalSize>=2147483648 As the result, 2147483648 was sent to hipLaunchKernel which the GPU does not support such a huge number since this number specifies the number of threads per block. The original code intended to set 128 threads per block, though this is debatable as the perf would not good for latest powerful GPUs (a TODO item to update for perf maybe?) , but at least it would not cause `invalid configuration argument` error. [Test] Run the same code snippet in the [issue](https://github.com/pytorch/pytorch/issues/130806), and print the output, its dim and numel(), which looks like below now: ``` output=tensor([[ 0.4044, -0.0244, -0.6865, ..., -0.7800, 0.1175, 1.6726], [-1.0866, -0.1609, 0.3538, ..., 1.9105, 0.7882, 1.1583], [-2.2079, 0.3736, 0.3610, ..., -0.2658, -0.0459, 1.3077], ..., [ 0.8753, -0.7482, -0.1978, ..., 0.9016, 1.1501, -0.5178], [-1.5845, -0.6277, 1.4520, ..., 0.5733, -2.1198, -0.0915], [-0.6310, -1.0239, -0.1910, ..., 0.4309, 0.1630, 0.3239]], device='cuda:0'), dim=2, numel=2147483648 ``` Added a large tensor unit test too. ``` /pytorch# pytest test/nn/test_embedding.py -k test_large_tensors ================================================================================== test session starts =================================================================================== platform linux -- Python 3.9.19, pytest-7.3.2, pluggy-1.4.0 rootdir: /dockerx/development/pytorch configfile: pytest.ini plugins: flakefinder-1.1.0, rerunfailures-14.0, xdist-3.3.1, xdoctest-1.1.0, cpp-2.3.0, hypothesis-5.35.1 collected 288 items / 287 deselected / 1 selected Running 1 items in this shard test/nn/test_embedding.py . [100%] =========================================================================== 1 passed, 287 deselected in 3.16s ============================================================================ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130994 Approved by: https://github.com/jeffdaily, https://github.com/xw285cornell	2024-07-20 08:33:29 +00:00
Jiong Gong	0b44e1a74c	[inductor][cpp][gemm] optimize arbitrary N in packed gemm template (#130690 ) Currently we require `n % register_block_n == 0` which typically bring good perf when `n` is a multiply of 8, 16, 32 etc. while will fall back to the reference micro gemm otherwise (where `register_block_n == 1`). This PR optimizes this by padding `n` to the multiple of `register_block_n` which is 8, 16, 32 etc. for packed weight. Therefore, the micro-gemm can work as is on the padded `n`. When the weight is padded, we will use the local accumulation buffer to get the result from micro-gemm and then unpadded (sliced) before storing back to the output buffer. Performance numbers measured on "Intel (R) Xeon (R) CPU Max 9480", single core, bf16. Before AUTOTUNE linear_unary(512x768, 3073x768, 3073) _linear_pointwise 2.3563 ms 100.0% cpp_packed_gemm_0 710.5902 ms 0.3% After AUTOTUNE linear_unary(512x768, 3073x768, 3073) cpp_packed_gemm_0 1.8909 ms 100.0% _linear_pointwise 2.1016 ms 90.0% Pull Request resolved: https://github.com/pytorch/pytorch/pull/130690 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel ghstack dependencies: #130675	2024-07-20 06:30:15 +00:00
ankurneog	ebc012ace6	Add hooks for execution on intel gaudi devices - 1 (#128584 ) ## Motivation This is follow up to PR:https://github.com/pytorch/pytorch/pull/126970 to support Gaudi devices for Pytorch UT execution. ## Changes We are adding additional hooks to: 1. Add dtype exceptions for Gaudi/HPU 2. Extend onlyNativeDevices decorator functionality to add additional devices Pull Request resolved: https://github.com/pytorch/pytorch/pull/128584 Approved by: https://github.com/albanD	2024-07-20 05:03:36 +00:00
Xuehai Pan	d2bd9acabd	[BE] bump `optree` version to 0.12.1 (#130139 ) 0.12.0 Major Updates: - Add context manager to temporarily set the dictionary sorting mode - Add accessor APIs - Use `stable` tag for `pybind11` for Python 3.13 support - Fix potential segmentation fault for pickling support 0.12.1 Updates: - Fix warning regression during import when launch with strict warning filters Closes #130155 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130139 Approved by: https://github.com/zou3519 ghstack dependencies: #130895	2024-07-20 02:41:10 +00:00
Yidi Wu	50436d5bdb	[export] fix zero arg export in training_ir (#130990 ) Fixed TrainingIRToRunDecomp failures for test_tensor_attribute_zero_args and also a few re-tracability failures because run_decomposition does a retracing. edit: also remove the eliminate_dead_code() in _unlift because of one onnx test failure: a constant tensor attr was lifted as constant_tensor input but it's not used in the graph after aot_autograd due to a short cut in its decomposition. This causes the setattr to be removed by eliminate_dead_code but the graph signature still contains the name of that buffer, which causes an inconsitency between the transformed graph and ep's original signature after _unlift. And it seems that this has happened a few times where some nodes are accidentally removed and we're in an inconsistent state. The alternative of removing it would be: every time we call elimiate_dead_code, we verify the consistency of the graph with 1. the graph before transformation and 2. all the meta datas but i think this deserves a complete design. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130990 Approved by: https://github.com/pianpwk	2024-07-20 02:35:13 +00:00
Peter Bell	9df8ea1cf2	[inductor] Use multiple outputs for flex-attention (#130833 ) Resubmit of #129344 This fixes the DCE issue for attention output Pull Request resolved: https://github.com/pytorch/pytorch/pull/130833 Approved by: https://github.com/lezcano ghstack dependencies: #130831, #130832	2024-07-20 02:05:10 +00:00
Peter Bell	27c2a0d63b	[inductor] Separate Buffer and Operation into two concepts (#130831 ) Resubmit of #128893 Currently a buffer represents both a tensor with physical storage and a computation that produces the tensor as a result. This PR attempts to split these into two different concepts in the scheduler. This should allow us to have multiple outputs from a single operation. Differential Revision: [D59876059](https://our.internmc.facebook.com/intern/diff/D59876059) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130831 Approved by: https://github.com/lezcano	2024-07-20 02:05:07 +00:00
Isuru Fernando	bb4251213b	Add decomposition for channel_shuffle (#118775 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118775 Approved by: https://github.com/peterbell10	2024-07-20 01:24:41 +00:00
rzou	207fb96155	[functorch] saved tensor hooks error should only apply to grad, vjp transforms. (#131191 ) There's no reason to ban them for vmap or jvp, because without the {grad, vjp} transforms those just act above PyTorch autograd, which will end up saving regular Tensors. Test Plan: - some tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/131191 Approved by: https://github.com/drisspg	2024-07-19 23:16:27 +00:00
PyTorch MergeBot	7c299b46ca	Revert "Invalidate StorageImpl instances when tensor is overwritten with cudagraphs (#125264 )" This reverts commit `8390843eba`. Reverted https://github.com/pytorch/pytorch/pull/125264 on behalf of https://github.com/izaitsevfb due to breaks internal tests ([comment](https://github.com/pytorch/pytorch/pull/125264#issuecomment-2240516202))	2024-07-19 22:58:51 +00:00
Shuo Ding	35bf05561c	[Inductor] B2B-GEMM performance tuning with test (#130778 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130778 Approved by: https://github.com/eellison	2024-07-19 22:53:57 +00:00
Shuqiang Zhang	4aef5a1134	[c10] add an option to pg_config split share (#130877 ) Summary: context is: #129865 We want to give users an option to not share comms resouces so that comm opts can overlap Test Plan: Augmentd UT Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/130877 Approved by: https://github.com/fduwjj	2024-07-19 21:11:26 +00:00
henrylhtsang	042be441ba	[aoti] Unskip some aot inductor tests (#130973 ) Trying to unskip some tests, and if they are still broken, add reasons. ## example testing command ``` pytest -v test/inductor/test_aot_inductor.py -k test_add_complex ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130973 Approved by: https://github.com/ColinPeppler	2024-07-19 17:19:35 +00:00
Jiashen Cao	9b5c70878b	[Fix] Missing parameter happens when retracing an already jit.scripted module (#129787 ) #### Issue Model parameters sometime do not appear in the `named_parameters()` function. For example, when trying to jit.trace an already jit.scripted model. This PR fixes that by relying on `state_dict` to get both parameters`requires_grad=True` and buffers. #### Test Plan * `pytest test/export/test_converter.py -s -k test_convert_retrace_nested_scripted_modules` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129787 Approved by: https://github.com/angelayi	2024-07-19 16:58:48 +00:00
Zhengxu Chen	abb3f2822c	[aotinductor] Support additional lifted constants supplied to const folding. (#130743 ) Summary: In export workflow, we always have a lifted graph which doesn't fetch constants through get_attr nodes. This cause some compatibility issue when we're trying to use inductor's split_const_gm function with a lifted graph. This diff make an additive change to split_const_gm's interface, such that, when the pass sees a placeholder node is present in the lifted_constants table, it will also use that as the source of constness. This change won't break the existing code and the lifted_constants table can be used orthogonal to the existing const folding mechanisms. Also as required from MTIA team, we want to introduce a small callback function used to skip certain nodes during const folding. For the internal followup counterpart, see D59685145 Test Plan: buck run mode/opt caffe2/test:test_export -- -r split_const_gm Differential Revision: D59692790 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130743 Approved by: https://github.com/desertfire, https://github.com/SherlockNoMad	2024-07-19 16:48:56 +00:00
PyTorch MergeBot	5f3d8b8788	Revert "[c10] add an option to pg_config split share (#130877 )" This reverts commit `367213a608`. Reverted https://github.com/pytorch/pytorch/pull/130877 on behalf of https://github.com/atalman due to breaks internal build ([comment](https://github.com/pytorch/pytorch/pull/130877#issuecomment-2239298810))	2024-07-19 14:24:50 +00:00
Michael Lazos	1b72cf0b09	Add hasattr for tensor variable (#131008 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131008 Approved by: https://github.com/anijain2305 ghstack dependencies: #131007	2024-07-19 12:43:27 +00:00
kausik	4f60a2e39c	Set correct output dtype for dequantize op during convert_pt2e in decomposed mode (#128953 ) Earlier the signature of dequantize ops for decomposed quantized Tensor was changed for wider use-cases where the output dtype can be different from torch.float and needs to be passed during dequantization. Please refer: https://github.com/pytorch/pytorch/pull/121450 However, setting of correct output dtype for dequantize ops was still missing in convert_pt2e flow. This change enables the users to use PT2E quantization flow with non torch.float unquantized dtype, such as torch.bfloat16. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128953 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2024-07-19 04:58:02 +00:00
chilli	d59803fb67	Refactored flexattention kernel (#130904 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130904 Approved by: https://github.com/drisspg ghstack dependencies: #130871	2024-07-19 04:56:32 +00:00
Animesh Jain	00e54e74ff	[dynamo][cpp-guards] Fix bug in dict tags (#131056 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131056 Approved by: https://github.com/williamwen42, https://github.com/jansel	2024-07-19 04:42:38 +00:00
xinan.lin	5a6a806b19	[Inductor UT] Generalize device-bias code in case TestFxGraphCache.test_inductor_counters. (#131006 ) [Inductor UT] Generalize device-bias code in case `TestFxGraphCache.test_inductor_counters`. Fix #131005 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131006 Approved by: https://github.com/masnesral	2024-07-19 01:14:22 +00:00
Will Feng	208dffa702	[Compiled DDP] DDP + AC unit test (#130981 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130981 Approved by: https://github.com/fegin	2024-07-19 01:07:41 +00:00
Xu Han	6e7b9ee8a0	[inductor] adapte windows file path (#130713 ) This PR is depends on https://github.com/pytorch/pytorch/pull/130132 can be landed successful. The detailed log: https://github.com/pytorch/pytorch/issues/124245#issuecomment-2211889758 After the file path was adapted for Windows, the first Windows inductor case was run successful. ```python import torch def foo(x, y): a = torch.sin(x) b = torch.cos(x) return a + b opt_foo1 = torch.compile(foo) print(opt_foo1(torch.randn(10, 10), torch.randn(10, 10))) ``` Result: ![image](https://github.com/user-attachments/assets/4944df47-e74d-476b-8eb5-1d1fd5abeb41) Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130713 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire	2024-07-18 23:19:38 +00:00
Justin Chu	e880cb2fe0	[ONNX] Remove beartype usage (#130484 ) beartype has served us well in identifying type errors and ensuring we call internal functions with the correct arguments (thanks!). However, the value of having beartype is diminished because of the following: 1. When beartype improves support for better Dict[] type checking, it discovered typing mistakes in some functions that were previously uncaught. This caused the exporter to fail with newer versions beartype when it used to succeed. Since we cannot fix PyTorch and release a new version just because of this, it creates confusion for users that have beartype in their environment from using torch.onnx 2. beartype adds an additional call line in the traceback, which makes the already thick dynamo stack even larger, affecting readability when users diagnose errors with the traceback. 3. Since the typing annotations need to be evaluated, we cannot use new syntaxes like `\|` because we need to maintain compatibility with Python 3.8. We don't want to wait for PyTorch take py310 as the lowest supported Python before using the new typing syntaxes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130484 Approved by: https://github.com/titaiwangms	2024-07-18 22:07:40 +00:00
PyTorch MergeBot	fb3674b1f4	Revert "[Autograd] Cond Higher-Order Operation (#126911 )" This reverts commit `f7058b735e`. Reverted https://github.com/pytorch/pytorch/pull/126911 on behalf of https://github.com/clee2000 due to broke lint and functorch/test_aotdispatch `f7058b735e` Probably a landrace since both the test and lint passed on PR ([comment](https://github.com/pytorch/pytorch/pull/126911#issuecomment-2237703182))	2024-07-18 22:06:40 +00:00
Jiashen Cao	686b7f046a	[Fix]: TSConverter handles call ops with multiple outputs (#129294 ) #### Issue * Current call ops does not handle IR with multiple outputs. If an op has multiple outputs, we add an implicit unpack to map output. E.g., ``` %5 : Tensor, %6 : Tensor = aten::max(%x.1, %3, %4), scope: export.test_converter.M:: # /data/users/jiashenc/pytorch/test/export/test_converter.py:774:20 ``` * There are some cases that `prim::If` sub-blocks do not return any outputs. E.g., ``` %9 : bool = aten::gt(%8, %3), scope: export.test_converter.M::/torch.nn.modules.pooling.AdaptiveMaxPool2d::pool # <string>:5:9 = prim::If(%9), scope: export.test_converter.M::/torch.nn.modules.pooling.AdaptiveMaxPool2d::pool # <string>:5:2 block0(): -> () block1(): = prim::RaiseException(%5, %4), scope: export.test_converter.M::/torch.nn.modules.pooling.AdaptiveMaxPool2d::pool # <string>:5:2 -> () ``` #### Test Plan We did an exhaustive search of all torch APIs that can return multiple outputs. We sample some of common ones and add new test cases based on those. * `pytest test/export/test_converter.py -s -k test_ts2ep_multi_outputs_on_call_ops` #### Appendix * aten ops that return multiple outputs. ``` aten._batch_norm_impl_index aten._batch_norm_no_update aten._batch_norm_with_update aten._batch_norm_with_update_functional aten._cudnn_rnn aten._efficient_attention_backward aten._efficient_attention_forward aten._embedding_bag aten._embedding_bag_forward_only aten._flash_attention_backward aten._flash_attention_forward aten._fused_adam aten._fused_dropout aten._fused_moving_avg_obs_fq_helper aten._linalg_det aten._linalg_eigh aten._linalg_slogdet aten._linalg_solve_ex aten._linalg_svd aten._native_batch_norm_legit aten._native_batch_norm_legit_functional aten._native_batch_norm_legit_no_training aten._pack_padded_sequence aten._prelu_kernel_backward aten._scaled_dot_product_efficient_attention aten._scaled_dot_product_efficient_attention_backward aten._scaled_dot_product_flash_attention aten._scaled_dot_product_flash_attention_backward aten._scaled_dot_product_flash_attention_for_cpu aten._scaled_dot_product_flash_attention_for_cpu_backward aten._thnn_fused_lstm_cell aten._thnn_fused_lstm_cell_backward_impl aten._unique2 aten._weight_norm_interface aten.adaptive_max_pool2d aten.adaptive_max_pool3d aten.aminmax aten.batch_norm_backward aten.convolution_backward aten.cudnn_batch_norm aten.cudnn_batch_norm_backward aten.cummax aten.cummin aten.fractional_max_pool2d aten.frexp aten.grid_sampler_2d_backward aten.grid_sampler_3d_backward aten.gru aten.linalg_cholesky_ex aten.linalg_eig aten.linalg_inv_ex aten.linalg_ldl_factor_ex aten.linalg_lu aten.linalg_lu_factor_ex aten.linalg_qr aten.linear_backward aten.log_sigmoid_forward aten.lstm aten.lu_unpack aten.max aten.max_pool2d_with_indices aten.max_pool3d_with_indices aten.median aten.min aten.miopen_batch_norm aten.miopen_batch_norm_backward aten.mkldnn_rnn_layer aten.mkldnn_rnn_layer_backward aten.mode aten.multilabel_margin_loss_forward aten.nanmedian aten.native_batch_norm aten.native_batch_norm_backward aten.native_dropout aten.native_group_norm aten.native_group_norm_backward aten.native_layer_norm aten.native_layer_norm_backward aten.nll_loss2d_forward aten.nll_loss_forward aten.quantized_gru aten.quantized_lstm aten.rnn_relu aten.rnn_tanh aten.sort aten.std_mean aten.topk aten.triangular_solve aten.unique_dim aten.var_mean ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129294 Approved by: https://github.com/angelayi	2024-07-18 21:55:18 +00:00
Alnis Murtovi	7f1cda1533	Autoheuristic: Do not store choices as metadata (#130304 ) While for optimizations like pad_mm, there are always only two possible choices, for other decision procedures, like kernel choice selection, the set of "available" choices depends on the input. Instead of storing the choices as metadata, we can instead take a look at all choices for which we have collected data (i.e. `df[CHOICE_COL].unique()`). In this PR, I also try to replace "choice" and "feedback" with global constants CHOICE_COL and FEEDBACK_COL. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130304 Approved by: https://github.com/eellison	2024-07-18 21:39:42 +00:00
Thomas Bohnstingl	f7058b735e	[Autograd] Cond Higher-Order Operation (#126911 ) This is an updated PR to equip cond with the autograd feature and replaces the old [PR](https://github.com/pytorch/pytorch/pull/126007) @ydwu4 I tried to incorporate your requests already. Currently there are two problems that I struggle with solving: 1. There seems to be an import issue when trying to import cond in `torch/__init__.py`, see [here](`8a704035c9/torch/__init__.py (L1914-L1916)`). Therefore, I had to comment those lines, which resolved the import issues, but I believe cond is not proberly exposed as torch.cond. 2. I am not entirely sure how to deal with the opinfo test in `hop_db.py` Co-authored-by: Yidi Wu <yidi@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126911 Approved by: https://github.com/ydwu4	2024-07-18 21:09:09 +00:00
Jerry Zhang	793b17ebcb	Add numeric_debugger top level APIs (#130643 ) Summary: Add three top level APIs for numeric debugger in pt2e flow that can log intermediate output in the model and calculate summary for metric comparisons between nodes in two graphs * `prepare_for_propagation_comparison` * `extract_results_from_loggers` * `compare_results` Test Plan: python test/test_quantization.py -k test_prepare_for_propagation_comparison python test/test_quantization.py -k test_extract_results_from_loggers Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/130643 Approved by: https://github.com/dulinriley, https://github.com/tarun292	2024-07-18 20:54:18 +00:00
PyTorch MergeBot	726b9268d2	Revert "Re-implement pin_memory to be device-agnostic by leveraging the Accelerator concept (#126376 )" This reverts commit `c986aeea2d`. Reverted https://github.com/pytorch/pytorch/pull/126376 on behalf of https://github.com/atalman due to Failing internal builds ([comment](https://github.com/pytorch/pytorch/pull/126376#issuecomment-2237496633))	2024-07-18 20:25:20 +00:00
Peter Bell	e7f7c5c3f8	[inductor] Avoid fallback case for custom scan op lowering (#130936 ) We currently can't generate split scans when there are multiple scan values, so we normally fall back to ATen. However, for the higher order scan op, we can't fallback so it makes sense to just generate the slower kernel anyway. This avoids having special shapes where we fail to codegen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130936 Approved by: https://github.com/lezcano	2024-07-18 19:53:47 +00:00
Shuqiang Zhang	367213a608	[c10] add an option to pg_config split share (#130877 ) Summary: context is: #129865 We want to give users an option to not share comms resouces so that comm opts can overlap Test Plan: Augmentd UT Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/130877 Approved by: https://github.com/fduwjj	2024-07-18 19:03:00 +00:00
drisspg	c015e5b9e3	Make sure that TransformGetItemToIndex for all graph replay (#131003 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131003 Approved by: https://github.com/Chillee ghstack dependencies: #130871	2024-07-18 18:32:21 +00:00
redwrasse	82242a258a	rm duplicate index_dtype arg (#130803 ) - Remove duplicate `index_dtype` argument for `_test_meta_sparse_compressed` operation. - Also remove unused `y_v_numel` variable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130803 Approved by: https://github.com/soulitzer	2024-07-18 18:30:13 +00:00
PyTorch MergeBot	fff92d4f18	Revert "Use inductor TestCase for test_replicate_with_compiler.py (#129494 )" This reverts commit `9f392f8294`. Reverted https://github.com/pytorch/pytorch/pull/129494 on behalf of https://github.com/atalman due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/129494#issuecomment-2237147504))	2024-07-18 17:42:05 +00:00
Pian Pawakapan	745324e487	[export] turn on hybrid symints by default (#130775 ) Sets `prefer_deferred_runtime_asserts_over_guards=True` for export, so any guards emitted from `SymNode.expect_true` (for example, guards that are implicitly required to be true for an op to succeed) won't lead to constraint violations. Instead these should appear in the graph as runtime asserts, or potentially as replacement expressions for placeholder shapes. For example, this reshape op should emit s0 * s1 = s2, deferred as a runtime assert. ``` x = torch.randn(4, 8) # [s0, s1] y = torch.randn(32) # [s2] out = x.reshape(-1) + y # this emits Eq(s0 * s1, s2), and we represent y's shape as [s0s1] in the graph. ``` However, other complex guards can still cause export to fail, for instance guards emitted from `SymNode.guard_bool/guard_size_oblivious` (e.g. explicit if-else conditions in user code or lower-level op implementations hit during tracing) can still raise constraint violations. These can be deferred with `allow_complex_guards_as_runtime_asserts=True`. We don't yet make this default, because while this makes export more likely to succeed, it results in non-trivial asserts being emitted that often represent specialization to a variant of the op, or checks related to 0/1 specialization. We also remove forced specializations for export and kill the `_disable_forced_specializations` flag - now any guard we can't express with Dims/DerivedDims either are handled with Hybrid SymInts, or should be resolved with rewriting or deferring. Follow up: Currently, `ShapeEnv._set_replacement()` is called for complex equality expressions (e.g. s2 -> s0s1 in the example above), and the ExportedProgram stores `s0*s1` in the input placeholder. This isn't checked for validity when the program is run, so an option is to avoid replacement and/or runtime assert on equality. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130775 Approved by: https://github.com/avikchaudhuri	2024-07-18 17:40:58 +00:00
mori360	5979014059	DSD for TorchTune LoRA (#129635 ) Fixes #128745 Solve the issue with conflicts when users use full_state_dict while the model is FSDP. Current solve the issue for `full_state_dict=True`, with error `'aten.copy_.default: got mixed torch.Tensor and DTensor, need to convert all torch.Tensor to DTensor before calling distributed operators!',).` TODO: for` broadcast_from_rank0=True, full_state_dict=True`, the error is `NotImplementedError: c10d::broadcast_: attempted to run this operator with Meta tensors` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129635 Approved by: https://github.com/fegin	2024-07-18 17:00:35 +00:00
Zhengxu Chen	5484c86021	[export] Fully support extension op in serialization/deserialization. (#130851 ) Summary: Finishing up the mechanism to "register" certain types of operators to a registry so that the serializer can handle them correctly. This is expected to be firstly used by executorch. Test Plan: buck run mode/opt caffe2/test:test_export -- -r test_export_with_extension_op_serialization Differential Revision: D59825148 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130851 Approved by: https://github.com/angelayi	2024-07-18 16:47:53 +00:00
Iris Z	85451b2cde	[DTensor] Fix shard_dim_alltoall fake tensor return (#129945 ) shard_dim_alltoall op has a return type as a Tensor in its schemas (here: https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/Functional.cpp#L628), but its FakeTensor implementation returns a list of tensors (see the chunk() call here: https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/_collective_utils.py#L33). So it would error out when device="meta". This PR fixes the fake tensor mode return type for 1d mesh and adds a test to compare shape with non-meta tensor case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129945 Approved by: https://github.com/wanchaol	2024-07-18 16:43:40 +00:00
eellison	16aaff7783	Fix mm pad regresion - more conservative estimation of plannable inputs (#128909 ) - More conservative estimation of plannable inputs - Consider constant_pad_nd as pointwise node in concat lowering - Use aten.cat instead of constant pad ndwhen padding just a single dimension because it can be memory-planned away Pull Request resolved: https://github.com/pytorch/pytorch/pull/128909 Approved by: https://github.com/Chillee	2024-07-18 16:42:30 +00:00
Shangdi Yu	27ded03545	[FX][export] DCE pass, check schema for node impurity (#130395 ) Change the default DCE pass to check node schema for impure nodes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130395 Approved by: https://github.com/angelayi, https://github.com/jgong5	2024-07-18 16:31:40 +00:00
Li-Huai (Allan) Lin	8ea03372a1	[MPS] Store philox counter as part of the RNG state (#130662 ) Fixes #130613 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130662 Approved by: https://github.com/malfet	2024-07-18 15:57:28 +00:00
PyTorch MergeBot	d6ae8bbf16	Revert "[export] Add print_readable to unflattener (#128617 )" This reverts commit `9fee87e4cd`. Reverted https://github.com/pytorch/pytorch/pull/128617 on behalf of https://github.com/clee2000 due to broke inductor/test_flex_attention https://github.com/pytorch/pytorch/actions/runs/9984688318/job/27595182606 `433ef4e444` Not run on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/128617#issuecomment-2236867975))	2024-07-18 15:31:51 +00:00
PyTorch MergeBot	120fdf7ee2	Revert "[aota] Needs autograd if an input requires_grad, agnostic to enable_grad (#128890 )" This reverts commit `e98135d1ad`. Reverted https://github.com/pytorch/pytorch/pull/128890 on behalf of https://github.com/zou3519 due to broke trunk tests, probably a landrace ([comment](https://github.com/pytorch/pytorch/pull/128890#issuecomment-2236790805))	2024-07-18 14:58:25 +00:00
rzou	5a90ed3523	Reinplacing should ignore copy_ nodes where the mutated arg is not read (#130866 ) Might fix #127660, need to test some more cases. We update the reinplacing pass. If we have something like the following, where "sin" is a custom op (this situation should also apply to triton kernels) ```py def graph(x): y = sin(x) z = sin(y) x.copy_(z) ``` then the reinplacer used to produce the following: ```py """step 1: reinplaces the first sin""" def graph(x): x_clone = x.clone() sin_out(x, out=x_clone) z = sin(x_clone) x.copy_(z) """step 2: reinplaces the second sin""" def graph(x): x_clone = x.clone() sin_out(x, out=x_clone) sin_out(x_clone, out=x_clone) x.copy_(x_clone) ``` However, the first clone is unnecessary. It is safe to reinplace the first sin into the following: ```py def graph(x): sin_out(x, out=x) z = sin(x) x.copy_(z) ``` because there are no users of `x`'s original value (the copy_ node doesn't actually use the original value of x!) This PR updates the reinplacing pass to ignore copy_ in its computation of if the original value of the mutated argument is still needed. NB: this also applies to triton kernels, but it was easier for me to reason about custom ops (and my repros were all for custom ops). Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130866 Approved by: https://github.com/oulgen	2024-07-18 13:47:54 +00:00
drisspg	dd39dca034	Removing some cruff and updating signatures for consistency (#130871 ) # Summary - This removes a bunch of example score mods that were primarily used for testing and places them directly in the test file. We should follow up with merging test_flex_decode and test_flash when the velocity slows down a little - Fixes a bug with indexing on block mask - Adds some doc strings to helper funcs and fixes some misc typing things - Forces functions passed to `create_block_mask` to mask_mods and updates tests files Pull Request resolved: https://github.com/pytorch/pytorch/pull/130871 Approved by: https://github.com/joydddd, https://github.com/Chillee	2024-07-18 13:32:11 +00:00
PyTorch MergeBot	9f6db5d0e2	Revert "Ensure staticmethods can be allowed in graph (#130882 )" This reverts commit `b0387449db`. Reverted https://github.com/pytorch/pytorch/pull/130882 on behalf of https://github.com/atalman due to failing torchrec tests internally, please fix and reland ([comment](https://github.com/pytorch/pytorch/pull/130882#issuecomment-2236528473))	2024-07-18 13:31:30 +00:00
wizzniu	c986aeea2d	Re-implement pin_memory to be device-agnostic by leveraging the Accelerator concept (#126376 ) This PR re-implements pin memory aiming to get rid of the optional `device` argument and makes all related APIs to be device-agnostic. We add two new abstract APIs in [AcceleratorHooksInterface](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/detail/AcceleratorHooksInterface.h#L12) and redefine pin memory as: "Pin memory is always pinned for the current accelerator device". In detail, it uses [getAcceleratorHooksInterface](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/Context.h#L61) in pin_memory/is_pinned to get an appropriate device and invoke the corresponding overridden interfaces, instead of using BackendSelect and then dispatching to CUDA or other specific backends' implement methods. Note: For new backends who want to implement and use pin memory, just inherit AcceleratorHooksInterface and overwrite the `isPinnedPtr` and `getPinnedMemoryAllocator` methods. Additional context: To avoid BC-breaking, this PR just preserves the `device` arg of related APIs and would throw a deprecation warning if `device` arg is passed. Another PR will be submitted to update all PT callers (`Tensor.is_pinned()`, `Tensor.pin_memory()`...) not to pass this arg based on this PR. In future, `device` arg will be actually removed. Relates #124908 Relates #14560 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126376 Approved by: https://github.com/albanD	2024-07-18 11:54:14 +00:00
jananisriram	28a74b9fa4	[NestedTensor] Integrate sum along the jagged dimension into NestedTensor (#130425 ) Summary: Modify the existing `sum` operator in PyTorch, invoked by `torch.sum`, to allow for reductions along the ragged dimension of a nested tensor. This diff enables PyTorch users to invoke `torch.sum` on a nested tensor with `dim=1`, where `ragged_idx=1`. Functions modified in `caffe2/torch/nested/_internal/ops.py`: - `sum_dim_IntList()`: The function assumes that `ragged_idx=1`; in the case that `dim=1` as well, where `dim` is the dimension on which we reduce, this diff invokes the PyTorch benchmark found in D58423489. Specifically, this diff pads a nested tensor, e.g. of logical shape `(B, , M)`, using [`torch.ops.aten._jagged_to_padded_dense_forward`](https://www.internalfb.com/code/fbsource/[92c2a067ab04e3eebc999254fed4ae2fbea6def3]/fbcode/deeplearning/fbgemm/fbgemm_gpu/fb/inductor_lowerings/elementwise_ops.py?lines=26), then reduces across the `` dimension (`dim == 1`) to a `(B, M)` output tensor. - `_wrap_jagged_dims()`: This diff adds special handling to allow for the case where `dim` contains `1` and not `0`, but to continue disallowing the case where `dim` contains `0` and not `1`. In this function's creation, I created a helper function, `_get_condition_for_invalid_jagged_reductions()`, which makes it clearer which conditions apply to which operators. Specifically, operators which are enabled with jagged reductions are specified at the top of the file in `SUPPORTED_JAGGED_REDUCTIONS` and have a different set of conditions that need to be tested, as reducing along `dim == 1` without `dim == 0` is now possible. Functions modified in `caffe2/test/test_nestedtensor.py`: - `test_sum_int_DimList()`: This diff adds special handling in the `sum` unit test to allow for the case where `dim` contains `1` and not `0`, but to continue disallowing the case where `dim` contains `0` and not `1`. - `test_sum_int_DimList_ragged_dim_1()`: This diff adds a new unit test which verifies the accuracy and feasibility of reducing along the jagged dimension of a nested tensor. Notes: - This diff solely adds functionality for the case in which we reduce only along the ragged dimension. Cases in which we reduce along both the ragged and another dimension, like `dim == (1, 2)`, are not permitted, as this set of diffs focuses primarily on the former. - The `sum` operator is the only operator which uses the function `_wrap_jagged_dims()`; all other operators use `_wrap_jagged_dim()`. I would like to later look into why this is the case and if we can consolidate this! - I modified some of the comments in the `sum` function as well as the unit tests for more clarity. Test Plan: Verify that existing (`test_sum_int_DimList`) and new (`test_sum_int_DimList_ragged_dim_1`) unit tests pass via the following command: ``` buck2 run mode/{opt,inplace} //caffe2/test:nested -- --regex test_sum_int_DimList ``` Differential Revision: D59571209 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130425 Approved by: https://github.com/davidberard98	2024-07-18 10:48:18 +00:00
IvanKobzarev	e98135d1ad	[aota] Needs autograd if an input requires_grad, agnostic to enable_grad (#128890 ) Reland of: https://github.com/pytorch/pytorch/pull/128016 Summary from previous PR: We assume only two possible mutually exclusive scenarios: Running compiled region for training (Any of inputs has requires_grad) Produced differentiable outputs should have requires_grad. Running compiled region for inference (None of inputs has requires_grad) All outputs do not have requires_grad. Even if user runs the region under no_grad(), but has an input Tensor with requires_grad - we go Training scenario (1). With current state that means: 1/ needs_autograd should not check torch.is_grad_enabled(), only that any of inputs requires_grad 2/ if needs_autograd => trace_joint (We are in training scenario 1.) => always run compiled region under with.enable_grad() Changes in partitioner? Inference and Training graphs had difference in return container, list/tuple. The changes in partitioner are done to unify and return always tuple. As a result - some changes in test_aotdispatch.py for graph contents list -> tuple. Why was revert? There was a regression of hf_Reformer model on inference. ``` TORCHINDUCTOR_FX_GRAPH_CACHE=0 python benchmarks/dynamo/torchbench.py --performance --inference --bfloat16 --backend inductor --device cuda --only hf_Reformer --cold-start-latency --use-eval-mode ``` Because one of the compiled graphs contained outputs, which are aliases to the inputs that are nn.Parameter(requires_grad=True). Even if inference bencharmsk torchbench runs inside with` torch.no_grad()` - alias (specifically for hf_Reformer - expand) ops preserve requires_grad. As a result we started compiling training graph instead of inference. Fix for view ops: If we have outputs, that are aliases to inputs that requires_grad, those outputs requires grad is not a reason to generate training graph. This is handled in aot_autograd.py, where output_and_mutation_safe are calculated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128890 Approved by: https://github.com/bdhirsh	2024-07-18 08:27:53 +00:00

1 2 3 4 5 ...

28113 Commits