pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
XDaoHong	1b34238d67	fix get device index if has _utils._get_device_index in privateuse1 (#108123 ) Get device index by torch.privateuse1._utils._get_device_index, if the metched exists. Reason: Can only get device_index 0 if ```location``` such as 'privateuse1' before modify. Can get accurate deivce index use _get_device_index in this scenario. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108123 Approved by: https://github.com/albanD	2023-10-07 06:18:59 +00:00
Stephen Jia	c2e7a0d689	[core IR] Add decomps for `aten.sum` and `aten.squeeze` variants (#110645 ) Summary: ## Context Both `aten.sum` and `aten.squeeze` have a "most generic" variant in the form of `aten.sum.dim_IntList` and `aten.squeeze.dims` respectively. Add decompositions for other non generic variants of these operators to express them using the most generic variant. Note that to register these decomps, the reference implementation under `_refs` had to be removed as registered decompositions. cc: @lezcano @peterbell10 Test Plan: Github CI + Meta Internal CI Differential Revision: D49965952 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110645 Approved by: https://github.com/peterbell10, https://github.com/digantdesai, https://github.com/manuelcandales	2023-10-07 04:21:51 +00:00
Jez Ng	c77dd684c9	Enable typechecking in _inductor/ir.py (#110112 ) I used a bunch of ignore-type comments, mostly due to https://github.com/pytorch/pytorch/issues/109963. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110112 Approved by: https://github.com/peterbell10	2023-10-07 04:19:38 +00:00
Oguz Ulgen	e8ef8bfdce	[Inductor] Allow matmul to have flexiable layout when we are not autotuning (#110726 ) Fixes #102804 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110726 Approved by: https://github.com/Chillee	2023-10-07 04:08:37 +00:00
jjsjann123	bf0866fc16	deprecating nvfuser c++ API (#110318 ) deprecating nvfuser c++ API Pull Request resolved: https://github.com/pytorch/pytorch/pull/110318 Approved by: https://github.com/davidberard98	2023-10-07 02:25:21 +00:00
Avik Chaudhuri	983f6f36db	[user errors] compulsory case names, allow multiple (#110733 ) We want to get to a point where most `UserError`s link to `exportdb` examples. This PR makes passing case names non-optional to make this intent clearer and encourage developers who raise `UserError`s to make or point to examples that make fixing such errors more obvious for users. In addition, sometimes there are multiple examples that are relevant to an error. Thus this PR also enables passing multiple case names. Differential Revision: [D50020465](https://our.internmc.facebook.com/intern/diff/D50020465/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110733 Approved by: https://github.com/zhxchen17	2023-10-07 01:25:12 +00:00
Chien-Chin Huang	90bf6e3938	[FSDP][optim_state_dict] Enable cpu_offload config for optimzer state_dict (#108434 ) We had the option but never used cpu_offload as optimizer state_dict offloads the tensors to CPU by default. And this is usually most users want as the tensors are required to be moved to CPU eventually. However, we may want to disable offloading to CPU in some cases, epsecially for the debugging purpose. This PR lets optimizer state_dict read the flag. Differential Revision: [D48913340](https://our.internmc.facebook.com/intern/diff/D48913340/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108434 Approved by: https://github.com/wz337	2023-10-07 01:14:49 +00:00
soulitzer	563728f61c	[reland] Update custom Function preserve torch function when inputs r… (#110679 ) …eturned as-is reland of https://github.com/pytorch/pytorch/pull/109825#issuecomment-1749803837 Opening this without ghstack to do codev. In our PR, we changed the signature of `_wrap_outputs`. There is some internal code that calls `_wrap_outputs` directly, so we also need to update that callsite. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110679 Approved by: https://github.com/albanD	2023-10-07 00:27:45 +00:00
Wanchao Liang	1c97808f81	[dtensor] support lt/gt op (#110585 ) This PR enables lt/gt aten op Pull Request resolved: https://github.com/pytorch/pytorch/pull/110585 Approved by: https://github.com/fduwjj ghstack dependencies: #110584	2023-10-07 00:06:36 +00:00
Wanchao Liang	9378a2ceda	[dtensor] support aten.where and enable implicit scalar promotion (#110584 ) This PR adds support for aten.where and support implicit scalar promotion, basically when we meet scalar tensors in dispatching logic, we implicitly convert it those to replicated dtensor The latter also enables bunch of ops in op db to pass Pull Request resolved: https://github.com/pytorch/pytorch/pull/110584 Approved by: https://github.com/fduwjj	2023-10-07 00:06:36 +00:00
Yue Dong	e3bf5000a7	Hide the contiguous requirement for user input mesh when initializing DeviceMesh (#110628 ) Summary: As title, this diff hides the contiguous requirement for user input mesh when initializing DeviceMesh. In the current implementation, when testing with inter-node model parallelism, an exception is thrown during mesh validation when the following input is provided: ``` mesh = torch.arange(0, world_size).view(mp_size, dp_size).transpose(0, 1) device_mesh = DeviceMesh( "cuda", mesh.contiguous(), mesh_dim_names=("dp", "mp") ) ``` Test Plan: Unit Test: ``` buck2 test mode/dev-nosan //caffe2/test/distributed/_tensor:device_mesh -- test_validate_device_mesh Test UI: https://www.internalfb.com/intern/testinfra/testrun/3940649876878399 Network: Up: 0B Down: 0B Jobs completed: 6. Time elapsed: 1:58.7s. Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Test with MP ``` mesh = torch.arange(0, world_size).view(mp_size, dp_size).transpose(0, 1) device_mesh = DeviceMesh( "cuda", mesh.contiguous(), mesh_dim_names=("dp", "mp") ) ``` Without the change: exception. After this change: initialzied sucessfully. Differential Revision: D49942839 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110628 Approved by: https://github.com/wanchaol, https://github.com/xw285cornell, https://github.com/fduwjj	2023-10-06 23:54:13 +00:00
chilli	6b1007b2a7	Fix error in div lowering with integers (#102809 ) Fixes https://github.com/pytorch/pytorch/issues/101016 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102809 Approved by: https://github.com/ngimel ghstack dependencies: #110501, #110504, #110591, #110668, #110687	2023-10-06 23:21:40 +00:00
Jon Chuang	9b55194f81	fix(dynamo): Incorrect `accumulate` implementation, bad tests (#110683 ) Root cause of: https://github.com/pytorch/pytorch/issues/110287 Fixed many tests that didn't actually test due to unreliability of `CompileCounter.frame_count` in detecting graph breaks: https://github.com/pytorch/pytorch/issues/110730 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110683 Approved by: https://github.com/voznesenskym	2023-10-06 23:07:56 +00:00
Adnan Akhundov	2aa3064364	[inductor] Add aoti_torch_dtype_bool to AOTI ABI shim (#110713 ) Summary: ATT Test Plan: CI Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/110713 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2023-10-06 22:16:39 +00:00
George White	f4796df914	Add support for generators on the IPU device (#110704 ) This change adds hooks similar to those used on other device types, to allow the Torch to create and use generators provided by the IPU backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110704 Approved by: https://github.com/ezyang	2023-10-06 21:36:14 +00:00
Avik Chaudhuri	44d34fe65c	different bounds for same Dim name (#110638 ) Previously,`Dim` definitions that shared the same name but had different ranges were allowed to appear in the `dynamic_shapes` argument of an `export` call. They would correspond to the same dynamic dimension (identified by the shared name) with an effective range would be the intersection of the different ranges. However this behavior can be confusing, because having different definitions with the same name is more likely than not unintentional. Therefore, this PR makes it a user error. We still allow different definitions with the same name to exist at the same time (no global uniqueness) as long as they are not confused in the same `export` call. Redefinitions with the same bounds are also allowed, in case they are accidentally created by executing the same code multiple times. Differential Revision: [D49965944](https://our.internmc.facebook.com/intern/diff/D49965944/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110638 Approved by: https://github.com/zhxchen17	2023-10-06 21:22:52 +00:00
Avik Chaudhuri	0d4a360fa2	remove replaced symbols from range_constraints (#110644 ) While the `range_constraints` that is initially derived by processing of constraints only contains symbols that appear in the graph module, eventually the `range_constraints` that are in the exported program seem to contain more symbols than those that appear in the graph module. Clearly this is a regression, because the example of "Expressing Dynamism" in our public docs (https://pytorch.org/docs/stable/export.html#expressing-dynamism) does not show the extra symbols in `range_constraints`, but running the example does. The problem seems to arise when we are running `_transform` passes, where we regenerate the `range_constraints` from the `shape_env`. However, as a rule, symbols that have `replacements` are actually replaced (by other expressions, including constants or other symbols), so they should never appear in the graph module. Thus we can filter such symbols out from `range_constraints` as well. Differential Revision: [D49969620](https://our.internmc.facebook.com/intern/diff/D49969620/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110644 Approved by: https://github.com/zhxchen17	2023-10-06 21:13:55 +00:00
Adnan Akhundov	f74937741e	Remove runtime assertions between export and AOT compilation (#110710 ) Summary: The runtime assertions inserted in the `torch._export.export` by the `_AddRuntimeAssertionsForInlineConstraintsPass` lead to errors in AOT Inductor like #109884. In `torch._export.aot_compile` export and AOT compilation are run consecutively which would lead to the above issue if any assertions are inserted. In this PR, we're adding a new parameter / flag to `torch._export.aot_compile`, `remove_runtime_assertions`, to remove the assertions inserted during export before AOT compilation. The flag is set to `False` for BC. Additionally, we remove the flag `add_runtime_assertions_for_inline_constraints` recently added to `torch._dynamo.config`, as it can lead to undesirable `torch._export` behavior and is 's no longer required for the AOT Inductor testing purposes. Test Plan: CI Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/110710 Approved by: https://github.com/zhxchen17, https://github.com/chenyang78	2023-10-06 21:09:35 +00:00
cdzhan	7cc0020a80	[decomp] Fix different return type in threshold_backward vs. eager (#110689 ) due to type promotion with floating point scalar in decompositions.py Fixes part of #100838 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110689 Approved by: https://github.com/ezyang	2023-10-06 20:59:58 +00:00
Yanbo Liang	1b1bc08557	[Dynamo] SizeVariable can be indexed by symint (#110349 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/110349 Approved by: https://github.com/williamwen42	2023-10-06 20:48:07 +00:00
PyTorch MergeBot	ff0358b038	Revert "[C10] PG observability hooks. (#108815 )" This reverts commit `0c7a877745`. Reverted https://github.com/pytorch/pytorch/pull/108815 on behalf of https://github.com/albanD due to Add a new torch.distributed.hooks namespace but does not document it, test was added this morning ([comment](https://github.com/pytorch/pytorch/pull/108815#issuecomment-1751327751))	2023-10-06 19:49:49 +00:00
Sherlock Huang	37a0265992	[Inductor] Disallow OpOverloadPacket in ir.FallbackKernel (#110567 ) In ABI compatible mode, We always need op_overload.schema for FallbackKernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110567 Approved by: https://github.com/jansel	2023-10-06 19:20:50 +00:00
Rodrigo Kumpera	0c7a877745	[C10] PG observability hooks. (#108815 ) Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108815 Approved by: https://github.com/wconstab, https://github.com/fduwjj	2023-10-06 18:52:46 +00:00
Joel Schlosser	17348b0f51	Implement split_with_sizes backward for NT (#110647 ) Needed internally. Note that `split_with_sizes()` for NT is currently supported only on `dim=-1`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110647 Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer ghstack dependencies: #110646	2023-10-06 18:44:22 +00:00
Joel Schlosser	48240ec62e	Make unbind() overrideable for NT subclass (#110646 ) Reland of #109122. Fixed the memory leak by not saving the outputs of `unbind()` for backward. Rather, the NT sizes are saved so undefined grads can replaced with zeros of the correct size. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110646 Approved by: https://github.com/soulitzer, https://github.com/cpuhrsch	2023-10-06 18:44:22 +00:00
Kazuaki Ishizaki	f7ce19d40a	Fix typo under torch/onnx directory (#110697 ) This PR fixes typo of comments in files under `torch/onnx` directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110697 Approved by: https://github.com/ezyang	2023-10-06 18:21:00 +00:00
PyTorch MergeBot	576b80d23e	Revert "[HigherOrderOp] expose torch.cond (#110293 )" This reverts commit `601f872831`. Reverted https://github.com/pytorch/pytorch/pull/110293 on behalf of https://github.com/ydwu4 due to Sorry, didn't check the error carefully on the PR. A doc error is related to this pr ([comment](https://github.com/pytorch/pytorch/pull/110293#issuecomment-1751176719))	2023-10-06 17:44:17 +00:00
cyy	e75f2e2ea1	Fix clang-tidy warnings in CUDAPluggableAllocator (#110678 ) This PR fixes clang-tidy warnings in CUDAPluggableAllocator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110678 Approved by: https://github.com/Skylion007	2023-10-06 17:33:08 +00:00
ydwu4	601f872831	[HigherOrderOp] expose torch.cond (#110293 ) This pr expose torch._higher_order_ops.cond as torch.cond. 1. Need to add #noqa: F811 to the _check calls in torch/__init__.py to address some confusing linter error "Redefinition of unused 'cond'" but only one cond is imported and for these lines that have this error, they don't define the cond but just use it as an argument. 2. Also add cond to the list that allows it to be traced through so as dynamo could trigger the CondHigherOrder logic instead of creating a TorchVariable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110293 Approved by: https://github.com/zou3519	2023-10-06 17:04:31 +00:00
Jeff Daily	e8f1f4ed66	[quant][pt2][ROCm] follow-up PR 109908 for miopen_batch_norm (#110653 ) Fixes recent broken unit tests caused by PR #109908 because cudnn and miopen have separate batch norm functions. ``` 2023-10-05T09:35:01.6606614Z _______________ TestQuantizePT2EQAT.test_qat_conv_bn_fusion_cuda _______________ 2023-10-05T09:35:01.6606948Z Traceback (most recent call last): 2023-10-05T09:35:01.6607362Z File "/var/lib/jenkins/pytorch/test/quantization/pt2e/test_quantize_pt2e_qat.py", line 323, in test_qat_conv_bn_fusion_cuda 2023-10-05T09:35:01.6607767Z self._verify_symmetric_xnnpack_qat_graph( 2023-10-05T09:35:01.6608217Z File "/var/lib/jenkins/pytorch/test/quantization/pt2e/test_quantize_pt2e_qat.py", line 130, in _verify_symmetric_xnnpack_qat_graph 2023-10-05T09:35:01.6608658Z self._verify_symmetric_xnnpack_qat_graph_helper( 2023-10-05T09:35:01.6609105Z File "/var/lib/jenkins/pytorch/test/quantization/pt2e/test_quantize_pt2e_qat.py", line 173, in _verify_symmetric_xnnpack_qat_graph_helper 2023-10-05T09:35:01.6609623Z m = prepare_qat_pt2e(m, quantizer) 2023-10-05T09:35:01.6610171Z File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/ao/quantization/quantize_pt2e.py", line 178, in prepare_qat_pt2e 2023-10-05T09:35:01.6610561Z _fuse_conv_bn_qat(model) 2023-10-05T09:35:01.6611072Z File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/ao/quantization/pt2e/qat_utils.py", line 501, in _fuse_conv_bn_qat 2023-10-05T09:35:01.6611497Z m = _fuse_conv_bn_qat_helper(m, is_cuda=True) 2023-10-05T09:35:01.6612065Z File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/ao/quantization/pt2e/qat_utils.py", line 575, in _fuse_conv_bn_qat_helper 2023-10-05T09:35:01.6612492Z _get_conv_bn_getitem_nodes(r.replacements) 2023-10-05T09:35:01.6613058Z File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/ao/quantization/pt2e/qat_utils.py", line 383, in _get_conv_bn_getitem_nodes 2023-10-05T09:35:01.6613465Z assert bn_node is not None 2023-10-05T09:35:01.6613716Z AssertionError ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/110653 Approved by: https://github.com/jerryzh168, https://github.com/pruthvistony	2023-10-06 15:30:55 +00:00
albanD	c4db607607	Doc test non packages (#110568 ) Add non-package python modules to the public API checks. The original change is to remove the `ispkg` check in this line https://github.com/pytorch/pytorch/blob/main/docs/source/conf.py#L518 Everything else is to add the appropriate modules to the rst files, make sure every module we provide can be imported (fixed by either making optional dependencies optional or just deleting files that have been un-importable for 3 years), make API that are both modules and functions (like torch.autograd.gradcheck) properly rendered on the docs website without confusion and add every non-documented API to the allow list (~3k of them). Next steps will be to try and fix these missing docs Pull Request resolved: https://github.com/pytorch/pytorch/pull/110568 Approved by: https://github.com/zou3519	2023-10-06 14:16:01 +00:00
chilli	ceb773b68d	Fix #110680 (requires_grad typo in decomp) (#110687 ) Fixes https://github.com/pytorch/pytorch/issues/110680 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110687 Approved by: https://github.com/voznesenskym, https://github.com/lezcano ghstack dependencies: #110501, #110504, #110591, #110668	2023-10-06 10:36:01 +00:00
Jon Chuang	d776dd04ac	perf(optim/dynamo): shortcut `is_sparse` iteration in SGD multi_tensor (#110648 ) Originated: https://github.com/pytorch/pytorch/pull/110353#discussion_r1347806922 Speeds up significantly in non-sparse path (majority use-case). Benchmarks: https://github.com/pytorch/pytorch/issues/110506#issuecomment-1747732478 CC: @janeyx99 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110648 Approved by: https://github.com/janeyx99	2023-10-06 08:56:18 +00:00
Jack Taylor	96f616a054	Revert tl.int1 casting change for ROCm to avoid hangs (#110531 ) Seeing hangs on ROCm seemingly after this PR https://github.com/pytorch/pytorch/pull/110388 https://ossci-raw-job-status.s3.amazonaws.com/log/17381916785 `inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_exp2_cuda_bool Command took >30min, returning 124` Conditionalising out of this while we investigate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110531 Approved by: https://github.com/peterbell10	2023-10-06 08:53:45 +00:00
Michael Voznesensky	7d98549ca9	retain_graph=True in compiled_autograd (#110367 ) Adds support for retain_graph=True - known as keep_graph_ internally in the autograd engine. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110367 Approved by: https://github.com/jansel	2023-10-06 08:22:10 +00:00
chilli	6d23193aab	Added strict=True to zip in aot_autograd (#110668 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110668 Approved by: https://github.com/ezyang ghstack dependencies: #110501, #110504, #110591	2023-10-06 05:12:05 +00:00
Jon Chuang	d279979102	perf(inductor): improve `Adam` compile times by shortcutting for loops (via `has_complex`) (#110607 ) Adam part of: https://github.com/pytorch/pytorch/issues/110506 TODO: - If this approach is validated as a good one, it an also be applied to all other optimizers which convert `complex` via list comprehensions ### Results: `NUM_PARAMS=200, foreach=True` - main: dynamo: 43s, inductor: 31s, total: 74s - this PR: dynamo: 3.5s, inductor: 30s, total: 34s (dynamo speedup: 12.3x, overall speedup: 34s, 2.1x) `NUM_PARAMS=1000, foreach=True, has_complex shortcut`: ``` <class 'torch.optim.adam.Adam'> {'lr': 0.01, 'foreach': True} torch.float32 TorchDynamo compilation metrics: Function Runtimes (s) ------------------------------------ ------------------------------- _compile.<locals>.compile_inner 0.0329, 50.0806, 0.0041 OutputGraph.call_user_compiler 44.9924 ``` `NUM_PARAMS=1000, foreach=True`: ``` <class 'torch.optim.adam.Adam'> {'lr': 0.01, 'foreach': True} torch.float32 TorchDynamo compilation metrics: Function Runtimes (s) ------------------------------------ ------------------------------- _compile.<locals>.compile_inner 0.0389, 58.6069, 0.0043 OutputGraph.call_user_compiler 44.1425 ``` ### Discussion - `has_complex` shortcut provides additional 2x dynamo speedup. It is not necessary to achieve a significant overall speedup. CC: @janeyx99 @mlazos Pull Request resolved: https://github.com/pytorch/pytorch/pull/110607 Approved by: https://github.com/janeyx99, https://github.com/lezcano	2023-10-06 05:08:49 +00:00
Banit Agrawal	64583c4d04	[CUDA Host Allocator] Add support of CudaHostRegister (#108488 ) Summary: This diff adds another option to create cuda pinned memory using cudaHostRegister. Differential Revision: D45843715 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108488 Approved by: https://github.com/zdevito	2023-10-06 04:13:02 +00:00
Jon Chuang	57e9969021	feat(optim): Add adadelta multi_tensor support for complex, with `has_complex` shortcut (#110631 ) Partial fix: https://github.com/pytorch/pytorch/issues/110606 More on `has_complex` shortcut: https://github.com/pytorch/pytorch/pull/110613#issuecomment-1749314805 CC: @janeyx99, @mlazos, @lezcano Pull Request resolved: https://github.com/pytorch/pytorch/pull/110631 Approved by: https://github.com/lezcano	2023-10-06 03:34:41 +00:00
Jon Chuang	11047be10e	feat(optim): Add `NAdam`support for complex, with `has_complex` shortcut (#110634 ) Partial fix: https://github.com/pytorch/pytorch/issues/110606 More on `has_complex` shortcut: https://github.com/pytorch/pytorch/pull/110613#issuecomment-1749314805 CC: @janeyx99 @mlazos @lezcano Pull Request resolved: https://github.com/pytorch/pytorch/pull/110634 Approved by: https://github.com/lezcano	2023-10-06 03:31:48 +00:00
Jon Chuang	347ea3fe0d	feat(optim): Add `RAdam` support for complex, with `has_complex` shortcut (#110635 ) Partial fix: https://github.com/pytorch/pytorch/issues/110606 More on `has_complex` shortcut: https://github.com/pytorch/pytorch/pull/110613#issuecomment-1749314805 CC: @janeyx99 @mlazos @lezcano Pull Request resolved: https://github.com/pytorch/pytorch/pull/110635 Approved by: https://github.com/lezcano	2023-10-06 03:29:26 +00:00
Zhengxu Chen	be5dc3a00d	[export] Update ArgumentSpec definition. (#110612 ) Summary: Changing ArgumentSpec into a true union type in Python without changing serialization format. Test Plan: CI Differential Revision: D49871088 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110612 Approved by: https://github.com/angelayi	2023-10-06 03:14:45 +00:00
PyTorch MergeBot	dac895c10a	Revert "Multiprocessing support for NT (#110292 )" This reverts commit `f17fe89e14`. Reverted https://github.com/pytorch/pytorch/pull/110292 on behalf of https://github.com/kit1980 due to Causes CUDA memory leaks ([comment](https://github.com/pytorch/pytorch/pull/110292#issuecomment-1749852095))	2023-10-06 01:07:40 +00:00
cyy	11b3210a11	[Reland2] Remove calls of c10::either (#110487 ) This PR is reland of #109707 with fixes of MSVC failures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110487 Approved by: https://github.com/soulitzer	2023-10-06 00:25:15 +00:00
PyTorch MergeBot	1c3fae46ee	Revert "Support SingletonSymNode mul with coefficient (#110369 )" This reverts commit `eb8feb8ff8`. Reverted https://github.com/pytorch/pytorch/pull/110369 on behalf of https://github.com/PaliC due to bottom diff is causing a plethora of internal failures ([comment](https://github.com/pytorch/pytorch/pull/110369#issuecomment-1749802899))	2023-10-05 23:51:28 +00:00
PyTorch MergeBot	236afe73a2	Revert "Update custom Function preserve torch function when inputs returned as-is (#109825 )" This reverts commit `4e73eee93f`. Reverted https://github.com/pytorch/pytorch/pull/109825 on behalf of https://github.com/PaliC due to causing a plethora of internal failures ([comment](https://github.com/pytorch/pytorch/pull/109825#issuecomment-1749802739))	2023-10-05 23:49:41 +00:00
PyTorch MergeBot	fdf6055ea7	Revert "Add symbolic singleton int (#110370 )" This reverts commit `a7145cb3a4`. Reverted https://github.com/pytorch/pytorch/pull/110370 on behalf of https://github.com/PaliC due to bottom diff is causing a plethora of internal failures ([comment](https://github.com/pytorch/pytorch/pull/110370#issuecomment-1749801188))	2023-10-05 23:47:09 +00:00
PyTorch MergeBot	585e2bd818	Revert "Symintify guards.cpp (#110371 )" This reverts commit `e1cfcdfa06`. Reverted https://github.com/pytorch/pytorch/pull/110371 on behalf of https://github.com/PaliC due to bottom diff is causing a plethora of internal failures ([comment](https://github.com/pytorch/pytorch/pull/110371#issuecomment-1749798063))	2023-10-05 23:42:35 +00:00
PyTorch MergeBot	bcd44dac60	Revert "Use is_symbolic instead of testing isinstance in some place (#110372 )" This reverts commit `8672d64fed`. Reverted https://github.com/pytorch/pytorch/pull/110372 on behalf of https://github.com/PaliC due to bottom diff is causing a plethora of internal failures ([comment](https://github.com/pytorch/pytorch/pull/110372#issuecomment-1749795074))	2023-10-05 23:37:37 +00:00
Gufan Yin	5d963474aa	Replace enforce_dtype with dtype in ShardedTensor.gather (#110561 ) Summary: Sometimes local_shards are empty on some ranks, and out.dtype is float16, which will cause error if enforce_dtype is True because `data` will be float32. Callers know best what dtype they want, so we can just let callers decide. Temporarily keep enforce_dtype for backward compatibility Test Plan: Run local and MAST job Reviewed By: uciyc123 Differential Revision: D46886551 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110561 Approved by: https://github.com/wanchaol, https://github.com/malfet	2023-10-05 23:16:23 +00:00
Edward Z. Yang	f274c7b32c	Add functional collective all_to_all_single and support it in Inductor (#110195 ) Copy of https://github.com/pytorch/pytorch/pull/106655 from yf225 rebased on top of item() support changes Pull Request resolved: https://github.com/pytorch/pytorch/pull/110195 Approved by: https://github.com/Skylion007	2023-10-05 23:11:51 +00:00
Jon Chuang	df7d01aed5	perf(inductor): use for loop with shortcut in `Optimizer`s to speedup against list comprehensions (e.g. complex conversion) (#110613 ) Fully fixes: https://github.com/pytorch/pytorch/issues/110506 Depends: https://github.com/pytorch/pytorch/pull/110607 Potential merge conflicts: - https://github.com/pytorch/pytorch/pull/110339 - https://github.com/pytorch/pytorch/pull/110345 - https://github.com/pytorch/pytorch/pull/110454 Related: - https://github.com/pytorch/pytorch/issues/110606 (we can apply the improvements here orthogonally to the complex support) ### Results Benchmark: 100 params. Breakdowns (float32, dynamo): ``` Adagrad: this PR: 4.4s, main: 8.8s Adam: this PR: 2.1s, main: 9.8s AdamW: this PR: 2.5s, main: 8.2s ASGD: this PR: 3.1s, main: 8.5s RMSProp: this PR: 1.3s, main: 4.2s RProp: this PR: 6.7s, main: 14.9s ``` Notes: 1. Adagrad is still slow due to `_get_value` list comprehension. Can be fixed in https://github.com/pytorch/pytorch/pull/110339/files by utilizing capturable path 2. Adamax is not actually compiled (it is currently disabled). 3. Inductor compile time is quite variable. We calculate dynamo by subtracting `call_user_compiler` from `compile_inner` timing. <details> This PR: ``` Adagrad (torch.float32): 28.47496461868286s Adagrad (torch.complex64): 29.379547357559204s Adam (torch.float32): 17.334211587905884s Adam (torch.complex64): 29.637500524520874s Adamax (torch.float32): 2.4749321937561035s Adamax (torch.complex64): 3.1997995376586914s AdamW (torch.float32): 18.06532859802246s AdamW (torch.complex64): 28.25661015510559s ASGD (torch.float32): 23.70255398750305s ASGD (torch.complex64): 25.33756995201111s RMSprop (torch.float32): 7.964028596878052s RMSprop (torch.complex64): 12.909599781036377s Rprop (torch.float32): 30.512362003326416s Rprop (torch.complex64): 44.74405765533447s ``` Main ``` Adagrad (torch.float32): 26.919506072998047s Adagrad (torch.complex64): 35.190622091293335s Adam (torch.float32): 25.715000867843628s Adam (torch.complex64): 24.17716670036316s Adamax (torch.float32): 2.4404726028442383s Adamax (torch.complex64): 3.3538928031921387s AdamW (torch.float32): 25.2022807598114s AdamW (torch.complex64): 28.915700912475586s ASGD (torch.float32): 24.108731985092163s ASGD (torch.complex64): 26.589075088500977s RMSprop (torch.float32): 10.781344175338745s RMSprop (torch.complex64): 15.136352777481079s Rprop (torch.float32): 42.46482181549072s Rprop (torch.complex64): 48.28277635574341s ``` Seems that it doesn't help the complex case by much (but that's not the majority case). torch.float32 is generally positive, when it does not show drastic improvement / regresses, it is due to inductor variance (by manually inspecting the logs). </details> ### Benchmark Script ```python import torch import time from torch.optim import Adagrad, Adam, Adamax, AdamW, ASGD, RMSprop, Rprop OPTIMS = [Adagrad, Adam, Adamax, AdamW, ASGD, RMSprop, Rprop] DTYPES = [torch.float, torch.cfloat] NUM_PARAMS = 100 kwargs = { "lr": 0.01, "foreach": True } summary = [] for optim_cls in OPTIMS: for dtype in DTYPES: torch._dynamo.reset() # torch._inductor.metrics.reset() input = torch.ones([10, 10], dtype=dtype, device="cuda:0") model = torch.nn.Sequential( [torch.nn.Linear(10, 10, dtype=dtype, device="cuda:0") for _ in range(NUM_PARAMS)] ) model(input).sum().abs().backward() opt_compiled = optim_cls(model.parameters(), *kwargs) compiled_step = torch.compile(opt_compiled.step) with torch.set_grad_enabled(False): start_time = time.time() compiled_step() summary.append(f"{optim_cls.__name__} ({dtype}): {time.time() - start_time}s") print(optim_cls, kwargs, dtype, torch._dynamo.utils.compile_times()) for s in summary: print(s) ``` CC: @janeyx99 @mlazos Pull Request resolved: https://github.com/pytorch/pytorch/pull/110613 Approved by: https://github.com/janeyx99	2023-10-05 23:10:52 +00:00
Jerry Zhang	7b6042111f	[quant][pt2e] Refactor conv related annotation for XNNPACKQuantizer (#110308 ) Summary: Since we changed IR that we are working with to pre autograd aten IR, it's easier to use plain pattern match instead of relying on source_matcher_utils now, this PR refactors the annotation for conv to use aten ops directly. Also fixed reentrant test after this change. Test Plan: python test/test_quantization.py TestQuantizePT2E Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/110308 Approved by: https://github.com/kimishpatel	2023-10-05 22:36:18 +00:00
albanD	cae537126f	Set _diffThreshold on our TestCase (#110603 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/110603 Approved by: https://github.com/albanD	2023-10-05 21:49:28 +00:00
Wanchao Liang	c95cf4b4c9	[dtensor] add grad placements kwarg to to_local API (#110629 ) When we convert to local tensor, dtensor can't track autograd or gradient layout of the local tensor anymore, if user do sth not expected, there needs to be a way for user to hint about the gradient layout of the local tensor Pull Request resolved: https://github.com/pytorch/pytorch/pull/110629 Approved by: https://github.com/zdevito	2023-10-05 21:34:01 +00:00
chilli	ada65508d2	Add option to flop counter formula registration to get raw values (#110591 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110591 Approved by: https://github.com/awgu ghstack dependencies: #110501, #110504	2023-10-05 21:14:41 +00:00
Scott Wolchok	9e72c9cccd	[torch] easy missing move in aoti_runtime/model.h (#110469 ) Just an extra shared_ptr copy, nothing fancy. Differential Revision: [D49792510](https://our.internmc.facebook.com/intern/diff/D49792510/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110469 Approved by: https://github.com/Skylion007	2023-10-05 20:56:06 +00:00
William Wen	71beca4899	[dynamo, logging] Report name of defining class along side function name in Dynamo logs (#110190 ) Implement https://github.com/pytorch/pytorch/issues/109236 Sample code: ```python import torch class AAA: class DUMMY: class DUMMY2: pass def dummy(self): def dummy2(): pass class BBB: @staticmethod def CCC(): class DDD: if True: @staticmethod def EEE(): x = [torch.ones(3, 3) for _ in range(5)] return x return DDD def fn(): return AAA.BBB.CCC().EEE() opt_fn = torch.compile(fn, backend="eager") opt_fn() ``` Logs: ```bash $TORCH_LOGS="trace_source" python playground2.py [2023-09-27 17:38:35,641] [0/0] torch._dynamo.symbolic_convert.__trace_source: [DEBUG] TRACE starts_line /data/users/williamwen/pytorch/playground2.py:21 in fn (fn) [2023-09-27 17:38:35,641] [0/0] torch._dynamo.symbolic_convert.__trace_source: [DEBUG] def fn(): [2023-09-27 17:38:35,642] [0/0] torch._dynamo.symbolic_convert.__trace_source: [DEBUG] TRACE starts_line /data/users/williamwen/pytorch/playground2.py:22 in fn (fn) [2023-09-27 17:38:35,642] [0/0] torch._dynamo.symbolic_convert.__trace_source: [DEBUG] return AAA.BBB.CCC().EEE() [2023-09-27 17:38:35,661] [0/0] torch._dynamo.symbolic_convert.__trace_source: [DEBUG] TRACE starts_line /data/users/williamwen/pytorch/playground2.py:11 in CCC (AAA.BBB) (inline depth: 1) [2023-09-27 17:38:35,661] [0/0] torch._dynamo.symbolic_convert.__trace_source: [DEBUG] @staticmethod [2023-09-27 17:38:35,661] [0/0] torch._dynamo.symbolic_convert.__trace_source: [DEBUG] TRACE starts_line /data/users/williamwen/pytorch/playground2.py:13 in CCC (AAA.BBB.CCC.DDD) (inline depth: 1) [2023-09-27 17:38:35,661] [0/0] torch._dynamo.symbolic_convert.__trace_source: [DEBUG] class DDD: [2023-09-27 17:38:35,723] [1/0] torch._dynamo.symbolic_convert.__trace_source: [DEBUG] TRACE starts_line /data/users/williamwen/pytorch/playground2.py:17 in <listcomp> (AAA.BBB.CCC.DDD.EEE) [2023-09-27 17:38:35,723] [1/0] torch._dynamo.symbolic_convert.__trace_source: [DEBUG] x = [torch.ones(3, 3) for _ in range(5)] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/110190 Approved by: https://github.com/ezyang, https://github.com/mlazos	2023-10-05 20:41:38 +00:00
Jon Chuang	c99de9f37c	fix(optim): adagrad sparse multitensor incorrect early exit (#110454 ) Fixes https://github.com/pytorch/pytorch/issues/110444#issuecomment-1745181530 This PR: Passes Main: ``` test/optim/test_optim.py::TestOptim::test_adagrad_sparse FAILED [0.0058s] ==================================================================================================================================== FAILURES ===================================================================================================================================== __________________________________________________________________________________________________________________________ TestOptim.test_adagrad_sparse __________________________________________________________________________________________________________________________ Traceback (most recent call last): File "/home/jonch/Desktop/Programming/mlsys/pytorch/test/optim/test_optim.py", line 1448, in test_adagrad_sparse self._test_rosenbrock_sparse( File "/home/jonch/Desktop/Programming/mlsys/pytorch/test/optim/test_optim.py", line 128, in _test_rosenbrock_sparse self.assertEqual(params, params_c, atol=1e-6, rtol=1e-6) File "/home/jonch/Desktop/Programming/mlsys/pytorch/torch/testing/_internal/common_utils.py", line 3309, in assertEqual raise error_metas.pop()[0].to_error( AssertionError: Tensor-likes are not close! Mismatched elements: 1 / 2 (50.0%) Greatest absolute difference: 0.09999999999993325 at index (1,) (up to 1e-06 allowed) Greatest relative difference: 0.06249999999996089 at index (1,) (up to 1e-06 allowed) ``` CC: @janeyx99 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110454 Approved by: https://github.com/janeyx99	2023-10-05 20:37:57 +00:00
CK Luk	ecdd1bcf03	Back out "[Inductor] Break the loop fusion when node2 depends on node1 mutations (#109172 )" (#110622 ) Summary: Original commit changeset: 03980fb054d5 Original Phabricator Diff: D49519512 Bisecting shows that this diff is the cause of S369683. Since this affects Ads production, need to back out this diff immediately. Test Plan: See S369683 Reviewed By: ezyang Differential Revision: D49958638 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110622 Approved by: https://github.com/yanboliang	2023-10-05 20:09:09 +00:00
Chien-Chin Huang	88616349d7	[state_dict][1/N] Implement the basic functions of distributed.checkpoint._state_dict (#105902 ) This PR implements the basic functions of distributed.checkpoint._state_dict. This PR currently contains the flattening of optimizer state_dict which makes the PR too large. A later version may split it into 2 for a better code review. Differential Revision: [D47647719](https://our.internmc.facebook.com/intern/diff/D47647719/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D47647719/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/105902 Approved by: https://github.com/wz337	2023-10-05 20:04:15 +00:00
Bin Bao	298f01d9a2	[aotinductor] Avoid generating redundant kernel loading code (#110510 ) Summary: 1) Stop forcing triton.unique_kernel_names to True for AOTInductor, because the unique kernel name can be read from metadata; 2) Only generate load_kernel once for each kernel since we don't have control flow in our generated code. This solves https://github.com/pytorch/pytorch/issues/105553. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110510 Approved by: https://github.com/chenyang78, https://github.com/jansel	2023-10-05 19:59:38 +00:00
Sherlock Huang	f1b94461aa	[AOTInductor] ProxyExecutor support Dynamic Shape (#110526 ) Summary: Extend ProxyExecutor to support dynamic shape. Example of ProxyExecutor invocation with symints. ``` int64_t* arg0_1_size; AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_get_sizes(arg0_1, &arg0_1_size)); auto s0 = arg0_1_size[0]; auto s1 = arg0_1_size[1]; int64_t* arg1_1_size; AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_get_sizes(arg1_1, &arg1_1_size)); auto s2 = arg1_1_size[0]; auto s3 = arg1_1_size[1]; ... aoti_torch_proxy_executor_call_function(proxy_executor, 0, 15, std::vector<int64_t>{42, 16, 17, s0 + s1, s0 + s1, s2s3, 45, 67, 16, 17, s2s3, s2s3, s0 + s1, 89, 910}.data(), 7, std::vector<AtenTensorHandle>{arg0_1, arg0_1, arg1_1, buf2, arg0_1, arg1_1, buf4}.data()); ``` Example of serialized SymInt(s) arguments: ``` { "name": "symint", "arg": { "asSymInt": { "asName": "s0 + s1" } } }, { "name": "symints", "arg": { "asSymInts": [ { "asName": "s0 + s1" }, { "asName": "s2s3" } ] } }, ... { "name": "o_symint", "arg": { "asSymInt": { "asName": "s2s3" } } }, { "name": "o_symints", "arg": { "asSymInts": [ { "asName": "s2s3" }, { "asName": "s0 + s1" } ] } }, ``` Test Plan: buck2 run mode/dev-nosan deeplearning/aot_inductor/test:test_custom_ops Differential Revision: D49887555 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110526 Approved by: https://github.com/chenyang78	2023-10-05 19:05:20 +00:00
Dmytro Dzhulgakov	a0cea517e7	Add 9.0a to cpp_extension supported compute archs (#110587 ) There's an extended compute capability 9.0a for Hopper that was introduced in Cuda 12.0: https://docs.nvidia.com/cuda/archive/12.0.0/cuda-compiler-driver-nvcc/index.html#gpu-feature-list E.g. Cutlass leverages it: `5f13dcad78/python/cutlass/emit/pytorch.py (L684)` This adds it to the list of permitted architectures to use in `cpp_extension` directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110587 Approved by: https://github.com/ezyang	2023-10-05 17:41:06 +00:00
Antoni Viros i Martin	efdf155383	Add requirement for input to AllGatherIntoTensor to be contiguous (#109561 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/109561 Approved by: https://github.com/Chillee	2023-10-05 17:04:48 +00:00
Catherine Lee	d6e5898e8d	Quieter logs in CI (#110033 ) To reduce the amount of logs * for successes, only print the part that says what tests ran and don't print the rest. Zip the log into an artifact. The line listing al the test names is really long, but if you view source of the raw logs, it will not wrap so it will only be one line. The log classifier can also be configured to ignored this line. Gets rid of lines like `test_ops.py::TestCommonCPU::test_multiple_devices_round_cpu_int64 SKIPPED [0.0010s] (Only runs on cuda) [ 9%]` * for failures/reruns, print logs. Do not zip. Also * change log artifact name Examples of various logs: `a074db0f7f` failures `1b439e24c4` failures possibly controversial haha should i include an option for always printing? Pull Request resolved: https://github.com/pytorch/pytorch/pull/110033 Approved by: https://github.com/huydhn	2023-10-05 16:40:37 +00:00
ydwu4	cc1de49340	[HigherOrderOp] fallthrough some keys by default. (#110478 ) Fixes #109253 Test Plan: Added a new test that shows default fallthrough keys can be overrided. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110478 Approved by: https://github.com/ezyang	2023-10-05 16:25:42 +00:00
Jason Park	26f634eefb	Enable aarch64 for fixing undefined symbol error. (#110542 ) Summary: ARM can be safely supported Reviewed By: andrewjcg, aaronenyeshi Differential Revision: D49921679 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110542 Approved by: https://github.com/aaronenyeshi	2023-10-05 16:16:06 +00:00
chilli	f767a6c57a	Made pattern-matcher diagnostics lazily reported + added TORCH_COMPILE_CPROFILE (#110504 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110504 Approved by: https://github.com/mlazos, https://github.com/eellison ghstack dependencies: #110501	2023-10-05 15:47:30 +00:00
PyTorch MergeBot	1e4c0641ce	Revert "Made pattern-matcher diagnostics lazily reported + added TORCH_COMPILE_CPROFILE (#110504 )" This reverts commit `9648df1a6a`. Reverted https://github.com/pytorch/pytorch/pull/110504 on behalf of https://github.com/PaliC due to temporarily will revert as it's causing problems with difftrain import ([comment](https://github.com/pytorch/pytorch/pull/110504#issuecomment-1749132253))	2023-10-05 15:28:23 +00:00
Chien-Chin Huang	1a729618ef	[FSDP][optim_state_dict] Make the new optimizer allgather fusion work with fine-tuning models (#110540 ) With use_orig_params=True, it is possible that some parameters with the same FlatParameter are in the optimizer while others parameters are frozen. This PR makes the allgather fusion logic support the case. Differential Revision: [D49922028](https://our.internmc.facebook.com/intern/diff/D49922028/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110540 Approved by: https://github.com/awgu, https://github.com/rohan-varma	2023-10-05 15:17:10 +00:00
Joel Schlosser	f17fe89e14	Multiprocessing support for NT (#110292 ) Fixes #110161 Allows NTs to be used in DataLoaders with `num_workers > 1`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110292 Approved by: https://github.com/cpuhrsch, https://github.com/albanD	2023-10-05 15:04:48 +00:00
Andrew Or	7c72238e4b	Back out "Enable pickling model prepared with QAT qconfig" (#110392 ) Summary: D49187352 caused our model conversion and loading of QAT checkpoint to be stuck with thrift time out. we are actively checking in final code and model for static quant HTP prod model, and encountered this breakage at head Thursday. Thrift timeout is a not failing, and because of that, it's hard to bisect and find this culprit. It is also hard to set up unit test, because the job simply time-out. Better test is needed to guard downstream model conversion against upstream changes. Our suspicion of why this diff broke us is that we create a lot of modules with qat (in a recursive manner) but our model is not a qat traceable module (it is a graph with many qat modules and floating point modules). With fuctools.partial as in the original diff, we will be caching modules in the memory and causing the memory of the machine to be taken up completely. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110392 Approved by: https://github.com/junesg, https://github.com/jerryzh168	2023-10-05 14:41:00 +00:00
Oleg Khabinov	cf1b494afd	[AOTInductor] Store loaded kernels in the model (#110554 ) Defining kernels as static vars is problematic for subsequent model loading on non-default CUDA devices. Assuming those kernels were loaded in context of the device #0, so, they are not nullptr anymore, therefore kernels won't work on devices other than the device #0. This change makes devices remembered at model level in AOT mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110554 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2023-10-05 10:17:05 +00:00
Sehoon Kim	c36b31d530	`torch::nn::AdaptiveLogSoftmaxWithLoss`: check length of `cutoffs` (#106777 ) Fixes #106698 Also added a check for python API, because current error message ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/sehoon/pytorch-latest/torch/nn/modules/adaptive.py", line 128, in __init__ or (min(cutoffs) <= 0) \ ValueError: min() arg is an empty sequence ``` is not very comprehensible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106777 Approved by: https://github.com/albanD	2023-10-05 05:35:47 +00:00
Avik Chaudhuri	416eca9736	export db links for user errors (#110555 ) Ideally all `_dynamo.exc.UserError`s should have "case names", i.e., link to examples in `exportdb`. This PR adds case names to several instances of `_dynamo.exc.UserError`. In particular, looking at coverage based on `UserErrorType`: * `DYNAMIC_CONTROL_FLOW`, `ANTI_PATTERN`, and `STANDARD_LIBRARY` are fully covered. * `CONSTRAINT_VIOLATION` and `DYNAMIC_DIM` have no coverage. We don't seem to have any dedicated examples of specifying dynamic shapes in `exportdb` (although they are used in some other examples without explanation, to avoid some specialization that would make such examples moot). * `INVALID_INPUT` is only partly covered. Frankly this is tedious to cover via examples. Differential Revision: [D49928518](https://our.internmc.facebook.com/intern/diff/D49928518/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110555 Approved by: https://github.com/angelayi, https://github.com/ydwu4	2023-10-05 05:03:04 +00:00
PyTorch MergeBot	21019620ee	Revert "[Dynamo] SizeVariable can be indexed by symint (#110349 )" This reverts commit `510ec7e3c5`. Reverted https://github.com/pytorch/pytorch/pull/110349 on behalf of https://github.com/PaliC due to breaking internal tests (check diff) ([comment](https://github.com/pytorch/pytorch/pull/110349#issuecomment-1748021641))	2023-10-05 04:42:33 +00:00
andrewor14	62cad5b5b0	[quant][pt2] Support cudnn_batch_norm in QAT fusion (#109908 ) Summary: Today, we get different batch norm ops depending on the device the model is placed on at export time. Exporting `model.cpu()` gives `_native_batch_norm_legit`, while exporting `model.cuda()` gives `cudnn_batch_norm`. QAT fusion currently only supports the former and silently ignores the latter. This commit fixes this by additionally matching on the latter op during QAT fusion. Test Plan: python test/test_quantization.py TestQuantizePT2EQAT.test_qat_conv_bn_fusion python test/test_quantization.py TestQuantizePT2EQAT.test_qat_conv_bn_relu_fusion Reviewers: jerryzh168, kimishpatel Subscribers: jerryzh168, kimishpatel, supriyar Differential Revision: [D49615145](https://our.internmc.facebook.com/intern/diff/D49615145) Pull Request resolved: https://github.com/pytorch/pytorch/pull/109908 Approved by: https://github.com/jerryzh168	2023-10-05 04:08:44 +00:00
lezcano	4b1e138162	[dynamo] [easy]Remove InstructionTranslator from within Set (#110521 ) I believe this was a left over from the before times. See if CI agrees. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110521 Approved by: https://github.com/ezyang	2023-10-05 04:01:18 +00:00
Kazuaki Ishizaki	434a996c42	Fix typo under torch/_inductor directory (#110530 ) This PR fixes typo of comments and messages in files under `torch/_dynamo` directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110530 Approved by: https://github.com/kit1980	2023-10-05 02:17:20 +00:00
chilli	9648df1a6a	Made pattern-matcher diagnostics lazily reported + added TORCH_COMPILE_CPROFILE (#110504 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110504 Approved by: https://github.com/mlazos, https://github.com/eellison ghstack dependencies: #110501	2023-10-05 01:34:57 +00:00
chilli	e686341f64	Consider that ops can be fused into cat in the min-cut partitioner (#110501 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110501 Approved by: https://github.com/eellison	2023-10-05 01:34:57 +00:00
Justin Chu	d24e7be243	Include `onnx` and `onnxscript` information in collect_env.py (#110560 ) `onnx` and `onnxscript` are used in torch.onnx.dynamo_export since 2.0. It would be helpful to collect version information in user issue reports. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110560 Approved by: https://github.com/albanD	2023-10-05 01:29:04 +00:00
Amadeusz Skrzypczak	653f966df0	Fix type promotion of float8_e5m2 and float8_e4m3fn (#110279 ) There is an issue with float8 type promotion, because _promoteTypesLookup doesn't contain records for few types between bfloat16 and float8. I have simply moved float8 types just after bfloat16, however I'm not sure if it doesn't break serialization. Please, decide if it can stay like this, or should I insert missing records filled with "ud" into _promoteTypesLookup instead of moving types. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110279 Approved by: https://github.com/albanD	2023-10-05 01:28:48 +00:00
Edward Z. Yang	6a974bec5d	Change flash attention outputs to be SymInt instead of int (#110533 ) Fixes https://github.com/pytorch/pytorch/issues/110322 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/110533 Approved by: https://github.com/albanD	2023-10-05 01:00:07 +00:00
Edward Z. Yang	f1d81134ef	Print output type if assert fires (#110534 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/110534 Approved by: https://github.com/albanD	2023-10-05 00:59:17 +00:00
Mihir Patel	95c59b30b8	Update fully_sharded_data_parallel to fix typing (#110545 ) Fixes typing so that linter does not complain when using CustomPolicy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110545 Approved by: https://github.com/awgu, https://github.com/Skylion007	2023-10-05 00:00:10 +00:00
Xuehai Pan	0daa7d4815	[test][docs] Fix doctest warnings for syntax errors (#110517 ) Fixes some syntax errors in doctest find in CI tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110517 Approved by: https://github.com/albanD	2023-10-05 00:00:06 +00:00
Fabrice Pont	053367b1ed	fix: flake8-bugbear code B024 (#107265 ) See #106571 item B024 This fix concerns the addition of `abstractmethod` to methods declared inside abstract classes. Should I also include PEP8 compliant reformatting on the files I had to modify ? Pull Request resolved: https://github.com/pytorch/pytorch/pull/107265 Approved by: https://github.com/kit1980	2023-10-04 23:52:52 +00:00
Xuehai Pan	449271f3f1	[pytree] Extract reusable generic tests for pytree (#110395 ) Part of #109684 - #109684 Changes: - Add new functions `tree_structure`, `tree_leaves`, `tree_map_` and `tree_map_only_` to Python pytree. - Extract reusable tests for pytree to `TestGenericPytree`. - Change `treespec_dumps` and `treespec_loads` in C++ pytree to call Python pytree and use JSON string as serialization type. - Rename `torch.utils.pytree` -> `torch.utils._cxx_pytree`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110395 Approved by: https://github.com/zou3519	2023-10-04 23:40:50 +00:00
Jon Chuang	37afa0c349	fix(inductor): Increase coverage of Inductor ATen lowering (#110473 ) Add sqrt to decomp testing path and fix missing `minimum`, `clamp_min`,`clamp_max` lowerings and/or registrations. Follow up to: https://github.com/pytorch/pytorch/pull/110468#issuecomment-1745718602 (requires upstream to merge to avoid merge conflict) CC: @janeyx99 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110473 Approved by: https://github.com/janeyx99	2023-10-04 23:40:46 +00:00
Xu Zhao	2e31fae5c5	Cleanup the code in the `dynamo` userbenchmark (#110519 ) Summary: Skip importing the modules that are only available in the pytorch source code, not pytorch nightly release. Make dynamo benchmark work on both OSS and internal. X-link: https://github.com/pytorch/benchmark/pull/1960 Test Plan: ``` $ python run_benchmark.py dynamo --only alexnet --training --performance --inductor loading model: 0it [00:05, ?it/s] cuda train alexnet running benchmark: 100%\|█████████████████\| 30/30 [00:00<00:00, 41.46it/s] 1.129x ``` ``` $ buck2 run mode/opt //pytorch/benchmark:run_benchmark -- dynamo --only alexnet --training --inductor --performance --output-directory $HOME loading model: 0it [00:16, ?it/s] running benchmark: 100%\|█████████████████\| 30/30 [00:00<00:00, 37.94it/s] cuda train alexnet 1.120x ``` Differential Revision: D49912006 Pulled By: xuzhao9 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110519 Approved by: https://github.com/desertfire, https://github.com/jansel	2023-10-04 23:26:30 +00:00
Howard Huang	0949d97c16	fix batch_isend_irecv example incorrect usage (#110408 ) mismatched dtypes silently leads to wrong outputs in nccl ``` 1:recv_tensor=tensor([0., 0.], device='cuda:1') 0:recv_tensor=tensor([2.8026e-45, 0.0000e+00], device='cuda:0') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/110408 Approved by: https://github.com/awgu, https://github.com/Neilblaze	2023-10-04 22:57:03 +00:00
soulitzer	8672d64fed	Use is_symbolic instead of testing isinstance in some place (#110372 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110372 Approved by: https://github.com/ezyang ghstack dependencies: #110044, #110369, #110370, #110371	2023-10-04 22:56:42 +00:00
soulitzer	e1cfcdfa06	Symintify guards.cpp (#110371 ) Separating this out so we can check perf more easily Pull Request resolved: https://github.com/pytorch/pytorch/pull/110371 Approved by: https://github.com/ezyang ghstack dependencies: #110044, #110369, #110370	2023-10-04 22:56:42 +00:00
soulitzer	a7145cb3a4	Add symbolic singleton int (#110370 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110370 Approved by: https://github.com/ezyang ghstack dependencies: #110044, #110369	2023-10-04 22:56:26 +00:00
soulitzer	eb8feb8ff8	Support SingletonSymNode mul with coefficient (#110369 ) We want to be able to use SingletonSymNode to represent strides for Jagged layout tensor. The following is for 3D, but easily generalizable to higher dimensions. Constraints: - [B, x, D] (where x represents the "variably lengthed dim") can be strided in two ways [x, 1, sum(x)] and [dx, d, 1]. We need two different placeholder values depending on how the jagged tensor is strided. - When doing operations we need the strides of output tensors to be expressable in terms of the strides and sizes of the inner tensors. Given [B, x, D] @ [D, D'], the output strides is [x * D', D', 1] rather than some opaque [x2, D', 1]. This constraint exists because if I'm tracing, I need a symint to represent the output stride. This symint needs to come from somewhere; I get it in several ways: (1) create a constant, (2) unbacked symint, (3) create a new input using a source, (4) output of an operation on an existing symint. It is clear that (4) is what we want here, which brings us to the design below. Design: Given the two constraints, the most straightforward way to implement this is actually to update SingletonSymNode to include some scalar factor, i.e. Morally, SingletonSymNode represents `factor * [s_0, s_1, …, s_n]` This enables us to symbolically compute strides from sizes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110369 Approved by: https://github.com/ezyang ghstack dependencies: #110044	2023-10-04 22:56:15 +00:00
soulitzer	4e73eee93f	Update custom Function preserve torch function when inputs returned as-is (#109825 ) Fixes https://github.com/pytorch/pytorch/issues/109805 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109825 Approved by: https://github.com/albanD	2023-10-04 22:45:11 +00:00
Avik Chaudhuri	6fc09aee36	constant output errors (#110472 ) When mapping between the original signature of a program and the graph-captured signature of its exported program, we emit errors when we see unexpected original or graph-captured inputs or outputs. These errors can arise because of various reasons, e.g.: 1. some input or output has been lifted because of mutation 2. some type is not pytree-registered for flattening / unflattening 3. some type cannot be realized with graph operations (This is probably not an exhaustive list.) Previously we used to emit errors based on a vanilla id-based membership check between the two sides, mostly anticipating (1) as the reason for errors. But this does not do justice to errors because of (2) or (3). This PR emits a different error when it finds (3) to be a probable cause. Specifically, it considers only Tensor and Sym* types to be "supported": no other type seems to be realizable by graph operations. When (2) is a probable cause, we sometimes also hit the same error because we would expect the supported types to show through upon registration. But this kind of error may need some more work in the future. Differential Revision: [D49885828](https://our.internmc.facebook.com/intern/diff/D49885828/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110472 Approved by: https://github.com/ydwu4	2023-10-04 21:56:20 +00:00
Bert Maher	a9df9e5187	[inductor] get_system shouldn't error if CUDA is not installed (#110282 ) Using inductor on a CPU-only device should be OK. Differential Revision: [D49749912](https://our.internmc.facebook.com/intern/diff/D49749912/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110282 Approved by: https://github.com/desertfire	2023-10-04 21:28:55 +00:00
ydwu4	6db3853eeb	Add doc for torch.cond (#108691 ) We add a doc for torch.cond. This PR is a replacement of https://github.com/pytorch/pytorch/pull/107977. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108691 Approved by: https://github.com/zou3519	2023-10-04 21:24:14 +00:00
Yang Chen	46a5558cd5	[AOTInductor] Simplified AOTInductor interface and model class (#110411 ) Summary: This PR removed several APIs from the AOTInductor interface, which are not used by the client. It also simplified AOTInductor's model class by removing the dim info for input/output tensors. We included dim info before to return max output shapes, which was used by the client to allocate memory for output tensors. Now, we allocate output tensor memory from the .so so that we don't need to maintain such information any more. The deletion of dim info from the model class also simplified the codegen quite a bit. Test Plan: ci Reviewed By: khabinov Differential Revision: D49835430 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110411 Approved by: https://github.com/khabinov, https://github.com/desertfire, https://github.com/jansel	2023-10-04 18:35:24 +00:00
Oguz Ulgen	f04b1a0d27	[AOTInductor] Implement autograd eager backend for native triton kernels (#110403 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110403 Approved by: https://github.com/zou3519, https://github.com/bdhirsh	2023-10-04 17:56:56 +00:00
Bin Bao	c0c2e052a4	[aotinductor] Clean up fallback kernel cpp name generation (#110267 ) Summary: Unify the way to generate cpp kernel name when the kernel is from OpOverload Pull Request resolved: https://github.com/pytorch/pytorch/pull/110267 Approved by: https://github.com/zou3519 ghstack dependencies: #110233	2023-10-04 17:18:02 +00:00
Bin Bao	539367f0bc	[aotindutor] Refactor optional value codegen (#110233 ) Summary: Simplify the codegen for optional values by using c10::nullopt, and we don't need placeholders like OptionalScalar because we can simply use None for that purpose. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110233 Approved by: https://github.com/jansel	2023-10-04 17:18:02 +00:00
Shiyan Deng	247c574313	[jit] make register parameter/buffer thread safe in torch::jit::Module (#110488 ) Summary: Registering param/buffer will write into a vector inside Object, need to maintain thread safety if we have threads reading from the vector and writing to the vector at the same time. Test Plan: CI Differential Revision: D49882601 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110488 Approved by: https://github.com/davidberard98	2023-10-04 17:04:23 +00:00
Kazuaki Ishizaki	2c1b009e39	Fix typo under torch/_dynamo directory (#110459 ) This PR fixes typo of comments in files under `torch/_dynamo` directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/110459 Approved by: https://github.com/colesbury	2023-10-04 16:05:05 +00:00
Bert Maher	4c3d3b7176	[inductor] Lower small gemvs on CPU (#110456 ) If the gemv fits in registers, like [1,16]*[16,16], MKL isn't going to do much better than compiling a simple for-loop, and we end up paying allocation overhead and ATen overhead. A very small internal inference model drops from 7->5 us with this change. Differential Revision: [D49875991](https://our.internmc.facebook.com/intern/diff/D49875991/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110456 Approved by: https://github.com/chenyang78, https://github.com/jgong5	2023-10-04 15:16:38 +00:00
Banit Agrawal	30c4c6ff9b	[PyTorch CCA] Refactor caching allocator config code (#110123 ) Summary: This diff refactors the code by moving CUDAAllocatorConfig into the header file. This config refactoring is done so that we can use the same config code for CUDA pinned memory as well. Test Plan: sandcastle Differential Revision: D49653265 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110123 Approved by: https://github.com/zdevito	2023-10-04 14:58:23 +00:00
PyTorch MergeBot	156aefa89b	Revert "[3/N] Add -Wdeprecated and related fixes (#109698 )" This reverts commit `c31fcdaa4f`. Reverted https://github.com/pytorch/pytorch/pull/109698 on behalf of https://github.com/PaliC due to breaking quantization tests ( quantization/test_quantize_per_channel_sub_byte and quantization/test_quantize_per_channel_float_qparams) internally ([comment](https://github.com/pytorch/pytorch/pull/109698#issuecomment-1746999806))	2023-10-04 14:33:47 +00:00
Yukio Siraichi	0e55cc4986	[HigherOrderOp] Flatten outputs of `wrap`. (#109433 ) Fix: #109247 This PR flattens `wrap` outputs by inlining `pytree.tree_flatten` function after calling the inner function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109433 Approved by: https://github.com/zou3519 ghstack dependencies: #110290	2023-10-04 13:43:55 +00:00
Raphael Reme	9f0601df6d	Fix a typo in `cholesky_inverse` documentation (#110364 ) Very small PR to fix a typo in [https://pytorch.org/docs/stable/generated/torch.cholesky_inverse.html](cholesky_inverse) doc. According to the current doc, the function expects $A$, the symmetric positive-definite matrix, as input. But the examples given (and more important, the code) is using $u$ the cholesky decomposition of this matrix (like cholesky_solve). Also, it provides a correct example of batch usage of this function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110364 Approved by: https://github.com/lezcano	2023-10-04 12:30:11 +00:00
Ken Jin	31d635803b	[Dynamo] Fx proxy for builtin all with list iterators (#109972 ) Fixes https://github.com/pytorch/pytorch/issues/109057. Fixes https://github.com/pytorch/pytorch/issues/103620. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109972 Approved by: https://github.com/ezyang	2023-10-04 07:59:26 +00:00
Yu Guo	2bf3ca1be7	[torchdynamo] preserve deterministic_algorithms_warn_only in convert_context (#110457 ) Summary: preserve deterministic_algorithms_warn_only in dynamo context Test Plan: modified unit tests to test warn_only Differential Revision: D49872622 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110457 Approved by: https://github.com/jansel	2023-10-04 07:12:32 +00:00
Jez Ng	dddf581da7	[dynamo] Add graph break on requires_grad_() (#110053 ) Fixes #107861. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110053 Approved by: https://github.com/eellison	2023-10-04 06:22:16 +00:00
Xiaodong Wang	562c68e56f	[nccl] denoise warning msg (#110433 ) Summary: This is too noisy for anything set with TORCH_NCCL_USE_COMM_NONBLOCKING. Just warn once. Test Plan: GH CI Differential Revision: D49846339 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110433 Approved by: https://github.com/awgu	2023-10-04 06:21:53 +00:00
Jon Chuang	3fd938369f	add `foreach_abs` meta registration and inductor decomp (#110468 ) Fixes https://github.com/pytorch/pytorch/issues/110458 Somehow it is on allowlist but not on testing path. CC @janeyx99 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110468 Approved by: https://github.com/janeyx99	2023-10-04 06:09:37 +00:00
Max Ren	08c7dcda65	[pt2e][xnnpack_quantizer] quantize "mul" (#110428 ) Adding "mul" to list of partitions that are supported by the quantizer. This shows up in EDSR, where we still want to quantize the mul op Differential Revision: [D49850151](https://our.internmc.facebook.com/intern/diff/D49850151/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110428 Approved by: https://github.com/jerryzh168 ghstack dependencies: #110427	2023-10-04 05:11:53 +00:00
Max Ren	66202ed29c	[pt2e][xnnpack_quantizer] add util function to convert scalars to attrs (#110427 ) Jerry provided a notebook solution for converting scalars to attrs so that they may be properly quantized: https://fburl.com/anp/kzz7tfn1 Adding this pass as a util function in xnnpack_quantizer_utils.py Differential Revision: [D49850150](https://our.internmc.facebook.com/intern/diff/D49850150/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110427 Approved by: https://github.com/jerryzh168	2023-10-04 05:11:53 +00:00
chilli	005e8ddcb9	cache the hash construction on Guard (#110464 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110464 Approved by: https://github.com/zou3519, https://github.com/voznesenskym	2023-10-04 04:49:18 +00:00
zdevito	3fe3439242	Use LLVMSymbolizer directly for unwind inside fbcode (#108800 ) Using LLVMSymbolizer directly avoids having to call fork which has caused timeouts in some circumstances. Differential Revision: [D49070589](https://our.internmc.facebook.com/intern/diff/D49070589/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108800 Approved by: https://github.com/aaronenyeshi	2023-10-04 04:04:08 +00:00
Yanbo Liang	510ec7e3c5	[Dynamo] SizeVariable can be indexed by symint (#110349 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/110349 Approved by: https://github.com/williamwen42	2023-10-04 03:20:18 +00:00
Sherlock Huang	50054b1a62	[AOTInductor] ProxyExecutor support ReinterpretView inputs (#110451 ) Summary: See wrapper.codegen_reinterpret_view(), it return a temporary handle for tensor, which has following problem. ``` # NB, the return handle here represents a temporary tensor, which will be automatically # released. # Here's a sample usage in the cpp wrapper code: # ``` # aoti_torch_addmm_out( # buf1, # arg1_1, # RAIIAtenTensorHandle(tmp_tensor_handle_0), # buf0, # 1L, # 1L)); # ``` # RAIIAtenTensorHandle(tmp_tensor_handle_0) will be released after the call to addmm_out. # This could be problematic when it's used in a different pattern, for example: # ```` # AtenTensorHandle tensor_args[] = {RAIIAtenTensorHandle(tmp_tensor_handle_2), buf5, buf6}; # aoti_torch_proxy_executor_call_function(..., tensor_args); # ```` # RAIIAtenTensorHandle(tmp_tensor_handle_2) will be invalid when it's used in the latter # kernel call. return f"RAIIAtenTensorHandle({tmp_name})" ``` As a result, ProxyExecutor would generate following code, which cause invalid memory access. Before: ``` // Source Nodes: [fn_with_tuple_output], Original ATen: [fb.fn_with_tuple_output] AtenTensorHandle tmp_tensor_handle_2; AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch__reinterpret_tensor(buf3, 2, int_array_0, int_array_1, 0L, &tmp_tensor_handle_2)); ... AtenTensorHandle tensor_args[] = {RAIIAtenTensorHandle(tmp_tensor_handle_2), buf5, buf6}; int64_t int_args[] = {1}; aoti_torch_proxy_executor_call_function(proxy_executor, 1, 1, int_args, 3, tensor_args); buf3.reset(); ``` With fix in this diff, ProxyExecutor generates following code After: ``` // Source Nodes: [fn_with_tuple_output], Original ATen: [fb.fn_with_tuple_output] AtenTensorHandle tmp_tensor_handle_2; AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch__reinterpret_tensor(buf3, 2, int_array_0, int_array_1, 0L, &tmp_tensor_handle_2)); ... aoti_torch_proxy_executor_call_function(proxy_executor, 1, 1, std::vector<int64_t>{1}.data(), 3, std::vector<AtenTensorHandle>{RAIIAtenTensorHandle(tmp_tensor_handle_2), buf5, buf6}.data()); buf3.reset(); ``` I am not exactly a big fan of such `std::vector{...}.data()` for creating a temp array, but I can't think of another fix. Test Plan: buck2 run mode/dev-nosan deeplearning/aot_inductor/test:test_custom_ops Reviewed By: desertfire Differential Revision: D49758764 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110451 Approved by: https://github.com/desertfire	2023-10-04 02:20:31 +00:00
eellison	dd95eaaf1a	turn back on constant folding in fbcode (#108604 ) Differential Revision: [D49020794](https://our.internmc.facebook.com/intern/diff/D49020794) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108604 Approved by: https://github.com/davidberard98, https://github.com/mlazos	2023-10-04 02:13:03 +00:00
Howard Huang	efb73fe8e4	Fix send()/recv() to adhere to timeout (#109611 ) Summary: Point to point ops don't enqueue their work to the `workMetaList_` which means that the NCCL watchdog does not watch over them, hence they do not respect the collective timeouts. Test Plan: While trying to add a test I found we dont have tests which validate the nccl watch dog. It looks like this is because we dont have a good way to detect when nccl watchdog has thrown an error (exception is thrown in a side thread) in our testing framework / `MultiprocessTestCase` I manually tested this change with the script in https://github.com/pytorch/pytorch/issues/109401, but need to look more closely at how to automate a test for NCCL watchdog Differential Revision: D49418976 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109611 Approved by: https://github.com/wconstab	2023-10-03 23:27:45 +00:00
Xiaodong Wang	a0bffe7ed7	[S366352] Print nccl version during initialization (#110305 ) Summary: print nccl version during initialization Differential Revision: D49603220 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110305 Approved by: https://github.com/Skylion007, https://github.com/fegin, https://github.com/rohan-varma	2023-10-03 23:09:48 +00:00
cyy	c31fcdaa4f	[3/N] Add -Wdeprecated and related fixes (#109698 ) This PR follows #108626. Hopefully we can enable the warning in the next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109698 Approved by: https://github.com/Skylion007, https://github.com/ezyang	2023-10-03 22:50:53 +00:00
Mu-Chu Lee	836ba6430a	[AOTInductor] Initial functionality for Inf and NaN checker (#109526 ) Summary: Add initial functionality for Inf and NaN checker for AOTInductor. Test Plan: Included in commit. Skipped for CI as SIGABRT can't be captured by pytest. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D49379751](https://our.internmc.facebook.com/intern/diff/D49379751) Pull Request resolved: https://github.com/pytorch/pytorch/pull/109526 Approved by: https://github.com/chenyang78	2023-10-03 22:39:42 +00:00
eellison	98c8550158	Fix Triplet Margin Loss Opinfo (#110302 ) Triplet Margin Loss takes in a Callable `distance_function` parameter which is not supported as an argument on the fx graph. See previous error: > File "/scratch/eellison/work/pytorch/torch/_dynamo/symbolic_convert.py", line 562, in call_function self.push(fn.call_function(self, args, kwargs)) File "/scratch/eellison/work/pytorch/torch/_dynamo/variables/torch.py", line 723, in call_function proxy_args_kwargs(args, kwargs), File "/scratch/eellison/work/pytorch/torch/_dynamo/utils.py", line 504, in proxy_args_kwargs f"call_function args: {typestr(args)} {typestr(*list(kwargs.values()))}" File "/scratch/eellison/work/pytorch/torch/_dynamo/exc.py", line 143, in unimplemented raise Unsupported(msg) torch._dynamo.exc.Unsupported: call_function args: TensorVariable() TensorVariable() TensorVariable() ConstantVariable(float) NNModuleVariable() This is fixable by just inlining into `triplet_margin_loss` and continuing to compile it. This required support for `has_torch_function_variadic`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110302 Approved by: https://github.com/mlazos	2023-10-03 20:26:13 +00:00
Peter Bell	dc794ec32c	[dynamo] Trace through builtin `abs` (#110398 ) In python `abs(x)` does nothing but delegate to `x.__abs__()` so we should do the same in dynamo. This also adds `SymNode.__abs__` so we can trace through indexing expressions involving `abs`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110398 Approved by: https://github.com/jansel, https://github.com/lezcano	2023-10-03 19:25:37 +00:00
Pruthvi Madugundu	9ce2e02fd6	Revert "[ROCm] Remove PYTORCH_MIOPEN_SUGGEST_NHWC flag (#90725 )" (#110319 ) This reverts commit `66bfcd32fd`. NHWC is have perf regression on MIOpen, so reverting till the performance issue is fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110319 Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd, https://github.com/kit1980	2023-10-03 19:14:47 +00:00
Brian Hirsh	b457e3f79a	Reland attempt 2 of "Update AOTAutograd to use FunctionalTensorMode instead of C++ functionalization (#106406 )" (#109906 )" (#110079 ) The first reland broke internal (failing diff: D49617462). The major error looks like it's because there's an internal-only higher order op that needs a new functionalization rule. I'm going to land an internal diff for that and confirm tests pass before relanding this PR. Also confirmed that the issue from https://github.com/pytorch/pytorch/issues/110121 is fixed, and added a test. This reverts commit `1b90f07f5a`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110079 Approved by: https://github.com/ezyang	2023-10-03 18:50:25 +00:00
Octavian Guzu	b5c3a17c2c	[fuzzing result][fuzz_torch_jit_lite_interpreter] read-heap-buffer-overflow-far-from-bounds (size 4) in c10::IValue::IValue() (#110441 ) Summary: This diff fixes a heap underflow found by fuzzing in torch/csrc/jit/runtime/vararg_functions.cpp Test Plan: CI and ``` arc lionhead crash reproduce 1753074381791061 ``` doesn't crash anymore. Differential Revision: D49537535 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110441 Approved by: https://github.com/Skylion007	2023-10-03 18:48:12 +00:00
Yang Chen	da63c7f2c3	[AOTInductor] remove CUDA dependency for cpp backend (#110409 ) Summary: Previously, we link against cuda libs even for pure cpp backend. This caused issues for cases where the inference platform does not have GPUs. This diff removed cuda dependency for cpp backend. Reviewed By: bertmaher, muchulee8, mikekgfb Differential Revision: D49800712 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110409 Approved by: https://github.com/bertmaher, https://github.com/desertfire	2023-10-03 18:36:00 +00:00
PyTorch MergeBot	df3ab70dde	Revert "Added new test sample to interpolate op in OpInfo (#104181 )" This reverts commit `87f8bc65f8`. Reverted https://github.com/pytorch/pytorch/pull/104181 on behalf of https://github.com/peterbell10 due to Causing OOM in slow-gradcheck ([comment](https://github.com/pytorch/pytorch/pull/104181#issuecomment-1745472323))	2023-10-03 18:07:02 +00:00
Rohan Varma	40be6b72e1	[ez] Type function in distributed_c10d (#110435 ) This function returns a `torch.device`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110435 Approved by: https://github.com/awgu	2023-10-03 17:54:04 +00:00
vfdev	5977d17953	Update common_methods_invocations.py (#110383 ) Description: - Fixed misleading test sample case Context: sample input is composed of input tensor `(N, C, iH, iW)` and grid tensor `(N, oH, oW, 2)`, however, grid is defined as `(N, C, oW, 2)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110383 Approved by: https://github.com/peterbell10	2023-10-03 17:53:39 +00:00
Bert Maher	aecfe5d168	[aoti] Remove pessimizing move (#110446 ) "`std::move` of a temporary prevents copy elision" says the compiler, and I am pretty sure it is right. Since AtenTensorHandle* implicitly converts to RAIIAtenTensorHandle, I simply called emplace_back; happy to put an explicit ctor if that makes folks happier. Differential Revision: [D49842542](https://our.internmc.facebook.com/intern/diff/D49842542/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110446 Approved by: https://github.com/desertfire, https://github.com/Skylion007 ghstack dependencies: #110445	2023-10-03 17:44:58 +00:00
Bert Maher	174e46b853	[inductor][easy] Free functions in headers should be declared inline (#110445 ) If multiple files include model.h, you end up with duplicate symbols errors. Differential Revision: [D49842167](https://our.internmc.facebook.com/intern/diff/D49842167/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110445 Approved by: https://github.com/desertfire, https://github.com/Skylion007	2023-10-03 17:44:49 +00:00
Levy Zhao	7f0a659ccc	Script to compare measured (trace) runtimes with estimated runtimes (#108037 ) (#109076 ) Summary: X-link: https://github.com/pytorch/benchmark/pull/1856 Reviewed By: xmfan, xuzhao9 Differential Revision: D48523883 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109076 Approved by: https://github.com/xw285cornell	2023-10-03 17:05:35 +00:00
Jerry Zhang	f2a1b93549	Back out "[quant] Support integer implementations for adaptive_avg_pool2d (#104226 )" (#110316 ) Summary: Original commit changeset: acdb5b34e3aa Original Phabricator Diff: D47321689 Test Plan: opinfo tests in CI Differential Revision: D49789403 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110316 Approved by: https://github.com/kimishpatel	2023-10-03 16:59:23 +00:00
Yanbo Liang	9bc5e10899	[New][1/N] Dynamo skipfiles refactor (#110330 ) This is the replacement of #109567. Now I preserved all existing semantics and only focusing on API (for developers) and code structure changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110330 Approved by: https://github.com/ezyang	2023-10-03 16:50:33 +00:00
David Berard	4069d1de59	[distributed] Remove recordStream for callback that ends a profiler event (#109933 ) Background: recordStreams can result in memory spikes, so we don't want them to appear in FSDP (https://dev-discuss.pytorch.org/t/fsdp-cudacachingallocator-an-outsider-newb-perspective/1486). @ awgu is working on fixing this, but it turns out profiler was causing recordStream to get called when it is enabled. Why profiler was causing recordStream to get called: NCCL calls add profiler events manually; they register a callback to be executed when the future for the collective is completed; this indicates the end of the CPU-side profiler event for the callback: `c2c7c4035f/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (L1822-L1824)` In order to guarantee safety, ivalue::Future::invokeCallback calls `recordStream` on the future's storage buffers; this marks the fact that other streams (e.g. the one that the callback runs on) may need to use the storage. `c2c7c4035f/aten/src/ATen/core/ivalue_inl.h (L1171-L1173)` Change: The end-profiler-event callback doesn't actually use the future, so we don't need to recordStream on it. This PR introduces an optional parameter `uses_future` for adding callbacks; a user can set this variable to "false" to unsafely skip the recordStream, if the user knows that the future will not be used in the lambda. Tests: (a) unit tests; (b) added an assert in recordStream: `c2c7c4035f/c10/cuda/CUDACachingAllocator.cpp (L3260)` and verified that it doesn't get triggered when running basic distributed tests w/ profiler enabled Pull Request resolved: https://github.com/pytorch/pytorch/pull/109933 Approved by: https://github.com/wconstab	2023-10-03 14:40:43 +00:00
Stephen Jia	ff96f6d04f	[core IR][reland] Add `split.Tensor` and `unbind` decompositions to core ATen decomp table (#110323 ) Summary: This is a reland of [github PR #110102]( https://github.com/pytorch/pytorch/pull/110102). The original PR had to be unlanded due to internal CI failures. This diff applies some small fixes to the failing tests to adjust to the new decompositions. Note that `lift_fresh` will not be decomposed for now, since it was found that [constant propogation looks specifically for `lift_fresh`](`13af952f94/torch/fx/experimental/proxy_tensor.py (L381-L386)`). Therefore decomposing `lift_fresh` will interfere with constant propogation during export. Test Plan: Github CI and internal CI Differential Revision: D49761321 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110323 Approved by: https://github.com/jansel	2023-10-03 14:35:04 +00:00
Yu, Guangye	2cbfcc740f	use torch.xpu.manual_seed_all in torch.seed (#110376 ) # Motivate Use manual_seed_all instead of manual_seed. Because multi-device is supported in xpu backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110376 Approved by: https://github.com/ezyang	2023-10-03 13:41:55 +00:00
HDCharles	428cbd7513	[ao] fixing multihead attention convert size (#110407 ) Summary: after converting nn.multihead attention we weren't deleting the old in_proj_weight and in_proj_bias despite not (really) using them. Test Plan: python test/test_quantization.py -k "test_custom_module_multi_head_attention" Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/110407 Approved by: https://github.com/jerryzh168	2023-10-03 08:49:12 +00:00
Sherlock Huang	15219f53d1	[AOTInductor] Fix ProxyExecutor's handling on multiple outputs (#110374 ) Summary: Fix ProxyExecutor after D49780781 Test Plan: buck2 run mode/dev-nosan deeplearning/aot_inductor/test:test_custom_ops Differential Revision: D49816044 Privacy Context Container: 368960445142440 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110374 Approved by: https://github.com/chenyang78	2023-10-03 06:42:22 +00:00
wz337	d15d7a6485	[DTensorTestbase] Add "cpu:gloo,cuda:nccl" backend to DTensorTestbase (#110397 ) This PR updates backend as a property to DTensorTestbase and add "cpu:gloo,cuda:nccl" support in DTensorTestbase so that we can use `cpu:gloo,cuda:nccl` backend for checkpoint unit tests. cc. @wanchaol, @fduwjj, @XilunWu Pull Request resolved: https://github.com/pytorch/pytorch/pull/110397 Approved by: https://github.com/wanchaol	2023-10-03 04:54:02 +00:00
Fuzzkatt	e55d6f923c	minor tf32 fixes for unit tests on H100 and L40 (#110201 ) fixes the following tests which were failing in the NVIDIA internal CI on H100 and L40: test/test_nn.py: * test_TransformerEncoderLayer_gelu_activation_cuda_tf32 * test_Transformer_multilayer_coder_cuda_tf32 test/inductor/test_torchinductor.py: * test_batch_norm_2d_2_cuda Pull Request resolved: https://github.com/pytorch/pytorch/pull/110201 Approved by: https://github.com/mikaylagawarecki, https://github.com/jansel, https://github.com/Skylion007	2023-10-03 00:10:37 +00:00
eellison	3812f2e40c	Preserve layout on like constructors (#110242 ) Partially fixes `test_memory_format_factory_like_functions_preserve` with PYTORCH_TEST_WITH_INDUCTOR. Inductor preserves memory layouts for user-visible outputs as annotated on the fx graph that it is passed in. That graph is generated from running aot_autograd with decompositions. If the decompositions give incorrect strides, so will inductor. This preserves the layout of `_like` operators when it corresponds to a `torch.memory_format`. It doesnt fix a) arbitrary permutations, b) striding of non-dense outputs. Both of these are lower-pri compared to preserving channels last. We would need either https://github.com/pytorch/pytorch/issues/92920 or a `to` variant that takes in a physical layout arbitrary permutations. I converted the output of rand to the correct layout instead of passing the layout in so that this would compose with the `replace_random` pass, and because the two pointwise ops will get fused anyway. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110242 Approved by: https://github.com/int3	2023-10-02 23:53:55 +00:00

1 2 3 4 5 ...

32467 Commits