pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Behrang Javaherian	386776c49a	[torch] Reduce the memory usage by adding flags to clearing intermediate graphs used for optimization during the ineference. (#115657 ) Summary: During the inference time the intermediate graphs for optimization are not used so the Executor's graph is the only graph we need to keep around these two flags Test Plan: the FLAGS are all off by default baseline ``` buck run mode/opt-clang sigrid/predictor/client/localnet:run_model -- --model_id_to_load=951679039 --model_snapshot_to_load=244 --torch_jit_do_not_store_optimized_graph=true I1212 10:24:20.407408 401092 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 951679039_244 is 182863 Kb ``` ``` buck run mode/opt-clang sigrid/predictor/client/localnet:run_model -- --model_id_to_load=951679039 --model_snapshot_to_load=244 --torch_jit_do_not_store_optimized_graph=true --torch_jit_release_profiling_graph_after_optimization=true I1212 10:31:37.663487 464000 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 951679039_244 is 186127 Kb ``` ``` buck run mode/opt-clang sigrid/predictor/client/localnet:run_model -- --model_id_to_load=951679039 --model_snapshot_to_load=244 --torch_jit_do_not_store_optimized_graph=true --torch_jit_release_profiling_graph_after_optimization=true --torch_jit_execution_plan_avoid_extra_graph_copy=true I1212 10:29:42.848093 447218 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 951679039_244 is 129451 Kb``` Differential Revision: D52081631 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115657 Approved by: https://github.com/houseroad	2023-12-18 17:56:39 +00:00
Wanchao Liang	dd367b7c8f	check tensor subclass when using torch.compile + SAC (#115960 ) as titled, when using SAC + torch.compile, it currently only check for functional tensor, but not checking any tensor subclasses, therefore SAC under torch.compile would ignore the tensor types like tensor subclasses. Fixed in this PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/115960 Approved by: https://github.com/bdhirsh	2023-12-18 17:49:06 +00:00
angelayi	e43d33f4f7	[export] Support torch.sym* ops (#115854 ) Fixes https://github.com/pytorch/pytorch/issues/108830 and https://github.com/pytorch/executorch/issues/1379#issuecomment-1853322866 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115854 Approved by: https://github.com/zhxchen17	2023-12-18 17:48:47 +00:00
Aaron Gokaslan	647f14e70b	[BE]: Enable clang-tidy check for readability-string-compare (#115994 ) Adds a clang-tidy check to ensure string compare is not used unnecessarily in a way that is less efficient and less readable if an equality overload exists. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115994 Approved by: https://github.com/albanD	2023-12-18 16:13:00 +00:00
Nikita Shulga	d7caef7996	[CI] Update clang-format (#116002 ) To 17.0.6 build using https://github.com/pytorch/test-infra/blob/main/.github/workflows/clang-tidy-linux.yml Pull Request resolved: https://github.com/pytorch/pytorch/pull/116002 Approved by: https://github.com/suo	2023-12-18 14:58:46 +00:00
Mu-Chu Lee	c285ca7916	[AOTInductor] Add updaing constant buffer to active buffer. (#116001 ) Summary: Refactor update inactive constant buffer to allow updating with active buffer. Test Plan: Existing test to test inactive buffer updates. UpdateConstantsCuda in cpp test for active buffer updates. Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/116001 Approved by: https://github.com/chenyang78	2023-12-18 11:49:03 +00:00
Pearu Peterson	34fe850d00	SymInt'ify sparse_compressed_tensor (#107903 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107903 Approved by: https://github.com/cpuhrsch ghstack dependencies: #115586	2023-12-17 17:36:20 +00:00
Tej Singh	eafeba71c1	Adamw refactor (#115983 ) Fixes #104899, refactors adamw by abstracting out common code in adam. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115983 Approved by: https://github.com/janeyx99	2023-12-17 06:58:39 +00:00
Yue Dong	87ea6fb844	Make input contiguous for DTensor reduce scatter to fix the incorrect numerical values (#115847 ) Summary: This change is to make the input tensor contiguous for DTensor reduce scatter in the case no padding is needed. There's no exception thrown during training, but we ran into numerical value correctness issue without the change. Test Plan: CI CI test WHEN model test: - Verified loss for each iteration within the expected range. - Verified NE on-par with this change with 4B training data. Differential Revision: D52170822 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115847 Approved by: https://github.com/wanchaol	2023-12-17 01:35:09 +00:00
Menglu Yu	bc4115ffcf	[Inductor][Observability] Change to log.debug to avoid excessive long of logs (#115474 ) Summary: Titled Test Plan: CI Differential Revision: D52003825 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115474 Approved by: https://github.com/jackiexu1992, https://github.com/yanboliang	2023-12-17 00:25:54 +00:00
voznesenskym	b06b02559e	Support non grapharg and intermediary grad access (#115898 ) Support for something we need for both FSDP and optimizers. For sourced args that are not inputs (params, etc) - we use the dynamic_getattr flow on tensors. This soundly handles the storage and registration and guarding downstream of tensor_wrap for the grad values. For non sourced (true intermediates), we only support None (the idea being that if we have a true intermediate in the graph with grad, we are already doing something weird). Pull Request resolved: https://github.com/pytorch/pytorch/pull/115898 Approved by: https://github.com/bdhirsh ghstack dependencies: #115315, #112184	2023-12-16 18:43:37 +00:00
kflu	c5dcb50c00	[easy] aten ops: support passing all args as kwargs, including `self` (#114920 ) Summary: This is important for writing aten IR based graph transformation. ``` In [4]: [x.name for x in torch.ops.aten.reshape.default._schema.arguments] Out[4]: ['self', 'shape'] In [8]: torch.ops.aten.reshape.default(torch.rand(1,2), shape=[2]) Out[8]: tensor([0.7584, 0.4834]) # === CANNOT CALL `self` BY KWARGS === In [7]: torch.ops.aten.reshape.default(self=torch.rand(1,2), shape=[2]) --------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[7], line 1 ----> 1 torch.ops.aten.reshape.default(self=torch.rand(1,2), shape=[2]) TypeError: OpOverload.__call__() got multiple values for argument 'self' ``` # Where's the problem? 1. the aten ops first arg is usually named `self` (aten/src/ATen/native/native_functions.yaml) 2. Unfortunately, in `torch._ops.{OpOverload, OpOverloadPacket}.__call__()`, the first arg is (by python convention) named `self` too. So when call `self` by kwargs, `OpOverloadPacket.__call__` received: ``` OpOverloadPacket.__call__(self, {"self": ...}) ``` It is Python that does not allow some argument named "arg" to appear twice. and hence > TypeError: OpOverload.__call__() got multiple values for argument 'self' # How to fix? Note that, in above, `self` is an instance of `OpOverloadPacket`, and the "self" kwarg is the input tensor to the aten op. To fix, we only need to differentiate the two `self`s. In Python, first arg of a method does not need to be named `self`. So we change the `__call__` definition to: ``` def __call__(_self, ...): ``` Now the call becomes: ``` OpOverloadPacket.__call__(_self, {"self": ...}) ``` where: * `_self` is the instance to the `OpOverloadPacket` * `"self"` is the input tensor to the aten op. Test Plan: ``` In [4]: [x.name for x in torch.ops.aten.reshape.default._schema.arguments] Out[4]: ['self', 'shape'] In [3]: torch.ops.aten.reshape.default(self=torch.rand(1,2), shape=[2]) Out[3]: tensor([0.5127, 0.3051]) ``` Differential Revision: D51731996 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114920 Approved by: https://github.com/houseroad	2023-12-16 18:32:58 +00:00
Jeff Daily	e3aefe2970	Revert "Initial Flash Attention support on ROCM (#114309 )" (#115975 ) This reverts commit `5bddbed399`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115975 Approved by: https://github.com/atalman, https://github.com/malfet	2023-12-16 03:40:14 +00:00
vfdev-5	2a2f2e454a	[inductor] Fixed issue with true div on integer input with dyn shapes (#115920 ) Related to https://github.com/pytorch/pytorch/issues/115742, `Cpu/CudaTests.test_div8` Description: - Fixed issue with true div on integer input with dyn shapes Pull Request resolved: https://github.com/pytorch/pytorch/pull/115920 Approved by: https://github.com/peterbell10	2023-12-16 02:06:39 +00:00
Yanbo Liang	14a6b24c8b	[Dynamo][8/N] Wrap itertools.* as ItertoolsVariable (#115802 ) This is part of a series changes before removing ```is_allowed```. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115802 Approved by: https://github.com/voznesenskym	2023-12-16 01:42:02 +00:00
Jane Xu	056a882cb9	add markDynamoStrictTest to TestOptimRenewed, removing flakiness (#115947 ) fixes #115406 fixes #115394 fixes #115393 fixes #115392 fixes #115391 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115947 Approved by: https://github.com/albanD, https://github.com/zou3519	2023-12-16 01:33:32 +00:00
youkaichao	034e871710	[Dynamo] Look up variables from old frame, rather than copy variables to new frame; skip some copy to save time. (#115062 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115062 Approved by: https://github.com/williamwen42	2023-12-16 00:02:59 +00:00
Will Constable	9fcf6fb6fe	[C10D] Add waitForDumpOrTimeout to log on dump abandonment (#115876 ) Helps call attention to any cases where the dump actually times out. The timeout is likely to hit if we run into slow stacktrace processing. Log any exceptions encountered in the background thread, but don't raise them- we're already willing to abandon the debug dump, and want to proceed with our normal execution (in the case of dumppipe) or shutdown process (when dumping happens on timeout and shutdown is already initiated). Pull Request resolved: https://github.com/pytorch/pytorch/pull/115876 Approved by: https://github.com/zdevito ghstack dependencies: #115807	2023-12-15 22:13:06 +00:00
Will Constable	82e0d00da9	[c10d] Polish NCCL PG monitor thread log message (#115888 ) We turned on monitor thread by default in https://github.com/pytorch/pytorch/pull/112518, and we want the error message that is displayed when the monitor kills the process to be more informative. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115888 Approved by: https://github.com/wconstab	2023-12-15 22:00:29 +00:00
angelayi	1f3bdf40ad	[export] Update schema version (#115712 ) Since pytorch 2.1 release we've made some BC breaking changes to the serialized schema. We should update it in time for the 2.2 release. Some of the changes include: * https://github.com/pytorch/pytorch/pull/114371 - custom class objects / pybinded objects are no longer saved directly to the `ExportedProgram` structure. Instead, the name is serialized inside of the program, and the actual bytes are stored. in a separate location from the exported program, allowing it to be saved to a different location. * https://github.com/pytorch/pytorch/pull/111204 - `GraphSignature` structure changed and `call_spec` is removed from the `GraphModule` schema * https://github.com/pytorch/pytorch/pull/111407 - `loss_outout` -> `loss_output` * https://github.com/pytorch/pytorch/pull/113075 - `example_inputs` removed from the `ExportedProgram` structure (this originally did not store anything), `dialect` added to the `ExportedProgram` structure. * https://github.com/pytorch/pytorch/pull/113689 - tensor constants are now lifted as inputs to the graph, and their locations are stored in the `GraphSignature` * https://github.com/pytorch/pytorch/pull/114172 - removed `equality_constraints` and added a `SymExprHint` for all symbolic expressions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115712 Approved by: https://github.com/gmagogsfm	2023-12-15 21:43:03 +00:00
PyTorch MergeBot	50c9665f92	Revert "[export] Support torch.sym* ops (#115854 )" This reverts commit `347cb91946`. Reverted https://github.com/pytorch/pytorch/pull/115854 on behalf of https://github.com/atalman due to OSSCI oncall, broke multple jobs ([comment](https://github.com/pytorch/pytorch/pull/115854#issuecomment-1858486796))	2023-12-15 21:07:52 +00:00
PyTorch MergeBot	80a9625d9f	Revert "non-strict export with dynamic shapes (#115862 )" This reverts commit `1bb0d0fc1f`. Reverted https://github.com/pytorch/pytorch/pull/115862 on behalf of https://github.com/atalman due to OSSCI oncall, failing trunk / macos-12-py3-arm64 / test ([comment](https://github.com/pytorch/pytorch/pull/115862#issuecomment-1858482486))	2023-12-15 21:04:12 +00:00
Avik Chaudhuri	1bb0d0fc1f	non-strict export with dynamic shapes (#115862 ) Differential Revision: [D52175048](https://our.internmc.facebook.com/intern/diff/D52175048/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115862 Approved by: https://github.com/zhxchen17	2023-12-15 20:11:30 +00:00
angelayi	347cb91946	[export] Support torch.sym* ops (#115854 ) Fixes https://github.com/pytorch/pytorch/issues/108830 and https://github.com/pytorch/executorch/issues/1379#issuecomment-1853322866 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115854 Approved by: https://github.com/zhxchen17	2023-12-15 20:08:04 +00:00
PyTorch MergeBot	91b848bf81	Revert "markDynamoStrictTest on more tests (#115879 )" This reverts commit `8b650cdd3c`. Reverted https://github.com/pytorch/pytorch/pull/115879 on behalf of https://github.com/atalman due to OSSCI oncall, broke inductor ([comment](https://github.com/pytorch/pytorch/pull/115879#issuecomment-1858418921))	2023-12-15 20:00:09 +00:00
PyTorch MergeBot	c006c8b50e	Revert "markDynamoStrictTest some more (#115885 )" This reverts commit `55ce4693ff`. Reverted https://github.com/pytorch/pytorch/pull/115885 on behalf of https://github.com/atalman due to OSSCI oncall, broke inductor ([comment](https://github.com/pytorch/pytorch/pull/115885#issuecomment-1858409669))	2023-12-15 19:51:24 +00:00
Wanchao Liang	61abacf829	[tp] improve documentation (#115880 ) Improve the TP documentation in terms of format and descriptions Pull Request resolved: https://github.com/pytorch/pytorch/pull/115880 Approved by: https://github.com/XilunWu	2023-12-15 18:44:22 +00:00
PyTorch MergeBot	d5115bfb06	Revert "[AOTI][refactor][3/n] Declare python_kernel_name and cpp_kernel_name in ExternKernel (#115831 )" This reverts commit `287a865677`. Reverted https://github.com/pytorch/pytorch/pull/115831 on behalf of https://github.com/desertfire due to rocm CI failure ([comment](https://github.com/pytorch/pytorch/pull/115831#issuecomment-1858322270))	2023-12-15 18:34:55 +00:00
PyTorch MergeBot	1b506e7469	Revert "non-strict export with dynamic shapes (#115862 )" This reverts commit `f54bb1ed56`. Reverted https://github.com/pytorch/pytorch/pull/115862 on behalf of https://github.com/atalman due to OSSCI oncall, failing trunk / macos-12-py3-arm64 / test ([comment](https://github.com/pytorch/pytorch/pull/115862#issuecomment-1858197497))	2023-12-15 17:03:42 +00:00
Avik Chaudhuri	f54bb1ed56	non-strict export with dynamic shapes (#115862 ) Differential Revision: [D52175048](https://our.internmc.facebook.com/intern/diff/D52175048/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115862 Approved by: https://github.com/zhxchen17	2023-12-15 16:38:45 +00:00
Jeff Daily	b062ea3803	[ROCm] add hipblaslt support (#114329 ) Disabled by default. Enable with env var DISABLE_ADDMM_HIP_LT=0. Tested on both ROCm 5.7 and 6.0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114329 Approved by: https://github.com/malfet	2023-12-15 15:36:46 +00:00
Bin Bao	287a865677	[AOTI][refactor][3/n] Declare python_kernel_name and cpp_kernel_name in ExternKernel (#115831 ) Summary: Both ExternKernelAlloc and ExternKernelOut need the two fields, so declaring them in the base class. Also add cpp codegen for IndexPutFallback and InplaceBernoulliFallback in this PR. Differential Revision: [D52189999](https://our.internmc.facebook.com/intern/diff/D52189999) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115831 Approved by: https://github.com/chenyang78	2023-12-15 14:40:44 +00:00
rzou	55ce4693ff	markDynamoStrictTest some more (#115885 ) Featuring test_native_mha.py test_nn.py test_prims.py test_schema_check.py test_serialization.py test_show_pickle.py test_sort_and_select.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/115885 Approved by: https://github.com/voznesenskym ghstack dependencies: #115845, #115855, #115856, #115857, #115858, #115870, #115871, #115879	2023-12-15 13:19:52 +00:00
rzou	8b650cdd3c	markDynamoStrictTest on more tests (#115879 ) Featuring: test_mobile_optimizer.py test_module_init.py test_modules.py test_multiprocessing.py test_multiprocessing_spawn.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/115879 Approved by: https://github.com/voznesenskym ghstack dependencies: #115845, #115855, #115856, #115857, #115858, #115870, #115871	2023-12-15 13:19:52 +00:00
Jun Luo	2d43e31aa9	Fix wrong behavior of is_alias_of and c10d::reducer on MTIA (#115553 ) Reviewed By: kirteshpatil Differential Revision: D51860023 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115553 Approved by: https://github.com/fduwjj	2023-12-15 11:14:41 +00:00
Yanbo Liang	b4d6443bcf	[Dynamo] Log innermost user frame filename & lineno for better error aggregation (#115899 ) CompilationMetrics example: ``` frame_key='1', co_name='fn', co_filename='/data/users/ybliang/debug/debug1.py', co_firstlineno=58, cache_size=0, accumulated_cache_size=0, guard_count=None, graph_op_count=None, graph_node_count=None, graph_input_count=None, entire_frame_compile_time_s=None, backend_compile_time_s=None, fail_type="<class 'torch._dynamo.exc.Unsupported'>", fail_reason='custome dict init with args/kwargs unimplemented', fail_user_frame_filename='/data/users/ybliang/debug/debug1.py', fail_user_frame_lineno=61 ``` where: * ```fail_type``` and ```fail_reason``` are exceptions inside of Dynamo. * ```fail_user_frame_filename``` and ```fail_user_frame_lineno``` are where the original user code triggered the exception. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115899 Approved by: https://github.com/davidberard98, https://github.com/ydwu4	2023-12-15 08:24:55 +00:00
Yifu Wang	4edc921857	Introduce 3 low-latency, intra-node allreduce algorithms for small messages to PyTorch (#114001 ) ## Summary This PR added 3 intra-node GPU allreduce algorithms to PyTorch: - One-shot allreduce (inspired by FasterTransformer): all ranks simultaneously read and accumulate data from other ranks. - Two-shot allreduce (inspired by FasterTransformer): all ranks simultanesouly read and accumulate `1 / world_size` data from other ranks. Then all ranks read accumulated data from other ranks. (effectively one-shot reduce-scatter + one-shot all-gather). - Hybrid cube mesh allreduce (original): a one-shot allreduce variant that avoids transmission over PCIe on HCM topology. ## Micro Benchmarks ![image](https://github.com/pytorch/pytorch/assets/4156752/7bd25ffc-cd5b-4acb-bd65-b01bc136726e) ![image](https://github.com/pytorch/pytorch/assets/4156752/3ced31b4-6c31-4f34-a2d8-c072df29ae0e) ![image](https://github.com/pytorch/pytorch/assets/4156752/5b942c05-4fcc-4ec9-ae29-12c64080bb1c) ## Details The intra-node algos are organized behind `c10d::IntraNodeComm`, which is responsible for: - Managing handshaking and cuda IPC handle exchange among ranks. - Querying NVLink connection and detecting topology. - Performing algo selection based on available info. - Launching the selected allreduce kernel. `c10d::IntraNodeComm` is integrated into `c10d::ProcessGroupNCCL` as follows: - When the `ENABLE_INTRA_NODE_COMM` environment variable is set, `c10d::ProcessGroupNCCL` initialize a `c10d::IntraNodeComm` for its ranks. - If the setup is not suitable for intra-node comm (e.g. not all ranks are from the same node), the rendezvous logic guarantees all participants fall back consistently. - `c10d::ProcessGroupNCCL::allreduce` consults `c10d::IntraNodeComm` whether to use intra-node allreduce and carries out the communication accordingly. We currently detect two types of topoloies from the nNVLink connection mesh: - Fully connected: all GPU pairs has direct NVLink connection (e.g. NVSwitch or fully connected sub-set of hybrid cube mesh) - `msg <= 256KB`: one-shot allreduce. - `256KB < msg <= 10MB`: two-shot allreduce. - `msg > 10MB`: instructs the caller to fallback to NCCL. - Hybrid cube mesh - `msg <= 256KB`: one-shot allreduce. - `msg > 256KB`: instructs the caller to fallback to NCCL. ## Next Steps - Fine tune algo selection based on GPU model, topology, link speed. - Potentially optimize the two-shot allreduce impl. Accroding to FasterTransformer, two-shot allreduce is preferred until 50MB. There might be room for improvement, but PyTorch does impose more constraints: - FasterTransformer uses a single process to drive multiple devices. It can use `cudaDeviceEnablePeerAccess` enable device-level peer access. - PyTorch uses multiple process to drive multiple devices. With cuda IPC, a device can only share a specific region to other devices. This means extra copies may be unavoidable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114001 Approved by: https://github.com/yf225	2023-12-15 08:17:35 +00:00
Aidyn-A	cd47e335d1	[TEST] Skip test_schema_correctness for float8 dtype (#115757 ) According to the https://github.com/pytorch/pytorch/issues/107256#issuecomment-1705341870 the ops tested in `test_schema_correctness` are not supported with `torch.float8_e4m3fn` yet. Until they are not supported, it is best to skip the test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115757 Approved by: https://github.com/drisspg	2023-12-15 06:26:46 +00:00
Zhijing Li (Accelerator Enablement)	c1c9b739e2	Back out "[aotinductor] replace lld with the default ld linker (#115478 )" (#115875 ) Summary: Back out the diff Pull Request resolved: https://github.com/pytorch/pytorch/pull/115875 Approved by: https://github.com/chenyang78	2023-12-15 05:56:06 +00:00
rzou	478f0e96dc	markDynamoStrictTest more tests (#115871 ) For: test_dispatch.py test_fake_tensor.py test_indexing.py test_linalg.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/115871 Approved by: https://github.com/voznesenskym ghstack dependencies: #115845, #115855, #115856, #115857, #115858, #115870	2023-12-15 05:26:54 +00:00
rzou	7f686c8fe1	More markDynamoStrictTest (#115870 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115870 Approved by: https://github.com/voznesenskym ghstack dependencies: #115845, #115855, #115856, #115857, #115858	2023-12-15 05:26:54 +00:00
leslie-fang-intel	9ae0e62929	[PT2] [Quant] Change the QConv2d Binary post op name from add to sum (#115329 ) Summary Change the QConv2d Binary fusion post op name from `add` to `sum`, since we are actually using OneDNN `post op sum` instead of `Binary_Add` for now. TestPlan ``` python -m pytest test_quantized_op.py -k test_qconv2d_sum_pt2e python -m pytest test_quantized_op.py -k test_qconv2d_sum_relu_pt2e python -m pytest test_quantized_op.py -k test_qconv2d_sum_relu_float_output_pt2e ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115329 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2023-12-15 05:10:47 +00:00
Angela Yi	8e2d63cbc3	[export][reland] Remove runtime assertion pass (#115597 ) Summary: Reland of https://github.com/pytorch/pytorch/pull/115196 D52054112 to fix internal failures. Test Plan: CI Differential Revision: D52054110 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115597 Approved by: https://github.com/ydwu4, https://github.com/zhxchen17	2023-12-15 03:22:03 +00:00
Bin Bao	7d4ccd7b9e	[AOTI][refactor][2/n] Rename kernel to python_kernel_name (#115766 ) Differential Revision: [D52164940](https://our.internmc.facebook.com/intern/diff/D52164940) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115766 Approved by: https://github.com/chenyang78 ghstack dependencies: #115783	2023-12-15 03:08:13 +00:00
Will Constable	8e1cff96e3	[C10D] Log PG size in init log (#115807 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115807 Approved by: https://github.com/XilunWu	2023-12-15 02:38:54 +00:00
Nikita Shulga	5989e1222d	[BE] Set `torch.cuda.has_half` to True (#115884 ) This check was introduced by https://github.com/pytorch/pytorch/pull/5417 and then turned into a tautology by https://github.com/pytorch/pytorch/pull/10147 So I guess it's time to let go of all that dynamic initialization (and may be just delete it in 2.3?) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115884 Approved by: https://github.com/kit1980	2023-12-15 02:30:55 +00:00
Jesse Cai	a8e354a9a0	[sparse][semi-structured] enable fp32 support, separate sparse and dense constraints (#115550 ) Summary: Both cuSPASRELt and CUTLASS support 1:2 semi-structured sparsity for fp32, which this PR enables.(thanks @alexsamardzic). Furthermore, this PR also updates the sparse_config to take into account the different shape constraints for sparse and dense matrices. Technically, cuSPARSELt supports smaller sparse matrix constraints as it seens to pad to the CUTLASS constraints under the hood. However, in practice small sparse matrices are not commonly used and we care more about the dense constraints for LLM inference. For now, we keep the CUTLASS constraints in place for both cuSPARSELt and CUTLASS tensors This PR also reconnects the _FUSE_TRANSPOSE flag for cuSPARSELt tensors. Test Plan: ``` python test/test_sparse_semi_structured.py ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/115550 Approved by: https://github.com/cpuhrsch	2023-12-15 02:28:17 +00:00
Mikayla Gawarecki	6d5fe07659	Fix numpy warning when importing torch without numpy installed (#115867 ) Fixes #115638 I verified locally that with no numpy install the warning no longer occurs Pull Request resolved: https://github.com/pytorch/pytorch/pull/115867 Approved by: https://github.com/soulitzer	2023-12-15 02:22:12 +00:00
rzou	85262b0a9e	markDynamoStrictTest some test_cpp_extensions.* (#115858 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115858 Approved by: https://github.com/voznesenskym ghstack dependencies: #115845, #115855, #115856, #115857	2023-12-15 01:22:38 +00:00
rzou	4ccd8eb613	Add Dynamo test expected failure mechanism (#115845 ) Tests that are added to a list in dynamo_test_failures.py will automatically be marked as expectedFailure when run with PYTORCH_TEST_WITH_DYNAMO=1. I'm splitting this PR off on its own so that I can test various things on top of it. Also added an unMarkDynamoStrictTest that is not useful until we turn on strict mode by default. Test Plan: - code reading Pull Request resolved: https://github.com/pytorch/pytorch/pull/115845 Approved by: https://github.com/voznesenskym	2023-12-15 01:22:17 +00:00
Bin Bao	f90a5f891b	[AOTI][refactor][1/n] Rename cpp_kernel to cpp_kernel_name (#115783 ) Differential Revision: [D52142184](https://our.internmc.facebook.com/intern/diff/D52142184) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115783 Approved by: https://github.com/chenyang78, https://github.com/jansel	2023-12-15 00:50:17 +00:00
Joel Schlosser	6fee208064	Handle -1 in jagged layout NT view ops (#115843 ) Allows for inheriting the ragged and batch dims via -1: ```python nt.view(-1, -1, D) nt.expand(B, -1, D) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115843 Approved by: https://github.com/soulitzer ghstack dependencies: #115636	2023-12-15 00:42:47 +00:00
PyTorch MergeBot	59f7355f86	Revert "[ROCm] add hipblaslt support (#114329 )" This reverts commit `bb2bb8cca1`. Reverted https://github.com/pytorch/pytorch/pull/114329 on behalf of https://github.com/atalman due to OSSCI oncall, trunk tests are failing ([comment](https://github.com/pytorch/pytorch/pull/114329#issuecomment-1857003155))	2023-12-14 23:53:30 +00:00
zdevito	66b04e3cb7	[nccl flight recorder] nullptr profiling name (#115851 ) Sometimes profiling name can be a nullptr, which throws on conversion to std::string. This adds a check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115851 Approved by: https://github.com/wconstab	2023-12-14 23:40:54 +00:00
Oguz Ulgen	21b8127f1c	[Inductor] Deduplicate grid wrapper statements for user defined triton kernels (#115849 ) Noticed that on many MRS kernels the grid wrapper for autotuning is huge with a bunch of duplicates due to num_warps and num_stages not being needed for grid calculation. Lets deduplicate these entries. Previously, we would see wrapper like ``` def grid_wrapper_for_add_kernel_2d_autotuned_0(meta): if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1) if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1) if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1) if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1) ``` now it looks like ``` def grid_wrapper_for_add_kernel_2d_autotuned_0(meta): if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1) if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115849 Approved by: https://github.com/jansel	2023-12-14 23:26:04 +00:00
Pearu Peterson	194d57dae7	Add values backward support for sparse CSR, CSC, BSR, and BSC tensors (#115586 ) Fixes https://github.com/pytorch/pytorch/issues/107286 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115586 Approved by: https://github.com/cpuhrsch, https://github.com/albanD	2023-12-14 23:09:13 +00:00
Zhengxu Chen	ef6a0faf89	[export] Fix canonicalization. (#115830 ) Summary: Add the missed layout argument branch. Test Plan: buck2 test 'fbcode//mode/dev-nosan' fbcode//sigmoid/inference/test_gpu:export_package_sparse_toy_test Differential Revision: D52166501 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115830 Approved by: https://github.com/angelayi	2023-12-14 22:48:26 +00:00
Jeff Daily	bb2bb8cca1	[ROCm] add hipblaslt support (#114329 ) Disabled by default. Enable with env var DISABLE_ADDMM_HIP_LT=0. Tested on both ROCm 5.7 and 6.0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114329 Approved by: https://github.com/malfet	2023-12-14 21:41:22 +00:00
Will Constable	04ef21f5dd	[C10D] Make dumpDebuggingInfo share a mutex across PGs (#115803 ) The mutex was originally added to avoid racing to dump debuginfo, where a race in this case would result in a corrupted dump file. The reason a mutex helps is that it forces all dump requests to be serialized, so that an observer would either see an in-progress file, a complete file, or no file. Without a mutex, a fourth state is possible (a file that has been written to by multiple threads and is invalid). Becuase the mutex was a ProcessGroupNCCL class member, and each PG instance has its own watchdog thread that can launch a dump, it was not doing its job. Making the mutex static shares it between instances of the class and ensures serialization of dumps triggered by any PG. (Note: dumps triggered by different PGs have the same, global contents anyway- there is only one global flight recorder, so it doesn't matter who triggers it.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115803 Approved by: https://github.com/kwen2501 ghstack dependencies: #115771, #115798, #115800, #115801	2023-12-14 21:17:44 +00:00
PyTorch MergeBot	7ecddaef23	Revert "Introduce 3 low-latency, intra-node allreduce algorithms for small messages to PyTorch (#114001 )" This reverts commit `adfbd2b219`. Reverted https://github.com/pytorch/pytorch/pull/114001 on behalf of https://github.com/atalman due to OSSCI oncall, breaks periodic jobs ([comment](https://github.com/pytorch/pytorch/pull/114001#issuecomment-1856539040))	2023-12-14 20:33:10 +00:00
David Berard	67232199b1	[dynamo] Log shape_env_guard_count separately from guard_count (#115776 ) guard_count counts all the shape_env guards as a single guard; log the shape_env_guard_count separately so those metrics can be used. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115776 Approved by: https://github.com/yanboliang	2023-12-14 20:12:49 +00:00
Wei Wei	87547a26b8	[aotinductor] add no weight change version of fuse_parallel_linear (#115791 ) Summary: We need a new version of fuse_parallel_linear w/o creating new weights for real-time update. Reviewed By: khabinov Differential Revision: D52128296 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115791 Approved by: https://github.com/khabinov	2023-12-14 18:36:17 +00:00
PyTorch MergeBot	ca4caf4eac	Revert "[inductor] Do variance calculation in opmath type (#115181 )" This reverts commit `42390a097b`. Reverted https://github.com/pytorch/pytorch/pull/115181 on behalf of https://github.com/atalman due to OSSCI oncall, broke periodic tests ([comment](https://github.com/pytorch/pytorch/pull/115181#issuecomment-1856360644))	2023-12-14 18:21:49 +00:00
Will Constable	0fe014bd8a	[C10D] Change PGNCCL logs to prefix [PG {} Rank {}] (#115801 ) Adds a PG {process group uid} prefix component to logs. This is helpful in situations where there are multiple processgroups, and rank information by itself is confusing. (For example rank0 on PG1 may correspond to rank3 on PG0. People may assume 'rank0' references the global (PG0) world, but it may reference a sub-pg. Prefacing the PG helps clarify this. Does NOT change logs from inside WorkNCCL functions, since WorkNCCL doens't know what PG ID it corresponds to. Will address these logs separately. Example: ``` [I ProcessGroupNCCL.cpp:787] [PG 0 Rank 0] ProcessGroupNCCL initialization ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115801 Approved by: https://github.com/fduwjj ghstack dependencies: #115771, #115798, #115800	2023-12-14 18:17:16 +00:00
Will Constable	e94267587b	[C10D] Refactor NCCL logs to use common prefix helper (#115800 ) Put the repeated code that string formats [Rank {rank}] in one place. Sets up for the next PR that also adds more info to this prefix. (Does not change exception messages, which could be done as well. Exception messages are not formatted quite the same way. Tries instead to keep from changing log behavior (in this PR) and only refactor code. Did limited testing (some logs were observed OK). Pull Request resolved: https://github.com/pytorch/pytorch/pull/115800 Approved by: https://github.com/fduwjj ghstack dependencies: #115771, #115798	2023-12-14 18:13:24 +00:00
Will Constable	eb6e70cf66	[C10D] Only open NCCL dump pipe file once per process (#115798 ) The NCCL flight recorder is per-process (it is shared by all processgroups), but individual process groups used to construct their own pipe for being signaled to dump the flight recorder. This ensures that only one pipe per process is created, by only creating the pipe on the first ProcessGroup (uid_ == 0) which should be the world group. Filenames are still keyed off of rank, but this should now be global rank instead of sub-pg rank, making the filenames unique across the whole trainer process. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115798 Approved by: https://github.com/zdevito ghstack dependencies: #115771	2023-12-14 17:48:26 +00:00
Will Constable	74d2b9dd15	[C10D] Make DumpPipe disabled when FlightRecorder disabled (#115771 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115771 Approved by: https://github.com/fduwjj	2023-12-14 17:42:46 +00:00
Ruichao Xiao	c80e2d5bb2	[fbcode] consolidate usage of fp8 linears for inference models (#115808 ) Summary: ATT, this will use implementation of D51812709 for fp8 linears. Meanwhile, it also adds use-case of delay quantization Test Plan: ``` CUDA_VISIBLE_DEVICES=7 buck run mode/opt -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=h100 -c fbcode.use_link_groups=false caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --local-model /home/xiaoruichao/test_models/463113248.input.predictor.disagg.gpu.merge --lower-backend AOT_INDUCTOR --fp8-linear-quantization-type delay_quantization --disable-acc-tracer-aot-inductor ``` ``` CUDA_VISIBLE_DEVICES=7 buck run mode/opt -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=h100 -c fbcode.use_link_groups=false caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --local-model /home/xiaoruichao/test_models/463113248.input.predictor.disagg.gpu.merge --lower-backend AOT_INDUCTOR --fp8-linear-quantization-type delay_quantization --disable-acc-tracer-aot-inductor ``` Reviewed By: tter1 Differential Revision: D51840344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115808 Approved by: https://github.com/ipiszy	2023-12-14 16:59:48 +00:00
Xinya Zhang	5bddbed399	Initial Flash Attention support on ROCM (#114309 ) This pull requests add initial Flash Attention support for AMD/ROCM platform. It added a specialized Triton repository/branch as a compile-time dependency for Flash Attention math library on AMD/ROCM. This triton submodule is not used at runtime and will not be shipped to the final pytorch package. We have the plan to release this specialized Triton as a separate project. Know limitations: - [ ] Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`. - [ ] Only supports power of two sequence lengths. - [ ] No support for varlen APIs. - [ ] Only support head dimension 16,32,64,128. - [ ] Performance is still being optimized. Fixes https://github.com/pytorch/pytorch/issues/112997 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114309 Approved by: https://github.com/jeffdaily, https://github.com/malfet --------- Co-authored-by: Joseph Groenenboom <joseph.groenenboom@amd.com>	2023-12-14 08:52:57 -08:00
Mikayla Gawarecki	ac60a70e06	Migrated loss functions to ModuleInfos (#115584 ) Migrates most tests in `common_nn.py:criterion_tests` to ModuleInfos. I can split this up if it is too large to review What this PR does not include: - [`no_batch_dim` tests](https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_nn.py#L3995-L4112) - [tests that use the functional variant of the loss function and `wrap_functional`](https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_nn.py#L1079-L1128) #### On test times This PR increases test time by ~58s locally Before this PR: ``` >>> python test/test_nn.py -k Loss Ran 1003 tests in 28.977s ``` After this PR ``` >>> python test/test_nn.py -k Loss Ran 368 tests in 23.073s ``` ``` >>> python test/test_modules.py -k Loss Ran 836 tests in 63.900s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115584 Approved by: https://github.com/janeyx99 ghstack dependencies: #115617	2023-12-14 16:21:05 +00:00
vfdev-5	f727bed2e6	[inductor] Updated upsample_bilinear2d decomposition (#104182 ) Description: - Updated upsample_bilinear2d decomposition - added support for uint8 dtype support - code improvements - Added uint8 dtype tests Perf considerations: - There is minor perf regression (speed-up ~0.7) on cases uint8, align_corners=True when output is smaller/equal (256, 256) - For cases, when output is larger (256, 256) and input dtype uint8, nightly output is wrong, so IMO large perf regression (speed-up around ~0.2) should not be taken into account. ## Perfs benchmarks ``` [--------------------------------------------------------------------------------------------------------------------------------------------------------- Interpolate, cpu --------------------------------------------------------------------------------------------------------------------------------------------------------] \| Eager (2.3.0a0+gitafcfdb1) PR \| Compiled (2.3.0a0+gitafcfdb1) PR \| Compiled (2.3.0a0+gitde89a53) Nightly \| speed-up PR vs Nightly \| Eager (2.3.0a0+gitde89a53) Nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input (1, 3, 500, 400), torch.uint8, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (256, 256) \| 565.212 (+-3.548) \| 1384.210 (+-10.798) \| 1230.996 (+-32.930) \| 0.889 (+-0.000) \| 566.253 (+-1.526) Input (1, 3, 500, 400), torch.uint8, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (256, 256) \| 565.404 (+-1.614) \| 1491.649 (+-7.763) \| 2974.959 (+-6.006) \| 1.994 (+-0.000) \| 566.476 (+-1.742) Input (1, 3, 500, 400), torch.uint8, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (256, 256) \| 270.761 (+-0.861) \| 1557.777 (+-4.699) \| 1080.919 (+-4.243) \| 0.694 (+-0.000) \| 269.829 (+-0.986) Input (1, 3, 500, 400), torch.uint8, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (256, 256) \| 270.960 (+-0.995) \| 1723.913 (+-12.433) \| 3191.938 (+-6.194) \| 1.852 (+-0.000) \| 269.962 (+-1.657) Input (1, 3, 500, 400), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (256, 256) \| 1555.884 (+-5.169) \| 1178.753 (+-4.957) \| 1910.445 (+-5.988) \| 1.621 (+-0.000) \| 1560.804 (+-6.793) Input (1, 3, 500, 400), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (256, 256) \| 1651.193 (+-6.952) \| 1323.466 (+-6.059) \| 3374.842 (+-8.168) \| 2.550 (+-0.000) \| 1653.497 (+-8.018) Input (1, 3, 500, 400), torch.float32, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (256, 256) \| 978.482 (+-10.183) \| 1383.768 (+-4.341) \| 2147.841 (+-6.581) \| 1.552 (+-0.000) \| 979.983 (+-1.499) Input (1, 3, 500, 400), torch.float32, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (256, 256) \| 1074.472 (+-5.031) \| 1414.912 (+-5.754) \| 3590.968 (+-10.042) \| 2.538 (+-0.000) \| 1074.589 (+-3.948) Input (4, 3, 500, 400), torch.uint8, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (256, 256) \| 2168.703 (+-8.964) \| 5400.528 (+-26.628) \| 4777.299 (+-11.891) \| 0.885 (+-0.000) \| 2168.133 (+-7.667) Input (4, 3, 500, 400), torch.uint8, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (256, 256) \| 2169.132 (+-12.618) \| 6583.866 (+-28.959) \| 11986.894 (+-45.838) \| 1.821 (+-0.000) \| 2174.488 (+-10.317) Input (4, 3, 500, 400), torch.uint8, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (256, 256) \| 992.808 (+-6.086) \| 5985.028 (+-9.532) \| 4334.158 (+-9.423) \| 0.724 (+-0.000) \| 989.604 (+-5.499) Input (4, 3, 500, 400), torch.uint8, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (256, 256) \| 987.618 (+-6.350) \| 6963.044 (+-28.885) \| 15441.096 (+-55.324) \| 2.218 (+-0.000) \| 985.573 (+-5.159) Input (4, 3, 500, 400), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (256, 256) \| 6695.557 (+-35.067) \| 4657.603 (+-14.220) \| 8058.708 (+-41.684) \| 1.730 (+-0.000) \| 6714.996 (+-38.626) Input (4, 3, 500, 400), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (256, 256) \| 7040.481 (+-39.486) \| 5445.704 (+-16.659) \| 13906.618 (+-53.298) \| 2.554 (+-0.000) \| 7034.453 (+-44.626) Input (4, 3, 500, 400), torch.float32, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (256, 256) \| 3926.186 (+-10.660) \| 5741.433 (+-12.748) \| 9356.036 (+-40.848) \| 1.630 (+-0.000) \| 3930.598 (+-17.086) Input (4, 3, 500, 400), torch.float32, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (256, 256) \| 4308.536 (+-9.607) \| 6122.755 (+-47.278) \| 15637.567 (+-54.392) \| 2.554 (+-0.000) \| 4307.463 (+-11.268) Input (1, 3, 1200, 1300), torch.uint8, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (200, 300) \| 2512.740 (+-10.860) \| 1573.590 (+-5.061) \| 451.355 (+-1.210) \| 0.287 (+-0.000) \| 2511.727 (+-10.930) Input (1, 3, 1200, 1300), torch.uint8, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (200, 300) \| 2489.926 (+-11.915) \| 1537.233 (+-4.212) \| 2501.470 (+-7.446) \| 1.627 (+-0.000) \| 2500.000 (+-12.155) Input (1, 3, 1200, 1300), torch.uint8, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (200, 300) \| 632.032 (+-2.108) \| 1496.994 (+-4.194) \| 404.759 (+-1.064) \| 0.270 (+-0.000) \| 630.122 (+-4.086) Input (1, 3, 1200, 1300), torch.uint8, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (200, 300) \| 629.174 (+-4.386) \| 1708.935 (+-8.817) \| 2643.296 (+-9.723) \| 1.547 (+-0.000) \| 628.388 (+-1.326) Input (1, 3, 1200, 1300), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (200, 300) \| 4409.941 (+-8.016) \| 1160.133 (+-4.698) \| 1897.089 (+-9.392) \| 1.635 (+-0.000) \| 4450.959 (+-10.438) Input (1, 3, 1200, 1300), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (200, 300) \| 4493.427 (+-11.703) \| 1329.226 (+-4.740) \| 2835.872 (+-12.241) \| 2.133 (+-0.000) \| 4506.973 (+-9.914) Input (1, 3, 1200, 1300), torch.float32, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (200, 300) \| 901.712 (+-4.071) \| 1320.739 (+-5.197) \| 2207.605 (+-8.219) \| 1.671 (+-0.000) \| 904.757 (+-4.558) Input (1, 3, 1200, 1300), torch.float32, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (200, 300) \| 990.080 (+-3.922) \| 1702.563 (+-7.909) \| 3074.196 (+-10.478) \| 1.806 (+-0.000) \| 990.482 (+-4.444) Input (4, 3, 1200, 1300), torch.uint8, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (200, 300) \| 9785.550 (+-58.445) \| 6135.680 (+-33.569) \| 1628.572 (+-19.770) \| 0.265 (+-0.000) \| 9893.606 (+-62.377) Input (4, 3, 1200, 1300), torch.uint8, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (200, 300) \| 9710.191 (+-57.597) \| 6066.824 (+-36.364) \| 10469.110 (+-42.775) \| 1.726 (+-0.000) \| 9919.022 (+-72.190) Input (4, 3, 1200, 1300), torch.uint8, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (200, 300) \| 2790.356 (+-12.188) \| 6134.101 (+-28.694) \| 1576.832 (+-6.030) \| 0.257 (+-0.000) \| 2761.122 (+-11.503) Input (4, 3, 1200, 1300), torch.uint8, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (200, 300) \| 2778.711 (+-13.603) \| 6608.528 (+-37.776) \| 10841.549 (+-49.429) \| 1.641 (+-0.000) \| 2753.037 (+-10.995) Input (4, 3, 1200, 1300), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (200, 300) \| 45533.868 (+-102.618) \| 4962.994 (+-8.215) \| 9003.968 (+-38.179) \| 1.814 (+-0.000) \| 43531.261 (+-102.951) Input (4, 3, 1200, 1300), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (200, 300) \| 45932.699 (+-81.207) \| 5595.682 (+-11.482) \| 12302.907 (+-50.254) \| 2.199 (+-0.000) \| 43916.455 (+-80.468) Input (4, 3, 1200, 1300), torch.float32, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (200, 300) \| 3827.804 (+-8.057) \| 6311.580 (+-25.021) \| 11760.614 (+-51.531) \| 1.863 (+-0.000) \| 3849.959 (+-10.848) Input (4, 3, 1200, 1300), torch.float32, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (200, 300) \| 4169.007 (+-8.452) \| 6820.716 (+-35.310) \| 15264.633 (+-49.982) \| 2.238 (+-0.000) \| 4183.875 (+-19.104) Input (1, 3, 300, 400), torch.uint8, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (600, 700) \| 1306.914 (+-7.470) \| 10598.101 (+-38.410) \| 2678.031 (+-11.051) \| 0.253 (+-0.000) \| 1307.470 (+-8.519) Input (1, 3, 300, 400), torch.uint8, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (600, 700) \| 1307.268 (+-8.197) \| 10161.123 (+-45.643) \| 17148.842 (+-55.402) \| 1.688 (+-0.000) \| 1308.077 (+-8.553) Input (1, 3, 300, 400), torch.uint8, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (600, 700) \| 548.574 (+-2.157) \| 10072.806 (+-41.368) \| 2408.971 (+-6.997) \| 0.239 (+-0.000) \| 547.726 (+-1.721) Input (1, 3, 300, 400), torch.uint8, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (600, 700) \| 546.664 (+-1.484) \| 11123.694 (+-43.636) \| 18058.070 (+-48.552) \| 1.623 (+-0.000) \| 547.151 (+-1.627) Input (1, 3, 300, 400), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (600, 700) \| 7935.051 (+-71.022) \| 7654.533 (+-29.512) \| 12414.194 (+-87.450) \| 1.622 (+-0.000) \| 7900.056 (+-53.997) Input (1, 3, 300, 400), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (600, 700) \| 8546.732 (+-53.118) \| 8583.572 (+-35.656) \| 19111.824 (+-166.978) \| 2.227 (+-0.000) \| 8515.433 (+-63.300) Input (1, 3, 300, 400), torch.float32, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (600, 700) \| 6202.642 (+-34.355) \| 8915.622 (+-62.293) \| 14327.295 (+-52.188) \| 1.607 (+-0.000) \| 6213.329 (+-39.740) Input (1, 3, 300, 400), torch.float32, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (600, 700) \| 6811.128 (+-33.747) \| 9647.316 (+-50.837) \| 20830.594 (+-62.979) \| 2.159 (+-0.000) \| 6822.512 (+-37.092) Input (4, 3, 300, 400), torch.uint8, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (600, 700) \| 5079.586 (+-19.067) \| 42238.442 (+-87.643) \| 11282.141 (+-42.477) \| 0.267 (+-0.000) \| 5104.234 (+-17.706) Input (4, 3, 300, 400), torch.uint8, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (600, 700) \| 5079.575 (+-16.306) \| 41512.995 (+-83.710) \| 68789.816 (+-440.001) \| 1.657 (+-0.000) \| 5097.446 (+-21.724) Input (4, 3, 300, 400), torch.uint8, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (600, 700) \| 2039.974 (+-8.614) \| 42322.773 (+-111.866) \| 10399.237 (+-43.140) \| 0.246 (+-0.000) \| 2043.808 (+-10.707) Input (4, 3, 300, 400), torch.uint8, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (600, 700) \| 2036.214 (+-10.083) \| 44353.281 (+-71.548) \| 73340.412 (+-324.780) \| 1.654 (+-0.000) \| 2039.000 (+-9.554) Input (4, 3, 300, 400), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (600, 700) \| 33821.523 (+-96.639) \| 30552.094 (+-65.023) \| 49494.486 (+-872.916) \| 1.620 (+-0.000) \| 33844.404 (+-92.466) Input (4, 3, 300, 400), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (600, 700) \| 36196.104 (+-128.169) \| 34038.432 (+-79.697) \| 75761.226 (+-905.194) \| 2.226 (+-0.000) \| 36260.473 (+-94.642) Input (4, 3, 300, 400), torch.float32, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (600, 700) \| 24827.821 (+-77.335) \| 37006.218 (+-86.318) \| 61297.625 (+-898.192) \| 1.656 (+-0.000) \| 24823.275 (+-80.945) Input (4, 3, 300, 400), torch.float32, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (600, 700) \| 27266.138 (+-70.262) \| 40109.475 (+-94.248) \| 92086.075 (+-404.922) \| 2.296 (+-0.000) \| 27287.992 (+-89.507) Times are in microseconds (us). [--------------------------------------------------------------------------------------------------------------------------------------------------------- Interpolate, cuda ---------------------------------------------------------------------------------------------------------------------------------------------------------] \| Eager (2.3.0a0+gitafcfdb1) PR \| Compiled (2.3.0a0+gitafcfdb1) PR \| Compiled (2.3.0a0+gitde89a53) Nightly \| speed-up PR vs Nightly \| Eager (2.3.0a0+gitde89a53) Nightly 1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input (1, 3, 2345, 2456), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (1234, 1345) \| 98.259 (+-0.014) \| 97.156 (+-0.008) \| 97.443 (+-0.031) \| 1.003 (+-0.000) \| 98.248 (+-0.021) Input (1, 3, 2345, 2456), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (1234, 1345) \| 97.048 (+-0.016) \| 97.480 (+-0.018) \| 96.819 (+-0.126) \| 0.993 (+-0.000) \| 97.045 (+-0.015) Input (1, 3, 2345, 2456), torch.float32, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (1234, 1345) \| 97.944 (+-0.028) \| 91.686 (+-0.411) \| 93.894 (+-1.011) \| 1.024 (+-0.000) \| 97.933 (+-0.008) Input (1, 3, 2345, 2456), torch.float32, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (1234, 1345) \| 98.008 (+-0.011) \| 91.205 (+-0.346) \| 96.854 (+-0.058) \| 1.062 (+-0.000) \| 97.203 (+-0.010) Input (4, 3, 2345, 2456), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (1234, 1345) \| 384.318 (+-0.011) \| 382.793 (+-0.007) \| 382.472 (+-0.011) \| 0.999 (+-0.000) \| 384.701 (+-0.012) Input (4, 3, 2345, 2456), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (1234, 1345) \| 384.266 (+-0.009) \| 385.333 (+-0.024) \| 382.554 (+-0.022) \| 0.993 (+-0.000) \| 384.386 (+-0.016) Input (4, 3, 2345, 2456), torch.float32, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (1234, 1345) \| 383.924 (+-0.011) \| 570.071 (+-0.030) \| 545.615 (+-0.051) \| 0.957 (+-0.000) \| 384.044 (+-0.012) Input (4, 3, 2345, 2456), torch.float32, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (1234, 1345) \| 384.184 (+-0.016) \| 560.857 (+-0.026) \| 552.447 (+-0.040) \| 0.985 (+-0.000) \| 384.063 (+-0.016) Input (1, 3, 1234, 1345), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (2345, 2456) \| 122.188 (+-0.053) \| 116.744 (+-1.006) \| 163.762 (+-0.015) \| 1.403 (+-0.000) \| 121.874 (+-0.015) Input (1, 3, 1234, 1345), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (2345, 2456) \| 122.156 (+-0.012) \| 182.692 (+-0.013) \| 161.653 (+-0.018) \| 0.885 (+-0.000) \| 121.926 (+-0.014) Input (1, 3, 1234, 1345), torch.float32, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (2345, 2456) \| 105.852 (+-0.324) \| 119.545 (+-0.294) \| 190.527 (+-0.023) \| 1.594 (+-0.000) \| 105.999 (+-0.446) Input (1, 3, 1234, 1345), torch.float32, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (2345, 2456) \| 106.507 (+-0.282) \| 120.060 (+-0.257) \| 162.330 (+-0.012) \| 1.352 (+-0.000) \| 106.567 (+-0.385) Input (4, 3, 1234, 1345), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (2345, 2456) \| 447.907 (+-0.015) \| 463.863 (+-1.779) \| 650.492 (+-0.331) \| 1.402 (+-0.000) \| 446.596 (+-0.017) Input (4, 3, 1234, 1345), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (2345, 2456) \| 447.750 (+-0.017) \| 723.832 (+-0.170) \| 641.539 (+-0.075) \| 0.886 (+-0.000) \| 446.467 (+-0.019) Input (4, 3, 1234, 1345), torch.float32, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (2345, 2456) \| 439.549 (+-0.031) \| 507.772 (+-2.879) \| 758.795 (+-0.482) \| 1.494 (+-0.000) \| 440.372 (+-0.025) Input (4, 3, 1234, 1345), torch.float32, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (2345, 2456) \| 439.538 (+-0.029) \| 509.260 (+-2.704) \| 654.195 (+-2.621) \| 1.285 (+-0.000) \| 440.362 (+-0.026) Times are in microseconds (us). ``` [Source](`f4751a3196/perf_interp_mode.py`), [Output](`899f34c024/output/20231213-214209-upsample-bilinear-pr_vs_nightly-speedup.md`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104182 Approved by: https://github.com/lezcano	2023-12-14 14:50:06 +00:00
Yifu Wang	adfbd2b219	Introduce 3 low-latency, intra-node allreduce algorithms for small messages to PyTorch (#114001 ) ## Summary This PR added 3 intra-node GPU allreduce algorithms to PyTorch: - One-shot allreduce (inspired by FasterTransformer): all ranks simultaneously read and accumulate data from other ranks. - Two-shot allreduce (inspired by FasterTransformer): all ranks simultanesouly read and accumulate `1 / world_size` data from other ranks. Then all ranks read accumulated data from other ranks. (effectively one-shot reduce-scatter + one-shot all-gather). - Hybrid cube mesh allreduce (original): a one-shot allreduce variant that avoids transmission over PCIe on HCM topology. ## Micro Benchmarks ![image](https://github.com/pytorch/pytorch/assets/4156752/7bd25ffc-cd5b-4acb-bd65-b01bc136726e) ![image](https://github.com/pytorch/pytorch/assets/4156752/3ced31b4-6c31-4f34-a2d8-c072df29ae0e) ![image](https://github.com/pytorch/pytorch/assets/4156752/5b942c05-4fcc-4ec9-ae29-12c64080bb1c) ## Details The intra-node algos are organized behind `c10d::IntraNodeComm`, which is responsible for: - Managing handshaking and cuda IPC handle exchange among ranks. - Querying NVLink connection and detecting topology. - Performing algo selection based on available info. - Launching the selected allreduce kernel. `c10d::IntraNodeComm` is integrated into `c10d::ProcessGroupNCCL` as follows: - When the `ENABLE_INTRA_NODE_COMM` environment variable is set, `c10d::ProcessGroupNCCL` initialize a `c10d::IntraNodeComm` for its ranks. - If the setup is not suitable for intra-node comm (e.g. not all ranks are from the same node), the rendezvous logic guarantees all participants fall back consistently. - `c10d::ProcessGroupNCCL::allreduce` consults `c10d::IntraNodeComm` whether to use intra-node allreduce and carries out the communication accordingly. We currently detect two types of topoloies from the nNVLink connection mesh: - Fully connected: all GPU pairs has direct NVLink connection (e.g. NVSwitch or fully connected sub-set of hybrid cube mesh) - `msg <= 256KB`: one-shot allreduce. - `256KB < msg <= 10MB`: two-shot allreduce. - `msg > 10MB`: instructs the caller to fallback to NCCL. - Hybrid cube mesh - `msg <= 256KB`: one-shot allreduce. - `msg > 256KB`: instructs the caller to fallback to NCCL. ## Next Steps - Fine tune algo selection based on GPU model, topology, link speed. - Potentially optimize the two-shot allreduce impl. Accroding to FasterTransformer, two-shot allreduce is preferred until 50MB. There might be room for improvement, but PyTorch does impose more constraints: - FasterTransformer uses a single process to drive multiple devices. It can use `cudaDeviceEnablePeerAccess` enable device-level peer access. - PyTorch uses multiple process to drive multiple devices. With cuda IPC, a device can only share a specific region to other devices. This means extra copies may be unavoidable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114001 Approved by: https://github.com/yf225	2023-12-14 08:13:08 +00:00
Xuehai Pan	36c6c0c7dc	[pytree] expand `tree_map` to accept multi-inputs (#115642 ) Fixes #115419 Fixes #91323 Closes #115549 - #115419 - #91323 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115642 Approved by: https://github.com/vmoens, https://github.com/zou3519	2023-12-14 06:16:42 +00:00
eqy	7e1542b938	[CUDA][FP8] Skip `test_dtypes` on FP8 `_scaled_mm` (#115661 ) This test isn't actually parametrized by `dtype` so it seems to surface bogus failures where "unsupported" types "work" but in reality fp8 is used every time. CC @drisspg I'm guessing this doesn't surface in upstream CI because there are no SM9.0 runners yet? Pull Request resolved: https://github.com/pytorch/pytorch/pull/115661 Approved by: https://github.com/drisspg	2023-12-14 05:12:33 +00:00
Will Constable	f5458f8f00	[C10D] Make DumpPipe pipe file configurable (#115770 ) Add TORCH_NCCL_DEBUG_INFO_PIPE_FILE env, allowing separate pipe file location from dump file location. Defaults PIPE_FILE to empty, meaning disabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115770 Approved by: https://github.com/zdevito	2023-12-14 03:54:43 +00:00
Fuzzkatt	ef01e78fd9	disable test_ddp_profiling_autograd_profiler in distributed_test.py (#115704 ) test was previously disabled in upstream: https://github.com/pytorch/pytorch/issues/77342, currently failing in NVIDIA internal CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/115704 Approved by: https://github.com/soulitzer	2023-12-14 01:41:37 +00:00
Pavan Balaji	ffc826bf10	[nccl-pg] Store PG global rank information in tracing logs (#115730 ) Storing the list of global ranks associated with each PG allows us to correlate traces across different ranks. Test Plan: OSS CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/115730 Approved by: https://github.com/fduwjj	2023-12-14 00:59:17 +00:00
Yidi Wu	b38e14c12a	[Reland][HigherOrderOp] remove unused get_item in MapHigherOrder (#115758 ) Summary: This is a reland of https://github.com/pytorch/pytorch/pull/115207 Test Plan: Modified existing tests. Reviewed By: yanboliang Differential Revision: D52045157 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115758 Approved by: https://github.com/angelayi	2023-12-14 00:41:46 +00:00
PyTorch MergeBot	626b7dc847	Revert "Migrated loss functions to ModuleInfos (#115584 )" This reverts commit `f138b08d2e`. Reverted https://github.com/pytorch/pytorch/pull/115584 on behalf of https://github.com/atalman due to OSS CI oncall, breaks slow test ([comment](https://github.com/pytorch/pytorch/pull/115584#issuecomment-1854855080))	2023-12-13 23:34:30 +00:00
vfdev-5	c7ae2c170f	[inductor] Added non-integer expr support for floordiv in triton codegen (#115751 ) Description: - Added non-integer expr support for floordiv in triton codegen - Added a test - cpp test is skipped as failing and https://github.com/pytorch/pytorch/pull/115647 may fix it This PR is fixing compilation error with the following code: ```python import torch def func(x, a): n = (a * 1.234) // 8.234 y = x + n return y cfunc = torch.compile(func, dynamic=True, fullgraph=True) device = "cuda" x = torch.tensor(0, dtype=torch.float32, device=device) a = 33 out = cfunc(x, a) expected = func(x, a) torch.testing.assert_close(out, expected) ``` Error message on Nightly: ``` File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result raise self._exception torch._dynamo.exc.BackendCompilerFailed: backend='compile_fx_wrapper' raised: CompilationError: at 7:38:def triton_(in_ptr0, out_ptr0, ks0, xnumel, XBLOCK : tl.constexpr): xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (x0), xmask) tmp1 = ((1.23400000000000*ks0) // 8.23400000000000) ^ AssertionError() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115751 Approved by: https://github.com/peterbell10	2023-12-13 23:17:42 +00:00
Peter Bell	ad76a4e1e7	[inductor] Allow sympy expressions to participate in type promotion (#115676 ) In the test example we have `add(i64[10], sympy.Expr)` where `sympy.Expr` is not considered a promoting arg so isn't factored into the type promotion. However, in eager it would promote to float32. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115676 Approved by: https://github.com/lezcano ghstack dependencies: #115677, #115699, #115700	2023-12-13 22:22:37 +00:00
Michael Lazos	869e52e3dd	Support torch function user objects (#111765 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/111765 Approved by: https://github.com/jansel	2023-12-13 22:11:52 +00:00
Scott Wolchok	81321baf5c	[PyTorch] Remove ArrayRefTensor::dtype (#113578 ) Knocks off a few nanoseconds from CPU inference due to not having to set this field; paths that would've needed it are expensive anyway. Differential Revision: [D51182794](https://our.internmc.facebook.com/intern/diff/D51182794/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/113578 Approved by: https://github.com/khabinov, https://github.com/Neilblaze ghstack dependencies: #112800, #113577	2023-12-13 21:32:14 +00:00
Chien-Chin Huang	8c57fde21f	Let all_reduce_coalesced accept one tensor as well (#115650 ) This diff introduces a change to the `all_reduce_coalesced` function in `distributed_c10d.py`. The function now accepts a single tensor as well as a list of tensors. This allows for more flexibility in the use of the function. This is just a syntax sugar for the compiler to use `all_reduce_coalesced` without worrying about converting the input to a list. Differential Revision: [D51433236](https://our.internmc.facebook.com/intern/diff/D51433236/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115650 Approved by: https://github.com/wconstab ghstack dependencies: #115523, #115302, #115648, #115649	2023-12-13 21:32:01 +00:00
Scott Wolchok	b9af126908	[PyTorch] Add input numel assert for minimal arrayref interface (#113577 ) We currently have no shape checking on CPU IIUC. Now we at least do numel checking for the minimal arrayref interface. Differential Revision: [D51165703](https://our.internmc.facebook.com/intern/diff/D51165703/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/113577 Approved by: https://github.com/chenyang78, https://github.com/jansel ghstack dependencies: #112800	2023-12-13 21:31:55 +00:00
Yanbo Liang	db851b1bc9	[Dynamo][7/N] Wrap python modules under torch as regular PythonModuleVariable (#115724 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/115724 Approved by: https://github.com/jansel	2023-12-13 21:23:14 +00:00
Chien-Chin Huang	54d552e991	[funcol] Directly import DeviceMesh to avoid circular dependency (#115649 ) This diff aims to directly import DeviceMesh from torch.distributed.device_mesh instead of importing it from dist._tensor. This is done to avoid a circular dependency issue. The code changes in each file of the diff are as follows: - torch/distributed/_functional_collectives.py: import DeviceMesh from torch.distributed instead of dist._tensor. Overall, this diff aims to improve the code by avoiding circular dependencies and improving the import statements. == The above summary is generated by LLM with minor manual fixes. The following summary is by me. The original import causes some issues when compiling DDP with compiled_autograd. The root cause of compilation failure is not identified but it is good to fix the lazy initialization, which indirectly fixes the compilation issues for DDP. Differential Revision: [D51857246](https://our.internmc.facebook.com/intern/diff/D51857246/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115649 Approved by: https://github.com/wconstab, https://github.com/wz337 ghstack dependencies: #115523, #115302, #115648	2023-12-13 20:44:58 +00:00
Will Constable	c90fdb9ac0	Fix torch.distributed.breakpoint (#115705 ) Switches from calling breakpoint() internally to using a subclass of Pdb. Fixes #115685 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115705 Approved by: https://github.com/wanchaol, https://github.com/fegin	2023-12-13 20:33:56 +00:00
Alexander Yermolovich	23bff71de4	[llvm][oncall] Fix build for llvm-18+ (#115652 ) Summary: https://reviews.llvm.org/D137838 moved Host.h and some other files under TargetParser. https://github.com/llvm/llvm-project/pull/74261 Removed it from Support folder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115652 Approved by: https://github.com/davidberard98	2023-12-13 20:11:31 +00:00
soulitzer	4d8ad4fb82	Move SingletonSymNodeImpl from c10 to aten (#114895 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114895 Approved by: https://github.com/jbschlosser	2023-12-13 20:01:18 +00:00
suo	926236305f	[sigmoid] fix for FX tracing unflattened modules (#115708 ) Differential Revision: [D52095387](https://our.internmc.facebook.com/intern/diff/D52095387/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115708 Approved by: https://github.com/zhxchen17	2023-12-13 19:43:46 +00:00
Carlos Mocholí	75d3bbaaa2	Fix cudagraph check message (#115664 ) This error message is printed when CUDAGraph trees are used with multiple device indices. However, the message seems to say the opposite. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115664 Approved by: https://github.com/soulitzer	2023-12-13 18:44:43 +00:00
Peter Bell	42390a097b	[inductor] Do variance calculation in opmath type (#115181 ) Fixes #114903 Previously large split variance reductions stored the intermediates as float16 precision, which may lead to overflow as the intermediate result is unnormalized. In #114903 we see two different `num_split` decisions made based on the hardware capabilities, one of which has large enough intermediates to cause overflows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115181 Approved by: https://github.com/shunting314	2023-12-13 18:40:44 +00:00
Pavan Balaji	afa62d6237	[nccl-pg] Pass group global rank information to NCCL PG (#114736 ) We were only passing a subset of the group creation information to the NCCL PG. We are specifically missing the information on which global ranks belong to a particular PG. This allows the NCCL PG to use this additional information for things like better trace logging. Test Plan: OSS CI Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/114736 Approved by: https://github.com/kwen2501	2023-12-13 18:02:51 +00:00
Pearu Peterson	193f87857e	[BC breaking] Remove check_sparse_nnz argument of gradcheck (#115658 ) As in title per deprecation plan. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115658 Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer	2023-12-13 17:34:30 +00:00
voznesenskym	310f6ab11a	[fsdp] Replace acc_grad hooking with register_post_accumulate_grad_hook on flat_param (#112184 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112184 Approved by: https://github.com/albanD ghstack dependencies: #115315	2023-12-13 16:24:44 +00:00
Peter Bell	fb80f05ee2	[inductor] Fix angle decomposition return type (#115700 ) The current decomposition always returns float32 when the input isn't complex. Instead, we should do proper type promotion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115700 Approved by: https://github.com/lezcano ghstack dependencies: #115677, #115699	2023-12-13 14:16:31 +00:00
Peter Bell	9cdc80d581	[inductor] Fix torch.bernoulli decomposition return type (#115699 ) Strangely enough, `torch.bernoulli` doesn't return a boolean and instead it matches the output type of the inplace bernoulli. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115699 Approved by: https://github.com/lezcano ghstack dependencies: #115677	2023-12-13 14:16:31 +00:00
atalman	3807fc690f	[OSSCI oncall] fix lint (#115737 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/115737 Approved by: https://github.com/DanilBaibak	2023-12-13 14:15:26 +00:00
Scott Wolchok	f9cf6ae889	[PyTorch] AOTI: add minimal arrayref interface (#112800 ) This implements an optional alternate interface to the AOTI generated DSO, intended to increase efficiency for models running on CPU and requiring minimal overhead. See comment in config.py for more explanation. This took a while to get right (e.g., I initially required 1-D MiniArrayRef<T> for the inputs, but found that multi-dimensional ArrayRefTensor<T> ended up simplifying the implementation and allowed test_aot_inductor.py to run) and is somewhat intricate, so I am anticipating that review will require some back-and-forth. Differential Revision: [D50699890](https://our.internmc.facebook.com/intern/diff/D50699890/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D50699890/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/112800 Approved by: https://github.com/chenyang78	2023-12-13 12:06:35 +00:00

1 2 3 4 5 ...

34349 Commits