pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
cyy	b61a556427	Turn onnx functions into static (#147598 ) To avoid exposing ONNX symbols. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147598 Approved by: https://github.com/justinchuby	2025-02-21 07:40:28 +00:00
PyTorch MergeBot	3395da7f7c	Revert "Build a storage reader/writer to write checkpoints in HF format (#146352 )" This reverts commit `c615b8c174`. Reverted https://github.com/pytorch/pytorch/pull/146352 on behalf of https://github.com/jeanschmidt due to Author ignored linting errors ([comment](https://github.com/pytorch/pytorch/pull/146352#issuecomment-2673789271))	2025-02-21 07:30:52 +00:00
Kevin Fu	4986f0f52e	[PT2]: allow empty dict to pass type check (#147167 ) (#147480 ) Summary: Seeing errors like when testing sigmoid for inline_cvr and perevent_cvr models. ``` terminate called after throwing an instance of 'c10::Error' what(): forward() Expected a value of type 'Dict[int, Tuple[Tensor, Tensor, Tensor]]' for argument 'event_based_features' but instead found type 'Dict[Any, Any]'. ``` Let empty dict pass type check. please, do NOT use any of the following flags, those are result of manual interventions in other parts of the system, misuse of them can be very painful for both detect and recover: Test Plan: ``` MODEL_ENTITY_ID=691508446 SNAPSHOT_ID=0 OTHER_MODEL_ENTITY_ID=649645886 OTHER_SNAPSHOT_ID=0 MODULE=local buck2 run mode/opt caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- \ --loadMode=BenchmarkAB \ --inputNetFile=/data/users/${USER}/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}${suffix} \ --otherNetFile=/data/users/${USER}/models/${OTHER_MODEL_ENTITY_ID}/${OTHER_SNAPSHOT_ID}/${OTHER_MODEL_ENTITY_ID}_${OTHER_SNAPSHOT_ID}${suffix} \ --moduleName=${module} \ --submodToDevice "" \ --benchmarkDontRebatchSamples=true \ --sampleInputFilePath=/data/users/${USER}/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/archive_.predictor.disagg.gpu.local/data/sample_inputs/local.pt ``` Reviewed By: yjhao Differential Revision: D69871393 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147480 Approved by: https://github.com/henryoier, https://github.com/jeanschmidt	2025-02-21 07:00:46 +00:00
dilililiwhy	86ae672b6a	Use has_triton_package in _inductor.runtime.hints (#147442 ) Fixes #ISSUE_NUMBER Use existing method for triton check Pull Request resolved: https://github.com/pytorch/pytorch/pull/147442 Approved by: https://github.com/Skylion007	2025-02-21 05:52:00 +00:00
Eddie Yan	533b884870	[cuDNN][SDPA][Nested Tensor] Experimental cuDNN Nested Tensor SDPA Support (forward only) (#141178 ) Disabled by default for now behind `TORCH_CUDNN_SDPA_NESTED_TENSOR_ENABLED=1` Just wanted to get this out before starting a series of SDPA cleanup PRs---the biggest thing is we don't need the boilerplate around all of the `build_graph_and_tensors*` functions anymore as we can now use the `UID`-style referencing of tensor nodes as was done for the Conv-V8 API backend. CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/141178 Approved by: https://github.com/jbschlosser	2025-02-21 05:22:19 +00:00
Jerry Zhang	a2c3a2c5c4	Support serialization for uintx/intx in weights_only (#147500 ) Summary: Fixing the issue reported by huggingface Test Plan: python test/test_serialization.py -k test_serialization_uintx_intx Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/147500 Approved by: https://github.com/mikaylagawarecki	2025-02-21 04:38:44 +00:00
Ankita George	c615b8c174	Build a storage reader/writer to write checkpoints in HF format (#146352 ) Summary: Title - we want to write checkpoints in HF format with DCP, this diff allows this for the non-distributed use case. Test Plan: buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/distributed/checkpoint:test_hf_torchtune_storage N6476188 --> able to save and load tensor in hf format Differential Revision: D68444967 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146352 Approved by: https://github.com/saumishr	2025-02-21 03:31:21 +00:00
Tristan Rice	ba214ab56c	TCPStore: soft fail bind when agent store active (#147465 ) This makes it easier to roll out `TORCHELASTIC_USE_AGENT_STORE` by opportunistically swallowing bind errors when the agent store is enabled and the port matches `MASTER_PORT`. This should be very safe as if the store is somehow not up and the envs are set, the TCPStore client connections will fail to connect so we end up with a slightly different error message but success/failure behavior is identical. This also pybinds `c10d::SocketError` into Python so we can assert on the error type in tests. https://docs.google.com/document/d/1CzOn_N53AiFxWGgbyMWSnd2elCJd4lZ-ajPg2lzcxoM/edit?tab=t.0#heading=h.2j2f5dimrdau Test plan: ``` pytest test/distributed/test_store.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147465 Approved by: https://github.com/fduwjj	2025-02-21 03:02:26 +00:00
Yan Zhiwei	8a5265cb37	[Intel GPU] qlinear_pointwise.binary[_tensor] XPU support (#135337 ) # Motivation This PR intends to enable quantized fusion `qlinear+add` at Intel GPU backend. At backend level, we register the op via schema `TORCH_SELECTIVE_NAME("onednn::qlinear_pointwise.binary")` and `TORCH_SELECTIVE_NAME("onednn::qlinear_pointwise.binary_tensor")` which is the one already defined in `x86InductorQuantzer` At Inductor level, we have small modification at `torch/_inductor/fx_passes/quantization.py` to allow signed int8 data type(s8) during op lowering. As for the pattern matching, we greatly reuse the code existing at x86InductorQuantizer. # UT verification ```bash python test/inductor/test_mkldnn_pattern_matcher.py -v \ -k test_qlinear_add_xpu ``` # Runtime Verification ```bash onednn_verbose,primitive,exec,gpu:0,matmul,jit:gemm:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_f32::blocked:ab::f0_mask2 dst_f32::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_linear:1:0.654408+sum:0.00511256+eltwise_relu,,4x4:4x4,0.0319824 ``` The verbose is collected from UT. We can see the attribute ` attr-post-ops:eltwise_linear:1:0.654408+sum:0.00511256+eltwise_relu`, the post add and ReLU is successfully fused on GEMM computation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135337 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/liangan1, https://github.com/jerryzh168 ghstack dependencies: #133307, #135189 Co-authored-by: guangyey <guangye.yu@intel.com>	2025-02-21 02:09:28 +00:00
Simon Fan	ac88a6c00d	[fx] demote node prepend to self log from warning to debug (#147538 ) FIXES https://github.com/pytorch/pytorch/issues/147175 This is harmless, not sure why this is a user warning. Writing reordering graph passes is more concise when we ignore this warning. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147538 Approved by: https://github.com/yanboliang	2025-02-21 01:32:34 +00:00
Zhengxu Chen	fdb1305ace	reland "[sigmoid] Test OSS model runner with test_export.py" (#147535 ) Summary: There are ~260 tests for all the corner cases of export from test_export.py. utitlizing to test sigmoid in the OSS setting. Test Plan: buck test mode/opt caffe2/test:test_export -- -r _sigmoid Differential Revision: D69937387 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147535 Approved by: https://github.com/yiming0416	2025-02-20 23:45:13 +00:00
Aaron Orenstein	be0df96b50	Fix c++ implementation of strip_function_call (#147436 ) #143063 was missing handling a couple UCS cases as well as had some bugs in the way it dealt with errors. - Fix all the UCS handling (and make some of the common code more common) - Make sure all the error paths return `nullptr` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147436 Approved by: https://github.com/jansel	2025-02-20 20:41:21 +00:00
Jessica Vandebon	6971b77510	[CPU Stream] Add noop for CPU stream record_event() and wait_event() (#145935 ) Summary: Adds wait_event and record_event endpoints to CPU stream in order to facilitate device-agnostic code. Both methods are noops. Test Plan: CI Differential Revision: D68833927 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145935 Approved by: https://github.com/Skylion007	2025-02-20 18:50:55 +00:00
Catherine Lee	863ac20659	[CI] Do not overwrite return code of test file when fails for rerun disabled tests (#147484 ) Do not overwrite the return code of a single file when it fails. This will allow the log to be printed to stdout and the gha logs Pull Request resolved: https://github.com/pytorch/pytorch/pull/147484 Approved by: https://github.com/ZainRizvi	2025-02-20 17:51:58 +00:00
Sampsa	83bb921a5a	[ROCm] Update meta_registration for efficient attention (#146979 ) Fixes a series of failing and skipped unit tests. For nvidia hw, the longsumexp last dimension is required to be a multiple of 32. This is not the case for rocm. A related issue: https://github.com/pytorch/pytorch/issues/146848 The unit tests in question: ```bash inductor.test_fused_attention SDPAPatternRewriterCudaDynamicTests test_sdpa_prev_13_cuda inductor.test_fused_attention SDPAPatternRewriterCudaDynamicTests test_sdpa_prev_14_cuda inductor.test_fused_attention SDPAPatternRewriterCudaDynamicTests test_sdpa_prev_15_cuda inductor.test_fused_attention SDPAPatternRewriterCudaDynamicTests test_sdpa_rewriter_11_cuda inductor.test_fused_attention SDPAPatternRewriterCudaDynamicTests test_sdpa_rewriter_14_cuda inductor.test_fused_attention SDPAPatternRewriterCudaDynamicTests test_sdpa_rewriter_15_cuda inductor.test_fused_attention SDPAPatternRewriterCudaDynamicTests test_sdpa_rewriter_17_cuda inductor.test_fused_attention SDPAPatternRewriterCudaDynamicTests test_sdpa_rewriter_1_cuda inductor.test_fused_attention SDPAPatternRewriterCudaDynamicTests test_sdpa_rewriter_1_freezing inductor.test_fused_attention SDPAPatternRewriterCudaDynamicTests test_sdpa_rewriter_2_cuda inductor.test_fused_attention SDPAPatternRewriterCudaDynamicTests test_sdpa_rewriter_3_cuda inductor.test_fused_attention SDPAPatternRewriterCudaDynamicTests test_sdpa_rewriter_4_cuda inductor.test_fused_attention SDPAPatternRewriterCudaDynamicTests test_sdpa_rewriter_6_cuda inductor.test_fused_attention SDPAPatternRewriterCudaTests test_sdpa_prev_13_cuda inductor.test_fused_attention SDPAPatternRewriterCudaTests test_sdpa_prev_14_cuda inductor.test_fused_attention SDPAPatternRewriterCudaTests test_sdpa_prev_15_cuda inductor.test_fused_attention SDPAPatternRewriterCudaTests test_sdpa_rewriter_11_cuda inductor.test_fused_attention SDPAPatternRewriterCudaTests test_sdpa_rewriter_14_cuda inductor.test_fused_attention SDPAPatternRewriterCudaTests test_sdpa_rewriter_15_cuda inductor.test_fused_attention SDPAPatternRewriterCudaTests test_sdpa_rewriter_17_cuda inductor.test_fused_attention SDPAPatternRewriterCudaTests test_sdpa_rewriter_1_cuda inductor.test_fused_attention SDPAPatternRewriterCudaTests test_sdpa_rewriter_1_freezing inductor.test_fused_attention SDPAPatternRewriterCudaTests test_sdpa_rewriter_2_cuda inductor.test_fused_attention SDPAPatternRewriterCudaTests test_sdpa_rewriter_3_cuda inductor.test_fused_attention SDPAPatternRewriterCudaTests test_sdpa_rewriter_4_cuda inductor.test_fused_attention SDPAPatternRewriterCudaTests test_sdpa_rewriter_6_cuda ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146979 Approved by: https://github.com/shunting314	2025-02-20 15:05:13 +00:00
vasiliy	382fbcc1e4	add the `torch.float8_e8m0fnu` dtype to PyTorch (#147466 ) Summary: Continuing the work from https://github.com/pytorch/pytorch/pull/146427 Adds the `torch.float8_e8m0fnu` dtype to PyTorch, as detailed in https://github.com/pytorch/pytorch/issues/146414 . Please see the issue for a detailed definition of the format. Example of basic functionality: ```python import torch # round trip x0 = torch.randn(4, 4, dtype=torch.float32) x1 = x0.to(torch.float8_e8m0fnu) # RNE rounding x2 = x1.to(torch.float32) # 2 ** exponent # creation with empty x0 = torch.empty(4, 4, dtype=torch.float8_e8m0fnu) # printing print(x0) ``` Done in this PR: * numerical correctness * op coverage (except for `torch._scaled_mm`): create tensor, cast to/from float32 * printing a tensor works For future PRs: * performance optimizations for casting * torch._scaled_mm * PT2 * various cleanups (detailed in comments with issue numbers) Test Plan: ``` pytest test/quantization/core/experimental/test_float8.py -s ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/147466 Approved by: https://github.com/drisspg	2025-02-20 13:55:42 +00:00
James Wu	574371d828	Add current cuda device index to FXGraphCache key (#147464 ) This PR intends to fix the cache related issues from https://github.com/pytorch/pytorch/issues/147405. It does not handle the dynamo recompile case in process, because it does not introduce any extra guards. For FXGraphCache and AOTAutogradCache, we simply have to have the device context in the cache key. Note that for any function that accepts tensor inputs, the device context is naturally already included in the cache key by the metadata of example inputs. However, for functions that return constants or have no arguments, the device context still needs to be in the cache key. A more robust fix for this would be to have inductor generate device guards that are dynamic, instead of specialized. This would also help us share more cache artifacts. I've added unit tests for FXGraphCache and AOTAutogradCache, both of which would fail without this change. Differential Revision: [D69875939](https://our.internmc.facebook.com/intern/diff/D69875939) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147464 Approved by: https://github.com/bdhirsh, https://github.com/anijain2305	2025-02-20 12:38:21 +00:00
zeshengzong	6beba8dcce	Optimize `graph.py` typing (#147099 ) Optimize `graph.py` methods type annotation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147099 Approved by: https://github.com/cyyever, https://github.com/aorenste	2025-02-20 09:32:30 +00:00
Luca Wehrstedt	f9b8121350	Make Inductor scheduler aware of _scaled_mm (#146992 ) This is used for example to estimate runtime when doing comms overlap Pull Request resolved: https://github.com/pytorch/pytorch/pull/146992 Approved by: https://github.com/drisspg, https://github.com/eellison, https://github.com/shunting314	2025-02-20 09:02:31 +00:00
Shawn Xu	9da250aada	type `fully_shard` so that the return value can be chained with typing enabled (#147489 ) This allows for ``` fsdped = fully_shard(model) fsdped.set_xyz() ``` same applies if `model` is actually a list of modules Differential Revision: [D69888119](https://our.internmc.facebook.com/intern/diff/D69888119) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147489 Approved by: https://github.com/Skylion007 ghstack dependencies: #147488	2025-02-20 08:43:16 +00:00
zeshengzong	6a72aaadae	Fix `torch.max` optional args `dim`, `keepdim` description (#147177 ) [`torch.max`](https://pytorch.org/docs/stable/generated/torch.max.html#torch.max) optional args `dim`, `keepdim` not described in document, but users can ignore them. ```python >>> import torch >>> a = torch.randn(3,1,3) >>> a.max() tensor(1.9145) >>> a.max(dim=1) torch.return_types.max( values=tensor([[ 1.1436, -0.0728, 1.3312], [-0.4049, 0.1792, -1.2247], [ 0.8767, -0.7888, 1.9145]]), indices=tensor([[0, 0, 0], [0, 0, 0], [0, 0, 0]])) ``` ## Changes - Add `optional` description for `dim`, `keepdim` - Add example of using `dim`, `keepdim` ## Test Result ### Before ![image](https://github.com/user-attachments/assets/3391bc45-b636-4e64-9406-04d80af0c087) ### After ![image](https://github.com/user-attachments/assets/1d70e282-409c-4573-b276-b8219fd6ef0a) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147177 Approved by: https://github.com/colesbury	2025-02-20 08:18:09 +00:00
drisspg	452315c84f	Fix RuntimeError: value cannot be converted to type int64_t without overflow (#147492 ) The exact call is coming from here: `78a94c9114/torch/_inductor/memory.py (L161)` I have no idea why this error is being thrown and what mode/modes might be failing for this Pull Request resolved: https://github.com/pytorch/pytorch/pull/147492 Approved by: https://github.com/eellison	2025-02-20 08:00:26 +00:00
zeshengzong	a000c7e6d2	Add hint message for `pack_padded_sequence` (#146747 ) Fixes #144207 Add truncate hint message in docs [torch.nn.utils.rnn.pack_padded_sequence](https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pack_padded_sequence.html) ## Test Result ![image](https://github.com/user-attachments/assets/46258f36-f6c7-4f11-9213-8513e52a9001) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146747 Approved by: https://github.com/mikaylagawarecki	2025-02-20 06:27:07 +00:00
Aaron Orenstein	db4ce78d46	PEP585: More UP006 fixes (#146392 ) This should be the final PR before we can enable RUFF UP006. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146392 Approved by: https://github.com/justinchuby, https://github.com/albanD, https://github.com/Skylion007	2025-02-20 06:18:13 +00:00
Animesh Jain	76ad19a549	[dynamo][codegen] Implement CSE for pre-graph graph-arg bytecode reconstruction (#147425 ) This reduces fixed overhead seen in a few internal models. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147425 Approved by: https://github.com/jansel, https://github.com/StrongerXi	2025-02-20 05:42:52 +00:00
Yidi Wu	77aa602871	[torchbind] Differentiate ScriptModule and ScriptObject with qualified name (#147399 ) Summary: This pr add a _is_script_object method to differentiate scriptModule and scriptObject, where the formal inherits from ScriptObject in C++ so they both passes the isinstance(obj, torch.ScriptObject) check. The qualified name of ScriptObject (i.e. custom class) would starts with "__torch__.torch.classes", this has been a widely used assumption for dealing with custom class across our code base. Test Plan: Add new test. Differential Revision: D69685316 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147399 Approved by: https://github.com/yushangdi	2025-02-20 04:57:57 +00:00
Michael Lazos	7185ca8348	[Cutlass] Add test verifying number of precompiles (#147477 ) As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/147477 Approved by: https://github.com/henrylhtsang	2025-02-20 04:47:57 +00:00
bobrenjc93	0d56b7e665	Support size oblivious max equation (#147344 ) Addresses https://github.com/pytorch/pytorch/issues/125914 by detecting when we have a sym_max between {0, 1} and a summation of size-like unbacked symints. The basic idea is max(1, u0 + u1) can be simplified to u0 + u1 if both u0 and u1 are size-like since their value ranges are [2, inf]. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147344 Approved by: https://github.com/angelayi	2025-02-20 04:33:19 +00:00
Shangdi Yu	0b0da81021	Support static method of torchbind attributes in torch.compile with inductor backend (#146927 ) As title. Many changes adapted from https://github.com/pytorch/pytorch/pull/129537. Also this diff is only for static method of torchbind attributes. Some case that's not supported/tested: - dynamic torchbind objects - torchbind objects as an input to the module. Note that in JIT Inductor, the attributes are lifted as inputs. So even if we just have torchbind objects as attributes, they will show up as inputs in the graph. Example generated python code in torch.compile with inductor backend for the test case in `inductor/test_torchbind.py` (P1730554370): ```python async_compile.wait(globals()) del async_compile def call(args): arg1_1, arg2_1, arg3_1 = args args.clear() assert_size_stride(arg1_1, (2, 3), (3, 1)) assert_size_stride(arg2_1, (2, 3), (3, 1)) buf2 = empty_strided_cpu((2, 3), (3, 1), torch.float32) cpp_fused_add_0(arg1_1, arg2_1, buf2) del arg1_1 del arg2_1 # Topologically Sorted Source Nodes: [x, takes_foo_tuple_return], Original ATen: [aten.add] buf3 = torch.ops._TorchScriptTesting.takes_foo_tuple_return.default(arg3_1, buf2) buf4 = buf3[0] assert_size_stride(buf4, (2, 3), (3, 1)) buf5 = buf3[1] assert_size_stride(buf5, (2, 3), (3, 1)) buf6 = buf4; del buf4 # reuse cpp_fused_add_1(buf6, buf5) del buf5 # Topologically Sorted Source Nodes: [y, b], Original ATen: [aten.add] buf7 = torch.ops._TorchScriptTesting.takes_foo.default(arg3_1, buf6) del buf3 del buf6 buf8 = buf7 assert_size_stride(buf8, (2, 3), (3, 1)) # Topologically Sorted Source Nodes: [c], Original ATen: [] buf9 = torch.ops.higher_order.call_torchbind(arg3_1, 'add_tensor', buf2) del arg3_1 del buf7 buf10 = buf9 assert_size_stride(buf10, (2, 3), (3, 1)) del buf9 buf11 = buf2; del buf2 # reuse cpp_fused_add_2(buf11, buf8, buf10) return (buf11, ) def benchmark_compiled_module(times=10, repeat=10): from torch._dynamo.testing import rand_strided from torch._inductor.utils import print_performance arg1_1 = rand_strided((2, 3), (3, 1), device='cpu', dtype=torch.float32) arg2_1 = rand_strided((2, 3), (3, 1), device='cpu', dtype=torch.float32) import pickle global arg3_1 arg3_1 = pickle.loads(b'\x80\x04\x95[\x00\x00\x00\x00\x00\x00\x00\x8c\x05torch\x94\x8c\x0cScriptObject\x94\x93\x94)\x81\x94]\x94(K\nK\x14e\x8c0__torch__.torch.classes._TorchScriptTesting._Foo\x94\x86\x94b.') fn = lambda: call([arg1_1, arg2_1, arg3_1]) return print_performance(fn, times=times, repeat=repeat) if __name__ == "__main__": from torch._inductor.wrapper_benchmark import compiled_module_main compiled_module_main('None', benchmark_compiled_module) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146927 Approved by: https://github.com/angelayi	2025-02-20 03:33:19 +00:00
Shawn Xu	de1cb0f351	capture the return value in the contract typing (#147488 ) ---- * the existing typing makes the return type `Optional[nn.Module]` * this doesn't seem to be what the decorator actually does as it does not alter the original return type * This PR aims to fix the typing Differential Revision: [D69888120](https://our.internmc.facebook.com/intern/diff/D69888120) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147488 Approved by: https://github.com/Skylion007	2025-02-20 03:32:34 +00:00
rzou	fea718f062	[BaseHOP] change hop(subgraph, operands) to hop(subgraph, *operands) (#146730 ) Our three main users are OK with this, with two of them (foreach_map, invoke_quant) prefering it like this. I was originally worried about BC issues (this now means you cannot add any positional args) but I think that's not a concern -- one can always add kwonly args. Test Plan - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/146730 Approved by: https://github.com/ydwu4, https://github.com/mlazos	2025-02-20 02:30:36 +00:00
Yan Zhiwei	f79b352f5a	[Intel GPU] qconv_pointwise.binary XPU support (#135189 ) # Motivation This PR intends to enable quantized fusion `qconv+add` and `qconv+add+relu` at Intel GPU backend. At backend level, we register the op via schema `TORCH_SELECTIVE_NAME("onednn::qconv2d_pointwise.binary")` which is the one already defined in `x86InductorQuantzer` At Inductor level, we have small modification at `torch/_inductor/fx_passes/quantization.py` to allow signed int8 data type(s8) during op lowering. As for the pattern matching, we greatly reuse the code existing at x86InductorQuantizer. # UT verification ```bash python test/inductor/test_mkldnn_pattern_matcher.py -v \ -k test_qconv2d_add_xpu \ -k test_qconv2d_add_relu_xpu 2>&1 ``` # Runtime exemplification Following is the oneDNN verbose collected from UT ```bash onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_s8::blocked:acdb::f0 wei_s8::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_s8::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:1:f32 attr-zero-points:src0:0:s32+dst:0:s32 attr-post-ops:eltwise_linear:1:0.337704+sum:0.0241217+eltwise_relu,alg:convolution_direct,mb1_ic3oc6_ih8oh6kh3sh1dh0ph0_iw8ow6kw3sw1dw0pw0,0.151123 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135189 Approved by: https://github.com/liangan1, https://github.com/EikanWang, https://github.com/guangyey, https://github.com/jerryzh168 ghstack dependencies: #133307 Co-authored-by: guangyey <guangye.yu@intel.com>	2025-02-20 02:02:54 +00:00
Riley Dulin	93316cfe94	Move ir_pre_fusion.txt and ir_post_fusion.txt to TORCH_LOGS (#147248 ) Fixes #147002 Moves ir_{pre, post}_fusion.txt to be controlled by TORCH_LOGS instead of TORCH_COMPILE_DEBUG. Updated tests of these logs as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147248 Approved by: https://github.com/eellison	2025-02-20 00:26:17 +00:00
William Wen	16e202a38e	[dynamo] improved graph break messages for some common graph break sites [1/N] (#146525 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146525 Approved by: https://github.com/jansel	2025-02-20 00:08:13 +00:00
Pian Pawakapan	1e94c7aaa4	[draft_export] only clear pending unbacked symbols for overwritten kernels (#147427 ) This was wrong, we were doing this in all cases Pull Request resolved: https://github.com/pytorch/pytorch/pull/147427 Approved by: https://github.com/angelayi	2025-02-20 00:07:54 +00:00
Henry Tsang	3986c3e4a6	[reland][cutlass backend] Do not change dtype of GEMM template for cutlass 3x (#147434 ) Reland of https://github.com/pytorch/pytorch/pull/146877 incorporate forward fix (didn't land): https://github.com/pytorch/pytorch/pull/147185 Summary: I think this is a change in the right direction. Right now, when we try to find a cutlass gemm, we generate bunch of gemm templates, and filter out those that don't fix. For example, if we are doing bf16 x bf16 matmul, the gemm template for fp32 x fp32 is generated and filtered out. However, for the dtype of bias, we would attempt to modify the dtype of the gemm template. I think this is a bad idea, since (1) the usable template is also being generated, and (2) this messes with the configuration name of the template. I tested this offline. There isn't much difference in performance. However, with instantiation level 2222, I noticed way less "C++ compile error". This is probably due to using the right template? Follow-ups are needed: 1. benchmark and dashboard 2. check our logic for setting alignment with my change https://www.internalfb.com/intern/paste/P1729604119/ without my change https://www.internalfb.com/intern/paste/P1729624806/ Differential Revision: D69825865 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147434 Approved by: https://github.com/ColinPeppler	2025-02-20 00:07:07 +00:00
Michael Lazos	004d65aeb0	Add type hints to cuda kernel (#147471 ) Missed this in a previous PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/147471 Approved by: https://github.com/eellison	2025-02-19 23:35:10 +00:00
Henry Tsang	48203bec63	[BE] remove sysconfig.get_config_var("LIBDIR") from cuda lib paths (#147409 ) Summary: I think the path is not needed anymore. It was added in https://github.com/pytorch/pytorch/pull/126408, but it has been a while since then. See if CI complains. Differential Revision: D69573185 See also https://github.com/pytorch/pytorch/pull/147158 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147409 Approved by: https://github.com/chenyang78	2025-02-19 23:04:22 +00:00
Gregory Comer	f63db6255f	Re-land exclude upsample_bilinear2d.vec and nearest2d.vec from default export decomposition table (#147153 ) Note: This is a re-land of https://github.com/pytorch/pytorch/pull/141791, which I reverted due to breaking some Meta-internal tests - an internal ET delegate did not handle the non-decomposed upsample_nearest2d, and it was not caught in CI. I've resolved that issue and should be ready to safely re-land. Summary: As upsample_bilinear2d.vec and upsample_nearest2d.vec are core ATen ops, they should not be decomposed by default in the export path. Because the operators have CompositeImplicitAutograd dispatch, their decomposition is registered by default. This change adds an override list for CIA decompositions being registered in the default decomp table. In the long-term, we likely will want to exclude decompositions for all core-tagged CIA ops, but this will require all consumers to be ready to handle the remaining two ops, avg_pool1d, and adaptive_avg_pool1d. Until they are ready, I believe an explicit override list is the safest option. Additionally, I've also removed the ExecuTorch XNNPACK delegate ConvertToUpsampleBilinear2d pass, as the pass breaks (and is not needed), given that the op is not decomposed. The purpose of this pass was originally to pattern match the decomposition and recompose it, but this is no longer necessary. Test Plan: Added a new test (`test_default_decomposition_core_cia_ops`) in test_export.py to verify that upsample_bilinear2d.vec (and in the future, other core-tagged CIA ops) are not decomposed by default. Also, I manually validated end to end with ExecuTorch that the op is not decomposed in to_edge (see N6238522). ``` buck test //caffe2/test:test_export -- test_default_decomposition_core_cia_ops ``` Differential Revision: D69625112 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147153 Approved by: https://github.com/manuelcandales	2025-02-19 23:03:29 +00:00
Justin Chu	41ae15faa3	[ONNX] Add scaffolding for onnx decomp and logic for op tests (#147392 ) Create scaffold for onnx op test data and common logic. This PR creates the scaffolding for new onnx decomp functions described in https://github.com/pytorch/pytorch/issues/139301. It adds two ops: abs and add, and enables the related tests. https://github.com/pytorch/pytorch/issues/139301 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147392 Approved by: https://github.com/titaiwangms ghstack dependencies: #147396	2025-02-19 21:55:12 +00:00
Avik Chaudhuri	24738768a8	more dist ops in non strict (#147417 ) Summary: Previously we added support for `all_reduce` to non strict. This PR extends this support to other non-functional collectives that are remapped in Dynamo: `all_gather`, `all_gather_into_tensor`, `all_to_all_single`, `reduce_scatter_tensor`. Test Plan: added unit tests Differential Revision: D69813991 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147417 Approved by: https://github.com/angelayi	2025-02-19 21:29:16 +00:00
Alex Baden	e758d8b4d1	[Inductor][Triton] Rework casting logic to avoid illegal bitcast (#147395 ) Triton introduced checks for bitcasts where the casted value does not fit into the casted type (e.g. https://github.com/triton-lang/triton/pull/5926, though in this instance I think the issue is related to the type for the broadcast). Some routines in Inductor now perform illegal bitcasts. I reworked the compare and swap w/ index routine used in sort to remove the illegal bitcast (~~I left the bitcast for now, but I think it could probably be removed assuming the reshape does not change the type~~). The explicit cast is correct, and I don't think there are performance issues, but because the cast on the sum is not a bitcast I suppose there could be. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147395 Approved by: https://github.com/eellison	2025-02-19 19:45:01 +00:00
Justin Chu	279c7f262e	[ONNX] Refactor dispatcher and registry (#147396 ) This PR sets up the registry to accept onnx decomp functions to be moved into PyTorch (https://github.com/pytorch/pytorch/issues/139301). The ops from onnx script are currently appended to the registry. When the ops are moved into PyTorch, the moved ops takes precedence because they appear first in the registry list. After the migration hooks for loading ops from onnx script will be removed. 1. Use a private field `_pt_onnx_signature` to store function signatures to avoid conflicts 2. Update the registry to record the signature in OnnxDecompMeta and update the dispatcher to leverage the data structure 3. Update registry to prepare for onnx op registration, and update the the onnx_impl decorator to support a no_compile option Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/147396 Approved by: https://github.com/titaiwangms	2025-02-19 19:38:28 +00:00
bobrenjc93	4f3c070b25	[inductor] GraphLowering code movement (#147335 ) moved these methods under __init__ to be more idiomatic Pull Request resolved: https://github.com/pytorch/pytorch/pull/147335 Approved by: https://github.com/eellison ghstack dependencies: #147331	2025-02-19 19:32:30 +00:00
Simon Fan	ed83b0b70b	[ddp] decouple python reducer from compilation mode (#147123 ) Current implementation reads as: we will only actually use the "python_reducer" config if the DDP forward is compiled. Otherwise, we will silently fallback to C++ reducer + no DDPOptimizer. I'm changing this behavior to always use the python reducer if the config is specified. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147123 Approved by: https://github.com/fegin	2025-02-19 15:51:40 +00:00
drisspg	303ad1916f	[FlexAttention] Fix weird generate stride call in flex decode (#147435 ) # Summary Seems like we had a redundant tuple unpack and that doesn't appear to be supported in new triton Fixes https://github.com/pytorch/pytorch/issues/147373 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147435 Approved by: https://github.com/BoyuanFeng	2025-02-19 12:12:27 +00:00
Michael Lazos	77dbd28535	[Cutlass] Restore search space for swizzle (#147224 ) This restores the previous search space, since swizzle is now a runtime parameter, there shouldn't be extra compile-time overhead from searching this now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147224 Approved by: https://github.com/eellison ghstack dependencies: #147222, #147223	2025-02-19 09:22:51 +00:00
Michael Lazos	e9b3ff0570	[Cutlass] Add support for runtime param choices, starting with swizzle (#147223 ) This PR adds support for swizzle as a runtime parameter choice. Future runtime parameter choices can be added to the [get_runtime_arg_info](`2d40f9fb52/torch/_inductor/codegen/cuda/cuda_template.py (L282)`) list method and then possible choices can be [looped over similarly to swizzle](`933f921b36/torch/_inductor/codegen/cuda/gemm_template.py (L532)`). For precompile, we now filter choices by hash to only compile each distinct kernel source once. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147223 Approved by: https://github.com/Chillee, https://github.com/eellison ghstack dependencies: #147222	2025-02-19 09:22:51 +00:00
Michael Lazos	81eb2a78ad	[Inductor] Add autotuning artifact logging (#147222 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147222 Approved by: https://github.com/henrylhtsang, https://github.com/eellison	2025-02-19 09:22:42 +00:00
bobrenjc93	655b061ef0	[inductor] Freeze runtime asserts after shape prop but before codegen (#147331 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147331 Approved by: https://github.com/eellison	2025-02-19 06:29:13 +00:00

1 2 3 4 5 ...

46262 Commits