pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
cyy	b61a556427	Turn onnx functions into static (#147598 ) To avoid exposing ONNX symbols. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147598 Approved by: https://github.com/justinchuby	2025-02-21 07:40:28 +00:00
Kevin Fu	4986f0f52e	[PT2]: allow empty dict to pass type check (#147167 ) (#147480 ) Summary: Seeing errors like when testing sigmoid for inline_cvr and perevent_cvr models. ``` terminate called after throwing an instance of 'c10::Error' what(): forward() Expected a value of type 'Dict[int, Tuple[Tensor, Tensor, Tensor]]' for argument 'event_based_features' but instead found type 'Dict[Any, Any]'. ``` Let empty dict pass type check. please, do NOT use any of the following flags, those are result of manual interventions in other parts of the system, misuse of them can be very painful for both detect and recover: Test Plan: ``` MODEL_ENTITY_ID=691508446 SNAPSHOT_ID=0 OTHER_MODEL_ENTITY_ID=649645886 OTHER_SNAPSHOT_ID=0 MODULE=local buck2 run mode/opt caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- \ --loadMode=BenchmarkAB \ --inputNetFile=/data/users/${USER}/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}${suffix} \ --otherNetFile=/data/users/${USER}/models/${OTHER_MODEL_ENTITY_ID}/${OTHER_SNAPSHOT_ID}/${OTHER_MODEL_ENTITY_ID}_${OTHER_SNAPSHOT_ID}${suffix} \ --moduleName=${module} \ --submodToDevice "" \ --benchmarkDontRebatchSamples=true \ --sampleInputFilePath=/data/users/${USER}/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/archive_.predictor.disagg.gpu.local/data/sample_inputs/local.pt ``` Reviewed By: yjhao Differential Revision: D69871393 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147480 Approved by: https://github.com/henryoier, https://github.com/jeanschmidt	2025-02-21 07:00:46 +00:00
Tristan Rice	ba214ab56c	TCPStore: soft fail bind when agent store active (#147465 ) This makes it easier to roll out `TORCHELASTIC_USE_AGENT_STORE` by opportunistically swallowing bind errors when the agent store is enabled and the port matches `MASTER_PORT`. This should be very safe as if the store is somehow not up and the envs are set, the TCPStore client connections will fail to connect so we end up with a slightly different error message but success/failure behavior is identical. This also pybinds `c10d::SocketError` into Python so we can assert on the error type in tests. https://docs.google.com/document/d/1CzOn_N53AiFxWGgbyMWSnd2elCJd4lZ-ajPg2lzcxoM/edit?tab=t.0#heading=h.2j2f5dimrdau Test plan: ``` pytest test/distributed/test_store.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147465 Approved by: https://github.com/fduwjj	2025-02-21 03:02:26 +00:00
Aaron Orenstein	be0df96b50	Fix c++ implementation of strip_function_call (#147436 ) #143063 was missing handling a couple UCS cases as well as had some bugs in the way it dealt with errors. - Fix all the UCS handling (and make some of the common code more common) - Make sure all the error paths return `nullptr` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147436 Approved by: https://github.com/jansel	2025-02-20 20:41:21 +00:00
vasiliy	382fbcc1e4	add the `torch.float8_e8m0fnu` dtype to PyTorch (#147466 ) Summary: Continuing the work from https://github.com/pytorch/pytorch/pull/146427 Adds the `torch.float8_e8m0fnu` dtype to PyTorch, as detailed in https://github.com/pytorch/pytorch/issues/146414 . Please see the issue for a detailed definition of the format. Example of basic functionality: ```python import torch # round trip x0 = torch.randn(4, 4, dtype=torch.float32) x1 = x0.to(torch.float8_e8m0fnu) # RNE rounding x2 = x1.to(torch.float32) # 2 ** exponent # creation with empty x0 = torch.empty(4, 4, dtype=torch.float8_e8m0fnu) # printing print(x0) ``` Done in this PR: * numerical correctness * op coverage (except for `torch._scaled_mm`): create tensor, cast to/from float32 * printing a tensor works For future PRs: * performance optimizations for casting * torch._scaled_mm * PT2 * various cleanups (detailed in comments with issue numbers) Test Plan: ``` pytest test/quantization/core/experimental/test_float8.py -s ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/147466 Approved by: https://github.com/drisspg	2025-02-20 13:55:42 +00:00
PyTorch MergeBot	babb2dc2af	Revert "Add torch._scaled_mm for CPU (#139975 )" This reverts commit `6f7e67c43c`. Reverted https://github.com/pytorch/pytorch/pull/139975 on behalf of https://github.com/wdvr due to failing inductor mkldnn_pattern_matcher_cpu tests ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2667186865))	2025-02-18 23:58:31 +00:00
William Wen	63e8ad49b8	[dynamo] replace hardcoded eval frame control flags skip_code_recursive_flag/cache_limit_hit_flag (#146355 ) This PR and the previous: - Moves parts of `eval_frame.c` to C++. - Reduces code duplication in `dynamo__custom_eval_frame` and makes the control flow more clear. - Enables `convert_frame` to signal to `eval_frame.cpp` in a general manner how to evaluate this frame, recursive frames, and future frames with the same code object (default/compile, skip, run-only). e.g. this will allow us to change skipping/cache limit hit eval_frame behavior directly from convert_frame without requiring changes to C/C++. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146355 Approved by: https://github.com/jansel ghstack dependencies: #145603	2025-02-18 21:37:12 +00:00
William Wen	75db0fd8a0	[dynamo] refactor dynamo__custom_eval_frame to C++, refactor SKIP_CODE[_RECURSIVE] (#145603 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145603 Approved by: https://github.com/jansel, https://github.com/anijain2305	2025-02-18 21:37:12 +00:00
Jiang, Yanbing	6f7e67c43c	Add torch._scaled_mm for CPU (#139975 ) This PR is to add `torch._scaled_mm` for CPU backend. `_scaled_mm_out_cpu` and `_scaled_mm_cpu` are new added and included in `torch._scaled_mm` CPU dispatch. We also add `_scaled_mm_out_cpu_emulated` as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139975 Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet	2025-02-18 18:44:26 +00:00
Michal Gallus	d9cf1debf9	[ROCm][Windows] Fix clang-cl error related to -Wmissing prototypes enabled (#146981 ) Some of the windows files (fused_kernels.cpp or temp_file.h) contain code that fail to compile when this flag is enabled when built with clang-cl. This PR resolves the issue by ensuring that even if we build with clang-cl, it doesn't include those flags on windows. Alternatively if needed, I can fix the files mentioned to pass under this flag. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146981 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-02-18 07:41:12 +00:00
PyTorch MergeBot	49e8f9c965	Revert "Add torch._scaled_mm for CPU (#139975 )" This reverts commit `22fae4c5f9`. Reverted https://github.com/pytorch/pytorch/pull/139975 on behalf of https://github.com/huydhn due to third time is the charm ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2664622598))	2025-02-18 05:11:32 +00:00
Jiang, Yanbing	22fae4c5f9	Add torch._scaled_mm for CPU (#139975 ) This PR is to add `torch._scaled_mm` for CPU backend. `_scaled_mm_out_cpu` and `_scaled_mm_cpu` are new added and included in `torch._scaled_mm` CPU dispatch. We also add `_scaled_mm_out_cpu_emulated` as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139975 Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet	2025-02-17 18:39:10 +00:00
Yan Zhiwei	ae351d4d0e	[Intel GPU] allow_tf32 for oneDNN backend - XPU part (#137570 ) # Motivation Add context variable `torch.bachend.mkldnn.allow_tf32` to control tf32 computation in convolution kernels at XPU side. The tf32 data type is beneficial to improve the performance of deep learning workloads during training/inference. Current PR uses the [oneDNN API fpmath_mode](https://oneapi-src.github.io/oneDNN/dev_guide_attributes_fpmath_mode.html#the-floating-point-math-mode-attribute) to trigger the tf32 acceleration in convolution kernels. # Valiadation * ut to test context variable `python test/xpu/test_conv.py -k test_mkldnn_allow_tf32_get_set` * Runtime exemplification ``` onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_f32::blocked:abcd::f0 wei_f32::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_f32::blocked:abcd::f0,attr-scratchpad:user attr-fpmath:tf32,alg:convolution_direct,mb20_ic16oc33_ih50oh24kh3sh2dh0ph0_iw100ow49kw3sw2dw0pw0,0.649902 onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_f32::blocked:abcd::f0 wei_f32::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_f32::blocked:abcd::f0,attr-scratchpad:user attr-fpmath:tf32,alg:convolution_direct,mb20_ic33oc33_ih24oh24kh3sh1dh0ph1_iw49ow49kw3sw1dw0pw1,0.151855 onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,backward_data,src_f32::blocked:abcd::f0 wei_f32::blocked:abcd::f0 bia_undef::undef::: dst_f32::blocked:abcd::f0,attr-scratchpad:user attr-fpmath:tf32,alg:convolution_direct,mb20_ic33oc33_ih24oh24kh3sh1dh0ph1_iw49ow49kw3sw1dw0pw1,0.167969 onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,backward_weights,src_f32::blocked:abcd::f0 wei_f32::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_f32::blocked:abcd::f0,attr-scratchpad:user attr-fpmath:tf32,alg:convolution_direct,mb20_ic33oc33_ih24oh24kh3sh1dh0ph1_iw49ow49kw3sw1dw0pw1,0.26709 onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,backward_weights,src_f32::blocked:abcd::f0 wei_f32::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_f32::blocked:abcd::f0,attr-scratchpad:user attr-fpmath:tf32,alg:convolution_direct,mb20_ic16oc33_ih50oh24kh3sh2dh0ph0_iw100ow49kw3sw2dw0pw0,0.219971 ``` According to the field `fpmath:tf32` in verbose, we could see that, current context setting utils could successfully trigger tf32 computation in conv forward/backward_data/backward_weights kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137570 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/atalman, https://github.com/malfet Co-authored-by: Yu, Guangye <guangye.yu@intel.com>	2025-02-17 01:46:43 +00:00
Zhou Fang	a8fa4bcfd2	[StaticRuntime] Support a new pattern (aten::to with 5 inputs) for ClipRangesToGatherToOffsets (#147189 ) Summary: Support the following new pattern for ClipRangesToGatherToOffsets: Before optimization: ``` %11175 : Tensor, %11176 : Tensor = fb::clip_ranges_gather(%int_66.1, %getitem_1784.1, %347) %getattr_256.1 : int = prim::dtype(%11175) %to_298.1 : Tensor = aten::to(%11176, %getattr_256.1, %13, %13, %12) %lengths_to_offsets_333.1 : Tensor = fb::lengths_to_offsets(%to_298.1, %8) ``` After optimization: ``` %11199 : int = prim::dtype(%int_66.1) %11200 : Tensor, %11201 : Tensor = fb::clip_ranges_gather_to_offsets(%int_66.1, %getitem_1784.1, %347, %8, %11199) ``` It is similar with https://github.com/pytorch/pytorch/pull/146931, but aten::to has 5 inputs instead of 4. Differential Revision: D69627793 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147189 Approved by: https://github.com/hanyilou123	2025-02-16 22:16:02 +00:00
Animesh Jain	9dc702875d	[dynamo][mappingproxy][inspect] Support existing types.MappingProxyType (#147217 ) Fixes https://github.com/pytorch/pytorch/issues/147162 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147217 Approved by: https://github.com/williamwen42	2025-02-15 07:59:33 +00:00
cyy	8daa742e8b	Remove code for Python < 3.9 (#147181 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/147181 Approved by: https://github.com/albanD	2025-02-15 06:43:26 +00:00
cyy	8f291e8c00	Fix clang-tidy warnings in torch/jit (#146963 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146963 Approved by: https://github.com/davidberard98	2025-02-15 03:36:59 +00:00
Mu-Chu Lee	a5c0dab900	[AOTInductor] Guard RAII_cpuMalloc with macro (#147150 ) Summary: Silence RAII_cpuMalloc(size_t) defined but not used [-Wunused-function] Test Plan: Existing tests Differential Revision: D69623481 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147150 Approved by: https://github.com/henrylhtsang	2025-02-14 23:21:35 +00:00
PyTorch MergeBot	aac5d1a289	Revert "Add torch._scaled_mm for CPU (#139975 )" This reverts commit `f0bdc27f74`. Reverted https://github.com/pytorch/pytorch/pull/139975 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it looks like internal ideep version is too old to support this ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2660008996))	2025-02-14 18:31:54 +00:00
Zhengxu Chen	0b84311842	[export] Generate printers/parsers for serialization enum values. (#147126 ) Summary: Generate two helper functions for enum classes in generated_serialization_types.h printEnum: will convert enum values into strings. parseEnum: will convert strings into enum values. Test Plan: CI Differential Revision: D69604850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147126 Approved by: https://github.com/yiming0416	2025-02-14 02:14:35 +00:00
Jiang, Yanbing	f0bdc27f74	Add torch._scaled_mm for CPU (#139975 ) This PR is to add `torch._scaled_mm` for CPU backend. `_scaled_mm_out_cpu` and `_scaled_mm_cpu` are new added and included in `torch._scaled_mm` CPU dispatch. We also add `_scaled_mm_out_cpu_emulated` as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139975 Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet	2025-02-14 02:03:53 +00:00
PyTorch MergeBot	9a883007a2	Revert "Implement cuda graphs implementation of torch.cond and torch.while_loop (#140979 )" This reverts commit `c7515da7b0`. Reverted https://github.com/pytorch/pytorch/pull/140979 on behalf of https://github.com/huydhn due to This change has been reported to break internal code ([comment](https://github.com/pytorch/pytorch/pull/140979#issuecomment-2657361940))	2025-02-13 18:04:26 +00:00
Mu-Chu Lee	e21181642f	[AOTInductor] Align behavior between CPU and GPU (#145459 ) Summary: (1) Make sure CPU and GPU doesn't have different implementation and behavior when calling from the same path and API. Only difference between CPU and GPU after this PR should ONLY be the running hardware. (2) This PR fixes the issue of memory access when it==constants_map.end() (3) This PR resolves T179437596 Test Plan: buck2 run mode/dev sigmoid/inference/test:e2e_test_cpu Differential Revision: D68540744 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145459 Approved by: https://github.com/desertfire, https://github.com/hl475	2025-02-13 09:50:18 +00:00
Xia, Weiwen	ca3aabc8e6	[Inductor][CPU] Add a lowering pass for _weight_int4pack_mm_for_cpu (#145250 ) Summary It's part of the task to enable max-autotune with GEMM template for WoQ INT4 GEMM on CPU. This PR adds a lowering pass for `torch.ops.aten_weight_int4pack_mm_for_cpu`. This op is used for WoQ int4 in Torchao. The lowering pass is a prerequisite for max-autotune, which is planed to be enabled for this op in subsequent PRs. Test plan ``` python test/inductor/test_mkldnn_pattern_matcher.py -k test_woq_int4 python test/inductor/test_cpu_cpp_wrapper.py -k test_woq_int4 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145250 Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168 ghstack dependencies: #145245	2025-02-13 08:40:12 +00:00
Yu, Guangye	aa20b4b6cf	Friendly handle mem_get_info's runtime error message (#146899 ) # Motivation Friendly handle the runtime error message if the device doesn't support querying the available free memory. See https://github.com/intel/torch-xpu-ops/issues/1352 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146899 Approved by: https://github.com/EikanWang	2025-02-13 06:26:19 +00:00
Rachel Guo	88d0bb0fee	[aoti_debug_printer][BE] explicitly dumping float32, bfloat16, float16 data type (#147020 ) Summary: per request, explicitly dumping the float dtypes for aten tensors in debug printing summary info. can be useful in identifying issues such as "wrong AOTI Lowering precisions" Test Plan: ``` AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2 TORCH_LOGS="+inductor, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_addmm ``` Differential Revision: D69547344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147020 Approved by: https://github.com/jingsh, https://github.com/ColinPeppler	2025-02-13 04:41:00 +00:00
Brian Hirsh	ec0b318ddb	[poc] force UntypedStorage.from_buffer(buf) to return meta storage under FakeTensorMode (#146642 ) context here: https://fb.workplace.com/groups/326136610199609/permalink/495389539940981/ This PR is an attempt to make it such that if you create a tensor from an external buffer (using `UntypedStorage.from_buffer(buf)`, we can generate a proper fake tensor for you out of the box. The annoying bit is that there are not any dispatcher ops to interpose on and change behavior. So instead, I took the manual C binding and tweaked the storage device to be "meta' if we see an active fake mode. Put "poc" in the title since I... think this is hopefully reasonable, but I can be convinced that it's not :) ``` from torch._subclasses.fake_tensor import FakeTensorMode import pickle import io import torch from contextlib import nullcontext use_fake_tensor = True with FakeTensorMode() if use_fake_tensor else nullcontext(): obj = [1, 2] f = io.BytesIO() pickle.Pickler(f).dump(obj) byte_storage = torch.ByteStorage._from_buffer(f.getvalue()) # type: ignore[attr-defined] t = torch.ByteTensor(byte_storage) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146642 Approved by: https://github.com/zou3519	2025-02-12 20:57:10 +00:00
Zhou Fang	d774a6333d	[StaticRuntime] Support a new pattern for ClipRangesToGatherToOffsets (#146931 ) Summary: Support the following new pattern for ClipRangesToGatherToOffsets: Before optimization: ``` %18267 : Tensor, %18268 : Tensor = fb::clip_ranges_gather(%int_77.1, %getitem_2484.1, %493) %getattr_368.1 : int = prim::dtype(%18267) %to_443.1 : Tensor = aten::to(%18268, %getattr_368.1, %self._maybe_compute_kjt_to_jt_dict.is_weighted, %self._maybe_compute_kjt_to_jt_dict.is_weighted) %lengths_to_offsets_490.1 : Tensor = fb::lengths_to_offsets(%to_443.1, %8) ``` After optimization: ``` %18297 : int = prim::dtype(%int_77.1) %18298 : Tensor, %18299 : Tensor = fb::clip_ranges_gather_to_offsets(%int_77.1, %getitem_2484.1, %493, %8, %18297) ``` Reviewed By: garroud Differential Revision: D69373835 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146931 Approved by: https://github.com/hanyilou123	2025-02-12 08:19:41 +00:00
Zhengxu Chen	683bb1242c	[export][ez] Update tag_ for union setters. (#146912 ) Summary: ez fix to set tag for union type fields. Test Plan: CI Differential Revision: D69467715 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146912 Approved by: https://github.com/yiming0416	2025-02-12 03:52:36 +00:00
Zhengxu Chen	664550ecbf	[export] Serialize special values of float into strings for json. (#146490 ) Summary: Currently inf is serialized as Infinity in JSON which is not standard compliant. Instead we will tweak all special floating points into strings and handle them at json layer. Test Plan: see D69060784 CI Differential Revision: D69186425 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146490 Approved by: https://github.com/yiming0416	2025-02-11 20:01:27 +00:00
Daniel Galvez	c7515da7b0	Implement cuda graphs implementation of torch.cond and torch.while_loop (#140979 ) This is a new PR for #130386 , which got stale and was closed. Since I force-pushed to that branch in order to rebase it on top of main, the PR can no longer be reopened, according to https://github.com/isaacs/github/issues/361 I fixed the possibly-not-warmed-up problem described here: https://github.com/pytorch/pytorch/pull/130386/files#r1690856534 Since starting this, torch.cond and torch.while_loop now apparently have support for backward passes. I will look into what it might take to support that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140979 Approved by: https://github.com/eqy, https://github.com/eellison	2025-02-11 18:16:15 +00:00
Zhou Fang	fc5913b6bf	[StaticRuntime] Fix a bug that memory planner ignores subblocks (#146728 ) (#146855 ) Summary: When Static Runtime graph node has sub-blocks, the memory planner does not consider sub-blocks' inputs as a node's input in memory planner. As the result, such nodes' inputs' lifetime is incorrect and corresponding tensor memory is released earlier than required and causes errors. Differential Revision: D69195886 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146855 Approved by: https://github.com/swolchok	2025-02-11 13:59:54 +00:00
cyy	15635b14ce	[4/N] Remove unnecessary once flag usage (#146783 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146783 Approved by: https://github.com/albanD	2025-02-11 13:55:06 +00:00
Ke Wen	30cbf13544	[PGNCCL] Associate tensor allocation support with NCCL version (#146842 ) This is a forward fix to #146589. For NCCL version lower than 2.19, previous PR would see `RuntimeError: NCCL mem allocator is not supported in this NCCL version`. This PR gates the support by checking link-time NCCL version via `ncclGetVersion`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146842 Approved by: https://github.com/XilunWu, https://github.com/wconstab, https://github.com/fduwjj ghstack dependencies: #146589	2025-02-11 02:52:52 +00:00
Yifu Wang	97f6480cf5	Fix an issue where functional collectives don't force fx stride on inputs when compiled (#146467 ) Fixes https://github.com/pytorch/pytorch/issues/146416 Also added contiguity checks in the C++ functional collective ops to prevent striding issues introduced during compilation manifest as silent correctness issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146467 Approved by: https://github.com/Chillee, https://github.com/lw, https://github.com/shunting314	2025-02-10 19:15:49 +00:00
Hyunho Yeo	5f621c5879	[MTIA] (3/n) Implement PyTorch APIs to query/reset device peak memory usage (#146710 ) Summary: Public summary (shared with Github): This diff implements a C++-Python binding to enable `reset_peak_memory_stats`. Test Plan: The test is implemented in the following diff. Reviewed By: yuhc Differential Revision: D68988673 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146710 Approved by: https://github.com/nautsimon	2025-02-10 16:57:09 +00:00
Ke Wen	effc545274	[DDP] Use NCCL allocated memory for gradient bucket (#146589 ) So that NVLink SHARP comes with zero-copy on H100+ platforms, for DDP applications. Less SM usage, less memory contention between NCCL kernel and compute kernels. Added env `DDP_DISABLE_COMM_MEM` as a back-out option: ``` An environment variable to disable comm-optimized memory pool. Default is 0, which means comm-optimized memory pool is enabled. Users can set it to 1 in case of seeing regression or OOM (because this comm MemPool may not share space with regular compute MemPool). ``` Differential Revision: [D69297766](https://our.internmc.facebook.com/intern/diff/D69297766) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146589 Approved by: https://github.com/syed-ahmed, https://github.com/c-p-i-o, https://github.com/fduwjj	2025-02-10 05:23:11 +00:00
Simon Fan	387c993c3b	[ca] remove private API: _compiled_autograd_should_lift (#146720 ) Since the functional autograd + compiled autograd migration, we don't trace into nodes anymore, and everything is lifted. We can't support this flag which tries to inline make_fx style in CA initial pass. There's no more usage internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146720 Approved by: https://github.com/zou3519	2025-02-10 04:29:57 +00:00
Dingming Wu	fa34128435	revert PTD's change that leads to signature mismatch of printNcclCommProxyTrace (#146453 ) Summary: D68801098 introduced this function signature mismatch issue for printNcclCommProxyTrace. Revert it so that trunk build can pass. Test Plan: With the change, build of APS model using rcclexp can now pass: `sh scripts/ltian/run_jobs/fb_fm_v2/run_fb_fm_v2_job.sh -h T20_GTT_MI300X -n 16 -b 1024 -t [2024-12-06] -d ai_infra_ngs -e ai_infra_training_rnd_tc -x 0` Reviewed By: c-p-i-o Differential Revision: D69149588 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146453 Approved by: https://github.com/c-p-i-o	2025-02-07 22:43:52 +00:00
Shunting Zhang	bc0191802f	[inductor] add size-asserts for fallback ops (#145904 ) Fix https://github.com/pytorch/pytorch/issues/144717 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145904 Approved by: https://github.com/jansel	2025-02-07 18:44:32 +00:00
Tristan Rice	68631f6e87	PyWork: preserve Python reference counting when used in functional collectives (#146376 ) @fegin found an issue where torchft is not compatible with functional collectives. Found in https://github.com/pytorch/torchtitan/pull/806 The root cause is because PyProcessGroup/PyWork are not compatible with functional collectives due to a nasty ownership bug. PyWork relies on a pybind trampoline to propagate requests to Python unfortunately the way Pybind works is that the Python object owns the C++ object rather than some form of shared ownership. Thus what happens is that the PyWork Python object will collected when returned to C++ from the PyProcessGroup but the C++ PyWork object still exists. When the PyWork object is used, this causes a deadlock as the corresponding Python object no longer exists To solve this, we introduce a new `PyWorkHolder` class which holds a reference to the `py::object` as well as the trampoline class. This resolves any dependency issues since we can now hold ownership in C++ to both the Python and C++ objects. To make this cleaner we introduce a `WORK_OVERRIDE` macro which is a patched version of `PYBIND11_OVERRIDE` that returns a `PyWorkHolder` rather than just `PyWork` and use for all collectives in PyProcessGroup. Test plan: ``` cd pytorch pytest test/distributed/test_c10d_functional_native.py ``` ``` cd torchft pytest torchft/process_group_test.py -k functional -v -x -s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146376 Approved by: https://github.com/yifuwang	2025-02-07 18:07:53 +00:00
cyy	25aa7ca62d	Cleanup CallOnce.h (#146700 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146700 Approved by: https://github.com/albanD	2025-02-07 16:44:45 +00:00
cyy	fa0592b568	Remove some NOLINT (#146610 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146610 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-02-07 01:50:06 +00:00
Michal Gallus	9ea1823f96	[ROCm][Windows] Remove external linkage from an anonymous namespace (#146607 ) Fixes a clang-cl compiler error related to attempt to export a symbol that doesn't have any external linkage, since its declared within a local anonymous namespace. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146607 Approved by: https://github.com/jeffdaily	2025-02-06 23:48:20 +00:00
Michael Suo	99dd846672	[torch] fix builds for older pybind (#146630 ) Summary: some versions of pybind we build with don't have `py::set_error`. So just use the underlying python C API. Test Plan: unit tests Differential Revision: D69254629 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146630 Approved by: https://github.com/colin2328, https://github.com/ngimel	2025-02-06 21:22:00 +00:00
Eddie Yan	9ee506bd93	[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 ) Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441 Approved by: https://github.com/Chillee, https://github.com/malfet	2025-02-06 19:04:50 +00:00
Michael Suo	425804db2b	[torch] fix exception types in custom class magic setattr/getattr (#146516 ) Summary: `c10::AttributeError` is not automatically converted to Python AttributeError, it needs some special macros (e.g. `HANDLE_TH_ERRORS`). Some Python functions like `hasattr` rely on the type of the throw exception to be correct. We don't need the fully generality of those macros, so just do a targeted error type conversion here. Test Plan: added unit test Differential Revision: D69197217 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146516 Approved by: https://github.com/zdevito	2025-02-06 02:14:11 +00:00
Simon Fan	72405b0c0f	[ca] refactor compile reasons and log to tlparse (#146386 ) This PR accumulates comple reasons inside each CacheNode, and logs them to tlparse on each CA compile. This defines a compile as an autograd structure change, and a recompile as a dynamic shape change. sample tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpdbo7gt/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100 for compiles: ```python [ "!0: Cache miss due to new autograd node: torch::autograd::GraphRoot (NodeCall 0) with key size 39, previous key sizes=[]" ] ``` for recompiles: ```python [ "!0: Cache miss due to new autograd node: torch::autograd::GraphRoot (NodeCall 0) with key size 39, previous key sizes=[]", "!1: Cache miss due to 7 changed tensor shapes (total of 7): sizes[0], sizes[1], sizes[2], sizes[3], sizes[4], sizes[5], sizes[6]" ] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146386 Approved by: https://github.com/jansel ghstack dependencies: #146229	2025-02-05 23:33:21 +00:00
Zhengxu Chen	cd6c0707a8	[aoti] Assign proxy call args by name, and support default values. (#146263 ) Fixing the following issue when compiling the following program: ``` window = torch.hann_window(N_FFT).to(x.device) stft = torch.stft( x, N_FFT, HOP_LENGTH, window=window, return_complex=True ) magnitudes = stft[..., :-1].abs() ** 2 return magnitudes ``` ``` Traceback (most recent call last): File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/unittest/case.py", line 57, in testPartExecutor yield File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/unittest/case.py", line 623, in run self._callTestMethod(testMethod) File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/unittest/case.py", line 579, in _callTestMethod if method() is not None: ^^^^^^^^ File "/home/zhxchen17/pytorch/torch/testing/_internal/common_utils.py", line 3120, in wrapper method(args, *kwargs) File "/home/zhxchen17/pytorch/test/inductor/test_torchinductor.py", line 12356, in new_test return value(self) ^^^^^^^^^^^ File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor.py", line 4334, in test_stft self.check_model(model, example_inputs) File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor_utils.py", line 185, in check_model actual = AOTIRunnerUtil.run( ^^^^^^^^^^^^^^^^^^^ File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor_utils.py", line 137, in run optimized = AOTIRunnerUtil.load(device, so_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor_utils.py", line 119, in load return torch._export.aot_load(so_path, device) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zhxchen17/pytorch/torch/_export/__init__.py", line 165, in aot_load runner = torch._C._aoti.AOTIModelContainerRunnerCuda(so_path, 1, device) # type: ignore[assignment, call-arg] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Expected extern kernel aten::hann_window to have serialized argument type as_scalar_type for argument 1 but got as_device ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146263 Approved by: https://github.com/angelayi	2025-02-05 15:43:05 +00:00
Simon Fan	e20b0c82d1	[ca] no longer require is_traceable annotations for c++ autograd functions (#146229 ) This PR removes the CA compile-time error for C++ autograd functions, and supports them by having dynamo graph break on them (instead of allow_in_graph). The CppNode's collects are kept as is for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146229 Approved by: https://github.com/jansel, https://github.com/zou3519	2025-02-05 08:49:17 +00:00

1 2 3 4 5 ...

15153 Commits