pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Giulio D'Ippolito	004dad48f7	Allow to set custom PYTHONPATH for torch.inductor (#152832 ) When using Bazel, it’s common to encounter issues like [this](https://github.com/bazelbuild/bazel/issues/14640) and [this](https://github.com/bazel-contrib/rules_python/issues/792) where the `PYTHONPATH` environment variable becomes too long and results in an error such as: `OSError: [Errno 7] Argument list too long` . To work around this, users often resort to custom logic to manipulate PYTHONPATH. Currently, PyTorch Inductor constructs the PYTHONPATH for a subprocess using sys.path, which can lead to this issue in certain environments. This PR introduces support for a new environment variable, `TORCH_CUSTOM_PYTHONPATH`, allowing users to override the default `PYTHONPATH` passed to the subprocess. This provides a clean way to avoid an exception when using PyTorch in Bazel. Please let me know if I need to add some documentation to support this PR. I haven't found an open issue specific to this change but I'm confident that this change (or a similar one) would be appreciated by few. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152832 Approved by: https://github.com/masnesral	2025-05-15 06:35:41 +00:00
Xia, Weiwen	55784be01b	[Quant][X86] add ops to compute uint8 pointwise add/add_relu (#152411 ) Summary This PR adds two new ops, `onednn.qadd.tensor` and `onednn.qadd_relu.tensor`, for int8 elementwise add, which accepts inputs on CPU device (instead of QuantizedCPU). The new ops are implemented with AVX512 instructions and it provides similar or better performance, depending on shape, than its counterpart for QuantizedCPU device `quantized.add` and `quantized.add_relu`. The new op supports output dtypes other than uint8 (fp32, fp16 and bf16 are supported). Test plan ``` pytest test/quantization/core/test_quantized_op.py -k test_int8_add_onednn ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152411 Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168	2025-05-15 06:23:01 +00:00
Zizeng Meng	a762dd1f67	[Memento] On-demand mode using without torch api (#153171 ) Summary: CUDA Post: https://fb.workplace.com/groups/ai.efficiency.tools.users/permalink/2020094788475989/ # Context In this diff, we want to enable the on-demand mode of memory snapshot to allow user to trace any remote process via dyno command line. # Design decision How do we send on-demand signal to remote process We leverage the dyno-Kineto approach. Since dyno is running on all machine in Meta, it can send a request to the remote machine to start the Kineto. Kineto will start another thread for memoryProfiler (https://fburl.com/code/dxsmmrok) why we use different approach as CUDA On CUDA side, we are using pybind to load torch Module and invoke the python api to start/stop the profiling. However, this requires us to compile the whole torch binary in the predictor which is not recommended by runtime(andruwang) Thus, we decide to use the CPP api directly to avoid un-necessary dependency why the snapshot is saved as json string directly instead of pickle Pickle is primarily designed for use with Python and doesn't have well support in cpp. Also, it is hard for user to download the snapshot file and open locally. Due to the dependency issue, it is hard to import the gzip/pickle library to decode the data. Thus, let's use JSON for now. I will work on the visualizer to fasten the render and support other format later. Plan: * Now, we will encoded file into gz for MTIA ondemand only and update the visualizer to support both type. * Update auto-trace and CUDA side to encode in gzip as well * Fully remove pickle dependency. Test Plan: # Remote cogwheel test Servicelab: https://fburl.com/servicelab/pckux7a3 snapshot file manifold: https://fburl.com/manifold/fnotk18c snapshot file in pastry: P1805522232 Visualization on D74399684 {F1977786422} # Local Predictor Test url: https://fburl.com/pytorch_memory_visualizer/y06kskkm {F1977787329} Differential Revision: D74179606 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153171 Approved by: https://github.com/sraikund16	2025-05-15 06:07:04 +00:00
bobrenjc93	181bfabb9e	fix set_logs for a single child log file (#153580 ) Tested via ``` + import logging + torch._logging.set_logs(modules={"torch._functorch._aot_autograd.autograd_cache": logging.DEBUG}) ``` ``` python test/dynamo/test_aot_autograd_cache.py -k test_multi_graph_specialization ``` and verifying logs are printed Pull Request resolved: https://github.com/pytorch/pytorch/pull/153580 Approved by: https://github.com/ColinPeppler	2025-05-15 05:58:45 +00:00
Animesh Jain	9839ec1383	[dynamo][compile-time] Cache method on load builtin (#153524 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153524 Approved by: https://github.com/StrongerXi, https://github.com/jansel ghstack dependencies: #153522	2025-05-15 05:54:15 +00:00
Animesh Jain	b47be23461	[dynamo][compile-time] Faster inspect getattr_static for torch.Tensor (#153522 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153522 Approved by: https://github.com/StrongerXi, https://github.com/jansel	2025-05-15 05:54:15 +00:00
henrylhtsang	910d2f96af	[cutlass backend] forward fix cutlass backend A100 test (#153428 ) Forward fix of https://github.com/pytorch/pytorch/pull/153006, which broke a test. In the long run, we should get rid of CUDATemplateCaller.category. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153428 Approved by: https://github.com/ColinPeppler	2025-05-15 05:45:38 +00:00
hanchao	0ca91af6b8	Define USE_C10D_XCCL and USE_XCCL in pytorch (#147593 ) ### Motivation: Add `USE_XCCL` and `USE_C10D_XCCL` to enable support of XCCL backend building in stock PyTorch, similar to `USE_NCCL` and `USE_C10D_NCCL`. By default, `USE_XCCL` is OFF and allowed set to ON explicitly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147593 Approved by: https://github.com/guangyey, https://github.com/malfet, https://github.com/albanD, https://github.com/cyyever	2025-05-15 05:39:00 +00:00
Reed Evans	ebd3268538	Removed duplicate patterns from gitignore (#153515 ) Removed duplicate patterns from gitignore. These patterns are duplicated verbatim on lines 148-169. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153515 Approved by: https://github.com/soulitzer	2025-05-15 05:38:42 +00:00
Chien-Chin Huang	b992a665d1	Fix AsyncMM not compiled with SM90a issue (#153519 ) The CMakeLists.txt is wrong and doesn't enable SM90a for AsyncMM.cu Pull Request resolved: https://github.com/pytorch/pytorch/pull/153519 Approved by: https://github.com/drisspg, https://github.com/ngimel, https://github.com/cyyever	2025-05-15 05:23:29 +00:00
Nikita Shulga	d5ddc5ab20	[MPS] Fix float64 scalar tensor handling (#153582 ) Current implementation causes silent correction problem with torch.compile when someone tries to `torch.compile` function where one of the arguments is say `np.exp(.3)`, which will be represented as torch.float64 scalar tensor Add regssion test for this behavior Pull Request resolved: https://github.com/pytorch/pytorch/pull/153582 Approved by: https://github.com/dcci	2025-05-15 05:15:14 +00:00
Mandar Deshpande	3e8bda4ad5	[pytorch][triton] flex attention fwd kernel with TMA loads (#151923 ) (#152460 ) Summary: Device side TMA for flex_attention fwd kernel, Q K V tensors Test Plan: Unit test: ``` buck test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:flex_attention -- test_tma_with_customer_kernel_options ``` https://www.internalfb.com/intern/testinfra/testrun/14355223891618726 Differential Revision: D71082691 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152460 Approved by: https://github.com/drisspg	2025-05-15 04:49:32 +00:00
Tsung-Hsien Lee	756fd80734	[BE] Improve the typing related to `model` input argument of `torch.compile()` (#153559 ) Summary: Match the `overload` typing with the original typing in function definition and adjust the corresponding comments. Test Plan: contbuild & OSS CI Differential Revision: D74746243 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153559 Approved by: https://github.com/Skylion007	2025-05-15 04:49:26 +00:00
Robert Burke	d2f6c6df1d	unbreak fb:operator_benchmark_test (#152049 ) Summary: unbreak fb:operator_benchmark_test Test Plan: works on my machine Differential Revision: D73540912 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152049 Approved by: https://github.com/hl475	2025-05-15 03:38:48 +00:00
Xuehai Pan	014726d9d3	[torchgen] Refactor `torchgen.utils.FileManager` to accept `pathlib.Path` (#150726 ) This PR allows `FileManager` to accept `pathlib.Path` as arguments while keeping the original `str` path support. This allows us to simplify the code such as: 1. `os.path.join(..., ...)` with `Path.__floordiv__(..., ...)`. `95a5958db4/torchgen/utils.py (L155)` `95a5958db4/torchgen/utils.py (L176)` 2. `os.path.basename(...)` with `Path(...).name`. `95a5958db4/torchgen/utils.py (L161)` 3. Manual file extension split with `Path(...).with_stem(new_stem)` `95a5958db4/torchgen/utils.py (L241-L256)` ------ Pull Request resolved: https://github.com/pytorch/pytorch/pull/150726 Approved by: https://github.com/aorenste	2025-05-15 02:52:24 +00:00
Daniel Vega-Myhre	881a598a1e	[FlexAttention] Enforce Q,K,V memory layouts for fp8 flex attention to avoid perf degradation (#153357 ) Fixes #147336 ## Context NCU analysis of the fp8 flex attention perf issue in #147336 showed an unexpected increase in shared memory access bank conflicts when loading the V tensor from HBM to SRAM. Bringing this to the attention of triton developer @davidberard98 he identified the memory layout of the tensor in HBM to be causing non-pipelined loads into SRAM, causing the slowdown. To summarize: In flex attention when performing the FP8 GEMM `softmax_scores @ V` the right operand V must be in column-major memory layout. However, the `tl.load` of V blocks from HBM to SRAM cannot be pipelined if the V tensor isn't column-major in HBM already, leading to substantial performance degradation. This is because triton does not perform async copies with the `cp.async` PTX instruction if the number of contiguous bytes is less than 4 (see [here](`81f93f2c8e/lib/Dialect/TritonGPU/Transforms/Pipeliner/PipeliningUtility.cpp (L403)`)). i.e., when loading 4 bytes of contiguous data from a tensor stored in row-major in HBM, we have to perform 4 separate non-contiguous writes to SRAM to place those bytes in their new location in the col-major layout in SRAM. Thus the load is not a candidate for pipelining w/ cp.async and just moves data to registers then performs a series of single byte stores. ## Fix summary - To fix this, we should enforce memory layouts for Q, K, V in FlexAttention when fp8 is being used, to ensure they each exist in HBM in the necessary memory layout to facilitate pipelined loads into SRAM ahead of the FP8 GEMMs ## Benchmarks Rerunning the repro we see fp8 runtime is reduced from 120% of bf16 to 76% of bf16 runtime. Before fix: ``` (flex) [danvm@devgpu007.eag6 ~/ml-perf-tools/flex_attention (main)]$ rm -rf /tmp/torchinductor_${USER}; python profile_flex.py --bf16 --fp8 2025-05-11 19:07:33,402 - flex_bench - INFO - Running benchmark: bf16 2025-05-11 19:07:35,885 - flex_bench - INFO - bf16: 424.87228804347734 us 2025-05-11 19:07:35,893 - flex_bench - INFO - Running benchmark: fp8e4m3 2025-05-11 19:07:37,319 - flex_bench - INFO - fp8e4m3: 515.714000000001 us ``` After fix: ``` (flex) [danvm@devgpu007.eag6 ~/ml-perf-tools/flex_attention (main)]$ rm -rf /tmp/torchinductor_${USER}; python profile_flex.py --bf16 --fp8 2025-05-11 17:34:38,223 - flex_bench - INFO - Running benchmark: bf16 2025-05-11 17:34:41,157 - flex_bench - INFO - bf16: 423.4662032967036 us 2025-05-11 17:34:41,167 - flex_bench - INFO - Running benchmark: fp8e4m3 2025-05-11 17:34:42,917 - flex_bench - INFO - fp8e4m3: 326.3694803493453 us ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153357 Approved by: https://github.com/ngimel, https://github.com/davidberard98	2025-05-15 02:41:38 +00:00
eellison	eaf2dee10e	don't run triton mm for k<32 (#153550 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/153550 Approved by: https://github.com/suo Co-authored-by: Natalia Gimelshein <ngimel@meta.com>	2025-05-15 02:36:44 +00:00
karthickai	725bbb6b5f	[inductor][dynamo] Include operator name in size/stride/alignment assertion (#152353 ) Fixes #151930 This PR updates the `assert_size_stride` and `assert_alignment` functions in [guards.cpp](https://github.com/pytorch/pytorch/blob/main/torch/csrc/dynamo/guards.cpp) to accept an optional `op_name` argument and includes it in the error messages. The corresponding type stubs in [guards.pyi](https://github.com/pytorch/pytorch/blob/main/torch/_C/_dynamo/guards.pyi) are updated to match the new function arg. In [inductor/ir.py](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/ir.py) extracts the operator name from the FX graph and passes it into the `codegen_size_asserts` and `codegen_alignment_asserts` functions, so that generated assertions in Triton code include the op name for better debugging. Added unit tests inside [test_torchinductor.py](https://github.com/pytorch/pytorch/blob/main/test/inductor/test_torchinductor.py). - Verified both successful and failing assertion cases include the operator name. - Verified that generated Triton code contains the op name inside the asserts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152353 Approved by: https://github.com/jansel	2025-05-15 02:33:57 +00:00
henrylhtsang	f5e0806f34	[cutlass backend] Add back descriptive names for epilogue fusion (#153405 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153405 Approved by: https://github.com/mlazos	2025-05-15 01:47:52 +00:00
zeshengzong	82dc3457e0	Add `load_state_dict` hint doc about invoke order work with lr_scheduler (#149942 ) Fixes #119168 ## Test Result ![image](https://github.com/user-attachments/assets/edb8124c-f103-475a-b903-20fbc71fdea6) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149942 Approved by: https://github.com/janeyx99 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-05-15 01:07:36 +00:00
cyy	781ba0ac9d	Update CMake to 3.27 in Windows CI (#153380 ) Before it's possible to use enable newer CMake. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153380 Approved by: https://github.com/albanD	2025-05-15 00:19:32 +00:00
Ting Lu	c2bc7e2827	API change for new enum in cusparseltsplitkmode-t for cusparseLT 0.7.0+ (#150536 ) Changing the bool to int to express split_k_mode. Before 0.7.0 we only have 2 cusparseLtSplitKMode_t enum values ONE_KERNEL and TWO_KERNELS so a boolean is enough but since 0.7.0 there are more. For Blackwell, there has to be minor change to parameter split_k_one_kernel (https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/sparse/cuda/cuSPARSELtOps.cpp#L103), since there are new values introduced to enum [cusparseLtSplitKMode_t](https://docs.nvidia.com/cuda/cusparselt/types.html#cusparseltsplitkmode-t) and a bool type is not enough for it (would have to be replaced with integer) https://docs.nvidia.com/cuda/cusparselt/types.html#cusparseltsplitkmode-t Error we see without the change ``` RuntimeError: CUDA error: invalid value when calling `cusparseLtMatmulAlgSetAttribute( &handle, &alg_sel, CUSPARSELT_MATMUL_SPLIT_K_MODE, &splitKMode, sizeof(splitKMode))` To execute this test, run the following from the base repo dir: python test/test_sparse_semi_structured.py TestSparseSemiStructuredCUSPARSELTCUDA.test_csrc_cslt_sparse_mm_search_cuda_int8 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150536 Approved by: https://github.com/jcaip, https://github.com/atalman	2025-05-14 23:36:53 +00:00
Hashem Hashemi	72fee137dd	[ROCm] Maxpool forward NHWC Perf Improvement targeting Resnet scenarios (#151727 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/151727 Approved by: https://github.com/seemethere Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>	2025-05-14 22:34:55 +00:00
Aaron Gokaslan	e0dece510b	[Ez][BE]: Remove accidental classvar (#153540 ) Untyped variables become ClassVar in dataclasses, this type alias should just be a type alias; no need for it to eb a classvar. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153540 Approved by: https://github.com/albanD, https://github.com/aorenste	2025-05-14 21:55:56 +00:00
henrylhtsang	7412b33e91	[inductor] Use get to avoid possible keyerror at the end of precompilation (#153417 ) Shameful admission: I have encountered this error 1-2 times, but don't have a repro. torch/_inductor/select_algorithm.py", line 2022, in wait_on_futures elapsed_times[future], ~~~~~~~~~~~~~^^^^^^^^ torch._inductor.exc.InductorError: KeyError: <Future at 0x7fc4e394fb90 state=finished returned tuple> Pull Request resolved: https://github.com/pytorch/pytorch/pull/153417 Approved by: https://github.com/Skylion007, https://github.com/ColinPeppler	2025-05-14 21:49:43 +00:00
Aidyn-A	f2e8e41855	[Easy][Inductor] Adds safety checks in get_estimated_runtime (#152821 ) This PR adds checks on `gpu_memory_bandwidth` and `gpu_flops` in `get_estimated_runtime`. This will prevent division by zero and other potential incorrect values: `9210a98b92/torch/_inductor/scheduler.py (L864-L865)` `9210a98b92/torch/_inductor/scheduler.py (L874)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152821 Approved by: https://github.com/eellison, https://github.com/jansel	2025-05-14 21:46:59 +00:00
Aaron Gokaslan	f887bfffda	Fix typo (#153561 ) Fix typo from #153386 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153561 Approved by: https://github.com/albanD	2025-05-14 21:38:51 +00:00
Animesh Jain	03d01860fd	[dynamo][compile-time] Compute logging related flags once (#153426 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153426 Approved by: https://github.com/jansel	2025-05-14 21:19:06 +00:00
Aaron Gokaslan	1bd6bc7190	[BE]: Enable ruff YTT linter for Python version checks (#153547 ) Adds ruff YTT checks to help future proof version checks and follow best practices here. Also makes it easier for static linters like mypy to detect python version branching. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153547 Approved by: https://github.com/albanD	2025-05-14 21:09:16 +00:00
PyTorch MergeBot	f363a3f51a	Revert "[cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for `sm90`, `sm100` (#149282 )" This reverts commit `9386701b51`. Reverted https://github.com/pytorch/pytorch/pull/149282 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, see [D74729259](https://www.internalfb.com/diff/D74729259). @drisspg may you help out the author have their PR merged? ([comment](https://github.com/pytorch/pytorch/pull/149282#issuecomment-2881546951))	2025-05-14 20:53:49 +00:00
Wang, Chuanqi	c92ea3bc98	[BE] Upgrade XPU support package to 2025.1 in CICD (#151899 ) Address #151097. Including below changes, - Add XPU support package 2025.1 build and test in CI for both Linux and Windows - Keep XPU support package 2025.0 build in CI to ensure no break issue until PyTorch 2.8 release - Upgrade XPU support package from 2025.0 to 2025.1 in CD for both Linux and Windows - Enable XCCL in Linux CD wheel and oneMKL integration in both both Linux and Windows - Update XPU runtime pypi packages of CD wheels - Remove deprecated support package version docker image build Pull Request resolved: https://github.com/pytorch/pytorch/pull/151899 Approved by: https://github.com/EikanWang, https://github.com/atalman	2025-05-14 20:21:09 +00:00
David Berard	5e6e52e7c9	[JIT] add GRAPH_DEBUG for setGraphExecutorOptimize (#153549 ) Summary: Optionally log when setGraphExecutorOptimize is called, so we can get insight into the GraphExecutor behavior. Differential Revision: D74692508 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153549 Approved by: https://github.com/PaulZhang12, https://github.com/SamGinzburg	2025-05-14 20:07:25 +00:00
James Wu	dda2c7c8fc	Pass inductor config for static cuda launcher to workers (#153382 ) Async compile workers don't respect inductor configs generally that get changed in the middle of execution because they warm up early. StaticCudaLauncher is especially susceptible to this because it affects triton compilation without being part of the inductor meta. So we'll pass it in via extra configs on each worker run. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153382 Approved by: https://github.com/masnesral, https://github.com/jansel	2025-05-14 20:01:32 +00:00
Aby Mathew C	6a28cc826f	Add TEST_HPU flag to set device type (#153461 ) MOTIVATION This PR includes a minor change to check for TEST_HPU flag as well before falling back to CPU. Without this flag, some tests were falling back to CPU causing them to fail. Please refer to this RFC as well: https://github.com/pytorch/rfcs/pull/66 CHANGES add TEST_HPU flag to some of the conditions checking the environment use DEVICE_COUNT variable instead of torch.accelerator.device_count() API since the later is not supported on out-of-tree devices like Intel Gaudi. @ankurneog , @EikanWang , @cyyever , @guangyey Pull Request resolved: https://github.com/pytorch/pytorch/pull/153461 Approved by: https://github.com/EikanWang, https://github.com/cyyever, https://github.com/albanD	2025-05-14 19:31:40 +00:00
Ben Zickel	a54bf43baa	Fix support of MixtureSameFamily [bugfix]. (#151317 ) Fixes https://github.com/pyro-ppl/pyro/issues/3419 which is actually a `torch` bug that can be replicated by the below code: ``` from torch import rand from torch.distributions import MixtureSameFamily, Categorical, Binomial max_count = 20 probs = rand(10, 5) binom_probs = rand(10, 5) d = MixtureSameFamily(Categorical(probs=probs), Binomial(max_count, binom_probs)) d.log_prob(d.sample()) ``` which results in: ``` Traceback (most recent call last): File "test.py", line 11, in <module> d.log_prob(d.sample()) File "pytorch\torch\distributions\mixture_same_family.py", line 168, in log_prob self._validate_sample(x) File "pytorch\torch\distributions\distribution.py", line 315, in _validate_sample valid = support.check(value) ^^^^^^^^^^^^^^^^^^^^ File "pytorch\torch\distributions\constraints.py", line 307, in check (value % 1 == 0) & (self.lower_bound <= value) & (value <= self.upper_bound) ^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: The size of tensor a (10) must match the size of tensor b (5) at non-singleton dimension 1 ``` ### Fix explanation (only for cases when the component distribution contains parameters with batch dimenisons) - The failure is due to sample validation taking place before padding in `MixtureSameFamily.log_prob`, and hence the fix is to pad before doing sample validation. - The fix itself does not alter the calculations at all. It only affects the sample validation process. - The failure does not occur with the component distribution set to the `Normal` distribution, as its validation is not defined elementwise (the validation itself is elementwise). - I've split the `test_mixture_same_family_log_prob` test into two tests based on the `Normal` and `Binomial` distributions. - Initially, the `Binomial` version of the test did not fail, but this was due to the component distribution having equal batch dimensions of (5, 5) so I changed it to (10, 5). ### Updated fix explanation (for all cases) - The previous fix caused a bug in sample shape validation (which is done correctly) due to the padding taking place before the sample validation. - The updated fix corrects the support to reflect the fact that the support of `MixtureSameFamily` is equal to the support of its components distribution with the first event dimension removed. - This issue was already anticipated in the [code](`331423e5c2/torch/distributions/mixture_same_family.py (L127)`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/151317 Approved by: https://github.com/albanD, https://github.com/fritzo	2025-05-14 19:24:36 +00:00
clr	534b66fe30	torch.compile: Remove reference to the unused dynamo_config.dynamic_shapes from (#153297 ) tests This config option is not set anywhere, and does nothing, so this should cause no changes to tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153297 Approved by: https://github.com/Skylion007	2025-05-14 19:02:51 +00:00
PyTorch MergeBot	bf0fe4f828	Revert "[CUDA][CUDNN] Dispatch to cuDNN for non-batch-splittable 64-bit NCHW convolutions (#153101 )" This reverts commit `ced90d23d3`. Reverted https://github.com/pytorch/pytorch/pull/153101 on behalf of https://github.com/jeanschmidt due to Seems to have introduced breakages on main, tentative revert: https://github.com/pytorch/pytorch/actions/runs/15024667248/job/42224521705 ([comment](https://github.com/pytorch/pytorch/pull/153101#issuecomment-2881208171))	2025-05-14 18:52:07 +00:00
Nikita Shulga	8749fe8439	[CI][MPS] Speedup test_large_bmm (#153562 ) By computing matmuls of only one random non-zero batch on CPU This reduces test runtime from 11 minutes to 14 sec ``` % python3 test/test_mps.py -v -k test_large_bmm_ test_large_bmm_bfloat16 (__main__.TestMPS.test_large_bmm_bfloat16) ... ok test_large_bmm_float16 (__main__.TestMPS.test_large_bmm_float16) ... ok ---------------------------------------------------------------------- Ran 2 tests in 27.495s ``` TODO: Compute it over two slices when https://github.com/pytorch/pytorch/issues/153560 is fixed Pull Request resolved: https://github.com/pytorch/pytorch/pull/153562 Approved by: https://github.com/Skylion007, https://github.com/clee2000	2025-05-14 18:49:42 +00:00
angelayi	47d6feff7c	[export] Support no inputs in unflattened module (#153474 ) Encountered in this diff D74589491 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153474 Approved by: https://github.com/avikchaudhuri	2025-05-14 18:45:47 +00:00
PyTorch MergeBot	6ef1cbc191	Revert "[ROCm] Maxpool forward NHWC Perf Improvement targeting Resnet scenarios (#151727 )" This reverts commit `e6a9067260`. Reverted https://github.com/pytorch/pytorch/pull/151727 on behalf of https://github.com/jeanschmidt due to Seems to be breaking internal builds, @seemethere may you help the author? [D74729252](https://www.internalfb.com/diff/D74729252) ([comment](https://github.com/pytorch/pytorch/pull/151727#issuecomment-2881122917))	2025-05-14 18:18:17 +00:00
Aaron Gokaslan	533fc58453	[BE]: Fix typing None override other optimizers (#153386 ) Follow up to #153367 to fix other instances of it throughout the codebase Also fully type NamedOptimizer since we were so close Pull Request resolved: https://github.com/pytorch/pytorch/pull/153386 Approved by: https://github.com/tsunghsienlee, https://github.com/janeyx99, https://github.com/jansel, https://github.com/cyyever	2025-05-14 17:48:47 +00:00
Xu Zhang	2362bd4a4c	[Torch][NT] Fix NestedTensor contiguous check condition. (#153237 ) (#153529 ) Fixes #153237 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153529 Approved by: https://github.com/jbschlosser	2025-05-14 17:15:48 +00:00
Ryan Guo	8bb67700a3	[dynamo] Support `delattr` on result of `torch.compile(module)` (#152741 ) This is essentially a follow-up on #122098, where we added support of `getattr` and `setattr` on result of `torch.compile(module)`, but didn't add support for `delattr`. Fixes #150711. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152741 Approved by: https://github.com/anijain2305 ghstack dependencies: #152740	2025-05-14 17:03:59 +00:00
Ryan Guo	6765df052c	[dynamo] Emit warning on global module hooks when calling using output of `torch.compile(module)` (#152740 ) When we do `torch.compile(module)`, we eventually end up returning a new `OptimizedModule` instance, whose `forward` method is the result of `torch.compile(mod.__call__)`, meaning it already captures all the extra logic (e.g., hook firing) for the compiled module. `OptimizedModule` also inherits `nn.module.__call__`, and thus has its own hook logic. This is useful for torchao, which injects module forward hooks to run in eager for quantization purposes. However, this might create unexpected behavior for global module hooks, because `torch.compile(module)` causes the hook to fire one extra time for `OptimizedModule`, when compared to eager. To preserve BC, we simply emit a warning for this behavior, and let users decide what to do. This is reasonable because the global module hooks are documented to be used for debugging/profiling purposes only. Fixes #149502 Differential Revision: [D74611716](https://our.internmc.facebook.com/intern/diff/D74611716) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152740 Approved by: https://github.com/anijain2305, https://github.com/zou3519	2025-05-14 17:03:59 +00:00
Shangdi Yu	b3dea0c0dd	Change aoti cpp tests to run serially within file (#152960 ) Fixes #152674 https://github.com/pytorch/pytorch/issues/152889 https://github.com/pytorch/pytorch/issues/152888 https://github.com/pytorch/pytorch/issues/152891 `--dist=loadfile` ensures all tests in the same source file run in the same worker. Tests like `FreeInactiveConstantBufferRuntimeConstantFoldingCuda` expect exclusive access to memory during test time to compute diffs (e.g., initMemory - updateMemory2 == DATASIZE). With `-n 3`, tests run in separate processes, but CUDA device memory is shared — and cudaMemGetInfo() reads device-wide global state. ``` python test/run_test.py --cpp --verbose -i cpp/test_aoti_inference -dist=loadfile ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152960 Approved by: https://github.com/desertfire, https://github.com/cyyever	2025-05-14 17:02:39 +00:00
Anthony Shoumikhin	ba70876407	Update lint_urls.sh (#153246 ) Treat 403, 429 and 503 http errors as success. Ignore non-verbal hostnames. Kill child jobs immediately. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153246 Approved by: https://github.com/malfet	2025-05-14 16:54:49 +00:00
Meet Vadakkanchery	b6b0080419	[DCP] Use multiprocess Pipes instead of Queues to improve communication contract with checkpointer process (#153488 ) Summary: ### Diff Context - PR introduces Pipes for multiprocess comms with checkpointer process. - Pipes allow easier comms contract management due to close() API and catch-all feature when background process is dead (e.g. seg faults). Test Plan: CI Differential Revision: D74668559 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153488 Approved by: https://github.com/saumishr	2025-05-14 16:47:43 +00:00
Aaron Gokaslan	8799bffc34	[BE][Ez]: RUF200 - validate pyproject.toml metadata (#153543 ) Since we have pyproject.toml metadata for [project] and [build-requires], let's turn on the linter rules which validates this optional metadata to make sure it's properly formatted and follows the correct schema for standard Python build tools. Right now, incorrect metadata could silently error with how our CI is invoked or only provide warnings for invalid metadata. This check will help surface those errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153543 Approved by: https://github.com/albanD	2025-05-14 16:42:22 +00:00
Anthony Shoumikhin	7d39e73c57	Fix more URLs (#153277 ) Or ignore them. Found by running the lint_urls.sh script locally with https://github.com/pytorch/pytorch/pull/153246 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153277 Approved by: https://github.com/malfet	2025-05-14 16:23:50 +00:00
fengqing.lu	de92296bbb	[Intel GPU] undo broadcast on zero stride tensor for SDPA (#151976 ) Fix https://github.com/pytorch/pytorch/issues/152290. The model hubert uses aten::expand to build attention mask by broadcasting. Pytorch uses strides[d]=0 to represent broadcast, which is not supported by oneDNN. This PR handles this scenario. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151976 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/drisspg	2025-05-14 16:09:03 +00:00

... 8 9 10 11 12 ...

88238 Commits