pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
PyTorch MergeBot	cf7447ae99	Revert "cpp_wrapper: Fix even more tests (#147225 )" This reverts commit `d25acac357`. Reverted https://github.com/pytorch/pytorch/pull/147225 on behalf of https://github.com/yangw-dev due to broke test internally test/inductor/test_benchmark_fusion ([comment](https://github.com/pytorch/pytorch/pull/147225#issuecomment-2761944564))	2025-03-28 17:07:52 +00:00
PyTorch MergeBot	1a3bd894ff	Revert "[fbcode]Removing `@NoIntBaseDeprecated` annotation in `caffe2.thrift` file (#149742 ) (#149744 )" This reverts commit `6eac3a0068`. Reverted https://github.com/pytorch/pytorch/pull/149744 on behalf of https://github.com/malfet due to Broke tests, see `80aa88f907/1` ([comment](https://github.com/pytorch/pytorch/pull/149744#issuecomment-2759676260))	2025-03-27 22:31:54 +00:00
Keke Zhai	68414512e6	Implement aten.select.int sharding strategy (#149842 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149842 Approved by: https://github.com/XilunWu	2025-03-27 20:49:00 +00:00
Benjamin Glass	d25acac357	cpp_wrapper: Fix even more tests (#147225 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147225 Approved by: https://github.com/desertfire	2025-03-27 19:21:03 +00:00
vasiliy	01cb3519b3	wire torch._scaled_mm with fp4 operands to the cublas nvfp4 kernel (#148792 ) Summary: When `a` and `b` have dtype `torch.float4_e2m1fn_x2` and `a_scale` and `b_scale` have dtype `torch.float8_e4m3fn`, makes ```python c = torch._scaled_mm(a, b, a_scale, b_scale, out_dtype=torch.bfloat16) ``` call the cuBLAS fp4 gemm kernel, as specified in https://docs.nvidia.com/cuda/cublas/index.html?highlight=fp4#d-block-scaling-for-fp8-and-fp4-data-types note: output scale (`scale_in_D` from the cuBLAS docs) is not tested in this PR - we can enable in a follow-up. Test Plan: ```bash pytest test/test_matmul_cuda.py -s -k mxfp8_nvfp4 ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/148792 Approved by: https://github.com/eqy ghstack dependencies: #148791	2025-03-27 17:32:20 +00:00
Danfeng Wang	6eac3a0068	[fbcode]Removing `@NoIntBaseDeprecated` annotation in `caffe2.thrift` file (#149742 ) (#149744 ) Summary: To align with thrift-python, we are adding the int base class for `non-Flag` enums. In order to not break production code, the annotation `python.NoIntBaseClassDeprecated` is added to opt-out some enums After the related customer code logic changes, we can now safely remove the annotations that were added earlier. Our ultimate goal is to unconditionally add the `int` base to `thrift-py3` enums. Test Plan: ``` buck test 'fbcode//mode/opt' fbcode//caffe2/torch/fb/training_toolkit/applications/bulk_eval/tests:evaluator_test -- --exact 'caffe2/torch/fb/training_toolkit/applications/bulk_eval/tests:evaluator_test - test_setup_evaluation_utils (caffe2.torch.fb.training_toolkit.applications.bulk_eval.tests.evaluator_test.EvaluatorTest)' ``` Reviewed By: ahilger Differential Revision: D71446522 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149744 Approved by: https://github.com/izaitsevfb, https://github.com/huydhn	2025-03-27 17:11:26 +00:00
Mu-Chu Lee	a0253d2840	[Inductor] Use real input to autotune user defined triton kernels (#149553 ) Summary: User defined Triton kernel sometimes rely on real inputs to determine the path of execution. We need real inputs to invoke the correct behavior of the user defined triton kernels (see example in test case, where we have an early return for random inputs) Test Plan: Included in the commit. python test/inductor/test_aot_inductor.py -k triton_autotuning python test/inductor/test_aot_inductor.py -k triton_mutated_autotuning Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/149553 Approved by: https://github.com/davidberard98, https://github.com/eellison	2025-03-26 16:42:48 +00:00
PyTorch MergeBot	1b373f6cd4	Revert "cpp_wrapper: Fix even more tests (#147225 )" This reverts commit `62d351a35b`. Reverted https://github.com/pytorch/pytorch/pull/147225 on behalf of https://github.com/yangw-dev due to broke [ROCM mi300 test](https://github.com/pytorch/pytorch/actions/runs/14066803692/job/39393110086) in [HUD](https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=rocm-mi300%20%2F%20linux-focal-rocm6.3-py3.10%20%2F%20test%20(default%2C%201%2C%206%2C%20linux.rocm.gpu.mi300.2)&mergeLF=true) ([comment](https://github.com/pytorch/pytorch/pull/147225#issuecomment-2752799778))	2025-03-26 00:03:13 +00:00
Benjamin Glass	62d351a35b	cpp_wrapper: Fix even more tests (#147225 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147225 Approved by: https://github.com/desertfire ghstack dependencies: #146706	2025-03-25 17:58:40 +00:00
Shangdi Yu	46dd226702	Fakify torchbind objects in compile_fx and add tests for SigridTransformsInstanceTorchBind (#149529 ) Summary: We need to properly fakify torchbind objects, including the ones in graph module attributes, so the resgitered fake implementation works properly. - _fakify_script_objects in `compile_fx` - Allow fake torchbind objects in `torchbind_constants` Remove `node.meta["unbacked_bindings"]` for `aot_compile` in `compile_fx`. Otherwise `ShapeProp` will fail when trying to resolve the `unbacked_bindings` of `with_effect` tokens. Update `sigrid_transforms_test` to use the latest `torch._inductor.aot_compile` API. Add a test for `Fakify torchbind objects in compile_fx and add tests for SigridTransformsInstanceTorchBind` in `e2e_test`. Test Plan: ``` buck run //caffe2/torch/fb/sparsenn:sigrid_test -- -r test_transform_torch_bind buck run //sigmoid/inference/test:e2e_test_cpu -- -r SigridTransforms buck2 run mode/dev-nosan sigmoid/inference/ts_migration:pt2i_readiness_main -- --model_id 545017754 --test_suite ads_all --mode test_preproc ``` Differential Revision: D70013257 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149529 Approved by: https://github.com/angelayi	2025-03-21 18:58:28 +00:00
IvanKobzarev	2c4bc65366	[aotd] Guess tangents stride as output strides (#144579 ) AOTDispatch doing AOT backward graph preparation does not know real tangents that user will specify when runs backward. AOTD guesses the tangents. Before - we guessed that memory format of tangents will be as memory format of corresponding outputs. And if specified tangents at runtime are not the same memory format as we guessed during compilation, AOTD does coercion (copy) to guessed memory_format But as Horace found, there are popular use cases, where the outputs of compiled region will be in specific memory_format. E.g. in 4D tensor transposing dims 1 and 2. https://github.com/karpathy/nanoGPT/blob/master/model.py#L57 This PR changes the logic, that AOTD expects the same "strideness" of tangents as outputs. As a result it will avoid coercion for the case of transposed dims. Limitations: We keep guessing memory_format for: 1/ Dynamic shapes (needs more changes) 2/ Tensor subclasses (needs more changes) Other changes: test_torchinductor was always creating contiguous tangents via `torch.randn()`, changing them to be `torch.randn_like()` to compare computation with the same strideness. (E.g. for cuda float16 strideness affects numerics for fft ops). Pull Request resolved: https://github.com/pytorch/pytorch/pull/144579 Approved by: https://github.com/bdhirsh	2025-03-20 15:41:36 +00:00
Aleksei Nikiforov	d5b1d99f78	Enable more nightly tests on s390x (#148452 ) Also enable some tests which probably were accidentally disabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148452 Approved by: https://github.com/seemethere, https://github.com/malfet	2025-03-18 16:09:39 +00:00
Aaron Gokaslan	a0ac63cbd9	[BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/149257 Approved by: https://github.com/jansel	2025-03-18 00:46:07 +00:00
PyTorch MergeBot	24cfeec2c7	Revert "[BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257 )" This reverts commit `bfee141666`. Reverted https://github.com/pytorch/pytorch/pull/149257 on behalf of https://github.com/malfet due to Let's see if it helps restore compiler benchmark sanity, see `8bc7bd94a5/1` ([comment](https://github.com/pytorch/pytorch/pull/149257#issuecomment-2731133812))	2025-03-17 22:57:00 +00:00
PyTorch MergeBot	afa1eda901	Revert "[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 )" This reverts commit `ef6296e7f2`. Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/izaitsevfb due to reverted internally, see D71292427 ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2731114626))	2025-03-17 22:43:15 +00:00
Xinya Zhang	2a011ca904	[ROCm] testing: enable MEFF/FA unittests for gfx1100 (#148911 ) Include gfx1100, and optionally enable gfx1201/gfx950 according to env var TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL Pull Request resolved: https://github.com/pytorch/pytorch/pull/148911 Approved by: https://github.com/jeffdaily	2025-03-17 16:41:15 +00:00
Aaron Gokaslan	bfee141666	[BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/149257 Approved by: https://github.com/jansel	2025-03-16 23:52:58 +00:00
Tugsbayasgalan Manlaibaatar	6b1b95ad2a	Support subclass constructor capturing in export (#147014 ) Notable TODOs: 1. Need to implement AutogradHOP to get rid of subclasses before serializing 2. Need to implement mechanism to figure out what subclasses will be used in export when they are not expressed in the inputs Differential Revision: [D69640673](https://our.internmc.facebook.com/intern/diff/D69640673) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147014 Approved by: https://github.com/bdhirsh	2025-03-16 18:19:19 +00:00
PyTorch MergeBot	f9b4856989	Revert "[pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257 )" This reverts commit `c95a6b416b`. Reverted https://github.com/pytorch/pytorch/pull/113257 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @zou3519 can you please help land this internally? See the sigmoid tests in D71198793 for details. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/113257#issuecomment-2725982539))	2025-03-14 23:13:34 +00:00
Xuehai Pan	c95a6b416b	[pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257 ) Changes in this PR: 1. Add `is_structseq` and `is_structseq_class` functions to determine a object or a class is PyStructSequence. 2. Add a generic class `structseq` which can be used as the registration key for PyStructSequence types like `namedtuple` for Named Tuple types. 3. Change `is_namedtuple` to accept subclasses of namedtuple to be namedtuple. Before this PR, only namedtuple class directly created by `collections.namedtuple` or `typing.NamedTuple` were namedtuple classes while their subclasses were not. This PR makes `is_namedtuple` return true for subclasses of namedtuple class. Resolves #75982. New tests are included in this PR. - #75982 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113257 Approved by: https://github.com/zou3519	2025-03-14 08:50:30 +00:00
cyy	970fefcc53	Remove outdated skipCUDAIfCudnnVersionLessThan decoration (#148940 ) Test conditions for CUDNN 7 and 8 were removed because we have moved to CUDNN 9. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148940 Approved by: https://github.com/mikaylagawarecki	2025-03-13 18:02:50 +00:00
Shangdi Yu	2a7d583452	Consolidate torchbind fake class registration (#149063 ) Summary: Remove duplicated fake class registration Test Plan: CI Differential Revision: D71052419 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149063 Approved by: https://github.com/angelayi	2025-03-13 06:57:13 +00:00
xinan.lin	037d7af778	[Inductor UT] Enable PYTORCH_TESTING_DEVICE_ONLY_FOR test case filter for test_torchinductor.py (#149023 ) The environ var PYTORCH_TESTING_DEVICE_ONLY_FOR controls the devices in get_desired_device_type_test_bases, so we add RUN_CPU and RUN_GPU to make sure cases are only enabled for devices specified for PYTORCH_TESTING_DEVICE_ONLY_FOR. eg. Only enable GPU cases, not CPU cases even HAS_CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149023 Approved by: https://github.com/jansel, https://github.com/cyyever	2025-03-13 05:15:28 +00:00
Guilherme Leobas	fb53e9e514	Add `__context/cause/suppress_context/traceback__` to Exception (#146499 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146499 Approved by: https://github.com/zou3519, https://github.com/anijain2305 ghstack dependencies: #146504	2025-03-11 18:55:45 +00:00
Ke Wen	ef6296e7f2	[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 ) This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj Differential Revision: [D70937982](https://our.internmc.facebook.com/intern/diff/D70937982) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590 Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj	2025-03-11 18:36:12 +00:00
PyTorch MergeBot	a95eb0c0a7	Revert "[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 )" This reverts commit `2149f6c684`. Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/ZainRizvi due to Breaking internally, see D70873275. Discussed reverting this with Ke. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2712001270))	2025-03-10 22:38:40 +00:00
PyTorch MergeBot	ebd087e4b5	Revert "[pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257 )" This reverts commit `f08146b67b`. Reverted https://github.com/pytorch/pytorch/pull/113257 on behalf of https://github.com/jovianjaison due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/113257#issuecomment-2711299830))	2025-03-10 17:19:21 +00:00
Xinyuan Zhao	59f14d19ae	Implement gradient for the `residuals` of `torch.linalg.lstsq` (#148526 ) Fixes #147543. I have written some tests in python using `gradcheck`. Please advise where I should put these tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148526 Approved by: https://github.com/lezcano	2025-03-10 12:35:09 +00:00
Ke Wen	2149f6c684	[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 ) This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590 Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj	2025-03-09 07:32:23 +00:00
PyTorch MergeBot	9cb25f0ea2	Revert "[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 )" This reverts commit `17dbeb11db`. Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/janeyx99 due to PR break backward compat test ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2708641172))	2025-03-09 03:01:55 +00:00
Ke Wen	17dbeb11db	[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 ) This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590 Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj	2025-03-08 20:00:12 +00:00
riccardofelluga	8f71d4563e	Fix rms_norm in fp16/bf16 (#147203 ) Fixes #134106. This PR moves the `upcasted_result` down-casting after all computation is done. Since the multiplication with the weight_opt input is not done in half precision, the current code path is doing the following: fp16 -> fp32 -> fp16 -> fp32 -> fp16. What we want tho is to avoid down-casting and this PR proposes: fp16 -> fp32 -> fp16. This results in better accuracy as it avoids truncating. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147203 Approved by: https://github.com/eqy	2025-03-08 04:43:18 +00:00
Xinya Zhang	67742128b7	[ROCm] Bump AOTriton to 0.9.2b (#148433 ) Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.9b: * Optimize these Non-power-of-two head dimensions: 48, 80, 96, 160, 192, 224. Inputs with these head dimensions do not need padding to power-of-two anymore. * `is_causal=True` cases are now supported with persistent dynamic algorithm, which requires an atomic tensor but does load balance between different CTAs * `dropout_p > 0.0` cases now support full 64-bit offsets and use all i64x4 PRNG outputs * The precise AOTriton shared library version can now be identified with `readelf -p .comment libaotriton_v2.so` + However, this does not guarantee the GPU images stored under `aotriton.images` have the same version, since they can be overwritten. * The newly added fused backward kernel will be used for smaller workloads, due to less kernel invocation overhead. * Support gfx1201 (RX 9070XT). Need to be enabled at runtime with `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148433 Approved by: https://github.com/jeffdaily	2025-03-07 22:10:07 +00:00
Aaron Orenstein	a3b77d434a	Subprocess compile (attempt 2) (#148635 ) Add a mode to fx_codegen_and_compile() to compile in a separate process. This is to prepare for async compile where we'll compile and run eager in parallel (and also be able to move the compile phase to a remote computer). Added a test based which runs the test_torchinductor tests with subprocess compiling turned on. Fixed the test which caused the previous version (#146134) to be reverted: ``` $ PYTORCH_TEST_WITH_ROCM=1 PYTORCH_TEST_WITH_SLOW=1 PYTORCH_TEST_SKIP_FAST=1 python test/inductor/test_compile_subprocess.py CpuTests.test_conv_bn_fuse_cpu ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148635 Approved by: https://github.com/jamesjwu	2025-03-07 17:50:14 +00:00
Aaron Gokaslan	edd640a95a	[BE][Ez]: Use itertools.chain.from_iterable when possible (#148190 ) Often makes the code more readable, more efficient, and adds support for infinite iterables. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148190 Approved by: https://github.com/jansel, https://github.com/malfet	2025-03-06 20:37:06 +00:00
Xuehai Pan	f08146b67b	[pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257 ) Changes in this PR: 1. Add `is_structseq` and `is_structseq_class` functions to determine a object or a class is PyStructSequence. 2. Add a generic class `structseq` which can be used as the registration key for PyStructSequence types like `namedtuple` for Named Tuple types. 3. Change `is_namedtuple` to accept subclasses of namedtuple to be namedtuple. Before this PR, only namedtuple class directly created by `collections.namedtuple` or `typing.NamedTuple` were namedtuple classes while their subclasses were not. This PR makes `is_namedtuple` return true for subclasses of namedtuple class. Resolves #75982. New tests are included in this PR. - #75982 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113257 Approved by: https://github.com/zou3519	2025-03-06 18:59:02 +00:00
PyTorch MergeBot	96176e32a9	Revert "[ROCm] Bump AOTriton to 0.9.1b (#148433 )" This reverts commit `8af79b7ec8`. Reverted https://github.com/pytorch/pytorch/pull/148433 on behalf of https://github.com/jovianjaison due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/148433#issuecomment-2704638858))	2025-03-06 18:32:48 +00:00
Prachi Gupta	703176e538	[ROCm] Fix sort for non-standard bool (#147459 ) When converting from uint8 to bool using `view` op, we get a bool that has 0 for false and a non-zero value for true. However, these kinds of bool have undefined behavior. We only read the last bit as 0 or 1 to convert to false or true. In this fix, we convert bools to uint8, which will convert false to 0 and non-zero value to 1. Essentially, converting non-standard bool to a standard bool and fixing the sort op for non-standard bool. Fixes #139972 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147459 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony	2025-03-06 00:23:02 +00:00
PyTorch MergeBot	897fd9b514	Revert "Subprocess compile (#146134 )" This reverts commit `07f876e960`. Reverted https://github.com/pytorch/pytorch/pull/146134 on behalf of https://github.com/malfet due to looks like it broke slow jobs, see `e1dee4ccb3/3` ([comment](https://github.com/pytorch/pytorch/pull/146134#issuecomment-2702239123))	2025-03-05 22:41:19 +00:00
Tugsbayasgalan Manlaibaatar	5ccd659c0e	Fix decomp for linspace (#147997 ) In python decompositions, we shouldn't do any non-functional operations for functional operators. This should go away once we start decomposing before functionalization. Differential Revision: [D70265200](https://our.internmc.facebook.com/intern/diff/D70265200) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147997 Approved by: https://github.com/zou3519	2025-03-05 22:10:08 +00:00
Xinya Zhang	8af79b7ec8	[ROCm] Bump AOTriton to 0.9.1b (#148433 ) Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.9b: * Optimize these Non-power-of-two head dimensions: 48, 80, 96, 160, 192, 224. Inputs with these head dimensions do not need padding to power-of-two anymore. * `is_causal=True` cases are now supported with persistent dynamic algorithm, which requires an atomic tensor but does load balance between different CTAs * `dropout_p > 0.0` cases now support full 64-bit offsets and use all i64x4 PRNG outputs * The precise AOTriton shared library version can now be identified with `readelf -p .comment libaotriton_v2.so` + However, this does not guarantee the GPU images stored under `aotriton.images` have the same version, since they can be overwritten. * The newly added fused backward kernel will be used for smaller workloads, due to less kernel invocation overhead. * Support gfx1201 (RX 9070XT). Need to be enabled at runtime with `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148433 Approved by: https://github.com/jeffdaily	2025-03-05 19:11:57 +00:00
William Wen	b28cbe5db3	[dynamo] remove internal stack trace for fullgraph=True graph breaks (#148205 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148205 Approved by: https://github.com/zou3519	2025-03-05 01:16:53 +00:00
lzhang2	84b58bd63e	Enable FSDP tests on XPU device (#147518 ) Motivation: Enable FSDP tests on XPU device Pull Request resolved: https://github.com/pytorch/pytorch/pull/147518 Approved by: https://github.com/weifengpy	2025-03-04 23:49:37 +00:00
Afanti	c219c5ca38	Fix code descriptions in the test package. (#148145 ) The parameter and function description have something wrong and make them correct. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148145 Approved by: https://github.com/janeyx99	2025-03-04 19:14:41 +00:00
cyy	ec2805ada8	Remove outdated CUDA version check (#148142 ) Since Torch requires CUDA>=11, some checks can be removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148142 Approved by: https://github.com/janeyx99, https://github.com/eqy	2025-03-04 03:33:44 +00:00
Aaron Orenstein	07f876e960	Subprocess compile (#146134 ) Add a mode to `fx_codegen_and_compile()` to compile in a separate process. This is to prepare for async compile where we'll compile and run eager in parallel (and also be able to move the compile phase to a remote computer). Added a test based which runs the test_torchinductor tests with subprocess compiling turned on. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146134 Approved by: https://github.com/jamesjwu	2025-03-03 21:10:12 +00:00
Alexander Grund	302c660298	Consistently use load_torchbind_test_lib in tests (#148082 ) The same code is repeated multiple times with slightly different implementations. Use the existing function for brevity and consistency. In the function the code from `test_export` is used which does a single `load_library` with cleaner conditions Pull Request resolved: https://github.com/pytorch/pytorch/pull/148082 Approved by: https://github.com/angelayi	2025-03-03 19:37:28 +00:00
cyy	9aa897b992	Remove unnecessary tensor clone (#148159 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148159 Approved by: https://github.com/Skylion007	2025-03-02 16:21:39 +00:00
Rengan Xu	0ff2e6a85a	Fix None and equal_to_1 arguments issue in Triton kernel generated by AOTI (#148102 ) Summary: When a Triton kernel has arguments with None values followed by arguments with value 1, AOTI attempts to remove the None arguments and update the indices of the equal_to_1 arguments in triton_meta["configs"]. However, if the same kernel is called multiple times, this optimization process is repeated. Prior to this diff, the indices of equal_to_1 arguments from subsequent calls (second and later) were based on the updated indices from the previous call, resulting in incorrect behavior. This diff aims to localize the updated indices for equal_to_1 arguments within the optimization process of the current call, ensuring accurate and consistent results. Test Plan: Unit Test: ``` buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor -- -r test_triton_kernel_with_none_inputs_and_equal_to_1_arg ``` Differential Revision: D69998314 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148102 Approved by: https://github.com/davidberard98, https://github.com/chenyang78	2025-03-01 18:38:33 +00:00
Sijia Chen	4995e058bf	[user-triton] handle inline_asm_case (#148043 ) Summary: We currently failed the mutation analysis for all inline_asm ops. In this diff, we handle the case when "is_pure" is set to True since it indicates the operation doesn't mutate the input value Test Plan: ../buck-out/v2/gen/fbcode/854b9ed00d28c5c5/caffe2/test/inductor/__triton_kernels__/triton_kernels.par --r test_mutations_inline_asm_kernel ``` test_mutations_inline_asm_kernel_is_pure_true (caffe2.test.inductor.test_triton_kernels.MutationTests) ... W0226 18:10:34.261000 1906801 /data/users/sijiac/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:656] TTIR mutation analysis: Skipping pure tt.elementwise_inline_asm op (is_pure=True) ok ---------------------------------------------------------------------- Ran 2 tests in 0.706s OK ``` Differential Revision: D69878591 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148043 Approved by: https://github.com/zou3519	2025-02-28 20:52:51 +00:00

1 2 3 4 5 ...

5523 Commits