pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
eqy	d35b27dde5	[CUDA] Add some more missing `@serialTest` decorators (#159672 ) Seems to fix #159663 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159672 Approved by: https://github.com/Skylion007	2025-08-04 07:44:35 +00:00
PyTorch MergeBot	cb9b74872b	Revert "Generalize torch._C._set_allocator_settings to be generic (#156175 )" This reverts commit `d3ce45012e`. Reverted https://github.com/pytorch/pytorch/pull/156175 on behalf of https://github.com/guangyey due to Static initialization order issue impact the downstream repo ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3142035444))	2025-08-01 03:24:54 +00:00
eqy	05aade1b6d	[CUDA] Add `serialTest` decorator to `largeTensorTest` in `test_cuda.py` (#159271 ) Hopefully helps with disabled tests due to OOM such as https://github.com/pytorch/pytorch/issues/159069 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159271 Approved by: https://github.com/Skylion007, https://github.com/ngimel	2025-07-31 16:27:16 +00:00
Eddie Yan	25c3a7e317	[CUDA][CUDA Graphs] Move cuda graphs test to subprocess to avoid polluting mempool tests (#159305 ) Otherwise mempool test will fail as the previous graph capture failed but doesn't have its state in the caching allocator fully cleaned up. See also #159301 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159305 Approved by: https://github.com/eellison, https://github.com/BoyuanFeng, https://github.com/naromero77amd	2025-07-30 23:31:38 +00:00
Yu, Guangye	d3ce45012e	Generalize torch._C._set_allocator_settings to be generic (#156175 ) # Motivation This PR moves the implementation of `torch.cuda.memory._set_allocator_settings` to `torch._C._accelerator_setAllocatorSettings`. Since the original API was intended as a temporary/internal utility, I am not exposing the new function as a public API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156175 Approved by: https://github.com/albanD ghstack dependencies: #149601, #157908, #150312, #156165	2025-07-30 06:37:15 +00:00
Xuehai Pan	f903bc475c	[BE] add noqa for flake8 rule B036: found `except BaseException` without re-raising (#159043 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159043 Approved by: https://github.com/Skylion007	2025-07-25 02:56:34 +00:00
Nichols A. Romero	c996aff6ed	[ROCm] UT verifies a runtime error is raised if tensor.item() is captured in a cudagraph (#158878 ) Unit test for this PR: https://github.com/pytorch/pytorch/pull/158165 This unit test verifies that a runtime error is raised when tensor.item() operation is captured in a cudagraph. Equally valid for ROCm and CUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158878 Approved by: https://github.com/jeffdaily, https://github.com/ngimel	2025-07-23 20:01:50 +00:00
Natalia Gimelshein	4869f71170	don't set CUDA_MODULE_LOADING (#158712 ) If needed, it'll be set in `_C._cuda_init()`. setenv is not threadsafe, so this can cause segfaults due to getenv/setenv races. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158712 Approved by: https://github.com/eqy	2025-07-20 01:36:26 +00:00
Huy Do	e71bb021b9	Add a periodic test for older NVIDIA driver (#158300 ) This is needed because of the botched landing of https://github.com/pytorch/pytorch/pull/156097 which crashed on older NVIDIA drivers `525.*`. I add a periodic job to install the `525.105.17` on CI, then run: 1. A smoke to make sure that CUDA can be initialized 2. And the whole the test suite on the older driver Pull Request resolved: https://github.com/pytorch/pytorch/pull/158300 Approved by: https://github.com/ngimel	2025-07-16 08:18:18 +00:00
Natalia Gimelshein	9056279f81	don't error out in empty_cache under mempool context (#158152 ) Now instead of erroring out on `empty_cache` call during graph capture or under mempool context, we will just silently do nothing. This used to be the behavior for mempools, cudagraphs used to error out, but it's fine to just ignore the call. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158152 Approved by: https://github.com/zou3519, https://github.com/eqy	2025-07-12 04:37:05 +00:00
Xuehai Pan	fc0376e8b1	[BE][2/6] fix typos in test/ (test/test_*.py) (#157636 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157636 Approved by: https://github.com/yewentao256, https://github.com/mlazos ghstack dependencies: #156311, #156609	2025-07-09 11:02:23 +00:00
Nikita Shulga	5e636d664a	[BE] `@serialTest` decorator must be called (#157388 ) Otherwise it turns test into a trivial one(that always succeeds), as following example demonstrates ```python import torch from torch.testing._internal.common_utils import serialTest, run_tests, TestCase class MegaTest(TestCase): @serialTest def test_foo(self): if hasattr(self.test_foo, "pytestmark"): print("foo has attr and it is", self.test_foo.pytestmark) print("foo") @serialTest() def test_bar(self): if hasattr(self.test_bar, "pytestmark"): print("bar has attr and it is", self.test_bar.pytestmark) print("bar") if __name__ == "__main__": run_tests() ``` That will print ``` test_bar (__main__.MegaTest.test_bar) ... bar has attr and it is [Mark(name='serial', args=(), kwargs={})] bar ok test_foo (__main__.MegaTest.test_foo) ... ok ---------------------------------------------------------------------- Ran 2 tests in 0.013s ``` Added assert that arg is boolean in the decorator to prevent such silent skips in the future Pull Request resolved: https://github.com/pytorch/pytorch/pull/157388 Approved by: https://github.com/clee2000	2025-07-02 19:15:19 +00:00
blorange-amd	67f8270516	[ROCm] test_hip_device_count safely runs on 1 GPU systems (#156398 ) Fixes test_cuda.py::TestCuda::test_hip_device_count on single gpu scenario Pull Request resolved: https://github.com/pytorch/pytorch/pull/156398 Approved by: https://github.com/jeffdaily	2025-06-28 20:17:26 +00:00
haozhe.zhu	53e0b9c393	refine fp32 precision api (#125888 ) Based on the [conversation](https://github.com/pytorch/pytorch/issues/121791), we plan to drop the "highest, high, medium" to represent fp32 internal computation data types . Instead, we will directly use the algorithm to represent it. ### Design Choice: Directly use algorithms name like "TF32", "BF16". #### Pros - The names are more informative. 'tf32' is more informative than a simple "high". - Easier to extend new algorithm like `tf32x3` #### Cons - "HIGHEST, HIGH, MEDIUM" indicated the relative precision between different algorithms. However, we can have more documents to discuss them. ### We provide a layered structure for backends/operators. ('f32' is short for 'fp32_precision') ![image](https://github.com/user-attachments/assets/f89143e5-d6a1-4865-9351-9a50439f5067) ### We provide 3 fp32 compute precision can be set: - "ieee": Not allowed to use any other internal computation data types . - "tf32": Allowed to use tf32 as internal computation data types. - "bf16": Allowed to use bf16 as internal computation data types. - "none": Precision's are not set. Can be override by its father node. ### Overriding Precision Settings Child node can be override by its father node if it is set to default. For current default settings: ``` backend = generic, op = all, precision setting = none backend = cuda, op = all, precision setting = none backend = cuda, op = conv, precision setting = tf32 backend = cuda, op = rnn, precision setting = tf32 backend = cuda, op = matmul, precision setting = none backend = matmul, op = all, precision setting = none backend = matmul, op = conv, precision setting = none backend = matmul, op = rnn, precision setting = none backend = matmul, op = matmul, precision setting = none ``` - If the user set `torch.backends.mkldnn.fp32_precision="bf16"`, his child nodes `torch.backends.mkldnn.matmul.fp32_precision` / `torch.backends.mkldnn.conv.fp32_precision` / `torch.backends.mkldnn.rnn.fp32_precision` will also be override to "bf16". - If the user set `torch.backends.fp32_precision="bf16"`, `torch.backends.mkldnn.fp32_precision` and his child nodes will also we override to "bf16". ### Backward Compatible Since new API allow user to have more fine-grained control. There will be some conflict. For example, previous `torch.backends.cudnn.allow_tf32` are not enough to represent the status for `torch.backends.cudnn.rnn.fp32_precision="ieee"` and `torch.backends.cudnn.conv.fp32_precision="tf32"`. Therefore, our goal for backward compatible is - If the user only uses previous APIs, it will work as previous expectations. - If the user use new API to change the status to an un-representable status for old API, and try to access the status by old API. We will raise Runtime Error and point the document for user. ### Test Plan ``` python test/test_cuda.py -k test_fp32_precision_with_tf32 python test/test_cuda.py -k test_fp32_precision_with_float32_matmul_precision python test/test_cuda.py -k test_invalid_status_for_legacy_api python test/test_mkldnn.py -k test_mlkdnn_get_set python test/test_mkldnn.py -k test_generic_precision python test/test_mkldnn.py -k test_invalid python test/test_mkldnn.py -k test_default_use_parent ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125888 Approved by: https://github.com/jgong5, https://github.com/albanD Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>	2025-06-26 10:32:20 +00:00
Mark Saroufim	ce73b0c53f	Validate custom op support for compile_kernel (#156332 ) Follow-up work from #151484 - just makes sure that compile_kernel composes nicely with custom ops by writing some new tests, no new code functionality is added benchmark failure in CI is unrelated to this change, CI is green Pull Request resolved: https://github.com/pytorch/pytorch/pull/156332 Approved by: https://github.com/zou3519, https://github.com/malfet	2025-06-24 08:21:21 +00:00
Daniel Galvez	9ed0060225	Provide access to the cudaGraph_t underlying a CUDAGraph. (#155164 ) There are a few considerations here: 1. A user might want to modify the cudaGraph_t either during the stream capture or after the stream capture (but before instantiation). This draft implements modification after stream capture only, though support could be added for modification during stream capture by applying https://github.com/pytorch/pytorch/pull/140979/files#diff-d7302d133bb5e0890fc94de9aeea4d9d442555a3b40772c9db10edb5cf36a35cR391-R404 2. Previously, the cudaGraph_t would be destroyed before the end of capture_end() unless the user had previously called enable_debug_mode(). There is no way to implement this correctly without removing this restriction, or forcing the user to always call enable_debug_mode(). However, enable_debug_mode() is a confusing API (despite being an instance method, it would modify a static global variable; thus, putting one CUDAGraph object into debug mode puts all of them into debug mode, which is not acceptable in my opinion). Therefore, I made enable_debug_mode() into a no-op. This means that the CPU memory usage will increase after this change. I think this is likely to be fine. 3. No python bindings yet. These should be easy to add. It is probably worthwhile to take some time to make sure that the returned cudaGraph_t can be converted into the cuda-python cudaGraph_t in a reasonable, hopefully type-safe, manner (but without making cuda-python a dependency of pytorch), since I imagine most users will use the pip cuda-python package to make modifications. 4. There are two foot guns: a. The cudaGraph_t returned by raw_cuda_graph() is not owned by the user, so it will be destroyed once the owning CUDAGraph is destroyed (or calls reset()). b. The following seuquence won't work as intended: ``` g = torch.cuda.CUDAGraph() with torch.cuda.graph(g): foo() g.replay() raw_graph = g.raw_cuda_graph() modify(raw_graph) g.replay() ``` This won't work because the user must call instantiate() again after modifying cudaGraph_t. You could add a "safety" mechanism by traversing the cudaGraph_t to create a hash and seeing if the hash changes between calls to replay(), but this is likely way too expensive. I think these two foot guns are probably okay given that this a bit of an experts' API. Fixes #155106 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155164 Approved by: https://github.com/ngimel	2025-06-18 03:39:28 +00:00
Daniel Galvez	4c0aa37dda	Support stream capture of event record and wait nodes in cuda graphs (#155372 ) These are created by the user passing cudaEventRecordExternal and cudaEventWaitExternal to cudaEventRecordWithFlags() and cudaStreamWaitEvent() respectively. We do this by allowing the user to specify external=True when constructing a torch.cuda.Event(). If external=False, the cudaEventRecord and cudaStreamWaitEvent API's have a different meaning described here: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cross-stream-dependencies-and-events In short, they will be used to experess fork and join operations in the graph if external=False. External events can be used for expressing a fine-grained dependency on the outcome of some nodes in a cuda graph (rather than all nodes). They can also be used for timing parts of a cuda graph's execution, rather than timing the entire graph's execution. Finishes #146145 I'm a dummy and don't know how to use ghstack at this time. The first commit is a bug fix for _CudaKernel, which would previously always launch work on the NULL stream, rather than the user-passed stream. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155372 Approved by: https://github.com/ngimel	2025-06-17 21:44:51 +00:00
zeshengzong	905b194a2e	Replace device check of TORCH_INTERNAL_ASSERT with TORCH_CHECK (#155318 ) Fixes #136849 ## Test Result ```python >>> import torch >>> device = torch.cuda.device_count() + 1 >>> torch.cuda.current_stream(device) # INTERNAL ASSERT FAILED Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/zong/code/pytorch/torch/cuda/__init__.py", line 1083, in current_stream streamdata = torch._C._cuda_getCurrentStream( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Device index value 3 is out of index range [0, 2) >>> torch.cuda.default_stream(device) # INTERNAL ASSERT FAILED Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/zong/code/pytorch/torch/cuda/__init__.py", line 1101, in default_stream streamdata = torch._C._cuda_getDefaultStream( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Device index value 3 is out of index range [0, 2) >>> torch.cuda.set_per_process_memory_fraction(0.5, device) # INTERNAL ASSERT FAILED Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/zong/code/pytorch/torch/cuda/memory.py", line 193, in set_per_process_memory_fraction torch._C._cuda_setMemoryFraction(fraction, device) RuntimeError: Allocator not initialized for device : did you call init? ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155318 Approved by: https://github.com/albanD	2025-06-13 01:20:19 +00:00
Natalia Gimelshein	706bc41c4c	pass mempool arg through emptyCache (#155315 ) Fixing typo in a previous PR #154746 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155315 Approved by: https://github.com/Skylion007	2025-06-06 16:14:26 +00:00
Nikita Shulga	0350c7e72c	[BE] Introduce torch.AcceleratorError (#152023 ) Which inherits from `RuntimeError` and contains `error_code`, which in case of CUDA should contain error returned by `cudaGetLastError` `torch::detail::_new_accelerator_error_object(c10::AcceleratorError&)` follows the pattern of CPython's [`PyErr_SetString`](`cb8a72b301/Python/errors.c (L282)`), namely - Convert cstr into Python string with `PyUnicode_FromString` - Create new exception object using `PyObject_CallOneArg` just like it's done in [`_PyErr_CreateException`](`cb8a72b301/Python/errors.c (L32)`) - Set `error_code` property using `PyObject_SetAttrString` - decref all temporary references Test that it works and captures CPP backtrace (in addition to CI) by running ```python import os os.environ['TORCH_SHOW_CPP_STACKTRACES'] = '1' import torch x = torch.rand(10, device="cuda") y = torch.arange(20, device="cuda") try: x[y] = 2 print(x) except torch.AcceleratorError as e: print("Exception was raised", e.args[0]) print("Captured error code is ", e.error_code) ``` which produces following output ``` Exception was raised CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. Exception raised from c10_cuda_check_implementation at /home/ubuntu/pytorch/c10/cuda/CUDAException.cpp:41 (most recent call first): C++ CapturedTraceback: #4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0 #6 c10::cuda::c10_cuda_check_implementation(int, char const, char const, int, bool) [clone .cold] from CUDAException.cpp:0 #7 void at::native::gpu_kernel_impl<at::native::AbsFunctor<float> >(at::TensorIteratorBase&, at::native::AbsFunctor<float> const&) [clone .isra.0] from tmpxft_000191fc_00000000-6_AbsKernel.cudafe1.cpp:0 #8 at::native::abs_kernel_cuda(at::TensorIteratorBase&) from ??:0 #9 at::Tensor& at::native::unary_op_impl_with_complex_to_float_out<at::native::abs_stub_DECLARE_DISPATCH_type>(at::Tensor&, at::Tensor const&, at::native::abs_stub_DECLARE_DISPATCH_type&, bool) [clone .constprop.0] from UnaryOps.cpp:0 #10 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA_out_abs_out(at::Tensor const&, at::Tensor&) from RegisterCUDA_0.cpp:0 #11 at::_ops::abs_out::call(at::Tensor const&, at::Tensor&) from ??:0 #12 at::native::abs(at::Tensor const&) from ??:0 #13 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeExplicitAutograd__abs>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&> >, at::Tensor (at::Tensor const&)>::call(c10::OperatorKernel, c10::DispatchKeySet, at::Tensor const&) from RegisterCompositeExplicitAutograd_0.cpp:0 #14 at::_ops::abs::redispatch(c10::DispatchKeySet, at::Tensor const&) from ??:0 #15 torch::autograd::VariableType::(anonymous namespace)::abs(c10::DispatchKeySet, at::Tensor const&) from VariableType_1.cpp:0 #16 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&), &torch::autograd::VariableType::(anonymous namespace)::abs>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&)>::call(c10::OperatorKernel, c10::DispatchKeySet, at::Tensor const&) from VariableType_1.cpp:0 #17 at::_ops::abs::call(at::Tensor const&) from ??:0 #18 at::native::isfinite(at::Tensor const&) from ??:0 #19 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeImplicitAutograd__isfinite>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&> >, at::Tensor (at::Tensor const&)>::call(c10::OperatorKernel, c10::DispatchKeySet, at::Tensor const&) from RegisterCompositeImplicitAutograd_0.cpp:0 #20 at::_ops::isfinite::call(at::Tensor const&) from ??:0 #21 torch::autograd::THPVariable_isfinite(_object, _object, _object) from python_torch_functions_2.cpp:0 #22 PyObject_CallFunctionObjArgs from ??:0 #23 _PyObject_MakeTpCall from ??:0 #24 _PyEval_EvalFrameDefault from ??:0 #25 _PyObject_FastCallDictTstate from ??:0 #26 _PyStack_AsDict from ??:0 #27 _PyObject_MakeTpCall from ??:0 #28 _PyEval_EvalFrameDefault from ??:0 #29 _PyFunction_Vectorcall from ??:0 #30 _PyEval_EvalFrameDefault from ??:0 #31 _PyFunction_Vectorcall from ??:0 #32 _PyEval_EvalFrameDefault from ??:0 #33 _PyFunction_Vectorcall from ??:0 #34 _PyEval_EvalFrameDefault from ??:0 #35 PyFrame_GetCode from ??:0 #36 PyNumber_Xor from ??:0 #37 PyObject_Str from ??:0 #38 PyFile_WriteObject from ??:0 #39 _PyWideStringList_AsList from ??:0 #40 _PyDict_NewPresized from ??:0 #41 _PyEval_EvalFrameDefault from ??:0 #42 PyEval_EvalCode from ??:0 #43 PyEval_EvalCode from ??:0 #44 PyUnicode_Tailmatch from ??:0 #45 PyInit__collections from ??:0 #46 PyUnicode_Tailmatch from ??:0 #47 _PyRun_SimpleFileObject from ??:0 #48 _PyRun_AnyFileObject from ??:0 #49 Py_RunMain from ??:0 #50 Py_BytesMain from ??:0 #51 __libc_init_first from ??:0 #52 __libc_start_main from ??:0 #53 _start from ??:0 Captured error code is 710 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152023 Approved by: https://github.com/eqy, https://github.com/mradmila, https://github.com/ngimel ghstack dependencies: #154436	2025-06-01 21:02:43 +00:00
Natalia Gimelshein	9cbbc2593b	test for 146431 (#154786 ) Adds test for #146431 that was fixed by #154746 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154786 Approved by: https://github.com/Skylion007, https://github.com/galv Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2025-06-01 04:17:54 +00:00
Natalia Gimelshein	f01e628e3b	Resubmit Remove MemPoolContext (#154042 ) (#154746 ) Summary: Per title Test Plan: Added tests + existing tests Differential Revision: D75695030 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154746 Approved by: https://github.com/malfet	2025-05-31 01:21:54 +00:00
PyTorch MergeBot	d173ba5a75	Revert "Remove MemPoolContext (#154042 )" This reverts commit `3b38989b5f`. Reverted https://github.com/pytorch/pytorch/pull/154042 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/154042#issuecomment-2921401100))	2025-05-30 06:53:37 +00:00
Bo Li	2c6f24c62d	[ROCm] Updated default workspace for gfx95 (#153988 ) Fixes test_cuda.py::test_cublas_workspace_explicit_allocation on gfx95 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153988 Approved by: https://github.com/jeffdaily	2025-05-29 16:22:17 +00:00
Natalia Gimelshein	3b38989b5f	Remove MemPoolContext (#154042 ) Removes MemPoolContext from custom user mempools. The ground truth for which pool should be used is in graph_pools active pool, and MemPoolContext just introduced an opportunity for the pool pointed to by MemPoolContext and active pool in graph_pools to go out of sync (see all the asserts in the code to make sure that happens, and yet it still could happen in a multithread scenario, see my recent PRs (#153990). Pull Request resolved: https://github.com/pytorch/pytorch/pull/154042 Approved by: https://github.com/albanD, https://github.com/syed-ahmed	2025-05-28 16:35:48 +00:00
Keith	c4ef4090c5	Fix segfault on exit in CachingHostAllocator by signaling background thread to exit (#154117 ) Fixes #152008 This PR fixes a segmentation fault that occurred when exiting the program due to improper background thread management in CachingHostAllocator. Previously, the background thread continued running and called process_events() even after the allocator object was destroyed, leading to a crash on exit. `f12d8d60b1/aten/src/ATen/core/CachingHostAllocator.h (L218)` ```cpp // Launch the background thread and process events in a loop. static bool background_thread_flag [[maybe_unused]] = [this] { getBackgroundThreadPool()->run([&]() { while (true) { process_events(); // <-- This line may cause segfault on exit std::this_thread::sleep_for(std::chrono::microseconds(100)); } }); return true; }(); ``` The fix adds a mechanism to signal the background thread to exit before the object is destructed, ensuring the thread stops safely. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154117 Approved by: https://github.com/ngimel, https://github.com/cyyever	2025-05-25 07:46:12 +00:00
Natalia Gimelshein	0cf61ca7e4	make use_mem_pool threadlocal (#153356 ) Partial fix for #152861, makes allocation to pool thread-local, but doesn't touch the second bug where multiple threads allocating to multiple pools error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153356 Approved by: https://github.com/Skylion007, https://github.com/eellison	2025-05-13 00:16:07 +00:00
Shivam Raikundalia	dbb4444ce3	[Memento] Add PT2 to Memory Snapshot (#152707 ) Summary: To add PT2 information to memory snapshot we piggyback off of the Kineto implementation using record_function similar to adding the user annotations. To do this we add the following: 1. Stack implementation that we instantiate to keep track of which compile context stack we are currently in (top element of the stack). The stack will be per device and thread-local since different threads of a process can be in different compile contexts at a given time. For this reason, we do not need to add mutexes to our stack impl since no two threads will touch a given stack 2. RecordFunction hooks to properly pipe the correct events to the compile context stack. These hooks are similar to the annotation ones in the fact that we just register them lazily and DO NOT unregister them. This is done out of convenience. In the future, we should save the handles and unregister them to minimize overhead after profiling is finished. As of now, we are registering this at the FUNCTION scope which is wide; however, we treat any function that does not start with "Torch-Compiled Region" as a no-op so we anticipate the difference in performance to be negligible during and after profiling. We also hide this feature behind a flag set to off on default so existing jobs will be unaffected 3. Piping for compile context to pickle output Test Plan: In D74039793, we add CompileContext to the visualizer and we see the following {F1977654658} Differential Revision: D74028214 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152707 Approved by: https://github.com/eqy	2025-05-12 21:12:51 +00:00
PyTorch MergeBot	fdc387ec7c	Revert "refine fp32 precision api (#125888 )" This reverts commit `4c11b26158`. Reverted https://github.com/pytorch/pytorch/pull/125888 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to cause some failures on ROCm ([comment](https://github.com/pytorch/pytorch/pull/125888#issuecomment-2869274791))	2025-05-11 00:35:46 +00:00
haozhe.zhu	4c11b26158	refine fp32 precision api (#125888 ) Based on the [conversation](https://github.com/pytorch/pytorch/issues/121791), we plan to drop the "highest, high, medium" to represent fp32 internal computation data types . Instead, we will directly use the algorithm to represent it. ### Design Choice: Directly use algorithms name like "TF32", "BF16". #### Pros - The names are more informative. 'tf32' is more informative than a simple "high". - Easier to extend new algorithm like `tf32x3` #### Cons - "HIGHEST, HIGH, MEDIUM" indicated the relative precision between different algorithms. However, we can have more documents to discuss them. ### We provide a layered structure for backends/operators. ('f32' is short for 'fp32_precision') ![image](https://github.com/user-attachments/assets/f89143e5-d6a1-4865-9351-9a50439f5067) ### We provide 3 fp32 compute precision can be set: - "ieee": Not allowed to use any other internal computation data types . - "tf32": Allowed to use tf32 as internal computation data types. - "bf16": Allowed to use bf16 as internal computation data types. - "none": Precision's are not set. Can be override by its father node. ### Overriding Precision Settings Child node can be override by its father node if it is set to default. For current default settings: ``` backend = generic, op = all, precision setting = none backend = cuda, op = all, precision setting = none backend = cuda, op = conv, precision setting = tf32 backend = cuda, op = rnn, precision setting = tf32 backend = cuda, op = matmul, precision setting = none backend = matmul, op = all, precision setting = none backend = matmul, op = conv, precision setting = none backend = matmul, op = rnn, precision setting = none backend = matmul, op = matmul, precision setting = none ``` - If the user set `torch.backends.mkldnn.fp32_precision="bf16"`, his child nodes `torch.backends.mkldnn.matmul.fp32_precision` / `torch.backends.mkldnn.conv.fp32_precision` / `torch.backends.mkldnn.rnn.fp32_precision` will also be override to "bf16". - If the user set `torch.backends.fp32_precision="bf16"`, `torch.backends.mkldnn.fp32_precision` and his child nodes will also we override to "bf16". ### Backward Compatible Since new API allow user to have more fine-grained control. There will be some conflict. For example, previous `torch.backends.cudnn.allow_tf32` are not enough to represent the status for `torch.backends.cudnn.rnn.fp32_precision="ieee"` and `torch.backends.cudnn.conv.fp32_precision="tf32"`. Therefore, our goal for backward compatible is - If the user only uses previous APIs, it will work as previous expectations. - If the user use new API to change the status to an un-representable status for old API, and try to access the status by old API. We will raise Runtime Error and point the document for user. ### Test Plan ``` python test/test_cuda.py -k test_fp32_precision_with_tf32 python test/test_cuda.py -k test_fp32_precision_with_float32_matmul_precision python test/test_cuda.py -k test_invalid_status_for_legacy_api python test/test_mkldnn.py -k test_mlkdnn_get_set python test/test_mkldnn.py -k test_generic_precision python test/test_mkldnn.py -k test_invalid python test/test_mkldnn.py -k test_default_use_parent ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125888 Approved by: https://github.com/jgong5, https://github.com/albanD Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>	2025-05-10 11:13:04 +00:00
eqy	172e641529	[CUDA] Rest peak memory stats before running `test_set_per_process_memory_fraction` (#152540 ) Otherwise previous tests can cause `application = int(total_memory * 0.499) - torch.cuda.max_memory_reserved()` to go negative Hopefully abates current flakiness (see also https://github.com/pytorch/pytorch/issues/135115#:~:text=TestCuda.test_set_per_process_memory_fraction) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152540 Approved by: https://github.com/Skylion007	2025-05-07 17:02:39 +00:00
Boyuan Feng	d969e2ec33	[CUDAGraph Trees] support memory allocation on side stream (#152472 ) I tried `beginAllocateToPool` instead of `_cuda_beginAllocateCurrentStreamToPool` and the error in #151199 does not happen any more. However, this approach is unsafe for multithreading. When multiple run_eager happens concurrently, we expect memory allocation to different mem_pool. Since beginAllocateToPool does not check stream, these memory allocation may happen on the same mem_pool. So, I use `_cuda_beginAllocateCurrentThreadToPool` to direct all memory allocation on the same thread to a given mem_pool. In particular, `_cuda_beginAllocateCurrentThreadToPool` records the launching thread id, and during runtime checks if the current thread id matches the launching thread id. Fixes #151199 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152472 Approved by: https://github.com/eellison, https://github.com/ngimel	2025-05-02 04:26:35 +00:00
eqy	7abca8ceba	Decorate `test_host_memory_stats` with `@serialTest` (#152454 ) Seems to need it as it is expecting only its allocation behavior to be visible, to address #152422 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152454 Approved by: https://github.com/Skylion007	2025-05-01 00:53:20 +00:00
Eddie Yan	8aa65780f4	[CUDA] Fix `test_multi_device_context_manager` on CUDA (#152474 ) Seems there was a typo where `set_device` was called when the intent was to use `current_device` As-is the test will fail on multigpu systems with `TypeError: set_device() missing 1 required positional argument: 'device'` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152474 Approved by: https://github.com/Skylion007	2025-04-30 16:53:10 +00:00
Eddie Yan	fa6f9eb2be	[CUDA][TF32] Account for TF32 in `compile_kernel_advanced` (#152468 ) Also cleanup some uses of `assert_close` in favor of `self.assertEqual` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152468 Approved by: https://github.com/msaroufim	2025-04-30 07:54:38 +00:00
FFFrog	580913290c	[Easy] The event_id of torch.cuda.Event and torch.xpu.Event always is 0 (#151226 ) Although torch.cuda.Event and torch.xpu.Event have cuda_event and sycl_event fields respectively, the event_id exposed from the base class torch.Event is always 0, which can confuse users. The memory of torch.Event is not useful to torch.cuda.Event and torch.xpu.Event, but we still need to inherit from torch.Event because CPython will check it. Repro with cuda: ``` >>> import torch >>> event = torch.cuda.Event() >>> event.cuda_event 0 >>> event.event_id 0 >>> event.record() >>> event.cuda_event 127982096 >>> event.event_id 0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151226 Approved by: https://github.com/albanD, https://github.com/guangyey ghstack dependencies: #151404, #151221, #151411	2025-04-26 14:18:22 +00:00
FFFrog	bd7dc1b17d	[Easy] Fix the function signature of torch.Event (#151221 ) As the title stated. The difference between declaration and implemention. declaration: `d5a19e4525/torch/_C/__init__.pyi.in (L157-L162)` Implementation: `d5a19e4525/torch/csrc/Event.cpp (L30-L32)` Question: Which one should we choose? - Change enable_timing to False to be consistent with torch.cuda.Event - Change enable_timing to True to avoid BC-break Pull Request resolved: https://github.com/pytorch/pytorch/pull/151221 Approved by: https://github.com/albanD ghstack dependencies: #151404	2025-04-26 13:51:56 +00:00
Dan Johnson	d22c4cc353	Add option to use mempool on OOM (#151487 ) MemPool is a separate pool of memory handled by the caching allocator. This PR adds the option let the caching allocator try to use this pool as a last resort instead of OOMing by associating a use_on_oom bool with each MemPool. Usage: Users can optionally specify a ``use_on_oom`` bool (which is False by default) during MemPool creation. If true, then the CUDACachingAllocator will be able to use memory in this pool as a last resort instead of OOMing. ``` pool = torch.cuda.MemPool(allocator, use_on_oom=True) with torch.cuda.use_mem_pool(pool): a = torch.randn(40 * 1024 * 1024, dtype=torch.uint8, device="cuda") del a # at the memory limit, this will succeed by using pool's memory in order to avoid the oom b = torch.randn(40 * 1024 * 1024, dtype=torch.uint8, device="cuda") ``` Testing: ``` python test/test_cuda.py -k test_mempool_limited_memory_with_allocator ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151487 Approved by: https://github.com/eqy, https://github.com/syed-ahmed, https://github.com/ngimel	2025-04-26 04:04:57 +00:00
FFFrog	2c5c793085	[Easy] Add more check for elapsedTime of torch.xxx.Event and torch.Event (#151404 ) As the title stated Changes: - Add record, query and enable_timing check - Add related tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/151404 Approved by: https://github.com/albanD	2025-04-25 20:15:04 +00:00
Jithun Nair	bcf1031cb8	[ROCm] Fixes to enable VM-based MI300 CI runners (#152133 ) New VM-based MI300 CI runners tested in https://github.com/pytorch/pytorch/pull/151708 exposed some issues in CI that this PR fixes: * HSAKMT_DEBUG_LEVEL is a debug env var that was introduced to debug driver issues. However, in the new MI300 runners being tested, since they run inside a VM, the driver emits a debug message `Failed to map remapped mmio page on gpu_mem 0` when calling `rocminfo` or doing other GPU-related work. This results in multiple PyTorch unit tests failing when doing a string match on the stdout vs expected output. * HSA_FORCE_FINE_GRAIN_PCIE was relevant for rccl performance improvement, but is not required now. * amdsmi doesn't return metrics like [power_info](https://rocm.docs.amd.com/projects/amdsmi/en/latest/reference/amdsmi-py-api.html#amdsmi-get-power-cap-info) and [clock_info](https://rocm.docs.amd.com/projects/amdsmi/en/latest/reference/amdsmi-py-api.html#amdsmi-get-clock-info) in a VM ("Guest") environment. Return 0 as the default in cases where amdsmi returns "N/A" * amdsmi throws an exception when calling `amdsmi.amdsmi_get_clock_info` on the VM-based runners. Temporarily skipping the unit test for MI300 until we find a resolution. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152133 Approved by: https://github.com/jeffdaily	2025-04-25 18:06:48 +00:00
PyTorch MergeBot	67f75244ea	Revert "[Easy] Add more check for elapsedTime of torch.xxx.Event and torch.Event (#151404 )" This reverts commit `c91acad73a`. Reverted https://github.com/pytorch/pytorch/pull/151404 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @albanD can you please help it get relanded? To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/151404#issuecomment-2830829368))	2025-04-25 16:08:27 +00:00
Yu, Guangye	33c75cae0a	Add torch.accelerator.device_index as accelerator's device switch context (#148864 ) # Motivation We propose adding support for the Python with statement on `torch.accelerator.device_index` to enable device switching functionality. This enhancement would simplify writing device-agnostic code and provide benefits across all accelerators. Its device-specific counterparts include [`torch.cuda.device`](`00199acdb8/torch/cuda/__init__.py (L482)`) and [`torch.cuda._DeviceGuard`](`00199acdb8/torch/cuda/__init__.py (L469)`). Design Philosophy It accepts either an `Int` or `None` as input. When `None` is passed, no device switch is performed. Supporting `None` is important for compatibility, as it's possible to encounter `None` values from `torch.device.index`. Therefore, with this PR, we can do like this ```python src = 0 dst = 1 # Set src to current device torch.accelerator.set_device_index(src) with torch.accelerator.device_index(dst): # Inside with statement, we set dst to current device assert torch.accelerator.get_device_index() == dst # Here the current device should be src assert torch.accelerator.get_device_index() == src ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148864 Approved by: https://github.com/albanD	2025-04-25 09:45:25 +00:00
Mark Saroufim	5b368fa0b7	Add torch.cuda._compile_kernel() (#151484 ) Followup work on top https://github.com/pytorch/pytorch/pull/149480 Wrapper on top of nvrtc inspired by https://gist.github.com/malfet/2c9a25976dd7396430c38af603f791da from @malfet Compiling toy kernels with this setup takes 0.01s vs 90s using `load_inline()` on my local H100. This was primarily motivated by the timeouts I was seeing in the popcorn leaderboard but would also be useful to integrate into KernelBench This PR is in the same spirit as https://github.com/pytorch/pytorch/pull/148972 which was a similar UX for Metal For now we are planning on landing this as a private function because we expect to iterate both on the user facing API and the internals implementation, will open up a seperate issue to discuss the path towards making this work public and give a broader overview of the state of custom cuda kernel authoring in PyTorch Future work, as a prereq to making the work public * divup primitive * support multiple kernels * Expose _get_nvrtc_version from native code * interop with torch.compile * AMD support Pull Request resolved: https://github.com/pytorch/pytorch/pull/151484 Approved by: https://github.com/malfet	2025-04-24 07:14:31 +00:00
FFFrog	c91acad73a	[Easy] Add more check for elapsedTime of torch.xxx.Event and torch.Event (#151404 ) As the title stated Changes: - Add record, query and enable_timing check - Add related tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/151404 Approved by: https://github.com/albanD	2025-04-24 01:28:09 +00:00
PyTorch MergeBot	9374064483	Revert "[Easy] Add more check for elapsedTime of torch.xxx.Event and torch.Event (#151404 )" This reverts commit `783be8f932`. Reverted https://github.com/pytorch/pytorch/pull/151404 on behalf of https://github.com/malfet due to suspected of breaking linux builds and breaks internal tests as well ([comment](https://github.com/pytorch/pytorch/pull/151404#issuecomment-2819041756))	2025-04-21 17:11:53 +00:00
PyTorch MergeBot	33808f0ebd	Revert "[Easy] The event_id of torch.cuda.Event and torch.xpu.Event always is 0 (#151226 )" This reverts commit `8e5fefedf4`. Reverted https://github.com/pytorch/pytorch/pull/151226 on behalf of https://github.com/malfet due to Reverting to unblock revert of https://github.com/pytorch/pytorch/pull/151404 ([comment](https://github.com/pytorch/pytorch/pull/151226#issuecomment-2819030735))	2025-04-21 17:07:49 +00:00
PyTorch MergeBot	48761e9737	Revert "[Easy] Fix the function signature of torch.Event (#151221 )" This reverts commit `92baeecbdd`. Reverted https://github.com/pytorch/pytorch/pull/151221 on behalf of https://github.com/malfet due to This broke rocm tests, see `92baeecbdd (40818271233-box)` ([comment](https://github.com/pytorch/pytorch/pull/151221#issuecomment-2816883409))	2025-04-19 22:06:24 +00:00
FFFrog	92baeecbdd	[Easy] Fix the function signature of torch.Event (#151221 ) As the title stated. The difference between declaration and implemention. declaration: `d5a19e4525/torch/_C/__init__.pyi.in (L157-L162)` Implementation: `d5a19e4525/torch/csrc/Event.cpp (L30-L32)` Question: Which one should we choose? - Change enable_timing to False to be consistent with torch.cuda.Event - Change enable_timing to True to avoid BC-break Pull Request resolved: https://github.com/pytorch/pytorch/pull/151221 Approved by: https://github.com/albanD ghstack dependencies: #151226	2025-04-19 11:56:37 +00:00
FFFrog	8e5fefedf4	[Easy] The event_id of torch.cuda.Event and torch.xpu.Event always is 0 (#151226 ) Although torch.cuda.Event and torch.xpu.Event have cuda_event and sycl_event fields respectively, the event_id exposed from the base class torch.Event is always 0, which can confuse users. The memory of torch.Event is not useful to torch.cuda.Event and torch.xpu.Event, but we still need to inherit from torch.Event because CPython will check it. Repro with cuda: ``` >>> import torch >>> event = torch.cuda.Event() >>> event.cuda_event 0 >>> event.event_id 0 >>> event.record() >>> event.cuda_event 127982096 >>> event.event_id 0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151226 Approved by: https://github.com/albanD	2025-04-19 10:42:00 +00:00
Xiaodong Wang	88b0553c58	[AMD] Remove fbcode limit for uuid (#151652 ) Summary: We're now w/ later rocm version so ok to add uuid back. Test Plan: sandcastle Differential Revision: D73240086 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151652 Approved by: https://github.com/Skylion007, https://github.com/ngimel, https://github.com/houseroad	2025-04-18 20:37:09 +00:00
FFFrog	783be8f932	[Easy] Add more check for elapsedTime of torch.xxx.Event and torch.Event (#151404 ) As the title stated Changes: - Add record, query and enable_timing check - Add related tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/151404 Approved by: https://github.com/albanD	2025-04-18 15:26:13 +00:00
PyTorch MergeBot	1ce7969e81	Revert "[Easy] Add more check for elapsedTime of torch.xxx.Event and torch.Event (#151404 )" This reverts commit `90c5b86cd8`. Reverted https://github.com/pytorch/pytorch/pull/151404 on behalf of https://github.com/clee2000 due to broke a cpp extension test? test_cpp_extensions_stream_and_event.py::TestCppExtensionStreamAndEvent::test_stream_event [GH job link](https://github.com/pytorch/pytorch/actions/runs/14519277500/job/40736981315) [HUD commit link](`90c5b86cd8`), bad TD ([comment](https://github.com/pytorch/pytorch/pull/151404#issuecomment-2813649667))	2025-04-17 17:45:41 +00:00
FFFrog	90c5b86cd8	[Easy] Add more check for elapsedTime of torch.xxx.Event and torch.Event (#151404 ) As the title stated Changes: - Add record, query and enable_timing check - Add related tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/151404 Approved by: https://github.com/albanD	2025-04-17 15:30:12 +00:00
Jeff Daily	15768cc34b	add unit test for preferred_blas_library settings (#150581 ) Follow up to #150212 that was committed without a unit test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150581 Approved by: https://github.com/atalman, https://github.com/malfet Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-04-06 01:44:07 +00:00
PyTorch MergeBot	b0e28f60df	Revert "add unit test for preferred_blas_library settings (#150581 )" This reverts commit `781d28e265`. Reverted https://github.com/pytorch/pytorch/pull/150581 on behalf of https://github.com/clee2000 due to new test broken internally D72395624 ([comment](https://github.com/pytorch/pytorch/pull/150581#issuecomment-2777228731))	2025-04-03 23:51:49 +00:00
Jeff Daily	781d28e265	add unit test for preferred_blas_library settings (#150581 ) Follow up to #150212 that was committed without a unit test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150581 Approved by: https://github.com/atalman	2025-04-03 13:27:50 +00:00
Alexander Grund	350a479146	Fix test failures on non-x86 Linux (#148445 ) The cpp contexts are only supported on x86 Linux. The tests requiring them are skipped on non-Linux but not if the architecture is not x86. In most places it is checked for ARM64 which is not enough as a check for x86 is required instead. Fix the test decorators and factor out a common one in test_cuda. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148445 Approved by: https://github.com/eellison	2025-03-28 15:27:44 +00:00
Fuzzkatt	ce3dc9e346	add some extra test oom skips for jetson due to lacking nvml support (#149587 ) Add a couple of Jetson skips for oom tests in test/test_cuda.py due to failures in nvidia CI. Jetson not having full nvml support is a known issue so this is mostly a test side fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149587 Approved by: https://github.com/eqy	2025-03-25 20:39:10 +00:00
Fuzzkatt	b562d22772	test/test_cuda.py: rework TEST_PYNVML logic to make more sense, add not IS_JETSON condition (#149578 ) PYNVML related tests in test/test_cuda.py are failing in nvidia internal CI for Jetson devices because Jetson devices don't fully support nvml (it exists as a stub library). In addition to skipping PYNVML tests for Jetson, this PR also reworks the TEST_PYNVML logic a bit to be more consistent with the rest of TEST_{something} conditions in test/test_cuda.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/149578 Approved by: https://github.com/janeyx99, https://github.com/eqy	2025-03-25 20:38:15 +00:00
Ding, Yi1	f7d1b966c2	[Inductor] Unify the data type propagation between Triton and CPP Backend (#146970 ) Fixes #144246 Use `DtypePropagationOpsHandler` for CSE variables of CPP backend. In addition, add static type checking for the generated CPP code similar to the `config.test_configs.runtime_triton_dtype_assert`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146970 Approved by: https://github.com/jgong5, https://github.com/eellison, https://github.com/leslie-fang-intel	2025-03-21 17:52:51 +00:00
albanD	68c12ecfe2	Move get accelerator to use build time flags when possible (#146098 ) This PR does two main things (they are in a single PR to show how the newly added APIs are used). - Add isBuilt and isAvailable APIs to the AcceleratorHook interface. See inline doc for their exact semantic - Use the newly added isBuilt for accelerator check to ensure it does not poison fork Pull Request resolved: https://github.com/pytorch/pytorch/pull/146098 Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/EikanWang, https://github.com/jeromean Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-03-10 13:17:58 +00:00
PyTorch MergeBot	b246cd7b82	Revert "Move get accelerator to use build time flags when possible (#146098 )" This reverts commit `17302b4bc8`. Reverted https://github.com/pytorch/pytorch/pull/146098 on behalf of https://github.com/albanD due to Still fails with cuda build on a non-gpu machine ([comment](https://github.com/pytorch/pytorch/pull/146098#issuecomment-2707191770))	2025-03-07 18:59:58 +00:00
albanD	17302b4bc8	Move get accelerator to use build time flags when possible (#146098 ) This PR does two main things (they are in a single PR to show how the newly added APIs are used). - Add isBuilt and isAvailable APIs to the AcceleratorHook interface. See inline doc for their exact semantic - Use the newly added isBuilt for accelerator check to ensure it does not poison fork Pull Request resolved: https://github.com/pytorch/pytorch/pull/146098 Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/EikanWang, https://github.com/jeromean Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-03-07 15:19:34 +00:00
Syed Tousif Ahmed	5f392ae560	Throws error when using torch.cuda.MemPool with expandable segments (#148378 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148378 Approved by: https://github.com/ngimel, https://github.com/eqy ghstack dependencies: #148374	2025-03-07 05:22:03 +00:00
Marko Radmilac	c65ee728f0	Initial implementation of host memory stats (#147660 ) This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics. This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache. As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later. Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660 Approved by: https://github.com/ngimel	2025-03-05 16:13:19 +00:00
cyy	ec2805ada8	Remove outdated CUDA version check (#148142 ) Since Torch requires CUDA>=11, some checks can be removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148142 Approved by: https://github.com/janeyx99, https://github.com/eqy	2025-03-04 03:33:44 +00:00
PyTorch MergeBot	a983b2b11a	Revert "Initial implementation of host memory stats (#147660 )" This reverts commit `945e359fc1`. Reverted https://github.com/pytorch/pytorch/pull/147660 on behalf of https://github.com/mradmila due to There is an issue with ambiguous definition of Stat structure when different C++ tools are used. Backing out for now. ([comment](https://github.com/pytorch/pytorch/pull/147660#issuecomment-2692346379))	2025-03-01 18:05:45 +00:00
Fuzzkatt	493cd97af5	add skips to test_notifies_oom and test_set_per_process_memory_fraction (#148134 ) Tests fail in NVIDIA internal CI since we do not support nvml on Jetson, but nvml is required for OOM reporting to work properly, so we are skipping the failing tests for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148134 Approved by: https://github.com/eqy	2025-03-01 02:59:48 +00:00
Marko Radmilac	945e359fc1	Initial implementation of host memory stats (#147660 ) This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics. This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache. As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later. Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660 Approved by: https://github.com/ngimel	2025-02-28 18:36:44 +00:00
cyy	b0dfd242fa	Remove NO_MULTIPROCESSING_SPAWN checks (#146705 ) py 3.9 has spawn. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146705 Approved by: https://github.com/colesbury	2025-02-28 05:53:19 +00:00
PyTorch MergeBot	926b7b5027	Revert "Remove NO_MULTIPROCESSING_SPAWN checks (#146705 )" This reverts commit `40ad5e01df`. Reverted https://github.com/pytorch/pytorch/pull/146705 on behalf of https://github.com/cyyever due to Broke lint?, I guess land race with rufff update ([comment](https://github.com/pytorch/pytorch/pull/146705#issuecomment-2689603077))	2025-02-28 03:04:38 +00:00
cyyever	40ad5e01df	Remove NO_MULTIPROCESSING_SPAWN checks (#146705 ) py 3.9 has spawn. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146705 Approved by: https://github.com/colesbury	2025-02-28 00:15:32 +00:00
Jagadish Krishnamoorthy	0ea5d1067b	ROCm: Remove static specifier for allow_tf32 variable. (#147186 ) Since the env variable HIPBLASLT_ALLOW_TF32 can change, remove static type for allow_tf32 variable so that it captures the current value of env variable HIPBLASLT_ALLOW_TF32. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147186 Approved by: https://github.com/jeffdaily, https://github.com/naromero77amd	2025-02-26 18:24:02 +00:00
Bo Li	de80b6f0d3	Updated test_cuda.py to rerun tests (#147040 ) Initially test_cuda::TestCudaMallocAsync::test_clock_speed and test_cuda::TestCudaMallocAsync::test_power_draw are skipped in this [commit](`d4871750d9`). Pulled ROCm nightly image and verified these two tests run fine locally. Filed this PR to enable them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147040 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily	2025-02-25 19:58:42 +00:00
PyTorch MergeBot	fb73b0c7c5	Revert "use copy2d in h2d/d2h copy when possible (#146256 )" This reverts commit `0bc036a9e9`. Reverted https://github.com/pytorch/pytorch/pull/146256 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/146256#issuecomment-2680868627))	2025-02-25 07:06:38 +00:00
Mikayla Gawarecki	e8fbc86de0	Make torch.cuda.gds APIs public (#147120 ) Follow up to https://github.com/pytorch/pytorch/pull/145748 that turned USE_CUFILE on for CUDA 12.6 and 12.8 binaries Pull Request resolved: https://github.com/pytorch/pytorch/pull/147120 Approved by: https://github.com/albanD	2025-02-14 17:06:50 +00:00
Mikayla Gawarecki	861bf892fb	Set USE_CUFILE=1 by default and add pypi package to binary build matrix (#145748 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145748 Approved by: https://github.com/atalman	2025-02-11 15:49:01 +00:00
Eddie Yan	9ee506bd93	[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 ) Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441 Approved by: https://github.com/Chillee, https://github.com/malfet	2025-02-06 19:04:50 +00:00
PyTorch MergeBot	f27220e32a	Revert "Move get accelerator to use build time flags when possible (#146098 )" This reverts commit `157d81c201`. Reverted https://github.com/pytorch/pytorch/pull/146098 on behalf of https://github.com/atalman due to Failing internally, sorry need to revert ([comment](https://github.com/pytorch/pytorch/pull/146098#issuecomment-2637443675))	2025-02-05 16:39:37 +00:00
albanD	157d81c201	Move get accelerator to use build time flags when possible (#146098 ) This PR does two main things (they are in a single PR to show how the newly added APIs are used). - Add isBuilt and isAvailable APIs to the AcceleratorHook interface. See inline doc for their exact semantic - Use the newly added isBuilt for accelerator check to ensure it does not poison fork Pull Request resolved: https://github.com/pytorch/pytorch/pull/146098 Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/EikanWang Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-02-04 18:23:24 +00:00
Natalia Gimelshein	0bc036a9e9	use copy2d in h2d/d2h copy when possible (#146256 ) A rewrite of #138964 In addition to rewriting the conditions for using copy2d, this PR fixes a few other problems with #138964: 1) gpu-gpu copies when peer access is disabled shouldn't rely on copy2d 2) copy2d should record even for the host pinned memory, like the regular copy does 3) copy2d shouldn't pretend that it's synchronizing (for the purposes of cuda sanitizer tracer) when it's non-blocking In this PR copy2d behaves in exactly the same way as copy does wrt to those additional syncs, except it calls a different underlying cuda call. Tests for multiple cases going through copy2d and avoiding copy2d pattern due to unsatisfied conditions are added. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146256 Approved by: https://github.com/eqy, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-02-03 23:07:54 +00:00
PyTorch MergeBot	c39c679813	Revert "Tensor .cuda() very slow with specific array sizes (#138964 )" This reverts commit `98f87edd23`. Reverted https://github.com/pytorch/pytorch/pull/138964 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but some slow test start failing after this lands ([comment](https://github.com/pytorch/pytorch/pull/138964#issuecomment-2628455198))	2025-01-31 21:48:51 +00:00
PyTorch MergeBot	c3f71eb61b	Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 )" This reverts commit `e2917245fb`. Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/ZainRizvi due to Sorry but this still fails internally with the same error. @Chillee or @malfet, can you please help the change get tested? (See D68783351) ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2627886999))	2025-01-31 17:43:09 +00:00
Donald Tolley	98f87edd23	Tensor .cuda() very slow with specific array sizes (#138964 ) ### Pull Request: Optimized Non-Contiguous Tensor Copy for CPU to GPU in PyTorch #### Summary This PR addresses the performance issue identified in [#111570](https://github.com/pytorch/pytorch/issues/111570), where non-contiguous tensors took significantly longer to transfer from CPU to GPU. Through detailed tracing of the call flow, we identified that PyTorch was creating temporary contiguous buffers for non-contiguous tensor transfers, which introduced unnecessary overhead. #### Tracing the Issue To pinpoint the cause of the slowdown, we followed the call flow from Python’s `tensor.cuda()` method through PyTorch’s backend, ultimately identifying `copy_kernel_cuda` as the key function responsible for CPU-to-GPU tensor transfers. Here’s a summary of the tracing process: 1. Python Call: `tensor.cuda()` - Starting from Python, the `cuda()` method initiates the tensor transfer to the GPU. 2. `TensorBody.h: cuda()` - The `cuda()` method calls `to()`, specifying the target device as CUDA. 3. `Tensor.cpp: TensorBase::to()` - The `to()` function prepares device and data type options before invoking `_ops::to_dtype_layout::call()`. 4. Operator Call: `_ops::to_dtype_layout::call()` - This operator dispatches the request to the backend-specific function responsible for managing the transfer. 5. `Copy.cpp: copy_()` - The `copy_()` function performs preliminary checks (e.g., zero-tensor immutability) and proceeds to call `copy_impl()`. 6. `Copy.cpp: copy_impl()` - This function sets up a tensor iterator and dispatches the copy operation to the appropriate backend through `copy_stub`. 7. Dispatch to CUDA: `copy_stub` - The dispatch mechanism routes the call to the CUDA-specific function, `copy_kernel_cuda`. 8. `Copy.cu: copy_kernel_cuda()` - Here, we identified that PyTorch was creating temporary contiguous buffers for 1D and 2D non-contiguous tensors, which slowed down the copy process. This behavior is managed by the `copy_requires_temporaries()` function. #### Solution To address this, we modified `copy_kernel_cuda` to handle non-contiguous 1D and 2D tensors directly by using `cudaMemcpy2DAsync`, which allows efficient, stride-aware memory transfers without temporary buffers. Here’s why this approach improves performance: - Efficiency of `cudaMemcpy2DAsync`: This CUDA function is optimized for pitched (stride-based) memory transfers, allowing it to handle non-contiguous data layouts effectively by specifying memory strides for source and destination tensors. - Reduction of Overhead: By directly copying non-contiguous tensors without intermediate buffers, we eliminate extra memory allocation and achieve faster CPU-to-GPU transfers. - Asynchronous Execution: `cudaMemcpy2DAsync` enables asynchronous transfer on the CUDA stream, further improving performance by taking advantage of CUDA's optimized memory handling for non-contiguous layouts. #### Performance Results In my testing, I created tensors of size `327680 x 2000` and used slices for transfer performance measurements. The tests show that the average time for transferring a non-contiguous slice (e.g., rows 10,000 to 50,000) from CPU to GPU now closely matches the contiguous case. This improvement indicates that the updated implementation effectively addresses the performance discrepancy. Below are the measured times and validation checks: ```plaintext Average time for contiguous slice (rows 10,000-50,000): 66 ms Average time for non-contiguous slice (rows 10,000-50,000): 66 ms Validation of contiguous and non-contiguous tensor copies: ✅ PASS: Tensor shapes match. ✅ PASS: Tensor contiguity matches. ✅ PASS: Tensor contents match. ✅ PASS: Tensor data types match. ✅ Success: Both contiguous and non-contiguous tensors were copied correctly to the GPU. ``` #### Conclusion This PR resolves the identified performance issue by eliminating the need for temporary buffers in non-contiguous 1D and 2D tensor transfers, ensuring faster and more efficient copies from CPU to GPU. Future optimizations could further enhance performance for higher-dimensional non-contiguous tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138964 Approved by: https://github.com/jeffdaily Co-authored-by: Natalia Gimelshein <ngimel@gmail.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-01-31 17:05:02 +00:00
Eddie Yan	e2917245fb	[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 ) Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441 Approved by: https://github.com/Chillee, https://github.com/malfet	2025-01-30 22:33:50 +00:00
Natalia Gimelshein	08ff11e9d0	initialize device when pinning memory on this device, short circuit i… (#145752 ) …s_pinned if device is not initialized Do not land RFC potential fix for #144687 Now `.is_pinned(device="cuda")` does not initialize device and thus doesn't poison the fork (but it complains about `device` arg being deprecated). To not need `device=` arg we'd need to fix get_accelerator to not initialize device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145752 Approved by: https://github.com/albanD Co-authored-by: albanD <albandes@fb.com>	2025-01-30 21:37:29 +00:00
Dmitry Nikolaev	6967ef1b07	[ROCm] fix test_cublas_workspace_explicit_allocation for gfx12 (#145227 ) gfx12 passes the condition `torch.cuda.get_device_capability() >= (9, 4)` and uses `default_workspace_size=128MB`, but it required only for MI300 Fix condition to use `("gfx94" in gcn_arch)` instead of `torch.cuda.get_device_properties()` to detect MI300. Now `default_workspace_size=32MB` is used for gfx12 and the test passes Pull Request resolved: https://github.com/pytorch/pytorch/pull/145227 Approved by: https://github.com/jeffdaily, https://github.com/eqy	2025-01-28 16:19:27 +00:00
PyTorch MergeBot	c986eba560	Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 )" This reverts commit `abf28982a8`. Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/ZainRizvi due to Sorry but this is failing internally. @Chillee can you please help change get remerged? See D68720562 ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2616726406))	2025-01-27 19:38:26 +00:00
Eddie Yan	abf28982a8	[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 ) Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441 Approved by: https://github.com/Chillee	2025-01-27 18:05:23 +00:00
FEI	615bdd9c81	Improve the caching allocator test for raw alloc (#145269 ) 1 Prevent block allocated by torch._C._cuda_cudaCachingAllocator_raw_alloc from affecting torch.cuda.empty_cache() in other unit tests 2 Additionally, tested the changes to raw_delete in https://github.com/pytorch/pytorch/pull/131114 @jeffdaily @albanD @houseroad @eqy @aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/145269 Approved by: https://github.com/albanD, https://github.com/eqy, https://github.com/jeffdaily	2025-01-24 21:07:17 +00:00
PyTorch MergeBot	dad9bc3461	Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 )" This reverts commit `de945d78da`. Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/izaitsevfb due to unused variables again :( ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2611182461))	2025-01-23 22:59:25 +00:00
Eddie Yan	de945d78da	[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 ) Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441 Approved by: https://github.com/Chillee	2025-01-22 22:42:48 +00:00
PyTorch MergeBot	4ea189422d	Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 )" This reverts commit `a6763b7b81`. Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/kit1980 due to breaking internal builds: unused variable 'halpha' ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2596895865))	2025-01-16 21:12:41 +00:00
eqy	a6763b7b81	[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 ) Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441 Approved by: https://github.com/Chillee	2025-01-15 18:37:55 +00:00
PyTorch MergeBot	64bcf39180	Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 )" This reverts commit `388b75edec`. Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/kit1980 due to breaking internal builds: unused variable 'halpha' ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2588517060))	2025-01-14 00:48:28 +00:00
eqy	388b75edec	[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 ) Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441 Approved by: https://github.com/Chillee	2025-01-11 15:30:38 +00:00
PyTorch MergeBot	b80ecc4457	Revert "Fix poision child process issue when call getAccelerator() (#144368 )" This reverts commit `2583d831d4`. Reverted https://github.com/pytorch/pytorch/pull/144368 on behalf of https://github.com/clee2000 due to broke internal tests D68023262, probably the same problem as noted in the issue this PR is mentioned above ([comment](https://github.com/pytorch/pytorch/pull/144368#issuecomment-2584848568))	2025-01-10 23:36:43 +00:00
Yu, Guangye	2583d831d4	Fix poision child process issue when call getAccelerator() (#144368 ) # Motivation fix https://github.com/pytorch/pytorch/issues/144152 # Solution - Align `at::globalContext()::hasXXX` to determine if accelerator XXX is built with PyTorch or an extension already registered to PyTorch. - Define `at::hasXXX` to determine if accelerator XXX is available at runtime. - Use `at::globalContext()::hasXXX` in `getAccelerator` rather than `at::hasXXX` to avoid initializing the XXX runtime (which can poison child processes) while detecting the current accelerator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144368 Approved by: https://github.com/albanD, https://github.com/atalman, https://github.com/gujinghui	2025-01-10 09:28:27 +00:00
Yu, Guangye	6de110b862	Support with statement on torch.Stream (#140138 ) # Motivation We propose to support Python with statement on `torch.Stream`. This is a benefit for all accelerators when writing device-agnostic code. The device-specific stream will also be supported because they are generally derived from `torch.Stream`. With this PR, we can do like this ```python s1= torch.Stream() # Set s1 to the current stream torch.accelerator.set_stream(s1) with torch.Stream() as s2: # Inside with statement, we set s2 to the current stream assert torch.accelerator.current_stream() == s2 # Here the current stream should be s1 assert torch.accelerator.current_stream() == s1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140138 Approved by: https://github.com/albanD	2025-01-10 02:05:19 +00:00
Dmitry Nikolaev	d4871750d9	[ROCm] Enable post-merge trunk workflow on MI300 runners; skip and fix MI300 related failed tests (#143673 ) This PR * makes changes to the workflow files and scripts so we can run CI workflows on the MI300 runners * skips and fixes several tests, failed on MI300, observed in https://github.com/pytorch/pytorch/pull/140989 Skipped due to unsupported Float8_e4m3fn data type on MI300 (need to update test code to use datatypes supported by MI300): - distributed.tensor.parallel.test_micro_pipeline_tp.py::MicroPipelineTPTest::test_fuse_all_gather_scaled_matmul_A_dims_\_gather_dim_\ (24 tests across inductor/distributed configs) - distributed.tensor.parallel.test_micro_pipeline_tp.py::test_fuse_scaled_matmul_reduce_scatter_A_dims_\_scatter_dim_\ (12 tests across inductor/distributed configs)) - inductor.test_loop_ordering::LoopOrderingTest::test_fp8_cast_and_t - inductor.test_loop_ordering::LoopOrderingTest::test_fp8_pattern_2 Skipped due to AssertionError on MI300: - inductor.test_mkldnn_pattern_matcher.py::test_qconv2d_int8_mixed_bf16 - distributed._tools.test_sac_ilp::TestSACILP::test_sac_ilp_case1 Skipped: - test_cuda.py::TestCudaMallocAsync::test_clock_speed - test_cuda.py::TestCudaMallocAsync::test_power_draw - test_torch.py::TestTorchDeviceTypeCUDA::test_deterministic_cumsum_cuda Skipped flaky tests on MI300: - distributed.test_c10d_gloo.py::ProcessGroupGlooTest::test_gather_stress_cuda - inductor.test_cpu_repro::CPUReproTests::test_lstm_packed_unbatched_False* (256 tests) Fixed: - test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_float8_basics_cuda Features: - inductor/test_fp8.py - declare a new function to convert FP8 datatypes to ROCm supported FP8 datatypes. It keeps test names for CUDA and ROCm and allows to enable Inductor FP8 tests on CPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/143673 Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/pruthvistony Co-authored-by: saienduri <saimanas.enduri@amd.com> Co-authored-by: Jithun Nair <jithun.nair@amd.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-01-09 05:18:57 +00:00

1 2 3 4 5 ...

918 Commits