pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
William Wen	b71e813bce	[dynamo, 3.13] fix bytecode nop tests (#139323 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139323 Approved by: https://github.com/jansel	2024-11-02 00:39:36 +00:00
Bin Bao	8c17830dea	[AOTI] Unify how weights are stored as data section (#139471 ) Summary: https://github.com/pytorch/pytorch/pull/118076 introduced a cleaner way to link weights as a data section for macos. Unify the code by adopting that approach for Linux as well. Test Plan: CI Differential Revision: D65302273 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139471 Approved by: https://github.com/chenyang78	2024-11-02 00:23:24 +00:00
PyTorch UpdateBot	aa54b2467f	[executorch hash update] update the pinned executorch hash (#139133 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139133 Approved by: https://github.com/pytorchbot	2024-11-02 00:14:47 +00:00
eellison	ee2f8a50d3	Class rename (#139490 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139490 Approved by: https://github.com/exclamaforte, https://github.com/zou3519 ghstack dependencies: #139295	2024-11-02 00:10:17 +00:00
PyTorch MergeBot	c95adb9c5b	Revert "use more elements per thread for narrow dtypes (#139449 )" This reverts commit `f5b9e725d1`. Reverted https://github.com/pytorch/pytorch/pull/139449 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but a bunch of tests are failing after it lands, it looks like a landrace ([comment](https://github.com/pytorch/pytorch/pull/139449#issuecomment-2452723863))	2024-11-01 23:42:16 +00:00
PyTorch MergeBot	b617d4813c	Revert "fix dynamo tracking numpy 2 ops (#138686 )" This reverts commit `124eac255e`. Reverted https://github.com/pytorch/pytorch/pull/138686 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I am seeing inductor failure with hf_BigBird number of graph breaks after it lands ([comment](https://github.com/pytorch/pytorch/pull/138686#issuecomment-2452718164))	2024-11-01 23:34:06 +00:00
Nikita Shulga	77b72d686e	[BE][MPS] Make metal shaders compile cleanly (#139522 ) I.e. without warnings, by deleting dead code and fixing one signed-unsigned comparison warning Also, pass `-Werror` to metal compiler if WERROR options is set Pull Request resolved: https://github.com/pytorch/pytorch/pull/139522 Approved by: https://github.com/Skylion007	2024-11-01 23:22:47 +00:00
eellison	2382b3b6d8	[Easy] Add joint graph passes, fallback_random to bisector (#139295 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139295 Approved by: https://github.com/zou3519, https://github.com/exclamaforte	2024-11-01 23:21:53 +00:00
Gabriel Ferns	1e73842029	Refactor FxGraphDrawer to use HTML-like labels (#137726 ) Fixes https://github.com/pytorch/pytorch/issues/137499 Testing: Added a new unit test to make sure that the regression case succeeds. I'm debating about whether to make the borders visible. I'm partial to no borders, but it might make it harder for some people to read? ![68a2b0e3-orig_fx_graph_diagram](https://github.com/user-attachments/assets/fbc2fd98-9e76-488e-8ebe-c64fbf206932) Vs. ![2bfe1c4f-orig_fx_graph_diagram](https://github.com/user-attachments/assets/b6bc88ba-dda2-4cf7-84ac-a615e1e03a74) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137726 Approved by: https://github.com/eellison, https://github.com/malfet	2024-11-01 23:19:50 +00:00
David Berard	60542eeb33	[inductor] set sanitize_overflow=False for triton kernels (#139502 ) In upstream triton, https://github.com/triton-lang/triton/pull/4589 introduces overflow checks. However, overflow checks likely add some overhead, and have some correctness bugs at the moment (e.g. https://github.com/triton-lang/triton/pull/5033). Let's set `sanitize_overflow=False` but keep `debug=True` so that we can keep using device_assert but without the additional asserts added by `sanitize_overflow`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139502 Approved by: https://github.com/bertmaher	2024-11-01 23:10:21 +00:00
Huy Do	da395384a2	Delete Windows GPU jobs in periodic (#139336 ) As an outcome of https://fburl.com/gdoc/voce5o06, we could stop running Windows GPU tests on periodic pending the green light from MS. No one is monitoring these jobs atm. We already have Windows CUDA and CPU build jobs in trunk. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139336 Approved by: https://github.com/ZainRizvi, https://github.com/wdvr, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-11-01 22:26:22 +00:00
Shuqiang Zhang	4c64a7f33f	[pgnccl] add a restart test for PGs in blocking mode (#139496 ) Summary: Restarting (aborting and re-initialize a PG) is a basic need if we want to achieve in-process restart of PGs without tearing down the whole process. Add this tests to verify that this is supported by current NCCL. Note that this restart test passes steadily only for blocking mode for now. In nonblockin mode. There is problem in either nccl init or abort that needs further investigation Test Plan: new UT Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/139496 Approved by: https://github.com/c-p-i-o, https://github.com/kwen2501	2024-11-01 22:13:37 +00:00
Huy Do	0b13bdd877	Delete parallelnative jobs in periodic (#139328 ) As an outcome of https://fburl.com/gdoc/voce5o06, we can now clean up parallelnative build and test jobs in periodic. There is not much value in running them anymore Pull Request resolved: https://github.com/pytorch/pytorch/pull/139328 Approved by: https://github.com/wdvr, https://github.com/malfet	2024-11-01 22:05:13 +00:00
Huy Do	8eb75cbad6	Delete iOS jobs from periodic (#139345 ) As an outcome of https://fburl.com/gdoc/voce5o06 and confirm with @iseeyuan, we can now clean up iOS lite interpreter jobs on PyTorch CI. There is not much value in running them anymore. It's stated in https://github.com/pytorch/ios-demo-app/blob/master/README.md that ExecuTorch is the replacement now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139345 Approved by: https://github.com/wdvr, https://github.com/malfet	2024-11-01 22:04:27 +00:00
Huy Do	8ad76efb8d	Delete Vulkan jobs from periodic (#139354 ) As an outcome of https://fburl.com/gdoc/voce5o06, we can clean up this job now as the backend has been marked as deprecated https://pytorch.org/tutorials/prototype/vulkan_workflow.html to be replace by ExecuTorch Vulkan delegate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139354 Approved by: https://github.com/wdvr, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-11-01 22:03:12 +00:00
Mikayla Gawarecki	a979318ef7	Add section to serialization note re weights_only (#139433 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139433 Approved by: https://github.com/malfet ghstack dependencies: #138936, #139221	2024-11-01 21:51:50 +00:00
Nikita Shulga	a1f854f270	[MPS] Compile kernels into Metallib (#138636 ) PyTorch MPS backend for the most part relies on MPSGraph to provide specific operations, but recently more and more often one had to implement custom kernel here that were simply embedded in the operator codebase and were compiled directly using [`- id<MTLLibrary>newLibraryWithSource:options:error:`](https://developer.apple.com/documentation/metal/mtldevice/1433431-newlibrarywithsource) (first metal kernel to MPS backend was added in https://github.com/pytorch/pytorch/pull/82307 ) Later on, as number of operator grew, those were refactored into `MetalShaderLibrary` convenience class (see https://github.com/pytorch/pytorch/pull/125550 ) But as number of kernels keeps growing, it's time to make a next step and properly compile them into `.metalib` This PR does exactly that by: - Moving shader sources into separate .metal files - Adds check on whether full Xcode installed or just DeveloperTools - If full Xcode is installed, compiles and links shaders into .metallib for Metal-3.0(Available on MacOS 13) and Metal-3.1 standard (available on MacOS 14, can use bfloat) and bundles both using `-sectcreate` linker option and `getsectiondata` API call. `metallib_dummy.cpp` file is used to properly express dependencies between metallib build and torch_cpu link stages. Logic for generating metallibraries is loosely based on https://github.com/ml-explore/mlx/blob/main/mlx/backend/metal/kernels/CMakeLists.txt. - If only DeveloperTools CLI is installed, automatically wraps .metal into `_metallib.h` that contains shader source wrapped in `MetalShaderLibrary` Bulk of changes introduced in this PR are just moving code around. I.e. for every file that contains non-templated shader definition in `aten/src/ATen/native/mps/operators` folder, corresponding `.metal` file is created in `aten/src/ATen/native/mps/kernels` folder and embedded shader definition is replaced with the following ```cpp #ifndef PYTORCH_JIT_COMPILE_SHADERS static auto& lib = MetalShaderLibrary::getBundledLibrary(); #else #include <ATen/native/mps/OpName_metallib.h> #endif ``` Some historical stats: \| PyTorch Version \| Number of shaders in MPS \| Ops added \| \| ------------- \| ------------- \| ---- \| \| 1.12 \| 0 \| \| \| 1.13 \| 2 \| bitwise_ops and index.out \| \| 2.0 \| 4 \| cross repeat and view) \| \| 2.1 \| 9 \| unary_ops, histogram, renorm, binary_ops \| \| 2.2 \| 11 \| gamma and bucketization \| \| 2.3 \| 12 \| naive_matmul (to workaround crash) \| \| 2.4 \| 13 \| quantized_mm \| \| 2.5 \| 14 \| fused_adam \| Pros: - Better code structure/readability - Eventually allows one to use shared headers (and implement something like `TensorIterator`) - Faster runtime (as compilation is done ahead of time) and perhaps better optimized compiled kernels Cons: - Build process is a bit more complicated that it used to be - Need to maintain two codepath (as our CI builders only has DeveloperTools installed) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138636 Approved by: https://github.com/manuelcandales	2024-11-01 21:47:20 +00:00
Edward Z. Yang	a6630bcf87	Profile guided optimization for automatic_dynamic (#139001 ) Previously: https://github.com/pytorch/pytorch/pull/138052 but the implementation is done from scratch, so I open a new PR. This implements the ability to save and load profiles of automatic dynamic decisions, so on subsequent runs we can directly make something automatically dynamic. Unlike the previous implementation, this cache is never enabled by default; instead, you have to specify a "job id" that says it's OK to share results. We will be able to automatically populate this id for internal MAST jobs but for generic OSS users you will have to explicitly opt into it. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Differential Revision: [D65065497](https://our.internmc.facebook.com/intern/diff/D65065497) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139001 Approved by: https://github.com/oulgen	2024-11-01 21:43:25 +00:00
Xuan Zhang	9c2ffce71a	add condition for freeable input buffer (#139480 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139480 Approved by: https://github.com/yf225 ghstack dependencies: #139396	2024-11-01 21:15:40 +00:00
Huy Do	18f3b3c991	Clean up Android jobs in CI (#139350 ) As an outcome of https://fburl.com/gdoc/voce5o06 and confirm with @iseeyuan, we can now clean up Android lite interpreter jobs on PyTorch CI. There is not much value in running them anymore. It's stated in https://github.com/pytorch/android-demo-app/blob/master/README.md that ExecuTorch is the replacement now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139350 Approved by: https://github.com/ZainRizvi	2024-11-01 21:10:19 +00:00
Sam Larsen	c412a42ae2	[pt2 logging] move remote cache get/put logging up one level (#139423 ) Summary: I need to refactor the way we record CompilationMetrics. It will be much easier to do in OSS and having the relevant timing code in the OSS area of the codebase will make this much easier. I doubt this meaningfully changes the values we see. Test Plan: Made sure samples show up: https://fburl.com/scuba/dynamo_compile/sandbox/c38zjq0x Differential Revision: D65290089 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139423 Approved by: https://github.com/oulgen	2024-11-01 21:06:59 +00:00
Animesh Jain	0e57f2b589	[invoke_subgraph] Change the joint_graph output signature to simplify min-cut partitioner (#139326 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139326 Approved by: https://github.com/zou3519 ghstack dependencies: #139216, #139130	2024-11-01 21:02:32 +00:00
Animesh Jain	6a268c3fbb	[invoke_subgraph] Generate fake_inputs correctly (#139130 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139130 Approved by: https://github.com/zou3519 ghstack dependencies: #139216	2024-11-01 21:02:32 +00:00
Animesh Jain	4c756cacfd	[invoke_subgraph] Re-enable fake tensor model in the fake tensor impl (#139216 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139216 Approved by: https://github.com/zou3519	2024-11-01 21:02:32 +00:00
Justin Chu	5d67efb809	[ONNX] New registration API (#135403 ) The ONNX custom ops registration API. ## Design 1. Create a "custom_translation_table: dict[Callable, Sequence[Callable] \| Callable" parameter for specifying extra functions 2. Use a callable as the key to support all possible call_function targets in the fx graph 3. Allow a callable or a Sequence of callables as values. - When there is a single callable, it is the translation function for the op - When there is a Sequence of callable, the exporter's dispatcher will dispatch to these callables in order based on input dtypes. - The translation functions can be a plain python function that calls onnxscript ops (traced), or an onnxscript function. - Complex input support: We create special type annotations for annotating real representations of complex inputs, which are needed to handle complex computation in the ONNX graph, as we don't have any ops in ONNX that handle complex inputs. The dispatcher will have knowledge of these newly created type annotations and dispatch correctly. The complex functions will be in the same overload pool as the real functions. ```py torch.onnx.export(dynamo=True, custom_translation_table = { torch.ops.aten.add: [overload1, overload2], torch.sym_not: sym_not_onnx, }) ``` Support for functions that handles complex inputs will be in separate PRs. fixes https://github.com/pytorch/pytorch/issues/138391 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135403 Approved by: https://github.com/titaiwangms	2024-11-01 20:58:54 +00:00
Natalia Gimelshein	f5b9e725d1	use more elements per thread for narrow dtypes (#139449 ) Fix perf issue for narrow type by accessing more elements per thread Pull Request resolved: https://github.com/pytorch/pytorch/pull/139449 Approved by: https://github.com/Chillee, https://github.com/eqy	2024-11-01 20:41:13 +00:00
Jason Ansel	73c0762a34	[inductor] Simplify remove_kernel_local_buffers (#139452 ) I plan to reuse `can_buffer_be_removed_through_fusion` in some heuristics. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139452 Approved by: https://github.com/shunting314 ghstack dependencies: #139364, #139365, #139370	2024-11-01 20:36:39 +00:00
Bert Maher	dcdcb8b364	Avoid overflow in float32-to-int32 test (#139489 ) Summary: Triton has added some integer overflow detection when kernels are compiled with `debug=True`, and this test results in integer overflow (2.0 is 0x40000000, times 2 is 0x80000000 which overflows a signed int32). Assertion `int32 overflow detected for operation mul` failed Fixes #139479 Test Plan: ``` python inductor/test_torchinductor.py -k test_float32_to_int32_cuda ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139489 Approved by: https://github.com/eellison, https://github.com/jansel, https://github.com/chenyang78	2024-11-01 20:22:19 +00:00
Yifu Wang	0dbc284a72	[SymmetricMemory] expose signal_pads as tensors in Python (#138754 ) ## This Stack This stack does the following things to support `xformers`-style, comm-aware Triton kernels: - Exposes `signal_pad`s as tensors in Python - Adds a binding for `cuMemsetAsync` These in combination aims to provide users with more flexibility to express custom signaling/synchronization patterns. ## This PR ```python # Obtain the signal pad of the specified peer rank as a tensor. # If both shape and dtype are unspecified, the returned tensor will be a # 1d uint32 tensor, which is most natural for signaling purposes. symm_mem.get_signal_pad(peer_rank) # If only shape is specified, it is equivalent to: # symm_mem.get_signal_pad(peer_rank)[:shape.numel()].view(shape) symm_mem.get_signal_pad(peer_rank, shape) # If only dtype is specified, it is equivalent to: # symm_mem.get_signal_pad(peer_rank).view(dtype) symm_mem.get_signal_pad(peer_rank, dtype=dtype) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138754 Approved by: https://github.com/weifengpy, https://github.com/lw	2024-11-01 20:17:15 +00:00
Haifeng Jin	124eac255e	fix dynamo tracking numpy 2 ops (#138686 ) Fixes #136559 As we upgrade to NumPy 2, torch falsely filtered out `numpy.random` as unsupported in dynamo tracking. This PR changes the filtering rules to include them while keeping behavior with numpy 1 unchanged. Before this PR, the following tests failed: ``` PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/dynamo/test_functions.py -k FunctionTests.test_numpy_random PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/dynamo/test_unspec.py -k UnspecTests.test_to_tensor PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/test_fake_tensor.py -k FakeTensorTest.test_export_numpy PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/test_fake_tensor.py -k PropagateRealTensorsFakeTensorTest.test_export_numpy_propagate_real_tensors ``` With this PR, the supported/unsupported ops in NumPy 1 are not changed. For NumPy 2, only the `numpy.random` ops that are already supported with NumPy 1 are added to the supported list. I used the following scripts to check the differences before and after the change for both NumPy 1 & 2. The output is empty for NumPy 1 since there is no change. The output is a list of `numpy.random` that considered supported for NumPy 2. ```py from torch._dynamo import trace_rules import numpy as np def new_numpy_function_ids(): unsupported_funcs = {"seed", "ranf", "get_bit_generator", "RandomState", "set_bit_generator", "sample"} def is_supported(k, v, mod): if not callable(v): return False if not getattr(v, "__module__", None): return True if v.__module__ == mod.__name__: return True if v.__module__ == "numpy.random.mtrand" and mod.__name__== "numpy.random" and k not in unsupported_funcs: return True return False rv = {} for mod in trace_rules.NP_SUPPORTED_MODULES: for k, v in mod.__dict__.items(): if is_supported(k, v, mod): rv[id(v)] = f"{mod.__name__}.{k}" return rv def old_numpy_function_ids(): rv = {} for mod in trace_rules.NP_SUPPORTED_MODULES: rv.update( { id(v): f"{mod.__name__}.{k}" for k, v in mod.__dict__.items() if callable(v) and (getattr(v, "__module__", None) or mod.__name__) == mod.__name__ } ) return rv rv1 = set(old_numpy_function_ids().values()) rv2 = set(new_numpy_function_ids().values()) for v in (rv1 - rv2): print(v) print("****") for v in (rv2 - rv1): print(v) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138686 Approved by: https://github.com/lezcano, https://github.com/williamwen42	2024-11-01 19:51:40 +00:00
Mikayla Gawarecki	ea0e09b3f3	Add utility to get all unsafe globals in checkpoint (no pickletools dependency) (#139221 ) Fixes https://github.com/pytorch/pytorch/issues/129698 https://github.com/pytorch/pytorch/pull/139106 without pickletools Pull Request resolved: https://github.com/pytorch/pytorch/pull/139221 Approved by: https://github.com/malfet ghstack dependencies: #138936	2024-11-01 19:31:39 +00:00
rzou	f3b485eb2a	[reland] Flip triton kernel default layout constraint to "needs_fixed_stride_order" (#137064 ) This is to match the default layout constraint for custom operators. By default, Inductor should match the stride order of inputs to a triton kernel. IF THIS IS BREAKING YOU, PLEASE REACH OUT, especially if it's been more than two weeks since this landed. You can flip the config locally as a workaround. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/137064 Approved by: https://github.com/albanD, https://github.com/eellison	2024-11-01 19:21:16 +00:00
Colin L. Rice	abc5d59dcb	config: create Config objects with JK support (#138766 ) This teaches install_config_module (and the underlying code) to understands Config objects. Additionally we've added a JK option to this which resolves the JK. This config gets stored within the _ConfigEntry class and is evaluated when __getattr__ is called. If justknobs is set, it'll call justknobs_check to see the result. Due to preceeding work, basically everything works correctly here and we had to update a couple of tests, and modify the getattr behaviour. Note that we are updating the justknob_check function to support a default option, to make default work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138766 Approved by: https://github.com/ezyang	2024-11-01 19:20:37 +00:00
eqy	6fc63b4ef1	[ROCM][CUDA][NCCL] Disable `test_lowering_one_shot_all_reduce` on ROCM (#139414 ) I'm not sure this is expected to run if it requires buffer-registration support CC @yifuwang @huydhn @syed-ahmed #138029 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139414 Approved by: https://github.com/huydhn, https://github.com/yifuwang	2024-11-01 18:39:47 +00:00
Jason Davies	391ee62180	Ensure scalar tensor device matches attn_mask for convert_boolean_attn_mask_cudnn. (#139450 ) This is causing a small performance hit when using SDPA with the cuDNN backend due to unnecessary host-to-device memcpy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139450 Approved by: https://github.com/drisspg, https://github.com/eqy	2024-11-01 18:38:02 +00:00
Sam Larsen	d8b606ecb5	[fx graph cache] Support freezing with FX graph caching (#136505 ) Summary: The main changes to support freezing are: 1) When pickling constant tensors as part of the cache key calculation: If freezing has not been applied, then keep the existing behavior (pickle the metadata and values). If freezing has been applied, then pickle the values if the constant will be inlined; otherwise, consider only the metadata. 2) If freezing has been applied, modify what we store in the cache: Instead of storing the constant attributes in the cache entry, store the _names_ of the constants, and then grab those constants from the GraphModule when we need attache the attributes to a newly-loaded Python module. Since the cache lookup path loads the Python module, this bullet means we need to thread through a GraphModule argument in several places. 3) Since this feature means that we may need to reload the same Python module path more than once (but attach different constant attributes), I changed PyCodeCache.load_by_key_path to not store an in-memory map of path to module (since there may be more than one). I don't _think_ this will have any affect on performance, however.. It's unclear why we were using an in-memory cache here anyway, since this function should only be called once for each module needed to be loaded. 4) Several tests were removing on-disk PyCodeCache artifacts by iterating over the modules. I made this more straightforward by implementing a cache_clear method that removes the on-disk artifacts. Arguably, this should have been the implementation all along. Differential Revision: [D63542170](https://our.internmc.facebook.com/intern/diff/D63542170) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136505 Approved by: https://github.com/eellison	2024-11-01 18:29:29 +00:00
vladimirrotariu	7d644f025f	make equation behind torch.isclose element-wise (#138459 ) The current formula behind torch.isclose, according to the docs, is ![imagen](https://github.com/user-attachments/assets/6b79f6d8-e675-4585-b26b-0c6933f7ecdd) However, torch.isclose acts element-wise, so this formula may be misleading at first, given that the docs said that `input` and `other` are the first, respectively second tensor to compare. I propose the following change, to stress the element-wise nature of the norms in the equation: ![imagen](https://github.com/user-attachments/assets/2926a1c6-c4fa-4c48-8874-106521d3f54c) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138459 Approved by: https://github.com/soulitzer	2024-11-01 18:18:33 +00:00
Nikita Shulga	1857be1b48	Fix S390 builds (#139491 ) Caused by https://github.com/pytorch/pytorch/pull/137918 By guarding all cpuinfo use with `!defined(__s390x__ ) && !defined(__powerpc__)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139491 Approved by: https://github.com/huydhn, https://github.com/Skylion007	2024-11-01 18:16:29 +00:00
Nikita Shulga	51adab0829	[MPS] Fix reduction ops outputs for empty tensors (#139446 ) By adding a switch for all reduction types, that either sets it to given value or raises runtime error. Before this change, reduction ops returned uninitialized values in many case Fixes https://github.com/pytorch/pytorch/issues/139400 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139446 Approved by: https://github.com/Skylion007	2024-11-01 17:32:12 +00:00
Bin Bao	7d081cabfb	[AOTI] Forward fix #139458 (#139485 ) Summary: A new test added in https://github.com/pytorch/pytorch/pull/139458 only fails in certain CI instance. Skip for now as the failing test has a low priority. @diff-train-skip-merge (to silent fb bot so that I can land this myself) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139485 Approved by: https://github.com/huydhn, https://github.com/hl475	2024-11-01 17:14:40 +00:00
Scott Wolchok	3e0f4d18eb	[PyTorch] Support non-zero beta in fp16_gemv_trans (#138275 ) No real reason to have the zero-beta restriction, so let's lift it. Testing: intentionally broke new paths locally to verify test coverage existed Differential Revision: [D64407752](https://our.internmc.facebook.com/intern/diff/D64407752/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138275 Approved by: https://github.com/malfet ghstack dependencies: #139082, #139083, #137918, #138005	2024-11-01 16:49:05 +00:00
Scott Wolchok	195b1b9a9b	[PyTorch] Hook up fp16_gemv_trans to gemv fast path for non-aarch64 architectures (#138005 ) Following up on previous rev to use fp16_gemv_trans in gemv, not just gemm-used-for-gemv. Differential Revision: [D64351092](https://our.internmc.facebook.com/intern/diff/D64351092/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138005 Approved by: https://github.com/malfet ghstack dependencies: #139082, #139083, #137918	2024-11-01 16:49:05 +00:00
Scott Wolchok	fad5d89321	[PyTorch] Hook up fp16_gemv_trans to x86 fp16 GEMM (#137918 ) This is the first big milestone we've been building towards! (Following rev also hooks this up to actual gemv.) Testing: To check perf, I ran python torchchat.py generate stories110M --dtype fp16 --device cpu on an x86 machine without AVX512FP16. Observed roughly 5x tokens/sec increase. Differential Revision: [D64280688](https://our.internmc.facebook.com/intern/diff/D64280688/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D64280688/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/137918 Approved by: https://github.com/malfet ghstack dependencies: #139082, #139083	2024-11-01 16:48:56 +00:00
Scott Wolchok	d79c5143d8	[PyTorch] Add efficient isnan for NEON half (#139083 ) Same as the efficient one for float when f16 hardware support is available. Testing: Added exhaustive isnan test coverage Differential Revision: [D65003321](https://our.internmc.facebook.com/intern/diff/D65003321/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139083 Approved by: https://github.com/malfet ghstack dependencies: #139082	2024-11-01 16:40:51 +00:00
Scott Wolchok	9ecd7d1587	[PyTorch] Add efficient isnan for NEON float (#139082 ) Just test x != x rather than applying element-by-element scalar isnan. Testing: vec_test_all_types checks IsNan Differential Revision: [D65001633](https://our.internmc.facebook.com/intern/diff/D65001633/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139082 Approved by: https://github.com/malfet	2024-11-01 16:40:51 +00:00
sanchitintel	3cbf0c0bbf	[Inductor][CPP] Cache weight tiles in L1D for AMX int8 WoQ GEMM (#136688 ) # Summary The AMX ISA based GEMM micro-kernel template for int8 weight-only quantization (BF16 activation, int8 weights) should cache dequantized weights (int8 -> int32 -> fp32 -> bf16) so that they would not have to be dequantized again in subsequent calls to the _inner-kernel_ that uses the same weights. This change leverages the fact that even for BF16 x BF16 GEMM template, cache-blocking ensures that `Nr * Kc` weight elements are cached in L1D cache (more info [here](https://static.sched.com/hosted_files/pytorch2024/59/TorchInductor%20CPU%20Backend%20Advancements%20-%20New%20Features%20and%20Performance%20Improvements_20240915.pdf)). Here, `Nr` is the register blocking size for `N` dimension (at the granularity of the GEMM micro-kernel, it's currently also the cache blocking size for `N` dimension, although that may change in the future), and `Kc` is the cache blocking size for `K` dimension. The figure below is from the document linked above - <img width="476" alt="image" src="https://github.com/user-attachments/assets/e23e5476-d910-46d1-a9b3-cbf77de76d94"> ## Performance data Collected on 48 physical cores of one socket of Intel Xeon Platinum 8468H (Xeon SP 4th gen). Intel OpenMP & tcmalloc were preloaded. \|M \| N \| K \| Latency with ATen _weight_int8pack_mm \| Latency with codegened templated GEMM (current main branch) \| Latency with codegened templated GEMM (this PR) \| \|-----\|-----\|-----\|------\|----------\|----\| \|4096\|4096\|4096\| 45.844 ms \| 9.322 ms\| 5.2181 ms \| \|4096\|11008\|4096\| 127.618 ms \|24.6258 ms \| 13.6046 ms\| \|4096\|4096\|11008\| 121.953 ms \| 25.4692 ms \| 10.2669 ms \| \|4096\|32000\|4096\| 478.450 ms\| 75.3942 ms \| 48.21 ms \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/136688 Approved by: https://github.com/jgong5	2024-11-01 16:32:22 +00:00
Jason Ansel	b57b4b7f9b	[inductor] Move remove_kernel_local_buffers to Kernel (#139370 ) This method mutates the kernel, so it fits better in that class. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139370 Approved by: https://github.com/shunting314 ghstack dependencies: #139364, #139365	2024-11-01 16:28:15 +00:00
Jason Ansel	1e934b473c	[inductor] Remove Node.last_usage mutation (#139365 ) I can't figure out why this is needed. Let's see if tests fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139365 Approved by: https://github.com/shunting314 ghstack dependencies: #139364	2024-11-01 16:28:15 +00:00
Jason Ansel	286d3ce266	[inductor] Remove SIMDKernel.last_usage (#139364 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139364 Approved by: https://github.com/eellison, https://github.com/shunting314	2024-11-01 16:28:15 +00:00
Shuqiang Zhang	df0c1eceb9	[pgnccl][simple] clean up unused members of PGNCCL (#139436 ) Summary: Found those unused members when prototying something else. Better remove unused members Test Plan: CI Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/139436 Approved by: https://github.com/Skylion007	2024-11-01 16:25:04 +00:00

1 2 3 4 5 ...

80400 Commits