pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
PyTorch MergeBot	496c1e78c5	Revert "Implements user buffer registration using MemPool (#133603 )" This reverts commit `25d9be37be`. Reverted https://github.com/pytorch/pytorch/pull/133603 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/133603#issuecomment-2486897708))	2024-11-19 22:42:26 +00:00
Mikayla Gawarecki	b63a84804c	Allow NJT by default for weights_only torch.load (take 2) (#140739 ) Per discussion with @malfet, only allow weights_only unpickler to load NJT if `torch.nested` and `torch._dynamo` are imported (this is slightly weird as technically `torch.nested` is actually imported by default and `torch._dynamo.decorators._DimRange` is actually what needs to be imported) we can't import this from `torch.nested` as this would - undo dynamo lazy import - cause circular import =========================== Redo of https://github.com/pytorch/pytorch/pull/140304 caused issues as `torch.nested._internal.foo` needs to be imported, which causes issues like ```python torch/_weights_only_unpickler.py", line 339, in load if full_path in _get_allowed_globals(): torch/_weights_only_unpickler.py", line 188, in _get_allowed_globals torch.nested._internal.nested_tensor.NestedTensor AttributeError: module 'torch.nested' has no attribute '_internal' ``` This likely wasn't caught in our CI because imports are global during unit tests(?), so we use subprocess to properly test this time Differential Revision: [D65961691](https://our.internmc.facebook.com/intern/diff/D65961691) @jbschlosser Pull Request resolved: https://github.com/pytorch/pytorch/pull/140739 Approved by: https://github.com/malfet	2024-11-19 02:44:53 +00:00
Anant Gulati	b379a28a95	Generalization of distributed test cases for non-CUDA devices (#138216 ) # Motivation This pr is an extension of #131758. As described in #131758, these changes are looking to make distributed UTs more accessible to users of all device types. It is a demonstration of a few changes discussed by @kwen2501 and @jgong5 in the discussion for #131758(https://github.com/pytorch/pytorch/pull/131758#discussion_r1762422784) This PR contains two types of changes, the first is to the common distributed folder where we have added a new class derived from MultiProcessTestCase which helps abstracts out the process group creation /deletion and other functionality for a given device. The new generalized content can be added by deriving from this base class. Also includes other misc changes for gaudi support The second changed file is test_functional_api. a test file in common distributed. This file is a POC for how we can use this new class to write more device agnostic distributed test cases. The following changes have been made to test_functional_api.py: -Functionality has been added to test for non cuda devices using intel HPU as an example -Multiple set up steps previously required by MultiProcessTestCase have been abstracted out -Misc adaptations to allow for general call to accelerators while adding test skips instead explicitly skipping for multiple GPUs -Skipifhpu flags have been added to enable skipping a few Multithreaded test cases which are as yet not supported on HPUs NOTE: Within test functional api, there are tests which require the use of some multithreading functions which are as yet not supported on HPUs. These have been skipped for hpu using skipHPU decorator. I will be raising a separate PR to improve usability pf said decorators in a device agnostic setting in the manner suggested by @kwen2501 in a comment on this PR. This pr is a cleaned up version of a previous PR(#136988) which I closed due to human error. I have addressed some of the comments made by @kwen2501 in this as well Pull Request resolved: https://github.com/pytorch/pytorch/pull/138216 Approved by: https://github.com/kwen2501, https://github.com/guangyey	2024-11-18 09:38:00 +00:00
Will Constable	1d5a8ee8fb	[C10D] call destroy_process_group after MultiProcess tests (#140820 ) Faced with an annoying string of warnings like this when running tests, <img width="1644" alt="Screenshot 2024-11-15 at 11 23 21 AM" src="https://github.com/user-attachments/assets/91ff4e1d-3c29-4510-9a61-46e7df68a212"> My choices seem to be (1) call destroy_process_group() at the end of each test fn, (2) do this in some wrapper, (3) do it in the base test class. Since tests in MultiProcessTestCase are responsible for calling init_process_group themselves, they should also be responsible for calling destroy (or at least method (3) would be asymmetric and may result in double-destroy). But it doesn't feel worth it to go add a destroy call manually to each test, and try/except for a possible second destroy call seems like a happy middle ground. Note: tests that want to ensure that destroy runs cleanly can and should still call destroy _inside_ the test, and this change does not affect that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140820 Approved by: https://github.com/fegin	2024-11-18 04:26:21 +00:00
Edward Z. Yang	6094f17ada	Revert "revert test repro logging" (#140749 ) This reverts commit 6323fa673279eac9f2292b9b7790f621a4649af8. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140749 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #138634	2024-11-17 06:25:54 +00:00
PyTorch MergeBot	bf8709b08a	Revert "[C10D] call destroy_process_group after MultiProcess tests (#140820 )" This reverts commit `77d1f076da`. Reverted https://github.com/pytorch/pytorch/pull/140820 on behalf of https://github.com/wconstab due to failures on trunk not on PR CI ([comment](https://github.com/pytorch/pytorch/pull/140820#issuecomment-2480644227))	2024-11-16 16:32:14 +00:00
Will Constable	77d1f076da	[C10D] call destroy_process_group after MultiProcess tests (#140820 ) Faced with an annoying string of warnings like this when running tests, <img width="1644" alt="Screenshot 2024-11-15 at 11 23 21 AM" src="https://github.com/user-attachments/assets/91ff4e1d-3c29-4510-9a61-46e7df68a212"> My choices seem to be (1) call destroy_process_group() at the end of each test fn, (2) do this in some wrapper, (3) do it in the base test class. Since tests in MultiProcessTestCase are responsible for calling init_process_group themselves, they should also be responsible for calling destroy (or at least method (3) would be asymmetric and may result in double-destroy). But it doesn't feel worth it to go add a destroy call manually to each test, and try/except for a possible second destroy call seems like a happy middle ground. Note: tests that want to ensure that destroy runs cleanly can and should still call destroy _inside_ the test, and this change does not affect that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140820 Approved by: https://github.com/fegin ghstack dependencies: #140460, #140815	2024-11-16 14:24:52 +00:00
Oguz Ulgen	a173186566	[RFC] Implement caching for user defined triton kernels (#140326 ) This PR adds caching for user defined triton kernels by putting the transitive closure of source code in node.meta along with constant arguments. One HUGE hack we do here is a node looks like ``` triton_kernel_wrapper_functional_proxy = torch.ops.higher_order.triton_kernel_wrapper_functional(kernel_idx = 0, constant_args_idx = 1, grid = [(1, 1, 1)], tma_descriptor_ metadata = {}, kwargs = {'in_ptr0': arg0_1, 'in_ptr1': arg1_1, 'out_ptr': arg0_1}, tensors_to_clone = ['out_ptr']); ``` so we use regex to remove `kernel_idx = 0, constant_args_idx = 1` parts as they are not relevant to cache hash. This is horrible and I'd like to eventually not use pickle as a hashing alternative but this is a longer project. Differential Revision: [D65895744](https://our.internmc.facebook.com/intern/diff/D65895744) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140326 Approved by: https://github.com/zou3519	2024-11-16 02:37:16 +00:00
Michael Lazos	1fd4757fdc	Support tensor betas in Adam and AdamW (#134171 ) Adds support for beta1 and beta2 to be wrapped in tensor for Adam and AdamW. Fixes https://github.com/pytorch/pytorch/issues/133898 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134171 Approved by: https://github.com/janeyx99	2024-11-15 21:55:55 +00:00
xin.li	80d63e7dd9	Fix softmax_backward_data cpu implementation error when argument output is noncontinguous (#139740 ) Implementation of the `softmax_backward_data` operator for the CPU backend produces incorrect results when the `output` argument is non-contiguous. Here is a test case that demonstrates this issue: ```python torch.manual_seed(0) op = torch.ops.aten._softmax_backward_data grad_output = torch.ones(3, 3, 3) temp = torch.randn(3, 10, 3) out = temp[:, :3, :] out = out.contiguous() print(out.is_contiguous()) grad_input = op(grad_output, out, 1, torch.float32) print(grad_input) ``` In this test case, the variable `grad_input` yields incorrect results if the line `out = out.contiguous()` is commented out. With this fix, `grad_input` consistently produces the same results whenever `output` is contiguous. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139740 Approved by: https://github.com/zou3519	2024-11-15 19:53:20 +00:00
Syed Tousif Ahmed	25d9be37be	Implements user buffer registration using MemPool (#133603 ) This PR implements user buffer registration and demonstrates NVLink Sharp (NVLS) reductions using a combination of allocation special memory using MemPool and registering it with the nccl buffer registration APIs. Part of https://github.com/pytorch/pytorch/issues/124807. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133603 Approved by: https://github.com/kwen2501, https://github.com/eqy	2024-11-15 12:47:49 +00:00
Nikita Shulga	9c88b08ac9	[BE] Replace `skipIfMPS` with `expectedFailureMPS` (#139940 ) Functionally two decorators are very similar, but one should rely on expectedFailure as much as possible to get signal when something is fixed. - Move `product_version` variable from `test_mps` to common_utils, but call it `MACOS_VERSION` - Introduce `skipIfMPSOnMacOS13` to decorate the hard crashes that happens only on MacOS13 (which at this point will not get any fixes and will be deprecated soon) - Add `device_type='mps'` to all `skipIfMPS` per https://github.com/pytorch/pytorch/issues/140560 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139940 Approved by: https://github.com/janeyx99, https://github.com/huydhn	2024-11-15 03:48:37 +00:00
Bob Ren	c536903c3f	revert test repro logging (#140717 ) @ezyang noticed this exercises a multithreading bug that is causing tests to become disabled: ``` 2024-11-13T21:05:55.8363582Z inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCPU::test_comprehensive_fft_ihfftn_cpu_int32 /opt/conda/envs/py_3.9/lib/python3.9/site-packages/_pytest/threadexception.py:73: PytestUnhandledThreadExceptionWarning: Exception in thread Thread-3 2024-11-13T21:05:55.8364857Z 2024-11-13T21:05:55.8364974Z Traceback (most recent call last): 2024-11-13T21:05:55.8365491Z File "/opt/conda/envs/py_3.9/lib/python3.9/threading.py", line 980, in _bootstrap_inner 2024-11-13T21:05:55.8366003Z self.run() 2024-11-13T21:05:55.8366371Z File "/opt/conda/envs/py_3.9/lib/python3.9/threading.py", line 917, in run 2024-11-13T21:05:55.8366858Z self._target(self._args, *self._kwargs) 2024-11-13T21:05:55.8367518Z File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/fbscribelogger/__init__.py", line 176, in _run_event_loop 2024-11-13T21:05:55.8368189Z self.loop.run_until_complete(self.task) 2024-11-13T21:05:55.8368774Z File "/opt/conda/envs/py_3.9/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete 2024-11-13T21:05:55.8369348Z return future.result() 2024-11-13T21:05:55.8369980Z File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/fbscribelogger/__init__.py", line 214, in _worker 2024-11-13T21:05:55.8370603Z message = await asyncio.wait_for( 2024-11-13T21:05:55.8371090Z File "/opt/conda/envs/py_3.9/lib/python3.9/asyncio/tasks.py", line 442, in wait_for 2024-11-13T21:05:55.8371573Z return await fut 2024-11-13T21:05:55.8372156Z File "/opt/conda/envs/py_3.9/lib/python3.9/asyncio/queues.py", line 166, in get 2024-11-13T21:05:55.8372613Z await getter 2024-11-13T21:05:55.8374010Z RuntimeError: Task <Task pending name='Task-1' coro=<FbScribeLogger._worker() running at /opt/conda/envs/py_3.9/lib/python3.9/site-packages/fbscribelogger/__init__.py:214> cb=[_run_until_complete_cb() at /opt/conda/envs/py_3.9/lib/python3.9/asyncio/base_events.py:184]> got Future <Future pending> attached to a different loop 2024-11-13T21:05:55.8375366Z 2024-11-13T21:05:55.8375603Z warnings.warn(pytest.PytestUnhandledThreadExceptionWarning(msg)) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140717 Approved by: https://github.com/ezyang, https://github.com/zxiiro	2024-11-14 19:51:52 +00:00
Nikita Shulga	0f739b8f66	[Codemod] `skipIfMps`->`skipIfMPS` (#140562 ) As `MPS` is an acronym that stands for Metal Performance Shaders Also to closer align with `skipCUDAIf` not `skipCudaIf` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140562 Approved by: https://github.com/ZainRizvi, https://github.com/r-barnes	2024-11-13 19:45:08 +00:00
zeshengzong	cb71bcc542	Replace clone.detach with detach.clone (#140264 ) Fixes #64532 As state in issue, replace `clone.detach` by `detach.clone` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140264 Approved by: https://github.com/soulitzer	2024-11-13 07:01:02 +00:00
Tsung-Hsien Lee	953286b850	[DTensorTestbase] Fix `@with_comms` inactive problem (#139637 ) Summary: `with_comms()` is mostly used as a decorator with an optional input argument `eager_init`. The problem of a decorator with input argument is that it has to be used with invocation always, i.e., you have to use as `with_comms()` rather than `with_comms` which majority of the existing usages. This diff tries to provide a solution such that we could use `with_comms`, `with_comms()`, `with_comms(eager_init=False)`, and `with_comms(eager_init=True)`. Test Plan: Contbuild & OSS CI Differential Revision: D65385700 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139637 Approved by: https://github.com/wz337	2024-11-13 02:45:02 +00:00
Masaki Kozuki	6a368b3fc5	Add ScalarList overload to `_foreach_lerp` (#134482 ) Related: - https://github.com/pytorch/pytorch/issues/133367 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134482 Approved by: https://github.com/janeyx99	2024-11-12 19:03:41 +00:00
Jane Xu	213b8ef163	[BE] add empty tensor testing for _foreach_addcmul/div (#140276 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140276 Approved by: https://github.com/jbschlosser ghstack dependencies: #140191	2024-11-12 15:35:06 +00:00
Jane Xu	92fb1f79b8	[BE] Test interspersed empty tensors for _foreach_norm test parity (#140191 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140191 Approved by: https://github.com/jbschlosser	2024-11-12 15:35:06 +00:00
Masaki Kozuki	71d8bb7ede	implement `torch._foreach_rsqrt` (#134574 ) Related: - #133367 c Pull Request resolved: https://github.com/pytorch/pytorch/pull/134574 Approved by: https://github.com/eqy, https://github.com/janeyx99	2024-11-12 15:34:35 +00:00
Jiang, Yanbing	f77eb07662	Split int4wo weight packing (#139611 ) Fixes https://github.com/pytorch/ao/issues/1117. This PR is to seperate int4wo weight packing between CPU and other devices, to help implement `INT4CPULayout` in torchao based on https://github.com/pytorch/ao/issues/1117#issuecomment-2451252756. Now, for CPU, the input `weight` of `_convert_weight_to_int4pack_for_cpu` is [n, k] int32, output is [n, k / 2] uint8. The input packed weight of `_weight_int4pack_mm_for_cpu` is [n, k / 2] uint8. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139611 Approved by: https://github.com/jerryzh168	2024-11-12 10:12:50 +00:00
Joel Schlosser	e7ec294c10	NJT OpInfo tests v2 (#138370 ) This PR updates OpInfo-based tests for NJTs: * Adds extensive coverage across non-contiguous NJTs (both non-contiguous transposed and non-contiguous with holes) * The `_sample_njts()` helper that `sample_input_func`s utilize now produces non-contig NJTs as well * Utilizes a `SampleInput`-based xfail system for granular classification of bugs. For example, it's possible to indicate that a class of ops is expected to fail only on non-contig with holes NJT inputs. * I decided on adding `SampleInput`s and utilizing this system over using test parametrization for two reasons: * Test perf - adding `SampleInput`s is faster than generating entire new tests * Avoiding the possibility of `sample_input_func`s not respecting the non-contig test parameter - this would result in silently incorrect passing of these tests. Keeping the responsibility for `SampleInput` generation firmly within each `OpInfo`'s `sample_input_func` means weirdness like this isn't possible * Improves `SampleInput` naming for a bunch of `sample_input_func`s. This makes it easier to xfail them as needed. For example, binary / unary / other ops now use the new `_describe_njt()` helper to get a string repr that uniquely defines the type of NJT being passed to the op * Adds appropriate `XFailRule`s to get tests passing for forward / backward / forward compile / backward compile. In general, each xfail corresponds to some bug that needs to be fixed ```python # Represents a rule indicating how to xfail a particular test. It allows granularity # at the device, dtype, op, and individual sample levels. This flexibility allows entire # bugs to be represented by a single rule, even if this corresponds with multiple conceptual # test cases across multiple ops. @dataclass class XFailRule: # expected error type error_type: TypeVar = Exception # expected error message error_msg: str = "." # function to indicate whether the rule applies; return True if so match_fn: Callable[[torch.device, torch.dtype, OpInfo, SampleInput], bool] = None # optional name for identifying the rule name: str = "" def match(self, device, dtype, op, sample) -> bool: return self.match_fn(device, dtype, op, sample) ``` Example: ```python # Bug when broadcasting a binary op with non-contiguous with holes NJT + dense # tensor with 1 in ragged dim. XFailRule( error_type=RuntimeError, error_msg="cannot call binary pointwise function . with inputs of shapes", match_fn=lambda device, dtype, op, sample: ( isinstance(op, BinaryUfuncInfo) and "noncontig_holes" in sample.name and "broadcasting 1 over ragged" in sample.name ), name="binary_noncontig_holes_broadcasting_1_over_ragged", ), ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138370 Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer ghstack dependencies: #140160	2024-11-11 19:35:24 +00:00
Xiaodong Wang	565a7942ee	Recover non-standard bool test for msort (#139870 ) Summary: I was looking into why the non-standard bool value will fail for msort - it makes sense for argsort and sort to fail, because we're randomly generating uint8 so the order will be different (and thus the indices will be different). But msort should work. After some digging, it's interesting that even though scalar_t is bool, when the actual value is a uint8_t, the comparison will treat them as signed. I tried lhs=255 and rhs=0: lhs < rhs is equivalent to -1 < 0 which is true (but it's supposed to be False) Therefore we add an explicit type cast. Test Plan: Remove the test skip Differential Revision: D65472170 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139870 Approved by: https://github.com/Skylion007, https://github.com/davidberard98	2024-11-11 02:00:34 +00:00
Natalia Gimelshein	1cdaf1d85f	correctly keep track of processed tensors for foreach reductions (#140103 ) Fixes #140066 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140103 Approved by: https://github.com/janeyx99 Co-authored-by: Jane Xu <janeyx@meta.com>	2024-11-08 23:04:53 +00:00
Animesh Jain	86792a5a8d	[invoke_subgraph] User facing API to support arbitrary args and kwargs (#139162 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139162 Approved by: https://github.com/zou3519	2024-11-08 03:31:19 +00:00
Nikita Shulga	ae01f2b61b	Extend CPU implementation of MSELoss to BF16 (#139959 ) It's strange that it has not been implemented for the type yet Pull Request resolved: https://github.com/pytorch/pytorch/pull/139959 Approved by: https://github.com/jgong5, https://github.com/janeyx99 ghstack dependencies: #139961	2024-11-07 23:50:15 +00:00
Animesh Jain	75f3056c81	[hop-db] Import invoke_subgraph to avoid Dynamo error on mac (#140038 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140038 Approved by: https://github.com/ydwu4	2024-11-07 22:36:57 +00:00
IvanKobzarev	781c68c865	[aotd] coerce_same_metadata_as_tangent with expected_type for e.g.AsyncCollectiveTensor (#139095 ) Based on discussion here: https://github.com/pytorch/pytorch/pull/138731 Introducing ability for subclass implement type convertion to expected_type. ``` def __coerce_same_metadata_as_tangent__( self, expected_metadata: Any, expected_type: Optional[Type] = None ): ``` Here if `expected_type=None` means `SubclassClass` is expected. E.g. for `DTensor` we may find tangent `AsyncCollectiveTensor` where we expected `Tensor` - in this case `expected_type=Tensor` will be called during runtime Adding implementation to AsyncCollectiveTensor, that just triggers `wait()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139095 Approved by: https://github.com/bdhirsh	2024-11-07 16:24:48 +00:00
Gabriel Ferns	2037ea3e15	Add type annotations to Configs (#139833 ) Summary: Adds types to Configs, and fixes a bug in options that was caused by the lack of types. fixes: https://github.com/pytorch/pytorch/issues/139822 Configs are used by many modules so not sure which label to put. Types also allow https://github.com/pytorch/pytorch/pull/139736 to fuzz configs Pull Request resolved: https://github.com/pytorch/pytorch/pull/139833 Approved by: https://github.com/c00w	2024-11-07 03:49:09 +00:00
Sun, Jiayi	a59132b9c8	fix torch.linalg.norm and torch.norm for torch.complex32 datatype (#133661 ) Fix https://github.com/pytorch/pytorch/issues/132634. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133661 Approved by: https://github.com/mingfeima, https://github.com/Skylion007	2024-11-07 03:21:36 +00:00
Edward Z. Yang	4e647871d6	Ensure TORCH_TRACE is run for Dynamo/Distributed tests (#139786 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/139786 Approved by: https://github.com/bobrenjc93, https://github.com/c00w, https://github.com/anijain2305 ghstack dependencies: #139716	2024-11-07 01:58:05 +00:00
Colin L. Rice	2a857e940d	config: Add env_name_default and env_name_force to Config (#138956 ) This allows Configs to handle setting their defaults (or overriding themselves) via environment variables. The environment variables are resolved at install time (which is usually import time). This is done 1) to avoid any race conditions between threads etc..., but 2) to help encourage people to just go modify the configs directly, vs overriding environment variables to change pytorch behaviour. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138956 Approved by: https://github.com/ezyang ghstack dependencies: #138766	2024-11-06 21:20:42 +00:00
Nikita Shulga	68ef445c33	[MPS][Perf] Dispatch to SDP-math-mps for non-contig Tensors (#139791 ) As MacOS-15 or newer supports those out of the box. This significantly reduces memory requirements and improves performance for some stable diffision networks. Test plan: Run ```python from diffusers import StableDiffusionXLPipeline, AutoencoderKL, EulerAncestralDiscreteScheduler import torch import time vae = AutoencoderKL.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", subfolder='vae', torch_dtype=torch.bfloat16, force_upcast=False).to('mps') pipe = StableDiffusionXLPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", vae=vae, torch_dtype=torch.bfloat16, variant="fp16").to('mps') pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipe.scheduler.config) start_time = time.time() start_mps_mem = torch.mps.driver_allocated_memory() image = pipe(prompt="Spherical cow in vacuum", num_inference_steps=10, guidance_scale=8, generator=torch.Generator("mps").manual_seed(42), ).images[0] end_mps_mem = torch.mps.driver_allocated_memory() run_time = time.time() - start_time print(f"run time in {run_time:.2f} sec, end_mps_mem {end_mps_mem/1024.02:.2f} Mb mem increase {(end_mps_mem-start_time)/1024.02:.2f} Mb") image.save(f'bfloat16.png') ``` Before the change total memory use were 16Gb and needed 65 sec to complete, after it drops down to 14Gb and takes 50 sec to finish on M2Pro, though generated image remains the same: ![image](https://github.com/user-attachments/assets/1a35efef-9f80-4cd0-ac9c-30203eab6bb1) Fixes https://github.com/pytorch/pytorch/issues/139389 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139791 Approved by: https://github.com/drisspg, https://github.com/Skylion007 ghstack dependencies: #139788, #139784, #139763	2024-11-06 16:25:39 +00:00
Sun, Jiayi	44df6522ee	add Half/BFloat16 support for grid_sample on CPU (#134812 ) Fix https://github.com/pytorch/pytorch/issues/127224. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134812 Approved by: https://github.com/Skylion007, https://github.com/mingfeima	2024-11-06 14:02:08 +00:00
Jack Taylor	5f266b5a02	[ROCm] re-enable flex attention UTs (#139632 ) https://github.com/pytorch/pytorch/pull/136792 accidentally disabled flex attention UTs on ROCm. Re-enabling. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139632 Approved by: https://github.com/drisspg	2024-11-06 12:49:44 +00:00
Xiaodong Wang	e7cf7d00be	Support torch.bool in torch.sort + CUDA (#139409 ) Summary: This might be out-dated, so I'm adding it back and see if we pass all the tests. I'm pretty sure cuda12 is ok. Test Plan: CI Differential Revision: D65282650 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139409 Approved by: https://github.com/zou3519, https://github.com/ngimel, https://github.com/eqy	2024-11-06 00:02:54 +00:00
PyTorch MergeBot	1d28b8b6d5	Revert "Deprecate `torch._utils.is_compiling()` and `torch._dynamo.external_utils.is_compiling()` (#127690 )" This reverts commit `e84d1121ad`. Reverted https://github.com/pytorch/pytorch/pull/127690 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. More details in D65483292 ([comment](https://github.com/pytorch/pytorch/pull/127690#issuecomment-2458381056))	2024-11-05 23:10:38 +00:00
Xuehai Pan	e84d1121ad	Deprecate `torch._utils.is_compiling()` and `torch._dynamo.external_utils.is_compiling()` (#127690 ) This PR is split from PR #126898. - #126898 ------ Pull Request resolved: https://github.com/pytorch/pytorch/pull/127690 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-11-05 10:44:56 +00:00
zeshengzong	ffb7a08921	Fix torch.histc not checking min > max on cuda for int8 tensors (#139372 ) Fixes #139360 `86e6513c86/aten/src/ATen/native/cuda/SummaryOps.cu (L323-L324)` Assign `min` and `max` to with low-precision input_t variable `minvalue` and `maxvalue` cause wrong comparing result in following check in here: `86e6513c86/aten/src/ATen/native/cuda/SummaryOps.cu (L353)` ![image](https://github.com/user-attachments/assets/0d5c87f4-3dc6-48bb-bcc8-b1803e7cd487) Change type of `minvalue` and `maxvalue` to fix it, similar like in line: `86e6513c86/aten/src/ATen/native/cuda/SummaryOps.cu (L280-L282)` Test Result ```bash $ pytest test/test_reductions.py -vv ``` ![image](https://github.com/user-attachments/assets/6b5d0d48-ebc2-4a8c-85f4-dbad147c086c) ```bash $ lintrunner ``` ![image](https://github.com/user-attachments/assets/f97c2d6d-78ea-4439-a1ba-907bc9defad7) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139372 Approved by: https://github.com/eqy	2024-11-05 08:42:38 +00:00
Chen, Zejun	9aaf3a04fa	[profiler][UT] instantiate profiler UTs for devices and enable UTs for xpu profiler (#134316 ) This PR enables the profiler related UT to be device-agnostic. It instantiates the profiler UTs for different device types and enable them on XPU backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134316 Approved by: https://github.com/etaf, https://github.com/aaronenyeshi, https://github.com/gujinghui	2024-11-05 05:46:13 +00:00
Jane Xu	23169a6bcc	Disable foreach tests for complex128 internally (#139649 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139649 Approved by: https://github.com/ngimel	2024-11-04 23:24:47 +00:00
Colin L. Rice	abc5d59dcb	config: create Config objects with JK support (#138766 ) This teaches install_config_module (and the underlying code) to understands Config objects. Additionally we've added a JK option to this which resolves the JK. This config gets stored within the _ConfigEntry class and is evaluated when __getattr__ is called. If justknobs is set, it'll call justknobs_check to see the result. Due to preceeding work, basically everything works correctly here and we had to update a couple of tests, and modify the getattr behaviour. Note that we are updating the justknob_check function to support a default option, to make default work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138766 Approved by: https://github.com/ezyang	2024-11-01 19:20:37 +00:00
Joel Schlosser	ddb291a881	Fix and test several NJT reductions (#139317 ) I'm sick of reductions not working properly - spotty dim coverage, missing backwards, etc. This PR fixes quite a bit. It applies to the following ops: * `sum` / `mean` / `prod` * `all` / `any` * `amin` / `amax` * `min` / `max` * `argmin` / `argmax` The general reduction logic has been factored out into a helper `_apply_reduction(func, func_name, identity_element, args, kwargs)`. The idea is that by providing a valid identity element, we can utilize conversions to padded dense when needed for reducing over the ragged dim. Extensive test coverage includes: reductions across ragged dim * reductions across non-batch, non-ragged dims * reductions across both batch and ragged dims * multiple dim reductions (for ops that support this) * full reduction -> scalar Bonus: the PR includes backwards fixes for `sum` and `mean`, which have never worked. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139317 Approved by: https://github.com/cpuhrsch	2024-10-31 20:55:38 +00:00
Antoni Viros	ad637a4c5c	Add support for index_put_ in NT (#135722 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135722 Approved by: https://github.com/jbschlosser	2024-10-30 17:17:59 +00:00
PyTorch MergeBot	5861279f47	Revert "Add support for index_put_ in NT (#135722 )" This reverts commit `b4836e5b5c`. Reverted https://github.com/pytorch/pytorch/pull/135722 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/135722#issuecomment-2445651914))	2024-10-30 01:53:55 +00:00
Antoni Viros	b4836e5b5c	Add support for index_put_ in NT (#135722 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135722 Approved by: https://github.com/jbschlosser	2024-10-30 00:03:21 +00:00
Joel Schlosser	23d590e518	More flexible test parametrization with @reparametrize (#138369 ) Background: The `@parametrize` decorator enjoys widespread usage as a convenient tool for ensuring extensive test coverage. One particular feature that makes this easy is the ability to stack such decorators, testing over the cross-product of inputs. Example: ```python class MyTestClass(TestCase): @parametrize("x", range(3)) @parametrize("y", [False, True]) def test_foo(self, x, y): # Invoked with: # x=0, y=False # x=1, y=False # x=2, y=False # x=0, y=True # x=1, y=True # x=2, y=True ... ``` Note that the `@ops` and `@modules` decorators employ the same underlying machinery for parametrizing over `OpInfo` / `ModuleInfo` entries. These decorators also parametrize over op-specific `device` / `dtype` info according to what is supported for each op. ```python class MyTestClass(TestCase): @ops(op_db) def test_foo(self, op, device, dtype): # Invoked each OpInfo in the db along with each device / dtype that corresponds # with this op according to the OpInfo entry. ... ``` Note that this in contrast to the naive cross product between ops and devices / dtypes, which would generate too many tests. Certain use cases benefit from a similar type of flexible parametrization that is more intelligent than simple cross-product composition. It is expensive to generate / run too many tests, even if the unneeded ones are skipped appropriately. This PR attempts to generalize such flexible parametrization and satisfy these use cases through the introduction of a `@reparametrize` decorator, which operates on an existing parametrizer and allows for customized on-the-fly parametrization through the use of an `adapter_fn`. Examples: ```python # adapter_fn that adds a new arg def include_is_even_arg(test_name, param_kwargs): x = param_kwargs["x"] is_even = x % 2 == 0 new_param_kwargs = dict(param_kwargs) new_param_kwargs["is_even"] = is_even is_even_suffix = "_even" if is_even else "_odd" new_test_name = f"{test_name}{is_even_suffix}" yield (new_test_name, new_param_kwargs) # adapter_fn that excludes certain values def exclude_odds(test_name, param_kwargs): x = param_kwargs["x"] is_even = x % 2 == 0 yield None if not is_even else (test_name, param_kwargs) class MyTestClass(TestCase): @reparametrize(parametrize("x", range(5)), include_is_even_arg) def test_foo(self, x, is_even): # Invoked with both the x value and the new is_even arg ... @reparametrize(parametrize("x", range(5)), exclude_odds) def test_bar(self, x): # Only invoked with even x values ... ``` For a more real-world use case, imagine you want to write a set of OpInfo tests that parametrize over additional op-specific things beyond `device` / `dtype` (in NJT's case, this includes contiguity type, whether to operate over the batch / ragged / other dims, etc.). The `@reparametrize` decorator allows you to customize the `@ops` parametrization to add in these additional args as they make sense on a per-op basis. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138369 Approved by: https://github.com/janeyx99	2024-10-29 22:14:38 +00:00
drisspg	80c7c7178e	Make sure all SDPA tests are ran with tensor cores enabled (#135592 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135592 Approved by: https://github.com/eqy	2024-10-29 20:53:10 +00:00
Jake Schmidt	2b577ae58f	Implement NJT embedding backward (#138627 ) Fixes #138352 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138627 Approved by: https://github.com/jbschlosser	2024-10-29 18:44:58 +00:00
PyTorch MergeBot	38645e8a3e	Revert "Fix unbind_copy and add its decomposition (#134319 )" This reverts commit `8aedc649bd`. Reverted https://github.com/pytorch/pytorch/pull/134319 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but this is still failing the same test on ExecuTorch ([comment](https://github.com/pytorch/pytorch/pull/134319#issuecomment-2443209139))	2024-10-29 04:54:37 +00:00
wz337	5b39734a0a	[DTensor][Test] Fix gloo backend failure when eager_init is turned on (#139097 ) We should only pass the `device_id` when the backend is `nccl`. Otherwise, we would run into the following error: ``` RuntimeError: No backend for the parent process group or its backend does not support splitting ``` This also fixes test failure is not asserted when using `with_comms()` or `with_comms(eager_init=False)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139097 Approved by: https://github.com/XilunWu	2024-10-29 00:04:06 +00:00
cyy	aa2b17c330	[3/N] Don't skip ASAN on some tests (#139058 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139058 Approved by: https://github.com/ezyang	2024-10-28 23:57:23 +00:00
Guilherme Leobas	8785353f2f	Fix tensor subclass + dynamic shapes in torch.compile + aot autograd (#125941 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125941 Approved by: https://github.com/bdhirsh ghstack dependencies: #133337	2024-10-28 21:58:59 +00:00
Guilherme Leobas	6baccb430b	Update TwoTensor impl. to accept `outer_size/outer_stride` (#133337 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133337 Approved by: https://github.com/bdhirsh	2024-10-28 21:58:59 +00:00
William Wen	52c80f663d	change name of dynamo CI chard to dynamo_wrapped (#138233 ) Implements https://github.com/pytorch/pytorch/issues/118127 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138233 Approved by: https://github.com/clee2000	2024-10-28 21:42:33 +00:00
Joel Schlosser	8ba9063002	FlexAttention support for NJT (#136792 ) This PR adds FlexAttention + NJT support. In particular: * To handle raggedness, treats the packed sequence dim of input NJTs as a giant "stacked sequence". To ensure user `score_mod` / `mask_mod` functions can still be written in the original NJT sequence space, this PR handles conversions for indices within the giant "stacked sequence" -> sequence relative indices automatically. * Provides `py_impls` for `NestedTensor` to the HOPs for flex attention forward / backward that simply wrap / unwrap NJTs appropriately * Adds barebones `new_empty()` support to NJT since FlexAttention utilizes this repeatedly; right now, only `new_empty()` with a shape of `()` is supported * Tests that FlexAttention with a causal mask matches causal SDPA * Adds a new public API for FlexAttention usage: * `create_nested_block_mask(mask_mod, B, H, njt, BLOCK_SIZE, _compile)` - NJT analogue for `create_block_mask()` that utilizes the `njt`'s ragged structure to create an appropriately-sized block mask (e.g. `(1, 1, total_seqlen, total_seqlen)`). This function handles the index conversion from "stacked sequence" space -> relative sequence space. * Minor note: as this is a public API, this function is purposefully named with "nested" instead of "njt" to keep the latter as an informal, mostly internal-only term. Example usage: ```python def causal_mask(b, h, q_idx, kv_idx): return q_idx >= kv_idx query = ... # NJT of shape (B, H, S, D) key = ... # NJT of shape (B, H, S, D) value = ... # NJT of shape (B, H, S, D) # create_nested_block_mask() automatically converts indices from "stacked sequence" space -> relative sequence space block_mask = create_nested_block_mask(causal_mask, 1, 1, query) # block mask conceptual shape is (B, H, sum(S), sum(S)) output = flex_attention(query, key, value, block_mask=block_mask) def causal_score_mod(score, b, h, q_idx, kv_idx): return torch.where(q_idx >= kv_idx, score, float("-inf")) # flex_attention() automatically converts indices from "stacked sequence" space -> relative sequence space for NJT inputs output2 = flex_attention(query, key, value, score_mod=causal_score_mod) ``` TODO: ~~Determine the right level of abstraction for public API helpers + move them alongside other helpers~~ Verify this with others though * ~~Some cleanup~~ * ~~`njt_score_mod_adapter`~~ * ~~Q: should `create_njt_block_mask()` call `njt_mask_mod_adapter()` so we don't need two calls?~~ * Can we avoid materializing the `sum(s)` length `seq_idx` used for conversion between stacked sequence -> sequence relative indices? * Not for now, although future work may deepen the integration between Flex + NJT (possibly requiring custom templates). We should try to cache this though. * ~~Demonstrate non-causal mask~~ * Support non-contiguous NJTs with holes (booted to future PR) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136792 Approved by: https://github.com/drisspg ghstack dependencies: #138841	2024-10-28 20:01:27 +00:00
Mwiza Kunda	c2ded9ec0d	Fix dot reference checks (#138596 ) dot reference implementation should be consistent with the cpu / cuda implementations since it may be used for meta dispatch i.e. ```python import torch x = torch.tensor([1,2,3], dtype=torch.float32) y = torch.tensor([4,5,6], dtype=torch.float16) x.dot(y) Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: dot : expected both vectors to have same dtype, but found Float and Half ``` However the below does not raise an exception ```python x.to("meta").dot(y.to("meta")) ``` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138596 Approved by: https://github.com/bdhirsh	2024-10-28 19:11:40 +00:00
Aaron Gokaslan	5d074746e9	[BE]: Add better optional typing (#138426 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138426 Approved by: https://github.com/XuehaiPan, https://github.com/malfet	2024-10-27 14:19:00 +00:00
Simon Fan	99608ceed6	Scoped extension building for C++ backed custom ops tests (#136695 ) FIXES #125579 #131103 #133197 #133283 #134738 #135369 #135685 Tests that create C++ extensions can cause flakiness in CI due to library namespace conflict and test ordering. We can build them in temp dirs to ensure isolation. An alternative is to build these as part of the build process and have build time errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136695 Approved by: https://github.com/zou3519	2024-10-26 07:41:00 +00:00
Yidi Wu	c6bb9b53f4	[scan] better error handling and remove redundant tests (#137967 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137967 Approved by: https://github.com/zou3519	2024-10-25 19:01:25 +00:00
Tom Ritchford	8aedc649bd	Fix unbind_copy and add its decomposition (#134319 ) * Fixes https://github.com/pytorch/pytorch/issues/130829 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134319 Approved by: https://github.com/amjames, https://github.com/eellison	2024-10-23 19:13:44 +00:00
Tom Ritchford	1bc73f3157	Add decomposition for permute_copy (#130944 ) * Extracted from #129476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130944 Approved by: https://github.com/amjames, https://github.com/eellison	2024-10-23 17:42:11 +00:00
William Wen	3441ea7d74	[dynamo] reset compiler stance after test (#138277 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138277 Approved by: https://github.com/anijain2305, https://github.com/jansel	2024-10-23 00:07:33 +00:00
Animesh Jain	4dd4d38ca9	[hierarchical-compilation][hop] Introduce invoke_subgraph (#137538 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137538 Approved by: https://github.com/zou3519	2024-10-22 15:33:34 +00:00
Colin L. Rice	bb8bc7d6b3	config: simplify most of the config handling and fix some bugs (#138377 ) This PR combines a number of cleanups in one PR. If any of the specific cleanups don't seem to make sense, let me know and I can remove them. Cleanups - This PR adds a set of test suites for the config module code, which handles basically all the APIs and ways it is used. Please let me know if you see anything critical that is not tested that I missed. This test suite is primarily used as the regression test suite for later changes in this diff. Note that there is some dynamo specific testing of the config module, but it isn't as verbose. - I removed all internal usage of shallow_copy_dict. Those usages could all use the deep copy, and did not depend on the reference behavior of certain config values that shallow_copy_dict allows. - I removed shallow copy semantics for configuration with a deprecation warning. I think this requires a release note, so hopefully I did that correctly. Let me know if we want to continue to expose shallow copy value semantics, but I just can't find a case where I expect anyone would want it. It also complicated later internal changes to the API (i.e. breaking apart various layers of the config changes). - I fixed what I believe is a bug in how hashes are calculated on configs. In particular, if you got the hash, then made a config change, and then got the hash again, it would not update the hash. @oulgen, please let me know if I'm misunderstanding this behavior and it is desired. - I switched our multiple implementations of iterating through the dictionary to a single one. This is primarily to make later changes easier, but it also makes it clear how inconsistent our various config ignoring options are. Let me know if people would be interested in me unifying the various options for ignoring config values. - I updated the test patcher (not the performance critical one, just the normal one), to use __setattr__ and __getattr__ to remove direct API access to the underlying config fetcher. For release notes, Not sure exactly how to communicate this, but something like "ConfigModule.to_dict, and ConfigModule.shallow_copy_dict no longer retain their shallow copy semantics, which allowed reference values objects to be modified. If you wish to modify the config object, call load_config explicitly". Pull Request resolved: https://github.com/pytorch/pytorch/pull/138377 Approved by: https://github.com/ezyang, https://github.com/jansel, https://github.com/jovianjaison	2024-10-22 13:40:26 +00:00
Bob Ren	20a2d39557	Log all failing test repros to scuba (#138394 ) This has the benefit that 1) It's much easier to aggregate test failure repros into say a CSV or shell script from scuba 2) We can do analysis (eg. set different two sets of tests across two PRs) 3) We can get results faster at the test-level granularity instead of job-level granularity we see in the HUD/GH. I tested this by introducing a breaking change, adding ci-scribe label and then verifying that the failed tests were logged to scuba: https://fburl.com/scuba/torch_open_source_signpost/w6qt7qr9 I then reverted the breaking change and published this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138394 Approved by: https://github.com/ezyang	2024-10-21 21:35:47 +00:00
Will Feng	e4ad02892f	Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137161 Approved by: https://github.com/seemethere, https://github.com/eqy, https://github.com/yf225 Co-authored-by: Will Feng <yf225@cornell.edu>	2024-10-20 23:48:54 +00:00
Will Feng	a9f4f89cd5	[CI] Add Compiled DDP / Compiled FSDP2 / compute-comm reordering tests to test_inductor_distributed (#138178 ) `test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support). This allows us to migrate distributed jobs to T4 machines in https://github.com/pytorch/pytorch/pull/137161, as the compiled distributed jobs are the only blocking ones now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138178 Approved by: https://github.com/xmfan, https://github.com/fduwjj, https://github.com/fegin, https://github.com/kwen2501	2024-10-20 19:38:18 +00:00
Tom Ritchford	c0582fd0f8	Remove unused Python variables in torch/[b-z]* (#136963 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136963 Approved by: https://github.com/ezyang	2024-10-19 16:45:22 +00:00
wz337	ff598f2f4d	[DTensorTestbase] Add an optional `eager_init` flag to `with_comms()` to support eager init nccl communicator for DeviceMesh test case (#138108 ) Add an optional `eager_init` flag to `with_comms`. When `eager_init` is True and backend is `nccl`, we pass the `device_id` to `init_process_group()` for eager initialization. Otherwise, `device_id` is still `None` and this goes through the normal lazy call. Default for `eager_init` is False. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138108 Approved by: https://github.com/kwen2501	2024-10-19 01:04:55 +00:00
Ke Wen	c88b77af9c	[Distributed][CI] Add SM guard for compiled tests involving BF16 (#138245 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138245 Approved by: https://github.com/yf225	2024-10-18 21:39:39 +00:00
PyTorch MergeBot	7b39fb5712	Revert "Fix unbind_copy and add its decomposition (#134319 )" This reverts commit `9f81270d75`. Reverted https://github.com/pytorch/pytorch/pull/134319 on behalf of https://github.com/clee2000 due to breaking some executorch tests D64568664 ([comment](https://github.com/pytorch/pytorch/pull/134319#issuecomment-2423157700))	2024-10-18 20:09:40 +00:00
PyTorch MergeBot	0ff6f7a040	Revert "[Distributed][CI] Add SM guard for compiled tests involving BF16 (#138245 )" This reverts commit `1581a93e87`. Reverted https://github.com/pytorch/pytorch/pull/138245 on behalf of https://github.com/albanD due to Breaks distributed inductor tests ([comment](https://github.com/pytorch/pytorch/pull/138245#issuecomment-2422462579))	2024-10-18 13:21:17 +00:00
Ke Wen	1581a93e87	[Distributed][CI] Add SM guard for compiled tests involving BF16 (#138245 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138245 Approved by: https://github.com/yf225	2024-10-18 09:10:01 +00:00
Xinya Zhang	770fcaf2ab	Fix the Rank of logsumexp Tensor and mGPU support. (#137717 ) The logsumexp tensor was considered for internal use only but apparently exposed to unit tests and inductors. The stream should be selected after picking the current device. Otherwise the code is checking the default device's architecture. Fixes #131316 #137414 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137717 Approved by: https://github.com/drisspg Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>	2024-10-17 21:58:14 +00:00
Tom Ritchford	9f81270d75	Fix unbind_copy and add its decomposition (#134319 ) * Fixes https://github.com/pytorch/pytorch/issues/130829 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134319 Approved by: https://github.com/amjames, https://github.com/eellison	2024-10-17 21:27:35 +00:00
Joel Schlosser	906fe05895	Naive impls for NJT matmul (#138121 ) Our matmul support is abysmal - let's at least get this working and do it performantly later. Bonus: implements `bmm` as well. jagged <-> padded dense conversions are utilized when possible, and an unbind-based fallback otherwise (the former works with torch.compile and the latter doesn't). Some testing is missing because we don't have factory function support yet :( Pull Request resolved: https://github.com/pytorch/pytorch/pull/138121 Approved by: https://github.com/cpuhrsch	2024-10-17 01:31:46 +00:00
Jane Xu	94537e70b5	Skip test_parity__foreach_mul_fastpath_inplace_cuda_complex128 internally (#138100 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138100 Approved by: https://github.com/Skylion007	2024-10-17 00:34:56 +00:00
PyTorch MergeBot	4b3035f2fe	Revert "Add decomposition for permute_copy (#130944 )" This reverts commit `e7a4ad3b40`. Reverted https://github.com/pytorch/pytorch/pull/130944 on behalf of https://github.com/clee2000 due to breaking internal builds D64418214 cc @digantdesai @GregoryComer to help get this fixed and remerged ([comment](https://github.com/pytorch/pytorch/pull/130944#issuecomment-2418125356))	2024-10-16 23:18:53 +00:00
PyTorch MergeBot	24ee4af86b	Revert "Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 )" This reverts commit `2b7c7a20b9`. Reverted https://github.com/pytorch/pytorch/pull/137161 on behalf of https://github.com/kwen2501 due to breaking trunk ([comment](https://github.com/pytorch/pytorch/pull/137161#issuecomment-2417833666))	2024-10-16 20:05:38 +00:00
Ke Wen	2b7c7a20b9	Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137161 Approved by: https://github.com/seemethere, https://github.com/eqy	2024-10-16 16:42:57 +00:00
PyTorch MergeBot	78632b97b1	Revert "Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 )" This reverts commit `f43c4d28b8`. Reverted https://github.com/pytorch/pytorch/pull/137161 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems another failure showing up after the upgrade ([comment](https://github.com/pytorch/pytorch/pull/137161#issuecomment-2415941159))	2024-10-16 07:26:34 +00:00
Ke Wen	f43c4d28b8	Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137161 Approved by: https://github.com/seemethere, https://github.com/eqy	2024-10-16 05:03:08 +00:00
Adnan Akhundov	809ff3b274	Add host-side Triton TMA support to Dynamo (#137677 ) This adds Dynamo tracing support for the host-side Triton TMA API (see `create_2d_tma_descriptor` calls on the host in the [Triton tutorial](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#sphx-glr-getting-started-tutorials-09-persistent-matmul-py)). A few notes: - Here we assume the availability of the host-side TMA API added to upstream Triton in https://github.com/triton-lang/triton/pull/4498. As of time of writing, this is not a part of the PT2 OSS Triton pin (although back-ported internally). OSS Triton pin update should be done in December 2024. - To capture the chain of calls `t.data_ptr() --> create_{1d,2d}_tma_descriptor(ptr, ...) --> kernel[grid](tma_desc, ...)`, we add three new variable trackers: `DataPtrVariable`, `CreateTMADescriptorVariable` (for the function), `TMADescriptorVariable` (for TMA descriptor object). This is to maintain the path back from the Triton kernel to the Tensor from which the TMA descriptor has been created. - The newly introduced variables have `reconstruct` methods used in case of graph breaks. - The `tma_descriptor_metadata` extracted from the captured `create_{1d,2d}_tma_descriptor` calls is propagated through the HOPs in Dynamo and AOTAutograd to be used by the downstream compiler (e.g., Inductor). See the unit tests for how the captured HOP arguments look like. - In the Dynamo-captured fx graph, we replace the TMA descriptor arguments of the Triton kernel by the underlying Tensors, to be able to track the input/output relationships in terms of Tensors. - In the Triton kernel mutation analysis pass (in AOTAutograd), we use the `tt.experimental_descriptor_store` TTIR op to detect mutations of the underlying tensors via TMA descriptors. So that downstream AOTAutograd can perform functionalizations as required. - JIT Inductor and AOT Inductor support will be implemented in follow-up PRs. Differential Revision: [D64404928](https://our.internmc.facebook.com/intern/diff/D64404928) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137677 Approved by: https://github.com/zou3519	2024-10-16 02:18:48 +00:00
Ke Wen	35fc24fbed	[PGNCCL] Fix bugs in non-blocking mode (#137741 ) ### Fix 1: Throw async error during init wait Previously we just busy wait for `ncclSuccess`, if the nonblocking init encountered error, we never report that. Added detection of async error via `ncclGetAsyncError`. ### Fix 2: Add wait after comm split ``` // After calling ncclCommSplit in non-blocking mode, we should wait for the // source communicator to be out of ncclInProgress state. // Reason 1: // it's unsafe to call new operations on the parent comm while it's in // ncclInProgress state. // Reason 2: // as of NCCL 2.23, the ptr value of child comm will not be filled until the // state of parent comm is ncclSuccess. This may change in the future. See: // https://github.com/NVIDIA/nccl/issues/1472 ``` This wait does not mean the child comm is ready for use, neither does it block till that point. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137741 Approved by: https://github.com/shuqiangzhang	2024-10-15 20:35:39 +00:00
Aaron Orenstein	524fe784ec	BundledAutotuneCache (take 2) (#137902 ) Summary: Add a cache to combine individual autotune caches into a single cached bundle. We still rely on the individual autotune caches - on a cache hit we copy the individual results into the local caches so they can retrieved later. Attempt 2 of #134959 (D60677499). Various configs: env: TORCHINDUCTOR_BUNDLED_AUTOTUNE_REMOTE_CACHE config: bundled_autotune_remote_cache jk: pytorch/remote_cache:bundled_autotune_remote_cache_version Test Plan: unit tests Manually tested w/ EMU: ``` cd fbcode/accelerators/workloads/models/emu_flash/v1p4 make build_benchmark_model && make save_model_to_path make test_pt2_latency ``` - on a cold run we got 0 hits and 40 misses. On a warm run it got 40 hits and 0 miss. - perf seems a little better - for 8 runs: - no bundled cache averaged 14m11s - bundled cache averaged 14m6s - 125ms saved per cache entry seems reasonable Cache Metrics for an sample run: no bundled cache: ``` INFO: Cache Metrics: FbMemcacheRemoteKernelCache: {hit: 2256, miss: 0, put: 0, exception: 0} FbRemoteAutotuneCache: {hit: 0, miss: 0, put: 7, exception: 0} FbRemoteFxGraphCache: {hit: 40, miss: 0, put: 0, exception: 0} LocalAutotuneCache: {hit: 878, miss: 0, put: 7, exception: 0} backend:MemcacheCache: {hit: 2256, miss: 0, put: 7, exception: 0} backend:_LocalAutotuneCacheBackend: {hit: 878, miss: 0, put: 7, exception: 0} backend:_ManifoldCache: {hit: 40, miss: 0, put: 0, exception: 0} ``` bundled cache: ``` INFO: Cache Metrics: FbMemcacheRemoteKernelCache: {hit: 2258, miss: 0, put: 0, exception: 0} FbRemoteAutotuneCache: {hit: 0, miss: 0, put: 8, exception: 0} FbRemoteBundledAutotuneCache: {hit: 40, miss: 0, put: 0, exception: 0} <<<<<< FbRemoteFxGraphCache: {hit: 40, miss: 0, put: 0, exception: 0} LocalAutotuneCache: {hit: 878, miss: 0, put: 886, exception: 0} backend:MemcacheCache: {hit: 2258, miss: 0, put: 8, exception: 0} backend:_LocalAutotuneCacheBackend: {hit: 878, miss: 0, put: 886, exception: 0} backend:_ManifoldCache: {hit: 80, miss: 0, put: 0, exception: 0} ``` Differential Revision: D64336043 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137902 Approved by: https://github.com/oulgen	2024-10-15 18:39:47 +00:00
Nikita Shulga	e4d7676c1b	[CPU] Expand `torch.special.i1` to Half and BF16 (#137899 ) To match behavior of `torch.special.i0` Noticed while looking at the failures in https://github.com/pytorch/pytorch/pull/137849 Also, add explicit high-precision template specialization for `calc_i0` and `calc_i1` for `BFloat16` and `Half` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137899 Approved by: https://github.com/Skylion007	2024-10-15 17:00:58 +00:00
Tom Ritchford	e7a4ad3b40	Add decomposition for permute_copy (#130944 ) * Extracted from #129476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130944 Approved by: https://github.com/amjames, https://github.com/eellison	2024-10-15 13:51:20 +00:00
ErezYosef	197601eeea	Add Support for Tracking Parameter Names (named_parameters) in Optimizer State Dict (#134107 ) A proposal addressing Issue #1489: Optimizer should track parameter names and not id. (also mentioned in here: [[RFC] Introducing FQNs/clarity eyeglasses to optim state_dict](https://dev-discuss.pytorch.org/t/rfc-introducing-fqns-clarity-to-optim-state-dict/1552) ## Summary This PR introduces a backward-compatible enhancement where optimizers track parameter names instead of just their id. Optimizers can be initialized with `named_parameters()` as: ```python optimizer = optim.SGD(model.named_parameters(), lr=0.01, momentum=0.9) ``` This allows for greater clarity and ease when handling optimizers, as the parameters' names are preserved within the optimizer’s `state_dict` as: ``` state_dict = { 'state': { 0: {'momentum_buffer': tensor(...), ...}, 1: {'momentum_buffer': tensor(...), ...}, }, 'param_groups': [ { 'lr': 0.01, 'weight_decay': 0, ... 'params': [0,1] 'param_names' ['layer.weight', 'layer.bias'] (optional) } ] } ``` Loading `state_dict` is not changed (backward-compatible) and the `param_names` key will be ignored. ## Key Features #### Named Parameters in Optimizer Initialization: Optimizers can accept the output of `model.named_parameters()` during initialization, allowing them to store parameter names directly. #### Parameter Names in `state_dict`: The parameter names are saved as a list in the optimizer’s `state_dict` with key `param_names`, alongside the `params` indices, ensuring seamless tracking of both names and parameters. ## Backward Compatibility #### No Breaking Changes: This change is fully backward-compatible. The added `param_names` key in the optimizer's `state_dict` is ignored when loading a state to the optimizer. #### Customization with Hooks: For more control, the loaded state_dict can be modified using a custom `register_load_state_dict_pre_hook`, providing flexibility for different design needs. ## Documentation Updates Please refer to the documentation changes for more details on how this feature is implemented and how it can be used effectively. ## Solution Example: A suggested solution to the problem mentioned in #1489, for the same parameters but in a different order. The following `register_load_state_dict_pre_hook` should be added to the optimizer before loading to enable loading the state dict : ```python def adapt_state_dict_ids(optimizer, state_dict): # assuming a single param group. current_state_group = optimizer.state_dict()['param_groups'][0] loaded_state_group = state_dict['param_groups'][0] # same number of params, same names, only different ordering current_state_name_to_id_mapping = {} # mapping -- param_name: id for i, name in enumerate(current_state_group['param_names']): current_state_name_to_id_mapping[name] = current_state_group['params'][i] # changing the ids of the loaded state dict to match the order of the given state dict. for i, name in enumerate(current_state_group['param_names']): loaded_state_group['params'][i] = current_state_name_to_id_mapping[name] return state_dict ``` In this code, the loaded `state_dict` ids are adapted to match the order of the current optimizer `state_dict`. Both the previous and the current optimizers are required to be initiated with `named_parameters()` to have the 'param_names' key in the dict. ### Note This is my first contribution to PyTorch, and I wish to receive feedback or suggestions for improvement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134107 Approved by: https://github.com/janeyx99 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2024-10-14 19:24:44 +00:00
iupaikov-amd	c09b567a91	Fixed error string assertion in test_invalid_devices (#137772 ) ROCm distribution returns different error string for this operation so the test fails this assertion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137772 Approved by: https://github.com/Skylion007	2024-10-13 18:10:07 +00:00
Li, Xingyuan	0dbbcfa7ae	[Inductor UT] Generalize newly introduced inductor UTs for intel GPU (Part 3) (#136947 ) [Inductor UT] Generalize Newly introduced inductor UTs for intel GPU reuse `test/inductor/test_pattern_matcher.py` reuse `test/inductor/test_snode_runtime.py` reuse `test/inductor/test_unbacked_symints.py` fix `test/inductor/test_triton_kernels.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136947 Approved by: https://github.com/etaf, https://github.com/EikanWang, https://github.com/jansel	2024-10-12 13:21:20 +00:00
PyTorch MergeBot	b55ff476bd	Revert "[Distributed] Fix extra context on device 0 (#135273 )" This reverts commit `cdd8fa98c7`. Reverted https://github.com/pytorch/pytorch/pull/135273 on behalf of https://github.com/PaliC due to broken tests on trunk ([comment](https://github.com/pytorch/pytorch/pull/137161#issuecomment-2406236337))	2024-10-10 23:47:25 +00:00
Ke Wen	cdd8fa98c7	[Distributed] Fix extra context on device 0 (#135273 ) This PR contains multiple fixes for issue https://github.com/pytorch/pytorch/issues/135279: ## First part: Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call. As its name suggests, it May Init Ctx. ## Second part: Even with the above fix, additional contexts are still observed during Work object destruction, e.g. ``` work = dist.all_reduce(tensor, async_op=True) time.sleep(5) <-- no additional context yet del work <-- additional context shows up ``` ### Debug process Chasing it down to destruction of a `Future` object -- a member variable of `Work`. Then further down to the following member of `Future`: ``` std::vector<c10::Event> events_; ``` When the `events_` are destroyed, we hit the road down to: `1f3a793790/c10/cuda/impl/CUDAGuardImpl.h (L106-L121)` When there is no "preset" CUDA context (which is the case for python garbage collector), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 -- that's where rank 1, 2, ... can create extra context on device 0! ### Solution This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard. ## Test Added test_extra_cuda_context, implemented via - `pynvml` (if available), or - memory consumption check. `python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135273 Approved by: https://github.com/fduwjj, https://github.com/wconstab, https://github.com/eqy ghstack dependencies: #137161	2024-10-10 17:16:34 +00:00
Joel Schlosser	3e2f276a14	Fix to() on non-contiguous NJTs (#137124 ) Called out via torchrec integration: `lengths` is not handled properly. Future work (not related to non-contiguous NJTs): #137275 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137124 Approved by: https://github.com/soulitzer ghstack dependencies: #137030, #137031	2024-10-08 15:11:05 +00:00
Wei Feng	14b4099521	[FSDP2] support torch._foreach_copy_(float8) for fully_shard(Float8Linear) (#135955 ) this PR unblocks unit test with single Float8Linear module. It fixes following error ``` torch._foreach_copy_(foreach_copy_dsts, all_gather_inputs) [rank0]:E0913 13:44:29.829000 2179476 torch/testing/_internal/common_distributed.py:671] RuntimeError: "foreach_tensor_copy" not implemented for 'Float8_e4m3fn' ``` Differential Revision: [D63961071](https://our.internmc.facebook.com/intern/diff/D63961071) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135955 Approved by: https://github.com/vkuzo, https://github.com/eqy	2024-10-07 16:36:31 +00:00
vasiliy	a063a82c8b	[redo] Fp8 support for item() with cuda, index_select, and fill_ cpu (#137341 ) Summary: Redo of https://github.com/pytorch/pytorch/pull/128780, easier to copy-paste. Test Plan: CI Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/137341 Approved by: https://github.com/eqy	2024-10-07 00:58:51 +00:00
Siddharth Kotapati	e27c0048db	Enable additional tests for MPS CI runs (#134356 ) As part of the follow up for https://github.com/pytorch/pytorch/issues/133520, adapting existing unused tests for use in MPS CI runs. Focusing on nhwc & other memory formatting tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/134356 Approved by: https://github.com/malfet, https://github.com/eqy, https://github.com/huydhn	2024-10-04 21:52:38 +00:00
PyTorch MergeBot	cd17b2645c	Revert "[Distributed] Fix extra context on device 0 (#135273 )" This reverts commit `a93d3873e9`. Reverted https://github.com/pytorch/pytorch/pull/135273 on behalf of https://github.com/albanD due to Broken trunk distributed ci ([comment](https://github.com/pytorch/pytorch/pull/135273#issuecomment-2393772987))	2024-10-04 13:58:57 +00:00
Ke Wen	a93d3873e9	[Distributed] Fix extra context on device 0 (#135273 ) This PR contains multiple fixes for issue https://github.com/pytorch/pytorch/issues/135279: ## First part: Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call. As its name suggests, it May Init Ctx. ## Second part: Even with the above fix, additional contexts are still observed during Work object destruction, e.g. ``` work = dist.all_reduce(tensor, async_op=True) time.sleep(5) <-- no additional context yet del work <-- additional context shows up ``` ### Debug process Chasing it down to destruction of a `Future` object -- a member variable of `Work`. Then further down to the following member of `Future`: ``` std::vector<c10::Event> events_; ``` When the `events_` are destroyed, we hit the road down to: `1f3a793790/c10/cuda/impl/CUDAGuardImpl.h (L106-L121)` When there is no "preset" CUDA context (which is the case for python garbage collector), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 -- that's where rank 1, 2, ... can create extra context on device 0! ### Solution This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard. ## Test Added test_extra_cuda_context, implemented via - `pynvml` (if available), or - memory consumption check. `python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135273 Approved by: https://github.com/fduwjj, https://github.com/wconstab, https://github.com/eqy	2024-10-04 00:44:02 +00:00
Shangdi Yu	c83178d894	Change to export_for_training in XNNPACK tests (#137238 ) Summary: as title Test Plan: CI Differential Revision: D63344674 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137238 Approved by: https://github.com/tugsbayasgalan	2024-10-03 21:28:05 +00:00
Simon Fan	b86269fab5	Unify cpp_extension build directory removal (#136059 ) Keeps existing default directory clearing logic, even though it fails when TORCH_EXTENSIONS_DIR is set. To properly clear, we'd need to track all the folders we compiled the extensions to. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136059 Approved by: https://github.com/ezyang, https://github.com/albanD	2024-10-03 06:22:11 +00:00
Ke Wen	7631a04081	[c10d] Fix the device query story of ProcessGroup (#136790 ) Function `_get_pg_default_device` is being used outside of `distributed_c10d.py`. A concern is that people may not be aware of what it actually does, due to bad naming of this function: `Return the device to use with ``group`` for control flow usage (object collectives, barrier).` The remediation is as follows: - Added a deprecation warning to `_get_pg_default_device`; - Added a private function `_get_object_coll_device` to undertake what it does; - Added a `_device_capability` function for users who want to query the device support of a PG -- it returns a plain list, no more "default" choice. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136790 Approved by: https://github.com/H-Huang	2024-10-03 01:36:22 +00:00
Joel Schlosser	6374a19a6e	Fix wrapper subclass serialization with custom sizes / strides (#137030 ) Fixes #130154 This PR takes the strategy outlined in the above issue and clears out any cached sizes / strides PyCapsules before serialization. This affects the default subclass serialization logic. The PyCapsule issue also affects `deepcopy`, so that's fixed here as well. Note: I originally tried utilizing a context manager to remove / restore cached PyCapsules after serialization, but in practice the state returned from `_reduce_ex_internal()` references the actual `tensor.__dict__()`, so the problem persists once the cached values are restored. Instead, we have to be careful to remove the cached values in the right place so they're not re-cached when pulling out size / stride information for serialization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137030 Approved by: https://github.com/albanD	2024-10-02 18:55:03 +00:00
Benjamin Glass	f984b88718	Ensure noncontiguous tensor creation tests offsetting (#136396 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136396 Approved by: https://github.com/amjames, https://github.com/eellison ghstack dependencies: #136055	2024-10-02 00:40:43 +00:00
Jez Ng	99eb47fb6d	Add CI for Triton CPU backend (#135342 ) Where possible, I have marked failing tests (which we intend to fix or triage) as `@xfail_if_triton_cpu`. This will help us track progress of the Triton CPU backend over time. Tests that I don't think we need to address, or that are flaky, have been marked as skips. Successful CI run: https://github.com/pytorch/pytorch/actions/runs/10822238062/job/30028284549 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135342 Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/malfet	2024-10-01 20:43:10 +00:00
Tom Ritchford	b85f21fc1d	Add decomposition for squeeze_copy (#130941 ) * Extracted from #128416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130941 Approved by: https://github.com/amjames, https://github.com/eellison ghstack dependencies: #136653	2024-10-01 10:23:22 +00:00
Nikita Shulga	c610aa80dc	Testing: Unblock `new_*` testing on MPS (#137003 ) By changing `other_dtype` to `torch.half` rather than `double` in `sample_inputs_new_fns` if MPS is available Pull Request resolved: https://github.com/pytorch/pytorch/pull/137003 Approved by: https://github.com/Skylion007 ghstack dependencies: #136981, #136982, #136983, #136984, #136985, #136986	2024-09-30 19:06:12 +00:00
ankurneog	22a4129a76	Generalization of FSDP common for non-cuda execution (#133209 ) ## Motivation The FSDP common code for FSDP UT execution is mostly written with cuda device in mind. However other devices such the intel Gaudi supports most of the functionality. We are generalizing the base content so that the UT content can be used for non-cuda device execution. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133209 Approved by: https://github.com/kwen2501	2024-09-27 00:38:10 +00:00
Joel Schlosser	991f8f8ec3	Bias gradient calculation for NJT linear backward (#136660 ) Previously NYI - @mikaylagawarecki needs it for Transformers. Fixes #136652 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136660 Approved by: https://github.com/soulitzer	2024-09-26 21:38:10 +00:00
drisspg	840c6b7a68	[FlexAttention] Add Better error message for cpu tensors (#136673 ) Partially address: #136525 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136673 Approved by: https://github.com/Chillee	2024-09-26 16:40:21 +00:00
Howard Huang	141cae2eb8	[pipelining] Fix more leaks and check leaks in tests (#136584 ) Fix two more leaks of the same variety as #136507 (see that PR desc and attached gdoc for debug details). This time, also add a test-time check that helped to discover new leaks and ensure we won't accidently regress. Adds `check_tensor_leak` util which internally asserts no tensors are being kept alive by other objects involved in py ref cycles. Uses objgraph for a nice debug utility when a leak is found. Credit to @H-Huang for pointing out objdump and helping debug the 'param_group["intermediates"]` leak. I manually confirmed that all 3 of the leaks identified/fixed so far are caught by the unit test and checker. Sample output, if I re-introduce a leak by commenting out `del param_group["intermediates"]` in _backward.py, and run `python test/distributed/pipelining/test_schedule_multiproc.py -k test_schedule_with_native_zero_bubble`: ``` warnings.warn( /data/users/whc/pytorch/torch/testing/_internal/common_utils.py:5341: UserWarning: 34 tensors were found in the garbage. Did you introduce a reference cycle? warnings.warn( /data/users/whc/pytorch/torch/testing/_internal/common_utils.py:5347: UserWarning: Dumping first 1 objgraphs of leaked tensors rendered to png Graph written to /tmp/objgraph-ztz642h3.dot (19 nodes) Graph viewer (xdot) not found, generating a png instead Image generated as /tmp/objgraph-ztz642h3.png ``` rendering of ` /tmp/objgraph-ztz642h3.png`: <img width="1671" alt="image" src="https://github.com/user-attachments/assets/9098ff29-224c-4533-935b-83c210ac2e22"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136584 Approved by: https://github.com/kwen2501, https://github.com/H-Huang ghstack dependencies: #136507 Co-authored-by: Howard Huang <howardhuang@fb.com>	2024-09-26 01:10:40 +00:00
Huy Do	b7a5c7d331	Do not XFAIL test_segfault in fbcode (#136661 ) https://github.com/pytorch/pytorch/pull/136252 silence the failure on OSS, but the test actually passed on fbcode [T202241133](https://www.internalfb.com/intern/tasks/?t=202241133) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136661 Approved by: https://github.com/malfet	2024-09-25 22:26:24 +00:00
IvanKobzarev	370c1c4297	[aotd] Fix rrelu compilation (#136008 ) Issues: https://github.com/pytorch/pytorch/issues/135083 https://github.com/pytorch/pytorch/issues/120292 rrelu decomposition contains mutation, copy_. Decompositions are executed below Functionalization, as a result AOT produces non-functional graph. Also that decomposition is registered as python_dispatch kernel for AutogradCUDA. Autograd dispatch happens above Functionalization, so registering it for Autograd to handle all backends makes functionalization running after this. Testing: ``` python test/functorch/test_aotdispatch.py -k test_rrelu ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136008 Approved by: https://github.com/bdhirsh	2024-09-25 11:26:19 +00:00
Joel Schlosser	888744bd36	NJT binary pointwise broadcasting support via jagged <-> padded dense conversion (#133021 ) Related: #132695 This PR uses padded dense <-> jagged conversions to handle binary pointwise broadcasting of (NT, T) and (T, NT). This includes: * `(B, j0, D) + (1, 1, 1)` * `(B, j0, D) + (B, 1, 1)` * `(B, j0, D) + (B, 1, D)` * etc. This PR also adds (hacky) support for bool inputs to the jagged <-> padded dense conversions. The underlying CUDA kernels do not support integer / bool inputs; so the following workaround is employed: `convert input -> half, run conversion kernel, convert output -> bool`. Note that this bool support is needed specifically for the backward formula of `fmax`, and likely others. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133021 Approved by: https://github.com/cpuhrsch	2024-09-24 19:11:49 +00:00
ankurneog	efed357ef5	Add dtypes support in opinfo for Intel Gaudi (#132840 ) ## Motivation This is following up on changes introduced in https://github.com/pytorch/pytorch/pull/128584 we are adding the dtype information to be picked up while executing the UTs for Intel Gaudi/HPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/132840 Approved by: https://github.com/albanD	2024-09-24 17:17:15 +00:00
gaopengff	33e10803c8	Fix ut in internal distributed_test.py (#136251 ) I have failed with test case of test_new_subgroups_by_enumeration_input_rank_exceeds_world_size, and passed with this small change. The expected exception is supposed to be "ValueError" rather than "RuntimeError" according to [code](https://github.com/pytorch/pytorch/blob/v2.4.1/torch/distributed/distributed_c10d.py#L4190). Pull Request resolved: https://github.com/pytorch/pytorch/pull/136251 Approved by: https://github.com/kwen2501	2024-09-24 15:06:20 +00:00
Joel Schlosser	83a3ee0699	Support embedding_bag() with NJT input (#135888 ) Fixes #93843 `EmbeddingBag()` / `embedding_bag()` support 1D inputs with offsets to handle raggedness. NJT is a natural fit here as it already maintains offsets of the same form. This PR updates the python-side to support NJT and adds corresponding OpInfo-based NJT tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135888 Approved by: https://github.com/cpuhrsch	2024-09-23 17:35:19 +00:00
Isuru Fernando	f276da7f98	Remove prims.slice_in_dim and prims.slice (#136150 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136150 Approved by: https://github.com/ezyang	2024-09-23 01:27:22 +00:00
Aaron Gokaslan	b6ffa381e1	[BE]: Add half CUDA support nextafter (#136373 ) Making CUDA support match CPU support for nextafter Pull Request resolved: https://github.com/pytorch/pytorch/pull/136373 Approved by: https://github.com/ezyang	2024-09-21 17:13:45 +00:00
Isuru Fernando	0c936c3ecb	Add decomps for max_unpool (#133146 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133146 Approved by: https://github.com/amjames, https://github.com/eellison	2024-09-20 21:35:25 +00:00
Jeff Daily	15dba021bb	[ROCm][CI] upgrade CI to ROCm 6.2 (#132555 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132555 Approved by: https://github.com/pruthvistony, https://github.com/malfet	2024-09-20 17:39:31 +00:00
Igor Sugak	bce52d0b60	[CODEMOD][caffe2] use npt.NDArray instead of np.ndarray in type annotations (#136288 ) Summary: To facilitate PSS-2 upgrade, this uses `ndt.NDArray` instead of `nd.ndarray` in type annotations. In Numpy-1.19 (PSS-1) it's an alias to `nd.ndarray` -- a noop. In Numpy-1.24, `ndt.NDArray` a proper generic type, and without this change uses of `nd.ndarray` generate this Pyre type error: ```counterexample Invalid type parameters [24]: Generic type `np.ndarray` expects 2 type parameters. ``` Test Plan: Sandcastle plus visual inspection Differential Revision: D62977370 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136288 Approved by: https://github.com/kit1980	2024-09-19 12:40:36 +00:00
Huy Do	db80b98ec4	XFAIL test_segfault (#136252 ) Fixes https://github.com/pytorch/pytorch/issues/128551 As this has been failing in trunk for a while and there is no owner yet to fix it properly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136252 Approved by: https://github.com/andrewkho	2024-09-19 04:17:06 +00:00
fduwjj	3efaa016b1	[c10d] Make test compatible for new pytest (#136158 ) Temporary fix to the issue in https://github.com/pytorch/pytorch/issues/127517. Short-term fix following CPython: `51aefc5bf9/Lib/unittest/case.py (L419-L426)` Differential Revision: [D62878083](https://our.internmc.facebook.com/intern/diff/D62878083) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136158 Approved by: https://github.com/fegin	2024-09-18 17:10:55 +00:00
CaoE	6a6f5b20c5	Add _addmm_activation to lower precision cast policy on AutocastCPU (#135936 ) Fixes #132613. Add `_addmm_activation` to lower precision cast policy on AutocastCPU. `_addmm_activation` https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/transformers/transformer.cpp#L39 of `transformer_encoder_layer_forward` may throw `RuntimeError: mat1 and mat2 must have the same dtype, but got BFloat16 and Float` when autocast is enabled, as `_native_multi_head_attention` is put in lower data type cast policy https://github.com/pytorch/pytorch/pull/107674 and `_addmm_activation` may encounter mixed data types. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135936 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-09-18 16:31:27 +00:00
Prachi Gupta	b5be4d8c05	Fix ROCm skip decorator for test_ddp_tp and multiprocess UTs (#136161 ) skip_if_rocm is used only in multiprocess case (when UT test class is a child of MultiProcessTestCase). Each individual process can exit with a skip code. If used for single process UT, it will cause the UT to fail as the process returns a non-zero exit code. Use skipIfRocm in single process UTs. To avoid the above confusion, this PR renamed skip_if_rocm to skip_if_rocm_multiprocess. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/136161 Approved by: https://github.com/jithunnair-amd, https://github.com/kwen2501, https://github.com/fegin	2024-09-18 11:01:23 +00:00
Joel Schlosser	a8382847f4	Support rms_norm() for NJT (#135872 ) `rms_norm()` is a nice-to-have for ViT :) This PR: * SymInt-ifies `rms_norm()`, allowing NJT to use the same decomp. * Adds torch_function-based input validation logic for nested-specific stuff (no normalization supported over the ragged dim for now) on the python NJT side. * Adds multi-dim support (on non-ragged, non-batch dims) to `mean()` for NJT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135872 Approved by: https://github.com/mikaylagawarecki ghstack dependencies: #125947	2024-09-17 18:09:20 +00:00
PyTorch MergeBot	462b727d1e	Revert "Add decomposition for permute_copy (#130944 )" This reverts commit `ab9a7eadd3`. Reverted https://github.com/pytorch/pytorch/pull/130944 on behalf of https://github.com/jeanschmidt due to Broke internal signal executorch.backends.xnnpack.test.ops.permute.TestPermute, more details on D62737086. @eellison could you please help get this PR merged to main? ([comment](https://github.com/pytorch/pytorch/pull/130944#issuecomment-2355846394))	2024-09-17 13:42:55 +00:00
PyTorch MergeBot	2c4ae81494	Revert "Add decomposition for squeeze_copy (#130941 )" This reverts commit `c33b0580e6`. Reverted https://github.com/pytorch/pytorch/pull/130941 on behalf of https://github.com/jeanschmidt due to Need to revert in order to be able to revert https://github.com/pytorch/pytorch/pull/130944, after fixing any merge conflicts, feel free to merge it back ([comment](https://github.com/pytorch/pytorch/pull/130941#issuecomment-2355831480))	2024-09-17 13:39:07 +00:00
PyTorch MergeBot	7fe004f7cf	Revert "Add CI for Triton CPU backend (#135342 )" This reverts commit `426580a67d`. Reverted https://github.com/pytorch/pytorch/pull/135342 on behalf of https://github.com/jeanschmidt due to Broke internal signals, see D62737208 for more details ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2353623816))	2024-09-16 18:33:33 +00:00
Aaron Gokaslan	23c0d2689e	[BE][Ez]: Fix missing float16 coverage for adaptive_pool3d_cpu (#136091 ) Testing if op info coverage has issues Pull Request resolved: https://github.com/pytorch/pytorch/pull/136091 Approved by: https://github.com/ezyang	2024-09-16 18:22:16 +00:00
Aaron Gokaslan	b491e2974c	[BE][Ez]: Add full half/bfloat16 dtype for `unique` and `isin` (#136114 ) Fixes #136090 * Add support for isin to tensor half dtypes for CPU (just add a few extra dispatches). * Seems like the CUDA implementation for bfloat16 was mostly compiled and available all along (it just calls sort internally AND unique). To enable it, we just need to remove an assert to access it (since sort's functionality was updated since the assert was added) and add missing dtype support to unique. * This unlocks more GPU functionality with minimal code bloat. I also added CPU kernels for the dtypes for parity. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136114 Approved by: https://github.com/malfet	2024-09-16 17:49:12 +00:00
Tom Ritchford	c33b0580e6	Add decomposition for squeeze_copy (#130941 ) * Extracted from #128416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130941 Approved by: https://github.com/amjames, https://github.com/eellison	2024-09-16 15:46:57 +00:00
Tom Ritchford	ab9a7eadd3	Add decomposition for permute_copy (#130944 ) * Extracted from #129476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130944 Approved by: https://github.com/amjames, https://github.com/eellison	2024-09-15 19:35:14 +00:00
Jez Ng	426580a67d	Add CI for Triton CPU backend (#135342 ) Where possible, I have marked failing tests (which we intend to fix or triage) as `@xfail_if_triton_cpu`. This will help us track progress of the Triton CPU backend over time. Tests that I don't think we need to address, or that are flaky, have been marked as skips. Successful CI run: https://github.com/pytorch/pytorch/actions/runs/10822238062/job/30028284549 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135342 Approved by: https://github.com/jansel ghstack dependencies: #133408	2024-09-14 21:45:19 +00:00
CaoE	db393fb95e	Add Half support for reflection and replication padding on CPU (#135931 ) Fixes #135680 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135931 Approved by: https://github.com/Skylion007	2024-09-14 14:18:55 +00:00
PyTorch MergeBot	1786a17fed	Revert "Use _amp_foreach_non_finite_check_and_unscale_ for CPU grads of ShardedGradScaler (#135232 )" This reverts commit `51c5206133`. Reverted https://github.com/pytorch/pytorch/pull/135232 on behalf of https://github.com/CaoE due to wrong commit ([comment](https://github.com/pytorch/pytorch/pull/135232#issuecomment-2350792806))	2024-09-14 02:31:06 +00:00
CaoE	51c5206133	Use _amp_foreach_non_finite_check_and_unscale_ for CPU grads of ShardedGradScaler (#135232 ) Use `_amp_foreach_non_finite_check_and_unscale_` instead of fallback version for CPU grads of `ShardedGradScaler ` as `_amp_foreach_non_finite_check_and_unscale_ ` is supported on CPU https://github.com/pytorch/pytorch/pull/109281. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135232 Approved by: https://github.com/ezyang	2024-09-14 02:20:58 +00:00
PyTorch MergeBot	b6d6aa49b8	Revert "Validate input types for `torch.nn.Linear` and `torch.nn.Bilinear` (#135596 )" This reverts commit `e157ce3ebb`. Reverted https://github.com/pytorch/pytorch/pull/135596 on behalf of https://github.com/malfet due to It's too restrictive, should allow other int-like types, such as `numpy.int64` ([comment](https://github.com/pytorch/pytorch/pull/135596#issuecomment-2349714104))	2024-09-13 18:06:56 +00:00
Sanskar Modi	e157ce3ebb	Validate input types for `torch.nn.Linear` and `torch.nn.Bilinear` (#135596 ) Adding validation checks to check the input types and display better error messages for the same. Fixes https://github.com/pytorch/pytorch/issues/135463 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135596 Approved by: https://github.com/malfet	2024-09-12 21:28:37 +00:00
Mayank Mishra	9a04cfbeff	fix for fp16 (#134106 ) This PR is a replacement for https://github.com/pytorch/pytorch/pull/133085 for pushing a quick fix for RMSNorm. The original author is @kkontny Previous PR summary: Since FP16 has quite small dynamic range it is very easy to overflow while computing `at::pow(input, 2)` , and it happens in real world computation. I've tried to use `nn.RMSNorm` fused implementation instead of `LlamaRMSNorm` inside `transformers` implementation of Llama (`src/transformers/models/llama/modeling_llama.py`). It started to give wrong answers in Fp16 while still giving good in FP32. I figured out happens due to overflow while computing square of the input tensor. Original `LLamaRMSNorm` implementation upcasts input to fp32 to prevent this and give better numerical stability. ``` class LlamaRMSNorm(nn.Module): def __init__(self, hidden_size, eps=1e-6): """ LlamaRMSNorm is equivalent to T5LayerNorm """ super().__init__() self.weight = nn.Parameter(torch.ones(hidden_size)) self.variance_epsilon = eps def forward(self, hidden_states): input_dtype = hidden_states.dtype hidden_states = hidden_states.to(torch.float32) variance = hidden_states.pow(2).mean(-1, keepdim=True) hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon) return self.weight * hidden_states.to(input_dtype) ``` Proposed commit fixed the issue. FP16 in RMSNorm has to be treated in special way, to be usable in real world implementations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134106 Approved by: https://github.com/mikaylagawarecki, https://github.com/eqy	2024-09-11 22:02:07 +00:00
Xinya Zhang	74fd1bf965	[ROCm] Update to AOTriton 0.7b (#134498 ) Notable changes: 1. Enable CudaGraph related tests 2. Fix UT problems 3. EXPERIMENTAL Navi31 support. User should enable Navi31 support with Env Var `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1` Know Problem: 1. `test/test_transformers.py` will massive failures and/or NaN outputs with `--use-pytest` + Update: Confirmed skip `class TestSDPAPrivateUse1Only` can fix the problem with `--use-pytest` Note: AOTriton 0.7b adds support to nestedtenosrs+SDPA but need more work (and consequently a separate PR) to enable it. Fixes #133540 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134498 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily, https://github.com/malfet	2024-09-11 20:34:01 +00:00
Tom Ritchford	e05ea2b179	Add decomposition for transpose_copy (#130943 ) * Extracted from #128416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130943 Approved by: https://github.com/amjames, https://github.com/eellison	2024-09-11 19:45:22 +00:00
Chien-Chin Huang	1d9fefff19	[DCP] Fixes the stateless optimizer issue of distributed state_dict (#135535 ) Some optimizers don't have states that can cause get_state_dict/set_state_dict behave incorrectly. This PR fixes the issues. fixes: https://github.com/pytorch/pytorch/issues/133415 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135535 Approved by: https://github.com/wz337	2024-09-10 03:10:00 +00:00
Wanchao Liang	cfc227ad43	[reland][dtensor] move DTensor to public namespace (#134203 ) reland of https://github.com/pytorch/pytorch/pull/133113 I have to create a new PR because the previous reverted PR could not either be rebased, or imported successfully :( ---- Moving DTensor to be in the public namespace, to formally add the documentation page that includes all the public APIs. This includes: * many path renames and path import fixes * a dedicated doc page without too much content yet (adding in the next PRs) * To preserve the BC for users still using the torch.distributed._tensor, I added a shim script to redirect old path calls to the new module The BC preserving is evidented by the fact that all DTensor tests are still working without changing the public imports. So it's safe to land the changes Pull Request resolved: https://github.com/pytorch/pytorch/pull/134203 Approved by: https://github.com/tianyu-l	2024-09-08 17:08:40 +00:00
Nowtryz	a15aabc975	Add MaskedTensor passthrough: unfold, F.Unfold, F.Fold, stack (#125262 ) Hi, I noticed the `unfold` operator was missing on MaskedTensor. I tested that my change works when calling unfold and backward on a `MaskedTensor` but I didn't find the tests for the dispatch of such operation. Where is it? Pull Request resolved: https://github.com/pytorch/pytorch/pull/125262 Approved by: https://github.com/cpuhrsch	2024-09-06 19:06:23 +00:00
Will Feng	84ae6b7d6b	AOTDispatcher: limit cases when we detach() graph inputs to non-leaves (#134193 ) This PR is slightly a revival / update to the discussion from https://github.com/pytorch/pytorch/pull/98960: Part of FSDP2's tracing strategy right now is that: (1) it is painful/difficult to handle the case where we have multiple graph input tensors that are aliased to each other and at least one of them is duplicated (2) we already have longstanding in logic to remove duplicate input tensors from the graph in dynamo. Morally, FSDP2 gives us duplicate input tensors in the backward graph for every `unsharded_param`, because we have (a) the `unsharded_param` being closed over by the backward hook to resize/allgather, and (b) the same `unsharded_param` being saved for backward by autograd (we now guarantee in the partitioner that we will always save the base tensor for backward and recompute views) (3) However, we were still seeing cases where the `unsharded_param` showed up twice in the backward graph inputs, as distinct tensor objects (with different python ids) instead of being true duplicates that dynamo can de-dup. It turns on that this was because we were `.detach()`ing the `unsharded_param` in AOTDispatcher before plumbing it through the compiled forward (and so autograd would save a detach'd version of the `unsharded_param`). This is precisely because of the logic from https://github.com/pytorch/pytorch/pull/98960. However, re-reading the detailed comments, it seems unnecessary to do a detach() on a graph input that is a (leaf) `nn.Parameter`, even if it happens to get no gradients in the backward. Since it is a leaf, we don't have to worry about the autograd engine "continuing to backprop through the graph beyond the current tensor" (the leaf has no other grad_fn for autograd to backprop through). So this PR makes us a bit less aggressive about calling detach() on inputs: we only do it when: (1) our graph input statically will get a `None` gradient (and also has no metadata mutations, the existing state) (2) and our graph input is a non-leaf tensor (so detach()ing is actually required to prevent autograd from incorrectly backpropping past the non-leaf. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134193 Approved by: https://github.com/yf225 Co-authored-by: Will Feng <yf225@cornell.edu>	2024-09-06 14:06:48 +00:00
Shangdi Yu	590a3e9f8a	[export][training ir migration] quantized_decomposed.quantize_per_tensor decomposition (#134525 ) Summary: In graph of TestXNNPACKQuantizer.test_dynamic_linear_with_con test, some quantized_decomposed.quantize_per_tensor.default ops are becoming quantized_decomposed.dequantize_per_tensor.tensor ops when using the new training ir. This is because we lift params/buffers before calling make_fx. So previously, for the graph that’s passed to make_fx,`graph.L__self___linear1.weight` is a tensor now in training ir, graph.L__self___linear1.weight is a FakeTensor. This caused the node overload to be different. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_dynamic_linear_with_conv ``` Differential Revision: D61364547 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134525 Approved by: https://github.com/tugsbayasgalan, https://github.com/jerryzh168	2024-09-06 07:06:06 +00:00
Avik Chaudhuri	43f4947d44	fix fake tensor tolist implementation (#135131 ) Summary: When exporting for training with `tolist`, we do not hit `FunctionalTensor.tolist` since we do not functionalize. Unfortunately, this means we hit `FakeTensor.tolist`, which creates unbacked symints that are not backed by proxies. Rather than trying to patch up this low-level implementation, we replace it with essentially what `FunctionalTensor.tolist` does, which is higher-level: we essentially desugar to `item()` calls and let it take care of unbacked symints. Test Plan: Some expected failures are gone now. Also found a test for `tolist` that was written when `FunctionalTensor.tolist` was implemented but not really doing much; repurposed it now to exercise more modes. Differential Revision: D62197742 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135131 Approved by: https://github.com/ezyang	2024-09-05 23:20:31 +00:00
Henry Tsang	bb3c2408f4	[inductor][test] in test_unbacked_symints, replace inductor's skipCUDAIf with common device type's skipcudaif (#133936 ) Differential Revision: D61506212 Use `skipCUDAIf` from `torch.testing._internal.common_device_type` if we create the test class with `instantiate_device_type_tests`. `instantiate_device_type_tests` would make sure the class has attr device_type, which works with`skipCUDAIf` from `torch.testing._internal.common_device_type`. Also skipping test_vertical_pointwise_reduction_fusion for cpu test class, since the test expects cuda. FAILED [0.0026s] test/inductor/test_unbacked_symints.py::TestUnbackedSymintsCPU::test_vertical_pointwise_reduction_fusion_cpu - AttributeError: 'TestUnbackedSymintsCPU' object has no attribute 'device' repro: ``` CUDA_VISIBLE_DEVICES="" pytest test/inductor/test_unbacked_symints.py -k cpu -v ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133936 Approved by: https://github.com/ColinPeppler, https://github.com/desertfire	2024-09-05 16:40:14 +00:00
eqy	4f70b3cfae	[CUDA][complex][TF32] Update `test_noncontiguous_samples` tolerances for `complex64` (#134526 ) Recent cuDNN heuristics change surfaces same TF32 issue as `float32` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134526 Approved by: https://github.com/ezyang	2024-09-04 23:37:16 +00:00
FFFrog	80a6d60829	Moving _run_autocast_outofplace to basic class named TestAutocast to reduce redundance (#134460 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134460 Approved by: https://github.com/EikanWang, https://github.com/ezyang	2024-09-04 10:48:58 +00:00
Nikita Shulga	71383dd3da	[MPS] Fix bachnorm_2d for channels last (#134618 ) By skipping gather of input tensor if memory_layout is channels_last, which is a first step towards fixing https://github.com/pytorch/pytorch/issues/134580 Though underlying problem is much more interesting, i.e. MPS does not have a generic support for channels last, but `c10::is_contiguoius()` is true for channels last layout. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134618 Approved by: https://github.com/albanD	2024-09-03 19:20:11 +00:00
Tobias Ringwald	758d787901	Added complex support for `torch.logsumexp` (#133187 ) Added complex support for `torch.logsumexp`. Implemented complex backward pass for `torch.logsumexp`. Fixes #133047 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133187 Approved by: https://github.com/amjames, https://github.com/lezcano	2024-09-03 17:28:36 +00:00
IvanKobzarev	33ba952e31	[subclasses] Do not fakeTensor const prop subclass args (#134855 ) The issue: Const propagation checks only if arguments do not have FakeTensor. If argument is Subclass, it will pass this condition. As a result Const Propogation execution happens without FakeTensorMode and having tensor factories inside Subclass.__torch_dispatch__ results that this Tensor is not Fakified. Solution: If we have subclasses arguments, do not count that const propagation is doable Pull Request resolved: https://github.com/pytorch/pytorch/pull/134855 Approved by: https://github.com/zou3519	2024-09-03 13:31:49 +00:00
chilli	6fce1faa10	change multinomial to use async asserts instead of a synchronization (#134818 ) Fixes https://github.com/pytorch/pytorch/issues/134442 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134818 Approved by: https://github.com/ezyang ghstack dependencies: #134813	2024-09-03 06:33:24 +00:00
Nikita Shulga	f95085fd91	[BE][MPS] Prefer xfail to skip (#134858 ) This essentially undoes large skips on everything but MacOS Sequoia to nn.modules made by https://github.com/pytorch/pytorch/pull/128393 Instead it uses existing `xfail`, but guards it on `_macos15_or_newer` boolean Before the change if run on MacOS 14: ``` % python3 ../test/test_modules.py -v -k Hardswish 2>&1\|tail -n3 Ran 57 tests in 0.053s OK (skipped=32) ``` After ``` % python3 ../test/test_modules.py -v -k Hardswish 2>&1\|tail -n3 Ran 57 tests in 0.229s OK (skipped=10, expected failures=2) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134858 Approved by: https://github.com/janeyx99	2024-08-31 00:29:48 +00:00
Masaki Kozuki	e21d7b77ce	Update `ForeachfuncInfo.sample_inputs_func` to yield scalars & scalarlists that are more friendly to test_meta (#134552 ) for `test_meta.py` to see more "PASSED" instead of "XFAIL". `pytest test_meta.py -k "_foreach_"` ran 6400 test cases and: - This PR: 4702 passed, 260 skipped, 73732 deselected, 1698 xfailed - main (`92c4771853`): 3906 passed, 260 skipped, 73732 deselected, 2494 xfailed Pull Request resolved: https://github.com/pytorch/pytorch/pull/134552 Approved by: https://github.com/janeyx99	2024-08-30 17:30:50 +00:00
PyTorch MergeBot	f997b2b8e6	Revert "Add MaskedTensor passthrough: unfold, F.Unfold, F.Fold, stack (#125262 )" This reverts commit `f685018ea9`. Reverted https://github.com/pytorch/pytorch/pull/125262 on behalf of https://github.com/ZainRizvi due to Hi, this PR appears to be calling maskedtensor tests to fail on main. Please rebase your changes onto the latest trunk build to repro the failure. test_maskedtensor.py::TestOperatorsCUDA::test_like_empty_like_layout1_cuda_bool [GH job link](https://github.com/pytorch/pytorch/actions/runs/10604716811/job/29393256312) [HUD commit link](`f685018ea9`) ([comment](https://github.com/pytorch/pytorch/pull/125262#issuecomment-2316387447))	2024-08-28 23:10:07 +00:00
Nowtryz	f685018ea9	Add MaskedTensor passthrough: unfold, F.Unfold, F.Fold, stack (#125262 ) Hi, I noticed the `unfold` operator was missing on MaskedTensor. I tested that my change works when calling unfold and backward on a `MaskedTensor` but I didn't find the tests for the dispatch of such operation. Where is it? Pull Request resolved: https://github.com/pytorch/pytorch/pull/125262 Approved by: https://github.com/cpuhrsch	2024-08-28 21:30:39 +00:00
Chien-Lin Chen	40de63be09	parameterized test_graph_optims and test_graph_scaling_fused_optimizers (#133749 ) Fixes #123451 This is a rework of a reverted pull request, https://github.com/pytorch/pytorch/pull/125127. The test failure is fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133749 Approved by: https://github.com/janeyx99	2024-08-28 16:34:06 +00:00
David Berard	289486d007	Move attention kernels back from fake_impls to meta_registrations (#134288 ) See #121528 for additional context. In #120682, we moved the attention kernels from meta_registrations to fake_impls with the intent of fixing the device handling for seed/offset: these are typically on CPU. We needed to put the registrations in fake_impls to do this because meta_registrations doesn't have a way to specify device, whereas fake_impls does. But when we tried to actually fix the device types (#120839), we had to revert the PR because it broke cudagraph handling (during which seed/offset _are_ on CUDA). Now, we want to put the registrations back in meta_registrations so that we can call these kernels with meta tensors. The use case is later in this stack - we want to be able to use the flop counter with these kernels. Also - I specifically skip the `compare_tensor_meta()` check in test_fake / test_fake_autocast tests for the `_efficient_attention_forward` and `_flash_attention_forward` kernels, which fails because of the device mismatch from the seed/offset tensors. Then we can un-skip these opinfos. I verified that the efficient_attention_forward bug (#120842) is now caught by these opinfos if I revert the fix from this PR. Differential Revision: [D61687369](https://our.internmc.facebook.com/intern/diff/D61687369) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134288 Approved by: https://github.com/drisspg	2024-08-27 21:10:36 +00:00
Bo Li	16b8146c9e	Exclude test_transformers and unit tests which require recent GPU arch (#132895 ) This PR is to exclude test_transformers on ROCm temporarily and skip some unit tests which require recent GPU arch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132895 Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/malfet	2024-08-27 20:40:53 +00:00
Nikita Shulga	8de0d7690c	Use newer `toAccumulateType` signature in `Normalization.cpp` (#134540 ) Which fixes BatchNorm behavior for if called with empty tensors on MPS backed. Removed `expectedFailureMPS` in test_nn.py, deleted expected failure in `test_mps.py` and adjusted `skipIfMPS` to `expectedFailureMPS` in BatchNorm2d OpInfo decorator, but restrict it only to the memory format tests Test Plan: CI + `python3 -c "import torch; print(torch.nn.BatchNorm2d(3, device='mps')(torch.rand(0, 3, 2, 2, device='mps')))"` Fixes https://github.com/pytorch/pytorch/issues/134423 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134540 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-08-27 18:09:20 +00:00
Colin Peppler	13049cd6e5	[aotinductor][UserDefinedTritonKernel] fix case with non-constexpr params declared after autotuned params (#134520 ) ## Context In some user Triton kernels, we have this set-up for whatever reason. ``` @triton.jit def mykernel( param0, param1, param2, param3: tl.constexpr, # autotuned param4, # non-constexpr ): ... ``` This is an edge case because it's a general practice to declare all constexprs params at the end. And this will be an issue for AOTI because it fails to codegen all 4 params. That will surface as a device-side error: CUDA IMA, invalid argument... ``` > void* kernel_args_var_0[] = {&var_0, &var_1, &var_2}; --- < CUdeviceptr var_3; < AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_get_data_ptr(buf0, reinterpret_cast<void*>(&var_3))); < void kernel_args_var_0[] = {&var_0, &var_1, &var_2, &var_3}; ``` ## Root-cause * `kernel.constexpr` from the Kernel side-table contains the indices for all `constexpr` params that includes autotuned params. * `raw_args`, that gets passed to wrapper codegen, excludes autotuned args. * In the wrapper codegen, we try to find non-constexpr args using `kernel.constexpr` & `raw_args`. This is okay unless there's a `raw_arg` after an autotuned param in the function signature. `79b7fff188/torch/_inductor/codegen/cpp_wrapper_cuda.py (L118-L126)` ## Fix We try to fix this, by calculating the right constexprs wrt `raw_args`. An illustration ``` raw_args: [arg0, arg1, arg2, arg4] kernel.arg_names: [param0, param1, param2, param3, param4] kernel.constexprs: [3] # param3 is autotuned; this is correct wrt kernel.arg_names constexpr_indices: [] # this is correct wrt raw_args ``` Differential Revision: [D61831625](https://our.internmc.facebook.com/intern/diff/D61831625) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134520 Approved by: https://github.com/oulgen	2024-08-27 17:20:27 +00:00
Mikayla Gawarecki	d028b810fe	Fix flaky GroupNorm ModuleInfo test (#133899 ) Fixes https://github.com/pytorch/pytorch/issues/98677 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133899 Approved by: https://github.com/albanD	2024-08-27 14:45:51 +00:00
wizzniu	3515090006	Fix TypeError when itering NoneType in instantiate_device_type_tests() (#134457 ) Fixes #134454 Fix TypeError introduced by https://github.com/pytorch/pytorch/pull/133082, which uses iter for NoneType of default args ``except_for`` and ``only_for``. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134457 Approved by: https://github.com/shink, https://github.com/albanD	2024-08-27 07:13:36 +00:00
Roy Hvaara	1565940114	[MPS] Add `test/test_nn.py` to test suite (#134184 ) This PR increases test coverage by including the tests in `test/test_nn.py` in the test suite of MPS. Some of the tests are decorated with `@expectedFailureMPS` for various reasons. Either that the op is not implemented, or that the outputs do not align. Those tests that contain differing results should be investigated further to rule out any live bugs. ```bash $ python test/run_test.py --mps --verbose -k TestNN Running test batch 'tests to run' cost 84.76 seconds ``` Ref #133520 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134184 Approved by: https://github.com/albanD, https://github.com/malfet	2024-08-26 23:48:23 +00:00
Benjamin Glass	55236d0cb7	TestForeach::test_parity: Remove check for error message text (#134251 ) Previously, error messages were expected to be string equivalent to error messages thrown by the ref function. This check fails for dozens of torch functions, and doesn't appear to add much value for the end user. This commit removes this check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134251 Approved by: https://github.com/amjames, https://github.com/janeyx99 ghstack dependencies: #134253, #134344	2024-08-26 22:40:54 +00:00
Benjamin Glass	ef8c474fcf	Add the fast path for bfloat16 lgamma (#134344 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134344 Approved by: https://github.com/amjames, https://github.com/janeyx99 ghstack dependencies: #134253	2024-08-26 22:40:54 +00:00
Benjamin Glass	3c5883e550	Fix test_parity xfail for sigmoid (#134253 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134253 Approved by: https://github.com/amjames, https://github.com/janeyx99	2024-08-26 22:40:54 +00:00
Mengwei Liu	a0e062c6f1	Add mean.dtype_out (#133506 ) Give it a try and see if CI is happy Pull Request resolved: https://github.com/pytorch/pytorch/pull/133506 Approved by: https://github.com/bdhirsh	2024-08-26 19:26:11 +00:00
Wuxun Zhang	1d231ff8ba	[HOO] add hints_wrapper to support passing context hints (#132860 ) Fixes #126393 The implementation code is based on feedback here (https://github.com/pytorch/pytorch/pull/121639#issuecomment-2223948842). Hints are passed as kwargs of hints_wrapper op. It also supports nested hints. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132860 Approved by: https://github.com/ydwu4, https://github.com/zou3519	2024-08-26 18:21:22 +00:00
Benjamin Glass	27d97b9649	Remove unnecessary test skip (#134250 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134250 Approved by: https://github.com/amjames, https://github.com/janeyx99	2024-08-26 14:34:53 +00:00
Laith Sakka	7b6b10417d	Remove ansi escape chars in assertExpectedInline and add options to skip comments and to skip empty lines (#134248 ) I had a night mare rewriting tests in test_misc.py specifically : 1. graphs can have comments that refers to my files "/lsakka/.." we really dont care about comments add option to ignore comments. 2. empty lines added when EXPECTTEST_ACCEPT=1 are changed with linter causing tests to fail or linter fail! add flag to ignore empty lines. 3. EXPECTTEST_ACCEPT fails when the text have some not readable characters. those should not effect comparing strings, also those causes weird diffs comments when tests fails. I removed ansi_escape chars https://github.com/pytorch/pytorch/pull/133045 this is used in Pull Request resolved: https://github.com/pytorch/pytorch/pull/134248 Approved by: https://github.com/aorenste ghstack dependencies: #133639, #134364	2024-08-26 02:03:44 +00:00
Xu Han	1034f456ef	[inductor] fix munge_exc not support windows path (#134348 ) Windows file path use `\` as delimiter, it is also a escape character. We need translate all path `\` to `/`. which like Linux. Reproduce UT: ```cmd pytest test\dynamo\test_higher_order_ops.py -v -k test_vmap_grad_vmap_guard_fail ``` Error msg: ```cmd ________________________________________________________________________________________________________ HigherOrderOpVmapGuardTests.test_vmap_grad_vmap_guard_fail _________________________________________________________________________________________________________ Traceback (most recent call last): File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\logging_utils.py", line 89, in test_fn fn(self, records) File "D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_higher_order_ops.py", line 2714, in test_vmap_grad_vmap_guard_fail munge_exc(record.getMessage()), File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_utils.py", line 5252, in munge_exc s = re.sub(file, os.path.basename(file), s) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\re.py", line 209, in sub return _compile(pattern, flags).sub(repl, string, count) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\re.py", line 303, in _compile p = sre_compile.compile(pattern, flags) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_compile.py", line 788, in compile p = sre_parse.parse(p, flags) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_parse.py", line 955, in parse p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_parse.py", line 444, in _parse_sub itemsappend(_parse(source, state, verbose, nested + 1, File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_parse.py", line 526, in _parse code = _escape(source, this, state) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_parse.py", line 370, in _escape raise source.error("incomplete escape %s" % escape, len(escape)) re.error: incomplete escape \x at position 2 To execute this test, run the following from the base repo dir: python test\dynamo\test_higher_order_ops.py HigherOrderOpVmapGuardTests.test_vmap_grad_vmap_guard_fail This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 --------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------- frames [('total', 2), ('ok', 2)] inductor [] inline_call [] stats [('calls_captured', 38), ('unique_graphs', 2)] --------------------------------------------------------------------------------------------------------------------------- Captured stderr call ---------------------------------------------------------------------------------------------------------------------------- V0824 01:29:00.148000 27840 torch\_dynamo\guards.py:2787] [0/1] [__recompiles] Recompiling function fn in D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_higher_order_ops.py:2699 V0824 01:29:00.148000 27840 torch\_dynamo\guards.py:2787] [0/1] [__recompiles] triggered by the following guard failure(s): V0824 01:29:00.148000 27840 torch\_dynamo\guards.py:2787] [0/1] [__recompiles] - 0/0: torch._functorch.pyfunctorch.compare_functorch_state([('Vmap', 1, 'error')]) # _dynamo\output_graph.py:479 in init_ambient_guards ========================================================================================================================== short test summary info ========================================================================================================================== FAILED [0.7452s] test/dynamo/test_higher_order_ops.py::HigherOrderOpVmapGuardTests::test_vmap_grad_vmap_guard_fail - re.error: incomplete escape \x at position 2 ``` Local test passed: <img width="860" alt="image" src="https://github.com/user-attachments/assets/90f0d780-0639-4c03-8d7c-6f227c93a3fc"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134348 Approved by: https://github.com/jansel	2024-08-24 05:51:35 +00:00
Aaron Gokaslan	83b5d449a3	Add full float16/bfloat16 support to MaxUnPool (#133774 ) It already supported half so might as well add bfloat16 support for parity Pull Request resolved: https://github.com/pytorch/pytorch/pull/133774 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-08-22 13:34:43 +00:00
Zitong Zhan	90c821814e	SparseCsrCUDA: cuDSS backend for linalg.solve (#129856 ) This PR switches to cuDSS library and has the same purpose of #127692, which is to add Sparse CSR tensor support to linalg.solve. Fixes #69538 Minimum example of usage: ``` import torch if __name__ == '__main__': spd = torch.rand(4, 3) A = spd.T @ spd b = torch.rand(3).to(torch.float64).cuda() A = A.to_sparse_csr().to(torch.float64).cuda() x = torch.linalg.solve(A, b) print((A @ x - b).norm()) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129856 Approved by: https://github.com/amjames, https://github.com/lezcano, https://github.com/huydhn Co-authored-by: Zihang Fang <zhfang1108@gmail.com> Co-authored-by: Huy Do <huydhn@gmail.com>	2024-08-22 07:57:30 +00:00
Yuanhao Ji	cdb9c7d228	Add support for using privateuse1 backend name in `instantiate_device_type_tests()` (#133082 ) As you can see, 'privateuse1' appears many times in out-of-tree extension codebase. I think that everything about the device type should be as same as other in-tree backends after registering the privateuse1 backend. For example, after registering a privateuse1 backend named "foo", you should allow "foo" to be passed in as a valid device type. ```diff - instantiate_device_type_tests(TestIndexing, globals(), only_for='privateuse1') - instantiate_device_type_tests(NumpyTests, globals(), only_for='privateuse1') + instantiate_device_type_tests(TestIndexing, globals(), only_for='foo') + instantiate_device_type_tests(NumpyTests, globals(), only_for='foo') ``` > https://github.com/Ascend/pytorch/blob/master/test/test_indexing.py#L1654-L1655 The change is to map privateuse1 backend name to 'privateuse1' when calling `filter_desired_device_types()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133082 Approved by: https://github.com/albanD	2024-08-22 06:17:21 +00:00
chilli	938f37b745	Added batching rule for sdpa_math, sdpa_efficient_attention forward, cudnn, and flash attention (#133964 ) Fixes https://github.com/pytorch/pytorch/issues/117016, https://github.com/pytorch/pytorch/issues/102457, https://github.com/pytorch/pytorch/issues/110525, https://github.com/pytorch/pytorch/issues/108065, Pull Request resolved: https://github.com/pytorch/pytorch/pull/133964 Approved by: https://github.com/Skylion007	2024-08-22 05:29:49 +00:00
xinan.lin	c2e2602ecd	[Inductor] Move `GPU_TYPE`(The runtime avaliable gpu type, cuda or xpu) from (#132740 ) Move GPU_TYPE(The runtime avaliable gpu type, cuda or xpu) from `testing/_internal/inductor_utils.py` to `_inductor/utils.py`. So that we can use it in Inductor, not limited in test case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132740 Approved by: https://github.com/EikanWang, https://github.com/jansel	2024-08-21 11:18:00 +00:00
krzysztofjordan	2e1830c7c8	Implement 2D version of masked_select for nestedtensors (#133889 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133889 Approved by: https://github.com/soulitzer	2024-08-20 21:46:32 +00:00
Chirag Pandya	994fcb9acd	Killswitch based rollout for flight recorder (#133237 ) Summary: Defaulting TORCH_NCCL_DUMP_ON_TIMEOUT to "true" and adding a kilswitch in case we need to kill this feature in production. Test Plan: Tests pass manually but need futher testing before this is rolled out fully everywhere. Differential Revision: D61136320 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133237 Approved by: https://github.com/c00w	2024-08-20 04:27:55 +00:00
drisspg	fb26b84390	Update fused kernels and call _safe_softmax from SDPA (#133882 ) # UPDATE: This is take 3 of https://github.com/pytorch/pytorch/pull/131863 which was landed via co dev but not applying correclty # Summary Changes the stance of SDPA on what to do for fully masked out rows ## Current Behavior Several PyTorch users have expressed frustration over this issue: - https://github.com/pytorch/pytorch/issues/41508 - https://github.com/pytorch/pytorch/issues/103749 - https://github.com/pytorch/pytorch/issues/103963 These are significant issues with extensive discussion but no satisfactory resolution. The PyTorch team's consensus, as stated here: https://github.com/pytorch/pytorch/issues/24816#issuecomment-524415617 Can be paraphrased as follows: When passing in fully masked out rows, attention becomes ambiguous. We have two main options: 1. Uniformly attend to all values: ```python scores[masked_out_rows] = 1 / len(row) out[masked_out_rows] = 1 / len(row) * value ``` 2. Decide that attention between no queries (masked) and no keys (masked) is meaningless: ```python output[fully_masked_rows] = NaN ``` We went with option 2. Partially because it was easier to implement, but also people argued that users can slice the output to remove the NaNs: ``` Python >fill_value = -float("inf") >row0 = torch.randn(4) >row1 = torch.tensor([(fill_value for _ in range(4)]) >matrix = torch.stack([row0, row1]).requires_grad_(True) >out = torch.softmax(matrix, 1) >out = out[0] >print(out) tensor([0.5377, 0.2729, 0.0692, 0.1201]) ``` Cool, problem solved. But what happends when you call backwards.. ```Python >out.backward(torch.ones_like(out)) >print(matrix.grad) tensor([[3.0957e-08, 1.4157e-08, 7.7802e-10, 1.3713e-08], [ nan, nan, nan, nan]]) ``` Those pesky NaNs are back! ## Why do we see NaNs today? The core of the problem revolves around using softmax function in sdpa: ```python > row = torch.tensor([(-float("inf")) for _ in range(4)]) > torch.softmax(row, 0) tensor([nan, nan, nan, nan]) ``` ## Quick Aside: Masking in Attention Attention itself doesn't have a concept of masking. The `sdpa` function has an argument called `attn_mask`, which would be more accurately named `attn_bias`. This is because we don't actually "mask" entries when computing attention. Instead, due to implementation details([performance](https://github.com/pytorch/pytorch/issues/25110#issuecomment-524519087)), we add a value to the masked-out query/key pairs. We use a large negative number (typically -inf) to decrease the attention weight, as softmax assigns more weight to larger values. ## Alternative Approaches If we use a very large negative number instead of -inf: ```python > row = torch.tensor([(-1e6) for _ in range(4)]) > torch.softmax(row, 0) tensor([0.2500, 0.2500, 0.2500, 0.2500]) ``` However if users always remembered to "slice" out their outputs i.e.: ```Python >fill_value = -1e6 >... >out.backward(torch.ones_like(out)) >print(matrix.grad) tensor([[-0.0563, -0.0564, 0.1613, -0.0486], [ 0.0000, 0.0000, 0.0000, 0.0000]]) ``` This would bring us back into a better state. ## A Third Option We don't necessarily need to alter the behavior of softmax for -inf or very large negative numbers. The fundamental goal is to exclude certain query/key pairs from attention, regardless of the underlying implementation. This PR implements the new semantic for masking w/ attention in fully masked-out rows: ```python out[masked_out_rows] = 0 ``` Important Note: This idea isn't entirely new. The [MaskedTensor](https://pytorch.org/tutorials/prototype/maskedtensor_overview#safe-softmax) prototype, a tensor subclass, was designed to handle such cases. However, it remains a prototype feature and hasn't gained widespread adoption. ## Details This PR stack does 3 things: 1. Adds a PRIVATE _safe_softmax op 2. Updates semantic for flash_cpu fused kernel 3. Updates semantic for efficient_cuda fused kernel _safe_softmax is not supposed to be used generically and is only meant to be used within the context of SDPA. Due to this fact instead of decomposing softmax and checking for -inf rows we instead "cheat" and use nan_to_num. Why I think this is okay? (please find a counter point if avail) There are multiple ways NaNs can emerge. For the fully masked out rows case nan_to_num works. But what if there were other NaNs, wouldn't this silently remove them? The only case that this can happen is if the input itself had a NaN or an Inf For example: ```Python a = torch.ones([4], requires_grad=False, dtype=torch.float16) a[1] = torch.finfo(torch.float16).max print(a.softmax(-1)) ``` Will return `tensor([0., 1., 0., 0.], dtype=torch.float16)` Where ```Python a = torch.ones([4], requires_grad=False, dtype=torch.float16) a[1] = float("inf") a.softmax(-1) ``` returns: `tensor([nan, nan, nan, nan], dtype=torch.float16)` If we dont want to even allow for the possibility of "inf" or "NaN" attention scores to be converted to 0 then we can implemented it something like this ```Python max = torch.max(a, dim=-1, keepdim=True) exp = torch.exp(a - max.values) denom = torch.sum(exp, dim=-1, keepdim=True) softmax = exp / denom softmax = torch.where(max.values == float('-inf'), 0.0, softmax) ``` however we would be paying for this in math performance. ## Why Now I think one point that has substantially changed where PyTorch should lie on this argument is the fact that we have fused implementations for SDPA now. And these fused implementations allow us to easily and performantly support this new semantic. Differential Revision: [D61418679](https://our.internmc.facebook.com/intern/diff/D61418679) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133882 Approved by: https://github.com/soulitzer	2024-08-19 18:53:11 +00:00
PyTorch MergeBot	35f36363ec	Revert "[dtensor] move DTensor to public namespace (#133113 )" This reverts commit `2ee6b97464`. Reverted https://github.com/pytorch/pytorch/pull/133113 on behalf of https://github.com/wanchaol due to looks like it break some internal type imports ([comment](https://github.com/pytorch/pytorch/pull/133113#issuecomment-2295670911))	2024-08-19 05:00:19 +00:00
Jiang, Yanbing	215b14530a	Add Half for sparse.mm reduce (#133672 ) This PR is to add Half support for sparse.mm reduce in CPU backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133672 Approved by: https://github.com/Skylion007	2024-08-17 15:20:39 +00:00
Wanchao Liang	2ee6b97464	[dtensor] move DTensor to public namespace (#133113 ) Moving DTensor to be in the public namespace, to formally add the documentation page that includes all the public APIs. This includes: * many path renames and path import fixes * a dedicated doc page without too much content yet (adding in the next PRs) * To preserve the BC for users still using the `torch.distributed._tensor`, I added a shim script to redirect old path calls to the new module The BC preserving is evidented by the fact that all DTensor tests are still working without changing the public imports. So it's safe to land the changes Pull Request resolved: https://github.com/pytorch/pytorch/pull/133113 Approved by: https://github.com/XilunWu ghstack dependencies: #133305, #133306	2024-08-17 05:09:52 +00:00
Li, Xingyuan	dcfa415e6e	[Inductor UT] Reuse inductor UT for intel GPU `test/inductor/test_compiled_optimizers.py` (#133083 ) [Inductor UT] Reuse Inductor test case for Intel GPU. Reuse `test/inductor/test_compiled_optimizers.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133083 Approved by: https://github.com/etaf, https://github.com/jansel, https://github.com/mlazos	2024-08-17 01:15:26 +00:00
Denis Vieriu	861bdf96f4	[MPS] Add native strided API for MPSNDArray starting with macOS 15 (#128393 ) Add support for native strides in MPS starting with macOS Sequoia. This will get rid of the additional gather and scatter operations needed to solve the strides or storage offsets of the tensors. Summary of changes (starting with macOS 15): - Add support for MPS strided API (strides/storage offsets etc): - [initWithBuffer:offset:descriptor:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarray/4391636-initwithbuffer?language=objc) - [arrayViewWithCommandBuffer:descriptor:aliasing:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarray/3114040-arrayviewwithcommandbuffer?language=objc) - [arrayViewWithShape:strides:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarray/4408694-arrayviewwithshape?language=objc) - [reshapeWithCommandBuffer:sourceArray:shape:destinationArray:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarrayidentity/4438557-reshapewithcommandbuffer?language=objc) - Add native support for NHWC convolutions (without incurring any extra copy from NCHW -> NHWC -> NCHW). - Add support for strided output buffers (previously we would create a contiguous buffer OSes older than macOS 15 will run the old gather/scatter code path to solve strides/storage offsets. --- Couple performance stats collected from torchbench comparing macOS 15 vs macOS 14: ``` - test_train[functorch_maml_omniglot-mps]: 27% faster - test_train[timm_vision_transformer-mps]: 12% faster - test_train[hf_T5-mps]: 9.46% faster ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128393 Approved by: https://github.com/albanD Co-authored-by: Siddharth Kotapati <skotapati@apple.com>	2024-08-16 21:07:50 +00:00
Mikayla Gawarecki	d9576c9440	Fix failures when default is flipped for weights_only (#127627 ) Tests on XLA shard not fixed yet but there is an issue here https://github.com/pytorch/xla/issues/7799 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127627 Approved by: https://github.com/albanD ghstack dependencies: #132349	2024-08-16 00:22:43 +00:00
PyTorch MergeBot	cfec69e2a1	Revert "Update fused kernels and call _safe_softmax from SDPA (#131863 )" This reverts commit `caba37e99b`. Reverted https://github.com/pytorch/pytorch/pull/131863 on behalf of https://github.com/izaitsevfb due to breaks executorch test executorch/backends/apple/coreml:test - test_vit_skip_conv (executorch.backends.apple.coreml.test.test_coreml_partitioner.TestCoreMLPartitioner) ([comment](https://github.com/pytorch/pytorch/pull/131863#issuecomment-2291855634))	2024-08-15 17:55:07 +00:00
Jane Xu	c23dceb8f1	Add Adafactor foreach impl (#132336 ) This PR adds the foreach impl for Adafactor knowing that there are many ways to improve its runtime perf today (by adding more foreach support). After this PR: - we have a foreach flag for Adafactor - It is NOT the default. Why not? It is only slightly faster + uses O(n) more memory where n is the number of params in your max param group. People tend to use Adafactor for memory efficiency. Next steps: - make torch.compile possible on it - make it faster (by adding more foreach apis) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132336 Approved by: https://github.com/albanD ghstack dependencies: #133360	2024-08-15 17:00:33 +00:00
Xuehai Pan	758a0a88a2	[BE][Easy] enable `ruff` rule `PIE790`: unnecessary `pass` statement (#133200 ) This PR removes unnecessary `pass` statement. This is semanticly safe because the bytecode for the Python code does not change. Note that if there is a docstring in the function, a empty function does not need a `pass` statement as placeholder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133200 Approved by: https://github.com/malfet, https://github.com/eqy, https://github.com/kit1980	2024-08-15 15:50:19 +00:00
Aaron Gokaslan	ec49ce5f8e	[CUDA]: Add frexp CUDA bfloat16 support (#133313 ) Fixes #133263 Add CUDA bfloat16 support to cuda_frexp Pull Request resolved: https://github.com/pytorch/pytorch/pull/133313 Approved by: https://github.com/ezyang, https://github.com/eqy	2024-08-15 15:20:00 +00:00
soulitzer	4af4910b1a	Reland "Construct NJT without graph breaks" (#133196 ) This reverts commit 154d40ca488e6979ce9c2de89d8a35b53129ebea. and adds changes from https://github.com/pytorch/pytorch/pull/133061 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133196 Approved by: https://github.com/ezyang ghstack dependencies: #133145	2024-08-14 01:11:13 +00:00
drisspg	caba37e99b	Update fused kernels and call _safe_softmax from SDPA (#131863 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131863 Approved by: https://github.com/jbschlosser, https://github.com/Chillee	2024-08-13 23:37:50 +00:00
Sun, Jiayi	7be77658e9	[Inductor] support masked vectorization for the tail_loop for INT8 datatype (#131155 ) This PR supports masked vectorization for the tail_loop for torch.uint8 and torch.int8 datatype to improve performance. BTW, I fixed the UT of `byte` by setting the range of the sample inputs to [0, 255] since the range of `torch.uint8` is [0, 255]. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131155 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel ghstack dependencies: #130724	2024-08-13 01:12:05 +00:00
soulitzer	05de2b2d0f	Revert "Construct NJT without graph breaks" (#133145 ) This reverts commit 911154271309667b55dfb963ec6384bd0048019b. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133145 Approved by: https://github.com/YuqingJ	2024-08-10 03:11:16 +00:00
drisspg	1434e0b121	Add a private _safe_softmax (#131060 ) # Summary Changes the stance of SDPA on what to do for fully masked out rows ## Current Behavior Several PyTorch users have expressed frustration over this issue: - https://github.com/pytorch/pytorch/issues/41508 - https://github.com/pytorch/pytorch/issues/103749 - https://github.com/pytorch/pytorch/issues/103963 These are significant issues with extensive discussion but no satisfactory resolution. The PyTorch team's consensus, as stated here: https://github.com/pytorch/pytorch/issues/24816#issuecomment-524415617 Can be paraphrased as follows: When passing in fully masked out rows, attention becomes ambiguous. We have two main options: 1. Uniformly attend to all values: ```python scores[masked_out_rows] = 1 / len(row) out[masked_out_rows] = 1 / len(row) * value ``` 2. Decide that attention between no queries (masked) and no keys (masked) is meaningless: ```python output[fully_masked_rows] = NaN ``` We went with option 2. Partially because it was easier to implement, but also people argued that users can slice the output to remove the NaNs: ``` Python >fill_value = -float("inf") >row0 = torch.randn(4) >row1 = torch.tensor([(fill_value for _ in range(4)]) >matrix = torch.stack([row0, row1]).requires_grad_(True) >out = torch.softmax(matrix, 1) >out = out[0] >print(out) tensor([0.5377, 0.2729, 0.0692, 0.1201]) ``` Cool, problem solved. But what happends when you call backwards.. ```Python >out.backward(torch.ones_like(out)) >print(matrix.grad) tensor([[3.0957e-08, 1.4157e-08, 7.7802e-10, 1.3713e-08], [ nan, nan, nan, nan]]) ``` Those pesky NaNs are back! ## Why do we see NaNs today? The core of the problem revolves around using softmax function in sdpa: ```python > row = torch.tensor([(-float("inf")) for _ in range(4)]) > torch.softmax(row, 0) tensor([nan, nan, nan, nan]) ``` ## Quick Aside: Masking in Attention Attention itself doesn't have a concept of masking. The `sdpa` function has an argument called `attn_mask`, which would be more accurately named `attn_bias`. This is because we don't actually "mask" entries when computing attention. Instead, due to implementation details([performance](https://github.com/pytorch/pytorch/issues/25110#issuecomment-524519087)), we add a value to the masked-out query/key pairs. We use a large negative number (typically -inf) to decrease the attention weight, as softmax assigns more weight to larger values. ## Alternative Approaches If we use a very large negative number instead of -inf: ```python > row = torch.tensor([(-1e6) for _ in range(4)]) > torch.softmax(row, 0) tensor([0.2500, 0.2500, 0.2500, 0.2500]) ``` However if users always remembered to "slice" out their outputs i.e.: ```Python >fill_value = -1e6 >... >out.backward(torch.ones_like(out)) >print(matrix.grad) tensor([[-0.0563, -0.0564, 0.1613, -0.0486], [ 0.0000, 0.0000, 0.0000, 0.0000]]) ``` This would bring us back into a better state. ## A Third Option We don't necessarily need to alter the behavior of softmax for -inf or very large negative numbers. The fundamental goal is to exclude certain query/key pairs from attention, regardless of the underlying implementation. This PR implements the new semantic for masking w/ attention in fully masked-out rows: ```python out[masked_out_rows] = 0 ``` Important Note: This idea isn't entirely new. The [MaskedTensor](https://pytorch.org/tutorials/prototype/maskedtensor_overview#safe-softmax) prototype, a tensor subclass, was designed to handle such cases. However, it remains a prototype feature and hasn't gained widespread adoption. ## Details This PR stack does 3 things: 1. Adds a PRIVATE _safe_softmax op 2. Updates semantic for flash_cpu fused kernel 3. Updates semantic for efficient_cuda fused kernel _safe_softmax is not supposed to be used generically and is only meant to be used within the context of SDPA. Due to this fact instead of decomposing softmax and checking for -inf rows we instead "cheat" and use nan_to_num. Why I think this is okay? (please find a counter point if avail) There are multiple ways NaNs can emerge. For the fully masked out rows case nan_to_num works. But what if there were other NaNs, wouldn't this silently remove them? The only case that this can happen is if the input itself had a NaN or an Inf For example: ```Python a = torch.ones([4], requires_grad=False, dtype=torch.float16) a[1] = torch.finfo(torch.float16).max print(a.softmax(-1)) ``` Will return `tensor([0., 1., 0., 0.], dtype=torch.float16)` Where ```Python a = torch.ones([4], requires_grad=False, dtype=torch.float16) a[1] = float("inf") a.softmax(-1) ``` returns: `tensor([nan, nan, nan, nan], dtype=torch.float16)` If we dont want to even allow for the possibility of "inf" or "NaN" attention scores to be converted to 0 then we can implemented it something like this ```Python max = torch.max(a, dim=-1, keepdim=True) exp = torch.exp(a - max.values) denom = torch.sum(exp, dim=-1, keepdim=True) softmax = exp / denom softmax = torch.where(max.values == float('-inf'), 0.0, softmax) ``` however we would be paying for this in math performance. ## Why Now I think one point that has substantially changed where PyTorch should lie on this argument is the fact that we have fused implementations for SDPA now. And these fused implementations allow us to easily and performantly support this new semantic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131060 Approved by: https://github.com/jbschlosser	2024-08-08 23:09:38 +00:00
Alnis Murtovi	7bd0732cbd	Fix flaky internal mixed_mm tests (#133015 ) This PR fixes flaky internal tests: - The AutoHeuristic test was sometimes failing because it required autotuning to happen for mixed_mm which didn't end up happening when there was a fx graph cache hit. - The tests inside pattern_matcher failed because in some cases pad_mm decided to pad which made the mixed_mm pattern not match anymore (instead of cast -> mm, it was cast -> pad -> mm), and the tests also fail when is_big_gpu is false (which I haven't found an explanation for). Differential Revision: [D60972176](https://our.internmc.facebook.com/intern/diff/D60972176) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133015 Approved by: https://github.com/Chillee, https://github.com/eellison	2024-08-08 20:32:12 +00:00
Joel Schlosser	75eb66afc0	Support 'non-contiguous with holes' NJTs for contiguous clone() (#132776 ) It's possible to construct an NJT with "holes" by specifying both `offsets` and `lengths` metadata. When `nt.clone(memory_format=torch.contiguous_format)` is called on such an NJT, the result should be an NJT without holes. This PR fixes this in simplistic way using `unbind()`, which isn't really supported in `torch.compile`. The longer term solution involves writing a proper kernel to support this. NB: Another limitation is that the returned NJT does not have the same ragged structure as the input. While we could manually hack the nested int registry (or update the union find when that lands), this is the first instance where a NJT with holes and an NJT without holes could have the same ragged structure, and getting those to play nicely together requires some fairly involved updates. For now, this PR punts on these updates until we can clean this up. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132776 Approved by: https://github.com/ani300, https://github.com/soulitzer ghstack dependencies: #131898, #131704, #131937	2024-08-08 17:08:11 +00:00
Catherine Lee	4fe6a5dc34	Move slow tests to be in repo (#132379 ) Move the slow test json to be in the pytorch/pytorch repo and make a job that will update it weekly. The job uses the same environment as the commit hash. It uses similar code to the hash updates, but the hash update contains a lot of code that is specific to the hash update, so I chose to pick out the parts that are relevant Remove references to the old file and set up testing to read from the new file instead The old update cadence was every day, the new one is every week The auto slow test infra + the lack of pinning between pytorch and test-infra makes it really hard to tell if a test started failing because of a change or because of the slow test json changing. While this can have benefits, like disable test issues being effective everywhere immediately, it can also be very confusing, especially since we don't have the same insight into slow tests like we do for disable issues. Example PR made: https://github.com/pytorch/pytorch/pull/132383 (with all the changes from this PR because it was working on top of this) We should just get rid of this at some point in favor of the slowTest decorator, but there are some tests that take 5+ minutes to run and I don't want to track them down right now Pull Request resolved: https://github.com/pytorch/pytorch/pull/132379 Approved by: https://github.com/huydhn	2024-08-07 18:42:56 +00:00
wz337	8b50d5398f	[DeviceMesh] Create new group for 1D mesh when default backend is 'gloo' and 'cuda' is available (#132709 ) More context in [#132471](https://github.com/pytorch/pytorch/issues/132471) and https://github.com/pytorch/pytorch/issues/132366. TLDR: When cuda is available and users move tensors to cuda, we cannot really reuse the default pg if default pg is gloo, as lots of collectives are not supported on gloo for cuda tensors. For example, `dtensor.full_tensor()` would result in a mysterious SIGTERM when all_gather a cuda tensor using gloo. Without the change in this PR, users would have to know the context and explicitly move the cuda tensor to cpu before invoking most collectives, which I think is not so ideal UX. Therefore, given most collectives are not supported on gloo for cuda tensors, we should init a new pg if the default pg is gloo when torch.cuda.is_available() and device_type is cuda. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132709 Approved by: https://github.com/awgu, https://github.com/wanchaol	2024-08-07 16:13:11 +00:00
Apurva Jain	8bc5ef563e	Grouped Query Attention (#132689 ) ### Approach: Using the current function declaration Constraint: Q_Heads % KV_Heads == 0 Major change: - Added a new argument enable_gqa: bool to sdpa function call - It adds a meaning to the last third dimension. Sample use cases this would enable: LLama3 ``` # LLama3 8b call to SDPA query = torch.rand(batch, 32, seq_len_q, D) key = torch.rand(batch, 8, seq_len_kv, D) value = torch.rand(batch, 8, seq_len_kv, D) output = scaled_dot_product_attention(query, key, value, is_causal=True, enable_gqa=True) # Output Shape (batch, 32, seq_len_q, D) ``` ### Design Choice: - Check if Query.size(-3) == Key.size(-3) == Value.size(-3) or, Query.size(-3) % Key.size(-3) == 0 - The function adjusts the key and value tensors to match the query tensor's head dimension by using repeat_interleave if their number of heads are not equal, facilitating correct and efficient computation in attention mechanisms. - By default the enable_gqa flag is set to False, which ensures that regular sdpa functionality remains unchanged. ### Benchmarks: - sdpa.py: #130634 For different batch sizes enable_gqa=True shows a substansial improvement in the run_time of sdpa \| batch_size \| q_num_heads \| kv_num_heads \| q_seq_len \| kv_seq_len \| embed_dim \| forward_time when enable_gqa=True \| forward_time when enable_gqa=False \| \| ------------ \| ------------- \| -------------- \| ----------- \| ------------ \| ----------- \| ----------- \| ---------------- \| \| 1 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 100.71 \| 119.70 \| \| 8 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 539.78 \| 628.83 \| \| 16 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 1056.81 \| 1225.48 \| \| 32 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 2099.54 \| 2440.45 \| ![Screenshot 2024-07-25 at 9 07 40 PM](https://github.com/user-attachments/assets/a3e5f716-c39f-4096-9e6c-82a735e57b7b) - TorchTitan: https://github.com/pytorch/torchtitan/pull/458 Differential Revision: D60772086 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132689 Approved by: https://github.com/drisspg	2024-08-07 05:35:36 +00:00
Edward Z. Yang	ed224554eb	[BE] Don't unnecessarily suggest -k for rerunning tests locally (#132807 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132807 Approved by: https://github.com/malfet	2024-08-07 02:15:18 +00:00
PyTorch MergeBot	cbee9c1fd2	Revert "Deprecate `torch._utils.is_compiling()` and `torch._dynamo.external_utils.is_compiling()` (#127690 )" This reverts commit `0e7e61f7ce`. Reverted https://github.com/pytorch/pytorch/pull/127690 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/127690#issuecomment-2272370386))	2024-08-07 00:05:20 +00:00
soulitzer	f50621989b	Construct NJT without graph breaks (#130292 ) Combines contributions from https://github.com/pytorch/pytorch/pull/130505 Some context can be found in this large comment block: `a5b64d39fd/test/dynamo/test_subclasses.py (L1667-L1681)` Changes in this PR - For each tensor fakified, check the nested int registry in eager, and eagerly symbolicize if that tensor has already been associated with nested int in eager. - Adds a separate counter stored on FakeTensorMode as a fake analog to _tensor_id_counter (which keeps track of unique tensors). This counter is initialized to the global eager tensor id counter upon creation of the FakeTensorMode, and needs to be reset when the same FakeTensorMode is reused to trace again (in this PR, we piggyback on the epoch incrementing logic). - (refactor) Today, we store FakeTensor -> symbolic nested int in the global registry. With this PR, symbolic nested int is stored directly on the FakeTensor. (Eager still caches nested int in the registry, though we should avoid this at some point.) Basically unchanged, but worth noting: - `__tensor_unflatten__` is still responsible for determining whether we should cache for now. The logic is somewhat simplified. - to_copy is still using the trick of updating two different tensors in the registry to point to the same nested int. This is kind of broken, but we try to leave it as is, and plan a better fix with the UnionFind stack. Differential Revision: [D60406772](https://our.internmc.facebook.com/intern/diff/D60406772) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130292 Approved by: https://github.com/bdhirsh ghstack dependencies: #131916, #131803	2024-08-06 17:03:39 +00:00
Wei Feng	2a8e94347f	[TP] verify numeric parity on Transfromers for multiple iterations (#132543 ) Before setting up float8 numeric parity test, I have to set up regular TP numeric parity test, preferrably testing 10 iterations this PR sets a baseline of TP numerics. I can verify fp8 on top of it Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/132543 Approved by: https://github.com/tianyu-l ghstack dependencies: #132350	2024-08-04 06:43:27 +00:00
Xuehai Pan	0e7e61f7ce	Deprecate `torch._utils.is_compiling()` and `torch._dynamo.external_utils.is_compiling()` (#127690 ) This PR is split from PR #126898. - #126898 ------ Pull Request resolved: https://github.com/pytorch/pytorch/pull/127690 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-08-03 09:43:38 +00:00
Adnan Akhundov	8ad9f89ccc	[inductor] Reland: Add flag to ignore unsupported @triton.autotune args in user-written kernel compilation (#132562 ) Summary: This is a reland attempt of [#131431](https://github.com/pytorch/pytorch/pull/131431), as, in its original form, the PR has caused issues internally. We currently don't support some of the `triton.autotune` arguments when compiling user-written Triton kernels with PT2. In this PR, we're adding a flag to circumvent it. This is to unblock internal compilation in some cases. The flag is supplied with the docs mentioning why it is not a good idea to set it. Test Plan: ``` python test/inductor/test_triton_kernels.py -k test_triton_kernel_ autotune_with_unsupported_args ... ---------------------------------------------------------------------- Ran 3 tests in 3.636s OK ``` Differential Revision: D60701839 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132562 Approved by: https://github.com/chenyang78	2024-08-03 06:31:28 +00:00
PyTorch MergeBot	bcb4f7c172	Revert "Grouped Query Attention (#128898 )" This reverts commit `6b28af1b79`. Reverted https://github.com/pytorch/pytorch/pull/128898 on behalf of https://github.com/ZainRizvi due to Sorry, this broke a bunch of tests internally. See D60638265 ([comment](https://github.com/pytorch/pytorch/pull/128898#issuecomment-2265961038))	2024-08-02 18:58:46 +00:00
Pearu Peterson	a4ea776881	Add pinned memory support to sparse COO/CSR/CSC/BSR/BSC tensors (#129645 ) As in the title: To register indices/values of a sparse XYZ tensor with CUDA, the following methods are supported - `sparse_xyz_tensor(indices, values, pin_memory=True)` - `sparse_xyz_tensor(indices, values).pin_memory()` - `sparse_xyz_tensor(indices.pin_memory(), values.pin_memory())` Fixes https://github.com/pytorch/pytorch/issues/115330 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129645 Approved by: https://github.com/amjames, https://github.com/cpuhrsch, https://github.com/eqy	2024-08-02 08:55:55 +00:00
majing	3a355c1891	Correct sample creation of torch.histogram in UT op_db to align PyTorch defined operator semantics (#131630 ) Fixes #130916 As the semantics defined in [torch.histogram](https://pytorch.org/docs/stable/generated/torch.histogram.html#torch-histogram), we need an increasing sequence as bins tensor. Random input doesn't make sense for torch.histogram. The case is a comparison between CPU backend and another backend. When the input is random, kernel implementation in other backends have to totally align with the CPU kernel, or the case fails. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131630 Approved by: https://github.com/EikanWang, https://github.com/albanD	2024-08-02 01:51:09 +00:00
henrylhtsang	0c3ac428a2	[BE][typing] fix types in common pruning (#132309 ) BE task. Add typings and remove mypy errors in torch/testing/_internal/common_pruning.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/132309 Approved by: https://github.com/ColinPeppler	2024-08-01 23:34:33 +00:00
Xuehai Pan	30293319a8	[BE][Easy][19/19] enforce style for empty lines in import segments in `torch/[o-z]*/` (#129771 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129771 Approved by: https://github.com/justinchuby, https://github.com/janeyx99	2024-08-01 17:07:14 +00:00
Oguz Ulgen	72d2dba992	Add None return type to init (#132335 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132335 Approved by: https://github.com/albanD	2024-08-01 15:26:45 +00:00
jainapurva	6b28af1b79	Grouped Query Attention (#128898 ) ### Approach: Using the current function declaration Constraint: Q_Heads % KV_Heads == 0 Major change: - Added a new argument enable_gqa: bool to sdpa function call - It adds a meaning to the last third dimension. Sample use cases this would enable: LLama3 ``` # LLama3 8b call to SDPA query = torch.rand(batch, 32, seq_len_q, D) key = torch.rand(batch, 8, seq_len_kv, D) value = torch.rand(batch, 8, seq_len_kv, D) output = scaled_dot_product_attention(query, key, value, is_causal=True, enable_gqa=True) # Output Shape (batch, 32, seq_len_q, D) ``` ### Design Choice: - Check if Query.size(-3) == Key.size(-3) == Value.size(-3) or, Query.size(-3) % Key.size(-3) == 0 - The function adjusts the key and value tensors to match the query tensor's head dimension by using repeat_interleave if their number of heads are not equal, facilitating correct and efficient computation in attention mechanisms. - By default the enable_gqa flag is set to False, which ensures that regular sdpa functionality remains unchanged. ### Benchmarks: - sdpa.py: #130634 For different batch sizes enable_gqa=True shows a substansial improvement in the run_time of sdpa \| batch_size \| q_num_heads \| kv_num_heads \| q_seq_len \| kv_seq_len \| embed_dim \| forward_time when enable_gqa=True \| forward_time when enable_gqa=False \| \| ------------ \| ------------- \| -------------- \| ----------- \| ------------ \| ----------- \| ----------- \| ---------------- \| \| 1 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 100.71 \| 119.70 \| \| 8 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 539.78 \| 628.83 \| \| 16 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 1056.81 \| 1225.48 \| \| 32 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 2099.54 \| 2440.45 \| ![Screenshot 2024-07-25 at 9 07 40 PM](https://github.com/user-attachments/assets/a3e5f716-c39f-4096-9e6c-82a735e57b7b) - TorchTitan: https://github.com/pytorch/torchtitan/pull/458 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128898 Approved by: https://github.com/drisspg	2024-07-31 22:58:51 +00:00
Yu, Guangye	45e6a364ee	Avoid autocast deprecation warning (#132207 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132207 Approved by: https://github.com/awgu	2024-07-31 13:13:39 +00:00
ekamiti	9e473fd868	Make adding Buffers more like adding Parameters (#125971 ) Add similar semantics for creating a buffer object similar to creating a parameter. This is done by introducing a new Buffer class that can be used for type disambiguation. The underlying functionality of registering a buffer remains the same as the register_buffer method has not been changed. The persistent parameter in the Buffer type is to indicate whether a buffer object should be persistent or not. Other non-test changes have to do with getting the new Buffer type recognized by inductor and dynamo. Remaining changes are test changes to make sure that the Buffer type can be used as a drop in replacement for register_buffer as it just leads to register_buffer being called. The addition of this new functionality still allows for normal tensors to be used as buffers so these changes are intended to be backwards compatible. Fixes #35735 Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125971 Approved by: https://github.com/albanD, https://github.com/anijain2305, https://github.com/mlazos	2024-07-31 10:32:40 +00:00
Joel Schlosser	524aac413c	Initial OpInfo-based testing for NJTs (#131704 ) This PR utilizes the info from the existing OpInfo database `op_db` to contribute to general NJT testing. * New tests in `TestNestedTensorOpInfo` * `test_forward()` - compares forward output to an unbind-based reference * `test_backward()` - compares forward output and grads to an unbind-based reference * `test_forward_compile()` - compares forward compile output (`backend="aot_eager_decomp_partition"`) to eager * `test_backward_compile()` - compares forward compile output (`backend="aot_eager_decomp_partition"`) and grads to eager * To avoid adding a bunch of NJT-specific stuff to the `OpInfo` structure, this PR translates `op_db` -> a NJT-specific `njt_op_db`. * `UnaryUfuncInfo`s utilize a new `sample_inputs_unary_njt_pointwise()` which iterates through a comprehensive list of NJTs: contiguous / non-contiguous, dims 2, 3, and 4, transposed / not, etc. * `BinaryUfuncInfo`s utilize a new `sample_inputs_binary_njt_pointwise()` which iterates through a comprehensive list of NJTs: contiguous / non-contiguous, dims 2, 3, and 4, transposed / not, etc. * `ReductionOpInfo`s utilize a new `sample_inputs_njt_reduction()` which covers full reductions, reductions over the jagged dim, and reductions over the non-jagged dim * Several xfails were added to get things passing TODO (future PRs): * Pass non-contiguous / non-contiguous with holes NJTs (maybe we should have separate tests for these? most ops don't support NJTs with holes today) * Mixed (NT, T), (T, NT) inputs for binary ops * Handle other types of OpInfos (beyond unary pointwise, binary pointwise, and reduction) by manually by writing sample_inputs_funcs * Address all xfails via fixes Pull Request resolved: https://github.com/pytorch/pytorch/pull/131704 Approved by: https://github.com/soulitzer ghstack dependencies: #131898	2024-07-30 23:02:24 +00:00
Joel Schlosser	d53b11bb6e	Strict shape checking for NJTs with TestCase.assertEqual() (#131898 ) Background: `TestCase.assertEqual()` is commonly used during test case validation. Historically, to support NSTs, the logic was written to compare two nested tensors by unbinding them and comparing their components. This logic applied to NJTs as well, which in practice meant that two NJTs with different nested ints in their shapes could compare equal if their components were equal. This PR changes the above logic so that NJTs are no longer unbound during comparison, allowing them to receive full shape validation. This makes `TestCase.assertEqual()` stricter for NJTs, requiring them to have the same nested ints in their shapes to compare equal. Note that some tests rely on the old, looser behavior. To address this, the PR introduces a base `NestedTensorTestCase` that defines a helper function `assertEqualIgnoringNestedInts()` so that these tests can explicitly opt in to the looser comparison behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131898 Approved by: https://github.com/soulitzer	2024-07-30 20:05:48 +00:00
PyTorch MergeBot	499ead96ff	Revert "Grouped Query Attention (#128898 )" This reverts commit `d039b14207`. Reverted https://github.com/pytorch/pytorch/pull/128898 on behalf of https://github.com/albanD due to Broken test on main ([comment](https://github.com/pytorch/pytorch/pull/128898#issuecomment-2258314481))	2024-07-30 13:11:24 +00:00
jainapurva	d039b14207	Grouped Query Attention (#128898 ) ### Approach: Using the current function declaration Constraint: Q_Heads % KV_Heads == 0 Major change: - Added a new argument enable_gqa: bool to sdpa function call - It adds a meaning to the last third dimension. Sample use cases this would enable: LLama3 ``` # LLama3 8b call to SDPA query = torch.rand(batch, 32, seq_len_q, D) key = torch.rand(batch, 8, seq_len_kv, D) value = torch.rand(batch, 8, seq_len_kv, D) output = scaled_dot_product_attention(query, key, value, is_causal=True, enable_gqa=True) # Output Shape (batch, 32, seq_len_q, D) ``` ### Design Choice: - Check if Query.size(-3) == Key.size(-3) == Value.size(-3) or, Query.size(-3) % Key.size(-3) == 0 - The function adjusts the key and value tensors to match the query tensor's head dimension by using repeat_interleave if their number of heads are not equal, facilitating correct and efficient computation in attention mechanisms. - By default the enable_gqa flag is set to False, which ensures that regular sdpa functionality remains unchanged. ### Benchmarks: - sdpa.py: #130634 For different batch sizes enable_gqa=True shows a substansial improvement in the run_time of sdpa \| batch_size \| q_num_heads \| kv_num_heads \| q_seq_len \| kv_seq_len \| embed_dim \| forward_time when enable_gqa=True \| forward_time when enable_gqa=False \| \| ------------ \| ------------- \| -------------- \| ----------- \| ------------ \| ----------- \| ----------- \| ---------------- \| \| 1 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 100.71 \| 119.70 \| \| 8 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 539.78 \| 628.83 \| \| 16 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 1056.81 \| 1225.48 \| \| 32 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 2099.54 \| 2440.45 \| ![Screenshot 2024-07-25 at 9 07 40 PM](https://github.com/user-attachments/assets/a3e5f716-c39f-4096-9e6c-82a735e57b7b) - TorchTitan: https://github.com/pytorch/torchtitan/pull/458 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128898 Approved by: https://github.com/drisspg	2024-07-29 21:49:06 +00:00
PyTorch MergeBot	d5e9fbb012	Revert "BE: reset dynamo before each test in test_module.py (#131372 )" This reverts commit `527901f054`. Reverted https://github.com/pytorch/pytorch/pull/131372 on behalf of https://github.com/kit1980 due to Broke test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_CTCLoss_cuda_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10149118852/job/28065175173) [HUD commit link](`ca8153ae67`) ([comment](https://github.com/pytorch/pytorch/pull/131372#issuecomment-2257019116))	2024-07-29 21:15:25 +00:00
PyTorch MergeBot	a4723b566f	Revert "BE: reset dynamo before each test in test_ops_gradients.py (#131397 )" This reverts commit `ca8153ae67`. Reverted https://github.com/pytorch/pytorch/pull/131397 on behalf of https://github.com/kit1980 due to Broke test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_CTCLoss_cuda_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10149118852/job/28065175173) [HUD commit link](`ca8153ae67`) ([comment](https://github.com/pytorch/pytorch/pull/131372#issuecomment-2257019116))	2024-07-29 21:15:25 +00:00
Tom Ritchford	bdf5a6dca9	Add decomposition for unsqueeze_copy (#130942 ) * Extracted from #128416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130942 Approved by: https://github.com/peterbell10	2024-07-29 21:13:37 +00:00
Xuehai Pan	5cc34f61d1	[CI] add new test config label `ci-test-showlocals` to control test log verbosity (#131981 ) Add a new label `ci-test-showlocals` and add it to test config filter. If the PR is labeled with `ci-test-showlocals` or "ci-test-showlocals" present in the PR comment, the test config filter will set a environment variable `TEST_SHOWLOCALS`. Then `pytest` will show local variables on failures for better debugging. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131981 Approved by: https://github.com/malfet ghstack dependencies: #131151	2024-07-29 18:53:14 +00:00
Xuehai Pan	4694ee1ad2	[BE][tests] show local variables on failure in tests (#131151 ) ------ As per the title, add argument `--locals` for `unittest` and `--showlocals --tb=long` for `pytest` in CI. Some failures cannot be reproduced on the local machine but exist on cloud CI. This change allows us to investigate the test failure more easily. Example output: https://github.com/pytorch/pytorch/actions/runs/9961546996/job/27523888353?pr=130710#step:20:3361 ```text /opt/conda/envs/py_3.8/lib/python3.8/site-packages/sympy/core/function.py:307: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ cls = FloorDiv, base = -1.00000000000000, divisor = -1.00000000000000 @classmethod def eval(cls, base, divisor): # python test/test_dynamic_shapes.py -k TestDimConstraints.test_dim_constraints_solve_full # Assert triggered by inequality solver # assert base.is_integer, base # assert divisor.is_integer, divisor # We don't provide the same error message as in Python because SymPy # makes it difficult to check the types. if divisor.is_zero: raise ZeroDivisionError("division by zero") if base in (int_oo, -int_oo, sympy.oo, -sympy.oo) and divisor in ( int_oo, -int_oo, sympy.oo, -sympy.oo, ): return sympy.nan if base is sympy.nan or divisor is sympy.nan: return sympy.nan if base.is_zero: return sympy.S.Zero if base.is_integer and divisor == 1: return base if base.is_integer and divisor == -1: return sympy.Mul(base, -1) if ( isinstance(base, sympy.Number) and isinstance(divisor, sympy.Number) and ( base in (int_oo, -int_oo, sympy.oo, -sympy.oo) or divisor in (int_oo, -int_oo, sympy.oo, -sympy.oo) ) ): r = float(base) / float(divisor) if r == math.inf: return int_oo elif r == -math.inf: return -int_oo elif math.isnan(r): return sympy.nan else: return sympy.Integer(math.floor(r)) if isinstance(base, sympy.Integer) and isinstance(divisor, sympy.Integer): return sympy.Integer(int(base) // int(divisor)) if isinstance(base, FloorDiv): return FloorDiv(base.args[0], base.args[1] * divisor) # Expands (x + y) // b into x // b + y // b. # This only works if floor is an identity, i.e. x / b is an integer. for term in sympy.Add.make_args(base): quotient = term / divisor if quotient.is_integer and isinstance(divisor, sympy.Integer): # NB: this is correct even if the divisor is not an integer, but it # creates rational expressions that cause problems with dynamic # shapes. return FloorDiv(base - term, divisor) + quotient try: gcd = sympy.gcd(base, divisor) if gcd != 1: > return FloorDiv( sympy.simplify(base / gcd), sympy.simplify(divisor / gcd) ) base = -1.00000000000000 cls = FloorDiv divisor = -1.00000000000000 gcd = 1.00000000000000 quotient = 1.00000000000000 term = -1.00000000000000 /opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/_sympy/functions.py:159: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ args = (FloorDiv, -1.00000000000000, -1.00000000000000), kwargs = {} @wraps(func) def wrapper(args, kwargs): try: > retval = cfunc(args, **kwargs) E RecursionError: maximum recursion depth exceeded in comparison E E To execute this test, run the following from the base repo dir: E python test/test_sympy_utils.py -k TestValueRanges.test_binary_ref_fn_floordiv_dtype_float E E This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 args = (FloorDiv, -1.00000000000000, -1.00000000000000) cfunc = <functools._lru_cache_wrapper object at 0x7fc5303173a0> func = <function Function.__new__ at 0x7fc530317280> kwargs = {} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131151 Approved by: https://github.com/ezyang	2024-07-29 18:53:14 +00:00
Shunting Zhang	ca8153ae67	BE: reset dynamo before each test in test_ops_gradients.py (#131397 ) https://github.com/pytorch/pytorch/pull/126586 tried to reset dynamo before each unit test. That PR get reverted a couple of times because we see post-land test failures that we don't see before merge. This PR only reset dynamo before each tests in `test_ops_gradients.py` to make it easier to land. Eventually after we reset dynamo in each individual test files, we can move the change to the base class (TestCase) and remove the change in individual test files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131397 Approved by: https://github.com/zou3519 ghstack dependencies: #131551, #131388, #131372	2024-07-29 17:39:23 +00:00
Shunting Zhang	527901f054	BE: reset dynamo before each test in test_module.py (#131372 ) https://github.com/pytorch/pytorch/pull/126586 tried to reset dynamo before each unit test. That PR get reverted a couple of times because we see post-land test failures that we don't see before merge. This PR only reset dynamo before each tests in `test_module.py` to make it easier to land. Eventually after we reset dynamo in each individual test files, we can move the change to the base class (TestCase) and remove the change in individual test files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131372 Approved by: https://github.com/zou3519 ghstack dependencies: #131551, #131388	2024-07-29 17:39:23 +00:00
PyTorch MergeBot	6cf493158e	Revert "Enable FlashAttention on Windows (#131906 )" This reverts commit `b90bc66766`. Reverted https://github.com/pytorch/pytorch/pull/131906 on behalf of https://github.com/atalman due to Windows nightly failures ([comment](https://github.com/pytorch/pytorch/pull/131906#issuecomment-2256421183))	2024-07-29 16:49:23 +00:00
Tom Ritchford	962f248437	Add decomposition for expand_copy (#130940 ) * Extracted from #129476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130940 Approved by: https://github.com/peterbell10	2024-07-29 16:23:56 +00:00
PyTorch MergeBot	c35f21e5fc	Revert "[BE][tests] show local variables on failure in tests (#131151 )" This reverts commit `14158d892a`. Reverted https://github.com/pytorch/pytorch/pull/131151 on behalf of https://github.com/atalman due to Broke CI: test_testing.py::TestTestingCUDA::test_cuda_assert_should_stop_common_device_type_test_suite_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/10131415299/job/28014665693) [HUD commit link](`14158d892a`) ([comment](https://github.com/pytorch/pytorch/pull/131151#issuecomment-2255921015))	2024-07-29 13:19:38 +00:00
PyTorch MergeBot	06fe99a097	Revert "[CI] add new test config label `ci-test-showlocals` to control test log verbosity (#131981 )" This reverts commit `dfa18bf3f3`. Reverted https://github.com/pytorch/pytorch/pull/131981 on behalf of https://github.com/atalman due to Sorry, need to revert bottom PR, which broke CI: https://github.com/pytorch/pytorch/pull/131151 ([comment](https://github.com/pytorch/pytorch/pull/131981#issuecomment-2255892628))	2024-07-29 13:09:41 +00:00
Xuehai Pan	dfa18bf3f3	[CI] add new test config label `ci-test-showlocals` to control test log verbosity (#131981 ) Add a new label `ci-test-showlocals` and add it to test config filter. If the PR is labeled with `ci-test-showlocals` or "ci-test-showlocals" present in the PR comment, the test config filter will set a environment variable `TEST_SHOWLOCALS`. Then `pytest` will show local variables on failures for better debugging. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131981 Approved by: https://github.com/malfet	2024-07-29 07:40:42 +00:00
Shunting Zhang	f151f25c0b	BE: reset dynamo before each test in test_torch.py (#131388 ) https://github.com/pytorch/pytorch/pull/126586 tried to reset dynamo before each unit test. That PR get reverted a couple of times because we see post-land test failures that we don't see before merge. This PR only reset dynamo before each tests in `test_torch.py` to make it easier to land. Eventually after we reset dynamo in each individual test files, we can move the change to the base class (TestCase) and remove the change in individual test files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131388 Approved by: https://github.com/zou3519 ghstack dependencies: #131551	2024-07-29 04:57:34 +00:00
Oguz Ulgen	7c29665f77	Remove mypy ignore from torch/testing/_internal/distributed/ (#131870 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131870 Approved by: https://github.com/aakhundov ghstack dependencies: #131786	2024-07-28 17:13:53 +00:00
PyTorch MergeBot	d3c17fea90	Revert "[BE] typing for decorators - _library/custom_ops (#131578 )" This reverts commit `c65b197b85`. Reverted https://github.com/pytorch/pytorch/pull/131578 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))	2024-07-28 03:29:32 +00:00
Xuehai Pan	14158d892a	[BE][tests] show local variables on failure in tests (#131151 ) ------ As per the title, add argument `--locals` for `unittest` and `--showlocals --tb=long` for `pytest` in CI. Some failures cannot be reproduced on the local machine but exist on cloud CI. This change allows us to investigate the test failure more easily. Example output: https://github.com/pytorch/pytorch/actions/runs/9961546996/job/27523888353?pr=130710#step:20:3361 ```text /opt/conda/envs/py_3.8/lib/python3.8/site-packages/sympy/core/function.py:307: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ cls = FloorDiv, base = -1.00000000000000, divisor = -1.00000000000000 @classmethod def eval(cls, base, divisor): # python test/test_dynamic_shapes.py -k TestDimConstraints.test_dim_constraints_solve_full # Assert triggered by inequality solver # assert base.is_integer, base # assert divisor.is_integer, divisor # We don't provide the same error message as in Python because SymPy # makes it difficult to check the types. if divisor.is_zero: raise ZeroDivisionError("division by zero") if base in (int_oo, -int_oo, sympy.oo, -sympy.oo) and divisor in ( int_oo, -int_oo, sympy.oo, -sympy.oo, ): return sympy.nan if base is sympy.nan or divisor is sympy.nan: return sympy.nan if base.is_zero: return sympy.S.Zero if base.is_integer and divisor == 1: return base if base.is_integer and divisor == -1: return sympy.Mul(base, -1) if ( isinstance(base, sympy.Number) and isinstance(divisor, sympy.Number) and ( base in (int_oo, -int_oo, sympy.oo, -sympy.oo) or divisor in (int_oo, -int_oo, sympy.oo, -sympy.oo) ) ): r = float(base) / float(divisor) if r == math.inf: return int_oo elif r == -math.inf: return -int_oo elif math.isnan(r): return sympy.nan else: return sympy.Integer(math.floor(r)) if isinstance(base, sympy.Integer) and isinstance(divisor, sympy.Integer): return sympy.Integer(int(base) // int(divisor)) if isinstance(base, FloorDiv): return FloorDiv(base.args[0], base.args[1] * divisor) # Expands (x + y) // b into x // b + y // b. # This only works if floor is an identity, i.e. x / b is an integer. for term in sympy.Add.make_args(base): quotient = term / divisor if quotient.is_integer and isinstance(divisor, sympy.Integer): # NB: this is correct even if the divisor is not an integer, but it # creates rational expressions that cause problems with dynamic # shapes. return FloorDiv(base - term, divisor) + quotient try: gcd = sympy.gcd(base, divisor) if gcd != 1: > return FloorDiv( sympy.simplify(base / gcd), sympy.simplify(divisor / gcd) ) base = -1.00000000000000 cls = FloorDiv divisor = -1.00000000000000 gcd = 1.00000000000000 quotient = 1.00000000000000 term = -1.00000000000000 /opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/_sympy/functions.py:159: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ args = (FloorDiv, -1.00000000000000, -1.00000000000000), kwargs = {} @wraps(func) def wrapper(args, kwargs): try: > retval = cfunc(args, **kwargs) E RecursionError: maximum recursion depth exceeded in comparison E E To execute this test, run the following from the base repo dir: E python test/test_sympy_utils.py -k TestValueRanges.test_binary_ref_fn_floordiv_dtype_float E E This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 args = (FloorDiv, -1.00000000000000, -1.00000000000000) cfunc = <functools._lru_cache_wrapper object at 0x7fc5303173a0> func = <function Function.__new__ at 0x7fc530317280> kwargs = {} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131151 Approved by: https://github.com/ezyang	2024-07-27 19:39:40 +00:00
Will Feng	9e06572704	[Traceable FSDP2][Inductor] Create grouped nodes for FSDP2 all-gather code block and reduce-scatter code block (after Buffer/Operation split) (#131510 ) This PR creates these `GroupedSchedulerNode`s: - One for each all-gather code block (cast + copy-in + all-gather) - One for each all-gather-wait code block (all-gather-wait + copy-out) - One for each reduce-scatter code block (copy-in + reduce-scatter) - One for each reduce-scatter-wait code block (reduce-scatter-wait) This serves two goals: - Prevent outside ops from being fused into these op groups, in order to have more predicable memory usage. - Make it easier to specify the dependency e.g. from `i+1` all-gather group node to the `i` all-gather-wait group node, to enforce FSDP2 comm ordering (i.e. "serialization of comms"). The actual "reorder-for-FSDP-compute-comm-overlap" PR will come next. Test commands: - `pytest -rA test/distributed/test_compute_comm_reordering.py::TestComputeCommReorderingMultiProc` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131510 Approved by: https://github.com/yifuwang	2024-07-27 08:39:58 +00:00
Xu Han	a90b8b967a	[inductor] enable windows inductor UTs (#131767 ) Changes: 1. Add `skipIfWindows` function. 2. Fix `fresh_inductor_cache` raise error on Windows, due to can't delete loaded modules. 3. Disable some UTs, which are not passed on Windows. 4. Enable test_torchinductor in Windows CI. I have tested passed on my dev machine: <img width="864" alt="image" src="https://github.com/user-attachments/assets/91d5a62f-7383-44b3-b614-99940f196fdb"> TODO: review and fix the skipped cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131767 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-27 02:46:03 +00:00
Luca Wehrstedt	b90bc66766	Enable FlashAttention on Windows (#131906 ) Let's just give this a try. Reland of https://github.com/pytorch/pytorch/pull/131875. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131906 Approved by: https://github.com/drisspg	2024-07-26 21:41:56 +00:00
Shunting Zhang	c8626a4e1f	[BE] add a list of inductor test files to skip resetting dynamo (#131551 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131551 Approved by: https://github.com/zou3519	2024-07-26 21:08:15 +00:00
PyTorch MergeBot	0f9bf208ec	Revert "[BE][tests] show local variables on failure in tests (#131151 )" This reverts commit `054d214c50`. Reverted https://github.com/pytorch/pytorch/pull/131151 on behalf of https://github.com/jbschlosser due to pollutes test failure output for OpInfo tests ([comment](https://github.com/pytorch/pytorch/pull/131151#issuecomment-2253310448))	2024-07-26 19:03:10 +00:00
Xuehai Pan	63374dda69	[BE][Easy] explicitly define global constants in `torch.testing._internal.common_utils` (#129826 ) This appeases IDE warnings like "torch.testing._internal.common_utils has no member TEST_WITH_ROCM". Pull Request resolved: https://github.com/pytorch/pytorch/pull/129826 Approved by: https://github.com/Skylion007	2024-07-26 06:32:08 +00:00
Aaron Orenstein	c65b197b85	[BE] typing for decorators - _library/custom_ops (#131578 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131578 Approved by: https://github.com/oulgen, https://github.com/zou3519 ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575, #131576, #131577	2024-07-25 22:24:19 +00:00
PyTorch MergeBot	f3df7deab8	Revert "Add flag to ignore unsupported @triton.autotune args in user-written kernel compilation (#131431 )" This reverts commit `e9db1b0597`. Reverted https://github.com/pytorch/pytorch/pull/131431 on behalf of https://github.com/clee2000 due to broke internal tests D60211713 ([comment](https://github.com/pytorch/pytorch/pull/131431#issuecomment-2251091957))	2024-07-25 18:00:46 +00:00
Jane Xu	9c4cf866c2	Adafactor forloop basic impl (#129905 ) #109581 At this point, the vanilla implementation (the default) is good. Docs: https://docs-preview.pytorch.org/pytorch/pytorch/129905/generated/torch.optim.Adafactor.html#torch.optim.Adafactor Specifically, the impl in this PR, which attempts to replicate the paper, ``` optim = torch.optim.Adafactor([weight]) ``` is close enough to https://pytorch-optimizers.readthedocs.io/en/latest/optimizer/#pytorch_optimizer.AdaFactor ``` optim_c = AdaFactor([weight], betas=(0, 0.999), scale_parameter=False) ``` is close enough to https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adafactor ``` optim = keras.optimizers.Adafactor(learning_rate=0.01) ``` The three results respectively for the same randomly generated weights: ``` # ours tensor([[ 0.3807594, -0.3912092], [ 0.0762539, 0.5377805], [ 0.2459473, 0.4662207]]) # pytorch-optimizer tensor([[ 0.3807592, -0.3912172], [ 0.0762507, 0.5377818], [ 0.2459457, 0.4662213]]) # keras array([[ 0.38076326, -0.39121315], [ 0.0762547 , 0.5377859 ], [ 0.24594972, 0.46622536]], dtype=float32) ``` This gives me confidence to move forward in speeding up the implementation now that a baseline has been established. If you're curious about differences: * keras assigns step_size (rho_t in their code) to `min(lr, 1 / sqrt(step)` whereas the OG impl uses a hardcoded 0.01 instead of lr. We do the same thing as keras, but our lr default is 0.01. * We differ from the pytorch-optimizers default in that our default will not track momentum (thus `beta1=0`) and we do not apply parameter scaling. <details> Keras collab: https://colab.research.google.com/drive/1i3xF8ChL7TWKJGV_5v_5nMhXKnYmQQ06?usp=sharing My script repro: ``` import torch from pytorch_optimizer import AdaFactor torch.set_printoptions(precision=7) weight = torch.tensor([[ 0.37697506, -0.39500135], [ 0.07246649, 0.53399765], [ 0.24216151, 0.46243715]], dtype=torch.float32) # bias = torch.tensor([0, 0], dtype=torch.float32) weight.grad = torch.tensor([[-0.5940447, -0.7743838], [-0.5940447, -0.7743838], [-0.5940447, -0.7743838]], dtype=torch.float32) # bias.grad = torch.tensor([-2.5027974, 1.5422692], dtype=torch.float32) weight_c = weight.clone() weight_c.grad = weight.grad.clone() optim = torch.optim.Adafactor([weight]) optim.step() print(weight) optim_c = AdaFactor([weight_c], betas=(0, 0.999), scale_parameter=False) optim_c.step() print(weight_c) ``` <details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129905 Approved by: https://github.com/albanD	2024-07-25 13:17:19 +00:00
Xuehai Pan	054d214c50	[BE][tests] show local variables on failure in tests (#131151 ) ------ As per the title, add argument `--locals` for `unittest` and `--showlocals --tb=long` for `pytest` in CI. Some failures cannot be reproduced on the local machine but exist on cloud CI. This change allows us to investigate the test failure more easily. Example output: https://github.com/pytorch/pytorch/actions/runs/9961546996/job/27523888353?pr=130710#step:20:3361 ```text /opt/conda/envs/py_3.8/lib/python3.8/site-packages/sympy/core/function.py:307: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ cls = FloorDiv, base = -1.00000000000000, divisor = -1.00000000000000 @classmethod def eval(cls, base, divisor): # python test/test_dynamic_shapes.py -k TestDimConstraints.test_dim_constraints_solve_full # Assert triggered by inequality solver # assert base.is_integer, base # assert divisor.is_integer, divisor # We don't provide the same error message as in Python because SymPy # makes it difficult to check the types. if divisor.is_zero: raise ZeroDivisionError("division by zero") if base in (int_oo, -int_oo, sympy.oo, -sympy.oo) and divisor in ( int_oo, -int_oo, sympy.oo, -sympy.oo, ): return sympy.nan if base is sympy.nan or divisor is sympy.nan: return sympy.nan if base.is_zero: return sympy.S.Zero if base.is_integer and divisor == 1: return base if base.is_integer and divisor == -1: return sympy.Mul(base, -1) if ( isinstance(base, sympy.Number) and isinstance(divisor, sympy.Number) and ( base in (int_oo, -int_oo, sympy.oo, -sympy.oo) or divisor in (int_oo, -int_oo, sympy.oo, -sympy.oo) ) ): r = float(base) / float(divisor) if r == math.inf: return int_oo elif r == -math.inf: return -int_oo elif math.isnan(r): return sympy.nan else: return sympy.Integer(math.floor(r)) if isinstance(base, sympy.Integer) and isinstance(divisor, sympy.Integer): return sympy.Integer(int(base) // int(divisor)) if isinstance(base, FloorDiv): return FloorDiv(base.args[0], base.args[1] * divisor) # Expands (x + y) // b into x // b + y // b. # This only works if floor is an identity, i.e. x / b is an integer. for term in sympy.Add.make_args(base): quotient = term / divisor if quotient.is_integer and isinstance(divisor, sympy.Integer): # NB: this is correct even if the divisor is not an integer, but it # creates rational expressions that cause problems with dynamic # shapes. return FloorDiv(base - term, divisor) + quotient try: gcd = sympy.gcd(base, divisor) if gcd != 1: > return FloorDiv( sympy.simplify(base / gcd), sympy.simplify(divisor / gcd) ) base = -1.00000000000000 cls = FloorDiv divisor = -1.00000000000000 gcd = 1.00000000000000 quotient = 1.00000000000000 term = -1.00000000000000 /opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/_sympy/functions.py:159: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ args = (FloorDiv, -1.00000000000000, -1.00000000000000), kwargs = {} @wraps(func) def wrapper(args, kwargs): try: > retval = cfunc(args, **kwargs) E RecursionError: maximum recursion depth exceeded in comparison E E To execute this test, run the following from the base repo dir: E python test/test_sympy_utils.py -k TestValueRanges.test_binary_ref_fn_floordiv_dtype_float E E This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 args = (FloorDiv, -1.00000000000000, -1.00000000000000) cfunc = <functools._lru_cache_wrapper object at 0x7fc5303173a0> func = <function Function.__new__ at 0x7fc530317280> kwargs = {} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131151 Approved by: https://github.com/ezyang	2024-07-25 10:10:58 +00:00
William Wen	7718024d2b	[3.13] support 3.13 multiline traces in munge_exc (#131207 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131207 Approved by: https://github.com/jansel, https://github.com/anijain2305 ghstack dependencies: #131206	2024-07-24 18:22:30 +00:00
William Wen	106c6a49f5	[dynamo] limit number of compiles per frame (#130891 ) Fixes https://github.com/pytorch/pytorch/issues/130776 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130891 Approved by: https://github.com/anijain2305	2024-07-24 16:43:40 +00:00
Adnan Akhundov	e9db1b0597	Add flag to ignore unsupported @triton.autotune args in user-written kernel compilation (#131431 ) Summary: We currently don't support some of the `@triton.autotune` arguments when compiling user-written Triton kernels with PT2. In this PR, we're adding a flag to circumvent it. This is to unblock internal compilation in some cases. The flag is supplied with the docs mentioning why it is not a good idea to set it. Test Plan: ``` python test/inductor/test_triton_kernels.py -k test_triton_kernel_ autotune_with_unsupported_args ... ---------------------------------------------------------------------- Ran 3 tests in 3.636s OK ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/131431 Approved by: https://github.com/oulgen, https://github.com/zou3519	2024-07-24 05:37:09 +00:00
Will Feng	fc3d2b26cd	Use fake PG for test_compute_comm_reordering.py unit tests (#131415 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131415 Approved by: https://github.com/yifuwang	2024-07-23 22:53:23 +00:00
Aaron Orenstein	5a0068cc69	[BE] mypy: disallow untyped decorators (#131428 ) Untyped decorators strip the types from their decorated function so even if the underlying function is fully typed then callers to it don't get any benefit from type annotations. Step 1 - Enable the error and override in all the offending files. #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131428 Approved by: https://github.com/justinchuby, https://github.com/oulgen	2024-07-23 21:50:55 +00:00
Shangdi Yu	68c725a094	[custom ops] Add register_vmap for custom ops (#130589 ) Fixes #130284 Fixes #130653 - Add `torch.library.register_vmap` to custom ops - Add `register_vmap` for operators in ops in custom_op_db. - Make `torch.autograd.Function` support kwarg-only kwargs for vmap - test operators in op_db with `tests/test_vmap`. - change `test_vmap` to allow custom `out_dim` and allow "None" in `out_dim` when testing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130589 Approved by: https://github.com/zou3519	2024-07-23 17:48:38 +00:00
Tom Ritchford	16247987a1	Add decomposition for t_copy (#130939 ) * Extracted from #128416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130939 Approved by: https://github.com/peterbell10	2024-07-23 08:29:19 +00:00
Li-Huai (Allan) Lin	99d9b369f4	[Optim] Support tensor lr for all optimizers and check it is 1-element (#131065 ) Fixes: #130980 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131065 Approved by: https://github.com/janeyx99	2024-07-23 04:27:05 +00:00
PyTorch MergeBot	b435d84261	Revert "[custom ops] Add register_vmap for custom ops (#130589 )" This reverts commit `074b420641`. Reverted https://github.com/pytorch/pytorch/pull/130589 on behalf of https://github.com/atalman due to Please fix lint and reland ([comment](https://github.com/pytorch/pytorch/pull/130589#issuecomment-2244092174))	2024-07-23 01:44:44 +00:00
Shangdi Yu	074b420641	[custom ops] Add register_vmap for custom ops (#130589 ) Fixes #130284 Fixes #130653 - Add `torch.library.register_vmap` to custom ops - Add `register_vmap` for operators in ops in custom_op_db. - Make `torch.autograd.Function` support kwarg-only kwargs for vmap - test operators in op_db with `tests/test_vmap`. - change `test_vmap` to allow custom `out_dim` and allow "None" in `out_dim` when testing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130589 Approved by: https://github.com/zou3519	2024-07-23 00:54:52 +00:00
Thomas Ortner	8ae1963a61	[Autograd] Cond Higher-Order Operation (#126911 ) This is an updated PR to equip cond with the autograd feature and replaces the old [PR](https://github.com/pytorch/pytorch/pull/126007) @ydwu4 I tried to incorporate your requests already. Currently there are two problems that I struggle with solving: 1. There seems to be an import issue when trying to import cond in `torch/__init__.py`, see [here](`8a704035c9/torch/__init__.py (L1914-L1916)`). Therefore, I had to comment those lines, which resolved the import issues, but I believe cond is not proberly exposed as torch.cond. 2. I am not entirely sure how to deal with the opinfo test in `hop_db.py` Co-authored-by: Yidi Wu <yidi@meta.com> Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126911 Approved by: https://github.com/ydwu4	2024-07-22 23:18:19 +00:00
Tom Ritchford	500cbb5b90	Add decomposition for view_copy (#130938 ) * Extracted from #128416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130938 Approved by: https://github.com/peterbell10 ghstack dependencies: #130937	2024-07-21 20:39:24 +00:00
Tom Ritchford	f628813066	Fix out_wrapper, _make_copy_from_view to handle all signatures (#130937 ) * See #128416 and #129476 * Simplify xskip lists in test/functorch/test_ops.py * Add supports_out=True to OpInfos for copy ops Pull Request resolved: https://github.com/pytorch/pytorch/pull/130937 Approved by: https://github.com/peterbell10	2024-07-21 20:39:24 +00:00
ankurneog	ebc012ace6	Add hooks for execution on intel gaudi devices - 1 (#128584 ) ## Motivation This is follow up to PR:https://github.com/pytorch/pytorch/pull/126970 to support Gaudi devices for Pytorch UT execution. ## Changes We are adding additional hooks to: 1. Add dtype exceptions for Gaudi/HPU 2. Extend onlyNativeDevices decorator functionality to add additional devices Pull Request resolved: https://github.com/pytorch/pytorch/pull/128584 Approved by: https://github.com/albanD	2024-07-20 05:03:36 +00:00
Isuru Fernando	bb4251213b	Add decomposition for channel_shuffle (#118775 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118775 Approved by: https://github.com/peterbell10	2024-07-20 01:24:41 +00:00
Catherine Lee	31e79aae6a	Another follow up to #130260 (#130993 ) Another followup to https://github.com/pytorch/pytorch/pull/130260 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130993 Approved by: https://github.com/huydhn	2024-07-19 16:43:54 +00:00
kausik	4f60a2e39c	Set correct output dtype for dequantize op during convert_pt2e in decomposed mode (#128953 ) Earlier the signature of dequantize ops for decomposed quantized Tensor was changed for wider use-cases where the output dtype can be different from torch.float and needs to be passed during dequantization. Please refer: https://github.com/pytorch/pytorch/pull/121450 However, setting of correct output dtype for dequantize ops was still missing in convert_pt2e flow. This change enables the users to use PT2E quantization flow with non torch.float unquantized dtype, such as torch.bfloat16. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128953 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2024-07-19 04:58:02 +00:00
PyTorch MergeBot	fb3674b1f4	Revert "[Autograd] Cond Higher-Order Operation (#126911 )" This reverts commit `f7058b735e`. Reverted https://github.com/pytorch/pytorch/pull/126911 on behalf of https://github.com/clee2000 due to broke lint and functorch/test_aotdispatch `f7058b735e` Probably a landrace since both the test and lint passed on PR ([comment](https://github.com/pytorch/pytorch/pull/126911#issuecomment-2237703182))	2024-07-18 22:06:40 +00:00
Thomas Bohnstingl	f7058b735e	[Autograd] Cond Higher-Order Operation (#126911 ) This is an updated PR to equip cond with the autograd feature and replaces the old [PR](https://github.com/pytorch/pytorch/pull/126007) @ydwu4 I tried to incorporate your requests already. Currently there are two problems that I struggle with solving: 1. There seems to be an import issue when trying to import cond in `torch/__init__.py`, see [here](`8a704035c9/torch/__init__.py (L1914-L1916)`). Therefore, I had to comment those lines, which resolved the import issues, but I believe cond is not proberly exposed as torch.cond. 2. I am not entirely sure how to deal with the opinfo test in `hop_db.py` Co-authored-by: Yidi Wu <yidi@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126911 Approved by: https://github.com/ydwu4	2024-07-18 21:09:09 +00:00
PyTorch MergeBot	fff92d4f18	Revert "Use inductor TestCase for test_replicate_with_compiler.py (#129494 )" This reverts commit `9f392f8294`. Reverted https://github.com/pytorch/pytorch/pull/129494 on behalf of https://github.com/atalman due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/129494#issuecomment-2237147504))	2024-07-18 17:42:05 +00:00
drisspg	dd39dca034	Removing some cruff and updating signatures for consistency (#130871 ) # Summary - This removes a bunch of example score mods that were primarily used for testing and places them directly in the test file. We should follow up with merging test_flex_decode and test_flash when the velocity slows down a little - Fixes a bug with indexing on block mask - Adds some doc strings to helper funcs and fixes some misc typing things - Forces functions passed to `create_block_mask` to mask_mods and updates tests files Pull Request resolved: https://github.com/pytorch/pytorch/pull/130871 Approved by: https://github.com/joydddd, https://github.com/Chillee	2024-07-18 13:32:11 +00:00
Michael Lazos	cf3f4285a8	Add recursive metadata guard test (#131002 ) Ensures that nested tensors subclasses are guarded properly. It turns out this case is already handled [here](`d77af49380/torch/_dynamo/variables/builder.py (L1496)`) which will recursively wrap inner tensors adding metadata guards for them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131002 Approved by: https://github.com/bdhirsh	2024-07-18 08:24:43 +00:00
Sam Larsen	9f392f8294	Use inductor TestCase for test_replicate_with_compiler.py (#129494 ) Summary: `test/distributed/_composable/test_replicate_with_compiler.py` exercises inductor. This change introduces a version of MultiProcessTestCase that derives from the inductor TestCase class to make sure we always get a clean cache dir. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129494 Approved by: https://github.com/eellison	2024-07-18 03:08:32 +00:00
Aidyn-A	bd56bcf0ab	[TEST] Fix _scaled_mm tests (#130897 ) This PR resolves several sets of `_scaled_mm` test failures: - `scale_a` and `scale_b` are now required arguments, so the function `sample_inputs_scaled_mm` must supply them - `_scaled_mm` does not support `"meta"` device, so it should be skipped in `test_meta.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130897 Approved by: https://github.com/drisspg	2024-07-18 02:15:00 +00:00
lezcano	af0b5ee924	Reduce number of samples in {svd,pca}_lowrank OpInfos (#127199 ) We don't need to generate so many samples for these very expensive ops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127199 Approved by: https://github.com/peterbell10, https://github.com/zou3519	2024-07-17 16:29:36 +00:00
Xuehai Pan	b29b23137c	[Easy] Fix argument name collision in dispatched functions (#129562 ) Use positional-only argument to avoid naming collision with aten ops arguments that are named "self". ```python In [1]: def foo(self, args, kwargs): ...: print(self, args, kwargs) ...: In [2]: def bar(self, /, args, **kwargs): ...: print(self, args, kwargs) ...: In [3]: foo(1, 2, self=3) TypeError: foo() got multiple values for argument 'self' In [4]: bar(1, 2, self=3) 1 (2,) {'self': 3} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129562 Approved by: https://github.com/zou3519, https://github.com/fegin	2024-07-17 14:39:56 +00:00
PyTorch MergeBot	8458dc8966	Revert "Use inductor TestCase for distributed tests (#129494 )" This reverts commit `3cd2ae331a`. Reverted https://github.com/pytorch/pytorch/pull/129494 on behalf of https://github.com/masnesral due to fbcode failures ([comment](https://github.com/pytorch/pytorch/pull/129494#issuecomment-2232063690))	2024-07-17 00:32:48 +00:00
drisspg	2b43d339fe	Make FlexAttention API public (#130755 ) # Summary Makes the prototype API flex_attention public Pull Request resolved: https://github.com/pytorch/pytorch/pull/130755 Approved by: https://github.com/Chillee	2024-07-16 16:21:25 +00:00
Jovian Anthony Jaison	e57101d927	Add testing regarding SparseAdam state_dicts (#130645 ) Summary: - Updated SparseAdam to run test_state_dict_deterministic unit test. - Made gradients sparse while keeping weights dense in the above test. Test Plan: - Ran test_optim.py locally. Fixes #116507 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130645 Approved by: https://github.com/janeyx99	2024-07-16 11:29:22 +00:00
Sam Larsen	3cd2ae331a	Use inductor TestCase for distributed tests (#129494 ) Summary: At least some of the tests deriving from MultiProcessTestCase exercise inductor. Using the inductor TestCase class makes sure we always get a clean cache dir. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129494 Approved by: https://github.com/eellison	2024-07-16 01:24:35 +00:00
eqy	5e617d7ef5	[CUDA] Actually bump tolerances for `test_grad_pca_lowrank` (#130770 ) Fixes change in #129902 to actually bump pca rather than svd, thanks @ptrblck for the catch Pull Request resolved: https://github.com/pytorch/pytorch/pull/130770 Approved by: https://github.com/Skylion007	2024-07-16 00:41:10 +00:00
Bilal Khan	54a932b0ac	Support for expandable segments with cuda graph trees (#128068 ) This PR adds support to use expandable segments with private memory pools which should unblock using it with cuda graphs and cuda graph trees. Currently, the allocator silently avoids using expandable segments when allocating in a private pool due to checkpoint saving/restoring not meshing well with how we keep track of unmapped blocks. The PR itself is pretty short, most of the logic for checkpointing and reapplying state for non-expandable segments transfers over without much work. Expandable segments reserve a virtual address space of size equal to the amount of physical memory on the GPU. Every time we want to `malloc()` or `free()` memory in a memory pool with expandable segments turned on, we map/unmap pages of physical GPU memory under the hood to create a new block that we return to the caller. This is beneficial due to the fact that each memory pool functions as a single segment of memory with a contiguous block of memory addresses that can grow and shrink as needed, avoiding fragmentation from allocating multiple non-contiguous segments that may not be merged together. The caching allocator handles this by creating an unmapped block for the entire reserved virtual address space at init, which is treated similarly to an unallocated block in a free pool. When callers call `malloc()`, it's split and mapped to create allocated blocks, and calling `free()` similarly caches and merges free blocks in a free pool to be used later. Expandable blocks are unmapped and returned back to Cuda when they are cleaned up, or when we hit an OOM and the allocator attempts to remap cached free blocks. The code paths to map, free, and unmap blocks in expandable segments is similar to that for normal blocks and does all the same work of updating stats on memory usage, moving blocks between active and free pools, and returning memory to Cuda. With Cuda Graph Trees and private memory pools, we need the ability to take checkpoints of the current state of the memory allocator after each graph capture as well as reapplying the state before capturing a new graph after replaying a captured graph so that the new cuda graph capture has access to the state of the allocator at the point after replaying a previously captured graph so it can reuse empty blocks and allocate new ones. As mentioned in a below comment, memory in a private pool is cached until the private pool is destroyed and allocations can only grow from extra graph captures, any freeing of memory would result in invalid memory addresses and would break cuda graphs. One implementation detail to note for unmapped blocks with expandable segments is that unmapped blocks are kept track in a member variable `unmapped` of a `BlockPool`. `unmapped` is not part of the checkpointed state of the caching allocator and isn't restored when reapplying checkpoints since we never free/unmap memory back to cuda and is persisted across graph captures / replays. Checkpointing the current state of the memory allocator works as expected with expandable segments. Checkpointing grabs the first block of every segment in the active and free pools of the private pool and traverses the linked list of blocks in the segment to capture the state of every segment, which is then saved and kept for when it is needed to be reapplied. For expandable blocks, the last block in every segment will be an unallocated unmapped block containing the remaining amount of unmapped memory at graph capture time, and this too is saved in the checkpoint. Reapplying the checkpoints works by freeing all allocated blocks and merging them into a single block per segment, then for each segment, we manually split and allocate all blocks from the checkpoint and then free the blocks marked as unallocated in the checkpoint state. For expandable segments, we need to make some modifications to not split unmapped blocks and avoid manually mapping then freeing unmapped blocks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128068 Approved by: https://github.com/eqy, https://github.com/eellison	2024-07-15 23:23:23 +00:00
Alnis Murtovi	50ef099ad0	Learn a heuristic to decide whether to pad before mm (#128643 ) This PR introduces AutoHeuristic, a framework to collect results from autotuning, learn a heuristic as a machine learning model (a regression tree), and then ship the learned heuristic by generating the regression tree to code. The heuristics have been learned on artificial/random data that has been collected with the `gen_data_pad_mm.py` script. The `gen_pad_mm_a100.sh` scripts can then be used to learn a heuristic and generate it to code. The best model is decided by doing a grid search over various values for `max_depth` and `min_samples_leaf` and choosing the model with the highest number of correct predicitons on the validation set. The heuristic can return "unsure" which means that it is not sure which choice is the best choice and as a result autotuning will happen. On A100 only tensors where each dimension is >= 512 are considered. For smaller tensors the heuristics that I learned returned "unsure" too often. The results for randomly generated data and huggingface look as follows: `max_wrong_speedup` is max(`wrong_speedups`) where `wrong_speedups` contains all the speedups one could have achieved for those examples where the heuristic made a wrong choice, i.e. a `max_wrong_speedup` of 1.37 means that the heuristic selected a choice, but the other choice would have been 1.37x faster. `gman_wrong_speedup` is the geomean of `wrong_speedups`. The heuristic is learned as a regression tree, that returns higher values for better choices. The threshold decides how much better the better choice has to be for it to be returned, i.e. on A100 if the better choice is less than 1.702530x better than the other choice, "unsure" will be returned. This threshold is determined using the validation set. A100 ``` max_depth min_samples_leaf dataset correct wrong unsure total max_wrong_speedup gman_wrong_speedup threshold 15 5.0 10 train 2730 4 3023 5757 1.372220 1.193873 1.702530 16 5.0 10 val 878 0 1042 1920 NaN NaN 1.702530 17 5.0 10 test 925 2 993 1920 1.741708 1.354954 1.702530 18 5.0 10 hf-train 14 0 22 36 NaN NaN 1.702530 19 5.0 10 hf-inf 7 0 1 8 NaN NaN 1.702530 ``` The numbers for huggingface only include tensors where each dim is >=512. If all tensors would have been included there would have been the following number of matmuls, where at least one dimension is unaligned: A100 hf-train: 60 A100 hf-inf: 10 ## Results on running huggingface locally This only includes models where the learned heuristic made at least one decision. For the examples here, it takes around 0.25-0.3 seconds to perform autotuning for the padded and unpadded version, so each decision that the heuristic makes saves around 0.25-0.3 seconds. #pad_mm_autotuning is the number of times autotuning happened in pad_mm and #heuristic_made_decision is the number of times the heuristic made a decision (i.e. it didn't return "unsure"). I ran huggingface locally, each model 5 times and took the median speedup and compilation_latency. Results on huggingface training ``` name speedup_heuristic speedup_baseline speedup_diff compilation_latency_heuristic compilation_latency_baseline compilation_latency_diff comp_latency_reduction% #pad_mm_autotuning #heuristic_made_decision BartForCausalLM 1.19 (+/- 0.00) 1.19 (+/- 0.00) -0.00 40.33 (+/- 1.13) 40.95 (+/- 0.78) -0.62 1.52 3 2 BartForConditionalGeneration 1.53 (+/- 0.06) 1.47 (+/- 0.05) 0.06 81.93 (+/- 5.20) 82.23 (+/- 1.92) -0.30 0.36 3 1 BlenderbotSmallForCausalLM 1.86 (+/- 0.04) 1.86 (+/- 0.00) 0.00 36.76 (+/- 0.49) 37.62 (+/- 1.33) -0.87 2.31 3 2 CamemBert 2.36 (+/- 0.01) 2.35 (+/- 0.01) 0.01 97.60 (+/- 1.91) 98.69 (+/- 1.35) -1.09 1.11 2 1 DistillGPT2 2.57 (+/- 0.01) 2.57 (+/- 0.01) 0.00 57.33 (+/- 0.77) 58.26 (+/- 1.41) -0.93 1.59 3 2 PLBartForCausalLM 2.07 (+/- 0.01) 2.06 (+/- 0.01) 0.01 32.54 (+/- 0.83) 34.65 (+/- 0.71) -2.11 6.10 3 2 PLBartForConditionalGeneration 1.87 (+/- 0.00) 1.88 (+/- 0.00) -0.01 58.45 (+/- 1.24) 58.95 (+/- 1.92) -0.50 0.85 3 1 RobertaForCausalLM 2.39 (+/- 0.01) 2.40 (+/- 0.01) -0.01 97.38 (+/- 1.52) 97.69 (+/- 1.18) -0.31 0.32 2 1 TrOCRForCausalLM 1.70 (+/- 0.00) 1.70 (+/- 0.00) -0.00 44.79 (+/- 1.33) 45.25 (+/- 1.08) -0.46 1.01 3 2 Mean difference in speedup: 0.01 Mean compilation latency saved: -0.80s Mean compilation latency reduction: 1.68% ``` Results on huggingface inference ``` name speedup_heuristic speedup_baseline speedup_diff compilation_latency_heuristic compilation_latency_baseline compilation_latency_diff comp_latency_reduction% #pad_mm_autotuning #heuristic_made_decision BartForCausalLM 1.11 (+/- 0.00) 1.11 (+/- 0.00) 0.00 19.02 (+/- 0.28) 19.40 (+/- 0.35) -0.38 1.95 3 2 BartForConditionalGeneration 1.26 (+/- 0.01) 1.23 (+/- 0.03) 0.03 36.84 (+/- 0.40) 36.55 (+/- 0.75) 0.30 -0.81 3 1 BlenderbotSmallForCausalLM 1.87 (+/- 0.02) 1.87 (+/- 0.01) 0.00 17.53 (+/- 0.31) 18.03 (+/- 0.43) -0.49 2.74 3 2 DistillGPT2 2.50 (+/- 0.02) 2.50 (+/- 0.01) 0.00 16.16 (+/- 0.29) 16.40 (+/- 0.18) -0.24 1.46 3 2 PLBartForCausalLM 1.93 (+/- 0.01) 1.94 (+/- 0.01) -0.00 15.30 (+/- 0.22) 16.01 (+/- 0.71) -0.71 4.43 3 2 PLBartForConditionalGeneration 1.98 (+/- 0.01) 1.98 (+/- 0.01) 0.00 25.90 (+/- 0.32) 26.58 (+/- 0.62) -0.67 2.53 3 1 TrOCRForCausalLM 1.61 (+/- 0.00) 1.62 (+/- 0.00) -0.01 21.38 (+/- 0.37) 21.85 (+/- 0.16) -0.47 2.16 3 2 Mean difference in speedup: 0.00 Mean compilation latency saved: -0.38s Mean compilation latency reduction: 2.07% ``` For now, the heuristic can only be applied to decide whether to pad for mm. One could also learn heuristics for bmm and addmm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128643 Approved by: https://github.com/Chillee, https://github.com/eellison	2024-07-15 23:04:06 +00:00
eqy	6f32dc0c7b	Don't pass error message as `places` in `assertGreaterAlmostEqual` (#130648 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130648 Approved by: https://github.com/awgu	2024-07-15 22:14:49 +00:00
PyTorch MergeBot	9e161af179	Revert "Increase tolerance for tensorsolve tests (#130620 )" This reverts commit `103b6ccab2`. Reverted https://github.com/pytorch/pytorch/pull/130620 on behalf of https://github.com/clee2000 due to didn't work, test is still failing on this PR and on main, reverting in favor of https://github.com/pytorch/pytorch/pull/130449 instead ([comment](https://github.com/pytorch/pytorch/pull/130620#issuecomment-2229036418))	2024-07-15 17:35:04 +00:00
cyy	28f6ae2718	[9/N] Replace c10::optional with std::optional (#130674 ) Follows #130509 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130674 Approved by: https://github.com/Skylion007	2024-07-15 00:48:43 +00:00
chilli	f9f85bfc0b	[Inductor] FlexAttention supports partial masking (#130415 ) (#130626 ) This is the new version of https://github.com/pytorch/pytorch/pull/130415 Updated test script: https://gist.github.com/yanboliang/7c34a82df611d4ea8869cb9e041bfbfc Updated perf numbers: ``` (pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py fwd speedup: 0.7166695598192317 bwd speedup: 0.7142133867805904 (pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py --partial-mask fwd speedup: 0.8428246087169973 bwd speedup: 0.8486261278030254 ``` Approved by: https://github.com/Chillee Pull Request resolved: https://github.com/pytorch/pytorch/pull/130626 Approved by: https://github.com/drisspg, https://github.com/yanboliang	2024-07-14 00:37:26 +00:00
Tobias Ringwald	e5de25896f	Fixed CUDA randint generation for large ranges. (#126066 ) Fixes #125224 For large ranges, calls to CUDA `randint` use a different `unroll_factor` to generate random ints. This `unroll_factor` was not considered correctly in the calculation of the Philox offsets. Thus, some of the random states were reused, resulting in lower entropy (see #125224). This also affects multiple other random functions, such as `torch.rand` and `torch.randn`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126066 Approved by: https://github.com/eqy, https://github.com/lezcano	2024-07-13 21:42:27 +00:00
rzou	95046c86e3	[HOP] add HOP x torch_dispatch interaction (#130606 ) This involved beefing up the Python dispatcher to handle torch_dispatch. Given a HOP and a torch_dispatch Tensor subclass: - the HOP will show up in the subclass's `__torch_dispatch__` - you can also use HOP.py_impl to register a rule for the HOP x subclass interaction - (coming soon) we'll offer a way to open register HOP x subclass interaction without needing to touch the subclass's `__torch_dispatch__` or the HOP's .py_impl. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130606 Approved by: https://github.com/ydwu4	2024-07-12 21:51:36 +00:00
albanD	103b6ccab2	Increase tolerance for tensorsolve tests (#130620 ) Fix current failure in periodic trunk https://hud.pytorch.org/failure?name=periodic%20%2F%20linux-focal-cuda11.8-py3.10-gcc9-debug%20%2F%20test%20(default%2C%204%2C%205%2C%20linux.4xlarge.nvidia.gpu)&jobName=undefined&failureCaptures=%5B%22functorch%2Ftest_ops.py%3A%3ATestOperatorsCUDA%3A%3Atest_vjp_linalg_tensorsolve_cuda_float32%22%5D Since it appeared with https://github.com/pytorch/pytorch/pull/128238 that only updates random seed for the test, I expect this is just bad luck of the draw. Thus increasing tolerance like we do for other tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130620 Approved by: https://github.com/lezcano, https://github.com/atalman, https://github.com/malfet	2024-07-12 20:08:18 +00:00
leslie-fang-intel	2a1f22e57f	Change BN to eval before QAT Convert phase (#130598 ) Summary In the QAT convert phase, we fold bn into conv and do DCE to this BN node. We should change `torch.ops.aten._native_batch_norm_legit.default` to `torch.ops.aten._native_batch_norm_legit_no_training.default` for a safe DCE. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130598 Approved by: https://github.com/jgong5, https://github.com/yushangdi	2024-07-12 16:03:56 +00:00
yan-yhy	c16e90fe06	The device_suffix in a test_name is "privateuse1" sometimes. (#130091 ) When run some test cases on the privateuse1 device, the device_suffix in a test_name is 'privateuse1' sometimes. For examples, a test_name is 'test_Dropout1d_npu', while it would be 'test_Dropout1d_privateuse1' sometimes. When setUpClass() didn't set it, the device_suffix would be "privateuse1". Pull Request resolved: https://github.com/pytorch/pytorch/pull/130091 Approved by: https://github.com/zou3519	2024-07-12 02:51:40 +00:00
PyTorch MergeBot	d97d962082	Revert "Add decompositions for copy variants of view ops (#128416 )" This reverts commit `68751799b8`. Reverted https://github.com/pytorch/pytorch/pull/128416 on behalf of https://github.com/izaitsevfb due to breaks test_qs8_permute_copy test in executorch ([comment](https://github.com/pytorch/pytorch/pull/128416#issuecomment-2224023423))	2024-07-11 22:09:23 +00:00
PyTorch MergeBot	a2f630a9a4	Revert "Decompose expand_copy and permute_copy (#129476 )" This reverts commit `7d4cb21098`. Reverted https://github.com/pytorch/pytorch/pull/129476 on behalf of https://github.com/izaitsevfb due to depends on #128416 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/129476#issuecomment-2224019720))	2024-07-11 22:06:15 +00:00
Yidi Wu	cd9bae30de	Allow kwargs in _remove_effect_tokens_pass (#130491 ) Summary: Previously, remove_effect_tokens pass didn't pass kwargs to the internal nodes. This PR fix it and add a test for it. Test Plan: buck2 run caffe2/test:test_export -- -r test_remove_effect_token_kwargs Reviewed By: angelayi Differential Revision: D59603147 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130491 Approved by: https://github.com/angelayi	2024-07-11 19:03:19 +00:00
PyTorch MergeBot	578388bed8	Revert "Support for expandable segments with cuda graph trees (#128068 )" This reverts commit `fdc83610f2`. Reverted https://github.com/pytorch/pytorch/pull/128068 on behalf of https://github.com/janeyx99 due to Reverting for breaking ROCm tests on trunk, I think the tests need to be qualified with @onlyCUDA ([comment](https://github.com/pytorch/pytorch/pull/128068#issuecomment-2223672381))	2024-07-11 18:58:13 +00:00
Xuehai Pan	973037be6a	[BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): `list()` / `tuple()` / `dict()` (#130199 ) This PR changes the empty collection factory call to Python literals: - `list()` -> `[]` - `tuple()` -> `()` - `dict()` -> `{}` The Python literals are more performant and safer. For example, the bytecode for building an empty dictionary: ```bash $ python3 -m dis - <<EOS import collections d1 = {} d2 = dict() dict = collections.OrderedDict d3 = dict() EOS ``` ```text 0 0 RESUME 0 1 2 LOAD_CONST 0 (0) 4 LOAD_CONST 1 (None) 6 IMPORT_NAME 0 (collections) 8 STORE_NAME 0 (collections) 3 10 BUILD_MAP 0 12 STORE_NAME 1 (d1) 4 14 PUSH_NULL 16 LOAD_NAME 2 (dict) 18 CALL 0 26 STORE_NAME 3 (d2) 6 28 LOAD_NAME 0 (collections) 30 LOAD_ATTR 8 (OrderedDict) 50 STORE_NAME 2 (dict) 7 52 PUSH_NULL 54 LOAD_NAME 2 (dict) 56 CALL 0 64 STORE_NAME 5 (d3) 66 RETURN_CONST 1 (None) ``` The dict literal `{}` only has one bytecode `BUILD_MAP`, while the factory call `dict()` has three `PUSH_NULL + LOAD_NAME + CALL`. Also, the factory call is not safe if users override the `dict` name in `locals` or `globals` (see the example of replacing with `OrderedDict` above). Pull Request resolved: https://github.com/pytorch/pytorch/pull/130199 Approved by: https://github.com/malfet	2024-07-11 17:30:28 +00:00
Jiang, Yanbing	6f662e9575	update the input `weight` of `_convert_weight_to_int4pack` to `[n][k / 2] uint8` (#129940 ) This PR is to update the input `weight` of `_convert_weight_to_int4pack` from `[n][k] int32` to `[n][k / 2] uint8`, both for CPU, CUDA and MPS, which can help decouple int4 model checkpoint with different ISAs and different platforms in `gpt-fast`. The advantage is int4 model checkpoint can be shared in different test machines, without re-generating in one certain platform. Meanwhile, the size of input `weight` can be reduced to `1 / 8`. Before this PR, packed weight stored in CUDA specific layout: `[n/8][k/(InnerKTiles*16)][32][InnerKTiles/2]`, dtype int32, where InnerKTiles = 2, 4, 8. CPU packed weight viewed as the SAME shape but stored in different layout: `[n/64][k][32]`, dtype uint8. Weight is strongly coupled with platforms (CPU/CUDA) and ISAs (AVX512/AVX2/scalar). And users cannot use a generated weight in another different ISA or platform, because when loading weight into devices, the compute format is different. ![image](https://github.com/pytorch/pytorch/assets/61222868/64971c4b-29b9-42cf-9aeb-ffa01cea93dd) Now, we use common serialized layout (`[n][k/2] uint8`) for different devices or ISAs as input `weight` of `_convert_weight_to_int4pack`, and each back chooses how to interpret as compute layout. ![image](https://github.com/pytorch/pytorch/assets/61222868/c7990761-c723-417b-aca2-7c60db7785c7) ### Performance Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) There is no obvious regression of this PR. ![image](https://github.com/pytorch/pytorch/assets/61222868/6046dcf4-920b-4c63-9ca3-1c8c3cafebde) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129940 Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/mingfeima	2024-07-11 15:26:48 +00:00
Isuru Fernando	5db9bd467e	Skip test_nnc_correctness for new op _unsafe_masked_index (#130375 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130375 Approved by: https://github.com/lezcano	2024-07-11 08:17:16 +00:00
Bilal Khan	fdc83610f2	Support for expandable segments with cuda graph trees (#128068 ) This PR adds support to use expandable segments with private memory pools which should unblock using it with cuda graphs and cuda graph trees. Currently, the allocator silently avoids using expandable segments when allocating in a private pool due to checkpoint saving/restoring not meshing well with how we keep track of unmapped blocks. The PR itself is pretty short, most of the logic for checkpointing and reapplying state for non-expandable segments transfers over without much work. Expandable segments reserve a virtual address space of size equal to the amount of physical memory on the GPU. Every time we want to `malloc()` or `free()` memory in a memory pool with expandable segments turned on, we map/unmap pages of physical GPU memory under the hood to create a new block that we return to the caller. This is beneficial due to the fact that each memory pool functions as a single segment of memory with a contiguous block of memory addresses that can grow and shrink as needed, avoiding fragmentation from allocating multiple non-contiguous segments that may not be merged together. The caching allocator handles this by creating an unmapped block for the entire reserved virtual address space at init, which is treated similarly to an unallocated block in a free pool. When callers call `malloc()`, it's split and mapped to create allocated blocks, and calling `free()` similarly caches and merges free blocks in a free pool to be used later. Expandable blocks are unmapped and returned back to Cuda when they are cleaned up, or when we hit an OOM and the allocator attempts to remap cached free blocks. The code paths to map, free, and unmap blocks in expandable segments is similar to that for normal blocks and does all the same work of updating stats on memory usage, moving blocks between active and free pools, and returning memory to Cuda. With Cuda Graph Trees and private memory pools, we need the ability to take checkpoints of the current state of the memory allocator after each graph capture as well as reapplying the state before capturing a new graph after replaying a captured graph so that the new cuda graph capture has access to the state of the allocator at the point after replaying a previously captured graph so it can reuse empty blocks and allocate new ones. As mentioned in a below comment, memory in a private pool is cached until the private pool is destroyed and allocations can only grow from extra graph captures, any freeing of memory would result in invalid memory addresses and would break cuda graphs. One implementation detail to note for unmapped blocks with expandable segments is that unmapped blocks are kept track in a member variable `unmapped` of a `BlockPool`. `unmapped` is not part of the checkpointed state of the caching allocator and isn't restored when reapplying checkpoints since we never free/unmap memory back to cuda and is persisted across graph captures / replays. Checkpointing the current state of the memory allocator works as expected with expandable segments. Checkpointing grabs the first block of every segment in the active and free pools of the private pool and traverses the linked list of blocks in the segment to capture the state of every segment, which is then saved and kept for when it is needed to be reapplied. For expandable blocks, the last block in every segment will be an unallocated unmapped block containing the remaining amount of unmapped memory at graph capture time, and this too is saved in the checkpoint. Reapplying the checkpoints works by freeing all allocated blocks and merging them into a single block per segment, then for each segment, we manually split and allocate all blocks from the checkpoint and then free the blocks marked as unallocated in the checkpoint state. For expandable segments, we need to make some modifications to not split unmapped blocks and avoid manually mapping then freeing unmapped blocks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128068 Approved by: https://github.com/zdevito, https://github.com/eqy	2024-07-11 05:33:09 +00:00
Tom Ritchford	7d4cb21098	Decompose expand_copy and permute_copy (#129476 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129476 Approved by: https://github.com/amjames, https://github.com/lezcano	2024-07-10 17:12:01 +00:00

... 4 5 6 7 8 ...

5527 Commits