pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
rzou	3918dfedc5	[custom_op] Rename register_impl to register_kernel (#124200 ) Motivation: - The API is used for registering an implementation for a specific device type. - "impl" is ambiguous and can be confused with Library.impl. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124200 Approved by: https://github.com/albanD ghstack dependencies: #124180	2024-04-19 13:54:21 +00:00
rzou	22a2f676c3	[custom_op] add ability to provide manual schema (#124180 ) Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124180 Approved by: https://github.com/albanD	2024-04-19 13:54:13 +00:00
GdoongMathew	8b1ad51881	Better Error Message in `ChainedScheduler` and `SequentialLR` (#121633 ) Fixes #121577 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121633 Approved by: https://github.com/janeyx99	2024-04-19 13:37:41 +00:00
Jesse Cai	c9db59e9e4	[sparse] Add fast semi-structured spasification kernels (#122350 ) This PR adds in fast semi-structured sparsification kernels to PyTorch. These kernels allow for accelerated semi-structured sparsification kernels in PyTorch. The kernels have been added as aten native functions In particular, three new functions have been added: * `torch._sparse_semi_structured_tile` This function will return the packed representation and metadata for both X and X', as well as the thread masks. Note that this applies 2:4 sparsity in a 4x4 tile instead of a 1x4 strip as usual. * `torch._sparse_semi_structured_apply` This function takes in an input tensor and thread masks from the above function and returns a packed representation and metadata from applying thread masks to the input tensor. * `torch._sparse_semi_structured_apply_dense` This function does the same thing as above but instead of returning the tensor in the sparse representation it returns it in the dense representation The subclasses have also been updated to add a new `prune_dense_static_sort` classmethod to create sparse tensors with this format. I've added some additional documentatino on how to calculate the compressed tensors needed to create a SparseSemiStructuredTensor oneself. To this end, there are two new helper functions added: `sparse_semi_structured_tile` `compute_compressed_swizzled_bitmask` Differential Revision: [D56190801](https://our.internmc.facebook.com/intern/diff/D56190801) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122350 Approved by: https://github.com/cpuhrsch	2024-04-19 13:31:58 +00:00
Cen Zhao	96724a769b	[ptd] drop ncclGroupStart/end for ncclCommInit (#124363 ) (#124416 ) Summary: ``` ncclGroupStart() ncclCommInit(..) ncclGroupEnd() ``` above pattern is only needed when we have single-thread to manage multiple GPUs in our case, we always have 1 process managing 1 GPU, we don't need group operation. Test Plan: CI Differential Revision: D56274975 Co-authored-by: Cen Zhao <cenzhao@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124416 Approved by: https://github.com/shuqiangzhang	2024-04-19 13:12:42 +00:00
chilli	8e280862ff	Add custom joint graph passes (#124443 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124443 Approved by: https://github.com/aorenste, https://github.com/malfet	2024-04-19 11:54:46 +00:00
Jane Xu	b412b75b42	[optim] add fused_adam/adamw_kernel support for CPU device (#123074 ) On par with `CUDA` implementation. For `autocast` logic, same with `CUDA` + `Fused Adam`: - check inf in `gradscalar.step` - In fused kernel, if there is `inf`, do nothing. If not, unscale the grad ( also write back) and update the param. TestPlan: ``` # extend CUDA only test for CPU fused adagrad python test_optim.py -k test_fused_matches_forloop python test_optim.py -k test_fused_large_tensor python test_torch.py -k test_grad_scaling_autocast_fused # extend fused test python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step python test_optim.py -k test_can_load_older_state_dict # newly added test (follow `6b1f13ea2f/test/test_cuda.py (L1108)`) python test_optim.py -k test_grad_scaling_autocast_fused_optimizers ``` Benchmark: 5.1x on 56 core SPR Parameter-size=1M Nparams=10 [test script](https://gist.github.com/zhuhaozhe/ef9a290ad3f8f4067b3373a3bdaa33e7) ``` numactl -C 0-55 -m 0 python bench_adam.py non-fused 6.0174267292022705 s fused 1.1787631511688232 s ``` Note: Fused kernel accuracy The accuracy failure in CI shows a little higher than default tolerance ``` 2024-04-02T06:09:16.2213887Z Mismatched elements: 21 / 64 (32.8%) 2024-04-02T06:09:16.2214339Z Greatest absolute difference: 1.5735626220703125e-05 at index (6, 6) (up to 1e-05 allowed) 2024-04-02T06:09:16.2214813Z Greatest relative difference: 1.0073336852656212e-05 at index (4, 1) (up to 1.3e-06 allowed) ``` I have debug it step by step and unfortunately we may not able to make the `fused kernel` exactly same with `non fused` one due to compiler optimizations. For example, in non-fused impl ``` exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2) ``` and in fused impl ``` exp_avg_sq_ptr[d] = scalar_t(beta2) * exp_avg_sq_ptr[d]; // std::cout << "exp_avg_sq " << exp_avg_sq_ptr[d] << std::endl; exp_avg_sq_ptr[d] = exp_avg_sq_ptr[d] + scalar_t(exp_avg_sq_grad_coefficient) * grad_val * grad_val; ``` If I keep `std::cout`, I can get exactly same results in UT ``` ===============param 0.6796758770942688 0.6796758770942688 ``` But when I comment out it, there will be a difference ``` ===============param 0.6796758770942688 0.6796759366989136 ``` So I will make the tolerance a little higher than default one. Co-authored-by: Jane Xu <janeyx@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123074 Approved by: https://github.com/jgong5, https://github.com/janeyx99	2024-04-19 11:14:04 +00:00
Boyuan Feng	9a71d12d92	[CUDAGraphTree] Support mutated inputs from prior cudagraph pool (#123231 ) # PR This PR supports mutating inputs in cudagraph trees, if these inputs are outputs from previous cudagraph. Please check #121861 for more details. # Note on Optimistic Mutation Check To determine whether applying cudagraph, we need to check input mutations, falling into four categories: a) no mutation, b) mutation on parameters/buffers, c) mutation on cudagraph recorded tensors, d) mutation on non-cudagraph recorded tensors. We can apply cudagraph for type a,b,c but cannot for type d. This input mutation types depends on function, current_node, and inputs. Since `check_for_mutation` is slow, there is a trade-off on making type c or d faster. - To make type d) faster, we want to `check_for_mutation` and call eager function early. However, this adds unnecessary overhead to type a, b, c due to the extra check. - To make type c) faster, we want to skip `check_for_mutation` at the beginning and only `check_for_mutation` before `record_function` for a new function. This removes the overhead of `check_for_mutation` for type a, b, c. However, this adds extra overhead to type d due to `check_invariants` for all children nodes. Instead, we design optimistic mutation check. The assumption is that, given a function and a node, the input mutation types usually remain the same across inputs. So, if we have ever detect a function on a node with type d, we will never detect it as type c. The detailed design is: - [Slow Path] On the first invocation of a function on a node, we run `check_for_mutation` once and cache the input mutation type as `non_cudagraph_managed_mutation[node_id][func_id]`. - [Fast Path] On the subsequent invocations of a function on a node, we skip `check_for_mutation`. For `non_cudagraph_managed_mutation[node_id][func_id]` as true, we directly call eager function. Otherwise, we `check_variants` and call cudagraph function. - [Slow Path] Before `record_function`, we run `check_for_mutation` again. Q1: Would there be overhead for type a,b,c,d? A: No. We only check input mutation types for the first invocation of a function on a node. Q2: If a function happens to be type c during the first invocation on a node, could we detect it as type d in the future? A: Yes. This is done by `check_invariants` and guarantees the correctness. Q3: If a function happens to be type d during the first invocation on a node, could it still be recognized as type c in the future? A: No. But this should happen rarely according to our assumption. In the rare case that it happens, there would not be any correctness issues and the performance is the same as the eager (or inductor optimized) function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123231 Approved by: https://github.com/eellison	2024-04-19 10:32:12 +00:00
Tobias Ringwald	58e403c739	Added a docstring for torch.Size.numel. (#124186 ) Fixes #61231. Fixes #124167. This PR documents a rather long-standing issue w.r.t. unexpected behavior of `torch.Size.numel`, first reported almost 5 years ago. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124186 Approved by: https://github.com/janeyx99	2024-04-19 09:23:02 +00:00
PyTorch MergeBot	520bc1080e	Revert "[Profiler] Unify the device(CUDA, XPU, PrivateUse1) in torch profiler post processing (#123247 )" This reverts commit `768ce2cdda`. Reverted https://github.com/pytorch/pytorch/pull/123247 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/123247#issuecomment-2066152611))	2024-04-19 09:09:03 +00:00
Xuehai Pan	a6f044a490	[dynamo, 3.8-3.9] support dataclass with `frozen=True` in Python 3.8/3.9 (#124393 ) Closes #114966 Frozen field assignment in `__init__` in Python 3.8-3.9: `f5bd65ed37/Lib/dataclasses.py (L402-L411)` ```python import builtins BUILTINS = builtins def _field_assign(frozen, name, value, self_name): # If we're a frozen class, then assign to our fields in __init__ # via object.__setattr__. Otherwise, just use a simple # assignment. # # self_name is what "self" is called in this function: don't # hard-code "self", since that might be a field name. if frozen: return f'BUILTINS.object.__setattr__({self_name},{name!r},{value})' return f'{self_name}.{name}={value}' ``` Frozen field assignment in `__init__` in Python 3.10+: `812245ecce/Lib/dataclasses.py (L436-L445)` ```python __dataclass_builtins_object__ = object def _field_assign(frozen, name, value, self_name): # If we're a frozen class, then assign to our fields in __init__ # via object.__setattr__. Otherwise, just use a simple # assignment. # # self_name is what "self" is called in this function: don't # hard-code "self", since that might be a field name. if frozen: return f'__dataclass_builtins_object__.__setattr__({self_name},{name!r},{value})' return f'{self_name}.{name}={value}' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124393 Approved by: https://github.com/jansel	2024-04-19 05:10:33 +00:00
Nikita Shulga	1ba85b34dd	[AOTI] Enbale mmaped weights when CUDA is used (#124346 ) By refactoring the logic that returns the start to constant pointer into `_get_constants_start()` method and call it from both CUDA and CPU readers It has no runtime impact, but export time is down from 10m to 3m if mmaped weights are used on AWS p4d.24xlarge Pull Request resolved: https://github.com/pytorch/pytorch/pull/124346 Approved by: https://github.com/mikekgfb, https://github.com/desertfire	2024-04-19 04:47:27 +00:00
Kiuk Chung	87f44d70b1	[torch/distributed] Check gloo availability when doing isinstance(pg,… (#124233 ) Fixes a bug where a reference to `_ProcessGroupWrapper` is used without first checking whether gloo is available. This fails on pytorch builds that do not include gloo becuase `_ProcessGroupWrapper` is only pybinded when building with `USE_GLOO=1`. Therefore, creation of a new process group fails with a `NameError` when only NCCL is available as the backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124233 Approved by: https://github.com/rohan-varma, https://github.com/d4l3k	2024-04-19 04:07:00 +00:00
Chen, Zejun	768ce2cdda	[Profiler] Unify the device(CUDA, XPU, PrivateUse1) in torch profiler post processing (#123247 ) This PR unifies the CUDA, XPU and PrivateUse1 in the torch profiler. Now CUDA, XPU and PrivateUse1 can together use string object `use_device` to distinguish each other and share one device path for calculating kineto time durations and memory statistics for post processing. #suppress-api-compatibility-check Co-authored-by: Aaron Enye Shi <enye.shi@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123247 Approved by: https://github.com/aaronenyeshi, https://github.com/gujinghui	2024-04-19 03:31:13 +00:00
rraminen	803a08f8ae	[ROCm] Add cublasGemmAlgo_t -> hipblasGemmAlgo_t (#121030 ) This PR is to add cublasGemmAlgo_t -> hipblasGemmAlgo_t to cuda_to_hip_mappings.py. It is required for DeepSpeed transformer extension build on ROCm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121030 Approved by: https://github.com/jeffdaily, https://github.com/ezyang	2024-04-19 02:57:16 +00:00
rzou	889e3eeed3	Avoid cuda init to FakeTensorMode (#124413 ) Also partially fixes #122109 This PR: - We add a C++ flag (only_lift_cpu_tensors) to toggle the torch.tensor(1, device='cuda') ctor strategy. When false (default), it does the current PyTorch behavior of unconditionally constructing a concrete CUDA tensor then calling lift_fresh on it. When true, we instead construct a concrete CPU tensor, call lift_fresh, and then call Tensor.to(device) (under any ambient modes). - FakeTensorMode flips this flag depending on if CUDA is available or not. We don't unconditionally set the flag to True because that is likely BC-breaking. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124413 Approved by: https://github.com/eellison	2024-04-19 02:39:35 +00:00
chilli	e620c3e814	Optimized templated attention to use exp2 (#124356 ) 0.705 (vs. FA2) to 0.860 after this change. <img width="1270" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/d58f57ba-e50e-44ea-8a8a-4f13b8650adf"> to <img width="1277" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/f1945b67-0cfc-463c-a2f6-5812b90677fe"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124356 Approved by: https://github.com/drisspg	2024-04-19 01:58:19 +00:00
Tristan Rice	ddd0ed1b43	distributed: templated ring attention (#124215 ) This adds a templated version of the ring attention forwards function as well as tests it with memory efficient attention. This doesn't add support for memory efficient attention in DTensor. That will be added in a follow up PR. This templating is also a POC of how to support other attention ops such as Jagged/nested tensor and as well how to implement striped attention in a scalable way. Misc changes: * Fixes all_to_all_single autograd implementation with CUDA + adds NCCL test * Adds compile support to the ring attention implementations (required some tweaks to process groups) Test plan: ``` pytest test/distributed/_tensor/test_attention.py pytest test/distributed/test_functional_api.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124215 Approved by: https://github.com/wanchaol	2024-04-19 00:57:08 +00:00
Bin Bao	4946638f06	[AOTI] Add ABI-compatiblity tests (#123848 ) Summary: In AOTInductor generated CPU model code, there can be direct references to some aten/c10 utility functions and data structures, e.g. at::vec and c10::Half. These are performance critical and thus it doesn't make sense to create C shim for them. Instead, we make sure they are implemented in a header-only way, and use this set of tests to guard future changes. There are more header files to be updated, but we will do it in other followup PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123848 Approved by: https://github.com/jansel ghstack dependencies: #123847	2024-04-19 00:51:24 +00:00
JackCaoG	9ed9b22ec0	Implement efficient_conv_bn_eval_decomp_graph_transform to handle conv and bn fusion after decomp (#123680 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123680 Approved by: https://github.com/ezyang, https://github.com/youkaichao	2024-04-19 00:22:25 +00:00
Shuqiang Zhang	ca6a0e1348	[c10d] remove the env of TORCH_NCCL_ABORT_IN_DESTROY_PG (#124334 ) Summary: This ENV was introduced to safely rollout the behavior change in destroy process group (e.g., call ncclCommsAbort). Now that this behavior change were already rolled out, we no longer need this env and we should clean up it to keep our code cleaner Test Plan: Modified/existing ut pass Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/124334 Approved by: https://github.com/wconstab	2024-04-18 23:42:55 +00:00
eellison	e4f6340f21	realize inputs to mem bound mm decomposition (#123165 ) Differential Revision: [D55639709](https://our.internmc.facebook.com/intern/diff/D55639709) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123165 Approved by: https://github.com/jackiexu1992	2024-04-18 23:10:04 +00:00
Mikayla Gawarecki	5ba6bb7b2f	Add swap_tensors path to nn parametrizations (#124130 ) Fixes #123859 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124130 Approved by: https://github.com/albanD	2024-04-18 22:22:08 +00:00
Wei Wei	87f651c7e7	fix cpu test errors (#124116 ) Similar fix is from @int3 but not landed. Credit to @int3 too. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124116 Approved by: https://github.com/chenyang78	2024-04-18 20:30:58 +00:00
ydwu4	2e48b39603	Fix example_value of map (#124203 ) Previously, we didn't expand the shape of example_value of map to the same as inputs (edit: the first mapped dimension). This pr fixes this bug. To make this easier, we change _call_function_and_unflatten_output to accept example_values directly instead of retrieving them from the variable trackers. Also remove a redundant call function node in strict_mode higher order op in dynamo. Test Plan: existing tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124203 Approved by: https://github.com/ezyang, https://github.com/zou3519	2024-04-18 19:18:36 +00:00
PyTorch MergeBot	4a0900d04b	Revert "[NJT] Inline through torch.nested.nested_tensor_from_jagged instead of graph break (#124343 )" This reverts commit `ef93402f61`. Reverted https://github.com/pytorch/pytorch/pull/124343 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124343#issuecomment-2064937192))	2024-04-18 18:55:48 +00:00
Sheng Fu	89407eca3b	Capture triton kernel in execution trace (#124140 ) Summary: This DIFF is to capture triton kernels in execution trace. Test Plan: buck test mode/dev-nosan caffe2/test:profiler -- test_execution_trace_with_pt2 Differential Revision: D56162599 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124140 Approved by: https://github.com/briancoutinho	2024-04-18 18:38:26 +00:00
angelayi	74bedbb9e1	[export] Serialize rational symint ranges (#123884 ) Some symints result in rational ranges like 10/3 which runs into an error ([example](https://www.internalfb.com/intern/everpaste/?handle=GMG2AxkeoFUrh-UDAFcE8pKPgjoUbsIXAAAB)). Ed will eventually get rid(?) of these rational ranges but as a workaround export can just clamp the results during serialization time Pull Request resolved: https://github.com/pytorch/pytorch/pull/123884 Approved by: https://github.com/zhxchen17	2024-04-18 18:20:11 +00:00
Aaron Orenstein	37215a4fa2	Fix memory leak in pattern_matcher (#124345 ) #121313 changed precompiled patterns so they are more integrated with the pattern matching code. This resulted with a list of "known" patterns (with their example data) being stored globally. Unfortunately since small FakeTensors store a constant of the original tensor it meant that we leaked cuda tensors in the example data. Fix this by clearing out the constant storage for the example data that we keep around. Fixes #124081 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124345 Approved by: https://github.com/xuzhao9	2024-04-18 17:38:12 +00:00
egienvalue	d7e1bf9ff9	torch.mtia module for MTIA device backend (#123612 ) MTIA device has its own Module in PyTorch now. torch.mtia has following APIs similar to other backends. The lazy_init is also supported. ``` __all__ = [ "init", "is_available", "synchronize", "device_count", "current_device", "current_stream", "default_stream", "set_stream", "stream", "device", ] ``` ------------ For device management. We expand AccleratorHooksInterface to support generic device management and it can be used in both C++ and PyThon. ``` def _accelerator_hooks_device_count() -> _int: ... def _accelerator_hooks_set_current_device(device_index: _int) -> None: ... def _accelerator_hooks_get_current_device() -> _int : ... def _accelerator_hooks_exchange_device(device_index: _int) -> _int : ... def _accelerator_hooks_maybe_exchange_device(device_index: _int) -> _int : ... ``` --------- Adding get_device_module API to retrieve device modules for different device types. ``` def get_device_module(device: Optional[Union[torch.device, str]] = None) ``` --------- @exported-using-ghexport Differential Revision: [D52923602](https://our.internmc.facebook.com/intern/diff/D52923602/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123612 Approved by: https://github.com/albanD ghstack dependencies: #123611	2024-04-18 17:38:06 +00:00
egienvalue	cb17721899	Build device generic torch.Stream and torch.Event based on c10::Stream/Event (#123611 ) This diff intends to build device generic torch.Stream and torch.Event for newly added accelerators in PyTorch. ------------ torch.Stream APIs ``` # Defined in torch/csrc/Stream.cpp class Stream(_StreamBase): stream_id: _int # Stream id device_index: _int device_type: _int device: _device # The device of the stream @overload def __new__(self, device: Optional[DeviceLikeType] = None, priority: _int = 0) -> Stream: ... @overload def __new__(self, stream_id: _int, device_index: _int, device_type: _int, priority: _int = 0) -> Stream: ... def query(self) -> _bool: ... def synchronize(self) -> None: ... def wait_event(self, event: Event) -> None: ... def wait_stream(self, other: Stream) -> None: ... def record_event(self, event: Optional[Event] = None) -> Event: ... def query(self) -> None: ... def synchronize(self) -> None: ... def __hash__(self) -> _int: ... def __repr__(self) -> str: ... def __eq__(self, other: object) -> _bool: ... ``` ------------------ torch.Event APIs: - IPC related APIs are not implemented, since many device backends don't support it, but we leave interfaces there for future adaption of torch.cuda.Stream. - currently only the enable_timing is supported, since it is the most common one used in other device backends. We have to refactor the event flag system in PyTorch to support more fancy flag. - elapsedTime API is added to c10::Event ``` # Defined in torch/csrc/Event.cpp class Event(_EventBase): device: _device # The device of the Event event_id: _int # The raw event created by device backend def __new__(self, device: Optional[DeviceLikeType] = None, enable_timing: _bool = False, blocking: _bool = False, interprocess: _bool = False) -> Event: ... @classmethod def from_ipc_handle(self, device: DeviceLikeType, ipc_handle: bytes) -> Event: ... def record(self, stream: Optional[Stream] = None) -> None: ... def wait(self, stream: Optional[Stream] = None) -> None: ... def query(self) -> _bool: ... def elapsed_time(self, other: Event) -> _float: ... def synchronize(self) -> None: ... def ipc_handle(self) -> bytes: ... def __repr__(self) -> str: ... ``` ----------- c10::Event provides new APIs - calculate elapsedTime. - Get raw event id - Synchronize event. ``` double elapsedTime(const Event& event) const { return impl_.elapsedTime(event.impl_); } void* eventId() const { return impl_.eventId(); } void synchronize() const { return impl_.synchronize(); } ``` ---------- TODO: need to find a good way to test them in PyTorch with API mocks. Differential Revision: [D55351839](https://our.internmc.facebook.com/intern/diff/D55351839/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123611 Approved by: https://github.com/albanD	2024-04-18 17:35:09 +00:00
Jason Ansel	7a6edb0b66	Possible fix for einops warning (#124084 ) See https://github.com/arogozhnikov/einops/issues/315 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124084 Approved by: https://github.com/peterbell10	2024-04-18 17:09:50 +00:00
Zhengxu Chen	e1062f5738	[export] Add a printer to unflattened module. (#124315 ) Summary: add a helper method to print graph in every level of unflattened module. Test Plan: {F1489609684} Differential Revision: D56263195 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124315 Approved by: https://github.com/tugsbayasgalan	2024-04-18 16:35:51 +00:00
Boyuan Feng	aa2da0cdd2	[Export] Add runtime assert to non-strict export (#123681 ) This PR moves insert_deferred_runtime_asserts from dynamo to torch.fx.passes and uses it to add runtime assertion for non-strict export. Differential Revision: D55944267 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123681 Approved by: https://github.com/tugsbayasgalan, https://github.com/angelayi	2024-04-18 16:13:27 +00:00
soulitzer	ef93402f61	[NJT] Inline through torch.nested.nested_tensor_from_jagged instead of graph break (#124343 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124343 Approved by: https://github.com/jbschlosser	2024-04-18 14:42:54 +00:00
Andrew Gu	bbb6e36495	[FSDP2] Fixed `set_requires_gradient_sync`'s `recurse` arg (#124318 ) The `recurse` argument was not being respected for `set_requires_gradient_sync`. This PR fixes that. The previous unit test did not have nested FSDP modules with managed parameters, so the `recurse=False` was not being exercised. We augment the unit test to try only disabling gradient sync for the root module and not children. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124318 Approved by: https://github.com/weifengpy ghstack dependencies: #120952, #124293	2024-04-18 14:21:57 +00:00
rzou	1542874311	Delete qualname from custom_op decorator (#124092 ) I forgot to delete this in an earlier PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124092 Approved by: https://github.com/albanD ghstack dependencies: #123937, #124064, #124065, #124066, #124071, #124089	2024-04-18 12:48:04 +00:00
rzou	648c39c47d	Add OpOverload.redispatch; use it in new custom ops API (#124089 ) A kernel has "dispatcher convention" if there is an additional keyset arg at the beginning of the argument list. This PR: - adds a way to register kernels with dispatcher_convention using Library.impl (pass dispatcher_convention = True) - adds OpOverload.redispatch We use both of the above in the new custom ops API: we register the autograd kernel in dispatcher convention so that we can actually call redispatch like how pytorch built-in ops do it. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124089 Approved by: https://github.com/albanD ghstack dependencies: #123937, #124064, #124065, #124066, #124071	2024-04-18 12:48:04 +00:00
rzou	645173a0b5	Add torch.library.register_autograd (#124071 ) Allows registering autograd for all custom op entry points: - the new-style custom op API (custom_op) - the old-style torch.library APIs - C++ operator registration Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124071 Approved by: https://github.com/albanD ghstack dependencies: #123937, #124064, #124065, #124066	2024-04-18 12:47:59 +00:00
rzou	8135c4b921	torch.library.register_fake now accepts more types (#124066 ) We allow it to accept: - a string with the op name - an opoverload - a new-style custom op If any of these are referring to a new-style custom op (created with the custom_op decorator), then we dispatch to CustomOpDef.register_fake. Otherwise, we do what we previously did. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124066 Approved by: https://github.com/albanD ghstack dependencies: #123937, #124064, #124065	2024-04-18 12:47:55 +00:00
xinan.lin	6fcbeb3489	[ATen] Add CPU fp16 support for nll_loss and cross_entropy_loss (#123256 ) Add CPU FP16 support for nll_loss and cross_entropy_loss. Resolve issue #123328. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123256 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/malfet	2024-04-18 11:44:38 +00:00
IvanKobzarev	d59f1da62f	[sym_shapes][perf] _find not update unchanged replacements (#124274 ) Differential Revision: [D56236380](https://our.internmc.facebook.com/intern/diff/D56236380) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124274 Approved by: https://github.com/ezyang	2024-04-18 08:32:02 +00:00
IvanKobzarev	9eba1995d0	[sym_shapes][perf] Use sympy xreplace instead of subs (#124208 ) https://github.com/sympy/sympy/issues/22240 Differential Revision: [D56207553](https://our.internmc.facebook.com/intern/diff/D56207553) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124208 Approved by: https://github.com/ezyang, https://github.com/lezcano	2024-04-18 08:19:03 +00:00
PyTorch MergeBot	2b82345e48	Revert "Re-land precompile triton templates (#124030 )" This reverts commit `030bb13fe8`. Reverted https://github.com/pytorch/pytorch/pull/124030 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124030#issuecomment-2063191117))	2024-04-18 07:21:41 +00:00
Animesh Jain	704fac5618	[dynamo][cpp-guard] Reland Attempt 1 - Enable cpp guard manager (#124231 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124231 Approved by: https://github.com/jansel ghstack dependencies: #124230, #124237	2024-04-18 06:36:20 +00:00
PyTorch MergeBot	6e86a40694	Revert "[Dynamo] Check for __bool__ attribute before accessing it (#120943 )" This reverts commit `dd7aeedb72`. Reverted https://github.com/pytorch/pytorch/pull/120943 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/120943#issuecomment-2063098295))	2024-04-18 06:34:32 +00:00
PyTorch MergeBot	8ff85b42f9	Revert "Add swap_tensors path to nn parametrizations (#124130 )" This reverts commit `64f6ddf12c`. Reverted https://github.com/pytorch/pytorch/pull/124130 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124130#issuecomment-2063074856))	2024-04-18 06:12:54 +00:00
Zhuoran Zhao	8ad66e05d2	[4/x][AMD][Lowering Enablement] Enabling meta internal AOTInductor compilation on ROCM (#124123 ) Summary: as title Test Plan: CI & unit test Differential Revision: D56163334 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124123 Approved by: https://github.com/chenyang78, https://github.com/jansel	2024-04-18 04:19:37 +00:00
xinan.lin	c9ab9248ce	[Inductor Intel GPU backend Upstream] Generalize device-bias code in (#124249 ) Generalize device-bias code in tirton_utils.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/124249 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/jansel	2024-04-18 03:54:31 +00:00
Yanan Cao (PyTorch)	27daa110c8	Back out "Refresh OpOverloadPacket if a new OpOverload gets added (#123578 )" (#124324 ) Summary: Original commit changeset: 528276bc8a92 Original Phabricator Diff: D56057952 Differential Revision: D56271240 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124324 Approved by: https://github.com/davidberard98	2024-04-18 03:33:54 +00:00
Animesh Jain	f213f262af	[dynamo][cpp-guards] Improve when to use Dict vs DictSubclassGuardManager (#124237 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124237 Approved by: https://github.com/jansel, https://github.com/mlazos ghstack dependencies: #124230	2024-04-18 03:33:37 +00:00
William Wen	812bae09be	[dynamo] fix 3.11+ refleak (#124238 ) Fixes https://github.com/pytorch/pytorch/issues/119607 for 3.11+. In 3.11+, `_PyFrame_FastToLocalsWithError` could implicity run `COPY_FREE_VARS` on the original frame, leading to double incref's since the dynamo shadow frame can rerun `COPY_FREE_VARS`. So the solution is to skip the first `COPY_FREE_VARS` instruction in the shadow frame if it was already executed in the original frame. Also move the location for clearing the original frame in 3.12 to handle error cases more thoroughly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124238 Approved by: https://github.com/jansel	2024-04-18 03:02:29 +00:00
Animesh Jain	6b4b857a60	[dynamo][nn_module] Enable torch.compile/disable as decorators on the class (#124187 ) Support something like. This is UI change, so please review carefully. ~~~ @torch._dynamo.disable class SimpleLinear(torch.nn.Module): def __init__(self): super().__init__() self.layer0 = torch.nn.Linear(4, 4) def forward(self, inp): return self.layer0(torch.sigmoid(inp)) @torch.compile(backend=cnts) class SimpleModel(torch.nn.Module): def __init__(self): super().__init__() self.layer0 = SimpleLinear() self.layer1 = torch.nn.Linear(4, 4) def forward(self, inp): z = self.layer0(torch.sin(inp)) return self.layer1(z) ~~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/124187 Approved by: https://github.com/yanboliang, https://github.com/jansel	2024-04-18 02:51:30 +00:00
Simon Fan	b6b757701e	[aot] trim refcount for subclass runtime wrapper (#124155 ) On torchtrain, before <img width="1218" alt="image" src="https://github.com/pytorch/pytorch/assets/9547562/b340c114-071a-440c-904c-c042de4d92c5"> after ![image](https://github.com/pytorch/pytorch/assets/9547562/ee3b6e6f-6e46-46bc-a93d-d4603673ee63) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124155 Approved by: https://github.com/jansel, https://github.com/bdhirsh ghstack dependencies: #124127	2024-04-18 02:34:52 +00:00
Sun, Jiayi	1f04c29be5	[inductor] Freeze the layout of the conv input to channels_last (#122765 ) Fix https://github.com/pytorch/pytorch/issues/118082. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122765 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-04-18 02:23:38 +00:00
Sun, Jiayi	51a56efbb9	[inductor] modify the output_stride of ConcatKernel (#122761 ) Fix https://github.com/pytorch/pytorch/issues/121613. Modify the `output_stride` of `ConcatKernel`: If any input to `Concat` is `Pointwise`, check the layout of all inputs to `Pointwise`, if any of the inputs is in channels_last format, set channels_last strides for the `output_stride`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122761 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-04-18 02:19:46 +00:00
Sun, Jiayi	78f3b99a94	[inductor] Modify the rules for freezing the layout of x.unwrap_view() in convert_to_reinterpret_view (#122760 ) Fix https://github.com/pytorch/pytorch/issues/121607 Modify the rules for freezing the layout of `x.unwrap_view()` in `convert_to_reinterpret_view`: If any read of `x.unwrap_view()` is in channels_last format, freeze the layout of `x.unwrap_view()` to channels_last format. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122760 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel	2024-04-18 02:17:07 +00:00
Shunting Zhang	b71423c2e4	[inductor] let coordesc tuner respect max RBLOCK (#124325 ) Fix https://github.com/pytorch/pytorch/issues/124251 . Coordesc tuner need respect max RBLOCK. When rnumel is a multiple of max-RBLOCK, inductor codegen will skip rmask. If coordesc tuner does not consider max-RBLOCK and pick a RBLOCK larger than that, we would get CUDA IMA (illegal memory access) error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124325 Approved by: https://github.com/Chillee, https://github.com/jansel	2024-04-18 02:12:35 +00:00
Pearu Peterson	43b4ac956e	Add index_reduce decomposition (#122579 ) As in the title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122579 Approved by: https://github.com/peterbell10 ghstack dependencies: #123375	2024-04-18 01:30:47 +00:00
eellison	030bb13fe8	Re-land precompile triton templates (#124030 ) Re-land precompile triton templates. This got reverted because we were precompiling templates without checking the cache. I have since added logic and a test to ensure we do not precompile if there is a cache hit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124030 Approved by: https://github.com/shunting314, https://github.com/nmacchioni, https://github.com/yoyoyocmu	2024-04-18 01:22:13 +00:00
Michael Lazos	102a223216	Enable dynamo test_state_dict_deterministic (#123323 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123323 Approved by: https://github.com/janeyx99 ghstack dependencies: #123498, #123322	2024-04-18 01:06:28 +00:00
Michael Lazos	d88fcb86d8	Enable dynamo traced test_forloop_goes_right_direction (#123322 ) Removed a bunch of skips, I also updated test_forloop_goes_right_direction to not use the closure when dynamo is tracing. The reason for this is that testing the disabled optimizer doesn't actually test anything. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123322 Approved by: https://github.com/janeyx99 ghstack dependencies: #123498	2024-04-18 00:50:10 +00:00
Michael Lazos	57a3dc56d4	Small Adamax fix (#123498 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123498 Approved by: https://github.com/janeyx99	2024-04-18 00:50:03 +00:00
rzou	5a60a1abde	Move the implementation of register_fake onto torch.library.Library (#124065 ) Motivations: - This makes things more consistent: using a Library object, you should be able to do all of the registration APIs that tie registrations to the lifetime of the Library. - I need this for the next PR up in the stack, where we will have torch.library.register_fake support both CustomOpDef (from the new custom ops API) and other custom ops. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124065 Approved by: https://github.com/albanD ghstack dependencies: #123937, #124064	2024-04-17 23:51:20 +00:00
rzou	d1e1d671ef	Stop requiring a pystub for register_fake by default (#124064 ) Previously, if someone used `register_fake` to add a fake impl for an operator defined in C++, we would require them to add a `m.set_python_module(<module>)` call to C++. This was to avoid situations where a user imported the C++ operator without importing the fake impl. This "breaks" open registration: there's no way to add a fake impl outside of a repository that defines an operator, so we want to turn this behavior off by default in open source. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124064 Approved by: https://github.com/albanD ghstack dependencies: #123937	2024-04-17 23:51:20 +00:00
Mikayla Gawarecki	64f6ddf12c	Add swap_tensors path to nn parametrizations (#124130 ) Fixes #123859 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124130 Approved by: https://github.com/albanD	2024-04-17 23:37:28 +00:00
Andrew Gu	b5235694f4	[FSDP2] Made `unshard` return type consistent (#124293 ) We can always return an `UnshardHandle` if `async_op=True` even if the FSDP module does not manage any parameters and hence does not have an `FSDPParamGroup`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124293 Approved by: https://github.com/weifengpy ghstack dependencies: #120952	2024-04-17 23:33:46 +00:00
Andrew M. James	64f42bfd52	[dynamo] Support list.reverse (#124210 ) fixes #123974 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124210 Approved by: https://github.com/peterbell10	2024-04-17 23:33:32 +00:00
Matthias Reso	dd7aeedb72	[Dynamo] Check for __bool__ attribute before accessing it (#120943 ) This PR checks if __bool__ attribute is available before accessing it when handling a UserDefinedObjectVariable Fixes #119782 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120943 Approved by: https://github.com/zou3519	2024-04-17 23:26:55 +00:00
Nikita Shulga	00372b1211	Extend int[48]mm ops to float32 input (#124287 ) Just for completeness Pull Request resolved: https://github.com/pytorch/pytorch/pull/124287 Approved by: https://github.com/mikekgfb	2024-04-17 23:10:49 +00:00
vfdev-5	6330acae76	Refactored implementation for upsample_nearest decompostions (#122783 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122783 Approved by: https://github.com/peterbell10	2024-04-17 23:05:40 +00:00
Edward Z. Yang	bebdbb63ce	Introduce set_example_value and use it throughout Dynamo (#124176 ) I'm going to setup some extra behavior when we set example value, so I need a convenient place to interpose. I cannot easily do it on meta itself because its a generic dict with no interposition point. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124176 Approved by: https://github.com/oulgen ghstack dependencies: #124105, #124059	2024-04-17 22:57:11 +00:00
Tugsbayasgalan Manlaibaatar	d23bf9cef0	Add fake impl for aten.unique2 (#124306 ) Reapply of: https://github.com/pytorch/pytorch/pull/121571 Differential Revision: [D56258431](https://our.internmc.facebook.com/intern/diff/D56258431) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124306 Approved by: https://github.com/gmagogsfm	2024-04-17 22:55:27 +00:00
Tristan Rice	1ec05c769b	all_gather and reduce_scatter autograd (#123989 ) This adds `all_gather_tensor_autograd` and `reduce_scatter_tensor_autograd` to the functional_collectives library. This only supports `sum` mode for `reduce_scatter` but should be easy to extend in the future. The backwards implementations match the behavior in https://github.com/pytorch/torchrec/blob/main/torchrec/distributed/comm_ops.py This follows the pattern of #123599 . Test plan: ```sh pytest test/distributed/test_functional_api.py -k Autograd ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123989 Approved by: https://github.com/wanchaol	2024-04-17 21:32:22 +00:00
Xuehai Pan	93e249969b	[BE] enable `ruff` rule `RSE` and remove useless parentheses in `raise` statements (#124261 ) Remove useless parentheses in `raise` statements if the exception type is raised with no argument. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124261 Approved by: https://github.com/albanD	2024-04-17 19:29:34 +00:00
Oguz Ulgen	24cecf06d7	Update autotune jk knobs (#124214 ) Differential Revision: [D56201145](https://our.internmc.facebook.com/intern/diff/D56201145/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124214 Approved by: https://github.com/aakhundov	2024-04-17 17:49:25 +00:00
Animesh Jain	f433517181	[dynamo][decorator] Support disable on nn modules (#124185 ) Fixes https://github.com/pytorch/pytorch/issues/123979 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124185 Approved by: https://github.com/weifengpy, https://github.com/yoyoyocmu	2024-04-17 16:20:34 +00:00
Xuehai Pan	7e1c98c171	[dynamo] support `object.__setattr__(obj, name, value)` (#124068 ) Resolves #114964 Resolves #114966 - #114964 - #114966 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124068 Approved by: https://github.com/jansel	2024-04-17 15:57:14 +00:00
PyTorch MergeBot	36f6928a37	Revert "[Profiler][PrivateUse1] Profiler support PrivateUse1 key (#120556 )" This reverts commit `41613a0803`. Reverted https://github.com/pytorch/pytorch/pull/120556 on behalf of https://github.com/aaronenyeshi due to Breaks GPU Chrome trace UI ([comment](https://github.com/pytorch/pytorch/pull/120556#issuecomment-2061578951))	2024-04-17 15:38:14 +00:00
Pearu Peterson	d2b0c0a34e	Fix index_reduce sampler filter when op_info.variant_test_name is specified (#123375 ) As in the title: `index_reduce` sample must correspond to reduction type specified by `variant_test_name`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123375 Approved by: https://github.com/zou3519, https://github.com/peterbell10	2024-04-17 15:31:28 +00:00
rzou	47dbfecd37	Rename impl_abstract to register_fake, part 1/2 (#123937 ) This PR: - adds a new torch.library.register_fake and deprecates torch.library.impl_abstract. The motivation is that we have a lot of confusion around the naming so we are going to align the naming with the actual subsystem (FakeTensor). - renames `m.impl_abstract_pystub("fbgemm_gpu.sparse_ops")` to `m.has_python_registration("fbgemm_gpu.sparse_ops")`. No deprecation here yet; I need to test how this works with static initialization. - Renames a bunch of internals to match (e.g. abstractimplpystub -> pystub) I'm scared to rename the Python-side internal APIs (e.g. torch._library.abstract_impl) because of torch.package concerns. I'll do that in its own isolated PR next just in case it causes problems. DEPRECATION NOTE: torch.library.impl_abstract was renamed to to torch.library.register_fake. Please use register_fake. We'll delete impl_abstract in a future version of PyTorch. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/123937 Approved by: https://github.com/albanD	2024-04-17 12:46:01 +00:00
PyTorch MergeBot	2dc15b6849	Revert "[sparse] Add fast semi-structured spasification kernels (#122350 )" This reverts commit `14b2273b0c`. Reverted https://github.com/pytorch/pytorch/pull/122350 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/122350#issuecomment-2061070350))	2024-04-17 11:47:02 +00:00
PyTorch MergeBot	3f89f565bb	Revert "Re-land precompile triton templates (#124030 )" This reverts commit `d68196e7ef`. Reverted https://github.com/pytorch/pytorch/pull/124030 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124030#issuecomment-2061044960))	2024-04-17 11:31:33 +00:00
PyTorch MergeBot	77ad630f5d	Revert "Dont precompile already seen keys, limit epilogue choices (#122642 )" This reverts commit `050051f412`. Reverted https://github.com/pytorch/pytorch/pull/122642 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124030#issuecomment-2061044960))	2024-04-17 11:31:32 +00:00
FFFrog	acc466751b	Add bfloat16 support to binary_cross_entropy for CPU (#123823 ) Fixes #123715 As the title stated. But, maybe we should pay attention to this https://github.com/pytorch/pytorch/pull/33206, which removed the half support for cpu about 4 years ago. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123823 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-04-17 09:44:07 +00:00
chunyuan	d0211e207c	inductor cpp wrapper: add GIL release back (#123897 ) Fixes https://github.com/pytorch/pytorch/issues/123517. This PR adds the GIL release (originally added in https://github.com/pytorch/pytorch/pull/111888) back following the suggestion here: https://github.com/pytorch/pytorch/pull/123897#discussion_r1562509705. We added a default constructor and an assignment operator for the `RAIIPyObject` class (https://github.com/pytorch/pytorch/pull/123897#discussion_r1566262575) in order to declare the `custom_op_wrapper` outside of the GIL acquisition scope. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123897 Approved by: https://github.com/peterbell10, https://github.com/jgong5	2024-04-17 07:18:14 +00:00
Tugsbayasgalan Manlaibaatar	dd3cea3291	Fix derived dim bugs in ep.run_decomp (#123326 ) Differential Revision: [D55730289](https://our.internmc.facebook.com/intern/diff/D55730289) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123326 Approved by: https://github.com/avikchaudhuri	2024-04-17 04:00:55 +00:00
Edward Z. Yang	236b0d12fa	Don't clamp slices generated from cat kernel (#124139 ) Fixes https://github.com/pytorch/pytorch/issues/123793 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124139 Approved by: https://github.com/Microve, https://github.com/peterbell10, https://github.com/Skylion007	2024-04-17 03:13:10 +00:00
eellison	050051f412	Dont precompile already seen keys, limit epilogue choices (#122642 ) Two changes: - in epilogue benchmark fusion, only take top 6 choices. There were basically no choices taken after this in HF. - Share a single precompilation function among matmuls with same key. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122642 Approved by: https://github.com/shunting314 ghstack dependencies: #124030	2024-04-17 03:08:59 +00:00
Animesh Jain	51cc808ac7	[dynamo][cpp-guards] Missing decref on early returns in DictSubclassGuardManager (#124230 ) I am sad that I missed this earlier. Good thing is that CI caught it. Will be more careful next time. This was the reason https://github.com/pytorch/pytorch/pull/123547 is reverted - https://github.com/pytorch/pytorch/pull/123547#issuecomment-2058350245 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124230 Approved by: https://github.com/mlazos	2024-04-17 02:49:07 +00:00
eellison	d68196e7ef	Re-land precompile triton templates (#124030 ) Re-land precompile triton templates. This got reverted because we were precompiling templates without checking the cache. I have since added logic and a test to ensure we do not precompile if there is a cache hit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124030 Approved by: https://github.com/shunting314, https://github.com/nmacchioni, https://github.com/yoyoyocmu	2024-04-17 02:30:46 +00:00
CK Luk	32ca18ea3b	Handle the case when one of the output of forward pass is None (#123988 ) Summary: When applying FSDP-2 to FM-FB benchmark with FullModel model, we ran into an error that one of the output tensors of a forward pass is None. I double checked that the same output tensor is also None in FSDP-1. So, we just need to handle the None properly here. Test Plan: See that in the internal diff. Differential Revision: D56087956 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123988 Approved by: https://github.com/awgu	2024-04-17 02:18:32 +00:00
Valentine233	6e4c4e93b6	[Inductor] add contiguous layout optm for bmm input (#122599 ) Fixes #117743. Add contiguous layout optimization for `bmm` input, to avoid additional copies. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122599 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/eellison	2024-04-17 02:12:20 +00:00
Oguz Ulgen	1fd9e320ea	Remove unnecessary FileLock in Fx Graph Cache (#124212 ) Writing to file happens via `write_atomic`, there's no need to take a global lock on the file system. This is likely creating unnecessary waits. Differential Revision: [D56208628](https://our.internmc.facebook.com/intern/diff/D56208628/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124212 Approved by: https://github.com/masnesral, https://github.com/eellison	2024-04-17 01:02:41 +00:00
Theodore Ehrenborg	f56c4572a6	Fix typos in docs (#124218 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124218 Approved by: https://github.com/albanD	2024-04-17 00:46:08 +00:00
Andrew Gu	bf45ac8c98	[FSDP2] Added explicit `unshard(async_op)` API (#120952 ) This PR adds an `unshard(async_op: bool = False)` API to manually unshard the parameters via all-gather. This can be used for reordering the all-gather with other collectives (e.g. all-to-all). This currently requires the user to set `TORCH_NCCL_AVOID_RECORD_STREAMS=1` to avoid `recordStream` from `ProcessGroupNCCL` and get expected memory behaviors. Differential Revision: [D56148725](https://our.internmc.facebook.com/intern/diff/D56148725) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120952 Approved by: https://github.com/wanchaol	2024-04-17 00:39:34 +00:00
Catherine Lee	0abd3f60fd	[CI] Reduce CI_SERIAL_LIST list (#124085 ) Add serial marker for individual tests so the test file can be removed from the ci serial list Run serial marked tests first in serial Run all other tests afterwards in parallel Slowly reduce list and mark individual tests as serial instead Hope # of serial tests is small so sharding evenness doesn't get too messed up Hopefully can do 3 procs for sm86 and cpu? serial no longer looks like a real word to me Pull Request resolved: https://github.com/pytorch/pytorch/pull/124085 Approved by: https://github.com/seemethere, https://github.com/malfet	2024-04-17 00:23:47 +00:00
IvanKobzarev	e7cf6f81ea	[sym_shapes][perf] Skip assert in check_is_size (#124209 ) Differential Revision: [D56207943](https://our.internmc.facebook.com/intern/diff/D56207943) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124209 Approved by: https://github.com/ezyang	2024-04-17 00:10:06 +00:00
Edward Z. Yang	cebf65126c	FakeTensorProp assert consistency of sizes when metadata previously existed (#124059 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124059 Approved by: https://github.com/bdhirsh, https://github.com/thiagocrepaldi ghstack dependencies: #124105	2024-04-16 23:28:42 +00:00
andrewor14	3eea300680	[quant] Do not decompose choose_qparams_per_token_asymmetric (#124178 ) Summary: https://github.com/pytorch/pytorch/pull/123452 added backward support to this op by turning it into CompositeImplicitAutograd, which meant it gets decomposed during export/compile. However, this is not desirable behavior for the PTQ case when we try to lower the model. This commit enables QAT without breaking PTQ by refactoring the impl into a separate op that does have backward support. Test Plan: python test/test_quantization.py -k test_decomposed_choose_qparams_per_token_asymmetric_backward Reviewers: jerryzh168, digantdesai, zou3519 Subscribers: jerryzh168, digantdesai, zou3519, supriyar Differential Revision: [D56192116](https://our.internmc.facebook.com/intern/diff/D56192116) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124178 Approved by: https://github.com/digantdesai	2024-04-16 22:58:48 +00:00
Shunting Zhang	3e90e93a78	[inductor] disable comprehensive padding in fbcode (#124191 ) Comprehension padding cause small NE change and fail an internal test. Disable it for internal use case to mitigate. Differential Revision: [D56197430](https://our.internmc.facebook.com/intern/diff/D56197430) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124191 Approved by: https://github.com/jansel	2024-04-16 22:44:08 +00:00
Xilun Wu	b3f88317ec	[dtensor][5/N] have table-wise sharding use LocalShardsWrapper on participating ranks only (#122853 ) Summary We wrap DTensor's local tensor in `LocalShardsWrapper` for torchrec's table-wise sharding. The exception is on non-participating ranks: for non-participating ranks, the local tensor is an empty torch.Tensor object. The reason of this design is to avoid complexity on supporting empty tensor case on `LocalShardsWrapper`. Test `torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/torchrec_sharding_example.py -e table-wise` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122853 Approved by: https://github.com/wz337 ghstack dependencies: #120265, #121392, #122843	2024-04-16 22:27:30 +00:00
Xilun Wu	d419fcd19f	[dtensor][4/N] have row-wise sharding always use LocalShardsWrapper (#122843 ) Summary Always wrap local tensor into a `LocalShardsWrapper`. This is for uniformity and it leads to easiness on adoption of DTensor as a wrapper for local shard(s) representation. To support more tensor ops over `LocalShardsWrapper`, users need to extend its `__torch_dispatch__`. Test `torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/torchrec_sharding_example.py -e row-wise-even` Result ``` Row-wise even sharding example in DTensor Col 0-15 ------- ---------- Row 0-1 cuda:0 Row 2-3 cuda:1 Row 4-5 cuda:2 Row 6-7 cuda:3 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122843 Approved by: https://github.com/wz337 ghstack dependencies: #120265, #121392	2024-04-16 22:27:30 +00:00
Xilun Wu	1d7ac7baa0	[dtensor][3/N] add torchrec row-wise uneven sharding example (#121392 ) Test `torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/torchrec_sharding_example.py -e row-wise-uneven` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121392 Approved by: https://github.com/wanchaol ghstack dependencies: #120265	2024-04-16 22:27:28 +00:00
Xilun Wu	9d3543df9a	[dtensor][2/N] add torchrec table-wise sharding example (#120265 ) Summary This PR serves as a start of this effort by adding an example test that represents TorchRec's `ShardingType.TABLE_WISE` using DTensor. Test `torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/torchrec_sharding_example.py -e table-wise` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120265 Approved by: https://github.com/wanchaol	2024-04-16 22:27:24 +00:00
PyTorch MergeBot	9d88339b53	Revert "make sure dynamo doesn't inline DTensor __new__ or __torch_dispatch__ (#123347 )" This reverts commit `63dcb5b0f2`. Reverted https://github.com/pytorch/pytorch/pull/123347 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/123347#issuecomment-2059994989))	2024-04-16 22:08:24 +00:00
Lucas Pasqualin	440e4353c7	[DCP] Remove overlapping loader in async case (#123942 ) In the async case, the state dict is already on CPU, so maintaining this buffer makes no sense. Additionally, using the overlapping cpu loader introduces new cuda synchronize calls, leading to additional unnecessary overhead. Differential Revision: [D56065250](https://our.internmc.facebook.com/intern/diff/D56065250/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123942 Approved by: https://github.com/fegin ghstack dependencies: #123941	2024-04-16 21:19:31 +00:00
Shawn Xu	606c4f1367	[PT] [ST] fix test_sharded_tensor (#124103 ) Summary: https://github.com/pytorch/pytorch/pull/123230 formalizes the rank validation to support sub groups. It broke a few UTs, some of which got fixed in https://github.com/pytorch/pytorch/pull/123778 This is to fix the remaining one reported by DanilBaibak Test Plan: CI Differential Revision: D56155076 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124103 Approved by: https://github.com/fegin	2024-04-16 21:18:22 +00:00
Lucas Pasqualin	46a25cc0db	[DCP] Adds support for non-primatives in async_save by deep copying during cpu offloading (#123941 ) Adds support for non-primatives in async_save by deep copying during cpu offloading. If users are not type checking, the expectation in async is likely that the object is copied Differential Revision: [D56065237](https://our.internmc.facebook.com/intern/diff/D56065237/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123941 Approved by: https://github.com/fegin	2024-04-16 20:49:25 +00:00
Jesse Cai	14b2273b0c	[sparse] Add fast semi-structured spasification kernels (#122350 ) This PR adds in fast semi-structured sparsification kernels to PyTorch. These kernels allow for accelerated semi-structured sparsification kernels in PyTorch. The kernels have been added as aten native functions In particular, three new functions have been added: * `torch._sparse_semi_structured_tile` This function will return the packed representation and metadata for both X and X', as well as the thread masks. Note that this applies 2:4 sparsity in a 4x4 tile instead of a 1x4 strip as usual. * `torch._sparse_semi_structured_apply` This function takes in an input tensor and thread masks from the above function and returns a packed representation and metadata from applying thread masks to the input tensor. * `torch._sparse_semi_structured_apply_dense` This function does the same thing as above but instead of returning the tensor in the sparse representation it returns it in the dense representation The subclasses have also been updated to add a new `prune_dense_static_sort` classmethod to create sparse tensors with this format. I've added some additional documentatino on how to calculate the compressed tensors needed to create a SparseSemiStructuredTensor oneself. To this end, there are two new helper functions added: `sparse_semi_structured_tile` `compute_compressed_swizzled_bitmask` Differential Revision: [D56190801](https://our.internmc.facebook.com/intern/diff/D56190801) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122350 Approved by: https://github.com/cpuhrsch	2024-04-16 20:31:52 +00:00
Mikayla Gawarecki	383d2d1f6c	Add testing and fix issues for weights_only load for LRScheduler (#123775 ) Fixes https://github.com/pytorch/pytorch/issues/98921 There were two issues detected: - `MultiStepLR`: issue is described in https://github.com/pytorch/pytorch/issues/98921, this is resolved by allowlisting `collections.Counter` - `OneCycleLR`: `state_dict['anneal_func']` is either `<function OneCycleLR._annealing_cos at 0x7f364186f5b0>` or `<function OneCycleLR._annealing_linear at 0x7f39aa483640>` depending on the `anneal_func` kwarg. This leads to `WeightsUnpickler error: Unsupported class __builtin__.getattr` from the `weights_only` Unpickler. Fixed the above in a BC-compatible manner by adding `OneCyclicLR._anneal_func_type` as a string attribute and removing `OneCyclicLR.anneal_func` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123775 Approved by: https://github.com/albanD, https://github.com/malfet	2024-04-16 20:29:27 +00:00
Shengbao Zheng	42e22bb444	[nccl-pg] Pass pg name and desc to NCCL communicator (#124149 ) Summary: Pass Process Group Name and Desc to NCCL communicator in order to access pg information in NCCL layer. The information is passed as commDesc string(i.e. "<pg_desc>:<pg_name>") Function only valid when NCCL_COMM_DESCRIPTION is defined. Differential Revision: D55703310 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124149 Approved by: https://github.com/shuqiangzhang	2024-04-16 20:08:07 +00:00
Rohan	72271fb07e	Add NEON ISA support on aarch64 (#123584 ) Fixes #104729 This improves the compiled mode performance of Softmax (by 20%) and other operations (like batchnorm) that invoke the reduce_all function. Thereby also improves BERT inference by around 8%. Tested on a graviton 3 instance (c7g.4xl). Tests were run in a single-threaded manner. Script attached below. Command: `OMP_NUM_THREADS=1 LRU_CACHE_CAPACITY=1024 DNNL_DEFAULT_FPMATH_MODE=BF16 python TestSoftmax.py` [TestSoftmax.txt](https://github.com/pytorch/pytorch/files/14910754/TestSoftmax.txt) ```python import torch import torch.nn as nn from torch.profiler import profile, record_function, ProfilerActivity model = nn.Softmax().eval() compiled_model = torch.compile(model) inputs = torch.randn(1024, 1024) with torch.set_grad_enabled(False): for _ in range(50): compiled_model(inputs) #Warmup print("Warmup over") with profile(activities=[ProfilerActivity.CPU]) as prof: with record_function("model_inference"): for _ in range(100): compiled_model(inputs) print(prof.key_averages().table(sort_by="self_cpu_time_total")) # Check if the compiled model inference and the eager model inference are similar using torch.allclose print(torch.allclose(compiled_model(inputs), model(inputs))) ``` Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123584 Approved by: https://github.com/jgong5, https://github.com/malfet	2024-04-16 18:49:52 +00:00
Simon Fan	67bd43b510	[compiled autograd][dynamo] use aliases for stack restore when partial graphs steal inputs (#124127 ) same idea as https://github.com/pytorch/pytorch/pull/123359, but for when we restore stack variables after calling a partial graph: Illustrated by the test case: before: ```python def function(inputs): graph_out_0 = __compiled_fn_2(inputs) getitem_1 = graph_out_0[0] add = inputs[1] <---- error inputs is already cleared del graph_out_0 add_1 = add + getitem_1 add = None getitem_1 = None cpu = add_1.cpu() add_1 = None return (cpu,) ``` after: ```python def function(inputs): inputs_ref_0 = inputs[1] graph_out_1 = __compiled_fn_2(inputs) getitem_1 = graph_out_1[0] add = inputs_ref_0 del graph_out_1 add_1 = add + getitem_1 add = None getitem_1 = None cpu = add_1.cpu() add_1 = None return (cpu,) ``` Co-authored-by: Jason Ansel <jansel@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124127 Approved by: https://github.com/jansel	2024-04-16 17:01:34 +00:00
Lucas Pasqualin	d838cc8f66	[DCP] Returns a copy of sd in copy sd (#123567 ) I found that returning the copy is actually useful in situations where you might do something like: ``` ret = _copy_state_dict(obj, cache) ret.update(some_other_values) ``` and would like `cache` not to change structure from `ret.update(some_other_values)`. Open to some notes here, not returning a copy might force the user to do some additional copies for this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123567 Approved by: https://github.com/wz337	2024-04-16 15:29:32 +00:00
Sijia Chen	0f6ce45bcb	[Inductor] handle AMD special launch options (#124146 ) Summary: `matrix_instr_nonkdim` and `waves_per_eu` are AMD specific launch configs that can't be treated as fn input args Test Plan: HIP_VISIBLE_DEVICES=7 numactl --cpunodebind=1 --membind=1 buck2 run mode/{opt,amd-gpu} -c fbcode.triton_backend=amd -c fbcode.enable_gpu_sections=true -c fbcode.rocm_arch=mi300 //hammer/modules/sequential/encoders/tests:hstu_bench -- --torch-compile=True the E2E works well on the magic model Differential Revision: D56165438 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124146 Approved by: https://github.com/aakhundov	2024-04-16 11:07:17 +00:00
William Wen	9309580d69	[dynamo, 3.12] handle possibility of NULL local variables during graph breaks (#124095 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124095 Approved by: https://github.com/jansel	2024-04-16 08:44:43 +00:00
William Wen	2b3594f90e	[dynamo] fix call_finally issue in Python 3.8 (#124122 ) Fix https://github.com/pytorch/pytorch/issues/97811 again... Pull Request resolved: https://github.com/pytorch/pytorch/pull/124122 Approved by: https://github.com/jansel	2024-04-16 08:36:20 +00:00
Nikita Shulga	298eb69c91	[EZ] Make `weight_int4pack_mm` compilable for `half` input dtype (#124136 ) To enable efficient int4 quantization on ARM Followup after https://github.com/pytorch/pytorch/pull/124022 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124136 Approved by: https://github.com/mikekgfb	2024-04-16 08:10:59 +00:00
Animesh Jain	bb0c768c5b	[dynamo][refactor] Move LazyGraphModule handling (#124113 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124113 Approved by: https://github.com/jansel ghstack dependencies: #124078	2024-04-16 06:39:45 +00:00
PyTorch MergeBot	530bf391cc	Revert "[dynamo] Turn on CPP guard manager (#123547 )" This reverts commit `3e98bdd66d`. Reverted https://github.com/pytorch/pytorch/pull/123547 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/123547#issuecomment-2058337419))	2024-04-16 06:38:15 +00:00
Xuehai Pan	2e48f7b044	[pytree] add `tree_iter` function (#123913 ) - Add a new `tree_iter` function. - Bump `optree` version to `0.11.0` for C++ version of `tree_iter`. This PR is split from #120300. - #120300 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123913 Approved by: https://github.com/zou3519	2024-04-16 06:02:08 +00:00
Xuehai Pan	0eab740db3	[Docs][Distributed] Add migration notes for `--local-rank` option style change for `torchrun` in PyTorch 2.0 (#109480 ) Fixes https://github.com/pytorch/pytorch/pull/94505#issuecomment-1722777767 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109480 Approved by: https://github.com/ezyang	2024-04-16 05:51:57 +00:00
Arun Pa	7530c5a85d	[DOC] Fix example and typo (#123959 ) Fixes #123554 and fixes #123053 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123959 Approved by: https://github.com/mikaylagawarecki	2024-04-16 05:38:24 +00:00
Shunting Zhang	df5829d0ba	[inductor] let rand_strided support fp8 (#124120 ) I'm working on https://fb.workplace.com/groups/1075192433118967/posts/1411161629522044/ (this is a meta internal link about a inefficient inner/persistent reduction kernel generated by inductor). I found the generated benchmark code for a kernel ( https://gist.github.com/shunting314/13a0105f72a1c54d9c220370c7fd3845 ) can not be run since rand_strided failed to generate tensors for fp8. Errors are like ``` RuntimeError: "normal_kernel_cpu" not implemented for 'Float8_e4m3fn' ``` for CPU or ``` RuntimeError: "normal_kernel_cuda" not implemented for 'Float8_e4m3fn' ``` for GPU This PR work around that problem. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124120 Approved by: https://github.com/Chillee, https://github.com/jansel	2024-04-16 04:15:56 +00:00
FFFrog	791e5db705	Part 3: UFMT fix the rest files in torch/optim due to the pr-sanity-checks (#124055 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124055 Approved by: https://github.com/ezyang ghstack dependencies: #124048, #124053, #124054	2024-04-16 03:22:39 +00:00
FFFrog	ac74a6783b	Part 2: UFMT fix 2 files in torch/optim due to the pr-sanity-checks (#124054 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124054 Approved by: https://github.com/ezyang ghstack dependencies: #124048, #124053	2024-04-16 03:20:21 +00:00
FFFrog	560efaa471	Part 1: UFMT partial files in torch/optim due to the pr-sanity-checks (#124053 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124053 Approved by: https://github.com/ezyang ghstack dependencies: #124048	2024-04-16 03:17:18 +00:00
FFFrog	f30704f5f3	add preparatory work for torch/optim/lr_scheduler.py (#124048 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124048 Approved by: https://github.com/albanD	2024-04-16 03:17:18 +00:00
Sam Larsen	6babf00014	[inductor] Bypass FX graph cache when we have HigherOrderOperators (#123325 ) Summary: The initial motivation was to avoid caching when we have triton higher order ops, but it's probably safer to avoid the cache for all higher order ops and allow/implement if/when we find it necessary. Test Plan: Unit test cribbed from: https://docs-preview.pytorch.org/pytorch/tutorials/2783/recipes/torch_compile_user_defined_triton_kernel_tutorial.html?highlight=triton Pull Request resolved: https://github.com/pytorch/pytorch/pull/123325 Approved by: https://github.com/eellison	2024-04-16 02:51:49 +00:00
Fuzzkatt	1cf62e86a4	skip various unit tests for Jetson (#122531 ) skip multiprocessing, cuda expandable segments, mem eff and flash attention tests on Jetson due to hanging / sigkill issues from nvidia internal testing Pull Request resolved: https://github.com/pytorch/pytorch/pull/122531 Approved by: https://github.com/eqy, https://github.com/malfet	2024-04-16 01:26:26 +00:00
Kai Londenberg	aaad0554b4	[Inductor] Fix endless recursion in codecache.DLLWrapper.__getattr__ (#123931 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123931 Approved by: https://github.com/peterbell10	2024-04-16 00:52:21 +00:00
cyy	c2596fd3e0	[Distributed] [4/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124032 ) This PR continues to fix some clang-tidy warnings in distributed/c10d code, following https://github.com/pytorch/pytorch/pull/123312. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124032 Approved by: https://github.com/Skylion007	2024-04-16 00:42:18 +00:00
Shivam Raikundalia	9079c76689	Fix Asynchronous PyTorch Profiler Trace (#124080 ) Summary: With the merge of D55925068, we have introduced an overflow issue when recording a trace using dyno gputrace. This is because it is possible for TorchOPs to be enumerated but not have an end time since they were running as the recording ended. By default these events have an end time set to INT_MIN. When finding the duration() for such events using end-start, we get an overflow resulting in a very long duration. This was avoided before because we were dividing the INT_MIN by 1000 because we were trying to convert uS to nS. This change introduces a patch for TorchOps and a future PR will be added to create a more universal guard in kineto. Test Plan: Trace recorded using resnet test. Trace: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/0/1713199267/localhost/libkineto_activities_2247224.json.gz&bucket=gpu_traces Differential Revision: D56144914 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124080 Approved by: https://github.com/aaronenyeshi	2024-04-16 00:24:32 +00:00
Will Constable	1885c3972d	[C10D] Add dist.get_node_local_rank helper (#123992 ) Fixes #122816 Summarizing the pros/cons of the request and motivation from #122816 - (+) it's really common for users to do 'os.getenv["LOCAL_RANK"]' so we should provide a helper - (-) we can't really control if/how local rank information is made available, but it is handled automatically if torchrun is used. We can assume local rank is correctly passed if it is passed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123992 Approved by: https://github.com/shuqiangzhang, https://github.com/zdevito, https://github.com/XilunWu	2024-04-16 00:09:46 +00:00
rzou	2b54b00e30	Update some more APIs to have positional-only args (#124063 ) Not BC-breaking since we haven't released these yet Pull Request resolved: https://github.com/pytorch/pytorch/pull/124063 Approved by: https://github.com/albanD ghstack dependencies: #123615, #124062	2024-04-15 23:32:47 +00:00
rzou	3c25b18d76	Excise old custom ops prototype from custom_op_db (#124062 ) Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124062 Approved by: https://github.com/albanD ghstack dependencies: #123615	2024-04-15 23:32:47 +00:00
rzou	a03711d24d	[custom_ops] Support TensorList inputs/outputs (#123615 ) We add a `supports_tensorlist` decorator that gives an autograd.Function the ability to handle TensorLists. Test Plan: - custom_op_db tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/123615 Approved by: https://github.com/albanD	2024-04-15 23:32:43 +00:00
Markus Hennerbichler	5a15cbfa44	Fix typo in TorchScript annotate docstring (#123719 ) It's already in the docstring for torch.jit.Attribute to use Attribute in a __init__ method of a Module. However, this was wrong in the `annotate` docstring Pull Request resolved: https://github.com/pytorch/pytorch/pull/123719 Approved by: https://github.com/mikaylagawarecki	2024-04-15 22:52:20 +00:00
-	70ad64e8a6	update docs for separate context and forward functions (#121955 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121955 Approved by: https://github.com/soulitzer	2024-04-15 22:31:12 +00:00
Shengbao Zheng	9fa922c2ed	[profiler] Log process group name instead of pg uid (#124035 ) Summary: As part of the work of unifying process group identifier, log <group_name, group_desc>, instead of pg uid in profiler. - group_name remains as the unique identifier, e.g. “0”, "1" - group_desc will be the user specified name, e.g. "fsdp". Reviewed By: aaronenyeshi, kwen2501 Differential Revision: D55610682 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124035 Approved by: https://github.com/aaronenyeshi	2024-04-15 21:49:06 +00:00
PHLens	9aba918bd8	Support Accelerator OOM Error (#121200 ) (#121702 ) Fixes #121200 This PR introduces AcceleratorOutOfMemoryError for all privateuse1 backend. For python, there is a PyError object which will be set only when privateuse1 is registered. All privateuse1 backend then can use this error for memory errors. Maybe more error types in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121702 Approved by: https://github.com/guangyey, https://github.com/albanD	2024-04-15 21:41:46 +00:00
Andrew Gu	495a4d4a42	[FSDP2] Added `mesh` arg to `fsdp_pre_all_gather` (#123953 ) This PR adds a `mesh: DeviceMesh` argument to `fsdp_pre_all_gather()` so that the extension can know over which mesh the all-gather is happening. This can be useful in recovering the post-all-gather tensor size in the `fsdp_post_all_gather()` (e.g. for `NF4Tensor`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/123953 Approved by: https://github.com/Skylion007, https://github.com/wanchaol ghstack dependencies: #119302, #122908	2024-04-15 21:35:51 +00:00
Andrew Gu	d1a0821e7e	[FSDP2] Added pre/post-all-gather extensions (subclass) (#122908 ) Overview This PR adds pre/post-all-gather extensions to FSDP2. - The pre/post-all-gather extensions are specified at the tensor-level on the `sharded_param._local_tensor` (i.e. the tensor wrapped by the sharded `DTensor`). If the user has a tensor-subclass parameter on the module passed to FSDP that preserves the subclass through the sharding ops (e.g. `new_zeros`, `chunk`, etc.), then the `sharded_param._local_tensor` will naturally be of that subclass. - The pre-all-gather function has signature: ``` def fsdp_pre_all_gather(self) -> Tuple[Tuple[torch.Tensor, ...], Any] ``` - The first return value is a `Tuple[torch.Tensor, ...]` of the all-gather inputs. It is a tuple since a subclass could contribute >1 inner tensors. - The second return value is any optional metadata needed to pass through to the post-all-gather. - The post all-gather function has signature: ``` def fsdp_post_all_gather( self, all_gather_outputs: Tuple[torch.Tensor, ...], metadata: Any, param_dtype: torch.dtype, , out: Optional[torch.Tensor] = None, ) -> Union[Tuple[torch.Tensor, Tuple[torch.Tensor, ...]], None]: ``` - The `all_gather_outputs` are exactly the all-gathered versions of the `fsdp_pre_all_gather` 1st return value (representing the all-gather inputs). We make sure to unflatten these back to ND for the user. - The `metadata` is the `fsdp_pre_all_gather` 2nd return value, untouched. - The `param_dtype` is the parameter dtype based on the passed-in `MixedPrecisionPolicy`. Namely, if no policy is passed in, then `param_dtype` is the original dtype, and otherwise, it is the `MixedPrecisionPolicy.param_dtype`. - If `out` is not specified, then the return value has type `Tuple[torch.Tensor, Tuple[torch.Tensor, ...]]`. The first tuple item is the unsharded parameter (e.g. re-wrapping into some subclass). The second tuple item is a tuple of unsharded inner tensors that FSDP should free during reshard. These should be derived from the all-gather outputs. - The `out` argument is required due to FSDP's `resize_` usage. We require an in-place variant for the backward all-gather. Here, `out` will be exactly the object returned as the first tuple item in the out-of-place variant mentioned before. The unsharded inner tensors will be allocated before calling `fsdp_post_all_gather`. When `out` is specified, the `fsdp_post_all_gather` should return `None`. If the post-all-gather does not do any out-of-place ops, then the `out` variant can just be a no-op since the unsharded inner tensors will be the same as the all-gather outputs, which FSDP directly writes to after all-gather. (E.g., this is the case for both float8 and `NF4Tensor`.) - We check for `fsdp_pre_all_gather` and `fsdp_post_all_gather` directly via `hasattr` to accommodate monkey patching so that we do not strictly require the user to use a tensor subclass. The monkey patch must happen after the local tensors have been finalized (after applying FSDP and after any meta-device init). - For now, we require that all gradients in one FSDP parameter group share the same dtype. This is fine for float8 and `NF4Tensor` use cases. If this requirement is too strict, then in the future we can issue 1 reduce-scatter per dtype per group. Design Notes* - We assume that the `sharded_param._local_tensor` is padded on dim-0. - This assumption should not block immediate use cases, and when we pad the `DTensor._local_tensor` by default, this assumption will always be true. - This assumption allows us to call `sharded_param._local_tensor.fsdp_pre_all_gather()`; i.e. it tells us from which tensor object to invoke `fsdp_pre_all_gather()`. - Suppose we want to compose with CPU offloading. Then, CPU offloading's H2D copy should run first, i.e. `sharded_param._local_tensor.to("cuda").fsdp_pre_all_gather()`, where `_local_tensor.to("cuda")` should return an instance of the subclass so that it still defines `fsdp_pre_all_gather()`. Note that in this case, the subclass instance on GPU is a temporary, which means caching values on it would not be possible. One possibility would be to have `.to("cuda")` move any cached values too. - `fsdp_post_all_gather` can either return an unsharded parameter that aliases with the all-gather output or does not alias, but there is no way to know a priori. - If the unsharded parameter aliases with the all-gather output, then we should _not_ free the all-gather output in `unshard`. - If the unsharded parameter does not alias with the all-gather output, then we prefer to free the all-gather output in `unshard` to avoid holding the unneeded temporary. - One approach is for eager-mode to check for this alias (by comparing data pointers). However, this might be adversarial to full-graph compilation. The compromise for simplicity can be to always free the all-gather output in `reshard`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122908 Approved by: https://github.com/weifengpy, https://github.com/wanchaol ghstack dependencies: #119302	2024-04-15 21:35:51 +00:00
Andrew Gu	ea52918e81	[FSDP2] Generalized all-gather outputs to >1 per parameter (#119302 ) This PR is part of the FSDP extensions work. For subclasses such as for QLoRA's `NF4Tensor` (using block-wise quantization) that have multiple inner tensors per parameter, we must generalize to allow each parameter to contribute >1 all-gather inputs and hence have >1 all-gather outputs. This PR does this generalization by converting `FSDPParam.all_gather_input: torch.Tensor` to `FSDPParam.all_gather_inputs: List[torch.Tensor]`. Unfortunately, since we need to preserve the mapping from all-gather inputs/outputs to their source parameter, we have to introduce `List[List]` instead of simply `List` in several places. Furthermore, we still require the flattened 1D `List` for `torch.split` calls, introducing some redundancy between data structures. Nonetheless, I do not see a way to avoid this if we want the generalization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119302 Approved by: https://github.com/weifengpy, https://github.com/wanchaol	2024-04-15 21:35:46 +00:00
Animesh Jain	601112fdb4	[dynamo][log] Print missing skipped frame info on debug (#124078 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124078 Approved by: https://github.com/yanboliang	2024-04-15 20:33:17 +00:00
Sam Larsen	e5b404b809	[inductor] Fix fresh_inductor_cache() (#122661 ) Summary: Modify fresh_inductor_cache() to clear cached state before mocking the toplevel cache_dir directory. Any lru_caches (or otherwise) can use the @clear_on_fresh_inductor_cache decorator to register the cache for clearing. Also change the base inductor TestCase class to use fresh_inductor_cache(). Previously that TestCase was only mocking the subdirectory within the toplevel cache dir designated for the FX graph cache artifacts. Test Plan: - New unit test - All existing inductor tests will exercise fresh_inductor_cache() Pull Request resolved: https://github.com/pytorch/pytorch/pull/122661 Approved by: https://github.com/oulgen	2024-04-15 20:28:54 +00:00
Bert Maher	99059affb9	Use packed metadata from triton to reduce launch latency (#123842 ) https://github.com/openai/triton/pull/3633 converts some kernel launch metadata from a namedtuple to a regular tuple, which is faster to parse. Using it here shaves off a microsecond or so from the apparently extremely-sensitive launch path. Fixes #123597 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123842 Approved by: https://github.com/jansel, https://github.com/shunting314 ghstack dependencies: #123841	2024-04-15 19:43:06 +00:00
Bert Maher	6c9f5064ea	Avoid retrieving launch metadata if launch_enter_hook is not installed (#123841 ) Fixes #123597 There's a sizable comment in the PR about why this is needed, but essentially the launch path is really really perf sensitive (running `launch` is ~30 microseconds, and according to the linked issue, regressing it to 33us is worth 6% overall on torchbench). The `bin.launch_metadata` call doesn't look super expensive, but microseconds matter, and this is only useful when we have a launch hook installed (which seems pretty rare?). This change is worth about 2us, and when combined with the other diff in the stack seems to completely eliminate the torchbench regression. Differential Revision: [D56046347](https://our.internmc.facebook.com/intern/diff/D56046347) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123841 Approved by: https://github.com/jansel, https://github.com/shunting314	2024-04-15 19:43:06 +00:00
Pian Pawakapan	90d1720861	[export] Restore original placeholder names (part 3: constant input de/serialization) (#123590 ) Summary: note: breaking the original diff D55225818 into 3 parts (top-level renaming, higher-order-op subgraphs, constant input de/serialization) because of its size. Stacked PR to restore original names to placeholder nodes, replacing the default names arg0_1, arg1_1, ... This PR supports constant argument placeholder (e.g. forward(self, x, y=1)) names and de/serialization, by adding a name field for ConstantArguments in the graph signature, and ConstantInputSpec in the input specs for serialization. Test Plan: verification checks on placeholder names for all export() calls, unit test in test/export/test_export.py Differential Revision: D55506949 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123590 Approved by: https://github.com/angelayi, https://github.com/zhxchen17	2024-04-15 19:09:41 +00:00

1 2 3 4 5 ...

37529 Commits